idnits 2.17.1 draft-ietf-bess-evpn-overlay-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 13 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 521 has weird spacing: '...ns. An adver...' -- The document date (June 10, 2016) is 2877 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2119' is mentioned on line 165, but not defined == Missing Reference: 'GRE' is mentioned on line 303, but not defined == Missing Reference: 'RFC4023' is mentioned on line 548, but not defined == Missing Reference: 'NOV3-Framework' is mentioned on line 608, but not defined == Missing Reference: 'RFC6514' is mentioned on line 869, but not defined == Unused Reference: 'KEYWORDS' is defined on line 1127, but no explicit reference was found in the text == Unused Reference: 'NOV3-FRWK' is defined on line 1166, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 4272 ** Obsolete normative reference: RFC 5512 (Obsoleted by RFC 9012) == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-07 == Outdated reference: A later version (-04) exists of draft-ietf-nvo3-overlay-problem-statement-01 == Outdated reference: A later version (-09) exists of draft-ietf-nvo3-framework-01 == Outdated reference: A later version (-10) exists of draft-ietf-bess-dci-evpn-overlay-02 == Outdated reference: A later version (-22) exists of draft-ietf-idr-tunnel-encaps-02 Summary: 4 errors (**), 0 flaws (~~), 14 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 L2VPN Workgroup A. Sajassi (Editor) 3 INTERNET-DRAFT Cisco 4 Intended Status: Standards Track J. Drake (Editor) 5 Juniper 6 N. Bitar 7 Nokia 8 R. Shekhar 9 Juniper 10 J. Uttaro 11 AT&T 12 W. Henderickx 13 Nokia 15 Expires: December 10, 2016 June 10, 2016 17 A Network Virtualization Overlay Solution using EVPN 18 draft-ietf-bess-evpn-overlay-04 20 Abstract 22 This document describes how Ethernet VPN (EVPN) [RFC7432] can be used 23 as an Network Virtualization Overlay (NVO) solution and explores the 24 various tunnel encapsulation options over IP and their impact on the 25 EVPN control-plane and procedures. In particular, the following 26 encapsulation options are analyzed: VXLAN, NVGRE, and MPLS over GRE. 28 Status of this Memo 30 This Internet-Draft is submitted to IETF in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF), its areas, and its working groups. Note that 35 other groups may also distribute working documents as 36 Internet-Drafts. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 The list of current Internet-Drafts can be accessed at 44 http://www.ietf.org/1id-abstracts.html 46 The list of Internet-Draft Shadow Directories can be accessed at 47 http://www.ietf.org/shadow.html 49 Copyright and License Notice 51 Copyright (c) 2012 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2 Specification of Requirements . . . . . . . . . . . . . . . . . 5 68 3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5 69 4 EVPN Features . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 5 Encapsulation Options for EVPN Overlays . . . . . . . . . . . . 7 71 5.1 VXLAN/NVGRE Encapsulation . . . . . . . . . . . . . . . . . 7 72 5.1.1 Virtual Identifiers Scope . . . . . . . . . . . . . . . 8 73 5.1.1.1 Data Center Interconnect with Gateway . . . . . . . 8 74 5.1.1.2 Data Center Interconnect without Gateway . . . . . . 9 75 5.1.2 Virtual Identifiers to EVI Mapping . . . . . . . . . . . 9 76 5.1.2.1 Auto Derivation of RT . . . . . . . . . . . . . . . 10 77 5.1.3 Constructing EVPN BGP Routes . . . . . . . . . . . . . 11 78 5.2 MPLS over GRE . . . . . . . . . . . . . . . . . . . . . . . 13 79 6 EVPN with Multiple Data Plane Encapsulations . . . . . . . . . 13 80 7 NVE Residing in Hypervisor . . . . . . . . . . . . . . . . . . 14 81 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE 82 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . 14 83 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation . . 15 84 8 NVE Residing in ToR Switch . . . . . . . . . . . . . . . . . . 15 85 8.1 EVPN Multi-Homing Features . . . . . . . . . . . . . . . . 16 86 8.1.1 Multi-homed Ethernet Segment Auto-Discovery . . . . . . 16 87 8.1.2 Fast Convergence and Mass Withdraw . . . . . . . . . . . 16 88 8.1.3 Split-Horizon . . . . . . . . . . . . . . . . . . . . . 16 89 8.1.4 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 17 90 8.1.5 DF Election . . . . . . . . . . . . . . . . . . . . . . 17 91 8.2 Impact on EVPN BGP Routes & Attributes . . . . . . . . . . . 18 92 8.3 Impact on EVPN Procedures . . . . . . . . . . . . . . . . . 18 93 8.3.1 Split Horizon . . . . . . . . . . . . . . . . . . . . . 19 94 8.3.2 Aliasing and Backup-Path . . . . . . . . . . . . . . . . 19 96 9 Support for Multicast . . . . . . . . . . . . . . . . . . . . . 20 97 10 Data Center Interconnections - DCI . . . . . . . . . . . . . . 20 98 10.1 DCI using GWs . . . . . . . . . . . . . . . . . . . . . . . 21 99 10.2 DCI using ASBRs . . . . . . . . . . . . . . . . . . . . . . 21 100 10.2.1 ASBR Functionality with NVEs in Hypervisors . . . . . . 22 101 10.2.2 ASBR Functionality with NVEs in TORs . . . . . . . . . 22 102 11 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . 24 103 12 Security Considerations . . . . . . . . . . . . . . . . . . . 24 104 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 105 14 References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 106 14.1 Normative References . . . . . . . . . . . . . . . . . . . 25 107 14.2 Informative References . . . . . . . . . . . . . . . . . . 26 108 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 111 1 Introduction 113 In the context of this document, a Network Virtualization Overlay 114 (NVO) is a solution to address the requirements of a multi-tenant 115 data center, especially one with virtualized hosts, e.g., Virtual 116 Machines (VMs). The key requirements of such a solution, as described 117 in [Problem-Statement], are: 119 - Isolation of network traffic per tenant 121 - Support for a large number of tenants (tens or hundreds of 122 thousands) 124 - Extending L2 connectivity among different VMs belonging to a given 125 tenant segment (subnet) across different PODs within a data center or 126 between different data centers 128 - Allowing a given VM to move between different physical points of 129 attachment within a given L2 segment 131 The underlay network for NVO solutions is assumed to provide IP 132 connectivity between NVO endpoints (NVEs). 134 This document describes how Ethernet VPN (EVPN) can be used as an NVO 135 solution and explores applicability of EVPN functions and procedures. 136 In particular, it describes the various tunnel encapsulation options 137 for EVPN over IP, and their impact on the EVPN control-plane and 138 procedures for two main scenarios: 140 a) when the NVE resides in the hypervisor, and 141 b) when the NVE resides in a Top of Rack (ToR) device 143 Note that the use of EVPN as an NVO solution does not necessarily 144 mandate that the BGP control-plane be running on the NVE. For such 145 scenarios, it is still possible to leverage the EVPN solution by 146 using XMPP, or alternative mechanisms, to extend the control-plane to 147 the NVE as discussed in [L3VPN-ENDSYSTEMS]. 149 The possible encapsulation options for EVPN overlays that are 150 analyzed in this document are: 152 - VXLAN and NVGRE 153 - MPLS over GRE 155 Before getting into the description of the different encapsulation 156 options for EVPN over IP, it is important to highlight the EVPN 157 solution's main features, how those features are currently supported, 158 and any impact that the encapsulation has on those features. 160 2 Specification of Requirements 162 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 163 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 164 document are to be interpreted as described in [RFC2119]. 166 3 Terminology 168 NVO: Network Virtualization Overlay 170 NVE: Network Virtualization Endpoint 172 VNI: Virtual Network Identifier (for VXLAN) 174 VSID: Virtual Subnet Identifier (for NVGRE) 176 EVPN: Ethernet VPN 178 EVI: An EVPN instance spanning the Provider Edge (PE) devices 179 participating in that EVPN. 181 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 182 Control (MAC) addresses on a PE. 184 Ethernet Segment (ES): When a customer site (device or network) is 185 connected to one or more PEs via a set of Ethernet links, then that 186 set of links is referred to as an 'Ethernet segment'. 188 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 189 identifies an Ethernet segment is called an 'Ethernet Segment 190 Identifier'. 192 Ethernet Tag: An Ethernet tag identifies a particular broadcast 193 domain, e.g., a VLAN. An EVPN instance consists of one or more 194 broadcast domains. 196 PE: Provider Edge device. 198 Single-Active Redundancy Mode: When only a single PE, among all the 199 PEs attached to an Ethernet segment, is allowed to forward traffic 200 to/from that Ethernet segment for a given VLAN, then the Ethernet 201 segment is defined to be operating in Single-Active redundancy mode. 203 All-Active Redundancy Mode: When all PEs attached to an Ethernet 204 segment are allowed to forward known unicast traffic to/from that 205 Ethernet segment for a given VLAN, then the Ethernet segment is 206 defined to be operating in All-Active redundancy mode. 208 4 EVPN Features 210 EVPN was originally designed to support the requirements detailed in 211 [RFC7209] and therefore has the following attributes which directly 212 address control plane scaling and ease of deployment issues. 214 1) Control plane traffic is distributed with BGP and Broadcast and 215 Multicast traffic is sent using a shared multicast tree or with 216 ingress replication. 218 2) Control plane learning is used for MAC (and IP) addresses instead 219 of data plane learning. The latter requires the flooding of unknown 220 unicast and ARP frames; whereas, the former does not require any 221 flooding. 223 3) Route Reflector is used to reduce a full mesh of BGP sessions 224 among PE devices to a single BGP session between a PE and the RR. 225 Furthermore, RR hierarchy can be leveraged to scale the number of BGP 226 routes on the RR. 228 4) Auto-discovery via BGP is used to discover PE devices 229 participating in a given VPN, PE devices participating in a given 230 redundancy group, tunnel encapsulation types, multicast tunnel type, 231 multicast members, etc. 233 5) All-Active multihoming is used. This allows a given customer 234 device (CE) to have multiple links to multiple PEs, and traffic 235 to/from that CE fully utilizes all of these links. This set of links 236 is termed an Ethernet Segment (ES). 238 6) When a link between a CE and a PE fails, the PEs for that EVI are 239 notified of the failure via the withdrawal of a single EVPN route. 240 This allows those PEs to remove the withdrawing PE as a next hop for 241 every MAC address associated with the failed link. This is termed 242 'mass withdrawal'. 244 7) BGP route filtering and constrained route distribution are 245 leveraged to ensure that the control plane traffic for a given EVI is 246 only distributed to the PEs in that EVI. 248 8) When a 802.1Q interface is used between a CE and a PE, each of the 249 VLAN ID (VID) on that interface can be mapped onto a bridge table 250 (for upto 4094 such bridge tables). All these bridge tables may be 251 mapped onto a single MAC-VRF (in case of VLAN-aware bundle service). 253 9) VM Mobility mechanisms ensure that all PEs in a given EVI know 254 the ES with which a given VM, as identified by its MAC and IP 255 addresses, is currently associated. 257 10) Route Targets are used to allow the operator (or customer) to 258 define a spectrum of logical network topologies including mesh, hub & 259 spoke, and extranets (e.g., a VPN whose sites are owned by different 260 enterprises), without the need for proprietary software or the aid of 261 other virtual or physical devices. 263 11) Because the design goal for NVO is millions of instances per 264 common physical infrastructure, the scaling properties of the control 265 plane for NVO are extremely important. EVPN and the extensions 266 described herein, are designed with this level of scalability in 267 mind. 269 5 Encapsulation Options for EVPN Overlays 271 5.1 VXLAN/NVGRE Encapsulation 273 Both VXLAN and NVGRE are examples of technologies that provide a data 274 plane encapsulation which is used to transport a packet over the 275 common physical IP infrastructure between Network Virtualization 276 Edges (NVEs) - e.g., VXLAN Tunnel End Points (VTEPs) in VXLAN 277 network. Both of these technologies include the identifier of the 278 specific NVO instance, Virtual Network Identifier (VNI) in VXLAN and 279 Virtual Subnet Identifier (VSID) in NVGRE, in each packet. In the 280 remainder of this document we use VNI as the representation for NVO 281 instance with the understanding that VSID can equally be used if the 282 encapsulation is NVGRE unless it is stated otherwise. 284 Note that a Provider Edge (PE) is equivalent to a NVE/VTEP. 286 VXLAN encapsulation is based on UDP, with an 8-byte header following 287 the UDP header. VXLAN provides a 24-bit VNI, which typically provides 288 a one-to-one mapping to the tenant VLAN ID, as described in 289 [RFC7348]. In this scenario, the ingress VTEP does not include an 290 inner VLAN tag on the encapsulated frame, and the egress VTEP 291 discards the frames with an inner VLAN tag. This mode of operation in 292 [RFC7348] maps to VLAN Based Service in [RFC7432], where a tenant 293 VLAN ID gets mapped to an EVPN instance (EVI). 295 VXLAN also provides an option of including an inner VLAN tag in the 296 encapsulated frame, if explicitly configured at the VTEP. This mode 297 of operation can map to VLAN Bundle Service in [RFC7432] because all 298 the tenant's tagged frames map to a single bridge table / MAC-VRF, 299 and the inner VLAN tag is not used for lookup by the disposition PE 300 when performing VXLAN decapsulation as described in section 6 of 301 [RFC7348]. 303 [NVGRE] encapsulation is based on [GRE] and it mandates the inclusion 304 of the optional GRE Key field which carries the VSID. There is a one- 305 to-one mapping between the VSID and the tenant VLAN ID, as described 306 in [NVGRE] and the inclusion of an inner VLAN tag is prohibited. This 307 mode of operation in [NVGRE] maps to VLAN Based Service in 308 [RFC7432]. 310 As described in the next section there is no change to the encoding 311 of EVPN routes to support VXLAN or NVGRE encapsulation except for the 312 use of BGP Encapsulation extended community to indicate the 313 encapsulation type (e.g., VxLAN or NVGRE). However, there is 314 potential impact to the EVPN procedures depending on where the NVE is 315 located (i.e., in hypervisor or TOR) and whether multi-homing 316 capabilities are required. 318 5.1.1 Virtual Identifiers Scope 320 Although VNIs are defined as 24-bit globally unique values, there are 321 scenarios in which it is desirable to use a locally significant value 322 for VNI, especially in the context of data center interconnect: 324 5.1.1.1 Data Center Interconnect with Gateway 326 In the case where NVEs in different data centers need to be 327 interconnected, and the NVEs need to use VNIs as a globally unique 328 identifiers within a data center, then a Gateway needs to be employed 329 at the edge of the data center network. This is because the Gateway 330 will provide the functionality of translating the VNI when crossing 331 network boundaries, which may align with operator span of control 332 boundaries. As an example, consider the network of Figure 1 below. 333 Assume there are three network operators: one for each of the DC1, 334 DC2 and WAN networks. The Gateways at the edge of the data centers 335 are responsible for translating the VNIs between the values used in 336 each of the data center networks and the values used in the WAN. 338 +--------------+ 339 | | 340 +---------+ | WAN | +---------+ 341 +----+ | +---+ +----+ +----+ +---+ | +----+ 342 |NVE1|--| | | |WAN | |WAN | | | |--|NVE3| 343 +----+ |IP |GW |--|Edge| |Edge|--|GW | IP | +----+ 344 +----+ |Fabric +---+ +----+ +----+ +---+ Fabric | +----+ 345 |NVE2|--| | | | | |--|NVE4| 346 +----+ +---------+ +--------------+ +---------+ +----+ 348 |<------ DC 1 ------> <------ DC2 ------>| 350 Figure 1: Data Center Interconnect with Gateway 352 5.1.1.2 Data Center Interconnect without Gateway 354 In the case where NVEs in different data centers need to be 355 interconnected, and the NVEs need to use locally assigned VNIs (e.g., 356 similar to MPLS labels), then there may be no need to employ Gateways 357 at the edge of the data center network. More specifically, the VNI 358 value that is used by the transmitting NVE is allocated by the NVE 359 that is receiving the traffic (in other words, this is similar to 360 "downstream assigned" MPLS label). This allows the VNI space to be 361 decoupled between different data center networks without the need for 362 a dedicated Gateway at the edge of the data centers. This topics is 363 covered in section 10.2. 365 +--------------+ 366 | | 367 +---------+ | WAN | +---------+ 368 +----+ | | +----+ +----+ | | +----+ 369 |NVE1|--| | |ASBR| |ASBR| | |--|NVE3| 370 +----+ |IP Fabric|---| | | |--|IP Fabric| +----+ 371 +----+ | | +----+ +----+ | | +----+ 372 |NVE2|--| | | | | |--|NVE4| 373 +----+ +---------+ +--------------+ +---------+ +----+ 375 |<------ DC 1 -----> <---- DC2 ------>| 377 Figure 2: Data Center Interconnect with ASBR 379 5.1.2 Virtual Identifiers to EVI Mapping 381 When the EVPN control plane is used in conjunction with VXLAN (or 382 NVGRE encapsulation), two options for mapping the VXLAN VNI (or NVGRE 383 VSID) to an EVI are possible: 385 1. Option 1: Single Subnet per EVI 387 In this option, a single subnet represented by a VNI is mapped to a 388 unique EVI. This corresponds to the VLAN Based service in [RFC7432], 389 where a tenant VLAN ID gets mapped to an EVPN instance (EVI). As 390 such, a BGP RD and RT is needed per VNI on every NVE. The advantage 391 of this model is that it allows the BGP RT constraint mechanisms to 392 be used in order to limit the propagation and import of routes to 393 only the NVEs that are interested in a given VNI. The disadvantage of 394 this model may be the provisioning overhead if RD and RT are not 395 derived automatically from VNI. 397 In this option, the MAC-VRF table is identified by the RT in the 398 control plane and by the VNI in the data-plane. In this option, the 399 specific the MAC-VRF table corresponds to only a single bridge table. 401 2. Option 2: Multiple Subnets per EVI 403 In this option, multiple subnets each represented by a unique VNI are 404 mapped to a single EVI. For example, if a tenant has multiple 405 segments/subnets each represented by a VNI, then all the VNIs for 406 that tenant are mapped to a single EVI - e.g., the EVI in this case 407 represents the tenant and not a subnet . This corresponds to the 408 VLAN-aware bundle service in [RFC7432]. The advantage of this model 409 is that it doesn't require the provisioning of RD/RT per VNI. 410 However, this is a moot point if option 1 with auto-derivation is 411 used. The disadvantage of this model is that routes would be imported 412 by NVEs that may not be interested in a given VNI. 414 In this option the MAC-VRF table is identified by the RT in the 415 control plane and a specific bridge table for that MAC-VRF is 416 identified by the in the control plane. In this 417 option, the VNI in the data-plane is sufficient to identify a 418 specific bridge table - e.g., no need to do a lookup based on VNI and 419 Ethernet Tag ID fields to identify a bridge table. 421 5.1.2.1 Auto Derivation of RT 423 When the option of a single VNI per EVI is used, it is important to 424 auto-derive RT for EVPN BGP routes in order to simplify configuration 425 for data center operations. RD can be auto generated as described in 426 [RFC7432] and RT can be auto-derived as described next. 428 Since a gateway PE as depicted in figure-1 participates in both the 429 DCN and WAN BGP sessions, it is important that when RT values are 430 auto-derived for VNIs, there is no conflict in RT spaces between DCN 431 and WAN networks assuming that both are operating within the same AS. 432 Also, there can be scenarios where both VXLAN and NVGRE 433 encapsulations may be needed within the same DCN and their 434 corresponding VNIs are administered independently which means VNI 435 spaces can overlap. In order to ensure that no such conflict in RT 436 spaces arises, RT values for DCNs are auto-derived as follow: 438 0 1 2 3 4 439 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 0 440 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 441 | AS # |A| TYPE| D-ID |Service Instance ID| 442 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+---+ 444 - 2 bytes of global admin field of the RT is set to the AS number. 446 - Three least significant bytes of the local admin field of the RT is 447 set to the VNI, VSID, I-SID, or VID. 449 - The most significant bit of the local admin field of the RT is set 450 as follow: 451 0: auto-derived 452 1: manually-derived 454 - The next 3 bits of the most significant byte of the local admin 455 field of the RT identifies the space in which the other 3 bytes are 456 defined. The following spaces are defined: 457 0 : VID 458 1 : VXLAN 459 2 : NVGRE 460 3 : I-SID 461 4 : EVI 462 5 : dual-VID 464 - The remaining 4 bits of the most significant byte of the local 465 admin field of the RT identifies the domain-id. The default value of 466 domain-id is zero indicating that only a single numbering space exist 467 for a given technology. However, if there are more than one number 468 space exist for a given technology (e.g., overlapping VXLAN spaces), 469 then each of the number spaces need to be identify by their 470 corresponding domain-id starting from 1. 472 5.1.3 Constructing EVPN BGP Routes 474 In EVPN, an MPLS label is distributed by the egress PE via the EVPN 475 control plane and is placed in the MPLS header of a given packet by 476 the ingress PE. This label is used upon receipt of that packet by the 477 egress PE for disposition of that packet. This is very similar to the 478 use of the VNI by the egress NVE, with the difference being that an 479 MPLS label has local significance while a VNI typically has global 480 significance. Accordingly, and specifically to support the option of 481 locally assigned VNIs, the MPLS label field in the MAC Advertisement, 482 Ethernet AD per EVI, and Inclusive Multicast Ethernet Tag routes is 483 used to carry the VNI. For the balance of this memo, the MPLS label 484 field will be referred to as the VNI field. The VNI field is used for 485 both local and global VNIs, and for either case the entire 24-bit 486 field is used to encode the VNI value. 488 For the VLAN-based service (a single VNI per MAC-VRF), the Ethernet 489 Tag field in the MAC/IP Advertisement, Ethernet AD per EVI, and 490 Inclusive Multicast route MUST be set to zero just as in the VLAN 491 Based service in [RFC7432]. 493 For the VLAN-aware bundle service (multiple VNIs per MAC-VRF with 494 each VNI associated with its own bridge table), the Ethernet Tag 495 field in the MAC Advertisement, Ethernet AD per EVI, and Inclusive 496 Multicast route MUST identify a bridge table within a MAC-VRF and the 497 set of Ethernet Tags for that EVI needs to be configured consistently 498 on all PEs within that EVI. For local VNIs, the value advertised in 499 the Ethernet Tag field MUST be set to a VID just as in the VLAN-aware 500 bundle service in [RFC7432]. Such setting must be done consistently 501 on all PE devices participating in that EVI within a given domain. 502 For global VNIs, the value advertised in the Ethernet Tag field 503 SHOULD be set to a VNI as long as it matches the existing semantics 504 of the Ethernet Tag, i.e., it identifies a bridge table within a MAC- 505 VRF and the set of VNIs are configured consistently on each PE in 506 that EVI. 508 In order to indicate that which type of data plane encapsulation 509 (i.e., VXLAN, NVGRE, MPLS, or MPLS in GRE) is to be used, the BGP 510 Encapsulation extended community defined in [TUNNEL-ENCAP]and 511 [RFC5512] is included with all EVPN routes (i.e. MAC Advertisement, 512 Ethernet AD per EVI, Ethernet AD per ESI, Inclusive Multicast 513 Ethernet Tag, and Ethernet Segment) advertised by an egress PE. Five 514 new values have been assigned by IANA to extend the list of 515 encapsulation types defined in [TUNNEL-ENCAP] and they are listed in 516 section 13. 518 The MPLS encapsulation tunnel type, listed in section 13, is needed 519 in order to distinguish between an advertising node that only 520 supports non-MPLS encapsulations and one that supports MPLS and non- 521 MPLS encapsulations. An advertising node that only supports MPLS 522 encapsulation does not need to advertise any encapsulation tunnel 523 types; i.e., if the BGP Encapsulation extended community is not 524 present, then either MPLS encapsulation or a statically configured 525 encapsulation is assumed. 527 The Ethernet Segment and Ethernet AD per ESI routes MAY be advertised 528 with multiple encapsulation types as long as they use the same EVPN 529 multi-homing procedures - e.g., the mix of VXLAN and NVGRE 530 encapsulation types is a valid one but not the mix of VXLAN and MPLS 531 encapsulation types. 533 The Next Hop field of the MP_REACH_NLRI attribute of the route MUST 534 be set to the IPv4 or IPv6 address of the NVE. The remaining fields 535 in each route are set as per [RFC7432]. 537 5.2 MPLS over GRE 539 The EVPN data-plane is modeled as an EVPN MPLS client layer sitting 540 over an MPLS PSN-tunnel server layer. Some of the EVPN functions 541 (split-horizon, aliasing, and backup-path) are tied to the MPLS 542 client layer. If MPLS over GRE encapsulation is used, then the EVPN 543 MPLS client layer can be carried over an IP PSN tunnel transparently. 544 Therefore, there is no impact to the EVPN procedures and associated 545 data-plane operation. 547 The existing standards for MPLS over GRE encapsulation as defined by 548 [RFC4023] can be used for this purpose; however, when it is used in 549 conjunction with EVPN the GRE key field SHOULD be present, and SHOULD 550 be used to provide a 32-bit entropy field. The Checksum and Sequence 551 Number fields are not needed and their corresponding C and S bits 552 MUST be set to zero. A PE capable of supporting this encapsulation, 553 should advertise its EVPN routes along with the Tunnel Encapsulation 554 extended community indicating MPLS over GRE encapsulation, as 555 described in previous section. 557 6 EVPN with Multiple Data Plane Encapsulations 559 The use of the BGP Encapsulation extended community per [TUNNEL- 560 ENCAP] and [RFC5512] allows each NVE in a given EVI to know each of 561 the encapsulations supported by each of the other NVEs in that EVI. 562 i.e., each of the NVEs in a given EVI may support multiple data plane 563 encapsulations. An ingress NVE can send a frame to an egress NVE 564 only if the set of encapsulations advertised by the egress NVE in the 565 subject MAC/IP Advertisement or per EVI Ethernet AD route, forms a 566 non-empty intersection with the set of encapsulations supported by 567 the ingress NVE, and it is at the discretion of the ingress NVE which 568 encapsulation to choose from this intersection. (As noted in 569 section 5.1.3, if the BGP Encapsulation extended community is not 570 present, then the default MPLS encapsulation or a statically 571 configured encapsulation is assumed.) 573 An ingress node that uses shared multicast trees for sending 574 broadcast or multicast frames MUST maintain distinct trees for each 575 different encapsulation type. 577 It is the responsibility of the operator of a given EVI to ensure 578 that all of the NVEs in that EVI support at least one common 579 encapsulation. If this condition is violated, it could result in 580 service disruption or failure. The use of the BGP Encapsulation 581 extended community provides a method to detect when this condition is 582 violated but the actions to be taken are at the discretion of the 583 operator and are outside the scope of this document. 585 7 NVE Residing in Hypervisor 587 When a NVE and its hosts/VMs are co-located in the same physical 588 device, e.g., when they reside in a server, the links between them 589 are virtual and they typically share fate; i.e., the subject 590 hosts/VMs are typically not multi-homed or if they are multi-homed, 591 the multi-homing is a purely local matter to the server hosting the 592 VM and the NVEs, and need not be "visible" to any other NVEs residing 593 on other servers, and thus does not require any specific protocol 594 mechanisms. The most common case of this is when the NVE resides on 595 the hypervisor. 597 In the sub-sections that follow, we will discuss the impact on EVPN 598 procedures for the case when the NVE resides on the hypervisor and 599 the VXLAN (or NVGRE) encapsulation is used. 601 7.1 Impact on EVPN BGP Routes & Attributes for VXLAN/NVGRE Encapsulation 603 In the scenario where all data centers are under a single 604 administrative domain, and there is a single global VNI space, the RD 605 MAY be set to zero in the EVPN routes. However, in the scenario where 606 different groups of data centers are under different administrative 607 domains, and these data centers are connected via one or more 608 backbone core providers as described in [NOV3-Framework], the RD must 609 be a unique value per EVI or per NVE as described in [RFC7432]. In 610 other words, whenever there is more than one administrative domain 611 for global VNI, then a non-zero RD MUST be used, or whenever the VNI 612 value have local significance, then a non-zero RD MUST be used. It is 613 recommend to use a non-zero RD at all time. 615 When the NVEs reside on the hypervisor, the EVPN BGP routes and 616 attributes associated with multi-homing are no longer required. This 617 reduces the required routes and attributes to the following subset of 618 four out of eight: 620 - MAC/IP Advertisement Route 621 - Inclusive Multicast Ethernet Tag Route 622 - MAC Mobility Extended Community 623 - Default Gateway Extended Community 625 However, as noted in section 8.6 of [RFC7432] in order to enable a 626 single-homing ingress NVE to take advantage of fast convergence, 627 aliasing, and backup-path when interacting with multi-homed egress 628 NVEs attached to a given Ethernet segment, the single-homing ingress 629 NVE SHOULD be able to receive and process Ethernet AD per ES and 630 Ethernet AD per EVI routes. 632 7.2 Impact on EVPN Procedures for VXLAN/NVGRE Encapsulation 634 When the NVEs reside on the hypervisors, the EVPN procedures 635 associated with multi-homing are no longer required. This limits the 636 procedures on the NVE to the following subset of the EVPN procedures: 638 1. Local learning of MAC addresses received from the VMs per section 639 10.1 of [RFC7432]. 641 2. Advertising locally learned MAC addresses in BGP using the MAC/IP 642 Advertisement routes. 644 3. Performing remote learning using BGP per Section 10.2 of 645 [RFC7432]. 647 4. Discovering other NVEs and constructing the multicast tunnels 648 using the Inclusive Multicast Ethernet Tag routes. 650 5. Handling MAC address mobility events per the procedures of Section 651 16 in [RFC7432]. 653 However, as noted in section 8.6 of [RFC7432] in order to enable a 654 single-homing ingress NVE to take advantage of fast convergence, 655 aliasing, and back-up path when interacting with multi-homed egress 656 NVEs attached to a given Ethernet segment, a single-homing ingress 657 NVE SHOULD implement the ingress node processing of Ethernet AD per 658 ES and Ethernet AD per EVI routes as defined in sections 8.2 Fast 659 Convergence and 8.4 Aliasing and Backup-Path of [RFC7432]. 661 8 NVE Residing in ToR Switch 663 In this section, we discuss the scenario where the NVEs reside in the 664 Top of Rack (ToR) switches AND the servers (where VMs are residing) 665 are multi-homed to these ToR switches. The multi-homing may operate 666 in All-Active or Single-Active redundancy mode. If the servers are 667 single-homed to the ToR switches, then the scenario becomes similar 668 to that where the NVE resides on the hypervisor, as discussed in 669 Section 7, as far as the required EVPN functionality are concerned. 671 [RFC7432] defines a set of BGP routes, attributes and procedures to 672 support multi-homing. We first describe these functions and 673 procedures, then discuss which of these are impacted by the VxLAN 674 (or NVGRE) encapsulation and what modifications are required. 676 8.1 EVPN Multi-Homing Features 678 In this section, we will recap the multi-homing features of EVPN to 679 highlight the encapsulation dependencies. The section only describes 680 the features and functions at a high-level. For more details, the 681 reader is to refer to [RFC7432]. 683 8.1.1 Multi-homed Ethernet Segment Auto-Discovery 685 EVPN NVEs (or PEs) connected to the same Ethernet Segment (e.g. the 686 same server via LAG) can automatically discover each other with 687 minimal to no configuration through the exchange of BGP routes. 689 8.1.2 Fast Convergence and Mass Withdraw 691 EVPN defines a mechanism to efficiently and quickly signal, to remote 692 NVEs, the need to update their forwarding tables upon the occurrence 693 of a failure in connectivity to an Ethernet segment (e.g., a link or 694 a port failure). This is done by having each NVE advertise an 695 Ethernet A-D Route per Ethernet segment for each locally attached 696 segment. Upon a failure in connectivity to the attached segment, the 697 NVE withdraws the corresponding Ethernet A-D route. This triggers all 698 NVEs that receive the withdrawal to update their next-hop adjacencies 699 for all MAC addresses associated with the Ethernet segment in 700 question. If no other NVE had advertised an Ethernet A-D route for 701 the same segment, then the NVE that received the withdrawal simply 702 invalidates the MAC entries for that segment. Otherwise, the NVE 703 updates the next-hop adjacency list accordingly. 705 8.1.3 Split-Horizon 707 If a server is multi-homed to two or more NVEs (represented by an 708 Ethernet segment ES1) and operating in an all-active redundancy mode, 709 sends a BUM packet (ie, Broadcast, Unknown unicast, or Multicast) 710 packet to one of these NVEs, then it is important to ensure the 711 packet is not looped back to the server via another NVE connected to 712 this server. The filtering mechanism on the NVE to prevent such loop 713 and packet duplication is called "split horizon filtering'. 715 8.1.4 Aliasing and Backup-Path 717 In the case where a station is multi-homed to multiple NVEs, it is 718 possible that only a single NVE learns a set of the MAC addresses 719 associated with traffic transmitted by the station. This leads to a 720 situation where remote NVEs receive MAC advertisement routes, for 721 these addresses, from a single NVE even though multiple NVEs are 722 connected to the multi-homed station. As a result, the remote NVEs 723 are not able to effectively load-balance traffic among the NVEs 724 connected to the multi-homed Ethernet segment. This could be the 725 case, for e.g. when the NVEs perform data-path learning on the 726 access, and the load-balancing function on the station hashes traffic 727 from a given source MAC address to a single NVE. Another scenario 728 where this occurs is when the NVEs rely on control plane learning on 729 the access (e.g. using ARP), since ARP traffic will be hashed to a 730 single link in the LAG. 732 To alleviate this issue, EVPN introduces the concept of Aliasing. 733 This refers to the ability of an NVE to signal that it has 734 reachability to a given locally attached Ethernet segment, even when 735 it has learnt no MAC addresses from that segment. The Ethernet A-D 736 route per EVI is used to that end. Remote NVEs which receive MAC 737 advertisement routes with non-zero ESI SHOULD consider the MAC 738 address as reachable via all NVEs that advertise reachability to the 739 relevant Segment using Ethernet A-D routes with the same ESI and with 740 the Single-Active flag reset. 742 Backup-Path is a closely related function, albeit it applies to the 743 case where the redundancy mode is Single-Active. In this case, the 744 NVE signals that it has reachability to a given locally attached 745 Ethernet Segment using the Ethernet A-D route as well. Remote NVEs 746 which receive the MAC advertisement routes, with non-zero ESI, SHOULD 747 consider the MAC address as reachable via the advertising NVE. 748 Furthermore, the remote NVEs SHOULD install a Backup-Path, for said 749 MAC, to the NVE which had advertised reachability to the relevant 750 Segment using an Ethernet A-D route with the same ESI and with the 751 Single-Active flag set. 753 8.1.5 DF Election 755 If a host is multi-homed to two or more NVEs on an Ethernet segment 756 operating in all-active redundancy mode, then for a given EVI only 757 one of these NVEs, termed the Designated Forwarder (DF) is 758 responsible for sending it broadcast, multicast, and, if configured 759 for that EVI, unknown unicast frames. 761 This is required in order to prevent duplicate delivery of multi- 762 destination frames to a multi-homed host or VM, in case of all-active 763 redundancy. 765 In NVEs where .1Q tagged frames are received from hosts, the DF 766 election is performed on host VLAN IDs (VIDs). It is assumed that for 767 a given Ethernet Segment, VIDs are unique and consistent (e.g., no 768 duplicate VIDs exist). 770 In GWs where VxLAN encapsulated frames are received, the DF election 771 is performed on VNIs. Again, it is assumed that for a given Ethernet 772 Segment, VNIs are unique and consistent (e.g., no duplicate VNIs 773 exist). 775 8.2 Impact on EVPN BGP Routes & Attributes 777 Since multi-homing is supported in this scenario, then the entire set 778 of BGP routes and attributes defined in [RFC7432] are used. The 779 setting of the Ethernet Tag field in the MAC Advertisement, Ethernet 780 AD per EVI, and Inclusive Multicast routes follows that of section 781 5.1.3. Furthermore, the setting of the VNI field in the MAC 782 Advertisement and Ethernet AD per EVI routes follows that of section 783 5.1.3. 785 8.3 Impact on EVPN Procedures 787 Two cases need to be examined here, depending on whether the NVEs are 788 operating in Active/Standby or in All-Active redundancy. 790 First, lets consider the case of Active/Standby redundancy, where the 791 hosts are multi-homed to a set of NVEs, however, only a single NVE is 792 active at a given point of time for a given VNI. In this case, the 793 aliasing is not required and the split-horizon may not be required, 794 but other functions such as multi-homed Ethernet segment auto- 795 discovery, fast convergence and mass withdraw, backup path, and DF 796 election are required. 798 Second, let's consider the case of All-Active redundancy. In this 799 case, out of all the EVPN multi-homing features listed in section 800 8.1, the use of the VXLAN or NVGRE encapsulation impacts the split- 801 horizon and aliasing features, since those two rely on the MPLS 802 client layer. Given that this MPLS client layer is absent with these 803 types of encapsulations, alternative procedures and mechanisms are 804 needed to provide the required functions. Those are discussed in 805 detail next. 807 8.3.1 Split Horizon 809 In EVPN, an MPLS label is used for split-horizon filtering to support 810 All-Active multi-homing where an ingress NVE adds a label 811 corresponding to the site of origin (aka ESI Label) when 812 encapsulating the packet. The egress NVE checks the ESI label when 813 attempting to forward a multi-destination frame out an interface, and 814 if the label corresponds to the same site identifier (ESI) associated 815 with that interface, the packet gets dropped. This prevents the 816 occurrence of forwarding loops. 818 Since the VXLAN or NVGRE encapsulation does not include this ESI 819 label, other means of performing the split-horizon filtering function 820 MUST be devised. The following approach is recommended for split- 821 horizon filtering when VXLAN (or NVGRE) encapsulation is used. 823 Every NVE track the IP address(es) associated with the other NVE(s) 824 with which it has shared multi-homed Ethernet Segments. When the NVE 825 receives a multi-destination frame from the overlay network, it 826 examines the source IP address in the tunnel header (which 827 corresponds to the ingress NVE) and filters out the frame on all 828 local interfaces connected to Ethernet Segments that are shared with 829 the ingress NVE. With this approach, it is required that the ingress 830 NVE performs replication locally to all directly attached Ethernet 831 Segments (regardless of the DF Election state) for all flooded 832 traffic ingress from the access interfaces (i.e. from the hosts). 833 This approach is referred to as "Local Bias", and has the advantage 834 that only a single IP address needs to be used per NVE for split- 835 horizon filtering, as opposed to requiring an IP address per Ethernet 836 Segment per NVE. 838 In order to prevent unhealthy interactions between the split horizon 839 procedures defined in [RFC7432] and the local bias procedures 840 described in this document, a mix of MPLS over GRE encapsulations on 841 the one hand and VXLAN/NVGRE encapsulations on the other on a given 842 Ethernet Segment is prohibited. 844 8.3.2 Aliasing and Backup-Path 846 The Aliasing and the Backup-Path procedures for VXLAN/NVGRE 847 encapsulation is very similar to the ones for MPLS. In case of MPLS, 848 two different Ethernet A-D routes are used for this purpose. The one 849 used for Aliasing has a VPN scope (per EVI) and carries a VPN label 850 but the one used for Backup-Path has Ethernet segment scope (per ES) 851 and doesn't carry any VPN specific info (e.g., Ethernet Tag and MPLS 852 label are set to zero). In case of VxLAN/NVGRE, the same two routes 853 are used for the Aliasing and the Backup-Path. In case of Aliasing, 854 the Ethernet Tag and VNI fields in Ethernet A-D per EVI route is set 855 as described in section 5.1.3. 857 9 Support for Multicast 859 The E-VPN Inclusive Multicast BGP route is used to discover the 860 multicast tunnels among the endpoints associated with a given EVI 861 (e.g., given VNI) for VLAN-based service and a given for 862 VLAN-aware bundle service. The Ethernet Tag field of this route is 863 set as described in section 5.1.3. The Originating router's IP 864 address field is set to the NVE's IP address. This route is tagged 865 with the PMSI Tunnel attribute, which is used to encode the type of 866 multicast tunnel to be used as well as the multicast tunnel 867 identifier. The tunnel encapsulation is encoded by adding the BGP 868 Encapsulation extended community as per section 5.1.1. The following 869 tunnel types as defined in [RFC6514] can be used in the PMSI tunnel 870 attribute for VXLAN/NVGRE: 872 + 3 - PIM-SSM Tree 873 + 4 - PIM-SM Tree 874 + 5 - BIDIR-PIM Tree 875 + 6 - Ingress Replication 877 Except for Ingress Replication, this multicast tunnel is used by the 878 PE originating the route for sending multicast traffic to other PEs, 879 and is used by PEs that receive this route for receiving the traffic 880 originated by hosts connected to the PE that originated the route. 882 In the scenario where the multicast tunnel is a tree, both the 883 Inclusive as well as the Aggregate Inclusive variants may be used. In 884 the former case, a multicast tree is dedicated to a VNI. Whereas, in 885 the latter, a multicast tree is shared among multiple VNIs. This is 886 done by having the NVEs advertise multiple Inclusive Multicast routes 887 with different VNI encoded in the Ethernet Tag field, but with the 888 same tunnel identifier encoded in the PMSI Tunnel attribute. 890 10 Data Center Interconnections - DCI 892 For DCI, the following two main scenarios are considered when 893 connecting data centers running evpn-overlay (as described here) over 894 MPLS/IP core network: 896 - Scenario 1: DCI using GWs 897 - Scenario 2: DCI using ASBRs 898 The following two subsections describe the operations for each of 899 these scenarios. 901 10.1 DCI using GWs 903 This is the typical scenario for interconnecting data centers over 904 WAN. In this scenario, EVPN routes are terminated and processed in 905 each GW and MAC/IP routes are always re-advertised from DC to WAN but 906 from WAN to DC, they are not re-advertised if unknown MAC address 907 (and default IP address) are utilized in NVEs. In this scenario, each 908 GW maintains a MAC-VRF (and/or IP-VRF) for each EVI. The main 909 advantage of this approach is that NVEs do not need to maintain MAC 910 and IP addresses from any remote data centers when default IP route 911 and unknown MAC routes are used - i.e., they only need to maintain 912 routes that are local to their own DC. When default IP route and 913 unknown MAC route are used, any unknown IP and MAC packets from NVEs 914 are forwarded to the GWs where all the VPN MAC and IP routes are 915 maintained. This approach reduces the size of MAC-VRF and IP-VRF 916 significantly at NVEs. Furthermore, it results in a faster 917 convergence time upon a link or NVE failure in a multi-homed network 918 or device redundancy scenario, because the failure related BGP routes 919 (such as mass withdraw message) do not need to get propagated all the 920 way to the remote NVEs in the remote DCs. This approach is described 921 in details in section 3.4 of [DCI-EVPN-OVERLAY]. 923 10.2 DCI using ASBRs 925 This approach can be considered as the opposite of the first approach 926 and it favors simplification at DCI devices over NVEs such that 927 larger MAC-VRF (and IP-VRF) tables are need to be maintained on NVEs; 928 whereas, DCI devices don't need to maintain any MAC (and IP) 929 forwarding tables. Furthermore, DCI devices do not need to terminate 930 and processed routes related to multi-homing but rather to relay 931 these messages for the establishment of an end-to-end LSP path. In 932 other words, DCI devices in this approach operate similar to ASBRs 933 for inter-AS options B. This requires locally assigned VNIs to be 934 used just like downstream assigned MPLS VPN label where for all 935 practical purposes the VNIs function like 24-bit VPN labels. This 936 approach is equally applicable to data centers (or Carrier Ethernet 937 networks) with MPLS encapsulation. 939 In inter-AS option B, when ASBR receives an EVPN route from its DC 940 over iBGP and re-advertises it to other ASBRs, it re-advertises the 941 EVPN route by re-writing the BGP next-hops to itself, thus losing the 942 identity of the PE that originated the advertisement. This re-write 943 of BGP next-hop impacts the EVPN Mass Withdraw route (Ethernet A-D 944 per ES) and its procedure adversely. However, it does not impact EVPN 945 Aliasing mechanism/procedure because when the Aliasing routes (Ether 946 A-D per EVI) are advertised, the receiving PE first resolves a MAC 947 address for a given EVI into its corresponding and 948 subsequently, it resolves the into multiple paths (and their 949 associated next hops) via which the is reachable. Since 950 Aliasing and MAC routes are both advertised per EVI basis and they 951 use the same RD and RT (per EVI), the receiving PE can associate them 952 together on a per BGP path basis (e.g., per originating PE) and thus 953 perform recursive route resolution - e.g., a MAC is reachable via an 954 which in turn, is reachable via a set of BGP paths, thus the 955 MAC is reachable via the set of BGP paths. Since on a per EVI basis, 956 the association of MAC routes and the corresponding Aliasing route is 957 fixed and determined by the same RD and RT, there is no ambiguity 958 when the BGP next hop for these routes is re-written as these routes 959 pass through ASBRs - i.e., the receiving PE may receive multiple 960 Aliasing routes for the same EVI from a single next hop (a single 961 ASBR), and it can still create multiple paths toward that . 963 However, when the BGP next hop address corresponding to the 964 originating PE is re-written, the association between the Mass 965 Withdraw route (Ether A-D per ES) and its corresponding MAC routes 966 cannot be made based on their RDs and RTs because the RD for Mass 967 Withdraw route is different than the one for the MAC routes. 968 Therefore, the functionality needed at the ASBRs and the receiving 969 PEs depends on whether the Mass Withdraw route is originated and 970 whether there is a need to handle route resolution ambiguity for this 971 route. The following two subsections describe the functionality 972 needed by the ASBRs and the receiving PEs depending on whether the 973 NVEs reside in a Hypervisors or in TORs. 975 10.2.1 ASBR Functionality with NVEs in Hypervisors 977 When NVEs reside in hypervisors as described in section 7.1, there is 978 no multi-homing and thus there is no need for the originating NVE to 979 send Ethernet A-D per ES or Ethernet A-D per EVI routes. However, as 980 noted in section 7, in order to enable a single-homing ingress NVE to 981 take advantage of fast convergence, aliasing, and backup-path when 982 interacting with multi-homing egress NVEs attached to a given 983 Ethernet segment, the single-homing NVE SHOULD be able to receive and 984 process Ethernet AD per ES and Ethernet AD per EVI routes. The 985 handling of these routes are described in the next section. 987 10.2.2 ASBR Functionality with NVEs in TORs 989 When NVEs reside in TORs and operate in multi-homing redundancy mode, 990 then as described in section 8, there is a need for the originating 991 NVE to send Ethernet A-D per ES route(s) (used for mass withdraw) and 992 Ethernet A-D per EVI routes (used for aliasing). As described above, 993 the re-write of BGP next-hop by ASBRs creates ambiguities when 994 Ethernet A-D per ES routes are received by the remote NVE in a 995 different ASBR because the receiving NVE cannot associated that route 996 with the MAC/IP routes of that Ethernet Segment advertised by the 997 same originating NVE. This ambiguity inhibits the function of mass- 998 withdraw per ES by the receiving NVE in a different AS. 1000 As an example consider a scenario where CE is multi-homed to PE1 and 1001 PE2 where these PEs are connected via ASBR1 and then ASBR2 to the 1002 remote PE3. Furthermore, consider that PE1 receives M1 from CE1 but 1003 not PE2. Therefore, PE1 advertises Eth A-D per ES1, Eth A-D per EVI1, 1004 and M1; whereas, PE2 only advertises Eth A-D per ES1 and Eth A-D per 1005 EVI1. ASBR1 receives all these five advertisements and passes them to 1006 ASBR2 (with itself as the BGP next hop). ASBR2, in turn, passes them 1007 to the remote PE3 with itself as the BGP next hop. PE3 receives these 1008 five routes where all of them have the same BGP next-hop (i.e., 1009 ASBR2). Furthermore, the two Ether A-D per ES routes received by PE3 1010 have the same info - i.e., same ESI and the same BGP next hop. 1011 Although both of these routes are maintained by the BGP process in 1012 PE3 (because they have different RDs and thus treated as different 1013 BGP routes), information from only one of them is used in the L2 1014 routing table (L2 RIB). 1016 PE1 1017 / \ 1018 CE ASBR1---ASBR2---PE3 1019 \ / 1020 PE2 1022 Figure 1: Inter-AS Option B 1024 Now, when the AC between the PE2 and the CE fails and PE2 sends NLRI 1025 withdrawal for Ether A-D per ES route and this withdrawal gets 1026 propagated and received by the PE3, the BGP process in PE3 removes 1027 the corresponding BGP route; however, it doesn't remove the 1028 associated info (namely ESI and BGP next hop) from the L2 routing 1029 table (L2 RIB) because it still has the other Ether A-D per ES route 1030 (originated from PE1) with the same info. That is why the mass- 1031 withdraw mechanism does not work when doing DCI with inter-AS option 1032 B. However, as described previoulsy, the aliasing function works and 1033 so does "mass-withdraw per EVI" (which is associated with withdrawing 1034 the EVPN route associated with Aliasing - i.e., Ether A-D per EVI 1035 route). 1037 In the above example, the PE3 receives two Aliasing routes with the 1038 same BGP next hop (ASBR2) but different RDs. One of the Alias route 1039 has the same RD as the advertised MAC route (M1). PE3 follows the 1040 route resolution procedure specified in [RFC7432] upon receiving the 1041 two Aliasing route - ie, it resolves M1 to and 1042 subsequently it resolves to a BGP path list with two paths 1043 along with the corresponding VNIs/MPLS labels (one associated with 1044 PE1 and the other associated with PE2). It should be noted that even 1045 though both paths are advertised by the same BGP next hop (ASRB2), 1046 the receiving PE3 can handle them properly. Therefore, M1 is 1047 reachable via two paths. This creates two end-to-end LSPs from PE3 to 1048 PE1 for M1 such that when PE3 wants to forward traffic destined to 1049 M1, it can load balanced between the two paths. Although route 1050 resolution for Aliasing routes with the same BGP next hop is not 1051 explicitly mentioned in [RFC7432], the is the expected operation and 1052 thus it is elaborated here. 1054 When the AC between the PE2 and the CE fails and PE2 sends NLRI 1055 withdrawal for Ether A-D per EVI routes and these withdrawals get 1056 propagated and received by the PE3, the PE3 removes the Aliasing 1057 route and updates the path list - ie, it removes the path 1058 corresponding to the PE2. Therefore, all the corresponding MAC routes 1059 for that that point to that path list will now have the 1060 updated path list with a single path associated with PE1. This action 1061 can be considered as the mass-withdraw at the per-EVI level. The 1062 mass-withdraw at per-EVI level has longer convergence time than the 1063 mass-withdraw at per-ES level; however, it is much faster than the 1064 convergence time when the withdraw is done on a per-MAC basis. 1066 In summary, it can be seen that aliasing (and backup path) 1067 functionality should work as is for inter-AS option B without 1068 requiring any addition functionality in ASBRs or PEs. However, the 1069 mass-withdraw functionality falls back from per-ES mode to per-EVI 1070 mode for inter-AS option B - i.e., PEs receiving mass-withdraw route 1071 from the same AS use Ether A-D per ES route; whereas, PEs receiving 1072 mass-withdraw route from different AS use Ether A-D per EVI route. 1074 11 Acknowledgement 1076 The authors would like to thank Aldrin Isaac, David Smith, John 1077 Mullooly, Thomas Nadeau for their valuable comments and feedback. The 1078 authors would also like to thank Jakob Heitz for his contribution on 1079 section 10.2. 1081 12 Security Considerations 1083 This document uses IP-based tunnel technologies to support data 1084 plane transport. Consequently, the security considerations of those 1085 tunnel technologies apply. This document defines support for VXLAN 1086 and NVGRE encapsulations. The security considerations from those 1087 documents as well as [RFC4301] apply to the data plane aspects of 1088 this document. 1090 As with [RFC5512], any modification of the information that is used 1091 to form encapsulation headers, to choose a tunnel type, or to choose 1092 a particular tunnel for a particular payload type may lead to user 1093 data packets getting misrouted, misdelivered, and/or dropped. 1095 More broadly, the security considerations for the transport of IP 1096 reachability information using BGP are discussed in [RFC4271] and 1097 [RFC4272], and are equally applicable for the extensions described 1098 in this document. 1100 If the integrity of the BGP session is not itself protected, then an 1101 imposter could mount a denial-of-service attack by establishing 1102 numerous BGP sessions and forcing an IPsec SA to be created for each 1103 one. However, as such an imposter could wreak havoc on the entire 1104 routing system, this particular sort of attack is probably not of 1105 any special importance. 1107 It should be noted that a BGP session may itself be transported over 1108 an IPsec tunnel. Such IPsec tunnels can provide additional security 1109 to a BGP session. The management of such IPsec tunnels is outside 1110 the scope of this document. 1112 13 IANA Considerations 1114 IANA has allocated the following BGP Tunnel Encapsulation Attribute 1115 Tunnel Types: 1117 8 VXLAN Encapsulation 1118 9 NVGRE Encapsulation 1119 10 MPLS Encapsulation 1120 11 MPLS in GRE Encapsulation 1121 12 VXLAN GPE Encapsulation 1123 14 References 1125 14.1 Normative References 1127 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 1128 Requirement Levels", BCP 14, RFC 2119, March 1997. 1130 [RFC4271] Y. Rekhter, Ed., T. Li, Ed., S. Hares, Ed., "A Border 1131 Gateway Protocol 4 (BGP-4)", January 2006. 1133 [RFC4272] S. Murphy, "BGP Security Vulnerabilities Analysis.", 1134 January 2006. 1136 [RFC4301] S. Kent, K. Seo., "Security Architecture for the 1137 Internet Protocol.", December 2005. 1139 [RFC5512] Mohapatra, P. and E. Rosen, "The BGP Encapsulation 1140 Subsequent Address Family Identifier (SAFI) and the BGP 1141 Tunnel Encapsulation Attribute", RFC 5512, April 2009. 1143 [RFC7432] Sajassi et al., "BGP MPLS Based Ethernet VPN", RFC 7432, 1144 February 2014 1146 14.2 Informative References 1148 [RFC7209] Sajassi et al., "Requirements for Ethernet VPN (EVPN)", RFC 1149 7209, May 2014 1151 [RFC7348] Mahalingam, M., et al, "VXLAN: A Framework for Overlaying 1152 Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, August 1153 2014 1155 [NVGRE] Garg, P., et al., "NVGRE: Network Virtualization using 1156 Generic Routing Encapsulation", draft-sridharan-virtualization-nvgre- 1157 07.txt, November 11, 2014 1159 [Problem-Statement] Narten et al., "Problem Statement: Overlays for 1160 Network Virtualization", draft-ietf-nvo3-overlay-problem-statement- 1161 01, September 2012. 1163 [L3VPN-ENDSYSTEMS] Marques et al., "BGP-signaled End-system IP/VPNs", 1164 draft-ietf-l3vpn-end-system, work in progress, October 2012. 1166 [NOV3-FRWK] Lasserre et al., "Framework for DC Network 1167 Virtualization", draft-ietf-nvo3-framework-01.txt, work in progress, 1168 October 2012. 1170 [DCI-EVPN-OVERLAY] Rabadan et al., "Interconnect Solution for EVPN 1171 Overlay networks", draft-ietf-bess-dci-evpn-overlay-02, work in 1172 progress, February 29, 2016. 1174 [TUNNEL-ENCAP] Rosen et al., "The BGP Tunnel Encapsulation 1175 Attribute", draft-ietf-idr-tunnel-encaps-02, work in progress, May 1176 31, 2016. 1178 Contributors 1180 S. Salam K. Patel D. Rao S. Thoria D. Cai Cisco 1182 Y. Rekhter R. Shekhar Wen Lin Nischal Sheth Juniper 1184 L. Yong Huawei 1186 Authors' Addresses 1188 Ali Sajassi 1189 Cisco 1190 Email: sajassi@cisco.com 1192 John Drake 1193 Juniper Networks 1194 Email: jdrake@juniper.net 1196 Nabil Bitar 1197 Nokia 1198 Email : nabil.bitar@nokia.com 1200 R. Shekhar 1201 Juniper 1202 Email: rshekhar@juniper.net 1204 James Uttaro 1205 AT&T 1206 Email: uttaro@att.com 1208 Wim Henderickx 1209 Alcatel-Lucent 1210 e-mail: wim.henderickx@nokia.com