idnits 2.17.1 draft-sharma-multi-site-evpn-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 17, 2017) is 2475 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-12) exists of draft-ietf-bess-evpn-overlay-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Sharma, Ed. 3 Internet-Draft A. Banerjee 4 Intended status: Standards Track R. Sivaramu 5 Expires: January 18, 2018 A. Sajassi 6 Cisco Systems 7 July 17, 2017 9 Multi-site EVPN based VXLAN using Border Gateways 10 draft-sharma-multi-site-evpn-03 12 Abstract 14 This document describes the procedures for interconnecting two or 15 more BGP based Ethernet VPN (EVPN) sites in a scalable fashion over 16 an IP-only network. The motivation is to support extension of EVPN 17 sites without having to rely on typical Data Center Interconnect 18 (DCI) technologies like MPLS/VPLS for the interconnection. The 19 requirements for such a deployment are very similar to the ones 20 specified in RFC 7209 -- "Requirements for Ethernet VPN (EVPN)". 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 18, 2018. 39 Copyright Notice 41 Copyright (c) 2017 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 59 3. Multi-Site EVPN Overview . . . . . . . . . . . . . . . . . . 4 60 3.1. MS-EVPN Interconnect Requirements . . . . . . . . . . . . 4 61 3.2. MS-EVPN Interconnect concept and framework . . . . . . . 5 62 4. Multi-site EVPN Interconnect Procedures . . . . . . . . . . . 8 63 4.1. Border Gateway Discovery . . . . . . . . . . . . . . . . 8 64 4.2. Border Gateway Provisioning . . . . . . . . . . . . . . . 11 65 4.2.1. Border Gateway Designated Forwarder Election . . . . 11 66 4.2.2. Anycast Border Gateway . . . . . . . . . . . . . . . 12 67 4.2.3. Multi-path Border Gateway . . . . . . . . . . . . . . 12 68 4.3. EVPN route processing at Border Gateway . . . . . . . . . 13 69 4.4. Multi-Destination tree between Border Gateways . . . . . 14 70 4.5. Inter-site Unicast traffic . . . . . . . . . . . . . . . 15 71 4.6. Inter-site Multi-destination traffic . . . . . . . . . . 15 72 4.7. Host Mobility . . . . . . . . . . . . . . . . . . . . . . 16 73 5. Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 16 74 5.1. Fabric to Border Gateway Failure . . . . . . . . . . . . 16 75 5.2. Border Gateway to Border Gateway Failures . . . . . . . . 16 76 6. Interoperability . . . . . . . . . . . . . . . . . . . . . . 16 77 7. Isolation of Fault Domains . . . . . . . . . . . . . . . . . 17 78 8. Loop detection and Prevention . . . . . . . . . . . . . . . . 17 79 9. MVPN with Multi-site EVPN . . . . . . . . . . . . . . . . . . 18 80 9.1. Inter-Site MI-PMSI . . . . . . . . . . . . . . . . . . . 18 81 9.2. Stitching of customer multicast trees across sites . . . 18 82 9.3. RPF resolution in Anycast Border Gateway model . . . . . 19 83 9.4. Inter-Site S-PMSI . . . . . . . . . . . . . . . . . . . . 19 84 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 85 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 86 12. Security Considerations . . . . . . . . . . . . . . . . . . . 19 87 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 88 13.1. Normative References . . . . . . . . . . . . . . . . . . 19 89 13.2. Informative References . . . . . . . . . . . . . . . . . 20 90 Appendix A. Additional Stuff . . . . . . . . . . . . . . . . . . 20 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 93 1. Introduction 95 BGP based Ethernet VPNs (EVPNs) are being used to support various VPN 96 topologies with the motivation and requirements being discussed in 97 detail in RFC7209 [RFC7209]. EVPN has been used to provide a Network 98 Virtualization Overlay (NVO) solution with a variety of tunnel 99 encapsulation options over IP as described in [DCI-EVPN-OVERLAY]. 100 EVPN used for the Data center interconnect (DCI) at the WAN Edge is 101 discussed in [DCI-EVPN-OVERLAY]. The EVPN DCI procedures are defined 102 for IP and MPLS hand-off at the site boundaries. 104 In the current EVPN deployments, there is a need to segment the EVPN 105 domains within a Data Center (DC) primarily due to the service 106 architecture and the scaling requirements around it. The number of 107 routes, tunnel end-points, and next-hops needed in the DC are larger 108 than some of the hardware elements that are being deployed. Network 109 operators would like to ensure that they have means to have smaller 110 sites within the data center, if they so desire, without having to 111 have traditional DCI technologies to inter-connect them. In essence, 112 they want smaller multi-site EVPN domains with an IP backbone. 114 Network operators today are using the Virtual Network Identifier 115 (VNI) to designate a service. However, they would like to have this 116 service available to a smaller set of nodes within the DC for 117 administrative reasons; in essence they want to break up the EVPN 118 domain to multiple smaller sites. An advantage of having a smaller 119 footprint for these EVPN sites, implies that the various fault 120 isolation domains are now more constrained. It is also feasible to 121 have features that can re-use the VNI space across these sites if 122 desired. The above mentioned motivations for having smaller multi- 123 site EVPN domains are over and above the ones that are already 124 detailed in RFC7209 [RFC7209]. 126 In this document we focus primarily on the VXLAN encapsulation for 127 EVPN deployments. We assume that the underlay provides simple IP 128 connectivity. We go into the details of the IP/VXLAN hand-off 129 mechanisms, to interconnect these smaller sites, within the data 130 center itself. We describe this deployment model as a scalable 131 multi-site EVPN (MS-EVPN) deployment. The procedures described here 132 go into substantial detail regarding interconnecting L2 and L3, 133 unicast and multicast domains across multiple EVPN sites. 135 1.1. Requirements Language 137 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 138 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 139 document are to be interpreted as described in RFC 2119 [RFC2119]. 141 2. Terminology 143 o Border Gateway (BG): This is the node that interacts with nodes 144 within a site and with nodes that are external to the site. For 145 example, in a leaf-spine data center fabric, it can be a leaf, a 146 spine, or a separate device acting as gateway to interconnect the 147 sites. 149 o Anycast Border Gateway: A Virtual set of shared Border Gateways 150 (or Next-hops) acting as Multiple entry-exit points for a site. 152 o Multipath Border Gateway: A Virtual set of unique border Gateways 153 (or Next-hops) acting as a Multiple entry-exit points for a site. 155 o A-D: Auto-discovery. 157 3. Multi-Site EVPN Overview 159 In this section we describe the motivation, requirements, and 160 framework of the multi-site EVPN enhancements. 162 3.1. MS-EVPN Interconnect Requirements 164 In this section we discuss the requirements and motivation for 165 interconnecting different EVPN sites within a data center. In 166 general any interconnect technology has the following requirements: 168 a. Scalability: Multi-Site EVPN (MS-EVPN) should be able to 169 interconnect multiple sites in a scalable fashion. In other 170 words, interconnecting such sites should not lead to one giant 171 fabric with full mesh of end-to-end VXLAN tunnels across leafs in 172 different sites. This leads to scale issues with respect to 173 managing large number of tunnel end-points and a large number of 174 tunnel next-hops. Also a huge flat fabric rules out option of 175 ingress replication (IR) trees as number of replications becomes 176 practically unachievable due to the internal bandwidth needed in 177 hardware. 179 b. Multi-Destination traffic over unicast-only cloud: MS-EVPN 180 mechanisms should be able to provide an efficient forwarding 181 mechanism for multi-destination frames even if the underlay 182 inter-site network is not capable of forwarding multicast frames. 183 This requirement is meant to ensure that for the solution to work 184 there are no additional constraints being requested of the IP 185 network. This allows for use of existing network elements as-is. 187 c. Maintain Site-specific Administrative control: The MS-EVPN 188 technology should be able to interconnect fabrics from different 189 Administrative domains. It is possible that different sites have 190 different VLAN-VNI mappings, use different underlay routing 191 protocols, and/or have different PIM-SM group ranges etc. It is 192 expected that the technology should not impose any additional 193 constraints on the various administrative domains. 195 d. Co-existence of other VPN Address Families: MS-EVPN should 196 coexists with other VPN address families like L3-VPN or MVPN. As 197 each VPN address family provides different set of services, MS- 198 EVPN architecture should allow for coexistence of such VPN 199 address families in the same topology. 201 e. Isolate fault domains: MS-EVPN technology hand-off should have 202 capability to isolate traffic cross site boundaries and prevent 203 defects to percolate from one site to another. As an example, a 204 broadcast storm in a site should not lead to meltdown of all 205 other sites. 207 f. Loop detection and prevention: In the scenarios where flood 208 domains are stretched across fabrics, interconnecting sites are 209 very vulnerable to loops and flood storms. There is a need to 210 provide comprehensive loop detection and prevention capabilities. 212 g. Plug-and-play and extensibility: Addition of new sites or 213 increasing capacity of existing sites should be achievable in a 214 completely plug-and-play fashion. This essentially means that 215 all control plane and forwarding states (L2 or L3 interconnect) 216 should be built in downstream allocation mode. MS-EVPN should 217 not pose any maximum requirements on the scale and capacity, it 218 should be easily extendable on those metrics. 220 3.2. MS-EVPN Interconnect concept and framework 222 EVPN with an IP-only interconnect is conceptualized as multiple site- 223 local EVPN control planes and IP forwarding domains interconnected 224 via a single common EVPN control and IP forwarding domain. Every 225 EVPN node is identified with a unique site-scope identifier. A site- 226 local EVPN domain consists of EVPN nodes with the same site 227 identifier. Border gateways on one hand are also part of site- 228 specific EVPN domain and on other hand part of a common EVPN domain 229 to interconnect with Border Gateways from other sites. Although a 230 border gateway has only a single explicit site-id (that of the site 231 it is a member of), it can be considered to also have a second 232 implicit site-id, that of the interconnect-domain which has 233 membership of all the BG's from all sites that are being 234 interconnected. This implicit site-id membership is derived by the 235 presence of the Border A-D route announced by that border gateway 236 node (please refer to Section 4.1 for details of the route format). 238 These border gateways discover each other through EVPN Border A-D 239 routes and act as both control and forwarding plane gateway across 240 sites. This will facilitate site-specific nodes to visualize all 241 other sites to be reachable only via its Border Gateways. 243 ____________________________ 244 | ooo Encapsulation tunnel | 245 | X X X Leaf-spine fabric | 246 |__________________________| 248 Site A (EVPN site A) Site B (EVPN site B) 249 ___________________________ ____________________________ 250 | X X X X X X X X | | X X X X X X X X | 251 | X X X X | | X X X X | 252 | o o | | o o | 253 |BG-1 Site A BG-2 Site A| |BG-1 Site B BG-2 Site B| 254 ___________________________ ____________________________ 255 o o o o 256 o o o o 257 o o o o 258 o o o o 259 _______________________________________________ 260 | | 261 | | 262 | Inter-site common EVPN site | 263 | | 264 | | 265 _______________________________________________ 266 o o 267 o o 268 o o 269 o o 270 ___________________________ 271 | BG-1 Site C BG-2 Site C| 272 | X X X X | 273 | X X X X X X X X | 274 _____________________________ 275 Site C (EVPN site C) 277 Figure 1 279 We describe the MS-EVPN deployment model using the topology above. 280 In the topology there are 3 sites, Site A, Site B, and Site C that 281 are inter-connected using IP. This entire topology is deemed to be 282 part of the same Data Center. In most deployments these sites can be 283 thought of as pods, which may span a rack, a row, or multiple rows in 284 the data center, depending on the size of domain desired for scale 285 and fault and/or administrative isolation domains. 287 In this topology, site-local nodes are connected to each other by 288 iBGP EVPN peering and Border Gateways are connected by eBGP Muti-hop 289 EVPN peering via inter-site cloud. We explicitly spell this out to 290 ensure that we can re-use BGP semantics of route announcement between 291 and across the sites. There are other BGP mechanisms to instantiate 292 this and they are not discussed in this document. This implies that 293 each domain has its own AS number associated with it. In the 294 topology, only 2 border gateway per site are shown; this is more for 295 ease of illustration and explanation. The technology poses no such 296 limitation. As mentioned earlier, site-specific EVPN domain will 297 consists of only site-local nodes in the sites. A Border Gateway is 298 logically partitioned into site specific EVPN domain towards the site 299 and into common EVPN domain towards other sites. This facilitates 300 them to acts as control and forwarding plane gateway for forwarding 301 traffic across sites. 303 EVPN nodes with in a site will discover each other via regular EVPN 304 procedures and build site-local bidirectional VXLAN tunnels and 305 multi-destination trees from leaves to Border Gateways. Border 306 Gateways will discover each other by A-D routes with unique site- 307 identifiers (as described in Section 4.1) and build inter-site bi- 308 directional VXLAN tunnels and Multi-destination trees between them. 309 We thus build an end-to-end bidirectional forwarding path across all 310 sites by stitching (and not by stretching end-to-end) site-local 311 VXLAN tunnels with inter-site VXLAN tunnels. 313 In essence, a MS-EVPN fabric is proposed to be built in complete 314 downstream and modular fashion. 316 o Site-local tenant Bridging domains are interconnected ONLY via 317 Border Gateways with Bridging domains from other sites. Such 318 interconnect do not assume uniform mappings of mac-vrf VNI-VLAN 319 across sites and stitches such bridging domains in complete 320 downstream fashion using EVPN route advertisements. 322 o Site-local tenant Routing domains are interconnected ONLY via 323 Border Gateways with Routing domains from other sites. Such 324 interconnect do not assume uniform mappings of IP VRF-VNI across 325 sites and stitches such routing domains in complete downstream 326 fashion using EVPN route advertisements. 328 o Site-local tenant Flood domains are interconnected ONLY via Border 329 Gateways with flood domains from other sites. Such interconnect 330 do not assume uniform mappings of mac-vrf VNI across sites (or 331 mechanisms to build flood domains with in site) and stitches such 332 flood domains in complete downstream fashion using EVPN route 333 advertisements. It however do not exclude possibility of building 334 an end-to-end flood domain, if desired for other reasons. 336 o Site-local tenant Multicast domains are interconnected ONLY via 337 Border Gateways by stitching tenant multicast trees via Border 338 Gateways. 340 o There could be potential use cases where border gateways should 341 behave as gateway for a subset of VXLAN tunnels (or VNIs) and an 342 underlay pass through for the rest. In other words, MS-EVPN 343 fabric can be built by stitching VXLAN tunnels at border gateways 344 while providing flexibility for other VXLAN (or VNI) tunnels to 345 pass through border gateways as native L3 underlay. The procedure 346 defined here provides flexibility to accommodate such use cases. 348 The above architecture satisfies the constraints laid out in 349 Section 3.1. For example, the size of a domain may be made dependent 350 on the route and next-hop scale that can be supported by the 351 deployment of the network nodes. There are no constraints on the 352 network that connects the nodes within the domain or across the 353 domains. In the event multicast capability is available and enabled, 354 the nodes can use those resources. In the event the underlay is 355 connected using unicast semantics, creation of ingress replication 356 lists ensure that multi-destination frames reach their destinations. 357 The domains may have their own deployment constraints, and the 358 overlay does not need any form of stretching. It is within the 359 control of the administrator with respect to containing fault 360 isolation domains. The automated discovery of the border nodes needs 361 no further configurations for existing deployed domains. 363 4. Multi-site EVPN Interconnect Procedures 365 In this section we describe the new functionalities in the Border 366 Gateway nodes for interconnecting EVPN sites within the DC. 368 4.1. Border Gateway Discovery 370 Border Gateway discovery will facilitate termination and re- 371 origination of inter-site VXLAN tunnels. Such discovery provides 372 flexibility for intra-site leaf-to-leaf VXLAN tunnels to co-exists 373 with inter-site tunnels terminating on Border Gateways. In other 374 words, border gateways discovery will facilitate learning of VXLAN 375 tunnel termination points while providing flexibility for such border 376 gateways to behave as native L3 transit for other VXLAN tunnels. 378 Border Gateways leverage the Type-1 A-D route type defined in RFC7432 379 [RFC7432]. Border Gateways in different sites will use Type-1 A-D 380 routes with unique site-identifiers to announce themselves as 381 "Borders" to other border gateways. Nodes within the same site MUST 382 be configured or auto-generate to announce the same site-identifier. 383 Nodes that are not configured to be a border node will build VXLAN 384 tunnels only between each member of the site (which it is aware due 385 to the site-identifier that is additionally announced by them). 386 Border nodes will additionally build VXLAN tunnels between itself and 387 other border nodes that are announced with a different site 388 identifier. Note that the site-identifier is encoded within the ESI 389 label itself as described below. 391 In this specification, we define a new Ethernet Segment Type (as 392 described in Section 5 of RFC7432 [RFC7432]) that can be auto- 393 generated or configured by the operator. 395 o Type 6 (T=0x06) - This type indicates a multi-site router-ID ESI 396 Value that can be auto-generated or configured by the operator. 397 The ESI Value is constructed as follows: 399 * Site Identifier (4 octets): The Site Identifier and its value 400 MUST be encoded in the high-order 4 octets of the ESI Value 401 field. It is used to announce the site to which a border 402 gateway belongs. All border gateways MUST announce this value. 404 * Reserved (5 octets): The low-order octets of the ESI Value will 405 be set to 0 and will be ignored on receipt. 407 Along with the Type-1 A-D routes, border nodes MUST announce an ESI 408 label extended community with such A-D routes. They will also 409 announce the Type-4 Ethernet Segment routes with the ESI Label 410 extended community (defined in Section 7.5 of RFC7432 [RFC7432] and 411 shown below in Figure 2) in order to perform the Designated Forwarder 412 election among the Border gateways of the same site. These Type-4 413 routes and ESI Label extended community will carry a new bit in the 414 Flags field to indicate that the DF election is for Border gateways 415 as against the traditional Ethernet segment DF election. Routes with 416 such bits set are generated only by Border Gateways and imported by 417 all site-local leafs, site-local Border Gateways, and inter-site 418 Border gateways. 420 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 421 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 422 | Type=0x06 | Sub-Type=0x01 | Flags(1 octet)| Reserved=0 | 423 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 424 | Reserved=0 | ESI Label | 425 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 427 Figure 2 429 The lowest order bit of Flags Octet in ESI Label extended community 430 has been defined to address multihoming with the Single-Active or 431 All-Active redundancy mode. In this specification, we define the the 432 Second Low order bit of Flag Octet in ESI Label extended Community. 433 It MUST be set to 1 by border gateway nodes if it is willing to take 434 part in the DF election for the VNI carried in the associated ESI 435 label. 437 Type-4 Ethernet Segment routes with the ESI Label extended community 438 will be leveraged to perform Designated Forwarder election among the 439 Border gateways of the same site. ESI label extended community 440 encoding will be same as described above for Type-1 A-D routes. Site 441 Identifier encoding in ESI label extended community will help border 442 gateways to negotiate DF winner with in a site and ignore Type-4 443 routes from other sites. 445 These Type 1 A-D routes are advertised with mac-VRF and IP-VRF RTs 446 depending on whether the VNI carried is a mac-VRF VNI or an IP VRF 447 VNI. 449 After a Border Gateway is provisioned, Border A-D routes will re- 450 originate EVPN routes after some delay interval. This will provide 451 sufficient time to learn Border A-D routes from Border Gateways of 452 different sites. Also, Border Gateways will not build VXLAN tunnels 453 between same-site Border Gateway members. 455 Once Border Gateways are discovered, any Type-2/Type-5 routes will be 456 terminated and re-originated on such Border Gateways. Similarly 457 Type-1, Type-3, Type-4 from other sites will be terminated at the 458 Border Gateways. (Also see section 8 for Type-1 handling for loop 459 detection and prevention across sites) 461 As has been defined in the specifications, Type 2, Type 3, and Type 5 462 routes carry downstream VNI labels. Type 3 routes will help to pre- 463 build VXLAN tunnels in the common EVPN domain for L2, L3, and Multi- 464 Destination traffic. As Border Gateway discovery is agnostic to 465 symmetric or downstream VNI provisioning, rewriting next-hop 466 attributes before re-advertising these routes from other sites to a 467 given site provides flexibility to keep different VNI-VLAN mapping in 468 different sites and still able to interconnect L3 and L2 domains. 470 All control plane and data plane states are interconnected in a 471 complete downstream fashion. For example, BGP import rules for a 472 Type 3 route should be able to extend a flood domain for a VNI and 473 flood traffic destined to advertised EVPN node should carry the VNI 474 which is announced in Type 3 route. Similarly Type 2, Type 5 control 475 and forwarding states should be interconnected in a complete 476 downstream fashion. 478 4.2. Border Gateway Provisioning 480 Border Gateway nodes manage both the control-plane communications and 481 the data forwarding plane for any inter-site traffic. Border Gateway 482 functionality in an EVPN site SHOULD be enabled on more than one node 483 in the network for redundancy and high-availability purposes. Any 484 external Type-2/Type-5 routes that are received by the BGs of a site 485 are advertised to all the intra-site nodes by all the BGs. For 486 internal Type-2/Type-5 routes received by the BG's from the intra- 487 site nodes, all the BGs of a site would advertise them to the remote 488 BG's, so any L2/L3 known unicast traffic to internal destinations 489 could be sent to any one of the local BG's by remote sources. For 490 known L2 and L3 unicast traffic, all of the individual border gateway 491 nodes will behave either as single logical forwarding node or a set 492 of active forwarding nodes. This can be perceived by intra-site 493 nodes as multiple entry/exit points for inter-site traffic. For 494 unknown unicast/multi-destination traffic, there must be a designated 495 forwarder election mechanism to determine which node would perform 496 the primary forwarding role at any given point in time, to ensure 497 there is no duplication of traffic for any given flow (See 498 Section 4.2.1). 500 4.2.1. Border Gateway Designated Forwarder Election 502 In the presence of more than one Border Gateway nodes in a site, 503 forwarding of multi-destination L2 or L3 traffic both into the site 504 and out of the site needs to be carried out by a single node. Border 505 Gateways between same site will run a Designated forwarder election 506 per MAC-VRF VNI for multi-destination traffic across the site. 507 Border Type 4 Ethernet Segment routes received from border gateways 508 of different sites will not trigger DF election for the local side; 509 it will only be used to terminate VXLAN tunnels from such border 510 gateways. 512 Border Gateway DF election will leverage Type-4 EVPN route and 513 Ethernet segment DF election defined in RFC7432 [RFC7432]. Ethernet 514 segment and ESI label extended community will be encoded as explained 515 in Border Gateway discovery procedures. ESI label extended community 516 is MUST to be announced with such routes. DF election will ignore 517 such routes that are announced by border gateways which have a 518 different site identifier value in them. 520 This DF election could be done independently by each candidate border 521 gateway, by subjecting an ordered "candidate list" of all the BG's 522 present in the same site (identified by reception of the Border Type 523 4 Ethernet Segment route per-VNI with the same site-id as itself) to 524 a hash-function on a per-VNI basis. All the candidate border 525 gateways of the same site are required to use a uniform hash-function 526 to yield the same result. Failure events which lead to a BG losing 527 all of its connectivity to the IP interconnect backbone should 528 trigger the BG to withdraw its Border Type 4 Ethernet Segment 529 route(s) and Type 1 A-D route, to indicate to other BG's of the same 530 site that it is no longer a candidate BG and to indicate BG's of 531 different sites that it is no longer a Border Gateway. 533 There are two modes proposed for Border gateway provisioning. 535 4.2.2. Anycast Border Gateway 537 In this mode all border gateways share same gateway IP and rewrite 538 EVPN next-hop attributes with a shared logical next-hop entity. 539 However, these Gateways will maintain unique gateway IP to facilitate 540 building IR trees from site-local nodes to forward Multi-Destination 541 traffic. EVPN Type 2, Type 5 routes will be advertised to the nodes 542 in the site from all border gateways and Border gateway will run DF 543 election per VNI for Multi destination traffic. Type 3 routes may be 544 advertised by the DF winner Border gateway for a given VNI so that 545 only DF will receive and forward inter-site traffic. It is also 546 possible to advertise and draw traffic by all Border Gateways at a 547 site to improve convergence properties of the network. In case of 548 multi-destination trees built by non-EVPN procedures (say PIM), all 549 border gateways will receive but only DF winner will forward traffic. 551 This mode is useful when there is no preference between different 552 border-gateways to forward traffic from different VNIs. Standard 553 data plane hashing of VXLAN header will load balance traffic among 554 Border Gateways. 556 Additionally, it is recommended that border gateway be enabled in the 557 Anycast mode wherein the BG functionality is available to the rest of 558 the network as a single logical entity (as in Anycast) for inter-site 559 communication. In the absence of capability for Anycast, the BG 560 could be enabled as individual gateways (Single-Active BG) wherein a 561 single node will perform the active BG role for a given flow at a 562 given time. As of now, the Border Gateway system mac of the other 563 border nodes belonging to the same site is expected to be configured 564 out-of-band. 566 4.2.3. Multi-path Border Gateway 568 In this mode, Border gateways will rewrite EVPN Next-hop attributes 569 with unique next-hop entities. This provides flexibility to apply 570 usual policies and pick per-VRF, per-VNI or per-flow primary/backup 571 border Gateways. Hence, an intra-site node will see each BG as a 572 next-hop for any external L2 or L3 unicast destination, and would 573 perform an ECMP path selection to load-balance traffic sent to 574 external destinations. In case an intra-site node is not capable of 575 performing ECMP hash based path-selection (possibly some L2 576 forwarding implementations), the node is expected to choose one of 577 the BG's as its designated forwarder. EVPN Type 2, Type 5 routes 578 will be advertised to the nodes in the site from all border gateways 579 and Border gateway will run DF election per VNI for Multi destination 580 traffic. Type 3 routes will be advertised by DF winner Border 581 gateway for a given VNI so that only DF will receive and forward 582 inter-site traffic. It is also possible to advertise and draw 583 traffic by all Border Gateways at a site to improve convergence 584 properties of the network. In case of multi-destination trees built 585 by non-EVPN procedures (say PIM), all border gateways will receive 586 but only DF winner will forward traffic. 588 4.3. EVPN route processing at Border Gateway 590 Border gateways will build EVPN peering on processing Type 4 Ethernet 591 Segment route from other Border gateways. Route targets MAY be auto- 592 generated based on some site-specific identifier. If BGP AS number 593 is used as site-specific identifier, import and export route targets 594 can be auto-generated as explained in RFC7432 [RFC7432]. This will 595 facilitate site-local nodes to import routes from other nodes in same 596 site and from its Border Gateways. Also this will prevent routes 597 exchange between nodes from different sites. However, in this auto- 598 generated scheme, import mechanism on Border Gateway should be 599 relaxed to allow import of Border A-D routes from other border 600 gateways. As defined in section Section 4.1 , ESI label extended 601 community will carry a bit to facilitate the import of Type 1 A-D 602 routes for Border gateway Discovery. Also the routes which are 603 imported at Border Gateway and re-advertised should implement a 604 mechanism to avoid looping of updates should they come back at Border 605 Gateways. 607 Type 2/Type 5 EVPN routes will be rewritten with Border Gateway IP, 608 Border Gateway system mac as next-hop and re-advertised. Only EVPN 609 routes received from discovered Border gateways with different site 610 identifiers will be rewritten and re-advertised. This will avoid 611 rewriting every EVPN update if border gateways are also acting as 612 Route reflector (RR) for site-local EVPN peering. Also this will 613 help in interoperating MS-EVPN fabric with sites which do not have 614 Border Gateway functionality. 616 There are few mechanisms suggested below for re-advertising these 617 inter-site routes to a site and provide connectivity of inter-site 618 hosts and subnets. 620 o All routes everywhere : In this mode all inter-site EVPN Type2/ 621 Type5 routes are downloaded on site-local leafs from Border 622 Gateways. In other words, every leaf in the MS-EVPN fabric will 623 have routes from every intra-site and inter-site leafs. This 624 mechanism is best-fit for the scenarios where inter-site traffic 625 is as voluminous as intra-site flow traffic. Also this mechanism 626 preserves usual glean processing, silent host discovery and 627 unknown traffic handling at the leafs. 629 o Default bridging and routing to Border Gateways : In this mode, 630 all received inter-site EVPN Type 2/Type 5 routes will be 631 installed only at Border Gateways and will not be advertised in 632 the site. Border Gateways will inject Type 5 default routes to 633 site-local nodes and avoid re-advertising Type 2 from other sites. 634 This mode provides scaling advantage by not downloading all inter- 635 site routes to every leaf in MS-EVPN fabric. This mechanism MAY 636 require glean processing and unknown traffic handling to be 637 tailored to provide efficient traffic forwarding. 639 o Site-scope flow registry and discovery : This mechanism provides 640 scaling advantage by downloading inter-site routes on-demand. It 641 provides scaling advantages of default routing with out need to 642 tailor glean processing and unknown traffic handling at the leafs. 643 Leafs will create on-demand flow registry on their border Gateways 644 and based on this flow registry border gateways will advertise 645 Type 2 routes in a site. In other words, assuming that we have a 646 trigger to send the EVPN routes that are needed by the site for 647 conversational learning from the Border Gateways, we can optimize 648 on the control plane state that is needed at the various leaf 649 nodes. Hardware programming can be further optimized based on 650 actual conversations needed by the leaf, as opposed to to the ones 651 needed by the site. We will describe a mechanism in the appendix 652 with respect to ARP processing at the Border Gateway. 654 Type 3 routes will be imported and processed on border gateways from 655 other border gateways but MUST NOT be advertised again. In both 656 modes (Anycast and Multipath), Type 3 routes will be generated 657 locally and advertised by DF winner Border Gateway with unique 658 gateway IP. This will facilitate building fast converging flood 659 domain connectivity inter-site and intra-site and on same time 660 avoiding duplicate traffic by electing DF winner to forward multi- 661 destination inter-site traffic. 663 4.4. Multi-Destination tree between Border Gateways 665 The procedures described here recommends building an Ingress 666 Replication (IR) tree between Border Gateways. This will facilitate 667 every site to independently build site-specific Multi destination 668 trees. Multi-destination end-to-end trees between leafs could be PIM 669 (site 1) + IR (between border Gateways) + PIM(site 2) or IR-IR-IR or 670 PIM-IR-IR. However this does not rule out using IR-PIM-IR or end-to- 671 end PIM to build multi-destination trees end-to-end. 673 Border Gateways will generate Type 3 routes with unique gateway IP 674 and advertise to Border Gateways of other sites. These Type 3 routes 675 will help in building IR trees between border gateways. However only 676 DF winner per VNI will forward multi-destination traffic across 677 sites. 679 As Border Gateways are part of both site-specific and inter-site 680 Multi-destination IR trees, split-horizon mechanism will be used to 681 avoid loops. Multi-destination tree with Border gateway as root to 682 other sites (or Border-Gateways) will be in a separate horizon group. 683 Similarity Multi-destination IR tree with Border Gateway as root to 684 site-local nodes will be in another split horizon group. 686 If PIM is used to build Multi-Destination trees in site-specific 687 domain, all Border gateway will join such PIM trees and draw multi- 688 destination traffic. However only DF Border Gateway will forward 689 traffic towards other sites. 691 4.5. Inter-site Unicast traffic 693 As site-local nodes will see all inter-site EVPN routes via Border 694 Gateways, VXLAN tunnels will be built between leafs and site-local 695 Border Gateways and Inter-site VXLAN tunnels will be built between 696 Border gateways in different sites. An end-to-end VXLAN 697 bidirectional forwarding path between inter-site leafs will consist 698 of VXLAN tunnel from leaf (say Site A) to its Border Gateway, another 699 VXLAN tunnel from Border Gateway to Border Gateway in another site 700 (say site B) and Border gateway to leaf (in site B). Such 701 arrangement of tunnels are very scalable as a full mesh of VXLAN 702 tunnels across inter-site leafs is substituted by combination of 703 intra-site and inter-site tunnels. 705 L2 and L3 unicast frames from site-local leafs will reach border 706 gateway using VXLAN encapsulation. At Border gateway, VXLAN header 707 is stripped out and another VXLAN header is pushed to sent frames to 708 destination site Border Gateway. Destination site Border gateway 709 will strip off VXLAN header and push another VXLAN header to send 710 frame to the destination site leaf. 712 4.6. Inter-site Multi-destination traffic 714 Multi-destination traffic will be forwarded from one site to other 715 site only by DF for that VNI. As frames reach Border Gateway from 716 site-local nodes, VXLAN header will be popped and another VXLAN 717 header (derived from downstream Type3 EVPN routes) will be pushed to 718 forward frame to destination site border gateway. Similarly 719 destination site Border Gateway will strip off VXLAN header and 720 forward frame after pushing another VXLAN header towards the 721 destination leaf. 723 As explained in Section 4.4, split horizon mechanism will be used to 724 avoid looping of inter-site multi-destination frames. 726 4.7. Host Mobility 728 Host movement handling will be same as defined in RFC7432 [RFC7432]. 729 When host moves, EVPN Type 2 routes with updated sequence number will 730 be propagated to every EVPN node. When a host moves inter-site, only 731 Border gateways may see EVPN updates with both next-hop attributes 732 and sequence number changes and leafs may see updates only with 733 updated sequence numbers. However in other cases both Border gateway 734 and leafs may see next-hop and sequence number changes. 736 5. Convergence 738 5.1. Fabric to Border Gateway Failure 740 If a Border Gateway is lost, Border gateway next-hop will be 741 withdrawn for Type 2 routes. Also per-VNI DF election will be 742 triggered to chose new DF. DF new winner will become forwarder of 743 Multi-destination inter-site traffic. 745 5.2. Border Gateway to Border Gateway Failures 747 In case where inter-site cloud has link failures, direct forwarding 748 path between border gateways can be lost. In this case, traffic from 749 one site can reach other site via border gateway of an intermediate 750 site. However this will be addressed like regular underlay failure 751 and traffic terminations end-points will still stay same for inter- 752 site traffic flows. 754 6. Interoperability 756 The procedures defined here are only for Border Gateways. Therefore 757 other EVPN nodes in the network should be RFC7432 [RFC7432] compliant 758 to operate in such topologies. 760 As the procedures described here are applicable only after receiving 761 Border A-D route, if other domains are connected which are not 762 capable of such multi-site gateway model, they can work in regular 763 EVPN mode. The exact procedures will be detailed in a future version 764 of the draft. 766 The procedures here provides flexibility to connect non-EVPN VXLAN 767 sites by provisioning Border Gateways on such sites and inter- 768 connecting such Border Gateways by Border Gateways of other sites. 769 Such Border Gateways in non-EVPN VXLAN sites will play dual role of 770 EVPN gateway towards common EVPN domain and non-EVPN gateway towards 771 non-EVPN VXLAN site. 773 7. Isolation of Fault Domains 775 Isolation of network defects requires policies like storm control, 776 security ACLs etc to be implemented at site boundaries. Border 777 gateways should be capable of inspecting inner payload of packets 778 received from VXLAN tunnels and enforce configured policies to 779 prevent defects percolating from one part to rest of the network. 781 8. Loop detection and Prevention 783 Customer L2 network deploy some flavor of Spanning tree protocol 784 (STP) to detect and prevent loops. Also Customer L2 segments deploy 785 some form of multihoming to connect L2 segments to EVPN nodes or 786 VTEPs. Such multihoming connectivity takes care of preventing L2 787 loops by multihoming mechanisms at the VTEPs. However 788 misconfiguration or other unexpected events in the customer L2 789 segments can lead to inconsistent connectivity to VTEPs leading to L2 790 loops. 792 This specification leverages Type-2 encoding of ESI label extended 793 community in Type-1 A-D route type as defined in RFC7432 [RFC7432] to 794 exchange STP root bridge information among VTEPs. When VTEPs 795 discovers same STP root bridge from VTEPs which are not multihoming 796 VTEP peers for a given L2 segment, it signals possibility of loop and 797 forwarding engine prunes VNI from the server facing ports to cut down 798 loop. As root bridge conflict across VTEPs is resolved, forwarding 799 engine will reestablish VNI on the server facing ports. This 800 mechanism can coexist with other mechanism like fast mac move 801 detections and is recommended as additional protection to prevent L2 802 loops poised by inconsistent connectivity of customer L2 segments to 803 L3 MS-EVPN fabric. 805 Such route advertisement should be originated by every EVPN node and 806 terminated at the border gateways. However if there is possibility 807 of server facing L2 segments to be stretched across sites, such 808 routes can be terminated and re-originated with out modifications to 809 be received by every other EVPN node. This behavior is exception to 810 usual guideline of terminating (and re-originating if required) all 811 routes types at border gateway. However such exception will help in 812 detecting loops if a customer L2 segment is inconsistently connected 813 to VTEPs in different sites. 815 Also as defined in Section 4.2.1 border gateways uses mechanisms like 816 Designated Forwarder and Split Horizon forwarding to prevent inter- 817 site loops in this network. 819 9. MVPN with Multi-site EVPN 821 BGP based MVPN as defined in RFC6513 [RFC6513] and RFC6514 [RFC6514] 822 will coexist with Multisite-EVPN with out any changes in route types 823 and encodings defined for MVPN route types in these RFCs. Route 824 Distinguisher and VRF route import extended communities will be 825 attached to MVPN routes as defined in the BGP MVPN RFCs. Import and 826 Export Route targets will be attached to MVPN routes either by Auto- 827 generating them from VNI or by explicit configuration per MVPN. 828 Since, BGP MVPN RFC adapts to any VPN address family to provide RPF 829 information to build C-Multicast trees, EVPN route types will be used 830 to provide required RPF information for Multicast sources in MVPNs. 831 In order to follow segmentation model of Multisite-EVPN, following 832 procedures are recommended to build provider and customer multicast 833 trees between sources and receivers across sites. 835 9.1. Inter-Site MI-PMSI 837 As defined in above mentioned MVPN RFCs, I-PMSI A-D routes are used 838 to signal a provider tunnel or MI-PMSI per MVPN. Multisite-EVPN 839 recommends EVPN Type-3 routes to build such MI-PMSI provider tunnel 840 per VPN between Border Gateways of different sites. Every MVPN node 841 will use its unique router identifier to build these MI-PMSI provider 842 tunnels. In Anycast Border gateway model also, these MI-PMSI 843 provider tunnels are built using unique router identifier of Border 844 gateways. In similar fashion, these Type-3 routes can be used to 845 build MI-PMSI provider tunnel per MVPN with in sites. 847 9.2. Stitching of customer multicast trees across sites 849 All Border Gateways will rewrite next-hop and re-originate MVPN 850 routes received from other sites to local site and from local site to 851 other sites. Therefore customer Multicast trees will be logically 852 built end-to-end across sites by stitching these trees via Border 853 gateways. A C-multicast join route (say Type 7 MVPN) will follow 854 EVPN RPF path to build C-multicast tree from leaf in a site to its 855 Border gateway and to source site leafs via source site Border 856 Gateways. Similarly Source-Active A-D MVPN route (Type 5 MVPN) will 857 be rewritten with next-hop and re-originated via Border gateways so 858 that source C-Multicast trees will be stitched via Border gateways. 860 9.3. RPF resolution in Anycast Border Gateway model 862 In Ancast Border Gateway model, multicast sources will be re- 863 advertised in EVPN routes after rewritten by anycast Border gateway 864 nexthop. However VRF route import (VRI) extended community tied with 865 these EVPN routes will be advertised by unique nexthop (typically 866 local loopback IP) from every such Border Gateway. Therefore 867 receiver leafs should use these VRIs to uniquely identify one member 868 of the Anycast Border Gateway set to build tenant MVPN trees. 869 Similarly a BG in the receiver site should pick one member of the 870 anycast-set of BGs in the source-site as its upstream RPF neighbor, 871 using the unique next hop by referring to the next hop advertised in 872 the VRI extended community. 874 9.4. Inter-Site S-PMSI 876 As defined in BGP MVPN RFCs, S-PMSI A-D routes (Type 3 MVPN) will be 877 used to signal selective PMSI trees for high bandwidth C-Multicast 878 streams. These S-PMSI A-D routes will be signaled across sites via 879 Border gateways rewriting next-hop and re-originating them to other 880 sites. PMSI tunnel attribute in re-originated S-PMSI routes will be 881 adjusted to the provide tunnel types between Border gateways across 882 sites. 884 10. Acknowledgements 886 This authors would like to thank Max Ardica, Lukas Krattiger, Anuj 887 Mittal, Lilian Quan, Veera Ravinutala, Murali Garimella, Tarun Wadhwa 888 for their review and comments. 890 11. IANA Considerations 892 TBD. 894 12. Security Considerations 896 TBD. 898 13. References 900 13.1. Normative References 902 [DCI-EVPN-OVERLAY] 903 A. Sajassi et. al., "A Network Virtualization Overlay 904 Solution using EVPN", 2017, . 907 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 908 Requirement Levels", BCP 14, RFC 2119, 909 DOI 10.17487/RFC2119, March 1997, 910 . 912 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 913 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 914 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 915 2015, . 917 13.2. Informative References 919 [RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/ 920 BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February 921 2012, . 923 [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP 924 Encodings and Procedures for Multicast in MPLS/BGP IP 925 VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012, 926 . 928 [RFC7209] Sajassi, A., Aggarwal, R., Uttaro, J., Bitar, N., 929 Henderickx, W., and A. Isaac, "Requirements for Ethernet 930 VPN (EVPN)", RFC 7209, DOI 10.17487/RFC7209, May 2014, 931 . 933 Appendix A. Additional Stuff 935 TBD. 937 Authors' Addresses 939 Rajesh Sharma (editor) 940 Cisco Systems 941 170 W Tasman Drive 942 San Jose, CA 943 USA 945 Email: rajshr@cisco.com 947 Ayan Banerjee 948 Cisco Systems 949 170 W Tasman Drive 950 San Jose, CA 951 USA 953 Email: ayabaner@cisco.com 954 Raghava Sivaramu 955 Cisco Systems 956 170 W Tasman Drive 957 San Jose, CA 958 USA 960 Email: raghavas@cisco.com 962 Ali Sajassi 963 Cisco Systems 964 170 W Tasman Drive 965 San Jose, CA 966 USA 968 Email: sajassi@cisco.com