idnits 2.17.1 draft-sharma-multi-site-evpn-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 6, 2016) is 2906 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-12) exists of draft-ietf-bess-evpn-overlay-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force R. Sharma, Ed. 3 Internet-Draft A. Banerjee 4 Intended status: Standards Track R. Sivaramu 5 Expires: November 7, 2016 Cisco Systems 6 May 6, 2016 8 Multi-site EVPN based VXLAN using Border Gateways 9 draft-sharma-multi-site-evpn-00 11 Abstract 13 This document describes the procedures for interconnecting two or 14 more BGP based Ethernet VPN (EVPN) sites in a scalable fashion over 15 an IP-only network. The motivation is to support extension of EVPN 16 sites without having to rely on typical Data Center Interconnect 17 (DCI) technologies like MPLS/VPLS for the interconnection. The 18 requirements for such a deployment are very similar to the ones 19 specified in RFC 7209 -- "Requirements for Ethernet VPN (EVPN)". 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on November 7, 2016. 38 Copyright Notice 40 Copyright (c) 2016 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 57 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 3. Multi-Site EVPN Overview . . . . . . . . . . . . . . . . . . 4 59 3.1. MS-EVPN Interconnect Requirements . . . . . . . . . . . . 4 60 3.2. MS-EVPN Interconnect concept and framework . . . . . . . 5 61 4. Multi-site EVPN Interconnect Procedures . . . . . . . . . . . 8 62 4.1. Border Auto-Discovery Route . . . . . . . . . . . . . . . 8 63 4.2. Border Gateway Provisioning . . . . . . . . . . . . . . . 10 64 4.2.1. Border Gateway Designated Forwarder Election . . . . 11 65 4.2.2. All-active Border Gateway . . . . . . . . . . . . . . 11 66 4.2.3. Multi-path Border Gateway . . . . . . . . . . . . . . 12 67 4.3. EVPN route processing at Border Gateway . . . . . . . . . 12 68 4.4. Multi-Destination tree between Border Gateways . . . . . 14 69 4.5. Inter-site Unicast traffic . . . . . . . . . . . . . . . 14 70 4.6. Inter-site Multi-destination traffic . . . . . . . . . . 15 71 4.7. Host Mobility . . . . . . . . . . . . . . . . . . . . . . 15 72 5. Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 15 73 5.1. Fabric to Border Gateway Failure . . . . . . . . . . . . 15 74 5.2. Border Gateway to Border Gateway Failures . . . . . . . . 15 75 6. Interoperability . . . . . . . . . . . . . . . . . . . . . . 16 76 7. Isolation of Fault Domains . . . . . . . . . . . . . . . . . 16 77 8. Loop detection and Prevention . . . . . . . . . . . . . . . . 16 78 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 79 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 80 11. Security Considerations . . . . . . . . . . . . . . . . . . . 16 81 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 82 12.1. Normative References . . . . . . . . . . . . . . . . . . 16 83 12.2. Informative References . . . . . . . . . . . . . . . . . 17 84 Appendix A. Additional Stuff . . . . . . . . . . . . . . . . . . 17 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 87 1. Introduction 89 BGP based Ethernet VPNs (EVPNs) are being used to support various VPN 90 topologies with the motivation and requirements being discussed in 91 detail in RFC7209 [RFC7209]. EVPN has been used to provide a Network 92 Virtualization Overly (NVO) solution with a variety of tunnel 93 encapsulation options over IP as described in [DCI-EVPN-OVERLAY]. 94 EVPN used for the Data center interconnect (DCI) at the WAN Edge is 95 discussed in [DCI-EVPN-OVERLAY]. The EVPN DCI procedures are defined 96 for IP and MPLS hand-off at the site boundaries. 98 In the current EVPN deployments, there is a need to segment the EVPN 99 domains within a Data Center (DC) primarily due to the service 100 architecture and the scaling requirements around it. The number of 101 routes, tunnel end-points, and next-hops needed in the DC are larger 102 than some of the hardware elements that are being deployed. Network 103 operators would like to ensure that they have means to have smaller 104 sites within the data center, if they so desire, without having to 105 have traditional DCI technologies to inter-connect them. In essence, 106 they want smaller multi-site EVPN domains with an IP backbone. 108 Network operators today are using the Virtual Network Identifier 109 (VNI) to designate a service. However, they would like to have this 110 service available to a smaller set of nodes within the DC for 111 administrative reasons; in essence they want to break up the EVPN 112 domain to multiple smaller sites. An advantage of having a smaller 113 footprint for these EVPN sites, implies that the various fault 114 isolation domains are now more constrained. It is also feasible to 115 have features that can re-use the VNI space across these sites if 116 desired. The above mentioned motivations for having smaller multi- 117 site EVPN domains are over and above the ones that are already 118 detailed in RFC7209 [RFC7209]. 120 In this document we focus primarily on the VXLAN encapsulation for 121 EVPN deployments. We assume that the underlay provides simple IP 122 connectivity. We go into the details of the IP/VXLAN hand-off 123 mechanisms, to interconnect these smaller sites, within the data 124 center itself. We describe this deployment model as a scalable 125 multi-site EVPN (MS-EVPN) deployment. The procedures described here 126 go into substantial detail regarding interconnecting L2 and L3, 127 unicast and multicast domains across multiple EVPN sites. 129 1.1. Requirements Language 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 133 document are to be interpreted as described in RFC 2119 [RFC2119]. 135 2. Terminology 137 o Border Gateway (BG): This is the node that interacts with nodes 138 within a site and with nodes that are external to the site. For 139 example, in a leaf-spine data center fabric, it can be a leaf, a 140 Spine, or a separate device acting as gateway to interconnect the 141 sites. 143 o All-Active Border Gateway: A Virtual set of shared Border Gateways 144 (or Next-hops) acting as Multiple entry-exit points for a site. 146 o Single-Active Border Gateway: A Virtual set of unique border 147 Gateways (or Next-hops) acting as a Multiple entry-exit points for 148 a site. 150 o A-D: Auto-discovery. 152 3. Multi-Site EVPN Overview 154 In this section we describe the motivation, requirements, and 155 framework of the multi-site EVPN enhancements. 157 3.1. MS-EVPN Interconnect Requirements 159 In this section we discuss the requirements and motivation for 160 interconnecting different EVPN sites within a data center. In 161 general any interconnect technology has the following requirements: 163 a. Scalability: Multi-Site EVPN (MS-EVPN) should be able to 164 interconnect multiple sites in a scalable fashion. In other 165 words, interconnecting such sites should not lead to one giant 166 fabric with full mesh of end-to-end VXLAN tunnels across leafs in 167 different sites. This leads to scale issues with respect to 168 managing large number of tunnel end-points and a large number of 169 tunnel next-hops. Also a huge flat fabric rules out option of 170 ingress replication (IR) trees as number of replications becomes 171 practically unachievable due to the internal bandwidth needed in 172 hardware. 174 b. Multi-Destination traffic over unicast-only cloud: MS-EVPN 175 mechanisms should be able to provide an efficient forwarding 176 mechanism for multi-destination frames even if the underlay 177 inter-site network is not capable of forwarding multicast frames. 178 This requirement is meant to ensure that for the solution to work 179 there are no additional constraints being requested of the IP 180 network. This allows for use of existing network elements as-is. 182 c. Maintain Site-specific Administrative control: The MS-EVPN 183 technology should be able to interconnect fabrics from different 184 Administrative domains. It is possible that different sites have 185 different VLAN-VNI mappings, use different underlay routing 186 protocols, and/or have different PIM-SM group ranges etc. It is 187 expected that the technology should not impose any additional 188 constraints on the various administrative domains. 190 d. Isolate fault domains: MS-EVPN technology hand-off should have 191 capability to isolate traffic cross site boundaries and prevent 192 defects to percolate from one site to another. As an example, a 193 broadcast storm in a site should not lead to meltdown of all 194 other sites. 196 e. Loop detection and prevention: In the scenarios where flood 197 domains are stretched across fabrics, interconnecting sites are 198 very vulnerable to loops and flood storms. There is a need to 199 provide comprehensive loop detection and prevention capabilities. 201 f. Plug-and-play and extensibility: Addition of new sites or 202 increasing capacity of existing sites should be achievable in a 203 completely plug-and-play fashion. This essentially means that 204 all control plane and forwarding states (L2 or L3 interconnect) 205 should be built in downstream allocation mode. MS-EVPN should 206 not pose any maximum requirements on the scale and capacity, it 207 should be easily extendable on those metrics. 209 3.2. MS-EVPN Interconnect concept and framework 211 EVPN with an IP-only interconnect is conceptualized as multiple site- 212 local EVPN control planes and IP forwarding domains interconnected 213 via a single common EVPN control and IP forwarding domain. Every 214 EVPN node is identified with a unique site-scope identifier. A site- 215 local EVPN domain consists of EVPN nodes with the same site 216 identifier. Border gateways on one hand are also part of site- 217 specific EVPN domain and on other hand part of a common EVPN domain 218 to interconnect with Border Gateways from other sites. Although a 219 border gateway has only a single explicit site-id (that of the site 220 it is a member of), it can be considered to also have a second 221 implicit site-id, that of the interconnect-domain which has 222 membership of all the BG's from all sites that are being 223 interconnected. This implicit site-id membership is derived by the 224 presence of the Border A-D route announced by that border gateway 225 node. 227 These border gateways discover each other through EVPN Border A-D 228 routes and act as both control and forwarding plane gateway across 229 sites. This will facilitate site-specific nodes to visualize all 230 other sites to be reachable only via its Border Gateways. 232 We describe the MS-EVPN deployment model using the topology below. 233 In the topology there are 3 sites, Site A, Site B, and Site C that 234 are inter-connected using IP. This entire topology is deemed to be 235 part of the same Data Center. In most deployments these sites can be 236 thought of as pods, which may span a rack, a row, or multiple rows in 237 the data center, depending on the size of domain desired for scale 238 and fault and/or administrative isolation domains. 240 ____________________________ 241 | ooo Encapsulation tunnel | 242 | X X X Leaf-spine fabric | 243 |__________________________| 245 Site A (EVPN site A) Site B (EVPN site B) 246 ___________________________ ____________________________ 247 | X X X X X X X X | | X X X X X X X X | 248 | X X X X | | X X X X | 249 | o o | | o o | 250 |BG-1 Site A BG-2 Site A| |BG-1 Site B BG-2 Site B| 251 ___________________________ ____________________________ 252 o o o o 253 o o o o 254 o o o o 255 o o o o 256 _______________________________________________ 257 | | 258 | | 259 | Inter-site common EVPN site | 260 | | 261 | | 262 _______________________________________________ 263 o o 264 o o 265 o o 266 o o 267 ___________________________ 268 | BG-1 Site C BG-2 Site C| 269 | X X X X | 270 | X X X X X X X X | 271 _____________________________ 272 Site C (EVPN site C) 274 Figure 1 276 In this topology, site-local nodes are connected to each other by 277 iBGP EVPN peering and Border Gateways are connected by eBGP Muti-hop 278 EVPN peering via inter-site cloud. We explicitly spell this out to 279 ensure that we can re-use BGP semantics of route announcement between 280 and across the sites. There are other BGP mechanisms to instantiate 281 this and they are not discussed in this document. This implies that 282 each domain has its own AS number associated with it. In the 283 topology, only 2 border gateway per site are shown; this is more for 284 ease of illustration and explanation. The technology poses no such 285 limitation. As mentioned earlier, site-specific EVPN domain will 286 consists of only site-local nodes in the sites. A Border Gateway is 287 logically partitioned into site specific EVPN domain towards the site 288 and into common EVPN domain towards other sites. This facilitates 289 them to acts as control and forwarding plane gateway for forwarding 290 traffic across sites. 292 EVPN nodes with in a site will discover each other via regular EVPN 293 procedures and build site-local bidirectional VXLAN tunnels and 294 multi-destination trees from leaves to Border Gateways. Border 295 Gateways will discover each other by Border A-D routes (defined in 296 Section 4.1) and build inter-site bi-directional VXLAN tunnels and 297 Multi-destination trees between them. We thus build an end-to-end 298 bidirectional forwarding path across all sites by stitching (and not 299 by stretching end-to-end) site-local VXLAN tunnels with inter-site 300 VXLAN tunnels. 302 In essence, a MS-EVPN fabric is proposed to be built in complete 303 downstream and modular fashion. 305 o Site-local Bridging domains are interconnected ONLY via Border 306 Gateways with Bridging domains from other sites. Such 307 interconnect do not assume uniform mappings of mac-vrf VNI-VLAN 308 across sites and stitches such bridging domains in complete 309 downstream fashion using EVPN route advertisements. 311 o Site-local Routing domains are interconnected ONLY via Border 312 Gateways with Routing domains from other sites. Such interconnect 313 do not assume uniform mappings of IP VRF-VNI across sites and 314 stitches such routing domains in complete downstream fashion using 315 EVPN route advertisements. 317 o Site-local Flood domains are interconnected ONLY via Border 318 Gateways with flood domains from other sites. Such interconnect 319 do not assume uniform mappings of mac-vrf VNI across sites (or 320 mechanisms to build flood domains with in site) and stitches such 321 flood domains in complete downstream fashion using EVPN route 322 advertisements. It however do not exclude possibility of building 323 an end-to-end flood domain, if desired for other reasons. 325 The above architecture satisfies the constraints laid out in 326 Section 3.1. For example, the size of a domain may be made dependent 327 on the route and next-hop scale that can be supported by the 328 deployment of the network nodes. There are no constraints on the 329 network that connects the nodes within the domain or across the 330 domains. In the event multicast capability is available and enabled, 331 the nodes can use those resources. In the event the underlay is 332 connected using unicast semantics, creation of ingress replication 333 lists ensure that multi-destination frames reach their destinations. 334 The domains may have their own deployment constraints, and the 335 overlay does not need any form of stretching. It is within the 336 control of the administrator with respect to containing fault 337 isolation domains. The automated discovery of the border nodes needs 338 no further configurations for existing deployed domains. 340 4. Multi-site EVPN Interconnect Procedures 342 In this section we describe the new functionalities in the Border 343 Gateway nodes for interconnecting EVPN sites within the DC. 345 4.1. Border Auto-Discovery Route 347 These routes are generated by Border Gateways and imported by leafs 348 and Border Gateways. These routes serve following purpose: 350 o Discover Border Gateways from same site. This will help in 351 finding designated forwarder for inter-site Multi-destination 352 traffic. Once designated forwarder election is complete, inter- 353 site Multi-destination traffic will be forwarded by DF winner. 355 o Discover Border Gateways from other sites. This will help in 356 deciding which VXLAN tunnels should be terminated for inter-site 357 traffic. Along with the Type 3 routes, this may help in optimal 358 traffic flow within the common core for multi-destination frames. 360 A Border A-D route type specific EVPN NLRI is defined as follows. It 361 is proposed to be a new route type in EVPN NLRI defined in RFC7432 362 [RFC7432]. 364 +--------------------------------------------+ 365 | Site identifier (2 octet) | 366 +--------------------------------------------+ 367 | Sequence number (2 octet) | 368 +--------------------------------------------+ 369 | IP Address Length (1 octet) | 370 +--------------------------------------------+ 371 | Gateway IP (4 or 16 octets) | 372 +--------------------------------------------+ 373 | VNI Label (3 octets) | 374 +--------------------------------------------+ 375 | Multi-destination flow Priority (1 octet) | 376 +--------------------------------------------+ 377 | Multi-destination forwarder (1 octet) | 378 +--------------------------------------------+ 380 Figure 2 382 o Site Identifier: This is used to distinguish A-D routes received 383 from border gateways in same site or in different sites. Border 384 gateways discover each other by processing these A-D routes from 385 different sites. These site identifier can be explicitly 386 configured or the BGP Autonomous system (AS) number can be 387 automatically carried as the site-identifier. 389 o Sequence number: Monotonically increasing sequence number added by 390 Border Gateway while sending A-D route. In case there are 391 multiple Border A-D routes, the one with the highest sequence 392 number is honored while processing. 394 o IP Address Length: Number of bytes in the Gateway IP field, 4 395 bytes for IPv4 address or 16 bytes for IPv6 address. 397 o Gateway IP: This is the unique IP address of the Border gateways. 398 This Gateway IP will be used to build Multi-destination trees. 400 o VNI Label: This is the MAC-VRF VNI or the IP-VRF VNI. 402 o Multi-destination flow Priority : This field is optional and is 0 403 if not used. This field can be used to assist in forwarder 404 election for multi-destination traffic by assigning higher 405 priority among border gateways of same site. This forwarder 406 election is per MAC-VRF or IP-VRF VNI. 408 o Multi-destination forwarder : This field is set to TRUE once DF 409 election is complete for Multi-destination traffic and announcing 410 Border Gateway is the DF winner. 412 These A-D routes are advertised with mac-VRF and IP-VRF RTs depending 413 on whether the VNI carried is a mac-VRF VNI or an IP VRF VNI. 415 After a Border Gateway is provisioned, Border A-D routes will be 416 announced after some delay interval from all border gateways. This 417 will provide sufficient time to learn Border A-D routes from other 418 Border Gateways. 420 Border Gateways between same site will run a Designated forwarder 421 election per MAC-VRF VNI for multi-destination traffic across the 422 site. Border A-D routes coming from different site will not trigger 423 DF election and will only be cached to terminate VXLAN tunnels from 424 such border gateways. 426 Multi-destination flow priority will be assigned (based on optional 427 policies) to prefer a border gateway for DF election per MAC or IP 428 VRF VNI for multi-destination traffic and will be used in DF election 429 to prefer higher priority border gateway as forwarder. 431 As has been defined in the specifications, Type 2, Type 3, and Type 5 432 routes carry downstream VNI labels. These A-D routes will help to 433 pre-build VXLAN tunnels in the common EVPN domain for L2, L3, and 434 Multi-Destination traffic. Also these A-D routes will help in 435 correlating next-hop of EVPN routes and will facilitate in rewriting 436 next-hop attributes before re-advertising these routes from other 437 sites to a given site. This provides flexibility to keep different 438 VNI-VLAN mapping in different sites and still able to interconnect L3 439 and L2 domains. 441 All control plane and data plane states are interconnected in a 442 complete downstream fashion. For example, BGP import rules for a 443 Type 3 route should be able to extend a flood domain for a VNI and 444 flood traffic destined to advertised EVPN node should carry the VNI 445 which is announced in Type 3 route. Similarly Type 2, Type 5 control 446 and forwarding states should be interconnected in a complete 447 downstream fashion. 449 4.2. Border Gateway Provisioning 451 Border Gateway nodes manage both the control-plane communications and 452 the data forwarding plane for any inter-site traffic. Border Gateway 453 functionality in an EVPN site SHOULD be enabled on more than one node 454 in the network for redundancy and high-availability purposes. Any 455 external Type-2/Type-5 routes that are received by the BGs of a site 456 are advertised to all the intra-site nodes by all the BGs. For 457 internal Type-2/Type-5 routes received by the BG's from the intra- 458 site nodes, all the BGs of a site would advertise them to the remote 459 BG's, so any L2/L3 known unicast traffic to internal destinations 460 could be sent to any one of the local BG's by remote sources. For 461 known L2 and L3 unicast traffic, all of the individual border gateway 462 nodes will behave either as single logical forwarding node or a set 463 of active forwarding nodes. This can be perceived by intra-site 464 nodes as multiple entry/exit points for inter-site traffic. For 465 unknown unicast/multi-destination traffic, there must be a designated 466 forwarder election mechanism to determine which node would perform 467 the primary forwarding role at any given point in time, to ensure 468 there is no duplication of traffic for any given flow (See 469 Section 4.2.1). 471 4.2.1. Border Gateway Designated Forwarder Election 473 In the presence of more than one Border Gateway nodes in a site, 474 forwarding of multi-destination L2 or L3 traffic both into the site 475 and out of the site needs to be carried out by a single node. This 476 DF election could be done independently by each candidate border 477 gateway, by subjecting an ordered "candidate list" of all the BG's 478 present in the same site (identified by reception of the Border A-D 479 routes per-VNI with the same site-id as itself) to a hash-function on 480 a per-VNI basis. All the candidate border gateways of the same site 481 are required to use a uniform hash-function to yield the same result. 482 Failure events which lead to a BG losing all of its connectivity to 483 the IP interconnect backbone should trigger the BG to withdraw its 484 Border A-D route(s), to indicate to other BG's of the site that it is 485 no longer a candidate BG. Also there is a possibility of configuring 486 policies to prefer a Border gateway over others and pick as DF 487 winner. 489 There are two modes proposed for Border gateway provisioning. 491 4.2.2. All-active Border Gateway 493 In this mode all border gateways share same gateway IP and rewrite 494 EVPN next-hop attributes with a shared logical next-hop entity. 495 However these Gateways will maintain unique gateway IP to facilitate 496 building IR trees from site-local nodes to forward Multi-Destination 497 traffic. EVPN Type 2, Type 5 routes will be advertised to the nodes 498 in the site from all border gateways and Border gateway will run DF 499 election per VNI for Multi destination traffic. Type 3 routes will 500 be advertised by all Border gateways but only DF will forward inter- 501 site traffic. 503 This mode is useful when there is no preference between different 504 border-gateways to forward traffic from different VNIs. Standard 505 data plane hashing of VXLAN header will load balance traffic among 506 Border Gateways. 508 Additionally, it is recommended that border gateway be enabled in the 509 All-active mode wherein the BG functionality is available to the rest 510 of the network as a single logical entity (as in Anycast) for inter- 511 site communication. In the absence of capability for All-active, the 512 BG could be enabled as individual gateways (Single-Active BG) wherein 513 a single node will perform the active BG role for a given flow at a 514 given time. 516 4.2.3. Multi-path Border Gateway 518 In this mode, Border gateways will rewrite EVPN Next-hop attributes 519 with unique next-hop entities. This provides flexibility to apply 520 usual policies and pick per-VRF, per-VNI or per-flow primary/backup 521 border Gateways. Hence, an intra-site node will see each BG as a 522 next-hop for any external L2 or L3 unicast destination, and would 523 perform an ECMP path selection to load-balance traffic sent to 524 external destinations. In case an intra-site node is not capable of 525 performing ECMP hash based path-selection (possibly some L2 526 forwarding implementations), the node is expected to choose one of 527 the BG's as its designated forwarder. EVPN Type 2, Type 5 routes 528 will be advertised to the nodes in the site from all border gateways 529 and Border gateway will run DF election per VNI for Multi destination 530 traffic. Type 3 routes will be advertised by all Border gateways but 531 only DF will forward inter-site traffic. 533 4.3. EVPN route processing at Border Gateway 535 Border gateways will build EVPN peering on processing A-D routes from 536 other Border gateways. Route targets MAY be auto-generated based on 537 some site-specific identifier. If BGP AS number is used as site- 538 specific identifier, import and export route targets can be auto- 539 generated as explained in RFC7432 [RFC7432]. This will facilitate 540 site-local nodes to import routes from other nodes in same site and 541 from its Border Gateways. Also this will prevent routes exchange 542 between nodes from different sites. However, in this auto-generated 543 scheme, import mechanism on Border Gateway should be relaxed to allow 544 unconditional import of Border A-D routes from other border gateways. 545 Also the routes which are imported at Border Gateway and re- 546 advertised should implement a mechanism to avoid looping of updates 547 should they come back at Border Gateways. 549 Type 2/Type 5 EVPN routes will be rewritten with Border Gateway IP, 550 Border Gateway system mac as next-hop and re-advertised. Only EVPN 551 routes received from discovered Border gateways with different site 552 identifiers will be rewritten and re-advertised. This will avoid 553 rewriting every EVPN update if border gateways are also acting as 554 Route reflector (RR) for site-local EVPN peering. Also this will 555 help in interoperating MS-EVPN fabric with sites which do not have 556 Border Gateway functionality. 558 There are few mechanisms suggested below for re-advertising these 559 inter-site routes to a site and provide connectivity of inter-site 560 hosts and subnets. 562 o All routes everywhere : In this mode all inter-site EVPN Type2/ 563 Type5 routes are downloaded on site-local leafs from Border 564 Gateways. In other words, every leaf in the MS-EVPN fabric will 565 have routes from every intra-site and inter-site leafs. This 566 mechanism is best-fit for the scenarios where inter-site traffic 567 is as volumonous as intra-site flow traffic. Also this mechanism 568 preserves usual glean processing, silent host discovery and 569 unknown traffic handling at the leafs. 571 o Default routing to Border Gateways : In this mode, all received 572 inter-site EVPN Type 2/Type 5 routes will be installed only at 573 Border Gateways and will not be advertised in the site. Border 574 Gateways will inject Type 5 default routes to site-local nodes and 575 avoid re-advertising Type 2 from other sites. This mode provides 576 scaling advantage by not downloading all inter-site routes to 577 every leaf in MS-EVPN fabric. This mechanism MAY require glean 578 processing and unknown traffic handling to be tailored to provide 579 efficient traffic forwarding. 581 o Site-scope flow registry and discovery : This mechanism provides 582 scaling advantage by downloading inter-site routes on-demand. It 583 provides scaling advantages of default routing with out need to 584 tailor glean processing and unknown traffic handling at the leafs. 585 Leafs will create on-demand flow registry on their border Gateways 586 and based on this flow registry border gateways will advertise 587 Type 2 routes in a site. In other words, assuming that we have a 588 trigger to send the EVPN routes that are needed by the site for 589 conversational learning from the Border Gateways, we can optimize 590 on the control plane state that is needed at the various leaf 591 nodes. Hardware programming can be further optimized based on 592 actual conversations needed by the leaf, as opposed to to the ones 593 needed by the site. We will describe a mechanism in the appendix 594 with respect to ARP processing at the Border Gateway. 596 Type 3 routes will be imported and processed on border gateways from 597 other border gateways but MUST NOT be advertised again. In both 598 modes (All-active and Multipath), Type 3 routes will be generated and 599 advertised by all Border Gateways with unique gateway IP. This will 600 facilitate building fast converging flood domain connectivity inter- 601 site and intra-site and on same time avoiding duplicate traffic by 602 electing DF winner to forward multi-destination inter-site traffic. 604 4.4. Multi-Destination tree between Border Gateways 606 The procedures described here recommends building an Ingress 607 Replication (IR) tree between Border Gateways. This will facilitate 608 every site to independently build site-specific Multi destination 609 trees. Multi-destination end-to-end trees between leafs could be PIM 610 (site 1) + IR (between border Gateways) + PIM(site 2) or IR-IR-IR or 611 PIM-IR-IR. However this does not rule out using IR-PIM-IR or end-to- 612 end PIM to build multi-destination trees end-to-end. 614 Border Gateways will generate Type 3 routes with unique gateway IP 615 and advertise to Border Gateways of other sites. These Type 3 routes 616 will help in building IR trees between border gateways. However only 617 DF winner per VNI will forward multi-destination traffic across 618 sites. 620 As Border Gateways are part of both site-specific and inter-site 621 Multi-destination IR trees, split-horizon mechanism will be used to 622 avoid loops. Multi-destination tree with Border gateway as root to 623 other sites (or Border-Gateways) will be in a separate horizon group. 624 Similarity Multi-destination IR tree with Border Gateway as root to 625 site-local nodes will be in another split horizon group. 627 If PIM is used to build Multi-Destination trees in site-specific 628 domain, all Border gateway will join such PIM trees and draw multi- 629 destination traffic. However only DF Border Gateway will forward 630 traffic towards other sites. 632 4.5. Inter-site Unicast traffic 634 As site-local nodes will see all inter-site EVPN routes via Border 635 Gateways, VXLAN tunnels will be built between leafs and site-local 636 Border Gateways and Inter-site VXLAN tunnels will be built between 637 Border gateways in different sites. An end-to-end VXLAN 638 bidirectional forwarding path between inter-site leafs will consist 639 of VXLAN tunnel from leaf (say Site A) to its Border Gateway, another 640 VXLAN tunnel from Border Gateway to Border Gateway in another site 641 (say site B) and Border gateway to leaf (in site B). Such 642 arrangement of tunnels are very scalable as a full mesh of VXLAN 643 tunnels across inter-site leafs is substituted by combination of 644 intra-site and inter-site tunnels. 646 L2 and L3 unicast frames from site-local leafs will reach border 647 gateway using VXLAN encapsulation. At Border gateway, VXLAN header 648 is stripped out and another VXLAN header is pushed to sent frames to 649 destination site Border Gateway. Destination site Border gateway 650 will strip off VXLAN header and push another VXLAN header to send 651 frame to the destination site leaf. 653 4.6. Inter-site Multi-destination traffic 655 Multi-destination traffic will be forwarded from one site to other 656 site only by DF for that VNI. As frames reach Border Gateway from 657 site-local nodes, VXLAN header will be popped and another VXLAN 658 header (derived from downstream Type3 EVPN routes) will be pushed to 659 forward frame to destination site border gateway. Similarly 660 destination site Border Gateway will strip off VXLAN header and 661 forward frame after pushing another VXLAN header towards the 662 destination leaf. 664 As explained in Section 4.4, split horizon mechanism will be used to 665 avoid looping of inter-site multi-destination frames. 667 4.7. Host Mobility 669 Host movement handling will be same as defined in RFC7432 [RFC7432]. 670 When host moves, EVPN Type 2 routes with updated sequence number will 671 be propagated to every EVPN node. When a host moves inter-site, only 672 Border gateways may see EVPN updates with both next-hop attributes 673 and sequence number changes and leafs may see updates only with 674 updated sequence numbers. However in other cases both Border gateway 675 and leafs may see next-hop and sequence number changes. 677 5. Convergence 679 5.1. Fabric to Border Gateway Failure 681 If a Border Gateway is lost, Border gateway next-hop will be 682 withdrawn for Type 2 routes. Also per-VNI DF election will be 683 triggered to chose new DF. DF new winner will become forwarder of 684 Multi-destination inter-site traffic. 686 5.2. Border Gateway to Border Gateway Failures 688 In case where inter-site cloud has link failures, direct forwarding 689 path between border gateways can be lost. In this case, traffic from 690 one site can reach other site via border gateway of an intermediate 691 site. However this will be addressed like regular underlay failure 692 and traffic terminations end-points will still stay same for inter- 693 site traffic flows. 695 6. Interoperability 697 The procedures defined here are only for Border Gateways. Therefore 698 other EVPN nodes in the network should be RFC7432 [RFC7432] compliant 699 to operate in such topologies. 701 As the procedures described here are applicable only after receiving 702 Border A-D route, if other domains are connected which are not 703 capable of such multi-site gateway model, they can work in regular 704 EVPN mode. The exact procedures will be detailed in a future version 705 of the draft. 707 7. Isolation of Fault Domains 709 Isolation of network defects requires policies like storm control, 710 security ACLs etc to be implemented at site boundaries. Border 711 gateways should be capable of inspecting inner payload of packets 712 received from VXLAN tunnels and enforce configured policies to 713 prevent defects percolating from one part to rest of the network. 715 8. Loop detection and Prevention 717 This has already been addressed in the Section 4.2.1. We are in 718 essence using the Designated Forwarder and Split Horizon procedures 719 to break loops in this network. 721 9. Acknowledgements 723 This authors would like to thank Max Ardica, Lukas Krattiger, Anuj 724 Mittal, Lilian Quan, Veera Ravinutala, for their review and comments. 726 10. IANA Considerations 728 TBD. 730 11. Security Considerations 732 TBD. 734 12. References 736 12.1. Normative References 738 [DCI-EVPN-OVERLAY] 739 A. Sajassi et. al., "A Network Virtualization Overlay 740 Solution using EVPN", 2016, . 743 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 744 Requirement Levels", BCP 14, RFC 2119, 745 DOI 10.17487/RFC2119, March 1997, 746 . 748 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 749 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 750 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 751 2015, . 753 12.2. Informative References 755 [RFC7209] Sajassi, A., Aggarwal, R., Uttaro, J., Bitar, N., 756 Henderickx, W., and A. Isaac, "Requirements for Ethernet 757 VPN (EVPN)", RFC 7209, DOI 10.17487/RFC7209, May 2014, 758 . 760 Appendix A. Additional Stuff 762 TBD. 764 Authors' Addresses 766 Rajesh Sharma (editor) 767 Cisco Systems 768 170 W Tasman Drive 769 San Jose, CA 770 USA 772 Email: rajshr@cisco.com 774 Ayan Banerjee 775 Cisco Systems 776 170 W Tasman Drive 777 San Jose, CA 778 USA 780 Email: ayabaner@cisco.com 782 Raghava Sivaramu 783 Cisco Systems 784 170 W Tasman Drive 785 San Jose, CA 786 USA 788 Email: raghavas@cisco.com