idnits 2.17.1 draft-sharma-bess-multi-site-evpn-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 25, 2018) is 2131 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-10) exists of draft-ietf-bess-dci-evpn-overlay-07 == Outdated reference: A later version (-02) exists of draft-rabadan-sajassi-bess-evpn-ipvpn-interworking-00 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group R. Sharma, Ed. 3 Internet-Draft A. Banerjee, Ed. 4 Intended status: Standards Track A. Sajassi 5 Expires: December 27, 2018 L. Krattiger 6 R. Sivaramu 7 Cisco Systems 8 June 25, 2018 10 Multi-site EVPN based VXLAN using Border Gateways 11 draft-sharma-bess-multi-site-evpn-00 13 Abstract 15 This document describes the procedures for interconnecting two or 16 more BGP based Ethernet VPN (EVPN) sites in a scalable fashion over 17 an IP-only network. The motivation is to support extension of EVPN 18 sites without having to rely on typical Data Center Interconnect 19 (DCI) technologies like MPLS/VPLS. The requirements for such a 20 deployment are very similar to the ones specified in RFC 7209 -- 21 "Requirements for Ethernet VPN (EVPN)". 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on December 27, 2018. 40 Copyright Notice 42 Copyright (c) 2018 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (https://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 59 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 3. Multi-Site EVPN Overview . . . . . . . . . . . . . . . . . . 4 61 3.1. MS-EVPN Interconnect Requirements . . . . . . . . . . . . 4 62 3.2. MS-EVPN Interconnect concept and framework . . . . . . . 5 63 4. Multi-site EVPN Interconnect Procedures . . . . . . . . . . . 7 64 4.1. Border Gateway Discovery . . . . . . . . . . . . . . . . 7 65 4.2. Border Gateway Provisioning . . . . . . . . . . . . . . . 8 66 4.2.1. Border Gateway Designated Forwarder Election . . . . 9 67 4.2.2. Anycast Border Gateway . . . . . . . . . . . . . . . 9 68 4.2.3. Multi-path Border Gateway . . . . . . . . . . . . . . 10 69 4.3. EVPN route processing at Border Gateway . . . . . . . . . 10 70 4.4. Multi-Destination tree between Border Gateways . . . . . 12 71 4.5. Inter-site Unicast traffic . . . . . . . . . . . . . . . 12 72 4.6. Inter-site Multi-destination traffic . . . . . . . . . . 13 73 4.7. Host Mobility . . . . . . . . . . . . . . . . . . . . . . 13 74 5. Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 13 75 5.1. Fabric to Border Gateway Failure . . . . . . . . . . . . 13 76 5.2. Border Gateway to Border Gateway Failures . . . . . . . . 13 77 6. Interoperability . . . . . . . . . . . . . . . . . . . . . . 13 78 7. Isolation of Fault Domains . . . . . . . . . . . . . . . . . 14 79 8. MVPN with Multi-site EVPN . . . . . . . . . . . . . . . . . . 14 80 8.1. Inter-Site MI-PMSI . . . . . . . . . . . . . . . . . . . 14 81 8.2. Stitching of customer multicast trees across sites . . . 15 82 8.3. RP placement across sites . . . . . . . . . . . . . . . . 15 83 8.4. Inter-Site S-PMSI . . . . . . . . . . . . . . . . . . . . 15 84 9. Observations with Multi-site EVPN . . . . . . . . . . . . . . 15 85 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 16 86 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 87 12. Security Considerations . . . . . . . . . . . . . . . . . . . 16 88 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 89 13.1. Normative References . . . . . . . . . . . . . . . . . . 16 90 13.2. Informative References . . . . . . . . . . . . . . . . . 16 91 Appendix A. Additional Stuff . . . . . . . . . . . . . . . . . . 17 92 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 94 1. Introduction 96 BGP based Ethernet VPNs (EVPNs) are being used to support various VPN 97 topologies with the motivation and requirements being discussed in 98 RFC7209 [RFC7209]. EVPN has been used to provide a Network 99 Virtualization Overly (NVO) solution with a variety of tunnel 100 encapsulation options in RFC8365 [RFC8365] for the Data center 101 interconnect (DCI) at the WAN Edge. Procedures for IP and MPLS hand- 102 off at site boundaries are additionally discussed in [DCI-OVERLAY]. 104 In current EVPN deployments, there is a need to segment the EVPN 105 domains within a Data Center (DC) primarily due to the service 106 architecture and the scaling requirements around it. The number of 107 routes, tunnel end-points, and next-hops needed in the DC are 108 sometimes larger than the capability of the hardware elements that 109 are being deployed. Network operators would like to inter-connect 110 these domains without using traditional DCI technologies. In 111 essence, they want smaller multi-site EVPN domains with an IP 112 backbone. Additionally, they would like to have an Anycast model for 113 the nodes at the gateways. This alleviates the hardware of having to 114 support multi-path on overlay reachability. 116 Network operators today are using the Virtual Network Identifier 117 (VNI) to designate a service. They would like to have this service 118 available to a smaller set of nodes within the DC for administrative 119 reasons; in essence they want to break up the EVPN domain to multiple 120 smaller sites. An advantage of having a smaller footprint for these 121 EVPN sites results in fault isolation domains being constrained. It 122 also allows for re-use of VNI space across sites. 124 In a traditional leaf-spine architecture, it is conceivable, that the 125 network operator may decide to support both the Route-Reflector and 126 Gateway functionality on the spine nodes. In such a deployment 127 model, it is necessary to have a site identifier marked with each 128 domain, such that route import and export rules can work effectively. 130 In this document we focus primarily on the VXLAN encapsulation for 131 EVPN deployments, with the underlay providing only IP connectivity. 132 We describe in detail the IP/VXLAN hand-off mechanisms to 133 interconnect these smaller sites within the data center itself, and 134 refer to this deployment model as multi-site EVPN (MS-EVPN). The 135 procedures described here go into substantial detail regarding 136 interconnecting Layer-2 (L2) and Layer-3 (L3) networks, for unicast 137 and multicast domains across MS-EVPNs. In this specification, we 138 also define the use of the Type 5 Ethernet Segment Identifier (ESI) 139 (Section 5 of RFC7432 [RFC7432]) between multiple sites using the 140 Anycast routing model. 142 1.1. Requirements Language 144 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 145 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 146 document are to be interpreted as described in RFC 2119 [RFC2119]. 148 2. Terminology 150 o Border Gateway (BG): This is the node that interacts with nodes 151 that are internal to a site and external to it. It is responsible 152 for functionality related to traffic entering and exiting a site. 154 o Anycast Border Gateway: A virtual set of shared BGs acting as 155 multiple entry-exit points for a single site. 157 o Multipath Border Gateway: A virtual set of unique BGs acting as a 158 multiple entry-exit points for a single site. 160 o RT-X: Route Type X as defined for various EVPN route types. 162 3. Multi-Site EVPN Overview 164 In this section we describe the motivation, requirements, and 165 framework for the Multi-Site EVPN (MS-EVPN) functionality. 167 3.1. MS-EVPN Interconnect Requirements 169 a. Scalability: Multi-Site EVPN (MS-EVPN) should be able to 170 interconnect multiple sites, allowing for addition/deletion of 171 new sites or modifying capacity of existing ones seamlessly. 173 b. Multi-Destination traffic over unicast-only cloud: MS-EVPN 174 mechanisms should provide an efficient forwarding mechanism for 175 multi-destination frames by using existing network elements as- 176 is. A large flat fabric rules out the option of ingress 177 replication, as the number of replications becomes practically 178 unachievable due to the internal hardware bandwidth needed. 180 c. Maintain Site-specific Administrative control: MS-EVPN should be 181 able to interconnect fabrics from different Administrative 182 domains. The solution should allow for different sites to have 183 different VLAN-VNI mappings, use different underlay routing 184 protocols, and/or have different PIM-SM group ranges. 186 d. Isolate fault domains: MS-EVPN technology hand-off should have 187 capability to isolate traffic across site boundaries and prevent 188 defects to percolate from one site to another. As an example, a 189 broadcast storm in a site should not propagate to other sites. 191 3.2. MS-EVPN Interconnect concept and framework 193 EVPN with IP-only interconnect is conceptualized as multiple site- 194 local EVPN control planes and IP forwarding domains interconnected 195 via a single common EVPN control and IP forwarding domain. Every 196 node is identified with a unique site-scope identifier. A site-local 197 EVPN domain consists of EVPN nodes with the same site identifier. 199 Border Gateways (BGs) are explicitly part of a site-specific EVPN 200 domain, and implicitly part of a common interconnect EVPN domain with 201 BGs from other sites. Although a BG has only a single explicit site- 202 id (that of the site it is a member of, see Section 4.1), it can be 203 considered to also have a second implicit site-id, that of the 204 interconnect-domain which has membership of all the BGs from all 205 sites that are being interconnected. BGs discover each other through 206 EVPN RT-1 A-D routes and act as both control and forwarding plane 207 gateway across sites. This facilitates site-local nodes to visualize 208 all other sites to be reachable only via its BGs. 210 We describe the MS-EVPN deployment model using the topology as shown 211 in Figure 1. In the topology there are 3 sites, Site A, Site B, and 212 Site C that are inter-connected using IP. This entire topology is 213 deemed to be part of the same Data Center. In most deployments these 214 sites can be thought of as pods, which may span a rack, a row, or 215 multiple rows in the data center, depending on the size of domain 216 desired for scale and fault and/or administrative isolation. 218 In this topology, site-local nodes are connected to each other by 219 iBGP EVPN peering and BGs are connected by eBGP Muti-hop EVPN peering 220 via inter-site cloud. We explicitly spell this out to ensure that we 221 can re-use BGP semantics of route announcement between and across the 222 sites. Other BGP mechanisms to instantiate this will be discussed in 223 a separate document. This implies that each domain/site has its own 224 AS number. In the topology, only 2 border gateway per site are 225 shown; this is more for ease of illustration and explanation. The 226 technology poses no such limitation. As mentioned earlier, site- 227 specific EVPN domain consists of only site-local nodes in the sites. 228 A BG is logically partitioned into site specific EVPN domain towards 229 the site and into common EVPN domain towards other sites. This 230 facilitates them to act as control and forwarding plane gateway for 231 forwarding traffic across sites. 233 EVPN nodes with in a site will discover each other via regular EVPN 234 procedures and build site-local bidirectional VXLAN tunnels and 235 multi-destination trees from leaves to BGs. BGs will discover each 236 other by RT-1 routes with unique site-identifiers and build inter- 237 site bi-directional VXLAN tunnels and multi-destination trees between 238 them. We thus build an end-to-end bidirectional forwarding path 239 across all sites by stitching (and not by stretching end-to-end) 240 site-local VXLAN tunnels with inter-site VXLAN tunnels. In essence, 241 a MS-EVPN fabric is built in complete downstream and modular fashion. 243 ____________________________ 244 | ooo Encapsulation tunnel | 245 | X X X Leaf-spine fabric | 246 |__________________________| 248 Site A (EVPN site A) Site B (EVPN site B) 249 ___________________________ ____________________________ 250 | X X X X X X X X | | X X X X X X X X | 251 | X X X X | | X X X X | 252 | o o | | o o | 253 |BG-1 Site A BG-2 Site A| |BG-1 Site B BG-2 Site B| 254 ___________________________ ____________________________ 255 o o o o 256 o o o o 257 o o o o 258 o o o o 259 _______________________________________________ 260 | | 261 | | 262 | Inter-site common EVPN site | 263 | | 264 | | 265 _______________________________________________ 266 o o 267 o o 268 o o 269 o o 270 ___________________________ 271 | BG-1 Site C BG-2 Site C| 272 | X X X X | 273 | X X X X X X X X | 274 _____________________________ 275 Site C (EVPN site C) 277 Figure 1 279 Site-local tenant domains (for example, bridging, flood, routing, and 280 multicast) are interconnected only via BGs with site-remote tenant 281 domains (bridging, flood, routing, and multicast respectively) from 282 other sites. It stitches such tenant domains (bridging, flood, 283 routing, and multicast) in complete downstream fashion using EVPN 284 route advertisements. Such interconnects do not assume uniform 285 mappings of mac-vrf (or IP-VRF) to VNI across sites. 287 4. Multi-site EVPN Interconnect Procedures 289 In this section we describe the new functionalities in the Border 290 Gateway (BG) nodes for interconnecting EVPN sites within the DC. 292 In a nutshell, BG discovery will facilitate termination and re- 293 origination of inter-site VXLAN tunnels. Such discovery provides 294 flexibility for intra-site leaf-to-leaf VXLAN tunnels to co-exist 295 with inter-site tunnels terminating on BGs. Additionally, BGs need 296 to discover each other such that it is possible to run the Designated 297 Forwarder (DF) election between the border nodes of a site. It also 298 needs to be aware of other remote BGs such that it can allow for 299 appropriate import/export of routes from other sites. 301 4.1. Border Gateway Discovery 303 BGs leverage the RT-1 A-D route type defined in RFC7432 [RFC7432]. 304 BGs in different sites will use RT-1 A-D routes with unique site- 305 identifiers to announce themselves as "Borders" to other BGs. Nodes 306 within the same site MUST be configured or auto-generate the same 307 site-identifier. Nodes that are not configured to be a border node 308 will build VXLAN tunnels only between each member of the site (which 309 it is aware due to the site-identifier that is additionally announced 310 by them). Border nodes will additionally build VXLAN tunnels between 311 itself and other border nodes that are announced with a different 312 site identifier. The site-identifier is encoded within the ESI label 313 itself as described below. 315 In this specification, we reuse the AS-based Ethernet Segment 316 Identifier (ESI) Type 5 (see Section 5 of RFC7432 [RFC7432]) that can 317 be auto-generated or configured by the operator. It is repeated here 318 to illustrate the encoding of the site-identifier. 320 o Type 5 (T=0x05): The ESI value is constructed with the site-id 321 parameter being embedded as follows. 323 * AS number (4 octets). This is an AS number owned by the system 324 and MUST be encoded in the high-order 4 octets of the ESI Value 325 field. If a 2-octet AS number is used, the high-order extra 2 326 octets will be 0x0000. 328 * Local Discriminator/Site Identifier (4 octets): The Local 329 Discriminator is also referred to as the Site Identifier and 330 its value MUST be encoded as follows. The high-order 2 octets 331 will be 0x0000, and the low order 2 octets will be set to the 332 site-identifier to which this node belongs. All border 333 gateways MUST announce this value. We need the AS number and 334 the site identifier together to be automatically derivable to 335 less than 6 octets; this enables for auto import and export of 336 routes (see the ES-Import RT definition in RFC7432 [RFC7432]). 338 * Reserved (1 octet): The low-order octets of the ESI Value will 339 be set to 0 on transmission and will be ignored on receipt. 341 Along with the RT-1 Ethernet A-D routes, border nodes MUST set the 342 second low order bit (Flags B0: Single Active, B1: MS-Border) of the 343 octet flag in the ESI Label Extended Community attribute that is 344 announced in tandem. 346 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 347 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 348 | Type=0x06 | Sub-Type=0x01 | Flags(1 octet)| Reserved=0 | 349 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 350 | Reserved=0 | ESI Label | 351 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 353 Figure 2 355 The site-identifier value is globally unique within the deployments. 356 The RT-1 Ethernet A-D route along with (i) the MS-Border bit being 357 set in the ESI Label Extended Community and (ii) the per-VNI RT 358 Extended Community will enable all BGs be aware of all the other BGs 359 in the network. All BGs are thus able to figure out other members in 360 the same site, and armed with this information is able to run a 361 Designated Forwarder (DF) election for BGs site and VNI scoped as 362 against the traditional Ethernet segment DF election. In Figure 1, 363 nodes BG-A1, BG-A2, BG-B1, BG-B2, BG-C1, and BG-C2, will announce the 364 ESI Label and the per-VNI RT Extended Communities. Nodes, BG-A1, and 365 BG-A2, will perform a DF election for Site-A, whereas, nodes BG-B1, 366 and BG-B2 will perform one for site-B. Even though, all BG nodes are 367 able to see all the advertisements, the site-identifier scopes the DF 368 election (using RT-4 ES Routes) to its site members. This 369 specification uses the All-Active Redundancy Mode specially when the 370 Anycast model of route announcements are used for the local routes. 372 4.2. Border Gateway Provisioning 374 Border Gateway nodes manage both the control-plane communications and 375 the data forwarding plane for any inter-site traffic. Once BGs are 376 discovered (using RT-1 routes), any RT-2/RT-5 routes from other sites 377 will be terminated and re-originated on such BGs. RT-2/RT-5 routes 378 carry downstream VNI labels. As BG discovery is agnostic to 379 symmetric or downstream VNI provisioning, rewriting next-hop 380 attributes before re-advertising these routes from other sites to a 381 given site provides flexibility to keep different mac-VRF or IP-VRF 382 to VNI mapping in different sites and still able to interconnect L3 383 and L2 domains. 385 RT-1, RT-3, and RT-4 from other sites will be terminated at the BGs. 386 As has been defined in the specifications, RT-3 routes carry 387 downstream VNI labels and will be used to pre-build VXLAN tunnels in 388 the common EVPN domain for L2, L3, and Multi-Destination traffic. 390 4.2.1. Border Gateway Designated Forwarder Election 392 In the presence of more than one BG nodes in a site, forwarding of 393 multi-destination L2 or L3 traffic both into the site and out of the 394 site needs to be carried out by a single node. This node is termed 395 as a designated forwarder and elected per-VNI as per rules defined in 396 Section 8.5 of RFC7432 [RFC7432]. RT-4 Ethernet Segment routes are 397 used for the DF election. In the multi-site deployment, the RT-4 398 Ethernet Segment routes carry a ES-Import RT Extended Community 399 attribute with it. We need to enforce that these are imported to 400 only the local site members when the ES-Import value matches with its 401 own value. The 6-byte values are generated using a concatenation of 402 the 4-byte AS number the member belongs, with the 2-bytes of site- 403 identifier. As a result, only local site-members will match to form 404 the candidate list. All the BGs are able to extract the site 405 identifier from this attribute and the list of nodes where this 406 election is run is now constrained to the BGs between same site 407 members. 409 In both modes (Anycast and Multipath), RT-3 routes will be generated 410 locally and advertised by DF winner Border Gateway with unique 411 gateway IP. This will facilitate building fast converging flood 412 domain connectivity inter-site and intra-site and on same time 413 avoiding duplicate traffic by electing DF winner to forward multi- 414 destination inter-site traffic. 416 Failure events which lead to a BG losing all of its connectivity to 417 the IP interconnect backbone should trigger the BG to withdraw its 418 Border RT-4 Ethernet Segment route(s) and RT-1 A-D route, to indicate 419 to other BG's of the same site that it is no longer a candidate BG 420 and to indicate BG's of different sites that it is no longer a Border 421 Gateway. 423 4.2.2. Anycast Border Gateway 425 In this mode all BGs share same gateway IP and rewrite EVPN next-hop 426 attributes with a shared logical next-hop entity. However, these BGs 427 will maintain unique gateway IP to facilitate building IR trees from 428 site-local nodes to forward Multi-Destination traffic. EVPN RT-2, 429 RT-5 routes will be advertised to the nodes in the site from all 430 other BGs and BG will run DF election per VNI for Multi destination 431 traffic. RT-3 routes will be advertised by the DF winner BG for a 432 given VNI so that only DF will receive and forward inter-site 433 traffic. It is also possible to advertise and draw traffic by all 434 BGs at a site to improve convergence properties of the network. In 435 case of multi-destination trees built by non-EVPN procedures (say 436 PIM), all BGs will receive but only DF winner will forward traffic. 438 It is recommended that BG be enabled in the Anycast mode wherein the 439 BG functionality is available to the rest of the network as a single 440 logical entity for inter-site communication. In the absence of 441 Anycast capability the BG could be enabled as individual gateways 442 (Single-Active BG) wherein a single node will perform the active BG 443 role for a given flow at a given time. As of now, the Border Gateway 444 system mac of the other border nodes belonging to the same site is 445 expected to be configured out-of-band. 447 4.2.3. Multi-path Border Gateway 449 In this mode, Border gateways will rewrite EVPN Next-hop attributes 450 with unique next-hop entities. This provides flexibility to apply 451 usual policies and pick per-VRF, per-VNI or per-flow primary/backup 452 border Gateways. Hence, an intra-site node will see each BG as a 453 next-hop for any external L2 or L3 unicast destination, and would 454 perform an ECMP path selection to load-balance traffic sent to 455 external destinations. In case an intra-site node is not capable of 456 performing ECMP hash based path-selection (possibly some L2 457 forwarding implementations), the node is expected to choose one of 458 the BG's as its designated forwarder. EVPN RT-2, RT-5 routes will be 459 advertised to the nodes in the site from all border gateways and 460 Border gateway will run DF election per VNI for Multi destination 461 traffic. RT-3 routes will be advertised by DF winner Border gateway 462 for a given VNI so that only DF will receive and forward inter-site 463 traffic. It is also possible to advertise and draw traffic by all 464 Border Gateways at a site to improve convergence properties of the 465 network. In case of multi-destination trees built by non-EVPN 466 procedures (say PIM), all border gateways will receive but only DF 467 winner will forward traffic. 469 4.3. EVPN route processing at Border Gateway 471 BG functionality in an EVPN site SHOULD be enabled on more than one 472 node in the network for redundancy and high-availability purposes. 473 Any external RT-2/RT-5 routes that are received by the BGs of a site 474 are advertised to all the intra-site nodes by all the BGs. For 475 internal RT-2/RT-5 routes received by the BG's from the intra-site 476 nodes, all the BGs of a site would advertise them to the remote BG's, 477 so any L2/L3 known unicast traffic to internal destinations could be 478 sent to any one of the local BG's by remote sources. For known L2 479 and L3 unicast traffic, all of the individual BGs will behave either 480 as single logical forwarding node (Anycast model) or a set of active 481 forwarding nodes. 483 All control plane and data plane states are interconnected in a 484 complete downstream fashion. For example, BGP import rules for a 485 Type 3 route should be able to extend a flood domain for a VNI and 486 flood traffic destined to advertised EVPN node should carry the VNI 487 which is announced in Type 3 route. Similarly Type 2, Type 5 control 488 and forwarding states should be interconnected in a complete 489 downstream fashion. 491 o Route Target processing for RT-1 routes: Every IP-VRF and MAC-VRF 492 will generate RT-1 with the format described in section 4.1. 493 Route targets can be auto derived from Ethernet Tag ID (VLAN ID) 494 for that EVPN instance as described in section 7.10.1 of RFC7432 495 [RFC7432]. ES import route target extended community as described 496 in Section 7.6 of RFC7432 [RFC7432] is optional for RT-1 routes in 497 this context. ESI Label Extended Community Attribute is a MUST in 498 this context, since it carries the MS-Border notion as a new bit. 500 o Route Target processing for RT-4 routes: Every IP-VRF and MAC-VRF 501 will generate RT-4 with the format described in section 4.1. 502 Route targets can be auto derived from Ethernet Tag ID (VLAN ID) 503 for that EVPN instance as described in Section 7.10.1 of RFC7432 504 [RFC7432]. ES import route target extended community as described 505 in Section 7.6 of RFC7432 [RFC7432] is mandatory for RT-4 in this 506 context. The encoding of ES-Import is based on AS number and 507 Site-identifier as described in Section 4.2.1. Such import route 508 target will allow import of RT-4 only to the Border gateways of 509 same sites. 511 o Route Target processing for RT-2, RT-3, RT-5 routes: These routes 512 will carry either auto-derived route targets (based on Ethernet 513 Tag ID (VLAN ID) for that EVPN instance) or explicit route 514 targets. Border gateways usual import rules will imports these 515 routes and re-advertise these with border gateway next hops. Also 516 the routes which are imported at Border Gateways and re-advertised 517 SHOULD implement a mechanism to avoid looping of updates should 518 they come back at Border Gateways. RT-3 routes will be imported 519 and processed on border gateways from other border gateways but 520 MUST NOT be advertised again. 522 4.4. Multi-Destination tree between Border Gateways 524 The procedures described here recommends building an Ingress 525 Replication (IR) tree between Border Gateways. This will facilitate 526 every site to independently build site-specific Multi destination 527 trees. Multi-destination end-to-end trees between leafs could be PIM 528 (site 1) + IR (between border Gateways) + PIM(site 2) or IR-IR-IR or 529 PIM-IR-IR. However this does not rule out using IR-PIM-IR or end-to- 530 end PIM to build multi-destination trees end-to-end. 532 Border Gateways will generate RT-3 routes with unique gateway IP and 533 advertise to Border Gateways of other sites. These RT-3 routes will 534 help in building IR trees between border gateways. However, only DF 535 winner per VNI will forward multi-destination traffic across sites. 537 As Border Gateways are part of both site-specific and inter-site 538 Multi-destination IR trees, split-horizon mechanism will be used to 539 avoid loops. Multi-destination tree with Border gateway as root to 540 other sites (or Border-Gateways) will be in a separate horizon group. 541 Similarity Multi-destination IR tree with Border Gateway as root to 542 site-local nodes will be in another split horizon group. 544 If PIM is used to build Multi-Destination trees in site-specific 545 domain, all Border gateway will join such PIM trees and draw multi- 546 destination traffic. However only DF Border Gateway will forward 547 traffic towards other sites. 549 4.5. Inter-site Unicast traffic 551 As site-local nodes will see all inter-site EVPN routes via Border 552 Gateways, VXLAN tunnels will be built between leafs and site-local 553 Border Gateways and Inter-site VXLAN tunnels will be built between 554 Border gateways in different sites. An end-to-end VXLAN 555 bidirectional forwarding path between inter-site leafs will consist 556 of VXLAN tunnel from leaf (say Site A) to its Border Gateway (BG-A1), 557 another VXLAN tunnel from Border Gateway (BG-A1) to Border Gateway 558 (BG-B1) in another site (say site B) and Border gateway (BG-B1) to 559 leaf (in site B). Such an arrangement of tunnels is scalable as a 560 full mesh of VXLAN tunnels across inter-site leafs is substituted by 561 combination of intra-site and inter-site tunnels. 563 L2 and L3 unicast frames from site-local leafs will reach border 564 gateway using VXLAN encapsulation. At Border gateway, VXLAN header 565 is stripped out and another VXLAN header is pushed to sent frames to 566 destination site Border Gateway. Destination site Border gateway 567 will strip off VXLAN header and push another VXLAN header to send 568 frame to the destination site leaf. 570 4.6. Inter-site Multi-destination traffic 572 Multi-destination traffic will be forwarded from one site to other 573 site only by DF for that VNI. As frames reach Border Gateway from 574 site-local nodes, VXLAN header will be decapsulated from the payload, 575 and encapsulated with another VXLAN header (derived from downstream 576 Type 3 EVPN routes received from the border gateways of the 577 destination site) to forward the payload to the destination site 578 border gateway. Similarly destination site Border Gateway will strip 579 off VXLAN header and forward the payload after encapsulating with 580 another VXLAN header towards the destination leaf. 582 As explained in Section 4.4, split horizon mechanism will be used to 583 avoid looping of inter-site multi-destination frames. 585 4.7. Host Mobility 587 Host movement handling will be same as defined in RFC7432 [RFC7432]. 588 When host moves, EVPN RT-2 routes with updated sequence number will 589 be propagated to every EVPN node. When a host moves inter-site, only 590 Border gateways may see EVPN updates with both next-hop attributes 591 and sequence number changes and leafs may see updates only with 592 updated sequence numbers. However in other cases, both Border 593 gateway and leaves may see next-hop and sequence number changes. 595 5. Convergence 597 5.1. Fabric to Border Gateway Failure 599 If a Border Gateway is lost, Border gateway next-hop will be 600 withdrawn for RT-2/RT-5 routes. Also per-VNI DF election will be 601 triggered to chose new DF. DF new winner will become forwarder of 602 Multi-destination inter-site traffic. 604 5.2. Border Gateway to Border Gateway Failures 606 In case where inter-site cloud has link failures, direct forwarding 607 path between border gateways can be lost. In this case, traffic from 608 one site can reach other site via border gateway of an intermediate 609 site. However, this will be addressed like regular underlay failure 610 and traffic terminations end-points will still stay same for inter- 611 site traffic flows. 613 6. Interoperability 615 The procedures defined here are only for Border Gateways. Therefore 616 other EVPN nodes in the network should be RFC7432 [RFC7432] compliant 617 to operate in such topologies. 619 As the procedures described here are applicable only after receiving 620 Border A-D route, if other domains are connected which are not 621 capable of such multi-site gateway model, they can work in regular 622 EVPN mode. The exact procedures will be detailed in a future version 623 of the draft. 625 The procedures here provides flexibility to connect non-EVPN VXLAN 626 sites by provisioning Border Gateways on such sites and inter- 627 connecting such Border Gateways by Border Gateways of other sites. 628 Such Border Gateways in non-EVPN VXLAN sites will play dual role of 629 EVPN gateway towards common EVPN domain and non-EVPN gateway towards 630 non-EVPN VXLAN site. 632 7. Isolation of Fault Domains 634 Isolation of network defects requires policies like storm control, 635 security ACLs etc to be implemented at site boundaries. Border 636 gateways should be capable of inspecting inner payload of packets 637 received from VXLAN tunnels and enforce configured policies to 638 prevent defects percolating from one part to rest of the network. 640 8. MVPN with Multi-site EVPN 642 BGP based MVPN as defined in RFC6513 [RFC6513] and RFC6514 [RFC6514] 643 will coexist with Multisite-EVPN with out any changes in route types 644 and encodings defined for MVPN route types in these RFCs. Route 645 Distinguisher and VRF route import extended communities will be 646 attached to MVPN routes as defined in the BGP MVPN RFCs. Import and 647 Export Route targets will be attached to MVPN routes either by Auto- 648 generating them from VNI or by explicit configuration per MVPN. 649 Since, BGP MVPN RFC adapts to any VPN address family to provide RPF 650 information to build C-Multicast trees, EVPN route types will be used 651 to provide required RPF information for Multicast sources in MVPNs. 652 In order to follow segmentation model of Multisite-EVPN, following 653 procedures are recommended to build provider and customer multicast 654 trees between sources and receivers across sites. 656 8.1. Inter-Site MI-PMSI 658 As defined in above mentioned MVPN RFCs, I-PMSI A-D routes are used 659 to signal a provider tunnel or MI-PMSI per MVPN. Multisite-EVPN 660 recommends EVPN Type-3 routes to build such MI-PMSI provider tunnel 661 per VPN between Border Gateways of different sites. Every MVPN node 662 will use its unique router identifier to build these MI-PMSI provider 663 tunnels. In Anycast Border gateway model also, these MI-PMSI 664 provider tunnels are built using unique router identifier of Border 665 gateways. In similar fashion, these Type-3 routes can be used to 666 build MI-PMSI provider tunnel per MVPN with in sites. 668 8.2. Stitching of customer multicast trees across sites 670 All Border Gateways will rewrite next-hop and re-originate MVPN 671 routes received from other sites to local site and from local site to 672 other sites. Therefore customer Multicast trees will be logically 673 built end-to-end across sites by stitching these trees via Border 674 gateways. A C-multicast join route (say Type 7 MVPN) will follow 675 EVPN RPF path to build C-multicast tree from leaf in a site to its 676 Border gateway and to destination site leafs via destination site 677 Border Gateways. Similarly Source-Active A-D MVPN route (Type 5 678 MVPN) will be rewritten with next-hop and re-originated via Border 679 gateways so that source C-Multicast trees will be stitched via Border 680 gateways. 682 8.3. RP placement across sites 684 Multisite-EVPN recommends only Source C-Multicast trees across sites. 685 Therefore Customer RP placement per MVPN should be restricted with in 686 sites. Source-Active A-D MVPN route type (Type 5) will be used to 687 signal C-Multicast sources across sites. 689 8.4. Inter-Site S-PMSI 691 As defined in BGP MVPN RFCs, S-PMSI A-D routes (Type 3 MVPN) will be 692 used to signal selective PMSI trees for high bandwidth C-Multicast 693 streams. These S-PMSI A-D routes will be signaled across sites via 694 Border gateways rewriting next-hop and re-originating them to other 695 sites. PMSI tunnel attribute in re-originated S-PMSI routes will be 696 adjusted to the provide tunnel types between Border gateways across 697 sites. 699 9. Observations with Multi-site EVPN 701 Since an Anycast address is now advertised in the underlay protocols 702 per ES, this solution does increase the scale of routes for the 703 underlay. Furthermore, the ES failures are now conveyed via the 704 underlay protocols. To drop down to single homing mode, one would 705 need to track the interfaces that are used for the inter-site 706 traffic. It is a requirement to not have intra-site and inter-site 707 traffic use the same links from the nodes. Due to the anycast 708 formulation of the gateways, it is not possible to entertain any 709 load-balancing per ES link for the gateway nodes. 711 Loop avoidance by the use of the domain-path-id as defined in 712 [EVPN-IPVPN-INTERWORKING] will be detailed in a future version of the 713 draft. 715 10. Acknowledgements 717 This authors would like to thank Max Ardica, Murali Garimella, Anuj 718 Mittal, Lilian Quan, Veera Ravinutala, Tarun Wadhwa for their review 719 and comments. 721 11. IANA Considerations 723 TBD. 725 12. Security Considerations 727 TBD. 729 13. References 731 13.1. Normative References 733 [DCI-OVERLAY] 734 A. Sajassi et. al., "A Network Virtualization Overlay 735 Solution using EVPN", 2018, . 738 [EVPN-IPVPN-INTERWORKING] 739 A. Sajassi et. al., "EVPN Interworking with IPVPN", 2018, 740 . 743 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 744 Requirement Levels", BCP 14, RFC 2119, 745 DOI 10.17487/RFC2119, March 1997, 746 . 748 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 749 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 750 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 751 2015, . 753 13.2. Informative References 755 [RFC6513] Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/ 756 BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February 757 2012, . 759 [RFC6514] Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP 760 Encodings and Procedures for Multicast in MPLS/BGP IP 761 VPNs", RFC 6514, DOI 10.17487/RFC6514, February 2012, 762 . 764 [RFC7209] Sajassi, A., Aggarwal, R., Uttaro, J., Bitar, N., 765 Henderickx, W., and A. Isaac, "Requirements for Ethernet 766 VPN (EVPN)", RFC 7209, DOI 10.17487/RFC7209, May 2014, 767 . 769 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 770 Uttaro, J., and W. Henderickx, "A Network Virtualization 771 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 772 DOI 10.17487/RFC8365, March 2018, 773 . 775 Appendix A. Additional Stuff 777 TBD. 779 Authors' Addresses 781 Rajesh Sharma (editor) 782 Cisco Systems 783 170 W Tasman Drive 784 San Jose, CA 785 USA 787 Email: rajshr@cisco.com 789 Ayan Banerjee (editor) 790 Cisco Systems 791 170 W Tasman Drive 792 San Jose, CA 793 USA 795 Email: ayabaner@cisco.com 797 Ali Sajassi 798 Cisco Systems 799 170 W Tasman Drive 800 San Jose, CA 801 USA 803 Email: sajassi@cisco.com 804 Lukas Krattiger 805 Cisco Systems 806 170 W Tasman Drive 807 San Jose, CA 808 USA 810 Email: lkrattig@cisco.com 812 Raghava Sivaramu 813 Cisco Systems 814 170 W Tasman Drive 815 San Jose, CA 816 USA 818 Email: raghavas@cisco.com