idnits 2.17.1 draft-sajassi-bess-evpn-ip-aliasing-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC7432]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 334 has weird spacing: '...ability to PE...' -- The document date (July 2, 2017) is 2489 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC7432' is mentioned on line 420, but not defined == Missing Reference: 'R7432' is mentioned on line 424, but not defined == Unused Reference: 'KEYWORDS' is defined on line 433, but no explicit reference was found in the text == Unused Reference: 'RFC1776' is defined on line 436, but no explicit reference was found in the text == Unused Reference: 'TRUTHS' is defined on line 439, but no explicit reference was found in the text ** Downref: Normative reference to an Informational RFC: RFC 1776 ** Downref: Normative reference to an Informational RFC: RFC 1925 (ref. 'TRUTHS') -- Duplicate reference: RFC2119, mentioned in 'RFC2119', was also mentioned in 'KEYWORDS'. Summary: 3 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group Ali Sajassi 3 Internet Draft Gaurav Badoni 4 Category: Standard Track Priyanka Warade 5 Suresh Pasupula 6 Cisco Systems 8 Expires: January 2, 2017 July 2, 2017 10 L3 Aliasing and Mass Withdrawal Support for EVPN 11 draft-sajassi-bess-evpn-ip-aliasing-00.txt 13 Abstract 15 This draft proposes an extension to [RFC7432] to do Aliasing for 16 Layer 3 routes that is needed for symmetric IRB to build a complete 17 IP ECMP. 19 Status of this Memo 21 This Internet-Draft is submitted to IETF in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF), its areas, and its working groups. Note that 26 other groups may also distribute working documents as 27 Internet-Drafts. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 The list of current Internet-Drafts can be accessed at 35 http://www.ietf.org/1id-abstracts.html 37 The list of Internet-Draft Shadow Directories can be accessed at 38 http://www.ietf.org/shadow.html 40 Copyright and License Notice 42 Copyright (c) 2017 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2 IP Aliasing and Backup Path . . . . . . . . . . . . . . . . . . 4 60 2.1 Constructing Ethernet A-D per EVPN Instance Route . . . . . 5 61 3 Fast Convergence for Routed Traffic . . . . . . . . . . . . . . 6 62 3.1 Constructing Ethernet A-D per Ethernet Segment Route . . . . 7 63 3.1.1 Ethernet A-D Route Targets . . . . . . . . . . . . . . . 7 64 3.2 Avoiding convergence issues by syncing IP prefixes . . . . . 7 65 3.3 Handling Silent Host . . . . . . . . . . . . . . . . . . . . 8 66 3.4 MAC Aging . . . . . . . . . . . . . . . . . . . . . . . . . 8 67 4 Determining Reach-ability to Unicast IP Addresses . . . . . . . 9 68 4.1 Local Learning . . . . . . . . . . . . . . . . . . . . . . . 9 69 4.2 Remote Learning . . . . . . . . . . . . . . . . . . . . . . 9 70 4.2.1 Constructing MAC/IP Address Advertisement . . . . . . . 9 71 4.2.2 Route Resolution . . . . . . . . . . . . . . . . . . . . 9 72 5 Forwarding Unicast Packets . . . . . . . . . . . . . . . . . . 9 73 6 Load Balancing of Unicast Packets . . . . . . . . . . . . . . . 10 74 7 Security Considerations . . . . . . . . . . . . . . . . . . . . 10 75 8 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10 76 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 77 9.1 Normative References . . . . . . . . . . . . . . . . . . . 10 78 9.2 Informative References . . . . . . . . . . . . . . . . . . 10 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 81 1 Introduction 83 +---------+ 84 +-------------+ | | 85 | | | | 86 / | PE1 |----| | +-------------+ 87 / | | | MPLS/ | | | 88 / +-------------+ | VxLAN/ | | PE3 |---H3 89 H1--- | NVGRE | | | 90 \ +-------------+ | |---| | 91 \ | | | | +-------------+ 92 \ | PE2 |----| | 93 | | | | 94 +-------------+ | | 95 | | 96 | | 97 +---------+ 99 Figure 1: Inter-subnet traffic between Multihoming PEs and Remote PE 101 Consider a pair of multi-homing TORs PE1 and PE2. Let there be a host 102 H1 attached to them. Consider another TOR PE3 and a host H3 attached 103 to it. 105 With Asymmetric IRB, if H3 sends inter-subnet traffic to H1, routing 106 will happen at PE3. PE3 will have the destination SVI and will 107 trigger ARP if it does not have an ARP adjacency to H1. Finally 108 routing lookup will resolve destination MAC to H1's MAC address. 109 Furthermore, H1's MAC will point to a VxLAN ECMP to T1 and T2, either 110 due to host route advertisement or MAC Aliasing as detailed in [RFC 111 7432]. 113 With Symmetric IRB, if H3 sends inter-subnet traffic to H1, routing 114 lookup will happen at PE3. PE3 will do a routing lookup in the L3VNI- 115 VRF context and is not expected to have the destination SVI. 116 Therefore at PE3, we need an IP ECMP list (PE1/PE2) to be built for 117 H1's IP address for proper load balancing. If H1 is locally learnt 118 only at one of the PEs, PE1 or PE2 due to port-channel hashing, we 119 will not be able to build IP ECMP at PE3 as we do not do Aliasing for 120 Layer 3 addresses. 122 This draft proposes an extension to do Aliasing for Layer 3 routes 123 that is needed for symmetric IRB to build a complete IP ECMP. 125 1.1 Terminology 126 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 127 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 128 document are to be interpreted as described in RFC 2119 [RFC2119]. 130 IRB: Integrated Routing and Bridging 132 IRB Interface: A virtual interface that connects the bridging module 133 and the routing module on an NVE. 135 Broadcast Domain: In a bridged network, the broadcast domain 136 corresponds to a Virtual LAN (VLAN), where a VLAN is typically 137 represented by a single VLAN ID (VID) but can be represented by 138 several VIDs where Shared VLAN Learning (SVL) is used per [802.1Q]. 140 Bridge Table: An instantiation of a broadcast domain on a MAC-VRF. 142 CE: Customer Edge device, e.g., a host, router, or switch. 144 EVI: An EVPN instance spanning the Provider Edge (PE) devices 145 participating in that EVPN. 147 MAC-VRF: A Virtual Routing and Forwarding table for Media Access 148 Control (MAC) addresses on a PE. 150 Ethernet Segment (ES): When a customer site (device or network) is 151 connected to one or more PEs via a set of Ethernet links, then that 152 set of links is referred to as an 'Ethernet segment'. 154 Ethernet Segment Identifier (ESI): A unique non-zero identifier that 155 identifies an Ethernet segment is called an 'Ethernet Segment 156 Identifier'. 158 LACP: Link Aggregation Control Protocol. 160 PE: Provider Edge device. 162 Single-Active Redundancy Mode: When only a single PE, among all the 163 PEs attached to an Ethernet segment, is allowed to forward traffic 164 to/from that Ethernet segment for a given VLAN, then the Ethernet 165 segment is defined to be operating in Single-Active redundancy mode. 167 All-Active Redundancy Mode: When all PEs attached to an Ethernet 168 segment are allowed to forward known unicast traffic to/from that 169 Ethernet segment for a given VLAN, then the Ethernet segment is 170 defined to be operating in All-Active redundancy mode. 172 2 IP Aliasing and Backup Path 173 Host IP and MAC routes are learnt by PEs on the access side via a 174 control plane protocol like ARP. In case where a CE is multihomed to 175 multiple PE nodes using a LAG and is running in All-Active Redundancy 176 Mode, the Host IP will be learnt and advertised in the MAC/IP 177 Advertisement only by the PE that receives the ARP packet. As a 178 result, the remote PE sees only one next-hop for the Host IP and 179 forwards traffic to that advertising PE. Hence, the remote PE is not 180 be able to effectively load balance the traffic towards the 181 multihomed Ethernet Segment. 183 To address this issue, concept of Aliasing that was introduced in RFC 184 7432 [RFC7432], can be extended for Layer 3 routes as well. The PE 185 SHOULD advertise reachability to an L3 VRF instance on a given ES for 186 IP addresses using the existing EAD/EVI route. In this case, the EVPN 187 instance is the VRF table to which the host IP address belongs. This 188 will henceforth be referred to as the IP-EAD/EVI route. 190 A remote PE that receives an IP route with a non reserved ESI SHOULD 191 consider it reachable by all PEs that have advertised the IP-EAD/EVI 192 advertisement route and the EAD/ES advertisement route containing the 193 VRF Route-Targets for that ES. The EAD/ES route must have the Single- 194 Active bit in the flags of the ESI Label extended community set to 0 195 for Aliasing to take effect. 197 The IP-EAD/EVI route cannot be used for route forwarding until the 198 associated Ethernet A-D per ES route is received. 200 In case of Single-Active redundancy mode, the remote PE SHOULD use 201 the IP-EAD/EVI route EVPN Layer 2 attribute extended community as 202 mentioned in draft-ietf-bess-evpn-vpws-07 in combination with the 203 EAD/ES route to determine the Backup Path for the IP addresses for 204 the given IP VRF context. This alternate path SHOULD be installed as 205 a backup path for the IP address. 207 2.1 Constructing Ethernet A-D per EVPN Instance Route 209 This draft proposes the advertisement of per EVI Ethernet A-D route 210 for IP VRFs to enable Aliasing for IP addresses. The 211 usage/construction of this route remains similar to that described in 212 RFC 7432 with a few notable exceptions as below. 214 * The Route-Distinguisher should be set to the corresponding L3VPN 215 context. 217 * The Ethernet Tag should be set to 0. 219 * The L3 EAD/EVI SHOULD carry one or more IP VRF Route-Target (RT) 220 attributes. 222 * The L3 EAD/EVI SHOULD carry the RMAC Extended Community attribute. 224 * The MPLS Label usage should be as described in RFC 7432. 226 It is important to note that the prefix for a IP-EAD/EVI and L2- 227 EAD/EVI may be identical. However, since the RD of the IP-EAD/EVI is 228 set to the corresponding L3VPN context and the RD of the L2-EAD/EVI 229 is set to the corresponding MAC-VRF context, the import will happen 230 in the respective IP-VRFs and MAC-VRFs and hence, the prefix will not 231 be overwritten. 233 3 Fast Convergence for Routed Traffic 235 In EVPN, Host IP reachability is learned via the BGP control plane 236 over the MPLS network. All the hosts that are dually connected behind 237 an ES are advertised by the PEs belonging to the redundancy group. A 238 remote TOR receiving these host routes can loose reachability from 239 any of the PEs either due to box reload or core failure or access 240 failure for that PE. 242 BGP PIC functionality is the existing mechanism for fast convergence 243 as described in https://tools.ietf.org/html/draft-rtgwg-bgp-pic-02. 244 PIC feature doesn't solve the convergence issue for the access 245 failure cases as the PEs are still reachable from the remote TOR. 247 To alleviate this, EVPN defines a mechanism to efficiently and 248 quickly signal, to remote PE nodes, the need to update their 249 forwarding tables upon the occurrence of a failure in connectivity to 250 an Ethernet segment. This is done by having each PE advertise a set 251 of one or more Ethernet A-D per ES routes for each locally attached 252 Ethernet segment (refer to Section 3.1 below for details on how these 253 routes are constructed). A PE may need to advertise more than one 254 Ethernet A-D per ES route for a given ES because the ES may be in a 255 multiplicity of EVIs and the RTs for all of these EVIs may not fit 256 into a single route. Advertising a set of Ethernet A-D per ES routes 257 for the ES allows each route to contain a subset of the complete set 258 of RTs. Each Ethernet A-D per ES route is differentiated from the 259 other routes in the set by a different Route Distinguisher (RD). 261 Upon failure in connectivity to the attached ES, the PE withdraws the 262 corresponding set of Ethernet A-D per ES routes. This triggers all 263 PEs that receive the withdrawal to update their next-hop adjacencies 264 for all IP addresses across IP VRFs associated with the Ethernet 265 segment in question. If no other PE has advertised an Ethernet A-D 266 route for the same segment, then the PE that received the withdrawal 267 simply invalidates the IP entries for that segment. Otherwise, the 268 PE updates its next-hop adjacencies accordingly. 270 These routes should be processed with higher priority than other MAC 271 or MAC-IP withdrawals upon failure. Similar priority processing is 272 needed even on the intermittent RRs. 274 This draft is addressing the mass withdrawal behavior for routed 275 traffic. For Layer-2, please refer to Section 8.2 of RFC 7432. 277 3.1 Constructing Ethernet A-D per Ethernet Segment Route 279 This section describes the procedures used to construct the Ethernet 280 A-D per ES route, which is used for fast convergence (as discussed 281 above). The usage/construction of this route remains similar to that 282 described in section 8.2.1. of RFC 7432 with a few notable exceptions 283 as explained in following sections. 285 3.1.1 Ethernet A-D Route Targets 287 Each Ethernet A-D per ES route MUST carry one or more Route Target 288 (RT attributes). The set of Ethernet A-D routes per ES MUST carry the 289 entire set of IP VRF RTs for all the IP VRFs in addition to MAC VRF 290 RTS for all the EVPN instance to which the Ethernet segment belongs. 292 3.2 Avoiding convergence issues by syncing IP prefixes 294 Consider a pair of multi-homing TORs PE1 and PE2. Let there be a host 295 H1 attached to them. Consider another TOR PE3 and a host H3 attached 296 to it. 298 If the host H1 is learnt on both the PEs, ECMP path list is formed on 299 PE3 pointing to (PE1/PE2). Traffic from H3 to H1 is not impacted even 300 if one of the TORs becomes unreachable as the path list gets 301 corrected upon receiving the mass withdrawal route (Ethernet A-D 302 segment). 304 Let us consider a case where H1 is locally learnt only on PE1 due to 305 port-channel hashing. At PE3, H1 has ECMP path list (PE1/PE2) using 306 Aliasing as described in section 2 of this draft. Traffic from H3 can 307 reach either of the TORs PE1 or PE2. 309 On PE2, all the remote MAC-IP routes belonging to the same Ethernet 310 Segment that are advertised by it's respective peers (PE1 in our 311 example) should be synced and installed locally on PE2 but not 312 advertised as local routes by BGP. When the traffic from H3 reaches 313 PE2, it will be able forward the traffic to H1 without any 314 convergence delay caused by triggering ARP/ND. In a scaled setup, the 315 convergence can be significant as the ARP and ND resolution can take 316 a lot of time. So syncing the IPv4/6 prefixes that belong to same 317 Ethernet Segment helps in solving convergence issues. 319 3.3 Handling Silent Host 321 In continuation with the discussion above, if the reachability of PE1 322 is lost, PE3 will update the ECMP list for H1 to PE2, upon receiving 323 mass withdrawal from PE1. If host H1 is also withdrawn from PE1, then 324 the same route is withdrawn from PE2 and PE3. Hence traffic from H3 325 to H1 is black-holed till H1 is re-learnt on PE2. 327 This black-holing can be much worse if the H1 behaves like a silent 328 host. IP address of H1 will not be re-learnt on PE2 till H1 re-ARPs 329 or some traffic triggers ARP for H1. 331 PE2 can detect the failure of PE1's reachability in following ways 333 a) When core failure or box reload happens on PE1, next hop 334 reachability to PE1 can be detected by the underlay routing 335 protocols. 337 b) Upon access failure, PE1 sends withdraws the EAD/ES Route and PE2 338 can use this as a trigger to detect failure. 340 Thus to avoid the black-holing, when PE2 detects loss of reachability 341 to PE1, it should trigger ARP/ND for all remote IP prefixes received 342 from it's ES peers (i.e. PE1) belonging to same Ethernet Segment 343 across IP-VRF contexts. This will force host H1 to reply to the 344 solicited ARP/ND from PE2 and refresh both MAC and IP for the 345 corresponding host in its tables. 347 Even in core failure scenario on PE1, PE1 must withdraw all its local 348 L2 connectivity, as L2 traffic should not be received by PE1. So when 349 ARP/ND is triggered from PE2 the replies from host H1 can only be 350 received by PE2. Thus H1 will be learnt as local route and also 351 advertised from PE2. 353 It is recommended to have a staggered or delayed deletion of the IP 354 routes from PE1, so that ARP/ND refresh can happen on PE2 before the 355 deletion. 357 3.4 MAC Aging 359 PE1 would do ARP/ND refresh for H1 before it ages out. During this 360 process, H1 on can age out genuinely or due to the ARP/ND reply 361 landing on PE2. PE1 must withdraw the local entry from BGP when H1 362 entry ages out. PE1 deletes the entry from the local forwarding only 363 when there are no remote synced entries. 365 4 Determining Reach-ability to Unicast IP Addresses 367 4.1 Local Learning 369 The procedures for local learning do not change from [RFC7432]. 371 4.2 Remote Learning 373 The procedures for remote learning do not change from [RFC7432]. 375 4.2.1 Constructing MAC/IP Address Advertisement 377 The procedures for constructing MAC/IP Address Advertisement do not 378 change from RFC 7432 380 4.2.2 Route Resolution 382 If the ESI field is set to reserved values of 0 or MAX-ESI, the the 383 IP route resolution MUST be based on the MAC-IP route alone. 385 If the ESI field is set to a non-reserved ESI, the IP route 386 resolution MUST happen only when both the MAC-IP route and the 387 associated set of Ethernet AD per ES routes have been received. To 388 illustrate this with an example, consider a pair of multi-homed TORs 389 PE1 and PE2 connected to an Ethernet Segment. ES1 in an all-active 390 redundancy mode. A given host with IP address H1 is leant by PE1 but 391 not by PE2. When the MAC-IP advertisement route from PE1 and a set of 392 EAD/ES and Layer 3 EAD/EVI routes from PE1 and PE2 are received, PE3 393 can forward traffic destined to H1 to both PE1 and PE2. 395 If after (1) PE1 withdraws EAD/ES, then PE3 will forward the said 396 traffic to PE2 only. 398 If after (1) PE2 withdraws EAD/ES, then PE3 will forward the said 399 traffic to PE1 only. 401 If after (1) PE1 withdraws the MAC-IP route, then PE3 will do delayed 402 deletion of H1, as described in section 3.3. 404 If after (1) PE2 advertised the MAC-IP route, but PE1 withdraws it, 405 PE3 will continue forwarding to both PE1 and PE2 as long as it has 406 the EAD/ES and the Layer 3 EAD/EVI route from both. 408 5 Forwarding Unicast Packets 409 Please refer to Section 5 in the draft-ietf-bess-evpn-inter-subnet- 410 forwarding-01 412 6 Load Balancing of Unicast Packets 414 The procedures for load balancing of Unicast Packets do not change 415 from [RFC7432] 417 7 Security Considerations 419 The mechanisms in this document use EVPN control plane as defined in 420 [RFC7432]. Security considerations described in [RFC7432] are equally 421 applicable. 423 This document uses MPLS and IP-based tunnel technologies to support 424 data plane transport. Security considerations described in [R7432] 425 and in [ietf-evpn-overlay] are equally applicable. 427 8 IANA Considerations 429 9 References 431 9.1 Normative References 433 [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate 434 Requirement Levels", BCP 14, RFC 2119, March 1997. 436 [RFC1776] Crocker, S., "The Address is the Message", RFC 1776, April 437 1 1995. 439 [TRUTHS] Callon, R., "The Twelve Networking Truths", RFC 1925, 440 April 1 1996. 442 9.2 Informative References 444 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 445 Requirement Levels", BCP 14, RFC 2119, March 1997. 447 Authors' Addresses 448 Ali Sajassi 449 Cisco 450 Email: sajassi@cisco.com 452 Suresh Pasupula 453 Cisco 454 Email: spasupula@cisco.com 456 Gaurav Badoni 457 Cisco 458 Email: gbadoni@cisco.com 460 Priyanka Warade 461 Cisco 462 Email: pwarade@cisco.com