idnits 2.17.1 draft-malhotra-bess-evpn-unequal-lb-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 30, 2017) is 2369 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2119' is mentioned on line 234, but not defined == Missing Reference: 'PE-1' is mentioned on line 321, but not defined == Missing Reference: 'PE-2' is mentioned on line 321, but not defined == Missing Reference: 'PE-3' is mentioned on line 321, but not defined == Missing Reference: 'EVPN-PREF-DF' is mentioned on line 335, but not defined == Missing Reference: 'RFC 7814' is mentioned on line 357, but not defined == Unused Reference: 'EVPN-PREF-DF-ELECT' is defined on line 405, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'BGP-LINK-BW' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IP-ALIASING' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-PREF-DF-ELECT' Summary: 2 errors (**), 0 flaws (~~), 8 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT N. Malhotra, Ed. 3 S. Thoria 4 A. Sajassi 5 Intended Status: Proposed Standard (Cisco) 6 A. Lingala 7 (AT&T) 9 Expires: May 3, 2018 October 30, 2017 11 Weighted Multi-Path Procedures for EVPN All-Active Multi-Homing 12 draft-malhotra-bess-evpn-unequal-lb-00 14 Abstract 16 In an EVPN-IRB based network overlay, EVPN LAG enables all-active 17 multi-homing for a host or CE device connected to two or more PEs via 18 a LAG bundle, such that bridged and routed traffic from remote PEs 19 can be equally load balanced (ECMPed) across the multi-homing PEs. 20 This document defines extensions to EVPN procedures to optimally 21 handle unequal access bandwidth distribution across a set of multi- 22 homing PEs in order to: 24 o provide greater flexibility, with respect to adding or 25 removing individual PE-CE links within the access LAG 27 o handle PE-CE LAG member link failures that can result in unequal 28 PE-CE access bandwidth across a set of multi-homing PEs 30 Status of this Memo 32 This Internet-Draft is submitted to IETF in full conformance with 33 the provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as 38 Internet-Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six 41 months and may be updated, replaced, or obsoleted by other 42 documents at any time. It is inappropriate to use Internet- 43 Drafts as reference material or to cite them other than as "work 44 in progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/1id-abstracts.html 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 Copyright and License Notice 53 Copyright (c) 2017 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with 61 respect to this document. Code Components extracted from this 62 document must include Simplified BSD License text as described in 63 Section 4.e of the Trust Legal Provisions and are provided 64 without warranty as described in the Simplified BSD License. 66 Table of Contents 68 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 69 1.1 PE CE Link Provisioning . . . . . . . . . . . . . . . . . . 4 70 1.2 PE CE Link Failures . . . . . . . . . . . . . . . . . . . . 5 71 1.3 Design Requirement . . . . . . . . . . . . . . . . . . . . . 6 72 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 6 73 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 6 74 3. Weighted Unicast Traffic Load-balancing . . . . . . . . . . . 7 75 3.1 LOCAL PE Behavior . . . . . . . . . . . . . . . . . . . . . 7 76 3.2 REMOTE PE Behavior . . . . . . . . . . . . . . . . . . . . . 7 77 4. Weighted BUM Traffic Load-Sharing . . . . . . . . . . . . . . 8 78 5. Routed EVPN Overlay . . . . . . . . . . . . . . . . . . . . . . 8 79 6. EVPN-IRB Multi-homing with non-EVPN routing . . . . . . . . . . 9 80 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 81 7.1 Normative References . . . . . . . . . . . . . . . . . . . 10 82 7.2 Informative References . . . . . . . . . . . . . . . . . . 10 83 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 86 1 Introduction 88 In an EVPN-IRB based network overlay, with access an access CE multi- 89 homed via a LAG interface, bridged and routed traffic from remote PEs 90 can be equally load balanced (ECMPed) across the multi-homing PEs: 92 o ECMP Load-balancing for bridged unicast traffic is enabled via 93 aliasing and mass-withdraw procedures detailed in RFC 7432. 95 o ECMP Load-balancing for routed unicast traffic is enabled via 96 existing L3 ECMP mechanisms. 98 o Load-sharing of bridged BUM traffic on local ports is enabled 99 via EVPN DF election procedure detailed in RFC 7432 101 All of the above load-balancing and DF election procedures implicitly 102 assume equal bandwidth distribution between the CE and the set of 103 multi-homing PEs. Essentially, with this assumption of equal "access" 104 bandwidth distribution across all PEs, ALL remote traffic is equally 105 load balanced across the multi-homing PEs. This assumption of equal 106 access bandwidth distribution can be restrictive with respect to 107 adding / removing links in a multi-homed LAG interface and may also 108 be easily broken on individual link failures. A solution to handle 109 unequal access bandwidth distribution across a set of multi-homing 110 EVPN PEs is proposed in this document. Primary motivation behind this 111 proposal is to enable greater flexibility with respect to adding / 112 removing member PE-CE links, as needed and optimally handle PE-CE 113 link failures. 115 1.1 PE CE Link Provisioning 117 +------------------------+ 118 | Underlay Network Fabric| 119 +------------------------+ 121 +-----+ +-----+ 122 | PE1 | | PE2 | 123 +-----+ +-----+ 124 \ / 125 \ ESI-1 / 126 \ / 127 +\---/+ 128 | \ / | 129 +--+--+ 130 | 131 CE1 133 Figure 1 135 Consider a CE1 that is dual-homed to PE1 and PE2 via EVPN-LAG with 136 single member links of equal bandwidth to each PE (aka, equal access 137 band-width distribution across PE1 and PE2). If the provider wants to 138 increase link bandwidth to CE1, it MUST add a link to both PE1 and 139 PE2 in order to maintain equal access bandwidth distribution and 140 inter-work with EVPN ECMP load-balancing. In other words, for a dual- 141 homed CE, total number of CE links must be provisioned in multiples 142 of 2 (2, 4, 6, and so on). For a triple-homed CE, number of CE links 143 must be provisioned in multiples of three (3, 6, 9, and so on). To 144 generalize, for a CE that is multi-homed to "n" PEs, number of PE-CE 145 physical links provisioned must be an integral multiple of "n". This 146 is restrictive in case of dual-homing and very quickly becomes 147 prohibitive in case of multi-homing. 149 Instead, a provider may wish to increase PE-CE bandwidth OR number of 150 links in ANY link increments. As an example, for CE1 dual-homed to 151 PE1 and PE2 in all-active mode, provider may wish to add a third link 152 to ONLY PE1 to increase total band-width for this CE by 50%, rather 153 than being required to increase access bandwidth by 100% by adding a 154 link to each of the two PEs. While existing EVPN based all-active 155 load-balancing procedures do not necessarily preclude such asymmetric 156 access bandwidth distribution among the PEs providing redundancy, it 157 may result in unexpected traffic loss due to congestion in the access 158 interface towards CE. This traffic loss is due to the fact that PE1 159 and PE2 will continue to attract equal amount of CE1 destined traffic 160 from remote PEs, even when PE2 only has half the bandwidth to CE1 as 161 PE1. This may lead to congestion and traffic loss on the PE2-CE1 162 link. If bandwidth distribution to CE1 across PE1 and PE2 is 2:1, 163 traffic from remote hosts MUST also be load-balanced across PE1 and 164 PE2 in 2:1 manner. 166 1.2 PE CE Link Failures 168 More importantly, unequal PE-CE bandwidth distribution described 169 above may occur during regular operation following a link failure, 170 even when PE-CE links were provisioned to provide equal bandwidth 171 distribution across multi-homing PEs. 173 +------------------------+ 174 | Underlay Network Fabric| 175 +------------------------+ 177 +-----+ +-----+ 178 | PE1 | | PE2 | 179 +-----+ +-----+ 180 \\ // 181 \\ ESI-1 // 182 \\ /X 183 +\\---//+ 184 | \\ // | 185 +---+---+ 186 | 187 CE1 189 Consider a CE1 that is multi-homed to PE1 and PE2 via a link bundle 190 with two member links to each PE. On a PE2-CE1 physical link failure, 191 link bundle represented by ESI-1 on PE2 stays up, however, it's 192 bandwidth is cut in half. With the existing ECMP procedures, both PE1 193 and PE2 will continue to attract equal amount of traffic from remote 194 PEs, even when PE1 has double the bandwidth to CE1. If bandwidth 195 distribution to CE1 across PE1 and PE2 is 2:1, traffic from remote 196 hosts MUST also be load-balanced across PE1 and PE2 in 2:1 manner to 197 avoid unexpected congestion and traffic loss on PE2-CE1 links within 198 the LAG. 200 1.3 Design Requirement 202 +-----------------------+ 203 |Underlay Network Fabric| 204 +-----------------------+ 206 +-----+ +-----+ +-----+ +-----+ 207 | PE1 | | PE2 | ..... | PEx | | PEn | 208 +-----+ +-----+ +-----+ +-----+ 209 \ \ // // 210 \ L1 \ L2 // Lx // Ln 211 \ \ // // 212 +-\-------\-----------//--------//-+ 213 | \ \ ESI-1 // // | 214 +----------------------------------+ 215 | 216 CE 218 To generalize, if total link band-width to a CE is distributed across 219 "n" multi-homing PEs, with Lx being the number of links / bandwidth 220 to PEx, traffic from remote PEs to this CE MUST be load-balanced 221 unequally across [PE1, PE2, ....., PEn] such that, the proportion of 222 unicast and BUM flows destined for CE that are serviced by PEx is: 224 Lx / [L1+L2+.....+Ln] 226 Solution proposed below includes extensions to EVPN procedures to 227 achieve the above. 229 1.1 Terminology 231 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 232 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 233 document are to be interpreted as described in RFC 2119 [RFC2119]. 235 "LOCAL PE" in the context of an ESI refers to a provider edge switch 236 OR router that physically hosts the ESI. 238 "REMOTE PE" in the context of an ESI refers to a provider edge switch 239 OR router in an EVPN overlay, who's overlay reachability to the ESI 240 is via the LOCAL PE. 242 2. Solution Overview 244 In order to achieve weighted load balancing for overlay unicast 245 traffic, EVPN per-ESI EAD (Route Type 1) is leveraged to signal the 246 ESI bandwidth to remote PEs. Using per-ESI EAD route to signal the 247 ESI bandwidth provides a mechanism to be able to react to changes in 248 access bandwidth in a service and host independent manner. Remote PEs 249 computing the MAC path-lists based on global and aliasing EAD routes 250 now have the ability to computed weighted load-balancing based on the 251 ESI access bandwidth received from each PE that the ESI is multi- 252 homed to. If per-ESI EAD route is also leveraged for IP path-list 253 computation, as per [EVPN-IP-ALIASING], it would also provide a 254 method to do weighted load-balancing for IP routed traffic. 256 In order to achieve weighted load-balancing of overlay BUM traffic, 257 EVPN ES route (Route Type 4) is leveraged to signal the ESI bandwidth 258 to PEs within an ESI's redundancy group to influence per-service DF 259 election. PEs in an ESI redundancy group now have the ability to do 260 per-service DF election in a manner that is proportionate to their 261 relative ESI bandwidth. 263 Procedures to accomplish this are described in greater detail next. 265 3. Weighted Unicast Traffic Load-balancing 267 3.1 LOCAL PE Behavior 269 A PE that is part of an ESI's redundancy group would advertise a 270 additional "link bandwidth" EXT-COMM attribute with per-ESI EAD route 271 (EVPN Route Type 1), that represents total band-width of PE's 272 physical links in an ESI. BGP link bandwidth EXT-COMM defined in 273 [BGP-LINK-BW] would be re-used for this purpose. 275 3.2 REMOTE PE Behavior 277 A receiving PE should use per-ESI link band-width attribute received 278 from each PE to compute a relative weight for each remote PE, per- 279 ESI, as shown below. 281 if, 283 L(x,y) : link band-width advertised by PE-x for ESI-y 285 W(x,y) : normalized weight assigned to PE-x for ESI-y 287 H(y) : Highest Common Factor (HCF) of [L(1,y), L(2,y), ....., 288 L(n,y)] 290 then, the normalized weight assigned to PE-x for ESI-y may be 291 computed as follows: 293 W(x,y) = L(x,y) / H(y) 295 For a MAC+IP route (EVPN Route Type 2) received with ESI-y, receiving 296 PE MUST compute MAC and IP forwarding path-list weighted by the above 297 normalized weights. 299 As an example, for a CE dual-homed to PE-1, PE-2, PE-3 via 2, 1, and 300 1 GE physical links respectively, as part of a link bundle 301 represented by ESI-10: 303 L(1, 10) = 2000 Mbps 305 L(2, 10) = 1000 Mbps 307 L(3, 10) = 1000 Mbps 309 H(10) = 1000 311 Normalized weights assigned to each PE for ESI-10 are as follows: 313 W(1, 10) = 2000 / 1000 = 2. 315 W(2, 10) = 1000 / 1000 = 1. 317 W(3, 10) = 1000 / 1000 = 1. 319 For a remote MAC+IP host route received with ESI-10, forwarding load- 320 balancing path-list must now be computed as: [PE-1, PE-1, PE-2, PE-3] 321 instead of [PE-1, PE-2, PE-3]. This now results in load-balancing of 322 all traffic destined for ESI-10 across the three multi-homing PEs in 323 proportion to ESI-10 band-width at each PE. 325 Above weighted path-list computation MUST only be done for an ESI, IF 326 a link bandwidth attribute is received from ALL of the PE's 327 advertising reachability to that ESI via per-ESI EAD Route Type 1. In 328 the event that link bandwidth attribute is not received from one or 329 more PEs, forwarding path-list would be computed using regular ECMP 330 semantics. 332 4. Weighted BUM Traffic Load-Sharing 334 Load sharing of per-service DF role, weighted by link-bandwidth is 335 currently under discussion and needs to be reconciled with [EVPN- 336 PREF-DF]. This will closed in the next revision of this draft. 338 5. Routed EVPN Overlay 340 An additional use case is possible, such that traffic to an end host 341 in the overlay is always IP routed. In a purely routed overlay such 342 as this: 344 o A host MAC is never advertised in EVPN overlay control plane 346 o Host /32 or /128 IP reachability is distributed across the 347 overlay via EVPN route type 5 (RT-5) along with a zero or non- 348 zero ESI 350 o An overlay IP subnet may still be stretched across the underlay 351 fabric, however, intra-subnet traffic across the stretched 352 overlay is never bridged 354 o Both inter-subnet and intra-subnet traffic, in the overlay is 355 IP routed at the EVPN GW. 357 Please refer to [RFC 7814] for more details. 359 Weighted multi-path procedure described in this document may be used 360 together with procedures described in [EVPN-IP-ALIASING] for this use 361 case. per-ES EAD route advertised with Layer 3 VRF RTs would be used 362 to signal ES link bandwidth attribute instead of the per-ES EAD route 363 with Layer 2 VRF RTs. All other procedures described earlier in this 364 document would as is. 366 6. EVPN-IRB Multi-homing with non-EVPN routing 368 EVPN-LAG based multi-homing on an IRB gateway may also be deployed 369 together with non-EVPN routing, such as global routing or an L3VPN 370 routing control plane. Key property that differentiates this set of 371 use cases from EVPN IRB use cases discussed earlier is that EVPN 372 control plane is used only to enable LAG interface based multi-homing 373 and NOT as an overlay VPN control plane. EVPN control plane in this 374 case enables: 376 o DF election via EVPN RT-4 based procedures described in [RFC7432] 378 o LOCAL MAC sync across multi-homing PEs via EVPN RT-2 380 o LOCAL ARP and ND sync across multi-homing PEs via EVPN RT-2 382 Applicability of weighted ECMP procedures proposed in this document 383 to these set of use cases are still under discussion and will be 384 addressed in subsequent revisions. 386 7. References 388 7.1 Normative References 390 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 391 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 392 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 393 2015, . 395 [BGP-LINK-BW] Mohapatra, P., Fernando, R., "BGP Link Bandwidth 396 Extended Community", January 2013, 397 . 400 [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass 401 Withdrawal Support for EVPN", July 2017, 402 . 405 [EVPN-PREF-DF-ELECT] Rabadan, J., et al., "Preference-based EVPN DF 406 Election", June 2017, . 409 7.2 Informative References 411 8. Acknowledgements 413 Authors' Addresses 415 Neeraj Malhotra 416 Cisco 417 Email: nmalhotr@cisco.com 419 Samir Thoria 420 Cisco 421 Email: sthoria@cisco.com 423 Ali Sajassi 424 Cisco 425 Email: sajassi@cisco.com 427 Avinash Lingala 428 AT&T 429 Email: ar977m@att.com