idnits 2.17.1 draft-mohanty-bess-ebgp-dmz-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 13, 2020) is 1376 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 -- Obsolete informational reference (is this intentional?): RFC 2547 (Obsoleted by RFC 4364) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup S. Mohanty 3 Internet-Draft Cisco Systems 4 Intended status: Informational A. Vayner 5 Expires: January 14, 2021 Nutanix 6 A. Gattani 7 A. Kini 8 Arista Networks 9 July 13, 2020 11 Cumulative DMZ Link Bandwidth and load-balancing 12 draft-mohanty-bess-ebgp-dmz-01 14 Abstract 16 The DMZ Link Bandwidth draft provides a way to load-balance traffic 17 to a destination (which is in a different AS than the source) which 18 is reachable via more than one path. Typically, the link bandwidth 19 (either configured on the link of the EBGP egress interface or set 20 via a policy) is encoded in an extended community and then sent to 21 the IBGP peer which employs multi-path. The link-bandwidth value is 22 then extracted from the path extended community and is used as a 23 weight in the FIB, which does the load-balancing. This draft extends 24 the usage of the DMZ link bandwidth to another setting where the 25 ingress BGP speaker requires knowledge of the cumulative bandwidth 26 while doing the load-balancing. The draft also proposes neighbor- 27 level knobs to enable the link bandwidth extended community to be 28 regenerated and then advertised to EBGP peers to override the default 29 behavior of not advertising optional non-transitive attributes to 30 EBGP peers. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on January 14, 2021. 49 Copyright Notice 51 Copyright (c) 2020 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (https://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 68 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3 69 4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 6 70 5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 8 71 6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 9 72 7. Operational Considerations . . . . . . . . . . . . . . . . . 10 73 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 74 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 75 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 76 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 77 10.2. Informative References . . . . . . . . . . . . . . . . . 11 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 80 1. Introduction 82 The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community 83 along with the multi-path feature can be used to provide unequal cost 84 load-balancing as per user control. In [I-D.ietf-idr-link-bandwidth] 85 the EBGP egress link bandwidth is encoded in the link bandwidth 86 extended community and sent along with the BGP update to the IBGP 87 peer. It is assumed that either a labeled path exists to each of the 88 EBGP links or alternatively the IGP cost to each link is the same. 89 When the same prefix/net is advertised into the receiving AS via 90 different egress-points or next-hops, the receiving IBGP peer that 91 employs multi-path will use the value of the DMZ LB to load-balance 92 traffic to the egress BGP speakers (ASBRs) in the proportion of the 93 link-bandwidths. 95 The link bandwidth extended community cannot be advertised over EBGP 96 peers as it is defined to be optional non-transitive. This draft 97 discusses a new use-case where we need to advertise the link 98 bandwidth over EBGP peers. The new use-case mandates that the router 99 calculates the aggregated link-bandwidth, regenerate the DMZ link 100 bandwidth extended community, and advertise it to EBGP peers. The 101 new use case also negates the [I-D.ietf-idr-link-bandwidth] 102 restriction that the DMZ link bandwidth extended community not be 103 sent when the the advertising router sets the next-hop to itself. 105 In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth 106 advertised by EBGP egress BGP speaker to the IBGP BGP speaker 107 represents the Link Bandwidth of the EBGP link. However, sometimes 108 there is a need to aggregate the link bandwidth of all the paths that 109 are advertising a given net and then send it to an upstream neighbor. 110 This is represented pictorially in Figure 1. The aggregated link 111 bandwidth is used by the upstream router to do load-balancing as it 112 may also receive several such paths for the same net which in turn 113 carry the accumulated bandwidth. 115 R1- -20 - - | 116 R3- -100 - -| 117 R2- -10 - - | | 118 | 119 R6- -40 - - | |- - R4 120 | | 121 R5- -100 - -| 122 R7- -30 - - | 124 EBGP Network with cumulative DMZ requirement 126 Figure 1 128 2. Requirements Language 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in [RFC2119]. 134 3. Problem Description 136 Figure 1 above represents an all-EBGP network. Router R3 is peering 137 with two other EBGP downstream routers, R1 and R2, over the eBGP link 138 and another upstream EBGP router R4. There is another router, R5, 139 which is peering with two downstream routers R6 and R7. R5 peers 140 with R4. A net, p/m, is learnt by R1, R2, R6, and R7 from their 141 downstream routers (not shown). From the perspective of R4, the 142 topology looks like a directed tree. The link bandwidths of the EBGP 143 links are shown alongside the links (The exact units are not really 144 important and for simplicity these can be assumed to be weights 145 proportional to the operational link bandwidths). It is assumed that 146 R3, R4 and R5 have multi-path configured and paths having different 147 value as-path attributes can still be considered as multi-path (knobs 148 exist in many implementations for this). When the ingress router, 149 R4, sends traffic to the destination p/m, the traffic needs to be 150 spread amongst the links in the ratio of their link bandwidths. 151 Today this is not possible as there is no way to signal the link 152 bandwidth extended community over the EBGP session from R3 to R4. In 153 absence of a mechanism to regenerate the link bandwidth over the EBGP 154 session from R3 to R4 and from R5 to R4, the assumed link bandwidth 155 for paths received over the R3 to R4 and R5 to R4 EBGP sessions would 156 be equal to the operational link bandwidth of the corresponding EBGP 157 links. 159 As per EBGP rules at the advertising router, the next-hop will be set 160 to the advertising router itself. Accordingly, R3 computes the best- 161 path from the advertisements received from R1 and R2 and R5 computes 162 the best-path from advertisements received from R6 and R7 163 respectively. R4 receives the update from R3 and R5 and in-turn 164 computes the best-path and may advertises it upstream (not shown). 165 The expected behavior is that when R4 sends traffic for p/m towards 166 R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should 167 be load-balanced based on the calculated weights at the routers which 168 employ multi-path. R4 should send 30% of the traffic to R3 and the 169 remaining 70% to R5. R3 in turn should send 67% of the traffic that 170 it received from R4 to R1 and 33% to R2. Similarly, R5 should send 171 57% of the traffic received from R4 to R6 and the remaining 43% to 172 R7. Instead what is happening is that R4 sends 50% of the traffic 173 towards both R3 and R5. R3 in turn sends more traffic than is 174 desired towards R1 and R2. R4 in turn sends less traffic than is 175 desired towards R6 and R7. Effectively the load balancing is getting 176 skewed towards R1 and R2 even as R1 and R2's egress link bandwidth 177 relative to R6 and R7 is less. 179 R1- -20 - - | 180 R3- -30 (100) - -| 181 R2- -10 - - | | 182 | 183 R6- -40 - - | |- - R4 184 | | 185 R5- -70 (100) - -| 186 R7- -30 - - | 188 EBGP Network showing advertisement of cumulative link bandwidth 190 Figure 2 192 With the existing rules for the DMZ link bandwidth, this is not 193 possible. First the LB extended community is not sent over EBGP. 194 Secondly the DMZ does not have a notion of conveying the cumulative 195 link bandwidth (of the directed tree rooted at a node) to an upstream 196 router. To enable the use case described above, the cumulative link 197 bandwidth of R1 and R2 has to be advertised by R3 to R4, and, 198 similarly, the cumulative bandwidth of R6 and R7 has to be advertised 199 by R5 to R4. This will enable R4 to load-balance based on the 200 proportion of the cumulative link bandwidth that it receives from its 201 downstream routers R3 and R5. Figure 2 shows the cumulative link 202 bandwidth advertised by R3 towards R4 and R5 towards R4 with the 203 original link bandwidth values in '()' for comparison. 205 To address cases like the above example, rather than introducing a 206 new attribute for aggregate link bandwidth, we will reuse the link 207 bandwidth extended community attribute and relax a few assumptions. 208 With neighbor-specific knobs or policy configuration applied to the 209 neighbor outbound or inbound as may be the case, we can regenerate 210 and advertise and/or accept the link bandwidth extended community 211 over the EBGP link. In addition, we can define neighbor specific 212 knobs that will aggregate the link bandwidth values from the LB 213 extended communities learnt from the downstream routers (either 214 received as link bandwidth extended community in the path update or 215 assigned at ingress using a neighbor inbound policy configuration or 216 derived from the operational link-speed of the peer link) and then 217 regenerate and advertise (via neighbor outbound policy knob) this 218 aggregate link bandwidth value in the form of the LB extended 219 community to the upstream EBGP router. Since the advertisement is 220 being made to EBGP neighbors, the next-hop is going to be reset at 221 the advertising router. 223 Speaking of overall traffic profile, if we assume that on ingress at 224 R4 traffic flow for net p/m is received at a data rate of 'x', then 225 in absence of link bandwidth regeneration at R3 and R5 the resultant 226 traffic profile is below: 228 link ratio percent approximation (~) -------- ------------------ 229 ------------------------- R4-R3 1/2x 50% R4-R5 1/2x 50% R3-R1 1/3x 230 (1/2 * 2/3) 33% R3-R2 1/6x (1/2 * 1/3) 17% R5-R6 2/7x (1/2 * 4/7) 29% 231 R5-R7 3/14x (1/2 * 3/7) 21% 233 For comparison the resultant traffic profile in presence of 234 cumulative link bandwidth regeneration at R3 and R5 is as below: link 235 ratio percent approximation (~) -------- ------------------ 236 ------------------------- R4-R3 3/10x 30% R4-R5 7/10x 70% R3-R1 1/5x 237 (3/10 * 2/3) 20% R3-R2 1/10x (3/10 * 1/3) 10% R5-R6 2/5x (7/10 * 4/7) 238 40% R5-R7 3/10x (7/10 * 3/7) 30% 240 As is evident, the second table is closer to the desired traffic 241 profile that shoud be received by the leaf nodes (R1, R2, R6, R7) 242 compared to the first one. 244 4. Large Scale Data Centers Use Case 246 The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938] 247 describes a way to design large scale data centers using EBGP across 248 the different routing layers. [RFC7938] section 6.3 ("Weighted 249 ECMP") describes a use case in which a service (most likely 250 represented using an anycast virtual IP) has an unequal set of 251 resources serving across the data center regions. Figure 5 shows a 252 typical data center topology as described in section 3.1 of [RFC7938] 253 where an unequal number of servers are deployed advertising a certain 254 BGP prefix. As can be seen in the figure, the left side of the data 255 center hosts only 3 servers while the right side hosts 10 servers. 257 +------+ +------+ 258 | | | | 259 | AS1 | | AS1 | Tier 1 260 | | | | 261 +------+ +------+ 262 | | | | 263 +---------+ | | +----------+ 264 | +-------+--+------+--+-------+ | 265 | | | | | | | | 266 +----+ +----+ +----+ +----+ 267 | | | | | | | | 268 |AS2 | |AS2 | |AS3 | |AS3 | Tier 2 269 | | | | | | | | 270 +----+ +----+ +----+ +----+ 271 | | | | 272 | | | | 273 | +-----+ | | +-----+ | 274 +-| AS4 |-+ +-| AS5 |-+ Tier 3 275 +-----+ +-----+ 276 | | | | | | 277 <- 3 Servers -> <- 10 Servers -> 279 Typical Data Center Topology (RFC7938) 281 Figure 3 283 In a regular ECMP environment, the tier 1 layer would see an ECMP 284 path equally load-sharing across all 4 tier 2 paths. This would 285 cause the servers on the left part of the data center to be 286 potentially overloaded, while the servers on the right to be 287 underutilized. Using link bandwidth advertisements the servers could 288 add a link bandwidth extended community to the advertised service 289 prefix. Another option is to add the extended community on the tier 290 3 network devices as the routes are received from the servers or 291 generated locally on the network devices. If the link bandwidth 292 value advertised for the service represents the server capacity for 293 that service, each data center tier would aggregate the values up 294 when sending the update to the higher tier. The result would be a 295 set of weighted load-sharing metrics at each tier allowing the 296 network to distribute the flow load among the different servers in 297 the most optimal way. If a server is added or removed to the service 298 prefix, it would add or remove its link bandwidth value and the 299 network would adjust accordingly. 301 Figure 5 shows a more popular Spine Leaf architecture similar to 302 [RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier, 303 i.e. the leaf tier (The representation shown in Figure 5 here is the 304 unfolded Clos). Using the same example above, it is clear that the 305 LB extended community value received by each of Spine1 and Spine2 306 from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines 307 will then aggregate the bandwidth, regenerate and advertise the LB 308 extended-community to Tor3. Tor3 will do equal cost sharing to both 309 the spines which in turn will do the traffic-splitting in the ratio 3 310 to 10 when forwarding the traffic to the Tor1 and Tor2 respectively. 312 +------+ 313 | Tor3 | Tier 1 314 +------+ 315 | 316 +- - - - -+- - - - + 317 | | 318 +----+ +----+ 319 | | | | 320 |Spine1 |Spine2 321 | | | | 322 +----+--+ +-+----+ 323 | \ / | 324 - + - - 325 | / \ | 326 +-----+- + -+-----+ 327 |Tor1 | |Tor2 | Tier 1 328 +-----+ +-----+ 329 | | | | | | 330 <- 3 Servers -> <- 10 Servers -> 332 Two-tier Clos Data Center Topology 334 Figure 4 336 5. Non-Conforming BGP Topologies 338 This use-case will not readily apply to all topologies. Figure 5 339 shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2, 340 AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised 341 from a server S1 with LB extended-community value 10 to R1 and R5. 342 R1 advertises p/m to R2 and R3 and also regenerates the LB extended- 343 community with value 10. R4 receives the advertisements from R2, R3 344 and R5 and computes the aggregate bandwidth to be 30. R4 advertises 345 p/m to R6 with LB extended-community value 30. The link bandwidths 346 are as shown in the figure. 348 In the example as can be seen, R4 will do the cumulative bandwidth of 349 the LB that it receives from R2, R3 and R5 which is 30. When R4 350 receives the traffic from R6, it will load-balance it across R2, R3 351 and R5. As a result R1 will receive twice the volume of traffic that 352 R5 does. This is not desirable because the bandwidth from R1 to S1 353 and the bandwidth from S1 to R5 is the same i.e. 10. The discrepancy 354 arose because when R4 aggregated the link bandwidth values from the 355 received advertisements, the contribution from R1 was actually 356 factored in twice. 358 |- - R2 - 10 --| 359 | | 360 | | 361 S1- - 10- R1 R4- - - --30 - -R6 362 | | | 363 | | | 364 10 |- - -R3- 10 - -| 365 | | 366 |- - - R5 - - -- - -- - - -| 368 A non-conforming topology for the Cumulative DMZ 370 Figure 5 372 One way to make the topology in the figure above conforming would be 373 to regenerate a normalized value of the aggregate link bandwidth when 374 the aggregate link bandwidth is being advertised over more than one 375 eBGP peer link. Such normalization can be achieved through outbound 376 policy application on top of the aggregate link bandwidth value. A 377 couple of options in this context are: a) divide the aggregate link 378 bandwidth across the eBGP peers equally b) divide the aggregate link 379 bandwidth across the eBGP peers as per the ratio of the operational 380 link capacity of the eBGP peer links These and similar options for 381 regeneration of link-bandwidth to cater to load-balancing 382 requirements in such topologies are outside the scope of this 383 document and can be implementated as additional outbound policy 384 enhancements on top of a computed aggregate link bandwidth. 386 6. Protocol Considerations 388 [I-D.ietf-idr-link-bandwidth] needs to be refreshed. No Protocol 389 Changes are necessary if the knobs are implemented as recommended. 391 The other way to achieve the same purpose would be to use some 392 complicated policy frameworks. But that is only a conjecture. 394 7. Operational Considerations 396 A note may be made that these solutions also are applicable to many 397 address families such as L3VPN [RFC2547] , IPv4 with labeled unicast 398 [RFC8277] and EVPN [RFC7432]. 400 In topologies and implementation where there is an option to 401 advertise all multipath (equal cost) eligible paths to eBGP peers 402 (i.e. 'ecmp' form of additional-path advertisement is enabled), 403 aggregate link bandwidth advertisement may not be required or may be 404 redundant since the receiving BGP speaker receives the link bandwidth 405 extended community values with all eligible paths, so the aggregate 406 link bandwidth is effectively received by the downstream eBGP speaker 407 and can be used in the local computation to affect the forwarding 408 behaviour. This assumes the additional paths are advertised with 409 next-hop self. 411 8. Security Considerations 413 This document raises no new security issues. 415 9. Acknowledgements 417 Viral Patel did substantial work on an implementation along with the 418 first author. The authors would like to thank Acee Lindem and Jakob 419 Heitz for their help in reviewing the draft and valuable suggestions. 420 The authors would like to thank Shyam Sethuram, Sameer Gulrajani, 421 Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to 422 the draft. 424 10. References 426 10.1. Normative References 428 [I-D.ietf-idr-link-bandwidth] 429 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 430 Extended Community", draft-ietf-idr-link-bandwidth-06 431 (work in progress), January 2013. 433 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 434 Requirement Levels", BCP 14, RFC 2119, 435 DOI 10.17487/RFC2119, March 1997, 436 . 438 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 439 BGP for Routing in Large-Scale Data Centers", RFC 7938, 440 DOI 10.17487/RFC7938, August 2016, 441 . 443 10.2. Informative References 445 [RFC2547] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547, 446 DOI 10.17487/RFC2547, March 1999, 447 . 449 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 450 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 451 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 452 2015, . 454 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 455 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 456 . 458 Authors' Addresses 460 Satya Ranjan Mohanty 461 Cisco Systems 462 170 W. Tasman Drive 463 San Jose, CA 95134 464 USA 466 Email: satyamoh@cisco.com 468 Arie Vayner 469 Nutanix 470 1740 Technology Drive 471 San Jose, CA 95110 472 USA 474 Email: ariev@vayner.net 476 Akshay Gattani 477 Arista Networks 478 5453 Great America Parkway 479 Santa Clara, CA 95054 480 USA 482 Email: akshay@arista.com 483 Ajay Kini 484 Arista Networks 485 5453 Great America Parkway 486 Santa Clara, CA 95054 487 USA 489 Email: ajkini@arista.com