idnits 2.17.1 draft-mohanty-bess-ebgp-dmz-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 15, 2021) is 1131 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 -- Obsolete informational reference (is this intentional?): RFC 2547 (Obsoleted by RFC 4364) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup S. Mohanty 3 Internet-Draft Cisco Systems 4 Intended status: Informational A. Vayner 5 Expires: September 16, 2021 Google 6 A. Gattani 7 A. Kini 8 Arista Networks 9 March 15, 2021 11 Cumulative DMZ Link Bandwidth and load-balancing 12 draft-mohanty-bess-ebgp-dmz-03 14 Abstract 16 The DMZ Link Bandwidth draft provides a way to load-balance traffic 17 to a destination (which is in a different AS than the source) which 18 is reachable via more than one path. Typically, the link bandwidth 19 (either configured on the link of the EBGP egress interface or set 20 via a policy) is encoded in an extended community and then sent to 21 the IBGP peer which employs multi-path. The link-bandwidth value is 22 then extracted from the path extended community and is used as a 23 weight in the FIB, which does the load-balancing. This draft extends 24 the usage of the DMZ link bandwidth to another setting where the 25 ingress BGP speaker requires knowledge of the cumulative bandwidth 26 while doing the load-balancing. The draft also proposes neighbor- 27 level knobs to enable the link bandwidth extended community to be 28 regenerated and then advertised to EBGP peers to override the default 29 behavior of not advertising optional non-transitive attributes to 30 EBGP peers. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on September 16, 2021. 49 Copyright Notice 51 Copyright (c) 2021 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (https://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 68 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3 69 4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 6 70 5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 8 71 6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 10 72 7. Operational Considerations . . . . . . . . . . . . . . . . . 10 73 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 74 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 75 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 76 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 77 10.2. Informative References . . . . . . . . . . . . . . . . . 11 78 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 80 1. Introduction 82 The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community 83 along with the multi-path feature can be used to provide unequal cost 84 load-balancing as per user control. In [I-D.ietf-idr-link-bandwidth] 85 the EBGP egress link bandwidth is encoded in the link bandwidth 86 extended community and sent along with the BGP update to the IBGP 87 peer. It is assumed that either a labeled path exists to each of the 88 EBGP links or alternatively the IGP cost to each link is the same. 89 When the same prefix/net is advertised into the receiving AS via 90 different egress-points or next-hops, the receiving IBGP peer that 91 employs multi-path will use the value of the DMZ LB to load-balance 92 traffic to the egress BGP speakers (ASBRs) in the proportion of the 93 link-bandwidths. 95 The link bandwidth extended community cannot be advertised over EBGP 96 peers as it is defined to be optional non-transitive. This draft 97 discusses a new use-case where we need to advertise the link 98 bandwidth over EBGP peers. The new use-case mandates that the router 99 calculates the aggregated link-bandwidth, regenerate the DMZ link 100 bandwidth extended community, and advertise it to EBGP peers. The 101 new use case also negates the [I-D.ietf-idr-link-bandwidth] 102 restriction that the DMZ link bandwidth extended community not be 103 sent when the the advertising router sets the next-hop to itself. 105 In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth 106 advertised by EBGP egress BGP speaker to the IBGP BGP speaker 107 represents the Link Bandwidth of the EBGP link. However, sometimes 108 there is a need to aggregate the link bandwidth of all the paths that 109 are advertising a given net and then send it to an upstream neighbor. 110 This is represented pictorially in Figure 1. The aggregated link 111 bandwidth is used by the upstream router to do load-balancing as it 112 may also receive several such paths for the same net which in turn 113 carry the accumulated bandwidth. 115 R1- -20 - - | 116 R3- -100 - -| 117 R2- -10 - - | | 118 | 119 R6- -40 - - | |- - R4 120 | | 121 R5- -100 - -| 122 R7- -30 - - | 124 EBGP Network with cumulative DMZ requirement 126 Figure 1 128 2. Requirements Language 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in [RFC2119]. 134 3. Problem Description 136 Figure 1 above represents an all-EBGP network. Router R3 is peering 137 with two other EBGP downstream routers, R1 and R2, over the eBGP link 138 and another upstream EBGP router R4. There is another router, R5, 139 which is peering with two downstream routers R6 and R7. R5 peers 140 with R4. A net, p/m, is learnt by R1, R2, R6, and R7 from their 141 downstream routers (not shown). From the perspective of R4, the 142 topology looks like a directed tree. The link bandwidths of the EBGP 143 links are shown alongside the links (The exact units are not really 144 important and for simplicity these can be assumed to be weights 145 proportional to the operational link bandwidths). It is assumed that 146 R3, R4 and R5 have multi-path configured and paths having different 147 value as-path attributes can still be considered as multi-path (knobs 148 exist in many implementations for this). When the ingress router, 149 R4, sends traffic to the destination p/m, the traffic needs to be 150 spread amongst the links in the ratio of their link bandwidths. 151 Today this is not possible as there is no way to signal the link 152 bandwidth extended community over the EBGP session from R3 to R4. In 153 absence of a mechanism to regenerate the link bandwidth over the EBGP 154 session from R3 to R4 and from R5 to R4, the assumed link bandwidth 155 for paths received over the R3 to R4 and R5 to R4 EBGP sessions would 156 be equal to the operational link bandwidth of the corresponding EBGP 157 links. 159 As per EBGP rules at the advertising router, the next-hop will be set 160 to the advertising router itself. Accordingly, R3 computes the best- 161 path from the advertisements received from R1 and R2 and R5 computes 162 the best-path from advertisements received from R6 and R7 163 respectively. R4 receives the update from R3 and R5 and in-turn 164 computes the best-path and may advertises it upstream (not shown). 165 The expected behavior is that when R4 sends traffic for p/m towards 166 R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should 167 be load-balanced based on the calculated weights at the routers which 168 employ multi-path. R4 should send 30% of the traffic to R3 and the 169 remaining 70% to R5. R3 in turn should send 67% of the traffic that 170 it received from R4 to R1 and 33% to R2. Similarly, R5 should send 171 57% of the traffic received from R4 to R6 and the remaining 43% to 172 R7. Instead what is happening is that R4 sends 50% of the traffic 173 towards both R3 and R5. R3 in turn sends more traffic than is 174 desired towards R1 and R2. R4 in turn sends less traffic than is 175 desired towards R6 and R7. Effectively the load balancing is getting 176 skewed towards R1 and R2 even as R1 and R2's egress link bandwidth 177 relative to R6 and R7 is less. 179 R1- -20 - - | 180 R3- -30 (100) - -| 181 R2- -10 - - | | 182 | 183 R6- -40 - - | |- - R4 184 | | 185 R5- -70 (100) - -| 186 R7- -30 - - | 188 EBGP Network showing advertisement of cumulative link bandwidth 190 Figure 2 192 With the existing rules for the DMZ link bandwidth, this is not 193 possible. First the LB extended community is not sent over EBGP. 194 Secondly the DMZ does not have a notion of conveying the cumulative 195 link bandwidth (of the directed tree rooted at a node) to an upstream 196 router. To enable the use case described above, the cumulative link 197 bandwidth of R1 and R2 has to be advertised by R3 to R4, and, 198 similarly, the cumulative bandwidth of R6 and R7 has to be advertised 199 by R5 to R4. This will enable R4 to load-balance based on the 200 proportion of the cumulative link bandwidth that it receives from its 201 downstream routers R3 and R5. Figure 2 shows the cumulative link 202 bandwidth advertised by R3 towards R4 and R5 towards R4 with the 203 original link bandwidth values in '()' for comparison. 205 To address cases like the above example, rather than introducing a 206 new attribute for aggregate link bandwidth, we will reuse the link 207 bandwidth extended community attribute and relax a few assumptions. 208 With neighbor-specific knobs or policy configuration applied to the 209 neighbor outbound or inbound as may be the case, we can regenerate 210 and advertise and/or accept the link bandwidth extended community 211 over the EBGP link. In addition, we can define neighbor specific 212 knobs that will aggregate the link bandwidth values from the LB 213 extended communities learnt from the downstream routers (either 214 received as link bandwidth extended community in the path update or 215 assigned at ingress using a neighbor inbound policy configuration or 216 derived from the operational link-speed of the peer link) and then 217 regenerate and advertise (via neighbor outbound policy knob) this 218 aggregate link bandwidth value in the form of the LB extended 219 community to the upstream EBGP router. Since the advertisement is 220 being made to EBGP neighbors, the next-hop is going to be reset at 221 the advertising router. 223 Speaking of overall traffic profile, if we assume that on ingress at 224 R4 traffic flow for net p/m is received at a data rate of 'x', then 225 in absence of link bandwidth regeneration at R3 and R5 the resultant 226 traffic profile is below: 228 link ratio percent approximation(~) 230 R4-R3 1/2x 50% 232 R4-R5 1/2x 50% 234 R3-R1 1/3x (1/2 * 2/3) 33% 236 R3-R2 1/6x (1/2 * 1/3) 17% 238 R5-R6 2/7x (1/2 * 4/7) 29% 240 R5-R7 3/14x (1/2 * 3/7) 21% 242 For comparison the resultant traffic profile in presence of 243 cumulative link bandwidth regeneration at R3 and R5 is as below: 245 link ratio percent approximation(~) 247 R4-R3 3/10x 30% 249 R4-R5 7/10x 70% 251 R3-R1 1/5x (3/10 * 2/3) 20% 253 R3-R2 1/10x (3/10 * 1/3) 10% 255 R5-R6 2/5x (7/10 * 4/7) 40% 257 R5-R7 3/10x (7/10 * 3/7) 30% 259 As is evident, the second table is closer to the desired traffic 260 profile that shoud be received by the leaf nodes (R1, R2, R6, R7) 261 compared to the first one. 263 4. Large Scale Data Centers Use Case 265 The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938] 266 describes a way to design large scale data centers using EBGP across 267 the different routing layers. [RFC7938] section 6.3 ("Weighted 268 ECMP") describes a use case in which a service (most likely 269 represented using an anycast virtual IP) has an unequal set of 270 resources serving across the data center regions. Figure 3 shows a 271 typical data center topology as described in section 3.1 of [RFC7938] 272 where an unequal number of servers are deployed advertising a certain 273 BGP prefix. As can be seen in the figure, the left side of the data 274 center hosts only 3 servers while the right side hosts 10 servers. 276 +------+ +------+ 277 | | | | 278 | AS1 | | AS1 | Tier 1 279 | | | | 280 +------+ +------+ 281 | | | | 282 +---------+ | | +----------+ 283 | +-------+--+------+--+-------+ | 284 | | | | | | | | 285 +----+ +----+ +----+ +----+ 286 | | | | | | | | 287 |AS2 | |AS2 | |AS3 | |AS3 | Tier 2 288 | | | | | | | | 289 +----+ +----+ +----+ +----+ 290 | | | | 291 | | | | 292 | +-----+ | | +-----+ | 293 +-| AS4 |-+ +-| AS5 |-+ Tier 3 294 +-----+ +-----+ 295 | | | | | | 297 <- 3 Servers -> <- 10 Servers -> 299 Typical Data Center Topology (RFC7938) 301 Figure 3 303 In a regular ECMP environment, the tier 1 layer would see an ECMP 304 path equally load-sharing across all 4 tier 2 paths. This would 305 cause the servers on the left part of the data center to be 306 potentially overloaded, while the servers on the right to be 307 underutilized. Using link bandwidth advertisements the servers could 308 add a link bandwidth extended community to the advertised service 309 prefix. Another option is to add the extended community on the tier 310 3 network devices as the routes are received from the servers or 311 generated locally on the network devices. If the link bandwidth 312 value advertised for the service represents the server capacity for 313 that service, each data center tier would aggregate the values up 314 when sending the update to the higher tier. The result would be a 315 set of weighted load-sharing metrics at each tier allowing the 316 network to distribute the flow load among the different servers in 317 the most optimal way. If a server is added or removed to the service 318 prefix, it would add or remove its link bandwidth value and the 319 network would adjust accordingly. 321 Figure 4 shows a more popular Spine Leaf architecture similar to 322 [RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier, 323 i.e. the leaf tier (The representation shown in Figure 3 here is the 324 unfolded Clos). Using the same example above, it is clear that the 325 LB extended community value received by each of Spine1 and Spine2 326 from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines 327 will then aggregate the bandwidth, regenerate and advertise the LB 328 extended-community to Tor3. Tor3 will do equal cost sharing to both 329 the spines which in turn will do the traffic-splitting in the ratio 3 330 to 10 when forwarding the traffic to the Tor1 and Tor2 respectively. 332 +------+ 333 | Tor3 | Tier 1 334 +------+ 335 | 336 +- - - - -+- - - - + 337 | | 338 +----+ +----+ 339 | | | | 340 |Spine1 |Spine2 341 | | | | 342 +----+--+ +-+----+ 343 | \ / | 344 - + - - 345 | / \ | 346 +-----+- + -+-----+ 347 |Tor1 | |Tor2 | Tier 1 348 +-----+ +-----+ 349 | | | | | | 350 <- 3 Servers -> <- 10 Servers -> 352 Two-tier Clos Data Center Topology 354 Figure 4 356 5. Non-Conforming BGP Topologies 358 This use-case will not readily apply to all topologies. Figure 5 359 shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2, 360 AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised 361 from a server S1 with LB extended-community value 10 to R1 and R5. 362 R1 advertises p/m to R2 and R3 and also regenerates the LB extended- 363 community with value 10. R4 receives the advertisements from R2, R3 364 and R5 and computes the aggregate bandwidth to be 30. R4 advertises 365 p/m to R6 with LB extended-community value 30. The link bandwidths 366 are as shown in the figure. 368 In the example as can be seen, R4 will do the cumulative bandwidth of 369 the LB that it receives from R2, R3 and R5 which is 30. When R4 370 receives the traffic from R6, it will load-balance it across R2, R3 371 and R5. As a result R1 will receive twice the volume of traffic that 372 R5 does. This is not desirable because the bandwidth from R1 to S1 373 and the bandwidth from S1 to R5 is the same i.e. 10. The discrepancy 374 arose because when R4 aggregated the link bandwidth values from the 375 received advertisements, the contribution from R1 was actually 376 factored in twice. 378 |- - R2 - 10 --| 379 | | 380 | | 381 S1- - 10- R1 R4- - - --30 - -R6 382 | | | 383 | | | 384 10 |- - -R3- 10 - -| 385 | | 386 |- - - R5 - - -- - -- - - -| 388 A non-conforming topology for the Cumulative DMZ 390 Figure 5 392 One way to make the topology in the figure above conforming would be 393 to regenerate a normalized value of the aggregate link bandwidth when 394 the aggregate link bandwidth is being advertised over more than one 395 eBGP peer link. Such normalization can be achieved through outbound 396 policy application on top of the aggregate link bandwidth value. A 397 couple of options in this context are: 399 1. divide the aggregate link bandwidth across the eBGP peers equally 401 2. divide the aggregate link bandwidth across the eBGP peers as per 402 the ratio of the operational link capacity of the eBGP peer links 404 These and similar options for regeneration of link-bandwidth to cater 405 to load-balancing requirements in such topologies are outside the 406 scope of this document and can be implementated as additional 407 outbound policy enhancements on top of a computed aggregate link 408 bandwidth. 410 6. Protocol Considerations 412 [I-D.ietf-idr-link-bandwidth] needs to be refreshed. No Protocol 413 Changes are necessary if the knobs are implemented as recommended. 414 The other way to achieve the same purpose would be to use some 415 complicated policy frameworks. But that is only a conjecture. 417 7. Operational Considerations 419 A note may be made that these solutions also are applicable to many 420 address families such as L3VPN [RFC2547] , IPv4 with labeled unicast 421 [RFC8277] and EVPN [RFC7432]. 423 In topologies and implementation where there is an option to 424 advertise all multipath (equal cost) eligible paths to eBGP peers 425 (i.e. 'ecmp' form of additional-path advertisement is enabled), 426 aggregate link bandwidth advertisement may not be required or may be 427 redundant since the receiving BGP speaker receives the link bandwidth 428 extended community values with all eligible paths, so the aggregate 429 link bandwidth is effectively received by the downstream eBGP speaker 430 and can be used in the local computation to affect the forwarding 431 behaviour. This assumes the additional paths are advertised with 432 next-hop self. 434 8. Security Considerations 436 This document raises no new security issues. 438 9. Acknowledgements 440 Viral Patel did substantial work on an implementation along with the 441 first author. The authors would like to thank Acee Lindem and Jakob 442 Heitz for their help in reviewing the draft and valuable suggestions. 443 The authors would like to thank Shyam Sethuram, Sameer Gulrajani, 444 Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to 445 the draft. 447 10. References 449 10.1. Normative References 451 [I-D.ietf-idr-link-bandwidth] 452 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 453 Extended Community", draft-ietf-idr-link-bandwidth-06 454 (work in progress), January 2013. 456 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 457 Requirement Levels", BCP 14, RFC 2119, 458 DOI 10.17487/RFC2119, March 1997, 459 . 461 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 462 BGP for Routing in Large-Scale Data Centers", RFC 7938, 463 DOI 10.17487/RFC7938, August 2016, 464 . 466 10.2. Informative References 468 [RFC2547] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547, 469 DOI 10.17487/RFC2547, March 1999, 470 . 472 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 473 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 474 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 475 2015, . 477 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 478 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 479 . 481 Authors' Addresses 483 Satya Ranjan Mohanty 484 Cisco Systems 485 170 W. Tasman Drive 486 San Jose, CA 95134 487 USA 489 Email: satyamoh@cisco.com 491 Arie Vayner 492 Google 493 1600 Amphitheatre Parkway 494 Mountain View, CA 94043 495 USA 497 Email: avayner@google.com 498 Akshay Gattani 499 Arista Networks 500 5453 Great America Parkway 501 Santa Clara, CA 95054 502 USA 504 Email: akshay@arista.com 506 Ajay Kini 507 Arista Networks 508 5453 Great America Parkway 509 Santa Clara, CA 95054 510 USA 512 Email: ajkini@arista.com