idnits 2.17.1 draft-mohanty-bess-ebgp-dmz-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 13, 2020) is 1376 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 -- Obsolete informational reference (is this intentional?): RFC 2547 (Obsoleted by RFC 4364) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup S. Mohanty 3 Internet-Draft A. Millisor 4 Intended status: Informational Cisco Systems 5 Expires: January 14, 2021 A. Vayner 6 Nutanix 7 A. Gattani 8 A. Kini 9 Arista Networks 10 July 13, 2020 12 Cumulative DMZ Link Bandwidth and load-balancing 13 draft-mohanty-bess-ebgp-dmz-02 15 Abstract 17 The DMZ Link Bandwidth draft provides a way to load-balance traffic 18 to a destination (which is in a different AS than the source) which 19 is reachable via more than one path. Typically, the link bandwidth 20 (either configured on the link of the EBGP egress interface or set 21 via a policy) is encoded in an extended community and then sent to 22 the IBGP peer which employs multi-path. The link-bandwidth value is 23 then extracted from the path extended community and is used as a 24 weight in the FIB, which does the load-balancing. This draft extends 25 the usage of the DMZ link bandwidth to another setting where the 26 ingress BGP speaker requires knowledge of the cumulative bandwidth 27 while doing the load-balancing. The draft also proposes neighbor- 28 level knobs to enable the link bandwidth extended community to be 29 regenerated and then advertised to EBGP peers to override the default 30 behavior of not advertising optional non-transitive attributes to 31 EBGP peers. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on January 14, 2021. 50 Copyright Notice 52 Copyright (c) 2020 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 68 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 69 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3 70 4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 6 71 5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 8 72 6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 9 73 7. Operational Considerations . . . . . . . . . . . . . . . . . 10 74 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 75 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 76 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 77 10.1. Normative References . . . . . . . . . . . . . . . . . . 10 78 10.2. Informative References . . . . . . . . . . . . . . . . . 11 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 81 1. Introduction 83 The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community 84 along with the multi-path feature can be used to provide unequal cost 85 load-balancing as per user control. In [I-D.ietf-idr-link-bandwidth] 86 the EBGP egress link bandwidth is encoded in the link bandwidth 87 extended community and sent along with the BGP update to the IBGP 88 peer. It is assumed that either a labeled path exists to each of the 89 EBGP links or alternatively the IGP cost to each link is the same. 90 When the same prefix/net is advertised into the receiving AS via 91 different egress-points or next-hops, the receiving IBGP peer that 92 employs multi-path will use the value of the DMZ LB to load-balance 93 traffic to the egress BGP speakers (ASBRs) in the proportion of the 94 link-bandwidths. 96 The link bandwidth extended community cannot be advertised over EBGP 97 peers as it is defined to be optional non-transitive. This draft 98 discusses a new use-case where we need to advertise the link 99 bandwidth over EBGP peers. The new use-case mandates that the router 100 calculates the aggregated link-bandwidth, regenerate the DMZ link 101 bandwidth extended community, and advertise it to EBGP peers. The 102 new use case also negates the [I-D.ietf-idr-link-bandwidth] 103 restriction that the DMZ link bandwidth extended community not be 104 sent when the the advertising router sets the next-hop to itself. 106 In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth 107 advertised by EBGP egress BGP speaker to the IBGP BGP speaker 108 represents the Link Bandwidth of the EBGP link. However, sometimes 109 there is a need to aggregate the link bandwidth of all the paths that 110 are advertising a given net and then send it to an upstream neighbor. 111 This is represented pictorially in Figure 1. The aggregated link 112 bandwidth is used by the upstream router to do load-balancing as it 113 may also receive several such paths for the same net which in turn 114 carry the accumulated bandwidth. 116 R1- -20 - - | 117 R3- -100 - -| 118 R2- -10 - - | | 119 | 120 R6- -40 - - | |- - R4 121 | | 122 R5- -100 - -| 123 R7- -30 - - | 125 EBGP Network with cumulative DMZ requirement 127 Figure 1 129 2. Requirements Language 131 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 132 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 133 document are to be interpreted as described in [RFC2119]. 135 3. Problem Description 137 Figure 1 above represents an all-EBGP network. Router R3 is peering 138 with two other EBGP downstream routers, R1 and R2, over the eBGP link 139 and another upstream EBGP router R4. There is another router, R5, 140 which is peering with two downstream routers R6 and R7. R5 peers 141 with R4. A net, p/m, is learnt by R1, R2, R6, and R7 from their 142 downstream routers (not shown). From the perspective of R4, the 143 topology looks like a directed tree. The link bandwidths of the EBGP 144 links are shown alongside the links (The exact units are not really 145 important and for simplicity these can be assumed to be weights 146 proportional to the operational link bandwidths). It is assumed that 147 R3, R4 and R5 have multi-path configured and paths having different 148 value as-path attributes can still be considered as multi-path (knobs 149 exist in many implementations for this). When the ingress router, 150 R4, sends traffic to the destination p/m, the traffic needs to be 151 spread amongst the links in the ratio of their link bandwidths. 152 Today this is not possible as there is no way to signal the link 153 bandwidth extended community over the EBGP session from R3 to R4. In 154 absence of a mechanism to regenerate the link bandwidth over the EBGP 155 session from R3 to R4 and from R5 to R4, the assumed link bandwidth 156 for paths received over the R3 to R4 and R5 to R4 EBGP sessions would 157 be equal to the operational link bandwidth of the corresponding EBGP 158 links. 160 As per EBGP rules at the advertising router, the next-hop will be set 161 to the advertising router itself. Accordingly, R3 computes the best- 162 path from the advertisements received from R1 and R2 and R5 computes 163 the best-path from advertisements received from R6 and R7 164 respectively. R4 receives the update from R3 and R5 and in-turn 165 computes the best-path and may advertises it upstream (not shown). 166 The expected behavior is that when R4 sends traffic for p/m towards 167 R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should 168 be load-balanced based on the calculated weights at the routers which 169 employ multi-path. R4 should send 30% of the traffic to R3 and the 170 remaining 70% to R5. R3 in turn should send 67% of the traffic that 171 it received from R4 to R1 and 33% to R2. Similarly, R5 should send 172 57% of the traffic received from R4 to R6 and the remaining 43% to 173 R7. Instead what is happening is that R4 sends 50% of the traffic 174 towards both R3 and R5. R3 in turn sends more traffic than is 175 desired towards R1 and R2. R4 in turn sends less traffic than is 176 desired towards R6 and R7. Effectively the load balancing is getting 177 skewed towards R1 and R2 even as R1 and R2's egress link bandwidth 178 relative to R6 and R7 is less. 180 R1- -20 - - | 181 R3- -30 (100) - -| 182 R2- -10 - - | | 183 | 184 R6- -40 - - | |- - R4 185 | | 186 R5- -70 (100) - -| 187 R7- -30 - - | 189 EBGP Network showing advertisement of cumulative link bandwidth 191 Figure 2 193 With the existing rules for the DMZ link bandwidth, this is not 194 possible. First the LB extended community is not sent over EBGP. 195 Secondly the DMZ does not have a notion of conveying the cumulative 196 link bandwidth (of the directed tree rooted at a node) to an upstream 197 router. To enable the use case described above, the cumulative link 198 bandwidth of R1 and R2 has to be advertised by R3 to R4, and, 199 similarly, the cumulative bandwidth of R6 and R7 has to be advertised 200 by R5 to R4. This will enable R4 to load-balance based on the 201 proportion of the cumulative link bandwidth that it receives from its 202 downstream routers R3 and R5. Figure 2 shows the cumulative link 203 bandwidth advertised by R3 towards R4 and R5 towards R4 with the 204 original link bandwidth values in '()' for comparison. 206 To address cases like the above example, rather than introducing a 207 new attribute for aggregate link bandwidth, we will reuse the link 208 bandwidth extended community attribute and relax a few assumptions. 209 With neighbor-specific knobs or policy configuration applied to the 210 neighbor outbound or inbound as may be the case, we can regenerate 211 and advertise and/or accept the link bandwidth extended community 212 over the EBGP link. In addition, we can define neighbor specific 213 knobs that will aggregate the link bandwidth values from the LB 214 extended communities learnt from the downstream routers (either 215 received as link bandwidth extended community in the path update or 216 assigned at ingress using a neighbor inbound policy configuration or 217 derived from the operational link-speed of the peer link) and then 218 regenerate and advertise (via neighbor outbound policy knob) this 219 aggregate link bandwidth value in the form of the LB extended 220 community to the upstream EBGP router. Since the advertisement is 221 being made to EBGP neighbors, the next-hop is going to be reset at 222 the advertising router. 224 Speaking of overall traffic profile, if we assume that on ingress at 225 R4 traffic flow for net p/m is received at a data rate of 'x', then 226 in absence of link bandwidth regeneration at R3 and R5 the resultant 227 traffic profile is below: 229 link ratio percent approximation (~) -------- ------------------ 230 ------------------------- R4-R3 1/2x 50% R4-R5 1/2x 50% R3-R1 1/3x 231 (1/2 * 2/3) 33% R3-R2 1/6x (1/2 * 1/3) 17% R5-R6 2/7x (1/2 * 4/7) 29% 232 R5-R7 3/14x (1/2 * 3/7) 21% 234 For comparison the resultant traffic profile in presence of 235 cumulative link bandwidth regeneration at R3 and R5 is as below: link 236 ratio percent approximation (~) -------- ------------------ 237 ------------------------- R4-R3 3/10x 30% R4-R5 7/10x 70% R3-R1 1/5x 238 (3/10 * 2/3) 20% R3-R2 1/10x (3/10 * 1/3) 10% R5-R6 2/5x (7/10 * 4/7) 239 40% R5-R7 3/10x (7/10 * 3/7) 30% 241 As is evident, the second table is closer to the desired traffic 242 profile that shoud be received by the leaf nodes (R1, R2, R6, R7) 243 compared to the first one. 245 4. Large Scale Data Centers Use Case 247 The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938] 248 describes a way to design large scale data centers using EBGP across 249 the different routing layers. [RFC7938] section 6.3 ("Weighted 250 ECMP") describes a use case in which a service (most likely 251 represented using an anycast virtual IP) has an unequal set of 252 resources serving across the data center regions. Figure 5 shows a 253 typical data center topology as described in section 3.1 of [RFC7938] 254 where an unequal number of servers are deployed advertising a certain 255 BGP prefix. As can be seen in the figure, the left side of the data 256 center hosts only 3 servers while the right side hosts 10 servers. 258 +------+ +------+ 259 | | | | 260 | AS1 | | AS1 | Tier 1 261 | | | | 262 +------+ +------+ 263 | | | | 264 +---------+ | | +----------+ 265 | +-------+--+------+--+-------+ | 266 | | | | | | | | 267 +----+ +----+ +----+ +----+ 268 | | | | | | | | 269 |AS2 | |AS2 | |AS3 | |AS3 | Tier 2 270 | | | | | | | | 271 +----+ +----+ +----+ +----+ 272 | | | | 273 | | | | 274 | +-----+ | | +-----+ | 275 +-| AS4 |-+ +-| AS5 |-+ Tier 3 276 +-----+ +-----+ 277 | | | | | | 278 <- 3 Servers -> <- 10 Servers -> 280 Typical Data Center Topology (RFC7938) 282 Figure 3 284 In a regular ECMP environment, the tier 1 layer would see an ECMP 285 path equally load-sharing across all 4 tier 2 paths. This would 286 cause the servers on the left part of the data center to be 287 potentially overloaded, while the servers on the right to be 288 underutilized. Using link bandwidth advertisements the servers could 289 add a link bandwidth extended community to the advertised service 290 prefix. Another option is to add the extended community on the tier 291 3 network devices as the routes are received from the servers or 292 generated locally on the network devices. If the link bandwidth 293 value advertised for the service represents the server capacity for 294 that service, each data center tier would aggregate the values up 295 when sending the update to the higher tier. The result would be a 296 set of weighted load-sharing metrics at each tier allowing the 297 network to distribute the flow load among the different servers in 298 the most optimal way. If a server is added or removed to the service 299 prefix, it would add or remove its link bandwidth value and the 300 network would adjust accordingly. 302 Figure 5 shows a more popular Spine Leaf architecture similar to 303 [RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier, 304 i.e. the leaf tier (The representation shown in Figure 5 here is the 305 unfolded Clos). Using the same example above, it is clear that the 306 LB extended community value received by each of Spine1 and Spine2 307 from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines 308 will then aggregate the bandwidth, regenerate and advertise the LB 309 extended-community to Tor3. Tor3 will do equal cost sharing to both 310 the spines which in turn will do the traffic-splitting in the ratio 3 311 to 10 when forwarding the traffic to the Tor1 and Tor2 respectively. 313 +------+ 314 | Tor3 | Tier 1 315 +------+ 316 | 317 +- - - - -+- - - - + 318 | | 319 +----+ +----+ 320 | | | | 321 |Spine1 |Spine2 322 | | | | 323 +----+--+ +-+----+ 324 | \ / | 325 - + - - 326 | / \ | 327 +-----+- + -+-----+ 328 |Tor1 | |Tor2 | Tier 1 329 +-----+ +-----+ 330 | | | | | | 331 <- 3 Servers -> <- 10 Servers -> 333 Two-tier Clos Data Center Topology 335 Figure 4 337 5. Non-Conforming BGP Topologies 339 This use-case will not readily apply to all topologies. Figure 5 340 shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2, 341 AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised 342 from a server S1 with LB extended-community value 10 to R1 and R5. 343 R1 advertises p/m to R2 and R3 and also regenerates the LB extended- 344 community with value 10. R4 receives the advertisements from R2, R3 345 and R5 and computes the aggregate bandwidth to be 30. R4 advertises 346 p/m to R6 with LB extended-community value 30. The link bandwidths 347 are as shown in the figure. 349 In the example as can be seen, R4 will do the cumulative bandwidth of 350 the LB that it receives from R2, R3 and R5 which is 30. When R4 351 receives the traffic from R6, it will load-balance it across R2, R3 352 and R5. As a result R1 will receive twice the volume of traffic that 353 R5 does. This is not desirable because the bandwidth from R1 to S1 354 and the bandwidth from S1 to R5 is the same i.e. 10. The discrepancy 355 arose because when R4 aggregated the link bandwidth values from the 356 received advertisements, the contribution from R1 was actually 357 factored in twice. 359 |- - R2 - 10 --| 360 | | 361 | | 362 S1- - 10- R1 R4- - - --30 - -R6 363 | | | 364 | | | 365 10 |- - -R3- 10 - -| 366 | | 367 |- - - R5 - - -- - -- - - -| 369 A non-conforming topology for the Cumulative DMZ 371 Figure 5 373 One way to make the topology in the figure above conforming would be 374 to regenerate a normalized value of the aggregate link bandwidth when 375 the aggregate link bandwidth is being advertised over more than one 376 eBGP peer link. Such normalization can be achieved through outbound 377 policy application on top of the aggregate link bandwidth value. A 378 couple of options in this context are: a) divide the aggregate link 379 bandwidth across the eBGP peers equally b) divide the aggregate link 380 bandwidth across the eBGP peers as per the ratio of the operational 381 link capacity of the eBGP peer links These and similar options for 382 regeneration of link-bandwidth to cater to load-balancing 383 requirements in such topologies are outside the scope of this 384 document and can be implementated as additional outbound policy 385 enhancements on top of a computed aggregate link bandwidth. 387 6. Protocol Considerations 389 [I-D.ietf-idr-link-bandwidth] needs to be refreshed. No Protocol 390 Changes are necessary if the knobs are implemented as recommended. 392 The other way to achieve the same purpose would be to use some 393 complicated policy frameworks. But that is only a conjecture. 395 7. Operational Considerations 397 A note may be made that these solutions also are applicable to many 398 address families such as L3VPN [RFC2547] , IPv4 with labeled unicast 399 [RFC8277] and EVPN [RFC7432]. 401 In topologies and implementation where there is an option to 402 advertise all multipath (equal cost) eligible paths to eBGP peers 403 (i.e. 'ecmp' form of additional-path advertisement is enabled), 404 aggregate link bandwidth advertisement may not be required or may be 405 redundant since the receiving BGP speaker receives the link bandwidth 406 extended community values with all eligible paths, so the aggregate 407 link bandwidth is effectively received by the downstream eBGP speaker 408 and can be used in the local computation to affect the forwarding 409 behaviour. This assumes the additional paths are advertised with 410 next-hop self. 412 8. Security Considerations 414 This document raises no new security issues. 416 9. Acknowledgements 418 Viral Patel did substantial work on an implementation along with the 419 first author. The authors would like to thank Acee Lindem and Jakob 420 Heitz for their help in reviewing the draft and valuable suggestions. 421 The authors would like to thank Shyam Sethuram, Sameer Gulrajani, 422 Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to 423 the draft. 425 10. References 427 10.1. Normative References 429 [I-D.ietf-idr-link-bandwidth] 430 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 431 Extended Community", draft-ietf-idr-link-bandwidth-06 432 (work in progress), January 2013. 434 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 435 Requirement Levels", BCP 14, RFC 2119, 436 DOI 10.17487/RFC2119, March 1997, 437 . 439 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 440 BGP for Routing in Large-Scale Data Centers", RFC 7938, 441 DOI 10.17487/RFC7938, August 2016, 442 . 444 10.2. Informative References 446 [RFC2547] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547, 447 DOI 10.17487/RFC2547, March 1999, 448 . 450 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 451 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 452 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 453 2015, . 455 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 456 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 457 . 459 Authors' Addresses 461 Satya Ranjan Mohanty 462 Cisco Systems 463 170 W. Tasman Drive 464 San Jose, CA 95134 465 USA 467 Email: satyamoh@cisco.com 469 Aaron 470 Cisco Systems 471 170 W. Tasman Drive 472 San Jose, CA 95134 473 USA 475 Email: amilliso@cisco.com 477 Arie Vayner 478 Nutanix 479 1740 Technology Drive 480 San Jose, CA 95110 481 USA 483 Email: ariev@vayner.net 484 Akshay Gattani 485 Arista Networks 486 5453 Great America Parkway 487 Santa Clara, CA 95054 488 USA 490 Email: akshay@arista.com 492 Ajay Kini 493 Arista Networks 494 5453 Great America Parkway 495 Santa Clara, CA 95054 496 USA 498 Email: ajkini@arista.com