idnits 2.17.1 draft-mohanty-bess-ebgp-dmz-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 3, 2018) is 2236 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 -- Obsolete informational reference (is this intentional?): RFC 2547 (Obsoleted by RFC 4364) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS WorkGroup S. Mohanty 3 Internet-Draft A. Millisor 4 Intended status: Informational Cisco Systems 5 Expires: September 4, 2018 A. Vayner 6 Google 7 March 3, 2018 9 Cumulative DMZ Link Bandwidth and load-balancing 10 draft-mohanty-bess-ebgp-dmz-00 12 Abstract 14 The DMZ Link Bandwidth draft provides a way to load-balance traffic 15 to a destination (which is in a different AS than the source) which 16 is reachable via more than one path. Typically, the link bandwidth 17 (either configured on the link of the EBGP egress interface or set 18 via a policy) is encoded in an extended community and then sent to 19 the IBGP peer which employs multi-path. The link-bandwidth value is 20 then extracted from the path extended community and is used as a 21 weight in the FIB, which does the load-balancing. This draft extends 22 the usage of the DMZ link bandwidth to another setting where the 23 ingress BGP speaker requires knowledge of the cumulative bandwidth 24 while doing the load-balancing. The draft also proposes neighbor- 25 level knobs to enable the link bandwidth extended community to be 26 regenerated and then advertised to EBGP peers to override the default 27 behavior of not advertising optional non-transitive attributes to 28 EBGP peers. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on September 4, 2018. 47 Copyright Notice 49 Copyright (c) 2018 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 65 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 66 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3 67 4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 5 68 5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 7 69 6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 8 70 7. Operational Considerations . . . . . . . . . . . . . . . . . 8 71 8. Security Considerations . . . . . . . . . . . . . . . . . . . 8 72 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 73 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 10.1. Normative References . . . . . . . . . . . . . . . . . . 9 75 10.2. Informative References . . . . . . . . . . . . . . . . . 9 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 78 1. Introduction 80 The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community 81 along with the multi-path feature can be used to provide unequal cost 82 load-balancing as per user control. In [I-D.ietf-idr-link-bandwidth] 83 the EBGP egress link bandwidth is encoded in the link bandwidth 84 extended community and sent along with the BGP update to the IBGP 85 peer. It is assumed that either a labeled path exists to each of the 86 EBGP links or alternatively the IGP cost to each link is the same. 87 When the same prefix/net is advertised into the receiving AS via 88 different egress-points or next-hops, the receiving IBGP peer that 89 employs multi-path will use the value of the DMZ LB to load-balance 90 traffic to the egress BGP speakers (ASBRs) in the proportion of the 91 link-bandwidths. 93 The link bandwidth extended community cannot be advertised over EBGP 94 peers as it is defined to be optional non-transitive. This draft 95 discusses a new use-case where we need to advertise the link 96 bandwidth over EBGP peers. The new use-case mandates that the router 97 calculates the aggregated link-bandwidth, regenerate the DMZ link 98 bandwidth extended community, and advertise it to EBGP peers. The 99 new use case also negates the [I-D.ietf-idr-link-bandwidth] 100 restriction that the DMZ link bandwidth extended community not be 101 sent when the the advertising router sets the next-hop to itself. 103 In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth 104 advertised by EBGP egress BGP speaker to the IBGP BGP speaker 105 represents the Link Bandwidth of the EBGP link. However, sometimes 106 there is a need to aggregate the link bandwidth of all the paths that 107 are advertising a given net and then send it to an upstream neighbor. 108 This is represented pictorially in Figure 1. The aggregated link 109 bandwidth is used by the upstream router to do load-balancing as it 110 may also receive several such paths for the same net which in turn 111 carry the accumulated bandwidth. 113 R1- -20 - - | 114 R3- -100- -| 115 R2- -10 - - | | 116 | 117 R6- -40 - - | |- - R4 118 | | 119 R5- -100- -| 120 R7- -30 - - | 122 EBGP Network with cumulative DMZ requirement 124 Figure 1 126 2. Requirements Language 128 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 129 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 130 document are to be interpreted as described in [RFC2119]. 132 3. Problem Description 134 Figure 1 above represents an all-EBGP network. Router R3 is peering 135 with two other EBGP downstream routers, R1 and R2, over the eBGP link 136 and another upstream EBGP router R4. There is another router, R5, 137 which is peering with two downstream routers R6 and R7. R5 peers 138 with R4. A net, p/m, is learnt by R1, R2, R6, and R7 from their 139 downstream routers (not shown). From the perspective of R4, the 140 topology looks like a directed tree. The link bandwidths of the EBGP 141 links are shown alongside the links (The exact units are not really 142 important). It is assumed that R3, R4 and R5 have multi-path 143 configured and paths having different value as-path attributes can 144 still be considered as multi-path (knobs exist in many 145 implementations for this). When the ingress router, R4, sends 146 traffic to the destination p/m, the traffic needs to be spread 147 amongst the links in the ratio of their link bandwidths. Today this 148 is not possible as there is no way to signal the link bandwidth 149 extended community over the EBGP session from R3 to R4. 151 As per EBGP rules at the advertising router, the next-hop will be set 152 to the advertising router itself. Accordingly, R3 computes the best- 153 path from the advertisements received from R1 and R2 and R5 computes 154 the best-path from advertisements received from R6 and R7 155 respectively. R4 receives the update from R3 and R5 and in-turn 156 computes the best-path and may advertises it upstream (not shown). 157 The expected behavior is that when R4 sends traffic for p/m towards 158 R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should 159 be load-balanced based on the calculated weights at the routers which 160 employ multi-path. R4 should send 30% of the traffic to R3 and the 161 remaining 70% to R5. R3 in turn should send 67% of the traffic that 162 it received from R4 to R1 and 33% to R2. Similarly, R5 should send 163 57% of the traffic to R6 and the remaining 43% to R7. 165 With the existing rules for the DMZ link bandwidth, this is not 166 possible. First the LB extended community is not sent over EBGP. 167 Secondly the DMZ does not have a notion of conveying the cumulative 168 link bandwidth (of the directed tree rooted at a node) to an upstream 169 router. To enable the use case described above, the cumulative link 170 bandwidth of R1 and R2 has to be advertised by R3 to R4, and, 171 similarly, the cumulative bandwidth of R6 and R7 has to be advertised 172 by R5 to R4. This will enable R4 to load-balance based on the 173 proportion of the cumulative link bandwidth that it receives from its 174 downstream routers R3 and R5. 176 To address cases like the above example, rather than inventing 177 something new from scratch, we will relax a few assumptions of the 178 link bandwidth extended community. With neighbor-specific knobs 179 outbound/inbound as may be the case, we can regenerate and advertise 180 and/or accept the link bandwidth extended community over the EBGP 181 link. In addition, we can define neighbor specific knobs that will 182 aggregate the link bandwidth values from the LB extended communities 183 (received via the neighbor inbound policy knobs) from the downstream 184 routers and then regenerate and advertise (via neighbor outbound 185 policy knob) this aggregate link bandwidth value stored in the LB 186 extended community to the upstream EBGP router. Since the 187 advertisement is being made to EBGP neighbors, the next-hop is going 188 to be reset at the advertising router. 190 4. Large Scale Data Centers Use Case 192 The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938] 193 describes a way to design large scale data centers using EBGP across 194 the different routing layers. [RFC7938] section 6.3 ("Weighted 195 ECMP") describes a use case in which a service (most likely 196 represented using an anycast virtual IP) has an unequal set of 197 resources serving across the data center regions. Figure 2 shows a 198 typical data center topology as described in section 3.1 of [RFC7938] 199 where an unequal number of servers are deployed advertising a certain 200 BGP prefix. As can be seen in the figure, the left side of the data 201 center hosts only 3 servers while the right side hosts 10 servers. 203 +------+ +------+ 204 | | | | 205 | AS1 | | AS1 | Tier 1 206 | | | | 207 +------+ +------+ 208 | | | | 209 +---------+ | | +----------+ 210 | +-------+--+------+--+-------+ | 211 | | | | | | | | 212 +----+ +----+ +----+ +----+ 213 | | | | | | | | 214 |AS2 | |AS2 | |AS3 | |AS3 | Tier 2 215 | | | | | | | | 216 +----+ +----+ +----+ +----+ 217 | | | | 218 | | | | 219 | +-----+ | | +-----+ | 220 +-| AS4 |-+ +-| AS5 |-+ Tier 3 221 +-----+ +-----+ 222 | | | | | | 223 <- 3 Servers -> <- 10 Servers -> 225 Typical Data Center Topology (RFC7938) 227 Figure 2 229 In a regular ECMP environment, the tier 1 layer would see an ECMP 230 path equally load-sharing across all 4 tier 2 paths. This would 231 cause the servers on the left part of the data center to be 232 potentially overloaded, while the servers on the right to be 233 underutilized. Using link bandwidth advertisements the servers could 234 add a link bandwidth extended community to the advertised service 235 prefix. Another option is to add the extended community on the tier 236 3 network devices as the routes are received from the servers or 237 generated locally on the network devices. If the link bandwidth 238 value advertised for the service represents the server capacity for 239 that service, each data center tier would aggregate the values up 240 when sending the update to the higher tier. The result would be a 241 set of weighted load-sharing metrics at each tier allowing the 242 network to distribute the flow load among the different servers in 243 the most optimal way. If a server is added or removed to the service 244 prefix, it would add or remove its link bandwidth value and the 245 network would adjust accordingly. 247 Figure 3 shows a more popular Spine Leaf architecture similar to 248 [RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier, 249 i.e. the leaf tier (The representation shown in Figure 3 here is the 250 unfolded Clos). Using the same example above, it is clear that the 251 LB extended community value received by each of Spine1 and Spine2 252 from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines 253 will then aggregate the bandwidth, regenerate and advertise the LB 254 extended-community to Tor3. Tor3 will do equal cost sharing to both 255 the spines which in turn will do the traffic-splitting in the ratio 3 256 to 10 when forwarding the traffic to the Tor1 and Tor2 respectively. 258 +------+ 259 | Tor3 | Tier 1 260 +------+ 261 | 262 +- - - - -+- - - - + 263 | | 264 +----+ +----+ 265 | | | | 266 |Spine1 |Spine2 267 | | | | 268 +----+--+ +-+----+ 269 | \ / | 270 - + - - 271 | / \ | 272 +-----+- + -+-----+ 273 |Tor1 | |Tor2 | Tier 1 274 +-----+ +-----+ 275 | | | | | | 276 <- 3 Servers -> <- 10 Servers -> 278 Two-tier Clos Data Center Topology 280 Figure 3 282 5. Non-Conforming BGP Topologies 284 This use-case will not readily apply to all topologies. Figure 4 285 shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2, 286 AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised 287 from a server S1 with LB extended-community value 10 to R1 and R5. 288 R1 advertises p/m to R2 and R3 and also regenerates the LB extended- 289 community with value 10. R4 receives the advertisements from R2, R3 290 and R5 and computes the aggregate bandwidth to be 30. R4 advertises 291 p/m to R6 with LB extended-community value 30. The link bandwidths 292 are as shown in the figure. 294 In the example as can be seen, R4 will do the cumulative bandwidth of 295 the LB that it receives from R2, R3 and R5 which is 30. When R4 296 receives the traffic from R6, it will load-balance it across R2, R3 297 and R5. As a result R1 will receive twice the volume of traffic that 298 R5 does. This is not desirable because the bandwidth from R1 to S1 299 and the bandwidth from S1 to R5 is the same i.e. 10. The discrepancy 300 arose because when R4 aggregated the link bandwidth values from the 301 received advertisements, the contribution from R1 was actually 302 factored in twice. 304 |- - R2 - 10 --| 305 | | 306 | | 307 S1- - 10- R1 R4- - - --30 - -R6 308 | | | 309 | | | 310 10 |- - -R3- 10 - -| 311 | | 312 |- - - R5 - - -- - -- - - -| 314 A non-conforming topology for the Cumulative DMZ 316 Figure 4 318 6. Protocol Considerations 320 [I-D.ietf-idr-link-bandwidth] needs to be refreshed. No Protocol 321 Changes are necessary if the knobs are implemented as recommended. 322 The other way to achieve the same purpose would be to use some 323 complicated policy frameworks. But that is only a conjecture. 325 7. Operational Considerations 327 A note may be made that these solutions also are applicable to many 328 address families such as L3VPN [RFC2547] , IPv4 with labeled unicast 329 [RFC8277] and EVPN [RFC7432]. 331 8. Security Considerations 333 This document raises no new security issues. 335 9. Acknowledgements 337 Viral Patel did substantial work on an implementation along with the 338 first author. The authors would like to thank Acee Lindem and Jakob 339 Heitz for their help in reviewing the draft and valuable suggestions. 340 The authors would like to thank Shyam Sethuram, Sameer Gulrajani, 341 Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to 342 the draft. 344 10. References 345 10.1. Normative References 347 [I-D.ietf-idr-link-bandwidth] 348 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 349 Extended Community", draft-ietf-idr-link-bandwidth-06 350 (work in progress), January 2013. 352 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 353 Requirement Levels", BCP 14, RFC 2119, 354 DOI 10.17487/RFC2119, March 1997, 355 . 357 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 358 BGP for Routing in Large-Scale Data Centers", RFC 7938, 359 DOI 10.17487/RFC7938, August 2016, 360 . 362 10.2. Informative References 364 [RFC2547] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547, 365 DOI 10.17487/RFC2547, March 1999, 366 . 368 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 369 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 370 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 371 2015, . 373 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 374 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 375 . 377 Authors' Addresses 379 Satya Ranjan Mohanty 380 Cisco Systems 381 170 W. Tasman Drive 382 San Jose, CA 95134 383 USA 385 Email: satyamoh@cisco.com 386 Aaron Millisor 387 Cisco Systems 388 170 W. Tasman Drive 389 San Jose, CA 95134 390 USA 392 Email: amilliso@cisco.com 394 Arie Vayner 395 Google 396 1600 Amphitheatre Pkwy 397 Mountain View, CA 94043 398 USA 400 Email: avayner@google.com