idnits 2.17.1 draft-heitz-idr-msdc-bgp-aggregation-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 22, 2018) is 2006 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Figure 3' is mentioned on line 322, but not defined Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IDR J. Heitz 3 Internet-Draft D. Rao 4 Intended status: Standards Track Cisco 5 Expires: April 25, 2019 October 22, 2018 7 Aggregating BGP routes in Massive Scale Data Centers 8 draft-heitz-idr-msdc-bgp-aggregation-00 10 Abstract 12 A design for a fabric of switches to connect up to one million 13 servers in a data center is described. At that scale, it is 14 impractical for every switch to maintain knowledge about every other 15 switch and every other link in the fabric. Aggregation of routes is 16 an excellent way to scale such a fabric. However, aggregation 17 presents some problems under link failures or switch failures. This 18 design solves those problems. 20 Requirements Language 22 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 23 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 24 document are to be interpreted as described in [RFC2119]. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on April 25, 2019. 43 Copyright Notice 45 Copyright (c) 2018 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (https://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 61 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 3 62 3. Problems with negative routes . . . . . . . . . . . . . . . . 4 63 4. Use of a negative route in BGP . . . . . . . . . . . . . . . 4 64 5. Implementation Notes to Reduce CPU Time Consumption . . . . . 5 65 6. Smooth Startup and Avoidance of Too Many Negative Routes . . 5 66 7. Avoidance of Transients . . . . . . . . . . . . . . . . . . . 6 67 8. Configuration . . . . . . . . . . . . . . . . . . . . . . . . 7 68 9. South Triggered Automatic Disaggregation (STAD) . . . . . . . 7 69 10. Configuration for STAD . . . . . . . . . . . . . . . . . . . 8 70 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 71 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 72 13. Acknowldgements . . . . . . . . . . . . . . . . . . . . . . . 9 73 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 14.1. Normative References . . . . . . . . . . . . . . . . . . 9 75 14.2. Informative References . . . . . . . . . . . . . . . . . 9 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 78 1. Introduction 80 [RFC7938] defines a massive scale data center as one that contains 81 over one hundred thousand servers. It describes the advantages of 82 using BGP as a routing protocol in a Clos switching fabric that 83 connects these servers. It laments the need to announce all routes 84 individually, because of the problems associated with route 85 aggergation. A fabric design that scales to one million servers is 86 considered enough for the forseeable future and is the design goal of 87 this document. Of course, the design should also work for smaller 88 fabrics. 90 A switch fabric to connect one million servers will consist of 91 between 35000 and 130000 switches and 1.5 million to 8 million links, 92 depending on how redundantly the servers are connected to the fabric 93 and the level of oversubscription in the fabric. A switch that needs 94 to store, send and operate on hundreds of routes is clearly cheaper 95 than one that needs to store, send and operate on millions of links. 97 A switch running BGP and aggregating its routes needs to send only 98 one route. In the ideal case, each switch receives just one route 99 from each of its neighbors. For each link or a neighbor that fails, 100 the switch should send just one extra route. No single link failure 101 needs to be known by every switch in the fabric and some switch 102 failures do not need to be known by every switch either. The routes 103 that advertise these failures should only propagate to those switches 104 that need to know about them. During normal operation, the number of 105 failures are few, so the number of advertisements are few. 107 A route that advertises a failure is called a negative route. 108 Negative routes are not a new idea, but they are unpopular, because 109 they cause a number of problems. This document solves the problems. 111 2. Solution Overview 113 In a Clos network all northbound links can reach all destinations and 114 there is typically only one or very few southbound links to reach any 115 specific destination. Therefore, traffic from source to destination 116 is spread to all available northbound links, reaches all the spines 117 and then concentrates southbound towards its destination. When a 118 link fails, then a spine will lose connectivity to some southbound 119 destiunations. That means any northbound link to that spine also 120 loses connectivity to the same destinations. 122 When the fabric is fully connected with no failed links, then the 123 forwarding tables in the switches can simply contain multipath 124 aggreagate routes to all the northbound links. Each of the multipath 125 routes is the same, so traffic is spread out smoothly among these 126 routes. As soon as a link fails, the forwarding tables must exclude 127 the resultant unreachable destinations from some of the northbound 128 links. The way to do that is to add specific routes for the failed 129 destinations to point at the remaining links that can reach those 130 destinations. Since traffic will always prefer specific routes to 131 aggregate routes, the traffic to the failed destinations will no 132 longer take the aggregate routes. 134 Two methods to create these specific routes are described. One way 135 is to send a negative route from the point where the failure is 136 detected. Receivers use the negative route to punch holes out of the 137 aggregate routes and create the specific routes by subtracting the 138 negative route from the aggregates. This method is described 139 starting at section 4. The other method creates the specific routes 140 at the point of the failure and announces them in BGP. This method 141 is described starting at section 9. 143 3. Problems with negative routes 145 - Massive failures can cause lots of negative routes and overwhelm 146 the switches. 148 - In order for a switch to know what has failed, it must know what 149 is supposed to be up. For it to know this requires either an 150 error prone algorithm or an error prone configuration. 152 - During certain network events that cause multiple routes to be 153 sent and/or withdrawn, the messages may race each other and cause 154 transient loss of connectivity to paths that were otherwise 155 unaffected by the event. This occurs in link state routing 156 protocols as well. 158 - Computation of forwarding table entries may consume a lot of CPU 159 time in pathological cases. However, even in pathological cases, 160 this is still much less CPU time than it takes to compute an SPF 161 in a million links. 163 4. Use of a negative route in BGP 165 Three new BGP well known communities are defined: 167 - Hole-Punch: A route with this community can punch a hole out of 168 another route with a shorter netmask that covers the address 169 space of this route. 171 - Punch-Accept: A route with this community can have holes punched 172 out of it by hole punch routes. 174 - Do-not-Aggregate; Do not aggregate this route. 176 A fabric switch will aggregate routes learnt from neighbors to its 177 south. It must know all the routes that are expected to complete the 178 aggregate. It will announce the aggregate with the Punch-Accept 179 community. If any of the routes that are expected to complete the 180 aggregate are missing, then it will announce those missing routes 181 with the Hole-Punch and Do-not-Aggregate communities along with the 182 aggregate route. 184 A receiver of a route with the Hole-Punch community will give it a 185 lower than normal local preference and will search the BGP table for 186 other routes with the following properties: 188 - a shorter netmask than this route, 190 - covers the address space of this route, 191 - has the Punch-Accept community, 193 - is installed in the Routing Table. 195 This is the candidate set. Then, it will remove any routes that have 196 a shorter netmask than the route with the longest netmask in the set. 197 The final candidate set of routes will all have the same prefix. For 198 each route in the candidate set, BGP will create a new route with the 199 same prefix as the Hole-punch route and the same attributes as the 200 Punch-Accept route. This new route is called a chad route. If a 201 route has an MPLS label, then the label is considered part of the 202 attributes, not part of the prefix. 204 Chad routes will take part in bestpath and multipath selection. If a 205 chad route becomes a bestpath or a multipath, it will be installed in 206 the Routing Table. However, chad routes are not advertised by 207 default. That means if a chad route is bestpath and other routes 208 exist for the same prefix, then no route is advertised for that 209 prefix. 211 If a chad route has the same nexthop (and MPLS label, if labels are 212 used) as a hole-punch route of the same prefix, then the chad route 213 becomes hidden. Hidden means that it cannot take part in route 214 selection. 216 5. Implementation Notes to Reduce CPU Time Consumption 218 This section is not normative. 220 When a Punch-Accept route is received, BGP needs to scan a subtree of 221 the BGP prefix table rooted at the prefix of the Punch-Accept route 222 to look for Hole-Punch routes that might create chad routes from it. 223 That subtree could be large. To reduce the number of routes to scan, 224 a separate prefix table is created to store copies of the Hole-Punch 225 routes. The number of Hole-Punch routes is expected to be much 226 smaller than the total number of routes. That makes the scan much 227 quicker. The Hole-Punch routes must additionally be stored in the 228 regular BGP route table. 230 6. Smooth Startup and Avoidance of Too Many Negative Routes 232 When several switches of a data center fabric start up at the same 233 time, many negative routes can be transiently created before the 234 whole system is up. 236 When the BGP process starts, it will typically start in receive-only 237 mode for some time, then perform route selection and send out it's 238 own updates. To ensure a smooth startup of the data center when many 239 nodes start at the same time, the startup sequence is modified as 240 follows. 242 - All BGP speakers SHOULD send EOR after sending all routes after 243 the BGP session becomes established. 245 - When all southbound configured BGP neighbors have sent their EOR, 246 the BGP speaker will perform route selection and send all updates 247 to the northbound neighbors and then send EOR. If some 248 southbound neighbors cannot establish, a timer will be used to 249 prevent waiting forever. 251 - After the previous step completes, when all northbound configured 252 BGP neighbors have sent their EOR, the BGP speaker will perform 253 route selection and send all updates to the southbound neighbors 254 and then send EOR. If some northbound neighbors cannot 255 establish, a timer will be used to prevent waiting forever. 257 - If the number of received negative routes causes too many 258 forwarding entries, then BGP can look for aggreagate routes that 259 are accompanied by many hole-punch routes and invalidate some of 260 the aggregate routes and their accompanying hole-punch routes. 261 If the number of received negative routes is too many to hold in 262 the BGP table, then BGP can shut down neighbor sessions that are 263 sending the most negative routes. 265 7. Avoidance of Transients 267 If one event were to cause both an aggregate and a hole-punch route 268 to be announced at the same time, but the hole-punch route were to 269 arrive late, a transient could result. The following rules prevent 270 that. 272 - It is common practice for aggregate routes to be withdrawn when 273 no components of the aggregate exist. Hole-Punch routes need to 274 always be announced, even if the aggregate is not. 276 - After a BGP session establishes, no routes that are received from 277 it should be installed in the RIB until the EOR is received from 278 that session. 280 - If overlapping hole-punch routes need to be updated and 281 withdrawn, then the updates must be sent before the withdraws. 283 - If overlapping hole-punch and punch-accept routes need to be 284 updated, then the hole-punch routes must be updated first. 286 - If overlapping hole-punch and punch-accept routes need to be 287 withdrawn, then the punch-accept routes must be withdrawn first. 289 8. Configuration 291 All the BGP sessions need to be configured on each switch. The BGP 292 sessions need to be configured as northbound or southbound. The 293 routes that are expected to complete an aggregate route must be 294 configured. 296 A companion document describes a protocol that can discover and 297 configure the entire fabric. If that companion document is used, 298 then no IP addresses or tier designations or any other location 299 dependent configuration is required on the switches. 301 9. South Triggered Automatic Disaggregation (STAD) 303 In this method, a node that is south of a failed link or node 304 announces its prefix(es) along alternative links with a hint to 305 trigger automatic disaggregation or inhibit their suppression on 306 upstream tier nodes. These disaggregated or unsuppressed routes 307 traverse along redundant paths and disjoint planes to switches in 308 other clusters in the topology where they are used in forwarding. 310 The hint is in the form of a well known BGP community. A few new 311 well known communities are used in this scheme. 313 - Do-not-Aggregate : Do not aggregate this route 315 - Tier : An Extended Community identifying the tier of the 316 originated route 318 - Dis-Aggregate : Triggers announcement of more specific routes at 319 receiving node 321 The techniques in this draft assume a CLOS topology of the form 322 described in [Figure 3] of RFC7938 where an access switch such as a 323 TOR forms the lowest tier and is connected to multiple northbound 324 upper tier switches; which in turn are connected to multiple upper 325 tier switches, forming disjoint planes across the topology with fan- 326 outs. 328 A figure illustrating the topology will be added in a subsequent 329 version. 331 Upon a link failure, the node south of the failure announces its 332 prefix to it's other northbound BGP sessions with the Do-Not- 333 Advertise community. 335 A higher tier node that receives a route with a Do-not-Aggregate 336 community will not suppress this route when there is a local covering 337 aggregate, but will propagate it further as is. 339 This procedure enables the more specific route to reach the 340 appropriate tier switches in other clusters where the topology fans 341 out on multiple northbound links. The received paths for the more 342 specific prefix form a multipath excluding the links which would lead 343 to failed paths in the topology. 345 A route that is advertised with the Do-Not-Aggregate community as per 346 this section will also add a Tier extended community. If this 347 extended community is present, then the Do-Not-Aggregate community is 348 only applicable at tiers that are more north than the tier indicated 349 in the extended community. 351 The Tier Extended Community ensures that the unsuppressed specific 352 routes do not propagate further beyond the corresponding fan-out 353 points in the other clusters. 355 If all the northbound links or BGP sessions at a node have failed, 356 then the node will announce its southbound route with the Dis- 357 Aggregate community. This signals all it's south-side nodes to 358 advertise their north-bound routes with the Do-Not-Aggregate 359 community along the other north-bound links. 361 These techniques are applicable to any tier in the topology. 363 At the lowest tier, if there are servers that are attached to more 364 than one fabric switch (eg. TOR), then the host routes (or 365 configured more-specific routes) for the server are not aggregated by 366 the TOR to it's connected upper tier switches. In this case, these 367 routes are aggregated by the upper-tier switches towards the rest of 368 the topology. 370 10. Configuration for STAD 372 Each switch has the notion of northbound and southbound sessions or 373 links. In addition, it is assigned to a tier in the hierarchy. The 374 switch uses this configuration to drive the procedures described in 375 the section above. A switch at the lowest tier (eg. a TOR) will have 376 server subnet prefixes configured. Switches at higher tiers have 377 aggregates configured. 379 11. Security Considerations 381 TBD 383 12. IANA Considerations 385 TBD 387 13. Acknowldgements 389 14. References 391 14.1. Normative References 393 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 394 Requirement Levels", BCP 14, RFC 2119, 395 DOI 10.17487/RFC2119, March 1997, 396 . 398 14.2. Informative References 400 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 401 BGP for Routing in Large-Scale Data Centers", RFC 7938, 402 DOI 10.17487/RFC7938, August 2016, 403 . 405 Authors' Addresses 407 Jakob Heitz 408 Cisco 409 170 West Tasman Drive 410 San Jose, CA, CA 95134 411 USA 413 Email: jheitz@cisco.com 415 Dhananjaya Rao 416 Cisco 417 170 West Tasman Drive 418 San Jose, CA, CA 95134 419 USA 421 Email: dhrao@cisco.com