idnits 2.17.1 draft-lapukhov-bgp-ecmp-considerations-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([RFC4271]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 97: '...SEQUENCE segment MUST be the same amon...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 17, 2020) is 1438 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 IDR Working Group P. Lapukhov 3 Internet-Draft Facebook 4 Intended status: Informational J. Tantsura 5 Expires: November 18, 2020 Apstra, Inc. 6 May 17, 2020 8 Equal-Cost Multipath Considerations for BGP 9 draft-lapukhov-bgp-ecmp-considerations-04 11 Abstract 13 BGP (Border Gateway Protocol) [RFC4271] employs tie-breaking logic to 14 select a single best path among multiple paths available, known as 15 BGP best path selection. At the same time, it has become a common 16 practice to allow for "equal-cost multipath" (ECMP) selection and 17 programming of multiple next-hops in routing tables. This document 18 summarizes some common considerations for the ECMP logic when BGP is 19 used as the routing protocol, with the intent of providing common 20 reference for otherwise unstandardized set of features. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on November 18, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2 58 3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3 59 4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3 60 5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4 61 6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5 62 7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5 63 8. Multipath and non-deterministic tie-breaking . . . . . . . . 5 64 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 65 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 6 66 11. Informative References . . . . . . . . . . . . . . . . . . . 6 67 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 69 1. Introduction 71 Section 9.1.2.2 of [RFC4271] defines step-by-step tie-breaking 72 procedure for selecting a single "best-path" among multiple 73 alternatives available for the same route. In order to improve 74 efficiency, in densely meshed symmetric network topologies is has 75 become a common practice to allow selection of multiple "equal" paths 76 for the same route. Most commonly used approach is to abort the tie- 77 breaking process after comparing IGP cost for the NEXT_HOP attribute 78 and selecting either all eBGP or all iBGP paths that remained "equal" 79 under the tie-breaking rules (see [BGPMP] for a vendor document 80 explaining the logic). Basically, the steps that compare the BGP 81 identifiers and BGP peer IP addresses (steps (f) and (g) in 82 [RFC4271]) are ignored for the purpose of multipath routing. BGP 83 implementations commonly have a configuration knob that specifies the 84 maximum number of equal paths that are allowed be programmed in the 85 routing table. Commonnly, there's also a knob to enable multipath 86 separately for iBGP-learned or eBGP-learned paths. 88 2. AS-PATH attribute comparison 90 The mandatory requirement for all paths that are considered as the 91 candidates for ECMP selection is to have the same AS_PATH length, 92 computed using the logic defined in [RFC4271] and [RFC5065], i.e. 93 ignoring the AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment 94 lengths. The content of the latter attributes is used purely for 95 loop detection and prevention. Assuming that AS_PATHs length 96 computed in this fashion are the same, many implementations require 97 that the content of AS_SEQUENCE segment MUST be the same among all 98 the paths considered. Two common configuration knobs to alter this 99 behaviour are usually provided: One, to relax otherwise mandatory 100 AS_SEQUENCE comparison rule, enforcing only the AS_PATH length rule, 101 while ignoring the content of AS_SEQUENCE. And another requiring 102 that the first AS numbers in first AS_SEQUENCE segment found in 103 AS_PATH (often referred to as "peer AS" number) be the same as the 104 one found in best path (determined by running the full tie-breaking 105 procedure). This document refers to those two as "multipath as-path 106 relaxed" and "multipath same peer-as". 108 3. Multipath among eBGP-learned paths 110 Step (d) in Section 9.1.2.2 of [RFC4271] mandates, in presence of an 111 eBGP path to remove all iBGP paths from the the ECMP candidates set. 112 This leaves the BGP tie-breaking procedure with just eBGP paths. At 113 this point, the mandatory BGP NEXT_HOP attribute value most commonly 114 belongs to the IP subnet that the BGP speaker shares with the 115 advertising neighbor. In this case, it is common for implementations 116 to treat all NEXT_HOP values as having the same "internal cost" to 117 reach them per the guidance of step (e) of Section 9.1.2.2. In some 118 cases, either static routing or an IGP routing protocol could be 119 running between the BGP speakers peering over eBGP session. An 120 implementation may use the metric discovered from the above sources 121 to perform tie-breaking even for eBGP paths. 123 Notice that in case when, in some paths MED attribute is present, the 124 set of multipath routes allowed will most likely be reduced to the 125 ones coming from the same peer AS, per step (c) of Section 9.1.2.2. 126 This is unless an implementation provides a configuration knob to 127 always compare MED attributes across all paths, as recommended by 128 [RFC4451]. In the latter case, the presence of MED attribute does 129 not automatically reduce the candidate path set to the same peer AS 130 only. 132 4. Multipath among iBGP learned paths 134 When all paths for a prefix are learned via iBGP, since in most cases 135 iBGP is used along with an underlying IGP, the tie-breaking commonly 136 occurs based on IGP metric of the NEXT_HOP attribute. In some 137 implementations, it is however possible to ignore the IGP cost as 138 well, if all of the paths are reachable via some kind of tunneling 139 mechanism, such as MPLS [RFC3031]. This is enabled via a knob 140 referred in this document as "skip igp check" . Notice that there is 141 no standard way for a BGP speaker to detect presence of such 142 tunneling techniques other than relying on the configuration 143 settings. 145 When iBGP is deployed with BGP route-reflectors per [RFC4456] the 146 path attribute list may include the CLUSTER_LIST attribute. Most 147 implementations commonly ignore it for the purpose of ECMP route 148 selection, assuming that IGP cost along should be sufficient for loop 149 prevention. This assumption may not hold when IGP is not deployed, 150 and instead iBGP session are configured to reset the NEXT_HOP 151 attribute to self on every node (this also assumes the use of 152 directly connected link addresses for session formation). In this 153 case, ignoring CLUSTER_LIST length might lead to routing loops. It 154 is therefore recommended for implementations to have a knob that 155 enables accounting for CLUSTER_LIST length when performing multipath 156 route selection. In this case, CLUSTER_LIST attribute length should 157 be effectively used to replace the IGP metric. 159 Similarly to the route-reflector scenario, the use of BGP 160 confederations in multipath scenarios assumes presence of an IGP for 161 proper loop prevention and use the IGP metric as the final tie- 162 breaker for multipath routing. In addition to that, and similar to 163 eBGP case, implementations often require that in order to be 164 considered equal, paths under consideration must belong to the same 165 peer member AS as the best-path. It is useful to have the following 166 two configuration knobs, one enabling "multipath same confederation 167 member peer-as" and another enabling less restrictive "confed as-path 168 multipath relaxed" rule, that allow selecting multipath routes 169 reachable via any confederation member peer AS. As mentioned above, 170 the AS_CONFED_SEQUENCE value length is usually ignored for the 171 purpose of AS_PATH length comparison, for the loop prevention relying 172 instead on the IGP cost . 174 In cases, when IGP is not present with BGP confederation deployment, 175 and similar to route-reflection case, it may be nessesary to consider 176 AS_CONFED_SEQUENCE length when selecting the equivalent routes, 177 effectively using it as a substitution for an IGP metric. A separate 178 configuration knob is needed to allow this behavior. 180 Per [RFC5065] paths learned over BGP intra-confederation peering 181 sessions are treated as iBGP. There is no specification or 182 operational document that defines how a mixed iBGP route-reflector 183 and confederation based deplyments would work together. Therefore, 184 this document does not make recommendations for the above case. 186 5. Multipath among eBGP and iBGP paths 188 The best-path selection algorithm explicitly prefers eBGP paths over 189 iBGP (or learned from BGP confederation member AS, which is, as per 190 [RFC5065] treated the same as iBGP from perspective of best-path 191 selection). In some cases however, it might be beneficial to allow 192 multipath routing between eBGP and iBGP learned paths. This is only 193 possible if some sort of tunneling technique is used to reach both 194 the eBGP and iBGP paths. If this feature is enabled, the equal 195 routes are selected prior to the MED comparison step (c) in 196 Section 9.1.2.2 [RFC4271]. 198 6. Multipath with AIGP 200 AIGP attribute defined in [RFC7311] must be used for best-path 201 selection prior to running any logic of Section 9.1.2.2 [RFC4271]. 202 Only the paths with minimal value of AIGP metric are eligible for 203 further consideration of tie-breaking rules. The rest of multipath 204 selection logic remains the same. 206 7. Best path advertisement 208 Unless BGP "Add-Path" feature as described in [RFC7911] is enabled 209 and even though multiple equal paths may be selected for programming 210 into the routing table, a BGP speaker announces to its peers single 211 best-path only. The unique best-path is elected among the multi-path 212 set using the standard tie-breaking rules. 214 8. Multipath and non-deterministic tie-breaking 216 Some implementations may implement non-standard tie-breaking logic, 217 for example using the oldest path rule, IETF reference - [RFC5004], a 218 vendor implementaion example [BGPMP]. This is generally not 219 recommended, and may interact with multi-path route selection on 220 downstream BGP speakers. That is, after a route flap that affects 221 the best-path upstream, the original best path would not be 222 recovered, and the older path would still be advertised, possibly 223 affecting the tie-breaking rules on down-stream device if for 224 example, the AS_PATH contents are different from previous. Another 225 side effect of using non-standard tie-breaking could be increased 226 number of BGP Next-Hop sets for Prefixes learned from eBGP neighbors 227 and advertised downstream towards iBGP Neighbors. This could 228 potentially cause ECMP group/entry tables to overrun (depending on a 229 platform) as the prefixes will be less coalesced. 231 9. Weighted equal-cost multipath 233 The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions 234 where iBGP multipath feature might inform the routing table of 235 "weights" associated with the multiple external paths. 236 [I-D.ietf-idr-link-bandwidth] defines the weight extended community 237 attribute as non-transitive, considers the applicability for iBGP 238 only, though there are implementations that apply it to eBGP as well. 239 The proposal does not change the equal-cost multipath selection 240 logic, only associates additional load-sharing attributes with 241 equivalent paths. 243 10. Acknowledgements 245 We like to thank Diptanshu Singh for their reviews and valuable 246 comments. 248 11. Informative References 250 [BGPMP] "BGP Best Path Selection Algorithm", 251 . 254 [I-D.ietf-idr-link-bandwidth] 255 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 256 Extended Community", draft-ietf-idr-link-bandwidth-07 257 (work in progress), March 2018. 259 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 260 Label Switching Architecture", RFC 3031, 261 DOI 10.17487/RFC3031, January 2001, 262 . 264 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 265 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 266 DOI 10.17487/RFC4271, January 2006, 267 . 269 [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) 270 Considerations", RFC 4451, DOI 10.17487/RFC4451, March 271 2006, . 273 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 274 Reflection: An Alternative to Full Mesh Internal BGP 275 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 276 . 278 [RFC5004] Chen, E. and S. Sangli, "Avoid BGP Best Path Transitions 279 from One External to Another", RFC 5004, 280 DOI 10.17487/RFC5004, September 2007, 281 . 283 [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous 284 System Confederations for BGP", RFC 5065, 285 DOI 10.17487/RFC5065, August 2007, 286 . 288 [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, 289 "The Accumulated IGP Metric Attribute for BGP", RFC 7311, 290 DOI 10.17487/RFC7311, August 2014, 291 . 293 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 294 "Advertisement of Multiple Paths in BGP", RFC 7911, 295 DOI 10.17487/RFC7911, July 2016, 296 . 298 Authors' Addresses 300 Petr Lapukhov 301 Facebook 302 1 Hacker Way 303 Menlo Park, CA 94025 304 US 306 Email: petr@fb.com 308 Jeff Tantsura 309 Apstra, Inc. 310 333 Middlefield Rd #200 311 Menlo Park, CA 94025 312 US 314 Email: jefftant.ietf@gmail.com