idnits 2.17.1 draft-lapukhov-bgp-ecmp-considerations-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([RFC4271]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 96: '...SEQUENCE segment MUST be the same amon...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 1, 2019) is 1761 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Lapukhov 3 Internet-Draft Facebook 4 Intended status: Informational J. Tantsura 5 Expires: January 2, 2020 Apstra, Inc. 6 July 1, 2019 8 Equal-Cost Multipath Considerations for BGP 9 draft-lapukhov-bgp-ecmp-considerations-02 11 Abstract 13 BGP (Border Gateway Protocol) [RFC4271] employs tie-breaking logic to 14 select a single best path among multiple paths available, known as 15 BGP best path selection. At the same time, it is a common practice 16 to allow for "equal-cost multipath" (ECMP) selection and programming 17 of multiple next-hops in routing tables. This document summarizes 18 some common considerations for the ECMP logic when BGP is used as the 19 routing protocol, with the intent of providing common reference for 20 otherwise unstandardized set of features. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 2, 2020. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2 58 3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3 59 4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3 60 5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4 61 6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5 62 7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5 63 8. Multipath and non-deterministic tie-breaking . . . . . . . . 5 64 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 65 10. Informative References . . . . . . . . . . . . . . . . . . . 5 66 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6 68 1. Introduction 70 Section 9.1.2.2 of [RFC4271] defines step-by-step tie-breaking 71 procedure for selecting a single "best-path" among multiple 72 alternatives available for the same route. In order to improve 73 efficiency in densely meshed symmetric network topologies it is 74 common to allow the selection of multiple "equal cost" paths for the 75 same route. Typical approach is to abort the tie-breaking process 76 after comparing IGP cost for the NEXT_HOP attribute and select either 77 all eBGP or all iBGP paths that remained "equal" under the tie- 78 breaking rules. See [BGPMP] for a vendor document explaining the 79 logic. In a nutshell, the steps that compare the BGP identifiers and 80 BGP peer IP addresses (steps (f) and (g) in [RFC4271]) are ignored 81 for the purpose of multipath routing. BGP implementations commonly 82 have a configuration knob that specifies the maximum number of equal 83 paths that are allowed be programmed in the routing table. Commonly, 84 there's also a knob to enable multipath separately for iBGP-learned 85 or eBGP-learned paths. 87 2. AS-PATH attribute comparison 89 The mandatory requirement for all paths that are considered as the 90 candidates for ECMP selection is to have the same AS_PATH length, 91 computed using the logic defined in [RFC4271] and [RFC5065], i.e. 92 ignoring the AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment 93 lengths. The content of the latter attributes is used purely for 94 routing loop prevention. Assuming that AS_PATHs length computed in 95 this fashion are the same, many implementations require that the 96 content of AS_SEQUENCE segment MUST be the same among all the paths 97 considered. Two common configuration knobs to alter this behaviour 98 are usually provided: First one, to relax the otherwise mandatory 99 AS_SEQUENCE comparison rule, enforcing only the AS_PATH length rule, 100 while ignoring the content of AS_SEQUENCE. Another one requiring 101 that the first AS numbers in the first AS_SEQUENCE segment found in 102 AS_PATH (often referred to as "peer AS" number) be the same as the 103 one found in best path (as determined by running the full tie- 104 breaking procedure). This document refers to those two as "multipath 105 as-path relaxed" and "multipath same peer-as" correspondingly. 107 3. Multipath among eBGP-learned paths 109 Step (d) in Section 9.1.2.2 of [RFC4271] mandates, in presence of an 110 eBGP path, to remove all iBGP paths from the the ECMP candidates set. 111 This leaves the BGP tie-breaking procedure with just eBGP paths. At 112 this point, the mandatory BGP NEXT_HOP attribute value most commonly 113 belongs to the IP subnet that the BGP speaker shares with the 114 advertising neighbor. In this case, it is common for implementations 115 to treat all NEXT_HOP values as having the same "internal cost" to 116 reach them per the guidance of step (e) of Section 9.1.2.2. In some 117 cases, either static routing or an IGP routing protocol could be used 118 between the BGP speakers peering using an eBGP session. An 119 implementation may use the next-hop metric discovered from the above 120 sources to perform tie-breaking even for eBGP paths. 122 If the MED attribute is present in some paths, the set of multipath 123 routes allowed will most likely be reduced to the ones coming from 124 the same peer AS, per step (c) of Section 9.1.2.2. This is unless an 125 implementation provides a configuration knob to always compare MED 126 attributes across all paths, as recommended by [RFC4451]. In the 127 latter case, the presence of the MED attribute does not automatically 128 reduce the candidate path set to the same peer AS only. 130 4. Multipath among iBGP learned paths 132 In most cases iBGP is used along with an underlying IGP. Thus, when 133 all paths for a prefix are learned via iBGP, the tie-breaking 134 commonly occurs based on IGP metric of the NEXT_HOP attribute. In 135 some implementations, it is possible to ignore the IGP cost as well, 136 if all of the paths are reachable via some kind of tunneling 137 mechanism, such as MPLS [RFC3031]. This is enabled via a knob 138 referred in this document as "skip igp check" . Notice that there is 139 no standard way for a BGP speaker to detect presence of such 140 tunneling techniques other than relying on the configuration 141 settings. 143 When iBGP is deployed with BGP route-reflectors per [RFC4456], the 144 path attribute list may include the CLUSTER_LIST attribute. Many 145 implementations ignore it for the purpose of ECMP route selection, 146 assuming that IGP cost along should be sufficient for loop 147 prevention. This assumption may not hold when IGP is not deployed, 148 and instead iBGP session are configured to reset the NEXT_HOP 149 attribute to "self" on every node. This also assumes the use of 150 directly connected link addresses for session formation. In this 151 case, ignoring CLUSTER_LIST length might lead to routing loops. It 152 is therefore recommended for implementations to have a knob that 153 enables accounting for CLUSTER_LIST length when performing multipath 154 route selection. Effectively, the CLUSTER_LIST attribute length 155 should be as an IGP metric. 157 Similarly to the route-reflector scenario, the use of BGP 158 confederations in multipath scenarios assumes presence of an IGP for 159 proper loop prevention and use the IGP metric as the final tie- 160 breaker for multipath routing. In addition to that, and similar to 161 eBGP case, implementations often require that in order to be 162 considered equal, the paths must belong to the same peer member AS as 163 the best-path. It is useful to have the following two configuration 164 knobs. First one enabling "multipath same confederation member peer- 165 as", and another enabling less restrictive "confed as-path multipath 166 relaxed" rule, that allow selecting multipath routes reachable via 167 any confederation member peer AS. As mentioned above, the 168 AS_CONFED_SEQUENCE value length is usually ignored for the purpose of 169 AS_PATH length comparison, relying instead on the IGP cost for loop 170 prevention. 172 In cases when IGP is not present with BGP confederation deployment, 173 and similar to route-reflection case, it may be nessesary to consider 174 AS_CONFED_SEQUENCE length when selecting the equivalent routes, 175 effectively using it as a substitution for an IGP metric. A separate 176 configuration knob is needed to allow this behavior. 178 Per [RFC5065] paths learned over BGP intra-confederation peering 179 sessions are treated as iBGP. There is no specification or 180 operational document that defines how a mixed iBGP route-reflector 181 and confederation based deplyments would work together. Therefore, 182 this document does not make recommendations for the above case. 184 5. Multipath among eBGP and iBGP paths 186 The best-path selection algorithm explicitly prefers eBGP paths over 187 iBGP or learned from a BGP confederation member AS, which is, as per 188 [RFC5065] treated the same as iBGP from perspective of best-path 189 selection. In some cases however, it might be beneficial to allow 190 multipath routing between eBGP and iBGP learned paths. This is only 191 possible if some sort of tunneling technique is used to reach both 192 the eBGP and iBGP paths. If this feature is enabled, the equal 193 routes are selected prior to the MED comparison step (c) in 194 Section 9.1.2.2 [RFC4271]. 196 6. Multipath with AIGP 198 AIGP attribute defined in [RFC7311] must be used for best-path 199 selection prior to running any logic of Section 9.1.2.2 [RFC4271]. 200 Only the paths with minimal value of AIGP metric are eligible for 201 further consideration of tie-breaking rules. The rest of multipath 202 selection logic remains the same. 204 7. Best path advertisement 206 Unless BGP "Add-Path" feature described in [RFC7911] is enabled and 207 even though multiple equal paths may be selected for programming into 208 the routing table, a BGP speaker announces single best-path only to 209 its peers. The unique best-path is elected among the multi-path set 210 using the standard tie-breaking rules. 212 8. Multipath and non-deterministic tie-breaking 214 Some implementations may implement non-standard tie-breaking logic, 215 for example using the oldest path rule(reference). This is generally 216 not recommended, and may interact with multi-path route selection on 217 downstream BGP speakers. That is, after a route flap that affects 218 the best-path upstream, the original best path would not be 219 recovered, and the older path would still be advertised, possibly 220 affecting the tie-breaking rules on down-stream device if for 221 example, the AS_PATH contents are different from previous. 223 9. Weighted equal-cost multipath 225 The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions 226 where iBGP multipath feature might inform the routing table of 227 "weights" associated with the multiple external paths. 228 [I-D.ietf-idr-link-bandwidth] defines the weight extended community 229 attribute as non-transitive, considers the applicability in iBGP case 230 only, though there are implementations that apply it to eBGP as well. 231 The proposal does not change the equal-cost multipath selection 232 logic, but associates additional load-sharing attributes with 233 equivalent paths. 235 10. Informative References 237 [BGPMP] "BGP Best Path Selection Algorithm", 238 . 241 [I-D.ietf-idr-link-bandwidth] 242 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 243 Extended Community", draft-ietf-idr-link-bandwidth-07 244 (work in progress), March 2018. 246 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 247 Label Switching Architecture", RFC 3031, 248 DOI 10.17487/RFC3031, January 2001, 249 . 251 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 252 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 253 DOI 10.17487/RFC4271, January 2006, 254 . 256 [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) 257 Considerations", RFC 4451, DOI 10.17487/RFC4451, March 258 2006, . 260 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 261 Reflection: An Alternative to Full Mesh Internal BGP 262 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 263 . 265 [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous 266 System Confederations for BGP", RFC 5065, 267 DOI 10.17487/RFC5065, August 2007, 268 . 270 [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, 271 "The Accumulated IGP Metric Attribute for BGP", RFC 7311, 272 DOI 10.17487/RFC7311, August 2014, 273 . 275 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 276 "Advertisement of Multiple Paths in BGP", RFC 7911, 277 DOI 10.17487/RFC7911, July 2016, 278 . 280 Authors' Addresses 282 Petr Lapukhov 283 Facebook 284 1 Hacker Way 285 Menlo Park, CA 94025 286 US 288 Email: petr@fb.com 289 Jeff Tantsura 290 Apstra, Inc. 291 Menlo Park, CA 94025 292 US 294 Email: jefftant.ietf@gmail.com