idnits 2.17.1 draft-lapukhov-bgp-ecmp-considerations-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([RFC4271]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 93: '...at content of AS_SEQUENCE segment MUST...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2732 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-idr-link-bandwidth-06 Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Lapukhov 3 Internet-Draft Facebook 4 Intended status: Informational October 31, 2016 5 Expires: May 4, 2017 7 Equal-Cost Multipath Considerations for BGP 8 draft-lapukhov-bgp-ecmp-considerations-00 10 Abstract 12 BGP routing protocol defined in ([RFC4271]) employs tie-breaking 13 logic to elect single best path among multiple possible. At the same 14 time, it has been common in virtually all BGP implementations to 15 allow for "equal-cost multipath" (ECMP) election and programming of 16 multiple next-hops in routing tables. This documents summarizes some 17 common considerations for the ECMP logic, with the intent of 18 providing common reference on otherwise unstandardized feature. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on May 4, 2017. 37 Copyright Notice 39 Copyright (c) 2016 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 2. AS-PATH attribute comparison . . . . . . . . . . . . . . . . 2 56 3. Multipath among eBGP-learned paths . . . . . . . . . . . . . 3 57 4. Multipath among iBGP learned paths . . . . . . . . . . . . . 3 58 5. Multipath among eBGP and iBGP paths . . . . . . . . . . . . . 4 59 6. Multipath with AIGP . . . . . . . . . . . . . . . . . . . . . 5 60 7. Best path advertisement . . . . . . . . . . . . . . . . . . . 5 61 8. Multipath and non-deterministic tie-breaking . . . . . . . . 5 62 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 63 10. Informative References . . . . . . . . . . . . . . . . . . . 5 64 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 66 1. Introduction 68 Section 9.1.2.2 of [RFC4271] defines step-by step procedure for 69 selecting single "best-path" among multiple alternative available for 70 the same NLRI (Network Layer Reachability Information) element. In 71 order to improve efficiency in symmetric network topologies is has 72 become common practice to allow for selecting multiple "equivalent" 73 paths for the same prefix. Most commonly used approach is to abort 74 the tie-breaking process after comparing the IGP cost for the 75 NEXT_HOP attribute and selecting either all eBGP or all iBGP paths 76 that remained equivalent under the tie-breaking rules (see [BGPMP] 77 for a vendor document explaining the logic). Basically, the steps 78 that compare the BGP identifier and BGP peer IP addresses (steps (f) 79 and (g)) are ignored for the purpose of multipath routing. BGP 80 implementations commonly have a configuration knob that specifies the 81 maximum number of equivalent paths that may be programmed to the 82 routing table. There is also common a knob to enable multipath 83 separately for iBGP-learned or eBGP-learned paths. 85 2. AS-PATH attribute comparison 87 A mandatory requirement is for all paths that are candidates for ECMP 88 selection to have the same AS_PATH length, computed using the 89 standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the 90 AS_SET, AS_CONFED_SEQUENCE, and AS_CONFED_SET segment lengths. The 91 content of the latter attributes is used purely for loop detection. 92 Assuming that AS_PATH lengths computed in this fashion are the same, 93 many implementations require that content of AS_SEQUENCE segment MUST 94 be the same among all equivalent paths. Two common configuration 95 knobs are usually provided: one allowing only the length of AS_PATH 96 to be the same, and another requiring that the first AS numbers in 97 first AS_SEQUENCE segment found in AS_PATH (often referred to as 98 "peer AS" number) be the same as the one found in best path 99 (determined by running the full tie-breaking algorithm). This 100 document refer to those two as "multipath as-path relaxed" and 101 "multipath same peer-as" knobs. 103 3. Multipath among eBGP-learned paths 105 Step (d) in Section 9.1.2.2 of [RFC4271] instructs to remove all iBGP 106 paths from considerations if an eBGP path is present in the candidate 107 set. This leaves the BGP process with just eBGP paths. At this 108 point, the mandatory BGP NEXT_HOP attribute value most commonly 109 belongs to the IP subnet that the BGP speaker shares with advertising 110 neighbor. In this case, it is common for implementation to treat all 111 NEXT_HOP values as having the same "internal cost" to reach them per 112 the guidance of step (e) of Section 9.1.2.2. In some cases, either 113 static routing or an IGP routing protocol could be running between 114 the BGP speakers peering over eBGP session. An implementation may 115 use the metric discovered from the above sources to perform tie- 116 breaking even for eBGP paths. 118 Notice that in case when MED attribute is present in some paths, the 119 set of allowed multipath routes will most likely be reduced to the 120 ones coming from the same peer AS, per step (c) of Section 9.1.2.2. 121 This is unless the implementation provided a configuration knob to 122 always compare MED attributes across all paths, as recommended in 123 [RFC4451]. In the latter case, the presence of MED attribute does 124 not automatically narrow the candidate path set only to the same peer 125 AS. 127 4. Multipath among iBGP learned paths 129 When all paths for a prefix are learned via iBGP, the tie-breaking 130 commonly occurs based on IGP metric of the NEXT_HOP attribute, since 131 in most cases iBGP is used along with an underlying IGP. It is 132 possible, in some implementations, to ignore the IGP cost as well, if 133 all of the paths are reachable via some kind of tunneling mechanism, 134 such as MPLS ([RFC3031]). This is enabled via a knob referred to as 135 "skip igp check" in this document. Notice that there is no standard 136 way for a BGP speaker to detect presence of such tunneling techniques 137 other than relying on configuration settings. 139 When iBGP is deployed with BGP route-reflectors per [RFC4456] the 140 path attribute list may include the CLUSTER_LIST attribute. Most 141 implementations commonly ignore it for the purpose of ECMP route 142 selection, assuming that IGP cost along should be sufficient for loop 143 prevention. This assumption may not hold when IGP is not deployed, 144 and instead iBGP session are configured to reset the NEXT_HOP 145 attribute to self on every node (this also assumes the use of 146 directly connected link addresses for session formation). In this 147 case, ignoring CLUSTER_LIST length might lead to routing loops. It 148 is therefore recommended for implementations to have a knob that 149 enables accounting for CLUSTER_LIST length when performing multipath 150 route selection. In this case, CLUSTER_LIST attribute length should 151 be effectively used to replace the IGP metric. 153 Similar to the route-reflector scenario, the use of BGP 154 confederations assumes presence of an IGP for proper loop prevention 155 in multipath scenarios, and use the IGP metric as the final tie- 156 breaker for multipath routing. In addition to this, and similar to 157 eBGP case, implementation often require that equivalent paths belong 158 to the same peer member AS as the best-path. It is useful to have 159 two configuration knobs, one enabling "multipath same confederation 160 member peer-as" and another enabling less restrictive "confed as-path 161 multipath relaxed", which allows selecting multipath routes going via 162 any confederation member peer AS. As mentioned above, the 163 AS_CONFED_SEQUENCE value length is usually ignored for the purpose of 164 AS_PATH length comparison, relying on IGP cost instead for loop 165 prevention. 167 In case if IGP is not present with BGP confederation deployment, and 168 similar to route-reflection case, it may be needed to consider 169 AS_CONFED_SEQUENCE length when selecting the equivalent routes, 170 effectively using it as a substitution for IGP metric. A separate 171 configuration knob is needed to allow this behavior. 173 Per [RFC5065] the path learned over BGP intra-confederation peering 174 sessions are treated as iBGP. There is no specification or 175 operational document that defines how a mixed iBGP route-reflector 176 and confederation based model would work together. Therefore, this 177 document does not make recommendations or considers this case. 179 5. Multipath among eBGP and iBGP paths 181 The best-path selection algorithm explicitly prefers eBGP paths over 182 iBGP (or learned from BGP confederation member AS, which is per 183 [RFC5065] is treated the same as iBGP from perspective of best-path 184 selection). In some case, allowing multipath routing between eBGP 185 and iBGP learned paths might be beneficial. This is only possible if 186 some sort of tunneling technique is used to reach both the eBGP and 187 iBGP path. If this feature is enabled, the equivalent routes are 188 selection by stopping the tie-breaking process prior at the MED 189 comparison step (c) in Section 9.1.2.2 of [RFC4271]. 191 6. Multipath with AIGP 193 AIGP attribute defined in [RFC7311] must be used for best-path 194 selection prior to running any logic of Section 9.1.2.2. Only the 195 paths with minimal value of AIGP metric are eligible for further 196 consideration of tie-breaking rules. The rest of multipath selection 197 logic remains the same. 199 7. Best path advertisement 201 Event though multiple equivalent paths may be selected for 202 programming into the routing table, the BGP speaker always announces 203 single best-path to its peers, unless BGP "Add-Path" feature has been 204 enabled as described in [I-D.ietf-idr-add-paths]. The unique best- 205 path is elected among the multi-path set using the standard tie- 206 breaking rules. 208 8. Multipath and non-deterministic tie-breaking 210 Some implementations may implement non-standard tie-breaking using 211 the oldest path rule. This is generally not recommended, and may 212 interact with multi-path route selection on downstream BGP speakers. 213 That is, after a route flap that affects the best-path upstream, the 214 original best path would not be recovered, and the older path still 215 be advertised, possibly affecting the tie-breaking rules on down- 216 stream device, for example if the AS_PATH contents are different from 217 previous. 219 9. Weighted equal-cost multipath 221 The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions 222 where iBGP multipath feature might inform the routing table of the 223 "weights" associated with the multiple paths. The document defines 224 the applicability only in iBGP case, though there are implementations 225 that apply it to eBGP multipath as well. The proposal does not 226 change the equal-cost multipath selection logic, only associates 227 additional load-sharing attributes with equivalent paths. 229 10. Informative References 231 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 232 Label Switching Architecture", RFC 3031, 233 DOI 10.17487/RFC3031, January 2001, 234 . 236 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 237 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 238 DOI 10.17487/RFC4271, January 2006, 239 . 241 [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) 242 Considerations", RFC 4451, DOI 10.17487/RFC4451, March 243 2006, . 245 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 246 Reflection: An Alternative to Full Mesh Internal BGP 247 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 248 . 250 [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous 251 System Confederations for BGP", RFC 5065, 252 DOI 10.17487/RFC5065, August 2007, 253 . 255 [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, 256 "The Accumulated IGP Metric Attribute for BGP", RFC 7311, 257 DOI 10.17487/RFC7311, August 2014, 258 . 260 [I-D.ietf-idr-add-paths] 261 Walton, D., Retana, A., Chen, E., and J. Scudder, 262 "Advertisement of Multiple Paths in BGP", draft-ietf-idr- 263 add-paths-15 (work in progress), May 2016. 265 [I-D.ietf-idr-link-bandwidth] 266 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 267 Extended Community", draft-ietf-idr-link-bandwidth-06 268 (work in progress), January 2013. 270 [BGPMP] "BGP Best Path Selection Algorithm", 271 . 274 Author's Address 276 Petr Lapukhov 277 Facebook 278 1 Hacker Way 279 Menlo Park, CA 94025 280 US 282 Email: petr@fb.com