| < draft-lapukhov-bgp-ecmp-considerations-00.txt | draft-lapukhov-bgp-ecmp-considerations-01.txt > | |||
|---|---|---|---|---|
| Network Working Group P. Lapukhov | Network Working Group P. Lapukhov | |||
| Internet-Draft Facebook | Internet-Draft Facebook | |||
| Intended status: Informational October 31, 2016 | Intended status: Informational October 30, 2017 | |||
| Expires: May 4, 2017 | Expires: May 3, 2018 | |||
| Equal-Cost Multipath Considerations for BGP | Equal-Cost Multipath Considerations for BGP | |||
| draft-lapukhov-bgp-ecmp-considerations-00 | draft-lapukhov-bgp-ecmp-considerations-01 | |||
| Abstract | Abstract | |||
| BGP routing protocol defined in ([RFC4271]) employs tie-breaking | BGP routing protocol defined in ([RFC4271]) employs tie-breaking | |||
| logic to elect single best path among multiple possible. At the same | logic to elect single best path among multiple possible. At the same | |||
| time, it has been common in virtually all BGP implementations to | time, it has been common in all practical BGP implementations to | |||
| allow for "equal-cost multipath" (ECMP) election and programming of | allow for "equal-cost multipath" (ECMP) path election and programming | |||
| multiple next-hops in routing tables. This documents summarizes some | of multiple next-hops in routing tables. This documents provides | |||
| common considerations for the ECMP logic, with the intent of | some common considerations for the ECMP logic, with the intent of | |||
| providing common reference on otherwise unstandardized feature. | providing common reference on otherwise unstandardized feature. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on May 4, 2017. | This Internet-Draft will expire on May 3, 2018. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2016 IETF Trust and the persons identified as the | Copyright (c) 2017 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| skipping to change at page 2, line 27 ¶ | skipping to change at page 2, line 27 ¶ | |||
| 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 | 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 | |||
| 10. Informative References . . . . . . . . . . . . . . . . . . . 5 | 10. Informative References . . . . . . . . . . . . . . . . . . . 5 | |||
| Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 | Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
| 1. Introduction | 1. Introduction | |||
| Section 9.1.2.2 of [RFC4271] defines step-by step procedure for | Section 9.1.2.2 of [RFC4271] defines step-by step procedure for | |||
| selecting single "best-path" among multiple alternative available for | selecting single "best-path" among multiple alternative available for | |||
| the same NLRI (Network Layer Reachability Information) element. In | the same NLRI (Network Layer Reachability Information) element. In | |||
| order to improve efficiency in symmetric network topologies is has | order to improve efficiency in symmetric network topologies is has | |||
| become common practice to allow for selecting multiple "equivalent" | become common practice to allow selecting multiple "equivalent" paths | |||
| paths for the same prefix. Most commonly used approach is to abort | for the same prefix. Most commonly used approach is to stop the tie- | |||
| the tie-breaking process after comparing the IGP cost for the | breaking process after comparing the IGP cost for the NEXT_HOP | |||
| NEXT_HOP attribute and selecting either all eBGP or all iBGP paths | attribute and selecting either all eBGP or all iBGP paths that | |||
| that remained equivalent under the tie-breaking rules (see [BGPMP] | remained equivalent under the tie-breaking rules (see [BGPMP] for a | |||
| for a vendor document explaining the logic). Basically, the steps | vendor document explaining the logic). Basically, the steps that | |||
| that compare the BGP identifier and BGP peer IP addresses (steps (f) | compare the BGP identifier and BGP peer IP addresses (steps (f) and | |||
| and (g)) are ignored for the purpose of multipath routing. BGP | (g)) are ignored for the purpose of multipath routing. BGP | |||
| implementations commonly have a configuration knob that specifies the | implementations commonly have a configuration knob that specifies the | |||
| maximum number of equivalent paths that may be programmed to the | maximum number of equivalent paths that may be programmed to the | |||
| routing table. There is also common a knob to enable multipath | routing table. There is also common a knob to enable multipath | |||
| separately for iBGP-learned or eBGP-learned paths. | separately for iBGP-learned or eBGP-learned paths. | |||
| 2. AS-PATH attribute comparison | 2. AS-PATH attribute comparison | |||
| A mandatory requirement is for all paths that are candidates for ECMP | A mandatory requirement is for all paths that are candidates for ECMP | |||
| selection to have the same AS_PATH length, computed using the | selection to have the same AS_PATH length, computed using the | |||
| standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the | standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the | |||
| skipping to change at page 3, line 25 ¶ | skipping to change at page 3, line 25 ¶ | |||
| point, the mandatory BGP NEXT_HOP attribute value most commonly | point, the mandatory BGP NEXT_HOP attribute value most commonly | |||
| belongs to the IP subnet that the BGP speaker shares with advertising | belongs to the IP subnet that the BGP speaker shares with advertising | |||
| neighbor. In this case, it is common for implementation to treat all | neighbor. In this case, it is common for implementation to treat all | |||
| NEXT_HOP values as having the same "internal cost" to reach them per | NEXT_HOP values as having the same "internal cost" to reach them per | |||
| the guidance of step (e) of Section 9.1.2.2. In some cases, either | the guidance of step (e) of Section 9.1.2.2. In some cases, either | |||
| static routing or an IGP routing protocol could be running between | static routing or an IGP routing protocol could be running between | |||
| the BGP speakers peering over eBGP session. An implementation may | the BGP speakers peering over eBGP session. An implementation may | |||
| use the metric discovered from the above sources to perform tie- | use the metric discovered from the above sources to perform tie- | |||
| breaking even for eBGP paths. | breaking even for eBGP paths. | |||
| Notice that in case when MED attribute is present in some paths, the | In case when MED attribute is present in some paths, the set of | |||
| set of allowed multipath routes will most likely be reduced to the | allowed multipath routes will most likely be reduced to the ones | |||
| ones coming from the same peer AS, per step (c) of Section 9.1.2.2. | coming from the same peer AS, per step (c) of Section 9.1.2.2. This | |||
| This is unless the implementation provided a configuration knob to | is unless the implementation provided a configuration knob to always | |||
| always compare MED attributes across all paths, as recommended in | compare MED attributes across all paths, as recommended in [RFC4451]. | |||
| [RFC4451]. In the latter case, the presence of MED attribute does | In the latter case, the presence of MED attribute does not narrow the | |||
| not automatically narrow the candidate path set only to the same peer | candidate path set only to the same peer AS. | |||
| AS. | ||||
| 4. Multipath among iBGP learned paths | 4. Multipath among iBGP learned paths | |||
| When all paths for a prefix are learned via iBGP, the tie-breaking | When all paths for a prefix are learned via iBGP, the tie-breaking | |||
| commonly occurs based on IGP metric of the NEXT_HOP attribute, since | commonly occurs based on IGP metric of the NEXT_HOP attribute, since | |||
| in most cases iBGP is used along with an underlying IGP. It is | in most cases iBGP is used along with an underlying IGP. It is | |||
| possible, in some implementations, to ignore the IGP cost as well, if | possible, in some implementations, to ignore the IGP cost as well, if | |||
| all of the paths are reachable via some kind of tunneling mechanism, | all of the paths are reachable via some kind of tunneling mechanism, | |||
| such as MPLS ([RFC3031]). This is enabled via a knob referred to as | such as MPLS ([RFC3031]). This is enabled via a knob referred to as | |||
| "skip igp check" in this document. Notice that there is no standard | "skip igp check" in this document. Notice that there is no standard | |||
| way for a BGP speaker to detect presence of such tunneling techniques | way for a BGP speaker to detect presence of such tunneling techniques | |||
| other than relying on configuration settings. | other than relying on configuration settings. | |||
| When iBGP is deployed with BGP route-reflectors per [RFC4456] the | When iBGP is deployed with BGP route-reflectors per [RFC4456] the | |||
| path attribute list may include the CLUSTER_LIST attribute. Most | path attribute list may include the CLUSTER_LIST attribute. Most | |||
| implementations commonly ignore it for the purpose of ECMP route | implementations commonly ignore it for the purpose of ECMP route | |||
| selection, assuming that IGP cost along should be sufficient for loop | selection, assuming that IGP cost along should be sufficient for loop | |||
| prevention. This assumption may not hold when IGP is not deployed, | prevention. This assumption may not hold when IGP is not deployed, | |||
| and instead iBGP session are configured to reset the NEXT_HOP | and instead iBGP session are configured to reset the NEXT_HOP | |||
| attribute to self on every node (this also assumes the use of | attribute on every node (this also assumes the use of directly | |||
| directly connected link addresses for session formation). In this | connected link IP addresses for session formation). In this case, | |||
| case, ignoring CLUSTER_LIST length might lead to routing loops. It | ignoring CLUSTER_LIST length might lead to routing loops. It is | |||
| is therefore recommended for implementations to have a knob that | therefore recommended for implementations to have a knob that enables | |||
| enables accounting for CLUSTER_LIST length when performing multipath | accounting for CLUSTER_LIST length when performing multipath route | |||
| route selection. In this case, CLUSTER_LIST attribute length should | selection. In this case, CLUSTER_LIST attribute length should be | |||
| be effectively used to replace the IGP metric. | effectively used to replace the IGP metric. | |||
| Similar to the route-reflector scenario, the use of BGP | Similar to the route-reflector scenario, the use of BGP | |||
| confederations assumes presence of an IGP for proper loop prevention | confederations assumes presence of an IGP for proper loop prevention | |||
| in multipath scenarios, and use the IGP metric as the final tie- | in multipath scenarios, and use the IGP metric as the final tie- | |||
| breaker for multipath routing. In addition to this, and similar to | breaker for multipath routing. In addition to this, and similar to | |||
| eBGP case, implementation often require that equivalent paths belong | eBGP case, implementation often require that equivalent paths belong | |||
| to the same peer member AS as the best-path. It is useful to have | to the same peer member AS as the best-path. It is useful to have | |||
| two configuration knobs, one enabling "multipath same confederation | two configuration knobs, one enabling "multipath same confederation | |||
| member peer-as" and another enabling less restrictive "confed as-path | member peer-as" and another enabling less restrictive "confed as-path | |||
| multipath relaxed", which allows selecting multipath routes going via | multipath relaxed", which allows selecting multipath routes going via | |||
| skipping to change at page 5, line 18 ¶ | skipping to change at page 5, line 18 ¶ | |||
| selection prior to running any logic of Section 9.1.2.2. Only the | selection prior to running any logic of Section 9.1.2.2. Only the | |||
| paths with minimal value of AIGP metric are eligible for further | paths with minimal value of AIGP metric are eligible for further | |||
| consideration of tie-breaking rules. The rest of multipath selection | consideration of tie-breaking rules. The rest of multipath selection | |||
| logic remains the same. | logic remains the same. | |||
| 7. Best path advertisement | 7. Best path advertisement | |||
| Event though multiple equivalent paths may be selected for | Event though multiple equivalent paths may be selected for | |||
| programming into the routing table, the BGP speaker always announces | programming into the routing table, the BGP speaker always announces | |||
| single best-path to its peers, unless BGP "Add-Path" feature has been | single best-path to its peers, unless BGP "Add-Path" feature has been | |||
| enabled as described in [I-D.ietf-idr-add-paths]. The unique best- | enabled as described in [RFC7911]. The unique best-path is elected | |||
| path is elected among the multi-path set using the standard tie- | among the multi-path set using the standard tie-breaking rules. | |||
| breaking rules. | ||||
| 8. Multipath and non-deterministic tie-breaking | 8. Multipath and non-deterministic tie-breaking | |||
| Some implementations may implement non-standard tie-breaking using | Some implementations may implement non-standard tie-breaking using | |||
| the oldest path rule. This is generally not recommended, and may | the oldest path rule to improve routing stability. This is generally | |||
| interact with multi-path route selection on downstream BGP speakers. | not recommended, and may interact with multi-path route selection on | |||
| That is, after a route flap that affects the best-path upstream, the | downstream BGP speakers. That is, after a route flap that affects | |||
| original best path would not be recovered, and the older path still | the best-path upstream, the original best path would not be | |||
| be advertised, possibly affecting the tie-breaking rules on down- | recovered, and the older path still be advertised, possibly affecting | |||
| stream device, for example if the AS_PATH contents are different from | the tie-breaking rules on down-stream device, for example if the | |||
| previous. | AS_PATH contents are different from previous. | |||
| 9. Weighted equal-cost multipath | 9. Weighted equal-cost multipath | |||
| The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions | The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions | |||
| where iBGP multipath feature might inform the routing table of the | where iBGP multipath feature might inform the routing table of the | |||
| "weights" associated with the multiple paths. The document defines | "weights" associated with the multiple paths. The document defines | |||
| the applicability only in iBGP case, though there are implementations | the applicability only in iBGP case, though there are implementations | |||
| that apply it to eBGP multipath as well. The proposal does not | that apply it to eBGP multipath as well. The proposal does not | |||
| change the equal-cost multipath selection logic, only associates | change the equal-cost multipath selection logic, only associates | |||
| additional load-sharing attributes with equivalent paths. | additional load-sharing attributes with equivalent paths. | |||
| 10. Informative References | 10. Informative References | |||
| [BGPMP] "BGP Best Path Selection Algorithm", | ||||
| <http://www.cisco.com/c/en/us/support/docs/ip/ | ||||
| border-gateway-protocol-bgp/13753-25.html>. | ||||
| [I-D.ietf-idr-link-bandwidth] | ||||
| Mohapatra, P. and R. Fernando, "BGP Link Bandwidth | ||||
| Extended Community", draft-ietf-idr-link-bandwidth-06 | ||||
| (work in progress), January 2013. | ||||
| [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol | [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol | |||
| Label Switching Architecture", RFC 3031, | Label Switching Architecture", RFC 3031, | |||
| DOI 10.17487/RFC3031, January 2001, | DOI 10.17487/RFC3031, January 2001, | |||
| <http://www.rfc-editor.org/info/rfc3031>. | <https://www.rfc-editor.org/info/rfc3031>. | |||
| [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A | [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A | |||
| Border Gateway Protocol 4 (BGP-4)", RFC 4271, | Border Gateway Protocol 4 (BGP-4)", RFC 4271, | |||
| DOI 10.17487/RFC4271, January 2006, | DOI 10.17487/RFC4271, January 2006, | |||
| <http://www.rfc-editor.org/info/rfc4271>. | <https://www.rfc-editor.org/info/rfc4271>. | |||
| [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) | [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) | |||
| Considerations", RFC 4451, DOI 10.17487/RFC4451, March | Considerations", RFC 4451, DOI 10.17487/RFC4451, March | |||
| 2006, <http://www.rfc-editor.org/info/rfc4451>. | 2006, <https://www.rfc-editor.org/info/rfc4451>. | |||
| [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route | [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route | |||
| Reflection: An Alternative to Full Mesh Internal BGP | Reflection: An Alternative to Full Mesh Internal BGP | |||
| (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, | (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, | |||
| <http://www.rfc-editor.org/info/rfc4456>. | <https://www.rfc-editor.org/info/rfc4456>. | |||
| [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous | [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous | |||
| System Confederations for BGP", RFC 5065, | System Confederations for BGP", RFC 5065, | |||
| DOI 10.17487/RFC5065, August 2007, | DOI 10.17487/RFC5065, August 2007, | |||
| <http://www.rfc-editor.org/info/rfc5065>. | <https://www.rfc-editor.org/info/rfc5065>. | |||
| [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, | [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, | |||
| "The Accumulated IGP Metric Attribute for BGP", RFC 7311, | "The Accumulated IGP Metric Attribute for BGP", RFC 7311, | |||
| DOI 10.17487/RFC7311, August 2014, | DOI 10.17487/RFC7311, August 2014, | |||
| <http://www.rfc-editor.org/info/rfc7311>. | <https://www.rfc-editor.org/info/rfc7311>. | |||
| [I-D.ietf-idr-add-paths] | ||||
| Walton, D., Retana, A., Chen, E., and J. Scudder, | ||||
| "Advertisement of Multiple Paths in BGP", draft-ietf-idr- | ||||
| add-paths-15 (work in progress), May 2016. | ||||
| [I-D.ietf-idr-link-bandwidth] | ||||
| Mohapatra, P. and R. Fernando, "BGP Link Bandwidth | ||||
| Extended Community", draft-ietf-idr-link-bandwidth-06 | ||||
| (work in progress), January 2013. | ||||
| [BGPMP] "BGP Best Path Selection Algorithm", | [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, | |||
| <http://www.cisco.com/c/en/us/support/docs/ip/ | "Advertisement of Multiple Paths in BGP", RFC 7911, | |||
| border-gateway-protocol-bgp/13753-25.html>. | DOI 10.17487/RFC7911, July 2016, | |||
| <https://www.rfc-editor.org/info/rfc7911>. | ||||
| Author's Address | Author's Address | |||
| Petr Lapukhov | Petr Lapukhov | |||
| 1 Hacker Way | 1 Hacker Way | |||
| Menlo Park, CA 94025 | Menlo Park, CA 94025 | |||
| US | US | |||
| Email: petr@fb.com | Email: petr@fb.com | |||
| End of changes. 20 change blocks. | ||||
| 63 lines changed or deleted | 61 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||