< draft-lapukhov-bgp-ecmp-considerations-00.txt   draft-lapukhov-bgp-ecmp-considerations-01.txt >
Network Working Group P. Lapukhov Network Working Group P. Lapukhov
Internet-Draft Facebook Internet-Draft Facebook
Intended status: Informational October 31, 2016 Intended status: Informational October 30, 2017
Expires: May 4, 2017 Expires: May 3, 2018
Equal-Cost Multipath Considerations for BGP Equal-Cost Multipath Considerations for BGP
draft-lapukhov-bgp-ecmp-considerations-00 draft-lapukhov-bgp-ecmp-considerations-01
Abstract Abstract
BGP routing protocol defined in ([RFC4271]) employs tie-breaking BGP routing protocol defined in ([RFC4271]) employs tie-breaking
logic to elect single best path among multiple possible. At the same logic to elect single best path among multiple possible. At the same
time, it has been common in virtually all BGP implementations to time, it has been common in all practical BGP implementations to
allow for "equal-cost multipath" (ECMP) election and programming of allow for "equal-cost multipath" (ECMP) path election and programming
multiple next-hops in routing tables. This documents summarizes some of multiple next-hops in routing tables. This documents provides
common considerations for the ECMP logic, with the intent of some common considerations for the ECMP logic, with the intent of
providing common reference on otherwise unstandardized feature. providing common reference on otherwise unstandardized feature.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 4, 2017. This Internet-Draft will expire on May 3, 2018.
Copyright Notice Copyright Notice
Copyright (c) 2016 IETF Trust and the persons identified as the Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
skipping to change at page 2, line 27 skipping to change at page 2, line 27
9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5 9. Weighted equal-cost multipath . . . . . . . . . . . . . . . . 5
10. Informative References . . . . . . . . . . . . . . . . . . . 5 10. Informative References . . . . . . . . . . . . . . . . . . . 5
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction 1. Introduction
Section 9.1.2.2 of [RFC4271] defines step-by step procedure for Section 9.1.2.2 of [RFC4271] defines step-by step procedure for
selecting single "best-path" among multiple alternative available for selecting single "best-path" among multiple alternative available for
the same NLRI (Network Layer Reachability Information) element. In the same NLRI (Network Layer Reachability Information) element. In
order to improve efficiency in symmetric network topologies is has order to improve efficiency in symmetric network topologies is has
become common practice to allow for selecting multiple "equivalent" become common practice to allow selecting multiple "equivalent" paths
paths for the same prefix. Most commonly used approach is to abort for the same prefix. Most commonly used approach is to stop the tie-
the tie-breaking process after comparing the IGP cost for the breaking process after comparing the IGP cost for the NEXT_HOP
NEXT_HOP attribute and selecting either all eBGP or all iBGP paths attribute and selecting either all eBGP or all iBGP paths that
that remained equivalent under the tie-breaking rules (see [BGPMP] remained equivalent under the tie-breaking rules (see [BGPMP] for a
for a vendor document explaining the logic). Basically, the steps vendor document explaining the logic). Basically, the steps that
that compare the BGP identifier and BGP peer IP addresses (steps (f) compare the BGP identifier and BGP peer IP addresses (steps (f) and
and (g)) are ignored for the purpose of multipath routing. BGP (g)) are ignored for the purpose of multipath routing. BGP
implementations commonly have a configuration knob that specifies the implementations commonly have a configuration knob that specifies the
maximum number of equivalent paths that may be programmed to the maximum number of equivalent paths that may be programmed to the
routing table. There is also common a knob to enable multipath routing table. There is also common a knob to enable multipath
separately for iBGP-learned or eBGP-learned paths. separately for iBGP-learned or eBGP-learned paths.
2. AS-PATH attribute comparison 2. AS-PATH attribute comparison
A mandatory requirement is for all paths that are candidates for ECMP A mandatory requirement is for all paths that are candidates for ECMP
selection to have the same AS_PATH length, computed using the selection to have the same AS_PATH length, computed using the
standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the standard logic defined in [RFC4271] and [RFC5065], i.e. ignoring the
skipping to change at page 3, line 25 skipping to change at page 3, line 25
point, the mandatory BGP NEXT_HOP attribute value most commonly point, the mandatory BGP NEXT_HOP attribute value most commonly
belongs to the IP subnet that the BGP speaker shares with advertising belongs to the IP subnet that the BGP speaker shares with advertising
neighbor. In this case, it is common for implementation to treat all neighbor. In this case, it is common for implementation to treat all
NEXT_HOP values as having the same "internal cost" to reach them per NEXT_HOP values as having the same "internal cost" to reach them per
the guidance of step (e) of Section 9.1.2.2. In some cases, either the guidance of step (e) of Section 9.1.2.2. In some cases, either
static routing or an IGP routing protocol could be running between static routing or an IGP routing protocol could be running between
the BGP speakers peering over eBGP session. An implementation may the BGP speakers peering over eBGP session. An implementation may
use the metric discovered from the above sources to perform tie- use the metric discovered from the above sources to perform tie-
breaking even for eBGP paths. breaking even for eBGP paths.
Notice that in case when MED attribute is present in some paths, the In case when MED attribute is present in some paths, the set of
set of allowed multipath routes will most likely be reduced to the allowed multipath routes will most likely be reduced to the ones
ones coming from the same peer AS, per step (c) of Section 9.1.2.2. coming from the same peer AS, per step (c) of Section 9.1.2.2. This
This is unless the implementation provided a configuration knob to is unless the implementation provided a configuration knob to always
always compare MED attributes across all paths, as recommended in compare MED attributes across all paths, as recommended in [RFC4451].
[RFC4451]. In the latter case, the presence of MED attribute does In the latter case, the presence of MED attribute does not narrow the
not automatically narrow the candidate path set only to the same peer candidate path set only to the same peer AS.
AS.
4. Multipath among iBGP learned paths 4. Multipath among iBGP learned paths
When all paths for a prefix are learned via iBGP, the tie-breaking When all paths for a prefix are learned via iBGP, the tie-breaking
commonly occurs based on IGP metric of the NEXT_HOP attribute, since commonly occurs based on IGP metric of the NEXT_HOP attribute, since
in most cases iBGP is used along with an underlying IGP. It is in most cases iBGP is used along with an underlying IGP. It is
possible, in some implementations, to ignore the IGP cost as well, if possible, in some implementations, to ignore the IGP cost as well, if
all of the paths are reachable via some kind of tunneling mechanism, all of the paths are reachable via some kind of tunneling mechanism,
such as MPLS ([RFC3031]). This is enabled via a knob referred to as such as MPLS ([RFC3031]). This is enabled via a knob referred to as
"skip igp check" in this document. Notice that there is no standard "skip igp check" in this document. Notice that there is no standard
way for a BGP speaker to detect presence of such tunneling techniques way for a BGP speaker to detect presence of such tunneling techniques
other than relying on configuration settings. other than relying on configuration settings.
When iBGP is deployed with BGP route-reflectors per [RFC4456] the When iBGP is deployed with BGP route-reflectors per [RFC4456] the
path attribute list may include the CLUSTER_LIST attribute. Most path attribute list may include the CLUSTER_LIST attribute. Most
implementations commonly ignore it for the purpose of ECMP route implementations commonly ignore it for the purpose of ECMP route
selection, assuming that IGP cost along should be sufficient for loop selection, assuming that IGP cost along should be sufficient for loop
prevention. This assumption may not hold when IGP is not deployed, prevention. This assumption may not hold when IGP is not deployed,
and instead iBGP session are configured to reset the NEXT_HOP and instead iBGP session are configured to reset the NEXT_HOP
attribute to self on every node (this also assumes the use of attribute on every node (this also assumes the use of directly
directly connected link addresses for session formation). In this connected link IP addresses for session formation). In this case,
case, ignoring CLUSTER_LIST length might lead to routing loops. It ignoring CLUSTER_LIST length might lead to routing loops. It is
is therefore recommended for implementations to have a knob that therefore recommended for implementations to have a knob that enables
enables accounting for CLUSTER_LIST length when performing multipath accounting for CLUSTER_LIST length when performing multipath route
route selection. In this case, CLUSTER_LIST attribute length should selection. In this case, CLUSTER_LIST attribute length should be
be effectively used to replace the IGP metric. effectively used to replace the IGP metric.
Similar to the route-reflector scenario, the use of BGP Similar to the route-reflector scenario, the use of BGP
confederations assumes presence of an IGP for proper loop prevention confederations assumes presence of an IGP for proper loop prevention
in multipath scenarios, and use the IGP metric as the final tie- in multipath scenarios, and use the IGP metric as the final tie-
breaker for multipath routing. In addition to this, and similar to breaker for multipath routing. In addition to this, and similar to
eBGP case, implementation often require that equivalent paths belong eBGP case, implementation often require that equivalent paths belong
to the same peer member AS as the best-path. It is useful to have to the same peer member AS as the best-path. It is useful to have
two configuration knobs, one enabling "multipath same confederation two configuration knobs, one enabling "multipath same confederation
member peer-as" and another enabling less restrictive "confed as-path member peer-as" and another enabling less restrictive "confed as-path
multipath relaxed", which allows selecting multipath routes going via multipath relaxed", which allows selecting multipath routes going via
skipping to change at page 5, line 18 skipping to change at page 5, line 18
selection prior to running any logic of Section 9.1.2.2. Only the selection prior to running any logic of Section 9.1.2.2. Only the
paths with minimal value of AIGP metric are eligible for further paths with minimal value of AIGP metric are eligible for further
consideration of tie-breaking rules. The rest of multipath selection consideration of tie-breaking rules. The rest of multipath selection
logic remains the same. logic remains the same.
7. Best path advertisement 7. Best path advertisement
Event though multiple equivalent paths may be selected for Event though multiple equivalent paths may be selected for
programming into the routing table, the BGP speaker always announces programming into the routing table, the BGP speaker always announces
single best-path to its peers, unless BGP "Add-Path" feature has been single best-path to its peers, unless BGP "Add-Path" feature has been
enabled as described in [I-D.ietf-idr-add-paths]. The unique best- enabled as described in [RFC7911]. The unique best-path is elected
path is elected among the multi-path set using the standard tie- among the multi-path set using the standard tie-breaking rules.
breaking rules.
8. Multipath and non-deterministic tie-breaking 8. Multipath and non-deterministic tie-breaking
Some implementations may implement non-standard tie-breaking using Some implementations may implement non-standard tie-breaking using
the oldest path rule. This is generally not recommended, and may the oldest path rule to improve routing stability. This is generally
interact with multi-path route selection on downstream BGP speakers. not recommended, and may interact with multi-path route selection on
That is, after a route flap that affects the best-path upstream, the downstream BGP speakers. That is, after a route flap that affects
original best path would not be recovered, and the older path still the best-path upstream, the original best path would not be
be advertised, possibly affecting the tie-breaking rules on down- recovered, and the older path still be advertised, possibly affecting
stream device, for example if the AS_PATH contents are different from the tie-breaking rules on down-stream device, for example if the
previous. AS_PATH contents are different from previous.
9. Weighted equal-cost multipath 9. Weighted equal-cost multipath
The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions The proposal in [I-D.ietf-idr-link-bandwidth] defines conditions
where iBGP multipath feature might inform the routing table of the where iBGP multipath feature might inform the routing table of the
"weights" associated with the multiple paths. The document defines "weights" associated with the multiple paths. The document defines
the applicability only in iBGP case, though there are implementations the applicability only in iBGP case, though there are implementations
that apply it to eBGP multipath as well. The proposal does not that apply it to eBGP multipath as well. The proposal does not
change the equal-cost multipath selection logic, only associates change the equal-cost multipath selection logic, only associates
additional load-sharing attributes with equivalent paths. additional load-sharing attributes with equivalent paths.
10. Informative References 10. Informative References
[BGPMP] "BGP Best Path Selection Algorithm",
<http://www.cisco.com/c/en/us/support/docs/ip/
border-gateway-protocol-bgp/13753-25.html>.
[I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06
(work in progress), January 2013.
[RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol
Label Switching Architecture", RFC 3031, Label Switching Architecture", RFC 3031,
DOI 10.17487/RFC3031, January 2001, DOI 10.17487/RFC3031, January 2001,
<http://www.rfc-editor.org/info/rfc3031>. <https://www.rfc-editor.org/info/rfc3031>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, Border Gateway Protocol 4 (BGP-4)", RFC 4271,
DOI 10.17487/RFC4271, January 2006, DOI 10.17487/RFC4271, January 2006,
<http://www.rfc-editor.org/info/rfc4271>. <https://www.rfc-editor.org/info/rfc4271>.
[RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED) [RFC4451] McPherson, D. and V. Gill, "BGP MULTI_EXIT_DISC (MED)
Considerations", RFC 4451, DOI 10.17487/RFC4451, March Considerations", RFC 4451, DOI 10.17487/RFC4451, March
2006, <http://www.rfc-editor.org/info/rfc4451>. 2006, <https://www.rfc-editor.org/info/rfc4451>.
[RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route
Reflection: An Alternative to Full Mesh Internal BGP Reflection: An Alternative to Full Mesh Internal BGP
(IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006,
<http://www.rfc-editor.org/info/rfc4456>. <https://www.rfc-editor.org/info/rfc4456>.
[RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous [RFC5065] Traina, P., McPherson, D., and J. Scudder, "Autonomous
System Confederations for BGP", RFC 5065, System Confederations for BGP", RFC 5065,
DOI 10.17487/RFC5065, August 2007, DOI 10.17487/RFC5065, August 2007,
<http://www.rfc-editor.org/info/rfc5065>. <https://www.rfc-editor.org/info/rfc5065>.
[RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro, [RFC7311] Mohapatra, P., Fernando, R., Rosen, E., and J. Uttaro,
"The Accumulated IGP Metric Attribute for BGP", RFC 7311, "The Accumulated IGP Metric Attribute for BGP", RFC 7311,
DOI 10.17487/RFC7311, August 2014, DOI 10.17487/RFC7311, August 2014,
<http://www.rfc-editor.org/info/rfc7311>. <https://www.rfc-editor.org/info/rfc7311>.
[I-D.ietf-idr-add-paths]
Walton, D., Retana, A., Chen, E., and J. Scudder,
"Advertisement of Multiple Paths in BGP", draft-ietf-idr-
add-paths-15 (work in progress), May 2016.
[I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06
(work in progress), January 2013.
[BGPMP] "BGP Best Path Selection Algorithm", [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder,
<http://www.cisco.com/c/en/us/support/docs/ip/ "Advertisement of Multiple Paths in BGP", RFC 7911,
border-gateway-protocol-bgp/13753-25.html>. DOI 10.17487/RFC7911, July 2016,
<https://www.rfc-editor.org/info/rfc7911>.
Author's Address Author's Address
Petr Lapukhov Petr Lapukhov
Facebook Facebook
1 Hacker Way 1 Hacker Way
Menlo Park, CA 94025 Menlo Park, CA 94025
US US
Email: petr@fb.com Email: petr@fb.com
 End of changes. 20 change blocks. 
63 lines changed or deleted 61 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/