< draft-mohanty-bess-ebgp-dmz-03.txt   draft-ietf-bess-ebgp-dmz-00.txt >
BESS WorkGroup S. Mohanty BESS WorkGroup S R. Mohanty
Internet-Draft Cisco Systems Internet-Draft Cisco Systems
Intended status: Informational A. Vayner Intended status: Informational A. Vayner
Expires: September 16, 2021 Google Expires: 28 August 2022 Google
A. Gattani A. Gattani
A. Kini A. Kini
Arista Networks Arista Networks
March 15, 2021 24 February 2022
Cumulative DMZ Link Bandwidth and load-balancing Cumulative DMZ Link Bandwidth and load-balancing
draft-mohanty-bess-ebgp-dmz-03 draft-ietf-bess-ebgp-dmz-00
Abstract Abstract
The DMZ Link Bandwidth draft provides a way to load-balance traffic The DMZ Link Bandwidth draft provides a way to load-balance traffic
to a destination (which is in a different AS than the source) which to a destination (which is in a different AS than the source) which
is reachable via more than one path. Typically, the link bandwidth is reachable via more than one path. Typically, the link bandwidth
(either configured on the link of the EBGP egress interface or set (either configured on the link of the EBGP egress interface or set
via a policy) is encoded in an extended community and then sent to via a policy) is encoded in an extended community and then sent to
the IBGP peer which employs multi-path. The link-bandwidth value is the IBGP peer which employs multi-path. The link-bandwidth value is
then extracted from the path extended community and is used as a then extracted from the path extended community and is used as a
skipping to change at page 1, line 48 skipping to change at page 1, line 48
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 16, 2021. This Internet-Draft will expire on 28 August 2022.
Copyright Notice Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents (https://trustee.ietf.org/
(https://trustee.ietf.org/license-info) in effect on the date of license-info) in effect on the date of publication of this document.
publication of this document. Please review these documents Please review these documents carefully, as they describe your rights
carefully, as they describe your rights and restrictions with respect and restrictions with respect to this document. Code Components
to this document. Code Components extracted from this document must extracted from this document must include Revised BSD License text as
include Simplified BSD License text as described in Section 4.e of described in Section 4.e of the Trust Legal Provisions and are
the Trust Legal Provisions and are provided without warranty as provided without warranty as described in the Revised BSD License.
described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 3
3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3 3. Problem Description . . . . . . . . . . . . . . . . . . . . . 3
4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 6 4. Large Scale Data Centers Use Case . . . . . . . . . . . . . . 6
5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 8 5. Non-Conforming BGP Topologies . . . . . . . . . . . . . . . . 8
6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 10 6. Protocol Considerations . . . . . . . . . . . . . . . . . . . 9
7. Operational Considerations . . . . . . . . . . . . . . . . . 10 7. Operational Considerations . . . . . . . . . . . . . . . . . 10
8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
10.1. Normative References . . . . . . . . . . . . . . . . . . 10 10.1. Normative References . . . . . . . . . . . . . . . . . . 10
10.2. Informative References . . . . . . . . . . . . . . . . . 11 10.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction 1. Introduction
skipping to change at page 3, line 31 skipping to change at page 3, line 30
R1- -20 - - | R1- -20 - - |
R3- -100 - -| R3- -100 - -|
R2- -10 - - | | R2- -10 - - | |
| |
R6- -40 - - | |- - R4 R6- -40 - - | |- - R4
| | | |
R5- -100 - -| R5- -100 - -|
R7- -30 - - | R7- -30 - - |
EBGP Network with cumulative DMZ requirement Figure 1
Figure 1 EBGP Network with cumulative DMZ requirement
2. Requirements Language 2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119]. document are to be interpreted as described in [RFC2119].
3. Problem Description 3. Problem Description
Figure 1 above represents an all-EBGP network. Router R3 is peering Figure 1 above represents an all-EBGP network. Router R3 is peering
skipping to change at page 5, line 14 skipping to change at page 4, line 46
R1- -20 - - | R1- -20 - - |
R3- -30 (100) - -| R3- -30 (100) - -|
R2- -10 - - | | R2- -10 - - | |
| |
R6- -40 - - | |- - R4 R6- -40 - - | |- - R4
| | | |
R5- -70 (100) - -| R5- -70 (100) - -|
R7- -30 - - | R7- -30 - - |
EBGP Network showing advertisement of cumulative link bandwidth Figure 2
Figure 2
EBGP Network showing advertisement of cumulative link bandwidth
With the existing rules for the DMZ link bandwidth, this is not With the existing rules for the DMZ link bandwidth, this is not
possible. First the LB extended community is not sent over EBGP. possible. First the LB extended community is not sent over EBGP.
Secondly the DMZ does not have a notion of conveying the cumulative Secondly the DMZ does not have a notion of conveying the cumulative
link bandwidth (of the directed tree rooted at a node) to an upstream link bandwidth (of the directed tree rooted at a node) to an upstream
router. To enable the use case described above, the cumulative link router. To enable the use case described above, the cumulative link
bandwidth of R1 and R2 has to be advertised by R3 to R4, and, bandwidth of R1 and R2 has to be advertised by R3 to R4, and,
similarly, the cumulative bandwidth of R6 and R7 has to be advertised similarly, the cumulative bandwidth of R6 and R7 has to be advertised
by R5 to R4. This will enable R4 to load-balance based on the by R5 to R4. This will enable R4 to load-balance based on the
proportion of the cumulative link bandwidth that it receives from its proportion of the cumulative link bandwidth that it receives from its
downstream routers R3 and R5. Figure 2 shows the cumulative link downstream routers R3 and R5. Figure 2 shows the cumulative link
skipping to change at page 7, line 32 skipping to change at page 7, line 28
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
| | | | | | | |
| | | | | | | |
| +-----+ | | +-----+ | | +-----+ | | +-----+ |
+-| AS4 |-+ +-| AS5 |-+ Tier 3 +-| AS4 |-+ +-| AS5 |-+ Tier 3
+-----+ +-----+ +-----+ +-----+
| | | | | | | | | | | |
<- 3 Servers -> <- 10 Servers -> <- 3 Servers -> <- 10 Servers ->
Typical Data Center Topology (RFC7938) Figure 3
Figure 3
In a regular ECMP environment, the tier 1 layer would see an ECMP In a regular ECMP environment, the tier 1 layer would see an ECMP
path equally load-sharing across all 4 tier 2 paths. This would path equally load-sharing across all 4 tier 2 paths. This would
cause the servers on the left part of the data center to be cause the servers on the left part of the data center to be
potentially overloaded, while the servers on the right to be potentially overloaded, while the servers on the right to be
underutilized. Using link bandwidth advertisements the servers could underutilized. Using link bandwidth advertisements the servers could
add a link bandwidth extended community to the advertised service add a link bandwidth extended community to the advertised service
prefix. Another option is to add the extended community on the tier prefix. Another option is to add the extended community on the tier
3 network devices as the routes are received from the servers or 3 network devices as the routes are received from the servers or
generated locally on the network devices. If the link bandwidth generated locally on the network devices. If the link bandwidth
value advertised for the service represents the server capacity for value advertised for the service represents the server capacity for
that service, each data center tier would aggregate the values up that service, each data center tier would aggregate the values up
when sending the update to the higher tier. The result would be a when sending the update to the higher tier. The result would be a
set of weighted load-sharing metrics at each tier allowing the set of weighted load-sharing metrics at each tier allowing the
network to distribute the flow load among the different servers in network to distribute the flow load among the different servers in
the most optimal way. If a server is added or removed to the service the most optimal way. If a server is added or removed to the service
prefix, it would add or remove its link bandwidth value and the prefix, it would add or remove its link bandwidth value and the
network would adjust accordingly. network would adjust accordingly.
Typical Data Center Topology (RFC7938)
Figure 4 shows a more popular Spine Leaf architecture similar to Figure 4 shows a more popular Spine Leaf architecture similar to
[RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier, [RFC7938] section 3.2. Tor1, Tor2 and Tor3 are in the same tier,
i.e. the leaf tier (The representation shown in Figure 3 here is the i.e. the leaf tier (The representation shown in Figure 3 here is the
unfolded Clos). Using the same example above, it is clear that the unfolded Clos). Using the same example above, it is clear that the
LB extended community value received by each of Spine1 and Spine2 LB extended community value received by each of Spine1 and Spine2
from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines from Tor1 and Tor2 is in the ratio 3 to 10 respectively. The Spines
will then aggregate the bandwidth, regenerate and advertise the LB will then aggregate the bandwidth, regenerate and advertise the LB
extended-community to Tor3. Tor3 will do equal cost sharing to both extended-community to Tor3. Tor3 will do equal cost sharing to both
the spines which in turn will do the traffic-splitting in the ratio 3 the spines which in turn will do the traffic-splitting in the ratio 3
to 10 when forwarding the traffic to the Tor1 and Tor2 respectively. to 10 when forwarding the traffic to the Tor1 and Tor2 respectively.
skipping to change at page 8, line 39 skipping to change at page 8, line 32
+----+--+ +-+----+ +----+--+ +-+----+
| \ / | | \ / |
- + - - - + - -
| / \ | | / \ |
+-----+- + -+-----+ +-----+- + -+-----+
|Tor1 | |Tor2 | Tier 1 |Tor1 | |Tor2 | Tier 1
+-----+ +-----+ +-----+ +-----+
| | | | | | | | | | | |
<- 3 Servers -> <- 10 Servers -> <- 3 Servers -> <- 10 Servers ->
Two-tier Clos Data Center Topology Figure 4
Figure 4 Two-tier Clos Data Center Topology
5. Non-Conforming BGP Topologies 5. Non-Conforming BGP Topologies
This use-case will not readily apply to all topologies. Figure 5 This use-case will not readily apply to all topologies. Figure 5
shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2, shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2,
AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised AS3, AS4, AS5 and AS6 respectively. A net p/m, is being advertised
from a server S1 with LB extended-community value 10 to R1 and R5. from a server S1 with LB extended-community value 10 to R1 and R5.
R1 advertises p/m to R2 and R3 and also regenerates the LB extended- R1 advertises p/m to R2 and R3 and also regenerates the LB extended-
community with value 10. R4 receives the advertisements from R2, R3 community with value 10. R4 receives the advertisements from R2, R3
and R5 and computes the aggregate bandwidth to be 30. R4 advertises and R5 and computes the aggregate bandwidth to be 30. R4 advertises
skipping to change at page 9, line 31 skipping to change at page 9, line 20
|- - R2 - 10 --| |- - R2 - 10 --|
| | | |
| | | |
S1- - 10- R1 R4- - - --30 - -R6 S1- - 10- R1 R4- - - --30 - -R6
| | | | | |
| | | | | |
10 |- - -R3- 10 - -| 10 |- - -R3- 10 - -|
| | | |
|- - - R5 - - -- - -- - - -| |- - - R5 - - -- - -- - - -|
A non-conforming topology for the Cumulative DMZ Figure 5
Figure 5 A non-conforming topology for the Cumulative DMZ
One way to make the topology in the figure above conforming would be One way to make the topology in the figure above conforming would be
to regenerate a normalized value of the aggregate link bandwidth when to regenerate a normalized value of the aggregate link bandwidth when
the aggregate link bandwidth is being advertised over more than one the aggregate link bandwidth is being advertised over more than one
eBGP peer link. Such normalization can be achieved through outbound eBGP peer link. Such normalization can be achieved through outbound
policy application on top of the aggregate link bandwidth value. A policy application on top of the aggregate link bandwidth value. A
couple of options in this context are: couple of options in this context are:
1. divide the aggregate link bandwidth across the eBGP peers equally 1. divide the aggregate link bandwidth across the eBGP peers equally
skipping to change at page 11, line 7 skipping to change at page 10, line 41
The authors would like to thank Shyam Sethuram, Sameer Gulrajani, The authors would like to thank Shyam Sethuram, Sameer Gulrajani,
Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to
the draft. the draft.
10. References 10. References
10.1. Normative References 10.1. Normative References
[I-D.ietf-idr-link-bandwidth] [I-D.ietf-idr-link-bandwidth]
Mohapatra, P. and R. Fernando, "BGP Link Bandwidth Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
Extended Community", draft-ietf-idr-link-bandwidth-06 Extended Community", Work in Progress, Internet-Draft,
(work in progress), January 2013. draft-ietf-idr-link-bandwidth-06, 21 January 2013,
<http://www.ietf.org/internet-drafts/draft-ietf-idr-link-
bandwidth-06.txt>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997, DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>. <https://www.rfc-editor.org/info/rfc2119>.
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938, BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016, DOI 10.17487/RFC7938, August 2016,
<https://www.rfc-editor.org/info/rfc7938>. <https://www.rfc-editor.org/info/rfc7938>.
skipping to change at page 11, line 40 skipping to change at page 11, line 30
[RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address
Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017,
<https://www.rfc-editor.org/info/rfc8277>. <https://www.rfc-editor.org/info/rfc8277>.
Authors' Addresses Authors' Addresses
Satya Ranjan Mohanty Satya Ranjan Mohanty
Cisco Systems Cisco Systems
170 W. Tasman Drive 170 W. Tasman Drive
San Jose, CA 95134 San Jose, CA 95134
USA United States of America
Email: satyamoh@cisco.com Email: satyamoh@cisco.com
Arie Vayner Arie Vayner
Google Google
1600 Amphitheatre Parkway 1600 Amphitheatre Parkway
Mountain View, CA 94043 Mountain View, CA 94043
USA United States of America
Email: avayner@google.com Email: avayner@google.com
Akshay Gattani Akshay Gattani
Arista Networks Arista Networks
5453 Great America Parkway 5453 Great America Parkway
Santa Clara, CA 95054 Santa Clara, CA 95054
USA United States of America
Email: akshay@arista.com Email: akshay@arista.com
Ajay Kini Ajay Kini
Arista Networks Arista Networks
5453 Great America Parkway 5453 Great America Parkway
Santa Clara, CA 95054 Santa Clara, CA 95054
USA United States of America
Email: ajkini@arista.com Email: ajkini@arista.com
 End of changes. 25 change blocks. 
42 lines changed or deleted 38 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/