idnits 2.17.1 

draft-mohanty-bess-ebgp-dmz-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 13, 2020) is 1376 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-07) exists of
     draft-ietf-idr-link-bandwidth-06

  -- Obsolete informational reference (is this intentional?): RFC 2547
     (Obsoleted by RFC 4364)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	BESS WorkGroup                                                S. Mohanty
3	Internet-Draft                                               A. Millisor
4	Intended status: Informational                             Cisco Systems
5	Expires: January 14, 2021                                      A. Vayner
6	                                                                 Nutanix
7	                                                              A. Gattani
8	                                                                 A. Kini
9	                                                         Arista Networks
10	                                                           July 13, 2020

12	            Cumulative DMZ Link Bandwidth and load-balancing
13	                     draft-mohanty-bess-ebgp-dmz-02

15	Abstract

17	   The DMZ Link Bandwidth draft provides a way to load-balance traffic
18	   to a destination (which is in a different AS than the source) which
19	   is reachable via more than one path.  Typically, the link bandwidth
20	   (either configured on the link of the EBGP egress interface or set
21	   via a policy) is encoded in an extended community and then sent to
22	   the IBGP peer which employs multi-path.  The link-bandwidth value is
23	   then extracted from the path extended community and is used as a
24	   weight in the FIB, which does the load-balancing.  This draft extends
25	   the usage of the DMZ link bandwidth to another setting where the
26	   ingress BGP speaker requires knowledge of the cumulative bandwidth
27	   while doing the load-balancing.  The draft also proposes neighbor-
28	   level knobs to enable the link bandwidth extended community to be
29	   regenerated and then advertised to EBGP peers to override the default
30	   behavior of not advertising optional non-transitive attributes to
31	   EBGP peers.

33	Status of This Memo

35	   This Internet-Draft is submitted in full conformance with the
36	   provisions of BCP 78 and BCP 79.

38	   Internet-Drafts are working documents of the Internet Engineering
39	   Task Force (IETF).  Note that other groups may also distribute
40	   working documents as Internet-Drafts.  The list of current Internet-
41	   Drafts is at https://datatracker.ietf.org/drafts/current/.

43	   Internet-Drafts are draft documents valid for a maximum of six months
44	   and may be updated, replaced, or obsoleted by other documents at any
45	   time.  It is inappropriate to use Internet-Drafts as reference
46	   material or to cite them other than as "work in progress."

48	   This Internet-Draft will expire on January 14, 2021.

50	Copyright Notice

52	   Copyright (c) 2020 IETF Trust and the persons identified as the
53	   document authors.  All rights reserved.

55	   This document is subject to BCP 78 and the IETF Trust's Legal
56	   Provisions Relating to IETF Documents
57	   (https://trustee.ietf.org/license-info) in effect on the date of
58	   publication of this document.  Please review these documents
59	   carefully, as they describe your rights and restrictions with respect
60	   to this document.  Code Components extracted from this document must
61	   include Simplified BSD License text as described in Section 4.e of
62	   the Trust Legal Provisions and are provided without warranty as
63	   described in the Simplified BSD License.

65	Table of Contents

67	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
68	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   3
69	   3.  Problem Description . . . . . . . . . . . . . . . . . . . . .   3
70	   4.  Large Scale Data Centers Use Case . . . . . . . . . . . . . .   6
71	   5.  Non-Conforming BGP Topologies . . . . . . . . . . . . . . . .   8
72	   6.  Protocol Considerations . . . . . . . . . . . . . . . . . . .   9
73	   7.  Operational Considerations  . . . . . . . . . . . . . . . . .  10
74	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
75	   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
76	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
77	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  10
78	     10.2.  Informative References . . . . . . . . . . . . . . . . .  11
79	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

81	1.  Introduction

83	   The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community
84	   along with the multi-path feature can be used to provide unequal cost
85	   load-balancing as per user control.  In [I-D.ietf-idr-link-bandwidth]
86	   the EBGP egress link bandwidth is encoded in the link bandwidth
87	   extended community and sent along with the BGP update to the IBGP
88	   peer.  It is assumed that either a labeled path exists to each of the
89	   EBGP links or alternatively the IGP cost to each link is the same.
90	   When the same prefix/net is advertised into the receiving AS via
91	   different egress-points or next-hops, the receiving IBGP peer that
92	   employs multi-path will use the value of the DMZ LB to load-balance
93	   traffic to the egress BGP speakers (ASBRs) in the proportion of the
94	   link-bandwidths.

96	   The link bandwidth extended community cannot be advertised over EBGP
97	   peers as it is defined to be optional non-transitive.  This draft
98	   discusses a new use-case where we need to advertise the link
99	   bandwidth over EBGP peers.  The new use-case mandates that the router
100	   calculates the aggregated link-bandwidth, regenerate the DMZ link
101	   bandwidth extended community, and advertise it to EBGP peers.  The
102	   new use case also negates the [I-D.ietf-idr-link-bandwidth]
103	   restriction that the DMZ link bandwidth extended community not be
104	   sent when the the advertising router sets the next-hop to itself.

106	   In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth
107	   advertised by EBGP egress BGP speaker to the IBGP BGP speaker
108	   represents the Link Bandwidth of the EBGP link.  However, sometimes
109	   there is a need to aggregate the link bandwidth of all the paths that
110	   are advertising a given net and then send it to an upstream neighbor.
111	   This is represented pictorially in Figure 1.  The aggregated link
112	   bandwidth is used by the upstream router to do load-balancing as it
113	   may also receive several such paths for the same net which in turn
114	   carry the accumulated bandwidth.

116	            R1- -20 - - |
117	                        R3- -100 - -|
118	            R2- -10 - - |           |
119	                                    |
120	            R6- -40 - - |           |- - R4
121	                        |           |
122	                        R5- -100 - -|
123	            R7- -30 - - |

125	   EBGP Network with cumulative DMZ requirement

127	                                 Figure 1

129	2.  Requirements Language

131	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
132	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
133	   document are to be interpreted as described in [RFC2119].

135	3.  Problem Description

137	   Figure 1 above represents an all-EBGP network.  Router R3 is peering
138	   with two other EBGP downstream routers, R1 and R2, over the eBGP link
139	   and another upstream EBGP router R4.  There is another router, R5,
140	   which is peering with two downstream routers R6 and R7.  R5 peers
141	   with R4.  A net, p/m, is learnt by R1, R2, R6, and R7 from their
142	   downstream routers (not shown).  From the perspective of R4, the
143	   topology looks like a directed tree.  The link bandwidths of the EBGP
144	   links are shown alongside the links (The exact units are not really
145	   important and for simplicity these can be assumed to be weights
146	   proportional to the operational link bandwidths).  It is assumed that
147	   R3, R4 and R5 have multi-path configured and paths having different
148	   value as-path attributes can still be considered as multi-path (knobs
149	   exist in many implementations for this).  When the ingress router,
150	   R4, sends traffic to the destination p/m, the traffic needs to be
151	   spread amongst the links in the ratio of their link bandwidths.
152	   Today this is not possible as there is no way to signal the link
153	   bandwidth extended community over the EBGP session from R3 to R4.  In
154	   absence of a mechanism to regenerate the link bandwidth over the EBGP
155	   session from R3 to R4 and from R5 to R4, the assumed link bandwidth
156	   for paths received over the R3 to R4 and R5 to R4 EBGP sessions would
157	   be equal to the operational link bandwidth of the corresponding EBGP
158	   links.

160	   As per EBGP rules at the advertising router, the next-hop will be set
161	   to the advertising router itself.  Accordingly, R3 computes the best-
162	   path from the advertisements received from R1 and R2 and R5 computes
163	   the best-path from advertisements received from R6 and R7
164	   respectively.  R4 receives the update from R3 and R5 and in-turn
165	   computes the best-path and may advertises it upstream (not shown).
166	   The expected behavior is that when R4 sends traffic for p/m towards
167	   R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should
168	   be load-balanced based on the calculated weights at the routers which
169	   employ multi-path.  R4 should send 30% of the traffic to R3 and the
170	   remaining 70% to R5.  R3 in turn should send 67% of the traffic that
171	   it received from R4 to R1 and 33% to R2.  Similarly, R5 should send
172	   57% of the traffic received from R4 to R6 and the remaining 43% to
173	   R7.  Instead what is happening is that R4 sends 50% of the traffic
174	   towards both R3 and R5.  R3 in turn sends more traffic than is
175	   desired towards R1 and R2.  R4 in turn sends less traffic than is
176	   desired towards R6 and R7.  Effectively the load balancing is getting
177	   skewed towards R1 and R2 even as R1 and R2's egress link bandwidth
178	   relative to R6 and R7 is less.

180	            R1- -20 - - |
181	                        R3- -30 (100) - -|
182	            R2- -10 - - |                |
183	                                         |
184	            R6- -40 - - |                |- - R4
185	                        |                |
186	                        R5- -70 (100) - -|
187	            R7- -30 - - |

189	   EBGP Network showing advertisement of cumulative link bandwidth

191	                                 Figure 2

193	   With the existing rules for the DMZ link bandwidth, this is not
194	   possible.  First the LB extended community is not sent over EBGP.
195	   Secondly the DMZ does not have a notion of conveying the cumulative
196	   link bandwidth (of the directed tree rooted at a node) to an upstream
197	   router.  To enable the use case described above, the cumulative link
198	   bandwidth of R1 and R2 has to be advertised by R3 to R4, and,
199	   similarly, the cumulative bandwidth of R6 and R7 has to be advertised
200	   by R5 to R4.  This will enable R4 to load-balance based on the
201	   proportion of the cumulative link bandwidth that it receives from its
202	   downstream routers R3 and R5.  Figure 2 shows the cumulative link
203	   bandwidth advertised by R3 towards R4 and R5 towards R4 with the
204	   original link bandwidth values in '()' for comparison.

206	   To address cases like the above example, rather than introducing a
207	   new attribute for aggregate link bandwidth, we will reuse the link
208	   bandwidth extended community attribute and relax a few assumptions.
209	   With neighbor-specific knobs or policy configuration applied to the
210	   neighbor outbound or inbound as may be the case, we can regenerate
211	   and advertise and/or accept the link bandwidth extended community
212	   over the EBGP link.  In addition, we can define neighbor specific
213	   knobs that will aggregate the link bandwidth values from the LB
214	   extended communities learnt from the downstream routers (either
215	   received as link bandwidth extended community in the path update or
216	   assigned at ingress using a neighbor inbound policy configuration or
217	   derived from the operational link-speed of the peer link) and then
218	   regenerate and advertise (via neighbor outbound policy knob) this
219	   aggregate link bandwidth value in the form of the LB extended
220	   community to the upstream EBGP router.  Since the advertisement is
221	   being made to EBGP neighbors, the next-hop is going to be reset at
222	   the advertising router.

224	   Speaking of overall traffic profile, if we assume that on ingress at
225	   R4 traffic flow for net p/m is received at a data rate of 'x', then
226	   in absence of link bandwidth regeneration at R3 and R5 the resultant
227	   traffic profile is below:

229	   link ratio percent approximation (~) -------- ------------------
230	   ------------------------- R4-R3 1/2x 50% R4-R5 1/2x 50% R3-R1 1/3x
231	   (1/2 * 2/3) 33% R3-R2 1/6x (1/2 * 1/3) 17% R5-R6 2/7x (1/2 * 4/7) 29%
232	   R5-R7 3/14x (1/2 * 3/7) 21%

234	   For comparison the resultant traffic profile in presence of
235	   cumulative link bandwidth regeneration at R3 and R5 is as below: link
236	   ratio percent approximation (~) -------- ------------------
237	   ------------------------- R4-R3 3/10x 30% R4-R5 7/10x 70% R3-R1 1/5x
238	   (3/10 * 2/3) 20% R3-R2 1/10x (3/10 * 1/3) 10% R5-R6 2/5x (7/10 * 4/7)
239	   40% R5-R7 3/10x (7/10 * 3/7) 30%

241	   As is evident, the second table is closer to the desired traffic
242	   profile that shoud be received by the leaf nodes (R1, R2, R6, R7)
243	   compared to the first one.

245	4.  Large Scale Data Centers Use Case

247	   The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938]
248	   describes a way to design large scale data centers using EBGP across
249	   the different routing layers.  [RFC7938] section 6.3 ("Weighted
250	   ECMP") describes a use case in which a service (most likely
251	   represented using an anycast virtual IP) has an unequal set of
252	   resources serving across the data center regions.  Figure 5 shows a
253	   typical data center topology as described in section 3.1 of [RFC7938]
254	   where an unequal number of servers are deployed advertising a certain
255	   BGP prefix.  As can be seen in the figure, the left side of the data
256	   center hosts only 3 servers while the right side hosts 10 servers.

258	                +------+  +------+
259	                |      |  |      |
260	                | AS1  |  | AS1  |           Tier 1
261	                |      |  |      |
262	                +------+  +------+
263	                  |  |      |  |
264	        +---------+  |      |  +----------+
265	        | +-------+--+------+--+-------+  |
266	        | |       |  |      |  |       |  |
267	      +----+     +----+    +----+     +----+
268	      |    |     |    |    |    |     |    |
269	      |AS2 |     |AS2 |    |AS3 |     |AS3 | Tier 2
270	      |    |     |    |    |    |     |    |
271	      +----+     +----+    +----+     +----+
272	         |         |          |         |
273	         |         |          |         |
274	         | +-----+ |          | +-----+ |
275	         +-| AS4 |-+          +-| AS5 |-+    Tier 3
276	           +-----+              +-----+
277	            | | |                | | |
278	        <- 3 Servers ->     <- 10 Servers ->

280	   Typical Data Center Topology (RFC7938)

282	                                 Figure 3

284	   In a regular ECMP environment, the tier 1 layer would see an ECMP
285	   path equally load-sharing across all 4 tier 2 paths.  This would
286	   cause the servers on the left part of the data center to be
287	   potentially overloaded, while the servers on the right to be
288	   underutilized.  Using link bandwidth advertisements the servers could
289	   add a link bandwidth extended community to the advertised service
290	   prefix.  Another option is to add the extended community on the tier
291	   3 network devices as the routes are received from the servers or
292	   generated locally on the network devices.  If the link bandwidth
293	   value advertised for the service represents the server capacity for
294	   that service, each data center tier would aggregate the values up
295	   when sending the update to the higher tier.  The result would be a
296	   set of weighted load-sharing metrics at each tier allowing the
297	   network to distribute the flow load among the different servers in
298	   the most optimal way.  If a server is added or removed to the service
299	   prefix, it would add or remove its link bandwidth value and the
300	   network would adjust accordingly.

302	   Figure 5 shows a more popular Spine Leaf architecture similar to
303	   [RFC7938] section 3.2.  Tor1, Tor2 and Tor3 are in the same tier,
304	   i.e. the leaf tier (The representation shown in Figure 5 here is the
305	   unfolded Clos).  Using the same example above, it is clear that the
306	   LB extended community value received by each of Spine1 and Spine2
307	   from Tor1 and Tor2 is in the ratio 3 to 10 respectively.  The Spines
308	   will then aggregate the bandwidth, regenerate and advertise the LB
309	   extended-community to Tor3.  Tor3 will do equal cost sharing to both
310	   the spines which in turn will do the traffic-splitting in the ratio 3
311	   to 10 when forwarding the traffic to the Tor1 and Tor2 respectively.

313	                    +------+
314	                    | Tor3 |      Tier 1
315	                    +------+
316	                        |
317	               +- - - - -+- - - - +
318	               |                  |
319	            +----+              +----+
320	            |    |              |    |
321	            |Spine1             |Spine2
322	            |    |              |    |
323	            +----+--+         +-+----+
324	              |      \       /     |
325	                      - + - -
326	              |      /       \     |
327	           +-----+- +          -+-----+
328	           |Tor1 |              |Tor2 |   Tier 1
329	           +-----+              +-----+
330	            | | |                | | |
331	        <- 3 Servers ->     <- 10 Servers ->

333	   Two-tier Clos Data Center Topology

335	                                 Figure 4

337	5.  Non-Conforming BGP Topologies

339	   This use-case will not readily apply to all topologies.  Figure 5
340	   shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2,
341	   AS3, AS4, AS5 and AS6 respectively.  A net p/m, is being advertised
342	   from a server S1 with LB extended-community value 10 to R1 and R5.
343	   R1 advertises p/m to R2 and R3 and also regenerates the LB extended-
344	   community with value 10.  R4 receives the advertisements from R2, R3
345	   and R5 and computes the aggregate bandwidth to be 30.  R4 advertises
346	   p/m to R6 with LB extended-community value 30.  The link bandwidths
347	   are as shown in the figure.

349	   In the example as can be seen, R4 will do the cumulative bandwidth of
350	   the LB that it receives from R2, R3 and R5 which is 30.  When R4
351	   receives the traffic from R6, it will load-balance it across R2, R3
352	   and R5.  As a result R1 will receive twice the volume of traffic that
353	   R5 does.  This is not desirable because the bandwidth from R1 to S1
354	   and the bandwidth from S1 to R5 is the same i.e. 10.  The discrepancy
355	   arose because when R4 aggregated the link bandwidth values from the
356	   received advertisements, the contribution from R1 was actually
357	   factored in twice.

359	              |- - R2 - 10  --|
360	              |               |
361	              |               |
362	    S1- - 10- R1              R4- - - --30 - -R6
363	     |        |               |
364	     |        |               |
365	    10        |- - -R3- 10 - -|
366	     |                        |
367	     |- - - R5 - - -- - -- - - -|

369	   A non-conforming topology for the Cumulative DMZ

371	                                 Figure 5

373	   One way to make the topology in the figure above conforming would be
374	   to regenerate a normalized value of the aggregate link bandwidth when
375	   the aggregate link bandwidth is being advertised over more than one
376	   eBGP peer link.  Such normalization can be achieved through outbound
377	   policy application on top of the aggregate link bandwidth value.  A
378	   couple of options in this context are: a) divide the aggregate link
379	   bandwidth across the eBGP peers equally b) divide the aggregate link
380	   bandwidth across the eBGP peers as per the ratio of the operational
381	   link capacity of the eBGP peer links These and similar options for
382	   regeneration of link-bandwidth to cater to load-balancing
383	   requirements in such topologies are outside the scope of this
384	   document and can be implementated as additional outbound policy
385	   enhancements on top of a computed aggregate link bandwidth.

387	6.  Protocol Considerations

389	   [I-D.ietf-idr-link-bandwidth] needs to be refreshed.  No Protocol
390	   Changes are necessary if the knobs are implemented as recommended.

392	   The other way to achieve the same purpose would be to use some
393	   complicated policy frameworks.  But that is only a conjecture.

395	7.  Operational Considerations

397	   A note may be made that these solutions also are applicable to many
398	   address families such as L3VPN [RFC2547] , IPv4 with labeled unicast
399	   [RFC8277] and EVPN [RFC7432].

401	   In topologies and implementation where there is an option to
402	   advertise all multipath (equal cost) eligible paths to eBGP peers
403	   (i.e. 'ecmp' form of additional-path advertisement is enabled),
404	   aggregate link bandwidth advertisement may not be required or may be
405	   redundant since the receiving BGP speaker receives the link bandwidth
406	   extended community values with all eligible paths, so the aggregate
407	   link bandwidth is effectively received by the downstream eBGP speaker
408	   and can be used in the local computation to affect the forwarding
409	   behaviour.  This assumes the additional paths are advertised with
410	   next-hop self.

412	8.  Security Considerations

414	   This document raises no new security issues.

416	9.  Acknowledgements

418	   Viral Patel did substantial work on an implementation along with the
419	   first author.  The authors would like to thank Acee Lindem and Jakob
420	   Heitz for their help in reviewing the draft and valuable suggestions.
421	   The authors would like to thank Shyam Sethuram, Sameer Gulrajani,
422	   Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to
423	   the draft.

425	10.  References

427	10.1.  Normative References

429	   [I-D.ietf-idr-link-bandwidth]
430	              Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
431	              Extended Community", draft-ietf-idr-link-bandwidth-06
432	              (work in progress), January 2013.

434	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
435	              Requirement Levels", BCP 14, RFC 2119,
436	              DOI 10.17487/RFC2119, March 1997,
437	              <https://www.rfc-editor.org/info/rfc2119>.

439	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
440	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
441	              DOI 10.17487/RFC7938, August 2016,
442	              <https://www.rfc-editor.org/info/rfc7938>.

444	10.2.  Informative References

446	   [RFC2547]  Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547,
447	              DOI 10.17487/RFC2547, March 1999,
448	              <https://www.rfc-editor.org/info/rfc2547>.

450	   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
451	              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
452	              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
453	              2015, <https://www.rfc-editor.org/info/rfc7432>.

455	   [RFC8277]  Rosen, E., "Using BGP to Bind MPLS Labels to Address
456	              Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017,
457	              <https://www.rfc-editor.org/info/rfc8277>.

459	Authors' Addresses

461	   Satya Ranjan Mohanty
462	   Cisco Systems
463	   170 W. Tasman Drive
464	   San Jose, CA  95134
465	   USA

467	   Email: satyamoh@cisco.com

469	   Aaron
470	   Cisco Systems
471	   170 W. Tasman Drive
472	   San Jose, CA  95134
473	   USA

475	   Email: amilliso@cisco.com

477	   Arie Vayner
478	   Nutanix
479	   1740 Technology Drive
480	   San Jose, CA  95110
481	   USA

483	   Email: ariev@vayner.net
484	   Akshay Gattani
485	   Arista Networks
486	   5453 Great America Parkway
487	   Santa Clara, CA  95054
488	   USA

490	   Email: akshay@arista.com

492	   Ajay Kini
493	   Arista Networks
494	   5453 Great America Parkway
495	   Santa Clara, CA  95054
496	   USA

498	   Email: ajkini@arista.com