idnits 2.17.1 

draft-mohanty-bess-ebgp-dmz-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (March 3, 2018) is 2236 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-07) exists of
     draft-ietf-idr-link-bandwidth-06

  -- Obsolete informational reference (is this intentional?): RFC 2547
     (Obsoleted by RFC 4364)


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	BESS WorkGroup                                                S. Mohanty
3	Internet-Draft                                               A. Millisor
4	Intended status: Informational                             Cisco Systems
5	Expires: September 4, 2018                                     A. Vayner
6	                                                                  Google
7	                                                           March 3, 2018

9	            Cumulative DMZ Link Bandwidth and load-balancing
10	                     draft-mohanty-bess-ebgp-dmz-00

12	Abstract

14	   The DMZ Link Bandwidth draft provides a way to load-balance traffic
15	   to a destination (which is in a different AS than the source) which
16	   is reachable via more than one path.  Typically, the link bandwidth
17	   (either configured on the link of the EBGP egress interface or set
18	   via a policy) is encoded in an extended community and then sent to
19	   the IBGP peer which employs multi-path.  The link-bandwidth value is
20	   then extracted from the path extended community and is used as a
21	   weight in the FIB, which does the load-balancing.  This draft extends
22	   the usage of the DMZ link bandwidth to another setting where the
23	   ingress BGP speaker requires knowledge of the cumulative bandwidth
24	   while doing the load-balancing.  The draft also proposes neighbor-
25	   level knobs to enable the link bandwidth extended community to be
26	   regenerated and then advertised to EBGP peers to override the default
27	   behavior of not advertising optional non-transitive attributes to
28	   EBGP peers.

30	Status of This Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at https://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on September 4, 2018.

47	Copyright Notice

49	   Copyright (c) 2018 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (https://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
65	   2.  Requirements Language . . . . . . . . . . . . . . . . . . . .   3
66	   3.  Problem Description . . . . . . . . . . . . . . . . . . . . .   3
67	   4.  Large Scale Data Centers Use Case . . . . . . . . . . . . . .   5
68	   5.  Non-Conforming BGP Topologies . . . . . . . . . . . . . . . .   7
69	   6.  Protocol Considerations . . . . . . . . . . . . . . . . . . .   8
70	   7.  Operational Considerations  . . . . . . . . . . . . . . . . .   8
71	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   8
72	   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   8
73	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
74	     10.1.  Normative References . . . . . . . . . . . . . . . . . .   9
75	     10.2.  Informative References . . . . . . . . . . . . . . . . .   9
76	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

78	1.  Introduction

80	   The Demilitarized Zone (DMZ) Link Bandwidth (LB) extended community
81	   along with the multi-path feature can be used to provide unequal cost
82	   load-balancing as per user control.  In [I-D.ietf-idr-link-bandwidth]
83	   the EBGP egress link bandwidth is encoded in the link bandwidth
84	   extended community and sent along with the BGP update to the IBGP
85	   peer.  It is assumed that either a labeled path exists to each of the
86	   EBGP links or alternatively the IGP cost to each link is the same.
87	   When the same prefix/net is advertised into the receiving AS via
88	   different egress-points or next-hops, the receiving IBGP peer that
89	   employs multi-path will use the value of the DMZ LB to load-balance
90	   traffic to the egress BGP speakers (ASBRs) in the proportion of the
91	   link-bandwidths.

93	   The link bandwidth extended community cannot be advertised over EBGP
94	   peers as it is defined to be optional non-transitive.  This draft
95	   discusses a new use-case where we need to advertise the link
96	   bandwidth over EBGP peers.  The new use-case mandates that the router
97	   calculates the aggregated link-bandwidth, regenerate the DMZ link
98	   bandwidth extended community, and advertise it to EBGP peers.  The
99	   new use case also negates the [I-D.ietf-idr-link-bandwidth]
100	   restriction that the DMZ link bandwidth extended community not be
101	   sent when the the advertising router sets the next-hop to itself.

103	   In draft [I-D.ietf-idr-link-bandwidth], the DMZ link bandwidth
104	   advertised by EBGP egress BGP speaker to the IBGP BGP speaker
105	   represents the Link Bandwidth of the EBGP link.  However, sometimes
106	   there is a need to aggregate the link bandwidth of all the paths that
107	   are advertising a given net and then send it to an upstream neighbor.
108	   This is represented pictorially in Figure 1.  The aggregated link
109	   bandwidth is used by the upstream router to do load-balancing as it
110	   may also receive several such paths for the same net which in turn
111	   carry the accumulated bandwidth.

113	            R1- -20 - - |
114	                        R3- -100- -|
115	            R2- -10 - - |          |
116	                                   |
117	            R6- -40 - - |          |- - R4
118	                        |          |
119	                        R5- -100- -|
120	            R7- -30 - - |

122	   EBGP Network with cumulative DMZ requirement

124	                                 Figure 1

126	2.  Requirements Language

128	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
129	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
130	   document are to be interpreted as described in [RFC2119].

132	3.  Problem Description

134	   Figure 1 above represents an all-EBGP network.  Router R3 is peering
135	   with two other EBGP downstream routers, R1 and R2, over the eBGP link
136	   and another upstream EBGP router R4.  There is another router, R5,
137	   which is peering with two downstream routers R6 and R7.  R5 peers
138	   with R4.  A net, p/m, is learnt by R1, R2, R6, and R7 from their
139	   downstream routers (not shown).  From the perspective of R4, the
140	   topology looks like a directed tree.  The link bandwidths of the EBGP
141	   links are shown alongside the links (The exact units are not really
142	   important).  It is assumed that R3, R4 and R5 have multi-path
143	   configured and paths having different value as-path attributes can
144	   still be considered as multi-path (knobs exist in many
145	   implementations for this).  When the ingress router, R4, sends
146	   traffic to the destination p/m, the traffic needs to be spread
147	   amongst the links in the ratio of their link bandwidths.  Today this
148	   is not possible as there is no way to signal the link bandwidth
149	   extended community over the EBGP session from R3 to R4.

151	   As per EBGP rules at the advertising router, the next-hop will be set
152	   to the advertising router itself.  Accordingly, R3 computes the best-
153	   path from the advertisements received from R1 and R2 and R5 computes
154	   the best-path from advertisements received from R6 and R7
155	   respectively.  R4 receives the update from R3 and R5 and in-turn
156	   computes the best-path and may advertises it upstream (not shown).
157	   The expected behavior is that when R4 sends traffic for p/m towards
158	   R3 and R5, and then on to to R1, R2, R6, and R7, the traffic should
159	   be load-balanced based on the calculated weights at the routers which
160	   employ multi-path.  R4 should send 30% of the traffic to R3 and the
161	   remaining 70% to R5.  R3 in turn should send 67% of the traffic that
162	   it received from R4 to R1 and 33% to R2.  Similarly, R5 should send
163	   57% of the traffic to R6 and the remaining 43% to R7.

165	   With the existing rules for the DMZ link bandwidth, this is not
166	   possible.  First the LB extended community is not sent over EBGP.
167	   Secondly the DMZ does not have a notion of conveying the cumulative
168	   link bandwidth (of the directed tree rooted at a node) to an upstream
169	   router.  To enable the use case described above, the cumulative link
170	   bandwidth of R1 and R2 has to be advertised by R3 to R4, and,
171	   similarly, the cumulative bandwidth of R6 and R7 has to be advertised
172	   by R5 to R4.  This will enable R4 to load-balance based on the
173	   proportion of the cumulative link bandwidth that it receives from its
174	   downstream routers R3 and R5.

176	   To address cases like the above example, rather than inventing
177	   something new from scratch, we will relax a few assumptions of the
178	   link bandwidth extended community.  With neighbor-specific knobs
179	   outbound/inbound as may be the case, we can regenerate and advertise
180	   and/or accept the link bandwidth extended community over the EBGP
181	   link.  In addition, we can define neighbor specific knobs that will
182	   aggregate the link bandwidth values from the LB extended communities
183	   (received via the neighbor inbound policy knobs) from the downstream
184	   routers and then regenerate and advertise (via neighbor outbound
185	   policy knob) this aggregate link bandwidth value stored in the LB
186	   extended community to the upstream EBGP router.  Since the
187	   advertisement is being made to EBGP neighbors, the next-hop is going
188	   to be reset at the advertising router.

190	4.  Large Scale Data Centers Use Case

192	   The "Use of BGP for Routing in Large-Scale Data Centers" [RFC7938]
193	   describes a way to design large scale data centers using EBGP across
194	   the different routing layers.  [RFC7938] section 6.3 ("Weighted
195	   ECMP") describes a use case in which a service (most likely
196	   represented using an anycast virtual IP) has an unequal set of
197	   resources serving across the data center regions.  Figure 2 shows a
198	   typical data center topology as described in section 3.1 of [RFC7938]
199	   where an unequal number of servers are deployed advertising a certain
200	   BGP prefix.  As can be seen in the figure, the left side of the data
201	   center hosts only 3 servers while the right side hosts 10 servers.

203	                +------+  +------+
204	                |      |  |      |
205	                | AS1  |  | AS1  |           Tier 1
206	                |      |  |      |
207	                +------+  +------+
208	                  |  |      |  |
209	        +---------+  |      |  +----------+
210	        | +-------+--+------+--+-------+  |
211	        | |       |  |      |  |       |  |
212	      +----+     +----+    +----+     +----+
213	      |    |     |    |    |    |     |    |
214	      |AS2 |     |AS2 |    |AS3 |     |AS3 | Tier 2
215	      |    |     |    |    |    |     |    |
216	      +----+     +----+    +----+     +----+
217	         |         |          |         |
218	         |         |          |         |
219	         | +-----+ |          | +-----+ |
220	         +-| AS4 |-+          +-| AS5 |-+    Tier 3
221	           +-----+              +-----+
222	            | | |                | | |
223	        <- 3 Servers ->     <- 10 Servers ->

225	   Typical Data Center Topology (RFC7938)

227	                                 Figure 2

229	   In a regular ECMP environment, the tier 1 layer would see an ECMP
230	   path equally load-sharing across all 4 tier 2 paths.  This would
231	   cause the servers on the left part of the data center to be
232	   potentially overloaded, while the servers on the right to be
233	   underutilized.  Using link bandwidth advertisements the servers could
234	   add a link bandwidth extended community to the advertised service
235	   prefix.  Another option is to add the extended community on the tier
236	   3 network devices as the routes are received from the servers or
237	   generated locally on the network devices.  If the link bandwidth
238	   value advertised for the service represents the server capacity for
239	   that service, each data center tier would aggregate the values up
240	   when sending the update to the higher tier.  The result would be a
241	   set of weighted load-sharing metrics at each tier allowing the
242	   network to distribute the flow load among the different servers in
243	   the most optimal way.  If a server is added or removed to the service
244	   prefix, it would add or remove its link bandwidth value and the
245	   network would adjust accordingly.

247	   Figure 3 shows a more popular Spine Leaf architecture similar to
248	   [RFC7938] section 3.2.  Tor1, Tor2 and Tor3 are in the same tier,
249	   i.e. the leaf tier (The representation shown in Figure 3 here is the
250	   unfolded Clos).  Using the same example above, it is clear that the
251	   LB extended community value received by each of Spine1 and Spine2
252	   from Tor1 and Tor2 is in the ratio 3 to 10 respectively.  The Spines
253	   will then aggregate the bandwidth, regenerate and advertise the LB
254	   extended-community to Tor3.  Tor3 will do equal cost sharing to both
255	   the spines which in turn will do the traffic-splitting in the ratio 3
256	   to 10 when forwarding the traffic to the Tor1 and Tor2 respectively.

258	                    +------+
259	                    | Tor3 |      Tier 1
260	                    +------+
261	                        |
262	               +- - - - -+- - - - +
263	               |                  |
264	            +----+              +----+
265	            |    |              |    |
266	            |Spine1             |Spine2
267	            |    |              |    |
268	            +----+--+         +-+----+
269	              |      \       /     |
270	                      - + - -
271	              |      /       \     |
272	           +-----+- +          -+-----+
273	           |Tor1 |              |Tor2 |   Tier 1
274	           +-----+              +-----+
275	            | | |                | | |
276	        <- 3 Servers ->     <- 10 Servers ->

278	   Two-tier Clos Data Center Topology

280	                                 Figure 3

282	5.  Non-Conforming BGP Topologies

284	   This use-case will not readily apply to all topologies.  Figure 4
285	   shows a all EBGP topology: R1, R2, R3, R4, R5 and R6 are in AS1, AS2,
286	   AS3, AS4, AS5 and AS6 respectively.  A net p/m, is being advertised
287	   from a server S1 with LB extended-community value 10 to R1 and R5.
288	   R1 advertises p/m to R2 and R3 and also regenerates the LB extended-
289	   community with value 10.  R4 receives the advertisements from R2, R3
290	   and R5 and computes the aggregate bandwidth to be 30.  R4 advertises
291	   p/m to R6 with LB extended-community value 30.  The link bandwidths
292	   are as shown in the figure.

294	   In the example as can be seen, R4 will do the cumulative bandwidth of
295	   the LB that it receives from R2, R3 and R5 which is 30.  When R4
296	   receives the traffic from R6, it will load-balance it across R2, R3
297	   and R5.  As a result R1 will receive twice the volume of traffic that
298	   R5 does.  This is not desirable because the bandwidth from R1 to S1
299	   and the bandwidth from S1 to R5 is the same i.e. 10.  The discrepancy
300	   arose because when R4 aggregated the link bandwidth values from the
301	   received advertisements, the contribution from R1 was actually
302	   factored in twice.

304	              |- - R2 - 10  --|
305	              |               |
306	              |               |
307	    S1- - 10- R1              R4- - - --30 - -R6
308	     |        |               |
309	     |        |               |
310	    10        |- - -R3- 10 - -|
311	     |                        |
312	     |- - - R5 - - -- - -- - - -|

314	   A non-conforming topology for the Cumulative DMZ

316	                                 Figure 4

318	6.  Protocol Considerations

320	   [I-D.ietf-idr-link-bandwidth] needs to be refreshed.  No Protocol
321	   Changes are necessary if the knobs are implemented as recommended.
322	   The other way to achieve the same purpose would be to use some
323	   complicated policy frameworks.  But that is only a conjecture.

325	7.  Operational Considerations

327	   A note may be made that these solutions also are applicable to many
328	   address families such as L3VPN [RFC2547] , IPv4 with labeled unicast
329	   [RFC8277]  and EVPN [RFC7432].

331	8.  Security Considerations

333	   This document raises no new security issues.

335	9.  Acknowledgements

337	   Viral Patel did substantial work on an implementation along with the
338	   first author.  The authors would like to thank Acee Lindem and Jakob
339	   Heitz for their help in reviewing the draft and valuable suggestions.
340	   The authors would like to thank Shyam Sethuram, Sameer Gulrajani,
341	   Nitin Kumar, Keyur Patel and Juan Alcaide for discussions related to
342	   the draft.

344	10.  References
345	10.1.  Normative References

347	   [I-D.ietf-idr-link-bandwidth]
348	              Mohapatra, P. and R. Fernando, "BGP Link Bandwidth
349	              Extended Community", draft-ietf-idr-link-bandwidth-06
350	              (work in progress), January 2013.

352	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
353	              Requirement Levels", BCP 14, RFC 2119,
354	              DOI 10.17487/RFC2119, March 1997,
355	              <https://www.rfc-editor.org/info/rfc2119>.

357	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
358	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
359	              DOI 10.17487/RFC7938, August 2016,
360	              <https://www.rfc-editor.org/info/rfc7938>.

362	10.2.  Informative References

364	   [RFC2547]  Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", RFC 2547,
365	              DOI 10.17487/RFC2547, March 1999,
366	              <https://www.rfc-editor.org/info/rfc2547>.

368	   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
369	              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
370	              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
371	              2015, <https://www.rfc-editor.org/info/rfc7432>.

373	   [RFC8277]  Rosen, E., "Using BGP to Bind MPLS Labels to Address
374	              Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017,
375	              <https://www.rfc-editor.org/info/rfc8277>.

377	Authors' Addresses

379	   Satya Ranjan Mohanty
380	   Cisco Systems
381	   170 W. Tasman Drive
382	   San Jose, CA  95134
383	   USA

385	   Email: satyamoh@cisco.com
386	   Aaron Millisor
387	   Cisco Systems
388	   170 W. Tasman Drive
389	   San Jose, CA  95134
390	   USA

392	   Email: amilliso@cisco.com

394	   Arie Vayner
395	   Google
396	   1600 Amphitheatre Pkwy
397	   Mountain View, CA  94043
398	   USA

400	   Email: avayner@google.com