idnits 2.17.1 

draft-ietf-bess-bgp-multicast-controller-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (September 22, 2020) is 1304 days in the past.  Is
     this intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC5331' is mentioned on line 249, but not defined

  == Missing Reference: 'RFC 7752' is mentioned on line 332, but not defined

  ** Obsolete undefined reference: RFC 7752 (Obsoleted by RFC 9552)

  == Unused Reference: 'RFC6513' is defined on line 872, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-07) exists of
     draft-ietf-bess-bgp-multicast-02

  == Outdated reference: A later version (-26) exists of
     draft-ietf-idr-segment-routing-te-policy-09

  == Outdated reference: A later version (-22) exists of
     draft-ietf-idr-tunnel-encaps-17

  == Outdated reference: A later version (-11) exists of
     draft-ietf-idr-wide-bgp-communities-05


     Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	BESS                                                            Z. Zhang
3	Internet-Draft                                          Juniper Networks
4	Intended status: Standards Track                               R. Raszuk
5	Expires: March 26, 2021                                     Bloomberg LP
6	                                                              D. Pacella
7	                                                                 Verizon
8	                                                                A. Gulko
9	                                                               Refinitiv
10	                                                      September 22, 2020

12	                Controller Based BGP Multicast Signaling
13	              draft-ietf-bess-bgp-multicast-controller-05

15	Abstract

17	   This document specifies a way that one or more centralized
18	   controllers can use BGP to set up a multicast distribution tree in a
19	   network.  In the case of labeled tree, the labels are assigned by the
20	   controllers either from the controllers' local label spaces, or from
21	   a common Segment Routing Global Block (SRGB), or from each routers
22	   Segment Routing Local Block (SRLB) that the controllers learn.  In
23	   case of labeled unidirectional tree and label allocation from the
24	   common SRGB or from the controllers' local spaces, a single common
25	   label can be used for all routers on the tree to send and receive
26	   traffic with.  Since the controllers calculate the trees, they can
27	   use sophisticated algorithms and constraints to achieve traffic
28	   engineering.

30	Requirements Language

32	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
33	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
34	   "OPTIONAL" in this document are to be interpreted as described in BCP
35	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
36	   capitals, as shown here.

38	Status of This Memo

40	   This Internet-Draft is submitted in full conformance with the
41	   provisions of BCP 78 and BCP 79.

43	   Internet-Drafts are working documents of the Internet Engineering
44	   Task Force (IETF).  Note that other groups may also distribute
45	   working documents as Internet-Drafts.  The list of current Internet-
46	   Drafts is at https://datatracker.ietf.org/drafts/current/.

48	   Internet-Drafts are draft documents valid for a maximum of six months
49	   and may be updated, replaced, or obsoleted by other documents at any
50	   time.  It is inappropriate to use Internet-Drafts as reference
51	   material or to cite them other than as "work in progress."

53	   This Internet-Draft will expire on March 26, 2021.

55	Copyright Notice

57	   Copyright (c) 2020 IETF Trust and the persons identified as the
58	   document authors.  All rights reserved.

60	   This document is subject to BCP 78 and the IETF Trust's Legal
61	   Provisions Relating to IETF Documents
62	   (https://trustee.ietf.org/license-info) in effect on the date of
63	   publication of this document.  Please review these documents
64	   carefully, as they describe your rights and restrictions with respect
65	   to this document.  Code Components extracted from this document must
66	   include Simplified BSD License text as described in Section 4.e of
67	   the Trust Legal Provisions and are provided without warranty as
68	   described in the Simplified BSD License.

70	Table of Contents

72	   1.  Overview  . . . . . . . . . . . . . . . . . . . . . . . . . .   3
73	     1.1.  Introduction  . . . . . . . . . . . . . . . . . . . . . .   3
74	     1.2.  Resilience  . . . . . . . . . . . . . . . . . . . . . . .   4
75	     1.3.  Signaling . . . . . . . . . . . . . . . . . . . . . . . .   5
76	     1.4.  Label Allocation  . . . . . . . . . . . . . . . . . . . .   5
77	       1.4.1.  Using a Common per-tree Label for All Routers . . . .   6
78	       1.4.2.  Upstream-assignment from Controller's Local Label
79	               Space . . . . . . . . . . . . . . . . . . . . . . . .   7
80	     1.5.  Determining Root/Leaves . . . . . . . . . . . . . . . . .   8
81	       1.5.1.  PIM-SSM/Bidir or mLDP . . . . . . . . . . . . . . . .   8
82	       1.5.2.  PIM ASM . . . . . . . . . . . . . . . . . . . . . . .   8
83	     1.6.  Multiple Domains  . . . . . . . . . . . . . . . . . . . .   9
84	     1.7.  SR-P2MP . . . . . . . . . . . . . . . . . . . . . . . . .  10
85	   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . .  11
86	     2.1.  Enhancements to TEA . . . . . . . . . . . . . . . . . . .  11
87	       2.1.1.  Any-Encapsulation Tunnel  . . . . . . . . . . . . . .  11
88	       2.1.2.  Load-balancing Tunnel . . . . . . . . . . . . . . . .  11
89	       2.1.3.  Receiving MPLS Label Stack  . . . . . . . . . . . . .  12
90	       2.1.4.  RPF Sub-TLV . . . . . . . . . . . . . . . . . . . . .  12
91	       2.1.5.  Tree Label Stack sub-TLV  . . . . . . . . . . . . . .  12
92	       2.1.6.  Backup Tunnel sub-TLV . . . . . . . . . . . . . . . .  13
93	     2.2.  Context Label TLV in BGP-LS Node Attribute  . . . . . . .  14
94	     2.3.  SR P2MP Signaling . . . . . . . . . . . . . . . . . . . .  14
95	       2.3.1.  S-PMSI A-D Route for SR P2MP  . . . . . . . . . . . .  14
96	       2.3.2.  BGP Community Container for SR P2MP Policy  . . . . .  15
97	       2.3.3.  SR Policy Tunnel Type . . . . . . . . . . . . . . . .  16
98	   3.  Procedures  . . . . . . . . . . . . . . . . . . . . . . . . .  17
99	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  17
100	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
101	   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  18
102	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
103	     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  18
104	     7.2.  Informative References  . . . . . . . . . . . . . . . . .  19
105	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

107	1.  Overview

109	1.1.  Introduction

111	   [I-D.ietf-bess-bgp-multicast] describes a way to use BGP as a
112	   replacement signaling for PIM [RFC7761] or mLDP [RFC6388].  The BGP-
113	   based multicast signaling described there provides a mechanism for
114	   setting up both (s,g)/(*,g) multicast trees (as PIM does, but
115	   optionally with labels) and labeled (MPLS) multicast tunnels (as mLDP
116	   does).  Each router on a tree performs essentially the same
117	   procedures as it would perform if using PIM or mLDP, but all the
118	   inter-router signaling is done using BGP.

120	   These procedures allow the routers to set up a separate tree for each
121	   individual multicast (x,g) flow where the 'x' could be either 's' or
122	   '*', but they also allow the routers to set up trees that are used
123	   for more than one flow.  In the latter case, the trees are often
124	   referred to as "multicast tunnels" or "multipoint tunnels", and
125	   specifically in this document they are mLDP tunnels (except that they
126	   are set up with BGP signaling).  While it actually does not have to
127	   be restricted to mLDP tunnels, mLDP FEC is conveniently borrowed to
128	   identify the tunnel.  In the rest of the document, the term tree and
129	   tunnel are used interchangeably.

131	   The trees/tunnels are set up using the "receiver-initiated join"
132	   technique of PIM/mLDP, hop by hop from downstream routers towards the
133	   root.  The BGP messages are either sent hop by hop between downstream
134	   routers and their upstream neighbors, or can be reflected by Route
135	   Reflectors (RRs).

137	   As an alternative to each hop independently determining its upstream
138	   router and signaling upstream towards the root (following PIM/mLDP
139	   model), the entire tree can be calculated by a centralized
140	   controller, and the signaling can be entirely done from the
141	   controller, using the same BGP messages as defined in
142	   [I-D.ietf-bess-bgp-multicast].  For that, some additional procedures
143	   and optimizations are specified in this document.

145	   While it is outside the scope of this document, signaling from the
146	   controllers could be done via other means as well, like Netconf or
147	   any other SDN methods.

149	1.2.  Resilience

151	   Each router could establish direct BGP sessions with one or more
152	   controllers, or it could establish BGP sessions with RRs who in turn
153	   peer with controllers.  For the same tree/tunnel, each controller may
154	   independently calculate the tree/tunnel and signal the routers on the
155	   tree/tunnel using MCAST-TREE Leaf A-D routes
156	   [I-D.ietf-bess-bgp-multicast].  How the tree/tunnel roots/leaves are
157	   discovered and how the calculation is done are outside the scope of
158	   this document.

160	   On each router, BGP route selection rules will lead to one
161	   controller's route for the tree/tunnel being selected as the active
162	   route and used for setting up forwarding state.  As long as all the
163	   routers on a tree/tunnel consistently pick the same controller's
164	   routes for the tree/tunnel, the setup should be consistent.  If the
165	   tree/tunnel is labeled, different labels will be used from different
166	   controllers so there is no traffic loop issue even if the routers do
167	   not consistently select the same controlle's routes.  In the
168	   unlabeled case, to ensure the consistency the selection SHOULD be
169	   solely based on the identifier of the controller, which could be
170	   carried in an Address Specific Extended Community (EC).

172	   Another consistency issue is when a bidirectional tree/tunnel needs
173	   to be re-routed.  Because this is no longer triggered hop-by-hop from
174	   downstream to upstream, it is possible that the upstream change
175	   happens before the downstream, causing traffic loop.  In the
176	   unlabeled case, there is no good solution (other than that the
177	   controller issues upstream change only after it gets acknowledgement
178	   from downstream).  In the labeled case, as long as a new label is
179	   used there should be no problem.

181	   Besides the traffic loop issue, there could be transient traffic loss
182	   before both the upstream and downstream's forwarding state are
183	   updated.  This could be mitigated if the upstream keep sending
184	   traffic on the old path (in addition to the new path) and the
185	   downstream keep accepting traffic on the old path (but not on the new
186	   path) for some time.  It is a local matter when for the downstream to
187	   switch to the new path - it could be data driven (e.g., after traffic
188	   arrives on the new path) or timer driven.

190	   For each tree, multiple disjoint instances could be calculated and
191	   signaled for live-live protection.  Different labels are used for
192	   different instances, so that the leaves can differentiate incoming
193	   traffic on different instances.  As far as transit routers are
194	   concerned, the instances are just independent.  Note that the two
195	   instances are not expected to share common transit routers (it is
196	   otherwise outside the scope of this document/revision).

198	1.3.  Signaling

200	   Each router only receives Leaf A-D routes from the controllers but
201	   does not originate or re-advertise S-PMSI/Leaf A-D routes.  The re-
202	   advertisement of a received route can be blocked based on the fact
203	   that a configured import RT matches the RT of the route, which
204	   indicates that this router is the target and consumer of the route
205	   hence it should not be re-advertised further.  The routes includes
206	   the forwarding information in the form of Tunnel Encapsulation
207	   Attributes (TEA) [I-D.ietf-idr-tunnel-encaps], with enhancements
208	   specified in this document.

210	   Suppose that for a particular tree, there are two downstream routers
211	   D1 and D2 for a particular upstream router U.  A controller C may
212	   send two Leaf A-D routes to U, as if the two routes were originated
213	   by D1 and D2 but reflected by the controller.  Alternatively, C could
214	   just send one route to U, with the Upstream Router's IP Address field
215	   set to U's IP address and the TEA specifying both the two downstreams
216	   and its upstream (see Section 2.1.4).  In this case, the Originating
217	   Router's Address field of the Leaf A-D route is set to the
218	   controller's address.  Note that for a TEA attached to a unicast
219	   NLRI, only one of the tunnels in a TEA is used for forwarding a
220	   particular packet, while all the tunnels in a TEA are used to reach
221	   multiple endpoints when it is attached to a multicast NLRI.

223	   Note that, in case of labeled trees, the (x,g) or mLDP FEC signaling
224	   is actually not needed to transit routers but only needed on tunnel
225	   root/leaves.  However, for consistency, the same signaling is used to
226	   all routers.

228	1.4.  Label Allocation

230	   In the case of labeled multicast signaled hop by hop towards the
231	   root, whether it's (x,g) multicast or "mLDP" tunnel, labels are
232	   assigned by a downstream router and advertised to its upstream router
233	   (from traffic direction point of view).  In the case of controller
234	   based signaling, routers do not originate tree join (S-PMSI/Leaf A-D)
235	   routes anymore, so the controllers have to assign labels on behalf of
236	   routers, and there are three options for label assignment:

238	   o  From each router's SRLB that the controller learns

240	   o  From the common SRGB that the controller learns
241	   o  From the controller's local label space

243	   Assignment from each router's SRLB is no different from each router
244	   assigning labels from its own local label space in the hop-by-hop
245	   signaling case.  The assignments for a router is independent of
246	   assignments for another router, even for the same tree.

248	   Assignment from the controller's local label space is upstream-
249	   assigned [RFC5331].  It is used if the controller does not learn the
250	   common SRGB or each router's SRLB.  Assignment from the SRGB
251	   [RFC8402] is only meaningful if all SRGBs are the same and a single
252	   common label is used for all the routers on a tree in case of
253	   unidirectional tree/tunnel (Section 1.4.1).  Otherwise, assignment
254	   from SRLB is preferred.

256	   The choice of which of the options to use depends on many factors.
257	   An operator may want to use a single common label per tree for ease
258	   of monitoring and debugging, but that requires explicit RPF checking
259	   and either SRGB or upstream assigned labels, which may not be
260	   supported due to either the software or hardware limitations (e.g.
261	   label imposition/disposition limits).  In an SR network, assignment
262	   from the common SRGB if it's required to use a single common label
263	   per unidirectional tree, or otherwise assignment from SRLB is a good
264	   choice because it does not require support for context label spaces.

266	1.4.1.  Using a Common per-tree Label for All Routers

268	   MPLS labels only have local significance.  For an LSP that goes
269	   through a series of routers, each router allocates a label
270	   independently and it swaps the incoming label (that it advertised to
271	   its upstream) to an outgoing label (that it received from its
272	   downstream) when it forwards a labeled packet.  Even if the incoming
273	   and outgoing labels happen to be the same on a particular router,
274	   that is just incidental.

276	   With Segment Routing, it is becoming a common practice that all
277	   routers use the same SRGB so that a SID maps to the same label on all
278	   routers.  This makes it easier for operators to monitor and debug
279	   their network.  The same concept applies to multicast trees as well -
280	   a common per-tree label is used for a router to receive traffic from
281	   its upstream neighbor and replicate traffic to all its downstream
282	   neighbor.

284	   However, a common per-tree label can only be used for unidirectional
285	   trees.  Additionally, it requires each router to do explicit RPF
286	   check, so that only packets from its expected upstream neighbor are
287	   accepted.  Otherwise, traffic loop may form during topology changes,
288	   because the forwarding state update is no longer ordered.

290	   Traditionally, p2mp mpls forwarding does not require explicit RPF
291	   check as a downstream router advertises a label only to its upstream
292	   router and all traffic with that incoming label is presumed to be
293	   from the upstream router and accepted.  When a downstream router
294	   switches to a different upstream router a different label will be
295	   advertised, so it can determine if traffic is from its expected
296	   upstream neighbor purely based on the label.  Now with a single
297	   common label used for all routers on a tree to send and receive
298	   traffic with, a router can no longer determine if the traffic is from
299	   its expected neighbor just based on that common tree label.
300	   Therefore, explicit RPF check is needed.  Instead of interface based
301	   RPF checking as in PIM case, neighbor based RPF checking is used - a
302	   label identifying the upstream neighbor precedes the tree label and
303	   the receiving router checks if that preceding neighbor label matches
304	   its expected upstream neighbor.  Notice that this is similar to
305	   what's described in Section "9.1.1 Discarding Packets from Wrong PE"
306	   of RFC 6513 (an egress PE discards traffic sent from a wrong ingress
307	   PE).  The only difference is one is used for label based forwarding
308	   and the other is used for (s,g) based forwarding. [note: for
309	   bidirectional trees, we may be able to use two labels per tree - one
310	   for upstream traffic and one for downstream traffic.  This needs
311	   further verification].

313	   Both the common per-tree label and the neighbor label are allocated
314	   either from the common SRGB or from the controller's local label
315	   space.  In the latter case, an additional label identifying the
316	   controller's label space is needed, as described in the following
317	   section.

319	1.4.2.  Upstream-assignment from Controller's Local Label Space

321	   In this case in the multicast packet's label stack the tree label and
322	   upstream neighbor label (if used in case of single common-label per
323	   tree) are preceded by a downstream-assigned "context label".  The
324	   context label identifies a context-specific label space (the
325	   controller's local label space), and the upstream-assigned label that
326	   follows it is looked up in that space.

328	   This specification requires that, in case of upstream-assignment from
329	   a controller's local label space, each router D to assign,
330	   corresponding to each controller C, a context label that identifies
331	   the upstream-assigned label space used by that controller.  This
332	   label, call it Lc-D, is communicated by D to C via BGP-LS [RFC 7752].

334	   Suppose a controller is setting up unidirectional tree T.  It assigns
335	   that tree the label Lt, and assigns label Lu to identify router U
336	   which is the upstream of router D on tree T.  C needs to tell U: "to
337	   send a packet on the given tree/tunnel, one of the things you have to
338	   do is push Lt onto the packet's label stack, then push Lu, then push
339	   Lc-D onto the packet's label stack, then unicast the packet to D".
340	   Controller C also needs to inform router D of the correspondence
341	   between <Lc-D, Lu, Lt> and tree T.

343	   To achieve that, when C sends a Leaf A-D route, for each tunnel in
344	   the TEA, it includes a label stack Sub-TLV
345	   [I-D.ietf-idr-tunnel-encaps], with the outer label being the context
346	   label Lc-D (received by the controller from the corresponding
347	   downstream), the next label being the upstream neighbor label Lu, and
348	   the inner label being the label Lt assigned by the controller for the
349	   tree.  The router receiving the route will use the label stacks to
350	   send traffic to its downstreams.

352	   For C to signal the expected label stack for D to receive traffic
353	   with, we overload a tunnel TLV in the TEA of the Leaf A-D route sent
354	   to D - if the tunnel TLV has a RPF sub-TLV (Section 2.1.4), then it
355	   indicates that this is actually for receiving traffic from the
356	   upstream.

358	1.5.  Determining Root/Leaves

360	   For the controller to calculate a tree, it needs to determine the
361	   root and leaves of the tree.  This may be based on provisioning
362	   (static or dynamically programmed), or based on BGP signaling using
363	   the BGP multicast messages defined in [I-D.ietf-bess-bgp-multicast],
364	   as described in the following two sections.

366	   In both cases, the BGP updates are targeted at the controller, via an
367	   address specific Route Target with Global Administration Field set to
368	   the controller's address and the Local Administration Field set to 0,
369	   or a value pre-assigned to identify a VPN.

371	1.5.1.  PIM-SSM/Bidir or mLDP

373	   In this case, the PIM Last Hop Routers (LHRs) with interested
374	   receivers or mLDP tunnel leaves encode a Leaf A-D route with the
375	   Upstream Router's IP Address field set to the controller's address
376	   and the Originating Router's IP Address set to the address of the LHR
377	   or the P2MP tunnel leaf.  The encoded PIM SSM source or mLDP FEC
378	   provides root information and the Originating Router's IP Address
379	   provides leaves information.

381	1.5.2.  PIM ASM

383	   In this case, the First Hop Routers (FHRs) originate Source Active
384	   routes which provides root information, and the LHRs originate Leaf
385	   A-D routes, encoded as in the PIM-SSM case except that it is (*,G)
386	   instead of (S,G).  The Leaf A-D routes provide leaf information.

388	1.6.  Multiple Domains

390	   An end to end multicast tree may span multiple routing domains, and
391	   the setup of the tree in each domain may be done differently as
392	   specified in [I-D.ietf-bess-bgp-multicast].  This section discusses a
393	   few aspects specific to controller signaling.

395	   Consider two adjacent domains each with its own controller in the
396	   following configuration where router B is an upstream node of C for a
397	   multicast tree:

399	                            |
400	                  domain 1  |  domain 2
401	                            |
402	                   ctrlr1   |   ctrlr2
403	                     /\     |     /\
404	                    /  \    |    /  \
405	                   /    \   |   /    \
406	                  A--...-B--|--C--...-D
407	                            |

409	   In the case of native (un-labeled) IP multicast, nothing special is
410	   needed.  Controller 1 signals B to send traffic out of B-C link while
411	   Controller 2 signals C to accept traffic on the B-C link.

413	   In the case of labeled IP multicast or mLDP tunnel, the controllers
414	   may be able to coordinate their actions such that Controller 1
415	   signals B to send traffic out of B-C link with label X while
416	   Controller 2 signals C to accept traffic with the same label X on the
417	   B-C link.  If the coordination is not possible, then C needs to use
418	   hop-by-hop BGP signaling to signal towards B, as specified in
419	   [I-D.ietf-bess-bgp-multicast].

421	   The configuration could also be as following, where router B borders
422	   both domain 1 and domain 2 and is controlled by both controllers:

424	                          |
425	                 domain 1 | domain 2
426	                          |
427	                   ctrlr1 | ctrlr2
428	                     /\   |   /\
429	                    /  \  |  /  \
430	                   /    \ | /    \
431	                  /      \|/      \
432	                 A--...---B--...---C
433	                          |

435	   As discussed in Section 1.2, when B receives signaling from both
436	   Controller 1 and Controller 2, only one of the routes would be
437	   selected as the best route and used for programming the forwarding
438	   state of the corresponding segment.  For B to stitch the two segments
439	   together, it is expected for B to know by provisioning that it is a
440	   border router so that B will look for the other segment (represented
441	   by the signaling from the other controller) and stitch the two
442	   together.

444	1.7.  SR-P2MP

446	   [I-D.voyer-pim-sr-p2mp-policy] describes an architecture to construct
447	   a Point-to-Multipoint (P2MP) tree to deliver Multi-point services in
448	   a Segment Routing domain.  An SR P2MP tree is constructed by
449	   stitching together a set of Replication Segments that are specified
450	   in [I-D.voyer-spring-sr-replication-segment].  An SR Point-to-
451	   Multipoint (SR P2MP) Policy is used to define and instantiate a P2MP
452	   tree which is computed by a controller.

454	   An SR P2MP tree is no different from an mLDP tunnel in MPLS
455	   forwarding plane.  The difference is in control plane - instead of
456	   hop-by-hop mLDP signaling from leaves towards the root, to set up SR
457	   P2MP trees controllers program forwarding state (referred to as
458	   Replication Segments) to the root, leaves, and intermediate
459	   replication points using Netconf, PCEP, BGP or any other reasonable
460	   signaling/programming methods.

462	   Procedures in this document can be used for controllers to set up SR
463	   P2MP trees with just an additional S-PMSI route type.

465	   If/once the SR Replication Segment is extended to bi-redirectional,
466	   and SR MP2MP is introduced, the same procedures in this document
467	   would apply to SR MP2MP as well.

469	2.  Specification

471	2.1.  Enhancements to TEA

473	   This document specifies two new Tunnel Types and four new sub-TLVs.
474	   The type codes will be assigned by IANA from the "BGP Tunnel
475	   Encapsulation Attribute Tunnel Types".

477	2.1.1.  Any-Encapsulation Tunnel

479	   When a multicast packet needs to be sent from an upstream node to a
480	   downstream node, it may not matter how it is sent - natively when the
481	   two nodes are directly connected or tunneled otherwise.  In case of
482	   tunneling, it may not matter what kind of tunnel is used - MPLS, GRE,
483	   IPinIP, or whatever.

485	   To support this, an "Any-Encapsulation" tunnel type is defined.  This
486	   tunnel MUST have a Tunnel Endpoint Sub-TLV and SHOULD NOT have any
487	   other Sub-TLVs.  The Tunnel Endpoint Sub-TLV specifies an IP address,
488	   which could be any of the following:

490	   o  An interface's local address - when a packet needs to sent out of
491	      the corresponding interface natively.  On a LAN multicast MAC
492	      address MUST be used.

494	   o  A directly connected neighbor's interface address - when a packet
495	      needs to unicast to the address natively.

497	   o  An address that is not directly connected - when a packet needs to
498	      be tunneled to the address (any tunnel type/instance can be used).

500	2.1.2.  Load-balancing Tunnel

502	   Consider that a multicast packet needs to be sent to a downstream
503	   node, which could be reached via four paths P1~P4.  If it does not
504	   matter which of path is taken, an "Any-Encapsulation" tunnel with the
505	   Tunnel Endpoint Sub-TLV specifying the downstream node's loopback
506	   address works well.  If the controller wants to specify that only
507	   P1~P2 should be used, then a "Load-balancing" tunnel needs to be
508	   used, listing P1 and P2 as member tunnels of the "Load-balancing"
509	   tunnel.

511	   A load-balancing tunnel has one "Member Tunnels" Sub-TLV defined in
512	   this document.  The Sub-TLV is a list of tunnels, each specifying a
513	   way to reach the downstream.  A packet will be sent out of one of the
514	   tunnels listed in the Member Tunnels Sub-TLV of the load-balancing
515	   tunnel.

517	2.1.3.  Receiving MPLS Label Stack

519	   While [I-D.ietf-bess-bgp-multicast] uses S-PMSI A-D routes to signal
520	   forwarding information for MP2MP upstream traffic, when controller
521	   signaling is used, a single Leaf A-D route is used for both upstream
522	   and downstream traffic.  Since different upstream and downstream
523	   labels need to be used, a new "Receiving MPLS Label Stack" of type
524	   TBD is added as a tunnel sub-TLV in addition to the existing MPLS
525	   Label Stack sub-TLV.  Other than type difference, the two are the
526	   encoded the same way.

528	   The Receiving MPLS Label Stack sub-TLV is added to each downstream
529	   tunnel in the TEA of Leaf A-D route for an MP2MP tunnel to specify
530	   the forwarding information for upstream traffic from the
531	   corresponding downstream node.  A label stack instead of a single
532	   label is used because of the need for neighbor based RPF check, as
533	   further explained in the following section.

535	   The Receiving MPLS Label Stack sub-TLV is also used for downstream
536	   traffic from the upstream for both P2MP and MP2MP, as specified
537	   below.

539	2.1.4.  RPF Sub-TLV

541	   The RPF sub-TLV has a type to be allocated by IANA and a one-octet
542	   length.  The length is 0 currently, but if necessary in the future,
543	   sub-sub-TLVs could be placed in its value part.  If the RPF sub-TLV
544	   appears in a tunnel, it indicates that the "tunnel" is for the
545	   upstream node instead of a downstream node.  The tunnel contains an
546	   Receiving MPLS Label Stack sub-TLV for downstream traffic from the
547	   upstream node, and in case of MP2MP it also contains a regular MPLS
548	   Label Stack sub-TLV for upstream traffic to the upstream node.

550	   The inner most label in the Receiving MPLS Label Stack is the
551	   incoming label identifying the tree (for comparison the inner most
552	   label for a regular MPLS Label Stack is the outgoing label).  If the
553	   Receiving MPLS Label Stack sub-TLVe has more than one labels, the
554	   second inner most label in the stack identifies the expected upstream
555	   neighbor and explicit RPF checking needs to be set up for the tree
556	   label accordingly.

558	2.1.5.  Tree Label Stack sub-TLV

560	   The MPLS Label Stack sub-TLV can be used to specify the complete
561	   label stack used to send traffic, with the stack including both a
562	   transport label (stack) and label(s) that identify the (tree,
563	   neighbor) to the downstream node.  There are cases where the
564	   controller only wants to specify the tree-identifying labels but
565	   leave the transport details to the router itself.  For example, the
566	   router could locally determine a transport label (stack) and combine
567	   with the tree-identifying labels signaled from the controller to get
568	   the complete outgoing label stack.

570	   For that purpose, a new Tree Label Stack sub-TLV is defined, with a
571	   one-octet length field.  The value field contains a label stack with
572	   the same encoding as value part of the MPLS Label Stack sub-TLV, but
573	   the sub-TLV has a different type.  A stack is specified because it
574	   may take up to three labels (see Section 1.4):

576	   o  If different nodes use different labels (allocated from the common
577	      SRGB or the node's SRLB) for a (tree, neighbor) tuple, only a
578	      single label is in the stack.  This is similar to current mLDP hop
579	      by hop signaling case.

581	   o  If different nodes use the same tree label, then an additional
582	      neighbor-identifying label is needed in front of the tree label.

584	   o  For the previous bullet, if the neighbor-identifying label is
585	      allocated from the controller's local label space, then an
586	      additional context label is needed in front of the neighbor label.

588	2.1.6.  Backup Tunnel sub-TLV

590	   The Backup Tunnel sub-TLV is used to specify the backup paths for the
591	   tunnel.  The length is two-octet.  The value part encodes a one-octet
592	   flags field and a variable length Tunnel Encapsulation Attribute.  If
593	   the tunnel goes down, traffic that is normally sent out of the tunnel
594	   is fast rerouted to the tunnels listed in the encoded TEA.

596	                  +--------------------------------+
597	                  | Sub-TLV Type (1 Octet, TBD)    |
598	                  +--------------------------------+
599	                  | Sub-TLV Length (2 Octets)      |
600	                  +--------------------------------+
601	                  | P | rest of 1 Octet Flags      |
602	                  +--------------------------------+
603	                  | Backup TEA (variable length)   |
604	                  +--------------------------------+

606	   The backup tunnels can be going to the same or different nodes
607	   reached by the original tunnel.

609	   If the tunnel carries a RPF sub-TLV and a Backup Tunnel sub-TLV, then
610	   both traffic arriving on the original tunnel and on the tunnels
611	   encoded in the Backup Tunnel sub-TLV's TEA can be accepted, if the
612	   Parallel (P-)bit in the flags field is set.  If the P-bit is not set,
613	   then traffic arriving on the backup tunnel is accepted only if router
614	   has switched to receiving on the backup tunnel (this is the
615	   equivalent of PIM/mLDP MoFRR).

617	2.2.  Context Label TLV in BGP-LS Node Attribute

619	   For a router to signal the context label that it assigns for a
620	   controller (or any label allocator that assigns labels - from its
621	   local label space -- that will be received by this router), a new
622	   BGP-LS Node Attribute TLV is defined:

624	       0                   1                   2                   3
625	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
626	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
627	      |               Type            |            Length             |
628	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
629	      |                      Context Label                            |
630	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
631	      |            IPv4/v6 Address of Label Space Owner               |
632	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

634	   The Length field implies the type of the address.  Multiple Context
635	   Label TLVs may be included in a Node Attribute, one for each label
636	   space owner.

638	   An as example, a controller with address 11.11.11.11 allocates label
639	   200 from its own label space, and router A assigns label 100 to
640	   identify this controller's label space.  The router includes the
641	   Context Label TLV (100, 11.11.11.11) in its BGP-LS Node Attribute and
642	   the controller instructs router B to send traffic to router A with a
643	   label stack (100, 200), and router A uses label 100 to determine the
644	   Label FIB in which to look up label 200.

646	2.3.  SR P2MP Signaling

648	   An SR P2MP policy for an SR P2MP tree is identified by a (Root, Tree-
649	   id) tuple.  It has a set of leaves and set of Candidate Paths (CPs).
650	   The policy is instantiated on the root of the tree, with
651	   corresponding Replication Segments - identified by (Root, Tree-id,
652	   Tree-Node-id) - instantiated on the tree nodes (root, leaves, and
653	   intermediate replication points).  The Candidate Path is implicitly
654	   identified by the Route Distinguisher.

656	2.3.1.  S-PMSI A-D Route for SR P2MP

658	   With BGP signaled IP multicast trees and mLDP tunnels, the tree/
659	   tunnel identification is encoded in the NLRI of S-PMSI A-D routes and
660	   corresponding Leaf A-D routes.  The signaling sets up forwarding
661	   state on each node of the tree, so the NLRI also contains the
662	   identification of the node in the "Upstream Router's IP Address"
663	   field.

665	   For SR P2MP, forwarding state are represented as Replication Segments
666	   and are signaled from controllers to tree nodes.  A Replication
667	   Segment is identified in a new type of S-PMSI A-D route and
668	   corresponding Leaf A-D route (note that the "Leaf" term here does not
669	   refer to tree leaves):

671	            +-     +-----------------------------------+
672	            |      |    Route Type - 4 (Leaf A-D)      |
673	            |      +-----------------------------------+
674	            |      |     Length (1 octet)              |
675	            | L +- +-----------------------------------+ --+
676	          L | E |  | Route Type - 0x83 (SR P2MP S-PMSI)|   | S
677	          E | A |  +-----------------------------------+   | |
678	          A | F |  |     Length (1 octet)              |   | P
679	          F |   |  +-----------------------------------+   | M
680	            | R |  |      RD   (8 octets)              |   | S
681	            | O |  +-----------------------------------+   | I
682	            | U |  |  Root ID (4 or 16 octets)         |   |
683	          N | T |  +-----------------------------------+   | N
684	          L | E |  |       Tree ID (4 octets)          |   | L
685	          R |   |  +-----------------------------------+   | R
686	          I | K |  |  Upstream Router's IP Address     |   | I
687	            | E |  +-----------------------------------+ --+
688	            | Y |  |  Originating Router's IP Address  |
689	            +-  +- +-----------------------------------+

691	              Leaf A-D route for SR Replication Segment

693	2.3.2.  BGP Community Container for SR P2MP Policy

695	   The Leaf A-D route for Replication Segments signaled to the root is
696	   also used to signal (parts of) the SR P2MP Policy - the policy name,
697	   the set of leaves (optional, for informational purpose), preference
698	   of the CP and other information are all encoded in a newly defined
699	   BGP Community Container (BCC) [I-D.ietf-idr-wide-bgp-communities]
700	   called SR P2MP Policy BCC.

702	   The SR P2MP Policy BCC has a BGP Community Container type to be
703	   assigned by IANA.  It is composed of a fixed 4-octet Candidate Path
704	   Preference value, optionally followed by TLVs.

706	        0                   1                   2                   3
707	        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
708	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
709	       |                Candidate Path Preference                      |
710	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
711	       |                                                               |
712	       |                        TLVs (optional)                        |
713	       |                                                               |
714	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

716	               BGP Community Container for SR P2MP Policy

718	   One optional TLV is to enclose the following optional Atoms TLVs that
719	   are already defined in [I-D.ietf-idr-wide-bgp-communities]:

721	   o  An IPv4 or IPv6 Prefix list - for the set of leaves

723	   o  A UTF-8 string - for the policy name

725	   If more information for the policy are needed, more Atoms TLVs or SR
726	   P2MP Policy BCC specific TLVs can be defined.

728	   The root receives one Leaf A-D route for each Candidate Path of the
729	   policy.  Only one of the routes need to, though more than one MAY
730	   include the above listed optional Atom TLVs in the SR P2MP Policy
731	   BCC.

733	2.3.3.  SR Policy Tunnel Type

735	   The Tunnel Encapsulation Attribute (TEA) attached to Leaf A-D routes
736	   encodes all replication branch information.  For example, if an SR
737	   explicit path is to be used to reach a particular downstream node,
738	   the TEA will include a tunnel that lists the entire label stack for
739	   that SR path, plus the label that identifies the SR P2MP tree to the
740	   downstream node.

742	   That SR path may have been installed on the node as a unicast SR
743	   policy with a corresponding Binding SID.  In stead of listing the
744	   entire label stack in an MPLS tunnel in the TEA, a different tunnel,
745	   SR Policy Tunnel [I-D.ietf-idr-segment-routing-te-policy], can be
746	   used as an alternative.  The tunnel includes a Binding SID sub-TLV,
747	   an optional endpoint sub-TLV that identifies the downstream node, and
748	   an optional one-segment segment list that identifies to the
749	   downstream node the SR P2MP tree.  When a node receives the Leaf A-D
750	   route with the TEA that contains an SR Policy Tunnel without a RPF
751	   sub-TLV, the Binding SID is used to locate corresponding outgoing
752	   segment lists used to reach the downstream node; the tree-identifying
753	   segment from the optional one-segment segment list is added to to
754	   outgoing segment lists mapped from the binding SID to form the entire
755	   segment list used to send traffic to downstream node.

757	   Note that, the SR Policy Tunnel is initially defined to instantiate
758	   an SR policy.  For that use case it provides information associated
759	   with the policy, e.g., Binding SID, preference, and segment lists.
760	   The receiving node installs that policy and establishes the mapping
761	   from the Binding SID to the outgoing segments.  The use of SR Policy
762	   Tunnel in this document is to refer to a pre-installed SR policy so
763	   the preference and segment lists are not used.

765	   If a tunnel in the TEA carries a RPF sub-TLV, it is for the upstream
766	   node.  The tunnel may be an MPLS tunnel in case of SR MPLS, and the
767	   Receiving MPLS Label Stack sub-TLV specifies the incoming label stack
768	   that identifies the tree and optionally the upstream neighbor.
769	   Alternatively, for both SR-MPLS and SRv6 an SR Policy Tunnel with the
770	   RPF sub-TLV can be used, in which the Binding SID sub-TLV is the SID
771	   for the tree.

773	   If the node is the root and a Binding SID is allocated by the
774	   controller, the Binding SID is signaled to the root in a TEA tunnel
775	   with a RPF sub-TLV as above but without a destination sub-TLV.

777	3.  Procedures

779	   Details to be added.  The general idea is described in the
780	   introduction section.

782	4.  Security Considerations

784	   This document does not introduce new security risks.

786	5.  IANA Considerations

788	   This document makes the following IANA requests:

790	   o  Assign "Any-Encapsulation" and "Load-balancing" tunnel types from
791	      the "BGP Tunnel Encapsulation Attribute Tunnel Types" registry

793	   o  Assign "Member Tunnels", "Receiving MPLS Label Stack", "Tree Label
794	      Stack" and "RPF" sub-TLV types from the "BGP Tunnel Encapsulation
795	      Attribute Sub-TLVs" registry.  The "Member Tunnels" sub-TLV has a
796	      two-octet value length (so the type should be in the 128-255
797	      range), while the "Receiving MPLS Label Stack", "Tree Label" and
798	      "RPF" sub-TLV has a one-octet value length.

800	   o  Assign "Context Label TLV" type from the "BGP-LS Node Descriptor,
801	      Link Descriptor, Prefix Descriptor, and Attribute TLVs" registry.

803	   o  Assign "S-PMSI A-D Route for SR P2MP" route type from the "BGP
804	      MCAST-TREE Route Types" registry, with a suggested value of 0x83.

806	   o  Assign a new BGP Community Container type "SR P2MP Policy", and to
807	      create an "SR P2MP Policy Community Container TLV Registry", with
808	      an initial entry for "TLV for Atoms".

810	6.  Acknowledgements

812	   The authors Eric Rosen for his questions, suggestions, and help
813	   finding solutions to some issues like the neighbor based explicit RPF
814	   checking.  The authors also thank Lenny Giuliano, Sanoj Vivekanandan
815	   and IJsbrand Wijnands for their review and comments.

817	7.  References

819	7.1.  Normative References

821	   [I-D.ietf-bess-bgp-multicast]
822	              Zhang, Z., Giuliano, L., Patel, K., Wijnands, I., mishra,
823	              m., and A. Gulko, "BGP Based Multicast", draft-ietf-bess-
824	              bgp-multicast-02 (work in progress), June 2020.

826	   [I-D.ietf-idr-segment-routing-te-policy]
827	              Previdi, S., Filsfils, C., Talaulikar, K., Mattes, P.,
828	              Rosen, E., Jain, D., and S. Lin, "Advertising Segment
829	              Routing Policies in BGP", draft-ietf-idr-segment-routing-
830	              te-policy-09 (work in progress), May 2020.

832	   [I-D.ietf-idr-tunnel-encaps]
833	              Patel, K., Velde, G., Sangli, S., and J. Scudder, "The BGP
834	              Tunnel Encapsulation Attribute", draft-ietf-idr-tunnel-
835	              encaps-17 (work in progress), July 2020.

837	   [I-D.ietf-idr-wide-bgp-communities]
838	              Raszuk, R., Haas, J., Lange, A., Decraene, B., Amante, S.,
839	              and P. Jakma, "BGP Community Container Attribute", draft-
840	              ietf-idr-wide-bgp-communities-05 (work in progress), July
841	              2018.

843	   [I-D.voyer-pim-sr-p2mp-policy]
844	              Voyer, D., Filsfils, C., Parekh, R., Bidgoli, H., and Z.
845	              Zhang, "Segment Routing Point-to-Multipoint Policy",
846	              draft-voyer-pim-sr-p2mp-policy-02 (work in progress), July
847	              2020.

849	   [I-D.voyer-spring-sr-replication-segment]
850	              Voyer, D., Filsfils, C., Parekh, R., Bidgoli, H., and Z.
851	              Zhang, "SR Replication Segment for Multi-point Service
852	              Delivery", draft-voyer-spring-sr-replication-segment-04
853	              (work in progress), July 2020.

855	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
856	              Requirement Levels", BCP 14, RFC 2119,
857	              DOI 10.17487/RFC2119, March 1997,
858	              <https://www.rfc-editor.org/info/rfc2119>.

860	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
861	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
862	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

864	7.2.  Informative References

866	   [RFC6388]  Wijnands, IJ., Ed., Minei, I., Ed., Kompella, K., and B.
867	              Thomas, "Label Distribution Protocol Extensions for Point-
868	              to-Multipoint and Multipoint-to-Multipoint Label Switched
869	              Paths", RFC 6388, DOI 10.17487/RFC6388, November 2011,
870	              <https://www.rfc-editor.org/info/rfc6388>.

872	   [RFC6513]  Rosen, E., Ed. and R. Aggarwal, Ed., "Multicast in MPLS/
873	              BGP IP VPNs", RFC 6513, DOI 10.17487/RFC6513, February
874	              2012, <https://www.rfc-editor.org/info/rfc6513>.

876	   [RFC7761]  Fenner, B., Handley, M., Holbrook, H., Kouvelas, I.,
877	              Parekh, R., Zhang, Z., and L. Zheng, "Protocol Independent
878	              Multicast - Sparse Mode (PIM-SM): Protocol Specification
879	              (Revised)", STD 83, RFC 7761, DOI 10.17487/RFC7761, March
880	              2016, <https://www.rfc-editor.org/info/rfc7761>.

882	   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
883	              Decraene, B., Litkowski, S., and R. Shakir, "Segment
884	              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
885	              July 2018, <https://www.rfc-editor.org/info/rfc8402>.

887	Authors' Addresses

889	   Zhaohui Zhang
890	   Juniper Networks

892	   EMail: zzhang@juniper.net
893	   Robert Raszuk
894	   Bloomberg LP

896	   EMail: robert@raszuk.net

898	   Dante Pacella
899	   Verizon

901	   EMail: dante.j.pacella@verizon.com

903	   Arkadiy Gulko
904	   Refinitiv

906	   EMail: arkadiy.gulko@refinitiv.com