idnits 2.17.1 

draft-ietf-mboned-dc-deploy-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with multicast IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use the 233.252.0.x range defined in RFC 5771


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (February 4, 2020) is 1537 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC 2710' is mentioned on line 378, but not defined

  == Missing Reference: 'RFC 3810' is mentioned on line 378, but not defined

  == Missing Reference: 'RFC 4604' is mentioned on line 379, but not defined

  == Missing Reference: 'RFC 4443' is mentioned on line 381, but not defined

  == Missing Reference: 'RFC 8279' is mentioned on line 614, but not defined

  == Unused Reference: 'RFC2119' is defined on line 709, but no explicit
     reference was found in the text

  == Unused Reference: 'I-D.ietf-nvo3-vxlan-gpe' is defined on line 727, but
     no explicit reference was found in the text

  == Unused Reference: 'RFC2710' is defined on line 742, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC8279' is defined on line 805, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-12) exists of
     draft-ietf-bier-use-cases-09

  == Outdated reference: A later version (-16) exists of
     draft-ietf-nvo3-geneve-13

  == Outdated reference: A later version (-13) exists of
     draft-ietf-nvo3-vxlan-gpe-07

  -- Obsolete informational reference (is this intentional?): RFC 4601
     (Obsoleted by RFC 7761)


     Summary: 0 errors (**), 0 flaws (~~), 15 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	MBONED                                                        M. McBride
3	Internet-Draft                                                 Futurewei
4	Intended status: Informational                               O. Komolafe
5	Expires: August 7, 2020                                  Arista Networks
6	                                                        February 4, 2020

8	                 Multicast in the Data Center Overview
9	                     draft-ietf-mboned-dc-deploy-08

11	Abstract

13	   The volume and importance of one-to-many traffic patterns in data
14	   centers is likely to increase significantly in the future.  Reasons
15	   for this increase are discussed and then attention is paid to the
16	   manner in which this traffic pattern may be judiously handled in data
17	   centers.  The intuitive solution of deploying conventional IP
18	   multicast within data centers is explored and evaluated.  Thereafter,
19	   a number of emerging innovative approaches are described before a
20	   number of recommendations are made.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on August 7, 2020.

39	Copyright Notice

41	   Copyright (c) 2020 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
57	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
58	   2.  Reasons for increasing one-to-many traffic patterns . . . . .   3
59	     2.1.  Applications  . . . . . . . . . . . . . . . . . . . . . .   3
60	     2.2.  Overlays  . . . . . . . . . . . . . . . . . . . . . . . .   5
61	     2.3.  Protocols . . . . . . . . . . . . . . . . . . . . . . . .   6
62	     2.4.  Summary . . . . . . . . . . . . . . . . . . . . . . . . .   6
63	   3.  Handling one-to-many traffic using conventional multicast . .   7
64	     3.1.  Layer 3 multicast . . . . . . . . . . . . . . . . . . . .   7
65	     3.2.  Layer 2 multicast . . . . . . . . . . . . . . . . . . . .   7
66	     3.3.  Example use cases . . . . . . . . . . . . . . . . . . . .   9
67	     3.4.  Advantages and disadvantages  . . . . . . . . . . . . . .   9
68	   4.  Alternative options for handling one-to-many traffic  . . . .  10
69	     4.1.  Minimizing traffic volumes  . . . . . . . . . . . . . . .  11
70	     4.2.  Head end replication  . . . . . . . . . . . . . . . . . .  12
71	     4.3.  Programmable Forwarding Planes  . . . . . . . . . . . . .  12
72	     4.4.  BIER  . . . . . . . . . . . . . . . . . . . . . . . . . .  13
73	     4.5.  Segment Routing . . . . . . . . . . . . . . . . . . . . .  14
74	   5.  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  15
75	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
76	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
77	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
78	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
79	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
80	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  16
81	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  18

83	1.  Introduction

85	   The volume and importance of one-to-many traffic patterns in data
86	   centers is likely to increase significantly in the future.  Reasons
87	   for this increase include the nature of the traffic generated by
88	   applications hosted in the data center, the need to handle broadcast,
89	   unknown unicast and multicast (BUM) traffic within the overlay
90	   technologies used to support multi-tenancy at scale, and the use of
91	   certain protocols that traditionally require one-to-many control
92	   message exchanges.

94	   These trends, allied with the expectation that future highly
95	   virtualized large-scale data centers must support communication
96	   between potentially thousands of participants, may lead to the
97	   natural assumption that IP multicast will be widely used in data
98	   centers, specifically given the bandwidth savings it potentially
99	   offers.  However, such an assumption would be wrong.  In fact, there
100	   is widespread reluctance to enable conventional IP multicast in data
101	   centers for a number of reasons, mostly pertaining to concerns about
102	   its scalability and reliability.

104	   This draft discusses some of the main drivers for the increasing
105	   volume and importance of one-to-many traffic patterns in data
106	   centers.  Thereafter, the manner in which conventional IP multicast
107	   may be used to handle this traffic pattern is discussed and some of
108	   the associated challenges highlighted.  Following this discussion, a
109	   number of alternative emerging approaches are introduced, before
110	   concluding by discussing key trends and making a number of
111	   recommendations.

113	1.1.  Requirements Language

115	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
116	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
117	   document are to be interpreted as described in RFC 2119.

119	2.  Reasons for increasing one-to-many traffic patterns

121	2.1.  Applications

123	   Key trends suggest that the nature of the applications likely to
124	   dominate future highly-virtualized multi-tenant data centers will
125	   produce large volumes of one-to-many traffic.  For example, it is
126	   well-known that traffic flows in data centers have evolved from being
127	   predominantly North-South (e.g. client-server) to predominantly East-
128	   West (e.g.  distributed computation).  This change has led to the
129	   consensus that topologies such as the Leaf/Spine, that are easier to
130	   scale in the East-West direction, are better suited to the data
131	   center of the future.  This increase in East-West traffic flows
132	   results from VMs often having to exchange numerous messages between
133	   themselves as part of executing a specific workload.  For example, a
134	   computational workload could require data, or an executable, to be
135	   disseminated to workers distributed throughout the data center which
136	   may be subsequently polled for status updates.  The emergence of such
137	   applications means there is likely to be an increase in one-to-many
138	   traffic flows with the increasing dominance of East-West traffic.

140	   The TV broadcast industry is another potential future source of
141	   applications with one-to-many traffic patterns in data centers.  The
142	   requirement for robustness, stability and predicability has meant the
143	   TV broadcast industry has traditionally used TV-specific protocols,
144	   infrastructure and technologies for transmitting video signals
145	   between end points such as cameras, monitors, mixers, graphics
146	   devices and video servers.  However, the growing cost and complexity
147	   of supporting this approach, especially as the bit rates of the video
148	   signals increase due to demand for formats such as 4K-UHD and 8K-UHD,
149	   means there is a consensus that the TV broadcast industry will
150	   transition from industry-specific transmission formats (e.g.  SDI,
151	   HD-SDI) over TV-specific infrastructure to using IP-based
152	   infrastructure.  The development of pertinent standards by the
153	   Society of Motion Picture and Television Engineers (SMPTE)
154	   [SMPTE2110], along with the increasing performance of IP routers,
155	   means this transition is gathering pace.  A possible outcome of this
156	   transition will be the building of IP data centers in broadcast
157	   plants.  Traffic flows in the broadcast industry are frequently one-
158	   to-many and so if IP data centers are deployed in broadcast plants,
159	   it is imperative that this traffic pattern is supported efficiently
160	   in that infrastructure.  In fact, a pivotal consideration for
161	   broadcasters considering transitioning to IP is the manner in which
162	   these one-to-many traffic flows will be managed and monitored in a
163	   data center with an IP fabric.

165	   One of the few success stories in using conventional IP multicast has
166	   been for disseminating market trading data.  For example, IP
167	   multicast is commonly used today to deliver stock quotes from stock
168	   exchanges to financial service providers and then to the stock
169	   analysts or brokerages.  It is essential that the network
170	   infrastructure delivers very low latency and high throughout,
171	   especially given the proliferation of automated and algorithmic
172	   trading which means stock analysts or brokerages may gain an edge on
173	   competitors simply by receiving an update a few milliseconds earlier.
174	   As would be expected, in such deployments reliability is critical.
175	   The network must be designed with no single point of failure and in
176	   such a way that it can respond in a deterministic manner to failure.
177	   Typically, redundant servers (in a primary/backup or live-live mode)
178	   send multicast streams into the network, with diverse paths being
179	   used across the network.  The stock exchange generating the one-to-
180	   many traffic and stock analysts/brokerage that receive the traffic
181	   will typically have their own data centers.  Therefore, the manner in
182	   which one-to-many traffic patterns are handled in these data centers
183	   are extremely important, especially given the requirements and
184	   constraints mentioned.

186	   Another reason for the growing volume of one-to-many traffic patterns
187	   in modern data centers is the increasing adoption of streaming
188	   telemetry.  This transition is motivated by the observation that
189	   traditional poll-based approaches for monitoring network devices are
190	   usually inadequate in modern data centers.  These approaches
191	   typically suffer from poor scalability, extensibility and
192	   responsiveness.  In contrast, in streaming telemetry, network devices
193	   in the data center stream highly-granular real-time updates to a
194	   telemetry collector/database.  This collector then collates,
195	   normalizes and encodes this data for convenient consumption by
196	   monitoring applications.  The montoring applications can subscribe to
197	   the notifications of interest, allowing them to gain insight into
198	   pertinent state and performance metrics.  Thus, the traffic flows
199	   associated with streaming telemetry are typically many-to-one between
200	   the network devices and the telemetry collector and then one-to-many
201	   from the collector to the monitoring applications.

203	   The use of publish and subscribe applications is growing within data
204	   centers, contributing to the rising volume of one-to-many traffic
205	   flows.  Such applications are attractive as they provide a robust
206	   low-latency asynchronous messaging service, allowing senders to be
207	   decoupled from receivers.  The usual approach is for a publisher to
208	   create and transmit a message to a specific topic.  The publish and
209	   subscribe application will retain the message and ensure it is
210	   delivered to all subscribers to that topic.  The flexibility in the
211	   number of publishers and subscribers to a specific topic means such
212	   applications cater for one-to-one, one-to-many and many-to-one
213	   traffic patterns.

215	2.2.  Overlays

217	   Another key contributor to the rise in one-to-many traffic patterns
218	   is the proposed architecture for supporting large-scale multi-tenancy
219	   in highly virtualized data centers [RFC8014].  In this architecture,
220	   a tenant's VMs are distributed across the data center and are
221	   connected by a virtual network known as the overlay network.  A
222	   number of different technologies have been proposed for realizing the
223	   overlay network, including VXLAN [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-
224	   vxlan-gpe], NVGRE [RFC7637] and GENEVE [I-D.ietf-nvo3-geneve].  The
225	   often fervent and arguably partisan debate about the relative merits
226	   of these overlay technologies belies the fact that, conceptually, it
227	   may be said that these overlays mainly simply provide a means to
228	   encapsulate and tunnel Ethernet frames from the VMs over the data
229	   center IP fabric, thus emulating a Layer 2 segment between the VMs.
230	   Consequently, the VMs believe and behave as if they are connected to
231	   the tenant's other VMs by a conventional Layer 2 segment, regardless
232	   of their physical location within the data center.

234	   Naturally, in a Layer 2 segment, point to multi-point traffic can
235	   result from handling BUM (broadcast, unknown unicast and multicast)
236	   traffic.  And, compounding this issue within data centers, since the
237	   tenant's VMs attached to the emulated segment may be dispersed
238	   throughout the data center, the BUM traffic may need to traverse the
239	   data center fabric.

241	   Hence, regardless of the overlay technology used, due consideration
242	   must be given to handling BUM traffic, forcing the data center
243	   operator to pay attention to the manner in which one-to-many
244	   communication is handled within the data center.  And this
245	   consideration is likely to become increasingly important with the
246	   anticipated rise in the number and importance of overlays.  In fact,
247	   it may be asserted that the manner in which one-to-many
248	   communications arising from overlays is handled is pivotal to the
249	   performance and stability of the entire data center network.

251	2.3.  Protocols

253	   Conventionally, some key networking protocols used in data centers
254	   require one-to-many communications for control messages.  Thus, the
255	   data center operator must pay due attention to how these control
256	   message exchanges are supported.

258	   For example, ARP [RFC0826] and ND [RFC4861] use broadcast and
259	   multicast messages within IPv4 and IPv6 networks respectively to
260	   discover MAC address to IP address mappings.  Furthermore, when these
261	   protocols are running within an overlay network, it essential to
262	   ensure the messages are delivered to all the hosts on the emulated
263	   Layer 2 segment, regardless of physical location within the data
264	   center.  The challenges associated with optimally delivering ARP and
265	   ND messages in data centers has attracted lots of attention
266	   [RFC6820].

268	   Another example of a protocol that may neccessitate having one-to-
269	   many traffic flows in the data center is IGMP [RFC2236], [RFC3376].
270	   If the VMs attached to the Layer 2 segment wish to join a multicast
271	   group they must send IGMP reports in response to queries from the
272	   querier.  As these devices could be located at different locations
273	   within the data center, there is the somewhat ironic prospect of IGMP
274	   itself leading to an increase in the volume of one-to-many
275	   communications in the data center.

277	2.4.  Summary

279	   Section 2.1, Section 2.2 and Section 2.3 have discussed how the
280	   trends in the types of applications, the overlay technologies used
281	   and some of the essential networking protocols results in an increase
282	   in the volume of one-to-many traffic patterns in modern highly-
283	   virtualized data centers.  Section 3 explores how such traffic flows
284	   may be handled using conventional IP multicast.

286	3.  Handling one-to-many traffic using conventional multicast

288	   Faced with ever increasing volumes of one-to-many traffic flows for
289	   the reasons presented in Section 2, arguably the intuitive initial
290	   course of action for a data center operator is to explore if and how
291	   conventional IP multicast could be deployed within the data center.
292	   This section introduces the key protocols, discusses some example use
293	   cases where they are deployed in data centers and discusses some of
294	   the advantages and disadvantages of such deployments.

296	3.1.  Layer 3 multicast

298	   PIM is the most widely deployed multicast routing protocol and so,
299	   unsurprisingly, is the primary multicast routing protocol considered
300	   for use in the data center.  There are three potential popular modes
301	   of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM-
302	   BIDIR [RFC5015].  It may be said that these different modes of PIM
303	   tradeoff the optimality of the multicast forwarding tree for the
304	   amount of multicast forwarding state that must be maintained at
305	   routers.  SSM provides the most efficient forwarding between sources
306	   and receivers and thus is most suitable for applications with one-to-
307	   many traffic patterns.  State is built and maintained for each (S,G)
308	   flow.  Thus, the amount of multicast forwarding state held by routers
309	   in the data center is proportional to the number of sources and
310	   groups.  At the other end of the spectrum, BIDIR is the most
311	   efficient shared tree solution as one tree is built for all flows,
312	   therefore minimizing the amount of state.  This state reduction is at
313	   the expense of optimal forwarding path between sources and receivers.
314	   This use of a shared tree makes BIDIR particularly well-suited for
315	   applications with many-to-many traffic patterns, given that the
316	   amount of state is uncorrelated to the number of sources.  SSM and
317	   BIDIR are optimizations of PIM-SM.  PIM-SM is the most widely
318	   deployed multicast routing protocol.  PIM-SM can also be the most
319	   complex.  PIM-SM relies upon a RP (Rendezvous Point) to set up the
320	   multicast tree and subsequently there is the option of switching to
321	   the SPT (shortest path tree), similar to SSM, or staying on the
322	   shared tree, similar to BIDIR.

324	3.2.  Layer 2 multicast

326	   With IPv4 unicast address resolution, the translation of an IP
327	   address to a MAC address is done dynamically by ARP.  With multicast
328	   address resolution, the mapping from a multicast IPv4 address to a
329	   multicast MAC address is done by assigning the low-order 23 bits of
330	   the multicast IPv4 address to fill the low-order 23 bits of the
331	   multicast MAC address.  Each IPv4 multicast address has 28 unique
332	   bits (the multicast address range is 224.0.0.0/12) therefore mapping
333	   a multicast IP address to a MAC address ignores 5 bits of the IP
334	   address.  Hence, groups of 32 multicast IP addresses are mapped to
335	   the same MAC address.  And so a multicast MAC address cannot be
336	   uniquely mapped to a multicast IPv4 address.  Therefore, IPv4
337	   multicast addresses must be chosen judiciously in order to avoid
338	   unneccessary address aliasing.  When sending IPv6 multicast packets
339	   on an Ethernet link, the corresponding destination MAC address is a
340	   direct mapping of the last 32 bits of the 128 bit IPv6 multicast
341	   address into the 48 bit MAC address.  It is possible for more than
342	   one IPv6 multicast address to map to the same 48 bit MAC address.

344	   The default behaviour of many hosts (and, in fact, routers) is to
345	   block multicast traffic.  Consequently, when a host wishes to join an
346	   IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to
347	   the router attached to the Layer 2 segment and also it instructs its
348	   data link layer to receive Ethernet frames that match the
349	   corresponding MAC address.  The data link layer filters the frames,
350	   passing those with matching destination addresses to the IP module.
351	   Similarly, hosts simply hand the multicast packet for transmission to
352	   the data link layer which would add the Layer 2 encapsulation, using
353	   the MAC address derived in the manner previously discussed.

355	   When this Ethernet frame with a multicast MAC address is received by
356	   a switch configured to forward multicast traffic, the default
357	   behaviour is to flood it to all the ports in the Layer 2 segment.
358	   Clearly there may not be a receiver for this multicast group present
359	   on each port and IGMP snooping is used to avoid sending the frame out
360	   of ports without receivers.

362	   A switch running IGMP snooping listens to the IGMP messages exchanged
363	   between hosts and the router in order to identify which ports have
364	   active receivers for a specific multicast group, allowing the
365	   forwarding of multicast frames to be suitably constrained.  Normally,
366	   the multicast router will generate IGMP queries to which the hosts
367	   send IGMP reports in response.  However, number of optimizations in
368	   which a switch generates IGMP queries (and so appears to be the
369	   router from the hosts' perspective) and/or generates IGMP reports
370	   (and so appears to be hosts from the router's perspectve) are
371	   commonly used to improve the performance by reducing the amount of
372	   state maintained at the router, suppressing superfluous IGMP messages
373	   and improving responsivenss when hosts join/leave the group.

375	   Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by
376	   IPv6 routers for discovering multicast listeners on a directly
377	   attached link, performing a similar function to IGMP in IPv4
378	   networks.  MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810]
379	   [RFC 4604] similar to IGMPv3.  However, in contrast to IGMP, MLD does
380	   not send its own distinct protocol messages.  Rather, MLD is a
381	   subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of
382	   ICMPv6 messages.  MLD snooping works similarly to IGMP snooping,
383	   described earlier.

385	3.3.  Example use cases

387	   A use case where PIM and IGMP are currently used in data centers is
388	   to support multicast in VXLAN deployments.  In the original VXLAN
389	   specification [RFC7348], a data-driven flood and learn control plane
390	   was proposed, requiring the data center IP fabric to support
391	   multicast routing.  A multicast group is associated with each virtual
392	   network, each uniquely identified by its VXLAN network identifiers
393	   (VNI).  VXLAN tunnel endpoints (VTEPs), typically located in the
394	   hypervisor or ToR switch, with local VMs that belong to this VNI
395	   would join the multicast group and use it for the exchange of BUM
396	   traffic with the other VTEPs.  Essentially, the VTEP would
397	   encapsulate any BUM traffic from attached VMs in an IP multicast
398	   packet, whose destination address is the associated multicast group
399	   address, and transmit the packet to the data center fabric.  Thus,
400	   PIM must be running in the fabric to maintain a multicast
401	   distribution tree per VNI.

403	   Alternatively, rather than setting up a multicast distribution tree
404	   per VNI, a tree can be set up whenever hosts within the VNI wish to
405	   exchange multicast traffic.  For example, whenever a VTEP receives an
406	   IGMP report from a locally connected host, it would translate this
407	   into a PIM join message which will be propagated into the IP fabric.
408	   In order to ensure this join message is sent to the IP fabric rather
409	   than over the VXLAN interface (since the VTEP will have a route back
410	   to the source of the multicast packet over the VXLAN interface and so
411	   would naturally attempt to send the join over this interface) a more
412	   specific route back to the source over the IP fabric must be
413	   configured.  In this approach PIM must be configured on the SVIs
414	   associated with the VXLAN interface.

416	   Another use case of PIM and IGMP in data centers is when IPTV servers
417	   use multicast to deliver content from the data center to end users.
418	   IPTV is typically a one to many application where the hosts are
419	   configured for IGMPv3, the switches are configured with IGMP
420	   snooping, and the routers are running PIM-SSM mode.  Often redundant
421	   servers send multicast streams into the network and the network is
422	   forwards the data across diverse paths.

424	3.4.  Advantages and disadvantages

426	   Arguably the biggest advantage of using PIM and IGMP to support one-
427	   to-many communication in data centers is that these protocols are
428	   relatively mature.  Consequently, PIM is available in most routers
429	   and IGMP is supported by most hosts and routers.  As such, no
430	   specialized hardware or relatively immature software is involved in
431	   using these protocols in data centers.  Furthermore, the maturity of
432	   these protocols means their behaviour and performance in operational
433	   networks is well-understood, with widely available best-practices and
434	   deployment guides for optimizing their performance.  For these
435	   reasons, PIM and IGMP have been used successfully for supporting one-
436	   to-many traffic flows within modern data centers, as discussed
437	   earlier.

439	   However, somewhat ironically, the relative disadvantages of PIM and
440	   IGMP usage in data centers also stem mostly from their maturity.
441	   Specifically, these protocols were standardized and implemented long
442	   before the highly-virtualized multi-tenant data centers of today
443	   existed.  Consequently, PIM and IGMP are neither optimally placed to
444	   deal with the requirements of one-to-many communication in modern
445	   data centers nor to exploit idiosyncrasies of data centers.  For
446	   example, there may be thousands of VMs participating in a multicast
447	   session, with some of these VMs migrating to servers within the data
448	   center, new VMs being continually spun up and wishing to join the
449	   sessions while all the time other VMs are leaving.  In such a
450	   scenario, the churn in the PIM and IGMP state machines, the volume of
451	   control messages they would generate and the amount of state they
452	   would necessitate within routers, especially if they were deployed
453	   naively, would be untenable.  Furthermore, PIM is a relatively
454	   complex protocol.  As such, PIM can be challenging to debug even in
455	   significantly more benign deployments than those envisaged for future
456	   data centers, a fact that has evidently had a dissuasive effect on
457	   data center operators considering enabling it within the IP fabric.

459	4.  Alternative options for handling one-to-many traffic

461	   Section 2 has shown that there is likely to be an increasing amount
462	   one-to-many communications in data centers for multiple reasons.  And
463	   Section 3 has discussed how conventional multicast may be used to
464	   handle this traffic, presenting some of the associated advantages and
465	   disadvantages.  Unsurprisingly, as discussed in the remainder of
466	   Section 4, there are a number of alternative options of handling this
467	   traffic pattern in data centers.  Critically, it should be noted that
468	   many of these techniques are not mutually-exclusive; in fact many
469	   deployments involve a combination of more than one of these
470	   techniques.  Furthermore, as will be shown, introducing a centralized
471	   controller or a distributed control plane, typically makes these
472	   techniques more potent.

474	4.1.  Minimizing traffic volumes

476	   If handling one-to-many traffic flows in data centers is considered
477	   onerous, then arguably the most intuitive solution is to aim to
478	   minimize the volume of said traffic.

480	   It was previously mentioned in Section 2 that the three main
481	   contributors to one-to-many traffic in data centers are applications,
482	   overlays and protocols.  Typically the applications running on VMs
483	   are outside the control of the data center operator and thus,
484	   relatively speaking, little can be done about the volume of one-to-
485	   many traffic generated by applications.  Luckily, there is more scope
486	   for attempting to reduce the volume of such traffic generated by
487	   overlays and protocols.  (And often by protocols within overlays.)
488	   This reduction is possible by exploiting certain characteristics of
489	   data center networks such as a fixed and regular topology, single
490	   administrative control, consistent hardware and software, well-known
491	   overlay encapsulation endpoints and systematic IP address allocation.

493	   A way of minimizing the amount of one-to-many traffic that traverses
494	   the data center fabric is to use a centralized controller.  For
495	   example, whenever a new VM is instantiated, the hypervisor or
496	   encapsulation endpoint can notify a centralized controller of this
497	   new MAC address, the associated virtual network, IP address etc.  The
498	   controller could subsequently distribute this information to every
499	   encapsulation endpoint.  Consequently, when any endpoint receives an
500	   ARP request from a locally attached VM, it could simply consult its
501	   local copy of the information distributed by the controller and
502	   reply.  Thus, the ARP request is suppressed and does not result in
503	   one-to-many traffic traversing the data center IP fabric.

505	   Alternatively, the functionality supported by the controller can
506	   realized by a distributed control plane.  BGP-EVPN [RFC7432, RFC8365]
507	   is the most popular control plane used in data centers.  Typically,
508	   the encapsulation endpoints will exchange pertinent information with
509	   each other by all peering with a BGP route reflector (RR).  Thus,
510	   information such as local MAC addresses, MAC to IP address mapping,
511	   virtual networks identifiers, IP prefixes, and local IGMP group
512	   membership can be disseminated.  Consequently, for example, ARP
513	   requests from local VMs can be suppressed by the encapsulation
514	   endpoint using the information learnt from the control plane about
515	   the MAC to IP mappings at remote peers.  In a similar fashion,
516	   encapsulation endpoints can use information gleaned from the BGP-EVPN
517	   messages to proxy for both IGMP reports and queries for the attached
518	   VMs, thus obviating the need to transmit IGMP messages across the
519	   data center fabric.

521	4.2.  Head end replication

523	   A popular option for handling one-to-many traffic patterns in data
524	   centers is head end replication (HER).  HER means the traffic is
525	   duplicated and sent to each end point individually using conventional
526	   IP unicast.  Obvious disadvantages of HER include traffic duplication
527	   and the additional processing burden on the head end.  Nevertheless,
528	   HER is especially attractive when overlays are in use as the
529	   replication can be carried out by the hypervisor or encapsulation end
530	   point.  Consequently, the VMs and IP fabric are unmodified and
531	   unaware of how the traffic is delivered to the multiple end points.
532	   Additionally, it is possible to use a number of approaches for
533	   constructing and disseminating the list of which endpoints should
534	   receive what traffic and so on.

536	   For example, the reluctance of data center operators to enable PIM
537	   within the data center fabric means VXLAN is often used with HER.
538	   Thus, BUM traffic from each VNI is replicated and sent using unicast
539	   to remote VTEPs with VMs in that VNI.  The list of remote VTEPs to
540	   which the traffic should be sent may be configured manually on the
541	   VTEP.  Alternatively, the VTEPs may transmit pertinent local state to
542	   a centralized controller which in turn sends each VTEP the list of
543	   remote VTEPs for each VNI.  Lastly, HER also works well when a
544	   distributed control plane is used instead of the centralized
545	   controller.  Again, BGP-EVPN may be used to distribute the
546	   information needed to faciliate HER to the VTEPs.

548	4.3.  Programmable Forwarding Planes

550	   As discussed in Section 2, one of the main functions of PIM is to
551	   build and maintain multicast distribution trees.  Such a tree
552	   indicates the path a specific flow will take through the network.
553	   Thus, in routers traversed by the flow, the information from PIM is
554	   ultimately used to create a multicast forwarding entry for the
555	   specific flow and insert it into the multicast forwarding table.  The
556	   multicast forwarding table will have entries for each multicast flow
557	   traversing the router, with the lookup key usually being a
558	   concantenation of the source and group addresses.  Critically, each
559	   entry will contain information such as the legal input interface for
560	   the flow and a list of output interfaces to which matching packets
561	   should be replicated.

563	   Viewed in this way, there is nothing remarkable about the multicast
564	   forwarding state constructed in routers based on the information
565	   gleaned from PIM.  And, in fact, it is perfectly feasible to build
566	   such state in the absence of PIM.  Such prospects have been
567	   significantly enhanced with the increasing popularity and performance
568	   of network devices with programmable forwarding planes.  These
569	   devices are attractive for use in data centers since they are
570	   amenable to being programmed by a centralized controller.  If such a
571	   controller has a global view of the sources and receivers for each
572	   multicast flow (which can be provided by the devices attached to the
573	   end hosts in the data center communicating with the controller), an
574	   accurate representation of data center topology (which is usually
575	   well-known), then it can readily compute the multicast forwarding
576	   state that must be installed at each router to ensure the one-to-many
577	   traffic flow is delivered properly to the correct receivers.  All
578	   that is needed is an API to program the forwarding planes of all the
579	   network devices that need to handle the flow appropriately.  Such
580	   APIs do in fact exist and so, unsurprisingly, handling one-to-many
581	   traffic flows using such an approach is attractive for data centers.

583	   Being able to program the forwarding plane in this manner offers the
584	   enticing possibility of introducing novel algorithms and concepts for
585	   forwarding multicast traffic in data centers.  These schemes
586	   typically aim to exploit the idiosyncracies of the data center
587	   network architecture to create ingenious, pithy and elegant encodings
588	   of the information needed to facilitate multicast forwarding.
589	   Depending on the scheme, this information may be carried in packet
590	   headers, stored in the multicast forwarding table in routers or a
591	   combination of both.  The key characterstic is that the terseness of
592	   the forwarding information means the volume of forwarding state is
593	   significantly reduced.  Additionally, the overhead associated with
594	   building and maintaining a multicast forwarding tree has been
595	   eliminated.  The result of these reductions in the overhead
596	   associated with multicast forwarding is a significant and impressive
597	   increase in the effective number of multicast flows that can be
598	   supported within the data center.

600	   [Shabaz19] is a good example of such an approach and also presents
601	   comprehensive discussion of other schemes in the discussion on
602	   releated work.  Although a number of promising schemes have been
603	   proposed, no consensus has yet emerged as to which approach is best,
604	   and in fact what "best" means.  Even if a clear winner were to
605	   emerge, it faces significant challenges to gain the vendor and
606	   operator buy-in to ensure it is widely deployed in data centers.

608	4.4.  BIER

610	   As discussed in Section 3.4, PIM and IGMP face potential scalability
611	   challenges when deployed in data centers.  These challenges are
612	   typically due to the requirement to build and maintain a distribution
613	   tree and the requirement to hold per-flow state in routers.  Bit
614	   Index Explicit Replication (BIER) [RFC 8279] is a new multicast
615	   forwarding paradigm that avoids these two requirements.

617	   When a multicast packet enters a BIER domain, the ingress router,
618	   known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header
619	   to the packet.  This header contains a bit string in which each bit
620	   maps to an egress router, known as Bit-Forwarding Egress Router
621	   (BFER).  If a bit is set, then the packet should be forwarded to the
622	   associated BFER.  The routers within the BIER domain, Bit-Forwarding
623	   Routers (BFRs), use the BIER header in the packet and information in
624	   the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise
625	   operations to determine how the packet should be replicated optimally
626	   so it reaches all the appropriate BFERs.

628	   BIER is deemed to be attractive for facilitating one-to-many
629	   communications in data centers [I-D.ietf-bier-use-cases].  The
630	   deployment envisioned with overlay networks is that the the
631	   encapsulation endpoints would be the BFIR.  So knowledge about the
632	   actual multicast groups does not reside in the data center fabric,
633	   improving the scalability compared to conventional IP multicast.
634	   Additionally, a centralized controller or a BGP-EVPN control plane
635	   may be used with BIER to ensure the BFIR have the required
636	   information.  A challenge associated with using BIER is that it
637	   requires changes to the forwarding behaviour of the routers used in
638	   the data center IP fabric.

640	4.5.  Segment Routing

642	   Segment Routing (SR) [RFC8402] is a manifestation of the source
643	   routing paradigm, so called as the path a packet takes through a
644	   network is determined at the source.  The source encodes this
645	   information in the packet header as a sequence of instructions.
646	   These instructions are followed by intermediate routers, ultimately
647	   resulting in the delivery of the packet to the desired destination.
648	   In SR, the instructions are known as segments and a number of
649	   different kinds of segments have been defined.  Each segment has an
650	   identifier (SID) which is distributed throughout the network by newly
651	   defined extensions to standard routing protocols.  Thus, using this
652	   information, sources are able to determine the exact sequence of
653	   segments to encode into the packet.  The manner in which these
654	   instructions are encoded depends on the underlying data plane.
655	   Segment Routing can be applied to the MPLS and IPv6 data planes.  In
656	   the former, the list of segments is represented by the label stack
657	   and in the latter it is represented as an IPv6 routing extension
658	   header.  Advantages of segment routing include the reduction in the
659	   amount of forwarding state routers need to hold and the removal of
660	   the need to run a signaling protocol, thus improving the network
661	   scalability while reducing the operational complexity.

663	   The advantages of segment routing and the ability to run it over an
664	   unmodified MPLS data plane means that one of its anticipated use
665	   cases is in BGP-based large-scale data centers [RFC7938].  The exact
666	   manner in which multicast traffic will be handled in SR has not yet
667	   been standardized, with a number of different options being
668	   considered.  For example, since with the MPLS data plane, segments
669	   are simply encoded as a label stack, then the protocols traditionally
670	   used to create point-to-multipoint LSPs could be reused to allow SR
671	   to support one-to-many traffic flows.  Alternatively, a special SID
672	   may be defined for a multicast distribution tree, with a centralized
673	   controller being used to program routers appropriately to ensure the
674	   traffic is delivered to the desired destinations, while avoiding the
675	   costly process of building and maintaining a multicast distribution
676	   tree.

678	5.  Conclusions

680	   As the volume and importance of one-to-many traffic in data centers
681	   increases, conventional IP multicast is likely to become increasingly
682	   unattractive for deployment in data centers for a number of reasons,
683	   mostly pertaining its relatively poor scalability and inability to
684	   exploit characteristics of data center network architectures.  Hence,
685	   even though IGMP/MLD is likely to remain the most popular manner in
686	   which end hosts signal interest in joining a multicast group, it is
687	   unlikely that this multicast traffic will be transported over the
688	   data center IP fabric using a multicast distribution tree built and
689	   maintained by PIM in the future.  Rather, approaches which exploit
690	   idiosyncracies of data center network architectures are better placed
691	   to deliver one-to-many traffic in data centers, especially when
692	   judiciously combined with a centralized controller and/or a
693	   distributed control plane, particularly one based on BGP-EVPN.

695	6.  IANA Considerations

697	   This memo includes no request to IANA.

699	7.  Security Considerations

701	   No new security considerations result from this document

703	8.  Acknowledgements

705	9.  References

707	9.1.  Normative References

709	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
710	              Requirement Levels", BCP 14, RFC 2119,
711	              DOI 10.17487/RFC2119, March 1997,
712	              <https://www.rfc-editor.org/info/rfc2119>.

714	9.2.  Informative References

716	   [I-D.ietf-bier-use-cases]
717	              Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A.,
718	              Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C.
719	              Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-09
720	              (work in progress), January 2019.

722	   [I-D.ietf-nvo3-geneve]
723	              Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic
724	              Network Virtualization Encapsulation", draft-ietf-
725	              nvo3-geneve-13 (work in progress), March 2019.

727	   [I-D.ietf-nvo3-vxlan-gpe]
728	              Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol
729	              Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-07 (work
730	              in progress), April 2019.

732	   [RFC0826]  Plummer, D., "An Ethernet Address Resolution Protocol: Or
733	              Converting Network Protocol Addresses to 48.bit Ethernet
734	              Address for Transmission on Ethernet Hardware", STD 37,
735	              RFC 826, DOI 10.17487/RFC0826, November 1982,
736	              <https://www.rfc-editor.org/info/rfc826>.

738	   [RFC2236]  Fenner, W., "Internet Group Management Protocol, Version
739	              2", RFC 2236, DOI 10.17487/RFC2236, November 1997,
740	              <https://www.rfc-editor.org/info/rfc2236>.

742	   [RFC2710]  Deering, S., Fenner, W., and B. Haberman, "Multicast
743	              Listener Discovery (MLD) for IPv6", RFC 2710,
744	              DOI 10.17487/RFC2710, October 1999,
745	              <https://www.rfc-editor.org/info/rfc2710>.

747	   [RFC3376]  Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A.
748	              Thyagarajan, "Internet Group Management Protocol, Version
749	              3", RFC 3376, DOI 10.17487/RFC3376, October 2002,
750	              <https://www.rfc-editor.org/info/rfc3376>.

752	   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
753	              "Protocol Independent Multicast - Sparse Mode (PIM-SM):
754	              Protocol Specification (Revised)", RFC 4601,
755	              DOI 10.17487/RFC4601, August 2006,
756	              <https://www.rfc-editor.org/info/rfc4601>.

758	   [RFC4607]  Holbrook, H. and B. Cain, "Source-Specific Multicast for
759	              IP", RFC 4607, DOI 10.17487/RFC4607, August 2006,
760	              <https://www.rfc-editor.org/info/rfc4607>.

762	   [RFC4861]  Narten, T., Nordmark, E., Simpson, W., and H. Soliman,
763	              "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861,
764	              DOI 10.17487/RFC4861, September 2007,
765	              <https://www.rfc-editor.org/info/rfc4861>.

767	   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
768	              "Bidirectional Protocol Independent Multicast (BIDIR-
769	              PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007,
770	              <https://www.rfc-editor.org/info/rfc5015>.

772	   [RFC6820]  Narten, T., Karir, M., and I. Foo, "Address Resolution
773	              Problems in Large Data Center Networks", RFC 6820,
774	              DOI 10.17487/RFC6820, January 2013,
775	              <https://www.rfc-editor.org/info/rfc6820>.

777	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
778	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
779	              eXtensible Local Area Network (VXLAN): A Framework for
780	              Overlaying Virtualized Layer 2 Networks over Layer 3
781	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
782	              <https://www.rfc-editor.org/info/rfc7348>.

784	   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
785	              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
786	              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
787	              2015, <https://www.rfc-editor.org/info/rfc7432>.

789	   [RFC7637]  Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network
790	              Virtualization Using Generic Routing Encapsulation",
791	              RFC 7637, DOI 10.17487/RFC7637, September 2015,
792	              <https://www.rfc-editor.org/info/rfc7637>.

794	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
795	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
796	              DOI 10.17487/RFC7938, August 2016,
797	              <https://www.rfc-editor.org/info/rfc7938>.

799	   [RFC8014]  Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T.
800	              Narten, "An Architecture for Data-Center Network
801	              Virtualization over Layer 3 (NVO3)", RFC 8014,
802	              DOI 10.17487/RFC8014, December 2016,
803	              <https://www.rfc-editor.org/info/rfc8014>.

805	   [RFC8279]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
806	              Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
807	              Explicit Replication (BIER)", RFC 8279,
808	              DOI 10.17487/RFC8279, November 2017,
809	              <https://www.rfc-editor.org/info/rfc8279>.

811	   [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
812	              Uttaro, J., and W. Henderickx, "A Network Virtualization
813	              Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
814	              DOI 10.17487/RFC8365, March 2018,
815	              <https://www.rfc-editor.org/info/rfc8365>.

817	   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
818	              Decraene, B., Litkowski, S., and R. Shakir, "Segment
819	              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
820	              July 2018, <https://www.rfc-editor.org/info/rfc8402>.

822	   [Shabaz19]
823	              Shabaz, M., Suresh, L., Rexford, J., Feamster, N.,
824	              Rottenstreich, O., and M. Hira, "Elmo: Source Routed
825	              Multicast for Public Clouds", ACM SIGCOMM 2019 Conference
826	              (SIGCOMM '19) ACM, DOI 10.1145/3341302.3342066, August
827	              2019.

829	   [SMPTE2110]
830	              "SMPTE2110 Standards Suite",
831	              <http://www.smpte.org/st-2110>.

833	Authors' Addresses

835	   Mike McBride
836	   Futurewei

838	   Email:  michael.mcbride@futurewei.com

840	   Olufemi Komolafe
841	   Arista Networks

843	   Email:  femi@arista.com