idnits 2.17.1 

draft-ietf-mboned-dc-deploy-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with multicast IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use the 233.252.0.x range defined in RFC 5771


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (June 29, 2018) is 2127 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC 2710' is mentioned on line 332, but not defined

  == Missing Reference: 'RFC 3810' is mentioned on line 332, but not defined

  == Missing Reference: 'RFC 4604' is mentioned on line 333, but not defined

  == Missing Reference: 'RFC 4443' is mentioned on line 335, but not defined

  == Missing Reference: 'RFC 8279' is mentioned on line 498, but not defined

  == Unused Reference: 'RFC2119' is defined on line 582, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2710' is defined on line 615, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC8279' is defined on line 673, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-12) exists of
     draft-ietf-bier-use-cases-06

  == Outdated reference: A later version (-16) exists of
     draft-ietf-nvo3-geneve-06

  == Outdated reference: A later version (-13) exists of
     draft-ietf-nvo3-vxlan-gpe-06

  -- Obsolete informational reference (is this intentional?): RFC 4601
     (Obsoleted by RFC 7761)


     Summary: 0 errors (**), 0 flaws (~~), 14 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	MBONED                                                        M. McBride
3	Internet-Draft                                                    Huawei
4	Intended status: Informational                               O. Komolafe
5	Expires: December 31, 2018                               Arista Networks
6	                                                           June 29, 2018

8	                 Multicast in the Data Center Overview
9	                     draft-ietf-mboned-dc-deploy-03

11	Abstract

13	   The volume and importance of one-to-many traffic patterns in data
14	   centers is likely to increase significantly in the future.  Reasons
15	   for this increase are discussed and then attention is paid to the
16	   manner in which this traffic pattern may be judiously handled in data
17	   centers.  The intuitive solution of deploying conventional IP
18	   multicast within data centers is explored and evaluated.  Thereafter,
19	   a number of emerging innovative approaches are described before a
20	   number of recommendations are made.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at https://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on December 31, 2018.

39	Copyright Notice

41	   Copyright (c) 2018 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (https://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
57	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
58	   2.  Reasons for increasing one-to-many traffic patterns . . . . .   3
59	     2.1.  Applications  . . . . . . . . . . . . . . . . . . . . . .   3
60	     2.2.  Overlays  . . . . . . . . . . . . . . . . . . . . . . . .   5
61	     2.3.  Protocols . . . . . . . . . . . . . . . . . . . . . . . .   5
62	   3.  Handling one-to-many traffic using conventional multicast . .   5
63	     3.1.  Layer 3 multicast . . . . . . . . . . . . . . . . . . . .   6
64	     3.2.  Layer 2 multicast . . . . . . . . . . . . . . . . . . . .   6
65	     3.3.  Example use cases . . . . . . . . . . . . . . . . . . . .   8
66	     3.4.  Advantages and disadvantages  . . . . . . . . . . . . . .   9
67	   4.  Alternative options for handling one-to-many traffic  . . . .   9
68	     4.1.  Minimizing traffic volumes  . . . . . . . . . . . . . . .   9
69	     4.2.  Head end replication  . . . . . . . . . . . . . . . . . .  10
70	     4.3.  BIER  . . . . . . . . . . . . . . . . . . . . . . . . . .  11
71	     4.4.  Segment Routing . . . . . . . . . . . . . . . . . . . . .  12
72	   5.  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  12
73	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  12
74	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
75	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  13
76	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  13
77	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  13
78	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  13
79	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

81	1.  Introduction

83	   The volume and importance of one-to-many traffic patterns in data
84	   centers is likely to increase significantly in the future.  Reasons
85	   for this increase include the nature of the traffic generated by
86	   applications hosted in the data center, the need to handle broadcast,
87	   unknown unicast and multicast (BUM) traffic within the overlay
88	   technologies used to support multi-tenancy at scale, and the use of
89	   certain protocols that traditionally require one-to-many control
90	   message exchanges.  These trends, allied with the expectation that
91	   future highly virtualized data centers must support communication
92	   between potentially thousands of participants, may lead to the
93	   natural assumption that IP multicast will be widely used in data
94	   centers, specifically given the bandwidth savings it potentially
95	   offers.  However, such an assumption would be wrong.  In fact, there
96	   is widespread reluctance to enable IP multicast in data centers for a
97	   number of reasons, mostly pertaining to concerns about its
98	   scalability and reliability.

100	   This draft discusses some of the main drivers for the increasing
101	   volume and importance of one-to-many traffic patterns in data
102	   centers.  Thereafter, the manner in which conventional IP multicast
103	   may be used to handle this traffic pattern is discussed and some of
104	   the associated challenges highlighted.  Following this discussion, a
105	   number of alternative emerging approaches are introduced, before
106	   concluding by discussing key trends and making a number of
107	   recommendations.

109	1.1.  Requirements Language

111	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
112	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
113	   document are to be interpreted as described in RFC 2119.

115	2.  Reasons for increasing one-to-many traffic patterns

117	2.1.  Applications

119	   Key trends suggest that the nature of the applications likely to
120	   dominate future highly-virtualized multi-tenant data centers will
121	   produce large volumes of one-to-many traffic.  For example, it is
122	   well-known that traffic flows in data centers have evolved from being
123	   predominantly North-South (e.g. client-server) to predominantly East-
124	   West (e.g.  distributed computation).  This change has led to the
125	   consensus that topologies such as the Leaf/Spine, that are easier to
126	   scale in the East-West direction, are better suited to the data
127	   center of the future.  This increase in East-West traffic flows
128	   results from VMs often having to exchange numerous messages between
129	   themselves as part of executing a specific workload.  For example, a
130	   computational workload could require data, or an executable, to be
131	   disseminated to workers distributed throughout the data center which
132	   may be subsequently polled for status updates.  The emergence of such
133	   applications means there is likely to be an increase in one-to-many
134	   traffic flows with the increasing dominance of East-West traffic.

136	   The TV broadcast industry is another potential future source of
137	   applications with one-to-many traffic patterns in data centers.  The
138	   requirement for robustness, stability and predicability has meant the
139	   TV broadcast industry has traditionally used TV-specific protocols,
140	   infrastructure and technologies for transmitting video signals
141	   between cameras, studios, mixers, encoders, servers etc.  However,
142	   the growing cost and complexity of supporting this approach,
143	   especially as the bit rates of the video signals increase due to
144	   demand for formats such as 4K-UHD and 8K-UHD, means there is a
145	   consensus that the TV broadcast industry will transition from
146	   industry-specific transmission formats (e.g.  SDI, HD-SDI) over TV-
147	   specific infrastructure to using IP-based infrastructure.  The
148	   development of pertinent standards by the SMPTE, along with the
149	   increasing performance of IP routers, means this transition is
150	   gathering pace.  A possible outcome of this transition will be the
151	   building of IP data centers in broadcast plants.  Traffic flows in
152	   the broadcast industry are frequently one-to-many and so if IP data
153	   centers are deployed in broadcast plants, it is imperative that this
154	   traffic pattern is supported efficiently in that infrastructure.  In
155	   fact, a pivotal consideration for broadcasters considering
156	   transitioning to IP is the manner in which these one-to-many traffic
157	   flows will be managed and monitored in a data center with an IP
158	   fabric.

160	   Arguably one of the (few?) success stories in using conventional IP
161	   multicast has been for disseminating market trading data.  For
162	   example, IP multicast is commonly used today to deliver stock quotes
163	   from the stock exchange to financial services provider and then to
164	   the stock analysts or brokerages.  The network must be designed with
165	   no single point of failure and in such a way that the network can
166	   respond in a deterministic manner to any failure.  Typically,
167	   redundant servers (in a primary/backup or live-live mode) send
168	   multicast streams into the network, with diverse paths being used
169	   across the network.  Another critical requirement is reliability and
170	   traceability; regulatory and legal requirements means that the
171	   producer of the marketing data must know exactly where the flow was
172	   sent and be able to prove conclusively that the data was received
173	   within agreed SLAs.  The stock exchange generating the one-to-many
174	   traffic and stock analysts/brokerage that receive the traffic will
175	   typically have their own data centers.  Therefore, the manner in
176	   which one-to-many traffic patterns are handled in these data centers
177	   are extremely important, especially given the requirements and
178	   constraints mentioned.

180	   Many data center cloud providers provide publish and subscribe
181	   applications.  There can be numerous publishers and subscribers and
182	   many message channels within a data center.  With publish and
183	   subscribe servers, a separate message is sent to each subscriber of a
184	   publication.  With multicast publish/subscribe, only one message is
185	   sent, regardless of the number of subscribers.  In a publish/
186	   subscribe system, client applications, some of which are publishers
187	   and some of which are subscribers, are connected to a network of
188	   message brokers that receive publications on a number of topics, and
189	   send the publications on to the subscribers for those topics.  The
190	   more subscribers there are in the publish/subscribe system, the
191	   greater the improvement to network utilization there might be with
192	   multicast.

194	2.2.  Overlays

196	   The proposed architecture for supporting large-scale multi-tenancy in
197	   highly virtualized data centers [RFC8014] consists of a tenant's VMs
198	   distributed across the data center connected by a virtual network
199	   known as the overlay network.  A number of different technologies
200	   have been proposed for realizing the overlay network, including VXLAN
201	   [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and
202	   GENEVE [I-D.ietf-nvo3-geneve].  The often fervent and arguably
203	   partisan debate about the relative merits of these overlay
204	   technologies belies the fact that, conceptually, it may be said that
205	   these overlays typically simply provide a means to encapsulate and
206	   tunnel Ethernet frames from the VMs over the data center IP fabric,
207	   thus emulating a layer 2 segment between the VMs.  Consequently, the
208	   VMs believe and behave as if they are connected to the tenant's other
209	   VMs by a conventional layer 2 segment, regardless of their physical
210	   location within the data center.  Naturally, in a layer 2 segment,
211	   point to multi-point traffic can result from handling BUM (broadcast,
212	   unknown unicast and multicast) traffic.  And, compounding this issue
213	   within data centers, since the tenant's VMs attached to the emulated
214	   segment may be dispersed throughout the data center, the BUM traffic
215	   may need to traverse the data center fabric.  Hence, regardless of
216	   the overlay technology used, due consideration must be given to
217	   handling BUM traffic, forcing the data center operator to consider
218	   the manner in which one-to-many communication is handled within the
219	   IP fabric.

221	2.3.  Protocols

223	   Conventionally, some key networking protocols used in data centers
224	   require one-to-many communication.  For example, ARP and ND use
225	   broadcast and multicast messages within IPv4 and IPv6 networks
226	   respectively to discover MAC address to IP address mappings.
227	   Furthermore, when these protocols are running within an overlay
228	   network, then it essential to ensure the messages are delivered to
229	   all the hosts on the emulated layer 2 segment, regardless of physical
230	   location within the data center.  The challenges associated with
231	   optimally delivering ARP and ND messages in data centers has
232	   attracted lots of attention [RFC6820].  Popular approaches in use
233	   mostly seek to exploit characteristics of data center networks to
234	   avoid having to broadcast/multicast these messages, as discussed in
235	   Section 4.1.

237	3.  Handling one-to-many traffic using conventional multicast
238	3.1.  Layer 3 multicast

240	   PIM is the most widely deployed multicast routing protocol and so,
241	   unsurprisingly, is the primary multicast routing protocol considered
242	   for use in the data center.  There are three potential popular
243	   flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607]
244	   or PIM-BIDIR [RFC5015].  It may be said that these different modes of
245	   PIM tradeoff the optimality of the multicast forwarding tree for the
246	   amount of multicast forwarding state that must be maintained at
247	   routers.  SSM provides the most efficient forwarding between sources
248	   and receivers and thus is most suitable for applications with one-to-
249	   many traffic patterns.  State is built and maintained for each (S,G)
250	   flow.  Thus, the amount of multicast forwarding state held by routers
251	   in the data center is proportional to the number of sources and
252	   groups.  At the other end of the spectrum, BIDIR is the most
253	   efficient shared tree solution as one tree is built for all (S,G)s,
254	   therefore minimizing the amount of state.  This state reduction is at
255	   the expense of optimal forwarding path between sources and receivers.
256	   This use of a shared tree makes BIDIR particularly well-suited for
257	   applications with many-to-many traffic patterns, given that the
258	   amount of state is uncorrelated to the number of sources.  SSM and
259	   BIDIR are optimizations of PIM-SM.  PIM-SM is still the most widely
260	   deployed multicast routing protocol.  PIM-SM can also be the most
261	   complex.  PIM-SM relies upon a RP (Rendezvous Point) to set up the
262	   multicast tree and subsequently there is the option of switching to
263	   the SPT (shortest path tree), similar to SSM, or staying on the
264	   shared tree, similar to BIDIR.

266	3.2.  Layer 2 multicast

268	   With IPv4 unicast address resolution, the translation of an IP
269	   address to a MAC address is done dynamically by ARP.  With multicast
270	   address resolution, the mapping from a multicast IPv4 address to a
271	   multicast MAC address is done by assigning the low-order 23 bits of
272	   the multicast IPv4 address to fill the low-order 23 bits of the
273	   multicast MAC address.  Each IPv4 multicast address has 28 unique
274	   bits (the multicast address range is 224.0.0.0/12) therefore mapping
275	   a multicast IP address to a MAC address ignores 5 bits of the IP
276	   address.  Hence, groups of 32 multicast IP addresses are mapped to
277	   the same MAC address meaning a a multicast MAC address cannot be
278	   uniquely mapped to a multicast IPv4 address.  Therefore, planning is
279	   required within an organization to choose IPv4 multicast addresses
280	   judiciously in order to avoid address aliasing.  When sending IPv6
281	   multicast packets on an Ethernet link, the corresponding destination
282	   MAC address is a direct mapping of the last 32 bits of the 128 bit
283	   IPv6 multicast address into the 48 bit MAC address.  It is possible
284	   for more than one IPv6 multicast address to map to the same 48 bit
285	   MAC address.

287	   The default behaviour of many hosts (and, in fact, routers) is to
288	   block multicast traffic.  Consequently, when a host wishes to join an
289	   IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to
290	   the router attached to the layer 2 segment and also it instructs its
291	   data link layer to receive Ethernet frames that match the
292	   corresponding MAC address.  The data link layer filters the frames,
293	   passing those with matching destination addresses to the IP module.
294	   Similarly, hosts simply hand the multicast packet for transmission to
295	   the data link layer which would add the layer 2 encapsulation, using
296	   the MAC address derived in the manner previously discussed.

298	   When this Ethernet frame with a multicast MAC address is received by
299	   a switch configured to forward multicast traffic, the default
300	   behaviour is to flood it to all the ports in the layer 2 segment.
301	   Clearly there may not be a receiver for this multicast group present
302	   on each port and IGMP snooping is used to avoid sending the frame out
303	   of ports without receivers.

305	   IGMP snooping, with proxy reporting or report suppression, actively
306	   filters IGMP packets in order to reduce load on the multicast router
307	   by ensuring only the minimal quantity of information is sent.  The
308	   switch is trying to ensure the router has only a single entry for the
309	   group, regardless of the number of active listeners.  If there are
310	   two active listeners in a group and the first one leaves, then the
311	   switch determines that the router does not need this information
312	   since it does not affect the status of the group from the router's
313	   point of view.  However the next time there is a routine query from
314	   the router the switch will forward the reply from the remaining host,
315	   to prevent the router from believing there are no active listeners.
316	   It follows that in active IGMP snooping, the router will generally
317	   only know about the most recently joined member of the group.

319	   In order for IGMP and thus IGMP snooping to function, a multicast
320	   router must exist on the network and generate IGMP queries.  The
321	   tables (holding the member ports for each multicast group) created
322	   for snooping are associated with the querier.  Without a querier the
323	   tables are not created and snooping will not work.  Furthermore, IGMP
324	   general queries must be unconditionally forwarded by all switches
325	   involved in IGMP snooping.  Some IGMP snooping implementations
326	   include full querier capability.  Others are able to proxy and
327	   retransmit queries from the multicast router.

329	   Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by
330	   IPv6 routers for discovering multicast listeners on a directly
331	   attached link, performing a similar function to IGMP in IPv4
332	   networks.  MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810]
333	   [RFC 4604] similar to IGMPv3.  However, in contrast to IGMP, MLD does
334	   not send its own distinct protocol messages.  Rather, MLD is a
335	   subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of
336	   ICMPv6 messages.  MLD snooping works similarly to IGMP snooping,
337	   described earlier.

339	3.3.  Example use cases

341	   A use case where PIM and IGMP are currently used in data centers is
342	   to support multicast in VXLAN deployments.  In the original VXLAN
343	   specification [RFC7348], a data-driven flood and learn control plane
344	   was proposed, requiring the data center IP fabric to support
345	   multicast routing.  A multicast group is associated with each virtual
346	   network, each uniquely identified by its VXLAN network identifiers
347	   (VNI).  VXLAN tunnel endpoints (VTEPs), typically located in the
348	   hypervisor or ToR switch, with local VMs that belong to this VNI
349	   would join the multicast group and use it for the exchange of BUM
350	   traffic with the other VTEPs.  Essentially, the VTEP would
351	   encapsulate any BUM traffic from attached VMs in an IP multicast
352	   packet, whose destination address is the associated multicast group
353	   address, and transmit the packet to the data center fabric.  Thus,
354	   PIM must be running in the fabric to maintain a multicast
355	   distribution tree per VNI.

357	   Alternatively, rather than setting up a multicast distribution tree
358	   per VNI, a tree can be set up whenever hosts within the VNI wish to
359	   exchange multicast traffic.  For example, whenever a VTEP receives an
360	   IGMP report from a locally connected host, it would translate this
361	   into a PIM join message which will be propagated into the IP fabric.
362	   In order to ensure this join message is sent to the IP fabric rather
363	   than over the VXLAN interface (since the VTEP will have a route back
364	   to the source of the multicast packet over the VXLAN interface and so
365	   would naturally attempt to send the join over this interface) a more
366	   specific route back to the source over the IP fabric must be
367	   configured.  In this approach PIM must be configured on the SVIs
368	   associated with the VXLAN interface.

370	   Another use case of PIM and IGMP in data centers is when IPTV servers
371	   use multicast to deliver content from the data center to end users.
372	   IPTV is typically a one to many application where the hosts are
373	   configured for IGMPv3, the switches are configured with IGMP
374	   snooping, and the routers are running PIM-SSM mode.  Often redundant
375	   servers send multicast streams into the network and the network is
376	   forwards the data across diverse paths.

378	   Windows Media servers send multicast streams to clients.  Windows
379	   Media Services streams to an IP multicast address and all clients
380	   subscribe to the IP address to receive the same stream.  This allows
381	   a single stream to be played simultaneously by multiple clients and
382	   thus reducing bandwidth utilization.

384	3.4.  Advantages and disadvantages

386	   Arguably the biggest advantage of using PIM and IGMP to support one-
387	   to-many communication in data centers is that these protocols are
388	   relatively mature.  Consequently, PIM is available in most routers
389	   and IGMP is supported by most hosts and routers.  As such, no
390	   specialized hardware or relatively immature software is involved in
391	   using them in data centers.  Furthermore, the maturity of these
392	   protocols means their behaviour and performance in operational
393	   networks is well-understood, with widely available best-practices and
394	   deployment guides for optimizing their performance.

396	   However, somewhat ironically, the relative disadvantages of PIM and
397	   IGMP usage in data centers also stem mostly from their maturity.
398	   Specifically, these protocols were standardized and implemented long
399	   before the highly-virtualized multi-tenant data centers of today
400	   existed.  Consequently, PIM and IGMP are neither optimally placed to
401	   deal with the requirements of one-to-many communication in modern
402	   data centers nor to exploit characteristics and idiosyncrasies of
403	   data centers.  For example, there may be thousands of VMs
404	   participating in a multicast session, with some of these VMs
405	   migrating to servers within the data center, new VMs being
406	   continually spun up and wishing to join the sessions while all the
407	   time other VMs are leaving.  In such a scenario, the churn in the PIM
408	   and IGMP state machines, the volume of control messages they would
409	   generate and the amount of state they would necessitate within
410	   routers, especially if they were deployed naively, would be
411	   untenable.

413	4.  Alternative options for handling one-to-many traffic

415	   Section 2 has shown that there is likely to be an increasing amount
416	   one-to-many communications in data centers.  And Section 3 has
417	   discussed how conventional multicast may be used to handle this
418	   traffic.  Having said that, there are a number of alternative options
419	   of handling this traffic pattern in data centers, as discussed in the
420	   subsequent section.  It should be noted that many of these techniques
421	   are not mutually-exclusive; in fact many deployments involve a
422	   combination of more than one of these techniques.  Furthermore, as
423	   will be shown, introducing a centralized controller or a distributed
424	   control plane, makes these techniques more potent.

426	4.1.  Minimizing traffic volumes

428	   If handling one-to-many traffic in data centers can be challenging
429	   then arguably the most intuitive solution is to aim to minimize the
430	   volume of such traffic.

432	   It was previously mentioned in Section 2 that the three main causes
433	   of one-to-many traffic in data centers are applications, overlays and
434	   protocols.  While, relatively speaking, little can be done about the
435	   volume of one-to-many traffic generated by applications, there is
436	   more scope for attempting to reduce the volume of such traffic
437	   generated by overlays and protocols.  (And often by protocols within
438	   overlays.)  This reduction is possible by exploiting certain
439	   characteristics of data center networks: fixed and regular topology,
440	   owned and exclusively controlled by single organization, well-known
441	   overlay encapsulation endpoints etc.

443	   A way of minimizing the amount of one-to-many traffic that traverses
444	   the data center fabric is to use a centralized controller.  For
445	   example, whenever a new VM is instantiated, the hypervisor or
446	   encapsulation endpoint can notify a centralized controller of this
447	   new MAC address, the associated virtual network, IP address etc.  The
448	   controller could subsequently distribute this information to every
449	   encapsulation endpoint.  Consequently, when any endpoint receives an
450	   ARP request from a locally attached VM, it could simply consult its
451	   local copy of the information distributed by the controller and
452	   reply.  Thus, the ARP request is suppressed and does not result in
453	   one-to-many traffic traversing the data center IP fabric.

455	   Alternatively, the functionality supported by the controller can
456	   realized by a distributed control plane.  BGP-EVPN [RFC7432, RFC8365]
457	   is the most popular control plane used in data centers.  Typically,
458	   the encapsulation endpoints will exchange pertinent information with
459	   each other by all peering with a BGP route reflector (RR).  Thus,
460	   information about local MAC addresses, MAC to IP address mapping,
461	   virtual networks identifiers etc can be disseminated.  Consequently,
462	   ARP requests from local VMs can be suppressed by the encapsulation
463	   endpoint.

465	4.2.  Head end replication

467	   A popular option for handling one-to-many traffic patterns in data
468	   centers is head end replication (HER).  HER means the traffic is
469	   duplicated and sent to each end point individually using conventional
470	   IP unicast.  Obvious disadvantages of HER include traffic duplication
471	   and the additional processing burden on the head end.  Nevertheless,
472	   HER is especially attractive when overlays are in use as the
473	   replication can be carried out by the hypervisor or encapsulation end
474	   point.  Consequently, the VMs and IP fabric are unmodified and
475	   unaware of how the traffic is delivered to the multiple end points.
476	   Additionally, it is possible to use a number of approaches for
477	   constructing and disseminating the list of which endpoints should
478	   receive what traffic and so on.

480	   For example, the reluctance of data center operators to enable PIM
481	   and IGMP within the data center fabric means VXLAN is often used with
482	   HER.  Thus, BUM traffic from each VNI is replicated and sent using
483	   unicast to remote VTEPs with VMs in that VNI.  The list of remote
484	   VTEPs to which the traffic should be sent may be configured manually
485	   on the VTEP.  Alternatively, the VTEPs may transmit appropriate state
486	   to a centralized controller which in turn sends each VTEP the list of
487	   remote VTEPs for each VNI.  Lastly, HER also works well when a
488	   distributed control plane is used instead of the centralized
489	   controller.  Again, BGP-EVPN may be used to distribute the
490	   information needed to faciliate HER to the VTEPs.

492	4.3.  BIER

494	   As discussed in Section 3.4, PIM and IGMP face potential scalability
495	   challenges when deployed in data centers.  These challenges are
496	   typically due to the requirement to build and maintain a distribution
497	   tree and the requirement to hold per-flow state in routers.  Bit
498	   Index Explicit Replication (BIER) [RFC 8279] is a new multicast
499	   forwarding paradigm that avoids these two requirements.

501	   When a multicast packet enters a BIER domain, the ingress router,
502	   known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header
503	   to the packet.  This header contains a bit string in which each bit
504	   maps to an egress router, known as Bit-Forwarding Egress Router
505	   (BFER).  If a bit is set, then the packet should be forwarded to the
506	   associated BFER.  The routers within the BIER domain, Bit-Forwarding
507	   Routers (BFRs), use the BIER header in the packet and information in
508	   the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise
509	   operations to determine how the packet should be replicated optimally
510	   so it reaches all the appropriate BFERs.

512	   BIER is deemed to be attractive for facilitating one-to-many
513	   communications in data ceneters [I-D.ietf-bier-use-cases].  The
514	   deployment envisioned with overlay networks is that the the
515	   encapsulation endpoints would be the BFIR.  So knowledge about the
516	   actual multicast groups does not reside in the data center fabric,
517	   improving the scalability compared to conventional IP multicast.
518	   Additionally, a centralized controller or a BGP-EVPN control plane
519	   may be used with BIER to ensure the BFIR have the required
520	   information.  A challenge associated with using BIER is that, unlike
521	   most of the other approaches discussed in this draft, it requires
522	   changes to the forwarding behaviour of the routers used in the data
523	   center IP fabric.

525	4.4.  Segment Routing

527	   Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the
528	   source routing paradigm in which the manner in which a packet
529	   traverses a network is determined by an ordered list of instructions.
530	   These instructions are known as segments may have a local semantic to
531	   an SR node or global within an SR domain.  SR allows enforcing a flow
532	   through any topological path while maintaining per-flow state only at
533	   the ingress node to the SR domain.  Segment Routing can be applied to
534	   the MPLS and IPv6 data-planes.  In the former, the list of segments
535	   is represented by the label stack and in the latter it is represented
536	   as a routing extension header.  Use-cases are described in [I-D.ietf-
537	   spring-segment-routing] and are being considered in the context of
538	   BGP-based large-scale data-center (DC) design [RFC7938].

540	   Multicast in SR continues to be discussed in a variety of drafts and
541	   working groups.  The SPRING WG has not yet been chartered to work on
542	   Multicast in SR.  Multicast can include locally allocating a Segment
543	   Identifier (SID) to existing replication solutions, such as PIM,
544	   mLDP, P2MP RSVP-TE and BIER.  It may also be that a new way to signal
545	   and install trees in SR is developed without creating state in the
546	   network.

548	5.  Conclusions

550	   As the volume and importance of one-to-many traffic in data centers
551	   increases, conventional IP multicast is likely to become increasingly
552	   unattractive for deployment in data centers for a number of reasons,
553	   mostly pertaining its inherent relatively poor scalability and
554	   inability to exploit characteristics of data center network
555	   architectures.  Hence, even though IGMP/MLD is likely to remain the
556	   most popular manner in which end hosts signal interest in joining a
557	   multicast group, it is unlikely that this multicast traffic will be
558	   transported over the data center IP fabric using a multicast
559	   distribution tree built by PIM.  Rather, approaches which exploit
560	   characteristics of data center network architectures (e.g. fixed and
561	   regular topology, owned and exclusively controlled by single
562	   organization, well-known overlay encapsulation endpoints etc.) are
563	   better placed to deliver one-to-many traffic in data centers,
564	   especially when judiciously combined with a centralized controller
565	   and/or a distributed control plane (particularly one based on BGP-
566	   EVPN).

568	6.  IANA Considerations

570	   This memo includes no request to IANA.

572	7.  Security Considerations

574	   No new security considerations result from this document

576	8.  Acknowledgements

578	9.  References

580	9.1.  Normative References

582	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
583	              Requirement Levels", BCP 14, RFC 2119,
584	              DOI 10.17487/RFC2119, March 1997,
585	              <https://www.rfc-editor.org/info/rfc2119>.

587	9.2.  Informative References

589	   [I-D.ietf-bier-use-cases]
590	              Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A.,
591	              Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C.
592	              Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06
593	              (work in progress), January 2018.

595	   [I-D.ietf-nvo3-geneve]
596	              Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic
597	              Network Virtualization Encapsulation", draft-ietf-
598	              nvo3-geneve-06 (work in progress), March 2018.

600	   [I-D.ietf-nvo3-vxlan-gpe]
601	              Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol
602	              Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work
603	              in progress), April 2018.

605	   [I-D.ietf-spring-segment-routing]
606	              Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
607	              Litkowski, S., and R. Shakir, "Segment Routing
608	              Architecture", draft-ietf-spring-segment-routing-15 (work
609	              in progress), January 2018.

611	   [RFC2236]  Fenner, W., "Internet Group Management Protocol, Version
612	              2", RFC 2236, DOI 10.17487/RFC2236, November 1997,
613	              <https://www.rfc-editor.org/info/rfc2236>.

615	   [RFC2710]  Deering, S., Fenner, W., and B. Haberman, "Multicast
616	              Listener Discovery (MLD) for IPv6", RFC 2710,
617	              DOI 10.17487/RFC2710, October 1999,
618	              <https://www.rfc-editor.org/info/rfc2710>.

620	   [RFC3376]  Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A.
621	              Thyagarajan, "Internet Group Management Protocol, Version
622	              3", RFC 3376, DOI 10.17487/RFC3376, October 2002,
623	              <https://www.rfc-editor.org/info/rfc3376>.

625	   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
626	              "Protocol Independent Multicast - Sparse Mode (PIM-SM):
627	              Protocol Specification (Revised)", RFC 4601,
628	              DOI 10.17487/RFC4601, August 2006,
629	              <https://www.rfc-editor.org/info/rfc4601>.

631	   [RFC4607]  Holbrook, H. and B. Cain, "Source-Specific Multicast for
632	              IP", RFC 4607, DOI 10.17487/RFC4607, August 2006,
633	              <https://www.rfc-editor.org/info/rfc4607>.

635	   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
636	              "Bidirectional Protocol Independent Multicast (BIDIR-
637	              PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007,
638	              <https://www.rfc-editor.org/info/rfc5015>.

640	   [RFC6820]  Narten, T., Karir, M., and I. Foo, "Address Resolution
641	              Problems in Large Data Center Networks", RFC 6820,
642	              DOI 10.17487/RFC6820, January 2013,
643	              <https://www.rfc-editor.org/info/rfc6820>.

645	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
646	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
647	              eXtensible Local Area Network (VXLAN): A Framework for
648	              Overlaying Virtualized Layer 2 Networks over Layer 3
649	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
650	              <https://www.rfc-editor.org/info/rfc7348>.

652	   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
653	              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
654	              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
655	              2015, <https://www.rfc-editor.org/info/rfc7432>.

657	   [RFC7637]  Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network
658	              Virtualization Using Generic Routing Encapsulation",
659	              RFC 7637, DOI 10.17487/RFC7637, September 2015,
660	              <https://www.rfc-editor.org/info/rfc7637>.

662	   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
663	              BGP for Routing in Large-Scale Data Centers", RFC 7938,
664	              DOI 10.17487/RFC7938, August 2016,
665	              <https://www.rfc-editor.org/info/rfc7938>.

667	   [RFC8014]  Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T.
668	              Narten, "An Architecture for Data-Center Network
669	              Virtualization over Layer 3 (NVO3)", RFC 8014,
670	              DOI 10.17487/RFC8014, December 2016,
671	              <https://www.rfc-editor.org/info/rfc8014>.

673	   [RFC8279]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
674	              Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
675	              Explicit Replication (BIER)", RFC 8279,
676	              DOI 10.17487/RFC8279, November 2017,
677	              <https://www.rfc-editor.org/info/rfc8279>.

679	   [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
680	              Uttaro, J., and W. Henderickx, "A Network Virtualization
681	              Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
682	              DOI 10.17487/RFC8365, March 2018,
683	              <https://www.rfc-editor.org/info/rfc8365>.

685	Authors' Addresses

687	   Mike McBride
688	   Huawei

690	   Email: michael.mcbride@huawei.com

692	   Olufemi Komolafe
693	   Arista Networks

695	   Email: femi@arista.com