MBONED                                                        M. McBride
Internet-Draft                                                    Huawei
Intended status: Informational                         February 28, 2018                               O. Komolafe
Expires: September 1, December 31, 2018                               Arista Networks
                                                           June 29, 2018

                 Multicast in the Data Center Overview
                     draft-ietf-mboned-dc-deploy-02
                     draft-ietf-mboned-dc-deploy-03

Abstract

   There has been much interest in issues surrounding massive amounts

   The volume and importance of
   hosts one-to-many traffic patterns in the data center.  These issues include the prevalent use of
   IP Multicast within the Data Center.  Its important to understand how
   IP Multicast
   centers is being deployed likely to increase significantly in the Data Center to be able future.  Reasons
   for this increase are discussed and then attention is paid to
   understand the surrounding issues with doing so.  This document
   provides a quick survey of uses
   manner in which this traffic pattern may be judiously handled in data
   centers.  The intuitive solution of deploying conventional IP
   multicast in the within data center centers is explored and
   should serve as an aid to further discussion evaluated.  Thereafter,
   a number of issues related to
   large amounts emerging innovative approaches are described before a
   number of multicast in the data center. recommendations are made.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 1, December 31, 2018.

Copyright Notice

   Copyright (c) 2018 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Multicast Applications in the Data Center  Reasons for increasing one-to-many traffic patterns . . . . .   3
     2.1.  Applications  . . . . . . .   3
     2.1.  Client-Server Applications . . . . . . . . . . . . . . .   3
     2.2.  Non Client-Server Multicast Applications  Overlays  . . . . . . . . . . . . . .   4
   3.  L2 Multicast Protocols in the Data Center . . . . . . . . . .   5
   4.  L3 Multicast
     2.3.  Protocols in the Data Center . . . . . . . . . .   6
   5.  Challenges of using multicast in the Data Center . . . . . .   7
   6. . . . . . . . .   5
   3.  Handling one-to-many traffic using conventional multicast . .   5
     3.1.  Layer 3 / multicast . . . . . . . . . . . . . . . . . . . .   6
     3.2.  Layer 2 Topological Variations multicast . . . . . . . . . .   8
   7.  Address Resolution . . . . . . . . . .   6
     3.3.  Example use cases . . . . . . . . . . .   9
     7.1.  Solicited-node Multicast Addresses for IPv6 address
           resolution . . . . . . . . .   8
     3.4.  Advantages and disadvantages  . . . . . . . . . . . . . .   9
     7.2.  Direct Mapping
   4.  Alternative options for Multicast address resolution handling one-to-many traffic  . . . .   9
     4.1.  Minimizing traffic volumes  . . . . . . . . . . . . . . .   9
   8.
     4.2.  Head end replication  . . . . . . . . . . . . . . . . . .  10
     4.3.  BIER  . . . . . . . . . . . . . . . . . . . . . . . . . .  11
     4.4.  Segment Routing . . . . . . . . . . . . . . . . . . . . .  12
   5.  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  12
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   9.  12
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   10.  13
   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
   11.  13
   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  10
     11.1.  13
     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  10
     11.2.  13
     9.2.  Informative References  . . . . . . . . . . . . . . . . .  10
   Author's Address  .  13
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10  15

1.  Introduction

   Data center servers often use IP Multicast to send

   The volume and importance of one-to-many traffic patterns in data to clients or
   other application servers.  IP Multicast
   centers is expected likely to help conserve
   bandwidth increase significantly in the data center and reduce future.  Reasons
   for this increase include the load on servers.  IP
   Multicast is also a key component in several data center overlay
   solutions.  Increased reliance on multicast, nature of the traffic generated by
   applications hosted in next generation the data
   centers, requires higher performance and capacity especially from center, the
   switches.  If multicast is to continue need to be handle broadcast,
   unknown unicast and multicast (BUM) traffic within the overlay
   technologies used in to support multi-tenancy at scale, and the use of
   certain protocols that traditionally require one-to-many control
   message exchanges.  These trends, allied with the expectation that
   future highly virtualized data center,
   it centers must scale well within and support communication
   between datacenters.  There has been
   much interest in issues surrounding massive amounts potentially thousands of hosts in participants, may lead to the
   data center.  There was a lengthy discussion,
   natural assumption that IP multicast will be widely used in data
   centers, specifically given the now closed ARMD
   WG, involving the issues with address resolution for non ARP/ND bandwidth savings it potentially
   offers.  However, such an assumption would be wrong.  In fact, there
   is widespread reluctance to enable IP multicast traffic in data centers.  This document provides centers for a quick
   survey
   number of reasons, mostly pertaining to concerns about its
   scalability and reliability.

   This draft discusses some of multicast in the data center main drivers for the increasing
   volume and should serve as an aid to
   further discussion importance of issues related to multicast one-to-many traffic patterns in the data center.

   ARP/ND issues are not addressed
   centers.  Thereafter, the manner in this document except which conventional IP multicast
   may be used to explain
   how address resolution occurs with multicast. handle this traffic pattern is discussed and some of
   the associated challenges highlighted.  Following this discussion, a
   number of alternative emerging approaches are introduced, before
   concluding by discussing key trends and making a number of
   recommendations.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119.

2.  Multicast  Reasons for increasing one-to-many traffic patterns

2.1.  Applications in

   Key trends suggest that the Data Center

   There are many data center operators who do not deploy Multicast in
   their networks for scalability and stability reasons.  There are also
   many operators for whom multicast is a critical protocol within their
   network and is enabled on their data center switches and routers.
   For this latter group, there are several uses of multicast in their
   data centers.  An understanding nature of the uses applications likely to
   dominate future highly-virtualized multi-tenant data centers will
   produce large volumes of that multicast one-to-many traffic.  For example, it is
   important
   well-known that traffic flows in order data centers have evolved from being
   predominantly North-South (e.g. client-server) to predominantly East-
   West (e.g.  distributed computation).  This change has led to properly support these applications in the ever
   evolving data centers.  If, for instance,
   consensus that topologies such as the majority of Leaf/Spine, that are easier to
   scale in the
   applications East-West direction, are discovering/signaling each other, using multicast,
   there may be better ways suited to support them then using multicast.  If,
   however, the multicasting of data is occurring in large volumes,
   there is a need for good data
   center overlay multicast support.  The
   applications either fall into the category of those that leverage L2
   multicast for discovery or the future.  This increase in East-West traffic flows
   results from VMs often having to exchange numerous messages between
   themselves as part of those that executing a specific workload.  For example, a
   computational workload could require L3 support and
   likely span multiple subnets.

2.1.  Client-Server Applications

   IPTV servers use multicast data, or an executable, to deliver content from be
   disseminated to workers distributed throughout the data center to
   end users.  IPTV which
   may be subsequently polled for status updates.  The emergence of such
   applications means there is typically a one likely to many application where the
   hosts are configured for IGMPv3, be an increase in one-to-many
   traffic flows with the switches are configured increasing dominance of East-West traffic.

   The TV broadcast industry is another potential future source of
   applications with
   IGMP snooping, one-to-many traffic patterns in data centers.  The
   requirement for robustness, stability and predicability has meant the routers are running PIM-SSM mode.  Often
   redundant
   TV broadcast industry has traditionally used TV-specific protocols,
   infrastructure and technologies for transmitting video signals
   between cameras, studios, mixers, encoders, servers are sending multicast streams into etc.  However,
   the network growing cost and complexity of supporting this approach,
   especially as the network is forwarding bit rates of the data across diverse paths.

   Windows Media servers send multicast streaming to clients.  Windows
   Media Services streams video signals increase due to an IP multicast address
   demand for formats such as 4K-UHD and all clients
   subscribe 8K-UHD, means there is a
   consensus that the TV broadcast industry will transition from
   industry-specific transmission formats (e.g.  SDI, HD-SDI) over TV-
   specific infrastructure to using IP-based infrastructure.  The
   development of pertinent standards by the SMPTE, along with the
   increasing performance of IP address to receive routers, means this transition is
   gathering pace.  A possible outcome of this transition will be the same stream.  This allows
   building of IP data centers in broadcast plants.  Traffic flows in
   the broadcast industry are frequently one-to-many and so if IP data
   centers are deployed in broadcast plants, it is imperative that this
   traffic pattern is supported efficiently in that infrastructure.  In
   fact, a single stream pivotal consideration for broadcasters considering
   transitioning to IP is the manner in which these one-to-many traffic
   flows will be played simultaneously by multiple clients managed and
   thus reducing bandwidth utilization.

   Market monitored in a data relies extensively on center with an IP
   fabric.

   Arguably one of the (few?) success stories in using conventional IP
   multicast has been for disseminating market trading data.  For
   example, IP multicast is commonly used today to deliver stock quotes
   from the data center stock exchange to a financial services provider and then to
   the stock analysts.  The most critical requirement of a multicast
   trading floor is that it be highly available. analysts or brokerages.  The network must be designed with
   no single point of failure and in such a way that the network can
   respond in a deterministic manner to any failure.  Typically  Typically,
   redundant servers (in a primary/backup or live live live-live mode) are sending send
   multicast streams into the network and network, with diverse paths being used
   across the network network.  Another critical requirement is forwarding reliability and
   traceability; regulatory and legal requirements means that the
   producer of the marketing data across diverse paths (when duplicate data is must know exactly where the flow was
   sent by multiple
   servers). and be able to prove conclusively that the data was received
   within agreed SLAs.  The stock exchange generating the one-to-many
   traffic and stock analysts/brokerage that receive the traffic will
   typically have their own data centers.  Therefore, the manner in
   which one-to-many traffic patterns are handled in these data centers
   are extremely important, especially given the requirements and
   constraints mentioned.

   Many data center cloud providers provide publish and subscribe
   applications.  There can be numerous publishers and subscribers and
   many message channels within a data center.  With publish and
   subscribe servers, a separate message is sent to each subscriber of a
   publication.  With multicast publish/subscribe, only one message is
   sent, regardless of the number of subscribers.  In a publish/subscribe publish/
   subscribe system, client applications, some of which are publishers
   and some of which are subscribers, are connected to a network of
   message brokers that receive publications on a number of topics, and
   send the publications on to the subscribers for those topics.  The
   more subscribers there are in the publish/subscribe system, the
   greater the improvement to network utilization there might be with
   multicast.

2.2.  Non Client-Server Multicast Applications

   Routers, running Virtual Routing Redundancy Protocol (VRRP),
   communicate with one another using a multicast address.  VRRP packets
   are sent, encapsulated  Overlays

   The proposed architecture for supporting large-scale multi-tenancy in IP packets, to 224.0.0.18.
   highly virtualized data centers [RFC8014] consists of a tenant's VMs
   distributed across the data center connected by a virtual network
   known as the overlay network.  A failure to
   receive number of different technologies
   have been proposed for realizing the overlay network, including VXLAN
   [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and
   GENEVE [I-D.ietf-nvo3-geneve].  The often fervent and arguably
   partisan debate about the relative merits of these overlay
   technologies belies the fact that, conceptually, it may be said that
   these overlays typically simply provide a multicast packet means to encapsulate and
   tunnel Ethernet frames from the master router for VMs over the data center IP fabric,
   thus emulating a period longer
   than three times layer 2 segment between the advertisement timer causes VMs.  Consequently, the backup routers
   VMs believe and behave as if they are connected to
   assume that the master router is dead.  The virtual router then
   transitions into an unsteady state tenant's other
   VMs by a conventional layer 2 segment, regardless of their physical
   location within the data center.  Naturally, in a layer 2 segment,
   point to multi-point traffic can result from handling BUM (broadcast,
   unknown unicast and an election process is
   initiated multicast) traffic.  And, compounding this issue
   within data centers, since the tenant's VMs attached to select the next master router from emulated
   segment may be dispersed throughout the backup routers.
   This is fulfilled through data center, the use of multicast packets.  Backup
   router(s) are only to send multicast packets during an election
   process.

   Overlays BUM traffic
   may use IP multicast to virtualize L2 multicasts.  IP
   multicast is used need to reduce traverse the scope data center fabric.  Hence, regardless of
   the L2-over-UDP flooding overlay technology used, due consideration must be given to
   only those hosts that have expressed explicit interest in
   handling BUM traffic, forcing the
   frames.VXLAN, for instance, data center operator to consider
   the manner in which one-to-many communication is an encapsulation scheme handled within the
   IP fabric.

2.3.  Protocols

   Conventionally, some key networking protocols used in data centers
   require one-to-many communication.  For example, ARP and ND use
   broadcast and multicast messages within IPv4 and IPv6 networks
   respectively to carry L2
   frames over L3 networks.  The VXLAN Tunnel End Point (VTEP)
   encapsulates frames inside an L3 tunnel.  VXLANs are identified by a
   24 bit VXLAN Network Identifier (VNI).  The VTEP maintains a table of
   known destination discover MAC addresses, and stores the address to IP address of the
   tunnel mappings.
   Furthermore, when these protocols are running within an overlay
   network, then it essential to ensure the remote VTEP to use for each.  Unicast frames, between
   VMs, messages are sent directly delivered to
   all the unicast L3 address hosts on the emulated layer 2 segment, regardless of physical
   location within the remote VTEP.
   Multicast frames are sent to a multicast IP group data center.  The challenges associated with the
   VNI.  Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are used
   optimally delivering ARP and ND messages in data centers has
   attracted lots of attention [RFC6820].  Popular approaches in use
   mostly seek to forward exploit characteristics of data center networks to
   avoid having to broadcast/multicast these messages, as discussed in
   Section 4.1.

3.  Handling one-to-many traffic using conventional multicast
3.1.  Layer 3 multicast

   PIM is the most widely deployed multicast routing protocol and so,
   unsurprisingly, is the primary multicast routing protocol considered
   for use in the data across center.  There are three potential popular
   flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607]
   or PIM-BIDIR [RFC5015].  It may be said that these different modes of
   PIM tradeoff the optimality of the overlay.

   The Ganglia application relies upon multicast forwarding tree for distributed
   discovery and monitoring the
   amount of computing systems such as clusters multicast forwarding state that must be maintained at
   routers.  SSM provides the most efficient forwarding between sources
   and
   grids.  It has been used to link clusters across university campuses receivers and can scale to handle clusters thus is most suitable for applications with 2000 nodes
   Windows Server, cluster node exchange, relies upon one-to-
   many traffic patterns.  State is built and maintained for each (S,G)
   flow.  Thus, the use amount of multicast heartbeats between servers.  Only forwarding state held by routers
   in the data center is proportional to the number of sources and
   groups.  At the other interfaces in end of the same multicast group use spectrum, BIDIR is the data.  Unlike broadcast, multicast
   traffic does not need to be flooded throughout most
   efficient shared tree solution as one tree is built for all (S,G)s,
   therefore minimizing the network, reducing amount of state.  This state reduction is at
   the chance that unnecessary CPU cycles are expended filtering expense of optimal forwarding path between sources and receivers.
   This use of a shared tree makes BIDIR particularly well-suited for
   applications with many-to-many traffic
   on nodes outside patterns, given that the cluster.  As
   amount of state is uncorrelated to the number of nodes increases, sources.  SSM and
   BIDIR are optimizations of PIM-SM.  PIM-SM is still the
   ability to replace several unicast messages with most widely
   deployed multicast routing protocol.  PIM-SM can also be the most
   complex.  PIM-SM relies upon a single RP (Rendezvous Point) to set up the
   multicast
   message improves node performance tree and decreases network bandwidth
   consumption.  Multicast messages replace unicast messages in two
   components of clustering:

   o  Heartbeats: The clustering failure detection engine subsequently there is based on a
      scheme whereby nodes send heartbeat messages the option of switching to other nodes.
      Specifically, for each network interface, a node sends a heartbeat
      message
   the SPT (shortest path tree), similar to all other nodes with interfaces SSM, or staying on that network.
      Heartbeat messages are sent every 1.2 seconds.  In the common case
      where each node has an interface on each cluster network, there
      are N * (N - 1)
   shared tree, similar to BIDIR.

3.2.  Layer 2 multicast

   With IPv4 unicast heartbeats sent per network every 1.2
      seconds in address resolution, the translation of an N-node cluster. IP
   address to a MAC address is done dynamically by ARP.  With multicast heartbeats,
   address resolution, the
      message count drops mapping from a multicast IPv4 address to N a
   multicast heartbeats per network every
      1.2 seconds, because each node sends 1 message instead MAC address is done by assigning the low-order 23 bits of N - 1.
      This represents
   the multicast IPv4 address to fill the low-order 23 bits of the
   multicast MAC address.  Each IPv4 multicast address has 28 unique
   bits (the multicast address range is 224.0.0.0/12) therefore mapping
   a reduction in processing cycles on multicast IP address to a MAC address ignores 5 bits of the sending
      node and IP
   address.  Hence, groups of 32 multicast IP addresses are mapped to
   the same MAC address meaning a reduction in network bandwidth consumed.

   o  Regroup: The clustering membership engine executes a regroup
      protocol during multicast MAC address cannot be
   uniquely mapped to a membership view change.  The regroup protocol
      algorithm assumes the ability multicast IPv4 address.  Therefore, planning is
   required within an organization to broadcast messages choose IPv4 multicast addresses
   judiciously in order to all cluster
      nodes.  To avoid unnecessary network flooding and to properly
      authenticate messages, address aliasing.  When sending IPv6
   multicast packets on an Ethernet link, the broadcast primitive corresponding destination
   MAC address is implemented by a
      sequence direct mapping of unicast messages.  Converting the unicast messages to
      a single multicast message conserves processing power on the
      sending node and reduces network bandwidth consumption.

   Multicast addresses in last 32 bits of the 224.0.0.x range are considered link local 128 bit
   IPv6 multicast addresses.  They are used address into the 48 bit MAC address.  It is possible
   for protocol discovery and are
   flooded more than one IPv6 multicast address to every port.  For example, OSPF uses 224.0.0.5 and
   224.0.0.6 for neighbor and DR discovery.  These addresses are
   reserved and will not be constrained by IGMP snooping.  These
   addresses are not map to be used by any application.

3.  L2 Multicast Protocols in the Data Center same 48 bit
   MAC address.

   The switches, default behaviour of many hosts (and, in between fact, routers) is to
   block multicast traffic.  Consequently, when a host wishes to join an
   IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to
   the servers router attached to the layer 2 segment and also it instructs its
   data link layer to receive Ethernet frames that match the routers, rely upon igmp
   snooping
   corresponding MAC address.  The data link layer filters the frames,
   passing those with matching destination addresses to bound the IP module.
   Similarly, hosts simply hand the multicast packet for transmission to
   the ports leading data link layer which would add the layer 2 encapsulation, using
   the MAC address derived in the manner previously discussed.

   When this Ethernet frame with a multicast MAC address is received by
   a switch configured to interested
   hosts and forward multicast traffic, the default
   behaviour is to L3 routers.  A switch will, by default, flood multicast
   traffic it to all the ports in the layer 2 segment.
   Clearly there may not be a broadcast domain (VLAN).  IGMP snooping
   is designed to prevent hosts on a local network from receiving
   traffic receiver for a this multicast group they have not explicitly joined.  It
   provides switches with a mechanism to prune multicast traffic from
   links that do not contain a multicast listener (an IGMP client). present
   on each port and IGMP snooping is a L2 optimization for L3 IGMP. used to avoid sending the frame out
   of ports without receivers.

   IGMP snooping, with proxy reporting or report suppression, actively
   filters IGMP packets in order to reduce load on the multicast router.
   Joins and leaves heading upstream to the router are filtered so that
   by ensuring only the minimal quantity of information is sent.  The
   switch is trying to ensure the router only has only a single entry for the
   group, regardless of how many the number of active listeners there are. listeners.  If there are
   two active listeners in a group and the first one leaves, then the
   switch determines that the router does not need this information
   since it does not affect the status of the group from the router's
   point of view.  However the next time there is a routine query from
   the router the switch will forward the reply from the remaining host,
   to prevent the router from believing there are no active listeners.
   It follows that in active IGMP snooping, the router will generally
   only know about the most recently joined member of the group.

   In order for IGMP, IGMP and thus IGMP snooping, snooping to function, a multicast
   router must exist on the network and generate IGMP queries.  The
   tables (holding the member ports for each multicast group) created
   for snooping are associated with the querier.  Without a querier the
   tables are not created and snooping will not work.  Furthermore  Furthermore, IGMP
   general queries must be unconditionally forwarded by all switches
   involved in IGMP snooping.  Some IGMP snooping implementations
   include full querier capability.  Others are able to proxy and
   retransmit queries from the multicast router.

   In source-only networks, however, which presumably describes most
   data center networks, there are no IGMP hosts

   Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by
   IPv6 routers for discovering multicast listeners on switch ports a directly
   attached link, performing a similar function to
   generate IGMP packets.  Switch ports are connected in IPv4
   networks.  MLDv1 [RFC 2710] is similar to multicast
   source ports IGMPv2 and multicast router ports.  The switch typically learns
   about multicast groups from the multicast data stream by using MLDv2 [RFC 3810]
   [RFC 4604] similar to IGMPv3.  However, in contrast to IGMP, MLD does
   not send its own distinct protocol messages.  Rather, MLD is a type
   subprotocol of source only learning (when only receiving multicast data on the
   port, no ICMPv6 [RFC 4443] and so MLD messages are a subset of
   ICMPv6 messages.  MLD snooping works similarly to IGMP packets).  The switch forwards traffic only snooping,
   described earlier.

3.3.  Example use cases

   A use case where PIM and IGMP are currently used in data centers is
   to the support multicast router ports.  When in VXLAN deployments.  In the switch receives traffic for new original VXLAN
   specification [RFC7348], a data-driven flood and learn control plane
   was proposed, requiring the data center IP fabric to support
   multicast groups, it will routing.  A multicast group is associated with each virtual
   network, each uniquely identified by its VXLAN network identifiers
   (VNI).  VXLAN tunnel endpoints (VTEPs), typically flood located in the packets
   hypervisor or ToR switch, with local VMs that belong to all ports in this VNI
   would join the same VLAN.  This unnecessary flooding can impact switch
   performance.

4.  L3 Multicast Protocols in multicast group and use it for the Data Center

   There are three flavors exchange of PIM used for Multicast Routing in BUM
   traffic with the Data
   Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015].
   SSM provides other VTEPs.  Essentially, the most efficient forwarding between sources and
   receivers and is most suitable for one to many types of VTEP would
   encapsulate any BUM traffic from attached VMs in an IP multicast
   applications.  State
   packet, whose destination address is built for each S,G channel therefore the more
   sources associated multicast group
   address, and groups there are, transmit the more state there is in packet to the network.
   BIDIR is data center fabric.  Thus,
   PIM must be running in the most efficient shared fabric to maintain a multicast
   distribution tree solution as one per VNI.

   Alternatively, rather than setting up a multicast distribution tree is built
   for all S,G's, therefore saving state.  But
   per VNI, a tree can be set up whenever hosts within the VNI wish to
   exchange multicast traffic.  For example, whenever a VTEP receives an
   IGMP report from a locally connected host, it is not would translate this
   into a PIM join message which will be propagated into the most
   efficient in forwarding path between sources and receivers.  SSM and
   BIDIR are optimizations of PIM-SM.  PIM-SM IP fabric.
   In order to ensure this join message is still sent to the most widely
   deployed multicast routing protocol.  PIM-SM can also be IP fabric rather
   than over the most
   complex.  PIM-SM relies upon VXLAN interface (since the VTEP will have a RP (Rendezvous Point) route back
   to set up the source of the multicast tree packet over the VXLAN interface and then will either switch so
   would naturally attempt to send the SPT (shortest path
   tree), similar join over this interface) a more
   specific route back to SSM, or stay on the shared tree (similar to BIDIR).
   For massive amounts of hosts sending (and receiving) multicast, source over the
   shared tree (particularly IP fabric must be
   configured.  In this approach PIM must be configured on the SVIs
   associated with PIM-BIDIR) provides the best potential
   scaling since no matter how many VXLAN interface.

   Another use case of PIM and IGMP in data centers is when IPTV servers
   use multicast sources exist within to deliver content from the data center to end users.
   IPTV is typically a
   VLAN, one to many application where the tree number stays hosts are
   configured for IGMPv3, the same. switches are configured with IGMP
   snooping, IGMP proxy, and
   PIM-BIDIR have the potential routers are running PIM-SSM mode.  Often redundant
   servers send multicast streams into the network and the network is
   forwards the data across diverse paths.

   Windows Media servers send multicast streams to scale clients.  Windows
   Media Services streams to an IP multicast address and all clients
   subscribe to the huge scaling numbers
   required in IP address to receive the same stream.  This allows
   a data center.

5.  Challenges single stream to be played simultaneously by multiple clients and
   thus reducing bandwidth utilization.

3.4.  Advantages and disadvantages

   Arguably the biggest advantage of using multicast PIM and IGMP to support one-
   to-many communication in data centers is that these protocols are
   relatively mature.  Consequently, PIM is available in most routers
   and IGMP is supported by most hosts and routers.  As such, no
   specialized hardware or relatively immature software is involved in
   using them in data centers.  Furthermore, the Data Center

   Data Center environments may create unique challenges for IP
   Multicast.  Data Center maturity of these
   protocols means their behaviour and performance in operational
   networks required a high amount is well-understood, with widely available best-practices and
   deployment guides for optimizing their performance.

   However, somewhat ironically, the relative disadvantages of VM traffic PIM and mobility within
   IGMP usage in data centers also stem mostly from their maturity.
   Specifically, these protocols were standardized and between DC networks.  DC networks have large
   numbers implemented long
   before the highly-virtualized multi-tenant data centers of servers.  DC networks today
   existed.  Consequently, PIM and IGMP are often used neither optimally placed to
   deal with cloud
   orchestration software.  DC networks often use IP Multicast in their
   unique environments.  This section looks at the challenges requirements of using one-to-many communication in modern
   data centers nor to exploit characteristics and idiosyncrasies of
   data centers.  For example, there may be thousands of VMs
   participating in a multicast session, with some of these VMs
   migrating to servers within the challenging data center environment.

   When IGMP/MLD Snooping is not implemented, ethernet switches will
   flood multicast frames out of center, new VMs being
   continually spun up and wishing to join the sessions while all switch-ports, which turns the
   traffic into something more like
   time other VMs are leaving.  In such a broadcast.

   VRRP uses multicast heartbeat to communicate between routers.  The
   communication between scenario, the host churn in the PIM
   and IGMP state machines, the default gateway volume of control messages they would
   generate and the amount of state they would necessitate within
   routers, especially if they were deployed naively, would be
   untenable.

4.  Alternative options for handling one-to-many traffic

   Section 2 has shown that there is unicast.
   The likely to be an increasing amount
   one-to-many communications in data centers.  And Section 3 has
   discussed how conventional multicast heartbeat can may be very chatty when used to handle this
   traffic.  Having said that, there are thousands a number of VRRP pairs with sub-second heartbeat calls back and forth.

   Link-local multicast alternative options
   of handling this traffic pattern in data centers, as discussed in the
   subsequent section.  It should scale well within be noted that many of these techniques
   are not mutually-exclusive; in fact many deployments involve a
   combination of more than one IP subnet
   particularly with of these techniques.  Furthermore, as
   will be shown, introducing a large layer3 domain extending down to the access centralized controller or aggregation switches.  But if multicast traverses beyond one IP
   subnet, which is necessary for an overlay like VXLAN, you could
   potentially have scaling concerns.  If using a VXLAN overlay, it distributed
   control plane, makes these techniques more potent.

4.1.  Minimizing traffic volumes

   If handling one-to-many traffic in data centers can be challenging
   then arguably the most intuitive solution is
   necessary to map aim to minimize the L2 multicast
   volume of such traffic.

   It was previously mentioned in Section 2 that the overlay to L3 multicast three main causes
   of one-to-many traffic in data centers are applications, overlays and
   protocols.  While, relatively speaking, little can be done about the underlay or do head end replication in
   volume of one-to-many traffic generated by applications, there is
   more scope for attempting to reduce the overlay volume of such traffic
   generated by overlays and receive
   duplicate frames on protocols.  (And often by protocols within
   overlays.)  This reduction is possible by exploiting certain
   characteristics of data center networks: fixed and regular topology,
   owned and exclusively controlled by single organization, well-known
   overlay encapsulation endpoints etc.

   A way of minimizing the first link from amount of one-to-many traffic that traverses
   the router data center fabric is to use a centralized controller.  For
   example, whenever a new VM is instantiated, the hypervisor or
   encapsulation endpoint can notify a centralized controller of this
   new MAC address, the core
   switch. associated virtual network, IP address etc.  The solution
   controller could be subsequently distribute this information to run potentially thousands every
   encapsulation endpoint.  Consequently, when any endpoint receives an
   ARP request from a locally attached VM, it could simply consult its
   local copy of PIM
   messages to generate/maintain the required multicast state information distributed by the controller and
   reply.  Thus, the ARP request is suppressed and does not result in
   one-to-many traffic traversing the data center IP
   underlay.  The behavior of fabric.

   Alternatively, the upper layer, functionality supported by the controller can
   realized by a distributed control plane.  BGP-EVPN [RFC7432, RFC8365]
   is the most popular control plane used in data centers.  Typically,
   the encapsulation endpoints will exchange pertinent information with respect
   each other by all peering with a BGP route reflector (RR).  Thus,
   information about local MAC addresses, MAC to
   broadcast/multicast, affects IP address mapping,
   virtual networks identifiers etc can be disseminated.  Consequently,
   ARP requests from local VMs can be suppressed by the choice of head encapsulation
   endpoint.

4.2.  Head end (*,G) or (S,G) replication

   A popular option for handling one-to-many traffic patterns in data
   centers is head end replication (HER).  HER means the underlay, which affects the opex traffic is
   duplicated and capex sent to each end point individually using conventional
   IP unicast.  Obvious disadvantages of HER include traffic duplication
   and the additional processing burden on the
   entire solution.  A VXLAN, with thousands of logical groups, maps to head end replication end.  Nevertheless,
   HER is especially attractive when overlays are in use as the hypervisor or to IGMP from
   replication can be carried out by the hypervisor
   and then PIM between or encapsulation end
   point.  Consequently, the TOR and CORE 'switches' VMs and the gateway
   router.

   Requiring IP multicast (especially PIM BIDIR) from the network can
   prove challenging for data center operators especially at the kind fabric are unmodified and
   unaware of
   scale that how the VXLAN/NVGRE proposals require.  This traffic is also true when delivered to the L2 topological domain multiple end points.
   Additionally, it is large possible to use a number of approaches for
   constructing and extended all disseminating the way list of which endpoints should
   receive what traffic and so on.

   For example, the reluctance of data center operators to enable PIM
   and IGMP within the L3
   core.  In data centers center fabric means VXLAN is often used with highly virtualized servers, even small L2
   domains may spread across many server racks (i.e. multiple switches
   HER.  Thus, BUM traffic from each VNI is replicated and router ports).

   It's not uncommon for there sent using
   unicast to be 10-20 remote VTEPs with VMs per server in a
   virtualized environment.  One vendor reported a customer requesting a
   scale to 400VM's per server.  For multicast that VNI.  The list of remote
   VTEPs to which the traffic should be sent may be configured manually
   on the VTEP.  Alternatively, the VTEPs may transmit appropriate state
   to a viable solution centralized controller which in this environment, turn sends each VTEP the network needs to list of
   remote VTEPs for each VNI.  Lastly, HER also works well when a
   distributed control plane is used instead of the centralized
   controller.  Again, BGP-EVPN may be able used to scale distribute the
   information needed to these
   numbers faciliate HER to the VTEPs.

4.3.  BIER

   As discussed in Section 3.4, PIM and IGMP face potential scalability
   challenges when these VMs deployed in data centers.  These challenges are sending/receiving multicast.

   A lot of switching/routing hardware has problems with IP Multicast,
   particularly with regards
   typically due to hardware support of PIM-BIDIR.

   Sending L2 multicast over the requirement to build and maintain a campus or data center backbone, distribution
   tree and the requirement to hold per-flow state in any
   sort of significant way, routers.  Bit
   Index Explicit Replication (BIER) [RFC 8279] is a new challenge enabled for the first
   time by overlays.  There are interesting challenges when pushing
   large amounts of multicast traffic through
   forwarding paradigm that avoids these two requirements.

   When a network, and have thus
   far been dealt with using purpose-built networks.  While multicast packet enters a BIER domain, the overlay
   proposals have been careful not to impose new protocol requirements,
   they have not addressed ingress router,
   known as the issues of performance and scalability,
   nor Bit-Forwarding Ingress Router (BFIR), adds a BIER header
   to the large-scale availability of these protocols.

   There is an unnecessary multicast stream flooding problem packet.  This header contains a bit string in which each bit
   maps to an egress router, known as Bit-Forwarding Egress Router
   (BFER).  If a bit is set, then the link
   layer switches between packet should be forwarded to the multicast source
   associated BFER.  The routers within the BIER domain, Bit-Forwarding
   Routers (BFRs), use the BIER header in the packet and information in
   the PIM First Hop
   Router (FHR).  The IGMP-Snooping Switch will forward multicast
   streams Bit Index Forwarding Table (BIFT) to router ports, and carry out simple bit- wise
   operations to determine how the PIM FHR must receive packet should be replicated optimally
   so it reaches all multicast
   streams even if there the appropriate BFERs.

   BIER is no request from receiver.  This often leads deemed to waste of switch cache and link bandwidth when be attractive for facilitating one-to-many
   communications in data ceneters [I-D.ietf-bier-use-cases].  The
   deployment envisioned with overlay networks is that the multicast
   streams are not actually required.  [I-D.pim-umf-problem-statement]
   details the problem and defines design goals for a generic mechanism
   to restrain
   encapsulation endpoints would be the unnecessary BFIR.  So knowledge about the
   actual multicast stream flooding.

6.  Layer 3 / Layer 2 Topological Variations

   As discussed groups does not reside in RFC6820, the ARMD problems statement, there are a
   variety of topological data center variations including L3 to Access
   Switches, L3 to Aggregation Switches, and L3 in fabric,
   improving the Core only.
   Further analysis is needed in order scalability compared to understand how these
   variations affect conventional IP Multicast scalability

7.  Address Resolution

7.1.  Solicited-node Multicast Addresses for IPv6 address resolution

   Solicited-node Multicast Addresses are multicast.
   Additionally, a centralized controller or a BGP-EVPN control plane
   may be used with IPv6 Neighbor
   Discovery BIER to provide ensure the same function as BFIR have the Address Resolution
   Protocol (ARP) required
   information.  A challenge associated with using BIER is that, unlike
   most of the other approaches discussed in IPv4.  ARP uses broadcasts, this draft, it requires
   changes to send an ARP
   Requests, which are received by all end hosts on the local link.
   Only forwarding behaviour of the host being queried responds.  However, routers used in the other hosts still
   have to process and discard data
   center IP fabric.

4.4.  Segment Routing

   Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the request.  With IPv6, a host is
   required to join the
   source routing paradigm in which the manner in which a Solicited-Node multicast group for each of its
   configured unicast or anycast addresses.  Because packet
   traverses a Solicited-node
   Multicast Address network is a function of the last 24-bits of determined by an IPv6
   unicast or anycast address, the number ordered list of hosts that instructions.
   These instructions are subscribed
   to each Solicited-node Multicast Address would typically be one
   (there could be more because the mapping function is not known as segments may have a 1:1
   mapping).  Compared local semantic to ARP in IPv4,
   an SR node or global within an SR domain.  SR allows enforcing a host should not need flow
   through any topological path while maintaining per-flow state only at
   the ingress node to the SR domain.  Segment Routing can be
   interrupted as often applied to service Neighbor Solicitation requests.

7.2.  Direct Mapping for Multicast address resolution

   With IPv4 unicast address resolution,
   the translation MPLS and IPv6 data-planes.  In the former, the list of an IP
   address to a MAC address segments
   is done dynamically represented by ARP.  With multicast
   address resolution, the mapping from a multicast IP address to a
   multicast MAC address is derived from direct mapping.  In IPv4, label stack and in the
   mapping latter it is done by assigning represented
   as a routing extension header.  Use-cases are described in [I-D.ietf-
   spring-segment-routing] and are being considered in the low-order 23 bits context of the multicast
   IP address
   BGP-based large-scale data-center (DC) design [RFC7938].

   Multicast in SR continues to fill the low-order 23 bits be discussed in a variety of the multicast MAC
   address.  When drafts and
   working groups.  The SPRING WG has not yet been chartered to work on
   Multicast in SR.  Multicast can include locally allocating a host joins an IP multicast group, it instructs the
   data link layer Segment
   Identifier (SID) to receive frames that match the MAC address existing replication solutions, such as PIM,
   mLDP, P2MP RSVP-TE and BIER.  It may also be that
   corresponds a new way to signal
   and install trees in SR is developed without creating state in the IP address of the multicast group.  The data link
   layer filters
   network.

5.  Conclusions

   As the frames volume and passes frames with matching destination
   addresses to the importance of one-to-many traffic in data centers
   increases, conventional IP module.  Since the mapping from multicast IP
   address is likely to become increasingly
   unattractive for deployment in data centers for a MAC address ignores 5 bits number of the IP address, groups reasons,
   mostly pertaining its inherent relatively poor scalability and
   inability to exploit characteristics of
   32 multicast IP addresses are mapped data center network
   architectures.  Hence, even though IGMP/MLD is likely to remain the same MAC address.  As a
   result a multicast MAC address cannot be uniquely mapped to
   most popular manner in which end hosts signal interest in joining a
   multicast IPv4 address.  Planning group, it is required within an organization
   to select IPv4 groups unlikely that are far enough away from each other as to
   not end up with the same L2 address used.  Any this multicast address in
   the [224-239].0.0.x and [224-239].128.0.x ranges should not traffic will be
   considered.  When sending IPv6 multicast packets on an Ethernet link,
   transported over the corresponding destination MAC address is data center IP fabric using a direct mapping of the
   last 32 bits of the 128 bit IPv6 multicast address into the 48 bit
   MAC address.  It is possible for more than one IPv6 Multicast address
   to map
   distribution tree built by PIM.  Rather, approaches which exploit
   characteristics of data center network architectures (e.g. fixed and
   regular topology, owned and exclusively controlled by single
   organization, well-known overlay encapsulation endpoints etc.) are
   better placed to the same 48 bit MAC address.

8. deliver one-to-many traffic in data centers,
   especially when judiciously combined with a centralized controller
   and/or a distributed control plane (particularly one based on BGP-
   EVPN).

6.  IANA Considerations

   This memo includes no request to IANA.

9.

7.  Security Considerations

   No new security considerations result from this document

10.

8.  Acknowledgements

   The authors would like to thank the many individuals who contributed
   opinions on the ARMD wg mailing list about this topic: Linda Dunbar,
   Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor
   Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas
   Narten.

11.

9.  References

11.1.

9.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

11.2.

9.2.  Informative References

   [I-D.ietf-bier-use-cases]
              Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A.,
              Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C.
              Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06
              (work in progress), January 2018.

   [I-D.ietf-nvo3-geneve]
              Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic
              Network Virtualization Encapsulation", draft-ietf-
              nvo3-geneve-06 (work in progress), March 2018.

   [I-D.ietf-nvo3-vxlan-gpe]
              Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol
              Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work
              in progress), April 2018.

   [I-D.ietf-spring-segment-routing]
              Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B.,
              Litkowski, S., and R. Shakir, "Segment Routing
              Architecture", draft-ietf-spring-segment-routing-15 (work
              in progress), January 2018.

   [RFC2236]  Fenner, W., "Internet Group Management Protocol, Version
              2", RFC 2236, DOI 10.17487/RFC2236, November 1997,
              <https://www.rfc-editor.org/info/rfc2236>.

   [RFC2710]  Deering, S., Fenner, W., and B. Haberman, "Multicast
              Listener Discovery (MLD) for IPv6", RFC 2710,
              DOI 10.17487/RFC2710, October 1999,
              <https://www.rfc-editor.org/info/rfc2710>.

   [RFC3376]  Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A.
              Thyagarajan, "Internet Group Management Protocol, Version
              3", RFC 3376, DOI 10.17487/RFC3376, October 2002,
              <https://www.rfc-editor.org/info/rfc3376>.

   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
              "Protocol Independent Multicast - Sparse Mode (PIM-SM):
              Protocol Specification (Revised)", RFC 4601,
              DOI 10.17487/RFC4601, August 2006,
              <https://www.rfc-editor.org/info/rfc4601>.

   [RFC4607]  Holbrook, H. and B. Cain, "Source-Specific Multicast for
              IP", RFC 4607, DOI 10.17487/RFC4607, August 2006,
              <https://www.rfc-editor.org/info/rfc4607>.

   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
              "Bidirectional Protocol Independent Multicast (BIDIR-
              PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007,
              <https://www.rfc-editor.org/info/rfc5015>.

   [RFC6820]  Narten, T., Karir, M., and I. Foo, "Address Resolution
              Problems in Large Data Center Networks", RFC 6820,
              DOI 10.17487/RFC6820, January 2013,
              <https://www.rfc-editor.org/info/rfc6820>.

Author's Address

   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
              eXtensible Local Area Network (VXLAN): A Framework for
              Overlaying Virtualized Layer 2 Networks over Layer 3
              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
              <https://www.rfc-editor.org/info/rfc7348>.

   [RFC7432]  Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A.,
              Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based
              Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February
              2015, <https://www.rfc-editor.org/info/rfc7432>.

   [RFC7637]  Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network
              Virtualization Using Generic Routing Encapsulation",
              RFC 7637, DOI 10.17487/RFC7637, September 2015,
              <https://www.rfc-editor.org/info/rfc7637>.

   [RFC7938]  Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
              BGP for Routing in Large-Scale Data Centers", RFC 7938,
              DOI 10.17487/RFC7938, August 2016,
              <https://www.rfc-editor.org/info/rfc7938>.

   [RFC8014]  Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T.
              Narten, "An Architecture for Data-Center Network
              Virtualization over Layer 3 (NVO3)", RFC 8014,
              DOI 10.17487/RFC8014, December 2016,
              <https://www.rfc-editor.org/info/rfc8014>.

   [RFC8279]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
              Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
              Explicit Replication (BIER)", RFC 8279,
              DOI 10.17487/RFC8279, November 2017,
              <https://www.rfc-editor.org/info/rfc8279>.

   [RFC8365]  Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R.,
              Uttaro, J., and W. Henderickx, "A Network Virtualization
              Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365,
              DOI 10.17487/RFC8365, March 2018,
              <https://www.rfc-editor.org/info/rfc8365>.

Authors' Addresses

   Mike McBride
   Huawei

   Email: michael.mcbride@huawei.com

   Olufemi Komolafe
   Arista Networks

   Email: femi@arista.com