MBONED M. McBride Internet-Draft Huawei Intended status: InformationalFebruary 28, 2018O. Komolafe Expires:September 1,December 31, 2018 Arista Networks June 29, 2018 Multicast in the Data Center Overviewdraft-ietf-mboned-dc-deploy-02draft-ietf-mboned-dc-deploy-03 AbstractThere has been much interest in issues surrounding massive amountsThe volume and importance ofhostsone-to-many traffic patterns inthedatacenter. These issues include the prevalent use of IP Multicast within the Data Center. Its important to understand how IP Multicastcenters isbeing deployedlikely to increase significantly in theData Center to be ablefuture. Reasons for this increase are discussed and then attention is paid tounderstandthesurrounding issues with doing so. This document provides a quick survey of usesmanner in which this traffic pattern may be judiously handled in data centers. The intuitive solution of deploying conventional IP multicastin thewithin datacentercenters is explored andshould serve as an aid to further discussionevaluated. Thereafter, a number ofissues related to large amountsemerging innovative approaches are described before a number ofmulticast in the data center.recommendations are made. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire onSeptember 1,December 31, 2018. Copyright Notice Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2.Multicast Applications in the Data CenterReasons for increasing one-to-many traffic patterns . . . . . 3 2.1. Applications . . . . . . .3 2.1. Client-Server Applications. . . . . . . . . . . . . . . 3 2.2.Non Client-Server Multicast ApplicationsOverlays . . . . . . . . . . . . . .4 3. L2 Multicast Protocols in the Data Center. . . . . . . . . . 54. L3 Multicast2.3. Protocolsin the Data Center. . . . . . . . . .6 5. Challenges of using multicast in the Data Center. . . . . .7 6.. . . . . . . . 5 3. Handling one-to-many traffic using conventional multicast . . 5 3.1. Layer 3/multicast . . . . . . . . . . . . . . . . . . . . 6 3.2. Layer 2Topological Variationsmulticast . . . . . . . . . .8 7. Address Resolution. . . . . . . . . . 6 3.3. Example use cases . . . . . . . . . . .9 7.1. Solicited-node Multicast Addresses for IPv6 address resolution. . . . . . . . . 8 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 97.2. Direct Mapping4. Alternative options forMulticast address resolutionhandling one-to-many traffic . . . . 9 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 98.4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . .10 9.12 7. Security Considerations . . . . . . . . . . . . . . . . . . .10 10.13 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . .10 11.13 9. References . . . . . . . . . . . . . . . . . . . . . . . . .10 11.1.13 9.1. Normative References . . . . . . . . . . . . . . . . . .10 11.2.13 9.2. Informative References . . . . . . . . . . . . . . . . .10 Author's Address .13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . .1015 1. IntroductionData center servers often use IP Multicast to sendThe volume and importance of one-to-many traffic patterns in datato clients or other application servers. IP Multicastcenters isexpectedlikely tohelp conserve bandwidthincrease significantly in thedata center and reducefuture. Reasons for this increase include theload on servers. IP Multicast is also a key component in several data center overlay solutions. Increased reliance on multicast,nature of the traffic generated by applications hosted innext generationthe datacenters, requires higher performance and capacity especially fromcenter, theswitches. If multicast is to continueneed tobehandle broadcast, unknown unicast and multicast (BUM) traffic within the overlay technologies usedinto support multi-tenancy at scale, and the use of certain protocols that traditionally require one-to-many control message exchanges. These trends, allied with the expectation that future highly virtualized datacenter, itcenters mustscale well within andsupport communication betweendatacenters. There has been much interest in issues surrounding massive amountspotentially thousands ofhosts inparticipants, may lead to thedata center. There was a lengthy discussion,natural assumption that IP multicast will be widely used in data centers, specifically given thenow closed ARMD WG, involving the issues with address resolution for non ARP/NDbandwidth savings it potentially offers. However, such an assumption would be wrong. In fact, there is widespread reluctance to enable IP multicasttrafficin datacenters. This document providescenters for aquick surveynumber of reasons, mostly pertaining to concerns about its scalability and reliability. This draft discusses some ofmulticast inthedata centermain drivers for the increasing volume andshould serve as an aid to further discussionimportance ofissues related to multicastone-to-many traffic patterns inthedatacenter. ARP/ND issues are not addressedcenters. Thereafter, the manner inthis document exceptwhich conventional IP multicast may be used toexplain how address resolution occurs with multicast.handle this traffic pattern is discussed and some of the associated challenges highlighted. Following this discussion, a number of alternative emerging approaches are introduced, before concluding by discussing key trends and making a number of recommendations. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2.MulticastReasons for increasing one-to-many traffic patterns 2.1. ApplicationsinKey trends suggest that theData Center There are many data center operators who do not deploy Multicast in their networks for scalability and stability reasons. There are also many operators for whom multicast is a critical protocol within their network and is enabled on their data center switches and routers. For this latter group, there are several uses of multicast in their data centers. An understandingnature of theusesapplications likely to dominate future highly-virtualized multi-tenant data centers will produce large volumes ofthat multicastone-to-many traffic. For example, it isimportantwell-known that traffic flows inorderdata centers have evolved from being predominantly North-South (e.g. client-server) to predominantly East- West (e.g. distributed computation). This change has led toproperly support these applications intheever evolving data centers. If, for instance,consensus that topologies such as themajority ofLeaf/Spine, that are easier to scale in theapplicationsEast-West direction, arediscovering/signaling each other, using multicast, there may bebetterwayssuited tosupport them then using multicast. If, however,themulticasting of data is occurring in large volumes, there is a need for gooddata centeroverlay multicast support. The applications either fall into the categoryofthose that leverage L2 multicast for discovery orthe future. This increase in East-West traffic flows results from VMs often having to exchange numerous messages between themselves as part ofthose thatexecuting a specific workload. For example, a computational workload could requireL3 support and likely span multiple subnets. 2.1. Client-Server Applications IPTV servers use multicastdata, or an executable, todeliver content frombe disseminated to workers distributed throughout the data centerto end users. IPTVwhich may be subsequently polled for status updates. The emergence of such applications means there istypically a onelikely tomany application where the hosts are configured for IGMPv3,be an increase in one-to-many traffic flows with theswitches are configuredincreasing dominance of East-West traffic. The TV broadcast industry is another potential future source of applications withIGMP snooping,one-to-many traffic patterns in data centers. The requirement for robustness, stability and predicability has meant therouters are running PIM-SSM mode. Often redundantTV broadcast industry has traditionally used TV-specific protocols, infrastructure and technologies for transmitting video signals between cameras, studios, mixers, encoders, serversare sending multicast streams intoetc. However, thenetworkgrowing cost and complexity of supporting this approach, especially as thenetwork is forwardingbit rates of thedata across diverse paths. Windows Media servers send multicast streaming to clients. Windows Media Services streamsvideo signals increase due toan IP multicast addressdemand for formats such as 4K-UHD andall clients subscribe8K-UHD, means there is a consensus that the TV broadcast industry will transition from industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- specific infrastructure to using IP-based infrastructure. The development of pertinent standards by the SMPTE, along with the increasing performance of IPaddress to receiverouters, means this transition is gathering pace. A possible outcome of this transition will be thesame stream. This allowsbuilding of IP data centers in broadcast plants. Traffic flows in the broadcast industry are frequently one-to-many and so if IP data centers are deployed in broadcast plants, it is imperative that this traffic pattern is supported efficiently in that infrastructure. In fact, asingle streampivotal consideration for broadcasters considering transitioning to IP is the manner in which these one-to-many traffic flows will beplayed simultaneously by multiple clientsmanaged andthus reducing bandwidth utilization. Marketmonitored in a datarelies extensively oncenter with an IP fabric. Arguably one of the (few?) success stories in using conventional IP multicast has been for disseminating market trading data. For example, IP multicast is commonly used today to deliver stock quotes from thedata centerstock exchange toafinancial services provider and then to the stockanalysts. The most critical requirement of a multicast trading floor is that it be highly available.analysts or brokerages. The network must be designed with no single point of failure and in such a way that the network can respond in a deterministic manner to any failure.TypicallyTypically, redundant servers (in a primary/backup orlive livelive-live mode)are sendingsend multicast streams into thenetwork andnetwork, with diverse paths being used across thenetworknetwork. Another critical requirement isforwardingreliability and traceability; regulatory and legal requirements means that the producer of the marketing dataacross diverse paths (when duplicate data ismust know exactly where the flow was sentby multiple servers).and be able to prove conclusively that the data was received within agreed SLAs. The stock exchange generating the one-to-many traffic and stock analysts/brokerage that receive the traffic will typically have their own data centers. Therefore, the manner in which one-to-many traffic patterns are handled in these data centers are extremely important, especially given the requirements and constraints mentioned. Many data center cloud providers provide publish and subscribe applications. There can be numerous publishers and subscribers and many message channels within a data center. With publish and subscribe servers, a separate message is sent to each subscriber of a publication. With multicast publish/subscribe, only one message is sent, regardless of the number of subscribers. In apublish/subscribepublish/ subscribe system, client applications, some of which are publishers and some of which are subscribers, are connected to a network of message brokers that receive publications on a number of topics, and send the publications on to the subscribers for those topics. The more subscribers there are in the publish/subscribe system, the greater the improvement to network utilization there might be with multicast. 2.2.Non Client-Server Multicast Applications Routers, running Virtual Routing Redundancy Protocol (VRRP), communicate with one another using a multicast address. VRRP packets are sent, encapsulatedOverlays The proposed architecture for supporting large-scale multi-tenancy inIP packets, to 224.0.0.18.highly virtualized data centers [RFC8014] consists of a tenant's VMs distributed across the data center connected by a virtual network known as the overlay network. Afailure to receivenumber of different technologies have been proposed for realizing the overlay network, including VXLAN [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably partisan debate about the relative merits of these overlay technologies belies the fact that, conceptually, it may be said that these overlays typically simply provide amulticast packetmeans to encapsulate and tunnel Ethernet frames from themaster router forVMs over the data center IP fabric, thus emulating aperiod longer than three timeslayer 2 segment between theadvertisement timer causesVMs. Consequently, thebackup routersVMs believe and behave as if they are connected toassume thatthemaster router is dead. The virtual router then transitions into an unsteady statetenant's other VMs by a conventional layer 2 segment, regardless of their physical location within the data center. Naturally, in a layer 2 segment, point to multi-point traffic can result from handling BUM (broadcast, unknown unicast andan election process is initiatedmulticast) traffic. And, compounding this issue within data centers, since the tenant's VMs attached toselectthenext master router fromemulated segment may be dispersed throughout thebackup routers. This is fulfilled throughdata center, theuse of multicast packets. Backup router(s) are only to send multicast packets during an election process. OverlaysBUM traffic mayuse IP multicast to virtualize L2 multicasts. IP multicast is usedneed toreducetraverse thescopedata center fabric. Hence, regardless of theL2-over-UDP floodingoverlay technology used, due consideration must be given toonly those hosts that have expressed explicit interest inhandling BUM traffic, forcing theframes.VXLAN, for instance,data center operator to consider the manner in which one-to-many communication isan encapsulation schemehandled within the IP fabric. 2.3. Protocols Conventionally, some key networking protocols used in data centers require one-to-many communication. For example, ARP and ND use broadcast and multicast messages within IPv4 and IPv6 networks respectively tocarry L2 frames over L3 networks. The VXLAN Tunnel End Point (VTEP) encapsulates frames inside an L3 tunnel. VXLANs are identified by a 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of known destinationdiscover MACaddresses, and stores theaddress to IP addressof the tunnelmappings. Furthermore, when these protocols are running within an overlay network, then it essential to ensure theremote VTEP to use for each. Unicast frames, between VMs,messages aresent directlydelivered to all theunicast L3 addresshosts on the emulated layer 2 segment, regardless of physical location within theremote VTEP. Multicast frames are sent to a multicast IP groupdata center. The challenges associated withthe VNI. Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are usedoptimally delivering ARP and ND messages in data centers has attracted lots of attention [RFC6820]. Popular approaches in use mostly seek toforwardexploit characteristics of data center networks to avoid having to broadcast/multicast these messages, as discussed in Section 4.1. 3. Handling one-to-many traffic using conventional multicast 3.1. Layer 3 multicast PIM is the most widely deployed multicast routing protocol and so, unsurprisingly, is the primary multicast routing protocol considered for use in the dataacrosscenter. There are three potential popular flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM-BIDIR [RFC5015]. It may be said that these different modes of PIM tradeoff the optimality of theoverlay. The Ganglia application relies uponmulticast forwarding tree fordistributed discovery and monitoringthe amount ofcomputing systems such as clustersmulticast forwarding state that must be maintained at routers. SSM provides the most efficient forwarding between sources andgrids. It has been used to link clusters across university campusesreceivers andcan scale to handle clustersthus is most suitable for applications with2000 nodes Windows Server, cluster node exchange, relies uponone-to- many traffic patterns. State is built and maintained for each (S,G) flow. Thus, theuseamount of multicastheartbeats between servers. Onlyforwarding state held by routers in the data center is proportional to the number of sources and groups. At the otherinterfaces inend of thesame multicast group usespectrum, BIDIR is thedata. Unlike broadcast, multicast traffic does not need to be flooded throughoutmost efficient shared tree solution as one tree is built for all (S,G)s, therefore minimizing thenetwork, reducingamount of state. This state reduction is at thechance that unnecessary CPU cycles are expended filteringexpense of optimal forwarding path between sources and receivers. This use of a shared tree makes BIDIR particularly well-suited for applications with many-to-many trafficon nodes outsidepatterns, given that thecluster. Asamount of state is uncorrelated to the number ofnodes increases,sources. SSM and BIDIR are optimizations of PIM-SM. PIM-SM is still theability to replace several unicast messages withmost widely deployed multicast routing protocol. PIM-SM can also be the most complex. PIM-SM relies upon asingleRP (Rendezvous Point) to set up the multicastmessage improves node performancetree anddecreases network bandwidth consumption. Multicast messages replace unicast messages in two components of clustering: o Heartbeats: The clustering failure detection enginesubsequently there isbased on a scheme whereby nodes send heartbeat messagesthe option of switching toother nodes. Specifically, for each network interface, a node sends a heartbeat messagethe SPT (shortest path tree), similar toall other nodes with interfacesSSM, or staying onthat network. Heartbeat messages are sent every 1.2 seconds. Inthecommon case where each node has an interface on each cluster network, there are N * (N - 1)shared tree, similar to BIDIR. 3.2. Layer 2 multicast With IPv4 unicastheartbeats sent per network every 1.2 seconds inaddress resolution, the translation of anN-node cluster.IP address to a MAC address is done dynamically by ARP. With multicastheartbeats,address resolution, themessage count dropsmapping from a multicast IPv4 address toNa multicastheartbeats per network every 1.2 seconds, because each node sends 1 message insteadMAC address is done by assigning the low-order 23 bits ofN - 1. This representsthe multicast IPv4 address to fill the low-order 23 bits of the multicast MAC address. Each IPv4 multicast address has 28 unique bits (the multicast address range is 224.0.0.0/12) therefore mapping areduction in processing cycles onmulticast IP address to a MAC address ignores 5 bits of thesending node andIP address. Hence, groups of 32 multicast IP addresses are mapped to the same MAC address meaning areduction in network bandwidth consumed. o Regroup: The clustering membership engine executesaregroup protocol duringmulticast MAC address cannot be uniquely mapped to amembership view change. The regroup protocol algorithm assumes the abilitymulticast IPv4 address. Therefore, planning is required within an organization tobroadcast messageschoose IPv4 multicast addresses judiciously in order toall cluster nodes. Toavoidunnecessary network flooding and to properly authenticate messages,address aliasing. When sending IPv6 multicast packets on an Ethernet link, thebroadcast primitivecorresponding destination MAC address isimplemented byasequencedirect mapping ofunicast messages. Convertingtheunicast messages to a single multicast message conserves processing power on the sending node and reduces network bandwidth consumption. Multicast addresses inlast 32 bits of the224.0.0.x range are considered link local128 bit IPv6 multicastaddresses. They are usedaddress into the 48 bit MAC address. It is possible forprotocol discovery and are floodedmore than one IPv6 multicast address toevery port. For example, OSPF uses 224.0.0.5 and 224.0.0.6 for neighbor and DR discovery. These addresses are reserved and will not be constrained by IGMP snooping. These addresses are notmap tobe used by any application. 3. L2 Multicast Protocols intheData Centersame 48 bit MAC address. Theswitches,default behaviour of many hosts (and, inbetweenfact, routers) is to block multicast traffic. Consequently, when a host wishes to join an IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to theserversrouter attached to the layer 2 segment and also it instructs its data link layer to receive Ethernet frames that match therouters, rely upon igmp snoopingcorresponding MAC address. The data link layer filters the frames, passing those with matching destination addresses toboundthe IP module. Similarly, hosts simply hand the multicast packet for transmission to theports leadingdata link layer which would add the layer 2 encapsulation, using the MAC address derived in the manner previously discussed. When this Ethernet frame with a multicast MAC address is received by a switch configured tointerested hosts andforward multicast traffic, the default behaviour is toL3 routers. A switch will, by default,floodmulticast trafficit to all the ports in the layer 2 segment. Clearly there may not be abroadcast domain (VLAN). IGMP snooping is designed to prevent hosts on a local network from receiving trafficreceiver forathis multicast groupthey have not explicitly joined. It provides switches with a mechanism to prune multicast traffic from links that do not contain a multicast listener (an IGMP client).present on each port and IGMP snooping isa L2 optimization for L3 IGMP.used to avoid sending the frame out of ports without receivers. IGMP snooping, with proxy reporting or report suppression, actively filters IGMP packets in order to reduce load on the multicastrouter. Joins and leaves heading upstream to therouterare filtered so thatby ensuring only the minimal quantity of information is sent. The switch is trying to ensure the routeronlyhas only a single entry for the group, regardless ofhow manythe number of activelisteners there are.listeners. If there are two active listeners in a group and the first one leaves, then the switch determines that the router does not need this information since it does not affect the status of the group from the router's point of view. However the next time there is a routine query from the router the switch will forward the reply from the remaining host, to prevent the router from believing there are no active listeners. It follows that in active IGMP snooping, the router will generally only know about the most recently joined member of the group. In order forIGMP,IGMP and thus IGMPsnooping,snooping to function, a multicast router must exist on the network and generate IGMP queries. The tables (holding the member ports for each multicast group) created for snooping are associated with the querier. Without a querier the tables are not created and snooping will not work.FurthermoreFurthermore, IGMP general queries must be unconditionally forwarded by all switches involved in IGMP snooping. Some IGMP snooping implementations include full querier capability. Others are able to proxy and retransmit queries from the multicast router.In source-only networks, however, which presumably describes most data center networks, there are no IGMP hostsMulticast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by IPv6 routers for discovering multicast listeners onswitch portsa directly attached link, performing a similar function togenerateIGMPpackets. Switch ports are connectedin IPv4 networks. MLDv1 [RFC 2710] is similar tomulticast source portsIGMPv2 andmulticast router ports. The switch typically learns about multicast groups from the multicast data stream by usingMLDv2 [RFC 3810] [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does not send its own distinct protocol messages. Rather, MLD is atypesubprotocol ofsource only learning (when only receiving multicast data on the port, noICMPv6 [RFC 4443] and so MLD messages are a subset of ICMPv6 messages. MLD snooping works similarly to IGMPpackets). The switch forwards traffic onlysnooping, described earlier. 3.3. Example use cases A use case where PIM and IGMP are currently used in data centers is tothesupport multicastrouter ports. Whenin VXLAN deployments. In theswitch receives traffic for neworiginal VXLAN specification [RFC7348], a data-driven flood and learn control plane was proposed, requiring the data center IP fabric to support multicastgroups, it willrouting. A multicast group is associated with each virtual network, each uniquely identified by its VXLAN network identifiers (VNI). VXLAN tunnel endpoints (VTEPs), typicallyfloodlocated in thepacketshypervisor or ToR switch, with local VMs that belong toall ports inthis VNI would join thesame VLAN. This unnecessary flooding can impact switch performance. 4. L3 Multicast Protocols inmulticast group and use it for theData Center There are three flavorsexchange ofPIM used for Multicast Routing inBUM traffic with theData Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. SSM providesother VTEPs. Essentially, themost efficient forwarding between sources and receivers and is most suitable for one to many types ofVTEP would encapsulate any BUM traffic from attached VMs in an IP multicastapplications. Statepacket, whose destination address isbuilt for each S,G channel thereforethemore sourcesassociated multicast group address, andgroups there are,transmit themore state there is inpacket to thenetwork. BIDIR isdata center fabric. Thus, PIM must be running in themost efficient sharedfabric to maintain a multicast distribution treesolution as oneper VNI. Alternatively, rather than setting up a multicast distribution treeis built for all S,G's, therefore saving state. Butper VNI, a tree can be set up whenever hosts within the VNI wish to exchange multicast traffic. For example, whenever a VTEP receives an IGMP report from a locally connected host, itis notwould translate this into a PIM join message which will be propagated into themost efficient in forwarding path between sources and receivers. SSM and BIDIR are optimizations of PIM-SM. PIM-SMIP fabric. In order to ensure this join message isstillsent to themost widely deployed multicast routing protocol. PIM-SM can also beIP fabric rather than over themost complex. PIM-SM relies uponVXLAN interface (since the VTEP will have aRP (Rendezvous Point)route back toset upthe source of the multicasttreepacket over the VXLAN interface andthen will either switchso would naturally attempt to send theSPT (shortest path tree), similarjoin over this interface) a more specific route back toSSM, or stay ontheshared tree (similar to BIDIR). For massive amounts of hosts sending (and receiving) multicast,source over theshared tree (particularlyIP fabric must be configured. In this approach PIM must be configured on the SVIs associated withPIM-BIDIR) providesthebest potential scaling since no matter how manyVXLAN interface. Another use case of PIM and IGMP in data centers is when IPTV servers use multicastsources exist withinto deliver content from the data center to end users. IPTV is typically aVLAN,one to many application where thetree number stayshosts are configured for IGMPv3, thesame.switches are configured with IGMP snooping,IGMP proxy,andPIM-BIDIR havethepotentialrouters are running PIM-SSM mode. Often redundant servers send multicast streams into the network and the network is forwards the data across diverse paths. Windows Media servers send multicast streams toscaleclients. Windows Media Services streams to an IP multicast address and all clients subscribe to thehuge scaling numbers required inIP address to receive the same stream. This allows adata center. 5. Challengessingle stream to be played simultaneously by multiple clients and thus reducing bandwidth utilization. 3.4. Advantages and disadvantages Arguably the biggest advantage of usingmulticastPIM and IGMP to support one- to-many communication in data centers is that these protocols are relatively mature. Consequently, PIM is available in most routers and IGMP is supported by most hosts and routers. As such, no specialized hardware or relatively immature software is involved in using them in data centers. Furthermore, theData Center Data Center environments may create unique challenges for IP Multicast. Data Centermaturity of these protocols means their behaviour and performance in operational networksrequired a high amountis well-understood, with widely available best-practices and deployment guides for optimizing their performance. However, somewhat ironically, the relative disadvantages ofVM trafficPIM andmobility withinIGMP usage in data centers also stem mostly from their maturity. Specifically, these protocols were standardized andbetween DC networks. DC networks have large numbersimplemented long before the highly-virtualized multi-tenant data centers ofservers. DC networkstoday existed. Consequently, PIM and IGMP areoften usedneither optimally placed to deal withcloud orchestration software. DC networks often use IP Multicast in their unique environments. This section looks atthechallengesrequirements ofusingone-to-many communication in modern data centers nor to exploit characteristics and idiosyncrasies of data centers. For example, there may be thousands of VMs participating in a multicast session, with some of these VMs migrating to servers within thechallengingdatacenter environment. When IGMP/MLD Snooping is not implemented, ethernet switches will flood multicast frames out ofcenter, new VMs being continually spun up and wishing to join the sessions while allswitch-ports, which turnsthetraffic into something more liketime other VMs are leaving. In such abroadcast. VRRP uses multicast heartbeat to communicate between routers. The communication betweenscenario, thehostchurn in the PIM and IGMP state machines, thedefault gatewayvolume of control messages they would generate and the amount of state they would necessitate within routers, especially if they were deployed naively, would be untenable. 4. Alternative options for handling one-to-many traffic Section 2 has shown that there isunicast. Thelikely to be an increasing amount one-to-many communications in data centers. And Section 3 has discussed how conventional multicastheartbeat canmay bevery chatty whenused to handle this traffic. Having said that, there arethousandsa number ofVRRP pairs with sub-second heartbeat calls back and forth. Link-local multicastalternative options of handling this traffic pattern in data centers, as discussed in the subsequent section. It shouldscale well withinbe noted that many of these techniques are not mutually-exclusive; in fact many deployments involve a combination of more than oneIP subnet particularly withof these techniques. Furthermore, as will be shown, introducing alarge layer3 domain extending down to the accesscentralized controller oraggregation switches. But if multicast traverses beyond one IP subnet, which is necessary for an overlay like VXLAN, you could potentially have scaling concerns. If usingaVXLAN overlay, itdistributed control plane, makes these techniques more potent. 4.1. Minimizing traffic volumes If handling one-to-many traffic in data centers can be challenging then arguably the most intuitive solution isnecessarytomapaim to minimize theL2 multicastvolume of such traffic. It was previously mentioned in Section 2 that theoverlay to L3 multicastthree main causes of one-to-many traffic in data centers are applications, overlays and protocols. While, relatively speaking, little can be done about theunderlay or do head end replication involume of one-to-many traffic generated by applications, there is more scope for attempting to reduce theoverlayvolume of such traffic generated by overlays andreceive duplicate frames onprotocols. (And often by protocols within overlays.) This reduction is possible by exploiting certain characteristics of data center networks: fixed and regular topology, owned and exclusively controlled by single organization, well-known overlay encapsulation endpoints etc. A way of minimizing thefirst link fromamount of one-to-many traffic that traverses therouterdata center fabric is to use a centralized controller. For example, whenever a new VM is instantiated, the hypervisor or encapsulation endpoint can notify a centralized controller of this new MAC address, thecore switch.associated virtual network, IP address etc. Thesolutioncontroller couldbesubsequently distribute this information torun potentially thousandsevery encapsulation endpoint. Consequently, when any endpoint receives an ARP request from a locally attached VM, it could simply consult its local copy ofPIM messages to generate/maintaintherequired multicast stateinformation distributed by the controller and reply. Thus, the ARP request is suppressed and does not result in one-to-many traffic traversing the data center IPunderlay. The behavior offabric. Alternatively, theupper layer,functionality supported by the controller can realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] is the most popular control plane used in data centers. Typically, the encapsulation endpoints will exchange pertinent information withrespecteach other by all peering with a BGP route reflector (RR). Thus, information about local MAC addresses, MAC tobroadcast/multicast, affectsIP address mapping, virtual networks identifiers etc can be disseminated. Consequently, ARP requests from local VMs can be suppressed by thechoice of headencapsulation endpoint. 4.2. Head end(*,G) or (S,G)replication A popular option for handling one-to-many traffic patterns in data centers is head end replication (HER). HER means theunderlay, which affects the opextraffic is duplicated andcapexsent to each end point individually using conventional IP unicast. Obvious disadvantages of HER include traffic duplication and the additional processing burden on theentire solution. A VXLAN, with thousands of logical groups, maps toheadend replicationend. Nevertheless, HER is especially attractive when overlays are in use as thehypervisor or to IGMP fromreplication can be carried out by the hypervisorand then PIM betweenor encapsulation end point. Consequently, theTOR and CORE 'switches'VMs andthe gateway router. RequiringIPmulticast (especially PIM BIDIR) from the network can prove challenging for data center operators especially at the kindfabric are unmodified and unaware ofscale thathow theVXLAN/NVGRE proposals require. Thistraffic isalso true whendelivered to theL2 topological domainmultiple end points. Additionally, it islargepossible to use a number of approaches for constructing andextended alldisseminating thewaylist of which endpoints should receive what traffic and so on. For example, the reluctance of data center operators to enable PIM and IGMP within theL3 core. Indatacenterscenter fabric means VXLAN is often used withhighly virtualized servers, even small L2 domains may spread across many server racks (i.e. multiple switchesHER. Thus, BUM traffic from each VNI is replicated androuter ports). It's not uncommon for theresent using unicast tobe 10-20remote VTEPs with VMsper serverina virtualized environment. One vendor reported a customer requesting a scale to 400VM's per server. For multicastthat VNI. The list of remote VTEPs to which the traffic should be sent may be configured manually on the VTEP. Alternatively, the VTEPs may transmit appropriate state to aviable solutioncentralized controller which inthis environment,turn sends each VTEP thenetwork needs tolist of remote VTEPs for each VNI. Lastly, HER also works well when a distributed control plane is used instead of the centralized controller. Again, BGP-EVPN may beableused toscaledistribute the information needed tothese numbersfaciliate HER to the VTEPs. 4.3. BIER As discussed in Section 3.4, PIM and IGMP face potential scalability challenges whenthese VMsdeployed in data centers. These challenges aresending/receiving multicast. A lot of switching/routing hardware has problems with IP Multicast, particularly with regardstypically due tohardware support of PIM-BIDIR. Sending L2 multicast overthe requirement to build and maintain acampus or data center backbone,distribution tree and the requirement to hold per-flow state inany sort of significant way,routers. Bit Index Explicit Replication (BIER) [RFC 8279] is a newchallenge enabled for the first time by overlays. There are interesting challenges when pushing large amounts ofmulticasttraffic throughforwarding paradigm that avoids these two requirements. When anetwork, and have thus far been dealt with using purpose-built networks. Whilemulticast packet enters a BIER domain, theoverlay proposals have been careful not to impose new protocol requirements, they have not addressedingress router, known as theissues of performance and scalability, norBit-Forwarding Ingress Router (BFIR), adds a BIER header to thelarge-scale availability of these protocols. There is an unnecessary multicast stream flooding problempacket. This header contains a bit string in which each bit maps to an egress router, known as Bit-Forwarding Egress Router (BFER). If a bit is set, then thelink layer switches betweenpacket should be forwarded to themulticast sourceassociated BFER. The routers within the BIER domain, Bit-Forwarding Routers (BFRs), use the BIER header in the packet and information in thePIM First Hop Router (FHR). The IGMP-Snooping Switch will forward multicast streamsBit Index Forwarding Table (BIFT) torouter ports, andcarry out simple bit- wise operations to determine how thePIM FHR must receivepacket should be replicated optimally so it reaches allmulticast streams even if therethe appropriate BFERs. BIER isno request from receiver. This often leadsdeemed towaste of switch cache and link bandwidth whenbe attractive for facilitating one-to-many communications in data ceneters [I-D.ietf-bier-use-cases]. The deployment envisioned with overlay networks is that themulticast streams are not actually required. [I-D.pim-umf-problem-statement] detailstheproblem and defines design goals for a generic mechanism to restrainencapsulation endpoints would be theunnecessaryBFIR. So knowledge about the actual multicaststream flooding. 6. Layer 3 / Layer 2 Topological Variations As discussedgroups does not reside inRFC6820,theARMD problems statement, there are a variety of topologicaldata centervariations including L3 to Access Switches, L3 to Aggregation Switches, and L3 infabric, improving theCore only. Further analysis is needed in orderscalability compared tounderstand how these variations affectconventional IPMulticast scalability 7. Address Resolution 7.1. Solicited-node Multicast Addresses for IPv6 address resolution Solicited-node Multicast Addresses aremulticast. Additionally, a centralized controller or a BGP-EVPN control plane may be used withIPv6 Neighbor DiscoveryBIER toprovideensure thesame function asBFIR have theAddress Resolution Protocol (ARP)required information. A challenge associated with using BIER is that, unlike most of the other approaches discussed inIPv4. ARP uses broadcasts,this draft, it requires changes tosend an ARP Requests, which are received by all end hosts onthelocal link. Onlyforwarding behaviour of thehost being queried responds. However,routers used in theother hosts still have to process and discarddata center IP fabric. 4.4. Segment Routing Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts therequest. With IPv6, a host is required to jointhe source routing paradigm in which the manner in which aSolicited-Node multicast group for each of its configured unicast or anycast addresses. Becausepacket traverses aSolicited-node Multicast Addressnetwork isa function of the last 24-bits ofdetermined by anIPv6 unicast or anycast address, the numberordered list ofhosts thatinstructions. These instructions aresubscribed to each Solicited-node Multicast Address would typically be one (there could be more because the mapping function is notknown as segments may have a1:1 mapping). Comparedlocal semantic toARP in IPv4,an SR node or global within an SR domain. SR allows enforcing ahost should not needflow through any topological path while maintaining per-flow state only at the ingress node to the SR domain. Segment Routing can beinterrupted as oftenapplied toservice Neighbor Solicitation requests. 7.2. Direct Mapping for Multicast address resolution With IPv4 unicast address resolution,thetranslationMPLS and IPv6 data-planes. In the former, the list ofan IP address to a MAC addresssegments isdone dynamicallyrepresented byARP. With multicast address resolution,themapping from a multicast IP address to a multicast MAC address is derived from direct mapping. In IPv4,label stack and in themappinglatter it isdone by assigningrepresented as a routing extension header. Use-cases are described in [I-D.ietf- spring-segment-routing] and are being considered in thelow-order 23 bitscontext ofthe multicast IP addressBGP-based large-scale data-center (DC) design [RFC7938]. Multicast in SR continues tofill the low-order 23 bitsbe discussed in a variety ofthe multicast MAC address. Whendrafts and working groups. The SPRING WG has not yet been chartered to work on Multicast in SR. Multicast can include locally allocating ahost joins an IP multicast group, it instructs the data link layerSegment Identifier (SID) toreceive frames that match the MAC addressexisting replication solutions, such as PIM, mLDP, P2MP RSVP-TE and BIER. It may also be thatcorrespondsa new way to signal and install trees in SR is developed without creating state in theIP address of the multicast group. The data link layer filtersnetwork. 5. Conclusions As theframesvolume andpasses frames with matching destination addresses to theimportance of one-to-many traffic in data centers increases, conventional IPmodule. Since the mapping frommulticastIP addressis likely to become increasingly unattractive for deployment in data centers for aMAC address ignores 5 bitsnumber ofthe IP address, groupsreasons, mostly pertaining its inherent relatively poor scalability and inability to exploit characteristics of32 multicast IP addresses are mappeddata center network architectures. Hence, even though IGMP/MLD is likely to remain thesame MAC address. As a result a multicast MAC address cannot be uniquely mapped tomost popular manner in which end hosts signal interest in joining a multicastIPv4 address. Planninggroup, it isrequired within an organization to select IPv4 groupsunlikely thatare far enough away from each other as to not end up with the same L2 address used. Anythis multicastaddress in the [224-239].0.0.x and [224-239].128.0.x ranges should nottraffic will beconsidered. When sending IPv6 multicast packets on an Ethernet link,transported over thecorresponding destination MAC address isdata center IP fabric using adirect mapping of the last 32 bits of the 128 bit IPv6multicastaddress into the 48 bit MAC address. It is possible for more than one IPv6 Multicast address to mapdistribution tree built by PIM. Rather, approaches which exploit characteristics of data center network architectures (e.g. fixed and regular topology, owned and exclusively controlled by single organization, well-known overlay encapsulation endpoints etc.) are better placed tothe same 48 bit MAC address. 8.deliver one-to-many traffic in data centers, especially when judiciously combined with a centralized controller and/or a distributed control plane (particularly one based on BGP- EVPN). 6. IANA Considerations This memo includes no request to IANA.9.7. Security Considerations No new security considerations result from this document10.8. AcknowledgementsThe authors would like to thank the many individuals who contributed opinions on the ARMD wg mailing list about this topic: Linda Dunbar, Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas Narten. 11.9. References11.1.9.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.11.2.9.2. Informative References [I-D.ietf-bier-use-cases] Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 (work in progress), January 2018. [I-D.ietf-nvo3-geneve] Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic Network Virtualization Encapsulation", draft-ietf- nvo3-geneve-06 (work in progress), March 2018. [I-D.ietf-nvo3-vxlan-gpe] Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work in progress), April 2018. [I-D.ietf-spring-segment-routing] Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", draft-ietf-spring-segment-routing-15 (work in progress), January 2018. [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, <https://www.rfc-editor.org/info/rfc2236>. [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast Listener Discovery (MLD) for IPv6", RFC 2710, DOI 10.17487/RFC2710, October 1999, <https://www.rfc-editor.org/info/rfc2710>. [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. Thyagarajan, "Internet Group Management Protocol, Version 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, <https://www.rfc-editor.org/info/rfc3376>. [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, "Protocol Independent Multicast - Sparse Mode (PIM-SM): Protocol Specification (Revised)", RFC 4601, DOI 10.17487/RFC4601, August 2006, <https://www.rfc-editor.org/info/rfc4601>. [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, <https://www.rfc-editor.org/info/rfc4607>. [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, "Bidirectional Protocol Independent Multicast (BIDIR- PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, <https://www.rfc-editor.org/info/rfc5015>. [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution Problems in Large Data Center Networks", RFC 6820, DOI 10.17487/RFC6820, January 2013, <https://www.rfc-editor.org/info/rfc6820>.Author's Address[RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, <https://www.rfc-editor.org/info/rfc7348>. [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 2015, <https://www.rfc-editor.org/info/rfc7432>. [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network Virtualization Using Generic Routing Encapsulation", RFC 7637, DOI 10.17487/RFC7637, September 2015, <https://www.rfc-editor.org/info/rfc7637>. [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of BGP for Routing in Large-Scale Data Centers", RFC 7938, DOI 10.17487/RFC7938, August 2016, <https://www.rfc-editor.org/info/rfc7938>. [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. Narten, "An Architecture for Data-Center Network Virtualization over Layer 3 (NVO3)", RFC 8014, DOI 10.17487/RFC8014, December 2016, <https://www.rfc-editor.org/info/rfc8014>. [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, DOI 10.17487/RFC8279, November 2017, <https://www.rfc-editor.org/info/rfc8279>. [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., Uttaro, J., and W. Henderickx, "A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, DOI 10.17487/RFC8365, March 2018, <https://www.rfc-editor.org/info/rfc8365>. Authors' Addresses Mike McBride Huawei Email: michael.mcbride@huawei.com Olufemi Komolafe Arista Networks Email: femi@arista.com