| < draft-ietf-mboned-dc-deploy-02.txt | draft-ietf-mboned-dc-deploy-03.txt > | |||
|---|---|---|---|---|
| MBONED M. McBride | MBONED M. McBride | |||
| Internet-Draft Huawei | Internet-Draft Huawei | |||
| Intended status: Informational February 28, 2018 | Intended status: Informational O. Komolafe | |||
| Expires: September 1, 2018 | Expires: December 31, 2018 Arista Networks | |||
| June 29, 2018 | ||||
| Multicast in the Data Center Overview | Multicast in the Data Center Overview | |||
| draft-ietf-mboned-dc-deploy-02 | draft-ietf-mboned-dc-deploy-03 | |||
| Abstract | Abstract | |||
| There has been much interest in issues surrounding massive amounts of | The volume and importance of one-to-many traffic patterns in data | |||
| hosts in the data center. These issues include the prevalent use of | centers is likely to increase significantly in the future. Reasons | |||
| IP Multicast within the Data Center. Its important to understand how | for this increase are discussed and then attention is paid to the | |||
| IP Multicast is being deployed in the Data Center to be able to | manner in which this traffic pattern may be judiously handled in data | |||
| understand the surrounding issues with doing so. This document | centers. The intuitive solution of deploying conventional IP | |||
| provides a quick survey of uses of multicast in the data center and | multicast within data centers is explored and evaluated. Thereafter, | |||
| should serve as an aid to further discussion of issues related to | a number of emerging innovative approaches are described before a | |||
| large amounts of multicast in the data center. | number of recommendations are made. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on September 1, 2018. | This Internet-Draft will expire on December 31, 2018. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2018 IETF Trust and the persons identified as the | Copyright (c) 2018 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (https://trustee.ietf.org/license-info) in effect on the date of | (https://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| described in the Simplified BSD License. | described in the Simplified BSD License. | |||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Multicast Applications in the Data Center . . . . . . . . . . 3 | 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 | |||
| 2.1. Client-Server Applications . . . . . . . . . . . . . . . 3 | 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2.2. Non Client-Server Multicast Applications . . . . . . . . 4 | 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 3. L2 Multicast Protocols in the Data Center . . . . . . . . . . 5 | 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
| 4. L3 Multicast Protocols in the Data Center . . . . . . . . . . 6 | 3. Handling one-to-many traffic using conventional multicast . . 5 | |||
| 5. Challenges of using multicast in the Data Center . . . . . . 7 | 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 6 | |||
| 6. Layer 3 / Layer 2 Topological Variations . . . . . . . . . . 8 | 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 6 | |||
| 7. Address Resolution . . . . . . . . . . . . . . . . . . . . . 9 | 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 8 | |||
| 7.1. Solicited-node Multicast Addresses for IPv6 address | 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 | |||
| resolution . . . . . . . . . . . . . . . . . . . . . . . 9 | 4. Alternative options for handling one-to-many traffic . . . . 9 | |||
| 7.2. Direct Mapping for Multicast address resolution . . . . . 9 | 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 9 | |||
| 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 | 4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 | |||
| 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 | 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |||
| 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 | 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11.1. Normative References . . . . . . . . . . . . . . . . . . 10 | 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 11.2. Informative References . . . . . . . . . . . . . . . . . 10 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 | |||
| Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 | 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 | ||||
| 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 | ||||
| 9.2. Informative References . . . . . . . . . . . . . . . . . 13 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 | ||||
| 1. Introduction | 1. Introduction | |||
| Data center servers often use IP Multicast to send data to clients or | The volume and importance of one-to-many traffic patterns in data | |||
| other application servers. IP Multicast is expected to help conserve | centers is likely to increase significantly in the future. Reasons | |||
| bandwidth in the data center and reduce the load on servers. IP | for this increase include the nature of the traffic generated by | |||
| Multicast is also a key component in several data center overlay | applications hosted in the data center, the need to handle broadcast, | |||
| solutions. Increased reliance on multicast, in next generation data | unknown unicast and multicast (BUM) traffic within the overlay | |||
| centers, requires higher performance and capacity especially from the | technologies used to support multi-tenancy at scale, and the use of | |||
| switches. If multicast is to continue to be used in the data center, | certain protocols that traditionally require one-to-many control | |||
| it must scale well within and between datacenters. There has been | message exchanges. These trends, allied with the expectation that | |||
| much interest in issues surrounding massive amounts of hosts in the | future highly virtualized data centers must support communication | |||
| data center. There was a lengthy discussion, in the now closed ARMD | between potentially thousands of participants, may lead to the | |||
| WG, involving the issues with address resolution for non ARP/ND | natural assumption that IP multicast will be widely used in data | |||
| multicast traffic in data centers. This document provides a quick | centers, specifically given the bandwidth savings it potentially | |||
| survey of multicast in the data center and should serve as an aid to | offers. However, such an assumption would be wrong. In fact, there | |||
| further discussion of issues related to multicast in the data center. | is widespread reluctance to enable IP multicast in data centers for a | |||
| number of reasons, mostly pertaining to concerns about its | ||||
| scalability and reliability. | ||||
| ARP/ND issues are not addressed in this document except to explain | This draft discusses some of the main drivers for the increasing | |||
| how address resolution occurs with multicast. | volume and importance of one-to-many traffic patterns in data | |||
| centers. Thereafter, the manner in which conventional IP multicast | ||||
| may be used to handle this traffic pattern is discussed and some of | ||||
| the associated challenges highlighted. Following this discussion, a | ||||
| number of alternative emerging approaches are introduced, before | ||||
| concluding by discussing key trends and making a number of | ||||
| recommendations. | ||||
| 1.1. Requirements Language | 1.1. Requirements Language | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119. | document are to be interpreted as described in RFC 2119. | |||
| 2. Multicast Applications in the Data Center | 2. Reasons for increasing one-to-many traffic patterns | |||
| There are many data center operators who do not deploy Multicast in | 2.1. Applications | |||
| their networks for scalability and stability reasons. There are also | ||||
| many operators for whom multicast is a critical protocol within their | ||||
| network and is enabled on their data center switches and routers. | ||||
| For this latter group, there are several uses of multicast in their | ||||
| data centers. An understanding of the uses of that multicast is | ||||
| important in order to properly support these applications in the ever | ||||
| evolving data centers. If, for instance, the majority of the | ||||
| applications are discovering/signaling each other, using multicast, | ||||
| there may be better ways to support them then using multicast. If, | ||||
| however, the multicasting of data is occurring in large volumes, | ||||
| there is a need for good data center overlay multicast support. The | ||||
| applications either fall into the category of those that leverage L2 | ||||
| multicast for discovery or of those that require L3 support and | ||||
| likely span multiple subnets. | ||||
| 2.1. Client-Server Applications | Key trends suggest that the nature of the applications likely to | |||
| dominate future highly-virtualized multi-tenant data centers will | ||||
| produce large volumes of one-to-many traffic. For example, it is | ||||
| well-known that traffic flows in data centers have evolved from being | ||||
| predominantly North-South (e.g. client-server) to predominantly East- | ||||
| West (e.g. distributed computation). This change has led to the | ||||
| consensus that topologies such as the Leaf/Spine, that are easier to | ||||
| scale in the East-West direction, are better suited to the data | ||||
| center of the future. This increase in East-West traffic flows | ||||
| results from VMs often having to exchange numerous messages between | ||||
| themselves as part of executing a specific workload. For example, a | ||||
| computational workload could require data, or an executable, to be | ||||
| disseminated to workers distributed throughout the data center which | ||||
| may be subsequently polled for status updates. The emergence of such | ||||
| applications means there is likely to be an increase in one-to-many | ||||
| traffic flows with the increasing dominance of East-West traffic. | ||||
| IPTV servers use multicast to deliver content from the data center to | The TV broadcast industry is another potential future source of | |||
| end users. IPTV is typically a one to many application where the | applications with one-to-many traffic patterns in data centers. The | |||
| hosts are configured for IGMPv3, the switches are configured with | requirement for robustness, stability and predicability has meant the | |||
| IGMP snooping, and the routers are running PIM-SSM mode. Often | TV broadcast industry has traditionally used TV-specific protocols, | |||
| redundant servers are sending multicast streams into the network and | infrastructure and technologies for transmitting video signals | |||
| the network is forwarding the data across diverse paths. | between cameras, studios, mixers, encoders, servers etc. However, | |||
| the growing cost and complexity of supporting this approach, | ||||
| especially as the bit rates of the video signals increase due to | ||||
| demand for formats such as 4K-UHD and 8K-UHD, means there is a | ||||
| consensus that the TV broadcast industry will transition from | ||||
| industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- | ||||
| specific infrastructure to using IP-based infrastructure. The | ||||
| development of pertinent standards by the SMPTE, along with the | ||||
| increasing performance of IP routers, means this transition is | ||||
| gathering pace. A possible outcome of this transition will be the | ||||
| building of IP data centers in broadcast plants. Traffic flows in | ||||
| the broadcast industry are frequently one-to-many and so if IP data | ||||
| centers are deployed in broadcast plants, it is imperative that this | ||||
| traffic pattern is supported efficiently in that infrastructure. In | ||||
| fact, a pivotal consideration for broadcasters considering | ||||
| transitioning to IP is the manner in which these one-to-many traffic | ||||
| flows will be managed and monitored in a data center with an IP | ||||
| fabric. | ||||
| Windows Media servers send multicast streaming to clients. Windows | Arguably one of the (few?) success stories in using conventional IP | |||
| Media Services streams to an IP multicast address and all clients | multicast has been for disseminating market trading data. For | |||
| subscribe to the IP address to receive the same stream. This allows | example, IP multicast is commonly used today to deliver stock quotes | |||
| a single stream to be played simultaneously by multiple clients and | from the stock exchange to financial services provider and then to | |||
| thus reducing bandwidth utilization. | the stock analysts or brokerages. The network must be designed with | |||
| no single point of failure and in such a way that the network can | ||||
| respond in a deterministic manner to any failure. Typically, | ||||
| redundant servers (in a primary/backup or live-live mode) send | ||||
| multicast streams into the network, with diverse paths being used | ||||
| across the network. Another critical requirement is reliability and | ||||
| traceability; regulatory and legal requirements means that the | ||||
| producer of the marketing data must know exactly where the flow was | ||||
| sent and be able to prove conclusively that the data was received | ||||
| within agreed SLAs. The stock exchange generating the one-to-many | ||||
| traffic and stock analysts/brokerage that receive the traffic will | ||||
| typically have their own data centers. Therefore, the manner in | ||||
| which one-to-many traffic patterns are handled in these data centers | ||||
| are extremely important, especially given the requirements and | ||||
| constraints mentioned. | ||||
| Market data relies extensively on IP multicast to deliver stock | Many data center cloud providers provide publish and subscribe | |||
| quotes from the data center to a financial services provider and then | applications. There can be numerous publishers and subscribers and | |||
| to the stock analysts. The most critical requirement of a multicast | many message channels within a data center. With publish and | |||
| trading floor is that it be highly available. The network must be | subscribe servers, a separate message is sent to each subscriber of a | |||
| designed with no single point of failure and in a way the network can | publication. With multicast publish/subscribe, only one message is | |||
| respond in a deterministic manner to any failure. Typically | sent, regardless of the number of subscribers. In a publish/ | |||
| redundant servers (in a primary/backup or live live mode) are sending | subscribe system, client applications, some of which are publishers | |||
| multicast streams into the network and the network is forwarding the | and some of which are subscribers, are connected to a network of | |||
| data across diverse paths (when duplicate data is sent by multiple | message brokers that receive publications on a number of topics, and | |||
| servers). | send the publications on to the subscribers for those topics. The | |||
| more subscribers there are in the publish/subscribe system, the | ||||
| greater the improvement to network utilization there might be with | ||||
| multicast. | ||||
| With publish and subscribe servers, a separate message is sent to | 2.2. Overlays | |||
| each subscriber of a publication. With multicast publish/subscribe, | ||||
| only one message is sent, regardless of the number of subscribers. | ||||
| In a publish/subscribe system, client applications, some of which are | ||||
| publishers and some of which are subscribers, are connected to a | ||||
| network of message brokers that receive publications on a number of | ||||
| topics, and send the publications on to the subscribers for those | ||||
| topics. The more subscribers there are in the publish/subscribe | ||||
| system, the greater the improvement to network utilization there | ||||
| might be with multicast. | ||||
| 2.2. Non Client-Server Multicast Applications | The proposed architecture for supporting large-scale multi-tenancy in | |||
| highly virtualized data centers [RFC8014] consists of a tenant's VMs | ||||
| distributed across the data center connected by a virtual network | ||||
| known as the overlay network. A number of different technologies | ||||
| have been proposed for realizing the overlay network, including VXLAN | ||||
| [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and | ||||
| GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably | ||||
| partisan debate about the relative merits of these overlay | ||||
| technologies belies the fact that, conceptually, it may be said that | ||||
| these overlays typically simply provide a means to encapsulate and | ||||
| tunnel Ethernet frames from the VMs over the data center IP fabric, | ||||
| thus emulating a layer 2 segment between the VMs. Consequently, the | ||||
| VMs believe and behave as if they are connected to the tenant's other | ||||
| VMs by a conventional layer 2 segment, regardless of their physical | ||||
| location within the data center. Naturally, in a layer 2 segment, | ||||
| point to multi-point traffic can result from handling BUM (broadcast, | ||||
| unknown unicast and multicast) traffic. And, compounding this issue | ||||
| within data centers, since the tenant's VMs attached to the emulated | ||||
| segment may be dispersed throughout the data center, the BUM traffic | ||||
| may need to traverse the data center fabric. Hence, regardless of | ||||
| the overlay technology used, due consideration must be given to | ||||
| handling BUM traffic, forcing the data center operator to consider | ||||
| the manner in which one-to-many communication is handled within the | ||||
| IP fabric. | ||||
| Routers, running Virtual Routing Redundancy Protocol (VRRP), | 2.3. Protocols | |||
| communicate with one another using a multicast address. VRRP packets | ||||
| are sent, encapsulated in IP packets, to 224.0.0.18. A failure to | ||||
| receive a multicast packet from the master router for a period longer | ||||
| than three times the advertisement timer causes the backup routers to | ||||
| assume that the master router is dead. The virtual router then | ||||
| transitions into an unsteady state and an election process is | ||||
| initiated to select the next master router from the backup routers. | ||||
| This is fulfilled through the use of multicast packets. Backup | ||||
| router(s) are only to send multicast packets during an election | ||||
| process. | ||||
| Overlays may use IP multicast to virtualize L2 multicasts. IP | Conventionally, some key networking protocols used in data centers | |||
| multicast is used to reduce the scope of the L2-over-UDP flooding to | require one-to-many communication. For example, ARP and ND use | |||
| only those hosts that have expressed explicit interest in the | broadcast and multicast messages within IPv4 and IPv6 networks | |||
| frames.VXLAN, for instance, is an encapsulation scheme to carry L2 | respectively to discover MAC address to IP address mappings. | |||
| frames over L3 networks. The VXLAN Tunnel End Point (VTEP) | Furthermore, when these protocols are running within an overlay | |||
| encapsulates frames inside an L3 tunnel. VXLANs are identified by a | network, then it essential to ensure the messages are delivered to | |||
| 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of | all the hosts on the emulated layer 2 segment, regardless of physical | |||
| known destination MAC addresses, and stores the IP address of the | location within the data center. The challenges associated with | |||
| tunnel to the remote VTEP to use for each. Unicast frames, between | optimally delivering ARP and ND messages in data centers has | |||
| VMs, are sent directly to the unicast L3 address of the remote VTEP. | attracted lots of attention [RFC6820]. Popular approaches in use | |||
| Multicast frames are sent to a multicast IP group associated with the | mostly seek to exploit characteristics of data center networks to | |||
| VNI. Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are used | avoid having to broadcast/multicast these messages, as discussed in | |||
| to forward multicast data across the overlay. | Section 4.1. | |||
| The Ganglia application relies upon multicast for distributed | 3. Handling one-to-many traffic using conventional multicast | |||
| discovery and monitoring of computing systems such as clusters and | 3.1. Layer 3 multicast | |||
| grids. It has been used to link clusters across university campuses | ||||
| and can scale to handle clusters with 2000 nodes | ||||
| Windows Server, cluster node exchange, relies upon the use of | ||||
| multicast heartbeats between servers. Only the other interfaces in | ||||
| the same multicast group use the data. Unlike broadcast, multicast | ||||
| traffic does not need to be flooded throughout the network, reducing | ||||
| the chance that unnecessary CPU cycles are expended filtering traffic | ||||
| on nodes outside the cluster. As the number of nodes increases, the | ||||
| ability to replace several unicast messages with a single multicast | ||||
| message improves node performance and decreases network bandwidth | ||||
| consumption. Multicast messages replace unicast messages in two | ||||
| components of clustering: | ||||
| o Heartbeats: The clustering failure detection engine is based on a | PIM is the most widely deployed multicast routing protocol and so, | |||
| scheme whereby nodes send heartbeat messages to other nodes. | unsurprisingly, is the primary multicast routing protocol considered | |||
| Specifically, for each network interface, a node sends a heartbeat | for use in the data center. There are three potential popular | |||
| message to all other nodes with interfaces on that network. | flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] | |||
| Heartbeat messages are sent every 1.2 seconds. In the common case | or PIM-BIDIR [RFC5015]. It may be said that these different modes of | |||
| where each node has an interface on each cluster network, there | PIM tradeoff the optimality of the multicast forwarding tree for the | |||
| are N * (N - 1) unicast heartbeats sent per network every 1.2 | amount of multicast forwarding state that must be maintained at | |||
| seconds in an N-node cluster. With multicast heartbeats, the | routers. SSM provides the most efficient forwarding between sources | |||
| message count drops to N multicast heartbeats per network every | and receivers and thus is most suitable for applications with one-to- | |||
| 1.2 seconds, because each node sends 1 message instead of N - 1. | many traffic patterns. State is built and maintained for each (S,G) | |||
| This represents a reduction in processing cycles on the sending | flow. Thus, the amount of multicast forwarding state held by routers | |||
| node and a reduction in network bandwidth consumed. | in the data center is proportional to the number of sources and | |||
| groups. At the other end of the spectrum, BIDIR is the most | ||||
| efficient shared tree solution as one tree is built for all (S,G)s, | ||||
| therefore minimizing the amount of state. This state reduction is at | ||||
| the expense of optimal forwarding path between sources and receivers. | ||||
| This use of a shared tree makes BIDIR particularly well-suited for | ||||
| applications with many-to-many traffic patterns, given that the | ||||
| amount of state is uncorrelated to the number of sources. SSM and | ||||
| BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely | ||||
| deployed multicast routing protocol. PIM-SM can also be the most | ||||
| complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the | ||||
| multicast tree and subsequently there is the option of switching to | ||||
| the SPT (shortest path tree), similar to SSM, or staying on the | ||||
| shared tree, similar to BIDIR. | ||||
| o Regroup: The clustering membership engine executes a regroup | 3.2. Layer 2 multicast | |||
| protocol during a membership view change. The regroup protocol | ||||
| algorithm assumes the ability to broadcast messages to all cluster | ||||
| nodes. To avoid unnecessary network flooding and to properly | ||||
| authenticate messages, the broadcast primitive is implemented by a | ||||
| sequence of unicast messages. Converting the unicast messages to | ||||
| a single multicast message conserves processing power on the | ||||
| sending node and reduces network bandwidth consumption. | ||||
| Multicast addresses in the 224.0.0.x range are considered link local | With IPv4 unicast address resolution, the translation of an IP | |||
| multicast addresses. They are used for protocol discovery and are | address to a MAC address is done dynamically by ARP. With multicast | |||
| flooded to every port. For example, OSPF uses 224.0.0.5 and | address resolution, the mapping from a multicast IPv4 address to a | |||
| 224.0.0.6 for neighbor and DR discovery. These addresses are | multicast MAC address is done by assigning the low-order 23 bits of | |||
| reserved and will not be constrained by IGMP snooping. These | the multicast IPv4 address to fill the low-order 23 bits of the | |||
| addresses are not to be used by any application. | multicast MAC address. Each IPv4 multicast address has 28 unique | |||
| bits (the multicast address range is 224.0.0.0/12) therefore mapping | ||||
| a multicast IP address to a MAC address ignores 5 bits of the IP | ||||
| address. Hence, groups of 32 multicast IP addresses are mapped to | ||||
| the same MAC address meaning a a multicast MAC address cannot be | ||||
| uniquely mapped to a multicast IPv4 address. Therefore, planning is | ||||
| required within an organization to choose IPv4 multicast addresses | ||||
| judiciously in order to avoid address aliasing. When sending IPv6 | ||||
| multicast packets on an Ethernet link, the corresponding destination | ||||
| MAC address is a direct mapping of the last 32 bits of the 128 bit | ||||
| IPv6 multicast address into the 48 bit MAC address. It is possible | ||||
| for more than one IPv6 multicast address to map to the same 48 bit | ||||
| MAC address. | ||||
| 3. L2 Multicast Protocols in the Data Center | The default behaviour of many hosts (and, in fact, routers) is to | |||
| block multicast traffic. Consequently, when a host wishes to join an | ||||
| IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to | ||||
| the router attached to the layer 2 segment and also it instructs its | ||||
| data link layer to receive Ethernet frames that match the | ||||
| corresponding MAC address. The data link layer filters the frames, | ||||
| passing those with matching destination addresses to the IP module. | ||||
| Similarly, hosts simply hand the multicast packet for transmission to | ||||
| the data link layer which would add the layer 2 encapsulation, using | ||||
| the MAC address derived in the manner previously discussed. | ||||
| The switches, in between the servers and the routers, rely upon igmp | When this Ethernet frame with a multicast MAC address is received by | |||
| snooping to bound the multicast to the ports leading to interested | a switch configured to forward multicast traffic, the default | |||
| hosts and to L3 routers. A switch will, by default, flood multicast | behaviour is to flood it to all the ports in the layer 2 segment. | |||
| traffic to all the ports in a broadcast domain (VLAN). IGMP snooping | Clearly there may not be a receiver for this multicast group present | |||
| is designed to prevent hosts on a local network from receiving | on each port and IGMP snooping is used to avoid sending the frame out | |||
| traffic for a multicast group they have not explicitly joined. It | of ports without receivers. | |||
| provides switches with a mechanism to prune multicast traffic from | ||||
| links that do not contain a multicast listener (an IGMP client). | ||||
| IGMP snooping is a L2 optimization for L3 IGMP. | ||||
| IGMP snooping, with proxy reporting or report suppression, actively | IGMP snooping, with proxy reporting or report suppression, actively | |||
| filters IGMP packets in order to reduce load on the multicast router. | filters IGMP packets in order to reduce load on the multicast router | |||
| Joins and leaves heading upstream to the router are filtered so that | by ensuring only the minimal quantity of information is sent. The | |||
| only the minimal quantity of information is sent. The switch is | switch is trying to ensure the router has only a single entry for the | |||
| trying to ensure the router only has a single entry for the group, | group, regardless of the number of active listeners. If there are | |||
| regardless of how many active listeners there are. If there are two | two active listeners in a group and the first one leaves, then the | |||
| active listeners in a group and the first one leaves, then the switch | switch determines that the router does not need this information | |||
| determines that the router does not need this information since it | since it does not affect the status of the group from the router's | |||
| does not affect the status of the group from the router's point of | point of view. However the next time there is a routine query from | |||
| view. However the next time there is a routine query from the router | the router the switch will forward the reply from the remaining host, | |||
| the switch will forward the reply from the remaining host, to prevent | to prevent the router from believing there are no active listeners. | |||
| the router from believing there are no active listeners. It follows | It follows that in active IGMP snooping, the router will generally | |||
| that in active IGMP snooping, the router will generally only know | only know about the most recently joined member of the group. | |||
| about the most recently joined member of the group. | ||||
| In order for IGMP, and thus IGMP snooping, to function, a multicast | In order for IGMP and thus IGMP snooping to function, a multicast | |||
| router must exist on the network and generate IGMP queries. The | router must exist on the network and generate IGMP queries. The | |||
| tables (holding the member ports for each multicast group) created | tables (holding the member ports for each multicast group) created | |||
| for snooping are associated with the querier. Without a querier the | for snooping are associated with the querier. Without a querier the | |||
| tables are not created and snooping will not work. Furthermore IGMP | tables are not created and snooping will not work. Furthermore, IGMP | |||
| general queries must be unconditionally forwarded by all switches | general queries must be unconditionally forwarded by all switches | |||
| involved in IGMP snooping. Some IGMP snooping implementations | involved in IGMP snooping. Some IGMP snooping implementations | |||
| include full querier capability. Others are able to proxy and | include full querier capability. Others are able to proxy and | |||
| retransmit queries from the multicast router. | retransmit queries from the multicast router. | |||
| In source-only networks, however, which presumably describes most | Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by | |||
| data center networks, there are no IGMP hosts on switch ports to | IPv6 routers for discovering multicast listeners on a directly | |||
| generate IGMP packets. Switch ports are connected to multicast | attached link, performing a similar function to IGMP in IPv4 | |||
| source ports and multicast router ports. The switch typically learns | networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] | |||
| about multicast groups from the multicast data stream by using a type | [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does | |||
| of source only learning (when only receiving multicast data on the | not send its own distinct protocol messages. Rather, MLD is a | |||
| port, no IGMP packets). The switch forwards traffic only to the | subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of | |||
| multicast router ports. When the switch receives traffic for new IP | ICMPv6 messages. MLD snooping works similarly to IGMP snooping, | |||
| multicast groups, it will typically flood the packets to all ports in | described earlier. | |||
| the same VLAN. This unnecessary flooding can impact switch | ||||
| performance. | ||||
| 4. L3 Multicast Protocols in the Data Center | 3.3. Example use cases | |||
| There are three flavors of PIM used for Multicast Routing in the Data | A use case where PIM and IGMP are currently used in data centers is | |||
| Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. | to support multicast in VXLAN deployments. In the original VXLAN | |||
| SSM provides the most efficient forwarding between sources and | specification [RFC7348], a data-driven flood and learn control plane | |||
| receivers and is most suitable for one to many types of multicast | was proposed, requiring the data center IP fabric to support | |||
| applications. State is built for each S,G channel therefore the more | multicast routing. A multicast group is associated with each virtual | |||
| sources and groups there are, the more state there is in the network. | network, each uniquely identified by its VXLAN network identifiers | |||
| BIDIR is the most efficient shared tree solution as one tree is built | (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the | |||
| for all S,G's, therefore saving state. But it is not the most | hypervisor or ToR switch, with local VMs that belong to this VNI | |||
| efficient in forwarding path between sources and receivers. SSM and | would join the multicast group and use it for the exchange of BUM | |||
| BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely | traffic with the other VTEPs. Essentially, the VTEP would | |||
| deployed multicast routing protocol. PIM-SM can also be the most | encapsulate any BUM traffic from attached VMs in an IP multicast | |||
| complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the | packet, whose destination address is the associated multicast group | |||
| multicast tree and then will either switch to the SPT (shortest path | address, and transmit the packet to the data center fabric. Thus, | |||
| tree), similar to SSM, or stay on the shared tree (similar to BIDIR). | PIM must be running in the fabric to maintain a multicast | |||
| For massive amounts of hosts sending (and receiving) multicast, the | distribution tree per VNI. | |||
| shared tree (particularly with PIM-BIDIR) provides the best potential | ||||
| scaling since no matter how many multicast sources exist within a | ||||
| VLAN, the tree number stays the same. IGMP snooping, IGMP proxy, and | ||||
| PIM-BIDIR have the potential to scale to the huge scaling numbers | ||||
| required in a data center. | ||||
| 5. Challenges of using multicast in the Data Center | Alternatively, rather than setting up a multicast distribution tree | |||
| per VNI, a tree can be set up whenever hosts within the VNI wish to | ||||
| exchange multicast traffic. For example, whenever a VTEP receives an | ||||
| IGMP report from a locally connected host, it would translate this | ||||
| into a PIM join message which will be propagated into the IP fabric. | ||||
| In order to ensure this join message is sent to the IP fabric rather | ||||
| than over the VXLAN interface (since the VTEP will have a route back | ||||
| to the source of the multicast packet over the VXLAN interface and so | ||||
| would naturally attempt to send the join over this interface) a more | ||||
| specific route back to the source over the IP fabric must be | ||||
| configured. In this approach PIM must be configured on the SVIs | ||||
| associated with the VXLAN interface. | ||||
| Data Center environments may create unique challenges for IP | Another use case of PIM and IGMP in data centers is when IPTV servers | |||
| Multicast. Data Center networks required a high amount of VM traffic | use multicast to deliver content from the data center to end users. | |||
| and mobility within and between DC networks. DC networks have large | IPTV is typically a one to many application where the hosts are | |||
| numbers of servers. DC networks are often used with cloud | configured for IGMPv3, the switches are configured with IGMP | |||
| orchestration software. DC networks often use IP Multicast in their | snooping, and the routers are running PIM-SSM mode. Often redundant | |||
| unique environments. This section looks at the challenges of using | servers send multicast streams into the network and the network is | |||
| multicast within the challenging data center environment. | forwards the data across diverse paths. | |||
| When IGMP/MLD Snooping is not implemented, ethernet switches will | Windows Media servers send multicast streams to clients. Windows | |||
| flood multicast frames out of all switch-ports, which turns the | Media Services streams to an IP multicast address and all clients | |||
| traffic into something more like a broadcast. | subscribe to the IP address to receive the same stream. This allows | |||
| a single stream to be played simultaneously by multiple clients and | ||||
| thus reducing bandwidth utilization. | ||||
| VRRP uses multicast heartbeat to communicate between routers. The | 3.4. Advantages and disadvantages | |||
| communication between the host and the default gateway is unicast. | ||||
| The multicast heartbeat can be very chatty when there are thousands | ||||
| of VRRP pairs with sub-second heartbeat calls back and forth. | ||||
| Link-local multicast should scale well within one IP subnet | Arguably the biggest advantage of using PIM and IGMP to support one- | |||
| particularly with a large layer3 domain extending down to the access | to-many communication in data centers is that these protocols are | |||
| or aggregation switches. But if multicast traverses beyond one IP | relatively mature. Consequently, PIM is available in most routers | |||
| subnet, which is necessary for an overlay like VXLAN, you could | and IGMP is supported by most hosts and routers. As such, no | |||
| potentially have scaling concerns. If using a VXLAN overlay, it is | specialized hardware or relatively immature software is involved in | |||
| necessary to map the L2 multicast in the overlay to L3 multicast in | using them in data centers. Furthermore, the maturity of these | |||
| the underlay or do head end replication in the overlay and receive | protocols means their behaviour and performance in operational | |||
| duplicate frames on the first link from the router to the core | networks is well-understood, with widely available best-practices and | |||
| switch. The solution could be to run potentially thousands of PIM | deployment guides for optimizing their performance. | |||
| messages to generate/maintain the required multicast state in the IP | ||||
| underlay. The behavior of the upper layer, with respect to | ||||
| broadcast/multicast, affects the choice of head end (*,G) or (S,G) | ||||
| replication in the underlay, which affects the opex and capex of the | ||||
| entire solution. A VXLAN, with thousands of logical groups, maps to | ||||
| head end replication in the hypervisor or to IGMP from the hypervisor | ||||
| and then PIM between the TOR and CORE 'switches' and the gateway | ||||
| router. | ||||
| Requiring IP multicast (especially PIM BIDIR) from the network can | However, somewhat ironically, the relative disadvantages of PIM and | |||
| prove challenging for data center operators especially at the kind of | IGMP usage in data centers also stem mostly from their maturity. | |||
| scale that the VXLAN/NVGRE proposals require. This is also true when | Specifically, these protocols were standardized and implemented long | |||
| the L2 topological domain is large and extended all the way to the L3 | before the highly-virtualized multi-tenant data centers of today | |||
| core. In data centers with highly virtualized servers, even small L2 | existed. Consequently, PIM and IGMP are neither optimally placed to | |||
| domains may spread across many server racks (i.e. multiple switches | deal with the requirements of one-to-many communication in modern | |||
| and router ports). | data centers nor to exploit characteristics and idiosyncrasies of | |||
| data centers. For example, there may be thousands of VMs | ||||
| participating in a multicast session, with some of these VMs | ||||
| migrating to servers within the data center, new VMs being | ||||
| continually spun up and wishing to join the sessions while all the | ||||
| time other VMs are leaving. In such a scenario, the churn in the PIM | ||||
| and IGMP state machines, the volume of control messages they would | ||||
| generate and the amount of state they would necessitate within | ||||
| routers, especially if they were deployed naively, would be | ||||
| untenable. | ||||
| It's not uncommon for there to be 10-20 VMs per server in a | 4. Alternative options for handling one-to-many traffic | |||
| virtualized environment. One vendor reported a customer requesting a | ||||
| scale to 400VM's per server. For multicast to be a viable solution | ||||
| in this environment, the network needs to be able to scale to these | ||||
| numbers when these VMs are sending/receiving multicast. | ||||
| A lot of switching/routing hardware has problems with IP Multicast, | Section 2 has shown that there is likely to be an increasing amount | |||
| particularly with regards to hardware support of PIM-BIDIR. | one-to-many communications in data centers. And Section 3 has | |||
| discussed how conventional multicast may be used to handle this | ||||
| traffic. Having said that, there are a number of alternative options | ||||
| of handling this traffic pattern in data centers, as discussed in the | ||||
| subsequent section. It should be noted that many of these techniques | ||||
| are not mutually-exclusive; in fact many deployments involve a | ||||
| combination of more than one of these techniques. Furthermore, as | ||||
| will be shown, introducing a centralized controller or a distributed | ||||
| control plane, makes these techniques more potent. | ||||
| Sending L2 multicast over a campus or data center backbone, in any | 4.1. Minimizing traffic volumes | |||
| sort of significant way, is a new challenge enabled for the first | ||||
| time by overlays. There are interesting challenges when pushing | ||||
| large amounts of multicast traffic through a network, and have thus | ||||
| far been dealt with using purpose-built networks. While the overlay | ||||
| proposals have been careful not to impose new protocol requirements, | ||||
| they have not addressed the issues of performance and scalability, | ||||
| nor the large-scale availability of these protocols. | ||||
| There is an unnecessary multicast stream flooding problem in the link | If handling one-to-many traffic in data centers can be challenging | |||
| layer switches between the multicast source and the PIM First Hop | then arguably the most intuitive solution is to aim to minimize the | |||
| Router (FHR). The IGMP-Snooping Switch will forward multicast | volume of such traffic. | |||
| streams to router ports, and the PIM FHR must receive all multicast | ||||
| streams even if there is no request from receiver. This often leads | ||||
| to waste of switch cache and link bandwidth when the multicast | ||||
| streams are not actually required. [I-D.pim-umf-problem-statement] | ||||
| details the problem and defines design goals for a generic mechanism | ||||
| to restrain the unnecessary multicast stream flooding. | ||||
| 6. Layer 3 / Layer 2 Topological Variations | It was previously mentioned in Section 2 that the three main causes | |||
| of one-to-many traffic in data centers are applications, overlays and | ||||
| protocols. While, relatively speaking, little can be done about the | ||||
| volume of one-to-many traffic generated by applications, there is | ||||
| more scope for attempting to reduce the volume of such traffic | ||||
| generated by overlays and protocols. (And often by protocols within | ||||
| overlays.) This reduction is possible by exploiting certain | ||||
| characteristics of data center networks: fixed and regular topology, | ||||
| owned and exclusively controlled by single organization, well-known | ||||
| overlay encapsulation endpoints etc. | ||||
| As discussed in RFC6820, the ARMD problems statement, there are a | A way of minimizing the amount of one-to-many traffic that traverses | |||
| variety of topological data center variations including L3 to Access | the data center fabric is to use a centralized controller. For | |||
| Switches, L3 to Aggregation Switches, and L3 in the Core only. | example, whenever a new VM is instantiated, the hypervisor or | |||
| Further analysis is needed in order to understand how these | encapsulation endpoint can notify a centralized controller of this | |||
| variations affect IP Multicast scalability | new MAC address, the associated virtual network, IP address etc. The | |||
| controller could subsequently distribute this information to every | ||||
| encapsulation endpoint. Consequently, when any endpoint receives an | ||||
| ARP request from a locally attached VM, it could simply consult its | ||||
| local copy of the information distributed by the controller and | ||||
| reply. Thus, the ARP request is suppressed and does not result in | ||||
| one-to-many traffic traversing the data center IP fabric. | ||||
| 7. Address Resolution | Alternatively, the functionality supported by the controller can | |||
| realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] | ||||
| is the most popular control plane used in data centers. Typically, | ||||
| the encapsulation endpoints will exchange pertinent information with | ||||
| each other by all peering with a BGP route reflector (RR). Thus, | ||||
| information about local MAC addresses, MAC to IP address mapping, | ||||
| virtual networks identifiers etc can be disseminated. Consequently, | ||||
| ARP requests from local VMs can be suppressed by the encapsulation | ||||
| endpoint. | ||||
| 7.1. Solicited-node Multicast Addresses for IPv6 address resolution | 4.2. Head end replication | |||
| Solicited-node Multicast Addresses are used with IPv6 Neighbor | A popular option for handling one-to-many traffic patterns in data | |||
| Discovery to provide the same function as the Address Resolution | centers is head end replication (HER). HER means the traffic is | |||
| Protocol (ARP) in IPv4. ARP uses broadcasts, to send an ARP | duplicated and sent to each end point individually using conventional | |||
| Requests, which are received by all end hosts on the local link. | IP unicast. Obvious disadvantages of HER include traffic duplication | |||
| Only the host being queried responds. However, the other hosts still | and the additional processing burden on the head end. Nevertheless, | |||
| have to process and discard the request. With IPv6, a host is | HER is especially attractive when overlays are in use as the | |||
| required to join a Solicited-Node multicast group for each of its | replication can be carried out by the hypervisor or encapsulation end | |||
| configured unicast or anycast addresses. Because a Solicited-node | point. Consequently, the VMs and IP fabric are unmodified and | |||
| Multicast Address is a function of the last 24-bits of an IPv6 | unaware of how the traffic is delivered to the multiple end points. | |||
| unicast or anycast address, the number of hosts that are subscribed | Additionally, it is possible to use a number of approaches for | |||
| to each Solicited-node Multicast Address would typically be one | constructing and disseminating the list of which endpoints should | |||
| (there could be more because the mapping function is not a 1:1 | receive what traffic and so on. | |||
| mapping). Compared to ARP in IPv4, a host should not need to be | ||||
| interrupted as often to service Neighbor Solicitation requests. | ||||
| 7.2. Direct Mapping for Multicast address resolution | For example, the reluctance of data center operators to enable PIM | |||
| and IGMP within the data center fabric means VXLAN is often used with | ||||
| HER. Thus, BUM traffic from each VNI is replicated and sent using | ||||
| unicast to remote VTEPs with VMs in that VNI. The list of remote | ||||
| VTEPs to which the traffic should be sent may be configured manually | ||||
| on the VTEP. Alternatively, the VTEPs may transmit appropriate state | ||||
| to a centralized controller which in turn sends each VTEP the list of | ||||
| remote VTEPs for each VNI. Lastly, HER also works well when a | ||||
| distributed control plane is used instead of the centralized | ||||
| controller. Again, BGP-EVPN may be used to distribute the | ||||
| information needed to faciliate HER to the VTEPs. | ||||
| With IPv4 unicast address resolution, the translation of an IP | 4.3. BIER | |||
| address to a MAC address is done dynamically by ARP. With multicast | ||||
| address resolution, the mapping from a multicast IP address to a | ||||
| multicast MAC address is derived from direct mapping. In IPv4, the | ||||
| mapping is done by assigning the low-order 23 bits of the multicast | ||||
| IP address to fill the low-order 23 bits of the multicast MAC | ||||
| address. When a host joins an IP multicast group, it instructs the | ||||
| data link layer to receive frames that match the MAC address that | ||||
| corresponds to the IP address of the multicast group. The data link | ||||
| layer filters the frames and passes frames with matching destination | ||||
| addresses to the IP module. Since the mapping from multicast IP | ||||
| address to a MAC address ignores 5 bits of the IP address, groups of | ||||
| 32 multicast IP addresses are mapped to the same MAC address. As a | ||||
| result a multicast MAC address cannot be uniquely mapped to a | ||||
| multicast IPv4 address. Planning is required within an organization | ||||
| to select IPv4 groups that are far enough away from each other as to | ||||
| not end up with the same L2 address used. Any multicast address in | ||||
| the [224-239].0.0.x and [224-239].128.0.x ranges should not be | ||||
| considered. When sending IPv6 multicast packets on an Ethernet link, | ||||
| the corresponding destination MAC address is a direct mapping of the | ||||
| last 32 bits of the 128 bit IPv6 multicast address into the 48 bit | ||||
| MAC address. It is possible for more than one IPv6 Multicast address | ||||
| to map to the same 48 bit MAC address. | ||||
| 8. IANA Considerations | As discussed in Section 3.4, PIM and IGMP face potential scalability | |||
| challenges when deployed in data centers. These challenges are | ||||
| typically due to the requirement to build and maintain a distribution | ||||
| tree and the requirement to hold per-flow state in routers. Bit | ||||
| Index Explicit Replication (BIER) [RFC 8279] is a new multicast | ||||
| forwarding paradigm that avoids these two requirements. | ||||
| When a multicast packet enters a BIER domain, the ingress router, | ||||
| known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header | ||||
| to the packet. This header contains a bit string in which each bit | ||||
| maps to an egress router, known as Bit-Forwarding Egress Router | ||||
| (BFER). If a bit is set, then the packet should be forwarded to the | ||||
| associated BFER. The routers within the BIER domain, Bit-Forwarding | ||||
| Routers (BFRs), use the BIER header in the packet and information in | ||||
| the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise | ||||
| operations to determine how the packet should be replicated optimally | ||||
| so it reaches all the appropriate BFERs. | ||||
| BIER is deemed to be attractive for facilitating one-to-many | ||||
| communications in data ceneters [I-D.ietf-bier-use-cases]. The | ||||
| deployment envisioned with overlay networks is that the the | ||||
| encapsulation endpoints would be the BFIR. So knowledge about the | ||||
| actual multicast groups does not reside in the data center fabric, | ||||
| improving the scalability compared to conventional IP multicast. | ||||
| Additionally, a centralized controller or a BGP-EVPN control plane | ||||
| may be used with BIER to ensure the BFIR have the required | ||||
| information. A challenge associated with using BIER is that, unlike | ||||
| most of the other approaches discussed in this draft, it requires | ||||
| changes to the forwarding behaviour of the routers used in the data | ||||
| center IP fabric. | ||||
| 4.4. Segment Routing | ||||
| Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the | ||||
| source routing paradigm in which the manner in which a packet | ||||
| traverses a network is determined by an ordered list of instructions. | ||||
| These instructions are known as segments may have a local semantic to | ||||
| an SR node or global within an SR domain. SR allows enforcing a flow | ||||
| through any topological path while maintaining per-flow state only at | ||||
| the ingress node to the SR domain. Segment Routing can be applied to | ||||
| the MPLS and IPv6 data-planes. In the former, the list of segments | ||||
| is represented by the label stack and in the latter it is represented | ||||
| as a routing extension header. Use-cases are described in [I-D.ietf- | ||||
| spring-segment-routing] and are being considered in the context of | ||||
| BGP-based large-scale data-center (DC) design [RFC7938]. | ||||
| Multicast in SR continues to be discussed in a variety of drafts and | ||||
| working groups. The SPRING WG has not yet been chartered to work on | ||||
| Multicast in SR. Multicast can include locally allocating a Segment | ||||
| Identifier (SID) to existing replication solutions, such as PIM, | ||||
| mLDP, P2MP RSVP-TE and BIER. It may also be that a new way to signal | ||||
| and install trees in SR is developed without creating state in the | ||||
| network. | ||||
| 5. Conclusions | ||||
| As the volume and importance of one-to-many traffic in data centers | ||||
| increases, conventional IP multicast is likely to become increasingly | ||||
| unattractive for deployment in data centers for a number of reasons, | ||||
| mostly pertaining its inherent relatively poor scalability and | ||||
| inability to exploit characteristics of data center network | ||||
| architectures. Hence, even though IGMP/MLD is likely to remain the | ||||
| most popular manner in which end hosts signal interest in joining a | ||||
| multicast group, it is unlikely that this multicast traffic will be | ||||
| transported over the data center IP fabric using a multicast | ||||
| distribution tree built by PIM. Rather, approaches which exploit | ||||
| characteristics of data center network architectures (e.g. fixed and | ||||
| regular topology, owned and exclusively controlled by single | ||||
| organization, well-known overlay encapsulation endpoints etc.) are | ||||
| better placed to deliver one-to-many traffic in data centers, | ||||
| especially when judiciously combined with a centralized controller | ||||
| and/or a distributed control plane (particularly one based on BGP- | ||||
| EVPN). | ||||
| 6. IANA Considerations | ||||
| This memo includes no request to IANA. | This memo includes no request to IANA. | |||
| 9. Security Considerations | 7. Security Considerations | |||
| No new security considerations result from this document | No new security considerations result from this document | |||
| 10. Acknowledgements | 8. Acknowledgements | |||
| The authors would like to thank the many individuals who contributed | ||||
| opinions on the ARMD wg mailing list about this topic: Linda Dunbar, | ||||
| Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor | ||||
| Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas | ||||
| Narten. | ||||
| 11. References | 9. References | |||
| 11.1. Normative References | 9.1. Normative References | |||
| [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
| Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
| DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
| <https://www.rfc-editor.org/info/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
| 11.2. Informative References | 9.2. Informative References | |||
| [I-D.ietf-bier-use-cases] | ||||
| Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., | ||||
| Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. | ||||
| Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 | ||||
| (work in progress), January 2018. | ||||
| [I-D.ietf-nvo3-geneve] | ||||
| Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic | ||||
| Network Virtualization Encapsulation", draft-ietf- | ||||
| nvo3-geneve-06 (work in progress), March 2018. | ||||
| [I-D.ietf-nvo3-vxlan-gpe] | ||||
| Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol | ||||
| Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work | ||||
| in progress), April 2018. | ||||
| [I-D.ietf-spring-segment-routing] | ||||
| Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., | ||||
| Litkowski, S., and R. Shakir, "Segment Routing | ||||
| Architecture", draft-ietf-spring-segment-routing-15 (work | ||||
| in progress), January 2018. | ||||
| [RFC2236] Fenner, W., "Internet Group Management Protocol, Version | ||||
| 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, | ||||
| <https://www.rfc-editor.org/info/rfc2236>. | ||||
| [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast | ||||
| Listener Discovery (MLD) for IPv6", RFC 2710, | ||||
| DOI 10.17487/RFC2710, October 1999, | ||||
| <https://www.rfc-editor.org/info/rfc2710>. | ||||
| [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. | ||||
| Thyagarajan, "Internet Group Management Protocol, Version | ||||
| 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, | ||||
| <https://www.rfc-editor.org/info/rfc3376>. | ||||
| [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, | ||||
| "Protocol Independent Multicast - Sparse Mode (PIM-SM): | ||||
| Protocol Specification (Revised)", RFC 4601, | ||||
| DOI 10.17487/RFC4601, August 2006, | ||||
| <https://www.rfc-editor.org/info/rfc4601>. | ||||
| [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for | ||||
| IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, | ||||
| <https://www.rfc-editor.org/info/rfc4607>. | ||||
| [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, | ||||
| "Bidirectional Protocol Independent Multicast (BIDIR- | ||||
| PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, | ||||
| <https://www.rfc-editor.org/info/rfc5015>. | ||||
| [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution | [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution | |||
| Problems in Large Data Center Networks", RFC 6820, | Problems in Large Data Center Networks", RFC 6820, | |||
| DOI 10.17487/RFC6820, January 2013, | DOI 10.17487/RFC6820, January 2013, | |||
| <https://www.rfc-editor.org/info/rfc6820>. | <https://www.rfc-editor.org/info/rfc6820>. | |||
| Author's Address | [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, | |||
| L., Sridhar, T., Bursell, M., and C. Wright, "Virtual | ||||
| eXtensible Local Area Network (VXLAN): A Framework for | ||||
| Overlaying Virtualized Layer 2 Networks over Layer 3 | ||||
| Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, | ||||
| <https://www.rfc-editor.org/info/rfc7348>. | ||||
| [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., | ||||
| Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based | ||||
| Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February | ||||
| 2015, <https://www.rfc-editor.org/info/rfc7432>. | ||||
| [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network | ||||
| Virtualization Using Generic Routing Encapsulation", | ||||
| RFC 7637, DOI 10.17487/RFC7637, September 2015, | ||||
| <https://www.rfc-editor.org/info/rfc7637>. | ||||
| [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of | ||||
| BGP for Routing in Large-Scale Data Centers", RFC 7938, | ||||
| DOI 10.17487/RFC7938, August 2016, | ||||
| <https://www.rfc-editor.org/info/rfc7938>. | ||||
| [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. | ||||
| Narten, "An Architecture for Data-Center Network | ||||
| Virtualization over Layer 3 (NVO3)", RFC 8014, | ||||
| DOI 10.17487/RFC8014, December 2016, | ||||
| <https://www.rfc-editor.org/info/rfc8014>. | ||||
| [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., | ||||
| Przygienda, T., and S. Aldrin, "Multicast Using Bit Index | ||||
| Explicit Replication (BIER)", RFC 8279, | ||||
| DOI 10.17487/RFC8279, November 2017, | ||||
| <https://www.rfc-editor.org/info/rfc8279>. | ||||
| [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., | ||||
| Uttaro, J., and W. Henderickx, "A Network Virtualization | ||||
| Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, | ||||
| DOI 10.17487/RFC8365, March 2018, | ||||
| <https://www.rfc-editor.org/info/rfc8365>. | ||||
| Authors' Addresses | ||||
| Mike McBride | Mike McBride | |||
| Huawei | Huawei | |||
| Email: michael.mcbride@huawei.com | Email: michael.mcbride@huawei.com | |||
| Olufemi Komolafe | ||||
| Arista Networks | ||||
| Email: femi@arista.com | ||||
| End of changes. 54 change blocks. | ||||
| 338 lines changed or deleted | 564 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||