idnits 2.17.1 draft-ietf-mboned-dc-deploy-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (June 29, 2018) is 2127 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2710' is mentioned on line 332, but not defined == Missing Reference: 'RFC 3810' is mentioned on line 332, but not defined == Missing Reference: 'RFC 4604' is mentioned on line 333, but not defined == Missing Reference: 'RFC 4443' is mentioned on line 335, but not defined == Missing Reference: 'RFC 8279' is mentioned on line 498, but not defined == Unused Reference: 'RFC2119' is defined on line 582, but no explicit reference was found in the text == Unused Reference: 'RFC2710' is defined on line 615, but no explicit reference was found in the text == Unused Reference: 'RFC8279' is defined on line 673, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-bier-use-cases-06 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-06 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-06 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 14 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Huawei 4 Intended status: Informational O. Komolafe 5 Expires: December 31, 2018 Arista Networks 6 June 29, 2018 8 Multicast in the Data Center Overview 9 draft-ietf-mboned-dc-deploy-03 11 Abstract 13 The volume and importance of one-to-many traffic patterns in data 14 centers is likely to increase significantly in the future. Reasons 15 for this increase are discussed and then attention is paid to the 16 manner in which this traffic pattern may be judiously handled in data 17 centers. The intuitive solution of deploying conventional IP 18 multicast within data centers is explored and evaluated. Thereafter, 19 a number of emerging innovative approaches are described before a 20 number of recommendations are made. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on December 31, 2018. 39 Copyright Notice 41 Copyright (c) 2018 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 59 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 60 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 61 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Handling one-to-many traffic using conventional multicast . . 5 63 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 6 64 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 6 65 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 8 66 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 67 4. Alternative options for handling one-to-many traffic . . . . 9 68 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 9 69 4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 70 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 72 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 75 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 76 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 77 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 78 9.2. Informative References . . . . . . . . . . . . . . . . . 13 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 81 1. Introduction 83 The volume and importance of one-to-many traffic patterns in data 84 centers is likely to increase significantly in the future. Reasons 85 for this increase include the nature of the traffic generated by 86 applications hosted in the data center, the need to handle broadcast, 87 unknown unicast and multicast (BUM) traffic within the overlay 88 technologies used to support multi-tenancy at scale, and the use of 89 certain protocols that traditionally require one-to-many control 90 message exchanges. These trends, allied with the expectation that 91 future highly virtualized data centers must support communication 92 between potentially thousands of participants, may lead to the 93 natural assumption that IP multicast will be widely used in data 94 centers, specifically given the bandwidth savings it potentially 95 offers. However, such an assumption would be wrong. In fact, there 96 is widespread reluctance to enable IP multicast in data centers for a 97 number of reasons, mostly pertaining to concerns about its 98 scalability and reliability. 100 This draft discusses some of the main drivers for the increasing 101 volume and importance of one-to-many traffic patterns in data 102 centers. Thereafter, the manner in which conventional IP multicast 103 may be used to handle this traffic pattern is discussed and some of 104 the associated challenges highlighted. Following this discussion, a 105 number of alternative emerging approaches are introduced, before 106 concluding by discussing key trends and making a number of 107 recommendations. 109 1.1. Requirements Language 111 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 112 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 113 document are to be interpreted as described in RFC 2119. 115 2. Reasons for increasing one-to-many traffic patterns 117 2.1. Applications 119 Key trends suggest that the nature of the applications likely to 120 dominate future highly-virtualized multi-tenant data centers will 121 produce large volumes of one-to-many traffic. For example, it is 122 well-known that traffic flows in data centers have evolved from being 123 predominantly North-South (e.g. client-server) to predominantly East- 124 West (e.g. distributed computation). This change has led to the 125 consensus that topologies such as the Leaf/Spine, that are easier to 126 scale in the East-West direction, are better suited to the data 127 center of the future. This increase in East-West traffic flows 128 results from VMs often having to exchange numerous messages between 129 themselves as part of executing a specific workload. For example, a 130 computational workload could require data, or an executable, to be 131 disseminated to workers distributed throughout the data center which 132 may be subsequently polled for status updates. The emergence of such 133 applications means there is likely to be an increase in one-to-many 134 traffic flows with the increasing dominance of East-West traffic. 136 The TV broadcast industry is another potential future source of 137 applications with one-to-many traffic patterns in data centers. The 138 requirement for robustness, stability and predicability has meant the 139 TV broadcast industry has traditionally used TV-specific protocols, 140 infrastructure and technologies for transmitting video signals 141 between cameras, studios, mixers, encoders, servers etc. However, 142 the growing cost and complexity of supporting this approach, 143 especially as the bit rates of the video signals increase due to 144 demand for formats such as 4K-UHD and 8K-UHD, means there is a 145 consensus that the TV broadcast industry will transition from 146 industry-specific transmission formats (e.g. SDI, HD-SDI) over TV- 147 specific infrastructure to using IP-based infrastructure. The 148 development of pertinent standards by the SMPTE, along with the 149 increasing performance of IP routers, means this transition is 150 gathering pace. A possible outcome of this transition will be the 151 building of IP data centers in broadcast plants. Traffic flows in 152 the broadcast industry are frequently one-to-many and so if IP data 153 centers are deployed in broadcast plants, it is imperative that this 154 traffic pattern is supported efficiently in that infrastructure. In 155 fact, a pivotal consideration for broadcasters considering 156 transitioning to IP is the manner in which these one-to-many traffic 157 flows will be managed and monitored in a data center with an IP 158 fabric. 160 Arguably one of the (few?) success stories in using conventional IP 161 multicast has been for disseminating market trading data. For 162 example, IP multicast is commonly used today to deliver stock quotes 163 from the stock exchange to financial services provider and then to 164 the stock analysts or brokerages. The network must be designed with 165 no single point of failure and in such a way that the network can 166 respond in a deterministic manner to any failure. Typically, 167 redundant servers (in a primary/backup or live-live mode) send 168 multicast streams into the network, with diverse paths being used 169 across the network. Another critical requirement is reliability and 170 traceability; regulatory and legal requirements means that the 171 producer of the marketing data must know exactly where the flow was 172 sent and be able to prove conclusively that the data was received 173 within agreed SLAs. The stock exchange generating the one-to-many 174 traffic and stock analysts/brokerage that receive the traffic will 175 typically have their own data centers. Therefore, the manner in 176 which one-to-many traffic patterns are handled in these data centers 177 are extremely important, especially given the requirements and 178 constraints mentioned. 180 Many data center cloud providers provide publish and subscribe 181 applications. There can be numerous publishers and subscribers and 182 many message channels within a data center. With publish and 183 subscribe servers, a separate message is sent to each subscriber of a 184 publication. With multicast publish/subscribe, only one message is 185 sent, regardless of the number of subscribers. In a publish/ 186 subscribe system, client applications, some of which are publishers 187 and some of which are subscribers, are connected to a network of 188 message brokers that receive publications on a number of topics, and 189 send the publications on to the subscribers for those topics. The 190 more subscribers there are in the publish/subscribe system, the 191 greater the improvement to network utilization there might be with 192 multicast. 194 2.2. Overlays 196 The proposed architecture for supporting large-scale multi-tenancy in 197 highly virtualized data centers [RFC8014] consists of a tenant's VMs 198 distributed across the data center connected by a virtual network 199 known as the overlay network. A number of different technologies 200 have been proposed for realizing the overlay network, including VXLAN 201 [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and 202 GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably 203 partisan debate about the relative merits of these overlay 204 technologies belies the fact that, conceptually, it may be said that 205 these overlays typically simply provide a means to encapsulate and 206 tunnel Ethernet frames from the VMs over the data center IP fabric, 207 thus emulating a layer 2 segment between the VMs. Consequently, the 208 VMs believe and behave as if they are connected to the tenant's other 209 VMs by a conventional layer 2 segment, regardless of their physical 210 location within the data center. Naturally, in a layer 2 segment, 211 point to multi-point traffic can result from handling BUM (broadcast, 212 unknown unicast and multicast) traffic. And, compounding this issue 213 within data centers, since the tenant's VMs attached to the emulated 214 segment may be dispersed throughout the data center, the BUM traffic 215 may need to traverse the data center fabric. Hence, regardless of 216 the overlay technology used, due consideration must be given to 217 handling BUM traffic, forcing the data center operator to consider 218 the manner in which one-to-many communication is handled within the 219 IP fabric. 221 2.3. Protocols 223 Conventionally, some key networking protocols used in data centers 224 require one-to-many communication. For example, ARP and ND use 225 broadcast and multicast messages within IPv4 and IPv6 networks 226 respectively to discover MAC address to IP address mappings. 227 Furthermore, when these protocols are running within an overlay 228 network, then it essential to ensure the messages are delivered to 229 all the hosts on the emulated layer 2 segment, regardless of physical 230 location within the data center. The challenges associated with 231 optimally delivering ARP and ND messages in data centers has 232 attracted lots of attention [RFC6820]. Popular approaches in use 233 mostly seek to exploit characteristics of data center networks to 234 avoid having to broadcast/multicast these messages, as discussed in 235 Section 4.1. 237 3. Handling one-to-many traffic using conventional multicast 238 3.1. Layer 3 multicast 240 PIM is the most widely deployed multicast routing protocol and so, 241 unsurprisingly, is the primary multicast routing protocol considered 242 for use in the data center. There are three potential popular 243 flavours of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] 244 or PIM-BIDIR [RFC5015]. It may be said that these different modes of 245 PIM tradeoff the optimality of the multicast forwarding tree for the 246 amount of multicast forwarding state that must be maintained at 247 routers. SSM provides the most efficient forwarding between sources 248 and receivers and thus is most suitable for applications with one-to- 249 many traffic patterns. State is built and maintained for each (S,G) 250 flow. Thus, the amount of multicast forwarding state held by routers 251 in the data center is proportional to the number of sources and 252 groups. At the other end of the spectrum, BIDIR is the most 253 efficient shared tree solution as one tree is built for all (S,G)s, 254 therefore minimizing the amount of state. This state reduction is at 255 the expense of optimal forwarding path between sources and receivers. 256 This use of a shared tree makes BIDIR particularly well-suited for 257 applications with many-to-many traffic patterns, given that the 258 amount of state is uncorrelated to the number of sources. SSM and 259 BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely 260 deployed multicast routing protocol. PIM-SM can also be the most 261 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 262 multicast tree and subsequently there is the option of switching to 263 the SPT (shortest path tree), similar to SSM, or staying on the 264 shared tree, similar to BIDIR. 266 3.2. Layer 2 multicast 268 With IPv4 unicast address resolution, the translation of an IP 269 address to a MAC address is done dynamically by ARP. With multicast 270 address resolution, the mapping from a multicast IPv4 address to a 271 multicast MAC address is done by assigning the low-order 23 bits of 272 the multicast IPv4 address to fill the low-order 23 bits of the 273 multicast MAC address. Each IPv4 multicast address has 28 unique 274 bits (the multicast address range is 224.0.0.0/12) therefore mapping 275 a multicast IP address to a MAC address ignores 5 bits of the IP 276 address. Hence, groups of 32 multicast IP addresses are mapped to 277 the same MAC address meaning a a multicast MAC address cannot be 278 uniquely mapped to a multicast IPv4 address. Therefore, planning is 279 required within an organization to choose IPv4 multicast addresses 280 judiciously in order to avoid address aliasing. When sending IPv6 281 multicast packets on an Ethernet link, the corresponding destination 282 MAC address is a direct mapping of the last 32 bits of the 128 bit 283 IPv6 multicast address into the 48 bit MAC address. It is possible 284 for more than one IPv6 multicast address to map to the same 48 bit 285 MAC address. 287 The default behaviour of many hosts (and, in fact, routers) is to 288 block multicast traffic. Consequently, when a host wishes to join an 289 IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to 290 the router attached to the layer 2 segment and also it instructs its 291 data link layer to receive Ethernet frames that match the 292 corresponding MAC address. The data link layer filters the frames, 293 passing those with matching destination addresses to the IP module. 294 Similarly, hosts simply hand the multicast packet for transmission to 295 the data link layer which would add the layer 2 encapsulation, using 296 the MAC address derived in the manner previously discussed. 298 When this Ethernet frame with a multicast MAC address is received by 299 a switch configured to forward multicast traffic, the default 300 behaviour is to flood it to all the ports in the layer 2 segment. 301 Clearly there may not be a receiver for this multicast group present 302 on each port and IGMP snooping is used to avoid sending the frame out 303 of ports without receivers. 305 IGMP snooping, with proxy reporting or report suppression, actively 306 filters IGMP packets in order to reduce load on the multicast router 307 by ensuring only the minimal quantity of information is sent. The 308 switch is trying to ensure the router has only a single entry for the 309 group, regardless of the number of active listeners. If there are 310 two active listeners in a group and the first one leaves, then the 311 switch determines that the router does not need this information 312 since it does not affect the status of the group from the router's 313 point of view. However the next time there is a routine query from 314 the router the switch will forward the reply from the remaining host, 315 to prevent the router from believing there are no active listeners. 316 It follows that in active IGMP snooping, the router will generally 317 only know about the most recently joined member of the group. 319 In order for IGMP and thus IGMP snooping to function, a multicast 320 router must exist on the network and generate IGMP queries. The 321 tables (holding the member ports for each multicast group) created 322 for snooping are associated with the querier. Without a querier the 323 tables are not created and snooping will not work. Furthermore, IGMP 324 general queries must be unconditionally forwarded by all switches 325 involved in IGMP snooping. Some IGMP snooping implementations 326 include full querier capability. Others are able to proxy and 327 retransmit queries from the multicast router. 329 Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by 330 IPv6 routers for discovering multicast listeners on a directly 331 attached link, performing a similar function to IGMP in IPv4 332 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] 333 [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does 334 not send its own distinct protocol messages. Rather, MLD is a 335 subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of 336 ICMPv6 messages. MLD snooping works similarly to IGMP snooping, 337 described earlier. 339 3.3. Example use cases 341 A use case where PIM and IGMP are currently used in data centers is 342 to support multicast in VXLAN deployments. In the original VXLAN 343 specification [RFC7348], a data-driven flood and learn control plane 344 was proposed, requiring the data center IP fabric to support 345 multicast routing. A multicast group is associated with each virtual 346 network, each uniquely identified by its VXLAN network identifiers 347 (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the 348 hypervisor or ToR switch, with local VMs that belong to this VNI 349 would join the multicast group and use it for the exchange of BUM 350 traffic with the other VTEPs. Essentially, the VTEP would 351 encapsulate any BUM traffic from attached VMs in an IP multicast 352 packet, whose destination address is the associated multicast group 353 address, and transmit the packet to the data center fabric. Thus, 354 PIM must be running in the fabric to maintain a multicast 355 distribution tree per VNI. 357 Alternatively, rather than setting up a multicast distribution tree 358 per VNI, a tree can be set up whenever hosts within the VNI wish to 359 exchange multicast traffic. For example, whenever a VTEP receives an 360 IGMP report from a locally connected host, it would translate this 361 into a PIM join message which will be propagated into the IP fabric. 362 In order to ensure this join message is sent to the IP fabric rather 363 than over the VXLAN interface (since the VTEP will have a route back 364 to the source of the multicast packet over the VXLAN interface and so 365 would naturally attempt to send the join over this interface) a more 366 specific route back to the source over the IP fabric must be 367 configured. In this approach PIM must be configured on the SVIs 368 associated with the VXLAN interface. 370 Another use case of PIM and IGMP in data centers is when IPTV servers 371 use multicast to deliver content from the data center to end users. 372 IPTV is typically a one to many application where the hosts are 373 configured for IGMPv3, the switches are configured with IGMP 374 snooping, and the routers are running PIM-SSM mode. Often redundant 375 servers send multicast streams into the network and the network is 376 forwards the data across diverse paths. 378 Windows Media servers send multicast streams to clients. Windows 379 Media Services streams to an IP multicast address and all clients 380 subscribe to the IP address to receive the same stream. This allows 381 a single stream to be played simultaneously by multiple clients and 382 thus reducing bandwidth utilization. 384 3.4. Advantages and disadvantages 386 Arguably the biggest advantage of using PIM and IGMP to support one- 387 to-many communication in data centers is that these protocols are 388 relatively mature. Consequently, PIM is available in most routers 389 and IGMP is supported by most hosts and routers. As such, no 390 specialized hardware or relatively immature software is involved in 391 using them in data centers. Furthermore, the maturity of these 392 protocols means their behaviour and performance in operational 393 networks is well-understood, with widely available best-practices and 394 deployment guides for optimizing their performance. 396 However, somewhat ironically, the relative disadvantages of PIM and 397 IGMP usage in data centers also stem mostly from their maturity. 398 Specifically, these protocols were standardized and implemented long 399 before the highly-virtualized multi-tenant data centers of today 400 existed. Consequently, PIM and IGMP are neither optimally placed to 401 deal with the requirements of one-to-many communication in modern 402 data centers nor to exploit characteristics and idiosyncrasies of 403 data centers. For example, there may be thousands of VMs 404 participating in a multicast session, with some of these VMs 405 migrating to servers within the data center, new VMs being 406 continually spun up and wishing to join the sessions while all the 407 time other VMs are leaving. In such a scenario, the churn in the PIM 408 and IGMP state machines, the volume of control messages they would 409 generate and the amount of state they would necessitate within 410 routers, especially if they were deployed naively, would be 411 untenable. 413 4. Alternative options for handling one-to-many traffic 415 Section 2 has shown that there is likely to be an increasing amount 416 one-to-many communications in data centers. And Section 3 has 417 discussed how conventional multicast may be used to handle this 418 traffic. Having said that, there are a number of alternative options 419 of handling this traffic pattern in data centers, as discussed in the 420 subsequent section. It should be noted that many of these techniques 421 are not mutually-exclusive; in fact many deployments involve a 422 combination of more than one of these techniques. Furthermore, as 423 will be shown, introducing a centralized controller or a distributed 424 control plane, makes these techniques more potent. 426 4.1. Minimizing traffic volumes 428 If handling one-to-many traffic in data centers can be challenging 429 then arguably the most intuitive solution is to aim to minimize the 430 volume of such traffic. 432 It was previously mentioned in Section 2 that the three main causes 433 of one-to-many traffic in data centers are applications, overlays and 434 protocols. While, relatively speaking, little can be done about the 435 volume of one-to-many traffic generated by applications, there is 436 more scope for attempting to reduce the volume of such traffic 437 generated by overlays and protocols. (And often by protocols within 438 overlays.) This reduction is possible by exploiting certain 439 characteristics of data center networks: fixed and regular topology, 440 owned and exclusively controlled by single organization, well-known 441 overlay encapsulation endpoints etc. 443 A way of minimizing the amount of one-to-many traffic that traverses 444 the data center fabric is to use a centralized controller. For 445 example, whenever a new VM is instantiated, the hypervisor or 446 encapsulation endpoint can notify a centralized controller of this 447 new MAC address, the associated virtual network, IP address etc. The 448 controller could subsequently distribute this information to every 449 encapsulation endpoint. Consequently, when any endpoint receives an 450 ARP request from a locally attached VM, it could simply consult its 451 local copy of the information distributed by the controller and 452 reply. Thus, the ARP request is suppressed and does not result in 453 one-to-many traffic traversing the data center IP fabric. 455 Alternatively, the functionality supported by the controller can 456 realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] 457 is the most popular control plane used in data centers. Typically, 458 the encapsulation endpoints will exchange pertinent information with 459 each other by all peering with a BGP route reflector (RR). Thus, 460 information about local MAC addresses, MAC to IP address mapping, 461 virtual networks identifiers etc can be disseminated. Consequently, 462 ARP requests from local VMs can be suppressed by the encapsulation 463 endpoint. 465 4.2. Head end replication 467 A popular option for handling one-to-many traffic patterns in data 468 centers is head end replication (HER). HER means the traffic is 469 duplicated and sent to each end point individually using conventional 470 IP unicast. Obvious disadvantages of HER include traffic duplication 471 and the additional processing burden on the head end. Nevertheless, 472 HER is especially attractive when overlays are in use as the 473 replication can be carried out by the hypervisor or encapsulation end 474 point. Consequently, the VMs and IP fabric are unmodified and 475 unaware of how the traffic is delivered to the multiple end points. 476 Additionally, it is possible to use a number of approaches for 477 constructing and disseminating the list of which endpoints should 478 receive what traffic and so on. 480 For example, the reluctance of data center operators to enable PIM 481 and IGMP within the data center fabric means VXLAN is often used with 482 HER. Thus, BUM traffic from each VNI is replicated and sent using 483 unicast to remote VTEPs with VMs in that VNI. The list of remote 484 VTEPs to which the traffic should be sent may be configured manually 485 on the VTEP. Alternatively, the VTEPs may transmit appropriate state 486 to a centralized controller which in turn sends each VTEP the list of 487 remote VTEPs for each VNI. Lastly, HER also works well when a 488 distributed control plane is used instead of the centralized 489 controller. Again, BGP-EVPN may be used to distribute the 490 information needed to faciliate HER to the VTEPs. 492 4.3. BIER 494 As discussed in Section 3.4, PIM and IGMP face potential scalability 495 challenges when deployed in data centers. These challenges are 496 typically due to the requirement to build and maintain a distribution 497 tree and the requirement to hold per-flow state in routers. Bit 498 Index Explicit Replication (BIER) [RFC 8279] is a new multicast 499 forwarding paradigm that avoids these two requirements. 501 When a multicast packet enters a BIER domain, the ingress router, 502 known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header 503 to the packet. This header contains a bit string in which each bit 504 maps to an egress router, known as Bit-Forwarding Egress Router 505 (BFER). If a bit is set, then the packet should be forwarded to the 506 associated BFER. The routers within the BIER domain, Bit-Forwarding 507 Routers (BFRs), use the BIER header in the packet and information in 508 the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise 509 operations to determine how the packet should be replicated optimally 510 so it reaches all the appropriate BFERs. 512 BIER is deemed to be attractive for facilitating one-to-many 513 communications in data ceneters [I-D.ietf-bier-use-cases]. The 514 deployment envisioned with overlay networks is that the the 515 encapsulation endpoints would be the BFIR. So knowledge about the 516 actual multicast groups does not reside in the data center fabric, 517 improving the scalability compared to conventional IP multicast. 518 Additionally, a centralized controller or a BGP-EVPN control plane 519 may be used with BIER to ensure the BFIR have the required 520 information. A challenge associated with using BIER is that, unlike 521 most of the other approaches discussed in this draft, it requires 522 changes to the forwarding behaviour of the routers used in the data 523 center IP fabric. 525 4.4. Segment Routing 527 Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the 528 source routing paradigm in which the manner in which a packet 529 traverses a network is determined by an ordered list of instructions. 530 These instructions are known as segments may have a local semantic to 531 an SR node or global within an SR domain. SR allows enforcing a flow 532 through any topological path while maintaining per-flow state only at 533 the ingress node to the SR domain. Segment Routing can be applied to 534 the MPLS and IPv6 data-planes. In the former, the list of segments 535 is represented by the label stack and in the latter it is represented 536 as a routing extension header. Use-cases are described in [I-D.ietf- 537 spring-segment-routing] and are being considered in the context of 538 BGP-based large-scale data-center (DC) design [RFC7938]. 540 Multicast in SR continues to be discussed in a variety of drafts and 541 working groups. The SPRING WG has not yet been chartered to work on 542 Multicast in SR. Multicast can include locally allocating a Segment 543 Identifier (SID) to existing replication solutions, such as PIM, 544 mLDP, P2MP RSVP-TE and BIER. It may also be that a new way to signal 545 and install trees in SR is developed without creating state in the 546 network. 548 5. Conclusions 550 As the volume and importance of one-to-many traffic in data centers 551 increases, conventional IP multicast is likely to become increasingly 552 unattractive for deployment in data centers for a number of reasons, 553 mostly pertaining its inherent relatively poor scalability and 554 inability to exploit characteristics of data center network 555 architectures. Hence, even though IGMP/MLD is likely to remain the 556 most popular manner in which end hosts signal interest in joining a 557 multicast group, it is unlikely that this multicast traffic will be 558 transported over the data center IP fabric using a multicast 559 distribution tree built by PIM. Rather, approaches which exploit 560 characteristics of data center network architectures (e.g. fixed and 561 regular topology, owned and exclusively controlled by single 562 organization, well-known overlay encapsulation endpoints etc.) are 563 better placed to deliver one-to-many traffic in data centers, 564 especially when judiciously combined with a centralized controller 565 and/or a distributed control plane (particularly one based on BGP- 566 EVPN). 568 6. IANA Considerations 570 This memo includes no request to IANA. 572 7. Security Considerations 574 No new security considerations result from this document 576 8. Acknowledgements 578 9. References 580 9.1. Normative References 582 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 583 Requirement Levels", BCP 14, RFC 2119, 584 DOI 10.17487/RFC2119, March 1997, 585 . 587 9.2. Informative References 589 [I-D.ietf-bier-use-cases] 590 Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., 591 Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. 592 Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 593 (work in progress), January 2018. 595 [I-D.ietf-nvo3-geneve] 596 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 597 Network Virtualization Encapsulation", draft-ietf- 598 nvo3-geneve-06 (work in progress), March 2018. 600 [I-D.ietf-nvo3-vxlan-gpe] 601 Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol 602 Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work 603 in progress), April 2018. 605 [I-D.ietf-spring-segment-routing] 606 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 607 Litkowski, S., and R. Shakir, "Segment Routing 608 Architecture", draft-ietf-spring-segment-routing-15 (work 609 in progress), January 2018. 611 [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 612 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, 613 . 615 [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast 616 Listener Discovery (MLD) for IPv6", RFC 2710, 617 DOI 10.17487/RFC2710, October 1999, 618 . 620 [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. 621 Thyagarajan, "Internet Group Management Protocol, Version 622 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, 623 . 625 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 626 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 627 Protocol Specification (Revised)", RFC 4601, 628 DOI 10.17487/RFC4601, August 2006, 629 . 631 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 632 IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, 633 . 635 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 636 "Bidirectional Protocol Independent Multicast (BIDIR- 637 PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, 638 . 640 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 641 Problems in Large Data Center Networks", RFC 6820, 642 DOI 10.17487/RFC6820, January 2013, 643 . 645 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 646 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 647 eXtensible Local Area Network (VXLAN): A Framework for 648 Overlaying Virtualized Layer 2 Networks over Layer 3 649 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 650 . 652 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 653 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 654 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 655 2015, . 657 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 658 Virtualization Using Generic Routing Encapsulation", 659 RFC 7637, DOI 10.17487/RFC7637, September 2015, 660 . 662 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 663 BGP for Routing in Large-Scale Data Centers", RFC 7938, 664 DOI 10.17487/RFC7938, August 2016, 665 . 667 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 668 Narten, "An Architecture for Data-Center Network 669 Virtualization over Layer 3 (NVO3)", RFC 8014, 670 DOI 10.17487/RFC8014, December 2016, 671 . 673 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 674 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 675 Explicit Replication (BIER)", RFC 8279, 676 DOI 10.17487/RFC8279, November 2017, 677 . 679 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 680 Uttaro, J., and W. Henderickx, "A Network Virtualization 681 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 682 DOI 10.17487/RFC8365, March 2018, 683 . 685 Authors' Addresses 687 Mike McBride 688 Huawei 690 Email: michael.mcbride@huawei.com 692 Olufemi Komolafe 693 Arista Networks 695 Email: femi@arista.com