idnits 2.17.1 draft-ietf-mboned-dc-deploy-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 11, 2019) is 1865 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2710' is mentioned on line 329, but not defined == Missing Reference: 'RFC 3810' is mentioned on line 329, but not defined == Missing Reference: 'RFC 4604' is mentioned on line 330, but not defined == Missing Reference: 'RFC 4443' is mentioned on line 332, but not defined == Missing Reference: 'RFC 8279' is mentioned on line 495, but not defined == Unused Reference: 'RFC2119' is defined on line 579, but no explicit reference was found in the text == Unused Reference: 'RFC2710' is defined on line 612, but no explicit reference was found in the text == Unused Reference: 'RFC8279' is defined on line 670, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-bier-use-cases-06 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-11 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-06 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 14 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Huawei 4 Intended status: Informational O. Komolafe 5 Expires: September 12, 2019 Arista Networks 6 March 11, 2019 8 Multicast in the Data Center Overview 9 draft-ietf-mboned-dc-deploy-05 11 Abstract 13 The volume and importance of one-to-many traffic patterns in data 14 centers is likely to increase significantly in the future. Reasons 15 for this increase are discussed and then attention is paid to the 16 manner in which this traffic pattern may be judiously handled in data 17 centers. The intuitive solution of deploying conventional IP 18 multicast within data centers is explored and evaluated. Thereafter, 19 a number of emerging innovative approaches are described before a 20 number of recommendations are made. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on September 12, 2019. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 59 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 60 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 61 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 5 62 3. Handling one-to-many traffic using conventional multicast . . 6 63 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 6 64 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 6 65 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 8 66 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 67 4. Alternative options for handling one-to-many traffic . . . . 9 68 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 9 69 4.2. Head end replication . . . . . . . . . . . . . . . . . . 10 70 4.3. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 4.4. Segment Routing . . . . . . . . . . . . . . . . . . . . . 12 72 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 12 73 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 74 7. Security Considerations . . . . . . . . . . . . . . . . . . . 13 75 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 76 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 77 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 78 9.2. Informative References . . . . . . . . . . . . . . . . . 13 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 81 1. Introduction 83 The volume and importance of one-to-many traffic patterns in data 84 centers is likely to increase significantly in the future. Reasons 85 for this increase include the nature of the traffic generated by 86 applications hosted in the data center, the need to handle broadcast, 87 unknown unicast and multicast (BUM) traffic within the overlay 88 technologies used to support multi-tenancy at scale, and the use of 89 certain protocols that traditionally require one-to-many control 90 message exchanges. These trends, allied with the expectation that 91 future highly virtualized data centers must support communication 92 between potentially thousands of participants, may lead to the 93 natural assumption that IP multicast will be widely used in data 94 centers, specifically given the bandwidth savings it potentially 95 offers. However, such an assumption would be wrong. In fact, there 96 is widespread reluctance to enable IP multicast in data centers for a 97 number of reasons, mostly pertaining to concerns about its 98 scalability and reliability. 100 This draft discusses some of the main drivers for the increasing 101 volume and importance of one-to-many traffic patterns in data 102 centers. Thereafter, the manner in which conventional IP multicast 103 may be used to handle this traffic pattern is discussed and some of 104 the associated challenges highlighted. Following this discussion, a 105 number of alternative emerging approaches are introduced, before 106 concluding by discussing key trends and making a number of 107 recommendations. 109 1.1. Requirements Language 111 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 112 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 113 document are to be interpreted as described in RFC 2119. 115 2. Reasons for increasing one-to-many traffic patterns 117 2.1. Applications 119 Key trends suggest that the nature of the applications likely to 120 dominate future highly-virtualized multi-tenant data centers will 121 produce large volumes of one-to-many traffic. For example, it is 122 well-known that traffic flows in data centers have evolved from being 123 predominantly North-South (e.g. client-server) to predominantly East- 124 West (e.g. distributed computation). This change has led to the 125 consensus that topologies such as the Leaf/Spine, that are easier to 126 scale in the East-West direction, are better suited to the data 127 center of the future. This increase in East-West traffic flows 128 results from VMs often having to exchange numerous messages between 129 themselves as part of executing a specific workload. For example, a 130 computational workload could require data, or an executable, to be 131 disseminated to workers distributed throughout the data center which 132 may be subsequently polled for status updates. The emergence of such 133 applications means there is likely to be an increase in one-to-many 134 traffic flows with the increasing dominance of East-West traffic. 136 The TV broadcast industry is another potential future source of 137 applications with one-to-many traffic patterns in data centers. The 138 requirement for robustness, stability and predicability has meant the 139 TV broadcast industry has traditionally used TV-specific protocols, 140 infrastructure and technologies for transmitting video signals 141 between end points such as cameras, monitors, mixers, graphics 142 devices and video servers. However, the growing cost and complexity 143 of supporting this approach, especially as the bit rates of the video 144 signals increase due to demand for formats such as 4K-UHD and 8K-UHD, 145 means there is a consensus that the TV broadcast industry will 146 transition from industry-specific transmission formats (e.g. SDI, 147 HD-SDI) over TV-specific infrastructure to using IP-based 148 infrastructure. The development of pertinent standards by the SMPTE, 149 along with the increasing performance of IP routers, means this 150 transition is gathering pace. A possible outcome of this transition 151 will be the building of IP data centers in broadcast plants. Traffic 152 flows in the broadcast industry are frequently one-to-many and so if 153 IP data centers are deployed in broadcast plants, it is imperative 154 that this traffic pattern is supported efficiently in that 155 infrastructure. In fact, a pivotal consideration for broadcasters 156 considering transitioning to IP is the manner in which these one-to- 157 many traffic flows will be managed and monitored in a data center 158 with an IP fabric. 160 One of the few success stories in using conventional IP multicast has 161 been for disseminating market trading data. For example, IP 162 multicast is commonly used today to deliver stock quotes from the 163 stock exchange to financial services provider and then to the stock 164 analysts or brokerages. The network must be designed with no single 165 point of failure and in such a way that the network can respond in a 166 deterministic manner to any failure. Typically, redundant servers 167 (in a primary/backup or live-live mode) send multicast streams into 168 the network, with diverse paths being used across the network. 169 Another critical requirement is reliability and traceability; 170 regulatory and legal requirements means that the producer of the 171 marketing data may need to know exactly where the flow was sent and 172 be able to prove conclusively that the data was received within 173 agreed SLAs. The stock exchange generating the one-to-many traffic 174 and stock analysts/brokerage that receive the traffic will typically 175 have their own data centers. Therefore, the manner in which one-to- 176 many traffic patterns are handled in these data centers are extremely 177 important, especially given the requirements and constraints 178 mentioned. 180 Many data center cloud providers provide publish and subscribe 181 applications. There can be numerous publishers and subscribers and 182 many message channels within a data center. With publish and 183 subscribe servers, a separate message is sent to each subscriber of a 184 publication. With multicast publish/subscribe, only one message is 185 sent, regardless of the number of subscribers. In a publish/ 186 subscribe system, client applications, some of which are publishers 187 and some of which are subscribers, are connected to a network of 188 message brokers that receive publications on a number of topics, and 189 send the publications on to the subscribers for those topics. The 190 more subscribers there are in the publish/subscribe system, the 191 greater the improvement to network utilization there might be with 192 multicast. 194 2.2. Overlays 196 The proposed architecture for supporting large-scale multi-tenancy in 197 highly virtualized data centers [RFC8014] consists of a tenant's VMs 198 distributed across the data center connected by a virtual network 199 known as the overlay network. A number of different technologies 200 have been proposed for realizing the overlay network, including VXLAN 201 [RFC7348], VXLAN-GPE [I-D.ietf-nvo3-vxlan-gpe], NVGRE [RFC7637] and 202 GENEVE [I-D.ietf-nvo3-geneve]. The often fervent and arguably 203 partisan debate about the relative merits of these overlay 204 technologies belies the fact that, conceptually, it may be said that 205 these overlays typically simply provide a means to encapsulate and 206 tunnel Ethernet frames from the VMs over the data center IP fabric, 207 thus emulating a layer 2 segment between the VMs. Consequently, the 208 VMs believe and behave as if they are connected to the tenant's other 209 VMs by a conventional layer 2 segment, regardless of their physical 210 location within the data center. Naturally, in a layer 2 segment, 211 point to multi-point traffic can result from handling BUM (broadcast, 212 unknown unicast and multicast) traffic. And, compounding this issue 213 within data centers, since the tenant's VMs attached to the emulated 214 segment may be dispersed throughout the data center, the BUM traffic 215 may need to traverse the data center fabric. Hence, regardless of 216 the overlay technology used, due consideration must be given to 217 handling BUM traffic, forcing the data center operator to consider 218 the manner in which one-to-many communication is handled within the 219 IP fabric. 221 2.3. Protocols 223 Conventionally, some key networking protocols used in data centers 224 require one-to-many communication. For example, ARP and ND use 225 broadcast and multicast messages within IPv4 and IPv6 networks 226 respectively to discover MAC address to IP address mappings. 227 Furthermore, when these protocols are running within an overlay 228 network, then it essential to ensure the messages are delivered to 229 all the hosts on the emulated layer 2 segment, regardless of physical 230 location within the data center. The challenges associated with 231 optimally delivering ARP and ND messages in data centers has 232 attracted lots of attention [RFC6820]. Popular approaches in use 233 mostly seek to exploit characteristics of data center networks to 234 avoid having to broadcast/multicast these messages, as discussed in 235 Section 4.1. 237 There are networking protocols that are being modified/developed to 238 specifically target working in a data center CLOS environment. BGP 239 has been extended to work in these type of DC environments and well 240 supports multicast. RIFT (Routing in Fat Trees) is a new protocol 241 being developed to work efficiently in DC CLOS environments and also 242 is being specified to support multicast addressing and forwarding. 244 3. Handling one-to-many traffic using conventional multicast 246 3.1. Layer 3 multicast 248 PIM is the most widely deployed multicast routing protocol and so, 249 unsurprisingly, is the primary multicast routing protocol considered 250 for use in the data center. There are three potential popular modes 251 of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM- 252 BIDIR [RFC5015]. It may be said that these different modes of PIM 253 tradeoff the optimality of the multicast forwarding tree for the 254 amount of multicast forwarding state that must be maintained at 255 routers. SSM provides the most efficient forwarding between sources 256 and receivers and thus is most suitable for applications with one-to- 257 many traffic patterns. State is built and maintained for each (S,G) 258 flow. Thus, the amount of multicast forwarding state held by routers 259 in the data center is proportional to the number of sources and 260 groups. At the other end of the spectrum, BIDIR is the most 261 efficient shared tree solution as one tree is built for all flows, 262 therefore minimizing the amount of state. This state reduction is at 263 the expense of optimal forwarding path between sources and receivers. 264 This use of a shared tree makes BIDIR particularly well-suited for 265 applications with many-to-many traffic patterns, given that the 266 amount of state is uncorrelated to the number of sources. SSM and 267 BIDIR are optimizations of PIM-SM. PIM-SM is the most widely 268 deployed multicast routing protocol. PIM-SM can also be the most 269 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 270 multicast tree and subsequently there is the option of switching to 271 the SPT (shortest path tree), similar to SSM, or staying on the 272 shared tree, similar to BIDIR. 274 3.2. Layer 2 multicast 276 With IPv4 unicast address resolution, the translation of an IP 277 address to a MAC address is done dynamically by ARP. With multicast 278 address resolution, the mapping from a multicast IPv4 address to a 279 multicast MAC address is done by assigning the low-order 23 bits of 280 the multicast IPv4 address to fill the low-order 23 bits of the 281 multicast MAC address. Each IPv4 multicast address has 28 unique 282 bits (the multicast address range is 224.0.0.0/12) therefore mapping 283 a multicast IP address to a MAC address ignores 5 bits of the IP 284 address. Hence, groups of 32 multicast IP addresses are mapped to 285 the same MAC address. And so a a multicast MAC address cannot be 286 uniquely mapped to a multicast IPv4 address. Therefore, planning is 287 required within an organization to choose IPv4 multicast addresses 288 judiciously in order to avoid address aliasing. When sending IPv6 289 multicast packets on an Ethernet link, the corresponding destination 290 MAC address is a direct mapping of the last 32 bits of the 128 bit 291 IPv6 multicast address into the 48 bit MAC address. It is possible 292 for more than one IPv6 multicast address to map to the same 48 bit 293 MAC address. 295 The default behaviour of many hosts (and, in fact, routers) is to 296 block multicast traffic. Consequently, when a host wishes to join an 297 IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to 298 the router attached to the layer 2 segment and also it instructs its 299 data link layer to receive Ethernet frames that match the 300 corresponding MAC address. The data link layer filters the frames, 301 passing those with matching destination addresses to the IP module. 302 Similarly, hosts simply hand the multicast packet for transmission to 303 the data link layer which would add the layer 2 encapsulation, using 304 the MAC address derived in the manner previously discussed. 306 When this Ethernet frame with a multicast MAC address is received by 307 a switch configured to forward multicast traffic, the default 308 behaviour is to flood it to all the ports in the layer 2 segment. 309 Clearly there may not be a receiver for this multicast group present 310 on each port and IGMP snooping is used to avoid sending the frame out 311 of ports without receivers. 313 A switch running IGMP snooping listens to the IGMP messages exchanged 314 between hosts and the router in order to identify which ports have 315 active receivers for a specific multicast group, allowing the 316 forwarding of multicast frames to be suitably constrained. Normally, 317 the multicast router will generate IGMP queries to which the hosts 318 send IGMP reports in response. However, number of optimizations in 319 which a switch generates IGMP queries (and so appears to be the 320 router from the hosts' perspective) and/or generates IGMP reports 321 (and so appears to be hosts from the router's perspectve) are 322 commonly used to improve the performance by reducing the amount of 323 state maintained at the router, suppressing superfluous IGMP messages 324 and improving responsivenss when hosts join/leave the group. 326 Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by 327 IPv6 routers for discovering multicast listeners on a directly 328 attached link, performing a similar function to IGMP in IPv4 329 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] 330 [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does 331 not send its own distinct protocol messages. Rather, MLD is a 332 subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of 333 ICMPv6 messages. MLD snooping works similarly to IGMP snooping, 334 described earlier. 336 3.3. Example use cases 338 A use case where PIM and IGMP are currently used in data centers is 339 to support multicast in VXLAN deployments. In the original VXLAN 340 specification [RFC7348], a data-driven flood and learn control plane 341 was proposed, requiring the data center IP fabric to support 342 multicast routing. A multicast group is associated with each virtual 343 network, each uniquely identified by its VXLAN network identifiers 344 (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the 345 hypervisor or ToR switch, with local VMs that belong to this VNI 346 would join the multicast group and use it for the exchange of BUM 347 traffic with the other VTEPs. Essentially, the VTEP would 348 encapsulate any BUM traffic from attached VMs in an IP multicast 349 packet, whose destination address is the associated multicast group 350 address, and transmit the packet to the data center fabric. Thus, 351 PIM must be running in the fabric to maintain a multicast 352 distribution tree per VNI. 354 Alternatively, rather than setting up a multicast distribution tree 355 per VNI, a tree can be set up whenever hosts within the VNI wish to 356 exchange multicast traffic. For example, whenever a VTEP receives an 357 IGMP report from a locally connected host, it would translate this 358 into a PIM join message which will be propagated into the IP fabric. 359 In order to ensure this join message is sent to the IP fabric rather 360 than over the VXLAN interface (since the VTEP will have a route back 361 to the source of the multicast packet over the VXLAN interface and so 362 would naturally attempt to send the join over this interface) a more 363 specific route back to the source over the IP fabric must be 364 configured. In this approach PIM must be configured on the SVIs 365 associated with the VXLAN interface. 367 Another use case of PIM and IGMP in data centers is when IPTV servers 368 use multicast to deliver content from the data center to end users. 369 IPTV is typically a one to many application where the hosts are 370 configured for IGMPv3, the switches are configured with IGMP 371 snooping, and the routers are running PIM-SSM mode. Often redundant 372 servers send multicast streams into the network and the network is 373 forwards the data across diverse paths. 375 Windows Media servers send multicast streams to clients. Windows 376 Media Services streams to an IP multicast address and all clients 377 subscribe to the IP address to receive the same stream. This allows 378 a single stream to be played simultaneously by multiple clients and 379 thus reducing bandwidth utilization. 381 3.4. Advantages and disadvantages 383 Arguably the biggest advantage of using PIM and IGMP to support one- 384 to-many communication in data centers is that these protocols are 385 relatively mature. Consequently, PIM is available in most routers 386 and IGMP is supported by most hosts and routers. As such, no 387 specialized hardware or relatively immature software is involved in 388 using them in data centers. Furthermore, the maturity of these 389 protocols means their behaviour and performance in operational 390 networks is well-understood, with widely available best-practices and 391 deployment guides for optimizing their performance. 393 However, somewhat ironically, the relative disadvantages of PIM and 394 IGMP usage in data centers also stem mostly from their maturity. 395 Specifically, these protocols were standardized and implemented long 396 before the highly-virtualized multi-tenant data centers of today 397 existed. Consequently, PIM and IGMP are neither optimally placed to 398 deal with the requirements of one-to-many communication in modern 399 data centers nor to exploit characteristics and idiosyncrasies of 400 data centers. For example, there may be thousands of VMs 401 participating in a multicast session, with some of these VMs 402 migrating to servers within the data center, new VMs being 403 continually spun up and wishing to join the sessions while all the 404 time other VMs are leaving. In such a scenario, the churn in the PIM 405 and IGMP state machines, the volume of control messages they would 406 generate and the amount of state they would necessitate within 407 routers, especially if they were deployed naively, would be 408 untenable. 410 4. Alternative options for handling one-to-many traffic 412 Section 2 has shown that there is likely to be an increasing amount 413 one-to-many communications in data centers. And Section 3 has 414 discussed how conventional multicast may be used to handle this 415 traffic. Having said that, there are a number of alternative options 416 of handling this traffic pattern in data centers, as discussed in the 417 subsequent section. It should be noted that many of these techniques 418 are not mutually-exclusive; in fact many deployments involve a 419 combination of more than one of these techniques. Furthermore, as 420 will be shown, introducing a centralized controller or a distributed 421 control plane, makes these techniques more potent. 423 4.1. Minimizing traffic volumes 425 If handling one-to-many traffic in data centers can be challenging 426 then arguably the most intuitive solution is to aim to minimize the 427 volume of such traffic. 429 It was previously mentioned in Section 2 that the three main causes 430 of one-to-many traffic in data centers are applications, overlays and 431 protocols. While, relatively speaking, little can be done about the 432 volume of one-to-many traffic generated by applications, there is 433 more scope for attempting to reduce the volume of such traffic 434 generated by overlays and protocols. (And often by protocols within 435 overlays.) This reduction is possible by exploiting certain 436 characteristics of data center networks: fixed and regular topology, 437 single administrative control, consistent hardware and software, 438 well-known overlay encapsulation endpoints and so on. 440 A way of minimizing the amount of one-to-many traffic that traverses 441 the data center fabric is to use a centralized controller. For 442 example, whenever a new VM is instantiated, the hypervisor or 443 encapsulation endpoint can notify a centralized controller of this 444 new MAC address, the associated virtual network, IP address etc. The 445 controller could subsequently distribute this information to every 446 encapsulation endpoint. Consequently, when any endpoint receives an 447 ARP request from a locally attached VM, it could simply consult its 448 local copy of the information distributed by the controller and 449 reply. Thus, the ARP request is suppressed and does not result in 450 one-to-many traffic traversing the data center IP fabric. 452 Alternatively, the functionality supported by the controller can 453 realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] 454 is the most popular control plane used in data centers. Typically, 455 the encapsulation endpoints will exchange pertinent information with 456 each other by all peering with a BGP route reflector (RR). Thus, 457 information about local MAC addresses, MAC to IP address mapping, 458 virtual networks identifiers etc can be disseminated. Consequently, 459 ARP requests from local VMs can be suppressed by the encapsulation 460 endpoint. 462 4.2. Head end replication 464 A popular option for handling one-to-many traffic patterns in data 465 centers is head end replication (HER). HER means the traffic is 466 duplicated and sent to each end point individually using conventional 467 IP unicast. Obvious disadvantages of HER include traffic duplication 468 and the additional processing burden on the head end. Nevertheless, 469 HER is especially attractive when overlays are in use as the 470 replication can be carried out by the hypervisor or encapsulation end 471 point. Consequently, the VMs and IP fabric are unmodified and 472 unaware of how the traffic is delivered to the multiple end points. 473 Additionally, it is possible to use a number of approaches for 474 constructing and disseminating the list of which endpoints should 475 receive what traffic and so on. 477 For example, the reluctance of data center operators to enable PIM 478 and IGMP within the data center fabric means VXLAN is often used with 479 HER. Thus, BUM traffic from each VNI is replicated and sent using 480 unicast to remote VTEPs with VMs in that VNI. The list of remote 481 VTEPs to which the traffic should be sent may be configured manually 482 on the VTEP. Alternatively, the VTEPs may transmit appropriate state 483 to a centralized controller which in turn sends each VTEP the list of 484 remote VTEPs for each VNI. Lastly, HER also works well when a 485 distributed control plane is used instead of the centralized 486 controller. Again, BGP-EVPN may be used to distribute the 487 information needed to faciliate HER to the VTEPs. 489 4.3. BIER 491 As discussed in Section 3.4, PIM and IGMP face potential scalability 492 challenges when deployed in data centers. These challenges are 493 typically due to the requirement to build and maintain a distribution 494 tree and the requirement to hold per-flow state in routers. Bit 495 Index Explicit Replication (BIER) [RFC 8279] is a new multicast 496 forwarding paradigm that avoids these two requirements. 498 When a multicast packet enters a BIER domain, the ingress router, 499 known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header 500 to the packet. This header contains a bit string in which each bit 501 maps to an egress router, known as Bit-Forwarding Egress Router 502 (BFER). If a bit is set, then the packet should be forwarded to the 503 associated BFER. The routers within the BIER domain, Bit-Forwarding 504 Routers (BFRs), use the BIER header in the packet and information in 505 the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise 506 operations to determine how the packet should be replicated optimally 507 so it reaches all the appropriate BFERs. 509 BIER is deemed to be attractive for facilitating one-to-many 510 communications in data ceneters [I-D.ietf-bier-use-cases]. The 511 deployment envisioned with overlay networks is that the the 512 encapsulation endpoints would be the BFIR. So knowledge about the 513 actual multicast groups does not reside in the data center fabric, 514 improving the scalability compared to conventional IP multicast. 515 Additionally, a centralized controller or a BGP-EVPN control plane 516 may be used with BIER to ensure the BFIR have the required 517 information. A challenge associated with using BIER is that, unlike 518 most of the other approaches discussed in this draft, it requires 519 changes to the forwarding behaviour of the routers used in the data 520 center IP fabric. 522 4.4. Segment Routing 524 Segment Routing (SR) [I-D.ietf-spring-segment-routing] adopts the the 525 source routing paradigm in which the manner in which a packet 526 traverses a network is determined by an ordered list of instructions. 527 These instructions are known as segments may have a local semantic to 528 an SR node or global within an SR domain. SR allows enforcing a flow 529 through any topological path while maintaining per-flow state only at 530 the ingress node to the SR domain. Segment Routing can be applied to 531 the MPLS and IPv6 data-planes. In the former, the list of segments 532 is represented by the label stack and in the latter it is represented 533 as a routing extension header. Use-cases are described in [I-D.ietf- 534 spring-segment-routing] and are being considered in the context of 535 BGP-based large-scale data-center (DC) design [RFC7938]. 537 Multicast in SR continues to be discussed in a variety of drafts and 538 working groups. The SPRING WG has not yet been chartered to work on 539 Multicast in SR. Multicast can include locally allocating a Segment 540 Identifier (SID) to existing replication solutions, such as PIM, 541 mLDP, P2MP RSVP-TE and BIER. It may also be that a new way to signal 542 and install trees in SR is developed without creating state in the 543 network. 545 5. Conclusions 547 As the volume and importance of one-to-many traffic in data centers 548 increases, conventional IP multicast is likely to become increasingly 549 unattractive for deployment in data centers for a number of reasons, 550 mostly pertaining its inherent relatively poor scalability and 551 inability to exploit characteristics of data center network 552 architectures. Hence, even though IGMP/MLD is likely to remain the 553 most popular manner in which end hosts signal interest in joining a 554 multicast group, it is unlikely that this multicast traffic will be 555 transported over the data center IP fabric using a multicast 556 distribution tree built by PIM. Rather, approaches which exploit 557 characteristics of data center network architectures (e.g. fixed and 558 regular topology, single administrative control, consistent hardware 559 and software, well-known overlay encapsulation endpoints etc.) are 560 better placed to deliver one-to-many traffic in data centers, 561 especially when judiciously combined with a centralized controller 562 and/or a distributed control plane (particularly one based on BGP- 563 EVPN). 565 6. IANA Considerations 567 This memo includes no request to IANA. 569 7. Security Considerations 571 No new security considerations result from this document 573 8. Acknowledgements 575 9. References 577 9.1. Normative References 579 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 580 Requirement Levels", BCP 14, RFC 2119, 581 DOI 10.17487/RFC2119, March 1997, 582 . 584 9.2. Informative References 586 [I-D.ietf-bier-use-cases] 587 Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., 588 Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. 589 Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-06 590 (work in progress), January 2018. 592 [I-D.ietf-nvo3-geneve] 593 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 594 Network Virtualization Encapsulation", draft-ietf- 595 nvo3-geneve-11 (work in progress), March 2019. 597 [I-D.ietf-nvo3-vxlan-gpe] 598 Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol 599 Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-06 (work 600 in progress), April 2018. 602 [I-D.ietf-spring-segment-routing] 603 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 604 Litkowski, S., and R. Shakir, "Segment Routing 605 Architecture", draft-ietf-spring-segment-routing-15 (work 606 in progress), January 2018. 608 [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 609 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, 610 . 612 [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast 613 Listener Discovery (MLD) for IPv6", RFC 2710, 614 DOI 10.17487/RFC2710, October 1999, 615 . 617 [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. 618 Thyagarajan, "Internet Group Management Protocol, Version 619 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, 620 . 622 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 623 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 624 Protocol Specification (Revised)", RFC 4601, 625 DOI 10.17487/RFC4601, August 2006, 626 . 628 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 629 IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, 630 . 632 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 633 "Bidirectional Protocol Independent Multicast (BIDIR- 634 PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, 635 . 637 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 638 Problems in Large Data Center Networks", RFC 6820, 639 DOI 10.17487/RFC6820, January 2013, 640 . 642 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 643 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 644 eXtensible Local Area Network (VXLAN): A Framework for 645 Overlaying Virtualized Layer 2 Networks over Layer 3 646 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 647 . 649 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 650 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 651 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 652 2015, . 654 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 655 Virtualization Using Generic Routing Encapsulation", 656 RFC 7637, DOI 10.17487/RFC7637, September 2015, 657 . 659 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 660 BGP for Routing in Large-Scale Data Centers", RFC 7938, 661 DOI 10.17487/RFC7938, August 2016, 662 . 664 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 665 Narten, "An Architecture for Data-Center Network 666 Virtualization over Layer 3 (NVO3)", RFC 8014, 667 DOI 10.17487/RFC8014, December 2016, 668 . 670 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 671 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 672 Explicit Replication (BIER)", RFC 8279, 673 DOI 10.17487/RFC8279, November 2017, 674 . 676 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 677 Uttaro, J., and W. Henderickx, "A Network Virtualization 678 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 679 DOI 10.17487/RFC8365, March 2018, 680 . 682 Authors' Addresses 684 Mike McBride 685 Huawei 687 Email: michael.mcbride@huawei.com 689 Olufemi Komolafe 690 Arista Networks 692 Email: femi@arista.com