idnits 2.17.1 draft-ietf-mboned-dc-deploy-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (February 4, 2020) is 1537 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2710' is mentioned on line 378, but not defined == Missing Reference: 'RFC 3810' is mentioned on line 378, but not defined == Missing Reference: 'RFC 4604' is mentioned on line 379, but not defined == Missing Reference: 'RFC 4443' is mentioned on line 381, but not defined == Missing Reference: 'RFC 8279' is mentioned on line 614, but not defined == Unused Reference: 'RFC2119' is defined on line 709, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-nvo3-vxlan-gpe' is defined on line 727, but no explicit reference was found in the text == Unused Reference: 'RFC2710' is defined on line 742, but no explicit reference was found in the text == Unused Reference: 'RFC8279' is defined on line 805, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-bier-use-cases-09 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-13 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-07 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 15 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Futurewei 4 Intended status: Informational O. Komolafe 5 Expires: August 7, 2020 Arista Networks 6 February 4, 2020 8 Multicast in the Data Center Overview 9 draft-ietf-mboned-dc-deploy-08 11 Abstract 13 The volume and importance of one-to-many traffic patterns in data 14 centers is likely to increase significantly in the future. Reasons 15 for this increase are discussed and then attention is paid to the 16 manner in which this traffic pattern may be judiously handled in data 17 centers. The intuitive solution of deploying conventional IP 18 multicast within data centers is explored and evaluated. Thereafter, 19 a number of emerging innovative approaches are described before a 20 number of recommendations are made. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 7, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 59 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 60 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 61 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 6 62 2.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3. Handling one-to-many traffic using conventional multicast . . 7 64 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 7 65 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 7 66 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 9 67 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 68 4. Alternative options for handling one-to-many traffic . . . . 10 69 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 11 70 4.2. Head end replication . . . . . . . . . . . . . . . . . . 12 71 4.3. Programmable Forwarding Planes . . . . . . . . . . . . . 12 72 4.4. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 13 73 4.5. Segment Routing . . . . . . . . . . . . . . . . . . . . . 14 74 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 15 75 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 76 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 77 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 79 9.1. Normative References . . . . . . . . . . . . . . . . . . 15 80 9.2. Informative References . . . . . . . . . . . . . . . . . 16 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 83 1. Introduction 85 The volume and importance of one-to-many traffic patterns in data 86 centers is likely to increase significantly in the future. Reasons 87 for this increase include the nature of the traffic generated by 88 applications hosted in the data center, the need to handle broadcast, 89 unknown unicast and multicast (BUM) traffic within the overlay 90 technologies used to support multi-tenancy at scale, and the use of 91 certain protocols that traditionally require one-to-many control 92 message exchanges. 94 These trends, allied with the expectation that future highly 95 virtualized large-scale data centers must support communication 96 between potentially thousands of participants, may lead to the 97 natural assumption that IP multicast will be widely used in data 98 centers, specifically given the bandwidth savings it potentially 99 offers. However, such an assumption would be wrong. In fact, there 100 is widespread reluctance to enable conventional IP multicast in data 101 centers for a number of reasons, mostly pertaining to concerns about 102 its scalability and reliability. 104 This draft discusses some of the main drivers for the increasing 105 volume and importance of one-to-many traffic patterns in data 106 centers. Thereafter, the manner in which conventional IP multicast 107 may be used to handle this traffic pattern is discussed and some of 108 the associated challenges highlighted. Following this discussion, a 109 number of alternative emerging approaches are introduced, before 110 concluding by discussing key trends and making a number of 111 recommendations. 113 1.1. Requirements Language 115 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 116 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 117 document are to be interpreted as described in RFC 2119. 119 2. Reasons for increasing one-to-many traffic patterns 121 2.1. Applications 123 Key trends suggest that the nature of the applications likely to 124 dominate future highly-virtualized multi-tenant data centers will 125 produce large volumes of one-to-many traffic. For example, it is 126 well-known that traffic flows in data centers have evolved from being 127 predominantly North-South (e.g. client-server) to predominantly East- 128 West (e.g. distributed computation). This change has led to the 129 consensus that topologies such as the Leaf/Spine, that are easier to 130 scale in the East-West direction, are better suited to the data 131 center of the future. This increase in East-West traffic flows 132 results from VMs often having to exchange numerous messages between 133 themselves as part of executing a specific workload. For example, a 134 computational workload could require data, or an executable, to be 135 disseminated to workers distributed throughout the data center which 136 may be subsequently polled for status updates. The emergence of such 137 applications means there is likely to be an increase in one-to-many 138 traffic flows with the increasing dominance of East-West traffic. 140 The TV broadcast industry is another potential future source of 141 applications with one-to-many traffic patterns in data centers. The 142 requirement for robustness, stability and predicability has meant the 143 TV broadcast industry has traditionally used TV-specific protocols, 144 infrastructure and technologies for transmitting video signals 145 between end points such as cameras, monitors, mixers, graphics 146 devices and video servers. However, the growing cost and complexity 147 of supporting this approach, especially as the bit rates of the video 148 signals increase due to demand for formats such as 4K-UHD and 8K-UHD, 149 means there is a consensus that the TV broadcast industry will 150 transition from industry-specific transmission formats (e.g. SDI, 151 HD-SDI) over TV-specific infrastructure to using IP-based 152 infrastructure. The development of pertinent standards by the 153 Society of Motion Picture and Television Engineers (SMPTE) 154 [SMPTE2110], along with the increasing performance of IP routers, 155 means this transition is gathering pace. A possible outcome of this 156 transition will be the building of IP data centers in broadcast 157 plants. Traffic flows in the broadcast industry are frequently one- 158 to-many and so if IP data centers are deployed in broadcast plants, 159 it is imperative that this traffic pattern is supported efficiently 160 in that infrastructure. In fact, a pivotal consideration for 161 broadcasters considering transitioning to IP is the manner in which 162 these one-to-many traffic flows will be managed and monitored in a 163 data center with an IP fabric. 165 One of the few success stories in using conventional IP multicast has 166 been for disseminating market trading data. For example, IP 167 multicast is commonly used today to deliver stock quotes from stock 168 exchanges to financial service providers and then to the stock 169 analysts or brokerages. It is essential that the network 170 infrastructure delivers very low latency and high throughout, 171 especially given the proliferation of automated and algorithmic 172 trading which means stock analysts or brokerages may gain an edge on 173 competitors simply by receiving an update a few milliseconds earlier. 174 As would be expected, in such deployments reliability is critical. 175 The network must be designed with no single point of failure and in 176 such a way that it can respond in a deterministic manner to failure. 177 Typically, redundant servers (in a primary/backup or live-live mode) 178 send multicast streams into the network, with diverse paths being 179 used across the network. The stock exchange generating the one-to- 180 many traffic and stock analysts/brokerage that receive the traffic 181 will typically have their own data centers. Therefore, the manner in 182 which one-to-many traffic patterns are handled in these data centers 183 are extremely important, especially given the requirements and 184 constraints mentioned. 186 Another reason for the growing volume of one-to-many traffic patterns 187 in modern data centers is the increasing adoption of streaming 188 telemetry. This transition is motivated by the observation that 189 traditional poll-based approaches for monitoring network devices are 190 usually inadequate in modern data centers. These approaches 191 typically suffer from poor scalability, extensibility and 192 responsiveness. In contrast, in streaming telemetry, network devices 193 in the data center stream highly-granular real-time updates to a 194 telemetry collector/database. This collector then collates, 195 normalizes and encodes this data for convenient consumption by 196 monitoring applications. The montoring applications can subscribe to 197 the notifications of interest, allowing them to gain insight into 198 pertinent state and performance metrics. Thus, the traffic flows 199 associated with streaming telemetry are typically many-to-one between 200 the network devices and the telemetry collector and then one-to-many 201 from the collector to the monitoring applications. 203 The use of publish and subscribe applications is growing within data 204 centers, contributing to the rising volume of one-to-many traffic 205 flows. Such applications are attractive as they provide a robust 206 low-latency asynchronous messaging service, allowing senders to be 207 decoupled from receivers. The usual approach is for a publisher to 208 create and transmit a message to a specific topic. The publish and 209 subscribe application will retain the message and ensure it is 210 delivered to all subscribers to that topic. The flexibility in the 211 number of publishers and subscribers to a specific topic means such 212 applications cater for one-to-one, one-to-many and many-to-one 213 traffic patterns. 215 2.2. Overlays 217 Another key contributor to the rise in one-to-many traffic patterns 218 is the proposed architecture for supporting large-scale multi-tenancy 219 in highly virtualized data centers [RFC8014]. In this architecture, 220 a tenant's VMs are distributed across the data center and are 221 connected by a virtual network known as the overlay network. A 222 number of different technologies have been proposed for realizing the 223 overlay network, including VXLAN [RFC7348], VXLAN-GPE [I-D.ietf-nvo3- 224 vxlan-gpe], NVGRE [RFC7637] and GENEVE [I-D.ietf-nvo3-geneve]. The 225 often fervent and arguably partisan debate about the relative merits 226 of these overlay technologies belies the fact that, conceptually, it 227 may be said that these overlays mainly simply provide a means to 228 encapsulate and tunnel Ethernet frames from the VMs over the data 229 center IP fabric, thus emulating a Layer 2 segment between the VMs. 230 Consequently, the VMs believe and behave as if they are connected to 231 the tenant's other VMs by a conventional Layer 2 segment, regardless 232 of their physical location within the data center. 234 Naturally, in a Layer 2 segment, point to multi-point traffic can 235 result from handling BUM (broadcast, unknown unicast and multicast) 236 traffic. And, compounding this issue within data centers, since the 237 tenant's VMs attached to the emulated segment may be dispersed 238 throughout the data center, the BUM traffic may need to traverse the 239 data center fabric. 241 Hence, regardless of the overlay technology used, due consideration 242 must be given to handling BUM traffic, forcing the data center 243 operator to pay attention to the manner in which one-to-many 244 communication is handled within the data center. And this 245 consideration is likely to become increasingly important with the 246 anticipated rise in the number and importance of overlays. In fact, 247 it may be asserted that the manner in which one-to-many 248 communications arising from overlays is handled is pivotal to the 249 performance and stability of the entire data center network. 251 2.3. Protocols 253 Conventionally, some key networking protocols used in data centers 254 require one-to-many communications for control messages. Thus, the 255 data center operator must pay due attention to how these control 256 message exchanges are supported. 258 For example, ARP [RFC0826] and ND [RFC4861] use broadcast and 259 multicast messages within IPv4 and IPv6 networks respectively to 260 discover MAC address to IP address mappings. Furthermore, when these 261 protocols are running within an overlay network, it essential to 262 ensure the messages are delivered to all the hosts on the emulated 263 Layer 2 segment, regardless of physical location within the data 264 center. The challenges associated with optimally delivering ARP and 265 ND messages in data centers has attracted lots of attention 266 [RFC6820]. 268 Another example of a protocol that may neccessitate having one-to- 269 many traffic flows in the data center is IGMP [RFC2236], [RFC3376]. 270 If the VMs attached to the Layer 2 segment wish to join a multicast 271 group they must send IGMP reports in response to queries from the 272 querier. As these devices could be located at different locations 273 within the data center, there is the somewhat ironic prospect of IGMP 274 itself leading to an increase in the volume of one-to-many 275 communications in the data center. 277 2.4. Summary 279 Section 2.1, Section 2.2 and Section 2.3 have discussed how the 280 trends in the types of applications, the overlay technologies used 281 and some of the essential networking protocols results in an increase 282 in the volume of one-to-many traffic patterns in modern highly- 283 virtualized data centers. Section 3 explores how such traffic flows 284 may be handled using conventional IP multicast. 286 3. Handling one-to-many traffic using conventional multicast 288 Faced with ever increasing volumes of one-to-many traffic flows for 289 the reasons presented in Section 2, arguably the intuitive initial 290 course of action for a data center operator is to explore if and how 291 conventional IP multicast could be deployed within the data center. 292 This section introduces the key protocols, discusses some example use 293 cases where they are deployed in data centers and discusses some of 294 the advantages and disadvantages of such deployments. 296 3.1. Layer 3 multicast 298 PIM is the most widely deployed multicast routing protocol and so, 299 unsurprisingly, is the primary multicast routing protocol considered 300 for use in the data center. There are three potential popular modes 301 of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM- 302 BIDIR [RFC5015]. It may be said that these different modes of PIM 303 tradeoff the optimality of the multicast forwarding tree for the 304 amount of multicast forwarding state that must be maintained at 305 routers. SSM provides the most efficient forwarding between sources 306 and receivers and thus is most suitable for applications with one-to- 307 many traffic patterns. State is built and maintained for each (S,G) 308 flow. Thus, the amount of multicast forwarding state held by routers 309 in the data center is proportional to the number of sources and 310 groups. At the other end of the spectrum, BIDIR is the most 311 efficient shared tree solution as one tree is built for all flows, 312 therefore minimizing the amount of state. This state reduction is at 313 the expense of optimal forwarding path between sources and receivers. 314 This use of a shared tree makes BIDIR particularly well-suited for 315 applications with many-to-many traffic patterns, given that the 316 amount of state is uncorrelated to the number of sources. SSM and 317 BIDIR are optimizations of PIM-SM. PIM-SM is the most widely 318 deployed multicast routing protocol. PIM-SM can also be the most 319 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 320 multicast tree and subsequently there is the option of switching to 321 the SPT (shortest path tree), similar to SSM, or staying on the 322 shared tree, similar to BIDIR. 324 3.2. Layer 2 multicast 326 With IPv4 unicast address resolution, the translation of an IP 327 address to a MAC address is done dynamically by ARP. With multicast 328 address resolution, the mapping from a multicast IPv4 address to a 329 multicast MAC address is done by assigning the low-order 23 bits of 330 the multicast IPv4 address to fill the low-order 23 bits of the 331 multicast MAC address. Each IPv4 multicast address has 28 unique 332 bits (the multicast address range is 224.0.0.0/12) therefore mapping 333 a multicast IP address to a MAC address ignores 5 bits of the IP 334 address. Hence, groups of 32 multicast IP addresses are mapped to 335 the same MAC address. And so a multicast MAC address cannot be 336 uniquely mapped to a multicast IPv4 address. Therefore, IPv4 337 multicast addresses must be chosen judiciously in order to avoid 338 unneccessary address aliasing. When sending IPv6 multicast packets 339 on an Ethernet link, the corresponding destination MAC address is a 340 direct mapping of the last 32 bits of the 128 bit IPv6 multicast 341 address into the 48 bit MAC address. It is possible for more than 342 one IPv6 multicast address to map to the same 48 bit MAC address. 344 The default behaviour of many hosts (and, in fact, routers) is to 345 block multicast traffic. Consequently, when a host wishes to join an 346 IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to 347 the router attached to the Layer 2 segment and also it instructs its 348 data link layer to receive Ethernet frames that match the 349 corresponding MAC address. The data link layer filters the frames, 350 passing those with matching destination addresses to the IP module. 351 Similarly, hosts simply hand the multicast packet for transmission to 352 the data link layer which would add the Layer 2 encapsulation, using 353 the MAC address derived in the manner previously discussed. 355 When this Ethernet frame with a multicast MAC address is received by 356 a switch configured to forward multicast traffic, the default 357 behaviour is to flood it to all the ports in the Layer 2 segment. 358 Clearly there may not be a receiver for this multicast group present 359 on each port and IGMP snooping is used to avoid sending the frame out 360 of ports without receivers. 362 A switch running IGMP snooping listens to the IGMP messages exchanged 363 between hosts and the router in order to identify which ports have 364 active receivers for a specific multicast group, allowing the 365 forwarding of multicast frames to be suitably constrained. Normally, 366 the multicast router will generate IGMP queries to which the hosts 367 send IGMP reports in response. However, number of optimizations in 368 which a switch generates IGMP queries (and so appears to be the 369 router from the hosts' perspective) and/or generates IGMP reports 370 (and so appears to be hosts from the router's perspectve) are 371 commonly used to improve the performance by reducing the amount of 372 state maintained at the router, suppressing superfluous IGMP messages 373 and improving responsivenss when hosts join/leave the group. 375 Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by 376 IPv6 routers for discovering multicast listeners on a directly 377 attached link, performing a similar function to IGMP in IPv4 378 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] 379 [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does 380 not send its own distinct protocol messages. Rather, MLD is a 381 subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of 382 ICMPv6 messages. MLD snooping works similarly to IGMP snooping, 383 described earlier. 385 3.3. Example use cases 387 A use case where PIM and IGMP are currently used in data centers is 388 to support multicast in VXLAN deployments. In the original VXLAN 389 specification [RFC7348], a data-driven flood and learn control plane 390 was proposed, requiring the data center IP fabric to support 391 multicast routing. A multicast group is associated with each virtual 392 network, each uniquely identified by its VXLAN network identifiers 393 (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the 394 hypervisor or ToR switch, with local VMs that belong to this VNI 395 would join the multicast group and use it for the exchange of BUM 396 traffic with the other VTEPs. Essentially, the VTEP would 397 encapsulate any BUM traffic from attached VMs in an IP multicast 398 packet, whose destination address is the associated multicast group 399 address, and transmit the packet to the data center fabric. Thus, 400 PIM must be running in the fabric to maintain a multicast 401 distribution tree per VNI. 403 Alternatively, rather than setting up a multicast distribution tree 404 per VNI, a tree can be set up whenever hosts within the VNI wish to 405 exchange multicast traffic. For example, whenever a VTEP receives an 406 IGMP report from a locally connected host, it would translate this 407 into a PIM join message which will be propagated into the IP fabric. 408 In order to ensure this join message is sent to the IP fabric rather 409 than over the VXLAN interface (since the VTEP will have a route back 410 to the source of the multicast packet over the VXLAN interface and so 411 would naturally attempt to send the join over this interface) a more 412 specific route back to the source over the IP fabric must be 413 configured. In this approach PIM must be configured on the SVIs 414 associated with the VXLAN interface. 416 Another use case of PIM and IGMP in data centers is when IPTV servers 417 use multicast to deliver content from the data center to end users. 418 IPTV is typically a one to many application where the hosts are 419 configured for IGMPv3, the switches are configured with IGMP 420 snooping, and the routers are running PIM-SSM mode. Often redundant 421 servers send multicast streams into the network and the network is 422 forwards the data across diverse paths. 424 3.4. Advantages and disadvantages 426 Arguably the biggest advantage of using PIM and IGMP to support one- 427 to-many communication in data centers is that these protocols are 428 relatively mature. Consequently, PIM is available in most routers 429 and IGMP is supported by most hosts and routers. As such, no 430 specialized hardware or relatively immature software is involved in 431 using these protocols in data centers. Furthermore, the maturity of 432 these protocols means their behaviour and performance in operational 433 networks is well-understood, with widely available best-practices and 434 deployment guides for optimizing their performance. For these 435 reasons, PIM and IGMP have been used successfully for supporting one- 436 to-many traffic flows within modern data centers, as discussed 437 earlier. 439 However, somewhat ironically, the relative disadvantages of PIM and 440 IGMP usage in data centers also stem mostly from their maturity. 441 Specifically, these protocols were standardized and implemented long 442 before the highly-virtualized multi-tenant data centers of today 443 existed. Consequently, PIM and IGMP are neither optimally placed to 444 deal with the requirements of one-to-many communication in modern 445 data centers nor to exploit idiosyncrasies of data centers. For 446 example, there may be thousands of VMs participating in a multicast 447 session, with some of these VMs migrating to servers within the data 448 center, new VMs being continually spun up and wishing to join the 449 sessions while all the time other VMs are leaving. In such a 450 scenario, the churn in the PIM and IGMP state machines, the volume of 451 control messages they would generate and the amount of state they 452 would necessitate within routers, especially if they were deployed 453 naively, would be untenable. Furthermore, PIM is a relatively 454 complex protocol. As such, PIM can be challenging to debug even in 455 significantly more benign deployments than those envisaged for future 456 data centers, a fact that has evidently had a dissuasive effect on 457 data center operators considering enabling it within the IP fabric. 459 4. Alternative options for handling one-to-many traffic 461 Section 2 has shown that there is likely to be an increasing amount 462 one-to-many communications in data centers for multiple reasons. And 463 Section 3 has discussed how conventional multicast may be used to 464 handle this traffic, presenting some of the associated advantages and 465 disadvantages. Unsurprisingly, as discussed in the remainder of 466 Section 4, there are a number of alternative options of handling this 467 traffic pattern in data centers. Critically, it should be noted that 468 many of these techniques are not mutually-exclusive; in fact many 469 deployments involve a combination of more than one of these 470 techniques. Furthermore, as will be shown, introducing a centralized 471 controller or a distributed control plane, typically makes these 472 techniques more potent. 474 4.1. Minimizing traffic volumes 476 If handling one-to-many traffic flows in data centers is considered 477 onerous, then arguably the most intuitive solution is to aim to 478 minimize the volume of said traffic. 480 It was previously mentioned in Section 2 that the three main 481 contributors to one-to-many traffic in data centers are applications, 482 overlays and protocols. Typically the applications running on VMs 483 are outside the control of the data center operator and thus, 484 relatively speaking, little can be done about the volume of one-to- 485 many traffic generated by applications. Luckily, there is more scope 486 for attempting to reduce the volume of such traffic generated by 487 overlays and protocols. (And often by protocols within overlays.) 488 This reduction is possible by exploiting certain characteristics of 489 data center networks such as a fixed and regular topology, single 490 administrative control, consistent hardware and software, well-known 491 overlay encapsulation endpoints and systematic IP address allocation. 493 A way of minimizing the amount of one-to-many traffic that traverses 494 the data center fabric is to use a centralized controller. For 495 example, whenever a new VM is instantiated, the hypervisor or 496 encapsulation endpoint can notify a centralized controller of this 497 new MAC address, the associated virtual network, IP address etc. The 498 controller could subsequently distribute this information to every 499 encapsulation endpoint. Consequently, when any endpoint receives an 500 ARP request from a locally attached VM, it could simply consult its 501 local copy of the information distributed by the controller and 502 reply. Thus, the ARP request is suppressed and does not result in 503 one-to-many traffic traversing the data center IP fabric. 505 Alternatively, the functionality supported by the controller can 506 realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] 507 is the most popular control plane used in data centers. Typically, 508 the encapsulation endpoints will exchange pertinent information with 509 each other by all peering with a BGP route reflector (RR). Thus, 510 information such as local MAC addresses, MAC to IP address mapping, 511 virtual networks identifiers, IP prefixes, and local IGMP group 512 membership can be disseminated. Consequently, for example, ARP 513 requests from local VMs can be suppressed by the encapsulation 514 endpoint using the information learnt from the control plane about 515 the MAC to IP mappings at remote peers. In a similar fashion, 516 encapsulation endpoints can use information gleaned from the BGP-EVPN 517 messages to proxy for both IGMP reports and queries for the attached 518 VMs, thus obviating the need to transmit IGMP messages across the 519 data center fabric. 521 4.2. Head end replication 523 A popular option for handling one-to-many traffic patterns in data 524 centers is head end replication (HER). HER means the traffic is 525 duplicated and sent to each end point individually using conventional 526 IP unicast. Obvious disadvantages of HER include traffic duplication 527 and the additional processing burden on the head end. Nevertheless, 528 HER is especially attractive when overlays are in use as the 529 replication can be carried out by the hypervisor or encapsulation end 530 point. Consequently, the VMs and IP fabric are unmodified and 531 unaware of how the traffic is delivered to the multiple end points. 532 Additionally, it is possible to use a number of approaches for 533 constructing and disseminating the list of which endpoints should 534 receive what traffic and so on. 536 For example, the reluctance of data center operators to enable PIM 537 within the data center fabric means VXLAN is often used with HER. 538 Thus, BUM traffic from each VNI is replicated and sent using unicast 539 to remote VTEPs with VMs in that VNI. The list of remote VTEPs to 540 which the traffic should be sent may be configured manually on the 541 VTEP. Alternatively, the VTEPs may transmit pertinent local state to 542 a centralized controller which in turn sends each VTEP the list of 543 remote VTEPs for each VNI. Lastly, HER also works well when a 544 distributed control plane is used instead of the centralized 545 controller. Again, BGP-EVPN may be used to distribute the 546 information needed to faciliate HER to the VTEPs. 548 4.3. Programmable Forwarding Planes 550 As discussed in Section 2, one of the main functions of PIM is to 551 build and maintain multicast distribution trees. Such a tree 552 indicates the path a specific flow will take through the network. 553 Thus, in routers traversed by the flow, the information from PIM is 554 ultimately used to create a multicast forwarding entry for the 555 specific flow and insert it into the multicast forwarding table. The 556 multicast forwarding table will have entries for each multicast flow 557 traversing the router, with the lookup key usually being a 558 concantenation of the source and group addresses. Critically, each 559 entry will contain information such as the legal input interface for 560 the flow and a list of output interfaces to which matching packets 561 should be replicated. 563 Viewed in this way, there is nothing remarkable about the multicast 564 forwarding state constructed in routers based on the information 565 gleaned from PIM. And, in fact, it is perfectly feasible to build 566 such state in the absence of PIM. Such prospects have been 567 significantly enhanced with the increasing popularity and performance 568 of network devices with programmable forwarding planes. These 569 devices are attractive for use in data centers since they are 570 amenable to being programmed by a centralized controller. If such a 571 controller has a global view of the sources and receivers for each 572 multicast flow (which can be provided by the devices attached to the 573 end hosts in the data center communicating with the controller), an 574 accurate representation of data center topology (which is usually 575 well-known), then it can readily compute the multicast forwarding 576 state that must be installed at each router to ensure the one-to-many 577 traffic flow is delivered properly to the correct receivers. All 578 that is needed is an API to program the forwarding planes of all the 579 network devices that need to handle the flow appropriately. Such 580 APIs do in fact exist and so, unsurprisingly, handling one-to-many 581 traffic flows using such an approach is attractive for data centers. 583 Being able to program the forwarding plane in this manner offers the 584 enticing possibility of introducing novel algorithms and concepts for 585 forwarding multicast traffic in data centers. These schemes 586 typically aim to exploit the idiosyncracies of the data center 587 network architecture to create ingenious, pithy and elegant encodings 588 of the information needed to facilitate multicast forwarding. 589 Depending on the scheme, this information may be carried in packet 590 headers, stored in the multicast forwarding table in routers or a 591 combination of both. The key characterstic is that the terseness of 592 the forwarding information means the volume of forwarding state is 593 significantly reduced. Additionally, the overhead associated with 594 building and maintaining a multicast forwarding tree has been 595 eliminated. The result of these reductions in the overhead 596 associated with multicast forwarding is a significant and impressive 597 increase in the effective number of multicast flows that can be 598 supported within the data center. 600 [Shabaz19] is a good example of such an approach and also presents 601 comprehensive discussion of other schemes in the discussion on 602 releated work. Although a number of promising schemes have been 603 proposed, no consensus has yet emerged as to which approach is best, 604 and in fact what "best" means. Even if a clear winner were to 605 emerge, it faces significant challenges to gain the vendor and 606 operator buy-in to ensure it is widely deployed in data centers. 608 4.4. BIER 610 As discussed in Section 3.4, PIM and IGMP face potential scalability 611 challenges when deployed in data centers. These challenges are 612 typically due to the requirement to build and maintain a distribution 613 tree and the requirement to hold per-flow state in routers. Bit 614 Index Explicit Replication (BIER) [RFC 8279] is a new multicast 615 forwarding paradigm that avoids these two requirements. 617 When a multicast packet enters a BIER domain, the ingress router, 618 known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header 619 to the packet. This header contains a bit string in which each bit 620 maps to an egress router, known as Bit-Forwarding Egress Router 621 (BFER). If a bit is set, then the packet should be forwarded to the 622 associated BFER. The routers within the BIER domain, Bit-Forwarding 623 Routers (BFRs), use the BIER header in the packet and information in 624 the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise 625 operations to determine how the packet should be replicated optimally 626 so it reaches all the appropriate BFERs. 628 BIER is deemed to be attractive for facilitating one-to-many 629 communications in data centers [I-D.ietf-bier-use-cases]. The 630 deployment envisioned with overlay networks is that the the 631 encapsulation endpoints would be the BFIR. So knowledge about the 632 actual multicast groups does not reside in the data center fabric, 633 improving the scalability compared to conventional IP multicast. 634 Additionally, a centralized controller or a BGP-EVPN control plane 635 may be used with BIER to ensure the BFIR have the required 636 information. A challenge associated with using BIER is that it 637 requires changes to the forwarding behaviour of the routers used in 638 the data center IP fabric. 640 4.5. Segment Routing 642 Segment Routing (SR) [RFC8402] is a manifestation of the source 643 routing paradigm, so called as the path a packet takes through a 644 network is determined at the source. The source encodes this 645 information in the packet header as a sequence of instructions. 646 These instructions are followed by intermediate routers, ultimately 647 resulting in the delivery of the packet to the desired destination. 648 In SR, the instructions are known as segments and a number of 649 different kinds of segments have been defined. Each segment has an 650 identifier (SID) which is distributed throughout the network by newly 651 defined extensions to standard routing protocols. Thus, using this 652 information, sources are able to determine the exact sequence of 653 segments to encode into the packet. The manner in which these 654 instructions are encoded depends on the underlying data plane. 655 Segment Routing can be applied to the MPLS and IPv6 data planes. In 656 the former, the list of segments is represented by the label stack 657 and in the latter it is represented as an IPv6 routing extension 658 header. Advantages of segment routing include the reduction in the 659 amount of forwarding state routers need to hold and the removal of 660 the need to run a signaling protocol, thus improving the network 661 scalability while reducing the operational complexity. 663 The advantages of segment routing and the ability to run it over an 664 unmodified MPLS data plane means that one of its anticipated use 665 cases is in BGP-based large-scale data centers [RFC7938]. The exact 666 manner in which multicast traffic will be handled in SR has not yet 667 been standardized, with a number of different options being 668 considered. For example, since with the MPLS data plane, segments 669 are simply encoded as a label stack, then the protocols traditionally 670 used to create point-to-multipoint LSPs could be reused to allow SR 671 to support one-to-many traffic flows. Alternatively, a special SID 672 may be defined for a multicast distribution tree, with a centralized 673 controller being used to program routers appropriately to ensure the 674 traffic is delivered to the desired destinations, while avoiding the 675 costly process of building and maintaining a multicast distribution 676 tree. 678 5. Conclusions 680 As the volume and importance of one-to-many traffic in data centers 681 increases, conventional IP multicast is likely to become increasingly 682 unattractive for deployment in data centers for a number of reasons, 683 mostly pertaining its relatively poor scalability and inability to 684 exploit characteristics of data center network architectures. Hence, 685 even though IGMP/MLD is likely to remain the most popular manner in 686 which end hosts signal interest in joining a multicast group, it is 687 unlikely that this multicast traffic will be transported over the 688 data center IP fabric using a multicast distribution tree built and 689 maintained by PIM in the future. Rather, approaches which exploit 690 idiosyncracies of data center network architectures are better placed 691 to deliver one-to-many traffic in data centers, especially when 692 judiciously combined with a centralized controller and/or a 693 distributed control plane, particularly one based on BGP-EVPN. 695 6. IANA Considerations 697 This memo includes no request to IANA. 699 7. Security Considerations 701 No new security considerations result from this document 703 8. Acknowledgements 705 9. References 707 9.1. Normative References 709 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 710 Requirement Levels", BCP 14, RFC 2119, 711 DOI 10.17487/RFC2119, March 1997, 712 . 714 9.2. Informative References 716 [I-D.ietf-bier-use-cases] 717 Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., 718 Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. 719 Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-09 720 (work in progress), January 2019. 722 [I-D.ietf-nvo3-geneve] 723 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 724 Network Virtualization Encapsulation", draft-ietf- 725 nvo3-geneve-13 (work in progress), March 2019. 727 [I-D.ietf-nvo3-vxlan-gpe] 728 Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol 729 Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-07 (work 730 in progress), April 2019. 732 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 733 Converting Network Protocol Addresses to 48.bit Ethernet 734 Address for Transmission on Ethernet Hardware", STD 37, 735 RFC 826, DOI 10.17487/RFC0826, November 1982, 736 . 738 [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 739 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, 740 . 742 [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast 743 Listener Discovery (MLD) for IPv6", RFC 2710, 744 DOI 10.17487/RFC2710, October 1999, 745 . 747 [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. 748 Thyagarajan, "Internet Group Management Protocol, Version 749 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, 750 . 752 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 753 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 754 Protocol Specification (Revised)", RFC 4601, 755 DOI 10.17487/RFC4601, August 2006, 756 . 758 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 759 IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, 760 . 762 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 763 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 764 DOI 10.17487/RFC4861, September 2007, 765 . 767 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 768 "Bidirectional Protocol Independent Multicast (BIDIR- 769 PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, 770 . 772 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 773 Problems in Large Data Center Networks", RFC 6820, 774 DOI 10.17487/RFC6820, January 2013, 775 . 777 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 778 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 779 eXtensible Local Area Network (VXLAN): A Framework for 780 Overlaying Virtualized Layer 2 Networks over Layer 3 781 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 782 . 784 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 785 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 786 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 787 2015, . 789 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 790 Virtualization Using Generic Routing Encapsulation", 791 RFC 7637, DOI 10.17487/RFC7637, September 2015, 792 . 794 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 795 BGP for Routing in Large-Scale Data Centers", RFC 7938, 796 DOI 10.17487/RFC7938, August 2016, 797 . 799 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 800 Narten, "An Architecture for Data-Center Network 801 Virtualization over Layer 3 (NVO3)", RFC 8014, 802 DOI 10.17487/RFC8014, December 2016, 803 . 805 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 806 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 807 Explicit Replication (BIER)", RFC 8279, 808 DOI 10.17487/RFC8279, November 2017, 809 . 811 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 812 Uttaro, J., and W. Henderickx, "A Network Virtualization 813 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 814 DOI 10.17487/RFC8365, March 2018, 815 . 817 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 818 Decraene, B., Litkowski, S., and R. Shakir, "Segment 819 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 820 July 2018, . 822 [Shabaz19] 823 Shabaz, M., Suresh, L., Rexford, J., Feamster, N., 824 Rottenstreich, O., and M. Hira, "Elmo: Source Routed 825 Multicast for Public Clouds", ACM SIGCOMM 2019 Conference 826 (SIGCOMM '19) ACM, DOI 10.1145/3341302.3342066, August 827 2019. 829 [SMPTE2110] 830 "SMPTE2110 Standards Suite", 831 . 833 Authors' Addresses 835 Mike McBride 836 Futurewei 838 Email: michael.mcbride@futurewei.com 840 Olufemi Komolafe 841 Arista Networks 843 Email: femi@arista.com