idnits 2.17.1 draft-ietf-mboned-dc-deploy-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (February 4, 2020) is 1542 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2710' is mentioned on line 377, but not defined == Missing Reference: 'RFC 3810' is mentioned on line 377, but not defined == Missing Reference: 'RFC 4604' is mentioned on line 378, but not defined == Missing Reference: 'RFC 4443' is mentioned on line 380, but not defined == Missing Reference: 'RFC 8279' is mentioned on line 613, but not defined == Unused Reference: 'RFC2119' is defined on line 708, but no explicit reference was found in the text == Unused Reference: 'I-D.ietf-nvo3-vxlan-gpe' is defined on line 726, but no explicit reference was found in the text == Unused Reference: 'RFC2710' is defined on line 741, but no explicit reference was found in the text == Unused Reference: 'RFC8279' is defined on line 804, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-bier-use-cases-09 == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-13 == Outdated reference: A later version (-13) exists of draft-ietf-nvo3-vxlan-gpe-07 -- Obsolete informational reference (is this intentional?): RFC 4601 (Obsoleted by RFC 7761) Summary: 0 errors (**), 0 flaws (~~), 15 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Futurewei 4 Intended status: Informational O. Komolafe 5 Expires: August 7, 2020 Arista Networks 6 February 4, 2020 8 Multicast in the Data Center Overview 9 draft-ietf-mboned-dc-deploy-09 11 Abstract 13 The volume and importance of one-to-many traffic patterns in data 14 centers is likely to increase significantly in the future. Reasons 15 for this increase are discussed and then attention is paid to the 16 manner in which this traffic pattern may be judiciously handled in 17 data centers. The intuitive solution of deploying conventional IP 18 multicast within data centers is explored and evaluated. Thereafter, 19 a number of emerging innovative approaches are described before a 20 number of recommendations are made. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 7, 2020. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 58 2. Reasons for increasing one-to-many traffic patterns . . . . . 3 59 2.1. Applications . . . . . . . . . . . . . . . . . . . . . . 3 60 2.2. Overlays . . . . . . . . . . . . . . . . . . . . . . . . 5 61 2.3. Protocols . . . . . . . . . . . . . . . . . . . . . . . . 6 62 2.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3. Handling one-to-many traffic using conventional multicast . . 7 64 3.1. Layer 3 multicast . . . . . . . . . . . . . . . . . . . . 7 65 3.2. Layer 2 multicast . . . . . . . . . . . . . . . . . . . . 7 66 3.3. Example use cases . . . . . . . . . . . . . . . . . . . . 9 67 3.4. Advantages and disadvantages . . . . . . . . . . . . . . 9 68 4. Alternative options for handling one-to-many traffic . . . . 10 69 4.1. Minimizing traffic volumes . . . . . . . . . . . . . . . 11 70 4.2. Head end replication . . . . . . . . . . . . . . . . . . 12 71 4.3. Programmable Forwarding Planes . . . . . . . . . . . . . 12 72 4.4. BIER . . . . . . . . . . . . . . . . . . . . . . . . . . 13 73 4.5. Segment Routing . . . . . . . . . . . . . . . . . . . . . 14 74 5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 15 75 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 76 7. Security Considerations . . . . . . . . . . . . . . . . . . . 15 77 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 79 9.1. Normative References . . . . . . . . . . . . . . . . . . 15 80 9.2. Informative References . . . . . . . . . . . . . . . . . 16 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 83 1. Introduction 85 The volume and importance of one-to-many traffic patterns in data 86 centers will likely continue to increase. Reasons for this increase 87 include the nature of the traffic generated by applications hosted in 88 the data center, the need to handle broadcast, unknown unicast and 89 multicast (BUM) traffic within the overlay technologies used to 90 support multi-tenancy at scale, and the use of certain protocols that 91 traditionally require one-to-many control message exchanges. 93 These trends, allied with the expectation that highly virtualized 94 large-scale data centers must support communication between 95 potentially thousands of participants, may lead to the natural 96 assumption that IP multicast will be widely used in data centers, 97 specifically given the bandwidth savings it potentially offers. 98 However, such an assumption would be wrong. In fact, there is 99 widespread reluctance to enable conventional IP multicast in data 100 centers for a number of reasons, mostly pertaining to concerns about 101 its scalability and reliability. 103 This draft discusses some of the main drivers for the increasing 104 volume and importance of one-to-many traffic patterns in data 105 centers. Thereafter, the manner in which conventional IP multicast 106 may be used to handle this traffic pattern is discussed and some of 107 the associated challenges highlighted. Following this discussion, a 108 number of alternative emerging approaches are introduced, before 109 concluding by discussing key trends and making a number of 110 recommendations. 112 1.1. Requirements Language 114 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 115 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 116 document are to be interpreted as described in RFC 2119. 118 2. Reasons for increasing one-to-many traffic patterns 120 2.1. Applications 122 Key trends suggest that the nature of the applications likely to 123 dominate future highly-virtualized multi-tenant data centers will 124 produce large volumes of one-to-many traffic. For example, it is 125 well-known that traffic flows in data centers have evolved from being 126 predominantly North-South (e.g. client-server) to predominantly East- 127 West (e.g. distributed computation). This change has led to the 128 consensus that topologies such as the Leaf/Spine, that are easier to 129 scale in the East-West direction, are better suited to the data 130 center of the future. This increase in East-West traffic flows 131 results from VMs often having to exchange numerous messages between 132 themselves as part of executing a specific workload. For example, a 133 computational workload could require data, or an executable, to be 134 disseminated to workers distributed throughout the data center which 135 may be subsequently polled for status updates. The emergence of such 136 applications means there is likely to be an increase in one-to-many 137 traffic flows with the increasing dominance of East-West traffic. 139 The TV broadcast industry is another potential future source of 140 applications with one-to-many traffic patterns in data centers. The 141 requirement for robustness, stability and predicability has meant the 142 TV broadcast industry has traditionally used TV-specific protocols, 143 infrastructure and technologies for transmitting video signals 144 between end points such as cameras, monitors, mixers, graphics 145 devices and video servers. However, the growing cost and complexity 146 of supporting this approach, especially as the bit rates of the video 147 signals increase due to demand for formats such as 4K-UHD and 8K-UHD, 148 means there is a consensus that the TV broadcast industry will 149 transition from industry-specific transmission formats (e.g. SDI, 150 HD-SDI) over TV-specific infrastructure to using IP-based 151 infrastructure. The development of pertinent standards by the 152 Society of Motion Picture and Television Engineers (SMPTE) 153 [SMPTE2110], along with the increasing performance of IP routers, 154 means this transition is gathering pace. A possible outcome of this 155 transition will be the building of IP data centers in broadcast 156 plants. Traffic flows in the broadcast industry are frequently one- 157 to-many and so if IP data centers are deployed in broadcast plants, 158 it is imperative that this traffic pattern is supported efficiently 159 in that infrastructure. In fact, a pivotal consideration for 160 broadcasters considering transitioning to IP is the manner in which 161 these one-to-many traffic flows will be managed and monitored in a 162 data center with an IP fabric. 164 One of the few success stories in using conventional IP multicast has 165 been for disseminating market trading data. For example, IP 166 multicast is commonly used today to deliver stock quotes from stock 167 exchanges to financial service providers and then to the stock 168 analysts or brokerages. It is essential that the network 169 infrastructure delivers very low latency and high throughout, 170 especially given the proliferation of automated and algorithmic 171 trading which means stock analysts or brokerages may gain an edge on 172 competitors simply by receiving an update a few milliseconds earlier. 173 As would be expected, in such deployments reliability is critical. 174 The network must be designed with no single point of failure and in 175 such a way that it can respond in a deterministic manner to failure. 176 Typically, redundant servers (in a primary/backup or live-live mode) 177 send multicast streams into the network, with diverse paths being 178 used across the network. The stock exchange generating the one-to- 179 many traffic and stock analysts/brokerage that receive the traffic 180 will typically have their own data centers. Therefore, the manner in 181 which one-to-many traffic patterns are handled in these data centers 182 are extremely important, especially given the requirements and 183 constraints mentioned. 185 Another reason for the growing volume of one-to-many traffic patterns 186 in modern data centers is the increasing adoption of streaming 187 telemetry. This transition is motivated by the observation that 188 traditional poll-based approaches for monitoring network devices are 189 usually inadequate in modern data centers. These approaches 190 typically suffer from poor scalability, extensibility and 191 responsiveness. In contrast, in streaming telemetry, network devices 192 in the data center stream highly-granular real-time updates to a 193 telemetry collector/database. This collector then collates, 194 normalizes and encodes this data for convenient consumption by 195 monitoring applications. The montoring applications can subscribe to 196 the notifications of interest, allowing them to gain insight into 197 pertinent state and performance metrics. Thus, the traffic flows 198 associated with streaming telemetry are typically many-to-one between 199 the network devices and the telemetry collector and then one-to-many 200 from the collector to the monitoring applications. 202 The use of publish and subscribe applications is growing within data 203 centers, contributing to the rising volume of one-to-many traffic 204 flows. Such applications are attractive as they provide a robust 205 low-latency asynchronous messaging service, allowing senders to be 206 decoupled from receivers. The usual approach is for a publisher to 207 create and transmit a message to a specific topic. The publish and 208 subscribe application will retain the message and ensure it is 209 delivered to all subscribers to that topic. The flexibility in the 210 number of publishers and subscribers to a specific topic means such 211 applications cater for one-to-one, one-to-many and many-to-one 212 traffic patterns. 214 2.2. Overlays 216 Another key contributor to the rise in one-to-many traffic patterns 217 is the proposed architecture for supporting large-scale multi-tenancy 218 in highly virtualized data centers [RFC8014]. In this architecture, 219 a tenant's VMs are distributed across the data center and are 220 connected by a virtual network known as the overlay network. A 221 number of different technologies have been proposed for realizing the 222 overlay network, including VXLAN [RFC7348], VXLAN-GPE [I-D.ietf-nvo3- 223 vxlan-gpe], NVGRE [RFC7637] and GENEVE [I-D.ietf-nvo3-geneve]. The 224 often fervent and arguably partisan debate about the relative merits 225 of these overlay technologies belies the fact that, conceptually, it 226 may be said that these overlays simply provide a means to encapsulate 227 and tunnel Ethernet frames from the VMs over the data center IP 228 fabric, thus emulating a Layer 2 segment between the VMs. 229 Consequently, the VMs believe and behave as if they are connected to 230 the tenant's other VMs by a conventional Layer 2 segment, regardless 231 of their physical location within the data center. 233 Naturally, in a Layer 2 segment, point to multi-point traffic can 234 result from handling BUM (broadcast, unknown unicast and multicast) 235 traffic. And, compounding this issue within data centers, since the 236 tenant's VMs attached to the emulated segment may be dispersed 237 throughout the data center, the BUM traffic may need to traverse the 238 data center fabric. 240 Hence, regardless of the overlay technology used, due consideration 241 must be given to handling BUM traffic, forcing the data center 242 operator to pay attention to the manner in which one-to-many 243 communication is handled within the data center. And this 244 consideration is likely to become increasingly important with the 245 anticipated rise in the number and importance of overlays. In fact, 246 it may be asserted that the manner in which one-to-many 247 communications arising from overlays is handled is pivotal to the 248 performance and stability of the entire data center network. 250 2.3. Protocols 252 Conventionally, some key networking protocols used in data centers 253 require one-to-many communications for control messages. Thus, the 254 data center operator must pay due attention to how these control 255 message exchanges are supported. 257 For example, ARP [RFC0826] and ND [RFC4861] use broadcast and 258 multicast messages within IPv4 and IPv6 networks respectively to 259 discover MAC address to IP address mappings. Furthermore, when these 260 protocols are running within an overlay network, it essential to 261 ensure the messages are delivered to all the hosts on the emulated 262 Layer 2 segment, regardless of physical location within the data 263 center. The challenges associated with optimally delivering ARP and 264 ND messages in data centers has attracted lots of attention 265 [RFC6820]. 267 Another example of a protocol that may neccessitate having one-to- 268 many traffic flows in the data center is IGMP [RFC2236], [RFC3376]. 269 If the VMs attached to the Layer 2 segment wish to join a multicast 270 group they must send IGMP reports in response to queries from the 271 querier. As these devices could be located at different locations 272 within the data center, there is the somewhat ironic prospect of IGMP 273 itself leading to an increase in the volume of one-to-many 274 communications in the data center. 276 2.4. Summary 278 Section 2.1, Section 2.2 and Section 2.3 have discussed how the 279 trends in the types of applications, the overlay technologies used 280 and some of the essential networking protocols results in an increase 281 in the volume of one-to-many traffic patterns in modern highly- 282 virtualized data centers. Section 3 explores how such traffic flows 283 may be handled using conventional IP multicast. 285 3. Handling one-to-many traffic using conventional multicast 287 Faced with ever increasing volumes of one-to-many traffic flows, for 288 the reasons presented in Section 2, it makes sense for a data center 289 operator to explore if and how conventional IP multicast could be 290 deployed within the data center. This section introduces the key 291 protocols, discusses some example use cases where they are deployed 292 in data centers and discusses some of the advantages and 293 disadvantages of such deployments. 295 3.1. Layer 3 multicast 297 PIM is the most widely deployed multicast routing protocol and so, 298 unsurprisingly, is the primary multicast routing protocol considered 299 for use in the data center. There are three potential popular modes 300 of PIM that may be used: PIM-SM [RFC4601], PIM-SSM [RFC4607] or PIM- 301 BIDIR [RFC5015]. It may be said that these different modes of PIM 302 tradeoff the optimality of the multicast forwarding tree for the 303 amount of multicast forwarding state that must be maintained at 304 routers. SSM provides the most efficient forwarding between sources 305 and receivers and thus is most suitable for applications with one-to- 306 many traffic patterns. State is built and maintained for each (S,G) 307 flow. Thus, the amount of multicast forwarding state held by routers 308 in the data center is proportional to the number of sources and 309 groups. At the other end of the spectrum, BIDIR is the most 310 efficient shared tree solution as one tree is built for all flows, 311 therefore minimizing the amount of state. This state reduction is at 312 the expense of optimal forwarding path between sources and receivers. 313 This use of a shared tree makes BIDIR particularly well-suited for 314 applications with many-to-many traffic patterns, given that the 315 amount of state is uncorrelated to the number of sources. SSM and 316 BIDIR are optimizations of PIM-SM. PIM-SM is the most widely 317 deployed multicast routing protocol. PIM-SM can also be the most 318 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 319 multicast tree and subsequently there is the option of switching to 320 the SPT (shortest path tree), similar to SSM, or staying on the 321 shared tree, similar to BIDIR. 323 3.2. Layer 2 multicast 325 With IPv4 unicast address resolution, the translation of an IP 326 address to a MAC address is done dynamically by ARP. With multicast 327 address resolution, the mapping from a multicast IPv4 address to a 328 multicast MAC address is done by assigning the low-order 23 bits of 329 the multicast IPv4 address to fill the low-order 23 bits of the 330 multicast MAC address. Each IPv4 multicast address has 28 unique 331 bits (the multicast address range is 224.0.0.0/12) therefore mapping 332 a multicast IP address to a MAC address ignores 5 bits of the IP 333 address. Hence, groups of 32 multicast IP addresses are mapped to 334 the same MAC address. And so a multicast MAC address cannot be 335 uniquely mapped to a multicast IPv4 address. Therefore, IPv4 336 multicast addresses must be chosen judiciously in order to avoid 337 unneccessary address aliasing. When sending IPv6 multicast packets 338 on an Ethernet link, the corresponding destination MAC address is a 339 direct mapping of the last 32 bits of the 128 bit IPv6 multicast 340 address into the 48 bit MAC address. It is possible for more than 341 one IPv6 multicast address to map to the same 48 bit MAC address. 343 The default behaviour of many hosts (and, in fact, routers) is to 344 block multicast traffic. Consequently, when a host wishes to join an 345 IPv4 multicast group, it sends an IGMP [RFC2236], [RFC3376] report to 346 the router attached to the Layer 2 segment and also it instructs its 347 data link layer to receive Ethernet frames that match the 348 corresponding MAC address. The data link layer filters the frames, 349 passing those with matching destination addresses to the IP module. 350 Similarly, hosts simply hand the multicast packet for transmission to 351 the data link layer which would add the Layer 2 encapsulation, using 352 the MAC address derived in the manner previously discussed. 354 When this Ethernet frame with a multicast MAC address is received by 355 a switch configured to forward multicast traffic, the default 356 behaviour is to flood it to all the ports in the Layer 2 segment. 357 Clearly there may not be a receiver for this multicast group present 358 on each port and IGMP snooping is used to avoid sending the frame out 359 of ports without receivers. 361 A switch running IGMP snooping listens to the IGMP messages exchanged 362 between hosts and the router in order to identify which ports have 363 active receivers for a specific multicast group, allowing the 364 forwarding of multicast frames to be suitably constrained. Normally, 365 the multicast router will generate IGMP queries to which the hosts 366 send IGMP reports in response. However, number of optimizations in 367 which a switch generates IGMP queries (and so appears to be the 368 router from the hosts' perspective) and/or generates IGMP reports 369 (and so appears to be hosts from the router's perspectve) are 370 commonly used to improve the performance by reducing the amount of 371 state maintained at the router, suppressing superfluous IGMP messages 372 and improving responsivenss when hosts join/leave the group. 374 Multicast Listener Discovery (MLD) [RFC 2710] [RFC 3810] is used by 375 IPv6 routers for discovering multicast listeners on a directly 376 attached link, performing a similar function to IGMP in IPv4 377 networks. MLDv1 [RFC 2710] is similar to IGMPv2 and MLDv2 [RFC 3810] 378 [RFC 4604] similar to IGMPv3. However, in contrast to IGMP, MLD does 379 not send its own distinct protocol messages. Rather, MLD is a 380 subprotocol of ICMPv6 [RFC 4443] and so MLD messages are a subset of 381 ICMPv6 messages. MLD snooping works similarly to IGMP snooping, 382 described earlier. 384 3.3. Example use cases 386 A use case where PIM and IGMP are currently used in data centers is 387 to support multicast in VXLAN deployments. In the original VXLAN 388 specification [RFC7348], a data-driven flood and learn control plane 389 was proposed, requiring the data center IP fabric to support 390 multicast routing. A multicast group is associated with each virtual 391 network, each uniquely identified by its VXLAN network identifiers 392 (VNI). VXLAN tunnel endpoints (VTEPs), typically located in the 393 hypervisor or ToR switch, with local VMs that belong to this VNI 394 would join the multicast group and use it for the exchange of BUM 395 traffic with the other VTEPs. Essentially, the VTEP would 396 encapsulate any BUM traffic from attached VMs in an IP multicast 397 packet, whose destination address is the associated multicast group 398 address, and transmit the packet to the data center fabric. Thus, a 399 multicast routing protocol (typically PIM) must be running in the 400 fabric to maintain a multicast distribution tree per VNI. 402 Alternatively, rather than setting up a multicast distribution tree 403 per VNI, a tree can be set up whenever hosts within the VNI wish to 404 exchange multicast traffic. For example, whenever a VTEP receives an 405 IGMP report from a locally connected host, it would translate this 406 into a PIM join message which will be propagated into the IP fabric. 407 In order to ensure this join message is sent to the IP fabric rather 408 than over the VXLAN interface (since the VTEP will have a route back 409 to the source of the multicast packet over the VXLAN interface and so 410 would naturally attempt to send the join over this interface) a more 411 specific route back to the source over the IP fabric must be 412 configured. In this approach PIM must be configured on the SVIs 413 associated with the VXLAN interface. 415 Another use case of PIM and IGMP in data centers is when IPTV servers 416 use multicast to deliver content from the data center to end users. 417 IPTV is typically a one to many application where the hosts are 418 configured for IGMPv3, the switches are configured with IGMP 419 snooping, and the routers are running PIM-SSM mode. Often redundant 420 servers send multicast streams into the network and the network is 421 forwards the data across diverse paths. 423 3.4. Advantages and disadvantages 425 Arguably the biggest advantage of using PIM and IGMP to support one- 426 to-many communication in data centers is that these protocols are 427 relatively mature. Consequently, PIM is available in most routers 428 and IGMP is supported by most hosts and routers. As such, no 429 specialized hardware or relatively immature software is involved in 430 using these protocols in data centers. Furthermore, the maturity of 431 these protocols means their behaviour and performance in operational 432 networks is well-understood, with widely available best-practices and 433 deployment guides for optimizing their performance. For these 434 reasons, PIM and IGMP have been used successfully for supporting one- 435 to-many traffic flows within modern data centers, as discussed 436 earlier. 438 However, somewhat ironically, the relative disadvantages of PIM and 439 IGMP usage in data centers also stem mostly from their maturity. 440 Specifically, these protocols were standardized and implemented long 441 before the highly-virtualized multi-tenant data centers of today 442 existed. Consequently, PIM and IGMP are neither optimally placed to 443 deal with the requirements of one-to-many communication in modern 444 data centers nor to exploit idiosyncrasies of data centers. For 445 example, there may be thousands of VMs participating in a multicast 446 session, with some of these VMs migrating to servers within the data 447 center, new VMs being continually spun up and wishing to join the 448 sessions while all the time other VMs are leaving. In such a 449 scenario, the churn in the PIM and IGMP state machines, the volume of 450 control messages they would generate and the amount of state they 451 would necessitate within routers, especially if they were deployed 452 naively, would be untenable. Furthermore, PIM is a relatively 453 complex protocol. As such, PIM can be challenging to debug even in 454 significantly more benign deployments than those envisaged for future 455 data centers, a fact that has evidently had a dissuasive effect on 456 data center operators considering enabling it within the IP fabric. 458 4. Alternative options for handling one-to-many traffic 460 Section 2 has shown that there is likely to be an increasing amount 461 one-to-many communications in data centers for multiple reasons. And 462 Section 3 has discussed how conventional multicast may be used to 463 handle this traffic, presenting some of the associated advantages and 464 disadvantages. Unsurprisingly, as discussed in the remainder of 465 Section 4, there are a number of alternative options of handling this 466 traffic pattern in data centers. Critically, it should be noted that 467 many of these techniques are not mutually-exclusive; in fact many 468 deployments involve a combination of more than one of these 469 techniques. Furthermore, as will be shown, introducing a centralized 470 controller or a distributed control plane, typically makes these 471 techniques more potent. 473 4.1. Minimizing traffic volumes 475 If handling one-to-many traffic flows in data centers is considered 476 onerous, then arguably the most intuitive solution is to aim to 477 minimize the volume of said traffic. 479 It was previously mentioned in Section 2 that the three main 480 contributors to one-to-many traffic in data centers are applications, 481 overlays and protocols. Typically the applications running on VMs 482 are outside the control of the data center operator and thus, 483 relatively speaking, little can be done about the volume of one-to- 484 many traffic generated by applications. Luckily, there is more scope 485 for attempting to reduce the volume of such traffic generated by 486 overlays and protocols. (And often by protocols within overlays.) 487 This reduction is possible by exploiting certain characteristics of 488 data center networks such as a fixed and regular topology, single 489 administrative control, consistent hardware and software, well-known 490 overlay encapsulation endpoints and systematic IP address allocation. 492 A way of minimizing the amount of one-to-many traffic that traverses 493 the data center fabric is to use a centralized controller. For 494 example, whenever a new VM is instantiated, the hypervisor or 495 encapsulation endpoint can notify a centralized controller of this 496 new MAC address, the associated virtual network, IP address etc. The 497 controller could subsequently distribute this information to every 498 encapsulation endpoint. Consequently, when any endpoint receives an 499 ARP request from a locally attached VM, it could simply consult its 500 local copy of the information distributed by the controller and 501 reply. Thus, the ARP request is suppressed and does not result in 502 one-to-many traffic traversing the data center IP fabric. 504 Alternatively, the functionality supported by the controller can 505 realized by a distributed control plane. BGP-EVPN [RFC7432, RFC8365] 506 is the most popular control plane used in data centers. Typically, 507 the encapsulation endpoints will exchange pertinent information with 508 each other by all peering with a BGP route reflector (RR). Thus, 509 information such as local MAC addresses, MAC to IP address mapping, 510 virtual networks identifiers, IP prefixes, and local IGMP group 511 membership can be disseminated. Consequently, for example, ARP 512 requests from local VMs can be suppressed by the encapsulation 513 endpoint using the information learnt from the control plane about 514 the MAC to IP mappings at remote peers. In a similar fashion, 515 encapsulation endpoints can use information gleaned from the BGP-EVPN 516 messages to proxy for both IGMP reports and queries for the attached 517 VMs, thus obviating the need to transmit IGMP messages across the 518 data center fabric. 520 4.2. Head end replication 522 A popular option for handling one-to-many traffic patterns in data 523 centers is head end replication (HER). HER means the traffic is 524 duplicated and sent to each end point individually using conventional 525 IP unicast. Obvious disadvantages of HER include traffic duplication 526 and the additional processing burden on the head end. Nevertheless, 527 HER is especially attractive when overlays are in use as the 528 replication can be carried out by the hypervisor or encapsulation end 529 point. Consequently, the VMs and IP fabric are unmodified and 530 unaware of how the traffic is delivered to the multiple end points. 531 Additionally, it is possible to use a number of approaches for 532 constructing and disseminating the list of which endpoints should 533 receive what traffic and so on. 535 For example, the reluctance of data center operators to enable PIM 536 within the data center fabric means VXLAN is often used with HER. 537 Thus, BUM traffic from each VNI is replicated and sent using unicast 538 to remote VTEPs with VMs in that VNI. The list of remote VTEPs to 539 which the traffic should be sent may be configured manually on the 540 VTEP. Alternatively, the VTEPs may transmit pertinent local state to 541 a centralized controller which in turn sends each VTEP the list of 542 remote VTEPs for each VNI. Lastly, HER also works well when a 543 distributed control plane is used instead of the centralized 544 controller. Again, BGP-EVPN may be used to distribute the 545 information needed to faciliate HER to the VTEPs. 547 4.3. Programmable Forwarding Planes 549 As discussed in Section 2, one of the main functions of PIM is to 550 build and maintain multicast distribution trees. Such a tree 551 indicates the path a specific flow will take through the network. 552 Thus, in routers traversed by the flow, the information from PIM is 553 ultimately used to create a multicast forwarding entry for the 554 specific flow and insert it into the multicast forwarding table. The 555 multicast forwarding table will have entries for each multicast flow 556 traversing the router, with the lookup key usually being a 557 concantenation of the source and group addresses. Critically, each 558 entry will contain information such as the legal input interface for 559 the flow and a list of output interfaces to which matching packets 560 should be replicated. 562 Viewed in this way, there is nothing remarkable about the multicast 563 forwarding state constructed in routers based on the information 564 gleaned from PIM. And, in fact, it is perfectly feasible to build 565 such state in the absence of PIM. Such prospects have been 566 significantly enhanced with the increasing popularity and performance 567 of network devices with programmable forwarding planes. These 568 devices are attractive for use in data centers since they are 569 amenable to being programmed by a centralized controller. If such a 570 controller has a global view of the sources and receivers for each 571 multicast flow (which can be provided by the devices attached to the 572 end hosts in the data center communicating with the controller), an 573 accurate representation of data center topology (which is usually 574 well-known), then it can readily compute the multicast forwarding 575 state that must be installed at each router to ensure the one-to-many 576 traffic flow is delivered properly to the correct receivers. All 577 that is needed is an API to program the forwarding planes of all the 578 network devices that need to handle the flow appropriately. Such 579 APIs do in fact exist and so, unsurprisingly, handling one-to-many 580 traffic flows using such an approach is attractive for data centers. 582 Being able to program the forwarding plane in this manner offers the 583 enticing possibility of introducing novel algorithms and concepts for 584 forwarding multicast traffic in data centers. These schemes 585 typically aim to exploit the idiosyncracies of the data center 586 network architecture to create ingenious, pithy and elegant encodings 587 of the information needed to facilitate multicast forwarding. 588 Depending on the scheme, this information may be carried in packet 589 headers, stored in the multicast forwarding table in routers or a 590 combination of both. The key characterstic is that the terseness of 591 the forwarding information means the volume of forwarding state is 592 significantly reduced. Additionally, the overhead associated with 593 building and maintaining a multicast forwarding tree has been 594 eliminated. The result of these reductions in the overhead 595 associated with multicast forwarding is a significant and impressive 596 increase in the effective number of multicast flows that can be 597 supported within the data center. 599 [Shabaz19] is a good example of such an approach and also presents 600 comprehensive discussion of other schemes in the discussion on 601 releated work. Although a number of promising schemes have been 602 proposed, no consensus has yet emerged as to which approach is best, 603 and in fact what "best" means. Even if a clear winner were to 604 emerge, it faces significant challenges to gain the vendor and 605 operator buy-in to ensure it is widely deployed in data centers. 607 4.4. BIER 609 As discussed in Section 3.4, PIM and IGMP face potential scalability 610 challenges when deployed in data centers. These challenges are 611 typically due to the requirement to build and maintain a distribution 612 tree and the requirement to hold per-flow state in routers. Bit 613 Index Explicit Replication (BIER) [RFC 8279] is a new multicast 614 forwarding paradigm that avoids these two requirements. 616 When a multicast packet enters a BIER domain, the ingress router, 617 known as the Bit-Forwarding Ingress Router (BFIR), adds a BIER header 618 to the packet. This header contains a bit string in which each bit 619 maps to an egress router, known as Bit-Forwarding Egress Router 620 (BFER). If a bit is set, then the packet should be forwarded to the 621 associated BFER. The routers within the BIER domain, Bit-Forwarding 622 Routers (BFRs), use the BIER header in the packet and information in 623 the Bit Index Forwarding Table (BIFT) to carry out simple bit- wise 624 operations to determine how the packet should be replicated optimally 625 so it reaches all the appropriate BFERs. 627 BIER is deemed to be attractive for facilitating one-to-many 628 communications in data centers [I-D.ietf-bier-use-cases]. The BFIRs 629 are the encapsulation endpoints in the deployment envisioned with 630 overlay networks. So knowledge about the actual multicast groups 631 does not reside in the data center fabric, improving the scalability 632 compared to conventional IP multicast. Additionally, a centralized 633 controller or a BGP-EVPN control plane may be used with BIER to 634 ensure the BFIR have the required information. A challenge 635 associated with using BIER is that it requires changes to the 636 forwarding behaviour of the routers used in the data center IP 637 fabric. 639 4.5. Segment Routing 641 Segment Routing (SR) [RFC8402] is a manifestation of the source 642 routing paradigm, so called as the path a packet takes through a 643 network is determined at the source. The source encodes this 644 information in the packet header as a sequence of instructions. 645 These instructions are followed by intermediate routers, ultimately 646 resulting in the delivery of the packet to the desired destination. 647 In SR, the instructions are known as segments and a number of 648 different kinds of segments have been defined. Each segment has an 649 identifier (SID) which is distributed throughout the network by newly 650 defined extensions to standard routing protocols. Thus, using this 651 information, sources are able to determine the exact sequence of 652 segments to encode into the packet. The manner in which these 653 instructions are encoded depends on the underlying data plane. 654 Segment Routing can be applied to the MPLS and IPv6 data planes. In 655 the former, the list of segments is represented by the label stack 656 and in the latter it is represented as an IPv6 routing extension 657 header. Advantages of segment routing include the reduction in the 658 amount of forwarding state routers need to hold and the removal of 659 the need to run a signaling protocol, thus improving the network 660 scalability while reducing the operational complexity. 662 The advantages of segment routing and the ability to run it over an 663 unmodified MPLS data plane means that one of its anticipated use 664 cases is in BGP-based large-scale data centers [RFC7938]. The exact 665 manner in which multicast traffic will be handled in SR has not yet 666 been standardized, with a number of different options being 667 considered. For example, since with the MPLS data plane, segments 668 are simply encoded as a label stack, then the protocols traditionally 669 used to create point-to-multipoint LSPs could be reused to allow SR 670 to support one-to-many traffic flows. Alternatively, a special SID 671 may be defined for a multicast distribution tree, with a centralized 672 controller being used to program routers appropriately to ensure the 673 traffic is delivered to the desired destinations, while avoiding the 674 costly process of building and maintaining a multicast distribution 675 tree. 677 5. Conclusions 679 As the volume and importance of one-to-many traffic in data centers 680 increases, conventional IP multicast is likely to become increasingly 681 unattractive for deployment in data centers for a number of reasons, 682 mostly pertaining its relatively poor scalability and inability to 683 exploit characteristics of data center network architectures. Hence, 684 even though IGMP/MLD is likely to remain the most popular manner in 685 which end hosts signal interest in joining a multicast group, it is 686 unlikely that this multicast traffic will be transported over the 687 data center IP fabric using a multicast distribution tree built and 688 maintained by PIM in the future. Rather, approaches which exploit 689 idiosyncracies of data center network architectures are better placed 690 to deliver one-to-many traffic in data centers, especially when 691 judiciously combined with a centralized controller and/or a 692 distributed control plane, particularly one based on BGP-EVPN. 694 6. IANA Considerations 696 This memo includes no request to IANA. 698 7. Security Considerations 700 No new security considerations result from this document 702 8. Acknowledgements 704 9. References 706 9.1. Normative References 708 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 709 Requirement Levels", BCP 14, RFC 2119, 710 DOI 10.17487/RFC2119, March 1997, 711 . 713 9.2. Informative References 715 [I-D.ietf-bier-use-cases] 716 Kumar, N., Asati, R., Chen, M., Xu, X., Dolganow, A., 717 Przygienda, T., Gulko, A., Robinson, D., Arya, V., and C. 718 Bestler, "BIER Use Cases", draft-ietf-bier-use-cases-09 719 (work in progress), January 2019. 721 [I-D.ietf-nvo3-geneve] 722 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 723 Network Virtualization Encapsulation", draft-ietf- 724 nvo3-geneve-13 (work in progress), March 2019. 726 [I-D.ietf-nvo3-vxlan-gpe] 727 Maino, F., Kreeger, L., and U. Elzur, "Generic Protocol 728 Extension for VXLAN", draft-ietf-nvo3-vxlan-gpe-07 (work 729 in progress), April 2019. 731 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 732 Converting Network Protocol Addresses to 48.bit Ethernet 733 Address for Transmission on Ethernet Hardware", STD 37, 734 RFC 826, DOI 10.17487/RFC0826, November 1982, 735 . 737 [RFC2236] Fenner, W., "Internet Group Management Protocol, Version 738 2", RFC 2236, DOI 10.17487/RFC2236, November 1997, 739 . 741 [RFC2710] Deering, S., Fenner, W., and B. Haberman, "Multicast 742 Listener Discovery (MLD) for IPv6", RFC 2710, 743 DOI 10.17487/RFC2710, October 1999, 744 . 746 [RFC3376] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. 747 Thyagarajan, "Internet Group Management Protocol, Version 748 3", RFC 3376, DOI 10.17487/RFC3376, October 2002, 749 . 751 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 752 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 753 Protocol Specification (Revised)", RFC 4601, 754 DOI 10.17487/RFC4601, August 2006, 755 . 757 [RFC4607] Holbrook, H. and B. Cain, "Source-Specific Multicast for 758 IP", RFC 4607, DOI 10.17487/RFC4607, August 2006, 759 . 761 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 762 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 763 DOI 10.17487/RFC4861, September 2007, 764 . 766 [RFC5015] Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano, 767 "Bidirectional Protocol Independent Multicast (BIDIR- 768 PIM)", RFC 5015, DOI 10.17487/RFC5015, October 2007, 769 . 771 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 772 Problems in Large Data Center Networks", RFC 6820, 773 DOI 10.17487/RFC6820, January 2013, 774 . 776 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 777 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 778 eXtensible Local Area Network (VXLAN): A Framework for 779 Overlaying Virtualized Layer 2 Networks over Layer 3 780 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 781 . 783 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 784 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 785 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 786 2015, . 788 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 789 Virtualization Using Generic Routing Encapsulation", 790 RFC 7637, DOI 10.17487/RFC7637, September 2015, 791 . 793 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 794 BGP for Routing in Large-Scale Data Centers", RFC 7938, 795 DOI 10.17487/RFC7938, August 2016, 796 . 798 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 799 Narten, "An Architecture for Data-Center Network 800 Virtualization over Layer 3 (NVO3)", RFC 8014, 801 DOI 10.17487/RFC8014, December 2016, 802 . 804 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 805 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 806 Explicit Replication (BIER)", RFC 8279, 807 DOI 10.17487/RFC8279, November 2017, 808 . 810 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 811 Uttaro, J., and W. Henderickx, "A Network Virtualization 812 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 813 DOI 10.17487/RFC8365, March 2018, 814 . 816 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 817 Decraene, B., Litkowski, S., and R. Shakir, "Segment 818 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 819 July 2018, . 821 [Shabaz19] 822 Shabaz, M., Suresh, L., Rexford, J., Feamster, N., 823 Rottenstreich, O., and M. Hira, "Elmo: Source Routed 824 Multicast for Public Clouds", ACM SIGCOMM 2019 Conference 825 (SIGCOMM '19) ACM, DOI 10.1145/3341302.3342066, August 826 2019. 828 [SMPTE2110] 829 "SMPTE2110 Standards Suite", 830 . 832 Authors' Addresses 834 Mike McBride 835 Futurewei 837 Email: michael.mcbride@futurewei.com 839 Olufemi Komolafe 840 Arista Networks 842 Email: femi@arista.com