idnits 2.17.1 draft-ietf-mboned-dc-deploy-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 3 instances of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (February 28, 2018) is 2220 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC4601' is mentioned on line 281, but not defined ** Obsolete undefined reference: RFC 4601 (Obsoleted by RFC 7761) == Missing Reference: 'RFC4607' is mentioned on line 281, but not defined == Missing Reference: 'RFC5015' is mentioned on line 281, but not defined == Missing Reference: 'I-D.pim-umf-problem-statement' is mentioned on line 370, but not defined == Missing Reference: '224-239' is mentioned on line 420, but not defined == Unused Reference: 'RFC2119' is defined on line 447, but no explicit reference was found in the text == Unused Reference: 'RFC6820' is defined on line 454, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 MBONED M. McBride 3 Internet-Draft Huawei 4 Intended status: Informational February 28, 2018 5 Expires: September 1, 2018 7 Multicast in the Data Center Overview 8 draft-ietf-mboned-dc-deploy-02 10 Abstract 12 There has been much interest in issues surrounding massive amounts of 13 hosts in the data center. These issues include the prevalent use of 14 IP Multicast within the Data Center. Its important to understand how 15 IP Multicast is being deployed in the Data Center to be able to 16 understand the surrounding issues with doing so. This document 17 provides a quick survey of uses of multicast in the data center and 18 should serve as an aid to further discussion of issues related to 19 large amounts of multicast in the data center. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at https://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on September 1, 2018. 38 Copyright Notice 40 Copyright (c) 2018 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (https://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 57 2. Multicast Applications in the Data Center . . . . . . . . . . 3 58 2.1. Client-Server Applications . . . . . . . . . . . . . . . 3 59 2.2. Non Client-Server Multicast Applications . . . . . . . . 4 60 3. L2 Multicast Protocols in the Data Center . . . . . . . . . . 5 61 4. L3 Multicast Protocols in the Data Center . . . . . . . . . . 6 62 5. Challenges of using multicast in the Data Center . . . . . . 7 63 6. Layer 3 / Layer 2 Topological Variations . . . . . . . . . . 8 64 7. Address Resolution . . . . . . . . . . . . . . . . . . . . . 9 65 7.1. Solicited-node Multicast Addresses for IPv6 address 66 resolution . . . . . . . . . . . . . . . . . . . . . . . 9 67 7.2. Direct Mapping for Multicast address resolution . . . . . 9 68 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 69 9. Security Considerations . . . . . . . . . . . . . . . . . . . 10 70 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 71 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 72 11.1. Normative References . . . . . . . . . . . . . . . . . . 10 73 11.2. Informative References . . . . . . . . . . . . . . . . . 10 74 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 10 76 1. Introduction 78 Data center servers often use IP Multicast to send data to clients or 79 other application servers. IP Multicast is expected to help conserve 80 bandwidth in the data center and reduce the load on servers. IP 81 Multicast is also a key component in several data center overlay 82 solutions. Increased reliance on multicast, in next generation data 83 centers, requires higher performance and capacity especially from the 84 switches. If multicast is to continue to be used in the data center, 85 it must scale well within and between datacenters. There has been 86 much interest in issues surrounding massive amounts of hosts in the 87 data center. There was a lengthy discussion, in the now closed ARMD 88 WG, involving the issues with address resolution for non ARP/ND 89 multicast traffic in data centers. This document provides a quick 90 survey of multicast in the data center and should serve as an aid to 91 further discussion of issues related to multicast in the data center. 93 ARP/ND issues are not addressed in this document except to explain 94 how address resolution occurs with multicast. 96 1.1. Requirements Language 98 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in RFC 2119. 102 2. Multicast Applications in the Data Center 104 There are many data center operators who do not deploy Multicast in 105 their networks for scalability and stability reasons. There are also 106 many operators for whom multicast is a critical protocol within their 107 network and is enabled on their data center switches and routers. 108 For this latter group, there are several uses of multicast in their 109 data centers. An understanding of the uses of that multicast is 110 important in order to properly support these applications in the ever 111 evolving data centers. If, for instance, the majority of the 112 applications are discovering/signaling each other, using multicast, 113 there may be better ways to support them then using multicast. If, 114 however, the multicasting of data is occurring in large volumes, 115 there is a need for good data center overlay multicast support. The 116 applications either fall into the category of those that leverage L2 117 multicast for discovery or of those that require L3 support and 118 likely span multiple subnets. 120 2.1. Client-Server Applications 122 IPTV servers use multicast to deliver content from the data center to 123 end users. IPTV is typically a one to many application where the 124 hosts are configured for IGMPv3, the switches are configured with 125 IGMP snooping, and the routers are running PIM-SSM mode. Often 126 redundant servers are sending multicast streams into the network and 127 the network is forwarding the data across diverse paths. 129 Windows Media servers send multicast streaming to clients. Windows 130 Media Services streams to an IP multicast address and all clients 131 subscribe to the IP address to receive the same stream. This allows 132 a single stream to be played simultaneously by multiple clients and 133 thus reducing bandwidth utilization. 135 Market data relies extensively on IP multicast to deliver stock 136 quotes from the data center to a financial services provider and then 137 to the stock analysts. The most critical requirement of a multicast 138 trading floor is that it be highly available. The network must be 139 designed with no single point of failure and in a way the network can 140 respond in a deterministic manner to any failure. Typically 141 redundant servers (in a primary/backup or live live mode) are sending 142 multicast streams into the network and the network is forwarding the 143 data across diverse paths (when duplicate data is sent by multiple 144 servers). 146 With publish and subscribe servers, a separate message is sent to 147 each subscriber of a publication. With multicast publish/subscribe, 148 only one message is sent, regardless of the number of subscribers. 149 In a publish/subscribe system, client applications, some of which are 150 publishers and some of which are subscribers, are connected to a 151 network of message brokers that receive publications on a number of 152 topics, and send the publications on to the subscribers for those 153 topics. The more subscribers there are in the publish/subscribe 154 system, the greater the improvement to network utilization there 155 might be with multicast. 157 2.2. Non Client-Server Multicast Applications 159 Routers, running Virtual Routing Redundancy Protocol (VRRP), 160 communicate with one another using a multicast address. VRRP packets 161 are sent, encapsulated in IP packets, to 224.0.0.18. A failure to 162 receive a multicast packet from the master router for a period longer 163 than three times the advertisement timer causes the backup routers to 164 assume that the master router is dead. The virtual router then 165 transitions into an unsteady state and an election process is 166 initiated to select the next master router from the backup routers. 167 This is fulfilled through the use of multicast packets. Backup 168 router(s) are only to send multicast packets during an election 169 process. 171 Overlays may use IP multicast to virtualize L2 multicasts. IP 172 multicast is used to reduce the scope of the L2-over-UDP flooding to 173 only those hosts that have expressed explicit interest in the 174 frames.VXLAN, for instance, is an encapsulation scheme to carry L2 175 frames over L3 networks. The VXLAN Tunnel End Point (VTEP) 176 encapsulates frames inside an L3 tunnel. VXLANs are identified by a 177 24 bit VXLAN Network Identifier (VNI). The VTEP maintains a table of 178 known destination MAC addresses, and stores the IP address of the 179 tunnel to the remote VTEP to use for each. Unicast frames, between 180 VMs, are sent directly to the unicast L3 address of the remote VTEP. 181 Multicast frames are sent to a multicast IP group associated with the 182 VNI. Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are used 183 to forward multicast data across the overlay. 185 The Ganglia application relies upon multicast for distributed 186 discovery and monitoring of computing systems such as clusters and 187 grids. It has been used to link clusters across university campuses 188 and can scale to handle clusters with 2000 nodes 189 Windows Server, cluster node exchange, relies upon the use of 190 multicast heartbeats between servers. Only the other interfaces in 191 the same multicast group use the data. Unlike broadcast, multicast 192 traffic does not need to be flooded throughout the network, reducing 193 the chance that unnecessary CPU cycles are expended filtering traffic 194 on nodes outside the cluster. As the number of nodes increases, the 195 ability to replace several unicast messages with a single multicast 196 message improves node performance and decreases network bandwidth 197 consumption. Multicast messages replace unicast messages in two 198 components of clustering: 200 o Heartbeats: The clustering failure detection engine is based on a 201 scheme whereby nodes send heartbeat messages to other nodes. 202 Specifically, for each network interface, a node sends a heartbeat 203 message to all other nodes with interfaces on that network. 204 Heartbeat messages are sent every 1.2 seconds. In the common case 205 where each node has an interface on each cluster network, there 206 are N * (N - 1) unicast heartbeats sent per network every 1.2 207 seconds in an N-node cluster. With multicast heartbeats, the 208 message count drops to N multicast heartbeats per network every 209 1.2 seconds, because each node sends 1 message instead of N - 1. 210 This represents a reduction in processing cycles on the sending 211 node and a reduction in network bandwidth consumed. 213 o Regroup: The clustering membership engine executes a regroup 214 protocol during a membership view change. The regroup protocol 215 algorithm assumes the ability to broadcast messages to all cluster 216 nodes. To avoid unnecessary network flooding and to properly 217 authenticate messages, the broadcast primitive is implemented by a 218 sequence of unicast messages. Converting the unicast messages to 219 a single multicast message conserves processing power on the 220 sending node and reduces network bandwidth consumption. 222 Multicast addresses in the 224.0.0.x range are considered link local 223 multicast addresses. They are used for protocol discovery and are 224 flooded to every port. For example, OSPF uses 224.0.0.5 and 225 224.0.0.6 for neighbor and DR discovery. These addresses are 226 reserved and will not be constrained by IGMP snooping. These 227 addresses are not to be used by any application. 229 3. L2 Multicast Protocols in the Data Center 231 The switches, in between the servers and the routers, rely upon igmp 232 snooping to bound the multicast to the ports leading to interested 233 hosts and to L3 routers. A switch will, by default, flood multicast 234 traffic to all the ports in a broadcast domain (VLAN). IGMP snooping 235 is designed to prevent hosts on a local network from receiving 236 traffic for a multicast group they have not explicitly joined. It 237 provides switches with a mechanism to prune multicast traffic from 238 links that do not contain a multicast listener (an IGMP client). 239 IGMP snooping is a L2 optimization for L3 IGMP. 241 IGMP snooping, with proxy reporting or report suppression, actively 242 filters IGMP packets in order to reduce load on the multicast router. 243 Joins and leaves heading upstream to the router are filtered so that 244 only the minimal quantity of information is sent. The switch is 245 trying to ensure the router only has a single entry for the group, 246 regardless of how many active listeners there are. If there are two 247 active listeners in a group and the first one leaves, then the switch 248 determines that the router does not need this information since it 249 does not affect the status of the group from the router's point of 250 view. However the next time there is a routine query from the router 251 the switch will forward the reply from the remaining host, to prevent 252 the router from believing there are no active listeners. It follows 253 that in active IGMP snooping, the router will generally only know 254 about the most recently joined member of the group. 256 In order for IGMP, and thus IGMP snooping, to function, a multicast 257 router must exist on the network and generate IGMP queries. The 258 tables (holding the member ports for each multicast group) created 259 for snooping are associated with the querier. Without a querier the 260 tables are not created and snooping will not work. Furthermore IGMP 261 general queries must be unconditionally forwarded by all switches 262 involved in IGMP snooping. Some IGMP snooping implementations 263 include full querier capability. Others are able to proxy and 264 retransmit queries from the multicast router. 266 In source-only networks, however, which presumably describes most 267 data center networks, there are no IGMP hosts on switch ports to 268 generate IGMP packets. Switch ports are connected to multicast 269 source ports and multicast router ports. The switch typically learns 270 about multicast groups from the multicast data stream by using a type 271 of source only learning (when only receiving multicast data on the 272 port, no IGMP packets). The switch forwards traffic only to the 273 multicast router ports. When the switch receives traffic for new IP 274 multicast groups, it will typically flood the packets to all ports in 275 the same VLAN. This unnecessary flooding can impact switch 276 performance. 278 4. L3 Multicast Protocols in the Data Center 280 There are three flavors of PIM used for Multicast Routing in the Data 281 Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015]. 282 SSM provides the most efficient forwarding between sources and 283 receivers and is most suitable for one to many types of multicast 284 applications. State is built for each S,G channel therefore the more 285 sources and groups there are, the more state there is in the network. 286 BIDIR is the most efficient shared tree solution as one tree is built 287 for all S,G's, therefore saving state. But it is not the most 288 efficient in forwarding path between sources and receivers. SSM and 289 BIDIR are optimizations of PIM-SM. PIM-SM is still the most widely 290 deployed multicast routing protocol. PIM-SM can also be the most 291 complex. PIM-SM relies upon a RP (Rendezvous Point) to set up the 292 multicast tree and then will either switch to the SPT (shortest path 293 tree), similar to SSM, or stay on the shared tree (similar to BIDIR). 294 For massive amounts of hosts sending (and receiving) multicast, the 295 shared tree (particularly with PIM-BIDIR) provides the best potential 296 scaling since no matter how many multicast sources exist within a 297 VLAN, the tree number stays the same. IGMP snooping, IGMP proxy, and 298 PIM-BIDIR have the potential to scale to the huge scaling numbers 299 required in a data center. 301 5. Challenges of using multicast in the Data Center 303 Data Center environments may create unique challenges for IP 304 Multicast. Data Center networks required a high amount of VM traffic 305 and mobility within and between DC networks. DC networks have large 306 numbers of servers. DC networks are often used with cloud 307 orchestration software. DC networks often use IP Multicast in their 308 unique environments. This section looks at the challenges of using 309 multicast within the challenging data center environment. 311 When IGMP/MLD Snooping is not implemented, ethernet switches will 312 flood multicast frames out of all switch-ports, which turns the 313 traffic into something more like a broadcast. 315 VRRP uses multicast heartbeat to communicate between routers. The 316 communication between the host and the default gateway is unicast. 317 The multicast heartbeat can be very chatty when there are thousands 318 of VRRP pairs with sub-second heartbeat calls back and forth. 320 Link-local multicast should scale well within one IP subnet 321 particularly with a large layer3 domain extending down to the access 322 or aggregation switches. But if multicast traverses beyond one IP 323 subnet, which is necessary for an overlay like VXLAN, you could 324 potentially have scaling concerns. If using a VXLAN overlay, it is 325 necessary to map the L2 multicast in the overlay to L3 multicast in 326 the underlay or do head end replication in the overlay and receive 327 duplicate frames on the first link from the router to the core 328 switch. The solution could be to run potentially thousands of PIM 329 messages to generate/maintain the required multicast state in the IP 330 underlay. The behavior of the upper layer, with respect to 331 broadcast/multicast, affects the choice of head end (*,G) or (S,G) 332 replication in the underlay, which affects the opex and capex of the 333 entire solution. A VXLAN, with thousands of logical groups, maps to 334 head end replication in the hypervisor or to IGMP from the hypervisor 335 and then PIM between the TOR and CORE 'switches' and the gateway 336 router. 338 Requiring IP multicast (especially PIM BIDIR) from the network can 339 prove challenging for data center operators especially at the kind of 340 scale that the VXLAN/NVGRE proposals require. This is also true when 341 the L2 topological domain is large and extended all the way to the L3 342 core. In data centers with highly virtualized servers, even small L2 343 domains may spread across many server racks (i.e. multiple switches 344 and router ports). 346 It's not uncommon for there to be 10-20 VMs per server in a 347 virtualized environment. One vendor reported a customer requesting a 348 scale to 400VM's per server. For multicast to be a viable solution 349 in this environment, the network needs to be able to scale to these 350 numbers when these VMs are sending/receiving multicast. 352 A lot of switching/routing hardware has problems with IP Multicast, 353 particularly with regards to hardware support of PIM-BIDIR. 355 Sending L2 multicast over a campus or data center backbone, in any 356 sort of significant way, is a new challenge enabled for the first 357 time by overlays. There are interesting challenges when pushing 358 large amounts of multicast traffic through a network, and have thus 359 far been dealt with using purpose-built networks. While the overlay 360 proposals have been careful not to impose new protocol requirements, 361 they have not addressed the issues of performance and scalability, 362 nor the large-scale availability of these protocols. 364 There is an unnecessary multicast stream flooding problem in the link 365 layer switches between the multicast source and the PIM First Hop 366 Router (FHR). The IGMP-Snooping Switch will forward multicast 367 streams to router ports, and the PIM FHR must receive all multicast 368 streams even if there is no request from receiver. This often leads 369 to waste of switch cache and link bandwidth when the multicast 370 streams are not actually required. [I-D.pim-umf-problem-statement] 371 details the problem and defines design goals for a generic mechanism 372 to restrain the unnecessary multicast stream flooding. 374 6. Layer 3 / Layer 2 Topological Variations 376 As discussed in RFC6820, the ARMD problems statement, there are a 377 variety of topological data center variations including L3 to Access 378 Switches, L3 to Aggregation Switches, and L3 in the Core only. 379 Further analysis is needed in order to understand how these 380 variations affect IP Multicast scalability 382 7. Address Resolution 384 7.1. Solicited-node Multicast Addresses for IPv6 address resolution 386 Solicited-node Multicast Addresses are used with IPv6 Neighbor 387 Discovery to provide the same function as the Address Resolution 388 Protocol (ARP) in IPv4. ARP uses broadcasts, to send an ARP 389 Requests, which are received by all end hosts on the local link. 390 Only the host being queried responds. However, the other hosts still 391 have to process and discard the request. With IPv6, a host is 392 required to join a Solicited-Node multicast group for each of its 393 configured unicast or anycast addresses. Because a Solicited-node 394 Multicast Address is a function of the last 24-bits of an IPv6 395 unicast or anycast address, the number of hosts that are subscribed 396 to each Solicited-node Multicast Address would typically be one 397 (there could be more because the mapping function is not a 1:1 398 mapping). Compared to ARP in IPv4, a host should not need to be 399 interrupted as often to service Neighbor Solicitation requests. 401 7.2. Direct Mapping for Multicast address resolution 403 With IPv4 unicast address resolution, the translation of an IP 404 address to a MAC address is done dynamically by ARP. With multicast 405 address resolution, the mapping from a multicast IP address to a 406 multicast MAC address is derived from direct mapping. In IPv4, the 407 mapping is done by assigning the low-order 23 bits of the multicast 408 IP address to fill the low-order 23 bits of the multicast MAC 409 address. When a host joins an IP multicast group, it instructs the 410 data link layer to receive frames that match the MAC address that 411 corresponds to the IP address of the multicast group. The data link 412 layer filters the frames and passes frames with matching destination 413 addresses to the IP module. Since the mapping from multicast IP 414 address to a MAC address ignores 5 bits of the IP address, groups of 415 32 multicast IP addresses are mapped to the same MAC address. As a 416 result a multicast MAC address cannot be uniquely mapped to a 417 multicast IPv4 address. Planning is required within an organization 418 to select IPv4 groups that are far enough away from each other as to 419 not end up with the same L2 address used. Any multicast address in 420 the [224-239].0.0.x and [224-239].128.0.x ranges should not be 421 considered. When sending IPv6 multicast packets on an Ethernet link, 422 the corresponding destination MAC address is a direct mapping of the 423 last 32 bits of the 128 bit IPv6 multicast address into the 48 bit 424 MAC address. It is possible for more than one IPv6 Multicast address 425 to map to the same 48 bit MAC address. 427 8. IANA Considerations 429 This memo includes no request to IANA. 431 9. Security Considerations 433 No new security considerations result from this document 435 10. Acknowledgements 437 The authors would like to thank the many individuals who contributed 438 opinions on the ARMD wg mailing list about this topic: Linda Dunbar, 439 Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor 440 Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas 441 Narten. 443 11. References 445 11.1. Normative References 447 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 448 Requirement Levels", BCP 14, RFC 2119, 449 DOI 10.17487/RFC2119, March 1997, 450 . 452 11.2. Informative References 454 [RFC6820] Narten, T., Karir, M., and I. Foo, "Address Resolution 455 Problems in Large Data Center Networks", RFC 6820, 456 DOI 10.17487/RFC6820, January 2013, 457 . 459 Author's Address 461 Mike McBride 462 Huawei 464 Email: michael.mcbride@huawei.com