idnits 2.17.1 

draft-ietf-mboned-dc-deploy-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 11 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 3 instances of lines with multicast IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use the 233.252.0.x range defined in RFC 5771


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (August 23, 2013) is 3898 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: '224-239' is mentioned on line 412, but not defined

  -- Obsolete informational reference (is this intentional?): RFC 4601
     (Obsoleted by RFC 7761)


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Internet Engineering Task Force                               M. McBride
2	Internet-Draft                                       Huawei Technologies
3	Intended status: Informational                           August 23, 2013
4	Expires: February 24, 2014

6	                 Multicast in the Data Center Overview
7	                     draft-ietf-mboned-dc-deploy-01

9	Abstract

11	   There has been much interest in issues surrounding massive amounts of
12	   hosts in the data center.  These issues include the prevalent use of
13	   IP Multicast within the Data Center.  Its important to understand how
14	   IP Multicast is being deployed in the Data Center to be able to
15	   understand the surrounding issues with doing so.  This document
16	   provides a quick survey of uses of multicast in the data center and
17	   should serve as an aid to further discussion of issues related to
18	   large amounts of multicast in the data center.

20	Status of this Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on February 24, 2014.

37	Copyright Notice

39	   Copyright (c) 2013 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
55	   2.  Multicast Applications in the Data Center  . . . . . . . . . .  3
56	     2.1.  Client-Server Applications . . . . . . . . . . . . . . . .  3
57	     2.2.  Non Client-Server Multicast Applications . . . . . . . . .  4
58	   3.  L2 Multicast Protocols in the Data Center  . . . . . . . . . .  6
59	   4.  L3 Multicast Protocols in the Data Center  . . . . . . . . . .  7
60	   5.  Challenges of using multicast in the Data Center . . . . . . .  7
61	   6.  Layer 3 / Layer 2 Topological Variations . . . . . . . . . . .  9
62	   7.  Address Resolution . . . . . . . . . . . . . . . . . . . . . .  9
63	     7.1.  Solicited-node Multicast Addresses for IPv6 address
64	           resolution . . . . . . . . . . . . . . . . . . . . . . . .  9
65	     7.2.  Direct Mapping for Multicast address resolution  . . . . .  9
66	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
67	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
68	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 10
69	   11. Informative References . . . . . . . . . . . . . . . . . . . . 10
70	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11

72	1.  Introduction

74	   Data center servers often use IP Multicast to send data to clients or
75	   other application servers.  IP Multicast is expected to help conserve
76	   bandwidth in the data center and reduce the load on servers.  IP
77	   Multicast is also a key component in several data center overlay
78	   solutions.  Increased reliance on multicast, in next generation data
79	   centers, requires higher performance and capacity especially from the
80	   switches.  If multicast is to continue to be used in the data center,
81	   it must scale well within and between datacenters.  There has been
82	   much interest in issues surrounding massive amounts of hosts in the
83	   data center.  There was a lengthy discussion, in the now closed ARMD
84	   WG, involving the issues with address resolution for non ARP/ND
85	   multicast traffic in data centers.  This document provides a quick
86	   survey of multicast in the data center and should serve as an aid to
87	   further discussion of issues related to multicast in the data center.

89	   ARP/ND issues are not addressed in this document except to explain
90	   how address resolution occurs with multicast.

92	2.  Multicast Applications in the Data Center

94	   There are many data center operators who do not deploy Multicast in
95	   their networks for scalability and stability reasons.  There are also
96	   many operators for whom multicast is a critical protocol within their
97	   network and is enabled on their data center switches and routers.
98	   For this latter group, there are several uses of multicast in their
99	   data centers.  An understanding of the uses of that multicast is
100	   important in order to properly support these applications in the ever
101	   evolving data centers.  If, for instance, the majority of the
102	   applications are discovering/signaling each other, using multicast,
103	   there may be better ways to support them then using multicast.  If,
104	   however, the multicasting of data is occurring in large volumes,
105	   there is a need for good data center overlay multicast support.  The
106	   applications either fall into the category of those that leverage L2
107	   multicast for discovery or of those that require L3 support and
108	   likely span multiple subnets.

110	2.1.  Client-Server Applications

112	   IPTV servers use multicast to deliver content from the data center to
113	   end users.  IPTV is typically a one to many application where the
114	   hosts are configured for IGMPv3, the switches are configured with
115	   IGMP snooping, and the routers are running PIM-SSM mode.  Often
116	   redundant servers are sending multicast streams into the network and
117	   the network is forwarding the data across diverse paths.

119	   Windows Media servers send multicast streaming to clients.  Windows
120	   Media Services streams to an IP multicast address and all clients
121	   subscribe to the IP address to receive the same stream.  This allows
122	   a single stream to be played simultaneously by multiple clients and
123	   thus reducing bandwidth utilization.

125	   Market data relies extensively on IP multicast to deliver stock
126	   quotes from the data center to a financial services provider and then
127	   to the stock analysts.  The most critical requirement of a multicast
128	   trading floor is that it be highly available.  The network must be
129	   designed with no single point of failure and in a way the network can
130	   respond in a deterministic manner to any failure.  Typically
131	   redundant servers (in a primary/backup or live live mode) are sending
132	   multicast streams into the network and the network is forwarding the
133	   data across diverse paths (when duplicate data is sent by multiple
134	   servers).

136	   With publish and subscribe servers, a separate message is sent to
137	   each subscriber of a publication.  With multicast publish/subscribe,
138	   only one message is sent, regardless of the number of subscribers.
139	   In a publish/subscribe system, client applications, some of which are
140	   publishers and some of which are subscribers, are connected to a
141	   network of message brokers that receive publications on a number of
142	   topics, and send the publications on to the subscribers for those
143	   topics.  The more subscribers there are in the publish/subscribe
144	   system, the greater the improvement to network utilization there
145	   might be with multicast.

147	2.2.  Non Client-Server Multicast Applications

149	   Routers, running Virtual Routing Redundancy Protocol (VRRP),
150	   communicate with one another using a multicast address.  VRRP packets
151	   are sent, encapsulated in IP packets, to 224.0.0.18.  A failure to
152	   receive a multicast packet from the master router for a period longer
153	   than three times the advertisement timer causes the backup routers to
154	   assume that the master router is dead.  The virtual router then
155	   transitions into an unsteady state and an election process is
156	   initiated to select the next master router from the backup routers.
157	   This is fulfilled through the use of multicast packets.  Backup
158	   router(s) are only to send multicast packets during an election
159	   process.

161	   Overlays may use IP multicast to virtualize L2 multicasts.  IP
162	   multicast is used to reduce the scope of the L2-over-UDP flooding to
163	   only those hosts that have expressed explicit interest in the
164	   frames.VXLAN, for instance, is an encapsulation scheme to carry L2
165	   frames over L3 networks.  The VXLAN Tunnel End Point (VTEP)
166	   encapsulates frames inside an L3 tunnel.  VXLANs are identified by a
167	   24 bit VXLAN Network Identifier (VNI).  The VTEP maintains a table of
168	   known destination MAC addresses, and stores the IP address of the
169	   tunnel to the remote VTEP to use for each.  Unicast frames, between
170	   VMs, are sent directly to the unicast L3 address of the remote VTEP.
171	   Multicast frames are sent to a multicast IP group associated with the
172	   VNI.  Underlying IP Multicast protocols (PIM-SM/SSM/BIDIR) are used
173	   to forward multicast data across the overlay.

175	   The Ganglia application relies upon multicast for distributed
176	   discovery and monitoring of computing systems such as clusters and
177	   grids.  It has been used to link clusters across university campuses
178	   and can scale to handle clusters with 2000 nodes

180	   Windows Server, cluster node exchange, relies upon the use of
181	   multicast heartbeats between servers.  Only the other interfaces in
182	   the same multicast group use the data.  Unlike broadcast, multicast
183	   traffic does not need to be flooded throughout the network, reducing
184	   the chance that unnecessary CPU cycles are expended filtering traffic
185	   on nodes outside the cluster.  As the number of nodes increases, the
186	   ability to replace several unicast messages with a single multicast
187	   message improves node performance and decreases network bandwidth
188	   consumption.  Multicast messages replace unicast messages in two
189	   components of clustering:

191	   o  Heartbeats: The clustering failure detection engine is based on a
192	      scheme whereby nodes send heartbeat messages to other nodes.
193	      Specifically, for each network interface, a node sends a heartbeat
194	      message to all other nodes with interfaces on that network.
195	      Heartbeat messages are sent every 1.2 seconds.  In the common case
196	      where each node has an interface on each cluster network, there
197	      are N * (N - 1) unicast heartbeats sent per network every 1.2
198	      seconds in an N-node cluster.  With multicast heartbeats, the
199	      message count drops to N multicast heartbeats per network every
200	      1.2 seconds, because each node sends 1 message instead of N - 1.
201	      This represents a reduction in processing cycles on the sending
202	      node and a reduction in network bandwidth consumed.

204	   o  Regroup: The clustering membership engine executes a regroup
205	      protocol during a membership view change.  The regroup protocol
206	      algorithm assumes the ability to broadcast messages to all cluster
207	      nodes.  To avoid unnecessary network flooding and to properly
208	      authenticate messages, the broadcast primitive is implemented by a
209	      sequence of unicast messages.  Converting the unicast messages to
210	      a single multicast message conserves processing power on the
211	      sending node and reduces network bandwidth consumption.

213	   Multicast addresses in the 224.0.0.x range are considered link local
214	   multicast addresses.  They are used for protocol discovery and are
215	   flooded to every port.  For example, OSPF uses 224.0.0.5 and
216	   224.0.0.6 for neighbor and DR discovery.  These addresses are
217	   reserved and will not be constrained by IGMP snooping.  These
218	   addresses are not to be used by any application.

220	3.  L2 Multicast Protocols in the Data Center

222	   The switches, in between the servers and the routers, rely upon igmp
223	   snooping to bound the multicast to the ports leading to interested
224	   hosts and to L3 routers.  A switch will, by default, flood multicast
225	   traffic to all the ports in a broadcast domain (VLAN).  IGMP snooping
226	   is designed to prevent hosts on a local network from receiving
227	   traffic for a multicast group they have not explicitly joined.  It
228	   provides switches with a mechanism to prune multicast traffic from
229	   links that do not contain a multicast listener (an IGMP client).
230	   IGMP snooping is a L2 optimization for L3 IGMP.

232	   IGMP snooping, with proxy reporting or report suppression, actively
233	   filters IGMP packets in order to reduce load on the multicast router.
234	   Joins and leaves heading upstream to the router are filtered so that
235	   only the minimal quantity of information is sent.  The switch is
236	   trying to ensure the router only has a single entry for the group,
237	   regardless of how many active listeners there are.  If there are two
238	   active listeners in a group and the first one leaves, then the switch
239	   determines that the router does not need this information since it
240	   does not affect the status of the group from the router's point of
241	   view.  However the next time there is a routine query from the router
242	   the switch will forward the reply from the remaining host, to prevent
243	   the router from believing there are no active listeners.  It follows
244	   that in active IGMP snooping, the router will generally only know
245	   about the most recently joined member of the group.

247	   In order for IGMP, and thus IGMP snooping, to function, a multicast
248	   router must exist on the network and generate IGMP queries.  The
249	   tables (holding the member ports for each multicast group) created
250	   for snooping are associated with the querier.  Without a querier the
251	   tables are not created and snooping will not work.  Furthermore IGMP
252	   general queries must be unconditionally forwarded by all switches
253	   involved in IGMP snooping.  Some IGMP snooping implementations
254	   include full querier capability.  Others are able to proxy and
255	   retransmit queries from the multicast router.

257	   In source-only networks, however, which presumably describes most
258	   data center networks, there are no IGMP hosts on switch ports to
259	   generate IGMP packets.  Switch ports are connected to multicast
260	   source ports and multicast router ports.  The switch typically learns
261	   about multicast groups from the multicast data stream by using a type
262	   of source only learning (when only receiving multicast data on the
263	   port, no IGMP packets).  The switch forwards traffic only to the
264	   multicast router ports.  When the switch receives traffic for new IP
265	   multicast groups, it will typically flood the packets to all ports in
266	   the same VLAN.  This unnecessary flooding can impact switch
267	   performance.

269	4.  L3 Multicast Protocols in the Data Center

271	   There are three flavors of PIM used for Multicast Routing in the Data
272	   Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015].
273	   SSM provides the most efficient forwarding between sources and
274	   receivers and is most suitable for one to many types of multicast
275	   applications.  State is built for each S,G channel therefore the more
276	   sources and groups there are, the more state there is in the network.
277	   BIDIR is the most efficient shared tree solution as one tree is built
278	   for all S,G's, therefore saving state.  But it is not the most
279	   efficient in forwarding path between sources and receivers.  SSM and
280	   BIDIR are optimizations of PIM-SM.  PIM-SM is still the most widely
281	   deployed multicast routing protocol.  PIM-SM can also be the most
282	   complex.  PIM-SM relies upon a RP (Rendezvous Point) to set up the
283	   multicast tree and then will either switch to the SPT (shortest path
284	   tree), similar to SSM, or stay on the shared tree (similar to BIDIR).
285	   For massive amounts of hosts sending (and receiving) multicast, the
286	   shared tree (particularly with PIM-BIDIR) provides the best potential
287	   scaling since no matter how many multicast sources exist within a
288	   VLAN, the tree number stays the same.  IGMP snooping, IGMP proxy, and
289	   PIM-BIDIR have the potential to scale to the huge scaling numbers
290	   required in a data center.

292	5.  Challenges of using multicast in the Data Center

294	   Data Center environments may create unique challenges for IP
295	   Multicast.  Data Center networks required a high amount of VM traffic
296	   and mobility within and between DC networks.  DC networks have large
297	   numbers of servers.  DC networks are often used with cloud
298	   orchestration software.  DC networks often use IP Multicast in their
299	   unique environments.  This section looks at the challenges of using
300	   multicast within the challenging data center environment.

302	   When IGMP/MLD Snooping is not implemented, ethernet switches will
303	   flood multicast frames out of all switch-ports, which turns the
304	   traffic into something more like a broadcast.

306	   VRRP uses multicast heartbeat to communicate between routers.  The
307	   communication between the host and the default gateway is unicast.

309	   The multicast heartbeat can be very chatty when there are thousands
310	   of VRRP pairs with sub-second heartbeat calls back and forth.

312	   Link-local multicast should scale well within one IP subnet
313	   particularly with a large layer3 domain extending down to the access
314	   or aggregation switches.  But if multicast traverses beyond one IP
315	   subnet, which is necessary for an overlay like VXLAN, you could
316	   potentially have scaling concerns.  If using a VXLAN overlay, it is
317	   necessary to map the L2 multicast in the overlay to L3 multicast in
318	   the underlay or do head end replication in the overlay and receive
319	   duplicate frames on the first link from the router to the core
320	   switch.  The solution could be to run potentially thousands of PIM
321	   messages to generate/maintain the required multicast state in the IP
322	   underlay.  The behavior of the upper layer, with respect to
323	   broadcast/multicast, affects the choice of head end (*,G) or (S,G)
324	   replication in the underlay, which affects the opex and capex of the
325	   entire solution.  A VXLAN, with thousands of logical groups, maps to
326	   head end replication in the hypervisor or to IGMP from the hypervisor
327	   and then PIM between the TOR and CORE 'switches' and the gateway
328	   router.

330	   Requiring IP multicast (especially PIM BIDIR) from the network can
331	   prove challenging for data center operators especially at the kind of
332	   scale that the VXLAN/NVGRE proposals require.  This is also true when
333	   the L2 topological domain is large and extended all the way to the L3
334	   core.  In data centers with highly virtualized servers, even small L2
335	   domains may spread across many server racks (i.e. multiple switches
336	   and router ports).

338	   It's not uncommon for there to be 10-20 VMs per server in a
339	   virtualized environment.  One vendor reported a customer requesting a
340	   scale to 400VM's per server.  For multicast to be a viable solution
341	   in this environment, the network needs to be able to scale to these
342	   numbers when these VMs are sending/receiving multicast.

344	   A lot of switching/routing hardware has problems with IP Multicast,
345	   particularly with regards to hardware support of PIM-BIDIR.

347	   Sending L2 multicast over a campus or data center backbone, in any
348	   sort of significant way, is a new challenge enabled for the first
349	   time by overlays.  There are interesting challenges when pushing
350	   large amounts of multicast traffic through a network, and have thus
351	   far been dealt with using purpose-built networks.  While the overlay
352	   proposals have been careful not to impose new protocol requirements,
353	   they have not addressed the issues of performance and scalability,
354	   nor the large-scale availability of these protocols.

356	   There is an unnecessary multicast stream flooding problem in the link
357	   layer switches between the multicast source and the PIM First Hop
358	   Router (FHR).  The IGMP-Snooping Switch will forward multicast
359	   streams to router ports, and the PIM FHR must receive all multicast
360	   streams even if there is no request from receiver.  This often leads
361	   to waste of switch cache and link bandwidth when the multicast
362	   streams are not actually required.  [I-D.pim-umf-problem-statement]
363	   details the problem and defines design goals for a generic mechanism
364	   to restrain the unnecessary multicast stream flooding.

366	6.  Layer 3 / Layer 2 Topological Variations

368	   As discussed in [I-D.armd-problem-statement], there are a variety of
369	   topological data center variations including L3 to Access Switches,
370	   L3 to Aggregation Switches, and L3 in the Core only.  Further
371	   analysis is needed in order to understand how these variations affect
372	   IP Multicast scalability

374	7.  Address Resolution

376	7.1.  Solicited-node Multicast Addresses for IPv6 address resolution

378	   Solicited-node Multicast Addresses are used with IPv6 Neighbor
379	   Discovery to provide the same function as the Address Resolution
380	   Protocol (ARP) in IPv4.  ARP uses broadcasts, to send an ARP
381	   Requests, which are received by all end hosts on the local link.
382	   Only the host being queried responds.  However, the other hosts still
383	   have to process and discard the request.  With IPv6, a host is
384	   required to join a Solicited-Node multicast group for each of its
385	   configured unicast or anycast addresses.  Because a Solicited-node
386	   Multicast Address is a function of the last 24-bits of an IPv6
387	   unicast or anycast address, the number of hosts that are subscribed
388	   to each Solicited-node Multicast Address would typically be one
389	   (there could be more because the mapping function is not a 1:1
390	   mapping).  Compared to ARP in IPv4, a host should not need to be
391	   interrupted as often to service Neighbor Solicitation requests.

393	7.2.  Direct Mapping for Multicast address resolution

395	   With IPv4 unicast address resolution, the translation of an IP
396	   address to a MAC address is done dynamically by ARP.  With multicast
397	   address resolution, the mapping from a multicast IP address to a
398	   multicast MAC address is derived from direct mapping.  In IPv4, the
399	   mapping is done by assigning the low-order 23 bits of the multicast
400	   IP address to fill the low-order 23 bits of the multicast MAC
401	   address.  When a host joins an IP multicast group, it instructs the
402	   data link layer to receive frames that match the MAC address that
403	   corresponds to the IP address of the multicast group.  The data link
404	   layer filters the frames and passes frames with matching destination
405	   addresses to the IP module.  Since the mapping from multicast IP
406	   address to a MAC address ignores 5 bits of the IP address, groups of
407	   32 multicast IP addresses are mapped to the same MAC address.  As a
408	   result a multicast MAC address cannot be uniquely mapped to a
409	   multicast IPv4 address.  Planning is required within an organization
410	   to select IPv4 groups that are far enough away from each other as to
411	   not end up with the same L2 address used.  Any multicast address in
412	   the [224-239].0.0.x and [224-239].128.0.x ranges should not be
413	   considered.  When sending IPv6 multicast packets on an Ethernet link,
414	   the corresponding destination MAC address is a direct mapping of the
415	   last 32 bits of the 128 bit IPv6 multicast address into the 48 bit
416	   MAC address.  It is possible for more than one IPv6 Multicast address
417	   to map to the same 48 bit MAC address.

419	8.  Acknowledgements

421	   The authors would like to thank the many individuals who contributed
422	   opinions on the ARMD wg mailing list about this topic: Linda Dunbar,
423	   Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor
424	   Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas
425	   Narten.

427	9.  IANA Considerations

429	   This memo includes no request to IANA.

431	10.  Security Considerations

433	   No new security considerations result from this document

435	11.  Informative References

437	   [I-D.armd-problem-statement]
438	              Narten, T., Karir, M., and I. Foo,
439	              "draft-ietf-armd-problem-statement", February 2012.

441	   [I-D.pim-umf-problem-statement]
442	              Zhou, D., Deng, H., Shi, Y., Liu, H., and I. Bhattacharya,
443	              "draft-dizhou-pim-umf-problem-statement", October 2010.

445	   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
446	              "Protocol Independent Multicast - Sparse Mode (PIM-SM):

448	              Protocol Specification (Revised)", RFC 4601, August 2006.

450	   [RFC4607]  Holbrook, H. and B. Cain, "Source-Specific Multicast for
451	              IP", RFC 4607, August 2006.

453	   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
454	              "Bidirectional Protocol Independent Multicast (BIDIR-
455	              PIM)", RFC 5015, October 2007.

457	Author's Address

459	   Mike McBride
460	   Huawei Technologies
461	   2330 Central Expressway
462	   Santa Clara, CA  95050
463	   USA

465	   Email: michael.mcbride@huawei.com