idnits 2.17.1 

draft-mcbride-armd-mcast-overview-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 10 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 2 instances of lines with multicast IPv4 addresses in the
     document.  If these are generic example addresses, they should be changed
     to use the 233.252.0.x range defined in RFC 5771


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 3, 2012) is 4437 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC 4601
     (Obsoleted by RFC 7761)


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	Internet Engineering Task Force                               M. McBride
2	Internet-Draft                                                    H. Lui
3	Intended status: Informational                       Huawei Technologies
4	Expires: September 4, 2012                                 March 3, 2012

6	                 Multicast in the Data Center Overview
7	                  draft-mcbride-armd-mcast-overview-00

9	Abstract

11	   There has been much interest in issues surrounding massive amounts of
12	   hosts in the data center.  There was a discussion, in ARMD, involving
13	   the issues with address resolution for non ARP/ND multicast traffic
14	   in data centers with massive number of hosts.  This document provides
15	   a quick survey of multicast in the data center and should serve as an
16	   aid to further discussion of issues related to large amounts of
17	   multicast in the data center.

19	Status of this Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on September 4, 2012.

36	Copyright Notice

38	   Copyright (c) 2012 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
54	   2.  Multicast Applications in the Data Center . . . . . . . . . . . 3
55	     2.1.  L3 Multicast Applications . . . . . . . . . . . . . . . . . 3
56	     2.2.  L2 Multicast Applications . . . . . . . . . . . . . . . . . 4
57	   3.  L2 Multicast Protocols in the Data Center . . . . . . . . . . . 5
58	   4.  L3 Multicast solutions in the Data Center . . . . . . . . . . . 6
59	   5.  Challenges of using multicast in the Data Center  . . . . . . . 7
60	   6.  Layer 3 / Layer 2 Topological Variations  . . . . . . . . . . . 8
61	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . 9
62	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9
63	   9.  Security Considerations . . . . . . . . . . . . . . . . . . . . 9
64	   10. Informative References  . . . . . . . . . . . . . . . . . . . . 9
65	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . . . 9

67	1.  Introduction

69	   Data center servers often use IP Multicast to send data to clients or
70	   other application servers.  IP Multicast is expected to help conserve
71	   bandwidth in the data center and reduce the load on servers.
72	   Increased reliance on multicast, in next generation data centers,
73	   requires higher performance and capacity especially from the
74	   switches.  If multicast is to continue to be used in the data center,
75	   it must scale well within and between datacenters.  There has been
76	   much interest in issues surrounding massive amounts of hosts in the
77	   data center.  There was a discussion, in ARMD, involving the issues
78	   with address resolution for non ARP/ND multicast traffic in data
79	   centers.  This document provides a quick survey of multicast in the
80	   data center and should serve as an aid to further discussion of
81	   issues related to multicast in the data center.

83	   ARP/ND issues are not addressed in this document.  ARP/ND issues are
84	   addressed in [I-D.armd-problem-statement]

86	2.  Multicast Applications in the Data Center

88	   There are many data center operators who do not deploy Multicast in
89	   their networks for scalability and stability reasons.  There are also
90	   many operators for whom multicast is critical and is enabled on their
91	   data center switches and routers.  For this latter group, there are
92	   several uses of multicast in their data centers.  An understanding of
93	   the uses of that multicast is important in order to properly support
94	   these applications in the ever evolving data centers.  If, for
95	   instance, the majority of the applications are discovering/signaling
96	   each other using multicast there may be better ways to support them
97	   then using multicast.  If, however, the multicasting of data is
98	   occurring in large volumes, there is a need for very good data center
99	   under/overlay multicast support.  The applications either fall into
100	   the category of those that leverage L2 multicast for discovery or of
101	   those that require L3 support and likely span multiple subnets.

103	2.1.  L3 Multicast Applications

105	   IPTV servers use multicast to deliver content from the data center to
106	   end users.  IPTV is typically a one to many application where the
107	   hosts are configured for IGMPv3, the switches are configured with
108	   IGMP snooping, and the routers are running PIM-SSM mode.  Often
109	   redundant servers are sending multicast streams into the network and
110	   the network is forwarding the data across diverse paths.

112	   Windows Media servers send multicast streaming to clients.  Windows
113	   Media Services streams to an IP multicast address and all clients
114	   subscribe to the IP address to receive the same stream.  This allows
115	   a single stream to be played simultaneously by multiple clients and
116	   thus reducing bandwidth utilization.

118	   Market data relies extensively on IP multicast to deliver stock
119	   quotes from the data center to a financial services provider and then
120	   to the stock analysts.  The most critical requirement of a multicast
121	   trading floor is that it be highly available.  The network must be
122	   designed with no single point of failure and in a way the network can
123	   respond in a deterministic manner to any failure.  Typically
124	   redundant servers (in a primary/backup or live live mode) are sending
125	   multicast streams into the network and the network is forwarding the
126	   data across diverse paths (when duplicate data is sent by multiple
127	   servers).

129	   With publish and subscribe servers a separate message is sent to each
130	   subscriber of a publication.  With multicast publish/subscribe, only
131	   one message is sent, regardless of the number of subscribers.  In a
132	   publish/subscribe system, client applications, some of which are
133	   publishers and some of which are subscribers, are connected to a
134	   network of message brokers that receive publications on a number of
135	   topics, and send the publications on to the subscribers for those
136	   topics.  The more subscribers there are in the publish/subscribe
137	   system, the greater the improvement to network utilization there
138	   might be with multicast.

140	   With load balancing protocols, such as VRRP, routers communicate
141	   within themselves using a multicast address.

143	   Overlays may use IP multicast to virtualize L2 multicasts.  VXLAN,
144	   for instance, is an encapsulation scheme to carry L2 frames over L3
145	   networks.  The VXLAN Tunnel End Point (VTEP) encapsulates frames
146	   inside an L3 tunnel.  VXLANs are identified by a 24 bit VXLAN Network
147	   Identifier (VNI).  The VTEP maintains a table of known destination
148	   MAC addresses, and stores the IP address of the tunnel to the remote
149	   VTEP to use for each.  Unicast frames, between VMs, are sent directly
150	   to the unicast L3 address of the remote VTEP.  Multicast frames are
151	   sent to a multicast IP group associated with the VNI.  Underlying IP
152	   Multicast protocols (PIM-SM/SSM/BIDIR) are used to forward multicast
153	   data across the overlay.

155	2.2.  L2 Multicast Applications

157	   Applications, such as Ganglia, uses multicast for distributed
158	   monitoring of computing systems such as clusters and grids.

160	   Windows Server, cluster node exchange, relies upon the use of
161	   multicast heartbeats between servers.  Only the other interfaces in
162	   the same multicast group use the data.  Unlike broadcast, multicast
163	   traffic does not need to be flooded throughout the network, reducing
164	   the chance that unnecessary CPU cycles are expended filtering traffic
165	   on nodes outside the cluster.  As the number of nodes increases, the
166	   ability to replace several unicast messages with a single multicast
167	   message improves node performance and decreases network bandwidth
168	   consumption.  Multicast messages replace unicast messages in two
169	   components of clustering:

171	   o  Heartbeats: The clustering failure detection engine is based on a
172	      scheme whereby nodes send heartbeat messages to other nodes.
173	      Specifically, for each network interface, a node sends a heartbeat
174	      message to all other nodes with interfaces on that network.
175	      Heartbeat messages are sent every 1.2 seconds.  In the common case
176	      where each node has an interface on each cluster network, there
177	      are N * (N - 1) unicast heartbeats sent per network every 1.2
178	      seconds in an N-node cluster.  With multicast heartbeats, the
179	      message count drops to N multicast heartbeats per network every
180	      1.2 seconds, because each node sends 1 message instead of N - 1.
181	      This represents a reduction in processing cycles on the sending
182	      node and a reduction in network bandwidth consumed.

184	   o  Regroup: The clustering membership engine executes a regroup
185	      protocol during a membership view change.  The regroup protocol
186	      algorithm assumes the ability to broadcast messages to all cluster
187	      nodes.  To avoid unnecessary network flooding and to properly
188	      authenticate messages, the broadcast primitive is implemented by a
189	      sequence of unicast messages.  Converting the unicast messages to
190	      a single multicast message conserves processing power on the
191	      sending node and reduces network bandwidth consumption.

193	   Multicast addresses in the 224.0.0.x range are considered link local
194	   multicast addresses.  They are used for protocol discovery and are
195	   flooded to every port.  For example, OSPF uses 224.0.0.5 and
196	   224.0.0.6 for neighbor and DR discovery.  These addresses are
197	   reserved and will not be constrained by IGMP snooping.  These
198	   addresses are not to be used by any application.

200	   These types of multicast applications should be able to be supported
201	   in data centers which support multicast.

203	3.  L2 Multicast Protocols in the Data Center

205	   The switches, in between the servers and the routers, rely upon igmp
206	   snooping to bound the multicast to the ports leading to interested
207	   hosts and to L3 routers.  A switch will, by default, flood multicast
208	   traffic to all the ports in a broadcast domain (VLAN).  IGMP snooping
209	   is designed to prevent hosts on a local network from receiving
210	   traffic for a multicast group they have not explicitly joined.  It
211	   provides switches with a mechanism to prune multicast traffic from
212	   links that do not contain a multicast listener (an IGMP client).
213	   IGMP snooping is a L2 optimization for L3 IGMP.

215	   IGMP snooping, with proxy reporting or report suppression, actively
216	   filters IGMP packets in order to reduce load on the multicast router.
217	   Joins and leaves heading upstream to the router are filtered so that
218	   only the minimal quantity of information is sent.  The switch is
219	   trying to ensure the router only has a single entry for the group,
220	   regardless of how many active listeners there are.  If there are two
221	   active listeners in a group and the first one leaves, then the switch
222	   determines that the router does not need this information since it
223	   does not affect the status of the group from the router's point of
224	   view.  However the next time there is a routine query from the router
225	   the switch will forward the reply from the remaining host, to prevent
226	   the router from believing there are no active listeners.  It follows
227	   that in active IGMP snooping, the router will generally only know
228	   about the most recently joined member of the group.

230	   In order for IGMP, and thus IGMP snooping, to function, a multicast
231	   router must exist on the network and generate IGMP queries.  The
232	   tables (holding the member ports for each multicast group) created
233	   for snooping are associated with the querier.  Without a querier the
234	   tables are not created and snooping will not work.  Furthermore IGMP
235	   general queries must be unconditionally forwarded by all switches
236	   involved in IGMP snooping.  Some IGMP snooping implementations
237	   include full querier capability.  Others are able to proxy and
238	   retransmit queries from the multicast router.

240	   In source-only networks, however, which presumably describes most
241	   data center networks, there are no IGMP hosts on switch ports to
242	   generate IGMP packets.  Switch ports are connected to multicast
243	   source ports and multicast router ports.  The switch typically learns
244	   about multicast groups from the multicast data stream by using a type
245	   of source only learning (when only receiving multicast data on the
246	   port, no IGMP packets).  The switch forwards traffic only to the
247	   multicast router ports.  When the switch receives traffic for new IP
248	   multicast groups, it will typically flood the packets to all ports in
249	   the same VLAN.  This unnecessary flooding can impact switch
250	   performance.

252	4.  L3 Multicast solutions in the Data Center

254	   There are three flavors of PIM used for Multicast Routing in the Data
255	   Center: PIM-SM [RFC4601], PIM-SSM [RFC4607], and PIM-BIDIR [RFC5015].

257	   SSM provides the most efficient forwarding between sources and
258	   receivers and is most suitable for one to many types of multicast
259	   applications.  State is built for each S,G channel therefore the more
260	   sources and groups there are, the more state there is in the network.
261	   BIDIR is the most efficient shared tree solution as one tree is built
262	   for all S,G's, therefore saving state.  But it is not the most
263	   efficient in forwarding path between sources and receivers.  SSM and
264	   BIDIR are optimizations of PIM-SM.  PIM-SM is still the most widely
265	   deployed multicast routing protocol.  PIM-SM can also be the most
266	   complex.  PIM-SM relies upon a RP (Rendezvous Point) to set up the
267	   multicast tree and then will either switch to the SPT (shortest path
268	   tree), similar to SSM, or stay on the shared tree (similar to BIDIR).
269	   For massive amounts of hosts sending (and receiving) multicast, the
270	   shared tree (particularly with PIM-BIDIR) provides the best potential
271	   scaling since no matter how many multicast sources exist within a
272	   VLAN, the tree number stays the same.  IGMP snooping, IGMP proxy, and
273	   PIM-BIDIR have the potential to scale to the huge scaling numbers
274	   required in a data center.

276	5.  Challenges of using multicast in the Data Center

278	   When IGMP/MLD Snooping is not implemented, ethernet switches will
279	   flood multicast frames out of all switch-ports, which turns the
280	   traffic into something more like broadcast.

282	   VRRP uses multicast heartbeat to communicate between routers.  The
283	   communication between the host and the default gateway is unicast.
284	   The multicast heartbeat can be very chatty when there are thousands
285	   of VRRP pairs with sub-second heartbeat calls back and forth.

287	   Link-local multicast should scale well within one IP subnet
288	   particularly with a large layer3 domain extending down to the access
289	   or aggregation switches.  But if multicast traverses beyond one IP
290	   subnet, which is necessary for an overlay like VXLAN, you could
291	   potentially have scaling concerns.  If using a VXLAN overlay, it is
292	   necessary to map the L2 multicast in the overlay to L3 multicast in
293	   the underlay or do head end replication in the overlay and receive
294	   duplicate frames on the first link from the router to the core
295	   switch.  The solution could be to run potentially thousands of PIM
296	   messages to generate/maintain the required multicast state in the IP
297	   underlay.  The behavior of the upper layer, with respect to
298	   broadcast/multicast, affects the choice of head end (*,G) or (S,G)
299	   replication in the underlay, which affects the opex and capex of the
300	   entire solution.  A VXLAN, with thousands of logical groups, maps to
301	   head end replication in the hypervisor or to IGMP from the hypervisor
302	   and then PIM between the TOR and CORE 'switches' and the gateway
303	   router.

305	   Requiring IP multicast (especially PIM BIDIR) from the network can
306	   prove challenging for data center operators especially at the kind of
307	   scale that the VXLAN/NVGRE proposals require.  This is also true when
308	   the L2 topological domain is large and extended all the way to the L3
309	   core.  In data centers with highly virtualized servers, even small L2
310	   domains may spread across many server racks (i.e. multiple switches
311	   and router ports).

313	   It's not uncommon for there to be 10-20 VMs per server in a
314	   virtualized environment.  One vendor reported a customer requesting a
315	   scale to 400VM's per server.  For multicast to be a viable solution
316	   in this environment, the network needs to be able to scale to these
317	   numbers when these VMs are sending/receiving multicast.

319	   A lot of switching/routing hardware has problems with IP Multicast,
320	   particularly with regards to hardware support of PIM-BIDIR.

322	   Sending L2 multicast over a campus or data center backbone, in any
323	   sort of significant way, is a new challenge enabled for the first
324	   time by overlays.  There are interesting challenges when pushing
325	   large amounts of multicast traffic through a network, and have thus
326	   far been dealt with using purpose-built networks.  While the overlay
327	   proposals have been careful not to impose new protocol requirements,
328	   they have not addressed the issues of performance and scalability,
329	   nor the large-scale availability of these protocols.

331	   There is an unnecessary multicast stream flooding problem in the link
332	   layer switches between the multicast source and the PIM First Hop
333	   Router (FHR).  The IGMP-Snooping Switch will forward multicast
334	   streams to router ports, and the PIM FHR must receive all multicast
335	   streams even if there is no request from receiver.  This often leads
336	   to waste of switch cache and link bandwidth when the multicast
337	   streams are not actually required.  [I-D.pim-umf-problem-statement]
338	   details the problem and defines design goals for a generic mechanism
339	   to restrain the unnecessary multicast stream flooding.

341	6.  Layer 3 / Layer 2 Topological Variations

343	   As discussed in [I-D.armd-problem-statement], there are a variety of
344	   topological data center variations including L3 to Access Switches,
345	   L3 to Aggregation Switches, and L3 in the Core only.  Further
346	   analysis is needed in order to understand how these variations affect
347	   IP Multicast scalability

349	7.  Acknowledgements

351	   The authors would like to thank the many individuals who contributed
352	   opinions on the ARMD wg mailing list about this topic: Linda Dunbar,
353	   Anoop Ghanwani, Peter Ashwoodsmith, David Allan, Aldrin Isaac, Igor
354	   Gashinsky, Michael Smith, Patrick Frejborg, Joel Jaeggli and Thomas
355	   Narten.

357	8.  IANA Considerations

359	   This memo includes no request to IANA.

361	9.  Security Considerations

363	   No security considerations at this time.

365	10.  Informative References

367	   [I-D.armd-problem-statement]
368	              Narten, T., Karir, M., and I. Foo,
369	              "draft-ietf-armd-problem-statement", February 2012.

371	   [I-D.pim-umf-problem-statement]
372	              Zhou, D., Deng, H., Shi, Y., Liu, H., and I. Bhattacharya,
373	              "draft-dizhou-pim-umf-problem-statement", October 2010.

375	   [RFC4601]  Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas,
376	              "Protocol Independent Multicast - Sparse Mode (PIM-SM):
377	              Protocol Specification (Revised)", RFC 4601, August 2006.

379	   [RFC4607]  Holbrook, H. and B. Cain, "Source-Specific Multicast for
380	              IP", RFC 4607, August 2006.

382	   [RFC5015]  Handley, M., Kouvelas, I., Speakman, T., and L. Vicisano,
383	              "Bidirectional Protocol Independent Multicast (BIDIR-
384	              PIM)", RFC 5015, October 2007.

386	Authors' Addresses

388	   Mike McBride
389	   Huawei Technologies
390	   2330 Central Expressway
391	   Santa Clara, CA  95050
392	   USA

394	   Email: michael.mcbride@huawei.com

396	   Helen Lui
397	   Huawei Technologies
398	   Building Q14, No. 156, Beiqing Rd.
399	   Beijing,   100095
400	   China

402	   Email: helen.liu@huawei.com