idnits 2.17.1 

draft-dunbar-armd-arp-nd-scaling-practices-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (August 31, 2012) is 4249 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'ARMD-Problems' is mentioned on line 92, but not
     defined

  == Missing Reference: 'RFC826' is mentioned on line 119, but not defined

  == Missing Reference: 'RFC4861' is mentioned on line 140, but not defined

  == Unused Reference: 'Gratuitous ARP' is defined on line 440, but no
     explicit reference was found in the text

  == Unused Reference: 'ARP' is defined on line 447, but no explicit
     reference was found in the text

  == Unused Reference: 'DC-ARCH' is defined on line 456, but no explicit
     reference was found in the text


     Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------

1	ARMD                                                L. Dunbar
2	Internet Draft                                         Huawei
3	Intended status: Informational                     W. Kumari
4	Expires: February 2013                                Google
5	                                              Igor Gashinsky
6	                                                       Yahoo
7	                                              August 31, 2012

9	      Practices for scaling ARP and ND for large data centers

11	           draft-dunbar-armd-arp-nd-scaling-practices-03

13	Status of this Memo

15	   This Internet-Draft is submitted to IETF in full conformance
16	   with the provisions of BCP 78 and BCP 79.

18	   Internet-Drafts are working documents of the Internet
19	   Engineering Task Force (IETF), its areas, and its working
20	   groups. Note that other groups may also distribute working
21	   documents as Internet-Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of
24	   six months and may be updated, replaced, or obsoleted by
25	   other documents at any time. It is inappropriate to use
26	   Internet-Drafts as reference material or to cite them other
27	   than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt.

32	   The list of Internet-Draft Shadow Directories can be
33	   accessed at http://www.ietf.org/shadow.html.

35	   This Internet-Draft will expire on February 31, 2013.

37	Copyright Notice

39	   Copyright (c) 2012 IETF Trust and the persons identified as
40	   the document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's
43	   Legal Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date
45	   of publication of this document. Please review these
46	   documents carefully, as they describe your rights and
47	   restrictions with respect to this document.

49	Internet-Draft   Pratices to scale ARP/ND in large DC

51	Abstract

53	   This draft documents some simple practices that scale ARP/ND
54	   in data center environments.

56	Conventions used in this document

58	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
59	   "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
60	   and "OPTIONAL" in this document are to be interpreted as
61	   described in RFC-2119 [RFC2119].

63	Table of Contents

65	   1. Introduction ................................................ 3
66	   2. Terminology ................................................. 3
67	   3. Common DC network Designs.................................... 4
68	   4. Layer 3 to Access Switches................................... 4
69	   5. Layer 2 practices to scale ARP/ND............................ 5
70	      5.1. Practices to alleviate APR/ND burden on L2/L3
71	      boundary routers ............................................ 5
72	         5.1.1. Station communicating with an external peer........ 5
73	         5.1.2. L2/L3 boundary router processing of inbound
74	         traffic .................................................. 6
75	         5.1.3. Inter subnets communications ...................... 7
76	      5.2. Static ARP/ND entries on switches ...................... 7
77	      5.3. ARP/ND Proxy approaches................................. 8
78	   6. Practices to scale ARP/ND in Overlay models ................. 8
79	   7. Summary and Recommendations ................................. 9
80	   8. Security Considerations...................................... 9
81	   9. IANA Considerations ......................................... 9
82	   10. Acknowledgements .......................................... 10
83	   11. References ................................................ 10
84	      11.1. Normative References.................................. 10
85	      11.2. Informative References................................ 10
86	   Authors' Addresses ............................................ 11

88	Internet-Draft   Pratices to scale ARP/ND in large DC

90	1. Introduction

92	   As described in [ARMD-Problems], the increasing trend of
93	   rapid workload shifting and server virtualization in modern
94	   data centers requires servers to be loaded (or re-loaded)
95	   with different VMs or applications at different times.
96	   Different VMs residing on one physical server may have
97	   different IP addresses, or may even be in different IP
98	   subnets.

100	   In order to allow a physical server to be loaded with VMs in
101	   different subnets, or VMs to be moved to different server
102	   racks without IP address re-configuration, the corresponding
103	   networks need to enable multiple broadcast domains (many
104	   VLANs) on the interfaces of L2/L3 boundary routers and ToR
105	   switches. Unfortunately, when the combined number of VMs (or
106	   hosts) in all those subnets is large, this can lead to
107	   address resolution scaling issues, especially on the L2/L3
108	   boundary routers.

110	   This draft documents some simple practices which can scale
111	   ARP/ND in data center environment.

113	2. Terminology

115	   This document reuses much of terminology from [ARMD-
116	   Problem]. Many of the definitions are presented here to aid
117	   the reader.

119	   ARP:    IPv4 Address Resolution Protocol [RFC826]

121	   Aggregation Switch: A Layer 2 switch interconnecting ToR
122	             switches

124	   Bridge:  IEEE802.1Q compliant device. In this draft, Bridge
125	             is used interchangeably with Layer 2 switch.

127	   DC:      Data Center

129	   DA:     Destination Address

131	   End Station:  VM or physical server, whose address is
132	             either a destination or the source of a data frame.

134	   EOR:    End of Row switches in data center.

136	Internet-Draft   Pratices to scale ARP/ND in large DC

138	   NA:     IPv6's Neighbor Advertisement

140	   ND:     IPv6's Neighbor Discovery [RFC4861]

142	   NS:     IPv6's Neighbor Solicitation

144	   SA:     Source Address

146	   Station: A node which is either a destination or source of a
147	             data frame.

149	   ToR:    Top of Rack Switch (also known as access switch).

151	   UNA:    IPv6's Unsolicited Neighbor Advertisement

153	   VM:     Virtual Machines

155	3. Common DC network Designs

157	   Some common network designs for data center include:

159	     1) layer-3 connectivity to the access switch,

161	     2) Large Layer 2,

163	     3) Overlay models

165	   There is no single network design that fits all cases.
166	   Following sections document some of the common practices to
167	   scale Address Resolution under each network design.

169	4. Layer 3 to Access Switches

171	   This refers to the network design with Layer 3 to the access
172	   switches.

174	   As described in [ARMD-Problem], many data centers are
175	   architected so that ARP/ND broadcast/multicast messages are
176	   confined to a few ports (interfaces) of the access switches
177	   (i.e. ToR switches).

179	   Another variant of the Layer 3 solution is Layer 3 all the
180	   way to servers (or even to the VMs), which confines the

182	Internet-Draft   Pratices to scale ARP/ND in large DC

184	   ARP/ND broadcast/multicast messages to the small number of
185	   VMs within the server.

187	   Advantage: Both ARP and ND scale well. There are no address
188	   resolution issue in this design.

190	   Disadvantage: The main disadvantage to this network design
191	   is that IP addresses have to be re-configured on switches
192	   when a server needs to be re-loaded with an application in
193	   different subnet or when VMs need to be moved to a different
194	   location.

196	   Summary: This solution is more suitable to data centers
197	   which have static workload and/or network operators who can
198	   re-configure IP addresses/subnets on switches before any
199	   workload change.  No protocol changes are suggested.

201	5. Layer 2 practices to scale ARP/ND

203	   5.1. Practices to alleviate APR/ND burden on L2/L3 boundary
204	      routers

206	   The ARP/ND broadcast/multicast messages in a Layer 2 domain
207	   can negatively affect the L2/L3 boundary routers, especially
208	   with large number of VMs and subnets. This section describes
209	   some commonly used practices in reducing the ARP/ND
210	   processing required on L2/L3 boundary routers.

212	   5.1.1. Station communicating with an external peer

214	   When the external peer is in a different subnet, the
215	   originating end station needs to send ARP/ND requests to its
216	   default gateway router to resolve the router's MAC address.
217	   If there are many subnets on the gateway router and a large
218	   number of end stations in those subnets, the gateway router
219	   has to process a very large number of ARP/ND requests. This
220	   is often CPU intensive as ARP/ND are usually processed by
221	   the CPU (and not in hardware).

223	   Solution: For IPv4 networks, a practice to alleviate this
224	   problem is to have the L2/L3 boundary router send periodic
225	   gratuitous ARP messages, so that all the connected end
226	   stations can refresh their ARP caches. As the result, most
227	   (if not all) end stations will not need to ARP for the
228	   gateway routers when they need to communicate with external
229	   peers.

231	Internet-Draft   Pratices to scale ARP/ND in large DC

233	   However, due to IPv6 requiring bi-directional path
234	   validation Ipv6 end stations are still required to send
235	   unicast ND messages to their default gateway router (even
236	   with those routers periodically sending Unsolicited Neighbor
237	   Advertisements).

239	   Advantage: Reduction of ARP requests to be processed by
240	   L2/L3 boundary router for IPv4.

242	   Disadvantage: No reduction of ND processing on L2/L3
243	   boundary router for IPv6 traffic.

245	   Recommendation: Use for IPv4-only networks, or make change to the ND
246	   protocol to allow data frames to be sent without requiring bi-
247	   directional frame validation. Some work in progress in this area is
248	   [Impatient-NUD]

250	   5.1.2. L2/L3 boundary router processing of inbound traffic

252	   When a L2/L3 boundary router receives a data frame and the
253	   destination is not in router's ARP/ND cache, some routers
254	   hold the packet and trigger an ARP/ND request to resolve the
255	   L2 address. The router may need to send multiple ARP/ND
256	   requests until either a timeout is reached or an ARP/ND
257	   reply is received before forwarding the data packets towards
258	   the target's MAC address. This process is not only CPU
259	   intensive but also buffer intensive.

261	   Solution: For IPv4 network, a common practice to alleviate
262	   this problem is for the router to snoop ARP messages, so
263	   that its ARP cache can be refreshed with active addresses in
264	   the L2 domain. As a result, there is an increased likelihood
265	   of the router's ARP cache having the IP-MAC entry when it
266	   receives data frames from external peers.

268	   For IPv6 end stations, routers are supposed to send ND
269	   unicast even if it has snooped UNA/NS/NA from those
270	   stations. Therefore, this practice doesn't help IPv6 very
271	   much.

273	   Advantage: Reduction of the number of ARP requests which
274	   routers have to send upon receiving IPv4 packets and the
275	   number of IPv4 data frames from external peers which routers
276	   have to hold.

278	Internet-Draft   Pratices to scale ARP/ND in large DC

280	   Disadvantage: The amount of ND processing on routers for
281	   IPv6 traffic is not reduced. Even for IPv4, routers still
282	   need to hold data packets from external peers and trigger
283	   ARP requests if the targets of the data packets either don't
284	   exist or are not very active.

286	   Recommendation: Do not use with IPv6 or make protocol
287	   changes to IPv6's ND. For IPv4, if there is higher chance of
288	   routers receiving data packets towards non-existing or
289	   inactive targets, alternative approaches should be
290	   considered.

292	   5.1.3. Inter subnets communications

294	   The router will be hit twice when the originating and
295	   destination stations are in different subnets under on the
296	   same router. Once for the originating station in subnet-A
297	   initiating ARP/ND request to the L2/L3 boundary router
298	   (5.1.1 above); and the second for the L2/L3 boundary router
299	   to initiate ARP/ND requests to the target in subnet-B (5.1.2
300	   above).

302	   Again, practices described in 5.1.1 and 5.1.2 can alleviate
303	   problems in IPv4 network, but don't help very much for IPv6.

305	   Advantage: reduction of ARP processing on L2/L3 boundary
306	   routers for IPv4 traffic.

308	   For IPv6 traffic, there is no reduction of ND processing on
309	   L2/L3 boundary routers.

311	   Recommendation: do not use with IPv6 or consider other
312	   approaches.

314	   5.2. Static ARP/ND entries on switches

316	   In a datacenter environment the placement of L2 and L3
317	   addressing may be orchestrated by Server (or VM) Management
318	   System(s). Therefore it may be possible for static ARP/ND
319	   entries to be configured on routers and / or servers.

321	   Advantage: This methodology has been used to reduce ARP/ND
322	   fluctuations in large scale data center networks.

324	   Disadvantage: There is no well-defined mechanism for devices
325	   to get prompt incremental updates of static ARP/ND entries
326	   when changes occur.

328	Internet-Draft   Pratices to scale ARP/ND in large DC

330	   Recommendation: The IETF should consider creating standard
331	   mechanism (or protocols) for switches or servers to get
332	   incremental static ARP/ND entries updates.

334	   5.3. ARP/ND Proxy approaches

336	   RFC1027 specifies one ARP proxy approach. Since the
337	   publication of RFC1027 in 1987 there have been many variants
338	   of ARP proxy being deployed. The term "ARP Proxy" is a
339	   loaded phrase, with different interpretations depending on
340	   vendors and/or environments.  RFC1027's ARP Proxy is for a
341	   Gateway to return its own MAC address on behalf of the
342	   target station.  Another technique, also called "ARP Proxy"
343	   is for a ToR switch to snoop ARP requests and return the
344	   target station's MAC if the ToR has the information.

346	   Advantage: Proxy ARP [RFC1027] and its variants have allowed
347	   multi-subnet ARP traffic for over a decade.

349	   Disadvantage: Proxy ARP protocol [RFC1027] was developed for
350	   hosts which don't support subnets.

352	   Recommendation: Revise RFC1027 with VLAN support and make it
353	   scale for Data Center Environment.

355	6. Practices to scale ARP/ND in Overlay models

357	   There are several drafts on using overlay networks to scale
358	   large layer 2 networks (or avoid the need for large L2
359	   networks) and enable mobility (e.g. draft-wkumari-dcops-l3-
360	   vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL
361	   and IEEE802.1ah (Mac-in-Mac) are other types of overlay
362	   network to scale Layer 2.

364	   Overlay networks hide the VMs' addresses from the interior
365	   switches and routers, thereby stopping the router from
366	   having to perform ARP/ND services for as many addresses. The
367	   Overlay Edge nodes which perform the network address
368	   encapsulation/decapsulation still see all remote stations
369	   addresses which communicate with stations attached locally.

371	   For a large data center with many applications, these
372	   applications' IP addresses need to be reachable by external
373	   peers. Therefore, the overlay network may have a bottleneck
374	   at the Gateway devices(s) in processing resolving target
375	   stations' physical address (MAC or IP) and overlay edge
376	   address within the data center.

378	Internet-Draft   Pratices to scale ARP/ND in large DC

380	   Here are some approaches being used to minimize the problem:

382	      1. Use static mapping as described in Section 5.2.

384	      2. Have multiple gateway nodes (i.e. routers), with each
385	        handling a subset of stations addresses which are
386	        visible to external peers, e.g. Gateway #1 handles a
387	        set of prefixes, Gateway #2 handles another subset of
388	        prefixes, etc.

390	7. Summary and Recommendations

392	    This memo describes some common practices which can
393	    alleviate the impact of address resolution to L2/L3 gateway
394	    routers.

396	    In Data Centers, no single solution fits all deployments.
397	    This memo has summarized some practices in various
398	    scenarios and the advantages and disadvantages about all of
399	    these practices.

401	    In some of these scenarios, the common practices could be
402	    improved by creating and/or extending existing IETF
403	    protocols. These protocol change recommendations are:

405	        - Extend IPv6 ND method,

407	        - Create a incremental "download" schemes for static
408	         ARP/ND entries,

410	        - Revise Proxy ARP [RFC1027] for use in the data center.

412	8. Security Considerations

414	   This draft documents existing solutions and proposes
415	   additional work that could be initiated to extend various
416	   IETF protocols to better scale ARP/ND for the data center
417	   environment. As such we do not believe that this introduces
418	   any security concerns.

420	9. IANA Considerations

422	   This document does not request any action from IANA.

424	Internet-Draft   Pratices to scale ARP/ND in large DC

426	10. Acknowledgements

428	   We want to acknowledge the following people for their
429	   valuable inputs to this draft: T. Sridhar, Ron Bonica,
430	   Kireeti Kompella, and K.K.Ramakrishnan.

432	11. References

434	   11.1. Normative References

436	   [ARMD-Problem] Narten, "Problem Statement for ARMD"
437	             (http://datatracker.ietf.org/doc/draft-ietf-armd-
438	             problem-statement/); Aug 2012

440	   [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict
441	             Detection", RFC 5227, July 2008.

443	   [RFC2119] Bradner, S., "Key words for use in RFCs to
444	             Indicate               Requirement Levels", BCP 14,
445	             RFC 2119, March 1997

447	   [ARP]   D.C. Plummer, "An Ethernet address resolution
448	             protocol." RFC826, Nov 1982.

450	   [RFC1027] Mitchell, et al, "Using ARP to Implement
451	             Transparent Subnet Gateways"
452	             (http://datatracker.ietf.org/doc/rfc1027/)

454	   11.2. Informative References

456	   [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter-
457	             reference-arch"

459	   [Impatient-NUD] E. Nordmark, I. Gashinsky, "draft-ietf-6man-
460	             impatient-nud"

462	Internet-Draft   Pratices to scale ARP/ND in large DC

464	Authors' Addresses

466	   Linda Dunbar
467	   Huawei Technologies
468	   5340 Legacy Drive, Suite 175
469	   Plano, TX 75024, USA
470	   Phone: (469) 277 5840
471	   Email: ldunbar@huawei.com

473	   Warren Kumari
474	   Google
475	   1600 Amphitheatre Parkway
476	   Mountain View, CA 94043
477	   US
478	   Email: warren@kumari.net

480	   Igor Gashinsky
481	   Yahoo
482	   45 West 18th Street 6th floor
483	   New York, NY 10011
484	   Email: igor@yahoo-inc.com