idnits 2.17.1 

draft-dunbar-armd-arp-nd-scaling-practices-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 187 has weird spacing: '...oundary  route...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 3, 2012) is 4314 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'ARMD-Problems' is mentioned on line 85, but not
     defined

  == Missing Reference: 'RFC826' is mentioned on line 104, but not defined

  == Missing Reference: 'RFC4861' is mentioned on line 122, but not defined

  == Missing Reference: 'RFC1027' is mentioned on line 341, but not defined

  -- Looks like a reference, but probably isn't: '1027' on line 411

  == Unused Reference: 'ARP' is defined on line 437, but no explicit
     reference was found in the text

  == Unused Reference: 'DC-ARCH' is defined on line 440, but no explicit
     reference was found in the text

  == Unused Reference: 'Gratuitous ARP' is defined on line 445, but no
     explicit reference was found in the text


     Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	ARMD                                                          L. Dunbar
2	Internet Draft                                                   Huawei
3	Category: Informational                                       W. Kumari
4	                                                                 Google
5	                                                          I. Gashingsky
6	                                                                  Yahoo

8	Expires: Nov 2012                                          July 3, 2012

10	            Practices for scaling arp-nd for Large Data Centers

12	               draft-dunbar-armd-arp-nd-scaling-practices-00

14	Status of this Memo

16	   This Internet-Draft is submitted to IETF in full conformance with
17	   the provisions of BCP 78 and BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups. Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six
25	   months and may be updated, replaced, or obsoleted by other documents
26	   at any time. It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt.

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html.

35	   This Internet-Draft will expire on November 30, 2012.

37	Copyright Notice

39	   Copyright (c) 2009 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document. Please review these documents
46	   carefully, as they describe your rights and restrictions with
47	   respect to this document.

49	Abstract

51	   This draft is intended to document some simple well established
52	   practices which can scale ARP/ND in data center environment.

54	Conventions used in this document

56	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
57	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
58	   document are to be interpreted as described in RFC-2119 [RFC2119].

60	Table of Contents

62	   1. Introduction ................................................ 3
63	   2. Terminology ................................................. 3
64	   3. Potential Solutions to Scale Address Resolution in DC......... 4
65	      3.1. Layer 3 to Access Switches .............................. 4
66	      3.2. Practices to scale ARP/ND in layer 2 .................... 5
67	         3.2.1. When a station needs to communicate with an external
68	         peer: .................................................... 5
69	         3.2.2. L2/L3 boundary router processing of inbound traffic: 6
70	         3.2.3. Inter subnets communications ....................... 7
71	      3.3. Static ARP/ND entries on switches ....................... 7
72	      3.4. DNS based solution ...................................... 7
73	      3.5. ARP/ND Proxy approaches ................................. 8
74	      3.6. Overlay models ......................................... 9
75	   4. Summary and Recommendations ................................. 10
76	   5. Manageability Considerations ................................ 10
77	   6. Security Considerations ..................................... 10
78	   7. IANA Considerations ........................................ 10
79	   8. Acknowledgements ........................................... 10
80	   9. References ................................................. 11
81	   Authors' Addresses ............................................ 11

83	1. Introduction

85	   As described in [ARMD-Problems], the increasing trend of rapid
86	   workload shifting and server virtualization in modern data centers
87	   is requiring servers to be loaded (or re-loaded) with different VMs
88	   or applications at different times. Those different VMs loaded to
89	   one physical server may have different IP addresses, or even be in
90	   different IP subnets.
91	   In order to allow a physical server to be re-loaded with VMs in
92	   different subnets, or VMs to be moved to different server racks
93	   without IP address re-configuration, the corresponding networks have
94	   to have multiple broadcast domains (many VLANs) on the interfaces of
95	   L2/L3 boundary routers and ToR switches. Unfortunately, this kind of
96	   network can lead to address resolution scaling issues, especially on
97	   the L2/L3 boundary routers, when the combined number of VMs (or
98	   hosts) in all those subnets is large.
99	   This document describes some potential solutions which can minimize
100	   the ARP/ND scaling issues in a Data Center environment.

102	2. Terminology

104	   ARP:    IPv4 Address Resolution Protocol [RFC826]

106	   Aggregation Switch: A Layer 2 switch interconnecting ToR switches

108	   Bridge:  IEEE802.1Q compliant device. In this draft, Bridge is used
109	             interchangeably with Layer 2 switch.

111	   DC:      Data Center

113	   DA:     Destination Address

115	   End Station:  VM or physical server, whose address is either a
116	             destination or the source of a data frame.

118	   EOR:    End of Row switches in data center.

120	   NA:     IPv6's Neighbor Advertisement

122	   ND:     IPv6's Neighbor Discovery [RFC4861]

124	   NS:     IPv6's Neighbor Solicitation
125	   SA:     Source Address

127	   Station: node which is either a destination or source of a data
128	             frame.

130	   ToR:    Top of Rack Switch. It is also known as access switch.

132	   UNA:    IPv6's Unsolicited Neighbor Advertisement

134	   VM:     Virtual Machines

136	3. Potential Solutions to Scale Address Resolution in DC

138	   The following solutions have been indicated by data center operators
139	   to scale ARP/ND:

141	     1) layer-3 connectivity to the access switch,

143	     2) practices to scale ARP/ND in layer 2,

145	     3) static ARP/ND entries,

147	     4) DNS based approaches, and

149	     5) Extensions to proxy ARP [RFC1027].

151	   There is no single solution that fits all cases.  This section
152	   suggests the common practices for each type of solution.

154	   3.1. Layer 3 to Access Switches

156	   This is referring to the network design with Layer 3 to the access
157	   switches.

159	   As described in [ARMD-Problem], many data centers are designed this
160	   way, so that ARP/ND broadcast/multicast messages are confined to a
161	   few ports (interfaces) of the access switches (i.e. ToR switches).

163	   Another variant of the Layer 3 solution is Layer 3 all the way to
164	   servers, or even to the VMs. Then the ARP/ND broadcast/multicast
165	   messages are further confined to the small number of VMs within the
166	   server, or none at all.

168	   Advantage: Both ARP/ND scales well. There is no address resolution
169	   issue in this design.

171	   Disadvantage: The main disadvantage to this solution is that IP
172	   addresses have to be re-configured on switches when a server needs
173	   to be re-loaded with an application in different subnet, or VMs need
174	   to be moved to a different location.

176	   Summary: This solution is more suitable to data centers which have
177	   static workload or network operators who can properly re-configure
178	   IP addresses/subnets on switches before any workload change.  No
179	   protocol changes are suggested.

181	   3.2. Practices to scale ARP/ND in layer 2

183	   L2/L3 boundary routers can be heavily impacted by the ARP/ND
184	   broadcast/multicast messages in a Layer 2 domain, especially with
185	   large number of VMs and subnets. This section describes some
186	   commonly used practices in reducing the ARP/ND processing required
187	   on L2/L3 boundary  routers.

189	   3.2.1. When a station needs to communicate with an external peer:

191	   When the external peer is in a different subnet, the originating end
192	   station needs to send ARP/ND requests to its default gateway router
193	   to get router's MAC address. If there are many subnets enabled on
194	   the gateway router with large combined number of end stations in all
195	   those subnets, the gateway router has to process a very large number
196	   of ARP/ND requests. This is often CPU intensive as such
197	   requests/responses are processed by the CPU and not in hardware.

199	   Solution: For IPv4 networks, a common practice to alleviate this
200	   problem is to have the L2/L3 boundary router send periodic
201	   gratuitous ARP messages, so that all the connected end stations can
202	   refresh their ARP caches. As the result, most end stations, if not
203	   all, won't send ARP messages to gateway routers when they need to
204	   communicate with external peers.

206	   However, IPv6 end stations are still required to send ND messages,
207	   via unicast, to their default gateway router even with their gateway
208	   routers periodically sending Unsolicited Neighbor Advertisement.
209	   This is due to IPv6 requiring bi-directional path validation before
210	   a data packet can be sent.

212	   Advantage: Reduction of ARP requests to be processed by L2/L3
213	   boundary router for IPv4.

215	   Disadvantage: No reduction of ND processing on L2/L3 boundary router
216	   for IPv6 traffic.

218	   Recommendation: Use for IPv4-only networks, or make change to the ND
219	   protocol to allow data frames to be sent without requiring
220	   bidirectional frame validation.

222	   3.2.2. L2/L3 boundary router processing of inbound traffic:

224	   When L2/L3 boundary router receives a data frame from L3 domain, if
225	   the destination is not in router's ARP/ND cache, the router usually
226	   holds the packet and triggers an ARP/ND request to make sure the
227	   target actually exists in its L2 domain. The router may need to send
228	   multiple ARP/ND requests until either a timeout is reached or an
229	   ARP/ND reply is received before forwarding the data packets towards
230	   the target's MAC address. This process is not only CPU intensive but
231	   also buffer intensive.

233	   Solution: For IPv4 network, a common practice to alleviate this
234	   problem is by an L2/L3 boundary router snooping ARP messages, so
235	   that its ARP cache can be refreshed with active addresses in its L2
236	   domain. As a result, there is an increased likelihood of the
237	   router's ARP cache having the IP-MAC entry when it receives data
238	   frames from external peers.

240	   For IPv6 end stations, routers are supposed to send ND unicast even
241	   if it has snooped UNA/NS/NA from those stations. Therefore, this
242	   practice doesn't help IPv6 very much.

244	   Advantage: Reduction of ARP requests which routers have to send upon
245	   receiving IPv4 packets and the number of IPv4 data frames from
246	   external peers which routers have to hold.

248	   Disadvantage: The amount of ND processing on routers for IPv6
249	   traffic is not reduced. Even for IPv4, routers still need to hold
250	   data packets from external peers and trigger ARP requests if the
251	   targets of the data packets either don't exist or are not very
252	   active.

254	   Recommendation: Do not use with IPv6 or make protocol changes to
255	   IPv6's ND. For IPv4, if there is higher chance of routers receiving
256	   data packets towards non-existing or inactive targets, alternative
257	   approaches should be considered.

259	   3.2.3. Inter subnets communications

261	   The router will be hit twice when the originating and destination
262	   stations are in different subnets under the router. Once for the
263	   originating station in subnet-A initiating ARP/ND request to the
264	   L2/L3 boundary router (3.2.1 above); and the second for the L2/L3
265	   boundary router to initiate ARP/ND requests to the target in subnet-
266	   B (3.2.2 above).

268	   Again, practices described in 3.2.1 and 3.2.2 can alleviate problems
269	   in IPv4 network, but don't help very much for IPv6.

271	   Advantage: reduction of ARP processing on L2/L3 boundary routers for
272	   IPv4 traffic.

274	   But for IPv6 traffic, there is no reduction of ND processing on
275	   L2/L3 boundary routers.

277	   Recommendation: do not use with IPv6 or consider other approaches.

279	   3.3. Static ARP/ND entries on switches

281	   In a data center environment, applications placement to servers,
282	   racks, and rows may be orchestrated by Server (or VM) Management
283	   System(s). Therefore it is possible for static ARP/ND entries to be
284	   downloaded to switches, routers or servers.

286	   Advantage: This methodology has been used to reduce ARP/ND
287	   fluctuations in large scale data center networks.

289	   Disadvantage: There is no well defined mechanism for switches to get
290	   prompt incremental update of static ARP/ND entries when changes
291	   occur, or to perform certain steps when switches go through reset.

293	   Recommendation: The IETF should create a well-defined mechanism (or
294	   protocols) for switches or servers to get static ARP/ND entries.

296	   3.4. DNS based solution

298	   This solution is best suited to environments where applications
299	   resolve the address of destinations they need to communicate to via
300	   DNS, and periodically refresh these addresses. While this solution is
301	   very well known, and extensively used, it is mainly appropriate for
302	   stateless services, or for services that have a large number of short
303	   lived connections. While simple, this technique may not be
304	   appropriate for generic VM migration.
305	   If a VM can get new IP address when it is moved to a new location,
306	   here are the steps in getting the IP addresses:
307	       Instantiate the service on a VM in a distant rack. The new VM
308	        gets a new IP address
309	       Change the address of the service in DNS
310	       Wait for the DNS TTL to expire. While you are waiting, watch the
311	        number of connections to the new VM increase and the number of
312	        connections to the old VM decrease.
313	       Wait a little longer. When the number of connections to the old
314	        VM reaches zero, shut down the old VM.
315	   Advantage: DNS is existing technology and this is a well-known,
316	   commonly practiced technique.

318	   Disadvantage: This approach is not suitable for multi-tenant
319	   scenarios where each tenant needs to use its own address space, or
320	   when the data center operators does not have full control of
321	   addresses used by stations/VMs.

323	   Summary: Limited use to where the data-center operators are in
324	   control of the entire application and runs the DNS. More appropriate
325	   for service migration than VM migration.

327	   3.5. ARP/ND Proxy approaches

329	   RFC1027 specifies one ARP proxy approach. Since RFC1027, which was
330	   published in 1987, there have been many variants of ARP proxy being
331	   deployed. The term "ARP Proxy" is a loaded phrase, with different
332	   interpretations depending on vendors and / or environments.
333	   RFC1027's ARP Proxy is for a Gateway to return its own MAC address
334	   on behalf of the target station.  Another technique, also called
335	   "ARP Proxy" is for a ToR switch to snoop ARP requests and return the
336	   target station's MAC if the ToR has the information.

338	   Advantage: Proxy ARP [RFC1027] and its variants have allowed multi-
339	   subnet ARP traffic for over a decade.

341	   Disadvantage: Proxy ARP protocol [RFC1027] was developed prior to
342	   the concepts of VLANs and for hosts which don't support subnets.

344	   Recommendation: Revise RFC1027 with VLAN support and make it scale
345	   for Data Center Environment.

347	   3.6. Overlay models

349	   There are several drafts on using overlay networks to scale large
350	   layer 2 networks and enable mobility (e.g. draft-wkumari-dcops-l3-
351	   vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL and
352	   IEEE802.1ah (Mac-in-Mac) are other types of overlay network to scale
353	   Layer 2.

355	   Overlay networks hide the VMs' addresses from the interior switches
356	   and routers. The Overlay Edge nodes which perform the network
357	   address encapsulation/decapsulation still see all remote stations
358	   addresses which communicate with stations attached locally.

360	   For a large data center with tens of thousands of applications
361	   communicating with peers outside the data center, all those
362	   applications' IP addresses are visible to external peers. When a
363	   great number of VMs move freely within a data center, all those VMs'
364	   IP addresses might not be aggregated very nicely on gateway routers,
365	   causing forwarding table size exploding.

367	   When the Gateway router receives a data frame from external peers
368	   destined to a target within the data center, routers need to resolve
369	   target's MAC address and the Overlay Edge node's address in order to
370	   perform the proper overlay encapsulation.

372	   Therefore, the overlay network will have a bottleneck at the Gateway
373	   router(s) in processing resolving target stations' physical address
374	   (MAC or IP) and overlay edge address within the data center.

376	   Here are some approaches being used to minimize the problem:

378	      1. Use static mapping as described in Section 3.3.

380	      2. Have multiple gateway nodes (i.e. routers), with each handling
381	        a subset of stations addresses which are visible to external
382	        peers, e.g. Gateway #1 handles a set of prefixes, Gateway #2
383	        handles another subset of prefixes, etc. This architecture
384	        assumes that each gateway have enough downstream ports to be
385	        connected to all server racks.

387	   If each server rack is allowed to instantiate  VMs/applications with
388	   any IP addresses, or allowing any VM to move anywhere without re-
389	   configuring IP/MAC addresses, each gateway has to resolve addresses
390	   which are potentially located on any server rack. The address
391	   resolution processing for each gateway can still be very heavy.

393	4. Summary and Recommendations

395	    This memo describes some common practices which can alleviate impact
396	    of address resolution to L2/L3 gateway routers.

398	    In Data Centers, no single solution fits all deployments. This memo
399	    has summarized five different practices in various scenarios and the
400	    advantages and disadvantages about all of these practices.

402	    In some of these scenarios, the common practices could be improved
403	    by creating and/or extending existing IETF protocols. These protocol
404	    change recommendations are:

406	        Extend IPv6 ND method,

408	        Create a incremental "download" schemes for static ARP/ND
409	         entries,

411	        Revise Proxy ARP [1027] for use in the data center.

413	5. Manageability Considerations

415	   This text gives recommendations for some practices in order to
416	   improve manageability of DC.

418	6. Security Considerations

420	   Security will be addressed in a separate document.

422	7. IANA Considerations

424	   This document does not request any action from IANA.

426	8. Acknowledgements

428	   We want to acknowledge the following people for their valuable
429	   inputs to this draft: T. Sridhar, Ron Bonica, Kireeti Kompella, and
430	   K.K.Ramakrishnan.

432	9. References

434	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
435	             Requirement Levels", BCP 14, RFC 2119, March 1997

437	   [ARP]   D.C. Plummer, "An Ethernet address resolution protocol."
438	             RFC826, Nov 1982.

440	   [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter-reference-arch"

442	   [ARMD-Problem] Narten, "draft-ietf-armd-problem-statement" in
443	             progress, Oct 2011.

445	   [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict Detection",
446	             RFC 5227, July 2008.

448	Authors' Addresses

450	   Linda Dunbar
451	   Huawei Technologies
452	   5340 Legacy Drive, Suite 175
453	   Plano, TX 75024, USA
454	   Phone: (469) 277 5840
455	   Email: ldunbar@huawei.com

457	   Warren Kumari
458	   Google
459	   1600 Amphitheatre Parkway
460	   Mountain View, CA 94043
461	   US
462	   Email: warren@kumari.net

464	   Igor Gashinsky
465	   Yahoo
466	   45 West 18th Street 6th floor
467	   New York, NY 10011
468	   Email: igor@yahoo-inc.com