idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 2 instances of too long lines in the document, the longest one being 11 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (August 31, 2012) is 4255 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'ARMD-Problems' is mentioned on line 89, but not defined == Missing Reference: 'RFC826' is mentioned on line 113, but not defined == Missing Reference: 'RFC4861' is mentioned on line 131, but not defined == Missing Reference: 'RFC1027' is mentioned on line 328, but not defined -- Looks like a reference, but probably isn't: '1027' on line 384 == Unused Reference: 'Gratuitous ARP' is defined on line 410, but no explicit reference was found in the text == Unused Reference: 'ARP' is defined on line 416, but no explicit reference was found in the text == Unused Reference: 'DC-ARCH' is defined on line 421, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD L. Dunbar 2 Internet Draft Huawei 3 Intended status: Informational W. Kumari 4 Expires: February 2013 Google 5 Igor Gashinsky 6 Yahoo 7 August 31, 2012 9 Practices for scaling ARP and ND for Large Data Centers 11 draft-dunbar-armd-arp-nd-scaling-practices-02 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with 16 the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other documents 25 at any time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on November 30, 2012. 36 Copyright Notice 38 Copyright (c) 2009 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with 46 respect to this document. 48 Internet-Draft Pratices to scale ARP/ND in large DC 50 Abstract 52 This draft documents some simple practices that scale ARP/ND in data 53 center environments. 55 Conventions used in this document 57 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 58 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 59 document are to be interpreted as described in RFC-2119 [RFC2119]. 61 Table of Contents 63 1. Introduction ................................................ 3 64 2. Terminology ................................................. 3 65 3. Common DC network Designs.................................... 4 66 4. Layer 3 to Access Switches................................... 4 67 5. Layer 2 practices to scale ARP/ND............................ 5 68 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary 69 routers ..................................................... 5 70 5.1.1. Station communicating with an external pe.......... 5 71 5.1.2. L2/L3 boundary router processing of inbound traffic.. 6 72 5.1.3. Inter subnets communications .........................6 73 5.2. Static ARP/ND entries on switches ....................... 7 74 5.3. ARP/ND Proxy approaches ................................. 7 75 6. Practices to scale ARP/ND in Overlay models .................. 8 76 7. Summary and Recommendations .................................. 8 77 8. Security Considerations ...................................... 9 78 9. IANA Considerations ......................................... 9 79 10. Acknowledgements ........................................... 9 80 11. References ................................................. 9 81 11.1. Normative References................................... 9 82 11.2. Informative References................................. 9 83 Authors' Addresses ............................................ 10 85 Internet-Draft Pratices to scale ARP/ND in large DC 87 1. Introduction 89 As described in [ARMD-Problems], the increasing trend of rapid 90 workload shifting and server virtualization in modern data centers 91 requires servers to be loaded (or re-loaded) with different VMs or 92 applications at different times. Different VMs residing on one 93 physical server may have different IP addresses, or may even be in 94 different IP subnets. 96 In order to allow a physical server to be loaded with VMs in 97 different subnets, or VMs to be moved to different server racks 98 without IP address re-configuration, the corresponding networks need 99 to enable multiple broadcast domains (many VLANs) on the interfaces 100 of L2/L3 boundary routers and ToR switches. Unfortunately, when the 101 combined number of VMs (or hosts) in all those subnets is large, 102 this can lead to address resolution scaling issues, especially on 103 the L2/L3 boundary routers. 105 This draft documents some simple practices which can scale ARP/ND in 106 data center environment. 108 2. Terminology 110 This document reuses much of terminology from [ARMD-Problem]. Many 111 of the definitions are presented here to aid the reader. 113 ARP: IPv4 Address Resolution Protocol [RFC826] 115 Aggregation Switch: A Layer 2 switch interconnecting ToR switches 117 Bridge: IEEE802.1Q compliant device. In this draft, Bridge is used 118 interchangeably with Layer 2 switch. 120 DC: Data Center 122 DA: Destination Address 124 End Station: VM or physical server, whose address is either a 125 destination or the source of a data frame. 127 EOR: End of Row switches in data center. 129 NA: IPv6's Neighbor Advertisement 131 ND: IPv6's Neighbor Discovery [RFC4861] 133 Internet-Draft Pratices to scale ARP/ND in large DC 135 NS: IPv6's Neighbor Solicitation 137 SA: Source Address 139 Station: A node which is either a destination or source of a data 140 frame. 142 ToR: Top of Rack Switch (also known as access switch). 144 UNA: IPv6's Unsolicited Neighbor Advertisement 146 VM: Virtual Machines 148 3. Common DC network Designs 150 Some common network designs for data center include: 152 1) layer-3 connectivity to the access switch, 154 2) Large Layer 2, 156 3) Overlay models 158 There is no single network design that fits all cases. Following 159 sections document some of the common practices to scale Address 160 Resolution under each network design. 162 4. Layer 3 to Access Switches 164 This refers to the network design with Layer 3 to the access 165 switches. 167 As described in [ARMD-Problem], many data centers are architected so 168 that ARP/ND broadcast/multicast messages are confined to a few ports 169 (interfaces) of the access switches (i.e. ToR switches). 171 Another variant of the Layer 3 solution is Layer 3 all the way to 172 servers (or even to the VMs), which confines the ARP/ND 173 broadcast/multicast messages to the small number of VMs within the 174 server. 176 Advantage: Both ARP and ND scale well. There are no address 177 resolution issue in this design. 179 Internet-Draft Pratices to scale ARP/ND in large DC 181 Disadvantage: The main disadvantage to this network design is that 182 IP addresses have to be re-configured on switches when a server 183 needs to be re-loaded with an application in different subnet or 184 when VMs need to be moved to a different location. 186 Summary: This solution is more suitable to data centers which have 187 static workload and/or network operators who can re-configure IP 188 addresses/subnets on switches before any workload change. No 189 protocol changes are suggested. 191 5. Layer 2 practices to scale ARP/ND. 193 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary routers 195 The ARP/ND broadcast/multicast messages in a Layer 2 domain can 196 negatively affect the L2/L3 boundary routers, especially with large 197 number of VMs and subnets. This section describes some commonly used 198 practices in reducing the ARP/ND processing required on L2/L3 199 boundary routers. 201 5.1.1. Station communicating with an external peer: 203 When the external peer is in a different subnet, the originating end 204 station needs to send ARP/ND requests to its default gateway router 205 to resolve the router's MAC address. If there are many subnets on 206 the gateway router and a large number of end stations in those 207 subnets, the gateway router has to process a very large number of 208 ARP/ND requests. This is often CPU intensive as ARP/ND are usually 209 processed by the CPU (and not in hardware). 211 Solution: For IPv4 networks, a practice to alleviate this problem is 212 to have the L2/L3 boundary router send periodic gratuitous ARP 213 messages, so that all the connected end stations can refresh their 214 ARP caches. As the result, most (if not all) end stations will not 215 need to ARP for the gateway routers when they need to communicate 216 with external peers. 218 However, due to IPv6 requiring bi-directional path validation Ipv6 219 end stations are still required to send unicast ND messages to their 220 default gateway router (even with those routers periodically sending 221 Unsolicited Neighbor Advertisements). 223 Advantage: Reduction of ARP requests to be processed by L2/L3 224 boundary router for IPv4. 226 Disadvantage: No reduction of ND processing on L2/L3 boundary router 227 for IPv6 traffic. 229 Internet-Draft Pratices to scale ARP/ND in large DC 231 Recommendation: Use for IPv4-only networks, or make change to the ND protocol to 232 allow data frames to be sent without requiring bi-directional frame validation. 233 Some work in progress in this area is [Impatient-NUD] 235 5.1.2. L2/L3 boundary router processing of inbound traffic: 237 When a L2/L3 boundary router receives a data frame and the 238 destination is not in router's ARP/ND cache, some routers hold the 239 packet and trigger an ARP/ND request to resolve the L2 address. The 240 router may need to send multiple ARP/ND requests until either a 241 timeout is reached or an ARP/ND reply is received before forwarding 242 the data packets towards the target's MAC address. This process is 243 not only CPU intensive but also buffer intensive. 245 Solution: For IPv4 network, a common practice to alleviate this 246 problem is for the router to snoop ARP messages, so that its ARP 247 cache can be refreshed with active addresses in the L2 domain. As a 248 result, there is an increased likelihood of the router's ARP cache 249 having the IP-MAC entry when it receives data frames from external 250 peers. 252 For IPv6 end stations, routers are supposed to send ND unicast even 253 if it has snooped UNA/NS/NA from those stations. Therefore, this 254 practice doesn't help IPv6 very much. 256 Advantage: Reduction of the number of ARP requests which routers 257 have to send upon receiving IPv4 packets and the number of IPv4 data 258 frames from external peers which routers have to hold. 260 Disadvantage: The amount of ND processing on routers for IPv6 261 traffic is not reduced. Even for IPv4, routers still need to hold 262 data packets from external peers and trigger ARP requests if the 263 targets of the data packets either don't exist or are not very 264 active. 266 Recommendation: Do not use with IPv6 or make protocol changes to 267 IPv6's ND. For IPv4, if there is higher chance of routers receiving 268 data packets towards non-existing or inactive targets, alternative 269 approaches should be considered. 271 5.1.3. Inter subnets communications 273 The router will be hit twice when the originating and destination 274 stations are in different subnets under on the same router. Once for 276 Internet-Draft Pratices to scale ARP/ND in large DC 278 the originating station in subnet-A initiating ARP/ND request to the 279 L2/L3 boundary router (5.1.1 above); and the second for the L2/L3 280 boundary router to initiate ARP/ND requests to the target in subnet- 281 B (5.1.2 above). 283 Again, practices described in 5.1.1 and 5.1.2 can alleviate problems 284 in IPv4 network, but don't help very much for IPv6. 286 Advantage: reduction of ARP processing on L2/L3 boundary routers for 287 IPv4 traffic. 289 For IPv6 traffic, there is no reduction of ND processing on L2/L3 290 boundary routers. 292 Recommendation: do not use with IPv6 or consider other approaches. 294 5.2. Static ARP/ND entries on switches 296 In a datacenter environment the placement of L2 and L3 addressing 297 may be orchestrated by Server (or VM) Management System(s). 298 Therefore it may be possible for static ARP/ND entries to be 299 configured on routers and / or servers. 301 Advantage: This methodology has been used to reduce ARP/ND 302 fluctuations in large scale data center networks. 304 Disadvantage: There is no well-defined mechanism for devices to get 305 prompt incremental updates of static ARP/ND entries when changes 306 occur. 308 Recommendation: The IETF should consider creating standard mechanism 309 (or protocols) for switches or servers to get incremental static 310 ARP/ND entries updates. 312 5.3. ARP/ND Proxy approaches 314 RFC1027 specifies one ARP proxy approach. Since the publication of 315 RFC1027 in 1987 there have been many variants of ARP proxy being 316 deployed. The term "ARP Proxy" is a loaded phrase, with different 317 interpretations depending on vendors and/or environments. RFC1027's 318 ARP Proxy is for a Gateway to return its own MAC address on behalf 319 of the target station. Another technique, also called "ARP Proxy" 320 is for a ToR switch to snoop ARP requests and return the target 321 station's MAC if the ToR has the information. 323 Advantage: Proxy ARP [RFC1027] and its variants have allowed multi- 324 subnet ARP traffic for over a decade. 326 Internet-Draft Pratices to scale ARP/ND in large DC 328 Disadvantage: Proxy ARP protocol [RFC1027] was developed for hosts 329 which don't support subnets. 331 Recommendation: Revise RFC1027 with VLAN support and make it scale 332 for Data Center Environment. 334 6. Practices to scale ARP/ND in Overlay models 336 There are several drafts on using overlay networks to scale large 337 layer 2 networks (or avoid the need for large L2 networks) and 338 enable mobility (e.g. draft-wkumari-dcops-l3-vmmobility-00, draft- 339 mahalingam-dutt-dcops-vxlan-00). TRILL and IEEE802.1ah (Mac-in-Mac) 340 are other types of overlay network to scale Layer 2. 342 Overlay networks hide the VMs' addresses from the interior switches 343 and routers, thereby stopping the router from having to perform 344 ARP/ND services for as many addresses. The Overlay Edge nodes which 345 perform the network address encapsulation/decapsulation still see 346 all remote stations addresses which communicate with stations 347 attached locally. 349 For a large data center with many applications, these applications' 350 IP addresses need to be reachable by external peers. Therefore, the 351 overlay network may have a bottleneck at the Gateway devices(s) in 352 processing resolving target stations' physical address (MAC or IP) 353 and overlay edge address within the data center. 355 Here are some approaches being used to minimize the problem: 357 1. Use static mapping as described in Section 5.2. 359 2. Have multiple gateway nodes (i.e. routers), with each handling 360 a subset of stations addresses which are visible to external 361 peers, e.g. Gateway #1 handles a set of prefixes, Gateway #2 362 handles another subset of prefixes, etc. 364 7. Summary and Recommendations 366 This memo describes some common practices which can alleviate the 367 impact of address resolution to L2/L3 gateway routers. 369 In Data Centers, no single solution fits all deployments. This memo 370 has summarized some practices in various scenarios and the 371 advantages and disadvantages about all of these practices. 373 In some of these scenarios, the common practices could be improved 374 by creating and/or extending existing IETF protocols. These protocol 375 change recommendations are: 377 Internet-Draft Pratices to scale ARP/ND in large DC 379 - Extend IPv6 ND method, 381 - Create a incremental "download" schemes for static ARP/ND 382 entries, 384 - Revise Proxy ARP [1027] for use in the data center. 386 8. Security Considerations 388 This draft documents existing solutions and proposes additional work 389 that could be initiated to extend various IETF protocols to better 390 scale ARP/ND for the data center environment. As such we do not 391 believe that this introduces any security concerns. 393 9. IANA Considerations 395 This document does not request any action from IANA. 397 10. Acknowledgements 399 We want to acknowledge the following people for their valuable 400 inputs to this draft: T. Sridhar, Ron Bonica, Kireeti Kompella, and 401 K.K.Ramakrishnan. 403 11. References 405 11.1. Normative References 407 [ARMD-Problem] Narten, "draft-ietf-armd-problem-statement" in 408 progress, Oct 2011. 410 [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict Detection", 411 RFC 5227, July 2008. 413 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 414 Requirement Levels", BCP 14, RFC 2119, March 1997 416 [ARP] D.C. Plummer, "An Ethernet address resolution protocol." 417 RFC826, Nov 1982. 419 11.2. Informative References 421 [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter-reference-arch" 423 Internet-Draft Pratices to scale ARP/ND in large DC 425 [Impatient-NUD] E. Nordmark, I. Gashinsky, "draft-ietf-6man- 426 impatient-nud" 428 Authors' Addresses 430 Linda Dunbar 431 Huawei Technologies 432 5340 Legacy Drive, Suite 175 433 Plano, TX 75024, USA 434 Phone: (469) 277 5840 435 Email: ldunbar@huawei.com 437 Warren Kumari 438 Google 439 1600 Amphitheatre Parkway 440 Mountain View, CA 94043 441 US 442 Email: warren@kumari.net 444 Igor Gashinsky 445 Yahoo 446 45 West 18th Street 6th floor 447 New York, NY 10011 448 Email: igor@yahoo-inc.com