idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 11, 2012) is 4153 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD L. Dunbar 2 Internet Draft Huawei 3 Intended status: Informational W. Kumari 4 Expires: June 2013 Google 5 Igor Gashinsky 6 Yahoo 7 December 11, 2012 9 Practices for scaling ARP and ND for large data centers 11 draft-dunbar-armd-arp-nd-scaling-practices-04 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance 16 with the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet 19 Engineering Task Force (IETF), its areas, and its working 20 groups. Note that other groups may also distribute working 21 documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of 24 six months and may be updated, replaced, or obsoleted by 25 other documents at any time. It is inappropriate to use 26 Internet-Drafts as reference material or to cite them other 27 than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be 33 accessed at http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on June 11, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as 40 the document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's 43 Legal Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date 45 of publication of this document. Please review these 46 documents carefully, as they describe your rights and 47 restrictions with respect to this document. 49 Internet-Draft Pratices to scale ARP/ND in large DC 51 Abstract 53 This draft documents some simple practices that scale ARP/ND 54 in data center environments. 56 Table of Contents 58 1. Introduction ................................................ 3 59 2. Terminology ................................................. 3 60 3. Common DC network Designs.................................... 4 61 4. Layer 3 to Access Switches................................... 4 62 5. Layer 2 practices to scale ARP/ND............................ 5 63 5.1. Practices to alleviate APR/ND burden on L2/L3 64 boundary routers ............................................ 5 65 5.1.1. Station communicating with an external peer........ 5 66 5.1.2. L2/L3 boundary router processing of inbound 67 traffic .................................................. 6 68 5.1.3. Inter subnets communications ...................... 7 69 5.2. Static ARP/ND entries on switches ....................... 7 70 5.3. ARP/ND Proxy approaches ................................. 8 71 6. Practices to scale ARP/ND in Overlay models .................. 8 72 7. Summary and Recommendations .................................. 9 73 8. Security Considerations ...................................... 9 74 9. IANA Considerations ......................................... 9 75 10. Acknowledgements .......................................... 10 76 11. References ................................................ 10 77 11.1. Normative References.................................. 10 78 11.2. Informative References................................ 10 79 Authors' Addresses ............................................ 11 81 Internet-Draft Pratices to scale ARP/ND in large DC 83 1. Introduction 85 As described in [ARMD-Problem], the increasing trend of 86 rapid workload shifting and server virtualization in modern 87 data centers requires servers to be loaded (or re-loaded) 88 with different VMs or applications at different times. 89 Different VMs residing on one physical server may have 90 different IP addresses, or may even be in different IP 91 subnets. 93 In order to allow a physical server to be loaded with VMs in 94 different subnets, or VMs to be moved to different server 95 racks without IP address re-configuration, the corresponding 96 networks need to enable multiple broadcast domains (many 97 VLANs) on the interfaces of L2/L3 boundary routers and ToR 98 switches. Unfortunately, when the combined number of VMs (or 99 hosts) in all those subnets is large, this can lead to 100 address resolution scaling issues, especially on the L2/L3 101 boundary routers. 103 This draft documents some simple practices which can scale 104 ARP/ND in data center environment. 106 2. Terminology 108 This document reuses much of terminology from [ARMD- 109 Problem]. Many of the definitions are presented here to aid 110 the reader. 112 ARP: IPv4 Address Resolution Protocol [RFC826] 114 Aggregation Switch: A Layer 2 switch interconnecting ToR 115 switches 117 Bridge: IEEE802.1Q compliant device. In this draft, Bridge 118 is used interchangeably with Layer 2 switch. 120 DC: Data Center 122 DA: Destination Address 124 End Station: VM or physical server, whose address is 125 either a destination or the source of a data frame. 127 EOR: End of Row switches in data center. 129 Internet-Draft Pratices to scale ARP/ND in large DC 131 NA: IPv6's Neighbor Advertisement 133 ND: IPv6's Neighbor Discovery [RFC4861] 135 NS: IPv6's Neighbor Solicitation 137 SA: Source Address 139 Station: A node which is either a destination or source of a 140 data frame. 142 ToR: Top of Rack Switch (also known as access switch). 144 UNA: IPv6's Unsolicited Neighbor Advertisement 146 VM: Virtual Machines 148 3. Common DC network Designs 150 Some common network designs for data center include: 152 1) layer-3 connectivity to the access switch, 154 2) Large Layer 2, 156 3) Overlay models 158 There is no single network design that fits all cases. 159 Following sections document some of the common practices to 160 scale Address Resolution under each network design. 162 4. Layer 3 to Access Switches 164 This refers to the network design with Layer 3 to the access 165 switches. 167 As described in [ARMD-Problem], many data centers are 168 architected so that ARP/ND broadcast/multicast messages are 169 confined to a few ports (interfaces) of the access switches 170 (i.e. ToR switches). 172 Another variant of the Layer 3 solution is Layer 3 all the 173 way to servers (or even to the VMs), which confines the 175 Internet-Draft Pratices to scale ARP/ND in large DC 177 ARP/ND broadcast/multicast messages to the small number of 178 VMs within the server. 180 Advantage: Both ARP and ND scale well. There are no address 181 resolution issue in this design. 183 Disadvantage: The main disadvantage to this network design 184 is that IP addresses have to be re-configured on switches 185 when a server needs to be re-loaded with an application in 186 different subnet or when VMs need to be moved to a different 187 location. 189 Summary: This solution is more suitable to data centers 190 which have static workload and/or network operators who can 191 re-configure IP addresses/subnets on switches before any 192 workload change. No protocol changes are suggested. 194 5. Layer 2 practices to scale ARP/ND 196 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary 197 routers 199 The ARP/ND broadcast/multicast messages in a Layer 2 domain 200 can negatively affect the L2/L3 boundary routers, especially 201 with large number of VMs and subnets. This section describes 202 some commonly used practices in reducing the ARP/ND 203 processing required on L2/L3 boundary routers. 205 5.1.1. Station communicating with an external peer 207 When the external peer is in a different subnet, the 208 originating end station needs to send ARP/ND requests to its 209 default gateway router to resolve the router's MAC address. 210 If there are many subnets on the gateway router and a large 211 number of end stations in those subnets, the gateway router 212 has to process a very large number of ARP/ND requests. This 213 is often CPU intensive as ARP/ND are usually processed by 214 the CPU (and not in hardware). 216 Solution: For IPv4 networks, a practice to alleviate this 217 problem is to have the L2/L3 boundary router send periodic 218 gratuitous ARP [GratuitousARP] messages, so that all the 219 connected end stations can refresh their ARP caches. As the 220 result, most (if not all) end stations will not need to ARP 221 for the gateway routers when they need to communicate with 222 external peers. 224 Internet-Draft Pratices to scale ARP/ND in large DC 226 However, due to IPv6 requiring bi-directional path 227 validation Ipv6 end stations are still required to send 228 unicast ND messages to their default gateway router (even 229 with those routers periodically sending Unsolicited Neighbor 230 Advertisements). 232 Advantage: Reduction of ARP requests to be processed by 233 L2/L3 boundary router for IPv4. 235 Disadvantage: No reduction of ND processing on L2/L3 236 boundary router for IPv6 traffic. 238 Recommendation: Use for IPv4-only networks, or make change to the ND 239 protocol to allow data frames to be sent without requiring bi- 240 directional frame validation. Some work in progress in this area is 241 [Impatient-NUD] 243 5.1.2. L2/L3 boundary router processing of inbound traffic 245 When a L2/L3 boundary router receives a data frame and the 246 destination is not in router's ARP/ND cache, some routers 247 hold the packet and trigger an ARP/ND request to resolve the 248 L2 address. The router may need to send multiple ARP/ND 249 requests until either a timeout is reached or an ARP/ND 250 reply is received before forwarding the data packets towards 251 the target's MAC address. This process is not only CPU 252 intensive but also buffer intensive. 254 Solution: For IPv4 network, a common practice to alleviate 255 this problem is for the router to snoop ARP messages, so 256 that its ARP cache can be refreshed with active addresses in 257 the L2 domain. As a result, there is an increased likelihood 258 of the router's ARP cache having the IP-MAC entry when it 259 receives data frames from external peers. 261 For IPv6 end stations, routers are supposed to send ND 262 unicast even if it has snooped UNA/NS/NA from those 263 stations. Therefore, this practice doesn't help IPv6 very 264 much. 266 Advantage: Reduction of the number of ARP requests which 267 routers have to send upon receiving IPv4 packets and the 268 number of IPv4 data frames from external peers which routers 269 have to hold. 271 Internet-Draft Pratices to scale ARP/ND in large DC 273 Disadvantage: The amount of ND processing on routers for 274 IPv6 traffic is not reduced. Even for IPv4, routers still 275 need to hold data packets from external peers and trigger 276 ARP requests if the targets of the data packets either don't 277 exist or are not very active. 279 Recommendation: Do not use with IPv6 or make protocol 280 changes to IPv6's ND. For IPv4, if there is higher chance of 281 routers receiving data packets towards non-existing or 282 inactive targets, alternative approaches should be 283 considered. 285 5.1.3. Inter subnets communications 287 The router will be hit twice when the originating and 288 destination stations are in different subnets under on the 289 same router. Once for the originating station in subnet-A 290 initiating ARP/ND request to the L2/L3 boundary router 291 (5.1.1 above); and the second for the L2/L3 boundary router 292 to initiate ARP/ND requests to the target in subnet-B (5.1.2 293 above). 295 Again, practices described in 5.1.1 and 5.1.2 can alleviate 296 problems in IPv4 network, but don't help very much for IPv6. 298 Advantage: reduction of ARP processing on L2/L3 boundary 299 routers for IPv4 traffic. 301 For IPv6 traffic, there is no reduction of ND processing on 302 L2/L3 boundary routers. 304 Recommendation: do not use with IPv6 or consider other 305 approaches. 307 5.2. Static ARP/ND entries on switches 309 In a datacenter environment the placement of L2 and L3 310 addressing may be orchestrated by Server (or VM) Management 311 System(s). Therefore it may be possible for static ARP/ND 312 entries to be configured on routers and / or servers. 314 Advantage: This methodology has been used to reduce ARP/ND 315 fluctuations in large scale data center networks. 317 Disadvantage: There is no well-defined mechanism for devices 318 to get prompt incremental updates of static ARP/ND entries 319 when changes occur. 321 Internet-Draft Pratices to scale ARP/ND in large DC 323 Recommendation: The IETF should consider creating standard 324 mechanism (or protocols) for switches or servers to get 325 incremental static ARP/ND entries updates. 327 5.3. ARP/ND Proxy approaches 329 RFC1027 specifies one ARP proxy approach. Since the 330 publication of RFC1027 in 1987 there have been many variants 331 of ARP proxy being deployed. The term "ARP Proxy" is a 332 loaded phrase, with different interpretations depending on 333 vendors and/or environments. RFC1027's ARP Proxy is for a 334 Gateway to return its own MAC address on behalf of the 335 target station. Another technique, also called "ARP Proxy" 336 is for a ToR switch to snoop ARP requests and return the 337 target station's MAC if the ToR has the information. 339 Advantage: Proxy ARP [RFC1027] and its variants have allowed 340 multi-subnet ARP traffic for over a decade. 342 Disadvantage: Proxy ARP protocol [RFC1027] was developed for 343 hosts which don't support subnets. 345 Recommendation: Revise RFC1027 with VLAN support and make it 346 scale for Data Center Environment. 348 6. Practices to scale ARP/ND in Overlay models 350 There are several drafts on using overlay networks to scale 351 large layer 2 networks (or avoid the need for large L2 352 networks) and enable mobility (e.g. draft-wkumari-dcops-l3- 353 vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL 354 and IEEE802.1ah (Mac-in-Mac) are other types of overlay 355 network to scale Layer 2. 357 Overlay networks hide the VMs' addresses from the interior 358 switches and routers, thereby stopping the router from 359 having to perform ARP/ND services for as many addresses. The 360 Overlay Edge nodes which perform the network address 361 encapsulation/decapsulation still see all remote stations 362 addresses which communicate with stations attached locally. 364 For a large data center with many applications, these 365 applications' IP addresses need to be reachable by external 366 peers. Therefore, the overlay network may have a bottleneck 367 at the Gateway devices(s) in processing resolving target 368 stations' physical address (MAC or IP) and overlay edge 369 address within the data center. 371 Internet-Draft Pratices to scale ARP/ND in large DC 373 Here are some approaches being used to minimize the problem: 375 1. Use static mapping as described in Section 5.2. 377 2. Have multiple gateway nodes (i.e. routers), with each 378 handling a subset of stations addresses which are 379 visible to external peers, e.g. Gateway #1 handles a 380 set of prefixes, Gateway #2 handles another subset of 381 prefixes, etc. 383 7. Summary and Recommendations 385 This memo describes some common practices which can 386 alleviate the impact of address resolution to L2/L3 gateway 387 routers. 389 In Data Centers, no single solution fits all deployments. 390 This memo has summarized some practices in various 391 scenarios and the advantages and disadvantages about all of 392 these practices. 394 In some of these scenarios, the common practices could be 395 improved by creating and/or extending existing IETF 396 protocols. These protocol change recommendations are: 398 - Extend IPv6 ND method, 400 - Create a incremental "download" schemes for static 401 ARP/ND entries, 403 - Revise Proxy ARP [RFC1027] for use in the data center. 405 8. Security Considerations 407 This draft documents existing solutions and proposes 408 additional work that could be initiated to extend various 409 IETF protocols to better scale ARP/ND for the data center 410 environment. As such we do not believe that this introduces 411 any security concerns. 413 9. IANA Considerations 415 This document does not request any action from IANA. 417 Internet-Draft Pratices to scale ARP/ND in large DC 419 10. Acknowledgements 421 We want to acknowledge the following people for their 422 valuable inputs to this draft: T. Sridhar, Ron Bonica, 423 Kireeti Kompella, and K.K.Ramakrishnan. 425 11. References 427 11.1. Normative References 429 [ARMD-Problem] Narten, "Problem Statement for ARMD" 430 (http://datatracker.ietf.org/doc/draft-ietf-armd- 431 problem-statement/); Aug 2012 433 [GratuitousARP] S. Cheshire, "IPv4 Address Conflict 434 Detection", RFC 5227, July 2008. 436 [RFC826] D.C. Plummer, "An Ethernet address resolution 437 protocol." RFC826, Nov 1982. 439 [RFC1027] Mitchell, et al, "Using ARP to Implement 440 Transparent Subnet Gateways" 441 (http://datatracker.ietf.org/doc/rfc1027/) 443 [RFC4861] Narten, et al, "Neighbor Discovery for IP version 444 6 (IPv6)", RFC4861, Sept 2007 446 11.2. Informative References 448 [Impatient-NUD] E. Nordmark, I. Gashinsky, "draft-ietf- 449 6man-impatient-nud" 451 Internet-Draft Pratices to scale ARP/ND in large DC 453 Authors' Addresses 455 Linda Dunbar 456 Huawei Technologies 457 5340 Legacy Drive, Suite 175 458 Plano, TX 75024, USA 459 Phone: (469) 277 5840 460 Email: ldunbar@huawei.com 462 Warren Kumari 463 Google 464 1600 Amphitheatre Parkway 465 Mountain View, CA 94043 466 US 467 Email: warren@kumari.net 469 Igor Gashinsky 470 Yahoo 471 45 West 18th Street 6th floor 472 New York, NY 10011 473 Email: igor@yahoo-inc.com