idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (August 31, 2012) is 4249 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'ARMD-Problems' is mentioned on line 92, but not defined == Missing Reference: 'RFC826' is mentioned on line 119, but not defined == Missing Reference: 'RFC4861' is mentioned on line 140, but not defined == Unused Reference: 'Gratuitous ARP' is defined on line 440, but no explicit reference was found in the text == Unused Reference: 'ARP' is defined on line 447, but no explicit reference was found in the text == Unused Reference: 'DC-ARCH' is defined on line 456, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD L. Dunbar 2 Internet Draft Huawei 3 Intended status: Informational W. Kumari 4 Expires: February 2013 Google 5 Igor Gashinsky 6 Yahoo 7 August 31, 2012 9 Practices for scaling ARP and ND for large data centers 11 draft-dunbar-armd-arp-nd-scaling-practices-03 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance 16 with the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet 19 Engineering Task Force (IETF), its areas, and its working 20 groups. Note that other groups may also distribute working 21 documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of 24 six months and may be updated, replaced, or obsoleted by 25 other documents at any time. It is inappropriate to use 26 Internet-Drafts as reference material or to cite them other 27 than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be 33 accessed at http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on February 31, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as 40 the document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's 43 Legal Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date 45 of publication of this document. Please review these 46 documents carefully, as they describe your rights and 47 restrictions with respect to this document. 49 Internet-Draft Pratices to scale ARP/ND in large DC 51 Abstract 53 This draft documents some simple practices that scale ARP/ND 54 in data center environments. 56 Conventions used in this document 58 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", 59 "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 60 and "OPTIONAL" in this document are to be interpreted as 61 described in RFC-2119 [RFC2119]. 63 Table of Contents 65 1. Introduction ................................................ 3 66 2. Terminology ................................................. 3 67 3. Common DC network Designs.................................... 4 68 4. Layer 3 to Access Switches................................... 4 69 5. Layer 2 practices to scale ARP/ND............................ 5 70 5.1. Practices to alleviate APR/ND burden on L2/L3 71 boundary routers ............................................ 5 72 5.1.1. Station communicating with an external peer........ 5 73 5.1.2. L2/L3 boundary router processing of inbound 74 traffic .................................................. 6 75 5.1.3. Inter subnets communications ...................... 7 76 5.2. Static ARP/ND entries on switches ...................... 7 77 5.3. ARP/ND Proxy approaches................................. 8 78 6. Practices to scale ARP/ND in Overlay models ................. 8 79 7. Summary and Recommendations ................................. 9 80 8. Security Considerations...................................... 9 81 9. IANA Considerations ......................................... 9 82 10. Acknowledgements .......................................... 10 83 11. References ................................................ 10 84 11.1. Normative References.................................. 10 85 11.2. Informative References................................ 10 86 Authors' Addresses ............................................ 11 88 Internet-Draft Pratices to scale ARP/ND in large DC 90 1. Introduction 92 As described in [ARMD-Problems], the increasing trend of 93 rapid workload shifting and server virtualization in modern 94 data centers requires servers to be loaded (or re-loaded) 95 with different VMs or applications at different times. 96 Different VMs residing on one physical server may have 97 different IP addresses, or may even be in different IP 98 subnets. 100 In order to allow a physical server to be loaded with VMs in 101 different subnets, or VMs to be moved to different server 102 racks without IP address re-configuration, the corresponding 103 networks need to enable multiple broadcast domains (many 104 VLANs) on the interfaces of L2/L3 boundary routers and ToR 105 switches. Unfortunately, when the combined number of VMs (or 106 hosts) in all those subnets is large, this can lead to 107 address resolution scaling issues, especially on the L2/L3 108 boundary routers. 110 This draft documents some simple practices which can scale 111 ARP/ND in data center environment. 113 2. Terminology 115 This document reuses much of terminology from [ARMD- 116 Problem]. Many of the definitions are presented here to aid 117 the reader. 119 ARP: IPv4 Address Resolution Protocol [RFC826] 121 Aggregation Switch: A Layer 2 switch interconnecting ToR 122 switches 124 Bridge: IEEE802.1Q compliant device. In this draft, Bridge 125 is used interchangeably with Layer 2 switch. 127 DC: Data Center 129 DA: Destination Address 131 End Station: VM or physical server, whose address is 132 either a destination or the source of a data frame. 134 EOR: End of Row switches in data center. 136 Internet-Draft Pratices to scale ARP/ND in large DC 138 NA: IPv6's Neighbor Advertisement 140 ND: IPv6's Neighbor Discovery [RFC4861] 142 NS: IPv6's Neighbor Solicitation 144 SA: Source Address 146 Station: A node which is either a destination or source of a 147 data frame. 149 ToR: Top of Rack Switch (also known as access switch). 151 UNA: IPv6's Unsolicited Neighbor Advertisement 153 VM: Virtual Machines 155 3. Common DC network Designs 157 Some common network designs for data center include: 159 1) layer-3 connectivity to the access switch, 161 2) Large Layer 2, 163 3) Overlay models 165 There is no single network design that fits all cases. 166 Following sections document some of the common practices to 167 scale Address Resolution under each network design. 169 4. Layer 3 to Access Switches 171 This refers to the network design with Layer 3 to the access 172 switches. 174 As described in [ARMD-Problem], many data centers are 175 architected so that ARP/ND broadcast/multicast messages are 176 confined to a few ports (interfaces) of the access switches 177 (i.e. ToR switches). 179 Another variant of the Layer 3 solution is Layer 3 all the 180 way to servers (or even to the VMs), which confines the 182 Internet-Draft Pratices to scale ARP/ND in large DC 184 ARP/ND broadcast/multicast messages to the small number of 185 VMs within the server. 187 Advantage: Both ARP and ND scale well. There are no address 188 resolution issue in this design. 190 Disadvantage: The main disadvantage to this network design 191 is that IP addresses have to be re-configured on switches 192 when a server needs to be re-loaded with an application in 193 different subnet or when VMs need to be moved to a different 194 location. 196 Summary: This solution is more suitable to data centers 197 which have static workload and/or network operators who can 198 re-configure IP addresses/subnets on switches before any 199 workload change. No protocol changes are suggested. 201 5. Layer 2 practices to scale ARP/ND 203 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary 204 routers 206 The ARP/ND broadcast/multicast messages in a Layer 2 domain 207 can negatively affect the L2/L3 boundary routers, especially 208 with large number of VMs and subnets. This section describes 209 some commonly used practices in reducing the ARP/ND 210 processing required on L2/L3 boundary routers. 212 5.1.1. Station communicating with an external peer 214 When the external peer is in a different subnet, the 215 originating end station needs to send ARP/ND requests to its 216 default gateway router to resolve the router's MAC address. 217 If there are many subnets on the gateway router and a large 218 number of end stations in those subnets, the gateway router 219 has to process a very large number of ARP/ND requests. This 220 is often CPU intensive as ARP/ND are usually processed by 221 the CPU (and not in hardware). 223 Solution: For IPv4 networks, a practice to alleviate this 224 problem is to have the L2/L3 boundary router send periodic 225 gratuitous ARP messages, so that all the connected end 226 stations can refresh their ARP caches. As the result, most 227 (if not all) end stations will not need to ARP for the 228 gateway routers when they need to communicate with external 229 peers. 231 Internet-Draft Pratices to scale ARP/ND in large DC 233 However, due to IPv6 requiring bi-directional path 234 validation Ipv6 end stations are still required to send 235 unicast ND messages to their default gateway router (even 236 with those routers periodically sending Unsolicited Neighbor 237 Advertisements). 239 Advantage: Reduction of ARP requests to be processed by 240 L2/L3 boundary router for IPv4. 242 Disadvantage: No reduction of ND processing on L2/L3 243 boundary router for IPv6 traffic. 245 Recommendation: Use for IPv4-only networks, or make change to the ND 246 protocol to allow data frames to be sent without requiring bi- 247 directional frame validation. Some work in progress in this area is 248 [Impatient-NUD] 250 5.1.2. L2/L3 boundary router processing of inbound traffic 252 When a L2/L3 boundary router receives a data frame and the 253 destination is not in router's ARP/ND cache, some routers 254 hold the packet and trigger an ARP/ND request to resolve the 255 L2 address. The router may need to send multiple ARP/ND 256 requests until either a timeout is reached or an ARP/ND 257 reply is received before forwarding the data packets towards 258 the target's MAC address. This process is not only CPU 259 intensive but also buffer intensive. 261 Solution: For IPv4 network, a common practice to alleviate 262 this problem is for the router to snoop ARP messages, so 263 that its ARP cache can be refreshed with active addresses in 264 the L2 domain. As a result, there is an increased likelihood 265 of the router's ARP cache having the IP-MAC entry when it 266 receives data frames from external peers. 268 For IPv6 end stations, routers are supposed to send ND 269 unicast even if it has snooped UNA/NS/NA from those 270 stations. Therefore, this practice doesn't help IPv6 very 271 much. 273 Advantage: Reduction of the number of ARP requests which 274 routers have to send upon receiving IPv4 packets and the 275 number of IPv4 data frames from external peers which routers 276 have to hold. 278 Internet-Draft Pratices to scale ARP/ND in large DC 280 Disadvantage: The amount of ND processing on routers for 281 IPv6 traffic is not reduced. Even for IPv4, routers still 282 need to hold data packets from external peers and trigger 283 ARP requests if the targets of the data packets either don't 284 exist or are not very active. 286 Recommendation: Do not use with IPv6 or make protocol 287 changes to IPv6's ND. For IPv4, if there is higher chance of 288 routers receiving data packets towards non-existing or 289 inactive targets, alternative approaches should be 290 considered. 292 5.1.3. Inter subnets communications 294 The router will be hit twice when the originating and 295 destination stations are in different subnets under on the 296 same router. Once for the originating station in subnet-A 297 initiating ARP/ND request to the L2/L3 boundary router 298 (5.1.1 above); and the second for the L2/L3 boundary router 299 to initiate ARP/ND requests to the target in subnet-B (5.1.2 300 above). 302 Again, practices described in 5.1.1 and 5.1.2 can alleviate 303 problems in IPv4 network, but don't help very much for IPv6. 305 Advantage: reduction of ARP processing on L2/L3 boundary 306 routers for IPv4 traffic. 308 For IPv6 traffic, there is no reduction of ND processing on 309 L2/L3 boundary routers. 311 Recommendation: do not use with IPv6 or consider other 312 approaches. 314 5.2. Static ARP/ND entries on switches 316 In a datacenter environment the placement of L2 and L3 317 addressing may be orchestrated by Server (or VM) Management 318 System(s). Therefore it may be possible for static ARP/ND 319 entries to be configured on routers and / or servers. 321 Advantage: This methodology has been used to reduce ARP/ND 322 fluctuations in large scale data center networks. 324 Disadvantage: There is no well-defined mechanism for devices 325 to get prompt incremental updates of static ARP/ND entries 326 when changes occur. 328 Internet-Draft Pratices to scale ARP/ND in large DC 330 Recommendation: The IETF should consider creating standard 331 mechanism (or protocols) for switches or servers to get 332 incremental static ARP/ND entries updates. 334 5.3. ARP/ND Proxy approaches 336 RFC1027 specifies one ARP proxy approach. Since the 337 publication of RFC1027 in 1987 there have been many variants 338 of ARP proxy being deployed. The term "ARP Proxy" is a 339 loaded phrase, with different interpretations depending on 340 vendors and/or environments. RFC1027's ARP Proxy is for a 341 Gateway to return its own MAC address on behalf of the 342 target station. Another technique, also called "ARP Proxy" 343 is for a ToR switch to snoop ARP requests and return the 344 target station's MAC if the ToR has the information. 346 Advantage: Proxy ARP [RFC1027] and its variants have allowed 347 multi-subnet ARP traffic for over a decade. 349 Disadvantage: Proxy ARP protocol [RFC1027] was developed for 350 hosts which don't support subnets. 352 Recommendation: Revise RFC1027 with VLAN support and make it 353 scale for Data Center Environment. 355 6. Practices to scale ARP/ND in Overlay models 357 There are several drafts on using overlay networks to scale 358 large layer 2 networks (or avoid the need for large L2 359 networks) and enable mobility (e.g. draft-wkumari-dcops-l3- 360 vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL 361 and IEEE802.1ah (Mac-in-Mac) are other types of overlay 362 network to scale Layer 2. 364 Overlay networks hide the VMs' addresses from the interior 365 switches and routers, thereby stopping the router from 366 having to perform ARP/ND services for as many addresses. The 367 Overlay Edge nodes which perform the network address 368 encapsulation/decapsulation still see all remote stations 369 addresses which communicate with stations attached locally. 371 For a large data center with many applications, these 372 applications' IP addresses need to be reachable by external 373 peers. Therefore, the overlay network may have a bottleneck 374 at the Gateway devices(s) in processing resolving target 375 stations' physical address (MAC or IP) and overlay edge 376 address within the data center. 378 Internet-Draft Pratices to scale ARP/ND in large DC 380 Here are some approaches being used to minimize the problem: 382 1. Use static mapping as described in Section 5.2. 384 2. Have multiple gateway nodes (i.e. routers), with each 385 handling a subset of stations addresses which are 386 visible to external peers, e.g. Gateway #1 handles a 387 set of prefixes, Gateway #2 handles another subset of 388 prefixes, etc. 390 7. Summary and Recommendations 392 This memo describes some common practices which can 393 alleviate the impact of address resolution to L2/L3 gateway 394 routers. 396 In Data Centers, no single solution fits all deployments. 397 This memo has summarized some practices in various 398 scenarios and the advantages and disadvantages about all of 399 these practices. 401 In some of these scenarios, the common practices could be 402 improved by creating and/or extending existing IETF 403 protocols. These protocol change recommendations are: 405 - Extend IPv6 ND method, 407 - Create a incremental "download" schemes for static 408 ARP/ND entries, 410 - Revise Proxy ARP [RFC1027] for use in the data center. 412 8. Security Considerations 414 This draft documents existing solutions and proposes 415 additional work that could be initiated to extend various 416 IETF protocols to better scale ARP/ND for the data center 417 environment. As such we do not believe that this introduces 418 any security concerns. 420 9. IANA Considerations 422 This document does not request any action from IANA. 424 Internet-Draft Pratices to scale ARP/ND in large DC 426 10. Acknowledgements 428 We want to acknowledge the following people for their 429 valuable inputs to this draft: T. Sridhar, Ron Bonica, 430 Kireeti Kompella, and K.K.Ramakrishnan. 432 11. References 434 11.1. Normative References 436 [ARMD-Problem] Narten, "Problem Statement for ARMD" 437 (http://datatracker.ietf.org/doc/draft-ietf-armd- 438 problem-statement/); Aug 2012 440 [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict 441 Detection", RFC 5227, July 2008. 443 [RFC2119] Bradner, S., "Key words for use in RFCs to 444 Indicate Requirement Levels", BCP 14, 445 RFC 2119, March 1997 447 [ARP] D.C. Plummer, "An Ethernet address resolution 448 protocol." RFC826, Nov 1982. 450 [RFC1027] Mitchell, et al, "Using ARP to Implement 451 Transparent Subnet Gateways" 452 (http://datatracker.ietf.org/doc/rfc1027/) 454 11.2. Informative References 456 [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter- 457 reference-arch" 459 [Impatient-NUD] E. Nordmark, I. Gashinsky, "draft-ietf-6man- 460 impatient-nud" 462 Internet-Draft Pratices to scale ARP/ND in large DC 464 Authors' Addresses 466 Linda Dunbar 467 Huawei Technologies 468 5340 Legacy Drive, Suite 175 469 Plano, TX 75024, USA 470 Phone: (469) 277 5840 471 Email: ldunbar@huawei.com 473 Warren Kumari 474 Google 475 1600 Amphitheatre Parkway 476 Mountain View, CA 94043 477 US 478 Email: warren@kumari.net 480 Igor Gashinsky 481 Yahoo 482 45 West 18th Street 6th floor 483 New York, NY 10011 484 Email: igor@yahoo-inc.com