idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 22 instances of too long lines in the document, the longest one being 3 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 31, 2013) is 4074 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 L. Dunbar 2 Internet Draft Huawei 3 Intended status: Informational W. Kumari 4 Expires: July 2013 Google 5 Igor Gashinsky 6 Yahoo 7 January 31, 2013 9 Practices for scaling ARP and ND for large data centers 11 draft-dunbar-armd-arp-nd-scaling-practices-05 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance 16 with the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet 19 Engineering Task Force (IETF), its areas, and its working 20 groups. Note that other groups may also distribute working 21 documents as Internet-Drafts. 23 Internet-Drafts are draft documents valid for a maximum of 24 six months and may be updated, replaced, or obsoleted by 25 other documents at any time. It is inappropriate to use 26 Internet-Drafts as reference material or to cite them other 27 than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be 33 accessed at http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on July 31, 2013. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as 40 the document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's 43 Legal Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date 45 of publication of this document. Please review these 46 documents carefully, as they describe your rights and 47 restrictions with respect to this document. 49 Internet-Draft Practices to scale ARP/ND in large DC 51 Abstract 53 This draft documents some operational practices that allow 54 ARP/ND to scale in data center environments. 56 Table of Contents 58 1. Introduction ................................................ 3 59 2. Terminology ................................................. 3 60 3. Common DC network Designs.................................... 4 61 4. Layer 3 to Access Switches................................... 4 62 5. Layer 2 practices to scale ARP/ND............................ 5 63 5.1. Practices to alleviate APR/ND burden on L2/L3 64 boundary routers ............................................ 5 65 5.1.1. Station communicating with an external peer........ 5 66 5.1.2. L2/L3 boundary router processing of inbound 67 traffic .................................................. 6 68 5.1.3. Inter subnets communications ...................... 7 69 5.2. Static ARP/ND entries on switches ...................... 7 70 5.3. ARP/ND Proxy approaches ................................ 8 71 6. Practices to scale ARP/ND in Overlay models ................. 8 72 7. Summary and Recommendations ................................. 9 73 8. Security Considerations ..................................... 9 74 9. IANA Considerations ........................................ 10 75 10. Acknowledgements .......................................... 10 76 11. References ................................................ 10 77 11.1. Normative References.................................. 10 78 11.2. Informative References................................ 10 79 Authors' Addresses ............................................ 11 81 Internet-Draft Practices to scale ARP/ND in large DC 83 1. Introduction 85 This draft documents some operational practices that allow 86 ARP/ND to scale in data center environments. 88 As described in [ARMD-Problem], the increasing trend of 89 rapid workload shifting and server virtualization in modern 90 data centers requires servers to be loaded (or re-loaded) 91 with different VMs or applications at different times. 92 Different VMs residing on one physical server may have 93 different IP addresses, or may even be in different IP 94 subnets. 96 In order to allow a physical server to be loaded with VMs in 97 different subnets, or VMs to be moved to different server 98 racks without IP address re-configuration, the networks need 99 to enable multiple broadcast domains (many VLANs) on the 100 interfaces of L2/L3 boundary routers and ToR switches. 101 Unfortunately, when the combined number of VMs (or hosts) in 102 all those subnets is large, this can lead to address 103 resolution scaling issues, especially on the L2/L3 boundary 104 routers. 106 This draft documents some simple practices which can scale 107 ARP/ND in data center environment. 109 2. Terminology 111 This document reuses much of terminology from 112 [ARMD-Problem]. Many of the definitions are presented here 113 to aid the reader. 114 ARP: IPv4 Address Resolution Protocol [RFC826] 116 Aggregation Switch: A Layer 2 switch interconnecting ToR 117 switches 119 Bridge: IEEE802.1Q compliant device. In this draft, Bridge 120 is used interchangeably with Layer 2 switch. 122 DC: Data Center 124 DA: Destination Address 126 Internet-Draft Practices to scale ARP/ND in large DC 128 End Station: VM or physical server, whose address is 129 either a destination or the source of a data frame. 131 EOR: End of Row switches in data center. 133 NA: IPv6's Neighbor Advertisement 135 ND: IPv6's Neighbor Discovery [RFC4861] 137 NS: IPv6's Neighbor Solicitation 139 SA: Source Address 141 ToR: Top of Rack Switch (also known as access switch). 143 UNA: IPv6's Unsolicited Neighbor Advertisement 145 VM: Virtual Machines 147 3. Common DC network Designs 149 Some common network designs for data center include: 151 1) Layer 3 connectivity to the access switch, 153 2) Large Layer 2, and 155 3) Overlay models. 157 There is no single network design that fits all cases. The 158 following sections document some of the common practices to 159 scale Address Resolution under each network design. 161 4. Layer 3 to Access Switches 163 This network design makes Layer 3 configured to the access 164 switches; effectively making the access switches the L2/L3 165 boundary routers for the attached VMs. 167 As described in [ARMD-Problem], many data centers are 168 architected so that ARP/ND broadcast/multicast messages are 169 confined to a few ports (interfaces) of the access switches 170 (i.e. ToR switches). 172 Another variant of the Layer 3 solution is Layer 3 173 infrastructur configured all the way to servers (or even to 175 Internet-Draft Practices to scale ARP/ND in large DC 177 the VMs), which confines the ARP/ND broadcast/multicast 178 messages to the small number of VMs within the server. 180 Advantage: Both ARP and ND scale well. There is no address 181 resolution issue in this design. 183 Disadvantage: The main disadvantage to this network design 184 occurs during VM movement. During VM movement, either VMs 185 need address change or switches/routers need configuration 186 change when the VMs are moved to different locations. 188 Summary: This solution is more suitable to data centers 189 which have static workload and/or network operators who can 190 re-configure IP addresses/subnets on switches before any 191 workload change. No protocol changes are suggested. 193 5. Layer 2 practices to scale ARP/ND 195 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary 196 routers 198 The ARP/ND broadcast/multicast messages in a Layer 2 domain 199 can negatively affect the L2/L3 boundary routers, especially 200 with a large number of VMs and subnets. This section 201 describes some commonly used practices in reducing the 202 ARP/ND processing required on L2/L3 boundary routers. 204 5.1.1. Communicating with a peer in a different subnet 206 When the communicating peer is in a different subnet, the 207 originating end station needs to send ARP/ND requests to its 208 default gateway router to resolve the router's MAC address. 209 If there are many subnets on the gateway router and a large 210 number of end stations in those subnets, the gateway router 211 has to process a very large number of ARP/ND requests. This 212 is often CPU intensive as ARP/ND are usually processed by 213 the CPU (and not in hardware). 215 Solution: For IPv4 networks, a practice to alleviate this 216 problem is to have the L2/L3 boundary router send periodic 217 gratuitous ARP [GratuitousARP] messages, so that all the 218 connected end stations can refresh their ARP caches. As the 219 result, most (if not all) end stations will not need to ARP 220 for the gateway routers when they need to communicate with 221 external peers. 223 Internet-Draft Practices to scale ARP/ND in large DC 225 However, due to IPv6 requiring bi-directional path 226 validation Ipv6 end stations are still required to send 227 unicast ND messages to their default gateway router (even 228 with those routers periodically sending Unsolicited Neighbor 229 Advertisements). 231 Advantage: Reduction of ARP requests to be processed by 232 L2/L3 boundary router for IPv4. 234 Disadvantage: No reduction of ND processing on L2/L3 235 boundary router for IPv6 traffic. 237 Recommendation: Use for IPv4-only networks, or make change to the ND 238 protocol to allow data frames to be sent without requiring bi- 239 directional frame validation. Some work in progress in this area is 240 [Impatient-NUD] 242 5.1.2. L2/L3 boundary router processing of inbound traffic 244 When a L2/L3 boundary router receives a data frame destined 245 for a local subnet and the destination is not in router's 246 ARP/ND cache, some routers hold the packet and trigger an 247 ARP/ND request to resolve the L2 address. The router may 248 need to send multiple ARP/ND requests until either a timeout 249 is reached or an ARP/ND reply is received before forwarding 250 the data packets towards the target's MAC address. This 251 process is not only CPU intensive but also buffer intensive. 253 Solution: To protect a router from being overburdened by 254 resolving target MAC addresses, one solution is for the 255 router to limit the rate of resolving target MAC addresses 256 for inbound traffic whose target is not in the router's ARP 257 cache. When the rate is exceeded, the incoming traffic whose 258 target is not in the ARP cache is dropped. 260 For an IPv4 network, another common practice to alleviate 261 this problem is for the router to snoop ARP messages between 262 other hosts, so that its ARP cache can be refreshed with 263 active addresses in the L2 domain. As a result, there is an 264 increased likelihood of the router's ARP cache having the 265 IP-MAC entry when it receives data frames from external 266 peers. 268 Internet-Draft Practices to scale ARP/ND in large DC 270 For IPv6 end stations, routers are supposed to send ND 271 unicast even if it has snooped UNA/NS/NA from those 272 stations. Therefore, this practice doesn't help IPv6 very 273 much. 275 Advantage: Reduction of the number of ARP requests which 276 routers have to send upon receiving IPv4 packets and the 277 number of IPv4 data frames from external peers which routers 278 have to hold. 280 Disadvantage: The amount of ND processing on routers for 281 IPv6 traffic is not reduced. Even for IPv4, routers still 282 need to hold data packets from external peers and trigger 283 ARP requests if the targets of the data packets either don't 284 exist or are not very active. 286 Recommendation: This scheme doesn't work with IPv6. For 287 IPv4, if there is higher chance of routers receiving data 288 packets towards non-existing or inactive targets, 289 alternative approaches should be considered. 291 5.1.3. Inter subnets communications 293 The router will be hit with ARP/ND twice when the 294 originating and destination stations are in different 295 subnets attached to the same router. Once when the 296 originating station in subnet-A initiates ARP/ND request to 297 the L2/L3 boundary router (5.1.1 above); and the second time 298 when the L2/L3 boundary router to initiates ARP/ND requests 299 to the target in subnet-B (5.1.2 above). 301 Again, practices described in 5.1.1 and 5.1.2 can alleviate 302 problems in IPv4 network, but don't help very much for IPv6. 304 Advantage: reduction of ARP processing on L2/L3 boundary 305 routers for IPv4 traffic. 307 For IPv6 traffic, there is no reduction of ND processing on 308 L2/L3 boundary routers. 310 Recommendation: Consider the recommended approaches 311 described in 5.1.1 & 5.1.2. 313 5.2. Static ARP/ND entries on switches 315 In a datacenter environment the placement of L2 and L3 316 addressing may be orchestrated by Server (or VM) Management 318 Internet-Draft Practices to scale ARP/ND in large DC 320 System(s). Therefore it may be possible for static ARP/ND 321 entries to be configured on routers and / or servers. 323 Advantage: This methodology has been used to reduce ARP/ND 324 fluctuations in large scale data center networks. 326 Disadvantage: There is no well-defined mechanism for devices 327 to get prompt incremental updates of static ARP/ND entries 328 when changes occur. 330 Recommendation: The IETF should consider creating standard 331 mechanism (or protocols) for switches or servers to get 332 incremental static ARP/ND entries updates. 334 5.3. ARP/ND Proxy approaches 336 RFC1027 [RFC1027] specifies one ARP proxy approach. Since 337 the publication of RFC1027 in 1987 there have been many 338 variants of ARP proxy being deployed. The term ''ARP Proxy'' 339 is a loaded phrase, with different interpretations depending 340 on vendors and/or environments. RFC1027's ARP Proxy is for 341 a Gateway to return its own MAC address on behalf of the 342 target station. Another technique, also called ''ARP Proxy'' 343 is for a ToR switch to snoop ARP requests and return the 344 target station's MAC if the ToR has the information. 346 Advantage: Proxy ARP [RFC1027] and its variants have allowed 347 multi-subnet ARP traffic for over a decade. 349 Disadvantage: Proxy ARP protocol [RFC1027] was developed for 350 hosts which don't support subnets. 352 Recommendation: Revise RFC1027 with VLAN support and make it 353 scale for Data Center Environment. 355 6. Practices to scale ARP/ND in Overlay models 357 There are several drafts on using overlay networks to scale 358 large layer 2 networks (or avoid the need for large L2 359 networks) and enable mobility (e.g. draft-wkumari-dcops-l3- 360 vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL 361 and IEEE802.1ah (Mac-in-Mac) are other types of overlay 362 network to scale Layer 2. 364 Overlay networks hide the VMs' addresses from the interior 365 switches and routers, thereby greatly reduces the number of 366 addresses exposed to the interior switches and router. The 367 Overlay Edge nodes which perform the network address 369 Internet-Draft Practices to scale ARP/ND in large DC 371 encapsulation/decapsulation still handle all remote stations 372 addresses which communicate with stations attached locally. 374 For a large data center with many applications, these 375 applications' IP addresses need to be reachable by external 376 peers. Therefore, the overlay network may have a bottleneck 377 at the Gateway devices(s) in processing resolving target 378 stations' physical address (MAC or IP) and overlay edge 379 address within the data center. 381 Here are some approaches being used to minimize the problem: 383 1. Use static mapping as described in Section 5.2. 385 2. Have multiple gateway nodes (i.e. routers), with each 386 handling a subset of stations addresses which are 387 visible to external peers, e.g. Gateway #1 handles a 388 set of prefixes, Gateway #2 handles another subset of 389 prefixes, etc. 391 7. Summary and Recommendations 393 This memo describes some common practices which can 394 alleviate the impact of address resolution on L2/L3 gateway 395 routers. 397 In Data Centers, no single solution fits all deployments. 398 This memo has summarized some practices in various 399 scenarios and the advantages and disadvantages about all of 400 these practices. 402 In some of these scenarios, the common practices could be 403 improved by creating and/or extending existing IETF 404 protocols. These protocol change recommendations are: 406 - Extend IPv6 ND method, 408 - Create a incremental ''update'' schemes for static 409 ARP/ND entries, 411 - Revise Proxy ARP [RFC1027] for use in the data center. 413 8. Security Considerations 415 This draft documents existing solutions and proposes 416 additional work that could be initiated to extend various 418 Internet-Draft Practices to scale ARP/ND in large DC 420 IETF protocols to better scale ARP/ND for the data center 421 environment. The security of future protocol extension will 422 be discussed in their respective documents. 424 9. IANA Considerations 426 This document does not request any action from IANA. 428 10. Acknowledgements 430 We want to acknowledge ARMD WG and the following people for 431 their valuable inputs to this draft: Susan Hares, Benson 432 Schliesser, T. Sridhar, Ron Bonica, Kireeti Kompella, and 433 K.K.Ramakrishnan. 435 11. References 437 11.1. Normative References 439 [ARMD-Problem] Narten, ''Problem Statement for 440 ARMD''(http://datatracker.ietf.org/doc/draft-ietf- 441 armd-problem-statement/); Aug 2012 443 [GratuitousARP] S. Cheshire, ''IPv4 Address Conflict 444 Detection'', RFC 5227, July 2008. 446 [RFC826] D.C. Plummer, ''An Ethernet address resolution 447 protocol.'' RFC826, Nov 1982. 449 [RFC1027] Mitchell, et al, ''Using ARP to Implement 450 Transparent Subnet Gateways'' 451 (http://datatracker.ietf.org/doc/rfc1027/) 453 [RFC4861] Narten, et al, ''Neighbor Discovery for IP version 454 6 (IPv6)'', RFC4861, Sept 2007 456 11.2. Informative References 458 [Impatient-NUD] E. Nordmark, I. Gashinsky, ''draft-ietf- 459 6man-impatient-nud'' 461 Internet-Draft Practices to scale ARP/ND in large DC 463 Authors' Addresses 465 Linda Dunbar 466 Huawei Technologies 467 5340 Legacy Drive, Suite 175 468 Plano, TX 75024, USA 469 Phone: (469) 277 5840 470 Email: ldunbar@huawei.com 472 Warren Kumari 473 Google 474 1600 Amphitheatre Parkway 475 Mountain View, CA 94043 476 US 477 Email: warren@kumari.net 479 Igor Gashinsky 480 Yahoo 481 45 West 18th Street 6th floor 482 New York, NY 10011 483 Email: igor@yahoo-inc.com