idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 187 has weird spacing: '...oundary route...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 3, 2012) is 4314 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'ARMD-Problems' is mentioned on line 85, but not defined == Missing Reference: 'RFC826' is mentioned on line 104, but not defined == Missing Reference: 'RFC4861' is mentioned on line 122, but not defined == Missing Reference: 'RFC1027' is mentioned on line 341, but not defined -- Looks like a reference, but probably isn't: '1027' on line 411 == Unused Reference: 'ARP' is defined on line 437, but no explicit reference was found in the text == Unused Reference: 'DC-ARCH' is defined on line 440, but no explicit reference was found in the text == Unused Reference: 'Gratuitous ARP' is defined on line 445, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 10 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD L. Dunbar 2 Internet Draft Huawei 3 Category: Informational W. Kumari 4 Google 5 I. Gashingsky 6 Yahoo 8 Expires: Nov 2012 July 3, 2012 10 Practices for scaling arp-nd for Large Data Centers 12 draft-dunbar-armd-arp-nd-scaling-practices-00 14 Status of this Memo 16 This Internet-Draft is submitted to IETF in full conformance with 17 the provisions of BCP 78 and BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six 25 months and may be updated, replaced, or obsoleted by other documents 26 at any time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on November 30, 2012. 37 Copyright Notice 39 Copyright (c) 2009 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. 49 Abstract 51 This draft is intended to document some simple well established 52 practices which can scale ARP/ND in data center environment. 54 Conventions used in this document 56 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 57 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 58 document are to be interpreted as described in RFC-2119 [RFC2119]. 60 Table of Contents 62 1. Introduction ................................................ 3 63 2. Terminology ................................................. 3 64 3. Potential Solutions to Scale Address Resolution in DC......... 4 65 3.1. Layer 3 to Access Switches .............................. 4 66 3.2. Practices to scale ARP/ND in layer 2 .................... 5 67 3.2.1. When a station needs to communicate with an external 68 peer: .................................................... 5 69 3.2.2. L2/L3 boundary router processing of inbound traffic: 6 70 3.2.3. Inter subnets communications ....................... 7 71 3.3. Static ARP/ND entries on switches ....................... 7 72 3.4. DNS based solution ...................................... 7 73 3.5. ARP/ND Proxy approaches ................................. 8 74 3.6. Overlay models ......................................... 9 75 4. Summary and Recommendations ................................. 10 76 5. Manageability Considerations ................................ 10 77 6. Security Considerations ..................................... 10 78 7. IANA Considerations ........................................ 10 79 8. Acknowledgements ........................................... 10 80 9. References ................................................. 11 81 Authors' Addresses ............................................ 11 83 1. Introduction 85 As described in [ARMD-Problems], the increasing trend of rapid 86 workload shifting and server virtualization in modern data centers 87 is requiring servers to be loaded (or re-loaded) with different VMs 88 or applications at different times. Those different VMs loaded to 89 one physical server may have different IP addresses, or even be in 90 different IP subnets. 91 In order to allow a physical server to be re-loaded with VMs in 92 different subnets, or VMs to be moved to different server racks 93 without IP address re-configuration, the corresponding networks have 94 to have multiple broadcast domains (many VLANs) on the interfaces of 95 L2/L3 boundary routers and ToR switches. Unfortunately, this kind of 96 network can lead to address resolution scaling issues, especially on 97 the L2/L3 boundary routers, when the combined number of VMs (or 98 hosts) in all those subnets is large. 99 This document describes some potential solutions which can minimize 100 the ARP/ND scaling issues in a Data Center environment. 102 2. Terminology 104 ARP: IPv4 Address Resolution Protocol [RFC826] 106 Aggregation Switch: A Layer 2 switch interconnecting ToR switches 108 Bridge: IEEE802.1Q compliant device. In this draft, Bridge is used 109 interchangeably with Layer 2 switch. 111 DC: Data Center 113 DA: Destination Address 115 End Station: VM or physical server, whose address is either a 116 destination or the source of a data frame. 118 EOR: End of Row switches in data center. 120 NA: IPv6's Neighbor Advertisement 122 ND: IPv6's Neighbor Discovery [RFC4861] 124 NS: IPv6's Neighbor Solicitation 125 SA: Source Address 127 Station: node which is either a destination or source of a data 128 frame. 130 ToR: Top of Rack Switch. It is also known as access switch. 132 UNA: IPv6's Unsolicited Neighbor Advertisement 134 VM: Virtual Machines 136 3. Potential Solutions to Scale Address Resolution in DC 138 The following solutions have been indicated by data center operators 139 to scale ARP/ND: 141 1) layer-3 connectivity to the access switch, 143 2) practices to scale ARP/ND in layer 2, 145 3) static ARP/ND entries, 147 4) DNS based approaches, and 149 5) Extensions to proxy ARP [RFC1027]. 151 There is no single solution that fits all cases. This section 152 suggests the common practices for each type of solution. 154 3.1. Layer 3 to Access Switches 156 This is referring to the network design with Layer 3 to the access 157 switches. 159 As described in [ARMD-Problem], many data centers are designed this 160 way, so that ARP/ND broadcast/multicast messages are confined to a 161 few ports (interfaces) of the access switches (i.e. ToR switches). 163 Another variant of the Layer 3 solution is Layer 3 all the way to 164 servers, or even to the VMs. Then the ARP/ND broadcast/multicast 165 messages are further confined to the small number of VMs within the 166 server, or none at all. 168 Advantage: Both ARP/ND scales well. There is no address resolution 169 issue in this design. 171 Disadvantage: The main disadvantage to this solution is that IP 172 addresses have to be re-configured on switches when a server needs 173 to be re-loaded with an application in different subnet, or VMs need 174 to be moved to a different location. 176 Summary: This solution is more suitable to data centers which have 177 static workload or network operators who can properly re-configure 178 IP addresses/subnets on switches before any workload change. No 179 protocol changes are suggested. 181 3.2. Practices to scale ARP/ND in layer 2 183 L2/L3 boundary routers can be heavily impacted by the ARP/ND 184 broadcast/multicast messages in a Layer 2 domain, especially with 185 large number of VMs and subnets. This section describes some 186 commonly used practices in reducing the ARP/ND processing required 187 on L2/L3 boundary routers. 189 3.2.1. When a station needs to communicate with an external peer: 191 When the external peer is in a different subnet, the originating end 192 station needs to send ARP/ND requests to its default gateway router 193 to get router's MAC address. If there are many subnets enabled on 194 the gateway router with large combined number of end stations in all 195 those subnets, the gateway router has to process a very large number 196 of ARP/ND requests. This is often CPU intensive as such 197 requests/responses are processed by the CPU and not in hardware. 199 Solution: For IPv4 networks, a common practice to alleviate this 200 problem is to have the L2/L3 boundary router send periodic 201 gratuitous ARP messages, so that all the connected end stations can 202 refresh their ARP caches. As the result, most end stations, if not 203 all, won't send ARP messages to gateway routers when they need to 204 communicate with external peers. 206 However, IPv6 end stations are still required to send ND messages, 207 via unicast, to their default gateway router even with their gateway 208 routers periodically sending Unsolicited Neighbor Advertisement. 209 This is due to IPv6 requiring bi-directional path validation before 210 a data packet can be sent. 212 Advantage: Reduction of ARP requests to be processed by L2/L3 213 boundary router for IPv4. 215 Disadvantage: No reduction of ND processing on L2/L3 boundary router 216 for IPv6 traffic. 218 Recommendation: Use for IPv4-only networks, or make change to the ND 219 protocol to allow data frames to be sent without requiring 220 bidirectional frame validation. 222 3.2.2. L2/L3 boundary router processing of inbound traffic: 224 When L2/L3 boundary router receives a data frame from L3 domain, if 225 the destination is not in router's ARP/ND cache, the router usually 226 holds the packet and triggers an ARP/ND request to make sure the 227 target actually exists in its L2 domain. The router may need to send 228 multiple ARP/ND requests until either a timeout is reached or an 229 ARP/ND reply is received before forwarding the data packets towards 230 the target's MAC address. This process is not only CPU intensive but 231 also buffer intensive. 233 Solution: For IPv4 network, a common practice to alleviate this 234 problem is by an L2/L3 boundary router snooping ARP messages, so 235 that its ARP cache can be refreshed with active addresses in its L2 236 domain. As a result, there is an increased likelihood of the 237 router's ARP cache having the IP-MAC entry when it receives data 238 frames from external peers. 240 For IPv6 end stations, routers are supposed to send ND unicast even 241 if it has snooped UNA/NS/NA from those stations. Therefore, this 242 practice doesn't help IPv6 very much. 244 Advantage: Reduction of ARP requests which routers have to send upon 245 receiving IPv4 packets and the number of IPv4 data frames from 246 external peers which routers have to hold. 248 Disadvantage: The amount of ND processing on routers for IPv6 249 traffic is not reduced. Even for IPv4, routers still need to hold 250 data packets from external peers and trigger ARP requests if the 251 targets of the data packets either don't exist or are not very 252 active. 254 Recommendation: Do not use with IPv6 or make protocol changes to 255 IPv6's ND. For IPv4, if there is higher chance of routers receiving 256 data packets towards non-existing or inactive targets, alternative 257 approaches should be considered. 259 3.2.3. Inter subnets communications 261 The router will be hit twice when the originating and destination 262 stations are in different subnets under the router. Once for the 263 originating station in subnet-A initiating ARP/ND request to the 264 L2/L3 boundary router (3.2.1 above); and the second for the L2/L3 265 boundary router to initiate ARP/ND requests to the target in subnet- 266 B (3.2.2 above). 268 Again, practices described in 3.2.1 and 3.2.2 can alleviate problems 269 in IPv4 network, but don't help very much for IPv6. 271 Advantage: reduction of ARP processing on L2/L3 boundary routers for 272 IPv4 traffic. 274 But for IPv6 traffic, there is no reduction of ND processing on 275 L2/L3 boundary routers. 277 Recommendation: do not use with IPv6 or consider other approaches. 279 3.3. Static ARP/ND entries on switches 281 In a data center environment, applications placement to servers, 282 racks, and rows may be orchestrated by Server (or VM) Management 283 System(s). Therefore it is possible for static ARP/ND entries to be 284 downloaded to switches, routers or servers. 286 Advantage: This methodology has been used to reduce ARP/ND 287 fluctuations in large scale data center networks. 289 Disadvantage: There is no well defined mechanism for switches to get 290 prompt incremental update of static ARP/ND entries when changes 291 occur, or to perform certain steps when switches go through reset. 293 Recommendation: The IETF should create a well-defined mechanism (or 294 protocols) for switches or servers to get static ARP/ND entries. 296 3.4. DNS based solution 298 This solution is best suited to environments where applications 299 resolve the address of destinations they need to communicate to via 300 DNS, and periodically refresh these addresses. While this solution is 301 very well known, and extensively used, it is mainly appropriate for 302 stateless services, or for services that have a large number of short 303 lived connections. While simple, this technique may not be 304 appropriate for generic VM migration. 305 If a VM can get new IP address when it is moved to a new location, 306 here are the steps in getting the IP addresses: 307 Instantiate the service on a VM in a distant rack. The new VM 308 gets a new IP address 309 Change the address of the service in DNS 310 Wait for the DNS TTL to expire. While you are waiting, watch the 311 number of connections to the new VM increase and the number of 312 connections to the old VM decrease. 313 Wait a little longer. When the number of connections to the old 314 VM reaches zero, shut down the old VM. 315 Advantage: DNS is existing technology and this is a well-known, 316 commonly practiced technique. 318 Disadvantage: This approach is not suitable for multi-tenant 319 scenarios where each tenant needs to use its own address space, or 320 when the data center operators does not have full control of 321 addresses used by stations/VMs. 323 Summary: Limited use to where the data-center operators are in 324 control of the entire application and runs the DNS. More appropriate 325 for service migration than VM migration. 327 3.5. ARP/ND Proxy approaches 329 RFC1027 specifies one ARP proxy approach. Since RFC1027, which was 330 published in 1987, there have been many variants of ARP proxy being 331 deployed. The term "ARP Proxy" is a loaded phrase, with different 332 interpretations depending on vendors and / or environments. 333 RFC1027's ARP Proxy is for a Gateway to return its own MAC address 334 on behalf of the target station. Another technique, also called 335 "ARP Proxy" is for a ToR switch to snoop ARP requests and return the 336 target station's MAC if the ToR has the information. 338 Advantage: Proxy ARP [RFC1027] and its variants have allowed multi- 339 subnet ARP traffic for over a decade. 341 Disadvantage: Proxy ARP protocol [RFC1027] was developed prior to 342 the concepts of VLANs and for hosts which don't support subnets. 344 Recommendation: Revise RFC1027 with VLAN support and make it scale 345 for Data Center Environment. 347 3.6. Overlay models 349 There are several drafts on using overlay networks to scale large 350 layer 2 networks and enable mobility (e.g. draft-wkumari-dcops-l3- 351 vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL and 352 IEEE802.1ah (Mac-in-Mac) are other types of overlay network to scale 353 Layer 2. 355 Overlay networks hide the VMs' addresses from the interior switches 356 and routers. The Overlay Edge nodes which perform the network 357 address encapsulation/decapsulation still see all remote stations 358 addresses which communicate with stations attached locally. 360 For a large data center with tens of thousands of applications 361 communicating with peers outside the data center, all those 362 applications' IP addresses are visible to external peers. When a 363 great number of VMs move freely within a data center, all those VMs' 364 IP addresses might not be aggregated very nicely on gateway routers, 365 causing forwarding table size exploding. 367 When the Gateway router receives a data frame from external peers 368 destined to a target within the data center, routers need to resolve 369 target's MAC address and the Overlay Edge node's address in order to 370 perform the proper overlay encapsulation. 372 Therefore, the overlay network will have a bottleneck at the Gateway 373 router(s) in processing resolving target stations' physical address 374 (MAC or IP) and overlay edge address within the data center. 376 Here are some approaches being used to minimize the problem: 378 1. Use static mapping as described in Section 3.3. 380 2. Have multiple gateway nodes (i.e. routers), with each handling 381 a subset of stations addresses which are visible to external 382 peers, e.g. Gateway #1 handles a set of prefixes, Gateway #2 383 handles another subset of prefixes, etc. This architecture 384 assumes that each gateway have enough downstream ports to be 385 connected to all server racks. 387 If each server rack is allowed to instantiate VMs/applications with 388 any IP addresses, or allowing any VM to move anywhere without re- 389 configuring IP/MAC addresses, each gateway has to resolve addresses 390 which are potentially located on any server rack. The address 391 resolution processing for each gateway can still be very heavy. 393 4. Summary and Recommendations 395 This memo describes some common practices which can alleviate impact 396 of address resolution to L2/L3 gateway routers. 398 In Data Centers, no single solution fits all deployments. This memo 399 has summarized five different practices in various scenarios and the 400 advantages and disadvantages about all of these practices. 402 In some of these scenarios, the common practices could be improved 403 by creating and/or extending existing IETF protocols. These protocol 404 change recommendations are: 406 Extend IPv6 ND method, 408 Create a incremental "download" schemes for static ARP/ND 409 entries, 411 Revise Proxy ARP [1027] for use in the data center. 413 5. Manageability Considerations 415 This text gives recommendations for some practices in order to 416 improve manageability of DC. 418 6. Security Considerations 420 Security will be addressed in a separate document. 422 7. IANA Considerations 424 This document does not request any action from IANA. 426 8. Acknowledgements 428 We want to acknowledge the following people for their valuable 429 inputs to this draft: T. Sridhar, Ron Bonica, Kireeti Kompella, and 430 K.K.Ramakrishnan. 432 9. References 434 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 435 Requirement Levels", BCP 14, RFC 2119, March 1997 437 [ARP] D.C. Plummer, "An Ethernet address resolution protocol." 438 RFC826, Nov 1982. 440 [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter-reference-arch" 442 [ARMD-Problem] Narten, "draft-ietf-armd-problem-statement" in 443 progress, Oct 2011. 445 [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict Detection", 446 RFC 5227, July 2008. 448 Authors' Addresses 450 Linda Dunbar 451 Huawei Technologies 452 5340 Legacy Drive, Suite 175 453 Plano, TX 75024, USA 454 Phone: (469) 277 5840 455 Email: ldunbar@huawei.com 457 Warren Kumari 458 Google 459 1600 Amphitheatre Parkway 460 Mountain View, CA 94043 461 US 462 Email: warren@kumari.net 464 Igor Gashinsky 465 Yahoo 466 45 West 18th Street 6th floor 467 New York, NY 10011 468 Email: igor@yahoo-inc.com