idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-bcp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (January 3, 2012) is 4496 days in the past. Is this intentional? Checking references for intended status: None ---------------------------------------------------------------------------- == Missing Reference: 'ARMD-Problems' is mentioned on line 92, but not defined == Missing Reference: 'RFC826' is mentioned on line 111, but not defined == Missing Reference: 'RFC4861' is mentioned on line 126, but not defined == Missing Reference: 'RFC1027' is mentioned on line 341, but not defined -- Looks like a reference, but probably isn't: '1027' on line 412 == Unused Reference: 'ARP' is defined on line 436, but no explicit reference was found in the text == Unused Reference: 'DC-ARCH' is defined on line 439, but no explicit reference was found in the text == Unused Reference: 'Gratuitous ARP' is defined on line 444, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 9 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 ARMD L. Dunbar 2 Internet Draft Huawei 3 Intended status: Information Track W. Kumari 4 Expires: July 2012 Google 5 I. Gashinsky 6 Yahoo 7 January 3, 2012 9 BCP for ARP-ND Scaling for Large Data Centers 11 draft-dunbar-armd-arp-nd-scaling-bcp-00.txt 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html 34 This Internet-Draft will expire on July 3, 2011. 36 Copyright Notice 38 Copyright (c) 2012 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the BSD License. 51 Abstract 53 This draft is intended to document some simple well established 54 practices which can scale ARP/ND in data center environment. 56 Conventions used in this document 58 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 59 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 60 document are to be interpreted as described in RFC-2119 0. 62 Table of Contents 64 1. Introduction ................................................ 3 65 2. Terminology ................................................. 3 66 3. Potential Solutions to Scale Address Resolution in D......... 4 67 3.1. Layer 3 solution........................................ 4 68 3.2. Commonly practiced Layer 2 solution to scale address 69 resolution .................................................. 5 70 3.2.1. When a host needs to communicate with an external peer 71 ......................................................... 5 72 3.2.2. When the L2/L3 boundary router receives an IP packet 73 towards a host in one of its subnets: ..................... 6 74 3.2.3. Hosts in two different subnets served by the router 75 communicate with each other ............................... 7 76 3.3. Static ARP/ND entries on switches ....................... 7 77 3.4. DNS based solution ..................................... 7 78 3.5. ARP/ND Proxy approaches ................................. 8 79 3.6. Overlay models .......................................... 9 80 4. Summary and Recommendations ................................. 10 81 5. Manageability Considerations ................................ 10 82 6. Security Considerations ..................................... 10 83 7. IANA Considerations ......................................... 10 84 8. Acknowledgments ............................................. 10 85 9. References .................................................. 10 86 Authors' Addresses ............................................. 11 87 Intellectual Property Statement ............................... 11 88 Disclaimer of Validity ......................................... 12 90 1. Introduction 92 As described in [ARMD-Problems], the increasing trend of rapid 93 workload shifting and server virtualization in modern data centers 94 is requiring servers to be loaded (or re-loaded) with different 95 hosts or applications at different times. Those different hosts 96 loaded to one physical server may have different IP addresses, or 97 even be in different IP subnets. 98 In order to allow a physical server to be re-loaded with hosts in 99 different subnets, or VMs to be moved to different server racks 100 without IP address re-configuration, the corresponding networks have 101 to have multiple broadcast domains (many VLANs) on the interfaces of 102 L2/L3 boundary routers and ToR switches. Unfortunately, this kind of 103 network can lead to address resolution scaling issues, especially on 104 the L2/L3 boundary routers, when the combined number of hosts in all 105 those subnets is large. 106 This document describes some potential solutions which can minimize 107 the ARP/ND scaling issues in a Data Center environment. 109 2. Terminology 111 ARP: IPv4 Address Resolution Protocol [RFC826] 113 Aggregation Switch: A Layer 2 switch interconnecting ToR switches 115 Bridge: IEEE802.1Q compliant device. In this draft, Bridge is used 116 interchangeably with Layer 2 switch. 118 DC: Data Center 120 DA: Destination Address 122 EOR: End of Row switches in data center. 124 NA: IPv6's Neighbor Advertisement 126 ND: IPv6's Neighbor Discovery [RFC4861] 128 NS: IPv6's Neighbor Solicitation 130 SA: Source Address 132 ToR: Top of Rack Switch. It is also known as access switch. 134 UNA: IPv6's Unsolicited Neighbor Advertisement 136 VM: Virtual Machines 138 3. Potential Solutions to Scale Address Resolution in DC 140 The following solutions have been indicated by data center operators: 142 1) layer-3 connectivity to the access switch, 144 2) practices to scale ARP/ND in layer 2, 146 3) static ARP/ND entries, 148 4) DNS based approaches, and 150 5) Extensions to proxy ARP [RFC1027]. 152 There is no single solution that fits all cases. This section 153 suggests the best practices for each type of solution. 155 3.1. Layer 3 solution 157 This is referring to the network design with Layer 3 to the access 158 switches. 160 As described in [ARMD-Problem], many data centers are designed this 161 way, so that ARP/ND broadcast/multicast messages are confined to a 162 few ports (interfaces) of the access switches (i.e. ToR switches). 164 Another variant of the Layer 3 solution is Layer 3 all the way to 165 servers, or even to the VMs. Then the ARP/ND broadcast/multicast 166 messages are further confined to the small number of hosts within the 167 server, or none at all. 169 Advantage: Both ARP/ND scales well. There is no address resolution 170 issue in this design. 172 Disadvantage: The main disadvantage to this solution is that IP 173 addresses have to be re-configured on switches when a server needs to 174 be re-loaded with an application in different subnet, or VMs need to 175 be moved to a different location. 177 Recommendation: This solution is more suitable to data centers which 178 have static workload or network operators who can properly re- 179 configure IP addresses/subnets on switches before any workload 180 change. No protocol changes are suggested. 182 3.2. Commonly practiced Layer 2 solution to scale address resolution 184 L2/L3 boundary routers can be heavily impacted by the ARP/ND 185 broadcast/multicast messages in a Layer 2 domain which is mapped to 186 one or multiple subnets (or VLANs) with combined large number of 187 hosts in all subnets. This section describes some commonly used 188 practices in reducing the ARP/ND processing required on L2/L3 189 boundary (or gateway) routers. 191 3.2.1. When a host needs to communicate with an external peer: 193 When the external peer is in a different subnet, the originating host 194 needs to send ARP/ND requests to its default gateway router to get 195 router's MAC address. If there are many subnets enabled on the 196 gateway router with large combined number of hosts in all those 197 subnets, the gateway router has to process a very large number of 198 ARP/ND requests, which is CPU intensive. 200 Solution: For IPv4 networks, a common practice to alleviate this 201 problem is to have the L2/L3 boundary router (or gateway router) send 202 periodic gratuitous ARP messages, so that all the connected hosts can 203 refresh their ARP caches. As the result, most hosts, if not all, 204 won't send ARP messages to gateway routers when they need to 205 communicate with external hosts. 207 However, IPv6 hosts are still required to send ND messages, via 208 unicast, to their default gateway router even with their gateway 209 routers periodically sending Unsolicited Neighbor Advertisement. This 210 is due to IPv6 requiring bi-directional path validation before a data 211 packet can be sent. 213 Advantage: Reduction of ARP requests to be processed by L2/L3 214 boundary router for IPv4. 216 Disadvantage: No reduction of ND processing on L2/L3 boundary router 217 for IPv6 traffic. 219 Recommendation: Use for IPv4-only networks, or change the ND protocol 220 to allow data frames to be sent without requiring bidirectional frame 221 validation. 223 3.2.2. When the L2/L3 boundary router receives an IP packet towards a 224 host in one of its subnets: 226 When the source address is in a different subnet and the target is 227 not in router's ARP/ND cache, the router usually holds the packet and 228 triggers an ARP/ND request to make sure the target actually exists in 229 its L2 domain. The router may need to send multiple ARP/ND requests 230 until either a timeout is reached or an ARP/ND reply is received. 231 After this the gateway router can forward the data packets towards 232 the target's MAC address. This process is not only CPU intensive but 233 also buffer intensive. 235 Solution: For IPv4 network, a common practice to alleviate this 236 problem is by an L2/L3 boundary router (or gateway router) snooping 237 ARP messages, so that its ARP cache can be refreshed with active 238 hosts in its L2 domain. As a result, there is an increased likelihood 239 of the router's ARP cache having the IP-MAC entry when it receives 240 data frames from external subnets. 242 For IPv6 hosts, routers are supposed to send ND unicast even if it 243 has snooped UNA/NS/NA from those hosts. Therefore, this practice 244 doesn't help IPv6 very much. 246 Advantage: Reduction of ARP requests which routers have to send upon 247 receiving IPv4 packets and the amount of IPv4 data frames from 248 external subnets which routers have to hold. 250 Disadvantage: The amount of ND processing on routers for IPv6 traffic 251 is not reduced. Even for IPv4, Routers still need to hold data 252 packets from external subnets and trigger ARP requests if the targets 253 of the data packets either don't exist or are not very active. 255 Recommendation: Do not use with IPv6 or make protocol changes to 256 IPv6's ND. For IPv4, if there are higher chance of routers receiving 257 data packets towards non-existing or inactive targets, alternative 258 approaches should be considered. 260 3.2.3. Hosts in two different subnets served by the router communicate 261 with each other 263 The router will be hit twice under this scenario. Once for the 264 originating host in subnet-A initiating ARP/ND request to the gateway 265 (3.2.1 above); and the second for the gateway to initiate ARP/ND 266 requests to the target in subnet-B (3.2.2 above). 268 Again, practices described in 3.2.1 and 3.2.2 can alleviate problems 269 in IPv4 network, but don't help very much for IPv6. 271 Advantage: reduction of ARP processing on L2/L3 boundary routers for 272 IPv4 traffic. 274 Disadvantage: For IPv6 traffic, there is no reduction of ND 275 processing on L2/L3 boundary routers. 277 Recommendation: do not use with IPv6 or consider other approaches. 279 3.3. Static ARP/ND entries on switches 281 In a data center environment, applications placement to servers, 282 racks, and rows may be orchestrated by Server (or VM) Management 283 System(s). Therefore it is possible for static ARP/ND entries to be 284 downloaded to switches, routers or servers. 286 Advantage: This methodology has been used to reduce ARP/ND 287 fluctuations in large scale deployments. 289 Disadvantage: There is no well defined mechanism for switches to get 290 static ARP/ND entries, to get prompt update of static ARP/ND entries 291 when changes occur, or to perform certain steps when switches go 292 through reset. 294 Recommendation: The IETF should create a well-defined mechanism (or 295 protocols) for switches or servers to get static ARP/ND entries. 297 3.4. DNS based solution 299 This solution is best suited to environments where applications 300 resolve the address of things they need to connect to via DNS, and 301 periodically refresh these addresses. While this solution is very 302 well known, and extensively used, it is mainly appropriate for 303 stateless services, or for services that have a large number of short 304 lived connections. While elegant, it may not be appropriate for 305 generic host migration. 306 . When a VM is to be moved to a new location, here are the steps in 307 getting the IP addresses: 308 Instantiate the service on a VM in a distant rack. The new VM 309 gets a new IP address 310 Change the address of the service in DNS 311 Wait for the DNS TTL to expire. While you are waiting, watch the 312 number of connections to the new VM increase and the number of 313 connections to the old VM decrease. 314 Wait a little longer. When the number of connections to the old 315 VM reaches zero, shut down the old VM. 316 Advantage: DNS is existing technology and this is a well-known, 317 commonly practiced technique. 319 Disadvantage: This approach is not suitable for multi-tenant 320 scenarios, or when the data center operators does not have full 321 control of the applications. 323 Recommendation: Limited use to where the data-center operators are in 324 control of the entire application and runs the DNS. More appropriate 325 for service migration than host / VM migration.. 327 3.5. ARP/ND Proxy approaches 329 RFC1027 specifies one ARP proxy approach. Since RFC1027, which was 330 published in 1987, there have been many variants of ARP proxy being 331 deployed. The term "ARP Proxy" is a loaded phrase, with different 332 interpretations depending on vendors and / or environments. 333 RFC1027's ARP Proxy is for a Gateway to return its own MAC address on 334 behalf of the target host. Another technique, also called "ARP 335 Proxy" is for a ToR switch to snoop ARP requests and return the 336 target hosts MAC if it knows it. . 338 Advantage: Proxy ARP [RFC1027] and its variants have allowed multi- 339 subnet ARP traffic for over a decade. 341 Disadvantage: Proxy ARP protocol [RFC1027] was developed prior to the 342 concepts of VLANs and for hosts which don't support subnets, and does 343 not provide the scaling. 345 Recommendation: Revise RFC1027 with VLAN support and scalability for 346 the Data Center Environment. 348 3.6. Overlay models 350 There are several drafts on using overlay networks to scale large 351 layer 2 networks and enable mobility (e.g. draft-wkumari-dcops-l3- 352 vmmobility-00, draft-mahalingam-dutt-dcops-vxlan-00). TRILL and 353 IEEE802.1ah (Mac-in-Mac) are other types of overlay network to scale 354 Layer 2. 356 Overlay networks hide the hosts' addresses form the interior switches 357 and routers. The Overlay Edge nodes which perform the network address 358 encapsulation/decapsulation still see all remote hosts addresses 359 which communicate with hosts attached locally. 361 For a large data center with tens of thousands of applications 362 communicating with peers outside the data center, all those 363 applications' IP addresses are visible to external peers. When a 364 great number of VMs move freely within a data center, all those VMs' 365 IP addresses might not be aggregated very nicely on gateway routers, 366 causing forwarding table size exploding. 368 When the Gateway router receives a data frame from external peers 369 destined to a target within the data center, routers need to resolve 370 target's MAC address and the Overlay Edge node's address in order to 371 perform the proper overlay encapsulation. 373 Therefore, the overlay network will have a bottleneck at the Gateway 374 router(s) in processing resolving target hosts' physical address (MAC 375 or IP) and overlay edge address within the data center. 377 Here are some approaches being used to minimize the problem: 379 1. Use static mapping as described in Section 3.3. 381 2. Have multiple gateway nodes (i.e. routers), with each handling 382 a subset of hosts addresses which are visible to external peers, 383 e.g. Gateway #1 handles a set of prefixes, Gateway #2 handles 384 another subset of prefixes, etc. This architecture assumes that 385 each gateway have enough downstream ports to be connected to all 386 server racks. 388 If each server rack is allowed to instantiate hosts/applications 389 with any IP addresses, or allowing any VM to move anywhere 390 without re-configuring IP/MAC addresses, each gateway has to 391 resolve addresses which are potentially located on any server 392 rack. The address resolution processing for each gateway can 393 still be very heavy. 395 4. Summary and Recommendations 397 This memo describes some best practices which can alleviate impact 398 of address resolution to L2/L3 gateway routers. 400 In the Data Center, no single solution fits all deployments. This 401 memo has summarized five different technologies and the advantages 402 and disadvantages about all of these practices. 404 In some of these scenarios, the best practices could be improved by 405 creating and/or extending existing IETF protocols. These protocol 406 change recommendations are: 408 Extend IPv6 ND method, 410 Create a "download" static ARP/ND entry protocol, 412 Revise Proxy ARP [1027] for use in the data center. 414 5. Manageability Considerations 416 This text gives recommendations for best practices in order to 417 improve manageability of DC. 419 6. Security Considerations 421 Security will be addressed in a separate document. 423 7. IANA Considerations 425 None. 427 8. Acknowledgments 429 We want to acknowledge the following people for their valuable inputs 430 to this draft: K.K.Ramakrishnan. 432 This document was prepared using 2-Word-v2.0.template.dot. 434 9. References 436 [ARP] D.C. Plummer, "An Ethernet address resolution protocol." 437 RFC826, Nov 1982. 439 [DC-ARCH] Karir,et al, "draft-karir-armd-datacenter-reference-arch" 441 [ARMD-Problem] Narten, "draft-ietf-armd-problem-statement" in 442 progress, Oct 2011. 444 [Gratuitous ARP] S. Cheshire, "IPv4 Address Conflict Detection", RFC 445 5227, July 2008. 447 Authors' Addresses 449 Linda Dunbar 450 Huawei Technologies 451 5340 Legacy Drive, Suite 175 452 Plano, TX 75024, USA 453 Phone: (469) 277 5840 454 Email: ldunbar@huawei.com 456 Warren Kumari 457 Google 458 1600 Amphitheatre Parkway 459 Mountain View, CA 94043 460 US 461 Email: warren@kumari.net 463 Igor Gashinsky 464 Yahoo 465 45 West 18th Street 6th floor 466 New York, NY 10011 467 Email: igor@yahoo-inc.com 469 Intellectual Property Statement 471 The IETF Trust takes no position regarding the validity or scope of 472 any Intellectual Property Rights or other rights that might be 473 claimed to pertain to the implementation or use of the technology 474 described in any IETF Document or the extent to which any license 475 under such rights might or might not be available; nor does it 476 represent that it has made any independent effort to identify any 477 such rights. 479 Copies of Intellectual Property disclosures made to the IETF 480 Secretariat and any assurances of licenses to be made available, or 481 the result of an attempt made to obtain a general license or 482 permission for the use of such proprietary rights by implementers or 483 users of this specification can be obtained from the IETF on-line IPR 484 repository at http://www.ietf.org/ipr 486 The IETF invites any interested party to bring to its attention any 487 copyrights, patents or patent applications, or other proprietary 488 rights that may cover technology that may be required to implement 489 any standard or specification contained in an IETF Document. Please 490 address the information to the IETF at ietf-ipr@ietf.org. 492 Disclaimer of Validity 494 All IETF Documents and the information contained therein are provided 495 on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 496 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 497 IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL 498 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 499 WARRANTY THAT THE USE OF THE INFORMATION THEREIN WILL NOT INFRINGE 500 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 501 FOR A PARTICULAR PURPOSE. 503 Acknowledgment 505 Funding for the RFC Editor function is currently provided by the 506 Internet Society.