idnits 2.17.1 draft-dunbar-armd-arp-nd-scaling-practices-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 7, 2014) is 3613 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network working Group L. Dunbar 2 Internet Draft Huawei 3 Intended status: Informational W. Kumari 4 Expires: November 2014 Google 5 Igor Gashinsky 6 Yahoo 7 May 7, 2014 9 Practices for scaling ARP and ND for large data centers 11 draft-dunbar-armd-arp-nd-scaling-practices-08 13 Status of this Memo 15 This Internet-Draft is submitted to IETF in full conformance with 16 the provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six 24 months and may be updated, replaced, or obsoleted by other 25 documents at any time. It is inappropriate to use Internet-Drafts 26 as reference material or to cite them other than as "work in 27 progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on November 7, 2014. 37 Copyright Notice 39 Copyright (c) 2014 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. 49 Internet-Draft Practices to scale ARP/ND in large DC 51 Abstract 53 This draft documents some operational practices that allow ARP/ND 54 to scale in data center environments. 56 Table of Contents 58 1. Introduction...................................................3 59 2. Terminology....................................................4 60 3. Common DC network Designs......................................5 61 4. Layer 3 to Access Switches.....................................5 62 5. Layer 2 practices to scale ARP/ND..............................6 63 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary 64 routers........................................................6 65 5.1.1. Communicating with a peer in a different subnet......6 66 5.1.2. L2/L3 boundary router processing of inbound traffic..7 67 5.1.3. Inter subnets communications.........................8 68 5.2. Static ARP/ND entries on switches.........................8 69 5.3. ARP/ND Proxy approaches...................................9 70 5.4. Multicast Scaling Issues.................................10 71 6. Practices to scale ARP/ND in Overlay models...................10 72 7. Summary and Recommendations...................................11 73 8. Security Considerations.......................................11 74 9. IANA Considerations...........................................11 75 10. Acknowledgements.............................................12 76 11. References...................................................12 77 11.1. Normative References....................................12 78 11.2. Informative References..................................13 79 Authors' Addresses...............................................13 81 Internet-Draft Practices to scale ARP/ND in large DC 83 1. Introduction 85 This draft documents some operational practices that allow ARP/ND 86 to scale in data center environments. 88 As described in [RFC6820], the increasing trend of rapid workload 89 shifting and server virtualization in modern data centers requires 90 servers to be loaded (or re-loaded) with different VMs or 91 applications at different times. Different VMs residing on one 92 physical server may have different IP addresses, or may even be in 93 different IP subnets. 95 In order to allow a physical server to be loaded with VMs in 96 different subnets, or VMs to be moved to different server racks 97 without IP address re-configuration, the networks need to enable 98 multiple broadcast domains (many VLANs) on the interfaces of L2/L3 99 boundary routers and ToR switches and allow some subnets to span 100 across multiple router ports. 102 Note: The L2/L3 boundary routers in this draft are capable of 103 forwarding IEEE802.1 Ethernet frames (layer 2) without MAC header 104 change. When subnets span across multiple ports of those routers, 105 they still fall under the category of single link, specifically the 106 multi-access link model recommended by [RFC4903]. They are 107 different from the ''multi-link'' subnets described in [Multi-Link] 108 and RFC4903, which refer to different physical media with the same 109 prefix connected to one router. Within the ''multi-link'' subnet 110 described in RFC4903, layer 2 frames from one port cannot be 111 natively forwarded to another port without a header change. 113 Unfortunately, when the combined number of VMs (or hosts) in all 114 those subnets is large, this can lead to address resolution (i.e. 115 IPv4 ARP and IPv6 ND) scaling issues. There are three major issues 116 associated with ARP/ND address resolution protocols when subnets 117 span across multiple L2/L3 boundary router ports: 119 1) the ARP/ND messages being flooded to many physical link segments 120 which can reduce bandwidth utilization for user traffic; 121 2) the ARP/ND processing load impact on the L2/L3 boundary routers; 122 3) In IPv4, every end station in a subnet receives ARP broadcast 123 messages from all other end stations in the subnet. IPv6 ND has 124 eliminated this issue by using multicast. 126 Since the majority of data center servers are moving towards 1G or 127 10G ports, the bandwidth taken by ARP/ND messages, even when 128 flooded to all physical links, becomes negligible compared to the 130 Internet-Draft Practices to scale ARP/ND in large DC 132 link bandwidth. In addition, the IGMP/MLD snooping [RFC4541] can 133 further reduce the ND multicast traffic to some physical link 134 segments. 136 As modern servers' computing power increases, the processing taken 137 by a large amount of ARP broadcast messages becomes less 138 significant to servers. For example, lab testing shows that 2000 139 ARP/s only takes 2% of a single core CPU server. Therefore, the 140 impact of ARP broadcast impact to end stations is not significant 141 on today's servers. 143 Statistics done by Merit Network [ARMD-Statistics] have shown that 144 the major impact of a large number of mobile VMs in Data Center is 145 on the L2/L3 boundary routers, i.e. the issue #2 above. 147 This draft documents some simple practices which can scale ARP/ND 148 in a data center environment, especially in reducing processing 149 load to L2/L3 boundary routers. 151 2. Terminology 153 This document reuses much of terminology from 154 [RFC6820]. Many of the definitions are presented here to aid the 155 reader. 157 ARP: IPv4 Address Resolution Protocol [RFC826] 159 Aggregation Switch: A Layer 2 switch interconnecting ToR switches 161 Bridge: IEEE802.1Q compliant device. In this draft, Bridge is used 162 interchangeably with Layer 2 switch. 164 DC: Data Center 166 DA: Destination Address 168 End Station: VM or physical server, whose address is either the 169 destination or the source of a data frame. 171 EOR: End of Row switches in data center. 173 NA: IPv6's Neighbor Advertisement 175 ND: IPv6's Neighbor Discovery [RFC4861] 177 NS: IPv6's Neighbor Solicitation 179 Internet-Draft Practices to scale ARP/ND in large DC 181 SA: Source Address 183 ToR: Top of Rack Switch (also known as access switch). 185 UNA: IPv6's Unsolicited Neighbor Advertisement 187 VM: Virtual Machines 189 Subnet Refer to the Multi-access link subnet referenced by 190 RFC4903 192 3. Common DC network Designs 194 Some common network designs for a data center include: 196 1) Layer 3 connectivity to the access switch, 198 2) Large Layer 2, and 200 3) Overlay models. 202 There is no single network design that fits all cases. The 203 following sections document some of the common practices to scale 204 Address Resolution under each network design. 206 4. Layer 3 to Access Switches 208 This network design configures Layer 3 to the access switches; 209 effectively making the access switches the L2/L3 boundary routers 210 for the attached VMs. 212 As described in [RFC6820], many data centers are architected so 213 that ARP/ND broadcast/multicast messages are confined to a few 214 ports (interfaces) of the access switches (i.e. ToR switches). 216 Another variant of the Layer 3 solution is Layer 3 infrastructure 217 configured all the way to servers (or even to the VMs), which 218 confines the ARP/ND broadcast/multicast messages to the small 219 number of VMs within the server. 221 Advantage: Both ARP and ND scale well. There is no address 222 resolution issue in this design. 224 Disadvantage: The main disadvantage to this network design occurs 225 during VM movement. During VM movement, either VMs need an address 227 Internet-Draft Practices to scale ARP/ND in large DC 229 change or switches/routers need a configuration change when the VMs 230 are moved to different locations. 232 Summary: This solution is more suitable to data centers which have 233 a static workload and/or network operators who can re-configure IP 234 addresses/subnets on switches before any workload change. No 235 protocol changes are suggested. 237 5. Layer 2 practices to scale ARP/ND 239 5.1. Practices to alleviate APR/ND burden on L2/L3 boundary routers 241 The ARP/ND broadcast/multicast messages in a Layer 2 domain can 242 negatively affect the L2/L3 boundary routers, especially with a 243 large number of VMs and subnets. This section describes some 244 commonly used practices in reducing the ARP/ND processing required 245 on L2/L3 boundary routers. 247 5.1.1. Communicating with a peer in a different subnet 249 Scenario: When the originating end station doesn't have its default 250 gateway MAC address in its ARP/ND cache and needs to communicate 251 with a peer in a different subnet, it needs to send ARP/ND requests 252 to its default gateway router to resolve the router's MAC address. 253 If there are many subnets on the gateway router and a large number 254 of end stations in those subnets that don't have gateway MAC in 255 their ARP/ND caches, the gateway router has to process a very large 256 number of ARP/ND requests. This is often CPU intensive as ARP/ND 257 messages are usually processed by the CPU (and not in hardware). 259 Note: Any centralized configuration which pre-loads the default MAC 260 addresses is not included in this scenario. 262 Solution: For IPv4 networks, a practice to alleviate this problem 263 is to have the L2/L3 boundary router send periodic gratuitous ARP 264 [GratuitousARP] messages, so that all the connected end stations 265 can refresh their ARP caches. As a result, most (if not all) end 266 stations will not need to send ARP requests for the gateway routers 267 when they need to communicate with external peers. 269 For the above scenario, IPv6 end stations are still required to 270 send unicast ND messages to their default gateway router (even with 271 those routers periodically sending Unsolicited Neighbor 272 Advertisements) because IPv6 requires bi-directional path 273 validation. 275 Internet-Draft Practices to scale ARP/ND in large DC 277 Advantage: Reduction of ARP requests to be processed by L2/L3 278 boundary router for IPv4. 280 Disadvantage: this practice doesn't reduce ND processing on L2/L3 281 boundary router for IPv6 traffic. 283 Recommendation: If the network is an IPv4-only network, then this 284 approach can be used. For an IPv6 network, need to consider the 285 work in progress described in [Impatient-NUD]. Note: The ND and 286 SEND [RFC3971] use the bi-directional nature of queries to detect 287 and prevent security attacks. 289 5.1.2.L2/L3 boundary router processing of inbound traffic 291 Scenario: When a L2/L3 boundary router receives a data frame 292 destined for a local subnet and the destination is not in the 293 router's ARP/ND cache, some routers hold the packet and trigger an 294 ARP/ND request to resolve the L2 address. The router may need to 295 send multiple ARP/ND requests until either a timeout is reached or 296 an ARP/ND reply is received before forwarding the data packets 297 towards the target's MAC address. This process is not only CPU 298 intensive but also buffer intensive. 300 Solution: To protect a router from being overburdened by resolving 301 target MAC addresses, one solution is for the router to limit the 302 rate of resolving target MAC addresses for inbound traffic whose 303 target is not in the router's ARP/ND cache. When the rate is 304 exceeded, the incoming traffic whose target is not in the ARP/ND 305 cache is dropped. 307 For an IPv4 network, another common practice to alleviate pain 308 caused by this problem is for the router to snoop ARP messages 309 between other hosts, so that its ARP cache can be refreshed with 310 active addresses in the L2 domain. As a result, there is an 311 increased likelihood of the router's ARP cache having the IP-MAC 312 entry when it receives data frames from external peers. [RFC6820] 313 section 7.1 provides a full description of this problem. 315 For IPv6 end stations, routers are supposed to send RA unicast even 316 if they have snooped UNA/NS/NA from those stations. Therefore, this 317 practice allows an L2/L3 boundary to send unicast RA to target 318 instead of multicast. [RFC6820] section 7.2 has a full description 319 of this problem. 321 Advantage: Reduction of the number of ARP requests that routers 322 have to send upon receiving IPv4 packets and the number of IPv4 324 Internet-Draft Practices to scale ARP/ND in large DC 326 data frames from external peers that routers have to hold due to 327 targets not in ARP cache. 329 Disadvantage: For IPv6 traffic, the amount of ND processing on 330 routers for IPv6 traffic is not reduced. IPv4 routers still need to 331 hold data packets from external peers and trigger ARP requests if 332 the targets of the data packets either don't exist or are not very 333 active. In this case, IPv4 process or IPv4 buffers are not reduced. 335 Recommendation: If there is a higher chance of routers receiving 336 data packets that are destined for non-existing or inactive 337 targets, alternative approaches should be considered. 339 5.1.3.Inter subnets communications 341 The router could be hit with ARP/ND twice when the originating and 342 destination stations are in different subnets attached to the same 343 router and those hosts don't communicate with external peers often 344 enough. The first hit is when the originating station in subnet-A 345 initiates an ARP/ND request to the L2/L3 boundary router if the 346 router's MAC is not in the host's cache (5.1.1 above); and the 347 second hit is when the L2/L3 boundary router initiates ARP/ND 348 requests to the target in subnet-B if the target is not in router's 349 ARP/ND cache (5.1.2 above). 351 Again, practices described in 5.1.1 and 5.1.2 can alleviate some 352 problems in some IPv4 networks. 354 For IPv6 traffic, the practices don't reduce the ND processing on 355 L2/L3 boundary routers. 357 Recommendation: Consider the recommended approaches described in 358 5.1.1 & 5.1.2. However, any solutions that relax the bi-directional 359 requirement of IPv6 ND disable the security the two-way ND 360 communication exchange provides. 362 5.2. Static ARP/ND entries on switches 364 In a datacenter environment the placement of L2 and L3 addressing 365 may be orchestrated by Server (or VM) Management System(s). 366 Therefore it may be possible for static ARP/ND entries to be 367 configured on routers and / or servers. 369 Advantage: This methodology has been used to reduce ARP/ND 370 fluctuations in large scale data center networks. 372 Internet-Draft Practices to scale ARP/ND in large DC 374 Disadvantage: When some VMs are added, deleted, or moved, many 375 switches' static entries need to be updated. In a Data Center with 376 virtualized servers, those events can happen frequently. For 377 example, for an event of one VM being added to one server, if the 378 subnet of this VM spans across 15 access switches, all of them need 379 to be updated. Network management (SNMP, netconf, or proprietary) 380 mechanisms are available to provide updates or incremental updates. 381 However, there is no well defined approach for switches to 382 synchronize their content with the management system for efficient 383 incremental update. 385 Recommendation: Additional work may be needed within IETF (e.g. 386 netconf, NVo3, IR2S, etc.) to get prompt incremental updates of 387 static ARP/ND entries when changes occur. 389 5.3. ARP/ND Proxy approaches 391 RFC1027 [RFC1027] specifies one ARP proxy approach. However, 392 RFC1027 is not a scaling mechanism. Since the publication of 393 RFC1027 in 1987 many variants of ARP proxy have been deployed. 394 RFC1027's ARP Proxy is for a gateway to return its own MAC address 395 on behalf of the target station. 397 [ARP_Reduction] describes a type of ''ARP Proxy'' which is for a ToR 398 switch to snoop ARP requests and return the target station's MAC if 399 the ToR has the information in its cache. However, [RFC4903] 400 doesn't recommend the caching approach described in [ARP_Reduction] 401 because such a cache prevents any type of fast mobility between 402 layer 2 ports, and breaks Secure neighbor Discovery [RFC3971]. 404 IPv6 ND Proxy [RFC4389] specifies a proxy used between Ethernet 405 segment and other segments, such as wireless or PPP segments. ND 406 Proxy [RFC4389] doesn't allow a proxy to send NA on behalf of the 407 target to ensure that the proxy does not interfere with hosts 408 moving from one segment to another. Therefore, the ND Proxy 409 [RFC4389] doesn't reduce the number of ND messages to L2/L3 410 boundary router. 412 Bottom line, the term ''ARP/ND Proxy'' has different interpretations 413 depending on vendors and/or environments. 415 Recommendation: For IPv4, even though those Proxy ARP variants (not 416 RFC1023) have been used to reduce ARP traffic in various 417 environments, there are many issues with caching. 419 IETF should consider making proxy recommendations for Data Center 420 environment as a transition issue to help DC operators 421 transitioning to IPv6. The ''Guideline for proxy developers'' 423 Internet-Draft Practices to scale ARP/ND in large DC 425 [RFC4389] should be considered when develop any new proxy protocols 426 to scale ARP. 428 5.4. Multicast Scaling Issues 430 Multicast snooping (IGMP/MLD) has different implementations and 431 scaling issues. [RFC4541] notes that multicast IGMPv2/v3 snooping 432 has trouble with subnets that include IGMPv2 and IGMPv3. [RFC4541] 433 also notes that MLDv2 snooping requires use of either DMAC address 434 filtering or deeper inspection of frames/packet to allow for 435 scaling. 437 MLDv2 snooping needs to be re-examined for scaling within the DC. 438 Efforts such as IGMP/MLD explicit tracking [IGMP-MLD-tracking] for 439 downstream host need to provide better scaling than IGMP/MLDv2 440 snooping. 442 6. Practices to scale ARP/ND in Overlay models 444 There are several drafts on using overlay networks to scale large 445 layer 2 networks (or avoid the need for large L2 networks) and 446 enable mobility (e.g. draft-wkumari-dcops-l3-vmmobility-00, draft- 447 mahalingam-dutt-dcops-vxlan-00). TRILL and IEEE802.1ah (Mac-in-Mac) 448 are other types of overlay network to scale Layer 2. 450 Overlay networks hide the VMs' addresses from the interior switches 451 and routers, thereby greatly reduces the number of addresses 452 exposed to the interior switches and router. The Overlay Edge nodes 453 that perform the network address encapsulation/decapsulation still 454 handle all remote stations addresses that communicate with the 455 locally attached end stations. 457 For a large data center with many applications, these applications' 458 IP addresses need to be reachable by external peers. Therefore, the 459 overlay network may have a bottleneck at the Gateway node(s) in 460 processing resolving target stations' physical address (MAC or IP) 461 and the overlay edge address within the data center. 463 Here are some approaches that can be used to minimize the problem: 465 1. Use static mapping as described in Section 5.2. 467 2. Have multiple L2/L3 boundary nodes (i.e. routers), with each 468 handling a subset of stations addresses which are visible to 469 external peers (e.g. Gateway #1 handles a set of prefixes, 470 Gateway #2 handles another subset of prefixes, etc.). 472 Internet-Draft Practices to scale ARP/ND in large DC 474 7. Summary and Recommendations 476 This memo describes some common practices which can alleviate the 477 impact of address resolution on L2/L3 gateway routers. 479 In Data Centers, no single solution fits all deployments. This 480 memo has summarized some practices in various scenarios and the 481 advantages and disadvantages about all of these practices. 483 In some of these scenarios, the common practices could be improved 484 by creating and/or extending existing IETF protocols. These 485 protocol change recommendations are: 487 . Relax some bi-directional requirement of IPv6 ND in some 488 environment. However, other issues will be introduced when 489 the bi-directional requirement of ND is relaxed. Therefore, 490 it is necessary to have comprehensive study in making those 491 changes. 493 . Create an incremental ''update'' schemes for efficient static 494 ARP/ND entries. 496 . Develop IPv4 ARP/IPv6 ND Proxy standards for use in the data 497 center. The ''Guideline for proxy developers'' [RFC4389] should 498 be considered when develop any new proxy protocols to scale 499 ARP/ND. 501 . Consider scaling issues with IGMP/MLD snooping to determine 502 if new alternatives can provide better scaling. 504 8. Security Considerations 506 This draft documents existing solutions and proposes additional 507 work that could be initiated to extend various IETF protocols to 508 better scale ARP/ND for the data center environment. 510 The security is a major issue for data center environment. 511 Therefore, security should be seriously considered when developing 512 any future protocol extension. 514 9. IANA Considerations 516 This document does not request any action from IANA. 518 Internet-Draft Practices to scale ARP/ND in large DC 520 10. Acknowledgements 522 We want to acknowledge ARMD WG and the following people for their 523 valuable inputs to this draft: Joel Jaeggli, Dave Thaler, Susan 524 Hares, Benson Schliesser, T. Sridhar, Ron Bonica, Kireeti Kompella, 525 and K.K.Ramakrishnan. 527 11. References 529 11.1. Normative References 531 [GratuitousARP] S. Cheshire, ''IPv4 Address Conflict Detection'', RFC 532 5227, July 2008. 534 [IGMP-MLD-tracking] H. Aseda, and N. Leymann, ''IGMP/MLD-Based 535 Explicit Membership Tracking Function for Multicast 536 Routers'' (http://tools.ietf.org/html/draft-ietf-pim- 537 explicit-tracking-02), Oct, 2012. 539 [RFC826] D.C. Plummer, ''An Ethernet address resolution protocol.'' 540 RFC826, Nov 1982. 542 [RFC1027] Mitchell, et al, ''Using ARP to Implement Transparent 543 Subnet Gateways'' 544 (http://datatracker.ietf.org/doc/rfc1027/) 546 [RFC3971] Arkko, et al, ''Secure Neighbor Discovery (SEND)'', 547 RFC3971, March 2005 549 [RFC4389] Thaler, et al, ''Neighbor Discovery Proxies (ND Proxy)'', 550 RFC4389, April 2006 552 [RFC4541] Christensen, et al, ''Considerations for Internet Group 553 Management Protocol (IGMP) and Multicast Listener 554 Discovery (MLD) Snooping Switches'', RFC 4541, May 2006 556 [RFC4861] Narten, et al, ''Neighbor Discovery for IP version 6 557 (IPv6)'', RFC4861, Sept 2007 559 [RFC4903] Thaler, ''Multilink Subnet Issues'', RFC4903, July 2007 561 [RFC6820] Narten, et al, ''Address Resolution Problems in Large Data 562 Center Networks'', RFC6820, Jan 2013 564 Internet-Draft Practices to scale ARP/ND in large DC 566 11.2. Informative References 568 [Impatient-NUD] E. Nordmark, I. Gashinsky, ''draft-ietf-6man- 569 impatient-nud'' 571 [ARMD-Statistics] M. Karir, J. Rees, ''Address Resolution 572 Statistics'', draft-karir-armd-statistics-01.txt 573 (expired), July 2011. 574 https://datatracker.ietf.org/doc/draft-karir-armd- 575 statistics/ 577 [ARP_Reduction] Shah, et al, ''ARP Broadcast Reduction for Large 578 Data Centers'', draft-shah-armd-arp-reduction-02.txt 579 (expired), Oct 2011. 580 https://datatracker.ietf.org/doc/draft-shah-armd-arp- 581 reduction/ 583 [Multi-Link] Thaler, et al, ''Multi-link Subnet Support in IPv6'', 584 draft-ietf-ipv6-multilink-subnets-00.txt (expired), Dec 585 2002. https://datatracker.ietf.org/doc/draft-ietf-ipv6- 586 multilink-subnets/ 588 Authors' Addresses 590 Linda Dunbar 591 Huawei Technologies 592 5340 Legacy Drive, Suite 175 593 Plano, TX 75024, USA 594 Phone: (469) 277 5840 595 Email: ldunbar@huawei.com 597 Warren Kumari 598 Google 599 1600 Amphitheatre Parkway 600 Mountain View, CA 94043 601 US 602 Email: warren@kumari.net 604 Igor Gashinsky 605 Yahoo 606 45 West 18th Street 6th floor 607 New York, NY 10011 608 Email: igor@yahoo-inc.com 610 Internet-Draft Practices to scale ARP/ND in large DC