idnits 2.17.1 draft-ietf-armd-problem-statement-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 12, 2012) is 4422 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'DATA1' is defined on line 675, but no explicit reference was found in the text == Unused Reference: 'DATA2' is defined on line 678, but no explicit reference was found in the text == Unused Reference: 'STUDY' is defined on line 703, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force T. Narten 3 Internet-Draft IBM 4 Intended status: Informational M. Karir 5 Expires: September 13, 2012 Merit Network Inc. 6 I. Foo 7 Huawei Technologies 8 March 12, 2012 10 Problem Statement for ARMD 11 draft-ietf-armd-problem-statement-02 13 Abstract 15 This document examines address resolution issues related to the 16 massive scaling of data centers. Our initial scope is relatively 17 narrow. Specifically, it focuses on address resolution (ARP and ND) 18 within the data center. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on September 13, 2012. 37 Copyright Notice 39 Copyright (c) 2012 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 4 57 4. Generalized Data Center Design . . . . . . . . . . . . . . . . 6 58 4.1. Access Layer . . . . . . . . . . . . . . . . . . . . . . . 7 59 4.2. Aggregation Layer . . . . . . . . . . . . . . . . . . . . 7 60 4.3. Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 4.4. Layer 3 / Layer 2 Topological Variations . . . . . . . . . 8 62 4.4.1. Layer 3 to Access Switches . . . . . . . . . . . . . . 8 63 4.4.2. L3 to Aggregation Switches . . . . . . . . . . . . . . 8 64 4.4.3. L3 in the Core only . . . . . . . . . . . . . . . . . 8 65 4.4.4. Overlays . . . . . . . . . . . . . . . . . . . . . . . 9 66 4.5. Factors that Affect Data Center Design . . . . . . . . . . 9 67 4.5.1. Traffic Patterns . . . . . . . . . . . . . . . . . . . 9 68 4.5.2. Virtualization . . . . . . . . . . . . . . . . . . . . 10 69 5. Address Resolution in IPv4 . . . . . . . . . . . . . . . . . . 10 70 6. Address Resolution in IPv6 . . . . . . . . . . . . . . . . . . 11 71 7. Problem Itemization . . . . . . . . . . . . . . . . . . . . . 11 72 7.1. ARP Processing on Routers . . . . . . . . . . . . . . . . 11 73 7.2. IPv6 Neighbor Discovery . . . . . . . . . . . . . . . . . 13 74 7.3. MAC Address Table Size Limitations in Switches . . . . . . 14 75 8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 76 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 14 77 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 78 11. Security Considerations . . . . . . . . . . . . . . . . . . . 15 79 12. Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . 15 80 12.1. Changes from -01 . . . . . . . . . . . . . . . . . . . . . 15 81 12.2. Changes from -00 . . . . . . . . . . . . . . . . . . . . . 15 82 13. Informative References . . . . . . . . . . . . . . . . . . . . 15 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 16 85 1. Introduction 87 This document examines issues related to the massive scaling of data 88 centers. Specifically, we focus on address resolution (ARP in IPv4 89 and Neighbor Discovery in IPv6) within the data center. Although 90 strictly speaking the scope of address resolution is confined to a 91 single L2 broadcast domain (i.e., ARP runs at the L2 layer below IP), 92 the issue is complicated by routers having many interfaces on which 93 address resolution must be performed or with IEEE 802.1Q domains, 94 where individual VLANs form their own broadcast domains. Thus, the 95 scope of address resolution spans both the L2 link and the devices 96 attached to those links. 98 This document is a product of the ARMD WG and identifies potential 99 issues associated with address resolution in datacenters with massive 100 number of hosts. The scope of this document is intentionally 101 relatively narrow as it mirrors the ARMD WG charter. This document 102 aims to list "pain points" that are being experienced in current data 103 centers. The goal of this document is to focus on address resolution 104 issues and not other broader issues that might arise in datacenters. 106 2. Terminology 108 Application: a software process that runs on either a physical or 109 virtual machine, providing a service (e.g., web server, database 110 server, etc.) 112 Broadcast Domain: The set of all links, repeaters, and switches that 113 are traversed in order to reach all nodes that are members of a 114 given L2 domain. For example, when sending a broadcast packet on 115 a VLAN, the domain would include all the links and switches that 116 the packet traverses when broadcast traffic is sent. 118 Host (or server): A computer system on the network. This might be a 119 standalone physical host, a hypervisor capable of or running 120 multiple VMs or a VM host. A physical host can support an 121 application running on an operating system on the "bare metal" or 122 multiple applications running within individual VMs on top of a 123 hypervisor. Traditional non-virtualized systems will have a 124 single (or small number of) IP addresses assigned to them. In 125 contrast, a virtualized system will use many IP addresses, one for 126 the hypervisor plus one (or more) for each individual VM. 128 Hypervisor: Software running on a host that allows multiple VMs to 129 run on the same host. 131 L2 domain: Layer 2 broadcast domain such as for example the 132 IEEE802.1Q domain which is capable of supporting up to 4095 VLANs. 133 The notion of an L2 broadcast domain is closely tied to individual 134 VLANs. Broadcast traffic (or flooding to reach all destinations) 135 reaches every member of the specific VLAN being used. 137 Virtual machine (VM): A software implementation of a physical 138 machine that runs programs as if they were executing on a bare 139 machine. Applications do not know they are running on a VM as 140 opposed to running on a "bare" host or server. 142 ToR: Top of Rack Switch. A switch placed in a single rack to 143 aggregate network connectivity to and from hosts in that rack. 145 EoR: End of Row Switch. A switch used to aggregate network network 146 connectivity from multiple racks. 148 3. Background 150 Large, flat L2 networks have long been known to have scaling 151 problems. As the size of an L2 network increases, the level of 152 broadcast traffic from protocols like ARP increases. Large amounts 153 of broadcast traffic pose a particular burden because every device 154 (switch, host and router) must process and possibly act on such 155 traffic. In extreme cases, "broadcast storms" can occur where the 156 quantity of broadcast traffic reaches a level that effectively brings 157 down part or all of a network. For example, poor implementations of 158 loop detection and prevention can create conditions that lead to 159 broadcast storms as network conditions change. The conventional 160 wisdom for addressing such problems has been to say "don't do that". 161 That is, split large L2 networks into multiple smaller L2 networks, 162 each operating as its own L3/IP subnet. Numerous data center 163 networks have been designed with this principle, e.g., with each rack 164 placed within its own L3 IP subnet. By doing so, the broadcast 165 domain (and address resolution) is confined to one Top of Rack 166 switch, which works well from a scaling perspective. Unfortunately, 167 this conflicts in some ways with the current trend towards dynamic 168 work load shifting in data centers and increased virtualization as 169 discussed below. 171 Workload placement has become a challenging task within data centers. 172 Ideally, it is desirable to be able to move workloads around within a 173 data center in order to optimize server utilization, add additional 174 servers in response to increased demand, etc. However, servers are 175 often pre-configured to run with a given set of IP addresses. 176 Placement of such servers is then subject to constraints of the IP 177 addressing restrictions of the data center. For example, servers 178 configured with addresses from a particular subnet could only be 179 placed where they connect to the IP subnet corresponding to their IP 180 addresses. If each top of rack switch is acting as a gateway for its 181 own subnet, a server can only be connected to the one top of rack 182 switch. This gateway switch represents the Layer 2/Layer 3 boundary. 183 A similar constraint occurs in virtualized environments, as discussed 184 next. 186 Server virtualization is fast becoming the norm in data centers. 187 With server virtualization, each physical server supports multiple 188 virtual servers, each running its own operating system, middleware 189 and applications. Virtualization is a key enabler of workload 190 agility, i.e., allowing any server to host any application and 191 providing the flexibility of adding, shrinking, or moving services 192 among the physical infrastructure. Server virtualization provides 193 numerous benefits, including higher utilization, increased data 194 security, reduced user downtime, and even significant power 195 conservation, along with the promise of a more flexible and dynamic 196 computing environment. 198 The discussion below focuses on VM placement and migration. Keep in 199 mind, however, that even in a non-virtualized environment, many of 200 the same issues apply to individual workloads running on standalone 201 machines. For example, when increasing the number of servers running 202 a particular workload to meet demand, placement of those workloads 203 may be constrained by IP subnet numbering considerations. 205 The greatest flexibility in VM and workload management occurs when it 206 is possible to place a VM (or workload) anywhere in the data center 207 regardless of what IP addresses the VM uses and how the physical 208 network is laid out. In practice, movement of VMs within a data 209 center is easiest when VM placement and movement does not conflict 210 with the IP subnet boundaries of the data center's network, so that 211 the VM's IP address need not be changed to reflect its actual point 212 of attachment on the network from an L3/IP perspective. In contrast, 213 if a VM moves to a new IP subnet, its address must change, and 214 clients will need to be made aware of that change. From a VM 215 management perspective, management is simplified if all servers are 216 on a single large L2 network. 218 With virtualization, a single physical server can host 10 (or more) 219 VMs, each having its own IP (and MAC) addresses. Consequently, the 220 number of addresses per machine (and hence per subnet) is increasing, 221 even when the number of physical machines stays constant. Today, it 222 is not uncommon to support 10s of VMs per physical server. In a few 223 years, the numbers will likely be even higher. 225 In the past, services were static in the sense that they tended to 226 stay in one physical place. A service installed on a machine would 227 stay on that machine because the cost of moving a service elsewhere 228 was generally high. Moreover, services would tend to be placed in 229 such a way as to facilitate communication locality. That is, servers 230 would be physically located near the services they accessed most 231 heavily. The network traffic patterns in such environments could 232 thus be optimized, in some cases keeping significant traffic local to 233 one network segment. In these more static and carefully managed 234 environments, it was possible to build networks that approached 235 scaling limitations, but did not actually cross the threshold. 237 Today, with the proliferation of VMs, traffic patterns are becoming 238 more diverse and less predictable. In particular, there can easily 239 be less locality of network traffic as services are moved for such 240 reasons as reducing overall power usage (by consolidating VMs and 241 powering off idle machine) or to move a virtual service to a physical 242 server with more capacity or a lower load. In today's changing 243 environments, it is becoming more difficult to engineer networks as 244 traffic patterns continually shift as VMs move around. 246 In summary, both the size and density of L2 networks is increasing. 247 In addition, increasingly dynamic workloads and the increased usage 248 of VMs is creating pressure for ever larger L2 networks. Today, 249 there are already data centers with over 100,000 physical machines 250 and many times that number of VMs. These number will only increase 251 going forward. In addition, traffic patterns within a data center 252 are also constantly changing. Ultimately, the issues described in 253 this document might be observed at any scale depending on the 254 particular design of the datacenter. In the next section we describe 255 a generalized design which can allow us to more easily describe the 256 L2 scaling issues. 258 4. Generalized Data Center Design 260 There are many different ways in which data centers might be 261 designed. The designs are usually engineered to suit the particular 262 application that is being deployed in the data center. For example, 263 a massive web server farm might be engineered in a very different way 264 than a general-purpose multi-tenant cloud hosting service. However 265 in most cases the designs can be abstracted into a typical three- 266 layer model consisting of an Access Layer, an Aggregation Layer and 267 the Core. The access layer generally refers to the Layer 2 switches 268 that are closest to the physical or virtual severs, the aggregation 269 layer serves to interconnect multiple access layer devices. The Core 270 switches connect the aggregation switches to the larger network core. 271 Figure 1 shows a generalized Data Center design, which captures the 272 essential elements of various alternatives. 274 +-----+-----+ +-----+-----+ 275 | Core0 | | Core1 | Core 276 +-----+-----+ +-----+-----+ 277 / \ / / 278 / \----------\ / 279 / /---------/ \ / 280 +-------+ +------+ 281 +/------+ | +/-----+ | 282 | Aggr11| + --------|AggrN1| + Aggregation Layer 283 +---+---+/ +------+/ 284 / \ / \ 285 / \ / \ 286 +---+ +---+ +---+ +---+ 287 |T11|... |T1x| |TN1| |TNy| Access Layer 288 +---+ +---+ +---+ +---+ 289 | | | | | | | | 290 +---+ +---+ +---+ +---+ 291 | |... | | | | | | 292 +---+ +---+ +---+ +---+ Server racks 293 | |... | | | | | | 294 +---+ +---+ +---+ +---+ 295 | |... | | | | | | 296 +---+ +---+ +---+ +---+ 298 Figure 1: Typical Layered Architecture in DC 300 Figure 1 302 4.1. Access Layer 304 The Access switches provide connectivity directly to/from physical 305 and virtual servers. The access switches might be placed either on 306 top-of-rack (ToR) or at end-of-row (EoR) physical configuration. A 307 server rack may have a single uplink to one access switch, or may 308 have dual uplinks to two different access switches. 310 4.2. Aggregation Layer 312 In a typical data center, aggregation switches interconnect many ToR 313 switches. Usually there are multiple parallel aggregation switches, 314 serving the same group of ToRs to achieve load sharing. It is no 315 longer uncommon to see aggregation switches interconnecting hundreds 316 of ToR switches in large data centers. 318 4.3. Core 320 Core switches connect multiple aggregation switches and act as the 321 data center gateway to external networks or interconnect to different 322 sets of racks within one data center. 324 4.4. Layer 3 / Layer 2 Topological Variations 326 4.4.1. Layer 3 to Access Switches 328 In this scenario the L3 domain is extended all the way to the Access 329 Switches. Each rack enclosure consists of a single Layer 2 domain, 330 which is confined to the rack. In general, there are no significant 331 ARP/ND scaling issues in this scenario as the Layer 2 domain cannot 332 grow very large. This topology is ideal for scenarios where servers 333 attached to a particular access switch generally run applications 334 that are are confined to using a single subnet. These applications 335 aren't moved (migrated) to other racks which might be attached to 336 different access switches (and different IP subnets). A small server 337 farm or very static compute cluster might be best served via this 338 design. 340 4.4.2. L3 to Aggregation Switches 342 When the Layer 3 domain only extends to aggregation switches, hosts 343 in any of the IP subnets configured on the aggregation switches can 344 be reachable via Layer 2 through any access switches if access 345 switches enable all the VLANs. This topology allows for a great deal 346 of flexibility as servers attached to one access switch can be re- 347 loaded with applications with different IP prefix and VMs can now 348 migrate between racks without IP address changes. The drawback of 349 this design however is that multiple VLANs have to be enabled on all 350 access switches and all ports of aggregation switches. Even though 351 layer 2 traffic are still partitioned by VLANs, the fact that all 352 VLANs are enabled on all ports can lead to broadcast traffic on all 353 VLANs to traverse all links and ports, which is same effect as one 354 big Layer 2 domain. In addition, internal traffic itself might have 355 to cross different Layer 2 boundaries resulting in significant ARP/ND 356 load at the aggregation switches. This design provides the best 357 flexibility/Layer 2 domain size trade-off. A moderate sized data 358 center might utilize this approach to provide high availability 359 services at a single location. 361 4.4.3. L3 in the Core only 363 In some cases where a wider range of VM mobility is desired (i.e. 364 greater number of racks among which VMs can move without IP address 365 change), the Layer 3 routed domain might be terminated at the core 366 routers themselves. In this case VLANs can span across multiple 367 groups of aggregation switches, which allow hosts to be moved among 368 more number of server racks without IP address change. This scenario 369 results in the largest ARP/ND performance impact as explained later. 371 A data center with very rapid workload shifting may consider this 372 kind of design. 374 4.4.4. Overlays 376 There are several approaches regarding how overlay networks can make 377 very large layer 2 network scale and enable mobility. Overlay 378 networks using various Layer 2 or Layer 3 mechanisms allow interior 379 switches/routers to mask host addresses. This can help the data 380 center designer to control the size of the L2 domain. However, the 381 Overlay Edge switches/routers which perform the network address 382 encapsulation/decapsulation must ultimately perform a L2 address 383 resolution and could still potentially face scaling issues at that 384 point. 386 A potential problem that arises in a large data center is when a 387 large number of hosts communicate with their peers in different 388 subnets, all these hosts send (and receive) data packets to their 389 respective L2/L3 boundary nodes as the traffic flows are generally 390 bi-directional. This has the potential to further highlight any 391 scaling problems. These L2/L3 boundary nodes have to process ARP/ND 392 requests sent from originating subnets and resolve physical addresses 393 (MAC) in the target subnets for what are generally bi-directional 394 flows. Therefore, For maximum flexibility in managing the data 395 center workload, it is often desirable to use overlays to place 396 related groups of hosts in the same topological subnet to avoid the 397 L2/L3 boundary translation. The use of overlays in the data center 398 network can be a useful design mechanism to help manage a potential 399 bottleneck at the Layer 2 / Layer 3 boundary by redefining where that 400 boundary exists. 402 4.5. Factors that Affect Data Center Design 404 4.5.1. Traffic Patterns 406 Expected traffic patterns play an important role in designing the 407 appropriately sized Access, Aggregation and Core networks. Traffic 408 patterns also vary based on the expected use of the Data Center. 409 Broadly speaking it is desirable to keep as much traffic as possible 410 on the Access Layer in order to minimize the bandwidth usage at the 411 Aggregation Layer. If the expected use of the data center is to 412 serve as a large web server farm, where thousands of nodes are doing 413 similar things and the traffic pattern is largely in and out a large 414 data center, an access layer with EoR switches might be used as it 415 minimizes complexity, allows for servers and databases to be located 416 in the same Layer 2 domain and provides for maximum density. 418 A Data Center that is expected to host a multi-tenant cloud hosting 419 service might have some completely unique requirements. In order to 420 isolate inter-customer traffic smaller Layer 2 domains might be 421 preferred and though the size of the overall Data Center might be 422 comparable to the previous example, the multi-tenant nature of the 423 cloud hosting application requires a smaller more compartmentalized 424 Access layer. A multi-tenant environment might also require the use 425 of Layer 3 all the way to the Access Layer ToR switch. 427 Yet another example of an application with a unique traffic pattern 428 is a high performance compute cluster where most of the traffic is 429 expected to stay within the cluster but at the same time there is a 430 high degree of crosstalk between the nodes. This would once again 431 call for a large Access Layer in order to minimize the requirements 432 at the Aggregation Layer. 434 4.5.2. Virtualization 436 Using virtualization in the Data Center further serves to increase 437 the possible densities that can be achieved. Virtualization also 438 further complicates the requirements on the Access Layer as that 439 determines the scope of server migrations or failover of servers on 440 physical hardware failures. 442 Virtualization also can place additional requirements on the 443 Aggregation switches in terms of address resolution table size and 444 the scalability of any address learning protocols that might be used 445 on those switches. The use of virtualization often also requires the 446 use of additional VLANs for High Availability beaconing which would 447 need to span across the entire virtualized infrastructure. This 448 would require the Access Layer to span as wide as the virtualized 449 infrastructure. 451 5. Address Resolution in IPv4 453 In IPv4 over Ethernet, ARP provides the function of address 454 resolution. To determine the link-layer address of a given IP 455 address, a node broadcasts an ARP Request. The request is delivered 456 to all portions of the L2 network, and the node with the requested IP 457 address replies with an ARP response. ARP is an old protocol, and by 458 current standards, is sparsely documented. For example, there are no 459 clear requirement for retransmitting ARP requests in the absence of 460 replies. Consequently, implementations vary in the details of what 461 they actually implement [RFC0826][RFC1122]. 463 From a scaling perspective, there are a number of problems with ARP. 464 First, it uses broadcast, and any network with a large number of 465 attached hosts will see a correspondingly large amount of broadcast 466 ARP traffic. The second problem is that it is not feasible to change 467 host implementations of ARP - current implementations are too widely 468 entrenched, and any changes to host implementations of ARP would take 469 years to become sufficiently deployed to matter. That said, it may 470 be possible to change ARP implementations in hypervisors, L2/L3 471 boundary routers, and/or ToR access switches, to leverage such 472 techniques as Proxy ARP [RFC1027] and/or OpenFlow [OpenFlow] infused 473 directory assistance approaches. Finally, ARP implementations need 474 to take steps to flush out stale or otherwise invalid entries. 475 Unfortunately, existing standards do not provide clear implementation 476 guidelines for how to do this. Consequently, implementations vary 477 significantly, and some implementations are "chatty" in that they 478 just periodically flush caches every few minutes and rerun ARP. 480 6. Address Resolution in IPv6 482 Broadly speaking, from the perspective of address resolution, IPv6's 483 Neighbor Discovery (ND) behaves much like ARP, with a few notable 484 differences. First, ARP uses broadcast, whereas ND uses multicast. 485 Specifically, when querying for a target IP address, ND maps the 486 target address into an IPv6 Solicited Node multicast address. From 487 an L2 perspective, sending to a multicast vs. broadcast address may 488 result in the packet being delivered to all nodes, but most (if not 489 all) nodes will filter out the (unwanted) query via filters installed 490 in the NIC -- hosts will never see such packets. Thus, whereas all 491 nodes must process every ARP query, ND queries are processed only by 492 the nodes to which they are intended. 494 7. Problem Itemization 496 This section articulates some specific problems or "pain points" that 497 are related to large data centers. It is a future activity to 498 determine which of these areas can or will be addressed by ARMD or 499 some other IETF WG. 501 7.1. ARP Processing on Routers 503 One pain point with large L2 broadcast domains is that the routers 504 connected to the L2 domain need to process "a lot of" ARP traffic. 505 Even though the vast majority of ARP traffic may well not be aimed at 506 that router, the router still has to process enough of the ARP 507 request to determine whether it can safely be ignored. The ARP 508 algorithm specifies that a recipient must update its ARP cache if it 509 receives an ARP query from a source for which it has an entry 510 [RFC0826]. 512 One common router implementation architecture has ARP processing 513 handled in a "slow path" software processor rather than directly by a 514 hardware ASIC as is the case when forwarding packets. Such a design 515 significantly limits the rate at which ARP traffic can be processed. 516 Current implementations today can support in the low thousands of ARP 517 packets per second, which is several orders of magnitude lower than 518 the rate at which packets can be forwarded by ASICs. 520 To further reduce the ARP load, some routers have implemented 521 additional optimizations in their ASIC fast paths. For example, some 522 routers can be configured to discard ARP requests for target 523 addresses other than those assigned to the router. That way, the 524 router's software processor only receives ARP requests for addresses 525 it owns and must respond to. This can significantly reduce the 526 number of ARP requests that must be processed by the router. 528 Another optimization concerns reducing the number of ARP queries 529 targeted at routers, whether for address resolution or to validate 530 existing cache entries. Some routers can be configured to send out 531 periodic gratuitous ARPs. Upon receipt of a gratuitous ARP, 532 implementations mark the associated entry as "fresh", resetting the 533 revalidate timer to its maximum setting. Consequently, sending out 534 periodic gratuitous ARPs can effectively prevent nodes from needing 535 to send ARP requests intended to revalidate stale entries for a 536 router. The net result is an overall reduction in the number of ARP 537 queries routers receive. Gratuitous ARPs can also pre-populate ARP 538 caches on neighboring devices, further reducing ARP traffic. 540 Finally, another area concerns how routers process IP packets for 541 which no ARP entry exists. Such packets must be held in a queue 542 while address resolution is performed. Once an ARP query has been 543 resolved, the packet is forwarded on. Again, the processing of such 544 packets is handled in the "slow path". This effectively limits the 545 rate at which a router can process ARP "cache misses" and is viewed 546 as a problem in some deployments today. Additionally, If no response 547 is received, the router has to send the ARP/ND query multiple times. 548 If no response is received after a number of ARP/ND requests, the 549 router needs to drop all those data packets. This process can be CPU 550 intensive. 552 Although address-resolution traffic remains local to one L2 network, 553 some data center designs terminate L2 subnets at individual 554 aggregation switches/routers (e.g., see Section 4.4.2). Such routers 555 can be connected to a large number of interfaces (e.g., 100 or more). 556 While the address resolution traffic on any one interface may be 557 manageable, the aggregate address resolution traffic across all 558 interfaces can become problematic. 560 Another variant of the above issue has individual routers servicing a 561 relatively small number of interfaces, with the individual interfaces 562 themselves serving very large subnets. Once again, it is the 563 aggregate quantity of ARP traffic seen across all of the router's 564 interfaces that can be problematic. This "pain point" is essentially 565 the same as the one discussed above, the only difference being 566 whether a given number of hosts are spread across a few large IP 567 subnets or many smaller ones. 569 When hosts in two different subnets under the same L2/L3 boundary 570 router need to communicate with each other, the L2/L3 router not only 571 has to initiate ARP/ND requests to the target's Subnet, it also has 572 to process the ARP/ND requests from the originating subnet. This 573 process further adds to the overall ARP processing load. 575 7.2. IPv6 Neighbor Discovery 577 Though IPv6's Neighbor Discovery behaves much like ARP there are 578 several notable differences which result in a different set of 579 potential issues. From a L2 perspective there is the simple 580 difference between sending to a multicast versus broadcast address 581 which results in ND queries only being processed by the nodes to 582 which they are intended. 584 Another key difference concerns revalidating stale ND entries. ND 585 requires that nodes periodically re-validate any entries they are 586 using, to ensure that bad entries are timed out quickly enough that 587 TCP does not terminate a connection. Consequently, some 588 implementations will send out "probe" ND queries to validate in-use 589 ND entries as frequently as every 35 seconds [RFC4861]. Such probes 590 are sent via unicast (unlike in the case of ARP). However, on larger 591 networks, such probes can result in routers receiving many such 592 queries. Unfortunately, the IPv4 mitigation technique of sending 593 gratuitous ARPs does not work in IPv6. The ND specification 594 specifically specifies that gratuitous ND "updates" cannot cause an 595 ND entry to be marked "valid". Rather, such entries are marked 596 "probe", which causes the receiving node to (eventually) generate a 597 probe back to the sender, which in this case is precisely the 598 behavior that the router is trying to prevent! 600 It should be noted that ND does not require the sending of probes in 601 all cases. Section 7.3.1 of [RFC4861] describes a technique whereby 602 hints from TCP can be used to verify that an existing ND entry is 603 working fine and does not need to be revalidated. 605 7.3. MAC Address Table Size Limitations in Switches 607 L2 switches maintain L2 MAC address forwarding tables for all sources 608 and destinations traversing through the switch. These tables are 609 populated through learning and are used to forward L2 frames to their 610 correct destination. The larger the L2 domain, the larger the tables 611 have to be. While in theory a switch only needs to keep track of 612 addresses it is actively using, switches flood broadcast frames 613 (e.g., from ARP), multicast frames (e.g., from Neighbor Discovery) 614 and unicast frames to unknown destinations. Switches add entries for 615 the source addresses of such flooded frames to their forwarding 616 tables. Consequently, MAC address table size can become a problem as 617 the size of the L2 domain increases. The table size problem is made 618 worse with VMs, where a single physical machine now hosts ten (or 619 more) VMs, since each has its own MAC address that is visible to 620 switches. 622 When layer 3 extends all the way to access switches (see Section 623 4.4.1), the size of MAC address tables in switches is not generally a 624 problem. When layer 3 extends only to aggregation switches (see 625 Section 4.4.2), however, MAC table size limitations can be a real 626 issue. 628 8. Summary 630 This document has outlined a number of problems or issues related to 631 address resolution in large data centers. In particular we have 632 described different scenarios where such issues might arise, what 633 these potential issues are, and what the various fundamental factors 634 are that cause them. It is hoped that describing specific pain 635 points will facilitate a discussion as to whether and how to best 636 address them. 638 9. Acknowledgments 640 This document has been significantly improved by comments from Benson 641 Schliesser, Linda Dunbar and Sue Hares. Igor Gashinsky deserves 642 additional credit for highlighting some of the ARP-related pain 643 points and for clarifying the difference between what the standards 644 require and what some router vendors have actually implemented in 645 response to operator requests. 647 10. IANA Considerations 649 This document makes not request of IANA. 651 11. Security Considerations 653 This documents lists existing problems or pain points with address 654 resolution in data centers. This document does not create any 655 security implications nor does it have any security implications. 656 The security vulnerabilities in ARP are well known and this document 657 does not change or mitigate them in any way. 659 12. Change Log 661 12.1. Changes from -01 663 1. Wordsmithing and editorial improvements. 665 12.2. Changes from -00 667 1. Merged draft-karir-armd-datacenter-reference-arch-00.txt into 668 this document. 670 2. Added section explaining how ND differs from ARP and the 671 implication on address resolution "pain". 673 13. Informative References 675 [DATA1] Cisco, Systems., "Data Center Design - IP Infrastructure", 676 October 2009. 678 [DATA2] Juniper, Networks., "Government Data Center Network 679 Reference Architecture", 2010. 681 [OpenFlow] 682 McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., 683 Peterson, L., Rexford, J., Shenker, S., and J. Turner, 684 "OpenFlow: Enabling Innovation in Campus Networks", 685 March 2008. 687 [RFC0826] Plummer, D., "Ethernet Address Resolution Protocol: Or 688 converting network protocol addresses to 48.bit Ethernet 689 address for transmission on Ethernet hardware", STD 37, 690 RFC 826, November 1982. 692 [RFC1027] Carl-Mitchell, S. and J. Quarterman, "Using ARP to 693 implement transparent subnet gateways", RFC 1027, 694 October 1987. 696 [RFC1122] Braden, R., "Requirements for Internet Hosts - 697 Communication Layers", STD 3, RFC 1122, October 1989. 699 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 700 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 701 September 2007. 703 [STUDY] Rees, J. and M. Karir, "ARP Traffic Study", NANOG 52, URL 704 http://www.nanog.org/meetings/nanog52/presentations/Tuesda 705 y/Karir-4-ARP-Study-Merit Network.pdf, June 2011. 707 Authors' Addresses 709 Thomas Narten 710 IBM 712 Email: narten@us.ibm.com 714 Manish Karir 715 Merit Network Inc. 717 Email: mkarir@merit.edu 719 Ian Foo 720 Huawei Technologies 722 Email: Ian.Foo@huawei.com