idnits 2.17.1 draft-shah-armd-arp-reduction-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([ARP-Problem], [RFC2119]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 28, 2011) is 4536 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC 2119' is mentioned on line 93, but not defined == Missing Reference: 'TRILL' is mentioned on line 153, but not defined == Missing Reference: 'RFC 1027' is mentioned on line 185, but not defined == Unused Reference: 'ARP' is defined on line 460, but no explicit reference was found in the text == Unused Reference: 'PROXY-ARP' is defined on line 477, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Working Group: ARMD Himanshu Shah 3 Intended Status: Informational Ciena Corp 4 Internet Draft 5 Anoop Ghanwani 6 Expiration Date: April 27, 2012 Brocade 8 Nabil Bitar 9 Verizon 11 October 28, 2011 13 ARP Broadcast Reduction for Large Data Centers 14 draft-shah-armd-arp-reduction-02.txt 16 Status of this Memo 18 This Internet-Draft is submitted in full conformance with the 19 provisions of BCP 78 and BCP 79. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six 27 months and may be updated, replaced, or obsoleted by other documents 28 at any time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/1id-abstracts.html 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html 37 This Internet-Draft will expire on April 27, 2012 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 Internet Draft draft-shah-arp-reduction-02.txt 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with 51 respect to this document. Code Components extracted from this 52 document must include Simplified BSD License text as described in 53 Section 4.e of the Trust Legal Provisions and are provided without 54 warranty as described in the Simplified BSD License. 56 Abstract 58 With advent of server virtualization technologies, a host is able to 59 support multiple Virtual Machines (VMs) in a single physical 60 machine. Data Centers can leverage these capabilities to instantiate 61 on the order of 10s to 100s of VMs in a single server with current 62 technology. It is conceivable that this number can be much higher 63 in the future. Each VM operates as an independent IP host with a set 64 of Virtual Network Interface Cards (vNICs), each having its own MAC 65 address and mapping to a physical Ethernet interface. These physical 66 servers are typically installed in a rack with their Ethernet 67 interfaces connected to a top-of-the-rack (ToR) switch. The ToR 68 switches are interconnected through End-of-the-Row (EoR) or 69 aggregation switches which are in turn connected to core switches. 71 As discussed in [ARP-Problem] the host VMs use ARP broadcasts to 72 find other host VMs and use periodic (broadcast) Gratuitous ARPs to 73 refresh their IP to MAC address binding in other VM hosts. Such 74 broadcasts in a large data center with potentially thousands of VM 75 hosts in a Layer 2 based topology can overwhelm the network. 77 This memo proposes mechanisms to reduce the number of broadcasts 78 that are sent throughout the network. This is done by having the 79 ToRs intelligently process ARP and frames, rather than simply 80 broadcasting them throughout the broadcast domain. 82 While this document addresses ARP, the Neighbor Discovery mechanisms 83 used by the IPv6 hosts that make use of multicast rather than 84 broadcast also pose similar issues in the Data Center. The solutions 85 defined herein should be equally applicable to hosts running IPv6. 86 The details will be specified in a subsequent revision. 88 Conventions 90 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 91 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 92 document are to be interpreted as described in RFC 2119 [RFC 2119]. 94 Internet Draft draft-shah-arp-reduction-02.txt 96 Table of Contents 98 Copyright Notice ........................................... 1 99 Abstract .................................................... 2 100 1.0 Overview ................................................ 3 101 1.1 Terminology ............................................ 5 102 2.0 Configuration ........................................... 6 103 3.0 Building the ARP tables ................................. 6 104 3.1 ARP Requests ........................................... 6 105 3.2 ARP Reply .............................................. 7 106 3.3 Gratuitous ARP ......................................... 7 107 3.4 Host movement .......................................... 8 108 4.0 Conclusion .............................................. 9 109 5.0 Security Considerations ................................. 10 110 6.0 Acknowledgments ......................................... 10 111 7.0 References .............................................. 10 112 7.1 Normative References.................................... 10 113 7.2 Informative References ................................. 10 114 8.0 Author's Address ........................................ 11 116 1.0 Overview 118 The traditional topology in a data center consists of racks of 119 servers connected to top-of-rack (ToR) switches, which connect to 120 aggregation switches, which in turn connect to core switches. The 121 network architecture typically combines Layer 2 and Layer 3. In 122 some architectures, Layer 2 is terminated at the ToR, with Layer 3 123 being run in the aggregation and core devices. In other 124 architectures, Layer 2 may be extended all the way to the 125 aggregation switch. The primary concerns that have influenced 126 network architectures in the data center have been keeping broadcast 127 domains manageable and spanning tree domains contained. 129 Moving forward, these traditional network architectures are being 130 challenged due to emerging technologies such as server 131 virtualization. 133 Internet Draft draft-shah-arp-reduction-02.txt 135 The effect of server virtualization in the data center brings some 136 challenges. Because of virtualization, the number of hosts that the 137 network sees increases dramatically - 10 to 100 times the number of 138 physical servers. These virtual hosts are referred to as Virtual 139 machines (VMs). VMs offer server mobility wherein a VM can be 140 relocated to run on a different physical server. In order for the 141 mobility to be non-disruptive to other hosts that have communication 142 in progress with the VM being moved, the VM must retain its MAC 143 address and IP address. Because of the requirement to retain the 144 MAC and IP address, it is desirable to develop network architectures 145 that would offer the least restrictions in terms of server mobility. 147 As an example, in a network architecture where TOR switches 148 terminate the L2 domain, the range of mobility would be restricted 149 to a single ToR switch. It would be more preferable to allow the 150 flexibility of moving the VM anywhere within the data center, or 151 perhaps even a different data center. 153 Technologies such as TRILL [TRILL] overcome some of the issues of 154 spanning trees because which traditional Layer 2 topologies have 155 been constrained. However, because of virtualization there are 2 156 specific problems that are introduced with respect to broadcast 157 traffic. 158 1. A larger number of hosts. A single physical server now hosts 159 multiple virtual machines taking the scale factor to a 160 different level. If each VM has the same number of broadcasts 161 as a physical server, the amount of broadcast traffic has 162 increased 10 to greater than 100 times. 163 2. If the Layer 2 domains are extended to go across data centers, 164 then broadcast traffic will now go across the backbone. If 165 Layer 2 was terminated at the ToR switch, the increase in 166 broadcast traffic would be been restricted to a single ToR 167 switch, but as discussed earlier, this restriction is not 168 desirable. 170 The broadcast as such in Layer 2 networks has far reaching impacts; 171 i.e. wastage in network bandwidth as well as CPU resources used by 172 all the VMs while processing superfluous ARP broadcasts (IPv6 gets 173 rid of the latter by running ND as a multicast service rather than a 174 broadcast service). 176 The solution presented here attempts to minimize negative effects of 177 ARP broadcasts. The solution requires the first hop Ethernet 178 switches, typically ToR, to maintain an ARP table learned from the 179 ARP PDUs received by the switch and selectively propagates the ARP 180 to, or proxy-responds on behalf of, the remote peer. These types of 181 ARP processing principles are well known and used/described in L2VPN 182 Working Group documents such as [ARP-Mediation] and [IPLS]. The ARP 183 proxy response differs from that described in [RFC1027] as the ARP 184 response contains MAC address of the destination and not that of the 185 switch as is suggested in [RFC 1027]. 187 Internet Draft draft-shah-arp-reduction-02.txt 189 The following sections describe the details of ARP snooping, 190 learning and maintaining ARP tables, using the learned information 191 to limit broadcast propagation and proxy (the response) on behalf of 192 the remote peers. 194 1.1 Terminology 196 ToR switch Top-of-Rack switch. An Ethernet switch installed 197 at the top of a rack of servers which provides 198 network connectivity to those servers. 200 Downlink The Ethernet link between the ToR switch and a 201 directly connected host/server in the rack. 203 Uplink The network-facing Ethernet connection in the 204 ToR switch. Typically, the uplinks from ToRs 205 connect to end-of-row or aggregation switches. 207 EoR switch End-of-Row switch. An Ethernet switch which 208 aggregates traffic from multiple racks. Also 209 commonly referred to as an aggregation switch. 210 Uplinks from the ToR connects to EoR switches 211 and uplinks from EoR switches in turn connect 212 to core switches. 214 Host/Server A host or server running the IP protocol. This 215 could be a physical entity or a logical entity 216 (such as a Virtual Machine) in a physical host. 217 The term server refers to its role in data 218 center. Both terms are used interchangeably 219 and refer to an IP end station. 221 Local hosts Used in the context of a ToR switch to denote 222 the VM hosts connected to a ToR switch on the 223 downlink, i.e. directly connected hosts. 225 Remote hosts Used in the context of a ToR switch to denote 226 the hosts that are accessible via the uplink of 227 the ToR switch. 229 VM Virtual Machine. This is a logical instance of 230 a host that operates independently in a 231 physical host and has its own IP and MAC 232 addresses. The VM architecture allows efficient 233 use of physical host resources (such as 234 multiple CPU cores). 236 Internet Draft draft-shah-arp-reduction-02.txt 238 2.0 Configuration 240 It is assumed that ARP reduction methodologies that are defined in 241 this document will be limited to ToR switches. The maximum benefit 242 of restraining ARP broadcasts in the network is achieved by the 243 first hop switches (the ones directly connected to the hosts) 244 without placing additional burden on second or third tier switches. 246 First, the ToR switches would need to be configured in order to 247 enable the ARP reduction feature. Every Ethernet interface needs to 248 be identified as either a downlink or uplink within the context of 249 this feature. The ARP reduction feature treats ARP frames received 250 from downlink or uplink differently as described in the following 251 sections. 253 In additional the operator may optionally configure various ARP 254 reduction related parameters such as: 255 . ARP aging timer, 256 . size of the ARP table, 257 . static entries of IP to MAC address, etc. 259 3.0 Building the ARP tables 261 When ARP reduction is enabled, the ToR switch will monitor all ARP 262 traffic transiting the switch (regardless of uplink port or downlink 263 port) and will process any ARP PDUs in the following manner: 264 . ARP Request PDUs must be redirected to control plane CPU. 265 . Gratuitous ARP PDUs (ARP Reply PDU with a broadcast MAC DA) 266 must be redirected to control plane CPU. 267 . Other ARP Reply PDUs (ARP Reply PDU with a unicast MAC DA) 268 should be bi-casted; one copy sent to control plane CPU and 269 other copy forwarded out normally. 271 3.1 ARP Requests 273 The ToR examines the source IP and the source hardware address (MAC 274 address) in the ARP Request . The source IP and MAC address 275 association is learned, or is updated/refreshed if already learned. 276 The destination IP address is searched in the ARP table. If an entry 277 exists, the associated MAC address from the table is used to prepare 278 a unicast ARP Reply PDU. The same MAC address is used as the source 279 MAC address in the MAC header, as well as for the target hardware 280 address,in the unicast ARP Reply PDU. 282 If the destination IP address in the request is not present in the 283 ARP table, then the original ARP request PDU is broadcast to all the 284 switch ports that are member of the same VLAN except the source port 285 that the Request was received from. However, if the requested 286 Internet Draft draft-shah-arp-reduction-02.txt 288 (destination) IP address is present in the ARP table, a unicast ARP 289 Reply PDU is prepared as described above and sent to the switch port 290 from which the ARP Request was received and original ARP request PDU 291 is dropped. 293 The intent is to prevent propagation of ARP Request PDU broadcasts 294 as much as possible using the information present in the ARP table. 295 The following observations can be made from such behavior. 296 . Most of the ARP requests from the local hosts of a ToR switch 297 for the local hosts of the ToR switch can be prevented. 298 . Most of the ARP requests from the remote hosts of a ToR switch 299 for the local hosts of the ToR switch can be prevented from 300 getting forwarded on downlinks or other uplinks of the ToR 301 switch. 302 . Many of the ARP requests from the local hosts of a ToR switch 303 for the remote hosts of the ToR switch can be prevented from 304 being forwarded on uplinks if the remote host IP to MAC 305 association is known to the ToR switch. 307 3.2 ARP Reply 309 The unicast ARP Reply is examined to learn/update the ARP table for 310 source and destination IP/MAC address association, but is also 311 forwarded out as a normal frame. 313 3.3 Gratuitous ARP 315 Gratuitous ARP is a broadcast ARP Reply PDU with destination IP 316 address set to the IP address of the sender and target hardware 317 address set to the MAC address of the sender. It is typically used 318 by the IP hosts (including VMs) to keep its association fresh in 319 peer's ARP cache. 321 The ToR switch should process Gratuitous ARP in the following 322 manner. 323 . Learn/update/refresh the ARP table entry. 324 . If the IP address is new, or exists but with a different 325 hardware address, then the Gratuitous ARP PDU is forwarded 326 out; otherwise the PDU is discarded. 328 The goal for handling of the Gratuitous ARP PDU received from the 329 downlinks (i.e. local hosts) is to avoid propagating it into the 330 'network' (i.e. to uplinks), unless there is a new association. 332 By suppressing the propagation of Gratuitous ARP PDUs, the peer IP 333 hosts will end up aging out the corresponding ARP table entries. 334 This will result in generation of the broadcast ARP Requests by 335 those IP hosts if they need to continue to communicate with the IP 336 host whose Gratuitous ARPs were obstructed. The handling of the ARP 337 Request, as described above, by the first hop ToR switch will be 338 able to respond to this request based on the ARP cache maintained in 339 the ToR switch. In essence, presence of large ARP tables with longer 340 age out times compensates for the smaller ARP table present in the 341 Internet Draft draft-shah-arp-reduction-02.txt 343 IP hosts and eliminates the need for periodic use of Gratuitous ARPs 344 in order to refresh the ARP table in the IP hosts. 346 3.4 Host movement 348 As mentioned earlier, server virtualization technology allows 349 movement of VMs to different physical servers. The flexibility to 350 move VMs is one of the key benefits of server virtualization. The 351 VM movement could be manual (operator initiated) or may be done 352 automatically in reaction to demands placed by the application 353 users. The important point is that in either case, VM movement is 354 not transparent and is made known to the network. 356 There is ongoing work in IEEE 802.1 standards organization (IEEE 357 802.1Qbg) to coordinate/communicate the presence and capabilities of 358 the VMs to the directly connected network switch. 360 VMs typically retain their MAC and IP address, and as such, there 361 would be little impact to the ARP table maintained by the ARP 362 reduction mechanism described herein. However, the ARP reduction 363 mechanism would benefit from knowing if a VM is completely 364 decommissioned so that the ToR can removed the ARP entry it has for 365 that VM in a timely fashion, rather than waiting for it to timeout. 367 3.5 Applicability to environments with overlay transport 369 Recently, there have been multiple proposals for using overlay 370 transport technologies such as VXLAN [VXLAN] and NVGRE [NVGRE]. 371 These proposals allow the network operator to build the network 372 using L2 or L3 technologies while building an L2-overlay on top of 373 that. As such, while they address the issue of network design, they 374 do not eliminate the need for a mechanism to reduce the amount of 375 broadcast traffic that may have to traverse the core, if there are 376 VMs of the same tenant on servers attached to different ToR 377 switches. 379 One of the ways for the overlay transport proposals to address this 380 issue would be to implement the mechanism discussed in this document 381 at the point where the overlay encapsulation and decapsulation is 382 performed (i.e. in the virtual switch). 384 3.6 Scaling Considerations 386 Depending on the number of hosts in the networks, the ARP table can 387 be quite large. Although it is possible to implement some of the 388 mechanisms for ARP reduction as described in this document in 389 hardware in the forwarding plane, the number of ARP entries may 390 favor maintaining the ARP table in the control plane memory. 392 Internet Draft draft-shah-arp-reduction-02.txt 394 3.7 Miscellaneous Issues 396 Because of the distributed nature of the mechanisms described 397 herein, there are a few additional issues that warrant consideration 398 from the network operator. 400 Earlier in the document, we had mentioned the configuration of a 401 timer for ARP entries. A longer timer for holding on to ARP entries 402 helps with reduction of broadcasts. However, the risk of having a 403 "too large timer" can cause problems in certain situations. 404 Consider the following scenario. Host A is attached to ToR switch 405 #1, and host B is attached to ToR switch #2. If host B issues an 406 ARP request for host A, if the entry is available at switch #2, then 407 switch #2 would send the ARP Reply on behalf of host A. It is 408 possible that host A is no longer available, but there is no way for 409 switch #2 to know this, and it would continue to respond on behalf 410 of host A, until its entry for host A has timed out. In this case, 411 it is easy to see that a smaller timer would be beneficial. 412 Additionally, since host B has an ARP age timer, it means that host 413 B would find out about host A's unavailability only after its entry 414 has aged, which would be after it has aged out of switch #2. 416 Another issue that can be somewhat problematic could be the 417 inconsistency of tables in switches. Once again, consider a 418 scenario similar to the one described above with 2 hosts each 419 connected to its respect ToR switch. Let the ARP entries at both A 420 and B be learned by both switches. Now assume that the IP address 421 on host A changes. This change is signaled to switch #1 which in 422 turn broadcasts the message on its uplink. Now, if this message is 423 discarded due to network congestion or signal integrity issues, then 424 switch #2 will not learn about the change and will continue to 425 respond to host B's ARP Requests for host A's old IP address with 426 stale information. This lasts until the ARP entry for A times out 427 at Switch #2. 429 4.0 Conclusion 431 Based on the procedures described in this document, it is possible 432 for ToR switches in the data center to contain ARP broadcasts 433 significantly. The solution is based on well known, non-intrusive 434 procedures and strives to curtail broadcasts that are increasingly 435 becoming a cause for concern in the data centers. In essence, ToR 436 switches facilitate the offloading of the extended ARP table 437 management from the IP hosts to itself. The ARP table timeout can be 438 tuned higher by the operator based on the available switch resources 439 and network traffic behavior. The larger capacity of the ARP table 440 directly translates to more effective subduing of the ARP 441 broadcasts. 443 Internet Draft draft-shah-arp-reduction-02.txt 445 5.0 Security Considerations 447 The details of the security aspects will be addressed in future 448 revision. 450 6.0 Acknowledgments 452 This document resulted from discussions with Linda Durbar (Huawei), 453 Sue Hares (Huawei), and T Sridhar (VMware). We would like to 454 acknowledge their contribution to this work. 456 7.0 References 458 7.1 Normative References 460 [ARP] D. Plummer, "An Ethernet Address Resolution Protocol: Or 461 Converting Network Protocol Addresses to 48.bit Ethernet 462 Addresses for Transmission on Ethernet Hardware," RFC 826, STD 463 37. 465 [ARP-Problem] T. Narten, "Problem Statement for ARMD," 466 work in progress, . 468 7.2 Informative References 470 [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking 471 in Layer 2 VPN," work in progress, . 474 [IPLS] H.Shah et al., "IP-only LAN service," work in progress, 475 . 477 [PROXY-ARP] J. Postel, "Multi-LAN Address Resolution," RFC 925. 479 [RFC1027] Smoot et al., "Using ARP to Implement Transparent Subnet 480 Gateways". 482 [VXLAN] M. Mahalingam et al., "VXLAN: A Framework for Overlaying 483 Virtualized Layer 2 Networks over Layer 3 Networks"," work in 484 progress, . 486 [NVGRE] M. Sridharan et al., " NVGRE: Network Virtualization using 487 Generic Routing Encapsulation", work in progress, . 490 Internet Draft draft-shah-arp-reduction-02.txt 492 8.0 Author's Address 494 Himanshu Shah 495 Ciena Corp 496 Email: hshah@ciena.com 498 Anoop Ghanwani 499 Brocade 500 Email: anoop@alumni.duke.edu 502 Nabil Bitar 503 Verizon 504 Email: nabil.n.bitar@verizon.com