Working Group: ARMD Himanshu Shah Intended Status:Proposed StandardInformational Ciena Corp Internet Draft Anoop Ghanwani Expiration Date:May, 2011April 27, 2012 Brocade Nabil Bitar Verizon October25, 201028, 2011 ARP Broadcast Reduction for Large Data Centersdraft-shah-armd-arp-reduction-01.txtdraft-shah-armd-arp-reduction-02.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire onMay 25, 2011April 27, 2012 Copyright Notice Copyright (c)20102011 IETF Trust and the persons identified as the document authors. All rights reserved. Internet Draftdraft-shah-arp-reduction-01.txtdraft-shah-arp-reduction-02.txt This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Abstract Withthe emergenceadvent of server virtualization technologies, a host is able to support multiple Virtual Machines (VMs) in a single physical machine. DatacentersCenters can leverage these capabilities to instantiate on the order of 10s to 100s of VMs in aserver.single server with current technology. It is conceivable that this number can be much higher in the future. Each VM operates as an independent IP host with a set of Virtual Network Interface Cards (vNICs), each having its own MAC address and mapping to a physical Ethernet interface. These physical servers are typically installed in a rack with their Ethernet interfaces connected to atop-of-racktop-of-the-rack (ToR) switch. The ToR switches are interconnected throughEnd-of- the-RowEnd-of-the-Row (EoR) or aggregation switches which are in turn connected to core switches. As discussed in [ARP-Problem] the host VMs use ARP broadcasts to find other host VMs and use periodic (broadcast) Gratuitous ARPs to refresh their IP to MAC address binding in other VM hosts. Such broadcasts in a large data center with potentially thousands of VM hosts in a Layer 2 based topology can overwhelm the network. This memo proposes mechanisms to reduce the number of broadcasts that are sent throughout the network. This is done by having theToR switchesToRs intelligently process ARPpackets,and frames, rather than simply broadcasting them throughout the broadcast domain. While this documentspecificallyaddresses ARP, the Neighbor Discovery mechanisms used by the IPv6 hosts that make use of multicast rather than broadcast also pose similar issuesforin thedata center.Data Center. The solutions defined herein should be equally applicable to hosts running IPv6. The details will be specified in a subsequent revision. Conventions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119]. Internet Draftdraft-shah-arp-reduction-01.txtdraft-shah-arp-reduction-02.txt Table of Contents Copyright Notice............................................................................................... 1Abstract..............................................................Abstract .................................................... 2 1.0Overview..........................................................Overview ................................................ 3 1.1 Terminology................................................................................................. 5 2.0Configuration.....................................................Configuration ........................................... 6 3.0 Building the ARPTables...........................................tables ................................. 6 3.1 ARPRequest .....................................................Requests ........................................... 6 3.2 ARP Reply..................................................................................................... 7 3.3 Gratuitous ARP........................................................................................... 7 3.4Uplink Versus Downlink Processing ............................... 8 3.5HostMobility ...................................................movement .......................................... 8 4.0Concluding Remarks................................................Conclusion .............................................. 9 5.0 Security Considerations.......................................................................... 10 6.0 Acknowledgments.......................................................................................... 10 7.0References.......................................................References .............................................. 10 7.1 NormativeReferences ...........................................References.................................... 10 7.2 Informative References.......................................................................... 10 8.0 Author'sAddress................................................. 10Address ........................................ 11 1.0 Overview The traditional topology in a data center consists of racks of servers connected to top-of-rack (ToR) switches, which connect to aggregation switches, which in turn connect to core switches. The network architectureistypicallya combinationcombines Layer 2 and Layer3 functionality.3. In some architectures, Layer 2 is terminated at the ToR, with Layer 3 being run in the aggregation and core devices. In other architectures, Layer 2 may be extended all the way to the aggregation switch. The primary concerns that have influenced network architectures in the data center have been keeping broadcast domains manageable andthespanning treediameterdomains contained. Moving forward, these traditional network architectures are being challenged due to emerging technologies such as server virtualization. Internet Draftdraft-shah-arp-reduction-01.txtdraft-shah-arp-reduction-02.txt The effect of server virtualization in the data center brings some challenges. Because of virtualization, the number of hostsseen bythat the network sees increases dramatically - 10 to 100 times the number of physical servers. These virtual hosts are referred to as Virtual machines (VMs).In addition, virtualized environmentsVMs offera feature referred to as "VM mobility"server mobility wherein a VM can be relocated to run on a different physical server. In order for theVMmobility to be non-disruptive to other hosts that have communication in progress with the VM being moved, the VM must retain its MAC address and IP address. Because of the requirement to retain the MAC and IP address, it is desirable to develop network architectures that would offer the least restrictions in terms ofVMserver mobility. As an example, in a network architecture where TOR switches terminate the L2 domain, the range ofVMmobility would be restricted to a single ToR switch. It would be more preferable to allow the flexibility of moving the VM anywhere within the data center, or perhaps even a different data center. Technologies such as TRILL [TRILL] overcome some of the issues of spanning treesthat forcedbecause which traditional Layer 2 topologiesto be severelyhave been constrained. However, because of virtualization there are 2 specific problems that are introduced with respect to broadcast traffic. 1. A larger number of hosts. A single physical server now hosts multipleVMsvirtual machines taking the scale factor to a different level. If each VMissueshas the same number of broadcasts as a physical server, the amount of broadcast trafficwillhas increased 10 to greater than 100 times. 2. If the Layer 2 domains are extended to go across data centers, then broadcast traffic will now go across the backbone. If Layer 2 was terminated at the ToR switch, the increase in broadcast traffic would be been restricted to a single ToR switch, but as discussed earlier, this restriction is not desirable.ExcessiveThe broadcasttrafficas such in Layer 2 networksresults inhas far reaching impacts; i.e. wastageofin networkbandwidth,bandwidth as well asin the wastage ofCPU resourcesdue toused by allofthe VMs while processing superfluous ARP broadcasts (IPv6 gets rid of the latter by running ND as a multicast service rather than a broadcast service). The solution presented here attempts to minimizethenegative effects of ARPbroadcast packets.broadcasts. The solution requires the first hop Ethernet switches, typicallythe ToR switch,ToR, to maintain an ARP tablethat islearned from the ARPpacketsPDUs received by theswitch. Theswitchthenand selectively propagates the ARPpacketto, orproxy- respondsproxy-responds on behalf of, the remote peer. These types of ARP processing principles arewell-knownwell known andare describedused/described in L2VPN Working Group documents such as [ARP-Mediation] and [IPLS]. The ARP proxy response differs from that described in [RFC1027] as the ARP response contains MAC address of the destination and not that of the switch as is suggested in [RFC 1027]. Internet Draftdraft-shah-arp-reduction-01.txtdraft-shah-arp-reduction-02.txt The following sections describe the details of ARP snooping,thelearning andmaintenance ofmaintaining ARP tables, using theuse oflearned information to limit broadcastpropagation,propagation and proxy (the response) on behalf of the remote peers. 1.1 Terminology ToRTop-of-Rack.switch Top-of-Rack switch. An Ethernet switchpresent oninstalled at the top of a rack of servers which provides network connectivity tothe servers present on the rack.those servers. Downlink The Ethernet link between the ToR switch and a directly connectedhost (serverhost/server in therack).rack. Uplink Thenetwork- facingnetwork-facing Ethernet connection in the ToR switch. Typically, the uplinks from ToRs connect to end-of-row or aggregation switches. EoREnd-of-Row.switch End-of-Row switch. An Ethernet switchtowhichthe ToR switches connect, alsoaggregates traffic from multiple racks. Also commonly referred to as an aggregation switch. Uplinks from the ToRswitches connectconnects toanEoRswitchswitches and uplinks from EoR switches in turn connect toacoreswitch.switches. Host/Server A host or server running the IP protocol. This could be a physical entity or a logical entity (such as a Virtual Machine) in a physical host. The term server refers to its role inthedata center. Both terms are used interchangeablytoand refer to an IPhost.end station. Local hosts Used in the context of a ToR switch to denote the VM hosts connected to a ToR switch on the downlink, i.e. directlyattachedconnected hosts. Remote hosts Used in the context of a ToR switch to denote the hosts that are accessiblethroughvia the uplink of theToR.ToR switch. VM Virtual Machine. This is a logical instance of a host that operates independently in a physical host and has its own IP and MAC addresses.VMs allowThe VM architecture allows efficient use of physical host resources (such as multiple CPU cores). Internet Draftdraft-shah-arp-reduction-01.txtdraft-shah-arp-reduction-02.txt 2.0 Configuration It is assumed that ARP reductionmechanismsmethodologies that are defined in this document will be limited to ToR switches. The maximum benefit of restraining ARP broadcasts in the network is achieved by the first hop switches (the ones directly connected to the hosts) without placing additional burden on second or third tier switches. First, the ToR switches would need to be configured in order to enable the ARP reduction feature. Every Ethernet interface needs to be identified as either a downlink or uplink within the context of this feature. The ARP reduction feature treats ARP frames received from downlink or uplink differently as described in the following sections. Inadditionadditional the operator may optionally configure various ARP reduction related parameters such as: . ARP agingtimer.timer, .Sizesize of the ARPtable.table, .Staticstatic entries of IP to MACaddress.address, etc. 3.0 Building the ARPTablestables When ARP reduction is enabled, the ToR switch will monitor all ARP traffic transiting the switch (regardless of uplink port or downlink port) and will process any ARPpacketsPDUs in the following manner: . ARP RequestpacketsPDUs must be redirected to control plane CPU. . Gratuitous ARPpacketsPDUs (ARP ReplypacketPDU with a broadcast MAC DA) must be redirected to control plane CPU. . Other ARP ReplypacketsPDUs (ARP ReplypacketPDU with a unicast MAC DA) should be bi-casted; one copy sent to control plane CPU and other copy forwarded out normally. 3.1 ARPRequestRequests The ToR examines the source IP and the source hardware address (MAC address) in the ARP Request . The source IP and MAC address association is learned, or is updated/refreshed if already learned. The destination IP address is searched in the ARP table. If an entry exists, the associated MAC address from the table is used to prepare a unicast ARP Replypacket.PDU. The same MAC address is used as the source MAC address in the MAC header, as well as for the target hardwareaddress, inaddress,in the unicast ARP Replypacket.PDU. If the destination IP address in theARP Requestrequest is not present in the ARP table, then the original ARPRequest packetrequest PDU is broadcast to all the switch ports that aremembersmember of the same VLAN except the source port that theARPRequest was received from. However, if the requested Internet Draft draft-shah-arp-reduction-02.txt (destination) IP address is present in the ARP table, a unicast ARP ReplypacketPDU is prepared as described above and sent to the switch port from which the ARP Request was received and original ARPRequest packetrequest PDU is dropped.Internet Draft draft-shah-arp-reduction-01.txtThe intent is to prevent propagation of ARP Request PDU broadcasts as much as possible using the information present in the ARP table. The following observations can be made from such behavior. . Most of the ARPRequest packetsrequests from the local hosts of a ToR switch for the local hosts ofthatthe ToR switch can beprevented from being broadcast on uplinks or downlinks.prevented. . Most of the ARPRequest packetsrequests from the remote hosts of a ToR switch for the local hosts ofthatthe ToR switch can be prevented frombeing broadcastgetting forwarded on downlinks or other uplinks of the ToR switch. . Many of the ARPRequest packetsrequests from the local hosts of a ToR switch for the remote hosts ofthatthe ToR switch can be prevented from being forwarded on uplinks if the remote host IP to MAC association is known to the ToR switch. 3.2 ARP Reply The unicast ARP Reply is examined to learn/update the ARP table for source and destination IP/MAC address association, but is also forwarded out as a normal frame. 3.3 Gratuitous ARP Gratuitous ARP is a broadcast ARP ReplypacketPDU withthedestination IP address set to the IP address of the sender and target hardware address set to the MAC address of the sender. It is typically used by the IP hosts (including VMs) to keep itsIP-to-MAC addressassociation fresh inits peers'peer's ARP cache. The ToR switch should process Gratuitous ARP in the following manner. . Learn/update/refresh the ARP table entry. . If the IP address is new, or exists but with a different hardware address, then the Gratuitous ARPpacketPDU is forwarded out; otherwise thepacketPDU is discarded. The goal for handling of the Gratuitous ARPpacketsPDU received from the downlinks (i.e. local hosts) is to avoid propagating it into the 'network' (i.e. totheuplinks), unless there is a new association. By suppressing the propagation of Gratuitous ARPpackets,PDUs, the peer IP hosts will end up aging out the corresponding ARP table entries. This will result in generation of the broadcast ARP Requests by those IP hosts if they need to continue to communicate with the IP host whose Gratuitous ARPs were obstructed. The handling of the ARPRequest by the first-hop ToR switch,Request, as described above, by the first hop ToR switch will be able to respond to this request based on the ARP cache maintained in the ToR switch. In essence,thepresence of large ARP tables with longeragingage out times compensates for the smaller ARP table present in the Internet Draftdraft-shah-arp-reduction-01.txt thedraft-shah-arp-reduction-02.txt IP hosts and eliminates the need for periodic use of Gratuitous ARPs in order to refresh the ARP table in the IP hosts. 3.4Uplink Versus Downlink Processing With respect to processing of the ARP packets as described above, the behavior is different depending on whether the packet was received from an uplink or downlink in the following ways. . The aging timer will typically be higher for entries learned from an uplink versus those learned from a downlink. The reason for this is to avoid flooding ARP broadcast packets on uplinks since they have a much larger negative impact. . If ARP table fills up, then entries learned from downlinks (i.e. directly attached hosts) will take precedence over those learned from an uplink (i.e. remote hosts). This will trade off sending broadcasts on host links versus sending them into the core of the network. The reason for this is that access links are typically lower bandwidth, and also this will conserve CPU resources involved in processing unnecessary ARP traffic. 3.5HostMobilitymovement As mentioned earlier, server virtualization technology allowsmobilitymovement of VMs to different physical servers. The flexibility to move VMs is one of the key benefits of server virtualization. The VMmobilitymovement could be manual (operator initiated) or may be done automatically in reaction to demands placed by the application users. The important point is that in either case, VM movement is not transparent and is made known to the network. There is ongoing work in IEEE 802.1 standards organization (IEEE 802.1Qbg) to coordinate/communicate the presence and capabilities of the VMs to the directly connected network switch. VMs typically retain their MAC and IPaddress across a VM mobility event,address, and as such, there would be little impact to the ARP table maintained by the ARP reduction mechanism described herein. However, the ARP reduction mechanism would benefit from knowing if a VM is completely decommissioned so that the ToRswitchcanremoveremoved the ARP entrythatit has for that VM in a timely fashion, rather than waiting for it toage out.timeout. 3.5 Applicability to environments with overlay transport Recently, there have been multiple proposals for using overlay transport technologies such as VXLAN [VXLAN] and NVGRE [NVGRE]. These proposals allow the network operator to build the network using L2 or L3 technologies while building an L2-overlay on top of that. As such, while they address the issue of network design, they do not eliminate the need for a mechanism to reduce the amount of broadcast traffic that may have to traverse the core, if there are VMs of the same tenant on servers attached to different ToR switches. One of the ways for the overlay transport proposals to address this issue would be to implement the mechanism discussed in this document at the point where the overlay encapsulation and decapsulation is performed (i.e. in the virtual switch). 3.6 Scaling Considerations Depending on the number of hosts in thenetwork,networks, the ARP tablein a ToR switch needed for the ARP reduction mechanisms described abovecan be quite large. Although it is possible to implement some of the mechanisms for ARP reduction as described in this document in hardware in the forwarding plane,Internet Draft draft-shah-arp-reduction-01.txtthe number of ARP entriesfavorsmay favor maintaining the ARP table in the control plane memory. Internet Draft draft-shah-arp-reduction-02.txt 3.7 Miscellaneous Issues Because of the distributed nature of the mechanisms described herein, there are a few additional issues that warrant consideration from the network operator. Earlier in the document, we had mentioned the configuration of aagingtimer for ARP entries. A longer timer for holdingontoon to ARP entries helps with reduction of broadcasts. However, the risk of having a "too large timer" canlead tocause problems in certain situations. Consider the following scenario. Host A is attached to ToR switch #1, and host B is attached to ToR switch #2. If host B issues an ARPRequestrequest for host A,andif the entry is available at switch #2, then switch #2 would send the ARP Reply on behalf of host A. It is possible that host A is no longer available, but there is no way for switch #2 to know this, and it would continue to respond on behalf of host A, until its entry for host A hasagedtimed out. In this case, it is easy to see that a smalleragingtimer would be beneficial. Additionally, since host B has an ARPagingage timer, it means that host B would find out about host A's unavailability only after its entry hasaged out,aged, which would besome timeafter itthe entryhas aged out of switch #2. Another issue that can be somewhat problematic could be the inconsistency of tables in switches. Once again, consider a scenario similar to the one described above withtwo2 hosts each connected to its respect ToR switch. Let the ARP entries at both A and B be learned by both switches. Now assume that the IP address on host A changes. This change is signaled to switch #1 which in turn broadcasts the message on its uplink. Now, if this message is discarded due to network congestion or signal integrity issues, then switch #2 will not learn about the change and will continue to respond to host B's ARP Requests for host A's old IP address with stale information. This lasts until the ARP entry for Aagestimes out at Switch #2. 4.0Concluding RemarksConclusion Based on the procedures described in this document, it is possible for ToR switches in the data center to contain ARP broadcasts significantly. The solution is based on well known, non-intrusive procedures and strives to curtailARPbroadcasts that are increasingly becoming a cause for concern in the data centers. In essence, ToR switchesoffload somefacilitate the offloading of the extended ARP table management from the IP hosts tothemselves.itself. The ARP tableaging timertimeout can be tuned higher by the operator based on the available switch resources and network traffic behavior. The larger capacity of the ARP tableInternet Draft draft-shah-arp-reduction-01.txt coupled with a long aging time for entries in the tabledirectly translates to more effective subduing of the ARP broadcasts. Internet Draft draft-shah-arp-reduction-02.txt 5.0 Security ConsiderationsSecurityThe details of the security aspects will be addressed ina subsequentfuture revision. 6.0 Acknowledgments This document resulted from discussions with LindaDunbarDurbar (Huawei), Sue Hares (Huawei), and T Sridhar(Force10).(VMware). We would like to acknowledge their contribution to this work. 7.0 References 7.1 Normative References [ARP] D. Plummer, "An Ethernet Address Resolution Protocol: Or Converting Network Protocol Addresses to 48.bit Ethernet Addresses for Transmission on EthernetHardware, "Hardware," RFC826 (also826, STD37), November 1982.37. [ARP-Problem]L.Dunbar et al., "Scalable Address ResolutionT. Narten, "Problem Statement forLarge Data Center Problem Statements," <draft-dunbar-arp-for- large-dc-problem-statement-00>, July 2010.ARMD," work in progress, <draft-ietf-armd-problem-statement>. 7.2 Informative References [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking in Layer 2 VPN,"<draft-ietf-l2vpn-arp-mediation-14>, July 2010.work in progress, <draft-ietf-l2vpn-arp- mediation>. [IPLS] H.Shah et al., "IP-only LAN service,"<draft-ietf-l2vpn-ipls-09>, February 2010.work in progress, <draft-ietf-l2vpn-ipls>. [PROXY-ARP] J. Postel, "Multi-LAN Address Resolution," RFC925, October 1984. [TRILL] R. Perlman925. [RFC1027] Smoot et al.,"RBridges: Base Protocol Specification", <draft-ietf-trill-rbridge-protocol-16>, March 2010."Using ARP to Implement Transparent Subnet Gateways". [VXLAN] M. Mahalingam et al., "VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks"," work in progress, <draft-mahalingam-dutt-dcops-vxlan>. [NVGRE] M. Sridharan et al., " NVGRE: Network Virtualization using Generic Routing Encapsulation", work in progress, <draft- sridharan-virtualization-nvgre>. Internet Draft draft-shah-arp-reduction-02.txt 8.0 Author's Address Himanshu Shah Ciena Corp Email: hshah@ciena.com Anoop GhanwaniInternet Draft draft-shah-arp-reduction-01.txtBrocade Email:anoop@brocade.comanoop@alumni.duke.edu Nabil Bitar Verizon Email: nabil.n.bitar@verizon.com