Working Group: ARMD                                  Himanshu Shah
   Intended Status: Proposed Standard Informational                          Ciena Corp
   Internet Draft
                                                       Anoop Ghanwani
   Expiration Date: May, 2011 April 27, 2012                            Brocade

                                                          Nabil Bitar
                                                              Verizon

                                                     October 25, 2010 28, 2011

              ARP Broadcast Reduction for Large Data Centers
                   draft-shah-armd-arp-reduction-01.txt
                   draft-shah-armd-arp-reduction-02.txt

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on May 25, 2011 April 27, 2012

Copyright Notice

   Copyright (c) 2010 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   Internet Draft   draft-shah-arp-reduction-01.txt   draft-shah-arp-reduction-02.txt

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.  Code Components extracted from this
   document must include Simplified BSD License text as described in
   Section 4.e of the Trust Legal Provisions and are provided without
   warranty as described in the Simplified BSD License.

Abstract

   With the emergence advent of server virtualization technologies, a host is able to
   support multiple Virtual Machines (VMs) in a single physical
   machine. Data centers Centers can leverage these capabilities to instantiate
   on the order of 10s to 100s of VMs in a server. single server with current
   technology.  It is conceivable that this number can be much higher
   in the future. Each VM operates as an independent IP host with a set
   of Virtual Network Interface Cards (vNICs), each having its own MAC
   address and mapping to a physical Ethernet interface. These physical
   servers are typically installed in a rack with their Ethernet
   interfaces connected to a top-of-rack top-of-the-rack (ToR) switch. The ToR
   switches are interconnected through End-of-
   the-Row End-of-the-Row (EoR) or
   aggregation switches which are in turn connected to core switches.

   As discussed in [ARP-Problem] the host VMs use ARP broadcasts to
   find other host VMs and use periodic (broadcast) Gratuitous ARPs to
   refresh their IP to MAC address binding in other VM hosts. Such
   broadcasts in a large data center with potentially thousands of VM
   hosts in a Layer 2 based topology can overwhelm the network.

   This memo proposes mechanisms to reduce the number of broadcasts
   that are sent throughout the network. This is done by having the ToR
   switches
   ToRs intelligently process ARP packets, and frames, rather than simply
   broadcasting them throughout the broadcast domain.

   While this document specifically addresses ARP, the Neighbor Discovery mechanisms
   used by the IPv6 hosts that make use of multicast rather than
   broadcast also pose similar issues for in the data center. Data Center. The solutions
   defined herein should be equally applicable to hosts running IPv6.
   The details will be specified in a subsequent revision.

Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC 2119].

   Internet Draft   draft-shah-arp-reduction-01.txt   draft-shah-arp-reduction-02.txt

Table of Contents

 Copyright Notice .................................................... ........................................... 1
Abstract..............................................................
Abstract .................................................... 2
1.0 Overview.......................................................... Overview ................................................ 3
 1.1 Terminology ..................................................... ............................................ 5
2.0 Configuration..................................................... Configuration ........................................... 6
3.0 Building the ARP Tables........................................... tables ................................. 6
 3.1 ARP Request ..................................................... Requests ........................................... 6
 3.2 ARP Reply ....................................................... .............................................. 7
 3.3 Gratuitous ARP .................................................. ......................................... 7
 3.4 Uplink Versus Downlink Processing ............................... 8
 3.5 Host Mobility ................................................... movement .......................................... 8
4.0 Concluding Remarks................................................ Conclusion .............................................. 9
5.0 Security Considerations ......................................... ................................. 10
6.0 Acknowledgments ................................................. ......................................... 10
7.0 References....................................................... References .............................................. 10
 7.1 Normative References ........................................... References.................................... 10
 7.2 Informative References ......................................... ................................. 10
8.0 Author's Address................................................. 10 Address ........................................ 11

1.0 Overview

   The traditional topology in a data center consists of racks of
   servers connected to top-of-rack (ToR) switches, which connect to
   aggregation switches, which in turn connect to core switches.  The
   network architecture is typically a combination combines Layer 2 and Layer 3
   functionality. 3.  In
   some architectures, Layer 2 is terminated at the ToR, with Layer 3
   being run in the aggregation and core devices.  In other
   architectures, Layer 2 may be extended all the way to the
   aggregation switch.  The primary concerns that have influenced
   network architectures in the data center have been keeping broadcast
   domains manageable and the spanning tree diameter domains contained.

   Moving forward, these traditional network architectures are being
   challenged due to emerging technologies such as server
   virtualization.

   Internet Draft   draft-shah-arp-reduction-01.txt   draft-shah-arp-reduction-02.txt

   The effect of server virtualization in the data center brings some
   challenges.  Because of virtualization, the number of hosts seen by that the
   network sees increases dramatically - 10 to 100 times the number of
   physical servers.  These virtual hosts are referred to as Virtual
   machines (VMs).  In addition, virtualized environments  VMs offer a
   feature referred to as "VM mobility" server mobility wherein a VM can be
   relocated to run on a different physical server.  In order for the VM
   mobility to be non-disruptive to other hosts that have communication
   in progress with the VM being moved, the VM must retain its MAC
   address and IP address.  Because of the requirement to retain the
   MAC and IP address, it is desirable to develop network architectures
   that would offer the least restrictions in terms of VM server mobility.

   As an example, in a network architecture where TOR switches
   terminate the L2 domain, the range of VM mobility would be restricted
   to a single ToR switch.  It would be more preferable to allow the
   flexibility of moving the VM anywhere within the data center, or
   perhaps even a different data center.

   Technologies such as TRILL [TRILL] overcome some of the issues of
   spanning trees that forced because which traditional Layer 2 topologies to be
   severely have
   been constrained.  However, because of virtualization there are 2
   specific problems that are introduced with respect to broadcast
   traffic.
     1. A larger number of hosts.  A single physical server now hosts
        multiple VMs virtual machines taking the scale factor to a
        different level.  If each VM issues has the same number of broadcasts
        as a physical server, the amount of broadcast traffic will has
        increased 10 to greater than 100 times.
     2. If the Layer 2 domains are extended to go across data centers,
        then broadcast traffic will now go across the backbone.  If
        Layer 2 was terminated at the ToR switch, the increase in
        broadcast traffic would be been restricted to a single ToR
        switch, but as discussed earlier, this restriction is not
        desirable.

   Excessive

   The broadcast traffic as such in Layer 2 networks results in has far reaching impacts;
   i.e. wastage
   of in network bandwidth, bandwidth as well as in the wastage of CPU resources due
   to used by
   all of the VMs while processing superfluous ARP broadcasts (IPv6 gets
   rid of the latter by running ND as a multicast service rather than a
   broadcast service).

   The solution presented here attempts to minimize the negative effects of
   ARP broadcast packets. broadcasts. The solution requires the first hop Ethernet
   switches, typically the ToR switch, ToR, to maintain an ARP table that is learned from the
   ARP packets PDUs received by the switch.
   The switch then and selectively propagates the ARP packet
   to, or proxy-
   responds proxy-responds on behalf of, the remote peer. These types of
   ARP processing principles are well-known well known and are described used/described in L2VPN
   Working Group documents such as [ARP-Mediation] and [IPLS]. The ARP
   proxy response differs from that described in [RFC1027] as the ARP
   response contains MAC address of the destination and not that of the
   switch as is suggested in [RFC 1027].

   Internet Draft   draft-shah-arp-reduction-01.txt   draft-shah-arp-reduction-02.txt

   The following sections describe the details of ARP snooping, the
   learning and maintenance of maintaining ARP tables, using the use of learned information
   to limit broadcast propagation, propagation and proxy (the response) on behalf of
   the remote peers.

1.1 Terminology

        ToR            Top-of-Rack. switch     Top-of-Rack switch. An Ethernet switch present on installed
                       at the top of a rack of servers which provides
                       network connectivity to
                       the servers present on the rack. those servers.

        Downlink       The Ethernet link between the ToR switch and a
                        directly connected  host (server host/server in the rack). rack.

        Uplink         The network- facing network-facing Ethernet connection in the
                        ToR switch. Typically, the uplinks from ToRs
                        connect to end-of-row or aggregation switches.

        EoR            End-of-Row. switch     End-of-Row switch.  An Ethernet switch to which the
                        ToR switches connect, also
                        aggregates traffic from multiple racks.  Also
                        commonly referred to as an aggregation switch.
                        Uplinks from the ToR switches
                        connect connects to an EoR switch switches
                        and uplinks from EoR switches in turn connect
                        to a core switch. switches.

        Host/Server    A host or server running the IP protocol.  This
                        could be a physical entity or a logical entity
                        (such as a Virtual Machine) in a physical host.
                        The term server refers to its role in the data
                        center.  Both terms are used interchangeably to
                        and refer to an IP host. end station.

        Local hosts    Used in the context of a ToR switch to denote
                        the VM hosts connected to a ToR switch on the
                        downlink, i.e. directly attached connected hosts.

        Remote hosts    Used in the context of a ToR switch to denote
                        the hosts that are accessible through via the uplink of
                        the ToR. ToR switch.

        VM             Virtual Machine. This is a logical instance of
                        a host that operates independently in a
                        physical host and has its own IP and MAC
                        addresses. VMs allow The VM architecture allows efficient
                        use of physical host resources (such as
                        multiple CPU cores).

   Internet Draft   draft-shah-arp-reduction-01.txt   draft-shah-arp-reduction-02.txt

2.0 Configuration

   It is assumed that ARP reduction mechanisms methodologies that are defined in
   this document will be limited to ToR switches.   The maximum benefit
   of restraining ARP broadcasts in the network is achieved by the
   first hop switches (the ones directly connected to the hosts)
   without placing additional burden on second or third tier switches.

   First, the ToR switches would need to be configured in order to
   enable the ARP reduction feature. Every Ethernet interface needs to
   be identified as either a downlink or uplink within the context of
   this feature. The ARP reduction feature treats ARP frames received
   from downlink or uplink differently as described in the following
   sections.

   In addition additional the operator may optionally configure various ARP
   reduction related parameters such as:
     . ARP aging timer. timer,
     . Size size of the ARP table. table,
     . Static static entries of IP to MAC address. address, etc.

3.0 Building the ARP Tables tables

   When ARP reduction is enabled, the ToR switch will monitor all ARP
   traffic transiting the switch (regardless of uplink port or downlink
   port) and will process any ARP packets PDUs in the following manner:
     . ARP Request packets PDUs must be redirected to control plane CPU.
     . Gratuitous ARP packets PDUs (ARP Reply packet PDU with a broadcast MAC DA)
        must be redirected to control plane CPU.
     . Other ARP Reply packets PDUs (ARP Reply packet PDU with a unicast MAC DA)
        should be bi-casted; one copy sent to control plane CPU and
        other copy forwarded out normally.

3.1 ARP Request Requests

   The ToR examines the source IP and the source hardware address (MAC
   address) in the ARP Request . The source IP and MAC address
   association is learned, or is updated/refreshed if already learned.
   The destination IP address is searched in the ARP table. If an entry
   exists, the associated MAC address from the table is used to prepare
   a unicast ARP Reply packet. PDU. The same MAC address is used as the source
   MAC address in the MAC header, as well as for the target hardware address, in
   address,in the unicast ARP Reply packet. PDU.

   If the destination IP address in the ARP Request request is not present in the
   ARP table, then the original ARP Request packet request PDU is broadcast to all the
   switch ports that are members member of the same VLAN except the source port
   that the ARP Request was received from. However, if the requested
   Internet Draft   draft-shah-arp-reduction-02.txt

   (destination) IP address is present in the ARP table, a unicast ARP
   Reply packet PDU is prepared as described above and sent to the switch port
   from which the ARP Request was received and original ARP Request packet request PDU
   is dropped.

   Internet Draft   draft-shah-arp-reduction-01.txt

   The intent is to prevent propagation of ARP Request PDU broadcasts
   as much as possible using the information present in the ARP table.
   The following observations can be made from such behavior.
      . Most of the ARP Request packets requests from the local hosts of a ToR switch
         for the local hosts of that the ToR switch can be prevented
         from being broadcast on uplinks or downlinks. prevented.
      . Most of the ARP Request packets requests from the remote hosts of a ToR switch
         for the local hosts of that the ToR switch can be prevented from being broadcast
         getting forwarded on downlinks or other uplinks of the ToR
         switch.
      . Many of the ARP Request packets requests from the local hosts of a ToR switch
         for the remote hosts of that the ToR switch can be prevented from
         being forwarded on uplinks if the remote host IP to MAC
         association is known to the ToR switch.

3.2 ARP Reply

   The unicast ARP Reply is examined to learn/update the ARP table for
   source and destination IP/MAC address association, but is also
   forwarded out as a normal frame.

3.3 Gratuitous ARP

   Gratuitous ARP is a broadcast ARP Reply packet PDU with the destination IP
   address set to the IP address of the sender and target hardware
   address set to the MAC address of the sender. It is typically used
   by the IP hosts (including VMs) to keep its IP-to-MAC address association fresh in its peers'
   peer's ARP cache.

   The ToR switch should process Gratuitous ARP in the following
   manner.
      . Learn/update/refresh the ARP table entry.
      . If the IP address is new, or exists but with a different
         hardware address, then the Gratuitous ARP packet PDU is forwarded
         out; otherwise the packet PDU is discarded.

   The goal for handling of the Gratuitous ARP packets PDU received from the
   downlinks (i.e. local hosts) is to avoid propagating it into the
   'network' (i.e. to the uplinks), unless there is a new association.

   By suppressing the propagation of Gratuitous ARP packets, PDUs, the peer IP
   hosts will end up aging out the corresponding ARP table entries.
   This will result in generation of the broadcast ARP Requests by
   those IP hosts if they need to continue to communicate with the IP
   host whose Gratuitous ARPs were obstructed. The handling of the ARP
   Request by the first-hop ToR switch,
   Request, as described above, by the first hop ToR switch will be
   able to respond to this request based on the ARP cache maintained in
   the ToR switch. In essence, the presence of large ARP tables with longer aging
   age out times compensates for the smaller ARP table present in the
   Internet Draft   draft-shah-arp-reduction-01.txt

   the   draft-shah-arp-reduction-02.txt

   IP hosts and eliminates the need for periodic use of Gratuitous ARPs
   in order to refresh the ARP table in the IP hosts.

3.4 Uplink Versus Downlink Processing

   With respect to processing of the ARP packets as described above,
   the behavior is different depending on whether the packet was
   received from an uplink or downlink in the following ways.

     . The aging timer will typically be higher for entries learned
        from an uplink versus those learned from a downlink.  The
        reason for this is to avoid flooding ARP broadcast packets on
        uplinks since they have a much larger negative impact.
     . If ARP table fills up, then entries learned from downlinks
        (i.e. directly attached hosts) will take precedence over those
        learned from an uplink (i.e. remote hosts).  This will trade
        off sending broadcasts on host links versus sending them into
        the core of the network.  The reason for this is that access
        links are typically lower bandwidth, and also this will
        conserve CPU resources involved in processing unnecessary ARP
        traffic.

3.5 Host Mobility movement

   As mentioned earlier, server virtualization technology allows
   mobility
   movement of VMs to different physical servers. The flexibility to
   move VMs is one of the key benefits of server virtualization. The
   VM
   mobility movement could be manual (operator initiated) or may be done
   automatically in reaction to demands placed by the application
   users. The important point is that in either case, VM movement is
   not transparent and is made known to the network.

   There is ongoing work in IEEE 802.1 standards organization (IEEE
   802.1Qbg) to coordinate/communicate the presence and capabilities of
   the VMs to the directly connected network switch.

   VMs typically retain their MAC and IP address across a VM mobility
   event, address, and as such, there
   would be little impact to the ARP table maintained by the ARP
   reduction mechanism described herein.  However, the ARP reduction
   mechanism would benefit from knowing if a VM is completely
   decommissioned so that the ToR switch can remove removed the ARP entry that it has for
   that VM in a timely fashion, rather than waiting for it to age out. timeout.

3.5 Applicability to environments with overlay transport

   Recently, there have been multiple proposals for using overlay
   transport technologies such as VXLAN [VXLAN] and NVGRE [NVGRE].
   These proposals allow the network operator to build the network
   using L2 or L3 technologies while building an L2-overlay on top of
   that.  As such, while they address the issue of network design, they
   do not eliminate the need for a mechanism to reduce the amount of
   broadcast traffic that may have to traverse the core, if there are
   VMs of the same tenant on servers attached to different ToR
   switches.

   One of the ways for the overlay transport proposals to address this
   issue would be to implement the mechanism discussed in this document
   at the point where the overlay encapsulation and decapsulation is
   performed (i.e. in the virtual switch).

3.6 Scaling Considerations

   Depending on the number of hosts in the network, networks, the ARP table in a
   ToR switch needed for the ARP reduction mechanisms described above can
   be quite large. Although it is possible to implement some of the
   mechanisms for ARP reduction as described in this document in
   hardware in the forwarding plane,
   Internet Draft   draft-shah-arp-reduction-01.txt the number of ARP entries favors may
   favor maintaining the ARP table in the control plane memory.

   Internet Draft   draft-shah-arp-reduction-02.txt

3.7 Miscellaneous Issues

   Because of the distributed nature of the mechanisms described
   herein, there are a few additional issues that warrant consideration
   from the network operator.

   Earlier in the document, we had mentioned the configuration of a
   aging
   timer for ARP entries.  A longer timer for holding onto on to ARP entries
   helps with reduction of broadcasts.  However, the risk of having a
   "too large timer" can lead to cause problems in certain situations.
   Consider the following scenario.  Host A is attached to ToR switch
   #1, and host B is attached to ToR switch #2.  If host B issues an
   ARP Request request for host A, and if the entry is available at switch #2, then
   switch #2 would send the ARP Reply on behalf of host A.  It is
   possible that host A is no longer available, but there is no way for
   switch #2 to know this, and it would continue to respond on behalf
   of host A, until its entry for host A has aged timed out.  In this case,
   it is easy to see that a smaller aging timer would be beneficial.
   Additionally, since host B has an ARP aging age timer, it means that host
   B would find out about host A's unavailability only after its entry
   has aged out, aged, which would be some time after it the entry has aged out of switch #2.

   Another issue that can be somewhat problematic could be the
   inconsistency of tables in switches.  Once again, consider a
   scenario similar to the one described above with two 2 hosts each
   connected to its respect ToR switch.  Let the ARP entries at both A
   and B be learned by both switches.  Now assume that the IP address
   on host A changes.  This change is signaled to switch #1 which in
   turn broadcasts the message on its uplink.  Now, if this message is
   discarded due to network congestion or signal integrity issues, then
   switch #2 will not learn about the change and will continue to
   respond to host B's ARP Requests for host A's old IP address with
   stale information.  This lasts until the ARP entry for A ages times out
   at Switch #2.

4.0 Concluding Remarks Conclusion

   Based on the procedures described in this document, it is possible
   for ToR switches in the data center to contain ARP broadcasts
   significantly. The solution is based on well known, non-intrusive
   procedures and strives to curtail ARP broadcasts that are increasingly
   becoming a cause for concern in the data centers. In essence, ToR
   switches offload some facilitate the offloading of the extended ARP table
   management from the IP hosts to themselves. itself. The ARP table aging timer timeout can be
   tuned higher by the operator based on the available switch resources
   and network traffic behavior. The larger capacity of the ARP table
   Internet Draft   draft-shah-arp-reduction-01.txt

   coupled with a long aging time for entries in the table
   directly translates to more effective subduing of the ARP
   broadcasts.

   Internet Draft   draft-shah-arp-reduction-02.txt

5.0  Security Considerations

   Security

   The details of the security aspects will be addressed in a subsequent future
   revision.

6.0  Acknowledgments

   This document resulted from discussions with Linda Dunbar Durbar (Huawei),
   Sue Hares (Huawei), and T Sridhar (Force10). (VMware).  We would like to
   acknowledge their contribution to this work.

7.0 References

7.1 Normative References

   [ARP] D. Plummer, "An Ethernet Address Resolution Protocol:  Or
      Converting Network Protocol Addresses to 48.bit Ethernet
      Addresses for Transmission on Ethernet Hardware, " Hardware," RFC 826 (also 826, STD 37), November 1982.
      37.

   [ARP-Problem] L.Dunbar et al., "Scalable Address Resolution T. Narten, "Problem Statement for
      Large Data Center Problem Statements," <draft-dunbar-arp-for-
      large-dc-problem-statement-00>, July 2010. ARMD,"
      work in progress, <draft-ietf-armd-problem-statement>.

7.2 Informative References

   [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking
      in Layer 2 VPN," <draft-ietf-l2vpn-arp-mediation-14>, July 2010. work in progress, <draft-ietf-l2vpn-arp-
      mediation>.

   [IPLS] H.Shah et al., "IP-only LAN service,"
      <draft-ietf-l2vpn-ipls-09>, February 2010. work in progress,
      <draft-ietf-l2vpn-ipls>.

   [PROXY-ARP] J. Postel, "Multi-LAN Address Resolution," RFC 925,
      October 1984.

   [TRILL] R. Perlman 925.

   [RFC1027] Smoot et al., "RBridges: Base Protocol Specification",
      <draft-ietf-trill-rbridge-protocol-16>, March 2010. "Using ARP to Implement Transparent Subnet
      Gateways".

   [VXLAN] M. Mahalingam et al., "VXLAN: A Framework for Overlaying
      Virtualized Layer 2 Networks over Layer 3 Networks"," work in
      progress, <draft-mahalingam-dutt-dcops-vxlan>.

   [NVGRE] M. Sridharan et al., " NVGRE: Network Virtualization using
      Generic Routing Encapsulation", work in progress, <draft-
      sridharan-virtualization-nvgre>.

   Internet Draft   draft-shah-arp-reduction-02.txt

8.0 Author's Address

   Himanshu Shah
   Ciena Corp
   Email: hshah@ciena.com

   Anoop Ghanwani
   Internet Draft   draft-shah-arp-reduction-01.txt
   Brocade
   Email: anoop@brocade.com anoop@alumni.duke.edu

   Nabil Bitar
   Verizon
   Email: nabil.n.bitar@verizon.com