| < draft-shah-armd-arp-reduction-01.txt | draft-shah-armd-arp-reduction-02.txt > | |||
|---|---|---|---|---|
| Working Group: ARMD Himanshu Shah | Working Group: ARMD Himanshu Shah | |||
| Intended Status: Proposed Standard Ciena Corp | Intended Status: Informational Ciena Corp | |||
| Internet Draft | Internet Draft | |||
| Anoop Ghanwani | Anoop Ghanwani | |||
| Expiration Date: May, 2011 Brocade | Expiration Date: April 27, 2012 Brocade | |||
| Nabil Bitar | Nabil Bitar | |||
| Verizon | Verizon | |||
| October 25, 2010 | October 28, 2011 | |||
| ARP Broadcast Reduction for Large Data Centers | ARP Broadcast Reduction for Large Data Centers | |||
| draft-shah-armd-arp-reduction-01.txt | draft-shah-armd-arp-reduction-02.txt | |||
| Status of this Memo | Status of this Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
| other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
| Drafts. | Drafts. | |||
| skipping to change at line 37 ¶ | skipping to change at line 37 ¶ | |||
| months and may be updated, replaced, or obsoleted by other documents | months and may be updated, replaced, or obsoleted by other documents | |||
| at any time. It is inappropriate to use Internet-Drafts as reference | at any time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
| http://www.ietf.org/1id-abstracts.html | http://www.ietf.org/1id-abstracts.html | |||
| The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
| http://www.ietf.org/shadow.html | http://www.ietf.org/shadow.html | |||
| This Internet-Draft will expire on May 25, 2011 | This Internet-Draft will expire on April 27, 2012 | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2010 IETF Trust and the persons identified as the | Copyright (c) 2011 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with | carefully, as they describe your rights and restrictions with | |||
| respect to this document. Code Components extracted from this | respect to this document. Code Components extracted from this | |||
| document must include Simplified BSD License text as described in | document must include Simplified BSD License text as described in | |||
| Section 4.e of the Trust Legal Provisions and are provided without | Section 4.e of the Trust Legal Provisions and are provided without | |||
| warranty as described in the Simplified BSD License. | warranty as described in the Simplified BSD License. | |||
| Abstract | Abstract | |||
| With the emergence server virtualization technologies, a host is | With advent of server virtualization technologies, a host is able to | |||
| able to support multiple Virtual Machines (VMs) in a single physical | support multiple Virtual Machines (VMs) in a single physical | |||
| machine. Data centers can leverage these capabilities to instantiate | machine. Data Centers can leverage these capabilities to instantiate | |||
| on the order of 10s to 100s of VMs in a server. Each VM operates as | on the order of 10s to 100s of VMs in a single server with current | |||
| an independent IP host with a set of Virtual Network Interface Cards | technology. It is conceivable that this number can be much higher | |||
| (vNICs), each having its own MAC address and mapping to a physical | in the future. Each VM operates as an independent IP host with a set | |||
| Ethernet interface. These physical servers are typically installed | of Virtual Network Interface Cards (vNICs), each having its own MAC | |||
| in a rack with their Ethernet interfaces connected to a top-of-rack | address and mapping to a physical Ethernet interface. These physical | |||
| (ToR) switch. The ToR switches are interconnected through End-of- | servers are typically installed in a rack with their Ethernet | |||
| the-Row (EoR) or aggregation switches which are in turn connected to | interfaces connected to a top-of-the-rack (ToR) switch. The ToR | |||
| core switches. | switches are interconnected through End-of-the-Row (EoR) or | |||
| aggregation switches which are in turn connected to core switches. | ||||
| As discussed in [ARP-Problem] the host VMs use ARP broadcasts to | As discussed in [ARP-Problem] the host VMs use ARP broadcasts to | |||
| find other host VMs and use periodic (broadcast) Gratuitous ARPs to | find other host VMs and use periodic (broadcast) Gratuitous ARPs to | |||
| refresh their IP to MAC address binding in other VM hosts. Such | refresh their IP to MAC address binding in other VM hosts. Such | |||
| broadcasts in a large data center with potentially thousands of VM | broadcasts in a large data center with potentially thousands of VM | |||
| hosts in a Layer 2 based topology can overwhelm the network. | hosts in a Layer 2 based topology can overwhelm the network. | |||
| This memo proposes mechanisms to reduce the number of broadcasts | This memo proposes mechanisms to reduce the number of broadcasts | |||
| that are sent throughout the network. This is done by having the ToR | that are sent throughout the network. This is done by having the | |||
| switches intelligently process ARP packets, rather than simply | ToRs intelligently process ARP and frames, rather than simply | |||
| broadcasting them throughout the broadcast domain. | broadcasting them throughout the broadcast domain. | |||
| While this document specifically addresses ARP, the Neighbor | While this document addresses ARP, the Neighbor Discovery mechanisms | |||
| Discovery mechanisms used by IPv6 hosts that make use of multicast | used by the IPv6 hosts that make use of multicast rather than | |||
| rather than broadcast also pose similar issues for the data center. | broadcast also pose similar issues in the Data Center. The solutions | |||
| The solutions defined herein should be equally applicable to hosts | defined herein should be equally applicable to hosts running IPv6. | |||
| running IPv6. The details will be specified in a subsequent | The details will be specified in a subsequent revision. | |||
| revision. | ||||
| Conventions | Conventions | |||
| The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
| "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
| document are to be interpreted as described in RFC 2119 [RFC 2119]. | document are to be interpreted as described in RFC 2119 [RFC 2119]. | |||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| Table of Contents | Table of Contents | |||
| Copyright Notice .................................................... 1 | Copyright Notice ........................................... 1 | |||
| Abstract.............................................................. 2 | Abstract .................................................... 2 | |||
| 1.0 Overview.......................................................... 3 | 1.0 Overview ................................................ 3 | |||
| 1.1 Terminology ..................................................... 5 | 1.1 Terminology ............................................ 5 | |||
| 2.0 Configuration..................................................... 6 | 2.0 Configuration ........................................... 6 | |||
| 3.0 Building the ARP Tables........................................... 6 | 3.0 Building the ARP tables ................................. 6 | |||
| 3.1 ARP Request ..................................................... 6 | 3.1 ARP Requests ........................................... 6 | |||
| 3.2 ARP Reply ....................................................... 7 | 3.2 ARP Reply .............................................. 7 | |||
| 3.3 Gratuitous ARP .................................................. 7 | 3.3 Gratuitous ARP ......................................... 7 | |||
| 3.4 Uplink Versus Downlink Processing ............................... 8 | 3.4 Host movement .......................................... 8 | |||
| 3.5 Host Mobility ................................................... 8 | 4.0 Conclusion .............................................. 9 | |||
| 4.0 Concluding Remarks................................................ 9 | 5.0 Security Considerations ................................. 10 | |||
| 5.0 Security Considerations ......................................... 10 | 6.0 Acknowledgments ......................................... 10 | |||
| 6.0 Acknowledgments ................................................. 10 | 7.0 References .............................................. 10 | |||
| 7.0 References....................................................... 10 | 7.1 Normative References.................................... 10 | |||
| 7.1 Normative References ........................................... 10 | 7.2 Informative References ................................. 10 | |||
| 7.2 Informative References ......................................... 10 | 8.0 Author's Address ........................................ 11 | |||
| 8.0 Author's Address................................................. 10 | ||||
| 1.0 Overview | 1.0 Overview | |||
| The traditional topology in a data center consists of racks of | The traditional topology in a data center consists of racks of | |||
| servers connected to top-of-rack (ToR) switches, which connect to | servers connected to top-of-rack (ToR) switches, which connect to | |||
| aggregation switches, which in turn connect to core switches. The | aggregation switches, which in turn connect to core switches. The | |||
| network architecture is typically a combination Layer 2 and Layer 3 | network architecture typically combines Layer 2 and Layer 3. In | |||
| functionality. In some architectures, Layer 2 is terminated at the | some architectures, Layer 2 is terminated at the ToR, with Layer 3 | |||
| ToR, with Layer 3 being run in the aggregation and core devices. In | being run in the aggregation and core devices. In other | |||
| other architectures, Layer 2 may be extended all the way to the | architectures, Layer 2 may be extended all the way to the | |||
| aggregation switch. The primary concerns that have influenced | aggregation switch. The primary concerns that have influenced | |||
| network architectures in the data center have been keeping broadcast | network architectures in the data center have been keeping broadcast | |||
| domains manageable and the spanning tree diameter contained. | domains manageable and spanning tree domains contained. | |||
| Moving forward, these traditional network architectures are being | Moving forward, these traditional network architectures are being | |||
| challenged due to emerging technologies such as server | challenged due to emerging technologies such as server | |||
| virtualization. | virtualization. | |||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| The effect of server virtualization in the data center brings some | The effect of server virtualization in the data center brings some | |||
| challenges. Because of virtualization, the number of hosts seen by | challenges. Because of virtualization, the number of hosts that the | |||
| the network increases dramatically - 10 to 100 times the number of | network sees increases dramatically - 10 to 100 times the number of | |||
| physical servers. These virtual hosts are referred to as Virtual | physical servers. These virtual hosts are referred to as Virtual | |||
| machines (VMs). In addition, virtualized environments offer a | machines (VMs). VMs offer server mobility wherein a VM can be | |||
| feature referred to as "VM mobility" wherein a VM can be relocated | relocated to run on a different physical server. In order for the | |||
| to run on a different physical server. In order for the VM mobility | mobility to be non-disruptive to other hosts that have communication | |||
| to be non-disruptive to other hosts that have communication in | in progress with the VM being moved, the VM must retain its MAC | |||
| progress with the VM being moved, the VM must retain its MAC address | address and IP address. Because of the requirement to retain the | |||
| and IP address. Because of the requirement to retain the MAC and IP | MAC and IP address, it is desirable to develop network architectures | |||
| address, it is desirable to develop network architectures that would | that would offer the least restrictions in terms of server mobility. | |||
| offer the least restrictions in terms of VM mobility. | ||||
| As an example, in a network architecture where TOR switches | As an example, in a network architecture where TOR switches | |||
| terminate the L2 domain, the range of VM mobility would be | terminate the L2 domain, the range of mobility would be restricted | |||
| restricted to a single ToR switch. It would be more preferable to | to a single ToR switch. It would be more preferable to allow the | |||
| allow the flexibility of moving the VM anywhere within the data | flexibility of moving the VM anywhere within the data center, or | |||
| center, or perhaps even a different data center. | perhaps even a different data center. | |||
| Technologies such as TRILL [TRILL] overcome some of the issues of | Technologies such as TRILL [TRILL] overcome some of the issues of | |||
| spanning trees that forced traditional Layer 2 topologies to be | spanning trees because which traditional Layer 2 topologies have | |||
| severely constrained. However, because of virtualization there are | been constrained. However, because of virtualization there are 2 | |||
| 2 specific problems that are introduced with respect to broadcast | specific problems that are introduced with respect to broadcast | |||
| traffic. | traffic. | |||
| 1. A larger number of hosts. A single physical server now hosts | 1. A larger number of hosts. A single physical server now hosts | |||
| multiple VMs taking the scale factor to a different level. If | multiple virtual machines taking the scale factor to a | |||
| each VM issues the same number of broadcasts as a physical | different level. If each VM has the same number of broadcasts | |||
| server, the amount of broadcast traffic will increased 10 to | as a physical server, the amount of broadcast traffic has | |||
| greater than 100 times. | increased 10 to greater than 100 times. | |||
| 2. If the Layer 2 domains are extended to go across data centers, | 2. If the Layer 2 domains are extended to go across data centers, | |||
| then broadcast traffic will now go across the backbone. If | then broadcast traffic will now go across the backbone. If | |||
| Layer 2 was terminated at the ToR switch, the increase in | Layer 2 was terminated at the ToR switch, the increase in | |||
| broadcast traffic would be been restricted to a single ToR | broadcast traffic would be been restricted to a single ToR | |||
| switch, but as discussed earlier, this restriction is not | switch, but as discussed earlier, this restriction is not | |||
| desirable. | desirable. | |||
| Excessive broadcast traffic in Layer 2 networks results in wastage | The broadcast as such in Layer 2 networks has far reaching impacts; | |||
| of network bandwidth, as well as in the wastage of CPU resources due | i.e. wastage in network bandwidth as well as CPU resources used by | |||
| to all of the VMs processing superfluous ARP broadcasts (IPv6 gets | all the VMs while processing superfluous ARP broadcasts (IPv6 gets | |||
| rid of the latter by running ND as a multicast service rather than a | rid of the latter by running ND as a multicast service rather than a | |||
| broadcast service). | broadcast service). | |||
| The solution presented here attempts to minimize the negative | The solution presented here attempts to minimize negative effects of | |||
| effects of ARP broadcast packets. The solution requires the first | ARP broadcasts. The solution requires the first hop Ethernet | |||
| hop Ethernet switches, typically the ToR switch, to maintain an ARP | switches, typically ToR, to maintain an ARP table learned from the | |||
| table that is learned from the ARP packets received by the switch. | ARP PDUs received by the switch and selectively propagates the ARP | |||
| The switch then selectively propagates the ARP packet to, or proxy- | to, or proxy-responds on behalf of, the remote peer. These types of | |||
| responds on behalf of, the remote peer. These types of ARP | ARP processing principles are well known and used/described in L2VPN | |||
| processing principles are well-known and are described in L2VPN | Working Group documents such as [ARP-Mediation] and [IPLS]. The ARP | |||
| Working Group documents such as [ARP-Mediation] and [IPLS]. | proxy response differs from that described in [RFC1027] as the ARP | |||
| response contains MAC address of the destination and not that of the | ||||
| switch as is suggested in [RFC 1027]. | ||||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| The following sections describe the details of ARP snooping, the | The following sections describe the details of ARP snooping, | |||
| learning and maintenance of ARP tables, the use of learned | learning and maintaining ARP tables, using the learned information | |||
| information to limit broadcast propagation, and proxy (the response) | to limit broadcast propagation and proxy (the response) on behalf of | |||
| on behalf of the remote peers. | the remote peers. | |||
| 1.1 Terminology | 1.1 Terminology | |||
| ToR Top-of-Rack. An Ethernet switch present on top | ToR switch Top-of-Rack switch. An Ethernet switch installed | |||
| of a rack which provides network connectivity to | at the top of a rack of servers which provides | |||
| the servers present on the rack. | network connectivity to those servers. | |||
| Downlink The Ethernet link between the ToR switch and a | Downlink The Ethernet link between the ToR switch and a | |||
| directly connected host (server in the rack). | directly connected host/server in the rack. | |||
| Uplink The network- facing Ethernet connection in the | Uplink The network-facing Ethernet connection in the | |||
| ToR switch. Typically, the uplinks from ToRs | ToR switch. Typically, the uplinks from ToRs | |||
| connect to end-of-row or aggregation switches. | connect to end-of-row or aggregation switches. | |||
| EoR End-of-Row. An Ethernet switch to which the | EoR switch End-of-Row switch. An Ethernet switch which | |||
| ToR switches connect, also referred to as an | aggregates traffic from multiple racks. Also | |||
| aggregation switch. Uplinks from ToR switches | commonly referred to as an aggregation switch. | |||
| connect to an EoR switch and uplinks from EoR | Uplinks from the ToR connects to EoR switches | |||
| switches connect to a core switch. | and uplinks from EoR switches in turn connect | |||
| to core switches. | ||||
| Host/Server A host or server running the IP protocol. This | Host/Server A host or server running the IP protocol. This | |||
| could be a physical entity or a logical entity | could be a physical entity or a logical entity | |||
| (such as a Virtual Machine) in a physical host. | (such as a Virtual Machine) in a physical host. | |||
| The term server refers to its role in the data | The term server refers to its role in data | |||
| center. Both terms are used interchangeably to | center. Both terms are used interchangeably | |||
| refer to an IP host. | and refer to an IP end station. | |||
| Local hosts Used in the context of a ToR switch to denote | Local hosts Used in the context of a ToR switch to denote | |||
| the VM hosts connected to a ToR on the | the VM hosts connected to a ToR switch on the | |||
| downlink, i.e. directly attached hosts. | downlink, i.e. directly connected hosts. | |||
| Remote hosts Used in the context of a ToR switch to denote | Remote hosts Used in the context of a ToR switch to denote | |||
| the hosts that are accessible through uplink of | the hosts that are accessible via the uplink of | |||
| the ToR. | the ToR switch. | |||
| VM Virtual Machine. This is a logical instance of | VM Virtual Machine. This is a logical instance of | |||
| a host that operates independently in a | a host that operates independently in a | |||
| physical host and has its own IP and MAC | physical host and has its own IP and MAC | |||
| addresses. VMs allow efficient use of physical | addresses. The VM architecture allows efficient | |||
| host resources (such as multiple CPU cores). | use of physical host resources (such as | |||
| multiple CPU cores). | ||||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| 2.0 Configuration | 2.0 Configuration | |||
| It is assumed that ARP reduction mechanisms that are defined in this | It is assumed that ARP reduction methodologies that are defined in | |||
| document will be limited to ToR switches. The maximum benefit of | this document will be limited to ToR switches. The maximum benefit | |||
| restraining ARP broadcasts in the network is achieved by the first | of restraining ARP broadcasts in the network is achieved by the | |||
| hop switches (the ones directly connected to the hosts) without | first hop switches (the ones directly connected to the hosts) | |||
| placing additional burden on second or third tier switches. | without placing additional burden on second or third tier switches. | |||
| First, the ToR switches would need to be configured in order to | First, the ToR switches would need to be configured in order to | |||
| enable the ARP reduction feature. Every Ethernet interface needs to | enable the ARP reduction feature. Every Ethernet interface needs to | |||
| be identified as either a downlink or uplink within the context of | be identified as either a downlink or uplink within the context of | |||
| this feature. | this feature. The ARP reduction feature treats ARP frames received | |||
| from downlink or uplink differently as described in the following | ||||
| sections. | ||||
| In addition the operator may optionally configure various ARP | In additional the operator may optionally configure various ARP | |||
| reduction related parameters such as: | reduction related parameters such as: | |||
| . ARP aging timer. | . ARP aging timer, | |||
| . Size of the ARP table. | . size of the ARP table, | |||
| . Static entries of IP to MAC address. | . static entries of IP to MAC address, etc. | |||
| 3.0 Building the ARP Tables | 3.0 Building the ARP tables | |||
| When ARP reduction is enabled, the ToR switch will monitor all ARP | When ARP reduction is enabled, the ToR switch will monitor all ARP | |||
| traffic transiting the switch (regardless of uplink port or downlink | traffic transiting the switch (regardless of uplink port or downlink | |||
| port) and will process any ARP packets in the following manner: | port) and will process any ARP PDUs in the following manner: | |||
| . ARP Request packets must be redirected to control plane CPU. | . ARP Request PDUs must be redirected to control plane CPU. | |||
| . Gratuitous ARP packets (ARP Reply packet with a broadcast MAC | . Gratuitous ARP PDUs (ARP Reply PDU with a broadcast MAC DA) | |||
| DA) must be redirected to control plane CPU. | must be redirected to control plane CPU. | |||
| . Other ARP Reply packets (ARP Reply packet with a unicast MAC | . Other ARP Reply PDUs (ARP Reply PDU with a unicast MAC DA) | |||
| DA) should be bi-casted; one copy sent to control plane CPU and | should be bi-casted; one copy sent to control plane CPU and | |||
| other copy forwarded out normally. | other copy forwarded out normally. | |||
| 3.1 ARP Request | 3.1 ARP Requests | |||
| The ToR examines the source IP and the source hardware address (MAC | The ToR examines the source IP and the source hardware address (MAC | |||
| address) in the ARP Request . The source IP and MAC address | address) in the ARP Request . The source IP and MAC address | |||
| association is learned, or is updated/refreshed if already learned. | association is learned, or is updated/refreshed if already learned. | |||
| The destination IP address is searched in the ARP table. If an entry | The destination IP address is searched in the ARP table. If an entry | |||
| exists, the associated MAC address from the table is used to prepare | exists, the associated MAC address from the table is used to prepare | |||
| a unicast ARP Reply packet. The same MAC address is used as the | a unicast ARP Reply PDU. The same MAC address is used as the source | |||
| source MAC address in the MAC header, as well as for the target | MAC address in the MAC header, as well as for the target hardware | |||
| hardware address, in the unicast ARP Reply packet. | address,in the unicast ARP Reply PDU. | |||
| If the destination IP address in the ARP Request is not present in | If the destination IP address in the request is not present in the | |||
| the ARP table, then the original ARP Request packet is broadcast to | ARP table, then the original ARP request PDU is broadcast to all the | |||
| all the switch ports that are members of the same VLAN except the | switch ports that are member of the same VLAN except the source port | |||
| source port that the ARP Request was received from. However, if the | that the Request was received from. However, if the requested | |||
| requested (destination) IP address is present in the ARP table, a | Internet Draft draft-shah-arp-reduction-02.txt | |||
| unicast ARP Reply packet is prepared as described above and sent to | ||||
| the switch port from which the ARP Request was received and original | ||||
| ARP Request packet is dropped. | ||||
| Internet Draft draft-shah-arp-reduction-01.txt | (destination) IP address is present in the ARP table, a unicast ARP | |||
| Reply PDU is prepared as described above and sent to the switch port | ||||
| from which the ARP Request was received and original ARP request PDU | ||||
| is dropped. | ||||
| The intent is to prevent propagation of ARP Request broadcasts as | The intent is to prevent propagation of ARP Request PDU broadcasts | |||
| much as possible using the information present in the ARP table. The | as much as possible using the information present in the ARP table. | |||
| following observations can be made from such behavior. | The following observations can be made from such behavior. | |||
| . Most of the ARP Request packets from the local hosts of a ToR | . Most of the ARP requests from the local hosts of a ToR switch | |||
| switch for the local hosts of that ToR switch can be prevented | for the local hosts of the ToR switch can be prevented. | |||
| from being broadcast on uplinks or downlinks. | . Most of the ARP requests from the remote hosts of a ToR switch | |||
| . Most of the ARP Request packets from remote hosts of a ToR | for the local hosts of the ToR switch can be prevented from | |||
| switch for local hosts of that ToR switch can be prevented | getting forwarded on downlinks or other uplinks of the ToR | |||
| from being broadcast on downlinks or other uplinks of the ToR | ||||
| switch. | switch. | |||
| . Many of the ARP Request packets from local hosts of a ToR | . Many of the ARP requests from the local hosts of a ToR switch | |||
| switch for remote hosts of that ToR switch can be prevented | for the remote hosts of the ToR switch can be prevented from | |||
| from being forwarded on uplinks if the remote host IP to MAC | being forwarded on uplinks if the remote host IP to MAC | |||
| association is known to the ToR switch. | association is known to the ToR switch. | |||
| 3.2 ARP Reply | 3.2 ARP Reply | |||
| The unicast ARP Reply is examined to learn/update the ARP table for | The unicast ARP Reply is examined to learn/update the ARP table for | |||
| source and destination IP/MAC address association, but is also | source and destination IP/MAC address association, but is also | |||
| forwarded out as a normal frame. | forwarded out as a normal frame. | |||
| 3.3 Gratuitous ARP | 3.3 Gratuitous ARP | |||
| Gratuitous ARP is a broadcast ARP Reply packet with the destination | Gratuitous ARP is a broadcast ARP Reply PDU with destination IP | |||
| IP address set to the IP address of the sender and target hardware | address set to the IP address of the sender and target hardware | |||
| address set to the MAC address of the sender. It is typically used | address set to the MAC address of the sender. It is typically used | |||
| by IP hosts (including VMs) to keep its IP-to-MAC address | by the IP hosts (including VMs) to keep its association fresh in | |||
| association fresh in its peers' ARP cache. | peer's ARP cache. | |||
| The ToR switch should process Gratuitous ARP in the following | The ToR switch should process Gratuitous ARP in the following | |||
| manner. | manner. | |||
| . Learn/update/refresh the ARP table entry. | . Learn/update/refresh the ARP table entry. | |||
| . If the IP address is new, or exists but with a different | . If the IP address is new, or exists but with a different | |||
| hardware address, then the Gratuitous ARP packet is forwarded | hardware address, then the Gratuitous ARP PDU is forwarded | |||
| out; otherwise the packet is discarded. | out; otherwise the PDU is discarded. | |||
| The goal for handling of Gratuitous ARP packets received from the | The goal for handling of the Gratuitous ARP PDU received from the | |||
| downlinks (i.e. local hosts) is to avoid propagating it into the | downlinks (i.e. local hosts) is to avoid propagating it into the | |||
| 'network' (i.e. to the uplinks), unless there is a new association. | 'network' (i.e. to uplinks), unless there is a new association. | |||
| By suppressing the propagation of Gratuitous ARP packets, the peer | By suppressing the propagation of Gratuitous ARP PDUs, the peer IP | |||
| IP hosts will end up aging out the corresponding ARP table entries. | hosts will end up aging out the corresponding ARP table entries. | |||
| This will result in generation of the broadcast ARP Requests by | This will result in generation of the broadcast ARP Requests by | |||
| those IP hosts if they need to continue to communicate with the IP | those IP hosts if they need to continue to communicate with the IP | |||
| host whose Gratuitous ARPs were obstructed. The handling of the ARP | host whose Gratuitous ARPs were obstructed. The handling of the ARP | |||
| Request by the first-hop ToR switch, as described above, will be | Request, as described above, by the first hop ToR switch will be | |||
| able to respond to this request based on the ARP cache maintained in | able to respond to this request based on the ARP cache maintained in | |||
| the ToR switch. In essence, the presence of large ARP tables with | the ToR switch. In essence, presence of large ARP tables with longer | |||
| longer aging times compensates for the smaller ARP table present in | age out times compensates for the smaller ARP table present in the | |||
| Internet Draft draft-shah-arp-reduction-01.txt | Internet Draft draft-shah-arp-reduction-02.txt | |||
| the IP hosts and eliminates the need for periodic use of Gratuitous | ||||
| ARPs in order to refresh the ARP table in the IP hosts. | ||||
| 3.4 Uplink Versus Downlink Processing | ||||
| With respect to processing of the ARP packets as described above, | ||||
| the behavior is different depending on whether the packet was | ||||
| received from an uplink or downlink in the following ways. | ||||
| . The aging timer will typically be higher for entries learned | IP hosts and eliminates the need for periodic use of Gratuitous ARPs | |||
| from an uplink versus those learned from a downlink. The | in order to refresh the ARP table in the IP hosts. | |||
| reason for this is to avoid flooding ARP broadcast packets on | ||||
| uplinks since they have a much larger negative impact. | ||||
| . If ARP table fills up, then entries learned from downlinks | ||||
| (i.e. directly attached hosts) will take precedence over those | ||||
| learned from an uplink (i.e. remote hosts). This will trade | ||||
| off sending broadcasts on host links versus sending them into | ||||
| the core of the network. The reason for this is that access | ||||
| links are typically lower bandwidth, and also this will | ||||
| conserve CPU resources involved in processing unnecessary ARP | ||||
| traffic. | ||||
| 3.5 Host Mobility | 3.4 Host movement | |||
| As mentioned earlier, server virtualization technology allows | As mentioned earlier, server virtualization technology allows | |||
| mobility of VMs to different physical servers. The flexibility to | movement of VMs to different physical servers. The flexibility to | |||
| move VMs is one of the key benefits of server virtualization. VM | move VMs is one of the key benefits of server virtualization. The | |||
| mobility could be manual (operator initiated) or may be done | VM movement could be manual (operator initiated) or may be done | |||
| automatically in reaction to demands placed by the application | automatically in reaction to demands placed by the application | |||
| users. The important point is that in either case, VM movement is | users. The important point is that in either case, VM movement is | |||
| not transparent and is made known to the network. | not transparent and is made known to the network. | |||
| There is ongoing work in IEEE 802.1 standards organization (IEEE | There is ongoing work in IEEE 802.1 standards organization (IEEE | |||
| 802.1Qbg) to coordinate/communicate the presence and capabilities of | 802.1Qbg) to coordinate/communicate the presence and capabilities of | |||
| the VMs to the directly connected network switch. | the VMs to the directly connected network switch. | |||
| VMs typically retain their MAC and IP address across a VM mobility | VMs typically retain their MAC and IP address, and as such, there | |||
| event, and as such, there would be little impact to the ARP table | would be little impact to the ARP table maintained by the ARP | |||
| maintained by the ARP reduction mechanism described herein. | reduction mechanism described herein. However, the ARP reduction | |||
| However, the ARP reduction mechanism would benefit from knowing if a | mechanism would benefit from knowing if a VM is completely | |||
| VM is completely decommissioned so that the ToR switch can remove | decommissioned so that the ToR can removed the ARP entry it has for | |||
| the ARP entry that it has for that VM in a timely fashion, rather | that VM in a timely fashion, rather than waiting for it to timeout. | |||
| than waiting for it to age out. | ||||
| 3.5 Applicability to environments with overlay transport | ||||
| Recently, there have been multiple proposals for using overlay | ||||
| transport technologies such as VXLAN [VXLAN] and NVGRE [NVGRE]. | ||||
| These proposals allow the network operator to build the network | ||||
| using L2 or L3 technologies while building an L2-overlay on top of | ||||
| that. As such, while they address the issue of network design, they | ||||
| do not eliminate the need for a mechanism to reduce the amount of | ||||
| broadcast traffic that may have to traverse the core, if there are | ||||
| VMs of the same tenant on servers attached to different ToR | ||||
| switches. | ||||
| One of the ways for the overlay transport proposals to address this | ||||
| issue would be to implement the mechanism discussed in this document | ||||
| at the point where the overlay encapsulation and decapsulation is | ||||
| performed (i.e. in the virtual switch). | ||||
| 3.6 Scaling Considerations | 3.6 Scaling Considerations | |||
| Depending on the number of hosts in the network, the ARP table in a | Depending on the number of hosts in the networks, the ARP table can | |||
| ToR switch needed for the ARP reduction mechanisms described above | be quite large. Although it is possible to implement some of the | |||
| can be quite large. Although it is possible to implement some of the | mechanisms for ARP reduction as described in this document in | |||
| mechanisms for ARP reduction in hardware in the forwarding plane, | hardware in the forwarding plane, the number of ARP entries may | |||
| Internet Draft draft-shah-arp-reduction-01.txt | favor maintaining the ARP table in the control plane memory. | |||
| the number of ARP entries favors maintaining the ARP table in the | Internet Draft draft-shah-arp-reduction-02.txt | |||
| control plane memory. | ||||
| 3.7 Miscellaneous Issues | 3.7 Miscellaneous Issues | |||
| Because of the distributed nature of the mechanisms described | Because of the distributed nature of the mechanisms described | |||
| herein, there are a few additional issues that warrant consideration | herein, there are a few additional issues that warrant consideration | |||
| from the network operator. | from the network operator. | |||
| Earlier in the document, we had mentioned the configuration of a | Earlier in the document, we had mentioned the configuration of a | |||
| aging timer for ARP entries. A longer timer for holding onto ARP | timer for ARP entries. A longer timer for holding on to ARP entries | |||
| entries helps with reduction of broadcasts. However, having a "too | helps with reduction of broadcasts. However, the risk of having a | |||
| large timer" can lead to problems in certain situations. | "too large timer" can cause problems in certain situations. | |||
| Consider the following scenario. Host A is attached to ToR switch | Consider the following scenario. Host A is attached to ToR switch | |||
| #1, and host B is attached to ToR switch #2. If host B issues an | #1, and host B is attached to ToR switch #2. If host B issues an | |||
| ARP Request for host A, and if the entry is available at switch #2, | ARP request for host A, if the entry is available at switch #2, then | |||
| then switch #2 would send the ARP Reply on behalf of host A. It is | switch #2 would send the ARP Reply on behalf of host A. It is | |||
| possible that host A is no longer available, but there is no way for | possible that host A is no longer available, but there is no way for | |||
| switch #2 to know this, and it would continue to respond on behalf | switch #2 to know this, and it would continue to respond on behalf | |||
| of host A, until its entry for host A has aged out. In this case, | of host A, until its entry for host A has timed out. In this case, | |||
| it is easy to see that a smaller aging timer would be beneficial. | it is easy to see that a smaller timer would be beneficial. | |||
| Additionally, since host B has an ARP aging timer, it means that | Additionally, since host B has an ARP age timer, it means that host | |||
| host B would find out about host A's unavailability only after its | B would find out about host A's unavailability only after its entry | |||
| entry has aged out, which would be some time after it the entry has | has aged, which would be after it has aged out of switch #2. | |||
| aged out of switch #2. | ||||
| Another issue that can be somewhat problematic could be the | Another issue that can be somewhat problematic could be the | |||
| inconsistency of tables in switches. Once again, consider a | inconsistency of tables in switches. Once again, consider a | |||
| scenario similar to the one described above with two hosts each | scenario similar to the one described above with 2 hosts each | |||
| connected to its respect ToR switch. Let the ARP entries at both A | connected to its respect ToR switch. Let the ARP entries at both A | |||
| and B be learned by both switches. Now assume that the IP address | and B be learned by both switches. Now assume that the IP address | |||
| on host A changes. This change is signaled to switch #1 which in | on host A changes. This change is signaled to switch #1 which in | |||
| turn broadcasts the message on its uplink. Now, if this message is | turn broadcasts the message on its uplink. Now, if this message is | |||
| discarded due to network congestion or signal integrity issues, then | discarded due to network congestion or signal integrity issues, then | |||
| switch #2 will not learn about the change and will continue to | switch #2 will not learn about the change and will continue to | |||
| respond to host B's ARP Requests for host A's old IP address with | respond to host B's ARP Requests for host A's old IP address with | |||
| stale information. This lasts until the ARP entry for A ages out at | stale information. This lasts until the ARP entry for A times out | |||
| Switch #2. | at Switch #2. | |||
| 4.0 Concluding Remarks | 4.0 Conclusion | |||
| Based on the procedures described in this document, it is possible | Based on the procedures described in this document, it is possible | |||
| for ToR switches in the data center to contain ARP broadcasts | for ToR switches in the data center to contain ARP broadcasts | |||
| significantly. The solution is based on well known, non-intrusive | significantly. The solution is based on well known, non-intrusive | |||
| procedures and strives to curtail ARP broadcasts that are | procedures and strives to curtail broadcasts that are increasingly | |||
| increasingly becoming a cause for concern in the data centers. In | becoming a cause for concern in the data centers. In essence, ToR | |||
| essence, ToR switches offload some of the ARP table management from | switches facilitate the offloading of the extended ARP table | |||
| the IP hosts to themselves. The ARP table aging timer can be tuned | management from the IP hosts to itself. The ARP table timeout can be | |||
| higher by the operator based on the available switch resources and | tuned higher by the operator based on the available switch resources | |||
| network traffic behavior. The larger capacity of the ARP table | and network traffic behavior. The larger capacity of the ARP table | |||
| Internet Draft draft-shah-arp-reduction-01.txt | directly translates to more effective subduing of the ARP | |||
| broadcasts. | ||||
| coupled with a long aging time for entries in the table directly | Internet Draft draft-shah-arp-reduction-02.txt | |||
| translates to more effective subduing of the ARP broadcasts. | ||||
| 5.0 Security Considerations | 5.0 Security Considerations | |||
| Security aspects will be addressed in a subsequent revision. | The details of the security aspects will be addressed in future | |||
| revision. | ||||
| 6.0 Acknowledgments | 6.0 Acknowledgments | |||
| This document resulted from discussions with Linda Dunbar (Huawei), | This document resulted from discussions with Linda Durbar (Huawei), | |||
| Sue Hares (Huawei), and T Sridhar (Force10). We would like to | Sue Hares (Huawei), and T Sridhar (VMware). We would like to | |||
| acknowledge their contribution to this work. | acknowledge their contribution to this work. | |||
| 7.0 References | 7.0 References | |||
| 7.1 Normative References | 7.1 Normative References | |||
| [ARP] D. Plummer, "An Ethernet Address Resolution Protocol: Or | [ARP] D. Plummer, "An Ethernet Address Resolution Protocol: Or | |||
| Converting Network Protocol Addresses to 48.bit Ethernet | Converting Network Protocol Addresses to 48.bit Ethernet | |||
| Addresses for Transmission on Ethernet Hardware, " RFC 826 (also | Addresses for Transmission on Ethernet Hardware," RFC 826, STD | |||
| STD 37), November 1982. | 37. | |||
| [ARP-Problem] L.Dunbar et al., "Scalable Address Resolution for | [ARP-Problem] T. Narten, "Problem Statement for ARMD," | |||
| Large Data Center Problem Statements," <draft-dunbar-arp-for- | work in progress, <draft-ietf-armd-problem-statement>. | |||
| large-dc-problem-statement-00>, July 2010. | ||||
| 7.2 Informative References | 7.2 Informative References | |||
| [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking | [ARP-Mediation] H. Shah et al., "ARP Mediation for IP interworking | |||
| in Layer 2 VPN," <draft-ietf-l2vpn-arp-mediation-14>, July 2010. | in Layer 2 VPN," work in progress, <draft-ietf-l2vpn-arp- | |||
| mediation>. | ||||
| [IPLS] H.Shah et al., "IP-only LAN service," | [IPLS] H.Shah et al., "IP-only LAN service," work in progress, | |||
| <draft-ietf-l2vpn-ipls-09>, February 2010. | <draft-ietf-l2vpn-ipls>. | |||
| [PROXY-ARP] J. Postel, "Multi-LAN Address Resolution," RFC 925, | [PROXY-ARP] J. Postel, "Multi-LAN Address Resolution," RFC 925. | |||
| October 1984. | ||||
| [TRILL] R. Perlman et al., "RBridges: Base Protocol Specification", | [RFC1027] Smoot et al., "Using ARP to Implement Transparent Subnet | |||
| <draft-ietf-trill-rbridge-protocol-16>, March 2010. | Gateways". | |||
| [VXLAN] M. Mahalingam et al., "VXLAN: A Framework for Overlaying | ||||
| Virtualized Layer 2 Networks over Layer 3 Networks"," work in | ||||
| progress, <draft-mahalingam-dutt-dcops-vxlan>. | ||||
| [NVGRE] M. Sridharan et al., " NVGRE: Network Virtualization using | ||||
| Generic Routing Encapsulation", work in progress, <draft- | ||||
| sridharan-virtualization-nvgre>. | ||||
| Internet Draft draft-shah-arp-reduction-02.txt | ||||
| 8.0 Author's Address | 8.0 Author's Address | |||
| Himanshu Shah | Himanshu Shah | |||
| Ciena Corp | Ciena Corp | |||
| Email: hshah@ciena.com | Email: hshah@ciena.com | |||
| Anoop Ghanwani | Anoop Ghanwani | |||
| Internet Draft draft-shah-arp-reduction-01.txt | ||||
| Brocade | Brocade | |||
| Email: anoop@brocade.com | Email: anoop@alumni.duke.edu | |||
| Nabil Bitar | Nabil Bitar | |||
| Verizon | Verizon | |||
| Email: nabil.n.bitar@verizon.com | Email: nabil.n.bitar@verizon.com | |||
| End of changes. 79 change blocks. | ||||
| 248 lines changed or deleted | 252 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||