| < draft-ietf-rtgwg-bgp-routing-large-dc-05.txt | draft-ietf-rtgwg-bgp-routing-large-dc-06.txt > | |||
|---|---|---|---|---|
| Routing Area Working Group P. Lapukhov | Routing Area Working Group P. Lapukhov | |||
| Internet-Draft Facebook | Internet-Draft Facebook | |||
| Intended status: Informational A. Premji | Intended status: Informational A. Premji | |||
| Expires: February 1, 2016 Arista Networks | Expires: February 20, 2016 Arista Networks | |||
| J. Mitchell, Ed. | J. Mitchell, Ed. | |||
| July 31, 2015 | August 19, 2015 | |||
| Use of BGP for routing in large-scale data centers | Use of BGP for routing in large-scale data centers | |||
| draft-ietf-rtgwg-bgp-routing-large-dc-05 | draft-ietf-rtgwg-bgp-routing-large-dc-06 | |||
| Abstract | Abstract | |||
| Some network operators build and operate data centers that support | Some network operators build and operate data centers that support | |||
| over one hundred thousand servers. In this document, such data | over one hundred thousand servers. In this document, such data | |||
| centers are referred to as "large-scale" to differentiate them from | centers are referred to as "large-scale" to differentiate them from | |||
| smaller infrastructures. Environments of this scale have a unique | smaller infrastructures. Environments of this scale have a unique | |||
| set of network requirements with an emphasis on operational | set of network requirements with an emphasis on operational | |||
| simplicity and network stability. This document summarizes | simplicity and network stability. This document summarizes | |||
| operational experience in designing and operating large-scale data | operational experience in designing and operating large-scale data | |||
| skipping to change at page 1, line 41 ¶ | skipping to change at page 1, line 41 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on February 1, 2016. | This Internet-Draft will expire on February 20, 2016. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| skipping to change at page 2, line 34 ¶ | skipping to change at page 2, line 34 ¶ | |||
| 3.2.1. Overview . . . . . . . . . . . . . . . . . . . . . . 7 | 3.2.1. Overview . . . . . . . . . . . . . . . . . . . . . . 7 | |||
| 3.2.2. Clos Topology Properties . . . . . . . . . . . . . . 8 | 3.2.2. Clos Topology Properties . . . . . . . . . . . . . . 8 | |||
| 3.2.3. Scaling the Clos topology . . . . . . . . . . . . . . 9 | 3.2.3. Scaling the Clos topology . . . . . . . . . . . . . . 9 | |||
| 3.2.4. Managing the Size of Clos Topology Tiers . . . . . . 10 | 3.2.4. Managing the Size of Clos Topology Tiers . . . . . . 10 | |||
| 4. Data Center Routing Overview . . . . . . . . . . . . . . . . 11 | 4. Data Center Routing Overview . . . . . . . . . . . . . . . . 11 | |||
| 4.1. Layer 2 Only Designs . . . . . . . . . . . . . . . . . . 11 | 4.1. Layer 2 Only Designs . . . . . . . . . . . . . . . . . . 11 | |||
| 4.2. Hybrid L2/L3 Designs . . . . . . . . . . . . . . . . . . 12 | 4.2. Hybrid L2/L3 Designs . . . . . . . . . . . . . . . . . . 12 | |||
| 4.3. Layer 3 Only Designs . . . . . . . . . . . . . . . . . . 12 | 4.3. Layer 3 Only Designs . . . . . . . . . . . . . . . . . . 12 | |||
| 5. Routing Protocol Selection and Design . . . . . . . . . . . . 13 | 5. Routing Protocol Selection and Design . . . . . . . . . . . . 13 | |||
| 5.1. Choosing EBGP as the Routing Protocol . . . . . . . . . . 13 | 5.1. Choosing EBGP as the Routing Protocol . . . . . . . . . . 13 | |||
| 5.2. EBGP Configuration for Clos topology . . . . . . . . . . 14 | 5.2. EBGP Configuration for Clos topology . . . . . . . . . . 15 | |||
| 5.2.1. EBGP Configuration Guidelines and Example ASN Scheme 15 | 5.2.1. EBGP Configuration Guidelines and Example ASN Scheme 15 | |||
| 5.2.2. Private Use ASNs . . . . . . . . . . . . . . . . . . 16 | 5.2.2. Private Use ASNs . . . . . . . . . . . . . . . . . . 16 | |||
| 5.2.3. Prefix Advertisement . . . . . . . . . . . . . . . . 17 | 5.2.3. Prefix Advertisement . . . . . . . . . . . . . . . . 17 | |||
| 5.2.4. External Connectivity . . . . . . . . . . . . . . . . 18 | 5.2.4. External Connectivity . . . . . . . . . . . . . . . . 18 | |||
| 5.2.5. Route Summarization at the Edge . . . . . . . . . . . 19 | 5.2.5. Route Summarization at the Edge . . . . . . . . . . . 19 | |||
| 6. ECMP Considerations . . . . . . . . . . . . . . . . . . . . . 19 | 6. ECMP Considerations . . . . . . . . . . . . . . . . . . . . . 19 | |||
| 6.1. Basic ECMP . . . . . . . . . . . . . . . . . . . . . . . 20 | 6.1. Basic ECMP . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
| 6.2. BGP ECMP over Multiple ASNs . . . . . . . . . . . . . . . 21 | 6.2. BGP ECMP over Multiple ASNs . . . . . . . . . . . . . . . 21 | |||
| 6.3. Weighted ECMP . . . . . . . . . . . . . . . . . . . . . . 21 | 6.3. Weighted ECMP . . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 6.4. Consistent Hashing . . . . . . . . . . . . . . . . . . . 22 | 6.4. Consistent Hashing . . . . . . . . . . . . . . . . . . . 22 | |||
| skipping to change at page 3, line 49 ¶ | skipping to change at page 3, line 49 ¶ | |||
| choice and presents details of the EBGP routing design as well as | choice and presents details of the EBGP routing design as well as | |||
| explores ideas for further enhancements. | explores ideas for further enhancements. | |||
| This document first presents an overview of network design | This document first presents an overview of network design | |||
| requirements and considerations for large-scale data centers. Then | requirements and considerations for large-scale data centers. Then | |||
| traditional hierarchical data center network topologies are | traditional hierarchical data center network topologies are | |||
| contrasted with Clos networks [CLOS1953] that are horizontally scaled | contrasted with Clos networks [CLOS1953] that are horizontally scaled | |||
| out. This is followed by arguments for selecting EBGP with a Clos | out. This is followed by arguments for selecting EBGP with a Clos | |||
| topology as the most appropriate routing protocol to meet the | topology as the most appropriate routing protocol to meet the | |||
| requirements and the proposed design is described in detail. | requirements and the proposed design is described in detail. | |||
| Finally, the document reviews some additional considerations and | Finally, this document reviews some additional considerations and | |||
| design options. | design options. | |||
| 2. Network Design Requirements | 2. Network Design Requirements | |||
| This section describes and summarizes network design requirements for | This section describes and summarizes network design requirements for | |||
| large-scale data centers. | large-scale data centers. | |||
| 2.1. Bandwidth and Traffic Patterns | 2.1. Bandwidth and Traffic Patterns | |||
| The primary requirement when building an interconnection network for | The primary requirement when building an interconnection network for | |||
| skipping to change at page 5, line 7 ¶ | skipping to change at page 5, line 7 ¶ | |||
| o Driving costs down using competitive pressures, by introducing | o Driving costs down using competitive pressures, by introducing | |||
| multiple network equipment vendors. | multiple network equipment vendors. | |||
| In order to allow for good vendor diversity it is important to | In order to allow for good vendor diversity it is important to | |||
| minimize the software feature requirements for the network elements. | minimize the software feature requirements for the network elements. | |||
| This strategy provides maximum flexibility of vendor equipment | This strategy provides maximum flexibility of vendor equipment | |||
| choices while enforcing interoperability using open standards. | choices while enforcing interoperability using open standards. | |||
| 2.3. OPEX Minimization | 2.3. OPEX Minimization | |||
| Operating large-scale infrastructure could be expensive, provided | Operating large-scale infrastructure can be expensive as a larger | |||
| that a larger amount of elements will statistically fail more often. | amount of elements will statistically fail more often. Having a | |||
| Having a simpler design and operating using a limited software | simpler design and operating using a limited software feature set | |||
| feature set minimizes software issue-related failures. | minimizes software issue-related failures. | |||
| An important aspect of Operational Expenditure (OPEX) minimization is | An important aspect of Operational Expenditure (OPEX) minimization is | |||
| reducing size of failure domains in the network. Ethernet networks | reducing size of failure domains in the network. Ethernet networks | |||
| are known to be susceptible to broadcast or unicast traffic storms | are known to be susceptible to broadcast or unicast traffic storms | |||
| that can have a dramatic impact on network performance and | that can have a dramatic impact on network performance and | |||
| availability. The use of a fully routed design significantly reduces | availability. The use of a fully routed design significantly reduces | |||
| the size of the data plane failure domains - i.e. limits them to the | the size of the data plane failure domains - i.e. limits them to the | |||
| lowest level in the network hierarchy. However, such designs | lowest level in the network hierarchy. However, such designs | |||
| introduce the problem of distributed control plane failures. This | introduce the problem of distributed control plane failures. This | |||
| observation calls for simpler and less control plane protocols to | observation calls for simpler and less control plane protocols to | |||
| skipping to change at page 5, line 34 ¶ | skipping to change at page 5, line 34 ¶ | |||
| requirements. | requirements. | |||
| 2.4. Traffic Engineering | 2.4. Traffic Engineering | |||
| In any data center, application load balancing is a critical function | In any data center, application load balancing is a critical function | |||
| performed by network devices. Traditionally, load balancers are | performed by network devices. Traditionally, load balancers are | |||
| deployed as dedicated devices in the traffic forwarding path. The | deployed as dedicated devices in the traffic forwarding path. The | |||
| problem arises in scaling load balancers under growing traffic | problem arises in scaling load balancers under growing traffic | |||
| demand. A preferable solution would be able to scale load balancing | demand. A preferable solution would be able to scale load balancing | |||
| layer horizontally, by adding more of the uniform nodes and | layer horizontally, by adding more of the uniform nodes and | |||
| distributing incoming traffic across these nodes. In situation like | distributing incoming traffic across these nodes. In situations like | |||
| this, an ideal choice would be to use network infrastructure itself | this, an ideal choice would be to use network infrastructure itself | |||
| to distribute traffic across a group of load balancers. The | to distribute traffic across a group of load balancers. The | |||
| combination of Anycast prefix advertisement [RFC4786] and Equal Cost | combination of Anycast prefix advertisement [RFC4786] and Equal Cost | |||
| Multipath (ECMP) functionality can be used to accomplish this goal. | Multipath (ECMP) functionality can be used to accomplish this goal. | |||
| To allow for more granular load distribution, it is beneficial for | To allow for more granular load distribution, it is beneficial for | |||
| the network to support the ability to perform controlled per-hop | the network to support the ability to perform controlled per-hop | |||
| traffic engineering. For example, it is beneficial to directly | traffic engineering. For example, it is beneficial to directly | |||
| control the ECMP next-hop set for Anycast prefixes at every level of | control the ECMP next-hop set for Anycast prefixes at every level of | |||
| network hierarchy. | network hierarchy. | |||
| skipping to change at page 10, line 21 ¶ | skipping to change at page 10, line 21 ¶ | |||
| main reason to limit oversubscription at a single layer of the | main reason to limit oversubscription at a single layer of the | |||
| network is to simplify application development that would otherwise | network is to simplify application development that would otherwise | |||
| need to account for multiple bandwidth pools: within rack (Tier-3), | need to account for multiple bandwidth pools: within rack (Tier-3), | |||
| between racks (Tier-2), and between clusters (Tier-1). Since | between racks (Tier-2), and between clusters (Tier-1). Since | |||
| oversubscription does not have a direct relationship to the routing | oversubscription does not have a direct relationship to the routing | |||
| design it is not discussed further in this document. | design it is not discussed further in this document. | |||
| 3.2.4. Managing the Size of Clos Topology Tiers | 3.2.4. Managing the Size of Clos Topology Tiers | |||
| If a data center network size is small, it is possible to reduce the | If a data center network size is small, it is possible to reduce the | |||
| number of switches in Tier-1 or Tier-2 of Clos topology by a power of | number of switches in Tier-1 or Tier-2 of Clos topology by a factor | |||
| two. To understand how this could be done, take Tier-1 as an | of two. To understand how this could be done, take Tier-1 as an | |||
| example. Every Tier-2 device connects to a single group of Tier-1 | example. Every Tier-2 device connects to a single group of Tier-1 | |||
| devices. If half of the ports on each of the Tier-1 devices are not | devices. If half of the ports on each of the Tier-1 devices are not | |||
| being used then it is possible to reduce the number of Tier-1 devices | being used then it is possible to reduce the number of Tier-1 devices | |||
| by half and simply map two uplinks from a Tier-2 device to the same | by half and simply map two uplinks from a Tier-2 device to the same | |||
| Tier-1 device that were previously mapped to different Tier-1 | Tier-1 device that were previously mapped to different Tier-1 | |||
| devices. This technique maintains the same bisectional bandwidth | devices. This technique maintains the same bisectional bandwidth | |||
| while reducing the number of elements in the Tier-1 layer, thus | while reducing the number of elements in the Tier-1 layer, thus | |||
| saving on CAPEX. The tradeoff, in this example, is the reduction of | saving on CAPEX. The tradeoff, in this example, is the reduction of | |||
| maximum DC size in terms of overall server count by half. | maximum DC size in terms of overall server count by half. | |||
| skipping to change at page 11, line 23 ¶ | skipping to change at page 11, line 23 ¶ | |||
| Originally most data center designs used Spanning-Tree Protocol (STP) | Originally most data center designs used Spanning-Tree Protocol (STP) | |||
| originally defined in [IEEE8021D-1990] for loop free topology | originally defined in [IEEE8021D-1990] for loop free topology | |||
| creation, typically utilizing variants of the traditional DC topology | creation, typically utilizing variants of the traditional DC topology | |||
| described in Section 3.1. At the time, many DC switches either did | described in Section 3.1. At the time, many DC switches either did | |||
| not support Layer 3 routed protocols or supported it with additional | not support Layer 3 routed protocols or supported it with additional | |||
| licensing fees, which played a part in the design choice. Although | licensing fees, which played a part in the design choice. Although | |||
| many enhancements have been made through the introduction of Rapid | many enhancements have been made through the introduction of Rapid | |||
| Spanning Tree Protocol (RSTP) in the latest revision of | Spanning Tree Protocol (RSTP) in the latest revision of | |||
| [IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified | [IEEE8021D-2004] and Multiple Spanning Tree Protocol (MST) specified | |||
| in [IEEE8021Q] that increase convergence, stability and load | in [IEEE8021Q] that increase convergence, stability and load | |||
| balancing in larger topologies many of the fundamentals of the | balancing in larger topologies, many of the fundamentals of the | |||
| protocol limit its applicability in large-scale DCs. STP and its | protocol limit its applicability in large-scale DCs. STP and its | |||
| newer variants use an active/standby approach to path selection and | newer variants use an active/standby approach to path selection and | |||
| are therefore hard to deploy in horizontally-scaled topologies as | are therefore hard to deploy in horizontally-scaled topologies as | |||
| described in Section 3.2. Further, operators have had many | described in Section 3.2. Further, operators have had many | |||
| experiences with large failures due to issues caused by improper | experiences with large failures due to issues caused by improper | |||
| cabling, misconfiguration, or flawed software on a single device. | cabling, misconfiguration, or flawed software on a single device. | |||
| These failures regularly affected the entire spanning-tree domain and | These failures regularly affected the entire spanning-tree domain and | |||
| were very hard to troubleshoot due to the nature of the protocol. | were very hard to troubleshoot due to the nature of the protocol. | |||
| For these reasons, and since almost all DC traffic is now IP, | For these reasons, and since almost all DC traffic is now IP, | |||
| therefore requiring a Layer 3 routing protocol at the network edge | therefore requiring a Layer 3 routing protocol at the network edge | |||
| for external connectivity, designs utilizing STP usually fail all of | for external connectivity, designs utilizing STP usually fail all of | |||
| the requirements of large-scale DC operators. Various enhancements | the requirements of large-scale DC operators. Various enhancements | |||
| to link-aggregation protocols such as [IEEE8023AD], generally known | to link-aggregation protocols such as [IEEE8023AD], generally known | |||
| as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use | as Multi-Chassis Link-Aggregation (M-LAG) made it possible to use | |||
| Layer 2 designs with active-active network paths while relying on STP | Layer 2 designs with active-active network paths while relying on STP | |||
| as the backup for loop prevention. The major downside of this | as the backup for loop prevention. The major downsides of this | |||
| approach is the proprietary nature of such extensions. | approach are the lack of ability to scale linearly past two in most | |||
| implementations, lack of standards based implementations, and added | ||||
| failure domain risk of keeping state between the devices. | ||||
| It should be noted that building large, horizontally scalable, Layer | It should be noted that building large, horizontally scalable, Layer | |||
| 2 only networks without STP is possible recently through the | 2 only networks without STP is possible recently through the | |||
| introduction of the TRILL protocol in [RFC6325]. TRILL resolves many | introduction of the TRILL protocol in [RFC6325]. TRILL resolves many | |||
| of the issues STP has for large-scale DC design however currently the | of the issues STP has for large-scale DC design however due to the | |||
| maturity of the protocol, limited number of implementations, and | lack of maturity of the protocol, the limited number of | |||
| requirement for new equipment that supports it has limited its | implementations, and requirement for new equipment that supports it, | |||
| applicability and increased the cost of such designs. | this has limited its applicability and increased the cost of such | |||
| designs. | ||||
| Finally, neither TRILL nor the M-LAG approach eliminate the | Finally, neither TRILL nor the M-LAG approach eliminate the | |||
| fundamental problem of the shared broadcast domain, that is so | fundamental problem of the shared broadcast domain, that is so | |||
| detrimental to the operations of any Layer 2, Ethernet based | detrimental to the operations of any Layer 2, Ethernet based | |||
| solutions. | solutions. | |||
| 4.2. Hybrid L2/L3 Designs | 4.2. Hybrid L2/L3 Designs | |||
| Operators have sought to limit the impact of data plane faults and | Operators have sought to limit the impact of data plane faults and | |||
| build large-scale topologies through implementing routing protocols | build large-scale topologies through implementing routing protocols | |||
| in either the Tier-1 or Tier-2 parts of the network and dividing the | in either the Tier-1 or Tier-2 parts of the network and dividing the | |||
| Layer-2 domain into numerous, smaller domains. This design has | Layer 2 domain into numerous, smaller domains. This design has | |||
| allowed data centers to scale up, but at the cost of complexity in | allowed data centers to scale up, but at the cost of complexity in | |||
| the network managing multiple protocols. For the following reasons, | the network managing multiple protocols. For the following reasons, | |||
| operators have retained Layer 2 in either the access (Tier-3) or both | operators have retained Layer 2 in either the access (Tier-3) or both | |||
| access and aggregation (Tier-3 and Tier-2) parts of the network: | access and aggregation (Tier-3 and Tier-2) parts of the network: | |||
| o Supporting legacy applications that may require direct Layer 2 | o Supporting legacy applications that may require direct Layer 2 | |||
| adjacency or use non-IP protocols. | adjacency or use non-IP protocols. | |||
| o Seamless mobility for virtual machines that require the | o Seamless mobility for virtual machines that require the | |||
| preservation of IP addresses when a virtual machine moves to | preservation of IP addresses when a virtual machine moves to | |||
| different Tier-3 switch. | different Tier-3 switch. | |||
| o Simplified IP addressing = less IP subnets are required for the | o Simplified IP addressing = less IP subnets are required for the | |||
| data center. | data center. | |||
| o Application load balancing may require direct Layer 2 reachability | o Application load balancing may require direct Layer 2 reachability | |||
| to perform certain functions such as Layer 2 Direct Server Return | to perform certain functions such as Layer 2 Direct Server Return | |||
| (DSR). | (DSR). | |||
| o Continued CAPEX differences between Layer-2 and Layer-3 capable | o Continued CAPEX differences between Layer 2 and Layer 3 capable | |||
| switches. | switches. | |||
| 4.3. Layer 3 Only Designs | 4.3. Layer 3 Only Designs | |||
| Network designs that leverage IP routing down to Tier-3 of the | Network designs that leverage IP routing down to Tier-3 of the | |||
| network have gained popularity as well. The main benefit of these | network have gained popularity as well. The main benefit of these | |||
| designs is improved network stability and scalability, as a result of | designs is improved network stability and scalability, as a result of | |||
| confining L2 broadcast domains. Commonly an Interior Gateway | confining L2 broadcast domains. Commonly an Interior Gateway | |||
| Protocol (IGP) such as OSPF [RFC2328] is used as the primary routing | Protocol (IGP) such as OSPF [RFC2328] is used as the primary routing | |||
| protocol in such a design. As data centers grow in scale, and server | protocol in such a design. As data centers grow in scale, and server | |||
| count exceeds tens of thousands, such fully routed designs have | count exceeds tens of thousands, such fully routed designs have | |||
| become more attractive. | become more attractive. | |||
| Choosing a Layer 3 only design greatly simplifies the network, | Choosing a Layer 3 only design greatly simplifies the network, | |||
| facilitating the meeting of REQ1 and REQ2, and has widespread | facilitating the meeting of REQ1 and REQ2, and has widespread | |||
| adoption in networks where large Layer 2 adjacency and larger size | adoption in networks where large Layer 2 adjacency and larger size | |||
| Layer 3 subnets are not as critical compared to network scalability | Layer 3 subnets are not as critical compared to network scalability | |||
| and stability. Application providers and network operators continue | and stability. Application providers and network operators continue | |||
| to also develop new solutions to meet some of the requirements that | to also develop new solutions to meet some of the requirements that | |||
| previously have driven large Layer 2 domains. | previously have driven large Layer 2 domains by using various overlay | |||
| or tunneling techniques. | ||||
| 5. Routing Protocol Selection and Design | 5. Routing Protocol Selection and Design | |||
| In this section the motivations for using External BGP (EBGP) as the | In this section the motivations for using External BGP (EBGP) as the | |||
| single routing protocol for data center networks having a Layer 3 | single routing protocol for data center networks having a Layer 3 | |||
| protocol design and Clos topology are reviewed. Then, a practical | protocol design and Clos topology are reviewed. Then, a practical | |||
| approach for designing an EBGP based network is provided. | approach for designing an EBGP based network is provided. | |||
| 5.1. Choosing EBGP as the Routing Protocol | 5.1. Choosing EBGP as the Routing Protocol | |||
| skipping to change at page 14, line 12 ¶ | skipping to change at page 14, line 14 ¶ | |||
| flow-control, BGP simply relies on TCP as the underlying | flow-control, BGP simply relies on TCP as the underlying | |||
| transport. This fulfills REQ2 and REQ3. | transport. This fulfills REQ2 and REQ3. | |||
| o BGP information flooding overhead is less when compared to link- | o BGP information flooding overhead is less when compared to link- | |||
| state IGPs. Since every BGP router calculates and propagates only | state IGPs. Since every BGP router calculates and propagates only | |||
| the best-path selected, a network failure is masked as soon as the | the best-path selected, a network failure is masked as soon as the | |||
| BGP speaker finds an alternate path, which exists when highly | BGP speaker finds an alternate path, which exists when highly | |||
| symmetric topologies, such as Clos, are coupled with EBGP only | symmetric topologies, such as Clos, are coupled with EBGP only | |||
| design. In contrast, the event propagation scope of a link-state | design. In contrast, the event propagation scope of a link-state | |||
| IGP is an entire area, regardless of the failure type. This meets | IGP is an entire area, regardless of the failure type. This meets | |||
| REQ3 and REQ4. It is worth mentioning that all widely deployed | REQ3 and REQ4. It is also worth mentioning that all widely | |||
| link-state IGPs also feature periodic refreshes of routing | deployed link-state IGPs feature periodic refreshes of routing | |||
| information, while BGP does not expire routing state, even if this | information, even if this rarely causes impact to modern router | |||
| rarely causes significant impact to modern router control planes. | control planes, while BGP does not expire routing state. | |||
| o BGP supports third-party (recursively resolved) next-hops. This | o BGP supports third-party (recursively resolved) next-hops. This | |||
| allows for manipulating multipath to be non-ECMP based or | allows for manipulating multipath to be non-ECMP based or | |||
| forwarding based on application-defined paths, through | forwarding based on application-defined paths, through | |||
| establishment of a peering session with an application | establishment of a peering session with an application | |||
| "controller" which can inject routing information into the system, | "controller" which can inject routing information into the system, | |||
| satisfying REQ5. OSPF provides similar functionality using | satisfying REQ5. OSPF provides similar functionality using | |||
| concepts such as "Forwarding Address", but with more difficulty in | concepts such as "Forwarding Address", but with more difficulty in | |||
| implementation and far less control of information propagation | implementation and far less control of information propagation | |||
| scope. | scope. | |||
| skipping to change at page 14, line 44 ¶ | skipping to change at page 14, line 46 ¶ | |||
| Using a traditional single flooding domain, which most DC designs | Using a traditional single flooding domain, which most DC designs | |||
| utilize, under certain failure conditions may pick up unwanted | utilize, under certain failure conditions may pick up unwanted | |||
| lengthy paths, e.g. traversing multiple Tier-2 devices. | lengthy paths, e.g. traversing multiple Tier-2 devices. | |||
| o EBGP configuration that is implemented with minimal routing policy | o EBGP configuration that is implemented with minimal routing policy | |||
| is easier to troubleshoot for network reachability issues. In | is easier to troubleshoot for network reachability issues. In | |||
| most implementations, it is straightforward to view contents of | most implementations, it is straightforward to view contents of | |||
| BGP Loc-RIB and compare it to the router's RIB. Also, in most | BGP Loc-RIB and compare it to the router's RIB. Also, in most | |||
| implementations an operator can view every BGP neighbors Adj-RIB- | implementations an operator can view every BGP neighbors Adj-RIB- | |||
| In and Adj-RIB-Out structures and therefore incoming and outgoing | In and Adj-RIB-Out structures and therefore incoming and outgoing | |||
| NRLI information can be easily correlated on both sides of a BGP | NLRI information can be easily correlated on both sides of a BGP | |||
| session. Thus, BGP satisfies REQ3. | session. Thus, BGP satisfies REQ3. | |||
| 5.2. EBGP Configuration for Clos topology | 5.2. EBGP Configuration for Clos topology | |||
| Clos topologies that have more than 5 stages are very uncommon due to | Clos topologies that have more than 5 stages are very uncommon due to | |||
| the large numbers of interconnects required by such a design. | the large numbers of interconnects required by such a design. | |||
| Therefore, the examples below are made with reference to the 5-stage | Therefore, the examples below are made with reference to the 5-stage | |||
| Clos topology (in unfolded state). | Clos topology (in unfolded state). | |||
| 5.2.1. EBGP Configuration Guidelines and Example ASN Scheme | 5.2.1. EBGP Configuration Guidelines and Example ASN Scheme | |||
| skipping to change at page 15, line 21 ¶ | skipping to change at page 15, line 28 ¶ | |||
| point links interconnecting the network nodes, no multi-hop or | point links interconnecting the network nodes, no multi-hop or | |||
| loopback sessions are used even in the case of multiple links | loopback sessions are used even in the case of multiple links | |||
| between the same pair of nodes. | between the same pair of nodes. | |||
| o Private Use ASNs from the range 64512-65534 are used so as to | o Private Use ASNs from the range 64512-65534 are used so as to | |||
| avoid ASN conflicts. | avoid ASN conflicts. | |||
| o A single ASN is allocated to all of the Clos topology's Tier-1 | o A single ASN is allocated to all of the Clos topology's Tier-1 | |||
| devices. | devices. | |||
| o A unique ASN is allocated per each group of Tier-2 devices. | o A unique ASN is allocated to each set of Tier-2 devices in the | |||
| same cluster. | ||||
| o A unique ASN is allocated to every Tier-3 device (e.g. ToR) in | o A unique ASN is allocated to every Tier-3 device (e.g. ToR) in | |||
| this topology. | this topology. | |||
| ASN 65534 | ASN 65534 | |||
| +---------+ | +---------+ | |||
| | +-----+ | | | +-----+ | | |||
| | | | | | | | | | | |||
| +-|-| |-|-+ | +-|-| |-|-+ | |||
| | | +-----+ | | | | | +-----+ | | | |||
| skipping to change at page 18, line 5 ¶ | skipping to change at page 18, line 5 ¶ | |||
| subnets in a Clos topology results in route black-holing under a | subnets in a Clos topology results in route black-holing under a | |||
| single link failure (e.g. between Tier-2 and Tier-3 devices) and | single link failure (e.g. between Tier-2 and Tier-3 devices) and | |||
| hence must be avoided. The use of peer links within the same tier to | hence must be avoided. The use of peer links within the same tier to | |||
| resolve the black-holing problem by providing "bypass paths" is | resolve the black-holing problem by providing "bypass paths" is | |||
| undesirable due to O(N^2) complexity of the peering mesh and waste of | undesirable due to O(N^2) complexity of the peering mesh and waste of | |||
| ports on the devices. An alternative to the full-mesh of peer-links | ports on the devices. An alternative to the full-mesh of peer-links | |||
| would be using a simpler bypass topology, e.g. a "ring" as described | would be using a simpler bypass topology, e.g. a "ring" as described | |||
| in [FB4POST], but such a topology adds extra hops and has very | in [FB4POST], but such a topology adds extra hops and has very | |||
| limited bisection bandwidth, in addition requiring special tweaks to | limited bisection bandwidth, in addition requiring special tweaks to | |||
| make BGP routing work - such as possibly splitting every device into | make BGP routing work - such as possibly splitting every device into | |||
| an ASN on its own. The section Section 8.2 introduces another, less | an ASN on its own. Later in this document, Section 8.2 introduces a | |||
| intrusive, method for performing a limited form route summarization | less intrusive method for performing a limited form route | |||
| in Clos networks and the discusses the associated trade-offs. | summarization in Clos networks and discusses it's associated trade- | |||
| offs. | ||||
| 5.2.4. External Connectivity | 5.2.4. External Connectivity | |||
| A dedicated cluster (or clusters) in the Clos topology could be used | A dedicated cluster (or clusters) in the Clos topology could be used | |||
| for the purpose of connecting to the Wide Area Network (WAN) edge | for the purpose of connecting to the Wide Area Network (WAN) edge | |||
| devices, or WAN Routers. Tier-3 devices in such cluster would be | devices, or WAN Routers. Tier-3 devices in such cluster would be | |||
| replaced with WAN routers, and EBGP peering would be used again, | replaced with WAN routers, and EBGP peering would be used again, | |||
| though WAN routers are likely to belong to a public ASN if Internet | though WAN routers are likely to belong to a public ASN if Internet | |||
| connectivity is required in the design. The Tier-2 devices in such a | connectivity is required in the design. The Tier-2 devices in such a | |||
| dedicated cluster will be referred to as "Border Routers" in this | dedicated cluster will be referred to as "Border Routers" in this | |||
| skipping to change at page 19, line 24 ¶ | skipping to change at page 19, line 24 ¶ | |||
| due to the lack of peer links inside every tier. | due to the lack of peer links inside every tier. | |||
| However, it is possible to lift this restriction for the Border | However, it is possible to lift this restriction for the Border | |||
| Routers, by devising a different connectivity model for these | Routers, by devising a different connectivity model for these | |||
| devices. There are two options possible: | devices. There are two options possible: | |||
| o Interconnect the Border Routers using a full-mesh of physical | o Interconnect the Border Routers using a full-mesh of physical | |||
| links or using any other "peer-mesh" topology, such as ring or | links or using any other "peer-mesh" topology, such as ring or | |||
| hub-and-spoke. Configure BGP accordingly on all Border Leafs to | hub-and-spoke. Configure BGP accordingly on all Border Leafs to | |||
| exchange network reachability information - e.g. by adding a mesh | exchange network reachability information - e.g. by adding a mesh | |||
| of iBGP sessions. The interconnecting peer links need to be | of IBGP sessions. The interconnecting peer links need to be | |||
| appropriately sized for traffic that will be present in the case | appropriately sized for traffic that will be present in the case | |||
| of a device or link failure underneath the Border Routers. | of a device or link failure underneath the Border Routers. | |||
| o Tier-1 devices may have additional physical links provisioned | o Tier-1 devices may have additional physical links provisioned | |||
| toward the Border Routers (which are Tier-2 devices from the | toward the Border Routers (which are Tier-2 devices from the | |||
| perspective of Tier-1). Specifically, if protection from a single | perspective of Tier-1). Specifically, if protection from a single | |||
| link or node failure is desired, each Tier-1 devices would have to | link or node failure is desired, each Tier-1 devices would have to | |||
| connect to at least two Border Routers. This puts additional | connect to at least two Border Routers. This puts additional | |||
| requirements on the port count for Tier-1 devices and Border | requirements on the port count for Tier-1 devices and Border | |||
| Routers, potentially making it a non-uniform, larger port count, | Routers, potentially making it a non-uniform, larger port count, | |||
| device with the other devices in the Clos. This also reduces the | device compared with the other devices in the Clos. This also | |||
| number of ports available to "regular" Tier-2 switches and hence | reduces the number of ports available to "regular" Tier-2 switches | |||
| the number of clusters that could be interconnected via Tier-1 | and hence the number of clusters that could be interconnected via | |||
| layer. | Tier-1 layer. | |||
| If any of the above options are implemented, it is possible to | If any of the above options are implemented, it is possible to | |||
| perform route summarization at the Border Routers toward the WAN | perform route summarization at the Border Routers toward the WAN | |||
| network core without risking a routing black-hole condition under a | network core without risking a routing black-hole condition under a | |||
| single link failure. Both of the options would result in non-uniform | single link failure. Both of the options would result in non-uniform | |||
| topology as additional links have to be provisioned on some network | topology as additional links have to be provisioned on some network | |||
| devices. | devices. | |||
| 6. ECMP Considerations | 6. ECMP Considerations | |||
| skipping to change at page 21, line 10 ¶ | skipping to change at page 21, line 10 ¶ | |||
| able to connect to multitude of Tier-1 devices if route summarization | able to connect to multitude of Tier-1 devices if route summarization | |||
| at Border Router level is implemented as described in Section 5.2.5. | at Border Router level is implemented as described in Section 5.2.5. | |||
| If a device's hardware does not support wider ECMP, logical link- | If a device's hardware does not support wider ECMP, logical link- | |||
| grouping (link-aggregation at layer 2) could be used to provide | grouping (link-aggregation at layer 2) could be used to provide | |||
| "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to | "hierarchical" ECMP (Layer 3 ECMP followed by Layer 2 ECMP) to | |||
| compensate for fan-out limitations. Such approach, however, | compensate for fan-out limitations. Such approach, however, | |||
| increases the risk of flow polarization, as less entropy will be | increases the risk of flow polarization, as less entropy will be | |||
| available to the second stage of ECMP. | available to the second stage of ECMP. | |||
| Most BGP implementations declare paths to be equal from ECMP | Most BGP implementations declare paths to be equal from ECMP | |||
| perspective if they match up to and including step (e) | perspective if they match up to and including step (e) in | |||
| Section 9.1.2.2 of [RFC4271]. In the proposed network design there | Section 9.1.2.2 of [RFC4271]. In the proposed network design there | |||
| is no underlying IGP, so all IGP costs are assumed to be zero or | is no underlying IGP, so all IGP costs are assumed to be zero or | |||
| otherwise the same value across all paths and policies may be applied | otherwise the same value across all paths and policies may be applied | |||
| as necessary to equalize BGP attributes that vary in vendor defaults, | as necessary to equalize BGP attributes that vary in vendor defaults, | |||
| such as MED and origin code. For historical reasons it is also | such as MED and origin code. For historical reasons it is also | |||
| useful to not use 0 as the equalized MED value; this and some other | useful to not use 0 as the equalized MED value; this and some other | |||
| useful BGP information is available in [RFC4277] . Routing loops are | useful BGP information is available in [RFC4277] . Routing loops are | |||
| unlikely due to the BGP best-path selection process which prefers | unlikely due to the BGP best-path selection process which prefers | |||
| shorter AS_PATH length, and longer paths through the Tier-1 devices | shorter AS_PATH length, and longer paths through the Tier-1 devices | |||
| which don't allow their own ASN in the path and have the same ASN are | which don't allow their own ASN in the path and have the same ASN are | |||
| skipping to change at page 22, line 7 ¶ | skipping to change at page 22, line 7 ¶ | |||
| and send more traffic over paths that have more capacity. The | and send more traffic over paths that have more capacity. The | |||
| prefixes that require weighted ECMP would have to be injected using | prefixes that require weighted ECMP would have to be injected using | |||
| remote BGP speaker (central agent) over a multihop session as | remote BGP speaker (central agent) over a multihop session as | |||
| described further in Section 8.1. If support in implementations is | described further in Section 8.1. If support in implementations is | |||
| available, weight-distribution for multiple BGP paths could be | available, weight-distribution for multiple BGP paths could be | |||
| signaled using the technique described in | signaled using the technique described in | |||
| [I-D.ietf-idr-link-bandwidth]. | [I-D.ietf-idr-link-bandwidth]. | |||
| 6.4. Consistent Hashing | 6.4. Consistent Hashing | |||
| It is often desirable to have the hashing function used to ECMP to be | It is often desirable to have the hashing function used for ECMP to | |||
| consistent (see [CONS-HASH]), to minimizing the impact on flow to | be consistent (see [CONS-HASH]), to minimize the impact on flow to | |||
| next-hop affinity changes when a next-hop is added or removed to ECMP | next-hop affinity changes when a next-hop is added or removed to ECMP | |||
| group. This could be used if the network device is used as a load | group. This could be used if the network device is used as a load | |||
| balancer, mapping flows toward multiple destinations - in this case, | balancer, mapping flows toward multiple destinations - in this case, | |||
| losing or adding a destination will not have detrimental effect of | losing or adding a destination will not have detrimental effect of | |||
| currently established flows. One particular recommendation on | currently established flows. One particular recommendation on | |||
| implementing consistent hashing is provided in [RFC2992], though | implementing consistent hashing is provided in [RFC2992], though | |||
| other implementations are possible. This functionality could be | other implementations are possible. This functionality could be | |||
| naturally combined with weighted ECMP, with the impact of the next- | naturally combined with weighted ECMP, with the impact of the next- | |||
| hop changes being proportional to the weight of the given next-hop. | hop changes being proportional to the weight of the given next-hop. | |||
| The downside of consistent hashing is increased load on hardware | The downside of consistent hashing is increased load on hardware | |||
| skipping to change at page 22, line 50 ¶ | skipping to change at page 22, line 50 ¶ | |||
| convergence delays, in the order of multiple seconds (on many BGP | convergence delays, in the order of multiple seconds (on many BGP | |||
| implementations the minimum configurable BGP hold timer value is | implementations the minimum configurable BGP hold timer value is | |||
| three seconds). However, many BGP implementations can shut down | three seconds). However, many BGP implementations can shut down | |||
| local EBGP peering sessions in response to the "link down" event for | local EBGP peering sessions in response to the "link down" event for | |||
| the outgoing interface used for BGP peering. This feature is | the outgoing interface used for BGP peering. This feature is | |||
| sometimes called as "fast fallover". Since links in modern data | sometimes called as "fast fallover". Since links in modern data | |||
| centers are predominantly point-to-point fiber connections, a | centers are predominantly point-to-point fiber connections, a | |||
| physical interface failure is often detected in milliseconds and | physical interface failure is often detected in milliseconds and | |||
| subsequently triggers a BGP re-convergence. | subsequently triggers a BGP re-convergence. | |||
| Ethernet technologies may support failure signaling or detection | Ethernet links may support failure signaling or detection standards | |||
| standards such as Connectivity Fault Management (CFM) as described in | such as Connectivity Fault Management (CFM) as described in | |||
| [IEEE8021Q], which may make failure detection more robust. | [IEEE8021Q], which may make failure detection more robust. | |||
| Alternatively, some platforms may support Bidirectional Forwarding | Alternatively, some platforms may support Bidirectional Forwarding | |||
| Detection (BFD) [RFC5880] to allow for sub-second failure detection | Detection (BFD) [RFC5880] to allow for sub-second failure detection | |||
| and fault signaling to the BGP process. However, use of either of | and fault signaling to the BGP process. However, use of either of | |||
| these presents additional requirements to vendor software and | these presents additional requirements to vendor software and | |||
| possibly hardware, and may contradict REQ1. Until recently with | possibly hardware, and may contradict REQ1. Until recently with | |||
| [RFC7130], BFD also did not allow detection of a single member link | [RFC7130], BFD also did not allow detection of a single member link | |||
| failure on a LAG, which would have limited it's usefulness in some | failure on a LAG, which would have limited its usefulness in some | |||
| designs. | designs. | |||
| 7.2. Event Propagation Timing | 7.2. Event Propagation Timing | |||
| In the proposed design the impact of BGP Minimum Route Advertisement | In the proposed design the impact of BGP Minimum Route Advertisement | |||
| Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be | Interval (MRAI) timer (See section 9.2.1.1 of [RFC4271]) should be | |||
| considered. Per the standard it is required for BGP implementations | considered. Per the standard it is required for BGP implementations | |||
| to space out consecutive BGP UPDATE messages by at least MRAI | to space out consecutive BGP UPDATE messages by at least MRAI | |||
| seconds, which is often a configurable value. The initial BGP UPDATE | seconds, which is often a configurable value. The initial BGP UPDATE | |||
| messages after an event carrying withdrawn routes are commonly not | messages after an event carrying withdrawn routes are commonly not | |||
| skipping to change at page 25, line 23 ¶ | skipping to change at page 25, line 23 ¶ | |||
| ECMP groups for all IP prefixes from non-local cluster. The | ECMP groups for all IP prefixes from non-local cluster. The | |||
| Tier-3 devices are once again not involved in the re-convergence | Tier-3 devices are once again not involved in the re-convergence | |||
| process, but may receive "implicit withdraws" as described above. | process, but may receive "implicit withdraws" as described above. | |||
| Even though in case of such failures multiple IP prefixes will have | Even though in case of such failures multiple IP prefixes will have | |||
| to be reprogrammed in the FIB, it is worth noting that ALL of these | to be reprogrammed in the FIB, it is worth noting that ALL of these | |||
| prefixes share a single ECMP group on Tier-2 device. Therefore, in | prefixes share a single ECMP group on Tier-2 device. Therefore, in | |||
| the case of implementations with a hierarchical FIB, only a single | the case of implementations with a hierarchical FIB, only a single | |||
| change has to be made to the FIB. Hierarchical FIB here means FIB | change has to be made to the FIB. Hierarchical FIB here means FIB | |||
| structure where the next-hop forwarding information is stored | structure where the next-hop forwarding information is stored | |||
| separately from the prefix lookup table, and the latter only store | separately from the prefix lookup table, and the latter only stores | |||
| pointers to the respective forwarding information. | pointers to the respective forwarding information. | |||
| Even though BGP offers reduced failure scope for some cases, further | Even though BGP offers reduced failure scope for some cases, further | |||
| reduction of the fault domain using summarization is not always | reduction of the fault domain using summarization is not always | |||
| possible with the proposed design, since using this technique may | possible with the proposed design, since using this technique may | |||
| create routing black-holes as mentioned previously. Therefore, the | create routing black-holes as mentioned previously. Therefore, the | |||
| worst control plane failure impact scope is the network as a whole, | worst control plane failure impact scope is the network as a whole, | |||
| for instance in a case of a link failure between Tier-2 and Tier-3 | for instance in a case of a link failure between Tier-2 and Tier-3 | |||
| devices. The amount of impacted prefixes in this case would be much | devices. The amount of impacted prefixes in this case would be much | |||
| less than in the case of a failure in the upper layers of a Clos | less than in the case of a failure in the upper layers of a Clos | |||
| skipping to change at page 26, line 11 ¶ | skipping to change at page 26, line 11 ¶ | |||
| Tier-2 will bounce it back again using the default route. This | Tier-2 will bounce it back again using the default route. This | |||
| micro-loop will last for the duration of time it takes the upstream | micro-loop will last for the duration of time it takes the upstream | |||
| device to fully update its forwarding tables. | device to fully update its forwarding tables. | |||
| To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can | To minimize impact of the micro-loops, Tier-2 and Tier-1 switches can | |||
| be configured with static "discard" or "null" routes that will be | be configured with static "discard" or "null" routes that will be | |||
| more specific than the default route for prefixes missing during | more specific than the default route for prefixes missing during | |||
| network convergence. For Tier-2 switches, the discard route should | network convergence. For Tier-2 switches, the discard route should | |||
| be a summary route, covering all server subnets of the underlying | be a summary route, covering all server subnets of the underlying | |||
| Tier-3 devices. For Tier-1 devices, the discard route should be a | Tier-3 devices. For Tier-1 devices, the discard route should be a | |||
| summary covering the server IP address subnet allocated for the whole | summary covering the server IP address subnets allocated for the | |||
| data center. Those discard routes will only take precedence for the | whole data center. Those discard routes will only take precedence | |||
| duration of network convergence, until the device learns a more | for the duration of network convergence, until the device learns a | |||
| specific prefix via a new path. | more specific prefix via a new path. | |||
| 8. Additional Options for Design | 8. Additional Options for Design | |||
| 8.1. Third-party Route Injection | 8.1. Third-party Route Injection | |||
| BGP allows for a "third-party", i.e. directly attached, BGP speaker | BGP allows for a "third-party", i.e. directly attached, BGP speaker | |||
| to inject routes anywhere in the network topology, meeting REQ5. | to inject routes anywhere in the network topology, meeting REQ5. | |||
| This can be achieved by peering via a multihop BGP session with some | This can be achieved by peering via a multihop BGP session with some | |||
| or even all devices in the topology. Furthermore, BGP diverse path | or even all devices in the topology. Furthermore, BGP diverse path | |||
| distribution [RFC6774] could be used to inject multiple BGP next hops | distribution [RFC6774] could be used to inject multiple BGP next hops | |||
| for the same prefix to facilitate load balancing, or using the BGP | for the same prefix to facilitate load balancing, or using the BGP | |||
| ADD-PATH capability [I-D.ietf-idr-add-paths] if supported by the | ADD-PATH capability [I-D.ietf-idr-add-paths] if supported by the | |||
| implementation. Unfortunately, in many implementations ADD-PATH has | implementation. Unfortunately, in many implementations ADD-PATH has | |||
| been found to only support IBGP properly due to the use cases it was | been found to only support IBGP properly due to the use cases it was | |||
| originally optimized for, which limits the "third-party" peering to | originally optimized for, which limits the "third-party" peering to | |||
| iBGP only, if the feature is used. | IBGP only, if the feature is used. | |||
| To implement route injection in the proposed design, a third-party | To implement route injection in the proposed design, a third-party | |||
| BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the | BGP speaker may peer with Tier-3 and Tier-1 switches, injecting the | |||
| same prefix, but using a special set of BGP next-hops for Tier-1 | same prefix, but using a special set of BGP next-hops for Tier-1 | |||
| devices. Those next-hops are assumed to resolve recursively via BGP, | devices. Those next-hops are assumed to resolve recursively via BGP, | |||
| and could be, for example, IP addresses on Tier-3 devices. The | and could be, for example, IP addresses on Tier-3 devices. The | |||
| resulting forwarding table programming could provide desired traffic | resulting forwarding table programming could provide desired traffic | |||
| proportion distribution among different clusters. | proportion distribution among different clusters. | |||
| 8.2. Route Summarization within Clos Topology | 8.2. Route Summarization within Clos Topology | |||
| skipping to change at page 27, line 8 ¶ | skipping to change at page 27, line 8 ¶ | |||
| devices. However, some operators may find route aggregation | devices. However, some operators may find route aggregation | |||
| desirable to improve control plane stability. | desirable to improve control plane stability. | |||
| If planning on using any technique to summarize within the topology | If planning on using any technique to summarize within the topology | |||
| modeling of the routing behavior and potential for black-holing | modeling of the routing behavior and potential for black-holing | |||
| should be done not only for single or multiple link failures, but | should be done not only for single or multiple link failures, but | |||
| also fiber pathway failures or optical domain failures if the | also fiber pathway failures or optical domain failures if the | |||
| topology extends beyond a physical location. Simple modeling can be | topology extends beyond a physical location. Simple modeling can be | |||
| done by checking the reachability on devices doing summarization | done by checking the reachability on devices doing summarization | |||
| under the condition of a link or pathway failure between a set of | under the condition of a link or pathway failure between a set of | |||
| devices in every Tier as well as to the WAN routers if external | devices in every tier as well as to the WAN routers if external | |||
| connectivity is present. | connectivity is present. | |||
| Route summarization would be possible with a small modification to | Route summarization would be possible with a small modification to | |||
| the network topology, though the trade-off would be reduction of the | the network topology, though the trade-off would be reduction of the | |||
| total size of the network as well as network congestion under | total size of the network as well as network congestion under | |||
| specific failures. This approach is very similar to the technique | specific failures. This approach is very similar to the technique | |||
| described above, which allows Border Routers to summarize the entire | described above, which allows Border Routers to summarize the entire | |||
| data center address space. | data center address space. | |||
| 8.2.1. Collapsing Tier-1 Devices Layer | 8.2.1. Collapsing Tier-1 Devices Layer | |||
| skipping to change at page 28, line 35 ¶ | skipping to change at page 28, line 35 ¶ | |||
| 8.2.2. Simple Virtual Aggregation | 8.2.2. Simple Virtual Aggregation | |||
| A completely different approach to route summarization is possible, | A completely different approach to route summarization is possible, | |||
| provided that the main goal is to reduce the FIB pressure, while | provided that the main goal is to reduce the FIB pressure, while | |||
| allowing the control plane to disseminate full routing information. | allowing the control plane to disseminate full routing information. | |||
| Firstly, it could be easily noted that in many cases multiple | Firstly, it could be easily noted that in many cases multiple | |||
| prefixes, some of which are less specific, share the same set of the | prefixes, some of which are less specific, share the same set of the | |||
| next-hops (same ECMP group). For example, looking from the | next-hops (same ECMP group). For example, looking from the | |||
| perspective of a Tier-3 devices, all routes learned from upstream | perspective of a Tier-3 devices, all routes learned from upstream | |||
| Tier-2's, including the default route, will share the same set of BGP | Tier-2's, including the default route, will share the same set of BGP | |||
| next-hops, provided that there is no failures in the network. This | next-hops, provided that there are no failures in the network. This | |||
| makes it possible to use the technique similar to described in | makes it possible to use the technique similar to described in | |||
| [RFC6769] and only install the least specific route in the FIB, | [RFC6769] and only install the least specific route in the FIB, | |||
| ignoring more specific routes if they share the same next-hop set. | ignoring more specific routes if they share the same next-hop set. | |||
| For example, under normal network conditions, only the default route | For example, under normal network conditions, only the default route | |||
| need to be programmed into FIB. | need to be programmed into FIB. | |||
| Furthermore, if the Tier-2 devices are configured with summary | Furthermore, if the Tier-2 devices are configured with summary | |||
| prefixes covering all of their attached Tier-3 device's prefixes the | prefixes covering all of their attached Tier-3 device's prefixes the | |||
| same logic could be applied in Tier-1 devices as well, and, by | same logic could be applied in Tier-1 devices as well, and, by | |||
| induction to Tier-2/Tier-3 switches in different clusters. These | induction to Tier-2/Tier-3 switches in different clusters. These | |||
| skipping to change at page 30, line 11 ¶ | skipping to change at page 30, line 11 ¶ | |||
| 10. IANA Considerations | 10. IANA Considerations | |||
| This document includes no request to IANA. | This document includes no request to IANA. | |||
| 11. Acknowledgements | 11. Acknowledgements | |||
| This publication summarizes work of many people who participated in | This publication summarizes work of many people who participated in | |||
| developing, testing and deploying the proposed network design, some | developing, testing and deploying the proposed network design, some | |||
| of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong, | of whom were George Chen, Parantap Lahiri, Dave Maltz, Edet Nkposong, | |||
| Robert Toomey, and Lihua Yuan. Authors would also like to thank | Robert Toomey, and Lihua Yuan. Authors would also like to thank | |||
| Linda Dunbar, Susan Hares, Danny McPherson, Robert Raszuk and Russ | Linda Dunbar, Anoop Ghanwani, Susan Hares, Danny McPherson, Robert | |||
| White for reviewing the document and providing valuable feedback and | Raszuk and Russ White for reviewing this document and providing | |||
| Mary Mitchell for grammar and style suggestions. | valuable feedback and Mary Mitchell for initial grammar and style | |||
| suggestions. | ||||
| 12. References | 12. References | |||
| 12.1. Normative References | 12.1. Normative References | |||
| [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A | [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A | |||
| Border Gateway Protocol 4 (BGP-4)", RFC 4271, | Border Gateway Protocol 4 (BGP-4)", RFC 4271, | |||
| DOI 10.17487/RFC4271, January 2006, | DOI 10.17487/RFC4271, January 2006, | |||
| <http://www.rfc-editor.org/info/rfc4271>. | <http://www.rfc-editor.org/info/rfc4271>. | |||
| skipping to change at page 32, line 14 ¶ | skipping to change at page 32, line 14 ¶ | |||
| [I-D.ietf-idr-link-bandwidth] | [I-D.ietf-idr-link-bandwidth] | |||
| Mohapatra, P. and R. Fernando, "BGP Link Bandwidth | Mohapatra, P. and R. Fernando, "BGP Link Bandwidth | |||
| Extended Community", draft-ietf-idr-link-bandwidth-06 | Extended Community", draft-ietf-idr-link-bandwidth-06 | |||
| (work in progress), January 2013. | (work in progress), January 2013. | |||
| [CLOS1953] | [CLOS1953] | |||
| Clos, C., "A Study of Non-Blocking Switching Networks: | Clos, C., "A Study of Non-Blocking Switching Networks: | |||
| Bell System Technical Journal Vol. 32(2)", March 1953. | Bell System Technical Journal Vol. 32(2)", March 1953. | |||
| [HADOOP] Apache, , "Apache HaDoop", July 2015, | [HADOOP] Apache, , "Apache HaDoop", August 2015, | |||
| <https://hadoop.apache.org/>. | <https://hadoop.apache.org/>. | |||
| [GREENBERG2009] | [GREENBERG2009] | |||
| Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a | Greenberg, A., Hamilton, J., and D. Maltz, "The Cost of a | |||
| Cloud: Research Problems in Data Center Networks", January | Cloud: Research Problems in Data Center Networks", January | |||
| 2009. | 2009. | |||
| [IEEE8021D-1990] | [IEEE8021D-1990] | |||
| IEEE 802.1D, , "IEEE Standard for Local and Metropolitan | IEEE 802.1D, , "IEEE Standard for Local and Metropolitan | |||
| Area Networks--Media access control (MAC) Bridges", May | Area Networks--Media access control (MAC) Bridges", May | |||
| skipping to change at page 32, line 46 ¶ | skipping to change at page 32, line 46 ¶ | |||
| [INTERCON] | [INTERCON] | |||
| Dally, W. and B. Towles, "Principles and Practices of | Dally, W. and B. Towles, "Principles and Practices of | |||
| Interconnection Networks", ISBN 978-0122007514, January | Interconnection Networks", ISBN 978-0122007514, January | |||
| 2004. | 2004. | |||
| [ALFARES2008] | [ALFARES2008] | |||
| Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, | Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, | |||
| Commodity Data Center Network Architecture", August 2008. | Commodity Data Center Network Architecture", August 2008. | |||
| [IANA.AS] IANA, , "Autonomous System (AS) Numbers", July 2015, | [IANA.AS] IANA, , "Autonomous System (AS) Numbers", August 2015, | |||
| <http://www.iana.org/assignments/as-numbers/>. | <http://www.iana.org/assignments/as-numbers/>. | |||
| [IEEE8023AD] | [IEEE8023AD] | |||
| IEEE 802.3ad, , "IEEE Standard for Link aggregation for | IEEE 802.3ad, , "IEEE Standard for Link aggregation for | |||
| parallel links", October 2000. | parallel links", October 2000. | |||
| [ALLOWASIN] | [ALLOWASIN] | |||
| Cisco Systems, , "Allowas-in Feature in BGP Configuration | Cisco Systems, , "Allowas-in Feature in BGP Configuration | |||
| Example", February 2015, | Example", February 2015, | |||
| <http://www.cisco.com/c/en/us/support/docs/ip/border- | <http://www.cisco.com/c/en/us/support/docs/ip/border- | |||
| End of changes. 33 change blocks. | ||||
| 56 lines changed or deleted | 63 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||