| < draft-white-lsr-distoptflood-01.txt | draft-white-lsr-distoptflood-02.txt > | |||
|---|---|---|---|---|
| Network Working Group R. White | Network Working Group R. White | |||
| Internet-Draft S. Hegde | Internet-Draft S. Hegde | |||
| Intended status: Informational T. Przygienda | Intended status: Informational T. Przygienda | |||
| Expires: 3 September 2022 Juniper Networks | Expires: 27 October 2022 Juniper Networks | |||
| 2 March 2022 | 25 April 2022 | |||
| IS-IS Optimal Distributed Flooding for Dense Topologies | IS-IS Optimal Distributed Flooding for Dense Topologies | |||
| draft-white-lsr-distoptflood-01 | draft-white-lsr-distoptflood-02 | |||
| Abstract | Abstract | |||
| In dense topologies (such as data center fabrics based on the Clos | In dense topologies (such as data center fabrics based on the Clos | |||
| and butterfly, though not limited to these topologies), flooding | and butterfly topologies, though not limited to these), IGP flooding | |||
| mechanisms designed for sparse topologies can flood excessively, or | mechanisms designed for sparse topologies can "overflood," or carry | |||
| carry too many copies of topology and reachability to fabric devices. | too many copies of topology and reachability information to fabric | |||
| This results in slower convergence times and higher resource | devices. This results in slower convergence times and higher | |||
| utilization. The modifications to the flooding mechanism in the | resource utilization. The modifications to the flooding mechanism in | |||
| Intermediate System to Intermediate System (IS-IS) link state | the Intermediate System to Intermediate System (IS-IS) link state | |||
| protocol described in this document reduce resource utilization to a | protocol described in this document reduce resource utilization | |||
| minimum, while increasing convergence performance in dense | significantly, while increasing convergence performance in dense | |||
| topologies. | topologies. | |||
| Note that a Clos fabric is used as the primary example of a dense | Note that a Clos fabric is used as the primary example of a dense | |||
| flooding topology throughout this document. However, the flooding | flooding topology throughout this document. However, the flooding | |||
| optimizations described in this document apply to any dense topology. | optimizations described in this document apply to any topology. | |||
| Status of This Memo | Status of This Memo | |||
| This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
| provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on 3 September 2022. | This Internet-Draft will expire on 27 October 2022. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2022 IETF Trust and the persons identified as the | Copyright (c) 2022 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents (https://trustee.ietf.org/ | |||
| license-info) in effect on the date of publication of this document. | license-info) in effect on the date of publication of this document. | |||
| Please review these documents carefully, as they describe your rights | Please review these documents carefully, as they describe your rights | |||
| skipping to change at page 2, line 25 ¶ | skipping to change at page 2, line 25 ¶ | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 1.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 2 | |||
| 1.2. Contributors . . . . . . . . . . . . . . . . . . . . . . 3 | 1.2. Contributors . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.3. Experience . . . . . . . . . . . . . . . . . . . . . . . 3 | 1.3. Experience . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 1.4. Sample Network . . . . . . . . . . . . . . . . . . . . . 3 | 1.4. Sample Network . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Flooding Modifications . . . . . . . . . . . . . . . . . . . 5 | 2. Flooding Modifications . . . . . . . . . . . . . . . . . . . 5 | |||
| 2.1. Optimizing Flooding . . . . . . . . . . . . . . . . . . . 5 | 2.1. Optimizing Flooding . . . . . . . . . . . . . . . . . . . 5 | |||
| 2.2. Optimization Process . . . . . . . . . . . . . . . . . . 6 | 2.2. Optimization Process . . . . . . . . . . . . . . . . . . 6 | |||
| 2.3. Flooding Failures . . . . . . . . . . . . . . . . . . . . 7 | 2.3. Flooding Failures . . . . . . . . . . . . . . . . . . . . 7 | |||
| 2.4. Flooding Example . . . . . . . . . . . . . . . . . . . . 7 | 2.4. Flooding Example . . . . . . . . . . . . . . . . . . . . 8 | |||
| 2.5. A Note on Performance . . . . . . . . . . . . . . . . . . 8 | 2.5. A Note on Performance . . . . . . . . . . . . . . . . . . 8 | |||
| 3. Security Considerations . . . . . . . . . . . . . . . . . . . 8 | 3. Security Considerations . . . . . . . . . . . . . . . . . . . 8 | |||
| 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 | 4. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
| 4.1. Normative References . . . . . . . . . . . . . . . . . . 8 | 4.1. Normative References . . . . . . . . . . . . . . . . . . 8 | |||
| 4.2. Informative References . . . . . . . . . . . . . . . . . 10 | 4.2. Informative References . . . . . . . . . . . . . . . . . 10 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 | |||
| 1. Introduction | 1. Introduction | |||
| 1.1. Goals | 1.1. Goals | |||
| The goal of this draft is to solve one specific set of problems | The goal of this draft is to solve one specific set of problems | |||
| involved in operating a link state protocol in a dense mesh topology. | involved in operating a link state protocol in a densely meshed | |||
| The problem with such topologies is the connectivity density, which | topology. The problem with such topologies is the connectivity | |||
| causes too much information to be flooded (or too much repeated state | density, which causes too many copies of identical information to be | |||
| to be flooded). Analysis and experiment show, for instance, that in | flooded. Analysis and experiment show, for instance, that in a | |||
| a butterfly fabric of around 2500 intermediate systems, each | butterfly fabric of around 2500 intermediate systems, each | |||
| intermediate system will receive 40+ copies of any changed LSP | intermediate system will receive more than 40 copies of any changed | |||
| fragment. This not only wastes bandwidth and processor time, this | LSP fragment. This not only wastes bandwidth and processor time, | |||
| dramatically slows convergence speed. | this dramatically slows convergence speed. | |||
| This document describes a set of modifications to existing IS-IS | This document describes a set of modifications to existing IS-IS | |||
| flooding mechanisms which minimize the number of LSP framgents | flooding mechanisms which minimize the number of LSP fragments | |||
| received by individual intermediate systems, potentially to one copy | received by individual intermediate systems, in its extreme version | |||
| per intermediate system. The mechanism described here chooses | to one copy per intermediate system. The mechanisms described in | |||
| different flooding intermediates using a hash across the system ID to | this document are similar to those implemented in OSPF to support | |||
| prevent single failures from having a large impact on flooding, and | mobile ad-hoc networks, as described in [RFC5449], [RFC5614], and | |||
| to spread the processing load of flooding across many systems and | [RFC7182]. These mechanisms have been widely implemented and | |||
| prevent bottlenecks. | deployed. | |||
| The mechanisms described in this document are similar to those | ||||
| implemented in OSPF to support mobile ad-hoc networks, as described | ||||
| in [RFC5449], [RFC5614], and [RFC7182]. Mechanisms similar to these | ||||
| have been widely implemented and deployed. | ||||
| 1.2. Contributors | 1.2. Contributors | |||
| The following people have contributed to this draft: Abhishek Kumar, | The following people have contributed to this draft: Abhishek Kumar, | |||
| Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes | Nikos Triantafillis, Ivan Pepelnjak, Christian Franke, Hannes | |||
| Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, and | Gredler, Les Ginsberg, Naiming Shen, Uma Chunduri, Nick Russo, Shawn | |||
| Rodny Molina. | Zandi, and Rodny Molina. | |||
| 1.3. Experience | 1.3. Experience | |||
| Lab testing shows modifications similar to these reduce flooding in a | Laboratory tests show modifications similar to these reduce flooding | |||
| large scale emulated butterfly network topology; without these | in a large scale emulated butterfly network topology; without these | |||
| modifications, intermediate systems receive, on average, 40 copies of | modifications, intermediate systems receive, on average, 40 copies of | |||
| any changed LSP fragment. With these modifications, intermediate | any changed LSP fragment. With the modifications described in this | |||
| systems recieve, on average, two copies of any changed LSP fragment. | document intermediate systems recieve, on average, two copies of any | |||
| In many cases, each intermediate system receives one copy of each | changed LSP fragment. In many cases, each intermediate system | |||
| changed LSP. In terms of performance, the modifications described | receives only a single copy of each changed LSP. In terms of | |||
| here reduce convergence times by around 50%. A network that converges | performance, the modifications described here cut convergence times | |||
| in about 30-40 seconds without these modifications converged in 15-20 | in half. Processor load times were not checked, as this was an | |||
| seconds with these modifications. Processor load times were not | emulated environment. | |||
| checked, as this was an emulated environment. | ||||
| A mechanism similar to the one described in this document has been | A mechanism similar to the one described in this document has been | |||
| implemented in the FR Routing open source routing stack as part of | implemented in the FR Routing open source routing stack as part of | |||
| fabricd. | fabricd. | |||
| 1.4. Sample Network | 1.4. Sample Network | |||
| The following spine and leaf fabric will be used to describe these | The following spine and leaf fabric will be used to describe these | |||
| modifications. | modifications. | |||
| skipping to change at page 4, line 49 ¶ | skipping to change at page 4, line 49 ¶ | |||
| * 4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and | * 4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and | |||
| 5F | 5F | |||
| * etc. | * etc. | |||
| The tiers or stages of the fabric are also marked for easier | The tiers or stages of the fabric are also marked for easier | |||
| reference. T0 is assumed to be connected to application servers, or | reference. T0 is assumed to be connected to application servers, or | |||
| rather they are Top of Rack (ToR) intermediate systems. The | rather they are Top of Rack (ToR) intermediate systems. The | |||
| remaining tiers, T1 and T2, are connected only to other devices in | remaining tiers, T1 and T2, are connected only to other devices in | |||
| the fabric itself. | the fabric itself. A common alternate representation of this | |||
| topology is drawn "folded" with T2, the "top of fabric," shown on | ||||
| top, while T1 is shown below, and T0 below T1. | ||||
| 2. Flooding Modifications | 2. Flooding Modifications | |||
| Flooding is perhaps the most challenging scaling issue for a link | Flooding is perhaps the most challenging scaling issue for a link | |||
| state protocol running on a dense, large scale fabric. This section | state protocol running on a dense, large scale topology. This | |||
| describes modifications to the IS-IS flooding process to reduce | section describes detailed modifications to the IS-IS flooding | |||
| flooding load on a dense or mesh topology. | process to reduce flooding load in a densely meshed topology. | |||
| 2.1. Optimizing Flooding | 2.1. Optimizing Flooding | |||
| The simplest way to conceive of the solution presented here is in two | The simplest way to conceive of the solution presented here is in two | |||
| stages: | stages: | |||
| * Stage 1: Forward Optimization | * Stage 1: Forward Optimization | |||
| - Find the group of intermediate systems that will all flood to | - Find the group of intermediate systems that will all flood to | |||
| the same set of neighbors as the local IS | the same set of neighbors as the local IS | |||
| - Decide (deterministially) which of the intermediate systems | - Decide (deterministically) which subset of the intermediate | |||
| within this group should flood any received LSPs | systems within this group should re-flood any received LSPs | |||
| * Stage 2: Reverse Optimization | * Stage 2: Reverse Optimization | |||
| - Find neighbors on the shortest path towards the origin of the | - Find neighbors on the shortest path towards the origin of the | |||
| change | change | |||
| - Do not flood towards these neighbors | - Do not flood towards these neighbors | |||
| The first stage is best explained through an illustration. In the | The first stage is best explained through an illustration. In the | |||
| network above, if 5A transmits a modified Link State Protocol Data | network above, if 5A transmits a modified Link State Protocol Data | |||
| Unit (LSP) to 4A-4F, each of 4A-4F will, in turn, flood this modified | Unit (LSP) to 4A-4F, each of 4A-4F will, in turn, flood this modified | |||
| LSP to 3A (for instance). 3A will receive 6 copies of the modified | LSP to 3A (for instance). 3A will receive 6 copies of the modified | |||
| LSP, while only one copy is necessary for the intermediate systems | LSP, while only one copy is necessary for the intermediate systems | |||
| shown to converge on a single view of the topology. If 4A-4F could | shown to converge on a single view of the topology. If 4A-4F could | |||
| determine they will all flood identical copies of the modified LSP to | determine they will all flood identical copies of the modified LSP to | |||
| 3A, it is possible for all of them except one to decide not to flood | 3A, it is possible for all of them except one to decide not to flood | |||
| the changed LSP to 3A. | the changed LSP to 3A. | |||
| The technique used in this draft to determine the flooding group is | The technique used in this draft to determine the flooding group is | |||
| for each intermediate system to calculate a special SPT from the | for each intermediate system to calculate a special Shortest-path | |||
| point of view of the transmitting neighbor. By setting the metric of | Spanning Tree (SPT) from the point of view of the transmitting | |||
| all links to 1 and truncating the SPT to two hops, the local IS can | neighbor. By setting the metric of all links to 1 and truncating the | |||
| find the group of neighbors it will flood any changed LSP towards and | SPT to two hops, the local IS can find the group of neighbors it will | |||
| the set of intermediate systems (not necessarily neighbors) which | flood any changed LSP towards and the set of intermediate systems | |||
| will also flood to this same set of neighbors. If every intermediate | (not necessarily neighbors) which will also flood to this same set of | |||
| system in the flooding set performs this same calculation, they will | neighbors. If every intermediate system in the flooding set performs | |||
| all obtain the same flooding group. | this same calculation, they will all obtain the same flooding group. | |||
| Once this flooding group is determined, the members of the flooding | Once this flooding group is determined, the members of the flooding | |||
| group will each (independently) choose which of the members should | group will each (independently) choose which of the members should | |||
| flood. Each member of the flooding group calculates this | re-flood the received information. Each member of the flooding group | |||
| independently of all the other members, using a common hash across a | calculates this independently of all the other members, but a common | |||
| set of shared variables so each member of the group comes to the same | hash MUST be used across a set of shared variables so each member of | |||
| conclusion. The group member which is selected to flood the changed | the group comes to the same conclusion. The group member which is | |||
| LSP does so normally; the remaining group members do not flood the | selected to flood the changed LSP does so normally; the remaining | |||
| LSP. | group members do not flood the LSP. | |||
| Note there is no signaling between the intermediate systems running | Note there is no signaling between the intermediate systems running | |||
| this flooding reduction mechanism. Each IS calculates a special, | this flooding reduction mechanism. Each IS calculates the special, | |||
| truncated SPT separately, and determines which IS should flood any | truncated SPT separately, and determines which IS should flood any | |||
| changed LSPs independently. Because these calculations are performed | changed LSPs independently based on a common hash function. Because | |||
| using a shared view of the network, however (based on the common link | these calculations are performed using a shared view of the network, | |||
| state database) and a shared sorting hash, each member of the | however (based on the common link state database) and a shared hash | |||
| flooding group will make the same decision. | function, each member of the flooding group will make the same | |||
| decision. | ||||
| The second stage is simpler, consisting of a single rule: do not | The second stage is simpler, consisting of a single rule: do not | |||
| flood modified LSPs along the shortest path towards the origin of the | flood modified LSPs along the shortest path towards the origin of the | |||
| modified LSP. This rule relies on the observation that any IS | modified LSP. This rule relies on the observation that any IS | |||
| between the origin of the modified LSP and the local IS should | between the origin of the modified LSP and the local IS should | |||
| receive the modified LSP from some other IS closer to the source of | receive the modified LSP from some other IS closer to the source of | |||
| the modified LSP. | the modified LSP. | |||
| 2.2. Optimization Process | 2.2. Optimization Process | |||
| Each intermediate system will determine whether it should reflood as | Each intermediate system will determine whether it should re-flood | |||
| described below, when a modified LSP arrives from a Transmitting | LSPs as described below. When a modified LSP arrives from a | |||
| Neighbor (TN), by: | Transmitting Neighbor (TN), the result of the following algorithm | |||
| obtains the necessary decision: | ||||
| Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) | Step 1: Build the Two-Hop List (THL) and Remote Neighbor's List (RNL) | |||
| by: | by: | |||
| * Set all link metrics to 1 | * Set all link metrics to 1 | |||
| * Calculate an SPT truncated to 2 hops from the perspective of TN | * Calculate an SPT truncated to 2 hops from the perspective of TN | |||
| * For each IS that is two hops (has a metric of two in the truncated | * For each IS that is two hops (has a metric of two in the truncated | |||
| SPT) from TN: | SPT) from TN: | |||
| skipping to change at page 7, line 4 ¶ | skipping to change at page 7, line 6 ¶ | |||
| * For each IS that is two hops (has a metric of two in the truncated | * For each IS that is two hops (has a metric of two in the truncated | |||
| SPT) from TN: | SPT) from TN: | |||
| - If the IS is on the shortest path towards the originator of the | - If the IS is on the shortest path towards the originator of the | |||
| modified LSP, skip | modified LSP, skip | |||
| - If the IS is not on the shortest path towards the originator of | - If the IS is not on the shortest path towards the originator of | |||
| the modified LSP, add it to THL | the modified LSP, add it to THL | |||
| * Add each IS that is one hop away from TN to the RNL | * Add each IS that is one hop away from TN to the RNL | |||
| Step 2: Sort RNL by system IDs, from the least to the greatest. | Step 2: Sort RNL by system IDs, from the least to the greatest. | |||
| Step 3: Calculate a number, N, by adding each byte in the LSP | Step 3: Calculate a number, N, by adding each byte in LSP-ID (without | |||
| originator's System-ID and then taking MOD on the number of | the fragment ID) and fragment ID MOD 2 (allowing for some balancing | |||
| neighbors. N MUST be less than the number of members of RNL. | of LSPs coming from same system ID without introducing excessive | |||
| amount of state in an implementation) and then taking MOD on the | ||||
| number of neighbors. N MUST be less than the number of members of | ||||
| RNL. | ||||
| Step 4: Starting with the Nth member of RNL: | Step 4: Starting with the Nth member of RNL: | |||
| * If THL is empty, exit | * If THL is empty, exit | |||
| * If this member of RNL is the local calculating IS, this IS MUST | * If this member of RNL is the local calculating IS, this IS MUST | |||
| reflood the modified LSP; exit | reflood the modified LSP; exit | |||
| * Remove all members of THL connected to (adjacent to) this member | * Remove all members of THL connected to (adjacent to) this member | |||
| of RNL | of RNL | |||
| skipping to change at page 7, line 32 ¶ | skipping to change at page 7, line 38 ¶ | |||
| Note: This description is geared to clarity rather than optimal | Note: This description is geared to clarity rather than optimal | |||
| performance. | performance. | |||
| 2.3. Flooding Failures | 2.3. Flooding Failures | |||
| It is possible in some failure modes for flooding to be incomplete | It is possible in some failure modes for flooding to be incomplete | |||
| because of the flooding optimizations outlined. Specifically, if a | because of the flooding optimizations outlined. Specifically, if a | |||
| reflooder fails, or is somehow disconnected from all the links across | reflooder fails, or is somehow disconnected from all the links across | |||
| which it should be reflooding, it is possible an LSP is only | which it should be reflooding, it is possible an LSP is only | |||
| partially flooded through the fabric. To prevent faliures, an | partially flooded through the fabric. To prevent such partition | |||
| intermediate system which does not reflood an LSP (or fragment) | failures, an intermediate system which does not reflood an LSP (or | |||
| should: | fragment) should: | |||
| * Set a short timer; the default should be less than one second | * Set a short timer; the default should be one second | |||
| * When the timer expires, send a Partial Sequence Number Packet | * When the timer expires, send Partial Sequence Number Packet (PSNP) | |||
| (PSNP) to all neighbors | of all LSPs that have not been reflooded during the timer runtime | |||
| to all neighbors unless an up-to-date PSNP or CSNP has been | ||||
| already received from the neighbor | ||||
| * Process any Partial Sequence Number Packets (PSNPs) as required to | * Process any Partial Sequence Number Packets (PSNPs) received that | |||
| resynchronize | indicate that neighbors still have older versions of the LSP per | |||
| normal protocol procedures to resynchronize | ||||
| * Log/notify if a resynchronization is required | * If resynchronization above a configurable threshold is required, | |||
| an implementation SHOULD notify the network operator | ||||
| 2.4. Flooding Example | 2.4. Flooding Example | |||
| Assume, in the network above, 5A floods some modified LSP towards 4A- | Assume, in the network above, 5A floods some modified LSP towards 4A- | |||
| 4F. To determine whether 4A should flood this LSP to 3A-3F: | 4F. To determine whether 4A should flood this LSP to 3A-3F: | |||
| * 5A is TN; 4A calculates a truncated SPT from 5A's perspective with | * 5A is TN; 4A calculates a truncated SPT from 5A's perspective with | |||
| all link metrics set to 1 | all link metrics set to 1 | |||
| * 4A builds THL, which contains 3A, 3B, 3C, 3D, 3E, 3F, 5B, 5C, 5D, | * 4A builds THL, which contains 3A, 3B, 3C, 3D, 3E, 3F, 5B, 5C, 5D, | |||
| End of changes. 27 change blocks. | ||||
| 87 lines changed or deleted | 93 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||