| < draft-ietf-rift-applicability-07.txt | draft-ietf-rift-applicability-08.txt > | |||
|---|---|---|---|---|
| RIFT WG Yuehua. Wei, Ed. | RIFT WG Yuehua. Wei, Ed. | |||
| Internet-Draft Zheng. Zhang | Internet-Draft Zheng. Zhang | |||
| Intended status: Informational ZTE Corporation | Intended status: Informational ZTE Corporation | |||
| Expires: 21 March 2022 Dmitry. Afanasiev | Expires: 11 May 2022 Dmitry. Afanasiev | |||
| Yandex | Yandex | |||
| P. Thubert | P. Thubert | |||
| Cisco Systems | Cisco Systems | |||
| Jaroslaw. Kowalczyk | Jaroslaw. Kowalczyk | |||
| Orange Polska | Orange Polska | |||
| 17 September 2021 | 7 November 2021 | |||
| RIFT Applicability | RIFT Applicability | |||
| draft-ietf-rift-applicability-07 | draft-ietf-rift-applicability-08 | |||
| Abstract | Abstract | |||
| This document discusses the properties, applicability and operational | This document discusses the properties, applicability and operational | |||
| considerations of RIFT in different network scenarios. It intends to | considerations of RIFT in different network scenarios. It intends to | |||
| provide a rough guide how RIFT can be deployed to simplify routing | provide a rough guide how RIFT can be deployed to simplify routing | |||
| operations in Clos topologies and their variations. | operations in Clos topologies and their variations. | |||
| Status of This Memo | Status of This Memo | |||
| skipping to change at page 1, line 39 ¶ | skipping to change at page 1, line 39 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on 21 March 2022. | This Internet-Draft will expire on 11 May 2022. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2021 IETF Trust and the persons identified as the | Copyright (c) 2021 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents (https://trustee.ietf.org/ | |||
| license-info) in effect on the date of publication of this document. | license-info) in effect on the date of publication of this document. | |||
| Please review these documents carefully, as they describe your rights | Please review these documents carefully, as they describe your rights | |||
| skipping to change at page 2, line 36 ¶ | skipping to change at page 2, line 36 ¶ | |||
| 4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 13 | 4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 13 | |||
| 4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 13 | 4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
| 5. Operational Considerations . . . . . . . . . . . . . . . . . 15 | 5. Operational Considerations . . . . . . . . . . . . . . . . . 15 | |||
| 5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 16 | 5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 16 | |||
| 5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 16 | 5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 16 | |||
| 5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 18 | 5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 18 | |||
| 5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 19 | 5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 19 | |||
| 5.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 20 | 5.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 20 | |||
| 5.6. Positive vs. Negative Disaggregation . . . . . . . . . . 22 | 5.6. Positive vs. Negative Disaggregation . . . . . . . . . . 22 | |||
| 5.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24 | 5.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24 | |||
| 5.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 25 | 5.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 5.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 26 | 5.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 26 | |||
| 5.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28 | 5.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28 | |||
| 5.11. Fabric With A Controller . . . . . . . . . . . . . . . . 29 | 5.11. Fabric With A Controller . . . . . . . . . . . . . . . . 28 | |||
| 5.11.1. Controller Attached to ToFs . . . . . . . . . . . . 29 | 5.11.1. Controller Attached to ToFs . . . . . . . . . . . . 29 | |||
| 5.11.2. Controller Attached to Leaf . . . . . . . . . . . . 29 | 5.11.2. Controller Attached to Leaf . . . . . . . . . . . . 29 | |||
| 5.12. Internet Connectivity With Underlay . . . . . . . . . . . 30 | 5.12. Internet Connectivity Within Underlay . . . . . . . . . . 29 | |||
| 5.12.1. Internet Default on the Leaf . . . . . . . . . . . . 30 | 5.12.1. Internet Default on the Leaf . . . . . . . . . . . . 30 | |||
| 5.12.2. Internet Default on the ToFs . . . . . . . . . . . . 30 | 5.12.2. Internet Default on the ToFs . . . . . . . . . . . . 30 | |||
| 5.13. Subnet Mismatch and Address Families . . . . . . . . . . 30 | 5.13. Subnet Mismatch and Address Families . . . . . . . . . . 30 | |||
| 5.14. Anycast Considerations . . . . . . . . . . . . . . . . . 31 | 5.14. Anycast Considerations . . . . . . . . . . . . . . . . . 30 | |||
| 5.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 32 | 5.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 31 | |||
| 5.16. Key Management . . . . . . . . . . . . . . . . . . . . . 32 | 5.16. Key Management . . . . . . . . . . . . . . . . . . . . . 32 | |||
| 6. Security Considerations . . . . . . . . . . . . . . . . . . . 33 | 6. Security Considerations . . . . . . . . . . . . . . . . . . . 32 | |||
| 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 | 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 | |||
| 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33 | 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33 | |||
| 9. Normative References . . . . . . . . . . . . . . . . . . . . 33 | 9. Normative References . . . . . . . . . . . . . . . . . . . . 33 | |||
| 10. Informative References . . . . . . . . . . . . . . . . . . . 35 | 10. Informative References . . . . . . . . . . . . . . . . . . . 35 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
| 1. Introduction | 1. Introduction | |||
| This document discusses the properties and applicability of "Routing | This document discusses the properties and applicability of "Routing | |||
| in Fat Trees" [RIFT] in different deployment scenarios and highlights | in Fat Trees" [RIFT] in different deployment scenarios and highlights | |||
| skipping to change at page 3, line 47 ¶ | skipping to change at page 3, line 47 ¶ | |||
| or negatively to repel it. Disaggregation is performed to prevent | or negatively to repel it. Disaggregation is performed to prevent | |||
| black-holing and suboptimal routing to the more specific prefixes. | black-holing and suboptimal routing to the more specific prefixes. | |||
| TIE: | TIE: | |||
| This is an acronym for a "Topology Information Element". TIEs are | This is an acronym for a "Topology Information Element". TIEs are | |||
| exchanged between RIFT nodes to describe parts of a network such as | exchanged between RIFT nodes to describe parts of a network such as | |||
| links and address prefixes. A TIE has always a direction and a type. | links and address prefixes. A TIE has always a direction and a type. | |||
| North TIEs (sometimes abbreviated as N-TIEs) are used when dealing | North TIEs (sometimes abbreviated as N-TIEs) are used when dealing | |||
| with TIEs in the northbound representation and South-TIEs (sometimes | with TIEs in the northbound representation and South-TIEs (sometimes | |||
| abbreviated as S- TIEs) for the southbound equivalent. TIEs have | abbreviated as S-TIEs) for the southbound equivalent. TIEs have | |||
| different types such as node and prefix TIEs. | different types such as node and prefix TIEs. | |||
| Node TIE: | Node TIE: | |||
| This stands as acronym for a "Node Topology Information Element", | This stands as acronym for a "Node Topology Information Element", | |||
| which contains all adjacencies the node discovered and information | which contains all adjacencies the node discovered and information | |||
| about the node itself. Node TIE should NOT be confused with a North | about the node itself. Node TIE should NOT be confused with a North | |||
| TIE since "node" defines the type of TIE rather than its direction. | TIE since "node" defines the type of TIE rather than its direction. | |||
| Consequently North Node TIEs and South Node TIEs exist. | Consequently North Node TIEs and South Node TIEs exist. | |||
| Prefix TIE: | Prefix TIE: | |||
| This is an acronym for a "Prefix Topology Information Element" and it | This is an acronym for a "Prefix Topology Information Element" and it | |||
| contains all prefixes directly attached to this node in case of a | contains all prefixes directly attached to this node in case of a | |||
| North TIE and in case of South TIE the necessary default routes the | North TIE and in case of South TIE the necessary default routes and | |||
| node advertises southbound. | disaggregated routes the node advertises southbound. | |||
| South Reflection: | South Reflection: | |||
| Often abbreviated just as "reflection", it defines a mechanism where | Often abbreviated just as "reflection", it defines a mechanism where | |||
| South Node TIEs are "reflected" from the level south back up north to | South Node TIEs are "reflected" from the level south back up north to | |||
| allow nodes in the same level without East- West links to "see" each | allow nodes in the same level without East- West links to "see" each | |||
| other's node Topology Information Elements (TIEs). | other's node Topology Information Elements (TIEs). | |||
| LIE: | LIE: | |||
| This is an acronym for a "Link Information Element" exchanged on all | This is an acronym for a "Link Information Element" exchanged on all | |||
| the system's links running RIFT to form ThreeWay adjacencies and | the system's links running RIFT to form ThreeWay adjacencies and | |||
| carry information used to perform Zero Touch Provisioning (ZTP) of | carry information used to perform Zero Touch Provisioning (ZTP) of | |||
| levels. | levels. | |||
| Shortest-Path First (SPF): | Shortest-Path First (SPF): | |||
| A well-known graph algorithm attributed to Dijkstra that establishes | A well-known graph algorithm attributed to Dijkstra that establishes | |||
| a tree of shortest paths from a source to destinations on the graph. | a tree of shortest paths from a source to destinations on the graph. | |||
| SPF acronym is used due to its familiarity as general term for the | SPF acronym is used due to its familiarity as general term for the | |||
| node reachability calculations. RIFT can employ to ultimately | node reachability calculations that RIFT can employ to ultimately | |||
| calculate routes of which Dijkstra algorithm is a possible one. | calculate routes of which Dijkstra algorithm is a possible one. | |||
| North SPF (N-SPF): | North SPF (N-SPF): | |||
| A reachability calculation that is progressing northbound, as example | A reachability calculation that is progressing northbound, as example | |||
| SPF that is using South Node TIEs only. Normally it progresses a | SPF that is using South Node TIEs only. Normally it progresses a | |||
| single hop only and installs default routes. | single hop only and installs default routes. | |||
| South SPF (S-SPF): | South SPF (S-SPF): | |||
| A reachability calculation that is progressing southbound, as example | A reachability calculation that is progressing southbound, as example | |||
| SPF that is using North Node TIEs only. | SPF that is using North Node TIEs only. | |||
| 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | |||
| Clos [CLOS] topologies (called commonly a fat tree/network in modern | Clos [CLOS] topologies (called commonly a fat tree/network in modern | |||
| IP fabric considerations as homonym to the original definition of the | IP fabric considerations as homonym to the original definition of the | |||
| term Fat Tree [FATTREE])have gained prominence in today's networking, | term Fat Tree [FATTREE]) have gained prominence in today's | |||
| primarily as a result of the paradigm shift towards a centralized | networking, primarily as a result of the paradigm shift towards a | |||
| data-center based architecture that deliver a majority of computation | centralized data-center based architecture that deliver a majority of | |||
| and storage services. | computation and storage services. | |||
| Today's current routing protocols were geared towards a network with | Current routing protocols were geared towards a network with an | |||
| an irregular topology with isotropic properties, and low degree of | irregular topology with isotropic properties, and low degree of | |||
| connectivity. When applied to Fat Tree topologies: | connectivity. When applied to Fat Tree topologies: | |||
| * They tend to need extensive configuration or provisioning during | * They tend to need extensive configuration or provisioning during | |||
| bring up and re-dimensioning. | bring up and adding or removing Rift nodes from the fabric. | |||
| * All nodes including spine and leaf nodes learn the entire network | * All nodes including spine and leaf nodes learn the entire network | |||
| topology and routing information, which is in fact, not needed on | topology and routing information, which is in fact, not needed on | |||
| the leaf nodes during normal operation. | the leaf nodes during normal operation. | |||
| * They flood significant amounts of duplicate link state information | * They flood significant amounts of duplicate link state information | |||
| between spine and leaf nodes during topology updates and | between spine and leaf nodes during topology updates and | |||
| convergence events, requiring that additional CPU and link | convergence events, requiring that additional CPU and link | |||
| bandwidth be consumed. This may impact the stability and | bandwidth be consumed. This may impact the stability and | |||
| scalability of the fabric, make the fabric less reactive to | scalability of the fabric, make the fabric less reactive to | |||
| skipping to change at page 6, line 25 ¶ | skipping to change at page 6, line 25 ¶ | |||
| information is never flooded east-west or back south again. So a top | information is never flooded east-west or back south again. So a top | |||
| tier node has full set of prefixes from the Shortest Path First (SPF) | tier node has full set of prefixes from the Shortest Path First (SPF) | |||
| calculation. | calculation. | |||
| In the southbound direction, the protocol operates like a "fully | In the southbound direction, the protocol operates like a "fully | |||
| summarizing, unidirectional" path-vector protocol or rather a | summarizing, unidirectional" path-vector protocol or rather a | |||
| distance-vector with implicit split horizon. Routing information, | distance-vector with implicit split horizon. Routing information, | |||
| normally just the default route, propagates one hop south and is "re- | normally just the default route, propagates one hop south and is "re- | |||
| advertised" by nodes at next lower level. | advertised" by nodes at next lower level. | |||
| +-----------+ +-----------+ | +---------------+ +----------------+ | |||
| | ToF | | ToF | LEVEL 2 | | ToF | | ToF | LEVEL 2 | |||
| + +-----+--+--+ +-+--+------+ | + ++------+--+--+-+ ++-+--+----+-----+ | |||
| | | | | | | | | | ^ | | | | | | | | | | ^ | |||
| + | | | +-------------------------+ | | + | | | +-------------------------+ | | |||
| Distance | +-------------------+ | | | | | | Distance | +-------------------+ | | | | | | |||
| Vector | | | | | | | | + | Vector | | | | | | | | + | |||
| South | | | | +--------+ | | | Link-State | South | | | | +--------+ | | | Link+State | |||
| + | | | | | | | | Flooding | + | | | | | | | | Flooding | |||
| | | | +-------------+ | | | North | | | | +----------------+ | | | North | |||
| v | | | | | | | | + | v | | | | | | | | + | |||
| +-+--+-+ +------+ +-------+ +--+--+-+ | | ++---+-+ +------+ +-+----+ ++----++ | | |||
| |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 | |SPINE | |SPINE | | SPINE| | SPINE| | LEVEL 1 | |||
| + ++----++ ++---+-+ +--+--+-+ ++----+-+ | | + ++----++ ++---+-+ +-+--+-+ ++----++ | | |||
| + | | | | | | | | | ^ N | + | | | | | | | | | ^ N | |||
| Distance | +-------+ | | +--------+ | | | E | Distance | +-------+ | | +--------+ | | | E | |||
| Vector | | | | | | | | | +------> | Vector | | | | | | | | | +------> | |||
| South | +-------+ | | | +-------+ | | | | | South | +-------+ | | | +------+ | | | | | |||
| + | | | | | | | | | + | + | | | | | | | | | + | |||
| v ++--++ +-+-++ ++-+-+ +-+--++ + | v ++--++ +-+-++ ++--++ ++--++ + | |||
| |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 | |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0 | |||
| +----+ +----+ +----+ +-----+ | +----+ +----+ +----+ +----+ | |||
| Figure 1: RIFT overview | Figure 1: RIFT overview | |||
| A spine node has only information necessary for its level, which is | A spine node has only information necessary for its level, which is | |||
| all destinations south of the node based on SPF calculation, default | all destinations south of the node based on SPF calculation, default | |||
| route, and potential disaggregated routes. | route, and potential disaggregated routes. | |||
| RIFT combines the advantage of both link-state and distance-vector: | RIFT combines the advantage of both link-state and distance-vector: | |||
| * Fastest possible convergence | * Fastest possible convergence | |||
| skipping to change at page 7, line 23 ¶ | skipping to change at page 7, line 27 ¶ | |||
| * High degree of ECMP | * High degree of ECMP | |||
| * Fast de-commissioning of nodes | * Fast de-commissioning of nodes | |||
| * Maximum propagation speed with flexible prefixes in an update | * Maximum propagation speed with flexible prefixes in an update | |||
| So there are two types of link-state database which are "north | So there are two types of link-state database which are "north | |||
| representation" North Topology Information Elements (N-TIEs) and | representation" North Topology Information Elements (N-TIEs) and | |||
| "south representation" South Topology Information Elements (S-TIEs). | "south representation" South Topology Information Elements (S-TIEs). | |||
| The N-TIEs contain a link-state topology description of lower levels | The N-TIEs contain a link-state topology description of lower levels | |||
| and S-TIEs carry simply default routes for the lower levels. | and S-TIEs carry simply default and disaggregated routes for the | |||
| lower levels. | ||||
| RIFT also eliminates major disadvantages of link-state and distance- | RIFT also eliminates major disadvantages of link-state and distance- | |||
| vector with: | vector with: | |||
| * Reduced and balanced flooding | * Reduced and balanced flooding | |||
| * Automatic neighbor detection | * Level constrained automatic neighbor detection | |||
| To achieve this, RIFT builds on the art of IGPs, not only OSPF and | To achieve this, RIFT builds on the art of IGPs, not only OSPF and | |||
| IS-IS but also MANET and IoT, to provide unique features: | IS-IS but also MANET and IoT, to provide unique features: | |||
| * Automatic (positive or negative) route disaggregation of | * Automatic (positive or negative) route disaggregation of | |||
| northwards routes upon fallen leaves | northwards routes upon fallen leaves | |||
| * Recursive operation in the case of negative route disaggregation | * Recursive operation in the case of negative route disaggregation | |||
| * Anisotropic routing that extends a principle seen in RPL [RFC6550] | * Anisotropic routing that extends a principle seen in RPL [RFC6550] | |||
| skipping to change at page 8, line 4 ¶ | skipping to change at page 8, line 9 ¶ | |||
| to wide superspines | to wide superspines | |||
| * Optimal flooding reduction that derives from the concept of a | * Optimal flooding reduction that derives from the concept of a | |||
| "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the | "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the | |||
| flooding load over northbound links and nodes. | flooding load over northbound links and nodes. | |||
| Additional advantages that are unique to RIFT are listed below, the | Additional advantages that are unique to RIFT are listed below, the | |||
| details of which can be found in RIFT [RIFT]. | details of which can be found in RIFT [RIFT]. | |||
| * True ZTP(Zero Touch Provisioning) | * True ZTP(Zero Touch Provisioning) | |||
| * Minimal blast radius on failures | * Minimal blast radius on failures | |||
| * Can utilize all paths through fabric without looping | * Can utilize all paths through fabric without looping | |||
| * Simple leaf implementation that can scale down to servers | * Simple leaf implementation that can scale down to servers | |||
| * Key-Value store | * Key-Value store | |||
| * Horizontal links used for protection only | * Horizontal links used for protection only | |||
| * Supports non-equal cost multipath and can replace multi-chassis | ||||
| link aggregation group (MLAG or MC-LAG) | ||||
| 4.2. Applicable Topologies | 4.2. Applicable Topologies | |||
| Albeit RIFT is specified primarily for "proper" Clos or Fat Tree | Albeit RIFT is specified primarily for "proper" Clos or Fat Tree | |||
| topologies, the protocol natively supports Points of Delivery (PoD) | topologies, the protocol natively supports Points of Delivery (PoD) | |||
| concepts, which, strictly speaking, are not found in the original | concepts, which, strictly speaking, are not found in the original | |||
| Clos concept. | Clos concept. | |||
| Further, the specification explains and supports operations of multi- | Further, the specification explains and supports operations of multi- | |||
| plane Clos variants where the protocol recommends the use of inter- | plane Clos variants where the protocol recommends the use of inter- | |||
| plane rings at the Top-of-Fabric level to allow the reconciliation of | plane rings at the Top-of-Fabric level to allow the reconciliation of | |||
| topology view of different planes to make the negative disaggregation | topology view of different planes to make the negative disaggregation | |||
| viable in case of failures within a plane. These observations hold | viable in case of failures within a plane. These observations hold | |||
| not only in case of RIFT but also in the generic case of dynamic | not only in case of RIFT but also in the generic case of dynamic | |||
| routing on Clos variants with multiple planes and failures in bi- | routing on Clos variants with multiple planes and failures in bi- | |||
| sectional bandwidth, especially on the leafs. | sectional bandwidth, especially on the leafs. | |||
| 4.2.1. Horizontal Links | 4.2.1. Horizontal Links | |||
| RIFT is not limited to pure Clos divided into PoD and multi-planes | RIFT is not limited to pure Clos divided into PoD and multi-planes | |||
| but supports horizontal (East-West) links below the top of fabric | but supports horizontal (East-West) links below the top of fabric | |||
| level. Those links are used only for last resort northbound routes | level. Those links are used only for last resort northbound | |||
| when a spine loses all its northbound links or cannot compute a | forwarding when a spine loses all its northbound links or cannot | |||
| default route through them. | compute a default route through them. | |||
| A possible configuration is a "ring" of horizontal links at a level. | ||||
| In presence of such a "ring" in any level (except Top of Fabric (ToF) | ||||
| level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a | ||||
| "ring-based protection" scheme since such a computation would have to | ||||
| deal necessarily with breaking of "loops" in Dijkstra sense; an | ||||
| application for which RIFT is not intended. | ||||
| A full-mesh connectivity between nodes on the same level can be | A full-mesh connectivity between nodes on the same level can be | |||
| employed and that allows N-SPF to provide for any node loosing all | employed and that allows N-SPF to provide for any node losing all its | |||
| its northbound adjacencies (as long as any of the other nodes in the | northbound adjacencies (as long as any of the other nodes in the | |||
| level are northbound connected) to still participate in northbound | level are northbound connected) to still participate in northbound | |||
| forwarding. | forwarding. | |||
| Note that a "ring" of horizontal links at any level below ToF does | ||||
| not provide a "ring-based protection" scheme since the SPF | ||||
| computation would have to deal necessarily with breaking of "loops" | ||||
| in Dijkstra sense--an application for which RIFT is not intended. | ||||
| 4.2.2. Vertical Shortcuts | 4.2.2. Vertical Shortcuts | |||
| Through relaxations of the specified adjacency forming rules, RIFT | Through relaxations of the specified adjacency forming rules, RIFT | |||
| implementations can be extended to support vertical "shortcuts". The | implementations can be extended to support vertical "shortcuts". The | |||
| RIFT specification itself does not provide the exact details since | RIFT specification itself does not provide the exact details since | |||
| the resulting solution suffers from either much larger blast radius | the resulting solution suffers from either much larger blast radius | |||
| with increased flooding volumes or in case of maximum aggregation | with increased flooding volumes or in case of maximum aggregation | |||
| routing, bow-tie problems. | routing, bow-tie problems. | |||
| 4.2.3. Generalizing to any Directed Acyclic Graph | 4.2.3. Generalizing to any Directed Acyclic Graph | |||
| skipping to change at page 9, line 36 ¶ | skipping to change at page 9, line 35 ¶ | |||
| * Northbound, RIFT operates as a link-state protocol, whereby the | * Northbound, RIFT operates as a link-state protocol, whereby the | |||
| control packets are reflooded first all the way north and only | control packets are reflooded first all the way north and only | |||
| interpreted later. All the individual fine grained routes are | interpreted later. All the individual fine grained routes are | |||
| advertised. | advertised. | |||
| * Southbound, RIFT operates as a distance-vector protocol, whereby | * Southbound, RIFT operates as a distance-vector protocol, whereby | |||
| the control packets are flooded only one-hop, interpreted, and the | the control packets are flooded only one-hop, interpreted, and the | |||
| consequence of that computation is what gets flooded one more hop | consequence of that computation is what gets flooded one more hop | |||
| south. In the most common use-cases, a ToF node can reach most of | south. In the most common use-cases, a ToF node can reach most of | |||
| the prefixes in the fabric. If that is the case, the ToF node | the prefixes in the fabric. If that is the case, the ToF node | |||
| advertises the fabric default and disaggregates the prefixes that | advertises the fabric default and negatively disaggregates the | |||
| it cannot reach. On the other hand, a ToF node that can reach | prefixes that it cannot reach. On the other hand, a ToF node that | |||
| only a small subset of the prefixes in the fabric will preferably | can reach only a small subset of the prefixes in the fabric will | |||
| advertise those prefixes and refrain from aggregating. | preferably advertise those prefixes and refrain from aggregating. | |||
| In the general case, what gets advertised south is in more | In the general case, what gets advertised south are: | |||
| details: | ||||
| 1. A fabric default that aggregates all the prefixes that are | 1. A fabric default that aggregates all the prefixes that are | |||
| reachable within the fabric, and that could be a default route | reachable within the fabric, and that could be a default route | |||
| or a prefix that is dedicated to this particular fabric. | or a prefix that is dedicated to this particular fabric. | |||
| 2. The loopback addresses of the northbound nodes, e.g., for | 2. The loopback addresses of the northbound nodes, e.g., for | |||
| inband management. | inband management. | |||
| 3. The disaggregated prefixes for the dynamic exceptions to the | 3. The disaggregated prefixes for the dynamic exceptions to the | |||
| fabric default, advertised to route around the black hole that | fabric default, advertised to route around the black hole that | |||
| may form. | may form. | |||
| * East-West routing can optionally be used, with specific | * East-West routing can optionally be used, with specific | |||
| restrictions. It is used when a sibling has access to the fabric | restrictions. It is used when a sibling has access to the fabric | |||
| default but this node does not. | default but this node does not. | |||
| A Directed Acyclic Graph (DAG) provides a sense of north (the | Since a Directed Acyclic Graph (DAG) provides a sense of north (the | |||
| direction of the DAG) and of south (the reverse), which can be used | direction of the DAG) and of south (the reverse), it can be used to | |||
| to apply RIFT. For the purpose of RIFT, an edge in the DAG that has | apply RIFT--an edge in the DAG that has only incoming vertices is a | |||
| only incoming vertices is a ToF node. | ToF node. | |||
| There are a number of caveats though: | There are a number of caveats though: | |||
| * The DAG structure must exist before RIFT starts, so there is a | * The DAG structure must exist before RIFT starts, so there is a | |||
| need for a companion protocol to establish the logical DAG | need for a companion protocol to establish the logical DAG | |||
| structure. | structure. | |||
| * A generic DAG does not have a sense of east and west. The | * A generic DAG does not have a sense of east and west. The | |||
| operation specified for east-west links and the southbound | operation specified for east-west links and the southbound | |||
| reflection between nodes are not applicable. Also ZTP(Zero Touch | reflection between nodes are not applicable. Also ZTP(Zero Touch | |||
| skipping to change at page 11, line 18 ¶ | skipping to change at page 11, line 14 ¶ | |||
| 4.2.4. Reachability of Internal Nodes in the Fabric | 4.2.4. Reachability of Internal Nodes in the Fabric | |||
| RIFT does not require that nodes have reachable addresses in the | RIFT does not require that nodes have reachable addresses in the | |||
| fabric, though it is clearly desirable for operational purposes. | fabric, though it is clearly desirable for operational purposes. | |||
| Under normal operating conditions this can be easily achieved by | Under normal operating conditions this can be easily achieved by | |||
| injecting the node's loopback address into North and South Prefix | injecting the node's loopback address into North and South Prefix | |||
| TIEs or other implementation specific mechanisms. | TIEs or other implementation specific mechanisms. | |||
| Special considerations arise when a node loses all northbound | Special considerations arise when a node loses all northbound | |||
| adjacencies, but is not at the top of the fabric. These are outside | adjacencies, but is not at the top of the fabric. If a spine node | |||
| the scope of this document and could be discussed in a separate | loses all northbound links, the spine node doesn't advertise default | |||
| document. | route. But if the level of the spine node is auto-determined by ZTP, | |||
| it will "fall down" as despicted in Figure 8. | ||||
| 4.3. Use Cases | 4.3. Use Cases | |||
| 4.3.1. Data Center Topologies | 4.3.1. Data Center Topologies | |||
| 4.3.1.1. Data Center Fabrics | 4.3.1.1. Data Center Fabrics | |||
| RIFT is suited for applying in data center (DC) IP fabrics underlay | RIFT is suited for applying in data center (DC) IP fabrics underlay | |||
| routing, vast majority of which seem to be currently (and for the | routing, vast majority of which seem to be currently (and for the | |||
| foreseeable future) Clos architectures. It significantly simplifies | foreseeable future) Clos architectures. It significantly simplifies | |||
| skipping to change at page 12, line 33 ¶ | skipping to change at page 12, line 33 ¶ | |||
| . +-----+ +-----+ | . +-----+ +-----+ | |||
| Figure 2: Level Shortcut | Figure 2: Level Shortcut | |||
| RIFT is not strictly limited to Clos topologies. The protocol only | RIFT is not strictly limited to Clos topologies. The protocol only | |||
| requires a sense of "compass rose directionality" either achieved | requires a sense of "compass rose directionality" either achieved | |||
| through configuration or derivation of levels. So, conceptually, | through configuration or derivation of levels. So, conceptually, | |||
| shortcuts between levels could be included. Figure 2 depicts an | shortcuts between levels could be included. Figure 2 depicts an | |||
| example of a shortcut between levels. In this example, sub-optimal | example of a shortcut between levels. In this example, sub-optimal | |||
| routing will occur when traffic is sent from L0 to L1 via S0's | routing will occur when traffic is sent from L0 to L1 via S0's | |||
| default route and back down through A0 or A1. In order to ensure | default route and back down through A0 or A1. In order to avoid | |||
| that, only default routes from A0 or A1 are used, all leaves would be | that, only default routes from A0 or A1 are used, all leaves would be | |||
| required to install each others routes. | required to install each others routes. | |||
| While various technical and operational challenges may require the | While various technical and operational challenges may require the | |||
| use of such modifications, discussion of those topics are outside the | use of such modifications, discussion of those topics are outside the | |||
| scope of this document. | scope of this document. | |||
| 4.3.2. Metro Fabrics | 4.3.2. Metro Fabrics | |||
| The demand for bandwidth is increasing steadily, driven primarily by | The demand for bandwidth is increasing steadily, driven primarily by | |||
| environments close to content producers (server farms connection via | environments close to content producers (server farms connection via | |||
| DC fabrics) but in proximity to content consumers as well. Consumers | DC fabrics) but in proximity to content consumers as well. Consumers | |||
| are often clustered in metro areas with their own network | are often clustered in metro areas with their own network | |||
| architectures that can benefit from simplified, regular Clos | architectures that can benefit from simplified, regular Clos | |||
| structures and hence from RIFT. | structures and hence from RIFT. | |||
| 4.3.3. Building Cabling | 4.3.3. Building Cabling | |||
| Commercial edifices are often cabled in topologies that are either | Commercial edifices are often cabled in topologies that are either | |||
| Clos or its isomorphic equivalents. The Clos can grow rather high | Clos or its isomorphic equivalents. The Clos can grow rather high | |||
| with many floors. That presents a challenge for traditional routing | with many levels. That presents a challenge for traditional routing | |||
| protocols (except BGP and by now largely phased-out PNNI) which do | protocols (except BGP and by now largely phased-out PNNI) which do | |||
| not support an arbitrary number of levels which RIFT does naturally. | not support an arbitrary number of levels which RIFT does naturally. | |||
| Moreover, due to the limited sizes of forwarding tables in network | Moreover, due to the limited sizes of forwarding tables in network | |||
| elements of building cabling, the minimum FIB size RIFT maintains | elements of building cabling, the minimum FIB size RIFT maintains | |||
| under normal conditions is cost-effective in terms of hardware and | under normal conditions is cost-effective in terms of hardware and | |||
| operational costs. | operational costs. | |||
| 4.3.4. Internal Router Switching Fabrics | 4.3.4. Internal Router Switching Fabrics | |||
| It is common in high-speed communications switching and routing | It is common in high-speed communications switching and routing | |||
| devices to use fabrics when a crossbar is not feasible due to cost, | devices to use fabrics when a crossbar is not feasible due to cost, | |||
| head-of-line blocking or size trade-offs. Normally such fabrics are | head-of-line blocking or size trade-offs. Normally such fabrics are | |||
| not self-healing or rely on 1:/+1 protection schemes but it is | not self-healing or rely on 1:/+1 protection schemes but it is | |||
| conceivable to use RIFT to operate Clos fabrics that can deal | conceivable to use RIFT to operate Clos fabrics that can deal | |||
| effectively with interconnections or subsystem failures in such | effectively with interconnections or subsystem failures in such | |||
| module. RIFT is neither IP specific and hence any link addressing | module. RIFT is not IP specific and hence any link addressing | |||
| connecting internal device subnets is conceivable. | connecting internal device subnets is conceivable. | |||
| 4.3.5. CloudCO | 4.3.5. CloudCO | |||
| The Cloud Central Office (CloudCO) is a new stage of telecom Central | The Cloud Central Office (CloudCO) is a new stage of telecom Central | |||
| Office. It takes the advantage of Software Defined Networking (SDN) | Office. It takes the advantage of Software Defined Networking (SDN) | |||
| and Network Function Virtualization (NFV) in conjunction with general | and Network Function Virtualization (NFV) in conjunction with general | |||
| purpose hardware to optimize current networks. The following figure | purpose hardware to optimize current networks. The following figure | |||
| illustrates this architecture at a high level. It describes a single | illustrates this architecture at a high level. It describes a single | |||
| instance or macro-node of cloud CO that provides a number of Value | instance or macro-node of cloud CO that provides a number of Value | |||
| skipping to change at page 15, line 25 ¶ | skipping to change at page 15, line 25 ¶ | |||
| good scaling properties while delivering maximum reactiveness. | good scaling properties while delivering maximum reactiveness. | |||
| * RIFT allows for extensive Zero Touch Provisioning within the | * RIFT allows for extensive Zero Touch Provisioning within the | |||
| protocol. In its most extreme version RIFT does not rely on any | protocol. In its most extreme version RIFT does not rely on any | |||
| specific addressing and for IP fabric can operate using IPv6 ND | specific addressing and for IP fabric can operate using IPv6 ND | |||
| [RFC4861] only. | [RFC4861] only. | |||
| * RIFT has provisions to detect common IP fabric mis-cabling | * RIFT has provisions to detect common IP fabric mis-cabling | |||
| scenarios. | scenarios. | |||
| * RIFT negotiates automatically BFD per link allowing this way for | * RIFT negotiates automatically BFD per link. This allows for IP | |||
| IP and micro-BFD [RFC7130] to replace Link Aggregation Groups | and micro-BFD [RFC7130] to replace Link Aggregation Groups (LAGs) | |||
| (LAGs) which do hide bandwidth imbalances in case of constituent | which do hide bandwidth imbalances in case of constituent | |||
| failures. Further automatic link validation techniques similar to | failures. Further automatic link validation techniques similar to | |||
| [RFC5357] could be supported as well. | [RFC5357] could be supported as well. | |||
| * RIFT inherently solves many difficult problems associated with the | * RIFT inherently solves many difficult problems associated with the | |||
| use of traditional routing topologies with dense meshes and high | use of traditional routing topologies with dense meshes and high | |||
| degrees of ECMP by including automatic bandwidth balancing, flood | degrees of ECMP by including automatic bandwidth balancing, flood | |||
| reduction and automatic disaggregation on failures while providing | reduction and automatic disaggregation on failures while providing | |||
| maximum aggregation of prefixes in default scenarios. | maximum aggregation of prefixes in default scenarios. | |||
| * RIFT reduces FIB size towards the bottom of the IP fabric where | * RIFT reduces FIB size towards the bottom of the IP fabric where | |||
| skipping to change at page 17, line 40 ¶ | skipping to change at page 17, line 40 ¶ | |||
| Figure 4: Suboptimal routing upon link failure use case | Figure 4: Suboptimal routing upon link failure use case | |||
| As shown in Figure 4, as the result of the south reflection between | As shown in Figure 4, as the result of the south reflection between | |||
| Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and | Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and | |||
| Spine 122 knows each other at level 1. | Spine 122 knows each other at level 1. | |||
| Without disaggregation mechanism, when linkSL6 fails, the packet from | Without disaggregation mechanism, when linkSL6 fails, the packet from | |||
| leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 | leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 | |||
| then go down through linkTS4 to linkSL8 to Leaf122 or go up through | then go down through linkTS4 to linkSL8 to Leaf122 or go up through | |||
| linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to | linkSL5 to linkTS6 then go down through linkTS8 and linkSL8 to | |||
| Leaf122 based on pure default route. It's the case of suboptimal | Leaf122 based on pure default route. It's the case of suboptimal | |||
| routing or bow-tieing. | routing or bow-tieing. | |||
| With disaggregation mechanism, when linkSL6 fails, Spine122 will | With disaggregation mechanism, when linkSL6 fails, Spine122 will | |||
| detect the failure according to the reflected node S-TIE from | detect the failure according to the reflected node S-TIE from | |||
| Spine121. Based on the disaggregation algorithm provided by RIFT, | Spine121. Based on the disaggregation algorithm provided by RIFT, | |||
| Spine122 will explicitly advertise prefix122 in Disaggregated Prefix | Spine122 will explicitly advertise prefix122 in Disaggregated Prefix | |||
| S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to | S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to | |||
| prefix122 will only be sent to linkSL7 following a longest-prefix | prefix122 will only be sent to linkSL7 following a longest-prefix | |||
| match to prefix 122 directly then go down through linkSL8 to Leaf122 | match to prefix 122 directly then go down through linkSL8 to Leaf122 | |||
| skipping to change at page 19, line 14 ¶ | skipping to change at page 19, line 14 ¶ | |||
| The packet from leaf111 to prefix122 will not be routed to linkTS1 or | The packet from leaf111 to prefix122 will not be routed to linkTS1 or | |||
| linkTS2. The packet from leaf111 to prefix122 will only be routed to | linkTS2. The packet from leaf111 to prefix122 will only be routed to | |||
| linkTS5 or linkTS7 following a longest-prefix match to prefix122. | linkTS5 or linkTS7 following a longest-prefix match to prefix122. | |||
| 5.4. Zero Touch Provisioning (ZTP) | 5.4. Zero Touch Provisioning (ZTP) | |||
| RIFT is designed to require a very minimal configuration to simplify | RIFT is designed to require a very minimal configuration to simplify | |||
| its operation and avoid human errors; based on that minimal | its operation and avoid human errors; based on that minimal | |||
| information, Zero Touch Provisioning (ZTP) autoconfigures the key | information, Zero Touch Provisioning (ZTP) autoconfigures the key | |||
| operational parameters of all the RIFT nodes, that is, on the one | operational parameters of all the RIFT nodes, including the SystemID | |||
| hand, the SystemID of the node that must be unique in the RIFT | of the node that must be unique in the RIFT network and the level of | |||
| network, and on the other hand the level of the node in the Fat Tree, | the node in the Fat Tree, which determines which peers are northwards | |||
| which determines which peers are northwards "parents" and which are | "parents" and which are southwards "children". | |||
| southwards "children". | ||||
| ZTP is always on, but its decisions can be overridden when a network | ZTP is always on, but its decisions can be overridden when a network | |||
| administrator prefers to impose its own configuration. In that case, | administrator prefers to impose its own configuration. In that case, | |||
| it is the responsibility of the administrator to ensure that the | it is the responsibility of the administrator to ensure that the | |||
| configured parameters are correct, in other words that the SystemID | configured parameters are correct, in other words that the SystemID | |||
| of each node is unique, and that the administratively set levels | of each node is unique, and that the administratively set levels | |||
| truly reflect the relative position of the nodes in the fabric. It | truly reflect the relative position of the nodes in the fabric. It | |||
| is recommended to let ZTP configure the network, and when not, it is | is recommended to let ZTP configure the network, and when not, it is | |||
| recommended to configure the level of all the nodes but those that | recommended to configure the level of all the nodes to avoid an | |||
| are forced as leaves to avoid an undesirable interaction between ZTP | undesirable interaction between ZTP and the manual configuration. | |||
| and the manual configuration. | ||||
| ZTP requires that the administrator points out the Top-of-Fabric | ZTP requires that the administrator points out the Top-of-Fabric | |||
| (ToF) nodes to set the baseline from which the fabric topology is | (ToF) nodes to set the baseline from which the fabric topology is | |||
| derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC | derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC | |||
| flag which are initial 'seeds' needed for other ZTP nodes to derive | flag which are initial 'seeds' needed for other ZTP nodes to derive | |||
| their level in the topology. ZTP computes the level of each node | their level in the topology. ZTP computes the level of each node | |||
| based on the Highest Available Level (HAL) of the potential parent(s) | based on the Highest Available Level (HAL) of the potential parent(s) | |||
| nearest that baseline, which represents the superspine. In a | nearest that baseline, which represents the superspine. In a | |||
| fashion, RIFT can be seen as a distance-vector protocol that computes | fashion, RIFT can be seen as a distance-vector protocol that computes | |||
| a set of feasible successors towards the superspine and auto- | a set of feasible successors towards the superspine and auto- | |||
| skipping to change at page 20, line 16 ¶ | skipping to change at page 20, line 12 ¶ | |||
| highest level either leaving or entering the domain (with some finer | highest level either leaving or entering the domain (with some finer | |||
| distinctions not explained further). It is therefore recommended | distinctions not explained further). It is therefore recommended | |||
| that each node is multi-homed towards nodes with respective HAL | that each node is multi-homed towards nodes with respective HAL | |||
| offerings. Fortunately, this is the natural state of things for the | offerings. Fortunately, this is the natural state of things for the | |||
| topology variants considered in RIFT. | topology variants considered in RIFT. | |||
| A RIFT node may also be configured to confine it to the leaf role | A RIFT node may also be configured to confine it to the leaf role | |||
| with the LEAF_ONLY flag. A leaf node can also be configured to | with the LEAF_ONLY flag. A leaf node can also be configured to | |||
| support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either | support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either | |||
| case the node cannot be TOP_OF_FABRIC and its level cannot be | case the node cannot be TOP_OF_FABRIC and its level cannot be | |||
| configured. RIFT will fully configure the node's level after it is | configured. RIFT will fully determine the node's level after it is | |||
| attached to the topology and ensure that the node is at the "bottom | attached to the topology and ensure that the node is at the "bottom | |||
| of the hierarchy" (southernmost). | of the hierarchy" (southernmost). | |||
| 5.5. Mis-cabling Examples | 5.5. Mis-cabling Examples | |||
| +----------------+ +-----------------+ | +----------------+ +-----------------+ | |||
| | ToF21 | +------+ ToF22 | LEVEL 2 | | ToF21 | +------+ ToF22 | LEVEL 2 | |||
| +-------+----+---+ | +----+---+--------+ | +-------+----+---+ | +----+---+--------+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | | | +----------------------------+ | | | | | +----------------------------+ | | |||
| skipping to change at page 21, line 20 ¶ | skipping to change at page 21, line 25 ¶ | |||
| and Spine121 belong to two different PoDs, the adjacency between | and Spine121 belong to two different PoDs, the adjacency between | |||
| Leaf112 and Spine121 can not be formed. link-W would be detected and | Leaf112 and Spine121 can not be formed. link-W would be detected and | |||
| prevented. | prevented. | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | | | +-----------------+ | | | | | | | +-----------------+ | | | | |||
| | +--------------------------+ | | | | | | +--------------------------+ | | | | | |||
| | | | | | | | | | ||||
| | +------+ | | | +------+ | | | +------+ | | | +------+ | | |||
| | | +-----------------+ | | | | | | | | +-----------------+ | | | | | | |||
| | | | +--------------------------+ | | | | | | +--------------------------+ | | | |||
| | A | | B | | A | | B | | | A | | B | | A | | B | | |||
| +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ | +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ | |||
| |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 | |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 | |||
| +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ | +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +---------+ | | | +---------+ | | | +---------+ | | | +---------+ | | |||
| | | | | link-W | | | | | | | | | link-W | | | | | |||
| skipping to change at page 22, line 25 ¶ | skipping to change at page 22, line 25 ¶ | |||
| | | | | | | | | | | | | | | |||
| | +-------+ | | | | | +-------+ | | | | |||
| + + | | ====> | | | + + | | ====> | | | |||
| X X +------+ | +------+ | | X X +------+ | +------+ | | |||
| + + | | | | | + + | | | | | |||
| +----+--+ +-+-----+ +-+-----+ | +----+--+ +-+-----+ +-+-----+ | |||
| |Spine11| |Spine12| |Spine12| | |Spine11| |Spine12| |Spine12| | |||
| +-+---+-+ ++----+-+ ++----+-+ | +-+---+-+ ++----+-+ ++----+-+ | |||
| | | | | | | | | | | | | | | |||
| | +---------+ | | | | | +---------+ | | | | |||
| | | | | | | | ||||
| | +-------+ | | +-------+ | | | +-------+ | | +-------+ | | |||
| | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ | +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ | |||
| |Leaf111| |Leaf112| |Leaf111| |Leaf112| | |Leaf111| |Leaf112| |Leaf111| |Leaf112| | |||
| +-------+ +-------+ +-+-----+ +-+-----+ | +-------+ +-------+ +-+-----+ +-+-----+ | |||
| | | | | | | |||
| | +--------+ | | +--------+ | |||
| | | | | | | |||
| +-+---+-+ | +-+---+-+ | |||
| |Spine11| | |Spine11| | |||
| skipping to change at page 23, line 7 ¶ | skipping to change at page 22, line 52 ¶ | |||
| specific route southwards as an exception to the aggregated fabric- | specific route southwards as an exception to the aggregated fabric- | |||
| default north. Disaggregation is useful when a prefix within the | default north. Disaggregation is useful when a prefix within the | |||
| aggregation is reachable via some of the parents but not the others | aggregation is reachable via some of the parents but not the others | |||
| at the same level of the fabric. It is mandatory when the level is | at the same level of the fabric. It is mandatory when the level is | |||
| the ToF since a ToF node that cannot reach a prefix becomes a black | the ToF since a ToF node that cannot reach a prefix becomes a black | |||
| hole for that prefix. The hard problem is to know which prefixes are | hole for that prefix. The hard problem is to know which prefixes are | |||
| reachable by whom. | reachable by whom. | |||
| In the general case, [RIFT] solves that problem by interconnecting | In the general case, [RIFT] solves that problem by interconnecting | |||
| the ToF nodes. So the ToF nodes can exchange the full list of | the ToF nodes. So the ToF nodes can exchange the full list of | |||
| prefixes that exist in the fabric and figure when a ToF node lacks | prefixes that exist in the fabric and figure out when a ToF node | |||
| reachability and to existing prefix. This requires additional ports | lacks reachability to some prefixes. This requires additional ports | |||
| at the ToF, typically 2 ports per ToF node to form a ToF-spanning | at the ToF, typically 2 ports per ToF node to form a ToF-spanning | |||
| ring. [RIFT] also defines the southbound reflection procedure that | ring. [RIFT] also defines the southbound reflection procedure that | |||
| enables a parent to explore the direct connectivity of its peers, | enables a parent to explore the direct connectivity of its peers, | |||
| meaning their own parents and children; based on the advertisements | meaning their own parents and children; based on the advertisements | |||
| received from the shared parents and children, it may enable the | received from the shared parents and children, it may enable the | |||
| parent to infer the prefixes its peers can reach. | parent to infer the prefixes its peers can reach. | |||
| When a parent lacks reachability to a prefix, it may disaggregate the | When a parent lacks reachability to a prefix, it may disaggregate the | |||
| prefix negatively, i.e., advertise that this parent can be used to | prefix negatively, i.e., advertise that this parent can be used to | |||
| reach any prefix in the aggregation except that one. The Negative | reach any prefix in the aggregation except that one. The Negative | |||
| skipping to change at page 24, line 41 ¶ | skipping to change at page 24, line 36 ¶ | |||
| leaf is obsolete, and a stale route may exist for a while. The | leaf is obsolete, and a stale route may exist for a while. The | |||
| common parent needs to select the freshest route advertisement in | common parent needs to select the freshest route advertisement in | |||
| order to install the correct route via the next-leaf. This requires | order to install the correct route via the next-leaf. This requires | |||
| that the fabric determines the sequence of the movements of the | that the fabric determines the sequence of the movements of the | |||
| mobile node. | mobile node. | |||
| On the one hand, a classical sequence counter provides a total order | On the one hand, a classical sequence counter provides a total order | |||
| for a while but it will eventually wrap. On the other hand, a | for a while but it will eventually wrap. On the other hand, a | |||
| timestamp provides a permanent order but it may miss a movement that | timestamp provides a permanent order but it may miss a movement that | |||
| happens too quickly vs. the granularity of the timing information. | happens too quickly vs. the granularity of the timing information. | |||
| It is not envisioned in the short term that the average fabric | It is not envisioned that an average fabric supports Precision Time | |||
| supports a Precision Time Protocol [IEEEstd1588], and the precision | Protocol [IEEEstd1588] in the short term, nor that the precision | |||
| that may be available with the Network Time Protocol [RFC5905], in | available with the Network Time Protocol [RFC5905] (in the order of | |||
| the order of 100 to 200ms, may not be necessarily enough to cover, | 100 to 200ms) may not be necessarily enough to cover, e.g., the fast | |||
| e.g., the fast mobility of a Virtual Machine. | mobility of a Virtual Machine. | |||
| Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that | Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that | |||
| combines a sequence counter from the mobile node and a timestamp from | combines a sequence counter from the mobile node and a timestamp from | |||
| the network taken at the leaf when the route is injected. If the | the network taken at the leaf when the route is injected. If the | |||
| timestamps of the concurrent advertisements are comparable (i.e., | timestamps of the concurrent advertisements are comparable (i.e., | |||
| more distant than the precision of the timing protocol), then the | more distant than the precision of the timing protocol), then the | |||
| timestamp alone is used to determine the relative freshness of the | timestamp alone is used to determine the relative freshness of the | |||
| routes. Otherwise, the sequence counter from the mobile node, if | routes. Otherwise, the sequence counter from the mobile node, if | |||
| available, is used. One caveat is that the sequence counter must not | available, is used. One caveat is that the sequence counter must not | |||
| wrap within the precision of the timing protocol. Another is that | wrap within the precision of the timing protocol. Another is that | |||
| skipping to change at page 26, line 37 ¶ | skipping to change at page 26, line 47 ¶ | |||
| +---+----+ +---+----+ | +---+----+ +---+----+ | |||
| | V4 | | V4 | | | V4 | | V4 | | |||
| | subnet | | subnet | | | subnet | | subnet | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| Figure 9: IPv4 over IPv6 | Figure 9: IPv4 over IPv6 | |||
| 5.9. In-Band Reachability of Nodes | 5.9. In-Band Reachability of Nodes | |||
| RIFT doesn't precondition that nodes of the fabric have reachable | RIFT doesn't precondition that nodes of the fabric have reachable | |||
| addresses. But the operational purposes to reach the internal nodes | addresses. But the operational reasons to reach the internal nodes | |||
| may exist. Figure 10 shows an example that the network management | may exist. Figure 10 shows an example that the network management | |||
| station (NMS) attaches to leaf1. | station (NMS) attaches to leaf1. | |||
| +-------+ +-------+ | +-------+ +-------+ | |||
| | ToF1 | | ToF2 | | | ToF1 | | ToF2 | | |||
| ++---- ++ ++-----++ | ++---- ++ ++-----++ | |||
| | | | | | | | | | | |||
| | +----------+ | | | +----------+ | | |||
| | +--------+ | | | | +--------+ | | | |||
| | | | | | | | | | | |||
| skipping to change at page 27, line 42 ¶ | skipping to change at page 27, line 42 ¶ | |||
| NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | |||
| ToF2-Spine2. | ToF2-Spine2. | |||
| If NMS wants to access ToF2, ToF2's loopback address needs to be | If NMS wants to access ToF2, ToF2's loopback address needs to be | |||
| injected into its Prefix South TIE. This TIE must be seen by all | injected into its Prefix South TIE. This TIE must be seen by all | |||
| nodes at the level below - the spine nodes in Figure 10 - that must | nodes at the level below - the spine nodes in Figure 10 - that must | |||
| form a ceiling for all the traffic coming from below (south). | form a ceiling for all the traffic coming from below (south). | |||
| Otherwise, the traffic from NMS may follow the default route to the | Otherwise, the traffic from NMS may follow the default route to the | |||
| wrong ToF Node, e.g., ToF1. | wrong ToF Node, e.g., ToF1. | |||
| In a fully connected ToF, in case of failure between ToF2 and spine | In case of failure between ToF2 and spine nodes, ToF2's loopback | |||
| nodes, ToF2's loopback address must be disaggregated recursively all | address must be disaggregated recursively all the way to the leaves. | |||
| the way to the leaves. | In a partitioned ToF, even with recursive disaggregation a ToF node | |||
| is only reachable within its plane. | ||||
| In a partitioned ToF, a TOF node is only reachable within its Plane, | A possible alternative to recursive disaggregation is to use a ring | |||
| and the disaggregation to the leaves is also required. A possible | that interconnects the ToF nodes to transmit packets between them for | |||
| alternative is to use the ring that interconnects the ToF nodes to | their loopback addresses only. The idea is that this is mostly | |||
| transmit packets between them for their loopback addresses only. The | control traffic and should not alter the load balancing properties of | |||
| idea is that this is mostly control traffic and should not alter the | the fabric. | |||
| load balancing properties of the fabric. | ||||
| 5.10. Dual Homing Servers | 5.10. Dual Homing Servers | |||
| Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It | Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It | |||
| has no configuration (unless it is a Top-of-Fabric at the top of the | has no configuration (unless it is a Top-of-Fabric at the top of the | |||
| topology or the must operate in the topology as leaf and/or support | topology or the must operate in the topology as leaf and/or support | |||
| leaf-2-leaf procedures) and it will fully configure itself after | leaf-2-leaf procedures) and it will fully configure itself after | |||
| being attached to the topology. | being attached to the topology. | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| |ToF| |ToF| |ToF| ToF | |ToF| |ToF| |ToF| ToF | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | |||
| | +----------------+ | | | | +----------------+ | | | |||
| | | | | | | | ||||
| | +----------------+ | | | +----------------+ | | |||
| | | | | | | | | | | | | | | |||
| +----------+--+ +--+----------+ | +----------+--+ +--+----------+ | |||
| | ToR1 | | ToR2 | Spine | | ToR1 | | ToR2 | Spine | |||
| +--+------+---+ +--+-------+--+ | +--+------+---+ +--+-------+--+ | |||
| +---+ | | | | | | +---+ | +---+ | | | | | | +---+ | |||
| | | | | | | | | | ||||
| | +-----------------+ | | | | | +-----------------+ | | | | |||
| | | | +-------------+ | | | | | | +-------------+ | | | |||
| + | + | | |-----------------+ | | + | + | | +-----------------+ | | |||
| X | X | +--------x-----+ | X | | X | X | +--------x-----+ | X | | |||
| + | + | | | + | | + | + | | | + | | |||
| +---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| +---+ +---+ ...............+---+ +---+ | +---+ +---+ ...............+---+ +---+ | |||
| SV(1) SV(2) SV(n+1) SV(n) Leaf | SV(1) SV(2) SV(n-1) SV(n) Leaf | |||
| Figure 11: Dual-homing servers | Figure 11: Dual-homing servers | |||
| In the single plane, the worst condition is disaggregation of every | In the single plane, the worst condition is disaggregation of every | |||
| other servers at the same level. Suppose the links from ToR1 (Top of | other servers at the same level. Suppose the links from ToR1 (Top of | |||
| Rack) to all the leaves become not available. All the servers' | Rack) to all the leaves become not available. All the servers' | |||
| routes are disaggregated and the FIB of the servers will be expanded | routes are disaggregated and the FIB of the servers will be expanded | |||
| with n-1 more specific routes. | with n-1 more specific routes. | |||
| Sometimes, people may prefer to disaggregate from ToR to servers from | Sometimes, people may prefer to disaggregate from ToR to servers from | |||
| skipping to change at page 29, line 42 ¶ | skipping to change at page 29, line 36 ¶ | |||
| +-----+ +-----+ | +-----+ +-----+ | |||
| Figure 12: Fabric with a controller | Figure 12: Fabric with a controller | |||
| 5.11.1. Controller Attached to ToFs | 5.11.1. Controller Attached to ToFs | |||
| If a controller is attaching to the RIFT domain from ToF, it usually | If a controller is attaching to the RIFT domain from ToF, it usually | |||
| uses dual-homing connections. The loopback prefix of the controller | uses dual-homing connections. The loopback prefix of the controller | |||
| should be advertised down by the ToF and spine to leaves. If the | should be advertised down by the ToF and spine to leaves. If the | |||
| controller loses link to ToF, make sure the ToF withdraw the prefix | controller loses link to ToF, make sure the ToF withdraw the prefix | |||
| of the controller(use different mechanisms). | of the controller. | |||
| 5.11.2. Controller Attached to Leaf | 5.11.2. Controller Attached to Leaf | |||
| If the controller is attaching from a leaf to the fabric, no special | If the controller is attaching from a leaf to the fabric, no special | |||
| provisions are needed. | provisions are needed. | |||
| 5.12. Internet Connectivity With Underlay | 5.12. Internet Connectivity Within Underlay | |||
| If global addressing is running without overlay, an external default | If global addressing is running without overlay, an external default | |||
| route needs to be advertised through RIFT fabric to achieve internet | route needs to be advertised through RIFT fabric to achieve internet | |||
| connectivity. For the purpose of forwarding of the entire RIFT | connectivity. For the purpose of forwarding of the entire RIFT | |||
| fabric, an internal fabric prefix needs to be advertised in the South | fabric, an internal fabric prefix needs to be advertised in the South | |||
| Prefix TIE by ToF and spine nodes. | Prefix TIE by ToF and spine nodes. | |||
| 5.12.1. Internet Default on the Leaf | 5.12.1. Internet Default on the Leaf | |||
| In case that an internet access request comes from a leaf and the | In case that the internet gateway is a leaf, the leaf node as the | |||
| internet gateway is another leaf, the leaf node as the internet | internet gateway needs to advertise a default route in its Prefix | |||
| gateway needs to advertise a default route in its Prefix North TIE. | North TIE. | |||
| 5.12.2. Internet Default on the ToFs | 5.12.2. Internet Default on the ToFs | |||
| In case that an internet access request comes from a leaf and the | In case that the internet gateway is a ToF, the ToF and spine nodes | |||
| internet gateway is a ToF, the ToF and spine nodes need to advertise | need to advertise a default route in the Prefix South TIE. | |||
| a default route in the Prefix South TIE. | ||||
| 5.13. Subnet Mismatch and Address Families | 5.13. Subnet Mismatch and Address Families | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| | | LIE LIE | | | | | LIE LIE | | | |||
| | A | +----> <----+ | B | | | A | +----> <----+ | B | | |||
| | +---------------------+ | | | +---------------------+ | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| X/24 Y/24 | X/24 Y/24 | |||
| skipping to change at page 31, line 12 ¶ | skipping to change at page 31, line 4 ¶ | |||
| node A and B may form, but the forwarding between node A and node B | node A and B may form, but the forwarding between node A and node B | |||
| may fail because subnet X mismatches with subnet Y. | may fail because subnet X mismatches with subnet Y. | |||
| To prevent this a RIFT implementation should check for subnet | To prevent this a RIFT implementation should check for subnet | |||
| mismatch just like e.g. ISIS does. This can lead to scenarios where | mismatch just like e.g. ISIS does. This can lead to scenarios where | |||
| an adjacency, despite exchange of LIEs in both address families may | an adjacency, despite exchange of LIEs in both address families may | |||
| end up having an adjacency in a single AF only. This is a | end up having an adjacency in a single AF only. This is a | |||
| consideration especially in Section 5.8 scenarios. | consideration especially in Section 5.8 scenarios. | |||
| 5.14. Anycast Considerations | 5.14. Anycast Considerations | |||
| + traffic | + traffic | |||
| | | | | |||
| v | v | |||
| +------+------+ | +------+------+ | |||
| | ToF | | | ToF | | |||
| +---+-----+---+ | +---+-----+---+ | |||
| | | | | | | | | | | |||
| +------------+ | | +------------+ | +------------+ | | +------------+ | |||
| | | | | | | | | | | |||
| +---+---+ +-------+ +-------+ +---+---+ | +---+---+ +-------+ +-------+ +---+---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 | |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 | |||
| +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | +---------+ | | +---------+ | | | +---------+ | | +---------+ | | |||
| | | | | | | | | | ||||
| | +-------+ | | | +-------+ | | | | +-------+ | | | +-------+ | | | |||
| | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | |||
| | | | | | | | | | | | | | | | | | | |||
| |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
| +-+-----+ ++------+ +-----+-+ +-----+-+ | +-+-----+ ++------+ +-----+-+ +-----+-+ | |||
| + + + ^ | | + + + ^ | | |||
| PrefixA PrefixB PrefixA | PrefixC | PrefixA PrefixB PrefixA | PrefixC | |||
| | | | | |||
| + traffic | + traffic | |||
| Figure 14: Anycast | Figure 14: Anycast | |||
| If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | |||
| prefix PrefixA. RIFT can deal with this case well. But if the | prefix PrefixA, RIFT can deal with this case well. But if the | |||
| traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. | traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. | |||
| But Spine21 or Spine22 doesn't know another PrefixA attaching | But Spine21 or Spine22 doesn't know another PrefixA attaching | |||
| Leaf111. So it will always get to Leaf121 and never get to Leaf111. | Leaf111. So it will always get to Leaf121 and never get to Leaf111. | |||
| If the intension is that the traffic should been offloaded to | If the intension is that the traffic should been offloaded to | |||
| Leaf111, then use policy guided prefixes defined in RIFT [RIFT]. | Leaf111, then use policy guided prefixes defined in RIFT [RIFT]. | |||
| 5.15. IoT Applicability | 5.15. IoT Applicability | |||
| The design of RIFT inherits from RPL [RFC6550] the anisotropic design | The design of RIFT inherits from RPL [RFC6550] the anisotropic design | |||
| of a default route upwards (northwards); it also inherits the | of a default route upwards (northwards); it also inherits the | |||
| capability to inject external host routes at the Leaf level using | capability to inject external host routes at the Leaf level using | |||
| Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host | Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host | |||
| and a RIFT router. Both the RPL and the RIFT protocols are meant for | and a RIFT router. Both the RPL and the RIFT protocols are meant for | |||
| large scale, and WiND enables device mobility at the edge the same | large scale, and WiND enables device mobility at the edge the same | |||
| way in both cases. | way in both cases. | |||
| The main difference between RIFT and RPL is that with RPL, there's a | The main difference between RIFT and RPL is that with RPL, there's a | |||
| single Root, whereas RIFT has many ToF nodes. The adds huge | single Root, whereas RIFT has many ToF nodes. This adds huge | |||
| capabilities for leaf-2-leaf ECMP paths, but additional complexity | capabilities for leaf-2-leaf ECMP paths, but additional complexity | |||
| with the need to disaggregate. Also RIFT uses Link State flooding | with the need to disaggregate. Also RIFT uses Link State flooding | |||
| northwards, and is not designed for low-power operation. | northwards, and is not designed for low-power operation. | |||
| Still nothing prevents that the IP devices connected at the Leaf are | Still nothing prevents that the IP devices connected at the Leaf are | |||
| IoT (Internet of Things) devices, which typically expose their | IoT (Internet of Things) devices, which typically expose their | |||
| address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. | address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. | |||
| A network that serves high speed/ high power IoT devices should | A network that serves high speed/ high power IoT devices should | |||
| typically provide deterministic capabilities for applications such as | typically provide deterministic capabilities for applications such as | |||
| high speed control loops or movement detection. The Fat Tree is | high speed control loops or movement detection. The Fat Tree is | |||
| highly reliable, and in normal condition provides an equilatent | highly reliable, and in normal condition provides an equivalent | |||
| multipath operation; but the ECMP doesn't provide hard guarantees for | multipath operation; but the ECMP doesn't provide hard guarantees for | |||
| either delivery or latency. As long as the fabric is non-blocking | either delivery or latency. As long as the fabric is non-blocking | |||
| the result is the same; but there can be load unbalances resulting in | the result is the same; but there can be load unbalances resulting in | |||
| incast and possibly congestion loss that will prevent the delivery | incast and possibly congestion loss that will prevent the delivery | |||
| within bounded latency. | within bounded latency. | |||
| This could be alleviated with Packet Replication, Elimination and | This could be alleviated with Packet Replication, Elimination and | |||
| Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide | Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide | |||
| at the scale of all flows, and the replication may increase the | at the scale of all flows, and the replication may increase the | |||
| probability of the overload that it attempts to solve. | probability of the overload that it attempts to solve. | |||
| End of changes. 56 change blocks. | ||||
| 122 lines changed or deleted | 110 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||