| < draft-ietf-rift-applicability-03.txt | draft-ietf-rift-applicability-04.txt > | |||
|---|---|---|---|---|
| RIFT WG Yuehua. Wei, Ed. | RIFT WG Yuehua. Wei, Ed. | |||
| Internet-Draft Zheng. Zhang | Internet-Draft Zheng. Zhang | |||
| Intended status: Informational ZTE Corporation | Intended status: Informational ZTE Corporation | |||
| Expires: 16 April 2021 Dmitry. Afanasiev | Expires: 24 July 2021 Dmitry. Afanasiev | |||
| Yandex | Yandex | |||
| Tom. Verhaeg | Tom. Verhaeg | |||
| Juniper Networks | Juniper Networks | |||
| Jaroslaw. Kowalczyk | Jaroslaw. Kowalczyk | |||
| Orange Polska | Orange Polska | |||
| P. Thubert | P. Thubert | |||
| Cisco Systems | Cisco Systems | |||
| 13 October 2020 | 20 January 2021 | |||
| RIFT Applicability | RIFT Applicability | |||
| draft-ietf-rift-applicability-03 | draft-ietf-rift-applicability-04 | |||
| Abstract | Abstract | |||
| This document discusses the properties, applicability and operational | This document discusses the properties, applicability and operational | |||
| considerations of RIFT in different network scenarios. It intends to | considerations of RIFT in different network scenarios. It intends to | |||
| provide a rough guide how RIFT can be deployed to simplify routing | provide a rough guide how RIFT can be deployed to simplify routing | |||
| operations in Clos topologies and their variations. | operations in Clos topologies and their variations. | |||
| Status of This Memo | Status of This Memo | |||
| skipping to change at page 1, line 41 ¶ | skipping to change at page 1, line 41 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on 16 April 2021. | This Internet-Draft will expire on 24 July 2021. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2020 IETF Trust and the persons identified as the | Copyright (c) 2021 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents (https://trustee.ietf.org/ | |||
| license-info) in effect on the date of publication of this document. | license-info) in effect on the date of publication of this document. | |||
| Please review these documents carefully, as they describe your rights | Please review these documents carefully, as they describe your rights | |||
| and restrictions with respect to this document. Code Components | and restrictions with respect to this document. Code Components | |||
| extracted from this document must include Simplified BSD License text | extracted from this document must include Simplified BSD License text | |||
| as described in Section 4.e of the Trust Legal Provisions and are | as described in Section 4.e of the Trust Legal Provisions and are | |||
| provided without warranty as described in the Simplified BSD License. | provided without warranty as described in the Simplified BSD License. | |||
| skipping to change at page 2, line 26 ¶ | skipping to change at page 2, line 26 ¶ | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Problem Statement of Routing in Modern IP Fabric Fat Tree | 2. Problem Statement of Routing in Modern IP Fabric Fat Tree | |||
| Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 | Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 | 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 | |||
| 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 | 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 | |||
| 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 | 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 | |||
| 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 | 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 | |||
| 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 | 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 | |||
| 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 | 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 | |||
| 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 | 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 8 | 3.3.1. Data Center Fabrics . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 | 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 | 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 9 | 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 9 | |||
| 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 | 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 | 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 | |||
| 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 | 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 | |||
| 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 | 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 | |||
| 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 | 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 | |||
| 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 | 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 | |||
| 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 15 | 4.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 15 | |||
| 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 18 | 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 17 | |||
| 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 | 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 | |||
| 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 | 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 22 | 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 22 | |||
| 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 23 | 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 23 | |||
| 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 24 | 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 24 | |||
| 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 | 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 | |||
| 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 24 | 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 25 | |||
| 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 25 | 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 25 | |||
| 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 | 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 | |||
| 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 | 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 | |||
| 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 | 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 | |||
| 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 26 | 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 26 | |||
| 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 | 4.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 27 | |||
| 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 27 | 5. Security Considerations . . . . . . . . . . . . . . . . . . . 27 | |||
| 7. Normative References . . . . . . . . . . . . . . . . . . . . 27 | 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 28 | |||
| 8. Informative References . . . . . . . . . . . . . . . . . . . 28 | 7. Normative References . . . . . . . . . . . . . . . . . . . . 28 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 | 8. Informative References . . . . . . . . . . . . . . . . . . . 29 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 30 | ||||
| 1. Introduction | 1. Introduction | |||
| This document intends to explain the properties and applicability of | This document intends to explain the properties and applicability of | |||
| "Routing in Fat Trees" [RIFT] in different deployment scenarios and | "Routing in Fat Trees" [RIFT] in different deployment scenarios and | |||
| highlight the operational simplicity of the technology compared to | highlight the operational simplicity of the technology compared to | |||
| traditional routing solutions. It also documents special | traditional routing solutions. It also documents special | |||
| considerations when RIFT is used with or without overlays, | considerations when RIFT is used with or without overlays, with or | |||
| controllers and corrects topology miscablings and/or node and link | without controllers, corrects topology mis-cablings, and node or link | |||
| failures. | failures. | |||
| 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | |||
| Clos and Fat-Tree topologies have gained prominence in today's | Clos [CLOS] and fat tree [FATTREE] topologies have gained prominence | |||
| networking, primarily as result of the paradigm shift towards a | in today's networking, primarily as a result of the paradigm shift | |||
| centralized data-center based architecture that is poised to deliver | towards a centralized data-center based architecture that deliver a | |||
| a majority of computation and storage services in the future. | majority of computation and storage services. | |||
| Today's current routing protocols were geared towards a network with | Today's current routing protocols were geared towards a network with | |||
| an irregular topology and low degree of connectivity originally. | an irregular topology and low degree of connectivity originally. | |||
| When they are applied to Fat-Tree topologies: | When they are applied to fat tree topologies: | |||
| * they tend to need extensive configuration or provisioning during | * They tend to need extensive configuration or provisioning during | |||
| bring up and re-dimensioning. | bring up and re-dimensioning. | |||
| * spine and leaf nodes have the entire network topology and routing | * Spine and leaf nodes have the entire network topology and routing | |||
| information, which is in fact, not needed on the leaf nodes during | information which is in fact not needed on the leaf nodes during | |||
| normal operation. | normal operation. | |||
| * significant Link State PDUs (LSPs) flooding duplication between | * Significant Link State PDUs (LSPs) flooding duplication between | |||
| spine nodes and leaf nodes occurs during network bring up and | spine nodes and leaf nodes occurs during network bring up and | |||
| topology updates. It consumes both spine and leaf nodes' CPU and | topology updates. It consumes both spine and leaf nodes' CPU and | |||
| link bandwidth resources and with that limits protocol | link bandwidth resources. | |||
| scalability. | ||||
| 3. Applicability of RIFT to Clos IP Fabrics | 3. Applicability of RIFT to Clos IP Fabrics | |||
| Further content of this document assumes that the reader is familiar | Further content of this document assumes that the reader is familiar | |||
| with the terms and concepts used in OSPF [RFC2328] and IS-IS | with the terms and concepts used in OSPF [RFC2328] and IS-IS | |||
| [ISO10589-Second-Edition] link-state protocols and at least the | [ISO10589-Second-Edition] link-state protocols. The sections of RIFT | |||
| sections of [RIFT] outlining the requirement of routing in IP fabrics | [RIFT] outline the requirements of routing in IP fabrics and RIFT | |||
| and RIFT protocol concepts. | protocol concepts. | |||
| 3.1. Overview of RIFT | 3.1. Overview of RIFT | |||
| RIFT is a dynamic routing protocol for Clos and fat-tree network | RIFT is a dynamic routing protocol for Clos and fat tree network | |||
| topologies. It defines a link-state protocol when "pointing north" | topologies. It defines a link-state protocol when "pointing north" | |||
| and path-vector protocol when "pointing south". | and path-vector protocol when "pointing south". | |||
| It floods flat link-state information northbound only so that each | It floods flat link-state information northbound only so that each | |||
| level obtains the full topology of levels south of it. That | level obtains the full topology of levels south of it. That | |||
| information is never flooded east-west or back South again. So a top | information is never flooded east-west or back south again. So a top | |||
| tier node has full set of prefixes from the SPF calculation. | tier node has full set of prefixes from the Shortest Path First (SPF) | |||
| calculation. | ||||
| In the southbound direction the protocol operates like a "fully | In the southbound direction, the protocol operates like a "fully | |||
| summarizing, unidirectional" path vector protocol or rather a | summarizing, unidirectional" path vector protocol or rather a | |||
| distance vector with implicit split horizon whereas the information | distance vector with implicit split horizon. Routing information, | |||
| propagates one hop south and is 're-advertised' by nodes at next | normally just the default route, propagates one hop south and is 're- | |||
| lower level, normally just the default route. | advertised' by nodes at next lower level. | |||
| +-----------+ +-----------+ | +-----------+ +-----------+ | |||
| | ToF | | ToF | LEVEL 2 | | ToF | | ToF | LEVEL 2 | |||
| + +-----+--+--+ +-+--+------+ | + +-----+--+--+ +-+--+------+ | |||
| | | | | | | | | | ^ | | | | | | | | | | ^ | |||
| + | | | +-------------------------+ | | + | | | +-------------------------+ | | |||
| Distance | +-------------------+ | | | | | | Distance | +-------------------+ | | | | | | |||
| Vector | | | | | | | | + | Vector | | | | | | | | + | |||
| South | | | | +--------+ | | | Link-state | South | | | | +--------+ | | | Link-state | |||
| + | | | | | | | | Flooding | + | | | | | | | | Flooding | |||
| skipping to change at page 4, line 47 ¶ | skipping to change at page 4, line 48 ¶ | |||
| Distance | +-------+ | | +--------+ | | | E | Distance | +-------+ | | +--------+ | | | E | |||
| Vector | | | | | | | | | +------> | Vector | | | | | | | | | +------> | |||
| South | +-------+ | | | +-------+ | | | | | South | +-------+ | | | +-------+ | | | | | |||
| + | | | | | | | | | + | + | | | | | | | | | + | |||
| v ++--++ +-+-++ ++-+-+ +-+--++ + | v ++--++ +-+-++ ++-+-+ +-+--++ + | |||
| |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 | |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 | |||
| +----+ +----+ +----+ +-----+ | +----+ +----+ +----+ +-----+ | |||
| Figure 1: Rift overview | Figure 1: Rift overview | |||
| A middle tier node has only information necessary for its level, | A spine node has only information necessary for its level, which is | |||
| which are all destinations south of the node based on SPF | all destinations south of the node based on SPF calculation, default | |||
| calculation, default route and potential disaggregated routes. | route, and potential disaggregated routes. | |||
| RIFT combines the advantage of both link-state and distance vector: | RIFT combines the advantage of both link-state and distance vector: | |||
| * Fastest Possible Convergence | * Fastest possible convergence | |||
| * Automatic Detection of Topology | * Automatic detection of topology | |||
| * Minimal Routes/Info on TORs | * Minimal routes/info on tors | |||
| * High Degree of ECMP | * High degree of ECMP | |||
| * Fast De-commissioning of Nodes | * Fast de-commissioning of nodes | |||
| * Maximum Propagation Speed with Flexible Prefixes in an Update | * Maximum Propagation speed with flexible prefixes in an update | |||
| And RIFT eliminates the disadvantages of link-state or distance | And RIFT eliminates the disadvantages of link-state or distance | |||
| vector: | vector: | |||
| * Reduced and Balanced Flooding | * Reduced and balanced flooding | |||
| * Automatic Neighbor Detection | * Automatic neighbor detection | |||
| So there are two types of link-state database which are "north | So there are two types of link-state database which are "north | |||
| representation" N-TIEs and "south representation" S-TIEs. The N-TIEs | representation" North Topology Information Elements (N-TIEs) and | |||
| contain a link-state topology description of lower levels and S-TIEs | "south representation" South Topology Information Elements (S-TIEs). | |||
| carry simply default routes for the lower levels. | The N-TIEs contain a link-state topology description of lower levels | |||
| and S-TIEs carry simply default routes for the lower levels. | ||||
| There are a bunch of more advantages unique to RIFT listed below | There are more advantages unique to RIFT listed below which could be | |||
| which could be understood if you read the details of [RIFT]. | understood if you read the details of RIFT [RIFT]. | |||
| * True ZTP | * True ZTP | |||
| * Minimal Blast Radius on Failures | * Minimal blast radius on failures | |||
| * Can Utilize All Paths Through Fabric Without Looping | * Can utilize all paths through fabric without looping | |||
| * Automatic Disaggregation on Failures | * Automatic disaggregation on failures | |||
| * Simple Leaf Implementation that Can Scale Down to Servers | * Simple leaf implementation that can scale down to servers | |||
| * Key-Value Store | * Key-Value store | |||
| * Horizontal Links Used for Protection Only | * Horizontal links used for protection only | |||
| * Supports Non-Equal Cost Multipath and Can Replace MC-LAG | * Supports non-equal cost multipath and can replace MC-LAG | |||
| * Optimal Flooding Reduction and Load-Balancing | * Optimal flooding reduction and load-balancing | |||
| 3.2. Applicable Topologies | 3.2. Applicable Topologies | |||
| Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" | Albeit RIFT is specified primarily for "proper" Clos or "fat tree" | |||
| structures, it already supports PoD concepts which are strictly | structures, it already supports Points of Delivery (PoD) concepts | |||
| speaking not found in original Clos concepts. | which are strictly speaking not found in original Clos concepts. | |||
| Further, the specification explains and supports operations of multi- | Further, the specification explains and supports operations of multi- | |||
| plane Clos variants where the protocol relies on set of rings to | plane Clos variants where the protocol relies on set of rings to | |||
| allow the reconciliation of topology view of different planes as most | allow the reconciliation of topology view of different planes as most | |||
| desirable solution making proper disaggregation viable in case of | desirable solution making proper disaggregation viable in case of | |||
| failures. These observations hold not only in case of RIFT but in | failures. These observations hold not only in case of RIFT but also | |||
| the generic case of dynamic routing on Clos variants with multiple | in the generic case of dynamic routing on Clos variants with multiple | |||
| planes and failures in bi-sectional bandwidth, especially on the | planes and failures in bi-sectional bandwidth, especially on the | |||
| leafs. | leafs. | |||
| 3.2.1. Horizontal Links | 3.2.1. Horizontal Links | |||
| RIFT is not limited to pure Clos divided into PoD and multi-planes | RIFT is not limited to pure Clos divided into PoD and multi-planes | |||
| but supports horizontal links below the top of fabric level. Those | but supports horizontal links below the top of fabric level. Those | |||
| links are used however only as routes of last resort northbound when | links are used only as routes of last resort northbound when a spine | |||
| a spine loses all northbound links or cannot compute a default route | loses all northbound links or cannot compute a default route through | |||
| through them. | them. | |||
| A possible configuration is a "ring" of horizontal links at a level. | A possible configuration is a "ring" of horizontal links at a level. | |||
| In presence of such a "ring" in any level (except ToF level) neither | In presence of such a "ring" in any level (except Top of Fabric (ToF) | |||
| N-SPF nor S-SPF will provide a "ring-based protection" scheme since | level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a | |||
| such a computation would have to deal necessarily with breaking of | "ring-based protection" scheme since such a computation would have to | |||
| "loops" in Dijkstra sense; an application for which RIFT is not | deal necessarily with breaking of "loops" in Dijkstra sense; an | |||
| intended. | application for which RIFT is not intended. | |||
| A full-mesh connectivity between nodes on the same level can be | A full-mesh connectivity between nodes on the same level can be | |||
| employed and that allows N-SPF to provide for any node loosing all | employed and that allows N-SPF to provide for any node loosing all | |||
| its northbound adjacencies (as long as any of the other nodes in the | its northbound adjacencies (as long as any of the other nodes in the | |||
| level are northbound connected) to still participate in northbound | level are northbound connected) to still participate in northbound | |||
| forwarding. | forwarding. | |||
| 3.2.2. Vertical Shortcuts | 3.2.2. Vertical Shortcuts | |||
| Through relaxations of the specified adjacency forming rules RIFT | Through relaxations of the specified adjacency forming rules, RIFT | |||
| implementations can be extended to support vertical "shortcuts" as | implementations can be extended to support vertical "shortcuts" as | |||
| proposed by e.g. [I-D.white-distoptflood]. The RIFT specification | proposed by e.g. [I-D.white-distoptflood]. The RIFT specification | |||
| itself does not provide the exact details since the resulting | itself does not provide the exact details since the resulting | |||
| solution suffers from either much larger blast radius with increased | solution suffers from either much larger blast radius with increased | |||
| flooding volumes or in case of maximum aggregation routing bow-tie | flooding volumes or in case of maximum aggregation routing bow-tie | |||
| problems. | problems. | |||
| 3.2.3. Generalizing to any Directed Acyclic Graph | 3.2.3. Generalizing to any Directed Acyclic Graph | |||
| RIFT is an anisotropic routing protocol, meaning that it has a sense | RIFT is an anisotropic routing protocol, meaning that it has a sense | |||
| of direction (northbound, southbound, east-west) and that it operates | of direction (northbound, southbound, east-west) and that it operates | |||
| differently depending on the direction. | differently depending on the direction. | |||
| * Northbound, RIFT operates as a link-state IGP, whereby the control | * Northbound, RIFT operates as a link-state IGP, whereby the control | |||
| packets are reflooded first all the way North and only interpreted | packets are reflooded first all the way north and only interpreted | |||
| later. All the individual fine grained routes are advertised. | later. All the individual fine grained routes are advertised. | |||
| * Southbound, RIFT operates as a distance vector IGP, whereby the | * Southbound, RIFT operates as a distance vector IGP, whereby the | |||
| control packets are flooded only one hop, interpreted, and the | control packets are flooded only one hop, interpreted, and the | |||
| consequence of that computation is what gets flooded on more hop | consequence of that computation is what gets flooded one more hop | |||
| South. In the most common use-cases, a ToF node can reach most of | south. In the most common use-cases, a ToF node can reach most of | |||
| the prefixes in the fabric. If that is the case, the ToF node | the prefixes in the fabric. If that is the case, the ToF node | |||
| advertises the fabric default and disaggregates the prefixes that | advertises the fabric default and disaggregates the prefixes that | |||
| it cannot reach. On the other hand, a ToF Node that can reach | it cannot reach. On the other hand, a ToF node that can reach | |||
| only a small subset of the prefixes in the fabric will preferably | only a small subset of the prefixes in the fabric will preferably | |||
| advertise those prefixes and refrain from aggregating. | advertise those prefixes and refrain from aggregating. | |||
| In the general case, what gets advertised South is in more | In the general case, what gets advertised south is in more | |||
| details: | details: | |||
| 1. A fabric default that aggregates all the prefixes that are | 1. A fabric default that aggregates all the prefixes that are | |||
| reachable within the fabric, and that could be a default route | reachable within the fabric, and that could be a default route | |||
| or a prefix that is dedicated to this particular fabric. | or a prefix that is dedicated to this particular fabric. | |||
| 2. The loopback addresses of the northbound nodes, e.g., for | 2. The loopback addresses of the northbound nodes, e.g., for | |||
| inband management. | inband management. | |||
| 3. The disaggregated prefixes for the dynamic exceptions to the | 3. The disaggregated prefixes for the dynamic exceptions to the | |||
| fabric Default, advertised to route around the black hole that | fabric default, advertised to route around the black hole that | |||
| may form | may form. | |||
| * east-west routing can optionally be used, with specific | * East-west routing can optionally be used, with specific | |||
| restrictions. It is useful in particular when a sibling has | restrictions. It is useful in particular when a sibling has | |||
| access to the fabric default but this node does not. | access to the fabric default but this node does not. | |||
| A Directed Acyclic Graph (DAG) provides a sense of North (the | A Directed Acyclic Graph (DAG) provides a sense of north (the | |||
| direction of the DAG) and of South (the reverse), which can be used | direction of the DAG) and of south (the reverse), which can be used | |||
| to apply RIFT. For the purpose of RIFT, an edge in the DAG that has | to apply RIFT. For the purpose of RIFT, an edge in the DAG that has | |||
| only incoming vertices is a ToF node. | only incoming vertices is a ToF node. | |||
| There are a number of caveats though: | There are a number of caveats though: | |||
| * The DAG structure must exist before RIFT starts, so there is a | * The DAG structure must exist before RIFT starts, so there is a | |||
| need for a companion protocol to establish the logical DAG | need for a companion protocol to establish the logical DAG | |||
| structure. | structure. | |||
| * A generic DAG does not have a sense of east and west. The | * A generic DAG does not have a sense of east and west. The | |||
| skipping to change at page 8, line 18 ¶ | skipping to change at page 8, line 18 ¶ | |||
| * In order to aggregate and disaggregate routes, RIFT requires that | * In order to aggregate and disaggregate routes, RIFT requires that | |||
| all the ToF nodes share the full knowledge of the prefixes in the | all the ToF nodes share the full knowledge of the prefixes in the | |||
| fabric. This can be achieved with a ring as suggested by the RIFT | fabric. This can be achieved with a ring as suggested by the RIFT | |||
| main specification, by some preconfiguration, or using a | main specification, by some preconfiguration, or using a | |||
| synchronization with a common repository where all the active | synchronization with a common repository where all the active | |||
| prefixes are registered. | prefixes are registered. | |||
| 3.3. Use Cases | 3.3. Use Cases | |||
| 3.3.1. DC Fabrics | 3.3.1. Data Center Fabrics | |||
| RIFT is largely driven by demands and hence ideally suited for | RIFT is largely driven by demands and hence ideally suited for | |||
| application in underlay of data center IP fabrics, vast majority of | applying in data center (DC) IP fabrics underlay routing, vast | |||
| which seem to be currently (and for the foreseeable future) Clos | majority of which seem to be currently (and for the foreseeable | |||
| architectures. It significantly simplifies operation and deployment | future) Clos architectures. It significantly simplifies operation | |||
| of such fabrics as described in Section 4 for environments compared | and deployment of such fabrics as described in Section 4 for | |||
| to extensive proprietary provisioning and operational solutions. | environments compared to extensive proprietary provisioning and | |||
| operational solutions. | ||||
| 3.3.2. Metro Fabrics | 3.3.2. Metro Fabrics | |||
| The demand for bandwidth is increasing steadily, driven primarily by | The demand for bandwidth is increasing steadily, driven primarily by | |||
| environments close to content producers (server farms connection via | environments close to content producers (server farms connection via | |||
| DC fabrics) but in proximity to content consumers as well. Consumers | DC fabrics) but in proximity to content consumers as well. Consumers | |||
| are often clustered in metro areas with their own network | are often clustered in metro areas with their own network | |||
| architectures that can benefit from simplified, regular Clos | architectures that can benefit from simplified, regular Clos | |||
| structures and hence RIFT. | structures and hence RIFT. | |||
| 3.3.3. Building Cabling | 3.3.3. Building Cabling | |||
| Commercial edifices are often cabled in topologies that are either | Commercial edifices are often cabled in topologies that are either | |||
| Clos or its isomorphic equivalents. With many floors the Clos can | Clos or its isomorphic equivalents. The Clos can grow rather high | |||
| grow rather high and with that present a challenge for traditional | with many floors. That presents a challenge for traditional routing | |||
| routing protocols (except BGP and by now largely phased-out PNNI) | protocols (except BGP and by now largely phased-out PNNI) which do | |||
| which do not support an arbitrary number of levels which RIFT does | not support an arbitrary number of levels which RIFT does naturally. | |||
| naturally. Moreover, due to limited sizes of forwarding tables in | Moreover, due to the limited sizes of forwarding tables in network | |||
| active elements of building cabling the minimum FIB size RIFT | elements of building cabling,the minimum FIB size RIFT | |||
| maintains under normal conditions can prove particularly cost- | maintains under normal conditions is cost-effective in terms of | |||
| effective in terms of hardware and operational costs. | hardware and operational costs. | |||
| 3.3.4. Internal Router Switching Fabrics | 3.3.4. Internal Router Switching Fabrics | |||
| It is common in high-speed communications switching and routing | It is common in high-speed communications switching and routing | |||
| devices to use fabrics when a crossbar is not feasible due to cost, | devices to use fabrics when a crossbar is not feasible due to cost, | |||
| head-of-line blocking or size trade-offs. Normally such fabrics are | head-of-line blocking or size trade-offs. Normally such fabrics are | |||
| not self-healing or rely on 1:/+1 protection schemes but it is | not self-healing or rely on 1:/+1 protection schemes but it is | |||
| conceivable to use RIFT to operate Clos fabrics that can deal | conceivable to use RIFT to operate Clos fabrics that can deal | |||
| effectively with interconnections or subsystem failures in such | effectively with interconnections or subsystem failures in such | |||
| module. RIFT is neither IP specific and hence any link addressing | module. RIFT is neither IP specific and hence any link addressing | |||
| connecting internal device subnets is conceivable. | connecting internal device subnets is conceivable. | |||
| 3.3.5. CloudCO | 3.3.5. CloudCO | |||
| The Cloud Central Office (CloudCO) is a new stage of telecom Central | The Cloud Central Office (CloudCO) is a new stage of telecom Central | |||
| Office. It takes the advantage of Software Defined Networking (SDN) | Office. It takes the advantage of Software Defined Networking (SDN) | |||
| and Network Function Virtualization (NFV) in conjunction with general | and Network Function Virtualization (NFV) in conjunction with general | |||
| purpose hardware to optimize current networks. The following figure | purpose hardware to optimize current networks. The following figure | |||
| illustrates this architecture at a high level. It describes a single | illustrates this architecture at a high level. It describes a single | |||
| instance or macro-node of cloud CO. An Access I/O module faces a | instance or macro-node of cloud CO. An Access I/O module faces a | |||
| Cloud CO Access Node, and the CPEs behind it. A Network I/O module | Cloud CO access node, and the Customer Premises Equipments (CPEs) | |||
| is facing the core network. The two I/O modules are interconnected | behind it. A Network I/O module is facing the core network. The two | |||
| by a leaf and spine fabric. [TR-384] | I/O modules are interconnected by a leaf and spine fabric. [TR-384] | |||
| +---------------------+ +----------------------+ | +---------------------+ +----------------------+ | |||
| | Spine | | Spine | | | Spine | | Spine | | |||
| | Switch | | Switch | | | Switch | | Switch | | |||
| +------+---+------+-+-+ +--+-+-+-+-----+-------+ | +------+---+------+-+-+ +--+-+-+-+-----+-------+ | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | | | +-------------------------------+ | | | | | | | +-------------------------------+ | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | | | +-------------------------+ | | | | | | | | +-------------------------+ | | | | |||
| | | | | | | | | | | | | | | | | | | | | | | | | | | |||
| | | +----------------------+ | | | | | | | | | | | +----------------------+ | | | | | | | | | |||
| skipping to change at page 11, line 12 ¶ | skipping to change at page 11, line 12 ¶ | |||
| The Spine-Leaf architecture deployed inside CloudCO meets the network | The Spine-Leaf architecture deployed inside CloudCO meets the network | |||
| requirements of adaptable, agile, scalable and dynamic. | requirements of adaptable, agile, scalable and dynamic. | |||
| 4. Deployment Considerations | 4. Deployment Considerations | |||
| RIFT presents the opportunity for organizations building and | RIFT presents the opportunity for organizations building and | |||
| operating IP fabrics to simplify their operation and deployments | operating IP fabrics to simplify their operation and deployments | |||
| while achieving many desirable properties of a dynamic routing on | while achieving many desirable properties of a dynamic routing on | |||
| such a substrate: | such a substrate: | |||
| * RIFT design follows minimum blast radius and minimum necessary | * RIFT only foods routing information to the devices that absolutely | |||
| epistemological scope philosophy which leads to very good scaling | need it. RIFT design follows minimum blast radius and minimum | |||
| properties while delivering maximum reactiveness. | necessary epistemological scope philosophy which leads to good | |||
| scaling properties while delivering maximum reactiveness. | ||||
| * RIFT allows for extensive Zero Touch Provisioning within the | * RIFT allows for extensive Zero Touch Provisioning within the | |||
| protocol. In its most extreme version RIFT does not rely on any | protocol. In its most extreme version RIFT does not rely on any | |||
| specific addressing and for IP fabric can operate using IPv6 ND | specific addressing and for IP fabric can operate using IPv6 ND | |||
| [RFC4861] only. | [RFC4861] only. | |||
| * RIFT has provisions to detect common IP fabric mis-cabling | * RIFT has provisions to detect common IP fabric mis-cabling | |||
| scenarios. | scenarios. | |||
| * RIFT negotiates automatically BFD per link allowing this way for | * RIFT negotiates automatically BFD per link allowing this way for | |||
| IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth | IP and micro-BFD [RFC7130] to replace Link Aggregation Groups | |||
| imbalances in case of constituent failures. Further automatic | (LAGs) which do hide bandwidth imbalances in case of constituent | |||
| link validation techniques similar to [RFC5357] could be supported | failures. Further automatic link validation techniques similar to | |||
| as well. | [RFC5357] could be supported as well. | |||
| * RIFT inherently solves many difficult problems associated with the | * RIFT inherently solves many difficult problems associated with the | |||
| use of traditional routing topologies with dense meshes and high | use of traditional routing topologies with dense meshes and high | |||
| degrees of ECMP by including automatic bandwidth balancing, flood | degrees of ECMP by including automatic bandwidth balancing, flood | |||
| reduction and automatic disaggregation on failures while providing | reduction and automatic disaggregation on failures while providing | |||
| maximum aggregation of prefixes in default scenarios. | maximum aggregation of prefixes in default scenarios. | |||
| * RIFT reduces FIB size towards the bottom of the IP fabric where | * RIFT reduces FIB size towards the bottom of the IP fabric where | |||
| most nodes reside and allows with that for cheaper hardware on the | most nodes reside and allows with that for cheaper hardware on the | |||
| edges and introduction of modern IP fabric architectures that | edges and introduction of modern IP fabric architectures that | |||
| skipping to change at page 12, line 14 ¶ | skipping to change at page 12, line 17 ¶ | |||
| * Many further operational and design points collected over many | * Many further operational and design points collected over many | |||
| years of routing protocol deployments have been incorporated in | years of routing protocol deployments have been incorporated in | |||
| RIFT such as fast flooding rates, protection of information | RIFT such as fast flooding rates, protection of information | |||
| lifetimes and operationally easily recognizable remote ends of | lifetimes and operationally easily recognizable remote ends of | |||
| links and node names. | links and node names. | |||
| 4.1. South Reflection | 4.1. South Reflection | |||
| South reflection is a mechanism that South Node TIEs are "reflected" | South reflection is a mechanism that South Node TIEs are "reflected" | |||
| back up north to allow nodes in same level without E-W links to "see" | back up north to allow nodes in same level without East-west links to | |||
| each other. | "see" each other. | |||
| For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs | For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs | |||
| from ToF21 to ToF22 separately. Respectively, | from ToF21 to ToF22 separately. Respectively, | |||
| Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 | Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 | |||
| to ToF21 separately. So ToF22 and ToF21 see each other's node | to ToF21 separately. So ToF22 and ToF21 see each other's node | |||
| information as level 2 nodes. | information as level 2 nodes. | |||
| In an equivalent fashion, as the result of the south reflection | In an equivalent fashion, as the result of the south reflection | |||
| between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, | between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, | |||
| Spine121 and Spine 122 knows each other at level 1. | Spine121 and Spine 122 knows each other at level 1. | |||
| 4.2. Suboptimal Routing on Link Failures | 4.2. Suboptimal Routing on Link Failures | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| | ToF21 | | ToF22 | LEVEL 2 | | ToF21 | | ToF22 | LEVEL 2 | |||
| ++--+-+-++ ++-+--+-++ | ++--+-+-++ ++-+--+-++ | |||
| | | | | | | | + | | | | | | | | + | |||
| | | | | | | | linkTS8 | | | | | | | | linkTS8 | |||
| +-------------+ | +-+linkTS3+-+ | | | +--------------+ | +-------------+ | +-+linkTS3+-+ | | | +-------------+ | |||
| | | | | | | + | | | | | | | | + | | |||
| | +----------------------------+ | linkTS7 | | | +----------------------------+ | linkTS7 | | |||
| | | | | + + + | | | | | | + + + | | |||
| | | | +-------+linkTS4+------------+ | | | | | +-------+linkTS4+------------+ | | |||
| | | | + + | | | | | | | + + | | | | |||
| | | | +------------+--+ | | | | | | +------------+--+ | | | |||
| | | | | | linkTS6 | | | | | | | | linkTS6 | | | |||
| +-+----++ ++-----++ ++------+ ++-----++ | +-+----+-+ +-----+--+ ++--------+ +-+----+-+ | |||
| |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 | |Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1 | |||
| +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ | +-+---+--+ +----+---+ +-+---+---+ +-+---+--+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | +--------------+ | + ++XX+linkSL6+---+ + | | +--------------+ | + ++XX+linkSL6+---+ + | |||
| | | | | linkSL5 | | linkSL8 | | | | | linkSL5 | | linkSL8 | |||
| | +------------+ | | + +---+linkSL7+-+ | + | | +------------+ | | + +---+linkSL7+-+ | + | |||
| | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ | |||
| |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
| +-+-----+ ++------+ +-----+-+ +-+-----+ | +-+-----+ ++------+ +-----+-+ +-+-----+ | |||
| + + + + | + + + + | |||
| Prefix111 Prefix112 Prefix121 Prefix122 | Prefix111 Prefix112 Prefix121 Prefix122 | |||
| skipping to change at page 14, line 10 ¶ | skipping to change at page 14, line 10 ¶ | |||
| S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to | S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to | |||
| prefix122 will only be sent to linkSL7 following a longest-prefix | prefix122 will only be sent to linkSL7 following a longest-prefix | |||
| match to prefix 122 directly then go down through linkSL8 to Leaf122 | match to prefix 122 directly then go down through linkSL8 to Leaf122 | |||
| . | . | |||
| 4.3. Black-Holing on Link Failures | 4.3. Black-Holing on Link Failures | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| | ToF 21 | | ToF 22 | LEVEL 2 | | ToF 21 | | ToF 22 | LEVEL 2 | |||
| ++-+--+-++ ++-+--+-++ | ++-+--+-++ ++-+--+-++ | |||
| | | | | | | | | | | | | | | | | + | |||
| | | | | | | | linkTS8 | | | | | | | | linkTS8 | |||
| +--------------+ | +--linkTS3-X+ | | | +--------------+ | +--------------+ | +-+linkTS3+X+ | | | +--------------+ | |||
| linkTS1 | | | | | | | | linkTS1 | | | | | + | | |||
| | +-----------------------------+ | linkTS7 | | + +-----------------------------+ | linkTS7 | | |||
| | | | | | | | | | | | + | + + + | | |||
| | | linkTS2 +--------linkTS4-X-----------+ | | | | linkTS2 +-------+linkTS4+X+----------+ | | |||
| | | | | | | | | | | + + + + | | | | |||
| | linkTS5 +-+ +---------------+ | | | | linkTS5 +-+ +------------+--+ | | | |||
| | | | | | linkTS6 | | | | + | | | linkTS6 | | | |||
| +-+----++ +-+-----+ ++----+-+ ++-----++ | +-+----+-+ +-+----+-+ ++-------+ +-+-----++ | |||
| |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 | |Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1 | |||
| +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ | +-+---+--+ ++----+--+ +-+---+--+ +-+---+--+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | +---------------+ | | +----linkSL6----+ | | + +---------------+ | + +---+linkSL6+---+ + | |||
| linkSL1 | | | linkSL5 | | linkSL8 | linkSL1 | | | linkSL5 | | linkSL8 | |||
| | +---linkSL3---+ | | | +----linkSL7--+ | | | + +--+linkSL3+--+ | | + +---+linkSL7+-+ | + | |||
| | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ | |||
| |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
| +-+-----+ ++------+ +-----+-+ +-+-----+ | +-+-----+ ++------+ +-----+-+ +-+-----+ | |||
| + + + + | + + + + | |||
| Prefix111 Prefix112 Prefix121 Prefix122 | Prefix111 Prefix112 Prefix121 Prefix122 | |||
| Figure 4: Black-holing upon link failure use case | Figure 4: Black-holing upon link failure use case | |||
| This scenario illustrates a case when double link failure occurs and | This scenario illustrates a case when double link failure occurs and | |||
| skipping to change at page 15, line 12 ¶ | skipping to change at page 15, line 12 ¶ | |||
| with prefix 121 and prefix 122, that is flooded to spines 111, 112, | with prefix 121 and prefix 122, that is flooded to spines 111, 112, | |||
| 121 and 122. | 121 and 122. | |||
| The packet from leaf111 to prefix122 will not be routed to linkTS1 or | The packet from leaf111 to prefix122 will not be routed to linkTS1 or | |||
| linkTS2. The packet from leaf111 to prefix122 will only be routed to | linkTS2. The packet from leaf111 to prefix122 will only be routed to | |||
| linkTS5 or linkTS7 following a longest-prefix match to prefix122. | linkTS5 or linkTS7 following a longest-prefix match to prefix122. | |||
| 4.4. Zero Touch Provisioning (ZTP) | 4.4. Zero Touch Provisioning (ZTP) | |||
| Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | |||
| has no configuration (unless it is a Top-of-Fabric at the top of the | has no configuration (unless it is a ToF at the top of the topology | |||
| topology or it is desired to confine it to leaf role w/o leaf-2-leaf | or it is desired to confine it to leaf role w/o leaf-2-leaf | |||
| procedures). In such case RIFT will fully configure the node's level | procedures). In such case RIFT will fully configure the node's level | |||
| after it is attached to the topology. | after it is attached to the topology. | |||
| The most import component for ZTP is the automatic level derivation | The most important component for ZTP is the automatic level | |||
| procedure. All the Top-of-Fabric nodes are explicitly marked with | derivation procedure. All the ToF nodes are explicitly marked with | |||
| TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP | TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP | |||
| nodes to derive their level in the topology. The derivation of the | nodes to derive their level in the topology. The derivation of the | |||
| level of each node happens then based on LIEs received from its | level of each node happens then based on Link Information Elements | |||
| neighbors whereas each node (with possibly exceptions of configured | (LIEs) received from its neighbors whereas each node (with possibly | |||
| leafs) tries to attach at the highest possible point in the fabric. | exceptions of configured leafs) tries to attach at the highest | |||
| This guarantees that even if the diffusion front reaches a node from | possible point in the fabric. This guarantees that even if the | |||
| "below" faster than from "above", it will greedily abandon already | diffusion front reaches a node from "below" faster than from "above", | |||
| negotiated level derived from nodes topologically below it and | it will greedily abandon already negotiated level derived from nodes | |||
| properly peer with nodes above. | topologically below it and properly peer with nodes above. | |||
| 4.5. Miscabling Examples | 4.5. Mis-cabling Examples | |||
| +----------------+ +-----------------+ | +----------------+ +-----------------+ | |||
| | ToF21 | +------+ ToF22 | LEVEL 2 | | ToF21 | +------+ ToF22 | LEVEL 2 | |||
| +-------+----+---+ | +----+---+--------+ | +-------+----+---+ | +----+---+--------+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | | | +----------------------------+ | | | | | +----------------------------+ | | |||
| | +---------------------------+ | | | | | | +---------------------------+ | | | | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | | | | +-----------------------+ | | | | | | | +-----------------------+ | | | |||
| | | +------------------------+ | | | | | | +------------------------+ | | | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +-+---+-+ | +-+---+-+ +-+---+-+ | +-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+ | |||
| |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 | |Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1 | |||
| +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ | +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +---------+ | link-M | +---------+ | | | +---------+ | link-M | +---------+ | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +-------+ | | | | +-------+ | | | | +-------+ | | | | +-------+ | | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | |||
| |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| Figure 5: A single plane miscabling example | Figure 5: A single plane mis-cabling example | |||
| Figure 5 shows a single plane miscabling example. It's a perfect | Figure 5 shows a single plane mis-cabling example. It's a perfect | |||
| fat-tree fabric except link-M connecting Leaf112 to ToF22. | fat tree fabric except link-M connecting Leaf112 to ToF22. | |||
| The RIFT control protocol can discover the physical links | The RIFT control protocol can discover the physical links | |||
| automatically and be able to detect cabling that violates fat-tree | automatically and be able to detect cabling that violates fat tree | |||
| topology constraints. It react accordingly to such mis-cabling | topology constraints. It reacts accordingly to such mis-cabling | |||
| attempts, at a minimum preventing adjacencies between nodes from | attempts, at a minimum preventing adjacencies between nodes from | |||
| being formed and traffic from being forwarded on those mis-cabled | being formed and traffic from being forwarded on those mis-cabled | |||
| links. Leaf112 will in such scenario use link-M to derive its level | links. Leaf112 will in such scenario use link-M to derive its level | |||
| (unless it is leaf) and can report links to spines 111 and 112 as | (unless it is leaf) and can report links to Spine111 and Spine112 as | |||
| miscabled unless the implementations allows horizontal links. | mis-cabled unless the implementations allows horizontal links. | |||
| Figure 6 shows a multiple plane miscabling example. Since Leaf112 | Figure 6 shows a multiple plane mis-cabling example. Since Leaf112 | |||
| and Spine121 belong to two different PoDs, the adjacency between | and Spine121 belong to two different PoDs, the adjacency between | |||
| Leaf112 and Spine121 can not be formed. link-W would be detected and | Leaf112 and Spine121 can not be formed. link-W would be detected and | |||
| prevented. | prevented. | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | | | +-----------------+ | | | | | | | +-----------------+ | | | | |||
| | +--------------------------+ | | | | | | +--------------------------+ | | | | | |||
| | | | | | | | | | | | | | | | | | | |||
| | +------+ | | | +------+ | | | +------+ | | | +------+ | | |||
| | | +-----------------+ | | | | | | | | +-----------------+ | | | | | | |||
| | | | +--------------------------+ | | | | | | +--------------------------+ | | | |||
| | A | | B | | A | | B | | | A | | B | | A | | B | | |||
| +-----+-+ +-+---+-+ +-+---+-+ +-+-----+ | +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ | |||
| |Spin111| |Spin112| +----+Spin121| |Spin122| LEVEL 1 | |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 | |||
| +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ | +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +---------+ | | | +---------+ | | | +---------+ | | | +---------+ | | |||
| | | | | link-W | | | | | | | | | link-W | | | | | |||
| | +-------+ | | | | +-------+ | | | | +-------+ | | | | +-------+ | | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | |||
| |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| +--------PoD#1----------+ +---------PoD#2---------+ | +--------PoD#1----------+ +---------PoD#2---------+ | |||
| Figure 6: A multiple plane miscabling example | Figure 6: A multiple plane mis-cabling example | |||
| RIFT provides an optional level determination procedure in its Zero | RIFT provides an optional level determination procedure in its Zero | |||
| Touch Provisioning mode. Nodes in the fabric without their level | Touch Provisioning mode. Nodes in the fabric without their level | |||
| configured determine it automatically. This can have possibly | configured determine it automatically. This can have possibly | |||
| counter-intuitive consequences however. One extreme failure scenario | counter-intuitive consequences however. One extreme failure scenario | |||
| is depicted in Figure 7 and it shows that if all northbound links of | is depicted in Figure 7 and it shows that if all northbound links of | |||
| spine11 fail at the same time, spine11 negotiates a lower level than | spine11 fail at the same time, spine11 negotiates a lower level than | |||
| Leaf11 and Leaf12. | Leaf11 and Leaf12. | |||
| To prevent such scenario where leafs are expected to act as switches, | To prevent such scenario where leafs are expected to act as switches, | |||
| LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is | LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is | |||
| invalid, Spine11 would not derive a valid level from the topology in | invalid, Spine11 would not derive a valid level from the topology in | |||
| Figure 7. It will be isolated from the whole fabric and it would be | Figure 7. It will be isolated from the whole fabric and it would be | |||
| up to the leafs to declare the links towards such spine as miscabled. | up to the leafs to declare the links towards such spine as mis- | |||
| cabled. | ||||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| |ToF A1| |ToF A2| |ToF A1| |ToF A2| | |ToF A1| |ToF A2| |ToF A1| |ToF A2| | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| | | | | | | | | | | | | | | |||
| | +-------+ | | | | | +-------+ | | | | |||
| + + | | ====> | | | + + | | ====> | | | |||
| X X +------+ | +------+ | | X X +------+ | +------+ | | |||
| + + | | | | | + + | | | | | |||
| +----+--+ +-+-----+ +-+-----+ | +----+--+ +-+-----+ +-+-----+ | |||
| skipping to change at page 18, line 8 ¶ | skipping to change at page 17, line 47 ¶ | |||
| | | | | | | |||
| +-+---+-+ | +-+---+-+ | |||
| |Spine11| | |Spine11| | |||
| +-------+ | +-------+ | |||
| Figure 7: Fallen spine | Figure 7: Fallen spine | |||
| 4.6. Positive vs. Negative Disaggregation | 4.6. Positive vs. Negative Disaggregation | |||
| Disaggregation is the procedure whereby [RIFT] advertises a more | Disaggregation is the procedure whereby [RIFT] advertises a more | |||
| specific route Southwards as an exception to the aggregated fabric- | specific route southwards as an exception to the aggregated fabric- | |||
| default North. Disaggregation is useful when a prefix within the | default north. Disaggregation is useful when a prefix within the | |||
| aggregation is reachable via some of the parents but not the others | aggregation is reachable via some of the parents but not the others | |||
| at the same level of the fabric. It is mandatory when the level is | at the same level of the fabric. It is mandatory when the level is | |||
| the ToF since a ToF node that cannot reach a prefix becomes a black | the ToF since a ToF node that cannot reach a prefix becomes a black | |||
| hole for that prefix. The hard problem is to know which prefixes are | hole for that prefix. The hard problem is to know which prefixes are | |||
| reachable by whom. | reachable by whom. | |||
| In the general case, [RIFT] solves that problem by interconnecting | In the general case, [RIFT] solves that problem by interconnecting | |||
| the ToF nodes so they can exchange the full list of prefixes that | the ToF nodes. So the ToF nodes can exchange the full list of | |||
| exist in the fabric and figure when a ToF node lacks reachability and | prefixes that exist in the fabric and figure when a ToF node lacks | |||
| to existing prefix. This requires additional ports at the ToF, | reachability and to existing prefix. This requires additional ports | |||
| typically 2 ports per ToF node to form a ToF-spanning ring. [RIFT] | at the ToF, typically 2 ports per ToF node to form a ToF-spanning | |||
| also defines the southbound reflection procedure that enables a | ring. [RIFT] also defines the southbound reflection procedure that | |||
| parent to explore the direct connectivity of its peers, meaning their | enables a parent to explore the direct connectivity of its peers, | |||
| own parents and children; based on the advertisements received from | meaning their own parents and children; based on the advertisements | |||
| the shared parents and children, it may enable the parent to infer | received from the shared parents and children, it may enable the | |||
| the prefixes its peers can reach. | parent to infer the prefixes its peers can reach. | |||
| When a parent lacks reachability to a prefix, it may disaggregate the | When a parent lacks reachability to a prefix, it may disaggregate the | |||
| prefix negatively, i.e., advertise that this parent can be used to | prefix negatively, i.e., advertise that this parent can be used to | |||
| reach any prefix in the aggregation except that one. The Negative | reach any prefix in the aggregation except that one. The Negative | |||
| Disaggregation signaling is simple and functions transitively from | Disaggregation signaling is simple and functions transitively from | |||
| ToF to ToP and then from Top to Leaf. But it is hard for a parent to | ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard | |||
| figure which prefix it needs to disaggregate, because it does not | for a parent to figure which prefix it needs to disaggregate, because | |||
| know what it does not know; it results that the use of a spanning | it does not know what it does not know; it results that the use of a | |||
| ring at the ToF is required to operate the Negative Disaggregation. | spanning ring at the ToF is required to operate the Negative | |||
| Also, though it is only an implementation problem, the programmation | Disaggregation. Also, though it is only an implementation problem, | |||
| of the FIB is complex compared to normal routes, and may incur | the programmation of the FIB is complex compared to normal routes, | |||
| recursions. | and may incur recursions. | |||
| The more classical alternative is, for the parents that can reach a | The more classical alternative is, for the parents that can reach a | |||
| prefix that peers at the same level cannot, to advertise a more | prefix that peers at the same level cannot, to advertise a more | |||
| specific route to that prefix. This leverages the normal longest | specific route to that prefix. This leverages the normal longest | |||
| prefix match in the FIB, and does not require a special | prefix match in the FIB, and does not require a special | |||
| implementation. But as opposed to the Negative Disaggregation, the | implementation. But as opposed to the Negative Disaggregation, the | |||
| Positive Disaggregation is difficult and inefficient to operate | Positive Disaggregation is difficult and inefficient to operate | |||
| transitively. | transitively. | |||
| Transitivity is not needed to a grandchild if all its parents | Transitivity is not needed to a grandchild if all its parents | |||
| skipping to change at page 19, line 45 ¶ | skipping to change at page 19, line 37 ¶ | |||
| meantime. In the case of Negative Disaggregation, the last ToF | meantime. In the case of Negative Disaggregation, the last ToF | |||
| node(s) that injects the route may also incur an incast issue; this | node(s) that injects the route may also incur an incast issue; this | |||
| problem would occur if a prefix that becomes totally unreachable is | problem would occur if a prefix that becomes totally unreachable is | |||
| disaggregated, but doing so is mostly useless and is not recommended. | disaggregated, but doing so is mostly useless and is not recommended. | |||
| 4.7. Mobile Edge and Anycast | 4.7. Mobile Edge and Anycast | |||
| When a physical or a virtual node changes its point of attachement in | When a physical or a virtual node changes its point of attachement in | |||
| the fabric from a previous-leaf to a next-leaf, new routes must be | the fabric from a previous-leaf to a next-leaf, new routes must be | |||
| installed that supersede the old ones. Since the flooding flows | installed that supersede the old ones. Since the flooding flows | |||
| Northwards, the nodes (if any) between the previous-leaf and the | northwards, the nodes (if any) between the previous-leaf and the | |||
| common parent are not immediately aware that the path via previous- | common parent are not immediately aware that the path via previous- | |||
| leaf is obsolete, and a stale route may exist for a while. The | leaf is obsolete, and a stale route may exist for a while. The | |||
| common parent needs to select the freshest route advertisement in | common parent needs to select the freshest route advertisement in | |||
| order to install the correct route via the next-leaf. This requires | order to install the correct route via the next-leaf. This requires | |||
| that the fabric determines the sequence of the movements of the | that the fabric determines the sequence of the movements of the | |||
| mobile node. | mobile node. | |||
| On the one hand, a classical sequence counter provides a total order | On the one hand, a classical sequence counter provides a total order | |||
| for a while but it will eventually wrap. On the other hand, a | for a while but it will eventually wrap. On the other hand, a | |||
| timestamp provides a permanent order but it may miss a movement that | timestamp provides a permanent order but it may miss a movement that | |||
| happens too quickly vs. the granularity of the timing information. | happens too quickly vs. the granularity of the timing information. | |||
| It is not envisioned in the short term that the average fabric | It is not envisioned in the short term that the average fabric | |||
| supports a Precision Time Protocol, and the precision that may be | supports a Precision Time Protocol [IEEEstd1588], and the precision | |||
| available with the Network Time Protocol [RFC5905], in the order of | that may be available with the Network Time Protocol [RFC5905], in | |||
| 100 to 200ms, may not be necessarily enough to cover, e.g., the fast | the order of 100 to 200ms, may not be necessarily enough to cover, | |||
| mobility of a Virtual Machine. | e.g., the fast mobility of a Virtual Machine. | |||
| Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that | Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that | |||
| combines a sequence counter from the mobile node and a timestamp from | combines a sequence counter from the mobile node and a timestamp from | |||
| the network taken at the leaf when the route is injected. If the | the network taken at the leaf when the route is injected. If the | |||
| timestamps of the concurrent advertisements are comparable (i.e., | timestamps of the concurrent advertisements are comparable (i.e., | |||
| more distant than the precision of the timing protocol), then the | more distant than the precision of the timing protocol), then the | |||
| timestamp alone is used to determine the relative freshness of the | timestamp alone is used to determine the relative freshness of the | |||
| routes. Otherwise, the sequence counter from the mobile node, if | routes. Otherwise, the sequence counter from the mobile node, if | |||
| available, is used. One caveat is that the sequence counter must not | available, is used. One caveat is that the sequence counter must not | |||
| wrap within the precision of the timing protocol. Another is that | wrap within the precision of the timing protocol. Another is that | |||
| the mobile node may not even provide a sequence counter, in which | the mobile node may not even provide a sequence counter, in which | |||
| case the mobility itself must be slower than the precision of the | case the mobility itself must be slower than the precision of the | |||
| timing. | timing. | |||
| Mobility must not be confused with Anycast. In both cases, a same | Mobility must not be confused with anycast. In both cases, a same | |||
| address is injected in RIFT at different leaves. In the case of | address is injected in RIFT at different leaves. In the case of | |||
| mobility, only the freshest route must be conserved, since mobile | mobility, only the freshest route must be conserved, since mobile | |||
| node changed its point of attachment for a leaf to the next. In the | node changed its point of attachment for a leaf to the next. In the | |||
| case of anycast, the node may be either multihomed (attached to | case of anycast, the node may be either multihomed (attached to | |||
| multiple leaves in parallel) or reachable beyond the fabric via | multiple leaves in parallel) or reachable beyond the fabric via | |||
| multiple routes that are redistributed to different leaves; either | multiple routes that are redistributed to different leaves; either | |||
| way, in the case of anycast, the multiple routes are equally valid | way, in the case of anycast, the multiple routes are equally valid | |||
| and should be conserved. Without further information from the | and should be conserved. Without further information from the | |||
| redistributed routing protocol, it is impossible to sort out a | redistributed routing protocol, it is impossible to sort out a | |||
| movement from a redistribution that happens asynchronously on | movement from a redistribution that happens asynchronously on | |||
| skipping to change at page 20, line 50 ¶ | skipping to change at page 20, line 50 ¶ | |||
| advertised within the timing precision, which is typically the case | advertised within the timing precision, which is typically the case | |||
| with a low-precision timing and a multihomed node. Beyond that time | with a low-precision timing and a multihomed node. Beyond that time | |||
| interval, RIFT interprets the lag as a mobility and only the freshest | interval, RIFT interprets the lag as a mobility and only the freshest | |||
| route is retained. | route is retained. | |||
| When using IPv6 [RFC8200], RIFT suggests to leverage "Registration | When using IPv6 [RFC8200], RIFT suggests to leverage "Registration | |||
| Extensions for IPv6 over Low-Power Wireless Personal Area Network | Extensions for IPv6 over Low-Power Wireless Personal Area Network | |||
| (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND | (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND | |||
| interaction between the mobile node and the leaf. This provides not | interaction between the mobile node and the leaf. This provides not | |||
| only a sequence counter but also a lifetime and a security token that | only a sequence counter but also a lifetime and a security token that | |||
| may be used to protect the ownership of an address. When using | may be used to protect the ownership of an address [RFC8928]. When | |||
| [RFC8505], the parallel registration of an anycast address to | using [RFC8505], the parallel registration of an anycast address to | |||
| multiple leaves is done with the same sequence counter, whereas the | multiple leaves is done with the same sequence counter, whereas the | |||
| sequence counter is incremented when the point of attachement | sequence counter is incremented when the point of attachement | |||
| changes. This way, it is possible to differentiate a mobile node | changes. This way, it is possible to differentiate a mobile node | |||
| from a multihomed node, even when the mobility happens within the | from a multihomed node, even when the mobility happens within the | |||
| timing precision. It is also possible for a mobile node to be | timing precision. It is also possible for a mobile node to be | |||
| multihomed as well, e.g., to change only one of its points of | multihomed as well, e.g., to change only one of its points of | |||
| attachement. | attachement. | |||
| 4.8. IPv4 over IPv6 | 4.8. IPv4 over IPv6 | |||
| RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 | RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 | |||
| AF configures via the usual ND mechanisms and then V4 can use V6 | Address Family (AF) configures via the usual Neighbor Discovery (ND) | |||
| nexthops analogous to RFC5549. It is expected that the whole fabric | mechanisms and then V4 can use V6 nexthops analogous to [RFC5549]. | |||
| supports the same type of forwarding of address families on all the | It is expected that the whole fabric supports the same type of | |||
| links. RIFT provides an indication whether a node is v4 forwarding | forwarding of address families on all the links. RIFT provides an | |||
| capable and implementations are possible where different routing | indication whether a node is v4 forwarding capable and | |||
| tables are computed per address family as long as the computation | implementations are possible where different routing tables are | |||
| remains loop-free. | computed per address family as long as the computation remains loop- | |||
| free. | ||||
| +-----+ +-----+ | +-----+ +-----+ | |||
| +---+---+ | ToF | | ToF | | +---+---+ | ToF | | ToF | | |||
| ^ +--+--+ +-----+ | ^ +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | + | | | | | |||
| V6 +-----+ +-+---+ | V6 +-----+ +-+---+ | |||
| Forwarding |SPINE| |SPINE| | Forwarding |Spine| |Spine| | |||
| | +--+--+ +-----+ | + +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | | | | | | | |||
| v +-----+ +-+---+ | v +-----+ +-+---+ | |||
| +---+---+ |LEAF | | LEAF| | +---+---+ |Leaf | | Leaf| | |||
| +--+--+ +--+--+ | +--+--+ +--+--+ | |||
| | | | | | | |||
| IPv4 prefixes| |IPv4 prefixes | IPv4 prefixes| |IPv4 prefixes | |||
| | | | | | | |||
| +---+----+ +---+----+ | +---+----+ +---+----+ | |||
| | V4 | | V4 | | | V4 | | V4 | | |||
| | subnet | | subnet | | | subnet | | subnet | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| Figure 8: IPv4 over IPv6 | Figure 8: IPv4 over IPv6 | |||
| 4.9. In-Band Reachability of Nodes | 4.9. In-Band Reachability of Nodes | |||
| RIFT doesn't precondition that nodes of the fabric have reachable | RIFT doesn't precondition that nodes of the fabric have reachable | |||
| addresses. But the operational purposes to reach the internal nodes | addresses. But the operational purposes to reach the internal nodes | |||
| may exist. Figure 9 shows an example that the NMS attaches to LEAF1. | may exist. Figure 9 shows an example that the network management | |||
| station (NMS) attaches to leaf1. | ||||
| +-------+ +-------+ | +-------+ +-------+ | |||
| | ToF1 | | ToF2 | | | ToF1 | | ToF2 | | |||
| ++---- ++ ++-----++ | ++---- ++ ++-----++ | |||
| | | | | | | | | | | |||
| | +----------+ | | | +----------+ | | |||
| | +--------+ | | | | +--------+ | | | |||
| | | | | | | | | | | |||
| ++-----++ +--+---++ | ++-----++ +--+---++ | |||
| |SPINE1 | |SPINE2 | | |Spine1 | |Spine2 | | |||
| ++-----++ ++-----++ | ++-----++ ++-----++ | |||
| | | | | | | | | | | |||
| | +----------+ | | | +----------+ | | |||
| | +--------+ | | | | +--------+ | | | |||
| | | | | | | | | | | |||
| ++-----++ +--+---++ | ++-----++ +--+---++ | |||
| | LEAF1 | | LEAF2 | | | Leaf1 | | Leaf2 | | |||
| +---+---+ +-------+ | +---+---+ +-------+ | |||
| | | | | |||
| |NMS | |NMS | |||
| Figure 9: In-Band reachability of node | Figure 9: In-Band reachability of node | |||
| If NMS wants to access LEAF2, it simply works. Because loopback | If NMS wants to access Leaf2, it simply works. Because loopback | |||
| address of LEAF2 is flooded in its Prefix North TIE. | address of Leaf2 is flooded in its Prefix North TIE. | |||
| If NMS wants to access SPINE2, it simply works too. Because spine | If NMS wants to access Spine2, it simply works too. Because spine | |||
| node always advertises its loopback address in the Prefix North TIE. | node always advertises its loopback address in the Prefix North TIE. | |||
| NMS may reach SPINE2 from LEAF1-SPINE2 or LEAF1-SPINE1-ToF1/ | NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ | |||
| ToF2-SPINE2. | ToF2-Spine2. | |||
| If NMS wants to access ToF2, ToF2's loopback address needs to be | If NMS wants to access ToF2, ToF2's loopback address needs to be | |||
| injected into its Prefix South TIE. Otherwise, the traffic from NMS | injected into its Prefix South TIE. This TIE must be seen by all | |||
| may be sent to ToF1. | nodes at the level below - the spine nodes in Figure 9 - that must | |||
| form a ceiling for all the traffic coming from below (south). | ||||
| Otherwise, the traffic from NMS may follow the default route to the | ||||
| wrong ToF Node, e.g., ToF1. | ||||
| And in case of failure between ToF2 and spine nodes, ToF2's loopback | In a fully connected ToF, in case of failure between ToF2 and spine | |||
| address must be sent all the way down to the leaves. | nodes, ToF2's loopback address must be disaggregated recursively all | |||
| the way to the leaves. | ||||
| In a partitioned ToF, a TOF node is only reachable within its Plane, | ||||
| and the disaggregation to the leaves is also required. A possible | ||||
| alternative is to use the ring that interconnects the ToF nodes to | ||||
| transmit packets between them for their loopback addresses only. The | ||||
| idea is that this is mostly control traffic and should not alter the | ||||
| load balancing properties of the fabric. | ||||
| 4.10. Dual Homing Servers | 4.10. Dual Homing Servers | |||
| Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It | |||
| has no configuration (unless it is a Top-of-Fabric at the top of the | has no configuration (unless it is a Top-of-Fabric at the top of the | |||
| topology or the must operate in the topology as leaf and/or support | topology or the must operate in the topology as leaf and/or support | |||
| leaf-2-leaf procedures) and it will fully configure itself after | leaf-2-leaf procedures) and it will fully configure itself after | |||
| being attached to the topology. | being attached to the topology. | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| |ToF| |ToF| |ToF| | |ToF| |ToF| |ToF| ToF | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | |||
| | +----------------+ | | | | +----------------+ | | | |||
| | | | | | | | | | | | | | | |||
| | +----------------+ | | | +----------------+ | | |||
| | | | | | | | | | | | | | | |||
| +----------+--+ +--+----------+ | +----------+--+ +--+----------+ | |||
| | Spine|ToR1 | | Spine|ToR2 | | | ToR1 | | ToR2 | Spine | |||
| +--+------+---+ +--+-------+--+ | +--+------+---+ +--+-------+--+ | |||
| +---+ | | | | | | +---+ | +---+ | | | | | | +---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | +-----------------+ | | | | | +-----------------+ | | | | |||
| | | | +-------------+ | | | | | | +-------------+ | | | |||
| + | + | | |-----------------+ | | + | + | | |-----------------+ | | |||
| X | X | +--------x-----+ | X | | X | X | +--------x-----+ | X | | |||
| + | + | | | + | | + | + | | | + | | |||
| +---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| +---+ +---+ ...............+---+ +---+ | +---+ +---+ ...............+---+ +---+ | |||
| SV(1) SV(2) SV(n+1) SV(n) | SV(1) SV(2) SV(n+1) SV(n) Leaf | |||
| Figure 10: Dual-homing servers | Figure 10: Dual-homing servers | |||
| In the single plane, the worst condition is disaggregation of every | In the single plane, the worst condition is disaggregation of every | |||
| other servers at the same level. Suppose the links from ToR1 to all | other servers at the same level. Suppose the links from ToR1 (Top of | |||
| the leaves become not available. All the servers' routes are | Rack) to all the leaves become not available. All the servers' | |||
| disaggregated and the FIB of the servers will be expanded with n-1 | routes are disaggregated and the FIB of the servers will be expanded | |||
| more specific routes. | with n-1 more specific routes. | |||
| Sometimes, people may prefer to disaggregate from ToR to servers from | Sometimes, people may prefer to disaggregate from ToR to servers from | |||
| start on, i.e. the servers have couple tens of routes in FIB from | start on, i.e. the servers have couple tens of routes in FIB from | |||
| start on beside default routes to avoid breakages at rack level. | start on beside default routes to avoid breakages at rack level. | |||
| Full disaggregation of the fabric could be achieved by configuration | Full disaggregation of the fabric could be achieved by configuration | |||
| supported by RIFT. | supported by RIFT. | |||
| 4.11. Fabric With A Controller | 4.11. Fabric With A Controller | |||
| There are many different ways to deploy the controller. One | There are many different ways to deploy the controller. One | |||
| skipping to change at page 24, line 24 ¶ | skipping to change at page 24, line 30 ¶ | |||
| | | | | | | |||
| | | | | | | |||
| +----++ ++----+ | +----++ ++----+ | |||
| ------- | ToF | | ToF | | ------- | ToF | | ToF | | |||
| | +--+--+ +-----+ | | +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | | | | | | | |||
| +-----+ +-+---+ | +-----+ +-+---+ | |||
| RIFT domain |SPINE| |SPINE| | RIFT domain |Spine| |Spine| | |||
| +--+--+ +-----+ | +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | | | | | | | |||
| | +-----+ +-+---+ | | +-----+ +-+---+ | |||
| ------- |LEAF | | LEAF| | ------- |Leaf | | Leaf| | |||
| +-----+ +-----+ | +-----+ +-----+ | |||
| Figure 11: Fabric with a controller | Figure 11: Fabric with a controller | |||
| 4.11.1. Controller Attached to ToFs | 4.11.1. Controller Attached to ToFs | |||
| If a controller is attaching to the RIFT domain from ToF, it usually | If a controller is attaching to the RIFT domain from ToF, it usually | |||
| uses dual-homing connections. The loopback prefix of the controller | uses dual-homing connections. The loopback prefix of the controller | |||
| should be advertised down by the ToF and spine to leaves. If the | should be advertised down by the ToF and spine to leaves. If the | |||
| controller loses link to ToF, make sure the ToF withdraw the prefix | controller loses link to ToF, make sure the ToF withdraw the prefix | |||
| skipping to change at page 26, line 48 ¶ | skipping to change at page 26, line 48 ¶ | |||
| + traffic | + traffic | |||
| Figure 13: Anycast | Figure 13: Anycast | |||
| If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | |||
| prefix PrefixA. RIFT can deal with this case well. But if the | prefix PrefixA. RIFT can deal with this case well. But if the | |||
| traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. | traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. | |||
| But Spine21 or Spine22 doesn't know another PrefixA attaching | But Spine21 or Spine22 doesn't know another PrefixA attaching | |||
| Leaf111. So it will always get to Leaf121 and never get to Leaf111. | Leaf111. So it will always get to Leaf121 and never get to Leaf111. | |||
| If the intension is that the traffic should been offloaded to | If the intension is that the traffic should been offloaded to | |||
| Leaf111, then use policy guided prefixes [PGP reference]. | Leaf111, then use policy guided prefixes defined in "Routing in Fat | |||
| Trees" [RIFT]. | ||||
| 4.15. IoT Applicability | ||||
| The design of RIFT inherits from RPL [RFC6550] the anisotropic design | ||||
| of a default route upwards (northwards); it also inherits the | ||||
| capability to inject external host routes at the Leaf level using | ||||
| Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host | ||||
| and a RIFT router. Both the RPL and the RIFT protocols are meant for | ||||
| large scale, and WiND enables device mobility at the edge the same | ||||
| way in both cases. | ||||
| The main difference between RIFT and RPL is that with RPL, there's a | ||||
| single Root, whereas RIFT has many ToF nodes. The adds huge | ||||
| capabilities for leaf-2-leaf ECMP paths, but additional complexity | ||||
| with the need to disaggregate. Also RIFT uses Link State flooding | ||||
| northwards, and is not designed for low-power operation. | ||||
| Still nothing prevents that the IP devices connected at the Leaf are | ||||
| IoT (Internet of Things) devices, which typically expose their | ||||
| address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. | ||||
| A network that serves high speed/ high power IoT devices should | ||||
| typically provide deterministic capabilities for applications such as | ||||
| high speed control loops or movement detection. The Fat Tree is | ||||
| highly reliable, and in normal condition provides an equilatent | ||||
| multipath operation; but the ECMP doesn't provide hard guarantees for | ||||
| either delivery or latency. As long as the fabric is non-blocking | ||||
| the result is the same; but there can be load unbalances resulting in | ||||
| incast and possibly congestion loss that will prevent the delivery | ||||
| within bounded latency. | ||||
| This could be alleviated with Packet Replication, Elimination and | ||||
| Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide | ||||
| at the scale of all flows, and the replication may increase the | ||||
| probability of the overload that it attempts to solve. | ||||
| Note that the load balancing is not RIFT's problem, but it is key to | ||||
| serve IoT adequately. | ||||
| 5. Security Considerations | ||||
| This document presents applicability of RIFT. As such, it does not | ||||
| introduce any security considerations. However, there are a number | ||||
| of security concerns at [RIFT]. | ||||
| 5. Acknowledgements | ||||
| 6. Contributors | 6. Contributors | |||
| The following people (listed in alphabetical order) contributed | The following people (listed in alphabetical order) contributed | |||
| significantly to the content of this document and should be | significantly to the content of this document and should be | |||
| considered co-authors: | considered co-authors: | |||
| Tony Przygienda | Tony Przygienda | |||
| Juniper Networks | Juniper Networks | |||
| skipping to change at page 28, line 11 ¶ | skipping to change at page 29, line 11 ¶ | |||
| Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | |||
| RFC 5357, DOI 10.17487/RFC5357, October 2008, | RFC 5357, DOI 10.17487/RFC5357, October 2008, | |||
| <https://www.rfc-editor.org/info/rfc5357>. | <https://www.rfc-editor.org/info/rfc5357>. | |||
| [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., | [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., | |||
| Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional | Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional | |||
| Forwarding Detection (BFD) on Link Aggregation Group (LAG) | Forwarding Detection (BFD) on Link Aggregation Group (LAG) | |||
| Interfaces", RFC 7130, DOI 10.17487/RFC7130, February | Interfaces", RFC 7130, DOI 10.17487/RFC7130, February | |||
| 2014, <https://www.rfc-editor.org/info/rfc7130>. | 2014, <https://www.rfc-editor.org/info/rfc7130>. | |||
| [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network | ||||
| Layer Reachability Information with an IPv6 Next Hop", | ||||
| RFC 5549, DOI 10.17487/RFC5549, May 2009, | ||||
| <https://www.rfc-editor.org/info/rfc5549>. | ||||
| [RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J., | ||||
| Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, | ||||
| JP., and R. Alexander, "RPL: IPv6 Routing Protocol for | ||||
| Low-Power and Lossy Networks", RFC 6550, | ||||
| DOI 10.17487/RFC6550, March 2012, | ||||
| <https://www.rfc-editor.org/info/rfc6550>. | ||||
| [RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C. | ||||
| Bormann, "Neighbor Discovery Optimization for IPv6 over | ||||
| Low-Power Wireless Personal Area Networks (6LoWPANs)", | ||||
| RFC 6775, DOI 10.17487/RFC6775, November 2012, | ||||
| <https://www.rfc-editor.org/info/rfc6775>. | ||||
| [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, | ||||
| "Deterministic Networking Architecture", RFC 8655, | ||||
| DOI 10.17487/RFC8655, October 2019, | ||||
| <https://www.rfc-editor.org/info/rfc8655>. | ||||
| [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and | [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and | |||
| D. Afanasiev, "RIFT: Routing in Fat Trees", Work in | D. Afanasiev, "RIFT: Routing in Fat Trees", Work in | |||
| Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May | Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May | |||
| 2020, | 2020, | |||
| <https://tools.ietf.org/html/draft-ietf-rift-rift-12>. | <https://tools.ietf.org/html/draft-ietf-rift-rift-12>. | |||
| [I-D.white-distoptflood] | [I-D.white-distoptflood] | |||
| White, R., Hegde, S., and S. Zandi, "IS-IS Optimal | White, R., Hegde, S., and S. Zandi, "IS-IS Optimal | |||
| Distributed Flooding for Dense Topologies", Work in | Distributed Flooding for Dense Topologies", Work in | |||
| Progress, Internet-Draft, draft-white-distoptflood-04, 27 | Progress, Internet-Draft, draft-white-distoptflood-04, 27 | |||
| July 2020, | July 2020, | |||
| <https://tools.ietf.org/html/draft-white-distoptflood-04>. | <https://tools.ietf.org/html/draft-white-distoptflood-04>. | |||
| 8. Informative References | 8. Informative References | |||
| [IEEEstd1588] | ||||
| IEEE standard for Information Technology, "IEEE Standard | ||||
| for a Precision Clock Synchronization Protocol for | ||||
| Networked Measurement and Control Systems", | ||||
| <https://standards.ieee.org/standard/1588-2019.html>. | ||||
| [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer | ||||
| Communication Environments", IEEE International Parallel & | ||||
| Distributed Processing Symposium, 2011. | ||||
| [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for | ||||
| Hardware-Efficient Supercomputing", 1985. | ||||
| [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, | [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, | |||
| "Network Time Protocol Version 4: Protocol and Algorithms | "Network Time Protocol Version 4: Protocol and Algorithms | |||
| Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, | Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, | |||
| <https://www.rfc-editor.org/info/rfc5905>. | <https://www.rfc-editor.org/info/rfc5905>. | |||
| [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 | [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 | |||
| (IPv6) Specification", STD 86, RFC 8200, | (IPv6) Specification", STD 86, RFC 8200, | |||
| DOI 10.17487/RFC8200, July 2017, | DOI 10.17487/RFC8200, July 2017, | |||
| <https://www.rfc-editor.org/info/rfc8200>. | <https://www.rfc-editor.org/info/rfc8200>. | |||
| [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. | [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. | |||
| Perkins, "Registration Extensions for IPv6 over Low-Power | Perkins, "Registration Extensions for IPv6 over Low-Power | |||
| Wireless Personal Area Network (6LoWPAN) Neighbor | Wireless Personal Area Network (6LoWPAN) Neighbor | |||
| Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, | Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, | |||
| <https://www.rfc-editor.org/info/rfc8505>. | <https://www.rfc-editor.org/info/rfc8505>. | |||
| [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, | ||||
| "Address-Protected Neighbor Discovery for Low-Power and | ||||
| Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November | ||||
| 2020, <https://www.rfc-editor.org/info/rfc8928>. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Yuehua Wei (editor) | Yuehua Wei (editor) | |||
| ZTE Corporation | ZTE Corporation | |||
| No.50, Software Avenue | No.50, Software Avenue | |||
| Nanjing | Nanjing | |||
| 210012 | 210012 | |||
| China | China | |||
| Email: wei.yuehua@zte.com.cn | Email: wei.yuehua@zte.com.cn | |||
| End of changes. 104 change blocks. | ||||
| 264 lines changed or deleted | 367 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||