| < draft-ietf-rift-applicability-00.txt | draft-ietf-rift-applicability-01.txt > | |||
|---|---|---|---|---|
| RIFT WG Yuehua. Wei | RIFT WG Yuehua. Wei, Ed. | |||
| Internet-Draft Zheng. Zhang | Internet-Draft Zheng. Zhang | |||
| Intended status: Informational ZTE Corporation | Intended status: Informational ZTE Corporation | |||
| Expires: August 25, 2020 Dmitry. Afanasiev | Expires: 5 October 2020 Dmitry. Afanasiev | |||
| Yandex | Yandex | |||
| Tom. Verhaeg | Tom. Verhaeg | |||
| Interconnect Services B.V. | Juniper Networks | |||
| Jaroslaw. Kowalczyk | Jaroslaw. Kowalczyk | |||
| Orange Polska | Orange Polska | |||
| February 22, 2020 | P. Thubert | |||
| Cisco Systems | ||||
| 3 April 2020 | ||||
| RIFT Applicability | RIFT Applicability | |||
| draft-ietf-rift-applicability-00 | draft-ietf-rift-applicability-01 | |||
| Abstract | Abstract | |||
| This document discusses the properties, applicability and operational | This document discusses the properties, applicability and operational | |||
| considerations of RIFT in different network scenarios. It intends to | considerations of RIFT in different network scenarios. It intends to | |||
| provide a rough guide how RIFT can be deployed to simplify routing | provide a rough guide how RIFT can be deployed to simplify routing | |||
| operations in Clos topologies and their variations. | operations in Clos topologies and their variations. | |||
| Status of This Memo | Status of This Memo | |||
| skipping to change at page 1, line 39 ¶ | skipping to change at page 1, line 41 ¶ | |||
| Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
| Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
| working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
| Drafts is at https://datatracker.ietf.org/drafts/current/. | Drafts is at https://datatracker.ietf.org/drafts/current/. | |||
| Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
| and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
| time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
| material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
| This Internet-Draft will expire on August 25, 2020. | This Internet-Draft will expire on 5 October 2020. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2020 IETF Trust and the persons identified as the | Copyright (c) 2020 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents (https://trustee.ietf.org/ | |||
| (https://trustee.ietf.org/license-info) in effect on the date of | license-info) in effect on the date of publication of this document. | |||
| publication of this document. Please review these documents | Please review these documents carefully, as they describe your rights | |||
| carefully, as they describe your rights and restrictions with respect | and restrictions with respect to this document. Code Components | |||
| to this document. Code Components extracted from this document must | extracted from this document must include Simplified BSD License text | |||
| include Simplified BSD License text as described in Section 4.e of | as described in Section 4.e of the Trust Legal Provisions and are | |||
| the Trust Legal Provisions and are provided without warranty as | provided without warranty as described in the Simplified BSD License. | |||
| described in the Simplified BSD License. | ||||
| Table of Contents | Table of Contents | |||
| 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 2. Problem Statement of Routing in Modern IP Fabric Fat Tree | 2. Problem Statement of Routing in Modern IP Fabric Fat Tree | |||
| Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3 | Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
| 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 | 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 | |||
| 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3 | 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3 | |||
| 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5 | 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5 | |||
| 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 | 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 | |||
| 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 | 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 | |||
| 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 6 | 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 6 | |||
| 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 6 | 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 7 | 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 7 | 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 7 | 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 | |||
| 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 7 | 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 8 | |||
| 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 9 | 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
| 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 10 | 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 | |||
| 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 10 | 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 | |||
| 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 12 | 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 | |||
| 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 13 | 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 | |||
| 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 13 | 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 | |||
| 4.6. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 16 | 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 15 | |||
| 4.7. In-Band Reachability of Nodes . . . . . . . . . . . . . . 17 | 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 18 | |||
| 4.7.1. Reachability of Leafs . . . . . . . . . . . . . . . . 17 | 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 | |||
| 4.7.2. Reachability of Spines . . . . . . . . . . . . . . . 17 | 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 | |||
| 4.8. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 17 | 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 21 | |||
| 4.9. Fabric With A Controller . . . . . . . . . . . . . . . . 18 | 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 22 | |||
| 4.9.1. Controller Attached to ToFs . . . . . . . . . . . . . 19 | 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 23 | |||
| 4.9.2. Controller Attached to Leaf . . . . . . . . . . . . . 19 | 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 | |||
| 4.10. Internet Connectivity Without Underlay . . . . . . . . . 19 | 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 24 | |||
| 4.10.1. Internet Default on the Leafs . . . . . . . . . . . 19 | 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 24 | |||
| 4.10.2. Internet Default on the ToFs . . . . . . . . . . . . 20 | 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 | |||
| 4.11. Subnet Mismatch and Address Families . . . . . . . . . . 20 | 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 | |||
| 4.12. Anycast Considerations . . . . . . . . . . . . . . . . . 20 | 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 | |||
| 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 | 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 25 | |||
| 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21 | 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 | |||
| 7. Normative References . . . . . . . . . . . . . . . . . . . . 22 | 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 | |||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 | 7. Normative References . . . . . . . . . . . . . . . . . . . . 27 | |||
| 8. Informative References . . . . . . . . . . . . . . . . . . . 28 | ||||
| Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 | ||||
| 1. Introduction | 1. Introduction | |||
| This document intends to explain the properties and applicability of | This document intends to explain the properties and applicability of | |||
| RIFT [I-D.ietf-rift-rift] in different deployment scenarios and | "Routing in Fat Trees" [RIFT] in different deployment scenarios and | |||
| highlight the operational simplicity of the technology compared to | highlight the operational simplicity of the technology compared to | |||
| traditional routing solutions. It also documents special | traditional routing solutions. It also documents special | |||
| considerations when RIFT is used with or without overlays, | considerations when RIFT is used with or without overlays, | |||
| controllers and corrects topology miscablings and/or node and link | controllers and corrects topology miscablings and/or node and link | |||
| failures. | failures. | |||
| 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks | |||
| Clos and Fat-Tree topologies have gained prominence in today's | Clos and Fat-Tree topologies have gained prominence in today's | |||
| networking, primarily as result of the paradigm shift towards a | networking, primarily as result of the paradigm shift towards a | |||
| centralized data-center based architecture that is poised to deliver | centralized data-center based architecture that is poised to deliver | |||
| a majority of computation and storage services in the future. | a majority of computation and storage services in the future. | |||
| Today's current routing protocols were geared towards a network with | Today's current routing protocols were geared towards a network with | |||
| an irregular topology and low degree of connectivity originally. | an irregular topology and low degree of connectivity originally. | |||
| When they are applied to Fat-Tree topologies: | When they are applied to Fat-Tree topologies: | |||
| o they tend to need extensive configuration or provisioning during | * they tend to need extensive configuration or provisioning during | |||
| bring up and re-dimensioning. | bring up and re-dimensioning. | |||
| o spine and leaf nodes have the entire network topology and routing | * spine and leaf nodes have the entire network topology and routing | |||
| information, which is in fact, not needed on the leaf nodes during | information, which is in fact, not needed on the leaf nodes during | |||
| normal operation. | normal operation. | |||
| o significant Link State PDUs (LSPs) flooding duplication between | * significant Link State PDUs (LSPs) flooding duplication between | |||
| spine nodes and leaf nodes occurs during network bring up and | spine nodes and leaf nodes occurs during network bring up and | |||
| topology updates. It consumes both spine and leaf nodes' CPU and | topology updates. It consumes both spine and leaf nodes' CPU and | |||
| link bandwidth resources and with that limits protocol | link bandwidth resources and with that limits protocol | |||
| scalability. | scalability. | |||
| 3. Applicability of RIFT to Clos IP Fabrics | 3. Applicability of RIFT to Clos IP Fabrics | |||
| Further content of this document assumes that the reader is familiar | Further content of this document assumes that the reader is familiar | |||
| with the terms and concepts used in OSPF [RFC2328] and IS-IS | with the terms and concepts used in OSPF [RFC2328] and IS-IS | |||
| [ISO10589-Second-Edition] link-state protocols and at least the | [ISO10589-Second-Edition] link-state protocols and at least the | |||
| sections of RIFT [I-D.ietf-rift-rift] outlining the requirement of | sections of [RIFT] outlining the requirement of routing in IP fabrics | |||
| routing in IP fabrics and RIFT protocol concepts. | and RIFT protocol concepts. | |||
| 3.1. Overview of RIFT | 3.1. Overview of RIFT | |||
| RIFT is a dynamic routing protocol for Clos and fat-tree network | RIFT is a dynamic routing protocol for Clos and fat-tree network | |||
| topologies. It defines a link-state protocol when "pointing north" | topologies. It defines a link-state protocol when "pointing north" | |||
| and path-vector protocol when "pointing south". | and path-vector protocol when "pointing south". | |||
| It floods flat link-state information northbound only so that each | It floods flat link-state information northbound only so that each | |||
| level obtains the full topology of levels south of it. That | level obtains the full topology of levels south of it. That | |||
| information is never flooded East-West or back South again. So a top | information is never flooded East-West or back South again. So a top | |||
| skipping to change at page 4, line 47 ¶ | skipping to change at page 4, line 47 ¶ | |||
| +----+ +----+ +----+ +-----+ | +----+ +----+ +----+ +-----+ | |||
| Figure 1: Rift overview | Figure 1: Rift overview | |||
| A middle tier node has only information necessary for its level, | A middle tier node has only information necessary for its level, | |||
| which are all destinations south of the node based on SPF | which are all destinations south of the node based on SPF | |||
| calculation, default route and potential disaggregated routes. | calculation, default route and potential disaggregated routes. | |||
| RIFT combines the advantage of both Link-State and Distance Vector: | RIFT combines the advantage of both Link-State and Distance Vector: | |||
| o Fastest Possible Convergence | * Fastest Possible Convergence | |||
| o Automatic Detection of Topology | * Automatic Detection of Topology | |||
| o Minimal Routes/Info on TORs | ||||
| o High Degree of ECMP | * Minimal Routes/Info on TORs | |||
| * High Degree of ECMP | ||||
| o Fast De-commissioning of Nodes | * Fast De-commissioning of Nodes | |||
| o Maximum Propagation Speed with Flexible Prefixes in an Update | * Maximum Propagation Speed with Flexible Prefixes in an Update | |||
| And RIFT eliminates the disadvantages of Link-State or Distance | And RIFT eliminates the disadvantages of Link-State or Distance | |||
| Vector: | Vector: | |||
| o Reduced and Balanced Flooding | * Reduced and Balanced Flooding | |||
| o Automatic Neighbor Detection | * Automatic Neighbor Detection | |||
| So there are two types of link state database which are "north | So there are two types of link state database which are "north | |||
| representation" N-TIEs and "south representation" S-TIEs. The N-TIEs | representation" N-TIEs and "south representation" S-TIEs. The N-TIEs | |||
| contain a link state topology description of lower levels and S-TIEs | contain a link state topology description of lower levels and S-TIEs | |||
| carry simply default routes for the lower levels. | carry simply default routes for the lower levels. | |||
| There are a bunch of more advantages unique to RIFT listed below | There are a bunch of more advantages unique to RIFT listed below | |||
| which could be understood if you read the details of RIFT | which could be understood if you read the details of [RIFT]. | |||
| [I-D.ietf-rift-rift]. | ||||
| o True ZTP | * True ZTP | |||
| o Minimal Blast Radius on Failures | * Minimal Blast Radius on Failures | |||
| o Can Utilize All Paths Through Fabric Without Looping | * Can Utilize All Paths Through Fabric Without Looping | |||
| o Automatic Disaggregation on Failures | * Automatic Disaggregation on Failures | |||
| o Simple Leaf Implementation that Can Scale Down to Servers | * Simple Leaf Implementation that Can Scale Down to Servers | |||
| o Key-Value Store | * Key-Value Store | |||
| o Horizontal Links Used for Protection Only | * Horizontal Links Used for Protection Only | |||
| o Supports Non-Equal Cost Multipath and Can Replace MC-LAG | * Supports Non-Equal Cost Multipath and Can Replace MC-LAG | |||
| o Optimal Flooding Reduction and Load-Balancing | * Optimal Flooding Reduction and Load-Balancing | |||
| 3.2. Applicable Topologies | 3.2. Applicable Topologies | |||
| Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" | Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" | |||
| structures, it already supports PoD concepts which are strictly | structures, it already supports PoD concepts which are strictly | |||
| speaking not found in original Clos concepts. | speaking not found in original Clos concepts. | |||
| Further, the specification explains and supports operations of multi- | Further, the specification explains and supports operations of multi- | |||
| plane Clos variants where the protocol relies on set of rings to | plane Clos variants where the protocol relies on set of rings to | |||
| allow the reconciliation of topology view of different planes as most | allow the reconciliation of topology view of different planes as most | |||
| skipping to change at page 6, line 40 ¶ | skipping to change at page 6, line 40 ¶ | |||
| its northbound adjacencies (as long as any of the other nodes in the | its northbound adjacencies (as long as any of the other nodes in the | |||
| level are northbound connected) to still participate in northbound | level are northbound connected) to still participate in northbound | |||
| forwarding. | forwarding. | |||
| 3.2.2. Vertical Shortcuts | 3.2.2. Vertical Shortcuts | |||
| Through relaxations of the specified adjacency forming rules RIFT | Through relaxations of the specified adjacency forming rules RIFT | |||
| implementations can be extended to support vertical "shortcuts" as | implementations can be extended to support vertical "shortcuts" as | |||
| proposed by e.g. [I-D.white-distoptflood]. The RIFT specification | proposed by e.g. [I-D.white-distoptflood]. The RIFT specification | |||
| itself does not provide the exact details since the resulting | itself does not provide the exact details since the resulting | |||
| solution suffers from either much larger blast radii with increased | solution suffers from either much larger blast radius with increased | |||
| flooding volumes or in case of maximum aggregation routing bow-tie | flooding volumes or in case of maximum aggregation routing bow-tie | |||
| problems. | problems. | |||
| 3.2.3. Generalizing to any Directed Acyclic Graph | ||||
| RIFT is an anisotropic routing protocol, meaning that it has a sense | ||||
| of direction (Northbound, Southbound, East-West) and that it operates | ||||
| differently depending on the direction. | ||||
| * Northbound, RIFT operates as a Link State IGP, whereby the control | ||||
| packets are reflooded first all the way North and only interpreted | ||||
| later. All the individual fine grained routes are advertised. | ||||
| * Southbound, RIFT operates as a Distance Vector IGP, whereby the | ||||
| control packets are flooded only one hop, interpreted, and the | ||||
| consequence of that computation is what gets flooded on more hop | ||||
| South. In the most common use-cases, a ToF node can reach most of | ||||
| the prefixes in the fabric. If that is the case, the ToF node | ||||
| advertises the fabric default and disaggregates the prefixes that | ||||
| it cannot reach. On the oethr hand, a ToF Node that can reach | ||||
| only a small subset of the prefixes in the fabric will preferably | ||||
| advertise those prefixes and refrain from aggregating. | ||||
| In the general case, what gets advertised South is in more | ||||
| details: | ||||
| 1. A fabric default that aggregates all the prefixes that are | ||||
| reachable within the fabric, and that could be a default route | ||||
| or a prefix that is dedicated to this particular fabric. | ||||
| 2. The loopback addresses of the Northbound nodes, e.g., for | ||||
| inband management. | ||||
| 3. The disaggregated prefixes for the dynamic exceptions to the | ||||
| fabric Default, advertised to route around the black hole that | ||||
| may form | ||||
| * East-West routing can optionally be used, with specific | ||||
| restrictions. It is useful in particular when a sibling has | ||||
| access to the fabric default but this node does not. | ||||
| A Directed Acyclic Graph (DAG) provides a sense of North (the | ||||
| direction of the DAG) and of South (the reverse), which can be used | ||||
| to apply RIFT. For the purpose of RIFT an edge in the DAG that has | ||||
| only incoming vertices is a ToF node. | ||||
| There are a number of caveats though: | ||||
| * The DAG structure must exist before RIFT starts, so there is a | ||||
| need for a companion protocol to establish the logical DAG | ||||
| structure. | ||||
| * A generic DAG does not have a sense of East and West. The | ||||
| operation specified for East-West links and the Southbound | ||||
| reflection between nodes are not applicable. | ||||
| * In order to aggregate and disaggregate routes, RIFT requires that | ||||
| all the ToF nodes share the full knowledge of the prefixes in the | ||||
| fabric. This can be achieved with a ring as suggested by the RIFT | ||||
| main specification, by some preconfiguration, or using a | ||||
| synchronization with a common repository where all the active | ||||
| prefixes are registered. | ||||
| 3.3. Use Cases | 3.3. Use Cases | |||
| 3.3.1. DC Fabrics | 3.3.1. DC Fabrics | |||
| RIFT is largely driven by demands and hence ideally suited for | RIFT is largely driven by demands and hence ideally suited for | |||
| application in underlay of data center IP fabrics, vast majority of | application in underlay of data center IP fabrics, vast majority of | |||
| which seem to be currently (and for the foreseeable future) Clos | which seem to be currently (and for the foreseeable future) Clos | |||
| architectures. It significantly simplifies operation and deployment | architectures. It significantly simplifies operation and deployment | |||
| of such fabrics as described in Section 4 for environments compared | of such fabrics as described in Section 4 for environments compared | |||
| to extensive proprietary provisioning and operational solutions. | to extensive proprietary provisioning and operational solutions. | |||
| skipping to change at page 8, line 44 ¶ | skipping to change at page 10, line 44 ¶ | |||
| | |--------| |--------| |----------| |-------| | | | |--------| |--------| |----------| |-------| | | |||
| | |--------| |--------| |----------| |-------| | | | |--------| |--------| |----------| |-------| | | |||
| | || VAS7 || || VAS4 || || vIGMP || ||BAA || | | | || VAS7 || || VAS4 || || vIGMP || ||BAA || | | |||
| | |--------| |--------| |----------| |-------| | | | |--------| |--------| |----------| |-------| | | |||
| | +--------+ +--------+ +----------+ +-------+ | | | +--------+ +--------+ +----------+ +-------+ | | |||
| | | | | | | |||
| ++-----------+ +---------++ | ++-----------+ +---------++ | |||
| |Network I/O | |Access I/O| | |Network I/O | |Access I/O| | |||
| +------------+ +----------+ | +------------+ +----------+ | |||
| Figure 2: An example of CloudCO architecture | Figure 2: An example of CloudCO architecture | |||
| The Spine-Leaf architectures deployed inside CloudCO meets the | The Spine-Leaf architectures deployed inside CloudCO meets the | |||
| network requirements of adaptable, agile, scalable and dynamic. | network requirements of adaptable, agile, scalable and dynamic. | |||
| 4. Deployment Considerations | 4. Deployment Considerations | |||
| RIFT presents the opportunity for organizations building and | RIFT presents the opportunity for organizations building and | |||
| operating IP fabrics to simplify their operation and deployments | operating IP fabrics to simplify their operation and deployments | |||
| while achieving many desirable properties of a dynamic routing on | while achieving many desirable properties of a dynamic routing on | |||
| such a substrate: | such a substrate: | |||
| o RIFT design follows minimum blast radius and minimum necessary | * RIFT design follows minimum blast radius and minimum necessary | |||
| epistemological scope philosophy which leads to very good scaling | epistemological scope philosophy which leads to very good scaling | |||
| properties while delivering maximum reactiveness. | properties while delivering maximum reactiveness. | |||
| o RIFT allows for extensive Zero Touch Provisioning within the | * RIFT allows for extensive Zero Touch Provisioning within the | |||
| protocol. In its most extreme version RIFT does not rely on any | protocol. In its most extreme version RIFT does not rely on any | |||
| specific addressing and for IP fabric can operate using IPv6 ND | specific addressing and for IP fabric can operate using IPv6 ND | |||
| [RFC4861] only. | [RFC4861] only. | |||
| o RIFT has provisions to detect common IP fabric mis-cabling | * RIFT has provisions to detect common IP fabric mis-cabling | |||
| scenarios. | scenarios. | |||
| o RIFT negotiates automatically BFD per link allowing this way for | * RIFT negotiates automatically BFD per link allowing this way for | |||
| IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth | IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth | |||
| imbalances in case of constituent failures. Further automatic | imbalances in case of constituent failures. Further automatic | |||
| link validation techniques similar to [RFC5357] could be supported | link validation techniques similar to [RFC5357] could be supported | |||
| as well. | as well. | |||
| o RIFT inherently solves many difficult problems associated with the | * RIFT inherently solves many difficult problems associated with the | |||
| use of traditional routing topologies with dense meshes and high | use of traditional routing topologies with dense meshes and high | |||
| degrees of ECMP by including automatic bandwidth balancing, flood | degrees of ECMP by including automatic bandwidth balancing, flood | |||
| reduction and automatic disaggregation on failures while providing | reduction and automatic disaggregation on failures while providing | |||
| maximum aggregation of prefixes in default scenarios. | maximum aggregation of prefixes in default scenarios. | |||
| o RIFT reduces FIB size towards the bottom of the IP fabric where | * RIFT reduces FIB size towards the bottom of the IP fabric where | |||
| most nodes reside and allows with that for cheaper hardware on the | most nodes reside and allows with that for cheaper hardware on the | |||
| edges and introduction of modern IP fabric architectures that | edges and introduction of modern IP fabric architectures that | |||
| encompass e.g. server multi-homing. | encompass e.g. server multi-homing. | |||
| o RIFT provides valley-free routing and with that is loop free. | * RIFT provides valley-free routing and with that is loop free. | |||
| This allows the use of any such valley-free path in bi-sectional | This allows the use of any such valley-free path in bi-sectional | |||
| fabric bandwidth between two destination irrespective of their | fabric bandwidth between two destination irrespective of their | |||
| metrics which can be used to balance load on the fabric in | metrics which can be used to balance load on the fabric in | |||
| different ways. | different ways. | |||
| o RIFT includes a key-value distribution mechanism which allows for | * RIFT includes a key-value distribution mechanism which allows for | |||
| many future applications such as automatic provisioning of basic | many future applications such as automatic provisioning of basic | |||
| overlay services or automatic key roll-overs over whole fabrics. | overlay services or automatic key roll-overs over whole fabrics. | |||
| o RIFT is designed for minimum delay in case of prefix mobility on | * RIFT is designed for minimum delay in case of prefix mobility on | |||
| the fabric. | the fabric. | |||
| o Many further operational and design points collected over many | * Many further operational and design points collected over many | |||
| years of routing protocol deployments have been incorporated in | years of routing protocol deployments have been incorporated in | |||
| RIFT such as fast flooding rates, protection of information | RIFT such as fast flooding rates, protection of information | |||
| lifetimes and operationally easily recognizable remote ends of | lifetimes and operationally easily recognizable remote ends of | |||
| links and node names. | links and node names. | |||
| 4.1. South Reflection | 4.1. South Reflection | |||
| South reflection is a mechanism that South Node TIEs are "reflected" | South reflection is a mechanism that South Node TIEs are "reflected" | |||
| back up north to allow nodes in same level without E-W links to "see" | back up north to allow nodes in same level without E-W links to "see" | |||
| each other. | each other. | |||
| skipping to change at page 12, line 51 ¶ | skipping to change at page 14, line 49 ¶ | |||
| Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, | Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, | |||
| the packet from leaf111 to prefix122 would suffer 50% black-holing | the packet from leaf111 to prefix122 would suffer 50% black-holing | |||
| based on pure default route. The packet supposed to go up through | based on pure default route. The packet supposed to go up through | |||
| linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be | linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be | |||
| dropped. The packet supposed to go up through linkSL3 to linkTS2 | dropped. The packet supposed to go up through linkSL3 to linkTS2 | |||
| then go down through linkTS3 or linkTS4 will be dropped as well. | then go down through linkTS3 or linkTS4 will be dropped as well. | |||
| It's the case of black-holing. | It's the case of black-holing. | |||
| With disaggregation mechanism, when linkTS3 and linkTS4 both fail, | With disaggregation mechanism, when linkTS3 and linkTS4 both fail, | |||
| ToF22 will detect the failure according to the reflected node S-TIE | ToF22 will detect the failure according to the reflected node S-TIE | |||
| of ToF21 from Spine111\Spine112\Spine121\Spine122. Based on the | of ToF21 from Spine111\Spine112. Based on the disaggregation | |||
| disaggregation algorithm provided by RITF, ToF22 will explicitly | algorithm provided by RITF, ToF22 will explicitly originate an S-TIE | |||
| originate an S-TIE with prefix 121 and prefix 122, that is flooded to | with prefix 121 and prefix 122, that is flooded to spines 111, 112, | |||
| spines 111, 112, 121 and 122. | 121 and 122. | |||
| The packet from leaf111 to prefix122 will not be routed to linkTS1 or | The packet from leaf111 to prefix122 will not be routed to linkTS1 or | |||
| linkTS2. The packet from leaf111 to prefix122 will only be routed to | linkTS2. The packet from leaf111 to prefix122 will only be routed to | |||
| linkTS5 or linkTS7 following a longest-prefix match to prefix122. | linkTS5 or linkTS7 following a longest-prefix match to prefix122. | |||
| 4.4. Zero Touch Provisioning (ZTP) | 4.4. Zero Touch Provisioning (ZTP) | |||
| Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | |||
| has no configuration (unless it is a Top-of-Fabric at the top of the | has no configuration (unless it is a Top-of-Fabric at the top of the | |||
| topology or it is desired to confine it to leaf role w/o leaf-2-leaf | topology or it is desired to confine it to leaf role w/o leaf-2-leaf | |||
| skipping to change at page 14, line 25 ¶ | skipping to change at page 16, line 4 ¶ | |||
| |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 | |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 | |||
| +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +---------+ | link-M | +---------+ | | | +---------+ | link-M | +---------+ | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| | +-------+ | | | | +-------+ | | | | +-------+ | | | | +-------+ | | | |||
| | | | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ | |||
| |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| Figure 5: A single plane miscabling example | Figure 5: A single plane miscabling example | |||
| Figure Figure 5 shows a single plane miscabling example. It's a | Figure 5 shows a single plane miscabling example. It's a perfect | |||
| perfect fat-tree fabric except link-M connecting Leaf112 to ToF22. | fat-tree fabric except link-M connecting Leaf112 to ToF22. | |||
| The RIFT control protocol can discover the physical links | The RIFT control protocol can discover the physical links | |||
| automatically and be able to detect cabling that violates fat-tree | automatically and be able to detect cabling that violates fat-tree | |||
| topology constraints. It react accordingly to such mis-cabling | topology constraints. It react accordingly to such mis-cabling | |||
| attempts, at a minimum preventing adjacencies between nodes from | attempts, at a minimum preventing adjacencies between nodes from | |||
| being formed and traffic from being forwarded on those mis-cabled | being formed and traffic from being forwarded on those mis-cabled | |||
| links. Leaf112 will in such scenario use link-M to derive its level | links. Leaf112 will in such scenario use link-M to derive its level | |||
| (unless it is leaf) and can report links to spines 111 and 112 as | (unless it is leaf) and can report links to spines 111 and 112 as | |||
| miscabled unless the implementations allows horizontal links. | miscabled unless the implementations allows horizontal links. | |||
| Figure Figure 6 shows a multiple plane miscabling example. Since | Figure 6 shows a multiple plane miscabling example. Since Leaf112 | |||
| Leaf112 and Spine121 belong to two different PoDs, the adjacency | and Spine121 belong to two different PoDs, the adjacency between | |||
| between Leaf112 and Spine121 can not be formed. link-W would be | Leaf112 and Spine121 can not be formed. link-W would be detected and | |||
| detected and prevented. | prevented. | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 | |||
| +-------+ +-------+ +-------+ +-------+ | +-------+ +-------+ +-------+ +-------+ | |||
| | | | | | | | | | | | | | | | | | | |||
| | | | +-----------------+ | | | | | | | +-----------------+ | | | | |||
| | +--------------------------+ | | | | | | +--------------------------+ | | | | | |||
| | | | | | | | | | | | | | | | | | | |||
| | +------+ | | | +------+ | | | +------+ | | | +------+ | | |||
| | | +-----------------+ | | | | | | | | +-----------------+ | | | | | | |||
| skipping to change at page 16, line 31 ¶ | skipping to change at page 17, line 41 ¶ | |||
| +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ | +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ | |||
| |Leaf111| |Leaf112| |Leaf111| |Leaf112| | |Leaf111| |Leaf112| |Leaf111| |Leaf112| | |||
| +-------+ +-------+ +-+-----+ +-+-----+ | +-------+ +-------+ +-+-----+ +-+-----+ | |||
| | | | | | | |||
| | +--------+ | | +--------+ | |||
| | | | | | | |||
| +-+---+-+ | +-+---+-+ | |||
| |Spine11| | |Spine11| | |||
| +-------+ | +-------+ | |||
| Figure 7: Fallen spine | Figure 7: Fallen spine | |||
| 4.6. IPv4 over IPv6 | 4.6. Positive vs. Negative Disaggregation | |||
| Disaggregation is the procedure whereby [RIFT] advertises more a | ||||
| specific route Southwards as an exception to the aggregated fabric- | ||||
| default North. Disaggregation is useful when a prefix within the | ||||
| aggregation is reachable via some of the parents but not the others | ||||
| at the same level of the fabric. It is mandatory when the level is | ||||
| the ToF since a ToF node that cannot reach a prefix becomes a black | ||||
| hole for that prefix. The hard problem is to know which prefixes are | ||||
| reachable by whom. | ||||
| In the general case, [RIFT] solves that problem by interconnecting | ||||
| the ToF nodes so they can exchange the full list of prefixes that | ||||
| exist in the fabric and figure when a ToF node lacks reachability and | ||||
| to existing prefix. This requires additional ports at the ToF, | ||||
| typically 2 ports per ToF node to form a ToF-spanning ring. xref | ||||
| target='I-D.ietf-rift-rift'/> also defines the Southbound Reflection | ||||
| procedure that enables a parent to explore the direct connectivity of | ||||
| its peers, meaning their own parents and children; based on the | ||||
| advertisements received from the shared parents and children, it may | ||||
| enable the parent to infer the prefixes its peers can reach. | ||||
| When a parent lacks reachability to a prefix, it may disaggregate the | ||||
| prefix negatively, i.e., advertise that this parent can be used to | ||||
| reach any prefix in the aggregation except that one. The Negative | ||||
| Disaggregation signaling is simple and functions transitively from | ||||
| ToF to ToP and then from Top to Leaf. But it is hard for a parent to | ||||
| figure which prefix it needs to disaggregate, because it does not | ||||
| know what it does not know; it results thet the use of a spanning | ||||
| ring at the ToF is required to operate the Negative Disaggregation. | ||||
| Also, though it is only an implementation problem, the programmation | ||||
| of the FIB is complex compared to normal routes, and may incur | ||||
| recursions. | ||||
| The more classical alternative is, for the parents that can reach a | ||||
| prefix that peers at the same level cannot, to advertise a more | ||||
| specific route to that prefix. This leverages the normal longest | ||||
| prefix match in the FIB, and does not require a special | ||||
| implementation. But as opposed to the Negative Disaggregation, the | ||||
| Positive Disaggregation is difficult and inefficient to operate | ||||
| transitively. | ||||
| Transitivity is not needed to a grandchild if all its parents | ||||
| received the Positive Disaggregation, meaning that they shall all | ||||
| avoid the black hole; when that is the case, they collectively build | ||||
| a ceiling that protects the grandchild. But until then, a parent | ||||
| that received a Positive Disaggregation may believe that some peers | ||||
| are lacking the reachability and readvertise too early, or defer and | ||||
| maintain a black hole situation longer than necessary. | ||||
| In a non-partitioned fabric, all the ToF nodes see one another | ||||
| through the reflection and can figure if one is missing a child. In | ||||
| that case it is possible to compute the prefixes that the peer cannot | ||||
| reach and disaggregate positively without a ToF-spanning ring. The | ||||
| ToF nodes can also acertain that the ToP nodes are connected each to | ||||
| at least a ToF node that can still reach the prefix, meaning that the | ||||
| transitive operation is not required. | ||||
| The bottom line is that in a fabric that is partitioned (e.g., using | ||||
| multiple planes) and/or where the ToP nodes are not guaranteed to | ||||
| always form a ceiling for their children, it is mandatory to use the | ||||
| Negative Disaggregation. On the other hand, in a highly symmetrical | ||||
| and fully connected fabric, (e.g., a canonical Clos Network), the | ||||
| Positive Disaggregation methods allows to save the complexity and | ||||
| cost associated to the ToF-spanning ring. | ||||
| Note that in the case of Positive Disaggregation, the first ToF | ||||
| node(s) that announces a more-specific route attracts all the traffic | ||||
| for that route and may suffer from a transient incast. A ToP node | ||||
| that defers injecting the longer prefix in the FIB, in order to | ||||
| receive more advertisements and spread the packets better, also keeps | ||||
| on sending a portion of the traffic to the black hole in the | ||||
| meantime. In the case of Negative Disaggregation, the last ToF | ||||
| node(s) that injects the route may also incur an incast issue; this | ||||
| problem would occur if a prefix that becomes totally unreachable is | ||||
| disaggregated, but doing so is mostly useless and is not recommended. | ||||
| 4.7. Mobile Edge and Anycast | ||||
| When a physical or a virtual node changes its point of attachement in | ||||
| the fabric from a previous-leaf to a next-leaf, new routes must be | ||||
| installed that supercede the old ones. Since the flooding flows | ||||
| Northwards, the nodes (if any) between the previous-leaf and the | ||||
| common parent are not immediately aware that the path via previous- | ||||
| leaf is obsolete, and a stale route may exist for a while. The | ||||
| common parent needs to select the freshest route advertisement in | ||||
| order to install the correct route via the next-leaf. This requires | ||||
| that the fabric determines the sequence of the movements of the | ||||
| mobile node. | ||||
| On the one hand, a classical sequence counter provides a total order | ||||
| for a while but it will eventually wrap. On the other hand, a | ||||
| timestamp provides a permanent order but it may miss a movement that | ||||
| happens too quickly vs. the granularity of the timing information. | ||||
| It is not envisioned in the short term that the average fabric | ||||
| supports a Precision Time Protocol, and the precision that may be | ||||
| available with the Network Time Protocol [RFC5905], in the order of | ||||
| 100 to 200ms, may not be necessarily enough to cover, e.g., the fast | ||||
| mobility of a Virtual Machine. | ||||
| Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that | ||||
| combines a sequence counter from the mobile node and a timestamp from | ||||
| the network taken at the leaf when the route is injected. If the | ||||
| timestamps of the concurrent advertisements are comparable (i.e., | ||||
| more distant than the precision of the timing protocol), then the | ||||
| timestamp alone is used to determine the relative freshness of the | ||||
| routes. Otherwise, the sequence counter from the mobile node, if | ||||
| available, is used. One caveat is that the sequence counter must not | ||||
| wrap within the precision of the timing protocol. Another is that | ||||
| the mobile node may not even provide a sequence counter, in which | ||||
| case the mobility itself must be slower than the precision of the | ||||
| timing. | ||||
| Mobility must not be confused with Anycast. In both cases, a same | ||||
| address is injected in RIFT at different leaves. In the case of | ||||
| mobility, only the freshest route must be conserved, since mobile | ||||
| node changed its point of attachement for a leaf ot the next. In the | ||||
| case of anycast, the node may be either multihomed (attached to | ||||
| multiple leaves in parallel) or reachable beyond the fabric via | ||||
| multiple routes that are redistributed to different leaves; either | ||||
| way, in the case of anycast, the multiple routes are equally valid | ||||
| and should be conserved. Without further information from the | ||||
| redistributed routing protocol, it is impossible to sort out a | ||||
| movement from a redistribution that happens asynchronously on | ||||
| different leaves. [RIFT] expects that anycast addresses are | ||||
| advertised within the timing precision, which is typically the case | ||||
| with a low-precision timing and a multihomed node. Beyond that time | ||||
| interval, RIFT interprets the lag as a mobility and only the freshest | ||||
| route is retained. | ||||
| When using IPv6 [RFC8200], RIFT suggests to leverage "Registration | ||||
| Extensions for IPv6 over Low-Power Wireless Personal Area Network | ||||
| (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND | ||||
| interaction between the mobile node and the leaf. This provides not | ||||
| only a sequence counter but also a lifetime and a security token that | ||||
| may be used to protect the ownership of an address. When using | ||||
| [RFC8505], the parallel registration of an anycast address to | ||||
| multiple leaves is done with the same sequence counter, whereas the | ||||
| sequence counter is incremented when the point of attachement | ||||
| changes. This way, it is possible to differentiate a mobile node | ||||
| from a multihomed node, even when the mobility happens within the | ||||
| timing precision. It is also possible for a mobile node to be | ||||
| multihomed as well, e.g., to change only one of its points of | ||||
| attachement. | ||||
| 4.8. IPv4 over IPv6 | ||||
| RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 | RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 | |||
| AF configures via the usual ND mechanisms and then V4 can use V6 | AF configures via the usual ND mechanisms and then V4 can use V6 | |||
| nexthops analogous to RFC5549. It is expected that the whole fabric | nexthops analogous to RFC5549. It is expected that the whole fabric | |||
| supports the same type of forwarding of address families on all the | supports the same type of forwarding of address families on all the | |||
| links. RIFT provides an indication whether a node is v4 forwarding | links. RIFT provides an indication whether a node is v4 forwarding | |||
| capable and implementations are possible where different routing | capable and implementations are possible where different routing | |||
| tables are computed per address family as long as the computation | tables are computed per address family as long as the computation | |||
| remains loop-free. | remains loop-free. | |||
| skipping to change at page 17, line 30 ¶ | skipping to change at page 21, line 43 ¶ | |||
| +---+---+ |LEAF | | LEAF| | +---+---+ |LEAF | | LEAF| | |||
| +--+--+ +--+--+ | +--+--+ +--+--+ | |||
| | | | | | | |||
| IPv4 prefixes| |IPv4 prefixes | IPv4 prefixes| |IPv4 prefixes | |||
| | | | | | | |||
| +---+----+ +---+----+ | +---+----+ +---+----+ | |||
| | V4 | | V4 | | | V4 | | V4 | | |||
| | subnet | | subnet | | | subnet | | subnet | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| Figure 8: IPv4 over IPv6 | Figure 8: IPv4 over IPv6 | |||
| 4.7. In-Band Reachability of Nodes | 4.9. In-Band Reachability of Nodes | |||
| 4.7.1. Reachability of Leafs | RIFT doesn't precondition that nodes of the fabric have reachable | |||
| addresses. But the operational purposes to reach the internal nodes | ||||
| may exist. Figure 9 shows an example that the NMS attaches to LEAF1. | ||||
| TODO | +-------+ +-------+ | |||
| | ToF1 | | ToF2 | | ||||
| ++---- ++ ++-----++ | ||||
| | | | | | ||||
| | +----------+ | | ||||
| | +--------+ | | | ||||
| | | | | | ||||
| ++-----++ +--+---++ | ||||
| |SPINE1 | |SPINE2 | | ||||
| ++-----++ ++-----++ | ||||
| | | | | | ||||
| | +----------+ | | ||||
| | +--------+ | | | ||||
| | | | | | ||||
| ++-----++ +--+---++ | ||||
| | LEAF1 | | LEAF2 | | ||||
| +---+---+ +-------+ | ||||
| | | ||||
| |NMS | ||||
| 4.7.2. Reachability of Spines | Figure 9: In-Band reachability of node | |||
| TODO | If NMS wants to access LEAF2, it simply works. Because loopback | |||
| address of LEAF2 is flooded in its Prefix North TIE. | ||||
| 4.8. Dual Homing Servers | If NMS wants to access SPINE2, it simply works too. Because spine | |||
| node always advertises its loopback address in the Prefix North TIE. | ||||
| NMS may reach SPINE2 from LEAF1-SPINE2 or LEAF1-SPINE1-ToF1/ | ||||
| ToF2-SPINE2. | ||||
| If NMS wants to access ToF2, ToF2's loopback address needs to be | ||||
| injected into its Prefix South TIE. Otherwise, the traffic from NMS | ||||
| may be sent to ToF1. | ||||
| And in case of failure between ToF2 and spine nodes, ToF2's loopback | ||||
| address must be sent all the way down to the leaves. | ||||
| 4.10. Dual Homing Servers | ||||
| Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | Each RIFT node may operate in zero touch provisioning (ZTP) mode. It | |||
| has no configuration (unless it is a Top-of-Fabric at the top of the | has no configuration (unless it is a Top-of-Fabric at the top of the | |||
| topology or the must operate in the topology as leaf and/or support | topology or the must operate in the topology as leaf and/or support | |||
| leaf-2-leaf procedures) and it will fully configure itself after | leaf-2-leaf procedures) and it will fully configure itself after | |||
| being attached to the topology. | being attached to the topology. | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| |ToF| |ToF| |ToF| | |ToF| |ToF| |ToF| | |||
| +---+ +---+ +---+ | +---+ +---+ +---+ | |||
| skipping to change at page 18, line 28 ¶ | skipping to change at page 23, line 28 ¶ | |||
| | +-----------------+ | | | | | +-----------------+ | | | | |||
| | | | +-------------+ | | | | | | +-------------+ | | | |||
| + | + | | |-----------------+ | | + | + | | |-----------------+ | | |||
| X | X | +--------x-----+ | X | | X | X | +--------x-----+ | X | | |||
| + | + | | | + | | + | + | | | + | | |||
| +---+ +---+ +---+ +---+ | +---+ +---+ +---+ +---+ | |||
| | | | | | | | | | | | | | | | | | | |||
| +---+ +---+ ...............+---+ +---+ | +---+ +---+ ...............+---+ +---+ | |||
| SV(1) SV(2) SV(n+1) SV(n) | SV(1) SV(2) SV(n+1) SV(n) | |||
| Figure 9: Dual-homing servers | Figure 10: Dual-homing servers | |||
| In the single plane, the worst condition is disaggregation of every | In the single plane, the worst condition is disaggregation of every | |||
| other servers at the same level. Suppose the links from ToR1 to all | other servers at the same level. Suppose the links from ToR1 to all | |||
| the leaves become not available. All the servers' routes are | the leaves become not available. All the servers' routes are | |||
| disaggregated and the FIB of the servers will be expanded with n-1 | disaggregated and the FIB of the servers will be expanded with n-1 | |||
| more spicific routes. | more spicific routes. | |||
| Sometimes, pleople may prefer to disaggregate from ToR to servers | Sometimes, pleople may prefer to disaggregate from ToR to servers | |||
| from start on, i.e. the servers have couple tens of routes in FIB | from start on, i.e. the servers have couple tens of routes in FIB | |||
| from start on beside default routes to avoid breakages at rack level. | from start on beside default routes to avoid breakages at rack level. | |||
| Full disaggregation of the fabric could be achieved by configuration | Full disaggregation of the fabric could be achieved by configuration | |||
| supported by RIFT. | supported by RIFT. | |||
| 4.9. Fabric With A Controller | 4.11. Fabric With A Controller | |||
| There are many different ways to deploy the controller. One | There are many different ways to deploy the controller. One | |||
| possibility is attaching a controller to the RIFT domain from ToF and | possibility is attaching a controller to the RIFT domain from ToF and | |||
| another possibility is attaching a controller from the leaf. | another possibility is attaching a controller from the leaf. | |||
| +------------+ | +------------+ | |||
| | Controller | | | Controller | | |||
| ++----------++ | ++----------++ | |||
| | | | | | | |||
| | | | | | | |||
| +----++ ++----+ | +----++ ++----+ | |||
| ---------- | ToF | | ToF | | ------- | ToF | | ToF | | |||
| | +--+--+ +-----+ | | +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | | | | | | | |||
| +-----+ +-+---+ | +-----+ +-+---+ | |||
| RIFT domain |SPINE| |SPINE| | RIFT domain |SPINE| |SPINE| | |||
| +--+--+ +-----+ | +--+--+ +-----+ | |||
| | | | | | | | | | | | | |||
| | | +-------------+ | | | | +-------------+ | | |||
| | | +--------+ | | | | | +--------+ | | | |||
| | | | | | | | | | | | | |||
| | +-----+ +-+---+ | | +-----+ +-+---+ | |||
| ---------- |LEAF | | LEAF| | ------- |LEAF | | LEAF| | |||
| +-----+ +-----+ | +-----+ +-----+ | |||
| Figure 10: Fabric with a controller | Figure 11: Fabric with a controller | |||
| 4.9.1. Controller Attached to ToFs | 4.11.1. Controller Attached to ToFs | |||
| If a controller is attaching to the RIFT domain from ToF, it usually | If a controller is attaching to the RIFT domain from ToF, it usually | |||
| uses dual-homing connections. The loopback prefix of the controller | uses dual-homing connections. The loopback prefix of the controller | |||
| should be advertised down by the ToF and spine to leaves. If the | should be advertised down by the ToF and spine to leaves. If the | |||
| controller loses link to ToF, make sure the ToF withdraw the prefix | controller loses link to ToF, make sure the ToF withdraw the prefix | |||
| of the controller(use different mechanisms). | of the controller(use different mechanisms). | |||
| 4.9.2. Controller Attached to Leaf | 4.11.2. Controller Attached to Leaf | |||
| If the controller is attaching from a leaf to the fabric, no special | If the controller is attaching from a leaf to the fabric, no special | |||
| provisions are needed. | provisions are needed. | |||
| 4.10. Internet Connectivity Without Underlay | 4.12. Internet Connectivity With Underlay | |||
| 4.10.1. Internet Default on the Leafs | If global addressing is running without overlay, an external default | |||
| route needs to be advertised through rift fabric to achieve internet | ||||
| connectivity. For the purpose of forwarding of the entire rift | ||||
| fabric, an internal fabric prefix needs to be advertised in the South | ||||
| Prefix TIE by ToF and spine nodes. | ||||
| TODO | 4.12.1. Internet Default on the Leaf | |||
| 4.10.2. Internet Default on the ToFs | In case that an internet access request comes from a leaf and the | |||
| internet gateway is another leaf, the leaf node as the internet | ||||
| gateway needs to advertise a default route in its Prefix North TIE. | ||||
| TODO | 4.12.2. Internet Default on the ToFs | |||
| 4.11. Subnet Mismatch and Address Families | In case that an internet access request comes from a leaf and the | |||
| internet gateway is a ToF, the ToF and spine nodes need to advertise | ||||
| a default route in the Prefix South TIE. | ||||
| 4.13. Subnet Mismatch and Address Families | ||||
| +--------+ +--------+ | +--------+ +--------+ | |||
| | | LIE LIE | | | | | LIE LIE | | | |||
| | A | +----> <----+ | B | | | A | +----> <----+ | B | | |||
| | +---------------------+ | | | +---------------------+ | | |||
| +--------+ +--------+ | +--------+ +--------+ | |||
| X/24 Y/24 | X/24 Y/24 | |||
| Figure 11: subnet mismatch | Figure 12: subnet mismatch | |||
| LIEs are exchanged over all links running RIFT to perform Link | LIEs are exchanged over all links running RIFT to perform Link | |||
| (Neighbor) Discovery. A node MUST NOT originate LIEs on an address | (Neighbor) Discovery. A node MUST NOT originate LIEs on an address | |||
| family if it does not process received LIEs on that family. LIEs on | family if it does not process received LIEs on that family. LIEs on | |||
| same link are considered part of the same negotiation independent on | same link are considered part of the same negotiation independent on | |||
| the address family they arrive on. An implementation MUST be ready | the address family they arrive on. An implementation MUST be ready | |||
| to accept TIEs on all addresses it used as source of LIE frames. | to accept TIEs on all addresses it used as source of LIE frames. | |||
| As shown in the above figure, without further checks adjacency of | As shown in the above figure, without further checks adjacency of | |||
| node A and B may form, but the forwarding between node A and node B | node A and B may form, but the forwarding between node A and node B | |||
| may fail because subnet X mismatches with subnet Y. | may fail because subnet X mismatches with subnet Y. | |||
| To prevent this a RIFT implementation should check for subnet | To prevent this a RIFT implementation should check for subnet | |||
| mismatch just like e.g. ISIS does. This can lead to scenarios where | mismatch just like e.g. ISIS does. This can lead to scenarios where | |||
| an adjacency, despite exchange of LIEs in both address families may | an adjacency, despite exchange of LIEs in both address families may | |||
| end up having an adjacency in a single AF only. This is a | end up having an adjacency in a single AF only. This is a | |||
| consideration especially in Section 4.6 scenarios. | consideration especially in Section 4.8 scenarios. | |||
| 4.12. Anycast Considerations | 4.14. Anycast Considerations | |||
| + traffic | + traffic | |||
| | | | | |||
| v | v | |||
| +------+------+ | +------+------+ | |||
| | ToF | | | ToF | | |||
| +---+-----+---+ | +---+-----+---+ | |||
| | | | | | | | | | | |||
| +------------+ | | +------------+ | +------------+ | | +------------+ | |||
| | | | | | | | | | | |||
| +---+---+ +-------+ +-------+ +---+---+ | +---+---+ +-------+ +-------+ +---+---+ | |||
| skipping to change at page 21, line 31 ¶ | skipping to change at page 26, line 31 ¶ | |||
| | | | | | | | | | | | | | | | | | | |||
| +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ | |||
| | | | | | | | | | | | | | | | | | | |||
| |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 | |||
| +-+-----+ ++------+ +-----+-+ +-----+-+ | +-+-----+ ++------+ +-----+-+ +-----+-+ | |||
| + + + ^ | | + + + ^ | | |||
| PrefixA PrefixB PrefixA | PrefixC | PrefixA PrefixB PrefixA | PrefixC | |||
| | | | | |||
| + traffic | + traffic | |||
| Figure 12: Anycast | Figure 13: Anycast | |||
| If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast | |||
| prefix PrefixA. RIFT can deal with this case well. But if the | prefix PrefixA. RIFT can deal with this case well. But if the | |||
| traffic comes from Leaf122, it will always get to Leaf121 and never | traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. | |||
| get to Leaf111. If the intension is that the traffic should been | But Spine21 or Spine22 doesn't know another PrefixA attaching | |||
| offloaded to Leaf111, then use policy guided prefixes [PGP | Leaf111. So it will always get to Leaf121 and never get to Leaf111. | |||
| reference]. | If the intension is that the traffic should been offloaded to | |||
| Leaf111, then use policy guided prefixes [PGP reference]. | ||||
| 5. Acknowledgements | 5. Acknowledgements | |||
| 6. Contributors | 6. Contributors | |||
| The following people (listed in alphabetical order) contributed | The following people (listed in alphabetical order) contributed | |||
| significantly to the content of this document and should be | significantly to the content of this document and should be | |||
| considered co-authors: | considered co-authors: | |||
| Tony Przygienda | Tony Przygienda | |||
| skipping to change at page 21, line 49 ¶ | skipping to change at page 27, line 4 ¶ | |||
| 5. Acknowledgements | 5. Acknowledgements | |||
| 6. Contributors | 6. Contributors | |||
| The following people (listed in alphabetical order) contributed | The following people (listed in alphabetical order) contributed | |||
| significantly to the content of this document and should be | significantly to the content of this document and should be | |||
| considered co-authors: | considered co-authors: | |||
| Tony Przygienda | Tony Przygienda | |||
| Juniper Networks | Juniper Networks | |||
| 1194 N. Mathilda Ave | 1194 N. Mathilda Ave | |||
| Sunnyvale, CA 94089 | Sunnyvale, CA 94089 | |||
| US | US | |||
| Email: prz@juniper.net | Email: prz@juniper.net | |||
| 7. Normative References | 7. Normative References | |||
| [I-D.ietf-rift-rift] | ||||
| Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and | ||||
| D. Afanasiev, "RIFT: Routing in Fat Trees", draft-ietf- | ||||
| rift-rift-10 (work in progress), January 2020. | ||||
| [I-D.white-distoptflood] | ||||
| White, R., Hegde, S., and S. Zandi, "IS-IS Optimal | ||||
| Distributed Flooding for Dense Topologies", draft-white- | ||||
| distoptflood-01 (work in progress), September 2019. | ||||
| [ISO10589-Second-Edition] | [ISO10589-Second-Edition] | |||
| International Organization for Standardization, | International Organization for Standardization, | |||
| "Intermediate system to Intermediate system intra-domain | "Intermediate system to Intermediate system intra-domain | |||
| routeing information exchange protocol for use in | routeing information exchange protocol for use in | |||
| conjunction with the protocol for providing the | conjunction with the protocol for providing the | |||
| connectionless-mode Network Service (ISO 8473)", Nov 2002. | connectionless-mode Network Service (ISO 8473)", November | |||
| 2002. | ||||
| [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central | ||||
| Office Reference Architectural Framework", January 2018. | ||||
| [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, | [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, | |||
| DOI 10.17487/RFC2328, April 1998, | DOI 10.17487/RFC2328, April 1998, | |||
| <https://www.rfc-editor.org/info/rfc2328>. | <https://www.rfc-editor.org/info/rfc2328>. | |||
| [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, | [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, | |||
| "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, | "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, | |||
| DOI 10.17487/RFC4861, September 2007, | DOI 10.17487/RFC4861, September 2007, | |||
| <https://www.rfc-editor.org/info/rfc4861>. | <https://www.rfc-editor.org/info/rfc4861>. | |||
| skipping to change at page 23, line 5 ¶ | skipping to change at page 27, line 47 ¶ | |||
| Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", | |||
| RFC 5357, DOI 10.17487/RFC5357, October 2008, | RFC 5357, DOI 10.17487/RFC5357, October 2008, | |||
| <https://www.rfc-editor.org/info/rfc5357>. | <https://www.rfc-editor.org/info/rfc5357>. | |||
| [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., | [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., | |||
| Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional | Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional | |||
| Forwarding Detection (BFD) on Link Aggregation Group (LAG) | Forwarding Detection (BFD) on Link Aggregation Group (LAG) | |||
| Interfaces", RFC 7130, DOI 10.17487/RFC7130, February | Interfaces", RFC 7130, DOI 10.17487/RFC7130, February | |||
| 2014, <https://www.rfc-editor.org/info/rfc7130>. | 2014, <https://www.rfc-editor.org/info/rfc7130>. | |||
| [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central | [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and | |||
| Office Reference Architectural Framework", Jan 2018. | D. Afanasiev, "RIFT: Routing in Fat Trees", Work in | |||
| Progress, Internet-Draft, draft-ietf-rift-rift-11, 10 | ||||
| March 2020, | ||||
| <https://tools.ietf.org/html/draft-ietf-rift-rift-11>. | ||||
| [I-D.white-distoptflood] | ||||
| White, R., Hegde, S., and S. Zandi, "IS-IS Optimal | ||||
| Distributed Flooding for Dense Topologies", Work in | ||||
| Progress, Internet-Draft, draft-white-distoptflood-01, 30 | ||||
| September 2019, | ||||
| <https://tools.ietf.org/html/draft-white-distoptflood-01>. | ||||
| 8. Informative References | ||||
| [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, | ||||
| "Network Time Protocol Version 4: Protocol and Algorithms | ||||
| Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, | ||||
| <https://www.rfc-editor.org/info/rfc5905>. | ||||
| [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 | ||||
| (IPv6) Specification", STD 86, RFC 8200, | ||||
| DOI 10.17487/RFC8200, July 2017, | ||||
| <https://www.rfc-editor.org/info/rfc8200>. | ||||
| [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. | ||||
| Perkins, "Registration Extensions for IPv6 over Low-Power | ||||
| Wireless Personal Area Network (6LoWPAN) Neighbor | ||||
| Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, | ||||
| <https://www.rfc-editor.org/info/rfc8505>. | ||||
| Authors' Addresses | Authors' Addresses | |||
| Yuehua Wei | Yuehua Wei (editor) | |||
| ZTE Corporation | ZTE Corporation | |||
| No.50, Software Avenue | No.50, Software Avenue | |||
| Nanjing 210012 | Nanjing | |||
| P. R. China | 210012 | |||
| China | ||||
| Email: wei.yuehua@zte.com.cn | Email: wei.yuehua@zte.com.cn | |||
| Zheng Zhang | Zheng Zhang | |||
| ZTE Corporation | ZTE Corporation | |||
| No.50, Software Avenue | No.50, Software Avenue | |||
| Nanjing 210012 | Nanjing | |||
| P. R. China | 210012 | |||
| China | ||||
| Email: zzhang_ietf@hotmail.com | Email: zzhang_ietf@hotmail.com | |||
| Dmitry Afanasiev | Dmitry Afanasiev | |||
| Yandex | Yandex | |||
| Email: fl0w@yandex-team.ru | Email: fl0w@yandex-team.ru | |||
| Tom Verhaeg | Tom Verhaeg | |||
| Interconnect Services B.V. | Juniper Networks | |||
| Email: t.verhaeg@interconnect.nl | Email: tverhaeg@juniper.net | |||
| Jaroslaw Kowalczyk | Jaroslaw Kowalczyk | |||
| Orange Polska | Orange Polska | |||
| Email: jaroslaw.kowalczyk2@orange.com | Email: jaroslaw.kowalczyk2@orange.com | |||
| Pascal Thubert | ||||
| Cisco Systems, Inc | ||||
| Building D | ||||
| 45 Allee des Ormes - BP1200 | ||||
| 06254 MOUGINS - Sophia Antipolis | ||||
| France | ||||
| Phone: +33 497 23 26 34 | ||||
| Email: pthubert@cisco.com | ||||
| End of changes. 86 change blocks. | ||||
| 164 lines changed or deleted | 437 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||