< draft-ietf-rift-applicability-07.txt   draft-ietf-rift-applicability-08.txt >
RIFT WG Yuehua. Wei, Ed. RIFT WG Yuehua. Wei, Ed.
Internet-Draft Zheng. Zhang Internet-Draft Zheng. Zhang
Intended status: Informational ZTE Corporation Intended status: Informational ZTE Corporation
Expires: 21 March 2022 Dmitry. Afanasiev Expires: 11 May 2022 Dmitry. Afanasiev
Yandex Yandex
P. Thubert P. Thubert
Cisco Systems Cisco Systems
Jaroslaw. Kowalczyk Jaroslaw. Kowalczyk
Orange Polska Orange Polska
17 September 2021 7 November 2021
RIFT Applicability RIFT Applicability
draft-ietf-rift-applicability-07 draft-ietf-rift-applicability-08
Abstract Abstract
This document discusses the properties, applicability and operational This document discusses the properties, applicability and operational
considerations of RIFT in different network scenarios. It intends to considerations of RIFT in different network scenarios. It intends to
provide a rough guide how RIFT can be deployed to simplify routing provide a rough guide how RIFT can be deployed to simplify routing
operations in Clos topologies and their variations. operations in Clos topologies and their variations.
Status of This Memo Status of This Memo
skipping to change at page 1, line 39 skipping to change at page 1, line 39
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on 21 March 2022. This Internet-Draft will expire on 11 May 2022.
Copyright Notice Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the Copyright (c) 2021 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/ Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document. license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights Please review these documents carefully, as they describe your rights
skipping to change at page 2, line 36 skipping to change at page 2, line 36
4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 13 4.3.4. Internal Router Switching Fabrics . . . . . . . . . . 13
4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 13
5. Operational Considerations . . . . . . . . . . . . . . . . . 15 5. Operational Considerations . . . . . . . . . . . . . . . . . 15
5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 16 5.1. South Reflection . . . . . . . . . . . . . . . . . . . . 16
5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 16 5.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 16
5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 18 5.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 18
5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 19 5.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 19
5.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 20 5.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 20
5.6. Positive vs. Negative Disaggregation . . . . . . . . . . 22 5.6. Positive vs. Negative Disaggregation . . . . . . . . . . 22
5.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24 5.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 24
5.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 25 5.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 26
5.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 26 5.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 26
5.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28 5.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 28
5.11. Fabric With A Controller . . . . . . . . . . . . . . . . 29 5.11. Fabric With A Controller . . . . . . . . . . . . . . . . 28
5.11.1. Controller Attached to ToFs . . . . . . . . . . . . 29 5.11.1. Controller Attached to ToFs . . . . . . . . . . . . 29
5.11.2. Controller Attached to Leaf . . . . . . . . . . . . 29 5.11.2. Controller Attached to Leaf . . . . . . . . . . . . 29
5.12. Internet Connectivity With Underlay . . . . . . . . . . . 30 5.12. Internet Connectivity Within Underlay . . . . . . . . . . 29
5.12.1. Internet Default on the Leaf . . . . . . . . . . . . 30 5.12.1. Internet Default on the Leaf . . . . . . . . . . . . 30
5.12.2. Internet Default on the ToFs . . . . . . . . . . . . 30 5.12.2. Internet Default on the ToFs . . . . . . . . . . . . 30
5.13. Subnet Mismatch and Address Families . . . . . . . . . . 30 5.13. Subnet Mismatch and Address Families . . . . . . . . . . 30
5.14. Anycast Considerations . . . . . . . . . . . . . . . . . 31 5.14. Anycast Considerations . . . . . . . . . . . . . . . . . 30
5.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 32 5.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 31
5.16. Key Management . . . . . . . . . . . . . . . . . . . . . 32 5.16. Key Management . . . . . . . . . . . . . . . . . . . . . 32
6. Security Considerations . . . . . . . . . . . . . . . . . . . 33 6. Security Considerations . . . . . . . . . . . . . . . . . . . 32
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33
8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 33
9. Normative References . . . . . . . . . . . . . . . . . . . . 33 9. Normative References . . . . . . . . . . . . . . . . . . . . 33
10. Informative References . . . . . . . . . . . . . . . . . . . 35 10. Informative References . . . . . . . . . . . . . . . . . . . 35
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 36
1. Introduction 1. Introduction
This document discusses the properties and applicability of "Routing This document discusses the properties and applicability of "Routing
in Fat Trees" [RIFT] in different deployment scenarios and highlights in Fat Trees" [RIFT] in different deployment scenarios and highlights
skipping to change at page 3, line 47 skipping to change at page 3, line 47
or negatively to repel it. Disaggregation is performed to prevent or negatively to repel it. Disaggregation is performed to prevent
black-holing and suboptimal routing to the more specific prefixes. black-holing and suboptimal routing to the more specific prefixes.
TIE: TIE:
This is an acronym for a "Topology Information Element". TIEs are This is an acronym for a "Topology Information Element". TIEs are
exchanged between RIFT nodes to describe parts of a network such as exchanged between RIFT nodes to describe parts of a network such as
links and address prefixes. A TIE has always a direction and a type. links and address prefixes. A TIE has always a direction and a type.
North TIEs (sometimes abbreviated as N-TIEs) are used when dealing North TIEs (sometimes abbreviated as N-TIEs) are used when dealing
with TIEs in the northbound representation and South-TIEs (sometimes with TIEs in the northbound representation and South-TIEs (sometimes
abbreviated as S- TIEs) for the southbound equivalent. TIEs have abbreviated as S-TIEs) for the southbound equivalent. TIEs have
different types such as node and prefix TIEs. different types such as node and prefix TIEs.
Node TIE: Node TIE:
This stands as acronym for a "Node Topology Information Element", This stands as acronym for a "Node Topology Information Element",
which contains all adjacencies the node discovered and information which contains all adjacencies the node discovered and information
about the node itself. Node TIE should NOT be confused with a North about the node itself. Node TIE should NOT be confused with a North
TIE since "node" defines the type of TIE rather than its direction. TIE since "node" defines the type of TIE rather than its direction.
Consequently North Node TIEs and South Node TIEs exist. Consequently North Node TIEs and South Node TIEs exist.
Prefix TIE: Prefix TIE:
This is an acronym for a "Prefix Topology Information Element" and it This is an acronym for a "Prefix Topology Information Element" and it
contains all prefixes directly attached to this node in case of a contains all prefixes directly attached to this node in case of a
North TIE and in case of South TIE the necessary default routes the North TIE and in case of South TIE the necessary default routes and
node advertises southbound. disaggregated routes the node advertises southbound.
South Reflection: South Reflection:
Often abbreviated just as "reflection", it defines a mechanism where Often abbreviated just as "reflection", it defines a mechanism where
South Node TIEs are "reflected" from the level south back up north to South Node TIEs are "reflected" from the level south back up north to
allow nodes in the same level without East- West links to "see" each allow nodes in the same level without East- West links to "see" each
other's node Topology Information Elements (TIEs). other's node Topology Information Elements (TIEs).
LIE: LIE:
This is an acronym for a "Link Information Element" exchanged on all This is an acronym for a "Link Information Element" exchanged on all
the system's links running RIFT to form ThreeWay adjacencies and the system's links running RIFT to form ThreeWay adjacencies and
carry information used to perform Zero Touch Provisioning (ZTP) of carry information used to perform Zero Touch Provisioning (ZTP) of
levels. levels.
Shortest-Path First (SPF): Shortest-Path First (SPF):
A well-known graph algorithm attributed to Dijkstra that establishes A well-known graph algorithm attributed to Dijkstra that establishes
a tree of shortest paths from a source to destinations on the graph. a tree of shortest paths from a source to destinations on the graph.
SPF acronym is used due to its familiarity as general term for the SPF acronym is used due to its familiarity as general term for the
node reachability calculations. RIFT can employ to ultimately node reachability calculations that RIFT can employ to ultimately
calculate routes of which Dijkstra algorithm is a possible one. calculate routes of which Dijkstra algorithm is a possible one.
North SPF (N-SPF): North SPF (N-SPF):
A reachability calculation that is progressing northbound, as example A reachability calculation that is progressing northbound, as example
SPF that is using South Node TIEs only. Normally it progresses a SPF that is using South Node TIEs only. Normally it progresses a
single hop only and installs default routes. single hop only and installs default routes.
South SPF (S-SPF): South SPF (S-SPF):
A reachability calculation that is progressing southbound, as example A reachability calculation that is progressing southbound, as example
SPF that is using North Node TIEs only. SPF that is using North Node TIEs only.
3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 3. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks
Clos [CLOS] topologies (called commonly a fat tree/network in modern Clos [CLOS] topologies (called commonly a fat tree/network in modern
IP fabric considerations as homonym to the original definition of the IP fabric considerations as homonym to the original definition of the
term Fat Tree [FATTREE])have gained prominence in today's networking, term Fat Tree [FATTREE]) have gained prominence in today's
primarily as a result of the paradigm shift towards a centralized networking, primarily as a result of the paradigm shift towards a
data-center based architecture that deliver a majority of computation centralized data-center based architecture that deliver a majority of
and storage services. computation and storage services.
Today's current routing protocols were geared towards a network with Current routing protocols were geared towards a network with an
an irregular topology with isotropic properties, and low degree of irregular topology with isotropic properties, and low degree of
connectivity. When applied to Fat Tree topologies: connectivity. When applied to Fat Tree topologies:
* They tend to need extensive configuration or provisioning during * They tend to need extensive configuration or provisioning during
bring up and re-dimensioning. bring up and adding or removing Rift nodes from the fabric.
* All nodes including spine and leaf nodes learn the entire network * All nodes including spine and leaf nodes learn the entire network
topology and routing information, which is in fact, not needed on topology and routing information, which is in fact, not needed on
the leaf nodes during normal operation. the leaf nodes during normal operation.
* They flood significant amounts of duplicate link state information * They flood significant amounts of duplicate link state information
between spine and leaf nodes during topology updates and between spine and leaf nodes during topology updates and
convergence events, requiring that additional CPU and link convergence events, requiring that additional CPU and link
bandwidth be consumed. This may impact the stability and bandwidth be consumed. This may impact the stability and
scalability of the fabric, make the fabric less reactive to scalability of the fabric, make the fabric less reactive to
skipping to change at page 6, line 25 skipping to change at page 6, line 25
information is never flooded east-west or back south again. So a top information is never flooded east-west or back south again. So a top
tier node has full set of prefixes from the Shortest Path First (SPF) tier node has full set of prefixes from the Shortest Path First (SPF)
calculation. calculation.
In the southbound direction, the protocol operates like a "fully In the southbound direction, the protocol operates like a "fully
summarizing, unidirectional" path-vector protocol or rather a summarizing, unidirectional" path-vector protocol or rather a
distance-vector with implicit split horizon. Routing information, distance-vector with implicit split horizon. Routing information,
normally just the default route, propagates one hop south and is "re- normally just the default route, propagates one hop south and is "re-
advertised" by nodes at next lower level. advertised" by nodes at next lower level.
+-----------+ +-----------+ +---------------+ +----------------+
| ToF | | ToF | LEVEL 2 | ToF | | ToF | LEVEL 2
+ +-----+--+--+ +-+--+------+ + ++------+--+--+-+ ++-+--+----+-----+
| | | | | | | | | ^ | | | | | | | | | ^
+ | | | +-------------------------+ | + | | | +-------------------------+ |
Distance | +-------------------+ | | | | | Distance | +-------------------+ | | | | |
Vector | | | | | | | | + Vector | | | | | | | | +
South | | | | +--------+ | | | Link-State South | | | | +--------+ | | | Link+State
+ | | | | | | | | Flooding + | | | | | | | | Flooding
| | | +-------------+ | | | North | | | +----------------+ | | | North
v | | | | | | | | + v | | | | | | | | +
+-+--+-+ +------+ +-------+ +--+--+-+ | ++---+-+ +------+ +-+----+ ++----++ |
|SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 |SPINE | |SPINE | | SPINE| | SPINE| | LEVEL 1
+ ++----++ ++---+-+ +--+--+-+ ++----+-+ | + ++----++ ++---+-+ +-+--+-+ ++----++ |
+ | | | | | | | | | ^ N + | | | | | | | | | ^ N
Distance | +-------+ | | +--------+ | | | E Distance | +-------+ | | +--------+ | | | E
Vector | | | | | | | | | +------> Vector | | | | | | | | | +------>
South | +-------+ | | | +-------+ | | | | South | +-------+ | | | +------+ | | | |
+ | | | | | | | | | + + | | | | | | | | | +
v ++--++ +-+-++ ++-+-+ +-+--++ + v ++--++ +-+-++ ++--++ ++--++ +
|LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 |LEAF| |LEAF| |LEAF| |LEAF| LEVEL 0
+----+ +----+ +----+ +-----+ +----+ +----+ +----+ +----+
Figure 1: RIFT overview Figure 1: RIFT overview
A spine node has only information necessary for its level, which is A spine node has only information necessary for its level, which is
all destinations south of the node based on SPF calculation, default all destinations south of the node based on SPF calculation, default
route, and potential disaggregated routes. route, and potential disaggregated routes.
RIFT combines the advantage of both link-state and distance-vector: RIFT combines the advantage of both link-state and distance-vector:
* Fastest possible convergence * Fastest possible convergence
skipping to change at page 7, line 23 skipping to change at page 7, line 27
* High degree of ECMP * High degree of ECMP
* Fast de-commissioning of nodes * Fast de-commissioning of nodes
* Maximum propagation speed with flexible prefixes in an update * Maximum propagation speed with flexible prefixes in an update
So there are two types of link-state database which are "north So there are two types of link-state database which are "north
representation" North Topology Information Elements (N-TIEs) and representation" North Topology Information Elements (N-TIEs) and
"south representation" South Topology Information Elements (S-TIEs). "south representation" South Topology Information Elements (S-TIEs).
The N-TIEs contain a link-state topology description of lower levels The N-TIEs contain a link-state topology description of lower levels
and S-TIEs carry simply default routes for the lower levels. and S-TIEs carry simply default and disaggregated routes for the
lower levels.
RIFT also eliminates major disadvantages of link-state and distance- RIFT also eliminates major disadvantages of link-state and distance-
vector with: vector with:
* Reduced and balanced flooding * Reduced and balanced flooding
* Automatic neighbor detection * Level constrained automatic neighbor detection
To achieve this, RIFT builds on the art of IGPs, not only OSPF and To achieve this, RIFT builds on the art of IGPs, not only OSPF and
IS-IS but also MANET and IoT, to provide unique features: IS-IS but also MANET and IoT, to provide unique features:
* Automatic (positive or negative) route disaggregation of * Automatic (positive or negative) route disaggregation of
northwards routes upon fallen leaves northwards routes upon fallen leaves
* Recursive operation in the case of negative route disaggregation * Recursive operation in the case of negative route disaggregation
* Anisotropic routing that extends a principle seen in RPL [RFC6550] * Anisotropic routing that extends a principle seen in RPL [RFC6550]
skipping to change at page 8, line 4 skipping to change at page 8, line 9
to wide superspines to wide superspines
* Optimal flooding reduction that derives from the concept of a * Optimal flooding reduction that derives from the concept of a
"multipoint relay" (MPR) found in OLSR [RFC3626] and balances the "multipoint relay" (MPR) found in OLSR [RFC3626] and balances the
flooding load over northbound links and nodes. flooding load over northbound links and nodes.
Additional advantages that are unique to RIFT are listed below, the Additional advantages that are unique to RIFT are listed below, the
details of which can be found in RIFT [RIFT]. details of which can be found in RIFT [RIFT].
* True ZTP(Zero Touch Provisioning) * True ZTP(Zero Touch Provisioning)
* Minimal blast radius on failures * Minimal blast radius on failures
* Can utilize all paths through fabric without looping * Can utilize all paths through fabric without looping
* Simple leaf implementation that can scale down to servers * Simple leaf implementation that can scale down to servers
* Key-Value store * Key-Value store
* Horizontal links used for protection only * Horizontal links used for protection only
* Supports non-equal cost multipath and can replace multi-chassis
link aggregation group (MLAG or MC-LAG)
4.2. Applicable Topologies 4.2. Applicable Topologies
Albeit RIFT is specified primarily for "proper" Clos or Fat Tree Albeit RIFT is specified primarily for "proper" Clos or Fat Tree
topologies, the protocol natively supports Points of Delivery (PoD) topologies, the protocol natively supports Points of Delivery (PoD)
concepts, which, strictly speaking, are not found in the original concepts, which, strictly speaking, are not found in the original
Clos concept. Clos concept.
Further, the specification explains and supports operations of multi- Further, the specification explains and supports operations of multi-
plane Clos variants where the protocol recommends the use of inter- plane Clos variants where the protocol recommends the use of inter-
plane rings at the Top-of-Fabric level to allow the reconciliation of plane rings at the Top-of-Fabric level to allow the reconciliation of
topology view of different planes to make the negative disaggregation topology view of different planes to make the negative disaggregation
viable in case of failures within a plane. These observations hold viable in case of failures within a plane. These observations hold
not only in case of RIFT but also in the generic case of dynamic not only in case of RIFT but also in the generic case of dynamic
routing on Clos variants with multiple planes and failures in bi- routing on Clos variants with multiple planes and failures in bi-
sectional bandwidth, especially on the leafs. sectional bandwidth, especially on the leafs.
4.2.1. Horizontal Links 4.2.1. Horizontal Links
RIFT is not limited to pure Clos divided into PoD and multi-planes RIFT is not limited to pure Clos divided into PoD and multi-planes
but supports horizontal (East-West) links below the top of fabric but supports horizontal (East-West) links below the top of fabric
level. Those links are used only for last resort northbound routes level. Those links are used only for last resort northbound
when a spine loses all its northbound links or cannot compute a forwarding when a spine loses all its northbound links or cannot
default route through them. compute a default route through them.
A possible configuration is a "ring" of horizontal links at a level.
In presence of such a "ring" in any level (except Top of Fabric (ToF)
level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a
"ring-based protection" scheme since such a computation would have to
deal necessarily with breaking of "loops" in Dijkstra sense; an
application for which RIFT is not intended.
A full-mesh connectivity between nodes on the same level can be A full-mesh connectivity between nodes on the same level can be
employed and that allows N-SPF to provide for any node loosing all employed and that allows N-SPF to provide for any node losing all its
its northbound adjacencies (as long as any of the other nodes in the northbound adjacencies (as long as any of the other nodes in the
level are northbound connected) to still participate in northbound level are northbound connected) to still participate in northbound
forwarding. forwarding.
Note that a "ring" of horizontal links at any level below ToF does
not provide a "ring-based protection" scheme since the SPF
computation would have to deal necessarily with breaking of "loops"
in Dijkstra sense--an application for which RIFT is not intended.
4.2.2. Vertical Shortcuts 4.2.2. Vertical Shortcuts
Through relaxations of the specified adjacency forming rules, RIFT Through relaxations of the specified adjacency forming rules, RIFT
implementations can be extended to support vertical "shortcuts". The implementations can be extended to support vertical "shortcuts". The
RIFT specification itself does not provide the exact details since RIFT specification itself does not provide the exact details since
the resulting solution suffers from either much larger blast radius the resulting solution suffers from either much larger blast radius
with increased flooding volumes or in case of maximum aggregation with increased flooding volumes or in case of maximum aggregation
routing, bow-tie problems. routing, bow-tie problems.
4.2.3. Generalizing to any Directed Acyclic Graph 4.2.3. Generalizing to any Directed Acyclic Graph
skipping to change at page 9, line 36 skipping to change at page 9, line 35
* Northbound, RIFT operates as a link-state protocol, whereby the * Northbound, RIFT operates as a link-state protocol, whereby the
control packets are reflooded first all the way north and only control packets are reflooded first all the way north and only
interpreted later. All the individual fine grained routes are interpreted later. All the individual fine grained routes are
advertised. advertised.
* Southbound, RIFT operates as a distance-vector protocol, whereby * Southbound, RIFT operates as a distance-vector protocol, whereby
the control packets are flooded only one-hop, interpreted, and the the control packets are flooded only one-hop, interpreted, and the
consequence of that computation is what gets flooded one more hop consequence of that computation is what gets flooded one more hop
south. In the most common use-cases, a ToF node can reach most of south. In the most common use-cases, a ToF node can reach most of
the prefixes in the fabric. If that is the case, the ToF node the prefixes in the fabric. If that is the case, the ToF node
advertises the fabric default and disaggregates the prefixes that advertises the fabric default and negatively disaggregates the
it cannot reach. On the other hand, a ToF node that can reach prefixes that it cannot reach. On the other hand, a ToF node that
only a small subset of the prefixes in the fabric will preferably can reach only a small subset of the prefixes in the fabric will
advertise those prefixes and refrain from aggregating. preferably advertise those prefixes and refrain from aggregating.
In the general case, what gets advertised south is in more In the general case, what gets advertised south are:
details:
1. A fabric default that aggregates all the prefixes that are 1. A fabric default that aggregates all the prefixes that are
reachable within the fabric, and that could be a default route reachable within the fabric, and that could be a default route
or a prefix that is dedicated to this particular fabric. or a prefix that is dedicated to this particular fabric.
2. The loopback addresses of the northbound nodes, e.g., for 2. The loopback addresses of the northbound nodes, e.g., for
inband management. inband management.
3. The disaggregated prefixes for the dynamic exceptions to the 3. The disaggregated prefixes for the dynamic exceptions to the
fabric default, advertised to route around the black hole that fabric default, advertised to route around the black hole that
may form. may form.
* East-West routing can optionally be used, with specific * East-West routing can optionally be used, with specific
restrictions. It is used when a sibling has access to the fabric restrictions. It is used when a sibling has access to the fabric
default but this node does not. default but this node does not.
A Directed Acyclic Graph (DAG) provides a sense of north (the Since a Directed Acyclic Graph (DAG) provides a sense of north (the
direction of the DAG) and of south (the reverse), which can be used direction of the DAG) and of south (the reverse), it can be used to
to apply RIFT. For the purpose of RIFT, an edge in the DAG that has apply RIFT--an edge in the DAG that has only incoming vertices is a
only incoming vertices is a ToF node. ToF node.
There are a number of caveats though: There are a number of caveats though:
* The DAG structure must exist before RIFT starts, so there is a * The DAG structure must exist before RIFT starts, so there is a
need for a companion protocol to establish the logical DAG need for a companion protocol to establish the logical DAG
structure. structure.
* A generic DAG does not have a sense of east and west. The * A generic DAG does not have a sense of east and west. The
operation specified for east-west links and the southbound operation specified for east-west links and the southbound
reflection between nodes are not applicable. Also ZTP(Zero Touch reflection between nodes are not applicable. Also ZTP(Zero Touch
skipping to change at page 11, line 18 skipping to change at page 11, line 14
4.2.4. Reachability of Internal Nodes in the Fabric 4.2.4. Reachability of Internal Nodes in the Fabric
RIFT does not require that nodes have reachable addresses in the RIFT does not require that nodes have reachable addresses in the
fabric, though it is clearly desirable for operational purposes. fabric, though it is clearly desirable for operational purposes.
Under normal operating conditions this can be easily achieved by Under normal operating conditions this can be easily achieved by
injecting the node's loopback address into North and South Prefix injecting the node's loopback address into North and South Prefix
TIEs or other implementation specific mechanisms. TIEs or other implementation specific mechanisms.
Special considerations arise when a node loses all northbound Special considerations arise when a node loses all northbound
adjacencies, but is not at the top of the fabric. These are outside adjacencies, but is not at the top of the fabric. If a spine node
the scope of this document and could be discussed in a separate loses all northbound links, the spine node doesn't advertise default
document. route. But if the level of the spine node is auto-determined by ZTP,
it will "fall down" as despicted in Figure 8.
4.3. Use Cases 4.3. Use Cases
4.3.1. Data Center Topologies 4.3.1. Data Center Topologies
4.3.1.1. Data Center Fabrics 4.3.1.1. Data Center Fabrics
RIFT is suited for applying in data center (DC) IP fabrics underlay RIFT is suited for applying in data center (DC) IP fabrics underlay
routing, vast majority of which seem to be currently (and for the routing, vast majority of which seem to be currently (and for the
foreseeable future) Clos architectures. It significantly simplifies foreseeable future) Clos architectures. It significantly simplifies
skipping to change at page 12, line 33 skipping to change at page 12, line 33
. +-----+ +-----+ . +-----+ +-----+
Figure 2: Level Shortcut Figure 2: Level Shortcut
RIFT is not strictly limited to Clos topologies. The protocol only RIFT is not strictly limited to Clos topologies. The protocol only
requires a sense of "compass rose directionality" either achieved requires a sense of "compass rose directionality" either achieved
through configuration or derivation of levels. So, conceptually, through configuration or derivation of levels. So, conceptually,
shortcuts between levels could be included. Figure 2 depicts an shortcuts between levels could be included. Figure 2 depicts an
example of a shortcut between levels. In this example, sub-optimal example of a shortcut between levels. In this example, sub-optimal
routing will occur when traffic is sent from L0 to L1 via S0's routing will occur when traffic is sent from L0 to L1 via S0's
default route and back down through A0 or A1. In order to ensure default route and back down through A0 or A1. In order to avoid
that, only default routes from A0 or A1 are used, all leaves would be that, only default routes from A0 or A1 are used, all leaves would be
required to install each others routes. required to install each others routes.
While various technical and operational challenges may require the While various technical and operational challenges may require the
use of such modifications, discussion of those topics are outside the use of such modifications, discussion of those topics are outside the
scope of this document. scope of this document.
4.3.2. Metro Fabrics 4.3.2. Metro Fabrics
The demand for bandwidth is increasing steadily, driven primarily by The demand for bandwidth is increasing steadily, driven primarily by
environments close to content producers (server farms connection via environments close to content producers (server farms connection via
DC fabrics) but in proximity to content consumers as well. Consumers DC fabrics) but in proximity to content consumers as well. Consumers
are often clustered in metro areas with their own network are often clustered in metro areas with their own network
architectures that can benefit from simplified, regular Clos architectures that can benefit from simplified, regular Clos
structures and hence from RIFT. structures and hence from RIFT.
4.3.3. Building Cabling 4.3.3. Building Cabling
Commercial edifices are often cabled in topologies that are either Commercial edifices are often cabled in topologies that are either
Clos or its isomorphic equivalents. The Clos can grow rather high Clos or its isomorphic equivalents. The Clos can grow rather high
with many floors. That presents a challenge for traditional routing with many levels. That presents a challenge for traditional routing
protocols (except BGP and by now largely phased-out PNNI) which do protocols (except BGP and by now largely phased-out PNNI) which do
not support an arbitrary number of levels which RIFT does naturally. not support an arbitrary number of levels which RIFT does naturally.
Moreover, due to the limited sizes of forwarding tables in network Moreover, due to the limited sizes of forwarding tables in network
elements of building cabling, the minimum FIB size RIFT maintains elements of building cabling, the minimum FIB size RIFT maintains
under normal conditions is cost-effective in terms of hardware and under normal conditions is cost-effective in terms of hardware and
operational costs. operational costs.
4.3.4. Internal Router Switching Fabrics 4.3.4. Internal Router Switching Fabrics
It is common in high-speed communications switching and routing It is common in high-speed communications switching and routing
devices to use fabrics when a crossbar is not feasible due to cost, devices to use fabrics when a crossbar is not feasible due to cost,
head-of-line blocking or size trade-offs. Normally such fabrics are head-of-line blocking or size trade-offs. Normally such fabrics are
not self-healing or rely on 1:/+1 protection schemes but it is not self-healing or rely on 1:/+1 protection schemes but it is
conceivable to use RIFT to operate Clos fabrics that can deal conceivable to use RIFT to operate Clos fabrics that can deal
effectively with interconnections or subsystem failures in such effectively with interconnections or subsystem failures in such
module. RIFT is neither IP specific and hence any link addressing module. RIFT is not IP specific and hence any link addressing
connecting internal device subnets is conceivable. connecting internal device subnets is conceivable.
4.3.5. CloudCO 4.3.5. CloudCO
The Cloud Central Office (CloudCO) is a new stage of telecom Central The Cloud Central Office (CloudCO) is a new stage of telecom Central
Office. It takes the advantage of Software Defined Networking (SDN) Office. It takes the advantage of Software Defined Networking (SDN)
and Network Function Virtualization (NFV) in conjunction with general and Network Function Virtualization (NFV) in conjunction with general
purpose hardware to optimize current networks. The following figure purpose hardware to optimize current networks. The following figure
illustrates this architecture at a high level. It describes a single illustrates this architecture at a high level. It describes a single
instance or macro-node of cloud CO that provides a number of Value instance or macro-node of cloud CO that provides a number of Value
skipping to change at page 15, line 25 skipping to change at page 15, line 25
good scaling properties while delivering maximum reactiveness. good scaling properties while delivering maximum reactiveness.
* RIFT allows for extensive Zero Touch Provisioning within the * RIFT allows for extensive Zero Touch Provisioning within the
protocol. In its most extreme version RIFT does not rely on any protocol. In its most extreme version RIFT does not rely on any
specific addressing and for IP fabric can operate using IPv6 ND specific addressing and for IP fabric can operate using IPv6 ND
[RFC4861] only. [RFC4861] only.
* RIFT has provisions to detect common IP fabric mis-cabling * RIFT has provisions to detect common IP fabric mis-cabling
scenarios. scenarios.
* RIFT negotiates automatically BFD per link allowing this way for * RIFT negotiates automatically BFD per link. This allows for IP
IP and micro-BFD [RFC7130] to replace Link Aggregation Groups and micro-BFD [RFC7130] to replace Link Aggregation Groups (LAGs)
(LAGs) which do hide bandwidth imbalances in case of constituent which do hide bandwidth imbalances in case of constituent
failures. Further automatic link validation techniques similar to failures. Further automatic link validation techniques similar to
[RFC5357] could be supported as well. [RFC5357] could be supported as well.
* RIFT inherently solves many difficult problems associated with the * RIFT inherently solves many difficult problems associated with the
use of traditional routing topologies with dense meshes and high use of traditional routing topologies with dense meshes and high
degrees of ECMP by including automatic bandwidth balancing, flood degrees of ECMP by including automatic bandwidth balancing, flood
reduction and automatic disaggregation on failures while providing reduction and automatic disaggregation on failures while providing
maximum aggregation of prefixes in default scenarios. maximum aggregation of prefixes in default scenarios.
* RIFT reduces FIB size towards the bottom of the IP fabric where * RIFT reduces FIB size towards the bottom of the IP fabric where
skipping to change at page 17, line 40 skipping to change at page 17, line 40
Figure 4: Suboptimal routing upon link failure use case Figure 4: Suboptimal routing upon link failure use case
As shown in Figure 4, as the result of the south reflection between As shown in Figure 4, as the result of the south reflection between
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and
Spine 122 knows each other at level 1. Spine 122 knows each other at level 1.
Without disaggregation mechanism, when linkSL6 fails, the packet from Without disaggregation mechanism, when linkSL6 fails, the packet from
leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3
then go down through linkTS4 to linkSL8 to Leaf122 or go up through then go down through linkTS4 to linkSL8 to Leaf122 or go up through
linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to linkSL5 to linkTS6 then go down through linkTS8 and linkSL8 to
Leaf122 based on pure default route. It's the case of suboptimal Leaf122 based on pure default route. It's the case of suboptimal
routing or bow-tieing. routing or bow-tieing.
With disaggregation mechanism, when linkSL6 fails, Spine122 will With disaggregation mechanism, when linkSL6 fails, Spine122 will
detect the failure according to the reflected node S-TIE from detect the failure according to the reflected node S-TIE from
Spine121. Based on the disaggregation algorithm provided by RIFT, Spine121. Based on the disaggregation algorithm provided by RIFT,
Spine122 will explicitly advertise prefix122 in Disaggregated Prefix Spine122 will explicitly advertise prefix122 in Disaggregated Prefix
S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to
prefix122 will only be sent to linkSL7 following a longest-prefix prefix122 will only be sent to linkSL7 following a longest-prefix
match to prefix 122 directly then go down through linkSL8 to Leaf122 match to prefix 122 directly then go down through linkSL8 to Leaf122
skipping to change at page 19, line 14 skipping to change at page 19, line 14
The packet from leaf111 to prefix122 will not be routed to linkTS1 or The packet from leaf111 to prefix122 will not be routed to linkTS1 or
linkTS2. The packet from leaf111 to prefix122 will only be routed to linkTS2. The packet from leaf111 to prefix122 will only be routed to
linkTS5 or linkTS7 following a longest-prefix match to prefix122. linkTS5 or linkTS7 following a longest-prefix match to prefix122.
5.4. Zero Touch Provisioning (ZTP) 5.4. Zero Touch Provisioning (ZTP)
RIFT is designed to require a very minimal configuration to simplify RIFT is designed to require a very minimal configuration to simplify
its operation and avoid human errors; based on that minimal its operation and avoid human errors; based on that minimal
information, Zero Touch Provisioning (ZTP) autoconfigures the key information, Zero Touch Provisioning (ZTP) autoconfigures the key
operational parameters of all the RIFT nodes, that is, on the one operational parameters of all the RIFT nodes, including the SystemID
hand, the SystemID of the node that must be unique in the RIFT of the node that must be unique in the RIFT network and the level of
network, and on the other hand the level of the node in the Fat Tree, the node in the Fat Tree, which determines which peers are northwards
which determines which peers are northwards "parents" and which are "parents" and which are southwards "children".
southwards "children".
ZTP is always on, but its decisions can be overridden when a network ZTP is always on, but its decisions can be overridden when a network
administrator prefers to impose its own configuration. In that case, administrator prefers to impose its own configuration. In that case,
it is the responsibility of the administrator to ensure that the it is the responsibility of the administrator to ensure that the
configured parameters are correct, in other words that the SystemID configured parameters are correct, in other words that the SystemID
of each node is unique, and that the administratively set levels of each node is unique, and that the administratively set levels
truly reflect the relative position of the nodes in the fabric. It truly reflect the relative position of the nodes in the fabric. It
is recommended to let ZTP configure the network, and when not, it is is recommended to let ZTP configure the network, and when not, it is
recommended to configure the level of all the nodes but those that recommended to configure the level of all the nodes to avoid an
are forced as leaves to avoid an undesirable interaction between ZTP undesirable interaction between ZTP and the manual configuration.
and the manual configuration.
ZTP requires that the administrator points out the Top-of-Fabric ZTP requires that the administrator points out the Top-of-Fabric
(ToF) nodes to set the baseline from which the fabric topology is (ToF) nodes to set the baseline from which the fabric topology is
derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC derived. The Top-of-Fabric nodes are configured with TOP_OF_FABRIC
flag which are initial 'seeds' needed for other ZTP nodes to derive flag which are initial 'seeds' needed for other ZTP nodes to derive
their level in the topology. ZTP computes the level of each node their level in the topology. ZTP computes the level of each node
based on the Highest Available Level (HAL) of the potential parent(s) based on the Highest Available Level (HAL) of the potential parent(s)
nearest that baseline, which represents the superspine. In a nearest that baseline, which represents the superspine. In a
fashion, RIFT can be seen as a distance-vector protocol that computes fashion, RIFT can be seen as a distance-vector protocol that computes
a set of feasible successors towards the superspine and auto- a set of feasible successors towards the superspine and auto-
skipping to change at page 20, line 16 skipping to change at page 20, line 12
highest level either leaving or entering the domain (with some finer highest level either leaving or entering the domain (with some finer
distinctions not explained further). It is therefore recommended distinctions not explained further). It is therefore recommended
that each node is multi-homed towards nodes with respective HAL that each node is multi-homed towards nodes with respective HAL
offerings. Fortunately, this is the natural state of things for the offerings. Fortunately, this is the natural state of things for the
topology variants considered in RIFT. topology variants considered in RIFT.
A RIFT node may also be configured to confine it to the leaf role A RIFT node may also be configured to confine it to the leaf role
with the LEAF_ONLY flag. A leaf node can also be configured to with the LEAF_ONLY flag. A leaf node can also be configured to
support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either support leaf-2-leaf procedures with the LEAF_2_LEAF flag. In either
case the node cannot be TOP_OF_FABRIC and its level cannot be case the node cannot be TOP_OF_FABRIC and its level cannot be
configured. RIFT will fully configure the node's level after it is configured. RIFT will fully determine the node's level after it is
attached to the topology and ensure that the node is at the "bottom attached to the topology and ensure that the node is at the "bottom
of the hierarchy" (southernmost). of the hierarchy" (southernmost).
5.5. Mis-cabling Examples 5.5. Mis-cabling Examples
+----------------+ +-----------------+ +----------------+ +-----------------+
| ToF21 | +------+ ToF22 | LEVEL 2 | ToF21 | +------+ ToF22 | LEVEL 2
+-------+----+---+ | +----+---+--------+ +-------+----+---+ | +----+---+--------+
| | | | | | | | | | | | | | | | | |
| | | +----------------------------+ | | | | +----------------------------+ |
skipping to change at page 21, line 20 skipping to change at page 21, line 25
and Spine121 belong to two different PoDs, the adjacency between and Spine121 belong to two different PoDs, the adjacency between
Leaf112 and Spine121 can not be formed. link-W would be detected and Leaf112 and Spine121 can not be formed. link-W would be detected and
prevented. prevented.
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| | | | | | | | | | | | | | | |
| | | +-----------------+ | | | | | | +-----------------+ | | |
| +--------------------------+ | | | | | +--------------------------+ | | | |
| | | | | | | |
| +------+ | | | +------+ | | +------+ | | | +------+ |
| | +-----------------+ | | | | | | | +-----------------+ | | | | |
| | | +--------------------------+ | | | | | +--------------------------+ | |
| A | | B | | A | | B | | A | | B | | A | | B |
+-----+--+ +-+---+--+ +--+---+-+ +--+-----+ +-----+--+ +-+---+--+ +--+---+-+ +--+-----+
|Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1
+-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+
| | | | | | | | | | | | | | | | | |
| +---------+ | | | +---------+ | | +---------+ | | | +---------+ |
| | | | link-W | | | | | | | | link-W | | | |
skipping to change at page 22, line 25 skipping to change at page 22, line 25
| | | | | | | | | | | |
| +-------+ | | | | +-------+ | | |
+ + | | ====> | | + + | | ====> | |
X X +------+ | +------+ | X X +------+ | +------+ |
+ + | | | | + + | | | |
+----+--+ +-+-----+ +-+-----+ +----+--+ +-+-----+ +-+-----+
|Spine11| |Spine12| |Spine12| |Spine11| |Spine12| |Spine12|
+-+---+-+ ++----+-+ ++----+-+ +-+---+-+ ++----+-+ ++----+-+
| | | | | | | | | | | |
| +---------+ | | | | +---------+ | | |
| | | | | |
| +-------+ | | +-------+ | | +-------+ | | +-------+ |
| | | | | | | | | | | |
+-+---+-+ +--+--+-+ +-----+-+ +-----+-+ +-+---+-+ +--+--+-+ +-----+-+ +-----+-+
|Leaf111| |Leaf112| |Leaf111| |Leaf112| |Leaf111| |Leaf112| |Leaf111| |Leaf112|
+-------+ +-------+ +-+-----+ +-+-----+ +-------+ +-------+ +-+-----+ +-+-----+
| | | |
| +--------+ | +--------+
| | | |
+-+---+-+ +-+---+-+
|Spine11| |Spine11|
skipping to change at page 23, line 7 skipping to change at page 22, line 52
specific route southwards as an exception to the aggregated fabric- specific route southwards as an exception to the aggregated fabric-
default north. Disaggregation is useful when a prefix within the default north. Disaggregation is useful when a prefix within the
aggregation is reachable via some of the parents but not the others aggregation is reachable via some of the parents but not the others
at the same level of the fabric. It is mandatory when the level is at the same level of the fabric. It is mandatory when the level is
the ToF since a ToF node that cannot reach a prefix becomes a black the ToF since a ToF node that cannot reach a prefix becomes a black
hole for that prefix. The hard problem is to know which prefixes are hole for that prefix. The hard problem is to know which prefixes are
reachable by whom. reachable by whom.
In the general case, [RIFT] solves that problem by interconnecting In the general case, [RIFT] solves that problem by interconnecting
the ToF nodes. So the ToF nodes can exchange the full list of the ToF nodes. So the ToF nodes can exchange the full list of
prefixes that exist in the fabric and figure when a ToF node lacks prefixes that exist in the fabric and figure out when a ToF node
reachability and to existing prefix. This requires additional ports lacks reachability to some prefixes. This requires additional ports
at the ToF, typically 2 ports per ToF node to form a ToF-spanning at the ToF, typically 2 ports per ToF node to form a ToF-spanning
ring. [RIFT] also defines the southbound reflection procedure that ring. [RIFT] also defines the southbound reflection procedure that
enables a parent to explore the direct connectivity of its peers, enables a parent to explore the direct connectivity of its peers,
meaning their own parents and children; based on the advertisements meaning their own parents and children; based on the advertisements
received from the shared parents and children, it may enable the received from the shared parents and children, it may enable the
parent to infer the prefixes its peers can reach. parent to infer the prefixes its peers can reach.
When a parent lacks reachability to a prefix, it may disaggregate the When a parent lacks reachability to a prefix, it may disaggregate the
prefix negatively, i.e., advertise that this parent can be used to prefix negatively, i.e., advertise that this parent can be used to
reach any prefix in the aggregation except that one. The Negative reach any prefix in the aggregation except that one. The Negative
skipping to change at page 24, line 41 skipping to change at page 24, line 36
leaf is obsolete, and a stale route may exist for a while. The leaf is obsolete, and a stale route may exist for a while. The
common parent needs to select the freshest route advertisement in common parent needs to select the freshest route advertisement in
order to install the correct route via the next-leaf. This requires order to install the correct route via the next-leaf. This requires
that the fabric determines the sequence of the movements of the that the fabric determines the sequence of the movements of the
mobile node. mobile node.
On the one hand, a classical sequence counter provides a total order On the one hand, a classical sequence counter provides a total order
for a while but it will eventually wrap. On the other hand, a for a while but it will eventually wrap. On the other hand, a
timestamp provides a permanent order but it may miss a movement that timestamp provides a permanent order but it may miss a movement that
happens too quickly vs. the granularity of the timing information. happens too quickly vs. the granularity of the timing information.
It is not envisioned in the short term that the average fabric It is not envisioned that an average fabric supports Precision Time
supports a Precision Time Protocol [IEEEstd1588], and the precision Protocol [IEEEstd1588] in the short term, nor that the precision
that may be available with the Network Time Protocol [RFC5905], in available with the Network Time Protocol [RFC5905] (in the order of
the order of 100 to 200ms, may not be necessarily enough to cover, 100 to 200ms) may not be necessarily enough to cover, e.g., the fast
e.g., the fast mobility of a Virtual Machine. mobility of a Virtual Machine.
Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that
combines a sequence counter from the mobile node and a timestamp from combines a sequence counter from the mobile node and a timestamp from
the network taken at the leaf when the route is injected. If the the network taken at the leaf when the route is injected. If the
timestamps of the concurrent advertisements are comparable (i.e., timestamps of the concurrent advertisements are comparable (i.e.,
more distant than the precision of the timing protocol), then the more distant than the precision of the timing protocol), then the
timestamp alone is used to determine the relative freshness of the timestamp alone is used to determine the relative freshness of the
routes. Otherwise, the sequence counter from the mobile node, if routes. Otherwise, the sequence counter from the mobile node, if
available, is used. One caveat is that the sequence counter must not available, is used. One caveat is that the sequence counter must not
wrap within the precision of the timing protocol. Another is that wrap within the precision of the timing protocol. Another is that
skipping to change at page 26, line 37 skipping to change at page 26, line 47
+---+----+ +---+----+ +---+----+ +---+----+
| V4 | | V4 | | V4 | | V4 |
| subnet | | subnet | | subnet | | subnet |
+--------+ +--------+ +--------+ +--------+
Figure 9: IPv4 over IPv6 Figure 9: IPv4 over IPv6
5.9. In-Band Reachability of Nodes 5.9. In-Band Reachability of Nodes
RIFT doesn't precondition that nodes of the fabric have reachable RIFT doesn't precondition that nodes of the fabric have reachable
addresses. But the operational purposes to reach the internal nodes addresses. But the operational reasons to reach the internal nodes
may exist. Figure 10 shows an example that the network management may exist. Figure 10 shows an example that the network management
station (NMS) attaches to leaf1. station (NMS) attaches to leaf1.
+-------+ +-------+ +-------+ +-------+
| ToF1 | | ToF2 | | ToF1 | | ToF2 |
++---- ++ ++-----++ ++---- ++ ++-----++
| | | | | | | |
| +----------+ | | +----------+ |
| +--------+ | | | +--------+ | |
| | | | | | | |
skipping to change at page 27, line 42 skipping to change at page 27, line 42
NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/
ToF2-Spine2. ToF2-Spine2.
If NMS wants to access ToF2, ToF2's loopback address needs to be If NMS wants to access ToF2, ToF2's loopback address needs to be
injected into its Prefix South TIE. This TIE must be seen by all injected into its Prefix South TIE. This TIE must be seen by all
nodes at the level below - the spine nodes in Figure 10 - that must nodes at the level below - the spine nodes in Figure 10 - that must
form a ceiling for all the traffic coming from below (south). form a ceiling for all the traffic coming from below (south).
Otherwise, the traffic from NMS may follow the default route to the Otherwise, the traffic from NMS may follow the default route to the
wrong ToF Node, e.g., ToF1. wrong ToF Node, e.g., ToF1.
In a fully connected ToF, in case of failure between ToF2 and spine In case of failure between ToF2 and spine nodes, ToF2's loopback
nodes, ToF2's loopback address must be disaggregated recursively all address must be disaggregated recursively all the way to the leaves.
the way to the leaves. In a partitioned ToF, even with recursive disaggregation a ToF node
is only reachable within its plane.
In a partitioned ToF, a TOF node is only reachable within its Plane, A possible alternative to recursive disaggregation is to use a ring
and the disaggregation to the leaves is also required. A possible that interconnects the ToF nodes to transmit packets between them for
alternative is to use the ring that interconnects the ToF nodes to their loopback addresses only. The idea is that this is mostly
transmit packets between them for their loopback addresses only. The control traffic and should not alter the load balancing properties of
idea is that this is mostly control traffic and should not alter the the fabric.
load balancing properties of the fabric.
5.10. Dual Homing Servers 5.10. Dual Homing Servers
Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It
has no configuration (unless it is a Top-of-Fabric at the top of the has no configuration (unless it is a Top-of-Fabric at the top of the
topology or the must operate in the topology as leaf and/or support topology or the must operate in the topology as leaf and/or support
leaf-2-leaf procedures) and it will fully configure itself after leaf-2-leaf procedures) and it will fully configure itself after
being attached to the topology. being attached to the topology.
+---+ +---+ +---+ +---+ +---+ +---+
|ToF| |ToF| |ToF| ToF |ToF| |ToF| |ToF| ToF
+---+ +---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | |
| +----------------+ | | | +----------------+ | |
| | | | | |
| +----------------+ | | +----------------+ |
| | | | | | | | | | | |
+----------+--+ +--+----------+ +----------+--+ +--+----------+
| ToR1 | | ToR2 | Spine | ToR1 | | ToR2 | Spine
+--+------+---+ +--+-------+--+ +--+------+---+ +--+-------+--+
+---+ | | | | | | +---+ +---+ | | | | | | +---+
| | | | | | | |
| +-----------------+ | | | | +-----------------+ | | |
| | | +-------------+ | | | | | +-------------+ | |
+ | + | | |-----------------+ | + | + | | +-----------------+ |
X | X | +--------x-----+ | X | X | X | +--------x-----+ | X |
+ | + | | | + | + | + | | | + |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| | | | | | | | | | | | | | | |
+---+ +---+ ...............+---+ +---+ +---+ +---+ ...............+---+ +---+
SV(1) SV(2) SV(n+1) SV(n) Leaf SV(1) SV(2) SV(n-1) SV(n) Leaf
Figure 11: Dual-homing servers Figure 11: Dual-homing servers
In the single plane, the worst condition is disaggregation of every In the single plane, the worst condition is disaggregation of every
other servers at the same level. Suppose the links from ToR1 (Top of other servers at the same level. Suppose the links from ToR1 (Top of
Rack) to all the leaves become not available. All the servers' Rack) to all the leaves become not available. All the servers'
routes are disaggregated and the FIB of the servers will be expanded routes are disaggregated and the FIB of the servers will be expanded
with n-1 more specific routes. with n-1 more specific routes.
Sometimes, people may prefer to disaggregate from ToR to servers from Sometimes, people may prefer to disaggregate from ToR to servers from
skipping to change at page 29, line 42 skipping to change at page 29, line 36
+-----+ +-----+ +-----+ +-----+
Figure 12: Fabric with a controller Figure 12: Fabric with a controller
5.11.1. Controller Attached to ToFs 5.11.1. Controller Attached to ToFs
If a controller is attaching to the RIFT domain from ToF, it usually If a controller is attaching to the RIFT domain from ToF, it usually
uses dual-homing connections. The loopback prefix of the controller uses dual-homing connections. The loopback prefix of the controller
should be advertised down by the ToF and spine to leaves. If the should be advertised down by the ToF and spine to leaves. If the
controller loses link to ToF, make sure the ToF withdraw the prefix controller loses link to ToF, make sure the ToF withdraw the prefix
of the controller(use different mechanisms). of the controller.
5.11.2. Controller Attached to Leaf 5.11.2. Controller Attached to Leaf
If the controller is attaching from a leaf to the fabric, no special If the controller is attaching from a leaf to the fabric, no special
provisions are needed. provisions are needed.
5.12. Internet Connectivity With Underlay 5.12. Internet Connectivity Within Underlay
If global addressing is running without overlay, an external default If global addressing is running without overlay, an external default
route needs to be advertised through RIFT fabric to achieve internet route needs to be advertised through RIFT fabric to achieve internet
connectivity. For the purpose of forwarding of the entire RIFT connectivity. For the purpose of forwarding of the entire RIFT
fabric, an internal fabric prefix needs to be advertised in the South fabric, an internal fabric prefix needs to be advertised in the South
Prefix TIE by ToF and spine nodes. Prefix TIE by ToF and spine nodes.
5.12.1. Internet Default on the Leaf 5.12.1. Internet Default on the Leaf
In case that an internet access request comes from a leaf and the In case that the internet gateway is a leaf, the leaf node as the
internet gateway is another leaf, the leaf node as the internet internet gateway needs to advertise a default route in its Prefix
gateway needs to advertise a default route in its Prefix North TIE. North TIE.
5.12.2. Internet Default on the ToFs 5.12.2. Internet Default on the ToFs
In case that an internet access request comes from a leaf and the In case that the internet gateway is a ToF, the ToF and spine nodes
internet gateway is a ToF, the ToF and spine nodes need to advertise need to advertise a default route in the Prefix South TIE.
a default route in the Prefix South TIE.
5.13. Subnet Mismatch and Address Families 5.13. Subnet Mismatch and Address Families
+--------+ +--------+ +--------+ +--------+
| | LIE LIE | | | | LIE LIE | |
| A | +----> <----+ | B | | A | +----> <----+ | B |
| +---------------------+ | | +---------------------+ |
+--------+ +--------+ +--------+ +--------+
X/24 Y/24 X/24 Y/24
skipping to change at page 31, line 12 skipping to change at page 31, line 4
node A and B may form, but the forwarding between node A and node B node A and B may form, but the forwarding between node A and node B
may fail because subnet X mismatches with subnet Y. may fail because subnet X mismatches with subnet Y.
To prevent this a RIFT implementation should check for subnet To prevent this a RIFT implementation should check for subnet
mismatch just like e.g. ISIS does. This can lead to scenarios where mismatch just like e.g. ISIS does. This can lead to scenarios where
an adjacency, despite exchange of LIEs in both address families may an adjacency, despite exchange of LIEs in both address families may
end up having an adjacency in a single AF only. This is a end up having an adjacency in a single AF only. This is a
consideration especially in Section 5.8 scenarios. consideration especially in Section 5.8 scenarios.
5.14. Anycast Considerations 5.14. Anycast Considerations
+ traffic + traffic
| |
v v
+------+------+ +------+------+
| ToF | | ToF |
+---+-----+---+ +---+-----+---+
| | | | | | | |
+------------+ | | +------------+ +------------+ | | +------------+
| | | | | | | |
+---+---+ +-------+ +-------+ +---+---+ +---+---+ +-------+ +-------+ +---+---+
| | | | | | | | | | | | | | | |
|Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1
+-+---+-+ ++----+-+ +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ +-+---+-+ ++----+-+
| | | | | | | | | | | | | | | |
| +---------+ | | +---------+ | | +---------+ | | +---------+ |
| | | | | | | |
| +-------+ | | | +-------+ | | | +-------+ | | | +-------+ | |
| | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
| | | | | | | | | | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-----+-+ +-+-----+ ++------+ +-----+-+ +-----+-+
+ + + ^ | + + + ^ |
PrefixA PrefixB PrefixA | PrefixC PrefixA PrefixB PrefixA | PrefixC
| |
+ traffic + traffic
Figure 14: Anycast Figure 14: Anycast
If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast
prefix PrefixA. RIFT can deal with this case well. But if the prefix PrefixA, RIFT can deal with this case well. But if the
traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1.
But Spine21 or Spine22 doesn't know another PrefixA attaching But Spine21 or Spine22 doesn't know another PrefixA attaching
Leaf111. So it will always get to Leaf121 and never get to Leaf111. Leaf111. So it will always get to Leaf121 and never get to Leaf111.
If the intension is that the traffic should been offloaded to If the intension is that the traffic should been offloaded to
Leaf111, then use policy guided prefixes defined in RIFT [RIFT]. Leaf111, then use policy guided prefixes defined in RIFT [RIFT].
5.15. IoT Applicability 5.15. IoT Applicability
The design of RIFT inherits from RPL [RFC6550] the anisotropic design The design of RIFT inherits from RPL [RFC6550] the anisotropic design
of a default route upwards (northwards); it also inherits the of a default route upwards (northwards); it also inherits the
capability to inject external host routes at the Leaf level using capability to inject external host routes at the Leaf level using
Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host
and a RIFT router. Both the RPL and the RIFT protocols are meant for and a RIFT router. Both the RPL and the RIFT protocols are meant for
large scale, and WiND enables device mobility at the edge the same large scale, and WiND enables device mobility at the edge the same
way in both cases. way in both cases.
The main difference between RIFT and RPL is that with RPL, there's a The main difference between RIFT and RPL is that with RPL, there's a
single Root, whereas RIFT has many ToF nodes. The adds huge single Root, whereas RIFT has many ToF nodes. This adds huge
capabilities for leaf-2-leaf ECMP paths, but additional complexity capabilities for leaf-2-leaf ECMP paths, but additional complexity
with the need to disaggregate. Also RIFT uses Link State flooding with the need to disaggregate. Also RIFT uses Link State flooding
northwards, and is not designed for low-power operation. northwards, and is not designed for low-power operation.
Still nothing prevents that the IP devices connected at the Leaf are Still nothing prevents that the IP devices connected at the Leaf are
IoT (Internet of Things) devices, which typically expose their IoT (Internet of Things) devices, which typically expose their
address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775].
A network that serves high speed/ high power IoT devices should A network that serves high speed/ high power IoT devices should
typically provide deterministic capabilities for applications such as typically provide deterministic capabilities for applications such as
high speed control loops or movement detection. The Fat Tree is high speed control loops or movement detection. The Fat Tree is
highly reliable, and in normal condition provides an equilatent highly reliable, and in normal condition provides an equivalent
multipath operation; but the ECMP doesn't provide hard guarantees for multipath operation; but the ECMP doesn't provide hard guarantees for
either delivery or latency. As long as the fabric is non-blocking either delivery or latency. As long as the fabric is non-blocking
the result is the same; but there can be load unbalances resulting in the result is the same; but there can be load unbalances resulting in
incast and possibly congestion loss that will prevent the delivery incast and possibly congestion loss that will prevent the delivery
within bounded latency. within bounded latency.
This could be alleviated with Packet Replication, Elimination and This could be alleviated with Packet Replication, Elimination and
Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide
at the scale of all flows, and the replication may increase the at the scale of all flows, and the replication may increase the
probability of the overload that it attempts to solve. probability of the overload that it attempts to solve.
 End of changes. 56 change blocks. 
122 lines changed or deleted 110 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/