< draft-wei-rift-applicability-01.txt   draft-wei-rift-applicability-02.txt >
RIFT WG Yuehua. Wei RIFT WG Yuehua. Wei
Internet-Draft Zheng. Zhang Internet-Draft Zheng. Zhang
Intended status: Standards Track ZTE Corporation Intended status: Standards Track ZTE Corporation
Expires: December 21, 2019 Dmitry. Afanasiev Expires: May 6, 2020 Dmitry. Afanasiev
Yandex Yandex
Tom. Verhaeg Tom. Verhaeg
Interconnect Services B.V. Interconnect Services B.V.
Jaroslaw. Kowalczyk Jaroslaw. Kowalczyk
Orange Polska Orange Polska
June 19, 2019 November 3, 2019
RIFT Applicability RIFT Applicability
draft-wei-rift-applicability-01 draft-wei-rift-applicability-02
Abstract Abstract
This document discusses the properties and applicability of RIFT in This document discusses the properties, applicability and operational
different network topologies. It intends to provide a rough guide considerations of RIFT in different network scenarios. It intends to
how RIFT can be deployed to simplify routing operations in Clos provide a rough guide how RIFT can be deployed to simplify routing
topologies and their variations. operations in Clos topologies and their variations.
Status of This Memo Status of This Memo
This Internet-Draft is submitted in full conformance with the This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/. Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 21, 2019. This Internet-Draft will expire on May 6, 2020.
Copyright Notice Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of (https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Problem statement of a Fat Tree network in modern IP fabric . 2 2. Problem Statement of Routing in Modern IP Fabric Fat Tree
3. Why ritf is chosen to address this use case . . . . . . . . . 3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3
3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3
3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5
3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 5 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6
3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6
3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 6 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 6
3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 6 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 7
3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 6 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 7
3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 7 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 7
3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 7
4. Operational Simplifications and Considerations . . . . . . . 9 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 9
4.1. Automatic Disaggregation . . . . . . . . . . . . . . . . 10 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 10
4.1.1. South reflection . . . . . . . . . . . . . . . . . . 10 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 10
4.1.2. Suboptimal routing upon link failure use case . . . . 10 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 12
4.1.3. Black-holing upon link failure use case . . . . . . . 12 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 13
4.2. Usage of ZTP . . . . . . . . . . . . . . . . . . . . . . 13 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 13
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 13 4.6. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 16
6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 13 4.7. In-Band Reachability of Nodes . . . . . . . . . . . . . . 17
7. Normative References . . . . . . . . . . . . . . . . . . . . 14 4.7.1. Reachability of Leafs . . . . . . . . . . . . . . . . 17
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 4.7.2. Reachability of Spines . . . . . . . . . . . . . . . 17
4.8. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 17
4.9. Fabric With A Controller . . . . . . . . . . . . . . . . 18
4.9.1. Controller Attached to ToFs . . . . . . . . . . . . . 19
4.9.2. Controller Attached to Leaf . . . . . . . . . . . . . 19
4.10. Internet Connectivity Without Underlay . . . . . . . . . 19
4.10.1. Internet Default on the Leafs . . . . . . . . . . . 19
4.10.2. Internet Default on the ToFs . . . . . . . . . . . . 20
4.11. Subnet Mismatch and Address Families . . . . . . . . . . 20
4.12. Anycast Considerations . . . . . . . . . . . . . . . . . 20
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21
6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21
7. Normative References . . . . . . . . . . . . . . . . . . . . 22
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
1. Introduction 1. Introduction
This document intends to explain the properties and applicability of This document intends to explain the properties and applicability of
RIFT [I-D.ietf-rift-rift] in different deployment scenarios and RIFT [I-D.ietf-rift-rift] in different deployment scenarios and
highlight the operational simplicity of the technology compared to highlight the operational simplicity of the technology compared to
traditional routing solutions. traditional routing solutions. It also documents special
considerations when RIFT is used with or without overlays,
controllers and corrects topology miscablings and/or node and link
failures.
2. Problem statement of a Fat Tree network in modern IP fabric 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks
Clos and Fat-Tree topologies have gained prominence in today's Clos and Fat-Tree topologies have gained prominence in today's
networking, primarily as result of the paradigm shift towards a networking, primarily as result of the paradigm shift towards a
centralized data-center based architecture that is poised to deliver centralized data-center based architecture that is poised to deliver
a majority of computation and storage services in the future. a majority of computation and storage services in the future.
Today's current routing protocols were geared towards a network with Today's current routing protocols were geared towards a network with
an irregular topology and low degree of connectivity originally. an irregular topology and low degree of connectivity originally.
When they are applied to Fat-Tree topologies: When they are applied to Fat-Tree topologies:
o There are always extensive configuration or provisioning during o they tend to need extensive configuration or provisioning during
bring up and re-dimensioning. bring up and re-dimensioning.
o Both the spine node and the leaf node have the entire network o spine and leaf nodes have the entire network topology and routing
topology and routing information, but in fact, the leaf node does information, which is in fact, not needed on the leaf nodes during
not need so much complete information. normal operation.
o There is significant Link State PDUs (LSPs) flooding duplication
between spine nodes and leaf nodes during network bring up and
topology update. It consumes both spine and leaf nodes' CPU and
link bandwidth resources.
o When a spine node advertises a topology change, every leaf node o significant Link State PDUs (LSPs) flooding duplication between
connected to it will flood the update to all the other spine spine nodes and leaf nodes occurs during network bring up and
nodes, and those spine nodes will further flood them to all the topology updates. It consumes both spine and leaf nodes' CPU and
leaf nodes, causing a O(n^2) flooding storm which is largely link bandwidth resources and with that limits protocol
redundant. scalability.
3. Why ritf is chosen to address this use case 3. Applicability of RIFT to Clos IP Fabrics
Further content of this document assumes that the reader is familiar Further content of this document assumes that the reader is familiar
with the terms and concepts used in OSPF [RFC2328] and IS-IS with the terms and concepts used in OSPF [RFC2328] and IS-IS
[ISO10589-Second-Edition] link-state protocols and at least the [ISO10589-Second-Edition] link-state protocols and at least the
sections of RIFT [I-D.ietf-rift-rift] outlining the requirement of sections of RIFT [I-D.ietf-rift-rift] outlining the requirement of
routing in IP fabrics and RIFT protocol concepts. routing in IP fabrics and RIFT protocol concepts.
3.1. Overview of RIFT 3.1. Overview of RIFT
RIFT is a dynamic routing protocol for Clos and fat-tree network RIFT is a dynamic routing protocol for Clos and fat-tree network
skipping to change at page 4, line 5 skipping to change at page 4, line 16
level obtains the full topology of levels south of it. That level obtains the full topology of levels south of it. That
information is never flooded East-West or back South again. So a top information is never flooded East-West or back South again. So a top
tier node has full set of prefixes from the SPF calculation. tier node has full set of prefixes from the SPF calculation.
In the southbound direction the protocol operates like a "fully In the southbound direction the protocol operates like a "fully
summarizing, unidirectional" path vector protocol or rather a summarizing, unidirectional" path vector protocol or rather a
distance vector with implicit split horizon whereas the information distance vector with implicit split horizon whereas the information
propagates one hop south and is 're-advertised' by nodes at next propagates one hop south and is 're-advertised' by nodes at next
lower level, normally just the default route. lower level, normally just the default route.
+-----------+ +-----------+ +-----------+ +-----------+
| ToF | | ToF | LEVEL 2 | ToF | | ToF | LEVEL 2
+ +-----+--+--+ +-+--+------+ + +-----+--+--+ +-+--+------+
| | | | | | | | | ^ | | | | | | | | | ^
+ | | | +-------------------------+ | + | | | +-------------------------+ |
Distance | +-------------------+ | | | | | Distance | +-------------------+ | | | | |
Vector | | | | | | | | + Vector | | | | | | | | +
South | | | | +--------+ | | | Link+State South | | | | +--------+ | | | Link+State
+ | | | | | | | | Flooding + | | | | | | | | Flooding
| | | +-------------+ | | | North | | | +-------------+ | | | North
v | | | | | | | | + v | | | | | | | | +
+-+--+-+ +------+ +-------+ +--+--+-+ | +-+--+-+ +------+ +-------+ +--+--+-+ |
|SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1
+ ++----++ ++---+-+ +--+--+-+ ++----+-+ | + ++----++ ++---+-+ +--+--+-+ ++----+-+ |
+ | | | | | | | | | ^N + | | | | | | | | | ^ N
Distance | +-------+ | | +--------+ | | | E Distance | +-------+ | | +--------+ | | | E
Vector | | | | | | | | | +------> Vector | | | | | | | | | +------>
South | +-------+ | | | +-------+ | | | | South | +-------+ | | | +-------+ | | | |
+ | | | | | | | | | + + | | | | | | | | | +
v ++--++ +-+-++ ++-+-+ +-+--++ + v ++--++ +-+-++ ++-+-+ +-+--++ +
|LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0
+----+ +----+ +----+ +-----+ +----+ +----+ +----+ +-----+
Figure 1: Rift overview Figure 1: Rift overview
A middle tier node has only information necessary for its level, A middle tier node has only information necessary for its level,
which are all destinations south of the node based on SPF which are all destinations south of the node based on SPF
calculation, default route and potential disaggregated routes. calculation, default route and potential disaggregated routes.
RIFT combines the advantage of both Link-State and Distance Vector: RIFT combines the advantage of both Link-State and Distance Vector:
o Fastest Possible Convergence o Fastest Possible Convergence
skipping to change at page 5, line 51 skipping to change at page 6, line 17
allow the reconciliation of topology view of different planes as most allow the reconciliation of topology view of different planes as most
desirable solution making proper disaggregation viable in case of desirable solution making proper disaggregation viable in case of
failures. This observations hold not only in case of RIFT but in the failures. This observations hold not only in case of RIFT but in the
generic case of dynamic routing on Clos variants with multiple planes generic case of dynamic routing on Clos variants with multiple planes
and failures in bi-sectional bandwidth, especially on the leafs. and failures in bi-sectional bandwidth, especially on the leafs.
3.2.1. Horizontal Links 3.2.1. Horizontal Links
RIFT is not limited to pure Clos divided into PoD and multi-planes RIFT is not limited to pure Clos divided into PoD and multi-planes
but supports horizontal links below the top of fabric level. Those but supports horizontal links below the top of fabric level. Those
links are used however only as routes of last resort when a spine links are used however only as routes of last resort northbound when
loses all northbound links or cannot compute a default route through a spine loses all northbound links or cannot compute a default route
them. through them.
A possible configuration is a "ring" of horizontal links at a level.
In presence of such a "ring" in any level (except ToF level) neither
N-SPF nor S-SPF will provide a "ring-based protection" scheme since
such a computation would have to deal necessarily with breaking of
"loops" in Dijkstra sense; an application for which RIFT is not
intended.
A full-mesh connectivity between nodes on the same level can be
employed and that allows N-SPF to provide for any node loosing all
its northbound adjacencies (as long as any of the other nodes in the
level are northbound connected) to still participate in northbound
forwarding.
3.2.2. Vertical Shortcuts 3.2.2. Vertical Shortcuts
Through relaxations of the specified adjacency forming rules RIFT Through relaxations of the specified adjacency forming rules RIFT
implementations can be extended to support vertical "shortcuts" as implementations can be extended to support vertical "shortcuts" as
proposed by e.g. [I-D.white-distoptflood]. The RIFT specification proposed by e.g. [I-D.white-distoptflood]. The RIFT specification
itself does not provide the exact details since the resulting itself does not provide the exact details since the resulting
solution suffers from either much larger blast radii with increased solution suffers from either much larger blast radii with increased
flooding volumes or in case of maximum aggregation routing bow-tie flooding volumes or in case of maximum aggregation routing bow-tie
problems. problems.
skipping to change at page 8, line 21 skipping to change at page 8, line 21
| | | | +-------------------------+ | | | | | | | +-------------------------+ | | |
| | | | | | | | | | | | | | | | | | | | | | | |
| | +----------------------+ | | | | | | | | | | +----------------------+ | | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | |
| +---------------------------------+ | | | | | | | | +---------------------------------+ | | | | | | |
| | | | | | | | | | | | | | | | | | | | | | | |
| | | +-----------------------------+ | | | | | | | | +-----------------------------+ | | | | |
| | | | | | | | | | | | | | | | | | | | | | | |
| | | | | +--------------------+ | | | | | | | | | +--------------------+ | | | |
| | | | | | | | | | | | | | | | | | | | | | | |
| | | | | | | | | | | |
+--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+
|L | | Leaf | | Leaf | | Leaf | | Leaf | |L | |L | | Leaf | | Leaf | | Leaf | | Leaf | |L |
|S | | Switch | | Switch | | Switch | | Switch| |S | |S | | Switch | | Switch | | Switch | | Switch| |S |
++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
| +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ |
| |Compute | |Compute | | Compute | |Compute| | | |Compute | |Compute | | Compute | |Compute| |
| |Node | |Node | | Node | |Node | | | |Node | |Node | | Node | |Node | |
| | | | | | | | | |
| +--------+ +--------+ +----------+ +-------+ | | +--------+ +--------+ +----------+ +-------+ |
| || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| || VAS7 || || VAS4 || || vIGMP || ||BAA || | | || VAS7 || || VAS4 || || vIGMP || ||BAA || |
| |--------| |--------| |----------| |-------| | | |--------| |--------| |----------| |-------| |
| +--------+ +--------+ +----------+ +-------+ | | +--------+ +--------+ +----------+ +-------+ |
| | | |
++-----------+ +---------++ ++-----------+ +---------++
|Network I/O | |Access I/O| |Network I/O | |Access I/O|
+------------+ +----------+ +------------+ +----------+
Figure 2: An example of CloudCo architecture Figure 2: An example of CloudCO architecture
The Spine-Leaf architectures deployed inside CloudCO meets the The Spine-Leaf architectures deployed inside CloudCO meets the
network requirements of adaptable, agile, scalable and dynamic. network requirements of adaptable, agile, scalable and dynamic.
4. Operational Simplifications and Considerations 4. Deployment Considerations
RIFT presents the opportunity for organizations building and RIFT presents the opportunity for organizations building and
operating IP fabrics to simplify their operation and deployments operating IP fabrics to simplify their operation and deployments
while achieving many desirable properties of a dynamic routing on while achieving many desirable properties of a dynamic routing on
such a substrate: such a substrate:
o RIFT design follows minimum blast radius and minimum necessary o RIFT design follows minimum blast radius and minimum necessary
epistemological scope philosophy which leads to very good scaling epistemological scope philosophy which leads to very good scaling
properties while delivering maximum reactiveness. properties while delivering maximum reactiveness.
skipping to change at page 10, line 11 skipping to change at page 10, line 11
o RIFT is designed for minimum delay in case of prefix mobility on o RIFT is designed for minimum delay in case of prefix mobility on
the fabric. the fabric.
o Many further operational and design points collected over many o Many further operational and design points collected over many
years of routing protocol deployments have been incorporated in years of routing protocol deployments have been incorporated in
RIFT such as fast flooding rates, protection of information RIFT such as fast flooding rates, protection of information
lifetimes and operationally easily recognizable remote ends of lifetimes and operationally easily recognizable remote ends of
links and node names. links and node names.
4.1. Automatic Disaggregation 4.1. South Reflection
4.1.1. South reflection
South reflection is a mechanism that South Node TIEs are "reflected" South reflection is a mechanism that South Node TIEs are "reflected"
back up north to allow nodes in same level without E-W links to "see" back up north to allow nodes in same level without E-W links to "see"
each other. each other.
For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs
from ToF21 to ToF22 separately. Spine111\Spine112\Spine121\Spine122 from ToF21 to ToF22 separately. Respectively,
reflects Node S-TIEs from ToF22 to ToF21 separately. So ToF22 and Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22
ToF21 knows each other as level 2 node. to ToF21 separately. So ToF22 and ToF21 see each other's node
information as level 2 nodes.
As the result of the south reflection between
Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and
Spine 122 knows each other at level 1.
This is a use case to explain the deployment of a Fat-Tree and the In an equivalent fashion, as the result of the south reflection
algorithm to achieve automatic disaggregation. between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122,
Spine121 and Spine 122 knows each other at level 1.
4.1.2. Suboptimal routing upon link failure use case 4.2. Suboptimal Routing on Link Failures
+--------+ +--------+ +--------+ +--------+
| | | | | ToF21 | | ToF22 | LEVEL 2
| ToF21 | | ToF22 | LEVEL 2 ++--+-+-++ ++-+--+-++
++-+--+-++ ++-+--+-++ | | | | | | | +
| | | | | | | | | | | | | | | linkTS8
| | | | | | | linkTS8 +-------------+ | +-+linkTS3+-+ | | | +--------------+
| | | | | | | | | | | | | | + |
| | | | | | | | | +----------------------------+ | linkTS7 |
+--------------+ | +--linkTS3--+ | | | +--------------+ | | | | + + + |
| | | | | | | | | | | +-------+linkTS4+------------+ |
| +-----------------------------+ | linkTS7 | | | | + + | | |
| | | | | | | | | | | +------------+--+ | |
| | | +--------linkTS4-------------+ | | | | | | linkTS6 | |
| | | | | | | | +-+----++ ++-----++ ++------+ ++-----++
| | +-+ +---------------+ | | |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1
| | | | | linkTS6 | | +-+---+-+ ++----+-+ +-+---+-+ ++---+--+
+-+----++ +-+-----+ ++----+-+ ++-----++ | | | | | | | |
| | | | | | | | | +--------------+ | + ++XX+linkSL6+---+ +
|Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 | | | | linkSL5 | | linkSL8
+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ | +------------+ | | + +---+linkSL7+-+ | +
| | | | | | | | | | | | | | | |
| +---------------+ | | +-XX-linkSL6----+ | +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+
| | | | linkSL5 | | linkSL8 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
| +-------------+ | | | +----linkSL7--+ | | +-+-----+ ++------+ +-----+-+ +-+-----+
| | | | | | | | + + + +
+-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ Prefix111 Prefix112 Prefix121 Prefix122
| | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-+-----+
+ + + +
Prefix111 Prefix112 Prefix121 Prefix122
Figure 3: Suboptimal routing upon link failure use case Figure 3: Suboptimal routing upon link failure use case
As shown in figure above, as the result of the south reflection As shown in Figure 3, as the result of the south reflection between
between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and
Spine121 and Spine 122 knows each other at level 1. Spine 122 knows each other at level 1.
Without disaggregation mechanism, when linkSL6 fails, the packet from Without disaggregation mechanism, when linkSL6 fails, the packet from
leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3
then go down through linkTS4 to linkSL8 to Leaf122 or go up through then go down through linkTS4 to linkSL8 to Leaf122 or go up through
linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to
Leaf122 based on pure default route. It's the case of suboptimal Leaf122 based on pure default route. It's the case of suboptimal
routing. routing or bow-tieing.
With disaggregation mechanism, when linkSL6 fails, Spine122 will With disaggregation mechanism, when linkSL6 fails, Spine122 will
detect the failure according to the reflected node S-TIE from detect the failure according to the reflected node S-TIE from
Spine121. Based on the disaggregation algorithm provided by RITF, Spine121. Based on the disaggregation algorithm provided by RIFT,
Spine122 will explicitly advertise prefix122 in Prefix S-TIE Spine122 will explicitly advertise prefix122 in Disaggregated Prefix
SouthPrefixesElement(prefix122, cost 1). The packet from leaf121 to S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to
prefix122 will only be sent to linkSL7 following a longest-prefix prefix122 will only be sent to linkSL7 following a longest-prefix
match to prefix 122 directly then go down through linkSL8 to Leaf122 match to prefix 122 directly then go down through linkSL8 to Leaf122
. .
4.1.3. Black-holing upon link failure use case 4.3. Black-Holing on Link Failures
+--------+ +--------+ +--------+ +--------+
| | | |
| ToF 21 | | ToF 22 | LEVEL 2 | ToF 21 | | ToF 22 | LEVEL 2
++-+--+-++ ++-+--+-++ ++-+--+-++ ++-+--+-++
| | | | | | | | | | | | | | | |
| | | | | | | linkTS8 | | | | | | | linkTS8
| | | | | | | |
| | | | | | | |
+--------------+ | +--linkTS3-X+ | | | +--------------+ +--------------+ | +--linkTS3-X+ | | | +--------------+
linkTS1 | | | | | | | linkTS1 | | | | | | |
| +-----------------------------+ | linkTS7 | | +-----------------------------+ | linkTS7 |
| | | | | | | | | | | | | | | |
| | linkTS2 +--------linkTS4-X-----------+ | | | linkTS2 +--------linkTS4-X-----------+ |
| | | | | | | | | | | | | | | |
| linkTS5 +-+ +---------------+ | | | linkTS5 +-+ +---------------+ | |
| | | | | linkTS6 | | | | | | | linkTS6 | |
+-+----++ +-+-----+ ++----+-+ ++-----++ +-+----++ +-+-----+ ++----+-+ ++-----++
| | | | | | | |
|Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1
+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ +-+---+-+ ++----+-+ +-+---+-+ ++---+--+
| | | | | | | | | | | | | | | |
| +---------------+ | | +----linkSL6----+ | | +---------------+ | | +----linkSL6----+ |
linkSL1 | | | linkSL5 | | linkSL8 linkSL1 | | | linkSL5 | | linkSL8
| +---linkSL3---+ | | | +----linkSL7--+ | | | +---linkSL3---+ | | | +----linkSL7--+ | |
| | | | | | | | | | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+
| | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-+-----+ +-+-----+ ++------+ +-----+-+ +-+-----+
+ + + + + + + +
Prefix111 Prefix112 Prefix121 Prefix122 Prefix111 Prefix112 Prefix121 Prefix122
Figure 4: Black-holing upon link failure use case Figure 4: Black-holing upon link failure use case
This scenario illustrates a case when double link failure occurs, This scenario illustrates a case when double link failure occurs and
black-holing happens. with that black-holing can happen.
Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, Without disaggregation mechanism, when linkTS3 and linkTS4 both fail,
the packet from leaf111 to prefix122 would suffer 50% black-holing the packet from leaf111 to prefix122 would suffer 50% black-holing
based on pure default route. The packet supposed to go up through based on pure default route. The packet supposed to go up through
linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be
dropped. The packet supposed to go up through linkSL3 to linkTS2 dropped. The packet supposed to go up through linkSL3 to linkTS2
then go down through linkTS3 or linkTS4 will be dropped as well. then go down through linkTS3 or linkTS4 will be dropped as well.
It's the case of black-holing. It's the case of black-holing.
With disaggregation mechanism, when linkTS3 and linkTS4 both fail, With disaggregation mechanism, when linkTS3 and linkTS4 both fail,
ToF22 will detect the failure according to the reflected node S-TIE ToF22 will detect the failure according to the reflected node S-TIE
of ToF21 from Spine111\Spine112\Spine121\Spine122. Based on the of ToF21 from Spine111\Spine112\Spine121\Spine122. Based on the
disaggregation algorithm provided by RITF, ToF22 will explicitly disaggregation algorithm provided by RITF, ToF22 will explicitly
originate an S-TIE with prefix 121 and prefix 122, that is flooded to originate an S-TIE with prefix 121 and prefix 122, that is flooded to
spines 111, 112, 121 and 122. spines 111, 112, 121 and 122.
The packet from leaf111 to prefix122 will not be routed to linkTS1 or The packet from leaf111 to prefix122 will not be routed to linkTS1 or
linkTS2. The packet from leaf111 to prefix122 will only be routed to linkTS2. The packet from leaf111 to prefix122 will only be routed to
linkTS5 or linkTS7 following a longest-prefix match to prefix122. linkTS5 or linkTS7 following a longest-prefix match to prefix122.
4.2. Usage of ZTP 4.4. Zero Touch Provisioning (ZTP)
Each RIFT node may operate in zero touch provisioning (ZTP) mode. It Each RIFT node may operate in zero touch provisioning (ZTP) mode. It
has no configuration (unless it is a Top-of-Fabric at the top of the has no configuration (unless it is a Top-of-Fabric at the top of the
topology or the must operate in the topology as leaf and/or support topology or it is desired to confine it to leaf role w/o leaf-2-leaf
leaf-2-leaf procedures) and it will fully configure itself after procedures). In such case RIFT will fully configure the node's level
being attached to the topology. after it is attached to the topology.
The most import component for ZTP is the automatic level derivation The most import component for ZTP is the automatic level derivation
procedure. All the Top-of-Fabric nodes are explicitly marked with procedure. All the Top-of-Fabric nodes are explicitly marked with
TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP
nodes to derive their level in the topology. nodes to derive their level in the topology. The derivation of the
level of each node happens then based on LIEs received from its
The derivation of the level of each node happens based on LIEs neighbors whereas each node (with possibly exceptions of configured
received from its neighbors whereas each node (with possibly leafs) tries to attach at the highest possible point in the fabric.
exceptions of configured leafs) tries to attach at the highest
possible point in the fabric.
This guarantees that even if the diffusion front reaches a node from This guarantees that even if the diffusion front reaches a node from
"below" faster than from "above", it will greedily abandon already "below" faster than from "above", it will greedily abandon already
negotiated level derived from nodes topologically below it and negotiated level derived from nodes topologically below it and
properly peers with nodes above. properly peer with nodes above.
4.5. Miscabling Examples
+----------------+ +-----------------+
| ToF21 | +------+ ToF22 | LEVEL 2
+-------+----+---+ | +----+---+--------+
| | | | | | | | |
| | | +----------------------------+ |
| +---------------------------+ | | | |
| | | | | | | | |
| | | | +-----------------------+ | |
| | +------------------------+ | | |
| | | | | | | | |
+-+---+-+ +-+---+-+ | +-+---+-+ +-+---+-+
|Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1
+-+---+-+ ++----+-+ | +-+---+-+ ++----+-+
| | | | | | | | |
| +---------+ | link-M | +---------+ |
| | | | | | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
Figure 5: A single plane miscabling example
Figure Figure 5 shows a single plane miscabling example. It's a
perfect fat-tree fabric except link-M connecting Leaf112 to ToF22.
The RIFT control protocol can discover the physical links
automatically and be able to detect cabling that violates fat-tree
topology constraints. It react accordingly to such mis-cabling
attempts, at a minimum preventing adjacencies between nodes from
being formed and traffic from being forwarded on those mis-cabled
links. Leaf112 will in such scenario use link-M to derive its level
(unless it is leaf) and can report links to spines 111 and 112 as
miscabled unless the implementations allows horizontal links.
Figure Figure 6 shows a multiple plane miscabling example. Since
Leaf112 and Spine121 belong to two different PoDs, the adjacency
between Leaf112 and Spine121 can not be formed. link-W would be
detected and prevented.
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
| | | +-----------------+ | | |
| +--------------------------+ | | | |
| | | | | | | |
| +------+ | | | +------+ |
| | +-----------------+ | | | | |
| | | +--------------------------+ | |
| A | | B | | A | | B |
+-----+-+ +-+---+-+ +-+---+-+ +-+-----+
|Spin111| |Spin112| +----+Spin121| |Spin122| LEVEL 1
+-+---+-+ ++----+-+ | +-+---+-+ ++----+-+
| | | | | | | | |
| +---------+ | | | +---------+ |
| | | | link-W | | | |
| +-------+ | | | | +-------+ | |
| | | | | | | | |
+-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+
|Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0
+-------+ +-------+ +-------+ +-------+
+--------PoD#1----------+ +---------PoD#2---------+
Figure 6: A multiple plane miscabling example
RIFT provides an optional level determination procedure in its Zero
Touch Provisioning mode. Nodes in the fabric without their level
configured determine it automatically. This can have possibly
counter-intuitive consequences however. One extreme failure scenario
is depicted in Figure 7 and it shows that if all northbound links of
spine11 fail at the same time, spine11 negotiates a lower level than
Leaf11 and Leaf12.
To prevent such scenario where leafs are expected to act as switches,
LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is
invalid, Spine11 would not derive a valid level from the topology in
Figure 7. It will be isolated from the whole fabric and it would be
up to the leafs to declare the links towards such spine as miscabled.
+-------+ +-------+ +-------+ +-------+
|ToF A1| |ToF A2| |ToF A1| |ToF A2|
+-------+ +-------+ +-------+ +-------+
| | | | | |
| +-------+ | | |
+ + | | ====> | |
X X +------+ | +------+ |
+ + | | | |
+----+--+ +-+-----+ +-+-----+
|Spine11| |Spine12| |Spine12|
+-+---+-+ ++----+-+ ++----+-+
| | | | | |
| +---------+ | | |
| | | | | |
| +-------+ | | +-------+ |
| | | | | |
+-+---+-+ +--+--+-+ +-----+-+ +-----+-+
|Leaf111| |Leaf112| |Leaf111| |Leaf112|
+-------+ +-------+ +-+-----+ +-+-----+
| |
| +--------+
| |
+-+---+-+
|Spine11|
+-------+
Figure 7: Fallen spine
4.6. IPv4 over IPv6
RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6
AF configures via the usual ND mechanisms and then V4 can use V6
nexthops analogous to RFC5549. It is expected that the whole fabric
supports the same type of forwarding of address families on all the
links. RIFT provides an indication whether a node is v4 forwarding
capable and implementations are possible where different routing
tables are computed per address family as long as the computation
remains loop-free.
+-----+ +-----+
+---+---+ | ToF | | ToF |
^ +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
V6 +-----+ +-+---+
Forwarding |SPINE| |SPINE|
| +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
v +-----+ +-+---+
+---+---+ |LEAF | | LEAF|
+--+--+ +--+--+
| |
IPv4 prefixes| |IPv4 prefixes
| |
+---+----+ +---+----+
| V4 | | V4 |
| subnet | | subnet |
+--------+ +--------+
Figure 8: IPv4 over IPv6
4.7. In-Band Reachability of Nodes
4.7.1. Reachability of Leafs
TODO
4.7.2. Reachability of Spines
TODO
4.8. Dual Homing Servers
Each RIFT node may operate in zero touch provisioning (ZTP) mode. It
has no configuration (unless it is a Top-of-Fabric at the top of the
topology or the must operate in the topology as leaf and/or support
leaf-2-leaf procedures) and it will fully configure itself after
being attached to the topology.
+---+ +---+ +---+
|ToF| |ToF| |ToF|
+---+ +---+ +---+
| | | | | |
| +----------------+ | |
| | | | | |
| +----------------+ |
| | | | | |
+----------+--+ +--+----------+
| Spine|ToR1 | | Spine|ToR2 |
+--+------+---+ +--+-------+--+
+---+ | | | | | | +---+
| | | | | | | |
| +-----------------+ | | |
| | | +-------------+ | |
+ | + | | |-----------------+ |
X | X | +--------x-----+ | X |
+ | + | | | + |
+---+ +---+ +---+ +---+
| | | | | | | |
+---+ +---+ ...............+---+ +---+
SV(1) SV(2) SV(n+1) SV(n)
Figure 9: Dual-homing servers
In the single plane, the worst condition is disaggregation of every
other servers at the same level. Suppose the links from ToR1 to all
the leaves become not available. All the servers' routes are
disaggregated and the FIB of the servers will be expanded with n-1
more spicific routes.
Sometimes, pleople may prefer to disaggregate from ToR to servers
from start on, i.e. the servers have couple tens of routes in FIB
from start on beside default routes to avoid breakages at rack level.
Full disaggregation of the fabric could be achieved by configuration
supported by RIFT.
4.9. Fabric With A Controller
There are many different ways to deploy the controller. One
possibility is attaching a controller to the RIFT domain from ToF and
another possibility is attaching a controller from the leaf.
+------------+
| Controller |
++----------++
| |
| |
+----++ ++----+
---------- | ToF | | ToF |
| +--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
+-----+ +-+---+
RIFT domain |SPINE| |SPINE|
+--+--+ +-----+
| | | | |
| | +-------------+ |
| | +--------+ | |
| | | | |
| +-----+ +-+---+
---------- |LEAF | | LEAF|
+-----+ +-----+
Figure 10: Fabric with a controller
4.9.1. Controller Attached to ToFs
If a controller is attaching to the RIFT domain from ToF, it usually
uses dual-homing connections. The loopback prefix of the controller
should be advertised down by the ToF and spine to leaves. If the
controller loses link to ToF, make sure the ToF withdraw the prefix
of the controller(use different mechanisms).
4.9.2. Controller Attached to Leaf
If the controller is attaching from a leaf to the fabric, no special
provisions are needed.
4.10. Internet Connectivity Without Underlay
4.10.1. Internet Default on the Leafs
TODO
4.10.2. Internet Default on the ToFs
TODO
4.11. Subnet Mismatch and Address Families
+--------+ +--------+
| | LIE LIE | |
| A | +----> <----+ | B |
| +---------------------+ |
+--------+ +--------+
X/24 Y/24
Figure 11: subnet mismatch
LIEs are exchanged over all links running RIFT to perform Link
(Neighbor) Discovery. A node MUST NOT originate LIEs on an address
family if it does not process received LIEs on that family. LIEs on
same link are considered part of the same negotiation independent on
the address family they arrive on. An implementation MUST be ready
to accept TIEs on all addresses it used as source of LIE frames.
As shown in the above figure, without further checks adjacency of
node A and B may form, but the forwarding between node A and node B
may fail because subnet X mismatches with subnet Y.
To prevent this a RIFT implementation should check for subnet
mismatch just like e.g. ISIS does. This can lead to scenarios where
an adjacency, despite exchange of LIEs in both address families may
end up having an adjacency in a single AF only. This is a
consideration especially in Section 4.6 scenarios.
4.12. Anycast Considerations
+ traffic
|
v
+------+------+
| ToF |
+---+-----+---+
| | | |
+------------+ | | +------------+
| | | |
+---+---+ +-------+ +-------+ +---+---+
| | | | | | | |
|Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1
+-+---+-+ ++----+-+ +-+---+-+ ++----+-+
| | | | | | | |
| +---------+ | | +---------+ |
| | | | | | | |
| +-------+ | | | +-------+ | |
| | | | | | | |
+-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+
| | | | | | | |
|Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0
+-+-----+ ++------+ +-----+-+ +-----+-+
+ + + ^ |
PrefixA PrefixB PrefixA | PrefixC
|
+ traffic
Figure 12: Anycast
If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast
prefix PrefixA. RIFT can deal with this case well. But if the
traffic comes from Leaf122, it will always get to Leaf121 and never
get to Leaf111. If the intension is that the traffic should been
offloaded to Leaf111, then use policy guided prefixes [PGP
reference].
5. Acknowledgements 5. Acknowledgements
6. Contributors 6. Contributors
The following people (listed in alphabetical order) contributed The following people (listed in alphabetical order) contributed
significantly to the content of this document and should be significantly to the content of this document and should be
considered co-authors: considered co-authors:
Tony Przygienda Tony Przygienda
Juniper Networks
Juniper Networks
1194 N. Mathilda Ave 1194 N. Mathilda Ave
Sunnyvale, CA 94089 Sunnyvale, CA 94089
US US
Email: prz@juniper.net Email: prz@juniper.net
7. Normative References 7. Normative References
[I-D.ietf-rift-rift] [I-D.ietf-rift-rift]
Team, T., "RIFT: Routing in Fat Trees", draft-ietf-rift- Przygienda, T., Sharma, A., Thubert, P., and D. Afanasiev,
rift-05 (work in progress), April 2019. "RIFT: Routing in Fat Trees", draft-ietf-rift-rift-08
(work in progress), September 2019.
[I-D.white-distoptflood] [I-D.white-distoptflood]
White, R. and S. Zandi, "IS-IS Optimal Distributed White, R., Hegde, S., and S. Zandi, "IS-IS Optimal
Flooding for Dense Topologies", draft-white- Distributed Flooding for Dense Topologies", draft-white-
distoptflood-00 (work in progress), March 2019. distoptflood-01 (work in progress), September 2019.
[ISO10589-Second-Edition] [ISO10589-Second-Edition]
International Organization for Standardization, International Organization for Standardization,
"Intermediate system to Intermediate system intra-domain "Intermediate system to Intermediate system intra-domain
routeing information exchange protocol for use in routeing information exchange protocol for use in
conjunction with the protocol for providing the conjunction with the protocol for providing the
connectionless-mode Network Service (ISO 8473)", Nov 2002. connectionless-mode Network Service (ISO 8473)", Nov 2002.
[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328,
DOI 10.17487/RFC2328, April 1998, DOI 10.17487/RFC2328, April 1998,
 End of changes. 42 change blocks. 
147 lines changed or deleted 477 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/