idnits 2.17.1 draft-ietf-rift-applicability-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1107: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 1110: '...e on. An implementation MUST be ready...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (20 January 2021) is 1192 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 5549 (Obsoleted by RFC 8950) == Outdated reference: A later version (-21) exists of draft-ietf-rift-rift-12 Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 24 July 2021 Dmitry. Afanasiev 6 Yandex 7 Tom. Verhaeg 8 Juniper Networks 9 Jaroslaw. Kowalczyk 10 Orange Polska 11 P. Thubert 12 Cisco Systems 13 20 January 2021 15 RIFT Applicability 16 draft-ietf-rift-applicability-04 18 Abstract 20 This document discusses the properties, applicability and operational 21 considerations of RIFT in different network scenarios. It intends to 22 provide a rough guide how RIFT can be deployed to simplify routing 23 operations in Clos topologies and their variations. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 24 July 2021. 42 Copyright Notice 44 Copyright (c) 2021 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 60 Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 62 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 4 63 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 6 64 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 65 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 66 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 7 67 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.3.1. Data Center Fabrics . . . . . . . . . . . . . . . . . 8 69 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 70 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 71 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 9 72 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 73 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 74 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 75 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 76 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 77 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 78 4.5. Mis-cabling Examples . . . . . . . . . . . . . . . . . . 15 79 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 17 80 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 81 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 82 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 22 83 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 23 84 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 24 85 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 86 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 25 87 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 25 88 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 89 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 90 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 91 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 26 92 4.15. IoT Applicability . . . . . . . . . . . . . . . . . . . . 27 93 5. Security Considerations . . . . . . . . . . . . . . . . . . . 27 94 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 28 95 7. Normative References . . . . . . . . . . . . . . . . . . . . 28 96 8. Informative References . . . . . . . . . . . . . . . . . . . 29 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 30 99 1. Introduction 101 This document intends to explain the properties and applicability of 102 "Routing in Fat Trees" [RIFT] in different deployment scenarios and 103 highlight the operational simplicity of the technology compared to 104 traditional routing solutions. It also documents special 105 considerations when RIFT is used with or without overlays, with or 106 without controllers, corrects topology mis-cablings, and node or link 107 failures. 109 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 111 Clos [CLOS] and fat tree [FATTREE] topologies have gained prominence 112 in today's networking, primarily as a result of the paradigm shift 113 towards a centralized data-center based architecture that deliver a 114 majority of computation and storage services. 116 Today's current routing protocols were geared towards a network with 117 an irregular topology and low degree of connectivity originally. 118 When they are applied to fat tree topologies: 120 * They tend to need extensive configuration or provisioning during 121 bring up and re-dimensioning. 123 * Spine and leaf nodes have the entire network topology and routing 124 information which is in fact not needed on the leaf nodes during 125 normal operation. 127 * Significant Link State PDUs (LSPs) flooding duplication between 128 spine nodes and leaf nodes occurs during network bring up and 129 topology updates. It consumes both spine and leaf nodes' CPU and 130 link bandwidth resources. 132 3. Applicability of RIFT to Clos IP Fabrics 134 Further content of this document assumes that the reader is familiar 135 with the terms and concepts used in OSPF [RFC2328] and IS-IS 136 [ISO10589-Second-Edition] link-state protocols. The sections of RIFT 137 [RIFT] outline the requirements of routing in IP fabrics and RIFT 138 protocol concepts. 140 3.1. Overview of RIFT 142 RIFT is a dynamic routing protocol for Clos and fat tree network 143 topologies. It defines a link-state protocol when "pointing north" 144 and path-vector protocol when "pointing south". 146 It floods flat link-state information northbound only so that each 147 level obtains the full topology of levels south of it. That 148 information is never flooded east-west or back south again. So a top 149 tier node has full set of prefixes from the Shortest Path First (SPF) 150 calculation. 152 In the southbound direction, the protocol operates like a "fully 153 summarizing, unidirectional" path vector protocol or rather a 154 distance vector with implicit split horizon. Routing information, 155 normally just the default route, propagates one hop south and is 're- 156 advertised' by nodes at next lower level. 158 +-----------+ +-----------+ 159 | ToF | | ToF | LEVEL 2 160 + +-----+--+--+ +-+--+------+ 161 | | | | | | | | | ^ 162 + | | | +-------------------------+ | 163 Distance | +-------------------+ | | | | | 164 Vector | | | | | | | | + 165 South | | | | +--------+ | | | Link-state 166 + | | | | | | | | Flooding 167 | | | +-------------+ | | | North 168 v | | | | | | | | + 169 +-+--+-+ +------+ +-------+ +--+--+-+ | 170 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 171 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 172 + | | | | | | | | | ^ N 173 Distance | +-------+ | | +--------+ | | | E 174 Vector | | | | | | | | | +------> 175 South | +-------+ | | | +-------+ | | | | 176 + | | | | | | | | | + 177 v ++--++ +-+-++ ++-+-+ +-+--++ + 178 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 179 +----+ +----+ +----+ +-----+ 181 Figure 1: Rift overview 183 A spine node has only information necessary for its level, which is 184 all destinations south of the node based on SPF calculation, default 185 route, and potential disaggregated routes. 187 RIFT combines the advantage of both link-state and distance vector: 189 * Fastest possible convergence 191 * Automatic detection of topology 193 * Minimal routes/info on tors 195 * High degree of ECMP 197 * Fast de-commissioning of nodes 199 * Maximum Propagation speed with flexible prefixes in an update 201 And RIFT eliminates the disadvantages of link-state or distance 202 vector: 204 * Reduced and balanced flooding 206 * Automatic neighbor detection 208 So there are two types of link-state database which are "north 209 representation" North Topology Information Elements (N-TIEs) and 210 "south representation" South Topology Information Elements (S-TIEs). 211 The N-TIEs contain a link-state topology description of lower levels 212 and S-TIEs carry simply default routes for the lower levels. 214 There are more advantages unique to RIFT listed below which could be 215 understood if you read the details of RIFT [RIFT]. 217 * True ZTP 219 * Minimal blast radius on failures 221 * Can utilize all paths through fabric without looping 223 * Automatic disaggregation on failures 225 * Simple leaf implementation that can scale down to servers 227 * Key-Value store 229 * Horizontal links used for protection only 231 * Supports non-equal cost multipath and can replace MC-LAG 233 * Optimal flooding reduction and load-balancing 235 3.2. Applicable Topologies 237 Albeit RIFT is specified primarily for "proper" Clos or "fat tree" 238 structures, it already supports Points of Delivery (PoD) concepts 239 which are strictly speaking not found in original Clos concepts. 241 Further, the specification explains and supports operations of multi- 242 plane Clos variants where the protocol relies on set of rings to 243 allow the reconciliation of topology view of different planes as most 244 desirable solution making proper disaggregation viable in case of 245 failures. These observations hold not only in case of RIFT but also 246 in the generic case of dynamic routing on Clos variants with multiple 247 planes and failures in bi-sectional bandwidth, especially on the 248 leafs. 250 3.2.1. Horizontal Links 252 RIFT is not limited to pure Clos divided into PoD and multi-planes 253 but supports horizontal links below the top of fabric level. Those 254 links are used only as routes of last resort northbound when a spine 255 loses all northbound links or cannot compute a default route through 256 them. 258 A possible configuration is a "ring" of horizontal links at a level. 259 In presence of such a "ring" in any level (except Top of Fabric (ToF) 260 level) neither North SPF (N-SPF) nor South SPF (S-SPF) will provide a 261 "ring-based protection" scheme since such a computation would have to 262 deal necessarily with breaking of "loops" in Dijkstra sense; an 263 application for which RIFT is not intended. 265 A full-mesh connectivity between nodes on the same level can be 266 employed and that allows N-SPF to provide for any node loosing all 267 its northbound adjacencies (as long as any of the other nodes in the 268 level are northbound connected) to still participate in northbound 269 forwarding. 271 3.2.2. Vertical Shortcuts 273 Through relaxations of the specified adjacency forming rules, RIFT 274 implementations can be extended to support vertical "shortcuts" as 275 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 276 itself does not provide the exact details since the resulting 277 solution suffers from either much larger blast radius with increased 278 flooding volumes or in case of maximum aggregation routing bow-tie 279 problems. 281 3.2.3. Generalizing to any Directed Acyclic Graph 283 RIFT is an anisotropic routing protocol, meaning that it has a sense 284 of direction (northbound, southbound, east-west) and that it operates 285 differently depending on the direction. 287 * Northbound, RIFT operates as a link-state IGP, whereby the control 288 packets are reflooded first all the way north and only interpreted 289 later. All the individual fine grained routes are advertised. 291 * Southbound, RIFT operates as a distance vector IGP, whereby the 292 control packets are flooded only one hop, interpreted, and the 293 consequence of that computation is what gets flooded one more hop 294 south. In the most common use-cases, a ToF node can reach most of 295 the prefixes in the fabric. If that is the case, the ToF node 296 advertises the fabric default and disaggregates the prefixes that 297 it cannot reach. On the other hand, a ToF node that can reach 298 only a small subset of the prefixes in the fabric will preferably 299 advertise those prefixes and refrain from aggregating. 301 In the general case, what gets advertised south is in more 302 details: 304 1. A fabric default that aggregates all the prefixes that are 305 reachable within the fabric, and that could be a default route 306 or a prefix that is dedicated to this particular fabric. 308 2. The loopback addresses of the northbound nodes, e.g., for 309 inband management. 311 3. The disaggregated prefixes for the dynamic exceptions to the 312 fabric default, advertised to route around the black hole that 313 may form. 315 * East-west routing can optionally be used, with specific 316 restrictions. It is useful in particular when a sibling has 317 access to the fabric default but this node does not. 319 A Directed Acyclic Graph (DAG) provides a sense of north (the 320 direction of the DAG) and of south (the reverse), which can be used 321 to apply RIFT. For the purpose of RIFT, an edge in the DAG that has 322 only incoming vertices is a ToF node. 324 There are a number of caveats though: 326 * The DAG structure must exist before RIFT starts, so there is a 327 need for a companion protocol to establish the logical DAG 328 structure. 330 * A generic DAG does not have a sense of east and west. The 331 operation specified for east-west links and the southbound 332 reflection between nodes are not applicable. 334 * In order to aggregate and disaggregate routes, RIFT requires that 335 all the ToF nodes share the full knowledge of the prefixes in the 336 fabric. This can be achieved with a ring as suggested by the RIFT 337 main specification, by some preconfiguration, or using a 338 synchronization with a common repository where all the active 339 prefixes are registered. 341 3.3. Use Cases 343 3.3.1. Data Center Fabrics 345 RIFT is largely driven by demands and hence ideally suited for 346 applying in data center (DC) IP fabrics underlay routing, vast 347 majority of which seem to be currently (and for the foreseeable 348 future) Clos architectures. It significantly simplifies operation 349 and deployment of such fabrics as described in Section 4 for 350 environments compared to extensive proprietary provisioning and 351 operational solutions. 353 3.3.2. Metro Fabrics 355 The demand for bandwidth is increasing steadily, driven primarily by 356 environments close to content producers (server farms connection via 357 DC fabrics) but in proximity to content consumers as well. Consumers 358 are often clustered in metro areas with their own network 359 architectures that can benefit from simplified, regular Clos 360 structures and hence RIFT. 362 3.3.3. Building Cabling 364 Commercial edifices are often cabled in topologies that are either 365 Clos or its isomorphic equivalents. The Clos can grow rather high 366 with many floors. That presents a challenge for traditional routing 367 protocols (except BGP and by now largely phased-out PNNI) which do 368 not support an arbitrary number of levels which RIFT does naturally. 369 Moreover, due to the limited sizes of forwarding tables in network 370 elements of building cabling,the minimum FIB size RIFT 371 maintains under normal conditions is cost-effective in terms of 372 hardware and operational costs. 374 3.3.4. Internal Router Switching Fabrics 376 It is common in high-speed communications switching and routing 377 devices to use fabrics when a crossbar is not feasible due to cost, 378 head-of-line blocking or size trade-offs. Normally such fabrics are 379 not self-healing or rely on 1:/+1 protection schemes but it is 380 conceivable to use RIFT to operate Clos fabrics that can deal 381 effectively with interconnections or subsystem failures in such 382 module. RIFT is neither IP specific and hence any link addressing 383 connecting internal device subnets is conceivable. 385 3.3.5. CloudCO 387 The Cloud Central Office (CloudCO) is a new stage of telecom Central 388 Office. It takes the advantage of Software Defined Networking (SDN) 389 and Network Function Virtualization (NFV) in conjunction with general 390 purpose hardware to optimize current networks. The following figure 391 illustrates this architecture at a high level. It describes a single 392 instance or macro-node of cloud CO. An Access I/O module faces a 393 Cloud CO access node, and the Customer Premises Equipments (CPEs) 394 behind it. A Network I/O module is facing the core network. The two 395 I/O modules are interconnected by a leaf and spine fabric. [TR-384] 396 +---------------------+ +----------------------+ 397 | Spine | | Spine | 398 | Switch | | Switch | 399 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 400 | | | | | | | | | | | | 401 | | | | | +-------------------------------+ | 402 | | | | | | | | | | | | 403 | | | | +-------------------------+ | | | 404 | | | | | | | | | | | | 405 | | +----------------------+ | | | | | | | | 406 | | | | | | | | | | | | 407 | +---------------------------------+ | | | | | | | 408 | | | | | | | | | | | | 409 | | | +-----------------------------+ | | | | | 410 | | | | | | | | | | | | 411 | | | | | +--------------------+ | | | | 412 | | | | | | | | | | | | 413 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 414 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 415 |S | | Switch | | Switch | | Switch | | Switch| |S | 416 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 417 | | | | | | | | | | | | | | 418 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 419 | |Compute | |Compute | | Compute | |Compute| | 420 | |Node | |Node | | Node | |Node | | 421 | +--------+ +--------+ +----------+ +-------+ | 422 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 423 | |--------| |--------| |----------| |-------| | 424 | |--------| |--------| |----------| |-------| | 425 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 426 | |--------| |--------| |----------| |-------| | 427 | |--------| |--------| |----------| |-------| | 428 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 429 | |--------| |--------| |----------| |-------| | 430 | +--------+ +--------+ +----------+ +-------+ | 431 | | 432 ++-----------+ +---------++ 433 |Network I/O | |Access I/O| 434 +------------+ +----------+ 436 Figure 2: An example of CloudCO architecture 438 The Spine-Leaf architecture deployed inside CloudCO meets the network 439 requirements of adaptable, agile, scalable and dynamic. 441 4. Deployment Considerations 443 RIFT presents the opportunity for organizations building and 444 operating IP fabrics to simplify their operation and deployments 445 while achieving many desirable properties of a dynamic routing on 446 such a substrate: 448 * RIFT only foods routing information to the devices that absolutely 449 need it. RIFT design follows minimum blast radius and minimum 450 necessary epistemological scope philosophy which leads to good 451 scaling properties while delivering maximum reactiveness. 453 * RIFT allows for extensive Zero Touch Provisioning within the 454 protocol. In its most extreme version RIFT does not rely on any 455 specific addressing and for IP fabric can operate using IPv6 ND 456 [RFC4861] only. 458 * RIFT has provisions to detect common IP fabric mis-cabling 459 scenarios. 461 * RIFT negotiates automatically BFD per link allowing this way for 462 IP and micro-BFD [RFC7130] to replace Link Aggregation Groups 463 (LAGs) which do hide bandwidth imbalances in case of constituent 464 failures. Further automatic link validation techniques similar to 465 [RFC5357] could be supported as well. 467 * RIFT inherently solves many difficult problems associated with the 468 use of traditional routing topologies with dense meshes and high 469 degrees of ECMP by including automatic bandwidth balancing, flood 470 reduction and automatic disaggregation on failures while providing 471 maximum aggregation of prefixes in default scenarios. 473 * RIFT reduces FIB size towards the bottom of the IP fabric where 474 most nodes reside and allows with that for cheaper hardware on the 475 edges and introduction of modern IP fabric architectures that 476 encompass e.g. server multi-homing. 478 * RIFT provides valley-free routing and with that is loop free. 479 This allows the use of any such valley-free path in bi-sectional 480 fabric bandwidth between two destination irrespective of their 481 metrics which can be used to balance load on the fabric in 482 different ways. 484 * RIFT includes a key-value distribution mechanism which allows for 485 many future applications such as automatic provisioning of basic 486 overlay services or automatic key roll-overs over whole fabrics. 488 * RIFT is designed for minimum delay in case of prefix mobility on 489 the fabric. 491 * Many further operational and design points collected over many 492 years of routing protocol deployments have been incorporated in 493 RIFT such as fast flooding rates, protection of information 494 lifetimes and operationally easily recognizable remote ends of 495 links and node names. 497 4.1. South Reflection 499 South reflection is a mechanism that South Node TIEs are "reflected" 500 back up north to allow nodes in same level without East-west links to 501 "see" each other. 503 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 504 from ToF21 to ToF22 separately. Respectively, 505 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 506 to ToF21 separately. So ToF22 and ToF21 see each other's node 507 information as level 2 nodes. 509 In an equivalent fashion, as the result of the south reflection 510 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 511 Spine121 and Spine 122 knows each other at level 1. 513 4.2. Suboptimal Routing on Link Failures 514 +--------+ +--------+ 515 | ToF21 | | ToF22 | LEVEL 2 516 ++--+-+-++ ++-+--+-++ 517 | | | | | | | + 518 | | | | | | | linkTS8 519 +-------------+ | +-+linkTS3+-+ | | | +-------------+ 520 | | | | | | + | 521 | +----------------------------+ | linkTS7 | 522 | | | | + + + | 523 | | | +-------+linkTS4+------------+ | 524 | | | + + | | | 525 | | | +------------+--+ | | 526 | | | | | linkTS6 | | 527 +-+----+-+ +-----+--+ ++--------+ +-+----+-+ 528 |Spine111| |Spine112| |Spine121 | |Spine122| LEVEL 1 529 +-+---+--+ +----+---+ +-+---+---+ +-+---+--+ 530 | | | | | | | | 531 | +--------------+ | + ++XX+linkSL6+---+ + 532 | | | | linkSL5 | | linkSL8 533 | +------------+ | | + +---+linkSL7+-+ | + 534 | | | | | | | | 535 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 536 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 537 +-+-----+ ++------+ +-----+-+ +-+-----+ 538 + + + + 539 Prefix111 Prefix112 Prefix121 Prefix122 541 Figure 3: Suboptimal routing upon link failure use case 543 As shown in Figure 3, as the result of the south reflection between 544 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 545 Spine 122 knows each other at level 1. 547 Without disaggregation mechanism, when linkSL6 fails, the packet from 548 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 549 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 550 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 551 Leaf122 based on pure default route. It's the case of suboptimal 552 routing or bow-tieing. 554 With disaggregation mechanism, when linkSL6 fails, Spine122 will 555 detect the failure according to the reflected node S-TIE from 556 Spine121. Based on the disaggregation algorithm provided by RIFT, 557 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 558 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 559 prefix122 will only be sent to linkSL7 following a longest-prefix 560 match to prefix 122 directly then go down through linkSL8 to Leaf122 561 . 563 4.3. Black-Holing on Link Failures 565 +--------+ +--------+ 566 | ToF 21 | | ToF 22 | LEVEL 2 567 ++-+--+-++ ++-+--+-++ 568 | | | | | | | + 569 | | | | | | | linkTS8 570 +--------------+ | +-+linkTS3+X+ | | | +--------------+ 571 linkTS1 | | | | | + | 572 + +-----------------------------+ | linkTS7 | 573 | | + | + + + | 574 | | linkTS2 +-------+linkTS4+X+----------+ | 575 | + + + + | | | 576 | linkTS5 +-+ +------------+--+ | | 577 | + | | | linkTS6 | | 578 +-+----+-+ +-+----+-+ ++-------+ +-+-----++ 579 |Spine111| |Spine112| |Spine121| |Spine122| LEVEL 1 580 +-+---+--+ ++----+--+ +-+---+--+ +-+---+--+ 581 | | | | | | | | 582 + +---------------+ | + +---+linkSL6+---+ + 583 linkSL1 | | | linkSL5 | | linkSL8 584 + +--+linkSL3+--+ | | + +---+linkSL7+-+ | + 585 | | | | | | | | 586 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 587 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 588 +-+-----+ ++------+ +-----+-+ +-+-----+ 589 + + + + 590 Prefix111 Prefix112 Prefix121 Prefix122 592 Figure 4: Black-holing upon link failure use case 594 This scenario illustrates a case when double link failure occurs and 595 with that black-holing can happen. 597 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 598 the packet from leaf111 to prefix122 would suffer 50% black-holing 599 based on pure default route. The packet supposed to go up through 600 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 601 dropped. The packet supposed to go up through linkSL3 to linkTS2 602 then go down through linkTS3 or linkTS4 will be dropped as well. 603 It's the case of black-holing. 605 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 606 ToF22 will detect the failure according to the reflected node S-TIE 607 of ToF21 from Spine111\Spine112. Based on the disaggregation 608 algorithm provided by RITF, ToF22 will explicitly originate an S-TIE 609 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 610 121 and 122. 612 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 613 linkTS2. The packet from leaf111 to prefix122 will only be routed to 614 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 616 4.4. Zero Touch Provisioning (ZTP) 618 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 619 has no configuration (unless it is a ToF at the top of the topology 620 or it is desired to confine it to leaf role w/o leaf-2-leaf 621 procedures). In such case RIFT will fully configure the node's level 622 after it is attached to the topology. 624 The most important component for ZTP is the automatic level 625 derivation procedure. All the ToF nodes are explicitly marked with 626 TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP 627 nodes to derive their level in the topology. The derivation of the 628 level of each node happens then based on Link Information Elements 629 (LIEs) received from its neighbors whereas each node (with possibly 630 exceptions of configured leafs) tries to attach at the highest 631 possible point in the fabric. This guarantees that even if the 632 diffusion front reaches a node from "below" faster than from "above", 633 it will greedily abandon already negotiated level derived from nodes 634 topologically below it and properly peer with nodes above. 636 4.5. Mis-cabling Examples 638 +----------------+ +-----------------+ 639 | ToF21 | +------+ ToF22 | LEVEL 2 640 +-------+----+---+ | +----+---+--------+ 641 | | | | | | | | | 642 | | | +----------------------------+ | 643 | +---------------------------+ | | | | 644 | | | | | | | | | 645 | | | | +-----------------------+ | | 646 | | +------------------------+ | | | 647 | | | | | | | | | 648 +-+---+--+ +-+---+--+ | +--+---+-+ +--+---+-+ 649 |Spine111| |Spine112| | |Spine121| |Spine122| LEVEL 1 650 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 651 | | | | | | | | | 652 | +---------+ | link-M | +---------+ | 653 | | | | | | | | | 654 | +-------+ | | | | +-------+ | | 655 | | | | | | | | | 656 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 657 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 658 +-------+ +-------+ +-------+ +-------+ 659 Figure 5: A single plane mis-cabling example 661 Figure 5 shows a single plane mis-cabling example. It's a perfect 662 fat tree fabric except link-M connecting Leaf112 to ToF22. 664 The RIFT control protocol can discover the physical links 665 automatically and be able to detect cabling that violates fat tree 666 topology constraints. It reacts accordingly to such mis-cabling 667 attempts, at a minimum preventing adjacencies between nodes from 668 being formed and traffic from being forwarded on those mis-cabled 669 links. Leaf112 will in such scenario use link-M to derive its level 670 (unless it is leaf) and can report links to Spine111 and Spine112 as 671 mis-cabled unless the implementations allows horizontal links. 673 Figure 6 shows a multiple plane mis-cabling example. Since Leaf112 674 and Spine121 belong to two different PoDs, the adjacency between 675 Leaf112 and Spine121 can not be formed. link-W would be detected and 676 prevented. 678 +-------+ +-------+ +-------+ +-------+ 679 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 680 +-------+ +-------+ +-------+ +-------+ 681 | | | | | | | | 682 | | | +-----------------+ | | | 683 | +--------------------------+ | | | | 684 | | | | | | | | 685 | +------+ | | | +------+ | 686 | | +-----------------+ | | | | | 687 | | | +--------------------------+ | | 688 | A | | B | | A | | B | 689 +-----+--+ +-+---+--+ +--+---+-+ +--+-----+ 690 |Spine111| |Spine112| +---+Spine121| |Spine122| LEVEL 1 691 +-+---+--+ ++----+--+ | +--+---+-+ +-+----+-+ 692 | | | | | | | | | 693 | +---------+ | | | +---------+ | 694 | | | | link-W | | | | 695 | +-------+ | | | | +-------+ | | 696 | | | | | | | | | 697 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 698 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 699 +-------+ +-------+ +-------+ +-------+ 700 +--------PoD#1----------+ +---------PoD#2---------+ 702 Figure 6: A multiple plane mis-cabling example 704 RIFT provides an optional level determination procedure in its Zero 705 Touch Provisioning mode. Nodes in the fabric without their level 706 configured determine it automatically. This can have possibly 707 counter-intuitive consequences however. One extreme failure scenario 708 is depicted in Figure 7 and it shows that if all northbound links of 709 spine11 fail at the same time, spine11 negotiates a lower level than 710 Leaf11 and Leaf12. 712 To prevent such scenario where leafs are expected to act as switches, 713 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 714 invalid, Spine11 would not derive a valid level from the topology in 715 Figure 7. It will be isolated from the whole fabric and it would be 716 up to the leafs to declare the links towards such spine as mis- 717 cabled. 719 +-------+ +-------+ +-------+ +-------+ 720 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 721 +-------+ +-------+ +-------+ +-------+ 722 | | | | | | 723 | +-------+ | | | 724 + + | | ====> | | 725 X X +------+ | +------+ | 726 + + | | | | 727 +----+--+ +-+-----+ +-+-----+ 728 |Spine11| |Spine12| |Spine12| 729 +-+---+-+ ++----+-+ ++----+-+ 730 | | | | | | 731 | +---------+ | | | 732 | | | | | | 733 | +-------+ | | +-------+ | 734 | | | | | | 735 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 736 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 737 +-------+ +-------+ +-+-----+ +-+-----+ 738 | | 739 | +--------+ 740 | | 741 +-+---+-+ 742 |Spine11| 743 +-------+ 745 Figure 7: Fallen spine 747 4.6. Positive vs. Negative Disaggregation 749 Disaggregation is the procedure whereby [RIFT] advertises a more 750 specific route southwards as an exception to the aggregated fabric- 751 default north. Disaggregation is useful when a prefix within the 752 aggregation is reachable via some of the parents but not the others 753 at the same level of the fabric. It is mandatory when the level is 754 the ToF since a ToF node that cannot reach a prefix becomes a black 755 hole for that prefix. The hard problem is to know which prefixes are 756 reachable by whom. 758 In the general case, [RIFT] solves that problem by interconnecting 759 the ToF nodes. So the ToF nodes can exchange the full list of 760 prefixes that exist in the fabric and figure when a ToF node lacks 761 reachability and to existing prefix. This requires additional ports 762 at the ToF, typically 2 ports per ToF node to form a ToF-spanning 763 ring. [RIFT] also defines the southbound reflection procedure that 764 enables a parent to explore the direct connectivity of its peers, 765 meaning their own parents and children; based on the advertisements 766 received from the shared parents and children, it may enable the 767 parent to infer the prefixes its peers can reach. 769 When a parent lacks reachability to a prefix, it may disaggregate the 770 prefix negatively, i.e., advertise that this parent can be used to 771 reach any prefix in the aggregation except that one. The Negative 772 Disaggregation signaling is simple and functions transitively from 773 ToF to top-of-pod (ToP) and then from ToP to Leaf. But it is hard 774 for a parent to figure which prefix it needs to disaggregate, because 775 it does not know what it does not know; it results that the use of a 776 spanning ring at the ToF is required to operate the Negative 777 Disaggregation. Also, though it is only an implementation problem, 778 the programmation of the FIB is complex compared to normal routes, 779 and may incur recursions. 781 The more classical alternative is, for the parents that can reach a 782 prefix that peers at the same level cannot, to advertise a more 783 specific route to that prefix. This leverages the normal longest 784 prefix match in the FIB, and does not require a special 785 implementation. But as opposed to the Negative Disaggregation, the 786 Positive Disaggregation is difficult and inefficient to operate 787 transitively. 789 Transitivity is not needed to a grandchild if all its parents 790 received the Positive Disaggregation, meaning that they shall all 791 avoid the black hole; when that is the case, they collectively build 792 a ceiling that protects the grandchild. But until then, a parent 793 that received a Positive Disaggregation may believe that some peers 794 are lacking the reachability and readvertise too early, or defer and 795 maintain a black hole situation longer than necessary. 797 In a non-partitioned fabric, all the ToF nodes see one another 798 through the reflection and can figure if one is missing a child. In 799 that case it is possible to compute the prefixes that the peer cannot 800 reach and disaggregate positively without a ToF-spanning ring. The 801 ToF nodes can also ascertain that the ToP nodes are connected each to 802 at least a ToF node that can still reach the prefix, meaning that the 803 transitive operation is not required. 805 The bottom line is that in a fabric that is partitioned (e.g., using 806 multiple planes) and/or where the ToP nodes are not guaranteed to 807 always form a ceiling for their children, it is mandatory to use the 808 Negative Disaggregation. On the other hand, in a highly symmetrical 809 and fully connected fabric, (e.g., a canonical Clos Network), the 810 Positive Disaggregation methods allows to save the complexity and 811 cost associated to the ToF-spanning ring. 813 Note that in the case of Positive Disaggregation, the first ToF 814 node(s) that announces a more-specific route attracts all the traffic 815 for that route and may suffer from a transient incast. A ToP node 816 that defers injecting the longer prefix in the FIB, in order to 817 receive more advertisements and spread the packets better, also keeps 818 on sending a portion of the traffic to the black hole in the 819 meantime. In the case of Negative Disaggregation, the last ToF 820 node(s) that injects the route may also incur an incast issue; this 821 problem would occur if a prefix that becomes totally unreachable is 822 disaggregated, but doing so is mostly useless and is not recommended. 824 4.7. Mobile Edge and Anycast 826 When a physical or a virtual node changes its point of attachement in 827 the fabric from a previous-leaf to a next-leaf, new routes must be 828 installed that supersede the old ones. Since the flooding flows 829 northwards, the nodes (if any) between the previous-leaf and the 830 common parent are not immediately aware that the path via previous- 831 leaf is obsolete, and a stale route may exist for a while. The 832 common parent needs to select the freshest route advertisement in 833 order to install the correct route via the next-leaf. This requires 834 that the fabric determines the sequence of the movements of the 835 mobile node. 837 On the one hand, a classical sequence counter provides a total order 838 for a while but it will eventually wrap. On the other hand, a 839 timestamp provides a permanent order but it may miss a movement that 840 happens too quickly vs. the granularity of the timing information. 841 It is not envisioned in the short term that the average fabric 842 supports a Precision Time Protocol [IEEEstd1588], and the precision 843 that may be available with the Network Time Protocol [RFC5905], in 844 the order of 100 to 200ms, may not be necessarily enough to cover, 845 e.g., the fast mobility of a Virtual Machine. 847 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 848 combines a sequence counter from the mobile node and a timestamp from 849 the network taken at the leaf when the route is injected. If the 850 timestamps of the concurrent advertisements are comparable (i.e., 851 more distant than the precision of the timing protocol), then the 852 timestamp alone is used to determine the relative freshness of the 853 routes. Otherwise, the sequence counter from the mobile node, if 854 available, is used. One caveat is that the sequence counter must not 855 wrap within the precision of the timing protocol. Another is that 856 the mobile node may not even provide a sequence counter, in which 857 case the mobility itself must be slower than the precision of the 858 timing. 860 Mobility must not be confused with anycast. In both cases, a same 861 address is injected in RIFT at different leaves. In the case of 862 mobility, only the freshest route must be conserved, since mobile 863 node changed its point of attachment for a leaf to the next. In the 864 case of anycast, the node may be either multihomed (attached to 865 multiple leaves in parallel) or reachable beyond the fabric via 866 multiple routes that are redistributed to different leaves; either 867 way, in the case of anycast, the multiple routes are equally valid 868 and should be conserved. Without further information from the 869 redistributed routing protocol, it is impossible to sort out a 870 movement from a redistribution that happens asynchronously on 871 different leaves. [RIFT] expects that anycast addresses are 872 advertised within the timing precision, which is typically the case 873 with a low-precision timing and a multihomed node. Beyond that time 874 interval, RIFT interprets the lag as a mobility and only the freshest 875 route is retained. 877 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 878 Extensions for IPv6 over Low-Power Wireless Personal Area Network 879 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 880 interaction between the mobile node and the leaf. This provides not 881 only a sequence counter but also a lifetime and a security token that 882 may be used to protect the ownership of an address [RFC8928]. When 883 using [RFC8505], the parallel registration of an anycast address to 884 multiple leaves is done with the same sequence counter, whereas the 885 sequence counter is incremented when the point of attachement 886 changes. This way, it is possible to differentiate a mobile node 887 from a multihomed node, even when the mobility happens within the 888 timing precision. It is also possible for a mobile node to be 889 multihomed as well, e.g., to change only one of its points of 890 attachement. 892 4.8. IPv4 over IPv6 894 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 895 Address Family (AF) configures via the usual Neighbor Discovery (ND) 896 mechanisms and then V4 can use V6 nexthops analogous to [RFC5549]. 897 It is expected that the whole fabric supports the same type of 898 forwarding of address families on all the links. RIFT provides an 899 indication whether a node is v4 forwarding capable and 900 implementations are possible where different routing tables are 901 computed per address family as long as the computation remains loop- 902 free. 904 +-----+ +-----+ 905 +---+---+ | ToF | | ToF | 906 ^ +--+--+ +-----+ 907 | | | | | 908 | | +-------------+ | 909 | | +--------+ | | 910 + | | | | 911 V6 +-----+ +-+---+ 912 Forwarding |Spine| |Spine| 913 + +--+--+ +-----+ 914 | | | | | 915 | | +-------------+ | 916 | | +--------+ | | 917 | | | | | 918 v +-----+ +-+---+ 919 +---+---+ |Leaf | | Leaf| 920 +--+--+ +--+--+ 921 | | 922 IPv4 prefixes| |IPv4 prefixes 923 | | 924 +---+----+ +---+----+ 925 | V4 | | V4 | 926 | subnet | | subnet | 927 +--------+ +--------+ 929 Figure 8: IPv4 over IPv6 931 4.9. In-Band Reachability of Nodes 933 RIFT doesn't precondition that nodes of the fabric have reachable 934 addresses. But the operational purposes to reach the internal nodes 935 may exist. Figure 9 shows an example that the network management 936 station (NMS) attaches to leaf1. 938 +-------+ +-------+ 939 | ToF1 | | ToF2 | 940 ++---- ++ ++-----++ 941 | | | | 942 | +----------+ | 943 | +--------+ | | 944 | | | | 945 ++-----++ +--+---++ 946 |Spine1 | |Spine2 | 947 ++-----++ ++-----++ 948 | | | | 949 | +----------+ | 950 | +--------+ | | 951 | | | | 952 ++-----++ +--+---++ 953 | Leaf1 | | Leaf2 | 954 +---+---+ +-------+ 955 | 956 |NMS 958 Figure 9: In-Band reachability of node 960 If NMS wants to access Leaf2, it simply works. Because loopback 961 address of Leaf2 is flooded in its Prefix North TIE. 963 If NMS wants to access Spine2, it simply works too. Because spine 964 node always advertises its loopback address in the Prefix North TIE. 965 NMS may reach Spine2 from Leaf1-Spine2 or Leaf1-Spine1-ToF1/ 966 ToF2-Spine2. 968 If NMS wants to access ToF2, ToF2's loopback address needs to be 969 injected into its Prefix South TIE. This TIE must be seen by all 970 nodes at the level below - the spine nodes in Figure 9 - that must 971 form a ceiling for all the traffic coming from below (south). 972 Otherwise, the traffic from NMS may follow the default route to the 973 wrong ToF Node, e.g., ToF1. 975 In a fully connected ToF, in case of failure between ToF2 and spine 976 nodes, ToF2's loopback address must be disaggregated recursively all 977 the way to the leaves. 979 In a partitioned ToF, a TOF node is only reachable within its Plane, 980 and the disaggregation to the leaves is also required. A possible 981 alternative is to use the ring that interconnects the ToF nodes to 982 transmit packets between them for their loopback addresses only. The 983 idea is that this is mostly control traffic and should not alter the 984 load balancing properties of the fabric. 986 4.10. Dual Homing Servers 988 Each RIFT node may operate in Zero Touch Provisioning (ZTP) mode. It 989 has no configuration (unless it is a Top-of-Fabric at the top of the 990 topology or the must operate in the topology as leaf and/or support 991 leaf-2-leaf procedures) and it will fully configure itself after 992 being attached to the topology. 994 +---+ +---+ +---+ 995 |ToF| |ToF| |ToF| ToF 996 +---+ +---+ +---+ 997 | | | | | | 998 | +----------------+ | | 999 | | | | | | 1000 | +----------------+ | 1001 | | | | | | 1002 +----------+--+ +--+----------+ 1003 | ToR1 | | ToR2 | Spine 1004 +--+------+---+ +--+-------+--+ 1005 +---+ | | | | | | +---+ 1006 | | | | | | | | 1007 | +-----------------+ | | | 1008 | | | +-------------+ | | 1009 + | + | | |-----------------+ | 1010 X | X | +--------x-----+ | X | 1011 + | + | | | + | 1012 +---+ +---+ +---+ +---+ 1013 | | | | | | | | 1014 +---+ +---+ ...............+---+ +---+ 1015 SV(1) SV(2) SV(n+1) SV(n) Leaf 1017 Figure 10: Dual-homing servers 1019 In the single plane, the worst condition is disaggregation of every 1020 other servers at the same level. Suppose the links from ToR1 (Top of 1021 Rack) to all the leaves become not available. All the servers' 1022 routes are disaggregated and the FIB of the servers will be expanded 1023 with n-1 more specific routes. 1025 Sometimes, people may prefer to disaggregate from ToR to servers from 1026 start on, i.e. the servers have couple tens of routes in FIB from 1027 start on beside default routes to avoid breakages at rack level. 1028 Full disaggregation of the fabric could be achieved by configuration 1029 supported by RIFT. 1031 4.11. Fabric With A Controller 1033 There are many different ways to deploy the controller. One 1034 possibility is attaching a controller to the RIFT domain from ToF and 1035 another possibility is attaching a controller from the leaf. 1037 +------------+ 1038 | Controller | 1039 ++----------++ 1040 | | 1041 | | 1042 +----++ ++----+ 1043 ------- | ToF | | ToF | 1044 | +--+--+ +-----+ 1045 | | | | | 1046 | | +-------------+ | 1047 | | +--------+ | | 1048 | | | | | 1049 +-----+ +-+---+ 1050 RIFT domain |Spine| |Spine| 1051 +--+--+ +-----+ 1052 | | | | | 1053 | | +-------------+ | 1054 | | +--------+ | | 1055 | | | | | 1056 | +-----+ +-+---+ 1057 ------- |Leaf | | Leaf| 1058 +-----+ +-----+ 1060 Figure 11: Fabric with a controller 1062 4.11.1. Controller Attached to ToFs 1064 If a controller is attaching to the RIFT domain from ToF, it usually 1065 uses dual-homing connections. The loopback prefix of the controller 1066 should be advertised down by the ToF and spine to leaves. If the 1067 controller loses link to ToF, make sure the ToF withdraw the prefix 1068 of the controller(use different mechanisms). 1070 4.11.2. Controller Attached to Leaf 1072 If the controller is attaching from a leaf to the fabric, no special 1073 provisions are needed. 1075 4.12. Internet Connectivity With Underlay 1077 If global addressing is running without overlay, an external default 1078 route needs to be advertised through rift fabric to achieve internet 1079 connectivity. For the purpose of forwarding of the entire rift 1080 fabric, an internal fabric prefix needs to be advertised in the South 1081 Prefix TIE by ToF and spine nodes. 1083 4.12.1. Internet Default on the Leaf 1085 In case that an internet access request comes from a leaf and the 1086 internet gateway is another leaf, the leaf node as the internet 1087 gateway needs to advertise a default route in its Prefix North TIE. 1089 4.12.2. Internet Default on the ToFs 1091 In case that an internet access request comes from a leaf and the 1092 internet gateway is a ToF, the ToF and spine nodes need to advertise 1093 a default route in the Prefix South TIE. 1095 4.13. Subnet Mismatch and Address Families 1097 +--------+ +--------+ 1098 | | LIE LIE | | 1099 | A | +----> <----+ | B | 1100 | +---------------------+ | 1101 +--------+ +--------+ 1102 X/24 Y/24 1104 Figure 12: subnet mismatch 1106 LIEs are exchanged over all links running RIFT to perform Link 1107 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 1108 family if it does not process received LIEs on that family. LIEs on 1109 same link are considered part of the same negotiation independent on 1110 the address family they arrive on. An implementation MUST be ready 1111 to accept TIEs on all addresses it used as source of LIE frames. 1113 As shown in the above figure, without further checks adjacency of 1114 node A and B may form, but the forwarding between node A and node B 1115 may fail because subnet X mismatches with subnet Y. 1117 To prevent this a RIFT implementation should check for subnet 1118 mismatch just like e.g. ISIS does. This can lead to scenarios where 1119 an adjacency, despite exchange of LIEs in both address families may 1120 end up having an adjacency in a single AF only. This is a 1121 consideration especially in Section 4.8 scenarios. 1123 4.14. Anycast Considerations 1125 + traffic 1126 | 1127 v 1128 +------+------+ 1129 | ToF | 1130 +---+-----+---+ 1131 | | | | 1132 +------------+ | | +------------+ 1133 | | | | 1134 +---+---+ +-------+ +-------+ +---+---+ 1135 | | | | | | | | 1136 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1137 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1138 | | | | | | | | 1139 | +---------+ | | +---------+ | 1140 | | | | | | | | 1141 | +-------+ | | | +-------+ | | 1142 | | | | | | | | 1143 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1144 | | | | | | | | 1145 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1146 +-+-----+ ++------+ +-----+-+ +-----+-+ 1147 + + + ^ | 1148 PrefixA PrefixB PrefixA | PrefixC 1149 | 1150 + traffic 1152 Figure 13: Anycast 1154 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1155 prefix PrefixA. RIFT can deal with this case well. But if the 1156 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1157 But Spine21 or Spine22 doesn't know another PrefixA attaching 1158 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1159 If the intension is that the traffic should been offloaded to 1160 Leaf111, then use policy guided prefixes defined in "Routing in Fat 1161 Trees" [RIFT]. 1163 4.15. IoT Applicability 1165 The design of RIFT inherits from RPL [RFC6550] the anisotropic design 1166 of a default route upwards (northwards); it also inherits the 1167 capability to inject external host routes at the Leaf level using 1168 Wireless ND (WiND) [RFC8505][RFC8928] between a RIFT-agnostic host 1169 and a RIFT router. Both the RPL and the RIFT protocols are meant for 1170 large scale, and WiND enables device mobility at the edge the same 1171 way in both cases. 1173 The main difference between RIFT and RPL is that with RPL, there's a 1174 single Root, whereas RIFT has many ToF nodes. The adds huge 1175 capabilities for leaf-2-leaf ECMP paths, but additional complexity 1176 with the need to disaggregate. Also RIFT uses Link State flooding 1177 northwards, and is not designed for low-power operation. 1179 Still nothing prevents that the IP devices connected at the Leaf are 1180 IoT (Internet of Things) devices, which typically expose their 1181 address using WiND - which is an upgrade from 6LoWPAN ND [RFC6775]. 1183 A network that serves high speed/ high power IoT devices should 1184 typically provide deterministic capabilities for applications such as 1185 high speed control loops or movement detection. The Fat Tree is 1186 highly reliable, and in normal condition provides an equilatent 1187 multipath operation; but the ECMP doesn't provide hard guarantees for 1188 either delivery or latency. As long as the fabric is non-blocking 1189 the result is the same; but there can be load unbalances resulting in 1190 incast and possibly congestion loss that will prevent the delivery 1191 within bounded latency. 1193 This could be alleviated with Packet Replication, Elimination and 1194 Reordering (PREOF) [RFC8655] leaf-2-leaf but PREOF is hard to provide 1195 at the scale of all flows, and the replication may increase the 1196 probability of the overload that it attempts to solve. 1198 Note that the load balancing is not RIFT's problem, but it is key to 1199 serve IoT adequately. 1201 5. Security Considerations 1203 This document presents applicability of RIFT. As such, it does not 1204 introduce any security considerations. However, there are a number 1205 of security concerns at [RIFT]. 1207 6. Contributors 1209 The following people (listed in alphabetical order) contributed 1210 significantly to the content of this document and should be 1211 considered co-authors: 1213 Tony Przygienda 1215 Juniper Networks 1217 1194 N. Mathilda Ave 1219 Sunnyvale, CA 94089 1221 US 1223 Email: prz@juniper.net 1225 7. Normative References 1227 [ISO10589-Second-Edition] 1228 International Organization for Standardization, 1229 "Intermediate system to Intermediate system intra-domain 1230 routeing information exchange protocol for use in 1231 conjunction with the protocol for providing the 1232 connectionless-mode Network Service (ISO 8473)", November 1233 2002. 1235 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1236 Office Reference Architectural Framework", January 2018. 1238 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1239 DOI 10.17487/RFC2328, April 1998, 1240 . 1242 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1243 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1244 DOI 10.17487/RFC4861, September 2007, 1245 . 1247 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1248 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1249 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1250 . 1252 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1253 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1254 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1255 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1256 2014, . 1258 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 1259 Layer Reachability Information with an IPv6 Next Hop", 1260 RFC 5549, DOI 10.17487/RFC5549, May 2009, 1261 . 1263 [RFC6550] Winter, T., Ed., Thubert, P., Ed., Brandt, A., Hui, J., 1264 Kelsey, R., Levis, P., Pister, K., Struik, R., Vasseur, 1265 JP., and R. Alexander, "RPL: IPv6 Routing Protocol for 1266 Low-Power and Lossy Networks", RFC 6550, 1267 DOI 10.17487/RFC6550, March 2012, 1268 . 1270 [RFC6775] Shelby, Z., Ed., Chakrabarti, S., Nordmark, E., and C. 1271 Bormann, "Neighbor Discovery Optimization for IPv6 over 1272 Low-Power Wireless Personal Area Networks (6LoWPANs)", 1273 RFC 6775, DOI 10.17487/RFC6775, November 2012, 1274 . 1276 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 1277 "Deterministic Networking Architecture", RFC 8655, 1278 DOI 10.17487/RFC8655, October 2019, 1279 . 1281 [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and 1282 D. Afanasiev, "RIFT: Routing in Fat Trees", Work in 1283 Progress, Internet-Draft, draft-ietf-rift-rift-12, 26 May 1284 2020, 1285 . 1287 [I-D.white-distoptflood] 1288 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 1289 Distributed Flooding for Dense Topologies", Work in 1290 Progress, Internet-Draft, draft-white-distoptflood-04, 27 1291 July 2020, 1292 . 1294 8. Informative References 1296 [IEEEstd1588] 1297 IEEE standard for Information Technology, "IEEE Standard 1298 for a Precision Clock Synchronization Protocol for 1299 Networked Measurement and Control Systems", 1300 . 1302 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 1303 Communication Environments", IEEE International Parallel & 1304 Distributed Processing Symposium, 2011. 1306 [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for 1307 Hardware-Efficient Supercomputing", 1985. 1309 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1310 "Network Time Protocol Version 4: Protocol and Algorithms 1311 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1312 . 1314 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1315 (IPv6) Specification", STD 86, RFC 8200, 1316 DOI 10.17487/RFC8200, July 2017, 1317 . 1319 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1320 Perkins, "Registration Extensions for IPv6 over Low-Power 1321 Wireless Personal Area Network (6LoWPAN) Neighbor 1322 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1323 . 1325 [RFC8928] Thubert, P., Ed., Sarikaya, B., Sethi, M., and R. Struik, 1326 "Address-Protected Neighbor Discovery for Low-Power and 1327 Lossy Networks", RFC 8928, DOI 10.17487/RFC8928, November 1328 2020, . 1330 Authors' Addresses 1332 Yuehua Wei (editor) 1333 ZTE Corporation 1334 No.50, Software Avenue 1335 Nanjing 1336 210012 1337 China 1339 Email: wei.yuehua@zte.com.cn 1340 Zheng Zhang 1341 ZTE Corporation 1342 No.50, Software Avenue 1343 Nanjing 1344 210012 1345 China 1347 Email: zhang.zheng@zte.com.cn 1349 Dmitry Afanasiev 1350 Yandex 1352 Email: fl0w@yandex-team.ru 1354 Tom Verhaeg 1355 Juniper Networks 1357 Email: tverhaeg@juniper.net 1359 Jaroslaw Kowalczyk 1360 Orange Polska 1362 Email: jaroslaw.kowalczyk2@orange.com 1364 Pascal Thubert 1365 Cisco Systems, Inc 1366 Building D 1367 45 Allee des Ormes - BP1200 1368 06254 MOUGINS - Sophia Antipolis 1369 France 1371 Phone: +33 497 23 26 34 1372 Email: pthubert@cisco.com