idnits 2.17.1 draft-wei-rift-applicability-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 838: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 841: '...e on. An implementation MUST be ready...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 3, 2019) is 1636 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-21) exists of draft-ietf-rift-rift-08 == Outdated reference: A later version (-04) exists of draft-white-distoptflood-01 ** Downref: Normative reference to an Informational draft: draft-white-distoptflood (ref. 'I-D.white-distoptflood') -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589-Second-Edition' -- Possible downref: Non-RFC (?) normative reference: ref. 'TR-384' Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei 3 Internet-Draft Zheng. Zhang 4 Intended status: Standards Track ZTE Corporation 5 Expires: May 6, 2020 Dmitry. Afanasiev 6 Yandex 7 Tom. Verhaeg 8 Interconnect Services B.V. 9 Jaroslaw. Kowalczyk 10 Orange Polska 11 November 3, 2019 13 RIFT Applicability 14 draft-wei-rift-applicability-02 16 Abstract 18 This document discusses the properties, applicability and operational 19 considerations of RIFT in different network scenarios. It intends to 20 provide a rough guide how RIFT can be deployed to simplify routing 21 operations in Clos topologies and their variations. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on May 6, 2020. 40 Copyright Notice 42 Copyright (c) 2019 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (https://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 59 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 61 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3 62 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5 63 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 64 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 65 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 6 67 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 7 68 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 7 69 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 7 70 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 7 71 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 9 72 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 10 73 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 10 74 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 12 75 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 13 76 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 13 77 4.6. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 16 78 4.7. In-Band Reachability of Nodes . . . . . . . . . . . . . . 17 79 4.7.1. Reachability of Leafs . . . . . . . . . . . . . . . . 17 80 4.7.2. Reachability of Spines . . . . . . . . . . . . . . . 17 81 4.8. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 17 82 4.9. Fabric With A Controller . . . . . . . . . . . . . . . . 18 83 4.9.1. Controller Attached to ToFs . . . . . . . . . . . . . 19 84 4.9.2. Controller Attached to Leaf . . . . . . . . . . . . . 19 85 4.10. Internet Connectivity Without Underlay . . . . . . . . . 19 86 4.10.1. Internet Default on the Leafs . . . . . . . . . . . 19 87 4.10.2. Internet Default on the ToFs . . . . . . . . . . . . 20 88 4.11. Subnet Mismatch and Address Families . . . . . . . . . . 20 89 4.12. Anycast Considerations . . . . . . . . . . . . . . . . . 20 90 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 91 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 21 92 7. Normative References . . . . . . . . . . . . . . . . . . . . 22 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 95 1. Introduction 97 This document intends to explain the properties and applicability of 98 RIFT [I-D.ietf-rift-rift] in different deployment scenarios and 99 highlight the operational simplicity of the technology compared to 100 traditional routing solutions. It also documents special 101 considerations when RIFT is used with or without overlays, 102 controllers and corrects topology miscablings and/or node and link 103 failures. 105 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 107 Clos and Fat-Tree topologies have gained prominence in today's 108 networking, primarily as result of the paradigm shift towards a 109 centralized data-center based architecture that is poised to deliver 110 a majority of computation and storage services in the future. 112 Today's current routing protocols were geared towards a network with 113 an irregular topology and low degree of connectivity originally. 114 When they are applied to Fat-Tree topologies: 116 o they tend to need extensive configuration or provisioning during 117 bring up and re-dimensioning. 119 o spine and leaf nodes have the entire network topology and routing 120 information, which is in fact, not needed on the leaf nodes during 121 normal operation. 123 o significant Link State PDUs (LSPs) flooding duplication between 124 spine nodes and leaf nodes occurs during network bring up and 125 topology updates. It consumes both spine and leaf nodes' CPU and 126 link bandwidth resources and with that limits protocol 127 scalability. 129 3. Applicability of RIFT to Clos IP Fabrics 131 Further content of this document assumes that the reader is familiar 132 with the terms and concepts used in OSPF [RFC2328] and IS-IS 133 [ISO10589-Second-Edition] link-state protocols and at least the 134 sections of RIFT [I-D.ietf-rift-rift] outlining the requirement of 135 routing in IP fabrics and RIFT protocol concepts. 137 3.1. Overview of RIFT 139 RIFT is a dynamic routing protocol for Clos and fat-tree network 140 topologies. It defines a link-state protocol when "pointing north" 141 and path-vector protocol when "pointing south". 143 It floods flat link-state information northbound only so that each 144 level obtains the full topology of levels south of it. That 145 information is never flooded East-West or back South again. So a top 146 tier node has full set of prefixes from the SPF calculation. 148 In the southbound direction the protocol operates like a "fully 149 summarizing, unidirectional" path vector protocol or rather a 150 distance vector with implicit split horizon whereas the information 151 propagates one hop south and is 're-advertised' by nodes at next 152 lower level, normally just the default route. 154 +-----------+ +-----------+ 155 | ToF | | ToF | LEVEL 2 156 + +-----+--+--+ +-+--+------+ 157 | | | | | | | | | ^ 158 + | | | +-------------------------+ | 159 Distance | +-------------------+ | | | | | 160 Vector | | | | | | | | + 161 South | | | | +--------+ | | | Link+State 162 + | | | | | | | | Flooding 163 | | | +-------------+ | | | North 164 v | | | | | | | | + 165 +-+--+-+ +------+ +-------+ +--+--+-+ | 166 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 167 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 168 + | | | | | | | | | ^ N 169 Distance | +-------+ | | +--------+ | | | E 170 Vector | | | | | | | | | +------> 171 South | +-------+ | | | +-------+ | | | | 172 + | | | | | | | | | + 173 v ++--++ +-+-++ ++-+-+ +-+--++ + 174 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 175 +----+ +----+ +----+ +-----+ 177 Figure 1: Rift overview 179 A middle tier node has only information necessary for its level, 180 which are all destinations south of the node based on SPF 181 calculation, default route and potential disaggregated routes. 183 RIFT combines the advantage of both Link-State and Distance Vector: 185 o Fastest Possible Convergence 187 o Automatic Detection of Topology 188 o Minimal Routes/Info on TORs 190 o High Degree of ECMP 192 o Fast De-commissioning of Nodes 194 o Maximum Propagation Speed with Flexible Prefixes in an Update 196 And RIFT eliminates the disadvantages of Link-State or Distance 197 Vector: 199 o Reduced and Balanced Flooding 201 o Automatic Neighbor Detection 203 So there are two types of link state database which are "north 204 representation" N-TIEs and "south representation" S-TIEs. The N-TIEs 205 contain a link state topology description of lower levels and S-TIEs 206 carry simply default routes for the lower levels. 208 There are a bunch of more advantages unique to RIFT listed below 209 which could be understood if you read the details of RIFT 210 [I-D.ietf-rift-rift]. 212 o True ZTP 214 o Minimal Blast Radius on Failures 216 o Can Utilize All Paths Through Fabric Without Looping 218 o Automatic Disaggregation on Failures 220 o Simple Leaf Implementation that Can Scale Down to Servers 222 o Key-Value Store 224 o Horizontal Links Used for Protection Only 226 o Supports Non-Equal Cost Multipath and Can Replace MC-LAG 228 o Optimal Flooding Reduction and Load-Balancing 230 3.2. Applicable Topologies 232 Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" 233 structures, it already supports PoD concepts which are strictly 234 speaking not found in original Clos concepts. 236 Further, the specification explains and supports operations of multi- 237 plane Clos variants where the protocol relies on set of rings to 238 allow the reconciliation of topology view of different planes as most 239 desirable solution making proper disaggregation viable in case of 240 failures. This observations hold not only in case of RIFT but in the 241 generic case of dynamic routing on Clos variants with multiple planes 242 and failures in bi-sectional bandwidth, especially on the leafs. 244 3.2.1. Horizontal Links 246 RIFT is not limited to pure Clos divided into PoD and multi-planes 247 but supports horizontal links below the top of fabric level. Those 248 links are used however only as routes of last resort northbound when 249 a spine loses all northbound links or cannot compute a default route 250 through them. 252 A possible configuration is a "ring" of horizontal links at a level. 253 In presence of such a "ring" in any level (except ToF level) neither 254 N-SPF nor S-SPF will provide a "ring-based protection" scheme since 255 such a computation would have to deal necessarily with breaking of 256 "loops" in Dijkstra sense; an application for which RIFT is not 257 intended. 259 A full-mesh connectivity between nodes on the same level can be 260 employed and that allows N-SPF to provide for any node loosing all 261 its northbound adjacencies (as long as any of the other nodes in the 262 level are northbound connected) to still participate in northbound 263 forwarding. 265 3.2.2. Vertical Shortcuts 267 Through relaxations of the specified adjacency forming rules RIFT 268 implementations can be extended to support vertical "shortcuts" as 269 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 270 itself does not provide the exact details since the resulting 271 solution suffers from either much larger blast radii with increased 272 flooding volumes or in case of maximum aggregation routing bow-tie 273 problems. 275 3.3. Use Cases 277 3.3.1. DC Fabrics 279 RIFT is largely driven by demands and hence ideally suited for 280 application in underlay of data center IP fabrics, vast majority of 281 which seem to be currently (and for the foreseeable future) Clos 282 architectures. It significantly simplifies operation and deployment 283 of such fabrics as described in Section 4 for environments compared 284 to extensive proprietary provisioning and operational solutions. 286 3.3.2. Metro Fabrics 288 The demand for bandwidth is increasing steadily, driven primarily by 289 environments close to content producers (server farms connection via 290 DC fabrics) but in proximity to content consumers as well. Consumers 291 are often clustered in metro areas with their own network 292 architectures that can benefit from simplified, regular Clos 293 structures and hence RIFT. 295 3.3.3. Building Cabling 297 Commercial edifices are often cabled in topologies that are either 298 Clos or its isomorphic equivalents. With many floors the Clos can 299 grow rather high and with that present a challenge for traditional 300 routing protocols (except BGP and by now largely phased-out PNNI) 301 which do not support an arbitrary number of levels which RIFT does 302 naturally. Moreover, due to limited sizes of forwarding tables in 303 active elements of building cabling the minimum FIB size RIFT 304 maintains under normal conditions can prove particularly cost- 305 effective in terms of hardware and operational costs. 307 3.3.4. Internal Router Switching Fabrics 309 It is common in high-speed communications switching and routing 310 devices to use fabrics when a crossbar is not feasible due to cost, 311 head-of-line blocking or size trade-offs. Normally such fabrics are 312 not self-healing or rely on 1:/+1 protection schemes but it is 313 conceivable to use RIFT to operate Clos fabrics that can deal 314 effectively with interconnections or subsystem failures in such 315 module. RIFT is neither IP specific and hence any link addressing 316 connecting internal device subnets is conceivable. 318 3.3.5. CloudCO 320 The Cloud Central Office (CloudCO) is a new stage of telecom Central 321 Office. It takes the advantage of Software Defined Networking (SDN) 322 and Network Function Virtualization (NFV) in conjunction with general 323 purpose hardware to optimize current networks. The following figure 324 illustrates this architecture at a high level. It describes a single 325 instance or macro-node of cloud CO. An Access I/O module faces a 326 Cloud CO Access Node, and the CPEs behind it. A Network I/O module 327 is facing the core network. The two I/O modules are interconnected 328 by a leaf and spine fabric. [TR-384] 329 +---------------------+ +----------------------+ 330 | Spine | | Spine | 331 | Switch | | Switch | 332 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 333 | | | | | | | | | | | | 334 | | | | | +-------------------------------+ | 335 | | | | | | | | | | | | 336 | | | | +-------------------------+ | | | 337 | | | | | | | | | | | | 338 | | +----------------------+ | | | | | | | | 339 | | | | | | | | | | | | 340 | +---------------------------------+ | | | | | | | 341 | | | | | | | | | | | | 342 | | | +-----------------------------+ | | | | | 343 | | | | | | | | | | | | 344 | | | | | +--------------------+ | | | | 345 | | | | | | | | | | | | 346 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 347 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 348 |S | | Switch | | Switch | | Switch | | Switch| |S | 349 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 350 | | | | | | | | | | | | | | 351 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 352 | |Compute | |Compute | | Compute | |Compute| | 353 | |Node | |Node | | Node | |Node | | 354 | +--------+ +--------+ +----------+ +-------+ | 355 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 356 | |--------| |--------| |----------| |-------| | 357 | |--------| |--------| |----------| |-------| | 358 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 359 | |--------| |--------| |----------| |-------| | 360 | |--------| |--------| |----------| |-------| | 361 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 362 | |--------| |--------| |----------| |-------| | 363 | +--------+ +--------+ +----------+ +-------+ | 364 | | 365 ++-----------+ +---------++ 366 |Network I/O | |Access I/O| 367 +------------+ +----------+ 369 Figure 2: An example of CloudCO architecture 371 The Spine-Leaf architectures deployed inside CloudCO meets the 372 network requirements of adaptable, agile, scalable and dynamic. 374 4. Deployment Considerations 376 RIFT presents the opportunity for organizations building and 377 operating IP fabrics to simplify their operation and deployments 378 while achieving many desirable properties of a dynamic routing on 379 such a substrate: 381 o RIFT design follows minimum blast radius and minimum necessary 382 epistemological scope philosophy which leads to very good scaling 383 properties while delivering maximum reactiveness. 385 o RIFT allows for extensive Zero Touch Provisioning within the 386 protocol. In its most extreme version RIFT does not rely on any 387 specific addressing and for IP fabric can operate using IPv6 ND 388 [RFC4861] only. 390 o RIFT has provisions to detect common IP fabric mis-cabling 391 scenarios. 393 o RIFT negotiates automatically BFD per link allowing this way for 394 IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth 395 imbalances in case of constituent failures. Further automatic 396 link validation techniques similar to [RFC5357] could be supported 397 as well. 399 o RIFT inherently solves many difficult problems associated with the 400 use of traditional routing topologies with dense meshes and high 401 degrees of ECMP by including automatic bandwidth balancing, flood 402 reduction and automatic disaggregation on failures while providing 403 maximum aggregation of prefixes in default scenarios. 405 o RIFT reduces FIB size towards the bottom of the IP fabric where 406 most nodes reside and allows with that for cheaper hardware on the 407 edges and introduction of modern IP fabric architectures that 408 encompass e.g. server multi-homing. 410 o RIFT provides valley-free routing and with that is loop free. 411 This allows the use of any such valley-free path in bi-sectional 412 fabric bandwidth between two destination irrespective of their 413 metrics which can be used to balance load on the fabric in 414 different ways. 416 o RIFT includes a key-value distribution mechanism which allows for 417 many future applications such as automatic provisioning of basic 418 overlay services or automatic key roll-overs over whole fabrics. 420 o RIFT is designed for minimum delay in case of prefix mobility on 421 the fabric. 423 o Many further operational and design points collected over many 424 years of routing protocol deployments have been incorporated in 425 RIFT such as fast flooding rates, protection of information 426 lifetimes and operationally easily recognizable remote ends of 427 links and node names. 429 4.1. South Reflection 431 South reflection is a mechanism that South Node TIEs are "reflected" 432 back up north to allow nodes in same level without E-W links to "see" 433 each other. 435 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 436 from ToF21 to ToF22 separately. Respectively, 437 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 438 to ToF21 separately. So ToF22 and ToF21 see each other's node 439 information as level 2 nodes. 441 In an equivalent fashion, as the result of the south reflection 442 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 443 Spine121 and Spine 122 knows each other at level 1. 445 4.2. Suboptimal Routing on Link Failures 446 +--------+ +--------+ 447 | ToF21 | | ToF22 | LEVEL 2 448 ++--+-+-++ ++-+--+-++ 449 | | | | | | | + 450 | | | | | | | linkTS8 451 +-------------+ | +-+linkTS3+-+ | | | +--------------+ 452 | | | | | | + | 453 | +----------------------------+ | linkTS7 | 454 | | | | + + + | 455 | | | +-------+linkTS4+------------+ | 456 | | | + + | | | 457 | | | +------------+--+ | | 458 | | | | | linkTS6 | | 459 +-+----++ ++-----++ ++------+ ++-----++ 460 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 461 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 462 | | | | | | | | 463 | +--------------+ | + ++XX+linkSL6+---+ + 464 | | | | linkSL5 | | linkSL8 465 | +------------+ | | + +---+linkSL7+-+ | + 466 | | | | | | | | 467 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 468 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 469 +-+-----+ ++------+ +-----+-+ +-+-----+ 470 + + + + 471 Prefix111 Prefix112 Prefix121 Prefix122 473 Figure 3: Suboptimal routing upon link failure use case 475 As shown in Figure 3, as the result of the south reflection between 476 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 477 Spine 122 knows each other at level 1. 479 Without disaggregation mechanism, when linkSL6 fails, the packet from 480 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 481 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 482 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 483 Leaf122 based on pure default route. It's the case of suboptimal 484 routing or bow-tieing. 486 With disaggregation mechanism, when linkSL6 fails, Spine122 will 487 detect the failure according to the reflected node S-TIE from 488 Spine121. Based on the disaggregation algorithm provided by RIFT, 489 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 490 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 491 prefix122 will only be sent to linkSL7 following a longest-prefix 492 match to prefix 122 directly then go down through linkSL8 to Leaf122 493 . 495 4.3. Black-Holing on Link Failures 497 +--------+ +--------+ 498 | ToF 21 | | ToF 22 | LEVEL 2 499 ++-+--+-++ ++-+--+-++ 500 | | | | | | | | 501 | | | | | | | linkTS8 502 +--------------+ | +--linkTS3-X+ | | | +--------------+ 503 linkTS1 | | | | | | | 504 | +-----------------------------+ | linkTS7 | 505 | | | | | | | | 506 | | linkTS2 +--------linkTS4-X-----------+ | 507 | | | | | | | | 508 | linkTS5 +-+ +---------------+ | | 509 | | | | | linkTS6 | | 510 +-+----++ +-+-----+ ++----+-+ ++-----++ 511 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 512 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 513 | | | | | | | | 514 | +---------------+ | | +----linkSL6----+ | 515 linkSL1 | | | linkSL5 | | linkSL8 516 | +---linkSL3---+ | | | +----linkSL7--+ | | 517 | | | | | | | | 518 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 519 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 520 +-+-----+ ++------+ +-----+-+ +-+-----+ 521 + + + + 522 Prefix111 Prefix112 Prefix121 Prefix122 524 Figure 4: Black-holing upon link failure use case 526 This scenario illustrates a case when double link failure occurs and 527 with that black-holing can happen. 529 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 530 the packet from leaf111 to prefix122 would suffer 50% black-holing 531 based on pure default route. The packet supposed to go up through 532 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 533 dropped. The packet supposed to go up through linkSL3 to linkTS2 534 then go down through linkTS3 or linkTS4 will be dropped as well. 535 It's the case of black-holing. 537 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 538 ToF22 will detect the failure according to the reflected node S-TIE 539 of ToF21 from Spine111\Spine112\Spine121\Spine122. Based on the 540 disaggregation algorithm provided by RITF, ToF22 will explicitly 541 originate an S-TIE with prefix 121 and prefix 122, that is flooded to 542 spines 111, 112, 121 and 122. 544 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 545 linkTS2. The packet from leaf111 to prefix122 will only be routed to 546 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 548 4.4. Zero Touch Provisioning (ZTP) 550 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 551 has no configuration (unless it is a Top-of-Fabric at the top of the 552 topology or it is desired to confine it to leaf role w/o leaf-2-leaf 553 procedures). In such case RIFT will fully configure the node's level 554 after it is attached to the topology. 556 The most import component for ZTP is the automatic level derivation 557 procedure. All the Top-of-Fabric nodes are explicitly marked with 558 TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP 559 nodes to derive their level in the topology. The derivation of the 560 level of each node happens then based on LIEs received from its 561 neighbors whereas each node (with possibly exceptions of configured 562 leafs) tries to attach at the highest possible point in the fabric. 563 This guarantees that even if the diffusion front reaches a node from 564 "below" faster than from "above", it will greedily abandon already 565 negotiated level derived from nodes topologically below it and 566 properly peer with nodes above. 568 4.5. Miscabling Examples 569 +----------------+ +-----------------+ 570 | ToF21 | +------+ ToF22 | LEVEL 2 571 +-------+----+---+ | +----+---+--------+ 572 | | | | | | | | | 573 | | | +----------------------------+ | 574 | +---------------------------+ | | | | 575 | | | | | | | | | 576 | | | | +-----------------------+ | | 577 | | +------------------------+ | | | 578 | | | | | | | | | 579 +-+---+-+ +-+---+-+ | +-+---+-+ +-+---+-+ 580 |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 581 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 582 | | | | | | | | | 583 | +---------+ | link-M | +---------+ | 584 | | | | | | | | | 585 | +-------+ | | | | +-------+ | | 586 | | | | | | | | | 587 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 588 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 589 +-------+ +-------+ +-------+ +-------+ 591 Figure 5: A single plane miscabling example 593 Figure Figure 5 shows a single plane miscabling example. It's a 594 perfect fat-tree fabric except link-M connecting Leaf112 to ToF22. 596 The RIFT control protocol can discover the physical links 597 automatically and be able to detect cabling that violates fat-tree 598 topology constraints. It react accordingly to such mis-cabling 599 attempts, at a minimum preventing adjacencies between nodes from 600 being formed and traffic from being forwarded on those mis-cabled 601 links. Leaf112 will in such scenario use link-M to derive its level 602 (unless it is leaf) and can report links to spines 111 and 112 as 603 miscabled unless the implementations allows horizontal links. 605 Figure Figure 6 shows a multiple plane miscabling example. Since 606 Leaf112 and Spine121 belong to two different PoDs, the adjacency 607 between Leaf112 and Spine121 can not be formed. link-W would be 608 detected and prevented. 610 +-------+ +-------+ +-------+ +-------+ 611 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 612 +-------+ +-------+ +-------+ +-------+ 613 | | | | | | | | 614 | | | +-----------------+ | | | 615 | +--------------------------+ | | | | 616 | | | | | | | | 617 | +------+ | | | +------+ | 618 | | +-----------------+ | | | | | 619 | | | +--------------------------+ | | 620 | A | | B | | A | | B | 621 +-----+-+ +-+---+-+ +-+---+-+ +-+-----+ 622 |Spin111| |Spin112| +----+Spin121| |Spin122| LEVEL 1 623 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 624 | | | | | | | | | 625 | +---------+ | | | +---------+ | 626 | | | | link-W | | | | 627 | +-------+ | | | | +-------+ | | 628 | | | | | | | | | 629 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 630 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 631 +-------+ +-------+ +-------+ +-------+ 633 +--------PoD#1----------+ +---------PoD#2---------+ 635 Figure 6: A multiple plane miscabling example 637 RIFT provides an optional level determination procedure in its Zero 638 Touch Provisioning mode. Nodes in the fabric without their level 639 configured determine it automatically. This can have possibly 640 counter-intuitive consequences however. One extreme failure scenario 641 is depicted in Figure 7 and it shows that if all northbound links of 642 spine11 fail at the same time, spine11 negotiates a lower level than 643 Leaf11 and Leaf12. 645 To prevent such scenario where leafs are expected to act as switches, 646 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 647 invalid, Spine11 would not derive a valid level from the topology in 648 Figure 7. It will be isolated from the whole fabric and it would be 649 up to the leafs to declare the links towards such spine as miscabled. 651 +-------+ +-------+ +-------+ +-------+ 652 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 653 +-------+ +-------+ +-------+ +-------+ 654 | | | | | | 655 | +-------+ | | | 656 + + | | ====> | | 657 X X +------+ | +------+ | 658 + + | | | | 659 +----+--+ +-+-----+ +-+-----+ 660 |Spine11| |Spine12| |Spine12| 661 +-+---+-+ ++----+-+ ++----+-+ 662 | | | | | | 663 | +---------+ | | | 664 | | | | | | 665 | +-------+ | | +-------+ | 666 | | | | | | 667 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 668 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 669 +-------+ +-------+ +-+-----+ +-+-----+ 670 | | 671 | +--------+ 672 | | 673 +-+---+-+ 674 |Spine11| 675 +-------+ 677 Figure 7: Fallen spine 679 4.6. IPv4 over IPv6 681 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 682 AF configures via the usual ND mechanisms and then V4 can use V6 683 nexthops analogous to RFC5549. It is expected that the whole fabric 684 supports the same type of forwarding of address families on all the 685 links. RIFT provides an indication whether a node is v4 forwarding 686 capable and implementations are possible where different routing 687 tables are computed per address family as long as the computation 688 remains loop-free. 690 +-----+ +-----+ 691 +---+---+ | ToF | | ToF | 692 ^ +--+--+ +-----+ 693 | | | | | 694 | | +-------------+ | 695 | | +--------+ | | 696 | | | | | 697 V6 +-----+ +-+---+ 698 Forwarding |SPINE| |SPINE| 699 | +--+--+ +-----+ 700 | | | | | 701 | | +-------------+ | 702 | | +--------+ | | 703 | | | | | 704 v +-----+ +-+---+ 705 +---+---+ |LEAF | | LEAF| 706 +--+--+ +--+--+ 707 | | 708 IPv4 prefixes| |IPv4 prefixes 709 | | 710 +---+----+ +---+----+ 711 | V4 | | V4 | 712 | subnet | | subnet | 713 +--------+ +--------+ 715 Figure 8: IPv4 over IPv6 717 4.7. In-Band Reachability of Nodes 719 4.7.1. Reachability of Leafs 721 TODO 723 4.7.2. Reachability of Spines 725 TODO 727 4.8. Dual Homing Servers 729 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 730 has no configuration (unless it is a Top-of-Fabric at the top of the 731 topology or the must operate in the topology as leaf and/or support 732 leaf-2-leaf procedures) and it will fully configure itself after 733 being attached to the topology. 735 +---+ +---+ +---+ 736 |ToF| |ToF| |ToF| 737 +---+ +---+ +---+ 738 | | | | | | 739 | +----------------+ | | 740 | | | | | | 741 | +----------------+ | 742 | | | | | | 743 +----------+--+ +--+----------+ 744 | Spine|ToR1 | | Spine|ToR2 | 745 +--+------+---+ +--+-------+--+ 746 +---+ | | | | | | +---+ 747 | | | | | | | | 748 | +-----------------+ | | | 749 | | | +-------------+ | | 750 + | + | | |-----------------+ | 751 X | X | +--------x-----+ | X | 752 + | + | | | + | 753 +---+ +---+ +---+ +---+ 754 | | | | | | | | 755 +---+ +---+ ...............+---+ +---+ 756 SV(1) SV(2) SV(n+1) SV(n) 758 Figure 9: Dual-homing servers 760 In the single plane, the worst condition is disaggregation of every 761 other servers at the same level. Suppose the links from ToR1 to all 762 the leaves become not available. All the servers' routes are 763 disaggregated and the FIB of the servers will be expanded with n-1 764 more spicific routes. 766 Sometimes, pleople may prefer to disaggregate from ToR to servers 767 from start on, i.e. the servers have couple tens of routes in FIB 768 from start on beside default routes to avoid breakages at rack level. 769 Full disaggregation of the fabric could be achieved by configuration 770 supported by RIFT. 772 4.9. Fabric With A Controller 774 There are many different ways to deploy the controller. One 775 possibility is attaching a controller to the RIFT domain from ToF and 776 another possibility is attaching a controller from the leaf. 778 +------------+ 779 | Controller | 780 ++----------++ 781 | | 782 | | 783 +----++ ++----+ 784 ---------- | ToF | | ToF | 785 | +--+--+ +-----+ 786 | | | | | 787 | | +-------------+ | 788 | | +--------+ | | 789 | | | | | 790 +-----+ +-+---+ 791 RIFT domain |SPINE| |SPINE| 792 +--+--+ +-----+ 793 | | | | | 794 | | +-------------+ | 795 | | +--------+ | | 796 | | | | | 797 | +-----+ +-+---+ 798 ---------- |LEAF | | LEAF| 799 +-----+ +-----+ 801 Figure 10: Fabric with a controller 803 4.9.1. Controller Attached to ToFs 805 If a controller is attaching to the RIFT domain from ToF, it usually 806 uses dual-homing connections. The loopback prefix of the controller 807 should be advertised down by the ToF and spine to leaves. If the 808 controller loses link to ToF, make sure the ToF withdraw the prefix 809 of the controller(use different mechanisms). 811 4.9.2. Controller Attached to Leaf 813 If the controller is attaching from a leaf to the fabric, no special 814 provisions are needed. 816 4.10. Internet Connectivity Without Underlay 818 4.10.1. Internet Default on the Leafs 820 TODO 822 4.10.2. Internet Default on the ToFs 824 TODO 826 4.11. Subnet Mismatch and Address Families 828 +--------+ +--------+ 829 | | LIE LIE | | 830 | A | +----> <----+ | B | 831 | +---------------------+ | 832 +--------+ +--------+ 833 X/24 Y/24 835 Figure 11: subnet mismatch 837 LIEs are exchanged over all links running RIFT to perform Link 838 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 839 family if it does not process received LIEs on that family. LIEs on 840 same link are considered part of the same negotiation independent on 841 the address family they arrive on. An implementation MUST be ready 842 to accept TIEs on all addresses it used as source of LIE frames. 844 As shown in the above figure, without further checks adjacency of 845 node A and B may form, but the forwarding between node A and node B 846 may fail because subnet X mismatches with subnet Y. 848 To prevent this a RIFT implementation should check for subnet 849 mismatch just like e.g. ISIS does. This can lead to scenarios where 850 an adjacency, despite exchange of LIEs in both address families may 851 end up having an adjacency in a single AF only. This is a 852 consideration especially in Section 4.6 scenarios. 854 4.12. Anycast Considerations 855 + traffic 856 | 857 v 858 +------+------+ 859 | ToF | 860 +---+-----+---+ 861 | | | | 862 +------------+ | | +------------+ 863 | | | | 864 +---+---+ +-------+ +-------+ +---+---+ 865 | | | | | | | | 866 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 867 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 868 | | | | | | | | 869 | +---------+ | | +---------+ | 870 | | | | | | | | 871 | +-------+ | | | +-------+ | | 872 | | | | | | | | 873 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 874 | | | | | | | | 875 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 876 +-+-----+ ++------+ +-----+-+ +-----+-+ 877 + + + ^ | 878 PrefixA PrefixB PrefixA | PrefixC 879 | 880 + traffic 882 Figure 12: Anycast 884 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 885 prefix PrefixA. RIFT can deal with this case well. But if the 886 traffic comes from Leaf122, it will always get to Leaf121 and never 887 get to Leaf111. If the intension is that the traffic should been 888 offloaded to Leaf111, then use policy guided prefixes [PGP 889 reference]. 891 5. Acknowledgements 893 6. Contributors 895 The following people (listed in alphabetical order) contributed 896 significantly to the content of this document and should be 897 considered co-authors: 899 Tony Przygienda 901 Juniper Networks 902 1194 N. Mathilda Ave 904 Sunnyvale, CA 94089 906 US 908 Email: prz@juniper.net 910 7. Normative References 912 [I-D.ietf-rift-rift] 913 Przygienda, T., Sharma, A., Thubert, P., and D. Afanasiev, 914 "RIFT: Routing in Fat Trees", draft-ietf-rift-rift-08 915 (work in progress), September 2019. 917 [I-D.white-distoptflood] 918 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 919 Distributed Flooding for Dense Topologies", draft-white- 920 distoptflood-01 (work in progress), September 2019. 922 [ISO10589-Second-Edition] 923 International Organization for Standardization, 924 "Intermediate system to Intermediate system intra-domain 925 routeing information exchange protocol for use in 926 conjunction with the protocol for providing the 927 connectionless-mode Network Service (ISO 8473)", Nov 2002. 929 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 930 DOI 10.17487/RFC2328, April 1998, 931 . 933 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 934 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 935 DOI 10.17487/RFC4861, September 2007, 936 . 938 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 939 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 940 RFC 5357, DOI 10.17487/RFC5357, October 2008, 941 . 943 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 944 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 945 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 946 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 947 2014, . 949 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 950 Office Reference Architectural Framework", Jan 2018. 952 Authors' Addresses 954 Yuehua Wei 955 ZTE Corporation 956 No.50, Software Avenue 957 Nanjing 210012 958 P. R. China 960 Email: wei.yuehua@zte.com.cn 962 Zheng Zhang 963 ZTE Corporation 964 No.50, Software Avenue 965 Nanjing 210012 966 P. R. China 968 Email: zzhang_ietf@hotmail.com 970 Dmitry Afanasiev 971 Yandex 973 Email: fl0w@yandex-team.ru 975 Tom Verhaeg 976 Interconnect Services B.V. 978 Email: t.verhaeg@interconnect.nl 980 Jaroslaw Kowalczyk 981 Orange Polska 983 Email: jaroslaw.kowalczyk2@orange.com