idnits 2.17.1 draft-ietf-rift-applicability-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1087: '...scovery. A node MUST NOT originate LI...' RFC 2119 keyword, line 1090: '...e on. An implementation MUST be ready...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (3 April 2020) is 1482 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-21) exists of draft-ietf-rift-rift-11 == Outdated reference: A later version (-04) exists of draft-white-distoptflood-01 Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT WG Yuehua. Wei, Ed. 3 Internet-Draft Zheng. Zhang 4 Intended status: Informational ZTE Corporation 5 Expires: 5 October 2020 Dmitry. Afanasiev 6 Yandex 7 Tom. Verhaeg 8 Juniper Networks 9 Jaroslaw. Kowalczyk 10 Orange Polska 11 P. Thubert 12 Cisco Systems 13 3 April 2020 15 RIFT Applicability 16 draft-ietf-rift-applicability-01 18 Abstract 20 This document discusses the properties, applicability and operational 21 considerations of RIFT in different network scenarios. It intends to 22 provide a rough guide how RIFT can be deployed to simplify routing 23 operations in Clos topologies and their variations. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on 5 October 2020. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 49 license-info) in effect on the date of publication of this document. 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. Code Components 52 extracted from this document must include Simplified BSD License text 53 as described in Section 4.e of the Trust Legal Provisions and are 54 provided without warranty as described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Problem Statement of Routing in Modern IP Fabric Fat Tree 60 Networks . . . . . . . . . . . . . . . . . . . . . . . . 3 61 3. Applicability of RIFT to Clos IP Fabrics . . . . . . . . . . 3 62 3.1. Overview of RIFT . . . . . . . . . . . . . . . . . . . . 3 63 3.2. Applicable Topologies . . . . . . . . . . . . . . . . . . 5 64 3.2.1. Horizontal Links . . . . . . . . . . . . . . . . . . 6 65 3.2.2. Vertical Shortcuts . . . . . . . . . . . . . . . . . 6 66 3.2.3. Generalizing to any Directed Acyclic Graph . . . . . 6 67 3.3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.3.1. DC Fabrics . . . . . . . . . . . . . . . . . . . . . 8 69 3.3.2. Metro Fabrics . . . . . . . . . . . . . . . . . . . . 8 70 3.3.3. Building Cabling . . . . . . . . . . . . . . . . . . 8 71 3.3.4. Internal Router Switching Fabrics . . . . . . . . . . 8 72 3.3.5. CloudCO . . . . . . . . . . . . . . . . . . . . . . . 9 73 4. Deployment Considerations . . . . . . . . . . . . . . . . . . 11 74 4.1. South Reflection . . . . . . . . . . . . . . . . . . . . 12 75 4.2. Suboptimal Routing on Link Failures . . . . . . . . . . . 12 76 4.3. Black-Holing on Link Failures . . . . . . . . . . . . . . 14 77 4.4. Zero Touch Provisioning (ZTP) . . . . . . . . . . . . . . 15 78 4.5. Miscabling Examples . . . . . . . . . . . . . . . . . . . 15 79 4.6. Positive vs. Negative Disaggregation . . . . . . . . . . 18 80 4.7. Mobile Edge and Anycast . . . . . . . . . . . . . . . . . 19 81 4.8. IPv4 over IPv6 . . . . . . . . . . . . . . . . . . . . . 21 82 4.9. In-Band Reachability of Nodes . . . . . . . . . . . . . . 21 83 4.10. Dual Homing Servers . . . . . . . . . . . . . . . . . . . 22 84 4.11. Fabric With A Controller . . . . . . . . . . . . . . . . 23 85 4.11.1. Controller Attached to ToFs . . . . . . . . . . . . 24 86 4.11.2. Controller Attached to Leaf . . . . . . . . . . . . 24 87 4.12. Internet Connectivity With Underlay . . . . . . . . . . . 24 88 4.12.1. Internet Default on the Leaf . . . . . . . . . . . . 25 89 4.12.2. Internet Default on the ToFs . . . . . . . . . . . . 25 90 4.13. Subnet Mismatch and Address Families . . . . . . . . . . 25 91 4.14. Anycast Considerations . . . . . . . . . . . . . . . . . 25 92 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 26 93 6. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 26 94 7. Normative References . . . . . . . . . . . . . . . . . . . . 27 95 8. Informative References . . . . . . . . . . . . . . . . . . . 28 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 98 1. Introduction 100 This document intends to explain the properties and applicability of 101 "Routing in Fat Trees" [RIFT] in different deployment scenarios and 102 highlight the operational simplicity of the technology compared to 103 traditional routing solutions. It also documents special 104 considerations when RIFT is used with or without overlays, 105 controllers and corrects topology miscablings and/or node and link 106 failures. 108 2. Problem Statement of Routing in Modern IP Fabric Fat Tree Networks 110 Clos and Fat-Tree topologies have gained prominence in today's 111 networking, primarily as result of the paradigm shift towards a 112 centralized data-center based architecture that is poised to deliver 113 a majority of computation and storage services in the future. 115 Today's current routing protocols were geared towards a network with 116 an irregular topology and low degree of connectivity originally. 117 When they are applied to Fat-Tree topologies: 119 * they tend to need extensive configuration or provisioning during 120 bring up and re-dimensioning. 122 * spine and leaf nodes have the entire network topology and routing 123 information, which is in fact, not needed on the leaf nodes during 124 normal operation. 126 * significant Link State PDUs (LSPs) flooding duplication between 127 spine nodes and leaf nodes occurs during network bring up and 128 topology updates. It consumes both spine and leaf nodes' CPU and 129 link bandwidth resources and with that limits protocol 130 scalability. 132 3. Applicability of RIFT to Clos IP Fabrics 134 Further content of this document assumes that the reader is familiar 135 with the terms and concepts used in OSPF [RFC2328] and IS-IS 136 [ISO10589-Second-Edition] link-state protocols and at least the 137 sections of [RIFT] outlining the requirement of routing in IP fabrics 138 and RIFT protocol concepts. 140 3.1. Overview of RIFT 142 RIFT is a dynamic routing protocol for Clos and fat-tree network 143 topologies. It defines a link-state protocol when "pointing north" 144 and path-vector protocol when "pointing south". 146 It floods flat link-state information northbound only so that each 147 level obtains the full topology of levels south of it. That 148 information is never flooded East-West or back South again. So a top 149 tier node has full set of prefixes from the SPF calculation. 151 In the southbound direction the protocol operates like a "fully 152 summarizing, unidirectional" path vector protocol or rather a 153 distance vector with implicit split horizon whereas the information 154 propagates one hop south and is 're-advertised' by nodes at next 155 lower level, normally just the default route. 157 +-----------+ +-----------+ 158 | ToF | | ToF | LEVEL 2 159 + +-----+--+--+ +-+--+------+ 160 | | | | | | | | | ^ 161 + | | | +-------------------------+ | 162 Distance | +-------------------+ | | | | | 163 Vector | | | | | | | | + 164 South | | | | +--------+ | | | Link+State 165 + | | | | | | | | Flooding 166 | | | +-------------+ | | | North 167 v | | | | | | | | + 168 +-+--+-+ +------+ +-------+ +--+--+-+ | 169 |SPINE | |SPINE | | SPINE | | SPINE | | LEVEL 1 170 + ++----++ ++---+-+ +--+--+-+ ++----+-+ | 171 + | | | | | | | | | ^ N 172 Distance | +-------+ | | +--------+ | | | E 173 Vector | | | | | | | | | +------> 174 South | +-------+ | | | +-------+ | | | | 175 + | | | | | | | | | + 176 v ++--++ +-+-++ ++-+-+ +-+--++ + 177 |LEAF| |LEAF| |LEAF| |LEAF | LEVEL 0 178 +----+ +----+ +----+ +-----+ 180 Figure 1: Rift overview 182 A middle tier node has only information necessary for its level, 183 which are all destinations south of the node based on SPF 184 calculation, default route and potential disaggregated routes. 186 RIFT combines the advantage of both Link-State and Distance Vector: 188 * Fastest Possible Convergence 190 * Automatic Detection of Topology 192 * Minimal Routes/Info on TORs 193 * High Degree of ECMP 195 * Fast De-commissioning of Nodes 197 * Maximum Propagation Speed with Flexible Prefixes in an Update 199 And RIFT eliminates the disadvantages of Link-State or Distance 200 Vector: 202 * Reduced and Balanced Flooding 204 * Automatic Neighbor Detection 206 So there are two types of link state database which are "north 207 representation" N-TIEs and "south representation" S-TIEs. The N-TIEs 208 contain a link state topology description of lower levels and S-TIEs 209 carry simply default routes for the lower levels. 211 There are a bunch of more advantages unique to RIFT listed below 212 which could be understood if you read the details of [RIFT]. 214 * True ZTP 216 * Minimal Blast Radius on Failures 218 * Can Utilize All Paths Through Fabric Without Looping 220 * Automatic Disaggregation on Failures 222 * Simple Leaf Implementation that Can Scale Down to Servers 224 * Key-Value Store 226 * Horizontal Links Used for Protection Only 228 * Supports Non-Equal Cost Multipath and Can Replace MC-LAG 230 * Optimal Flooding Reduction and Load-Balancing 232 3.2. Applicable Topologies 234 Albeit RIFT is specified primarily for "proper" Clos or "fat-tree" 235 structures, it already supports PoD concepts which are strictly 236 speaking not found in original Clos concepts. 238 Further, the specification explains and supports operations of multi- 239 plane Clos variants where the protocol relies on set of rings to 240 allow the reconciliation of topology view of different planes as most 241 desirable solution making proper disaggregation viable in case of 242 failures. This observations hold not only in case of RIFT but in the 243 generic case of dynamic routing on Clos variants with multiple planes 244 and failures in bi-sectional bandwidth, especially on the leafs. 246 3.2.1. Horizontal Links 248 RIFT is not limited to pure Clos divided into PoD and multi-planes 249 but supports horizontal links below the top of fabric level. Those 250 links are used however only as routes of last resort northbound when 251 a spine loses all northbound links or cannot compute a default route 252 through them. 254 A possible configuration is a "ring" of horizontal links at a level. 255 In presence of such a "ring" in any level (except ToF level) neither 256 N-SPF nor S-SPF will provide a "ring-based protection" scheme since 257 such a computation would have to deal necessarily with breaking of 258 "loops" in Dijkstra sense; an application for which RIFT is not 259 intended. 261 A full-mesh connectivity between nodes on the same level can be 262 employed and that allows N-SPF to provide for any node loosing all 263 its northbound adjacencies (as long as any of the other nodes in the 264 level are northbound connected) to still participate in northbound 265 forwarding. 267 3.2.2. Vertical Shortcuts 269 Through relaxations of the specified adjacency forming rules RIFT 270 implementations can be extended to support vertical "shortcuts" as 271 proposed by e.g. [I-D.white-distoptflood]. The RIFT specification 272 itself does not provide the exact details since the resulting 273 solution suffers from either much larger blast radius with increased 274 flooding volumes or in case of maximum aggregation routing bow-tie 275 problems. 277 3.2.3. Generalizing to any Directed Acyclic Graph 279 RIFT is an anisotropic routing protocol, meaning that it has a sense 280 of direction (Northbound, Southbound, East-West) and that it operates 281 differently depending on the direction. 283 * Northbound, RIFT operates as a Link State IGP, whereby the control 284 packets are reflooded first all the way North and only interpreted 285 later. All the individual fine grained routes are advertised. 287 * Southbound, RIFT operates as a Distance Vector IGP, whereby the 288 control packets are flooded only one hop, interpreted, and the 289 consequence of that computation is what gets flooded on more hop 290 South. In the most common use-cases, a ToF node can reach most of 291 the prefixes in the fabric. If that is the case, the ToF node 292 advertises the fabric default and disaggregates the prefixes that 293 it cannot reach. On the oethr hand, a ToF Node that can reach 294 only a small subset of the prefixes in the fabric will preferably 295 advertise those prefixes and refrain from aggregating. 297 In the general case, what gets advertised South is in more 298 details: 300 1. A fabric default that aggregates all the prefixes that are 301 reachable within the fabric, and that could be a default route 302 or a prefix that is dedicated to this particular fabric. 304 2. The loopback addresses of the Northbound nodes, e.g., for 305 inband management. 307 3. The disaggregated prefixes for the dynamic exceptions to the 308 fabric Default, advertised to route around the black hole that 309 may form 311 * East-West routing can optionally be used, with specific 312 restrictions. It is useful in particular when a sibling has 313 access to the fabric default but this node does not. 315 A Directed Acyclic Graph (DAG) provides a sense of North (the 316 direction of the DAG) and of South (the reverse), which can be used 317 to apply RIFT. For the purpose of RIFT an edge in the DAG that has 318 only incoming vertices is a ToF node. 320 There are a number of caveats though: 322 * The DAG structure must exist before RIFT starts, so there is a 323 need for a companion protocol to establish the logical DAG 324 structure. 326 * A generic DAG does not have a sense of East and West. The 327 operation specified for East-West links and the Southbound 328 reflection between nodes are not applicable. 330 * In order to aggregate and disaggregate routes, RIFT requires that 331 all the ToF nodes share the full knowledge of the prefixes in the 332 fabric. This can be achieved with a ring as suggested by the RIFT 333 main specification, by some preconfiguration, or using a 334 synchronization with a common repository where all the active 335 prefixes are registered. 337 3.3. Use Cases 339 3.3.1. DC Fabrics 341 RIFT is largely driven by demands and hence ideally suited for 342 application in underlay of data center IP fabrics, vast majority of 343 which seem to be currently (and for the foreseeable future) Clos 344 architectures. It significantly simplifies operation and deployment 345 of such fabrics as described in Section 4 for environments compared 346 to extensive proprietary provisioning and operational solutions. 348 3.3.2. Metro Fabrics 350 The demand for bandwidth is increasing steadily, driven primarily by 351 environments close to content producers (server farms connection via 352 DC fabrics) but in proximity to content consumers as well. Consumers 353 are often clustered in metro areas with their own network 354 architectures that can benefit from simplified, regular Clos 355 structures and hence RIFT. 357 3.3.3. Building Cabling 359 Commercial edifices are often cabled in topologies that are either 360 Clos or its isomorphic equivalents. With many floors the Clos can 361 grow rather high and with that present a challenge for traditional 362 routing protocols (except BGP and by now largely phased-out PNNI) 363 which do not support an arbitrary number of levels which RIFT does 364 naturally. Moreover, due to limited sizes of forwarding tables in 365 active elements of building cabling the minimum FIB size RIFT 366 maintains under normal conditions can prove particularly cost- 367 effective in terms of hardware and operational costs. 369 3.3.4. Internal Router Switching Fabrics 371 It is common in high-speed communications switching and routing 372 devices to use fabrics when a crossbar is not feasible due to cost, 373 head-of-line blocking or size trade-offs. Normally such fabrics are 374 not self-healing or rely on 1:/+1 protection schemes but it is 375 conceivable to use RIFT to operate Clos fabrics that can deal 376 effectively with interconnections or subsystem failures in such 377 module. RIFT is neither IP specific and hence any link addressing 378 connecting internal device subnets is conceivable. 380 3.3.5. CloudCO 382 The Cloud Central Office (CloudCO) is a new stage of telecom Central 383 Office. It takes the advantage of Software Defined Networking (SDN) 384 and Network Function Virtualization (NFV) in conjunction with general 385 purpose hardware to optimize current networks. The following figure 386 illustrates this architecture at a high level. It describes a single 387 instance or macro-node of cloud CO. An Access I/O module faces a 388 Cloud CO Access Node, and the CPEs behind it. A Network I/O module 389 is facing the core network. The two I/O modules are interconnected 390 by a leaf and spine fabric. [TR-384] 391 +---------------------+ +----------------------+ 392 | Spine | | Spine | 393 | Switch | | Switch | 394 +------+---+------+-+-+ +--+-+-+-+-----+-------+ 395 | | | | | | | | | | | | 396 | | | | | +-------------------------------+ | 397 | | | | | | | | | | | | 398 | | | | +-------------------------+ | | | 399 | | | | | | | | | | | | 400 | | +----------------------+ | | | | | | | | 401 | | | | | | | | | | | | 402 | +---------------------------------+ | | | | | | | 403 | | | | | | | | | | | | 404 | | | +-----------------------------+ | | | | | 405 | | | | | | | | | | | | 406 | | | | | +--------------------+ | | | | 407 | | | | | | | | | | | | 408 +--+ +-+---+--+ +-+---+--+ +--+----+--+ +-+--+--+ +--+ 409 |L | | Leaf | | Leaf | | Leaf | | Leaf | |L | 410 |S | | Switch | | Switch | | Switch | | Switch| |S | 411 ++-+ +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ +-++ 412 | | | | | | | | | | | | | | 413 | +-+-+-+--+ +-+-+-+--+ +--+-+--+--+ ++-+--+-+ | 414 | |Compute | |Compute | | Compute | |Compute| | 415 | |Node | |Node | | Node | |Node | | 416 | +--------+ +--------+ +----------+ +-------+ | 417 | || VAS5 || || vDHCP|| || vRouter|| ||VAS1 || | 418 | |--------| |--------| |----------| |-------| | 419 | |--------| |--------| |----------| |-------| | 420 | || VAS6 || || VAS3 || || v802.1x|| ||VAS2 || | 421 | |--------| |--------| |----------| |-------| | 422 | |--------| |--------| |----------| |-------| | 423 | || VAS7 || || VAS4 || || vIGMP || ||BAA || | 424 | |--------| |--------| |----------| |-------| | 425 | +--------+ +--------+ +----------+ +-------+ | 426 | | 427 ++-----------+ +---------++ 428 |Network I/O | |Access I/O| 429 +------------+ +----------+ 431 Figure 2: An example of CloudCO architecture 433 The Spine-Leaf architectures deployed inside CloudCO meets the 434 network requirements of adaptable, agile, scalable and dynamic. 436 4. Deployment Considerations 438 RIFT presents the opportunity for organizations building and 439 operating IP fabrics to simplify their operation and deployments 440 while achieving many desirable properties of a dynamic routing on 441 such a substrate: 443 * RIFT design follows minimum blast radius and minimum necessary 444 epistemological scope philosophy which leads to very good scaling 445 properties while delivering maximum reactiveness. 447 * RIFT allows for extensive Zero Touch Provisioning within the 448 protocol. In its most extreme version RIFT does not rely on any 449 specific addressing and for IP fabric can operate using IPv6 ND 450 [RFC4861] only. 452 * RIFT has provisions to detect common IP fabric mis-cabling 453 scenarios. 455 * RIFT negotiates automatically BFD per link allowing this way for 456 IP and micro-BFD [RFC7130] to replace LAGs which do hide bandwidth 457 imbalances in case of constituent failures. Further automatic 458 link validation techniques similar to [RFC5357] could be supported 459 as well. 461 * RIFT inherently solves many difficult problems associated with the 462 use of traditional routing topologies with dense meshes and high 463 degrees of ECMP by including automatic bandwidth balancing, flood 464 reduction and automatic disaggregation on failures while providing 465 maximum aggregation of prefixes in default scenarios. 467 * RIFT reduces FIB size towards the bottom of the IP fabric where 468 most nodes reside and allows with that for cheaper hardware on the 469 edges and introduction of modern IP fabric architectures that 470 encompass e.g. server multi-homing. 472 * RIFT provides valley-free routing and with that is loop free. 473 This allows the use of any such valley-free path in bi-sectional 474 fabric bandwidth between two destination irrespective of their 475 metrics which can be used to balance load on the fabric in 476 different ways. 478 * RIFT includes a key-value distribution mechanism which allows for 479 many future applications such as automatic provisioning of basic 480 overlay services or automatic key roll-overs over whole fabrics. 482 * RIFT is designed for minimum delay in case of prefix mobility on 483 the fabric. 485 * Many further operational and design points collected over many 486 years of routing protocol deployments have been incorporated in 487 RIFT such as fast flooding rates, protection of information 488 lifetimes and operationally easily recognizable remote ends of 489 links and node names. 491 4.1. South Reflection 493 South reflection is a mechanism that South Node TIEs are "reflected" 494 back up north to allow nodes in same level without E-W links to "see" 495 each other. 497 For example, Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs 498 from ToF21 to ToF22 separately. Respectively, 499 Spine111\Spine112\Spine121\Spine122 reflects Node S-TIEs from ToF22 500 to ToF21 separately. So ToF22 and ToF21 see each other's node 501 information as level 2 nodes. 503 In an equivalent fashion, as the result of the south reflection 504 between Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, 505 Spine121 and Spine 122 knows each other at level 1. 507 4.2. Suboptimal Routing on Link Failures 508 +--------+ +--------+ 509 | ToF21 | | ToF22 | LEVEL 2 510 ++--+-+-++ ++-+--+-++ 511 | | | | | | | + 512 | | | | | | | linkTS8 513 +-------------+ | +-+linkTS3+-+ | | | +--------------+ 514 | | | | | | + | 515 | +----------------------------+ | linkTS7 | 516 | | | | + + + | 517 | | | +-------+linkTS4+------------+ | 518 | | | + + | | | 519 | | | +------------+--+ | | 520 | | | | | linkTS6 | | 521 +-+----++ ++-----++ ++------+ ++-----++ 522 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 523 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 524 | | | | | | | | 525 | +--------------+ | + ++XX+linkSL6+---+ + 526 | | | | linkSL5 | | linkSL8 527 | +------------+ | | + +---+linkSL7+-+ | + 528 | | | | | | | | 529 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 530 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 531 +-+-----+ ++------+ +-----+-+ +-+-----+ 532 + + + + 533 Prefix111 Prefix112 Prefix121 Prefix122 535 Figure 3: Suboptimal routing upon link failure use case 537 As shown in Figure 3, as the result of the south reflection between 538 Spine121-Leaf121-Spine122 and Spine121-Leaf122-Spine122, Spine121 and 539 Spine 122 knows each other at level 1. 541 Without disaggregation mechanism, when linkSL6 fails, the packet from 542 leaf121 to prefix122 will probably go up through linkSL5 to linkTS3 543 then go down through linkTS4 to linkSL8 to Leaf122 or go up through 544 linkSL5 to linkTS6 then go down through linkTS4 and linkSL8 to 545 Leaf122 based on pure default route. It's the case of suboptimal 546 routing or bow-tieing. 548 With disaggregation mechanism, when linkSL6 fails, Spine122 will 549 detect the failure according to the reflected node S-TIE from 550 Spine121. Based on the disaggregation algorithm provided by RIFT, 551 Spine122 will explicitly advertise prefix122 in Disaggregated Prefix 552 S-TIE PrefixesElement(prefix122, cost 1). The packet from leaf121 to 553 prefix122 will only be sent to linkSL7 following a longest-prefix 554 match to prefix 122 directly then go down through linkSL8 to Leaf122 555 . 557 4.3. Black-Holing on Link Failures 559 +--------+ +--------+ 560 | ToF 21 | | ToF 22 | LEVEL 2 561 ++-+--+-++ ++-+--+-++ 562 | | | | | | | | 563 | | | | | | | linkTS8 564 +--------------+ | +--linkTS3-X+ | | | +--------------+ 565 linkTS1 | | | | | | | 566 | +-----------------------------+ | linkTS7 | 567 | | | | | | | | 568 | | linkTS2 +--------linkTS4-X-----------+ | 569 | | | | | | | | 570 | linkTS5 +-+ +---------------+ | | 571 | | | | | linkTS6 | | 572 +-+----++ +-+-----+ ++----+-+ ++-----++ 573 |Spin111| |Spin112| |Spin121| |Spin122| LEVEL 1 574 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 575 | | | | | | | | 576 | +---------------+ | | +----linkSL6----+ | 577 linkSL1 | | | linkSL5 | | linkSL8 578 | +---linkSL3---+ | | | +----linkSL7--+ | | 579 | | | | | | | | 580 +-+---+-+ +--+--+-+ +-+---+-+ +--+-+--+ 581 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 582 +-+-----+ ++------+ +-----+-+ +-+-----+ 583 + + + + 584 Prefix111 Prefix112 Prefix121 Prefix122 586 Figure 4: Black-holing upon link failure use case 588 This scenario illustrates a case when double link failure occurs and 589 with that black-holing can happen. 591 Without disaggregation mechanism, when linkTS3 and linkTS4 both fail, 592 the packet from leaf111 to prefix122 would suffer 50% black-holing 593 based on pure default route. The packet supposed to go up through 594 linkSL1 to linkTS1 then go down through linkTS3 or linkTS4 will be 595 dropped. The packet supposed to go up through linkSL3 to linkTS2 596 then go down through linkTS3 or linkTS4 will be dropped as well. 597 It's the case of black-holing. 599 With disaggregation mechanism, when linkTS3 and linkTS4 both fail, 600 ToF22 will detect the failure according to the reflected node S-TIE 601 of ToF21 from Spine111\Spine112. Based on the disaggregation 602 algorithm provided by RITF, ToF22 will explicitly originate an S-TIE 603 with prefix 121 and prefix 122, that is flooded to spines 111, 112, 604 121 and 122. 606 The packet from leaf111 to prefix122 will not be routed to linkTS1 or 607 linkTS2. The packet from leaf111 to prefix122 will only be routed to 608 linkTS5 or linkTS7 following a longest-prefix match to prefix122. 610 4.4. Zero Touch Provisioning (ZTP) 612 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 613 has no configuration (unless it is a Top-of-Fabric at the top of the 614 topology or it is desired to confine it to leaf role w/o leaf-2-leaf 615 procedures). In such case RIFT will fully configure the node's level 616 after it is attached to the topology. 618 The most import component for ZTP is the automatic level derivation 619 procedure. All the Top-of-Fabric nodes are explicitly marked with 620 TOP_OF_FABRIC flag which are initial 'seeds' needed for other ZTP 621 nodes to derive their level in the topology. The derivation of the 622 level of each node happens then based on LIEs received from its 623 neighbors whereas each node (with possibly exceptions of configured 624 leafs) tries to attach at the highest possible point in the fabric. 625 This guarantees that even if the diffusion front reaches a node from 626 "below" faster than from "above", it will greedily abandon already 627 negotiated level derived from nodes topologically below it and 628 properly peer with nodes above. 630 4.5. Miscabling Examples 632 +----------------+ +-----------------+ 633 | ToF21 | +------+ ToF22 | LEVEL 2 634 +-------+----+---+ | +----+---+--------+ 635 | | | | | | | | | 636 | | | +----------------------------+ | 637 | +---------------------------+ | | | | 638 | | | | | | | | | 639 | | | | +-----------------------+ | | 640 | | +------------------------+ | | | 641 | | | | | | | | | 642 +-+---+-+ +-+---+-+ | +-+---+-+ +-+---+-+ 643 |Spin111| |Spin112| | |Spin121| |Spin122| LEVEL 1 644 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 645 | | | | | | | | | 646 | +---------+ | link-M | +---------+ | 647 | | | | | | | | | 648 | +-------+ | | | | +-------+ | | 649 | | | | | | | | | 650 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 651 |Leaf111| |Leaf112+-----+ |Leaf121| |Leaf122| LEVEL 0 652 +-------+ +-------+ +-------+ +-------+ 653 Figure 5: A single plane miscabling example 655 Figure 5 shows a single plane miscabling example. It's a perfect 656 fat-tree fabric except link-M connecting Leaf112 to ToF22. 658 The RIFT control protocol can discover the physical links 659 automatically and be able to detect cabling that violates fat-tree 660 topology constraints. It react accordingly to such mis-cabling 661 attempts, at a minimum preventing adjacencies between nodes from 662 being formed and traffic from being forwarded on those mis-cabled 663 links. Leaf112 will in such scenario use link-M to derive its level 664 (unless it is leaf) and can report links to spines 111 and 112 as 665 miscabled unless the implementations allows horizontal links. 667 Figure 6 shows a multiple plane miscabling example. Since Leaf112 668 and Spine121 belong to two different PoDs, the adjacency between 669 Leaf112 and Spine121 can not be formed. link-W would be detected and 670 prevented. 672 +-------+ +-------+ +-------+ +-------+ 673 |ToF A1| |ToF A2| |ToF B1| |ToF B2| LEVEL 2 674 +-------+ +-------+ +-------+ +-------+ 675 | | | | | | | | 676 | | | +-----------------+ | | | 677 | +--------------------------+ | | | | 678 | | | | | | | | 679 | +------+ | | | +------+ | 680 | | +-----------------+ | | | | | 681 | | | +--------------------------+ | | 682 | A | | B | | A | | B | 683 +-----+-+ +-+---+-+ +-+---+-+ +-+-----+ 684 |Spin111| |Spin112| +----+Spin121| |Spin122| LEVEL 1 685 +-+---+-+ ++----+-+ | +-+---+-+ ++----+-+ 686 | | | | | | | | | 687 | +---------+ | | | +---------+ | 688 | | | | link-W | | | | 689 | +-------+ | | | | +-------+ | | 690 | | | | | | | | | 691 +-+---+-+ +--+--+-+ | +-+---+-+ +--+--+-+ 692 |Leaf111| |Leaf112+------+ |Leaf121| |Leaf122| LEVEL 0 693 +-------+ +-------+ +-------+ +-------+ 694 +--------PoD#1----------+ +---------PoD#2---------+ 696 Figure 6: A multiple plane miscabling example 698 RIFT provides an optional level determination procedure in its Zero 699 Touch Provisioning mode. Nodes in the fabric without their level 700 configured determine it automatically. This can have possibly 701 counter-intuitive consequences however. One extreme failure scenario 702 is depicted in Figure 7 and it shows that if all northbound links of 703 spine11 fail at the same time, spine11 negotiates a lower level than 704 Leaf11 and Leaf12. 706 To prevent such scenario where leafs are expected to act as switches, 707 LEAF_ONLY flag can be set for Leaf111 and Leaf112. Since level -1 is 708 invalid, Spine11 would not derive a valid level from the topology in 709 Figure 7. It will be isolated from the whole fabric and it would be 710 up to the leafs to declare the links towards such spine as miscabled. 712 +-------+ +-------+ +-------+ +-------+ 713 |ToF A1| |ToF A2| |ToF A1| |ToF A2| 714 +-------+ +-------+ +-------+ +-------+ 715 | | | | | | 716 | +-------+ | | | 717 + + | | ====> | | 718 X X +------+ | +------+ | 719 + + | | | | 720 +----+--+ +-+-----+ +-+-----+ 721 |Spine11| |Spine12| |Spine12| 722 +-+---+-+ ++----+-+ ++----+-+ 723 | | | | | | 724 | +---------+ | | | 725 | | | | | | 726 | +-------+ | | +-------+ | 727 | | | | | | 728 +-+---+-+ +--+--+-+ +-----+-+ +-----+-+ 729 |Leaf111| |Leaf112| |Leaf111| |Leaf112| 730 +-------+ +-------+ +-+-----+ +-+-----+ 731 | | 732 | +--------+ 733 | | 734 +-+---+-+ 735 |Spine11| 736 +-------+ 738 Figure 7: Fallen spine 740 4.6. Positive vs. Negative Disaggregation 742 Disaggregation is the procedure whereby [RIFT] advertises more a 743 specific route Southwards as an exception to the aggregated fabric- 744 default North. Disaggregation is useful when a prefix within the 745 aggregation is reachable via some of the parents but not the others 746 at the same level of the fabric. It is mandatory when the level is 747 the ToF since a ToF node that cannot reach a prefix becomes a black 748 hole for that prefix. The hard problem is to know which prefixes are 749 reachable by whom. 751 In the general case, [RIFT] solves that problem by interconnecting 752 the ToF nodes so they can exchange the full list of prefixes that 753 exist in the fabric and figure when a ToF node lacks reachability and 754 to existing prefix. This requires additional ports at the ToF, 755 typically 2 ports per ToF node to form a ToF-spanning ring. xref 756 target='I-D.ietf-rift-rift'/> also defines the Southbound Reflection 757 procedure that enables a parent to explore the direct connectivity of 758 its peers, meaning their own parents and children; based on the 759 advertisements received from the shared parents and children, it may 760 enable the parent to infer the prefixes its peers can reach. 762 When a parent lacks reachability to a prefix, it may disaggregate the 763 prefix negatively, i.e., advertise that this parent can be used to 764 reach any prefix in the aggregation except that one. The Negative 765 Disaggregation signaling is simple and functions transitively from 766 ToF to ToP and then from Top to Leaf. But it is hard for a parent to 767 figure which prefix it needs to disaggregate, because it does not 768 know what it does not know; it results thet the use of a spanning 769 ring at the ToF is required to operate the Negative Disaggregation. 770 Also, though it is only an implementation problem, the programmation 771 of the FIB is complex compared to normal routes, and may incur 772 recursions. 774 The more classical alternative is, for the parents that can reach a 775 prefix that peers at the same level cannot, to advertise a more 776 specific route to that prefix. This leverages the normal longest 777 prefix match in the FIB, and does not require a special 778 implementation. But as opposed to the Negative Disaggregation, the 779 Positive Disaggregation is difficult and inefficient to operate 780 transitively. 782 Transitivity is not needed to a grandchild if all its parents 783 received the Positive Disaggregation, meaning that they shall all 784 avoid the black hole; when that is the case, they collectively build 785 a ceiling that protects the grandchild. But until then, a parent 786 that received a Positive Disaggregation may believe that some peers 787 are lacking the reachability and readvertise too early, or defer and 788 maintain a black hole situation longer than necessary. 790 In a non-partitioned fabric, all the ToF nodes see one another 791 through the reflection and can figure if one is missing a child. In 792 that case it is possible to compute the prefixes that the peer cannot 793 reach and disaggregate positively without a ToF-spanning ring. The 794 ToF nodes can also acertain that the ToP nodes are connected each to 795 at least a ToF node that can still reach the prefix, meaning that the 796 transitive operation is not required. 798 The bottom line is that in a fabric that is partitioned (e.g., using 799 multiple planes) and/or where the ToP nodes are not guaranteed to 800 always form a ceiling for their children, it is mandatory to use the 801 Negative Disaggregation. On the other hand, in a highly symmetrical 802 and fully connected fabric, (e.g., a canonical Clos Network), the 803 Positive Disaggregation methods allows to save the complexity and 804 cost associated to the ToF-spanning ring. 806 Note that in the case of Positive Disaggregation, the first ToF 807 node(s) that announces a more-specific route attracts all the traffic 808 for that route and may suffer from a transient incast. A ToP node 809 that defers injecting the longer prefix in the FIB, in order to 810 receive more advertisements and spread the packets better, also keeps 811 on sending a portion of the traffic to the black hole in the 812 meantime. In the case of Negative Disaggregation, the last ToF 813 node(s) that injects the route may also incur an incast issue; this 814 problem would occur if a prefix that becomes totally unreachable is 815 disaggregated, but doing so is mostly useless and is not recommended. 817 4.7. Mobile Edge and Anycast 819 When a physical or a virtual node changes its point of attachement in 820 the fabric from a previous-leaf to a next-leaf, new routes must be 821 installed that supercede the old ones. Since the flooding flows 822 Northwards, the nodes (if any) between the previous-leaf and the 823 common parent are not immediately aware that the path via previous- 824 leaf is obsolete, and a stale route may exist for a while. The 825 common parent needs to select the freshest route advertisement in 826 order to install the correct route via the next-leaf. This requires 827 that the fabric determines the sequence of the movements of the 828 mobile node. 830 On the one hand, a classical sequence counter provides a total order 831 for a while but it will eventually wrap. On the other hand, a 832 timestamp provides a permanent order but it may miss a movement that 833 happens too quickly vs. the granularity of the timing information. 834 It is not envisioned in the short term that the average fabric 835 supports a Precision Time Protocol, and the precision that may be 836 available with the Network Time Protocol [RFC5905], in the order of 837 100 to 200ms, may not be necessarily enough to cover, e.g., the fast 838 mobility of a Virtual Machine. 840 Section 4.3.3. "Mobility" of [RIFT] specifies an hybrid method that 841 combines a sequence counter from the mobile node and a timestamp from 842 the network taken at the leaf when the route is injected. If the 843 timestamps of the concurrent advertisements are comparable (i.e., 844 more distant than the precision of the timing protocol), then the 845 timestamp alone is used to determine the relative freshness of the 846 routes. Otherwise, the sequence counter from the mobile node, if 847 available, is used. One caveat is that the sequence counter must not 848 wrap within the precision of the timing protocol. Another is that 849 the mobile node may not even provide a sequence counter, in which 850 case the mobility itself must be slower than the precision of the 851 timing. 853 Mobility must not be confused with Anycast. In both cases, a same 854 address is injected in RIFT at different leaves. In the case of 855 mobility, only the freshest route must be conserved, since mobile 856 node changed its point of attachement for a leaf ot the next. In the 857 case of anycast, the node may be either multihomed (attached to 858 multiple leaves in parallel) or reachable beyond the fabric via 859 multiple routes that are redistributed to different leaves; either 860 way, in the case of anycast, the multiple routes are equally valid 861 and should be conserved. Without further information from the 862 redistributed routing protocol, it is impossible to sort out a 863 movement from a redistribution that happens asynchronously on 864 different leaves. [RIFT] expects that anycast addresses are 865 advertised within the timing precision, which is typically the case 866 with a low-precision timing and a multihomed node. Beyond that time 867 interval, RIFT interprets the lag as a mobility and only the freshest 868 route is retained. 870 When using IPv6 [RFC8200], RIFT suggests to leverage "Registration 871 Extensions for IPv6 over Low-Power Wireless Personal Area Network 872 (6LoWPAN) Neighbor Discovery (ND)" [RFC8505] as the IPv6 ND 873 interaction between the mobile node and the leaf. This provides not 874 only a sequence counter but also a lifetime and a security token that 875 may be used to protect the ownership of an address. When using 876 [RFC8505], the parallel registration of an anycast address to 877 multiple leaves is done with the same sequence counter, whereas the 878 sequence counter is incremented when the point of attachement 879 changes. This way, it is possible to differentiate a mobile node 880 from a multihomed node, even when the mobility happens within the 881 timing precision. It is also possible for a mobile node to be 882 multihomed as well, e.g., to change only one of its points of 883 attachement. 885 4.8. IPv4 over IPv6 887 RIFT allows advertising IPv4 prefixes over IPv6 RIFT network. IPv6 888 AF configures via the usual ND mechanisms and then V4 can use V6 889 nexthops analogous to RFC5549. It is expected that the whole fabric 890 supports the same type of forwarding of address families on all the 891 links. RIFT provides an indication whether a node is v4 forwarding 892 capable and implementations are possible where different routing 893 tables are computed per address family as long as the computation 894 remains loop-free. 896 +-----+ +-----+ 897 +---+---+ | ToF | | ToF | 898 ^ +--+--+ +-----+ 899 | | | | | 900 | | +-------------+ | 901 | | +--------+ | | 902 | | | | | 903 V6 +-----+ +-+---+ 904 Forwarding |SPINE| |SPINE| 905 | +--+--+ +-----+ 906 | | | | | 907 | | +-------------+ | 908 | | +--------+ | | 909 | | | | | 910 v +-----+ +-+---+ 911 +---+---+ |LEAF | | LEAF| 912 +--+--+ +--+--+ 913 | | 914 IPv4 prefixes| |IPv4 prefixes 915 | | 916 +---+----+ +---+----+ 917 | V4 | | V4 | 918 | subnet | | subnet | 919 +--------+ +--------+ 921 Figure 8: IPv4 over IPv6 923 4.9. In-Band Reachability of Nodes 925 RIFT doesn't precondition that nodes of the fabric have reachable 926 addresses. But the operational purposes to reach the internal nodes 927 may exist. Figure 9 shows an example that the NMS attaches to LEAF1. 929 +-------+ +-------+ 930 | ToF1 | | ToF2 | 931 ++---- ++ ++-----++ 932 | | | | 933 | +----------+ | 934 | +--------+ | | 935 | | | | 936 ++-----++ +--+---++ 937 |SPINE1 | |SPINE2 | 938 ++-----++ ++-----++ 939 | | | | 940 | +----------+ | 941 | +--------+ | | 942 | | | | 943 ++-----++ +--+---++ 944 | LEAF1 | | LEAF2 | 945 +---+---+ +-------+ 946 | 947 |NMS 949 Figure 9: In-Band reachability of node 951 If NMS wants to access LEAF2, it simply works. Because loopback 952 address of LEAF2 is flooded in its Prefix North TIE. 954 If NMS wants to access SPINE2, it simply works too. Because spine 955 node always advertises its loopback address in the Prefix North TIE. 956 NMS may reach SPINE2 from LEAF1-SPINE2 or LEAF1-SPINE1-ToF1/ 957 ToF2-SPINE2. 959 If NMS wants to access ToF2, ToF2's loopback address needs to be 960 injected into its Prefix South TIE. Otherwise, the traffic from NMS 961 may be sent to ToF1. 963 And in case of failure between ToF2 and spine nodes, ToF2's loopback 964 address must be sent all the way down to the leaves. 966 4.10. Dual Homing Servers 968 Each RIFT node may operate in zero touch provisioning (ZTP) mode. It 969 has no configuration (unless it is a Top-of-Fabric at the top of the 970 topology or the must operate in the topology as leaf and/or support 971 leaf-2-leaf procedures) and it will fully configure itself after 972 being attached to the topology. 974 +---+ +---+ +---+ 975 |ToF| |ToF| |ToF| 976 +---+ +---+ +---+ 977 | | | | | | 978 | +----------------+ | | 979 | | | | | | 980 | +----------------+ | 981 | | | | | | 982 +----------+--+ +--+----------+ 983 | Spine|ToR1 | | Spine|ToR2 | 984 +--+------+---+ +--+-------+--+ 985 +---+ | | | | | | +---+ 986 | | | | | | | | 987 | +-----------------+ | | | 988 | | | +-------------+ | | 989 + | + | | |-----------------+ | 990 X | X | +--------x-----+ | X | 991 + | + | | | + | 992 +---+ +---+ +---+ +---+ 993 | | | | | | | | 994 +---+ +---+ ...............+---+ +---+ 995 SV(1) SV(2) SV(n+1) SV(n) 997 Figure 10: Dual-homing servers 999 In the single plane, the worst condition is disaggregation of every 1000 other servers at the same level. Suppose the links from ToR1 to all 1001 the leaves become not available. All the servers' routes are 1002 disaggregated and the FIB of the servers will be expanded with n-1 1003 more spicific routes. 1005 Sometimes, pleople may prefer to disaggregate from ToR to servers 1006 from start on, i.e. the servers have couple tens of routes in FIB 1007 from start on beside default routes to avoid breakages at rack level. 1008 Full disaggregation of the fabric could be achieved by configuration 1009 supported by RIFT. 1011 4.11. Fabric With A Controller 1013 There are many different ways to deploy the controller. One 1014 possibility is attaching a controller to the RIFT domain from ToF and 1015 another possibility is attaching a controller from the leaf. 1017 +------------+ 1018 | Controller | 1019 ++----------++ 1020 | | 1021 | | 1022 +----++ ++----+ 1023 ------- | ToF | | ToF | 1024 | +--+--+ +-----+ 1025 | | | | | 1026 | | +-------------+ | 1027 | | +--------+ | | 1028 | | | | | 1029 +-----+ +-+---+ 1030 RIFT domain |SPINE| |SPINE| 1031 +--+--+ +-----+ 1032 | | | | | 1033 | | +-------------+ | 1034 | | +--------+ | | 1035 | | | | | 1036 | +-----+ +-+---+ 1037 ------- |LEAF | | LEAF| 1038 +-----+ +-----+ 1040 Figure 11: Fabric with a controller 1042 4.11.1. Controller Attached to ToFs 1044 If a controller is attaching to the RIFT domain from ToF, it usually 1045 uses dual-homing connections. The loopback prefix of the controller 1046 should be advertised down by the ToF and spine to leaves. If the 1047 controller loses link to ToF, make sure the ToF withdraw the prefix 1048 of the controller(use different mechanisms). 1050 4.11.2. Controller Attached to Leaf 1052 If the controller is attaching from a leaf to the fabric, no special 1053 provisions are needed. 1055 4.12. Internet Connectivity With Underlay 1057 If global addressing is running without overlay, an external default 1058 route needs to be advertised through rift fabric to achieve internet 1059 connectivity. For the purpose of forwarding of the entire rift 1060 fabric, an internal fabric prefix needs to be advertised in the South 1061 Prefix TIE by ToF and spine nodes. 1063 4.12.1. Internet Default on the Leaf 1065 In case that an internet access request comes from a leaf and the 1066 internet gateway is another leaf, the leaf node as the internet 1067 gateway needs to advertise a default route in its Prefix North TIE. 1069 4.12.2. Internet Default on the ToFs 1071 In case that an internet access request comes from a leaf and the 1072 internet gateway is a ToF, the ToF and spine nodes need to advertise 1073 a default route in the Prefix South TIE. 1075 4.13. Subnet Mismatch and Address Families 1077 +--------+ +--------+ 1078 | | LIE LIE | | 1079 | A | +----> <----+ | B | 1080 | +---------------------+ | 1081 +--------+ +--------+ 1082 X/24 Y/24 1084 Figure 12: subnet mismatch 1086 LIEs are exchanged over all links running RIFT to perform Link 1087 (Neighbor) Discovery. A node MUST NOT originate LIEs on an address 1088 family if it does not process received LIEs on that family. LIEs on 1089 same link are considered part of the same negotiation independent on 1090 the address family they arrive on. An implementation MUST be ready 1091 to accept TIEs on all addresses it used as source of LIE frames. 1093 As shown in the above figure, without further checks adjacency of 1094 node A and B may form, but the forwarding between node A and node B 1095 may fail because subnet X mismatches with subnet Y. 1097 To prevent this a RIFT implementation should check for subnet 1098 mismatch just like e.g. ISIS does. This can lead to scenarios where 1099 an adjacency, despite exchange of LIEs in both address families may 1100 end up having an adjacency in a single AF only. This is a 1101 consideration especially in Section 4.8 scenarios. 1103 4.14. Anycast Considerations 1104 + traffic 1105 | 1106 v 1107 +------+------+ 1108 | ToF | 1109 +---+-----+---+ 1110 | | | | 1111 +------------+ | | +------------+ 1112 | | | | 1113 +---+---+ +-------+ +-------+ +---+---+ 1114 | | | | | | | | 1115 |Spine11| |Spine12| |Spine21| |Spine22| LEVEL 1 1116 +-+---+-+ ++----+-+ +-+---+-+ ++----+-+ 1117 | | | | | | | | 1118 | +---------+ | | +---------+ | 1119 | | | | | | | | 1120 | +-------+ | | | +-------+ | | 1121 | | | | | | | | 1122 +-+---+-+ +--+--+-+ +-+---+-+ +--+--+-+ 1123 | | | | | | | | 1124 |Leaf111| |Leaf112| |Leaf121| |Leaf122| LEVEL 0 1125 +-+-----+ ++------+ +-----+-+ +-----+-+ 1126 + + + ^ | 1127 PrefixA PrefixB PrefixA | PrefixC 1128 | 1129 + traffic 1131 Figure 13: Anycast 1133 If the traffic comes from ToF to Leaf111 or Leaf121 which has anycast 1134 prefix PrefixA. RIFT can deal with this case well. But if the 1135 traffic comes from Leaf122, it arrives Spine21 or Spine22 at level 1. 1136 But Spine21 or Spine22 doesn't know another PrefixA attaching 1137 Leaf111. So it will always get to Leaf121 and never get to Leaf111. 1138 If the intension is that the traffic should been offloaded to 1139 Leaf111, then use policy guided prefixes [PGP reference]. 1141 5. Acknowledgements 1143 6. Contributors 1145 The following people (listed in alphabetical order) contributed 1146 significantly to the content of this document and should be 1147 considered co-authors: 1149 Tony Przygienda 1150 Juniper Networks 1152 1194 N. Mathilda Ave 1154 Sunnyvale, CA 94089 1156 US 1158 Email: prz@juniper.net 1160 7. Normative References 1162 [ISO10589-Second-Edition] 1163 International Organization for Standardization, 1164 "Intermediate system to Intermediate system intra-domain 1165 routeing information exchange protocol for use in 1166 conjunction with the protocol for providing the 1167 connectionless-mode Network Service (ISO 8473)", November 1168 2002. 1170 [TR-384] Broadband Forum Technical Report, "TR-384 Cloud Central 1171 Office Reference Architectural Framework", January 2018. 1173 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 1174 DOI 10.17487/RFC2328, April 1998, 1175 . 1177 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 1178 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 1179 DOI 10.17487/RFC4861, September 2007, 1180 . 1182 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1183 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1184 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1185 . 1187 [RFC7130] Bhatia, M., Ed., Chen, M., Ed., Boutros, S., Ed., 1188 Binderberger, M., Ed., and J. Haas, Ed., "Bidirectional 1189 Forwarding Detection (BFD) on Link Aggregation Group (LAG) 1190 Interfaces", RFC 7130, DOI 10.17487/RFC7130, February 1191 2014, . 1193 [RIFT] Przygienda, T., Sharma, A., Thubert, P., Rijsman, B., and 1194 D. Afanasiev, "RIFT: Routing in Fat Trees", Work in 1195 Progress, Internet-Draft, draft-ietf-rift-rift-11, 10 1196 March 2020, 1197 . 1199 [I-D.white-distoptflood] 1200 White, R., Hegde, S., and S. Zandi, "IS-IS Optimal 1201 Distributed Flooding for Dense Topologies", Work in 1202 Progress, Internet-Draft, draft-white-distoptflood-01, 30 1203 September 2019, 1204 . 1206 8. Informative References 1208 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 1209 "Network Time Protocol Version 4: Protocol and Algorithms 1210 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 1211 . 1213 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 1214 (IPv6) Specification", STD 86, RFC 8200, 1215 DOI 10.17487/RFC8200, July 2017, 1216 . 1218 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 1219 Perkins, "Registration Extensions for IPv6 over Low-Power 1220 Wireless Personal Area Network (6LoWPAN) Neighbor 1221 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 1222 . 1224 Authors' Addresses 1226 Yuehua Wei (editor) 1227 ZTE Corporation 1228 No.50, Software Avenue 1229 Nanjing 1230 210012 1231 China 1233 Email: wei.yuehua@zte.com.cn 1235 Zheng Zhang 1236 ZTE Corporation 1237 No.50, Software Avenue 1238 Nanjing 1239 210012 1240 China 1242 Email: zzhang_ietf@hotmail.com 1243 Dmitry Afanasiev 1244 Yandex 1246 Email: fl0w@yandex-team.ru 1248 Tom Verhaeg 1249 Juniper Networks 1251 Email: tverhaeg@juniper.net 1253 Jaroslaw Kowalczyk 1254 Orange Polska 1256 Email: jaroslaw.kowalczyk2@orange.com 1258 Pascal Thubert 1259 Cisco Systems, Inc 1260 Building D 1261 45 Allee des Ormes - BP1200 1262 06254 MOUGINS - Sophia Antipolis 1263 France 1265 Phone: +33 497 23 26 34 1266 Email: pthubert@cisco.com