idnits 2.17.1 draft-przygienda-rift-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 15 instances of too long lines in the document, the longest one being 9 characters in excess of 72. == There are 1 instance of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 580 has weird spacing: '... Prefix con...' == Line 582 has weird spacing: '... Prefix con...' == Line 717 has weird spacing: '... N-TIEs nev...' == Line 721 has weird spacing: '...ed TIEs ori...' == Line 2154 has weird spacing: '...velType leaf...' == (24 more instances...) == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Thrift serializer/deserializer MUST not discard optional, unknown fields but preserve and serialize them again when re-flooding whereas missing optional fields MAY be replaced with according default values if present. -- The document date (January 11, 2018) is 2297 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0-31' is mentioned on line 955, but not defined == Missing Reference: '32-63' is mentioned on line 956, but not defined == Missing Reference: 'P' is mentioned on line 1194, but not defined == Missing Reference: 'NH' is mentioned on line 1306, but not defined == Missing Reference: 'RFC5880' is mentioned on line 2132, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589' ** Obsolete normative reference: RFC 1142 (Obsoleted by RFC 7142) ** Downref: Normative reference to an Informational RFC: RFC 4655 ** Downref: Normative reference to an Informational RFC: RFC 6234 ** Obsolete normative reference: RFC 6822 (Obsoleted by RFC 8202) ** Downref: Normative reference to an Informational RFC: RFC 7855 ** Downref: Normative reference to an Informational RFC: RFC 7938 Summary: 7 errors (**), 0 flaws (~~), 14 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT Working Group T. Przygienda, Ed. 3 Internet-Draft Juniper Networks 4 Intended status: Standards Track A. Sharma 5 Expires: July 15, 2018 Comcast 6 A. Atlas 7 J. Drake 8 Juniper Networks 9 January 11, 2018 11 RIFT: Routing in Fat Trees 12 draft-przygienda-rift-04 14 Abstract 16 This document outlines a specialized, dynamic routing protocol for 17 Clos and fat-tree network topologies. The protocol (1) deals with 18 automatic construction of fat-tree topologies based on detection of 19 links, (2) minimizes the amount of routing state held at each level, 20 (3) automatically prunes the topology distribution exchanges to a 21 sufficient subset of links, (4) supports automatic disaggregation of 22 prefixes on link and node failures to prevent black-holing and 23 suboptimal routing, (5) allows traffic steering and re-routing 24 policies, (6) allows non-ECMP forwarding and ultimately (7) provides 25 mechanisms to synchronize a limited key-value data-store that can be 26 used after protocol convergence to e.g. bootstrap higher levels of 27 functionality on nodes. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on July 15, 2018. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 64 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 65 2. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 4 66 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 67 2.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 7 68 3. Requirement Considerations . . . . . . . . . . . . . . . . . 8 69 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 10 70 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 10 71 4.2. Specification . . . . . . . . . . . . . . . . . . . . . . 10 72 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 10 73 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 11 74 4.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 12 75 4.2.3.1. Topology Information Elements . . . . . . . . . . 12 76 4.2.3.2. South- and Northbound Representation . . . . . . 12 77 4.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 15 78 4.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 15 79 4.2.3.5. Initial and Periodic Database Synchronization . . 17 80 4.2.3.6. Purging . . . . . . . . . . . . . . . . . . . . . 17 81 4.2.3.7. Southbound Default Route Origination . . . . . . 17 82 4.2.3.8. Optional Automatic Flooding Reduction and 83 Partitioning . . . . . . . . . . . . . . . . . . 18 84 4.2.4. Policy-Guided Prefixes . . . . . . . . . . . . . . . 19 85 4.2.4.1. Ingress Filtering . . . . . . . . . . . . . . . . 20 86 4.2.4.2. Applying Policy . . . . . . . . . . . . . . . . . 21 87 4.2.4.3. Store Policy-Guided Prefix for Route Computation 88 and Regeneration . . . . . . . . . . . . . . . . 22 89 4.2.4.4. Re-origination . . . . . . . . . . . . . . . . . 22 90 4.2.4.5. Overlap with Disaggregated Prefixes . . . . . . . 22 91 4.2.5. Reachability Computation . . . . . . . . . . . . . . 23 92 4.2.5.1. Northbound SPF . . . . . . . . . . . . . . . . . 23 93 4.2.5.2. Southbound SPF . . . . . . . . . . . . . . . . . 24 94 4.2.5.3. East-West Forwarding Within a Level . . . . . . . 24 95 4.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 24 96 4.2.7. Attaching Policy-Guided Prefixes . . . . . . . . . . 26 97 4.2.8. Automatic Disaggregation on Link & Node Failures . . 26 98 4.2.9. Optional Autoconfiguration . . . . . . . . . . . . . 30 99 4.2.9.1. Terminology . . . . . . . . . . . . . . . . . . . 30 100 4.2.9.2. Automatic SystemID Selection . . . . . . . . . . 31 101 4.2.9.3. Generic Fabric Example . . . . . . . . . . . . . 31 102 4.2.9.4. Level Determination Procedure . . . . . . . . . . 32 103 4.2.9.5. Resulting Topologies . . . . . . . . . . . . . . 33 104 4.2.10. Stability Considerations . . . . . . . . . . . . . . 35 105 4.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 36 106 4.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 36 107 4.3.2. Optimized Route Computation on Leafs . . . . . . . . 36 108 4.3.3. Key/Value Store . . . . . . . . . . . . . . . . . . . 36 109 4.3.4. Interactions with BFD . . . . . . . . . . . . . . . . 37 110 4.3.5. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 37 111 4.3.6. Other End-to-End Services . . . . . . . . . . . . . . 38 112 4.3.7. Address Family and Multi Topology Considerations . . 38 113 4.3.8. Reachability of Internal Nodes in the Fabric . . . . 38 114 4.3.9. One-Hop Healing of Levels with East-West Links . . . 38 115 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 39 116 5.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 39 117 5.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 40 118 5.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 41 119 5.4. Northbound Partitioned Router and Optional East-West 120 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 43 121 6. Implementation and Operation: Further Details . . . . . . . . 44 122 6.1. Considerations for Leaf-Only Implementation . . . . . . . 44 123 6.2. Adaptations to Other Proposed Data Center Topologies . . 44 124 6.3. Originating Non-Default Route Southbound . . . . . . . . 45 125 7. Security Considerations . . . . . . . . . . . . . . . . . . . 45 126 8. Information Elements Schema . . . . . . . . . . . . . . . . . 46 127 8.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 46 128 8.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 50 129 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 55 130 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 55 131 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 56 132 11.1. Normative References . . . . . . . . . . . . . . . . . . 56 133 11.2. Informative References . . . . . . . . . . . . . . . . . 57 134 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 58 136 1. Introduction 138 Clos [CLOS] and Fat-Tree [FATTREE] have gained prominence in today's 139 networking, primarily as a result of a the paradigm shift towards a 140 centralized data-center based architecture that is poised to deliver 141 a majority of computation and storage services in the future. The 142 existing set of dynamic routing protocols was geared originally 143 towards a network with an irregular topology and low degree of 144 connectivity and consequently several attempts to adapt those have 145 been made. Most successfully BGP [RFC4271] [RFC7938] has been 146 extended to this purpose, not as much due to its inherent suitability 147 to solve the problem but rather because the perceived capability to 148 modify it "quicker" and the immanent difficulties with link-state 149 [DIJKSTRA] based protocols to fulfill certain of the resulting 150 requirements. 152 In looking at the problem through the very lens of its requirements 153 an optimal approach does not seem to be a simple modification of 154 either a link-state (distributed computation) or distance-vector 155 (diffused computation) approach but rather a mixture of both, 156 colloquially best described as 'link-state towards the spine' and 157 'distance vector towards the leafs'. The balance of this document 158 details the resulting protocol. 160 1.1. Requirements Language 162 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 163 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 164 document are to be interpreted as described in RFC 2119 [RFC2119]. 166 2. Reference Frame 168 2.1. Terminology 170 This section presents the terminology used in this document. It is 171 assumed that the reader is thoroughly familiar with the terms and 172 concepts used in OSPF [RFC2328] and IS-IS [RFC1142], [ISO10589] as 173 well as the according graph theoretical concepts of shortest path 174 first (SPF) [DIJKSTRA] computation and directed acyclic graphs (DAG). 176 Level: Clos and Fat Tree networks are trees and 'level' denotes the 177 set of nodes at the same height in such a network, where the 178 bottom level is level 0. A node has links to nodes one level down 179 and/or one level up. Under some circumstances, a node may have 180 links to nodes at the same level. As footnote: Clos terminology 181 uses often the concept of "stage" but due to the folded nature of 182 the Fat Tree we do not use it to prevent misunderstandings. 184 Spine/Aggregation/Edge Levels: Traditional names for Level 2, 1 and 185 0 respectively. Level 0 is often called leaf as well. 187 Point of Delivery (PoD): A self-contained vertical slice of a Clos 188 or Fat Tree network containing normally only level 0 and level 1 189 nodes. It communicates with nodes in other PoDs via the spine. 191 Spine: The set of nodes that provide inter-PoD communication. These 192 nodes are also organized into levels (typically one, three, or 193 five levels). Spine nodes do not belong to any PoD and are 194 assigned the PoD value 0 to indicate this. 196 Leaf: A node without southbound adjacencies. Its level is 0 (except 197 cases where it is deriving its level via ZTP and is running 198 without LEAF_ONLY which will be explained in Section 4.2.9). 200 Connected Spine: In case a spine level represents a connected graph 201 (discounting links terminating at different levels), we call it a 202 "connected spine", in case a spine level consists of multiple 203 partitions, we call it a "disconnected" or "partitioned spine". 204 In other terms, a spine without east-west links is disconnected 205 and is the typical configuration forf Clos and Fat Tree networks. 207 South/Southbound and North/Northbound (Direction): When describing 208 protocol elements and procedures, we will be using in different 209 situations the directionality of the compass. I.e., 'south' or 210 'southbound' mean moving towards the bottom of the Clos or Fat 211 Tree network and 'north' and 'northbound' mean moving towards the 212 top of the Clos or Fat Tree network. 214 Northbound Link: A link to a node one level up or in other words, 215 one level further north. 217 Southbound Link: A link to a node one level down or in other words, 218 one level further south. 220 East-West Link: A link between two nodes at the same level. East- 221 west links are normally not part of Clos or "fat-tree" topologies. 223 Leaf shortcuts (L2L): East-west links at leaf level will need to be 224 differentiated from East-west links at other levels. 226 Southbound representation: Information sent towards a lower level 227 representing only limited amount of information. 229 TIE: This is an acronym for a "Topology Information Element". TIEs 230 are exchanged between RIFT nodes to describe parts of a network 231 such as links and address prefixes. It can be thought of as 232 largely equivalent to ISIS LSPs or OSPF LSA. We will talk about 233 N-TIEs when talking about TIEs in the northbound representation 234 and S-TIEs for the southbound equivalent. 236 Node TIE: This is an acronym for a "Node Topology Information 237 Element", largely equivalent to OSPF Node LSA, i.e. it contains 238 all neighbors the node discovered and information about node 239 itself. 241 Prefix TIE: This is an acronym for a "Prefix Topology Information 242 Element" and it contains all prefixes directly attached to this 243 node in case of a N-TIE and in case of S-TIE the necessary default 244 and de-aggregated prefixes the node passes southbound. 246 Policy-Guided Information: Information that is passed in either 247 southbound direction or north-bound direction by the means of 248 diffusion and can be filtered via policies. Policy-Guided 249 Prefixes and KV Ties are examples of Policy-Guided Information. 251 Key Value TIE: A S-TIE that is carrying a set of key value pairs 252 [DYNAMO]. It can be used to distribute information in the 253 southbound direction within the protocol. 255 TIDE: Topology Information Description Element, equivalent to CSNP 256 in ISIS. 258 TIRE: Topology Information Request Element, equivalent to PSNP in 259 ISIS. It can both confirm received and request missing TIEs. 261 PGP: Policy-Guided Prefixes allow to support traffic engineering 262 that cannot be achieved by the means of SPF computation or normal 263 node and prefix S-TIE origination. S-PGPs are propagated in south 264 direction only and N-PGPs follow northern direction strictly. 266 De-aggregation/Disaggregation: Process in which a node decides to 267 advertise certain prefixes it received in N-TIEs to prevent black- 268 holing and suboptimal routing upon link failures. 270 LIE: This is an acronym for a "Link Information Element", largely 271 equivalent to HELLOs in IGPs and exchanged over all the links 272 between systems running RIFT to form adjacencies. 274 FL: Flooding Leader for a specific system has a dedicated role to 275 flood TIEs of that system. 277 2.2. Topology 279 . +--------+ +--------+ 280 . | | | | ^ N 281 . |Spine 21| |Spine 22| | 282 .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W 283 . | | | | | | | | | 284 . P111/2| |P121 | | | | S v 285 . ^ ^ ^ ^ | | | | 286 . | | | | | | | | 287 . +--------------+ | +-----------+ | | | +---------------+ 288 . | | | | | | | | 289 . South +-----------------------------+ | | ^ 290 . | | | | | | | All TIEs 291 . 0/0 0/0 0/0 +-----------------------------+ | 292 . v v v | | | | | 293 . | | +-+ +<-0/0----------+ | | 294 . | | | | | | | | 295 .+-+----++ optional +-+----++ ++----+-+ ++-----++ 296 .| | E/W link | | | | | | 297 .|Node111+----------+Node112| |Node121| |Node122| 298 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 299 . | | | South | | | | 300 . | +---0/0--->-----+ 0/0 | +----------------+ | 301 . 0/0 | | | | | | | 302 . | +---<-0/0-----+ | v | +--------------+ | | 303 . v | | | | | | | 304 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 305 .| | (L2L) | | | | Level 0 | | 306 .|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122| 307 .+-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 308 . + + \ / + + 309 . Prefix111 Prefix112 \ / Prefix121 Prefix122 310 . multi-homed 311 . Prefix 312 .+---------- Pod 1 ---------+ +---------- Pod 2 ---------+ 314 Figure 1: A two level spine-and-leaf topology 316 We will use this topology (called commonly a fat tree/network in 317 modern DC considerations [VAHDAT08] as homonym to the original 318 definition of the term [FATTREE]) in all further considerations. It 319 depicts a generic "fat-tree" and the concepts explained in three 320 levels here carry by induction for further levels and higher degrees 321 of connectivity. 323 3. Requirement Considerations 325 [RFC7938] gives the original set of requirements augmented here based 326 upon recent experience in the operation of fat-tree networks. 328 REQ1: The control protocol should discover the physical links 329 automatically and be able to detect cabling that violates 330 fat-tree topology constraints. It must react accordingly to 331 such mis-cabling attempts, at a minimum preventing 332 adjacencies between nodes from being formed and traffic from 333 being forwarded on those mis-cabled links. E.g. connecting 334 a leaf to a spine at level 2 should be detected and ideally 335 prevented. 337 REQ2: A node without any configuration beside default values 338 should come up at the correct level in any PoD it is 339 introduced into. Optionally, it must be possible to 340 configure nodes to restrict their participation to the 341 PoD(s) targeted at any level. 343 REQ3: Optionally, the protocol should allow to provision data 344 centers where the individual switches carry no configuration 345 information and are all deriving their level from a "seed". 346 Observe that this requirement may collide with the desire to 347 detect cabling misconfiguration and with that only one of 348 the requirements can be fully met in a chosen configuration 349 mode. 351 REQ4: The solution should allow for minimum size routing 352 information base and forwarding tables at leaf level for 353 speed, cost and simplicity reasons. Holding excessive 354 amount of information away from leaf nodes simplifies 355 operation and lowers cost of the underlay. 357 REQ5: Very high degree of ECMP must be supported. Maximum ECMP is 358 currently understood as the most efficient routing approach 359 to maximize the throughput of switching fabrics 360 [MAKSIC2013]. 362 REQ6: Non equal cost anycast must be supported to allow for easy 363 and robust multi-homing of services without regressing to 364 careful balancing of link costs. 366 REQ7: Traffic engineering should be allowed by modification of 367 prefixes and/or their next-hops. 369 REQ8: The solution should allow for access to link states of the 370 whole topology to enable efficient support for modern 371 control architectures like SPRING [RFC7855] or PCE 372 [RFC4655]. 374 REQ9: The solution should easily accommodate opaque data to be 375 carried throughout the topology to subsets of nodes. This 376 can be used for many purposes, one of them being a key-value 377 store that allows bootstrapping of nodes based right at the 378 time of topology discovery. 380 REQ10: Nodes should be taken out and introduced into production 381 with minimum wait-times and minimum of "shaking" of the 382 network, i.e. radius of propagation (often called "blast 383 radius") of changed information should be as small as 384 feasible. 386 REQ11: The protocol should allow for maximum aggregation of carried 387 routing information while at the same time automatically de- 388 aggregating the prefixes to prevent black-holing in case of 389 failures. The de-aggregation should support maximum 390 possible ECMP/N-ECMP remaining after failure. 392 REQ12: Reducing the scope of communication needed throughout the 393 network on link and state failure, as well as reducing 394 advertisements of repeating, idiomatic or policy-guided 395 information in stable state is highly desirable since it 396 leads to better stability and faster convergence behavior. 398 REQ13: Once a packet traverses a link in a "southbound" direction, 399 it must not take any further "northbound" steps along its 400 path to delivery to its destination under normal conditions. 401 Taking a path through the spine in cases where a shorter 402 path is available is highly undesirable. 404 REQ14: Parallel links between same set of nodes must be 405 distinguishable for SPF, failure and traffic engineering 406 purposes. 408 REQ15: The protocol must not rely on interfaces having discernible 409 unique addresses, i.e. it must operate in presence of 410 unnumbered links (even parallel ones) or links of a single 411 node having same addresses. 413 Following list represents possible requirements and requirements 414 under discussion: 416 PEND1: Supporting anything but point-to-point links is a non- 417 requirement. Questions remain: for connecting to the 418 leaves, is there a case where multipoint is desirable? One 419 could still model it as point-to-point links; it seems there 420 is no need for anything more than a NBMA-type construct. 422 PEND2: What is the maximum scale of number leaf prefixes we need to 423 carry. Is 500'000 enough ? 425 Finally, following are the non-requirements: 427 NONREQ1: Broadcast media support is unnecessary. 429 NONREQ2: Purging is unnecessary given its fragility and complexity 430 and today's large memory size on even modest switches and 431 routers. 433 NONREQ3: Special support for layer 3 multi-hop adjacencies is not 434 part of the protocol specification. Such support can be 435 easily provided by using tunneling technologies the same 436 way IGPs today are solving the problem. 438 4. RIFT: Routing in Fat Trees 440 Derived from the above requirements we present a detailed outline of 441 a protocol optimized for Routing in Fat Trees (RIFT) that in most 442 abstract terms has many properties of a modified link-state protocol 443 [RFC2328][RFC1142] when "pointing north" and path-vector [RFC4271] 444 protocol when "pointing south". Albeit an unusual combination, it 445 does quite naturally exhibit the desirable properties we seek. 447 4.1. Overview 449 The novel property of RIFT is that it floods northbound "flat" link- 450 state information so that each level understands the full topology of 451 levels south of it. In contrast, in the southbound direction the 452 protocol operates like a path vector protocol or rather a distance 453 vector with implicit split horizon since the topology constraints 454 make a diffused computation front propagating in all directions 455 unnecessary. 457 4.2. Specification 459 4.2.1. Transport 461 All protocol elements are carried over UDP. Once QUIC [QUIC] 462 achieves the desired stability in deployments it may prove a valuable 463 candidate for TIE transport. 465 All packet formats are defined in Thrift models in Section 8. 467 Future versions may include a [PROTOBUF] schema. 469 4.2.2. Link (Neighbor) Discovery (LIE Exchange) 471 LIE exchange happens over well-known administratively locally scoped 472 IPv4 multicast address [RFC2365] or link-local multicast scope for 473 IPv6 [RFC4291] and SHOULD be sent with a TTL of 1 to prevent RIFT 474 information reaching beyond a single L3 next-hop in the topology. 475 LIEs are exchanged over all links running RIFT. 477 Unless Section 4.2.9 is used, each node is provisioned with the level 478 at which it is operating and its PoD or otherwise a default level and 479 PoD of zero are assumed, meaning that leafs do not need to be 480 configured with a level (or even PoD). Nodes in the spine are 481 configured with a PoD of zero. This information is propagated in the 482 LIEs exchanged. Adjacencies are formed if and only if (definitions 483 of LEAF_ONLY are found in Section 4.2.9) 485 1. the node is in the same PoD or either the node or the neighbor 486 advertises "any" PoD membership (PoD# = 0) AND 488 2. the neighboring node is running the same MAJOR schema version AND 490 3. the neighbor is not member of some PoD while the node has a 491 northbound adjacency already joining another PoD AND 493 4. the advertised MTUs match on both sides AND 495 5. both nodes advertise defined level values AND 497 6. [ 499 i) the node is at level 0 and it has no adjacencies to nodes 500 with level higher than this neighboring node OR 502 ii) both nodes are at level 0 AND both indicate support for 503 Section 4.3.5 OR 505 iii) neither node is at level 0 and the neighboring node is at 506 most one level away 508 ]. 510 A node configured with "any" PoD membership MUST, after building 511 first northbound adjacency making it participant in a PoD, advertise 512 that PoD as part of its LIEs. 514 LIEs arriving with a TTL larger than 1 MUST be ignored. 516 A node SHOULD NOT send out LIEs without defined level in the header 517 but in certain scenarios it may be beneficial for trouble-shooting 518 purposes. 520 LIE exchange uses three-way handshake mechanism [RFC5303]. Precise 521 finite state machines will be provided in later versions of this 522 specification. LIE packets contain nonces and may contain an SHA-1 523 [RFC6234] over nonces and some of the LIE data which prevents 524 corruption and replay attacks. TIE flooding reuses those nonces to 525 prevent mismatches and can use those for security purposes in case it 526 is using QUIC [QUIC]. Section 7 will address the precise security 527 mechanisms in the future. 529 4.2.3. Topology Exchange (TIE Exchange) 531 4.2.3.1. Topology Information Elements 533 Topology and reachability information in RIFT is conveyed by the 534 means of TIEs which have good amount of commonalities with LSAs in 535 OSPF. 537 TIE exchange mechanism uses port indicated by each node in the LIE 538 exchange and the interface on which the adjacency has been formed as 539 destination. It SHOULD use TTL of 1 as well. 541 TIEs contain sequence numbers, lifetimes and a type. Each type has a 542 large identifying number space and information is spread across 543 possibly many TIEs of a certain type by the means of a hash function 544 that a node or deployment can individually determine. One extreme 545 point of the design space is a prefix per TIE which leads to BGP-like 546 behavior vs. dense packing into few TIEs leading to more traditional 547 IGP trade-off with fewer TIEs. An implementation may even rehash at 548 the cost of significant amount of re-advertisements of TIEs. 550 More information about the TIE structure can be found in the schema 551 in Section 8. 553 4.2.3.2. South- and Northbound Representation 555 As a central concept to RIFT, each node represents itself differently 556 depending on the direction in which it is advertising information. 557 More precisely, a spine node represents two different databases to 558 its neighbors depending whether it advertises TIEs to the north or to 559 the south/sideways. We call those differing TIE databases either 560 south- or northbound (S-TIEs and N-TIEs) depending on the direction 561 of distribution. 563 The N-TIEs hold all of the node's adjacencies, local prefixes and 564 northbound policy-guided prefixes while the S-TIEs hold only all of 565 the node's adjacencies and the default prefix with necessary 566 disaggregated prefixes and southbound policy-guided prefixes. We 567 will explain this in detail further in Section 4.2.8 and 568 Section 4.2.4. 570 The TIE types are symmetric in both directions and Table 1 provides a 571 quick reference to the different TIE types including direction and 572 their function. 574 TIE-Type Content 575 --------- ----------------------------------------------------------- 576 node node properties, adjacencies and information helping in 577 N-TIE complex disaggregation scenarios 578 node same content as node N-TIE except the information to help 579 S-TIE disaggregation 580 Prefix contains nodes' directly reachable prefixes 581 N-TIE 582 Prefix contains originated defaults and de-aggregated prefixes 583 S-TIE 584 PGP N-TIE contains nodes north PGPs 585 PGP S-TIE contains nodes south PGPs 586 KV N-TIE contains nodes northbound KVs 587 KV S-TIE contains nodes southbound KVs 589 Table 1: TIE Types 591 As an example illustrating a databases holding both representations, 592 consider the topology in Figure 1 with the optional link between node 593 111 and node 112 (so that the flooding on an east-west link can be 594 shown). This example assumes unnumbered interfaces. First, here are 595 the TIEs generated by some nodes. For simplicity, the key value 596 elements and the PGP elements which may be included in their S-TIEs 597 or N-TIEs are not shown. 599 Spine21 S-TIEs: 600 Node S-TIE: 601 NodeElement(layer=2, neighbors((Node111, layer 1, cost 1), 602 (Node112, layer 1, cost 1), (Node121, layer 1, cost 1), 603 (Node122, layer 1, cost 1))) 604 Prefix S-TIE: 605 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 607 Node111 S-TIEs: 608 Node S-TIE: 609 NodeElement(layer=1, neighbors((Spine21, layer 2, cost 1, links(...)), 610 (Spine22, layer 2, cost 1, links(...)), 611 (Node112, layer 1, cost 1, links(...)), 612 (Leaf111, layer 0, cost 1, links(...)), 613 (Leaf112, layer 0, cost 1, links(...)))) 614 Prefix S-TIE: 615 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 617 Node111 N-TIEs: 618 Node N-TIE: 619 NodeElement(layer=1, 620 neighbors((Spine21, layer 2, cost 1, links(...)), 621 (Spine22, layer 2, cost 1, links(...)), 622 (Node112, layer 1, cost 1, links(...)), 623 (Leaf111, layer 0, cost 1, links(...)), 624 (Leaf112, layer 0, cost 1, links(...)))) 625 Prefix N-TIE: 626 NorthPrefixesElement(prefixes(Node111.loopback) 628 Node121 S-TIEs: 629 Node S-TIE: 630 NodeElement(layer=1, neighbors((Spine21,layer 2,cost 1), 631 (Spine22, layer 2, cost 1), (Leaf121, layer 0, cost 1), 632 (Leaf122, layer 0, cost 1))) 633 Prefix S-TIE: 634 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 636 Node121 N-TIEs: 637 Node N-TIE: 638 NodeLinkElement(layer=1, 639 neighbors((Spine21, layer 2, cost 1, links(...)), 640 (Spine22, layer 2, cost 1, links(...)), 641 (Leaf121, layer 0, cost 1, links(...)), 642 (Leaf122, layer 0, cost 1, links(...)))) 643 Prefix N-TIE: 644 NorthPrefixesElement(prefixes(Node121.loopback) 646 Leaf112 N-TIEs: 647 Node N-TIE: 648 NodeLinkElement(layer=0, 649 neighbors((Node111, layer 1, cost 1, links(...)), 650 (Node112, layer 1, cost 1, links(...)))) 651 Prefix N-TIE: 652 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 653 Prefix_MH)) 655 Figure 2: example TIES generated in a 2 level spine-and-leaf topology 657 4.2.3.3. Flooding 659 The mechanism used to distribute TIEs is the well-known (albeit 660 modified in several respects to address fat tree requirements) 661 flooding mechanism used by today's link-state protocols. Albeit 662 initially more demanding to implement it avoids many problems with 663 diffused computation update style used by path vector. As described 664 before, TIEs themselves are transported over UDP with the ports 665 indicates in the LIE exchanges and using the destination address (for 666 unnumbered IPv4 interfaces same considerations apply as in equivalent 667 OSPF case) on which the LIE adjacency has been formed. 669 On reception of a TIE with an undefined level value in the packet 670 header the node SHOULD issue a warning and indiscriminately discard 671 the packet. 673 Precise finite state machines and procedures will be provided in 674 later versions of this specification. 676 4.2.3.4. TIE Flooding Scopes 678 In a somewhat analogous fashion to link-local, area and domain 679 flooding scopes, RIFT defines several complex "flooding scopes" 680 depending on the direction and type of TIE propagated. 682 Every N-TIE is flooded northbound, providing a node at a given level 683 with the complete topology of the Clos or Fat Tree network underneath 684 it, including all specific prefixes. This means that a packet 685 received from a node at the same or lower level whose destination is 686 covered by one of those specific prefixes may be routed directly 687 towards the node advertising that prefix rather than sending the 688 packet to a node at a higher level. 690 A node's node S-TIEs, consisting of all node's adjacencies and prefix 691 S-TIEs with default IP prefix and disaggregated prefixes, are flooded 692 southbound in order to allow the nodes one level down to see 693 connectivity of the higher level as well as reachability to the rest 694 of the fabric. In order to allow a E-W disconnected node in a given 695 level to receive the S-TIEs of other nodes at its level, every *NODE* 696 S-TIE is "reflected" northbound to level from which it was received. 697 It should be noted that east-west links are included in South TIE 698 flooding; those TIEs need to be flooded to satisfy algorithms in 699 Section 4.2.5. In that way nodes at same level can learn about each 700 other without a lower level, e.g. in case of leaf level. The precise 701 flooding scopes are given in Table 2. Those rules govern as well 702 what SHOULD be included in TIDEs towards neighbors. East-West 703 flooding scopes are identical to South flooding scopes. 705 Node S-TIE "reflection" allows to support disaggregation on failures 706 describes in Section 4.2.8 and flooding reduction in Section 4.2.3.8. 708 Packet Type South North 709 vs. Peer 710 Direction 711 ------------- ------------------------------ ------------------------ 712 node S-TIE flood self-originated only flood if TIE 713 originator's level is 714 higher than own level 715 non-node flood self-originated only flood only if TIE 716 S-TIE originator is equal peer 717 all N-TIEs never flood flood always 718 TIDE include TIEs in flooding scope include TIEs in flooding 719 scope 720 TIRE include all N-TIEs and all include only if TIE 721 peer's self-originated TIEs originator is equal peer 722 and all node S-TIEs 724 Table 2: Flooding Scopes 726 As an example to illustrate these rules, consider using the topology 727 in Figure 1, with the optional link between node 111 and node 112, 728 and the associated TIEs given in Figure 2. The flooding from 729 particular nodes of the TIEs is given in Table 3. 731 Router Neighbor TIEs 732 floods to 733 ------------ -------- ----------------------------------------------- 734 Leaf111 Node112 Leaf111 N-TIEs, Node111 node S-TIE 735 Leaf111 Node111 Leaf111 N-TIEs, Node112 node S-TIE 737 Node111 Leaf111 Node111 S-TIEs 738 Node111 Leaf112 Node111 S-TIEs 739 Node111 Node112 Node111 S-TIEs 740 Node111 Spine21 Node111 N-TIEs, Leaf111 N-TIEs, Leaf112 N-TIEs, 741 Spine22 node S-TIE 742 Node111 Spine22 Node111 N-TIEs, Leaf111 N-TIEs, Leaf112 N-TIEs, 743 Spine21 node S-TIE 745 ... ... ... 746 Spine21 Node111 Spine21 S-TIEs 747 Spine21 Node112 Spine21 S-TIEs 748 Spine21 Node121 Spine21 S-TIEs 749 Spine21 Node122 Spine21 S-TIEs 750 ... ... ... 752 Table 3: Flooding some TIEs from example topology 754 4.2.3.5. Initial and Periodic Database Synchronization 756 The initial exchange of RIFT is modeled after ISIS with TIDE being 757 equivalent to CSNP and TIRE playing the role of PSNP. The content of 758 TIDEs and TIREs is governed by Table 2. 760 4.2.3.6. Purging 762 RIFT does not purge information that has been distributed by the 763 protocol. Purging mechanisms in other routing protocols have proven 764 to be complex and fragile over many years of experience. Abundant 765 amounts of memory are available today even on low-end platforms. The 766 information will age out and all computations will deliver correct 767 results if a node leaves the network due to the new information 768 distributed by its adjacent nodes. 770 Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as 771 long as feasible (also when the protocol restarts), even if the TIE 772 looses all content. The re-advertisement of empty TIE fulfills the 773 purpose of purging any information advertised in previous versions. 774 The originator is free to not re-originate the according empty TIE 775 again or originate an empty TIE with relatively short lifetime to 776 prevent large number of long-lived empty stubs polluting the network. 777 Each node will timeout and clean up the according empty TIEs 778 independently. 780 4.2.3.7. Southbound Default Route Origination 782 Under certain conditions nodes issue a default route in their South 783 Prefix TIEs. 785 A node X that 787 1. is NOT overloaded AND 789 2. has southbound or east-west adjacencies 791 originates in its south prefix TIE such a default route IIF 793 1. all other nodes at X's' level are overloaded OR 795 2. all other nodes at X's' level have NO northbound adjacencies OR 797 3. X has computed reachability to a default route during N-SPF. 799 The term "all other nodes at X's' level" describes obviously just the 800 nodes at the same level in the POD with a viable lower layer 801 (otherwise the node S-TIEs cannot be reflected and the nodes in e.g. 802 POD 1 nad POD 2 are "invisible" to each other). 804 A node originating a southbound default route MUST install a default 805 discard route if it did not compute a default route during N-SPF. 807 4.2.3.8. Optional Automatic Flooding Reduction and Partitioning 809 Several nodes can, but strictly only under conditions defined below, 810 run a hashing function based on TIE originator value and partition 811 flooding between them. 813 Steps for flooding reduction and partitioning: 815 1. select all nodes in the same level for which 817 A. node S-TIEs have been received AND 819 B. which have precisely the same non-empty sets of respectively 820 north and south neighbor adjacencies AND 822 C. have at least one shared southern neighbor including backlink 823 verification and 825 D. support flooding reduction (overload bits are ignored) 827 and then 829 2. run on the chosen set a hash algorithm using nodes flood 830 priorities and IDs to select flooding leader and backup per TIE 831 originator ID, i.e. each node floods immediately through to all 832 its necessary neighbors TIEs that it received with an originator 833 ID that makes it the flooding leader or backup for this 834 originator. The preference (higher is better) is computed as 835 XOR(TIE-ORIGINATOR-ID<<1,~OWN-SYSTEM-ID)), whereas << is a non- 836 circular shift and ~ is bit-wise NOT. 838 3. In the very unlikely case of hash collisions on either of the two 839 nodes with highest values (i.e. either does NOT produce unique 840 hashes as compared to all other hash values), the node running 841 the election does not attempt to reduce flooding. 843 Additional rules for flooding reduction and partitioning: 845 1. A node always floods its own TIEs 847 2. A node generates TIDEs as usual but when receiving TIREs with 848 requests for TIEs for a node for which it is not a flooding 849 leader or backup it ignores such TIDEs on first request only. 850 Normally, the flooding leader should satisfy the requestor and 851 with that no further TIREs for such TIEs will be generated. 852 Otherwise, the next set of TIDEs and TIREs will lead to flooding 853 independent of the flooding leader status. 855 3. A node receiving a TIE originated by a node for which it is not a 856 flooding leader floods such TIEs only when receiving an out-of- 857 date TIDE for them, except for the first one. 859 The mechanism can be implemented optionally in each node. The 860 capability is carried in the node S-TIE (and for symmetry purposes in 861 node N-TIE as well but it serves no purpose there currently). 863 Obviously flooding reduction does NOT apply to self originated TIEs. 864 Observe further that all policy-guided information consists of self- 865 originated TIEs. 867 4.2.4. Policy-Guided Prefixes 869 In a fat tree, it can be sometimes desirable to guide traffic to 870 particular destinations or keep specific flows to certain paths. In 871 RIFT, this is done by using policy-guided prefixes with their 872 associated communities. Each community is an abstract value whose 873 meaning is determined by configuration. It is assumed that the 874 fabric is under a single administrative control so that the meaning 875 and intent of the communities is understood by all the nodes in the 876 fabric. Any node can originate a policy-guided prefix. 878 Since RIFT uses distance vector concepts in a southbound direction, 879 it is straightforward to add a policy-guided prefix to an S-TIE. For 880 easier troubleshooting, the approach taken in RIFT is that a node's 881 southbound policy-guided prefixes are sent in its S-TIE and the 882 receiver does inbound filtering based on the associated communities 883 (an egress policy is imaginable but would lead to different S-TIEs 884 per neighbor possibly which is not considered in RIFT protocol 885 procedures). A southbound policy-guided prefix can only use links in 886 the south direction. If an PGP S-TIE is received on an east-west or 887 northbound link, it must be discarded by ingress filtering. 889 Conceptually, a southbound policy-guided prefix guides traffic from 890 the leaves up to at most the north-most layer. It is also necessary 891 to to have northbound policy-guided prefixes to guide traffic from 892 the north-most layer down to the appropriate leaves. Therefore, RIFT 893 includes northbound policy-guided prefixes in its N PGP-TIE and the 894 receiver does inbound filtering based on the associated communities. 895 A northbound policy-guided prefix can only use links in the northern 896 direction. If an N PGP TIE is received on an east-west or southbound 897 link, it must be discarded by ingress filtering. 899 By separating southbound and northbound policy-guided prefixes and 900 requiring that the cost associated with a PGP is strictly 901 monotonically increasing at each hop, the path cannot loop. Because 902 the costs are strictly increasing, it is not possible to have a loop 903 between a northbound PGP and a southbound PGP. If east-west links 904 were to be allowed, then looping could occur and issues such as 905 counting to infinity would become an issue to be solved. If complete 906 generality of path - such as including east-west links and using both 907 north and south links in arbitrary sequence - then a Path Vector 908 protocol or a similar solution must be considered. 910 If a node has received the same prefix, after ingress filtering, as a 911 PGP in an S-TIE and in an N-TIE, then the node determines which 912 policy-guided prefix to use based upon the advertised cost. 914 A policy-guided prefix is always preferred to a regular prefix, even 915 if the policy-guided prefix has a larger cost. Section 8 provides 916 normative indication of prefix preferences. 918 The set of policy-guided prefixes received in a TIE is subject to 919 ingress filtering and then re-originated to be sent out in the 920 receiver's appropriate TIE. Both the ingress filtering and the re- 921 origination use the communities associated with the policy-guided 922 prefixes to determine the correct behavior. The cost on re- 923 advertisement MUST increase in a strictly monotonic fashion. 925 4.2.4.1. Ingress Filtering 927 When a node X receives a PGP S-TIE or a PGP N-TIE that is originated 928 from a node Y which does not have an adjacency with X, all PGPs in 929 such a TIE MUST be filtered. Similarly, if node Y is at the same 930 layer as node X, then X MUST filter out PGPs in such S- and N-TIEs to 931 prevent loops. 933 Next, policy can be applied to determine which policy-guided prefixes 934 to accept. Since ingress filtering is chosen rather than egress 935 filtering and per-neighbor PGPs, policy that applies to links is done 936 at the receiver. Because the RIFT adjacency is between nodes and 937 there may be parallel links between the two nodes, the policy-guided 938 prefix is considered to start with the next-hop set that has all 939 links to the originating node Y. 941 A policy-guided prefix has or is assigned the following attributes: 943 cost: This is initialized to the cost received 944 community_list: This is initialized to the list of the communities 945 received. 947 next_hop_set: This is initialized to the set of links to the 948 originating node Y. 950 4.2.4.2. Applying Policy 952 The specific action to apply based upon a community is deployment 953 specific. Here are some examples of things that can be done with 954 communities. The length of a community is a 64 bits number and it 955 can be written as a single field M or as a multi-field (S = M[0-31], 956 T = M[32-63]) in these examples. For simplicity, the policy-guided 957 prefix is referred to as P, the processing node as X and the 958 originator as Y. 960 Prune Next-Hops: Community Required: For each next-hop in 961 P.next_hop_set, if the next-hop does not have the community, prune 962 that next-hop from P.next_hop_set. 964 Prune Next-Hops: Avoid Community: For each next-hop in 965 P.next_hop_set, if the next-hop has the community, prune that 966 next-hop from P.next_hop_set. 968 Drop if Community: If node X has community M, discard P. 970 Drop if not Community: If node X does not have the community M, 971 discard P. 973 Prune to ifIndex T: For each next-hop in P.next_hop_set, if the 974 next-hop's ifIndex is not the value T specified in the community 975 (S,T), then prune that next-hop from P.next_hop_set. 977 Add Cost T: For each appearance of community S in P.community_list, 978 if the node X has community S, then add T to P.cost. 980 Accumulate Min-BW T: Let bw be the sum of the bandwidth for 981 P.next_hop_set. If that sum is less than T, then replace (S,T) 982 with (S, bw). 984 Add Community T if Node matches S: If the node X has community S, 985 then add community T to P.community_list. 987 4.2.4.3. Store Policy-Guided Prefix for Route Computation and 988 Regeneration 990 Once a policy-guided prefix has completed ingress filtering and 991 policy, it is almost ready to store and use. It is still necessary 992 to adjust the cost of the prefix to account for the link from the 993 computing node X to the originating neighbor node Y. 995 There are three different policies that can be used: 997 Minimum Equal-Cost: Find the lowest cost C next-hops in 998 P.next_hop_set and prune to those. Add C to P.cost. 1000 Minimum Unequal-Cost: Find the lowest cost C next-hop in 1001 P.next_hop_set. Add C to P.cost. 1003 Maximum Unequal-Cost: Find the highest cost C next-hop in 1004 P.next_hop_set. Add C to P.cost. 1006 The default policy is Minimum Unequal-Cost but well-known communities 1007 can be defined to get the other behaviors. 1009 Regardless of the policy used, a node MUST store a PGP cost that is 1010 at least 1 greater than the PGP cost received. This enforces the 1011 strictly monotonically increasing condition that avoids loops. 1013 Two databases of PGPs - from N-TIEs and from S-TIEs are stored. When 1014 a PGP is inserted into the appropriate database, the usual tie- 1015 breaking on cost is performed. Observe that the node retains all PGP 1016 TIEs due to normal flooding behavior and hence loss of the best 1017 prefix will lead to re-evaluation of TIEs present and re- 1018 advertisement of a new best PGP. 1020 4.2.4.4. Re-origination 1022 A node must re-originate policy-guided prefixes and retransmit them. 1023 The node has its database of southbound policy-guided prefixes to 1024 send in its S-TIE and its database of northbound policy-guided 1025 prefixes to send in its N-TIE. 1027 Of course, a leaf does not need to re-originate southbound policy- 1028 guided prefixes. 1030 4.2.4.5. Overlap with Disaggregated Prefixes 1032 PGPs may overlap with prefixes introduced by automatic de- 1033 aggregation. The topic is under further discussion. The break in 1034 connectivity that leads to infeasibility of a PGP is mirrored in 1035 adjacency tear-down and according removal of such PGPs. 1036 Nevertheless, the underlying link-state flooding will be likely 1037 reacting significantly faster than a hop-by-hop redistribution and 1038 with that the preference for PGPs may cause intermittent black-holes. 1040 4.2.5. Reachability Computation 1042 A node has three sources of relevant information. A node knows the 1043 full topology south from the received N-TIEs. A node has the set of 1044 prefixes with associated distances and bandwidths from received 1045 S-TIEs. A node can also have a set of PGPs. 1047 To compute reachability, a node runs conceptually a northbound and a 1048 southbound SPF. We call that N-SPF and S-SPF. 1050 Since neither computation can "loop" (with due considerations given 1051 to PGPs), it is possible to compute non-equal-cost or even k-shortest 1052 paths [EPPSTEIN] and "saturate" the fabric to the extent desired. 1054 4.2.5.1. Northbound SPF 1056 N-SPF uses northbound and east-west adjacencies in North Node TIEs 1057 when progressing Dijkstra. Observe that this is really just a one 1058 hop variety since South Node TIEs are not re-flooded southbound 1059 beyond a single level (or east-west) and with that the computation 1060 cannot progress beyond adjacent nodes. 1062 Default route found when crossing an E-W link is used IIF 1064 1. the node itself does NOT have any northbound adjacencies AND 1066 2. the adjacent node has one or more northbound adjacencies 1068 This rule forms a "one-hop default route split-horizon" and prevents 1069 looping over default routes while allowing for "one-hop protection" 1070 of nodes that lost all northbound adjacencies. 1072 Other south prefixes found when crossing E-W link MAY be used IIF 1074 1. no north neighbors are advertising same or supersuming prefix AND 1076 2. the node does not originate a supersuming prefix itself. 1078 i.e. the E-W link can be used as the gateway of last resort for a 1079 specific prefix only. Using south prefixes across E-W link can be 1080 beneficial e.g. on automatic de-aggregation in pathological fabric 1081 partitioning scenarios. 1083 A detailed example can be found in Section 5.4. 1085 For N-SPF we are using the South Node TIEs to find according 1086 adjacencies to verify backlink connectivity. Just as in case of IS- 1087 IS or OSPF, two unidirectional links are associated together to 1088 confirm bidirectional connectivity. 1090 4.2.5.2. Southbound SPF 1092 S-SPF uses only the southbound adjacencies in the south node TIEs, 1093 i.e. progresses towards nodes at lower levels. Observe that E-W 1094 adjacencies are NEVER used in the computation. This enforces the 1095 requirement that a packet traversing in a southbound direction must 1096 never change its direction. 1098 S-SPF uses northbound adjacencies in north node TIEs to verify 1099 backlink connectivity. 1101 4.2.5.3. East-West Forwarding Within a Level 1103 Ultimately, it should be observed that in presence of a "ring" of E-W 1104 links in a level neither SPF will provide a "ring protection" scheme 1105 since such a computation would have to deal necessarily with breaking 1106 of "loops" in generic Dijkstra sense; an application for which RIFT 1107 is not intended. It is outside the scope of this document how an 1108 underlay can be used to provide a full-mesh connectivity between 1109 nodes in the same layer that would allow for N-SPF to provide 1110 protection for a single node loosing all its northbound adjacencies 1111 (as long as any of the other nodes in the level are northbound 1112 connected). 1114 Using south prefixes over horizontal links is optional and can 1115 protect against pathological fabric partitioning cases that leave 1116 only paths to destinations that would necessitate multiple changes of 1117 forwarding direction between north and south. 1119 4.2.6. Attaching Prefixes 1121 After the SPF is run, it is necessary to attach according prefixes. 1122 For S-SPF, prefixes from an N-TIE are attached to the originating 1123 node with that node's next-hop set and a distance equal to the 1124 prefix's cost plus the node's minimized path distance. The RIFT 1125 route database, a set of (prefix, type=spf, path_distance, next-hop 1126 set), accumulates these results. Obviously, the prefix retains its 1127 type which is used to tie-break between the same prefix advertised 1128 with different types. 1130 In case of N-SPF prefixes from each S-TIE need to also be added to 1131 the RIFT route database. The N-SPF is really just a stub so the 1132 computing node needs simply to determine, for each prefix in an S-TIE 1133 that originated from adjacent node, what next-hops to use to reach 1134 that node. Since there may be parallel links, the next-hops to use 1135 can be a set; presence of the computing node in the associated Node 1136 S-TIE is sufficient to verify that at least one link has 1137 bidirectional connectivity. The set of minimum cost next-hops from 1138 the computing node X to the originating adjacent node is determined. 1140 Each prefix has its cost adjusted before being added into the RIFT 1141 route database. The cost of the prefix is set to the cost received 1142 plus the cost of the minimum cost next-hop to that neighbor. Then 1143 each prefix can be added into the RIFT route database with the 1144 next_hop_set; ties are broken based upon type first and then 1145 distance. RIFT route preferences are normalized by the according 1146 thrift model type. 1148 An exemplary implementation for node X follows: 1150 for each S-TIE 1151 if S-TIE.layer > X.layer 1152 next_hop_set = set of minimum cost links to the S-TIE.originator 1153 next_hop_cost = minimum cost link to S-TIE.originator 1154 end if 1155 for each prefix P in the S-TIE 1156 P.cost = P.cost + next_hop_cost 1157 if P not in route_database: 1158 add (P, type=DistVector, P.cost, next_hop_set) to route_database 1159 end if 1160 if (P in route_database) and 1161 (route_database[P].type is not PolicyGuided): 1162 if route_database[P].cost > P.cost): 1163 update route_database[P] with (P, DistVector, P.cost, next_hop_set) 1164 else if route_database[P].cost == P.cost 1165 update route_database[P] with (P, DistVector, P.cost, 1166 merge(next_hop_set, route_database[P].next_hop_set)) 1167 else 1168 // Not preferred route so ignore 1169 end if 1170 end if 1171 end for 1172 end for 1174 Figure 3: Adding Routes from S-TIE Prefixes 1176 4.2.7. Attaching Policy-Guided Prefixes 1178 Each policy-guided prefix P has its cost and next_hop_set already 1179 stored in the associated database, as specified in Section 4.2.4.3; 1180 the cost stored for the PGP is already updated to considering the 1181 cost of the link to the advertising neighbor. By definition, a 1182 policy-guided prefix is preferred to a regular prefix. 1184 for each policy-guided prefix P: 1185 if P not in route_database: 1186 add (P, type=PolicyGuided, P.cost, next_hop_set) 1187 end if 1188 if P in route_database : 1189 if (route_database[P].type is not PolicyGuided) or 1190 (route_database[P].cost > P.cost): 1191 update route_database[P] with (P, PolicyGuided, P.cost, next_hop_set) 1192 else if route_database[P].cost == P.cost 1193 update route_database[P] with (P, PolicyGuided, P.cost, 1194 merge(next_hop_set, route_database[P].next_hop_set)) 1195 else 1196 // Not preferred route so ignore 1197 end if 1198 end if 1199 end for 1201 Figure 4: Adding Routes from Policy-Guided Prefixes 1203 4.2.8. Automatic Disaggregation on Link & Node Failures 1205 Under normal circumstances, node's S-TIEs contain just the 1206 adjacencies, a default route and policy-guided prefixes. However, if 1207 a node detects that its default IP prefix covers one or more prefixes 1208 that are reachable through it but not through one or more other nodes 1209 at the same level, then it MUST explicitly advertise those prefixes 1210 in an S-TIE. Otherwise, some percentage of the northbound traffic 1211 for those prefixes would be sent to nodes without according 1212 reachability, causing it to be black-holed. Even when not black- 1213 holing, the resulting forwarding could 'backhaul' packets through the 1214 higher level spines, clearly an undesirable condition affecting the 1215 blocking probabilities of the fabric. 1217 We refer to the process of advertising additional prefixes as 'de- 1218 aggregation' or 'dis-aggregation'. 1220 A node determines the set of prefixes needing de-aggregation using 1221 the following steps: 1223 1. A DAG computation in the southern direction is performed first, 1224 i.e. the N-TIEs are used to find all of prefixes it can reach and 1225 the set of next-hops in the lower level for each. Such a 1226 computation can be easily performed on a fat tree by e.g. setting 1227 all link costs in the southern direction to 1 and all northern 1228 directions to infinity. We term set of those prefixes |R, and 1229 for each prefix, r, in |R, we define its set of next-hops to 1230 be |H(r). Observe that policy-guided prefixes are NOT affected 1231 since their scope is controlled by configuration. 1233 2. The node uses reflected S-TIEs to find all nodes at the same 1234 level in the same PoD and the set of southbound adjacencies for 1235 each. The set of nodes at the same level is termed |N and for 1236 each node, n, in |N, we define its set of southbound adjacencies 1237 to be |A(n). 1239 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 1240 is null then that prefix r must be explicitly advertised by the 1241 node in an S-TIE. 1243 4. Identical set of de-aggregated prefixes is flooded on each of the 1244 node's southbound adjacencies. In accordance with the normal 1245 flooding rules for an S-TIE, a node at the lower level that 1246 receives this S-TIE will not propagate it south-bound. Neither 1247 is it necessary for the receiving node to reflect the 1248 disaggregated prefixes back over its adjacencies to nodes at the 1249 level from which it was received. 1251 To summarize the above in simplest terms: if a node detects that its 1252 default route encompasses prefixes for which one of the other nodes 1253 in its level has no possible next-hops in the level below, it has to 1254 disaggregate it to prevent black-holing or suboptimal routing. Hence 1255 a node X needs to determine if it can reach a different set of south 1256 neighbors than other nodes at the same level, which are connected to 1257 it via at least one common south or east-west neighbor. If it can, 1258 then prefix disaggregation may be required. If it can't, then no 1259 prefix disaggregation is needed. An example of disaggregation is 1260 provided in Section 5.3. 1262 A possible algorithm is described last: 1264 1. Create partial_neighbors = (empty), a set of neighbors with 1265 partial connectivity to the node X's layer from X's perspective. 1266 Each entry is a list of south neighbor of X and a list of nodes 1267 of X.layer that can't reach that neighbor. 1269 2. A node X determines its set of southbound neighbors 1270 X.south_neighbors. 1272 3. For each S-TIE originated from a node Y that X has which is at 1273 X.layer, if Y.south_neighbors is not the same as 1274 X.south_neighbors but the nodes share at least one southern 1275 neighbor, for each neighbor N in X.south_neighbors but not in 1276 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 1277 there or add Y to the list for N. 1279 4. If partial_neighbors is empty, then node X does not to 1280 disaggregate any prefixes. If node X is advertising 1281 disaggregated prefixes in its S-TIE, X SHOULD remove them and re- 1282 advertise its according S-TIEs. 1284 A node X computes its SPF based upon the received N-TIEs. This 1285 results in a set of routes, each categorized by (prefix, 1286 path_distance, next-hop-set). Alternately, for clarity in the 1287 following procedure, these can be organized by next-hop-set as ( 1288 (next-hops), {(prefix, path_distance)}). If partial_neighbors isn't 1289 empty, then the following procedure describes how to identify 1290 prefixes to disaggregate. 1292 disaggregated_prefixes = {empty } 1293 nodes_same_layer = { empty } 1294 for each S-TIE 1295 if (S-TIE.layer == X.layer and 1296 X shares at least one S-neighbor with X) 1297 add S-TIE.originator to nodes_same_layer 1298 end if 1299 end for 1301 for each next-hop-set NHS 1302 isolated_nodes = nodes_same_layer 1303 for each NH in NHS 1304 if NH in partial_neighbors 1305 isolated_nodes = intersection(isolated_nodes, 1306 partial_neighbors[NH].nodes) 1307 end if 1308 end for 1310 if isolated_nodes is not empty 1311 for each prefix using NHS 1312 add (prefix, distance) to disaggregated_prefixes 1313 end for 1314 end if 1315 end for 1317 copy disaggregated_prefixes to X's S-TIE 1318 if X's S-TIE is different 1319 schedule S-TIE for flooding 1320 end if 1322 Figure 5: Computation to Disaggregate Prefixes 1324 Each disaggregated prefix is sent with the accurate path_distance. 1325 This allows a node to send the same S-TIE to each south neighbor. 1326 The south neighbor which is connected to that prefix will thus have a 1327 shorter path. 1329 Finally, to summarize the less obvious points partially omitted in 1330 the algorithms to keep them more tractable: 1332 1. all neighbor relationships MUST perform backlink checks. 1334 2. overload bits as introduced in Section 4.3.1 have to be respected 1335 during the computation. 1337 3. all the lower level nodes are flooded the same disaggregated 1338 prefixes since we don't want to build an S-TIE per node and 1339 complicate things unnecessarily. The PoD containing the prefix 1340 will prefer southbound anyway. 1342 4. disaggregated prefixes do NOT have to propagate to lower levels. 1343 With that the disturbance in terms of new flooding is contained 1344 to a single level experiencing failures only. 1346 5. disaggregated prefix S-TIEs are not "reflected" by the lower 1347 layer, i.e. nodes within same level do NOT need to be aware 1348 which node computed the need for disaggregation. 1350 6. The fabric is still supporting maximum load balancing properties 1351 while not trying to send traffic northbound unless necessary. 1353 4.2.9. Optional Autoconfiguration 1355 Each RIFT node can optionally operate in zero touch provisioning 1356 (ZTP) mode, i.e. a node will fully configure itself after being 1357 attached to the topology. This section describes the necessary 1358 concepts and procedures. 1360 4.2.9.1. Terminology 1362 Automatic Level Derivation: Procedures which allow nodes without 1363 level configured to derive it automatically. Only applied if 1364 CONFIGURED_LEVEL is undefined. 1366 UNDEFINED_LEVEL: An imaginary value that indicates that the level 1367 has not beeen determined and has not been configured. 1369 LEAF_ONLY: An optional configuration flag that can be configured on 1370 a node to make sure it never leaves the "bottom of the hierarchy". 1371 SUPERSPINE_FLAG and CONFIGURED_LEVEL cannot be defined at the same 1372 time as this flag. It implies CONFIGURED_LEVEL value of 0. 1374 CONFIGURED_LEVEL: A level value provided manually. When this is 1375 defined (i.e. not UNDEFINED_LEVEL) a node is not participating in 1376 ZTP. SUPERSPINE_FLAG MUST NOT be set when this value is defined. 1377 LEAF_ONLY can be set only if this value is undefined or set to 0. 1379 DERIVED_LEVEL: Level value computed via automatic level derivation 1380 when CONFIGURED_LEVEL has not been set to something else than 1381 UNDEFINED_LEVEL. 1383 LEAF_2_LEAF: An optional configuration flag that can be configured 1384 on a node to make sure it supports procedures defined in 1385 Section 4.3.5. SUPERSPINE_FLAG cannot be defined at the same time 1386 as this flag. It implies LEAF_ONLY and the according 1387 restrictions. 1389 LEVEL_VALUE: In ZTP case the original definition of "level" in 1390 Section 2.1 is both extended and relaxed. First, level is defined 1391 now as LEVEL_VALUE and is the first defined value of 1392 CONFIGURED_LEVEL followed by DERIVED_LEVEL. Second, it is 1393 possible for nodes to be more than one level apart to form 1394 adjacencies if any of the nodes is at least LEAF_ONLY. 1396 Valid Level Offer: A neighbor's level received on a valid LIE (i.e. 1397 passing all checks for adjacency formation while disregarding all 1398 clauses involving level values) within the holdtime interval last 1399 received LIE advertises. 1401 Highest Available Level (HAL): Highest defined level value seen from 1402 all valid level offers received. 1404 SUPERSPINE_FLAG: Configuration flag provided to all superspines. 1405 LEAF_FLAG and CONFIGURED_LEVEL cannot be defined at the same time 1406 as this flag. It implies CONFIGURED_LEVEL value of 64. In fact, 1407 it is basically a shortcut for configuring same level at all 1408 superspine nodes which is unavoidable since an initial 'seed' is 1409 needed for other ZTP nodes to derive their level in the topology. 1411 4.2.9.2. Automatic SystemID Selection 1413 RIFT identifies each node via a SystemID which is 64 bits wide. It 1414 is relatively simple to derive a, for all practical purposes 1415 collision free, value for each node on startup. As simple examples 1416 either system MAC + 2 random bytes can be used or an IPv4/IPv6 router 1417 ID interface address. The router MUST ensure that such identifier is 1418 not changing very frequently (at least not without sending all its 1419 TIEs with fairly short lifetime) since otherwise the network may be 1420 left with large amounts of stale TIEs in other nodes albeit this is 1421 not necessarily a serious problem if the procedures suggested in 1422 Section 7 are implemented. 1424 4.2.9.3. Generic Fabric Example 1426 ZTP forces us to think about miscabled or unusually cabled fabric and 1427 how such a topology can be forced into a "lattice" structure which a 1428 fabric represents (with further restrictions). Let us consider a 1429 necessary and sufficient physical cabling in Figure 6. We assume all 1430 nodes being in the same PoD. 1432 . +---+ 1433 . | A | s = SUPERSPINE_FLAG 1434 . | S | l = LEAF_ONLY 1435 . ++-++ l2l = LEAF_2_LEAF 1436 . | | 1437 . +--+ +--+ 1438 . | | 1439 . +--++ ++--+ 1440 . | E | | F | 1441 . | +-+ | +-----------+ 1442 . ++--+ | ++-++ | 1443 . | | | | | 1444 . | +-------+ | | 1445 . | | | | | 1446 . | | +----+ | | 1447 . | | | | | 1448 . ++-++ ++-++ | 1449 . | I +-----+ J | | 1450 . | | | +-+ | 1451 . ++-++ +--++ | | 1452 . | | | | | 1453 . +---------+ | +------+ | 1454 . | | | | | 1455 . +-----------------+ | | 1456 . | | | | | 1457 . ++-++ ++-++ | 1458 . | X +-----+ Y +-+ 1459 . |l2l| | l | 1460 . +---+ +---+ 1462 Figure 6: Generic ZTP Cabling Considerations 1464 First, we need to anchor the "top" of the cabling and that's what the 1465 SUPERSPINE_FLAG at node A is for. Then things look smooth until we 1466 have to decide whether node Y is at the same level as I, J or at the 1467 same level as Y and consequently, X is south of it. This is 1468 unresolvable here until we "nail down the bottom" of the topology 1469 which the leaf flags are for. We will see further then whether Y 1470 chooses to form adjacencies to F or I, J consecutively. 1472 4.2.9.4. Level Determination Procedure 1474 A node starting up with UNDEFINED_VALUE (i.e. without a 1475 CONFIGURED_LEVEL or any leaf or superspine flag) MUST follow those 1476 additional procedures: 1478 1. It advertises its LEVEL_VALUE on all LIEs (observe that this can 1479 be UNDEFINED_LEVEL which in terms of the schema is simply an 1480 omitted optional value). 1482 2. It chooses on an ongoing basis from all received valid level 1483 offers the value of MAX(HAL-1,0) as its DERIVED_LEVEL. The node 1484 then starts to advertise this derived level. 1486 3. A node that lost all adjacencies with HAL value MUST revert to 1487 UNDEFINED_VALUE as its level and hold down computation of new 1488 DERIVED_LEVEL for a short period of time. 1490 4. A node MUST reset any adjacency that has changed the level it is 1491 offering and is in three way state. 1493 5. A node that changed its defined level value MUST readvertise its 1494 own TIEs (since the new `PacketHeader` will contain a different 1495 level than before). Sequence number of each TIE MUST be 1496 increased. 1498 A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf 1499 function or has a CONFIGURED_LEVEL of 0) MUST follow those additional 1500 procedures: 1502 1. It computes HAL per procedures above but does NOT use it to 1503 compute DERIVED_LEVEL. HAL is used to limit adjacency formation 1504 per Section 4.2.2. 1506 Precise finite state machines will be provided in later versions of 1507 this specification. 1509 4.2.9.5. Resulting Topologies 1511 The procedures defined in Section 4.2.9.4 will lead to the RIFT 1512 topology and levels depicted in Figure 7. 1514 . +---+ 1515 . | As| 1516 . | 64| 1517 . ++-++ 1518 . | | 1519 . +--+ +--+ 1520 . | | 1521 . +--++ ++--+ 1522 . | E | | F | 1523 . | 63+-+ | 63+-----------+ 1524 . ++--+ | ++-++ | 1525 . | | | | | 1526 . | +-------+ | | 1527 . | | | | | 1528 . | | +----+ | | 1529 . | | | | | 1530 . ++-++ ++-++ | 1531 . | I +-----+ J | | 1532 . | 62| | 62| | 1533 . ++--+ +--++ | 1534 . | | | 1535 . +---------+ | | 1536 . | | | 1537 . ++-++ +---+ | 1538 . | X | | Y +-+ 1539 . | 0 | | 0 | 1540 . +---+ +---+ 1542 Figure 7: Generic ZTP Topology Autoconfigured 1544 In case we imagine the LEAF_ONLY restriction on Y is removed the 1545 outcome would be very different however and result in Figure 8. This 1546 demonstrates basically that auto configuration prevents miscabling 1547 detection and with that can lead to undesirable effects when leafs 1548 are not "nailed" and arbitrarily cabled. 1550 . +---+ 1551 . | As| 1552 . | 64| 1553 . ++-++ 1554 . | | 1555 . +--+ +--+ 1556 . | | 1557 . +--++ ++--+ 1558 . | E | | F | 1559 . | 63+-+ | 63+-------+ 1560 . ++--+ | ++-++ | 1561 . | | | | | 1562 . | +-------+ | | 1563 . | | | | | 1564 . | | +----+ | | 1565 . | | | | | 1566 . ++-++ ++-++ +-+-+ 1567 . | I +-----+ J +-----+ Y | 1568 . | 62| | 62| | 62| 1569 . ++-++ +--++ ++-++ 1570 . | | | | | 1571 . | +-----------------+ | 1572 . | | | 1573 . +---------+ | | 1574 . | | | 1575 . ++-++ | 1576 . | X +--------+ 1577 . | 0 | 1578 . +---+ 1580 Figure 8: Generic ZTP Topology Autoconfigured 1582 4.2.10. Stability Considerations 1584 The autoconfiguration mechanism computes a global maximum of levels 1585 by diffusion. The achieved equilibrium can be disturbed massively by 1586 all nodes with highest level either leaving or entering the domain 1587 (with some finer distinctions not explained further). It is 1588 therefore recommended that each node is multi-homed towards nodes 1589 with respective HAL offerings. Fortuntately, this is the natural 1590 state of things for the topology variants considered in RIFT. 1592 4.3. Further Mechanisms 1594 4.3.1. Overload Bit 1596 Overload Bit MUST be respected in all according reachability 1597 computations. A node with overload bit set SHOULD NOT advertise any 1598 reachability prefixes southbound except locally hosted ones. 1600 The leaf node SHOULD set the 'overload' bit on its node TIEs, since 1601 if the spine nodes were to forward traffic not meant for the local 1602 node, the leaf node does not have the topology information to prevent 1603 a routing/forwarding loop. 1605 4.3.2. Optimized Route Computation on Leafs 1607 Since the leafs do see only "one hop away" they do not need to run a 1608 full SPF but can simply gather prefix candidates from their neighbors 1609 and build the according routing table. 1611 A leaf will have no N-TIEs except its own and optionally from its 1612 east-west neighbors. A leaf will have S-TIEs from its neighbors. 1614 Instead of creating a network graph from its N-TIEs and neighbor's 1615 S-TIEs and then running an SPF, a leaf node can simply compute the 1616 minimum cost and next_hop_set to each leaf neighbor by examining its 1617 local interfaces, determining bi-directionality from the associated 1618 N-TIE, and specifying the neighbor's next_hop_set set and cost from 1619 the minimum cost local interfaces to that neighbor. 1621 Then a leaf attaches prefixes as in Section 4.2.6 as well as the 1622 policy-guided prefixes as in Section 4.2.7. 1624 4.3.3. Key/Value Store 1626 The protocol supports a southbound distribution of key-value pairs 1627 that can be used to e.g. distribute configuration information during 1628 topology bring-up. The KV TIEs (which are always S-TIEs) can arrive 1629 from multiple nodes and need tie-breaking per key uses the following 1630 rules 1632 1. Only KV TIEs originated by a node to which the receiver has an 1633 adjacency are considered. 1635 2. Within all valid KV S-TIEs containing the key, the value of the 1636 S-TIE with the highest level and within the same level highest 1637 originator ID is preferred. 1639 Observe that if a node goes down, the node south of it looses 1640 adjacencies to it and with that the KVs will be disregarded and on 1641 tie-break changes new KV re-advertised to prevent stale information 1642 being used by nodes further south. KV information is not result of 1643 independent computation of every node but a diffused computation. 1645 4.3.4. Interactions with BFD 1647 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 1648 In such case following procedures are introduced: 1650 After RIFT 3-way hello adjacency convergence a BFD session MAY be 1651 formed automatically between the RIFT endpoints without further 1652 configuration. 1654 In case RIFT looses 3-way hello adjacency, the BFD session should 1655 be brought down until 3-way adjacency is formed again. 1657 In case established BFD session goes Down after it was Up, RIFT 1658 adjacency should be re-initialized from scratch. 1660 In case of parallel links between nodes each link may run its own 1661 independent BFD session. 1663 In case RIFT changes link identifiers both the hello as well as 1664 the BFD sessions will be brought down and back up again. 1666 4.3.5. Leaf to Leaf Procedures 1668 RIFT can optionally allow special leaf East-West adjacencies under 1669 additional set of rules. The leaf supporting those procedures MUST: 1671 advertise the LEAF_2_LEAF flag in node capabilities AND 1673 set the overload bit on all leaf's node TIEs AND 1675 flood only node's own north and south TIEs over E-W leaf 1676 adjacencies AND 1678 always use E-W leaf adjacency in both north as well as south 1679 computation AND 1681 install a discard route for any advertised aggregate in leaf's 1682 TIEs AND 1684 never form southbound adjacencies. 1686 This will allow the E-W leaf nodes to exchange traffic strictly for 1687 the prefixes advertised in each other's north prefix TIEs (since the 1688 southbound computation will find the reverse direction in the other 1689 node's TIE and install its north prefixes). 1691 4.3.6. Other End-to-End Services 1693 Losing full, flat topology information at every node will have an 1694 impact on some of the end-to-end network services. This is the price 1695 paid for minimal disturbance in case of failures and reduced flooding 1696 and memory requirements on nodes lower south in the level hierarchy. 1698 4.3.7. Address Family and Multi Topology Considerations 1700 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC6822] is used 1701 today in link-state routing protocols to support several domains on 1702 the same physical topology. RIFT supports this capability by 1703 carrying transport ports in the LIE protocol exchanges. Multiplexing 1704 of LIEs can be achieved by either choosing varying multicast 1705 addresses or ports on the same address. 1707 BFD interactions in Section 4.3.4 are implementation dependent when 1708 multiple RIFT instances run on the same link. 1710 4.3.8. Reachability of Internal Nodes in the Fabric 1712 RIFT does not precondition that its nodes have reachable addresses 1713 albeit for operational purposes this is clearly desirable. Under 1714 normal operating conditions this can be easily achieved by e.g. 1715 injecting the node's loopback address into North Prefix TIEs. 1717 Things get more interesting in case a node looses all its northbound 1718 adjacencies but is not at the top of the fabric. In such a case a 1719 node that detects that some other members at its level are 1720 advertising northbound adjacencies MAY inject its loopback address 1721 into southbound PGP TIE and become reachable "from the south" that 1722 way. Further, a solution may be implemented where based on e.g. a 1723 "well known" community such a southbound PGP is reflected at level 0 1724 and advertised as northbound PGP again to allow for "reachability 1725 from the north" at the cost of additional flooding. 1727 4.3.9. One-Hop Healing of Levels with East-West Links 1729 Based on the rules defined in Section 4.2.5, Section 4.2.3.7 and 1730 given presence of E-W links, RIFT can provide a one-hop protection of 1731 nodes that lost all their northbound links or in other complex link 1732 set failure scenarios. Section 5.4 explains the resulting behavior 1733 based on one such example. 1735 5. Examples 1737 5.1. Normal Operation 1739 This section describes RIFT deployment in the example topology 1740 without any node or link failures. We disregard flooding reduction 1741 for simplicity's sake. 1743 As first step, the following bi-directional adjacencies will be 1744 created (and any other links that do not fulfill LIE rules in 1745 Section 4.2.2 disregarded): 1747 1. Spine 21 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 1749 2. Spine 22 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 1751 3. Node 111 to Leaf 111, Leaf 112 1753 4. Node 112 to Leaf 111, Leaf 112 1755 5. Node 121 to Leaf 121, Leaf 122 1757 6. Node 122 to Leaf 121, Leaf 122 1759 Consequently, N-TIEs would be originated by Node 111 and Node 112 and 1760 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 1761 would be originated by Leaf 111 (w/ Prefix 111) and Leaf 112 (w/ 1762 Prefix 112 and the multi-homed prefix) and each set would be sent to 1763 Node 111 and Node 112. Node 111 and Node 112 would then flood these 1764 N-TIEs to Spine 21 and Spine 22. 1766 Similarly, N-TIEs would be originated by Node 121 and Node 122 and 1767 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 1768 would be originated by Leaf 121 (w/ Prefix 121 and the multi-homed 1769 prefix) and Leaf 122 (w/ Prefix 122) and each set would be sent to 1770 Node 121 and Node 122. Node 121 and Node 122 would then flood these 1771 N-TIEs to Spine 21 and Spine 22. 1773 At this point both Spine 21 and Spine 22, as well as any controller 1774 to which they are connected, would have the complete network 1775 topology. At the same time, Node 111/112/121/122 hold only the 1776 N-ties of level 0 of their respective PoD. Leafs hold only their own 1777 N-TIEs. 1779 S-TIEs with adjacencies and a default IP prefix would then be 1780 originated by Spine 21 and Spine 22 and each would be flooded to Node 1781 111, Node 112, Node 121, and Node 122. Node 111, Node 112, Node 121, 1782 and Node 122 would each send the S-TIE from Spine 21 to Spine 22 and 1783 the S-TIE from Spine 22 to Spine 21. (S-TIEs are reflected up to 1784 level from which they are received but they are NOT propagated 1785 southbound.) 1787 An S Tie with a default IP prefix would be originated by Node 111 and 1788 Node 112 and each would be sent to Leaf 111 and Leaf 112. Leaf 111 1789 and Leaf 112 would each send the S-TIE from Node 111 to Node 112 and 1790 the S-TIE from Node 112 to Node 111. 1792 Similarly, an S Tie with a default IP prefix would be originated by 1793 Node 121 and Node 122 and each would be sent to Leaf 121 and Leaf 1794 122. Leaf 121 and Leaf 122 would each send the S-TIE from Node 121 1795 to Node 122 and the S-TIE from Node 122 to Node 121. At this point 1796 IP connectivity with maximum possible ECMP has been established 1797 between the leafs while constraining the amount of information held 1798 by each node to the minimum necessary for normal operation and 1799 dealing with failures. 1801 5.2. Leaf Link Failure 1803 . | | | | 1804 .+-+---+-+ +-+---+-+ 1805 .| | | | 1806 .|Node111| |Node112| 1807 .+-+---+-+ ++----+-+ 1808 . | | | | 1809 . | +---------------+ X 1810 . | | | X Failure 1811 . | +-------------+ | X 1812 . | | | | 1813 .+-+---+-+ +--+--+-+ 1814 .| | | | 1815 .|Leaf111| |Leaf112| 1816 .+-------+ +-------+ 1817 . + + 1818 . Prefix111 Prefix112 1820 Figure 9: Single Leaf link failure 1822 In case of a failing leaf link between node 112 and leaf 112 the 1823 link-state information will cause re-computation of the necessary SPF 1824 and the higher levels will stop forwarding towards prefix 112 through 1825 node 112. Only nodes 111 and 112, as well as both spines will see 1826 control traffic. Leaf 111 will receive a new S-TIE from node 112 and 1827 reflect back to node 111. Node 111 will de-aggregate prefix 111 and 1828 prefix 112 but we will not describe it further here since de- 1829 aggregation is emphasized in the next example. It is worth observing 1830 however in this example that if leaf 111 would keep on forwarding 1831 traffic towards prefix 112 using the advertised south-bound default 1832 of node 112 the traffic would end up on spine 21 and spine 22 and 1833 cross back into pod 1 using node 111. This is arguably not as bad as 1834 black-holing present in the next example but clearly undesirable. 1835 Fortunately, de-aggregation prevents this type of behavior except for 1836 a transitory period of time. 1838 5.3. Partitioned Fabric 1840 . +--------+ +--------+ S-TIE of Spine21 1841 . | | | | received by 1842 . |Spine 21| |Spine 22| reflection of 1843 . ++-+--+-++ ++-+--+-++ Nodes 112 and 111 1844 . | | | | | | | | 1845 . | | | | | | | 0/0 1846 . | | | | | | | | 1847 . | | | | | | | | 1848 . +--------------+ | +--- XXXXXX + | | | +---------------+ 1849 . | | | | | | | | 1850 . | +-----------------------------+ | | | 1851 . 0/0 | | | | | | | 1852 . | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 1853 . | 1.1/16 | | | | | | 1854 . | | +-+ +-0/0-----------+ | | 1855 . | | | 1.1./16 | | | | 1856 .+-+----++ +-+-----+ ++-----0/0 ++----0/0 1857 .| | | | | 1.1/16 | 1.1/16 1858 .|Node111| |Node112| |Node121| |Node122| 1859 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 1860 . | | | | | | | | 1861 . | +---------------+ | | +----------------+ | 1862 . | | | | | | | | 1863 . | +-------------+ | | | +--------------+ | | 1864 . | | | | | | | | 1865 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 1866 .| | | | | | | | 1867 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 1868 .+-+-----+ ++------+ +-----+-+ +-+-----+ 1869 . + + + + 1870 . Prefix111 Prefix112 Prefix121 Prefix122 1871 . 1.1/16 1873 Figure 10: Fabric partition 1875 Figure 10 shows the arguably most catastrophic but also the most 1876 interesting case. Spine 21 is completely severed from access to 1877 Prefix 121 (we use in the figure 1.1/16 as example) by double link 1878 failure. However unlikely, if left unresolved, forwarding from leaf 1879 111 and leaf 112 to prefix 121 would suffer 50% black-holing based on 1880 pure default route advertisements by spine 21 and spine 22. 1882 The mechanism used to resolve this scenario is hinging on the 1883 distribution of southbound representation by spine 21 that is 1884 reflected by node 111 and node 112 to spine 22. Spine 22, having 1885 computed reachability to all prefixes in the network, advertises with 1886 the default route the ones that are reachable only via lower level 1887 neighbors that spine 21 does not show an adjacency to. That results 1888 in node 111 and node 112 obtaining a longest-prefix match to prefix 1889 121 which leads through spine 22 and prevents black-holing through 1890 spine 21 still advertising the 0/0 aggregate only. 1892 The prefix 121 advertised by spine 22 does not have to be propagated 1893 further towards leafs since they do no benefit from this information. 1894 Hence the amount of flooding is restricted to spine 21 reissuing its 1895 S-TIEs and reflection of those by node 111 and node 112. The 1896 resulting SPF in spine 22 issues a new prefix S-TIEs containing 1897 1.1/16. None of the leafs become aware of the changes and the 1898 failure is constrained strictly to the level that became partitioned. 1900 To finish with an example of the resulting sets computed using 1901 notation introduced in Section 4.2.8, spine 22 constructs the 1902 following sets: 1904 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 1906 |H (for r=Prefix 111) = Node 111, Node 112 1908 |H (for r=Prefix 112) = Node 111, Node 112 1910 |H (for r=Prefix 121) = Node 121, Node 122 1912 |H (for r=Prefix 122) = Node 121, Node 122 1914 |A (for Spine 21) = Node 111, Node 112 1916 With that and |H (for r=prefix 121) and |H (for r=prefix 122) being 1917 disjoint from |A (for spine 21), spine 22 will originate an S-TIE 1918 with prefix 121 and prefix 122, that is flooded to nodes 112, 112, 1919 121 and 122. 1921 5.4. Northbound Partitioned Router and Optional East-West Links 1923 . + + + 1924 . X N1 | N2 | N3 1925 . X | | 1926 .+--+----+ +--+----+ +--+-----+ 1927 .| |0/0> <0/0| |0/0> <0/0| | 1928 .| A01 +----------+ A02 +----------+ A03 | Level 1 1929 .++-+-+--+ ++--+--++ +---+-+-++ 1930 . | | | | | | | | | 1931 . | | +----------------------------------+ | | | 1932 . | | | | | | | | | 1933 . | +-------------+ | | | +--------------+ | 1934 . | | | | | | | | | 1935 . | +----------------+ | +-----------------+ | 1936 . | | | | | | | | | 1937 . | | +------------------------------------+ | | 1938 . | | | | | | | | | 1939 .++-+-+--+ | +---+---+ | +-+---+-++ 1940 .| | +-+ +-+ | | 1941 .| L01 | | L02 | | L03 | Level 0 1942 .+-------+ +-------+ +--------+ 1944 Figure 11: North Partitioned Router 1946 Figure 11 shows a part of a fabric where level 1 is horizontally 1947 connected and A01 lost its only northbound adjacency. Based on N-SPF 1948 rules in Section 4.2.5.1 A01 will compute northbound reachability by 1949 using the link A01 to A02 (whereas A02 will NOT use this link during 1950 N-SPF). Hence A01 will still advertise the default towards level 0 1951 and route unidirectionally using the horizontal link. Moreover, 1952 based on Section 4.3.8 it may advertise its loopback address as south 1953 PGP to remain reachable "from the south" for operational purposes. 1954 This is necessary since A02 will NOT route towards A01 using the E-W 1955 link (doing otherwise may form routing loops). 1957 As further consideration, the moment A02 looses link N2 the situation 1958 evolves again. A01 will have no more northbound reachability while 1959 still seeing A03 advertising northbound adjacencies in its south node 1960 tie. With that it will stop advertising a default route due to 1961 Section 4.2.3.7. Moreover, A02 may now inject its loopback address 1962 as south PGP. 1964 6. Implementation and Operation: Further Details 1966 6.1. Considerations for Leaf-Only Implementation 1968 Ideally RIFT can be stretched out to the lowest level in the IP 1969 fabric to integrate ToRs or even servers. Since those entities would 1970 run as leafs only, it is worth to observe that a leaf only version is 1971 significantly simpler to implement and requires much less resources: 1973 1. Under normal conditions, the leaf needs to support a multipath 1974 default route only. In worst partitioning case it has to be 1975 capable of accommodating all the leaf routes in its own POD to 1976 prevent black-holing. 1978 2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes 1979 they are connected to; so overall few in numbers. 1981 3. Leaf node does not have to support flooding reduction and de- 1982 aggregation. 1984 4. Unless optional leaf-2-leaf procedures are desired default route 1985 origination, S-TIE origination is unnecessary. 1987 6.2. Adaptations to Other Proposed Data Center Topologies 1989 . +-----+ +-----+ 1990 . | | | | 1991 .+-+ S0 | | S1 | 1992 .| ++---++ ++---++ 1993 .| | | | | 1994 .| | +------------+ | 1995 .| | | +------------+ | 1996 .| | | | | 1997 .| ++-+--+ +--+-++ 1998 .| | | | | 1999 .| | A0 | | A1 | 2000 .| +-+--++ ++---++ 2001 .| | | | | 2002 .| | +------------+ | 2003 .| | +-----------+ | | 2004 .| | | | | 2005 .| +-+-+-+ +--+-++ 2006 .+-+ | | | 2007 . | L0 | | L1 | 2008 . +-----+ +-----+ 2010 Figure 12: Level Shortcut 2012 Strictly speaking, RIFT is not limited to Clos variations only. The 2013 protocol preconditions only a sense of 'compass rose direction' 2014 achieved by configuration (or derivation) of levels and other 2015 topologies are possible within this framework. So, conceptually, one 2016 could include leaf to leaf links and even shortcut between layers but 2017 certain requirements in Section 3 will not be met anymore. As an 2018 example, shortcutting levels illustrated in Figure 12 will lead 2019 either to suboptimal routing when L0 sends traffic to L1 (since using 2020 S0's default route will lead to the traffic being sent back to A0 or 2021 A1) or the leafs need each other's routes installed to understand 2022 that only A0 and A1 should be used to talk to each other. 2024 Whether such modifications of topology constraints make sense is 2025 dependent on many technology variables and the exhausting treatment 2026 of the topic is definitely outside the scope of this document. 2028 6.3. Originating Non-Default Route Southbound 2030 Obviously, an implementation may choose to originate southbound 2031 instead of a strict default route (as described in Section 4.2.3.7) a 2032 shorter prefix P' but in such a scenario all addresses carried within 2033 the RIFT domain must be contained within P'. 2035 7. Security Considerations 2037 The protocol has provisions for nonces and can include authentication 2038 mechanisms in the future comparable to [RFC5709] and [RFC7987]. 2040 One can consider additionally attack vectors where a router may 2041 reboot many times while changing its system ID and pollute the 2042 network with many stale TIEs or TIEs are sent with very long 2043 lifetimes and not cleaned up when the routes vanishes. Those attack 2044 vectors are not unique to RIFT. Given large memory footprints 2045 available today those attacks should be relatively benign. Otherwise 2046 a node can implement a strategy of e.g. discarding contents of all 2047 TIEs of nodes that were not present in the SPF tree over a certain 2048 period of time. Since the protocol, like all modern link-state 2049 protocols, is self-stabilizing and will advertise the presence of 2050 such TIEs to its neighbors, they can be re-requested again if a 2051 computation finds that it sees an adjacency formed towards the system 2052 ID of the discarded TIEs. 2054 Section 4.2.9 presents many attack vectors in untrusted environments, 2055 starting with nodes that oscillate their level offers to the 2056 possiblity of a node offering the highest available level value to 2057 put itself "on top of the lattice" and with that observing the whole 2058 topology. Session authentication mechanisms are necessary in 2059 environments where this is possible. 2061 8. Information Elements Schema 2063 This section introduces the schema for information elements. 2065 On schema changes that 2067 1. change field numbers or 2069 2. add new required fields or 2071 3. remove fields or 2073 4. change lists into sets, unions into structures or 2075 5. change multiplicity of fields or 2077 6. change datatypes of any field or 2079 7. changes default value of any field 2081 major version of the schema MUST increase. All other changes MUST 2082 increase minor version within the same major. 2084 Thrift serializer/deserializer MUST not discard optional, unknown 2085 fields but preserve and serialize them again when re-flooding whereas 2086 missing optional fields MAY be replaced with according default values 2087 if present. 2089 All signed integer as forced by Thrift support must be cast for 2090 internal purposes to equivalent unsigned values without discarding 2091 the signedness bit. An implementation SHOULD try to avoid using the 2092 signedness bit when generating values. 2094 The schema is normative. 2096 8.1. common.thrift 2098 /** 2099 Thrift file with common definitions for RIFT 2100 */ 2102 namespace * common 2104 /** @note MUST be interpreted in implementation as unsigned 64 bits. 2105 * The implementation SHOULD NOT use the MSB. 2106 */ 2107 typedef i64 SystemIDType 2108 typedef i32 IPv4Address 2109 /** this has to be of length long enough to accomodate prefix */ 2110 typedef binary IPv6Address 2111 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2112 typedef i16 UDPPortType 2113 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2114 typedef i32 TIENrType 2115 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2116 typedef i32 MTUSizeType 2117 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2118 typedef i32 SeqNrType 2119 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2120 typedef i32 LifeTimeInSecType 2121 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2122 typedef i16 LevelType 2123 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2124 typedef i32 PodType 2125 /** @note MUST be interpreted in implementation as unsigned 16 bits */ 2126 typedef i16 VersionType 2127 /** @note MUST be interpreted in implementation as unsigned 32 bits */ 2128 typedef i32 MetricType 2129 typedef string KeyIDType 2130 /** node local, unique identification for a link (interface/tunnel 2131 * etc. Basically anything RIFT runs on). This is kept 2132 * at 32 bits so it aligns with BFD [RFC5880] discriminator size. 2133 */ 2134 typedef i32 LinkIDType 2135 typedef string KeyNameType 2136 typedef i8 PrefixLenType 2137 /** timestamp in seconds since the epoch */ 2138 typedef i64 TimestampInSecsType 2139 /** security nonce */ 2140 typedef i64 NonceType 2141 /** adjacency holdtime */ 2142 typedef i16 HoldTimeInSecType 2144 /** Flags indicating nodes behavior in case of ZTP and support 2145 for special optimization procedures. 2146 */ 2147 enum LeafIndications { 2148 NO_RESTRICTIONS =0, 2149 LEAF_ONLY =1, 2150 LEAF_ONLY_AND_LEAF_2_LEAF_PROCEDURES =2, 2151 } 2153 /** fixed leaf level when ZTP is not used */ 2154 const LevelType leaf_level = 0 2155 const LevelType default_level = 0 2156 const PodType default_pod = 0 2157 const LinkIDType undefined_linkid = 0 2158 const MetricType default_distance = 1 2159 /** any distance larger than this will be considered infinity */ 2160 const MetricType infinite_distance = 0x70000000 2161 /** any element with 0 distance will be ignored, 2162 * missing metrics will be replaced with default_distance 2163 */ 2164 const MetricType invalid_distance = 0 2165 const bool overload_default = false 2166 const bool flood_reduction_default = true 2167 const LeafIndications leaf_indication_default = LeafIndications.NO_RESTRICTIONS 2168 const bool south_transitive_default = false 2169 const HoldTimeInSecType default_holdtime = 3 2170 /** 0 is illegal for SystemID */ 2171 const SystemIDType IllegalSystemID = 0 2172 /** empty set of nodes */ 2173 const set empty_set_of_nodeids = {} 2175 /** default UDP port to run LIEs on */ 2176 const UDPPortType default_lie_udp_port = 6949 2177 const UDPPortType default_tie_udp_flood_port = 6950 2179 /** default MTU size to use */ 2180 const MTUSizeType default_mtu_size = 1400 2181 /** default mcast is v4 224.0.1.150, we make it i64 to 2182 * help languages struggling with highest bit */ 2183 const i64 default_lie_v4_mcast_group = 3758096790 2185 /** indicates whether the direction is northbound/east-west 2186 * or southbound */ 2187 enum TieDirectionType { 2188 Illegal = 0, 2189 South = 1, 2190 North = 2, 2191 DirectionMaxValue = 3, 2192 } 2194 enum AddressFamilyType { 2195 Illegal = 0, 2196 AddressFamilyMinValue = 1, 2197 IPv4 = 2, 2198 IPv6 = 3, 2199 AddressFamilyMaxValue = 4, 2200 } 2202 struct IPv4PrefixType { 2203 1: required IPv4Address address; 2204 2: required PrefixLenType prefixlen; 2206 } 2208 struct IPv6PrefixType { 2209 1: required IPv6Address address; 2210 2: required PrefixLenType prefixlen; 2211 } 2213 union IPAddressType { 2214 1: optional IPv4Address ipv4address; 2215 2: optional IPv6Address ipv6address; 2216 } 2218 union IPPrefixType { 2219 1: optional IPv4PrefixType ipv4prefix; 2220 2: optional IPv6PrefixType ipv6prefix; 2221 } 2223 enum TIETypeType { 2224 Illegal = 0, 2225 TIETypeMinValue = 1, 2226 /** first legal value */ 2227 NodeTIEType = 2, 2228 PrefixTIEType = 3, 2229 PGPrefixTIEType = 4, 2230 KeyValueTIEType = 5, 2231 TIETypeMaxValue = 6, 2232 } 2234 /** @note: route types which MUST be ordered on their preference 2235 * PGP prefixes are most preferred attracting 2236 * traffic north (towards spine) and then south 2237 * normal prefixes are attracting traffic south (towards leafs), 2238 * i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE 2239 */ 2240 enum RouteType { 2241 Illegal = 0, 2242 RouteTypeMinValue = 1, 2243 /** First legal value. */ 2244 /** Discard routes are most prefered */ 2245 Discard = 2, 2247 /** Local prefixes are directly attached prefixes on the 2248 * system such as e.g. interface routes. 2249 */ 2250 LocalPrefix = 3, 2251 /** advertised in S-TIEs */ 2252 SouthPGPPrefix = 4, 2253 /** advertised in N-TIEs */ 2254 NorthPGPPrefix = 5, 2255 /** advertised in N-TIEs */ 2256 NorthPrefix = 6, 2257 /** advertised in S-TIEs */ 2258 SouthPrefix = 7, 2259 /** transitive southbound are least preferred */ 2260 TransitiveSouthPrefix = 8, 2261 RouteTypeMaxValue = 9 2262 } 2264 8.2. encoding.thrift 2266 /** 2267 Thrift file for packet encodings for RIFT 2268 */ 2270 include "common.thrift" 2272 namespace * encoding 2274 /** represents protocol encoding schema major version */ 2275 const i32 protocol_major_version = 5 2276 /** represents protocol encoding schema minor version */ 2277 const i32 protocol_minor_version = 0 2279 /** common RIFT packet header */ 2280 struct PacketHeader { 2281 1: required common.VersionType major_version = protocol_major_version; 2282 2: required common.VersionType minor_version = protocol_minor_version; 2283 /** this is the node sending the packet, in case of LIE/TIRE/TIDE 2284 also the originator of it */ 2285 3: required common.SystemIDType sender; 2286 /** level of the node sending the packet, required on everything except 2287 * LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used 2288 * in ZTP procedures. 2289 */ 2290 4: optional common.LevelType level; 2291 } 2293 /** Community serves as community for PGP purposes */ 2294 struct Community { 2295 1: required i32 top; 2296 2: required i32 bottom; 2297 } 2299 /** Neighbor structure */ 2300 struct Neighbor { 2301 1: required common.SystemIDType originator; 2302 2: required common.LinkIDType remote_id; 2303 } 2305 /** Capabilities the node supports */ 2306 struct NodeCapabilities { 2307 /** can this node participate in flood reduction, 2308 only relevant at level > 0 */ 2309 1: optional bool flood_reduction = 2310 common.flood_reduction_default; 2311 /** does this node restrict itself to be leaf only (in ZTP) and 2312 does it support leaf-2-leaf procedures */ 2313 2: optional common.LeafIndications leaf_indications = 2314 LeafIndications.NO_RESTRICTIONS; 2315 } 2317 /** RIFT LIE packet 2319 @note this node's level is already included on the packet header */ 2320 struct LIEPacket { 2321 /** optional node or adjacency name */ 2322 1: optional string name; 2323 /** local link ID */ 2324 2: required common.LinkIDType local_id; 2325 /** UDP port to which we can receive flooded TIEs */ 2326 3: required common.UDPPortType flood_port = 2327 common.default_tie_udp_flood_port; 2328 /** layer 3 MTU */ 2329 4: optional common.MTUSizeType link_mtu_size = 2330 common.default_mtu_size; 2331 /** this will reflect the neighbor once received */ 2332 5: optional Neighbor neighbor; 2333 6: optional common.PodType pod = common.default_pod; 2334 /** optional nonce used for security computations */ 2335 7: optional common.NonceType nonce; 2336 /** optional node capabilities shown in the LIE. The capabilies 2337 MUST match the capabilities shown in the Node TIEs, otherwise 2338 the behavior is unspecified. A node detecting the mismatch 2339 SHOULD generate according error. 2340 */ 2341 8: optional NodeCapabilities capabilities; 2342 /** required holdtime of the adjacency, i.e. how much time 2343 MUST expire without LIE for the adjacency to drop 2344 */ 2345 9: required common.HoldTimeInSecType holdtime = 2346 common.default_holdtime; 2347 } 2349 /** LinkID pair describes one of parallel links between two nodes */ 2350 struct LinkIDPair { 2351 /** node-wide unique value for the local link */ 2352 1: required common.LinkIDType local_id; 2353 /** received remote link ID for this link */ 2354 2: required common.LinkIDType remote_id; 2355 /** more properties of the link can go in here */ 2356 } 2358 /** ID of a TIE 2360 @note: TIEID space is a total order achieved by comparing the elements 2361 in sequence defined and comparing each value as an 2362 unsigned integer of according length 2363 */ 2364 struct TIEID { 2365 /** indicates direction of the TIE */ 2366 1: required common.TieDirectionType direction; 2367 /** indicates originator of the TIE */ 2368 2: required common.SystemIDType originator; 2369 3: required common.TIETypeType tietype; 2370 4: required common.TIENrType tie_nr; 2371 } 2373 /** Header of a TIE */ 2374 struct TIEHeader { 2375 2: required TIEID tieid; 2376 3: required common.SeqNrType seq_nr; 2377 /** lifetime expires down to 0 just like in ISIS */ 2378 4: required common.LifeTimeInSecType lifetime; 2379 } 2381 /** A sorted TIDE packet, if unsorted, behavior is undefined */ 2382 struct TIDEPacket { 2383 /** all 00s marks starts */ 2384 1: required TIEID start_range; 2385 /** all FFs mark end */ 2386 2: required TIEID end_range; 2387 /** _sorted_ list of headers */ 2388 3: required list headers; 2389 } 2391 /** A TIRE packet */ 2392 struct TIREPacket { 2393 1: required set headers; 2394 } 2396 /** Neighbor of a node */ 2397 struct NodeNeighborsTIEElement { 2398 2: required common.LevelType level; 2399 /** Cost to neighbor. 2401 @note: All parallel links to same node 2402 incur same cost, in case the neighbor has multiple 2403 parallel links at different cost, the largest distance 2404 (highest numerical value) MUST be advertised 2405 @note: any neighbor with cost <= 0 MUST be ignored in computations */ 2406 3: optional common.MetricType cost = common.default_distance; 2407 /** can carry description of multiple parallel links in a TIE */ 2408 4: optional set link_ids; 2409 } 2411 /** Flags the node sets */ 2412 struct NodeFlags { 2413 /** node is in overload, do not transit traffic through it */ 2414 1: optional bool overload = common.overload_default; 2415 } 2417 /** Description of a node. 2419 It may occur multiple times in different TIEs but if either 2420 * capabilities values do not match or 2421 * flags values do not match or 2422 * neighbors repeat with different values or 2423 * visible in same level/having partition upper do not match 2424 the behavior is undefined and a warning SHOULD be generated. 2425 Neighbors can be distributed across multiple TIEs however if 2426 the sets are disjoint. 2428 @note: observe that absence of fields implies defined defaults 2429 */ 2430 struct NodeTIEElement { 2431 1: required common.LevelType level; 2432 /** if neighbor systemID repeats in other node TIEs of same node 2433 the behavior is undefined. Equivalent to |A_(n,s)(N) in spec. */ 2434 2: required map neighbors; 2436 3: optional NodeCapabilities capabilities; 2437 4: optional NodeFlags flags; 2438 /** optional node name for easier operations */ 2439 5: optional string name; 2441 /** Nodes seen an the same level through reflection through nodes 2442 having backlink to both nodes. They are equivalent to |V(N) in 2443 future specifications. Ignored in Node S-TIEs if present. 2444 */ 2445 6: optional set visible_in_same_level 2446 = common.empty_set_of_nodeids; 2447 /** Non-overloaded nodes in |V seen as attached to another north 2448 * level partition due to the fact that some nodes in its |V have 2449 * adjacencies to higher level nodes that this node doesn't see. 2450 * This may be used in the computation at higher levels to prevent 2451 * blackholing. Ignored in Node S-TIEs if present. 2452 * Equivalent to |PUL(N) in spec. */ 2453 7: optional set same_level_unknown_north_partitions 2454 = common.empty_set_of_nodeids; 2455 } 2457 struct PrefixAttributes { 2458 2: required common.MetricType metric = common.default_distance; 2459 /** Respected only on prefixes advertised in S-TIEs, indicates that 2460 * prefix SHOULD be propagated southwards towards lower levels to heal 2461 * upper level partitioning, otherwise blackholes may occur. 2462 * Required since very implementation MUST understand it. 2463 */ 2464 3: required bool south_transitive = 2465 common.south_transitive_default; 2466 } 2468 /** multiple prefixes */ 2469 struct PrefixTIEElement { 2470 /** prefixes with the associated attributes. 2471 if the same prefix repeats in multiple TIEs of same node 2472 behavior is unspecified */ 2473 1: required map prefixes; 2474 } 2476 /** keys with their values */ 2477 struct KeyValueTIEElement { 2478 /** if the same key repeats in multiple TIEs of same node 2479 or with different values, behavior is unspecified */ 2480 1: required map keyvalues; 2481 } 2483 /** single element in a TIE. enum common.TIETypeType 2484 in TIEID indicates which elements MUST be present 2485 in the TIEElement. In case of mismatch the unexpected 2486 elements MUST be ignored. 2487 */ 2488 union TIEElement { 2489 /** in case of enum common.TIETypeType.NodeTIEType */ 2490 1: optional NodeTIEElement node; 2491 /** in case of enum common.TIETypeType.PrefixTIEType */ 2492 2: optional PrefixTIEElement prefixes; 2493 3: optional KeyValueTIEElement keyvalues; 2494 /** @todo: policy guided prefixes */ 2495 } 2497 /** @todo: flood header separately in UDP to allow caching to TIEs 2498 while changing lifetime? 2499 */ 2500 struct TIEPacket { 2501 1: required TIEHeader header; 2502 2: required TIEElement element; 2503 } 2505 union PacketContent { 2506 1: optional LIEPacket lie; 2507 2: optional TIDEPacket tide; 2508 3: optional TIREPacket tire; 2509 4: optional TIEPacket tie; 2510 } 2512 /** protocol packet structure */ 2513 struct ProtocolPacket { 2514 1: required PacketHeader header; 2515 2: required PacketContent content; 2516 } 2518 9. IANA Considerations 2520 This specification will request at an opportune time multiple 2521 registry points to exchange protocol packets in a standardized way, 2522 amongst them multicast address assignments and standard port numbers. 2523 The schema itself defines many values and codepoints which can be 2524 considered registries themselves. 2526 10. Acknowledgments 2528 Many thanks to Naiming Shen for some of the early discussions around 2529 the topic of using IGPs for routing in topologies related to Clos. 2530 Adrian Farrel, Joel Halpern and Jeffrey Zhang provided thoughtful 2531 comments that improved the readability of the document and found good 2532 amount of corners where the light failed to shine. Kris Price was 2533 first to mention single router, single arm default considerations. 2534 Jeff Tantsura helped out with some initial thoughts on BFD 2535 interactions while Jeff Haas corrected several misconceptions about 2536 BFD's finer points. Artur Makutunowicz pointed out many possible 2537 improvements and acted as sounding board in regard to modern protocol 2538 implementation techniques RIFT is exploring. Barak Gafni formalized 2539 first time clearly the problem of partitioned spine on a (clean) 2540 napkin in Singapore. 2542 11. References 2544 11.1. Normative References 2546 [ISO10589] 2547 ISO "International Organization for Standardization", 2548 "Intermediate system to Intermediate system intra-domain 2549 routeing information exchange protocol for use in 2550 conjunction with the protocol for providing the 2551 connectionless-mode Network Service (ISO 8473), ISO/IEC 2552 10589:2002, Second Edition.", Nov 2002. 2554 [RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol", 2555 RFC 1142, DOI 10.17487/RFC1142, February 1990, 2556 . 2558 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2559 Requirement Levels", BCP 14, RFC 2119, 2560 DOI 10.17487/RFC2119, March 1997, 2561 . 2563 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 2564 DOI 10.17487/RFC2328, April 1998, 2565 . 2567 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 2568 RFC 2365, DOI 10.17487/RFC2365, July 1998, 2569 . 2571 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 2572 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 2573 DOI 10.17487/RFC4271, January 2006, 2574 . 2576 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 2577 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 2578 2006, . 2580 [RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation 2581 Element (PCE)-Based Architecture", RFC 4655, 2582 DOI 10.17487/RFC4655, August 2006, 2583 . 2585 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 2586 Topology (MT) Routing in Intermediate System to 2587 Intermediate Systems (IS-ISs)", RFC 5120, 2588 DOI 10.17487/RFC5120, February 2008, 2589 . 2591 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 2592 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 2593 DOI 10.17487/RFC5303, October 2008, 2594 . 2596 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 2597 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 2598 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 2599 2009, . 2601 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 2602 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 2603 DOI 10.17487/RFC5881, June 2010, 2604 . 2606 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 2607 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 2608 DOI 10.17487/RFC6234, May 2011, 2609 . 2611 [RFC6822] Previdi, S., Ed., Ginsberg, L., Shand, M., Roy, A., and D. 2612 Ward, "IS-IS Multi-Instance", RFC 6822, 2613 DOI 10.17487/RFC6822, December 2012, 2614 . 2616 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 2617 Litkowski, S., Horneffer, M., and R. Shakir, "Source 2618 Packet Routing in Networking (SPRING) Problem Statement 2619 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 2620 2016, . 2622 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 2623 BGP for Routing in Large-Scale Data Centers", RFC 7938, 2624 DOI 10.17487/RFC7938, August 2016, 2625 . 2627 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 2628 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 2629 DOI 10.17487/RFC7987, October 2016, 2630 . 2632 11.2. Informative References 2634 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 2635 Communication Environments", IEEE International Parallel & 2636 Distributed Processing Symposium, 2011. 2638 [DIJKSTRA] 2639 Dijkstra, E., "A Note on Two Problems in Connexion with 2640 Graphs", Journal Numer. Math. , 1959. 2642 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 2643 key-value store", ACM SIGOPS symposium on Operating 2644 systems principles (SOSP '07), 2007. 2646 [EPPSTEIN] 2647 Eppstein, D., "Finding the k-Shortest Paths", 1997. 2649 [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for 2650 Hardware-Efficient Supercomputing", 1985. 2652 [MAKSIC2013] 2653 Maksic et al., N., "Improving Utilization of Data Center 2654 Networks", IEEE Communications Magazine, Nov 2013. 2656 [PROTOBUF] 2657 Google, Inc., "Protocol Buffers, 2658 https://developers.google.com/protocol-buffers". 2660 [QUIC] Iyengar et al., J., "QUIC: A UDP-Based Multiplexed and 2661 Secure Transport", 2016. 2663 [VAHDAT08] 2664 Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 2665 Commodity Data Center Network Architecture", SIGCOMM , 2666 2008. 2668 Authors' Addresses 2670 Tony Przygienda (editor) 2671 Juniper Networks 2672 1194 N. Mathilda Ave 2673 Sunnyvale, CA 94089 2674 US 2676 Email: prz@juniper.net 2678 Alankar Sharma 2679 Comcast 2680 1800 Bishops Gate Blvd 2681 Mount Laurel, NJ 08054 2682 US 2684 Email: Alankar_Sharma@comcast.com 2685 Alia Atlas 2686 Juniper Networks 2687 10 Technology Park Drive 2688 Westford, MA 01886 2689 US 2691 Email: akatlas@juniper.net 2693 John Drake 2694 Juniper Networks 2695 1194 N. Mathilda Ave 2696 Sunnyvale, CA 94089 2697 US 2699 Email: jdrake@juniper.net