idnits 2.17.1 draft-przygienda-rift-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 10 instances of too long lines in the document, the longest one being 9 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 642 has weird spacing: '... N-TIEs nev...' == Line 644 has weird spacing: '...looding incl...' == Line 646 has weird spacing: '...only if sam...' == Line 648 has weird spacing: '... S-TIEs equa...' == Line 1757 has weird spacing: '...velType defa...' == (15 more instances...) == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Thrift serializer/deserializer MUST not discard optional, unknown fields but preserve and serialize them again when re-flooding. -- The document date (June 26, 2017) is 2496 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '0-31' is mentioned on line 860, but not defined == Missing Reference: '32-63' is mentioned on line 861, but not defined == Missing Reference: 'P' is mentioned on line 1073, but not defined == Missing Reference: 'NH' is mentioned on line 1185, but not defined == Missing Reference: 'RFC5880' is mentioned on line 1744, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589' ** Obsolete normative reference: RFC 1142 (Obsoleted by RFC 7142) ** Downref: Normative reference to an Informational RFC: RFC 4655 ** Downref: Normative reference to an Informational RFC: RFC 6234 ** Obsolete normative reference: RFC 6822 (Obsoleted by RFC 8202) ** Downref: Normative reference to an Informational RFC: RFC 7855 ** Downref: Normative reference to an Informational RFC: RFC 7938 Summary: 7 errors (**), 0 flaws (~~), 13 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Networking Working Group T. Przygienda 3 Internet-Draft Juniper Networks 4 Intended status: Standards Track A. Sharma 5 Expires: December 28, 2017 Comcast 6 J. Drake 7 A. Atlas 8 Juniper Networks 9 June 26, 2017 11 RIFT: Routing in Fat Trees 12 draft-przygienda-rift-02 14 Abstract 16 This document outlines a specialized, dynamic routing protocol for 17 Clos and fat-tree network topologies. The protocol (1) deals with 18 automatic construction of fat-tree topologies based on detection of 19 links, (2) minimizes the amount of routing state held at each level, 20 (3) automatically prunes the topology distribution exchanges to a 21 sufficient subset of links, (4) supports automatic disaggregation of 22 prefixes on link and node failures to prevent black-holing and 23 suboptimal routing, (5) allows traffic steering and re-routing 24 policies and ultimately (6) provides mechanisms to synchronize a 25 limited key-value data-store that can be used after protocol 26 convergence to e.g. bootstrap higher levels of functionality on 27 nodes. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at http://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on December 28, 2017. 46 Copyright Notice 48 Copyright (c) 2017 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (http://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 64 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 65 2. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 4 66 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 67 2.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 6 68 3. Requirement Considerations . . . . . . . . . . . . . . . . . 8 69 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 10 70 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 10 71 4.2. Specification . . . . . . . . . . . . . . . . . . . . . . 10 72 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 10 73 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 11 74 4.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 11 75 4.2.3.1. Topology Information Elements . . . . . . . . . . 11 76 4.2.3.2. South- and Northbound Representation . . . . . . 12 77 4.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 14 78 4.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 14 79 4.2.3.5. Initial and Periodic Database Synchronization . . 16 80 4.2.3.6. Purging . . . . . . . . . . . . . . . . . . . . . 16 81 4.2.3.7. Southbound Default Route Origination . . . . . . 16 82 4.2.3.8. Optional Automatic Flooding Reduction and 83 Partitioning . . . . . . . . . . . . . . . . . . 16 84 4.2.4. Policy-Guided Prefixes . . . . . . . . . . . . . . . 17 85 4.2.4.1. Ingress Filtering . . . . . . . . . . . . . . . . 19 86 4.2.4.2. Applying Policy . . . . . . . . . . . . . . . . . 19 87 4.2.4.3. Store Policy-Guided Prefix for Route Computation 88 and Regeneration . . . . . . . . . . . . . . . . 20 89 4.2.4.4. Re-origination . . . . . . . . . . . . . . . . . 21 90 4.2.4.5. Overlap with Disaggregated Prefixes . . . . . . . 21 91 4.2.5. Reachability Computation . . . . . . . . . . . . . . 21 92 4.2.5.1. Northbound SPF . . . . . . . . . . . . . . . . . 21 93 4.2.5.2. Southbound SPF . . . . . . . . . . . . . . . . . 22 94 4.2.5.3. East-West Forwarding Within a Level . . . . . . . 22 95 4.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 22 96 4.2.7. Attaching Policy-Guided Prefixes . . . . . . . . . . 23 97 4.2.8. Automatic Disaggregation on Link & Node Failures . . 24 98 4.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 27 99 4.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 27 100 4.3.2. Optimized Route Computation on Leafs . . . . . . . . 27 101 4.3.3. Key/Value Store . . . . . . . . . . . . . . . . . . . 28 102 4.3.4. Interactions with BFD . . . . . . . . . . . . . . . . 28 103 4.3.5. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 29 104 4.3.6. Other End-to-End Services . . . . . . . . . . . . . . 29 105 4.3.7. Address Family and Multi Topology Considerations . . 29 106 4.3.8. Reachability of Internal Nodes in the Fabric . . . . 30 107 4.3.9. One-Hop Healing of Levels with East-West Links . . . 30 108 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 30 109 5.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 30 110 5.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 32 111 5.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 32 112 5.4. Northbound Partitioned Router and Optional East-West 113 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 34 114 6. Implementation and Operation: Further Details . . . . . . . . 35 115 6.1. Considerations for Leaf-Only Implementation . . . . . . . 36 116 6.2. Adaptations to Other Proposed Data Center Topologies . . 36 117 6.3. Originating Non-Default Route Southbound . . . . . . . . 37 118 7. Security Considerations . . . . . . . . . . . . . . . . . . . 37 119 8. Information Elements Schema . . . . . . . . . . . . . . . . . 37 120 8.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 38 121 8.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 40 122 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 45 123 10. Security Considerations . . . . . . . . . . . . . . . . . . . 45 124 11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 45 125 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 45 126 12.1. Normative References . . . . . . . . . . . . . . . . . . 45 127 12.2. Informative References . . . . . . . . . . . . . . . . . 47 128 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 48 130 1. Introduction 132 Clos [CLOS] and Fat-Tree [FATTREE] have gained prominence in today's 133 networking, primarily as a result of a the paradigm shift towards a 134 centralized data-center based architecture that is poised to deliver 135 a majority of computation and storage services in the future. The 136 existing set of dynamic routing protocols was geared originally 137 towards a network with an irregular topology and low degree of 138 connectivity and consequently several attempts to adapt those have 139 been made. Most successfully BGP [RFC4271] [RFC7938] has been 140 extended to this purpose, not as much due to its inherent suitability 141 to solve the problem but rather because the perceived capability to 142 modify it "quicker" and the immanent difficulties with link-state 143 [DIJKSTRA] based protocols to fulfill certain of the resulting 144 requirements. 146 In looking at the problem through the very lens of its requirements 147 an optimal approach does not seem to be a simple modification of 148 either a link-state (distributed computation) or distance-vector 149 (diffused computation) approach but rather a mixture of both, 150 colloquially best described as 'link-state towards the spine' and 151 'distance vector towards the leafs'. The balance of this document 152 details the resulting protocol. 154 1.1. Requirements Language 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 157 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 158 document are to be interpreted as described in RFC 2119 [RFC2119]. 160 2. Reference Frame 162 2.1. Terminology 164 This section presents the terminology used in this document. It is 165 assumed that the reader is thoroughly familiar with the terms and 166 concepts used in OSPF [RFC2328] and IS-IS [RFC1142], [ISO10589] as 167 well as the according graph theoretical concepts of shortest path 168 first (SPF) [DIJKSTRA] computation and directed acyclic graphs (DAG). 170 Level: Clos and Fat Tree networks are trees and 'level' denotes the 171 set of nodes at the same height in such a network, where the 172 bottom level is level 0. A node has links to nodes one level down 173 and/or one level up. Under some circumstances, a node may have 174 links to nodes at the same level. As footnote: Clos terminology 175 uses often the concept of "stage" but due to the folded nature of 176 the Fat Tree we do not use it to prevent misunderstandings. 178 Spine/Aggregation/Edge Levels: Traditional names for Level 2, 1 and 179 0 respectively. Level 0 is often called leaf as well. 181 Point of Delivery (PoD): A self-contained vertical slice of a Clos 182 or Fat Tree network containing normally only level 0 and level 1 183 nodes. It communicates with nodes in other PoDs via the spine. 185 Spine: The set of nodes that provide inter-PoD communication. These 186 nodes are also organized into levels (typically one, three, or 187 five levels). Spine nodes do not belong to any PoD and are 188 assigned the PoD value 0 to indicate this. 190 Leaf: A node at level 0. 192 Connected Spine: In case a spine level represents a connected graph 193 (discounting links terminating at different levels), we call it a 194 "connected spine", in case a spine level consists of multiple 195 partitions, we call it a "disconnected" or "partitioned spine". 196 In other terms, a spine without east-west links is disconnected 197 and is the typical configuration for Clos and Fat Tree networks. 199 South/Southbound and North/Northbound (Direction): When describing 200 protocol elements and procedures, we will be using in different 201 situations the directionality of the compass. I.e., 'south' or 202 'southbound' mean moving towards the bottom of the Clos or Fat 203 Tree network and 'north' and 'northbound' mean moving towards the 204 top of the Clos or Fat Tree network. 206 Northbound Link: A link to a node one level up or in other words, 207 one level further north. 209 Southbound Link: A link to a node one level down or in other words, 210 one level further south. 212 East-West Link: A link between two nodes at the same level. East- 213 west links are normally not part of Clos or "fat-tree" topologies. 215 Leaf shortcuts (L2L): East-west links at leaf level will need to be 216 differentiated from East-west links at other levels. 218 Southbound representation: Information sent towards a lower level 219 representing only limited amount of information. 221 TIE: This is an acronym for a "Topology Information Element". TIEs 222 are exchanged between RIFT nodes to describe parts of a network 223 such as links and address prefixes. It can be thought of as 224 largely equivalent to ISIS LSPs or OSPF LSA. We will talk about 225 N-TIEs when talking about TIEs in the northbound representation 226 and S-TIEs for the southbound equivalent. 228 Node TIE: This is an acronym for a "Node Topology Information 229 Element", largely equivalent to OSPF Node LSA, i.e. it contains 230 all neighbors the node discovered and information about node 231 itself. 233 Prefix TIE: This is an acronym for a "Prefix Topology Information 234 Element" and it contains all prefixes directly attached to this 235 node in case of a N-TIE and in case of S-TIE the necessary default 236 and de-aggregated prefixes the node passes southbound. 238 Policy-Guided Information: Information that is passed in either 239 southbound direction or north-bound direction by the means of 240 diffusion and can be filtered via policies. Policy-Guided 241 Prefixes and KV Ties are examples of Policy-Guided Information. 243 Key Value TIE: A S-TIE that is carrying a set of key value pairs 244 [DYNAMO]. It can be used to distribute information in the 245 southbound direction within the protocol. 247 TIDE: Topology Information Description Element, equivalent to CSNP 248 in ISIS. 250 TIRE: Topology Information Request Element, equivalent to PSNP in 251 ISIS. It can both confirm received and request missing TIEs. 253 PGP: Policy-Guided Prefixes allow to support traffic engineering 254 that cannot be achieved by the means of SPF computation or normal 255 node and prefix S-TIE origination. S-PGPs are propagated in south 256 direction only and N-PGPs follow northern direction strictly. 258 De-aggregation/Disaggregation: Process in which a node decides to 259 advertise certain prefixes it received in N-TIEs to prevent black- 260 holing and suboptimal routing upon link failures. 262 LIE: This is an acronym for a "Link Information Element", largely 263 equivalent to HELLOs in IGPs and exchanged over all the links 264 between systems running RIFT to form adjacencies. 266 FL: Flooding Leader for a specific system has a dedicated role to 267 flood TIEs of that system. 269 2.2. Topology 270 . +--------+ +--------+ 271 . | | | | ^ N 272 . |Spine 21| |Spine 22| | 273 .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W 274 . | | | | | | | | | 275 . P111/2| |P121 | | | | S v 276 . ^ ^ ^ ^ | | | | 277 . | | | | | | | | 278 . +--------------+ | +-----------+ | | | +---------------+ 279 . | | | | | | | | 280 . South +-----------------------------+ | | ^ 281 . | | | | | | | All TIEs 282 . 0/0 0/0 0/0 +-----------------------------+ | 283 . v v v | | | | | 284 . | | +-+ +<-0/0----------+ | | 285 . | | | | | | | | 286 .+-+----++ optional +-+----++ ++----+-+ ++-----++ 287 .| | E/W link | | | | | | 288 .|Node111+----------+Node112| |Node121| |Node122| 289 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 290 . | | | South | | | | 291 . | +---0/0--->-----+ 0/0 | +----------------+ | 292 . 0/0 | | | | | | | 293 . | +---<-0/0-----+ | v | +--------------+ | | 294 . v | | | | | | | 295 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 296 .| | (L2L) | | | | Level 0 | | 297 .|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122| 298 .+-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 299 . + + \ / + + 300 . Prefix111 Prefix112 \ / Prefix121 Prefix122 301 . multi-homed 302 . Prefix 303 .+---------- Pod 1 ---------+ +---------- Pod 2 ---------+ 305 Figure 1: A two level spine-and-leaf topology 307 We will use this topology (called commonly a fat tree/network in 308 modern DC considerations [VAHDAT08] as homonym to the original 309 definition of the term [FATTREE]) in all further considerations. It 310 depicts a generic "fat-tree" and the concepts explained in three 311 levels here carry by induction for further levels and higher degrees 312 of connectivity. 314 3. Requirement Considerations 316 [RFC7938] gives the original set of requirements augmented here based 317 upon recent experience in the operation of fat-tree networks. 319 REQ1: The solution should allow for minimum size routing 320 information base and forwarding tables at leaf level for 321 speed, cost and simplicity reasons. Holding excessive 322 amount of information away from leaf nodes simplifies 323 operation and lowers cost of the underlay. 325 REQ2: Very high degree of ECMP (and ideally non equal cost) must 326 be supported. Maximum ECMP is currently understood as the 327 most efficient routing approach to maximize the throughput 328 of switching fabrics [MAKSIC2013]. 330 REQ3: Traffic engineering should be allowed by modification of 331 prefixes and/or their next-hops. 333 REQ4: The control protocol should discover the physical links 334 automatically and be able to detect cabling that violates 335 fat-tree topology constraints. It must react accordingly to 336 such mis-cabling attempts, at a minimum preventing 337 adjacencies between nodes from being formed and traffic from 338 being forwarded on those mis-cabled links. E.g. connecting 339 a leaf to a spine at level 2 should be detected and ideally 340 prevented. 342 REQ5: The solution should allow for access to link states of the 343 whole topology to enable efficient support for modern 344 control architectures like SPRING [RFC7855] or PCE 345 [RFC4655]. 347 REQ6: The solution should easily accommodate opaque data to be 348 carried throughout the topology to subsets of nodes. This 349 can be used for many purposes, one of them being a key-value 350 store that allows bootstrapping of nodes based right at the 351 time of topology discovery. 353 REQ7: Nodes should be taken out and introduced into production 354 with minimum wait-times and minimum of "shaking" of the 355 network, i.e. radius of propagation (often called "blast 356 radius") of changed information should be as small as 357 feasible. 359 REQ8: The protocol should allow for maximum aggregation of carried 360 routing information while at the same time automatically de- 361 aggregating the prefixes to prevent black-holing in case of 362 failures. The de-aggregation should support maximum 363 possible ECMP/N-ECMP remaining after failure. 365 REQ9: A node without any configuration beside default values 366 should come up as leaf in any PoD it is introduced into. 367 Optionally, it must be possible to configure nodes to 368 restrict their participation to the PoD(s) targeted at any 369 level. 371 REQ10: Reducing the scope of communication needed throughout the 372 network on link and state failure, as well as reducing 373 advertisements of repeating, idiomatic or policy-guided 374 information in stable state is highly desirable since it 375 leads to better stability and faster convergence behavior. 377 REQ11: Once a packet traverses a link in a "southbound" direction, 378 it must not take any further "northbound" steps along its 379 path to delivery to its destination under normal conditions. 380 Taking a path through the spine in cases where a shorter 381 path is available is highly undesirable. 383 REQ12: Parallel links between same set of nodes must be 384 distinguishable for SPF, failure and traffic engineering 385 purposes. 387 REQ13: The protocol must not rely on interfaces having discernible 388 unique addresses, i.e. it must operate in presence of 389 unnumbered links (even parallel ones) or links of a single 390 node having same addresses. 392 REQ14: Optionally, the protocol should allow to provision data 393 centers where the individual switches carry no configuration 394 information and are all determining the topology from a 395 "seed". Observe that this requirement may collide with the 396 desire to detect cabling misconfiguration and with that only 397 one of the requirements can be fully met in a chosen 398 configuration mode. 400 Following list represents possible requirements and requirements 401 under discussion: 403 PEND1: Supporting anything but point-to-point links is a non- 404 requirement. Questions remain: for connecting to the 405 leaves, is there a case where multipoint is desirable? One 406 could still model it as point-to-point links; it seems there 407 is no need for anything more than a NBMA-type construct. 409 PEND2: What is the maximum scale of number leaf prefixes we need to 410 carry. Is 500'000 enough ? 412 Finally, following are the non-requirements: 414 NONREQ1: Broadcast media support is unnecessary. 416 NONREQ2: Purging is unnecessary given its fragility and complexity 417 and today's large memory size on even modest switches and 418 routers. 420 NONREQ3: Special support for layer 3 multi-hop adjacencies is not 421 part of the protocol specification. Such support can be 422 easily provided by using tunneling technologies the same 423 way IGPs today are solving the problem. 425 4. RIFT: Routing in Fat Trees 427 Derived from the above requirements we present a detailed outline of 428 a protocol optimized for Routing in Fat Trees (RIFT) that in most 429 abstract terms has many properties of a modified link-state protocol 430 [RFC2328][RFC1142] when "pointing north" and path-vector [RFC4271] 431 protocol when "pointing south". Albeit an unusual combination, it 432 does quite naturally exhibit the desirable properties we seek. 434 4.1. Overview 436 The novel property of RIFT is that it floods northbound "flat" link- 437 state information so that each level understands the full topology of 438 levels south of it. In contrast, in the southbound direction the 439 protocol operates like a path vector protocol or rather a distance 440 vector with implicit split horizon since the topology constraints 441 make a diffused computation front propagating in all directions 442 unnecessary. 444 4.2. Specification 446 4.2.1. Transport 448 All protocol elements are carried over UDP. Once QUIC [QUIC] 449 achieves the desired stability in deployments it may prove a valuable 450 candidate for TIE transport. 452 All packet formats are defined in Thrift models in Section 8. 454 4.2.2. Link (Neighbor) Discovery (LIE Exchange) 456 LIE exchange happens over well-known administratively locally scoped 457 IPv4 multicast address [RFC2365] or link-local multicast scope for 458 IPv6 [RFC4291] and SHOULD be sent with a TTL of 1 to prevent RIFT 459 information reaching beyond a single link in the topology. LIEs are 460 exchanged over all links running RIFT. 462 Each node is provisioned with the level at which it is operating and 463 its PoD. A default level and PoD of zero are assumed, meaning that 464 leafs do not need to be configured with a level (or even PoD). Nodes 465 in the spine are configured with a PoD of zero. This information is 466 propagated in the LIEs exchanged. Adjacencies are formed if and only 467 if 469 1. the node is in the same PoD or either the node or the neighbor 470 advertises "any" PoD membership (PoD# = 0) AND 472 2. the neighboring node is at most one level away AND 474 3. the neighboring node is running the same MAJOR schema version AND 476 4. the neighbor is not member of some PoD while the node has a 477 northbound adjacency already joining another PoD. 479 A node configured with "any" PoD membership MUST, after building 480 first northbound adjacency making it participant in a PoD, advertise 481 that PoD as part of its LIEs. 483 LIEs arriving with a TTL larger than 1 MUST be ignored. 485 LIE exchange uses three-way handshake mechanism [RFC5303]. Precise 486 final state machines will be provided in later versions of this 487 specification. LIE packets contain nonces and may contain an SHA-1 488 [RFC6234] over nonces and some of the LIE data which prevents 489 corruption and replay attacks. TIE flooding reuses those nonces to 490 prevent mismatches and can use those for security purposes in case it 491 is using QUIC [QUIC]. Section 7 will address the precise security 492 mechanisms in the future. 494 4.2.3. Topology Exchange (TIE Exchange) 496 4.2.3.1. Topology Information Elements 498 Topology and reachability information in RIFT is conveyed by the 499 means of TIEs which have good amount of commonalities with LSAs in 500 OSPF. 502 TIE exchange mechanism uses port indicated by each node in the LIE 503 exchange and the interface on which the adjacency has been formed as 504 destination. It SHOULD use TTL of 1 as well. 506 TIEs contain sequence numbers, lifetimes and a type. Each type has a 507 large identifying number space and information is spread across 508 possibly many TIEs of a certain type by the means of a hash function 509 that a node or deployment can individually determine. One extreme 510 side of the scale is a prefix per TIE which leads to BGP-like 511 behavior vs. dense packing into few TIEs leading to more traditional 512 IGP trade-off with fewer TIEs. An implementation may even rehash at 513 the cost of significant amount of re-advertisements of TIEs. 515 More information about the TIE structure can be found in the schema 516 in Section 8. 518 4.2.3.2. South- and Northbound Representation 520 As a central concept to RIFT, each node represents itself differently 521 depending on the direction in which it is advertising information. 522 More precisely, a spine node represents two different databases to 523 its neighbors depending whether it advertises TIEs to the south or to 524 the north/sideways. We call those differing TIE databases either 525 south- or northbound (S-TIEs and N-TIEs) depending on the direction 526 of distribution. 528 The N-TIEs hold all of the node's adjacencies, local prefixes and 529 northbound policy-guided prefixes while the S-TIEs hold only all of 530 the node's neighbors and the default prefix with necessary 531 disaggregated prefixes and southbound policy-guided prefixes. We 532 will explain this in detail further in Section 4.2.8 and 533 Section 4.2.4. 535 As an example illustrating a databases holding both representations, 536 consider the topology in Figure 1 with the optional link between node 537 111 and node 112 (so that the flooding on an east-west link can be 538 shown). This example assumes unnumbered interfaces. First, here are 539 the TIEs generated by some nodes. For simplicity, the key value 540 elements and the PGP elements which may be included in their S-TIEs 541 or N-TIEs are not shown. 543 Spine21 S-TIE: 544 NodeElement(layer=2, neighbors((Node111, layer 1, cost 1), 545 (Node112, layer 1, cost 1), (Node121, layer 1, cost 1), 546 (Node122, layer 1, cost 1))) 547 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 549 Node111 S-TIE: 550 NodeElement(layer=1, neighbors((Spine21,layer 2,cost 1), 551 (Spine22, layer 2, cost 1), (Node112, layer 1, cost 1), 552 (Leaf111, layer 0, cost 1), (Leaf112, layer 0, cost 1))) 553 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 555 Node111 N-TIE: 556 NodeLinkElement(layer=1, 557 neighbors((Spine21, layer 2, cost 1, links(...)), 558 (Spine22, layer 2, cost 1, links(...)), 559 (Node112, layer 1, cost 1, links(...)), 560 (Leaf111, layer 0, cost 1, links(...)), 561 (Leaf112, layer 0, cost 1, links(...)))) 562 NorthPrefixesElement(prefixes(Node111.loopback) 564 Node121 S-TIE: 565 NodeElement(layer=1, neighbors((Spine21,layer 2,cost 1), 566 (Spine22, layer 2, cost 1), (Leaf121, layer 0, cost 1), 567 (Leaf122, layer 0, cost 1))) 568 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 570 Node121 N-TIE: NodeLinkElement(layer=1, 571 neighbors((Spine21, layer 2, cost 1, links(...)), 572 (Spine22, layer 2, cost 1, links(...)), 573 (Leaf121, layer 0, cost 1, links(...)), 574 (Leaf122, layer 0, cost 1, links(...)))) 575 NorthPrefixesElement(prefixes(Node121.loopback) 577 Leaf112 N-TIE: 578 NodeLinkElement(layer=0, 579 neighbors((Node111, layer 1, cost 1, links(...)), 580 (Node112, layer 1, cost 1, links(...)))) 581 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 582 Prefix_MH)) 584 Figure 2: example TIES generated in a 2 level spine-and-leaf topology 586 4.2.3.3. Flooding 588 The mechanism used to distribute TIEs is the well-known (albeit 589 modified in several respects to address fat tree requirements) 590 flooding mechanism used by today's link-state protocols. Albeit 591 initially more demanding to implement it avoids many problems with 592 diffused computation update style used by path vector. As described 593 before, TIEs themselves are transported over UDP with the ports 594 indicates in the LIE exchanges and using the destination address (for 595 unnumbered IPv4 interfaces same considerations apply as in equivalent 596 OSPF case) on which the LIE adjacency has been formed. 598 Precise final state machines and procedures will be provided in later 599 versions of this specification. 601 4.2.3.4. TIE Flooding Scopes 603 In a somewhat analogous fashion to link-local, area and domain 604 flooding scopes, RIFT defines several complex "flooding scopes" 605 depending on the direction and type of TIE propagated. 607 Every N-TIE is flooded northbound, providing a node at a given level 608 with the complete topology of the Clos or Fat Tree network underneath 609 it, including all specific prefixes. This means that a packet 610 received from a node at the same or lower level whose destination is 611 covered by one of those specific prefixes may be routed directly 612 towards the node advertising that prefix rather than sending the 613 packet to a node at a higher level. 615 A node's node S-TIEs, consisting of all node's adjacencies and prefix 616 S-TIEs with default IP prefix and disaggregated prefixes, are flooded 617 southbound in order to allow the nodes one level down to see 618 connectivity of the higher level as well as reachability to the rest 619 of the fabric. In order to allow a E-W disconnected node in a given 620 level to receive the S-TIEs of other nodes at its level, every *NODE* 621 S-TIE is "reflected" northbound to level from which it was received. 622 It should be noted that east-west links are included in South TIE 623 flooding; those TIEs need to be flooded to satisfy algorithms in 624 Section 4.2.5. In that way nodes at same level can learn about each 625 other without a more southern level, e.g. in case of leaf level. The 626 precise flooding scopes are given in Table 1. Those rules govern in 627 a symmetric fashion what SHOULD be included in TIDEs towards 628 neighbors. 630 Node S-TIE "reflection" allows to support disaggregation on failures 631 describes in Section 4.2.8 and flooding reduction in Section 4.2.3.8. 633 Packet Type South North East-West 634 vs. Peer 635 Direction 636 ------------- -------------------------- ------------------ --------- 637 node S-TIE flood own only flood always same as 638 South 639 other S-TIE flood own only flood only if TIE same as 640 originator is South 641 equal peer 642 all N-TIEs never flood flood always same as 643 South 644 TIDE include TIEs in flooding include TIEs in same as 645 scope flooding scope South 646 TIRE include all N-TIEs and all include only if same as 647 peer's self-originated TIE originator is South 648 TIEs and all node S-TIEs equal peer 650 Table 1: Flooding Scopes 652 As an example to illustrate these rules, consider using the topology 653 in Figure 1, with the optional link between node 111 and node 112, 654 and the associated TIEs given in Figure 2. The flooding from 655 particular nodes of the TIEs is given in Table 2. 657 Router Neighbor TIEs 658 floods to 659 ---------- -------- ------------------------------------------------- 660 Leaf111 Node112 Leaf111 N-TIEs, Node111 node S-TIE 661 Leaf111 Node111 Leaf111 N-TIEs, Node112 node S-TIE 663 Node111 Leaf111 Node111 S-TIEs 664 Node111 Leaf112 Node111 S-TIEs 665 Node111 Node112 Node111 S-TIEs 666 Node111 Spine21 Node111 N-TIEs, Node112 N-TIEs, Leaf111 N-TIEs, 667 Leaf112 N-TIEs, Spine22 node S-TIE 668 Node111 Spine22 Node111 N-TIEs, Node112 N-TIEs, Leaf111 N-TIEs, 669 Leaf112 N-TIEs, Spine21 node S-TIE 671 ... ... ... 672 Spine21 Node111 Spine21 S-TIEs 673 Spine21 Node112 Spine21 S-TIEs 674 Spine21 Node121 Spine21 S-TIEs 675 Spine21 Node122 Spine21 S-TIEs 676 ... ... ... 678 Table 2: Flooding some TIEs from example topology 680 4.2.3.5. Initial and Periodic Database Synchronization 682 The initial exchange of RIFT is modeled after ISIS with TIDE being 683 equivalent to CSNP and TIRE playing the role of PSNP. The content of 684 TIDEs and TIREs is governed by Table 1. 686 4.2.3.6. Purging 688 RIFT does not purge information that has been distributed by the 689 protocol. Purging mechanisms in other routing protocols have proven 690 through many years of experience to be complex and fragile. Abundant 691 amounts of memory are available today even on low-end platforms. The 692 information will age out and all computations will deliver correct 693 results if a node leaves the network due to the new information 694 distributed by its adjacent nodes. 696 Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as 697 long as feasible (also when the protocol restarts), even if the TIE 698 looses all content. The re-advertisement of empty TIE fulfills the 699 purpose of purging any information advertised in previous versions. 700 The originator is free to not re-originate the according empty TIE 701 again or originate an empty TIE with relatively short lifetime to 702 prevent large number of long-lived empty stubs polluting the network. 703 Each node will timeout and clean up the according empty TIEs 704 independently. 706 4.2.3.7. Southbound Default Route Origination 708 Under certain conditions nodes issue a default route in their South 709 Prefix TIEs. 711 A node X that is NOT overloaded AND has southbound or east-west 712 adjacencies originates in its south prefix TIE such a default route 713 IIF all other nodes at X's' level are overloaded OR have NO 714 northbound adjacencies OR X has computed reachability to a default 715 route during N-SPF. 717 A node originating a southbound default route MUST install a default 718 discard route if it did not compute a default route during N-SPF. 720 4.2.3.8. Optional Automatic Flooding Reduction and Partitioning 722 Several nodes can, but strictly only under conditions defined below, 723 run a hashing function based on TIE originator value and partition 724 flooding between them. 726 Steps for flooding reduction and partitioning: 728 1. select all nodes in the same level for which node S-TIEs have 729 been received and which have precisely the same non-empty sets of 730 respectively north and south neighbor adjacencies and support 731 flooding reduction (overload bits are ignored) and then 733 2. run on the chosen set a hash algorithm using nodes flood 734 priorities and IDs to select flooding leader and backup per TIE 735 originator ID, i.e. each node floods immediately through to all 736 its necessary neighbors TIEs that it received with an originator 737 ID that makes it the flooding leader or backup for this 738 originator. The preference (higher is better) is computed as 739 XOR(TIE-ORIGINATOR-ID<<1,~OWN-SYSTEM-ID)), whereas << is a non- 740 circular shift and ~ is bit-wise NOT. 742 3. In the very unlikely case of hash collisions on either of the two 743 nodes with highest values (i.e. either does NOT produce unique 744 hashes as compared to all other hash values), the node running 745 the election does not attempt to reduce flooding. 747 Additional rules for flooding reduction and partitioning: 749 1. A node always floods its own TIEs 751 2. A node generates TIDEs as usual but when receiving TIREs with 752 requests for TIEs for a node for which it is not a flooding 753 leader or backup it ignores such TIDEs on first request only. 754 Normally, the flooding leader should satisfy the requestor and 755 with that no further TIREs for such TIEs will be generated. 756 Otherwise, the next set of TIDEs and TIREs will lead to flooding 757 independent of the flooding leader status. 759 3. A node receiving a TIE originated by a node for which it is not a 760 flooding leader floods such TIEs only when receiving an out-of- 761 date TIDE for them, except for the first one. 763 The mechanism can be implemented optionally in each node. The 764 capability is carried in the node S-TIE (and for symmetry purposes in 765 node N-TIE as well but it serves no purpose there currently). 767 Obviously flooding reduction does NOT apply to self originated TIEs. 768 Observe further that all policy-guided information consists of self- 769 originated TIEs. 771 4.2.4. Policy-Guided Prefixes 773 In a fat tree, it can be sometimes desirable to guide traffic to 774 particular destinations or keep specific flows to certain paths. In 775 RIFT, this is done by using policy-guided prefixes with their 776 associated communities. Each community is an abstract value whose 777 meaning is determined by configuration. It is assumed that the 778 fabric is under a single administrative control so that the meaning 779 and intent of the communities is understood by all the nodes in the 780 fabric. Any node can originate a policy-guided prefix. 782 Since RIFT uses distance vector concepts in a southbound direction, 783 it is straightforward to add a policy-guided prefix to an S-TIE. For 784 easier troubleshooting, the approach taken in RIFT is that a node's 785 southbound policy-guided prefixes are sent in its S-TIE and the 786 receiver does inbound filtering based on the associated communities 787 (an egress policy is imaginable but would lead to different S-TIEs 788 per neighbor possibly which is not considered in RIFT protocol 789 procedures). A southbound policy-guided prefix can only use links in 790 the south direction. If an PGP S-TIE is received on an east-west or 791 northbound link, it must be discarded by ingress filtering. 793 Conceptually, a southbound policy-guided prefix guides traffic from 794 the leaves up to at most the north-most layer. It is also necessary 795 to to have northbound policy-guided prefixes to guide traffic from 796 the north-most layer down to the appropriate leaves. Therefore, RIFT 797 includes northbound policy-guided prefixes in its N PGP-TIE and the 798 receiver does inbound filtering based on the associated communities. 799 A northbound policy-guided prefix can only use links in the northern 800 direction. If an N PGP TIE is received on an east-west or southbound 801 link, it must be discarded by ingress filtering. 803 By separating southbound and northbound policy-guided prefixes and 804 requiring that the cost associated with a PGP is strictly 805 monotonically increasing at each hop, the path cannot loop. Because 806 the costs are strictly increasing, it is not possible to have a loop 807 between a northbound PGP and a southbound PGP. If east-west links 808 were to be allowed, then looping could occur and issues such as 809 counting to infinity would become an issue to be solved. If complete 810 generality of path - such as including east-west links and using both 811 north and south links in arbitrary sequence - then a Path Vector 812 protocol or a similar solution must be considered. 814 If a node has received the same prefix, after ingress filtering, as a 815 PGP in an S-TIE and in an N-TIE, then the node determines which 816 policy-guided prefix to use based upon the advertised cost. 818 A policy-guided prefix is always preferred to a regular prefix, even 819 if the policy-guided prefix has a larger cost. Section 8 provides 820 normative indication of prefix preferences. 822 The set of policy-guided prefixes received in a TIE is subject to 823 ingress filtering and then re-originated to be sent out in the 824 receiver's appropriate TIE. Both the ingress filtering and the re- 825 origination use the communities associated with the policy-guided 826 prefixes to determine the correct behavior. The cost on re- 827 advertisement MUST increase in a strictly monotonic fashion. 829 4.2.4.1. Ingress Filtering 831 When a node X receives a PGP S-TIE or a PGP N-TIE that is originated 832 from a node Y which does not have an adjacency with X, all PGPs in 833 such a TIE MUST be filtered. Similarly, if node Y is at the same 834 layer as node X, then X MUST filter out PGPs in such S- and N-TIEs to 835 prevent loops. 837 Next, policy can be applied to determine which policy-guided prefixes 838 to accept. Since ingress filtering is chosen rather than egress 839 filtering and per-neighbor PGPs, policy that applies to links is done 840 at the receiver. Because the RIFT adjacency is between nodes and 841 there may be parallel links between the two nodes, the policy-guided 842 prefix is considered to start with the next-hop set that has all 843 links to the originating node Y. 845 A policy-guided prefix has or is assigned the following attributes: 847 cost: This is initialized to the cost received 849 community_list: This is initialized to the list of the communities 850 received. 852 next_hop_set: This is initialized to the set of links to the 853 originating node Y. 855 4.2.4.2. Applying Policy 857 The specific action to apply based upon a community is deployment 858 specific. Here are some examples of things that can be done with 859 communities. The length of a community is a 64 bits number and it 860 can be written as a single field M or as a multi-field (S = M[0-31], 861 T = M[32-63]) in these examples. For simplicity, the policy-guided 862 prefix is referred to as P, the processing node as X and the 863 originator as Y. 865 Prune Next-Hops: Community Required: For each next-hop in 866 P.next_hop_set, if the next-hop does not have the community, prune 867 that next-hop from P.next_hop_set. 869 Prune Next-Hops: Avoid Community: For each next-hop in 870 P.next_hop_set, if the next-hop has the community, prune that 871 next-hop from P.next_hop_set. 873 Drop if Community: If node X has community M, discard P. 875 Drop if not Community: If node X does not have the community M, 876 discard P. 878 Prune to ifIndex T: For each next-hop in P.next_hop_set, if the 879 next-hop's ifIndex is not the value T specified in the community 880 (S,T), then prune that next-hop from P.next_hop_set. 882 Add Cost T: For each appearance of community S in P.community_list, 883 if the node X has community S, then add T to P.cost. 885 Accumulate Min-BW T: Let bw be the sum of the bandwidth for 886 P.next_hop_set. If that sum is less than T, then replace (S,T) 887 with (S, bw). 889 Add Community T if Node matches S: If the node X has community S, 890 then add community T to P.community_list. 892 4.2.4.3. Store Policy-Guided Prefix for Route Computation and 893 Regeneration 895 Once a policy-guided prefix has completed ingress filtering and 896 policy, it is almost ready to store and use. It is still necessary 897 to adjust the cost of the prefix to account for the link from the 898 computing node X to the originating neighbor node Y. 900 There are three different policies that can be used: 902 Minimum Equal-Cost: Find the lowest cost C next-hops in 903 P.next_hop_set and prune to those. Add C to P.cost. 905 Minimum Unequal-Cost: Find the lowest cost C next-hop in 906 P.next_hop_set. Add C to P.cost. 908 Maximum Unequal-Cost: Find the highest cost C next-hop in 909 P.next_hop_set. Add C to P.cost. 911 The default policy is Minimum Unequal-Cost but well-known communities 912 can be defined to get the other behaviors. 914 Regardless of the policy used, a node MUST store a PGP cost that is 915 at least 1 greater than the PGP cost received. This enforces the 916 strictly monotonically increasing condition that avoids loops. 918 Two databases of PGPs - from N-TIEs and from S-TIEs are stored. When 919 a PGP is inserted into the appropriate database, the usual tie- 920 breaking on cost is performed. Observe that the node retains all PGP 921 TIEs due to normal flooding behavior and hence loss of the best 922 prefix will lead to re-evaluation of TIEs present and re- 923 advertisement of a new best PGP. 925 4.2.4.4. Re-origination 927 A node must re-originate policy-guided prefixes and retransmit them. 928 The node has its database of southbound policy-guided prefixes to 929 send in its S-TIE and its database of northbound policy-guided 930 prefixes to send in its N-TIE. 932 Of course, a leaf does not need to re-originate southbound policy- 933 guided prefixes. 935 4.2.4.5. Overlap with Disaggregated Prefixes 937 PGPs may overlap with prefixes introduced by automatic de- 938 aggregation. The topic is under further discussion. The break in 939 connectivity that leads to infeasibility of a PGP is mirrored in 940 adjacency tear-down and according removal of such PGPs. 941 Nevertheless, the underlying link-state flooding will be likely 942 reacting significantly faster than a hop-by-hop redistribution and 943 with that the preference for PGPs may cause intermittent black-holes. 945 4.2.5. Reachability Computation 947 A node has three sources of relevant information. A node knows the 948 full topology south from the received N-TIEs. A node has the set of 949 prefixes with associated distances and bandwidths from received 950 S-TIEs. A node can also have a set of PGPs. 952 To compute reachability, a node runs conceptually a northbound and a 953 southbound SPF. We call that N-SPF and S-SPF. 955 Since neither computation can "loop" (with due considerations given 956 to PGPs), it is possible to compute non-equal-cost or even k-shortest 957 paths [EPPSTEIN] and "saturate" the fabric to the extent desired. 959 4.2.5.1. Northbound SPF 961 N-SPF uses northbound adjacencies in north node TIEs when progressing 962 Dijkstra. N-SPF uses E-W adjacencies during N-SPF ONLY if the node 963 itself does NOT have any northbound adjacencies and the adjacent node 964 has one or more northbound adjacencies. This forms a "one-hop split- 965 horizon" and prevents looping over default routes while allowing for 966 "one-hop protection" of nodes that lost all northbound adjacencies. 967 A detailed example can be found in Section 4.3.8. 969 For N-SPF we are using the South Node TIEs to find according 970 adjacencies to verify backlink connectivity. Just as in case of IS- 971 IS or OSPF, unidirectional links are associated together to confirm 972 bidirectional connectivity. 974 4.2.5.2. Southbound SPF 976 S-SPF uses only the southbound adjacencies in the south node TIEs, 977 i.e. progresses towards nodes at lower levels. Observe that E-W 978 adjacencies are NEVER used in the computation. This enforces the 979 requirement that a packet traversing in a southbound direction must 980 never change the direction. 982 S-SPF uses northbound adjacencies in north node TIEs to verify 983 backlink connectivity. 985 4.2.5.3. East-West Forwarding Within a Level 987 Ultimately, it should be observed that in presence of a "ring" of E-W 988 links in a level neither SPF will provide a "ring protection" scheme 989 since such a computation would have to deal necessarily with breaking 990 of "loops" in generic Dijkstra sense; an application for which RIFT 991 is not intended. It is outside the scope of this document how an 992 underlay can be used to provide a full-mesh connectivity between 993 nodes in the same layer that would allow for N-SPF to provide 994 protection for a single node loosing all its northbound adjacencies 995 (as long as any of the other nodes in the level are northbound 996 connected). 998 4.2.6. Attaching Prefixes 1000 After the SPF is run, it is necessary to attach according prefixes. 1001 For S-SPF, prefixes from an N-TIE are attached to the originating 1002 node with that node's next-hop set and a distance equal to the 1003 prefix's cost plus the node's minimized path distance. The RIFT 1004 route database, a set of (prefix, type=spf, path_distance, next-hop 1005 set), accumulates these results. Obviously, the prefix retains its 1006 type which is used to tie-break between the same prefix advertised 1007 with different types. 1009 In case of N-SPF prefixes from each S-TIE need to also be added to 1010 the RIFT route database. The N-SPF is really just a stub so the 1011 computing node needs simply to determine, for each prefix in an S-TIE 1012 that originated from adjacent node, what next-hops to use to reach 1013 that node. Since there may be parallel links, the next-hops to use 1014 can be a set; presence of the computing node in the associated Node 1015 S-TIE is sufficient to verify that at least one link has 1016 bidirectional connectivity. The set of minimum cost next-hops from 1017 the computing node X to the originating adjacent node is determined. 1019 Each prefix has its cost adjusted before being added into the RIFT 1020 route database. The cost of the prefix is set to the cost received 1021 plus the cost of the minimum cost next-hop to that neighbor. Then 1022 each prefix can be added into the RIFT route database with the 1023 next_hop_set; ties are broken based upon type first and then 1024 distance. RIFT route preferences are normalized by the according 1025 thrift model type. 1027 An exemplary implementation for node X follows: 1029 for each S-TIE 1030 if S-TIE.layer > X.layer 1031 next_hop_set = set of minimum cost links to the S-TIE.originator 1032 next_hop_cost = minimum cost link to S-TIE.originator 1033 end if 1034 for each prefix P in the S-TIE 1035 P.cost = P.cost + next_hop_cost 1036 if P not in route_database: 1037 add (P, type=DistVector, P.cost, next_hop_set) to route_database 1038 end if 1039 if (P in route_database) and 1040 (route_database[P].type is not PolicyGuided): 1041 if route_database[P].cost > P.cost): 1042 update route_database[P] with (P, DistVector, P.cost, next_hop_set) 1043 else if route_database[P].cost == P.cost 1044 update route_database[P] with (P, DistVector, P.cost, 1045 merge(next_hop_set, route_database[P].next_hop_set)) 1046 else 1047 // Not preferred route so ignore 1048 end if 1049 end if 1050 end for 1051 end for 1053 Figure 3: Adding Routes from S-TIE Prefixes 1055 4.2.7. Attaching Policy-Guided Prefixes 1057 Each policy-guided prefix P has its cost and next_hop_set already 1058 stored in the associated database, as specified in Section 4.2.4.3; 1059 the cost stored for the PGP is already updated to considering the 1060 cost of the link to the advertising neighbor. By definition, a 1061 policy-guided prefix is preferred to a regular prefix. 1063 for each policy-guided prefix P: 1064 if P not in route_database: 1065 add (P, type=PolicyGuided, P.cost, next_hop_set) 1066 end if 1067 if P in route_database : 1068 if (route_database[P].type is not PolicyGuided) or 1069 (route_database[P].cost > P.cost): 1070 update route_database[P] with (P, PolicyGuided, P.cost, next_hop_set) 1071 else if route_database[P].cost == P.cost 1072 update route_database[P] with (P, PolicyGuided, P.cost, 1073 merge(next_hop_set, route_database[P].next_hop_set)) 1074 else 1075 // Not preferred route so ignore 1076 end if 1077 end if 1078 end for 1080 Figure 4: Adding Routes from Policy-Guided Prefixes 1082 4.2.8. Automatic Disaggregation on Link & Node Failures 1084 Under normal circumstances, node's S-TIEs contain just the 1085 adjacencies, a default route and policy-guided prefixes. However, if 1086 a node detects that its default IP prefix covers one or more prefixes 1087 that are reachable through it but not through one or more other nodes 1088 at the same level, then it MUST explicitly advertise those prefixes 1089 in an S-TIE. Otherwise, some percentage of the northbound traffic 1090 for those prefixes would be sent to nodes without according 1091 reachability, causing it to be black-holed. Even when not black- 1092 holing, the resulting forwarding could 'backhaul' packets through the 1093 higher level spines, clearly an undesirable condition affecting the 1094 blocking probabilities of the fabric. 1096 We refer to the process of advertising additional prefixes as 'de- 1097 aggregation' or 'dis-aggregation'. 1099 A node determines the set of prefixes needing de-aggregation using 1100 the following steps: 1102 1. A DAG computation in the southern direction is performed first, 1103 i.e. the N-TIEs are used to find all of prefixes it can reach and 1104 the set of next-hops in the lower level for each. Such a 1105 computation can be easily performed on a fat tree by e.g. setting 1106 all link costs in the southern direction to 1 and all northern 1107 directions to infinity. We term set of those prefixes |R, and 1108 for each prefix, r, in |R, we define its set of next-hops to 1109 be |H(r). Observe that policy-guided prefixes are NOT affected 1110 since their scope is controlled by configuration. 1112 2. The node uses reflected S-TIEs to find all nodes at the same 1113 level in the same PoD and the set of southbound adjacencies for 1114 each. The set of nodes at the same level is termed |N and for 1115 each node, n, in |N, we define its set of southbound adjacencies 1116 to be |A(n). 1118 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 1119 is null then that prefix r must be explicitly advertised by the 1120 node in an S-TIE. 1122 4. Identical set of de-aggregated prefixes is flooded on each of the 1123 node's southbound adjacencies. In accordance with the normal 1124 flooding rules for an S-TIE, a node at the lower level that 1125 receives this S-TIE will not propagate it south-bound. Neither 1126 is it necessary for the receiving node to reflect the 1127 disaggregated prefixes back over its adjacencies to nodes at the 1128 level from which it was received. 1130 To summarize the above in simplest terms: if a node detects that its 1131 default route encompasses prefixes for which one of the other nodes 1132 in its level has no possible next-hops in the level below, it has to 1133 disaggregate it to prevent black-holing or suboptimal routing. Hence 1134 a node X needs to determine if it can reach a different set of south 1135 neighbors than other nodes at the same level, which are connected to 1136 it via at least one common south or east-west neighbor. If it can, 1137 then prefix disaggregation may be required. If it can't, then no 1138 prefix disaggregation is needed. An example of disaggregation is 1139 provided in Section 5.3. 1141 A possible algorithm is described last: 1143 1. Create partial_neighbors = (empty), a set of neighbors with 1144 partial connectivity to the node X's layer from X's perspective. 1145 Each entry is a list of south neighbor of X and a list of nodes 1146 of X.layer that can't reach that neighbor. 1148 2. A node X determines its set of southbound neighbors 1149 X.south_neighbors. 1151 3. For each S-TIE originated from a node Y that X has which is at 1152 X.layer, if Y.south_neighbors is not the same as 1153 X.south_neighbors but the nodes share at least one southern 1154 neighbor, for each neighbor N in X.south_neighbors but not in 1155 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 1156 there or add Y to the list for N. 1158 4. If partial_neighbors is empty, then node X does not to 1159 disaggregate any prefixes. If node X is advertising 1160 disaggregated prefixes in its S-TIE, X SHOULD remove them and re- 1161 advertise its according S-TIEs. 1163 A node X computes its SPF based upon the received N-TIEs. This 1164 results in a set of routes, each categorized by (prefix, 1165 path_distance, next-hop-set). Alternately, for clarity in the 1166 following procedure, these can be organized by next-hop-set as ( 1167 (next-hops), {(prefix, path_distance)}). If partial_neighbors isn't 1168 empty, then the following procedure describes how to identify 1169 prefixes to disaggregate. 1171 disaggregated_prefixes = {empty } 1172 nodes_same_layer = { empty } 1173 for each S-TIE 1174 if (S-TIE.layer == X.layer and 1175 X shares at least one S-neighbor with X) 1176 add S-TIE.originator to nodes_same_layer 1177 end if 1178 end for 1180 for each next-hop-set NHS 1181 isolated_nodes = nodes_same_layer 1182 for each NH in NHS 1183 if NH in partial_neighbors 1184 isolated_nodes = intersection(isolated_nodes, 1185 partial_neighbors[NH].nodes) 1186 end if 1187 end for 1189 if isolated_nodes is not empty 1190 for each prefix using NHS 1191 add (prefix, distance) to disaggregated_prefixes 1192 end for 1193 end if 1194 end for 1196 copy disaggregated_prefixes to X's S-TIE 1197 if X's S-TIE is different 1198 schedule S-TIE for flooding 1199 end if 1201 Figure 5: Computation to Disaggregate Prefixes 1203 Each disaggregated prefix is sent with the accurate path_distance. 1204 This allows a node to send the same S-TIE to each south neighbor. 1205 The south neighbor which is connected to that prefix will thus have a 1206 shorter path. 1208 Finally, to summarize the less obvious points partially omitted in 1209 the algorithms to keep them more tractable: 1211 1. all neighbor relationships MUST perform backlink checks. 1213 2. overload bits as introduced in Section 4.3.1 have to be respected 1214 during the computation. 1216 3. all the lower level nodes are flooded the same disaggregated 1217 prefixes since we don't want to build an S-TIE per node and 1218 complicate things unnecessarily. The PoD containing the prefix 1219 will prefer southbound anyway. 1221 4. disaggregated prefixes do NOT have to propagate to lower levels. 1222 With that the disturbance in terms of new flooding is contained 1223 to a single level experiencing failures only. 1225 5. disaggregated prefix S-TIEs are not "reflected" by the lower 1226 layer, i.e. nodes within same level do NOT need to be aware 1227 which node computed the need for disaggregation. 1229 6. The fabric is still supporting maximum load balancing properties 1230 while not trying to send traffic northbound unless necessary. 1232 4.3. Further Mechanisms 1234 4.3.1. Overload Bit 1236 Overload Bit MUST be respected in all according reachability 1237 computations. A node with overload bit set SHOULD NOT advertise any 1238 reachability prefixes southbound except locally hosted ones. 1240 The leaf node SHOULD set the 'overload' bit on its node TIEs, since 1241 if the spine nodes were to forward traffic not meant for the local 1242 node, the leaf node does not have the topology information to prevent 1243 a routing/forwarding loop. 1245 4.3.2. Optimized Route Computation on Leafs 1247 Since the leafs do see only "one hop away" they do not need to run a 1248 full SPF but can simply gather prefix candidates from their neighbors 1249 and build the according routing table. 1251 A leaf will have no N-TIEs except its own and optionally from its 1252 east-west neighbors. A leaf will have S-TIEs from its neighbors. 1254 Instead of creating a network graph from its N-TIEs and neighbor's 1255 S-TIEs and then running an SPF, a leaf node can simply compute the 1256 minimum cost and next_hop_set to each leaf neighbor by examining its 1257 local interfaces, determining bi-directionality from the associated 1258 N-TIE, and specifying the neighbor's next_hop_set set and cost from 1259 the minimum cost local interfaces to that neighbor. 1261 Then a leaf attaches prefixes as in Section 4.2.6 as well as the 1262 policy-guided prefixes as in Section 4.2.7. 1264 4.3.3. Key/Value Store 1266 The protocol supports a southbound distribution of key-value pairs 1267 that can be used to e.g. distribute configuration information during 1268 topology bring-up. The KV TIEs (which are always S-TIEs) can arrive 1269 from multiple nodes and need tie-breaking per key uses the following 1270 rules 1272 1. Only KV TIEs originated by a node to which the receiver has an 1273 adjacency are considered. 1275 2. Within all valid KV S-TIEs containing the key, the value of the 1276 S-TIE with the highest level and within the same level highest 1277 originator ID is preferred. 1279 Observe that if a node goes down, the node south of it looses 1280 adjacencies to it and with that the KVs will be disregarded and on 1281 tie-break changes new KV re-advertised to prevent stale information 1282 being used by nodes further south. KV information is not result of 1283 independent computation of every node but a diffused computation. 1285 4.3.4. Interactions with BFD 1287 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 1288 In such case following procedures are introduced: 1290 After RIFT 3-way hello adjacency convergence a BFD session MAY be 1291 formed automatically between the RIFT endpoints without further 1292 configuration. 1294 In case RIFT looses 3-way hello adjacency, the BFD session should 1295 be brought down until 3-way adjacency is formed again. 1297 In case established BFD session goes Down after it was Up, RIFT 1298 adjacency should be re-initialized from scratch. 1300 In case of parallel links between nodes each link may run its own 1301 independent BFD session. 1303 In case RIFT changes link identifiers both the hello as well as 1304 the BFD sessions will be brought down and back up again. 1306 4.3.5. Leaf to Leaf Procedures 1308 RIFT can be optionally relaxed to allow leaf East-West adjacencies 1309 under additional set of rules. The leaf supporting those procedures 1310 MUST: 1312 Only nodes supporting Leaf to Leaf Procedures CAN advertise LIEs 1313 on E-W links at level 0 and MUST in such a case advertise the 1314 according flag in node capabilities as "true". 1316 The overload bit MUST be set on all leaf's node TIEs. 1318 Only node's own north and south TIEs are flooded over E-W leaf 1319 adjacency. 1321 E-W leaf adjacency is always used in both north as well as south 1322 computation. 1324 Any advertised aggregate in leaf's south TIE MUST install a 1325 discard route. 1327 This will allow the E-W leaf nodes to exchange traffic strictly for 1328 the prefixes advertised in each other's north prefix TIEs (since the 1329 southbound computation will find the reverse direction in the other 1330 node's TIE and install its north prefixes). 1332 4.3.6. Other End-to-End Services 1334 Losing full, flat topology information at every node will have an 1335 impact on some of the end-to-end network services. This is the price 1336 paid for minimal disturbance in case of failures and reduced flooding 1337 and memory requirements on nodes lower south in the level hierarchy. 1339 4.3.7. Address Family and Multi Topology Considerations 1341 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC6822] is used 1342 today in link-state routing protocols to support several domains on 1343 the same physical topology. RIFT supports this capability by 1344 carrying transport ports in the LIE protocol exchanges. Multiplexing 1345 of LIEs can be achieved by either choosing varying multicast 1346 addresses or ports on the same address. 1348 BFD interactions in Section 4.3.4 are implementation dependent when 1349 multiple RIFT instances run on the same link. 1351 4.3.8. Reachability of Internal Nodes in the Fabric 1353 RIFT does not precondition that its nodes have reachable addresses 1354 albeit for operational purposes this is clearly desirable. Under 1355 normal operating conditions this can be easily achieved by e.g. 1356 injecting the node's loopback address into North Prefix TIEs. 1358 Things get more interesting in case a node looses all its northbound 1359 adjacencies but is not at the top of the fabric. In such a case a 1360 node that detects that some other members at its level are 1361 advertising northbound adjacencies MAY inject its loopback address 1362 into southbound PGP TIE and become reachable "from the south" that 1363 way. Further, a solution may be implemented where based on e.g. a 1364 "well known" community such a southbound PGP is reflected at level 0 1365 and advertised as northbound PGP again to allow for "reachability 1366 from the north" at the cost of additional flooding. 1368 4.3.9. One-Hop Healing of Levels with East-West Links 1370 Based on the rules defined in Section 4.2.5, Section 4.2.3.7 and 1371 given presence of E-W links, RIFT can provide a one-hop protection of 1372 nodes that lost all their northbound links. Section 5.4 explains the 1373 resulting behavior based on an example. 1375 5. Examples 1377 5.1. Normal Operation 1379 This section describes RIFT deployment in the example topology 1380 without any node or link failures. We disregard flooding reduction 1381 for simplicity's sake. 1383 As first step, the following bi-directional adjacencies will be 1384 created (and any other links that do not fulfill LIE rules in 1385 Section 4.2.2 disregarded): 1387 1. Spine 21 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 1389 2. Spine 22 (PoD 0) to Node 111, Node 112, Node 121, and Node 122 1391 3. Node 111 to Leaf 111, Leaf 112 1393 4. Node 112 to Leaf 111, Leaf 112 1395 5. Node 121 to Leaf 121, Leaf 122 1396 6. Node 122 to Leaf 121, Leaf 122 1398 Consequently, N-TIEs would be originated by Node 111 and Node 112 and 1399 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 1400 would be originated by Leaf 111 (w/ Prefix 111) and Leaf 112 (w/ 1401 Prefix 112 and the multi-homed prefix) and each set would be sent to 1402 Node 111 and Node 112. Node 111 and Node 112 would then flood these 1403 N-TIEs to Spine 21 and Spine 22. 1405 Similarly, N-TIEs would be originated by Node 121 and Node 122 and 1406 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 1407 would be originated by Leaf 121 (w/ Prefix 121 and the multi-homed 1408 prefix) and Leaf 122 (w/ Prefix 122) and each set would be sent to 1409 Node 121 and Node 122. Node 121 and Node 122 would then flood these 1410 N-TIEs to Spine 21 and Spine 22. 1412 At this point both Spine 21 and Spine 22, as well as any controller 1413 to which they are connected, would have the complete network 1414 topology. At the same time, Node 111/112/121/122 hold only the 1415 N-ties of level 0 of their respective PoD. Leafs hold only their own 1416 N-TIEs. 1418 S-TIEs with adjacencies and a default IP prefix would then be 1419 originated by Spine 21 and Spine 22 and each would be flooded to Node 1420 111, Node 112, Node 121, and Node 122. Node 111, Node 112, Node 121, 1421 and Node 122 would each send the S-TIE from Spine 21 to Spine 22 and 1422 the S-TIE from Spine 22 to Spine 21. (S-TIEs are reflected up to 1423 level from which they are received but they are NOT propagated 1424 southbound.) 1426 An S Tie with a default IP prefix would be originated by Node 111 and 1427 Node 112 and each would be sent to Leaf 111 and Leaf 112. Leaf 111 1428 and Leaf 112 would each send the S-TIE from Node 111 to Node 112 and 1429 the S-TIE from Node 112 to Node 111. 1431 Similarly, an S Tie with a default IP prefix would be originated by 1432 Node 121 and Node 122 and each would be sent to Leaf 121 and Leaf 1433 122. Leaf 121 and Leaf 122 would each send the S-TIE from Node 121 1434 to Node 122 and the S-TIE from Node 122 to Node 121. At this point 1435 IP connectivity with maximum possible ECMP has been established 1436 between the leafs while constraining the amount of information held 1437 by each node to the minimum necessary for normal operation and 1438 dealing with failures. 1440 5.2. Leaf Link Failure 1442 . | | | | 1443 .+-+---+-+ +-+---+-+ 1444 .| | | | 1445 .|Node111| |Node112| 1446 .+-+---+-+ ++----+-+ 1447 . | | | | 1448 . | +---------------+ X 1449 . | | | X Failure 1450 . | +-------------+ | X 1451 . | | | | 1452 .+-+---+-+ +--+--+-+ 1453 .| | | | 1454 .|Leaf111| |Leaf112| 1455 .+-------+ +-------+ 1456 . + + 1457 . Prefix111 Prefix112 1459 Figure 6: Single Leaf link failure 1461 In case of a failing leaf link between node 112 and leaf 112 the 1462 link-state information will cause re-computation of the necessary SPF 1463 and the higher levels will stop forwarding towards prefix 112 through 1464 node 112. Only nodes 111 and 112, as well as both spines will see 1465 control traffic. Leaf 111 will receive a new S-TIE from node 112 and 1466 reflect back to node 111. Node 111 will de-aggregate prefix 111 and 1467 prefix 112 but we will not describe it further here since de- 1468 aggregation is emphasized in the next example. It is worth observing 1469 however in this example that if leaf 111 would keep on forwarding 1470 traffic towards prefix 112 using the advertised south-bound default 1471 of node 112 the traffic would end up on spine 21 and spine 22 and 1472 cross back into pod 1 using node 111. This is arguably not as bad as 1473 black-holing present in the next example but clearly undesirable. 1474 Fortunately, de-aggregation prevents this type of behavior except for 1475 a transitory period of time. 1477 5.3. Partitioned Fabric 1478 . +--------+ +--------+ S-TIE of Spine21 1479 . | | | | received by 1480 . |Spine 21| |Spine 22| reflection of 1481 . ++-+--+-++ ++-+--+-++ Nodes 112 and 111 1482 . | | | | | | | | 1483 . | | | | | | | 0/0 1484 . | | | | | | | | 1485 . | | | | | | | | 1486 . +--------------+ | +--- XXXXXX + | | | +---------------+ 1487 . | | | | | | | | 1488 . | +-----------------------------+ | | | 1489 . 0/0 | | | | | | | 1490 . | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 1491 . | 1.1/16 | | | | | | 1492 . | | +-+ +-0/0-----------+ | | 1493 . | | | 1.1./16 | | | | 1494 .+-+----++ +-+-----+ ++-----0/0 ++----0/0 1495 .| | | | | 1.1/16 | 1.1/16 1496 .|Node111| |Node112| |Node121| |Node122| 1497 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 1498 . | | | | | | | | 1499 . | +---------------+ | | +----------------+ | 1500 . | | | | | | | | 1501 . | +-------------+ | | | +--------------+ | | 1502 . | | | | | | | | 1503 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 1504 .| | | | | | | | 1505 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 1506 .+-+-----+ ++------+ +-----+-+ +-+-----+ 1507 . + + + + 1508 . Prefix111 Prefix112 Prefix121 Prefix122 1509 . 1.1/16 1511 Figure 7: Fabric partition 1513 Figure 7 shows the arguably most catastrophic but also the most 1514 interesting case. Spine 21 is completely severed from access to 1515 Prefix 121 (we use in the figure 1.1/16 as example) by double link 1516 failure. However unlikely, if left unresolved, forwarding from leaf 1517 111 and leaf 112 to prefix 121 would suffer 50% black-holing based on 1518 pure default route advertisements by spine 21 and spine 22. 1520 The mechanism used to resolve this scenario is hinging on the 1521 distribution of southbound representation by spine 21 that is 1522 reflected by node 111 and node 112 to spine 22. Spine 22, having 1523 computed reachability to all prefixes in the network, advertises with 1524 the default route the ones that are reachable only via lower level 1525 neighbors that spine 21 does not show an adjacency to. That results 1526 in node 111 and node 112 obtaining a longest-prefix match to prefix 1527 121 which leads through spine 22 and prevents black-holing through 1528 spine 21 still advertising the 0/0 aggregate only. 1530 The prefix 121 advertised by spine 22 does not have to be propagated 1531 further towards leafs since they do no benefit from this information. 1532 Hence the amount of flooding is restricted to spine 21 reissuing its 1533 S-TIEs and reflection of those by node 111 and node 112. The 1534 resulting SPF in spine 22 issues a new prefix S-TIEs containing 1535 1.1/16. None of the leafs become aware of the changes and the 1536 failure is constrained strictly to the level that became partitioned. 1538 To finish with an example of the resulting sets computed using 1539 notation introduced in Section 4.2.8, spine 22 constructs the 1540 following sets: 1542 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 1544 |H (for r=Prefix 111) = Node 111, Node 112 1546 |H (for r=Prefix 112) = Node 111, Node 112 1548 |H (for r=Prefix 121) = Node 121, Node 122 1550 |H (for r=Prefix 122) = Node 121, Node 122 1552 |A (for Spine 21) = Node 111, Node 112 1554 With that and |H (for r=prefix 121) and |H (for r=prefix 122) being 1555 disjoint from |A (for spine 21), spine 22 will originate an S-TIE 1556 with prefix 121 and prefix 122, that is flooded to nodes 112, 112, 1557 121 and 122. 1559 5.4. Northbound Partitioned Router and Optional East-West Links 1560 . + + + 1561 . X N1 | N2 | N3 1562 . X | | 1563 .+--+----+ +--+----+ +--+-----+ 1564 .| |0/0> <0/0| |0/0> <0/0| | 1565 .| A01 +----------+ A02 +----------+ A03 | Level 1 1566 .++-+-+--+ ++--+--++ +---+-+-++ 1567 . | | | | | | | | | 1568 . | | +----------------------------------+ | | | 1569 . | | | | | | | | | 1570 . | +-------------+ | | | +--------------+ | 1571 . | | | | | | | | | 1572 . | +----------------+ | +-----------------+ | 1573 . | | | | | | | | | 1574 . | | +------------------------------------+ | | 1575 . | | | | | | | | | 1576 .++-+-+--+ | +---+---+ | +-+---+-++ 1577 .| | +-+ +-+ | | 1578 .| L01 | | L02 | | L03 | Level 0 1579 .+-------+ +-------+ +--------+ 1581 Figure 8: North Partitioned Router 1583 Figure 8 shows a part of a fabric where level 1 is horizontally 1584 connected and A01 lost its only northbound adjacency. Based on N-SPF 1585 rules in Section 4.2.5.1 A01 will compute northbound reachability by 1586 using the link A01 to A02 (whereas A02 will NOT use this link during 1587 N-SPF). Hence A01 will still advertise the default towards level 0 1588 and route unidirectionally using the horizontal link. Moreover, 1589 based on Section 4.3.8 it may advertise its loopback address as south 1590 PGP to remain reachable "from the south" for operational purposes. 1591 This is necessary since A02 will NOT route towards A01 using the E-W 1592 link (doing otherwise may form routing loops). 1594 As further consideration, the moment A02 looses link N2 the situation 1595 evolves again. A01 will have no more northbound reachability while 1596 still seeing A03 advertising northbound adjacencies in its south node 1597 tie. With that it will stop advertising a default route due to 1598 Section 4.2.3.7. Moreover, A02 may now inject its loopback address 1599 as south PGP. 1601 6. Implementation and Operation: Further Details 1602 6.1. Considerations for Leaf-Only Implementation 1604 Ideally RIFT can be stretched out to the lowest level in the IP 1605 fabric to integrate ToRs or even servers. Since those entities would 1606 run as leafs only, it is worth to observe that a leaf only version is 1607 significantly simpler to implement and requires much less resources: 1609 1. Under normal conditions, the leaf needs to support a multipath 1610 default route only. In worst partitioning case it has to be 1611 capable of accommodating all the leaf routes in its own POD to 1612 prevent black-holing. 1614 2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes 1615 they are connected to; so overall few in numbers. 1617 3. Leaf node does not have to support flooding reduction and de- 1618 aggregation. 1620 4. Unless optional leaf-2-leaf procedures are desired default route 1621 origination, S-TIE origination is unnecessary. 1623 6.2. Adaptations to Other Proposed Data Center Topologies 1625 . +-----+ +-----+ 1626 . | | | | 1627 .+-+ S0 | | S1 | 1628 .| ++---++ ++---++ 1629 .| | | | | 1630 .| | +------------+ | 1631 .| | | +------------+ | 1632 .| | | | | 1633 .| ++-+--+ +--+-++ 1634 .| | | | | 1635 .| | A0 | | A1 | 1636 .| +-+--++ ++---++ 1637 .| | | | | 1638 .| | +------------+ | 1639 .| | +-----------+ | | 1640 .| | | | | 1641 .| +-+-+-+ +--+-++ 1642 .+-+ | | | 1643 . | L0 | | L1 | 1644 . +-----+ +-----+ 1646 Figure 9: Level Shortcut 1648 Strictly speaking, RIFT is not limited to Clos variations only. The 1649 protocol preconditions only a sense of 'compass rose direction' 1650 achieved by configuration of levels and other topologies are possible 1651 within this framework. So, conceptually, one could include leaf to 1652 leaf links and even shortcut between layers but certain requirements 1653 in Section 3 will not be met anymore. As an example, shortcutting 1654 levels illustrated in Figure 9 will lead either to suboptimal routing 1655 when L0 sends traffic to L1 (since using S0's default route will lead 1656 to the traffic being sent back to A0 or A1) or the leafs need each 1657 other's routes installed to understand that only A0 and A1 should be 1658 used to talk to each other. 1660 Whether such modifications of topology constraints make sense is 1661 dependent on many technology variables and the exhausting treatment 1662 of the topic is definitely outside the scope of this document. 1664 6.3. Originating Non-Default Route Southbound 1666 Obviously, an implementation may choose to originate southbound 1667 instead of a strict default route (as described in Section 4.2.3.7) a 1668 shorter prefix P' but in such a scenario all addresses carried within 1669 the RIFT domain must be contained within P'. 1671 7. Security Considerations 1673 The protocol has provisions for nonces and can include authentication 1674 mechanisms in the future comparable to [RFC5709] and [RFC7987]. 1676 One can consider additionally attack vectors where a router may 1677 reboot many times while changing its system ID and pollute the 1678 network with many stale TIEs or TIEs are sent with very long 1679 lifetimes and not cleaned up when the routes vanishes. Those attack 1680 vectors are not unique to RIFT. Given large memory footprints 1681 available today those attacks should be relatively benign. Otherwise 1682 a node can implement a strategy of e.g. discarding contents of all 1683 TIEs of nodes that were not present in the SPF tree over a certain 1684 period of time. Since the protocol, like all modern link-state 1685 protocols, is self-stabilizing and will advertise the presence of 1686 such TIEs to its neighbors, they can be re-requested again if a 1687 computation finds that it sees an adjacency formed towards the system 1688 ID of the discarded TIEs. 1690 8. Information Elements Schema 1692 This section introduces the schema for information elements. 1694 On schema changes that 1695 1. change field numbers or 1697 2. add new required fields or 1699 3. change lists into sets, unions into structures or 1701 4. change multiplicity of fields or 1703 5. change datatypes of any field or 1705 6. changes default value of any field 1707 major version of the schema MUST increase. All other changes MUST 1708 increase minor version within the same major. 1710 Thrift serializer/deserializer MUST not discard optional, unknown 1711 fields but preserve and serialize them again when re-flooding. 1713 All signed integer as forced by Thrift support must be cast for 1714 internal purposes to equivalent unsigned values without discarding 1715 the signedness bit. An implementation SHOULD try to avoid using the 1716 signedness bit when generating values. 1718 The schema is normative. 1720 8.1. common.thrift 1722 /** 1723 Thrift file for common definitions in RIFT 1724 */ 1726 namespace * models 1728 typedef i64 SystemID 1729 typedef i32 IPv4Address 1730 /** this has to be of length long enough to accomodate prefix */ 1731 typedef binary IPv6Address 1732 typedef i16 UDPPortType 1733 typedef i32 TIENrType 1734 typedef i16 MTUSizeType 1735 typedef i32 SeqNrType 1736 /** lifetime in seconds */ 1737 typedef i32 LifeTimeType 1738 typedef i16 LevelType 1739 typedef i16 PodType 1740 typedef i16 VersionType 1741 typedef i32 MetricType 1742 typedef string KeyIDType 1743 /** node local, unique identification for a link, this is kept 1744 * at 32 bits so it aligns with BFD [RFC5880] discriminator size 1745 */ 1746 typedef i32 LinkIDType 1747 typedef string KeyNameType 1748 /** indicates whether the direction is northbound/east-west 1749 * or southbound */ 1750 typedef bool TieDirectionType 1751 typedef byte PrefixLenType 1752 /** timestamp in seconds since the epoch */ 1753 typedef i64 TimestampInSecsType 1754 /** security nonce */ 1755 typedef i64 NonceType 1757 const LevelType default_level = 0 1758 const PodType default_pod = 0 1759 const LinkIDType undefined_linkid = 0 1760 const MetricType default_distance = 1 1761 /** any distance larger than this will be considered infinity */ 1762 const MetricType infinite_distance = 0x70000000 1763 /** any element with 0 distance will be ignored, 1764 * missing metrics will be replaced with default_distance 1765 */ 1766 const MetricType invalid_distance = 0 1767 const bool overload_default = false; 1768 const bool flood_reduction_default = true; 1769 const bool leaf_2_leaf_procedures_default = false; 1771 enum AddressFamilyType { 1772 Illegal = 0, 1773 AddressFamilyMinValue = 1, 1774 IPv4 = 2, 1775 IPv6 = 3, 1776 AddressFamilyMaxValue = 4, 1777 } 1779 struct IPv4PrefixType { 1780 1: required IPv4Address address; 1781 2: required PrefixLenType prefixlen; 1782 } 1784 struct IPv6PrefixType { 1785 1: required IPv6Address address; 1786 2: required PrefixLenType prefixlen; 1787 } 1789 union IPPrefixType { 1790 1: optional IPv4PrefixType ipv4prefix; 1791 2: optional IPv6PrefixType ipv6prefix; 1792 } 1794 enum TIETypeType { 1795 Illegal = 0, 1796 TIETypeMinValue = 1, 1797 /** first legal value */ 1798 NodeTIEType = 2, 1799 PrefixTIEType = 3, 1800 PGPrefixTIEType = 4, 1801 KeyValueTIEType = 5, 1802 TIETypeMaxValue = 6, 1803 } 1805 /** @note: route types which MUST be ordered on preference 1806 * PGP prefixes are most preferred attracting 1807 * traffic north (towards spine) 1808 * normal prefixes are attracting traffic south (towards leafs), 1809 * i.e. prefix in NORTH PREFIX TIE is preferred 1810 */ 1811 enum RouteType { 1812 Illegal = 0, 1813 RouteTypeMinValue = 1, 1814 /** First legal value. Local prefixes are 1815 * locally hosted routes on the system. 1816 */ 1817 LocalPrefix = 2, 1818 SouthPGPPrefix = 3, 1819 NorthPGPPrefix = 4, 1820 NorthPrefix = 5, 1821 SouthPrefix = 6, 1822 /** advertised in N-TIEs */ 1823 RouteTypeMaxValue = 7 1824 } 1826 8.2. encoding.thrift 1828 /** 1829 Thrift file for packet encodings for RIFT 1830 */ 1832 include "common.thrift" 1834 namespace * models 1836 /** represents protocol major version */ 1837 const i32 current_major_version = 2 1838 /** represents protocol minor version */ 1839 const i32 current_minor_version = 1 1841 /** coomon RIFT packet header */ 1842 struct PacketHeader { 1843 1: required common.VersionType major_version = current_major_version; 1844 2: required common.VersionType minor_version = current_minor_version; 1845 /** this is the node sending the packet, in case of LIE/TIRE/TIDE 1846 also the originator of it */ 1847 3: required common.SystemID sender; 1848 /** level of the node sending the packet */ 1849 4: required common.LevelType level; 1850 } 1852 /** protocol packet structure */ 1853 struct ProtocolPacket { 1854 1: required PacketHeader header; 1855 2: required Content content; 1856 } 1858 union Content { 1859 1: optional LIE hello; 1860 2: optional TIDEPacket tide; 1861 3: optional TIREPacket tire; 1862 4: optional TIEPacket tie; 1863 } 1865 /** Community serves as community for PGP purposes */ 1866 struct Community { 1867 1: required i32 top; 1868 2: required i32 bottom; 1869 } 1871 /** @todo: flood header separately in UDP to allow caching to TIEs 1872 while changing lifetime? 1873 */ 1874 struct TIEPacket { 1875 1: required TIEHeader header; 1876 2: required TIEElement element; 1877 } 1879 /** RIFT LIE packet 1881 @note this node's level is already included on the packet header */ 1882 struct LIE { 1883 1: optional string name; 1884 /** UDP port to which we can flood TIEs, same address 1885 as the hello TX this hello has been received on */ 1886 /** local link ID */ 1887 2: required common.LinkIDType local_id; 1888 3: required common.UDPPortType flood_port; 1889 /** this will reflect the neighbor once received */ 1890 5: optional Neighbor neighbor; 1891 6: optional common.PodType pod = common.default_pod; 1892 /** optional nonce used for security computations */ 1893 7: optional common.NonceType nonce; 1894 /** optional node capabilities shown in the LIE. The capabilies 1895 MUST match the capabilities shown in the Node TIEs, otherwise 1896 the behavior is unspecified. A node detecting the mismatch 1897 SHOULD generate according error. 1898 */ 1899 8: optional NodeCapabilities capabilities; 1900 } 1902 /** Neighbor structure */ 1903 struct Neighbor { 1904 1: required common.SystemID originator; 1905 2: required common.LinkIDType remote_id; 1906 } 1908 /** LinkID pair describes one of parallel links between two nodes */ 1909 struct LinkIDPair { 1910 /** node-wide unique value for the local link */ 1911 1: required common.LinkIDType local_id; 1912 /** received remote link ID for this link */ 1913 2: required common.LinkIDType remote_id; 1914 /** more properties of the link can go in here */ 1915 } 1917 /** ID of a TIE 1918 @note: TIEID space is a total order achieved by comparing the elements 1919 in sequence defined and comparing each value as an unsigned integer 1920 of according length 1921 */ 1922 struct TIEID { 1923 /** indicates whether N or S-TIE, True > False */ 1924 1: required common.TieDirectionType northbound; 1925 2: required common.SystemID originator; 1926 3: required common.TIETypeType tietype; 1927 4: required common.TIENrType tie_nr; 1928 } 1930 /** Header of a TIE */ 1931 struct TIEHeader { 1932 2: required TIEID tieid; 1933 3: required common.SeqNrType seq_nr; 1934 4: required common.LifeTimeType lifetime; 1936 } 1938 /** A sorted TIDE packet, if unsorted, behavior is undefined */ 1939 struct TIDEPacket { 1940 /** all 00s marks starts */ 1941 1: required TIEID start_range; 1942 /** all FFs mark end */ 1943 2: required TIEID end_range; 1944 /** _sorted_ list of headers */ 1945 3: required list headers; 1946 } 1948 /** A TIRE packet */ 1949 struct TIREPacket { 1950 1: required set headers; 1951 } 1953 /** Neighbor of a node */ 1954 struct NodeNeighborsTIEElement { 1955 2: required common.LevelType level; 1956 3: optional common.MetricType cost = common.default_distance; 1957 /** can carry description of multiple parallel links in a TIE 1958 * all parallel links to same node incur same cost 1959 **/ 1960 4: optional set link_ids; 1961 } 1963 /** Capabilities the node supports */ 1964 struct NodeCapabilities { 1965 /** can this node participate in flood reduction, 1966 only relevant at level > 0 */ 1967 1: required bool flood_reduction; 1968 /** does this node support leaf-2-leaf procedures, 1969 only relevant at level 0 */ 1970 2: optional bool leaf_2_leaf_procedures; 1971 } 1973 /** Flags the node sets */ 1974 struct NodeFlags { 1975 /** node is in overload, do not transit traffic through it */ 1976 1: required bool overload; 1977 } 1979 /** Description of a node. 1980 It may occur multiple times in different TIEs but if either 1981 * capabilities values do not match or 1982 * flags values do not match 1983 * neighbors repeat with different values 1985 the behavior is undefined and a warning 1986 SHOULD be generated. 1988 @note: observe that absence of fields implies defined defaults 1989 */ 1990 struct NodeTIEElement { 1991 1: required common.LevelType level; 1992 2: optional NodeCapabilities capabilities = 1993 { 1994 "flood_reduction": common.flood_reduction_default, 1995 "leaf_2_leaf_procedures": common.leaf_2_leaf_procedures_default 1996 }; 1997 3: optional NodeFlags flags = 1998 { 1999 "overload": common.overload_default 2000 }; 2001 /** if neighbor systemID repeats in other node TIEs of same node 2002 the behavior is undefined */ 2003 4: required map neighbors; 2004 } 2006 /** multiple prefixes */ 2007 struct PrefixTIEElement { 2008 /** prefixes with the associated cost. 2009 if the same prefix repeats in multiple TIEs of same node 2010 or with different metrics, behavior is unspecified */ 2011 1: required map prefixes; 2012 } 2014 /** keys with their values */ 2015 struct KeyValueTIEElement { 2016 /** if the same key repeats in multiple TIEs of same node 2017 or with different values, behavior is unspecified */ 2018 1: required map keyvalues; 2019 } 2021 /** single element in a TIE. enum common.TIETypeType 2022 in TIEID indicates which elements MUST be present 2023 in the TIEElement. In case of mismatch the unexpected 2024 elements MUST be ignored. 2025 */ 2026 union TIEElement { 2027 /** in case of enum common.TIETypeType.NodeTIEType */ 2028 1: optional NodeTIEElement node; 2029 /** in case of enum common.TIETypeType.PrefixTIEType */ 2030 2: optional PrefixTIEElement prefixes; 2031 3: optional KeyValueTIEElement keyvalues; 2032 /** @todo: policy guided prefixes */ 2034 } 2036 9. IANA Considerations 2038 This specification will request at an opportune time multiple 2039 registry points to exchange protocol packets in a standardized way, 2040 amongst them multicast address assignments and standard port numbers. 2041 The schema itself defines many values and codepoints which can be 2042 considered registries themselves. 2044 10. Security Considerations 2046 Security mechanisms will be addressed in upcoming versions of this 2047 specification. 2049 11. Acknowledgments 2051 Many thanks to Naiming Shen for some of the early discussions around 2052 the topic of using IGPs for routing in topologies related to Clos. 2053 Adrian Farrel and Jeffrey Zhang provided thoughtful comments that 2054 improved the readability of the document and found good amount of 2055 corners where the light failed to shine. Kris Price was first to 2056 mention single router, single arm default considerations. Jeff 2057 Tantsura helped out with some initial thoughts on BFD interactions 2058 while Jeff Haas corrected several misconceptions about BFD's finer 2059 points. 2061 12. References 2063 12.1. Normative References 2065 [ISO10589] 2066 ISO "International Organization for Standardization", 2067 "Intermediate system to Intermediate system intra-domain 2068 routeing information exchange protocol for use in 2069 conjunction with the protocol for providing the 2070 connectionless-mode Network Service (ISO 8473), ISO/IEC 2071 10589:2002, Second Edition.", Nov 2002. 2073 [RFC1142] Oran, D., Ed., "OSI IS-IS Intra-domain Routing Protocol", 2074 RFC 1142, DOI 10.17487/RFC1142, February 1990, 2075 . 2077 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2078 Requirement Levels", BCP 14, RFC 2119, 2079 DOI 10.17487/RFC2119, March 1997, 2080 . 2082 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 2083 DOI 10.17487/RFC2328, April 1998, 2084 . 2086 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 2087 RFC 2365, DOI 10.17487/RFC2365, July 1998, 2088 . 2090 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 2091 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 2092 DOI 10.17487/RFC4271, January 2006, 2093 . 2095 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 2096 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 2097 2006, . 2099 [RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation 2100 Element (PCE)-Based Architecture", RFC 4655, 2101 DOI 10.17487/RFC4655, August 2006, 2102 . 2104 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 2105 Topology (MT) Routing in Intermediate System to 2106 Intermediate Systems (IS-ISs)", RFC 5120, 2107 DOI 10.17487/RFC5120, February 2008, 2108 . 2110 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 2111 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 2112 DOI 10.17487/RFC5303, October 2008, 2113 . 2115 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 2116 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 2117 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 2118 2009, . 2120 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 2121 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 2122 DOI 10.17487/RFC5881, June 2010, 2123 . 2125 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 2126 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 2127 DOI 10.17487/RFC6234, May 2011, 2128 . 2130 [RFC6822] Previdi, S., Ed., Ginsberg, L., Shand, M., Roy, A., and D. 2131 Ward, "IS-IS Multi-Instance", RFC 6822, 2132 DOI 10.17487/RFC6822, December 2012, 2133 . 2135 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 2136 Litkowski, S., Horneffer, M., and R. Shakir, "Source 2137 Packet Routing in Networking (SPRING) Problem Statement 2138 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 2139 2016, . 2141 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 2142 BGP for Routing in Large-Scale Data Centers", RFC 7938, 2143 DOI 10.17487/RFC7938, August 2016, 2144 . 2146 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 2147 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 2148 DOI 10.17487/RFC7987, October 2016, 2149 . 2151 12.2. Informative References 2153 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 2154 Communication Environments", IEEE International Parallel & 2155 Distributed Processing Symposium, 2011. 2157 [DIJKSTRA] 2158 Dijkstra, E., "A Note on Two Problems in Connexion with 2159 Graphs", Journal Numer. Math. , 1959. 2161 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 2162 key-value store", ACM SIGOPS symposium on Operating 2163 systems principles (SOSP '07), 2007. 2165 [EPPSTEIN] 2166 Eppstein, D., "Finding the k-Shortest Paths", 1997. 2168 [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for 2169 Hardware-Efficient Supercomputing", 1985. 2171 [MAKSIC2013] 2172 Maksic et al., N., "Improving Utilization of Data Center 2173 Networks", IEEE Communications Magazine, Nov 2013. 2175 [QUIC] Iyengar et al., J., "QUIC: A UDP-Based Multiplexed and 2176 Secure Transport", 2016. 2178 [VAHDAT08] 2179 Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 2180 Commodity Data Center Network Architecture", SIGCOMM , 2181 2008. 2183 Authors' Addresses 2185 Tony Przygienda 2186 Juniper Networks 2187 1194 N. Mathilda Ave 2188 Sunnyvale, CA 94089 2189 US 2191 Email: prz@juniper.net 2193 Alankar Sharma 2194 Comcast 2195 1800 Bishops Gate Blvd 2196 Mount Laurel, NJ 08054 2197 US 2199 Email: Alankar_Sharma@comcast.com 2201 John Drake 2202 Juniper Networks 2203 1194 N. Mathilda Ave 2204 Sunnyvale, CA 94089 2205 US 2207 Email: jdrake@juniper.net 2209 Alia Atlas 2210 Juniper Networks 2211 10 Technology Park Drive 2212 Westford, MA 01886 2213 US 2215 Email: akatlas@juniper.net