idnits 2.17.1 draft-ietf-rift-rift-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 58 instances of too long lines in the document, the longest one being 30 characters in excess of 72. == There are 2 instances of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 7065 has weird spacing: '...berType undef...' == Line 7067 has weird spacing: '...velType top_...' == Line 7069 has weird spacing: '...itsType defau...' == Line 7071 has weird spacing: '...velType leaf...' == Line 7072 has weird spacing: '...velType defa...' == (31 more instances...) == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: Any attempt to transition from a state towards another on reception of an event where no action is specified MUST be considered an unrecoverable error, i.e. the protocol MUST reset all adjacencies, discard all the state and MAY NOT start again. == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: A prefix can carry the `directly_attached` attribute to indicate that the prefix is directly attached, i.e. should be routed to even if the node is in overload. In case of a negatively distributed prefix this attribute MUST not be included by the originator and it MUST be ignored by all nodes during SPF computation. If a prefix is locally originated the attribute `from_link` can indicate the interface to which the address belongs to. In case of a negatively distributed prefix this attribute MUST NOT be included by the originator and it MUST be ignored by all nodes during computation. A prefix can also carry the `loopback` attribute to indicate the said property. -- The document date (28 December 2021) is 850 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 258, but not defined == Missing Reference: 'B' is mentioned on line 258, but not defined == Missing Reference: 'C' is mentioned on line 268, but not defined == Missing Reference: 'D' is mentioned on line 268, but not defined == Missing Reference: 'E' is mentioned on line 261, but not defined == Missing Reference: 'F' is mentioned on line 261, but not defined == Missing Reference: 'NH' is mentioned on line 3481, but not defined == Missing Reference: 'P' is mentioned on line 3688, but not defined == Missing Reference: 'RFC5880' is mentioned on line 7023, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'EUI64' ** Obsolete normative reference: RFC 6830 (Obsoleted by RFC 9300, RFC 9301) -- Possible downref: Non-RFC (?) normative reference: ref. 'VFR' == Outdated reference: A later version (-14) exists of draft-ietf-rift-applicability-10 Summary: 2 errors (**), 0 flaws (~~), 21 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT Working Group A. Przygienda, Ed. 3 Internet-Draft Juniper 4 Intended status: Standards Track A. Sharma 5 Expires: 1 July 2022 Comcast 6 P. Thubert 7 Cisco 8 Bruno. Rijsman 9 Individual 10 Dmitry. Afanasiev 11 Yandex 12 28 December 2021 14 RIFT: Routing in Fat Trees 15 draft-ietf-rift-rift-15 17 Abstract 19 This document defines a specialized, dynamic routing protocol for 20 Clos and fat-tree network topologies optimized towards minimization 21 of control plane state as well as configuration and operational 22 complexity. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on 1 July 2022. 41 Copyright Notice 43 Copyright (c) 2021 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 48 license-info) in effect on the date of publication of this document. 49 Please review these documents carefully, as they describe your rights 50 and restrictions with respect to this document. Code Components 51 extracted from this document must include Simplified BSD License text 52 as described in Section 4.e of the Trust Legal Provisions and are 53 provided without warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 58 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 7 59 2. A Reader's Digest . . . . . . . . . . . . . . . . . . . . . . 7 60 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 9 61 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 9 62 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 15 63 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 16 64 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 16 65 4.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 17 66 4.1.2. Generalized Topology View . . . . . . . . . . . . . . 17 67 4.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 29 68 4.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 31 69 4.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 32 70 4.2. Specification . . . . . . . . . . . . . . . . . . . . . . 33 71 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 34 72 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 35 73 4.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 49 74 4.2.4. Reachability Computation . . . . . . . . . . . . . . 74 75 4.2.5. Automatic Disaggregation on Link & Node Failures . . 76 76 4.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 82 77 4.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 90 78 4.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 102 79 4.3.1. Route Preferences . . . . . . . . . . . . . . . . . . 102 80 4.3.2. Overload Bit . . . . . . . . . . . . . . . . . . . . 103 81 4.3.3. Optimized Route Computation on Leaves . . . . . . . . 103 82 4.3.4. Mobility . . . . . . . . . . . . . . . . . . . . . . 104 83 4.3.5. Key/Value Store . . . . . . . . . . . . . . . . . . . 107 84 4.3.6. Interactions with BFD . . . . . . . . . . . . . . . . 108 85 4.3.7. Fabric Bandwidth Balancing . . . . . . . . . . . . . 109 86 4.3.8. Label Binding . . . . . . . . . . . . . . . . . . . . 111 87 4.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 111 88 4.3.10. Address Family and Multi Topology Considerations . . 112 89 4.3.11. One-Hop Healing of Levels with East-West Links . . . 112 90 4.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 112 91 4.4.1. Security Model . . . . . . . . . . . . . . . . . . . 112 92 4.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 114 93 4.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 115 94 4.4.4. Weak Nonces . . . . . . . . . . . . . . . . . . . . . 118 95 4.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 120 96 4.5. Security Association Changes . . . . . . . . . . . . . . 120 98 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 120 99 5.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 120 100 5.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 122 101 5.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 123 102 5.4. Northbound Partitioned Router and Optional East-West 103 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 125 104 6. Further Details on Implementation . . . . . . . . . . . . . . 126 105 6.1. Considerations for Leaf-Only Implementation . . . . . . . 126 106 6.2. Considerations for Spine Implementation . . . . . . . . . 127 107 7. Security Considerations . . . . . . . . . . . . . . . . . . . 127 108 7.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 127 109 7.2. Malformed Packets . . . . . . . . . . . . . . . . . . . . 128 110 7.3. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 111 7.4. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 128 112 7.5. Packet Number . . . . . . . . . . . . . . . . . . . . . . 128 113 7.6. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 129 114 7.7. TIE Origin Fingerprint DoS Attacks . . . . . . . . . . . 129 115 7.8. Host Implementations . . . . . . . . . . . . . . . . . . 129 116 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 130 117 8.1. Requested Multicast and Port Numbers . . . . . . . . . . 130 118 8.2. Requested Registries with Suggested Values . . . . . . . 130 119 8.2.1. Registry RIFT_v5/common/AddressFamilyType" . . . . . 131 120 8.2.2. Registry RIFT_v5/common/HierarchyIndications" . . . . 131 121 8.2.3. Registry RIFT_v5/common/IEEE802_1ASTimeStampType" . . 131 122 8.2.4. Registry RIFT_v5/common/IPAddressType" . . . . . . . 132 123 8.2.5. Registry RIFT_v5/common/IPPrefixType" . . . . . . . . 132 124 8.2.6. Registry RIFT_v5/common/IPv4PrefixType" . . . . . . . 133 125 8.2.7. Registry RIFT_v5/common/IPv6PrefixType" . . . . . . . 133 126 8.2.8. Registry RIFT_v5/common/PrefixSequenceType" . . . . . 133 127 8.2.9. Registry RIFT_v5/common/RouteType" . . . . . . . . . 134 128 8.2.10. Registry RIFT_v5/common/TIETypeType" . . . . . . . . 135 129 8.2.11. Registry RIFT_v5/common/TieDirectionType" . . . . . . 135 130 8.2.12. Registry RIFT_v5/encoding/Community" . . . . . . . . 136 131 8.2.13. Registry RIFT_v5/encoding/KeyValueTIEElement" . . . . 136 132 8.2.14. Registry RIFT_v5/encoding/LIEPacket" . . . . . . . . 137 133 8.2.15. Registry RIFT_v5/encoding/LinkCapabilities" . . . . . 139 134 8.2.16. Registry RIFT_v5/encoding/LinkIDPair" . . . . . . . . 139 135 8.2.17. Registry RIFT_v5/encoding/Neighbor" . . . . . . . . . 140 136 8.2.18. Registry RIFT_v5/encoding/NodeCapabilities" . . . . . 141 137 8.2.19. Registry RIFT_v5/encoding/NodeFlags" . . . . . . . . 141 138 8.2.20. Registry RIFT_v5/encoding/NodeNeighborsTIEElement" . 142 139 8.2.21. Registry RIFT_v5/encoding/NodeTIEElement" . . . . . . 142 140 8.2.22. Registry RIFT_v5/encoding/PacketContent" . . . . . . 143 141 8.2.23. Registry RIFT_v5/encoding/PacketHeader" . . . . . . . 144 142 8.2.24. Registry RIFT_v5/encoding/PrefixAttributes" . . . . . 144 143 8.2.25. Registry RIFT_v5/encoding/PrefixTIEElement" . . . . . 145 144 8.2.26. Registry RIFT_v5/encoding/ProtocolPacket" . . . . . . 145 145 8.2.27. Registry RIFT_v5/encoding/TIDEPacket" . . . . . . . . 146 146 8.2.28. Registry RIFT_v5/encoding/TIEElement" . . . . . . . . 146 147 8.2.29. Registry RIFT_v5/encoding/TIEHeader" . . . . . . . . 147 148 8.2.30. Registry RIFT_v5/encoding/TIEHeaderWithLifeTime" . . 147 149 8.2.31. Registry RIFT_v5/encoding/TIEID" . . . . . . . . . . 148 150 8.2.32. Registry RIFT_v5/encoding/TIEPacket" . . . . . . . . 148 151 8.2.33. Registry RIFT_v5/encoding/TIREPacket" . . . . . . . . 149 152 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 149 153 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 150 154 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 150 155 11.1. Normative References . . . . . . . . . . . . . . . . . . 150 156 11.2. Informative References . . . . . . . . . . . . . . . . . 152 157 Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 154 158 Appendix B. Information Elements Schema . . . . . . . . . . . . 155 159 B.1. Backwards-Compatible Extension of Schema . . . . . . . . 156 160 B.2. common.thrift . . . . . . . . . . . . . . . . . . . . . . 157 161 B.3. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 163 162 Appendix C. Constants . . . . . . . . . . . . . . . . . . . . . 170 163 C.1. Configurable Protocol Constants . . . . . . . . . . . . . 170 164 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 172 166 1. Introduction 168 Clos [CLOS] topologies (called commonly a fat tree/network in modern 169 IP fabric considerations [VAHDAT08] as homonym to the original 170 definition of the term [FATTREE]) have gained prominence in today's 171 networking, primarily as result of the paradigm shift towards a 172 centralized data-center based architecture that is poised to deliver 173 a majority of computation and storage services in the future. Many 174 builders of such IP fabrics desire a protocol that auto-configures 175 itself and deals with failures and mis-configurations with a minimum 176 of human intervention. Such a solution would allow local IP fabric 177 bandwidth to be consumed in a 'standard component' fashion, i.e. 178 provision it much faster and operate it at much lower costs than 179 today, much like compute or storage is consumed already. 181 In looking at the problem through the lens of such IP fabric 182 requirements, RIFT addresses those challenges not through an 183 incremental modification of either a link-state (distributed 184 computation) or distance-vector (diffused computation) techniques but 185 rather a mixture of both, colloquially best described as "link-state 186 towards the spines" and "distance vector towards the leaves". In 187 other words, "bottom" levels are flooding their link-state 188 information in the "northern" direction while each node generates 189 under normal conditions a "default route" and floods it in the 190 "southern" direction. This type of protocol allows naturally for 191 highly desirable aggregation. Alas, such aggregation could blackhole 192 traffic in cases of misconfiguration or while failures are being 193 resolved or even cause partial network partitioning and this has to 194 be addressed by some adequate mechanism. The approach RIFT takes is 195 described in Section 4.2.5 and is basically based on automatic, 196 sufficient disaggregation of prefixes in case of link and node 197 failures. 199 The protocol does further provide 201 * optional fully automated construction of fat-tree topologies based 202 on detection of links without any configuration (Section 4.2.7) 203 while allowing for traditional configuration and arbitrary mix of 204 both types of nodes as well, 206 * minimum amount of routing state held at each level, 208 * automatic pruning and load balancing of topology flooding 209 exchanges over a sufficient subset of links which resolves the 210 traditional problem of link-state protocol struggling with densely 211 meshed graphs due to high volume of flooding traffic 212 (Section 4.2.3.9), 214 * automatic aggregation (Section 4.2.3.8) and consequently automatic 215 disaggregation (Section 4.2.5) of prefixes on link and node 216 failures to prevent black-holing and suboptimal routing, 218 * loop-free non-ECMP forwarding due to its inherent valley-free 219 nature, 221 * fast mobility (Section 4.3.4), 223 * re-balancing of traffic towards the spines based on bandwidth 224 available (Section 4.3.7.1) and finally 226 * mechanisms to synchronize a limited key-value data-store 227 (Section 4.3.5.1) that can be used after protocol convergence to 228 e.g. bootstrap higher levels of functionality on nodes. 230 Figure 1 presents as first example of operation a simplified, 231 conceptual view of the resulting information and routes on a RIFT 232 fabric. The top of the fabric is holding in its link-state database 233 the information about the nodes below it and the routes to them 234 whereas the notation A/32 is used to indicate a loopback route to 235 node A and 0/0 is the usual notation for a default route. First row 236 of information represents the nodes for which full topology 237 information is available. The second row of the database table 238 indicates that partial information of other nodes in the same level 239 is available as well. Such information will be necessary to perform 240 certain algorithms necessary for correct protocol operation. When 241 "bottom" of the fabric is considered, or in other words the leaves, 242 the topology is basically empty and, under normal conditions, the 243 leaves hold a load balanced default route to the next level. 245 The balance of this document fills in the protocol specification 246 details. 248 [A,B,C,D] 249 [E] 250 +-----+ +-----+ 251 | E | | F | A/32 @ [C,D] 252 +-+-+-+ +-+-+-+ B/32 @ [C,D] 253 | | | | C/32 @ C 254 | | +-----+ | D/32 @ D 255 | | | | 256 | +------+ | 257 | | | | 258 [A,B] +-+---+ | | +---+-+ [A,B] 259 [D] | C +--+ +-+ D | [C] 260 +-+-+-+ +-+-+-+ 261 0/0 @ [E,F] | | | | 0/0 @ [E,F] 262 A/32 @ A | | +-----+ | A/32 @ A 263 B/32 @ B | | | | B/32 @ B 264 | +------+ | 265 | | | | 266 +-+---+ | | +---+-+ 267 | A +--+ +-+ B | 268 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] 270 Figure 1: RIFT Information Distribution 272 1.1. Requirements Language 274 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 275 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 276 "OPTIONAL" in this document are to be interpreted as described in BCP 277 14 RFC 2119 [RFC2119] RFC 8174 [RFC8174] when, and only when, they 278 appear in all capitals, as shown here. 280 2. A Reader's Digest 282 This section should serve as an initial guided tour through the 283 document in order to convey the necessary information for any reader, 284 depending on their level of interest. The glossary section 285 (Section 3.1) should be used as a supporting reference as the 286 document is read. 288 The indications to direction (i.e. "top", "bottom", etc.) referenced 289 in the Section 1 are of paramount importance. RIFT requires a 290 topology with a sense top and bottom in order to properly achieve a 291 sorted topology. Clos, Fat-Tree, and other similarly structured 292 networks are conducive to such requirements. RIFT does allow for 293 further relaxation of these constraints, they will be mentioned later 294 in this section. 296 Operators and implementors alike must understand if multi-plane IP 297 fabrics are of interest or not. Section 3.2 illustrates an example 298 of both single-plane in Figure 2 and multi-plane fabric in Figure 3. 299 Multi-plane fabrics require understanding of additional RIFT concepts 300 (e.g. negative disaggregation in Section 4.2.5.2) that are otherwise 301 unnecessary in context of strictly single-plane fabrics. Overview 302 (Section 4.1) and Section 4.1.2 aim to provide enough context to 303 determine if multi-plane fabrics are of interest to the reader. The 304 Fallen Leaf part (Section 4.1.3), and additionally Section 4.1.4 and 305 Section 4.1.5 describe further considerations that are specific to 306 multi-plane fabrics. 308 The fundamental protocol concepts are described starting in the 309 specification part (Section 4.2), but some sub-sections are not quite 310 as relevant unless dealing with implementation of the protocol. The 311 protocol transport (Section 4.2.1) is of particular importance for 312 two reasons. First, it introduces RIFT's packet formats in the form 313 of a normative Thrift model given in Appendix B.3. Second, the 314 Thrift model component is a prelude to understanding the RIFT's 315 inherent security features as defined in the security segment 316 (Section 7). The normative schema defining the Thrift model can be 317 found in both Appendix B.2 and Appendix B.3. Furthermore, while a 318 detailed understanding of Thrift and the models are not required 319 unless implementing RIFT, they may provide additional useful 320 information for other readers. 322 If implementing RIFT to support multi-plane topologies Section 4.2 323 should be reviewed in its entirety in conjunction with previously 324 mentioned Thrift schemas. Sections not relevant to single-plane 325 implementations will be noted later in the section. Special 326 attention should be paid to the LIE definitions part (Section 4.2.2) 327 as it not only outlines basic neighbor discovery and adjacency 328 formation, but also provides necessary context for RIFT's ZTP 329 (Section 4.2.7) and mis-cabling detection capabilities that allow it 330 to automatically detect and build the underlay topology with a 331 negligible configuration. These specific capabilities are detailed 332 in Section 4.2.7. 334 For other readers, the following sections provide a more detailed 335 understanding of the fundamental properties and highlight some 336 additional benefits of RIFT such as link state packet formats, highly 337 efficient flooding, synchronization, loop-free path computation and 338 link-state database maintenance - Section 4.2.3, Section 4.2.3.2, 339 Section 4.2.3.3, Section 4.2.3.4, Section 4.2.3.6, Section 4.2.3.7, 340 Section 4.2.3.8, Section 4.2.4, Section 4.2.4.1, Section 4.2.4.2, 341 Section 4.2.4.3, Section 4.2.4.4. RIFT's unique ability to perform 342 weighted unequal-cost load balancing of traffic across all available 343 links is outlined in Section 4.3.7 with an accompanying example. 345 Section 4.2.5 is the place where the single-plane vs. multi-plane 346 requirement is explained in more detail. For those interested in 347 single-plane fabrics, only Section 4.2.5.1 is required. For the 348 multi-plane interested reader Section 4.2.5.2, Section 4.2.5.2.1, 349 Section 4.2.5.2.2, and Section 4.2.5.2.3 are also mandatory. 350 Section 4.2.6 is especially important for any multi-plane interested 351 reader as it outlines how the RIB and FIB are built via the 352 disaggregation mechanisms, but also illustrates how they prevent 353 defective routing decisions (e.g. black holes) in both single or 354 multi-plane topologies. 356 Section 5 contains a set of comprehensive examples that continue to 357 highlight just how efficiently RIFT handles failures by containing 358 impact to only the required set of nodes. It should also help cement 359 some of RIFT's core concepts in the reader's mind. 361 Last, but not least, RIFT has other optional capabilities. One 362 example is the key-value data-store, which enables RIFT to advertise 363 data post-convergence in order to bootstrap higher levels of 364 functionality (e.g. operational telemetry). Those are covered in 365 Section 4.3 and Section 6. 367 More information related to RIFT can be found in the "RIFT 368 Applicability" [APPLICABILITY] document, which discusses alternate 369 topologies upon which RIFT may be deployed, use cases where it is 370 applicable, and presents operational considerations that complement 371 this document. 373 3. Reference Frame 375 3.1. Terminology 377 This section presents the terminology used in this document. 379 Crossbar: 380 Physical arrangement of ports in a switching matrix without 381 implying any further scheduling or buffering disciplines. 383 Clos/Fat Tree: 384 This document uses the terms Clos and Fat Tree interchangeably 385 whereas it always refers to a folded spine-and-leaf topology with 386 possibly multiple Points of Delivery (PoDs) and one or multiple 387 Top of Fabric (ToF) planes. Several modifications such as leaf- 388 2-leaf shortcuts and multiple level shortcuts are possible and 389 described further in the document. 391 Directed Acyclic Graph (DAG): 392 A finite directed graph with no directed cycles (loops). If links 393 in a Clos are considered as either being all directed towards the 394 top or vice versa, each of such two graphs is a DAG. 396 Folded Spine-and-Leaf: 397 In case the Clos fabric input and output stages are analogous, the 398 fabric can be "folded" to build a "superspine" or top which is 399 called Top of Fabric (ToF) in this document. 401 Level: 402 Clos and Fat Tree networks are topologically partially ordered 403 graphs and 'level' denotes the set of nodes at the same height in 404 such a network, where the bottom level (leaf) is the level with 405 lowest value. A node has links to nodes one level down and/or one 406 level up. Under some circumstances, a node may have links to 407 nodes at the same level and a leaf may have links to nodes 408 multiple levels higher. RIFT counts levels from top-of-fabric 409 (ToF) numerically down. Level 0 always implies a leaf in RIFT but 410 a leaf does not have to be level 0. Level in RIFT can be 411 configured or automatically derive its level via ZTP as explained 412 in Section 4.2.7. As final footnote: Clos terminology uses often 413 the concept of "stage" but due to the folded nature of the Fat 414 Tree it is not used from this point on to prevent 415 misunderstandings. 417 Superspine, Aggregation/Spine and Edge/Leaf Switches:" 418 Traditional level names in 5-stages folded Clos for Level 2, 1 and 419 0 respectively (counting up from the bottom). We normalize this 420 language to talk about top-of-fabric (ToF), top-of-pod (ToP) and 421 leaves. 423 Zero Touch Provisioning (ZTP): 424 Optional RIFT mechanism which allows to derive node levels 425 automatically based on minimum configuration. Such a mininum 426 configuration consists solely of ToFs being configured as such. 428 Point of Delivery (PoD): 429 A self-contained vertical slice or subset of a Clos or Fat Tree 430 network containing normally only level 0 and level 1 nodes. A 431 node in a PoD communicates with nodes in other PoDs via the Top- 432 of-Fabric. PoDs are numbered to distinguish them and PoD value 0 433 (defined later in the encoding schema as `common.default_pod`) is 434 used to denote "undefined" or "any" PoD. 436 Top of PoD (ToP): 437 The set of nodes that provide intra-PoD communication and have 438 northbound adjacencies outside of the PoD, i.e. are at the "top" 439 of the PoD. 441 Top of Fabric (ToF): 442 The set of nodes that provide inter-PoD communication and have no 443 northbound adjacencies, i.e. are at the "very top" of the fabric. 444 ToF nodes do not belong to any PoD and are assigned 445 `common.default_pod` PoD value to indicate the equivalent of "any" 446 PoD. 448 Spine: 449 Any nodes north of leaves and south of top-of-fabric nodes. 450 Multiple layers of spines in a PoD are possible. 452 Leaf: 453 A node without southbound adjacencies. As mentioned before, Level 454 0 implies a leaf in RIFT but a leaf does not have to be level 0. 456 Top-of-fabric Plane or Partition: 457 In large fabrics top-of-fabric switches may not have enough ports 458 to aggregate all switches south of them and with that, the ToF is 459 'split' into multiple independent planes. Section 4.1.2 explains 460 the concept in more detail. A plane is subset of ToF nodes that 461 see each other through south reflection or E-W links. 463 Radix: 464 A radix of a switch is number of switching ports it provides. 465 It's sometimes called fanout as well. 467 North Radix: 468 Ports cabled northbound to higher level nodes. 470 South Radix: 471 Ports cabled southbound to lower level nodes. 473 South/Southbound and North/Northbound (Direction): 474 When describing protocol elements and procedures, in different 475 situations the directionality of the compass is used. I.e., 476 'lower', 'south' or 'southbound' mean moving towards the bottom of 477 the Clos or Fat Tree network and 'higher', 'north' and 478 'northbound' mean moving towards the top of the Clos or Fat Tree 479 network. 481 Northbound Link: 482 A link to a node one level up or in other words, one level further 483 north. 485 Southbound Link: 486 A link to a node one level down or in other words, one level 487 further south. 489 East-West (E-W) Link: 490 A link between two nodes at the same level. East-West links are 491 normally not part of Clos or "fat-tree" topologies. 493 Leaf shortcuts (L2L): 494 East-West links at leaf level will need to be differentiated from 495 East-West links at other levels. 497 Routing on the host (RotH): 498 Modern data center architecture variant where servers/leaves are 499 multi-homed and consecutively participate in routing. 501 Northbound representation: 502 Subset of topology information flooded towards higher levels of 503 the fabric. 505 Southbound representation: 506 Subset of topology information sent towards a lower level. 508 South Reflection: 509 Often abbreviated just as "reflection", it defines a mechanism 510 where South Node TIEs are "reflected" from the level south back up 511 north to allow nodes in the same level without E-W links to "see" 512 each other's node Topology Information Elements (TIEs). 514 TIE: 515 This is an acronym for a "Topology Information Element". TIEs are 516 exchanged between RIFT nodes to describe parts of a network such 517 as links and address prefixes. A TIE has always a direction and a 518 type. North TIEs (sometimes abbreviated as N-TIEs) are used when 519 dealing with TIEs in the northbound representation and South-TIEs 520 (sometimes abbreviated as S-TIEs) for the southbound equivalent. 521 TIEs have different types such as node and prefix TIEs. 523 Node TIE: 524 This stands as acronym for a "Node Topology Information Element", 525 which contains all adjacencies the node discovered and information 526 about the node itself. Node TIE should NOT be confused with a 527 North TIE since "node" defines the type of TIE rather than its 528 direction. Consequently North Node TIEs and South Node TIEs 529 exist. 531 Prefix TIE: 532 This is an acronym for a "Prefix Topology Information Element" and 533 it contains all prefixes directly attached to this node in case of 534 a North TIE and in case of South TIE the necessary default routes 535 the node advertises southbound. 537 Key Value (KV) TIE: 538 A TIE that is carrying a set of key value pairs [DYNAMO]. It can 539 be used to distribute non topology related information within the 540 protocol. 542 TIDE: 543 Topology Information Description Element carrying descriptors of 544 the TIEs stored in the node. 546 TIRE: 547 Topology Information Request Element carrying set of TIDE 548 descriptors. It can both confirm received and request missing 549 TIEs. 551 Disaggregation: 552 Process in which a node decides to advertise more specific 553 prefixes Southwards, either positively to attract the 554 corresponding traffic, or negatively to repel it. Disaggregation 555 is performed to prevent black-holing and suboptimal routing to the 556 more specific prefixes. 558 LIE: 559 This is an acronym for a "Link Information Element" exchanged on 560 all the system's links running RIFT to form ThreeWay adjacencies 561 and carry information used to perform Zero Touch Provisioning 562 (ZTP) of levels. 564 Flood Repeater (FR): 565 A node can designate one or more northbound neighbor nodes to be 566 flood repeaters. The flood repeaters are responsible for flooding 567 northbound TIEs further north. The document sometimes calls them 568 flood leaders as well. 570 Bandwidth Adjusted Distance (BAD): 571 Each RIFT node can calculate the amount of northbound bandwidth 572 available towards a node compared to other nodes at the same level 573 and can modify the route distance accordingly to allow for the 574 lower level to adjust their load balancing towards spines. 576 Overloaded: 577 Applies to a node advertising the `overload` attribute as set. 578 Overload attribute is carried in the `NodeFlags` object of the 579 encoding schema. 581 Interface: 582 A layer 3 entity over which RIFT control packets are exchanged. 584 ThreeWay Adjacency: 585 RIFT tries to form a unique adjacency over an interface and 586 exchange local configuration and necessary ZTP information. An 587 adjacency is only advertised in node TIEs and used for 588 computations after it achieved ThreeWay state, i.e. both routers 589 reflected each other in LIEs including relevant security 590 information. Nevertheless, LIEs before ThreeWay state is reached 591 may carry ZTP related information already. 593 Bi-directional Adjacency: 594 Bidirectional adjacency is an adjacency where nodes of both sides 595 of the adjacency advertised it in the node TIEs with the correct 596 levels and system IDs. Bi-directionality is used to check in 597 different algorithms whether the link should be included. 599 Neighbor: 600 Once a ThreeWay adjacency has been formed a neighborship 601 relationship contains the neighbor's properties. Multiple 602 adjacencies can be formed to a remote node via parallel interfaces 603 but such adjacencies are *not* sharing a neighbor structure. 604 Saying "neighbor" is thus equivalent to saying "a ThreeWay 605 adjacency". 607 Cost: 608 The term signifies the weighted distance between two neighbors. 610 Distance: 611 Sum of costs (bound by infinite distance) between two nodes. 613 Shortest-Path First (SPF): 614 A well-known graph algorithm attributed to Dijkstra [DIJKSTRA] 615 that establishes a tree of shortest paths from a source to 616 destinations on the graph. SPF acronym is used due to its 617 familiarity as general term for the node reachability calculations 618 RIFT can employ to ultimately calculate routes of which Dijkstra 619 algorithm is a possible one. 621 North SPF (N-SPF): 622 A reachability calculation that is progressing northbound, as 623 example SPF that is using South Node TIEs only. Normally it 624 progresses a single hop only and installs default routes. 626 South SPF (S-SPF): 627 A reachability calculation that is progressing southbound, as 628 example SPF that is using North Node TIEs only. 630 Security Envelope: 631 RIFT packets are flooded within an authenticated security envelope 632 that allows to protect the integrity of information a node 633 accepts. 635 System ID: 636 Each RIFT node identifies itself by a valid, network wide unique 637 number when trying to build adjacencies or describing its 638 topology. RIFT System IDs can be auto-derived or configured. 640 Additionally, when the specification refers to elements of packet 641 encoding or constants provided in the Appendix B grave accents are 642 used, e.g. `invalid_distance`. Same convention is used when 643 referring to finite state machine states or events outside the 644 context of the machine itself, e.g. `OneWay`. 646 3.2. Topology 648 ^ N +--------+ +--------+ 649 Level 2 | |ToF 21| |ToF 22| 650 W <-*-> E ++-+--+-++ ++-+--+-++ 651 | | | | | | | | | 652 S v P111/2 P121/2 | | | | 653 ^ ^ ^ ^ | | | | 654 | | | | | | | | 655 +--------------+ | +-----------+ | | | +---------------+ 656 | | | | | | | | 657 South +-----------------------------+ | | ^ 658 | | | | | | | All TIEs 659 0/0 0/0 0/0 +-----------------------------+ | 660 v v v | | | | | 661 | | +-+ +<-0/0----------+ | | 662 | | | | | | | | 663 +-+----++ optional +-+----++ ++----+-+ ++-----++ 664 Level 1 | | E/W link | | | | | | 665 |Spin111+----------+Spin112| |Spin121| |Spin122| 666 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 667 | | | South | | | | 668 | +---0/0--->-----+ 0/0 | +----------------+ | 669 0/0 | | | | | | | 670 | +---<-0/0-----+ | v | +--------------+ | | 671 v | | | | | | | 672 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 673 Level 0 | | (L2L) | | | | | | 674 |Leaf111+~~~~~~~~~~+Leaf112| |Leaf121| |Leaf122| 675 +-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 676 + + \ / + + 677 Prefix111 Prefix112 \ / Prefix121 Prefix122 678 multi-homed 679 Prefix 680 +---------- PoD 1 ---------+ +---------- PoD 2 ---------+ 682 Figure 2: A Three Level Spine-and-Leaf Topology 683 +--------+ +--------+ +--------+ +--------+ 684 |ToF A1| |ToF B1| |ToF B2| |ToF A2| 685 ++-+-----+ ++-+-----+ ++-+-----+ ++-+-----+ 686 | | | | | | | | 687 | | | | | +---------------+ 688 | | | | | | | | 689 | | | +-------------------------+ | 690 | | | | | | | | 691 | +-----------------------+ | | | | 692 | | | | | | | | 693 | | +---------+ | +---------+ | | 694 | | | | | | | | 695 | +---------------------------------+ | | 696 | | | | | | | | 697 ++-+-----+ ++-+-----+ +--+-+---+ +----+-+-+ 698 |Spine111| |Spine112| |Spine121| |Spine122| 699 +-+---+--+ ++----+--+ +-+---+--+ ++---+---+ 700 | | | | | | | | 701 | +--------+ | | +--------+ | 702 | | | | | | | | 703 | -------+ | | | +------+ | | 704 | | | | | | | | 705 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 706 |Leaf111| |Leaf112| |Leaf121| |Leaf122| 707 +-------+ +-------+ +-------+ +-------+ 709 Figure 3: Topology with Multiple Planes 711 Topology in Figure 2 is refered to in all further considerations. 712 This figure depicts a generic "single plane fat-tree" and the 713 concepts explained using three levels apply by induction to further 714 levels and higher degrees of connectivity. Further, this document 715 will deal also with designs that provide only sparser connectivity 716 and "partitioned spines" as shown in Figure 3 and explained further 717 in Section 4.1.2. 719 4. RIFT: Routing in Fat Trees 721 Remainder of this documents presents the detailed specification of a 722 protocol optimized for Routing in Fat Trees (RIFT) that in most 723 abstract terms has many properties of a modified link-state protocol 724 when distributing information northbound and a distance vector 725 protocol when distributing information southbound. While this is an 726 unusual combination, it does quite naturally exhibit the desirable 727 properties desired. 729 4.1. Overview 730 4.1.1. Properties 732 The most singular property of RIFT is that it floods link-state 733 information northbound only so that each level obtains the full 734 topology of levels south of it. Link-State information is, with some 735 exceptions, never flooded East-West or back South again. Exceptions 736 like south reflection is explained in detail in Section 4.2.5.1 and 737 east-west flooding at ToF level in multi-plane fabrics is outlined in 738 Section 4.1.2. In the southbound direction, the necessary routing 739 information, normally just the default route, propagates one hop 740 south and is 're-advertised' by nodes at next lower level. However, 741 RIFT uses flooding in the southern direction as well to avoid the 742 overhead of building an update per adjacency. For the moment 743 describing the East-West direction is left out. 745 Those information flow constraints create not only an anisotropic 746 protocol (i.e. the information is not distributed "evenly" or 747 "clumped" but summarized along the N-S gradient) but also a "smooth" 748 information propagation where nodes do not receive the same 749 information from multiple directions at the same time. Normally, 750 accepting the same reachability on any link, without understanding 751 its topological significance, forces tie-breaking on some kind of 752 distance metric. And such tie-breaking leads ultimately in hop-by- 753 hop forwarding to shortest paths only. In contrast to that, RIFT, 754 under normal conditions, does not need to tie-break the same 755 reachability information from multiple directions. Its computation 756 principles (south forwarding direction is always preferred) leads to 757 valley-free [VFR] forwarding behavior. And since valley free routing 758 is loop-free, it can use all feasible paths which is another highly 759 desirable property if available bandwidth should be utilized to the 760 maximum extent possible. 762 To account for the "northern" and the "southern" information split 763 the link state database is partitioned accordingly into "north 764 representation" and "south representation" TIEs. In simplest terms 765 the North TIEs contain a link state topology description of lower 766 levels and and South TIEs carry simply node description of the level 767 above and default routes pointing north. This oversimplified view 768 will be refined gradually in the following sections while introducing 769 protocol procedures and state machines at the same time. 771 4.1.2. Generalized Topology View 773 This section and resulting Section 4.2.5.2 are dedicated to multi- 774 plane fabrics, in contrast with the single plane designs where all 775 top-of-fabric nodes are topologically equal and initially connected 776 to all the switches at the level below them. 778 It is quite difficult to visualize multi plane design, which are 779 effectively multi-dimensional switching matrices. To cope with that, 780 this document introduces a methodology allowing to depict the 781 connectivity in two-dimensional pictures. Further, the fact can be 782 leveraged that what is under consideration here are basically stacked 783 crossbar fabrics where ports align "on top of each other" in a 784 regular fashion. 786 A word of caution to the reader; at this point it should be observed 787 that the language used to describe Clos variations, especially in 788 multi-plane designs, varies widely between sources. This description 789 follows the terminology introduced in Section 3.1. It is unavoidable 790 to have it present to be able to follow the rest of this section 791 correctly. 793 4.1.2.1. Terminology and Glossary 795 This section describes the terminology and acronyms used in the rest 796 of the text. Though the glossary may not be comprehensible on a 797 first read, the following sections will gradually introduce the terms 798 in their proper context. 800 P: 801 Denotes the number of PoDs in a topology. 803 S: 804 Denotes the number of ToF nodes in a topology. 806 K: 807 To simplify the visual aids, notations and further considerations, 808 implicit assumption is made that the switches are symmetrical, 809 i.e. equal number ports point northbound and southbound. With 810 that simplification, K denotes half of the radix of a symmetrical 811 switch, meaning that the switch has K ports pointing north and K 812 ports pointing south. K_LEAF (K of a leaf) thus represents both 813 the number of access ports in a leaf Node and the maximum number 814 of planes in the fabric, whereas K_TOP (K of a ToP) represents the 815 number of leaves in the PoD and the number of ports pointing north 816 in a ToP Node towards a higher spine level, thus the number of ToF 817 nodes in a plane. 819 ToF Plane: 820 Set of ToFs that are aware of each other by means of south 821 reflection. Planes are numbered by capital letters, e.g. plane 822 A. 824 N: 825 Denotes the number of independent ToF planes in a topology. 827 R: 828 Denotes a redundancy factor, i.e. number of connections a spine 829 has towards a ToF plane. In single plane design K_TOP is equal to 830 R. 832 Fallen Leaf: 833 A fallen leaf in a plane Z is a switch that lost all connectivity 834 northbound to Z. 836 4.1.2.2. Clos as Crossed, Stacked Crossbars 838 The typical topology for which RIFT is defined is built of P number 839 of PoDs and connected together by S number of ToF nodes. A PoD node 840 has K number of ports. From here on half of them (K=Radix/2) are 841 assumed to connect host devices from the south, and the other half to 842 connect to interleaved PoD Top-Level switches to the north. The K 843 ratio can be chosen differently without loss of generality when port 844 speeds differ or the fabric is oversubscribed but K=Radix/2 allows 845 for more readable representation whereby there are as many ports 846 facing north as south on any intermediate node. A node is hence 847 represented in a schematic fashion with ports "sticking out" to its 848 north and south rather than by the usual real-world front faceplate 849 designs of the day. 851 Figure 4 provides a view of a leaf node as seen from the north, i.e. 852 showing ports that connect northbound. For lack of a better symbol, 853 the document chooses to use the "o" as ASCII visualisation of a 854 single port. In this example, K_LEAF has 6 ports. Observe that the 855 number of PoDs is not related to Radix unless the ToF Nodes are 856 constrained to be the same as the PoD nodes in a particular 857 deployment. 859 Top view 860 +---+ 861 | | 862 | O | e.g., Radix = 12, K_LEAF = 6 863 | | 864 | O | 865 | | ------------------------- 866 | o <------ Physical Port (Ethernet) ----+ 867 | | ------------------------- | 868 | O | | 869 | | | 870 | O | | 871 | | | 872 | O | | 873 | | | 874 +---+ v 876 || || || || || || || 877 +----+ +------------------------------------------------+ 878 | | | | 879 +----+ +------------------------------------------------+ 880 || || || || || || || 881 Side views 883 Figure 4: A Leaf Node, K_LEAF=6 885 The Radix of a PoD's top node may be different than that of the leaf 886 node. Though, more often than not, a same type of node is used for 887 both, effectively forming a square (K*K). In the general case, 888 switches at the top of the PoD with K_TOP southern ports not 889 necessarily equal to K_LEAF could be considered . For instance, in 890 the representations below, we pick a 6 port K_LEAF and a 8 port 891 K_TOP. In order to form a crossbar, K_TOP Leaf Nodes are necessary 892 as illustrated in Figure 5. 894 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 895 | | | | | | | | | | | | | | | | 896 | O | | O | | O | | O | | O | | O | | O | | O | 897 | | | | | | | | | | | | | | | | 898 | O | | O | | O | | O | | O | | O | | O | | O | 899 | | | | | | | | | | | | | | | | 900 | O | | O | | O | | O | | O | | O | | O | | O | 901 | | | | | | | | | | | | | | | | 902 | O | | O | | O | | O | | O | | O | | O | | O | 903 | | | | | | | | | | | | | | | | 904 | O | | O | | O | | O | | O | | O | | O | | O | 905 | | | | | | | | | | | | | | | | 906 | O | | O | | O | | O | | O | | O | | O | | O | 907 | | | | | | | | | | | | | | | | 908 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 910 Figure 5: Southern View of a PoD, K_TOP=8 912 As further visualized in Figure 6 the K_TOP Leaf Nodes are fully 913 interconnected with the K_LEAF ToP nodes, providing connectivity that 914 can be represented as a crossbar when "looked at" from the north. 915 The result is that, in the absence of a failure, a packet entering 916 the PoD from the north on any port can be routed to any port in the 917 south of the PoD and vice versa. And that is precisely why it makes 918 sense to talk about a "switching matrix". 920 E<-*->W 922 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 923 | | | | | | | | | | | | | | | | 924 +--------------------------------------------------------+ 925 | o o o o o o o o | 926 +--------------------------------------------------------+ 927 +--------------------------------------------------------+ 928 | o o o o o o o o | 929 +--------------------------------------------------------+ 930 +--------------------------------------------------------+ 931 | o o o o o o o o | 932 +--------------------------------------------------------+ 933 +--------------------------------------------------------+ 934 | o o o o o o o o | 935 +--------------------------------------------------------+ 936 +--------------------------------------------------------+ 937 | o o o o o o o o |<-+ 938 +--------------------------------------------------------+ | 939 +--------------------------------------------------------+ | 940 | o o o o o o o o | | 941 +--------------------------------------------------------+ | 942 | | | | | | | | | | | | | | | | | 943 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 944 ^ | 945 | | 946 | ---------- --------------------- | 947 +----- Leaf Node PoD top Node (Spine) --+ 948 ---------- --------------------- 950 Figure 6: Northern View of a PoD's Spines, K_TOP=8 952 Side views of this PoD is illustrated in Figure 7 and Figure 8. 954 Connecting to Spine 956 || || || || || || || || 957 +----------------------------------------------------------------+ N 958 | PoD top Nodes seen sideways | ^ 959 +----------------------------------------------------------------+ | 960 || || || || || || || || * 961 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 962 | | | | | | | | | | | | | | | | v 963 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ S 964 || || || || || || || || 966 Connecting to Client nodes 968 Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6 970 Connecting to Spine 972 || || || || || || 973 +----+ +----+ +----+ +----+ +----+ +----+ N 974 | | | | | | | | | | | PoD top Nodes ^ 975 +----+ +----+ +----+ +----+ +----+ +----+ | 976 || || || || || || * 977 +------------------------------------------------+ | 978 | Leaf seen sideways | v 979 +------------------------------------------------+ S 981 Connecting to Client nodes 983 Figure 8: Other Side View of a PoD, K_TOP=8, K_LEAF=6, 90o turn 984 in E-W Plane from the previous figure 986 As next step, observe further that a resulting PoD can be abstracted 987 as a bigger node with a number K of K_POD= K_TOP * K_LEAF, and the 988 design can recurse. 990 It will be critical at this point that, before progressing further, 991 the concept and the picture of "crossed crossbars" is clear. Else, 992 the following considerations might be difficult to comprehend. 994 To continue, the PoDs are interconnected with each other through a 995 Top-of-Fabric (ToF) node at the very top or the north edge of the 996 fabric. The resulting ToF is *not* partitioned if, and only if 997 (IIF), every PoD top level node (spine) is connected to every ToF 998 Node. This topology is also referred to as a single plane 999 configuration and is quite popular due to its simplicity. In order 1000 to reach a 1:1 connectivity ratio between the ToF and the leaves, it 1001 results that there are K_TOP ToF nodes, because each port of a ToP 1002 node connects to a different ToF node, and K_LEAF ToP nodes for the 1003 same reason. Consequently, it will take (P * K_LEAF) ports on a ToF 1004 node to connect to each of the K_LEAF ToP nodes of the P PoDs. 1005 Figure 9 illustrates this, looking at P=3 PoDs from above and 2 1006 sides. The large view is the one from above, with the 8 ToF of 3*6 1007 ports each interconnecting the PoDs, every ToP Node being connected 1008 to every ToF node. 1010 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] <-----+ 1011 | | | | | | | | | 1012 [=================================] | -------------- 1013 | | | | | | | | +----- Top-of-Fabric 1014 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] +----- Node -------+ 1015 | -------------- | 1016 | v 1017 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ <-----+ +-+ 1018 | | | | | | | | | | | | | | | | | | 1019 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 1020 [ |o| |o| |o| |o| |o| |o| |o| |o| ] ------------------------- | | 1021 [ |o| |o| |o| |o| |o| |o| |o| |o<--- Physical Port (Ethernet) | | 1022 [ |o| |o| |o| |o| |o| |o| |o| |o| ] ------------------------- | | 1023 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 1024 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 1025 | | | | | | | | | | | | | | | | | | 1026 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 1027 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- | | 1028 [ |o| |o| |o| |o| |o| |o| |o| |o| ] <--- PoD top level | | 1029 [ |o| |o| |o| |o| |o| |o| |o| |o| ] node (Spine) ---+ | | 1030 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- | | | 1031 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | | 1032 | | | | | | | | | | | | | | | | -+ +- +-+ v | | 1033 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 1034 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ ]--| | 1035 [ |o| |o| |o| |o| |o| |o| |o| |o| ] +--- PoD ---+ --| |--[ ]--| | 1036 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ ]--| | 1037 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 1038 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 1039 | | | | | | | | | | | | | | | | -+ +- +-+ | | 1040 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ 1042 Figure 9: Fabric Spines and TOFs in Single Plane Design, 3 PoDs 1044 The top view can be collapsed into a third dimension where the hidden 1045 depth index is representing the PoD number. One PoD can be shown 1046 then as a class of PoDs and hence save one dimension in the 1047 representation. The Spine Node expands in the depth and the vertical 1048 dimensions, whereas the PoD top level Nodes are constrained, in 1049 horizontal dimension. A port in the 2-D representation represents 1050 effectively the class of all the ports at the same position in all 1051 the PoDs that are projected in its position along the depth axis. 1052 This is shown in Figure 10. 1054 / / / / / / / / / / / / / / / / 1055 / / / / / / / / / / / / / / / / 1056 / / / / / / / / / / / / / / / / 1057 / / / / / / / / / / / / / / / / ] 1058 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ ]] 1059 | | | | | | | | | | | | | | | | ] --------------------------- 1060 [ |o| |o| |o| |o| |o| |o| |o| |o| ] <-- PoD top level node (Spine) 1061 [ |o| |o| |o| |o| |o| |o| |o| |o| ] --------------------------- 1062 [ |o| |o| |o| |o| |o| |o| |o| |o| ]]]] 1063 [ |o| |o| |o| |o| |o| |o| |o| |o| ]]] ^^ 1064 [ |o| |o| |o| |o| |o| |o| |o| |o| ]] // PoDs 1065 [ |o| |o| |o| |o| |o| |o| |o| |o| ] // (in depth) 1066 | |/| |/| |/| |/| |/| |/| |/| |/ // 1067 +-+ +-+ +-+/+-+/+-+ +-+ +-+ +-+ // 1068 ^ 1069 | ---------------- 1070 +----- Top-of-Fabric Node 1071 ---------------- 1073 Figure 10: Collapsed Northern View of a Fabric for Any Number of PoDs 1075 As simple as single plane deployment is, it introduces a limit due to 1076 the bound on the available radix of the ToF nodes that has to be at 1077 least P * K_LEAF. Nevertheless, it will be come clear that a 1078 distinct advantage of a connected or non-partitioned Top-of-Fabric is 1079 that all failures can be resolved by simple, non-transitive, positive 1080 disaggregation (i.e. nodes advertising more specific prefixes with 1081 the default to the level below them that is however not propagated 1082 further down the fabric) as described in Section 4.2.5.1 . In other 1083 words; non-partitioned ToF nodes can always reach nodes below or 1084 withdraw the routes from PoDs they cannot reach unambiguously. And 1085 with this, positive disaggregation can heal all failures and still 1086 allow all the ToF nodes to see each other via south reflection. 1087 Disaggregation will be explained in further detail in Section 4.2.5. 1089 In order to scale beyond the "single plane limit", the Top-of-Fabric 1090 can be partitioned by an N number of identically wired planes where N 1091 is an integer divider of K_LEAF. The 1:1 ratio and the desired 1092 symmetry are still served, this time with (K_TOP * N) ToF nodes, each 1093 of (P * K_LEAF / N) ports. N=1 represents a non-partitioned Spine 1094 and N=K_LEAF is a maximally partitioned Spine. Further, if R is any 1095 integer divisor of K_LEAF, then N=K_LEAF/R is a feasible number of 1096 planes and R a redundancy factor that denotes the number of 1097 independent paths between 2 leaves within a plane. It proves 1098 convenient for deployments to use a radix for the leaf nodes that is 1099 a power of 2 so they can pick a number of planes that is a lower 1100 power of 2. The example in Figure 11 splits the Spine in 2 planes 1101 with a redundancy factor R=3, meaning that there are 3 non- 1102 intersecting paths between any leaf node and any ToF node. A ToF 1103 node must have, in this case, at least 3*P ports, and be directly 1104 connected to 3 of the 6 ToP nodes (spines) in each PoD. The ToP 1105 nodes are represented horizontally with K_TOP=8 ports northwards 1106 each. 1108 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1109 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1110 | | O | | O | | O | | O | | O | | O | | O | | O | | 1111 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1112 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1113 | | O | | O | | O | | O | | O | | O | | O | | O | | 1114 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1115 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1116 | | O | | O | | O | | O | | O | | O | | O | | O | | 1117 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1118 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1120 Plane 1 1121 ----------- . ------------ . ------------ . ------------ . -------- 1122 Plane 2 1124 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1125 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1126 | | O | | O | | O | | O | | O | | O | | O | | O | | 1127 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1128 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1129 | | O | | O | | O | | O | | O | | O | | O | | O | | 1130 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1131 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1132 | | O | | O | | O | | O | | O | | O | | O | | O | | 1133 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1134 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1135 ^ 1136 | 1137 | ---------------- 1138 +----- Top-of-Fabric node 1139 "across" depth 1140 ---------------- 1142 Figure 11: Northern View of a Multi-Plane ToF Level, K_LEAF=6, N=2 1144 At the extreme end of the spectrum it is even possible to fully 1145 partition the spine with N = K_LEAF and R=1, while maintaining 1146 connectivity between each leaf node and each Top-of-Fabric node. In 1147 that case the ToF node connects to a single Port per PoD, so it 1148 appears as a single port in the projected view represented in 1149 Figure 12. The number of ports required on the Spine Node is more 1150 than or equal to P, the number of PoDs. 1152 Plane 1 1153 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ -+ 1154 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1155 | | O | | O | | O | | O | | O | | O | | O | | O | | | 1156 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1157 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1158 ----------- . ------------------- . ------------ . ------- | 1159 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1160 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1161 | | O | | O | | O | | O | | O | | O | | O | | O | | | 1162 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1163 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1164 ----------- . ------------ . ---- . ------------ . ------- | 1165 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1166 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1167 | | O | | O | | O | | O | | O | | O | | O | | O | | | 1168 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1169 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1170 ----------- . ------------ . ------------------- . --------+<-+ 1171 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1172 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1173 | | O | | O | | O | | O | | O | | O | | O | | O | | | | 1174 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1175 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1176 ----------- . ------------ . ------------ . ---- . ------- | | 1177 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1178 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1179 | | O | | O | | O | | O | | O | | O | | O | | O | | | | 1180 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1181 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1182 ----------- . ------------ . ------------ . -------------- | | 1183 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1184 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1185 | | O | | O | | O | | O | | O | | O | | O | | O | | | | 1186 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1187 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ -+ | 1188 Plane 6 ^ | 1189 | | 1190 | ---------------- ------------- | 1191 +----- ToF Node Class of PoDs ---+ 1192 ---------------- ------------- 1194 Figure 12: Northern View of a Maximally Partitioned ToF Level, R=1 1196 4.1.3. Fallen Leaf Problem 1198 As mentioned earlier, RIFT exhibits an anisotropic behavior tailored 1199 for fabrics with a North / South orientation and a high level of 1200 interleaving paths. A non-partitioned fabric makes a total loss of 1201 connectivity between a Top-of-Fabric node at the north and a leaf 1202 node at the south a very rare but yet possible occasion that is fully 1203 healed by positive disaggregation as described in Section 4.2.5.1. 1204 In large fabrics or fabrics built from switches with low radix, the 1205 ToF ends often being partitioned in planes which makes the occurrence 1206 of having a given leaf being only reachable from a subset of the ToF 1207 nodes more likely to happen. This makes some further considerations 1208 necessary. 1210 A "Fallen Leaf" is a leaf that can be reached by only a subset, but 1211 not all, of Top-of-Fabric nodes due to missing connectivity. If R is 1212 the redundancy factor, then it takes at least R breakages to reach a 1213 "Fallen Leaf" situation. 1215 In a maximally partitioned fabric, the redundancy factor is R=1, so 1216 any breakage in the fabric will cause one or more fallen leaves in 1217 the affected plane. R=2 guarantees that a single breakage will not 1218 cause a fallen leaf. However, not all cases require disaggregation. 1219 The following cases do not require particular action: 1221 If a southern link on a node goes down, then connectivity through 1222 that node is lost for all nodes south of it. There is no need to 1223 disaggregate since the connectivity to this node is lost for all 1224 spine nodes in a same fashion. 1226 If a ToF Node goes down, then northern traffic towards it is 1227 routed via alternate ToF nodes in the same plane and there is no 1228 need to disaggregate routes. 1230 In a general manner, the mechanism of non-transitive positive 1231 disaggregation is sufficient when the disaggregating ToF nodes 1232 collectively connect to all the ToP nodes in the broken plane. This 1233 happens in the following case: 1235 If the breakage is the last northern link from a ToP node to a ToF 1236 node going down, then the fallen leaf problem affects only the ToF 1237 node, and the connectivity to all the nodes in the PoD is lost 1238 from that ToF node. This can be observed by other ToF nodes 1239 within the plane where the ToP node is located and positively 1240 disaggregated within that plane. 1242 On the other hand, there is a need to disaggregate the routes to 1243 Fallen Leaves within the plane in a transitive fashion, that is, all 1244 the way to the other leaves, in the following cases: 1246 * If the breakage is the last northern link from a leaf node within 1247 a plane (there is only one such link in a maximally partitioned 1248 fabric) that goes down, then connectivity to all unicast prefixes 1249 attached to the leaf node is lost within the plane where the link 1250 is located. Southern Reflection by a leaf node, e.g., between ToP 1251 nodes, if the PoD has only 2 levels, happens in between planes, 1252 allowing the ToP nodes to detect the problem within the PoD where 1253 it occurs and positively disaggregate. The breakage can be 1254 observed by the ToF nodes in the same plane through the North 1255 flooding of TIEs from the ToP nodes. The ToF nodes however need 1256 to be aware of all the affected prefixes for the negative, 1257 possibly transitive disaggregation to be fully effective (i.e. a 1258 node advertising in the control plane that it cannot reach a 1259 certain more specific prefix than default whereas such 1260 disaggregation must in the extreme condition propagate further 1261 down southbound). The problem can also be observed by the ToF 1262 nodes in the other planes through the flooding of North TIEs from 1263 the affected leaf nodes, together with non-node North TIEs which 1264 indicate the affected prefixes. To be effective in that case, the 1265 positive disaggregation must reach down to the nodes that make the 1266 plane selection, which are typically the ingress leaf nodes. The 1267 information is not useful for routing in the intermediate levels. 1269 * If the breakage is a ToP node in a maximally partitioned fabric 1270 (in which case it is the only ToP node serving the plane in that 1271 PoD that goes down), then the connectivity to all the nodes in the 1272 PoD is lost within the plane where the ToP node is located. 1273 Consequently, all leaves of the PoD fall in this plane. Since the 1274 Southern Reflection between the ToF nodes happens only within a 1275 plane, ToF nodes in other planes cannot discover fallen leaves in 1276 a different plane. They also cannot determine beyond their local 1277 plane whether a leaf node that was initially reachable has become 1278 unreachable. As the breakage can be observed by the ToF nodes in 1279 the plane where the breakage happened, the ToF nodes in the plane 1280 need to be aware of all the affected prefixes for the negative 1281 disaggregation to be fully effective. The problem can also be 1282 observed by the ToF nodes in the other planes through the flooding 1283 of North TIEs from the affected leaf nodes, if there are only 3 1284 levels and the ToP nodes are directly connected to the leaf nodes, 1285 and then again it can only be effective it is propagated 1286 transitively to the leaf, and useless above that level. 1288 For the sake of easy comprehension the abstractions are rolled back 1289 into a simple example that shows that in Figure 3 the loss of link 1290 Spine 122 to Leaf 122 will make Leaf 122 a fallen leaf for Top-of- 1291 Fabric plane B. Worse, if the cabling was never present in the first 1292 place, plane B will not even be able to know that such a fallen leaf 1293 exists. Hence partitioning without further treatment results in two 1294 grave problems: 1296 * Leaf 111 trying to route to Leaf 122 must choose Spine 111 in 1297 plane A as its next hop since plane B will inevitably blackhole 1298 the packet when forwarding using default routes or do excessive 1299 bow tying. This information must be in its routing table. 1301 * A path computation trying to deal with the problem by distributing 1302 host routes may only form paths through leaves. The flooding of 1303 information about Leaf 122 would have to go up to Top-of-Fabric A 1304 and then "loopback" over other leaves to ToF B leading in extreme 1305 cases to traffic for Leaf 122 when presented to plane B taking an 1306 "inverted fabric" path where leaves start to serve as TOFs, at 1307 least for the duration of a protocol's convergence. 1309 4.1.4. Discovering Fallen Leaves 1311 When aggregation is used, RIFT deals with fallen leaves by ensuring 1312 that all the ToF nodes share the same north topology database. This 1313 happens naturally in single plane design by the means of northbound 1314 flooding and south reflection but needs additional considerations in 1315 multi-plane fabrics. To enable routing to fallen leaves in multi- 1316 plane designs, RIFT requires additional interconnection across planes 1317 between the ToF nodes, e.g., using rings as illustrated in Figure 13. 1318 Other solutions are possible but they either need more cabling or end 1319 up having much longer flooding paths and/or single points of failure. 1321 In detail, by reserving two ports on each Top-of-Fabric node it is 1322 possible to connect them together by interplane bi-directional rings 1323 as illustrated in Figure 13. The rings will be used to exchange full 1324 north topology information between planes. All ToFs having same 1325 north topology allows by the means of transitive, negative 1326 disaggregation described in Section 4.2.5.2 to efficiently fix any 1327 possible fallen leaf scenario. Somewhat as a side-effect, the 1328 exchange of information fulfills the requirement to have a full view 1329 of the fabric topology at the Top-of-Fabric level, without the need 1330 to collate it from multiple points. 1332 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ 1333 | | | | | | | | | | | | | | 1334 | | | | | | | | 1335 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1336 +-| |--| |--| |--| |--| |--| |--| |-+ | 1337 | | o | | o | | o | | o | | o | | o | | o | | | Plane A 1338 +-| |--| |--| |--| |--| |--| |--| |-+ | 1339 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1340 | | | | | | | | 1341 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1342 +-| |--| |--| |--| |--| |--| |--| |-+ | 1343 | | o | | o | | o | | o | | o | | o | | o | | | Plane B 1344 +-| |--| |--| |--| |--| |--| |--| |-+ | 1345 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1346 | | | | | | | | 1347 ... | 1348 | | | | | | | | 1349 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1350 +-| |--| |--| |--| |--| |--| |--| |-+ | 1351 | | o | | o | | o | | o | | o | | o | | o | | | Plane X 1352 +-| |--| |--| |--| |--| |--| |--| |-+ | 1353 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1354 | | | | | | | | 1355 | | | | | | | | | | | | | | 1356 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ 1357 Rings 1 2 3 4 5 6 7 1359 Figure 13: Using rings to bring all planes and at the ToF bind them 1361 4.1.5. Addressing the Fallen Leaves Problem 1363 One consequence of the "Fallen Leaf" problem is that some prefixes 1364 attached to the fallen leaf become unreachable from some of the ToF 1365 nodes. RIFT defines two methods to address this issue, the positive 1366 and the negative disaggregation. Both methods flood according types 1367 of South TIEs to advertise the impacted prefix(es). 1369 When used for the operation of disaggregation, a positive South TIE 1370 contained in `positive_disaggregation_prefixes`, as usual, indicates 1371 reachability to a prefix of given length and all addresses subsumed 1372 by it. In contrast, a negative route advertisement contained in 1373 `negative_disaggregation_prefixes` indicates that the origin cannot 1374 route to the advertised prefix. 1376 The positive disaggregation is originated by a router that can still 1377 reach the advertised prefix, and the operation is not transitive. In 1378 other words, the receiver does *not* generate its own TIEs or floods 1379 them south as a consequence of receiving positive disaggregation 1380 advertisements from a higher level node. The effect of a positive 1381 disaggregation is that the traffic to the impacted prefix will follow 1382 the longest match and will be limited to the northbound routers that 1383 advertised the more specific route. 1385 In contrast, the negative disaggregation can be transitive, and is 1386 propagated south when all the possible routes have been advertised as 1387 negative exceptions. A negative route advertisement is only 1388 actionable when the negative prefix is aggregated by a positive route 1389 advertisement for a shorter prefix. In such case, the negative 1390 advertisement "punches out a hole" in the positive route in the 1391 routing table, making the positive prefix reachable through the 1392 originator with the special consideration of the negative prefix 1393 removing certain next hop neighbors. The specific procedures will be 1394 explained in detail in Section 4.2.5.2.3. 1396 When the top of fabric switches are not partitioned into multiple 1397 planes, the resulting southbound flooding of the positive 1398 disaggregation by the ToF nodes that can still reach the impacted 1399 prefix is in general enough to cover all the switches at the next 1400 level south, typically the ToP nodes. If all those switches are 1401 aware of the disaggregation, they collectively create a ceiling that 1402 intercepts all the traffic north and forwards it to the ToF nodes 1403 that advertised the more specific route. In that case, the positive 1404 disaggregation alone is sufficient to solve the fallen leaf problem. 1406 On the other hand, when the fabric is partitioned in planes, the 1407 positive disaggregation from ToF nodes in different planes do not 1408 reach the ToP switches in the affected plane and cannot solve the 1409 fallen leaves problem. In other words, a breakage in a plane can 1410 only be solved in that plane. Also, the selection of the plane for a 1411 packet typically occurs at the leaf level and the disaggregation must 1412 be transitive and reach all the leaves. In that case, the negative 1413 disaggregation is necessary. The details on the RIFT approach to 1414 deal with fallen leaves in an optimal way are specified in 1415 Section 4.2.5.2. 1417 4.2. Specification 1419 This section specifies the protocol in a normative fashion by either 1420 prescriptive procedures or behavior defined by Finite State Machines 1421 (FSM). 1423 The FSMs, as usual, are presented as states the FSM can assume, 1424 events that it can be given and according actions performed when 1425 transitioning between states on event processing. 1427 Actions are performed before the end state is assumed. 1429 The FSMs can queue events against itself to chain actions or against 1430 other FSMs in the specification. Events are always processed in the 1431 sequence they have been queued. 1433 Consequently, "On Entry" actions on FSM state are performed every 1434 time and right before the according state is entered, i.e. after any 1435 transitions from previous state. 1437 "On Exit" actions are performed every time and immediately when a 1438 state is exited, i.e. before any transitions towards target state are 1439 performed. 1441 Any attempt to transition from a state towards another on reception 1442 of an event where no action is specified MUST be considered an 1443 unrecoverable error, i.e. the protocol MUST reset all adjacencies, 1444 discard all the state and MAY NOT start again. 1446 The data structures and FSMs described in this document are 1447 conceptual and do not have to be implemented precisely as described 1448 here, as long as the implementations support the described 1449 functionality and exhibit the same externally visible behavior. 1451 The machines can use conceptually "timers" for different situations. 1452 Those timers are started through actions and their expiration leads 1453 to queuing of according events to be processed. 1455 The term `holdtime` is used often as short-hand for `holddown timer` 1456 and signifies either the length of the holding down period or the 1457 timer used to expire after such period. Such timers are used to 1458 "hold down" state within an FSM that is cleaned if the machine 1459 triggers a `HoldtimeExpired` event. 1461 4.2.1. Transport 1463 All packet formats are defined in Thrift [thrift] models in 1464 Appendix B. LIE packet format is contained in the `LIEPacket` schema 1465 element. TIE packet format is contained in `TIEPacket`, TIDE and 1466 TIRE accordingly in `TIDEPacket`, `TIREPacket` and the whole packet 1467 is a union of the above in `ProtocolPacket` while it contains a 1468 `PacketHeader` as well. 1470 Such a packet being in terms of bits on the wire a serialized 1471 `ProtocolPacket` is carried in an envelope defined in Section 4.4.3 1472 within a UDP frame that provides security and allows validation/ 1473 modification of several important fields without de-serialization for 1474 performance and security reasons. Security model and procedures are 1475 further explained in Section 7. 1477 4.2.2. Link (Neighbor) Discovery (LIE Exchange) 1479 RIFT LIE exchange auto-discovers neighbors, negotiates ZTP parameters 1480 and discovers miscablings. The formation progresses under normal 1481 conditions from OneWay to TwoWay and then ThreeWay state at which 1482 point it is ready to exchange TIEs per Section 4.2.3. The adjacency 1483 exchanges ZTP information (Section 4.2.7) in any of the states, i.e. 1484 it is not necessary to reach ThreeWay for zero-touch provisioning to 1485 operate. 1487 RIFT supports any combination of IPv4 and IPv6 addressing on the 1488 fabric with the additional capability for forwarding paths that are 1489 capable of forwarding IPv4 packets in presence of IPv6 addressing 1490 only. 1492 For IPv4 LIE exchange happens over well-known administratively 1493 locally scoped and configured or otherwise well-known IPv4 multicast 1494 address [RFC2365]. For IPv6 [RFC8200] exchange is performed over 1495 link-local multicast scope [RFC4291] address which is configured or 1496 otherwise well-known. In both cases a destination UDP port defined 1497 in Appendix C.1 is used unless configured otherwise. LIEs SHOULD be 1498 sent with an IPv4 Time to Live (TTL) / IPv6 Hop Limit (HL) of either 1499 1 or 255 to prevent RIFT information reaching beyond a single L3 1500 next-hop in the topology. LIEs SHOULD be sent with network control 1501 precedence unless an implementation is prevented from doing so. 1503 The originating port of the LIE has no further significance other 1504 than identifying the origination point. LIEs are exchanged over all 1505 links running RIFT. 1507 An implementation MAY listen and send LIEs on IPv4 and/or IPv6 1508 multicast addresses. A node MUST NOT originate LIEs on an address 1509 family if it does not process received LIEs on that family. LIEs on 1510 same link are considered part of the same LIE FSM independent of the 1511 address family they arrive on. Observe further that the LIE source 1512 address may not identify the peer uniquely in unnumbered or link- 1513 local address cases so the response transmission MUST occur over the 1514 same interface the LIEs have been received on. A node MAY use any of 1515 the adjacency's source addresses it saw in LIEs on the specific 1516 interface during adjacency formation to send TIEs (Section 4.2.3.3). 1517 That implies that an implementation MUST be ready to accept TIEs on 1518 all addresses it used as source of LIE frames. 1520 A simplified version on platforms with limited multicast support MAY 1521 implement optional sending and reception of LIE frames on IPv4 subnet 1522 broadcast addresses and IPv6 all routers multicast address though 1523 such technique is less optimal and presents a wider attack surface 1524 from security perspective. 1526 A ThreeWay adjacency (as defined in the glossary) over any address 1527 family implies support for IPv4 forwarding if the 1528 `ipv4_forwarding_capable` flag in `LinkCapabilities` is set to true. 1529 A node, in case of absence of IPv4 addresses on such links and 1530 advertising `ipv4_forwarding_capable` as true, MUST forward IPv4 1531 packets using gateways discovered on IPv6-only links advertising this 1532 capability. It is expected that the whole fabric supports the same 1533 type of forwarding of address families on all the links, any other 1534 combination is outside the scope of this specification. 1535 `ipv4_forwarding_capable` MUST be set to true when LIEs from a IPv4 1536 address are sent and MAY be set to true in LIEs on IPv6 address if no 1537 LIEs are sent from a IPv4 address. If IPv4 and IPv6 LIEs indicate 1538 contradicting information protocol behavior is unspecified. 1540 Operation of a fabric where only some of the links are supporting 1541 forwarding on an address family or have an address in a family and 1542 others do not is outside the scope of this specification. 1544 Any attempt to construct IPv6 forwarding over IPv4 only adjacencies 1545 is outside this specification. 1547 Table 1 outlines protocol behavior in case of different address 1548 family combinations. 1550 +=======+=======+=============================================+ 1551 | AF | AF | Behavior | 1552 +=======+=======+=============================================+ 1553 | IPv4 | IPv4 | LIEs and TIEs are exchanged over IPv4, no | 1554 | | | IPv6 forwarding. TIEs are received on any | 1555 | | | of the LIE sending addresses. | 1556 +-------+-------+---------------------------------------------+ 1557 | IPv6 | IPv6 | LIEs and TIEs are exchanged over IPv6 only, | 1558 | | | no IPv4 forwarding if either of the | 1559 | | | `ipv4_forwarding_capable` flags is false. | 1560 | | | If both `ipv4_forwarding_capable` flags are | 1561 | | | true IPv4 is forwarded. TIEs are received | 1562 | | | on any of the LIE sending addresses. | 1563 +-------+-------+---------------------------------------------+ 1564 | IPv4, | IPv6 | LIEs and TIEs are exchanged over IPv6, no | 1565 | IPv6 | | IPv4 forwarding if either of the | 1566 | | | `ipv4_forwarding_capable` flags is false. | 1567 | | | If both `ipv4_forwarding_capable` are true | 1568 | | | IPv4 is forwarded. TIEs are received on | 1569 | | | any of the IPv6 LIE sending addresses. | 1570 +-------+-------+---------------------------------------------+ 1571 | IPv4, | IPv4, | LIEs and TIEs are exchanged over IPv6 and | 1572 | IPv6 | IPv6 | IPv4, unspecified behavior if either of the | 1573 | | | `ipv4_forwarding_capable` flags is false or | 1574 | | | IPv4 and IPv6 advertise different flags as | 1575 | | | described previously. IPv4 and IPv6 are | 1576 | | | forwarded. TIEs are received on any of the | 1577 | | | IPv4 and IPv6 LIE sending addresses. | 1578 +-------+-------+---------------------------------------------+ 1580 Table 1: Neighbor AF Combination Behavior 1582 The protocol does *not* support selective disabling of address 1583 families after adjacency formation, disabling IPv4 forwarding 1584 capability or any local address changes in ThreeWay state, i.e. if a 1585 link has entered ThreeWay IPv4 and/or IPv6 with a neighbor on an 1586 adjacency and it wants to stop supporting one of the families or 1587 change any of its local addresses or stop IPv4 forwarding, it has to 1588 tear down and rebuild the adjacency. It also has to remove any state 1589 it stored about the remote side of the adjacency such as LIE source 1590 addresses seen. 1592 Unless ZTP as described in Section 4.2.7 is used, each node is 1593 provisioned with the level at which it is operating and advertises it 1594 in the `level` of the `PacketHeader` schema element. It MAY be also 1595 provisioned with its PoD. If level is not provisioned it is not 1596 present in the optional `PacketHeader` schema element and established 1597 by ZTP procedures if feasible. If PoD is not provisioned it is as 1598 governed by the `LIEPacket` schema element assuming the 1599 `common.default_pod` value. This means that switches except top of 1600 fabric do not need to be configured at all. Necessary information to 1601 configure all values is exchanged in the `LIEPacket` and 1602 `PacketHeader` or derived by the node automatically. 1604 Further definitions of leaf flags are found in Section 4.2.7 given 1605 they have implications in terms of level and adjacency forming here. 1606 Leaf flags are carried in `HierarchyIndications`. 1608 A node MUST form a ThreeWay adjacency (or in other words consider the 1609 neighbor "valid" and hence reflecting it) if and only if the 1610 following first order logic conditions are satisfied on a LIE packet 1611 as specified by the `LIEPacket` schema element and received on a link 1613 1. the neighboring node is running the same major schema version as 1614 indicated in the `major_version` element in `PacketHeader` *and* 1616 2. the neighboring node uses a valid System ID (i.e. value different 1617 from `IllegalSystemID`) in `sender` element in `PacketHeader` 1618 *and* 1620 3. the neighboring node uses a different System ID than the node 1621 itself 1623 4. the advertised MTUs in `LiePacket` element match on both sides 1624 *and* 1626 5. both nodes advertise defined level values in `level` element in 1627 `PacketHeader` *and* 1629 6. [ 1631 i) the node is at `leaf_level` value and has no ThreeWay 1632 adjacencies already to nodes at Highest Adjacency ThreeWay 1633 (HAT as defined later in Section 4.2.7.1) with level different 1634 than the adjacent node *or* 1636 ii) the node is not at `leaf_level` value and the neighboring 1637 node is at `leaf_level` value *or* 1639 iii) both nodes are at `leaf_level` values *and* both indicate 1640 support for Section 4.3.9 *or* 1642 iv) neither node is at `leaf_level` value and the neighboring 1643 node is at most one level difference away 1645 ]. 1647 LIEs arriving with IPv4 Time to Live (TTL) / IPv6 Hop Limit (HL) 1648 different than 1 or 255 SHOULD be ignored. 1650 4.2.2.1. LIE Finite State Machine 1652 This section specifies the precise, normative LIE FSM. For easier 1653 reference the according figure is given as well in Figure 14. 1654 Additionally, some sets of actions repeat often and are hence 1655 summarized into well-known procedures. 1657 Events generated are fairly fine grained, especially when indicating 1658 problems in adjacency forming conditions. The intention of such 1659 differentiation is to simplify tracking of problems in deployment. 1661 Initial state is `OneWay`. 1663 The machine sends LIEs proactively on several transitions to 1664 accelerate adjacency bring-up without waiting for the according timer 1665 tic. 1667 Enter 1668 | 1669 V 1670 +-----------+ 1671 | OneWay |<----+ 1672 | | | HALChanged 1673 | | | HALSChanged 1674 | | | HATChanged 1675 | | | HoldTimerExpired 1676 | | | InstanceNameMismatch 1677 | | | LevelChanged 1678 | | | LieRcvd 1679 | | | MTUMismatch 1680 | | | NeighborChangedAddress 1681 | | | NeighborChangedLevel 1682 | | | NeighborChangedMinorFields 1683 | | | NeighborDroppedReflection 1684 | | | SendLIE 1685 | | | TimerTick 1686 | | | UnacceptableHeader 1687 | | | UpdateZTPOffer 1688 | |-----+ 1689 | | 1690 | |<--------------------- (ThreeWay) 1691 | |---------------------> 1692 | | ValidReflection 1693 | | 1694 | |---------------------> (Multiple 1695 | | MultipleNeighbors Neighbors 1696 +-----------+ Wait) 1697 ^ | 1698 | | 1699 | | NewNeighbor 1700 | V 1701 (TwoWay) 1703 (OneWay) 1704 | ^ 1705 | | HoldTimeExpired 1706 | | InstanceNameMismatch 1707 | | MTUMismatch 1708 | | NeighborChangedAddress 1709 | | NeighborChangedLevel 1710 | | UnacceptableHeader 1711 V | 1712 +-----------+ 1713 | TwoWay |<----+ 1714 | | | HALChanged 1715 | | | HALSChanged 1716 | | | HATChanged 1717 | | | LevelChanged 1718 | | | LIERcvd 1719 | | | SendLIE 1720 | | | TimerTick 1721 | | | UpdateZTPOffer 1722 | | | FloodLeadersChanged 1723 | |-----+ 1724 | | 1725 | |<---------------------- 1726 | |----------------------> (Multiple 1727 | | NewNeighbor Neighbors 1728 | | Wait) 1729 | | MultipleNeighbors 1730 +-----------+ 1731 ^ | 1732 | | ValidReflection 1733 | V 1734 (ThreeWay) 1736 (TwoWay) (OneWay) 1737 ^ | ^ 1738 | | | HoldTimerExpired 1739 | | | InstanceNameMismatch 1740 | | | LevelChanged 1741 | | | MTUMismatch 1742 | | | NeighborChangedAddress 1743 | | | NeighborChangedLevel 1744 NeighborDropped- | | | UnacceptableHeader 1745 Reflection | | | 1746 | V | 1747 +-----------+ | 1748 | ThreeWay |-----+ 1749 | | 1750 | |<----+ 1751 | | | HALChanged 1752 | | | HALSChanged 1753 | | | HATChanged 1754 | | | LieRcvd 1755 | | | SendLIE 1756 | | | TimerTick 1757 | | | UpdateZTPOffer 1758 | | | ValidReflection 1759 | | | FloodLeadersChanged 1760 | |-----+ 1761 | |----------------------> (Multiple 1762 | | MultipleNeighbors Neighbors 1763 +-----------+ Wait) 1765 (TwoWay) (ThreeWay) 1766 | | 1767 V V 1768 +------------+ 1769 | Multiple |<----+ 1770 | Neighbors | | HALChanged 1771 | Wait | | HALSChanged 1772 | | | HATChanged 1773 | | | MultipleNeighbors 1774 | | | TimerTick 1775 | | | UpdateZTPOffer 1776 | | | FloodLeadersChanged 1777 | | | NeighborChangedAddress 1778 | | | UnacceptableHeader 1779 | | | SendLie 1780 | | | MTUMismatch 1781 | | | LieRcvd 1782 | | | NeighborDroppedReflection 1783 | | | 1784 | | | 1785 | |-----+ 1786 | | 1787 | |<--------------------------- 1788 | |---------------------------> (OneWay) 1789 | | LevelChanged 1790 +------------+ MultipleNeighborsDone 1791 Figure 14: LIE FSM 1793 The following words are used for well known procedures: 1795 * PUSH Event: queues an event to be executed by the FSM upon exit of 1796 this action 1798 * CLEANUP: neighbor MUST be reset to unknown 1800 * SEND_LIE: create and send a new LIE packet 1802 1. reflecting the neighbor if known and valid and 1804 2. setting the necessary `not_a_ztp_offer` variable if level was 1805 derived from last known neighbor on this interface and 1807 3. setting `you_are_not_flood_repeater` to computed value 1809 * PROCESS_LIE: 1811 1. if LIE has major version not equal to this node's *or* system 1812 ID equal to this node's system ID or `IllegalSystemID` then 1813 CLEANUP else 1815 2. if LIE has non matching MTUs then CLEANUP, PUSH 1816 UpdateZTPOffer, PUSH MTUMismatch else 1818 3. if LIE has undefined level OR this node's level is undefined 1819 OR this node is a leaf and remote level is lower than HAT OR 1820 (LIE's level is not leaf AND its difference is more than one 1821 from this node's level) then CLEANUP, PUSH UpdateZTPOffer, 1822 PUSH UnacceptableHeader else 1824 4. PUSH UpdateZTPOffer, construct temporary new neighbor 1825 structure with values from LIE, if no current neighbor exists 1826 then set neighbor to new neighbor, PUSH NewNeighbor event, 1827 CHECK_THREE_WAY else 1829 1. if current neighbor system ID differs from LIE's system ID 1830 then PUSH MultipleNeighbors else 1832 2. if current neighbor stored level differs from LIE's level 1833 then PUSH NeighborChangedLevel else 1835 3. if current neighbor stored IPv4/v6 address differs from 1836 LIE's address then PUSH NeighborChangedAddress else 1838 4. if any of neighbor's flood address port, name, local 1839 LinkID changed then PUSH NeighborChangedMinorFields 1841 5. CHECK_THREE_WAY 1843 * CHECK_THREE_WAY: if current state is OneWay do nothing else 1845 1. if LIE packet does not contain neighbor then if current state 1846 is ThreeWay then PUSH NeighborDroppedReflection else 1848 2. if packet reflects this system's ID and local port and state 1849 is ThreeWay then PUSH event ValidReflection else PUSH event 1850 MultipleNeighbors 1852 States: 1854 * OneWay: initial state FSM is starting from. In this state the 1855 neighbors did not see any valid LIEs from a neighbor after the 1856 state was entered. 1858 * TwoWay: that state is entered when a node has seen a LIE from a 1859 neighbor but it did not contain its reflection. 1861 * ThreeWay: this state signifies that lies from a neighbor are seen 1862 with correct reflection. On achieving this state the link can be 1863 advertised in `neighbors` element in `NodeTIEElement`. 1865 * MultipleNeighborsWait: occurs normally when more than two nodes 1866 see each other on the same link or a remote node is quickly 1867 reconfigured or rebooted without regressing to `OneWay` first. 1868 Each occurrence of the event SHOULD generate a clear, according 1869 notification to help operational deployments. 1871 Events: 1873 * TimerTick: one second timer tic, i.e. the event is generated for 1874 FSM by some external entity once a second. To be quietly ignored 1875 if transition does not exist. 1877 * LevelChanged: node's level has been changed by ZTP or 1878 configuration. This is provided by the ZTP FSM. 1880 * HALChanged: best HAL computed by ZTP has changed. This is 1881 provided by the ZTP FSM. 1883 * HATChanged: HAT computed by ZTP has changed. This is provided by 1884 the ZTP FSM. 1886 * HALSChanged: set of HAL offering systems computed by ZTP has 1887 changed. This is provided by the ZTP FSM. 1889 * LieRcvd: received LIE on the interface. 1891 * NewNeighbor: new neighbor seen on the received LIE. 1893 * ValidReflection: received reflection of this node from neighbor, 1894 i.e. `neighbor` element in `LiePacket` corresponds to this node. 1896 * NeighborDroppedReflection: lost previously seen reflection from 1897 neighbor, i.e. `neighbor` element in `LiePacket` does not 1898 correspond to this node or is not present. 1900 * NeighborChangedLevel: neighbor changed advertised level from the 1901 previously seen one. 1903 * NeighborChangedAddress: neighbor changed IP address, i.e. LIE has 1904 been received from an address different from previous LIEs. Those 1905 changes will influence the sockets used to listen to TIEs, TIREs, 1906 TIDEs. 1908 * UnacceptableHeader: Unacceptable header seen. 1910 * MTUMismatch: MTU mismatched. 1912 * NeighborChangedMinorFields: minor fields changed in neighbor's 1913 LIE. 1915 * HoldtimeExpired: adjacency holddown timer expired. 1917 * MultipleNeighbors: more than one neighbor seen on interface 1919 * MultipleNeighborsDone: multiple neighbors timer expired. 1921 * FloodLeadersChanged: node's election algorithm determined new set 1922 of flood leaders. 1924 * SendLie: send a LIE out. 1926 * UpdateZTPOffer: update this node's ZTP offer. This is sent to the 1927 ZTP FSM. 1929 Actions: 1931 * on TimerTick in OneWay finishes in OneWay: PUSH SendLie event 1933 * on UnacceptableHeader in OneWay finishes in OneWay: no action 1934 * on LevelChanged in OneWay finishes in OneWay: update level with 1935 event value, PUSH SendLie event 1937 * on NeighborChangedMinorFields in OneWay finishes in OneWay: no 1938 action 1940 * on NeighborChangedLevel in OneWay finishes in OneWay: no action 1942 * on NewNeighbor in OneWay finishes in TwoWay: PUSH SendLie event 1944 * on HoldtimeExpired in OneWay finishes in OneWay: no action 1946 * on HALSChanged in OneWay finishes in OneWay: store HALS 1948 * on NeighborChangedAddress in OneWay finishes in OneWay: no action 1950 * on LieRcvd in OneWay finishes in OneWay: PROCESS_LIE 1952 * on ValidReflection in OneWay finishes in ThreeWay: no action 1954 * on SendLie in OneWay finishes in OneWay: SEND_LIE 1956 * on UpdateZTPOffer in OneWay finishes in OneWay: send offer to ZTP 1957 FSM 1959 * on HATChanged in OneWay finishes in OneWay: store HAT 1961 * on MultipleNeighbors in OneWay finishes in MultipleNeighborsWait: 1962 start multiple neighbors timer with interval 1963 `multiple_neighbors_lie_holdtime_multipler` * 1964 `default_lie_holdtime` 1966 * on MTUMismatch in OneWay finishes in OneWay: no action 1968 * on FloodLeadersChanged in OneWay finishes in OneWay: update 1969 `you_are_flood_repeater` LIE elements based on flood leader 1970 election results 1972 * on NeighborDroppedReflection in OneWay finishes in OneWay: no 1973 action 1975 * on HALChanged in OneWay finishes in OneWay: store new HAL 1977 * on NeighborChangedAddress in TwoWay finishes in OneWay: no action 1979 * on LieRcvd in TwoWay finishes in TwoWay: PROCESS_LIE 1980 * on UpdateZTPOffer in TwoWay finishes in TwoWay: send offer to ZTP 1981 FSM 1983 * on HoldtimeExpired in TwoWay finishes in OneWay: no action 1985 * on MTUMismatch in TwoWay finishes in OneWay: no action 1987 * on UnacceptableHeader in TwoWay finishes in OneWay: no action 1989 * on ValidReflection in TwoWay finishes in ThreeWay: no action 1991 * on SendLie in TwoWay finishes in TwoWay: SEND_LIE 1993 * on HATChanged in TwoWay finishes in TwoWay: store HAT 1995 * on HALChanged in TwoWay finishes in TwoWay: store new HAL 1997 * on LevelChanged in TwoWay finishes in TwoWay: update level with 1998 event value 2000 * on FloodLeadersChanged in TwoWay finishes in TwoWay: update 2001 `you_are_flood_repeater` LIE elements based on flood leader 2002 election results 2004 * on NewNeighbor in TwoWay finishes in MultipleNeighborsWait: PUSH 2005 SendLie event 2007 * on TimerTick in TwoWay finishes in TwoWay: PUSH SendLie event, if 2008 last valid LIE was received more than `holdtime` ago as advertised 2009 by neighbor then PUSH HoldtimeExpired event 2011 * on NeighborChangedLevel in TwoWay finishes in OneWay: no action 2013 * on MultipleNeighbors in TwoWay finishes in MultipleNeighborsWait: 2014 start multiple neighbors timer with interval 2015 `multiple_neighbors_lie_holdtime_multipler` * 2016 `default_lie_holdtime` 2018 * on HALSChanged in TwoWay finishes in TwoWay: store HALS 2020 * on NeighborChangedAddress in ThreeWay finishes in OneWay: no 2021 action 2023 * on ValidReflection in ThreeWay finishes in ThreeWay: no action 2025 * on HoldtimeExpired in ThreeWay finishes in OneWay: no action 2027 * on UnacceptableHeader in ThreeWay finishes in OneWay: no action 2028 * on NeighborDroppedReflection in ThreeWay finishes in TwoWay: no 2029 action 2031 * on HALChanged in ThreeWay finishes in ThreeWay: store new HAL 2033 * on MultipleNeighbors in ThreeWay finishes in 2034 MultipleNeighborsWait: start multiple neighbors timer with 2035 interval `multiple_neighbors_lie_holdtime_multipler` * 2036 `default_lie_holdtime` 2038 * on LevelChanged in ThreeWay finishes in OneWay: update level with 2039 event value 2041 * on HALSChanged in ThreeWay finishes in ThreeWay: store HALS 2043 * on TimerTick in ThreeWay finishes in ThreeWay: PUSH SendLie event, 2044 if last valid LIE was received more than `holdtime` ago as 2045 advertised by neighbor then PUSH HoldtimeExpired event 2047 * on HATChanged in ThreeWay finishes in ThreeWay: store HAT 2049 * on UpdateZTPOffer in ThreeWay finishes in ThreeWay: send offer to 2050 ZTP FSM 2052 * on LieRcvd in ThreeWay finishes in ThreeWay: PROCESS_LIE 2054 * on NeighborChangedLevel in ThreeWay finishes in OneWay: no action 2056 * on SendLie in ThreeWay finishes in ThreeWay: SEND_LIE 2058 * on FloodLeadersChanged in ThreeWay finishes in ThreeWay: update 2059 `you_are_flood_repeater` LIE elements based on flood leader 2060 election results, PUSH SendLie 2062 * on MTUMismatch in ThreeWay finishes in OneWay: no action 2064 * on HoldtimeExpired in MultipleNeighborsWait finishes in 2065 MultipleNeighborsWait: no action 2067 * on LieRcvd in MultipleNeighborsWait finishes in 2068 MultipleNeighborsWait: no action 2070 * on NeighborDroppedReflection in MultipleNeighborsWait finishes in 2071 MultipleNeighborsWait: no action 2073 * on MTUMismatch in MultipleNeighborsWait finishes in 2074 MultipleNeighborsWait: no action 2076 * on NeighborChangedBFDCapability in MultipleNeighborsWait finishes 2077 in MultipleNeighborsWait: no action 2079 * on LevelChanged in MultipleNeighborsWait finishes in OneWay: 2080 update level with event value 2082 * on SendLie in MultipleNeighborsWait finishes in 2083 MultipleNeighborsWait: no action 2085 * on UpdateZTPOffer in MultipleNeighborsWait finishes in 2086 MultipleNeighborsWait: send offer to ZTP FSM 2088 * on MultipleNeighborsDone in MultipleNeighborsWait finishes in 2089 OneWay: no action 2091 * on HATChanged in MultipleNeighborsWait finishes in 2092 MultipleNeighborsWait: store HAT 2094 * on NeighborChangedAddress in MultipleNeighborsWait finishes in 2095 MultipleNeighborsWait: no action 2097 * on HALSChanged in MultipleNeighborsWait finishes in 2098 MultipleNeighborsWait: store HALS 2100 * on HALChanged in MultipleNeighborsWait finishes in 2101 MultipleNeighborsWait: store new HAL 2103 * on MultipleNeighbors in MultipleNeighborsWait finishes in 2104 MultipleNeighborsWait: start multiple neighbors timer with 2105 interval `multiple_neighbors_lie_holdtime_multipler` * 2106 `default_lie_holdtime` 2108 * on FloodLeadersChanged in MultipleNeighborsWait finishes in 2109 MultipleNeighborsWait: update `you_are_flood_repeater` LIE 2110 elements based on flood leader election results 2112 * on ValidReflection in MultipleNeighborsWait finishes in 2113 MultipleNeighborsWait: no action 2115 * on TimerTick in MultipleNeighborsWait finishes in 2116 MultipleNeighborsWait: check MultipleNeighbors timer, if timer 2117 expired PUSH MultipleNeighborsDone 2119 * on UnacceptableHeader in MultipleNeighborsWait finishes in 2120 MultipleNeighborsWait: no action 2122 * on Entry into OneWay: CLEANUP 2124 4.2.3. Topology Exchange (TIE Exchange) 2126 4.2.3.1. Topology Information Elements 2128 Topology and reachability information in RIFT is conveyed by the 2129 means of TIEs. 2131 The TIE exchange mechanism uses the port indicated by each node in 2132 the LIE exchange as `flood_port` in `LIEPacket` and the interface on 2133 which the adjacency has been formed as destination. It SHOULD use 2134 TTL of 1 or 255 as well and set inter-network control precedence on 2135 according packets. 2137 TIEs contain sequence numbers, lifetimes and a type. Each type has 2138 ample identifying number space and information is spread across 2139 possibly many TIEs of a certain type by the means of a hash function 2140 that an implementation can individually determine. One extreme 2141 design choice is a prefix per TIE which leads to more BGP-like 2142 behavior where small increments are only advertised on route changes 2143 vs. deploying with dense prefix packing into few TIEs leading to more 2144 traditional IGP trade-off with fewer TIEs. An implementation may 2145 even rehash prefix to TIE mapping at any time at the cost of 2146 significant amount of re-advertisements of TIEs. 2148 More information about the TIE structure can be found in the schema 2149 in Appendix B starting with `TIEPacket` root. 2151 4.2.3.2. Southbound and Northbound TIE Representation 2153 A central concept of RIFT is that each node represents itself 2154 differently depending on the direction in which it is advertising 2155 information. More precisely, a spine node represents two different 2156 databases over its adjacencies depending whether it advertises TIEs 2157 to the north or to the south/east-west. Those differing TIE 2158 databases are called either south- or northbound (South TIEs and 2159 North TIEs) depending on the direction of distribution. 2161 The North TIEs hold all of the node's adjacencies and local prefixes 2162 while the South TIEs hold only all of the node's adjacencies, the 2163 default prefix with necessary disaggregated prefixes and local 2164 prefixes. Section 4.2.5 explains further details. 2166 The TIE types are mostly symmetric in both directions and Table 2 2167 provides a quick reference to main TIE types including direction and 2168 their function. The direction itself is carried in `direction` of 2169 `TIEID` schema element. 2171 +=========================+=================================+ 2172 | TIE-Type | Content | 2173 +=========================+=================================+ 2174 | Node North TIE | node properties and adjacencies | 2175 +-------------------------+---------------------------------+ 2176 | Node South TIE | same content as node North TIE | 2177 +-------------------------+---------------------------------+ 2178 | Prefix North TIE | contains nodes' directly | 2179 | | reachable prefixes | 2180 +-------------------------+---------------------------------+ 2181 | Prefix South TIE | contains originated defaults | 2182 | | and directly reachable prefixes | 2183 +-------------------------+---------------------------------+ 2184 | Positive Disaggregation | contains disaggregated prefixes | 2185 | South TIE | | 2186 +-------------------------+---------------------------------+ 2187 | Negative Disaggregation | contains special, negatively | 2188 | South TIE | disaggregated prefixes to | 2189 | | support multi-plane designs | 2190 +-------------------------+---------------------------------+ 2191 | External Prefix North | contains external prefixes | 2192 | TIE | | 2193 +-------------------------+---------------------------------+ 2194 | Key-Value North TIE | contains nodes northbound KVs | 2195 +-------------------------+---------------------------------+ 2196 | Key-Value South TIE | contains nodes southbound KVs | 2197 +-------------------------+---------------------------------+ 2199 Table 2: TIE Types 2201 As an example illustrating a databases holding both representations, 2202 the topology in Figure 2 with the optional link between spine 111 and 2203 spine 112 (so that the flooding on an East-West link can be shown) is 2204 considered. Unnumbered interfaces are implicitly assumed and for 2205 simplicity, the key value elements which may be included in their 2206 South TIEs or North TIEs are not shown. First, in Figure 15 are the 2207 TIEs generated by some nodes. 2209 ToF 21 South TIEs: 2210 Node South TIE: 2211 NodeElement(level=2, neighbors((Spine 111, level 1, cost 1), 2212 (Spine 112, level 1, cost 1), (Spine 121, level 1, cost 1), 2213 (Spine 122, level 1, cost 1))) 2214 Prefix South TIE: 2215 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2217 Spine 111 South TIEs: 2218 Node South TIE: 2220 NodeElement(level=1, neighbors((ToF 21, level 2, cost 1, 2221 links(...)), 2222 (ToF 22, level 2, cost 1, links(...)), 2223 (Spine 112, level 1, cost 1, links(...)), 2224 (Leaf111, level 0, cost 1, links(...)), 2225 (Leaf112, level 0, cost 1, links(...)))) 2226 Prefix South TIE: 2227 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2229 Spine 111 North TIEs: 2230 Node North TIE: 2231 NodeElement(level=1, 2232 neighbors((ToF 21, level 2, cost 1, links(...)), 2233 (ToF 22, level 2, cost 1, links(...)), 2234 (Spine 112, level 1, cost 1, links(...)), 2235 (Leaf111, level 0, cost 1, links(...)), 2236 (Leaf112, level 0, cost 1, links(...)))) 2237 Prefix North TIE: 2238 NorthPrefixesElement(prefixes(Spine 111.loopback) 2240 Spine 121 South TIEs: 2241 Node South TIE: 2242 NodeElement(level=1, neighbors((ToF 21,level 2,cost 1), 2243 (ToF 22, level 2, cost 1), (Leaf121, level 0, cost 1), 2244 (Leaf122, level 0, cost 1))) 2245 Prefix South TIE: 2246 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2248 Spine 121 North TIEs: 2249 Node North TIE: 2250 NodeElement(level=1, 2251 neighbors((ToF 21, level 2, cost 1, links(...)), 2252 (ToF 22, level 2, cost 1, links(...)), 2253 (Leaf121, level 0, cost 1, links(...)), 2254 (Leaf122, level 0, cost 1, links(...)))) 2255 Prefix North TIE: 2256 NorthPrefixesElement(prefixes(Spine 121.loopback) 2258 Leaf112 North TIEs: 2259 Node North TIE: 2260 NodeElement(level=0, 2261 neighbors((Spine 111, level 1, cost 1, links(...)), 2262 (Spine 112, level 1, cost 1, links(...)))) 2263 Prefix North TIE: 2264 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 2265 Prefix_MH)) 2267 Figure 15: Example TIES Generated in a 2 Level Spine-and-Leaf 2268 Topology 2270 It may be here not necessarily obvious why the node South TIEs 2271 contain all the adjacencies of the according node. This will be 2272 necessary for algorithms further elaborated on in Section 4.2.3.9 and 2273 Section 4.3.7. 2275 For node TIEs to carry more adjacencies than fit into an MTU, the 2276 element `neighbors` may contain different set of neighbors in each 2277 TIE. Those disjoint sets of neighbors MUST be joined during 2278 according computation. Nevertheless, in case across multiple node 2279 TIEs 2281 1. `capabilities` do not match *or* 2283 2. `flags` values do not match *or* 2285 3. same neighbor repeats in multiple TIEs with different values 2287 the behavior is undefined and a warning SHOULD be generated after a 2288 period of time. 2290 The element `miscabled_links` SHOULD be repeated in every node TIE, 2291 otherwise the behavior is undefined. 2293 A top of fabric node MUST include in the node TIEs in 2294 `same_plane_tofs` element all the other ToFs it sees through 2295 reflection. To prevent MTU overrun problems, multiple node TIEs can 2296 carry disjoint sets of ToFs which can be joined to form a single set. 2297 This element allows nodes in other planes that are on the multi-plane 2298 ring with this node to see the complete plane and with that all ToFs 2299 in a multi-plane fabric are aware of all other ToFs which can be used 2300 further to form input to complex multi-plane elections. 2302 Different TIE types are carried in `TIEElement`. Schema enum 2303 `common.TIETypeType` in `TIEID` indicates which elements MUST be 2304 present in the `TIEElement`. In case of mismatch the unexpected 2305 elements MUST be ignored. In case of lack of expected element in the 2306 TIE an error MUST be reported and the TIE MUST be ignored. The 2307 element `positive_disaggregation_prefixes` and 2308 `positive_external_disaggregation_prefixes` MUST be advertised 2309 southbound only and ignored in North TIEs. The element 2310 `negative_disaggregation_prefixes` MUST be aggregated and propagated 2311 according to Section 4.2.5.2 southwards towards lower levels to heal 2312 pathological upper level partitioning, otherwise blackholes may occur 2313 in multiplane fabrics. It MUST NOT be advertised within a North TIE 2314 and ignored otherwise. 2316 4.2.3.3. Flooding 2318 The mechanism used to distribute TIEs is the well-known (albeit 2319 modified in several respects to take advantage of Fat Tree topology) 2320 flooding mechanism used in link-state protocols. Although flooding 2321 is initially more demanding to implement it avoids many problems with 2322 update style used in diffused computation by distance vector 2323 protocols. However, since flooding tends to present a significant 2324 burden in large, densely meshed topologies (Fat Trees being 2325 unfortunately such a topology) RIFT provides as solution a close to 2326 optimal global flood reduction and load balancing optimization in 2327 Section 4.2.3.9. 2329 As described before, TIEs themselves are transported over UDP with 2330 the ports indicated in the LIE exchanges and using the destination 2331 address on which the LIE adjacency has been formed. For unnumbered 2332 IPv4 interfaces same considerations apply as in other link-state 2333 routing protocols and are largely implementation dependent. 2335 TIEs are uniquely identifed by `TIEID` schema element. `TIEID` space 2336 is a total order achieved by comparing the elements in sequence 2337 defined in the element and comparing each value as an unsigned 2338 integer of according length. They contain a `seq_nr` element to 2339 distinguish newer versions of same TIE. TIEIDs also carry 2340 `origination_time` and `origination_lifetime`. Field 2341 `origination_time` contains the absolute timestamp when the TIE was 2342 generated. Field `origination_lifetime` carries lifetime when the 2343 TIE was generated. Those are normally disregarded during comparison 2344 and carried purely for debugging/security purposes if present. They 2345 may be used for comparison of last resort to differentiate otherwise 2346 equal ties and they can be used on fabrics with synchronized clock to 2347 prevent lifetime modification attacks. 2349 Remaining lifetime counts down to 0 from origination lifetime. TIEs 2350 with lifetimes differing by less than `lifetime_diff2ignore` MUST be 2351 considered EQUAL (if all other fields are equal). This constant MUST 2352 be larger than `purge_lifetime` to avoid retransmissions. 2354 All valid TIE types are defined in `TIETypeType`. This enum indicates 2355 what TIE type the TIE is carrying. In case the value is not known to 2356 the receiver, the TIE MUST be re-flooded. This allows for future 2357 extensions of the protocol within the same major schema with types 2358 opaque to some nodes with some restrictions. 2360 4.2.3.3.1. Normative Flooding Procedures 2362 On reception of a TIE with an undefined level value in the packet 2363 header the node MAY issue a warning and indiscriminately discard the 2364 packet. Such packets can be useful however to establish e.g. via 2365 `instance_name`, `name` and `originator` elements in `LIEPacket` 2366 whether the cabling of the node fulfills expectations, even before 2367 ZTP procedures determine levels across the topology. 2369 This section specifies the precise, normative flooding mechanism and 2370 can be omitted unless the reader is pursuing an implementation of the 2371 protocol or looks for a deep understanding of underlying information 2372 distribution mechanism. 2374 Flooding Procedures are described in terms of a flooding state of an 2375 adjacency and resulting operations on it driven by packet arrivals. 2376 The FSM itself has basically just a single state and is not well 2377 suited to represent the behavior. An implementation MUST either 2378 implement the given procedures in a verbatim manner or behave on the 2379 wire in the same way as the provided normative procedures of this 2380 paragraph. 2382 RIFT does not specify any kind of flood rate limiting since such 2383 specifications always assume particular points in available 2384 technology speeds and feeds and those points are shifting at faster 2385 and faster rate (speed of light holding for the moment). 2387 To help with adjustement of flooding speeds the encoded packets 2388 provide hints to react accordingly to losses or overruns via 2389 `you_are_sending_too_quickly` in `LIEPacket` and `Packet Number` in 2390 security envelope described in Section 4.4.3. Flooding of all 2391 according topology exchange elements SHOULD be performed at highest 2392 feasible rate whereas the rate of transmission MUST be throttled by 2393 reacting to packet elements and adequate features of the system such 2394 as e.g. queue lengths or congestion indications in the protocol 2395 packets. 2397 A node SHOULD NOT send out any topology information elements if the 2398 adjacency is not in a "ThreeWay" state. No further tightening of 2399 this rule as to e.g. sequence is possible due to possible link 2400 buffering and re-ordering of LIEs and TIEs/TIDEs/TIREs in a real 2401 implementation for e.g. performance purposes. 2403 A node MUST drop any received TIEs/TIDEs/TIREs unless it is in 2404 ThreeWay state. 2406 TIDEs and TIREs MUST NOT be re-flooded the way TIEs of other nodes 2407 MUST be always generated by the node itself and cross only to the 2408 neighboring node. 2410 4.2.3.3.1.1. FloodState Structure per Adjacency 2412 The structure contains conceptually on each adjacency the following 2413 elements. The word collection or queue indicates a set of elements 2414 that can be iterated over: 2416 TIES_TX: 2417 Collection containing all the TIEs to transmit on the adjacency. 2419 TIES_ACK: 2420 Collection containing all the TIEs that have to be acknowledged on 2421 the adjacency. 2423 TIES_REQ: 2424 Collection containing all the TIE headers that have to be 2425 requested on the adjacency. 2427 TIES_RTX: 2428 Collection containing all TIEs that need retransmission with the 2429 according time to retransmit. 2431 Following words are used for well known elements and procedures 2432 operating on this structure: 2434 TIE: 2435 Describes either a full RIFT TIE or accordingly just the 2436 `TIEHeader` or `TIEID` equivalent as defined in Appendix B.3. The 2437 according meaning is unambiguously contained in the context of 2438 each algorithm. 2440 is_flood_reduced(TIE): 2441 returns whether a TIE can be flood reduced or not. 2443 is_tide_entry_filtered(TIE): 2444 returns whether a header should be propagated in TIDE according to 2445 flooding scopes. 2447 is_request_filtered(TIE): 2448 returns whether a TIE request should be propagated to neighbor or 2449 not according to flooding scopes. 2451 is_flood_filtered(TIE): 2452 returns whether a TIE requested be flooded to neighbor or not 2453 according to flooding scopes. 2455 try_to_transmit_tie(TIE): 2456 A. if not is_flood_filtered(TIE) then 2458 1. remove TIE from TIES_RTX if present 2460 2. if TIE" with same key is found on TIES_ACK then 2462 a. if TIE" is same or newer than TIE do nothing else 2464 b. remove TIE" from TIES_ACK and add TIE to TIES_TX 2466 3. else insert TIE into TIES_TX 2468 ack_tie(TIE): 2469 remove TIE from all collections and then insert TIE into TIES_ACK. 2471 tie_been_acked(TIE): 2472 remove TIE from all collections. 2474 remove_from_all_queues(TIE): 2475 same as `tie_been_acked`. 2477 request_tie(TIE): 2478 if not is_request_filtered(TIE) then remove_from_all_queues(TIE) 2479 and add to TIES_REQ. 2481 move_to_rtx_list(TIE): 2482 remove TIE from TIES_TX and then add to TIES_RTX using TIE 2483 retransmission interval. 2485 clear_requests(TIEs): 2486 remove all TIEs from TIES_REQ. 2488 bump_own_tie(TIE): 2489 for self-originated TIE originate an empty or re-generate with 2490 version number higher then the one in TIE. 2492 The collection SHOULD be served with the following priorities if the 2493 system cannot process all the collections in real time: 2495 1. Elements on TIES_ACK should be processed with highest priority 2497 2. TIES_TX 2499 3. TIES_REQ and TIES_RTX 2501 4.2.3.3.1.2. TIDEs 2503 `TIEID` and `TIEHeader` space forms a strict total order (modulo 2504 incomparable sequence numbers as explained in Appendix A in the very 2505 unlikely event that can occur if a TIE is "stuck" in a part of a 2506 network while the originator reboots and reissues TIEs many times to 2507 the point its sequence# rolls over and forms incomparable distance to 2508 the "stuck" copy) which implies that a comparison relation is 2509 possible between two elements. With that it is implicitly possible 2510 to compare TIEs, TIEHeaders and TIEIDs to each other whereas the 2511 shortest viable key is always implied. 2513 When generating and sending TIDEs an implementation SHOULD ensure 2514 that enough bandwidth is left to send elements from other queues of 2515 `Floodstate` structure. 2517 4.2.3.3.1.2.1. TIDE Generation 2519 As given by timer constant, periodically generate TIDEs by: 2521 NEXT_TIDE_ID: ID of next TIE to be sent in TIDE. 2523 TIDE_START: Begin of TIDE packet range. 2525 a. NEXT_TIDE_ID = MIN_TIEID 2527 b. while NEXT_TIDE_ID not equal to MAX_TIEID do 2529 1. TIDE_START = NEXT_TIDE_ID 2531 2. HEADERS = At most TIRDEs_PER_PKT headers in TIEDB starting at 2532 NEXT_TIDE_ID or higher that SHOULD be filtered by 2533 is_tide_entry_filtered and MUST either have a lifetime left > 2534 0 or have no content 2536 3. if HEADERS is empty then START = MIN_TIEID else START = first 2537 element in HEADERS 2539 4. if HEADERS' size less than TIRDEs_PER_PKT then END = 2540 MAX_TIEID else END = last element in HEADERS 2542 5. send *sorted* HEADERS as TIDE setting START and END as its 2543 range 2545 6. NEXT_TIDE_ID = END 2547 The constant `TIRDEs_PER_PKT` SHOULD be computed per interface and 2548 used by the implementation to limit the amount of TIE headers per 2549 TIDE so the sent TIDE PDU does not exceed interface MTU. 2551 TIDE PDUs SHOULD be spaced on sending to prevent packet drops. 2553 4.2.3.3.1.2.2. TIDE Processing 2555 On reception of TIDEs the following processing is performed: 2557 TXKEYS: Collection of TIE Headers to be sent after processing of 2558 the packet 2560 REQKEYS: Collection of TIEIDs to be requested after processing of 2561 the packet 2563 CLEARKEYS: Collection of TIEIDs to be removed from flood state 2564 queues 2566 LASTPROCESSED: Last processed TIEID in TIDE 2568 DBTIE: TIE in the LSDB if found 2570 a. LASTPROCESSED = TIDE.start_range 2572 b. for every HEADER in TIDE do 2574 1. DBTIE = find HEADER in current LSDB 2576 2. if HEADER < LASTPROCESSED then report error and reset 2577 adjacency and return 2579 3. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 2580 TIE.HEADER < HEADER) into TXKEYS 2582 4. LASTPROCESSED = HEADER 2584 5. if DBTIE not found then 2586 I) if originator is this node then bump_own_tie 2588 II) else put HEADER into REQKEYS 2590 6. if DBTIE.HEADER < HEADER then 2592 I) if originator is this node then bump_own_tie else 2593 i. if this is a North TIE header from a northbound 2594 neighbor then override DBTIE in LSDB with HEADER 2596 ii. else put HEADER into REQKEYS 2598 7. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 2600 8. if DBTIE.HEADER = HEADER then 2602 I) if DBTIE has content already then put DBTIE.HEADER into 2603 CLEARKEYS 2605 II) else put HEADER into REQKEYS 2607 c. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 2608 TIE.HEADER <= TIDE.end_range) into TXKEYS 2610 d. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 2612 e. for all TIEs in REQKEYS request_tie(TIE) 2614 f. for all TIEs in CLEARKEYS remove_from_all_queues(TIE) 2616 4.2.3.3.1.3. TIREs 2618 4.2.3.3.1.3.1. TIRE Generation 2620 Elements from both TIES_REQ and TIES_ACK MUST be collected and sent 2621 out as fast as feasible as TIREs. When sending TIREs with elements 2622 from TIES_REQ the `remaining_lifetime` field in 2623 `TIEHeaderWithLifeTime` MUST be set to 0 to force reflooding from the 2624 neighbor even if the TIEs seem to be same. 2626 4.2.3.3.1.3.2. TIRE Processing 2628 On reception of TIREs the following processing is performed: 2630 TXKEYS: Collection of TIE Headers to be send after processing of 2631 the packet 2633 REQKEYS: Collection of TIEIDs to be requested after processing of 2634 the packet 2636 ACKKEYS: Collection of TIEIDs that have been acked 2638 DBTIE: TIE in the LSDB if found 2640 a. for every HEADER in TIRE do 2641 1. DBTIE = find HEADER in current LSDB 2643 2. if DBTIE not found then do nothing 2645 3. if DBTIE.HEADER < HEADER then put HEADER into REQKEYS 2647 4. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 2649 5. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into ACKKEYS 2651 b. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 2653 c. for all TIEs in REQKEYS request_tie(TIE) 2655 d. for all TIEs in ACKKEYS tie_been_acked(TIE) 2657 4.2.3.3.1.4. TIEs Processing on Flood State Adjacency 2659 On reception of TIEs the following processing is performed: 2661 ACKTIE: TIE to acknowledge 2663 TXTIE: TIE to transmit 2665 DBTIE: TIE in the LSDB if found 2667 a. DBTIE = find TIE in current LSDB 2669 b. if DBTIE not found then 2671 1. if originator is this node then bump_own_tie with a short 2672 remaining lifetime 2674 2. else insert TIE into LSDB and ACKTIE = TIE 2676 else 2678 1. if DBTIE.HEADER = TIE.HEADER then 2680 i. if DBTIE has content already then ACKTIE = TIE 2682 ii. else process like the "DBTIE.HEADER < TIE.HEADER" case 2684 2. if DBTIE.HEADER < TIE.HEADER then 2686 i. if originator is this node then bump_own_tie 2688 ii. else insert TIE into LSDB and ACKTIE = TIE 2690 3. if DBTIE.HEADER > TIE.HEADER then 2692 i. if DBTIE has content already then TXTIE = DBTIE 2694 ii. else ACKTIE = DBTIE 2696 c. if TXTIE is set then try_to_transmit_tie(TXTIE) 2698 d. if ACKTIE is set then ack_tie(TIE) 2700 4.2.3.3.1.5. Sending TIEs 2702 On a periodic basis all TIEs with lifetime left > 0 MUST be sent out 2703 on the adjacency, removed from TIES_TX list and requeued onto 2704 TIES_RTX list. 2706 4.2.3.3.1.6. TIEs Processing In LSDB 2708 The Link State Database can be considered to be a switchboard that 2709 does not need any flooding procedures but can be given versions of 2710 TIEs by peers. Consecutively, after version tie-breaking by LSDB, a 2711 peer receives from the LSDB newest versions of TIEs received by other 2712 peers and processes them (without any filtering) just like receiving 2713 TIEs from its remote peer. Such a publisher model can be implemented 2714 in many ways, either in a single thread of execution of in parallel 2715 threads. 2717 LSDB can be logically considered as the entity aging out TIEs, i.e. 2718 being responsible to discard TIEs that are stored longer than 2719 `remaining_lifetime` on their reception. 2721 LSDB is also expected to periodically re-originate the node's own 2722 TIEs. It is recommended to originate at interval significantly 2723 shorter than `default_lifetime` to prevent TIE expiration by other 2724 nodes in the network which can lead to instabilities. 2726 4.2.3.4. TIE Flooding Scopes 2728 In a somewhat analogous fashion to link-local, area and domain 2729 flooding scopes, RIFT defines several complex "flooding scopes" 2730 depending on the direction and type of TIE propagated. 2732 Every North TIE is flooded northbound, providing a node at a given 2733 level with the complete topology of the Clos or Fat Tree network that 2734 is reachable southwards of it, including all specific prefixes. This 2735 means that a packet received from a node at the same or lower level 2736 whose destination is covered by one of those specific prefixes will 2737 be routed directly towards the node advertising that prefix rather 2738 than sending the packet to a node at a higher level. 2740 A node's Node South TIEs, consisting of all node's adjacencies and 2741 prefix South TIEs limited to those related to default IP prefix and 2742 disaggregated prefixes, are flooded southbound in order to allow the 2743 nodes one level down to see connectivity of the higher level as well 2744 as reachability to the rest of the fabric. In order to allow an E-W 2745 disconnected node in a given level to receive the South TIEs of other 2746 nodes at its level, every *NODE* South TIE is "reflected" northbound 2747 to level from which it was received. It should be noted that East- 2748 West links are included in South TIE flooding (except at ToF level); 2749 those TIEs need to be flooded to satisfy algorithms in Section 4.2.4. 2750 In that way nodes at same level can learn about each other without a 2751 lower level except in case of leaf level. The precise, normative 2752 flooding scopes are given in Table 3. Those rules govern as well 2753 what SHOULD be included in TIDEs on the adjacency. Again, East-West 2754 flooding scopes are identical to South flooding scopes except in case 2755 of ToF East-West links (rings) which are basically performing 2756 northbound flooding. 2758 Node South TIE "south reflection" allows to support positive 2759 disaggregation on failures as described in in Section 4.2.5 and 2760 flooding reduction in Section 4.2.3.9. 2762 +===========+======================+==============+=================+ 2763 | Type / | South | North | East-West | 2764 | Direction | | | | 2765 +===========+======================+==============+=================+ 2766 | node | flood if level of | flood if | flood only if | 2767 | South TIE | originator is | level of | this node is | 2768 | | equal to this | originator | not ToF | 2769 | | node | is higher | | 2770 | | | than this | | 2771 | | | node | | 2772 +-----------+----------------------+--------------+-----------------+ 2773 | non-node | flood self- | flood only | flood only if | 2774 | South TIE | originated only | if neighbor | self-originated | 2775 | | | is | and this node | 2776 | | | originator | is not ToF | 2777 | | | of TIE | | 2778 +-----------+----------------------+--------------+-----------------+ 2779 | all North | never flood | flood always | flood only if | 2780 | TIEs | | | this node is | 2781 | | | | ToF | 2782 +-----------+----------------------+--------------+-----------------+ 2783 | TIDE | include at least | include at | if this node is | 2784 | | all non-self | least all | ToF then | 2785 | | originated North | node South | include all | 2786 | | TIE headers and | TIEs and all | North TIEs, | 2787 | | self-originated | South TIEs | otherwise only | 2788 | | South TIE headers | originated | self-originated | 2789 | | and node South | by peer and | TIEs | 2790 | | TIEs of nodes at | all North | | 2791 | | same level | TIEs | | 2792 +-----------+----------------------+--------------+-----------------+ 2793 | TIRE as | request all North | request all | if this node is | 2794 | Request | TIEs and all | South TIEs | ToF then apply | 2795 | | peer's self- | | North scope | 2796 | | originated TIEs | | rules, | 2797 | | and all node | | otherwise South | 2798 | | South TIEs | | scope rules | 2799 +-----------+----------------------+--------------+-----------------+ 2800 | TIRE as | Ack all received | Ack all | Ack all | 2801 | Ack | TIEs | received | received TIEs | 2802 | | | TIEs | | 2803 +-----------+----------------------+--------------+-----------------+ 2805 Table 3: Normative Flooding Scopes 2807 If the TIDE includes additional TIE headers beside the ones 2808 specified, the receiving neighbor must apply according filter to the 2809 received TIDE strictly and MUST NOT request the extra TIE headers 2810 that were not allowed by the flooding scope rules in its direction. 2812 As an example to illustrate these rules, consider using the topology 2813 in Figure 2, with the optional link between spine 111 and spine 112, 2814 and the associated TIEs given in Figure 15. The flooding from 2815 particular nodes of the TIEs is given in Table 4. 2817 +============+==========+===========================================+ 2818 | Local | Neighbor | TIEs Flooded from Local to Neighbor Node | 2819 | Node | Node | | 2820 +============+==========+===========================================+ 2821 | Leaf111 | Spine | Leaf111 North TIEs, Spine 111 node South | 2822 | | 112 | TIE | 2823 +------------+----------+-------------------------------------------+ 2824 | Leaf111 | Spine | Leaf111 North TIEs, Spine 112 node South | 2825 | | 111 | TIE | 2826 +------------+----------+-------------------------------------------+ 2827 | ... | ... | ... | 2828 +------------+----------+-------------------------------------------+ 2829 | Spine | Leaf111 | Spine 111 South TIEs | 2830 | 111 | | | 2831 +------------+----------+-------------------------------------------+ 2832 | Spine | Leaf112 | Spine 111 South TIEs | 2833 | 111 | | | 2834 +------------+----------+-------------------------------------------+ 2835 | Spine | Spine | Spine 111 South TIEs | 2836 | 111 | 112 | | 2837 +------------+----------+-------------------------------------------+ 2838 | Spine | ToF 21 | Spine 111 North TIEs, Leaf111 North TIEs, | 2839 | 111 | | Leaf112 North TIEs, ToF 22 node South TIE | 2840 +------------+----------+-------------------------------------------+ 2841 | Spine | ToF 22 | Spine 111 North TIEs, Leaf111 North TIEs, | 2842 | 111 | | Leaf112 North TIEs, ToF 21 node South TIE | 2843 +------------+----------+-------------------------------------------+ 2844 | ... | ... | ... | 2845 +------------+----------+-------------------------------------------+ 2846 | ToF 21 | Spine | ToF 21 South TIEs | 2847 | | 111 | | 2848 +------------+----------+-------------------------------------------+ 2849 | ToF 21 | Spine | ToF 21 South TIEs | 2850 | | 112 | | 2851 +------------+----------+-------------------------------------------+ 2852 | ToF 21 | Spine | ToF 21 South TIEs | 2853 | | 121 | | 2854 +------------+----------+-------------------------------------------+ 2855 | ToF 21 | Spine | ToF 21 South TIEs | 2856 | | 122 | | 2857 +------------+----------+-------------------------------------------+ 2858 | ... | ... | ... | 2859 +------------+----------+-------------------------------------------+ 2861 Table 4: Flooding some TIEs from example topology 2863 4.2.3.5. 'Flood Only Node TIEs' Bit 2865 RIFT includes an optional ECN (Explicit Congestion Notification) 2866 mechanism to prevent "flooding inrush" on restart or bring-up with 2867 many southbound neighbors. A node MAY set on its LIEs the according 2868 `you_are_sending_too_quickly` flag to indicate to the neighbor that 2869 it should temporarily flood node TIEs only to it and slow down the 2870 flooding of any other TIEs. It SHOULD only set it in the southbound 2871 direction. The receiving node SHOULD accommodate the request to 2872 lessen the flooding load on the affected node if south of the sender 2873 and SHOULD ignore the indication if northbound. 2875 Obviously this mechanism is most useful in the southbound direction. 2876 The distribution of node TIEs guarantees correct behavior of 2877 algorithms like disaggregation or default route origination. 2878 Furthermore though, the use of this bit presents an inherent trade- 2879 off between processing load and convergence speed since suppressing 2880 flooding of northbound prefixes from neighbors permanently will lead 2881 to blackholes. 2883 4.2.3.6. Initial and Periodic Database Synchronization 2885 The initial exchange of RIFT includes periodic TIDE exchanges that 2886 contain description of the link state database and TIREs which 2887 perform the function of requesting unknown TIEs as well as confirming 2888 reception of flooded TIEs. The content of TIDEs and TIREs is 2889 governed by Table 3. 2891 4.2.3.7. Purging and Roll-Overs 2893 When a node exits the network, if "unpurged", residual stale TIEs may 2894 exist in the network until their lifetimes expire (which in case of 2895 RIFT is by default a rather long period to prevent ongoing re- 2896 origination of TIEs in very large topologies). RIFT does however not 2897 have a "purging mechanism" in the traditional sense based on sending 2898 specialized "purge" packets. In other routing protocols such 2899 mechanism has proven to be complex and fragile based on many years of 2900 experience. RIFT simply issues a new, i.e. higher sequence number, 2901 empty version of the TIE with a short lifetime given by 2902 `purge_lifetime` constant and relies on each node to age out and 2903 delete such TIE copy independently. Abundant amounts of memory are 2904 available today even on low-end platforms and hence keeping those 2905 relatively short-lived extra copies for a while is acceptable. The 2906 information will age out and in the meantime all computations will 2907 deliver correct results if a node leaves the network due to the new 2908 information distributed by its adjacent nodes breaking bi-directional 2909 connectivity checks in different computations. 2911 Once a RIFT node issues a TIE with an ID, it SHOULD preserve the ID 2912 as long as feasible (also when the protocol restarts), even if the 2913 TIE looses all content. The re-advertisement of empty TIE fulfills 2914 the purpose of purging any information advertised in previous 2915 versions. The originator is free to not re-originate the according 2916 empty TIE again or originate an empty TIE with relatively short 2917 lifetime to prevent large number of long-lived empty stubs polluting 2918 the network. Each node MUST timeout and clean up the according empty 2919 TIEs independently. 2921 Upon restart a node MUST, as any link-state implementation, be 2922 prepared to receive TIEs with its own system ID and supersede them 2923 with equivalent, newly generated, empty TIEs with a higher sequence 2924 number. As above, the lifetime can be relatively short since it only 2925 needs to exceed the necessary propagation and processing delay by all 2926 the nodes that are within the TIE's flooding scope. 2928 TIE sequence numbers are rolled over using the method described in 2929 Appendix A. First sequence number of any spontaneously originated 2930 TIE (i.e. not originated to override a detected older copy in the 2931 network) MUST be a reasonably unpredictable random number in the 2932 interval [0, 2^30-1] which will prevent otherwise identical TIE 2933 headers to remain "stuck" in the network with content different from 2934 TIE originated after reboot. In traditional link-state protocols 2935 this is delegated to a 16-bit checksum on packet content. RIFT 2936 avoids this design due to the CPU burden presented by computation of 2937 such checksums and additional complications tied to the fact that the 2938 checksum must be "patched" into the packet after the generation of 2939 the content, a difficult proposition in binary hand-crafted formats 2940 already and highly incompatible with model-based, serialized formats. 2941 The sequence number space is hence consciously chosen to be 64-bits 2942 wide to make the occurrence of a TIE with same sequence number but 2943 different content as much or even more unlikely than the checksum 2944 method. To emulate the "checksum behavior" an implementation could 2945 e.g. choose to compute 64-bit checksum over the TIE content and use 2946 that as part of the first sequence number after reboot. 2948 4.2.3.8. Southbound Default Route Origination 2950 Under certain conditions nodes issue a default route in their South 2951 Prefix TIEs with costs as computed in Section 4.3.7.1. 2953 A node X that 2955 1. is *not* overloaded *and* 2957 2. has southbound or East-West adjacencies 2958 SHOULD originate in its south prefix TIE such a default route if and 2959 only if 2961 1. all other nodes at X's' level are overloaded *or* 2963 2. all other nodes at X's' level have NO northbound adjacencies *or* 2965 3. X has computed reachability to a default route during N-SPF. 2967 The term "all other nodes at X's' level" describes obviously just the 2968 nodes at the same level in the PoD with a viable lower level 2969 (otherwise the node South TIEs cannot be reflected and the nodes in 2970 e.g. PoD 1 and PoD 2 are "invisible" to each other). 2972 A node originating a southbound default route SHOULD install a 2973 default discard route if it did not compute a default route during 2974 N-SPF. This makes the top of the fabric basically a blackhole for 2975 unreachable addresses. 2977 4.2.3.9. Northbound TIE Flooding Reduction 2979 RIFT chooses only a subset of northbound nodes to propagate flooding 2980 and with that both balances it (to prevent 'hot' flooding links) 2981 across the fabric as well as reduces its volume. The solution is 2982 based on several principles: 2984 1. a node MUST flood self-originated North TIEs to all the reachable 2985 nodes at the level above which is called the node's "parents"; 2987 2. it is typically not necessary that all parents reflood the North 2988 TIEs to achieve a complete flooding of all the reachable nodes 2989 two levels above which we choose to call the node's 2990 "grandparents"; 2992 3. to control the volume of its flooding two hops North and yet keep 2993 it robust enough, it is advantageous for a node to select a 2994 subset of its parents as "Flood Repeaters" (FRs), which combined 2995 together deliver two or more copies of its flooding to all of its 2996 parents, i.e. the originating node's grandparents; 2998 4. nodes at the same level do *not* have to agree on a specific 2999 algorithm to select the FRs, but overall load balancing should be 3000 achieved so that different nodes at the same level should tend to 3001 select different parents as FRs; 3003 5. there are usually many solutions to the problem of finding a set 3004 of FRs for a given node; the problem of finding the minimal set 3005 is (similar to) a NP-Complete problem and a globally optimal set 3006 may not be the minimal one if load-balancing with other nodes is 3007 an important consideration; 3009 6. it is expected that there will be often sets of equivalent nodes 3010 at a level L, defined as having a common set of parents at L+1. 3011 Applying this observation at both L and L+1, an algorithm may 3012 attempt to split the larger problem in a sum of smaller separate 3013 problems; 3015 7. it is another expectation that there will be from time to time a 3016 broken link between a parent and a grandparent, and in that case 3017 the parent is probably a poor FR due to its lower reliability. 3018 An algorithm may attempt to eliminate parents with broken 3019 northbound adjacencies first in order to reduce the number of 3020 FRs. Albeit it could be argued that relying on higher fanout FRs 3021 will slow flooding due to higher replication load reliability of 3022 FR's links seems to be a more pressing concern. 3024 In a fully connected Clos Network, this means that a node selects one 3025 arbitrary parent as FR and then a second one for redundancy. The 3026 computation can be kept relatively simple and completely distributed 3027 without any need for synchronization amongst nodes. In a "PoD" 3028 structure, where the Level L+2 is partitioned in silos of equivalent 3029 grandparents that are only reachable from respective parents, this 3030 means treating each silo as a fully connected Clos Network and solve 3031 the problem within the silo. 3033 In terms of signaling, a node has enough information to select its 3034 set of FRs; this information is derived from the node's parents' Node 3035 South TIEs, which indicate the parent's reachable northbound 3036 adjacencies to its own parents, i.e. the node's grandparents. A node 3037 may send a LIE to a northbound neighbor with the optional boolean 3038 field `you_are_flood_repeater` set to false, to indicate that the 3039 northbound neighbor is not a flood repeater for the node that sent 3040 the LIE. In that case the northbound neighbor SHOULD NOT reflood 3041 northbound TIEs received from the node that sent the LIE. If the 3042 `you_are_flood_repeater` is absent or if `you_are_flood_repeater` is 3043 set to true, then the northbound neighbor is a flood repeater for the 3044 node that sent the LIE and MUST reflood northbound TIEs received from 3045 that node. The element `you_are_flood_repeater` MUST be ignored if 3046 received from a northbound adjacency. 3048 This specification provides a simple default algorithm that SHOULD be 3049 implemented and used by default on every RIFT node. 3051 * let |NA(Node) be the set of Northbound adjacencies of node Node 3052 and CN(Node) be the cardinality of |NA(Node); 3054 * let |SA(Node) be the set of Southbound adjacencies of node Node 3055 and CS(Node) be the cardinality of |SA(Node); 3057 * let |P(Node) be the set of node Node's parents; 3059 * let |G(Node) be the set of node Node's grandparents. Observe 3060 that |G(Node) = |P(|P(Node)); 3062 * let N be the child node at level L computing a set of FR; 3064 * let P be a node at level L+1 and a parent node of N, i.e. bi- 3065 directionally reachable over adjacency ADJ(N, P); 3067 * let G be a grandparent node of N, reachable transitively via a 3068 parent P over adjacencies ADJ(N, P) and ADJ(P, G). Observe that N 3069 does not have enough information to check bidirectional 3070 reachability of ADJ(P, G); 3072 * let R be a redundancy constant integer; a value of 2 or higher for 3073 R is RECOMMENDED; 3075 * let S be a similarity constant integer; a value in range 0 .. 2 3076 for S is RECOMMENDED, the value of 1 SHOULD be used. Two 3077 cardinalities are considered as equivalent if their absolute 3078 difference is less than or equal to S, i.e. |a-b|<=S. 3080 * let RND be a 64-bit random number generated by the system once on 3081 startup. 3083 The algorithm consists of the following steps: 3085 1. Derive a 64-bits number by XOR'ing 'N's system ID with RND. 3087 2. Derive a 16-bits pseudo-random unsigned integer PR(N) from the 3088 resulting 64-bits number by splitting it in 16-bits-long words 3089 W1, W2, W3, W4 (where W1 are the least significant 16 bits of the 3090 64-bits number, and W4 are the most significant 16 bits) and then 3091 XOR'ing the circularly shifted resulting words together: 3093 A. (W1<<1) xor (W2<<2) xor (W3<<3) xor (W4<<4); 3095 where << is the circular shift operator. 3097 3. Sort the parents by decreasing number of northbound adjacencies 3098 (using decreasing system id of the parent as tie-breaker): 3099 sort |P(N) by decreasing CN(P), for all P in |P(N), as ordered 3100 array |A(N) 3102 4. Partition |A(N) in subarrays |A_k(N) of parents with equivalent 3103 cardinality of northbound adjacencies (in other words with 3104 equivalent number of grandparents they can reach): 3106 A. set k=0; // k is the ID of the subarrray 3108 B. set i=0; 3110 C. while i < CN(N) do 3112 i) set j=i; 3114 ii) while i < CN(N) and CN(|A(N)[j]) - CN(|A(N)[i]) <= S 3116 a. place |A(N)[i] in |A_k(N) // abstract action, maybe 3117 noop 3119 b. set i=i+1; 3121 iii) /* At this point j is the index in |A(N) of the first 3122 member of |A_k(N) and (i-j) is C_k(N) defined as the 3123 cardinality of |A_k(N) */ 3125 set k=k+1; 3127 /* At this point k is the total number of subarrays, initialized 3128 for the shuffling operation below */ 3130 5. shuffle individually each subarrays |A_k(N) of cardinality C_k(N) 3131 within |A(N) using the Durstenfeld variation of Fisher-Yates 3132 algorithm that depends on N's System ID: 3134 A. while k > 0 do 3136 i) for i from C_k(N)-1 to 1 decrementing by 1 do 3138 a. set j to PR(N) modulo i; 3140 b. exchange |A_k[j] and |A_k[i]; 3142 ii) set k=k-1; 3144 6. For each grandparent G, initialize a counter c(G) with the number 3145 of its south-bound adjacencies to elected flood repeaters (which 3146 is initially zero): 3148 A. for each G in |G(N) set c(G) = 0; 3150 7. Finally keep as FRs only parents that are needed to maintain the 3151 number of adjacencies between the FRs and any grandparent G equal 3152 or above the redundancy constant R: 3154 A. for each P in reshuffled |A(N); 3156 i) if there exists an adjacency ADJ(P, G) in |NA(P) such 3157 that c(G) < R then 3159 a. place P in FR set; 3161 b. for all adjacencies ADJ(P, G') in |NA(P) increment 3162 c(G') 3164 B. If any c(G) is still < R, it was not possible to elect a set 3165 of FRs that covers all grandparents with redundancy R 3167 Additional rules for flooding reduction: 3169 1. The algorithm MUST be re-evaluated by a node on every change of 3170 local adjacencies or reception of a parent South TIE with changed 3171 adjacencies. A node MAY apply a hysteresis to prevent excessive 3172 amount of computation during periods of network instability just 3173 like in case of reachability computation. 3175 2. Upon a change of the flood repeater set, a node SHOULD send out 3176 LIEs that grant flood repeater status to newly promoted nodes 3177 before it sends LIEs that revoke the status to the nodes that 3178 have been newly demoted. This is done to prevent transient 3179 behavior where the full coverage of grandparents is not 3180 guaranteed. Such a condition is sometimes unavoidable in case of 3181 lost LIEs but it will correct itself though at possible transient 3182 hit in flooding propagation speeds. The election can use the LIE 3183 FSM `FloodLeadersChanged` event to notify LIE FSMs of necessity 3184 to update the sent LIEs. 3186 3. A node MUST always flood its self-originated TIEs to all its 3187 neighbors. 3189 4. A node receiving a TIE originated by a node for which it is not a 3190 flood repeater SHOULD NOT reflood such TIEs to its neighbors 3191 except for rules in Section 4.2.3.9, Paragraph 10, Item 6. 3193 5. The indication of flood reduction capability MUST be carried in 3194 the node TIEs in the `flood_reduction` element and MAY be used to 3195 optimize the algorithm to account for nodes that will flood 3196 regardless. 3198 6. A node generates TIDEs as usual but when receiving TIREs or TIDEs 3199 resulting in requests for a TIE of which the newest received copy 3200 came on an adjacency where the node was not flood repeater it 3201 SHOULD ignore such requests on first and only first request. 3202 Normally, the nodes that received the TIEs as flooding repeaters 3203 should satisfy the requesting node and with that no further TIREs 3204 for such TIEs will be generated. Otherwise, the next set of 3205 TIDEs and TIREs MUST lead to flooding independent of the flood 3206 repeater status. This solves a very difficult incast problem on 3207 nodes restarting with a very wide fanout, especially northbound. 3208 To retrieve the full database they often end up processing many 3209 in-rushing copies whereas this approach load-balances the 3210 incoming database between adjacent nodes and flood repeaters 3211 should guarantee that two copies are sent by different nodes to 3212 ensure against any losses. 3214 4.2.3.10. Special Considerations 3216 First, due to the distributed, asynchronous nature of ZTP, it can 3217 create temporary convergence anomalies where nodes at higher levels 3218 of the fabric temporarily see themselves lower than where they 3219 ultimately belong. Since flooding can begin before ZTP is "finished" 3220 and in fact must do so given there is no global termination criteria 3221 for the unsychronized ZTP algorithm, information may end up 3222 temporarily in wrong layers. A special clause when changing level 3223 takes care of that. 3225 More difficult is a condition where a node (e.g. a leaf) floods a TIE 3226 north towards its grandparent, then its parent reboots, in fact 3227 partitioning the grandparent from leaf directly and then the leaf 3228 itself reboots. That can leave the grandparent holding the "primary 3229 copy" of the leaf's TIE. Normally this condition is resolved easily 3230 by the leaf re-originating its TIE with a higher sequence number than 3231 it sees in the northbound TIEs, here however, when the parent comes 3232 back it won't be able to obtain leaf's North TIE from the grandparent 3233 easily and with that the leaf may not issue the TIE with a higher 3234 sequence number that can reach the grandparent for a long time. 3235 Flooding procedures are extended to deal with the problem by the 3236 means of special clauses that override the database of a lower level 3237 with headers of newer TIEs seen in TIDEs coming from the north. 3238 Those headers are then propagated southbound towards the leaf nudging 3239 it to originate a higher sequence number of the TIE effectively 3240 refreshing it all the way up to ToF. 3242 4.2.4. Reachability Computation 3244 A node has three possible sources of relevant information for 3245 reachability computation. A node knows the full topology south of it 3246 from the received North Node TIEs or alternately north of it from the 3247 South Node TIEs. A node has the set of prefixes with their 3248 associated distances and bandwidths from corresponding prefix TIEs. 3250 To compute prefix reachability, a node runs conceptually a northbound 3251 and a southbound SPF. N-SPF and S-SPF notation denotes here the 3252 direction in which the computation front is progressing. 3254 Since neither computation can "loop", it is possible to compute non- 3255 equal-cost or even k-shortest paths [EPPSTEIN] and "saturate" the 3256 fabric to the extent desired. This specification however uses 3257 simple, familiar SPF algorithms and concepts as example due to their 3258 prevalence in today's routing. 3260 For reachability computation purposes RIFT considers all parallel 3261 links between two nodes to be of the same cost advertised in `cost` 3262 element of `NodeNeighborsTIEElement`. In case the neighbor has 3263 multiple parallel links at different cost, the largest distance 3264 (highest numerical value) MUST be advertised. Given the range of 3265 thrift encodings, `infinite_distance` is defined as largest non- 3266 negative `MetricType`. Any link with metric larger than that (i.e. 3267 negative MetricType) MUST be ignored in computations. Any link with 3268 metric set to `invalid_distance` MUST be ignored in computation as 3269 well. In case of a negatively distributed prefix the metric 3270 attribute MUST be set to `infinite_distance` by the originator and it 3271 MUST be ignored by all nodes during computation except for the 3272 purpose of determining transitive propagation and building the 3273 according routing table. 3275 A prefix can carry the `directly_attached` attribute to indicate that 3276 the prefix is directly attached, i.e. should be routed to even if the 3277 node is in overload. In case of a negatively distributed prefix this 3278 attribute MUST not be included by the originator and it MUST be 3279 ignored by all nodes during SPF computation. If a prefix is locally 3280 originated the attribute `from_link` can indicate the interface to 3281 which the address belongs to. In case of a negatively distributed 3282 prefix this attribute MUST NOT be included by the originator and it 3283 MUST be ignored by all nodes during computation. A prefix can also 3284 carry the `loopback` attribute to indicate the said property. 3286 Prefixes are carried in different type of TIEs indicating their type. 3287 For same prefix being included in different TIE types according to 3288 Section 4.3.1. In case the same prefix is included multiple times in 3289 multiple TIEs of same type originating at the same node the resulting 3290 behavior is unspecified. 3292 4.2.4.1. Northbound Reachability SPF 3294 N-SPF MUST use exclusively northbound and East-West adjacencies in 3295 the computing node's node North TIEs (since if the node is a leaf it 3296 may not have generated a node South TIE) when starting SPF. Observe 3297 that N-SPF is really just a one hop variety since Node South TIEs are 3298 not re-flooded southbound beyond a single level (or East-West) and 3299 with that the computation cannot progress beyond adjacent nodes. 3301 Once progressing, the computation uses the next higher level's node 3302 South TIEs to find according adjacencies to verify backlink 3303 connectivity. Two unidirectional links MUST be associated together 3304 to confirm bidirectional connectivity, a process often known as 3305 `backlink check`. As part of the check, both node TIEs MUST contain 3306 the correct system IDs *and* expected levels. 3308 Default route found when crossing an E-W link SHOULD be used if and 3309 only if 3311 1. the node itself does *not* have any northbound adjacencies *and* 3313 2. the adjacent node has one or more northbound adjacencies 3315 This rule forms a "one-hop default route split-horizon" and prevents 3316 looping over default routes while allowing for "one-hop protection" 3317 of nodes that lost all northbound adjacencies except at Top-of-Fabric 3318 where the links are used exclusively to flood topology information in 3319 multi-plane designs. 3321 Other south prefixes found when crossing E-W link MAY be used if and 3322 only if 3324 1. no north neighbors are advertising same or supersuming non- 3325 default prefix *and* 3327 2. the node does not originate a non-default supersuming prefix 3328 itself. 3330 i.e. the E-W link can be used as a gateway of last resort for a 3331 specific prefix only. Using south prefixes across E-W link can be 3332 beneficial e.g. on automatic disaggregation in pathological fabric 3333 partitioning scenarios. 3335 A detailed example can be found in Section 5.4. 3337 4.2.4.2. Southbound Reachability SPF 3339 S-SPF MUST use the southbound adjacencies in the node South TIEs 3340 exclusively, i.e. progresses towards nodes at lower levels. Observe 3341 that E-W adjacencies are NEVER used in this computation. This 3342 enforces the requirement that a packet traversing in a southbound 3343 direction must never change its direction. 3345 S-SPF MUST use northbound adjacencies in node North TIEs to verify 3346 backlink connectivity by checking for presence of the link beside 3347 correct System ID and level. 3349 4.2.4.3. East-West Forwarding Within a non-ToF Level 3351 Using south prefixes over horizontal links MAY occur if the N-SPF 3352 includes East-West adjacencies in computation. It can protect 3353 against pathological fabric partitioning cases that leave only paths 3354 to destinations that would necessitate multiple changes of forwarding 3355 direction between north and south. 3357 4.2.4.4. East-West Links Within ToF Level 3359 E-W ToF links behave in terms of flooding scopes defined in 3360 Section 4.2.3.4 like northbound links and MUST be used exclusively 3361 for control plane information flooding. Even though a ToF node could 3362 be tempted to use those links during southbound SPF and carry traffic 3363 over them this MUST NOT be attempted since it may lead in, e.g. 3364 anycast cases to routing loops. An implementation MAY try to resolve 3365 the looping problem by following on the ring strictly tie-broken 3366 shortest-paths only but the details are outside this specification. 3367 And even then, the problem of proper capacity provisioning of such 3368 links when they become traffic-bearing in case of failures is vexing 3369 and when used for forwarding purposes, they defeat statistical non- 3370 blocking guarantees that Clos is providing normally. 3372 4.2.5. Automatic Disaggregation on Link & Node Failures 3374 4.2.5.1. Positive, Non-transitive Disaggregation 3376 Under normal circumstances, a node's South TIEs contain just the 3377 adjacencies and a default route. However, if a node detects that its 3378 default IP prefix covers one or more prefixes that are reachable 3379 through it but not through one or more other nodes at the same level, 3380 then it MUST explicitly advertise those prefixes in an South TIE. 3381 Otherwise, some percentage of the northbound traffic for those 3382 prefixes would be sent to nodes without according reachability, 3383 causing it to be black-holed. Even when not black-holing, the 3384 resulting forwarding could 'backhaul' packets through the higher 3385 level spines, clearly an undesirable condition affecting the blocking 3386 probabilities of the fabric. 3388 This specification refers to the process of advertising additional 3389 prefixes southbound as 'positive disaggregation'. Such 3390 disaggregation is non-transitive, i.e. its' effects are always 3391 contained to a single level of the fabric only. Naturally, multiple 3392 node or link failures can lead to several independent instances of 3393 positive disaggregation necessary to prevent looping or bow-tying the 3394 fabric. 3396 A node determines the set of prefixes needing disaggregation using 3397 the following steps: 3399 1. A DAG computation in the southern direction is performed first, 3400 i.e. the North TIEs are used to find all of prefixes it can reach 3401 and the set of next-hops in the lower level for each of them. 3402 Such a computation can be easily performed on a Fat Tree by e.g. 3403 setting all link costs in the southern direction to 1 and all 3404 northern directions to infinity. We term set of those 3405 prefixes |R, and for each prefix, r, in |R, its set of next-hops 3406 is defined to be |H(r). 3408 2. The node uses reflected South TIEs to find all nodes at the same 3409 level in the same PoD and the set of southbound adjacencies for 3410 each. The set of nodes at the same level is termed |N and for 3411 each node, n, in |N, its set of southbound adjacencies is defined 3412 to be |A(n). 3414 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 3415 is empty then that prefix r must be explicitly advertised by the 3416 node in an South TIE. 3418 4. Identical set of disaggregated prefixes is flooded on each of the 3419 node's southbound adjacencies. In accordance with the normal 3420 flooding rules for an South TIE, a node at the lower level that 3421 receives this South TIE SHOULD NOT propagate it south-bound or 3422 reflect the disaggregated prefixes back over its adjacencies to 3423 nodes at the level from which it was received. 3425 To summarize the above in simplest terms: if a node detects that its 3426 default route encompasses prefixes for which one of the other nodes 3427 in its level has no possible next-hops in the level below, it has to 3428 disaggregate it to prevent black-holing or suboptimal routing through 3429 such nodes. Hence a node X needs to determine if it can reach a 3430 different set of south neighbors than other nodes at the same level, 3431 which are connected to it via at least one common south neighbor. If 3432 it can, then prefix disaggregation may be required. If it can't, 3433 then no prefix disaggregation is needed. An example of 3434 disaggregation is provided in Section 5.3. 3436 Finally, a possible algorithm is described here: 3438 1. Create partial_neighbors = (empty), a set of neighbors with 3439 partial connectivity to the node X's level from X's perspective. 3440 Each entry in the set is a south neighbor of X and a list of 3441 nodes of X.level that can't reach that neighbor. 3443 2. A node X determines its set of southbound neighbors 3444 X.south_neighbors. 3446 3. For each South TIE originated from a node Y that X has which is 3447 at X.level, if Y.south_neighbors is not the same as 3448 X.south_neighbors but the nodes share at least one southern 3449 neighbor, for each neighbor N in X.south_neighbors but not in 3450 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 3451 there or add Y to the list for N. 3453 4. If partial_neighbors is empty, then node X does not disaggregate 3454 any prefixes. If node X is advertising disaggregated prefixes in 3455 its South TIE, X SHOULD remove them and re-advertise its 3456 according South TIEs. 3458 A node X computes reachability to all nodes below it based upon the 3459 received North TIEs first. This results in a set of routes, each 3460 categorized by (prefix, path_distance, next-hop set). Alternately, 3461 for clarity in the following procedure, these can be organized by 3462 next-hop set as ((next-hops), {(prefix, path_distance)}). If 3463 partial_neighbors isn't empty, then the following procedure describes 3464 how to identify prefixes to disaggregate. 3466 disaggregated_prefixes = { empty } 3467 nodes_same_level = { empty } 3468 for each South TIE 3469 if (South TIE.level == X.level and 3470 X shares at least one S-neighbor with X) 3471 add South TIE.originator to nodes_same_level 3472 end if 3473 end for 3475 for each next-hop-set NHS 3476 isolated_nodes = nodes_same_level 3477 for each NH in NHS 3478 if NH in partial_neighbors 3479 isolated_nodes = 3480 intersection(isolated_nodes, 3481 partial_neighbors[NH].nodes) 3482 end if 3483 end for 3485 if isolated_nodes is not empty 3486 for each prefix using NHS 3487 add (prefix, distance) to disaggregated_prefixes 3488 end for 3489 end if 3490 end for 3492 copy disaggregated_prefixes to X's South TIE 3493 if X's South TIE is different 3494 schedule South TIE for flooding 3495 end if 3497 Figure 16: Computation of Disaggregated Prefixes 3499 Each disaggregated prefix is sent with the according path_distance. 3500 This allows a node to send the same South TIE to each south neighbor. 3501 The south neighbor which is connected to that prefix will thus have a 3502 shorter path. 3504 Finally, to summarize the less obvious points partially omitted in 3505 the algorithms to keep them more tractable: 3507 1. all neighbor relationships MUST perform backlink checks. 3509 2. overload bits as introduced in Section 4.3.2 and carried in 3510 `overload` schema element have to be respected during the 3511 computation, i.e. node advertising themselves as overloaded MUST 3512 NOT be transited in reachability computation but MUST be used as 3513 terminal nodes with prefixes they advertise being reachable. 3515 3. all the lower level nodes are flooded the same disaggregated 3516 prefixes since RIFT does not build an South TIE per node which 3517 would complicate things unnecessarily. The lower level node that 3518 can compute a southbound route to the prefix will prefer it to 3519 the disaggregated route anyway based on route preference rules. 3521 4. positively disaggregated prefixes do *not* have to propagate to 3522 lower levels. With that the disturbance in terms of new flooding 3523 is contained to a single level experiencing failures. 3525 5. disaggregated Prefix South TIEs are not "reflected" by the lower 3526 level, i.e. nodes within same level do *not* need to be aware 3527 which node computed the need for disaggregation. 3529 6. The fabric is still supporting maximum load balancing properties 3530 while not trying to send traffic northbound unless necessary. 3532 In case positive disaggregation is triggered and due to the very 3533 stable but un-synchronized nature of the algorithm the nodes may 3534 issue the necessary disaggregated prefixes at different points in 3535 time. This can lead for a short time to an "incast" behavior where 3536 the first advertising router based on the nature of longest prefix 3537 match will attract all the traffic. Different implementation 3538 strategies can be used to lessen that effect but those are clearly 3539 outside the scope of this specification. 3541 To close this section it is worth to observe that in a single plane 3542 ToF this disaggregation prevents blackholing up to (K_LEAF * P) link 3543 failures in terms of Section 4.1.2 or in other terms, it takes at 3544 minimum that many link failures to partition the ToF into multiple 3545 planes. 3547 4.2.5.2. Negative, Transitive Disaggregation for Fallen Leaves 3549 As explained in Section 4.1.3 failures in multi-plane Top-of-Fabric 3550 or more than (K_LEAF * P) links failing in single plane design can 3551 generate fallen leaves. Such scenario cannot be addressed by 3552 positive disaggregation only and needs a further mechanism. 3554 4.2.5.2.1. Cabling of Multiple Top-of-Fabric Planes 3556 Returning in this section to designs with multiple planes as shown 3557 originally in Figure 3, Figure 17 highlights now how the ToF is 3558 cabled in case of two planes by the means of dual-rings to distribute 3559 all the North TIEs within both planes. 3561 ++==========++ ++==========++ 3562 II II II II 3563 +----++--+ +----++--+ +----++--+ +----++--+ 3564 |ToF A1| |ToF B1| |ToF B2| |ToF A2| 3565 ++-+-++--+ ++-+-++--+ ++-+-++--+ ++-+-++--+ 3566 | | II | | II | | II | | II 3567 | | ++==========++ | | ++==========++ 3568 | | | | | | | | 3570 ~~~ Highlighted ToF of the previous multi-plane figure ~~ 3572 Figure 17: Topologically Connected Planes 3574 Section 4.1.3 already describes how failures in multi-plane fabrics 3575 can lead to blackholes which normal positive disaggregation cannot 3576 fix. The mechanism of negative, transitive disaggregation 3577 incorporated in RIFT provides the according solution and next section 3578 explains the involved mechanisms in more detail. 3580 4.2.5.2.2. Transitive Advertisement of Negative Disaggregates 3582 A ToF node discovering that it cannot reach a fallen leaf SHOULD 3583 disaggregate all the prefixes of such leaves. It uses for that 3584 purpose negative prefix South TIEs that are, as usual, flooded 3585 southwards with the scope defined in Section 4.2.3.4. 3587 Transitively, a node explicitly loses connectivity to a prefix when 3588 none of its children advertises it and when the prefix is negatively 3589 disaggregated by all of its parents. When that happens, the node 3590 originates the negative prefix further down south. Since the 3591 mechanism applies recursively south the negative prefix may propagate 3592 transitively all the way down to the leaf. This is necessary since 3593 leaves connected to multiple planes by means of disjoint paths may 3594 have to choose the correct plane already at the very bottom of the 3595 fabric to make sure that they don't send traffic towards another leaf 3596 using a plane where it is "fallen" at which in point a blackhole is 3597 unavoidable. 3599 When the connectivity is restored, a node that disaggregated a prefix 3600 withdraws the negative disaggregation by the usual mechanism of re- 3601 advertising TIEs omitting the negative prefix. 3603 4.2.5.2.3. Computation of Negative Disaggregates 3605 The document omitted so far the description of the computation 3606 necessary to generate the correct set of negative prefixes. Negative 3607 prefixes can in fact be advertised due to two different triggers. 3608 This will be described consecutively. 3610 The first origination reason is a computation that uses all the node 3611 North TIEs to build the set of all reachable nodes by reachability 3612 computation over the complete graph and including horizontal ToF 3613 links. The computation uses the node itself as root. This is 3614 compared with the result of the normal southbound SPF as described in 3615 Section 4.2.4.2. The difference are the fallen leaves and all their 3616 attached prefixes are advertised as negative prefixes southbound if 3617 the node does not see the prefix being reachable within the 3618 southbound SPF. 3620 The second mechanism hinges on the understanding how the negative 3621 prefixes are used within the computation as described in Figure 18. 3622 When attaching the negative prefixes at certain point in time the 3623 negative prefix may find itself with all the viable nodes from the 3624 shorter match nexthop being pruned. In other words, all its 3625 northbound neighbors provided a negative prefix advertisement. This 3626 is the trigger to advertise this negative prefix transitively south 3627 and normally caused by the node being in a plane where the prefix 3628 belongs to a fabric leaf that has "fallen" in this plane. Obviously, 3629 when one of the northbound switches withdraws its negative 3630 advertisement, the node has to withdraw its transitively provided 3631 negative prefix as well. 3633 4.2.6. Attaching Prefixes 3635 After SPF is run, it is necessary to attach the resulting 3636 reachability information in form of prefixes. For S-SPF, prefixes 3637 from an North TIE are attached to the originating node with that 3638 node's next-hop set and a distance equal to the prefix's cost plus 3639 the node's minimized path distance. The RIFT route database, a set 3640 of (prefix, prefix-type, attributes, path_distance, next-hop set), 3641 accumulates these results. 3643 In case of N-SPF prefixes from each South TIE need to also be added 3644 to the RIFT route database. The N-SPF is really just a stub so the 3645 computing node needs simply to determine, for each prefix in an South 3646 TIE that originated from adjacent node, what next-hops to use to 3647 reach that node. Since there may be parallel links, the next-hops to 3648 use can be a set; presence of the computing node in the associated 3649 Node South TIE is sufficient to verify that at least one link has 3650 bidirectional connectivity. The set of minimum cost next-hops from 3651 the computing node X to the originating adjacent node is determined. 3653 Each prefix has its cost adjusted before being added into the RIFT 3654 route database. The cost of the prefix is set to the cost received 3655 plus the cost of the minimum distance next-hop to that neighbor while 3656 taking into account its attributes such as mobility per 3657 Section 4.3.4. Then each prefix can be added into the RIFT route 3658 database with the next-hop set; ties are broken based upon type first 3659 and then distance and further on `PrefixAttributes` and only the best 3660 combination is used for forwarding. RIFT route preferences are 3661 normalized by the according Thrift [thrift] model type. 3663 An example implementation for node X follows: 3665 for each South TIE 3666 if South TIE.level > X.level 3667 next_hop_set = set of minimum cost links to the 3668 South TIE.originator 3669 next_hop_cost = minimum cost link to 3670 South TIE.originator 3671 end if 3672 for each prefix P in the South TIE 3673 P.cost = P.cost + next_hop_cost 3674 if P not in route_database: 3675 add (P, P.cost, P.type, 3676 P.attributes, next_hop_set) to route_database 3677 end if 3678 if (P in route_database): 3679 if route_database[P].cost > P.cost or 3680 route_database[P].type > P.type: 3681 update route_database[P] with (P, P.type, P.cost, 3682 P.attributes, 3683 next_hop_set) 3684 else if route_database[P].cost == P.cost and 3685 route_database[P].type == P.type: 3686 update route_database[P] with (P, P.type, 3687 P.cost, P.attributes, 3688 merge(next_hop_set, route_database[P].next_hop_set)) 3689 else 3690 // Not preferred route so ignore 3691 end if 3692 end if 3693 end for 3694 end for 3696 Figure 18: Adding Routes from South TIE Positive and Negative 3697 Prefixes 3699 After the positive prefixes are attached and tie-broken, negative 3700 prefixes are attached and used in case of northbound computation, 3701 ideally from the shortest length to the longest. The nexthop 3702 adjacencies for a negative prefix are inherited from the longest 3703 positive prefix that aggregates it, and subsequently adjacencies to 3704 nodes that advertised negative for this prefix are removed. 3706 The rule of inheritance MUST be maintained when the nexthop list for 3707 a prefix is modified, as the modification may affect the entries for 3708 matching negative prefixes of immediate longer prefix length. For 3709 instance, if a nexthop is added, then by inheritance it must be added 3710 to all the negative routes of immediate longer prefixes length unless 3711 it is pruned due to a negative advertisement for the same next hop. 3712 Similarly, if a nexthop is deleted for a given prefix, then it is 3713 deleted for all the immediately aggregated negative routes. This 3714 will recurse in the case of nested negative prefix aggregations. 3716 The rule of inheritance must also be maintained when a new prefix of 3717 intermediate length is inserted, or when the immediately aggregating 3718 prefix is deleted from the routing table, making an even shorter 3719 aggregating prefix the one from which the negative routes now inherit 3720 their adjacencies. As the aggregating prefix changes, all the 3721 negative routes must be recomputed, and then again the process may 3722 recurse in case of nested negative prefix aggregations. 3724 Although these operations can be computationally expensive, the 3725 overall load on devices in the network is low because these 3726 computations are not run very often, as positive route advertisements 3727 are always preferred over negative ones. This prevents recursion in 3728 most cases because positive reachability information never inherits 3729 next hops. 3731 To make the negative disaggregation less abstract and provide an 3732 example ToP node T1 with 4 ToF parents S1..S4 as represented in 3733 Figure 19 are considered further: 3735 +----+ +----+ +----+ +----+ N 3736 | S1 | | S2 | | S3 | | S4 | ^ 3737 +----+ +----+ +----+ +----+ W< + >E 3738 | | | | v 3739 |+--------+ | | S 3740 ||+-----------------+ | 3741 |||+----------------+---------+ 3742 |||| 3743 +----+ 3744 | T1 | 3745 +----+ 3747 Figure 19: A ToP Node with 4 Parents 3749 If all ToF nodes can reach all the prefixes in the network; with 3750 RIFT, they will normally advertise a default route south. An 3751 abstract Routing Information Base (RIB), more commonly known as a 3752 routing table, stores all types of maintained routes including the 3753 negative ones and "tie-breaks" for the best one, whereas an abstract 3754 Forwarding table (FIB) retains only the ultimately computed 3755 "positive" routing instructions. In T1, those tables would look as 3756 illustrated in Figure 20: 3758 +---------+ 3759 | Default | 3760 +---------+ 3761 | 3762 | +--------+ 3763 +---> | Via S1 | 3764 | +--------+ 3765 | 3766 | +--------+ 3767 +---> | Via S2 | 3768 | +--------+ 3769 | 3770 | +--------+ 3771 +---> | Via S3 | 3772 | +---------+ 3773 | 3774 | +--------+ 3775 +---> | Via S4 | 3776 +--------+ 3778 Figure 20: Abstract RIB 3780 In case T1 receives a negative advertisement for prefix 2001:db8::/32 3781 from S1 a negative route is stored in the RIB (indicated by a ~ 3782 sign), while the more specific routes to the complementing ToF nodes 3783 are installed in FIB. RIB and FIB in T1 now look as illustrated in 3784 Figure 21 and Figure 22, respectively: 3786 +---------+ +-----------------+ 3787 | Default | <-------------- | ~2001:db8::/32 | 3788 +---------+ +-----------------+ 3789 | | 3790 | +--------+ | +--------+ 3791 +---> | Via S1 | +---> | Via S1 | 3792 | +--------+ +--------+ 3793 | 3794 | +--------+ 3795 +---> | Via S2 | 3796 | +--------+ 3797 | 3798 | +--------+ 3799 +---> | Via S3 | 3800 | +---------+ 3801 | 3802 | +--------+ 3803 +---> | Via S4 | 3804 +--------+ 3806 Figure 21: Abstract RIB after Negative 2001:db8::/32 from S1 3808 The negative 2001:db8::/32 prefix entry inherits from ::/0, so the 3809 positive more specific routes are the complements to S1 in the set of 3810 next-hops for the default route. That entry is composed of S2, S3, 3811 and S4, or, in other words, it uses all entries the the default route 3812 with a "hole punched" for S1 into them. These are the next hops that 3813 are still available to reach 2001:db8::/32, now that S1 advertised 3814 that it will not forward 2001:db8::/32 anymore. Ultimately, those 3815 resulting next-hops are installed in FIB for the more specific route 3816 to 2001:db8::/32 as illustrated below: 3818 +---------+ +---------------+ 3819 | Default | | 2001:db8::/32 | 3820 +---------+ +---------------+ 3821 | | 3822 | +--------+ | 3823 +---> | Via S1 | | 3824 | +--------+ | 3825 | | 3826 | +--------+ | +--------+ 3827 +---> | Via S2 | +---> | Via S2 | 3828 | +--------+ | +--------+ 3829 | | 3830 | +--------+ | +--------+ 3831 +---> | Via S3 | +---> | Via S3 | 3832 | +--------+ | +--------+ 3833 | | 3834 | +--------+ | +--------+ 3835 +---> | Via S4 | +---> | Via S4 | 3836 +--------+ +--------+ 3838 Figure 22: Abstract FIB after Negative 2001:db8::/32 from S1 3840 To illustrate matters further consider T1 receiving a negative 3841 advertisement for prefix 2001:db8:1::/48 from S2, which is stored in 3842 RIB again. After the update, the RIB in T1 is illustrated in 3843 Figure 23: 3845 +---------+ +----------------+ +------------------+ 3846 | Default | <----- | ~2001:db8::/32 | <------ | ~2001:db8:1::/48 | 3847 +---------+ +----------------+ +------------------+ 3848 | | | 3849 | +--------+ | +--------+ | 3850 +---> | Via S1 | +---> | Via S1 | | 3851 | +--------+ +--------+ | 3852 | | 3853 | +--------+ | +--------+ 3854 +---> | Via S2 | +---> | Via S2 | 3855 | +--------+ +--------+ 3856 | 3857 | +--------+ 3858 +---> | Via S3 | 3859 | +---------+ 3860 | 3861 | +--------+ 3862 +---> | Via S4 | 3863 +--------+ 3865 Figure 23: Abstract RIB after Negative 2001:db8:1::/48 from S2 3867 Negative 2001:db8:1::/48 inherits from 2001:db8::/32 now, so the 3868 positive more specific routes are the complements to S2 in the set of 3869 next hops for 2001:db8::/32, which are S3 and S4, or, in other words, 3870 all entries of the parent with the negative holes "punched in" again. 3871 After the update, the FIB in T1 shows as illustrated in Figure 24: 3873 +---------+ +---------------+ +-----------------+ 3874 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3875 +---------+ +---------------+ +-----------------+ 3876 | | | 3877 | +--------+ | | 3878 +---> | Via S1 | | | 3879 | +--------+ | | 3880 | | | 3881 | +--------+ | +--------+ | 3882 +---> | Via S2 | +---> | Via S2 | | 3883 | +--------+ | +--------+ | 3884 | | | 3885 | +--------+ | +--------+ | +--------+ 3886 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 3887 | +--------+ | +--------+ | +--------+ 3888 | | | 3889 | +--------+ | +--------+ | +--------+ 3890 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3891 +--------+ +--------+ +--------+ 3893 Figure 24: Abstract FIB after Negative 2001:db8:1::/48 from S2 3895 Further, assume that S3 stops advertising its service as default 3896 gateway. The entry is removed from RIB as usual. In order to update 3897 the FIB, it is necessary to eliminate the FIB entry for the default 3898 route, as well as all the FIB entries that were created for negative 3899 routes pointing to the RIB entry being removed (::/0). This is done 3900 recursively for 2001:db8::/32 and then for, 2001:db8:1::/48. The 3901 related FIB entries via S3 are removed, as illustrated in Figure 25. 3903 +---------+ +---------------+ +-----------------+ 3904 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3905 +---------+ +---------------+ +-----------------+ 3906 | | | 3907 | +--------+ | | 3908 +---> | Via S1 | | | 3909 | +--------+ | | 3910 | | | 3911 | +--------+ | +--------+ | 3912 +---> | Via S2 | +---> | Via S2 | | 3913 | +--------+ | +--------+ | 3914 | | | 3915 | | | 3916 | | | 3917 | | | 3918 | | | 3919 | +--------+ | +--------+ | +--------+ 3920 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3921 +--------+ +--------+ +--------+ 3923 Figure 25: Abstract FIB after Loss of S3 3925 Say that at that time, S4 would also disaggregate prefix 3926 2001:db8:1::/48. This would mean that the FIB entry for 3927 2001:db8:1::/48 becomes a discard route, and that would be the signal 3928 for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a 3929 transitive fashion with its own children. 3931 Finally, the case occurs where S3 becomes available again as a 3932 default gateway, and a negative advertisement is received from S4 3933 about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a 3934 negative route is stored in the RIB, and the more specific route to 3935 the complementing ToF nodes are installed in FIB. Since 3936 2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes 3937 are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1 3938 now shows as illustrated in Figure 26: 3940 +-----------------+ 3941 | 2001:db8:2::/48 | 3942 +-----------------+ 3943 | 3944 +---------+ +---------------+ +-----------------+ 3945 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3946 +---------+ +---------------+ +-----------------+ 3947 | | | | 3948 | +--------+ | | | +--------+ 3949 +---> | Via S1 | | | +---> | Via S2 | 3950 | +--------+ | | | +--------+ 3951 | | | | 3952 | +--------+ | +--------+ | | +--------+ 3953 +---> | Via S2 | +---> | Via S2 | | +---> | Via S3 | 3954 | +--------+ | +--------+ | +--------+ 3955 | | | 3956 | +--------+ | +--------+ | +--------+ 3957 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 3958 | +--------+ | +--------+ | +--------+ 3959 | | | 3960 | +--------+ | +--------+ | +--------+ 3961 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3962 +--------+ +--------+ +--------+ 3964 Figure 26: Abstract FIB after Negative 2001:db8:2::/48 from S4 3966 4.2.7. Optional Zero Touch Provisioning (ZTP) 3968 Each RIFT node can operate in zero touch provisioning (ZTP) mode, 3969 i.e. it has no configuration (unless it is a ToF or it is configured 3970 to operate in the overall topology as leaf and/or support leaf-2-leaf 3971 procedures) and it will fully configure itself after being attached 3972 to the topology. Configured nodes and nodes operating in ZTP can be 3973 mixed and will form a valid topology if achievable. 3975 The derivation of the level of each node happens based on offers 3976 received from its neighbors whereas each node (with possibly 3977 exceptions of configured leaves) tries to attach at the highest 3978 possible point in the fabric. This guarantees that even if the 3979 diffusion front of offers reaches a node from "below" faster than 3980 from "above", it will greedily abandon already negotiated level 3981 derived from nodes topologically below it and properly peer with 3982 nodes above. 3984 The fabric is very consciously numbered from the top down to allow 3985 for PoDs of different heights and minimize number of provisioning 3986 necessary, in this case just a TOP_OF_FABRIC flag on every node at 3987 the top of the fabric. 3989 This section describes the necessary concepts and procedures for ZTP 3990 operation. 3992 4.2.7.1. Terminology 3994 The interdependencies between the different flags and the configured 3995 level can be somewhat vexing at first and it may take multiple reads 3996 of the glossary to comprehend them. 3998 Automatic Level Derivation: 3999 Procedures which allow nodes without level configured to derive it 4000 automatically. Only applied if CONFIGURED_LEVEL is undefined. 4002 UNDEFINED_LEVEL: 4003 A "null" value that indicates that the level has not been 4004 determined and has not been configured. Schemas normally indicate 4005 that by a missing optional value without an available defined 4006 default. 4008 LEAF_ONLY: 4009 An optional configuration flag that can be configured on a node to 4010 make sure it never leaves the "bottom of the hierarchy". 4011 TOP_OF_FABRIC flag and CONFIGURED_LEVEL cannot be defined at the 4012 same time as this flag. It implies CONFIGURED_LEVEL value of 4013 `leaf_level`. It is indicated in `leaf_only` schema element. 4015 TOP_OF_FABRIC flag: 4016 Configuration flag that MUST be provided to all Top-of-Fabric 4017 nodes. LEAF_FLAG and CONFIGURED_LEVEL cannot be defined at the 4018 same time as this flag. It implies a CONFIGURED_LEVEL value. In 4019 fact, it is basically a shortcut for configuring same level at all 4020 Top-of-Fabric nodes which is unavoidable since an initial 'seed' 4021 is needed for other ZTP nodes to derive their level in the 4022 topology. The flag plays an important role in fabrics with 4023 multiple planes to enable successful negative disaggregation 4024 (Section 4.2.5.2). It is carried in `top_of_fabric` schema 4025 element. A standards conform RIFT implementation implies a 4026 CONFIGURED_LEVEL value of `top_of_fabric_level` in case of 4027 TOP_OF_FABRIC. This value is kept reasonably low to alow for fast 4028 ZTP re-convergence on failures. 4030 CONFIGURED_LEVEL: 4031 A level value provided manually. When this is defined (i.e. it is 4032 not an UNDEFINED_LEVEL) the node is not participating in ZTP in 4033 the sense of deriving its own level based on other nodes' 4034 information. TOP_OF_FABRIC flag is ignored when this value is 4035 defined. LEAF_ONLY can be set only if this value is undefined or 4036 set to `leaf_level`. 4038 DERIVED_LEVEL: 4039 Level value computed via automatic level derivation when 4040 CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL. 4042 LEAF_2_LEAF: 4043 An optional flag that can be configured on a node to make sure it 4044 supports procedures defined in Section 4.3.9. In a strict sense 4045 it is a capability that implies LEAF_ONLY and the according 4046 restrictions. TOP_OF_FABRIC flag is ignored when set at the same 4047 time as this flag. It is carried in the 4048 `leaf_only_and_leaf_2_leaf_procedures` schema flag. 4050 LEVEL_VALUE: 4051 In ZTP case the original definition of "level" in Section 3.1 is 4052 both extended and relaxed. First, level is defined now as 4053 LEVEL_VALUE and is the first defined value of CONFIGURED_LEVEL 4054 followed by DERIVED_LEVEL. Second, it is possible for nodes to be 4055 more than one level apart to form adjacencies if any of the nodes 4056 is at least LEAF_ONLY. 4058 Valid Offered Level (VOL): 4059 A neighbor's level received on a valid LIE (i.e. passing all 4060 checks for adjacency formation while disregarding all clauses 4061 involving level values) persisting for the duration of the 4062 holdtime interval on the LIE. Observe that offers from nodes 4063 offering level value of `leaf_level` do not constitute VOLs (since 4064 no valid DERIVED_LEVEL can be obtained from those and consequently 4065 `not_a_ztp_offer` flag MUST be ignored). Offers from LIEs with 4066 `not_a_ztp_offer` being true are not VOLs either. If a node 4067 maintains parallel adjacencies to the neighbor, VOL on each 4068 adjacency is considered as equivalent, i.e. the newest VOL from 4069 any such adjacency updates the VOL received from the same node. 4071 Highest Available Level (HAL): 4072 Highest defined level value seen from all VOLs received. 4074 Highest Available Level Systems (HALS): 4075 Set of nodes offering HAL VOLs. 4077 Highest Adjacency ThreeWay (HAT): 4078 Highest neighbor level of all the formed ThreeWay adjacencies for 4079 the node. 4081 4.2.7.2. Automatic System ID Selection 4083 RIFT nodes require a 64 bit System ID which SHOULD be derived as 4084 EUI-64 MA-L derive according to [EUI64]. The organizationally 4085 governed portion of this ID (24 bits) can be used to generate 4086 multiple IDs if required to indicate more than one RIFT instance." 4088 As matter of operational concern, the router MUST ensure that such 4089 identifier is not changing very frequently (or at least not without 4090 sending all its TIEs with fairly short lifetimes, i.e. purging them) 4091 since otherwise the network may be left with large amounts of stale 4092 TIEs in other nodes (though this is not necessarily a serious problem 4093 if the procedures described in Section 7 are implemented). 4095 4.2.7.3. Generic Fabric Example 4097 ZTP forces considerations of miscabled or unusually cabled fabric and 4098 how such a topology can be forced into a "lattice" structure which a 4099 fabric represents (with further restrictions). A necessary and 4100 sufficient physical cabling is shown in Figure 27. The assumption 4101 here is that all nodes are in the same PoD. 4103 +---+ 4104 | A | s = TOP_OF_FABRIC 4105 | s | l = LEAF_ONLY 4106 ++-++ l2l = LEAF_2_LEAF 4107 | | 4108 +--+ +--+ 4109 | | 4110 +--++ ++--+ 4111 | E | | F | 4112 | +-+ | +-----------+ 4113 ++--+ | ++-++ | 4114 | | | | | 4115 | +-------+ | | 4116 | | | | | 4117 | | +----+ | | 4118 | | | | | 4119 ++-++ ++-++ | 4120 | I +-----+ J | | 4121 | | | +-+ | 4122 ++-++ +--++ | | 4123 | | | | | 4124 +---------+ | +------+ | 4125 | | | | | 4126 +-----------------+ | | 4127 | | | | | 4128 ++-++ ++-++ | 4129 | X +-----+ Y +-+ 4130 |l2l| | l | 4131 +---+ +---+ 4133 Figure 27: Generic ZTP Cabling Considerations 4135 First, RIFT must anchor the "top" of the cabling and that's what the 4136 TOP_OF_FABRIC flag at node A is for. Then things look smooth until 4137 the protocol has to decide whether node Y is at the same level as I, 4138 J (and as consequence, X is south of it) or at the same level as X. 4139 This is unresolvable here until we "nail down the bottom" of the 4140 topology. To achieve that the protocol chooses to use in this 4141 example the leaf flags in X and Y. In case where Y would not have a 4142 leaf flag it will try to elect highest level offered and end up being 4143 in same level as I and J. 4145 4.2.7.4. Level Determination Procedure 4147 A node starting up with UNDEFINED_VALUE (i.e. without a 4148 CONFIGURED_LEVEL or any leaf or TOP_OF_FABRIC flag) MUST follow those 4149 additional procedures: 4151 1. It advertises its LEVEL_VALUE on all LIEs (observe that this can 4152 be UNDEFINED_LEVEL which in terms of the schema is simply an 4153 omitted optional value). 4155 2. It computes HAL as numerically highest available level in all 4156 VOLs. 4158 3. It chooses then MAX(HAL-1,0) as its DERIVED_LEVEL. The node then 4159 starts to advertise this derived level. 4161 4. A node that lost all adjacencies with HAL value MUST hold down 4162 computation of new DERIVED_LEVEL for a short period of time 4163 unless it has no VOLs from southbound adjacencies. After the 4164 holddown timer expired, it MUST discard all received offers, 4165 recompute DERIVED_LEVEL and announce it to all neighbors. 4167 5. A node MUST reset any adjacency that has changed the level it is 4168 offering and is in ThreeWay state. 4170 6. A node that changed its defined level value MUST readvertise its 4171 own TIEs (since the new `PacketHeader` will contain a different 4172 level than before). Sequence number of each TIE MUST be 4173 increased. 4175 7. After a level has been derived the node MUST set the 4176 `not_a_ztp_offer` on LIEs towards all systems offering a VOL for 4177 HAL. 4179 8. A node that changed its level SHOULD flush from its link state 4180 database TIEs of all other nodes, otherwise stale information may 4181 persist on "direction reversal", i.e. nodes that seemed south 4182 are now north or east-west. This will not prevent the correct 4183 operation of the protocol but could be slightly confusing 4184 operationally. 4186 A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf 4187 function by being configured with the appropriate flags or has a 4188 CONFIGURED_LEVEL of 0) MUST follow those additional procedures: 4190 1. It computes HAT per procedures above but does *not* use it to 4191 compute DERIVED_LEVEL. HAT is used to limit adjacency formation 4192 per Section 4.2.2. 4194 It MAY also follow modified procedures: 4196 1. It may pick a different strategy to choose VOL, e.g. use the VOL 4197 value with highest number of VOLs. Such strategies are only 4198 possible since the node always remains "at the bottom of the 4199 fabric" while another layer could "invert" the fabric by picking 4200 its preferred VOL in a different fashion than always trying to 4201 achieve the highest viable level. 4203 4.2.7.5. ZTP FSM 4205 This section specifies the precise, normative ZTP FSM and can be 4206 omitted unless the reader is pursuing an implementation of the 4207 protocol. 4209 Initial state is ComputeBestOffer. 4211 Enter 4212 | 4213 v 4214 +------------------+ 4215 | ComputeBestOffer | 4216 | |<----+ 4217 | | | BetterHAL 4218 | | | BetterHAT 4219 | | | ChangeLocalConfiguredLevel 4220 | | | ChangeLocalHierarchyIndications 4221 | | | LostHAT 4222 | | | NeighborOffer 4223 | | | ShortTic 4224 | |-----+ 4225 | | 4226 | |<--------------------- 4227 | |---------------------> (UpdatingClients) 4228 | | ComputationDone 4229 +------------------+ 4230 ^ | 4231 | | LostHAL 4232 | V 4233 (HoldingDown) 4235 (ComputeBestOffer) 4236 | ^ 4237 | | ChangeLocalConfiguredLevel 4238 | | ChangeLocalHierarchyIndications 4239 | | HoldDownExpired 4240 V | 4241 +------------------+ 4242 | HoldingDown | 4243 | |<----+ 4244 | | | BetterHAL 4245 | | | BetterHAT 4246 | | | ComputationDone 4247 | | | LostHAL 4248 | | | LostHat 4249 | | | NeighborOffer 4250 | | | ShortTic 4251 | |-----+ 4252 +------------------+ 4253 ^ 4254 | 4255 (UpdatingClients) 4257 (ComputeBestOffer) 4258 | ^ 4259 | | BetterHAL 4260 | | BetterHAT 4261 | | LostHAT 4262 | | ChangeLocalHierarchyIndications 4263 | | ChangeLocalConfiguredLevel 4264 V | 4265 +------------------+ 4266 | UpdatingClients | 4267 | |<----+ 4268 | | | 4269 | | | NeighborOffer 4270 | | | ShortTic 4271 | |-----+ 4272 +------------------+ 4273 | 4274 | LostHAL 4275 V 4276 (HoldingDown) 4278 Figure 28: ZTP FSM 4280 The following words are used for well known procedures: 4282 * PUSH Event: queues an event to be executed by the FSM upon exit of 4283 this action 4285 * COMPARE_OFFERS: checks whether based on current offers and held 4286 last results the events BetterHAL/LostHAL/BetterHAT/LostHAT are 4287 necessary and returns them 4289 * UPDATE_OFFER: store current offer with adjancency holdtime as 4290 lifetime and COMPARE_OFFERS, then PUSH according events 4292 * LEVEL_COMPUTE: compute best offered or configured level and HAL/ 4293 HAT, if anything changed PUSH ComputationDone 4295 * REMOVE_OFFER: remove the according offer and COMPARE_OFFERS, PUSH 4296 according events 4298 * PURGE_OFFERS: REMOVE_OFFER for all held offers, COMPARE OFFERS, 4299 PUSH according events 4301 * PROCESS_OFFER: 4303 1. if no level offered then REMOVE_OFFER 4305 2. else 4307 1. if offered level > leaf then UPDATE_OFFER 4309 2. else REMOVE_OFFER 4311 States: 4313 * ComputeBestOffer: processes received offers to derive ZTP 4314 variables 4316 * HoldingDown: holding down while receiving updates 4318 * UpdatingClients: updates other FSMs with computation results 4320 Events: 4322 * ChangeLocalHierarchyIndications: node locally configured with new 4323 leaf flags. 4325 * ChangeLocalConfiguredLevel: node locally configured with a defined 4326 level 4328 * NeighborOffer: a new neighbor offer with optional level and 4329 neighbor state. 4331 * BetterHAL: better HAL computed internally. 4333 * BetterHAT: better HAT computed internally. 4335 * LostHAL: lost last HAL in computation. 4337 * LostHAT: lost HAT in computation. 4339 * ComputationDone: computation performed. 4341 * HoldDownExpired: holddown timer expired. 4343 * ShortTic: one second timer tic, i.e. the event is generated for 4344 FSM by some external entity once a second. To be ignored if 4345 transition does not exist. 4347 Actions: 4349 * on ChangeLocalConfiguredLevel in HoldingDown finishes in 4350 ComputeBestOffer: store configured level 4352 * on BetterHAT in HoldingDown finishes in HoldingDown: no action 4354 * on ShortTic in HoldingDown finishes in HoldingDown: remove expired 4355 offers and if holddown timer expired PUSH_EVENT HoldDownExpired 4357 * on NeighborOffer in HoldingDown finishes in HoldingDown: 4358 PROCESS_OFFER 4360 * on ComputationDone in HoldingDown finishes in HoldingDown: no 4361 action 4363 * on BetterHAL in HoldingDown finishes in HoldingDown: no action 4365 * on LostHAT in HoldingDown finishes in HoldingDown: no action 4367 * on LostHAL in HoldingDown finishes in HoldingDown: no action 4369 * on HoldDownExpired in HoldingDown finishes in ComputeBestOffer: 4370 PURGE_OFFERS 4372 * on ChangeLocalHierarchyIndications in HoldingDown finishes in 4373 ComputeBestOffer: store leaf flags 4375 * on LostHAT in ComputeBestOffer finishes in ComputeBestOffer: 4376 LEVEL_COMPUTE 4378 * on NeighborOffer in ComputeBestOffer finishes in ComputeBestOffer: 4379 PROCESS_OFFER 4381 * on BetterHAT in ComputeBestOffer finishes in ComputeBestOffer: 4382 LEVEL_COMPUTE 4384 * on ChangeLocalHierarchyIndications in ComputeBestOffer finishes in 4385 ComputeBestOffer: store leaf flags and LEVEL_COMPUTE 4387 * on LostHAL in ComputeBestOffer finishes in HoldingDown: if any 4388 southbound adjacencies present then update holddown timer to 4389 normal duration else fire holddown timer immediately 4391 * on ShortTic in ComputeBestOffer finishes in ComputeBestOffer: 4392 remove expired offers 4394 * on ComputationDone in ComputeBestOffer finishes in 4395 UpdatingClients: no action 4397 * on ChangeLocalConfiguredLevel in ComputeBestOffer finishes in 4398 ComputeBestOffer: store configured level and LEVEL_COMPUTE 4400 * on BetterHAL in ComputeBestOffer finishes in ComputeBestOffer: 4401 LEVEL_COMPUTE 4403 * on ShortTic in UpdatingClients finishes in UpdatingClients: remove 4404 expired offers 4406 * on LostHAL in UpdatingClients finishes in HoldingDown: if any 4407 southbound adjacencies present then update holddown timer to 4408 normal duration else fire holddown timer immediately 4410 * on BetterHAT in UpdatingClients finishes in ComputeBestOffer: no 4411 action 4413 * on BetterHAL in UpdatingClients finishes in ComputeBestOffer: no 4414 action 4416 * on ChangeLocalConfiguredLevel in UpdatingClients finishes in 4417 ComputeBestOffer: store configured level 4419 * on ChangeLocalHierarchyIndications in UpdatingClients finishes in 4420 ComputeBestOffer: store leaf flags 4422 * on NeighborOffer in UpdatingClients finishes in UpdatingClients: 4423 PROCESS_OFFER 4425 * on LostHAT in UpdatingClients finishes in ComputeBestOffer: no 4426 action 4428 * on Entry into ComputeBestOffer: LEVEL_COMPUTE 4430 * on Entry into UpdatingClients: update all LIE FSMs with 4431 computation results 4433 4.2.7.6. Resulting Topologies 4435 The procedures defined in Section 4.2.7.4 will lead to the RIFT 4436 topology and levels depicted in Figure 29. 4438 +---+ 4439 | As| 4440 | 24| 4441 ++-++ 4442 | | 4443 +--+ +--+ 4444 | | 4445 +--++ ++--+ 4446 | E | | F | 4447 | 23+-+ | 23+-----------+ 4448 ++--+ | ++-++ | 4449 | | | | | 4450 | +-------+ | | 4451 | | | | | 4452 | | +----+ | | 4453 | | | | | 4454 ++-++ ++-++ | 4455 | I +-----+ J | | 4456 | 22| | 22| | 4457 ++--+ +--++ | 4458 | | | 4459 +---------+ | | 4460 | | | 4461 ++-++ +---+ | 4462 | X | | Y +-+ 4463 | 0 | | 0 | 4464 +---+ +---+ 4466 Figure 29: Generic ZTP Topology Autoconfigured 4468 In case where the LEAF_ONLY restriction on Y is removed the outcome 4469 would be very different however and result in Figure 30. This 4470 demonstrates basically that auto configuration makes miscabling 4471 detection hard and with that can lead to undesirable effects in cases 4472 where leaves are not "nailed" by the accordingly configured flags and 4473 arbitrarily cabled. 4475 A node MAY analyze the outstanding level offers on its interfaces and 4476 generate warnings when its internal ruleset flags a possible 4477 miscabling. As an example, when a node's sees ZTP level offers that 4478 differ by more than one level from its chosen level (with proper 4479 accounting for leaf's being at level `leaf_level`) this can indicate 4480 miscabling. 4482 . +---+ 4483 . | As| 4484 . | 24| 4485 . ++-++ 4486 . | | 4487 . +--+ +--+ 4488 . | | 4489 . +--++ ++--+ 4490 . | E | | F | 4491 . | 23+-+ | 23+-------+ 4492 . ++--+ | ++-++ | 4493 . | | | | | 4494 . | +-------+ | | 4495 . | | | | | 4496 . | | +----+ | | 4497 . | | | | | 4498 . ++-++ ++-++ +-+-+ 4499 . | I +-----+ J +-----+ Y | 4500 . | 22| | 22| | 22| 4501 . ++-++ +--++ ++-++ 4502 . | | | | | 4503 . | +-----------------+ | 4504 . | | | 4505 . +---------+ | | 4506 . | | | 4507 . ++-++ | 4508 . | X +--------+ 4509 . | 0 | 4510 . +---+ 4512 Figure 30: Generic ZTP Topology Autoconfigured 4514 4.3. Further Mechanisms 4516 4.3.1. Route Preferences 4518 Since RIFT distinguishes between different route types such as e.g. 4519 external routes from other protocols and additionally advertises 4520 special types of routes on disaggregation, the protocol MUST tie- 4521 break internally different types on a clear preference scale to 4522 prevent blackholes or loops. The preferences are given in the schema 4523 type `RouteType`. 4525 Table Table 5 contains the route type as derived from the TIE type 4526 carrying it from the most preferred to the least preferred one. 4528 +==================================+======================+ 4529 | TIE Type | Resulting Route Type | 4530 +==================================+======================+ 4531 | None | Discard | 4532 +----------------------------------+----------------------+ 4533 | Local Interface | LocalPrefix | 4534 +----------------------------------+----------------------+ 4535 | S-PGP | South PGP | 4536 +----------------------------------+----------------------+ 4537 | N-PGP | North PGP | 4538 +----------------------------------+----------------------+ 4539 | North Prefix | NorthPrefix | 4540 +----------------------------------+----------------------+ 4541 | North External Prefix | NorthExternalPrefix | 4542 +----------------------------------+----------------------+ 4543 | South Prefix and South Positive | SouthPrefix | 4544 | Disaggregation | | 4545 +----------------------------------+----------------------+ 4546 | South External Prefix and South | SouthExternalPrefix | 4547 | Positive External Disaggregation | | 4548 +----------------------------------+----------------------+ 4549 | South Negative Prefix | NegativeSouthPrefix | 4550 +----------------------------------+----------------------+ 4552 Table 5: TIEs and Contained Route Types 4554 4.3.2. Overload Bit 4556 Overload attribute is specified in the packet encoding schema 4557 (Appendix B). 4559 The overload bit MUST be respected by all necessary SPF computations. 4560 A node with the overload bit set SHOULD advertise all locally hosted 4561 prefixes both northbound and southbound, all other southbound 4562 prefixes SHOULD NOT be advertised. 4564 Leaf nodes SHOULD set the overload attribute on all originated Node 4565 TIEs. If spine nodes were to forward traffic not intended for the 4566 local node, the leaf node would not be able to prevent routing/ 4567 forwarding loops as it does not have the necessary topology 4568 information to do so. 4570 4.3.3. Optimized Route Computation on Leaves 4572 Leaf nodes only have visibility to directly connected nodes and 4573 therefore are not required to run "full" SPF computations. Instead, 4574 prefixes from neighboring nodes can be gathered to run a "partial" 4575 SPF computation in order to build the routing table. 4577 Leaf nodes SHOULD only hold their own N-TIEs, and in cases of L2L 4578 implementations, the N-TIEs of their East/West neighbors. Leaf nodes 4579 MUST hold all S-TIEs from their neighbors. 4581 Normally, a full network graph is created based on local N-TIEs and 4582 remote S-TIEs that it receives from neighbors, at which time, 4583 necessary SPF computations are performed. Instead, leaf nodes can 4584 simply compute the minimum cost and next-hop set of each leaf 4585 neighbor by examining its local adjacencies. Associated N-TIEs are 4586 used to determine bi-directionality and derive the next-hop set. 4587 Cost is then derived from the minimum cost of the local adjacency to 4588 the neighbor and the prefix cost. 4590 Leaf nodes would then attach necessary prefixes as described in 4591 Section 4.2.6. 4593 4.3.4. Mobility 4595 The RIFT control plane MUST maintain the real time status of every 4596 prefix, to which port it is attached, and to which leaf node that 4597 port belongs. This is still true in cases of IP mobility where the 4598 point of attachment may change several times a second. 4600 There are two classic approaches to explicitly maintain this 4601 information: 4603 timestamp: 4604 With this method, the infrastructure SHOULD record the precise 4605 time at which the movement is observed. One key advantage of this 4606 technique is that it has no dependency on the mobile device. One 4607 drawback is that the infrastructure MUST be precisely synchronized 4608 in order to be able to compare timestamps as the points of 4609 attachment change. This could be accomplished by utilizing 4610 Precision Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588] or 4611 802.1AS [IEEEstd8021AS] which is designed for bridged LANs. Both 4612 the precision of the synchronization protocol and the resolution 4613 of the timestamp must beat the highest possible roaming time on 4614 the fabric. Another drawback is that the presence of a mobile 4615 device may only be observed asynchronously, such as when it starts 4616 using an IP protocol like ARP [RFC0826], IPv6 Neighbor Discovery 4617 [RFC4861], IPv6 Stateless Address Configuration [RFC4862], DHCP 4618 [RFC2131], or DHCPv6 [RFC8415]. 4620 sequence counter: 4621 With this method, a mobile device notifies its point of attachment 4622 on arrival with a sequence counter that is incremented upon each 4623 movement. On the positive side, this method does not have a 4624 dependency on a precise sense of time, since the sequence of 4625 movements is kept in order by the mobile device. The disadvantage 4626 of this approach is the lack of support for protocols that may be 4627 used by the mobile device to register its presence to the leaf 4628 node with the capability to provide a sequence counter. Well- 4629 known issues with sequence counters such as wrapping and 4630 comparison rules MUST be addressed properly. Sequence numbers 4631 MUST be compared by a single homogenous source to make operation 4632 feasible. Sequence number comparison from multiple heterogeneous 4633 sources would be extremely difficult to implement. 4635 RIFT supports a hybrid approach by using an optional 4636 'PrefixSequenceType' attribute (that is also called a `monotonic 4637 clock` in the schema) that consists of a timestamp and optional 4638 sequence number field. In case of a negatively distributed prefix 4639 this attribute MUST NOT be included by the originator and it MUST be 4640 ignored by all nodes during computation. When this attribute is 4641 present (observe that per data schema the attribute itself is 4642 optional but in case it is included the 'timestamp' field is 4643 required): 4645 * The leaf node MAY advertise a timestamp of the latest sighting of 4646 a prefix, e.g., by snooping IP protocols or the node using the 4647 time at which it advertised the prefix. RIFT transports the 4648 timestamp within the desired prefix North TIEs as 802.1AS 4649 timestamp. 4651 * RIFT MAY interoperate with "Registration Extensions for 6LoWPAN 4652 Neighbor Discovery" [RFC8505], which provides a method for 4653 registering a prefix with a sequence number called a Transaction 4654 ID (TID). In such cases, RIFT SHOULD transport the derived TID 4655 without modification. 4657 * RIFT also defines an abstract negative clock (ASNC) (also called 4658 an 'undefined' clock). ASNC MUST be considered older than any 4659 other defined clock. By default, when a node receives a prefix 4660 North TIE that does not contain a 'PrefixSequenceType' attribute, 4661 it MUST interpret the absence as ASNC. 4663 * Any prefix present on the fabric in multiple nodes that has the 4664 `same` clock is considered as anycast. 4666 * RIFT specification assumes that all nodes are being synchronized 4667 to at least 200 milliseconds of precision. This is achievable 4668 through the use of NTP [RFC5905]. An implementation MAY provide a 4669 way to reconfigure a domain to a different value, and provides for 4670 this purpose a variable called MAXIMUM_CLOCK_DELTA. 4672 4.3.4.1. Clock Comparison 4674 All monotonic clock values MUST be compared to each other using the 4675 following rules: 4677 1. ASNC is older than any other value except ASNC *and* 4679 2. Clock with timestamp differing by more than MAXIMUM_CLOCK_DELTA 4680 are comparable by using the timestamps only *and* 4682 3. Clocks with timestamps differing by less than MAXIMUM_CLOCK_DELTA 4683 are comparable by using their TIDs only *and* 4685 4. An undefined TID is always older than any other TID *and* 4687 5. TIDs are compared using rules of [RFC8505]. 4689 4.3.4.2. Interaction between Time Stamps and Sequence Counters 4691 For attachment changes that occur less frequently (e.g. once per 4692 second), the timestamp that the RIFT infrastructure captures should 4693 be enough to determine the most current discovery. If the point of 4694 attachment changes faster than the maximum drift of the time stamping 4695 mechanism (i.e. MAXIMUM_CLOCK_DELTA), then a sequence number SHOULD 4696 be used to enable necessary precision to determine currency. 4698 The sequence counter in [RFC8505] is encoded as one octet and wraps 4699 around using Appendix A. 4701 Within the resolution of MAXIMUM_CLOCK_DELTA, sequence counter values 4702 captured during 2 sequential iterations of the same timestamp SHOULD 4703 be comparable. This means that with default values, a node may move 4704 up to 127 times in a 200 millisecond period and the clocks will 4705 remain comparable. This allows the RIFT infrastructure to explicitly 4706 assert the most up-to-date advertisement. 4708 4.3.4.3. Anycast vs. Unicast 4710 A unicast prefix can be attached to at most one leaf, whereas an 4711 anycast prefix may be reachable via more than one leaf. 4713 If a monotonic clock attribute is provided on the prefix, then the 4714 prefix with the `newest` clock value is strictly preferred. An 4715 anycast prefix does not carry a clock or all clock attributes MUST be 4716 the same under the rules of Section 4.3.4.1. 4718 Observe that it is important that in mobility events the leaf is re- 4719 flooding as quickly as possible the absence of the prefix that moved 4720 away. 4722 Observe further that without support for [RFC8505] movements on the 4723 fabric within intervals smaller than 100msec will be seen as anycast. 4725 4.3.4.4. Overlays and Signaling 4727 RIFT is agnostic to any overlay technologies and their associated 4728 control and transports that run on top of it (e.g. VXLAN). It is 4729 expected that leaf nodes and possibly Top-of-Fabric nodes can perform 4730 necessary data plane encapsulation. 4732 In the context of mobility, overlays provide another possible 4733 solution to avoid injecting mobile prefixes into the fabric as well 4734 as improving scalability of the deployment. It makes sense to 4735 consider overlays for mobility solutions in IP fabrics. As an 4736 example, a mobility protocol such as LISP [RFC6830] may inform the 4737 ingress leaf of the location of the egress leaf in real time. 4739 Another possibility is to consider that mobility as an underlay 4740 service and support it in RIFT to an extent. The load on the fabric 4741 augments with the amount of mobility obviously since a move forces 4742 flooding and computation on all nodes in the scope of the move so 4743 tunneling from leaf to the Top-of-Fabric may be desired to speed up 4744 convergence times. 4746 4.3.5. Key/Value Store 4748 4.3.5.1. Southbound 4750 RIFT supports the southbound distribution of key-value pairs that can 4751 be used to distribute information to facilitate higher levels of 4752 functionality (e.g. distribution of configuration information). KV 4753 South TIEs may arrive from multiple nodes and therefore MUST execute 4754 the following tie-breaking rules for each key: 4756 1. Only KV TIEs received from nodes to which a bi-directional 4757 adjacency exists MUST be considered. 4759 2. For each valid KV South TIEs that contains the same key, the 4760 value within the South TIE with the highest level will be 4761 preferred. If the levels are identical, the highest originating 4762 system ID will be preferred. In the case of overlapping keys in 4763 the winning South TIE, the behavior is undefined. 4765 Consider that if a node goes down, nodes south of it will lose 4766 associated adjacencies causing them to disregard corresponding KVs. 4767 New KV South TIEs are advertised to prevent stale information being 4768 used by nodes that are farther south. KV advertisements southbound 4769 are not a result of independent computation by every node over the 4770 same set of South TIEs, but a diffused computation. 4772 4.3.5.2. Northbound 4774 Certain use cases necessitate distribution of essential KV 4775 information that is generated by the leaves in the northbound 4776 direction. Such information is flooded in KV North TIEs. Since the 4777 originator of the KV North TIEs is preserved during flooding, the 4778 according mechanism will define, if necessary, according tie-breaking 4779 rules depending on the semantics of the information. 4781 Only KV TIEs from nodes that are reachable via multiplane 4782 reachability computation mentioned in Section 4.2.5.2.3 SHOULD be 4783 considered. 4785 4.3.6. Interactions with BFD 4787 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 4788 In such case following procedures are introduced: 4790 After RIFT ThreeWay hello adjacency convergence a BFD session MAY 4791 be formed automatically between the RIFT endpoints without further 4792 configuration using the exchanged discriminators. The capability 4793 of the remote side to support BFD is carried in the LIEs in 4794 `LinkCapabilities`. 4796 In case established BFD session goes Down after it was Up, RIFT 4797 adjacency SHOULD be re-initialized and subsequently started from 4798 Init after it sees a consecutive BFD Up. 4800 In case of parallel links between nodes each link MAY run its own 4801 independent BFD session or they MAY share a session. 4803 If link identifiers or BFD capabilities change, both the LIE and 4804 any BFD sessions SHOULD be brought down and back up again. In 4805 case only the advertised capabilities change, the node MAY choose 4806 to persist the BFD session. 4808 Multiple RIFT instances MAY choose to share a single BFD session, 4809 in such cases the behavior for which discriminators are used is 4810 undefined. However, RIFT MAY advertise the same link ID for the 4811 same interface in multiple instances to "share" discriminators. 4813 BFD TTL follows [RFC5082]. 4815 4.3.7. Fabric Bandwidth Balancing 4817 A well understood problem in fabrics is that in case of link 4818 failures, it would be ideal to rebalance how much traffic is sent to 4819 switches in the next level based on available ingress and egress 4820 bandwidth. 4822 RIFT supports a very light weight mechanism that can deal with the 4823 problem in an approximate way based on the fact that RIFT is loop- 4824 free. 4826 4.3.7.1. Northbound Direction 4828 Every RIFT node SHOULD compute the amount of northbound bandwidth 4829 available through neighbors at higher level and modify distance 4830 received on default route from this neighbor. The bandwidth is 4831 advertised in `NodeNeighborsTIEElement` element which represents the 4832 sum of the bandwidths of all the parallel links to a neighbor. 4833 Default routes with differing distances SHOULD be used to support 4834 weighted ECMP forwarding. Such a distance is called Bandwidth 4835 Adjusted Distance or BAD. This is best illustrated by a simple 4836 example. 4838 100 x 100 100 MBits 4839 | x | | 4840 +-+---+-+ +-+---+-+ 4841 | | | | 4842 |Spin111| |Spin112| 4843 +-+---+++ ++----+++ 4844 |x || || || 4845 || |+---------------+ || 4846 || +---------------+| || 4847 || || || || 4848 || || || || 4849 -----All Links 10 MBit------- 4850 || || || || 4851 || || || || 4852 || +------------+| || || 4853 || |+------------+ || || 4854 |x || || || 4855 +-+---+++ +--++-+++ 4856 | | | | 4857 |Leaf111| |Leaf112| 4858 +-------+ +-------+ 4860 Figure 31: Balancing Bandwidth 4862 Figure 31 depicts an example topology where links between leaf and 4863 spine nodes are 10 MBit/s and links from spine nodes northbound are 4864 100 MBit/s. It includes parallel link failure between Leaf 111 and 4865 Spine 111 and as a result, Leaf 111 wants to forward more traffic 4866 toward Spine 112. Additionally, it includes as well an uplink 4867 failure on Spine 111. 4869 The local modification of the received default route distance from 4870 upper level is achieved by running a relatively simple algorithm 4871 where the bandwidth is weighted exponentially, while the distance on 4872 the default route represents a multiplier for the bandwidth weight 4873 for easy operational adjustments. 4875 On a node, L, use Node TIEs to compute from each non-overloaded 4876 northbound neighbor N to compute 3 values: 4878 L_N_u: as sum of the bandwidth available to N 4880 N_u: as sum of the uplink bandwidth available on N 4882 T_N_u: as sum of L_N_u * OVERSUBSCRIPTION_CONSTANT + N_u 4884 For all T_N_u determine the according M_N_u as 4885 log_2(next_power_2(T_N_u)) and determine MAX_M_N_u as maximum value 4886 of all such M_N_u values. 4888 For each advertised default route from a node N modify the advertised 4889 distance D to BAD = D * (1 + MAX_M_N_u - M_N_u) and use BAD instead 4890 of distance D to weight balance default forwarding towards N. 4892 For the example above, a simple table of values will help in 4893 understanding of the concept. The implicit assumption here is that 4894 all default route distances are advertised with D=1 and that 4895 OVERSUBSCRIPTION_CONSTANT = 1. 4897 +=========+===========+=======+=======+=====+ 4898 | Node | N | T_N_u | M_N_u | BAD | 4899 +=========+===========+=======+=======+=====+ 4900 | Leaf111 | Spine 111 | 110 | 7 | 2 | 4901 +---------+-----------+-------+-------+-----+ 4902 | Leaf111 | Spine 112 | 220 | 8 | 1 | 4903 +---------+-----------+-------+-------+-----+ 4904 | Leaf112 | Spine 111 | 120 | 7 | 2 | 4905 +---------+-----------+-------+-------+-----+ 4906 | Leaf112 | Spine 112 | 220 | 8 | 1 | 4907 +---------+-----------+-------+-------+-----+ 4909 Table 6: BAD Computation 4911 If a calculation produces a result exceeding the range of the type, 4912 e.g. bandwidth, the result is set to the highest possible value for 4913 that type. 4915 BAD SHOULD be only computed for default routes. A node MAY compute 4916 and use BAD for any disaggregated prefixes or other RIFT routes. A 4917 node MAY use a different algorithm to weight northbound traffic based 4918 on bandwidth. If a different algorithm is used, its successful 4919 behavior MUST NOT depend on uniformity of algorithm or 4920 synchronization of BAD computations across the fabric. E.g. it is 4921 conceivable that leaves could use real time link loads gathered by 4922 analytics to change the amount of traffic assigned to each default 4923 route next hop. 4925 Furthermore, a change in available bandwidth will only affect, at 4926 most, two levels down in the fabric, i.e. the blast radius of 4927 bandwidth adjustments is constrained no matter the fabric's height. 4929 4.3.7.2. Southbound Direction 4931 Due to its loop free nature, during South SPF, a node MAY account for 4932 maximum available bandwidth on nodes in lower levels and modify the 4933 amount of traffic offered to the next level's southbound nodes. It 4934 is worth considering that such computations may be more effective if 4935 standardized, but do not have to be. As long as a packet continues 4936 to flow southbound, it will take some viable, loop-free path to reach 4937 its destination. 4939 4.3.8. Label Binding 4941 A node MAY advertise in its LIEs, a locally significant, downstream 4942 assigned, interface specific label. One use of such a label is a 4943 hop-by-hop encapsulation allowing forwarding planes to be easily 4944 distinguished among multiple RIFT instances. 4946 4.3.9. Leaf to Leaf Procedures 4948 RIFT implementations SHOULD support special East-West adjacencies 4949 between leaf nodes. Leaf nodes supporting these procedures MUST: 4951 advertise the LEAF_2_LEAF flag in its node capabilities *and* 4953 set the overload bit on all leaf's node TIEs *and* 4955 flood only a node's own north and south TIEs over E-W leaf 4956 adjacencies *and* 4958 always use E-W leaf adjacency in all SPF computations *and* 4959 install a discard route for any advertised aggregate routes in a 4960 leaf?s TIE *and* 4962 never form southbound adjacencies. 4964 This will allow the E-W leaf nodes to exchange traffic strictly for 4965 the prefixes advertised in each other's north prefix TIEs (since the 4966 southbound computation will find the reverse direction in the other 4967 node's TIE and install its north prefixes). 4969 4.3.10. Address Family and Multi Topology Considerations 4971 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC8202] 4972 concepts are used today in link-state routing protocols to support 4973 several domains on the same physical topology. RIFT supports this 4974 capability by carrying transport ports in the LIE protocol exchanges. 4975 Multiplexing of LIEs can be achieved by either choosing varying 4976 multicast addresses or ports on the same address. 4978 BFD interactions in Section 4.3.6 are implementation dependent when 4979 multiple RIFT instances run on the same link. 4981 4.3.11. One-Hop Healing of Levels with East-West Links 4983 Based on the rules defined in Section 4.2.4, Section 4.2.3.8 and 4984 given presence of E-W links, RIFT can provide a one-hop protection 4985 for nodes that lost all their northbound links. This can also be 4986 applied to multi-plane designs where complex link set failures occur 4987 at the Top-of-Fabric when links are exclusively used for flooding 4988 topology information. Section 5.4 outlines this behavior. 4990 4.4. Security 4992 4.4.1. Security Model 4994 An inherent property of any security and ZTP architecture is the 4995 resulting trade-off in regard to integrity verification of the 4996 information distributed through the fabric vs. provisioning and auto- 4997 configuration requirements. At a minimum the security of an 4998 established adjacency should be ensured. The stricter the security 4999 model the more provisioning must take over the role of ZTP. 5001 RIFT supports the following security models to allow for flexible 5002 control by the operator. 5004 * The most security conscious operators may choose to have control 5005 over which ports interconnect between a given pair of nodes, such 5006 a model is called the "Port-Association Model" (PAM). This is 5007 achievable by configuring each pair of directly connected ports 5008 with a designated shared key or public/private key pair. 5010 * In physically secure data center locations, operators may choose 5011 to control connectivity between entire nodes, called here the 5012 "Node-Association Model" (NAM). A benefit of this model is that 5013 it allows for simplified port sparing. 5015 * In the most relaxed environments, an operator may only choose to 5016 control which nodes join a particular fabric. This is denoted as 5017 the "Fabric-Association Model" (FAM). This is achievable by using 5018 a single shared secret across the entire fabric. Such flexibility 5019 makes sense when servers are considered as leaf devices, as those 5020 are replaced more often than network nodes. In addition, this 5021 model allows for simplified node sparing. 5023 * These models may be mixed throughout the fabric depending upon 5024 security requirements at various levels of the fabric and 5025 willingness to accept increased provisioning complexity. 5027 In order to support the cases mentioned above, RIFT implementations 5028 supports, through operator control, mechanisms that allow for: 5030 a. specification of the appropriate level in the fabric, 5032 b. discovery and reporting of missing connections, 5034 c. discovery and reporting of unexpected connections while 5035 preventing them from forming insecure adjacencies. 5037 Operators may only choose to configure the level of each node, but 5038 not explicitly configure which connections are allowed. In this 5039 case, RIFT will only allow adjacencies to establish between nodes 5040 that are in adjacent levels. Operators with the lowest security 5041 requirements may not use any configuration to specify which 5042 connections are allowed. Nodes in such fabrics could rely fully on 5043 ZTP and only established adjacencies between nodes in adjacent 5044 levels. Figure 32 illustrates inherent tradeoffs between the 5045 different security models. 5047 Some level of link quality verification may be required prior to an 5048 adjacency being used for forwarding. For example, an implementation 5049 may require that a BFD session comes up before advertising the 5050 adjacency. 5052 For the cases outlined above, RIFT has two approaches to enforce that 5053 a local port is connected to the correct port on the correct remote 5054 node. One approach is to piggy-back on RIFT's authentication 5055 mechanism. Assuming the provisioning model (e.g. the YANG model) is 5056 flexible enough, operators can choose to provision a unique 5057 authentication key for: 5059 a. each pair of ports in "port-association model" or 5061 b. each pair of switches in "node-association model" or 5063 c. each pair of levels or 5065 d. the entire fabric in "fabric-association model". 5067 The other approach is to rely on the system-id, port-id and level 5068 fields in the LIE message to validate an adjacency against the 5069 expected cabling topology, and optionally introduce some new rules in 5070 the FSM to allow the adjacency to come up if the expectations are 5071 met. 5073 ^ /\ | 5074 /|\ / \ | 5075 | / \ | 5076 | / PAM \ | 5077 Increasing / \ Increasing 5078 Integrity +----------+ Flexibility 5079 & / NAM \ & 5080 Increasing +--------------+ Less 5081 Provisioning / FAM \ Configuration 5082 | +------------------+ | 5083 | / Level Provisioning \ | 5084 | +----------------------+ \|/ 5085 | / Zero Configuration \ v 5086 +--------------------------+ 5088 Figure 32: Security Model 5090 4.4.2. Security Mechanisms 5092 RIFT Security goals are to ensure: 5094 1. authentication 5096 2. message integrity 5098 3. the prevention of replay attacks 5099 4. low processing overhead 5101 5. efficient messaging 5103 Message confidentiality is a non-goal. 5105 The model in the previous section allows a range of security key 5106 types that are analogous to the various security association models. 5107 PAM and NAM allow security associations at the port or node level 5108 using symmetric or asymmetric keys that are pre-installed. FAM 5109 argues for security associations to be applied only at a group level 5110 or to be refined once the topology has been established. RIFT does 5111 not specify how security keys are installed or updated, though it 5112 does specify how the key can be used to achieve security goals. 5114 The protocol has provisions for "weak" nonces to prevent replay 5115 attacks and includes authentication mechanisms comparable to 5116 [RFC5709] and [RFC7987]. 5118 4.4.3. Security Envelope 5120 A serialized schema `ProtocolPacket` MUST be carried in a secure 5121 envelope illustrated in Figure 33. The `ProtocolPacket` MUST be 5122 serialized using the default Thrift's Binary Protocol. Any value in 5123 the packet following a security fingerprint MUST be used only after 5124 the appropriate fingerprint has been validated against the data 5125 covered by it and advertised key. 5127 Local configuration MAY allow for the envelope's integrity checks to 5128 be skipped. 5130 0 1 2 3 5131 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 5133 UDP Header: 5134 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5135 | Source Port | RIFT destination port | 5136 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5137 | UDP Length | UDP Checksum | 5138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5140 Outer Security Envelope Header: 5141 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5142 | RIFT MAGIC | Packet Number | 5143 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5144 | Reserved | RIFT Major | Outer Key ID | Fingerprint | 5145 | | Version | | Length | 5146 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5147 | | 5148 ~ Security Fingerprint covers all following content ~ 5149 | | 5150 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5151 | Weak Nonce Local | Weak Nonce Remote | 5152 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5153 | Remaining TIE Lifetime (all 1s in case of LIE) | 5154 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5156 TIE Origin Security Envelope Header: 5157 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5158 | TIE Origin Key ID | Fingerprint | 5159 | | Length | 5160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5161 | | 5162 ~ Security Fingerprint covers all following content ~ 5163 | | 5164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5166 Serialized RIFT Model Object 5167 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5168 | | 5169 ~ Serialized RIFT Model Object ~ 5170 | | 5171 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 5173 Figure 33: Security Envelope 5175 RIFT MAGIC: 5176 16 bits. Constant value of 0xA1F7 that allows to classify RIFT 5177 packets easily independent of UDP port used. 5179 Packet Number: 5180 16 bits. An optional, per adjacency, per packet type 5181 monotonically increasing number rolling over using sequence number 5182 arithmetic defined in Appendix A. A node SHOULD correctly set the 5183 number on subsequent packets or otherwise MUST set the value to 5184 `undefined_packet_number` as provided in the schema. This number 5185 can be used to detect losses and misordering in flooding for 5186 either operational purposes or in implementation to adjust 5187 flooding behavior to current link or buffer quality. This number 5188 MUST NOT be used to discard or validate the correctness of 5189 packets. Packet numbers are incremented on each interface and 5190 within that for each type of packet independently. This allows to 5191 parallelize packet generation and processing for different types 5192 within an implementation if so desired. 5194 RIFT Major Version: 5195 8 bits. It allows to check whether protocol versions are 5196 compatible, i.e. if the serialized object can be decoded at all. 5197 An implementation MUST drop packets with unexpected values and MAY 5198 report a problem. 5200 Outer Key ID: 5201 8 bits to allow key rollovers. This implies key type and 5202 algorithm. Value `invalid_key_value_key` means that no valid 5203 fingerprint was computed. This key ID scope is local to the nodes 5204 on both ends of the adjacency. 5206 TIE Origin Key ID: 5207 24 bits. This implies key type and used algorithm. Value 5208 `invalid_key_value_key` means that no valid fingerprint was 5209 computed. This key ID scope is global to the RIFT instance since 5210 it may imply the originator of the TIE so the contained object 5211 does not have to be de-serialized to obtain the originator. 5213 Length of Fingerprint: 5214 8 bits. Length in 32-bit multiples of the following fingerprint 5215 (not including lifetime or weak nonces). It allows the structure 5216 to be navigated when an unknown key type is present. To clarify, 5217 a common corner case when this value is set to 0 is when it 5218 signifies an empty (0 bytes long) security fingerprint. 5220 Security Fingerprint: 5221 32 bits * Length of Fingerprint. This is a signature that is 5222 computed over all data following after it. If the significant 5223 bits of fingerprint are fewer than the 32 bits padded length than 5224 the significant bits MUST be left aligned and remaining bits on 5225 the right padded with 0s. When using PKI the Security fingerprint 5226 originating node uses its private key to create the signature. 5227 The original packet can then be verified provided the public key 5228 is shared and current. 5230 Remaining TIE Lifetime: 5231 32 bits. In case of anything but TIEs this field MUST be set to 5232 all ones and Origin Security Envelope Header MUST NOT be present 5233 in the packet. For TIEs this field represents the remaining 5234 lifetime of the TIE and Origin Security Envelope Header MUST be 5235 present in the packet. 5237 Weak Nonce Local: 5238 16 bits. Local Weak Nonce of the adjacency as advertised in LIEs. 5240 Weak Nonce Remote: 5241 16 bits. Remote Weak Nonce of the adjacency as received in LIEs. 5243 TIE Origin Security Envelope Header: 5244 It MUST be present if and only if the Remaining TIE Lifetime field 5245 is *not* all ones. It carries through the originators key ID and 5246 according fingerprint of the object to protect TIE from 5247 modification during flooding. This ensures origin validation and 5248 integrity (but does not provide validation of a chain of trust). 5250 Observe that due to the schema migration rules per Appendix B the 5251 contained model can be always decoded if the major version matches 5252 and the envelope integrity has been validated. Consequently, 5253 description of the TIE is available to flood it properly including 5254 unknown TIE types. 5256 4.4.4. Weak Nonces 5258 The protocol uses two 16 bit nonces to salt generated signatures. 5259 The term "nonce" is used a bit loosely since RIFT nonces are not 5260 being changed in every packet as often common in cryptography. For 5261 efficiency purposes they are changed at a high enough frequency to 5262 dwarf practical replay attack attempts. And hence, such nonces are 5263 called from this point on "weak" nonces. 5265 Any implementation including RIFT security MUST generate and wrap 5266 around local nonces properly. When a nonce increment leads to 5267 `undefined_nonce` value, the value MUST be incremented again 5268 immediately. All implementation MUST reflect the neighbor's nonces. 5269 An implementation SHOULD increment a chosen nonce on every LIE FSM 5270 transition that ends up in a different state from the previous one 5271 and MUST increment its nonce at least every 5272 `nonce_regeneration_interval` (such considerations allow for 5273 efficient implementations without opening a significant security 5274 risk). When flooding TIEs, the implementation MUST use recent (i.e. 5275 within allowed difference) nonces reflected in the LIE exchange. The 5276 schema specifies in `maximum_valid_nonce_delta` the maximum allowable 5277 nonce value difference on a packet compared to reflected nonces in 5278 the LIEs. Any packet received with nonces deviating more than the 5279 allowed delta MUST be discarded without further computation of 5280 signatures to prevent computation load attacks. The delta is either 5281 a negative or positive difference that a mirrored nonce can deviate 5282 from local value to be considered valid. If nonces are not changed 5283 on every packet but at the maximum interval on both sides this opens 5284 statistically a `maximum_valid_nonce_delta`/2 window of identical 5285 LIEs, TIE and TI(x)E replays. The interval cannot be too small since 5286 LIE FSM may change states fairly quickly during ZTP without sending 5287 LIEs and additionally, UDP can both loose as well as misorder 5288 packets. 5290 In cases where a secure implementation does not receive signatures or 5291 receives undefined nonces from a neighbor (indicating that it does 5292 not support or verify signatures), it is a matter of local policy as 5293 to how those packets are treated. A secure implementation MAY refuse 5294 forming an adjacency with an implementation that is not advertising 5295 signatures or valid nonces, or it MAY continue signing local packets 5296 while accepting a neighbor's packets without further security 5297 validation. 5299 As a necessary exception, an implementation MUST advertise the remote 5300 nonce value as `undefined_nonce` when the FSM is not in TwoWay or 5301 ThreeWay state and accept an `undefined_nonce` for its local nonce 5302 value on packets in any other state than ThreeWay. 5304 As optional optimization, an implementation MAY send one LIE with 5305 previously negotiated neighbor's nonce to try to speed up a 5306 neighbor's transition from ThreeWay to OneWay and MUST revert to 5307 sending `undefined_nonce` after that. 5309 4.4.5. Lifetime 5311 Protecting flooding lifetime may lead to an excessive number of 5312 security fingerprint computations and to avoid this the application 5313 generating the fingerprints for advertised TIEs MAY round the value 5314 down to the next `rounddown_lifetime_interval`. This will limit the 5315 number of computations performed for security purposes caused by 5316 lifetime attacks as long the weak nonce did not advance. 5318 4.5. Security Association Changes 5320 There in no mechanism to convert a security envelope for the same key 5321 ID from one algorithm to another once the envelope is operational. 5322 The recommended procedure to change to a new algorithm is to take the 5323 adjacency down, make the necessary changes, and bring the adjacency 5324 back up. Obviously, an implementation MAY choose to stop verifying 5325 security envelope for the duration of algorithm change to keep the 5326 adjacency up but since this introduces a security vulnerability 5327 window, such roll-over SHOULD NOT be recommended. 5329 5. Examples 5331 5.1. Normal Operation 5332 ^ N +--------+ +--------+ 5333 Level 2 | |ToF 21| |ToF 22| 5334 E <-*-> W ++-+--+-++ ++-+--+-++ 5335 | | | | | | | | | 5336 S v P111/2 |P121/2 | | | | 5337 ^ ^ ^ ^ | | | | 5338 | | | | | | | | 5339 +--------------+ | +-----------+ | | | +---------------+ 5340 | | | | | | | | 5341 South +-----------------------------+ | | ^ 5342 | | | | | | | All TIEs 5343 0/0 0/0 0/0 +-----------------------------+ | 5344 v v v | | | | | 5345 | | +-+ +<-0/0----------+ | | 5346 | | | | | | | | 5347 +-+----++ +-+----++ ++----+-+ ++-----++ 5348 Level 1 | | | | | | | | 5349 |Spin111| |Spin112| |Spin121| |Spin122| 5350 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 5351 | | | South | | | | 5352 | +---0/0--->-----+ 0/0 | +----------------+ | 5353 0/0 | | | | | | | 5354 | +---<-0/0-----+ | v | +--------------+ | | 5355 v | | | | | | | 5356 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 5357 Level 0 | | | | | | | | 5358 |Leaf111| |Leaf112| |Leaf121| |Leaf122| 5359 +-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 5360 + + \ / + + 5361 Prefix111 Prefix112 \ / Prefix121 Prefix122 5362 multi-homed 5363 Prefix 5364 +---------- PoD 1 ---------+ +---------- PoD 2 ---------+ 5366 Figure 34: Normal Case Topology 5368 This section describes RIFT deployment in example topology given in 5369 Figure 34 without any node or link failures. The scenario disregards 5370 flooding reduction for simplicity's sake and compresses the node 5371 names in some cases to fit them into the picture better. 5373 First, the following bi-directional adjacencies will be established: 5375 1. ToF 21 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 122 5377 2. ToF 22 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 122 5379 3. Spine 111 to Leaf 111, Leaf 112 5380 4. Spine 112 to Leaf 111, Leaf 112 5382 5. Spine 121 to Leaf 121, Leaf 122 5384 6. Spine 122 to Leaf 121, Leaf 122 5386 Leaf 111 and Leaf 112 originate N-TIEs for Prefix 111 and Prefix 112 5387 (respectively) to both Spine 111 and Spine 112 (Leaf 112 also 5388 originates an N-TIE for the multi-homed prefix). Spine 111 and Spine 5389 112 will then originate their own N-TIEs, as well as flood the N-TIEs 5390 received from Leaf 111 and Leaf 112 to both ToF 21 and ToF 22. 5392 Similarly, Leaf 121 and Leaf 122 originate North TIEs for Prefix 121 5393 and Prefix 122 (respectively) to Spine 121 and Spine 122 (Leaf 121 5394 also originates an North TIE for the multi-homed prefix). Spine 121 5395 and Spine 122 will then originate their own North TIEs, as well as 5396 flood the North TIEs received from Leaf 121 and Leaf 122 to both ToF 5397 21 and ToF 22. 5399 Spines hold only North TIEs of level 0 for their PoD, while leaves 5400 only hold their own North TIEs while at this point, both ToF 21 and 5401 ToF 22 (as well as any northbound connected controllers) would have 5402 the complete network topology. 5404 ToF 21 and ToF 22 would then originate and flood South TIEs 5405 containing any established adjacencies and a default IP route to all 5406 spines. Spine 111, Spine 112, Spine 121, and Spine 122 will reflect 5407 all Node South TIEs received from ToF 21 to ToF 22, and all Node 5408 South TIEs from ToF 22 to ToF 21. South TIEs will not be re- 5409 propagated southbound. 5411 South TIEs containing a default IP route are then originated by both 5412 Spine 111 and Spine 112 toward Leaf 111 and Leaf 112. Similarly, 5413 South TIEs containing a default IP route are originated by Spine 121 5414 and Spine 122 toward Leaf 121 and Leaf 122. 5416 At this point IP connectivity across maximum number of viable paths 5417 has been established for all leaves, with routing information 5418 constrained to only the minimum amount that allows for normal 5419 operation and redundancy. 5421 5.2. Leaf Link Failure 5422 | | | | 5423 +-+---+-+ +-+---+-+ 5424 | | | | 5425 |Spin111| |Spin112| 5426 +-+---+-+ ++----+-+ 5427 | | | | 5428 | +---------------+ X 5429 | | | X Failure 5430 | +-------------+ | X 5431 | | | | 5432 +-+---+-+ +--+--+-+ 5433 | | | | 5434 |Leaf111| |Leaf112| 5435 +-------+ +-------+ 5436 + + 5437 Prefix111 Prefix112 5439 Figure 35: Single Leaf Link Failure 5441 In the event of a link failure between Spine 112 and Leaf 112, both 5442 nodes will originate new Node TIEs that contain their connected 5443 adjacencies, except for the one that just failed. Leaf 112 will send 5444 a Node North TIE to Spine 111. Spine 112 will send a Node North TIE 5445 to ToF 21 and ToF 22 as well as a new Node South TIE to Leaf 111 that 5446 will be reflected to Spine 111. Necessary SPF recomputation will 5447 occur, resulting in Spine 112 no longer being in the forwarding path 5448 for Prefix 112. 5450 Spine 111 will also disaggregate Prefix 112 by sending new Prefix 5451 South TIE to Leaf 111 and Leaf 112. Though disaggregation is covered 5452 in more detail in the following section, it is worth mentioning ini 5453 this example as it further illustrates RIFT's blackhole mitigation 5454 mechanism. Consider that Leaf 111 has yet to receive the more 5455 specific (disaggregated) route from Spine 111. In such a scenario, 5456 traffic from Leaf 111 toward Prefix 112 may still use Spine 112's 5457 default route, causing it to traverse ToF 21 and ToF 22 back down via 5458 Spine 111. While this behavior is suboptimal, it is transient in 5459 nature and preferred to black-holing traffic. 5461 5.3. Partitioned Fabric 5462 +--------+ +--------+ 5463 Level 2 |ToF 21| |ToF 22| 5464 ++-+--+-++ ++-+--+-++ 5465 | | | | | | | | 5466 | | | | | | | 0/0 5467 | | | | | | | | 5468 | | | | | | | | 5469 +--------------+ | +--- XXXXXX + | | | +---------------+ 5470 | | | | | | | | 5471 | +-----------------------------+ | | | 5472 0/0 | | | | | | | 5473 | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 5474 | 1.1/16 | | | | | | 5475 | | +-+ +-0/0-----------+ | | 5476 | | | 1.1./16 | | | | 5477 +-+----++ +-+-----+ ++-----0/0 ++----0/0 5478 Level 1 | | | | | 1.1/16 | 1.1/16 5479 |Spin111| |Spin112| |Spin121| |Spin122| 5480 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 5481 | | | | | | | | 5482 | +---------------+ | | +----------------+ | 5483 | | | | | | | | 5484 | +-------------+ | | | +--------------+ | | 5485 | | | | | | | | 5486 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 5487 Level 3 | | | | | | | | 5488 |Leaf111| |Leaf112| |Leaf121| |Leaf122| 5489 +-+-----+ ++------+ +-----+-+ +-+-----+ 5490 + + + + 5491 Prefix111 Prefix112 Prefix121 Prefix122 5492 1.1/16 5494 Figure 36: Fabric Partition 5496 Figure 36 shows one of more catastrophic scenarios where ToF 21 is 5497 completely severed from access to Prefix 121 due to a double link 5498 failure. If only default routes existed, this would result in 50% of 5499 traffic from Leaf 111 and Leaf 112 toward Prefix 121 being black- 5500 holed. 5502 The mechanism to resolve this scenario hinges on ToF 21's South TIEs 5503 being reflected from Spine 111 and Spine 112 to ToF 22. Once ToF 22 5504 sees that Prefix 121 cannot be reached from ToF 21, it will begin to 5505 disaggregate Prefix 121 by advertising a more specific route (1.1/16) 5506 along with the default IP prefix route to all spines (ToF 21 still 5507 only sends a default route). The result is Spine 111 and Spine112 5508 using the more specific route to Prefix 121 via ToF 22. All other 5509 prefixes continue to use the default IP prefix route toward both ToF 5510 21 and ToF 22. 5512 The more specific route for Prefix 121 being advertised by ToF 22 5513 does not need to be propagated further south to the leaves, as they 5514 do not benefit from this information. Spine 111 and Spine 112 are 5515 only required to reflect the new South Node TIEs received from ToF 22 5516 to ToF 21. In short, only the relevant nodes received the relevant 5517 updates, thereby restricting the failure to only the partitioned 5518 level rather than burdening the whole fabric with the flooding and 5519 recomputation of the new topology information. 5521 To finish this example, the following table shows sets computed by 5522 ToF 22 using notation introduced in Section 4.2.5: 5524 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 5526 |H (for r=Prefix 111) = Spine 111, Spine 112 5528 |H (for r=Prefix 112) = Spine 111, Spine 112 5530 |H (for r=Prefix 121) = Spine 121, Spine 122 5532 |H (for r=Prefix 122) = Spine 121, Spine 122 5534 |A (for ToF 21) = Spine 111, Spine 112 5536 With that and |H (for r=Prefix 121) and |H (for r=Prefix 122) being 5537 disjoint from |A (for ToF 21), ToF 22 will originate an South TIE 5538 with Prefix 121 and Prefix 122, which will be flooded to all spines. 5540 5.4. Northbound Partitioned Router and Optional East-West Links 5541 + + + 5542 X N1 | N2 | N3 5543 X | | 5544 +--+----+ +--+----+ +--+-----+ 5545 | |0/0> <0/0| |0/0> <0/0| | 5546 | A01 +----------+ A02 +----------+ A03 | Level 1 5547 ++-+-+--+ ++--+--++ +---+-+-++ 5548 | | | | | | | | | 5549 | | +----------------------------------+ | | | 5550 | | | | | | | | | 5551 | +-------------+ | | | +--------------+ | 5552 | | | | | | | | | 5553 | +----------------+ | +-----------------+ | 5554 | | | | | | | | | 5555 | | +------------------------------------+ | | 5556 | | | | | | | | | 5557 ++-+-+--+ | +---+---+ | +-+---+-++ 5558 | | +-+ +-+ | | 5559 | L01 | | L02 | | L03 | Level 0 5560 +-------+ +-------+ +--------+ 5562 Figure 37: North Partitioned Router 5564 Figure 37 shows a part of a fabric where level 1 is horizontally 5565 connected and A01 lost its only northbound adjacency. Based on N-SPF 5566 rules in Section 4.2.4.1 A01 will compute northbound reachability by 5567 using the link A01 to A02. A02 however, will *not* use this link 5568 during N-SPF. The result is A01 utilizing the horizontal link for 5569 default route advertisement and unidirectional routing. 5571 Furthermore, if A02 also loses its only northbound adjacency (N2), 5572 the situation evolves. A01 will no longer have northbound 5573 reachability while it sees A03's northbound adjacencies in South Node 5574 TIEs reflected by nodes south of it. As a result, A01 will no longer 5575 advertise its default route in accordance with Section 4.2.3.8. 5577 6. Further Details on Implementation 5579 6.1. Considerations for Leaf-Only Implementation 5581 RIFT can and is intended to be stretched to the lowest level in the 5582 IP fabric to integrate ToRs or even servers. Since those entities 5583 would run as leaves only, it is worth to observe that a leaf only 5584 version is significantly simpler to implement and requires much less 5585 resources: 5587 1. Leaf nodes only need to maintain a multipath default route under 5588 normal circumstances. However, in cases of catastrophic 5589 partitioning, leaf nodes SHOULD be capable of accommodating all 5590 the leaf routes in its own PoD to prevent black-holing. 5592 2. Leaf nodes hold only their own North TIEs and South TIEs of Level 5593 1 nodes they are connected to. 5595 3. Leaf nodes do not have to support any type of disaggregation 5596 computation or propagation. 5598 4. Leaf nodes are not required to support overload bit. 5600 5. Leaf nodes do not need to originate S-TIEs unless optional leaf- 5601 2-leaf features are desired. 5603 6.2. Considerations for Spine Implementation 5605 Nodes that do not act as ToF are not required to discover fallen 5606 leaves by comparing reachable destinations with peers and therefore 5607 do not need to run the computation of disaggregated routes based on 5608 that discovery. On the other hand, non-ToF nodes need to respect 5609 disaggregated routes advertised from the north. In the case of 5610 negative disaggregation, spines nodes need to generate southbound 5611 disaggregated routes when all parents are lost for a fallen leaf. 5613 7. Security Considerations 5615 7.1. General 5617 One can consider attack vectors where a router may reboot many times 5618 while changing its system ID and pollute the network with many stale 5619 TIEs or TIEs are sent with very long lifetimes and not cleaned up 5620 when the routes vanish. Those attack vectors are not unique to RIFT. 5621 Given large memory footprints available today those attacks should be 5622 relatively benign. Otherwise a node SHOULD implement a strategy of 5623 discarding contents of all TIEs that were not present in the SPF tree 5624 over a certain, configurable period of time. Since the protocol, 5625 like all modern link-state protocols, is self-stabilizing and will 5626 advertise the presence of such TIEs to its neighbors, they can be re- 5627 requested again if a computation finds that it sees an adjacency 5628 formed towards the system ID of the discarded TIEs. 5630 7.2. Malformed Packets 5632 The protocol protects packets extensively through optional signatures 5633 and nonces so if the possibility of maliciously injected malformed or 5634 replayed packets exist in a deployment, this conclusively protects 5635 against such attacks. 5637 Even with security envelope, since RIFT relies on Thrift encoders and 5638 decoders generated automatically from IDL it is conceivable that 5639 errors in such encoders/decoders could be discovered and lead to 5640 delivery of corrupted packets or reception of packets that cannot be 5641 decoded. Misformatted packets lead normally to decoder returning an 5642 error condition to the caller and with that the packet is basically 5643 unparsable with no other choice but to discard it. Should the 5644 unlikely scenario occur of the decoder being forced to abort the 5645 protocol this is neither better nor worse than today's behavior of 5646 other protocols. 5648 7.3. ZTP 5650 Section 4.2.7 presents many attack vectors in untrusted environments, 5651 starting with nodes that oscillate their level offers to the 5652 possibility of nodes offering a ThreeWay adjacency with the highest 5653 possible level value and a very long holdtime trying to put itself 5654 "on top of the lattice" thereby allowing it to gain access to the 5655 whole southbound topology. Session authentication mechanisms are 5656 necessary in environments where this is possible and RIFT provides 5657 the security envelope to ensure this if so desired. 5659 7.4. Lifetime 5661 Traditional IGP protocols are vulnerable to lifetime modification and 5662 replay attacks that can be somewhat mitigated by using techniques 5663 like [RFC7987]. RIFT removes this attack vector by protecting the 5664 lifetime behind a signature computed over it and additional nonce 5665 combination which makes even the replay attack window very small and 5666 for practical purposes irrelevant since lifetime cannot be 5667 artificially shortened by the attacker. 5669 7.5. Packet Number 5671 Optional packet number is carried in the security envelope without 5672 any encryption protection and is hence vulnerable to replay and 5673 modification attacks. Contrary to nonces this number must change on 5674 every packet and would present a very high cryptographic load if 5675 signed. The attack vector packet number present is relatively 5676 benign. Changing the packet number by a man-in-the-middle attack 5677 will only affect operational validation tools and possibly some 5678 performance optimizations on flooding. It is expected that an 5679 implementation detecting too many "fake losses" or "misorderings" due 5680 to the attack on the packet number would simply suppress its further 5681 processing. 5683 7.6. Outer Fingerprint Attacks 5685 A node can try to inject LIE packets observing a conversation on the 5686 wire by using the outer key ID albeit it cannot generate valid hashes 5687 in case it changes the integrity of the message so the only possible 5688 attack is DoS due to excessive LIE validation. 5690 A node can try to replay previous LIEs with changed state that it 5691 recorded but the attack is hard to replicate since the nonce 5692 combination must match the ongoing exchange and is then limited to a 5693 single flap only since both nodes will advance their nonces in case 5694 the adjacency state changed. Even in the most unlikely case the 5695 attack length is limited due to both sides periodically increasing 5696 their nonces. 5698 7.7. TIE Origin Fingerprint DoS Attacks 5700 A compromised node can attempt to generate "fake TIEs" using other 5701 nodes' TIE origin key identifiers. Albeit the ultimate validation of 5702 the origin fingerprint will fail in such scenarios and not progress 5703 further than immediately peering nodes, the resulting denial of 5704 service attack seems unavoidable since the TIE origin key id is only 5705 protected by the, here assumed to be compromised, node. 5707 7.8. Host Implementations 5709 It can be reasonably expected that with the proliferation of RotH 5710 servers, rather than dedicated networking devices, will represent a 5711 significant amount of RIFT devices. Given their normally far wider 5712 software envelope and access granted to them, such servers are also 5713 far more likely to be compromised and present an attack vector on the 5714 protocol. Hijacking of prefixes to attract traffic is a trust 5715 problem and cannot be easily addressed within the protocol if the 5716 trust model is breached, i.e. the server presents valid credentials 5717 to form an adjacency and issue TIEs. In an even more devious way, 5718 the servers can present DoS (or even DDos) vectors of issuing too 5719 many LIE packets, flood large amounts of North TIEs and attempt 5720 similar resource overrun attacks. A prudent implementation forming 5721 adjacencies to leaves should implement according thresholds 5722 mechanisms and raise warnings when e.g. a leaf is advertising an 5723 excess number of TIEs or prefixes. Additionally, such implementation 5724 could refuse any topology information except the node's own TIEs and 5725 authenticated, reflected South Node TIEs at own level. 5727 To isolate possible attack vectors on the leaf to the largest 5728 possible extent a dedicated leaf-only implementation could run 5729 without any configuration by hard-coding a well-known adjacency key 5730 (which can be always rolled-over by the means of e.g. well-known key- 5731 value distributed from top of the fabric), leaf level value and 5732 always setting overload bit. All other values can be derived by 5733 automatic means as described earlier in the protocol specification. 5735 8. IANA Considerations 5737 This specification requests multicast address assignments and 5738 standard port numbers. Additionally registries for the schema are 5739 requested and suggested values provided that reflect the numbers 5740 allocated in the given schema. 5742 8.1. Requested Multicast and Port Numbers 5744 This document requests allocation in the 'IPv4 Multicast Address 5745 Space' registry the suggested value of 224.0.0.120 as 5746 'ALL_V4_RIFT_ROUTERS' and in the 'IPv6 Multicast Address Space' 5747 registry the suggested value of FF02::A1F7 as 'ALL_V6_RIFT_ROUTERS'. 5749 This document requests allocation in the 'Service Name and Transport 5750 Protocol Port Number Registry' the allocation of a suggested value of 5751 914 on udp for 'RIFT_LIES_PORT' and suggested value of 915 for 5752 'RIFT_TIES_PORT'. 5754 8.2. Requested Registries with Suggested Values 5756 This section requests registries that help govern the schema via 5757 usual IANA registry procedures. A top level 'RIFT' registry should 5758 hold the according registries requested in the following sections 5759 with their pre-defined values. IANA is requested to store the schema 5760 version introducing the allocated value as well as, optionally, its 5761 description when present. This will allow to assign different values 5762 to an entry depending on schema version. Alternately, IANA is 5763 requested to consider a root RIFT/3 registry to store RIFT schema 5764 major version 3 values and may be requested in the future to create a 5765 RIFT/4 registry under that. In any case, IANA is requested to store 5766 the schema version in the entries since that will allow to 5767 distinguish between minor versions in the same major schema version. 5768 All values not suggested as to be considered `Unassigned`. The range 5769 of every registry is a 16-bit integer. Allocation of new values is 5770 always performed via `Expert Review` action. 5772 8.2.1. Registry RIFT_v5/common/AddressFamilyType" 5774 Address family type. 5776 8.2.1.1. Requested Entries 5778 +=======================+=======+================+=============+ 5779 | Name | Value | Schema Version | Description | 5780 +=======================+=======+================+=============+ 5781 | Illegal | 0 | 5.0 | | 5782 +-----------------------+-------+----------------+-------------+ 5783 | AddressFamilyMinValue | 1 | 5.0 | | 5784 +-----------------------+-------+----------------+-------------+ 5785 | IPv4 | 2 | 5.0 | | 5786 +-----------------------+-------+----------------+-------------+ 5787 | IPv6 | 3 | 5.0 | | 5788 +-----------------------+-------+----------------+-------------+ 5789 | AddressFamilyMaxValue | 4 | 5.0 | | 5790 +-----------------------+-------+----------------+-------------+ 5792 Table 7 5794 8.2.2. Registry RIFT_v5/common/HierarchyIndications" 5796 Flags indicating node configuration in case of ZTP. 5798 8.2.2.1. Requested Entries 5800 +======================================+=====+========+=============+ 5801 | Name |Value| Schema| Description | 5802 | | | Version| | 5803 +======================================+=====+========+=============+ 5804 | leaf_only | 0| 5.0| | 5805 +--------------------------------------+-----+--------+-------------+ 5806 | leaf_only_and_leaf_2_leaf_procedures | 1| 5.0| | 5807 +--------------------------------------+-----+--------+-------------+ 5808 | top_of_fabric | 2| 5.0| | 5809 +--------------------------------------+-----+--------+-------------+ 5811 Table 8 5813 8.2.3. Registry RIFT_v5/common/IEEE802_1ASTimeStampType" 5815 Timestamp per IEEE 802.1AS, all values MUST be interpreted in 5816 implementation as unsigned. 5818 8.2.3.1. Requested Entries 5820 +=========+=======+================+=============+ 5821 | Name | Value | Schema Version | Description | 5822 +=========+=======+================+=============+ 5823 | AS_sec | 1 | 5.0 | | 5824 +---------+-------+----------------+-------------+ 5825 | AS_nsec | 2 | 5.0 | | 5826 +---------+-------+----------------+-------------+ 5828 Table 9 5830 8.2.4. Registry RIFT_v5/common/IPAddressType" 5832 IP address type. 5834 8.2.4.1. Requested Entries 5836 +=============+=======+================+=================+ 5837 | Name | Value | Schema Version | Description | 5838 +=============+=======+================+=================+ 5839 | ipv4address | 1 | 5.0 | Content is IPv4 | 5840 +-------------+-------+----------------+-----------------+ 5841 | ipv6address | 2 | 5.0 | Content is IPv6 | 5842 +-------------+-------+----------------+-----------------+ 5844 Table 10 5846 8.2.5. Registry RIFT_v5/common/IPPrefixType" 5848 Prefix advertisement. 5850 @note: for interface addresses the protocol can propagate the address 5851 part beyond the subnet mask and on reachability computation that has 5852 to be normalized. The non-significant bits can be used for 5853 operational purposes. 5855 8.2.5.1. Requested Entries 5857 +============+=======+================+=============+ 5858 | Name | Value | Schema Version | Description | 5859 +============+=======+================+=============+ 5860 | ipv4prefix | 1 | 5.0 | | 5861 +------------+-------+----------------+-------------+ 5862 | ipv6prefix | 2 | 5.0 | | 5863 +------------+-------+----------------+-------------+ 5865 Table 11 5867 8.2.6. Registry RIFT_v5/common/IPv4PrefixType" 5869 IPv4 prefix type. 5871 8.2.6.1. Requested Entries 5873 +===========+=======+================+=============+ 5874 | Name | Value | Schema Version | Description | 5875 +===========+=======+================+=============+ 5876 | address | 1 | 5.0 | | 5877 +-----------+-------+----------------+-------------+ 5878 | prefixlen | 2 | 5.0 | | 5879 +-----------+-------+----------------+-------------+ 5881 Table 12 5883 8.2.7. Registry RIFT_v5/common/IPv6PrefixType" 5885 IPv6 prefix type. 5887 8.2.7.1. Requested Entries 5889 +===========+=======+================+=============+ 5890 | Name | Value | Schema Version | Description | 5891 +===========+=======+================+=============+ 5892 | address | 1 | 5.0 | | 5893 +-----------+-------+----------------+-------------+ 5894 | prefixlen | 2 | 5.0 | | 5895 +-----------+-------+----------------+-------------+ 5897 Table 13 5899 8.2.8. Registry RIFT_v5/common/PrefixSequenceType" 5901 Sequence of a prefix in case of move. 5903 8.2.8.1. Requested Entries 5905 +===============+=======+=========+============================+ 5906 | Name | Value | Schema | Description | 5907 | | | Version | | 5908 +===============+=======+=========+============================+ 5909 | timestamp | 1 | 5.0 | | 5910 +---------------+-------+---------+----------------------------+ 5911 | transactionid | 2 | 5.0 | Transaction ID set by | 5912 | | | | client in e.g. in 6LoWPAN. | 5913 +---------------+-------+---------+----------------------------+ 5915 Table 14 5917 8.2.9. Registry RIFT_v5/common/RouteType" 5919 RIFT route types. @note: The only purpose of those values is to 5920 introduce an ordering whereas an implementation can choose internally 5921 any other values as long the ordering is preserved 5923 8.2.9.1. Requested Entries 5925 +=====================+=======+================+=============+ 5926 | Name | Value | Schema Version | Description | 5927 +=====================+=======+================+=============+ 5928 | Illegal | 0 | 5.0 | | 5929 +---------------------+-------+----------------+-------------+ 5930 | RouteTypeMinValue | 1 | 5.0 | | 5931 +---------------------+-------+----------------+-------------+ 5932 | Discard | 2 | 5.0 | | 5933 +---------------------+-------+----------------+-------------+ 5934 | LocalPrefix | 3 | 5.0 | | 5935 +---------------------+-------+----------------+-------------+ 5936 | SouthPGPPrefix | 4 | 5.0 | | 5937 +---------------------+-------+----------------+-------------+ 5938 | NorthPGPPrefix | 5 | 5.0 | | 5939 +---------------------+-------+----------------+-------------+ 5940 | NorthPrefix | 6 | 5.0 | | 5941 +---------------------+-------+----------------+-------------+ 5942 | NorthExternalPrefix | 7 | 5.0 | | 5943 +---------------------+-------+----------------+-------------+ 5944 | SouthPrefix | 8 | 5.0 | | 5945 +---------------------+-------+----------------+-------------+ 5946 | SouthExternalPrefix | 9 | 5.0 | | 5947 +---------------------+-------+----------------+-------------+ 5948 | NegativeSouthPrefix | 10 | 5.0 | | 5949 +---------------------+-------+----------------+-------------+ 5950 | RouteTypeMaxValue | 11 | 5.0 | | 5951 +---------------------+-------+----------------+-------------+ 5953 Table 15 5955 8.2.10. Registry RIFT_v5/common/TIETypeType" 5957 Type of TIE. 5959 8.2.10.1. Requested Entries 5961 +===========================================+=====+=======+===========+ 5962 |Name |Value| Schema|Description| 5963 | | |Version| | 5964 +===========================================+=====+=======+===========+ 5965 |Illegal | 0| 5.0| | 5966 +-------------------------------------------+-----+-------+-----------+ 5967 |TIETypeMinValue | 1| 5.0| | 5968 +-------------------------------------------+-----+-------+-----------+ 5969 |NodeTIEType | 2| 5.0| | 5970 +-------------------------------------------+-----+-------+-----------+ 5971 |PrefixTIEType | 3| 5.0| | 5972 +-------------------------------------------+-----+-------+-----------+ 5973 |PositiveDisaggregationPrefixTIEType | 4| 5.0| | 5974 +-------------------------------------------+-----+-------+-----------+ 5975 |NegativeDisaggregationPrefixTIEType | 5| 5.0| | 5976 +-------------------------------------------+-----+-------+-----------+ 5977 |PGPrefixTIEType | 6| 5.0| | 5978 +-------------------------------------------+-----+-------+-----------+ 5979 |KeyValueTIEType | 7| 5.0| | 5980 +-------------------------------------------+-----+-------+-----------+ 5981 |ExternalPrefixTIEType | 8| 5.0| | 5982 +-------------------------------------------+-----+-------+-----------+ 5983 |PositiveExternalDisaggregationPrefixTIEType| 9| 5.0| | 5984 +-------------------------------------------+-----+-------+-----------+ 5985 |TIETypeMaxValue | 10| 5.0| | 5986 +-------------------------------------------+-----+-------+-----------+ 5988 Table 16 5990 8.2.11. Registry RIFT_v5/common/TieDirectionType" 5992 Direction of TIEs. 5994 8.2.11.1. Requested Entries 5996 +===================+=======+================+=============+ 5997 | Name | Value | Schema Version | Description | 5998 +===================+=======+================+=============+ 5999 | Illegal | 0 | 5.0 | | 6000 +-------------------+-------+----------------+-------------+ 6001 | South | 1 | 5.0 | | 6002 +-------------------+-------+----------------+-------------+ 6003 | North | 2 | 5.0 | | 6004 +-------------------+-------+----------------+-------------+ 6005 | DirectionMaxValue | 3 | 5.0 | | 6006 +-------------------+-------+----------------+-------------+ 6008 Table 17 6010 8.2.12. Registry RIFT_v5/encoding/Community" 6012 Prefix community. 6014 8.2.12.1. Requested Entries 6016 +========+=======+================+===================+ 6017 | Name | Value | Schema Version | Description | 6018 +========+=======+================+===================+ 6019 | top | 1 | 5.0 | Higher order bits | 6020 +--------+-------+----------------+-------------------+ 6021 | bottom | 2 | 5.0 | Lower order bits | 6022 +--------+-------+----------------+-------------------+ 6024 Table 18 6026 8.2.13. Registry RIFT_v5/encoding/KeyValueTIEElement" 6028 Generic key value pairs. 6030 8.2.13.1. Requested Entries 6032 +===========+=======+================+=============+ 6033 | Name | Value | Schema Version | Description | 6034 +===========+=======+================+=============+ 6035 | keyvalues | 1 | 5.0 | | 6036 +-----------+-------+----------------+-------------+ 6038 Table 19 6040 8.2.14. Registry RIFT_v5/encoding/LIEPacket" 6042 RIFT LIE Packet. 6044 @note: this node's level is already included on the packet header 6046 8.2.14.1. Requested Entries 6048 +=============================+=======+=========+=================+ 6049 | Name | Value | Schema | Description | 6050 | | | Version | | 6051 +=============================+=======+=========+=================+ 6052 | name | 1 | 5.0 | Node or | 6053 | | | | adjacency name. | 6054 +-----------------------------+-------+---------+-----------------+ 6055 | local_id | 2 | 5.0 | Local link ID. | 6056 +-----------------------------+-------+---------+-----------------+ 6057 | flood_port | 3 | 5.0 | UDP port to | 6058 | | | | which we can | 6059 | | | | receive flooded | 6060 | | | | TIEs. | 6061 +-----------------------------+-------+---------+-----------------+ 6062 | link_mtu_size | 4 | 5.0 | Layer 3 MTU, | 6063 | | | | used to | 6064 | | | | discover | 6065 | | | | mismatch. | 6066 +-----------------------------+-------+---------+-----------------+ 6067 | link_bandwidth | 5 | 5.0 | Local link | 6068 | | | | bandwidth on | 6069 | | | | the interface. | 6070 +-----------------------------+-------+---------+-----------------+ 6071 | neighbor | 6 | 5.0 | Reflects the | 6072 | | | | neighbor once | 6073 | | | | received to | 6074 | | | | provide 3-way | 6075 | | | | connectivity. | 6076 +-----------------------------+-------+---------+-----------------+ 6077 | pod | 7 | 5.0 | Node's PoD. | 6078 +-----------------------------+-------+---------+-----------------+ 6079 | node_capabilities | 10 | 5.0 | Node | 6080 | | | | capabilities | 6081 | | | | supported. | 6082 +-----------------------------+-------+---------+-----------------+ 6083 | link_capabilities | 11 | 5.0 | Capabilities of | 6084 | | | | this link. | 6085 +-----------------------------+-------+---------+-----------------+ 6086 | holdtime | 12 | 5.0 | Required | 6087 | | | | holdtime of the | 6088 | | | | adjacency, i.e. | 6089 | | | | for how long a | 6090 | | | | period should | 6091 | | | | adjacency be | 6092 | | | | kept up without | 6093 | | | | valid LIE | 6094 | | | | reception. | 6095 +-----------------------------+-------+---------+-----------------+ 6096 | label | 13 | 5.0 | Optional, | 6097 | | | | unsolicited, | 6098 | | | | downstream | 6099 | | | | assigned | 6100 | | | | locally | 6101 | | | | significant | 6102 | | | | label value for | 6103 | | | | the adjacency. | 6104 +-----------------------------+-------+---------+-----------------+ 6105 | not_a_ztp_offer | 21 | 5.0 | Indicates that | 6106 | | | | the level on | 6107 | | | | the LIE must | 6108 | | | | not be used to | 6109 | | | | derive a ZTP | 6110 | | | | level by the | 6111 | | | | receiving node. | 6112 +-----------------------------+-------+---------+-----------------+ 6113 | you_are_flood_repeater | 22 | 5.0 | Indicates to | 6114 | | | | northbound | 6115 | | | | neighbor that | 6116 | | | | it should be | 6117 | | | | reflooding TIEs | 6118 | | | | received from | 6119 | | | | this node to | 6120 | | | | achieve flood | 6121 | | | | reduction and | 6122 | | | | balancing for | 6123 | | | | northbound | 6124 | | | | flooding. | 6125 +-----------------------------+-------+---------+-----------------+ 6126 | you_are_sending_too_quickly | 23 | 5.0 | Indicates to | 6127 | | | | neighbor to | 6128 | | | | flood node TIEs | 6129 | | | | only and slow | 6130 | | | | down all other | 6131 | | | | TIEs. Ignored | 6132 | | | | when received | 6133 | | | | from southbound | 6134 | | | | neighbor. | 6135 +-----------------------------+-------+---------+-----------------+ 6136 | instance_name | 24 | 5.0 | Instance name | 6137 | | | | in case | 6138 | | | | multiple RIFT | 6139 | | | | instances | 6140 | | | | running on same | 6141 | | | | interface. | 6142 +-----------------------------+-------+---------+-----------------+ 6144 Table 20 6146 8.2.15. Registry RIFT_v5/encoding/LinkCapabilities" 6148 Link capabilities. 6150 8.2.15.1. Requested Entries 6152 +=========================+=======+=========+===================+ 6153 | Name | Value | Schema | Description | 6154 | | | Version | | 6155 +=========================+=======+=========+===================+ 6156 | bfd | 1 | 5.0 | Indicates that | 6157 | | | | the link is | 6158 | | | | supporting BFD. | 6159 +-------------------------+-------+---------+-------------------+ 6160 | ipv4_forwarding_capable | 2 | 5.0 | Indicates whether | 6161 | | | | the interface | 6162 | | | | will support IPv4 | 6163 | | | | forwarding. | 6164 +-------------------------+-------+---------+-------------------+ 6166 Table 21 6168 8.2.16. Registry RIFT_v5/encoding/LinkIDPair" 6170 LinkID pair describes one of parallel links between two nodes. 6172 8.2.16.1. Requested Entries 6174 +============================+=======+=========+====================+ 6175 | Name | Value | Schema | Description | 6176 | | | Version | | 6177 +============================+=======+=========+====================+ 6178 | local_id | 1 | 5.0 | Node-wide unique | 6179 | | | | value for the | 6180 | | | | local link. | 6181 +----------------------------+-------+---------+--------------------+ 6182 | remote_id | 2 | 5.0 | Received remote | 6183 | | | | link ID for this | 6184 | | | | link. | 6185 +----------------------------+-------+---------+--------------------+ 6186 | platform_interface_index | 10 | 5.0 | Describes the | 6187 | | | | local interface | 6188 | | | | index of the | 6189 | | | | link. | 6190 +----------------------------+-------+---------+--------------------+ 6191 | platform_interface_name | 11 | 5.0 | Describes the | 6192 | | | | local interface | 6193 | | | | name. | 6194 +----------------------------+-------+---------+--------------------+ 6195 | trusted_outer_security_key | 12 | 5.0 | Indicates | 6196 | | | | whether the link | 6197 | | | | is secured, i.e. | 6198 | | | | protected by | 6199 | | | | outer key, | 6200 | | | | absence of this | 6201 | | | | element means no | 6202 | | | | indication, | 6203 | | | | undefined outer | 6204 | | | | key means not | 6205 | | | | secured. | 6206 +----------------------------+-------+---------+--------------------+ 6207 | bfd_up | 13 | 5.0 | Indicates | 6208 | | | | whether the link | 6209 | | | | is protected by | 6210 | | | | established BFD | 6211 | | | | session. | 6212 +----------------------------+-------+---------+--------------------+ 6213 | address_families | 14 | 5.0 | Optional | 6214 | | | | indication which | 6215 | | | | address families | 6216 | | | | are up on the | 6217 | | | | interface | 6218 +----------------------------+-------+---------+--------------------+ 6220 Table 22 6222 8.2.17. Registry RIFT_v5/encoding/Neighbor" 6224 Neighbor structure. 6226 8.2.17.1. Requested Entries 6228 +============+=======+================+===================+ 6229 | Name | Value | Schema Version | Description | 6230 +============+=======+================+===================+ 6231 | originator | 1 | 5.0 | System ID of the | 6232 | | | | originator. | 6233 +------------+-------+----------------+-------------------+ 6234 | remote_id | 2 | 5.0 | ID of remote side | 6235 | | | | of the link. | 6236 +------------+-------+----------------+-------------------+ 6238 Table 23 6240 8.2.18. Registry RIFT_v5/encoding/NodeCapabilities" 6242 Capabilities the node supports. 6244 8.2.18.1. Requested Entries 6246 +========================+=======+=========+======================+ 6247 | Name | Value | Schema | Description | 6248 | | | Version | | 6249 +========================+=======+=========+======================+ 6250 | protocol_minor_version | 1 | 5.0 | Must advertise | 6251 | | | | supported minor | 6252 | | | | version dialect that | 6253 | | | | way. | 6254 +------------------------+-------+---------+----------------------+ 6255 | flood_reduction | 2 | 5.0 | indicates that node | 6256 | | | | supports flood | 6257 | | | | reduction. | 6258 +------------------------+-------+---------+----------------------+ 6259 | hierarchy_indications | 3 | 5.0 | indicates place in | 6260 | | | | hierarchy, i.e. top- | 6261 | | | | of-fabric or leaf | 6262 | | | | only (in ZTP) or | 6263 | | | | support for leaf- | 6264 | | | | 2-leaf procedures. | 6265 +------------------------+-------+---------+----------------------+ 6267 Table 24 6269 8.2.19. Registry RIFT_v5/encoding/NodeFlags" 6271 Indication flags of the node. 6273 8.2.19.1. Requested Entries 6275 +==========+=======+=========+=====================================+ 6276 | Name | Value | Schema | Description | 6277 | | | Version | | 6278 +==========+=======+=========+=====================================+ 6279 | overload | 1 | 5.0 | Indicates that node is in overload, | 6280 | | | | do not transit traffic through it. | 6281 +----------+-------+---------+-------------------------------------+ 6283 Table 25 6285 8.2.20. Registry RIFT_v5/encoding/NodeNeighborsTIEElement" 6287 neighbor of a node 6289 8.2.20.1. Requested Entries 6291 +===========+=======+=========+====================================+ 6292 | Name | Value | Schema | Description | 6293 | | | Version | | 6294 +===========+=======+=========+====================================+ 6295 | level | 1 | 5.0 | level of neighbor | 6296 +-----------+-------+---------+------------------------------------+ 6297 | cost | 3 | 5.0 | Cost to neighbor. Ignore anything | 6298 | | | | larger than `infinite_distance` | 6299 | | | | and `invalid_distance` | 6300 +-----------+-------+---------+------------------------------------+ 6301 | link_ids | 4 | 5.0 | can carry description of multiple | 6302 | | | | parallel links in a TIE | 6303 +-----------+-------+---------+------------------------------------+ 6304 | bandwidth | 5 | 5.0 | total bandwith to neighbor as sum | 6305 | | | | of all parallel links | 6306 +-----------+-------+---------+------------------------------------+ 6308 Table 26 6310 8.2.21. Registry RIFT_v5/encoding/NodeTIEElement" 6312 Description of a node. 6314 8.2.21.1. Requested Entries 6316 +=================+=======+=========+=============================+ 6317 | Name | Value | Schema | Description | 6318 | | | Version | | 6319 +=================+=======+=========+=============================+ 6320 | level | 1 | 5.0 | Level of the node. | 6321 +-----------------+-------+---------+-----------------------------+ 6322 | neighbors | 2 | 5.0 | Node's neighbors. Multiple | 6323 | | | | node TIEs can carry | 6324 | | | | disjoint sets of neighbors. | 6325 +-----------------+-------+---------+-----------------------------+ 6326 | capabilities | 3 | 5.0 | Capabilities of the node. | 6327 +-----------------+-------+---------+-----------------------------+ 6328 | flags | 4 | 5.0 | Flags of the node. | 6329 +-----------------+-------+---------+-----------------------------+ 6330 | name | 5 | 5.0 | Optional node name for | 6331 | | | | easier operations. | 6332 +-----------------+-------+---------+-----------------------------+ 6333 | pod | 6 | 5.0 | PoD to which the node | 6334 | | | | belongs. | 6335 +-----------------+-------+---------+-----------------------------+ 6336 | startup_time | 7 | 5.0 | optional startup time of | 6337 | | | | the node | 6338 +-----------------+-------+---------+-----------------------------+ 6339 | miscabled_links | 10 | 5.0 | If any local links are | 6340 | | | | miscabled, this indication | 6341 | | | | is flooded. | 6342 +-----------------+-------+---------+-----------------------------+ 6344 Table 27 6346 8.2.22. Registry RIFT_v5/encoding/PacketContent" 6348 Content of a RIFT packet. 6350 8.2.22.1. Requested Entries 6352 +======+=======+================+=============+ 6353 | Name | Value | Schema Version | Description | 6354 +======+=======+================+=============+ 6355 | lie | 1 | 5.0 | | 6356 +------+-------+----------------+-------------+ 6357 | tide | 2 | 5.0 | | 6358 +------+-------+----------------+-------------+ 6359 | tire | 3 | 5.0 | | 6360 +------+-------+----------------+-------------+ 6361 | tie | 4 | 5.0 | | 6362 +------+-------+----------------+-------------+ 6364 Table 28 6366 8.2.23. Registry RIFT_v5/encoding/PacketHeader" 6368 Common RIFT packet header. 6370 8.2.23.1. Requested Entries 6372 +===============+=======+=========+===============================+ 6373 | Name | Value | Schema | Description | 6374 | | | Version | | 6375 +===============+=======+=========+===============================+ 6376 | major_version | 1 | 5.0 | Major version of protocol. | 6377 +---------------+-------+---------+-------------------------------+ 6378 | minor_version | 2 | 5.0 | Minor version of protocol. | 6379 +---------------+-------+---------+-------------------------------+ 6380 | sender | 3 | 5.0 | Node sending the packet, in | 6381 | | | | case of LIE/TIRE/TIDE also | 6382 | | | | the originator of it. | 6383 +---------------+-------+---------+-------------------------------+ 6384 | level | 4 | 5.0 | Level of the node sending the | 6385 | | | | packet, required on | 6386 | | | | everything except LIEs. Lack | 6387 | | | | of presence on LIEs indicates | 6388 | | | | UNDEFINED_LEVEL and is used | 6389 | | | | in ZTP procedures. | 6390 +---------------+-------+---------+-------------------------------+ 6392 Table 29 6394 8.2.24. Registry RIFT_v5/encoding/PrefixAttributes" 6396 Attributes of a prefix. 6398 8.2.24.1. Requested Entries 6400 +===================+=======+=========+========================+ 6401 | Name | Value | Schema | Description | 6402 | | | Version | | 6403 +===================+=======+=========+========================+ 6404 | metric | 2 | 5.0 | Distance of the | 6405 | | | | prefix. | 6406 +-------------------+-------+---------+------------------------+ 6407 | tags | 3 | 5.0 | Generic unordered set | 6408 | | | | of route tags, can be | 6409 | | | | redistributed to other | 6410 | | | | protocols or use | 6411 | | | | within the context of | 6412 | | | | real time analytics. | 6413 +-------------------+-------+---------+------------------------+ 6414 | monotonic_clock | 4 | 5.0 | Monotonic clock for | 6415 | | | | mobile addresses. | 6416 +-------------------+-------+---------+------------------------+ 6417 | loopback | 6 | 5.0 | Indicates if the | 6418 | | | | prefix is a node | 6419 | | | | loopback. | 6420 +-------------------+-------+---------+------------------------+ 6421 | directly_attached | 7 | 5.0 | Indicates that the | 6422 | | | | prefix is directly | 6423 | | | | attached. | 6424 +-------------------+-------+---------+------------------------+ 6425 | from_link | 10 | 5.0 | link to which the | 6426 | | | | address belongs to. | 6427 +-------------------+-------+---------+------------------------+ 6429 Table 30 6431 8.2.25. Registry RIFT_v5/encoding/PrefixTIEElement" 6433 TIE carrying prefixes 6435 8.2.25.1. Requested Entries 6437 +==========+=======+================+========================+ 6438 | Name | Value | Schema Version | Description | 6439 +==========+=======+================+========================+ 6440 | prefixes | 1 | 5.0 | Prefixes with the | 6441 | | | | associated attributes. | 6442 +----------+-------+----------------+------------------------+ 6444 Table 31 6446 8.2.26. Registry RIFT_v5/encoding/ProtocolPacket" 6448 RIFT packet structure. 6450 8.2.26.1. Requested Entries 6452 +=========+=======+================+=============+ 6453 | Name | Value | Schema Version | Description | 6454 +=========+=======+================+=============+ 6455 | header | 1 | 5.0 | | 6456 +---------+-------+----------------+-------------+ 6457 | content | 2 | 5.0 | | 6458 +---------+-------+----------------+-------------+ 6460 Table 32 6462 8.2.27. Registry RIFT_v5/encoding/TIDEPacket" 6464 TIDE with *sorted* TIE headers. 6466 8.2.27.1. Requested Entries 6468 +=============+=======+================+=====================+ 6469 | Name | Value | Schema Version | Description | 6470 +=============+=======+================+=====================+ 6471 | start_range | 1 | 5.0 | First TIE header in | 6472 | | | | the tide packet. | 6473 +-------------+-------+----------------+---------------------+ 6474 | end_range | 2 | 5.0 | Last TIE header in | 6475 | | | | the tide packet. | 6476 +-------------+-------+----------------+---------------------+ 6477 | headers | 3 | 5.0 | _Sorted_ list of | 6478 | | | | headers. | 6479 +-------------+-------+----------------+---------------------+ 6481 Table 33 6483 8.2.28. Registry RIFT_v5/encoding/TIEElement" 6485 Single element in a TIE. 6487 8.2.28.1. Requested Entries 6489 +=========================================+=====+=======+=================================+ 6490 |Name |Value| Schema|Description | 6491 | | |Version| | 6492 +=========================================+=====+=======+=================================+ 6493 |node | 1| 5.0| Used in case of enum| 6494 | | | | common.TIETypeType.NodeTIEType.| 6495 +-----------------------------------------+-----+-------+---------------------------------+ 6496 |prefixes | 2| 5.0| Used in case of enum| 6497 | | | |common.TIETypeType.PrefixTIEType.| 6498 +-----------------------------------------+-----+-------+---------------------------------+ 6499 |positive_disaggregation_prefixes | 3| 5.0| Positive prefixes (always| 6500 | | | | southbound).| 6501 +-----------------------------------------+-----+-------+---------------------------------+ 6502 |negative_disaggregation_prefixes | 5| 5.0| Transitive, negative prefixes| 6503 | | | | (always southbound)| 6504 +-----------------------------------------+-----+-------+---------------------------------+ 6505 |external_prefixes | 6| 5.0| Externally reimported prefixes.| 6506 +-----------------------------------------+-----+-------+---------------------------------+ 6507 |positive_external_disaggregation_prefixes| 7| 5.0| Positive external disaggregated| 6508 | | | | prefixes (always southbound).| 6509 +-----------------------------------------+-----+-------+---------------------------------+ 6510 |keyvalues | 9| 5.0| Key-Value store elements.| 6511 +-----------------------------------------+-----+-------+---------------------------------+ 6513 Table 34 6515 8.2.29. Registry RIFT_v5/encoding/TIEHeader" 6517 Header of a TIE. 6519 8.2.29.1. Requested Entries 6521 +======================+=======+=========+=========================+ 6522 | Name | Value | Schema | Description | 6523 | | | Version | | 6524 +======================+=======+=========+=========================+ 6525 | tieid | 2 | 5.0 | ID of the tie. | 6526 +----------------------+-------+---------+-------------------------+ 6527 | seq_nr | 3 | 5.0 | Sequence number of the | 6528 | | | | tie. | 6529 +----------------------+-------+---------+-------------------------+ 6530 | origination_time | 10 | 5.0 | Absolute timestamp when | 6531 | | | | the TIE was generated. | 6532 +----------------------+-------+---------+-------------------------+ 6533 | origination_lifetime | 12 | 5.0 | Original lifetime when | 6534 | | | | the TIE was generated. | 6535 +----------------------+-------+---------+-------------------------+ 6537 Table 35 6539 8.2.30. Registry RIFT_v5/encoding/TIEHeaderWithLifeTime" 6541 Header of a TIE as described in TIRE/TIDE. 6543 8.2.30.1. Requested Entries 6545 +====================+=======+================+=====================+ 6546 | Name | Value | Schema Version | Description | 6547 +====================+=======+================+=====================+ 6548 | header | 1 | 5.0 | | 6549 +--------------------+-------+----------------+---------------------+ 6550 | remaining_lifetime | 2 | 5.0 | Remaining | 6551 | | | | lifetime. | 6552 +--------------------+-------+----------------+---------------------+ 6554 Table 36 6556 8.2.31. Registry RIFT_v5/encoding/TIEID" 6558 Unique ID of a TIE. 6560 8.2.31.1. Requested Entries 6562 +============+=======+================+======================+ 6563 | Name | Value | Schema Version | Description | 6564 +============+=======+================+======================+ 6565 | direction | 1 | 5.0 | direction of TIE | 6566 +------------+-------+----------------+----------------------+ 6567 | originator | 2 | 5.0 | indicates originator | 6568 | | | | of the TIE | 6569 +------------+-------+----------------+----------------------+ 6570 | tietype | 3 | 5.0 | type of the tie | 6571 +------------+-------+----------------+----------------------+ 6572 | tie_nr | 4 | 5.0 | number of the tie | 6573 +------------+-------+----------------+----------------------+ 6575 Table 37 6577 8.2.32. Registry RIFT_v5/encoding/TIEPacket" 6579 TIE packet 6581 8.2.32.1. Requested Entries 6583 +=========+=======+================+=============+ 6584 | Name | Value | Schema Version | Description | 6585 +=========+=======+================+=============+ 6586 | header | 1 | 5.0 | | 6587 +---------+-------+----------------+-------------+ 6588 | element | 2 | 5.0 | | 6589 +---------+-------+----------------+-------------+ 6591 Table 38 6593 8.2.33. Registry RIFT_v5/encoding/TIREPacket" 6595 TIRE packet 6597 8.2.33.1. Requested Entries 6599 +=========+=======+================+=============+ 6600 | Name | Value | Schema Version | Description | 6601 +=========+=======+================+=============+ 6602 | headers | 1 | 5.0 | | 6603 +---------+-------+----------------+-------------+ 6605 Table 39 6607 9. Acknowledgments 6609 A new routing protocol in its complexity is not a product of a parent 6610 but of a village as the author list shows already. However, many 6611 more people provided input, fine-combed the specification based on 6612 their experience in design, implementation or application of 6613 protocols in IP fabrics. This section will make an inadequate 6614 attempt in recording their contribution. 6616 Many thanks to Naiming Shen for some of the early discussions around 6617 the topic of using IGPs for routing in topologies related to Clos. 6618 Russ White to be especially acknowledged for the key conversation on 6619 epistemology that allowed to tie current asynchronous distributed 6620 systems theory results to a modern protocol design presented in this 6621 scope. Adrian Farrel, Joel Halpern, Jeffrey Zhang, Krzysztof 6622 Szarkowicz, Nagendra Kumar, Melchior Aelmans, Kaushal Tank, Will 6623 Jones, Moin Ahmed, Sandy Zhang and Jordan Head (in no particular 6624 order) provided thoughtful comments that improved the readability of 6625 the document and found good amount of corners where the light failed 6626 to shine. Kris Price was first to mention single router, single arm 6627 default considerations. Jeff Tantsura helped out with some initial 6628 thoughts on BFD interactions while Jeff Haas corrected several 6629 misconceptions about BFD's finer points and helped to improve the 6630 security section around leaf considerations. Artur Makutunowicz 6631 pointed out many possible improvements and acted as sounding board in 6632 regard to modern protocol implementation techniques RIFT is 6633 exploring. Barak Gafni formalized first time clearly the problem of 6634 partitioned spine and fallen leaves on a (clean) napkin in Singapore 6635 that led to the very important part of the specification centered 6636 around multiple Top-of-Fabric planes and negative disaggregation. 6637 Igor Gashinsky and others shared many thoughts on problems 6638 encountered in design and operation of large-scale data center 6639 fabrics. Xu Benchong found a delicate error in the flooding 6640 procedures and a schema datatype size mismatch. 6642 Last but not least, Alvaro Retana guided the undertaking by asking 6643 many necessary procedural and technical questions which did not only 6644 improve the content but did also lay out the track towards 6645 publication. 6647 10. Contributors 6649 This work is a product of a list of individuals which are all to be 6650 considered major contributors independent of the fact whether their 6651 name made it to the limited boilerplate author's list or not. 6653 +======================+===+================+===+==================+ 6654 +======================+===+================+===+==================+ 6655 | Tony Przygienda, Ed. | | | Alankar Sharma | | | Pascal Thubert | 6656 +----------------------+---+----------------+---+------------------+ 6657 | Juniper | | | Comcast | | | Cisco | 6658 +----------------------+---+----------------+---+------------------+ 6659 | Bruno Rijsman | | | Jordan Head | | | Dmitry Afanasiev | 6660 +----------------------+---+----------------+---+------------------+ 6661 | Individual | | | Juniper | | | Yandex | 6662 +----------------------+---+----------------+---+------------------+ 6663 | Don Fedyk | | | Alia Atlas | | | John Drake | 6664 +----------------------+---+----------------+---+------------------+ 6665 | Individual | | | Individual | | | Juniper | 6666 +----------------------+---+----------------+---+------------------+ 6667 | Ilya Vershkov | | | | | | | | | 6668 +----------------------+---+----------------+---+------------------+ 6669 | Mellanox | | | | | | | | | 6670 +----------------------+---+----------------+---+------------------+ 6672 Table 40: RIFT Authors 6674 11. References 6676 11.1. Normative References 6678 [EUI64] IEEE, "Guidelines for Use of Extended Unique Identifier 6679 (EUI), Organizationally Unique Identifier (OUI), and 6680 Company ID (CID)", IEEE EUI, 6681 . 6683 [RFC1982] Elz, R. and R. Bush, "Serial Number Arithmetic", RFC 1982, 6684 DOI 10.17487/RFC1982, August 1996, 6685 . 6687 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 6688 Requirement Levels", BCP 14, RFC 2119, 6689 DOI 10.17487/RFC2119, March 1997, 6690 . 6692 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 6693 RFC 2365, DOI 10.17487/RFC2365, July 1998, 6694 . 6696 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 6697 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 6698 2006, . 6700 [RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C. 6701 Pignataro, "The Generalized TTL Security Mechanism 6702 (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007, 6703 . 6705 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 6706 Topology (MT) Routing in Intermediate System to 6707 Intermediate Systems (IS-ISs)", RFC 5120, 6708 DOI 10.17487/RFC5120, February 2008, 6709 . 6711 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 6712 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 6713 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 6714 2009, . 6716 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 6717 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 6718 DOI 10.17487/RFC5881, June 2010, 6719 . 6721 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 6722 "Network Time Protocol Version 4: Protocol and Algorithms 6723 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 6724 . 6726 [RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The 6727 Locator/ID Separation Protocol (LISP)", RFC 6830, 6728 DOI 10.17487/RFC6830, January 2013, 6729 . 6731 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 6732 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 6733 DOI 10.17487/RFC7987, October 2016, 6734 . 6736 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 6737 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 6738 May 2017, . 6740 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 6741 (IPv6) Specification", STD 86, RFC 8200, 6742 DOI 10.17487/RFC8200, July 2017, 6743 . 6745 [RFC8202] Ginsberg, L., Previdi, S., and W. Henderickx, "IS-IS 6746 Multi-Instance", RFC 8202, DOI 10.17487/RFC8202, June 6747 2017, . 6749 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 6750 Perkins, "Registration Extensions for IPv6 over Low-Power 6751 Wireless Personal Area Network (6LoWPAN) Neighbor 6752 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 6753 . 6755 [thrift] Apache Software Foundation, "Thrift Language 6756 Implementation and Documentation", 6757 . 6759 [VFR] Erlebach et al., T., "Cuts and Disjoint Paths in the 6760 Valley-Free Path Model of Internet BGP Routing", Springer 6761 Berlin Heidelberg Combinatorial and Algorithmic Aspects of 6762 Networking, 2005. 6764 11.2. Informative References 6766 [APPLICABILITY] 6767 Wei, Y., Zhang, Z., Afanasiev, D., Thubert, P., and T. 6768 Przygienda, "RIFT Applicability", Work in Progress, 6769 Internet-Draft, draft-ietf-rift-applicability-10, 16 6770 December 2021, . 6773 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 6774 Communication Environments", IEEE International Parallel & 6775 Distributed Processing Symposium, 2011. 6777 [DIJKSTRA] Dijkstra, E. W., "A Note on Two Problems in Connexion with 6778 Graphs", Journal Numer. Math. , 1959. 6780 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 6781 key-value store", ACM SIGOPS symposium on Operating 6782 systems principles (SOSP '07), 2007. 6784 [EPPSTEIN] Eppstein, D., "Finding the k-Shortest Paths", 1997. 6786 [FATTREE] Leiserson, C. E., "Fat-Trees: Universal Networks for 6787 Hardware-Efficient Supercomputing", 1985. 6789 [IEEEstd1588] 6790 IEEE, "IEEE Standard for a Precision Clock Synchronization 6791 Protocol for Networked Measurement and Control Systems", 6792 IEEE Standard 1588, 6793 . 6795 [IEEEstd8021AS] 6796 IEEE, "IEEE Standard for Local and Metropolitan Area 6797 Networks - Timing and Synchronization for Time-Sensitive 6798 Applications in Bridged Local Area Networks", 6799 IEEE Standard 802.1AS, 6800 . 6802 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 6803 Converting Network Protocol Addresses to 48.bit Ethernet 6804 Address for Transmission on Ethernet Hardware", STD 37, 6805 RFC 826, DOI 10.17487/RFC0826, November 1982, 6806 . 6808 [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", 6809 RFC 2131, DOI 10.17487/RFC2131, March 1997, 6810 . 6812 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 6813 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 6814 DOI 10.17487/RFC4861, September 2007, 6815 . 6817 [RFC4862] Thomson, S., Narten, T., and T. Jinmei, "IPv6 Stateless 6818 Address Autoconfiguration", RFC 4862, 6819 DOI 10.17487/RFC4862, September 2007, 6820 . 6822 [RFC8415] Mrugalski, T., Siodelski, M., Volz, B., Yourtchenko, A., 6823 Richardson, M., Jiang, S., Lemon, T., and T. Winters, 6824 "Dynamic Host Configuration Protocol for IPv6 (DHCPv6)", 6825 RFC 8415, DOI 10.17487/RFC8415, November 2018, 6826 . 6828 [VAHDAT08] Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 6829 Commodity Data Center Network Architecture", SIGCOMM , 6830 2008. 6832 [Wikipedia] 6833 Wikipedia, 6834 "https://en.wikipedia.org/wiki/Serial_number_arithmetic", 6835 2016. 6837 Appendix A. Sequence Number Binary Arithmetic 6839 The only reasonably reference to a cleaner than [RFC1982] sequence 6840 number solution is given in [Wikipedia]. It basically converts the 6841 problem into two complement's arithmetic. Assuming a straight two 6842 complement's subtractions on the bit-width of the sequence number the 6843 according >: and =: relations are defined as: 6845 U_1, U_2 are 12-bits aligned unsigned version number 6847 D_f is ( U_1 - U_2 ) interpreted as two complement signed 12-bits 6848 D_b is ( U_2 - U_1 ) interpreted as two complement signed 12-bits 6850 U_1 >: U_2 IIF D_f > 0 *and* D_b < 0 6851 U_1 =: U_2 IIF D_f = 0 6853 The >: relationship is anti-symmetric but not transitive. Observe 6854 that this leaves >: of the numbers having maximum two complement 6855 distance, e.g. ( 0 and 0x800 ) undefined in the 12-bits case since 6856 D_f and D_b are both -0x7ff. 6858 A simple example of the relationship in case of 3-bit arithmetic 6859 follows as table indicating D_f/D_b values and then the relationship 6860 of U_1 to U_2: 6862 U2 / U1 0 1 2 3 4 5 6 7 6863 0 +/+ +/- +/- +/- -/- -/+ -/+ -/+ 6864 1 -/+ +/+ +/- +/- +/- -/- -/+ -/+ 6865 2 -/+ -/+ +/+ +/- +/- +/- -/- -/+ 6866 3 -/+ -/+ -/+ +/+ +/- +/- +/- -/- 6867 4 -/- -/+ -/+ -/+ +/+ +/- +/- +/- 6868 5 +/- -/- -/+ -/+ -/+ +/+ +/- +/- 6869 6 +/- +/- -/- -/+ -/+ -/+ +/+ +/- 6870 7 +/- +/- +/- -/- -/+ -/+ -/+ +/+ 6872 U2 / U1 0 1 2 3 4 5 6 7 6873 0 = > > > ? < < < 6874 1 < = > > > ? < < 6875 2 < < = > > > ? < 6876 3 < < < = > > > ? 6877 4 ? < < < = > > > 6878 5 > ? < < < = > > 6879 6 > > ? < < < = > 6880 7 > > > ? < < < = 6882 Appendix B. Information Elements Schema 6884 This section introduces the schema for information elements. The IDL 6885 is Thrift [thrift]. 6887 On schema changes that 6889 1. change field numbers *or* 6891 2. add new *required* fields *or* 6893 3. remove any fields *or* 6895 4. change lists into sets, unions into structures *or* 6897 5. change multiplicity of fields *or* 6899 6. changes name of any field or type *or* 6901 7. change data types of any field *or* 6903 8. adds, changes or removes a default value of any *existing* field 6904 *or* 6906 9. removes or changes any defined constant or constant value *or* 6907 10. changes any enumeration type except extending 6908 `common.TIETypeType` (use of enumeration types is generally 6909 discouraged) *or* 6911 11. add new TIE type to `TIETypeType` with flooding scope different 6912 from prefix TIE flooding scope 6914 major version of the schema MUST increase. All other changes MUST 6915 increase minor version within the same major. 6917 Introducing an optional field does not cause a major version increase 6918 even if the fields inside the structure are optional with defaults. 6920 All signed integer as forced by Thrift [thrift] support must be cast 6921 for internal purposes to equivalent unsigned values without 6922 discarding the signedness bit. An implementation SHOULD try to avoid 6923 using the signedness bit when generating values. 6925 The schema is normative. 6927 B.1. Backwards-Compatible Extension of Schema 6929 The set of rules in Appendix B guarantees that every decoder can 6930 process serialized content generated by a higher minor version of the 6931 schema and with that the protocol can progress without a 'fork-lift'. 6932 Contrary to that, content serialized using a major version X is *not* 6933 expected to be decodable by any implementation using decoder for a 6934 model with a major version lower than X. 6936 Additionally, based on the propagated minor version in encoded 6937 content and added optional node capabilities new TIE types or even 6938 de-facto mandatory fields can be introduced without progressing the 6939 major version albeit only nodes supporting such new extensions would 6940 decode them. Given the model is encoded at the source and never re- 6941 encoded flooding through nodes not understanding any new extensions 6942 will preserve the according fields. However, it is important to 6943 understand that a higher minor version of a schema does *not* 6944 guarantee that capabilities introduced in lower minors of the same 6945 major are supported. The `node_capabilities` field is used to 6946 indicate which capabilities are supported. 6948 Specifically, the schema SHOULD add elements to `NodeCapabilities` 6949 field future capabilities to indicate whether it will support 6950 interpretation of schema extensions on the same major revision if 6951 they are present. Such fields MUST be optional and have an implicit 6952 or explicit false default value. If a future capability changes 6953 route selection or generates blackholes if some nodes are not 6954 supporting it then a major version increment will be however 6955 unavoidable. `NodeCapabilities` shown in LIE MUST match the 6956 capabilities shown in the Node TIEs, otherwise the behavior is 6957 unspecified. A node detecting the mismatch SHOULD generate a 6958 notification. 6960 Alternately or additionally, new optional fields can be introduced 6961 into e.g. `NodeTIEElement` if a special field is chosen to indicate 6962 via its presence that an optional feature is enabled (since 6963 capability to support a feature does not necessarily mean that the 6964 feature is actually configured and operational). 6966 To support new TIE types without increasing the major version 6967 enumeration `TIEElement` can be extended with new optional elements 6968 for new `common.TIETypeType` values as long the scope of the new TIE 6969 matches the prefix TIE scope. In case it is necessary to understand 6970 whether all nodes can parse the new TIE type a node capability MUST 6971 be added in `NodeCapabilities` to prevent a non-homogenous network. 6973 B.2. common.thrift 6975 /** 6976 Thrift file with common definitions for RIFT 6977 */ 6979 namespace py common 6981 /** @note MUST be interpreted in implementation as unsigned 64 bits. 6982 */ 6983 typedef i64 SystemIDType 6984 typedef i32 IPv4Address 6985 typedef i32 MTUSizeType 6986 /** @note MUST be interpreted in implementation as unsigned 6987 rolling over number */ 6988 typedef i64 SeqNrType 6989 /** @note MUST be interpreted in implementation as unsigned */ 6990 typedef i32 LifeTimeInSecType 6991 /** @note MUST be interpreted in implementation as unsigned */ 6992 typedef i8 LevelType 6993 typedef i16 PacketNumberType 6994 /** @note MUST be interpreted in implementation as unsigned */ 6995 typedef i32 PodType 6996 /** @note MUST be interpreted in implementation as unsigned. 6997 /** this has to be long enough to accomodate prefix */ 6998 typedef binary IPv6Address 6999 /** @note MUST be interpreted in implementation as unsigned */ 7000 typedef i16 UDPPortType 7001 /** @note MUST be interpreted in implementation as unsigned */ 7002 typedef i32 TIENrType 7003 /** @note MUST be interpreted in implementation as unsigned 7004 This is carried in the 7005 security envelope and must hence fit into 8 bits. */ 7006 typedef i8 VersionType 7007 /** @note MUST be interpreted in implementation as unsigned */ 7008 typedef i16 MinorVersionType 7009 /** @note MUST be interpreted in implementation as unsigned */ 7010 typedef i32 MetricType 7011 /** @note MUST be interpreted in implementation as unsigned 7012 and unstructured */ 7013 typedef i64 RouteTagType 7014 /** @note MUST be interpreted in implementation as unstructured 7015 label value */ 7016 typedef i32 LabelType 7017 /** @note MUST be interpreted in implementation as unsigned */ 7018 typedef i32 BandwithInMegaBitsType 7019 /** @note Key Value key ID type */ 7020 typedef i32 KeyIDType 7021 /** node local, unique identification for a link (interface/tunnel 7022 * etc. Basically anything RIFT runs on). This is kept 7023 * at 32 bits so it aligns with BFD [RFC5880] discriminator size. 7024 */ 7025 typedef i32 LinkIDType 7026 /** @note MUST be interpreted in implementation as unsigned, 7027 especially since we have the /128 IPv6 case. */ 7028 typedef i8 PrefixLenType 7029 /** timestamp in seconds since the epoch */ 7030 typedef i64 TimestampInSecsType 7031 /** security nonce. 7032 @note MUST be interpreted in implementation as rolling 7033 over unsigned value */ 7034 typedef i16 NonceType 7035 /** LIE FSM holdtime type */ 7036 typedef i16 TimeIntervalInSecType 7037 /** Transaction ID type for prefix mobility as specified by RFC6550, 7038 value MUST be interpreted in implementation as unsigned */ 7039 typedef i8 PrefixTransactionIDType 7040 /** Timestamp per IEEE 802.1AS, all values MUST be interpreted in 7041 implementation as unsigned. */ 7042 struct IEEE802_1ASTimeStampType { 7043 1: required i64 AS_sec; 7044 2: optional i32 AS_nsec; 7045 } 7046 /** generic counter type */ 7047 typedef i64 CounterType 7048 /** Platform Interface Index type, i.e. index of interface on hardware, 7049 can be used e.g. with RFC5837 */ 7051 typedef i32 PlatformInterfaceIndex 7053 /** Flags indicating node configuration in case of ZTP. 7054 */ 7055 enum HierarchyIndications { 7056 /** forces level to `leaf_level` and enables according procedures */ 7057 leaf_only = 0, 7058 /** forces level to `leaf_level` and enables according procedures */ 7059 leaf_only_and_leaf_2_leaf_procedures = 1, 7060 /** forces level to `top_of_fabric` and enables according 7061 procedures */ 7062 top_of_fabric = 2, 7063 } 7065 const PacketNumberType undefined_packet_number = 0 7066 /** used when node is configured as top of fabric in ZTP.*/ 7067 const LevelType top_of_fabric_level = 24 7068 /** default bandwidth on a link */ 7069 const BandwithInMegaBitsType default_bandwidth = 100 7070 /** fixed leaf level when ZTP is not used */ 7071 const LevelType leaf_level = 0 7072 const LevelType default_level = leaf_level 7073 const PodType default_pod = 0 7074 const LinkIDType undefined_linkid = 0 7076 /** invalid key for key value */ 7077 const KeyIDType invalid_key_value_key = 0 7078 /** default distance used */ 7079 const MetricType default_distance = 1 7080 /** any distance larger than this will be considered infinity */ 7081 const MetricType infinite_distance = 0x7FFFFFFF 7082 /** represents invalid distance */ 7083 const MetricType invalid_distance = 0 7084 const bool overload_default = false 7085 const bool flood_reduction_default = true 7086 /** default LIE FSM LIE TX internval time */ 7087 const TimeIntervalInSecType default_lie_tx_interval = 1 7088 /** default LIE FSM holddown time */ 7089 const TimeIntervalInSecType default_lie_holdtime = 3 7090 /** multipler for default_lie_holdtime to hold down multiple neighbors */ 7091 const i8 multiple_neighbors_lie_holdtime_multipler = 4 7092 /** default ZTP FSM holddown time */ 7093 const TimeIntervalInSecType default_ztp_holdtime = 1 7094 /** by default LIE levels are ZTP offers */ 7095 const bool default_not_a_ztp_offer = false 7096 /** by default everyone is repeating flooding */ 7097 const bool default_you_are_flood_repeater = true 7098 /** 0 is illegal for SystemID */ 7099 const SystemIDType IllegalSystemID = 0 7100 /** empty set of nodes */ 7101 const set empty_set_of_nodeids = {} 7102 /** default lifetime of TIE is one week */ 7103 const LifeTimeInSecType default_lifetime = 604800 7104 /** default lifetime when TIEs are purged is 5 minutes */ 7105 const LifeTimeInSecType purge_lifetime = 300 7106 /** optional round down interval when TIEs are sent with security hashes 7107 to prevent excessive computation. **/ 7108 const LifeTimeInSecType rounddown_lifetime_interval = 60 7109 /** any `TieHeader` that has a smaller lifetime difference 7110 than this constant is equal (if other fields equal). */ 7111 const LifeTimeInSecType lifetime_diff2ignore = 400 7113 /** default UDP port to run LIEs on */ 7114 const UDPPortType default_lie_udp_port = 914 7115 /** default UDP port to receive TIEs on, that can be peer specific */ 7116 const UDPPortType default_tie_udp_flood_port = 915 7118 /** default MTU link size to use */ 7119 const MTUSizeType default_mtu_size = 1400 7120 /** default link being BFD capable */ 7121 const bool bfd_default = true 7123 /** undefined nonce, equivalent to missing nonce */ 7124 const NonceType undefined_nonce = 0; 7125 /** outer security key id, MUST be interpreted as in implementation 7126 as unsigned */ 7127 typedef i8 OuterSecurityKeyID 7128 /** security key id, MUST be interpreted as in implementation 7129 as unsigned */ 7130 typedef i32 TIESecurityKeyID 7131 /** undefined key */ 7132 const TIESecurityKeyID undefined_securitykey_id = 0; 7133 /** Maximum delta (negative or positive) that a mirrored nonce can 7134 deviate from local value to be considered valid. */ 7135 const i16 maximum_valid_nonce_delta = 5; 7136 const TimeIntervalInSecType nonce_regeneration_interval = 300; 7138 /** Direction of TIEs. */ 7139 enum TieDirectionType { 7140 Illegal = 0, 7141 South = 1, 7142 North = 2, 7143 DirectionMaxValue = 3, 7144 } 7146 /** Address family type. */ 7147 enum AddressFamilyType { 7148 Illegal = 0, 7149 AddressFamilyMinValue = 1, 7150 IPv4 = 2, 7151 IPv6 = 3, 7152 AddressFamilyMaxValue = 4, 7153 } 7155 /** IPv4 prefix type. */ 7156 struct IPv4PrefixType { 7157 1: required IPv4Address address; 7158 2: required PrefixLenType prefixlen; 7159 } (python.immutable = "") 7161 /** IPv6 prefix type. */ 7162 struct IPv6PrefixType { 7163 1: required IPv6Address address; 7164 2: required PrefixLenType prefixlen; 7165 } (python.immutable = "") 7167 /** IP address type. */ 7168 union IPAddressType { 7169 /** Content is IPv4 */ 7170 1: optional IPv4Address ipv4address; 7171 /** Content is IPv6 */ 7172 2: optional IPv6Address ipv6address; 7173 } (python.immutable = "") 7175 /** Prefix advertisement. 7177 @note: for interface 7178 addresses the protocol can propagate the address part beyond 7179 the subnet mask and on reachability computation that has to 7180 be normalized. The non-significant bits can be used 7181 for operational purposes. 7182 */ 7183 union IPPrefixType { 7184 1: optional IPv4PrefixType ipv4prefix; 7185 2: optional IPv6PrefixType ipv6prefix; 7186 } (python.immutable = "") 7188 /** Sequence of a prefix in case of move. 7189 */ 7190 struct PrefixSequenceType { 7191 1: required IEEE802_1ASTimeStampType timestamp; 7192 /** Transaction ID set by client in e.g. in 6LoWPAN. */ 7193 2: optional PrefixTransactionIDType transactionid; 7194 } 7195 /** Type of TIE. 7196 */ 7197 enum TIETypeType { 7198 Illegal = 0, 7199 TIETypeMinValue = 1, 7200 /** first legal value */ 7201 NodeTIEType = 2, 7202 PrefixTIEType = 3, 7203 PositiveDisaggregationPrefixTIEType = 4, 7204 NegativeDisaggregationPrefixTIEType = 5, 7205 PGPrefixTIEType = 6, 7206 KeyValueTIEType = 7, 7207 ExternalPrefixTIEType = 8, 7208 PositiveExternalDisaggregationPrefixTIEType = 9, 7209 TIETypeMaxValue = 10, 7210 } 7212 /** RIFT route types. 7213 @note: The only purpose of those values is to introduce an 7214 ordering whereas an implementation can choose internally 7215 any other values as long the ordering is preserved 7216 */ 7217 enum RouteType { 7218 Illegal = 0, 7219 RouteTypeMinValue = 1, 7220 /** First legal value. */ 7221 /** Discard routes are most preferred */ 7222 Discard = 2, 7224 /** Local prefixes are directly attached prefixes on the 7225 * system such as e.g. interface routes. 7226 */ 7227 LocalPrefix = 3, 7228 /** Advertised in S-TIEs */ 7229 SouthPGPPrefix = 4, 7230 /** Advertised in N-TIEs */ 7231 NorthPGPPrefix = 5, 7232 /** Advertised in N-TIEs */ 7233 NorthPrefix = 6, 7234 /** Externally imported north */ 7235 NorthExternalPrefix = 7, 7236 /** Advertised in S-TIEs, either normal prefix or positive 7237 disaggregation */ 7238 SouthPrefix = 8, 7239 /** Externally imported south */ 7240 SouthExternalPrefix = 9, 7241 /** Negative, transitive prefixes are least preferred */ 7242 NegativeSouthPrefix = 10, 7243 RouteTypeMaxValue = 11, 7244 } 7246 enum KVTypes { 7247 OUI = 1, 7248 WellKnown = 2, 7249 } 7251 B.3. encoding.thrift 7253 /** 7254 Thrift file for packet encodings for RIFT 7256 */ 7258 include "common.thrift" 7260 namespace py encoding 7262 /** Represents protocol encoding schema major version */ 7263 const common.VersionType protocol_major_version = 6 7264 /** Represents protocol encoding schema minor version */ 7265 const common.MinorVersionType protocol_minor_version = 0 7267 /** Common RIFT packet header. */ 7268 struct PacketHeader { 7269 /** Major version of protocol. */ 7270 1: required common.VersionType major_version = 7271 protocol_major_version; 7272 /** Minor version of protocol. */ 7273 2: required common.MinorVersionType minor_version = 7274 protocol_minor_version; 7275 /** Node sending the packet, in case of LIE/TIRE/TIDE 7276 also the originator of it. */ 7277 3: required common.SystemIDType sender; 7278 /** Level of the node sending the packet, required on everything 7279 except LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL 7280 and is used in ZTP procedures. 7281 */ 7282 4: optional common.LevelType level; 7283 } 7285 /** Prefix community. */ 7286 struct Community { 7287 /** Higher order bits */ 7288 1: required i32 top; 7289 /** Lower order bits */ 7290 2: required i32 bottom; 7292 } (python.immutable = "") 7294 /** Neighbor structure. */ 7295 struct Neighbor { 7296 /** System ID of the originator. */ 7297 1: required common.SystemIDType originator; 7298 /** ID of remote side of the link. */ 7299 2: required common.LinkIDType remote_id; 7300 } (python.immutable = "") 7302 /** Capabilities the node supports. */ 7303 struct NodeCapabilities { 7304 /** Must advertise supported minor version dialect that way. */ 7305 1: required common.MinorVersionType protocol_minor_version = 7306 protocol_minor_version; 7307 /** indicates that node supports flood reduction. */ 7308 2: optional bool flood_reduction = 7309 common.flood_reduction_default; 7310 /** indicates place in hierarchy, i.e. top-of-fabric or 7311 leaf only (in ZTP) or support for leaf-2-leaf 7312 procedures. */ 7313 3: optional common.HierarchyIndications hierarchy_indications; 7315 } (python.immutable = "") 7317 /** Link capabilities. */ 7318 struct LinkCapabilities { 7319 /** Indicates that the link is supporting BFD. */ 7320 1: optional bool bfd = 7321 common.bfd_default; 7322 /** Indicates whether the interface will support IPv4 forwarding. */ 7323 2: optional bool ipv4_forwarding_capable = 7324 true; 7325 } (python.immutable = "") 7327 /** RIFT LIE Packet. 7329 @note: this node's level is already included on the packet header 7330 */ 7331 struct LIEPacket { 7332 /** Node or adjacency name. */ 7333 1: optional string name; 7334 /** Local link ID. */ 7335 2: required common.LinkIDType local_id; 7336 /** UDP port to which we can receive flooded TIEs. */ 7337 3: required common.UDPPortType flood_port = 7338 common.default_tie_udp_flood_port; 7340 /** Layer 3 MTU, used to discover mismatch. */ 7341 4: optional common.MTUSizeType link_mtu_size = 7342 common.default_mtu_size; 7343 /** Local link bandwidth on the interface. */ 7344 5: optional common.BandwithInMegaBitsType 7345 link_bandwidth = common.default_bandwidth; 7346 /** Reflects the neighbor once received to provide 7347 3-way connectivity. */ 7348 6: optional Neighbor neighbor; 7349 /** Node's PoD. */ 7350 7: optional common.PodType pod = 7351 common.default_pod; 7352 /** Node capabilities supported. */ 7353 10: required NodeCapabilities node_capabilities; 7354 /** Capabilities of this link. */ 7355 11: optional LinkCapabilities link_capabilities; 7356 /** Required holdtime of the adjacency, i.e. for how 7357 long a period should adjacency be kept up without valid LIE reception. */ 7358 12: required common.TimeIntervalInSecType 7359 holdtime = common.default_lie_holdtime; 7360 /** Optional, unsolicited, downstream assigned locally significant label 7361 value for the adjacency. */ 7362 13: optional common.LabelType label; 7363 /** Indicates that the level on the LIE must not be used 7364 to derive a ZTP level by the receiving node. */ 7365 21: optional bool not_a_ztp_offer = 7366 common.default_not_a_ztp_offer; 7367 /** Indicates to northbound neighbor that it should 7368 be reflooding TIEs received from this node to achieve flood 7369 reduction and balancing for northbound flooding. */ 7370 22: optional bool you_are_flood_repeater = 7371 common.default_you_are_flood_repeater; 7372 /** Indicates to neighbor to flood node TIEs only and slow down 7373 all other TIEs. Ignored when received from southbound neighbor. */ 7374 23: optional bool you_are_sending_too_quickly = 7375 false; 7376 /** Instance name in case multiple RIFT instances running on same 7377 interface. */ 7378 24: optional string instance_name; 7380 } 7382 /** LinkID pair describes one of parallel links between two nodes. */ 7383 struct LinkIDPair { 7384 /** Node-wide unique value for the local link. */ 7385 1: required common.LinkIDType local_id; 7386 /** Received remote link ID for this link. */ 7387 2: required common.LinkIDType remote_id; 7388 /** Describes the local interface index of the link. */ 7389 10: optional common.PlatformInterfaceIndex platform_interface_index; 7390 /** Describes the local interface name. */ 7391 11: optional string platform_interface_name; 7392 /** Indicates whether the link is secured, i.e. protected by 7393 outer key, absence of this element means no indication, 7394 undefined outer key means not secured. */ 7395 12: optional common.OuterSecurityKeyID 7396 trusted_outer_security_key; 7397 /** Indicates whether the link is protected by established 7398 BFD session. */ 7399 13: optional bool bfd_up; 7400 /** Optional indication which address families are up on the 7401 interface */ 7402 14: optional set 7403 (python.immutable = "") address_families; 7404 } (python.immutable = "") 7406 /** Unique ID of a TIE. */ 7407 struct TIEID { 7408 /** direction of TIE */ 7409 1: required common.TieDirectionType direction; 7410 /** indicates originator of the TIE */ 7411 2: required common.SystemIDType originator; 7412 /** type of the tie */ 7413 3: required common.TIETypeType tietype; 7414 /** number of the tie */ 7415 4: required common.TIENrType tie_nr; 7416 } (python.immutable = "") 7418 /** Header of a TIE. */ 7419 struct TIEHeader { 7420 /** ID of the tie. */ 7421 2: required TIEID tieid; 7422 /** Sequence number of the tie. */ 7423 3: required common.SeqNrType seq_nr; 7425 /** Absolute timestamp when the TIE was generated. */ 7426 10: optional common.IEEE802_1ASTimeStampType origination_time; 7427 /** Original lifetime when the TIE was generated. */ 7428 12: optional common.LifeTimeInSecType origination_lifetime; 7429 } 7431 /** Header of a TIE as described in TIRE/TIDE. 7432 */ 7433 struct TIEHeaderWithLifeTime { 7434 1: required TIEHeader header; 7435 /** Remaining lifetime. */ 7436 2: required common.LifeTimeInSecType remaining_lifetime; 7437 } 7439 /** TIDE with *sorted* TIE headers. */ 7440 struct TIDEPacket { 7441 /** First TIE header in the tide packet. */ 7442 1: required TIEID start_range; 7443 /** Last TIE header in the tide packet. */ 7444 2: required TIEID end_range; 7445 /** _Sorted_ list of headers. */ 7446 3: required list 7447 (python.immutable = "") headers; 7448 } 7450 /** TIRE packet */ 7451 struct TIREPacket { 7452 1: required set 7453 (python.immutable = "") headers; 7454 } 7456 /** neighbor of a node */ 7457 struct NodeNeighborsTIEElement { 7458 /** level of neighbor */ 7459 1: required common.LevelType level; 7460 /** Cost to neighbor. Ignore anything larger than `infinite_distance` and `invalid_distance` */ 7461 3: optional common.MetricType cost 7462 = common.default_distance; 7463 /** can carry description of multiple parallel links in a TIE */ 7464 4: optional set 7465 (python.immutable = "") link_ids; 7466 /** total bandwith to neighbor as sum of all parallel links */ 7467 5: optional common.BandwithInMegaBitsType 7468 bandwidth = common.default_bandwidth; 7469 } (python.immutable = "") 7471 /** Indication flags of the node. */ 7472 struct NodeFlags { 7473 /** Indicates that node is in overload, do not transit traffic 7474 through it. */ 7475 1: optional bool overload = common.overload_default; 7476 } (python.immutable = "") 7478 /** Description of a node. */ 7479 struct NodeTIEElement { 7480 /** Level of the node. */ 7481 1: required common.LevelType level; 7482 /** Node's neighbors. Multiple node TIEs can carry disjoint sets of neighbors. */ 7483 2: required map neighbors; 7485 /** Capabilities of the node. */ 7486 3: required NodeCapabilities capabilities; 7487 /** Flags of the node. */ 7488 4: optional NodeFlags flags; 7489 /** Optional node name for easier operations. */ 7490 5: optional string name; 7491 /** PoD to which the node belongs. */ 7492 6: optional common.PodType pod; 7493 /** optional startup time of the node */ 7494 7: optional common.TimestampInSecsType startup_time; 7496 /** If any local links are miscabled, this indication is flooded. */ 7497 10: optional set 7498 (python.immutable = "") miscabled_links; 7500 /** ToFs in the same plane. Only carried by ToF. Multiple node TIEs can carry disjoint sets of ToFs 7501 which can be joined to form a single set. Used in complex multi-plane elections. */ 7502 12: optional set same_plane_tofs; 7504 } (python.immutable = "") 7506 /** Attributes of a prefix. */ 7507 struct PrefixAttributes { 7508 /** Distance of the prefix. */ 7509 2: required common.MetricType metric 7510 = common.default_distance; 7511 /** Generic unordered set of route tags, can be redistributed 7512 to other protocols or use within the context of real time 7513 analytics. */ 7514 3: optional set 7515 (python.immutable = "") tags; 7516 /** Monotonic clock for mobile addresses. */ 7517 4: optional common.PrefixSequenceType monotonic_clock; 7518 /** Indicates if the prefix is a node loopback. */ 7519 6: optional bool loopback = false; 7520 /** Indicates that the prefix is directly attached. */ 7521 7: optional bool directly_attached = true; 7522 /** link to which the address belongs to. */ 7523 10: optional common.LinkIDType from_link; 7524 } (python.immutable = "") 7526 /** TIE carrying prefixes */ 7527 struct PrefixTIEElement { 7528 /** Prefixes with the associated attributes. */ 7529 1: required map prefixes; 7530 } (python.immutable = "") 7531 /** Generic key value pairs. */ 7532 struct KeyValueTIEElement { 7533 1: required map keyvalues; 7534 } (python.immutable = "") 7536 /** Single element in a TIE. */ 7537 union TIEElement { 7538 /** Used in case of enum common.TIETypeType.NodeTIEType. */ 7539 1: optional NodeTIEElement node; 7540 /** Used in case of enum common.TIETypeType.PrefixTIEType. */ 7541 2: optional PrefixTIEElement prefixes; 7542 /** Positive prefixes (always southbound). */ 7543 3: optional PrefixTIEElement positive_disaggregation_prefixes; 7544 /** Transitive, negative prefixes (always southbound) */ 7545 5: optional PrefixTIEElement negative_disaggregation_prefixes; 7546 /** Externally reimported prefixes. */ 7547 6: optional PrefixTIEElement external_prefixes; 7548 /** Positive external disaggregated prefixes (always southbound). */ 7549 7: optional PrefixTIEElement 7550 positive_external_disaggregation_prefixes; 7551 /** Key-Value store elements. */ 7552 9: optional KeyValueTIEElement keyvalues; 7553 } (python.immutable = "") 7555 /** TIE packet */ 7556 struct TIEPacket { 7557 1: required TIEHeader header; 7558 2: required TIEElement element; 7559 } 7561 /** Content of a RIFT packet. */ 7562 union PacketContent { 7563 1: optional LIEPacket lie; 7564 2: optional TIDEPacket tide; 7565 3: optional TIREPacket tire; 7566 4: optional TIEPacket tie; 7567 } 7569 /** RIFT packet structure. */ 7570 struct ProtocolPacket { 7571 1: required PacketHeader header; 7572 2: required PacketContent content; 7573 } 7574 Appendix C. Constants 7576 C.1. Configurable Protocol Constants 7578 This section gathers constants that are provided in the schema files 7579 and in the document. 7581 +================+==============+==================================+ 7582 | | Type | Value | 7583 +================+==============+==================================+ 7584 | LIE IPv4 | Default | 224.0.0.120 or all-rift-routers | 7585 | Multicast | Value, | to be assigned in IPv4 Multicast | 7586 | Address | Configurable | Address Space Registry in Local | 7587 | | | Network Control Block | 7588 +----------------+--------------+----------------------------------+ 7589 | LIE IPv6 | Default | FF02::A1F7 or all-rift-routers | 7590 | Multicast | Value, | to be assigned in IPv6 Multicast | 7591 | Address | Configurable | Address Assignments | 7592 +----------------+--------------+----------------------------------+ 7593 | LIE | Default | 914 | 7594 | Destination | Value, | | 7595 | Port | Configurable | | 7596 +----------------+--------------+----------------------------------+ 7597 | Level value | Constant | 24 | 7598 | for | | | 7599 | TOP_OF_FABRIC | | | 7600 | flag | | | 7601 +----------------+--------------+----------------------------------+ 7602 | Default LIE | Default | 3 seconds | 7603 | Holdtime | Value, | | 7604 | | Configurable | | 7605 +----------------+--------------+----------------------------------+ 7606 | TIE | Default | 1 second | 7607 | Retransmission | Value | | 7608 | Interval | | | 7609 +----------------+--------------+----------------------------------+ 7610 | TIDE | Default | 5 seconds | 7611 | Generation | Value, | | 7612 | Interval | Configurable | | 7613 +----------------+--------------+----------------------------------+ 7614 | MIN_TIEID | Constant | TIE Key with minimal values: | 7615 | signifies | | TIEID(originator=0, | 7616 | start of TIDEs | | tietype=TIETypeMinValue, | 7617 | | | tie_nr=0, direction=South) | 7618 +----------------+--------------+----------------------------------+ 7619 | MAX_TIEID | Constant | TIE Key with maximal values: | 7620 | signifies end | | TIEID(originator=MAX_UINT64, | 7621 | of TIDEs | | tietype=TIETypeMaxValue, | 7622 | | | tie_nr=MAX_UINT64, | 7623 | | | direction=North) | 7624 +----------------+--------------+----------------------------------+ 7626 Table 41: all_constants 7628 Authors' Addresses 7630 Tony Przygienda (editor) 7631 Juniper 7632 1137 Innovation Way 7633 Sunnyvale, CA 7634 United States of America 7636 Email: prz@juniper.net 7638 Alankar Sharma 7639 Comcast 7640 1800 Bishops Gate Blvd 7641 Mount Laurel, NJ 08054 7642 United States of America 7644 Email: as3957@gmail.com 7646 Pascal Thubert 7647 Cisco Systems, Inc 7648 Building D 7649 45 Allee des Ormes - BP1200 7650 06254 MOUGINS - Sophia Antipolis 7651 France 7653 Phone: +33 497 23 26 34 7654 Email: pthubert@cisco.com 7656 Bruno Rijsman 7657 Individual 7659 Email: brunorijsman@gmail.com 7661 Dmitry Afanasiev 7662 Yandex 7664 Email: fl0w@yandex-team.ru