idnits 2.17.1 draft-ietf-rift-rift-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 144 instances of too long lines in the document, the longest one being 116 characters in excess of 72. == There are 2 instances of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 5362 has weird spacing: '...berType undef...' == Line 5366 has weird spacing: '...velType top_...' == Line 5368 has weird spacing: '...itsType defau...' == Line 5370 has weird spacing: '...velType leaf...' == Line 5371 has weird spacing: '...velType defa...' == (27 more instances...) -- The document date (August 14, 2019) is 1715 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 347, but not defined == Missing Reference: 'B' is mentioned on line 347, but not defined == Missing Reference: 'C' is mentioned on line 357, but not defined == Missing Reference: 'D' is mentioned on line 357, but not defined == Missing Reference: 'E' is mentioned on line 350, but not defined == Missing Reference: 'F' is mentioned on line 350, but not defined == Missing Reference: 'NH' is mentioned on line 2353, but not defined == Missing Reference: 'P' is mentioned on line 2548, but not defined == Missing Reference: 'RFC5880' is mentioned on line 5324, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589' ** Obsolete normative reference: RFC 5549 (Obsoleted by RFC 8950) ** Obsolete normative reference: RFC 7752 (Obsoleted by RFC 9552) Summary: 3 errors (**), 0 flaws (~~), 17 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT Working Group A. Przygienda, Ed. 3 Internet-Draft Juniper 4 Intended status: Standards Track A. Sharma 5 Expires: February 15, 2020 Comcast 6 P. Thubert 7 Cisco 8 Bruno. Rijsman 9 Individual 10 Dmitry. Afanasiev 11 Yandex 12 August 14, 2019 14 RIFT: Routing in Fat Trees 15 draft-ietf-rift-rift-07 17 Abstract 19 This document outlines a specialized, dynamic routing protocol for 20 Clos and fat-tree network topologies. The protocol (1) deals with 21 fully automated construction of fat-tree topologies based on 22 detection of links, (2) minimizes the amount of routing state held at 23 each level, (3) automatically prunes and load balances topology 24 flooding exchanges over a sufficient subset of links, (4) supports 25 automatic disaggregation of prefixes on link and node failures to 26 prevent black-holing and suboptimal routing, (5) allows traffic 27 steering and re-routing policies, (6) allows loop-free non-ECMP 28 forwarding, (7) automatically re-balances traffic towards the spines 29 based on bandwidth available and finally (8) provides mechanisms to 30 synchronize a limited key-value data-store that can be used after 31 protocol convergence to e.g. bootstrap higher levels of 32 functionality on nodes. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on February 15, 2020. 50 Copyright Notice 52 Copyright (c) 2019 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 68 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6 69 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 8 70 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 8 71 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 8 72 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 12 73 4. Requirement Considerations . . . . . . . . . . . . . . . . . 14 74 5. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 17 75 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 18 76 5.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 18 77 5.1.2. Generalized Topology View . . . . . . . . . . . . . . 18 78 5.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 28 79 5.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 30 80 5.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 31 81 5.2. Specification . . . . . . . . . . . . . . . . . . . . . . 32 82 5.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 32 83 5.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 33 84 5.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 35 85 5.2.3.1. Topology Information Elements . . . . . . . . . . 35 86 5.2.3.2. South- and Northbound Representation . . . . . . 36 87 5.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 38 88 5.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 39 89 5.2.3.5. 'Flood Only Node TIEs' Bit . . . . . . . . . . . 41 90 5.2.3.6. Initial and Periodic Database Synchronization . . 42 91 5.2.3.7. Purging and Roll-Overs . . . . . . . . . . . . . 42 92 5.2.3.8. Southbound Default Route Origination . . . . . . 43 93 5.2.3.9. Northbound TIE Flooding Reduction . . . . . . . . 43 94 5.2.3.10. Special Considerations . . . . . . . . . . . . . 48 95 5.2.4. Reachability Computation . . . . . . . . . . . . . . 49 96 5.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 49 97 5.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 50 98 5.2.4.3. East-West Forwarding Within a non-ToF Level . . . 50 99 5.2.4.4. East-West Links Within ToF Level . . . . . . . . 50 100 5.2.5. Automatic Disaggregation on Link & Node Failures . . 51 101 5.2.5.1. Positive, Non-transitive Disaggregation . . . . . 51 102 5.2.5.2. Negative, Transitive Disaggregation for Fallen 103 Leafs . . . . . . . . . . . . . . . . . . . . . . 54 104 5.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 56 105 5.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 65 106 5.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 66 107 5.2.7.2. Automatic SystemID Selection . . . . . . . . . . 67 108 5.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 68 109 5.2.7.4. Level Determination Procedure . . . . . . . . . . 69 110 5.2.7.5. Resulting Topologies . . . . . . . . . . . . . . 70 111 5.2.8. Stability Considerations . . . . . . . . . . . . . . 72 112 5.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 72 113 5.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 72 114 5.3.2. Optimized Route Computation on Leafs . . . . . . . . 72 115 5.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 73 116 5.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 74 117 5.3.3.2. Interaction between Time Stamps and Sequence 118 Counters . . . . . . . . . . . . . . . . . . . . 74 119 5.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 75 120 5.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 75 121 5.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 76 122 5.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 76 123 5.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 76 124 5.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 76 125 5.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 77 126 5.3.6.1. Northbound Direction . . . . . . . . . . . . . . 77 127 5.3.6.2. Southbound Direction . . . . . . . . . . . . . . 79 128 5.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 80 129 5.3.8. Segment Routing Support with RIFT . . . . . . . . . . 80 130 5.3.8.1. Global Segment Identifiers Assignment . . . . . . 80 131 5.3.8.2. Distribution of Topology Information . . . . . . 80 132 5.3.9. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 81 133 5.3.10. Address Family and Multi Topology Considerations . . 81 134 5.3.11. Reachability of Internal Nodes in the Fabric . . . . 81 135 5.3.12. One-Hop Healing of Levels with East-West Links . . . 82 136 5.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 82 137 5.4.1. Security Model . . . . . . . . . . . . . . . . . . . 82 138 5.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 84 139 5.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 84 140 5.4.4. Weak Nonces . . . . . . . . . . . . . . . . . . . . . 87 141 5.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 88 142 5.4.6. Key Management . . . . . . . . . . . . . . . . . . . 88 143 5.4.7. Security Association Changes . . . . . . . . . . . . 88 145 6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 89 146 6.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 89 147 6.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 90 148 6.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 91 149 6.4. Northbound Partitioned Router and Optional East-West 150 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 93 151 7. Implementation and Operation: Further Details . . . . . . . . 93 152 7.1. Considerations for Leaf-Only Implementation . . . . . . . 93 153 7.2. Considerations for Spine Implementation . . . . . . . . . 94 154 7.3. Adaptations to Other Proposed Data Center Topologies . . 94 155 7.4. Originating Non-Default Route Southbound . . . . . . . . 95 156 8. Security Considerations . . . . . . . . . . . . . . . . . . . 95 157 8.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 95 158 8.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 159 8.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 96 160 8.4. Packet Number . . . . . . . . . . . . . . . . . . . . . . 96 161 8.5. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 96 162 8.6. TIE Origin Fingerprint DoS Attacks . . . . . . . . . . . 96 163 8.7. Host Implementations . . . . . . . . . . . . . . . . . . 97 164 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 97 165 9.1. Requested Multicast and Port Numbers . . . . . . . . . . 97 166 9.2. Requested Registries with Suggested Values . . . . . . . 97 167 9.2.1. RIFT/common/AddressFamilyType . . . . . . . . . . . . 98 168 9.2.1.1. Requested Entries . . . . . . . . . . . . . . . . 98 169 9.2.2. RIFT/common/HierarchyIndications . . . . . . . . . . 98 170 9.2.2.1. Requested Entries . . . . . . . . . . . . . . . . 98 171 9.2.3. RIFT/common/IEEE802_1ASTimeStampType . . . . . . . . 98 172 9.2.3.1. Requested Entries . . . . . . . . . . . . . . . . 98 173 9.2.4. RIFT/common/IPAddressType . . . . . . . . . . . . . . 98 174 9.2.4.1. Requested Entries . . . . . . . . . . . . . . . . 98 175 9.2.5. RIFT/common/IPPrefixType . . . . . . . . . . . . . . 99 176 9.2.5.1. Requested Entries . . . . . . . . . . . . . . . . 99 177 9.2.6. RIFT/common/IPv4PrefixType . . . . . . . . . . . . . 99 178 9.2.6.1. Requested Entries . . . . . . . . . . . . . . . . 99 179 9.2.7. RIFT/common/IPv6PrefixType . . . . . . . . . . . . . 99 180 9.2.7.1. Requested Entries . . . . . . . . . . . . . . . . 99 181 9.2.8. RIFT/common/PrefixSequenceType . . . . . . . . . . . 99 182 9.2.8.1. Requested Entries . . . . . . . . . . . . . . . . 99 183 9.2.9. RIFT/common/RouteType . . . . . . . . . . . . . . . . 100 184 9.2.9.1. Requested Entries . . . . . . . . . . . . . . . . 100 185 9.2.10. RIFT/common/TIETypeType . . . . . . . . . . . . . . . 100 186 9.2.10.1. Requested Entries . . . . . . . . . . . . . . . 100 187 9.2.11. RIFT/common/TieDirectionType . . . . . . . . . . . . 101 188 9.2.11.1. Requested Entries . . . . . . . . . . . . . . . 101 189 9.2.12. RIFT/encoding/Community . . . . . . . . . . . . . . . 101 190 9.2.12.1. Requested Entries . . . . . . . . . . . . . . . 101 191 9.2.13. RIFT/encoding/KeyValueTIEElement . . . . . . . . . . 101 192 9.2.13.1. Requested Entries . . . . . . . . . . . . . . . 101 194 9.2.14. RIFT/encoding/LIEPacket . . . . . . . . . . . . . . . 102 195 9.2.14.1. Requested Entries . . . . . . . . . . . . . . . 102 196 9.2.15. RIFT/encoding/LinkCapabilities . . . . . . . . . . . 103 197 9.2.15.1. Requested Entries . . . . . . . . . . . . . . . 103 198 9.2.16. RIFT/encoding/LinkIDPair . . . . . . . . . . . . . . 103 199 9.2.16.1. Requested Entries . . . . . . . . . . . . . . . 103 200 9.2.17. RIFT/encoding/Neighbor . . . . . . . . . . . . . . . 104 201 9.2.17.1. Requested Entries . . . . . . . . . . . . . . . 104 202 9.2.18. RIFT/encoding/NodeCapabilities . . . . . . . . . . . 104 203 9.2.18.1. Requested Entries . . . . . . . . . . . . . . . 104 204 9.2.19. RIFT/encoding/NodeFlags . . . . . . . . . . . . . . . 105 205 9.2.19.1. Requested Entries . . . . . . . . . . . . . . . 105 206 9.2.20. RIFT/encoding/NodeNeighborsTIEElement . . . . . . . . 105 207 9.2.20.1. Requested Entries . . . . . . . . . . . . . . . 105 208 9.2.21. RIFT/encoding/NodeTIEElement . . . . . . . . . . . . 105 209 9.2.21.1. Requested Entries . . . . . . . . . . . . . . . 105 210 9.2.22. RIFT/encoding/PacketContent . . . . . . . . . . . . . 106 211 9.2.22.1. Requested Entries . . . . . . . . . . . . . . . 106 212 9.2.23. RIFT/encoding/PacketHeader . . . . . . . . . . . . . 106 213 9.2.23.1. Requested Entries . . . . . . . . . . . . . . . 106 214 9.2.24. RIFT/encoding/PrefixAttributes . . . . . . . . . . . 107 215 9.2.24.1. Requested Entries . . . . . . . . . . . . . . . 107 216 9.2.25. RIFT/encoding/PrefixTIEElement . . . . . . . . . . . 107 217 9.2.25.1. Requested Entries . . . . . . . . . . . . . . . 107 218 9.2.26. RIFT/encoding/ProtocolPacket . . . . . . . . . . . . 107 219 9.2.26.1. Requested Entries . . . . . . . . . . . . . . . 107 220 9.2.27. RIFT/encoding/TIDEPacket . . . . . . . . . . . . . . 108 221 9.2.27.1. Requested Entries . . . . . . . . . . . . . . . 108 222 9.2.28. RIFT/encoding/TIEElement . . . . . . . . . . . . . . 108 223 9.2.28.1. Requested Entries . . . . . . . . . . . . . . . 108 224 9.2.29. RIFT/encoding/TIEHeader . . . . . . . . . . . . . . . 109 225 9.2.29.1. Requested Entries . . . . . . . . . . . . . . . 109 226 9.2.30. RIFT/encoding/TIEHeaderWithLifeTime . . . . . . . . . 110 227 9.2.30.1. Requested Entries . . . . . . . . . . . . . . . 110 228 9.2.31. RIFT/encoding/TIEID . . . . . . . . . . . . . . . . . 110 229 9.2.31.1. Requested Entries . . . . . . . . . . . . . . . 110 230 9.2.32. RIFT/encoding/TIEPacket . . . . . . . . . . . . . . . 111 231 9.2.32.1. Requested Entries . . . . . . . . . . . . . . . 111 232 9.2.33. RIFT/encoding/TIREPacket . . . . . . . . . . . . . . 111 233 9.2.33.1. Requested Entries . . . . . . . . . . . . . . . 111 234 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 111 235 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 112 236 11.1. Normative References . . . . . . . . . . . . . . . . . . 112 237 11.2. Informative References . . . . . . . . . . . . . . . . . 114 238 Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 116 239 Appendix B. Information Elements Schema . . . . . . . . . . . . 117 240 B.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 118 241 B.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 124 243 Appendix C. Finite State Machines and Precise Operational 244 Specifications . . . . . . . . . . . . . . . . . . . 132 245 C.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . . . . . 132 246 C.2. ZTP FSM . . . . . . . . . . . . . . . . . . . . . . . . . 139 247 C.3. Flooding Procedures . . . . . . . . . . . . . . . . . . . 147 248 C.3.1. FloodState Structure per Adjacency . . . . . . . . . 147 249 C.3.2. TIDEs . . . . . . . . . . . . . . . . . . . . . . . . 149 250 C.3.2.1. TIDE Generation . . . . . . . . . . . . . . . . . 149 251 C.3.2.2. TIDE Processing . . . . . . . . . . . . . . . . . 150 252 C.3.3. TIREs . . . . . . . . . . . . . . . . . . . . . . . . 151 253 C.3.3.1. TIRE Generation . . . . . . . . . . . . . . . . . 151 254 C.3.3.2. TIRE Processing . . . . . . . . . . . . . . . . . 151 255 C.3.4. TIEs Processing on Flood State Adjacency . . . . . . 152 256 C.3.5. TIEs Processing When LSDB Received Newer Version on 257 Other Adjacencies . . . . . . . . . . . . . . . . . . 153 258 C.3.6. Sending TIEs . . . . . . . . . . . . . . . . . . . . 153 259 Appendix D. Constants . . . . . . . . . . . . . . . . . . . . . 153 260 D.1. Configurable Protocol Constants . . . . . . . . . . . . . 153 261 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 155 263 1. Authors 265 This work is a product of a list of individuals which are all to be 266 considered major contributors independent of the fact whether their 267 name made it to the limited boilerplate author's list or not. 269 Tony Przygienda, Ed. | Alankar Sharma | Pascal Thubert 270 Juniper Networks | Comcast | Cisco 272 Bruno Rijsman | Ilya Vershkov | Dmitry Afanasiev 273 Individual | Mellanox | Yandex 275 Don Fedyk | Alia Atlas | John Drake 276 Individual | Individual | Juniper 278 Table 1: RIFT Authors 280 2. Introduction 282 Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained prominence 283 in today's networking, primarily as result of the paradigm shift 284 towards a centralized data-center based architecture that is poised 285 to deliver a majority of computation and storage services in the 286 future. Today's current routing protocols were geared towards a 287 network with an irregular topology and low degree of connectivity 288 originally but given they were the only available options, 289 consequently several attempts to apply those protocols to Clos have 290 been made. Most successfully BGP [RFC4271] [RFC7938] has been 291 extended to this purpose, not as much due to its inherent suitability 292 but rather because the perceived capability to easily modify BGP and 293 the immanent difficulties with link-state [DIJKSTRA] based protocols 294 to optimize topology exchange and converge quickly in large scale 295 densely meshed topologies. The incumbent protocols precondition 296 normally extensive configuration or provisioning during bring up and 297 re-dimensioning which is only viable for a set of organizations with 298 according networking operation skills and budgets. For the majority 299 of data center consumers a preferable protocol would be one that 300 auto-configures itself and deals with failures and misconfigurations 301 with a minimum of human intervention only. Such a solution would 302 allow local IP fabric bandwidth to be consumed in a 'standard 303 component' fashion, i.e. provision it much faster and operate it at 304 much lower costs, much like compute or storage is consumed today. 306 In looking at the problem through the lens of data center 307 requirements, an optimal approach does not seem however to be a 308 simple modification of either a link-state (distributed computation) 309 or distance-vector (diffused computation) approach but rather a 310 mixture of both, colloquially best described as "link-state towards 311 the spine" and "distance vector towards the leafs". In other words, 312 "bottom" levels are flooding their link-state information in the 313 "northern" direction while each node generates under normal 314 conditions a default route and floods it in the "southern" direction. 315 This type of protocol allows naturally for highly desirable 316 aggregation. Alas, such aggregation could blackhole traffic in cases 317 of misconfiguration or while failures are being resolved or even 318 cause partial network partitioning and this has to be addressed. The 319 approach RIFT takes is described in Section 5.2.5 and is basically 320 based on automatic, sufficient disaggregation of prefixes. 322 For the visually oriented reader, Figure 1 presents a first level 323 simplified view of the resulting information and routes on a RIFT 324 fabric. The top of the fabric is holding in its link-state database 325 the nodes below it and the routes to them. In the second row of the 326 database table we indicate that partial information of other nodes in 327 the same level is available as well. The details of how this is 328 achieved will be postponed for the moment. When we look at the 329 "bottom" of the fabric, the leafs, we see that the topology is 330 basically empty and they only hold a load balanced default route to 331 the next level. 333 The balance of this document details the requirements of a dedicated 334 fabric routing protocol, fills in the specification details and 335 ultimately includes resulting security considerations. 337 . [A,B,C,D] 338 . [E] 339 . +-----+ +-----+ 340 . | E | | F | A/32 @ [C,D] 341 . +-+-+-+ +-+-+-+ B/32 @ [C,D] 342 . | | | | C/32 @ C 343 . | | +-----+ | D/32 @ D 344 . | | | | 345 . | +------+ | 346 . | | | | 347 . [A,B] +-+---+ | | +---+-+ [A,B] 348 . [D] | C +--+ +-+ D | [C] 349 . +-+-+-+ +-+-+-+ 350 . 0/0 @ [E,F] | | | | 0/0 @ [E,F] 351 . A/32 @ A | | +-----+ | A/32 @ A 352 . B/32 @ B | | | | B/32 @ B 353 . | +------+ | 354 . | | | | 355 . +-+---+ | | +---+-+ 356 . | A +--+ +-+ B | 357 . 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] 359 Figure 1: RIFT information distribution 361 2.1. Requirements Language 363 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 364 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 365 document are to be interpreted as described in RFC 2119 [RFC2119]. 367 3. Reference Frame 369 3.1. Terminology 371 This section presents the terminology used in this document. It is 372 assumed that the reader is thoroughly familiar with the terms and 373 concepts used in OSPF [RFC2328] and IS-IS [ISO10589-Second-Edition], 374 [ISO10589] as well as the according graph theoretical concepts of 375 shortest path first (SPF) [DIJKSTRA] computation and directed acyclic 376 graphs (DAG). 378 Crossbar: Physical arrangement of ports in a switching matrix 379 without implying any further scheduling or buffering disciplines. 381 Clos/Fat Tree: This document uses the terms Clos and Fat Tree 382 interchangeably whereas it always refers to a folded spine-and- 383 leaf topology with possibly multiple PoDs and one or multiple ToF 384 planes. Several modifications such as leaf-2-leaf shortcuts and 385 multiple level shortcuts are possible and described further in the 386 document. 388 Folded Spine-and-Leaf: In case Clos fabric input and output stages 389 are analogous, the fabric can be "folded" to build a "superspine" 390 or top which we will call Top of Fabric (ToF) in this document. 392 Level: Clos and Fat Tree networks are topologically partially 393 ordered graphs and 'level' denotes the set of nodes at the same 394 height in such a network, where the bottom level (leaf) is the 395 level with lowest value. A node has links to nodes one level down 396 and/or one level up. Under some circumstances, a node may have 397 links to nodes at the same level. As footnote: Clos terminology 398 uses often the concept of "stage" but due to the folded nature of 399 the Fat Tree we do not use it to prevent misunderstandings. 401 Superspine/Aggregation or Spine/Edge Levels: Traditional names in 402 5-stages folded Clos for Level 2, 1 and 0 respectively. Level 0 403 is often called leaf as well. We normalize this language to talk 404 about leafs, spines and top-of-fabric (ToF). 406 Point of Delivery (PoD): A self-contained vertical slice or subset 407 of a Clos or Fat Tree network containing normally only level 0 and 408 level 1 nodes. A node in a PoD communicates with nodes in other 409 PoDs via the Top-of-Fabric. We number PoDs to distinguish them 410 and use PoD #0 to denote "undefined" PoD. 412 Top of PoD (ToP): The set of nodes that provide intra-PoD 413 communication and have northbound adjacencies outside of the PoD, 414 i.e. are at the "top" of the PoD. 416 Top of Fabric (ToF): The set of nodes that provide inter-PoD 417 communication and have no northbound adjacencies, i.e. are at the 418 "very top" of the fabric. ToF nodes do not belong to any PoD and 419 are assigned "undefined" PoD value to indicate the equivalent of 420 "any" PoD. 422 Spine: Any nodes north of leafs and south of top-of-fabric nodes. 423 Multiple layers of spines in a PoD are possible. 425 Leaf: A node without southbound adjacencies. Its level is 0 (except 426 cases where it is deriving its level via ZTP and is running 427 without LEAF_ONLY which will be explained in Section 5.2.7). 429 Top-of-fabric Plane or Partition: In large fabrics top-of-fabric 430 switches may not have enough ports to aggregate all switches south 431 of them and with that, the ToF is 'split' into multiple 432 independent planes. Introduction and Section 5.1.2 explains the 433 concept in more detail. A plane is subset of ToF nodes that see 434 each other through south reflection or E-W links. 436 Radix: A radix of a switch is basically number of switching ports it 437 provides. It's sometimes called fanout as well. 439 North Radix: Ports cabled northbound to higher level nodes. 441 South Radix: Ports cabled southbound to lower level nodes. 443 South/Southbound and North/Northbound (Direction): When describing 444 protocol elements and procedures, we will be using in different 445 situations the directionality of the compass. I.e., 'south' or 446 'southbound' mean moving towards the bottom of the Clos or Fat 447 Tree network and 'north' and 'northbound' mean moving towards the 448 top of the Clos or Fat Tree network. 450 Northbound Link: A link to a node one level up or in other words, 451 one level further north. 453 Southbound Link: A link to a node one level down or in other words, 454 one level further south. 456 East-West Link: A link between two nodes at the same level. East- 457 West links are normally not part of Clos or "fat-tree" topologies. 459 Leaf shortcuts (L2L): East-West links at leaf level will need to be 460 differentiated from East-West links at other levels. 462 Routing on the host (RotH): Modern data center architecture variant 463 where servers/leafs are multi-homed and consecutively participate 464 in routing. 466 Southbound representation: Subset of topology information sent 467 towards a lower level. 469 South Reflection: Often abbreviated just as "reflection" it defines 470 a mechanism where South Node TIEs are "reflected" back up north to 471 allow nodes in same level without E-W links to "see" each other. 473 TIE: This is an acronym for a "Topology Information Element". TIEs 474 are exchanged between RIFT nodes to describe parts of a network 475 such as links and address prefixes, in a fashion similar to ISIS 476 LSPs or OSPF LSAs. We will talk about N-TIEs when talking about 477 TIEs in the northbound representation and S-TIEs for the 478 southbound equivalent. 480 Node TIE: This stands as acronym for a "Node Topology Information 481 Element" that contains all adjacencies the node discovered and 482 information about node itself. 484 Prefix TIE: This is an acronym for a "Prefix Topology Information 485 Element" and it contains all prefixes directly attached to this 486 node in case of a N-TIE and in case of S-TIE the necessary default 487 the node passes southbound. 489 Key Value TIE: A S-TIE that is carrying a set of key value pairs 490 [DYNAMO]. It can be used to distribute information in the 491 southbound direction within the protocol. 493 TIDE: Topology Information Description Element, equivalent to CSNP 494 in ISIS. 496 TIRE: Topology Information Request Element, equivalent to PSNP in 497 ISIS. It can both confirm received and request missing TIEs. 499 De-aggregation/Disaggregation: Process in which a node decides to 500 advertise certain prefixes it received in N-TIEs to prevent black- 501 holing and suboptimal routing upon link failures. 503 LIE: This is an acronym for a "Link Information Element", largely 504 equivalent to HELLOs in IGPs and exchanged over all the links 505 between systems running RIFT to form adjacencies. 507 Flood Repeater (FR): A node can designate one or more northbound 508 neighbor nodes to be flood repeaters. The flood repeaters are 509 responsible for flooding northbound TIEs further north. They are 510 similar to MPR in OSLR. The document sometimes calls them flood 511 leaders as well. 513 Bandwidth Adjusted Distance (BAD): This is an acronym for Bandwidth 514 Adjusted Distance. Each RIFT node calculates the amount of 515 northbound bandwidth available towards a node compared to other 516 nodes at the same level and modifies the default route distance 517 accordingly to allow for the lower level to adjust their load 518 balancing towards spines. 520 Overloaded: Applies to a node advertising `overload` attribute as 521 set. The semantics closely follow the meaning of the same 522 attribute in [ISO10589-Second-Edition]. 524 Interface: A layer 3 entity over which RIFT control packets are 525 exchanged. 527 Adjacency: RIFT tries to form a unique adjacency over an interface 528 and exchange local configuration and necessary ZTP information. 530 Neighbor: Once a three way adjacency has been formed a neighborship 531 relationship contains the neighbor's properties. Multiple 532 adjacencies can be formed to a neighbor via parallel interfaces 533 but such adjacencies are NOT sharing a neighbor structure. Saying 534 "neighbor" is thus equivalent to saying "a three way adjacency". 536 Cost: The term signifies the weighted distance between two 537 neighbors. 539 Distance: Sum of costs (bound by infinite distance) between two 540 nodes. 542 Metric: Without going deeper into the proper differentiation, a 543 metric is equivalent to distance. 545 3.2. Topology 546 . +--------+ +--------+ ^ N 547 . |ToF 21| |ToF 22| | 548 .Level 2 ++-+--+-++ ++-+--+-++ <-*-> E/W 549 . | | | | | | | | | 550 . P111/2| |P121 | | | | S v 551 . ^ ^ ^ ^ | | | | 552 . | | | | | | | | 553 . +--------------+ | +-----------+ | | | +---------------+ 554 . | | | | | | | | 555 . South +-----------------------------+ | | ^ 556 . | | | | | | | All TIEs 557 . 0/0 0/0 0/0 +-----------------------------+ | 558 . v v v | | | | | 559 . | | +-+ +<-0/0----------+ | | 560 . | | | | | | | | 561 .+-+----++ optional +-+----++ ++----+-+ ++-----++ 562 .| | E/W link | | | | | | 563 .|Spin111+----------+Spin112| |Spin121| |Spin122| 564 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 565 . | | | South | | | | 566 . | +---0/0--->-----+ 0/0 | +----------------+ | 567 . 0/0 | | | | | | | 568 . | +---<-0/0-----+ | v | +--------------+ | | 569 . v | | | | | | | 570 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 571 .| | (L2L) | | | | Level 0 | | 572 .|Leaf111~~~~~~~~~~~~Leaf112| |Leaf121| |Leaf122| 573 .+-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 574 . + + \ / + + 575 . Prefix111 Prefix112 \ / Prefix121 Prefix122 576 . multi-homed 577 . Prefix 578 .+---------- Pod 1 ---------+ +---------- Pod 2 ---------+ 580 Figure 2: A three level spine-and-leaf topology 581 .+--------+ +--------+ +--------+ +--------+ 582 .|ToF A1| |ToF B1| |ToF B2| |ToF A2| 583 .++-+-----+ ++-+-----+ ++-+-----+ ++-+-----+ 584 . | | | | | | | | 585 . | | | | | +---------------+ 586 . | | | | | | | | 587 . | | | +-------------------------+ | 588 . | | | | | | | | 589 . | +-----------------------+ | | | | 590 . | | | | | | | | 591 . | | +---------+ | +---------+ | | 592 . | | | | | | | | 593 . | +---------------------------------+ | | 594 . | | | | | | | | 595 .++-+-----+ ++-+-----+ +--+-+---+ +----+-+-+ 596 .|Spine111| |Spine112| |Spine121| |Spine122| 597 .+-+---+--+ ++----+--+ +-+---+--+ ++---+---+ 598 . | | | | | | | | 599 . | +--------+ | | +--------+ | 600 . | | | | | | | | 601 . | -------+ | | | +------+ | | 602 . | | | | | | | | 603 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 604 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 605 .+-------+ +-------+ +-------+ +-------+ 607 Figure 3: Topology with multiple planes 609 We will use topology in Figure 2 (called commonly a fat tree/network 610 in modern IP fabric considerations [VAHDAT08] as homonym to the 611 original definition of the term [FATTREE]) in all further 612 considerations. This figure depicts a generic "single plane fat- 613 tree" and the concepts explained using three levels apply by 614 induction to further levels and higher degrees of connectivity. 615 Further, this document will deal also with designs that provide only 616 sparser connectivity and "partitioned spines" as shown in Figure 3 617 and explained further in Section 5.1.2. 619 4. Requirement Considerations 621 [RFC7938] gives the original set of requirements augmented here based 622 upon recent experience in the operation of fat-tree networks. 624 REQ1: The control protocol should discover the physical links 625 automatically and be able to detect cabling that violates 626 fat-tree topology constraints. It must react accordingly to 627 such mis-cabling attempts, at a minimum preventing 628 adjacencies between nodes from being formed and traffic from 629 being forwarded on those mis-cabled links. E.g. connecting 630 a leaf to a spine at level 2 should be detected and ideally 631 prevented. 633 REQ2: A node without any configuration beside default values 634 should come up at the correct level in any PoD it is 635 introduced into. Optionally, it must be possible to 636 configure nodes to restrict their participation to the 637 PoD(s) targeted at any level. 639 REQ3: Optionally, the protocol should allow to provision IP 640 fabrics where the individual switches carry no configuration 641 information and are all deriving their level from a "seed". 642 Observe that this requirement may collide with the desire to 643 detect cabling misconfiguration and with that only one of 644 the requirements can be fully met in a chosen configuration 645 mode. 647 REQ4: The solution should allow for minimum size routing 648 information base and forwarding tables at leaf level for 649 speed, cost and simplicity reasons. Holding excessive 650 amount of information away from leaf nodes simplifies 651 operation and lowers cost of the underlay and allows to 652 scale and introduce proper multi-homing down to the server 653 level. The routing solution should allow for easy 654 instantiation of multiple routing planes. Coupled with 655 mobility defined in Paragraph 17 this should allow for 656 "light-weight" overlays on an IP fabric with e.g. native 657 IPv6 mobility support. 659 REQ5: Very high degree of ECMP must be supported. Maximum ECMP is 660 currently understood as the most efficient routing approach 661 to maximize the throughput of switching fabrics 662 [MAKSIC2013]. 664 REQ6: Non equal cost anycast must be supported to allow for easy 665 and robust multi-homing of services without regressing to 666 careful balancing of link costs. 668 REQ7: Traffic engineering should be allowed by modification of 669 prefixes and/or their next-hops. 671 REQ8: The solution should allow for access to link states of the 672 whole topology to enable efficient support for modern 673 control architectures like SPRING [RFC7855] or PCE 674 [RFC4655]. 676 REQ9: The solution should easily accommodate opaque data to be 677 carried throughout the topology to subsets of nodes. This 678 can be used for many purposes, one of them being a key-value 679 store that allows bootstrapping of nodes based right at the 680 time of topology discovery. Another use is distributing MAC 681 to L3 address binding from the leafs up north in case of 682 e.g. DHCP. 684 REQ10: Nodes should be taken out and introduced into production 685 with minimum wait-times and minimum of "shaking" of the 686 network, i.e. radius of propagation (often called "blast 687 radius") of changed information should be as small as 688 feasible. 690 REQ11: The protocol should allow for maximum aggregation of carried 691 routing information while at the same time automatically de- 692 aggregating the prefixes to prevent black-holing in case of 693 failures. The de-aggregation should support maximum 694 possible ECMP/N-ECMP remaining after failure. 696 REQ12: Reducing the scope of communication needed throughout the 697 network on link and state failure, as well as reducing 698 advertisements of repeating or idiomatic information in 699 stable state is highly desirable since it leads to better 700 stability and faster convergence behavior. 702 REQ13: Under normal, fully converged condition, once a packet is 703 forwarded along a link in a "southbound" direction, it must 704 not take any further "northbound" links (Valley Free 705 Routing). Taking a path through the spine in cases where a 706 shorter path is available is highly undesirable (Bow Tying). 708 REQ14: Parallel links between same set of nodes must be 709 distinguishable for SPF, failure and traffic engineering 710 purposes. 712 REQ15: The protocol must support interfaces sharing the same 713 address. Specifically, it must operate in presence of 714 unnumbered links (even parallel ones) and/or links of a 715 single node being configured with same addresses. 717 REQ16: It would be desirable to achieve fast re-balancing of flows 718 when links, especially towards the spines are lost or 719 provisioned without regressing to per flow traffic 720 engineering which introduces significant amount of 721 complexity while possibly not being reactive enough to 722 account for short-lived flows. 724 REQ17: The control plane should be able to unambiguously determine 725 the current point of attachment (which port on which leaf 726 node) of a prefix, even in a context of fast mobility, e.g., 727 when the prefix is a host address on a wireless node that 1) 728 may associate to any of multiple access points (APs) that 729 are attached to different ports on a same leaf node or to 730 different leaf nodes, and 2) may move and reassociate 731 several times to a different access point within a sub- 732 second period. 734 REQ18: The protocol must provide security mechanisms that allow the 735 operator to restrict nodes, especially leaf nodes without 736 proper credentials, from forming a three-way adjacency and 737 participating in routing. 739 Following list represents non-requirements: 741 PEND1: Supporting anything but point-to-point links is not 742 necessary. 744 Finally, following are the non-requirements: 746 NONREQ1: Broadcast media support is unnecessary. However, 747 miscabling leading to multiple nodes on a broadcast 748 segment must be operationally easily recognizable and 749 detectable while not taxing the protocol excessively. 751 NONREQ2: Purging link state elements is unnecessary given its 752 fragility and complexity and today's large memory size on 753 even modest switches and routers. 755 NONREQ3: Special support for layer 3 multi-hop adjacencies is not 756 part of the protocol specification. Such support can be 757 easily provided by using tunneling technologies the same 758 way IGPs today are solving the problem. 760 5. RIFT: Routing in Fat Trees 762 Derived from the above requirements we present a detailed outline of 763 a protocol optimized for Routing in Fat Trees (RIFT) that in most 764 abstract terms has many properties of a modified link-state protocol 765 [RFC2328][ISO10589-Second-Edition] when "pointing north" and distance 766 vector [RFC4271] protocol when "pointing south". While this is an 767 unusual combination, it does quite naturally exhibit the desirable 768 properties we seek. 770 5.1. Overview 772 5.1.1. Properties 774 The most singular property of RIFT is that it floods flat link-state 775 information northbound only so that each level obtains the full 776 topology of levels south of it. That information is never flooded 777 East-West (we'll talk about exceptions later) or back South again. 778 In the southbound direction the protocol operates like a "fully 779 summarizing, unidirectional" path vector protocol or rather a 780 distance vector with implicit split horizon whereas the information 781 propagates one hop south and is 're-advertised' by nodes at next 782 lower level, normally just the default route. However, RIFT uses 783 flooding in the southern direction as well to avoid the necessity to 784 build an update per adjacency. We omit describing the East-West 785 direction out for the moment. 787 Those information flow constraints create not only an anisotropic 788 protocol (i.e. the information is not distributed "evenly" or 789 "clumped" but summarized along the N-S gradient) but also a "smooth" 790 information propagation where nodes do not receive the same 791 information from multiple directions at the same time. Normally, 792 accepting the same reachability on any link without understanding its 793 topological significance forces tie-breaking on some kind of distance 794 metric and ultimately leads in hop-by-hop forwarding substrates to 795 utilization of variants of shortest paths only. RIFT under normal 796 conditions does not need to reconcile same reachability information 797 from multiple directions and its computation principles (south 798 forwarding direction is always prefered) leads to valley-free 799 forwarding behavior. And since valley free routing is loop-free it 800 can use all feasible paths, another highly desirable property if 801 available bandwidth should be utilized to the maximum extent 802 possible. 804 To account for the "northern" and the "southern" information split 805 the link state database is accordingly partitioned into "north 806 representation" and "south representation" TIEs. In simplest terms 807 the N-TIEs contain a link state topology description of lower levels 808 and and S-TIEs carry simply default routes of the level above. This 809 oversimplified view will be refined gradually in following sections 810 while introducing protocol procedures aimed to fulfill the described 811 requirements. 813 5.1.2. Generalized Topology View 815 This section will shed some light on the topologies addresses by RIFT 816 including multi plane fabrics and their related implications. 817 Readers that are only interested in single plane designs, i.e. all 818 top-of-fabric nodes being topologically equal and initially connected 819 to all the switches at the level below them can skip this section and 820 resulting Section 5.2.5.2 as well. 822 It is quite difficult to visualize multi plane design which are 823 effectively multi-dimensional switching matrices. To cope with that, 824 we will introduce a methodology allowing us to depict the 825 connectivity in a two-dimensional plane. Further, we will leverage 826 the fact that we are dealing basically with crossbar fabrics stacked 827 on top of each other where ports align "on top of each other" in a 828 regular fashion. 830 As a word of caution to the reader at this point it should be 831 observed that the language used to describe Clos variations, 832 especially in multi-plane designs varies widely between sources. 833 This description follows the introduced Section 3.1 and it is 834 paramount to have it present to follow the rest of this section 835 correctly. 837 The typical topology for which RIFT is defined is built of a number P 838 of PoDs, connected together by a number S of ToF nodes. A PoD node 839 has a number of ports called Radix, with half of them (K=Radix/2) 840 used to connect host devices from the south, and half to connect to 841 interleaved PoD Top-Level switches to the north. Ratio K can be 842 chosen differently without loss of generality when port speeds differ 843 or fabric is oversubscribed but K=R/2 allows for more readable 844 representation whereby there are as many ports facing north as south 845 on any intermediate node. We represent a node hence in a schematic 846 fashion with ports "sticking out" to its north and south rather than 847 by the usual real-world front faceplate designs of the day. 849 Figure 4 provides a view of a leaf node as seen from the north, i.e. 850 showing ports that connect northbound and for lack of a better 851 symbol, we have chosen to use the "HH" symbol as ASCII visualisation 852 of a RJ45 jack. In that example, K_LEAF is chosen to be 6 ports. 853 Observe that the number of PoDs is not related to Radix unless the 854 ToF Nodes are constrained to be the same as the PoD nodes in a 855 particular deployment. 857 Top view 858 +----+ 859 | | 860 | HH | e.g., Radix = 12, K_LEAF = 6 861 | | 862 | HH | 863 | | ------------------------- 864 | HH ------- Physical Port (Ethernet) ----+ 865 | | ------------------------- | 866 | HH | | 867 | | | 868 | HH | | 869 | | | 870 | HH | | 871 | | | 872 +----+ | 874 || || || || || || || 875 +----+ +------------------------------------------------+ 876 | | | | 877 +----+ +------------------------------------------------+ 878 || || || || || || || 879 Side views 881 Figure 4: A Leaf Node, K_LEAF=6 883 The Radix of a node on top of a PoD may be different than that of the 884 leaf node, though more often than not a same type of node is used for 885 both, effectively forming a square (K*K). In the general case, we 886 could have switches with K_TOP southern ports on nodes at the top of 887 the PoD that is not necessarily the same as K_LEAF; for instance, in 888 the representations below, we pick a K_LEAF of 6 and a K_TOP of 8. 889 In order to form a crossbar, we need K_TOP Leaf Nodes as illustrated 890 in Figure 5. 892 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 893 | | | | | | | | | | | | | | | | 894 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 895 | | | | | | | | | | | | | | | | 896 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 897 | | | | | | | | | | | | | | | | 898 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 899 | | | | | | | | | | | | | | | | 900 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 901 | | | | | | | | | | | | | | | | 902 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 903 | | | | | | | | | | | | | | | | 904 | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | 905 | | | | | | | | | | | | | | | | 906 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 908 Figure 5: Southern View of a PoD, K_TOP=8 910 The K_TOP Leaf Nodes are fully interconnected with the K_LEAF PoD-top 911 nodes, providing a connectivity that can be represented as a crossbar 912 as seen from the north and illustrated in Figure 6. The result is 913 that, in the absence of a breakage, a packet entering the PoD from 914 North on any port can be routed to any port on the south of the PoD 915 and vice versa. 917 E<-*->W 919 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 920 | | | | | | | | | | | | | | | | 921 +----------------------------------------------------------------+ 922 | HH HH HH HH HH HH HH HH | 923 +----------------------------------------------------------------+ 924 +----------------------------------------------------------------+ 925 | HH HH HH HH HH HH HH HH | 926 +----------------------------------------------------------------+ 927 +----------------------------------------------------------------+ 928 | HH HH HH HH HH HH HH HH | 929 +----------------------------------------------------------------+ 930 +----------------------------------------------------------------+ 931 | HH HH HH HH HH HH HH HH | 932 +----------------------------------------------------------------+ 933 +----------------------------------------------------------------+ 934 | HH HH HH HH HH HH HH HH |<-+ 935 +----------------------------------------------------------------+ | 936 +----------------------------------------------------------------+ | 937 | HH HH HH HH HH HH HH HH | | 938 +----------------------------------------------------------------+ | 939 | | | | | | | | | | | | | | | | | 940 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 941 ^ | 942 | | 943 | ---------- --------------------- | 944 +----- Leaf Node PoD top Node (Spine) --+ 945 ---------- --------------------- 947 Figure 6: Northern View of a PoD's Spines, K_TOP=8 949 Side views of this PoD is illustrated in Figure 7 and Figure 8. 951 Connecting to Spine 953 || || || || || || || || 954 +----------------------------------------------------------------+ N 955 | PoD top Node seen sideways | ^ 956 +----------------------------------------------------------------+ | 957 || || || || || || || || * 958 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 959 | | | | | | | | | | | | | | | | v 960 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ S 961 || || || || || || || || 963 Connecting to Client nodes 965 Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6 967 Connecting to Spine 969 || || || || || || 970 +----+ +----+ +----+ +----+ +----+ +----+ N 971 | | | | | | | | | | | PoD top Nodes ^ 972 +----+ +----+ +----+ +----+ +----+ +----+ | 973 || || || || || || * 974 +------------------------------------------------+ | 975 | Leaf seen sideways | v 976 +------------------------------------------------+ S 977 || || || || || || 979 Connecting to Client nodes 981 Figure 8: Other side View of a PoD, K_TOP=8, K_LEAF=6, 90o turn in 982 E-W Plane 984 Note that a resulting PoD can be abstracted as a bigger node with a 985 number K of K_POD= K_TOP * K_LEAF, and the design can recurse. 987 It is critical at this junction that the concept and the picture of 988 those "crossed crossbars" is clear before progressing further, 989 otherwise following considerations will be difficult to comprehend. 991 Further, the PoDs are interconnected with one another through a Top- 992 of-Fabric at the very top or the north edge of the fabric. The 993 resulting ToF is NOT partitioned if and only if (IIF) every PoD top 994 level node (spine) is connected to every ToF Node. This is also 995 referred to as a single plane configuration. In order to reach a 996 1::1 connectivity ratio between the ToF and the Leaves, it results 997 that there are K_TOP ToF nodes, because each port of a ToP node 998 connects to a different ToF node, and K_LEAF ToP nodes for the same 999 reason. Consequently, it takes (P * K_LEAF) ports on a ToF node to 1000 connect to each of the K_LEAF ToP nodes of the P PoDs, as illustrated 1001 in Figure 9. 1003 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] <-----+ 1004 | | | | | | | | | 1005 [=================================] | ----------- 1006 | | | | | | | | +----- Top-of-Fabric 1007 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] +----- Node -------+ 1008 | ----------- | 1009 | v 1010 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ <-----+ +-+ 1011 | | | | | | | | | | | | | | | | | | 1012 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | 1013 [ |H| |H| |H| |H| |H| |H| |H| |H| ] ------------------------- | | 1014 [ |H| |H| |H| |H| |H| |H| |H| |H<--- Physical Port (Ethernet) | | 1015 [ |H| |H| |H| |H| |H| |H| |H| |H| ] ------------------------- | | 1016 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | 1017 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | 1018 | | | | | | | | | | | | | | | | | | 1019 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | 1020 [ |H| |H| |H| |H| |H| |H| |H| |H| ] -------------- | | 1021 [ |H| |H| |H| |H| |H| |H| |H| |H| ] <--- PoD top level | | 1022 [ |H| |H| |H| |H| |H| |H| |H| |H| ] node (Spine) ---+ | | 1023 [ |H| |H| |H| |H| |H| |H| |H| |H| ] -------------- | | | 1024 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | | 1025 | | | | | | | | | | | | | | | | -+ +- +-+ v | | 1026 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | --| |--[ ]--| | 1027 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | ----- | --| |--[ ]--| | 1028 [ |H| |H| |H| |H| |H| |H| |H| |H| ] +--- PoD ---+ --| |--[ ]--| | 1029 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | ----- | --| |--[ ]--| | 1030 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | --| |--[ ]--| | 1031 [ |H| |H| |H| |H| |H| |H| |H| |H| ] | | --| |--[ ]--| | 1032 | | | | | | | | | | | | | | | | -+ +- +-+ | | 1033 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ 1035 Figure 9: Fabric Spines and TOFs in Single Plane Design, 3 PoDs 1037 The top view can be collapsed into a third dimension where the hidden 1038 depth index is representing the PoD number. So we can show one PoD 1039 as a class of PoDs and hence save one dimension in our 1040 representation. The Spine Node expands in the depth and the vertical 1041 dimensions whereas the PoD top level Nodes are constrained in 1042 horizontal dimension. A port in the 2-D representation represents 1043 effectively the class of all the ports at the same position in all 1044 the PoDs that are projected in its position along the depth axis. 1045 This is shown in Figure 10. 1047 / / / / / / / / / / / / / / / / 1048 / / / / / / / / / / / / / / / / 1049 / / / / / / / / / / / / / / / / 1050 / / / / / / / / / / / / / / / / ] 1051 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ ]] 1052 | | | | | | | | | | | | | | | | ] --------------------------- 1053 [ |H| |H| |H| |H| |H| |H| |H| |H| ] <-- PoD top level node (Spine) 1054 [ |H| |H| |H| |H| |H| |H| |H| |H| ] --------------------------- 1055 [ |H| |H| |H| |H| |H| |H| |H| |H| ]]]] 1056 [ |H| |H| |H| |H| |H| |H| |H| |H| ]]] ^^ 1057 [ |H| |H| |H| |H| |H| |H| |H| |H| ]] // PoDs 1058 [ |H| |H| |H| |H| |H| |H| |H| |H| ] // (in depth) 1059 | |/| |/| |/| |/| |/| |/| |/| |/ // 1060 +-+ +-+ +-+/+-+/+-+ +-+ +-+ +-+ // 1061 ^ 1062 | ---------------- 1063 +----- Top-of-Fabric Node 1064 ---------------- 1066 Figure 10: Collapsed Northern View of a Fabric for Any Number of PoDs 1068 This type of deployment introduces a "single plane limit" where the 1069 bound is the available radix of the ToF nodes, which limits (P * 1070 K_LEAF). Nevertheless, a distinct advantage of a connected or 1071 unpartitioned Top-of-Fabric is that all failures can be resolved by 1072 simple, non-transitive, positive disaggregation described in 1073 Section 5.2.5.1 that propagates only within one level of the fabric. 1074 In other words unpartitoned ToF nodes can always reach nodes below or 1075 withdraw the routes from PoDs they cannot reach unambiguously. To be 1076 more precise, all failures which still allow all the ToF nodes to see 1077 each other via south reflection as explained in Section 5.2.5. 1079 In order to scale beyond the "single plane limit", the Top-of-Fabric 1080 can be partitioned by a number N of identically wired planes, N being 1081 an integer divider of K_LEAF. The 1::1 ratio and the desired 1082 symmetry are still served, this time with (K_TOP * N) ToF nodes, each 1083 of (P * K_LEAF / N) ports. N=1 represents a non-partitioned Spine 1084 and N=K_LEAF is a maximally partitioned Spine. Further, if R is any 1085 divisor of K_LEAF, then (N=K_LEAF/R) is a feasible number of planes 1086 and R a redundancy factor. If proves convenient for deployments to 1087 use a radix for the leaf nodes that is a power of 2 so they can pick 1088 a number of planes that is a lower power of 2. The example in 1089 Figure 11 splits the Spine in 2 planes with a redundancy factor R=3, 1090 meaning that there are 3 non-intersecting paths between any leaf node 1091 and any ToF node. A ToF node must have in this case at least 3*P 1092 ports, and be directly connected to 3 of the 6 PoD-ToP nodes (spines) 1093 in each PoD. 1095 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 1096 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1097 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1098 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1099 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1100 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1101 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1102 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1103 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1104 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1105 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 1107 Plane 1 1108 ----------- . ------------ . ------------ . ------------ . -------- 1109 Plane 2 1111 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 1112 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1113 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1114 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1115 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1116 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1117 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1118 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1119 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | 1120 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1121 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ 1122 ^ 1123 | 1124 | ---------------- 1125 +----- Top-of-Fabric node 1126 "across" depth 1127 ---------------- 1129 Figure 11: Northern View of a Multi-Plane ToF Level, K_LEAF=6, N=2 1131 At the extreme end of the spectrum, it is even possible to fully 1132 partition the spine with N = K_LEAF and R=1, while maintaining 1133 connectivity between each leaf node and each Top-of-Fabric node. In 1134 that case the ToF node connects to a single Port per PoD, so it 1135 appears as a single port in the projected view represented in 1136 Figure 12 and the number of ports required on the Spine Node is more 1137 or equal to P, the number of PoDs. 1139 Plane 1 1140 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ -+ 1141 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1142 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | 1143 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1144 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 1145 ----------- . ------------ . ------------ . ------------ . -------- | 1146 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 1147 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1148 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | 1149 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1150 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 1151 ----------- . ------------ . ------------ . ------------ . -------- | 1152 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 1153 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1154 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | 1155 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1156 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 1157 ----------- . ------------ . ------------ . ------------ . -------- +<-+ 1158 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | | 1159 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1160 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | | 1161 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1162 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | | 1163 ----------- . ------------ . ------------ . ------------ . -------- | | 1164 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | | 1165 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1166 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | | 1167 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1168 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | | 1169 ----------- . ------------ . ------------ . ------------ . -------- | | 1170 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | | 1171 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1172 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | | 1173 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1174 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ -+ | 1175 Plane 6 ^ | 1176 | | 1177 | ---------------- -------------- | 1178 +----- ToF Node Class of PoDs ---+ 1179 ---------------- ------------- 1181 Figure 12: Northern View of a Maximally Partitioned ToF Level, R=1 1183 5.1.3. Fallen Leaf Problem 1185 As mentioned earlier, RIFT exhibits an anisotropic behavior tailored 1186 for fabrics with a North / South orientation and a high level of 1187 interleaving paths. A non-partitioned fabric makes a total loss of 1188 connectivity between a Top-of-Fabric node at the north and a leaf 1189 node at the south a very rare but yet possible occasion that is fully 1190 healed by positive disaggregation described in Section 5.2.5.1. In 1191 large fabrics or fabrics built from switches with low radix, the ToF 1192 ends often being partioned in planes which makes the occurrence of 1193 having a given leaf being only reachable from a subset of the ToF 1194 nodes more likely to happen. This makes some further considerations 1195 necessary. 1197 We define a "Fallen Leaf" as a leaf that can be reached by only a 1198 subset of Top-of-Fabric nodes but cannot be reached by all due to 1199 missing connectivity. If R is the redundancy factor, then it takes 1200 at least R breakages to reach a "Fallen Leaf" situation. 1202 In a general manner, the mechanism of non-transitive positive 1203 disaggregation is sufficient when the disaggregating ToF nodes 1204 collectively connect to all the ToP nodes in the broken plane. This 1205 happens in the following case: 1207 If the breakage is the last northern link from a ToP node to a ToF 1208 node going down, then the fallen leaf problem affects only The ToF 1209 node, and the connectivity to all the nodes in the PoD is lost 1210 from that ToF node. This can be observed by other ToF nodes 1211 within the plane where the ToP node is located and positively 1212 disaggregated within that plane. 1214 On the other hand, there is a need to disaggregate the routes to 1215 Fallen Leaves in a transitive fashion all the way to the other leaves 1216 in the following cases: 1218 If the breakage is the last northern link from a Leaf node within 1219 a plane - there is only one such link in a maximally partitioned 1220 fabric - that goes down, then connectivity to all unicast prefixes 1221 attached to the Leaf node is lost within the plane where the link 1222 is located. Southern Reflection by a Leaf Node - e.g., between 1223 ToP nodes if the PoD has only 2 levels - happens in between 1224 planes, allowing the ToP nodes to detect the problem within the 1225 PoD where it occurs and positively disaggregate. The breakage can 1226 be observed by the ToF nodes in the same plane through the 1227 flooding of N-TIEs from the ToP nodes, but the ToF nodes need to 1228 be aware of all the affected prefixes for the negative 1229 disaggregation to be fully effective. The problem can also be 1230 observed by the ToF nodes in the other planes through the flooding 1231 of N-TIEs from the affected Leaf nodes, together with non-node 1232 N-TIEs which indicate the affected prefixes. To be effective in 1233 that case, the positive disaggregation must reach down to the 1234 nodes that make the plane selection, which are typically the 1235 ingress Leaf nodes, and the information is not useful for routing 1236 in the intermediate levels. 1238 If the breakage is a ToP node in a maximally partitioned fabric - 1239 in which case it is the only ToP node serving that plane in that 1240 PoD - that goes down, then the connectivity to all the nodes in 1241 the PoD is lost within the plane where the ToP node is located - 1242 all leaves fall. Since the Southern Reflection between the ToF 1243 nodes happens only within a plane, ToF nodes in other planes 1244 cannot discover the case of fallen leaves in a different plane, 1245 and cannot determine beyond their local plane whether a Leaf node 1246 that was initially reachable has become unreachable. As above, 1247 the breakage can be observed by the ToF nodes in the plane where 1248 the breakage happened, and then again, the ToF nodes in the plane 1249 need to be aware of all the affected prefixes for the negative 1250 disaggregation to be fully effective. The problem can also be 1251 observed by the ToF nodes in the other planes through the flooding 1252 of N-TIEs from the affected Leaf nodes, if there are only 3 levels 1253 and the ToP nodes are directly connected to the Leaf nodes, and 1254 then again it can only be effective it is propagated transitively 1255 to the Leaf, and useless above that level. 1257 For the sake of easy comprehension let us roll the abstractions back 1258 to a simple example and observe that in Figure 3 the loss of link 1259 Spine 122 to Leaf 122 will make Leaf 122 a fallen leaf for Top-of- 1260 Fabric plane B. Worse, if the cabling was never present in first 1261 place, plane B will not even be able to know that such a fallen leaf 1262 exists. Hence partitioning without further treatment results in two 1263 grave problems: 1265 o Leaf111 trying to route to Leaf122 MUST choose Spine 111 in plane 1266 A as its next hop since plane B will inevitably blackhole the 1267 packet when forwarding using default routes or do excessive bow 1268 tie'ing, i.e. this information must be in its routing table. 1270 o any kind of "flooding" or distance vector trying to deal with the 1271 problem by distributing host routes will be able to converge only 1272 using paths through leafs, i.e. the flooding of information on 1273 Leaf122 will go up to Top-of-Fabric A and then "loopback" over 1274 other leafs to ToF B leading in extreme cases to traffic for 1275 Leaf122 when presented to plane B taking an "inverted fabric" path 1276 where leafs start to serve as TOFs. 1278 5.1.4. Discovering Fallen Leaves 1280 As we illustrate later and without further proof here, to deal with 1281 fallen leafs in multi-plane designs when aggregation is used RIFT 1282 requires all the ToF nodes to share the same topology database. This 1283 happens naturally in single plane design but needs additional 1284 considerations in multi-plane fabrics. To satisfy this RIFT in 1285 multi-plane designs relies at the ToF Level on ring interconnection 1286 of switches in multiple planes. Other solutions are possible but 1287 they either need more cabling or end up having much longer flooding 1288 path and/or single points of failure. 1290 In more detail, by reserving two ports on each Top-of-Fabric node it 1291 is possible to connect them together in an interplane bi-directional 1292 ring as illustrated in Figure 13 (where we show a bi-directional ring 1293 connecting switches across planes). The rings will exchange full 1294 topology information between planes and with that allow consequently 1295 by the means of transitive, negative disaggregation described in 1296 Section 5.2.5.2 to efficiently fix any possible fallen leaf scenario. 1297 Somewhat as a side-effect, the exchange of information fulfills the 1298 requirement to present full view of the fabric topology at the Top- 1299 of-Fabric level without the need to collate it from multiple points 1300 by additional complexity of technologies like [RFC7752]. 1302 +----+ +----+ +----+ +----+ +----+ +----+ +--------+ 1303 | | | | | | | | | | | | | | 1304 | | | | | | | | 1305 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1306 +-| |--| |--| |--| |--| |--| |--| |-+ | 1307 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | Plane A 1308 +-| |--| |--| |--| |--| |--| |--| |-+ | 1309 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1310 | | | | | | | | 1311 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1312 +-| |--| |--| |--| |--| |--| |--| |-+ | 1313 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | Plane B 1314 +-| |--| |--| |--| |--| |--| |--| |-+ | 1315 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1316 | | | | | | | | 1317 ... | 1318 | | | | | | | | 1319 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1320 +-| |--| |--| |--| |--| |--| |--| |-+ | 1321 | | HH | | HH | | HH | | HH | | HH | | HH | | HH | | | Plane X 1322 +-| |--| |--| |--| |--| |--| |--| |-+ | 1323 +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ +-o--+ | 1324 | | | | | | | | 1325 | | | | | | | | | | | | | | 1326 +----+ +----+ +----+ +----+ +----+ +----+ +--------+ 1328 Figure 13: Connecting Top-of-Fabric Nodes Across Planes by Two Rings 1330 5.1.5. Addressing the Fallen Leaves Problem 1332 One consequence of the Fallen Leaf problem is that some prefixes 1333 attached to the fallen leaf become unreachable from some of the ToF 1334 nodes. RIFT proposes two methods to address this issue, the positive 1335 and the negative disaggregation. Both methods flood S-TIEs to 1336 advertise the impacted prefix(es). 1338 When used for the operation of disaggregation, a positive S-TIE, as 1339 usual, indicates reachability to a prefix of given length and all 1340 addresses subsumed by it. In contrast, a negative route 1341 advertisement indicates that the origin cannot route to the 1342 advertised prefix. 1344 The positive disaggregation is originated by a router that can still 1345 reach the advertised prefix, and the operation is not transitive, 1346 meaning that the receiver does not generate its own flooding south as 1347 a consequence of receiving positive disaggregation advertisements 1348 from an higher level node. The effect of a positive disaggregation 1349 is that the traffic to the impacted prefix will follow the prefix 1350 longest match and will be limited to the northbound routers that 1351 advertised the more specific route. 1353 In contrast, the negative disaggregation is transitive, and is 1354 propagated south when all the possible routes northwards are barred. 1355 A negative route advertisement is only actionable when the negative 1356 prefix is aggregated by a positive route advertisement for a shorter 1357 prefix. In that case, the negative advertisement carves an exception 1358 to the positive route in the routing table (one could think of 1359 "punching a hole"), making the positive prefix reachable through the 1360 originator with the special consideration of the negative prefix 1361 removing certain next hop neighbors. 1363 When the ToF is not partitioned, the collective southern flooding of 1364 the positive disaggregation by the ToF nodes that can still reach the 1365 impacted prefix is in general enough to cover all the switches at the 1366 next level south, typically the ToP nodes. If all those switches are 1367 aware of the disaggregation, they collectively create a ceiling that 1368 intercepts all the traffic north and forwards it to the ToF nodes 1369 that advertised the more specific route. In that case, the positive 1370 disaggregation alone is sufficient to solve the fallen leaf problem. 1372 On the other hand, when the fabric is partitioned in planes, the 1373 positive disaggregation from ToF nodes in different planes do not 1374 reach the ToP switches in the affected plane and cannot solve the 1375 fallen leaves problem. In other words, a breakage in a plane can 1376 only be solved in that plane. Also, the selection of the plane for a 1377 packet typically occurs at the leaf level and the disaggregation must 1378 be transitive and reach all the leaves. In that case, the negative 1379 disaggregation is necessary. The details on the RIFT approach to 1380 deal with fallen leafs in an optimal way is specified in 1381 Section 5.2.5.2. 1383 5.2. Specification 1385 5.2.1. Transport 1387 All packet formats are defined in Thrift [thrift] models in 1388 Appendix B. 1390 The serialized model is carried in an envelope within a UDP frame 1391 that provides security and allows validation/modification of several 1392 important fields without de-serialization for performance and 1393 security reasons. 1395 5.2.2. Link (Neighbor) Discovery (LIE Exchange) 1397 LIE exchange happens over well-known administratively locally scoped 1398 and configured or otherwise well-known IPv4 multicast address 1399 [RFC2365] and/or link-local multicast scope [RFC4291] for IPv6 1400 [RFC8200] using a configured or otherwise a well-known destination 1401 UDP port defined in Appendix D.1. LIEs SHOULD be sent with a TTL of 1402 1 to prevent RIFT information reaching beyond a single L3 next-hop in 1403 the topology. LIEs SHOULD be sent with network control precedence. 1405 Originating port of the LIE has no further significance other than 1406 identifying the origination point. LIEs are exchanged over all links 1407 running RIFT. 1409 An implementation MAY listen and send LIEs on IPv4 and/or IPv6 1410 multicast addresses. A node MUST NOT originate LIEs on an address 1411 family if it does not process received LIEs on that family. LIEs on 1412 same link are considered part of the same negotiation independent on 1413 the address family they arrive on. Observe further that the LIE 1414 source address may not identify the peer uniquely in unnumbered or 1415 link-local address cases so the response transmission MUST occur over 1416 the same interface the LIEs have been received on. A node CAN use 1417 any of the adjacency's source addresses it saw in LIEs on the 1418 specific interface during adjacency formation to send TIEs. That 1419 implies that an implementation MUST be ready to accept TIEs on all 1420 addresses it used as source of LIE frames. 1422 A three way adjacency over any address family implies support for 1423 IPv4 forwarding if the `v4_forwarding_capable` flag is set to true 1424 and a node can use [RFC5549] type of forwarding in such a situation. 1425 It is expected that the whole fabric supports the same type of 1426 forwarding of address families on all the links. Operation of a 1427 fabric where only some of the links are supporting forwarding on an 1428 address family and others do not is outside the scope of this 1429 specification. 1431 Observe further that the protocol does NOT support selective 1432 disabling of address families, disabling v4 forwarding capability or 1433 any local address changes in three way state, i.e. if a link has 1434 entered three way IPv4 and/or IPv6 with a neighbor on an adjacency 1435 and it wants to stop supporting one of the families or change any of 1436 its local addresses or stop v4 forwarding, it has to tear down and 1437 rebuild the adjacency. It also has to remove any information it 1438 stored about the adjacency such as LIE source addresses seen. 1440 Unless Section 5.2.7 is used, each node is provisioned with the level 1441 at which it is operating and its PoD (or otherwise a default level 1442 and "undefined" PoD are assumed; meaning that leafs do not need to be 1443 configured at all if initial configuration values are all left at 0). 1444 Nodes in the spine are configured with "any" PoD which has the same 1445 value "undefined" PoD hence we will talk about "undefined/any" PoD. 1446 This information is propagated in the LIEs exchanged. 1448 Further definitions of leaf flags are found in Section 5.2.7 given 1449 they have implications in terms of level and adjacency forming here. 1451 A node tries to form a three way adjacency if and only if 1453 1. the node is in the same PoD or either the node or the neighbor 1454 advertises "undefined/any" PoD membership (PoD# = 0) AND 1456 2. the neighboring node is running the same MAJOR schema version AND 1458 3. the neighbor is not member of some PoD while the node has a 1459 northbound adjacency already joining another PoD AND 1461 4. the neighboring node uses a valid System ID AND 1463 5. the neighboring node uses a different System ID than the node 1464 itself 1466 6. the advertised MTUs match on both sides AND 1468 7. both nodes advertise defined level values AND 1470 8. [ 1472 i) the node is at level 0 and has no three way adjacencies 1473 already to HAT nodes with level different than the adjacent 1474 node OR 1476 ii) the node is not at level 0 and the neighboring node is at 1477 level 0 OR 1479 iii) both nodes are at level 0 AND both indicate support for 1480 Section 5.3.9 OR 1482 iv) neither node is at level 0 and the neighboring node is at 1483 most one level away 1485 ]. 1487 The rule in Paragraph 3 MAY be optionally disregarded by a node if 1488 PoD detection is undesirable or has to be ignored. 1490 A node configured with "undefined" PoD membership MUST, after 1491 building first northbound three way adjacencies to a node being in a 1492 defined PoD, advertise that PoD as part of its LIEs. In case that 1493 adjacency is lost, from all available northbound three way 1494 adjacencies the node with the highest System ID and defined PoD is 1495 chosen. That way the northmost defined PoD value (normally the top 1496 spines in a PoD) can diffuse southbound towards the leafs "forcing" 1497 the PoD value on any node with "undefined" PoD. 1499 LIEs arriving with a TTL larger than 1 MUST be ignored. 1501 A node SHOULD NOT send out LIEs without defined level in the header 1502 but in certain scenarios it may be beneficial for trouble-shooting 1503 purposes. 1505 LIE exchange uses three way handshake mechanism which is a cleaned up 1506 version of [RFC5303]. Observe that for easier comprehension the 1507 terminology of one/two and three-way states does NOT align with OSPF 1508 or ISIS FSMs albeit they use roughly same mechanisms. 1510 5.2.3. Topology Exchange (TIE Exchange) 1512 5.2.3.1. Topology Information Elements 1514 Topology and reachability information in RIFT is conveyed by the 1515 means of TIEs which have good amount of commonalities with LSAs in 1516 OSPF. 1518 The TIE exchange mechanism uses the port indicated by each node in 1519 the LIE exchange and the interface on which the adjacency has been 1520 formed as destination. It SHOULD use TTL of 1 as well and set inter- 1521 network control precedence on according packets. 1523 TIEs contain sequence numbers, lifetimes and a type. Each type has 1524 ample identifying number space and information is spread across 1525 possibly many TIEs of a certain type by the means of a hash function 1526 that a node or deployment can individually determine. One extreme 1527 design choice is a prefix per TIE which leads to more BGP-like 1528 behavior where small increments are only advertised on route changes 1529 vs. deploying with dense prefix packing into few TIEs leading to more 1530 traditional IGP trade-off with fewer TIEs. An implementation may 1531 even rehash prefix to TIE mapping at any time at the cost of 1532 significant amount of re-advertisements of TIEs. 1534 More information about the TIE structure can be found in the schema 1535 in Appendix B. 1537 5.2.3.2. South- and Northbound Representation 1539 A central concept of RIFT is that each node represents itself 1540 differently depending on the direction in which it is advertising 1541 information. More precisely, a spine node represents two different 1542 databases over its adjacencies depending whether it advertises TIEs 1543 to the north or to the south/sideways. We call those differing TIE 1544 databases either south- or northbound (S-TIEs and N-TIEs) depending 1545 on the direction of distribution. 1547 The N-TIEs hold all of the node's adjacencies and local prefixes 1548 while the S-TIEs hold only all of the node's adjacencies, the default 1549 prefix with necessary disaggregated prefixes and local prefixes. We 1550 will explain this in detail further in Section 5.2.5. 1552 The TIE types are mostly symmetric in both directions and Table 2 1553 provides a quick reference to main TIE types including direction and 1554 their function. 1556 +-------------------+-----------------------------------------------+ 1557 | TIE-Type | Content | 1558 +-------------------+-----------------------------------------------+ 1559 | Node N-TIE | node properties and adjacencies | 1560 +-------------------+-----------------------------------------------+ 1561 | Node S-TIE | same content as node N-TIE | 1562 +-------------------+-----------------------------------------------+ 1563 | Prefix N-TIE | contains nodes' directly reachable prefixes | 1564 +-------------------+-----------------------------------------------+ 1565 | Prefix S-TIE | contains originated defaults and directly | 1566 | | reachable prefixes | 1567 +-------------------+-----------------------------------------------+ 1568 | Positive | contains disaggregated prefixes | 1569 | Disaggregation | | 1570 | S-TIE | | 1571 +-------------------+-----------------------------------------------+ 1572 | Negative | contains special, negatively disaggreagted | 1573 | Disaggregation | prefixes to support multi-plane designs | 1574 | S-TIE | | 1575 +-------------------+-----------------------------------------------+ 1576 | External Prefix | contains external prefixes | 1577 | N-TIE | | 1578 +-------------------+-----------------------------------------------+ 1579 | Key-Value N-TIE | contains nodes northbound KVs | 1580 +-------------------+-----------------------------------------------+ 1581 | Key-Value S-TIE | contains nodes southbound KVs | 1582 +-------------------+-----------------------------------------------+ 1584 Table 2: TIE Types 1586 As an example illustrating a databases holding both representations, 1587 consider the topology in Figure 2 with the optional link between 1588 spine 111 and spine 112 (so that the flooding on an East-West link 1589 can be shown). This example assumes unnumbered interfaces. First, 1590 here are the TIEs generated by some nodes. For simplicity, the key 1591 value elements which may be included in their S-TIEs or N-TIEs are 1592 not shown. 1594 Spine21 S-TIEs: 1595 Node S-TIE: 1596 NodeElement(level=2, neighbors((Spine 111, level 1, cost 1), 1597 (Spine 112, level 1, cost 1), (Spine 121, level 1, cost 1), 1598 (Spine 122, level 1, cost 1))) 1599 Prefix S-TIE: 1600 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 1602 Spine 111 S-TIEs: 1603 Node S-TIE: 1604 NodeElement(level=1, neighbors((Spine21, level 2, cost 1, links(...)), 1605 (Spine22, level 2, cost 1, links(...)), 1606 (Spine 112, level 1, cost 1, links(...)), 1607 (Leaf111, level 0, cost 1, links(...)), 1608 (Leaf112, level 0, cost 1, links(...)))) 1609 Prefix S-TIE: 1610 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 1612 Spine 111 N-TIEs: 1613 Node N-TIE: 1614 NodeElement(level=1, 1615 neighbors((Spine21, level 2, cost 1, links(...)), 1616 (Spine22, level 2, cost 1, links(...)), 1617 (Spine 112, level 1, cost 1, links(...)), 1618 (Leaf111, level 0, cost 1, links(...)), 1619 (Leaf112, level 0, cost 1, links(...)))) 1620 Prefix N-TIE: 1621 NorthPrefixesElement(prefixes(Spine 111.loopback) 1623 Spine 121 S-TIEs: 1624 Node S-TIE: 1625 NodeElement(level=1, neighbors((Spine21,level 2,cost 1), 1626 (Spine22, level 2, cost 1), (Leaf121, level 0, cost 1), 1627 (Leaf122, level 0, cost 1))) 1628 Prefix S-TIE: 1629 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 1631 Spine 121 N-TIEs: 1632 Node N-TIE: 1634 NodeElement(level=1, 1635 neighbors((Spine21, level 2, cost 1, links(...)), 1636 (Spine22, level 2, cost 1, links(...)), 1637 (Leaf121, level 0, cost 1, links(...)), 1638 (Leaf122, level 0, cost 1, links(...)))) 1639 Prefix N-TIE: 1640 NorthPrefixesElement(prefixes(Spine 121.loopback) 1642 Leaf112 N-TIEs: 1643 Node N-TIE: 1644 NodeElement(level=0, 1645 neighbors((Spine 111, level 1, cost 1, links(...)), 1646 (Spine 112, level 1, cost 1, links(...)))) 1647 Prefix N-TIE: 1648 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 1649 Prefix_MH)) 1651 Figure 14: example TIES generated in a 2 level spine-and-leaf 1652 topology 1654 5.2.3.3. Flooding 1656 The mechanism used to distribute TIEs is the well-known (albeit 1657 modified in several respects to address fat tree requirements) 1658 flooding mechanism used by today's link-state protocols. Although 1659 flooding is initially more demanding to implement it avoids many 1660 problems with update style used in diffused computation such as 1661 distance vector protocols. Since flooding tends to present an 1662 unscalable burden in large, densely meshed topologies (fat trees 1663 being unfortunately such a topology) we provide as solution a close 1664 to optimal global flood reduction and load balancing optimization in 1665 Section 5.2.3.9. 1667 As described before, TIEs themselves are transported over UDP with 1668 the ports indicated in the LIE exchanges and using the destination 1669 address on which the LIE adjacency has been formed. For unnumbered 1670 IPv4 interfaces same considerations apply as in equivalent OSPF case. 1672 On reception of a TIE with an undefined level value in the packet 1673 header the node SHOULD issue a warning and indiscriminately discard 1674 the packet. 1676 Precise finite state machines and procedures can be found in 1677 Appendix C.3. 1679 5.2.3.4. TIE Flooding Scopes 1681 In a somewhat analogous fashion to link-local, area and domain 1682 flooding scopes, RIFT defines several complex "flooding scopes" 1683 depending on the direction and type of TIE propagated. 1685 Every N-TIE is flooded northbound, providing a node at a given level 1686 with the complete topology of the Clos or Fat Tree network underneath 1687 it, including all specific prefixes. This means that a packet 1688 received from a node at the same or lower level whose destination is 1689 covered by one of those specific prefixes may be routed directly 1690 towards the node advertising that prefix rather than sending the 1691 packet to a node at a higher level. 1693 A node's Node S-TIEs, consisting of all node's adjacencies and prefix 1694 S-TIEs limited to those related to default IP prefix and 1695 disaggregated prefixes, are flooded southbound in order to allow the 1696 nodes one level down to see connectivity of the higher level as well 1697 as reachability to the rest of the fabric. In order to allow an E-W 1698 disconnected node in a given level to receive the S-TIEs of other 1699 nodes at its level, every *NODE* S-TIE is "reflected" northbound to 1700 level from which it was received. It should be noted that East-West 1701 links are included in South TIE flooding (except at ToF level); those 1702 TIEs need to be flooded to satisfy algorithms in Section 5.2.4. In 1703 that way nodes at same level can learn about each other without a 1704 lower level, e.g. in case of leaf level. The precise flooding scopes 1705 are given in Table 3. Those rules govern as well what SHOULD be 1706 included in TIDEs on the adjacency. Again, East-West flooding scopes 1707 are identical to South flooding scopes except in case of ToF East- 1708 West links (rings) which are basically performing northbound 1709 flooding. 1711 Node S-TIE "south reflection" allows to support positive 1712 disaggregation on failures describes in Section 5.2.5 and flooding 1713 reduction in Section 5.2.3.9. 1715 +-----------+---------------------+---------------+-----------------+ 1716 | Type / | South | North | East-West | 1717 | Direction | | | | 1718 +-----------+---------------------+---------------+-----------------+ 1719 | node | flood if level of | flood if | flood only if | 1720 | S-TIE | originator is equal | level of | this node is | 1721 | | to this node | originator is | not ToF | 1722 | | | higher than | | 1723 | | | this node | | 1724 +-----------+---------------------+---------------+-----------------+ 1725 | non-node | flood self- | flood only if | flood only if | 1726 | S-TIE | originated only | neighbor is | self-originated | 1727 | | | originator of | and this node | 1728 | | | TIE | is not ToF | 1729 +-----------+---------------------+---------------+-----------------+ 1730 | all | never flood | flood always | flood only if | 1731 | N-TIEs | | | this node is | 1732 | | | | ToF | 1733 +-----------+---------------------+---------------+-----------------+ 1734 | TIDE | include at least | include at | if this node is | 1735 | | all non-self | least all | ToF then | 1736 | | originated N-TIE | node S-TIEs | include all | 1737 | | headers and self- | and all | N-TIEs, | 1738 | | originated S-TIE | S-TIEs | otherwise only | 1739 | | headers and node | originated by | self-originated | 1740 | | S-TIEs of nodes at | peer and all | TIEs | 1741 | | same level | N-TIEs | | 1742 +-----------+---------------------+---------------+-----------------+ 1743 | TIRE as | request all N-TIEs | request all | if this node is | 1744 | Request | and all peer's | S-TIEs | ToF then apply | 1745 | | self-originated | | North scope | 1746 | | TIEs and all node | | rules, | 1747 | | S-TIEs | | otherwise South | 1748 | | | | scope rules | 1749 +-----------+---------------------+---------------+-----------------+ 1750 | TIRE as | Ack all received | Ack all | Ack all | 1751 | Ack | TIEs | received TIEs | received TIEs | 1752 +-----------+---------------------+---------------+-----------------+ 1754 Table 3: Flooding Scopes 1756 If the TIDE includes additional TIE headers beside the ones 1757 specified, the receiving neighbor must apply according filter to the 1758 received TIDE strictly and MUST NOT request the extra TIE headers 1759 that were not allowed by the flooding scope rules in its direction. 1761 As an example to illustrate these rules, consider using the topology 1762 in Figure 2, with the optional link between spine 111 and spine 112, 1763 and the associated TIEs given in Figure 14. The flooding from 1764 particular nodes of the TIEs is given in Table 4. 1766 +-------------+----------+------------------------------------------+ 1767 | Router | Neighbor | TIEs | 1768 | floods to | | | 1769 +-------------+----------+------------------------------------------+ 1770 | Leaf111 | Spine | Leaf111 N-TIEs, Spine 111 node S-TIE | 1771 | | 112 | | 1772 | Leaf111 | Spine | Leaf111 N-TIEs, Spine 112 node S-TIE | 1773 | | 111 | | 1774 | | | | 1775 | Spine 111 | Leaf111 | Spine 111 S-TIEs | 1776 | Spine 111 | Leaf112 | Spine 111 S-TIEs | 1777 | Spine 111 | Spine | Spine 111 S-TIEs | 1778 | | 112 | | 1779 | Spine 111 | Spine21 | Spine 111 N-TIEs, Leaf111 N-TIEs, | 1780 | | | Leaf112 N-TIEs, Spine22 node S-TIE | 1781 | Spine 111 | Spine22 | Spine 111 N-TIEs, Leaf111 N-TIEs, | 1782 | | | Leaf112 N-TIEs, Spine21 node S-TIE | 1783 | | | | 1784 | ... | ... | ... | 1785 | Spine21 | Spine | Spine21 S-TIEs | 1786 | | 111 | | 1787 | Spine21 | Spine | Spine21 S-TIEs | 1788 | | 112 | | 1789 | Spine21 | Spine | Spine21 S-TIEs | 1790 | | 121 | | 1791 | Spine21 | Spine | Spine21 S-TIEs | 1792 | | 122 | | 1793 | ... | ... | ... | 1794 +-------------+----------+------------------------------------------+ 1796 Table 4: Flooding some TIEs from example topology 1798 5.2.3.5. 'Flood Only Node TIEs' Bit 1800 RIFT includes an optional ECN mechanism to prevent "flooding inrush" 1801 on restart or bring-up with many southbound neighbors. A node MAY 1802 set on its LIEs the according bit to indicate to the neighbor that it 1803 should temporarily flood node TIEs only to it. It should only set it 1804 in the southbound direction. The receiving node SHOULD accomodate 1805 the request to lessen the flooding load on the affected node if south 1806 of the sender and SHOULD ignore the bit if northbound. 1808 Obviously this mechanism is most useful in southbound direction. The 1809 distribution of node TIEs guarantees correct behavior of algorithms 1810 like disaggregation or default route origination. Furthermore 1811 though, the use of this bit presents an inherent trade-off between 1812 processing load and convergence speed since suppressing flooding of 1813 northbound prefixes from neighbors will lead to blackholes. 1815 5.2.3.6. Initial and Periodic Database Synchronization 1817 The initial exchange of RIFT is modeled after ISIS with TIDE being 1818 equivalent to CSNP and TIRE playing the role of PSNP. The content of 1819 TIDEs and TIREs is governed by Table 3. 1821 5.2.3.7. Purging and Roll-Overs 1823 RIFT does not purge information that has been distributed by the 1824 protocol. Purging mechanisms in other routing protocols have proven 1825 to be complex and fragile over many years of experience. Abundant 1826 amounts of memory are available today even on low-end platforms. The 1827 information will age out and all computations will deliver correct 1828 results if a node leaves the network due to the new information 1829 distributed by its adjacent nodes. 1831 Once a RIFT node issues a TIE with an ID, it MUST preserve the ID as 1832 long as feasible (also when the protocol restarts), even if the TIE 1833 looses all content. The re-advertisement of empty TIE fulfills the 1834 purpose of purging any information advertised in previous versions. 1835 The originator is free to not re-originate the according empty TIE 1836 again or originate an empty TIE with relatively short lifetime to 1837 prevent large number of long-lived empty stubs polluting the network. 1838 Each node MUST timeout and clean up the according empty TIEs 1839 independently. 1841 Upon restart a node MUST, as any link-state implementation, be 1842 prepared to receive TIEs with its own system ID and supersede them 1843 with equivalent, newly generated, empty TIEs with a higher sequence 1844 number. As above, the lifetime can be relatively short since it only 1845 needs to exceed the necessary propagation and processing delay by all 1846 the nodes that are within the TIE's flooding scope. 1848 TIE sequence numbers are rolled over using the method described in 1849 Appendix A. First sequence number of any spontaneously originated 1850 TIE (i.e. not originated to override a detected older copy in the 1851 network) MUST be a reasonbly unpredictable random number in the 1852 interval [0, 2^10-1] which will prevent otherwise identical TIE 1853 headers to remain "stuck" in the network with content different from 1854 TIE originated after reboot. 1856 5.2.3.8. Southbound Default Route Origination 1858 Under certain conditions nodes issue a default route in their South 1859 Prefix TIEs with costs as computed in Section 5.3.6.1. 1861 A node X that 1863 1. is NOT overloaded AND 1865 2. has southbound or East-West adjacencies 1867 originates in its south prefix TIE such a default route IIF 1869 1. all other nodes at X's' level are overloaded OR 1871 2. all other nodes at X's' level have NO northbound adjacencies OR 1873 3. X has computed reachability to a default route during N-SPF. 1875 The term "all other nodes at X's' level" describes obviously just the 1876 nodes at the same level in the PoD with a viable lower level 1877 (otherwise the node S-TIEs cannot be reflected and the nodes in e.g. 1878 PoD 1 and PoD 2 are "invisible" to each other). 1880 A node originating a southbound default route MUST install a default 1881 discard route if it did not compute a default route during N-SPF. 1883 5.2.3.9. Northbound TIE Flooding Reduction 1885 Section 1.4 of the Optimized Link State Routing Protocol [RFC3626] 1886 (OLSR) introduces the concept of a "multipoint relay" (MPR) that 1887 minimize the overhead of flooding messages in the network by reducing 1888 redundant retransmissions in the same region. 1890 A similar technique is applied to RIFT to control northbound 1891 flooding. Important observations first: 1893 1. a node MUST flood self-originated N-TIEs to all the reachable 1894 nodes at the level above which we call the node's "parents"; 1896 2. it is typically not necessary that all parents reflood the N-TIEs 1897 to achieve a complete flooding of all the reachable nodes two 1898 levels above which we choose to call the node's "grandparents"; 1900 3. to control the volume of its flooding two hops North and yet keep 1901 it robust enough, it is advantageous for a node to select a 1902 subset of its parents as "Flood Repeaters" (FRs), which combined 1903 together deliver two or more copies of its flooding to all of its 1904 parents, i.e. the originating node's grandparents; 1906 4. nodes at the same level do NOT have to agree on a specific 1907 algorithm to select the FRs, but overall load balancing should be 1908 achieved so that different nodes at the same level should tend to 1909 select different parents as FRs; 1911 5. there are usually many solutions to the problem of finding a set 1912 of FRs for a given node; the problem of finding the minimal set 1913 is (similar to) a NP-Complete problem and a globally optimal set 1914 may not be the minimal one if load-balancing with other nodes is 1915 an important consideration; 1917 6. it is expected that there will be often sets of equivalent nodes 1918 at a level L, defined as having a common set of parents at L+1. 1919 Applying this observation at both L and L+1, an algorithm may 1920 attempt to split the larger problem in a sum of smaller separate 1921 problems; 1923 7. it is another expectation that there will be from time to time a 1924 broken link between a parent and a grandparent, and in that case 1925 the parent is probably a poor FR due to its lower reliability. 1926 An algorithm may attempt to eliminate parents with broken 1927 northbound adjacencies first in order to reduce the number of 1928 FRs. Albeit it could be argued that relying on higher fanout FRs 1929 will slow flooding due to higher replication load reliability of 1930 FR's links seems to be a more pressing concern. 1932 In a fully connected Clos Network, this means that a node selects one 1933 arbitrary parent as FR and then a second one for redundancy. The 1934 computation can be kept relatively simple and completely distributed 1935 without any need for synchronization amongst nodes. In a "PoD" 1936 structure, where the Level L+2 is partitioned in silos of equivalent 1937 grandparents that are only reachable from respective parents, this 1938 means treating each silo as a fully connected Clos Network and solve 1939 the problem within the silo. 1941 In terms of signaling, a node has enough information to select its 1942 set of FRs; this information is derived from the node's parents' Node 1943 S-TIEs, which indicate the parent's reachable northbound adjacencies 1944 to its own parents, i.e. the node's grandparents. A node may send a 1945 LIE to a northbound neighbor with the optional boolean field 1946 `you_are_flood_repeater` set to false, to indicate that the 1947 northbound neighbor is not a flood repeater for the node that sent 1948 the LIE. In that case the northbound neighbor SHOULD NOT reflood 1949 northbound TIEs received from the node that sent the LIE. If the 1950 `you_are_flood_repeater` is absent or if `you_are_flood_repeater` is 1951 set to true, then the northbound neighbor is a flood repeater for the 1952 node that sent the LIE and MUST reflood northbound TIEs received from 1953 that node. 1955 This specification proposes a simple default algorithm that SHOULD be 1956 implemented and used by default on every RIFT node. 1958 o let |NA(Node) be the set of Northbound adjacencies of node Node 1959 and CN(Node) be the cardinality of |NA(Node); 1961 o let |SA(Node) be the set of Southbound adjacencies of node Node 1962 and CS(Node) be the cardinality of |SA(Node); 1964 o let |P(Node) be the set of node Node's parents; 1966 o let |G(Node) be the set of node Node's grandparents. Observe 1967 that |G(Node) = |P(|P(Node)); 1969 o let N be the child node at level L computing a set of FR; 1971 o let P be a node at level L+1 and a parent node of N, i.e. bi- 1972 directionally reachable over adjacency A(N, P); 1974 o let G be a grandparent node of N, reachable transitively via a 1975 parent P over adjacencies ADJ(N, P) and ADJ(P, G). Observe that N 1976 does not have enough information to check bidirectional 1977 reachability of A(P, G); 1979 o let R be a redundancy constant integer; a value of 2 or higher for 1980 R is RECOMMENDED; 1982 o let S be a similarity constant integer; a value in range 0 .. 2 1983 for S is RECOMMENDED, the value of 1 SHOULD be used. Two 1984 cardinalities are considered as equivalent if their absolute 1985 difference is less than or equal to S, i.e. |a-b|<=S. 1987 o let RND be a 64-bit random number generated by the system once on 1988 startup. 1990 The algorithm consists of the following steps: 1992 1. Derive a 64-bits number by XOR'ing 'N's system ID with RND. 1994 2. Derive a 16-bits pseudo-random unsigned integer PR(N) from the 1995 resulting 64-bits number by splitting it in 16-bits-long words 1996 W1, W2, W3, W4 (where W1 are the least significant 16 bits of the 1997 64-bits number, and W4 are the most significant 16 bits) and then 1998 XOR'ing the circularly shifted resulting words together: 2000 (W1<<1) xor (W2<<2) xor (W3<<3) xor (W4<<4); 2002 where << is the circular shift operator. 2004 3. Sort the parents by decreasing number of northbound adjacencies 2005 (using decreasing system id of the parent as tie-breaker): 2006 sort |P(N) by decreasing CN(P), for all P in |P(N), as ordered 2007 array |A(N) 2009 4. Partition |A(N) in subarrays |A_k(N) of parents with equivalent 2010 cardinality of northbound adjacencies (in other words with 2011 equivalent number of grandparents they can reach): 2013 1. set k=0; // k is the ID of the subarrray 2015 2. set i=0; 2017 3. while i < CN(N) do 2019 1. set j=i; 2021 2. while i < CN(N) and CN(|A(N)[j]) - CN(|A(N)[i]) <= S 2023 1. place |A(N)[i] in |A_k(N) // abstract action, maybe 2024 noop 2026 2. set i=i+1; 2028 3. /* At this point j is the index in |A(N) of the first 2029 member of |A_k(N) and (i-j) is C_k(N) defined as the 2030 cardinality of |A_k(N) */ 2032 4. set k=k+1; 2034 4. /* At this point k is the total number of subarrays, 2035 initialized for the shuffling operation below */ 2037 5. shuffle individually each subarrays |A_k(N) of cardinality C_k(N) 2038 within |A(N) using the Durstenfeld variation of Fisher-Yates 2039 algorithm that depends on N's System ID: 2041 1. while k > 0 do 2043 1. for i from C_k(N)-1 to 1 decrementing by 1 do 2045 1. set j to PR(N) modulo i; 2047 2. exchange |A_k[j] and |A_k[i]; 2049 2. set k=k-1; 2051 6. For each grandparent G, initialize a counter c(G) with the number 2052 of its south-bound adjacencies to elected flood repeaters (which 2053 is initially zero): 2055 1. for each G in |G(N) set c(G) = 0; 2057 7. Finally keep as FRs only parents that are needed to maintain the 2058 number of adjacencies between the FRs and any grandparent G equal 2059 or above the redundancy constant R: 2061 1. for each P in reshuffled |A(N); 2063 1. if there exists an adjacency ADJ(P, G) in |NA(P) such 2064 that c(G) < R then 2066 1. place P in FR set; 2068 2. for all adjacencies ADJ(P, G') in |NA(P) increment 2069 c(G') 2071 2. If any c(G) is still < R, it was not possible to elect a set 2072 of FRs that covers all grandparents with redundancy R 2074 Additional rules for flooding reduction: 2076 1. The algorithm MUST be re-evaluated by a node on every change of 2077 local adjacencies or reception of a parent S-TIE with changed 2078 adjacencies. A node MAY apply a hysteresis to prevent excessive 2079 amount of computation during periods of network instability just 2080 like in case of reachability computation. 2082 2. A node SHOULD send out LIEs that grant flood repeater status 2083 before LIEs that revoke it on flood repeater set changes to 2084 prevent transient behavior where the full coverage of grand 2085 parents is not guaranteed. Albeit the condition will correct in 2086 positively stable manner due to LIE retransmission and periodic 2087 TIDEs, it can slow down flooding convergence on flood repeater 2088 status changes. 2090 3. A node always floods its self-originated TIEs. 2092 4. A node receiving a TIE originated by a node for which it is not a 2093 flood repeater does NOT re-flood such TIEs to its neighbors 2094 except for rules in Paragraph 6. 2096 5. The indication of flood reduction capability is carried in the 2097 node TIEs and can be used to optimize the algorithm to account 2098 for nodes that will flood regardless. 2100 6. A node generates TIDEs as usual but when receiving TIREs or TIDEs 2101 resulting in requests for a TIE of which the newest received copy 2102 came on an adjacency where the node was not flood repeater it 2103 SHOULD ignore such requests on first and first request ONLY. 2104 Normally, the nodes that received the TIEs as flooding repeaters 2105 should satisfy the requesting node and with that no further TIREs 2106 for such TIEs will be generated. Otherwise, the next set of 2107 TIDEs and TIREs MUST lead to flooding independent of the flood 2108 repeater status. This solves a very difficult incast problem on 2109 nodes restarting with a very wide fanout, especially northbound. 2110 To retrieve the full database they often end up processing many 2111 in-rushing copies whereas this approach should load-balance the 2112 incoming database between adjacent nodes and flood repeaters 2113 should guarantee that two copies are sent by different nodes to 2114 ensure against any losses. 2116 7. Obviously sine flooding reduction does NOT apply to self 2117 originated TIEs and since all policy-guided information consists 2118 of self-originated TIEs those are unaffected. 2120 5.2.3.10. Special Considerations 2122 First, due to the distributed, asynchronous nature of ZTP, it can 2123 create temporary convergence anomalies where nodes at higher levels 2124 of the fabric temporarily see themselves lower than they belong. 2125 Since flooding can begin before ZTP is "finished" and in fact must do 2126 so given there is no global termination criteria, information may end 2127 up in wrong layers. A special clause when changing level takes care 2128 of that. 2130 More difficult is a condition where a node floods a TIE north towards 2131 a super-spine, then its spine reboots, in fact partitioning 2132 superspine from it directly and then the node itself reboots. That 2133 leaves in a sense the super-spine holding the "primary copy" of the 2134 node's TIE. Normally this condition is resolved easily by the node 2135 re-originating its TIE with a higher sequence number than it sees in 2136 northbound TIEs, here however, when spine comes back it won't be able 2137 to obtain a N-TIE from its superspine easily and with that the node 2138 below may issue the same version of the TIE with a lower sequence 2139 number. Flooding procedures are are extended to deal with the 2140 problem by the means of special clauses that override the database of 2141 a lower level with headers of newer TIEs seen in TIDEs coming from 2142 the north. 2144 5.2.4. Reachability Computation 2146 A node has three sources of relevant information. A node knows the 2147 full topology south from the received N-TIEs. A node has the set of 2148 prefixes with associated distances and bandwidths from received 2149 S-TIEs. 2151 To compute reachability, a node runs conceptually a northbound and a 2152 southbound SPF. We call that N-SPF and S-SPF. 2154 Since neither computation can "loop", it is possible to compute non- 2155 equal-cost or even k-shortest paths [EPPSTEIN] and "saturate" the 2156 fabric to the extent desired but we use simple, familiar SPF 2157 algorithms and concepts here due to their prevalence in today's 2158 routing. 2160 5.2.4.1. Northbound SPF 2162 N-SPF uses northbound and East-West adjacencies in the computing 2163 node's node N-TIEs (since if the node is a leaf it may not have 2164 generated a node S-TIE) when starting Dijkstra. Observe that N-SPF 2165 is really just a one hop variety since Node S-TIEs are not re-flooded 2166 southbound beyond a single level (or East-West) and with that the 2167 computation cannot progress beyond adjacent nodes. 2169 Once progressing, we are using the next level's node S-TIEs to find 2170 according adjacencies to verify backlink connectivity. Just as in 2171 case of IS-IS or OSPF, two unidirectional links are associated 2172 together to confirm bidirectional connectivity. Particular care MUST 2173 be paid that the Node TIEs do not only contain the correct system IDs 2174 but matching levels as well. 2176 Default route found when crossing an E-W link is used IIF 2178 1. the node itself does NOT have any northbound adjacencies AND 2180 2. the adjacent node has one or more northbound adjacencies 2182 This rule forms a "one-hop default route split-horizon" and prevents 2183 looping over default routes while allowing for "one-hop protection" 2184 of nodes that lost all northbound adjacencies except at Top-of-Fabric 2185 where the links are used exclusively to flood topology information in 2186 multi-plane designs. 2188 Other south prefixes found when crossing E-W link MAY be used IIF 2190 1. no north neighbors are advertising same or supersuming non- 2191 default prefix AND 2193 2. the node does not originate a non-default supersuming prefix 2194 itself. 2196 i.e. the E-W link can be used as a gateway of last resort for a 2197 specific prefix only. Using south prefixes across E-W link can be 2198 beneficial e.g. on automatic de-aggregation in pathological fabric 2199 partitioning scenarios. 2201 A detailed example can be found in Section 6.4. 2203 5.2.4.2. Southbound SPF 2205 S-SPF uses only the southbound adjacencies in the node S-TIEs, i.e. 2206 progresses towards nodes at lower levels. Observe that E-W 2207 adjacencies are NEVER used in the computation. This enforces the 2208 requirement that a packet traversing in a southbound direction must 2209 never change its direction. 2211 S-SPF uses northbound adjacencies in node N-TIEs to verify backlink 2212 connectivity. 2214 5.2.4.3. East-West Forwarding Within a non-ToF Level 2216 Ultimately, it should be observed that in presence of a "ring" of E-W 2217 links in any level (except ToF level) neither SPF will provide a 2218 "ring protection" scheme since such a computation would have to deal 2219 necessarily with breaking of "loops" in generic Dijkstra sense; an 2220 application for which RIFT is not intended. It is outside the scope 2221 of this document how an underlay can be used to provide a full-mesh 2222 connectivity between nodes in the same level that would allow for 2223 N-SPF to provide protection for a single node loosing all its 2224 northbound adjacencies (as long as any of the other nodes in the 2225 level are northbound connected). 2227 Using south prefixes over horizontal links is optional and can 2228 protect against pathological fabric partitioning cases that leave 2229 only paths to destinations that would necessitate multiple changes of 2230 forwarding direction between north and south. 2232 5.2.4.4. East-West Links Within ToF Level 2234 E-W ToF links behave in terms of flooding scopes defined in 2235 Section 5.2.3.4 like northbound links. Even though a ToF node could 2236 be tempted to use those links during southbound SPF this MUST NOT be 2237 attempted since it may lead in, e.g. anycast cases to routing loops. 2238 An implemention could try to resolve the looping problem by following 2239 on the ring strictly tie-broken shortest-paths only but the details 2240 are outside this specification. And even then, the problem of proper 2241 capacity provisioning of such links when they become traffic-bearing 2242 in case of failures is vexing. 2244 5.2.5. Automatic Disaggregation on Link & Node Failures 2246 5.2.5.1. Positive, Non-transitive Disaggregation 2248 Under normal circumstances, node's S-TIEs contain just the 2249 adjacencies and a default route. However, if a node detects that its 2250 default IP prefix covers one or more prefixes that are reachable 2251 through it but not through one or more other nodes at the same level, 2252 then it MUST explicitly advertise those prefixes in an S-TIE. 2253 Otherwise, some percentage of the northbound traffic for those 2254 prefixes would be sent to nodes without according reachability, 2255 causing it to be black-holed. Even when not black-holing, the 2256 resulting forwarding could 'backhaul' packets through the higher 2257 level spines, clearly an undesirable condition affecting the blocking 2258 probabilities of the fabric. 2260 We refer to the process of advertising additional prefixes southbound 2261 as 'positive de-aggregation' or 'positive dis-aggregation'. Such 2262 dis-aggregation is non-transitive, i.e. its' effects are always 2263 contained to a single level of the fabric only. Naturally, multiple 2264 node or link failures can lead to several independent instances of 2265 positive dis-aggregation necessary to prevent looping or bow-tying 2266 the fabric. 2268 A node determines the set of prefixes needing de-aggregation using 2269 the following steps: 2271 1. A DAG computation in the southern direction is performed first, 2272 i.e. the N-TIEs are used to find all of prefixes it can reach and 2273 the set of next-hops in the lower level for each of them. Such a 2274 computation can be easily performed on a fat tree by e.g. setting 2275 all link costs in the southern direction to 1 and all northern 2276 directions to infinity. We term set of those prefixes |R, and 2277 for each prefix, r, in |R, we define its set of next-hops to 2278 be |H(r). 2280 2. The node uses reflected S-TIEs to find all nodes at the same 2281 level in the same PoD and the set of southbound adjacencies for 2282 each. The set of nodes at the same level is termed |N and for 2283 each node, n, in |N, we define its set of southbound adjacencies 2284 to be |A(n). 2286 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 2287 is null then that prefix r must be explicitly advertised by the 2288 node in an S-TIE. 2290 4. Identical set of de-aggregated prefixes is flooded on each of the 2291 node's southbound adjacencies. In accordance with the normal 2292 flooding rules for an S-TIE, a node at the lower level that 2293 receives this S-TIE will not propagate it south-bound. Neither 2294 is it necessary for the receiving node to reflect the 2295 disaggregated prefixes back over its adjacencies to nodes at the 2296 level from which it was received. 2298 To summarize the above in simplest terms: if a node detects that its 2299 default route encompasses prefixes for which one of the other nodes 2300 in its level has no possible next-hops in the level below, it has to 2301 disaggregate it to prevent black-holing or suboptimal routing through 2302 such nodes. Hence a node X needs to determine if it can reach a 2303 different set of south neighbors than other nodes at the same level, 2304 which are connected to it via at least one common south neighbor. If 2305 it can, then prefix disaggregation may be required. If it can't, 2306 then no prefix disaggregation is needed. An example of 2307 disaggregation is provided in Section 6.3. 2309 A possible algorithm is described last: 2311 1. Create partial_neighbors = (empty), a set of neighbors with 2312 partial connectivity to the node X's level from X's perspective. 2313 Each entry is a list of south neighbor of X and a list of nodes 2314 of X.level that can't reach that neighbor. 2316 2. A node X determines its set of southbound neighbors 2317 X.south_neighbors. 2319 3. For each S-TIE originated from a node Y that X has which is at 2320 X.level, if Y.south_neighbors is not the same as 2321 X.south_neighbors but the nodes share at least one southern 2322 neighbor, for each neighbor N in X.south_neighbors but not in 2323 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 2324 there or add Y to the list for N. 2326 4. If partial_neighbors is empty, then node X does not to 2327 disaggregate any prefixes. If node X is advertising 2328 disaggregated prefixes in its S-TIE, X SHOULD remove them and re- 2329 advertise its according S-TIEs. 2331 A node X computes reachability to all nodes below it based upon the 2332 received N-TIEs first. This results in a set of routes, each 2333 categorized by (prefix, path_distance, next-hop-set). Alternately, 2334 for clarity in the following procedure, these can be organized by 2335 next-hop-set as ( (next-hops), {(prefix, path_distance)}). If 2336 partial_neighbors isn't empty, then the following procedure describes 2337 how to identify prefixes to disaggregate. 2339 disaggregated_prefixes = { empty } 2340 nodes_same_level = { empty } 2341 for each S-TIE 2342 if (S-TIE.level == X.level and 2343 X shares at least one S-neighbor with X) 2344 add S-TIE.originator to nodes_same_level 2345 end if 2346 end for 2348 for each next-hop-set NHS 2349 isolated_nodes = nodes_same_level 2350 for each NH in NHS 2351 if NH in partial_neighbors 2352 isolated_nodes = intersection(isolated_nodes, 2353 partial_neighbors[NH].nodes) 2354 end if 2355 end for 2357 if isolated_nodes is not empty 2358 for each prefix using NHS 2359 add (prefix, distance) to disaggregated_prefixes 2360 end for 2361 end if 2362 end for 2364 copy disaggregated_prefixes to X's S-TIE 2365 if X's S-TIE is different 2366 schedule S-TIE for flooding 2367 end if 2369 Figure 15: Computation of Disaggregated Prefixes 2371 Each disaggregated prefix is sent with the according path_distance. 2372 This allows a node to send the same S-TIE to each south neighbor. 2373 The south neighbor which is connected to that prefix will thus have a 2374 shorter path. 2376 Finally, to summarize the less obvious points partially omitted in 2377 the algorithms to keep them more tractable: 2379 1. all neighbor relationships MUST perform backlink checks. 2381 2. overload bits as introduced in Section 5.3.1 have to be respected 2382 during the computation. 2384 3. all the lower level nodes are flooded the same disaggregated 2385 prefixes since we don't want to build an S-TIE per node and 2386 complicate things unnecessarily. The PoD containing the prefix 2387 will prefer southbound anyway. 2389 4. positively disaggregated prefixes do NOT have to propagate to 2390 lower levels. With that the disturbance in terms of new flooding 2391 is contained to a single level experiencing failures. 2393 5. disaggregated prefix S-TIEs are not "reflected" by the lower 2394 level, i.e. nodes within same level do NOT need to be aware 2395 which node computed the need for disaggregation. 2397 6. The fabric is still supporting maximum load balancing properties 2398 while not trying to send traffic northbound unless necessary. 2400 In case positive disaggregation is triggered and due to the very 2401 stable but un-synchronized nature of the algorithm the nodes may 2402 issue the necessary disaggregated prefixes at different points in 2403 time. This can lead for a short time to an "incast" behavior where 2404 the first advertising router based on the nature of longest prefix 2405 match will attract all the traffic. An implementation MAY hence 2406 choose different strategies to address this behavior if needed. 2408 To close this section it is worth to observe that in a single plane 2409 ToF this disaggregation prevents blackholing up to (K_LEAF * P) link 2410 failures in terms of Section 5.1.2 or in other terms, it takes at 2411 minimum that many link failures to partition the ToF into multiple 2412 planes. 2414 5.2.5.2. Negative, Transitive Disaggregation for Fallen Leafs 2416 As explained in Section 5.1.3 failures in multi-plane Top-of-Fabric 2417 or more than (K_LEAF * P) links failing in single plane design can 2418 generate fallen leafs. Such scenario cannot be addressed by positive 2419 disaggregation only and needs a further mechanism. 2421 5.2.5.2.1. Cabling of Multiple Top-of-Fabric Planes 2423 Let us return in this section to designs with multiple planes as 2424 shown in Figure 3. Figure 16 highlights how the ToF is cabled in 2425 case of two planes by the means of dual-rings to distribute all the 2426 N-TIEs within both planes. For people familiar with traditional 2427 link-state routing protocols ToF level can be considered equivalent 2428 to area 0 in OSPF or level-2 in ISIS which need to be "connected" as 2429 well for the protocol to operate correctly. 2431 . ++==========++ ++==========++ 2432 . II II II II 2433 .+----++--+ +----++--+ +----++--+ +----++--+ 2434 .|ToF A1| |ToF B1| |ToF B2| |ToF A2| 2435 .++-+-++--+ ++-+-++--+ ++-+-++--+ ++-+-++--+ 2436 . | | II | | II | | II | | II 2437 . | | ++==========++ | | ++==========++ 2438 . | | | | | | | | 2439 . 2440 . ~~~ Highlighted ToF of the previous multi-plane figure ~~ 2442 Figure 16: Topologically connected planes 2444 As described in Section 5.1.3 failures in multi-plane fabrics can 2445 lead to blackholes which normal positive disaggregation cannot fix. 2446 The mechanism of negative, transitive disaggregation incorporated in 2447 RIFT provides the according solution. 2449 5.2.5.2.2. Transitive Advertisement of Negative Disaggregates 2451 A ToF node that discovers that it cannot reach a fallen leaf 2452 disaggregates all the prefixes of such leafs. It uses for that 2453 purpose negative prefix S-TIEs that are, as usual, flooded southwards 2454 with the scope defined in Section 5.2.3.4. 2456 Transitively, a node explicitly loses connectivity to a prefix when 2457 none of its children advertises it and when the prefix is negatively 2458 disaggregated by all of its parents. When that happens, the node 2459 originates the negative prefix further down south. Since the 2460 mechanism applies recursively south the negative prefix may propagate 2461 transitively all the way down to the leaf. This is necessary since 2462 leafs connected to multiple planes by means of disjoint paths may 2463 have to choose the correct plane already at the very bottom of the 2464 fabric to make sure that they don't send traffic towards another leaf 2465 using a plane where it is "fallen" at which in point a blackhole is 2466 unavoidable. 2468 When the connectivity is restored, a node that disaggregated a prefix 2469 withdraws the negative disaggregation by the usual mechanism of re- 2470 advertising TIEs omitting the negative prefix. 2472 5.2.5.2.3. Computation of Negative Disaggregates 2474 The document omitted so far the description of the computation 2475 necessary to generate the correct set of negative prefixes. Negative 2476 prefixes can in fact be advertised due to two different triggers. We 2477 describe them consecutively. 2479 The first origination reason is a computation that uses all the node 2480 N-TIEs to build the set of all reachable nodes by reachability 2481 computation over the complete graph and including ToF links. The 2482 computation uses the node itself as root. This is compared with the 2483 result of the normal southbound SPF as described in Section 5.2.4.2. 2484 The difference are the fallen leafs and all their attached prefixes 2485 are advertised as negative prefixes southbound if the node does not 2486 see the prefix being reachable within southbound SPF. 2488 The second mechanism hinges on the understanding how the negative 2489 prefixes are used within the computation as described in Figure 17. 2490 When attaching the negative prefixes at certain point in time the 2491 negative prefix may find itself with all the viable nodes from the 2492 shorter match nexthop being pruned. In other words, all its 2493 northbound neighbors provided a negative prefix advertisement. This 2494 is the trigger to advertise this negative prefix transitively south 2495 and normally caused by the node being in a plane where the prefix 2496 belongs to a fabric leaf that has "fallen" in this plane. Obviously, 2497 when one of the northbound switches withdraws its negative 2498 advertisement, the node has to withdraw its transitively provided 2499 negative prefix as well. 2501 5.2.6. Attaching Prefixes 2503 After SPF is run, it is necessary to attach the resulting 2504 reachability information in form of prefixes. For S-SPF, prefixes 2505 from an N-TIE are attached to the originating node with that node's 2506 next-hop set and a distance equal to the prefix's cost plus the 2507 node's minimized path distance. The RIFT route database, a set of 2508 (prefix, prefix-type, attributes, path_distance, next-hop set), 2509 accumulates these results. 2511 In case of N-SPF prefixes from each S-TIE need to also be added to 2512 the RIFT route database. The N-SPF is really just a stub so the 2513 computing node needs simply to determine, for each prefix in an S-TIE 2514 that originated from adjacent node, what next-hops to use to reach 2515 that node. Since there may be parallel links, the next-hops to use 2516 can be a set; presence of the computing node in the associated Node 2517 S-TIE is sufficient to verify that at least one link has 2518 bidirectional connectivity. The set of minimum cost next-hops from 2519 the computing node X to the originating adjacent node is determined. 2521 Each prefix has its cost adjusted before being added into the RIFT 2522 route database. The cost of the prefix is set to the cost received 2523 plus the cost of the minimum distance next-hop to that neighbor while 2524 taking into account its attributes such as mobility per Section 5.3.3 2525 necessary. Then each prefix can be added into the RIFT route 2526 database with the next_hop_set; ties are broken based upon type first 2527 and then distance and further attributes and only the best 2528 combination is used for forwarding. RIFT route preferences are 2529 normalized by the according Thrift [thrift] model type. 2531 An example implementation for node X follows: 2533 for each S-TIE 2534 if S-TIE.level > X.level 2535 next_hop_set = set of minimum cost links to the S-TIE.originator 2536 next_hop_cost = minimum cost link to S-TIE.originator 2537 end if 2538 for each prefix P in the S-TIE 2539 P.cost = P.cost + next_hop_cost 2540 if P not in route_database: 2541 add (P, P.cost, P.type, P.attributes, next_hop_set) to route_database 2542 end if 2543 if (P in route_database): 2544 if route_database[P].cost > P.cost or route_database[P].type > P.type: 2545 update route_database[P] with (P, P.type, P.cost, P.attributes, next_hop_set) 2546 else if route_database[P].cost == P.cost and route_database[P].type == P.type: 2547 update route_database[P] with (P, P.type, P.cost, P.attributes, 2548 merge(next_hop_set, route_database[P].next_hop_set)) 2549 else 2550 // Not preferred route so ignore 2551 end if 2552 end if 2553 end for 2554 end for 2556 Figure 17: Adding Routes from S-TIE Positive and Negative Prefixes 2558 After the positive prefixes are attached and tie-broken, negative 2559 prefixes are attached and used in case of northbound computation, 2560 ideally from the shortest length to the longest. The nexthop 2561 adjacencies for a negative prefix are inherited from the longest 2562 prefix that aggregates it, and subsequently adjacencies to nodes that 2563 advertised negative for this prefix are removed. 2565 The rule of inheritance MUST be maintained when the nexthop list for 2566 a prefix is modified, as the modification may affect the entries for 2567 matching negative prefixes of immediate longer prefix length. For 2568 instance, if a nexthop is added, then by inheritance it must be added 2569 to all the negative routes of immediate longer prefixes length unless 2570 it is pruned due to a negative advertisement for the same next hop. 2571 Similarily, if a nexthop is deleted for a given prefix, then it is 2572 deleted for all the immediately aggregated negative routes. This 2573 will recurse in the case of nested negative prefix aggregations. 2575 The rule of inheritance must also be maintained when a new prefix of 2576 intermediate length is inserted, or when the immediately aggregating 2577 prefix is deleted from the routing table, making an even shorter 2578 aggregating prefix the one from which the negative routes now inherit 2579 their adjacencies. As the aggregating prefix changes, all the 2580 negative routes must be recomputed, and then again the process may 2581 recurse in case of nested negative prefix aggregations. 2583 Although these operations can be computationally expensive, the 2584 overall load on devices in the network is low because these 2585 computations are not run very often, as positive route advertisements 2586 are always preferred over negative ones. This prevents recursion in 2587 most cases because positive reachability information never inherits 2588 next hops. 2590 To make the negative disaggregation less abstract and provide an 2591 example let us consider a ToP node T1 with 4 ToF parents S1..S4 as 2592 represented in Figure 18: 2594 +----+ +----+ +----+ +----+ N 2595 | S1 | | S1 | | S1 | | S1 | ^ 2596 +----+ +----+ +----+ +----+ W< + >E 2597 | | | | v 2598 |+--------+ | | S 2599 ||+-----------------+ | 2600 |||+----------------+---------+ 2601 |||| 2602 +----+ 2603 | T1 | 2604 +----+ 2606 Figure 18: A ToP node with 4 parents 2608 If all ToF nodes can reach all the prefixes in the network; with 2609 RIFT, they will normally advertise a default route south. An 2610 abstract Routing Information Base (RIB), more commonly known as a 2611 routing table, stores all types of maintained routes including the 2612 negative ones and "tie-breaks" for the best one, whereas an abstract 2613 Forwarding table (FIB) retains only the ultimately computed 2614 "positive" routing instructions. In T1, those tables would look as 2615 illustrated in Figure 19: 2617 +---------+ 2618 | Default | 2619 +---------+ 2620 | 2621 | +--------+ 2622 +---> | Via S1 | 2623 | +--------+ 2624 | 2625 | +--------+ 2626 +---> | Via S2 | 2627 | +--------+ 2628 | 2629 | +--------+ 2630 +---> | Via S3 | 2631 | +---------+ 2632 | 2633 | +--------+ 2634 +---> | Via S4 | 2635 +--------+ 2637 Figure 19: Abstract RIB 2639 In case T1 receives a negative advertisement for prefix 2001:db8::/32 2640 from S1 a negative route is stored in the RIB (indicated by a ~ 2641 sign), while the more specific routes to the complementing ToF nodes 2642 are installed in FIB. RIB and FIB in T1 now look as illustrated in 2643 Figure 20 and Figure 21, respectively: 2645 +---------+ +-----------------+ 2646 | Default | <-------------- | ~2001:db8::/32 | 2647 +---------+ +-----------------+ 2648 | | 2649 | +--------+ | +--------+ 2650 +---> | Via S1 | +---> | Via S1 | 2651 | +--------+ +--------+ 2652 | 2653 | +--------+ 2654 +---> | Via S2 | 2655 | +--------+ 2656 | 2657 | +--------+ 2658 +---> | Via S3 | 2659 | +---------+ 2660 | 2661 | +--------+ 2662 +---> | Via S4 | 2663 +--------+ 2665 Figure 20: Abstract RIB after negative 2001:db8::/32 from S1 2667 The negative 2001:db8::/32 prefix entry inherits from ::/0, so the 2668 positive more specific routes are the complements to S1 in the set of 2669 next-hops for the default route. That entry is composed of S2, S3, 2670 and S4, or, in other words, it uses all entries the the default route 2671 with a "hole punched" for S1 into them. These are the next hops that 2672 are still available to reach 2001:db8::/32, now that S1 advertised 2673 that it will not forward 2001:db8::/32 anymore. Ultimately, those 2674 resulting next-hops are installed in FIB for the more specific route 2675 to 2001:db8::/32 as illustrated below: 2677 +---------+ +---------------+ 2678 | Default | | 2001:db8::/32 | 2679 +---------+ +---------------+ 2680 | | 2681 | +--------+ | 2682 +---> | Via S1 | | 2683 | +--------+ | 2684 | | 2685 | +--------+ | +--------+ 2686 +---> | Via S2 | +---> | Via S2 | 2687 | +--------+ | +--------+ 2688 | | 2689 | +--------+ | +--------+ 2690 +---> | Via S3 | +---> | Via S3 | 2691 | +--------+ | +--------+ 2692 | | 2693 | +--------+ | +--------+ 2694 +---> | Via S4 | +---> | Via S4 | 2695 +--------+ +--------+ 2697 Figure 21: Abstract FIB after negative 2001:db8::/32 from S1 2699 To illustrate matters further let us consider T1 receiving a negative 2700 advertisement for prefix 2001:db8:1::/48 from S2, which is stored in 2701 RIB again. After the update, the RIB in T1 is illustrated in 2702 Figure 22: 2704 +---------+ +----------------+ +------------------+ 2705 | Default | <----- | ~2001:db8::/32 | <------ | ~2001:db8:1::/48 | 2706 +---------+ +----------------+ +------------------+ 2707 | | | 2708 | +--------+ | +--------+ | 2709 +---> | Via S1 | +---> | Via S1 | | 2710 | +--------+ +--------+ | 2711 | | 2712 | +--------+ | +--------+ 2713 +---> | Via S2 | +---> | Via S2 | 2714 | +--------+ +--------+ 2715 | 2716 | +--------+ 2717 +---> | Via S3 | 2718 | +---------+ 2719 | 2720 | +--------+ 2721 +---> | Via S4 | 2722 +--------+ 2724 Figure 22: Abstract RIB after negative 2001:db8:1::/48 from S2 2726 Negative 2001:db8:1::/48 inherits from 2001:db8::/32 now, so the 2727 positive more specific routes are the complements to S2 in the set of 2728 next hops for 2001:db8::/32, which are S3 and S4, or, in other words, 2729 all entries of the father with the negative holes "punched in" again. 2730 After the update, the FIB in T1 shows as illustrated in Figure 23: 2732 +---------+ +---------------+ +-----------------+ 2733 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 2734 +---------+ +---------------+ +-----------------+ 2735 | | | 2736 | +--------+ | | 2737 +---> | Via S1 | | | 2738 | +--------+ | | 2739 | | | 2740 | +--------+ | +--------+ | 2741 +---> | Via S2 | +---> | Via S2 | | 2742 | +--------+ | +--------+ | 2743 | | | 2744 | +--------+ | +--------+ | +--------+ 2745 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 2746 | +--------+ | +--------+ | +--------+ 2747 | | | 2748 | +--------+ | +--------+ | +--------+ 2749 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 2750 +--------+ +--------+ +--------+ 2752 Figure 23: Abstract FIB after negative 2001:db8:1::/48 from S2 2754 Further, let us say that S3 stops advertising its service as default 2755 gateway. The entry is removed from RIB as usual. In order to update 2756 the FIB, it is necessary to eliminate the FIB entry for the default 2757 route, as well as all the FIB entries that were created for negative 2758 routes pointing to the RIB entry being removed (::/0). This is done 2759 recursively for 2001:db8::/32 and then for, 2001:db8:1::/48. The 2760 related FIB entries via S3 are removed, as illustrated in Figure 24. 2762 +---------+ +---------------+ +-----------------+ 2763 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 2764 +---------+ +---------------+ +-----------------+ 2765 | | | 2766 | +--------+ | | 2767 +---> | Via S1 | | | 2768 | +--------+ | | 2769 | | | 2770 | +--------+ | +--------+ | 2771 +---> | Via S2 | +---> | Via S2 | | 2772 | +--------+ | +--------+ | 2773 | | | 2774 | | | 2775 | | | 2776 | | | 2777 | | | 2778 | +--------+ | +--------+ | +--------+ 2779 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 2780 +--------+ +--------+ +--------+ 2782 Figure 24: Abstract FIB after loss of S3 2784 Say that at that time, S4 would also disaggregate prefix 2785 2001:db8:1::/48. This would mean that the FIB entry for 2786 2001:db8:1::/48 becomes a discard route, and that would be the signal 2787 for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a 2788 transitive fashion with its own children. 2790 Finally, let us look at the case where S3 becomes available again as 2791 a default gateway, and a negative advertisement is received from S4 2792 about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a 2793 negative route is stored in the RIB, and the more specific route to 2794 the complementing ToF nodes are installed in FIB. Since 2795 2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes 2796 are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1 2797 now shows as illustrated in Figure 25: 2799 +-----------------+ 2800 | 2001:db8:2::/48 | 2801 +-----------------+ 2802 | 2803 +---------+ +---------------+ +-----------------+ 2804 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 2805 +---------+ +---------------+ +-----------------+ 2806 | | | | 2807 | +--------+ | | | +--------+ 2808 +---> | Via S1 | | | +---> | Via S2 | 2809 | +--------+ | | | +--------+ 2810 | | | | 2811 | +--------+ | +--------+ | | +--------+ 2812 +---> | Via S2 | +---> | Via S2 | | +---> | Via S3 | 2813 | +--------+ | +--------+ | +--------+ 2814 | | | 2815 | +--------+ | +--------+ | +--------+ 2816 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 2817 | +--------+ | +--------+ | +--------+ 2818 | | | 2819 | +--------+ | +--------+ | +--------+ 2820 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 2821 +--------+ +--------+ +--------+ 2823 Figure 25: Abstract FIB after negative 2001:db8:2::/48 from S4 2825 5.2.7. Optional Zero Touch Provisioning (ZTP) 2827 Each RIFT node can operate in zero touch provisioning (ZTP) mode, 2828 i.e. it has no configuration (unless it is a Top-of-Fabric at the top 2829 of the topology or the must operate in the topology as leaf and/or 2830 support leaf-2-leaf procedures) and it will fully configure itself 2831 after being attached to the topology. Configured nodes and nodes 2832 operating in ZTP can be mixed and will form a valid topology if 2833 achievable. 2835 The derivation of the level of each node happens based on offers 2836 received from its neighbors whereas each node (with possibly 2837 exceptions of configured leafs) tries to attach at the highest 2838 possible point in the fabric. This guarantees that even if the 2839 diffusion front reaches a node from "below" faster than from "above", 2840 it will greedily abandon already negotiated level derived from nodes 2841 topologically below it and properly peers with nodes above. 2843 The fabric is very conciously numbered from the top to allow for PoDs 2844 of different heights and minimize number of provisioning necessary, 2845 in this case just a TOP_OF_FABRIC flag on every node at the top of 2846 the fabric. 2848 This section describes the necessary concepts and procedures for ZTP 2849 operation. 2851 5.2.7.1. Terminology 2853 The interdependencies between the different flags and the configured 2854 level can be somewhat vexing at first and it may take multiple reads 2855 of the glossary to comprehend them. 2857 Automatic Level Derivation: Procedures which allow nodes without 2858 level configured to derive it automatically. Only applied if 2859 CONFIGURED_LEVEL is undefined. 2861 UNDEFINED_LEVEL: A "null" value that indicates that the level has 2862 not beeen determined and has not been configured. Schemas 2863 normally indicate that by a missing optional value without an 2864 available defined default. 2866 LEAF_ONLY: An optional configuration flag that can be configured on 2867 a node to make sure it never leaves the "bottom of the hierarchy". 2868 TOP_OF_FABRIC flag and CONFIGURED_LEVEL cannot be defined at the 2869 same time as this flag. It implies CONFIGURED_LEVEL value of 0. 2871 TOP_OF_FABRIC flag: Configuration flag that MUST be provided to all 2872 Top-of-Fabric nodes. LEAF_FLAG and CONFIGURED_LEVEL cannot be 2873 defined at the same time as this flag. It implies a 2874 CONFIGURED_LEVEL value. In fact, it is basically a shortcut for 2875 configuring same level at all Top-of-Fabric nodes which is 2876 unavoidable since an initial 'seed' is needed for other ZTP nodes 2877 to derive their level in the topology. The flag plays an 2878 important role in fabrics with multiple planes to enable 2879 successful negative disaggregation (Section 5.2.5.2). 2881 CONFIGURED_LEVEL: A level value provided manually. When this is 2882 defined (i.e. it is not an UNDEFINED_LEVEL) the node is not 2883 participating in ZTP. TOP_OF_FABRIC flag is ignored when this 2884 value is defined. LEAF_ONLY can be set only if this value is 2885 undefined or set to 0. 2887 DERIVED_LEVEL: Level value computed via automatic level derivation 2888 when CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL. 2890 LEAF_2_LEAF: An optional flag that can be configured on a node to 2891 make sure it supports procedures defined in Section 5.3.9. In a 2892 strict sense it is a capability that implies LEAF_ONLY and the 2893 according restrictions. TOP_OF_FABRIC flag is ignored when set at 2894 the same time as this flag. 2896 LEVEL_VALUE: In ZTP case the original definition of "level" in 2897 Section 3.1 is both extended and relaxed. First, level is defined 2898 now as LEVEL_VALUE and is the first defined value of 2899 CONFIGURED_LEVEL followed by DERIVED_LEVEL. Second, it is 2900 possible for nodes to be more than one level apart to form 2901 adjacencies if any of the nodes is at least LEAF_ONLY. 2903 Valid Offered Level (VOL): A neighbor's level received on a valid 2904 LIE (i.e. passing all checks for adjacency formation while 2905 disregarding all clauses involving level values) persisting for 2906 the duration of the holdtime interval on the LIE. Observe that 2907 offers from nodes offering level value of 0 do not constitute VOLs 2908 (since no valid DERIVED_LEVEL can be obtained from those and 2909 consequently `not_a_ztp_offer` MUST be ignored). Offers from LIEs 2910 with `not_a_ztp_offer` being true are not VOLs either. If a node 2911 maintains parallel adjacencies to the neighbor, VOL on each 2912 adjacency is considered as equivalent, i.e. the newest VOL from 2913 any such adjacency updates the VOL received from the same node. 2915 Highest Available Level (HAL): Highest defined level value seen from 2916 all VOLs received. 2918 Highest Available Level Systems (HALS): Set of nodes offering HAL 2919 VOLs. 2921 Highest Adjacency Three Way (HAT): Highest neigbhor level of all the 2922 formed three way adjacencies for the node. 2924 5.2.7.2. Automatic SystemID Selection 2926 RIFT nodes require a 64 bit SystemID which SHOULD be derived as 2927 EUI-64 MA-L derive according to [EUI64]. The organizationally 2928 goverened portion of this ID (24 bits) can be used to generate 2929 multiple IDs if required to indicate more than one RIFT instance." 2931 As matter of operational concern, the router MUST ensure that such 2932 identifier is not changing very frequently (or at least not without 2933 sending all its TIEs with fairly short lifetimes) since otherwise the 2934 network may be left with large amounts of stale TIEs in other nodes 2935 (though this is not necessarily a serious problem if the procedures 2936 described in Section 8 are implemented). 2938 5.2.7.3. Generic Fabric Example 2940 ZTP forces us to think about miscabled or unusually cabled fabric and 2941 how such a topology can be forced into a "lattice" structure which a 2942 fabric represents (with further restrictions). Let us consider a 2943 necessary and sufficient physical cabling in Figure 26. We assume 2944 all nodes being in the same PoD. 2946 . +---+ 2947 . | A | s = TOP_OF_FABRIC 2948 . | s | l = LEAF_ONLY 2949 . ++-++ l2l = LEAF_2_LEAF 2950 . | | 2951 . +--+ +--+ 2952 . | | 2953 . +--++ ++--+ 2954 . | E | | F | 2955 . | +-+ | +-----------+ 2956 . ++--+ | ++-++ | 2957 . | | | | | 2958 . | +-------+ | | 2959 . | | | | | 2960 . | | +----+ | | 2961 . | | | | | 2962 . ++-++ ++-++ | 2963 . | I +-----+ J | | 2964 . | | | +-+ | 2965 . ++-++ +--++ | | 2966 . | | | | | 2967 . +---------+ | +------+ | 2968 . | | | | | 2969 . +-----------------+ | | 2970 . | | | | | 2971 . ++-++ ++-++ | 2972 . | X +-----+ Y +-+ 2973 . |l2l| | l | 2974 . +---+ +---+ 2976 Figure 26: Generic ZTP Cabling Considerations 2978 First, we must anchor the "top" of the cabling and that's what the 2979 TOP_OF_FABRIC flag at node A is for. Then things look smooth until 2980 we have to decide whether node Y is at the same level as I, J or at 2981 the same level as Y and consequently, X is south of it. This is 2982 unresolvable here until we "nail down the bottom" of the topology. 2983 To achieve that we choose to use in this example the leaf flags. We 2984 will see further then whether Y chooses to form adjacencies to F or 2985 I, J successively. 2987 5.2.7.4. Level Determination Procedure 2989 A node starting up with UNDEFINED_VALUE (i.e. without a 2990 CONFIGURED_LEVEL or any leaf or TOP_OF_FABRIC flag) MUST follow those 2991 additional procedures: 2993 1. It advertises its LEVEL_VALUE on all LIEs (observe that this can 2994 be UNDEFINED_LEVEL which in terms of the schema is simply an 2995 omitted optional value). 2997 2. It computes HAL as numerically highest available level in all 2998 VOLs. 3000 3. It chooses then MAX(HAL-1,0) as its DERIVED_LEVEL. The node then 3001 starts to advertise this derived level. 3003 4. A node that lost all adjacencies with HAL value MUST hold down 3004 computation of new DERIVED_LEVEL for a short period of time 3005 unless it has no VOLs from southbound adjacencies. After the 3006 holddown expired, it MUST discard all received offers, recompute 3007 DERIVED_LEVEL and announce it to all neighbors. 3009 5. A node MUST reset any adjacency that has changed the level it is 3010 offering and is in three way state. 3012 6. A node that changed its defined level value MUST readvertise its 3013 own TIEs (since the new `PacketHeader` will contain a different 3014 level than before). Sequence number of each TIE MUST be 3015 increased. 3017 7. After a level has been derived the node MUST set the 3018 `not_a_ztp_offer` on LIEs towards all systems offering a VOL for 3019 HAL. 3021 8. A node that changed its level SHOULD flush from its link state 3022 database TIEs of all other nodes, otherwise stale information may 3023 persist on "direction reversal", i.e. nodes that seemed south 3024 are now north or east-west. This will not prevent the correct 3025 operation of the protocol but could be slightly confusing 3026 operationally. 3028 A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf 3029 function by being configured with the appropriate flags or has a 3030 CONFIGURED_LEVEL of 0) MUST follow those additional procedures: 3032 1. It computes HAT per procedures above but does NOT use it to 3033 compute DERIVED_LEVEL. HAT is used to limit adjacency formation 3034 per Section 5.2.2. 3036 It MAY also follow modified procedures: 3038 1. It may pick a different strategy to choose VOL, e.g. use the VOL 3039 value with highest number of VOLs. Such strategies are only 3040 possible since the node always remains "at the bottom of the 3041 fabric" while another layer could "invert" the fabric by picking 3042 its prefered VOL in a different fashion than always trying to 3043 achieve the highest viable level. 3045 5.2.7.5. Resulting Topologies 3047 The procedures defined in Section 5.2.7.4 will lead to the RIFT 3048 topology and levels depicted in Figure 27. 3050 . +---+ 3051 . | As| 3052 . | 24| 3053 . ++-++ 3054 . | | 3055 . +--+ +--+ 3056 . | | 3057 . +--++ ++--+ 3058 . | E | | F | 3059 . | 23+-+ | 23+-----------+ 3060 . ++--+ | ++-++ | 3061 . | | | | | 3062 . | +-------+ | | 3063 . | | | | | 3064 . | | +----+ | | 3065 . | | | | | 3066 . ++-++ ++-++ | 3067 . | I +-----+ J | | 3068 . | 22| | 22| | 3069 . ++--+ +--++ | 3070 . | | | 3071 . +---------+ | | 3072 . | | | 3073 . ++-++ +---+ | 3074 . | X | | Y +-+ 3075 . | 0 | | 0 | 3076 . +---+ +---+ 3078 Figure 27: Generic ZTP Topology Autoconfigured 3080 In case we imagine the LEAF_ONLY restriction on Y is removed the 3081 outcome would be very different however and result in Figure 28. 3082 This demonstrates basically that auto configuration makes miscabling 3083 detection hard and with that can lead to undesirable effects in cases 3084 where leafs are not "nailed" by the accordingly configured flags and 3085 arbitrarily cabled. 3087 A node MAY analyze the outstanding level offers on its interfaces and 3088 generate warnings when its internal ruleset flags a possible 3089 miscabling. As an example, when a node's sees ZTP level offers that 3090 differ by more than one level from its chosen level (with proper 3091 accounting for leaf's being at level 0) this can indicate miscabling. 3093 . +---+ 3094 . | As| 3095 . | 24| 3096 . ++-++ 3097 . | | 3098 . +--+ +--+ 3099 . | | 3100 . +--++ ++--+ 3101 . | E | | F | 3102 . | 23+-+ | 23+-------+ 3103 . ++--+ | ++-++ | 3104 . | | | | | 3105 . | +-------+ | | 3106 . | | | | | 3107 . | | +----+ | | 3108 . | | | | | 3109 . ++-++ ++-++ +-+-+ 3110 . | I +-----+ J +-----+ Y | 3111 . | 22| | 22| | 22| 3112 . ++-++ +--++ ++-++ 3113 . | | | | | 3114 . | +-----------------+ | 3115 . | | | 3116 . +---------+ | | 3117 . | | | 3118 . ++-++ | 3119 . | X +--------+ 3120 . | 0 | 3121 . +---+ 3123 Figure 28: Generic ZTP Topology Autoconfigured 3125 5.2.8. Stability Considerations 3127 The autoconfiguration mechanism computes a global maximum of levels 3128 by diffusion. The achieved equilibrium can be disturbed massively by 3129 all nodes with highest level either leaving or entering the domain 3130 (with some finer distinctions not explained further). It is 3131 therefore recommended that each node is multi-homed towards nodes 3132 with respective HAL offerings. Fortuntately, this is the natural 3133 state of things for the topology variants considered in RIFT. 3135 5.3. Further Mechanisms 3137 5.3.1. Overload Bit 3139 The overload Bit MUST be respected in all according reachability 3140 computations. A node with overload bit set SHOULD NOT advertise any 3141 reachability prefixes southbound except locally hosted ones. A node 3142 in overload SHOULD advertise all its locally hosted prefixes north 3143 and southbound. 3145 The leaf node SHOULD set the 'overload' bit on its node TIEs, since 3146 if the spine nodes were to forward traffic not meant for the local 3147 node, the leaf node does not have the topology information to prevent 3148 a routing/forwarding loop. 3150 5.3.2. Optimized Route Computation on Leafs 3152 Since the leafs do see only "one hop away" they do not need to run a 3153 "proper" SPF. Instead, they can gather the available prefix 3154 candidates from their neighbors and build the routing table 3155 accordingly. 3157 A leaf will have no N-TIEs except its own and optionally from its 3158 East-West neighbors. A leaf will have S-TIEs from its neighbors. 3160 Instead of creating a network graph from its N-TIEs and neighbor's 3161 S-TIEs and then running an SPF, a leaf node can simply compute the 3162 minimum cost and next_hop_set to each leaf neighbor by examining its 3163 local adjacencies, determining bi-directionality from the associated 3164 N-TIE, and specifying the neighbor's next_hop_set set and cost from 3165 the minimum cost local adjacency to that neighbor. 3167 Then a leaf attaches prefixes as described in Section 5.2.6. 3169 5.3.3. Mobility 3171 It is a requirement for RIFT to maintain at the control plane a real 3172 time status of which prefix is attached to which port of which leaf, 3173 even in a context of mobility where the point of attachement may 3174 change several times in a subsecond period of time. 3176 There are two classical approaches to maintain such knowledge in an 3177 unambiguous fashion: 3179 time stamp: With this method, the infrastructure records the precise 3180 time at which the movement is observed. One key advantage of this 3181 technique is that it has no dependency on the mobile device. One 3182 drawback is that the infrastructure must be precisely synchronized 3183 to be able to compare time stamps as observed by the various 3184 points of attachment, e.g., using the variation of the Precision 3185 Time Protocol (PTP) IEEE Std. 1588 [IEEEstd1588], [IEEEstd8021AS] 3186 designed for bridged LANs IEEE Std. 802.1AS [IEEEstd8021AS]. Both 3187 the precision of the synchronisation protocol and the resolution 3188 of the time stamp must beat the highest possible roaming time on 3189 the fabric. Another drawback is that the presence of the mobile 3190 device may be observed only asynchronously, e.g., after it starts 3191 using an IP protocol such as ARP [RFC0826], IPv6 Neighbor 3192 Discovery [RFC4861][RFC4862], or DHCP [RFC2131][RFC8415]. 3194 sequence counter: With this method, a mobile node notifies its point 3195 of attachment on arrival with a sequence counter that is 3196 incremented upon each movement. On the positive side, this method 3197 does not have a dependency on a precise sense of time, since the 3198 sequence of movements is kept in order by the device. The 3199 disadvantage of this approach is the lack of support for protocols 3200 that may be used by the mobile node to register its presence to 3201 the leaf node with the capability to provide a sequence counter. 3202 Well-known issues with wrapping sequence counters must be 3203 addressed properly, and many forms of sequence counters that vary 3204 in both wrapping rules and comparison rules. A particular 3205 knowledge of the source of the sequence counter is required to 3206 operate it, and the comparison between sequence counters from 3207 heterogeneous sources can be hard to impossible. 3209 RIFT supports a hybrid approach contained in an optional 3210 `PrefixSequenceType` prefix attribute that we call a `monotonic 3211 clock` consisting of a timestamp and optional sequence number. In 3212 case of presence of the attribute: 3214 o The leaf node MAY advertise a time stamp of the latest sighting of 3215 a prefix, e.g., by snooping IP protocols or the node using the 3216 time at which it advertised the prefix. RIFT transports the time 3217 stamp within the desired prefix N-TIEs as 802.1AS timestamp. 3219 o RIFT may interoperate with the "update to 6LoWPAN Neighbor 3220 Discovery" [RFC8505], which provides a method for registering a 3221 prefix with a sequence counter called a Transaction ID (TID). 3222 RIFT transports in such case the TID in its native form. 3224 o RIFT also defines an abstract negative clock (ANSC) that compares 3225 as less than any other clock. By default, the lack of a 3226 `PrefixSequenceType` in a Prefix N-TIE is interpreted as ANSC. We 3227 call this also an `undefined` clock. 3229 o Any prefix present on the fabric in multiple nodes that has the 3230 `same` clock is considered as anycast. ASNC is always considered 3231 smaller than any defined clock. 3233 o RIFT implementation assumes by default that all nodes are being 3234 synchronized to 200 milliseconds precision which is easily 3235 achievable even in very large fabrics using [RFC5905]. An 3236 implementation MAY provide a way to reconfigure a domain to a 3237 different value. We call this variable MAXIMUM_CLOCK_DELTA. 3239 5.3.3.1. Clock Comparison 3241 All monotonic clock values are comparable to each other using the 3242 following rules: 3244 1. ASNC is older than any other value except ASNC AND 3246 2. Clock with timestamp differing by more than MAXIMUM_CLOCK_DELTA 3247 are comparable by using the timestamps only AND 3249 3. Clocks with timestamps differing by less than MAXIMUM_CLOCK_DELTA 3250 are comparable by using their TIDs only AND 3252 4. An undefined TID is always older than any other TID AND 3254 5. TIDs are compared using rules of [RFC8505]. 3256 5.3.3.2. Interaction between Time Stamps and Sequence Counters 3258 For slow movements that occur less frequently than e.g. once per 3259 second, the time stamp that the RIFT infrastruture captures is enough 3260 to determine the freshest discovery. If the point of attachement 3261 changes faster than the maximum drift of the time stamping mechanism 3262 (i.e. MAXIMUM_CLOCK_DELTA), then a sequence counter is required to 3263 add resolution to the freshness evaluation, and it must be sized so 3264 that the counters stay comparable within the resolution of the time 3265 stampling mechanism. 3267 The sequence counter in [RFC8505] is encoded as one octet and wraps 3268 around using Appendix A. 3270 Within the resolution of MAXIMUM_CLOCK_DELTA the sequence counters 3271 captured during 2 sequential values of the time stamp SHOULD be 3272 comparable. This means with default values that a node may move up 3273 to 127 times during a 200 milliseconds period and the clocks remain 3274 still comparable thus allowing the infrastructure to assert the 3275 freshest advertisement with no ambiguity. 3277 5.3.3.3. Anycast vs. Unicast 3279 A unicast prefix can be attached to at most one leaf, whereas an 3280 anycast prefix may be reachable via more than one leaf. 3282 If a monotonic clock attribute is provided on the prefix, then the 3283 prefix with the `newest` clock value is strictly prefered. An 3284 anycast prefix does not carry a clock or all clock attributes MUST be 3285 the same under the rules of Section 5.3.3.1. 3287 Observe that it is important that in mobility events the leaf is re- 3288 flooding as quickly as possible the absence of the prefix that moved 3289 away. 3291 Observe further that without support for [RFC8505] movements on the 3292 fabric within intervals smaller than 100msec will be seen as anycast. 3294 5.3.3.4. Overlays and Signaling 3296 RIFT is agnostic whether any overlay technology like [MIP, LISP, 3297 VxLAN, NVO3] and the associated signaling is deployed over it. But 3298 it is expected that leaf nodes, and possibly Top-of-Fabric nodes can 3299 perform the correct encapsulation. 3301 In the context of mobility, overlays provide a classical solution to 3302 avoid injecting mobile prefixes in the fabric and improve the 3303 scalability of the solution. It makes sense on a data center that 3304 already uses overlays to consider their applicability to the mobility 3305 solution; as an example, a mobility protocol such as LISP may inform 3306 the ingress leaf of the location of the egress leaf in real time. 3308 Another possibility is to consider that mobility as an underlay 3309 service and support it in RIFT to an extent. The load on the fabric 3310 augments with the amount of mobility obviously since a move forces 3311 flooding and computation on all nodes in the scope of the move so 3312 tunneling from leaf to the Top-of-Fabric may be desired. Future 3313 versions of this document may describe support for such tunneling in 3314 RIFT. 3316 5.3.4. Key/Value Store 3318 5.3.4.1. Southbound 3320 The protocol supports a southbound distribution of key-value pairs 3321 that can be used to e.g. distribute configuration information during 3322 topology bring-up. The KV S-TIEs can arrive from multiple nodes and 3323 hence need tie-breaking per key. We use the following rules 3325 1. Only KV TIEs originated by nodes to which the receiver has a bi- 3326 directional adjacency are considered. 3328 2. Within all such valid KV S-TIEs containing the key, the value of 3329 the KV S-TIE for which the according node S-TIE is present, has 3330 the highest level and within the same level has highest 3331 originating system ID is preferred. If keys in the most 3332 preferred TIEs are overlapping, the behavior is undefined. 3334 Observe that if a node goes down, the node south of it looses 3335 adjacencies to it and with that the KVs will be disregarded and on 3336 tie-break changes new KV re-advertised to prevent stale information 3337 being used by nodes further south. KV information in southbound 3338 direction is not result of independent computation of every node over 3339 same set of TIEs but a diffused computation. 3341 5.3.4.2. Northbound 3343 Certain use cases seem to necessitate distribution of essentialy KV 3344 information that is generated in the leafs in the northbound 3345 direction. Such information is flooded in KV N-TIEs. Since the 3346 originator of northbound KV is preserved during northbound flooding, 3347 overlapping keys could be used. However, to omit further protocol 3348 complexity, only the value of the key in TIE tie-broken in same 3349 fashion as southbound KV TIEs is used. 3351 5.3.5. Interactions with BFD 3353 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 3354 In such case following procedures are introduced: 3356 After RIFT three way hello adjacency convergence a BFD session MAY 3357 be formed automatically between the RIFT endpoints without further 3358 configuration using the exchanged discriminators. The capability 3359 of the remote side to support BFD is carried on the LIEs. 3361 In case established BFD session goes Down after it was Up, RIFT 3362 adjacency should be re-initialized started from Init. 3364 In case of parallel links between nodes each link may run its own 3365 independent BFD session or they may share a session. 3367 In case RIFT changes link identifiers or BFD capability indication 3368 both the LIE as well as the BFD sessions SHOULD be brought down 3369 and back up again. 3371 Multiple RIFT instances MAY choose to share a single BFD session 3372 (in such case it is undefined what discriminators are used albeit 3373 RIFT CAN advertise the same link ID for the same interface in 3374 multiple instances and with that "share" the discriminators). 3376 BFD TTL follows [RFC5082]. 3378 5.3.6. Fabric Bandwidth Balancing 3380 A well understood problem in fabrics is that in case of link losses 3381 it would be ideal to rebalance how much traffic is offered to 3382 switches in the next level based on the ingress and egress bandwidth 3383 they have. Current attempts rely mostly on specialized traffic 3384 engineering via controller or leafs being aware of complete topology 3385 with according cost and complexity. 3387 RIFT can support a very light weight mechanism that can deal with the 3388 problem in an approximate way based on the fact that RIFT is loop- 3389 free. 3391 5.3.6.1. Northbound Direction 3393 Every RIFT node SHOULD compute the amount of northbound bandwith 3394 available through neighbors at higher level and modify distance 3395 received on default route from this neighbor. Those different 3396 distances SHOULD be used to support weighted ECMP forwarding towards 3397 higher level when using default route. We call such a distance 3398 Bandwidth Adjusted Distance or BAD. This is best illustrated by a 3399 simple example. 3401 . 100 x 100 100 MBits 3402 . | x | | 3403 . +-+---+-+ +-+---+-+ 3404 . | | | | 3405 . |Spin111| |Spin112| 3406 . +-+---+++ ++----+++ 3407 . |x || || || 3408 . || |+---------------+ || 3409 . || +---------------+| || 3410 . || || || || 3411 . || || || || 3412 . -----All Links 10 MBit------- 3413 . || || || || 3414 . || || || || 3415 . || +------------+| || || 3416 . || |+------------+ || || 3417 . |x || || || 3418 . +-+---+++ +--++-+++ 3419 . | | | | 3420 . |Leaf111| |Leaf112| 3421 . +-------+ +-------+ 3423 Figure 29: Balancing Bandwidth 3425 All links from Leafs in Figure 29 are assumed to 10 MBit/s bandwidth 3426 while the uplinks one level further up are assumed to be 100 MBit/s. 3427 Further, in Figure 29 we assume that Leaf111 lost one of the parallel 3428 links to Spine 111 and with that wants to possibly push more traffic 3429 onto Spine 112. Leaf 112 has equal bandwidth to Spine 111 and Spine 3430 112 but Spine 111 lost one of its uplinks. 3432 The local modification of the received default route distance from 3433 upper level is achieved by running a relatively simple algorithm 3434 where the bandwidth is weighted exponentially while the distance on 3435 the default route represents a multiplier for the bandwidth weight 3436 for easy operational adjustements. 3438 On a node L use Node TIEs to compute for each non-overloaded 3439 northbound neighbor N three values: 3441 L_N_u: as sum of the bandwidth available to N 3443 N_u: as sum of the uplink bandwidth available on N 3445 T_N_u: as sum of L_N_u * OVERSUBSCRIPTION_CONSTANT + N_u 3447 For all T_N_u determine the according M_N_u as 3448 log_2(next_power_2(T_N_u)) and determine MAX_M_N_u as maximum value 3449 of all M_N_u. 3451 For each advertised default route from a node N modify the advertised 3452 distance D to BAD = D * (1 + MAX_M_N_u - M_N_u) and use BAD instead 3453 of distance D to weight balance default forwarding towards N. 3455 For the example above a simple table of values will help the 3456 understanding. We assume the default route distance is advertised 3457 with D=1 everywhere and OVERSUBSCRIPTION_CONSTANT = 1. 3459 +---------+-----------+-------+-------+-----+ 3460 | Node | N | T_N_u | M_N_u | BAD | 3461 +---------+-----------+-------+-------+-----+ 3462 | Leaf111 | Spine 111 | 110 | 7 | 2 | 3463 +---------+-----------+-------+-------+-----+ 3464 | Leaf111 | Spine 112 | 220 | 8 | 1 | 3465 +---------+-----------+-------+-------+-----+ 3466 | Leaf112 | Spine 111 | 120 | 7 | 2 | 3467 +---------+-----------+-------+-------+-----+ 3468 | Leaf112 | Spine 112 | 220 | 8 | 1 | 3469 +---------+-----------+-------+-------+-----+ 3471 Table 5: BAD Computation 3473 If a calculation produces a result exceeding the range of the type, 3474 e.g. bandwidth, the result is set to the highest possible value for 3475 that type. 3477 BAD is only computed for default routes. A node MAY compute and use 3478 BAD for any disaggregated prefixes or other RIFT routes. A node MAY 3479 use another algorithm than BAD to weight northbound traffic based on 3480 bandwidth given that the algorithm is distributed and un-synchronized 3481 and ultimately, its correct behavior does not depend on uniformity of 3482 balancing algorithms used in the fabric. E.g. it is conceivable that 3483 leafs could use real time link loads gathered by analytics to change 3484 the amount of traffic assigned to each default route next hop. 3486 Observe further that a change in available bandwidth will only affect 3487 at maximum two levels down in the fabric, i.e. blast radius of 3488 bandwidth changes is contained no matter its height. 3490 5.3.6.2. Southbound Direction 3492 Due to its loop free properties a node CAN take during S-SPF into 3493 account the available bandwidth on the nodes in lower levels and 3494 modify the amount of traffic offered to next level's "southbound" 3495 nodes based as what it sees is the total achievable maximum flow 3496 through those nodes. It is worth observing that such computations 3497 may work better if standardized but does not have to be necessarily. 3498 As long the packet keeps on heading south it will take one of the 3499 available paths and arrive at the intended destination. 3501 5.3.7. Label Binding 3503 A node MAY advertise on its TIEs a locally significant, downstream 3504 assigned label for the according interface. One use of such label is 3505 a hop-by-hop encapsulation allowing to easily distinguish forwarding 3506 planes served by a multiplicity of RIFT instances. 3508 5.3.8. Segment Routing Support with RIFT 3510 Recently, alternative architecture to reuse labels as segment 3511 identifiers [RFC8402] has gained traction and may present use cases 3512 in IP fabric that would justify its deployment. Such use cases will 3513 either precondition an assignment of a label per node (or other 3514 entities where the mechanisms are equivalent) or a global assignment 3515 and a knowledge of topology everywhere to compute segment stacks of 3516 interest. We deal with the two issues separately. 3518 5.3.8.1. Global Segment Identifiers Assignment 3520 Global segment identifiers are normally assumed to be provided by 3521 some kind of a centralized "controller" instance and distributed to 3522 other entities. This can be performed in RIFT by attaching a 3523 controller to the Top-of-Fabric nodes at the top of the fabric where 3524 the whole topology is always visible, assign such identifiers and 3525 then distribute those via the KV mechanism towards all nodes so they 3526 can perform things like probing the fabric for failures using a stack 3527 of segments. 3529 5.3.8.2. Distribution of Topology Information 3531 Some segment routing use cases seem to precondition full knowledge of 3532 fabric topology in all nodes which can be performed albeit at the 3533 loss of one of highly desirable properties of RIFT, namely minimal 3534 blast radius. Basically, RIFT can function as a flat IGP by 3535 switching off its flooding scopes. All nodes will end up with full 3536 topology view and albeit the N-SPF and S-SPF are still performed 3537 based on RIFT rules, any computation with segment identifiers that 3538 needs full topology can use it. 3540 Beside blast radius problem, excessive flooding may present 3541 significant load on implementations. 3543 5.3.9. Leaf to Leaf Procedures 3545 RIFT can optionally allow special leaf East-West adjacencies under 3546 additional set of rules. The leaf supporting those procedures MUST: 3548 advertise the LEAF_2_LEAF flag in node capabilities AND 3550 set the overload bit on all leaf's node TIEs AND 3552 flood only node's own north and south TIEs over E-W leaf 3553 adjacencies AND 3555 always use E-W leaf adjacency in both north as well as south 3556 computation AND 3558 install a discard route for any advertised aggregate in leaf's 3559 TIEs AND 3561 never form southbound adjacencies. 3563 This will allow the E-W leaf nodes to exchange traffic strictly for 3564 the prefixes advertised in each other's north prefix TIEs (since the 3565 southbound computation will find the reverse direction in the other 3566 node's TIE and install its north prefixes). 3568 5.3.10. Address Family and Multi Topology Considerations 3570 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC8202] is used 3571 today in link-state routing protocols to support several domains on 3572 the same physical topology. RIFT supports this capability by 3573 carrying transport ports in the LIE protocol exchanges. Multiplexing 3574 of LIEs can be achieved by either choosing varying multicast 3575 addresses or ports on the same address. 3577 BFD interactions in Section 5.3.5 are implementation dependent when 3578 multiple RIFT instances run on the same link. 3580 5.3.11. Reachability of Internal Nodes in the Fabric 3582 RIFT does not precondition that its nodes have reachable addresses 3583 albeit for operational purposes this is clearly desirable. Under 3584 normal operating conditions this can be easily achieved by e.g. 3585 injecting the node's loopback address into North and South Prefix 3586 TIEs or other implementation specific mechanisms. 3588 Things get more interesting in case a node looses all its northbound 3589 adjacencies but is not at the top of the fabric. That is outside the 3590 scope of this document and may be covered in a separate document 3591 about policy guided prefixes [PGP reference]. 3593 5.3.12. One-Hop Healing of Levels with East-West Links 3595 Based on the rules defined in Section 5.2.4, Section 5.2.3.8 and 3596 given presence of E-W links, RIFT can provide a one-hop protection of 3597 nodes that lost all their northbound links or in other complex link 3598 set failure scenarios except at Top-of-Fabric where the links are 3599 used exclusively to flood topology information in multi-plane 3600 designs. Section 6.4 explains the resulting behavior based on one 3601 such example. 3603 5.4. Security 3605 5.4.1. Security Model 3607 An inherent property of any security and ZTP architecture is the 3608 resulting trade-off in regard to integrity verification of the 3609 information distributed through the fabric vs. necessary provisioning 3610 and auto-configuration. At a minimum, in all approaches, the 3611 security of an established adjacency can be ensured. The stricter 3612 the security model the more provisioning must take over the role of 3613 ZTP. 3615 The most security conscious operators will want to have full control 3616 over which port on which router/switch is connected to the respective 3617 port on the "other side", which we will call the "port-association 3618 model" (PAM) achievable e.g. by configuring on each port pair a 3619 designated shared key or pair of private/public keys. In secure data 3620 center locations, operators may want to control which router/switch 3621 is connected to which other router/switch only or choose a "node- 3622 association model" (NAM) which allows, for example, simplified port 3623 sparing. In an even more relaxed environment, an operator may only 3624 be concerned that the router/switches share credentials ensuring that 3625 they belong to this particular data center network hence allowing the 3626 flexible sparing of whole routers/switches. We will define that case 3627 as the "fabric-association model" (FAM), equivalent to using a shared 3628 secret for the whole fabric. Such flexibility may make sense for 3629 leaf nodes such as servers where the addition and swapping of servers 3630 is more frequent than the rest of the data center network. 3631 Generally, leafs of the fabric tend to be less trusted than switches. 3632 The different models could be mixed throughout the fabric if the 3633 benefits outweigh the cost of increased complexity in provisioning. 3635 In each of the above cases, some configuration mechanism is needed to 3636 allow the operator to specify which connections are allowed, and some 3637 mechanism is needed to: 3639 a. specify the according level in the fabric, 3641 b. discover and report missing connections, 3643 c. discover and report unexpected connections, and prevent such 3644 adjacencies from forming. 3646 On the more relaxed configuration side of the spectrum, operators 3647 might only configure the level of each switch, but don't explicitly 3648 configure which connections are allowed. In this case, RIFT will 3649 only allow adjacencies to come up between nodes are that in adjacent 3650 levels. The operators with lowest security requirements may not use 3651 any configuration to specify which connections are allowed. Such 3652 fabrics could rely fully on ZTP for each router/switch to discover 3653 its level and would only allow adjacencies between adjacent levels to 3654 come up. Figure 30 illustrates the tradeoffs inherent in the 3655 different security models. 3657 Ultimately, some level of verification of the link quality may be 3658 required before an adjacency is allowed to be used for forwarding. 3659 For example, an implementation may require that a BFD session comes 3660 up before advertising the adjacency. 3662 For the above outlined cases, RIFT has two approaches to enforce that 3663 a local port is connected to the correct port on the correct remote 3664 router/switch. One approach is to piggy-back on RIFT's 3665 authentication mechanism. Assuming the provisioning model (e.g. the 3666 YANG model) is flexible enough, operators can choose to provision a 3667 unique authentication key for: 3669 a. each pair of ports in "port-association model" or 3671 b. each pair of switches in "node-association model" or 3673 c. each pair of levels or 3675 d. the entire fabric in "fabric-association model". 3677 The other approach is to rely on the system-id, port-id and level 3678 fields in the LIE message to validate an adjacency against the 3679 configured expected cabling topology, and optionally introduce some 3680 new rules in the FSM to allow the adjacency to come up if the 3681 expectations are met. 3683 ^ /\ | 3684 /|\ / \ | 3685 | / \ | 3686 | / PAM \ | 3687 Increasing / \ Increasing 3688 Integrity +----------+ Flexibility 3689 & / NAM \ & 3690 Increasing +--------------+ Less 3691 Provisioning / FAM \ Configuration 3692 | +------------------+ | 3693 | / Level Provisioning \ | 3694 | +----------------------+ \|/ 3695 | / Zero Configuration \ v 3696 +--------------------------+ 3698 Figure 30: Security Model 3700 5.4.2. Security Mechanisms 3702 RIFT Security goals are to ensure authentication, message integrity 3703 and prevention of replay attacks. Low processing overhead and 3704 efficient messaging are also a goal. Message confidentiality is a 3705 non-goal. 3707 The model in the previous section allows a range of security key 3708 types that are analogous to the various security association models. 3709 PAM and NAM allow security associations at the port or node level 3710 using symmetric or asymmetric keys that are pre-installed. FAM 3711 argues for security associations to be applied only at a group level 3712 or to be refined once the topology has been established. RIFT does 3713 not specify how security keys are installed or updated it specifies 3714 how the key can be used to achieve goals. 3716 The protocol has provisions for "weak" nonces to prevent replay 3717 attacks and includes authentication mechanisms comparable to 3718 [RFC5709] and [RFC7987]. 3720 5.4.3. Security Envelope 3722 RIFT MUST be carried in a mandatory secure envelope illustrated in 3723 Figure 31. Any value in the packet following a security fingerprint 3724 MUST be used only after the according fingerprint has been validated. 3726 Local configuration MAY allow to skip the checking of the envelope's 3727 integrity. 3729 0 1 2 3 3730 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 3732 UDP Header: 3733 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3734 | Source Port | RIFT destination port | 3735 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3736 | UDP Length | UDP Checksum | 3737 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3739 Outer Security Envelope Header: 3740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3741 | RIFT MAGIC | Packet Number | 3742 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3743 | Reserved | RIFT Major | Outer Key ID | Fingerprint | 3744 | | Version | | Length | 3745 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3746 | | 3747 ~ Security Fingerprint covers all following content ~ 3748 | | 3749 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3750 | Weak Nonce Local | Weak Nonce Remote | 3751 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3752 | Remaining TIE Lifetime (all 1s in case of LIE) | 3753 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3755 TIE Origin Security Envelope Header: 3756 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3757 | TIE Origin Key ID | Fingerprint | 3758 | | Length | 3759 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3760 | | 3761 ~ Security Fingerprint covers all following content ~ 3762 | | 3763 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3765 Serialized RIFT Model Object 3766 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3767 | | 3768 ~ Serialized RIFT Model Object ~ 3769 | | 3770 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 3772 Figure 31: Security Envelope 3774 RIFT MAGIC: 16 bits. Constant value of 0xA1F7 that allows to 3775 classify RIFT packets independent of used UDP port. 3777 Packet Number: 16 bits. An optional, per packet type monotonically 3778 growing number rolling over using sequence number arithmetic 3779 defined inAppendix A. A node SHOULD correctly set the number on 3780 subsequent packets or otherwise MUST set the value to 3781 `undefined_packet_number` as provided in the schema. This number 3782 can be used to detect losses and misordering in flooding for 3783 either operational purposes or in implementation to adjust 3784 flooding behavior to current link or buffer quality. This number 3785 MUST NOT be used to discard or validate the correctness of 3786 packets. 3788 RIFT Major Version: 8 bits. It allows to check whether protocol 3789 versions are compatible, i.e. the serialized object can be decoded 3790 at all. An implementation MUST drop packets with unexpected value 3791 and MAY report a problem. Must be same as in encoded model 3792 object, otherwise packet is dropped. 3794 Outer Key ID: 8 bits to allow key rollovers. This implies key type 3795 and used algorithm. Value 0 means that no valid fingerprint was 3796 computed. This key ID scope is local to the nodes on both ends of 3797 the adjacency. 3799 TIE Origin Key ID: 24 bits. This implies key type and used 3800 algorithm. Value 0 means that no valid fingerprint was computed. 3801 This key ID scope is global to the RIFT instance since it implies 3802 the originator of the TIE so the contained object does not have to 3803 be de-serialized to obtain it. 3805 Length of Fingerprint: 8 bits. Length in 32-bit multiples of the 3806 following fingerprint not including lifetime or weak nonces. It 3807 allows to navigate the structure when an unknown key type is 3808 present. To clarify a common cornercase when this value is set to 3809 0 it signifies an empty (0 bytes long) security fingerprint. 3811 Security Fingerprint: 32 bits * Length of Fingerprint. This is a 3812 signature that is computed over all data following after it. If 3813 the signficant bits of fingerprint are fewer than the 32 bits 3814 padded length than the signficant bits MUST be left aligned and 3815 remaining bits on the right padded with 0s. When using PKI the 3816 Security fingerprint originating node uses its private key to 3817 create the signature. The original packet can then be verified 3818 provided the public key is shared and current. 3820 Remaining TIE Lifetime: 32 bits. In case of anything but TIEs this 3821 field MUST be set to all ones and Origin Security Envelope Header 3822 MUST NOT be present in the packet. For TIEs this field represents 3823 the remaining lifetime of the TIE and Origin Security Envelope 3824 Header MUST be present in the packet. The value in the serialized 3825 model object MUST be ignored. 3827 Weak Nonce Local: 16 bits. Local Weak Nonce of the adjacency as 3828 advertised in LIEs. 3830 Weak Nonce Remote: 16 bits. Remote Weak Nonce of the adjacency as 3831 received in LIEs. 3833 TIE Origin Security Envelope Header: It MUST be present if and only 3834 if the Remaining TIE Lifetime field is NOT all ones. It carries 3835 through the originators key ID and according fingerprint of the 3836 object to protect TIE from modification during flooding. This 3837 ensures origin validation and integrity (but does not provide 3838 validation of a chain of trust). 3840 Observe that due to the schema migration rules per Appendix B the 3841 contained model can be always decoded if the major version matches 3842 and the envelope integrity has been validated. Consequently, 3843 description of the TIE is available to flood it properly including 3844 unknown TIE types. 3846 5.4.4. Weak Nonces 3848 The protocol uses two 16 bit nonces to salt generated signatures. We 3849 use the term "nonce" a bit loosely since RIFT nonces are not being 3850 changed on every packet as common in cryptography. For efficiency 3851 purposes they are changed at a frequency high enough to dwarf replay 3852 attacks attempts for all practical purposes. Therefore, we call them 3853 "weak" nonces. 3855 Any implementation including RIFT security MUST generate and wrap 3856 around local nonces properly. When a nonce increment leads to 3857 `undefined_nonce` value the value SHOULD be incremented again 3858 immediately. All implementation MUST reflect the neighbor's nonces. 3859 An implementation SHOULD increment a chosen nonce on every LIE FSM 3860 transition that ends up in a different state from the previous and 3861 MUST increment its nonce at least every 5 minutes (such 3862 considerations allow for efficient implementations without opening a 3863 significant security risk). When flooding TIEs, the implementation 3864 MUST use recent (i.e. within allowed difference) nonces reflected in 3865 the LIE exchange. The schema specifies maximum allowable nonce value 3866 difference on a packet compared to reflected nonces in the LIEs. Any 3867 packet received with nonces deviating more than the allowed delta 3868 MUST be discarded without further computation of signatures to 3869 prevent computation load attacks. 3871 In case where a secure implementation does not receive signatures or 3872 receives undefined nonces from neighbor indicating that it does not 3873 support or verify signatures, it is a matter of local policy how such 3874 packets are treated. Any secure implementation may choose to either 3875 refuse forming an adjacency with an implementation not advertising 3876 signatures or valid nonces or simply keep on signing local packets 3877 while accepting neighbor's packets without further security 3878 verification. 3880 As a necessary exception, an implementation MUST advertise 3881 `undefined_nonce` for remote nonce value when the FSM is not in 2-way 3882 or 3-way state and accept an `undefined_nonce` for its local nonce 3883 value on packets in any other state than 3-way. 3885 As optional optimization, an implemenation MAY send one LIE with 3886 previously negotiated neighbor's nonce to try to speed up a 3887 neighbor's transition from 3-way to 1-way and MUST revert to sending 3888 `undefined_nonce` after that. 3890 5.4.5. Lifetime 3892 Protecting lifetime on flooding may lead to excessive number of 3893 security fingerprint computation and hence an application generating 3894 such fingerprints on TIEs MAY round the value down to the next 3895 `rounddown_lifetime_interval` defined in the schema when sending TIEs 3896 albeit such optimization in presence of security hashes over 3897 advancing weak nonces may not be feasible. 3899 5.4.6. Key Management 3901 As outlined in the Security Model a private shared key or a public/ 3902 private key pair is used to Authenticate the adjacency. The actual 3903 method of key distribution and key synchronization is assumed to be 3904 out of band from RIFT's perspective. Both nodes in the adjacency 3905 must share the same keys and configuration of key type and algorithm 3906 for a key ID. Mismatched keys will obviously not inter-operate due 3907 to unverifiable security envelope. 3909 Key roll-over while the adjacency is active is allowed and the 3910 technique is well known and described in e.g. [RFC6518]. Key 3911 distribution procedures are out of scope for RIFT. 3913 5.4.7. Security Association Changes 3915 There in no mechanism to convert a security envelope for the same key 3916 ID from one algorithm to another once the envelope is operational. 3917 The recommended procedure to change to a new algorithm is to take the 3918 adjacency down and make the changes and then bring the adjacency up. 3920 Obviously, an implementation may choose to stop verifying security 3921 envelope for the duration of key change to keep the adjacency up but 3922 since this introduces a security vulnerability window, such roll-over 3923 is not recommended. 3925 6. Examples 3927 6.1. Normal Operation 3929 This section describes RIFT deployment in the example topology 3930 without any node or link failures. We disregard flooding reduction 3931 for simplicity's sake. 3933 As first step, the following bi-directional adjacencies will be 3934 created (and any other links that do not fulfill LIE rules in 3935 Section 5.2.2 disregarded): 3937 1. Spine 21 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 3938 122 3940 2. Spine 22 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 3941 122 3943 3. Spine 111 to Leaf 111, Leaf 112 3945 4. Spine 112 to Leaf 111, Leaf 112 3947 5. Spine 121 to Leaf 121, Leaf 122 3949 6. Spine 122 to Leaf 121, Leaf 122 3951 Consequently, N-TIEs would be originated by Spine 111 and Spine 112 3952 and each set would be sent to both Spine 21 and Spine 22. N-TIEs 3953 also would be originated by Leaf 111 (w/ Prefix 111) and Leaf 112 (w/ 3954 Prefix 112 and the multi-homed prefix) and each set would be sent to 3955 Spine 111 and Spine 112. Spine 111 and Spine 112 would then flood 3956 these N-TIEs to Spine 21 and Spine 22. 3958 Similarly, N-TIEs would be originated by Spine 121 and Spine 122 and 3959 each set would be sent to both Spine 21 and Spine 22. N-TIEs also 3960 would be originated by Leaf 121 (w/ Prefix 121 and the multi-homed 3961 prefix) and Leaf 122 (w/ Prefix 122) and each set would be sent to 3962 Spine 121 and Spine 122. Spine 121 and Spine 122 would then flood 3963 these N-TIEs to Spine 21 and Spine 22. 3965 At this point both Spine 21 and Spine 22, as well as any controller 3966 to which they are connected, would have the complete network 3967 topology. At the same time, Spine 111/112/121/122 hold only the 3968 N-ties of level 0 of their respective PoD. Leafs hold only their own 3969 N-TIEs. 3971 S-TIEs with adjacencies and a default IP prefix would then be 3972 originated by Spine 21 and Spine 22 and each would be flooded to 3973 Spine 111, Spine 112, Spine 121, and Spine 122. Spine 111, Spine 3974 112, Spine 121, and Spine 122 would each send the S-TIE from Spine 21 3975 to Spine 22 and the S-TIE from Spine 22 to Spine 21. (S-TIEs are 3976 reflected up to level from which they are received but they are NOT 3977 propagated southbound.) 3979 A S-TIE with a default IP prefix would be originated by Node 111 and 3980 Spine 112 and each would be sent to Leaf 111 and Leaf 112. 3982 Similarly, an S-TIE with a default IP prefix would be originated by 3983 Node 121 and Spine 122 and each would be sent to Leaf 121 and Leaf 3984 122. At this point IP connectivity with maximum possible ECMP has 3985 been established between the leafs while constraining the amount of 3986 information held by each node to the minimum necessary for normal 3987 operation and dealing with failures. 3989 6.2. Leaf Link Failure 3991 . | | | | 3992 .+-+---+-+ +-+---+-+ 3993 .| | | | 3994 .|Spin111| |Spin112| 3995 .+-+---+-+ ++----+-+ 3996 . | | | | 3997 . | +---------------+ X 3998 . | | | X Failure 3999 . | +-------------+ | X 4000 . | | | | 4001 .+-+---+-+ +--+--+-+ 4002 .| | | | 4003 .|Leaf111| |Leaf112| 4004 .+-------+ +-------+ 4005 . + + 4006 . Prefix111 Prefix112 4008 Figure 32: Single Leaf link failure 4010 In case of a failing leaf link between spine 112 and leaf 112 the 4011 link-state information will cause re-computation of the necessary SPF 4012 and the higher levels will stop forwarding towards prefix 112 through 4013 spine 112. Only spines 111 and 112, as well as both spines will see 4014 control traffic. Leaf 111 will receive a new S-TIE from spine 112 4015 and reflect back to spine 111. Spine 111 will de-aggregate prefix 4016 111 and prefix 112 but we will not describe it further here since de- 4017 aggregation is emphasized in the next example. It is worth observing 4018 however in this example that if leaf 111 would keep on forwarding 4019 traffic towards prefix 112 using the advertised south-bound default 4020 of spine 112 the traffic would end up on Top-of-Fabric 21 and ToF 22 4021 and cross back into pod 1 using spine 111. This is arguably not as 4022 bad as black-holing present in the next example but clearly 4023 undesirable. Fortunately, de-aggregation prevents this type of 4024 behavior except for a transitory period of time. 4026 6.3. Partitioned Fabric 4028 . +--------+ +--------+ S-TIE of Spine21 4029 . | | | | received by 4030 . |ToF 21| |ToF 22| south reflection of 4031 . ++-+--+-++ ++-+--+-++ spines 112 and 111 4032 . | | | | | | | | 4033 . | | | | | | | 0/0 4034 . | | | | | | | | 4035 . | | | | | | | | 4036 . +--------------+ | +--- XXXXXX + | | | +---------------+ 4037 . | | | | | | | | 4038 . | +-----------------------------+ | | | 4039 . 0/0 | | | | | | | 4040 . | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 4041 . | 1.1/16 | | | | | | 4042 . | | +-+ +-0/0-----------+ | | 4043 . | | | 1.1./16 | | | | 4044 .+-+----++ +-+-----+ ++-----0/0 ++----0/0 4045 .| | | | | 1.1/16 | 1.1/16 4046 .|Spin111| |Spin112| |Spin121| |Spin122| 4047 .+-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 4048 . | | | | | | | | 4049 . | +---------------+ | | +----------------+ | 4050 . | | | | | | | | 4051 . | +-------------+ | | | +--------------+ | | 4052 . | | | | | | | | 4053 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 4054 .| | | | | | | | 4055 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 4056 .+-+-----+ ++------+ +-----+-+ +-+-----+ 4057 . + + + + 4058 . Prefix111 Prefix112 Prefix121 Prefix122 4059 . 1.1/16 4061 Figure 33: Fabric partition 4063 Figure 33 shows the arguably a more catastrophic but also a more 4064 interesting case. Spine 21 is completely severed from access to 4065 Prefix 121 (we use in the figure 1.1/16 as example) by double link 4066 failure. However unlikely, if left unresolved, forwarding from leaf 4067 111 and leaf 112 to prefix 121 would suffer 50% black-holing based on 4068 pure default route advertisements by Top-of-Fabric 21 and ToF 22. 4070 The mechanism used to resolve this scenario is hinging on the 4071 distribution of southbound representation by Top-of-Fabric 21 that is 4072 reflected by spine 111 and spine 112 to ToF 22. Spine 22, having 4073 computed reachability to all prefixes in the network, advertises with 4074 the default route the ones that are reachable only via lower level 4075 neighbors that ToF 21 does not show an adjacency to. That results in 4076 spine 111 and spine 112 obtaining a longest-prefix match to prefix 4077 121 which leads through ToF 22 and prevents black-holing through ToF 4078 21 still advertising the 0/0 aggregate only. 4080 The prefix 121 advertised by Top-of-Fabric 22 does not have to be 4081 propagated further towards leafs since they do no benefit from this 4082 information. Hence the amount of flooding is restricted to ToF 21 4083 reissuing its S-TIEs and south reflection of those by spine 111 and 4084 spine 112. The resulting SPF in ToF 22 issues a new prefix S-TIEs 4085 containing 1.1/16. None of the leafs become aware of the changes and 4086 the failure is constrained strictly to the level that became 4087 partitioned. 4089 To finish with an example of the resulting sets computed using 4090 notation introduced in Section 5.2.5, Top-of-Fabric 22 constructs the 4091 following sets: 4093 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 4095 |H (for r=Prefix 111) = Spine 111, Spine 112 4097 |H (for r=Prefix 112) = Spine 111, Spine 112 4099 |H (for r=Prefix 121) = Spine 121, Spine 122 4101 |H (for r=Prefix 122) = Spine 121, Spine 122 4103 |A (for Spine 21) = Spine 111, Spine 112 4105 With that and |H (for r=prefix 121) and |H (for r=prefix 122) being 4106 disjoint from |A (for Top-of-Fabric 21), ToF 22 will originate an 4107 S-TIE with prefix 121 and prefix 122, that is flooded to spines 112, 4108 112, 121 and 122. 4110 6.4. Northbound Partitioned Router and Optional East-West Links 4112 . + + + 4113 . X N1 | N2 | N3 4114 . X | | 4115 .+--+----+ +--+----+ +--+-----+ 4116 .| |0/0> <0/0| |0/0> <0/0| | 4117 .| A01 +----------+ A02 +----------+ A03 | Level 1 4118 .++-+-+--+ ++--+--++ +---+-+-++ 4119 . | | | | | | | | | 4120 . | | +----------------------------------+ | | | 4121 . | | | | | | | | | 4122 . | +-------------+ | | | +--------------+ | 4123 . | | | | | | | | | 4124 . | +----------------+ | +-----------------+ | 4125 . | | | | | | | | | 4126 . | | +------------------------------------+ | | 4127 . | | | | | | | | | 4128 .++-+-+--+ | +---+---+ | +-+---+-++ 4129 .| | +-+ +-+ | | 4130 .| L01 | | L02 | | L03 | Level 0 4131 .+-------+ +-------+ +--------+ 4133 Figure 34: North Partitioned Router 4135 Figure 34 shows a part of a fabric where level 1 is horizontally 4136 connected and A01 lost its only northbound adjacency. Based on N-SPF 4137 rules in Section 5.2.4.1 A01 will compute northbound reachability by 4138 using the link A01 to A02 (whereas A02 will NOT use this link during 4139 N-SPF). Hence A01 will still advertise the default towards level 0 4140 and route unidirectionally using the horizontal link. 4142 As further consideration, the moment A02 looses link N2 the situation 4143 evolves again. A01 will have no more northbound reachability while 4144 still seeing A03 advertising northbound adjacencies in its south node 4145 tie. With that it will stop advertising a default route due to 4146 Section 5.2.3.8. 4148 7. Implementation and Operation: Further Details 4150 7.1. Considerations for Leaf-Only Implementation 4152 RIFT can and is intended to be stretched to the lowest level in the 4153 IP fabric to integrate ToRs or even servers. Since those entities 4154 would run as leafs only, it is worth to observe that a leaf only 4155 version is significantly simpler to implement and requires much less 4156 resources: 4158 1. Under normal conditions, the leaf needs to support a multipath 4159 default route only. In most catastrophic partitioning case it 4160 has to be capable of accommodating all the leaf routes in its own 4161 PoD to prevent black-holing. 4163 2. Leaf nodes hold only their own N-TIEs and S-TIEs of Level 1 nodes 4164 they are connected to; so overall few in numbers. 4166 3. Leaf node does not have to support any type of de-aggregation 4167 computation or propagation. 4169 4. Leaf nodes do not have to support overload bit normally. 4171 5. Unless optional leaf-2-leaf procedures are desired default route 4172 origination and S-TIE origination is unnecessary. 4174 7.2. Considerations for Spine Implementation 4176 In case of spines, i.e. nodes that will never act as Top of Fabric a 4177 full implementation is not required, specifically the node does not 4178 need to perform any computation of negative disaggregation except 4179 respecting northbound disaggregation advertised from the north. 4181 7.3. Adaptations to Other Proposed Data Center Topologies 4183 . +-----+ +-----+ 4184 . | | | | 4185 .+-+ S0 | | S1 | 4186 .| ++---++ ++---++ 4187 .| | | | | 4188 .| | +------------+ | 4189 .| | | +------------+ | 4190 .| | | | | 4191 .| ++-+--+ +--+-++ 4192 .| | | | | 4193 .| | A0 | | A1 | 4194 .| +-+--++ ++---++ 4195 .| | | | | 4196 .| | +------------+ | 4197 .| | +-----------+ | | 4198 .| | | | | 4199 .| +-+-+-+ +--+-++ 4200 .+-+ | | | 4201 . | L0 | | L1 | 4202 . +-----+ +-----+ 4204 Figure 35: Level Shortcut 4206 Strictly speaking, RIFT is not limited to Clos variations only. The 4207 protocol preconditions only a sense of 'compass rose direction' 4208 achieved by configuration (or derivation) of levels and other 4209 topologies are possible within this framework. So, conceptually, one 4210 could include leaf to leaf links and even shortcut between levels but 4211 certain requirements in Section 4 will not be met anymore. As an 4212 example, shortcutting levels illustrated in Figure 35 will lead 4213 either to suboptimal routing when L0 sends traffic to L1 (since using 4214 S0's default route will lead to the traffic being sent back to A0 or 4215 A1) or the leafs need each other's routes installed to understand 4216 that only A0 and A1 should be used to talk to each other. 4218 Whether such modifications of topology constraints make sense is 4219 dependent on many technology variables and the exhausting treatment 4220 of the topic is definitely outside the scope of this document. 4222 7.4. Originating Non-Default Route Southbound 4224 Obviously, an implementation may choose to originate southbound 4225 instead of a strict default route (as described in Section 5.2.3.8) a 4226 shorter prefix P' but in such a scenario all addresses carried within 4227 the RIFT domain must be contained within P'. 4229 8. Security Considerations 4231 8.1. General 4233 One can consider attack vectors where a router may reboot many times 4234 while changing its system ID and pollute the network with many stale 4235 TIEs or TIEs are sent with very long lifetimes and not cleaned up 4236 when the routes vanishes. Those attack vectors are not unique to 4237 RIFT. Given large memory footprints available today those attacks 4238 should be relatively benign. Otherwise a node SHOULD implement a 4239 strategy of discarding contents of all TIEs that were not present in 4240 the SPF tree over a certain, configurable period of time. Since the 4241 protocol, like all modern link-state protocols, is self-stabilizing 4242 and will advertise the presence of such TIEs to its neighbors, they 4243 can be re-requested again if a computation finds that it sees an 4244 adjacency formed towards the system ID of the discarded TIEs. 4246 8.2. ZTP 4248 Section 5.2.7 presents many attack vectors in untrusted environments, 4249 starting with nodes that oscillate their level offers to the 4250 possiblity of a node offering a three way adjacency with the highest 4251 possible level value with a very long holdtime trying to put itself 4252 "on top of the lattice" and with that gaining access to the whole 4253 southbound topology. Session authentication mechanisms are necessary 4254 in environments where this is possible and RIFT provides the 4255 according security envelope to ensure this if desired. 4257 8.3. Lifetime 4259 Traditional IGP protocols are vulnerable to lifetime modification and 4260 replay attacks that can be somewhat mitigated by using techniques 4261 like [RFC7987]. RIFT removes this attack vector by protecting the 4262 lifetime behind a signature computed over it and additional nonce 4263 combination which makes even the replay attack window very small and 4264 for practical purposes irrelevant since lifetime cannot be 4265 artificially shortened by the attacker. 4267 8.4. Packet Number 4269 Optional packet number is carried in the security envelope without 4270 any encryption protection and is hence vulnerable to replay and 4271 modification attacks. Contrary to nonces this number must change on 4272 every packet and would present a very high cryptographic load if 4273 signed. The attack vector packet number present is relatively 4274 benign. Changing the packet number by a man-in-the-middle attack 4275 will only affect operational validation tools and possibly some 4276 performance optimizations on flooding. It is expected that an 4277 implementation detecting too many "fake losses" or "misorderings" due 4278 to the attack on the packet number would simply suppress its further 4279 processing. 4281 8.5. Outer Fingerprint Attacks 4283 A node can try to inject LIE packets observing a conversation on the 4284 wire by using the outer key ID albeit it cannot generate valid hashes 4285 in case it changes the integrity of the message so the only possible 4286 attack is DoS due to excessive LIE validation. 4288 A node can try to replay previous LIEs with changed state that it 4289 recorded but the attack is hard to replicate since the nonce 4290 combination must match the ongoing exchange and is then limited to a 4291 single flap only since both nodes will advance their nonces in case 4292 the adjacency state changed. Even in the most unlikely case the 4293 attack length is limited due to both sides periodically increasing 4294 their nonces. 4296 8.6. TIE Origin Fingerprint DoS Attacks 4298 A compromised node can attempt to generate "fake TIEs" using other 4299 nodes' TIE origin key identifiers. Albeit the ultimate validation of 4300 the origin fingerprint will fail in such scenarios and not progress 4301 further than immediately peering nodes, the resulting denial of 4302 service attack seems unavoidable since the TIE origin key id is only 4303 protected by the, here assumed to be compromised, node. 4305 8.7. Host Implementations 4307 It can be reasonably expected that with the proliferation of RotH 4308 servers, rather than dedicated networking devices, will constitute 4309 significant amount of RIFT devices. Given their normally far wider 4310 software envelope and access granted to them, such servers are also 4311 far more likely to be compromised and present an attack vector on the 4312 protocol. Hijacking of prefixes to attract traffic is a trust 4313 problem and cannot be addressed within the protocol if the trust 4314 model is breached, i.e. the server presents valid credentials to form 4315 an adjacency and issue TIEs. However, in a move devious way, the 4316 servers can present DoS (or even DDos) vectors of issuing too many 4317 LIE packets, flood large amount of N-TIEs and similar anomalies. A 4318 prudent implementation hosting leafs should implement thresholds and 4319 raise warnings when leaf is advertising number of TIEs in excess of 4320 those. 4322 9. IANA Considerations 4324 This specification requests multicast address assignments and 4325 standard port numbers. Additionally registries for the schema are 4326 requested and suggested values provided that reflect the numbers 4327 allocated in the given schema. 4329 9.1. Requested Multicast and Port Numbers 4331 This document requests allocation in the 'IPv4 Multicast Address 4332 Space' registry the suggested value of 224.0.0.120 as 4333 'ALL_V4_RIFT_ROUTERS' and in the 'IPv6 Multicast Address Space' 4334 registry the suggested value of FF02::A1F7 as 'ALL_V6_RIFT_ROUTERS'. 4336 This document requests allocation in the 'Service Name and Transport 4337 Protocol Port Number Registry' the allocation of a suggested value of 4338 914 on udp for 'RIFT_LIES_PORT' and suggested value of 915 for 4339 'RIFT_TIES_PORT'. 4341 9.2. Requested Registries with Suggested Values 4343 This section requests registries that help govern the schema via 4344 usual IANA registry procedures. Allocation of new values is always 4345 performed via `Expert Review` action. IANA is requested to store the 4346 schema version introducing the allocated value as well as, 4347 optionally, its description when present. All values not suggested 4348 as to be considered `Unassigned`. The range of every registry is a 4349 16-bit integer. 4351 9.2.1. RIFT/common/AddressFamilyType 4353 address family 4355 9.2.1.1. Requested Entries 4357 Name Value Schema Version Description 4358 Illegal 0 1.0 4359 AddressFamilyMinValue 1 1.0 4360 IPv4 2 1.0 4361 IPv6 3 1.0 4362 AddressFamilyMaxValue 4 1.0 4364 9.2.2. RIFT/common/HierarchyIndications 4366 flags indicating nodes behavior in case of ZTP 4368 9.2.2.1. Requested Entries 4370 Name Value Schema Version Description 4371 leaf_only 0 1.0 4372 leaf_only_and_leaf_2_leaf_procedures 1 1.0 4373 top_of_fabric 2 1.0 4375 9.2.3. RIFT/common/IEEE802_1ASTimeStampType 4377 timestamp per IEEE 802.1AS, values MUST be interpreted in 4378 implementation as unsigned 4380 9.2.3.1. Requested Entries 4382 Name Value Schema Version Description 4383 AS_sec 1 1.0 4384 AS_nsec 2 1.0 4386 9.2.4. RIFT/common/IPAddressType 4388 IP address type 4390 9.2.4.1. Requested Entries 4392 Name Value Schema Version Description 4393 ipv4address 1 1.0 4394 ipv6address 2 1.0 4396 9.2.5. RIFT/common/IPPrefixType 4398 prefix representing reachablity. 4400 @note: for interface addresses the protocol can propagate the address 4401 part beyond the subnet mask and on reachability computation that has 4402 to be normalized. The non-significant bits can be used for 4403 operational purposes. 4405 9.2.5.1. Requested Entries 4407 Name Value Schema Version Description 4408 ipv4prefix 1 1.0 4409 ipv6prefix 2 1.0 4411 9.2.6. RIFT/common/IPv4PrefixType 4413 IP v4 prefix type 4415 9.2.6.1. Requested Entries 4417 Name Value Schema Version Description 4418 address 1 1.0 4419 prefixlen 2 1.0 4421 9.2.7. RIFT/common/IPv6PrefixType 4423 IP v6 prefix type 4425 9.2.7.1. Requested Entries 4427 Name Value Schema Version Description 4428 address 1 1.0 4429 prefixlen 2 1.0 4431 9.2.8. RIFT/common/PrefixSequenceType 4433 sequence of a prefix when it moves 4435 9.2.8.1. Requested Entries 4437 Name Value Schema Description 4438 Version 4439 timestamp 1 1.0 4440 transactionid 2 1.0 transaction ID set by client in e.g. 4441 in 6LoWPAN 4443 9.2.9. RIFT/common/RouteType 4445 RIFT route types. 4447 @note: route types which MUST be ordered on their preference PGP 4448 prefixes are most preferred attracting traffic north (towards spine) 4449 and then south normal prefixes are attracting traffic south (towards 4450 leafs), i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH 4451 PREFIX TIE. 4453 @note: The only purpose of those values is to introduce an ordering 4454 whereas an implementation can choose internally any other values as 4455 long the ordering is preserved 4457 9.2.9.1. Requested Entries 4459 Name Value Schema Version Description 4460 Illegal 0 1.0 4461 RouteTypeMinValue 1 1.0 4462 Discard 2 1.0 4463 LocalPrefix 3 1.0 4464 SouthPGPPrefix 4 1.0 4465 NorthPGPPrefix 5 1.0 4466 NorthPrefix 6 1.0 4467 NorthExternalPrefix 7 1.0 4468 SouthPrefix 8 1.0 4469 SouthExternalPrefix 9 1.0 4470 NegativeSouthPrefix 10 1.0 4471 RouteTypeMaxValue 11 1.0 4473 9.2.10. RIFT/common/TIETypeType 4475 type of TIE. 4477 This enum indicates what TIE type the TIE is carrying. In case the 4478 value is not known to the receiver, re-flooded the same way as prefix 4479 TIEs. This allows for future extensions of the protocol within the 4480 same schema major with types opaque to some nodes unless the flooding 4481 scope is not the same as prefix TIE, then a major version revision 4482 MUST be performed. 4484 9.2.10.1. Requested Entries 4485 Name Value Schema Version Description 4486 Illegal 0 1.0 4487 TIETypeMinValue 1 1.0 4488 NodeTIEType 2 1.0 4489 PrefixTIEType 3 1.0 4490 PositiveDisaggregationPrefixTIEType 4 1.0 4491 NegativeDisaggregationPrefixTIEType 5 1.0 4492 PGPrefixTIEType 6 1.0 4493 KeyValueTIEType 7 1.0 4494 ExternalPrefixTIEType 8 1.0 4495 TIETypeMaxValue 9 1.0 4497 9.2.11. RIFT/common/TieDirectionType 4499 direction of tie 4501 9.2.11.1. Requested Entries 4503 Name Value Schema Version Description 4504 Illegal 0 1.0 4505 South 1 1.0 4506 North 2 1.0 4507 DirectionMaxValue 3 1.0 4509 9.2.12. RIFT/encoding/Community 4511 community 4513 9.2.12.1. Requested Entries 4515 Name Value Schema Version Description 4516 top 1 1.0 4517 bottom 2 1.0 4519 9.2.13. RIFT/encoding/KeyValueTIEElement 4521 Generic key value pairs 4523 9.2.13.1. Requested Entries 4525 Name Value Schema Description 4526 Version 4527 keyvalues 1 1.0 if the same key repeats in multiple TIEs of 4528 same node or with different values, behavior 4529 is unspecified 4531 9.2.14. RIFT/encoding/LIEPacket 4533 RIFT LIE packet 4535 @note this node's level is already included on the packet header 4537 9.2.14.1. Requested Entries 4539 Name Value Schema Description 4540 Version 4541 name 1 1.0 node or adjacency name 4542 local_id 2 1.0 local link ID 4543 flood_port 3 1.0 UDP port to which we can 4544 receive flooded TIEs 4545 link_mtu_size 4 1.0 layer 3 MTU, used to 4546 discover to mismatch. 4547 link_bandwidth 5 1.0 local link bandwidth on the 4548 interface 4549 neighbor 6 1.0 reflects the neighbor once 4550 received to provide 3-way 4551 connectivity 4552 pod 7 1.0 node's PoD 4553 node_capabilities 10 1.0 node capabilities shown in 4554 the LIE. The capabilies 4555 MUST match the capabilities 4556 shown in the Node TIEs, 4557 otherwise the behavior is 4558 unspecified. A node 4559 detecting the mismatch 4560 SHOULD generate according 4561 error 4562 link_capabilities 11 1.0 capabilities of this link 4563 holdtime 12 1.0 required holdtime of the 4564 adjacency, i.e. how much 4565 time MUST expire without 4566 LIE for the adjacency to 4567 drop 4568 label 13 1.0 unsolicited, downstream 4569 assigned locally 4570 significant label value for 4571 the adjacency 4572 not_a_ztp_offer 21 1.0 indicates that the level on 4573 the LIE MUST NOT be used to 4574 derive a ZTP level by the 4575 receiving node 4576 you_are_flood_repeater 22 1.0 indicates to northbound 4577 neighbor that it should be 4578 reflooding this node's 4579 N-TIEs to achieve flood 4580 reduction and balancing for 4581 northbound flooding. To be 4582 ignored if received from a 4583 northbound adjacency 4584 you_are_sending_too_quickly 23 1.0 can be optionally set to 4585 indicate to neighbor that 4586 packet losses are seen on 4587 reception based on packet 4588 numbers or the rate is too 4589 high. The receiver SHOULD 4590 temporarily slow down 4591 flooding rates 4592 instance_name 24 1.0 instance name in case 4593 multiple RIFT instances 4594 running on same interface 4596 9.2.15. RIFT/encoding/LinkCapabilities 4598 link capabilities 4600 9.2.15.1. Requested Entries 4602 Name Value Schema Description 4603 Version 4604 bfd 1 1.0 indicates that the link's `local 4605 ID` can be used as its BFD 4606 discriminator and the link is 4607 supporting BFD 4608 v4_forwarding_capable 2 1.0 indicates whether the interface 4609 will support v4 forwarding. This 4610 MUST be set to true when LIEs 4611 from a v4 address are sent and 4612 MAY be set to true in LIEs on v6 4613 address. If v4 and v6 LIEs 4614 indicate contradicting 4615 information the behavior is 4616 unspecified. 4618 9.2.16. RIFT/encoding/LinkIDPair 4620 LinkID pair describes one of parallel links between two nodes 4622 9.2.16.1. Requested Entries 4623 Name Value Schema Description 4624 Version 4625 local_id 1 1.0 node-wide unique value for 4626 the local link 4627 remote_id 2 1.0 received remote link ID for 4628 this link 4629 platform_interface_index 10 1.0 describes the local 4630 interface index of the link 4631 platform_interface_name 11 1.0 describes the local 4632 interface name 4633 trusted_outer_security_key 12 1.0 indication whether the link 4634 is secured, i.e. protected 4635 by outer key, absence of 4636 this element means no 4637 indication, undefined outer 4638 key means not secured 4640 9.2.17. RIFT/encoding/Neighbor 4642 neighbor structure 4644 9.2.17.1. Requested Entries 4646 Name Value Schema Version Description 4647 originator 1 1.0 system ID of the originator 4648 remote_id 2 1.0 ID of remote side of the link 4650 9.2.18. RIFT/encoding/NodeCapabilities 4652 capabilities the node supports. The schema may add to this field 4653 future capabilities to indicate whether it will support 4654 interpretation of future schema extensions on the same major 4655 revision. Such fields MUST be optional and have an implicit or 4656 explicit false default value. If a future capability changes route 4657 selection or generates blackholes if some nodes are not supporting it 4658 then a major version increment is unavoidable. 4660 9.2.18.1. Requested Entries 4662 Name Value Schema Description 4663 Version 4664 flood_reduction 1 1.0 can this node participate in 4665 flood reduction 4666 hierarchy_indications 2 1.0 does this node restrict itself to 4667 be top-of-fabric or leaf only (in 4668 ZTP) and does it support 4669 leaf-2-leaf procedures 4671 9.2.19. RIFT/encoding/NodeFlags 4673 Flags the node sets 4675 9.2.19.1. Requested Entries 4677 Name Value Schema Description 4678 Version 4679 overload 1 1.0 indicates that node is in overload, do not 4680 transit traffic through it 4682 9.2.20. RIFT/encoding/NodeNeighborsTIEElement 4684 neighbor of a node 4686 9.2.20.1. Requested Entries 4688 Name Value Schema Description 4689 Version 4690 level 1 1.0 level of neighbor 4691 cost 3 1.0 4692 link_ids 4 1.0 can carry description of multiple parallel 4693 links in a TIE 4694 bandwidth 5 1.0 total bandwith to neighbor, this will be 4695 normally sum of the bandwidths of all the 4696 parallel links. 4698 9.2.21. RIFT/encoding/NodeTIEElement 4700 Description of a node. 4702 It may occur multiple times in different TIEs but if either * 4703 capabilities values do not match or * flags values do not match or * 4704 neighbors repeat with different values 4706 the behavior is undefined and a warning SHOULD be generated. 4707 Neighbors can be distributed across multiple TIEs however if the sets 4708 are disjoint. Miscablings SHOULD be repeated in every node TIE, 4709 otherwise the behavior is undefined. 4711 @note: observe that absence of fields implies defined defaults 4713 9.2.21.1. Requested Entries 4714 Name Value Schema Description 4715 Version 4716 level 1 1.0 level of the node 4717 neighbors 2 1.0 node's neighbors. If neighbor systemID 4718 repeats in other node TIEs of same node 4719 the behavior is undefined 4720 capabilities 3 1.0 capabilities of the node 4721 flags 4 1.0 flags of the node 4722 name 5 1.0 optional node name for easier 4723 operations 4724 pod 6 1.0 PoD to which the node belongs 4725 miscabled_links 10 1.0 if any local links are miscabled, the 4726 indication is flooded 4728 9.2.22. RIFT/encoding/PacketContent 4730 content of a RIFT packet 4732 9.2.22.1. Requested Entries 4734 Name Value Schema Version Description 4735 lie 1 1.0 4736 tide 2 1.0 4737 tire 3 1.0 4738 tie 4 1.0 4740 9.2.23. RIFT/encoding/PacketHeader 4742 common RIFT packet header 4744 9.2.23.1. Requested Entries 4746 Name Value Schema Description 4747 Version 4748 major_version 1 1.0 major version type of protocol 4749 minor_version 2 1.0 minor version type of protocol 4750 sender 3 1.0 node sending the packet, in case of 4751 LIE/TIRE/TIDE also the originator of it 4752 level 4 1.0 level of the node sending the packet, 4753 required on everything except LIEs. Lack 4754 of presence on LIEs indicates 4755 UNDEFINED_LEVEL and is used in ZTP 4756 procedures. 4758 9.2.24. RIFT/encoding/PrefixAttributes 4760 9.2.24.1. Requested Entries 4762 Name Value Schema Description 4763 Version 4764 metric 2 1.0 distance of the prefix 4765 tags 3 1.0 generic unordered set of route tags, 4766 can be redistributed to other 4767 protocols or use within the context 4768 of real time analytics 4769 monotonic_clock 4 1.0 monotonic clock for mobile addresses 4770 loopback 6 1.0 indicates if the interface is a node 4771 loopback 4772 directly_attached 7 1.0 indicates that the prefix is directly 4773 attached, i.e. should be routed to 4774 even if the node is in overload. * 4775 from_link 10 1.0 in case of locally originated 4776 prefixes, i.e. interface addresses 4777 this can describe which link the 4778 address belongs to. 4780 9.2.25. RIFT/encoding/PrefixTIEElement 4782 TIE carrying prefixes 4784 9.2.25.1. Requested Entries 4786 Name Value Schema Description 4787 Version 4788 prefixes 1 1.0 prefixes with the associated attributes. if 4789 the same prefix repeats in multiple TIEs of 4790 same node behavior is unspecified 4792 9.2.26. RIFT/encoding/ProtocolPacket 4794 RIFT packet structure 4796 9.2.26.1. Requested Entries 4798 Name Value Schema Version Description 4799 header 1 1.0 4800 content 2 1.0 4802 9.2.27. RIFT/encoding/TIDEPacket 4804 TIDE with sorted TIE headers, if headers are unsorted, behavior is 4805 undefined 4807 9.2.27.1. Requested Entries 4809 Name Value Schema Version Description 4810 start_range 1 1.0 first TIE header in the tide packet 4811 end_range 2 1.0 last TIE header in the tide packet 4812 headers 3 1.0 _sorted_ list of headers 4814 9.2.28. RIFT/encoding/TIEElement 4816 single element in a TIE. enum `common.TIETypeType` in TIEID indicates 4817 which elements MUST be present in the TIEElement. In case of 4818 mismatch the unexpected elements MUST be ignored. In case of lack of 4819 expected element the TIE an error MUST be reported and the TIE MUST 4820 be ignored. 4822 This type can be extended with new optional elements for new 4823 `common.TIETypeType` values without breaking the major but if it is 4824 necessary to understand whether all nodes support the new type a node 4825 capability must be added as well. 4827 9.2.28.1. Requested Entries 4828 Name Valu Schema Description 4829 e Version 4830 node 1 1.0 used in case of enum common. 4831 TIETypeType.NodeTIEType 4832 prefixes 2 1.0 used in case of enum common. 4833 TIETypeType.PrefixTIEType 4834 positive_disaggregation_pre 3 1.0 positive prefixes (always 4835 fixes southbound) It MUST NOT be 4836 advertised within a North 4837 TIE and ignored otherwise 4838 negative_disaggregation_pre 4 1.0 transitive, negative 4839 fixes prefixes (always southbound) 4840 which MUST be aggregated and 4841 propagated according to the 4842 specification southwards 4843 towards lower levels to heal 4844 pathological upper level 4845 partitioning, otherwise 4846 blackholes may occur in 4847 multiplane fabrics. It MUST 4848 NOT be advertised within a 4849 North TIE. 4850 external_prefixes 5 1.0 externally reimported 4851 prefixes 4852 keyvalues 6 1.0 Key-Value store elements 4854 9.2.29. RIFT/encoding/TIEHeader 4856 Header of a TIE. 4858 @note: TIEID space is a total order achieved by comparing the 4859 elements in sequence defined and comparing each value as an unsigned 4860 integer of according length. 4862 @note: After sequence number the lifetime received on the envelope 4863 must be used for comparison before further fields. 4865 @note: `origination_time` and `origination_lifetime` are disregarded 4866 for comparison purposes and carried purely for debugging/security 4867 purposes if present. 4869 9.2.29.1. Requested Entries 4870 Name Value Schema Description 4871 Version 4872 tieid 2 1.0 ID of the tie 4873 seq_nr 3 1.0 sequence number of the tie 4874 origination_time 10 1.0 absolute timestamp when the TIE 4875 was generated. This can be used on 4876 fabrics with synchronized clock to 4877 prevent lifetime modification 4878 attacks. 4879 origination_lifetime 12 1.0 original lifetime when the TIE was 4880 generated. This can be used on 4881 fabrics with synchronized clock to 4882 prevent lifetime modification 4883 attacks. 4885 9.2.30. RIFT/encoding/TIEHeaderWithLifeTime 4887 Header of a TIE as described in TIRE/TIDE. 4889 9.2.30.1. Requested Entries 4891 Name Value Schema Description 4892 Version 4893 header 1 1.0 4894 remaining_lifetime 2 1.0 remaining lifetime that expires down 4895 to 0 just like in ISIS. TIEs with 4896 lifetimes differing by less than 4897 `lifetime_diff2ignore` MUST be 4898 considered EQUAL. 4900 9.2.31. RIFT/encoding/TIEID 4902 ID of a TIE 4904 @note: TIEID space is a total order achieved by comparing the 4905 elements in sequence defined and comparing each value as an unsigned 4906 integer of according length. 4908 9.2.31.1. Requested Entries 4910 Name Value Schema Version Description 4911 direction 1 1.0 direction of TIE 4912 originator 2 1.0 indicates originator of the TIE 4913 tietype 3 1.0 type of the tie 4914 tie_nr 4 1.0 number of the tie 4916 9.2.32. RIFT/encoding/TIEPacket 4918 TIE packet 4920 9.2.32.1. Requested Entries 4922 Name Value Schema Version Description 4923 header 1 1.0 4924 element 2 1.0 4926 9.2.33. RIFT/encoding/TIREPacket 4928 TIRE packet 4930 9.2.33.1. Requested Entries 4932 Name Value Schema Version Description 4933 headers 1 1.0 4935 10. Acknowledgments 4937 A new routing protocol in its complexity is not a product of a parent 4938 but of a village as the author list shows already. However, many 4939 more people provided input, fine-combed the specification based on 4940 their experience in design or implementation. This section will make 4941 an inadequate attempt in recording their contribution. 4943 Many thanks to Naiming Shen for some of the early discussions around 4944 the topic of using IGPs for routing in topologies related to Clos. 4945 Russ White to be especially acknowledged for the key conversation on 4946 epistomology that allowed to tie current asynchronous distributed 4947 systems theory results to a modern protocol design presented here. 4948 Adrian Farrel, Joel Halpern, Jeffrey Zhang, Krzysztof Szarkowicz, 4949 Nagendra Kumar provided thoughtful comments that improved the 4950 readability of the document and found good amount of corners where 4951 the light failed to shine. Kris Price was first to mention single 4952 router, single arm default considerations. Jeff Tantsura helped out 4953 with some initial thoughts on BFD interactions while Jeff Haas 4954 corrected several misconceptions about BFD's finer points. Artur 4955 Makutunowicz pointed out many possible improvements and acted as 4956 sounding board in regard to modern protocol implementation techniques 4957 RIFT is exploring. Barak Gafni formalized first time clearly the 4958 problem of partitioned spine and fallen leafs on a (clean) napkin in 4959 Singapore that led to the very important part of the specification 4960 centered around multiple Top-of-Fabric planes and negative 4961 disaggregation. Igor Gashinsky and others shared many thoughts on 4962 problems encountered in design and operation of large-scale data 4963 center fabrics. Xu Benchong found a delicate error in the flooding 4964 procedures while implementing. 4966 11. References 4968 11.1. Normative References 4970 [ISO10589] 4971 ISO "International Organization for Standardization", 4972 "Intermediate system to Intermediate system intra-domain 4973 routeing information exchange protocol for use in 4974 conjunction with the protocol for providing the 4975 connectionless-mode Network Service (ISO 8473), ISO/IEC 4976 10589:2002, Second Edition.", Nov 2002. 4978 [RFC1982] Elz, R. and R. Bush, "Serial Number Arithmetic", RFC 1982, 4979 DOI 10.17487/RFC1982, August 1996, 4980 . 4982 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 4983 Requirement Levels", BCP 14, RFC 2119, 4984 DOI 10.17487/RFC2119, March 1997, 4985 . 4987 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 4988 DOI 10.17487/RFC2328, April 1998, 4989 . 4991 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 4992 RFC 2365, DOI 10.17487/RFC2365, July 1998, 4993 . 4995 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 4996 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 4997 DOI 10.17487/RFC4271, January 2006, 4998 . 5000 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 5001 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 5002 2006, . 5004 [RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C. 5005 Pignataro, "The Generalized TTL Security Mechanism 5006 (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007, 5007 . 5009 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 5010 Topology (MT) Routing in Intermediate System to 5011 Intermediate Systems (IS-ISs)", RFC 5120, 5012 DOI 10.17487/RFC5120, February 2008, 5013 . 5015 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 5016 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 5017 DOI 10.17487/RFC5303, October 2008, 5018 . 5020 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 5021 Layer Reachability Information with an IPv6 Next Hop", 5022 RFC 5549, DOI 10.17487/RFC5549, May 2009, 5023 . 5025 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 5026 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 5027 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 5028 2009, . 5030 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 5031 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 5032 DOI 10.17487/RFC5881, June 2010, 5033 . 5035 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 5036 "Network Time Protocol Version 4: Protocol and Algorithms 5037 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 5038 . 5040 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 5041 S. Ray, "North-Bound Distribution of Link-State and 5042 Traffic Engineering (TE) Information Using BGP", RFC 7752, 5043 DOI 10.17487/RFC7752, March 2016, 5044 . 5046 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 5047 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 5048 DOI 10.17487/RFC7987, October 2016, 5049 . 5051 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 5052 (IPv6) Specification", STD 86, RFC 8200, 5053 DOI 10.17487/RFC8200, July 2017, 5054 . 5056 [RFC8202] Ginsberg, L., Previdi, S., and W. Henderickx, "IS-IS 5057 Multi-Instance", RFC 8202, DOI 10.17487/RFC8202, June 5058 2017, . 5060 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 5061 Decraene, B., Litkowski, S., and R. Shakir, "Segment 5062 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 5063 July 2018, . 5065 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 5066 Perkins, "Registration Extensions for IPv6 over Low-Power 5067 Wireless Personal Area Network (6LoWPAN) Neighbor 5068 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 5069 . 5071 [thrift] Apache Software Foundation, "Thrift Interface Description 5072 Language", . 5074 11.2. Informative References 5076 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 5077 Communication Environments", IEEE International Parallel & 5078 Distributed Processing Symposium, 2011. 5080 [DIJKSTRA] 5081 Dijkstra, E., "A Note on Two Problems in Connexion with 5082 Graphs", Journal Numer. Math. , 1959. 5084 [DOT] Ellson, J. and L. Koutsofios, "Graphviz: open source graph 5085 drawing tools", Springer-Verlag , 2001. 5087 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 5088 key-value store", ACM SIGOPS symposium on Operating 5089 systems principles (SOSP '07), 2007. 5091 [EPPSTEIN] 5092 Eppstein, D., "Finding the k-Shortest Paths", 1997. 5094 [EUI64] IEEE, "Guidelines for Use of Extended Unique Identifier 5095 (EUI), Organizationally Unique Identifier (OUI), and 5096 Company ID (CID)", IEEE EUI, 5097 . 5099 [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for 5100 Hardware-Efficient Supercomputing", 1985. 5102 [IEEEstd1588] 5103 IEEE, "IEEE Standard for a Precision Clock Synchronization 5104 Protocol for Networked Measurement and Control Systems", 5105 IEEE Standard 1588, 5106 . 5108 [IEEEstd8021AS] 5109 IEEE, "IEEE Standard for Local and Metropolitan Area 5110 Networks - Timing and Synchronization for Time-Sensitive 5111 Applications in Bridged Local Area Networks", 5112 IEEE Standard 802.1AS, 5113 . 5115 [ISO10589-Second-Edition] 5116 International Organization for Standardization, 5117 "Intermediate system to Intermediate system intra-domain 5118 routeing information exchange protocol for use in 5119 conjunction with the protocol for providing the 5120 connectionless-mode Network Service (ISO 8473)", Nov 2002. 5122 [MAKSIC2013] 5123 Maksic et al., N., "Improving Utilization of Data Center 5124 Networks", IEEE Communications Magazine, Nov 2013. 5126 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 5127 Converting Network Protocol Addresses to 48.bit Ethernet 5128 Address for Transmission on Ethernet Hardware", STD 37, 5129 RFC 826, DOI 10.17487/RFC0826, November 1982, 5130 . 5132 [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", 5133 RFC 2131, DOI 10.17487/RFC2131, March 1997, 5134 . 5136 [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link 5137 State Routing Protocol (OLSR)", RFC 3626, 5138 DOI 10.17487/RFC3626, October 2003, 5139 . 5141 [RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation 5142 Element (PCE)-Based Architecture", RFC 4655, 5143 DOI 10.17487/RFC4655, August 2006, 5144 . 5146 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 5147 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 5148 DOI 10.17487/RFC4861, September 2007, 5149 . 5151 [RFC4862] Thomson, S., Narten, T., and T. Jinmei, "IPv6 Stateless 5152 Address Autoconfiguration", RFC 4862, 5153 DOI 10.17487/RFC4862, September 2007, 5154 . 5156 [RFC6518] Lebovitz, G. and M. Bhatia, "Keying and Authentication for 5157 Routing Protocols (KARP) Design Guidelines", RFC 6518, 5158 DOI 10.17487/RFC6518, February 2012, 5159 . 5161 [RFC7855] Previdi, S., Ed., Filsfils, C., Ed., Decraene, B., 5162 Litkowski, S., Horneffer, M., and R. Shakir, "Source 5163 Packet Routing in Networking (SPRING) Problem Statement 5164 and Requirements", RFC 7855, DOI 10.17487/RFC7855, May 5165 2016, . 5167 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 5168 BGP for Routing in Large-Scale Data Centers", RFC 7938, 5169 DOI 10.17487/RFC7938, August 2016, 5170 . 5172 [RFC8415] Mrugalski, T., Siodelski, M., Volz, B., Yourtchenko, A., 5173 Richardson, M., Jiang, S., Lemon, T., and T. Winters, 5174 "Dynamic Host Configuration Protocol for IPv6 (DHCPv6)", 5175 RFC 8415, DOI 10.17487/RFC8415, November 2018, 5176 . 5178 [VAHDAT08] 5179 Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 5180 Commodity Data Center Network Architecture", SIGCOMM , 5181 2008. 5183 [Wikipedia] 5184 Wikipedia, 5185 "https://en.wikipedia.org/wiki/Serial_number_arithmetic", 5186 2016. 5188 Appendix A. Sequence Number Binary Arithmetic 5190 The only reasonably reference to a cleaner than [RFC1982] sequence 5191 number solution is given in [Wikipedia]. It basically converts the 5192 problem into two complement's arithmetic. Assuming a straight two 5193 complement's substractions on the bit-width of the sequence number 5194 the according >: and =: relations are defined as: 5196 U_1, U_2 are 12-bits aligned unsigned version number 5198 D_f is ( U_1 - U_2 ) interpreted as two complement signed 12-bits 5199 D_b is ( U_2 - U_1 ) interpreted as two complement signed 12-bits 5201 U_1 >: U_2 IIF D_f > 0 AND D_b < 0 5202 U_1 =: U_2 IIF D_f = 0 5204 The >: relationsship is symmetric but not transitive. Observe that 5205 this leaves the case of the numbers having maximum two complement 5206 distance, e.g. ( 0 and 0x800 ) undefined in our 12-bits case since 5207 D_f and D_b are both -0x7ff. 5209 A simple example of the relationship in case of 3-bit arithmetic 5210 follows as table indicating D_f/D_b values and then the relationship 5211 of U_1 to U_2: 5213 U2 / U1 0 1 2 3 4 5 6 7 5214 0 +/+ +/- +/- +/- -/- -/+ -/+ -/+ 5215 1 -/+ +/+ +/- +/- +/- -/- -/+ -/+ 5216 2 -/+ -/+ +/+ +/- +/- +/- -/- -/+ 5217 3 -/+ -/+ -/+ +/+ +/- +/- +/- -/- 5218 4 -/- -/+ -/+ -/+ +/+ +/- +/- +/- 5219 5 +/- -/- -/+ -/+ -/+ +/+ +/- +/- 5220 6 +/- +/- -/- -/+ -/+ -/+ +/+ +/- 5221 7 +/- +/- +/- -/- -/+ -/+ -/+ +/+ 5223 U2 / U1 0 1 2 3 4 5 6 7 5224 0 = > > > ? < < < 5225 1 < = > > > ? < < 5226 2 < < = > > > ? < 5227 3 < < < = > > > ? 5228 4 ? < < < = > > > 5229 5 > ? < < < = > > 5230 6 > > ? < < < = > 5231 7 > > > ? < < < = 5233 Appendix B. Information Elements Schema 5235 This section introduces the schema for information elements. The IDL 5236 is Thrift [thrift]. 5238 On schema changes that 5240 1. change field numbers or 5242 2. add new *required* fields or 5243 3. remove any fields or 5245 4. change lists into sets, unions into structures or 5247 5. change multiplicity of fields or 5249 6. changes name of any field or type or 5251 7. change datatypes of any field or 5253 8. adds, changes or removes a default value of any *existing* field 5254 or 5256 9. removes or changes any defined constant or constant value or 5258 10. changes any enumeration type except extending `common.TIEType` 5259 (use of enumeration types is generally discouraged) 5261 major version of the schema MUST increase. All other changes MUST 5262 increase minor version within the same major. 5264 Observe however that introducing an optional field does not cause a 5265 major version increase even if the fields inside the structure are 5266 optional with defaults. 5268 All signed integer as forced by Thrift [thrift] support must be cast 5269 for internal purposes to equivalent unsigned values without 5270 discarding the signedness bit. An implementation SHOULD try to avoid 5271 using the signedness bit when generating values. 5273 The schema is normative. 5275 B.1. common.thrift 5277 /** 5278 Thrift file with common definitions for RIFT 5279 */ 5281 /** @note MUST be interpreted in implementation as unsigned 64 bits. 5282 * The implementation SHOULD NOT use the MSB. 5283 */ 5284 typedef i64 SystemIDType 5285 typedef i32 IPv4Address 5286 /** this has to be of length long enough to accomodate prefix */ 5287 typedef binary IPv6Address 5288 /** @note MUST be interpreted in implementation as unsigned */ 5289 typedef i16 UDPPortType 5290 /** @note MUST be interpreted in implementation as unsigned */ 5291 typedef i32 TIENrType 5292 /** @note MUST be interpreted in implementation as unsigned */ 5293 typedef i32 MTUSizeType 5294 /** @note MUST be interpreted in implementation as unsigned rollling over number */ 5295 typedef i16 SeqNrType 5296 /** @note MUST be interpreted in implementation as unsigned */ 5297 typedef i32 LifeTimeInSecType 5298 /** @note MUST be interpreted in implementation as unsigned */ 5299 typedef i8 LevelType 5300 /** optional, recommended monotonically increasing number _per packet type per adjacency_ 5301 that can be used to detect losses/misordering/restarts. 5302 This will be moved into envelope in the future. 5303 @note MUST be interpreted in implementation as unsigned rollling over number */ 5304 typedef i16 PacketNumberType 5305 /** @note MUST be interpreted in implementation as unsigned */ 5306 typedef i32 PodType 5307 /** @note MUST be interpreted in implementation as unsigned. This is carried in the 5308 security envelope and MUST fit into 8 bits. */ 5309 typedef i8 VersionType 5310 /** @note MUST be interpreted in implementation as unsigned */ 5311 typedef i16 MinorVersionType 5312 /** @note MUST be interpreted in implementation as unsigned */ 5313 typedef i32 MetricType 5314 /** @note MUST be interpreted in implementation as unsigned and unstructured */ 5315 typedef i64 RouteTagType 5316 /** @note MUST be interpreted in implementation as unstructured label value */ 5317 typedef i32 LabelType 5318 /** @note MUST be interpreted in implementation as unsigned */ 5319 typedef i32 BandwithInMegaBitsType 5320 /** @note Key Value key ID type */ 5321 typedef string KeyIDType 5322 /** node local, unique identification for a link (interface/tunnel 5323 * etc. Basically anything RIFT runs on). This is kept 5324 * at 32 bits so it aligns with BFD [RFC5880] discriminator size. 5325 */ 5326 typedef i32 LinkIDType 5327 typedef string KeyNameType 5328 typedef i8 PrefixLenType 5329 /** timestamp in seconds since the epoch */ 5330 typedef i64 TimestampInSecsType 5331 /** security nonce. 5332 * @note MUST be interpreted in implementation as rolling over unsigned value */ 5333 typedef i16 NonceType 5334 /** LIE FSM holdtime type */ 5335 typedef i16 TimeIntervalInSecType 5336 /** Transaction ID type for prefix mobility as specified by RFC6550, value 5337 MUST be interpreted in implementation as unsigned */ 5339 typedef i8 PrefixTransactionIDType 5340 /** timestamp per IEEE 802.1AS, values MUST be interpreted in implementation as unsigned */ 5341 struct IEEE802_1ASTimeStampType { 5342 1: required i64 AS_sec; 5343 2: optional i32 AS_nsec; 5344 } 5345 /** generic counter type */ 5346 typedef i64 CounterType 5347 /** Platform Interface Index type, i.e. index of interface on hardware, can be used e.g. with 5348 RFC5837 */ 5349 typedef i32 PlatformInterfaceIndex 5351 /** flags indicating nodes behavior in case of ZTP 5352 */ 5353 enum HierarchyIndications { 5354 /** forces level to `leaf_level` and enables according procedures */ 5355 leaf_only = 0, 5356 /** forces level to `leaf_level` and enables according procedures */ 5357 leaf_only_and_leaf_2_leaf_procedures = 1, 5358 /** forces level to `top_of_fabric` and enables according procedures */ 5359 top_of_fabric = 2, 5360 } 5362 const PacketNumberType undefined_packet_number = 0 5363 /** This MUST be used when node is configured as top of fabric in ZTP. 5364 This is kept reasonably low to alow for fast ZTP convergence on 5365 failures. */ 5366 const LevelType top_of_fabric_level = 24 5367 /** default bandwidth on a link */ 5368 const BandwithInMegaBitsType default_bandwidth = 100 5369 /** fixed leaf level when ZTP is not used */ 5370 const LevelType leaf_level = 0 5371 const LevelType default_level = leaf_level 5372 const PodType default_pod = 0 5373 const LinkIDType undefined_linkid = 0 5375 /** default distance used */ 5376 const MetricType default_distance = 1 5377 /** any distance larger than this will be considered infinity */ 5378 const MetricType infinite_distance = 0x7FFFFFFF 5379 /** represents invalid distance */ 5380 const MetricType invalid_distance = 0 5381 const bool overload_default = false 5382 const bool flood_reduction_default = true 5383 /** default LIE FSM holddown time */ 5384 const TimeIntervalInSecType default_lie_holdtime = 3 5385 /** default ZTP FSM holddown time */ 5386 const TimeIntervalInSecType default_ztp_holdtime = 1 5387 /** by default LIE levels are ZTP offers */ 5388 const bool default_not_a_ztp_offer = false 5389 /** by default e'one is repeating flooding */ 5390 const bool default_you_are_flood_repeater = true 5391 /** 0 is illegal for SystemID */ 5392 const SystemIDType IllegalSystemID = 0 5393 /** empty set of nodes */ 5394 const set empty_set_of_nodeids = {} 5395 /** default lifetime of TIE is one week */ 5396 const LifeTimeInSecType default_lifetime = 604800 5397 /** default lifetime when TIEs are purged is 5 minutes */ 5398 const LifeTimeInSecType purge_lifetime = 300 5399 /** round down interval when TIEs are sent with security hashes 5400 to prevent excessive computation. **/ 5401 const LifeTimeInSecType rounddown_lifetime_interval = 60 5402 /** any `TieHeader` that has a smaller lifetime difference 5403 than this constant is equal (if other fields equal). This 5404 constant MUST be larger than `purge_lifetime` to avoid 5405 retransmissions */ 5406 const LifeTimeInSecType lifetime_diff2ignore = 400 5408 /** default UDP port to run LIEs on */ 5409 const UDPPortType default_lie_udp_port = 914 5410 /** default UDP port to receive TIEs on, that can be peer specific */ 5411 const UDPPortType default_tie_udp_flood_port = 915 5413 /** default MTU link size to use */ 5414 const MTUSizeType default_mtu_size = 1400 5415 /** default link being BFD capable */ 5416 const bool bfd_default = true 5418 /** undefined nonce, equivalent to missing nonce */ 5419 const NonceType undefined_nonce = 0; 5420 /** outer security key id, MUST be interpreted as in implementation as unsigned */ 5421 typedef i8 OuterSecurityKeyID 5422 /** security key id, MUST be interpreted as in implementation as unsigned */ 5423 typedef i32 TIESecurityKeyID 5424 /** undefined key */ 5425 const TIESecurityKeyID undefined_securitykey_id = 0; 5426 /** Maximum delta (negative or positive) that a mirrored nonce can 5427 deviate from local value to be considered valid. If nonces are 5428 changed every minute on both sides this opens statistically 5429 a `maximum_valid_nonce_delta` minutes window of identical LIEs, 5430 TIE, TI(x)E replays. 5431 The interval cannot be too small since LIE FSM may change 5432 states fairly quickly during ZTP without sending LIEs*/ 5433 const i16 maximum_valid_nonce_delta = 5; 5434 /** direction of tie */ 5435 enum TieDirectionType { 5436 Illegal = 0, 5437 South = 1, 5438 North = 2, 5439 DirectionMaxValue = 3, 5440 } 5442 /** address family */ 5443 enum AddressFamilyType { 5444 Illegal = 0, 5445 AddressFamilyMinValue = 1, 5446 IPv4 = 2, 5447 IPv6 = 3, 5448 AddressFamilyMaxValue = 4, 5449 } 5451 /** IP v4 prefix type */ 5452 struct IPv4PrefixType { 5453 1: required IPv4Address address; 5454 2: required PrefixLenType prefixlen; 5455 } 5457 /** IP v6 prefix type */ 5458 struct IPv6PrefixType { 5459 1: required IPv6Address address; 5460 2: required PrefixLenType prefixlen; 5461 } 5463 /** IP address type */ 5464 union IPAddressType { 5465 1: optional IPv4Address ipv4address; 5466 2: optional IPv6Address ipv6address; 5467 } 5469 /** prefix representing reachablity. 5471 @note: for interface 5472 addresses the protocol can propagate the address part beyond 5473 the subnet mask and on reachability computation that has to 5474 be normalized. The non-significant bits can be used for operational 5475 purposes. 5476 */ 5477 union IPPrefixType { 5478 1: optional IPv4PrefixType ipv4prefix; 5479 2: optional IPv6PrefixType ipv6prefix; 5480 } 5481 /** sequence of a prefix when it moves 5482 */ 5483 struct PrefixSequenceType { 5484 1: required IEEE802_1ASTimeStampType timestamp; 5485 /** transaction ID set by client in e.g. in 6LoWPAN */ 5486 2: optional PrefixTransactionIDType transactionid; 5487 } 5489 /** type of TIE. 5491 This enum indicates what TIE type the TIE is carrying. 5492 In case the value is not known to the receiver, 5493 re-flooded the same way as prefix TIEs. This allows for 5494 future extensions of the protocol within the same schema major 5495 with types opaque to some nodes unless the flooding scope is not 5496 the same as prefix TIE, then a major version revision MUST 5497 be performed. 5498 */ 5499 enum TIETypeType { 5500 Illegal = 0, 5501 TIETypeMinValue = 1, 5502 /** first legal value */ 5503 NodeTIEType = 2, 5504 PrefixTIEType = 3, 5505 PositiveDisaggregationPrefixTIEType = 4, 5506 NegativeDisaggregationPrefixTIEType = 5, 5507 PGPrefixTIEType = 6, 5508 KeyValueTIEType = 7, 5509 ExternalPrefixTIEType = 8, 5510 TIETypeMaxValue = 9, 5511 } 5513 /** RIFT route types. 5515 @note: route types which MUST be ordered on their preference 5516 PGP prefixes are most preferred attracting 5517 traffic north (towards spine) and then south 5518 normal prefixes are attracting traffic south (towards leafs), 5519 i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH PREFIX TIE. 5521 @note: The only purpose of those values is to introduce an 5522 ordering whereas an implementation can choose internally 5523 any other values as long the ordering is preserved 5524 */ 5525 enum RouteType { 5526 Illegal = 0, 5527 RouteTypeMinValue = 1, 5528 /** first legal value. */ 5529 /** discard routes are most prefered */ 5530 Discard = 2, 5532 /** local prefixes are directly attached prefixes on the 5533 * system such as e.g. interface routes. 5534 */ 5535 LocalPrefix = 3, 5536 /** advertised in S-TIEs */ 5537 SouthPGPPrefix = 4, 5538 /** advertised in N-TIEs */ 5539 NorthPGPPrefix = 5, 5540 /** advertised in N-TIEs */ 5541 NorthPrefix = 6, 5542 /** externally imported north */ 5543 NorthExternalPrefix = 7, 5544 /** advertised in S-TIEs, either normal prefix or positive disaggregation */ 5545 SouthPrefix = 8, 5546 /** externally imported south */ 5547 SouthExternalPrefix = 9, 5548 /** negative, transitive prefixes are least preferred */ 5549 NegativeSouthPrefix = 10, 5550 RouteTypeMaxValue = 11, 5551 } 5553 B.2. encoding.thrift 5555 /** 5556 Thrift file for packet encodings for RIFT 5557 */ 5559 /** Represents protocol encoding schema major version */ 5560 const common.VersionType protocol_major_version = 1 5561 /** Represents protocol encoding schema minor version */ 5562 const common.MinorVersionType protocol_minor_version = 0 5564 /** common RIFT packet header */ 5565 struct PacketHeader { 5566 /** major version type of protocol */ 5567 1: required common.VersionType major_version = protocol_major_version; 5568 /** minor version type of protocol */ 5569 2: required common.VersionType minor_version = protocol_minor_version; 5570 /** node sending the packet, in case of LIE/TIRE/TIDE 5571 * also the originator of it */ 5572 3: required common.SystemIDType sender; 5573 /** level of the node sending the packet, required on everything except 5574 * LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL and is used 5575 * in ZTP procedures. 5576 */ 5577 4: optional common.LevelType level; 5578 } 5580 /** community */ 5581 struct Community { 5582 1: required i32 top; 5583 2: required i32 bottom; 5584 } 5586 /** neighbor structure */ 5587 struct Neighbor { 5588 /** system ID of the originator */ 5589 1: required common.SystemIDType originator; 5590 /** ID of remote side of the link */ 5591 2: required common.LinkIDType remote_id; 5592 } 5594 /** capabilities the node supports. The schema may add to this 5595 field future capabilities to indicate whether it will support 5596 interpretation of future schema extensions on the same major 5597 revision. Such fields MUST be optional and have an implicit or 5598 explicit false default value. If a future capability changes route 5599 selection or generates blackholes if some nodes are not supporting 5600 it then a major version increment is unavoidable. 5601 */ 5602 struct NodeCapabilities { 5603 /** can this node participate in flood reduction */ 5604 1: optional bool flood_reduction = 5605 common.flood_reduction_default; 5606 /** does this node restrict itself to be top-of-fabric or 5607 leaf only (in ZTP) and does it support leaf-2-leaf procedures */ 5608 2: optional common.HierarchyIndications hierarchy_indications; 5609 } 5611 /** link capabilities */ 5612 struct LinkCapabilities { 5613 /** indicates that the link's `local ID` can be used as its BFD 5614 * discriminator and the link is supporting BFD */ 5615 1: optional bool bfd = 5616 common.bfd_default; 5617 /** indicates whether the interface will support v4 forwarding. This MUST 5618 * be set to true when LIEs from a v4 address are sent and MAY be set 5619 * to true in LIEs on v6 address. If v4 and v6 LIEs indicate contradicting 5620 * information the behavior is unspecified. */ 5621 2: optional bool v4_forwarding_capable = 5622 true; 5623 } 5625 /** RIFT LIE packet 5627 @note this node's level is already included on the packet header */ 5628 struct LIEPacket { 5629 /** node or adjacency name */ 5630 1: optional string name; 5631 /** local link ID */ 5632 2: required common.LinkIDType local_id; 5633 /** UDP port to which we can receive flooded TIEs */ 5634 3: required common.UDPPortType flood_port = 5635 common.default_tie_udp_flood_port; 5636 /** layer 3 MTU, used to discover to mismatch. */ 5637 4: optional common.MTUSizeType link_mtu_size = 5638 common.default_mtu_size; 5639 /** local link bandwidth on the interface */ 5640 5: optional common.BandwithInMegaBitsType link_bandwidth = 5641 common.default_bandwidth; 5642 /** reflects the neighbor once received to provide 5643 3-way connectivity */ 5644 6: optional Neighbor neighbor; 5645 /** node's PoD */ 5646 7: optional common.PodType pod = 5647 common.default_pod; 5648 /** node capabilities shown in the LIE. The capabilies 5649 MUST match the capabilities shown in the Node TIEs, otherwise 5650 the behavior is unspecified. A node detecting the mismatch 5651 SHOULD generate according error */ 5652 10: optional NodeCapabilities node_capabilities; 5653 /** capabilities of this link */ 5654 11: optional LinkCapabilities link_capabilities; 5655 /** required holdtime of the adjacency, i.e. how much time 5656 MUST expire without LIE for the adjacency to drop */ 5657 12: required common.TimeIntervalInSecType holdtime = 5658 common.default_lie_holdtime; 5659 /** unsolicited, downstream assigned locally significant label 5660 value for the adjacency */ 5661 13: optional common.LabelType label; 5662 /** indicates that the level on the LIE MUST NOT be used 5663 to derive a ZTP level by the receiving node */ 5664 21: optional bool not_a_ztp_offer = 5665 common.default_not_a_ztp_offer; 5666 /** indicates to northbound neighbor that it should 5667 be reflooding this node's N-TIEs to achieve flood reduction and 5668 balancing for northbound flooding. To be ignored if received from a 5669 northbound adjacency */ 5671 22: optional bool you_are_flood_repeater = 5672 common.default_you_are_flood_repeater; 5673 /** can be optionally set to indicate to neighbor that packet losses are seen on 5674 reception based on packet numbers or the rate is too high. The receiver SHOULD 5675 temporarily slow down flooding rates 5676 */ 5677 23: optional bool you_are_sending_too_quickly = 5678 false; 5679 /** instance name in case multiple RIFT instances running on same interface */ 5680 24: optional string instance_name; 5681 } 5683 /** LinkID pair describes one of parallel links between two nodes */ 5684 struct LinkIDPair { 5685 /** node-wide unique value for the local link */ 5686 1: required common.LinkIDType local_id; 5687 /** received remote link ID for this link */ 5688 2: required common.LinkIDType remote_id; 5690 /** describes the local interface index of the link */ 5691 10: optional common.PlatformInterfaceIndex platform_interface_index; 5692 /** describes the local interface name */ 5693 11: optional string platform_interface_name; 5694 /** indication whether the link is secured, i.e. protected by outer key, absence 5695 of this element means no indication, undefined outer key means not secured */ 5696 12: optional common.OuterSecurityKeyID trusted_outer_security_key; 5697 } 5699 /** ID of a TIE 5701 @note: TIEID space is a total order achieved by comparing the elements 5702 in sequence defined and comparing each value as an 5703 unsigned integer of according length. 5704 */ 5705 struct TIEID { 5706 /** direction of TIE */ 5707 1: required common.TieDirectionType direction; 5708 /** indicates originator of the TIE */ 5709 2: required common.SystemIDType originator; 5710 /** type of the tie */ 5711 3: required common.TIETypeType tietype; 5712 /** number of the tie */ 5713 4: required common.TIENrType tie_nr; 5714 } 5716 /** Header of a TIE. 5718 @note: TIEID space is a total order achieved by comparing the elements 5719 in sequence defined and comparing each value as an 5720 unsigned integer of according length. 5722 @note: After sequence number the lifetime received on the envelope 5723 must be used for comparison before further fields. 5725 @note: `origination_time` and `origination_lifetime` are disregarded 5726 for comparison purposes and carried purely for debugging/security 5727 purposes if present. 5728 */ 5729 struct TIEHeader { 5730 /** ID of the tie */ 5731 2: required TIEID tieid; 5732 /** sequence number of the tie */ 5733 3: required common.SeqNrType seq_nr; 5735 /** absolute timestamp when the TIE 5736 was generated. This can be used on fabrics with 5737 synchronized clock to prevent lifetime modification attacks. */ 5738 10: optional common.IEEE802_1ASTimeStampType origination_time; 5739 /** original lifetime when the TIE 5740 was generated. This can be used on fabrics with 5741 synchronized clock to prevent lifetime modification attacks. */ 5742 12: optional common.LifeTimeInSecType origination_lifetime; 5743 } 5745 /** Header of a TIE as described in TIRE/TIDE. 5746 */ 5747 struct TIEHeaderWithLifeTime { 5748 1: required TIEHeader header; 5749 /** remaining lifetime that expires down to 0 just like in ISIS. 5750 TIEs with lifetimes differing by less than `lifetime_diff2ignore` MUST 5751 be considered EQUAL. */ 5752 2: required common.LifeTimeInSecType remaining_lifetime; 5753 } 5755 /** TIDE with sorted TIE headers, if headers are unsorted, behavior is undefined */ 5756 struct TIDEPacket { 5757 /** first TIE header in the tide packet */ 5758 1: required TIEID start_range; 5759 /** last TIE header in the tide packet */ 5760 2: required TIEID end_range; 5761 /** _sorted_ list of headers */ 5762 3: required list headers; 5763 } 5765 /** TIRE packet */ 5766 struct TIREPacket { 5767 1: required set headers; 5768 } 5770 /** neighbor of a node */ 5771 struct NodeNeighborsTIEElement { 5772 /** level of neighbor */ 5773 1: required common.LevelType level; 5774 /** Cost to neighbor. 5776 @note: All parallel links to same node 5777 incur same cost, in case the neighbor has multiple 5778 parallel links at different cost, the largest distance 5779 (highest numerical value) MUST be advertised 5780 @note: any neighbor with cost <= 0 MUST be ignored in computations */ 5781 3: optional common.MetricType cost = common.default_distance; 5782 /** can carry description of multiple parallel links in a TIE */ 5783 4: optional set link_ids; 5785 /** total bandwith to neighbor, this will be normally sum of the 5786 bandwidths of all the parallel links. */ 5787 5: optional common.BandwithInMegaBitsType bandwidth = 5788 common.default_bandwidth; 5789 } 5791 /** Flags the node sets */ 5792 struct NodeFlags { 5793 /** indicates that node is in overload, do not transit traffic through it */ 5794 1: optional bool overload = common.overload_default; 5795 } 5797 /** Description of a node. 5799 It may occur multiple times in different TIEs but if either 5800 * capabilities values do not match or 5801 * flags values do not match or 5802 * neighbors repeat with different values 5804 the behavior is undefined and a warning SHOULD be generated. 5805 Neighbors can be distributed across multiple TIEs however if 5806 the sets are disjoint. Miscablings SHOULD be repeated in every 5807 node TIE, otherwise the behavior is undefined. 5809 @note: observe that absence of fields implies defined defaults 5810 */ 5811 struct NodeTIEElement { 5812 /** level of the node */ 5813 1: required common.LevelType level; 5814 /** node's neighbors. If neighbor systemID repeats in other node TIEs of 5815 same node the behavior is undefined */ 5816 2: required map neighbors; 5818 /** capabilities of the node */ 5819 3: optional NodeCapabilities capabilities; 5820 /** flags of the node */ 5821 4: optional NodeFlags flags; 5822 /** optional node name for easier operations */ 5823 5: optional string name; 5824 /** PoD to which the node belongs */ 5825 6: optional common.PodType pod; 5827 /** if any local links are miscabled, the indication is flooded */ 5828 10: optional set miscabled_links; 5830 } 5832 struct PrefixAttributes { 5833 /** distance of the prefix */ 5834 2: required common.MetricType metric = common.default_distance; 5835 /** generic unordered set of route tags, can be redistributed to other protocols or use 5836 within the context of real time analytics */ 5837 3: optional set tags; 5838 /** monotonic clock for mobile addresses */ 5839 4: optional common.PrefixSequenceType monotonic_clock; 5840 /** indicates if the interface is a node loopback */ 5841 6: optional bool loopback = false; 5842 /** indicates that the prefix is directly attached, i.e. should be routed to even if 5843 the node is in overload. **/ 5844 7: optional bool directly_attached = true; 5846 /** in case of locally originated prefixes, i.e. interface addresses this can 5847 describe which link the address belongs to. */ 5848 10: optional common.LinkIDType from_link; 5849 } 5851 /** TIE carrying prefixes */ 5852 struct PrefixTIEElement { 5853 /** prefixes with the associated attributes. 5854 if the same prefix repeats in multiple TIEs of same node 5855 behavior is unspecified */ 5856 1: required map prefixes; 5857 } 5859 /** Generic key value pairs */ 5860 struct KeyValueTIEElement { 5861 /** if the same key repeats in multiple TIEs of same node 5862 or with different values, behavior is unspecified */ 5864 1: required map keyvalues; 5865 } 5867 /** single element in a TIE. enum `common.TIETypeType` 5868 in TIEID indicates which elements MUST be present 5869 in the TIEElement. In case of mismatch the unexpected 5870 elements MUST be ignored. In case of lack of expected 5871 element the TIE an error MUST be reported and the TIE 5872 MUST be ignored. 5874 This type can be extended with new optional elements 5875 for new `common.TIETypeType` values without breaking 5876 the major but if it is necessary to understand whether 5877 all nodes support the new type a node capability must 5878 be added as well. 5879 */ 5880 union TIEElement { 5881 /** used in case of enum common.TIETypeType.NodeTIEType */ 5882 1: optional NodeTIEElement node; 5883 /** used in case of enum common.TIETypeType.PrefixTIEType */ 5884 2: optional PrefixTIEElement prefixes; 5885 /** positive prefixes (always southbound) 5886 It MUST NOT be advertised within a North TIE and ignored otherwise 5887 */ 5888 3: optional PrefixTIEElement positive_disaggregation_prefixes; 5889 /** transitive, negative prefixes (always southbound) which 5890 MUST be aggregated and propagated 5891 according to the specification 5892 southwards towards lower levels to heal 5893 pathological upper level partitioning, otherwise 5894 blackholes may occur in multiplane fabrics. 5895 It MUST NOT be advertised within a North TIE. 5896 */ 5897 4: optional PrefixTIEElement negative_disaggregation_prefixes; 5898 /** externally reimported prefixes */ 5899 5: optional PrefixTIEElement external_prefixes; 5900 /** Key-Value store elements */ 5901 6: optional KeyValueTIEElement keyvalues; 5902 } 5904 /** TIE packet */ 5905 struct TIEPacket { 5906 1: required TIEHeader header; 5907 2: required TIEElement element; 5908 } 5910 /** content of a RIFT packet */ 5911 union PacketContent { 5912 1: optional LIEPacket lie; 5913 2: optional TIDEPacket tide; 5914 3: optional TIREPacket tire; 5915 4: optional TIEPacket tie; 5916 } 5918 /** RIFT packet structure */ 5919 struct ProtocolPacket { 5920 1: required PacketHeader header; 5921 2: required PacketContent content; 5922 } 5924 Appendix C. Finite State Machines and Precise Operational 5925 Specifications 5927 Some FSM figures are provided as [DOT] description due to limitations 5928 of ASCII art. 5930 On Entry action is performed every time and right before the 5931 according state is entered, i.e. after any transitions from previous 5932 state. 5934 On Exit action is performed every time and immediately when a state 5935 is exited, i.e. before any transitions towards target state are 5936 performed. 5938 Any attempt to transition from a state towards another on reception 5939 of an event where no action is specified MUST be considered an 5940 unrecoverable error. 5942 The FSMs and procedures are NOT normative in the sense that an 5943 implementation MUST implement them literally (which would be 5944 overspecification) but an implementation MUST exhibit externally 5945 observable behavior that is identical to the execution of the 5946 specified FSMs. 5948 Where a FSM representation is inconvenient, i.e. the amount of 5949 procedures and kept state exceeds the amount of transitions, we defer 5950 to a more procedural description on data structures. 5952 C.1. LIE FSM 5954 Initial state is `OneWay`. 5956 Event `MultipleNeighbors` occurs normally when more than two nodes 5957 see each other on the same link or a remote node is quickly 5958 reconfigured or rebooted without regressing to `OneWay` first. Each 5959 occurence of the event SHOULD generate a clear, according 5960 notification to help operational deployments. 5962 The machine sends LIEs on several transitions to accelerate adjacency 5963 bring-up without waiting for the timer tic. 5965 digraph Ga556dde74c30450aae125eaebc33bd57 { 5966 Nd16ab5092c6b421c88da482eb4ae36b6[label="ThreeWay"][shape="oval"]; 5967 N54edd2b9de7641688608f44fca346303[label="OneWay"][shape="oval"]; 5968 Nfeef2e6859ae4567bd7613a32cc28c0e[label="TwoWay"][shape="oval"]; 5969 N7f2bb2e04270458cb5c9bb56c4b96e23[label="Enter"][style="invis"][shape="plain"]; 5970 N292744a4097f492f8605c926b924616b[label="Enter"][style="dashed"][shape="plain"]; 5971 Nc48847ba98e348efb45f5b78f4a5c987[label="Exit"][style="invis"][shape="plain"]; 5972 Nd16ab5092c6b421c88da482eb4ae36b6 -> N54edd2b9de7641688608f44fca346303 5973 [label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|\n|MultipleNeighbors|"] 5974 [color="black"][arrowhead="normal" dir="both" arrowtail="none"]; 5975 Nd16ab5092c6b421c88da482eb4ae36b6 -> Nd16ab5092c6b421c88da482eb4ae36b6 5976 [label="|TimerTick|\n|LieRcvd|\n|SendLie|"][color="black"] 5977 [arrowhead="normal" dir="both" arrowtail="none"]; 5978 Nfeef2e6859ae4567bd7613a32cc28c0e -> Nfeef2e6859ae4567bd7613a32cc28c0e 5979 [label="|TimerTick|\n|LieRcvd|\n|SendLie|"][color="black"] 5980 [arrowhead="normal" dir="both" arrowtail="none"]; 5981 N54edd2b9de7641688608f44fca346303 -> Nd16ab5092c6b421c88da482eb4ae36b6 5982 [label="|ValidReflection|"][color="red"][arrowhead="normal" dir="both" arrowtail="none"]; 5983 Nd16ab5092c6b421c88da482eb4ae36b6 -> Nd16ab5092c6b421c88da482eb4ae36b6 5984 [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"][color="blue"] 5985 [arrowhead="normal" dir="both" arrowtail="none"]; 5986 Nd16ab5092c6b421c88da482eb4ae36b6 -> Nd16ab5092c6b421c88da482eb4ae36b6 5987 [label="|ValidReflection|"][color="red"][arrowhead="normal" dir="both" arrowtail="none"]; 5988 Nfeef2e6859ae4567bd7613a32cc28c0e -> N54edd2b9de7641688608f44fca346303 5989 [label="|LevelChanged|"][color="blue"][arrowhead="normal" dir="both" arrowtail="none"]; 5990 Nfeef2e6859ae4567bd7613a32cc28c0e -> N54edd2b9de7641688608f44fca346303 5991 [label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|\n|MultipleNeighbors|"] 5992 [color="black"][arrowhead="normal" dir="both" arrowtail="none"]; 5993 Nfeef2e6859ae4567bd7613a32cc28c0e -> Nd16ab5092c6b421c88da482eb4ae36b6 5994 [label="|ValidReflection|"][color="red"][arrowhead="normal" dir="both" arrowtail="none"]; 5995 N54edd2b9de7641688608f44fca346303 -> N54edd2b9de7641688608f44fca346303 5996 [label="|TimerTick|\n|LieRcvd|\n|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|\n|SendLie|"] 5997 [color="black"][arrowhead="normal" dir="both" arrowtail="none"]; 5998 N292744a4097f492f8605c926b924616b -> N54edd2b9de7641688608f44fca346303 5999 [label=""][color="black"][arrowhead="normal" dir="both" arrowtail="none"]; 6000 Nd16ab5092c6b421c88da482eb4ae36b6 -> N54edd2b9de7641688608f44fca346303 6001 [label="|LevelChanged|"][color="blue"][arrowhead="normal" dir="both" arrowtail="none"]; 6002 N54edd2b9de7641688608f44fca346303 -> Nfeef2e6859ae4567bd7613a32cc28c0e 6003 [label="|NewNeighbor|"][color="black"][arrowhead="normal" dir="both" arrowtail="none"]; 6004 N54edd2b9de7641688608f44fca346303 -> N54edd2b9de7641688608f44fca346303 6005 [label="|LevelChanged|\n|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6006 [color="blue"][arrowhead="normal" dir="both" arrowtail="none"]; 6007 Nfeef2e6859ae4567bd7613a32cc28c0e -> Nfeef2e6859ae4567bd7613a32cc28c0e 6008 [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6009 [color="blue"][arrowhead="normal" dir="both" arrowtail="none"]; 6010 Nd16ab5092c6b421c88da482eb4ae36b6 -> Nfeef2e6859ae4567bd7613a32cc28c0e 6011 [label="|NeighborDroppedReflection|"] 6012 [color="red"][arrowhead="normal" dir="both" arrowtail="none"]; 6013 N54edd2b9de7641688608f44fca346303 -> N54edd2b9de7641688608f44fca346303 6014 [label="|NeighborDroppedReflection|"][color="red"] 6015 [arrowhead="normal" dir="both" arrowtail="none"]; 6016 } 6018 LIE FSM DOT 6020 .. To be updated .. 6022 LIE FSM Figure 6024 Events 6026 o TimerTick: one second timer tic 6028 o LevelChanged: node's level has been changed by ZTP or 6029 configuration 6031 o HALChanged: best HAL computed by ZTP has changed 6033 o HATChanged: HAT computed by ZTP has changed 6035 o HALSChanged: set of HAL offering systems computed by ZTP has 6036 changed 6038 o LieRcvd: received LIE 6040 o NewNeighbor: new neighbor parsed 6042 o ValidReflection: received own reflection from neighbor 6044 o NeighborDroppedReflection: lost previous own reflection from 6045 neighbor 6047 o NeighborChangedLevel: neighbor changed advertised level 6049 o NeighborChangedAddress: neighbor changed IP address 6051 o UnacceptableHeader: unacceptable header seen 6052 o MTUMismatch: MTU mismatched 6054 o PODMismatch: Unacceptable PoD seen 6056 o HoldtimeExpired: adjacency hold down expired 6058 o MultipleNeighbors: more than one neighbor seen on interface 6060 o SendLie: send a LIE out 6062 o UpdateZTPOffer: update this node's ZTP offer 6064 Actions 6066 on TimerTick in TwoWay finishes in TwoWay: PUSH SendLie event, if 6067 holdtime expired PUSH HoldtimeExpired event 6069 on HALChanged in TwoWay finishes in TwoWay: store new HAL 6071 on MTUMismatch in ThreeWay finishes in OneWay: no action 6073 on HALChanged in ThreeWay finishes in ThreeWay: store new HAL 6075 on ValidReflection in TwoWay finishes in ThreeWay: no action 6077 on ValidReflection in OneWay finishes in ThreeWay: no action 6079 on NeighborDroppedReflection in ThreeWay finishes in TwoWay: no 6080 action 6082 on LieRcvd in ThreeWay finishes in ThreeWay: PROCESS_LIE 6084 on MultipleNeighbors in TwoWay finishes in OneWay: no action 6086 on UnacceptableHeader in ThreeWay finishes in OneWay: no action 6088 on MTUMismatch in TwoWay finishes in OneWay: no action 6090 on LevelChanged in OneWay finishes in OneWay: update level with 6091 event value, PUSH SendLie event 6093 on UnacceptableHeader in TwoWay finishes in OneWay: no action 6095 on HALSChanged in TwoWay finishes in TwoWay: store HALS 6097 on UpdateZTPOffer in TwoWay finishes in TwoWay: send offer to ZTP 6098 FSM 6099 on NeighborChangedLevel in TwoWay finishes in OneWay: no action 6101 on NewNeighbor in OneWay finishes in TwoWay: PUSH SendLie event 6103 on NeighborChangedAddress in ThreeWay finishes in OneWay: no 6104 action 6106 on HALChanged in OneWay finishes in OneWay: store new HAL 6108 on NeighborChangedLevel in OneWay finishes in OneWay: no action 6110 on HoldtimeExpired in TwoWay finishes in OneWay: no action 6112 on SendLie in TwoWay finishes in TwoWay: SEND_LIE 6114 on LevelChanged in TwoWay finishes in OneWay: update level with 6115 event value 6117 on NeighborChangedAddress in OneWay finishes in OneWay: no action 6119 on HATChanged in TwoWay finishes in TwoWay: store HAT 6121 on LieRcvd in TwoWay finishes in TwoWay: PROCESS_LIE 6123 on MultipleNeighbors in ThreeWay finishes in OneWay: no action 6125 on MTUMismatch in OneWay finishes in OneWay: no action 6127 on SendLie in OneWay finishes in OneWay: SEND_LIE 6129 on LieRcvd in OneWay finishes in OneWay: PROCESS_LIE 6131 on TimerTick in ThreeWay finishes in ThreeWay: PUSH SendLie event, 6132 if holdtime expired PUSH HoldtimeExpired event 6134 on TimerTick in OneWay finishes in OneWay: PUSH SendLie event 6136 on PODMismatch in ThreeWay finishes in OneWay: no action 6138 on LevelChanged in ThreeWay finishes in OneWay: update level with 6139 event value 6141 on NeighborChangedLevel in ThreeWay finishes in OneWay: no action 6143 on UpdateZTPOffer in OneWay finishes in OneWay: send offer to ZTP 6144 FSM 6145 on UpdateZTPOffer in ThreeWay finishes in ThreeWay: send offer to 6146 ZTP FSM 6148 on HATChanged in OneWay finishes in OneWay: store HAT 6150 on HATChanged in ThreeWay finishes in ThreeWay: store HAT 6152 on HoldtimeExpired in OneWay finishes in OneWay: no action 6154 on UnacceptableHeader in OneWay finishes in OneWay: no action 6156 on PODMismatch in OneWay finishes in OneWay: no action 6158 on SendLie in ThreeWay finishes in ThreeWay: SEND_LIE 6160 on NeighborChangedAddress in TwoWay finishes in OneWay: no action 6162 on ValidReflection in ThreeWay finishes in ThreeWay: no action 6164 on HALSChanged in OneWay finishes in OneWay: store HALS 6166 on HoldtimeExpired in ThreeWay finishes in OneWay: no action 6168 on HALSChanged in ThreeWay finishes in ThreeWay: store HALS 6170 on NeighborDroppedReflection in OneWay finishes in OneWay: no 6171 action 6173 on PODMismatch in TwoWay finishes in OneWay: no action 6175 on Entry into OneWay: CLEANUP 6177 Following words are used for well known procedures: 6179 1. PUSH Event: pushes an event to be executed by the FSM upon exit 6180 of this action 6182 2. CLEANUP: neighbor MUST be reset to unknown 6184 3. SEND_LIE: create a new LIE packet 6186 1. reflecting the neighbor if known and valid and 6188 2. setting the necessary `not_a_ztp_offer` variable if level was 6189 derived from last known neighbor on this interface and 6191 3. setting `you_are_not_flood_repeater` to computed value 6193 4. PROCESS_LIE: 6195 1. if lie has wrong major version OR our own system ID or 6196 invalid system ID then CLEANUP else 6198 2. if lie has non matching MTUs then CLEANUP, PUSH 6199 UpdateZTPOffer, PUSH MTUMismatch else 6201 3. if PoD rules do not allow adjacency forming then CLEANUP, 6202 PUSH PODMismatch, PUSH MTUMismatch else 6204 4. if lie has undefined level OR my level is undefined OR this 6205 node is leaf and remote level lower than HAT OR (lie's level 6206 is not leaf AND its difference is more than one from my 6207 level) then CLEANUP, PUSH UpdateZTPOffer, PUSH 6208 UnacceptableHeader else 6210 5. PUSH UpdateZTPOffer, construct temporary new neighbor 6211 structure with values from lie, if no current neighbor exists 6212 then set neighbor to new neighbor, PUSH NewNeighbor event, 6213 CHECK_THREE_WAY else 6215 1. if current neighbor system ID differs from lie's system 6216 ID then PUSH MultipleNeighbors else 6218 2. if current neighbor stored level differs from lie's level 6219 then PUSH NeighborChangedLevel else 6221 3. if current neighbor stored IPv4/v6 address differs from 6222 lie's address then PUSH NeighborChangedAddress else 6224 4. if any of neighbor's flood address port, name, local 6225 linkid changed then PUSH NeighborChangedMinorFields and 6227 5. CHECK_THREE_WAY 6229 5. CHECK_THREE_WAY: if current state is one-way do nothing else 6231 1. if lie packet does not contain neighbor then if current state 6232 is three-way then PUSH NeighborDroppedReflection else 6234 2. if packet reflects this system's ID and local port and state 6235 is three-way then PUSH event ValidReflection else PUSH event 6236 MultipleNeighbors 6238 C.2. ZTP FSM 6240 Initial state is ComputeBestOffer. 6242 digraph Gd436cc3ced8c471eb30bd4f3ac946261 { 6243 N06108ba9ac894d988b3e4e8ea5ace007 6244 [label="Enter"] 6245 [style="invis"] 6246 [shape="plain"]; 6247 Na47ff5eac9aa4b2eaf12839af68aab1f 6248 [label="MultipleNeighborsWait"] 6249 [shape="oval"]; 6250 N57a829be68e2489d8dc6b84e10597d0b 6251 [label="OneWay"] 6252 [shape="oval"]; 6253 Na641d400819a468d987e31182cdb013e 6254 [label="ThreeWay"] 6255 [shape="oval"]; 6256 Necfbfc2d8e5b482682ee66e604450c7b 6257 [label="Enter"] 6258 [style="dashed"] 6259 [shape="plain"]; 6260 N16db54bf2c5d48f093ad6c18e70081ee 6261 [label="TwoWay"] 6262 [shape="oval"]; 6263 N1b89016876b44cc1b9c1e4a735769560 6264 [label="Exit"] 6265 [style="invis"] 6266 [shape="plain"]; 6267 N16db54bf2c5d48f093ad6c18e70081ee -> N57a829be68e2489d8dc6b84e10597d0b 6268 [label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|"] 6269 [color="black"] 6270 [arrowhead="normal" dir="both" arrowtail="none"]; 6271 N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b 6272 [label="|NeighborDroppedReflection|"] 6273 [color="red"] 6274 [arrowhead="normal" dir="both" arrowtail="none"]; 6275 N57a829be68e2489d8dc6b84e10597d0b -> Na47ff5eac9aa4b2eaf12839af68aab1f 6276 [label="|MultipleNeighbors|"] 6277 [color="black"] 6278 [arrowhead="normal" dir="both" arrowtail="none"]; 6279 Necfbfc2d8e5b482682ee66e604450c7b -> N57a829be68e2489d8dc6b84e10597d0b 6280 [label=""] 6281 [color="black"] 6282 [arrowhead="normal" dir="both" arrowtail="none"]; 6283 N57a829be68e2489d8dc6b84e10597d0b -> N16db54bf2c5d48f093ad6c18e70081ee 6284 [label="|NewNeighbor|"] 6285 [color="black"] 6286 [arrowhead="normal" dir="both" arrowtail="none"]; 6287 Na641d400819a468d987e31182cdb013e -> Na47ff5eac9aa4b2eaf12839af68aab1f 6288 [label="|MultipleNeighbors|"] 6289 [color="black"] 6290 [arrowhead="normal" dir="both" arrowtail="none"]; 6291 N16db54bf2c5d48f093ad6c18e70081ee -> N16db54bf2c5d48f093ad6c18e70081ee 6292 [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6293 [color="blue"] 6294 [arrowhead="normal" dir="both" arrowtail="none"]; 6295 Na641d400819a468d987e31182cdb013e -> N16db54bf2c5d48f093ad6c18e70081ee 6296 [label="|NeighborDroppedReflection|"] 6297 [color="red"] 6298 [arrowhead="normal" dir="both" arrowtail="none"]; 6299 Na47ff5eac9aa4b2eaf12839af68aab1f -> Na47ff5eac9aa4b2eaf12839af68aab1f 6300 [label="|TimerTick|\n|MultipleNeighbors|"] 6301 [color="black"] 6302 [arrowhead="normal" dir="both" arrowtail="none"]; 6303 N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b 6304 [label="|LevelChanged|\n|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6305 [color="blue"] 6306 [arrowhead="normal" dir="both" arrowtail="none"]; 6307 Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e 6308 [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6309 [color="blue"] 6310 [arrowhead="normal" dir="both" arrowtail="none"]; 6311 Na641d400819a468d987e31182cdb013e -> N57a829be68e2489d8dc6b84e10597d0b 6312 [label="|NeighborChangedLevel|\n|NeighborChangedAddress|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|"] 6313 [color="black"] 6314 [arrowhead="normal" dir="both" arrowtail="none"]; 6315 Na47ff5eac9aa4b2eaf12839af68aab1f -> Na47ff5eac9aa4b2eaf12839af68aab1f 6316 [label="|HALChanged|\n|HATChanged|\n|HALSChanged|\n|UpdateZTPOffer|"] 6317 [color="blue"] 6318 [arrowhead="normal" dir="both" arrowtail="none"]; 6319 N16db54bf2c5d48f093ad6c18e70081ee -> N57a829be68e2489d8dc6b84e10597d0b 6320 [label="|LevelChanged|"] 6321 [color="blue"] 6322 [arrowhead="normal" dir="both" arrowtail="none"]; 6323 Na641d400819a468d987e31182cdb013e -> N57a829be68e2489d8dc6b84e10597d0b 6324 [label="|LevelChanged|"] 6325 [color="blue"] 6326 [arrowhead="normal" dir="both" arrowtail="none"]; 6327 N16db54bf2c5d48f093ad6c18e70081ee -> Na47ff5eac9aa4b2eaf12839af68aab1f 6328 [label="|MultipleNeighbors|"] 6329 [color="black"] 6330 [arrowhead="normal" dir="both" arrowtail="none"]; 6331 Na47ff5eac9aa4b2eaf12839af68aab1f -> N57a829be68e2489d8dc6b84e10597d0b 6332 [label="|MultipleNeighborsDone|"] 6333 [color="black"] 6334 [arrowhead="normal" dir="both" arrowtail="none"]; 6335 N16db54bf2c5d48f093ad6c18e70081ee -> Na641d400819a468d987e31182cdb013e 6336 [label="|ValidReflection|"] 6337 [color="red"] 6338 [arrowhead="normal" dir="both" arrowtail="none"]; 6339 Na47ff5eac9aa4b2eaf12839af68aab1f -> N57a829be68e2489d8dc6b84e10597d0b 6340 [label="|LevelChanged|"] 6341 [color="blue"] 6342 [arrowhead="normal" dir="both" arrowtail="none"]; 6343 Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e 6344 [label="|TimerTick|\n|LieRcvd|\n|SendLie|"] 6345 [color="black"] 6346 [arrowhead="normal" dir="both" arrowtail="none"]; 6347 N57a829be68e2489d8dc6b84e10597d0b -> N57a829be68e2489d8dc6b84e10597d0b 6348 [label="|TimerTick|\n|LieRcvd|\n|NeighborChangedLevel|\n|NeighborChangedAddress|\n|NeighborAddressAdded|\n|UnacceptableHeader|\n|MTUMismatch|\n|PODMismatch|\n|HoldtimeExpired|\n|SendLie|"] 6349 [color="black"] 6350 [arrowhead="normal" dir="both" arrowtail="none"]; 6351 N57a829be68e2489d8dc6b84e10597d0b -> Na641d400819a468d987e31182cdb013e 6352 [label="|ValidReflection|"] 6353 [color="red"] 6354 [arrowhead="normal" dir="both" arrowtail="none"]; 6355 N16db54bf2c5d48f093ad6c18e70081ee -> N16db54bf2c5d48f093ad6c18e70081ee 6356 [label="|TimerTick|\n|LieRcvd|\n|SendLie|"] 6357 [color="black"] 6358 [arrowhead="normal" dir="both" arrowtail="none"]; 6359 Na641d400819a468d987e31182cdb013e -> Na641d400819a468d987e31182cdb013e 6360 [label="|ValidReflection|"] 6361 [color="red"] 6362 [arrowhead="normal" dir="both" arrowtail="none"]; 6363 } 6365 ZTP FSM DOT 6367 Events 6369 o TimerTick: one second timer tic 6371 o LevelChanged: node's level has been changed by ZTP or 6372 configuration 6374 o HALChanged: best HAL computed by ZTP has changed 6376 o HATChanged: HAT computed by ZTP has changed 6377 o HALSChanged: set of HAL offering systems computed by ZTP has 6378 changed 6380 o LieRcvd: received LIE 6382 o NewNeighbor: new neighbor parsed 6384 o ValidReflection: received own reflection from neighbor 6386 o NeighborDroppedReflection: lost previous own reflection from 6387 neighbor 6389 o NeighborChangedLevel: neighbor changed advertised level 6391 o NeighborChangedAddress: neighbor changed IP address 6393 o UnacceptableHeader: unacceptable header seen 6395 o MTUMismatch: MTU mismatched 6397 o PODMismatch: Unacceptable PoD seen 6399 o HoldtimeExpired: adjacency hold down expired 6401 o MultipleNeighbors: more than one neighbor seen on interface 6403 o MultipleNeighborsDone: cooldown for multiple neighbors expired 6405 o SendLie: send a LIE out 6407 o UpdateZTPOffer: update this node's ZTP offer 6409 Actions 6411 on MTUMismatch in OneWay finishes in OneWay: no action 6413 on HoldtimeExpired in OneWay finishes in OneWay: no action 6415 on LevelChanged in ThreeWay finishes in OneWay: update level with 6416 event value 6418 on MultipleNeighbors in MultipleNeighborsWait finishes in 6419 MultipleNeighborsWait: start multiple neighbors timer as 4 * 6420 DEFAULT_LIE_HOLDTIME 6422 on HALChanged in MultipleNeighborsWait finishes in 6423 MultipleNeighborsWait: store new HAL 6424 on NeighborChangedAddress in ThreeWay finishes in OneWay: no 6425 action 6427 on ValidReflection in OneWay finishes in ThreeWay: no action 6429 on MTUMismatch in TwoWay finishes in OneWay: no action 6431 on TimerTick in MultipleNeighborsWait finishes in 6432 MultipleNeighborsWait: decrement MultipleNeighbors timer, if 6433 expired PUSH MultipleNeighborsDone 6435 on MultipleNeighborsDone in MultipleNeighborsWait finishes in 6436 OneWay: decrement MultipleNeighbors timer, if expired PUSH 6437 MultipleNeighborsDone 6439 on HATChanged in ThreeWay finishes in ThreeWay: store HAT 6441 on UpdateZTPOffer in TwoWay finishes in TwoWay: send offer to ZTP 6442 FSM 6444 on HALSChanged in TwoWay finishes in TwoWay: store HALS 6446 on PODMismatch in TwoWay finishes in OneWay: no action 6448 on LieRcvd in TwoWay finishes in TwoWay: PROCESS_LIE 6450 on PODMismatch in ThreeWay finishes in OneWay: no action 6452 on TimerTick in TwoWay finishes in TwoWay: PUSH SendLie event, if 6453 holdtime expired PUSH HoldtimeExpired event 6455 on SendLie in TwoWay finishes in TwoWay: SEND_LIE 6457 on SendLie in OneWay finishes in OneWay: SEND_LIE 6459 on TimerTick in OneWay finishes in OneWay: PUSH SendLie event 6461 on HALChanged in OneWay finishes in OneWay: store new HAL 6463 on HALSChanged in ThreeWay finishes in ThreeWay: store HALS 6465 on NeighborChangedLevel in TwoWay finishes in OneWay: no action 6467 on PODMismatch in OneWay finishes in OneWay: no action 6469 on HoldtimeExpired in TwoWay finishes in OneWay: no action 6470 on TimerTick in ThreeWay finishes in ThreeWay: PUSH SendLie event, 6471 if holdtime expired PUSH HoldtimeExpired event 6473 on MultipleNeighbors in TwoWay finishes in MultipleNeighborsWait: 6474 start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME 6476 on UpdateZTPOffer in MultipleNeighborsWait finishes in 6477 MultipleNeighborsWait: send offer to ZTP FSM 6479 on LieRcvd in OneWay finishes in OneWay: PROCESS_LIE 6481 on LevelChanged in MultipleNeighborsWait finishes in OneWay: 6482 update level with event value 6484 on UpdateZTPOffer in ThreeWay finishes in ThreeWay: send offer to 6485 ZTP FSM 6487 on HALChanged in TwoWay finishes in TwoWay: store new HAL 6489 on UnacceptableHeader in OneWay finishes in OneWay: no action 6491 on HALSChanged in OneWay finishes in OneWay: store HALS 6493 on HALSChanged in MultipleNeighborsWait finishes in 6494 MultipleNeighborsWait: store HALS 6496 on SendLie in ThreeWay finishes in ThreeWay: SEND_LIE 6498 on MTUMismatch in ThreeWay finishes in OneWay: no action 6500 on HATChanged in MultipleNeighborsWait finishes in 6501 MultipleNeighborsWait: store HAT 6503 on NeighborChangedAddress in OneWay finishes in OneWay: no action 6505 on ValidReflection in TwoWay finishes in ThreeWay: no action 6507 on MultipleNeighbors in OneWay finishes in MultipleNeighborsWait: 6508 start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME 6510 on NeighborChangedLevel in OneWay finishes in OneWay: no action 6512 on HATChanged in OneWay finishes in OneWay: store HAT 6514 on NeighborDroppedReflection in OneWay finishes in OneWay: no 6515 action 6517 on HALChanged in ThreeWay finishes in ThreeWay: store new HAL 6518 on NeighborAddressAdded in OneWay finishes in OneWay: no action 6520 on NeighborChangedAddress in TwoWay finishes in OneWay: no action 6522 on LieRcvd in ThreeWay finishes in ThreeWay: PROCESS_LIE 6524 on UnacceptableHeader in TwoWay finishes in OneWay: no action 6526 on LevelChanged in TwoWay finishes in OneWay: update level with 6527 event value 6529 on HATChanged in TwoWay finishes in TwoWay: store HAT 6531 on UpdateZTPOffer in OneWay finishes in OneWay: send offer to ZTP 6532 FSM 6534 on ValidReflection in ThreeWay finishes in ThreeWay: no action 6536 on UnacceptableHeader in ThreeWay finishes in OneWay: no action 6538 on HoldtimeExpired in ThreeWay finishes in OneWay: no action 6540 on NeighborChangedLevel in ThreeWay finishes in OneWay: no action 6542 on LevelChanged in OneWay finishes in OneWay: update level with 6543 event value, PUSH SendLie event 6545 on NewNeighbor in OneWay finishes in TwoWay: PUSH SendLie event 6547 on NeighborDroppedReflection in ThreeWay finishes in TwoWay: no 6548 action 6550 on MultipleNeighbors in ThreeWay finishes in 6551 MultipleNeighborsWait: start multiple neighbors timer as 4 * 6552 DEFAULT_LIE_HOLDTIME 6554 on Entry into OneWay: CLEANUP 6556 Following words are used for well known procedures: 6558 1. PUSH Event: pushes an event to be executed by the FSM upon exit 6559 of this action 6561 2. CLEANUP: neighbor MUST be reset to unknown 6563 3. SEND_LIE: create a new LIE packet 6565 1. reflecting the neighbor if known and valid and 6566 2. setting the necessary `not_a_ztp_offer` variable if level was 6567 derived from last known neighbor on this interface and 6569 3. setting `you_are_not_flood_repeater` to computed value 6571 4. PROCESS_LIE: 6573 1. if lie has wrong major version OR our own system ID or 6574 invalid system ID then CLEANUP else 6576 2. if lie has non matching MTUs then CLEANUP, PUSH 6577 UpdateZTPOffer, PUSH MTUMismatch else 6579 3. if PoD rules do not allow adjacency forming then CLEANUP, 6580 PUSH PODMismatch, PUSH MTUMismatch else 6582 4. if lie has undefined level OR my level is undefined OR this 6583 node is leaf and remote level lower than HAT OR (lie's level 6584 is not leaf AND its difference is more than one from my 6585 level) then CLEANUP, PUSH UpdateZTPOffer, PUSH 6586 UnacceptableHeader else 6588 5. PUSH UpdateZTPOffer, construct temporary new neighbor 6589 structure with values from lie, if no current neighbor exists 6590 then set neighbor to new neighbor, PUSH NewNeighbor event, 6591 CHECK_THREE_WAY else 6593 1. if current neighbor system ID differs from lie's system 6594 ID then PUSH MultipleNeighbors else 6596 2. if current neighbor stored level differs from lie's level 6597 then PUSH NeighborChangedLevel else 6599 3. if current neighbor stored IPv4/v6 address differs from 6600 lie's address then PUSH NeighborChangedAddress else 6602 4. if any of neighbor's flood address port, name, local 6603 linkid changed then PUSH NeighborChangedMinorFields and 6605 5. CHECK_THREE_WAY 6607 5. CHECK_THREE_WAY: if current state is one-way do nothing else 6609 1. if lie packet does not contain neighbor then if current state 6610 is three-way then PUSH NeighborDroppedReflection else 6612 2. if packet reflects this system's ID and local port and state 6613 is three-way then PUSH event ValidReflection else PUSH event 6614 MultipleNeighbors 6616 C.3. Flooding Procedures 6618 Flooding Procedures are described in terms of a flooding state of an 6619 adjacency and resulting operations on it driven by packet arrivals. 6620 The FSM has basically a single state and is not well suited to 6621 represent the behavior. 6623 RIFT does not specify any kind of flood rate limiting since such 6624 specifications always assume particular points in available 6625 technology speeds and feeds and those points are shifting at faster 6626 and faster rate (speed of light holding for the moment). The encoded 6627 packets provide hints to react accordingly to losses or overruns. 6629 Flooding of all according topology exchange elements SHOULD be 6630 performed at highest feasible rate whereas the rate of transmission 6631 MUST be throttled by reacting to adequate features of the system such 6632 as e.g. queue lengths or congestion indications in the protocol 6633 packets. 6635 C.3.1. FloodState Structure per Adjacency 6637 The structure contains conceptually the following elements. The word 6638 collection or queue indicates a set of elements that can be iterated: 6640 TIES_TX: Collection containing all the TIEs to transmit on the 6641 adjacency. 6643 TIES_ACK: Collection containing all the TIEs that have to be 6644 acknowledged on the adjacency. 6646 TIES_REQ: Collection containing all the TIE headers that have to be 6647 requested on the adjacency. 6649 TIES_RTX: Collection containing all TIEs that need retransmission 6650 with the according time to retransmit. 6652 Following words are used for well known procedures operating on this 6653 structure: 6655 TIE Describes either a full RIFT TIE or accordingly just the 6656 `TIEHeader` or `TIEID`. The according meaning is unambiguously 6657 contained in the context of the algorithm. 6659 is_flood_reduced(TIE): returns whether a TIE can be flood reduced or 6660 not. 6662 is_tide_entry_filtered(TIE): returns whether a header should be 6663 propagated in TIDE according to flooding scopes. 6665 is_request_filtered(TIE): returns whether a TIE request should be 6666 propagated to neighbor or not according to flooding scopes. 6668 is_flood_filtered(TIE): returns whether a TIE requested be flooded 6669 to neighbor or not according to flooding scopes. 6671 try_to_transmit_tie(TIE): 6673 A. if not is_flood_filtered(TIE) then 6675 1. remove TIE from TIES_RTX if present 6677 2. if TIE" with same key on TIES_ACK then 6679 a. if TIE" same or newer than TIE do nothing else 6681 b. remove TIE" from TIES_ACK and add TIE to TIES_TX 6683 3. else insert TIE into TIES_TX 6685 ack_tie(TIE): remove TIE from all collections and then insert TIE 6686 into TIES_ACK. 6688 tie_been_acked(TIE): remove TIE from all collections. 6690 remove_from_all_queues(TIE): same as `tie_been_acked`. 6692 request_tie(TIE): if not is_request_filtered(TIE) then 6693 remove_from_all_queues(TIE) and add to TIES_REQ. 6695 move_to_rtx_list(TIE): remove TIE from TIES_TX and then add to 6696 TIES_RTX using TIE retransmission interval. 6698 clear_requests(TIEs): remove all TIEs from TIES_REQ. 6700 bump_own_tie(TIE): for self-originated TIE originate an empty or re- 6701 generate with version number higher then the one in TIE. 6703 The collection SHOULD be served with following priorities if the 6704 system cannot process all the collections in real time: 6706 Elements on TIES_ACK should be processed with highest priority 6707 TIES_TX 6709 TIES_REQ and TIES_RTX 6711 C.3.2. TIDEs 6713 `TIEID` and `TIEHeader` space forms a strict total order (modulo 6714 uncomparable sequence numbers in the very unlikely event that can 6715 occur if a TIE is "stuck" in a part of a network while the originator 6716 reboots and reissues TIEs many times to the point its sequence# rolls 6717 over and forms incomparable distance to the "stuck" copy) which 6718 implies that a comparison relation is possible between two elements. 6719 With that it is implictly possible to compare TIEs, TIEHeaders and 6720 TIEIDs to each other whereas the shortest viable key is always 6721 implied. 6723 When generating and sending TIDEs an implementation SHOULD ensure 6724 that enough bandwidth is left to send elements of Floodstate 6725 structure. 6727 C.3.2.1. TIDE Generation 6729 As given by timer constant, periodically generate TIDEs by: 6731 NEXT_TIDE_ID: ID of next TIE to be sent in TIDE. 6733 TIDE_START: Begin of TIDE packet range. 6735 a. NEXT_TIDE_ID = MIN_TIEID 6737 b. while NEXT_TIDE_ID not equal to MAX_TIEID do 6739 1. TIDE_START = NEXT_TIDE_ID 6741 2. HEADERS = At most TIRDEs_PER_PKT headers in TIEDB starting at 6742 NEXT_TIDE_ID or higher that SHOULD be filtered by 6743 is_tide_entry_filtered and MUST either have a lifetime left > 6744 0 or have no content 6746 3. if HEADERS is empty then START = MIN_TIEID else START = first 6747 element in HEADERS 6749 4. if HEADERS' size less than TIRDEs_PER_PKT then END = 6750 MAX_TIEID else END = last element in HEADERS 6752 5. send sorted HEADERS as TIDE setting START and END as its 6753 range 6755 6. NEXT_TIDE_ID = END 6757 The constant `TIRDEs_PER_PKT` SHOULD be generated and used by the 6758 implementation to limit the amount of TIE headers per TIDE so the 6759 sent TIDE PDU does not exceed interface MTU. 6761 TIDE PDUs SHOULD be spaced on sending to prevent packet drops. 6763 C.3.2.2. TIDE Processing 6765 On reception of TIDEs the following processing is performed: 6767 TXKEYS: Collection of TIE Headers to be send after processing of 6768 the packet 6770 REQKEYS: Collection of TIEIDs to be requested after processing of 6771 the packet 6773 CLEARKEYS: Collection of TIEIDs to be removed from flood state 6774 queues 6776 LASTPROCESSED: Last processed TIEID in TIDE 6778 DBTIE: TIE in the LSDB if found 6780 a. LASTPROCESSED = TIDE.start_range 6782 b. for every HEADER in TIDE do 6784 1. DBTIE = find HEADER in current LSDB 6786 2. if HEADER < LASTPROCESSED then report error and reset 6787 adjacency and return 6789 3. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 6790 TIE.HEADER < HEADER) into TXKEYS 6792 4. LASTPROCESSED = HEADER 6794 5. if DBTIE not found then 6796 I) if originator is this node then bump_own_tie 6798 II) else put HEADER into REQKEYS 6800 6. if DBTIE.HEADER < HEADER then 6802 I) if originator is this node then bump_own_tie else 6803 i. if this is a N-TIE header from a northbound 6804 neighbor then override DBTIE in LSDB with HEADER 6806 ii. else put HEADER into REQKEYS 6808 7. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 6810 8. if DBTIE.HEADER = HEADER then 6812 I) if DBTIE has content already then put DBTIE.HEADER 6813 into CLEARKEYS 6815 II) else put HEADER into REQKEYS 6817 c. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 6818 TIE.HEADER <= TIDE.end_range) into TXKEYS 6820 d. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 6822 e. for all TIEs in REQKEYS request_tie(TIE) 6824 f. for all TIEs in CLEARKEYS remove_from_all_queues(TIE) 6826 C.3.3. TIREs 6828 C.3.3.1. TIRE Generation 6830 There is not much to say here. Elements from both TIES_REQ and 6831 TIES_ACK MUST be collected and sent out as fast as feasible as TIREs. 6832 When sending TIREs with elements from TIES_REQ the `lifetime` field 6833 MUST be set to 0 to force reflooding from the neighbor even if the 6834 TIEs seem to be same. 6836 C.3.3.2. TIRE Processing 6838 On reception of TIREs the following processing is performed: 6840 TXKEYS: Collection of TIE Headers to be send after processing of 6841 the packet 6843 REQKEYS: Collection of TIEIDs to be requested after processing of 6844 the packet 6846 ACKKEYS: Collection of TIEIDs that have been acked 6848 DBTIE: TIE in the LSDB if found 6850 a. for every HEADER in TIRE do 6851 1. DBTIE = find HEADER in current LSDB 6853 2. if DBTIE not found then do nothing 6855 3. if DBTIE.HEADER < HEADER then put HEADER into REQKEYS 6857 4. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 6859 5. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into ACKKEYS 6861 b. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 6863 c. for all TIEs in REQKEYS request_tie(TIE) 6865 d. for all TIEs in ACKKEYS tie_been_acked(TIE) 6867 C.3.4. TIEs Processing on Flood State Adjacency 6869 On reception of TIEs the following processing is performed: 6871 ACKTIE: TIE to acknowledge 6873 TXTIE: TIE to transmit 6875 DBTIE: TIE in the LSDB if found 6877 a. DBTIE = find TIE in current LSDB 6879 b. if DBTIE not found then 6881 1. if originator is this node then bump_own_tie with a short 6882 remaining lifetime 6884 2. else insert TIE into LSDB and ACKTIE = TIE 6886 else 6888 1. if DBTIE.HEADER = TIE.HEADER then 6890 i. if DBTIE has content already then ACKTIE = TIE 6892 ii. else process like the "DBTIE.HEADER < TIE.HEADER" case 6894 2. if DBTIE.HEADER < TIE.HEADER then 6896 i. if originator is this node then bump_own_tie 6898 ii. else insert TIE into LSDB and ACKTIE = TIE 6900 3. if DBTIE.HEADER > TIE.HEADER then 6902 i. if DBTIE has content already then TXTIE = DBTIE 6904 ii. else ACKTIE = DBTIE 6906 c. if TXTIE is set then try_to_transmit_tie(TXTIE) 6908 d. if ACKTIE is set then ack_tie(TIE) 6910 C.3.5. TIEs Processing When LSDB Received Newer Version on Other 6911 Adjacencies 6913 The Link State Database can be considered to be a switchboard that 6914 does not need any flooding procedures but can be given new versions 6915 of TIEs by a peer. Consecutively, a peer receives from the LSDB 6916 newer versions of TIEs received by other peeers and processes them 6917 (without any filtering) just like receving TIEs from its remote peer. 6918 This publisher model can be implemented in many ways. 6920 C.3.6. Sending TIEs 6922 On a periodic basis all TIEs with lifetime left > 0 MUST be sent out 6923 on the adjacency, removed from TIES_TX list and requeued onto 6924 TIES_RTX list. 6926 Appendix D. Constants 6928 D.1. Configurable Protocol Constants 6930 This section gathers constants that are provided in the schema files 6931 and in the document. 6933 +----------------+--------------+-----------------------------------+ 6934 | | Type | Value | 6935 +----------------+--------------+-----------------------------------+ 6936 | LIE IPv4 | Default | 224.0.0.120 or all-rift-routers | 6937 | Multicast | Value, | to be assigned in IPv4 Multicast | 6938 | Address | Configurable | Address Space Registry in Local | 6939 | | | Network Control Block | 6940 +----------------+--------------+-----------------------------------+ 6941 | LIE IPv6 | Default | FF02::A1F7 or all-rift-routers to | 6942 | Multicast | Value, | be assigned in IPv6 Multicast | 6943 | Address | Configurable | Address Assignments | 6944 +----------------+--------------+-----------------------------------+ 6945 | LIE | Default | 914 | 6946 | Destination | Value, | | 6947 | Port | Configurable | | 6948 +----------------+--------------+-----------------------------------+ 6949 | Level value | Constant | 24 | 6950 | for | | | 6951 | TOP_OF_FABRIC | | | 6952 | flag | | | 6953 +----------------+--------------+-----------------------------------+ 6954 | Default LIE | Default | 3 seconds | 6955 | Holdtime | Value, | | 6956 | | Configurable | | 6957 +----------------+--------------+-----------------------------------+ 6958 | TIE | Default | 1 second | 6959 | Retransmission | Value | | 6960 | Interval | | | 6961 +----------------+--------------+-----------------------------------+ 6962 | TIDE | Default | 5 seconds | 6963 | Generation | Value, | | 6964 | Interval | Configurable | | 6965 +----------------+--------------+-----------------------------------+ 6966 | MIN_TIEID | Constant | TIE Key with minimal values: | 6967 | signifies | | TIEID(originator=0, | 6968 | start of TIDEs | | tietype=TIETypeMinValue, | 6969 | | | tie_nr=0, direction=South) | 6970 +----------------+--------------+-----------------------------------+ 6971 | MAX_TIEID | Constant | TIE Key with maximal values: | 6972 | signifies end | | TIEID(originator=MAX_UINT64, | 6973 | of TIDEs | | tietype=TIETypeMaxValue, | 6974 | | | tie_nr=MAX_UINT64, | 6975 | | | direction=North) | 6976 +----------------+--------------+-----------------------------------+ 6978 Table 6: all_constants 6980 Authors' Addresses 6982 Tony Przygienda (editor) 6983 Juniper 6984 1137 Innovation Way 6985 Sunnyvale, CA 6986 USA 6988 Email: prz@juniper.net 6990 Alankar Sharma 6991 Comcast 6992 1800 Bishops Gate Blvd 6993 Mount Laurel, NJ 08054 6994 US 6996 Email: Alankar_Sharma@comcast.com 6998 Pascal Thubert 6999 Cisco Systems, Inc 7000 Building D 7001 45 Allee des Ormes - BP1200 7002 MOUGINS - Sophia Antipolis 06254 7003 FRANCE 7005 Phone: +33 497 23 26 34 7006 Email: pthubert@cisco.com 7008 Bruno Rijsman 7009 Individual 7011 Email: fl0w@yandex-team.ru 7013 Dmitry Afanasiev 7014 Yandex 7016 Email: fl0w@yandex-team.ru