idnits 2.17.1 draft-ietf-rift-rift-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with multicast IPv4 addresses in the document. If these are generic example addresses, they should be changed to use the 233.252.0.x range defined in RFC 5771 Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 6435 has weird spacing: '...berType undef...' == Line 6439 has weird spacing: '...velType top_...' == Line 6441 has weird spacing: '...itsType defau...' == Line 6443 has weird spacing: '...velType leaf...' == Line 6444 has weird spacing: '...velType defa...' == (28 more instances...) == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (March 10, 2020) is 1506 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 343, but not defined == Missing Reference: 'B' is mentioned on line 343, but not defined == Missing Reference: 'C' is mentioned on line 353, but not defined == Missing Reference: 'D' is mentioned on line 353, but not defined == Missing Reference: 'E' is mentioned on line 346, but not defined == Missing Reference: 'F' is mentioned on line 346, but not defined == Missing Reference: 'StoreHAL' is mentioned on line 1644, but not defined == Missing Reference: 'StoreHALS' is mentioned on line 1645, but not defined == Missing Reference: 'CleanUp' is mentioned on line 1536, but not defined == Missing Reference: 'StoreHAT' is mentioned on line 1646, but not defined == Missing Reference: '-' is mentioned on line 3894, but not defined == Missing Reference: 'ProcessLIE' is mentioned on line 1625, but not defined == Missing Reference: 'SendLIE' is mentioned on line 1626, but not defined == Missing Reference: 'SendOfferToZTPFSM' is mentioned on line 1630, but not defined == Missing Reference: 'StartMulNeighTimer' is mentioned on line 1635, but not defined == Missing Reference: 'StoreLevel' is mentioned on line 1656, but not defined == Missing Reference: 'UpdateLevel' is mentioned on line 1611, but not defined == Missing Reference: 'StartMultipleNeighborsTimer' is mentioned on line 1648, but not defined == Missing Reference: 'SendOfferToZTP' is mentioned on line 1651, but not defined == Missing Reference: 'NH' is mentioned on line 3114, but not defined == Missing Reference: 'P' is mentioned on line 3318, but not defined == Missing Reference: 'RemoveExpiredOffers' is mentioned on line 3907, but not defined == Missing Reference: 'StoreConfiguredLevel' is mentioned on line 3862, but not defined == Missing Reference: 'StoreLeafFlags' is mentioned on line 3895, but not defined == Missing Reference: 'StoreConfigLevel' is mentioned on line 3896, but not defined == Missing Reference: 'RFC5880' is mentioned on line 6396, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'EUI64' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10589' ** Obsolete normative reference: RFC 5549 (Obsoleted by RFC 8950) ** Obsolete normative reference: RFC 7752 (Obsoleted by RFC 9552) Summary: 2 errors (**), 0 flaws (~~), 35 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RIFT Working Group A. Przygienda, Ed. 3 Internet-Draft Juniper 4 Intended status: Standards Track A. Sharma 5 Expires: September 11, 2020 Comcast 6 P. Thubert 7 Cisco 8 Bruno. Rijsman 9 Individual 10 Dmitry. Afanasiev 11 Yandex 12 March 10, 2020 14 RIFT: Routing in Fat Trees 15 draft-ietf-rift-rift-11 17 Abstract 19 This document defines a specialized, dynamic routing protocol for 20 Clos and fat-tree network topologies optimized towards minimization 21 of configuration and operational complexity. The protocol 23 o deals with no configuration, fully automated construction of fat- 24 tree topologies based on detection of links, 26 o minimizes the amount of routing state held at each level, 28 o automatically prunes and load balances topology flooding exchanges 29 over a sufficient subset of links, 31 o supports automatic disaggregation of prefixes on link and node 32 failures to prevent black-holing and suboptimal routing, 34 o allows traffic steering and re-routing policies, 36 o allows loop-free non-ECMP forwarding, 38 o automatically re-balances traffic towards the spines based on 39 bandwidth available and finally 41 o provides mechanisms to synchronize a limited key-value data-store 42 that can be used after protocol convergence to e.g. bootstrap 43 higher levels of functionality on nodes. 45 Status of This Memo 47 This Internet-Draft is submitted in full conformance with the 48 provisions of BCP 78 and BCP 79. 50 Internet-Drafts are working documents of the Internet Engineering 51 Task Force (IETF). Note that other groups may also distribute 52 working documents as Internet-Drafts. The list of current Internet- 53 Drafts is at https://datatracker.ietf.org/drafts/current/. 55 Internet-Drafts are draft documents valid for a maximum of six months 56 and may be updated, replaced, or obsoleted by other documents at any 57 time. It is inappropriate to use Internet-Drafts as reference 58 material or to cite them other than as "work in progress." 60 This Internet-Draft will expire on September 11, 2020. 62 Copyright Notice 64 Copyright (c) 2020 IETF Trust and the persons identified as the 65 document authors. All rights reserved. 67 This document is subject to BCP 78 and the IETF Trust's Legal 68 Provisions Relating to IETF Documents 69 (https://trustee.ietf.org/license-info) in effect on the date of 70 publication of this document. Please review these documents 71 carefully, as they describe your rights and restrictions with respect 72 to this document. Code Components extracted from this document must 73 include Simplified BSD License text as described in Section 4.e of 74 the Trust Legal Provisions and are provided without warranty as 75 described in the Simplified BSD License. 77 Table of Contents 79 1. Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 80 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6 81 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 8 82 3. Reference Frame . . . . . . . . . . . . . . . . . . . . . . . 8 83 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 8 84 3.2. Topology . . . . . . . . . . . . . . . . . . . . . . . . 13 85 4. RIFT: Routing in Fat Trees . . . . . . . . . . . . . . . . . 15 86 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 16 87 4.1.1. Properties . . . . . . . . . . . . . . . . . . . . . 16 88 4.1.2. Generalized Topology View . . . . . . . . . . . . . . 17 89 4.1.2.1. Terminology . . . . . . . . . . . . . . . . . . . 17 90 4.1.2.2. Clos as Crossed Crossbars . . . . . . . . . . . . 18 91 4.1.3. Fallen Leaf Problem . . . . . . . . . . . . . . . . . 28 92 4.1.4. Discovering Fallen Leaves . . . . . . . . . . . . . . 30 93 4.1.5. Addressing the Fallen Leaves Problem . . . . . . . . 31 94 4.2. Specification . . . . . . . . . . . . . . . . . . . . . . 32 95 4.2.1. Transport . . . . . . . . . . . . . . . . . . . . . . 33 96 4.2.2. Link (Neighbor) Discovery (LIE Exchange) . . . . . . 33 97 4.2.2.1. LIE FSM . . . . . . . . . . . . . . . . . . . . . 36 98 4.2.3. Topology Exchange (TIE Exchange) . . . . . . . . . . 46 99 4.2.3.1. Topology Information Elements . . . . . . . . . . 46 100 4.2.3.2. South- and Northbound Representation . . . . . . 46 101 4.2.3.3. Flooding . . . . . . . . . . . . . . . . . . . . 49 102 4.2.3.4. TIE Flooding Scopes . . . . . . . . . . . . . . . 56 103 4.2.3.5. 'Flood Only Node TIEs' Bit . . . . . . . . . . . 59 104 4.2.3.6. Initial and Periodic Database Synchronization . . 60 105 4.2.3.7. Purging and Roll-Overs . . . . . . . . . . . . . 60 106 4.2.3.8. Southbound Default Route Origination . . . . . . 61 107 4.2.3.9. Northbound TIE Flooding Reduction . . . . . . . . 62 108 4.2.3.10. Special Considerations . . . . . . . . . . . . . 67 109 4.2.4. Reachability Computation . . . . . . . . . . . . . . 67 110 4.2.4.1. Northbound SPF . . . . . . . . . . . . . . . . . 67 111 4.2.4.2. Southbound SPF . . . . . . . . . . . . . . . . . 68 112 4.2.4.3. East-West Forwarding Within a non-ToF Level . . . 69 113 4.2.4.4. East-West Links Within ToF Level . . . . . . . . 69 114 4.2.5. Automatic Disaggregation on Link & Node Failures . . 69 115 4.2.5.1. Positive, Non-transitive Disaggregation . . . . . 69 116 4.2.5.2. Negative, Transitive Disaggregation for Fallen 117 Leaves . . . . . . . . . . . . . . . . . . . . . 73 118 4.2.6. Attaching Prefixes . . . . . . . . . . . . . . . . . 75 119 4.2.7. Optional Zero Touch Provisioning (ZTP) . . . . . . . 84 120 4.2.7.1. Terminology . . . . . . . . . . . . . . . . . . . 85 121 4.2.7.2. Automatic SystemID Selection . . . . . . . . . . 86 122 4.2.7.3. Generic Fabric Example . . . . . . . . . . . . . 87 123 4.2.7.4. Level Determination Procedure . . . . . . . . . . 88 124 4.2.7.5. ZTP FSM . . . . . . . . . . . . . . . . . . . . . 89 125 4.2.7.6. Resulting Topologies . . . . . . . . . . . . . . 95 126 4.2.8. Stability Considerations . . . . . . . . . . . . . . 97 127 4.3. Further Mechanisms . . . . . . . . . . . . . . . . . . . 98 128 4.3.1. Overload Bit . . . . . . . . . . . . . . . . . . . . 98 129 4.3.2. Optimized Route Computation on Leaves . . . . . . . . 98 130 4.3.3. Mobility . . . . . . . . . . . . . . . . . . . . . . 98 131 4.3.3.1. Clock Comparison . . . . . . . . . . . . . . . . 100 132 4.3.3.2. Interaction between Time Stamps and Sequence 133 Counters . . . . . . . . . . . . . . . . . . . . 100 134 4.3.3.3. Anycast vs. Unicast . . . . . . . . . . . . . . . 101 135 4.3.3.4. Overlays and Signaling . . . . . . . . . . . . . 101 136 4.3.4. Key/Value Store . . . . . . . . . . . . . . . . . . . 101 137 4.3.4.1. Southbound . . . . . . . . . . . . . . . . . . . 101 138 4.3.4.2. Northbound . . . . . . . . . . . . . . . . . . . 102 139 4.3.5. Interactions with BFD . . . . . . . . . . . . . . . . 102 140 4.3.6. Fabric Bandwidth Balancing . . . . . . . . . . . . . 103 141 4.3.6.1. Northbound Direction . . . . . . . . . . . . . . 103 142 4.3.6.2. Southbound Direction . . . . . . . . . . . . . . 106 143 4.3.7. Label Binding . . . . . . . . . . . . . . . . . . . . 106 144 4.3.8. Leaf to Leaf Procedures . . . . . . . . . . . . . . . 106 145 4.3.9. Address Family and Multi Topology Considerations . . 106 146 4.3.10. Reachability of Internal Nodes in the Fabric . . . . 107 147 4.3.11. One-Hop Healing of Levels with East-West Links . . . 107 148 4.4. Security . . . . . . . . . . . . . . . . . . . . . . . . 107 149 4.4.1. Security Model . . . . . . . . . . . . . . . . . . . 107 150 4.4.2. Security Mechanisms . . . . . . . . . . . . . . . . . 109 151 4.4.3. Security Envelope . . . . . . . . . . . . . . . . . . 110 152 4.4.4. Weak Nonces . . . . . . . . . . . . . . . . . . . . . 113 153 4.4.5. Lifetime . . . . . . . . . . . . . . . . . . . . . . 114 154 4.4.6. Key Management . . . . . . . . . . . . . . . . . . . 114 155 4.4.7. Security Association Changes . . . . . . . . . . . . 114 156 5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 115 157 5.1. Normal Operation . . . . . . . . . . . . . . . . . . . . 115 158 5.2. Leaf Link Failure . . . . . . . . . . . . . . . . . . . . 117 159 5.3. Partitioned Fabric . . . . . . . . . . . . . . . . . . . 118 160 5.4. Northbound Partitioned Router and Optional East-West 161 Links . . . . . . . . . . . . . . . . . . . . . . . . . . 119 162 6. Implementation and Operation: Further Details . . . . . . . . 120 163 6.1. Considerations for Leaf-Only Implementation . . . . . . . 120 164 6.2. Considerations for Spine Implementation . . . . . . . . . 121 165 6.3. Adaptations to Other Proposed Data Center Topologies . . 121 166 6.4. Originating Non-Default Route Southbound . . . . . . . . 122 167 7. Security Considerations . . . . . . . . . . . . . . . . . . . 122 168 7.1. General . . . . . . . . . . . . . . . . . . . . . . . . . 122 169 7.2. ZTP . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 170 7.3. Lifetime . . . . . . . . . . . . . . . . . . . . . . . . 123 171 7.4. Packet Number . . . . . . . . . . . . . . . . . . . . . . 123 172 7.5. Outer Fingerprint Attacks . . . . . . . . . . . . . . . . 123 173 7.6. TIE Origin Fingerprint DoS Attacks . . . . . . . . . . . 123 174 7.7. Host Implementations . . . . . . . . . . . . . . . . . . 124 175 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 124 176 8.1. Requested Multicast and Port Numbers . . . . . . . . . . 124 177 8.2. Requested Registries with Suggested Values . . . . . . . 124 178 8.2.1. Registry RIFT_v4/common/AddressFamilyType . . . . . . 125 179 8.2.1.1. Requested Entries . . . . . . . . . . . . . . . . 125 180 8.2.2. Registry RIFT_v4/common/HierarchyIndications . . . . 125 181 8.2.2.1. Requested Entries . . . . . . . . . . . . . . . . 125 182 8.2.3. Registry RIFT_v4/common/IEEE802_1ASTimeStampType . . 125 183 8.2.3.1. Requested Entries . . . . . . . . . . . . . . . . 125 184 8.2.4. Registry RIFT_v4/common/IPAddressType . . . . . . . . 125 185 8.2.4.1. Requested Entries . . . . . . . . . . . . . . . . 126 186 8.2.5. Registry RIFT_v4/common/IPPrefixType . . . . . . . . 126 187 8.2.5.1. Requested Entries . . . . . . . . . . . . . . . . 126 188 8.2.6. Registry RIFT_v4/common/IPv4PrefixType . . . . . . . 126 189 8.2.6.1. Requested Entries . . . . . . . . . . . . . . . . 126 190 8.2.7. Registry RIFT_v4/common/IPv6PrefixType . . . . . . . 126 191 8.2.7.1. Requested Entries . . . . . . . . . . . . . . . . 126 192 8.2.8. Registry RIFT_v4/common/PrefixSequenceType . . . . . 126 193 8.2.8.1. Requested Entries . . . . . . . . . . . . . . . . 127 194 8.2.9. Registry RIFT_v4/common/RouteType . . . . . . . . . . 127 195 8.2.9.1. Requested Entries . . . . . . . . . . . . . . . . 127 196 8.2.10. Registry RIFT_v4/common/TIETypeType . . . . . . . . . 127 197 8.2.10.1. Requested Entries . . . . . . . . . . . . . . . 128 198 8.2.11. Registry RIFT_v4/common/TieDirectionType . . . . . . 128 199 8.2.11.1. Requested Entries . . . . . . . . . . . . . . . 128 200 8.2.12. Registry RIFT_v4/encoding/Community . . . . . . . . . 128 201 8.2.12.1. Requested Entries . . . . . . . . . . . . . . . 128 202 8.2.13. Registry RIFT_v4/encoding/KeyValueTIEElement . . . . 128 203 8.2.13.1. Requested Entries . . . . . . . . . . . . . . . 128 204 8.2.14. Registry RIFT_v4/encoding/LIEPacket . . . . . . . . . 129 205 8.2.14.1. Requested Entries . . . . . . . . . . . . . . . 129 206 8.2.15. Registry RIFT_v4/encoding/LinkCapabilities . . . . . 130 207 8.2.15.1. Requested Entries . . . . . . . . . . . . . . . 130 208 8.2.16. Registry RIFT_v4/encoding/LinkIDPair . . . . . . . . 130 209 8.2.16.1. Requested Entries . . . . . . . . . . . . . . . 130 210 8.2.17. Registry RIFT_v4/encoding/Neighbor . . . . . . . . . 131 211 8.2.17.1. Requested Entries . . . . . . . . . . . . . . . 131 212 8.2.18. Registry RIFT_v4/encoding/NodeCapabilities . . . . . 131 213 8.2.18.1. Requested Entries . . . . . . . . . . . . . . . 131 214 8.2.19. Registry RIFT_v4/encoding/NodeFlags . . . . . . . . . 132 215 8.2.19.1. Requested Entries . . . . . . . . . . . . . . . 132 216 8.2.20. Registry RIFT_v4/encoding/NodeNeighborsTIEElement . . 132 217 8.2.20.1. Requested Entries . . . . . . . . . . . . . . . 132 218 8.2.21. Registry RIFT_v4/encoding/NodeTIEElement . . . . . . 132 219 8.2.21.1. Requested Entries . . . . . . . . . . . . . . . 133 220 8.2.22. Registry RIFT_v4/encoding/PacketContent . . . . . . . 133 221 8.2.22.1. Requested Entries . . . . . . . . . . . . . . . 133 222 8.2.23. Registry RIFT_v4/encoding/PacketHeader . . . . . . . 133 223 8.2.23.1. Requested Entries . . . . . . . . . . . . . . . 133 224 8.2.24. Registry RIFT_v4/encoding/PrefixAttributes . . . . . 134 225 8.2.24.1. Requested Entries . . . . . . . . . . . . . . . 134 226 8.2.25. Registry RIFT_v4/encoding/PrefixTIEElement . . . . . 134 227 8.2.25.1. Requested Entries . . . . . . . . . . . . . . . 134 228 8.2.26. Registry RIFT_v4/encoding/ProtocolPacket . . . . . . 135 229 8.2.26.1. Requested Entries . . . . . . . . . . . . . . . 135 230 8.2.27. Registry RIFT_v4/encoding/TIDEPacket . . . . . . . . 135 231 8.2.27.1. Requested Entries . . . . . . . . . . . . . . . 135 232 8.2.28. Registry RIFT_v4/encoding/TIEElement . . . . . . . . 135 233 8.2.28.1. Requested Entries . . . . . . . . . . . . . . . 136 234 8.2.29. Registry RIFT_v4/encoding/TIEHeader . . . . . . . . . 136 235 8.2.29.1. Requested Entries . . . . . . . . . . . . . . . 137 236 8.2.30. Registry RIFT_v4/encoding/TIEHeaderWithLifeTime . . . 137 237 8.2.30.1. Requested Entries . . . . . . . . . . . . . . . 137 238 8.2.31. Registry RIFT_v4/encoding/TIEID . . . . . . . . . . . 137 239 8.2.31.1. Requested Entries . . . . . . . . . . . . . . . 138 240 8.2.32. Registry RIFT_v4/encoding/TIEPacket . . . . . . . . . 138 241 8.2.32.1. Requested Entries . . . . . . . . . . . . . . . 138 242 8.2.33. Registry RIFT_v4/encoding/TIREPacket . . . . . . . . 138 243 8.2.33.1. Requested Entries . . . . . . . . . . . . . . . 138 244 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 138 245 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 139 246 10.1. Normative References . . . . . . . . . . . . . . . . . . 139 247 10.2. Informative References . . . . . . . . . . . . . . . . . 141 248 Appendix A. Sequence Number Binary Arithmetic . . . . . . . . . 143 249 Appendix B. Information Elements Schema . . . . . . . . . . . . 144 250 B.1. common.thrift . . . . . . . . . . . . . . . . . . . . . . 146 251 B.2. encoding.thrift . . . . . . . . . . . . . . . . . . . . . 152 252 Appendix C. Constants . . . . . . . . . . . . . . . . . . . . . 160 253 C.1. Configurable Protocol Constants . . . . . . . . . . . . . 160 254 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 162 256 1. Authors 258 This work is a product of a list of individuals which are all to be 259 considered major contributors independent of the fact whether their 260 name made it to the limited boilerplate author's list or not. 262 Tony Przygienda, Ed. | Alankar Sharma | Pascal Thubert 263 Juniper Networks | Comcast | Cisco 265 Bruno Rijsman | Ilya Vershkov | Dmitry Afanasiev 266 Individual | Mellanox | Yandex 268 Don Fedyk | Alia Atlas | John Drake 269 Individual | Individual | Juniper 271 Table 1: RIFT Authors 273 2. Introduction 275 Clos [CLOS] and Fat-Tree [FATTREE] topologies have gained prominence 276 in today's networking, primarily as result of the paradigm shift 277 towards a centralized data-center based architecture that is poised 278 to deliver a majority of computation and storage services in the 279 future. Today's current routing protocols were geared towards a 280 network with an irregular topology and low degree of connectivity 281 originally but given they were the only available options, 282 consequently several attempts to apply those protocols to Clos have 283 been made. Most successfully BGP [RFC4271] [RFC7938] has been 284 extended to this purpose, not as much due to its inherent suitability 285 but rather because the perceived capability to easily modify BGP and 286 the immanent difficulties with link-state [DIJKSTRA] based protocols 287 to optimize topology exchange and converge quickly in large scale 288 densely meshed topologies. The incumbent protocols precondition 289 normally extensive configuration or provisioning during bring up and 290 re-dimensioning. This tends to be viable only for a set of 291 organizations with according networking operation skills and budgets. 292 For many IP fabric builders a desirable protocol would be one that 293 auto-configures itself and deals with failures and mis-configurations 294 with a minimum of human intervention only. Such a solution would 295 allow local IP fabric bandwidth to be consumed in a 'standard 296 component' fashion, i.e. provision it much faster and operate it at 297 much lower costs than today, much like compute or storage is consumed 298 already. 300 In looking at the problem through the lens of data center 301 requirements, RIFT addresses challenges in IP fabric routing not 302 through an incremental modification of either a link-state 303 (distributed computation) or distance-vector (diffused computation) 304 but rather a mixture of both, colloquially best described as "link- 305 state towards the spine" and "distance vector towards the leaves". 306 In other words, "bottom" levels are flooding their link-state 307 information in the "northern" direction while each node generates 308 under normal conditions a "default route" and floods it in the 309 "southern" direction. This type of protocol allows naturally for 310 highly desirable aggregation. Alas, such aggregation could blackhole 311 traffic in cases of misconfiguration or while failures are being 312 resolved or even cause partial network partitioning and this has to 313 be addressed by some adequate mechanism. The approach RIFT takes is 314 described in Section 4.2.5 and is basically based on automatic, 315 sufficient disaggregation of prefixes in case of link and node 316 failures. 318 For the visually oriented reader, Figure 1 presents a first level 319 simplified view of the resulting information and routes on a RIFT 320 fabric. The top of the fabric is holding in its link-state database 321 the nodes below it and the routes to them. In the second row of the 322 database table we indicate that partial information of other nodes in 323 the same level is available as well. The details of how this is 324 achieved will be postponed for the moment. When we look at the 325 "bottom" of the fabric, the leaves, we see that the topology is 326 basically empty and they only hold a load balanced default route to 327 the next level under normal conditions. 329 The balance of this document details a dedicated IP fabric routing 330 protocol, fills in the specification details and ultimately includes 331 resulting security considerations. 333 . [A,B,C,D] 334 . [E] 335 . +-----+ +-----+ 336 . | E | | F | A/32 @ [C,D] 337 . +-+-+-+ +-+-+-+ B/32 @ [C,D] 338 . | | | | C/32 @ C 339 . | | +-----+ | D/32 @ D 340 . | | | | 341 . | +------+ | 342 . | | | | 343 . [A,B] +-+---+ | | +---+-+ [A,B] 344 . [D] | C +--+ +-+ D | [C] 345 . +-+-+-+ +-+-+-+ 346 . 0/0 @ [E,F] | | | | 0/0 @ [E,F] 347 . A/32 @ A | | +-----+ | A/32 @ A 348 . B/32 @ B | | | | B/32 @ B 349 . | +------+ | 350 . | | | | 351 . +-+---+ | | +---+-+ 352 . | A +--+ +-+ B | 353 . 0/0 @ [C,D] +-----+ +-----+ 0/0 @ [C,D] 355 Figure 1: RIFT Information Distribution 357 2.1. Requirements Language 359 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 360 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 361 document are to be interpreted as described in RFC 8174 [RFC8174]. 363 3. Reference Frame 365 3.1. Terminology 367 This section presents the terminology used in this document. It is 368 assumed that the reader is thoroughly familiar with the terms and 369 concepts used in OSPF [RFC2328] and IS-IS [ISO10589-Second-Edition], 370 [ISO10589] as well as the according graph theoretical concepts of 371 shortest path first (SPF) [DIJKSTRA] computation and DAGs. 373 Crossbar: Physical arrangement of ports in a switching matrix 374 without implying any further scheduling or buffering disciplines. 376 Clos/Fat Tree: This document uses the terms Clos and Fat Tree 377 interchangeably whereas it always refers to a folded spine-and- 378 leaf topology with possibly multiple Points of Delivery (PoDs) and 379 one or multiple Top of Fabric (ToF) planes. Several modifications 380 such as leaf-2-leaf shortcuts and multiple level shortcuts are 381 possible and described further in the document. 383 Directed Acyclic Graph (DAG): A finite directed graph with no 384 directed cycles (loops). If links in Clos are considered as 385 either being all directed towards the top or vice versa, each of 386 such two graphs is a DAG. 388 Folded Spine-and-Leaf: In case Clos fabric input and output stages 389 are analogous, the fabric can be "folded" to build a "superspine" 390 or top which we will call Top of Fabric (ToF) in this document. 392 Level: Clos and Fat Tree networks are topologically partially 393 ordered graphs and 'level' denotes the set of nodes at the same 394 height in such a network, where the bottom level (leaf) is the 395 level with lowest value. A node has links to nodes one level down 396 and/or one level up. Under some circumstances, a node may have 397 links to nodes at the same level. As footnote: Clos terminology 398 uses often the concept of "stage" but due to the folded nature of 399 the Fat Tree we do not use it to prevent misunderstandings. 401 Superspine vs. Aggregation and Spine vs. Edge/Leaf: 402 Traditional level names in 5-stages folded Clos for Level 2, 1 and 403 0 respectively. We normalize this language to talk about top-of- 404 fabric (ToF), top-of-pod (ToP) and leaves. 406 Zero Touch Provisioning (ZTP): Optional RIFT mechanism which allows 407 to derive node levels automatically based on minimum configuration 408 (only ToF property has to be provisioned on according nodes). 410 Point of Delivery (PoD): A self-contained vertical slice or subset 411 of a Clos or Fat Tree network containing normally only level 0 and 412 level 1 nodes. A node in a PoD communicates with nodes in other 413 PoDs via the Top-of-Fabric. We number PoDs to distinguish them 414 and use PoD #0 to denote "undefined" PoD. 416 Top of PoD (ToP): The set of nodes that provide intra-PoD 417 communication and have northbound adjacencies outside of the PoD, 418 i.e. are at the "top" of the PoD. 420 Top of Fabric (ToF): The set of nodes that provide inter-PoD 421 communication and have no northbound adjacencies, i.e. are at the 422 "very top" of the fabric. ToF nodes do not belong to any PoD and 423 are assigned "undefined" PoD value to indicate the equivalent of 424 "any" PoD. 426 Spine: Any nodes north of leaves and south of top-of-fabric nodes. 427 Multiple layers of spines in a PoD are possible. 429 Leaf: A node without southbound adjacencies. Its level is 0 (except 430 cases where it is deriving its level via ZTP and is running 431 without LEAF_ONLY which will be explained in Section 4.2.7). 433 Top-of-fabric Plane or Partition: In large fabrics top-of-fabric 434 switches may not have enough ports to aggregate all switches south 435 of them and with that, the ToF is 'split' into multiple 436 independent planes. Introduction and Section 4.1.2 explains the 437 concept in more detail. A plane is subset of ToF nodes that see 438 each other through south reflection or E-W links. 440 Radix: A radix of a switch is basically number of switching ports it 441 provides. It's sometimes called fanout as well. 443 North Radix: Ports cabled northbound to higher level nodes. 445 South Radix: Ports cabled southbound to lower level nodes. 447 South/Southbound and North/Northbound (Direction): 448 When describing protocol elements and procedures, we will be using 449 in different situations the directionality of the compass. I.e., 450 'south' or 'southbound' mean moving towards the bottom of the Clos 451 or Fat Tree network and 'north' and 'northbound' mean moving 452 towards the top of the Clos or Fat Tree network. 454 Northbound Link: A link to a node one level up or in other words, 455 one level further north. 457 Southbound Link: A link to a node one level down or in other words, 458 one level further south. 460 East-West Link: A link between two nodes at the same level. East- 461 West links are normally not part of Clos or "fat-tree" topologies. 463 Leaf shortcuts (L2L): East-West links at leaf level will need to be 464 differentiated from East-West links at other levels. 466 Routing on the host (RotH): Modern data center architecture variant 467 where servers/leaves are multi-homed and consecutively participate 468 in routing. 470 Northbound representation: Subset of topology information flooded 471 towards higher levels of the fabric. 473 Southbound representation: Subset of topology information sent 474 towards a lower level. 476 South Reflection: Often abbreviated just as "reflection" it defines 477 a mechanism where South Node TIEs are "reflected" from the level 478 south back up north to allow nodes in the same level without E-W 479 links to "see" each other's node TIEs. 481 TIE: This is an acronym for a "Topology Information Element". TIEs 482 are exchanged between RIFT nodes to describe parts of a network 483 such as links and address prefixes, in a fashion similar to ISIS 484 LSPs or OSPF LSAs. A TIE has always a direction and a type. We 485 will talk about North TIEs (sometimes abbreviated as N-TIEs) when 486 talking about TIEs in the northbound representation and South-TIEs 487 (sometimes abbreviated as S-TIEs) for the southbound equivalent. 488 TIEs have different types such as node and prefix TIEs. 490 Node TIE: This stands as acronym for a "Node Topology Information 491 Element" that contains all adjacencies the node discovered and 492 information about node itself. Node TIE should NOT be confused 493 with a North TIE since "node" defines the type of TIE rather than 494 its direction. 496 Prefix TIE: This is an acronym for a "Prefix Topology Information 497 Element" and it contains all prefixes directly attached to this 498 node in case of a North TIE and in case of South TIE the necessary 499 default routes the node advertises southbound. 501 Key Value TIE: A South TIE that is carrying a set of key value pairs 502 [DYNAMO]. It can be used to distribute information in the 503 southbound direction within the protocol. 505 TIDE: Topology Information Description Element, equivalent to CSNP 506 in ISIS. 508 TIRE: Topology Information Request Element, equivalent to PSNP in 509 ISIS. It can both confirm received and request missing TIEs. 511 De-aggregation/Disaggregation: Process in which a node decides to 512 advertise more specific prefixes Southwards, either positively to 513 attract the corresponding traffic, or negatively to repel it. 514 Disaggregation is performed to prevent black-holing and suboptimal 515 routing to the more specific prefixes. 517 LIE: This is an acronym for a "Link Information Element", largely 518 equivalent to HELLOs in IGPs and exchanged over all the links 519 between systems running RIFT to form three way adjacencies. 521 Flood Repeater (FR): A node can designate one or more northbound 522 neighbor nodes to be flood repeaters. The flood repeaters are 523 responsible for flooding northbound TIEs further north. They are 524 similar to MPR in OSLR. The document sometimes calls them flood 525 leaders as well. 527 Bandwidth Adjusted Distance (BAD): Each RIFT node can calculate the 528 amount of northbound bandwidth available towards a node compared 529 to other nodes at the same level and can modify the route distance 530 accordingly to allow for the lower level to adjust their load 531 balancing towards spines. 533 Overloaded: Applies to a node advertising `overload` attribute as 534 set. The semantics closely follow the meaning of the same 535 attribute in [ISO10589-Second-Edition]. 537 Interface: A layer 3 entity over which RIFT control packets are 538 exchanged. 540 Three-Way Adjacency: RIFT tries to form a unique adjacency over an 541 interface and exchange local configuration and necessary ZTP 542 information. An adjacency is only advertised in node TIEs and 543 used for computations after it achieved three-way state, i.e. both 544 routers reflected each other in LIEs including relevant security 545 information. LIEs before three-way state is reached may carry ZTP 546 related information already. 548 Bi-directional Adjacency: Bidirectional adjacency is an adjacency 549 where nodes of both sides of the adjacency advertised it in the 550 node TIEs with the correct levels and system IDs. Bi- 551 directionality is used to check in different algorithms whether 552 the link should be included. 554 Neighbor: Once a three-way adjacency has been formed a neighborship 555 relationship contains the neighbor's properties. Multiple 556 adjacencies can be formed to a remote node via parallel interfaces 557 but such adjacencies are NOT sharing a neighbor structure. Saying 558 "neighbor" is thus equivalent to saying "a three-way adjacency". 560 Cost: The term signifies the weighted distance between two 561 neighbors. 563 Distance: Sum of costs (bound by infinite distance) between two 564 nodes. 566 Shortest-Path First (SPF): A well-known graph algorithm attributed 567 to Dijkstra that establishes a tree of shortest paths from a 568 source to destinations on the graph. We use SPF acronym due to 569 its familiarity as general term for the node reachability 570 calculations RIFT can employ to ultimately calculate routes of 571 which Dijkstra algorithm is one. 573 North SPF (N-SPF): A reachability calculation that is progressing 574 northbound, as example SPF that is using South Node TIEs only. 575 Normally it progresses a single hop only and installs default 576 routes. 578 South SPF (S-SPF): A reachability calculation that is progressing 579 southbound, as example SPF that is using North Node TIEs only. 581 Security Envelope RIFT packets are flooded within an authenticated 582 security envelope that allows to protect the integrity of 583 information a node accepts. 585 3.2. Topology 586 ^ N +--------+ +--------+ 587 Level 2 | |ToF 21| |ToF 22| 588 E <-*-> W ++-+--+-++ ++-+--+-++ 589 | | | | | | | | | 590 S v P111/2 P121/2 | | | | 591 ^ ^ ^ ^ | | | | 592 | | | | | | | | 593 +--------------+ | +-----------+ | | | +---------------+ 594 | | | | | | | | 595 South +-----------------------------+ | | ^ 596 | | | | | | | All TIEs 597 0/0 0/0 0/0 +-----------------------------+ | 598 v v v | | | | | 599 | | +-+ +<-0/0----------+ | | 600 | | | | | | | | 601 +-+----++ optional +-+----++ ++----+-+ ++-----++ 602 Level 1 | | E/W link | | | | | | 603 |Spin111+----------+Spin112| |Spin121| |Spin122| 604 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 605 | | | South | | | | 606 | +---0/0--->-----+ 0/0 | +----------------+ | 607 0/0 | | | | | | | 608 | +---<-0/0-----+ | v | +--------------+ | | 609 v | | | | | | | 610 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 611 Level 0 | | (L2L) | | | | | | 612 |Leaf111+~~~~~~~~~~+Leaf112| |Leaf121| |Leaf122| 613 +-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 614 + + \ / + + 615 Prefix111 Prefix112 \ / Prefix121 Prefix122 616 multi-homed 617 Prefix 618 +---------- PoD 1 ---------+ +---------- PoD 2 ---------+ 620 Figure 2: A Three Level Spine-and-Leaf Topology 621 .+--------+ +--------+ +--------+ +--------+ 622 .|ToF A1| |ToF B1| |ToF B2| |ToF A2| 623 .++-+-----+ ++-+-----+ ++-+-----+ ++-+-----+ 624 . | | | | | | | | 625 . | | | | | +---------------+ 626 . | | | | | | | | 627 . | | | +-------------------------+ | 628 . | | | | | | | | 629 . | +-----------------------+ | | | | 630 . | | | | | | | | 631 . | | +---------+ | +---------+ | | 632 . | | | | | | | | 633 . | +---------------------------------+ | | 634 . | | | | | | | | 635 .++-+-----+ ++-+-----+ +--+-+---+ +----+-+-+ 636 .|Spine111| |Spine112| |Spine121| |Spine122| 637 .+-+---+--+ ++----+--+ +-+---+--+ ++---+---+ 638 . | | | | | | | | 639 . | +--------+ | | +--------+ | 640 . | | | | | | | | 641 . | -------+ | | | +------+ | | 642 . | | | | | | | | 643 .+-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 644 .|Leaf111| |Leaf112| |Leaf121| |Leaf122| 645 .+-------+ +-------+ +-------+ +-------+ 647 Figure 3: Topology with Multiple Planes 649 We will use topology in Figure 2 (called commonly a fat tree/network 650 in modern IP fabric considerations [VAHDAT08] as homonym to the 651 original definition of the term [FATTREE]) in all further 652 considerations. This figure depicts a generic "single plane fat- 653 tree" and the concepts explained using three levels apply by 654 induction to further levels and higher degrees of connectivity. 655 Further, this document will deal also with designs that provide only 656 sparser connectivity and "partitioned spines" as shown in Figure 3 657 and explained further in Section 4.1.2. 659 4. RIFT: Routing in Fat Trees 661 We present here a detailed outline of a protocol optimized for 662 Routing in Fat Trees (RIFT) that in most abstract terms has many 663 properties of a modified link-state protocol 664 [RFC2328][ISO10589-Second-Edition] when distributing information 665 northbound and distance vector [RFC4271] protocol when distributing 666 information southbound. While this is an unusual combination, it 667 does quite naturally exhibit the desirable properties we seek. 669 4.1. Overview 671 4.1.1. Properties 673 The most singular property of RIFT is that it floods flat link-state 674 information northbound only so that each level obtains the full 675 topology of levels south of it. Link-State information is, with some 676 exceptions, never flooded East-West or back South again. Exceptions 677 like south reflection is explained in detail in Section 4.2.5.1 and 678 east-west flooding at ToF level in multi-plane fabrics is outlined in 679 Section 4.1.2. In southbound direction, the protocol operates like a 680 "fully summarizing, unidirectional" path vector protocol or rather a 681 distance vector with implicit split horizon. Routing information, 682 normally just the default route, propagates one hop south and is 're- 683 advertised' by nodes at next lower level. However, RIFT uses 684 flooding in the southern direction as well to avoid the overhead of 685 building an update per adjacency. We omit describing the East-West 686 direction for the moment. 688 Those information flow constraints create not only an anisotropic 689 protocol (i.e. the information is not distributed "evenly" or 690 "clumped" but summarized along the N-S gradient) but also a "smooth" 691 information propagation where nodes do not receive the same 692 information from multiple directions at the same time. Normally, 693 accepting the same reachability on any link, without understanding 694 its topological significance, forces tie-breaking on some kind of 695 distance metric. And such tie-breaking leads ultimately in hop-by- 696 hop forwarding to shortest paths only. In constrast to that, RIFT, 697 under normal conditions, does not need to tie-break same reachability 698 information from multiple directions. Its computation principles 699 (south forwarding direction is always preferred) leads to valley-free 700 forwarding behavior. And since valley free routing is loop-free, it 701 can use all feasible paths which is another highly desirable property 702 if available bandwidth should be utilized to the maximum extent 703 possible. 705 To account for the "northern" and the "southern" information split 706 the link state database is partitioned accordingly into "north 707 representation" and "south representation" TIEs. In simplest terms 708 the North TIEs contain a link state topology description of lower 709 levels and and South TIEs carry simply default routes towards the 710 level above. This oversimplified view will be refined gradually in 711 following sections while introducing protocol procedures and state 712 machines at the same time. 714 4.1.2. Generalized Topology View 716 This section will shed some light on the topologies RIFT addresses, 717 including multi plane fabrics and their implications. Readers that 718 are only interested in single plane designs, i.e. all top-of-fabric 719 nodes being topologically equal and initially connected to all the 720 switches at the level below them, can skip the rest of Section 4.1.2 721 and resulting Section 4.2.5.2 as well. 723 It is quite difficult to visualize multi plane design, which are 724 effectively multi-dimensional switching matrices. To cope with that, 725 we will introduce a methodology allowing us to depict the 726 connectivity in two-dimensional pictures. Further, we will leverage 727 the fact that we are dealing basically with stacked crossbar fabrics 728 where ports align "on top of each other" in a regular fashion. 730 A word of caution to the reader; at this point it should be observed 731 that the language used to describe Clos variations, especially in 732 multi-plane designs, varies widely between sources. This description 733 follows the terminology introduced in Section 3.1. It is unavoidable 734 to have it present to be able to follow the rest of this section 735 correctly. 737 4.1.2.1. Terminology 739 This section describes the terminology and acronyms used in the rest 740 of the text. 742 P: Denotes the number of PoDs in a topology. 744 S: Denotes the number of ToF nodes in a topology. 746 K: Denotes the number of ports in radix of a switch pointing north or 747 south. Further, K_LEAF denotes number of ports pointing south, 748 i.e. towards leaves, and K_TOP for number of ports pointing north 749 towards a higher spine level. To simplify the visual aids, 750 notations and further considerations, K will be mostly set to 751 Radix/2. 753 ToF Plane: Set of ToFs that are aware of each other by means of 754 south reflection. We number planes by capital letters, e.g. 755 plane A. 757 N: Denote the number of independent ToF planes in a topology. 759 R: Denotes a redundancy factor, i.e. number of connections a spine 760 has towards a ToF plane. In single plane design K_TOP is equal to 761 R. 763 Fallen Leaf: A fallen leaf in a plane Z is a switch that lost all 764 connectivity northbound to Z. 766 4.1.2.2. Clos as Crossed Crossbars 768 The typical topology for which RIFT is defined is built of P number 769 of PoDs and connected together by S number of ToF nodes. A PoD node 770 has K number of ports (also called Radix). We consider half of them 771 (K=Radix/2) as connecting host devices from the south, and the other 772 half connecting to interleaved PoD Top-Level switches to the north. 773 Ratio K can be chosen differently without loss of generality when 774 port speeds differ or the fabric is oversubscribed but K=R/2 allows 775 for more readable representation whereby there are as many ports 776 facing north as south on any intermediate node. We represent a node 777 hence in a schematic fashion with ports "sticking out" to its north 778 and south rather than by the usual real-world front faceplate designs 779 of the day. 781 Figure 4 provides a view of a leaf node as seen from the north, i.e. 782 showing ports that connect northbound. For lack of a better symbol, 783 we have chosen to use the "o" as ASCII visualisation of a single 784 port. In this example, K_LEAF has 6 ports. Observe that the number 785 of PoDs is not related to Radix unless the ToF Nodes are constrained 786 to be the same as the PoD nodes in a particular deployment. 788 Top view 789 +---+ 790 | | 791 | o | e.g., Radix = 12, K_LEAF = 6 792 | | 793 | o | 794 | | ------------------------- 795 | o ------- Physical Port (Ethernet) ----+ 796 | | ------------------------- | 797 | o | | 798 | | | 799 | o | | 800 | | | 801 | o | | 802 | | | 803 +---+ | 805 || || || || || || || 806 +----+ +------------------------------------------------+ 807 | | | | 808 +----+ +------------------------------------------------+ 809 || || || || || || || 810 Side views 812 Figure 4: A Leaf Node, K_LEAF=6 814 The Radix of a PoD's topnode may be different than that of the leaf 815 node. Though, more often than not, a same type of node is used for 816 both, effectively forming a square (K*K). In general case, we could 817 have switches with K_TOP southern ports on nodes at the top of the 818 PoD which are not necessarily the same as K_LEAF. For instance, in 819 the representations below, we pick a 6 port K_LEAF and a 8 port 820 K_TOP. In order to form a crossbar, we need K_TOP Leaf Nodes as 821 illustrated in Figure 5. 823 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 824 | | | | | | | | | | | | | | | | 825 | o | | o | | o | | o | | o | | o | | o | | o | 826 | | | | | | | | | | | | | | | | 827 | o | | o | | o | | o | | o | | o | | o | | o | 828 | | | | | | | | | | | | | | | | 829 | o | | o | | o | | o | | o | | o | | o | | o | 830 | | | | | | | | | | | | | | | | 831 | o | | o | | o | | o | | o | | o | | o | | o | 832 | | | | | | | | | | | | | | | | 833 | o | | o | | o | | o | | o | | o | | o | | o | 834 | | | | | | | | | | | | | | | | 835 | o | | o | | o | | o | | o | | o | | o | | o | 836 | | | | | | | | | | | | | | | | 837 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 839 Figure 5: Southern View of a PoD, K_TOP=8 841 As further visualized in Figure 6 the K_TOP Leaf Nodes are fully 842 interconnected with the K_LEAF PoD-top nodes, providing connectivity 843 that can be represented as a crossbar when "looked at" from the 844 north. The result is that, in the absence of a failure, a packet 845 entering the PoD from the north on any port can be routed to any port 846 in the south of the PoD and vice versa. And that is precisely why it 847 makes sense to talk about a "switching matrix". 849 E<-*->W 851 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 852 | | | | | | | | | | | | | | | | 853 +--------------------------------------------------------+ 854 | o o o o o o o o | 855 +--------------------------------------------------------+ 856 +--------------------------------------------------------+ 857 | o o o o o o o o | 858 +--------------------------------------------------------+ 859 +--------------------------------------------------------+ 860 | o o o o o o o o | 861 +--------------------------------------------------------+ 862 +--------------------------------------------------------+ 863 | o o o o o o o o | 864 +--------------------------------------------------------+ 865 +--------------------------------------------------------+ 866 | o o o o o o o o |<-+ 867 +--------------------------------------------------------+ | 868 +--------------------------------------------------------+ | 869 | o o o o o o o o | | 870 +--------------------------------------------------------+ | 871 | | | | | | | | | | | | | | | | | 872 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 873 ^ | 874 | | 875 | ---------- --------------------- | 876 +----- Leaf Node PoD top Node (Spine) --+ 877 ---------- --------------------- 879 Figure 6: Northern View of a PoD's Spines, K_TOP=8 881 Side views of this PoD is illustrated in Figure 7 and Figure 8. 883 Connecting to Spine 885 || || || || || || || || 886 +----------------------------------------------------------------+ N 887 | PoD top Node seen sideways | ^ 888 +----------------------------------------------------------------+ | 889 || || || || || || || || * 890 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 891 | | | | | | | | | | | | | | | | v 892 +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ S 893 || || || || || || || || 895 Connecting to Client nodes 897 Figure 7: Side View of a PoD, K_TOP=8, K_LEAF=6 899 Connecting to Spine 901 || || || || || || 902 +----+ +----+ +----+ +----+ +----+ +----+ N 903 | | | | | | | | | | | PoD top Nodes ^ 904 +----+ +----+ +----+ +----+ +----+ +----+ | 905 || || || || || || * 906 +------------------------------------------------+ | 907 | Leaf seen sideways | v 908 +------------------------------------------------+ S 909 || || || || || || 911 Connecting to Client nodes 913 Figure 8: Other Side View of a PoD, K_TOP=8, K_LEAF=6, 90o turn in 914 E-W Plane 916 As next step, let us observe that a resulting PoD can be abstracted 917 as a bigger node with a number K of K_POD= K_TOP * K_LEAF, and the 918 design can recurse. 920 It will be critical at this point that, before progressing further, 921 the concept and the picture of "crossed crossbars" is clear. Else, 922 the following considerations might be difficult to comprehend. 924 To continue, the PoDs are interconnected with each other through a 925 Top-of-Fabric (ToF) node at the very top or the north edge of the 926 fabric. The resulting ToF is NOT partitioned if, and only if (IIF), 927 every PoD top level node (spine) is connected to every ToF Node. 929 This topology is also referred to as a single plane configuration and 930 is quite popular due to its simplicity. In order to reach a 1:1 931 connectivity ratio between the ToF and the leaves, it results that 932 there are K_TOP ToF nodes, because each port of a ToP node connects 933 to a different ToF node, and K_LEAF ToP nodes for the same reason. 934 Consequently, it will take (P * K_LEAF) ports on a ToF node to 935 connect to each of the K_LEAF ToP nodes of the P PoDs, as shown in 936 Figure 9. 938 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] <-----+ 939 | | | | | | | | | 940 [=================================] | ----------- 941 | | | | | | | | +----- Top-of-Fabric 942 [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] +----- Node -------+ 943 | ----------- | 944 | v 945 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ <-----+ +-+ 946 | | | | | | | | | | | | | | | | | | 947 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 948 [ |o| |o| |o| |o| |o| |o| |o| |o| ] ------------------------- | | 949 [ |o| |o| |o| |o| |o| |o| |o| |o<--- Physical Port (Ethernet) | | 950 [ |o| |o| |o| |o| |o| |o| |o| |o| ] ------------------------- | | 951 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 952 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 953 | | | | | | | | | | | | | | | | | | 954 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | 955 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- | | 956 [ |o| |o| |o| |o| |o| |o| |o| |o| ] <--- PoD top level | | 957 [ |o| |o| |o| |o| |o| |o| |o| |o| ] node (Spine) ---+ | | 958 [ |o| |o| |o| |o| |o| |o| |o| |o| ] -------------- | | | 959 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | | 960 | | | | | | | | | | | | | | | | -+ +- +-+ v | | 961 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 962 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ ]--| | 963 [ |o| |o| |o| |o| |o| |o| |o| |o| ] +--- PoD ---+ --| |--[ ]--| | 964 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | ----- | --| |--[ ]--| | 965 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 966 [ |o| |o| |o| |o| |o| |o| |o| |o| ] | | --| |--[ ]--| | 967 | | | | | | | | | | | | | | | | -+ +- +-+ | | 968 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ 970 Figure 9: Fabric Spines and TOFs in Single Plane Design, 3 PoDs 972 The top view can be collapsed into a third dimension where the hidden 973 depth index is representing the PoD number. We can then show one PoD 974 as a class of PoDs and hence save one dimension in our 975 representation. The Spine Node expands in the depth and the vertical 976 dimensions, whereas the PoD top level Nodes are constrained, in 977 horizontal dimension. A port in the 2-D representation represents 978 effectively the class of all the ports at the same position in all 979 the PoDs that are projected in its position along the depth axis. 980 This is shown in Figure 10. 982 / / / / / / / / / / / / / / / / 983 / / / / / / / / / / / / / / / / 984 / / / / / / / / / / / / / / / / 985 / / / / / / / / / / / / / / / / ] 986 +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-+ ]] 987 | | | | | | | | | | | | | | | | ] --------------------------- 988 [ |o| |o| |o| |o| |o| |o| |o| |o| ] <-- PoD top level node (Spine) 989 [ |o| |o| |o| |o| |o| |o| |o| |o| ] --------------------------- 990 [ |o| |o| |o| |o| |o| |o| |o| |o| ]]]] 991 [ |o| |o| |o| |o| |o| |o| |o| |o| ]]] ^^ 992 [ |o| |o| |o| |o| |o| |o| |o| |o| ]] // PoDs 993 [ |o| |o| |o| |o| |o| |o| |o| |o| ] // (in depth) 994 | |/| |/| |/| |/| |/| |/| |/| |/ // 995 +-+ +-+ +-+/+-+/+-+ +-+ +-+ +-+ // 996 ^ 997 | ---------------- 998 +----- Top-of-Fabric Node 999 ---------------- 1001 Figure 10: Collapsed Northern View of a Fabric for Any Number of PoDs 1003 As simple as single plane deployment is it introduces a limit due to 1004 the bound on the available radix of the ToF nodes that has to be at 1005 least P * K_LEAF. Nevertheless, we will see that a distinct 1006 advantage of a connected or non-partitioned Top-of-Fabric is that all 1007 failures can be resolved by simple, non-transitive, positive 1008 disaggregation (i.e. nodes advertising more specific prefixes with 1009 the default to the level below them that is however not propagated 1010 further down the fabric) as described in Section 4.2.5.1 . In other 1011 words; non-partitioned ToF nodes can always reach nodes below or 1012 withdraw the routes from PoDs they cannot reach unambiguously. And 1013 with this, positive disaggregation can heal all failures and still 1014 allow all the ToF nodes to see each other via south reflection. 1015 Disaggregation will be explained in further detail in Section 4.2.5. 1017 In order to scale beyond the "single plane limit", the Top-of-Fabric 1018 can be partitioned by a N number of identically wired planes where N 1019 is an integer divider of K_LEAF. The 1:1 ratio and the desired 1020 symmetry are still served, this time with (K_TOP * N) ToF nodes, each 1021 of (P * K_LEAF / N) ports. N=1 represents a non-partitioned Spine 1022 and N=K_LEAF is a maximally partitioned Spine. Further, if R is any 1023 integer divisor of K_LEAF, then N=K_LEAF/R is a feasible number of 1024 planes and R a redundancy factor. If proves convenient for 1025 deployments to use a radix for the leaf nodes that is a power of 2 so 1026 they can pick a number of planes that is a lower power of 2. The 1027 example in Figure 11 splits the Spine in 2 planes with a redundancy 1028 factor R=3, meaning that there are 3 non-intersecting paths between 1029 any leaf node and any ToF node. A ToF node must have, in this case, 1030 at least 3*P ports, and be directly connected to 3 of the 6 PoD-ToP 1031 nodes (spines) in each PoD. 1033 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1034 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1035 | | o | | o | | o | | o | | o | | o | | o | | o | | 1036 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1037 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1038 | | o | | o | | o | | o | | o | | o | | o | | o | | 1039 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1040 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1041 | | o | | o | | o | | o | | o | | o | | o | | o | | 1042 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1043 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1045 Plane 1 1046 ----------- . ------------ . ------------ . ------------ . -------- 1047 Plane 2 1049 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1050 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1051 | | o | | o | | o | | o | | o | | o | | o | | o | | 1052 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1053 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1054 | | o | | o | | o | | o | | o | | o | | o | | o | | 1055 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1056 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1057 | | o | | o | | o | | o | | o | | o | | o | | o | | 1058 +-| |--| |--| |--| |--| |--| |--| |--| |-+ 1059 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 1060 ^ 1061 | 1062 | ---------------- 1063 +----- Top-of-Fabric node 1064 "across" depth 1065 ---------------- 1067 Figure 11: Northern View of a Multi-Plane ToF Level, K_LEAF=6, N=2 1069 At the extreme end of the spectrum it is even possible to fully 1070 partition the spine with N = K_LEAF and R=1, while maintaining 1071 connectivity between each leaf node and each Top-of-Fabric node. In 1072 that case the ToF node connects to a single Port per PoD, so it 1073 appears as a single port in the projected view represented in 1074 Figure 12. The number of ports required on the Spine Node is more or 1075 equal to P, the number of PoDs. 1077 Plane 1 1078 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ -+ 1079 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1080 | | o | | o | | o | | o | | o | | o | | o | | o | | | 1081 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1082 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1083 ----------- . ------------------- . ------------ . -------- | 1084 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1085 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1086 | | o | | o | | o | | o | | o | | o | | o | | o | | | 1087 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1088 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1089 ----------- . ------------ . ---- . ------------ . -------- | 1090 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1091 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1092 | | o | | o | | o | | o | | o | | o | | o | | o | | | 1093 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | 1094 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | 1095 ----------- . ------------ . ------------------- . -------- +<-+ 1096 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1097 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1098 | | o | | o | | o | | o | | o | | o | | o | | o | | | | 1099 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1100 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1101 ----------- . ------------ . ------------ . ---- . -------- | | 1102 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1103 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1104 | | o | | o | | o | | o | | o | | o | | o | | o | | | | 1105 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1106 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1107 ----------- . ------------ . ------------ . --------------- | | 1108 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | 1109 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1110 | | o | | o | | o | | o | | o | | o | | o | | o | | | | 1111 +-| |--| |--| |--| |--| |--| |--| |--| |-+ | | 1112 +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ -+ | 1113 Plane 6 ^ | 1114 | | 1115 | ---------------- ------------- | 1116 +----- ToF Node Class of PoDs ---+ 1117 ---------------- ------------- 1119 Figure 12: Northern View of a Maximally Partitioned ToF Level, R=1 1121 4.1.3. Fallen Leaf Problem 1123 As mentioned earlier, RIFT exhibits an anisotropic behaviour tailored 1124 for fabrics with a North / South orientation and a high level of 1125 interleaving paths. A non-partitioned fabric makes a total loss of 1126 connectivity between a Top-of-Fabric node at the north and a leaf 1127 node at the south a very rare but yet possible occasion that is fully 1128 healed by positive disaggregation as described in Section 4.2.5.1. 1129 In large fabrics or fabrics built from switches with low radix, the 1130 ToF ends often being partitioned in planes which makes the occurrence 1131 of having a given leaf being only reachable from a subset of the ToF 1132 nodes more likely to happen. This makes some further considerations 1133 necessary. 1135 We define a "Fallen Leaf" as a leaf that can be reached by only a 1136 subset, but not all, of Top-of-Fabric nodes due to missing 1137 connectivity. If R is the redundancy factor, then it takes at least 1138 R breakages to reach a "Fallen Leaf" situation. 1140 In a maximally partitioned fabric, the redundancy factor is R= 1, so 1141 any breakage in the fabric may cause one or more fallen leaves. 1142 However, not all cases require disaggregation. The following cases 1143 do not require particular action in such scenario: 1145 If a southern link on a leaf node goes down, then connectivity to 1146 any node attached to the leaf is lost. There is no need to 1147 disaggregate since the connectivity is lost from all spine nodes 1148 to the leaf nodes in the same fashion. 1150 If a southern link on a leaf node goes down, then connectivity 1151 through that leaf is lost for all nodes. There is no need to 1152 disaggregate since the connectivity to this leaf is lost for all 1153 spine nodes in a same fashion. 1155 If a ToF Node goes down, then northern traffic towards it is 1156 routed via alternate ToF nodes in the same plane and there is no 1157 need to disaggregate routes. 1159 In a general manner, the mechanism of non-transitive positive 1160 disaggregation is sufficient when the disaggregating ToF nodes 1161 collectively connect to all the ToP nodes in the broken plane. This 1162 happens in the following case: 1164 If the breakage is the last northern link from a ToP node to a ToF 1165 node going down, then the fallen leaf problem affects only The ToF 1166 node, and the connectivity to all the nodes in the PoD is lost 1167 from that ToF node. This can be observed by other ToF nodes 1168 within the plane where the ToP node is located and positively 1169 disaggregated within that plane. 1171 On the other hand, there is a need to disaggregate the routes to 1172 Fallen Leaves in a transitive fashion, all the way to the other 1173 leaves in the following cases: 1175 o If the breakage is the last northern link from a leaf node within 1176 a plane (there is only one such link in a maximally partitioned 1177 fabric) that goes down, then connectivity to all unicast prefixes 1178 attached to the leaf node is lost within the plane where the link 1179 is located. Southern Reflection by a leaf node, e.g., between ToP 1180 nodes, if the PoD has only 2 levels, happens in between planes, 1181 allowing the ToP nodes to detect the problem within the PoD where 1182 it occurs and positively disaggregate. The breakage can be 1183 observed by the ToF nodes in the same plane through the North 1184 flooding of TIEs from the ToP nodes. The ToF nodes however need 1185 to be aware of all the affected prefixes for the negative, 1186 possibly transitive disaggregation to be fully effective (i.e. a 1187 node advertising in control plane that it cannot reach a certain 1188 more specific prefix than default whereas such disaggregation must 1189 in extreme condition propagate further down southbound). The 1190 problem can also be observed by the ToF nodes in the other planes 1191 through the flooding of North TIEs from the affected leaf nodes, 1192 together with non-node North TIEs which indicate the affected 1193 prefixes. To be effective in that case, the positive 1194 disaggregation must reach down to the nodes that make the plane 1195 selection, which are typically the ingress leaf nodes. The 1196 information is not useful for routing in the intermediate levels. 1198 o If the breakage is a ToP node in a maximally partitioned fabric - 1199 in which case it is the only ToP node serving the plane in that 1200 PoD - goes down, then the connectivity to all the nodes in the PoD 1201 is lost within the plane where the ToP node is located. 1202 Consequently, all leaves of the PoD fall in this plane. Since the 1203 Southern Reflection between the ToF nodes happens only within a 1204 plane, ToF nodes in other planes cannot discover fallen leaves in 1205 a different plane. They also cannot determine beyond their local 1206 plane whether a leaf node that was initially reachable has become 1207 unreachable. As the breakage can be observed by the ToF nodes in 1208 the plane where the breakage happened, the ToF nodes in the plane 1209 need to be aware of all the affected prefixes for the negative 1210 disaggregation to be fully effective. The problem can also be 1211 observed by the ToF nodes in the other planes through the flooding 1212 of North TIEs from the affected leaf nodes, if there are only 3 1213 levels and the ToP nodes are directly connected to the leaf nodes, 1214 and then again it can only be effective it is propagated 1215 transitively to the leaf, and useless above that level. 1217 For the sake of easy comprehension let us roll the abstractions back 1218 into a simple example and observe that in Figure 3 the loss of link 1219 Spine 122 to Leaf 122 will make Leaf 122 a fallen leaf for Top-of- 1220 Fabric plane B. Worse, if the cabling was never present in first 1221 place, plane B will not even be able to know that such a fallen leaf 1222 exists. Hence partitioning without further treatment results in two 1223 grave problems: 1225 o Leaf 111 trying to route to Leaf 122 MUST choose Spine 111 in 1226 plane A as its next hop since plane B will inevitably blackhole 1227 the packet when forwarding using default routes or do excessive 1228 bow tying. This information must be in its routing table. 1230 o Any kind of "flooding" or distance vector trying to deal with the 1231 problem by distributing host routes will be able to converge only 1232 using paths through leaves. The flooding of information on Leaf 1233 122 would have to go up to Top-of-Fabric A and then "loopback" 1234 over other leaves to ToF B leading in extreme cases to traffic for 1235 Leaf 122 when presented to plane B taking an "inverted fabric" 1236 path where leaves start to serve as TOFs, at least for the 1237 duration of a protocol's convergence. 1239 4.1.4. Discovering Fallen Leaves 1241 As illustrated later, and without further proof, the way to deal with 1242 fallen leaves in multi-plane designs, when aggregation is used, is 1243 that RIFT requires all the ToF nodes to share the same north topology 1244 database. This happens naturally in single plane design by the means 1245 of northbound flooding and south reflection but needs additional 1246 considerations in multi-plane fabrics. To satisfy this RIFT, in 1247 multi-plane designs, relies at the ToF level on ring interconnection 1248 of switches in multiple planes. Other solutions are possible but 1249 they either need more cabling or end up having much longer flooding 1250 paths and/or single points of failure. 1252 In detail, by reserving two ports on each Top-of-Fabric node it is 1253 possible to connect them together by interplane bi-directional rings 1254 as illustrated in Figure 13. The rings will be used to exchange full 1255 north topology information between planes. All ToFs having same 1256 north topology allows by the means of transitive, negative 1257 disaggregation described in Section 4.2.5.2 to efficiently fix any 1258 possible fallen leaf scenario. Somewhat as a side-effect, the 1259 exchange of information fulfills the ask to present full view of the 1260 fabric topology at the Top-of-Fabric level, without the need to 1261 collate it from multiple points by additional complexity of 1262 technologies like [RFC7752]. 1264 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ 1265 | | | | | | | | | | | | | | 1266 | | | | | | | | 1267 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1268 +-| |--| |--| |--| |--| |--| |--| |-+ | 1269 | | o | | o | | o | | o | | o | | o | | o | | | Plane A 1270 +-| |--| |--| |--| |--| |--| |--| |-+ | 1271 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1272 | | | | | | | | 1273 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1274 +-| |--| |--| |--| |--| |--| |--| |-+ | 1275 | | o | | o | | o | | o | | o | | o | | o | | | Plane B 1276 +-| |--| |--| |--| |--| |--| |--| |-+ | 1277 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1278 | | | | | | | | 1279 ... | 1280 | | | | | | | | 1281 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1282 +-| |--| |--| |--| |--| |--| |--| |-+ | 1283 | | o | | o | | o | | o | | o | | o | | o | | | Plane X 1284 +-| |--| |--| |--| |--| |--| |--| |-+ | 1285 +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ +-o-+ | 1286 | | | | | | | | 1287 | | | | | | | | | | | | | | 1288 +---+ +---+ +---+ +---+ +---+ +---+ +--------+ 1289 Rings 1 2 3 4 5 6 7 1291 Figure 13: Connecting Top-of-Fabric Nodes Across Planes by Rings 1293 4.1.5. Addressing the Fallen Leaves Problem 1295 One consequence of the "Fallen Leaf" problem is that some prefixes 1296 attached to the fallen leaf become unreachable from some of the ToF 1297 nodes. RIFT proposes two methods to address this issue, the positive 1298 and the negative disaggregation. Both methods flood South TIEs to 1299 advertise the impacted prefix(es). 1301 When used for the operation of disaggregation, a positive South TIE, 1302 as usual, indicates reachability to a prefix of given length and all 1303 addresses subsumed by it. In contrast, a negative route 1304 advertisement indicates that the origin cannot route to the 1305 advertised prefix. 1307 The positive disaggregation is originated by a router that can still 1308 reach the advertised prefix, and the operation is not transitive. In 1309 other words, the receiver does not generate its own flooding south as 1310 a consequence of receiving positive disaggregation advertisements 1311 from a higher level node. The effect of a positive disaggregation is 1312 that the traffic to the impacted prefix will follow the longest match 1313 and will be limited to the northbound routers that advertised the 1314 more specific route. 1316 In contrast, the negative disaggregation can be transitive, and is 1317 propagated south when all the possible routes have been advertised as 1318 negative exceptions. A negative route advertisement is only 1319 actionable when the negative prefix is aggregated by a positive route 1320 advertisement for a shorter prefix. In such case, the negative 1321 advertisement "punches out a hole" in the positive route in the 1322 routing table, making the positive prefix reachable through the 1323 originator with the special consideration of the negative prefix 1324 removing certain next hop neighbors. 1326 When the ToF is not partitioned, the collective southern flooding of 1327 the positive disaggregation by the ToF nodes that can still reach the 1328 impacted prefix is in general enough to cover all the switches at the 1329 next level south, typically the ToP nodes. If all those switches are 1330 aware of the disaggregation, they collectively create a ceiling that 1331 intercepts all the traffic north and forwards it to the ToF nodes 1332 that advertised the more specific route. In that case, the positive 1333 disaggregation alone is sufficient to solve the fallen leaf problem. 1335 On the other hand, when the fabric is partitioned in planes, the 1336 positive disaggregation from ToF nodes in different planes do not 1337 reach the ToP switches in the affected plane and cannot solve the 1338 fallen leaves problem. In other words, a breakage in a plane can 1339 only be solved in that plane. Also, the selection of the plane for a 1340 packet typically occurs at the leaf level and the disaggregation must 1341 be transitive and reach all the leaves. In that case, the negative 1342 disaggregation is necessary. The details on the RIFT approach to 1343 deal with fallen leaves in an optimal way are specified in 1344 Section 4.2.5.2. 1346 4.2. Specification 1348 This section specifies the protocol in a normative fashion by either 1349 prescriptive procedures or behavior defined by Finite State Machines 1350 (FSM). 1352 Some FSM figures are provided as [DOT] description due to limitations 1353 of ASCII art. 1355 "On Entry" actions on FSM state are performed every time and right 1356 before the according state is entered, i.e. after any transitions 1357 from previous state. 1359 "On Exit" actions are performed every time and immediately when a 1360 state is exited, i.e. before any transitions towards target state are 1361 performed. 1363 Any attempt to transition from a state towards another on reception 1364 of an event where no action is specified MUST be considered an 1365 unrecoverable error. 1367 The FSMs and procedures are normative in the sense that an 1368 implementation MUST implement them either literally or an 1369 implementation MUST exhibit externally observable behavior that is 1370 identical to the execution of the specified FSMs. 1372 Where a FSM representation is inconvenient, i.e. the amount of 1373 procedures and kept state exceeds the amount of transitions, we defer 1374 to a more procedural description on data structures. 1376 4.2.1. Transport 1378 All packet formats are defined in Thrift [thrift] models in 1379 Appendix B. 1381 The serialized model is carried in an envelope within a UDP frame 1382 that provides security and allows validation/modification of several 1383 important fields without de-serialization for performance and 1384 security reasons. 1386 4.2.2. Link (Neighbor) Discovery (LIE Exchange) 1388 RIFT LIE exchange auto-discovers neighbors, negotiates ZTP parameters 1389 and discovers miscablings. It uses a three-way handshake mechanism 1390 which is a cleaned up version of [RFC5303]. Observe that for easier 1391 comprehension the terminology of one/two and three-way states does 1392 NOT align with OSPF or ISIS FSMs albeit they use roughly same 1393 mechanisms. The formation progresses under normal conditions from 1394 one-way to two-way and then three-way state at which point it is 1395 ready to exchange TIEs per Section 4.2.3. 1397 LIE exchange happens over well-known administratively locally scoped 1398 and configured or otherwise well-known IPv4 multicast address 1399 [RFC2365] and/or link-local multicast scope [RFC4291] for IPv6 1400 [RFC8200] using a configured or otherwise a well-known destination 1401 UDP port defined in Appendix C.1. LIEs SHOULD be sent with an IPv4 1402 Time to Live (TTL) / IPv6 Hop Limit (HL) of 1 to prevent RIFT 1403 information reaching beyond a single L3 next-hop in the topology. 1404 LIEs SHOULD be sent with network control precedence. 1406 Originating port of the LIE has no further significance other than 1407 identifying the origination point. LIEs are exchanged over all links 1408 running RIFT. 1410 An implementation MAY listen and send LIEs on IPv4 and/or IPv6 1411 multicast addresses. A node MUST NOT originate LIEs on an address 1412 family if it does not process received LIEs on that family. LIEs on 1413 same link are considered part of the same negotiation independent of 1414 the address family they arrive on. Observe further that the LIE 1415 source address may not identify the peer uniquely in unnumbered or 1416 link-local address cases so the response transmission MUST occur over 1417 the same interface the LIEs have been received on. A node MAY use 1418 any of the adjacency's source addresses it saw in LIEs on the 1419 specific interface during adjacency formation to send TIEs. That 1420 implies that an implementation MUST be ready to accept TIEs on all 1421 addresses it used as source of LIE frames. 1423 A three-way adjacency over any address family implies support for 1424 IPv4 forwarding if the `v4_forwarding_capable` flag is set to true 1425 and a node can use [RFC5549] type of forwarding in such a situation. 1426 It is expected that the whole fabric supports the same type of 1427 forwarding of address families on all the links. Operation of a 1428 fabric where only some of the links are supporting forwarding on an 1429 address family and others do not is outside the scope of this 1430 specification. 1432 The protocol does NOT support selective disabling of address 1433 families, disabling v4 forwarding capability or any local address 1434 changes in three-way state, i.e. if a link has entered three-way IPv4 1435 and/or IPv6 with a neighbor on an adjacency and it wants to stop 1436 supporting one of the families or change any of its local addresses 1437 or stop v4 forwarding, it has to tear down and rebuild the adjacency. 1438 It also has to remove any information it stored about the adjacency 1439 such as LIE source addresses seen. 1441 Unless ZTP as described in Section 4.2.7 is used, each node is 1442 provisioned with the level at which it is operating. It MAY be also 1443 provisioned with its PoD. If any of those values is undefined, then 1444 accordingly a default level and/or an "undefined" PoD are assumed. 1445 This means that leaves do not need to be configured at all if initial 1446 configuration values are all left at "undefined" value. Nodes above 1447 ToP MUST remain at "any" PoD value which has the same value as 1448 "undefined" PoD. This information is propagated in the LIEs 1449 exchanged. 1451 Further definitions of leaf flags are found in Section 4.2.7 given 1452 they have implications in terms of level and adjacency forming here. 1454 A node tries to form a three-way adjacency if and only if 1456 1. the node is in the same PoD or either the node or the neighbor 1457 advertises "undefined/any" PoD membership (PoD# = 0) AND 1459 2. the neighboring node is running the same MAJOR schema version AND 1461 3. the neighbor is not member of some PoD while the node has a 1462 northbound adjacency already joining another PoD AND 1464 4. the neighboring node uses a valid System ID AND 1466 5. the neighboring node uses a different System ID than the node 1467 itself 1469 6. the advertised MTUs match on both sides AND 1471 7. both nodes advertise defined level values AND 1473 8. [ 1475 i) the node is at level 0 and has no three way adjacencies 1476 already to nodes at Highest Adjacency Three-Way level (HAT as 1477 defined later in Section 4.2.7.1) with level different than 1478 the adjacent node OR 1480 ii) the node is not at level 0 and the neighboring node is at 1481 level 0 OR 1483 iii) both nodes are at level 0 AND both indicate support for 1484 Section 4.3.8 OR 1486 iv) neither node is at level 0 and the neighboring node is at 1487 most one level away 1489 ]. 1491 The rules checking PoD numbering MAY be optionally disregarded by a 1492 node if PoD detection is undesirable or has to be ignored. This will 1493 not affect the correctness of the protocol except preventing 1494 detection of certain miscabling cases. 1496 A node configured with "undefined" PoD membership MUST, after 1497 building first northbound three way adjacencies to a node being in a 1498 defined PoD, advertise that PoD as part of its LIEs. In case that 1499 adjacency is lost, from all available northbound three way 1500 adjacencies the node with the highest System ID and defined PoD is 1501 chosen. That way the northmost defined PoD value (normally the ToP 1502 nodes) can diffuse southbound towards the leaves "forcing" the PoD 1503 value on any node with "undefined" PoD. 1505 LIEs arriving with IPv4 Time to Live (TTL) / IPv6 Hop Limit (HL) 1506 larger than 1 MUST be ignored. 1508 A node SHOULD NOT send out LIEs without defined level in the header 1509 but in certain scenarios it may be beneficial for trouble-shooting 1510 purposes. 1512 4.2.2.1. LIE FSM 1514 This section specifies the precise, normative LIE FSM and can be 1515 omitted unless the reader is pursuing an implementation of the 1516 protocol. 1518 Initial state is `OneWay`. 1520 Event `MultipleNeighbors` occurs normally when more than two nodes 1521 see each other on the same link or a remote node is quickly 1522 reconfigured or rebooted without regressing to `OneWay` first. Each 1523 occurrence of the event SHOULD generate a clear, according 1524 notification to help operational deployments. 1526 The machine sends LIEs on several transitions to accelerate adjacency 1527 bring-up without waiting for the timer tic. 1529 Enter 1530 | 1531 V 1532 +-----------+ 1533 | OneWay |<----+ 1534 | | | HALChanged [StoreHAL] 1535 | Entry: | | HALSChanged [StoreHALS] 1536 | [CleanUp] | | HATChanged [StoreHAT] 1537 | | | HoldTimerExpired [-] 1538 | | | InstanceNameMismatch [-] 1539 | | | LevelChanged [UpdateLevel, PUSH SendLie] 1540 | | | LieReceived [ProcessLIE] 1541 | | | MTUMismatch [-] 1542 | | | NeighborAddressAdded [-] 1543 | | | NeighborChangedAddress [-] 1544 | | | NeighborChangedLevel [-] 1545 | | | NeighborChangedMinorFields [-] 1546 | | | NeighborDroppedReflection [-] 1547 | | | PODMismatch [-] 1548 | | | SendLIE [SendLIE] 1549 | | | TimerTick [PUSH SendLIE] 1550 | | | UnacceptableHeader 1551 | | | UpdateZTPOffer [SendOfferToZTPFSM] 1552 | |-----+ 1553 | | 1554 | |<--------------------- (ThreeWay) 1555 | |---------------------> 1556 | | ValidReflection [-] 1557 | | 1558 | |---------------------> (Multiple 1559 | | MultipleNeighbors Neighbors 1560 +-----------+ [StartMulNeighTimer] Wait) 1561 ^ | 1562 | | 1563 | | NewNeighbor [PUSH SendLIE] 1564 | V 1565 (TwoWay) 1567 LIE FSM 1569 (OneWay) 1570 | ^ 1571 | | HoldTimeExpired [-] 1572 | | InstanceNameMismatch [-] 1573 | | LevelChanged [StoreLevel] 1574 | | MTUMismatch [-] 1575 | | NeighborChangedAddress [-] 1576 | | NeighborChangedLevel [-] 1577 | | PODMismatch [-] 1578 | | UnacceptableHeader [-] 1579 V | 1580 +-----------+ 1581 | TwoWay |<----+ 1582 | | | HALChanged [StoreHAL] 1583 | | | HALSChanged [StoreHALS] 1584 | | | HATChanged [StoreHAT] 1585 | | | LevelChanged [StoreLevel] 1586 | | | LIERcvd [ProcessLIE] 1587 | | | SendLIE [SendLIE] 1588 | | | TimerTick [PUSH SendLIE, 1589 | | | IF HoldTimer expired 1590 | | | PUSH HoldTimerExpired] 1591 | | | UpdateZTPOffer [SendOfferToZTPFSM] 1592 | |-----+ 1593 | | 1594 | |<---------------------- 1595 | |----------------------> (Multiple 1596 | | NewNeighbor Neighbors 1597 | | [StartMulNeighTimer] Wait) 1598 | | MultipleNeighbors 1599 +-----------+ [StartMulNeighTimer] 1600 ^ | 1601 | | ValidReflection [-] 1602 | V 1603 (ThreeWay) 1605 LIE FSM (continued) 1607 (TwoWay) (OneWay) 1608 ^ | ^ 1609 | | | HoldTimerExpired [-] 1610 | | | InstanceNameMismatch [-] 1611 | | | LevelChanged [UpdateLevel] 1612 | | | MTUMismatch [-] 1613 | | | NeighborChangedAddress [-] 1614 | | | NeighborChangedLevel [-] 1615 NeighborDropped- | | | PODMismatch [-] 1616 Reflection [-] | | | UnacceptableHeader [-] 1617 | V | 1618 +-----------+ | 1619 | ThreeWay |-----+ 1620 | | 1621 | |<----+ 1622 | | | HALChanged [StoreHAL] 1623 | | | HALSChanged [StoreHALS] 1624 | | | HATChanged [StoreHAT] 1625 | | | LieReceived [ProcessLIE] 1626 | | | SendLIE [SendLIE] 1627 | | | TimerTick [PUSH SendLie, 1628 | | | IF HoldTimer expired 1629 | | | PUSH HoldTimerExpired] 1630 | | | UpdateZTPOffer [SendOfferToZTPFSM] 1631 | | | ValidReflection [-] 1632 | |-----+ 1633 | |----------------------> (Multiple 1634 | | MultipleNeighbors Neighbors 1635 +-----------+ [StartMulNeighTimer] Wait) 1637 LIE FSM (continued) 1639 (TwoWay) (ThreeWay) 1640 | | 1641 V V 1642 +------------+ 1643 | Multiple |<----+ 1644 | Neighbors | | HALChanged [StoreHAL] 1645 | Wait | | HALSChanged [StoreHALS] 1646 | | | HATChanged [StoreHAT] 1647 | | | MultipleNeighbors 1648 | | | [StartMultipleNeighborsTimer] 1649 | | | TimerTick [IF MulNeighTimer expired 1650 | | | PUSH MultipleNeighborsDone] 1651 | | | UpdateZTPOffer [SendOfferToZTP] 1652 | |-----+ 1653 | | 1654 | |<--------------------------- 1655 | |---------------------------> (OneWay) 1656 | | LevelChanged [StoreLevel] 1657 +------------+ MultipleNeighborsDone [-] 1659 LIE FSM (continued) 1661 Events 1663 o TimerTick: one second timer tic 1665 o LevelChanged: node's level has been changed by ZTP or 1666 configuration 1668 o HALChanged: best HAL computed by ZTP has changed 1670 o HATChanged: HAT computed by ZTP has changed 1672 o HALSChanged: set of HAL offering systems computed by ZTP has 1673 changed 1675 o LieRcvd: received LIE 1677 o NewNeighbor: new neighbor parsed 1679 o ValidReflection: received own reflection from neighbor 1681 o NeighborDroppedReflection: lost previous own reflection from 1682 neighbor 1684 o NeighborChangedLevel: neighbor changed advertised level 1686 o NeighborChangedAddress: neighbor changed IP address 1687 o UnacceptableHeader: unacceptable header seen 1689 o MTUMismatch: MTU mismatched 1691 o InstanceNameMismatch: Instance mismatched 1693 o PODMismatch: Unacceptable PoD seen 1695 o HoldtimeExpired: adjacency hold down expired 1697 o MultipleNeighbors: more than one neighbor seen on interface 1699 o MultipleNeighborsDone: cooldown for multiple neighbors expired 1701 o SendLie: send a LIE out 1703 o UpdateZTPOffer: update this node's ZTP offer 1705 Actions 1707 on MultipleNeighbors in OneWay finishes in MultipleNeighborsWait: 1708 start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME 1710 on NeighborDroppedReflection in ThreeWay finishes in TwoWay: no 1711 action 1713 on NeighborDroppedReflection in OneWay finishes in OneWay: no 1714 action 1716 on PODMismatch in TwoWay finishes in OneWay: no action 1718 on NewNeighbor in TwoWay finishes in MultipleNeighborsWait: PUSH 1719 SendLie event 1721 on LieRcvd in OneWay finishes in OneWay: PROCESS_LIE 1723 on UnacceptableHeader in ThreeWay finishes in OneWay: no action 1725 on UpdateZTPOffer in TwoWay finishes in TwoWay: send offer to ZTP 1726 FSM 1728 on NeighborChangedAddress in ThreeWay finishes in OneWay: no 1729 action 1731 on HALChanged in MultipleNeighborsWait finishes in 1732 MultipleNeighborsWait: store new HAL 1734 on NeighborChangedAddress in TwoWay finishes in OneWay: no action 1735 on MultipleNeighbors in TwoWay finishes in MultipleNeighborsWait: 1736 start multiple neighbors timer as 4 * DEFAULT_LIE_HOLDTIME 1738 on LevelChanged in ThreeWay finishes in OneWay: update level with 1739 event value 1741 on LieRcvd in ThreeWay finishes in ThreeWay: PROCESS_LIE 1743 on ValidReflection in OneWay finishes in ThreeWay: no action 1745 on NeighborChangedLevel in TwoWay finishes in OneWay: no action 1747 on MultipleNeighbors in ThreeWay finishes in 1748 MultipleNeighborsWait: start multiple neighbors timer as 4 * 1749 DEFAULT_LIE_HOLDTIME 1751 on InstanceNameMismatch in OneWay finishes in OneWay: no action 1753 on NewNeighbor in OneWay finishes in TwoWay: PUSH SendLie event 1755 on UpdateZTPOffer in OneWay finishes in OneWay: send offer to ZTP 1756 FSM 1758 on UpdateZTPOffer in ThreeWay finishes in ThreeWay: send offer to 1759 ZTP FSM 1761 on MTUMismatch in ThreeWay finishes in OneWay: no action 1763 on TimerTick in OneWay finishes in OneWay: PUSH SendLie event 1765 on SendLie in TwoWay finishes in TwoWay: SEND_LIE 1767 on ValidReflection in ThreeWay finishes in ThreeWay: no action 1769 on InstanceNameMismatch in TwoWay finishes in OneWay: no action 1771 on HoldtimeExpired in OneWay finishes in OneWay: no action 1773 on TimerTick in ThreeWay finishes in ThreeWay: PUSH SendLie event, 1774 if holdtime expired PUSH HoldtimeExpired event 1776 on HALChanged in TwoWay finishes in TwoWay: store new HAL 1778 on HoldtimeExpired in ThreeWay finishes in OneWay: no action 1780 on HALSChanged in TwoWay finishes in TwoWay: store HALS 1782 on HALSChanged in ThreeWay finishes in ThreeWay: store HALS 1783 on ValidReflection in TwoWay finishes in ThreeWay: no action 1785 on MultipleNeighborsDone in MultipleNeighborsWait finishes in 1786 OneWay: no action 1788 on NeighborAddressAdded in OneWay finishes in OneWay: no action 1790 on TimerTick in MultipleNeighborsWait finishes in 1791 MultipleNeighborsWait: decrement MultipleNeighbors timer, if 1792 expired PUSH MultipleNeighborsDone 1794 on MTUMismatch in OneWay finishes in OneWay: no action 1796 on MultipleNeighbors in MultipleNeighborsWait finishes in 1797 MultipleNeighborsWait: start multiple neighbors timer as 4 * 1798 DEFAULT_LIE_HOLDTIME 1800 on LieRcvd in TwoWay finishes in TwoWay: PROCESS_LIE 1802 on HATChanged in MultipleNeighborsWait finishes in 1803 MultipleNeighborsWait: store HAT 1805 on HoldtimeExpired in TwoWay finishes in OneWay: no action 1807 on NeighborChangedLevel in ThreeWay finishes in OneWay: no action 1809 on LevelChanged in OneWay finishes in OneWay: update level with 1810 event value, PUSH SendLie event 1812 on SendLie in OneWay finishes in OneWay: SEND_LIE 1814 on HATChanged in OneWay finishes in OneWay: store HAT 1816 on LevelChanged in TwoWay finishes in TwoWay: update level with 1817 event value 1819 on HATChanged in TwoWay finishes in TwoWay: store HAT 1821 on PODMismatch in ThreeWay finishes in OneWay: no action 1823 on LevelChanged in MultipleNeighborsWait finishes in OneWay: 1824 update level with event value 1826 on UnacceptableHeader in TwoWay finishes in OneWay: no action 1828 on NeighborChangedLevel in OneWay finishes in OneWay: no action 1830 on InstanceNameMismatch in ThreeWay finishes in OneWay: no action 1831 on HATChanged in ThreeWay finishes in ThreeWay: store HAT 1833 on HALChanged in OneWay finishes in OneWay: store new HAL 1835 on UnacceptableHeader in OneWay finishes in OneWay: no action 1837 on HALChanged in ThreeWay finishes in ThreeWay: store new HAL 1839 on UpdateZTPOffer in MultipleNeighborsWait finishes in 1840 MultipleNeighborsWait: send offer to ZTP FSM 1842 on NeighborChangedMinorFields in OneWay finishes in OneWay: no 1843 action 1845 on NeighborChangedAddress in OneWay finishes in OneWay: no action 1847 on MTUMismatch in TwoWay finishes in OneWay: no action 1849 on PODMismatch in OneWay finishes in OneWay: no action 1851 on SendLie in ThreeWay finishes in ThreeWay: SEND_LIE 1853 on TimerTick in TwoWay finishes in TwoWay: PUSH SendLie event, if 1854 holdtime expired PUSH HoldtimeExpired event 1856 on HALSChanged in OneWay finishes in OneWay: store HALS 1858 on HALSChanged in MultipleNeighborsWait finishes in 1859 MultipleNeighborsWait: store HALS 1861 on Entry into OneWay: CLEANUP 1863 Following words are used for well known procedures: 1865 1. PUSH Event: pushes an event to be executed by the FSM upon exit 1866 of this action 1868 2. CLEANUP: neighbor MUST be reset to unknown 1870 3. SEND_LIE: create a new LIE packet 1872 1. reflecting the neighbor if known and valid and 1874 2. setting the necessary `not_a_ztp_offer` variable if level was 1875 derived from last known neighbor on this interface and 1877 3. setting `you_are_not_flood_repeater` to computed value 1879 4. PROCESS_LIE: 1881 1. if lie has wrong major version OR our own system ID or 1882 invalid system ID then CLEANUP else 1884 2. if lie has non matching MTUs then CLEANUP, PUSH 1885 UpdateZTPOffer, PUSH MTUMismatch else 1887 3. if PoD rules do not allow adjacency forming then CLEANUP, 1888 PUSH PODMismatch, PUSH MTUMismatch else 1890 4. if lie has undefined level OR my level is undefined OR this 1891 node is leaf and remote level lower than HAT OR (lie's level 1892 is not leaf AND its difference is more than one from my 1893 level) then CLEANUP, PUSH UpdateZTPOffer, PUSH 1894 UnacceptableHeader else 1896 5. PUSH UpdateZTPOffer, construct temporary new neighbor 1897 structure with values from lie, if no current neighbor exists 1898 then set neighbor to new neighbor, PUSH NewNeighbor event, 1899 CHECK_THREE_WAY else 1901 1. if current neighbor system ID differs from lie's system 1902 ID then PUSH MultipleNeighbors else 1904 2. if current neighbor stored level differs from lie's level 1905 then PUSH NeighborChangedLevel else 1907 3. if current neighbor stored IPv4/v6 address differs from 1908 lie's address then PUSH NeighborChangedAddress else 1910 4. if any of neighbor's flood address port, name, local 1911 linkid changed then PUSH NeighborChangedMinorFields and 1913 5. CHECK_THREE_WAY 1915 5. CHECK_THREE_WAY: if current state is one-way do nothing else 1917 1. if lie packet does not contain neighbor then if current state 1918 is three-way then PUSH NeighborDroppedReflection else 1920 2. if packet reflects this system's ID and local port and state 1921 is three-way then PUSH event ValidReflection else PUSH event 1922 MultipleNeighbors 1924 4.2.3. Topology Exchange (TIE Exchange) 1926 4.2.3.1. Topology Information Elements 1928 Topology and reachability information in RIFT is conveyed by the 1929 means of TIEs which have good amount of commonalities with LSAs in 1930 OSPF. 1932 The TIE exchange mechanism uses the port indicated by each node in 1933 the LIE exchange and the interface on which the adjacency has been 1934 formed as destination. It SHOULD use TTL of 1 as well and set inter- 1935 network control precedence on according packets. 1937 TIEs contain sequence numbers, lifetimes and a type. Each type has 1938 ample identifying number space and information is spread across 1939 possibly many TIEs of a certain type by the means of a hash function 1940 that a node or deployment can individually determine. One extreme 1941 design choice is a prefix per TIE which leads to more BGP-like 1942 behavior where small increments are only advertised on route changes 1943 vs. deploying with dense prefix packing into few TIEs leading to more 1944 traditional IGP trade-off with fewer TIEs. An implementation may 1945 even rehash prefix to TIE mapping at any time at the cost of 1946 significant amount of re-advertisements of TIEs. 1948 More information about the TIE structure can be found in the schema 1949 in Appendix B. 1951 4.2.3.2. South- and Northbound Representation 1953 A central concept of RIFT is that each node represents itself 1954 differently depending on the direction in which it is advertising 1955 information. More precisely, a spine node represents two different 1956 databases over its adjacencies depending whether it advertises TIEs 1957 to the north or to the south/sideways. We call those differing TIE 1958 databases either south- or northbound (South TIEs and North TIEs) 1959 depending on the direction of distribution. 1961 The North TIEs hold all of the node's adjacencies and local prefixes 1962 while the South TIEs hold only all of the node's adjacencies, the 1963 default prefix with necessary disaggregated prefixes and local 1964 prefixes. We will explain this in detail further in Section 4.2.5. 1966 The TIE types are mostly symmetric in both directions and Table 2 1967 provides a quick reference to main TIE types including direction and 1968 their function. 1970 +--------------------+----------------------------------------------+ 1971 | TIE-Type | Content | 1972 +--------------------+----------------------------------------------+ 1973 | Node North TIE | node properties and adjacencies | 1974 +--------------------+----------------------------------------------+ 1975 | Node South TIE | same content as node North TIE | 1976 +--------------------+----------------------------------------------+ 1977 | Prefix North TIE | contains nodes' directly reachable prefixes | 1978 +--------------------+----------------------------------------------+ 1979 | Prefix South TIE | contains originated defaults and directly | 1980 | | reachable prefixes | 1981 +--------------------+----------------------------------------------+ 1982 | Positive | contains disaggregated prefixes | 1983 | Disaggregation | | 1984 | South TIE | | 1985 +--------------------+----------------------------------------------+ 1986 | Negative | contains special, negatively disaggregated | 1987 | Disaggregation | prefixes to support multi-plane designs | 1988 | South TIE | | 1989 +--------------------+----------------------------------------------+ 1990 | External Prefix | contains external prefixes | 1991 | North TIE | | 1992 +--------------------+----------------------------------------------+ 1993 | Key-Value North | contains nodes northbound KVs | 1994 | TIE | | 1995 +--------------------+----------------------------------------------+ 1996 | Key-Value South | contains nodes southbound KVs | 1997 | TIE | | 1998 +--------------------+----------------------------------------------+ 2000 Table 2: TIE Types 2002 As an example illustrating a databases holding both representations, 2003 consider the topology in Figure 2 with the optional link between 2004 spine 111 and spine 112 (so that the flooding on an East-West link 2005 can be shown). This example assumes unnumbered interfaces. First, 2006 here are the TIEs generated by some nodes. For simplicity, the key 2007 value elements which may be included in their South TIEs or North 2008 TIEs are not shown. 2010 ToF 21 South TIEs: 2011 Node South TIE: 2012 NodeElement(level=2, neighbors((Spine 111, level 1, cost 1), 2013 (Spine 112, level 1, cost 1), (Spine 121, level 1, cost 1), 2014 (Spine 122, level 1, cost 1))) 2015 Prefix South TIE: 2016 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2018 Spine 111 South TIEs: 2019 Node South TIE: 2020 NodeElement(level=1, neighbors((ToF 21, level 2, cost 1, 2021 links(...)), 2022 (ToF 22, level 2, cost 1, links(...)), 2023 (Spine 112, level 1, cost 1, links(...)), 2024 (Leaf111, level 0, cost 1, links(...)), 2025 (Leaf112, level 0, cost 1, links(...)))) 2026 Prefix South TIE: 2027 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2029 Spine 111 North TIEs: 2030 Node North TIE: 2031 NodeElement(level=1, 2032 neighbors((ToF 21, level 2, cost 1, links(...)), 2033 (ToF 22, level 2, cost 1, links(...)), 2034 (Spine 112, level 1, cost 1, links(...)), 2035 (Leaf111, level 0, cost 1, links(...)), 2036 (Leaf112, level 0, cost 1, links(...)))) 2037 Prefix North TIE: 2038 NorthPrefixesElement(prefixes(Spine 111.loopback) 2040 Spine 121 South TIEs: 2041 Node South TIE: 2042 NodeElement(level=1, neighbors((ToF 21,level 2,cost 1), 2043 (ToF 22, level 2, cost 1), (Leaf121, level 0, cost 1), 2044 (Leaf122, level 0, cost 1))) 2045 Prefix South TIE: 2046 SouthPrefixesElement(prefixes(0/0, cost 1), (::/0, cost 1)) 2048 Spine 121 North TIEs: 2049 Node North TIE: 2050 NodeElement(level=1, 2051 neighbors((ToF 21, level 2, cost 1, links(...)), 2052 (ToF 22, level 2, cost 1, links(...)), 2053 (Leaf121, level 0, cost 1, links(...)), 2054 (Leaf122, level 0, cost 1, links(...)))) 2055 Prefix North TIE: 2056 NorthPrefixesElement(prefixes(Spine 121.loopback) 2058 Leaf112 North TIEs: 2059 Node North TIE: 2060 NodeElement(level=0, 2061 neighbors((Spine 111, level 1, cost 1, links(...)), 2062 (Spine 112, level 1, cost 1, links(...)))) 2063 Prefix North TIE: 2064 NorthPrefixesElement(prefixes(Leaf112.loopback, Prefix112, 2065 Prefix_MH)) 2067 Figure 14: Example TIES Generated in a 2 Level Spine-and-Leaf 2068 Topology 2070 It may be here not necessarily obvious why the node South TIEs 2071 contain all the adjacencies of the according node. This will be 2072 necessary for algorithms given in Section 4.2.3.9 and Section 4.3.6. 2074 4.2.3.3. Flooding 2076 The mechanism used to distribute TIEs is the well-known (albeit 2077 modified in several respects to take advantage of fat tree topology) 2078 flooding mechanism used by today's link-state protocols. Although 2079 flooding is initially more demanding to implement it avoids many 2080 problems with update style used in diffused computation such as 2081 distance vector protocols. Since flooding tends to present an 2082 unscalable burden in large, densely meshed topologies (fat trees 2083 being unfortunately such a topology) we provide as solution close to 2084 optimal global flood reduction and load balancing optimization in 2085 Section 4.2.3.9. 2087 As described before, TIEs themselves are transported over UDP with 2088 the ports indicated in the LIE exchanges and using the destination 2089 address on which the LIE adjacency has been formed. For unnumbered 2090 IPv4 interfaces same considerations apply as in equivalent OSPF case. 2092 4.2.3.3.1. Normative Flooding Procedures 2094 On reception of a TIE with an undefined level value in the packet 2095 header the node SHOULD issue a warning and indiscriminately discard 2096 the packet. 2098 This section specifies the precise, normative flooding mechanism and 2099 can be omitted unless the reader is pursuing an implementation of the 2100 protocol. 2102 Flooding Procedures are described in terms of a flooding state of an 2103 adjacency and resulting operations on it driven by packet arrivals. 2104 The FSM itself has basically just a single state and is not well 2105 suited to represent the behavior. An implementation MUST behave on 2106 the wire in the same way as the provided normative procedures of this 2107 paragraph. 2109 RIFT does not specify any kind of flood rate limiting since such 2110 specifications always assume particular points in available 2111 technology speeds and feeds and those points are shifting at faster 2112 and faster rate (speed of light holding for the moment). The encoded 2113 packets provide hints to react accordingly to losses or overruns. 2115 Flooding of all according topology exchange elements SHOULD be 2116 performed at highest feasible rate whereas the rate of transmission 2117 MUST be throttled by reacting to adequate features of the system such 2118 as e.g. queue lengths or congestion indications in the protocol 2119 packets. 2121 A node SHOULD NOT send out any topology information elements if the 2122 adjacency is not in a "three-way" state. No further tightening of 2123 this rule is possible due to possible link buffering and re-ordering 2124 of LIEs and TIEs/TIDEs/TIREs. 2126 A node MUST drop any received TIEs/TIDEs/TIREs unless it is in three- 2127 way state. 2129 TIDEs and TIREs MUST NOT be re-flooded the way TIEs of other nodes 2130 are are MUST be always generated by the node itself and cross only to 2131 the neighboring node. 2133 4.2.3.3.1.1. FloodState Structure per Adjacency 2135 The structure contains conceptually the following elements. The word 2136 collection or queue indicates a set of elements that can be iterated: 2138 TIES_TX: Collection containing all the TIEs to transmit on the 2139 adjacency. 2141 TIES_ACK: Collection containing all the TIEs that have to be 2142 acknowledged on the adjacency. 2144 TIES_REQ: Collection containing all the TIE headers that have to be 2145 requested on the adjacency. 2147 TIES_RTX: Collection containing all TIEs that need retransmission 2148 with the according time to retransmit. 2150 Following words are used for well known procedures operating on this 2151 structure: 2153 TIE Describes either a full RIFT TIE or accordingly just the 2154 `TIEHeader` or `TIEID`. The according meaning is unambiguously 2155 contained in the context of the algorithm. 2157 is_flood_reduced(TIE): returns whether a TIE can be flood reduced or 2158 not. 2160 is_tide_entry_filtered(TIE): returns whether a header should be 2161 propagated in TIDE according to flooding scopes. 2163 is_request_filtered(TIE): returns whether a TIE request should be 2164 propagated to neighbor or not according to flooding scopes. 2166 is_flood_filtered(TIE): returns whether a TIE requested be flooded 2167 to neighbor or not according to flooding scopes. 2169 try_to_transmit_tie(TIE): 2171 A. if not is_flood_filtered(TIE) then 2173 1. remove TIE from TIES_RTX if present 2175 2. if TIE" with same key on TIES_ACK then 2177 a. if TIE" same or newer than TIE do nothing else 2179 b. remove TIE" from TIES_ACK and add TIE to TIES_TX 2181 3. else insert TIE into TIES_TX 2183 ack_tie(TIE): remove TIE from all collections and then insert TIE 2184 into TIES_ACK. 2186 tie_been_acked(TIE): remove TIE from all collections. 2188 remove_from_all_queues(TIE): same as `tie_been_acked`. 2190 request_tie(TIE): if not is_request_filtered(TIE) then 2191 remove_from_all_queues(TIE) and add to TIES_REQ. 2193 move_to_rtx_list(TIE): remove TIE from TIES_TX and then add to 2194 TIES_RTX using TIE retransmission interval. 2196 clear_requests(TIEs): remove all TIEs from TIES_REQ. 2198 bump_own_tie(TIE): for self-originated TIE originate an empty or re- 2199 generate with version number higher then the one in TIE. 2201 The collection SHOULD be served with following priorities if the 2202 system cannot process all the collections in real time: 2204 Elements on TIES_ACK should be processed with highest priority 2206 TIES_TX 2208 TIES_REQ and TIES_RTX 2210 4.2.3.3.1.2. TIDEs 2212 `TIEID` and `TIEHeader` space forms a strict total order (modulo 2213 incomparable sequence numbers in the very unlikely event that can 2214 occur if a TIE is "stuck" in a part of a network while the originator 2215 reboots and reissues TIEs many times to the point its sequence# rolls 2216 over and forms incomparable distance to the "stuck" copy) which 2217 implies that a comparison relation is possible between two elements. 2218 With that it is implicitly possible to compare TIEs, TIEHeaders and 2219 TIEIDs to each other whereas the shortest viable key is always 2220 implied. 2222 When generating and sending TIDEs an implementation SHOULD ensure 2223 that enough bandwidth is left to send elements of Floodstate 2224 structure. 2226 4.2.3.3.1.2.1. TIDE Generation 2228 As given by timer constant, periodically generate TIDEs by: 2230 NEXT_TIDE_ID: ID of next TIE to be sent in TIDE. 2232 TIDE_START: Begin of TIDE packet range. 2234 a. NEXT_TIDE_ID = MIN_TIEID 2236 b. while NEXT_TIDE_ID not equal to MAX_TIEID do 2238 1. TIDE_START = NEXT_TIDE_ID 2240 2. HEADERS = At most TIRDEs_PER_PKT headers in TIEDB starting at 2241 NEXT_TIDE_ID or higher that SHOULD be filtered by 2242 is_tide_entry_filtered and MUST either have a lifetime left > 2243 0 or have no content 2245 3. if HEADERS is empty then START = MIN_TIEID else START = first 2246 element in HEADERS 2248 4. if HEADERS' size less than TIRDEs_PER_PKT then END = 2249 MAX_TIEID else END = last element in HEADERS 2251 5. send sorted HEADERS as TIDE setting START and END as its 2252 range 2254 6. NEXT_TIDE_ID = END 2256 The constant `TIRDEs_PER_PKT` SHOULD be generated and used by the 2257 implementation to limit the amount of TIE headers per TIDE so the 2258 sent TIDE PDU does not exceed interface MTU. 2260 TIDE PDUs SHOULD be spaced on sending to prevent packet drops. 2262 4.2.3.3.1.2.2. TIDE Processing 2264 On reception of TIDEs the following processing is performed: 2266 TXKEYS: Collection of TIE Headers to be send after processing of 2267 the packet 2269 REQKEYS: Collection of TIEIDs to be requested after processing of 2270 the packet 2272 CLEARKEYS: Collection of TIEIDs to be removed from flood state 2273 queues 2275 LASTPROCESSED: Last processed TIEID in TIDE 2277 DBTIE: TIE in the LSDB if found 2279 a. LASTPROCESSED = TIDE.start_range 2281 b. for every HEADER in TIDE do 2283 1. DBTIE = find HEADER in current LSDB 2285 2. if HEADER < LASTPROCESSED then report error and reset 2286 adjacency and return 2288 3. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 2289 TIE.HEADER < HEADER) into TXKEYS 2291 4. LASTPROCESSED = HEADER 2293 5. if DBTIE not found then 2295 I) if originator is this node then bump_own_tie 2297 II) else put HEADER into REQKEYS 2299 6. if DBTIE.HEADER < HEADER then 2301 I) if originator is this node then bump_own_tie else 2302 i. if this is a North TIE header from a northbound 2303 neighbor then override DBTIE in LSDB with HEADER 2305 ii. else put HEADER into REQKEYS 2307 7. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 2309 8. if DBTIE.HEADER = HEADER then 2311 I) if DBTIE has content already then put DBTIE.HEADER 2312 into CLEARKEYS 2314 II) else put HEADER into REQKEYS 2316 c. put all TIEs in LSDB where (TIE.HEADER > LASTPROCESSED and 2317 TIE.HEADER <= TIDE.end_range) into TXKEYS 2319 d. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 2321 e. for all TIEs in REQKEYS request_tie(TIE) 2323 f. for all TIEs in CLEARKEYS remove_from_all_queues(TIE) 2325 4.2.3.3.1.3. TIREs 2327 4.2.3.3.1.3.1. TIRE Generation 2329 There is not much to say here. Elements from both TIES_REQ and 2330 TIES_ACK MUST be collected and sent out as fast as feasible as TIREs. 2331 When sending TIREs with elements from TIES_REQ the `lifetime` field 2332 MUST be set to 0 to force reflooding from the neighbor even if the 2333 TIEs seem to be same. 2335 4.2.3.3.1.3.2. TIRE Processing 2337 On reception of TIREs the following processing is performed: 2339 TXKEYS: Collection of TIE Headers to be send after processing of 2340 the packet 2342 REQKEYS: Collection of TIEIDs to be requested after processing of 2343 the packet 2345 ACKKEYS: Collection of TIEIDs that have been acked 2347 DBTIE: TIE in the LSDB if found 2349 a. for every HEADER in TIRE do 2350 1. DBTIE = find HEADER in current LSDB 2352 2. if DBTIE not found then do nothing 2354 3. if DBTIE.HEADER < HEADER then put HEADER into REQKEYS 2356 4. if DBTIE.HEADER > HEADER then put DBTIE.HEADER into TXKEYS 2358 5. if DBTIE.HEADER = HEADER then put DBTIE.HEADER into ACKKEYS 2360 b. for all TIEs in TXKEYS try_to_transmit_tie(TIE) 2362 c. for all TIEs in REQKEYS request_tie(TIE) 2364 d. for all TIEs in ACKKEYS tie_been_acked(TIE) 2366 4.2.3.3.1.4. TIEs Processing on Flood State Adjacency 2368 On reception of TIEs the following processing is performed: 2370 ACKTIE: TIE to acknowledge 2372 TXTIE: TIE to transmit 2374 DBTIE: TIE in the LSDB if found 2376 a. DBTIE = find TIE in current LSDB 2378 b. if DBTIE not found then 2380 1. if originator is this node then bump_own_tie with a short 2381 remaining lifetime 2383 2. else insert TIE into LSDB and ACKTIE = TIE 2385 else 2387 1. if DBTIE.HEADER = TIE.HEADER then 2389 i. if DBTIE has content already then ACKTIE = TIE 2391 ii. else process like the "DBTIE.HEADER < TIE.HEADER" case 2393 2. if DBTIE.HEADER < TIE.HEADER then 2395 i. if originator is this node then bump_own_tie 2397 ii. else insert TIE into LSDB and ACKTIE = TIE 2399 3. if DBTIE.HEADER > TIE.HEADER then 2401 i. if DBTIE has content already then TXTIE = DBTIE 2403 ii. else ACKTIE = DBTIE 2405 c. if TXTIE is set then try_to_transmit_tie(TXTIE) 2407 d. if ACKTIE is set then ack_tie(TIE) 2409 4.2.3.3.1.5. TIEs Processing When LSDB Received Newer Version on Other 2410 Adjacencies 2412 The Link State Database can be considered to be a switchboard that 2413 does not need any flooding procedures but can be given new versions 2414 of TIEs by a peer. Consecutively, a peer receives from the LSDB 2415 newer versions of TIEs received by other peers and processes them 2416 (without any filtering) just like receiving TIEs from its remote 2417 peer. This publisher model can be implemented in many ways. 2419 4.2.3.3.1.6. Sending TIEs 2421 On a periodic basis all TIEs with lifetime left > 0 MUST be sent out 2422 on the adjacency, removed from TIES_TX list and requeued onto 2423 TIES_RTX list. 2425 4.2.3.4. TIE Flooding Scopes 2427 In a somewhat analogous fashion to link-local, area and domain 2428 flooding scopes, RIFT defines several complex "flooding scopes" 2429 depending on the direction and type of TIE propagated. 2431 Every North TIE is flooded northbound, providing a node at a given 2432 level with the complete topology of the Clos or Fat Tree network that 2433 is reachable southwards of it, including all specific prefixes. This 2434 means that a packet received from a node at the same or lower level 2435 whose destination is covered by one of those specific prefixes will 2436 be routed directly towards the node advertising that prefix rather 2437 than sending the packet to a node at a higher level. 2439 A node's Node South TIEs, consisting of all node's adjacencies and 2440 prefix South TIEs limited to those related to default IP prefix and 2441 disaggregated prefixes, are flooded southbound in order to allow the 2442 nodes one level down to see connectivity of the higher level as well 2443 as reachability to the rest of the fabric. In order to allow an E-W 2444 disconnected node in a given level to receive the South TIEs of other 2445 nodes at its level, every *NODE* South TIE is "reflected" northbound 2446 to level from which it was received. It should be noted that East- 2447 West links are included in South TIE flooding (except at ToF level); 2448 those TIEs need to be flooded to satisfy algorithms in Section 4.2.4. 2449 In that way nodes at same level can learn about each other without a 2450 lower level, e.g. in case of leaf level. The precise, normative 2451 flooding scopes are given in Table 3. Those rules govern as well 2452 what SHOULD be included in TIDEs on the adjacency. Again, East-West 2453 flooding scopes are identical to South flooding scopes except in case 2454 of ToF East-West links (rings) which are basically performing 2455 northbound flooding. 2457 Node South TIE "south reflection" allows to support positive 2458 disaggregation on failures describes in Section 4.2.5 and flooding 2459 reduction in Section 4.2.3.9. 2461 +-----------+---------------------+----------------+-----------------+ 2462 | Type / | South | North | East-West | 2463 | Direction | | | | 2464 +-----------+---------------------+----------------+-----------------+ 2465 | node | flood if level of | flood if level | flood only if | 2466 | South TIE | originator is equal | of originator | this node | 2467 | | to this node | is higher than | is not ToF | 2468 | | | this node | | 2469 +-----------+---------------------+----------------+-----------------+ 2470 | non-node | flood self- | flood only if | flood only if | 2471 | South TIE | originated only | neighbor is | self-originated | 2472 | | | originator of | and this node | 2473 | | | TIE | is not ToF | 2474 +-----------+---------------------+----------------+-----------------+ 2475 | all North | never flood | flood always | flood only if | 2476 | TIEs | | | this node is | 2477 | | | | ToF | 2478 +-----------+---------------------+----------------+-----------------+ 2479 | TIDE | include at least | include at | if this node is | 2480 | | all non-self | least all node | ToF then | 2481 | | originated North | South TIEs and | include all | 2482 | | TIE headers and | all South TIEs | North TIEs, | 2483 | | self-originated | originated by | otherwise only | 2484 | | South TIE headers | peer and | self-originated | 2485 | | and | all North TIEs | TIEs | 2486 | | node South TIEs of | | | 2487 | | nodes at same | | | 2488 | | level | | | 2489 +-----------+---------------------+----------------+-----------------+ 2490 | TIRE as | request all North | request all | if this node is | 2491 | Request | TIEs and all peer's | South TIEs | ToF then apply | 2492 | | self-originated | | North scope | 2493 | | TIEs and | | rules, | 2494 | | all node South TIEs | | otherwise South | 2495 | | | | scope rules | 2496 +-----------+---------------------+----------------+-----------------+ 2497 | TIRE as | Ack all received | Ack all | Ack all | 2498 | Ack | TIEs | received TIEs | received TIEs | 2499 +-----------+---------------------+----------------+-----------------+ 2501 Table 3: Normative Flooding Scopes 2503 If the TIDE includes additional TIE headers beside the ones 2504 specified, the receiving neighbor must apply according filter to the 2505 received TIDE strictly and MUST NOT request the extra TIE headers 2506 that were not allowed by the flooding scope rules in its direction. 2508 As an example to illustrate these rules, consider using the topology 2509 in Figure 2, with the optional link between spine 111 and spine 112, 2510 and the associated TIEs given in Figure 14. The flooding from 2511 particular nodes of the TIEs is given in Table 4. 2513 +-----------+----------+--------------------------------------------+ 2514 | Router | Neighbor | TIEs | 2515 | floods to | | | 2516 +-----------+----------+--------------------------------------------+ 2517 | Leaf111 | Spine | Leaf111 North TIEs, Spine 111 node South | 2518 | | 112 | TIE | 2519 | Leaf111 | Spine | Leaf111 North TIEs, Spine 112 node South | 2520 | | 111 | TIE | 2521 | | | | 2522 | Spine 111 | Leaf111 | Spine 111 South TIEs | 2523 | Spine 111 | Leaf112 | Spine 111 South TIEs | 2524 | Spine 111 | Spine | Spine 111 South TIEs | 2525 | | 112 | | 2526 | Spine 111 | ToF 21 | Spine 111 North TIEs, Leaf111 | 2527 | | | North TIEs, Leaf112 North TIEs, ToF 22 | 2528 | | | node South TIE | 2529 | Spine 111 | ToF 22 | Spine 111 North TIEs, Leaf111 | 2530 | | | North TIEs, Leaf112 North TIEs, ToF 21 | 2531 | | | node South TIE | 2532 | | | | 2533 | ... | ... | ... | 2534 | ToF 21 | Spine | ToF 21 South TIEs | 2535 | | 111 | | 2536 | ToF 21 | Spine | ToF 21 South TIEs | 2537 | | 112 | | 2538 | ToF 21 | Spine | ToF 21 South TIEs | 2539 | | 121 | | 2540 | ToF 21 | Spine | ToF 21 South TIEs | 2541 | | 122 | | 2542 | ... | ... | ... | 2543 +-----------+----------+--------------------------------------------+ 2545 Table 4: Flooding some TIEs from example topology 2547 4.2.3.5. 'Flood Only Node TIEs' Bit 2549 RIFT includes an optional ECN mechanism to prevent "flooding inrush" 2550 on restart or bring-up with many southbound neighbors. A node MAY 2551 set on its LIEs the according bit to indicate to the neighbor that it 2552 should temporarily flood node TIEs only to it. It SHOULD only set it 2553 in the southbound direction. The receiving node SHOULD accommodate 2554 the request to lessen the flooding load on the affected node if south 2555 of the sender and SHOULD ignore the bit if northbound. 2557 Obviously this mechanism is most useful in southbound direction. The 2558 distribution of node TIEs guarantees correct behavior of algorithms 2559 like disaggregation or default route origination. Furthermore 2560 though, the use of this bit presents an inherent trade-off between 2561 processing load and convergence speed since suppressing flooding of 2562 northbound prefixes from neighbors will lead to blackholes. 2564 4.2.3.6. Initial and Periodic Database Synchronization 2566 The initial exchange of RIFT is modeled after ISIS with TIDE being 2567 equivalent to CSNP and TIRE playing the role of PSNP. The content of 2568 TIDEs and TIREs is governed by Table 3. 2570 4.2.3.7. Purging and Roll-Overs 2572 When a node exits the network, if "unpurged", residual stale TIEs may 2573 exist in the network until their lifetimes expire (which in case of 2574 RIFT is by default a rather long period to prevent ongoing re- 2575 origination of TIEs in very large topologies). RIFT does however not 2576 have a "purging mechanism" in the traditional sense based on sending 2577 specialized "purge" packets. In other routing protocols such 2578 mechanism has proven to be complex and fragile based on many years of 2579 experience. RIFT simply issues a new, empty version of the TIE with 2580 a short lifetime and relies on each node to age out and delete such 2581 TIE copy independently. Abundant amounts of memory are available 2582 today even on low-end platforms and hence keeping those relatively 2583 short-lived extra copies for a while is acceptable. The information 2584 will age out and in the meantime all computations will deliver 2585 correct results if a node leaves the network due to the new 2586 information distributed by its adjacent nodes breaking bi-directional 2587 connectivity checks in different computations. 2589 Once a RIFT node issues a TIE with an ID, it SHOULD preserve the ID 2590 as long as feasible (also when the protocol restarts), even if the 2591 TIE looses all content. The re-advertisement of empty TIE fulfills 2592 the purpose of purging any information advertised in previous 2593 versions. The originator is free to not re-originate the according 2594 empty TIE again or originate an empty TIE with relatively short 2595 lifetime to prevent large number of long-lived empty stubs polluting 2596 the network. Each node MUST timeout and clean up the according empty 2597 TIEs independently. 2599 Upon restart a node MUST, as any link-state implementation, be 2600 prepared to receive TIEs with its own system ID and supersede them 2601 with equivalent, newly generated, empty TIEs with a higher sequence 2602 number. As above, the lifetime can be relatively short since it only 2603 needs to exceed the necessary propagation and processing delay by all 2604 the nodes that are within the TIE's flooding scope. 2606 TIE sequence numbers are rolled over using the method described in 2607 Appendix A. First sequence number of any spontaneously originated 2608 TIE (i.e. not originated to override a detected older copy in the 2609 network) MUST be a reasonably unpredictable random number in the 2610 interval [0, 2^30-1] which will prevent otherwise identical TIE 2611 headers to remain "stuck" in the network with content different from 2612 TIE originated after reboot. In traditional link-state protocols 2613 this is delegated to a 16-bit checksum on packet content. RIFT 2614 avoids this design due to the CPU burden presented by computation of 2615 such checksums and additional complications tied to the fact that the 2616 checksum must be "patched" into the packet after the computation, a 2617 difficult proposition in binary hand-crafted formats already and 2618 highly incompatible with model-based, serialized formats. The 2619 sequence number space is hence consciously chosen to be 64-bits wide 2620 to make the occurence of a TIE with same sequence number but 2621 different content as much or even more unlikely than the checksum 2622 method. To emulate the "checksum behavior" an implementation could 2623 e.g. choose to compute 64-bit checksum over the packet content and 2624 use that as first sequence number after reboot. 2626 4.2.3.8. Southbound Default Route Origination 2628 Under certain conditions nodes issue a default route in their South 2629 Prefix TIEs with costs as computed in Section 4.3.6.1. 2631 A node X that 2633 1. is NOT overloaded AND 2635 2. has southbound or East-West adjacencies 2637 originates in its south prefix TIE such a default route IIF 2639 1. all other nodes at X's' level are overloaded OR 2641 2. all other nodes at X's' level have NO northbound adjacencies OR 2643 3. X has computed reachability to a default route during N-SPF. 2645 The term "all other nodes at X's' level" describes obviously just the 2646 nodes at the same level in the PoD with a viable lower level 2647 (otherwise the node South TIEs cannot be reflected and the nodes in 2648 e.g. PoD 1 and PoD 2 are "invisible" to each other). 2650 A node originating a southbound default route MUST install a default 2651 discard route if it did not compute a default route during N-SPF. 2653 4.2.3.9. Northbound TIE Flooding Reduction 2655 Section 1.4 of the Optimized Link State Routing Protocol [RFC3626] 2656 (OLSR) introduces the concept of a "multipoint relay" (MPR) that 2657 minimize the overhead of flooding messages in the network by reducing 2658 redundant retransmissions in the same region. 2660 A similar technique is applied to RIFT to control northbound 2661 flooding. Important observations first: 2663 1. a node MUST flood self-originated North TIEs to all the reachable 2664 nodes at the level above which we call the node's "parents"; 2666 2. it is typically not necessary that all parents reflood the North 2667 TIEs to achieve a complete flooding of all the reachable nodes 2668 two levels above which we choose to call the node's 2669 "grandparents"; 2671 3. to control the volume of its flooding two hops North and yet keep 2672 it robust enough, it is advantageous for a node to select a 2673 subset of its parents as "Flood Repeaters" (FRs), which combined 2674 together deliver two or more copies of its flooding to all of its 2675 parents, i.e. the originating node's grandparents; 2677 4. nodes at the same level do NOT have to agree on a specific 2678 algorithm to select the FRs, but overall load balancing should be 2679 achieved so that different nodes at the same level should tend to 2680 select different parents as FRs; 2682 5. there are usually many solutions to the problem of finding a set 2683 of FRs for a given node; the problem of finding the minimal set 2684 is (similar to) a NP-Complete problem and a globally optimal set 2685 may not be the minimal one if load-balancing with other nodes is 2686 an important consideration; 2688 6. it is expected that there will be often sets of equivalent nodes 2689 at a level L, defined as having a common set of parents at L+1. 2690 Applying this observation at both L and L+1, an algorithm may 2691 attempt to split the larger problem in a sum of smaller separate 2692 problems; 2694 7. it is another expectation that there will be from time to time a 2695 broken link between a parent and a grandparent, and in that case 2696 the parent is probably a poor FR due to its lower reliability. 2697 An algorithm may attempt to eliminate parents with broken 2698 northbound adjacencies first in order to reduce the number of 2699 FRs. Albeit it could be argued that relying on higher fanout FRs 2700 will slow flooding due to higher replication load reliability of 2701 FR's links seems to be a more pressing concern. 2703 In a fully connected Clos Network, this means that a node selects one 2704 arbitrary parent as FR and then a second one for redundancy. The 2705 computation can be kept relatively simple and completely distributed 2706 without any need for synchronization amongst nodes. In a "PoD" 2707 structure, where the Level L+2 is partitioned in silos of equivalent 2708 grandparents that are only reachable from respective parents, this 2709 means treating each silo as a fully connected Clos Network and solve 2710 the problem within the silo. 2712 In terms of signaling, a node has enough information to select its 2713 set of FRs; this information is derived from the node's parents' Node 2714 South TIEs, which indicate the parent's reachable northbound 2715 adjacencies to its own parents, i.e. the node's grandparents. A node 2716 may send a LIE to a northbound neighbor with the optional boolean 2717 field `you_are_flood_repeater` set to false, to indicate that the 2718 northbound neighbor is not a flood repeater for the node that sent 2719 the LIE. In that case the northbound neighbor SHOULD NOT reflood 2720 northbound TIEs received from the node that sent the LIE. If the 2721 `you_are_flood_repeater` is absent or if `you_are_flood_repeater` is 2722 set to true, then the northbound neighbor is a flood repeater for the 2723 node that sent the LIE and MUST reflood northbound TIEs received from 2724 that node. 2726 This specification proposes a simple default algorithm that SHOULD be 2727 implemented and used by default on every RIFT node. 2729 o let |NA(Node) be the set of Northbound adjacencies of node Node 2730 and CN(Node) be the cardinality of |NA(Node); 2732 o let |SA(Node) be the set of Southbound adjacencies of node Node 2733 and CS(Node) be the cardinality of |SA(Node); 2735 o let |P(Node) be the set of node Node's parents; 2737 o let |G(Node) be the set of node Node's grandparents. Observe 2738 that |G(Node) = |P(|P(Node)); 2740 o let N be the child node at level L computing a set of FR; 2742 o let P be a node at level L+1 and a parent node of N, i.e. bi- 2743 directionally reachable over adjacency A(N, P); 2745 o let G be a grandparent node of N, reachable transitively via a 2746 parent P over adjacencies ADJ(N, P) and ADJ(P, G). Observe that N 2747 does not have enough information to check bidirectional 2748 reachability of A(P, G); 2750 o let R be a redundancy constant integer; a value of 2 or higher for 2751 R is RECOMMENDED; 2753 o let S be a similarity constant integer; a value in range 0 .. 2 2754 for S is RECOMMENDED, the value of 1 SHOULD be used. Two 2755 cardinalities are considered as equivalent if their absolute 2756 difference is less than or equal to S, i.e. |a-b|<=S. 2758 o let RND be a 64-bit random number generated by the system once on 2759 startup. 2761 The algorithm consists of the following steps: 2763 1. Derive a 64-bits number by XOR'ing 'N's system ID with RND. 2765 2. Derive a 16-bits pseudo-random unsigned integer PR(N) from the 2766 resulting 64-bits number by splitting it in 16-bits-long words 2767 W1, W2, W3, W4 (where W1 are the least significant 16 bits of the 2768 64-bits number, and W4 are the most significant 16 bits) and then 2769 XOR'ing the circularly shifted resulting words together: 2771 A. (W1<<1) xor (W2<<2) xor (W3<<3) xor (W4<<4); 2773 where << is the circular shift operator. 2775 3. Sort the parents by decreasing number of northbound adjacencies 2776 (using decreasing system id of the parent as tie-breaker): 2777 sort |P(N) by decreasing CN(P), for all P in |P(N), as ordered 2778 array |A(N) 2780 4. Partition |A(N) in subarrays |A_k(N) of parents with equivalent 2781 cardinality of northbound adjacencies (in other words with 2782 equivalent number of grandparents they can reach): 2784 A. set k=0; // k is the ID of the subarrray 2786 B. set i=0; 2788 C. while i < CN(N) do 2790 i) set j=i; 2792 ii) while i < CN(N) and CN(|A(N)[j]) - CN(|A(N)[i]) <= S 2793 a. place |A(N)[i] in |A_k(N) // abstract action, 2794 maybe noop 2796 b. set i=i+1; 2798 iii) /* At this point j is the index in |A(N) of the first 2799 member of |A_k(N) and (i-j) is C_k(N) defined as the 2800 cardinality of |A_k(N) */ 2802 set k=k+1; 2804 /* At this point k is the total number of subarrays, initialized 2805 for the shuffling operation below */ 2807 5. shuffle individually each subarrays |A_k(N) of cardinality C_k(N) 2808 within |A(N) using the Durstenfeld variation of Fisher-Yates 2809 algorithm that depends on N's System ID: 2811 A. while k > 0 do 2813 i) for i from C_k(N)-1 to 1 decrementing by 1 do 2815 a. set j to PR(N) modulo i; 2817 b. exchange |A_k[j] and |A_k[i]; 2819 ii) set k=k-1; 2821 6. For each grandparent G, initialize a counter c(G) with the number 2822 of its south-bound adjacencies to elected flood repeaters (which 2823 is initially zero): 2825 A. for each G in |G(N) set c(G) = 0; 2827 7. Finally keep as FRs only parents that are needed to maintain the 2828 number of adjacencies between the FRs and any grandparent G equal 2829 or above the redundancy constant R: 2831 A. for each P in reshuffled |A(N); 2833 i) if there exists an adjacency ADJ(P, G) in |NA(P) such 2834 that c(G) < R then 2836 a. place P in FR set; 2838 b. for all adjacencies ADJ(P, G') in |NA(P) increment 2839 c(G') 2841 B. If any c(G) is still < R, it was not possible to elect a set 2842 of FRs that covers all grandparents with redundancy R 2844 Additional rules for flooding reduction: 2846 1. The algorithm MUST be re-evaluated by a node on every change of 2847 local adjacencies or reception of a parent South TIE with changed 2848 adjacencies. A node MAY apply a hysteresis to prevent excessive 2849 amount of computation during periods of network instability just 2850 like in case of reachability computation. 2852 2. Upon a change of the flood repeater set, a node SHOULD send out 2853 LIEs that grant flood repeater status to newly promoted nodes 2854 before it sends LIEs that revoke the status to the nodes that 2855 have been newly demoted. This is done to prevent transient 2856 behavior where the full coverage of grandparents is not 2857 guaranteed. Such a condition is sometimes unavoidable in case of 2858 lost LIEs but it will correct itself though at possible transient 2859 hit in flooding propagation speeds. 2861 3. A node MUST always flood its self-originated TIEs. 2863 4. A node receiving a TIE originated by a node for which it is not a 2864 flood repeater SHOULD NOT reflood such TIEs to its neighbors 2865 except for rules in Paragraph 6. 2867 5. The indication of flood reduction capability MUST be carried in 2868 the node TIEs and MAY be used to optimize the algorithm to 2869 account for nodes that will flood regardless. 2871 6. A node generates TIDEs as usual but when receiving TIREs or TIDEs 2872 resulting in requests for a TIE of which the newest received copy 2873 came on an adjacency where the node was not flood repeater it 2874 SHOULD ignore such requests on first and only first request. 2875 Normally, the nodes that received the TIEs as flooding repeaters 2876 should satisfy the requesting node and with that no further TIREs 2877 for such TIEs will be generated. Otherwise, the next set of 2878 TIDEs and TIREs MUST lead to flooding independent of the flood 2879 repeater status. This solves a very difficult incast problem on 2880 nodes restarting with a very wide fanout, especially northbound. 2881 To retrieve the full database they often end up processing many 2882 in-rushing copies whereas this approach load-balances the 2883 incoming database between adjacent nodes and flood repeaters 2884 should guarantee that two copies are sent by different nodes to 2885 ensure against any losses. 2887 4.2.3.10. Special Considerations 2889 First, due to the distributed, asynchronous nature of ZTP, it can 2890 create temporary convergence anomalies where nodes at higher levels 2891 of the fabric temporarily see themselves lower than they belong. 2892 Since flooding can begin before ZTP is "finished" and in fact must do 2893 so given there is no global termination criteria, information may end 2894 up in wrong layers. A special clause when changing level takes care 2895 of that. 2897 More difficult is a condition where a node (e.g. a leaf) floods a TIE 2898 north towards its grandparent, then its parent reboots, in fact 2899 partitioning the grandparent from leaf directly and then the leaf 2900 itself reboots. That can leave the grandparent holding the "primary 2901 copy" of the leaf's TIE. Normally this condition is resolved easily 2902 by the leaf re-originating its TIE with a higher sequence number than 2903 it sees in northbound TIEs, here however, when the parent comes back 2904 it won't be able to obtain leaf's North TIE from the grandparent 2905 easily and with that the leaf may not issue the TIE with a higher 2906 sequence number that can reach the grandparent for a long time. 2907 Flooding procedures are extended to deal with the problem by the 2908 means of special clauses that override the database of a lower level 2909 with headers of newer TIEs seen in TIDEs coming from the north. 2911 4.2.4. Reachability Computation 2913 A node has three possible sources of relevant information for 2914 reachability computation. A node knows the full topology south of it 2915 from the received North Node TIEs or alternately north of it from the 2916 South Node TIEs. A node has the set of prefixes with their 2917 associated distances and bandwidths from corresponding prefix TIEs. 2919 To compute prefix reachability, a node runs conceptually a northbound 2920 and a southbound SPF. We call that N-SPF and S-SPF denoting the 2921 direction in which the computation front is progressing. 2923 Since neither computation can "loop", it is possible to compute non- 2924 equal-cost or even k-shortest paths [EPPSTEIN] and "saturate" the 2925 fabric to the extent desired but we use simple, familiar SPF 2926 algorithms and concepts here as example due to their prevalence in 2927 today's routing. 2929 4.2.4.1. Northbound SPF 2931 N-SPF MUST use exclusively northbound and East-West adjacencies in 2932 the computing node's node North TIEs (since if the node is a leaf it 2933 may not have generated a node South TIE) when starting SPF. Observe 2934 that N-SPF is really just a one hop variety since Node South TIEs are 2935 not re-flooded southbound beyond a single level (or East-West) and 2936 with that the computation cannot progress beyond adjacent nodes. 2938 Once progressing, we are using the next higher level's node South 2939 TIEs to find according adjacencies to verify backlink connectivity. 2940 Just as in case of IS-IS or OSPF, two unidirectional links MUST be 2941 associated together to confirm bidirectional connectivity. 2942 Particular care MUST be paid that the Node TIEs do not only contain 2943 the correct system IDs but matching levels as well. 2945 Default route found when crossing an E-W link SHOULD be used IIF 2947 1. the node itself does NOT have any northbound adjacencies AND 2949 2. the adjacent node has one or more northbound adjacencies 2951 This rule forms a "one-hop default route split-horizon" and prevents 2952 looping over default routes while allowing for "one-hop protection" 2953 of nodes that lost all northbound adjacencies except at Top-of-Fabric 2954 where the links are used exclusively to flood topology information in 2955 multi-plane designs. 2957 Other south prefixes found when crossing E-W link MAY be used IIF 2959 1. no north neighbors are advertising same or supersuming non- 2960 default prefix AND 2962 2. the node does not originate a non-default supersuming prefix 2963 itself. 2965 i.e. the E-W link can be used as a gateway of last resort for a 2966 specific prefix only. Using south prefixes across E-W link can be 2967 beneficial e.g. on automatic de-aggregation in pathological fabric 2968 partitioning scenarios. 2970 A detailed example can be found in Section 5.4. 2972 4.2.4.2. Southbound SPF 2974 S-SPF MUST use exclusively the southbound adjacencies in the node 2975 South TIEs, i.e. progresses towards nodes at lower levels. Observe 2976 that E-W adjacencies are NEVER used in the computation. This 2977 enforces the requirement that a packet traversing in a southbound 2978 direction must never change its direction. 2980 S-SPF MUST use northbound adjacencies in node North TIEs to verify 2981 backlink connectivity by checking for presence of the link beside 2982 correct SystemID and level. 2984 4.2.4.3. East-West Forwarding Within a non-ToF Level 2986 Using south prefixes over horizontal links MAY occur if the N-SPF 2987 includes East-West adjacencies in computation. It can protect 2988 against pathological fabric partitioning cases that leave only paths 2989 to destinations that would necessitate multiple changes of forwarding 2990 direction between north and south. 2992 4.2.4.4. East-West Links Within ToF Level 2994 E-W ToF links behave in terms of flooding scopes defined in 2995 Section 4.2.3.4 like northbound links and MUST be used exclusively 2996 for control plane information flooding. Even though a ToF node could 2997 be tempted to use those links during southbound SPF and carry traffic 2998 over them this MUST NOT be attempted since it may lead in, e.g. 2999 anycast cases to routing loops. An implementation MAY try to resolve 3000 the looping problem by following on the ring strictly tie-broken 3001 shortest-paths only but the details are outside this specification. 3002 And even then, the problem of proper capacity provisioning of such 3003 links when they become traffic-bearing in case of failures is vexing. 3005 4.2.5. Automatic Disaggregation on Link & Node Failures 3007 4.2.5.1. Positive, Non-transitive Disaggregation 3009 Under normal circumstances, node's South TIEs contain just the 3010 adjacencies and a default route. However, if a node detects that its 3011 default IP prefix covers one or more prefixes that are reachable 3012 through it but not through one or more other nodes at the same level, 3013 then it MUST explicitly advertise those prefixes in an South TIE. 3014 Otherwise, some percentage of the northbound traffic for those 3015 prefixes would be sent to nodes without according reachability, 3016 causing it to be black-holed. Even when not black-holing, the 3017 resulting forwarding could 'backhaul' packets through the higher 3018 level spines, clearly an undesirable condition affecting the blocking 3019 probabilities of the fabric. 3021 We refer to the process of advertising additional prefixes southbound 3022 as 'positive de-aggregation' or 'positive dis-aggregation'. Such 3023 dis-aggregation is non-transitive, i.e. its' effects are always 3024 contained to a single level of the fabric only. Naturally, multiple 3025 node or link failures can lead to several independent instances of 3026 positive dis-aggregation necessary to prevent looping or bow-tying 3027 the fabric. 3029 A node determines the set of prefixes needing de-aggregation using 3030 the following steps: 3032 1. A DAG computation in the southern direction is performed first, 3033 i.e. the North TIEs are used to find all of prefixes it can reach 3034 and the set of next-hops in the lower level for each of them. 3035 Such a computation can be easily performed on a fat tree by e.g. 3036 setting all link costs in the southern direction to 1 and all 3037 northern directions to infinity. We term set of those 3038 prefixes |R, and for each prefix, r, in |R, we define its set of 3039 next-hops to be |H(r). 3041 2. The node uses reflected South TIEs to find all nodes at the same 3042 level in the same PoD and the set of southbound adjacencies for 3043 each. The set of nodes at the same level is termed |N and for 3044 each node, n, in |N, we define its set of southbound adjacencies 3045 to be |A(n). 3047 3. For a given r, if the intersection of |H(r) and |A(n), for any n, 3048 is null then that prefix r must be explicitly advertised by the 3049 node in an South TIE. 3051 4. Identical set of de-aggregated prefixes is flooded on each of the 3052 node's southbound adjacencies. In accordance with the normal 3053 flooding rules for an South TIE, a node at the lower level that 3054 receives this South TIE SHOULD NOT propagate it south-bound or 3055 reflect the disaggregated prefixes back over its adjacencies to 3056 nodes at the level from which it was received. 3058 To summarize the above in simplest terms: if a node detects that its 3059 default route encompasses prefixes for which one of the other nodes 3060 in its level has no possible next-hops in the level below, it has to 3061 disaggregate it to prevent black-holing or suboptimal routing through 3062 such nodes. Hence a node X needs to determine if it can reach a 3063 different set of south neighbors than other nodes at the same level, 3064 which are connected to it via at least one common south neighbor. If 3065 it can, then prefix disaggregation may be required. If it can't, 3066 then no prefix disaggregation is needed. An example of 3067 disaggregation is provided in Section 5.3. 3069 A possible algorithm is described last: 3071 1. Create partial_neighbors = (empty), a set of neighbors with 3072 partial connectivity to the node X's level from X's perspective. 3073 Each entry in the set is a south neighbor of X and a list of 3074 nodes of X.level that can't reach that neighbor. 3076 2. A node X determines its set of southbound neighbors 3077 X.south_neighbors. 3079 3. For each South TIE originated from a node Y that X has which is 3080 at X.level, if Y.south_neighbors is not the same as 3081 X.south_neighbors but the nodes share at least one southern 3082 neighbor, for each neighbor N in X.south_neighbors but not in 3083 Y.south_neighbors, add (N, (Y)) to partial_neighbors if N isn't 3084 there or add Y to the list for N. 3086 4. If partial_neighbors is empty, then node X does not disaggregate 3087 any prefixes. If node X is advertising disaggregated prefixes in 3088 its South TIE, X SHOULD remove them and re-advertise its 3089 according South TIEs. 3091 A node X computes reachability to all nodes below it based upon the 3092 received North TIEs first. This results in a set of routes, each 3093 categorized by (prefix, path_distance, next-hop set). Alternately, 3094 for clarity in the following procedure, these can be organized by 3095 next-hop set as ( (next-hops), {(prefix, path_distance)}). If 3096 partial_neighbors isn't empty, then the following procedure describes 3097 how to identify prefixes to disaggregate. 3099 disaggregated_prefixes = { empty } 3100 nodes_same_level = { empty } 3101 for each South TIE 3102 if (South TIE.level == X.level and 3103 X shares at least one S-neighbor with X) 3104 add South TIE.originator to nodes_same_level 3105 end if 3106 end for 3108 for each next-hop-set NHS 3109 isolated_nodes = nodes_same_level 3110 for each NH in NHS 3111 if NH in partial_neighbors 3112 isolated_nodes = 3113 intersection(isolated_nodes, 3114 partial_neighbors[NH].nodes) 3115 end if 3116 end for 3118 if isolated_nodes is not empty 3119 for each prefix using NHS 3120 add (prefix, distance) to disaggregated_prefixes 3121 end for 3122 end if 3123 end for 3125 copy disaggregated_prefixes to X's South TIE 3126 if X's South TIE is different 3127 schedule South TIE for flooding 3128 end if 3130 Figure 15: Computation of Disaggregated Prefixes 3132 Each disaggregated prefix is sent with the according path_distance. 3133 This allows a node to send the same South TIE to each south neighbor. 3134 The south neighbor which is connected to that prefix will thus have a 3135 shorter path. 3137 Finally, to summarize the less obvious points partially omitted in 3138 the algorithms to keep them more tractable: 3140 1. all neighbor relationships MUST perform backlink checks. 3142 2. overload bits as introduced in Section 4.3.1 have to be respected 3143 during the computation. 3145 3. all the lower level nodes are flooded the same disaggregated 3146 prefixes since we don't want to build an South TIE per node and 3147 complicate things unnecessarily. The lower level node that can 3148 compute a southbound route to the prefix will prefer it to the 3149 disaggregated route anyway based on route preference rules. 3151 4. positively disaggregated prefixes do NOT have to propagate to 3152 lower levels. With that the disturbance in terms of new flooding 3153 is contained to a single level experiencing failures. 3155 5. disaggregated Prefix South TIEs are not "reflected" by the lower 3156 level, i.e. nodes within same level do NOT need to be aware 3157 which node computed the need for disaggregation. 3159 6. The fabric is still supporting maximum load balancing properties 3160 while not trying to send traffic northbound unless necessary. 3162 In case positive disaggregation is triggered and due to the very 3163 stable but un-synchronized nature of the algorithm the nodes may 3164 issue the necessary disaggregated prefixes at different points in 3165 time. This can lead for a short time to an "incast" behavior where 3166 the first advertising router based on the nature of longest prefix 3167 match will attract all the traffic. An implementation MAY hence 3168 choose different strategies to address this behavior if needed. 3170 To close this section it is worth to observe that in a single plane 3171 ToF this disaggregation prevents blackholing up to (K_LEAF * P) link 3172 failures in terms of Section 4.1.2 or in other terms, it takes at 3173 minimum that many link failures to partition the ToF into multiple 3174 planes. 3176 4.2.5.2. Negative, Transitive Disaggregation for Fallen Leaves 3178 As explained in Section 4.1.3 failures in multi-plane Top-of-Fabric 3179 or more than (K_LEAF * P) links failing in single plane design can 3180 generate fallen leaves. Such scenario cannot be addressed by 3181 positive disaggregation only and needs a further mechanism. 3183 4.2.5.2.1. Cabling of Multiple Top-of-Fabric Planes 3185 Let us return in this section to designs with multiple planes as 3186 shown in Figure 3. Figure 16 highlights how the ToF is cabled in 3187 case of two planes by the means of dual-rings to distribute all the 3188 North TIEs within both planes. For people familiar with traditional 3189 link-state routing protocols ToF level can be considered equivalent 3190 to area 0 in OSPF or level-2 in ISIS which need to be "connected" as 3191 well for the protocol to operate correctly. 3193 . ++==========++ ++==========++ 3194 . II II II II 3195 .+----++--+ +----++--+ +----++--+ +----++--+ 3196 .|ToF A1| |ToF B1| |ToF B2| |ToF A2| 3197 .++-+-++--+ ++-+-++--+ ++-+-++--+ ++-+-++--+ 3198 . | | II | | II | | II | | II 3199 . | | ++==========++ | | ++==========++ 3200 . | | | | | | | | 3201 . 3202 . ~~~ Highlighted ToF of the previous multi-plane figure ~~ 3204 Figure 16: Topologically Connected Planes 3206 As described in Section 4.1.3 failures in multi-plane fabrics can 3207 lead to blackholes which normal positive disaggregation cannot fix. 3208 The mechanism of negative, transitive disaggregation incorporated in 3209 RIFT provides the according solution. 3211 4.2.5.2.2. Transitive Advertisement of Negative Disaggregates 3213 A ToF node that discovers that it cannot reach a fallen leaf 3214 disaggregates all the prefixes of such leaves. It uses for that 3215 purpose negative prefix South TIEs that are, as usual, flooded 3216 southwards with the scope defined in Section 4.2.3.4. 3218 Transitively, a node explicitly loses connectivity to a prefix when 3219 none of its children advertises it and when the prefix is negatively 3220 disaggregated by all of its parents. When that happens, the node 3221 originates the negative prefix further down south. Since the 3222 mechanism applies recursively south the negative prefix may propagate 3223 transitively all the way down to the leaf. This is necessary since 3224 leaves connected to multiple planes by means of disjoint paths may 3225 have to choose the correct plane already at the very bottom of the 3226 fabric to make sure that they don't send traffic towards another leaf 3227 using a plane where it is "fallen" at which in point a blackhole is 3228 unavoidable. 3230 When the connectivity is restored, a node that disaggregated a prefix 3231 withdraws the negative disaggregation by the usual mechanism of re- 3232 advertising TIEs omitting the negative prefix. 3234 4.2.5.2.3. Computation of Negative Disaggregates 3236 The document omitted so far the description of the computation 3237 necessary to generate the correct set of negative prefixes. Negative 3238 prefixes can in fact be advertised due to two different triggers. We 3239 describe them consecutively. 3241 The first origination reason is a computation that uses all the node 3242 North TIEs to build the set of all reachable nodes by reachability 3243 computation over the complete graph and including ToF links. The 3244 computation uses the node itself as root. This is compared with the 3245 result of the normal southbound SPF as described in Section 4.2.4.2. 3246 The difference are the fallen leaves and all their attached prefixes 3247 are advertised as negative prefixes southbound if the node does not 3248 see the prefix being reachable within southbound SPF. 3250 The second mechanism hinges on the understanding how the negative 3251 prefixes are used within the computation as described in Figure 17. 3252 When attaching the negative prefixes at certain point in time the 3253 negative prefix may find itself with all the viable nodes from the 3254 shorter match nexthop being pruned. In other words, all its 3255 northbound neighbors provided a negative prefix advertisement. This 3256 is the trigger to advertise this negative prefix transitively south 3257 and normally caused by the node being in a plane where the prefix 3258 belongs to a fabric leaf that has "fallen" in this plane. Obviously, 3259 when one of the northbound switches withdraws its negative 3260 advertisement, the node has to withdraw its transitively provided 3261 negative prefix as well. 3263 4.2.6. Attaching Prefixes 3265 After SPF is run, it is necessary to attach the resulting 3266 reachability information in form of prefixes. For S-SPF, prefixes 3267 from an North TIE are attached to the originating node with that 3268 node's next-hop set and a distance equal to the prefix's cost plus 3269 the node's minimized path distance. The RIFT route database, a set 3270 of (prefix, prefix-type, attributes, path_distance, next-hop set), 3271 accumulates these results. 3273 In case of N-SPF prefixes from each South TIE need to also be added 3274 to the RIFT route database. The N-SPF is really just a stub so the 3275 computing node needs simply to determine, for each prefix in an South 3276 TIE that originated from adjacent node, what next-hops to use to 3277 reach that node. Since there may be parallel links, the next-hops to 3278 use can be a set; presence of the computing node in the associated 3279 Node South TIE is sufficient to verify that at least one link has 3280 bidirectional connectivity. The set of minimum cost next-hops from 3281 the computing node X to the originating adjacent node is determined. 3283 Each prefix has its cost adjusted before being added into the RIFT 3284 route database. The cost of the prefix is set to the cost received 3285 plus the cost of the minimum distance next-hop to that neighbor while 3286 taking into account its attributes such as mobility per 3287 Section 4.3.3. Then each prefix can be added into the RIFT route 3288 database with the next-hop set; ties are broken based upon type first 3289 and then distance and further on `PrefixAttributes` and only the best 3290 combination is used for forwarding. RIFT route preferences are 3291 normalized by the according Thrift [thrift] model type. 3293 An example implementation for node X follows: 3295 for each South TIE 3296 if South TIE.level > X.level 3297 next_hop_set = set of minimum cost links to the 3298 South TIE.originator 3299 next_hop_cost = minimum cost link to 3300 South TIE.originator 3301 end if 3302 for each prefix P in the South TIE 3303 P.cost = P.cost + next_hop_cost 3304 if P not in route_database: 3305 add (P, P.cost, P.type, 3306 P.attributes, next_hop_set) to route_database 3307 end if 3308 if (P in route_database): 3309 if route_database[P].cost > P.cost or 3310 route_database[P].type > P.type: 3311 update route_database[P] with (P, P.type, P.cost, 3312 P.attributes, 3313 next_hop_set) 3314 else if route_database[P].cost == P.cost and 3315 route_database[P].type == P.type: 3316 update route_database[P] with (P, P.type, 3317 P.cost, P.attributes, 3318 merge(next_hop_set, route_database[P].next_hop_set)) 3319 else 3320 // Not preferred route so ignore 3321 end if 3322 end if 3323 end for 3324 end for 3326 Figure 17: Adding Routes from South TIE Positive and Negative 3327 Prefixes 3329 After the positive prefixes are attached and tie-broken, negative 3330 prefixes are attached and used in case of northbound computation, 3331 ideally from the shortest length to the longest. The nexthop 3332 adjacencies for a negative prefix are inherited from the longest 3333 positive prefix that aggregates it, and subsequently adjacencies to 3334 nodes that advertised negative for this prefix are removed. 3336 The rule of inheritance MUST be maintained when the nexthop list for 3337 a prefix is modified, as the modification may affect the entries for 3338 matching negative prefixes of immediate longer prefix length. For 3339 instance, if a nexthop is added, then by inheritance it must be added 3340 to all the negative routes of immediate longer prefixes length unless 3341 it is pruned due to a negative advertisement for the same next hop. 3342 Similarly, if a nexthop is deleted for a given prefix, then it is 3343 deleted for all the immediately aggregated negative routes. This 3344 will recurse in the case of nested negative prefix aggregations. 3346 The rule of inheritance must also be maintained when a new prefix of 3347 intermediate length is inserted, or when the immediately aggregating 3348 prefix is deleted from the routing table, making an even shorter 3349 aggregating prefix the one from which the negative routes now inherit 3350 their adjacencies. As the aggregating prefix changes, all the 3351 negative routes must be recomputed, and then again the process may 3352 recurse in case of nested negative prefix aggregations. 3354 Although these operations can be computationally expensive, the 3355 overall load on devices in the network is low because these 3356 computations are not run very often, as positive route advertisements 3357 are always preferred over negative ones. This prevents recursion in 3358 most cases because positive reachability information never inherits 3359 next hops. 3361 To make the negative disaggregation less abstract and provide an 3362 example let us consider a ToP node T1 with 4 ToF parents S1..S4 as 3363 represented in Figure 18: 3365 +----+ +----+ +----+ +----+ N 3366 | S1 | | S1 | | S1 | | S1 | ^ 3367 +----+ +----+ +----+ +----+ W< + >E 3368 | | | | v 3369 |+--------+ | | S 3370 ||+-----------------+ | 3371 |||+----------------+---------+ 3372 |||| 3373 +----+ 3374 | T1 | 3375 +----+ 3377 Figure 18: A ToP Node with 4 Parents 3379 If all ToF nodes can reach all the prefixes in the network; with 3380 RIFT, they will normally advertise a default route south. An 3381 abstract Routing Information Base (RIB), more commonly known as a 3382 routing table, stores all types of maintained routes including the 3383 negative ones and "tie-breaks" for the best one, whereas an abstract 3384 Forwarding table (FIB) retains only the ultimately computed 3385 "positive" routing instructions. In T1, those tables would look as 3386 illustrated in Figure 19: 3388 +---------+ 3389 | Default | 3390 +---------+ 3391 | 3392 | +--------+ 3393 +---> | Via S1 | 3394 | +--------+ 3395 | 3396 | +--------+ 3397 +---> | Via S2 | 3398 | +--------+ 3399 | 3400 | +--------+ 3401 +---> | Via S3 | 3402 | +---------+ 3403 | 3404 | +--------+ 3405 +---> | Via S4 | 3406 +--------+ 3408 Figure 19: Abstract RIB 3410 In case T1 receives a negative advertisement for prefix 2001:db8::/32 3411 from S1 a negative route is stored in the RIB (indicated by a ~ 3412 sign), while the more specific routes to the complementing ToF nodes 3413 are installed in FIB. RIB and FIB in T1 now look as illustrated in 3414 Figure 20 and Figure 21, respectively: 3416 +---------+ +-----------------+ 3417 | Default | <-------------- | ~2001:db8::/32 | 3418 +---------+ +-----------------+ 3419 | | 3420 | +--------+ | +--------+ 3421 +---> | Via S1 | +---> | Via S1 | 3422 | +--------+ +--------+ 3423 | 3424 | +--------+ 3425 +---> | Via S2 | 3426 | +--------+ 3427 | 3428 | +--------+ 3429 +---> | Via S3 | 3430 | +---------+ 3431 | 3432 | +--------+ 3433 +---> | Via S4 | 3434 +--------+ 3436 Figure 20: Abstract RIB after Negative 2001:db8::/32 from S1 3438 The negative 2001:db8::/32 prefix entry inherits from ::/0, so the 3439 positive more specific routes are the complements to S1 in the set of 3440 next-hops for the default route. That entry is composed of S2, S3, 3441 and S4, or, in other words, it uses all entries the the default route 3442 with a "hole punched" for S1 into them. These are the next hops that 3443 are still available to reach 2001:db8::/32, now that S1 advertised 3444 that it will not forward 2001:db8::/32 anymore. Ultimately, those 3445 resulting next-hops are installed in FIB for the more specific route 3446 to 2001:db8::/32 as illustrated below: 3448 +---------+ +---------------+ 3449 | Default | | 2001:db8::/32 | 3450 +---------+ +---------------+ 3451 | | 3452 | +--------+ | 3453 +---> | Via S1 | | 3454 | +--------+ | 3455 | | 3456 | +--------+ | +--------+ 3457 +---> | Via S2 | +---> | Via S2 | 3458 | +--------+ | +--------+ 3459 | | 3460 | +--------+ | +--------+ 3461 +---> | Via S3 | +---> | Via S3 | 3462 | +--------+ | +--------+ 3463 | | 3464 | +--------+ | +--------+ 3465 +---> | Via S4 | +---> | Via S4 | 3466 +--------+ +--------+ 3468 Figure 21: Abstract FIB after Negative 2001:db8::/32 from S1 3470 To illustrate matters further let us consider T1 receiving a negative 3471 advertisement for prefix 2001:db8:1::/48 from S2, which is stored in 3472 RIB again. After the update, the RIB in T1 is illustrated in 3473 Figure 22: 3475 +---------+ +----------------+ +------------------+ 3476 | Default | <----- | ~2001:db8::/32 | <------ | ~2001:db8:1::/48 | 3477 +---------+ +----------------+ +------------------+ 3478 | | | 3479 | +--------+ | +--------+ | 3480 +---> | Via S1 | +---> | Via S1 | | 3481 | +--------+ +--------+ | 3482 | | 3483 | +--------+ | +--------+ 3484 +---> | Via S2 | +---> | Via S2 | 3485 | +--------+ +--------+ 3486 | 3487 | +--------+ 3488 +---> | Via S3 | 3489 | +---------+ 3490 | 3491 | +--------+ 3492 +---> | Via S4 | 3493 +--------+ 3495 Figure 22: Abstract RIB after Negative 2001:db8:1::/48 from S2 3497 Negative 2001:db8:1::/48 inherits from 2001:db8::/32 now, so the 3498 positive more specific routes are the complements to S2 in the set of 3499 next hops for 2001:db8::/32, which are S3 and S4, or, in other words, 3500 all entries of the parent with the negative holes "punched in" again. 3501 After the update, the FIB in T1 shows as illustrated in Figure 23: 3503 +---------+ +---------------+ +-----------------+ 3504 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3505 +---------+ +---------------+ +-----------------+ 3506 | | | 3507 | +--------+ | | 3508 +---> | Via S1 | | | 3509 | +--------+ | | 3510 | | | 3511 | +--------+ | +--------+ | 3512 +---> | Via S2 | +---> | Via S2 | | 3513 | +--------+ | +--------+ | 3514 | | | 3515 | +--------+ | +--------+ | +--------+ 3516 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 3517 | +--------+ | +--------+ | +--------+ 3518 | | | 3519 | +--------+ | +--------+ | +--------+ 3520 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3521 +--------+ +--------+ +--------+ 3523 Figure 23: Abstract FIB after Negative 2001:db8:1::/48 from S2 3525 Further, let us say that S3 stops advertising its service as default 3526 gateway. The entry is removed from RIB as usual. In order to update 3527 the FIB, it is necessary to eliminate the FIB entry for the default 3528 route, as well as all the FIB entries that were created for negative 3529 routes pointing to the RIB entry being removed (::/0). This is done 3530 recursively for 2001:db8::/32 and then for, 2001:db8:1::/48. The 3531 related FIB entries via S3 are removed, as illustrated in Figure 24. 3533 +---------+ +---------------+ +-----------------+ 3534 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3535 +---------+ +---------------+ +-----------------+ 3536 | | | 3537 | +--------+ | | 3538 +---> | Via S1 | | | 3539 | +--------+ | | 3540 | | | 3541 | +--------+ | +--------+ | 3542 +---> | Via S2 | +---> | Via S2 | | 3543 | +--------+ | +--------+ | 3544 | | | 3545 | | | 3546 | | | 3547 | | | 3548 | | | 3549 | +--------+ | +--------+ | +--------+ 3550 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3551 +--------+ +--------+ +--------+ 3553 Figure 24: Abstract FIB after Loss of S3 3555 Say that at that time, S4 would also disaggregate prefix 3556 2001:db8:1::/48. This would mean that the FIB entry for 3557 2001:db8:1::/48 becomes a discard route, and that would be the signal 3558 for T1 to disaggregate prefix 2001:db8:1::/48 negatively in a 3559 transitive fashion with its own children. 3561 Finally, let us look at the case where S3 becomes available again as 3562 a default gateway, and a negative advertisement is received from S4 3563 about prefix 2001:db8:2::/48 as opposed to 2001:db8:1::/48. Again, a 3564 negative route is stored in the RIB, and the more specific route to 3565 the complementing ToF nodes are installed in FIB. Since 3566 2001:db8:2::/48 inherits from 2001:db8::/32, the positive FIB routes 3567 are chosen by removing S4 from S2, S3, S4. The abstract FIB in T1 3568 now shows as illustrated in Figure 25: 3570 +-----------------+ 3571 | 2001:db8:2::/48 | 3572 +-----------------+ 3573 | 3574 +---------+ +---------------+ +-----------------+ 3575 | Default | | 2001:db8::/32 | | 2001:db8:1::/48 | 3576 +---------+ +---------------+ +-----------------+ 3577 | | | | 3578 | +--------+ | | | +--------+ 3579 +---> | Via S1 | | | +---> | Via S2 | 3580 | +--------+ | | | +--------+ 3581 | | | | 3582 | +--------+ | +--------+ | | +--------+ 3583 +---> | Via S2 | +---> | Via S2 | | +---> | Via S3 | 3584 | +--------+ | +--------+ | +--------+ 3585 | | | 3586 | +--------+ | +--------+ | +--------+ 3587 +---> | Via S3 | +---> | Via S3 | +---> | Via S3 | 3588 | +--------+ | +--------+ | +--------+ 3589 | | | 3590 | +--------+ | +--------+ | +--------+ 3591 +---> | Via S4 | +---> | Via S4 | +---> | Via S4 | 3592 +--------+ +--------+ +--------+ 3594 Figure 25: Abstract FIB after Negative 2001:db8:2::/48 from S4 3596 4.2.7. Optional Zero Touch Provisioning (ZTP) 3598 Each RIFT node can operate in zero touch provisioning (ZTP) mode, 3599 i.e. it has no configuration (unless it is a Top-of-Fabric at the top 3600 of the topology or the must operate in the topology as leaf and/or 3601 support leaf-2-leaf procedures) and it will fully configure itself 3602 after being attached to the topology. Configured nodes and nodes 3603 operating in ZTP can be mixed and will form a valid topology if 3604 achievable. 3606 The derivation of the level of each node happens based on offers 3607 received from its neighbors whereas each node (with possibly 3608 exceptions of configured leaves) tries to attach at the highest 3609 possible point in the fabric. This guarantees that even if the 3610 diffusion front reaches a node from "below" faster than from "above", 3611 it will greedily abandon already negotiated level derived from nodes 3612 topologically below it and properly peers with nodes above. 3614 The fabric is very consciously numbered from the top to allow for 3615 PoDs of different heights and minimize number of provisioning 3616 necessary, in this case just a TOP_OF_FABRIC flag on every node at 3617 the top of the fabric. 3619 This section describes the necessary concepts and procedures for ZTP 3620 operation. 3622 4.2.7.1. Terminology 3624 The interdependencies between the different flags and the configured 3625 level can be somewhat vexing at first and it may take multiple reads 3626 of the glossary to comprehend them. 3628 Automatic Level Derivation: Procedures which allow nodes without 3629 level configured to derive it automatically. Only applied if 3630 CONFIGURED_LEVEL is undefined. 3632 UNDEFINED_LEVEL: A "null" value that indicates that the level has 3633 not been determined and has not been configured. Schemas normally 3634 indicate that by a missing optional value without an available 3635 defined default. 3637 LEAF_ONLY: An optional configuration flag that can be configured on 3638 a node to make sure it never leaves the "bottom of the hierarchy". 3639 TOP_OF_FABRIC flag and CONFIGURED_LEVEL cannot be defined at the 3640 same time as this flag. It implies CONFIGURED_LEVEL value of 0. 3642 TOP_OF_FABRIC flag: Configuration flag that MUST be provided to all 3643 Top-of-Fabric nodes. LEAF_FLAG and CONFIGURED_LEVEL cannot be 3644 defined at the same time as this flag. It implies a 3645 CONFIGURED_LEVEL value. In fact, it is basically a shortcut for 3646 configuring same level at all Top-of-Fabric nodes which is 3647 unavoidable since an initial 'seed' is needed for other ZTP nodes 3648 to derive their level in the topology. The flag plays an 3649 important role in fabrics with multiple planes to enable 3650 successful negative disaggregation (Section 4.2.5.2). 3652 CONFIGURED_LEVEL: A level value provided manually. When this is 3653 defined (i.e. it is not an UNDEFINED_LEVEL) the node is not 3654 participating in ZTP. TOP_OF_FABRIC flag is ignored when this 3655 value is defined. LEAF_ONLY can be set only if this value is 3656 undefined or set to 0. 3658 DERIVED_LEVEL: Level value computed via automatic level derivation 3659 when CONFIGURED_LEVEL is equal to UNDEFINED_LEVEL. 3661 LEAF_2_LEAF: An optional flag that can be configured on a node to 3662 make sure it supports procedures defined in Section 4.3.8. In a 3663 strict sense it is a capability that implies LEAF_ONLY and the 3664 according restrictions. TOP_OF_FABRIC flag is ignored when set at 3665 the same time as this flag. 3667 LEVEL_VALUE: In ZTP case the original definition of "level" in 3668 Section 3.1 is both extended and relaxed. First, level is defined 3669 now as LEVEL_VALUE and is the first defined value of 3670 CONFIGURED_LEVEL followed by DERIVED_LEVEL. Second, it is 3671 possible for nodes to be more than one level apart to form 3672 adjacencies if any of the nodes is at least LEAF_ONLY. 3674 Valid Offered Level (VOL): A neighbor's level received on a valid 3675 LIE (i.e. passing all checks for adjacency formation while 3676 disregarding all clauses involving level values) persisting for 3677 the duration of the holdtime interval on the LIE. Observe that 3678 offers from nodes offering level value of 0 do not constitute VOLs 3679 (since no valid DERIVED_LEVEL can be obtained from those and 3680 consequently `not_a_ztp_offer` MUST be ignored). Offers from LIEs 3681 with `not_a_ztp_offer` being true are not VOLs either. If a node 3682 maintains parallel adjacencies to the neighbor, VOL on each 3683 adjacency is considered as equivalent, i.e. the newest VOL from 3684 any such adjacency updates the VOL received from the same node. 3686 Highest Available Level (HAL): Highest defined level value seen from 3687 all VOLs received. 3689 Highest Available Level Systems (HALS): Set of nodes offering HAL 3690 VOLs. 3692 Highest Adjacency Three Way (HAT): Highest neighbor level of all the 3693 formed three way adjacencies for the node. 3695 4.2.7.2. Automatic SystemID Selection 3697 RIFT nodes require a 64 bit SystemID which SHOULD be derived as 3698 EUI-64 MA-L derive according to [EUI64]. The organizationally 3699 governed portion of this ID (24 bits) can be used to generate 3700 multiple IDs if required to indicate more than one RIFT instance." 3702 As matter of operational concern, the router MUST ensure that such 3703 identifier is not changing very frequently (or at least not without 3704 sending all its TIEs with fairly short lifetimes) since otherwise the 3705 network may be left with large amounts of stale TIEs in other nodes 3706 (though this is not necessarily a serious problem if the procedures 3707 described in Section 7 are implemented). 3709 4.2.7.3. Generic Fabric Example 3711 ZTP forces us to think about miscabled or unusually cabled fabric and 3712 how such a topology can be forced into a "lattice" structure which a 3713 fabric represents (with further restrictions). Let us consider a 3714 necessary and sufficient physical cabling in Figure 26. We assume 3715 all nodes being in the same PoD. 3717 . +---+ 3718 . | A | s = TOP_OF_FABRIC 3719 . | s | l = LEAF_ONLY 3720 . ++-++ l2l = LEAF_2_LEAF 3721 . | | 3722 . +--+ +--+ 3723 . | | 3724 . +--++ ++--+ 3725 . | E | | F | 3726 . | +-+ | +-----------+ 3727 . ++--+ | ++-++ | 3728 . | | | | | 3729 . | +-------+ | | 3730 . | | | | | 3731 . | | +----+ | | 3732 . | | | | | 3733 . ++-++ ++-++ | 3734 . | I +-----+ J | | 3735 . | | | +-+ | 3736 . ++-++ +--++ | | 3737 . | | | | | 3738 . +---------+ | +------+ | 3739 . | | | | | 3740 . +-----------------+ | | 3741 . | | | | | 3742 . ++-++ ++-++ | 3743 . | X +-----+ Y +-+ 3744 . |l2l| | l | 3745 . +---+ +---+ 3747 Figure 26: Generic ZTP Cabling Considerations 3749 First, we must anchor the "top" of the cabling and that's what the 3750 TOP_OF_FABRIC flag at node A is for. Then things look smooth until 3751 we have to decide whether node Y is at the same level as I, J (and as 3752 consequence, X is south of it) or at the same level as X. This is 3753 unresolvable here until we "nail down the bottom" of the topology. 3754 To achieve that we choose to use in this example the leaf flags in X 3755 and Y. In case where Y would not have a leaf flag it will try to 3756 elect highest level offered and end up being in same level as I and 3757 J. 3759 4.2.7.4. Level Determination Procedure 3761 A node starting up with UNDEFINED_VALUE (i.e. without a 3762 CONFIGURED_LEVEL or any leaf or TOP_OF_FABRIC flag) MUST follow those 3763 additional procedures: 3765 1. It advertises its LEVEL_VALUE on all LIEs (observe that this can 3766 be UNDEFINED_LEVEL which in terms of the schema is simply an 3767 omitted optional value). 3769 2. It computes HAL as numerically highest available level in all 3770 VOLs. 3772 3. It chooses then MAX(HAL-1,0) as its DERIVED_LEVEL. The node then 3773 starts to advertise this derived level. 3775 4. A node that lost all adjacencies with HAL value MUST hold down 3776 computation of new DERIVED_LEVEL for a short period of time 3777 unless it has no VOLs from southbound adjacencies. After the 3778 holddown expired, it MUST discard all received offers, recompute 3779 DERIVED_LEVEL and announce it to all neighbors. 3781 5. A node MUST reset any adjacency that has changed the level it is 3782 offering and is in three-way state. 3784 6. A node that changed its defined level value MUST readvertise its 3785 own TIEs (since the new `PacketHeader` will contain a different 3786 level than before). Sequence number of each TIE MUST be 3787 increased. 3789 7. After a level has been derived the node MUST set the 3790 `not_a_ztp_offer` on LIEs towards all systems offering a VOL for 3791 HAL. 3793 8. A node that changed its level SHOULD flush from its link state 3794 database TIEs of all other nodes, otherwise stale information may 3795 persist on "direction reversal", i.e. nodes that seemed south 3796 are now north or east-west. This will not prevent the correct 3797 operation of the protocol but could be slightly confusing 3798 operationally. 3800 A node starting with LEVEL_VALUE being 0 (i.e. it assumes a leaf 3801 function by being configured with the appropriate flags or has a 3802 CONFIGURED_LEVEL of 0) MUST follow those additional procedures: 3804 1. It computes HAT per procedures above but does NOT use it to 3805 compute DERIVED_LEVEL. HAT is used to limit adjacency formation 3806 per Section 4.2.2. 3808 It MAY also follow modified procedures: 3810 1. It may pick a different strategy to choose VOL, e.g. use the VOL 3811 value with highest number of VOLs. Such strategies are only 3812 possible since the node always remains "at the bottom of the 3813 fabric" while another layer could "invert" the fabric by picking 3814 its preferred VOL in a different fashion than always trying to 3815 achieve the highest viable level. 3817 4.2.7.5. ZTP FSM 3819 This section specifies the precise, normative ZTP FSM and can be 3820 omitted unless the reader is pursuing an implementation of the 3821 protocol. 3823 Initial state is ComputeBestOffer. 3825 Enter 3826 | 3827 v 3828 +------------------+ 3829 | ComputeBestOffer | 3830 | |<----+ 3831 | Entry: | | BetterHAL [LEVEL_COMPUTE] 3832 | [LEVEL_COMPUTE] | | BetterHAT [LEVEL_COMPUTE] 3833 | | | ChangeLocalConfiguredLevel [StoreConfigLevel, 3834 | | | LEVEL_COMPUTE] 3835 | | | ChangeLocalHierarchyIndications 3836 | | | [StoreLeafFlags, 3837 | | | LEVEL_COMPUTE] 3838 | | | LostHAT [LEVEL_COMPUTE] 3839 | | | NeighborOffer [IF NoLevelOffered 3840 | | | THEN REMOVE_OFFER 3841 | | | ELSE IF OfferedLevel > Leaf 3842 | | | THEN UPDATE_OFFER 3843 | | | ELSE REMOVE_OFFER 3844 | | | ShortTic [RemoveExpiredOffers] 3845 | |-----+ 3846 | | 3847 | |<--------------------- 3848 | |---------------------> (UpdatingClients) 3849 | | ComputationDone [-] 3850 +------------------+ 3851 ^ | 3852 | | LostHAL [IF AnySouthBoundAdjacenciesPresent 3853 | | THEN UpdateHoldDownTimerToNormalValue 3854 | | ELSE FireHoldDownTimerImmediately] 3855 | V 3856 (HoldingDown) 3858 ZTP FSM FSM 3860 (ComputeBestOffer) 3861 | ^ 3862 | | ChangeLocalConfiguredLevel [StoreConfiguredLevel] 3863 | | ChangeLocalHierarchyIndications [StoreLeafFlags] 3864 | | HoldDownExpired [PURGE_OFFERS] 3865 V | 3866 +------------------+ 3867 | HoldingDown | 3868 | |<----+ 3869 | | | BetterHAL [-] 3870 | | | BetterHAT [-] 3871 | | | ComputationDone [-] 3872 | | | LostHAL [-] 3873 | | | LostHat [-] 3874 | | | NeighborOffer [IF NoLevelOffered 3875 | | | THEN REMOVE_OFFER 3876 | | | ELSE IF OfferedLevel > Leaf 3877 | | | THEN UPDATE_OFFER 3878 | | | ELSE REMOVE_OFFER 3879 | | | ShortTic [RemoveExpiredOffers, 3880 | | | IF HoldDownTimer expired 3881 | | | THEN PUSH HoldDownExpired] 3882 | |-----+ 3883 +------------------+ 3884 ^ 3885 | 3886 (UpdatingClients) 3888 ZTP FSM FSM (continued) 3890 (ComputeBestOffer) 3891 | ^ 3892 | | BetterHAL [-] 3893 | | BetterHAT [-] 3894 | | LostHAT [-] 3895 | | ChangeLocalHierarchyIndications [StoreLeafFlags] 3896 | | ChangeLocalConfiguredLevel [StoreConfigLevel] 3897 V | 3898 +------------------+ 3899 | UpdatingClients | 3900 | |<----+ 3901 | Entry: | | 3902 | [UpdateAllLIE- | | NeighborOffer [IF NoLevelOffered 3903 | FSMsWith- | | THEN REMOVE_OFFER 3904 | Computation- | | ELSE IF OfferedLevel > Leaf 3905 | Results] | | THEN UPDATE_OFFER 3906 | | | ELSE REMOVE_OFFER 3907 | | | ShortTic [RemoveExpiredOffers] 3908 | |-----+ 3909 +------------------+ 3910 | 3911 | LostHAL [IF AnySouthBoundAdjacenciesPresent 3912 | THEN UpdateHoldDownTimerToNormalValue 3913 | ELSE FireHoldDownTimerImmediately] 3914 V 3915 (HoldingDown) 3917 ZTP FSM FSM (continued) 3919 Events 3921 o ChangeLocalHierarchyIndications: node locally configured with new 3922 leaf flags 3924 o ChangeLocalConfiguredLevel: node locally configured with a defined 3925 level 3927 o NeighborOffer: a new neighbor offer with optional level and 3928 neighbor state 3930 o BetterHAL: better HAL computed internally 3932 o BetterHAT: better HAT computed internally 3934 o LostHAL: lost last HAL in computation 3936 o LostHAT: lost HAT in computation 3937 o ComputationDone: computation performed 3939 o HoldDownExpired: holddown expired 3941 o ShortTic: one second timer tick, to be ignored if transition does 3942 not exist 3944 Actions 3946 on ShortTic in HoldingDown finishes in HoldingDown: remove expired 3947 offers and if holddown timer expired PUSH_EVENT HoldDownExpired 3949 on ShortTic in ComputeBestOffer finishes in ComputeBestOffer: 3950 remove expired offers 3952 on HoldDownExpired in HoldingDown finishes in ComputeBestOffer: 3953 PURGE_OFFERS 3955 on ChangeLocalConfiguredLevel in HoldingDown finishes in 3956 ComputeBestOffer: store configured level 3958 on ShortTic in UpdatingClients finishes in UpdatingClients: remove 3959 expired offers 3961 on BetterHAT in ComputeBestOffer finishes in ComputeBestOffer: 3962 LEVEL_COMPUTE 3964 on BetterHAL in HoldingDown finishes in HoldingDown: no action 3966 on ChangeLocalHierarchyIndications in HoldingDown finishes in 3967 ComputeBestOffer: store leaf flags 3969 on BetterHAT in UpdatingClients finishes in ComputeBestOffer: no 3970 action 3972 on BetterHAL in UpdatingClients finishes in ComputeBestOffer: no 3973 action 3975 on ChangeLocalHierarchyIndications in UpdatingClients finishes in 3976 ComputeBestOffer: store leaf flags 3978 on LostHAL in HoldingDown finishes in HoldingDown: 3980 on LostHAT in ComputeBestOffer finishes in ComputeBestOffer: 3981 LEVEL_COMPUTE 3983 on LostHAT in HoldingDown finishes in HoldingDown: no action 3984 on BetterHAT in HoldingDown finishes in HoldingDown: no action 3986 on NeighborOffer in UpdatingClients finishes in UpdatingClients: 3988 if no level offered then REMOVE_OFFER 3990 else 3992 if offered level > leaf then UPDATE_OFFER 3994 else REMOVE_OFFER 3996 on LostHAL in ComputeBestOffer finishes in HoldingDown: if any 3997 southbound adjacencies present then update holddown timer to 3998 normal duration else fire holddown timer immediately 4000 on LostHAL in UpdatingClients finishes in HoldingDown: if any 4001 southbound adjacencies present then update holddown timer to 4002 normal duration else fire holddown timer immediately 4004 on ComputationDone in ComputeBestOffer finishes in 4005 UpdatingClients: no action 4007 on LostHAT in UpdatingClients finishes in ComputeBestOffer: no 4008 action 4010 on ComputationDone in HoldingDown finishes in HoldingDown: 4012 on ChangeLocalConfiguredLevel in ComputeBestOffer finishes in 4013 ComputeBestOffer: store configured level and LEVEL_COMPUTE 4015 on ChangeLocalConfiguredLevel in UpdatingClients finishes in 4016 ComputeBestOffer: store configured level 4018 on NeighborOffer in ComputeBestOffer finishes in ComputeBestOffer: 4020 if no level offered then REMOVE_OFFER 4022 else 4024 if offered level > leaf then UPDATE_OFFER 4026 else REMOVE_OFFER 4028 on NeighborOffer in HoldingDown finishes in HoldingDown: 4030 if no level offered then REMOVE_OFFER 4031 else 4033 if offered level > leaf then UPDATE_OFFER 4035 else REMOVE_OFFER 4037 on ChangeLocalHierarchyIndications in ComputeBestOffer finishes in 4038 ComputeBestOffer: store leaf flags and LEVEL_COMPUTE 4040 on BetterHAL in ComputeBestOffer finishes in ComputeBestOffer: 4041 LEVEL_COMPUTE 4043 on Entry into UpdatingClients: update all LIE FSMs with 4044 computation results 4046 on Entry into ComputeBestOffer: LEVEL_COMPUTE 4048 Following words are used for well known procedures: 4050 1. PUSH Event: pushes an event to be executed by the FSM upon exit 4051 of this action 4053 2. COMPARE_OFFERS: checks whether based on current offers and held 4054 last results the events BetterHAL/LostHAL/BetterHAT/LostHAT are 4055 necessary and returns them 4057 3. UPDATE_OFFER: store current offer with adjancency holdtime as 4058 lifetime and COMPARE_OFFERS, then PUSH according events 4060 4. LEVEL_COMPUTE: compute best offered or configured level and HAL/ 4061 HAT, if anything changed PUSH ComputationDone 4063 5. REMOVE_OFFER: remove the according offer and COMPARE_OFFERS, PUSH 4064 according events 4066 6. PURGE_OFFERS: REMOVE_OFFER for all held offers, COMPARE OFFERS, 4067 PUSH according events 4069 4.2.7.6. Resulting Topologies 4071 The procedures defined in Section 4.2.7.4 will lead to the RIFT 4072 topology and levels depicted in Figure 27. 4074 . +---+ 4075 . | As| 4076 . | 24| 4077 . ++-++ 4078 . | | 4079 . +--+ +--+ 4080 . | | 4081 . +--++ ++--+ 4082 . | E | | F | 4083 . | 23+-+ | 23+-----------+ 4084 . ++--+ | ++-++ | 4085 . | | | | | 4086 . | +-------+ | | 4087 . | | | | | 4088 . | | +----+ | | 4089 . | | | | | 4090 . ++-++ ++-++ | 4091 . | I +-----+ J | | 4092 . | 22| | 22| | 4093 . ++--+ +--++ | 4094 . | | | 4095 . +---------+ | | 4096 . | | | 4097 . ++-++ +---+ | 4098 . | X | | Y +-+ 4099 . | 0 | | 0 | 4100 . +---+ +---+ 4102 Figure 27: Generic ZTP Topology Autoconfigured 4104 In case we imagine the LEAF_ONLY restriction on Y is removed the 4105 outcome would be very different however and result in Figure 28. 4106 This demonstrates basically that auto configuration makes miscabling 4107 detection hard and with that can lead to undesirable effects in cases 4108 where leaves are not "nailed" by the accordingly configured flags and 4109 arbitrarily cabled. 4111 A node MAY analyze the outstanding level offers on its interfaces and 4112 generate warnings when its internal ruleset flags a possible 4113 miscabling. As an example, when a node's sees ZTP level offers that 4114 differ by more than one level from its chosen level (with proper 4115 accounting for leaf's being at level 0) this can indicate miscabling. 4117 . +---+ 4118 . | As| 4119 . | 24| 4120 . ++-++ 4121 . | | 4122 . +--+ +--+ 4123 . | | 4124 . +--++ ++--+ 4125 . | E | | F | 4126 . | 23+-+ | 23+-------+ 4127 . ++--+ | ++-++ | 4128 . | | | | | 4129 . | +-------+ | | 4130 . | | | | | 4131 . | | +----+ | | 4132 . | | | | | 4133 . ++-++ ++-++ +-+-+ 4134 . | I +-----+ J +-----+ Y | 4135 . | 22| | 22| | 22| 4136 . ++-++ +--++ ++-++ 4137 . | | | | | 4138 . | +-----------------+ | 4139 . | | | 4140 . +---------+ | | 4141 . | | | 4142 . ++-++ | 4143 . | X +--------+ 4144 . | 0 | 4145 . +---+ 4147 Figure 28: Generic ZTP Topology Autoconfigured 4149 4.2.8. Stability Considerations 4151 The autoconfiguration mechanism computes a global maximum of levels 4152 by diffusion. The achieved equilibrium can be disturbed massively by 4153 all nodes with highest level either leaving or entering the domain 4154 (with some finer distinctions not explained further). It is 4155 therefore recommended that each node is multi-homed towards nodes 4156 with respective HAL offerings. Fortunately, this is the natural 4157 state of things for the topology variants considered in RIFT. 4159 4.3. Further Mechanisms 4161 4.3.1. Overload Bit 4163 The overload bit MUST be respected by all necessary SPF computations. 4164 A node with the overload bit set SHOULD advertise all locally hosted 4165 prefixes both northbound and southbound, all other southbound 4166 prefixes SHOULD NOT be advertised. 4168 Leaf nodes SHOULD set the overload bit on all originated Node TIEs. 4169 If spine nodes were to forward traffic not intended for the local 4170 node, the leaf node would not be able to prevent routing/forwarding 4171 loops as it does not have the necessary topology information to do 4172 so. 4174 4.3.2. Optimized Route Computation on Leaves 4176 Leaf nodes only have visibility to directly connected nodes and 4177 therefore are not required to run "full" SPF computations. Instead, 4178 prefixes from neighboring nodes can be gathered to run a "partial" 4179 SPF computation in order to build the routing table. 4181 Leaf nodes SHOULD only hold their own N-TIEs, and in cases of L2L 4182 implementations, the N-TIEs of their East/West neighbors. Leaf nodes 4183 MUST hold all S-TIEs from their neighbors. 4185 Normally, a full network graph is created based on local N-TIEs and 4186 remote S-TIEs that it receives from neighbors, at which time, 4187 necessary SPF computations are performed. Instead, leaf nodes can 4188 simply compute the minimum cost and next-hop set of each leaf 4189 neighbor by examining its local adjacencies. Associated N-TIEs are 4190 used to determine bi-directionality and derive the next-hop set. 4191 Cost is then derived from the minimum cost of the local adjacency to 4192 the neighbor and the prefix cost. 4194 Leaf nodes would then attach necessary prefixes as described in 4195 Section 4.2.6. 4197 4.3.3. Mobility 4199 The RIFT control plane MUST maintain the real time status of every 4200 prefix, to which port it is attached, and to which leaf node that 4201 port belongs. This is still true in cases of IP mobility where the 4202 point of attachment may change several times a second. 4204 There are two classic approaches to explicitly maintain this 4205 information: 4207 timestamp: With this method, the infrastructure SHOULD record the 4208 precise time at which the movement is observed. One key advantage 4209 of this technique is that it has no dependency on the mobile 4210 device. One drawback is that the infrastructure MUST be precisely 4211 synchronized in order to be able to compare timestamps as the 4212 points of attachment change. This could be accomplished by 4213 utilizing Precision Time Protocol (PTP) IEEE Std. 1588 4214 [IEEEstd1588] or 802.1AS [IEEEstd8021AS] which is designed for 4215 bridged LANs. Both the precision of the synchronization protocol 4216 and the resolution of the timestamp must beat the highest possible 4217 roaming time on the fabric. Another drawback is that the presence 4218 of a mobile device may only be observed asynchronously, such as 4219 when it starts using an IP protocol like ARP [RFC0826], IPv6 4220 Neighbor Discovery [RFC4861], IPv6 Stateless Address Configuration 4221 [RFC4862], DHCP [RFC2131], or DHCPv6 [RFC8415]. 4223 sequence counter: With this method, a mobile device notifies its 4224 point of attachment on arrival with a sequence counter that is 4225 incremented upon each movement. On the positive side, this method 4226 does not have a dependency on a precise sense of time, since the 4227 sequence of movements is kept in order by the mobile device. The 4228 disadvantage of this approach is the lack of support for protocols 4229 that may be used by the mobile device to register its presence to 4230 the leaf node with the capability to provide a sequence counter. 4231 Well-known issues with sequence counters such as wrapping and 4232 comparison rules MUST be addressed properly. Sequence numbers 4233 MUST be compared by a single homogenous source to make operation 4234 feasible. Sequence number comparison from multiple heterogeneous 4235 sources would be extremely difficult to implement. 4237 RIFT supports a hybrid approach by using an optional 4238 'PrefixSequenceType' attribute (that we also call a 'monotonic 4239 clock') that consists of a timestamp and optional sequence number 4240 field. When this attribute is present (observe that per data schema 4241 the attribute itself is optional but in case it is included the 4242 'timestamp' field is required): 4244 o The leaf node MAY advertise a timestamp of the latest sighting of 4245 a prefix, e.g., by snooping IP protocols or the node using the 4246 time at which it advertised the prefix. RIFT transports the 4247 timestamp within the desired prefix North TIEs as 802.1AS 4248 timestamp. 4250 o RIFT MAY interoperate with "Registration Extensions for 6LoWPAN 4251 Neighbor Discovery" [RFC8505], which provides a method for 4252 registering a prefix with a sequence number called a Transaction 4253 ID (TID). In such cases, RIFT SHOULD transport the derived TID 4254 without modification. 4256 o RIFT also defines an abstract negative clock (ASNC) (also called 4257 an 'undefined' clock). ASNC MUST be considered older than any 4258 other defined clock. By default, when a node receives a prefix 4259 North TIE that does not contain a 'PrefixSequenceType' attribute, 4260 it MUST interpret the absence as ASNC. 4262 o Any prefix present on the fabric in multiple nodes that has the 4263 `same` clock is considered as anycast. 4265 o RIFT specification assumes that all nodes are being synchronized 4266 to at least 200 milliseconds of precision. This is achievable 4267 through the use of NTP [RFC5905]. An implementation MAY provide a 4268 way to reconfigure a domain to a different value, we call this 4269 variable MAXIMUM_CLOCK_DELTA. 4271 4.3.3.1. Clock Comparison 4273 All monotonic clock values MUST be compared to each other using the 4274 following rules: 4276 1. ASNC is older than any other value except ASNC AND 4278 2. Clock with timestamp differing by more than MAXIMUM_CLOCK_DELTA 4279 are comparable by using the timestamps only AND 4281 3. Clocks with timestamps differing by less than MAXIMUM_CLOCK_DELTA 4282 are comparable by using their TIDs only AND 4284 4. An undefined TID is always older than any other TID AND 4286 5. TIDs are compared using rules of [RFC8505]. 4288 4.3.3.2. Interaction between Time Stamps and Sequence Counters 4290 For attachment changes that occur less frequently (e.g. once per 4291 second), the timestamp that the RIFT infrastructure captures should 4292 be enough to determine the most current discovery. If the point of 4293 attachment changes faster than the maximum drift of the timestamping 4294 mechanism (i.e. MAXIMUM_CLOCK_DELTA), then a sequence number SHOULD 4295 be used to enable necessary precision to determine currency. 4297 The sequence counter in [RFC8505] is encoded as one octet and wraps 4298 around using Appendix A. 4300 Within the resolution of MAXIMUM_CLOCK_DELTA, sequence counter values 4301 captured during 2 sequential iterations of the same timestamp SHOULD 4302 be comparable. This means that with default values, a node may move 4303 up to 127 times in a 200 millisecond period and the clocks will 4304 remain comparable. This allows the RIFT infrastructure to explicitly 4305 assert the most up-to-date advertisement. 4307 4.3.3.3. Anycast vs. Unicast 4309 A unicast prefix can be attached to at most one leaf, whereas an 4310 anycast prefix may be reachable via more than one leaf. 4312 If a monotonic clock attribute is provided on the prefix, then the 4313 prefix with the `newest` clock value is strictly preferred. An 4314 anycast prefix does not carry a clock or all clock attributes MUST be 4315 the same under the rules of Section 4.3.3.1. 4317 Observe that it is important that in mobility events the leaf is re- 4318 flooding as quickly as possible the absence of the prefix that moved 4319 away. 4321 Observe further that without support for [RFC8505] movements on the 4322 fabric within intervals smaller than 100msec will be seen as anycast. 4324 4.3.3.4. Overlays and Signaling 4326 RIFT is agnostic to any overlay technologies and their associated 4327 control and transports that run on top of it (e.g. VXLAN). It is 4328 expected that leaf nodes and possibly Top-of-Fabric nodes can perform 4329 necessary data plane encapsulation. 4331 In the context of mobility, overlays provide another possible 4332 solution to avoid injecting mobile prefixes into the fabric as well 4333 as improving scalability of the deployment. It makes sense to 4334 consider overlays for mobility solutions in IP fabrics. As an 4335 example, a mobility protocol such as LISP may inform the ingress leaf 4336 of the location of the egress leaf in real time. 4338 Another possibility is to consider that mobility as an underlay 4339 service and support it in RIFT to an extent. The load on the fabric 4340 augments with the amount of mobility obviously since a move forces 4341 flooding and computation on all nodes in the scope of the move so 4342 tunneling from leaf to the Top-of-Fabric may be desired to speed up 4343 convergence times. 4345 4.3.4. Key/Value Store 4347 4.3.4.1. Southbound 4349 RIFT supports the southbound distribution of key-value pairs that can 4350 be used to distribute information to facilitate higher levels of 4351 functionality (e.g. distribution of configuration information). KV 4352 South TIEs may arrive from multiple nodes and therefore MUST execute 4353 the following tie-breaking rules for each key: 4355 1. Only KV TIEs received from nodes to which a bi-directional 4356 adjacency exists MUST be considered. 4358 2. For each valid KV South TIEs that contains the same key, the 4359 value within the South TIE with the highest level will be 4360 preferred. If the levels are identical, the highest originating 4361 system ID will be preferred. In the case of overlapping keys in 4362 the winning South TIE, the behavior is undefined. 4364 Consider that if a node goes down, nodes south of it will lose 4365 associated adjacencies causing them to disregard corresponding KVs. 4366 New KV South TIEs are advertised to prevent stale information being 4367 used by nodes that are farther south. KV advertisements southbound 4368 are not a result of independent computation by every node over the 4369 same set of South TIEs, but a diffused computation. 4371 4.3.4.2. Northbound 4373 Certain use cases necessitate distribution of essential KV 4374 information that is generated by the leaves in the northbound 4375 direction. Such information is flooded in KV North TIEs. Since the 4376 originator of the KV North TIEs is preserved during flooding, 4377 overlapping keys MAY be used. However, to avoid further protocol 4378 complexity, the same tie-breaking rules as used in southbound 4379 distribution SHOULD be used. 4381 4.3.5. Interactions with BFD 4383 RIFT MAY incorporate BFD [RFC5881] to react quickly to link failures. 4384 In such case following procedures are introduced: 4386 After RIFT three-way hello adjacency convergence a BFD session MAY 4387 be formed automatically between the RIFT endpoints without further 4388 configuration using the exchanged discriminators. The capability 4389 of the remote side to support BFD is carried in the LIEs. 4391 In case established BFD session goes Down after it was Up, RIFT 4392 adjacency SHOULD be re-initialized and subsequently started from 4393 Init after it sees a consecutive BFD Up. 4395 In case of parallel links between nodes each link MAY run its own 4396 independent BFD session or they MAY share a session. 4398 If link identifiers or BFD capabilities change, both the LIE and 4399 any BFD sessions SHOULD be brought down and back up again. In 4400 case only the advertised capabilities change, the node MAY choose 4401 to persist the BFD session. 4403 Multiple RIFT instances MAY choose to share a single BFD session, 4404 in such cases the behavior for which discriminators are used is 4405 undefined. However, RIFT MAY advertise the same link ID for the 4406 same interface in multiple instances to "share" discriminators. 4408 BFD TTL follows [RFC5082]. 4410 4.3.6. Fabric Bandwidth Balancing 4412 A well understood problem in fabrics is that in case of link 4413 failures, it would be ideal to rebalance how much traffic is sent to 4414 switches in the next level based on available ingress and egress 4415 bandwidth. 4417 RIFT supports a very light weight mechanism that can deal with the 4418 problem in an approximate way based on the fact that RIFT is loop- 4419 free. 4421 4.3.6.1. Northbound Direction 4423 Every RIFT node SHOULD compute the amount of northbound bandwidth 4424 available through neighbors at higher level and modify distance 4425 received on default route from this neighbor. Default routes with 4426 differing distances SHOULD be used to support weighted ECMP 4427 forwarding. We call such a distance Bandwidth Adjusted Distance or 4428 BAD. This is best illustrated by a simple example. 4430 . 100 x 100 100 MBits 4431 . | x | | 4432 . +-+---+-+ +-+---+-+ 4433 . | | | | 4434 . |Spin111| |Spin112| 4435 . +-+---+++ ++----+++ 4436 . |x || || || 4437 . || |+---------------+ || 4438 . || +---------------+| || 4439 . || || || || 4440 . || || || || 4441 . -----All Links 10 MBit------- 4442 . || || || || 4443 . || || || || 4444 . || +------------+| || || 4445 . || |+------------+ || || 4446 . |x || || || 4447 . +-+---+++ +--++-+++ 4448 . | | | | 4449 . |Leaf111| |Leaf112| 4450 . +-------+ +-------+ 4452 Figure 29: Balancing Bandwidth 4454 Figure 29 depicts an example topology where links between leaf and 4455 spine nodes are 10 MBit/s and links from spine nodes northbound are 4456 100 MBit/s. Consider a parallel link failure between Leaf 111 and 4457 Spine 111 and as a result, Leaf 111 wants to forward more traffic 4458 toward Spine 112. Additionally, we consider an uplink failure on 4459 Spine 111. 4461 The local modification of the received default route distance from 4462 upper level is achieved by running a relatively simple algorithm 4463 where the bandwidth is weighted exponentially, while the distance on 4464 the default route represents a multiplier for the bandwidth weight 4465 for easy operational adjustments. 4467 On a node, L, use Node TIEs to compute from each non-overloaded 4468 northbound neighbor N to compute 3 values: 4470 L_N_u: as sum of the bandwidth available to N 4472 N_u: as sum of the uplink bandwidth available on N 4474 T_N_u: as sum of L_N_u * OVERSUBSCRIPTION_CONSTANT + N_u 4476 For all T_N_u determine the according M_N_u as 4477 log_2(next_power_2(T_N_u)) and determine MAX_M_N_u as maximum value 4478 of all such M_N_u values. 4480 For each advertised default route from a node N modify the advertised 4481 distance D to BAD = D * (1 + MAX_M_N_u - M_N_u) and use BAD instead 4482 of distance D to weight balance default forwarding towards N. 4484 For the example above, a simple table of values will help in 4485 understanding of the concept. We assume that all default route 4486 distances are advertised with D=1 and that OVERSUBSCRIPTION_CONSTANT 4487 = 1. 4489 +---------+-----------+-------+-------+-----+ 4490 | Node | N | T_N_u | M_N_u | BAD | 4491 +---------+-----------+-------+-------+-----+ 4492 | Leaf111 | Spine 111 | 110 | 7 | 2 | 4493 +---------+-----------+-------+-------+-----+ 4494 | Leaf111 | Spine 112 | 220 | 8 | 1 | 4495 +---------+-----------+-------+-------+-----+ 4496 | Leaf112 | Spine 111 | 120 | 7 | 2 | 4497 +---------+-----------+-------+-------+-----+ 4498 | Leaf112 | Spine 112 | 220 | 8 | 1 | 4499 +---------+-----------+-------+-------+-----+ 4501 Table 5: BAD Computation 4503 If a calculation produces a result exceeding the range of the type, 4504 e.g. bandwidth, the result is set to the highest possible value for 4505 that type. 4507 BAD SHOULD be only computed for default routes. A node MAY compute 4508 and use BAD for any disaggregated prefixes or other RIFT routes. A 4509 node MAY use a different algorithm to weight northbound traffic based 4510 on bandwidth. If a different algorithm is used, its successful 4511 behavior MUST NOT depend on uniformity of algorithm or 4512 synchronization of BAD computations across the fabric. E.g. it is 4513 conceivable that leaves could use real time link loads gathered by 4514 analytics to change the amount of traffic assigned to each default 4515 route next hop. 4517 Furthermore, a change in available bandwidth will only affect, at 4518 most, two levels down in the fabric, i.e. the blast radius of 4519 bandwidth adjustments is constrained no matter the fabric's height. 4521 4.3.6.2. Southbound Direction 4523 Due to its loop free nature, during South SPF, a node MAY account for 4524 maximum available bandwidth on nodes in lower levels and modify the 4525 amount of traffic offered to the next level's southbound nodes. It 4526 is worth considering that such computations may be more effective if 4527 standardized, but do not have to be. As long as a packet continues 4528 to flow southbound, it will take some viable, loop-free path to reach 4529 its destination. 4531 4.3.7. Label Binding 4533 A node MAY advertise in its LIEs, a locally significant, downstream 4534 assigned, interface specific label. One use of such a label is a 4535 hop-by-hop encapsulation allowing forwarding planes to be easily 4536 distinguished among multiple RIFT instances. 4538 4.3.8. Leaf to Leaf Procedures 4540 RIFT implementations SHOULD support special East-West adjacencies 4541 between leaf nodes. Leaf nodes supporting these procedures MUST: 4543 advertise the LEAF_2_LEAF flag in its node capabilities AND 4545 set the overload bit on all leaf's node TIEs AND 4547 flood only a node's own north and south TIEs over E-W leaf 4548 adjacencies AND 4550 always use E-W leaf adjacency in all SPF computations AND 4552 install a discard route for any advertised aggregate routes in a 4553 leaf?s TIE AND 4555 never form southbound adjacencies. 4557 This will allow the E-W leaf nodes to exchange traffic strictly for 4558 the prefixes advertised in each other's north prefix TIEs (since the 4559 southbound computation will find the reverse direction in the other 4560 node's TIE and install its north prefixes). 4562 4.3.9. Address Family and Multi Topology Considerations 4564 Multi-Topology (MT)[RFC5120] and Multi-Instance (MI)[RFC8202] 4565 concepts are used today in link-state routing protocols to support 4566 several domains on the same physical topology. RIFT supports this 4567 capability by carrying transport ports in the LIE protocol exchanges. 4569 Multiplexing of LIEs can be achieved by either choosing varying 4570 multicast addresses or ports on the same address. 4572 BFD interactions in Section 4.3.5 are implementation dependent when 4573 multiple RIFT instances run on the same link. 4575 4.3.10. Reachability of Internal Nodes in the Fabric 4577 RIFT does not require that nodes have reachable addresses in the 4578 fabric, though it is clearly desirable for operational purposes. 4579 Under normal operating conditions this can be easily achieved by 4580 injecting the node's loopback address into North and South Prefix 4581 TIEs or other implementation specific mechanisms. 4583 Special considerations arise when a node loses all northbound 4584 adjacencies, but is not at the top of the fabric. These are outside 4585 the scope of this document and could be discussed in a separate 4586 document. 4588 4.3.11. One-Hop Healing of Levels with East-West Links 4590 Based on the rules defined in Section 4.2.4, Section 4.2.3.8 and 4591 given presence of E-W links, RIFT can provide a one-hop protection 4592 for nodes that lost all their northbound links. This can also be 4593 applied to multi-plane designs where complex link set failures occur 4594 at the Top-of-Fabric when links are exclusively used for flooding 4595 topology information. Section 5.4 outlines this behavior. 4597 4.4. Security 4599 4.4.1. Security Model 4601 An inherent property of any security and ZTP architecture is the 4602 resulting trade-off in regard to integrity verification of the 4603 information distributed through the fabric vs. provisioning and auto- 4604 configuration requirements. At a minimum the security of an 4605 established adjacency should be ensured. The stricter the security 4606 model the more provisioning must take over the role of ZTP. 4608 RIFT supports the following security models to allow for flexible 4609 control by the operator. 4611 o The most security conscious operators may choose to have control 4612 over which ports interconnect between a given pair of nodes, we 4613 call this the "Port-Association Model" (PAM). This is achievable 4614 by configuring each pair of directly connected ports with a 4615 designated shared key or public/private key pair. 4617 o In physically secure data center locations, operators may choose 4618 to control connectivity between entire nodes, we call this the 4619 "Node-Association Model" (NAM). A benefit of this model is that 4620 it allows for simplified port sparing. 4622 o In the most relaxed environments, an operator may only choose to 4623 control which nodes join a particular fabric. We call this the 4624 "Fabric-Association Model" (FAM). This is achievable by using a 4625 single shared secret across the entire fabric. Such flexibility 4626 makes sense when we consider servers as leaf devices, which are 4627 replaced more often than network nodes. In addition, this model 4628 allows for simplified node sparing. 4630 o These models may be mixed throughout the fabric depending upon 4631 security requirements at various levels of the fabric and 4632 willingness to accept increased provisioning complexity. 4634 In order to support the cases mentioned above, RIFT implementations 4635 supports, through operator control, mechanisms that allow for: 4637 a. specification of the appropriate level in the fabric, 4639 b. discovery and reporting of missing connections, 4641 c. discovery and reporting of unexpected connections while 4642 preventing them from forming insecure adjacencies. 4644 Operators may only choose to configure the level of each node, but 4645 not explicitly configure which connections are allowed. In this 4646 case, RIFT will only allow adjacencies to establish between nodes 4647 that are in adjacent levels. Operators with the lowest security 4648 requirements may not use any configuration to specify which 4649 connections are allowed. Nodes in such fabrics could rely fully on 4650 ZTP and only established adjacencies between nodes in adjacent 4651 levels. Figure 30 illustrates inherent tradeoffs between the 4652 different security models. 4654 Some level of link quality verification may be required prior to an 4655 adjacency being used for forwarding. For example, an implementation 4656 may require that a BFD session comes up before advertising the 4657 adjacency. 4659 For the cases outlined above, RIFT has two approaches to enforce that 4660 a local port is connected to the correct port on the correct remote 4661 node. One approach is to piggy-back on RIFT's authentication 4662 mechanism. Assuming the provisioning model (e.g. the YANG model) is 4663 flexible enough, operators can choose to provision a unique 4664 authentication key for: 4666 a. each pair of ports in "port-association model" or 4668 b. each pair of switches in "node-association model" or 4670 c. each pair of levels or 4672 d. the entire fabric in "fabric-association model". 4674 The other approach is to rely on the system-id, port-id and level 4675 fields in the LIE message to validate an adjacency against the 4676 expected cabling topology, and optionally introduce some new rules in 4677 the FSM to allow the adjacency to come up if the expectations are 4678 met. 4680 ^ /\ | 4681 /|\ / \ | 4682 | / \ | 4683 | / PAM \ | 4684 Increasing / \ Increasing 4685 Integrity +----------+ Flexibility 4686 & / NAM \ & 4687 Increasing +--------------+ Less 4688 Provisioning / FAM \ Configuration 4689 | +------------------+ | 4690 | / Level Provisioning \ | 4691 | +----------------------+ \|/ 4692 | / Zero Configuration \ v 4693 +--------------------------+ 4695 Figure 30: Security Model 4697 4.4.2. Security Mechanisms 4699 RIFT Security goals are to ensure: 4701 1. authentication 4703 2. message integrity 4705 3. the prevention of replay attacks 4707 4. low processing overhead 4709 5. efficient messaging 4710 Message confidentiality is a non-goal. 4712 The model in the previous section allows a range of security key 4713 types that are analogous to the various security association models. 4714 PAM and NAM allow security associations at the port or node level 4715 using symmetric or asymmetric keys that are pre-installed. FAM 4716 argues for security associations to be applied only at a group level 4717 or to be refined once the topology has been established. RIFT does 4718 not specify how security keys are installed or updated, though it 4719 does specify how the key can be used to achieve security goals. 4721 The protocol has provisions for "weak" nonces to prevent replay 4722 attacks and includes authentication mechanisms comparable to 4723 [RFC5709] and [RFC7987]. 4725 4.4.3. Security Envelope 4727 RIFT MUST be carried in a mandatory secure envelope illustrated in 4728 Figure 31. Any value in the packet following a security fingerprint 4729 MUST be used only after the appropriate fingerprint has been 4730 validated. 4732 Local configuration MAY allow for the envelope's integrity checks to 4733 be skipped. 4735 0 1 2 3 4736 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 4738 UDP Header: 4739 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4740 | Source Port | RIFT destination port | 4741 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4742 | UDP Length | UDP Checksum | 4743 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4745 Outer Security Envelope Header: 4746 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4747 | RIFT MAGIC | Packet Number | 4748 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4749 | Reserved | RIFT Major | Outer Key ID | Fingerprint | 4750 | | Version | | Length | 4751 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4752 | | 4753 ~ Security Fingerprint covers all following content ~ 4754 | | 4755 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4756 | Weak Nonce Local | Weak Nonce Remote | 4757 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4758 | Remaining TIE Lifetime (all 1s in case of LIE) | 4759 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4761 TIE Origin Security Envelope Header: 4762 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4763 | TIE Origin Key ID | Fingerprint | 4764 | | Length | 4765 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4766 | | 4767 ~ Security Fingerprint covers all following content ~ 4768 | | 4769 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4771 Serialized RIFT Model Object 4772 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4773 | | 4774 ~ Serialized RIFT Model Object ~ 4775 | | 4776 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4778 Figure 31: Security Envelope 4780 RIFT MAGIC: 16 bits. Constant value of 0xA1F7 that allows to 4781 classify RIFT packets independent of used UDP port. 4783 Packet Number: 16 bits. An optional, per packet type monotonically 4784 growing number rolling over using sequence number arithmetic 4785 defined in Appendix A. A node SHOULD correctly set the number on 4786 subsequent packets or otherwise MUST set the value to 4787 `undefined_packet_number` as provided in the schema. This number 4788 can be used to detect losses and misordering in flooding for 4789 either operational purposes or in implementation to adjust 4790 flooding behavior to current link or buffer quality. This number 4791 MUST NOT be used to discard or validate the correctness of 4792 packets. 4794 RIFT Major Version: 8 bits. It allows to check whether protocol 4795 versions are compatible, i.e. if the serialized object can be 4796 decoded at all. An implementation MUST drop packets with 4797 unexpected values and MAY report a problem. 4799 Outer Key ID: 8 bits to allow key rollovers. This implies key type 4800 and algorithm. Value 0 means that no valid fingerprint was 4801 computed. This key ID scope is local to the nodes on both ends of 4802 the adjacency. 4804 TIE Origin Key ID: 24 bits. This implies key type and used 4805 algorithm. Value 0 means that no valid fingerprint was computed. 4806 This key ID scope is global to the RIFT instance since it implies 4807 the originator of the TIE so the contained object does not have to 4808 be de-serialized to obtain it. 4810 Length of Fingerprint: 8 bits. Length in 32-bit multiples of the 4811 following fingerprint (not including lifetime or weak nonces). It 4812 allows the structure to be navigated when an unknown key type is 4813 present. To clarify, a common corner case when this value is set 4814 to 0 is when it signifies an empty (0 bytes long) security 4815 fingerprint. 4817 Security Fingerprint: 32 bits * Length of Fingerprint. This is a 4818 signature that is computed over all data following after it. If 4819 the significant bits of fingerprint are fewer than the 32 bits 4820 padded length than the significant bits MUST be left aligned and 4821 remaining bits on the right padded with 0s. When using PKI the 4822 Security fingerprint originating node uses its private key to 4823 create the signature. The original packet can then be verified 4824 provided the public key is shared and current. 4826 Remaining TIE Lifetime: 32 bits. In case of anything but TIEs this 4827 field MUST be set to all ones and Origin Security Envelope Header 4828 MUST NOT be present in the packet. For TIEs this field represents 4829 the remaining lifetime of the TIE and Origin Security Envelope 4830 Header MUST be present in the packet. The value in the serialized 4831 model object MUST be ignored. 4833 Weak Nonce Local: 16 bits. Local Weak Nonce of the adjacency as 4834 advertised in LIEs. 4836 Weak Nonce Remote: 16 bits. Remote Weak Nonce of the adjacency as 4837 received in LIEs. 4839 TIE Origin Security Envelope Header: It MUST be present if and only 4840 if the Remaining TIE Lifetime field is NOT all ones. It carries 4841 through the originators key ID and according fingerprint of the 4842 object to protect TIE from modification during flooding. This 4843 ensures origin validation and integrity (but does not provide 4844 validation of a chain of trust). 4846 Observe that due to the schema migration rules per Appendix B the 4847 contained model can be always decoded if the major version matches 4848 and the envelope integrity has been validated. Consequently, 4849 description of the TIE is available to flood it properly including 4850 unknown TIE types. 4852 4.4.4. Weak Nonces 4854 The protocol uses two 16 bit nonces to salt generated signatures. We 4855 use the term "nonce" a bit loosely since RIFT nonces are not being 4856 changed in every packet as common in cryptography. For efficiency 4857 purposes they are changed at a high enough frequency to dwarf 4858 practical replay attack attempts. Therefore, we call them "weak" 4859 nonces. 4861 Any implementation including RIFT security MUST generate and wrap 4862 around local nonces properly. When a nonce increment leads to 4863 `undefined_nonce` value, the value MUST be incremented again 4864 immediately. All implementation MUST reflect the neighbor's nonces. 4865 An implementation SHOULD increment a chosen nonce on every LIE FSM 4866 transition that ends up in a different state from the previous and 4867 MUST increment its nonce at least every 5 minutes (such 4868 considerations allow for efficient implementations without opening a 4869 significant security risk). When flooding TIEs, the implementation 4870 MUST use recent (i.e. within allowed difference) nonces reflected in 4871 the LIE exchange. The schema specifies the maximum allowable nonce 4872 value difference on a packet compared to reflected nonces in the 4873 LIEs. Any packet received with nonces deviating more than the 4874 allowed delta MUST be discarded without further computation of 4875 signatures to prevent computation load attacks. 4877 In cases where a secure implementation does not receive signatures or 4878 receives undefined nonces from a neighbor (indicating that it does 4879 not support or verify signatures), it is a matter of local policy as 4880 to how those packets are treated. A secure implementation MAY refuse 4881 forming an adjacency with an implementation that is not advertising 4882 signatures or valid nonces, or it MAY continue signing local packets 4883 while accepting a neighbor's packets without further security 4884 validation. 4886 As a necessary exception, an implementation MUST advertise the remote 4887 nonce value as `undefined_nonce` when the FSM is not in two-way or 4888 three-way state and accept an `undefined_nonce` for its local nonce 4889 value on packets in any other state than three-way. 4891 As optional optimization, an implementation MAY send one LIE with 4892 previously negotiated neighbor's nonce to try to speed up a 4893 neighbor's transition from three-way to one-way and MUST revert to 4894 sending `undefined_nonce` after that. 4896 4.4.5. Lifetime 4898 Protecting flooding lifetime may lead to an excessive number of 4899 security fingerprint computations and to avoid this the application 4900 generating the fingerprints for advertised TIEs, MAY round the value 4901 down to the next `rounddown_lifetime_interval`. Such an optimization 4902 in the presence of security hashes over advancing weak nonces, may 4903 not be feasible. 4905 4.4.6. Key Management 4907 As outlined in Section Section 7, either a private shared key or a 4908 public/private key pair is used to authenticate the adjacency. Both 4909 the key distribution and key synchronization methods are out of scope 4910 for this document. Both nodes in the adjacency MUST share the same 4911 keys, key type, and algorithm for a given key ID. Mismatched keys 4912 will not inter-operate as their security envelopes will be 4913 unverifiable. 4915 Key roll-over while the adjacency is active MAY be supported. The 4916 specific mechanism is well documented in [RFC6518]. 4918 4.4.7. Security Association Changes 4920 There in no mechanism to convert a security envelope for the same key 4921 ID from one algorithm to another once the envelope is operational. 4922 The recommended procedure to change to a new algorithm is to take the 4923 adjacency down, make the necessary changes, and bring the adjacency 4924 back up. Obviously, an implementation MAY choose to stop verifying 4925 security envelope for the duration of algorithm change to keep the 4926 adjacency up but since this introduces a security vulnerability 4927 window, such roll-over SHOULD NOT be recommended. 4929 5. Examples 4931 5.1. Normal Operation 4933 ^ N +--------+ +--------+ 4934 Level 2 | |ToF 21| |ToF 22| 4935 E <-*-> W ++-+--+-++ ++-+--+-++ 4936 | | | | | | | | | 4937 S v P111/2 |P121/2 | | | | 4938 ^ ^ ^ ^ | | | | 4939 | | | | | | | | 4940 +--------------+ | +-----------+ | | | +---------------+ 4941 | | | | | | | | 4942 South +-----------------------------+ | | ^ 4943 | | | | | | | All TIEs 4944 0/0 0/0 0/0 +-----------------------------+ | 4945 v v v | | | | | 4946 | | +-+ +<-0/0----------+ | | 4947 | | | | | | | | 4948 +-+----++ +-+----++ ++----+-+ ++-----++ 4949 Level 1 | | | | | | | | 4950 |Spin111| |Spin112| |Spin121| |Spin122| 4951 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 4952 | | | South | | | | 4953 | +---0/0--->-----+ 0/0 | +----------------+ | 4954 0/0 | | | | | | | 4955 | +---<-0/0-----+ | v | +--------------+ | | 4956 v | | | | | | | 4957 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 4958 Level 0 | | | | | | | | 4959 |Leaf111| |Leaf112| |Leaf121| |Leaf122| 4960 +-+-----+ +-+---+-+ +--+--+-+ +-+-----+ 4961 + + \ / + + 4962 Prefix111 Prefix112 \ / Prefix121 Prefix122 4963 multi-homed 4964 Prefix 4965 +---------- PoD 1 ---------+ +---------- PoD 2 ---------+ 4967 Figure 32: Normal Case Topology 4969 This section describes RIFT deployment in example topology given in 4970 Figure 32 without any node or link failures. We disregard flooding 4971 reduction for simplicity's sake and compress the node names in some 4972 cases to fit them into the picture better. 4974 First, the following bi-directional adjacencies will be established: 4976 1. ToF 21 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 122 4978 2. ToF 22 (PoD 0) to Spine 111, Spine 112, Spine 121, and Spine 122 4980 3. Spine 111 to Leaf 111, Leaf 112 4982 4. Spine 112 to Leaf 111, Leaf 112 4984 5. Spine 121 to Leaf 121, Leaf 122 4986 6. Spine 122 to Leaf 121, Leaf 122 4988 Leaf 111 and Leaf 112 originate N-TIEs for Prefix 111 and Prefix 112 4989 (respectively) to both Spine 111 and Spine 112 (Leaf 112 also 4990 originates an N-TIE for the multi-homed prefix). Spine 111 and Spine 4991 112 will then originate their own N-TIEs, as well as flood the N-TIEs 4992 received from Leaf 111 and Leaf 112 to both ToF 21 and ToF 22. 4994 Similarly, Leaf 121 and Leaf 122 originate North TIEs for Prefix 121 4995 and Prefix 122 (respectively) to Spine 121 and Spine 122 (Leaf 121 4996 also originates an North TIE for the multi-homed prefix). Spine 121 4997 and Spine 122 will then originate their own North TIEs, as well as 4998 flood the North TIEs received from Leaf 121 and Leaf 122 to both ToF 4999 21 and ToF 22. 5001 Spines hold only North TIEs of level 0 for their PoD, while leaves 5002 only hold their own North TIEs while at this point, both ToF 21 and 5003 ToF 22 (as well as any northbound connected controllers) would have 5004 the complete network topology. 5006 ToF 21 and ToF 22 would then originate and flood South TIEs 5007 containing any established adjacencies and a default IP route to all 5008 spines. Spine 111, Spine 112, Spine 121, and Spine 122 will reflect 5009 all Node South TIEs received from ToF 21 to ToF 22, and all Node 5010 South TIEs from ToF 22 to ToF 21. South TIEs will not be re- 5011 propagated southbound. 5013 South TIEs containing a default IP route are then originated by both 5014 Spine 111 and Spine 112 toward Leaf 111 and Leaf 112. Similarly, 5015 South TIEs containing a default IP route are originated by Spine 121 5016 and Spine 122 toward Leaf 121 and Leaf 122. 5018 At this point IP connectivity across maximum number of viable paths 5019 has been established for all leaves, with routing information 5020 constrained to only the minimum amount that allows for normal 5021 operation and redundancy. 5023 5.2. Leaf Link Failure 5025 . | | | | 5026 .+-+---+-+ +-+---+-+ 5027 .| | | | 5028 .|Spin111| |Spin112| 5029 .+-+---+-+ ++----+-+ 5030 . | | | | 5031 . | +---------------+ X 5032 . | | | X Failure 5033 . | +-------------+ | X 5034 . | | | | 5035 .+-+---+-+ +--+--+-+ 5036 .| | | | 5037 .|Leaf111| |Leaf112| 5038 .+-------+ +-------+ 5039 . + + 5040 . Prefix111 Prefix112 5042 Figure 33: Single Leaf Link Failure 5044 In the event of a link failure between Spine 112 and Leaf 112, both 5045 nodes will originate new Node TIEs that contain their connected 5046 adjacencies, except for the one that just failed. Leaf 112 will send 5047 a Node North TIE to Spine 111. Spine 112 will send a Node North TIE 5048 to ToF 21 and ToF 22 as well as a new Node South TIE to Leaf 111 that 5049 will be reflected to Spine 111. Necessary SPF recomputation will 5050 occur, resulting in Spine 112 no longer being in the forwarding path 5051 for Prefix 112. 5053 Spine 111 will also disaggregate Prefix 112 by sending new Prefix 5054 South TIE to Leaf 111 and Leaf 112. Though we cover disaggregation 5055 in more detail in the following section, it is worth mentioning ini 5056 this example as it further illustrates RIFT's blackhole mitigation 5057 mechanism. Consider that Leaf 111 has yet to receive the more 5058 specific (disaggregated) route from Spine 111. In such a scenario, 5059 traffic from Leaf 111 toward Prefix 112 may still use Spine 112's 5060 default route, causing it to traverse ToF 21 and ToF 22 back down via 5061 Spine 111. While this behavior is suboptimal, it is transient in 5062 nature and preferred to black-holing traffic. 5064 5.3. Partitioned Fabric 5066 +--------+ +--------+ 5067 Level 2 |ToF 21| |ToF 22| 5068 ++-+--+-++ ++-+--+-++ 5069 | | | | | | | | 5070 | | | | | | | 0/0 5071 | | | | | | | | 5072 | | | | | | | | 5073 +--------------+ | +--- XXXXXX + | | | +---------------+ 5074 | | | | | | | | 5075 | +-----------------------------+ | | | 5076 0/0 | | | | | | | 5077 | 0/0 0/0 +- XXXXXXXXXXXXXXXXXXXXXXXXX -+ | 5078 | 1.1/16 | | | | | | 5079 | | +-+ +-0/0-----------+ | | 5080 | | | 1.1./16 | | | | 5081 +-+----++ +-+-----+ ++-----0/0 ++----0/0 5082 Level 1 | | | | | 1.1/16 | 1.1/16 5083 |Spin111| |Spin112| |Spin121| |Spin122| 5084 +-+---+-+ ++----+-+ +-+---+-+ ++---+--+ 5085 | | | | | | | | 5086 | +---------------+ | | +----------------+ | 5087 | | | | | | | | 5088 | +-------------+ | | | +--------------+ | | 5089 | | | | | | | | 5090 +-+---+-+ +--+--+-+ +-+---+-+ +---+-+-+ 5091 Level 3 | | | | | | | | 5092 |Leaf111| |Leaf112| |Leaf121| |Leaf122| 5093 +-+-----+ ++------+ +-----+-+ +-+-----+ 5094 + + + + 5095 Prefix111 Prefix112 Prefix121 Prefix122 5096 1.1/16 5098 Figure 34: Fabric Partition 5100 Figure 34 shows one of more catastrophic scenarios where ToF 21 is 5101 completely severed from access to Prefix 121 due to a double link 5102 failure. If only default routes existed, this would result in 50% of 5103 traffic from Leaf 111 and Leaf 112 toward Prefix 121 being black- 5104 holed. 5106 The mechanism to resolve this scenario hinges on ToF 21's Sout TIEs 5107 being reflected from Spine 111 and Spine 112 to ToF 22. Once ToF 22 5108 sees that Prefix 121 cannot be reached from ToF 21, it will begin to 5109 disaggregate Prefix 121 by advertising a more specific route (1.1/16) 5110 along with the default IP prefix route to all spines (ToF 21 still 5111 only sends a default route). The result is Spine 111 and Spine112 5112 using the more specific route to Prefix 121 via ToF 22. All other 5113 prefixes continue to use the default IP prefix route toward both ToF 5114 21 and ToF 22. 5116 The more specific route for Prefix 121 being advertised by ToF 22 5117 does not need to be propagated further south to the leaves, as they 5118 do not benefit from this information. Spine 111 and Spine 112 are 5119 only required to reflect the new South Node TIEs received from ToF 22 5120 to ToF 21. In short, only the relevant nodes received the relevant 5121 updates, thereby restricting the failure to only the partitioned 5122 level rather than burdening the whole fabric with the flooding and 5123 recomputation of the new topology information. 5125 To finish our example, the following table shows sets computed by ToF 5126 22 using notation introduced in Section 4.2.5: 5128 |R = Prefix 111, Prefix 112, Prefix 121, Prefix 122 5130 |H (for r=Prefix 111) = Spine 111, Spine 112 5132 |H (for r=Prefix 112) = Spine 111, Spine 112 5134 |H (for r=Prefix 121) = Spine 121, Spine 122 5136 |H (for r=Prefix 122) = Spine 121, Spine 122 5138 |A (for ToF 21) = Spine 111, Spine 112 5140 With that and |H (for r=Prefix 121) and |H (for r=Prefix 122) being 5141 disjoint from |A (for ToF 21), ToF 22 will originate an South TIE 5142 with Prefix 121 and Prefix 122, which will be flooded to all spines. 5144 5.4. Northbound Partitioned Router and Optional East-West Links 5145 . + + + 5146 . X N1 | N2 | N3 5147 . X | | 5148 .+--+----+ +--+----+ +--+-----+ 5149 .| |0/0> <0/0| |0/0> <0/0| | 5150 .| A01 +----------+ A02 +----------+ A03 | Level 1 5151 .++-+-+--+ ++--+--++ +---+-+-++ 5152 . | | | | | | | | | 5153 . | | +----------------------------------+ | | | 5154 . | | | | | | | | | 5155 . | +-------------+ | | | +--------------+ | 5156 . | | | | | | | | | 5157 . | +----------------+ | +-----------------+ | 5158 . | | | | | | | | | 5159 . | | +------------------------------------+ | | 5160 . | | | | | | | | | 5161 .++-+-+--+ | +---+---+ | +-+---+-++ 5162 .| | +-+ +-+ | | 5163 .| L01 | | L02 | | L03 | Level 0 5164 .+-------+ +-------+ +--------+ 5166 Figure 35: North Partitioned Router 5168 Figure 35 shows a part of a fabric where level 1 is horizontally 5169 connected and A01 lost its only northbound adjacency. Based on N-SPF 5170 rules in Section 4.2.4.1 A01 will compute northbound reachability by 5171 using the link A01 to A02. A02 however, will NOT use this link 5172 during N-SPF. The result is A01 utilizing the horizontal link for 5173 default route advertisement and unidirectional routing. 5175 Furthermore, if A02 also loses its only northbound adjacency (N2), 5176 the situation evolves. A01 will no longer have northbound 5177 reachability while it sees A03's northbound adjacencies in South Node 5178 TIEs reflected by nodes south of it. As a result, A01 will no longer 5179 advertise its default route in accordance with Section 4.2.3.8. 5181 6. Implementation and Operation: Further Details 5183 6.1. Considerations for Leaf-Only Implementation 5185 RIFT can and is intended to be stretched to the lowest level in the 5186 IP fabric to integrate ToRs or even servers. Since those entities 5187 would run as leaves only, it is worth to observe that a leaf only 5188 version is significantly simpler to implement and requires much less 5189 resources: 5191 1. Leaf nodes only need to maintain a multipath default route under 5192 normal circumstances. However, in cases of catastrophic 5193 partitioning, leaf nodes SHOULD be capable of accommodating all 5194 the leaf routes in its own PoD to prevent black-holing. 5196 2. Leaf nodes hold only their own North TIEs and South TIEs of Level 5197 1 nodes they are connected to. 5199 3. Leaf nodes do not have to support any type of de-aggregation 5200 computation or propagation. 5202 4. Leaf nodes are not required to support overload bit. 5204 5. Leaf nodes do not need to originate S-TIEs unless optional leaf- 5205 2-leaf features are desired. 5207 6.2. Considerations for Spine Implementation 5209 Spine nodes will never act as Top of Fabric, and are therefore not 5210 required to run a full RIFT implementation. Specifically, spines do 5211 not need to perform negative disaggregation computation other than 5212 respecting northbound disaggregation advertised from the north. 5214 6.3. Adaptations to Other Proposed Data Center Topologies 5216 . +-----+ +-----+ 5217 . | | | | 5218 .+-+ S0 | | S1 | 5219 .| ++---++ ++---++ 5220 .| | | | | 5221 .| | +------------+ | 5222 .| | | +------------+ | 5223 .| | | | | 5224 .| ++-+--+ +--+-++ 5225 .| | | | | 5226 .| | A0 | | A1 | 5227 .| +-+--++ ++---++ 5228 .| | | | | 5229 .| | +------------+ | 5230 .| | +-----------+ | | 5231 .| | | | | 5232 .| +-+-+-+ +--+-++ 5233 .+-+ | | | 5234 . | L0 | | L1 | 5235 . +-----+ +-----+ 5237 Figure 36: Level Shortcut 5239 RIFT is not strictly limited to Clos topologies. The protocol only 5240 requires a sense of "compass rose directionality" either achieved 5241 through configuration or derivation of levels. So, conceptually, 5242 leaf-2-leaf links and even shortcuts between levels could be 5243 included. Figure 36 depicts an example of a shortcut between levels. 5244 In this example, sub-optimal routing will occur when traffic is sent 5245 from L0 to L1 via S0's default route and back down through A0 or A1. 5246 In order to ensure that only default routes from A0 or A1 are used, 5247 all leaves would be required to install each others routes. 5249 While various technical and operational challenges may require the 5250 use of such modifications, discussion of those topics are outside the 5251 scope of this document. 5253 6.4. Originating Non-Default Route Southbound 5255 An implementation MAY choose to originate more specific prefixes (P') 5256 southbound instead of only the default route (as described in 5257 Section 4.2.3.8). In such a scenario, all addresses carried within 5258 the RIFT domain MUST be contained within P'. 5260 7. Security Considerations 5262 7.1. General 5264 One can consider attack vectors where a router may reboot many times 5265 while changing its system ID and pollute the network with many stale 5266 TIEs or TIEs are sent with very long lifetimes and not cleaned up 5267 when the routes vanish. Those attack vectors are not unique to RIFT. 5268 Given large memory footprints available today those attacks should be 5269 relatively benign. Otherwise a node SHOULD implement a strategy of 5270 discarding contents of all TIEs that were not present in the SPF tree 5271 over a certain, configurable period of time. Since the protocol, 5272 like all modern link-state protocols, is self-stabilizing and will 5273 advertise the presence of such TIEs to its neighbors, they can be re- 5274 requested again if a computation finds that it sees an adjacency 5275 formed towards the system ID of the discarded TIEs. 5277 7.2. ZTP 5279 Section 4.2.7 presents many attack vectors in untrusted environments, 5280 starting with nodes that oscillate their level offers to the 5281 possibility of nodes offering a three-way adjacency with the highest 5282 possible level value and a very long holdtime trying to put itself 5283 "on top of the lattice" thereby allowing it to gain access to the 5284 whole southbound topology. Session authentication mechanisms are 5285 necessary in environments where this is possible and RIFT provides 5286 the security envelope to ensure this if so desired. 5288 7.3. Lifetime 5290 Traditional IGP protocols are vulnerable to lifetime modification and 5291 replay attacks that can be somewhat mitigated by using techniques 5292 like [RFC7987]. RIFT removes this attack vector by protecting the 5293 lifetime behind a signature computed over it and additional nonce 5294 combination which makes even the replay attack window very small and 5295 for practical purposes irrelevant since lifetime cannot be 5296 artificially shortened by the attacker. 5298 7.4. Packet Number 5300 Optional packet number is carried in the security envelope without 5301 any encryption protection and is hence vulnerable to replay and 5302 modification attacks. Contrary to nonces this number must change on 5303 every packet and would present a very high cryptographic load if 5304 signed. The attack vector packet number present is relatively 5305 benign. Changing the packet number by a man-in-the-middle attack 5306 will only affect operational validation tools and possibly some 5307 performance optimizations on flooding. It is expected that an 5308 implementation detecting too many "fake losses" or "misorderings" due 5309 to the attack on the packet number would simply suppress its further 5310 processing. 5312 7.5. Outer Fingerprint Attacks 5314 A node can try to inject LIE packets observing a conversation on the 5315 wire by using the outer key ID albeit it cannot generate valid hashes 5316 in case it changes the integrity of the message so the only possible 5317 attack is DoS due to excessive LIE validation. 5319 A node can try to replay previous LIEs with changed state that it 5320 recorded but the attack is hard to replicate since the nonce 5321 combination must match the ongoing exchange and is then limited to a 5322 single flap only since both nodes will advance their nonces in case 5323 the adjacency state changed. Even in the most unlikely case the 5324 attack length is limited due to both sides periodically increasing 5325 their nonces. 5327 7.6. TIE Origin Fingerprint DoS Attacks 5329 A compromised node can attempt to generate "fake TIEs" using other 5330 nodes' TIE origin key identifiers. Albeit the ultimate validation of 5331 the origin fingerprint will fail in such scenarios and not progress 5332 further than immediately peering nodes, the resulting denial of 5333 service attack seems unavoidable since the TIE origin key id is only 5334 protected by the, here assumed to be compromised, node. 5336 7.7. Host Implementations 5338 It can be reasonably expected that with the proliferation of RotH 5339 servers, rather than dedicated networking devices, servers will 5340 represent a significant amount of RIFT devices. Given their normally 5341 far wider software envelope and access granted to them, such servers 5342 are also far more likely to be compromised and present an attack 5343 vector on the protocol. Hijacking of prefixes to attract traffic is 5344 a trust problem and cannot be addressed within the protocol if the 5345 trust model is breached, i.e. the server presents valid credentials 5346 to form an adjacency and issue TIEs. However, in a more devious way, 5347 the servers can present DoS (or even DDos) vectors of issuing too 5348 many LIE packets, flood large amounts of North TIEs and attempt 5349 similar resource overrun attacks. A prudent implementation forming 5350 adjacencies to leaves should implement according thresholds 5351 mechanisms and raise warnings when e.g. a leaf is advertising an 5352 excess number of TIEs. 5354 8. IANA Considerations 5356 This specification requests multicast address assignments and 5357 standard port numbers. Additionally registries for the schema are 5358 requested and suggested values provided that reflect the numbers 5359 allocated in the given schema. 5361 8.1. Requested Multicast and Port Numbers 5363 This document requests allocation in the 'IPv4 Multicast Address 5364 Space' registry the suggested value of 224.0.0.120 as 5365 'ALL_V4_RIFT_ROUTERS' and in the 'IPv6 Multicast Address Space' 5366 registry the suggested value of FF02::A1F7 as 'ALL_V6_RIFT_ROUTERS'. 5368 This document requests allocation in the 'Service Name and Transport 5369 Protocol Port Number Registry' the allocation of a suggested value of 5370 914 on udp for 'RIFT_LIES_PORT' and suggested value of 915 for 5371 'RIFT_TIES_PORT'. 5373 8.2. Requested Registries with Suggested Values 5375 This section requests registries that help govern the schema via 5376 usual IANA registry procedures. A top level 'RIFT' registry should 5377 hold the according registries requested in following sections with 5378 their pre-defined values. IANA is requested to store the schema 5379 version introducing the allocated value as well as, optionally, its 5380 description when present. This will allow to assign different values 5381 to an entry depending on schema version. Alternately, IANA is 5382 requested to consider a root RIFT/3 registry to store RIFT schema 5383 major version 3 values and may be requested in the future to create a 5384 RIFT/4 registry under that. In any case, IANA is requested to store 5385 the schema version in the entries since that will allow to 5386 distinguish between minor versions in the same major schema version. 5387 All values not suggested as to be considered `Unassigned`. The range 5388 of every registry is a 16-bit integer. Allocation of new values is 5389 always performed via `Expert Review` action. 5391 8.2.1. Registry RIFT_v4/common/AddressFamilyType 5393 Address family type. 5395 8.2.1.1. Requested Entries 5397 Name Value Schema Version Description 5398 Illegal 0 4.0 5399 AddressFamilyMinValue 1 4.0 5400 IPv4 2 4.0 5401 IPv6 3 4.0 5402 AddressFamilyMaxValue 4 4.0 5404 8.2.2. Registry RIFT_v4/common/HierarchyIndications 5406 Flags indicating node configuration in case of ZTP. 5408 8.2.2.1. Requested Entries 5410 Name Value Schema Version Description 5411 leaf_only 0 4.0 5412 leaf_only_and_leaf_2_leaf_procedures 1 4.0 5413 top_of_fabric 2 4.0 5415 8.2.3. Registry RIFT_v4/common/IEEE802_1ASTimeStampType 5417 Timestamp per IEEE 802.1AS, all values MUST be interpreted in 5418 implementation as unsigned. 5420 8.2.3.1. Requested Entries 5422 Name Value Schema Version Description 5423 AS_sec 1 4.0 5424 AS_nsec 2 4.0 5426 8.2.4. Registry RIFT_v4/common/IPAddressType 5428 IP address type. 5430 8.2.4.1. Requested Entries 5432 Name Value Schema Version Description 5433 ipv4address 1 4.0 Content is IPv4 5434 ipv6address 2 4.0 Content is IPv6 5436 8.2.5. Registry RIFT_v4/common/IPPrefixType 5438 Prefix advertisement. 5440 @note: for interface addresses the protocol can propagate the address 5441 part beyond the subnet mask and on reachability computation that has 5442 to be normalized. The non-significant bits can be used for 5443 operational purposes. 5445 8.2.5.1. Requested Entries 5447 Name Value Schema Version Description 5448 ipv4prefix 1 4.0 5449 ipv6prefix 2 4.0 5451 8.2.6. Registry RIFT_v4/common/IPv4PrefixType 5453 IPv4 prefix type. 5455 8.2.6.1. Requested Entries 5457 Name Value Schema Version Description 5458 address 1 4.0 5459 prefixlen 2 4.0 5461 8.2.7. Registry RIFT_v4/common/IPv6PrefixType 5463 IPv6 prefix type. 5465 8.2.7.1. Requested Entries 5467 Name Value Schema Version Description 5468 address 1 4.0 5469 prefixlen 2 4.0 5471 8.2.8. Registry RIFT_v4/common/PrefixSequenceType 5473 Sequence of a prefix in case of move. 5475 8.2.8.1. Requested Entries 5477 Name Value Schema Description 5478 Version 5479 timestamp 1 4.0 5480 transactionid 2 4.0 Transaction ID set by client in e.g. 5481 in 6LoWPAN. 5483 8.2.9. Registry RIFT_v4/common/RouteType 5485 RIFT route types. 5487 @note: route types which MUST be ordered on their preference PGP 5488 prefixes are most preferred attracting traffic north (towards spine) 5489 and then south normal prefixes are attracting traffic south (towards 5490 leafs), i.e. prefix in NORTH PREFIX TIE is preferred over SOUTH 5491 PREFIX TIE. 5493 @note: The only purpose of those values is to introduce an ordering 5494 whereas an implementation can choose internally any other values as 5495 long the ordering is preserved 5497 8.2.9.1. Requested Entries 5499 Name Value Schema Version Description 5500 Illegal 0 4.0 5501 RouteTypeMinValue 1 4.0 5502 Discard 2 4.0 5503 LocalPrefix 3 4.0 5504 SouthPGPPrefix 4 4.0 5505 NorthPGPPrefix 5 4.0 5506 NorthPrefix 6 4.0 5507 NorthExternalPrefix 7 4.0 5508 SouthPrefix 8 4.0 5509 SouthExternalPrefix 9 4.0 5510 NegativeSouthPrefix 10 4.0 5511 RouteTypeMaxValue 11 4.0 5513 8.2.10. Registry RIFT_v4/common/TIETypeType 5515 Type of TIE. 5517 This enum indicates what TIE type the TIE is carrying. In case the 5518 value is not known to the receiver, the TIE MUST be re-flooded. This 5519 allows for future extensions of the protocol within the same major 5520 schema with types opaque to some nodes UNLESS the flooding scope is 5521 not the same as prefix TIE, then a major version revision MUST be 5522 performed. 5524 8.2.10.1. Requested Entries 5526 Name Value Schema Description 5527 Version 5528 Illegal 0 4.0 5529 TIETypeMinValue 1 4.0 5530 NodeTIEType 2 4.0 5531 PrefixTIEType 3 4.0 5532 PositiveDisaggregationPrefixTIEType 4 4.0 5533 NegativeDisaggregationPrefixTIEType 5 4.0 5534 PGPrefixTIEType 6 4.0 5535 KeyValueTIEType 7 4.0 5536 ExternalPrefixTIEType 8 4.0 5537 PositiveExternalDisaggregationPrefixTIEType 9 4.0 5538 TIETypeMaxValue 10 4.0 5540 8.2.11. Registry RIFT_v4/common/TieDirectionType 5542 Direction of TIEs. 5544 8.2.11.1. Requested Entries 5546 Name Value Schema Version Description 5547 Illegal 0 4.0 5548 South 1 4.0 5549 North 2 4.0 5550 DirectionMaxValue 3 4.0 5552 8.2.12. Registry RIFT_v4/encoding/Community 5554 Prefix community. 5556 8.2.12.1. Requested Entries 5558 Name Value Schema Version Description 5559 top 1 4.0 Higher order bits 5560 bottom 2 4.0 Lower order bits 5562 8.2.13. Registry RIFT_v4/encoding/KeyValueTIEElement 5564 Generic key value pairs. 5566 8.2.13.1. Requested Entries 5568 Name Value Schema Version Description 5569 keyvalues 1 4.0 5571 8.2.14. Registry RIFT_v4/encoding/LIEPacket 5573 RIFT LIE Packet. 5575 @note: this node's level is already included on the packet header 5577 8.2.14.1. Requested Entries 5579 Name Value Schema Description 5580 Version 5581 name 1 4.0 Node or adjacency name. 5582 local_id 2 4.0 Local link ID. 5583 flood_port 3 4.0 UDP port to which we can 5584 receive flooded TIEs. 5585 link_mtu_size 4 4.0 Layer 3 MTU, used to 5586 discover to mismatch. 5587 link_bandwidth 5 4.0 Local link bandwidth on 5588 the interface. 5589 neighbor 6 4.0 Reflects the neighbor once 5590 received to provide 5591 3-way connectivity. 5592 pod 7 4.0 Node's PoD. 5593 node_capabilities 10 4.0 Node capabilities shown in 5594 LIE. The capabilities 5595 MUST match the capabilities 5596 shown in the Node TIEs, 5597 otherwise the behavior 5598 is unspecified. A node 5599 detecting the mismatch 5600 SHOULD generate according 5601 error. 5602 link_capabilities 11 4.0 Capabilities of this link. 5603 holdtime 12 4.0 Required holdtime of the 5604 adjacency, i.e. how much 5605 time MUST expire 5606 without LIE for the 5607 adjacency to drop. 5608 label 13 4.0 Unsolicited, downstream 5609 assigned locally 5610 significant label value 5611 for the adjacency. 5612 not_a_ztp_offer 21 4.0 Indicates that the level 5613 on the LIE MUST NOT be used 5614 to derive a ZTP level by 5615 the receiving node. 5616 you_are_flood_repeater 22 4.0 Indicates to northbound 5617 neighbor that it should 5618 be reflooding this node's 5619 North TIEs to achieve flood 5620 reduction and balancing 5621 for northbound flooding. To 5622 be ignored if received 5623 from a northbound 5624 adjacency. 5625 you_are_sending_too_quickly 23 4.0 Can be optionally set to 5626 indicate to neighbor that 5627 packet losses are seen 5628 on reception based on 5629 packet numbers or the rate 5630 is too high. The 5631 receiver SHOULD temporarily 5632 slow down flooding 5633 rates. 5634 instance_name 24 4.0 Instance name in case 5635 multiple RIFT instances 5636 running on same 5637 interface. 5639 8.2.15. Registry RIFT_v4/encoding/LinkCapabilities 5641 Link capabilities. 5643 8.2.15.1. Requested Entries 5645 Name Value Schema Description 5646 Version 5647 bfd 1 4.0 Indicates that the link is 5648 supporting BFD. 5649 v4_forwarding_capable 2 4.0 Indicates whether the interface 5650 will support v4 forwarding. 5652 8.2.16. Registry RIFT_v4/encoding/LinkIDPair 5654 LinkID pair describes one of parallel links between two nodes. 5656 8.2.16.1. Requested Entries 5657 Name Value Schema Description 5658 Version 5659 local_id 1 4.0 Node-wide unique value for 5660 the local link. 5661 remote_id 2 4.0 Received remote link ID for 5662 this link. 5663 platform_interface_index 10 4.0 Describes the local 5664 interface index of the link. 5665 platform_interface_name 11 4.0 Describes the local 5666 interface name. 5667 trusted_outer_security_key 12 4.0 Indication whether the link 5668 is secured, i.e. protected 5669 by outer key, absence of 5670 this element means no 5671 indication, undefined 5672 outer key means not secured. 5673 bfd_up 13 4.0 Indication whether the link 5674 is protected by established 5675 BFD session. 5677 8.2.17. Registry RIFT_v4/encoding/Neighbor 5679 Neighbor structure. 5681 8.2.17.1. Requested Entries 5683 Name Value Schema Version Description 5684 originator 1 4.0 System ID of the originator. 5685 remote_id 2 4.0 ID of remote side of the link. 5687 8.2.18. Registry RIFT_v4/encoding/NodeCapabilities 5689 Capabilities the node supports. 5691 @note: The schema may add to this field future capabilities to 5692 indicate whether it will support interpretation of future schema 5693 extensions on the same major revision. Such fields MUST be optional 5694 and have an implicit or explicit false default value. If a future 5695 capability changes route selection or generates blackholes if some 5696 nodes are not supporting it then a major version increment is 5697 unavoidable. 5699 8.2.18.1. Requested Entries 5700 Name Value Schema Description 5701 Version 5702 protocol_minor_version 1 4.0 Must advertise supported minor 5703 version dialect that way. 5704 flood_reduction 2 4.0 Can this node participate in 5705 flood reduction. 5706 hierarchy_indications 3 4.0 Does this node restrict itself 5707 to be top-of-fabric or leaf 5708 only (in ZTP) and does it 5709 support leaf-2-leaf 5710 procedures. 5712 8.2.19. Registry RIFT_v4/encoding/NodeFlags 5714 Indication flags of the node. 5716 8.2.19.1. Requested Entries 5718 Name Value Schema Description 5719 Version 5720 overload 1 4.0 Indicates that node is in overload, do not 5721 transit traffic through it. 5723 8.2.20. Registry RIFT_v4/encoding/NodeNeighborsTIEElement 5725 neighbor of a node 5727 8.2.20.1. Requested Entries 5729 Name Value Schema Description 5730 Version 5731 level 1 4.0 level of neighbor 5732 cost 3 4.0 Cost to neighbor. 5733 link_ids 4 4.0 can carry description of multiple parallel 5734 links in a TIE 5735 bandwidth 5 4.0 total bandwith to neighbor, this will be 5736 normally sum of the bandwidths of all the 5737 parallel links. 5739 8.2.21. Registry RIFT_v4/encoding/NodeTIEElement 5741 Description of a node. 5743 It may occur multiple times in different TIEs but if either 5745 capabilities values do not match or 5747 flags values do not match or 5748 neighbors repeat with different values 5750 the behavior is undefined and a warning SHOULD be generated. 5751 Neighbors can be distributed across multiple TIEs however if the sets 5752 are disjoint. Miscablings SHOULD be repeated in every node TIE, 5753 otherwise the behavior is undefined. 5755 @note: Observe that absence of fields implies defined defaults. 5757 8.2.21.1. Requested Entries 5759 Name Value Schema Description 5760 Version 5761 level 1 4.0 Level of the node. 5762 neighbors 2 4.0 Node's neighbors. If neighbor systemID 5763 repeats in other node TIEs of same 5764 node the behavior is undefined. 5765 capabilities 3 4.0 Capabilities of the node. 5766 flags 4 4.0 Flags of the node. 5767 name 5 4.0 Optional node name for easier 5768 operations. 5769 pod 6 4.0 PoD to which the node belongs. 5770 miscabled_links 10 4.0 If any local links are miscabled, the 5771 indication is flooded. 5773 8.2.22. Registry RIFT_v4/encoding/PacketContent 5775 Content of a RIFT packet. 5777 8.2.22.1. Requested Entries 5779 Name Value Schema Version Description 5780 lie 1 4.0 5781 tide 2 4.0 5782 tire 3 4.0 5783 tie 4 4.0 5785 8.2.23. Registry RIFT_v4/encoding/PacketHeader 5787 Common RIFT packet header. 5789 8.2.23.1. Requested Entries 5790 Name Value Schema Description 5791 Version 5792 major_version 1 4.0 Major version of protocol. 5793 minor_version 2 4.0 Minor version of protocol. 5794 sender 3 4.0 Node sending the packet, in case of 5795 LIE/TIRE/TIDE also the originator of 5796 it. 5797 level 4 4.0 Level of the node sending the packet, 5798 required on everything except LIEs. 5799 Lack of presence on LIEs indicates 5800 UNDEFINED_LEVEL and is used in ZTP 5801 procedures. 5803 8.2.24. Registry RIFT_v4/encoding/PrefixAttributes 5805 Attributes of a prefix. 5807 8.2.24.1. Requested Entries 5809 Name Value Schema Description 5810 Version 5811 metric 2 4.0 Distance of the prefix. 5812 tags 3 4.0 Generic unordered set of route tags, 5813 can be redistributed to other 5814 protocols or use within the context 5815 of real time analytics. 5816 monotonic_clock 4 4.0 Monotonic clock for mobile 5817 addresses. 5818 loopback 6 4.0 Indicates if the interface is a node 5819 loopback. 5820 directly_attached 7 4.0 Indicates that the prefix is 5821 directly attached, i.e. should be 5822 routed to even if the node is in 5823 overload. 5824 from_link 10 4.0 In case of locally originated 5825 prefixes, i.e. interface 5826 addresses this can describe which 5827 link the address belongs to. 5829 8.2.25. Registry RIFT_v4/encoding/PrefixTIEElement 5831 TIE carrying prefixes 5833 8.2.25.1. Requested Entries 5834 Name Value Schema Description 5835 Version 5836 prefixes 1 4.0 Prefixes with the associated attributes. 5837 If the same prefix repeats in multiple TIEs of 5838 same node behavior is unspecified. 5840 8.2.26. Registry RIFT_v4/encoding/ProtocolPacket 5842 RIFT packet structure. 5844 8.2.26.1. Requested Entries 5846 Name Value Schema Version Description 5847 header 1 4.0 5848 content 2 4.0 5850 8.2.27. Registry RIFT_v4/encoding/TIDEPacket 5852 TIDE with sorted TIE headers, if headers are unsorted, behavior is 5853 undefined. 5855 8.2.27.1. Requested Entries 5857 Name Value Schema Version Description 5858 start_range 1 4.0 First TIE header in the tide 5859 packet. 5860 end_range 2 4.0 Last TIE header in the tide packet. 5861 headers 3 4.0 _Sorted_ list of headers. 5863 8.2.28. Registry RIFT_v4/encoding/TIEElement 5865 Single element in a TIE. 5867 Schema enum `common.TIETypeType` in TIEID indicates which elements 5868 MUST be present in the TIEElement. In case of mismatch the 5869 unexpected elements MUST be ignored. In case of lack of expected 5870 element the TIE an error MUST be reported and the TIE MUST be 5871 ignored. 5873 This type can be extended with new optional elements for new 5874 `common.TIETypeType` values without breaking the major but if it is 5875 necessary to understand whether all nodes support the new type a node 5876 capability must be added as well. 5878 8.2.28.1. Requested Entries 5880 Name Valu Schem Description 5881 e a Ver 5882 sion 5883 node 1 4.0 Used in case of enum comm 5884 on.TIETypeType.NodeTIEType 5885 . 5886 prefixes 2 4.0 Used in case of enum comm 5887 on.TIETypeType.PrefixTIETy 5888 pe. 5889 positive_disaggregation_prefixe 3 4.0 Positive prefixes (always 5890 s southbound). It MUST 5891 NOT be advertised within a 5892 North TIE and ignored 5893 otherwise. 5894 negative_disaggregation_prefixe 5 4.0 Transitive, negative 5895 s prefixes (always 5896 southbound) which MUST 5897 be aggregated and 5898 propagated according 5899 to the specification 5900 southwards towards lower 5901 levels to heal 5902 pathological upper level 5903 partitioning, otherwise 5904 blackholes may occur in 5905 multiplane fabrics. It 5906 MUST NOT be advertised 5907 within a North TIE. 5908 external_prefixes 6 4.0 Externally reimported 5909 prefixes. 5910 positive_external_disaggregatio 7 4.0 Positive external 5911 n_prefixes disaggregated prefixes 5912 (always southbound). 5913 It MUST NOT be advertised 5914 within a North TIE and 5915 ignored otherwise. 5916 keyvalues 9 4.0 Key-Value store elements. 5918 8.2.29. Registry RIFT_v4/encoding/TIEHeader 5920 Header of a TIE. 5922 @note: TIEID space is a total order achieved by comparing the 5923 elements in sequence defined and comparing each value as an unsigned 5924 integer of according length. 5926 @note: After sequence number the lifetime received on the envelope 5927 must be used for comparison before further fields. 5929 @note: `origination_time` and `origination_lifetime` are disregarded 5930 for comparison purposes and carried purely for debugging/security 5931 purposes if present. 5933 8.2.29.1. Requested Entries 5935 Name Value Schema Description 5936 Version 5937 tieid 2 4.0 ID of the tie. 5938 seq_nr 3 4.0 Sequence number of the tie. 5939 origination_time 10 4.0 Absolute timestamp when the TIE 5940 was generated. This can be used on 5941 fabrics with synchronized 5942 clock to prevent lifetime 5943 modification attacks. 5944 origination_lifetime 12 4.0 Original lifetime when the TIE 5945 was generated. This can be used on 5946 fabrics with synchronized 5947 clock to prevent lifetime 5948 modification attacks. 5950 8.2.30. Registry RIFT_v4/encoding/TIEHeaderWithLifeTime 5952 Header of a TIE as described in TIRE/TIDE. 5954 8.2.30.1. Requested Entries 5956 Name Value Schema Description 5957 Version 5958 header 1 4.0 5959 remaining_lifetime 2 4.0 Remaining lifetime that expires 5960 down to 0 just like in ISIS. 5961 TIEs with lifetimes differing by 5962 less than `lifetime_diff2ignore` 5963 MUST be considered EQUAL. 5965 8.2.31. Registry RIFT_v4/encoding/TIEID 5967 ID of a TIE. 5969 @note: TIEID space is a total order achieved by comparing the 5970 elements in sequence defined and comparing each value as an unsigned 5971 integer of according length. 5973 8.2.31.1. Requested Entries 5975 Name Value Schema Version Description 5976 direction 1 4.0 direction of TIE 5977 originator 2 4.0 indicates originator of the TIE 5978 tietype 3 4.0 type of the tie 5979 tie_nr 4 4.0 number of the tie 5981 8.2.32. Registry RIFT_v4/encoding/TIEPacket 5983 TIE packet 5985 8.2.32.1. Requested Entries 5987 Name Value Schema Version Description 5988 header 1 4.0 5989 element 2 4.0 5991 8.2.33. Registry RIFT_v4/encoding/TIREPacket 5993 TIRE packet 5995 8.2.33.1. Requested Entries 5997 Name Value Schema Version Description 5998 headers 1 4.0 6000 9. Acknowledgments 6002 A new routing protocol in its complexity is not a product of a parent 6003 but of a village as the author list shows already. However, many 6004 more people provided input, fine-combed the specification based on 6005 their experience in design, implementation or application of 6006 protocols in IP fabrics. This section will make an inadequate 6007 attempt in recording their contribution. 6009 Many thanks to Naiming Shen for some of the early discussions around 6010 the topic of using IGPs for routing in topologies related to Clos. 6011 Russ White to be especially acknowledged for the key conversation on 6012 epistemology that allowed to tie current asynchronous distributed 6013 systems theory results to a modern protocol design presented in this 6014 scope. Adrian Farrel, Joel Halpern, Jeffrey Zhang, Krzysztof 6015 Szarkowicz, Nagendra Kumar, Melchior Aelmans, Kaushal Tank, Will 6016 Jones, Moin Ahmed and Jordan Head provided thoughtful comments that 6017 improved the readability of the document and found good amount of 6018 corners where the light failed to shine. Kris Price was first to 6019 mention single router, single arm default considerations. Jeff 6020 Tantsura helped out with some initial thoughts on BFD interactions 6021 while Jeff Haas corrected several misconceptions about BFD's finer 6022 points. Artur Makutunowicz pointed out many possible improvements 6023 and acted as sounding board in regard to modern protocol 6024 implementation techniques RIFT is exploring. Barak Gafni formalized 6025 first time clearly the problem of partitioned spine and fallen leaves 6026 on a (clean) napkin in Singapore that led to the very important part 6027 of the specification centered around multiple Top-of-Fabric planes 6028 and negative disaggregation. Igor Gashinsky and others shared many 6029 thoughts on problems encountered in design and operation of large- 6030 scale data center fabrics. Xu Benchong found a delicate error in the 6031 flooding procedures and a schema datatype size mismatch. 6033 Last but not least, Alvaro Retana guided the undertaking by asking 6034 many necessary procedural and technical questions which did not only 6035 improve the content but did also lay out the track towards 6036 publication. 6038 10. References 6040 10.1. Normative References 6042 [EUI64] IEEE, "Guidelines for Use of Extended Unique Identifier 6043 (EUI), Organizationally Unique Identifier (OUI), and 6044 Company ID (CID)", IEEE EUI, 6045 . 6047 [ISO10589] 6048 ISO "International Organization for Standardization", 6049 "Intermediate system to Intermediate system intra-domain 6050 routeing information exchange protocol for use in 6051 conjunction with the protocol for providing the 6052 connectionless-mode Network Service (ISO 8473), ISO/IEC 6053 10589:2002, Second Edition.", Nov 2002. 6055 [RFC1982] Elz, R. and R. Bush, "Serial Number Arithmetic", RFC 1982, 6056 DOI 10.17487/RFC1982, August 1996, 6057 . 6059 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 6060 DOI 10.17487/RFC2328, April 1998, 6061 . 6063 [RFC2365] Meyer, D., "Administratively Scoped IP Multicast", BCP 23, 6064 RFC 2365, DOI 10.17487/RFC2365, July 1998, 6065 . 6067 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 6068 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 6069 DOI 10.17487/RFC4271, January 2006, 6070 . 6072 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 6073 Architecture", RFC 4291, DOI 10.17487/RFC4291, February 6074 2006, . 6076 [RFC5082] Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C. 6077 Pignataro, "The Generalized TTL Security Mechanism 6078 (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007, 6079 . 6081 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 6082 Topology (MT) Routing in Intermediate System to 6083 Intermediate Systems (IS-ISs)", RFC 5120, 6084 DOI 10.17487/RFC5120, February 2008, 6085 . 6087 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 6088 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 6089 DOI 10.17487/RFC5303, October 2008, 6090 . 6092 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 6093 Layer Reachability Information with an IPv6 Next Hop", 6094 RFC 5549, DOI 10.17487/RFC5549, May 2009, 6095 . 6097 [RFC5709] Bhatia, M., Manral, V., Fanto, M., White, R., Barnes, M., 6098 Li, T., and R. Atkinson, "OSPFv2 HMAC-SHA Cryptographic 6099 Authentication", RFC 5709, DOI 10.17487/RFC5709, October 6100 2009, . 6102 [RFC5881] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 6103 (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, 6104 DOI 10.17487/RFC5881, June 2010, 6105 . 6107 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 6108 "Network Time Protocol Version 4: Protocol and Algorithms 6109 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 6110 . 6112 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 6113 S. Ray, "North-Bound Distribution of Link-State and 6114 Traffic Engineering (TE) Information Using BGP", RFC 7752, 6115 DOI 10.17487/RFC7752, March 2016, 6116 . 6118 [RFC7987] Ginsberg, L., Wells, P., Decraene, B., Przygienda, T., and 6119 H. Gredler, "IS-IS Minimum Remaining Lifetime", RFC 7987, 6120 DOI 10.17487/RFC7987, October 2016, 6121 . 6123 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 6124 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 6125 May 2017, . 6127 [RFC8200] Deering, S. and R. Hinden, "Internet Protocol, Version 6 6128 (IPv6) Specification", STD 86, RFC 8200, 6129 DOI 10.17487/RFC8200, July 2017, 6130 . 6132 [RFC8202] Ginsberg, L., Previdi, S., and W. Henderickx, "IS-IS 6133 Multi-Instance", RFC 8202, DOI 10.17487/RFC8202, June 6134 2017, . 6136 [RFC8505] Thubert, P., Ed., Nordmark, E., Chakrabarti, S., and C. 6137 Perkins, "Registration Extensions for IPv6 over Low-Power 6138 Wireless Personal Area Network (6LoWPAN) Neighbor 6139 Discovery", RFC 8505, DOI 10.17487/RFC8505, November 2018, 6140 . 6142 [thrift] Apache Software Foundation, "Thrift Interface Description 6143 Language", . 6145 10.2. Informative References 6147 [CLOS] Yuan, X., "On Nonblocking Folded-Clos Networks in Computer 6148 Communication Environments", IEEE International Parallel & 6149 Distributed Processing Symposium, 2011. 6151 [DIJKSTRA] 6152 Dijkstra, E., "A Note on Two Problems in Connexion with 6153 Graphs", Journal Numer. Math. , 1959. 6155 [DOT] Ellson, J. and L. Koutsofios, "Graphviz: open source graph 6156 drawing tools", Springer-Verlag , 2001. 6158 [DYNAMO] De Candia et al., G., "Dynamo: amazon's highly available 6159 key-value store", ACM SIGOPS symposium on Operating 6160 systems principles (SOSP '07), 2007. 6162 [EPPSTEIN] 6163 Eppstein, D., "Finding the k-Shortest Paths", 1997. 6165 [FATTREE] Leiserson, C., "Fat-Trees: Universal Networks for 6166 Hardware-Efficient Supercomputing", 1985. 6168 [IEEEstd1588] 6169 IEEE, "IEEE Standard for a Precision Clock Synchronization 6170 Protocol for Networked Measurement and Control Systems", 6171 IEEE Standard 1588, 6172 . 6174 [IEEEstd8021AS] 6175 IEEE, "IEEE Standard for Local and Metropolitan Area 6176 Networks - Timing and Synchronization for Time-Sensitive 6177 Applications in Bridged Local Area Networks", 6178 IEEE Standard 802.1AS, 6179 . 6181 [ISO10589-Second-Edition] 6182 International Organization for Standardization, 6183 "Intermediate system to Intermediate system intra-domain 6184 routeing information exchange protocol for use in 6185 conjunction with the protocol for providing the 6186 connectionless-mode Network Service (ISO 8473)", Nov 2002. 6188 [RFC0826] Plummer, D., "An Ethernet Address Resolution Protocol: Or 6189 Converting Network Protocol Addresses to 48.bit Ethernet 6190 Address for Transmission on Ethernet Hardware", STD 37, 6191 RFC 826, DOI 10.17487/RFC0826, November 1982, 6192 . 6194 [RFC2131] Droms, R., "Dynamic Host Configuration Protocol", 6195 RFC 2131, DOI 10.17487/RFC2131, March 1997, 6196 . 6198 [RFC3626] Clausen, T., Ed. and P. Jacquet, Ed., "Optimized Link 6199 State Routing Protocol (OLSR)", RFC 3626, 6200 DOI 10.17487/RFC3626, October 2003, 6201 . 6203 [RFC4861] Narten, T., Nordmark, E., Simpson, W., and H. Soliman, 6204 "Neighbor Discovery for IP version 6 (IPv6)", RFC 4861, 6205 DOI 10.17487/RFC4861, September 2007, 6206 . 6208 [RFC4862] Thomson, S., Narten, T., and T. Jinmei, "IPv6 Stateless 6209 Address Autoconfiguration", RFC 4862, 6210 DOI 10.17487/RFC4862, September 2007, 6211 . 6213 [RFC6518] Lebovitz, G. and M. Bhatia, "Keying and Authentication for 6214 Routing Protocols (KARP) Design Guidelines", RFC 6518, 6215 DOI 10.17487/RFC6518, February 2012, 6216 . 6218 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 6219 BGP for Routing in Large-Scale Data Centers", RFC 7938, 6220 DOI 10.17487/RFC7938, August 2016, 6221 . 6223 [RFC8415] Mrugalski, T., Siodelski, M., Volz, B., Yourtchenko, A., 6224 Richardson, M., Jiang, S., Lemon, T., and T. Winters, 6225 "Dynamic Host Configuration Protocol for IPv6 (DHCPv6)", 6226 RFC 8415, DOI 10.17487/RFC8415, November 2018, 6227 . 6229 [VAHDAT08] 6230 Al-Fares, M., Loukissas, A., and A. Vahdat, "A Scalable, 6231 Commodity Data Center Network Architecture", SIGCOMM , 6232 2008. 6234 [Wikipedia] 6235 Wikipedia, 6236 "https://en.wikipedia.org/wiki/Serial_number_arithmetic", 6237 2016. 6239 Appendix A. Sequence Number Binary Arithmetic 6241 The only reasonably reference to a cleaner than [RFC1982] sequence 6242 number solution is given in [Wikipedia]. It basically converts the 6243 problem into two complement's arithmetic. Assuming a straight two 6244 complement's subtractions on the bit-width of the sequence number the 6245 according >: and =: relations are defined as: 6247 U_1, U_2 are 12-bits aligned unsigned version number 6249 D_f is ( U_1 - U_2 ) interpreted as two complement signed 12-bits 6250 D_b is ( U_2 - U_1 ) interpreted as two complement signed 12-bits 6252 U_1 >: U_2 IIF D_f > 0 AND D_b < 0 6253 U_1 =: U_2 IIF D_f = 0 6255 The >: relationship is anti-symmetric but not transitive. Observe 6256 that this leaves >: of the numbers having maximum two complement 6257 distance, e.g. ( 0 and 0x800 ) undefined in our 12-bits case since 6258 D_f and D_b are both -0x7ff. 6260 A simple example of the relationship in case of 3-bit arithmetic 6261 follows as table indicating D_f/D_b values and then the relationship 6262 of U_1 to U_2: 6264 U2 / U1 0 1 2 3 4 5 6 7 6265 0 +/+ +/- +/- +/- -/- -/+ -/+ -/+ 6266 1 -/+ +/+ +/- +/- +/- -/- -/+ -/+ 6267 2 -/+ -/+ +/+ +/- +/- +/- -/- -/+ 6268 3 -/+ -/+ -/+ +/+ +/- +/- +/- -/- 6269 4 -/- -/+ -/+ -/+ +/+ +/- +/- +/- 6270 5 +/- -/- -/+ -/+ -/+ +/+ +/- +/- 6271 6 +/- +/- -/- -/+ -/+ -/+ +/+ +/- 6272 7 +/- +/- +/- -/- -/+ -/+ -/+ +/+ 6274 U2 / U1 0 1 2 3 4 5 6 7 6275 0 = > > > ? < < < 6276 1 < = > > > ? < < 6277 2 < < = > > > ? < 6278 3 < < < = > > > ? 6279 4 ? < < < = > > > 6280 5 > ? < < < = > > 6281 6 > > ? < < < = > 6282 7 > > > ? < < < = 6284 Appendix B. Information Elements Schema 6286 This section introduces the schema for information elements. The IDL 6287 is Thrift [thrift]. 6289 On schema changes that 6291 1. change field numbers or 6293 2. add new *required* fields or 6294 3. remove any fields or 6296 4. change lists into sets, unions into structures or 6298 5. change multiplicity of fields or 6300 6. changes name of any field or type or 6302 7. change data types of any field or 6304 8. adds, changes or removes a default value of any *existing* field 6305 or 6307 9. removes or changes any defined constant or constant value or 6309 10. changes any enumeration type except extending 6310 `common.TIETypeType` (use of enumeration types is generally 6311 discouraged) 6313 major version of the schema MUST increase. All other changes MUST 6314 increase minor version within the same major. 6316 The above set of rules guarantees that every decoder can process 6317 serialized content generated by a higher minor version of the schema 6318 and with that the protocol can progress without a 'fork-lift'. 6319 Additionally, based on the propagated minor version in encoded 6320 content and added optional node capabilities new TIE types or even 6321 de-facto mandatory fields can be introduced without progressing the 6322 major version albeit only nodes supporting such new extensions would 6323 decode them. Given the model is encoded at the source and never re- 6324 encoded flooding through nodes not understanding any new extensions 6325 will preserve the according fields. 6327 Content serialized using a major version X is NOT expected to be 6328 decodable by any implementation using decoder for a model with a 6329 major version lower than X. 6331 Observe especially that introducing an optional field does not cause 6332 a major version increase even if the fields inside the structure are 6333 optional with defaults. 6335 All signed integer as forced by Thrift [thrift] support must be cast 6336 for internal purposes to equivalent unsigned values without 6337 discarding the signedness bit. An implementation SHOULD try to avoid 6338 using the signedness bit when generating values. 6340 The schema is normative. 6342 B.1. common.thrift 6344 /** 6345 Thrift file with common definitions for RIFT 6346 */ 6348 /** @note MUST be interpreted in implementation as unsigned 64 bits. 6349 * The implementation SHOULD NOT use the MSB. 6350 */ 6351 typedef i64 SystemIDType 6352 typedef i32 IPv4Address 6353 /** this has to be long enough to accomodate prefix */ 6354 typedef binary IPv6Address 6355 /** @note MUST be interpreted in implementation as unsigned */ 6356 typedef i16 UDPPortType 6357 /** @note MUST be interpreted in implementation as unsigned */ 6358 typedef i32 TIENrType 6359 /** @note MUST be interpreted in implementation as unsigned */ 6360 typedef i32 MTUSizeType 6361 /** @note MUST be interpreted in implementation as unsigned 6362 rolling over number */ 6363 typedef i64 SeqNrType 6364 /** @note MUST be interpreted in implementation as unsigned */ 6365 typedef i32 LifeTimeInSecType 6366 /** @note MUST be interpreted in implementation as unsigned */ 6367 typedef i8 LevelType 6368 /** optional, recommended monotonically increasing number 6369 _per packet type per adjacency_ 6370 that can be used to detect losses/misordering/restarts. 6371 @note MUST be interpreted in implementation as unsigned 6372 rolling over number */ 6373 typedef i16 PacketNumberType 6374 /** @note MUST be interpreted in implementation as unsigned */ 6375 typedef i32 PodType 6376 /** @note MUST be interpreted in implementation as unsigned. 6377 This is carried in the 6378 security envelope and MUST fit into 8 bits. */ 6379 typedef i8 VersionType 6380 /** @note MUST be interpreted in implementation as unsigned */ 6381 typedef i16 MinorVersionType 6382 /** @note MUST be interpreted in implementation as unsigned */ 6383 typedef i32 MetricType 6384 /** @note MUST be interpreted in implementation as unsigned 6385 and unstructured */ 6386 typedef i64 RouteTagType 6387 /** @note MUST be interpreted in implementation as unstructured 6388 label value */ 6389 typedef i32 LabelType 6390 /** @note MUST be interpreted in implementation as unsigned */ 6391 typedef i32 BandwithInMegaBitsType 6392 /** @note Key Value key ID type */ 6393 typedef string KeyIDType 6394 /** node local, unique identification for a link (interface/tunnel 6395 * etc. Basically anything RIFT runs on). This is kept 6396 * at 32 bits so it aligns with BFD [RFC5880] discriminator size. 6397 */ 6398 typedef i32 LinkIDType 6399 typedef string KeyNameType 6400 typedef i8 PrefixLenType 6401 /** timestamp in seconds since the epoch */ 6402 typedef i64 TimestampInSecsType 6403 /** security nonce. 6404 @note MUST be interpreted in implementation as rolling 6405 over unsigned value */ 6406 typedef i16 NonceType 6407 /** LIE FSM holdtime type */ 6408 typedef i16 TimeIntervalInSecType 6409 /** Transaction ID type for prefix mobility as specified by RFC6550, 6410 value MUST be interpreted in implementation as unsigned */ 6411 typedef i8 PrefixTransactionIDType 6412 /** Timestamp per IEEE 802.1AS, all values MUST be interpreted in 6413 implementation as unsigned. */ 6414 struct IEEE802_1ASTimeStampType { 6415 1: required i64 AS_sec; 6416 2: optional i32 AS_nsec; 6417 } 6418 /** generic counter type */ 6419 typedef i64 CounterType 6420 /** Platform Interface Index type, i.e. index of interface on hardware, 6421 can be used e.g. with RFC5837 */ 6422 typedef i32 PlatformInterfaceIndex 6424 /** Flags indicating node configuration in case of ZTP. 6425 */ 6426 enum HierarchyIndications { 6427 /** forces level to `leaf_level` and enables according procedures */ 6428 leaf_only = 0, 6429 /** forces level to `leaf_level` and enables according procedures */ 6430 leaf_only_and_leaf_2_leaf_procedures = 1, 6431 /** forces level to `top_of_fabric` and enables according 6432 procedures */ 6433 top_of_fabric = 2, 6434 } 6435 const PacketNumberType undefined_packet_number = 0 6436 /** This MUST be used when node is configured as top of fabric in ZTP. 6437 This is kept reasonably low to alow for fast ZTP convergence on 6438 failures. */ 6439 const LevelType top_of_fabric_level = 24 6440 /** default bandwidth on a link */ 6441 const BandwithInMegaBitsType default_bandwidth = 100 6442 /** fixed leaf level when ZTP is not used */ 6443 const LevelType leaf_level = 0 6444 const LevelType default_level = leaf_level 6445 const PodType default_pod = 0 6446 const LinkIDType undefined_linkid = 0 6448 /** default distance used */ 6449 const MetricType default_distance = 1 6450 /** any distance larger than this will be considered infinity */ 6451 const MetricType infinite_distance = 0x7FFFFFFF 6452 /** represents invalid distance */ 6453 const MetricType invalid_distance = 0 6454 const bool overload_default = false 6455 const bool flood_reduction_default = true 6456 /** default LIE FSM holddown time */ 6457 const TimeIntervalInSecType default_lie_holdtime = 3 6458 /** default ZTP FSM holddown time */ 6459 const TimeIntervalInSecType default_ztp_holdtime = 1 6460 /** by default LIE levels are ZTP offers */ 6461 const bool default_not_a_ztp_offer = false 6462 /** by default everyone is repeating flooding */ 6463 const bool default_you_are_flood_repeater = true 6464 /** 0 is illegal for SystemID */ 6465 const SystemIDType IllegalSystemID = 0 6466 /** empty set of nodes */ 6467 const set empty_set_of_nodeids = {} 6468 /** default lifetime of TIE is one week */ 6469 const LifeTimeInSecType default_lifetime = 604800 6470 /** default lifetime when TIEs are purged is 5 minutes */ 6471 const LifeTimeInSecType purge_lifetime = 300 6472 /** round down interval when TIEs are sent with security hashes 6473 to prevent excessive computation. **/ 6474 const LifeTimeInSecType rounddown_lifetime_interval = 60 6475 /** any `TieHeader` that has a smaller lifetime difference 6476 than this constant is equal (if other fields equal). This 6477 constant MUST be larger than `purge_lifetime` to avoid 6478 retransmissions */ 6479 const LifeTimeInSecType lifetime_diff2ignore = 400 6481 /** default UDP port to run LIEs on */ 6482 const UDPPortType default_lie_udp_port = 914 6483 /** default UDP port to receive TIEs on, that can be peer specific */ 6484 const UDPPortType default_tie_udp_flood_port = 915 6486 /** default MTU link size to use */ 6487 const MTUSizeType default_mtu_size = 1400 6488 /** default link being BFD capable */ 6489 const bool bfd_default = true 6491 /** undefined nonce, equivalent to missing nonce */ 6492 const NonceType undefined_nonce = 0; 6493 /** outer security key id, MUST be interpreted as in implementation 6494 as unsigned */ 6495 typedef i8 OuterSecurityKeyID 6496 /** security key id, MUST be interpreted as in implementation 6497 as unsigned */ 6498 typedef i32 TIESecurityKeyID 6499 /** undefined key */ 6500 const TIESecurityKeyID undefined_securitykey_id = 0; 6501 /** Maximum delta (negative or positive) that a mirrored nonce can 6502 deviate from local value to be considered valid. If nonces are 6503 changed every minute on both sides this opens statistically 6504 a `maximum_valid_nonce_delta` minutes window of identical LIEs, 6505 TIE, TI(x)E replays. 6506 The interval cannot be too small since LIE FSM may change 6507 states fairly quickly during ZTP without sending LIEs*/ 6508 const i16 maximum_valid_nonce_delta = 5; 6510 /** Direction of TIEs. */ 6511 enum TieDirectionType { 6512 Illegal = 0, 6513 South = 1, 6514 North = 2, 6515 DirectionMaxValue = 3, 6516 } 6518 /** Address family type. */ 6519 enum AddressFamilyType { 6520 Illegal = 0, 6521 AddressFamilyMinValue = 1, 6522 IPv4 = 2, 6523 IPv6 = 3, 6524 AddressFamilyMaxValue = 4, 6525 } 6527 /** IPv4 prefix type. */ 6528 struct IPv4PrefixType { 6529 1: required IPv4Address address; 6530 2: required PrefixLenType prefixlen; 6532 } 6534 /** IPv6 prefix type. */ 6535 struct IPv6PrefixType { 6536 1: required IPv6Address address; 6537 2: required PrefixLenType prefixlen; 6538 } 6540 /** IP address type. */ 6541 union IPAddressType { 6542 /** Content is IPv4 */ 6543 1: optional IPv4Address ipv4address; 6544 /** Content is IPv6 */ 6545 2: optional IPv6Address ipv6address; 6546 } 6548 /** Prefix advertisement. 6550 @note: for interface 6551 addresses the protocol can propagate the address part beyond 6552 the subnet mask and on reachability computation that has to 6553 be normalized. The non-significant bits can be used 6554 for operational purposes. 6555 */ 6556 union IPPrefixType { 6557 1: optional IPv4PrefixType ipv4prefix; 6558 2: optional IPv6PrefixType ipv6prefix; 6559 } 6561 /** Sequence of a prefix in case of move. 6562 */ 6563 struct PrefixSequenceType { 6564 1: required IEEE802_1ASTimeStampType timestamp; 6565 /** Transaction ID set by client in e.g. in 6LoWPAN. */ 6566 2: optional PrefixTransactionIDType transactionid; 6567 } 6569 /** Type of TIE. 6571 This enum indicates what TIE type the TIE is carrying. 6572 In case the value is not known to the receiver, 6573 the TIE MUST be re-flooded. This allows for 6574 future extensions of the protocol within the same major schema 6575 with types opaque to some nodes UNLESS the flooding scope is not 6576 the same as prefix TIE, then a major version revision MUST 6577 be performed. 6578 */ 6579 enum TIETypeType { 6580 Illegal = 0, 6581 TIETypeMinValue = 1, 6582 /** first legal value */ 6583 NodeTIEType = 2, 6584 PrefixTIEType = 3, 6585 PositiveDisaggregationPrefixTIEType = 4, 6586 NegativeDisaggregationPrefixTIEType = 5, 6587 PGPrefixTIEType = 6, 6588 KeyValueTIEType = 7, 6589 ExternalPrefixTIEType = 8, 6590 PositiveExternalDisaggregationPrefixTIEType = 9, 6591 TIETypeMaxValue = 10, 6592 } 6594 /** RIFT route types. 6596 @note: route types which MUST be ordered on their preference 6597 PGP prefixes are most preferred attracting 6598 traffic north (towards spine) and then south 6599 normal prefixes are attracting traffic south 6600 (towards leafs), i.e. prefix in NORTH PREFIX TIE 6601 is preferred over SOUTH PREFIX TIE. 6603 @note: The only purpose of those values is to introduce an 6604 ordering whereas an implementation can choose internally 6605 any other values as long the ordering is preserved 6606 */ 6607 enum RouteType { 6608 Illegal = 0, 6609 RouteTypeMinValue = 1, 6610 /** First legal value. */ 6611 /** Discard routes are most preferred */ 6612 Discard = 2, 6614 /** Local prefixes are directly attached prefixes on the 6615 * system such as e.g. interface routes. 6616 */ 6617 LocalPrefix = 3, 6618 /** Advertised in S-TIEs */ 6619 SouthPGPPrefix = 4, 6620 /** Advertised in N-TIEs */ 6621 NorthPGPPrefix = 5, 6622 /** Advertised in N-TIEs */ 6623 NorthPrefix = 6, 6624 /** Externally imported north */ 6625 NorthExternalPrefix = 7, 6626 /** Advertised in S-TIEs, either normal prefix or positive 6627 disaggregation */ 6629 SouthPrefix = 8, 6630 /** Externally imported south */ 6631 SouthExternalPrefix = 9, 6632 /** Negative, transitive prefixes are least preferred */ 6633 NegativeSouthPrefix = 10, 6634 RouteTypeMaxValue = 11, 6635 } 6637 B.2. encoding.thrift 6639 /** 6640 Thrift file for packet encodings for RIFT 6641 */ 6643 include "common.thrift" 6645 /** Represents protocol encoding schema major version */ 6646 const common.VersionType protocol_major_version = 4 6647 /** Represents protocol encoding schema minor version */ 6648 const common.MinorVersionType protocol_minor_version = 0 6650 /** Common RIFT packet header. */ 6651 struct PacketHeader { 6652 /** Major version of protocol. */ 6653 1: required common.VersionType major_version = 6654 protocol_major_version; 6655 /** Minor version of protocol. */ 6656 2: required common.MinorVersionType minor_version = 6657 protocol_minor_version; 6658 /** Node sending the packet, in case of LIE/TIRE/TIDE 6659 also the originator of it. */ 6660 3: required common.SystemIDType sender; 6661 /** Level of the node sending the packet, required on everything 6662 except LIEs. Lack of presence on LIEs indicates UNDEFINED_LEVEL 6663 and is used in ZTP procedures. 6664 */ 6665 4: optional common.LevelType level; 6666 } 6668 /** Prefix community. */ 6669 struct Community { 6670 /** Higher order bits */ 6671 1: required i32 top; 6672 /** Lower order bits */ 6673 2: required i32 bottom; 6674 } 6676 /** Neighbor structure. */ 6677 struct Neighbor { 6678 /** System ID of the originator. */ 6679 1: required common.SystemIDType originator; 6680 /** ID of remote side of the link. */ 6681 2: required common.LinkIDType remote_id; 6682 } 6684 /** Capabilities the node supports. 6686 @note: The schema may add to this 6687 field future capabilities to indicate whether it will support 6688 interpretation of future schema extensions on the same major 6689 revision. Such fields MUST be optional and have an implicit or 6690 explicit false default value. If a future capability changes route 6691 selection or generates blackholes if some nodes are not supporting 6692 it then a major version increment is unavoidable. 6693 */ 6694 struct NodeCapabilities { 6695 /** Must advertise supported minor version dialect that way. */ 6696 1: required common.MinorVersionType protocol_minor_version = 6697 protocol_minor_version; 6698 /** Can this node participate in flood reduction. */ 6699 2: optional bool flood_reduction = 6700 common.flood_reduction_default; 6701 /** Does this node restrict itself to be top-of-fabric or 6702 leaf only (in ZTP) and does it support leaf-2-leaf 6703 procedures. */ 6704 3: optional common.HierarchyIndications hierarchy_indications; 6705 } 6707 /** Link capabilities. */ 6708 struct LinkCapabilities { 6709 /** Indicates that the link is supporting BFD. */ 6710 1: optional bool bfd = 6711 common.bfd_default; 6712 /** Indicates whether the interface will support v4 forwarding. 6714 @note: This MUST be set to true when LIEs from a v4 address are 6715 sent and MAY be set to true in LIEs on v6 address. If v4 6716 and v6 LIEs indicate contradicting information the 6717 behavior is unspecified. */ 6718 2: optional bool v4_forwarding_capable = 6719 true; 6720 } 6721 /** RIFT LIE Packet. 6723 @note: this node's level is already included on the packet header 6724 */ 6725 struct LIEPacket { 6726 /** Node or adjacency name. */ 6727 1: optional string name; 6728 /** Local link ID. */ 6729 2: required common.LinkIDType local_id; 6730 /** UDP port to which we can receive flooded TIEs. */ 6731 3: required common.UDPPortType flood_port = 6732 common.default_tie_udp_flood_port; 6733 /** Layer 3 MTU, used to discover to mismatch. */ 6734 4: optional common.MTUSizeType link_mtu_size = 6735 common.default_mtu_size; 6736 /** Local link bandwidth on the interface. */ 6737 5: optional common.BandwithInMegaBitsType 6738 link_bandwidth = common.default_bandwidth; 6739 /** Reflects the neighbor once received to provide 6740 3-way connectivity. */ 6741 6: optional Neighbor neighbor; 6742 /** Node's PoD. */ 6743 7: optional common.PodType pod = 6744 common.default_pod; 6745 /** Node capabilities shown in LIE. The capabilities 6746 MUST match the capabilities shown in the Node TIEs, otherwise 6747 the behavior is unspecified. A node detecting the mismatch 6748 SHOULD generate according error. */ 6749 10: required NodeCapabilities node_capabilities; 6750 /** Capabilities of this link. */ 6751 11: optional LinkCapabilities link_capabilities; 6752 /** Required holdtime of the adjacency, i.e. how much time 6753 MUST expire without LIE for the adjacency to drop. */ 6754 12: required common.TimeIntervalInSecType 6755 holdtime = common.default_lie_holdtime; 6756 /** Unsolicited, downstream assigned locally significant label 6757 value for the adjacency. */ 6758 13: optional common.LabelType label; 6759 /** Indicates that the level on the LIE MUST NOT be used 6760 to derive a ZTP level by the receiving node. */ 6761 21: optional bool not_a_ztp_offer = 6762 common.default_not_a_ztp_offer; 6763 /** Indicates to northbound neighbor that it should 6764 be reflooding this node's N-TIEs to achieve flood reduction and 6765 balancing for northbound flooding. To be ignored if received 6766 from a northbound adjacency. */ 6767 22: optional bool you_are_flood_repeater = 6768 common.default_you_are_flood_repeater; 6770 /** Can be optionally set to indicate to neighbor that packet losses 6771 are seen on reception based on packet numbers or the rate is 6772 too high. The receiver SHOULD temporarily slow down 6773 flooding rates. 6774 */ 6775 23: optional bool you_are_sending_too_quickly = 6776 false; 6777 /** Instance name in case multiple RIFT instances running on same 6778 interface. */ 6779 24: optional string instance_name; 6780 } 6782 /** LinkID pair describes one of parallel links between two nodes. */ 6783 struct LinkIDPair { 6784 /** Node-wide unique value for the local link. */ 6785 1: required common.LinkIDType local_id; 6786 /** Received remote link ID for this link. */ 6787 2: required common.LinkIDType remote_id; 6789 /** Describes the local interface index of the link. */ 6790 10: optional common.PlatformInterfaceIndex platform_interface_index; 6791 /** Describes the local interface name. */ 6792 11: optional string platform_interface_name; 6793 /** Indication whether the link is secured, i.e. protected by 6794 outer key, absence of this element means no indication, 6795 undefined outer key means not secured. */ 6796 12: optional common.OuterSecurityKeyID 6797 trusted_outer_security_key; 6798 /** Indication whether the link is protected by established 6799 BFD session. */ 6800 13: optional bool bfd_up; 6801 } 6803 /** ID of a TIE. 6805 @note: TIEID space is a total order achieved by comparing 6806 the elements in sequence defined and comparing each 6807 value as an unsigned integer of according length. 6808 */ 6809 struct TIEID { 6810 /** direction of TIE */ 6811 1: required common.TieDirectionType direction; 6812 /** indicates originator of the TIE */ 6813 2: required common.SystemIDType originator; 6814 /** type of the tie */ 6815 3: required common.TIETypeType tietype; 6816 /** number of the tie */ 6817 4: required common.TIENrType tie_nr; 6819 } 6821 /** Header of a TIE. 6823 @note: TIEID space is a total order achieved by comparing 6824 the elements in sequence defined and comparing each 6825 value as an unsigned integer of according length. 6827 @note: After sequence number the lifetime received on the envelope 6828 must be used for comparison before further fields. 6830 @note: `origination_time` and `origination_lifetime` are disregarded 6831 for comparison purposes and carried purely for 6832 debugging/security purposes if present. 6833 */ 6834 struct TIEHeader { 6835 /** ID of the tie. */ 6836 2: required TIEID tieid; 6837 /** Sequence number of the tie. */ 6838 3: required common.SeqNrType seq_nr; 6840 /** Absolute timestamp when the TIE 6841 was generated. This can be used on fabrics with 6842 synchronized clock to prevent lifetime modification attacks. */ 6843 10: optional common.IEEE802_1ASTimeStampType origination_time; 6844 /** Original lifetime when the TIE 6845 was generated. This can be used on fabrics with 6846 synchronized clock to prevent lifetime modification attacks. */ 6847 12: optional common.LifeTimeInSecType origination_lifetime; 6848 } 6850 /** Header of a TIE as described in TIRE/TIDE. 6851 */ 6852 struct TIEHeaderWithLifeTime { 6853 1: required TIEHeader header; 6854 /** Remaining lifetime that expires down to 0 just like in ISIS. 6855 TIEs with lifetimes differing by less than 6856 `lifetime_diff2ignore` MUST be considered EQUAL. */ 6857 2: required common.LifeTimeInSecType remaining_lifetime; 6858 } 6860 /** TIDE with sorted TIE headers, if headers are unsorted, behavior 6861 is undefined. */ 6862 struct TIDEPacket { 6863 /** First TIE header in the tide packet. */ 6864 1: required TIEID start_range; 6865 /** Last TIE header in the tide packet. */ 6866 2: required TIEID end_range; 6867 /** _Sorted_ list of headers. */ 6868 3: required list headers; 6869 } 6871 /** TIRE packet */ 6872 struct TIREPacket { 6873 1: required set headers; 6874 } 6876 /** neighbor of a node */ 6877 struct NodeNeighborsTIEElement { 6878 /** level of neighbor */ 6879 1: required common.LevelType level; 6880 /** Cost to neighbor. 6882 @note: All parallel links to same node 6883 incur same cost, in case the neighbor has multiple 6884 parallel links at different cost, the largest distance 6885 (highest numerical value) MUST be advertised. 6887 @note: any neighbor with cost <= 0 MUST be ignored 6888 in computations */ 6889 3: optional common.MetricType cost 6890 = common.default_distance; 6891 /** can carry description of multiple parallel links in a TIE */ 6892 4: optional set link_ids; 6894 /** total bandwith to neighbor, this will be normally sum of the 6895 bandwidths of all the parallel links. */ 6896 5: optional common.BandwithInMegaBitsType 6897 bandwidth = common.default_bandwidth; 6898 } 6900 /** Indication flags of the node. */ 6901 struct NodeFlags { 6902 /** Indicates that node is in overload, do not transit traffic 6903 through it. */ 6904 1: optional bool overload = common.overload_default; 6905 } 6907 /** Description of a node. 6909 It may occur multiple times in different TIEs but if either 6910 6911 capabilities values do not match or 6912 flags values do not match or 6913 neighbors repeat with different values 6914 6916 the behavior is undefined and a warning SHOULD be generated. 6917 Neighbors can be distributed across multiple TIEs however if 6918 the sets are disjoint. Miscablings SHOULD be repeated in every 6919 node TIE, otherwise the behavior is undefined. 6921 @note: Observe that absence of fields implies defined defaults. 6922 */ 6923 struct NodeTIEElement { 6924 /** Level of the node. */ 6925 1: required common.LevelType level; 6926 /** Node's neighbors. If neighbor systemID repeats in other 6927 node TIEs of same node the behavior is undefined. */ 6928 2: required map neighbors; 6930 /** Capabilities of the node. */ 6931 3: required NodeCapabilities capabilities; 6932 /** Flags of the node. */ 6933 4: optional NodeFlags flags; 6934 /** Optional node name for easier operations. */ 6935 5: optional string name; 6936 /** PoD to which the node belongs. */ 6937 6: optional common.PodType pod; 6938 /** optional startup time of the node */ 6939 7: optional common.TimestampInSecsType startup_time; 6941 /** If any local links are miscabled, the indication is flooded. */ 6942 10: optional set miscabled_links; 6944 } 6946 /** Attributes of a prefix. */ 6947 struct PrefixAttributes { 6948 /** Distance of the prefix. */ 6949 2: required common.MetricType metric 6950 = common.default_distance; 6951 /** Generic unordered set of route tags, can be redistributed 6952 to other protocols or use within the context of real time 6953 analytics. */ 6954 3: optional set tags; 6955 /** Monotonic clock for mobile addresses. */ 6956 4: optional common.PrefixSequenceType monotonic_clock; 6957 /** Indicates if the interface is a node loopback. */ 6958 6: optional bool loopback = false; 6959 /** Indicates that the prefix is directly attached, i.e. should be 6960 routed to even if the node is in overload. */ 6961 7: optional bool directly_attached = true; 6963 /** In case of locally originated prefixes, i.e. interface 6964 addresses this can describe which link the address 6965 belongs to. */ 6966 10: optional common.LinkIDType from_link; 6967 } 6969 /** TIE carrying prefixes */ 6970 struct PrefixTIEElement { 6971 /** Prefixes with the associated attributes. 6972 If the same prefix repeats in multiple TIEs of same node 6973 behavior is unspecified. */ 6974 1: required map prefixes; 6975 } 6977 /** Generic key value pairs. */ 6978 struct KeyValueTIEElement { 6979 /** @note: if the same key repeats in multiple TIEs of same node 6980 or with different values, behavior is unspecified */ 6981 1: required map keyvalues; 6982 } 6984 /** Single element in a TIE. 6986 Schema enum `common.TIETypeType` 6987 in TIEID indicates which elements MUST be present 6988 in the TIEElement. In case of mismatch the unexpected 6989 elements MUST be ignored. In case of lack of expected 6990 element the TIE an error MUST be reported and the TIE 6991 MUST be ignored. 6993 This type can be extended with new optional elements 6994 for new `common.TIETypeType` values without breaking 6995 the major but if it is necessary to understand whether 6996 all nodes support the new type a node capability must 6997 be added as well. 6998 */ 6999 union TIEElement { 7000 /** Used in case of enum common.TIETypeType.NodeTIEType. */ 7001 1: optional NodeTIEElement node; 7002 /** Used in case of enum common.TIETypeType.PrefixTIEType. */ 7003 2: optional PrefixTIEElement prefixes; 7004 /** Positive prefixes (always southbound). 7005 It MUST NOT be advertised within a North TIE and 7006 ignored otherwise. 7007 */ 7008 3: optional PrefixTIEElement positive_disaggregation_prefixes; 7009 /** Transitive, negative prefixes (always southbound) which 7010 MUST be aggregated and propagated 7011 according to the specification 7012 southwards towards lower levels to heal 7013 pathological upper level partitioning, otherwise 7014 blackholes may occur in multiplane fabrics. 7015 It MUST NOT be advertised within a North TIE. 7016 */ 7017 5: optional PrefixTIEElement negative_disaggregation_prefixes; 7018 /** Externally reimported prefixes. */ 7019 6: optional PrefixTIEElement external_prefixes; 7020 /** Positive external disaggregated prefixes (always southbound). 7021 It MUST NOT be advertised within a North TIE and 7022 ignored otherwise. 7023 */ 7024 7: optional PrefixTIEElement 7025 positive_external_disaggregation_prefixes; 7026 /** Key-Value store elements. */ 7027 9: optional KeyValueTIEElement keyvalues; 7028 } 7030 /** TIE packet */ 7031 struct TIEPacket { 7032 1: required TIEHeader header; 7033 2: required TIEElement element; 7034 } 7036 /** Content of a RIFT packet. */ 7037 union PacketContent { 7038 1: optional LIEPacket lie; 7039 2: optional TIDEPacket tide; 7040 3: optional TIREPacket tire; 7041 4: optional TIEPacket tie; 7042 } 7044 /** RIFT packet structure. */ 7045 struct ProtocolPacket { 7046 1: required PacketHeader header; 7047 2: required PacketContent content; 7048 } 7050 Appendix C. Constants 7052 C.1. Configurable Protocol Constants 7054 This section gathers constants that are provided in the schema files 7055 and in the document. 7057 +----------------+--------------+-----------------------------------+ 7058 | | Type | Value | 7059 +----------------+--------------+-----------------------------------+ 7060 | LIE IPv4 | Default | 224.0.0.120 or all-rift-routers | 7061 | Multicast | Value, | to be assigned in IPv4 | 7062 | Address | Configurable | Multicast Address Space Registry | 7063 | | | in Local Network Control Block | 7064 +----------------+--------------+-----------------------------------+ 7065 | LIE IPv6 | Default | FF02::A1F7 or all-rift-routers to | 7066 | Multicast | Value, | be assigned in IPv6 Multicast | 7067 | Address | Configurable | Address Assignments | 7068 +----------------+--------------+-----------------------------------+ 7069 | LIE | Default | 914 | 7070 | Destination | Value, | | 7071 | Port | Configurable | | 7072 +----------------+--------------+-----------------------------------+ 7073 | Level value | Constant | 24 | 7074 | for | | | 7075 | TOP_OF_FABRIC | | | 7076 | flag | | | 7077 +----------------+--------------+-----------------------------------+ 7078 | Default LIE | Default | 3 seconds | 7079 | Holdtime | Value, | | 7080 | | Configurable | | 7081 +----------------+--------------+-----------------------------------+ 7082 | TIE | Default | 1 second | 7083 | Retransmission | Value | | 7084 | Interval | | | 7085 +----------------+--------------+-----------------------------------+ 7086 | TIDE | Default | 5 seconds | 7087 | Generation | Value, | | 7088 | Interval | Configurable | | 7089 +----------------+--------------+-----------------------------------+ 7090 | MIN_TIEID | Constant | TIE Key with minimal values: | 7091 | signifies | | TIEID(originator=0, | 7092 | start of TIDEs | | tietype=TIETypeMinValue, | 7093 | | | tie_nr=0, direction=South) | 7094 +----------------+--------------+-----------------------------------+ 7095 | MAX_TIEID | Constant | TIE Key with maximal values: | 7096 | signifies end | | TIEID(originator=MAX_UINT64, | 7097 | of TIDEs | | tietype=TIETypeMaxValue, | 7098 | | | tie_nr=MAX_UINT64, | 7099 | | | direction=North) | 7100 +----------------+--------------+-----------------------------------+ 7102 Table 6: all_constants 7104 Authors' Addresses 7106 Tony Przygienda (editor) 7107 Juniper 7108 1137 Innovation Way 7110 Sunnyvale, CA 7112 USA 7114 Email: prz@juniper.net 7116 Alankar Sharma 7117 Comcast 7118 1800 Bishops Gate Blvd 7119 Mount Laurel, NJ 08054 7120 US 7122 Email: Alankar_Sharma@comcast.com 7124 Pascal Thubert 7125 Cisco Systems, Inc 7126 Building D 7127 45 Allee des Ormes - BP1200 7128 MOUGINS - Sophia Antipolis 06254 7129 FRANCE 7131 Phone: +33 497 23 26 34 7132 Email: pthubert@cisco.com 7134 Bruno Rijsman 7135 Individual 7137 Email: brunorijsman@gmail.com 7139 Dmitry Afanasiev 7140 Yandex 7142 Email: fl0w@yandex-team.ru