idnits 2.17.1 draft-white-openfabric-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (November 5, 2018) is 1998 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2119' is defined on line 615, but no explicit reference was found in the text == Unused Reference: 'RFC2629' is defined on line 620, but no explicit reference was found in the text == Unused Reference: 'RFC5309' is defined on line 647, but no explicit reference was found in the text == Unused Reference: 'RFC5311' is defined on line 652, but no explicit reference was found in the text == Unused Reference: 'RFC5316' is defined on line 657, but no explicit reference was found in the text == Unused Reference: 'RFC7981' is defined on line 667, but no explicit reference was found in the text == Unused Reference: 'RFC8174' is defined on line 672, but no explicit reference was found in the text == Unused Reference: 'RFC4271' is defined on line 701, but no explicit reference was found in the text == Unused Reference: 'RFC5837' is defined on line 730, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2629 (Obsoleted by RFC 7749) ** Obsolete normative reference: RFC 5316 (Obsoleted by RFC 9346) == Outdated reference: A later version (-25) exists of draft-ietf-isis-segment-routing-extensions-19 Summary: 3 errors (**), 0 flaws (~~), 12 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group R. White, Ed. 3 Internet-Draft S. Zandi, Ed. 4 Intended status: Informational LinkedIn 5 Expires: May 9, 2019 November 5, 2018 7 IS-IS Support for Openfabric 8 draft-white-openfabric-07 10 Abstract 12 Spine and leaf topologies are widely used in hyperscale and cloud 13 scale networks. In most of these networks, configuration is 14 automated, but difficult, and topology information is extracted 15 through broad based connections. Policy is often integrated into the 16 control plane, as well, making configuration, management, and 17 troubleshooting difficult. Openfabric is an adaptation of an 18 existing, widely deployed link state protocol, Intermediate System to 19 Intermediate System (IS-IS) that is designed to: 21 o Provide a full view of the topology from a single point in the 22 network to simplify operations 24 o Minimize configuration of each Intermediate System (IS) (also 25 called a router or switch) in the network 27 o Optimize the operation of IS-IS within a spine and leaf fabric to 28 enable scaling 30 This document begins with an overview of openfabric, including a 31 description of what may be removed from IS-IS to enable scaling. The 32 document then describes an optimized adjacency formation process; an 33 optimized flooding scheme; some thoughts on the operation of 34 openfabric, metrics, and aggregation; and finally a description of 35 the changes to the IS-IS protocol required for openfabric. 37 Status of This Memo 39 This Internet-Draft is submitted in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF). Note that other groups may also distribute 44 working documents as Internet-Drafts. The list of current Internet- 45 Drafts is at https://datatracker.ietf.org/drafts/current/. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress." 52 This Internet-Draft will expire on May 9, 2019. 54 Copyright Notice 56 Copyright (c) 2018 IETF Trust and the persons identified as the 57 document authors. All rights reserved. 59 This document is subject to BCP 78 and the IETF Trust's Legal 60 Provisions Relating to IETF Documents 61 (https://trustee.ietf.org/license-info) in effect on the date of 62 publication of this document. Please review these documents 63 carefully, as they describe your rights and restrictions with respect 64 to this document. Code Components extracted from this document must 65 include Simplified BSD License text as described in Section 4.e of 66 the Trust Legal Provisions and are provided without warranty as 67 described in the Simplified BSD License. 69 Table of Contents 71 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 72 1.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.2. Contributors . . . . . . . . . . . . . . . . . . . . . . 3 74 1.3. Simplification . . . . . . . . . . . . . . . . . . . . . 3 75 1.4. Additions and Requirements . . . . . . . . . . . . . . . 4 76 1.5. Sample Network . . . . . . . . . . . . . . . . . . . . . 4 77 2. Modified Adjacency Formation . . . . . . . . . . . . . . . . 6 78 2.1. Level 2 Adjacencies Only . . . . . . . . . . . . . . . . 6 79 2.2. Point-to-point Adjacencies . . . . . . . . . . . . . . . 6 80 2.3. Three Way Handshake Support . . . . . . . . . . . . . . . 7 81 2.4. Adjacency Formation Optimization . . . . . . . . . . . . 7 82 3. Advertisement of Reachability Information . . . . . . . . . . 8 83 4. Determining and Advertising Location on the Fabric . . . . . 9 84 5. Flooding Optimization . . . . . . . . . . . . . . . . . . . . 10 85 5.1. Flooding Failures . . . . . . . . . . . . . . . . . . . . 11 86 6. Other Optimizations . . . . . . . . . . . . . . . . . . . . . 12 87 6.1. Transit Link Reachability . . . . . . . . . . . . . . . . 12 88 6.2. Transiting T0 Intermediate Systems . . . . . . . . . . . 12 89 7. Openfabric and Route Aggregation . . . . . . . . . . . . . . 13 90 8. Security Considerations . . . . . . . . . . . . . . . . . . . 13 91 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 92 9.1. Normative References . . . . . . . . . . . . . . . . . . 13 93 9.2. Informative References . . . . . . . . . . . . . . . . . 15 94 Appendix A. Flooding Optimization Operation . . . . . . . . . . 17 95 Appendix B. Fabric Location Calculation . . . . . . . . . . . . 19 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 98 1. Introduction 100 1.1. Goals 102 Spine and leaf fabrics are often used in large scale data centers; in 103 this application, they are commonly called a fabric because of their 104 regular structure and predictable forwarding and convergence 105 properties. This document describes modifications to the IS-IS 106 protocol to enable it to run efficiently on a large scale spine and 107 leaf fabric, openfabric. The goals of this control plane are: 109 o Provide a full view of the topology from a single point in the 110 network to simplify operations 112 o Minimize configuration of each IS in the network 114 o Optimize the operation of IS-IS within a spine and leaf fabric to 115 enable scaling 117 1.2. Contributors 119 The following people have contributed to this draft: Nikos 120 Triantafillis (reflected flooding optimization), Ivan Pepelnjak 121 (fabric locality calculation modifications), Christian Franke (fabric 122 localigy calculation modification), Hannes Gredler (do not reflood 123 optimizations), Les Ginsberg (capabilities encoding, circuit local 124 reflooding), Naiming Shen (capabilities encoding, circuit local 125 reflooding), Uma Chunduri (failure mode suggestions, flooding), Nick 126 Russo, and Rodny Molina. 128 See [RFC5449], [RFC5614], and [RFC7182] for similar solutions in the 129 Mobile Ad Hoc Networking (MANET) solution space. 131 1.3. Simplification 133 In building any scalable system, it is often best to begin by 134 removing what is not needed. In this spirit, openfabric 135 implementations MAY remove the following from IS-IS: 137 o External metrics. There is no need for external metrics in large 138 scale spine and leaf fabrics; it is assumed that metrics will be 139 properly configured by the operator to account for the correct 140 order of route preference at any route redistribution point. 142 o Tags and traffic engineering processing. Openfabric is only 143 designed to provide topology and reachability information. It is 144 not designed to provide for traffic engineering, route preference 145 through tags, or other policy mechanisms. It is assumed that all 146 routing policy will be provided through an overlay system which 147 communicates directly with each IS in the fabric, such as PCEP 148 [RFC5440] or I2RS [RFC7921]. Traffic engineering is assumed to be 149 provided through Segment Routing (SR) 150 [I-D.ietf-spring-segment-routing]. 152 1.4. Additions and Requirements 154 To create a scalable link state fabric, openfabric includes the 155 following: 157 o A slightly modified adjacency formation process. 159 o Mechanisms for determining which tier within a spine and leaf 160 fabric in which the IS is located. 162 o A mechanism that reduces flooding to the minimum possible, while 163 still ensuring complete database synchronization among the 164 intermediate systems within the fabric. 166 Three general requirements are placed here; more specific 167 requirements are considered in the following sections. Openfabric 168 implementations: 170 o MUST support [RFC5301] and enable hostname advertisement by 171 default if a hostname is configured on the intermediate system. 173 o SHOULD support [RFC6232], purge originator identification for IS- 174 IS. 176 o MUST NOT be mixed with standard IS-IS implementations in 177 operational deployments. Openfabric and standard IS-IS 178 implementations SHOULD be treated as two separate protocols. 180 1.5. Sample Network 182 The following spine and leaf fabric will be used to describe these 183 modifications. 185 +----+ +----+ +----+ +----+ +----+ +----+ 186 | 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0) 187 +----+ +----+ +----+ +----+ +----+ +----+ 189 +----+ +----+ +----+ +----+ +----+ +----+ 190 | 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1) 191 +----+ +----+ +----+ +----+ +----+ +----+ 193 +----+ +----+ +----+ +----+ +----+ +----+ 194 | 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2) 195 +----+ +----+ +----+ +----+ +----+ +----+ 197 +----+ +----+ +----+ +----+ +----+ +----+ 198 | 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1) 199 +----+ +----+ +----+ +----+ +----+ +----+ 201 +----+ +----+ +----+ +----+ +----+ +----+ 202 | 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0) 203 +----+ +----+ +----+ +----+ +----+ +----+ 205 Figure 1 207 To reduce confusion (spine and leaf fabrics are difficult to draw in 208 plain text art), this diagram does not contain the connections 209 between devices. The reader should assume that each device in a 210 given layer is connected to every device in the layer above it. For 211 instance: 213 o 5A is connected to 4A, 4B, 4C, 4D, 4E, and 4F 215 o 5B is connected to 4A, 4B, 4C, 4D, 4E, and 4F 217 o 4A is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 218 5F 220 o 4B is connected to 3A, 3B, 3C, 3D, 3E, 3F, 5A, 5B, 5C, 5D, 5E, and 221 5F 223 o etc. 225 The tiers or stages of the fabric are also marked for easier 226 reference. T0 is assumed to be connected to application servers, or 227 rather they are Top of Rack (ToR) intermediate systems. The 228 remaining tiers, T1 and T2, are connected only to the fabric itself. 229 Note there are no "cross links," or "east west" links in the 230 illustrated fabric. The fabric locality detection mechanism 231 described here will not work if there are cross links running east/ 232 west through the fabric. Locality detection may be possible in such 233 a fabric; this is an area for further study. 235 2. Modified Adjacency Formation 237 Because Openfabric operates in a tightly controlled data center 238 environment, various modifications can be made to the IS-IS neighbor 239 formation process to increase efficencicy and simplify the protocol. 240 Specifically, Openfabric implementations SHOULD support [RFC3719], 241 section 4, hello padding for IS-IS. Variable hello padding SHOULD 242 NOT be used, as data center fabrics are built using high speed links 243 on which padded hellos will have little performance impact. Further 244 modifications to the neighbor formation process are considered in the 245 following sections. 247 2.1. Level 2 Adjacencies Only 249 Openfabric is designed to work in a single flooding domain over a 250 single data center fabric at the scale of thousands of routers with 251 hundreds of thousands of routes (so a moderate scale in router and 252 route count terms). Because of the way Openfabric optimizes 253 operation in this environment, it is not necessary nor desirable to 254 build multiple flooding domains. For instance, the flooding 255 optimizations described later this document require a full view of 256 the topology, as does any proposed overlay to inject policy into the 257 forwarding plane. In light of this, the following changes SHOULD BE 258 to IS-IS implemetations to support Openfabric: 260 o IIH PDU 17 (level 2 point-to-point circuit hello) should be the 261 only IIH PDU type transmitted (see section 9.7 of ISO 10589) 263 o In IIH PDU 17 (level 2 point-to-point circuit hello), the Circuit 264 Type field should be set to 2 (see section 9.7 of ISO 10589) 266 o Support for IIH PDU 15 (level 1 broadcast hello) should be removed 267 (see section 9.5 of ISO 10589) 269 o Support for IIH PDU 16 (level 2 broadcast hello) should be removed 270 (see section 9.6 of ISO 10589) 272 2.2. Point-to-point Adjacencies 274 Data center network fabrics only contain point-to-point links; 275 because of this, there is no reason to support any broadcast link 276 types, nor to support the Designated Intermediate System processing, 277 including pseudonode creation. In light ot his, processing related 278 to sections 7.2.3 (broadcast networks), 7.3.8 (generation of level 1 279 pseudonode LSPs), 7.3.10 (generation of level 2 pseudonode LSPs), and 280 section 8.4.5 (LAN designated intermediate systems) in [ISO10589] 281 SHOULD BE removed. 283 2.3. Three Way Handshake Support 285 It is important that two way connectivity be established before 286 synchronizing the link state database, or routing through a link in a 287 data center fabric. To reject optical failures that cause a one way 288 connection between two routers, fabricDC must support the three way 289 handshake mechanism described in [RFC5303]. 291 2.4. Adjacency Formation Optimization 293 While adjacency formation is not considered particularly burdensome 294 in IS-IS, it may still be useful to reduce the amount of state 295 transferred across the network when connecting a new IS to the 296 fabric. In its simplest form, the process is: 298 o An IS connected to the fabric will send hellos on all links. 300 o The IS will only complete the three-way handshake with one newly 301 discovered neighbor; this would normally be the first neighbor 302 which sends the newly connected intermediate system's ID back in 303 the three-way handshake process. 305 o The IS will complete its database exchange with this one newly 306 adjacent neighbor. 308 o Once this process is completed, the IS will continue processing 309 the remaining neighbors as normal. 311 o If synchronization is not achieved within twice the dead timer on 312 the local interface, the newly connected IS will repeat this 313 process with the second neighbor with which it forms a three-way 314 adjacency. 316 This process allows each IS newly added to the fabric to exchange a 317 full table once; a very minimal amount of information will be 318 transferred with the remaining neighbors to reach full 319 synchronization. 321 Any such optimization is bound to present a tradeoff between several 322 factors; the mechanism described here increases the amount of time 323 required to form adjacencies slightly in order to reduce the total 324 state carried across the network. An alternative mechanism could 325 provide a better balance of the amount of information carried across 326 the network for initial synchronization and the time required to 327 synchronize a new IS. For instance, an IS could choose to 328 synchronize its database with two or three adjacent intermediate 329 systems, which could speed the synchronization process up at the cost 330 of carrying additional data on the network. A locally determined 331 balance between the speed of synchronization and the amount of data 332 carried on the network can be acheived by adjusting the number of 333 adjacent intermediate systems the newly attached IS synchronizes 334 with. 336 3. Advertisement of Reachability Information 338 IS-IS describes the topology in two different sets of TLVs; the first 339 describes the set of neighbors connected to an IS, the second 340 describes the set of reachable destination connected to an IS. There 341 are two different forms of both of these descriptions, one of which 342 carries what are widely called narrow metrics, the other of which 343 carries what are widely called wide metrics. In a tightly controlled 344 data center fabric implementation, such as the ones Openfabric is 345 designed to support, no IS that supports narrow metrics will ever be 346 deployed or supported; hence there is no reason to support any metric 347 type other than wide metrics. 349 o The Level 2 Link State PDU (type 20 in section 9.9 of [ISO10589]) 350 and the scoped flooding PDU (type 10 in section 3.1 of [RFC7356]) 351 SHOULD BE the only PDU types used to carry link state information 352 in a Openfabric implementation 354 o Processing related to the Level 1 Link State PDU (type 18) MAY BE 355 removed from Openfabric implementations (see section 9.8 of 356 [ISO10589]) 358 o Neighbor reachability MUST BE carried in TLV type 22 (see section 359 3 of [RFC5305]) 361 o IPv4 reachability SHOULD BE carried in TLV type 135 (see section 4 362 of [RFC5305]), or TLV type 235 for multitopology implementations 363 (see [RFC5120]) 365 o IPv6 reachability SHOULD BE carried in TLV type 236 (see 366 [RFC5308]), or TLV type 237 for multitopology implemenations (see 367 [RFC5120]) 369 o Processing related to the neighbor reachability TLV (type 2, see 370 sections 9.8 and 9.9 of [ISO10589]) SHOULD BE removed 372 o Processing related to the narrow metric IP reachability TLV (types 373 128 and 130) SHOULD BE removed 375 Further, if segment routing support is desired, Openfabric MAY 376 support the Prefix Segment Identifier sub-TLV and other TLVs as 377 required in [I-D.ietf-isis-segment-routing-extensions]. 379 4. Determining and Advertising Location on the Fabric 381 The tier to which a IS is connected is useful to enable 382 autoconfiguration of intermediate systems connected to the fabric and 383 to reduce flooding. Once the tier of an intermediate system within 384 the fabric has been determined, it MUST be advertised using the 4 bit 385 Tier field described in section 3.3 of 386 [I-D.shen-isis-spine-leaf-ext]. This section describes a method of 387 calculating the tier number, assuming the tier numbers rise in value 388 from the edge of the fabric. 390 This method begins with two of the T0 intermediate systems 391 advertising their location in the fabric. This information can 392 either be obtained through: 394 o Two T0 intermediate systems are manually configured to advertise 395 0x00 in their IS reachability tier sub-TLV, indicating they are at 396 the edge of the fabric (a ToR IS). 398 o The T0 intermediate systems detect they are T0 through the 399 presence connected hosts (i.e. through a request for address 400 assignment or some other means). If such detection is used, and 401 the IS determines it is located at T0, it should advertise 0x00 in 402 its IS reachability tier sub-TLV. 404 If the first method is used, the two T0 routers MUST be "maximally 405 separated" on the fabric. They must be a maximal number of hops 406 apart, or rather thay MUST NOT be connected to the same T1 device as 407 their "upstream" towards the superspines in a 5 ary fabric. 409 The second method above SHOULD be used with care, as it may not be 410 secure, and it may not work in all data center environments. For 411 instance, if a host is mistakenly (or intentionally, as a form of 412 attack) attached to a spine IS, or a request for address assignment 413 is transmitted to a spine IS during the bootup phase of the device or 414 fabric, it is possible to cause a spine IS to advertise itself as a 415 T0. Unless the autodetection of the T0 devices is secured, the 416 manual mechanism SHOULD BE used (configuring at least one T0 device 417 manually). 419 Given the correct configuration of two T0 devices, maximally spaced 420 on the fabric, the remaining intermediate systems calculate their 421 tier number as follows: 423 o The local IS calculates an SPT (using SPF) setting the cost of 424 every link to 1; this effectively calculates a topology only view 425 of the network, without considering any configured link costs 427 o Ensure that at least two T0 are in the calculated SPT; otherwise 428 abort 430 o Find the furthest T0; call this node A and set LD to the cost; the 431 "farthest T0" is the T0 with the largest metric, or the farthest 432 distance from the local calculating node 434 o Calculate an SPT (using SPF) from the perspective of A (above) 435 setting the cost of every link to 1 437 o Find the furthest IS in A's SPT; call this node B and set RD to 438 the cost from A to B 440 o Calculate the tier number of the local IS by subtracting LD from 441 RD 443 In the example network, assume 5A and 1C are manually configured as a 444 T0, and are advertising their tier numbers. From here: 446 o From 1A the path to 5A is 4 hops; this is LD 448 o Run SPF from the perspective of 5A with all link metrics set to 1 450 o From 5A the path length to 1C is 4; this is RD 452 o RD - LD is 0 at 1A, so 1A is T0, or a ToR 454 This process will work for any spine and leaf fabric without "cross 455 links." 457 5. Flooding Optimization 459 Flooding is perhaps the most challenging scaling issue for a link 460 state protocol running on a dense, large scale fabric. To reduce the 461 flooding of link state information in the form of Link State Protocol 462 Data Units (LSPs), Openfabric takes advantage of information already 463 available in the link state protocol, the list of the local 464 intermediate system's neighbor's neighbors, and the fabric locality 465 computed above. The following tables are required to compute a set 466 of reflooders: 468 o Neighbor List (NL) list: The set of neighbors 469 o Neighbor's Neighbors (NN) list: The set of neighbor's neighbors; 470 this can be calculated by running SPF truncated to two hops 472 o Do Not Reflood (DNR) list: The set of neighbors who should have 473 LSPs (or fragments) who should not reflood LSPs 475 o Reflood (RF) list: The set of neighbors who should flood LSPs (or 476 fragments) to their adjacent neighbors to ensure synchronization 478 NL is set to contain all neighbors, and sorted deterministically (for 479 instance, from the highest IS identifier to the lowest). All 480 intermediate systems within a single fabric SHOULD use the same 481 mechanism for sorting the NL list. NN is set to contain all 482 neighbor's neighbors, or all intermediate systems that are two hops 483 away, as determined by performing a truncated SPF. The DNR and RF 484 tables are initially empty. To begin, the following steps are taken 485 to reduce the size of NN and NL: 487 o Move any IS in NL with its tier (or fabric location) set to T0 to 488 DNR 490 o Remove all intermediate systems from NL and NN that in the 491 shortest path to the IS that originated the LSP 493 Then, for every IS in NL: 495 o If the current entry in NL is connected to any entries in NN: 497 * Move the IS to RF 499 * Remove the intermediate systems connected to the IS from NN 501 o Else move the IS to DNR 503 The calculation terminates when the NL is empty. 505 When flooding, LSPs transmitted to adjacent neighbors on the RF list 506 will be transmitted normally. Adjacent intermediate systems on this 507 list will reflood received LSPs into the next stage of the topology, 508 ensuring database synchronization. LSPs transmitted to adjacent 509 neighbors on the DNR list, however, MUST be transmitted using a 510 circuit scope PDU as described in [RFC7356]. 512 5.1. Flooding Failures 514 It is possible in some failure modes for flooding to be incomplete 515 because of the flooding optimizations outlined. Specifically, if a 516 reflooder fails, or is somehow disconnected from all the links across 517 which it should be reflooding, it is possible an LSP is only 518 partially flooded through the fabric. To prevent such situations, 519 any IS receiving an LSP transmitted using DNR SHOULD: 521 o Set a short timer; the default should be less than one second 523 o When the timer expires, send a Complete Sequence Number Packet 524 (CSNP) to all neighbors 526 o Process any Partial Sequence Number Packets (PSNPs) as required to 527 resynchronize 529 o If a resynchronization is required, notify the network operator 530 through a network management system 532 6. Other Optimizations 534 6.1. Transit Link Reachability 536 In order to reduce the amount of control plane state carried on large 537 scale spine and leaf fabrics, openfabric implementations SHOULD NOT 538 advertise reachability for transit links. These links MAY remain 539 unnumbered, as IS-IS does not require layer 3 IP addresses to 540 operate. Each IS SHOULD be configured with a single loopback 541 address, which is assigned an IPv6 address, to provide reachability 542 to intermediate systems which make up the fabric. 544 [RFC3277] SHOULD be supported on devices supporting openfabric with 545 unnumbered interface in order to support traceability and network 546 management. 548 6.2. Transiting T0 Intermediate Systems 550 In data center fabrics, ToR intermediate systems SHOULD NOT be used 551 to transit between two T1 (or above) spine intermediate systems. The 552 simplest way to prevent this is to set the overload bit [RFC3277] for 553 all the LSPs originated from T0 intermediate systems. However, this 554 solution would have the unfortunate side effect of causing all 555 reachability beyond any T0 IS to have the same metric, and many 556 implementations treat a set overload bit as a metric of 0xFFFF in 557 calculating the Shortest Path Tree (SPT). This document proposes an 558 alternate solution which preserves the leaf node metric, while still 559 avoiding transiting T0 intermediate systems. 561 Specifically, all T0 intermediate systems SHOULD advertise their 562 metric to reach any T1 adjacent neighbor with a cost of 0XFFE. T1 563 intermediate systems, on the other hand, will advertise T0 564 intermediate systems with the actual interface cost used to reach the 565 T0 IS. Hence, links connecting T0 and T1 intermediate systems will 566 be advertised with an asymmetric cost that discourages transiting T0 567 intermediate systems, while leaving reachability to the destinations 568 attached to T0 devices the same. 570 7. Openfabric and Route Aggregation 572 While schemes may be designed so reachability information can be 573 aggregated in Openfabric deployments, this is not a recommended 574 configuraiton. 576 8. Security Considerations 578 This document outlines modifications to the IS-IS protocol for 579 operation on large scale data center fabrics. While it does add new 580 TLVs, and some local processing changes, it does not add any new 581 security vulnerabilities to the operation of IS-IS. However, 582 openfabric implementations SHOULD implement IS-IS cryptographic 583 authentication, as described in [RFC5304], and should enable other 584 security measures in accordance with best common practices for the 585 IS-IS protocol. 587 If T0 intermediate systems are auto-detected using information 588 outside Openfabric, it is possible to attack the calucations used for 589 flooding reduction and auto-configuration of intermediate systems. 590 For instance, if a request for an address pool is used as an 591 indicator of an attached host, and hence receiving such a request 592 causes an intermediate system to advertise itself as T0, it is 593 possible for an attacker (or a simple mistake) to cause auto- 594 configuration to fail. Any such auto-detection mechanims SHOULD BE 595 secured using appropriate techniques, as described by any protocols 596 or mechanisms used. 598 9. References 600 9.1. Normative References 602 [I-D.shen-isis-spine-leaf-ext] 603 Shen, N., Ginsberg, L., and S. Thyamagundalu, "IS-IS 604 Routing for Spine-Leaf Topology", draft-shen-isis-spine- 605 leaf-ext-07 (work in progress), October 2018. 607 [ISO10589] 608 International Organization for Standardization, 609 "Intermediate system to Intermediate system intra-domain 610 routeing information exchange protocol for use in 611 conjunction with the protocol for providing the 612 connectionless-mode Network Service (ISO 8473)", ISO/ 613 IEC 10589:2002, Second Edition, Nov 2002. 615 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 616 Requirement Levels", BCP 14, RFC 2119, 617 DOI 10.17487/RFC2119, March 1997, 618 . 620 [RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, 621 DOI 10.17487/RFC2629, June 1999, 622 . 624 [RFC5120] Przygienda, T., Shen, N., and N. Sheth, "M-ISIS: Multi 625 Topology (MT) Routing in Intermediate System to 626 Intermediate Systems (IS-ISs)", RFC 5120, 627 DOI 10.17487/RFC5120, February 2008, 628 . 630 [RFC5301] McPherson, D. and N. Shen, "Dynamic Hostname Exchange 631 Mechanism for IS-IS", RFC 5301, DOI 10.17487/RFC5301, 632 October 2008, . 634 [RFC5303] Katz, D., Saluja, R., and D. Eastlake 3rd, "Three-Way 635 Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, 636 DOI 10.17487/RFC5303, October 2008, 637 . 639 [RFC5305] Li, T. and H. Smit, "IS-IS Extensions for Traffic 640 Engineering", RFC 5305, DOI 10.17487/RFC5305, October 641 2008, . 643 [RFC5308] Hopps, C., "Routing IPv6 with IS-IS", RFC 5308, 644 DOI 10.17487/RFC5308, October 2008, 645 . 647 [RFC5309] Shen, N., Ed. and A. Zinin, Ed., "Point-to-Point Operation 648 over LAN in Link State Routing Protocols", RFC 5309, 649 DOI 10.17487/RFC5309, October 2008, 650 . 652 [RFC5311] McPherson, D., Ed., Ginsberg, L., Previdi, S., and M. 653 Shand, "Simplified Extension of Link State PDU (LSP) Space 654 for IS-IS", RFC 5311, DOI 10.17487/RFC5311, February 2009, 655 . 657 [RFC5316] Chen, M., Zhang, R., and X. Duan, "ISIS Extensions in 658 Support of Inter-Autonomous System (AS) MPLS and GMPLS 659 Traffic Engineering", RFC 5316, DOI 10.17487/RFC5316, 660 December 2008, . 662 [RFC7356] Ginsberg, L., Previdi, S., and Y. Yang, "IS-IS Flooding 663 Scope Link State PDUs (LSPs)", RFC 7356, 664 DOI 10.17487/RFC7356, September 2014, 665 . 667 [RFC7981] Ginsberg, L., Previdi, S., and M. Chen, "IS-IS Extensions 668 for Advertising Router Information", RFC 7981, 669 DOI 10.17487/RFC7981, October 2016, 670 . 672 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 673 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 674 May 2017, . 676 9.2. Informative References 678 [I-D.ietf-isis-segment-routing-extensions] 679 Previdi, S., Ginsberg, L., Filsfils, C., Bashandy, A., 680 Gredler, H., Litkowski, S., Decraene, B., and J. Tantsura, 681 "IS-IS Extensions for Segment Routing", draft-ietf-isis- 682 segment-routing-extensions-19 (work in progress), July 683 2018. 685 [I-D.ietf-spring-segment-routing] 686 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 687 Litkowski, S., and R. Shakir, "Segment Routing 688 Architecture", draft-ietf-spring-segment-routing-15 (work 689 in progress), January 2018. 691 [RFC3277] McPherson, D., "Intermediate System to Intermediate System 692 (IS-IS) Transient Blackhole Avoidance", RFC 3277, 693 DOI 10.17487/RFC3277, April 2002, 694 . 696 [RFC3719] Parker, J., Ed., "Recommendations for Interoperable 697 Networks using Intermediate System to Intermediate System 698 (IS-IS)", RFC 3719, DOI 10.17487/RFC3719, February 2004, 699 . 701 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 702 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 703 DOI 10.17487/RFC4271, January 2006, 704 . 706 [RFC5304] Li, T. and R. Atkinson, "IS-IS Cryptographic 707 Authentication", RFC 5304, DOI 10.17487/RFC5304, October 708 2008, . 710 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 711 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 712 DOI 10.17487/RFC5440, March 2009, 713 . 715 [RFC5449] Baccelli, E., Jacquet, P., Nguyen, D., and T. Clausen, 716 "OSPF Multipoint Relay (MPR) Extension for Ad Hoc 717 Networks", RFC 5449, DOI 10.17487/RFC5449, February 2009, 718 . 720 [RFC5614] Ogier, R. and P. Spagnolo, "Mobile Ad Hoc Network (MANET) 721 Extension of OSPF Using Connected Dominating Set (CDS) 722 Flooding", RFC 5614, DOI 10.17487/RFC5614, August 2009, 723 . 725 [RFC5820] Roy, A., Ed. and M. Chandra, Ed., "Extensions to OSPF to 726 Support Mobile Ad Hoc Networking", RFC 5820, 727 DOI 10.17487/RFC5820, March 2010, 728 . 730 [RFC5837] Atlas, A., Ed., Bonica, R., Ed., Pignataro, C., Ed., Shen, 731 N., and JR. Rivers, "Extending ICMP for Interface and 732 Next-Hop Identification", RFC 5837, DOI 10.17487/RFC5837, 733 April 2010, . 735 [RFC6232] Wei, F., Qin, Y., Li, Z., Li, T., and J. Dong, "Purge 736 Originator Identification TLV for IS-IS", RFC 6232, 737 DOI 10.17487/RFC6232, May 2011, 738 . 740 [RFC7182] Herberg, U., Clausen, T., and C. Dearlove, "Integrity 741 Check Value and Timestamp TLV Definitions for Mobile Ad 742 Hoc Networks (MANETs)", RFC 7182, DOI 10.17487/RFC7182, 743 April 2014, . 745 [RFC7921] Atlas, A., Halpern, J., Hares, S., Ward, D., and T. 746 Nadeau, "An Architecture for the Interface to the Routing 747 System", RFC 7921, DOI 10.17487/RFC7921, June 2016, 748 . 750 Appendix A. Flooding Optimization Operation 752 Recent testing has shown that flooding is largely a "non-issue" in 753 terms of scaling when using high speed links connecting intermediate 754 systems with reasonable processing power and memory. However, 755 testing has also shown that flooding will impact convergence speed 756 even in such environments, and flooding optimization has a major 757 impact on the performance of a link state protocol in resource 758 constrained environments. Some thoughts on flooding optimization in 759 general, and the flooding optimization contained in this document, 760 follow. 762 There are two general classes of flooding optimization available for 763 link state protocols. The first class of optimization relies on a 764 centralized service or server to gather the link state information 765 and redistribute it back into the intermediate systems making up the 766 fabric. Such solutions are attractive in many, but not all, 767 environments; hence these systems compliment, rather than compete 768 with, the system described here. Systems relying on a service or 769 server necessarily also rely on connectivity to that service or 770 server, either through an out-of-band network or connectivity through 771 the fabric itself. Because of this, these mechanisms do not apply to 772 all deployments; some deployments require underlying reachability 773 regardless of connectivity to an outside service or server. 775 The second possibility is to create a fully distributed system that 776 floods the minimal amount of information possible to every 777 intermediate system. The system described in this draft is an 778 example of such a system. Again, there are many ways to accomplish 779 this goal, but simplicity is a primary goal of the system described 780 in this draft. 782 The system described here divides the work into two different parts; 783 forward and reverse optimization. The forward optimization begins by 784 finding the set of intermediate systems two hops away from the 785 flooding device, and choosing a subset of connected neighbors that 786 will successfully reach this entire set of intermediate systems, as 787 shown in the diagram below. 789 G 790 | 791 A B C--+ 792 | | | | 793 +--D--+ E H 794 | | | 795 +----F--+--+ 797 Figure 2 799 If F is flooding some piece of information, then it will find the 800 entire set of intermediate systems within two hops by discovering its 801 neighbors and their neighbors from the local LSDB. This will include 802 A, B, C, D, and E--but not G. From this set, F can determine that D 803 can reach A and B, while a single flood to either E or H will reach 804 C. Hence F can flood to D and either E or H to reach C. F can 805 choose to flood to D and E normally. Because H still needs to 806 receive this new LSP (or fragment!), but does not need to reflood to 807 C, F can send the LSP using link local signaling. In this case, H 808 will receive and process the new LSP, but not reflood it. 810 Rather than carrying the information necessary through hello 811 extensions, as is done in [RFC5820], the neighbors are allowed to 812 complete initial synchronization, and then a truncated shortest path 813 tree is built to determine the "two hop neighborhood." This has the 814 advantage of using mechanisms already used in IS-IS, rather than 815 adding new processes. The risk with this process is any LSPs flooded 816 through the network before this initial calculation takes place will 817 be suboptimal. This "two hop neighborhood" process has been used in 818 OSPF deployments for a number of years, and has proven stable in 819 practice. 821 Rather than setting a timer for reflooding, the implementation 822 described here uses IS-IS' ability to describe the entire database 823 using a CSNP to ensure flooding is successful. This adds some small 824 amount of overhead, so there is some balance between optimal flooding 825 and ensuring flooding is complete. 827 The reverse optimization is simpler. It relies on the observation 828 that any intermediate system between the local IS and the origin of 829 the LSP, other than in the case of floods removing an LSP from the 830 shared LSDB, should have already received a copy of the LSP. For 831 instance, if F originates an LSP in the figure above, and E refloods 832 the LSP to C, C does not need to reflood back to F if F is on its 833 shortest path tree towards F. It is obvious this is not a "perfect" 834 optimization. A perfect optimization would block flooding back along 835 a directed acyclic graph towards the originator. Using the SPT, 836 however, is a quick way to reduce flooding without performing more 837 calculations. 839 The combination of these two optimizations have been seen, in 840 testing, to reduce the number of copies any IS receives from the tens 841 to precisely one. 843 Appendix B. Fabric Location Calculation 845 Determining the location of a device in a symmetric topology is quite 846 challenging. The authors of this draft worked through a number of 847 possible solutions to this problem, each of which was found to either 848 not work in some topology, or was found to be liable to unacceptable 849 errors. For instance: 851 o Method 1: 853 * Caculate the maximum distance through the fabric, and the 854 distance from one of those points to the local intermediate 855 system 857 * This works in a five stage Clos spine and leaf, but not in a 858 three stage, nor in some other five stage spine and leaf 859 fabrics, such as the common butterfly or Benes fabric 861 o Method 2: 863 * Manually mark one edge leaf node in the fabric as T0 865 * Calculate maximum distance through the fabric from this point 867 * Calculate local position based on this maximum distance the 868 distance to the single marked device 870 * This works in three and five stage Clod fabrics, but does not 871 work from every location in other spine and leaf fabrics, such 872 as the common butterfly or Benes fabric 874 In the end, marking two devices located as far from one another 875 topologically as possible provides the anchor points necessary to 876 calculate the total distance through the fabric, and then from those 877 points to the location of the calculating device. 879 The information obtained in this way can also be combined with other 880 forms of location calculation, such as whether a device requesting an 881 address through some mechanism is attached to the local device, or 882 other indications of fabric locality. It generally true that having 883 more than one method to determine fabric location will be better than 884 any single method to account for errors, failures, and other problems 885 that can arise with any mechanism. 887 Authors' Addresses 889 Russ White (editor) 890 LinkedIn 892 Email: russ@riw.us 894 Shawn Zandi (editor) 895 LinkedIn 897 Email: szandi@linkedin.com