idnits 2.17.1 draft-ietf-lsvr-bgp-spf-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 27, 2018) is 2030 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2328' is mentioned on line 831, but not defined == Missing Reference: 'RFC5286' is mentioned on line 865, but not defined == Missing Reference: 'RFC4456' is mentioned on line 835, but not defined == Missing Reference: 'RFC4915' is mentioned on line 860, but not defined == Missing Reference: 'RFC5549' is mentioned on line 870, but not defined ** Obsolete undefined reference: RFC 5549 (Obsoleted by RFC 8950) == Missing Reference: 'RFC4790' is mentioned on line 855, but not defined == Missing Reference: 'RFC5880' is mentioned on line 875, but not defined == Missing Reference: 'I-D.ietf-lsvr-applicability' is mentioned on line 825, but not defined == Missing Reference: 'RFC4760' is mentioned on line 850, but not defined == Missing Reference: 'RFC4750' is mentioned on line 845, but not defined == Missing Reference: 'RFC4724' is mentioned on line 840, but not defined == Outdated reference: A later version (-19) exists of draft-ietf-idr-bgpls-segment-routing-epe-15 ** Obsolete normative reference: RFC 7752 (Obsoleted by RFC 9552) ** Downref: Normative reference to an Informational RFC: RFC 7938 Summary: 3 errors (**), 0 flaws (~~), 14 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group K. Patel 3 Internet-Draft Arrcus, Inc. 4 Intended status: Standards Track A. Lindem 5 Expires: March 31, 2019 Cisco Systems 6 S. Zandi 7 Linkedin 8 W. Henderickx 9 Nokia 10 September 27, 2018 12 Shortest Path Routing Extensions for BGP Protocol 13 draft-ietf-lsvr-bgp-spf-03.txt 15 Abstract 17 Many Massively Scaled Data Centers (MSDCs) have converged on 18 simplified layer 3 routing. Furthermore, requirements for 19 operational simplicity have lead many of these MSDCs to converge on 20 BGP as their single routing protocol for both their fabric routing 21 and their Data Center Interconnect (DCI) routing. This document 22 describes a solution which leverages BGP Link-State distribution and 23 the Shortest Path First (SPF) algorithm similar to Internal Gateway 24 Protocols (IGPs) such as OSPF. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on March 31, 2019. 43 Copyright Notice 45 Copyright (c) 2018 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 This document may contain material from IETF Documents or IETF 59 Contributions published or made publicly available before November 60 10, 2008. The person(s) controlling the copyright in some of this 61 material may not have granted the IETF Trust the right to allow 62 modifications of such material outside the IETF Standards Process. 63 Without obtaining an adequate license from the person(s) controlling 64 the copyright in such materials, this document may not be modified 65 outside the IETF Standards Process, and derivative works of it may 66 not be created outside the IETF Standards Process, except to format 67 it for publication as an RFC or to translate it into languages other 68 than English. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 1.1. BGP Shortest Path First (SPF) Motivation . . . . . . . . 4 74 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 5 75 2. BGP Peering Models . . . . . . . . . . . . . . . . . . . . . 5 76 2.1. BGP Single-Hop Peering on Network Node Connections . . . 5 77 2.2. BGP Peering Between Directly Connected Network Nodes . . 6 78 2.3. BGP Peering in Route-Reflector or Controller Topology . . 6 79 3. BGP-LS Shortest Path Routing (SPF) SAFI . . . . . . . . . . . 6 80 4. Extensions to BGP-LS . . . . . . . . . . . . . . . . . . . . 7 81 4.1. Node NLRI Usage and Modifications . . . . . . . . . . . . 7 82 4.2. Link NLRI Usage . . . . . . . . . . . . . . . . . . . . . 8 83 4.2.1. BGP-LS Link NLRI Attribute Prefix-Length TLVs . . . . 9 84 4.3. Prefix NLRI Usage . . . . . . . . . . . . . . . . . . . . 9 85 4.4. BGP-LS Attribute Sequence-Number TLV . . . . . . . . . . 9 86 5. Decision Process with SPF Algorithm . . . . . . . . . . . . . 10 87 5.1. Phase-1 BGP NLRI Selection . . . . . . . . . . . . . . . 11 88 5.2. Dual Stack Support . . . . . . . . . . . . . . . . . . . 12 89 5.3. SPF Calculation based on BGP-LS NLRI . . . . . . . . . . 12 90 5.4. NEXT_HOP Manipulation . . . . . . . . . . . . . . . . . . 15 91 5.5. IPv4/IPv6 Unicast Address Family Interaction . . . . . . 15 92 5.6. NLRI Advertisement and Convergence . . . . . . . . . . . 15 93 5.7. Error Handling . . . . . . . . . . . . . . . . . . . . . 16 94 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 95 7. Security Considerations . . . . . . . . . . . . . . . . . . . 16 96 8. Management Considerations . . . . . . . . . . . . . . . . . . 16 97 8.1. Configuration . . . . . . . . . . . . . . . . . . . . . . 16 98 8.2. Operational Data . . . . . . . . . . . . . . . . . . . . 16 99 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 100 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 17 101 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 102 11.1. Normative References . . . . . . . . . . . . . . . . . . 17 103 11.2. Information References . . . . . . . . . . . . . . . . . 18 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 106 1. Introduction 108 Many Massively Scaled Data Centers (MSDCs) have converged on 109 simplified layer 3 routing. Furthermore, requirements for 110 operational simplicity have lead many of these MSDCs to converge on 111 BGP [RFC4271] as their single routing protocol for both their fabric 112 routing and their Data Center Interconnect (DCI) routing. 113 Requirements and procedures for using BGP are described in [RFC7938]. 114 This document describes an alternative solution which leverages BGP- 115 LS [RFC7752] and the Shortest Path First algorithm similar to 116 Internal Gateway Protocols (IGPs) such as OSPF [RFC2328]. 118 [RFC4271] defines the Decision Process that is used to select routes 119 for subsequent advertisement by applying the policies in the local 120 Policy Information Base (PIB) to the routes stored in its Adj-RIBs- 121 In. The output of the Decision Process is the set of routes that are 122 announced by a BGP speaker to its peers. These selected routes are 123 stored by a BGP speaker in the speaker's Adj-RIBs-Out according to 124 policy. 126 [RFC7752] describes a mechanism by which link-state and TE 127 information can be collected from networks and shared with external 128 components using BGP. This is achieved by defining NLRI advertised 129 within the BGP-LS/BGP-LS-SPF AFI/SAFI. The BGP-LS extensions defined 130 in [RFC7752] makes use of the Decision Process defined in [RFC4271]. 132 This document augments [RFC7752] by replacing its use of the existing 133 Decision Process. Rather than reusing the BGP-LS SAFI, the BGP-LS- 134 SPF SAFI is introduced to insure backward compatibility. The Phase 1 135 and 2 decision functions of the Decision Process are replaced with 136 the Shortest Path First (SPF) algorithm also known as the Dijkstra 137 algorithm. The Phase 3 decision function is also simplified since it 138 is no longer dependent on the previous phases. This solution avails 139 the benefits of both BGP and SPF-based IGPs. These include TCP based 140 flow-control, no periodic link-state refresh, and completely 141 incremental NLRI advertisement. These advantages can reduce the 142 overhead in MSDCs where there is a high degree of Equal Cost Multi- 143 Path (ECMPs) and the topology is very stable. Additionally, using a 144 SPF-based computation can support fast convergence and the 145 computation of Loop-Free Alternatives (LFAs) [RFC5286] in the event 146 of link failures. Furthermore, a BGP based solution lends itself to 147 multiple peering models including those incorporating route- 148 reflectors [RFC4456] or controllers. 150 Support for Multiple Topology Routing (MTR) as described in [RFC4915] 151 is an area for further study dependent on deployment requirements. 153 1.1. BGP Shortest Path First (SPF) Motivation 155 Given that [RFC7938] already describes how BGP could be used as the 156 sole routing protocol in an MSDC, one might question the motivation 157 for defining an alternate BGP deployment model when a mature solution 158 exists. For both alternatives, BGP offers the operational benefits 159 of a single routing protocol. However, BGP SPF offers some unique 160 advantages above and beyond standard BGP distance-vector routing. 162 A primary advantage is that all BGP speakers in the BGP SPF routing 163 domain will have a complete view of the topology. This will allow 164 support for ECMP, IP fast-reroute (e.g., Loop-Free Alternatives), 165 Shared Risk Link Groups (SRLGs), and other routing enhancements 166 without advertisement of addition BGP paths or other extensions. In 167 short, the advantages of an IGP such as OSPF [RFC2328] are availed in 168 BGP. 170 With the simplified BGP decision process as defined in Section 5.1, 171 NLRI changes can be disseminated throughout the BGP routing domain 172 much more rapidly (equivalent to IGPs with the proper 173 implementation). 175 Another primary advantage is a potential reduction in NLRI 176 advertisement. With standard BGP distance-vector routing, a single 177 link failure may impact 100s or 1000s prefixes and result in the 178 withdrawal or re-advertisement of the attendant NLRI. With BGP SPF, 179 only the BGP speakers corresponding to the link NLRI need withdraw 180 the corresponding BGP-LS Link NLRI. This advantage will contribute 181 to both faster convergence and better scaling. 183 With controller and route-reflector peering models, BGP SPF 184 advertisement and distributed computation require a minimal number of 185 sessions and copies of the NLRI since only the latest version of the 186 NLRI from the originator is required. Given that verification of the 187 adjacencies is done outside of BGP (see Section 2), each BGP speaker 188 will only need as many sessions and copies of the NLRI as required 189 for redundancy (e.g., one for the SPF computation and another for 190 backup). Functions such as Optimized Route Reflection (ORR) are 191 supported without extension by virtue of the primary advantages. 193 Additionally, a controller could inject topology that is learned 194 outside the BGP routing domain. 196 Given that controllers are already consuming BGP-LS NLRI [RFC7752], 197 reusing for the BGP-LS SPF leverages the existing controller 198 implementations. 200 Another potential advantage of BGP SPF is that both IPv6 and IPv4 can 201 be supported in the same address family using the same topology. 202 Although not described in this version of the document, multi- 203 topology extensions can be used to support separate IPv4, IPv6, 204 unicast, and multicast topologies while sharing the same NLRI. 206 Finally, the BGP SPF topology can be used as an underlay for other 207 BGP address families (using the existing model) and realize all the 208 above advantages. A simplified peering model using IPv6 link-local 209 addresses as next-hops can be deployed similar to [RFC5549]. 211 1.2. Requirements Language 213 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 214 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 215 "OPTIONAL" in this document are to be interpreted as described in BCP 216 14 [RFC2119] [RFC8174] when, and only when, they appear in all 217 capitals, as shown here. 219 2. BGP Peering Models 221 Depending on the requirements, scaling, and capabilities of the BGP 222 speakers, various peering models are supported. The only requirement 223 is that all BGP speakers in the BGP SPF routing domain receive link- 224 state NLRI on a timely basis, run an SPF calculation, and update 225 their data plane appropriately. The content of the Link NLRI is 226 described in Section 4.2. 228 2.1. BGP Single-Hop Peering on Network Node Connections 230 The simplest peering model is the one described in section 5.2.1 of 231 [RFC7938]. In this model, EBGP single-hop sessions are established 232 over direct point-to-point links interconnecting the SPF domain 233 nodes. For the purposes of BGP SPF, Link NLRI is only advertised if 234 a single-hop BGP session has been established and the Link-State/SPF 235 address family capability has been exchanged [RFC4790] on the 236 corresponding session. If the session goes down, the corresponding 237 Link NLRI will be withdrawn. Topologically, this would be equivalent 238 to the peering model in [RFC7938] where there is a BGP session on 239 every link in the data center switch fabric. 241 2.2. BGP Peering Between Directly Connected Network Nodes 243 In this model, BGP speakers peer with all directly connected network 244 nodes but the sessions may be multi-hop and the direct connection 245 discovery and liveliness detection for those connections are 246 independent of the BGP protocol. How this is accomplished is outside 247 the scope of this document. Consequently, there will be a single 248 session even if there are multiple direct connections between BGP 249 speakers. For the purposes of BGP SPF, Link NLRI is advertised as 250 long as a BGP session has been established, the Link-State/SPF 251 address family capability has been exchanged [RFC4790] and the 252 corresponding link is considered is up and considered operational. 253 This is much like the previous peering model only peering is on a 254 single loopback address and the switch fabric links can be 255 unnumbered. However, there will be the same unnumber of sessions as 256 with the previous peering model unless there are parrallel links 257 between switches in the fabric. 259 2.3. BGP Peering in Route-Reflector or Controller Topology 261 In this model, BGP speakers peer solely with one or more Route 262 Reflectors [RFC4456] or controllers. As in the previous model, 263 direct connection discovery and liveliness detection for those 264 connections are done outside the BGP protocol. More specifically, 265 the Liveliness detection is done using BFD protocol described in 266 [RFC5880]. For the purposes of BGP SPF, Link NLRI is advertised as 267 long as the corresponding link is up and considered operational. 269 This peering model, known as sparse peering, allows for many fewer 270 BGP sessions and, consequently, instances of the same NLRI received 271 from multiple peers. It is discussed in greater detail in 272 [I-D.ietf-lsvr-applicability]. 274 3. BGP-LS Shortest Path Routing (SPF) SAFI 276 In order to replace the Phase 1 and 2 decision functions of the 277 existing Decision Process with an SPF-based Decision Process and 278 streamline the Phase 3 decision functions in a backward compatible 279 manner, this draft introduces the BGP-LS-SFP SAFI for BGP-LS SPF 280 operation. The BGP-LS-SPF (AF 16388 / SAFI TBD1) [RFC4790] is 281 allocated by IANA as specified in the Section 6. A BGP speaker using 282 the BGP-LS SPF extensions described herein MUST exchange the AFI/SAFI 283 using Multiprotocol Extensions Capability Code [RFC4760] with other 284 BGP speakers in the SPF routing domain. 286 4. Extensions to BGP-LS 288 [RFC7752] describes a mechanism by which link-state and TE 289 information can be collected from networks and shared with external 290 components using BGP protocol. It describes both the definition of 291 BGP-LS NLRI that describes links, nodes, and prefixes comprising IGP 292 link-state information and the definition of a BGP path attribute 293 (BGP-LS attribute) that carries link, node, and prefix properties and 294 attributes, such as the link and prefix metric or auxiliary Router- 295 IDs of nodes, etc. 297 The BGP protocol will be used in the Protocol-ID field specified in 298 table 1 of [I-D.ietf-idr-bgpls-segment-routing-epe]. The local and 299 remote node descriptors for all NLRI will be the BGP Router-ID (TLV 300 516) and either the AS Number (TLV 512) [RFC7752] or the BGP 301 Confederation Member (TLV 517) [RFC8402]. However, if the BGP 302 Router-ID is known to be unique within the BGP Routing domain, it can 303 be used as the sole descriptor. 305 4.1. Node NLRI Usage and Modifications 307 The SPF capability is a new Node Attribute TLV that will be added to 308 those defined in table 7 of [RFC7752]. The new attribute TLV will 309 only be applicable when BGP is specified in the Node NLRI Protocol ID 310 field. The TBD TLV type will be defined by IANA. The new Node 311 Attribute TLV will contain a single-octet SPF algorithm as defined in 312 [RFC8402]. 314 0 1 2 3 315 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 316 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 317 | Type | Length | 318 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 319 | SPF Algorithm | 320 +-+-+-+-+-+-+-+-+ 322 The SPF Algorithm may take the following values: 324 0 - Normal Shortest Path First (SPF) algorithm based on link 325 metric. This is the standard shortest path algorithm as 326 computed by the IGP protocol. Consistent with the deployed 327 practice for link-state protocols, Algorithm 0 permits any 328 node to overwrite the SPF path with a different path based on 329 its local policy. 330 1 - Strict Shortest Path First (SPF) algorithm based on link 331 metric. The algorithm is identical to Algorithm 0 but Algorithm 332 1 requires that all nodes along the path will honor the SPF 333 routing decision. Local policy at the node claiming support for 334 Algorithm 1 MUST NOT alter the SPF paths computed by Algorithm 1. 336 Note that usage of Strict Shortest Path First (SPF) algorithm is 337 defined in the IGP algorithm registry but usage is restricted to 338 [I-D.ietf-idr-bgpls-segment-routing-epe]. Hence, its usage for BGP- 339 LS SPF is out of scope. 341 When computing the SPF for a given BGP routing domain, only BGP nodes 342 advertising the SPF capability attribute will be included the 343 Shortest Path Tree (SPT). 345 4.2. Link NLRI Usage 347 The criteria for advertisement of Link NLRI are discussed in 348 Section 2. 350 Link NLRI is advertised with local and remote node descriptors as 351 described above and unique link identifiers dependent on the 352 addressing. For IPv4 links, the links local IPv4 (TLV 259) and 353 remote IPv4 (TLV 260) addresses will be used. For IPv6 links, the 354 local IPv6 (TLV 261) and remote IPv6 (TLV 262) addresses will be 355 used. For unnumbered links, the link local/remote identifiers (TLV 356 258) will be used. For links supporting having both IPv4 and IPv6 357 addresses, both sets of descriptors may be included in the same Link 358 NLRI. The link identifiers are described in table 5 of [RFC7752]. 360 The link IGP metric attribute TLV (TLV 1095) as well as any others 361 required for non-SPF purposes SHOULD be advertised. Algorithms such 362 as setting the metric inversely to the link speed as done in the OSPF 363 MIB [RFC4750] MAY be supported. However, this is beyond the scope of 364 this document. 366 4.2.1. BGP-LS Link NLRI Attribute Prefix-Length TLVs 368 Two BGP-LS Attribute TLVs to BGP-LS Link NLRI are defined to 369 advertise the prefix length associated with the IPv4 and IPv6 link 370 prefixes. The prefix length is used for the optional installation of 371 prefixes corresponding to Link NLRI as defined in Section 5.3. 373 0 1 2 3 374 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 375 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 376 | TBD IPv4 or IPv6 Type | Length | 377 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 378 | Prefix-Length | 379 +-+-+-+-+-+-+-+-+ 381 Prefix-length - A one-octet length restricted to 1-32 for IPv4 382 Link NLIR endpoint prefixes and 1-128 for IPv6 383 Link NLRI endpoint prefixes. 385 4.3. Prefix NLRI Usage 387 Prefix NLRI is advertised with a local node descriptor as described 388 above and the prefix and length used as the descriptors (TLV 265) as 389 described in [RFC7752]. The prefix metric attribute TLV (TLV 1155) 390 as well as any others required for non-SPF purposes SHOULD be 391 advertised. For loopback prefixes, the metric should be 0. For non- 392 loopback prefixes, the setting of the metric is a local matter and 393 beyond the scope of this document. 395 4.4. BGP-LS Attribute Sequence-Number TLV 397 A new BGP-LS Attribute TLV to BGP-LS NLRI types is defined to assure 398 the most recent version of a given NLRI is used in the SPF 399 computation. The TBD TLV type will be defined by IANA. The new BGP- 400 LS Attribute TLV will contain an 8-octet sequence number. The usage 401 of the Sequence Number TLV is described in Section 5.1. 403 0 1 2 3 404 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 405 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 406 | Type | Length | 407 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 408 | Sequence Number (High-Order 32 Bits) | 409 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 410 | Sequence Number (Low-Order 32 Bits) | 411 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 413 Sequence Number 415 The 64-bit strictly increasing sequence number is incremented for 416 every version of BGP-LS NLRI originated. BGP speakers implementing 417 this specification MUST use available mechanisms to preserve the 418 sequence number's strictly increasing property for the deployed life 419 of the BGP speaker (including cold restarts). One mechanism for 420 accomplishing this would be to use the high-order 32 bits of the 421 sequence number as a wrap/boot count that is incremented anytime the 422 BGP router loses its sequence number state or the low-order 32 bits 423 wrap. 425 When incrementing the sequence number for each self-originated NLRI, 426 the sequence number should be treated as an unsigned 64-bit value. 427 If the lower-order 32-bit value wraps, the higher-order 32-bit value 428 should be incremented and saved in non-volatile storage. If by some 429 chance the BGP Speaker is deployed long enough that there is a 430 possibility that the 64-bit sequence number may wrap or a BGP Speaker 431 completely loses its sequence number state (e.g., the BGP speaker 432 hardware is replaced or experiences a cold-start), the phase 1 433 decision function (see Section 5.1) rules will insure convergence, 434 albeit, not immediately. 436 5. Decision Process with SPF Algorithm 438 The Decision Process described in [RFC4271] takes place in three 439 distinct phases. The Phase 1 decision function of the Decision 440 Process is responsible for calculating the degree of preference for 441 each route received from a BGP speaker's peer. The Phase 2 decision 442 function is invoked on completion of the Phase 1 decision function 443 and is responsible for choosing the best route out of all those 444 available for each distinct destination, and for installing each 445 chosen route into the Loc-RIB. The combination of the Phase 1 and 2 446 decision functions is characterized as a Path Vector algorithm. 448 The SPF based Decision process replaces the BGP best-path Decision 449 process described in [RFC4271]. This process starts with selecting 450 only those Node NLRI whose SPF capability TLV matches with the local 451 BGP speaker's SPF capability TLV value. Since Link-State NLRI always 452 contains the local descriptor [RFC7752], it will only be originated 453 by a single BGP speaker in the BGP routing domain. These selected 454 Node NLRI and their Link/Prefix NLRI are used to build a directed 455 graph during the SPF computation. The best paths for BGP prefixes 456 are installed as a result of the SPF process. 458 When BGP-LS-SPF NLRI is received, all that is required is to 459 determine whether it is the best-path by examining the Node-ID and 460 sequence number as described in Section 5.1. If the received best- 461 path NLRI had changed, it will be advertised to other BGP-LS-SPF 462 peers. If the attributes have changed (other than the sequence 463 number), a BGP SPF calculation will be scheduled. However, a changed 464 NLRI MAY be advertised to other peers almost immediately and 465 propagation of changes can approach IGP convergence times. To 466 accomplish this, the MinRouteAdvertisementIntervalTimer and 467 MinASOriginationIntervalTimer [RFC4271] are not applicable to the 468 BGP-LS-SPF SAFI. Rather, SPF calculations SHOULD be triggered and 469 dampened consistent with the SPF backoff algorithm specified in 470 [RFC8405]. 472 The Phase 3 decision function of the Decision Process [RFC4271] is 473 also simplified since under normal SPF operation, a BGP speaker would 474 advertise the NLRI selected for the SPF to all BGP peers with the 475 BGP-LS/BGP-LS-SPF AFI/SAFI. Application of policy would not be 476 prevented however its usage to best-path process would be limited as 477 the SPF relies solely on link metrics. 479 5.1. Phase-1 BGP NLRI Selection 481 The rules for NLRI selection are greatly simplified from [RFC4271]. 483 1. If the NLRI is received from the BGP speaker originating the NLRI 484 (as determined by the comparing BGP Router ID in the NLRI Node 485 identifiers with the BGP speaker Router ID), then it is preferred 486 over the same NLRI from non-originators. This rule will assure 487 that stale NLRI is updated even if a BGP-LS router loses its 488 sequence number state due to a cold-start. 490 2. If the Sequence-Number TLV is present in the BGP-LS Attribute, 491 then the NLRI with the most recent, i.e., highest sequence number 492 is selected. BGP-LS NLRI with a Sequence-Number TLV will be 493 considered more recent than NLRI without a BGP-LS Attribute or a 494 BGP-LS Attribute that doesn't include the Sequence-Number TLV. 496 3. The final tie-breaker is the NLRI from the BGP Speaker with the 497 numerically largest BGP Router ID. 499 When a BGP speaker completely loses its sequence number state, i.e., 500 due to a cold start, or in the unlikely possibility that that 501 sequence number wraps, the BGP routing domain will still converge. 502 This is due to the fact that BGP speakers adjacent to the router will 503 always accept self-originated NLRI from the associated speaker as 504 more recent (rule # 1). When BGP speaker reestablishes a connection 505 with its peers, any existing session will be taken down and stale 506 NLRI will be replaced by the new NLRI and stale NLRI will be 507 discarded independent of whether or not BGP graceful restart is 508 deployed, [RFC4724]. The adjacent BGP speaker will update their NLRI 509 advertisements in turn until the BGP routing domain has converged. 511 The modified SPF Decision Process performs an SPF calculation rooted 512 at the BGP speaker using the metrics from Link and Prefix NLRI 513 Attribute TLVs [RFC7752]. As a result, any attributes that would 514 influence the Decision process defined in [RFC4271] like ORIGIN, 515 MULTI_EXIT_DISC, and LOCAL_PREF attributes are ignored by the SPF 516 algorithm. Furthermore, the NEXT_HOP attribute value is preserved 517 but otherwise ignored during the SPF or best-path. 519 5.2. Dual Stack Support 521 The SPF-based decision process operates on Node, Link, and Prefix 522 NLRIs that support both IPv4 and IPv6 addresses. Whether to run a 523 single SPF instance or multiple SPF instances for separate AFs is a 524 matter of a local implementation. Normally, IPv4 next-hops are 525 calculated for IPv4 prefixes and IPv6 next-hops are calculated for 526 IPv6 prefixes. However, an interesting use-case is deployment of 527 [RFC5549] where IPv6 next-hops are calculated for both IPv4 and IPv6 528 prefixes. As stated in Section 1, support for Multiple Topology 529 Routing (MTR) is an area for future study. 531 5.3. SPF Calculation based on BGP-LS NLRI 533 This section details the BGP-LS SPF local routing information base 534 (RIB) calculation. The router will use BGP-LS Node, Link, and Prefix 535 NLRI to populate the local RIB using the following algorithm. This 536 calculation yields the set of intra-area routes associated with the 537 BGP-LS domain. A router calculates the shortest-path tree using 538 itself as the root. Variations and optimizations of the algorithm 539 are valid as long as it yields the same set of routes. The algorithm 540 below supports Equal Cost Multi-Path (ECMP) routes. Weighted Unequal 541 Cost Multi-Path are out of scope. The organization of this section 542 owes heavily to section 16 of [RFC2328]. 544 The following abstract data structures are defined in order to 545 specify the algorithm. 547 o Local Route Information Base (RIB) - This is abstract contains 548 reachability information (i.e., next hops) for all prefixes (both 549 IPv4 and IPv6) as well as the Node NLRI reachability. 550 Implementations may choose to implement this as separate RIBs for 551 each address family and/or Node NLRI. 553 o Link State NLRI Database (LSNDB) - Database of BGP-LS NLRI that 554 facilitates access to all Node, Link, and Prefix NLRI as well as 555 all the Link and Prefix NLRI corresponding to a given Node NLRI. 556 Other optimization, such as, resolving bi-directional connectivity 557 associations between Link NLRI are possible but of scope of this 558 document. 560 o Candidate List - This is a list of candidate Node NLRI with the 561 lowest cost Node NLRI at the front of the list. It is typically 562 implemented as a heap but other concrete data structures have also 563 been used. 565 The algorithm is comprised of the steps below: 567 1. The current local RIB is invalidated. The local RIB is built 568 again from scratch. The existing routing entries are preserved 569 for comparision to determine changes that need to be installed in 570 the global RIB. 572 2. The computing router's Node NLRI is installed in the local RIB 573 with a cost of 0 and as as the sole entry in the candidate list. 575 3. The Node NLRI with the lowest cost is removed from the candidate 576 list for processing. The Node corresponding to this NLRI will be 577 referred to as the Current Node. If the candidate list is empty, 578 the SPF calculation has completed and the algorithm proceeds to 579 step 6. 581 4. All the Prefix NLRI with the same Node Identifiers as the Current 582 Node will be considered for installation. The cost for each 583 prefix is the metric advertised in the Prefix NLRI added to the 584 cost to reach the Current Node. 586 * If the prefix is not in the local RIB, the prefix is installed 587 and will inherit the Current Node's next hops. 589 * If the prefix is in the local RIB and the cost is greater than 590 the Current route's metric, the Prefix NLRI does not 591 contribute to the route and is ignored. 593 * If the prefix is in the local RIB and the cost is less than 594 the current route's metric, the Prefix is installed with the 595 Current Node's next-hops replacing the local RIB route's next- 596 hops and the metric being updated. 598 * If the prefix is in the local RIB and the cost is same as the 599 current route's metric, the Prefix is installed with the 600 Current Node's next-hops being merged with local RIB route's 601 next-hops. 603 5. All the Link NLRI with the same Node Identifiers as the Current 604 Node will be considered for installation. Each link will be 605 examined and will be referred to in the following text as the 606 Current Link. The cost of the Current Link is the advertised 607 metric in the Link NLRI added to the cost to reach the Current 608 Node. 610 * Optionally, the prefix(es) associated with the Current Link 611 are installed into the local RIB using the same rules as were 612 used for Prefix NLRI in the previous steps. 614 * The Current Link's endpoint Node NLRI is accessed (i.e., the 615 Node NLRI with the same Node identifiers as the Link 616 endpoint). If it exists, it will be referred to as the 617 Endpoint Node NLRI and the algorithm will proceed as follows: 619 + All the Link NLRI corresponding the Endpoint Node NLRI will 620 be searched for a back-link NLRI pointing to the current 621 node. Both the Node identifiers and the Link endpoint 622 identifiers in the Endpoint Node's Link NLRI must match for 623 a match. If there is no corresponding Link NLRI 624 corresponding to the Endpoint Node NLRI, the Endpoint Node 625 NLIR fails the bi-directional connectivity test and is not 626 processed further. 628 + If the Endpoint Node NLRI is not on the candidate list, it 629 is inserted based on the link cost and BGP Identifier (the 630 latter being used as a tie-breaker). 632 + If the Endpoint Node NLRI is already on the candidate list 633 with a lower cost, it need not be inserted again. 635 + If the Endpoint Node NLRI is already on the candidate list 636 with a higher cost, it must be removed and reinserted with 637 a lower cost. 639 * Return to step 3 to process the next lowest cost Node NLRI on 640 the candidate list. 642 6. The local RIB is examined and changes (adds, deletes, 643 modifications) are installed into the global RIB. 645 5.4. NEXT_HOP Manipulation 647 A BGP speaker that supports SPF extensions MAY interact with peers 648 that don't support SPF extensions. If the BGP-LS address family is 649 advertised to a peer not supporting the SPF extensions described 650 herein, then the BGP speaker MUST conform to the NEXT_HOP rules 651 specified in [RFC4271] when announcing the Link-State address family 652 routes to those peers. 654 All BGP peers that support SPF extensions would locally compute the 655 Loc-RIB next-hops as a result of the SPF process. Consequently, the 656 NEXT_HOP attribute is always ignored on receipt. However, BGP 657 speakers SHOULD set the NEXT_HOP address according to the NEXT_HOP 658 attribute rules specified in [RFC4271]. 660 5.5. IPv4/IPv6 Unicast Address Family Interaction 662 While the BGP-LS SPF address family and the IPv4/IPv6 unicast address 663 families install routes into the same device routing tables, they 664 will operate independently much the same as OSPF and IS-IS would 665 operate today (i.e., "Ships-in-the-Night" mode). There will be no 666 implicit route redistribution between the BGP address families. 667 However, implementation specific redistribution mechanisms SHOULD be 668 made available with the restriction that redistribution of BGP-LS SPF 669 routes into the IPv4 address family applies only to IPv4 routes and 670 redistribution of BGP-LS SPF route into the IPv6 address family 671 applies only to IPv6 routes. 673 Given the fact that SPF algorithms are based on the assumption that 674 all routers in the routing domain calculate the precisely the same 675 SPF tree and install the same set of routes, it is RECOMMENDED that 676 BGP-LS SPF IPv4/IPv6 routes be given priority by default when 677 installed into their respective RIBs. In common implementations the 678 prioritization is governed by route preference or administrative 679 distance with lower being more preferred. 681 5.6. NLRI Advertisement and Convergence 683 A local failure will prevent a link from being used in the SPF 684 calculation due to the IGP bi-directional connectivity requirement. 685 Consequently, local link failures should always be given priority 686 over updates (e.g., withdrawing all routes learned on a session) in 687 order to ensure the highest priority propagation and optimal 688 convergence. 690 Delaying the withdrawal of non-local routes is an area for further 691 study as more IGP-like mechanisms would be required to prevent usage 692 of stale NLRI. 694 5.7. Error Handling 696 When a BGP speaker receives a BGP Update containing a malformed SPF 697 Capability TLV in the Node NLRI BGP-LS Attribute [RFC7752], it MUST 698 ignore the received TLV and the Node NLRI and not pass it to other 699 BGP peers as specified in [RFC7606]. When discarding a Node NLRI 700 with malformed TLV, a BGP speaker SHOULD log an error for further 701 analysis. 703 6. IANA Considerations 705 This document defines an AFI/SAFI for BGP-LS SPF operation and 706 requests IANA to assign the BGP-LS/BGP-LS-SPF (AFI 16388 / SAFI TBD1) 707 as described in [RFC4750]. 709 This document also defines four attribute TLVs for BGP LS NLRI. We 710 request IANA to assign TLVs for the SPF capability, Sequence Number, 711 IPv4 Link Prefix-Length, and IPv6 Link Prefix-Length from the "BGP-LS 712 Node Descriptor, Link Descriptor, Prefix Descriptor, and Attribute 713 TLVs" Registry. 715 7. Security Considerations 717 This extension to BGP does not change the underlying security issues 718 inherent in the existing [RFC4271], [RFC4724], and [RFC7752]. 720 8. Management Considerations 722 This section includes unique management considerations for the BGP-LS 723 SPF address family. 725 8.1. Configuration 727 In addition to configuration of the BGP-LS SPF address family, 728 implementations SHOULD support the configuratio of the 729 INITIAL_SPF_DELAY, SHORT_SPF_DELAY, LONG_SPF_DELAY, TIME_TO_LEARN, 730 and HOLDDOWN_INTERVAL as documented in [RFC8405]. 732 8.2. Operational Data 734 In order to troubleshoot SPF issues, implementations SHOULD support 735 an SPF log including entries for previous SPF computations, Each SPF 736 log entry would include the BGP-LS NLRI SPF triggering the SPF, SPF 737 scheduled time, SPF start time, SPF end time, and SPF type if 738 different types of SPF are supported. Since the size of the log will 739 be finite, implementations SHOULD also maintain counters for the 740 total number of SPF computations of each type and the total number of 741 SPF triggering events. Additionally, to troubleshoot SPF scheduling 742 and backoff [RFC8405], the current SPF backoff state, remaining time- 743 to-learn, remaining holddown, last trigger event time, last SPF time, 744 and next SPF time should be available. 746 9. Acknowledgements 748 The authors would like to thank Sue Hares, Jorge Rabadan, Boris 749 Hassanov, Dan Frost, and Fred Baker for their review and comments. 751 10. Contributors 753 In addition to the authors listed on the front page, the following 754 co-authors have contributed to the document. 756 Derek Yeung 757 Arrcus, Inc. 758 derek@arrcus.com 760 Gunter Van De Velde 761 Nokia 762 gunter.van_de_velde@nokia.com 764 Abhay Roy 765 Cisco Systems 766 akr@cisco.com 768 Venu Venugopal 769 Cisco Systems 770 venuv@cisco.com 772 11. References 774 11.1. Normative References 776 [I-D.ietf-idr-bgpls-segment-routing-epe] 777 Previdi, S., Filsfils, C., Patel, K., Ray, S., and J. 778 Dong, "BGP-LS extensions for Segment Routing BGP Egress 779 Peer Engineering", draft-ietf-idr-bgpls-segment-routing- 780 epe-15 (work in progress), March 2018. 782 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 783 Requirement Levels", BCP 14, RFC 2119, 784 DOI 10.17487/RFC2119, March 1997, . 787 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 788 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 789 DOI 10.17487/RFC4271, January 2006, . 792 [RFC7606] Chen, E., Ed., Scudder, J., Ed., Mohapatra, P., and K. 793 Patel, "Revised Error Handling for BGP UPDATE Messages", 794 RFC 7606, DOI 10.17487/RFC7606, August 2015, 795 . 797 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 798 S. Ray, "North-Bound Distribution of Link-State and 799 Traffic Engineering (TE) Information Using BGP", RFC 7752, 800 DOI 10.17487/RFC7752, March 2016, . 803 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 804 BGP for Routing in Large-Scale Data Centers", RFC 7938, 805 DOI 10.17487/RFC7938, August 2016, . 808 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 809 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 810 May 2017, . 812 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 813 Decraene, B., Litkowski, S., and R. Shakir, "Segment 814 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 815 July 2018, . 817 [RFC8405] Decraene, B., Litkowski, S., Gredler, H., Lindem, A., 818 Francois, P., and C. Bowers, "Shortest Path First (SPF) 819 Back-Off Delay Algorithm for Link-State IGPs", RFC 8405, 820 DOI 10.17487/RFC8405, June 2018, . 823 11.2. Information References 825 [I-D.ietf-lsvr-applicability] 826 Patel, K., Lindem, A., Zandi, S., and G. Dawra, "Usage and 827 Applicability of Link State Vector Routing in Data 828 Centers", draft-ietf-lsvr-applicability-00 (work in 829 progress), July 2018. 831 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, 832 DOI 10.17487/RFC2328, April 1998, . 835 [RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route 836 Reflection: An Alternative to Full Mesh Internal BGP 837 (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006, 838 . 840 [RFC4724] Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y. 841 Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724, 842 DOI 10.17487/RFC4724, January 2007, . 845 [RFC4750] Joyal, D., Ed., Galecki, P., Ed., Giacalone, S., Ed., 846 Coltun, R., and F. Baker, "OSPF Version 2 Management 847 Information Base", RFC 4750, DOI 10.17487/RFC4750, 848 December 2006, . 850 [RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter, 851 "Multiprotocol Extensions for BGP-4", RFC 4760, 852 DOI 10.17487/RFC4760, January 2007, . 855 [RFC4790] Newman, C., Duerst, M., and A. Gulbrandsen, "Internet 856 Application Protocol Collation Registry", RFC 4790, 857 DOI 10.17487/RFC4790, March 2007, . 860 [RFC4915] Psenak, P., Mirtorabi, S., Roy, A., Nguyen, L., and P. 861 Pillay-Esnault, "Multi-Topology (MT) Routing in OSPF", 862 RFC 4915, DOI 10.17487/RFC4915, June 2007, 863 . 865 [RFC5286] Atlas, A., Ed. and A. Zinin, Ed., "Basic Specification for 866 IP Fast Reroute: Loop-Free Alternates", RFC 5286, 867 DOI 10.17487/RFC5286, September 2008, . 870 [RFC5549] Le Faucheur, F. and E. Rosen, "Advertising IPv4 Network 871 Layer Reachability Information with an IPv6 Next Hop", 872 RFC 5549, DOI 10.17487/RFC5549, May 2009, 873 . 875 [RFC5880] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 876 (BFD)", RFC 5880, DOI 10.17487/RFC5880, June 2010, 877 . 879 Authors' Addresses 881 Keyur Patel 882 Arrcus, Inc. 884 Email: keyur@arrcus.com 886 Acee Lindem 887 Cisco Systems 888 301 Midenhall Way 889 Cary, NC 27513 890 USA 892 Email: acee@cisco.com 894 Shawn Zandi 895 Linkedin 896 222 2nd Street 897 San Francisco, CA 94105 898 USA 900 Email: szandi@linkedin.com 902 Wim Henderickx 903 Nokia 904 Antwerp 905 Belgium 907 Email: wim.henderickx@nokia.com