idnits 2.17.1 draft-ietf-rtgwg-bgp-pic-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 11, 2020) is 1316 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-rtgwg-segment-routing-ti-lfa-02 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group A. Bashandy, Ed. 2 Internet Draft Individual Contributor 3 Intended status: Informational C. Filsfils 4 Expires: February 2021 Cisco Systems 5 P. Mohapatra 6 Sproute Networks 7 August 11, 2020 9 BGP Prefix Independent Convergence 10 draft-ietf-rtgwg-bgp-pic-12.txt 12 Abstract 14 In the network comprising thousands of iBGP peers exchanging millions 15 of routes, many routes are reachable via more than one next-hop. 16 Given the large scaling targets, it is desirable to restore traffic 17 after failure in a time period that does not depend on the number of 18 BGP prefixes. In this document we proposed an architecture by which 19 traffic can be re-routed to ECMP or pre-calculated backup paths in a 20 timeframe that does not depend on the number of BGP prefixes. The 21 objective is achieved through organizing the forwarding data 22 structures in a hierarchical manner and sharing forwarding elements 23 among the maximum possible number of routes. The proposed technique 24 achieves prefix independent convergence while ensuring incremental 25 deployment, complete automation, and zero management and provisioning 26 effort. It is noteworthy to mention that the benefits of BGP-PIC are 27 hinged on the existence of more than one path whether as ECMP or 28 primary-backup. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six 41 months and may be updated, replaced, or obsoleted by other 42 documents at any time. It is inappropriate to use Internet-Drafts 43 as reference material or to cite them other than as "work in 44 progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 This Internet-Draft will expire on February 11, 2021. 53 Copyright Notice 55 Copyright (c) 2020 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with 63 respect to this document. Code Components extracted from this 64 document must include Simplified BSD License text as described in 65 Section 4.e of the Trust Legal Provisions and are provided without 66 warranty as described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction...................................................3 71 1.1. Terminology...............................................3 72 2. Overview.......................................................5 73 2.1. Dependency................................................6 74 2.1.1. Hierarchical Hardware FIB............................6 75 2.1.2. Availability of more than one BGP next-hops..........6 76 2.2. BGP-PIC Illustration......................................7 77 3. Constructing the Shared Hierarchical Forwarding Chain..........9 78 3.1. Constructing the BGP-PIC forwarding Chain.................9 79 3.2. Example: Primary-Backup Path Scenario....................10 80 4. Forwarding Behavior...........................................11 81 5. Handling Platforms with Limited Levels of Hierarchy...........12 82 5.1. Flattening the Forwarding Chain..........................12 83 5.2. Example: Flattening a forwarding chain...................14 84 6. Forwarding Chain Adjustment at a Failure......................21 85 6.1. BGP-PIC core.............................................22 86 6.2. BGP-PIC edge.............................................23 87 6.2.1. Adjusting forwarding Chain in egress node failure...23 88 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure....23 89 6.3. Handling Failures for Flattened Forwarding Chains........24 90 7. Properties....................................................25 91 7.1. Coverage.................................................25 92 7.1.1. A remote failure on the path to a BGP next-hop......25 93 7.1.2. A local failure on the path to a BGP next-hop.......25 94 7.1.3. A remote iBGP next-hop fails........................26 95 7.1.4. A local eBGP next-hop fails.........................26 97 7.2. Performance..............................................26 98 7.3. Automated................................................27 99 7.4. Incremental Deployment...................................27 100 8. Security Considerations.......................................27 101 9. IANA Considerations...........................................27 102 10. Conclusions..................................................27 103 11. References...................................................28 104 11.1. Normative References....................................28 105 11.2. Informative References..................................28 106 12. Acknowledgments..............................................29 107 Appendix A. Perspective..........................................30 109 1. Introduction 111 As a path vector protocol, BGP propagates reachability serially. 112 Hence BGP convergence speed is limited by the time taken to 113 serially propagate reachability information from the point of 114 failure to the device that must re-converge. BGP speakers exchange 115 reachability information about prefixes[1][2] and, for labeled 116 address families, namely AFI/SAFI 1/4, 2/4, 1/128, and 2/128, an 117 edge router assigns local labels to prefixes and associates the 118 local label with each advertised prefix such as L3VPN [7], 6PE 119 [8], and Softwire [6] using BGP label unicast technique[3]. A BGP 120 speaker then applies the path selection steps to choose the best 121 path. In modern networks, it is not uncommon to have a prefix 122 reachable via multiple edge routers. In addition to proprietary 123 techniques, multiple techniques have been proposed to allow for 124 BGP to advertise more than one path for a given prefix 125 [5][10][11], whether in the form of equal cost multipath or 126 primary-backup. Another common and widely deployed scenario is 127 L3VPN with multi-homed VPN sites with unique Route Distinguisher. 128 It is advantageous to utilize the commonality among paths used by 129 NLRIs to significantly improve convergence in case of topology 130 modifications. 132 This document proposes a hierarchical and shared forwarding chain 133 organization that allows traffic to be restored to pre-calculated 134 alternative equal cost primary path or backup path in a time 135 period that does not depend on the number of BGP prefixes. The 136 technique relies on internal router behavior that is completely 137 transparent to the operator and can be incrementally deployed and 138 enabled with zero operator intervention. 140 1.1. Terminology 142 This section defines the terms used in this document. For ease of 143 use, we will use terms similar to those used by L3VPN [7]. 145 o BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker 146 has a path for. 148 o IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via 149 an Interior Gateway Protocol, such as OSPF and ISIS, has a path 150 for. The prefix may be learnt directly through the IGP or 151 redistributed from other protocol(s) 153 o CE: An external router through which an egress PE can reach a 154 prefix P/m. 156 o Egress PE, "ePE": A BGP speaker that learns about a prefix 157 through an eBGP peer and chooses that eBGP as the next-hop for 158 that prefix. 160 o Ingress PE, "iPE": A BGP speaker that learns about a prefix 161 through a iBGP peer and chooses an egress PE as the next-hop for 162 the prefix. 164 o Path: The next-hop in a sequence of nodes starting from the 165 current node and ending with the destination node or network 166 identified by the prefix. The nodes may not be directly 167 connected. 169 o Recursive path: A path consisting only of the IP address of the 170 next-hop without the outgoing interface. Subsequent lookups are 171 necessary to determine the outgoing interface and a directly 172 connected next-hop 174 o Non-recursive path: A path consisting of the IP address of a 175 directly connected next-hop and outgoing interface 177 o Primary path: A recursive or non-recursive path that can be 178 used all the time as long as a walk starting from this path can 179 end to an adjacency. A prefix can have more than one primary 180 path 182 o Backup path: A recursive or non-recursive path that can be used 183 only after some or all primary paths become unreachable 185 o Leaf: A container data structure for a prefix or local label. 186 Alternatively, it is the data structure that contains prefix 187 specific information. 189 o IP leaf: The leaf corresponding to an IPv4 or IPv6 prefix 191 o Label leaf. The leaf corresponding to a locally allocated label 192 such as the VPN label on an egress PE [7]. 194 o Pathlist: An array of paths used by one or more prefix to forward 195 traffic to destination(s) covered by a IP prefix. Each path in 196 the pathlist carries its "path-index" that identifies its 197 position in the array of paths. "). In general, the value of the 198 "path-index" stored in path may not necessarily has the same 199 value of the location of the path in the pathlist. For example 200 the 3rd path may carry path-index value of 1 202 o A pathlist may contain a mix of primary and backup paths 204 o OutLabel-List: Each labeled prefix is associated with an 205 OutLabel-List. The OutLabel-List is an array of one or more 206 outgoing labels and/or label actions where each label or label 207 action has 1-to-1 correspondence to a path in the pathlist. 208 Label actions are: push the label, pop the label, swap the 209 incoming label with the label in the Outlabel-Array entry, or 210 don't push anything at all in case of "unlabeled". The prefix 211 may be an IGP or BGP prefix 213 o Adjacency: The layer 2 encapsulation leading to the layer 3 214 directly connected next-hop 216 o Dependency: An object X is said to be a dependent or child of 217 object Y if there is at least one forwarding chain where the 218 forwarding engine must visits the object X before visiting the 219 object Y in order to forward a packet. Note that if object X is 220 a child of object Y, then Y cannot be deleted unless object X 221 is no longer a dependent/child of object Y 223 o Route: A prefix with one or more paths associated with it. 224 Hence the minimum set of objects needed to construct a route is 225 a leaf and a pathlist. 227 2. Overview 229 The idea of BGP-PIC is based on two pillars 231 o A shared hierarchical forwarding chain: It is not uncommon to see 232 multiple destinations are reachable via the same list of next- 233 hops. Instead of having a separate list of next-hops for each 234 destination, all destinations sharing the same list of next-hops 235 can point to a single copy of this list thereby allowing fast 236 convergence by making changes to a single shared list of next- 237 hops rather than possibly a large number of destinations. Because 238 paths in a pathlist may be recursive, a hierarchy is formed 239 between pathlist and the resolving prefix whereby the pathlist 240 depends on the resolving prefix. 242 o A forwarding plane that supports multiple levels of indirection: 243 A forwarding that starts with a destination and ends with an 244 outgoing interface is not a simple flat structure. Instead a 245 forwarding entry is constructed via multiple levels of 246 dependency. A BGP NLRI uses a recursive next-hop, which in turn 247 resolves via an IGP next-hop, which in turn resolves via an 248 adjacency consisting of one or more outgoing interface(s) and 249 next-hop(s). 251 Designing a forwarding plane that constructs multi-level forwarding 252 chains with maximal sharing of forwarding objects allows rerouting a 253 large number of destinations by modifying a small number of objects 254 thereby achieving convergence in a time frame that does not depend 255 on the number of destinations. For example, if the IGP prefix that 256 resolves a recursive next-hop is updated there is no need to update 257 the possibly large number of BGP NLRIs that use this recursive next- 258 hop. 260 2.1. Dependency 262 This section describes the required functionalities in the 263 forwarding and control planes to support BGP-PIC described in this 264 document. 266 2.1.1. Hierarchical Hardware FIB 268 BGP-PIC requires a hierarchical hardware FIB support: for each BGP 269 forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is 270 consulted, then an IGP Pathlist, then an Adjacency. 272 An alternative method consists in "flattening" the dependencies when 273 programming the BGP destinations into HW FIB resulting in 274 potentially eliminating both the BGP Path-List and IGP Path-List 275 consultation. Such an approach decreases the number of memory 276 lookup's per forwarding operation at the expense of HW FIB memory 277 increase (flattening means less sharing hence duplication), loss of 278 ECMP properties (flattening means less pathlist entropy) and loss of 279 BGP-PIC properties. 281 2.1.2. Availability of more than one BGP next-hops 283 When the primary BGP next-hop fails, BGP-PIC depends on the 284 availability of one or more pre-computed and pre-installed secondary 285 BGP next-hop(s) in the BGP Pathlist. 287 The existence of a secondary next-hop is clear for the following 288 reason: a service caring for network availability will require two 289 disjoint network connections hence two BGP next-hops. 291 The BGP distribution of secondary next-hops is available thanks to 292 the following BGP mechanisms: Add-Path [10], BGP Best-External [5], 293 diverse path [11], and the frequent use in VPN deployments of 294 different VPN RD's per PE. It is noteworthy to mention that the 295 availability of another BGP path does not mean that all failure 296 scenarios can be covered by simply forwarding traffic to the 297 available secondary path. The discussion of how to cover various 298 failure scenarios is beyond the scope of this document. 300 2.2. BGP-PIC Illustration 302 To illustrate the two pillars above as well as the platform 303 dependency, we will use an example of a simple multihomed L3VPN [7] 304 prefix in a BGP-free core running LDP [4] or segment routing over 305 MPLS forwarding plane [13]. 307 +--------------------------------+ 308 | | 309 | ePE2 (IGP-IP1 192.0.2.1, Loopback) 310 | | \ 311 | | \ 312 | | \ 313 iPE | CE....VRF "Blue", ASnum 65000 314 | | / (VPN-IP1 198.51.100.0/24) 315 | | / (VPN-IP2 203.0.113.0/24) 316 | LDP/Segment-Routing Core | / 317 | ePE1 (IGP-IP2 192.0.2.2, Loopback) 318 | | 319 +--------------------------------+ 320 Figure 1 VPN prefix reachable via multiple PEs 322 Referring to Figure 1, suppose the iPE (the ingress PE) receives 323 NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs, 324 ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively. 325 Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while 326 ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and 327 VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved 328 via the IGP prefixes IGP-IP1 and IGP-IP2, where each happen to have 329 2 ECMP paths with IGP-NH1 and IGP-NH2 reachable via the interfaces 330 I1 and I2, respectively. Suppose that local labels (whether LDP [4] 331 or segment routing [13]) on the downstream LSRs for IGP-IP1 are IGP- 332 L11 and IGP-L12 while for IGP-IP2 are IGP-L21 and IGP-L22. As such, 333 the routing table at iPE is as follows: 335 65000: 198.51.100.0/24 336 via ePE1 (192.0.2.1), VPN Label: VPN-L11 337 via ePE2 (192.0.2.2), VPN Label: VPN-L21 339 65000: 203.0.113.0/24 340 via ePE1 (192.0.2.1), VPN Label: VPN-L12 341 via ePE2 (192.0.2.2), VPN Label: VPN-L22 343 192.0.2.1/32 344 via Core, Label: IGP-L11 345 via Core, Label: IGP-L12 347 192.0.2.2/32 348 via Core, Label: IGP-L21 349 via Core, Label: IGP-L22 351 Based on the above routing table, a hierarchical forwarding chain 352 can be constructed as shown in Figure 2. 354 IP Leaf: Pathlist: IP Leaf: Pathlist: 355 -------- +-------+ -------- +----------+ 356 VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP NH1,I1|--->Adjacency1 357 | |BGP-NH2|-->.... | |IGP NH2,I2|--->Adjacency2 358 | +-------+ | +----------+ 359 | | 360 | | 361 v v 362 OutLabel-List: OutLabel-List: 363 +----------------------+ +----------------------+ 364 |VPN-L11 (VPN-IP1, NH1)| |IGP-L11 (IGP-IP1, NH1)| 365 |VPN-L21 (VPN-IP1, NH2)| |IGP-L12 (IGP-IP1, NH2)| 366 +----------------------+ +----------------------+ 368 Figure 2 Shared Hierarchical Forwarding Chain at iPE 370 The forwarding chain depicted in Figure 2 illustrates the first 371 pillar, which is sharing and hierarchy. We can see that the BGP 372 pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs 373 reachable via ePE1 and ePE2. As such, it is possible to make changes 374 to the pathlist without having to make changes to the NLRIs. For 375 example, if BGP-NH2 becomes unreachable, there is no need to modify 376 any of the possibly large number of NLRIs. Instead only the shared 377 pathlist needs to be modified. Likewise, due to the hierarchical 378 structure of the forwarding chain, it is possible to make 379 modifications to the IGP routes without having to make any changes 380 to the BGP NLRIs. For example, if the interface "I2" goes down, only 381 the shared IGP pathlist needs to be updated, but none of the IGP 382 prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP 383 prefixes for resolution need to be modified. 385 Figure 2 can also be used to illustrate the second BGP-PIC pillar. 386 Having a deep forwarding chain such as the one illustrated in Figure 387 2 requires a forwarding plane that is capable of accessing multiple 388 levels of indirection in order to calculate the outgoing 389 interface(s) and next-hops(s). While a deeper forwarding chain 390 minimizes the re-convergence time on topology change, there will 391 always exist platforms with limited capabilities and hence imposing 392 a limit on the depth of the forwarding chain. Section 5 describes 393 how to gracefully trade off convergence speed with the number of 394 hierarchical levels to support platforms with different 395 capabilities. 397 3. Constructing the Shared Hierarchical Forwarding Chain 399 Constructing the forwarding chain is an application of the two 400 pillars described in Section 2. This section describes how to 401 construct the forwarding chain in hierarchical shared manner. 403 3.1. Constructing the BGP-PIC forwarding Chain 405 The whole process starts when BGP downloads a prefix to FIB. The 406 prefix contains one or more outgoing paths. For certain labeled 407 prefixes, such as VPN [7] prefixes, each path may be associated with 408 an outgoing label and the prefix itself may be assigned a local 409 label. The list of outgoing paths defines a pathlist. If such 410 pathlist does not already exist, then FIB creates a new pathlist, 411 otherwise the existing pathlist is used. The BGP prefix is added as 412 a dependent of the pathlist. 414 The previous step constructs the upper part of the hierarchical 415 forwarding chain. The forwarding chain is completed by resolving the 416 paths of the pathlist. A BGP path usually consists of a next-hop. 417 The next-hop is resolved by finding a matching IGP prefix. 419 The end result is a hierarchical shared forwarding chain where the 420 BGP pathlist is shared by all BGP prefixes that use the same list of 421 paths and the IGP prefix is shared by all pathlists that have a path 422 resolving via that IGP prefix. It is noteworthy to mention that the 423 forwarding chain is constructed without any operator intervention at 424 all. 426 The remainder of this section goes over an example to illustrate the 427 applicability of BGP-PIC in a primary-backup path scenario. 429 3.2. Example: Primary-Backup Path Scenario 431 Consider the egress PE ePE1 in the case of the multi-homed VPN 432 prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1 433 determines that the primary path is the external path but the backup 434 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2. 435 ePE2 constructs the forwarding chain depicted in Figure 3. We are 436 only showing a single VPN prefix for simplicity. But all prefixes 437 that are multihomed to ePE1 and ePE2 share the BGP pathlist. 439 BGP OutLabel Array 440 VPN-L11 +---------+ 441 (Label-leaf)---+---->|Unlabeled| 442 | +---------+ 443 | | VPN-L21 | 444 | | (swap) | 445 | +---------+ 446 | 447 | BGP Pathlist 448 | +------------+ Connected route 449 | | CE-NH |------>(to the CE) 450 | |path-index=0| 451 | +------------+ 452 | | VPN-NH2 | 453 VPN-IP1 -----+------------------>| (backup) |------>IGP Leaf 454 (IP prefix leaf) |path-index=1| (Towards ePE2) 455 | +------------+ 456 | 457 | BGP OutLabel Array 458 | +---------+ 459 +------------->|Unlabeled| 460 +---------+ 461 | VPN-L21 | 462 | (push) | 463 +---------+ 465 Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE 467 The example depicted in Figure 3 differs from the example in Figure 468 2 in two main aspects. First, as long as the primary path towards 469 the CE (external path) is useable, it will be the only path used for 470 forwarding while the OutLabel-List contains both the unlabeled label 471 (primary path) and the VPN label (backup path) advertised by the 472 backup path ePE2. The second aspect is presence of the label leaf 473 corresponding to the VPN prefix. This label leaf is used to match 474 VPN traffic arriving from the core. Note that the label leaf shares 475 the pathlist with the IP prefix. 477 4. Forwarding Behavior 479 This section explains how the forwarding plane uses the hierarchical 480 shared forwarding chain to forward a packet. 482 When a packet arrives at a router, it matches a leaf. A labeled 483 packet matches a label leaf while an IP packet matches an IP prefix 484 leaf. The forwarding engines walks the forwarding chain starting 485 from the leaf until the walk terminates on an adjacency. Thus when a 486 packet arrives, the chain is walked as follows: 488 1. Lookup the leaf based on the destination address or the label at 489 the top of the packet. 491 2. Retrieve the parent pathlist of the leaf. 493 3. Pick the outgoing path "Pi" from the list of resolved paths in 494 the pathlist. The method by which the outgoing path is picked is 495 beyond the scope of this document (e.g. flow-preserving hash 496 exploiting entropy within the MPLS stack and IP header). Let the 497 "path-index" of the outgoing path "Pi" be "j". 499 4. If the prefix is labeled, use the "path-index" "j" to retrieve 500 the jth label "Lj" stored the jth entry in the OutLabel-List and 501 apply the label action of the label on the packet (e.g. for VPN 502 label on the ingress PE, the label action is "push"). As 503 mentioned in Section 1.1 the value of the "path-index" stored 504 in path may not necessarily be the same value of the location of 505 the path in the pathlist. 507 5. Move to the parent of the chosen path "Pi". 509 6. If the chosen path "Pi" is recursive, move to its parent prefix 510 and go to step 2. 512 7. If the chosen path is non-recursive move to its parent adjacency. 513 Otherwise go to the next step. 515 8. Encapsulate the packet in the layer string specified by the 516 adjacency and send the packet out. 518 Let's apply the above forwarding steps to the forwarding chain 519 depicted in Figure 2 in Section 2. Suppose a packet arrives at 520 ingress PE iPE from an external neighbor. Assume the packet matches 521 the VPN prefix VPN-IP1. While walking the forwarding chain, the 522 forwarding engine applies a hashing algorithm to choose the path and 523 the hashing at the BGP level yields path 0 while the hashing at the 524 IGP level yields path 1. In that case, the packet will be sent out 525 of interface I2 with the label stack "IGP-L12,VPN-L11". 527 5. Handling Platforms with Limited Levels of Hierarchy 529 This section describes the construction of the forwarding chain if a 530 platform does not support the number of recursion levels required to 531 resolve the NLRIs. There are two main design objectives. 533 o Being able to reduce the number of hierarchical levels from any 534 arbitrary value to a smaller arbitrary value that can be 535 supported by the forwarding engine. 537 o Minimal modifications to the forwarding algorithm due to such 538 reduction. 540 5.1. Flattening the Forwarding Chain 542 Let's consider a pathlist associated with the leaf "R1" consisting 543 of the list of paths . Assume that the leaf "R1" has 544 an Outlabel-list . Suppose the path Pi is a 545 recursive path that resolves via a prefix represented by the leaf 546 "R2". The leaf "R2" itself is pointing to a pathlist consisting of 547 the paths . 549 If the platform supports the number of hierarchy levels of the 550 forwarding chain, then a packet that uses the path "Pi" will be 551 forwarded as follows: 553 1. The forwarding engine is now at leaf "R1". 555 2. So it moves to its parent pathlist, which contains the list . 558 3. The forwarding engine applies a hashing algorithm and picks the 559 path "Pi". So now the forwarding engine is at the path "Pi". 561 4. The forwarding engine retrieves the label "Li" from the outlabel- 562 list attached to the leaf "R1" and applies the label action. 564 5. The path "Pi" uses the leaf "R2". 566 6. The forwarding engine walks forward to the leaf "R2" for 567 resolution. 569 7. The forwarding plane performs a hash to pick a path among the 570 pathlist of the leaf "R2", which is . 572 8. Suppose the forwarding engine picks the path "Qj". 574 9. Now the forwarding engine continues the walk to the parent of 575 "Qj". 577 Suppose the platform cannot support the number of hierarchy levels 578 in the forwarding chain. FIB needs to reduce the number of hierarchy 579 levels. The idea of reducing the number of hierarchy levels is to 580 "flatten" two chain levels into a single level. The "flattening" 581 steps are as follows 583 1. FIB wants to reduce the number of levels used by "Pi" by 1. 585 2. FIB walks to the parent of "Pi", which is the leaf "R2". 587 3. FIB extracts the parent pathlist of the leaf "R2", which is . 590 4. FIB also extracts the OutLabel-list(R2) associated with the leaf 591 "R2". Remember that OutLabel-list(R2) = . 593 5. FIB replaces the path "Pi", with the list of paths . 596 6. Hence the path list now becomes ". 599 7. The path index stored inside the locations "Q1", "Q2", ..., "Qm" 600 must all be "i" because the index "i" refers to the label "Li" 601 associated with leaf "R1". 603 8. FIB attaches an OutLabel-list with the new pathlist as follows: 604 . The size of the label list associated with the 606 flattened pathlist equals the size of the pathlist. Hence there 607 is a 1-1 mapping between every path in the "flattened" pathlist 608 and the OutLabel-list associated with it. 610 It is noteworthy to mention that the labels in the outlabel-list 611 associated with the "flattened" pathlist may be stored in the same 612 memory location as the path itself to avoid additional memory 613 access. But that is an implementation detail that is beyond the 614 scope of this document. 616 The same steps can be applied to all paths in the pathlist so that all paths are "flattened" thereby reducing the 618 number of hierarchical levels by one. Note that that "flattening" a 619 pathlist pulls in all paths of the parent paths, a desired feature 620 to utilize all ECMP/UCMP paths at all levels. A platform that has a 621 limit on the number of paths in a pathlist for any given leaf may 622 choose to reduce the number paths using methods that are beyond the 623 scope of this document. 625 The steps can be recursively applied to other paths at the same 626 levels or other levels to recursively reduce the number of 627 hierarchical levels to an arbitrary value so as to accommodate the 628 capability of the forwarding engine. 630 Because a flattened pathlist may have an associated OutLabel-list 631 the forwarding behavior has to be slightly modified. The 632 modification is done by adding the following step right after step 4 633 in Section 4. 635 5. If there is an OutLabel-list associated with the pathlist, then 636 if the path "Pi" is chosen by the hashing algorithm, retrieve the 637 label at location "i" in that OutLabel-list and apply the label 638 action of that label on the packet. 640 In the next subsection, we apply the steps in this subsection to a 641 sample scenario. 643 5.2. Example: Flattening a forwarding chain. 645 This example uses a case of inter-AS option C [7] where there are 3 646 levels of hierarchy. Figure 4 illustrates the sample topology. To 647 force 3 levels of hierarchy, the ASBRs on the ingress domain (domain 648 1) advertise the core routers of the egress domain (domain 2) to the 649 ingress PE (iPE) via BGP-LU [3] instead of redistributing them into 650 the IGP of domain 1. The end result is that the ingress PE (iPE) has 651 2 levels of recursion for the VPN prefix VPN-IP1 and VPN2-IP2. 653 Domain 1 Domain 2 654 +-------------+ +-------------+ 655 | | | | 656 | LDP/SR Core | | LDP/SR core | 657 | | | | 658 | (192.0.2.4) | | 659 | ASBR11-------ASBR21........ePE1(192.0.2.1) 660 | | \ / | . . |\ 661 | | \ / | . . | \ 662 | | \ / | . . | \ 663 | | \/ | .. | \VPN-IP1(198.51.100.0/24) 664 | | /\ | . . | /VRF "Blue" ASn: 65000 665 | | / \ | . . | / 666 | | / \ | . . | / 667 | | / \ | . . |/ 668 iPE ASBR12-------ASBR22........ePE2 (192.0.2.2) 669 | (192.0.2.5) | |\ 670 | | | | \ 671 | | | | \ 672 | | | | \VRF "Blue" ASn: 65000 673 | | | | /VPN-IP2(203.0.113.0/24) 674 | | | | / 675 | | | | / 676 | | | |/ 677 | ASBR13-------ASBR23........ePE3(192.0.2.3) 678 | (192.0.2.6) | | 679 | | | | 680 | | | | 681 +-------------+ +-------------+ 682 <=========== <========= <============ 683 Advertise ePEx Advertise Redistribute 684 Using iBGP-LU ePEx Using IGP into 685 eBGP-LU BGP 687 Figure 4 : Sample 3-level hierarchy topology 689 We will make the following assumptions about connectivity 691 o In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and 692 ePE2 using the same distance. 694 o In "domain 2", only ASBR23 can reach ePE3. 696 o In "domain 1", iPE (the ingress PE) can reach ASBR11, ASBR12, and 697 ASBR13 via IGP using the same distance. 699 We will make the following assumptions about the labels 700 o The VPN labels advertised by ePE1 and ePE2 for prefix VPN-IP1 are 701 VPN-L11 and VPN-L21, respectively. 703 o The VPN labels advertised by ePE2 and ePE3 for prefix VPN-IP2 are 704 VPN-L22 and VPN-L32, respectively. 706 o The labels advertised by ASBR11 to iPE using BGP-LU [3] for the 707 egress PEs ePE1 and ePE2 are LASBR111(ePE1) and LASBR112(ePE2), 708 respectively. 710 o The labels advertised by ASBR12 to iPE using BGP-LU [3] for the 711 egress PEs ePE1 and ePE2 are LASBR121(ePE1) and LASBR122(ePE2), 712 respectively. 714 o The label advertised by ASBR13 to iPE using BGP-LU [3] for the 715 egress PE ePE3 is LASBR13(ePE3). 717 o The IGP labels advertised by the next hops directly connected to 718 iPE towards ASBR11, ASBR12, and ASBR13 in the core of domain 1 719 are IGP-L11, IGP-L12, and IGP-L13, respectively. 721 o Both the routers ASBR21 and ASBR22 of Domain 2 advertise the same 722 label LASBR21 and LASBR22 to the egress PEs ePE1 and ePE2, 723 respectively, to the routers ASBR11 and ASBR22 of Domain 1. 725 o The router ASBR23 of Domain 2 advertises the label LASBR23 for 726 the egress PE ePE3 to the router ASBR13 of Domain 1. 728 Based on these connectivity assumptions and the topology in Figure 729 4, the routing table on iPE is 730 65000: 198.51.100.0/24 731 via ePE1 (192.0.2.1), VPN Label: VPN-L11 732 via ePE2 (192.0.2.2), VPN Label: VPN-L21 733 65000: 203.0.113.0/24 734 via ePE1 (192.0.2.2), VPN Label: VPN-L22 735 via ePE2 (192.0.2.3), VPN Label: VPN-L32 737 192.0.2.1/32 (ePE1) 738 Via ASBR11, BGP-LU Label: LASBR111(ePE1) 739 Via ASBR12, BGP-LU Label: LASBR121(ePE1) 740 192.0.2.2/32 (ePE2) 741 Via ASBR11, BGP-LU Label: LASBR112(ePE2) 742 Via ASBR12, BGP-LU Label: LASBR122(ePE2) 743 192.0.2.3/32 (ePE3) 744 Via ASBR13, BGP-LU Label: LASBR13(ePE3) 746 192.0.2.4/32 (ASBR11) 747 via Core, Label: IGP-L11 748 192.0.2.5/32 (ASBR12) 749 via Core, Label: IGP-L12 750 192.0.2.6/32 (ASBR13) 751 via Core, Label: IGP-L13 753 The diagram in Figure 5 illustrates the forwarding chain in iPE 754 assuming that the forwarding hardware in iPE supports 3 levels of 755 hierarchy. The leaves corresponding to the ABSRs on domain 1 756 (ASBR11, ASBR12, and ASBR13) are at the bottom of the hierarchy. 757 There are few important points: 759 o Because the hardware supports the required depth of hierarchy, 760 the sizes of a pathlist equal the size of the label list 761 associated with the leaves using this pathlist. 763 o The index inside the pathlist entry indicates the label that will 764 be picked from the Outlabel-List associated with the child leaf 765 if that path is chosen by the forwarding engine hashing function. 767 Outlabel-List Outlabel-List 768 For VPN-IP1 For VPN-IP2 769 +------------+ +--------+ +-------+ +------------+ 770 | VPN-L11 |<---| VPN-IP1| |VPN-IP2|-->| VPN-L22 | 771 +------------+ +---+----+ +---+---+ +------------+ 772 | VPN-L21 | | | | VPN-L32 | 773 +------------+ | | +------------+ 774 | | 775 V V 776 +---+---+ +---+---+ 777 | 0 | 1 | | 0 | 1 | 778 +-|-+-\-+ +-/-+-\-+ 779 | \ / \ 780 | \ / \ 781 | \ / \ 782 | \ / \ 783 v \ / \ 784 +-----+ +-----+ +-----+ 785 +----+ ePE1| |ePE2 +-----+ | ePE3+-----+ 786 | +--+--+ +-----+ | +--+--+ | 787 v | / v | v 788 +--------------+ | / +--------------+ | +-------------+ 789 |LASBR111(ePE1)| | / |LASBR112(ePE2)| | |LASBR13(ePE3)| 790 +--------------+ | / +--------------+ | +-------------+ 791 |LASBR121(ePE1)| | / |LASBR122(ePE2)| | Outlabel-List 792 +--------------+ | / +--------------+ | For ePE3 793 Outlabel-List | / Outlabel-List | 794 For ePE1 | / For ePE2 | 795 | / | 796 | / | 797 | / | 798 v / v 799 +---+---+ Shared Pathlist +---+ Pathlist 800 | 0 | 1 | For ePE1 and ePE2 | 0 | For ePE3 801 +-|-+-\-+ +-|-+ 802 | \ | 803 | \ | 804 | \ | 805 | \ | 806 v \ v 807 +------+ +------+ +------+ 808 +---+ASBR11| |ASBR12+--+ |ASBR13+---+ 809 | +------+ +------+ | +------+ | 810 v v v 811 +-------+ +-------+ +-------+ 812 |IGP-L11| |IGP-L12| |IGP-L13| 813 +-------+ +-------+ +-------+ 815 Figure 5 : Forwarding Chain for hardware supporting 3 Levels 817 Now suppose the hardware on iPE (the ingress PE) supports 2 levels 818 of hierarchy only. In that case, the 3-levels forwarding chain in 819 Figure 5 needs to be "flattened" into 2 levels only. 821 Outlabel-List Outlabel-List 822 For VPN-IP1 For VPN-IP2 823 +------------+ +-------+ +-------+ +------------+ 824 | VPN-L11 |<---|VPN-IP1| | VPN-IP2|--->| VPN-L22 | 825 +------------+ +---+---+ +---+---+ +------------+ 826 | VPN-L21 | | | | VPN-L32 | 827 +------------+ | | +------------+ 828 | | 829 | | 830 | | 831 Flattened | | Flattened 832 pathlist V V pathlist 833 +===+===+ +===+===+===+ +==============+ 834 +--------+ 0 | 1 | | 0 | 0 | 1 +---->|LASBR112(ePE2)| 835 | +=|=+=\=+ +=/=+=/=+=\=+ +==============+ 836 v | \ / / \ |LASBR122(ePE2)| 837 +==============+ | \ +-----+ / \ +==============+ 838 |LASBR111(ePE1)| | \/ / \ |LASBR13(ePE3) | 839 +==============+ | /\ / \ +==============+ 840 |LASBR121(ePE1)| | / \ / \ 841 +==============+ | / \ / \ 842 | / \ / \ 843 | / + + \ 844 | + | | \ 845 | | | | \ 846 v v v v \ 847 +------+ +------+ +------+ 848 +----|ASBR11| |ASBR12+---+ |ASBR13+---+ 849 | +------+ +------+ | +------+ | 850 v v v 851 +-------+ +-------+ +-------+ 852 |IGP-L11| |IGP-L12| |IGP-L13| 853 +-------+ +-------+ +-------+ 855 Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE 857 Figure 6 represents one way to "flatten" a 3 levels hierarchy into 858 two levels. There are few important points: 860 o As mentioned in Section 5.1 a flattened pathlist may have label 861 lists associated with them. The size of the label list associated 862 with a flattened pathlist equals the size of the pathlist. Hence 863 it is possible that an implementation includes these label lists 864 in the flattened pathlist itself. 866 o Again as mentioned in Section 5.1, the size of a flattened 867 pathlist may not be equal to the size of the OutLabel-lists of 868 leaves using the flattened pathlist. So the indices inside a 869 flattened pathlist still indicate the label index in the 870 Outlabel-Lists of the leaves using that pathlist. Because the 871 size of the flattened pathlist may be different from the size of 872 the OutLabel-lists of the leaves, the indices may be repeated. 874 o Let's take a look at the flattened pathlist used by the prefix 875 "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has 876 three entries. 878 o The first and second entry have index "0". This is because 879 both entries correspond to ePE2. Hence when hashing performed 880 by the forwarding engine results in using first or the second 881 entry in the pathlist, the forwarding engine will pick the 882 correct VPN label "VPN-L22", which is the label advertised by 883 ePE2 for the prefix "VPN-IP2". 885 o The third entry has the index "1". This is because the third 886 entry corresponds to ePE3. Hence when the hashing is performed 887 by the forwarding engine results in using the third entry in 888 the flattened pathlist, the forwarding engine will pick the 889 correct VPN label "VPN-L32", which is the label advertised by 890 "ePE3" for the prefix "VPN-IP2". 892 Now let's try and apply the forwarding steps in Section 4 together 893 with the additional step in Section 5.1 to the flattened forwarding 894 chain illustrated in Figure 6. 896 o Suppose a packet arrives at "iPE" and matches the VPN prefix 897 "VPN-IP2". 899 o The forwarding engine walks to the parent of the "VPN-IP2", which 900 is the flattened pathlist and applies a hashing algorithm to pick 901 a path. 903 o Suppose the hashing by the forwarding engine picks the second 904 path in the flattened pathlist associated with the leaf "VPN- 905 IP2". 907 o Because the second path has the index "0", the label "VPN-L22" is 908 pushed on the packet. 910 o Next the forwarding engine picks the second label from the 911 Outlabel-Array associated with the flattened pathlist. Hence the 912 next label that is pushed is "LASBR122(ePE2)". 914 o The forwarding engine now moves to the parent of the flattened 915 pathlist corresponding to the second path. The parent is the IGP 916 label leaf corresponding to "ASBR12". 918 o So the packet is forwarded towards the ASBR "ASBR12" and the IGP 919 label at the top will be "L12". 921 Based on the above steps, a packet arriving at iPE and destined to 922 the prefix VPN-L22 reaches its destination as follows: 924 o iPE sends the packet along the shortest path towards ASBR12 with 925 the following label stack starting from the top: {L12, 926 LASBR122(ePE2), VPN-L22}. 928 o The penultimate hop of ASBR12 pops the top label "L12". Hence the 929 packet arrives at ASBR12 with the label stack {LASBR122(ePE2), 930 VPN-L22} where "LASBR12(ePE2)" is the top label. 932 o ASBR12 swaps "LASBR122(ePE2)" with the label "LASBR22(ePE2)", 933 which is the label advertised by ASBR22 for the ePE2 (the egress 934 PE). 936 o ASBR22 receives the packet with "LASBR22(ePE2)" at the top. 938 o Hence ASBR22 swaps "LASBR22(ePE2)" with the IGP label for ePE2 939 advertised by the next-hop towards ePE2 in domain 2, and sends 940 the packet along the shortest path towards ePE2. 942 o The penultimate hop of ePE2 pops the top label. Hence ePE2 943 receives the packet with the top label VPN-L22 at the top 945 o ePE2 pops "VPN-L22" and sends the packet as a pure IP packet 946 towards the destination VPN-IP2. 948 6. Forwarding Chain Adjustment at a Failure 950 The hierarchical and shared structure of the forwarding chain 951 explained in the previous section allows modifying a small number of 952 forwarding chain objects to re-route traffic to a pre-calculated 953 equal-cost or backup path without the need to modify the possibly 954 very large number of BGP prefixes. In this section, we go over 955 various core and edge failure scenarios to illustrate how FIB 956 manager can utilize the forwarding chain structure to achieve BGP 957 prefix independent convergence. 959 6.1. BGP-PIC core 961 This section describes the adjustments to the forwarding chain when 962 a core link or node fails but the BGP next-hop remains reachable. 964 There are two case: remote link failure and attached link failure. 965 Node failures are treated as link failures. 967 When a remote link or node fails, IGP on the ingress PE receives 968 advertisement indicating a topology change so IGP re-converges to 969 either find a new next-hop and/or outgoing interface or remove the 970 path completely from the IGP prefix used to resolve BGP next-hops. 971 IGP and/or LDP download the modified IGP leaves with modified 972 outgoing labels for labeled core. 974 When a local link fails, FIB manager detects the failure almost 975 immediately. The FIB manager marks the impacted path(s) as unusable 976 so that only useable paths are used to forward packets. Hence only 977 IGP pathlists with paths using the failed local link need to be 978 modified. All other pathlists are not impacted. Note that in this 979 particular case there is actually no need even to backwalk to IGP 980 leaves to adjust the OutLabel-Lists because FIB can rely on the 981 path-index stored in the useable paths in the pathlist to pick the 982 right label. 984 It is noteworthy to mention that because FIB manager modifies the 985 forwarding chain starting from the IGP leaves only. BGP pathlists 986 and leaves are not modified. Hence traffic restoration occurs within 987 the time frame of IGP convergence, and, for local link failure, 988 assuming a backup path has been precomputed, within the timeframe of 989 local detection (e.g. 50ms). Examples of solutions that pre- 990 computing backup paths are IP FRR [15] remote LFA [16], Ti-LFA [14] 991 and MRT [17] or eBGP path having a backup path [9]. 993 Let's apply the procedure mentioned in this subsection to the 994 forwarding chain depicted in Figure 2. Suppose a remote link failure 995 occurs and impacts the first ECMP IGP path to the remote BGP next- 996 hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop 997 is updated to reflect the new topology (one path instead of two). As 998 soon as the IGP convergence is effective for the BGP next-hop entry, 999 the new forwarding state is immediately available to all dependent 1000 BGP prefixes. The same behavior would occur if the failure was local 1001 such as an interface going down. As soon as the IGP convergence is 1002 complete for the BGP next-hop IGP route, all its BGP depending 1003 routes benefit from the new path. In fact, upon local failure, if 1004 LFA protection is enabled for the IGP route to the BGP next-hop and 1005 a backup path was pre-computed and installed in the pathlist, upon 1006 the local interface failure, the LFA backup path is immediately 1007 activated (e.g. sub-50msec) and thus protection benefits all the 1008 depending BGP traffic through the hierarchical forwarding dependency 1009 between the routes. 1011 6.2. BGP-PIC edge 1013 This section describes the adjustments to the forwarding chains as a 1014 result of edge node or edge link failure. 1016 6.2.1. Adjusting forwarding Chain in egress node failure 1018 When an edge node fails, IGP on neighboring core nodes send route 1019 updates indicating that the edge node is no longer reachable. IGP 1020 running on the iBGP peers instructs FIB to remove the IP and label 1021 leaves corresponding to the failed edge node from FIB. So FIB 1022 manager performs the following steps: 1024 o FIB manager deletes the IGP leaf corresponding to the failed edge 1025 node 1027 o FIB manager backwalks to all dependent BGP pathlists and marks 1028 that path using the deleted IGP leaf as unresolved 1030 o Note that there is no need to modify the possibly large number of 1031 BGP leaves because each path in the pathlist carries its path 1032 index and hence the correct outgoing label will be picked. 1033 Consider for example the forwarding chain depicted in Figure 2. 1034 If the 1st BGP path becomes unresolved, then the forwarding 1035 engine will only use the second path for forwarding. Yet the path 1036 index of that single resolved path will still be 1 and hence the 1037 label VPN-L12 will be pushed. 1039 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure 1041 Suppose the link between an edge router and its external peer fails. 1042 There are two scenarios (1) the edge node attached to the failed 1043 link performs next-hop self and (2) the edge node attached to the 1044 failure advertises the IP address of the failed link as the next-hop 1045 attribute to its iBGP peers. 1047 In the first case, the rest of iBGP peers will remain unaware of the 1048 link failure and will continue to forward traffic to the edge node 1049 until the edge node attached to the failed link withdraws the BGP 1050 prefixes. If the destination prefixes are multi-homed to another 1051 iBGP peer, say ePE2, then FIB manager on the edge router detecting 1052 the link failure applies the following steps: 1054 o FIB manager backwalks to the BGP pathlists marks the path through 1055 the failed link to the external peer as unresolved. 1057 o Hence traffic will be forwarded used the backup path towards 1058 ePE2. 1060 o For labeled traffic 1062 o The Outlabel-List attached to the BGP leaf already contains 1063 an entry corresponding to the backup path. 1065 o The label entry in OutLabel-List corresponding to the 1066 internal path to backup egress PE has swap action to the 1067 label advertised by backup egress PE. 1069 o For an arriving label packet (e.g. VPN), the top label is 1070 swapped with the label advertised by backup egress PE and the 1071 packet is sent towards that backup egress PE. 1073 o For unlabeled traffic, packets are simply redirected towards 1074 backup egress PE. 1076 In the second case where the edge router uses the IP address of the 1077 failed link as the BGP next-hop, the edge router will still perform 1078 the previous steps. But, unlike the case of next-hop self, IGP on 1079 failed edge node informs the rest of the iBGP peers that IP address 1080 of the failed link is no longer reachable. Hence the FIB manager on 1081 iBGP peers will delete the IGP leaf corresponding to the IP prefix 1082 of the failed link. The behavior of the iBGP peers will be identical 1083 to the case of edge node failure outlined in Section 6.2.1. 1085 It is noteworthy to mention that because the edge link failure is 1086 local to the edge router, sub-50 msec convergence can be achieved as 1087 described in [9]. 1089 Let's try to apply the case of next-hop self to the forwarding chain 1090 depicted in Figure 3. After failure of the link between ePE1 and CE, 1091 the forwarding engine will route traffic arriving from the core 1092 towards VPN-NH2 with path-index=1. A packet arriving from the core 1093 will contain the label VPN-L11 at top. The label VPN-L11 is swapped 1094 with the label VPN-L21 and the packet is forwarded towards ePE2. 1096 6.3. Handling Failures for Flattened Forwarding Chains 1098 As explained in the in Section 5 if the number of hierarchy levels 1099 of a platform cannot support the native number of hierarchy levels 1100 of a recursive forwarding chain, the instantiated forwarding chain 1101 is constructed by flattening two or more levels. Hence a 3 levels 1102 chain in Figure 5 is flattened into the 2 levels chain in Figure 6. 1104 While reducing the benefits of BGP-PIC, flattening one hierarchy 1105 into a shallower hierarchy does not always result in a complete loss 1106 of the benefits of the BGP-PIC. To illustrate this fact suppose 1107 ASBR12 is no longer reachable in domain 1. If the platform supports 1108 the full hierarchy depth, the forwarding chain is the one depicted 1109 in Figure 5 and hence the FIB manager needs to backwalk one level to 1110 the pathlist shared by "ePE1" and "ePE2" and adjust it. If the 1111 platform supports 2 levels of hierarchy, then a useable forwarding 1112 chain is the one depicted in Figure 6. In that case, if ASBR12 is no 1113 longer reachable, the FIB manager has to backwalk to the two 1114 flattened pathlists and updates both of them. 1116 The main observation is that the loss of convergence speed due to 1117 the loss of hierarchy depth depends on the structure of the 1118 forwarding chain itself. To illustrate this fact, let's take two 1119 extremes. Suppose the forwarding objects in level i+1 depend on the 1120 forwarding objects in level i. If every object on level i+1 depends 1121 on a separate object in level i, then flattening level i into level 1122 i+1 will not result in loss of convergence speed. Now let's take the 1123 other extreme. Suppose "n" objects in level i+1 depend on 1 object 1124 in level i. Now suppose FIB flattens level i into level i+1. If a 1125 topology change results in modifying the single object in level i, 1126 then FIB has to backwalk and modify "n" objects in the flattened 1127 level, thereby losing all the benefit of BGP-PIC. Experience shows 1128 that flattening forwarding chains usually results in moderate loss 1129 of BGP-PIC benefits. Further analysis is needed to corroborate and 1130 quantify this statement. 1132 7. Properties 1134 7.1. Coverage 1136 All the possible failures, except CE node failure, are covered, 1137 whether they impact a local or remote IGP path or a local or remote 1138 BGP next-hop as described in Section 6. This section provides 1139 details for each failure and now the hierarchical and shared FIB 1140 structure proposed in this document allows recovery that does not 1141 depend on number of BGP prefixes. 1143 7.1.1. A remote failure on the path to a BGP next-hop 1145 Upon IGP convergence, the IGP leaf for the BGP next-hop is updated 1146 upon IGP convergence and all the BGP depending routes leverage the 1147 new IGP forwarding state immediately. Details of this behavior can 1148 be found in Section 6.1. 1150 This BGP resiliency property only depends on IGP convergence and is 1151 independent of the number of BGP prefixes impacted. 1153 7.1.2. A local failure on the path to a BGP next-hop 1155 Upon LFA protection, the IGP leaf for the BGP next-hop is updated to 1156 use the precomputed LFA backup path and all the BGP depending routes 1157 leverage this LFA protection. Details of this behavior can be found 1158 in Section 6.1. 1160 This BGP resiliency property only depends on LFA protection and is 1161 independent of the number of BGP prefixes impacted. 1163 7.1.3. A remote iBGP next-hop fails 1165 Upon IGP convergence, the IGP leaf for the BGP next-hop is deleted 1166 and all the depending BGP Path-Lists are updated to either use the 1167 remaining ECMP BGP best-paths or if none remains available to 1168 activate precomputed backups. Details about this behavior can be 1169 found in Section 6.2.1. 1171 This BGP resiliency property only depends on IGP convergence and is 1172 independent of the number of BGP prefixes impacted. 1174 7.1.4. A local eBGP next-hop fails 1176 Upon local link failure detection, the adjacency to the BGP next-hop 1177 is deleted and all the depending BGP pathlists are updated to either 1178 use the remaining ECMP BGP best-paths or if none remains available 1179 to activate precomputed backups. Details about this behavior can be 1180 found in Section 6.2.2. 1182 This BGP resiliency property only depends on local link failure 1183 detection and is independent of the number of BGP prefixes impacted. 1185 7.2. Performance 1187 When the failure is local (a local IGP next-hop failure or a local 1188 eBGP next-hop failure), a pre-computed and pre-installed backup is 1189 activated by a local-protection mechanism that does not depend on 1190 the number of BGP destinations impacted by the failure. Sub-50msec 1191 is thus possible even if millions of BGP routes are impacted. 1193 When the failure is remote (a remote IGP failure not impacting the 1194 BGP next-hop or a remote BGP next-hop failure), an alternate path is 1195 activated upon IGP convergence. All the impacted BGP destinations 1196 benefit from a working alternate path as soon as the IGP convergence 1197 occurs for their impacted BGP next-hop even if millions of BGP 1198 routes are impacted. 1200 Appendix A puts the BGP-PIC benefits in perspective by providing 1201 some results using actual numbers. 1203 7.3. Automated 1205 The BGP-PIC solution does not require any operator involvement. The 1206 process is entirely automated as part of the FIB implementation. 1208 The salient points enabling this automation are: 1210 o Extension of the BGP Best Path to compute more than one primary 1211 ([10]and [11]) or backup BGP next-hop ([5] and [12]). 1213 o Sharing of BGP Path-list across BGP destinations with same 1214 primary and backup BGP next-hop. 1216 o Hierarchical indirection and dependency between BGP pathlist and 1217 IGP pathlist. 1219 7.4. Incremental Deployment 1221 As soon as one router supports BGP-PIC solution, it benefits from 1222 all its benefits without any requirement for other routers to 1223 support BGP-PIC. 1225 8. Security Considerations 1227 The behavior described in this document is internal functionality 1228 to a router that result in significant improvement to convergence 1229 time as well as reduction in CPU and memory used by FIB while not 1230 showing change in basic routing and forwarding functionality. As 1231 such no additional security risk is introduced by using the 1232 mechanisms proposed in this document. 1234 9. IANA Considerations 1236 No requirements for IANA 1238 10. Conclusions 1240 This document proposes a hierarchical and shared forwarding chain 1241 structure that allows achieving BGP prefix independent 1242 convergence, and in the case of locally detected failures, sub-50 1243 msec convergence. A router can construct the forwarding chains in 1244 a completely transparent manner with zero operator intervention 1245 thereby supporting smooth and incremental deployment. 1247 11. References 1249 11.1. Normative References 1251 [1] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 1252 4 (BGP-4), RFC 4271, January 2006 1254 [2] Bates, T., Chandra, R., Katz, D., and Rekhter Y., 1255 "Multiprotocol Extensions for BGP", RFC 4760, January 2007 1257 [3] Y. Rekhter and E. Rosen, " Carrying Label Information in BGP- 1258 4", RFC 8277, October 2017 1260 [4] Andersson, L., Minei, I., and B. Thomas, "LDP Specification", 1261 RFC 5036, October 2007 1263 11.2. Informative References 1265 [5] Marques,P., Fernando, R., Chen, E, Mohapatra, P., Gredler, H., 1266 "Advertisement of the best external route in BGP", draft-ietf- 1267 idr-best-external-05.txt, January 2012. 1269 [6] Wu, J., Cui, Y., Metz, C., and E. Rosen, "Softwire Mesh 1270 Framework", RFC 5565, June 2009. 1272 [7] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1273 Networks (VPNs)", RFC 4364, February 2006. 1275 [8] De Clercq, J. , Ooms, D., Prevost, S., Le Faucheur, F., 1276 "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider 1277 Edge Routers (6PE)", RFC 4798, February 2007 1279 [9] O. Bonaventure, C. Filsfils, and P. Francois. "Achieving sub- 1280 50 milliseconds recovery upon bgp peering link failures, " 1281 IEEE/ACM Transactions on Networking, 15(5):1123-1135, 2007 1283 [10] D. Walton, A. Retana, E. Chen, J. Scudder, "Advertisement of 1284 Multiple Paths in BGP", RFC 7911, July 2016 1286 [11] R. Raszuk, R. Fernando, K. Patel, D. McPherson, K. Kumaki, 1287 "Distribution of diverse BGP paths", RFC 6774, November 2012 1289 [12] P. Mohapatra, R. Fernando, C. Filsfils, and R. Raszuk, "Fast 1290 Connectivity Restoration Using BGP Add-path", draft-pmohapat- 1291 idr-fast-conn-restore-03, Jan 2013 1293 [13] A. Bashandy, C. Filsfils, S. Previdi, B. Decraene, S. 1294 Litkowski, M. Horneffer, R. Shakir, "Segment Routing with MPLS 1295 data plane", RFC 8660, December 2019 1297 [14] S. Litkowski, A. Bashandy, C. Filsfils, B. Decraene, P. 1298 Francois, D. Voyer, F. Clad, P. Camarillo" Topology 1299 Independent Fast Reroute using Segment Routing", draft-ietf- 1300 rtgwg-segment-routing-ti-lfa-02 (work in progress), January 1301 2020 1303 [15] M. Shand and S. Bryant, "IP Fast Reroute Framework", RFC 5714, 1304 January 2010 1306 [16] S. Bryant, C. Filsfils, S. Previdi, M. Shand, N So, " Remote 1307 Loop-Free Alternate (LFA) Fast Reroute (FRR)", RFC 7490 April 1308 2015 1310 [17] A. Atlas, C. Bowers, G. Enyedi, " An Architecture for IP/LDP 1311 Fast-Reroute Using Maximally Redundant Trees", RFC 7812, June 1312 2016 1314 12. Acknowledgments 1316 Special thanks to Neeraj Malhotra, Yuri Tsier for the valuable 1317 help 1319 Special thanks to Bruno Decraene for the valuable comments 1321 This document was prepared using 2-Word-v2.0.template.dot. 1323 Authors' Addresses 1325 Ahmed Bashandy 1326 Individual Contributor 1327 Email: abashandy.ietf@gmail.com 1329 Clarence Filsfils 1330 Cisco Systems 1331 Brussels, Belgium 1332 Email: cfilsfil@cisco.com 1334 Prodosh Mohapatra 1335 Sproute Networks 1336 Email: mpradosh@yahoo.com 1338 Appendix A. Perspective 1340 The following table puts the BGP-PIC benefits in perspective 1341 assuming 1343 o 1M impacted BGP prefixes 1345 o IGP convergence ~ 500 msec 1347 o local protection ~ 50msec 1349 o FIB Update per BGP destination ~ 100usec conservative, 1351 ~ 10usec optimistic 1353 o BGP Convergence per BGP destination ~ 200usec conservative, 1355 ~ 100usec optimistic 1357 Without PIC With PIC 1359 Local IGP Failure 10 to 100sec 50msec 1361 Local BGP Failure 100 to 200sec 50msec 1363 Remote IGP Failure 10 to 100sec 500msec 1365 Local BGP Failure 100 to 200sec 500msec 1367 Upon local IGP next-hop failure or remote IGP next-hop failure, the 1368 existing primary BGP next-hop is intact and usable hence the 1369 resiliency only depends on the ability of the FIB mechanism to 1370 reflect the new path to the BGP next-hop to the depending BGP 1371 destinations. Without BGP-PIC, a conservative back-of-the-envelope 1372 estimation for this FIB update is 100usec per BGP destination. An 1373 optimistic estimation is 10usec per entry. 1375 Upon local BGP next-hop failure or remote BGP next-hop failure, 1376 without the BGP-PIC mechanism, a new BGP Best-Path needs to be 1377 recomputed and new updates need to be sent to peers. This depends on 1378 BGP processing time that will be shared between best-path 1379 computation, RIB update and peer update. A conservative back-of-the- 1380 envelope estimation for this is 200usec per BGP destination. An 1381 optimistic estimation is 100usec per entry.