idnits 2.17.1 draft-ietf-rtgwg-bgp-pic-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 26, 2021) is 941 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-rtgwg-segment-routing-ti-lfa-07 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group A. Bashandy, Ed. 2 Internet Draft Individual Contributor 3 Intended status: Informational C. Filsfils 4 Expires: March 2022 Cisco Systems 5 P. Mohapatra 6 Sproute Networks 7 September 26, 2021 9 BGP Prefix Independent Convergence 10 draft-ietf-rtgwg-bgp-pic-16.txt 12 Abstract 14 In a network comprising thousands of BGP peers exchanging millions of 15 routes, many routes are reachable via more than one next-hop. Given 16 the large scaling targets, it is desirable to restore traffic after 17 failure in a time period that does not depend on the number of BGP 18 prefixes. In this document we proposed an architecture by which 19 traffic can be re-routed to ECMP or pre-calculated backup paths in a 20 timeframe that does not depend on the number of BGP prefixes. The 21 objective is achieved through organizing the forwarding data 22 structures in a hierarchical manner and sharing forwarding elements 23 among the maximum possible number of routes. The proposed technique 24 achieves prefix independent convergence while ensuring incremental 25 deployment, complete automation, and zero management and provisioning 26 effort. It is noteworthy to mention that the benefits of BGP Prefix 27 Independent Convergence (BGP-PIC) are hinged on the existence of more 28 than one path whether as ECMP or primary-backup. 30 Status of this Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF), its areas, and its working groups. Note that 37 other groups may also distribute working documents as Internet- 38 Drafts. 40 Internet-Drafts are draft documents valid for a maximum of six 41 months and may be updated, replaced, or obsoleted by other 42 documents at any time. It is inappropriate to use Internet-Drafts 43 as reference material or to cite them other than as "work in 44 progress." 46 The list of current Internet-Drafts can be accessed at 47 http://www.ietf.org/ietf/1id-abstracts.txt 48 The list of Internet-Draft Shadow Directories can be accessed at 49 http://www.ietf.org/shadow.html 51 This Internet-Draft will expire on March 26, 2022. 53 Copyright Notice 55 Copyright (c) 2021 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with 63 respect to this document. Code Components extracted from this 64 document must include Simplified BSD License text as described in 65 Section 4.e of the Trust Legal Provisions and are provided without 66 warranty as described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction...................................................3 71 1.1. Terminology...............................................3 72 2. Overview.......................................................5 73 2.1. Dependency................................................6 74 2.1.1. Hierarchical Hardware FIB (Forwarding Information Base) 75 ............................................................6 76 2.1.2. Availability of more than one BGP next-hops..........7 77 2.2. BGP-PIC Illustration......................................7 78 3. Constructing the Shared Hierarchical Forwarding Chain.........10 79 3.1. Constructing the BGP-PIC forwarding Chain................10 80 3.2. Example: Primary-Backup Path Scenario....................10 81 4. Forwarding Behavior...........................................11 82 5. Handling Platforms with Limited Levels of Hierarchy...........12 83 5.1. Flattening the Forwarding Chain..........................13 84 5.2. Example: Flattening a forwarding chain...................15 85 6. Forwarding Chain Adjustment at a Failure......................22 86 6.1. BGP-PIC core.............................................23 87 6.2. BGP-PIC edge.............................................24 88 6.2.1. Adjusting forwarding Chain in egress node failure...24 89 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure....24 90 6.3. Handling Failures for Flattened Forwarding Chains........25 91 7. Properties....................................................26 92 7.1. Coverage.................................................26 93 7.1.1. A remote failure on the path to a BGP next-hop......26 94 7.1.2. A local failure on the path to a BGP next-hop.......27 95 7.1.3. A remote iBGP next-hop fails........................27 96 7.1.4. A local eBGP next-hop fails.........................27 97 7.2. Performance..............................................27 98 7.3. Automated................................................28 99 7.4. Incremental Deployment...................................28 100 8. Security Considerations.......................................28 101 9. IANA Considerations...........................................28 102 10. Conclusions..................................................28 103 11. References...................................................29 104 11.1. Normative References....................................29 105 11.2. Informative References..................................29 106 12. Acknowledgments..............................................30 107 Appendix A. Perspective..........................................31 109 1. Introduction 111 BGP speakers exchange reachability information about 112 prefixes[1][2] and, for labeled address families, namely AFI/SAFI 113 1/4, 2/4, 1/128, and 2/128, an edge router assigns local labels to 114 prefixes and associates the local label with each advertised 115 prefix using technologies such as L3VPN [9], 6PE [10], and 116 Softwire [8] using BGP label unicast (BGP-LU) technique[3]. A BGP 117 speaker then applies the path selection steps to choose the best 118 path. In modern networks, it is not uncommon to have a prefix 119 reachable via multiple edge routers. In addition to proprietary 120 techniques, multiple techniques have been proposed to allow for 121 BGP to advertise more than one path for a given prefix 122 [7][12][13], whether in the form of equal cost multipath or 123 primary-backup. Another common and widely deployed scenario is 124 L3VPN with multi-homed VPN sites with unique Route Distinguisher. 125 It is advantageous to utilize the commonality among paths used by 126 NLRIs[1] to significantly improve convergence in case of topology 127 modifications. 129 This document proposes a hierarchical and shared forwarding chain 130 organization that allows traffic to be restored to pre-calculated 131 alternative equal cost primary path or backup path in a time 132 period that does not depend on the number of BGP prefixes. The 133 technique relies on internal router behavior that is completely 134 transparent to the operator and can be incrementally deployed and 135 enabled with zero operator intervention. In other words, once it 136 is implemented and deployed on a router, nothing is required from 137 the operator to make it work. It is noteworthy to mention that 138 this document describes FIB architecture that can be implemented 139 in both hardware and/or software. 141 1.1. Terminology 143 This section defines the terms used in this document. For ease of 144 use, we will use terms similar to those used by L3VPN [9]. 146 o BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker 147 has a path for. 149 o IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via 150 an Interior Gateway Protocol (IGP), such as OSPF and ISIS. The 151 prefix may be learnt directly through the IGP or redistributed 152 from other protocol(s). 154 o CE[7]: An external router through which an egress PE can reach 155 a prefix P/m. 157 o Egress PE[7], "ePE": A BGP speaker that learns about a prefix 158 through an eBGP peer and chooses that eBGP peer as the next-hop 159 for that prefix. 161 o Ingress PE, "iPE": A BGP speaker that learns about a prefix 162 through a iBGP peer and chooses an egress PE as the next-hop for 163 the prefix. 165 o Path: The next-hop in a sequence of nodes starting from the 166 current node and ending with the destination node or network 167 identified by the prefix. The nodes may not be directly 168 connected. 170 o Recursive path: A path consisting only of the IP address of the 171 next-hop without the outgoing interface. Subsequent lookups are 172 necessary to determine the outgoing interface and a directly 173 connected next-hop. 175 o Non-recursive path: A path consisting of the IP address of a 176 directly connected next-hop and outgoing interface. 178 o Adjacency: The layer 2 encapsulation leading to the layer 3 179 directly connected next-hop. 181 o Primary path: A recursive or non-recursive path that can be 182 used all the time as long as a walk starting from this path can 183 end to an adjacency. A prefix can have more than one primary 184 path. 186 o Backup path: A recursive or non-recursive path that can be used 187 only after some or all primary paths become unreachable. 189 o Leaf: A container data structure for a prefix or local label. 190 Alternatively, it is the data structure that contains prefix 191 specific information. 193 o IP leaf: The leaf corresponding to an IPv4 or IPv6 prefix. 195 o Label leaf. The leaf corresponding to a locally allocated label 196 such as the VPN label on an egress PE [9]. 198 o Pathlist: An array of paths used by one or more prefixes to 199 forward traffic to destination(s) covered by an IP prefix. Each 200 path in the pathlist carries its "path-index" that identifies its 201 position in the array of paths. In general, the value of the 202 "path-index" stored in path may not necessarily have the same 203 value of the location of the path in the pathlist. For example 204 the 3rd path may carry path-index value of 1. A pathlist may 205 contain a mix of primary and backup paths. 207 o OutLabel-List: Each labeled prefix is associated with an 208 OutLabel-List. The OutLabel-List is an array of one or more 209 outgoing labels and/or label actions where each label or label 210 action has 1-to-1 correspondence to a path in the pathlist. 211 Label actions[6] are: push the label, pop the label, swap the 212 incoming label with the label in the OutLabel-List entry, or 213 don't push anything at all in case of "unlabeled". The prefix 214 may be an IGP or BGP prefix. 216 o Forwarding chain: It is a compound data structure consisting of 217 multiple connected block that a forwarding engine walks one 218 block at a time to forward the packet out of an interface. 219 Section 2.2 explains an example of a forwarding chain. 220 Subsequent sections provide additional examples 222 o Dependency: An object X is said to be a dependent or child of 223 object Y if there is at least one forwarding chain where the 224 forwarding engine must visit the object X before visiting the 225 object Y in order to forward a packet. Note that if object X is 226 a child of object Y, then Y cannot be deleted unless object X 227 is no longer a dependent/child of object Y. 229 o Route: A prefix with one or more paths associated with it. The 230 minimum set of objects needed to construct a route is a leaf 231 and a pathlist. 233 2. Overview 235 The idea of BGP-PIC is based on two pillars 236 o A shared hierarchical forwarding chain: It is not uncommon to see 237 multiple destinations reachable via the same list of next-hops. 238 Instead of having a separate list of next-hops for each 239 destination, all destinations sharing the same list of next-hops 240 can point to a single copy of this list thereby allowing fast 241 convergence by making changes to a single shared list of next- 242 hops rather than possibly a large number of destinations. Because 243 paths in a pathlist may be recursive, a hierarchy is formed 244 between pathlist and the resolving prefix whereby the pathlist 245 depends on the resolving prefix. 247 o A forwarding plane that supports multiple levels of indirection: 248 A forwarding chain that starts with a destination and ends with 249 an outgoing interface is not a simple flat structure. Instead a 250 forwarding entry is constructed via multiple levels of 251 dependency. A BGP NLRI uses a recursive next-hop, which in turn 252 resolves via an IGP next-hop, which in turn resolves via an 253 adjacency consisting of one or more outgoing interface(s) and 254 next-hop(s). 256 Designing a forwarding plane that constructs multi-level forwarding 257 chains with maximal sharing of forwarding objects allows rerouting a 258 large number of destinations by modifying a small number of objects 259 thereby achieving convergence in a time frame that does not depend 260 on the number of destinations. For example, if the IGP prefix that 261 resolves a recursive next-hop is updated there is no need to update 262 the possibly large number of BGP NLRIs that use this recursive next- 263 hop. 265 2.1. Dependency 267 This section describes the required functionalities in the 268 forwarding and control planes to support BGP-PIC described in this 269 document. 271 2.1.1. Hierarchical Hardware FIB (Forwarding Information Base) 273 BGP-PIC requires a hierarchical hardware FIB support: for each BGP 274 forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is 275 consulted, then an IGP Pathlist, then an Adjacency. 277 An alternative method consists in "flattening" the dependencies when 278 programming the BGP destinations into HW FIB resulting in 279 potentially eliminating both the BGP Path-List and IGP Path-List 280 consultation. Such an approach decreases the number of memory 281 lookups per forwarding operation at the expense of HW FIB memory 282 increase (flattening means less sharing thereby less duplication), 283 loss of ECMP properties (flattening means less pathlist entropy) and 284 loss of BGP-PIC properties. 286 2.1.2. Availability of more than one BGP next-hops 288 When the primary BGP next-hop fails, BGP-PIC depends on the 289 availability of one or more pre-computed and pre-installed secondary 290 BGP next-hop(s) in the BGP Pathlist. 292 The existence of a secondary next-hop is clearly required for the 293 following reason: a service caring for network availability will 294 require two disjoint network connections resulting in two BGP next- 295 hops. 297 The BGP distribution of secondary next-hops is available thanks to 298 the following BGP mechanisms: Add-Path [12], BGP Best-External [7], 299 diverse path [13], and the frequent use in VPN deployments of 300 different VPN RD's per PE. Another option to learn multiple BGP 301 NH/path is simply to receive IBGP paths from multiple BGP RR 302 selection a different path as best. It is noteworthy to mention that 303 the availability of another BGP path does not mean that all failure 304 scenarios can be covered by simply forwarding traffic to the 305 available secondary path. The discussion of how to cover various 306 failure scenarios is beyond the scope of this document. 308 2.2. BGP-PIC Illustration 310 To illustrate the two pillars above as well as the platform 311 dependency, we will use an example of a simple multihomed L3VPN [9] 312 prefix in a BGP-free core running LDP [4] or segment routing over 313 MPLS forwarding plane [5]. 315 +--------------------------------+ 316 | | 317 | ePE2 (IGP-IP1 192.0.2.1, Loopback) 318 | | \ 319 | | \ 320 | | \ 321 iPE | CE....VRF "Blue", ASnum 65000 322 | | / (VPN-IP1 198.51.100.0/24) 323 | | / (VPN-IP2 203.0.113.0/24) 324 | LDP/Segment-Routing Core | / 325 | ePE1 (IGP-IP2 192.0.2.2, Loopback) 326 | | 327 +--------------------------------+ 328 Figure 1 VPN prefix reachable via multiple PEs 330 Referring to Figure 1, suppose the iPE (the ingress PE) receives 331 NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs, 332 ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively. 333 Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while 334 ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and 335 VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved 336 via the IGP prefixes IGP-IP1 and IGP-IP2, where each happen to have 337 2 equal cost paths with IGP-NH1 and IGP-NH2 reachable via the 338 interfaces I1 and I2, respectively. Suppose that local labels 339 (whether LDP [4] or segment routing [5]) on the downstream LSRs[4] 340 for IGP-IP1 are IGP-L11 and IGP-L12 while for IGP-IP2 are IGP-L21 341 and IGP-L22. As such, the routing table at iPE is as follows: 343 65000: 198.51.100.0/24 344 via ePE1 (192.0.2.1), VPN Label: VPN-L11 345 via ePE2 (192.0.2.2), VPN Label: VPN-L21 347 65000: 203.0.113.0/24 348 via ePE1 (192.0.2.1), VPN Label: VPN-L12 349 via ePE2 (192.0.2.2), VPN Label: VPN-L22 351 192.0.2.1/32 352 via I1, Label: IGP-L11 353 via I2, Label: IGP-L12 355 192.0.2.2/32 356 via I1, Label: IGP-L21 357 via I2, Label: IGP-L22 359 Based on the above routing table, a hierarchical forwarding chain 360 can be constructed as shown in Figure 2. 362 IP Leaf: Pathlist: IP Leaf: Pathlist: 363 -------- +-------+ -------- +----------+ 364 VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP-NH1,I1|--->Adjacency1 365 | |BGP-NH2|-->.... | |IGP-NH2,I2|--->Adjacency2 366 | +-------+ | +----------+ 367 | | 368 | | 369 v v 370 OutLabel-List: OutLabel-List: 371 +----------------------+ +----------------------+ 372 |VPN-L11 (VPN-IP1, NH1)| |IGP-L11 (IGP-IP1, NH1)| 373 |VPN-L21 (VPN-IP1, NH2)| |IGP-L12 (IGP-IP1, NH2)| 374 +----------------------+ +----------------------+ 376 Figure 2 Shared Hierarchical Forwarding Chain at iPE 378 The forwarding chain depicted in Figure 2 illustrates the first 379 pillar, which is sharing and hierarchy. We can see that the BGP 380 pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs 381 reachable via ePE1 and ePE2. As such, it is possible to make changes 382 to the pathlist without having to make changes to the NLRIs. For 383 example, if BGP-NH2 becomes unreachable, there is no need to modify 384 any of the possibly large number of NLRIs. Instead only the shared 385 pathlist needs to be modified. Likewise, due to the hierarchical 386 structure of the forwarding chain, it is possible to make 387 modifications to the IGP routes without having to make any changes 388 to the BGP NLRIs. For example, if the interface "I2" goes down, only 389 the shared IGP pathlist needs to be updated, but none of the IGP 390 prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP 391 prefixes for resolution need to be modified. 393 Figure 2 can also be used to illustrate the second BGP-PIC pillar. 394 Having a deep forwarding chain such as the one illustrated in Figure 395 2 requires a forwarding plane that is capable of accessing multiple 396 levels of indirection in order to calculate the outgoing 397 interface(s) and next-hops(s). While a deeper forwarding chain 398 minimizes the re-convergence time on topology change, there will 399 always exist platforms with limited capabilities and hence imposing 400 a limit on the depth of the forwarding chain. Section 5 describes 401 how to gracefully trade off convergence speed with the number of 402 hierarchical levels to support platforms with different 403 capabilities. 405 Another example using IPv6 addresses can be something like the 406 following 408 65000: 2003:DBB:1::/48 409 via ePE1 (192::1), VPN Label: VPN6-L11 410 via ePE2 (192::2), VPN Label: VPN6-L21 412 65000: 2003:DBB:2:/48 413 via ePE1 (192::1), VPN Label: VPN6-L12 414 via ePE2 (192::2), VPN Label: VPN6-L22 416 192::1/128 417 via Core, Label: IGP6-L11 418 via Core, Label: IGP6-L12 420 192::2/128 421 via Core, Label: IGP6-L21 422 via Core, Label: IGP6-L22 424 The same hierarchical forwarding chain described can be constructed 425 for IPv6 addresses/prefixes. 427 3. Constructing the Shared Hierarchical Forwarding Chain 429 Constructing the forwarding chain is an application of the two 430 pillars described in Section 2. This section describes how to 431 construct the forwarding chain in hierarchical shared manner. 433 3.1. Constructing the BGP-PIC forwarding Chain 435 The whole process starts when BGP downloads a prefix to FIB. The 436 prefix contains one or more outgoing paths. For certain labeled 437 prefixes, such as VPN [9] prefixes, each path may be associated with 438 an outgoing label and the prefix itself may be assigned a local 439 label. The list of outgoing paths defines a pathlist. If such 440 pathlist does not already exist, then FIB manager (software or 441 hardware entity responsible for managing the FIB) creates a new 442 pathlist, otherwise the existing pathlist is used. The BGP prefix is 443 added as a dependent of the pathlist. 445 The previous step constructs the upper part of the hierarchical 446 forwarding chain. The forwarding chain is completed by resolving the 447 paths of the pathlist. A BGP path usually consists of a next-hop. 448 The next-hop is resolved by finding a matching IGP prefix. 450 The end result is a hierarchical shared forwarding chain where the 451 BGP pathlist is shared by all BGP prefixes that use the same list of 452 paths and the IGP prefix is shared by all pathlists that have a path 453 resolving via that IGP prefix. It is noteworthy to mention that the 454 forwarding chain is constructed without any operator intervention at 455 all. 457 The remainder of this section goes over an example to illustrate the 458 applicability of BGP-PIC in a primary-backup path scenario. 460 3.2. Example: Primary-Backup Path Scenario 462 Consider the egress PE ePE1 in the case of the multi-homed VPN 463 prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1 464 determines that the primary path is the external path but the backup 465 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2. 466 ePE2 constructs the forwarding chain depicted in Figure 3. We are 467 only showing a single VPN prefix for simplicity. But all prefixes 468 that are multihomed to ePE1 and ePE2 share the BGP pathlist. 470 BGP OutLabel-List 471 VPN-L11 +---------+ 472 (Label-leaf)---+---->|Unlabeled| 473 | +---------+ 474 v | VPN-L21 | 475 | | (swap) | 476 | +---------+ 477 | 478 | BGP Pathlist 479 | +------------+ Connected route 480 | | CE-NH |------>(to the CE) 481 | |path-index=0| 482 | +------------+ 483 | | VPN-NH2 | 484 VPN-IP1 -----+------------------>| (backup) |------>IGP Leaf 485 (IP leaf) |path-index=1| (Towards ePE2) 486 | +------------+ 487 | 488 | BGP OutLabel-List 489 | +---------+ 490 +------------->|Unlabeled| 491 +---------+ 492 | VPN-L21 | 493 | (push) | 494 +---------+ 496 Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE 498 The example depicted in Figure 3 differs from the example in Figure 499 2 in two main aspects. First, as long as the primary path towards 500 the CE (external path) is useable, it will be the only path used for 501 forwarding while the OutLabel-List contains both the unlabeled label 502 (primary path) and the VPN label (backup path) advertised by the 503 backup path ePE2. The second aspect is presence of the label leaf 504 corresponding to the VPN prefix. This label leaf is used to match 505 VPN traffic arriving from the core. Note that the label leaf shares 506 the pathlist with the IP prefix. 508 4. Forwarding Behavior 510 This section explains how the forwarding plane uses the hierarchical 511 shared forwarding chain to forward a packet. 513 When a packet arrives at a router, it matches a leaf. A labeled 514 packet matches a label leaf while an IP packet matches an IP leaf. 515 The forwarding engines walks the forwarding chain starting from the 516 leaf until the walk terminates on an adjacency. Thus when a packet 517 arrives, the chain is walked as follows: 519 1. Lookup the leaf based on the destination address or the label at 520 the top of the packet. 522 2. Retrieve the parent pathlist of the leaf. 524 3. Pick the outgoing path "Pi" from the list of resolved paths in 525 the pathlist. The method by which the outgoing path is picked is 526 beyond the scope of this document (e.g. flow-preserving hash 527 exploiting entropy within the MPLS stack and IP header). Let the 528 "path-index" of the outgoing path "Pi" be "j". 530 4. If the prefix is labeled, use the "path-index" "j" to retrieve 531 the jth label "Lj" stored the jth entry in the OutLabel-List and 532 apply the label action of the label on the packet (e.g. for VPN 533 label on the ingress PE, the label action is "push"). As 534 mentioned in Section 1.1 the value of the "path-index" stored 535 in path may not necessarily be the same value of the location of 536 the path in the pathlist. 538 5. Move to the parent of the chosen path "Pi". 540 6. If the chosen path "Pi" is recursive, move to its parent prefix 541 and go to step 2. 543 7. If the chosen path is non-recursive move to its parent adjacency. 544 Otherwise go to the next step. 546 8. Encapsulate the packet in the layer string specified by the 547 adjacency and send the packet out. 549 Let's apply the above forwarding steps to the forwarding chain 550 depicted in Figure 2 in Section 2. Suppose a packet arrives at 551 ingress PE iPE from an external neighbor. Assume the packet matches 552 the VPN prefix VPN-IP1. While walking the forwarding chain, the 553 forwarding engine applies a hashing algorithm to choose the path and 554 the hashing at the BGP level yields path 0 while the hashing at the 555 IGP level yields path 1. In that case, the packet will be sent out 556 of interface I2 with the label stack "IGP-L12,VPN-L11". 558 5. Handling Platforms with Limited Levels of Hierarchy 560 This section describes the construction of the forwarding chain if a 561 platform does not support the number of recursion levels required to 562 resolve the NLRIs. There are two main design objectives. 564 o Being able to reduce the number of hierarchical levels from any 565 arbitrary value to a smaller arbitrary value that can be 566 supported by the forwarding engine. 568 o Minimal modifications to the forwarding algorithm due to such 569 reduction. 571 5.1. Flattening the Forwarding Chain 573 Let's consider a pathlist associated with the leaf "R1" consisting 574 of the list of paths . Assume that the leaf "R1" has 575 an OutLabel-list . Suppose the path Pi is a 576 recursive path that resolves via a prefix represented by the leaf 577 "R2". The leaf "R2" itself is pointing to a pathlist consisting of 578 the paths . 580 If the platform supports the number of hierarchy levels of the 581 forwarding chain, then a packet that uses the path "Pi" will be 582 forwarded as follows: 584 1. The forwarding engine is now at leaf "R1". 586 2. So it moves to its parent pathlist, which contains the list . 589 3. The forwarding engine applies a hashing algorithm and picks the 590 path "Pi". So now the forwarding engine is at the path "Pi". 592 4. The forwarding engine retrieves the label "Li" from the OutLabel- 593 list attached to the leaf "R1" and applies the label action. 595 5. The path "Pi" uses the leaf "R2". 597 6. The forwarding engine walks forward to the leaf "R2" for 598 resolution. 600 7. The forwarding plane performs a hash to pick a path among the 601 pathlist of the leaf "R2", which is . 603 8. Suppose the forwarding engine picks the path "Qj". 605 9. Now the forwarding engine continues the walk to the parent of 606 "Qj". 608 Suppose the platform cannot support the number of hierarchy levels 609 in the forwarding chain. FIB manager needs to reduce the number of 610 hierarchy levels when programming the forwarding chain in the FIB. 611 The idea of reducing the number of hierarchy levels is to "flatten" 612 two chain levels into a single level. The "flattening" steps are as 613 follows 614 1. FIB manager wants to reduce the number of levels used by "Pi" by 615 1. 617 2. FIB manager walks to the parent of "Pi", which is the leaf "R2". 619 3. FIB manager extracts the parent pathlist of the leaf "R2", which 620 is . 622 4. FIB manager also extracts the OutLabel-list(R2) associated with 623 the leaf "R2". Remember that OutLabel-list(R2) = . 626 5. FIB manager replaces the path "Pi", with the list of paths . 629 6. Hence the path list now becomes ". 632 7. The path index stored inside the locations "Q1", "Q2", ..., "Qm" 633 must all be "i" because the index "i" refers to the label "Li" 634 associated with leaf "R1". 636 8. FIB manager attaches an OutLabel-list with the new pathlist as 637 follows: . The size of the label list associated with the 639 flattened pathlist equals the size of the pathlist. Thus there is 640 a 1-1 mapping between every path in the "flattened" pathlist and 641 the OutLabel-list associated with it. 643 It is noteworthy to mention that the labels in the OutLabel-list 644 associated with the "flattened" pathlist may be stored in the same 645 memory location as the path itself to avoid additional memory 646 access. But that is an implementation detail that is beyond the 647 scope of this document. 649 The same steps can be applied to all paths in the pathlist so that all paths are "flattened" thereby reducing the 651 number of hierarchical levels by one. Note that that "flattening" a 652 pathlist pulls in all paths of the parent paths, a desired feature 653 to utilize all ECMP/UCMP paths at all levels. A platform that has a 654 limit on the number of paths in a pathlist for any given leaf may 655 choose to reduce the number paths using methods that are beyond the 656 scope of this document. 658 The steps can be recursively applied to other paths at the same 659 levels or other levels to recursively reduce the number of 660 hierarchical levels to an arbitrary value so as to accommodate the 661 capability of the forwarding engine. 663 Because a flattened pathlist may have an associated OutLabel-list 664 the forwarding behavior has to be slightly modified. The 665 modification is done by adding the following step right after step 4 666 in Section 4. 668 5. If there is an OutLabel-list associated with the pathlist, then 669 if the path "Pi" is chosen by the hashing algorithm, retrieve the 670 label at location "i" in that OutLabel-list and apply the label 671 action of that label on the packet. 673 In the next subsection, we apply the steps in this subsection to an 674 example scenario. 676 5.2. Example: Flattening a forwarding chain. 678 This example uses a case of inter-AS option C [9] where there are 3 679 levels of hierarchy. Figure 4 illustrates the sample topology. To 680 force 3 levels of hierarchy, the ASBRs[9] on the ingress domain 681 (domain 1) advertise the core routers of the egress domain (domain 682 2) to the ingress PE (iPE) via BGP-LU [3] instead of redistributing 683 them into the IGP of domain 1. The end result is that the ingress PE 684 (iPE) has 2 levels of recursion for the VPN prefixes VPN-IP1 and 685 VPN-IP2. 687 Domain 1 Domain 2 688 +-------------+ +-------------+ 689 | | | | 690 | LDP/SR Core | | LDP/SR core | 691 | | | | 692 | (192.0.2.4) | | 693 | ASBR11-------ASBR21........ePE1(192.0.2.1) 694 | | \ / | . . |\ 695 | | \ / | . . | \ 696 | | \ / | . . | \ 697 | | \/ | .. | \VPN-IP1(198.51.100.0/24) 698 | | /\ | . . | /VRF "Blue" ASn: 65000 699 | | / \ | . . | / 700 | | / \ | . . | / 701 | | / \ | . . |/ 702 iPE ASBR12-------ASBR22........ePE2 (192.0.2.2) 703 | (192.0.2.5) | |\ 704 | | | | \ 705 | | | | \ 706 | | | | \VRF "Blue" ASn: 65000 707 | | | | /VPN-IP2(203.0.113.0/24) 708 | | | | / 709 | | | | / 710 | | | |/ 711 | ASBR13-------ASBR23........ePE3(192.0.2.3) 712 | (192.0.2.6) | | 713 | | | | 714 | | | | 715 +-------------+ +-------------+ 716 <=========== <========= <============ 717 Advertise ePEx Advertise Redistribute 718 Using iBGP-LU ePEx Using IGP into 719 eBGP-LU BGP 721 Figure 4 : Sample 3-level hierarchy topology 723 We will make the following assumptions about connectivity 725 o In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and 726 ePE2 using the same distance. 728 o In "domain 2", only ASBR23 can reach ePE3. 730 o In "domain 1", iPE (the ingress PE) can reach ASBR11, ASBR12, and 731 ASBR13 via IGP using the same distance. 733 We will make the following assumptions about the labels 734 o The VPN labels advertised by ePE1 and ePE2 for prefix VPN-IP1 are 735 VPN-L11 and VPN-L21, respectively. 737 o The VPN labels advertised by ePE2 and ePE3 for prefix VPN-IP2 are 738 VPN-L22 and VPN-L32, respectively. 740 o The labels advertised by ASBR11 to iPE using BGP-LU [3] for the 741 egress PEs ePE1 and ePE2 are LASBR111(ePE1) and LASBR112(ePE2), 742 respectively. 744 o The labels advertised by ASBR12 to iPE using BGP-LU [3] for the 745 egress PEs ePE1 and ePE2 are LASBR121(ePE1) and LASBR122(ePE2), 746 respectively. 748 o The label advertised by ASBR13 to iPE using BGP-LU [3] for the 749 egress PE ePE3 is LASBR13(ePE3). 751 o The IGP labels advertised by the next hops directly connected to 752 iPE towards ASBR11, ASBR12, and ASBR13 in the core of domain 1 753 are IGP-L11, IGP-L12, and IGP-L13, respectively. 755 o Both the routers ASBR21 and ASBR22 of Domain 2 advertise the same 756 label LASBR21 and LASBR22 for the egress PEs ePE1 and ePE2, 757 respectively, to the routers ASBR11 and ASBR22 of Domain 1. 759 o The router ASBR23 of Domain 2 advertises the label LASBR23 for 760 the egress PE ePE3 to the router ASBR13 of Domain 1. 762 Based on these connectivity assumptions and the topology in Figure 763 4, the routing table on iPE is 764 65000: 198.51.100.0/24 765 via ePE1 (192.0.2.1), VPN Label: VPN-L11 766 via ePE2 (192.0.2.2), VPN Label: VPN-L21 767 65000: 203.0.113.0/24 768 via ePE2 (192.0.2.2), VPN Label: VPN-L22 769 via ePE3 (192.0.2.3), VPN Label: VPN-L32 771 192.0.2.1/32 (ePE1) 772 Via ASBR11, BGP-LU Label: LASBR111(ePE1) 773 Via ASBR12, BGP-LU Label: LASBR121(ePE1) 774 192.0.2.2/32 (ePE2) 775 Via ASBR11, BGP-LU Label: LASBR112(ePE2) 776 Via ASBR12, BGP-LU Label: LASBR122(ePE2) 777 192.0.2.3/32 (ePE3) 778 Via ASBR13, BGP-LU Label: LASBR13(ePE3) 780 192.0.2.4/32 (ASBR11) 781 via Core, Label: IGP-L11 782 192.0.2.5/32 (ASBR12) 783 via Core, Label: IGP-L12 784 192.0.2.6/32 (ASBR13) 785 via Core, Label: IGP-L13 787 The diagram in Figure 5 illustrates the forwarding chain in iPE 788 assuming that the forwarding hardware in iPE supports 3 levels of 789 hierarchy. The leaves corresponding to the ASBRs on domain 1 790 (ASBR11, ASBR12, and ASBR13) are at the bottom of the hierarchy. 791 There are few important points: 793 o Because the hardware supports the required depth of hierarchy, 794 the sizes of a pathlist equal the size of the label list 795 associated with the leaves using this pathlist. 797 o The index inside the pathlist entry indicates the label that will 798 be picked from the OutLabel-List associated with the child leaf 799 if that path is chosen by the forwarding engine hashing function. 801 OutLabel-List OutLabel-List 802 For VPN-IP1 For VPN-IP2 803 +------------+ +--------+ +-------+ +------------+ 804 | VPN-L11 |<---| VPN-IP1| |VPN-IP2|-->| VPN-L22 | 805 +------------+ +---+----+ +---+---+ +------------+ 806 | VPN-L21 | | | | VPN-L32 | 807 +------------+ | | +------------+ 808 | | 809 V V 810 +---+---+ +---+---+ 811 | 0 | 1 | | 0 | 1 | 812 +-|-+-\-+ +-/-+-\-+ 813 | \ / \ 814 | \ / \ 815 | \ / \ 816 | \ / \ 817 v \ / \ 818 +-----+ +-----+ +-----+ 819 +----+ ePE1| |ePE2 +-----+ | ePE3+-----+ 820 | +--+--+ +-----+ | +--+--+ | 821 v | / v | v 822 +--------------+ | / +--------------+ | +-------------+ 823 |LASBR111(ePE1)| | / |LASBR112(ePE2)| | |LASBR13(ePE3)| 824 +--------------+ | / +--------------+ | +-------------+ 825 |LASBR121(ePE1)| | / |LASBR122(ePE2)| | OutLabel-List 826 +--------------+ | / +--------------+ | For ePE3 827 OutLabel-List | / OutLabel-List | 828 For ePE1 | / For ePE2 | 829 | / | 830 | / | 831 | / | 832 v / v 833 +---+---+ Shared Pathlist +---+ Pathlist 834 | 0 | 1 | For ePE1 and ePE2 | 0 | For ePE3 835 +-|-+-\-+ +-|-+ 836 | \ | 837 | \ | 838 | \ | 839 | \ | 840 v \ v 841 +------+ +------+ +------+ 842 +---+ASBR11| |ASBR12+--+ |ASBR13+---+ 843 | +------+ +------+ | +------+ | 844 v v v 845 +-------+ +-------+ +-------+ 846 |IGP-L11| |IGP-L12| |IGP-L13| 847 +-------+ +-------+ +-------+ 849 Figure 5 : Forwarding Chain for hardware supporting 3 Levels 851 Now suppose the hardware on iPE (the ingress PE) supports 2 levels 852 of hierarchy only. In that case, the 3-levels forwarding chain in 853 Figure 5 needs to be "flattened" into 2 levels only. 855 OutLabel-List OutLabel-List 856 For VPN-IP1 For VPN-IP2 857 +------------+ +-------+ +-------+ +------------+ 858 | VPN-L11 |<---|VPN-IP1| | VPN-IP2|--->| VPN-L22 | 859 +------------+ +---+---+ +---+---+ +------------+ 860 | VPN-L21 | | | | VPN-L32 | 861 +------------+ | | +------------+ 862 | | 863 | | 864 | | 865 Flattened | | Flattened 866 pathlist V V pathlist 867 +===+===+ +===+===+===+ +==============+ 868 +--------+ 0 | 1 | | 0 | 0 | 1 +---->|LASBR112(ePE2)| 869 | +=|=+=\=+ +=/=+=/=+=\=+ +==============+ 870 v | \ / / \ |LASBR122(ePE2)| 871 +==============+ | \ +-----+ / \ +==============+ 872 |LASBR111(ePE1)| | \/ / \ |LASBR13(ePE3) | 873 +==============+ | /\ / \ +==============+ 874 |LASBR121(ePE1)| | / \ / \ 875 +==============+ | / \ / \ 876 | / \ / \ 877 | / + + \ 878 | + | | \ 879 | | | | \ 880 v v v v \ 881 +------+ +------+ +------+ 882 +----|ASBR11| |ASBR12+---+ |ASBR13+---+ 883 | +------+ +------+ | +------+ | 884 v v v 885 +-------+ +-------+ +-------+ 886 |IGP-L11| |IGP-L12| |IGP-L13| 887 +-------+ +-------+ +-------+ 889 Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE 891 Figure 6 represents one way to "flatten" a 3 levels hierarchy into 892 two levels. There are few important points: 894 o As mentioned in Section 5.1 a flattened pathlist may have label 895 lists associated with them. The size of the label list associated 896 with a flattened pathlist equals the size of the pathlist. Hence 897 it is possible that an implementation includes these label lists 898 in the flattened pathlist itself. 900 o Again as mentioned in Section 5.1, the size of a flattened 901 pathlist may not be equal to the size of the OutLabel-lists of 902 leaves using the flattened pathlist. So the indices inside a 903 flattened pathlist still indicate the label index in the 904 OutLabel-Lists of the leaves using that pathlist. Because the 905 size of the flattened pathlist may be different from the size of 906 the OutLabel-lists of the leaves, the indices may be repeated. 908 o Let's take a look at the flattened pathlist used by the prefix 909 "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has 910 three entries. 912 o The first and second entry have index "0". This is because 913 both entries correspond to ePE2. Thus when hashing performed 914 by the forwarding engine results in using first or the second 915 entry in the pathlist, the forwarding engine will pick the 916 correct VPN label "VPN-L22", which is the label advertised by 917 ePE2 for the prefix "VPN-IP2". 919 o The third entry has the index "1". This is because the third 920 entry corresponds to ePE3. Thus when the hashing is performed 921 by the forwarding engine results in using the third entry in 922 the flattened pathlist, the forwarding engine will pick the 923 correct VPN label "VPN-L32", which is the label advertised by 924 "ePE3" for the prefix "VPN-IP2". 926 Now let's try and apply the forwarding steps in Section 4 together 927 with the additional step in Section 5.1 to the flattened forwarding 928 chain illustrated in Figure 6. 930 o Suppose a packet arrives at "iPE" and matches the VPN prefix 931 "VPN-IP2". 933 o The forwarding engine walks to the parent of the "VPN-IP2", which 934 is the flattened pathlist and applies a hashing algorithm to pick 935 a path. 937 o Suppose the hashing by the forwarding engine picks the second 938 path in the flattened pathlist associated with the leaf "VPN- 939 IP2". 941 o Because the second path has the index "0", the label "VPN-L22" is 942 pushed on the packet. 944 o Next the forwarding engine picks the second label from the 945 OutLabel-List associated with the flattened pathlist resulting in 946 "LASBR122(ePE2)" being the next pushed label. 948 o The forwarding engine now moves to the parent of the flattened 949 pathlist corresponding to the second path. The parent is the IGP 950 label leaf corresponding to "ASBR12". 952 o So the packet is forwarded towards the ASBR "ASBR12" and the IGP 953 label at the top will be "L12". 955 Based on the above steps, a packet arriving at iPE and destined to 956 the prefix VPN-L22 reaches its destination as follows: 958 o iPE sends the packet along the shortest path towards ASBR12 with 959 the following label stack starting from the top: {L12, 960 LASBR122(ePE2), VPN-L22}. 962 o The penultimate hop of ASBR12 pops the top label "L12". Hence the 963 packet arrives at ASBR12 with the remaining label stack 964 {LASBR122(ePE2), VPN-L22} where "LASBR12(ePE2)" is the top label. 966 o ASBR12 swaps "LASBR122(ePE2)" with the label "LASBR22(ePE2)", 967 which is the label advertised by ASBR22 for the ePE2 (the egress 968 PE). 970 o ASBR22 receives the packet with "LASBR22(ePE2)" at the top. 972 o Hence ASBR22 swaps "LASBR22(ePE2)" with the IGP label for ePE2 973 advertised by the next-hop towards ePE2 in domain 2, and sends 974 the packet along the shortest path towards ePE2. 976 o The penultimate hop of ePE2 pops the top label. Hence ePE2 977 receives the packet with the top label VPN-L22 at the top 979 o ePE2 pops "VPN-L22" and sends the packet as a pure IP packet 980 towards the destination VPN-IP2. 982 6. Forwarding Chain Adjustment at a Failure 984 The hierarchical and shared structure of the forwarding chain 985 explained in the previous section allows modifying a small number of 986 forwarding chain objects to re-route traffic to a pre-calculated 987 equal-cost or backup path without the need to modify the possibly 988 very large number of BGP prefixes. In this section, we go over 989 various core and edge failure scenarios to illustrate how FIB 990 manager can utilize the forwarding chain structure to achieve BGP 991 prefix independent convergence. 993 6.1. BGP-PIC core 995 This section describes the adjustments to the forwarding chain when 996 a core link or node fails but the BGP next-hop remains reachable. 998 There are two case: remote link failure and attached link failure. 999 Node failures are treated as link failures. 1001 When a remote link or node fails, IGP on the ingress PE receives 1002 advertisement indicating a topology change so IGP re-converges to 1003 either find a new next-hop and/or outgoing interface or remove the 1004 path completely from the IGP prefix used to resolve BGP next-hops. 1005 IGP and/or LDP download the modified IGP leaves with modified 1006 outgoing labels for labeled core. 1008 When a local link fails, FIB manager detects the failure almost 1009 immediately. The FIB manager marks the impacted path(s) as unusable 1010 so that only useable paths are used to forward packets. Hence only 1011 IGP pathlists with paths using the failed local link need to be 1012 modified. All other pathlists are not impacted. Note that in this 1013 particular case there is actually no need even to backwalk to IGP 1014 leaves to adjust the OutLabel-Lists because FIB can rely on the 1015 path-index stored in the useable paths in the pathlist to pick the 1016 right label. 1018 It is noteworthy to mention that because FIB manager modifies the 1019 forwarding chain starting from the IGP leaves only. BGP pathlists 1020 and leaves are not modified. Hence traffic restoration occurs within 1021 the time frame of IGP convergence, and, for local link failure, 1022 assuming a backup path has been precomputed, within the timeframe of 1023 local detection (e.g. 50ms). Examples of solutions that pre- 1024 computing backup paths are IP FRR [16] remote LFA [17], Ti-LFA [15] 1025 and MRT [18] or eBGP path having a backup path [11]. 1027 Let's apply the procedure mentioned in this subsection to the 1028 forwarding chain depicted in Figure 2. Suppose a remote link failure 1029 occurs and impacts the first ECMP IGP path to the remote BGP next- 1030 hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop 1031 is updated to reflect the new topology (one path instead of two). As 1032 soon as the IGP convergence is effective for the BGP next-hop entry, 1033 the new forwarding state is immediately available to all dependent 1034 BGP prefixes. The same behavior would occur if the failure was local 1035 such as an interface going down. As soon as the IGP convergence is 1036 complete for the BGP next-hop IGP route, all its BGP depending 1037 routes benefit from the new path. In fact, upon local failure, if 1038 LFA protection is enabled for the IGP route to the BGP next-hop and 1039 a backup path was pre-computed and installed in the pathlist, upon 1040 the local interface failure, the LFA backup path is immediately 1041 activated (e.g. sub-50msec) and thus protection benefits all the 1042 depending BGP traffic through the hierarchical forwarding dependency 1043 between the routes. 1045 6.2. BGP-PIC edge 1047 This section describes the adjustments to the forwarding chains as a 1048 result of edge node or edge link failure. 1050 6.2.1. Adjusting forwarding Chain in egress node failure 1052 When an edge node fails, IGP on neighboring core nodes send route 1053 updates indicating that the edge node is no longer reachable. IGP 1054 running on the iPE instructs FIB to remove the IP and label leaves 1055 corresponding to the failed edge node from FIB. So FIB manager 1056 performs the following steps: 1058 o FIB manager deletes the IGP leaf corresponding to the failed edge 1059 node 1061 o FIB manager backwalks to all dependent BGP pathlists and marks 1062 that path using the deleted IGP leaf as unresolved 1064 o Note that there is no need to modify the possibly large number of 1065 BGP leaves because each path in the pathlist carries its path 1066 index and hence the correct outgoing label will be picked. 1067 Consider for example the forwarding chain depicted in Figure 2. 1068 If the 1st BGP path becomes unresolved, then the forwarding 1069 engine will only use the second path for forwarding. Yet the path 1070 index of that single resolved path will still be 1 and hence the 1071 label VPN-L21 will be pushed. 1073 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure 1075 Suppose the link between an edge router and its external peer fails. 1076 There are two scenarios (1) the edge node attached to the failed 1077 link performs next-hop self (where BGP advertises the IP address of 1078 its own loopback as next-hop) and (2) the edge node attached to the 1079 failure advertises the IP address of the failed link as the next-hop 1080 attribute to its iBGP peers. 1082 In the first case, the rest of iBGP peers will remain unaware of the 1083 link failure and will continue to forward traffic to the edge node 1084 until the edge node attached to the failed link withdraws the BGP 1085 prefixes. If the destination prefixes are multi-homed to another 1086 iBGP peer, say ePE2, then FIB manager on the edge router detecting 1087 the link failure applies the following steps (see Figure 3): 1089 o FIB manager backwalks to the BGP pathlists marks the path through 1090 the failed link to the external peer as unresolved. 1092 o Hence traffic will be forwarded using the backup path towards 1093 ePE2. 1095 o For labeled traffic 1097 o The OutLabel-List attached to the BGP leaf already contains 1098 an entry corresponding to the backup path. 1100 o The label entry in OutLabel-List corresponding to the 1101 internal path to backup egress PE has swap action to the 1102 label advertised by backup egress PE. 1104 o For an arriving label packet (e.g. VPN), the top label is 1105 swapped with the label advertised by backup egress PE and the 1106 packet is sent towards that backup egress PE. 1108 o For unlabeled traffic, packets are simply redirected towards 1109 backup egress PE. 1111 In the second case where the edge router uses the IP address of the 1112 failed link as the BGP next-hop, the edge router will still perform 1113 the previous steps. But, unlike the case of next-hop self, IGP on 1114 failed edge node informs the rest of the iBGP peers that IP address 1115 of the failed link is no longer reachable. Hence the FIB manager on 1116 iBGP peers will delete the IGP leaf corresponding to the IP prefix 1117 of the failed link. The behavior of the iBGP peers will be identical 1118 to the case of edge node failure outlined in Section 6.2.1. 1120 It is noteworthy to mention that because the edge link failure is 1121 local to the edge router, sub-50 msec convergence can be achieved as 1122 described in [11]. 1124 Let's try to apply the case of next-hop self to the forwarding chain 1125 depicted in Figure 3. After failure of the link between ePE1 and CE, 1126 the forwarding engine will route traffic arriving from the core 1127 towards VPN-NH2 with path-index=1. A packet arriving from the core 1128 will contain the label VPN-L11 at top. The label VPN-L11 is swapped 1129 with the label VPN-L21 and the packet is forwarded towards ePE2. 1131 6.3. Handling Failures for Flattened Forwarding Chains 1133 As explained in the in Section 5 if the number of hierarchy levels 1134 of a platform cannot support the native number of hierarchy levels 1135 of a recursive forwarding chain, the instantiated forwarding chain 1136 is constructed by flattening two or more levels. Hence a 3 levels 1137 chain in Figure 5 is flattened into the 2 levels chain in Figure 6. 1139 While reducing the benefits of BGP-PIC, flattening one hierarchy 1140 into a shallower hierarchy does not always result in a complete loss 1141 of the benefits of the BGP-PIC. To illustrate this fact suppose 1142 ASBR12 is no longer reachable in domain 1. If the platform supports 1143 the full hierarchy depth, the forwarding chain is the one depicted 1144 in Figure 5 and hence the FIB manager needs to backwalk one level to 1145 the pathlist shared by "ePE1" and "ePE2" and adjust it. If the 1146 platform supports 2 levels of hierarchy, then a useable forwarding 1147 chain is the one depicted in Figure 6. In that case, if ASBR12 is no 1148 longer reachable, the FIB manager has to backwalk to the two 1149 flattened pathlists and updates both of them. 1151 The main observation is that the loss of convergence speed due to 1152 the loss of hierarchy depth depends on the structure of the 1153 forwarding chain itself. To illustrate this fact, let's take two 1154 extremes. Suppose the forwarding objects in level i+1 depend on the 1155 forwarding objects in level i. If every object on level i+1 depends 1156 on a separate object in level i, then flattening level i into level 1157 i+1 will not result in loss of convergence speed. Now let's take the 1158 other extreme. Suppose "n" objects in level i+1 depend on 1 object 1159 in level i. Now suppose FIB flattens level i into level i+1. If a 1160 topology change results in modifying the single object in level i, 1161 then FIB has to backwalk and modify "n" objects in the flattened 1162 level, thereby losing all the benefit of BGP-PIC. Experience shows 1163 that flattening forwarding chains usually results in moderate loss 1164 of BGP-PIC benefits. Further analysis is needed to corroborate and 1165 quantify this statement. 1167 7. Properties 1169 7.1. Coverage 1171 All the possible failures, except CE node failure, are covered, 1172 whether they impact a local or remote IGP path or a local or remote 1173 BGP next-hop as described in Section 6. This section provides 1174 details for each failure and how the hierarchical and shared FIB 1175 structure proposed in this document allows recovery that does not 1176 depend on number of BGP prefixes. 1178 7.1.1. A remote failure on the path to a BGP next-hop 1180 Upon IGP convergence, the IGP leaf for the BGP next-hop is updated 1181 upon IGP convergence and all the BGP depending routes leverage the 1182 new IGP forwarding state immediately. Details of this behavior can 1183 be found in Section 6.1. 1185 This BGP resiliency property only depends on IGP convergence and is 1186 independent of the number of BGP prefixes impacted. 1188 7.1.2. A local failure on the path to a BGP next-hop 1190 Upon LFA protection, the IGP leaf for the BGP next-hop is updated to 1191 use the precomputed LFA backup path and all the BGP depending routes 1192 leverage this LFA protection. Details of this behavior can be found 1193 in Section 6.1. 1195 This BGP resiliency property only depends on LFA protection and is 1196 independent of the number of BGP prefixes impacted. 1198 7.1.3. A remote iBGP next-hop fails 1200 Upon IGP convergence, the IGP leaf for the BGP next-hop is deleted 1201 and all the depending BGP Path-Lists are updated to either use the 1202 remaining ECMP BGP best-paths or if none remains available to 1203 activate precomputed backups. Details about this behavior can be 1204 found in Section 6.2.1. 1206 This BGP resiliency property only depends on IGP convergence and is 1207 independent of the number of BGP prefixes impacted. 1209 7.1.4. A local eBGP next-hop fails 1211 Upon local link failure detection, the adjacency to the BGP next-hop 1212 is deleted and all the depending BGP pathlists are updated to either 1213 use the remaining ECMP BGP best-paths or if none remains available 1214 to activate precomputed backups. Details about this behavior can be 1215 found in Section 6.2.2. 1217 This BGP resiliency property only depends on local link failure 1218 detection and is independent of the number of BGP prefixes impacted. 1220 7.2. Performance 1222 When the failure is local (a local IGP next-hop failure or a local 1223 eBGP next-hop failure), a pre-computed and pre-installed backup is 1224 activated by a local-protection mechanism that does not depend on 1225 the number of BGP destinations impacted by the failure. Sub-50msec 1226 is thus possible even if millions of BGP routes are impacted. 1228 When the failure is remote (a remote IGP failure not impacting the 1229 BGP next-hop or a remote BGP next-hop failure), an alternate path is 1230 activated upon IGP convergence. All the impacted BGP destinations 1231 benefit from a working alternate path as soon as the IGP convergence 1232 occurs for their impacted BGP next-hop even if millions of BGP 1233 routes are impacted. 1235 Appendix A puts the BGP-PIC benefits in perspective by providing 1236 some results using actual numbers. 1238 7.3. Automated 1240 The BGP-PIC solution does not require any operator involvement. The 1241 process is entirely automated as part of the FIB implementation. 1243 The salient points enabling this automation are: 1245 o Extension of the BGP Best Path to compute more than one primary 1246 ([12]and [13]) or backup BGP next-hop ([7] and [14]). 1248 o Sharing of BGP Path-list across BGP destinations with same 1249 primary and backup BGP next-hop. 1251 o Hierarchical indirection and dependency between BGP pathlist and 1252 IGP pathlist. 1254 7.4. Incremental Deployment 1256 As soon as one router supports BGP-PIC solution, it benefits from 1257 all its benefits (most notably convergence that does not depend in 1258 the number of prefixes) without any requirement for other routers to 1259 support BGP-PIC. 1261 8. Security Considerations 1263 The behavior described in this document is internal functionality 1264 to a router that result in significant improvement to convergence 1265 time as well as reduction in CPU and memory used by FIB while not 1266 showing change in basic routing and forwarding functionality. As 1267 such no additional security risk is introduced by using the 1268 mechanisms proposed in this document. 1270 9. IANA Considerations 1272 No requirements for IANA 1274 10. Conclusions 1276 This document proposes a hierarchical and shared forwarding chain 1277 structure that allows achieving BGP prefix independent 1278 convergence, and in the case of locally detected failures, sub-50 1279 msec convergence. A router can construct the forwarding chains in 1280 a completely transparent manner with zero operator intervention 1281 thereby supporting smooth and incremental deployment. 1283 11. References 1285 11.1. Normative References 1287 [1] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 1288 4 (BGP-4), RFC 4271, January 2006 1290 [2] Bates, T., Chandra, R., Katz, D., and Rekhter Y., 1291 "Multiprotocol Extensions for BGP", RFC 4760, January 2007 1293 [3] E. Rosen, " Carrying Label Information in BGP-4", RFC 8277, 1294 October 2017 1296 [4] Andersson, L., Minei, I., and B. Thomas, "LDP Specification", 1297 RFC 5036, October 2007 1299 [5] A. Bashandy, C. Filsfils, S. Previdi, B. Decraene, S. 1300 Litkowski, M. Horneffer, R. Shakir, "Segment Routing with MPLS 1301 data plane", RFC 8660, December 2019 1303 [6] E. Rosen, A. Viswanathan, R. Callon, "Multiprotocol Label 1304 Switching Architecture", RFC 3031, January 2001 1306 11.2. Informative References 1308 [7] Marques,P., Fernando, R., Chen, E, Mohapatra, P., Gredler, H., 1309 "Advertisement of the best external route in BGP", draft-ietf- 1310 idr-best-external-05.txt, January 2012. 1312 [8] Wu, J., Cui, Y., Metz, C., and E. Rosen, "Softwire Mesh 1313 Framework", RFC 5565, June 2009. 1315 [9] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1316 Networks (VPNs)", RFC 4364, February 2006. 1318 [10] De Clercq, J. , Ooms, D., Prevost, S., Le Faucheur, F., 1319 "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider 1320 Edge Routers (6PE)", RFC 4798, February 2007 1322 [11] O. Bonaventure, C. Filsfils, and P. Francois. "Achieving sub- 1323 50 milliseconds recovery upon bgp peering link failures, " 1324 IEEE/ACM Transactions on Networking, 15(5):1123-1135, 2007 1326 [12] D. Walton, A. Retana, E. Chen, J. Scudder, "Advertisement of 1327 Multiple Paths in BGP", RFC 7911, July 2016 1329 [13] R. Raszuk, R. Fernando, K. Patel, D. McPherson, K. Kumaki, 1330 "Distribution of diverse BGP paths", RFC 6774, November 2012 1332 [14] P. Mohapatra, R. Fernando, C. Filsfils, and R. Raszuk, "Fast 1333 Connectivity Restoration Using BGP Add-path", draft-pmohapat- 1334 idr-fast-conn-restore-03, Jan 2013 1336 [15] S. Litkowski, A. Bashandy, C. Filsfils, P. Francois, B. 1337 Decraene, D. Voyer, " Topology Independent Fast Reroute using 1338 Segment Routing", draft-ietf-rtgwg-segment-routing-ti-lfa-07 1339 (work in progress), June 2021 1341 [16] M. Shand and S. Bryant, "IP Fast Reroute Framework", RFC 5714, 1342 January 2010 1344 [17] S. Bryant, C. Filsfils, S. Previdi, M. Shand, N So, " Remote 1345 Loop-Free Alternate (LFA) Fast Reroute (FRR)", RFC 7490 April 1346 2015 1348 [18] A. Atlas, C. Bowers, G. Enyedi, " An Architecture for IP/LDP 1349 Fast-Reroute Using Maximally Redundant Trees", RFC 7812, June 1350 2016 1352 12. Acknowledgments 1354 Special thanks to Neeraj Malhotra and Yuri Tsier for the valuable 1355 help 1357 Special thanks to Bruno Decraene, Theresa Enghardt, Ines Robles, 1358 and Luc Andre Burdet for the valuable comments 1360 This document was prepared using 2-Word-v2.0.template.dot. 1362 Authors' Addresses 1364 Ahmed Bashandy 1365 Individual Contributor 1366 Email: abashandy.ietf@gmail.com 1368 Clarence Filsfils 1369 Cisco Systems 1370 Brussels, Belgium 1371 Email: cfilsfil@cisco.com 1373 Prodosh Mohapatra 1374 Sproute Networks 1375 Email: mpradosh@yahoo.com 1377 Appendix A. Perspective 1379 The following table puts the BGP-PIC benefits in perspective 1380 assuming 1382 o 1M impacted BGP prefixes 1384 o IGP convergence ~ 500 msec 1386 o local protection ~ 50msec 1388 o FIB Update per BGP destination ~ 100usec conservative, 1390 ~ 10usec optimistic 1392 o BGP Convergence per BGP destination ~ 200usec conservative, 1394 ~ 100usec optimistic 1396 Without PIC With PIC 1398 Local IGP Failure 10 to 100sec 50msec 1400 Local BGP Failure 100 to 200sec 50msec 1402 Remote IGP Failure 10 to 100sec 500msec 1404 Local BGP Failure 100 to 200sec 500msec 1406 Upon local IGP next-hop failure or remote IGP next-hop failure, the 1407 existing primary BGP next-hop is intact and usable hence the 1408 resiliency only depends on the ability of the FIB mechanism to 1409 reflect the new path to the BGP next-hop to the depending BGP 1410 destinations. Without BGP-PIC, a conservative back-of-the-envelope 1411 estimation for this FIB update is 100usec per BGP destination. An 1412 optimistic estimation is 10usec per entry. 1414 Upon local BGP next-hop failure or remote BGP next-hop failure, 1415 without the BGP-PIC mechanism, a new BGP Best-Path needs to be 1416 recomputed and new updates need to be sent to peers. This depends on 1417 BGP processing time that will be shared between best-path 1418 computation, RIB update and peer update. A conservative back-of-the- 1419 envelope estimation for this is 200usec per BGP destination. An 1420 optimistic estimation is 100usec per entry.