idnits 2.17.1 draft-ietf-rtgwg-bgp-pic-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 5 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 22, 2016) is 2711 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 3107 (ref. '4') (Obsoleted by RFC 8277) == Outdated reference: A later version (-15) exists of draft-ietf-idr-add-paths-12 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-mpls-02 Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group A. Bashandy, Ed. 2 Internet Draft C. Filsfils 3 Intended status: Informational Cisco Systems 4 Expires: May 2017 P. Mohapatra 5 Sproute Networks 6 November 22, 2016 7 BGP Prefix Independent Convergence 8 draft-ietf-rtgwg-bgp-pic-03.txt 10 Abstract 12 In the network comprising thousands of iBGP peers exchanging millions 13 of routes, many routes are reachable via more than one next-hop. 14 Given the large scaling targets, it is desirable to restore traffic 15 after failure in a time period that does not depend on the number of 16 BGP prefixes. In this document we proposed an architecture by which 17 traffic can be re-routed to ECMP or pre-calculated backup paths in a 18 timeframe that does not depend on the number of BGP prefixes. The 19 objective is achieved through organizing the forwarding data 20 structures in a hierarchical manner and sharing forwarding elements 21 among the maximum possible number of routes. The proposed technique 22 achieves prefix independent convergence while ensuring incremental 23 deployment, complete automation, and zero management and provisioning 24 effort. It is noteworthy to mention that the benefits of BGP-PIC are 25 hinged on the existence of more than one path whether as ECMP or 26 primary-backup. 28 Status of this Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 This document may contain material from IETF Documents or IETF 34 Contributions published or made publicly available before November 35 10, 2008. The person(s) controlling the copyright in some of this 36 material may not have granted the IETF Trust the right to allow 37 modifications of such material outside the IETF Standards Process. 38 Without obtaining an adequate license from the person(s) 39 controlling the copyright in such materials, this document may not 40 be modified outside the IETF Standards Process, and derivative 41 works of it may not be created outside the IETF Standards Process, 42 except to format it for publication as an RFC or to translate it 43 into languages other than English. 45 Internet-Drafts are working documents of the Internet Engineering 46 Task Force (IETF), its areas, and its working groups. Note that 47 other groups may also distribute working documents as Internet- 48 Drafts. 50 Internet-Drafts are draft documents valid for a maximum of six 51 months and may be updated, replaced, or obsoleted by other 52 documents at any time. It is inappropriate to use Internet-Drafts 53 as reference material or to cite them other than as "work in 54 progress." 56 The list of current Internet-Drafts can be accessed at 57 http://www.ietf.org/ietf/1id-abstracts.txt 59 The list of Internet-Draft Shadow Directories can be accessed at 60 http://www.ietf.org/shadow.html 62 This Internet-Draft will expire on May 22, 2016. 64 Copyright Notice 66 Copyright (c) 2016 IETF Trust and the persons identified as the 67 document authors. All rights reserved. 69 This document is subject to BCP 78 and the IETF Trust's Legal 70 Provisions Relating to IETF Documents 71 (http://trustee.ietf.org/license-info) in effect on the date of 72 publication of this document. Please review these documents 73 carefully, as they describe your rights and restrictions with 74 respect to this document. Code Components extracted from this 75 document must include Simplified BSD License text as described in 76 Section 4.e of the Trust Legal Provisions and are provided without 77 warranty as described in the Simplified BSD License. 79 Table of Contents 81 1. Introduction...................................................3 82 1.1. Conventions used in this document.........................4 83 1.2. Terminology...............................................4 84 2. Overview.......................................................6 85 2.1. Dependency................................................6 86 2.1.1. Hierarchical Hardware FIB............................6 87 2.1.2. Availability of more than one primary or secondary BGP 88 next-hops...................................................7 89 2.1.3. Pre-Computation of a secondary BGP next-hop.....Error! 90 Bookmark not defined. 91 2.2. BGP-PIC Illustration......................................7 92 3. Constructing the Shared Hierarchical Forwarding Chain..........9 93 3.1. Constructing the BGP-PIC forwarding Chain.................9 94 3.1.1. Example: Primary-Backup Path Scenario...............10 95 4. Forwarding Behavior...........................................11 96 5. Handling Platforms with Limited Levels of Hierarchy...........12 97 5.1. Flattening the Forwarding Chain..........................12 98 5.2. Example: Flattening a forwarding chain...................14 99 6. Forwarding Chain Adjustment at a Failure......................21 100 6.1. BGP-PIC core.............................................22 101 6.2. BGP-PIC edge.............................................23 102 6.2.1. Adjusting forwarding Chain in egress node failure...23 103 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure....23 104 6.3. Handling Failures for Flattened Forwarding Chains........24 105 7. Properties....................................................25 106 7.1. Coverage.................................................25 107 7.1.1. A remote failure on the path to a BGP next-hop......25 108 7.1.2. A local failure on the path to a BGP next-hop.......25 109 7.1.3. A remote iBGP next-hop fails........................26 110 7.1.4. A local eBGP next-hop fails.........................26 111 7.2. Performance..............................................26 112 7.3. Automated................................................27 113 7.4. Incremental Deployment...................................27 114 8. Security Considerations.......................................27 115 9. IANA Considerations...........................................27 116 10. Conclusions..................................................27 117 11. References...................................................28 118 11.1. Normative References....................................28 119 11.2. Informative References..................................28 120 12. Acknowledgments..............................................29 121 Appendix A. Perspective..........................................30 123 1. Introduction 125 As a path vector protocol, BGP propagates reachability serially. 126 Hence BGP convergence speed is limited by the time taken to 127 serially propagate reachability information from the point of 128 failure to the device that must re-converge. BGP speakers exchange 129 reachability information about prefixes[2][3] and, for labeled 130 address families, namely AFI/SAFI 1/4, 2/4, 1/128, and 2/128, an 131 edge router assigns local labels to prefixes and associates the 132 local label with each advertised prefix such as L3VPN [8], 6PE 133 [9], and Softwire [7] using BGP label unicast technique[4]. A BGP 134 speaker then applies the path selection steps to choose the best 135 path. In modern networks, it is not uncommon to have a prefix 136 reachable via multiple edge routers. In addition to proprietary 137 techniques, multiple techniques have been proposed to allow for 138 BGP to advertise more than one path for a given prefix 139 [6][11][12], whether in the form of equal cost multipath or 140 primary-backup. Another common and widely deployed scenario is 141 L3VPN with multi-homed VPN sites with unique Route Distinguisher. 142 It is advantageous to utilize the commonality among paths used by 143 NLRIs to significantly improve convergence in case of topology 144 modifications. 146 This document proposes a hierarchical and shared forwarding chain 147 organization that allows traffic to be restored to pre-calculated 148 alternative equal cost primary path or backup path in a time 149 period that does not depend on the number of BGP prefixes. The 150 technique relies on internal router behavior that is completely 151 transparent to the operator and can be incrementally deployed and 152 enabled with zero operator intervention. 154 1.1. Conventions used in this document 156 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 157 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" 158 in this document are to be interpreted as described in RFC-2119 159 [1]. 161 In this document, these words will appear with that interpretation 162 only when in ALL CAPS. Lower case uses of these words are not to 163 be interpreted as carrying RFC-2119 significance. 165 1.2. Terminology 167 This section defines the terms used in this document. For ease of 168 use, we will use terms similar to those used by L3VPN [8] 170 o BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker 171 has a path for. 173 o IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via 174 an Interior Gateway Protocol, such as OSPF and ISIS, has a path 175 for. The prefix may be learnt directly through the IGP or 176 redistributed from other protocol(s) 178 o CE: An external router through which an egress PE can reach a 179 prefix P/m. 181 o Ingress PE, "iPE": A BGP speaker that learns about a prefix 182 through a IBGP peer and chooses an egress PE as the next-hop for 183 the prefix. 185 o Path: The next-hop in a sequence of nodes starting from the 186 current node and ending with the destination node or network 187 identified by the prefix. The nodes may not be directly 188 connected. 190 o Recursive path: A path consisting only of the IP address of the 191 next-hop without the outgoing interface. Subsequent lookups are 192 necessary to determine the outgoing interface and a directly 193 connected next-hop 195 o Non-recursive path: A path consisting of the IP address of a 196 directly connected next-hop and outgoing interface 198 o Primary path: A recursive or non-recursive path that can be 199 used all the time as long as a walk starting from this path can 200 end to an adjacency. A prefix can have more than one primary 201 path 203 o Backup path: A recursive or non-recursive path that can be used 204 only after some or all primary paths become unreachable 206 o Leaf: A container data structure for a prefix or local label. 207 Alternatively, it is the data structure that contains prefix 208 specific information. 210 o IP leaf: The leaf corresponding to an IPv4 or IPv6 prefix 212 o Label leaf. The leaf corresponding to a locally allocated label 213 such as the VPN label on an egress PE [8]. 215 o Pathlist: An array of paths used by one or more prefix to forward 216 traffic to destination(s) covered by a IP prefix. Each path in 217 the pathlist carries its "path-index" that identifies its 218 position in the array of paths. "). In general, the value of the 219 "path-index" stored in path may not necessarily has the same 220 value of the location of the path in the pathlist. For example 221 the 3rd path may carry path-index value of 1 223 o A pathlist may contain a mix of primary and backup paths 225 o OutLabel-List: Each labeled prefix is associated with an 226 OutLabel-List. The OutLabel-List is an array of one or more 227 outgoing labels and/or label actions where each label or label 228 action has 1-to-1 correspondence to a path in the pathlist. 229 Label actions are: push the label, pop the label, swap the 230 incoming label with the label in the Outlabel-Array entry, or 231 don't push anything at all in case of "unlabeled". The prefix 232 may be an IGP or BGP prefix 234 o Adjacency: The layer 2 encapsulation leading to the layer 3 235 directly connected next-hop 237 o Dependency: An object X is said to be a dependent or child of 238 object Y if there is at least one forwarding chain where the 239 forwarding engine must visits the object X before visiting the 240 object Y in order to forward a packet. Note that if object X is 241 a child of object Y, then Y cannot be deleted unless object X 242 is no longer a dependent/child of object Y 244 o Route: A prefix with one or more paths associated with it. 245 Hence the minimum set of objects needed to construct a route is 246 a leaf and a pathlist. 248 2. Overview 250 The idea of BGP-PIC is based on two pillars 252 o A shared hierarchical forwarding Chain: It is not uncommon to see 253 multiple destinations are reachable via the same list of next- 254 hops. Instead of having a separate list of next-hops for each 255 destination, all destinations sharing the same list of next-hops 256 can point to a single copy of this list thereby allowing fast 257 convergence by making changes to a single shared list of next- 258 hops rather than possibly a large number of destinations. Because 259 paths in a pathlist may be recursive, a hierarchy is formed 260 between pathlist and the resolving prefix whereby the pathlist 261 depends on the resolving prefix. 263 o A forwarding plane that supports multiple levels of indirection: 264 A forwarding that starts with a destination and ends with an 265 outgoing interface is not a simple flat structure. Instead a 266 forwarding entry is constructed via multiple levels of 267 dependency. A BGP NLRI uses a recursive next-hop, which in turn 268 resolves via an IGP next-hop, which in turn resolves via an 269 adjacency consisting of one or more outgoing interface(s) and 270 next-hop(s). 272 Designing a forwarding plane that constructs multi-level forwarding 273 chains with maximal sharing of forwarding objects allows rerouting a 274 large number of destinations by modifying a small number of objects 275 thereby achieving convergence in a time frame that does not depend 276 on the number of destinations. For example, if the IGP prefix that 277 resolves a recursive next-hop is updated there is no need to update 278 the possibly large number of BGP NLRIs that use this recursive next- 279 hop. 281 2.1. Dependency 283 This section describes the required functionality in the forwarding 284 and control planes to support BGP-PIC described in this document 286 2.1.1. Hierarchical Hardware FIB 288 BGP PIC requires a hierarchical hardware FIB support: for each BGP 289 forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is 290 consulted, then an IGP Pathlist, then an Adjacency. 292 An alternative method consists in "flattening" the dependencies when 293 programming the BGP destinations into HW FIB resulting in 294 potentially eliminating both the BGP Path-List and IGP Path-List 295 consultation. Such an approach decreases the number of memory 296 lookup's per forwarding operation at the expense of HW FIB memory 297 increase (flattening means less sharing hence duplication), loss of 298 ECMP properties (flattening means less pathlist entropy) and loss of 299 BGP PIC properties. 301 2.1.2. Availability of more than one primary or secondary BGP next-hops 303 When the primary BGP next-hop fails, BGP PIC depends on the 304 availability of a pre-computed and pre-installed secondary BGP next- 305 hop in the BGP Pathlist. 307 The existence of a secondary next-hop is clear for the following 308 reason: a service caring for network availability will require two 309 disjoint network connections hence two BGP next-hops. 311 The BGP distribution of the secondary next-hop is available thanks 312 to the following BGP mechanisms: Add-Path [11], BGP Best-External 313 [6], diverse path [12], and the frequent use in VPN deployments of 314 different VPN RD's per PE. It is noteworthy to mention that the 315 availability of another BGP path does not mean that all failure 316 scenarios can be covered by simply forwarding traffic to the 317 available secondary path. The discussion of how to cover various 318 failure scenarios is beyond the scope of this document 320 2.2. BGP-PIC Illustration 322 To illustrate the two pillars above as well as the platform 323 dependency, we will use an example of a simple multihomed L3VPN [8] 324 prefix in a BGP-free core running LDP [5] or segment routing over 325 MPLS forwarding plane [14]. 327 +--------------------------------+ 328 | | 329 | ePE2 (IGP-IP1 192.0.2.1, Loopback) 330 | | \ 331 | | \ 332 | | \ 333 iPE | CE....VRF "Blue", ASnum 65000 334 | | / (VPN-IP1 11.1.1.0/24) 335 | | / (VPN-IP2 11.1.2.0/24) 336 | LDP/Segment-Routing Core | / 337 | ePE1 (IGP-IP2 192.0.2.2, Loopback) 338 | | 339 +--------------------------------+ 340 Figure 1 VPN prefix reachable via multiple PEs 342 Referring to Figure 1, suppose the iPE (the ingress PE) receives 343 NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs, 344 ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively. 345 Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while 346 ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and 347 VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved 348 via the IGP prefixes IGP-IP1 and IGP-P2, where each happen to have 2 349 ECMP paths with IGP-NH1 and IGP-NH2 reachable via the interfaces I1 350 and I2, respectively. Suppose that local labels (whether LDP[5] or 351 segment routing [14]) on the downstream LSRs for IGP-IP1 are IGP-L11 352 and IGP-L12 while for IGP-P2 are IGP-L21 and IGP-L22. As such, the 353 routing table at iPE is as follows: 355 65000:11.1.1.0/24 356 via ePE1 (192.0.2.1), VPN Label: VPN-L11 357 via ePE2 (192.0.2.2), VPN Label: VPN-L21 359 65000:11.1.2.0/24 360 via ePE1 (192.0.2.1), VPN Label: VPN-L12 361 via ePE2 (192.0.2.2), VPN Label: VPN-L22 363 192.0.2.1/32 364 via Core, Label: IGP-L11 365 via Core, Label: IGP-L12 367 192.0.2.2/32 368 via Core, Label: IGP-L21 369 via Core, Label: IGP-L22 371 Based on the above routing table, a hierarchical forwarding chain 372 can be constructed as shown in Figure 2. 374 IP Leaf: Pathlist: IP Leaf: Pathlist: 375 -------- +-------+ -------- +----------+ 376 VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP NH1,I1|--->Adjacency1 377 | |BGP-NH2|-->.... | |IGP NH2,I2|--->Adjacency2 378 | +-------+ | +----------+ 379 | | 380 | | 381 v v 382 OutLabel-List: OutLabel-List: 383 +----------------------+ +----------------------+ 384 |VPN-L11 (VPN-IP1, NH1)| |IGP-L11 (IGP-IP1, NH1)| 385 |VPN-L12 (VPN-IP1, NH2)| |IGP-L12 (IGP-IP1, NH2)| 386 +----------------------+ +----------------------+ 388 Figure 2 Shared Hierarchical Forwarding Chain at iPE 390 The forwarding chain depicted in Figure 2 illustrates the first 391 pillar, which is sharing and hierarchy. We can see that the BGP 392 pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs 393 reachable via ePE1 and ePE2. As such, it is possible to make changes 394 to the pathlist without having to make changes to the NLRIs. For 395 example, if BGP-NH2 becomes unreachable, there is no need to modify 396 any of the possibly large number of NLRIs. Instead only the shared 397 pathlist needs to be modified. Likewise, due to the hierarchical 398 structure of the forwarding chain, it is possible to make 399 modifications to the IGP routes without having to make any changes 400 to the BGP NLRIs. For example, if the interface "I2" goes down, only 401 the shared IGP pathlist needs to be updated, but none of the IGP 402 prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP 403 prefixes for resolution need to be modified. 405 Figure 2 can also be used to illustrate the second BGP-PIC pillar. 406 Having a deep forwarding chain such as the one illustrated in Figure 407 2 requires a forwarding plane that is capable of accessing multiple 408 levels of indirection in order to calculate the outgoing 409 interface(s) and next-hops(s). While a deeper forwarding chain 410 minimizes the re-convergence time on topology change, there will 411 always exist platforms with limited capabilities and hence imposing 412 a limit on the depth of the forwarding chain. Section 5 describes 413 how to gracefully trade off convergence speed with the number of 414 hierarchical levels to support platforms with different 415 capabilities. 417 3. Constructing the Shared Hierarchical Forwarding Chain 419 Constructing the forwarding chain is an application of the two 420 pillars described in Section 2. This section describes how to 421 construct the forwarding chain in hierarchical shared manner 423 3.1. Constructing the BGP-PIC forwarding Chain 425 The whole process starts when BGP downloads a prefix to FIB. The 426 prefix contains one or more outgoing paths. For certain labeled 427 prefixes, such as VPN [8] prefixes, each path may be associated with 428 an outgoing label and the prefix itself may be assigned a local 429 label. The list of outgoing paths defines a pathlist. If such 430 pathlist does not already exist, then FIB creates a new pathlist, 431 otherwise the existing pathlist is used. The BGP prefix is added as 432 a dependent of the pathlist. 434 The previous step constructs the upper part of the hierarchical 435 forwarding chain. The forwarding chain is completed by resolving the 436 paths of the pathlist. A BGP path usually consists of a next-hop. 437 The next-hop is resolved by finding a matching IGP prefix. 439 The end result is a hierarchical shared forwarding chain where the 440 BGP pathlist is shared by all BGP prefixes that use the same list of 441 paths and the IGP prefix is shared by all pathlists that have a path 442 resolving via that IGP prefix. It is noteworthy to mention that the 443 forwarding chain is constructed without any operator intervention at 444 all. 446 The remainder of this section goes over an example to illustrate the 447 applicability of BGP-PIC in a primary-backup path scenario. 449 3.2. Example: Primary-Backup Path Scenario 451 Consider the egress PE ePE1 in the case of the multi-homed VPN 452 prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1 453 determines that the primary path is the external path but the backup 454 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2. 455 ePE2 constructs the forwarding chain depicted in Figure 3. We are 456 only showing a single VPN prefix for simplicity. But all prefixes 457 that are multihomed to ePE1 and ePE2 share the BGP pathlist. 459 BGP OutLabel Array 460 VPN-L11 +---------+ 461 (Label-leaf)---+---->|Unlabeled| 462 | +---------+ 463 | | VPN-L21 | 464 | | (swap) | 465 | +---------+ 466 | 467 | BGP Pathlist 468 | +------------+ Connected route 469 | | CE-NH |------>(to the CE) 470 | |path-index=0| 471 | +------------+ 472 | | VPN-NH2 | 473 VPN-IP1 -----+------------------>| (backup) |------>IGP Leaf 474 (IP prefix leaf) |path-index=1| (Towards ePE2) 475 | +------------+ 476 | 477 | BGP OutLabel Array 478 | +---------+ 479 +------------->|Unlabeled| 480 +---------+ 481 | VPN-L21 | 482 | (push) | 483 +---------+ 485 Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE 487 The example depicted in Figure 3 differs from the example in Figure 488 2 in two main aspects. First, as long as the primary path towards 489 the CE (external path) is useable, it will be the only path used for 490 forwarding while the OutLabel-List contains both the unlabeled label 491 (primary path) and the VPN label (backup path) advertised by the 492 backup path ePE2. The second aspect is presence of the label leaf 493 corresponding to the VPN prefix. This label leaf is used to match 494 VPN traffic arriving from the core. Note that the label leaf shares 495 the pathlist with the IP prefix. 497 4. Forwarding Behavior 499 This section explains how the forwarding plane uses the hierarchical 500 shared forwarding chain to forward a packet. 502 When a packet arrives at a router, it matches a leaf. A labeled 503 packet matches a label leaf while an IP packet matches an IP prefix 504 leaf. The forwarding engines walks the forwarding chain starting 505 from the leaf until the walk terminates on an adjacency. Thus when a 506 packet arrives, the chain is walked as follows: 508 1. Lookup the leaf based on the destination address or the label at 509 the top of the packet 511 2. Retrieve the parent pathlist of the leaf 513 3. Pick the outgoing path "Pi" from the list of resolved paths in 514 the pathlist. The method by which the outgoing path is picked is 515 beyond the scope of this document (e.g. flow-preserving hash 516 exploiting entropy within the MPLS stack and IP header). Let the 517 "path-index" of the outgoing path "Pi" be "j". 519 4. If the prefix is labeled, use the "path-index" "j" to retrieve 520 the jth label "Lj" stored the jth entry in the OutLabel-List and 521 apply the label action of the label on the packet (e.g. for VPN 522 label on the ingress PE, the label action is "push"). As 523 mentioned in Section 1.2, the value of the "path-index" stored 524 in path may not necessarily be the same value of the location of 525 the path in the pathlist. 527 5. Move to the parent of the chosen path "Pi" 529 6. If the chosen path "Pi" is recursive, move to its parent prefix 530 and go to step 2 532 7. If the chosen path is non-recursive move to its parent adjacency. 533 Otherwise go to the next step. 535 8. Encapsulate the packet in the layer string specified by the 536 adjacency and send the packet out. 538 Let's apply the above forwarding steps to the forwarding chain 539 depicted in Figure 2 in Section 2. Suppose a packet arrives at 540 ingress PE iPE from an external neighbor. Assume the packet matches 541 the VPN prefix VPN-IP1. While walking the forwarding chain, the 542 forwarding engine applies a hashing algorithm to choose the path and 543 the hashing at the BGP level yields path 0 while the hashing at the 544 IGP level yields path 1. In that case, the packet will be sent out 545 of interface I2 with the label stack "IGP-L12,VPN-L11". 547 5. Handling Platforms with Limited Levels of Hierarchy 549 This section describes the construction of the forwarding chain if a 550 platform does not support the number of recursion levels required to 551 resolve the NLRIs. There are two main design objectives 553 o Being able to reduce the number of hierarchical levels from any 554 arbitrary value to a smaller arbitrary value that can be 555 supported by the forwarding engine 557 o Minimal modifications to the forwarding algorithm due to such 558 reduction. 560 5.1. Flattening the Forwarding Chain 562 Let's consider a pathlist associated with the leaf "R1" consisting 563 of the list of paths . Assume that the leaf "R1" has 564 an Outlabel-list . Suppose the path Pi is a 565 recursive path that resolves via a prefix represented by the leaf 566 "R2". The leaf "R2" itself is pointing to a pathlist consisting of 567 the paths 569 If the platform supports the number of hierarchy levels of the 570 forwarding chain, then a packet that uses the path "Pi" will be 571 forwarded as follows: 573 1. The forwarding engine is now at leaf "R1" 575 2. So it moves to its parent pathlist, which contains the list . 578 3. The forwarding engine applies a hashing algorithm and picks the 579 path "Pi". So now the forwarding engine is at the path "Pi" 581 4. The forwarding engine retrieves the label "Li" from the outlabel- 582 list attached to the leaf "R1" and applies the label action 584 5. The path "Pi" uses the leaf "R2" 586 6. The forwarding engine walks forward to the leaf "R2" for 587 resolution 589 7. The forwarding plane performs a hash to pick a path among the 590 pathlist of the leaf "R2", which is 592 8. Suppose the forwarding engine picks the path "Qj" 594 9. Now the forwarding engine continues the walk to the parent of 595 "Qj" 597 Suppose the platform cannot support the number of hierarchy levels 598 in the forwarding chain. FIB needs to reduce the number of hierarchy 599 levels. The idea of reducing the number of hierarchy levels is to 600 "flatten" two chain levels into a single level. The "flattening" 601 steps are as follows 603 1. FIB wants to reduce the number of levels used by "Pi" by 1 605 2. FIB walks to the parent of "Pi", which is the leaf "R2" 607 3. FIB extracts the parent pathlist of the leaf "R2", which is 610 4. FIB also extracts the OutLabel-list(R2) associated with the leaf 611 "R2". Remember that OutLabel-list(R2) = 613 5. FIB replaces the path "Pi", with the list of paths 616 6. Hence the path list now becomes " 619 7. The path index stored inside the locations "Q1", "Q2", ..., "Qm" 620 must all be "i" because the index "i" refers to the label "Li" 621 associated with leaf "R1" 623 8. FIB attaches an OutLabel-list with the new pathlist as follows: 624 . The size of the label list associated with the 626 flattened pathlist equals the size of the pathlist. Hence there 627 is a 1-1 mapping between every path in the "flattened" pathlist 628 and the OutLabel-list associated with it. 630 It is noteworthy to mention that the labels in the outlabel-list 631 associated with the "flattened" pathlist may be stored in the same 632 memory location as the path itself to avoid additional memory 633 access. But that is an implementation detail that is beyond the 634 scope of this document. 636 The same steps can be applied to all paths in the pathlist so that all paths are "flattened" thereby reducing the 638 number of hierarchical levels by one. Note that that "flattening" a 639 pathlist pulls in all paths of the parent paths, a desired feature 640 to utilize all ECMP/UCMP paths at all levels. A platform that has a 641 limit on the number of paths in a pathlist for any given leaf may 642 choose to reduce the number paths using methods that are beyond the 643 scope of this document. 645 The steps can be recursively applied to other paths at the same 646 levels or other levels to recursively reduce the number of 647 hierarchical levels to an arbitrary value so as to accommodate the 648 capability of the forwarding engine. 650 Because a flattened pathlist may have an associated OutLabel-list 651 the forwarding behavior has to be slightly modified. The 652 modification is done by adding the following step right after step 4 653 in Section 4. 655 5. If there is an OutLabel-list associated with the pathlist, then 656 if the path "Pi" is chosen by the hashing algorithm, retrieve the 657 label at location "i" in that OutLabel-list and apply the label 658 action of that label on the packet 660 In the next subsection, we apply the steps in this subsection to a 661 sample scenario. 663 5.2. Example: Flattening a forwarding chain 665 This example uses a case of inter-AS option C [8] where there are 3 666 levels of hierarchy. Figure 4 illustrates the sample topology. To 667 force 3 levels of hierarchy, the ASBRs on the ingress domain (domain 668 1) advertise the core routers of the egress domain (domain 2) to the 669 ingress PE (iPE) via BGP-LU [4] instead of redistributing them into 670 the IGP of domain 1. The end result is that the ingress PE (iPE) has 671 2 levels of recursion for the VPN prefix VPN-IP1 and VPN2-IP2. 673 Domain 1 Domain 2 674 +-------------+ +-------------+ 675 | | | | 676 | LDP/SR Core | | LDP/SR core | 677 | | | | 678 | (192.0.1.1) | | 679 | ASBR11---------ASBR21........ePE1(192.0.2.1) 680 | | \ / | . . |\ 681 | | \ / | . . | \ 682 | | \ / | . . | \ 683 | | \/ | .. | \VPN-IP1 (11.1.1.0/24) 684 | | /\ | . . | /VRF "Blue" ASn: 65000 685 | | / \ | . . | / 686 | | / \ | . . | / 687 | | / \ | . . |/ 688 iPE ASBR12---------ASBR22........ePE2 (192.0.2.2) 689 | (192.0.1.2) | |\ 690 | | | | \ 691 | | | | \ 692 | | | | \VRF "Blue" ASn: 65000 693 | | | | /VPN-IP2 (11.1.2.0/24) 694 | | | | / 695 | | | | / 696 | | | |/ 697 | ASBR13---------ASBR23........ePE3(192.0.2.3) 698 | (192.0.1.3) | | 699 | | | | 700 | | | | 701 +-------------+ +-------------+ 702 <============ <========= <============ 703 Advertise ePEx Advertise Redistribute 704 Using iBGP-LU ePEx Using IGP into 705 eBGP-LU BGP 707 Figure 4 : Sample 3-level hierarchy topology 709 We will make the following assumptions about connectivity 711 o In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and 712 ePE2 using the same distance 714 o In "domain 2", only ASBR23 can reach ePE3 716 o In "domain 1", iPE (the ingress PE) can reach ASBR11, ASBR12, and 717 ASBR13 via IGP using the same distance. 719 We will make the following assumptions about the labels 720 o The VPN labels advertised by ePE1 and ePE2 for prefix VPN-IP1 are 721 VPN-L11 and VPN-L21, respectively 723 o The VPN labels advertised by ePE2 and ePE3 for prefix VPN-IP2 are 724 VPN-L22 and VPN-L32, respectively 726 o The labels advertised by ASBR11 to iPE using BGP-LU [4] for the 727 egress PEs ePE1 and ePE2 are LASBR11(ePE1) and LASBR11(ePE2), 728 respectively. 730 o The labels advertised by ASBR12 to iPE using BGP-LU [4] for the 731 egress PEs ePE1 and ePE2 are LASBR12(ePE1) and LASBR12(ePE2), 732 respectively 734 o The label advertised by ASBR11 to iPE using BGP-LU [4] for the 735 egress PE ePE3 is LASBR13(ePE3) 737 o The IGP labels advertised by the next hops directly connected to 738 iPE towards ASBR11, ASBR12, and ASBR13 in the core of domain 1 739 are IGP-L11, IGP-L12, and IGP-L13, respectively. 741 Based on these connectivity assumptions and the topology in Figure 742 4, the routing table on iPE is 744 65000:11.1.1.0/24 745 via ePE1 (192.0.2.1), VPN Label: VPN-L11 746 via ePE2 (192.0.2.2), VPN Label: VPN-L21 747 65000:11.1.2.0/24 748 via ePE1 (192.0.2.2), VPN Label: VPN-L22 749 via ePE2 (192.0.2.3), VPN Label: VPN-L23 751 192.0.2.1/32 (ePE1) 752 Via ASBR11, BGP-LU Label: LASBR11(ePE1) 753 Via ASBR12, BGP-LU Label: LASBR12(ePE1) 754 192.0.2.2/32 (ePE2) 755 Via ASBR11, BGP-LU Label: LASBR11(ePE2) 756 Via ASBR12, BGP-LU Label: LASBR12(ePE2) 757 192.0.2.3/32 (ePE3) 758 Via ASBR13, BGP-LU Label: LASBR13(ePE3) 760 192.0.1.1/32 (ASBR11) 761 via Core, Label: IGP-L11 762 192.0.1.2/32 (ASBR12) 763 via Core, Label: IGP-L12 764 192.0.1.3/32 (ASBR13) 765 via Core, Label: IGP-L13 767 The diagram in Figure 5 illustrates the forwarding chain in iPE 768 assuming that the forwarding hardware in iPE supports 3 levels of 769 hierarchy. The leaves corresponding to the ABSRs on domain 1 770 (ASBR11, ASBR12, and ASBR13) are at the bottom of the hierarchy. 771 There are few important points: 773 o Because the hardware supports the required depth of hierarchy, 774 the sizes of a pathlist equal the size of the label list 775 associated with the leaves using this pathlist 777 o The index inside the pathlist entry indicates the label that will 778 be picked from the Outlabel-List associated with the child leaf 779 if that path is chosen by the forwarding engine hashing function. 781 Outlabel-List Outlabel-List 782 For VPN-IP1 For VPN-IP2 783 +------------+ +--------+ +-------+ +------------+ 784 | VPN-L11 |<---| VPN-IP1| |VPN-IP2|-->| VPN-L22 | 785 +------------+ +---+----+ +---+---+ +------------+ 786 | VPN-L21 | | | | VPN-L32 | 787 +------------+ | | +------------+ 788 | | 789 V V 790 +---+---+ +---+---+ 791 | 0 | 1 | | 0 | 1 | 792 +-|-+-\-+ +-/-+-\-+ 793 | \ / \ 794 | \ / \ 795 | \ / \ 796 | \ / \ 797 v \ / \ 798 +-----+ +-----+ +-----+ 799 +----+ ePE1| |ePE2 +-----+ | ePE3+-----+ 800 | +--+--+ +-----+ | +--+--+ | 801 v | / v | v 802 +-------------+ | / +-------------+ | +-------------+ 803 |LASBR11(ePE1)| | / |LASBR11(ePE2)| | |LASBR13(ePE3)| 804 +-------------+ | / +-------------+ | +-------------+ 805 |LASBR12(ePE1)| | / |LASBR12(ePE2)| | Outlabel-List 806 +-------------+ | / +-------------+ | For ePE3 807 Outlabel-List | / Outlabel-List | 808 For ePE1 | / For ePE2 | 809 | / | 810 | / | 811 | / | 812 v / v 813 +---+---+ Shared Pathlist +---+ Pathlist 814 | 0 | 1 | For ePE1 and ePE2 | 0 | For ePE3 815 +-|-+-\-+ +-|-+ 816 | \ | 817 | \ | 818 | \ | 819 | \ | 820 v \ v 821 +------+ +------+ +------+ 822 +---+ASBR11| |ASBR12+--+ |ASBR13+---+ 823 | +------+ +------+ | +------+ | 824 v v v 825 +-------+ +-------+ +-------+ 826 |IGP-L11| |IGP-L12| |IGP-L13| 827 +-------+ +-------+ +-------+ 829 Figure 5 : Forwarding Chain for hardware supporting 3 Levels 831 Now suppose the hardware on iPE (the ingress PE) supports 2 levels 832 of hierarchy only. In that case, the 3-levels forwarding chain in 833 Figure 5 needs to be "flattened" into 2 levels only. 835 Outlabel-List Outlabel-List 836 For VPN-IP1 For VPN-IP2 837 +------------+ +-------+ +-------+ +------------+ 838 | VPN-L11 |<---|VPN-IP1| | VPN-IP2|--->| VPN-L22 | 839 +------------+ +---+---+ +---+---+ +------------+ 840 | VPN-L21 | | | | VPN-L32 | 841 +------------+ | | +------------+ 842 | | 843 | | 844 | | 845 Flattened | | Flattened 846 pathlist V V pathlist 847 +===+===+ +===+===+===+ +=============+ 848 +--------+ 0 | 1 | | 0 | 0 | 1 +---->|LASBR11(ePE2)| 849 | +=|=+=\=+ +=/=+=/=+=\=+ +=============+ 850 v | \ / / \ |LASBR12(ePE2)| 851 +=============+ | \ +-----+ / \ +=============+ 852 |LASBR11(ePE1)| | \/ / \ |LASBR13(ePE3)| 853 +=============+ | /\ / \ +=============+ 854 |LASBR12(ePE1)| | / \ / \ 855 +=============+ | / \ / \ 856 | / \ / \ 857 | / + + \ 858 | + | | \ 859 | | | | \ 860 v v v v \ 861 +------+ +------+ +------+ 862 +----|ASBR11| |ASBR12+---+ |ASBR13+---+ 863 | +------+ +------+ | +------+ | 864 v v v 865 +-------+ +-------+ +-------+ 866 |IGP-L11| |IGP-L12| |IGP-L13| 867 +-------+ +-------+ +-------+ 869 Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE 871 Figure 6 represents one way to "flatten" a 3 levels hierarchy into 872 two levels. There are few important points: 874 o As mentioned in Section 5.1 a flattened pathlist may have label 875 lists associated with them. The size of the label list associated 876 with a flattened pathlist equals the size of the pathlist. Hence 877 it is possible that an implementation includes these label lists 878 in the flattened pathlist itself 880 o Again as mentioned in Section 5.1, the size of a flattened 881 pathlist may not be equal to the size of the OutLabel-lists of 882 leaves using the flattened pathlist. So the indices inside a 883 flattened pathlist still indicate the label index in the 884 Outlabel-Lists of the leaves using that pathlist. Because the 885 size of the flattened pathlist may be different from the size of 886 the OutLabel-lists of the leaves, the indices may be repeated. 888 o Let's take a look at the flattened pathlist used by the prefix 889 "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has 890 three entries. 892 o The first and second entry have index "0". This is because 893 both entries correspond to ePE2. Hence when hashing performed 894 by the forwarding engine results in using first or the second 895 entry in the pathlist, the forwarding engine will pick the 896 correct VPN label "VPN-L22", which is the label advertised by 897 ePE2 for the prefix "VPN-IP2" 899 o The third entry has the index "1". This is because the third 900 entry corresponds to ePE3. Hence when the hashing is performed 901 by the forwarding engine results in using the third entry in 902 the flattened pathlist, the forwarding engine will pick the 903 correct VPN label "VPN-L32", which is the label advertised by 904 "ePE3" for the prefix "VPN-IP2" 906 Now let's try and apply the forwarding steps in Section 4. together 907 with the additional step in Section 5.1 to the flattened forwarding 908 chain illustrated in Figure 6. 910 o Suppose a packet arrives at "iPE" and matches the VPN prefix 911 "VPN-IP2" 913 o The forwarding engine walks to the parent of the "VPN-IP2", which 914 is the flattened pathlist and applies a hashing algorithm to pick 915 a path 917 o Suppose the hashing by the forwarding engine picks the second 918 path in the flattened pathlist associated with the leaf "VPN- 919 IP2". 921 o Because the second path has the index "0", the label "VPN-L22" is 922 pushed on the packet 924 o Next the forwarding engine picks the second label from the 925 Outlabel-Array associated with the flattened pathlist. Hence the 926 next label that is pushed is "LASBR12(ePE2)" 928 o The forwarding engine now moves to the parent of the flattened 929 pathlist corresponding to the second path. The parent is the IGP 930 label leaf corresponding to "ASBR12" 932 o So the packet is forwarded towards the ASBR "ASBR12" and the IGP 933 label at the top will be "L12" 935 Based on the above steps, a packet arriving at iPE and destined to 936 the prefix VPN-L22 reaches its destination as follows 938 o iPE sends the packet along the shortest path towards ASBR12 with 939 the following label stack starting from the top: {L12, 940 LASBR12(ePE2), VPN-L22}. 942 o The penultimate hop of ASBR12 pops the top label "L12". Hence the 943 packet arrives at ASBR12 with the label stack {LASBR12(ePE2), 944 VPN-L22} where "LASBR12(ePE2)" is the top label. 946 o ASBR12 swaps "LASBR12(ePE2)" with the label "LASBR22(ePE2)", 947 which is the label advertised by ASBR22 for the ePE2 (the egress 948 PE). 950 o ASBR22 receives the packet with "LASBR22(ePE2)" at the top. 952 o Hence ASBR22 swaps "LASBR22(ePE2)" with the IGP label for ePE2 953 advertised by the next-hop towards ePE2 in domain 2, and sends 954 the packet along the shortest path towards ePE2. 956 o The penultimate hop of ePE2 pops the top label. Hence ePE2 957 receives the packet with the top label VPN-L22 at the top 959 o ePE2 pops "VPN-L22" and sends the packet as a pure IP packet 960 towards the destination VPN-IP2. 962 6. Forwarding Chain Adjustment at a Failure 964 The hierarchical and shared structure of the forwarding chain 965 explained in the previous section allows modifying a small number of 966 forwarding chain objects to re-route traffic to a pre-calculated 967 equal-cost or backup path without the need to modify the possibly 968 very large number of BGP prefixes. In this section, we go over 969 various core and edge failure scenarios to illustrate how FIB 970 manager can utilize the forwarding chain structure to achieve BGP 971 prefix independent convergence. 973 6.1. BGP-PIC core 975 This section describes the adjustments to the forwarding chain when 976 a core link or node fails but the BGP next-hop remains reachable. 978 There are two case: remote link failure and attached link failure. 979 Node failures are treated as link failures. 981 When a remote link or node fails, IGP on the ingress PE receives 982 advertisement indicating a topology change so IGP re-converges to 983 either find a new next-hop and/or outgoing interface or remove the 984 path completely from the IGP prefix used to resolve BGP next-hops. 985 IGP and/or LDP download the modified IGP leaves with modified 986 outgoing labels for labeled core. 988 When a local link fails, FIB manager detects the failure almost 989 immediately. The FIB manager marks the impacted path(s) as unusable 990 so that only useable paths are used to forward packets. Hence only 991 IGP pathlists with paths using the failed local link need to be 992 modified. All other pathlists are not impacted. Note that in this 993 particular case there is actually no need even to backwalk to IGP 994 leaves to adjust the OutLabel-Lists because FIB can rely on the 995 path-index stored in the useable paths in the pathlist to pick the 996 right label. 998 It is noteworthy to mention that because FIB manager modifies the 999 forwarding chain starting from the IGP leaves only. BGP pathlists 1000 and leaves are not modified. Hence traffic restoration occurs within 1001 the time frame of IGP convergence, and, for local link failure, 1002 assuming a backup path has been precomputed, within the timeframe of 1003 local detection (e.g. 50ms). Examples of solutions that pre- 1004 computing backup paths are IP FRR [16] remote LFA [17], Ti-LFA [15] 1005 and MRT [18] or eBGP path having a backup path [10]. 1007 Let's apply the procedure mentioned in this subsection to the 1008 forwarding chain depicted in Figure 2. Suppose a remote link failure 1009 occurs and impacts the first ECMP IGP path to the remote BGP next- 1010 hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop 1011 is updated to reflect the new topology (one path instead of two). As 1012 soon as the IGP convergence is effective for the BGP next-hop entry, 1013 the new forwarding state is immediately available to all dependent 1014 BGP prefixes. The same behavior would occur if the failure was local 1015 such as an interface going down. As soon as the IGP convergence is 1016 complete for the BGP next-hop IGP route, all its BGP depending 1017 routes benefit from the new path. In fact, upon local failure, if 1018 LFA protection is enabled for the IGP route to the BGP next-hop and 1019 a backup path was pre-computed and installed in the pathlist, upon 1020 the local interface failure, the LFA backup path is immediately 1021 activated (e.g. sub-50msec) and thus protection benefits all the 1022 depending BGP traffic through the hierarchical forwarding dependency 1023 between the routes. 1025 6.2. BGP-PIC edge 1027 This section describes the adjustments to the forwarding chains as a 1028 result of edge node or edge link failure. 1030 6.2.1. Adjusting forwarding Chain in egress node failure 1032 When an edge node fails, IGP on neighboring core nodes send route 1033 updates indicating that the edge node is no longer reachable. IGP 1034 running on the iBGP peers instructs FIB to remove the IP and label 1035 leaves corresponding to the failed edge node from FIB. So FIB 1036 manager performs the following steps: 1038 o FIB manager deletes the IGP leaf corresponding to the failed edge 1039 node 1041 o FIB manager backwalks to all dependent BGP pathlists and marks 1042 that path using the deleted IGP leaf as unresolved 1044 o Note that there is no need to modify the possibly large number of 1045 BGP leaves because each path in the pathlist carries its path 1046 index and hence the correct outgoing label will be picked. 1047 Consider for example the forwarding chain depicted in Figure 2. 1048 If the 1st BGP path becomes unresolved, then the forwarding 1049 engine will only use the second path for forwarding. Yet the path 1050 index of that single resolved path will still be 1 and hence the 1051 label VPN-L12 will be pushed. 1053 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure 1055 Suppose the link between an edge router and its external peer fails. 1056 There are two scenarios (1) the edge node attached to the failed 1057 link performs next-hop self and (2) the edge node attached to the 1058 failure advertises the IP address of the failed link as the next-hop 1059 attribute to its iBGP peers. 1061 In the first case, the rest of iBGP peers will remain unaware of the 1062 link failure and will continue to forward traffic to the edge node 1063 until the edge node attached to the failed link withdraws the BGP 1064 prefixes. If the destination prefixes are multi-homed to another 1065 iBGP peer, say ePE2, then FIB manager on the edge router detecting 1066 the link failure applies the following steps: 1068 o FIB manager backwalks to the BGP pathlists marks the path through 1069 the failed link to the external peer as unresolved 1071 o Hence traffic will be forwarded used the backup path towards ePE2 1073 o For labeled traffic 1075 o The Outlabel-List attached to the BGP leaf already contains 1076 an entry corresponding to the backup path. 1078 o The label entry in OutLabel-List corresponding to the 1079 internal path to backup egress PE has swap action to the 1080 label advertised by backup egress PE 1082 o For an arriving label packet (e.g. VPN), the top label is 1083 swapped with the label advertised by backup egress PE and the 1084 packet is sent towards that backup egress PE 1086 o For unlabeled traffic, packets are simply redirected towards 1087 backup egress PE. 1089 In the second case where the edge router uses the IP address of the 1090 failed link as the BGP next-hop, the edge router will still perform 1091 the previous steps. But, unlike the case of next-hop self, IGP on 1092 failed edge node informs the rest of the iBGP peers that IP address 1093 of the failed link is no longer reachable. Hence the FIB manager on 1094 iBGP peers will delete the IGP leaf corresponding to the IP prefix 1095 of the failed link. The behavior of the iBGP peers will be identical 1096 to the case of edge node failure outlined in Section 6.2.1. 1098 It is noteworthy to mention that because the edge link failure is 1099 local to the edge router, sub-50 msec convergence can be achieved as 1100 described in [10]. 1102 Let's try to apply the case of next-hop self to the forwarding chain 1103 depicted in Figure 3. After failure of the link between ePE1 and CE, 1104 the forwarding engine will route traffic arriving from the core 1105 towards VPN-NH2 with path-index=1. A packet arriving from the core 1106 will contain the label VPN-L11 at top. The label VPN-L11 is swapped 1107 with the label VPN-L21 and the packet is forwarded towards ePE2. 1109 6.3. Handling Failures for Flattened Forwarding Chains 1111 As explained in the in Section 5 if the number of hierarchy levels 1112 of a platform cannot support the native number of hierarchy levels 1113 of a recursive forwarding chain, the instantiated forwarding chain 1114 is constructed by flattening two or more levels. Hence a 3 levels 1115 chain in Figure 5 is flattened into the 2 levels chain in Figure 6. 1117 While reducing the benefits of BGP-PIC, flattening one hierarchy 1118 into a shallower hierarchy does not always result in a complete loss 1119 of the benefits of the BGP-PIC. To illustrate this fact suppose 1120 ASBR12 is no longer reachable in domain 1. If the platform supports 1121 the full hierarchy depth, the forwarding chain is the one depicted 1122 in Figure 5 and hence the FIB manager needs to backwalk one level to 1123 the pathlist shared by "ePE1" and "ePE2" and adjust it. If the 1124 platform supports 2 levels of hierarchy, then a useable forwarding 1125 chain is the one depicted in Figure 6. In that case, if ASBR12 is no 1126 longer reachable, the FIB manager has to backwalk to the two 1127 flattened pathlists and updates both of them. 1129 The main observation is that the loss of convergence speed due to 1130 the loss of hierarchy depth depends on the structure of the 1131 forwarding chain itself. To illustrate this fact, let's take two 1132 extremes. Suppose the forwarding objects in level i+1 depend on the 1133 forwarding objects in level i. If every object on level i+1 depends 1134 on a separate object in level i, then flattening level i into level 1135 i+1 will not result in loss of convergence speed. Now let's take the 1136 other extreme. Suppose "n" objects in level i+1 depend on 1 object 1137 in level i. Now suppose FIB flattens level i into level i+1. If a 1138 topology change results in modifying the single object in level i, 1139 then FIB has to backwalk and modify "n" objects in the flattened 1140 level, thereby losing all the benefit of BGP-PIC. Experience shows 1141 that flattening forwarding chains usually results in moderate loss 1142 of BGP-PIC benefits. Further analysis is needed to corroborate and 1143 quantify this statement. 1145 7. Properties 1147 7.1. Coverage 1149 All the possible failures, except CE node failure, are covered, 1150 whether they impact a local or remote IGP path or a local or remote 1151 BGP next-hop as described in Section 6. This section provides 1152 details for each failure and now the hierarchical and shared FIB 1153 structure proposed in this document allows recovery that does not 1154 depend on number of BGP prefixes. 1156 7.1.1. A remote failure on the path to a BGP next-hop 1158 Upon IGP convergence, the IGP leaf for the BGP next-hop is updated 1159 upon IGP convergence and all the BGP depending routes leverage the 1160 new IGP forwarding state immediately. Details of this behavior can 1161 be found in Section 6.1. 1163 This BGP resiliency property only depends on IGP convergence and is 1164 independent of the number of BGP prefixes impacted. 1166 7.1.2. A local failure on the path to a BGP next-hop 1168 Upon LFA protection, the IGP leaf for the BGP next-hop is updated to 1169 use the precomputed LFA backup path and all the BGP depending routes 1170 leverage this LFA protection. Details of this behavior can be found 1171 in Section 6.1. 1173 This BGP resiliency property only depends on LFA protection and is 1174 independent of the number of BGP prefixes impacted. 1176 7.1.3. A remote iBGP next-hop fails 1178 Upon IGP convergence, the IGP leaf for the BGP next-hop is deleted 1179 and all the depending BGP Path-Lists are updated to either use the 1180 remaining ECMP BGP best-paths or if none remains available to 1181 activate precomputed backups. Details about this behavior can be 1182 found in Section 6.2.1. 1184 This BGP resiliency property only depends on IGP convergence and is 1185 independent of the number of BGP prefixes impacted. 1187 7.1.4. A local eBGP next-hop fails 1189 Upon local link failure detection, the adjacency to the BGP next-hop 1190 is deleted and all the depending BGP pathlists are updated to either 1191 use the remaining ECMP BGP best-paths or if none remains available 1192 to activate precomputed backups. Details about this behavior can be 1193 found in Section 6.2.2. 1195 This BGP resiliency property only depends on local link failure 1196 detection and is independent of the number of BGP prefixes impacted. 1198 7.2. Performance 1200 When the failure is local (a local IGP next-hop failure or a local 1201 eBGP next-hop failure), a pre-computed and pre-installed backup is 1202 activated by a local-protection mechanism that does not depend on 1203 the number of BGP destinations impacted by the failure. Sub-50msec 1204 is thus possible even if millions of BGP routes are impacted. 1206 When the failure is remote (a remote IGP failure not impacting the 1207 BGP next-hop or a remote BGP next-hop failure), an alternate path is 1208 activated upon IGP convergence. All the impacted BGP destinations 1209 benefit from a working alternate path as soon as the IGP convergence 1210 occurs for their impacted BGP next-hop even if millions of BGP 1211 routes are impacted. 1213 Appendix A puts the BGP PIC benefits in perspective by providing 1214 some results using actual numbers. 1216 7.3. Automated 1218 The BGP PIC solution does not require any operator involvement. The 1219 process is entirely automated as part of the FIB implementation. 1221 The salient points enabling this automation are: 1223 o Extension of the BGP Best Path to compute more than one primary 1224 ([11]and [12]) or backup BGP next-hop ([6] and [13]). 1226 o Sharing of BGP Path-list across BGP destinations with same 1227 primary and backup BGP next-hop 1229 o Hierarchical indirection and dependency between BGP pathlist and 1230 IGP pathlist 1232 7.4. Incremental Deployment 1234 As soon as one router supports BGP PIC solution, it benefits from 1235 all its benefits without any requirement for other routers to 1236 support BGP PIC. 1238 8. Security Considerations 1240 The behavior described in this document is internal functionality 1241 to a router that result in significant improvement to convergence 1242 time as well as reduction in CPU and memory used by FIB while not 1243 showing change in basic routing and forwarding functionality. As 1244 such no additional security risk is introduced by using the 1245 mechanisms proposed in this document. 1247 9. IANA Considerations 1249 No requirements for IANA 1251 10. Conclusions 1253 This document proposes a hierarchical and shared forwarding chain 1254 structure that allows achieving BGP prefix independent 1255 convergence, and in the case of locally detected failures, sub-50 1256 msec convergence. A router can construct the forwarding chains in 1257 a completely transparent manner with zero operator intervention 1258 thereby supporting smooth and incremental deployment. 1260 11. References 1262 11.1. Normative References 1264 [1] Bradner, S., "Key words for use in RFCs to Indicate 1265 Requirement Levels", BCP 14, RFC 2119, March 1997. 1267 [2] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 1268 4 (BGP-4), RFC 4271, January 2006 1270 [3] Bates, T., Chandra, R., Katz, D., and Rekhter Y., 1271 "Multiprotocol Extensions for BGP", RFC 4760, January 2007 1273 [4] Y. Rekhter and E. Rosen, " Carrying Label Information in BGP- 1274 4", RFC 3107, May 2001 1276 [5] Andersson, L., Minei, I., and B. Thomas, "LDP Specification", 1277 RFC 5036, October 2007 1279 11.2. Informative References 1281 [6] Marques,P., Fernando, R., Chen, E, Mohapatra, P., Gredler, H., 1282 "Advertisement of the best external route in BGP", draft-ietf- 1283 idr-best-external-05.txt, January 2012. 1285 [7] Wu, J., Cui, Y., Metz, C., and E. Rosen, "Softwire Mesh 1286 Framework", RFC 5565, June 2009. 1288 [8] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1289 Networks (VPNs)", RFC 4364, February 2006. 1291 [9] De Clercq, J. , Ooms, D., Prevost, S., Le Faucheur, F., 1292 "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider 1293 Edge Routers (6PE)", RFC 4798, February 2007 1295 [10] O. Bonaventure, C. Filsfils, and P. Francois. "Achieving sub- 1296 50 milliseconds recovery upon bgp peering link failures, " 1297 IEEE/ACM Transactions on Networking, 15(5):1123-1135, 2007 1299 [11] D. Walton, A. Retana, E. Chen, J. Scudder, "Advertisement of 1300 Multiple Paths in BGP", draft-ietf-idr-add-paths-12.txt, 1301 November 2015 1303 [12] R. Raszuk, R. Fernando, K. Patel, D. McPherson, K. Kumaki, 1304 "Distribution of diverse BGP paths", RFC 6774, November 2012 1306 [13] P. Mohapatra, R. Fernando, C. Filsfils, and R. Raszuk, "Fast 1307 Connectivity Restoration Using BGP Add-path", draft-pmohapat- 1308 idr-fast-conn-restore-03, Jan 2013 1310 [14] C. Filsfils, S. Previdi, A. Bashandy, B. Decraene, S. 1311 Litkowski, M. Horneffer, R. Shakir, J. Tansura, E. Crabbe 1312 "Segment Routing with MPLS data plane", draft-ietf-spring- 1313 segment-routing-mpls-02 (work in progress), October 2015 1315 [15] C. Filsfils, S. Previdi, A. Bashandy, B. Decraene, " Topology 1316 Independent Fast Reroute using Segment Routing", draft- 1317 francois-spring-segment-routing-ti-lfa-02 (work in progress), 1318 August 2015 1320 [16] M. Shand and S. Bryant, "IP Fast Reroute Framework", RFC 5714, 1321 January 2010 1323 [17] S. Bryant, C. Filsfils, S. Previdi, M. Shand, N So, " Remote 1324 Loop-Free Alternate (LFA) Fast Reroute (FRR)", RFC 7490 April 1325 2015 1327 [18] A. Atlas, C. Bowers, G. Enyedi, " An Architecture for IP/LDP 1328 Fast-Reroute Using Maximally Redundant Trees", draft-ietf- 1329 rtgwg-mrt-frr-architecture-10 (work in progress), February 1330 2016 1332 12. Acknowledgments 1334 Special thanks to Neeraj Malhotra, Yuri Tsier for the valuable 1335 help 1337 Special thanks to Bruno Decraene for the valuable comments 1339 This document was prepared using 2-Word-v2.0.template.dot. 1341 Authors' Addresses 1343 Ahmed Bashandy 1344 Cisco Systems 1345 170 West Tasman Dr, San Jose, CA 95134, USA 1346 Email: bashandy@cisco.com 1348 Clarence Filsfils 1349 Cisco Systems 1350 Brussels, Belgium 1351 Email: cfilsfil@cisco.com 1353 Prodosh Mohapatra 1354 Sproute Networks 1355 Email: mpradosh@yahoo.com 1357 Appendix A. Perspective 1359 The following table puts the BGP PIC benefits in perspective 1360 assuming 1362 o 1M impacted BGP prefixes 1364 o IGP convergence ~ 500 msec 1366 o local protection ~ 50msec 1368 o FIB Update per BGP destination ~ 100usec conservative, 1370 ~ 10usec optimistic 1372 o BGP Convergence per BGP destination ~ 200usec conservative, 1374 ~ 100usec optimistic 1376 Without PIC With PIC 1378 Local IGP Failure 10 to 100sec 50msec 1380 Local BGP Failure 100 to 200sec 50msec 1382 Remote IGP Failure 10 to 100sec 500msec 1384 Local BGP Failure 100 to 200sec 500msec 1386 Upon local IGP next-hop failure or remote IGP next-hop failure, the 1387 existing primary BGP next-hop is intact and usable hence the 1388 resiliency only depends on the ability of the FIB mechanism to 1389 reflect the new path to the BGP next-hop to the depending BGP 1390 destinations. Without BGP PIC, a conservative back-of-the-envelope 1391 estimation for this FIB update is 100usec per BGP destination. An 1392 optimistic estimation is 10usec per entry. 1394 Upon local BGP next-hop failure or remote BGP next-hop failure, 1395 without the BGP PIC mechanism, a new BGP Best-Path needs to be 1396 recomputed and new updates need to be sent to peers. This depends on 1397 BGP processing time that will be shared between best-path 1398 computation, RIB update and peer update. A conservative back-of-the- 1399 envelope estimation for this is 200usec per BGP destination. An 1400 optimistic estimation is 100usec per entry.