idnits 2.17.1 draft-rtgwg-bgp-pic-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 143 has weird spacing: '... be inter...' == Line 573 has weird spacing: '...al peer as un...' == The document seems to contain a disclaimer for pre-RFC5378 work, but was first submitted on or after 10 November 2008. The disclaimer is usually necessary only for documents that revise or obsolete older RFCs, and that take significant amounts of text from those RFCs. If you can contact all authors of the source material and they are willing to grant the BCP78 rights to the IETF Trust, you can and should remove the disclaimer. Otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 21, 2013) is 3812 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: 'TBD' on line 636 == Outdated reference: A later version (-05) exists of draft-ietf-idr-best-external-04 == Outdated reference: A later version (-15) exists of draft-ietf-idr-add-paths-07 == Outdated reference: A later version (-03) exists of draft-pmohapat-idr-fast-conn-restore-02 Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group A. Bashandy, Ed. 2 Internet Draft C. Filsfils 3 Intended status: Informational Cisco Systems 4 Expires: April 2014 P. Mohapatra 5 Cumulus Networks 6 October 21, 2013 7 BGP Prefix Independent Convergence 8 draft-rtgwg-bgp-pic-02.txt 10 Abstract 12 In the network comprising thousands of iBGP peers exchanging millions 13 of routes, many routes are reachable via more than one path. Given 14 the large scaling targets, it is desirable to restore traffic after 15 failure in a time period that does not depend on the number of BGP 16 prefixes. In this document we proposed a technique by which traffic 17 can be re-routed to ECMP or pre-calculated backup paths in a 18 timeframe that does not depend on the number of BGP prefixes. The 19 objective is achieved through organizing the forwarding chains in a 20 hierarchical manner and sharing forwarding elements among the maximum 21 possible number of routes. The proposed technique achieves prefix 22 independent convergence while ensuring incremental deployment, 23 complete transparency and automation, and zero management and 24 provisioning effort 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 This document may contain material from IETF Documents or IETF 32 Contributions published or made publicly available before November 33 10, 2008. The person(s) controlling the copyright in some of this 34 material may not have granted the IETF Trust the right to allow 35 modifications of such material outside the IETF Standards Process. 36 Without obtaining an adequate license from the person(s) 37 controlling the copyright in such materials, this document may not 38 be modified outside the IETF Standards Process, and derivative 39 works of it may not be created outside the IETF Standards Process, 40 except to format it for publication as an RFC or to translate it 41 into languages other than English. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF), its areas, and its working groups. Note that 45 other groups may also distribute working documents as Internet- 46 Drafts. 48 Internet-Drafts are draft documents valid for a maximum of six 49 months and may be updated, replaced, or obsoleted by other 50 documents at any time. It is inappropriate to use Internet-Drafts 51 as reference material or to cite them other than as "work in 52 progress." 54 The list of current Internet-Drafts can be accessed at 55 http://www.ietf.org/ietf/1id-abstracts.txt 57 The list of Internet-Draft Shadow Directories can be accessed at 58 http://www.ietf.org/shadow.html 60 This Internet-Draft will expire on September 21, 2013. 62 Copyright Notice 64 Copyright (c) 2013 IETF Trust and the persons identified as the 65 document authors. All rights reserved. 67 This document is subject to BCP 78 and the IETF Trust's Legal 68 Provisions Relating to IETF Documents 69 (http://trustee.ietf.org/license-info) in effect on the date of 70 publication of this document. Please review these documents 71 carefully, as they describe your rights and restrictions with 72 respect to this document. Code Components extracted from this 73 document must include Simplified BSD License text as described in 74 Section 4.e of the Trust Legal Provisions and are provided without 75 warranty as described in the Simplified BSD License. 77 Table of Contents 79 1. Introduction...................................................3 80 1.1. Conventions used in this document.........................3 81 1.2. Terminology...............................................4 82 2. Constructing the Shared Hierarchical Forwarding Chain..........5 83 2.1. Databases.................................................5 84 2.2. Constructing the forwarding chain from a downloaded route.6 85 2.3. Examples..................................................7 86 2.3.1. Example 1: Forwarding Chain for iBGP ECMP............7 87 2.3.2. Example 2: Primary Backup Paths......................9 88 3. Forwarding Behavior...........................................10 89 4. Forwarding Chain Adjustment at a Failure......................10 90 4.1. BGP-PIC core.............................................11 91 4.2. BGP-PIC edge.............................................12 92 4.2.1. Adjusting forwarding Chain in egress node failure...12 93 4.2.2. Adjusting Forwarding Chain on PE-CE link Failure....12 94 4.2.3. Loop Avoidance using Special Label (backup/repair 95 label).....................................................13 96 5. Properties....................................................14 97 6. Dependency....................................................17 98 7. Security Considerations.......................................18 99 8. IANA Considerations...........................................18 100 9. Conclusions...................................................18 101 10. References...................................................18 102 10.1. Normative References....................................18 103 10.2. Informative References..................................18 104 11. Acknowledgments..............................................19 105 Appendix A. Modification History.................................20 106 A.1.1. Changes from Version 01.............................20 107 A.1.2. Changes from Version 00.............................20 109 1. Introduction 111 As a path vector protocol, BGP is inherently slow due to the 112 serial nature of reachability propagation. BGP speakers exchange 113 reachability information about prefixes[2][3] and, for labeled 114 address families, namely AFI/SAFI 1/4, 2/4, 1/128, and 2/128, an 115 edge router assigns local labels to prefixes and associates the 116 local label with each advertised prefix such as L3VPN [6], 6PE 117 [7], and Softwire [5]. A BGP speaker then applies the path 118 selection steps to choose the best path. In modern networks, it is 119 not uncommon to have a prefix reachable via multiple edge routers. 120 In addition to proprietary techniques, multiple techniques have 121 been proposed to allow for more than one path for a given prefix 122 [4][9][10], whether in the form of equal cost multipath or 123 primary-backup. Another more common and widely deployed scenario 124 is L3VPN with multi-homed VPN sites. 126 This document proposes a hierarchical and shared forwarding chain 127 organization that allows traffic to be restored to pre-calculated 128 alternative equal cost primary path or backup path in a time 129 period that does not depend on the number of BGP prefixes. The 130 technique relies on internal router behavior that is completely 131 transparent to the operator and can be incrementally deployed and 132 enabled with zero operator intervention. 134 1.1. Conventions used in this document 136 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL 137 NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" 138 in this document are to be interpreted as described in RFC-2119 139 [1]. 141 In this document, these words will appear with that interpretation 142 only when in ALL CAPS. Lower case uses of these words are not to 143 be interpreted as carrying RFC-2119 significance. 145 1.2. Terminology 147 This section defines the terms used in this document. For ease of 148 use, we will use terms similar to those used by L3VPN [6] 150 o BGP prefix: It is a prefix P/m (of any AFI/SAFI) that a BGP 151 speaker has a path for. 153 o IGP prefix: It is a prefix P/m (of any AFI/SAFI) that is learnt 154 via an Interior Gateway Protocol, such as OSPF and ISIS, has a 155 path for. The prefix may be learnt directly through the IGP 156 redistributed from other protocol(s) 158 o CE: It is an external router through which an egress PE can 159 reach a prefix P/m. 161 o Ingress PE, "iPE": It is a BGP speaker that learns about a 162 prefix through another IBGP peer and chooses that IBGP peer as 163 the next-hop for the prefix. 165 o Path: It is the next-hop in a sequence of unique connected 166 nodes starting from the current node and ending with the 167 destination node or network identified by the prefix. 169 o Recursive path: It is a path consisting only of the IP address 170 of the next-hop without the outgoing interface. Subsequent 171 lookups are needed to determine the outgoing interface. 173 o Non-recursive path: It is a path consisting of the IP address 174 of the next-hop and one outgoing interface 176 o Primary path: It is a recursive or non-recursive path that can 177 be used all the time. A prefix can have more than one primary 178 path 180 o Backup path: It is a recursive or non-recursive path that can 181 be used only after some or all primary paths become unreachable 183 o Leaf: A leaf is container data structure for a prefix or local 184 label. Alternatively, it is the data structure that contains 185 prefix specific information. 187 o IP leaf: Is the leaf corresponding to an IPv4 or IPv6 prefix 189 o Label leaf. It is the leaf corresponding to a locally allocated 190 label such as the VPN label on an egress PE [6]. 192 o Pathlist: It is an array of paths used by one or more prefix to 193 forward traffic to destination(s) covered by a IP prefix. Each 194 path in the pathlist carries its "path-index" that identifies 195 its position in the array of paths. A pathlist may contain a 196 mix of primary and backup paths 198 o OutLabel-Array: Each labeled prefix is associated with an 199 OutLabel-Array. The OutLabel-Array is a list of one or more 200 outgoing labels and/or label actions where each label or label 201 action has 1-to-1 correspondence to a path in the pathlist. The 202 number of entries in the OutLabel-array is identical to the 203 number of paths in the pathlist and the ith outlabel entry is 204 associated with the path whose path-index is "i". Label actions 205 are: push the label, pop the label, or swap the incoming label 206 with the outlabel. The prefix may be an IGP or BGP prefix 208 o Adjacency: It is the layer 2 encapsulation leading to the layer 209 3 directly connected next-hop 211 o Dependency: An object X is said to be a dependent or Child of 212 object Y if Object Y cannot be deleted unless object X is no 213 longer a dependent/child of object Y 215 o Route: It is a prefix with one or more paths associated with 216 it. Hence the minimum set of objects needed to construct a 217 route is a leaf and a pathlist. 219 2. Constructing the Shared Hierarchical Forwarding Chain 221 2.1. Databases 223 The Forwarding Information Base (FIB) on a router maintains 3 basic 224 databases 226 o Pathlist-DB: A pathlist is uniquely identified by the list of 227 paths. The Pathlist DB contains the set of all shared pathlists 229 o Leaf-DB: A leaf is uniquely identified by the prefix or the label 231 o Adjacency-DB: An adjacency is uniquely identified by the outgoing 232 layer 3 interface and the IP address of the next-hop directly 233 connected to the layer 3 interface. Adjacency DB contains the 234 list of all adjacencies 236 2.2. Constructing the forwarding chain from a downloaded route 238 1. A prefix with a list of paths is downloaded to FIB from BGP. For 239 labeled prefixes, an OutLabel-Array and possibly a local label 240 (e.g. for a VPN [6] prefix on an egress PE) are also downloaded 242 2. If the prefix does not exist, construct a new IP leaf from the 243 downloaded prefix. If a local label is allocated, construct a 244 label leaf from the local label 246 3. Construct an OutLabel-Array and attach the Outlabel array to the 247 IP and label leaf 249 4. The list of paths attached to the route is looked up in the 250 pathlist-DB 252 5. If a pathlist PL is found 254 a. Retrieve the pathlist 256 6. Else 258 a. Construct a new pathlist 260 b. Insert the new pathlist in the pathlist-DB 262 c. Resolve the paths of the pathlist as follows 264 d. Recursive path: 266 i. Lookup the next-hop in the leaf-DB 268 ii. If a leaf with at least one reachable path is found, add 269 the path to the dependency list of the leaf 271 iii. Otherwise the path remains unresolved and cannot be used 272 for forwarding 274 e. Non-recursive path 276 i. Lookup the next-hop and outgoing interface in the 277 adjacency-DB 279 ii. If an adjacency is found, add the path to the dependency 280 list of adjacency 282 iii. Otherwise, create a new adjacency and add the path to 283 its dependency list 285 7. Attach the leaf(s) as (a) dependent(s) of the pathlist 286 As a result of the above steps, a forwarding chain starting with a 287 leaf and ending with one or more adjacency is constructed. It is 288 noteworthy to mention that the forwarding chain is constructed 289 without any operator intervention at all. 291 2.3. Examples 293 This section outlines two examples that we will use for illustration 294 for the rest of the document. The examples use a standard multihomed 295 VPN [6] prefix in a BGP-free core running LDP. The topology is 296 depicted in Figure 1. 298 +-----------------------------------+ 299 | | 300 | LDP Core | 301 | | 302 | ePE2 303 | |\ 304 | | \ 305 | | \ 306 | | \ 307 iPE | CE.......VRF "Blue" 308 | | / (VPN-P1) 309 | | / (VPN-P2) 310 | | / 311 | |/ 312 | ePE1 313 | | 314 | | 315 | | 316 +-----------------------------------+ 317 Figure 1 VPN prefix reachable via multiple PEs 319 The first example is an illustration of ECMP while the second 320 example is an illustration of primary-backup paths 322 2.3.1. Example 1: Forwarding Chain for iBGP ECMP 324 Consider the case of the ingress PE (iPE) in the multi-homed VPN 325 prefixes depicted in Figure 1. Suppose the iPE receives route 326 advertisements for the VPN prefixes VPN-P1 and VPN-P2 from two 327 egress PEs, ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, 328 respectively. Assume that ePE1 advertise the VPN labels VPN-L11 and 329 VPN-L12 while ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for 330 VPN-P1 and VPN-P2, respectively. Suppose that BGP-NH1 and BGP-NH2 331 are resolved via the IGP prefixes IGP-P1 and IGP-P2, which also 332 happen to have 2 ECMP paths with IGP-NH1 and IGP-NH2 reachable via 333 the interfaces I1 and I2. Suppose that LDP on the downstream LSRs 334 for IGP-P1 and IGP-P2 are assign the LDP labels LDP-L1 and LDP-L2 to 335 the prefixes IGP-P1 and IGP-P2. The forwarding chain on the ingress 336 PE "iPE" for the VPN prefixes is depicted in Figure 2. 338 BGP OutLabel Array 339 +---------+ 340 | VPN-L11 | 341 +--->+---------+ 342 | | VPN-L21 | 343 | +---------+ IGP OutLabel Array 344 | +---------+ 345 | | LDP-L11 | 346 | +-->+---------+ 347 | | | LDP-L21 | 348 VPN-P1------+ | +---------+ 349 | | 350 | | 351 | IGP-P1-----+ 352 | ^ | 353 | | | 354 V | V IGP Pathlist 355 +--------+ | +-------------+ 356 |BGP-NH1 |---------------+ | IGP-NH1, I1 |------>adj1 357 BGP +--------+ +-------------+ 358 Pathlist |BGP-NH2 |----+ | IGP-NH2, I2 |------>adj2 359 +--------+ | +-------------+ 360 ^ | ^ 361 | | | 362 | | | 363 | IGP-P2----------------+ 364 | | 365 | | 366 VPN-P2------+ | +---------+ 367 | | | LDP-L12 | 368 | +--->+---------+ 369 | | LDP-L22 | 370 | +---------+ 371 | +---------+ IGP OutLabel Array 372 | | VPN-L12 | 373 +--->+---------+ 374 | VPN-L22 | 375 +---------+ 376 BGP OutLabel Array 378 Figure 2 Forwarding Chain for VPN Prefixes with iBGP ECMP 380 The structure depicted in Figure 2 illustrates the two important 381 properties discussed in this memo: sharing and hierarchy. We can 382 see that the both the BGP and IGP pathlists are shared among 383 multiple BGP and IGP prefixes, respectively. At the same time, the 384 forwarding chain objects depend on each other in a child-parent 385 relation instead of being collapsed into a single level. 387 2.3.2. Example 2: Primary Backup Paths 389 Consider the egress PE ePE1 in the case of the multi-homed VPN 390 prefixes in the BGP-free LDP core depicted in Figure 1. Suppose ePE1 391 determines that the primary path is the external path but the backup 392 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2. 393 ePE2 constructs the forwarding chain depicted in Figure 1. We are 394 only showing a single VPN prefix for simplicity. But all prefixes 395 that are multihomed to ePE1 and ePE2 share the BGP pathlist 397 BGP OutLabel Array 398 VPL-L11 +---------+ 399 (Label-leaf)---+---->|Unlabeled| 400 | +---------+ 401 | | VPN-L21 | 402 | | (swap) | 403 | +---------+ 404 | ^ 405 | | BGP Pathlist 406 | | +------------+ Connected route 407 | | | CE-NH |------>(to the CE) 408 | | |path-index=0| 409 | | +------------+ 410 V | | VPN-NH2 | 411 VPN-P1 ------------------+------>| (backup) |------>IGP Leaf 412 (IP prefix leaf) |path-index=1| (Towards ePE2) 413 +-----+------+ 415 Figure 3 VPN Prefix Forwarding Chain with eiBGP paths on egress PE 417 The example depicted in Figure 3 differs from the example in Figure 418 2 in two main aspects. First as long as the primary path towards the 419 CE (external path) is useable, it will be the only path used for 420 forwarding while the OutLabel-Array contains both the unlabeled 421 label (primary path) and the VPN label (backup path) advertised by 422 the backup path ePE2. The second aspect is presence of the label 423 leaf corresponding to the VPN prefix. This label leaf is used to 424 match VPN traffic arriving from the core. Note that the label leaf 425 shares the OutLabel-Array and the pathlist with the IP prefix. 427 3. Forwarding Behavior 429 When a packet arrives, it matches a leaf. A labeled packet matches a 430 label leaf while an IP packet matches an IP prefix leaf. The 431 forwarding engines walks the forwarding chain starting from the leaf 432 until the walk terminates on an adjacency. Thus when a packet 433 arrives, the chain is walked as follows: 435 1. Lookup the leaf based on the destination address or the label at 436 the top of the packet 438 2. Retrieve the parent pathlist of the leaf 440 3. Pick the outgoing path from the list of resolved paths in the 441 pathlist. The method by which the outgoing path is picked is 442 beyond the scope of this document (i.e. flow-preserving hash 443 exploiting entropy within the MPLS stack and IP header). Let the 444 "path-index" of the outgoing path be "i". 446 4. If the prefix is labeled, use the "path-index" "i" to retrieve 447 the ith label "Li" stored the ith entry in the OutLabel-Array and 448 apply the label action of the label on the packet (e.g. for VPN 449 label on the ingress PE, the label action is "push"). 451 5. Move to the parent of the chosen path "i" 453 6. If the chosen path "i" is recursive, move to its parent prefix 454 and go to step 2 456 7. If the chosen path "i" is non-recursive move to its parent 457 adjacency 459 8. Encapsulate the packet in the L2 string specified by the 460 adjacency and send the packet out. 462 Let's applying the above forwarding steps to the example described 463 in Figure 1 Section 2.3.1. Suppose a packet arrives at ingress PE 464 iPE from an external neighbor. Assume the packet matches the VPN 465 prefix VPN-P1. While walking the forwarding chain, the forwarding 466 engine applies hashing algorithm to choose the path and the hashing 467 at the BGP level yields path 0 while the hashing at the IGP level 468 yields path 1. In that case, the packet will be sent out of 469 interface I1 with the label stack "LDP-L12,VPN-L21". 471 4. Forwarding Chain Adjustment at a Failure 473 The hierarchical and shared structure of the forwarding chain 474 explained in Section 2. allows modifying a small number of 475 forwarding chain objects to re-route traffic to a pre-calculated 476 equal-cost or backup path without the need to modify the possibly 477 very large number of BGP prefixes. In this section, we go over 478 various core and edge failure scenarios to illustrate how FIB 479 manager can utilize the forwarding chain structure to achieve prefix 480 independent convergence. 482 4.1. BGP-PIC core 484 This section describes the adjustments to the forwarding chain when 485 a core link or node fails but the BGP next-hop remains reachable. 487 There are two case: remote link failure and attached link failure. 488 Node failures are treated as link failures. 490 When a remote link or node fails, IGP receives advertisement 491 indicating a topology change so IGP re-converges to either find a 492 new next-hop and outgoing interface or remove the path completely 493 from the IGP prefix used to resolve BGP next-hops. IGP and LDP 494 download the modified IGP leaves with modified outgoing labels for 495 labeled core. FIB manager modifies the existing IGP leaf by 496 executing the steps outlined in Section 2.2. 498 When a local link fails, FIB manager detects the failure almost 499 immediately. The FIB manager marks the impacted path(s) as unuseable 500 so that only useable paths are used to forward packets. Note that in 501 this particular case there is actually no need even to backwalk to 502 IGP leaves to adjust the OutLabel-Arrays because FIB can rely on the 503 path-index stored in the useable paths in the loadinfo to pick the 504 right label. 506 It is noteworthy to mention that because FIB manager modifies the 507 forwarding chain starting from the IGP leaves only, BGP pathlists 508 and leaves are not modified. Hence traffic restoration occurs within 509 the time frame of IGP convergence, and, for local link failure, 510 within the timeframe of local detection. Thus it is possible to 511 achieve sub-50 msec convergence as described in [8] for local link 512 failure 514 Let's apply the procedure to the forwarding chain depicted in Figure 515 2 Section 2.3.1. Suppose a remote link failure occurs and impacts 516 the first ECMP IGP path to the remote BGP nhop. Upon IGP 517 convergence, the IGP pathlist of the BGP nhop is updated to reflect 518 the new topology (one path instead of two). As soon as the IGP 519 convergence is effective for the BGP nhop entry, the new forwarding 520 state is immediately available to all dependent BGP prefixes. The 521 same behavior would occur if the failure was local such as an 522 interface going down. As soon as the IGP convergence is complete for 523 the BGP nhop IGP route, all its BGP depending routes benefit from 524 the new path. In fact, upon local failure, if LFA protection is 525 enabled for the IGP route to the BGP nhop and a backup path was pre- 526 computed and installed in the pathlist, upon the local interface 527 failure, the LFA backup path is immediately activated (sub-50msec) 528 and thus protection benefits all the depending BGP traffic through 529 the hierarchical forwarding dependency between the routes. 531 4.2. BGP-PIC edge 533 This section describes the adjustments to the forwarding chains as a 534 result of edge node or edge link failure 536 4.2.1. Adjusting forwarding Chain in egress node failure 538 When an edge node fails, IGP on neighboring core nodes send route 539 updates indicating that the edge node is no longer reachable. IGP 540 running on the iBGP peers instructs FIB to remove the IP and label 541 leaves corresponding to the failed edge node from FIB. So FIB 542 manager performs the following steps: 544 o FIB manager deletes the IGP leaf corresponding to the failed edge 545 node 547 o FIB manager backwalks to all dependent BGP pathlists and marks 548 that path using the deleted IGP leaf as unresolved 550 o Note that there is no need to modify BGP leaves because each path 551 in the pathlist carries its path index and hence the correct 552 outgoing label will be picked. So for example the forwarding 553 chain depicted in Figure 2, if the 1st path becomes unresolved, 554 then the forwarding engine will only use the second path path for 555 forwarding. Yet the pathindex of that single resolved path will 556 still be 1 and hence the label VPN-L21 or VPN-L22 will be pushed 558 4.2.2. Adjusting Forwarding Chain on PE-CE link Failure 560 Suppose the link between an edge router and its external peer fails. 561 There are two scenarios (1) the edge node attached to the failed 562 link performs next-hop self and (2) the edge node attached to the 563 failure advertises the IP address of the failed link as the next-hop 564 attribute to its iBGP peers. 566 In the first case, the rest of iBGP peers will remain unaware of the 567 link failure and will continue to forward traffic to the edge node 568 until the edge node attached to the failed link withdraws the BGP 569 prefixes. If the destination prefixes are multi-homed to another 570 iBGP peer, say ePE2, then FIB manager on the edge router detecting 571 the link failure performs the following tasks 572 o FIB manager backwalks to the BGP pathlists marks the path through 573 the failed link to the external peer as unresolved 575 o Hence traffic will be forwarded used the backup path towards ePE2 577 o For labeled traffic 579 o The Outlabel-Array attached to the BGP leaves already 580 contains an entry corresponding to the path towards ePE2. 582 o The label entry in OutLabel-Arrays corresponding to the 583 internal path to ePE2 has swap action and the label 584 advertised by ePE2 586 o For an arriving label packet (e.g. VPN), the top label is 587 swapped with the label advertised by ePE2 589 o For unlabeled traffic, packets are simply redirected towards ePE2 591 In the second case where the edge router uses the IP address of the 592 failed link as the BGP next-hop, the edge router will still perform 593 the previous steps. But, unlike the case of next-hop self, IGP on 594 failed edge node informs the rest of the iBGP peers that IP address 595 of the failed link is no longer reachable. Hence the FIB manager on 596 iBGP peers will delete the IGP leaf corresponding to the IP prefix 597 of the failed link. The behavior of the iBGP peers will be identical 598 to the case of edge node failure outlined in Section 4.2.1. 600 It is noteworthy to mention that because the edge link failure is 601 local to the edge router, sub-50 msec convergence can be achieved as 602 described in [8]. 604 Let's try to apply the case of next-hop self to the forwarding chain 605 depicted in Figure 3. After failure of the link between ePE1 and CE, 606 the forwarding engine will route traffic arriving from the core 607 towards VPN-NH2 with path-index=1. A packet arriving from the core 608 will contain the label VPN-L11 at top. The label VPN-L11 is swaped 609 with the label VPN-L21 and the packet is forwarded towards ePE2 611 4.2.3. Loop Avoidance using Special Label (backup/repair label) 613 The adjustment of the forwarding chain for edge link failure as 614 specified in Section 4.2.2. can lead to loops in the following 615 scenarios: 617 o Unlabeled traffic when the iBGP and eBGP paths are treated as 618 ECMP 620 o Unlabeled traffic if there is an AS-wide single best path such as 621 the case where the MED or LOCAL_PREF [2] is used to determine the 622 best path 624 o Labeled and unlabeled traffic if the edge link failure was due to 625 an external peer failure and the external peer is common to both 626 edge nodes. This scenario results in edge link failure on both 627 iBGP peers and may result in a mutual loop. 629 This section proposes advertising a special label as a path 630 attribute to avoid the possibility of looping. When an edge router 631 has an external path, whether this path is the BGP best path [2] or 632 not [4][10][9], the edge router associates a non-transitive path 633 attribute containing a backup/repair label. The semantics of the 634 backup/repair label is as follows: A packet arriving with the 635 backup/repair label at the top MUST either be sent outside the AS 636 dropped. Details for backup/repair label can be found in [TBD] 638 5. Properties 640 5.1 Coverage 642 All the possible failures are covered, whether they impact a local 643 or remote IGP path or a local or remote BGP nhop as described in 644 Section 4. This section provides details for each failure and now 645 the hierarchical and shared FIB structure proposed in this document 646 allows recovery that does not depend on number of BGP prefixes 648 5.1.1 A remote failure on the path to a BGP nhop 650 Upon IGP convergence, the IGP leaf for the BGP nhop is updated upon 651 IGP convergence and all the BGP depending routes leverage the new 652 IGP forwarding state immediately. 654 This BGP resiliency property only depends on IGP convergence and is 655 independent of the number of BGP prefixes impacted. 657 5.1.2 A local failure on the path to a BGP nhop 659 Upon LFA protection, the IGP leaf for the BGP nhop is updated to use 660 the precomputed LFA backup path and all the BGP depending routes 661 leverage this LFA protection. 663 This BGP resiliency property only depends on LFA protection and is 664 independent of the number of BGP prefixes impacted. 666 5.1.3 A remote iBGP nhop fails 668 Upon IGP convergence, the IGP leaf for the BGP nhop is deleted and 669 all the depending BGP Path-Lists are updated to either use the 670 remaining ECMP BGP best-paths or if none remains available to 671 activate precomputed backups. 673 This BGP resiliency property only depends on IGP convergence and is 674 independent of the number of BGP prefixes impacted. 676 5.1.4 A local eBGP nhop fails 678 Upon local link failure detection, the adjacency to the BGP nhop is 679 deleted and all the depending BGP Path-Lists are updated to either 680 use the remaining ECMP BGP best-paths or if none remains available 681 to activate precomputed backups. 683 This BGP resiliency property only depends on local link failure 684 detection and is independent of the number of BGP prefixes impacted. 686 5.2 Performance 688 When the failure is local (a local IGP nhop failure or a local eBGP 689 nhop failure), a pre-computed and pre-installed backup is activated 690 by a local-protection mechanism that does not depend on the number 691 of BGP destinations impacted by the failure. Sub-50msec is thus 692 possible even if millions of BGP routes are impacted. 694 When the failure is remote (a remote IGP failure not impacting the 695 BGP nhop or a remote BGP nhop failure), an alternate path is 696 activated upon IGP convergence. All the impacted BGP destinations 697 benefit from a working alternate path as soon as the IGP convergence 698 occurs for their impacted BGP nhop even if millions of BGP routes 699 are impacted. 701 5.2.1 Perspective 703 The following table puts the BGP PIC benefits in perspective 704 assuming 706 o 1M impacted BGP prefixes 708 o IGP convergence ~ 500 msec 710 o local protection ~ 50msec 712 o FIB Update per BGP destination ~ 100usec conservative, 714 ~ 10usec optimistic 716 o BGP Convergence per BGP destination ~ 200usec conservative, 718 ~ 100usec optimistic 720 Without PIC With PIC 722 Local IGP Failure 10 to 100sec 50msec 724 Local BGP Failure 100 to 200sec 50msec 726 Remote IGP Failure 10 to 100sec 500msec 728 Local BGP Failure 100 to 200sec 500msec 730 Upon local IGP nhop failure or remote IGP nhop failure, the existing 731 primary BGP nhop is intact and usable hence the resiliency only 732 depends on the ability of the FIB mechanism to reflect the new path 733 to the BGP nhop to the depending BGP destinations. Without BGP PIC, 734 a conservative back-of-the-enveloppe estimation for this FIB update 735 is 100usec per BGP destination. An optimistic estimation is 10usec 736 per entry. 738 Upon local BGP nhop failure or remote BGP nhop failure, without the 739 BGP PIC mechanism, a new BGP Best-Path needs to be recomputed and 740 new updates need to be sent to peers. This depends on BGP processing 741 time that will be shared between best-path computation, RIB update 742 and peer update. A conservative back-of-the-envelope estimation for 743 this is 200usec per BGP destination. An optimistic estimation is 744 100usec per entry. 746 5.3 Automated 748 The BGP PIC solution does not require any operator involvement. The 749 process is entirely automated as part of the FIB implementation. 751 The salient points enabling this automation are: 753 o Extension of the BGP Best Path to compute a backup BGP nhop [11] 755 o Sharing of BGP Path-list across BGP destinations with same 756 primary and backup BGP nhop 758 o Hierarchical indirection and dependency between BGP Path-List and 759 IGP-Path-List 761 5.4 Incremental Deployment 762 As soon as one router supports BGP PIC solution, it benefits from 763 all its benefits without any requirement for other routers to 764 support BGP PIC. 766 6. Dependency 768 This section describes the required functionality in the forwarding 769 and control planes to support BGP-PIC described in this document 771 6.1 Hierarchical Hardware FIB 773 BGP PIC requires a hierarchical hardware FIB support: for each BGP 774 forwarded packet, a BGP leaf is looked up, then a BGP Path-List is 775 consulted, then an IGP Path-List then an Adjacency. 777 An alternative method consists in "flattening" the dependencies when 778 programming the BGP destinations into HW FIB resulting in 779 potentially eliminating both the BGP Path-List and IGP Path-List 780 consultation. Such an approach decreases the number of memory 781 lookup's per forwarding operation at the expense of HW FIB memory 782 increase (flattening means less sharing hence duplication), loss of 783 ECMP properties (flattening means less path-list entropy) and loss 784 of BGP PIC properties. 786 6.2 Availability of a secondary BGP next-hop 788 When the primary BGP nhop fails, BGP PIC depends on the availability 789 of a pre-computed and pre-installed secondary BGP nhop in the BGP 790 Path-List. 792 The existence of a secondary next-hop is clear for the following 793 reason: a service caring for network availability will require two 794 disjoint network connections hence two BGP nhops. 796 The BGP distribution of the secondary next-hop is simple thanks to 797 the following BGP mechanisms: Add-Path [9], BGP Best-External [4], 798 diverse path [10], and the frequent use in VPN deployments of 799 different VPN RD's per PE. 801 6.3 Pre-Computation of a secondary BGP nhop 803 [11] describes how a secondary BGP nhop can be precomputed on a per 804 BGP destination basis. 806 7. Security Considerations 808 No additional security risk is introduced by using the mechanisms 809 proposed in this document 811 8. IANA Considerations 813 No requirements for IANA 815 9. Conclusions 817 This document proposes a hierarchical and shared forwarding chain 818 structure that allows achieving prefix independent convergence, 819 and in the case of locally detected failures, sub-50 msec 820 convergence. A router can construct the forwarding chains in a 821 completely transparent manner with zero operator intervention. It 822 supports incremental deployment. 824 10. References 826 10.1. Normative References 828 [1] Bradner, S., "Key words for use in RFCs to Indicate 829 Requirement Levels", BCP 14, RFC 2119, March 1997. 831 [2] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway Protocol 832 4 (BGP-4), RFC 4271, January 2006 834 [3] Bates, T., Chandra, R., Katz, D., and Rekhter Y., 835 "Multiprotocol Extensions for BGP", RFC 4760, January 2007 837 10.2. Informative References 839 [4] Marques,P., Fernando, R., Chen, E, Mohapatra, P., Gredler, H., 840 "Advertisement of the best external route in BGP", draft-ietf- 841 idr-best-external-04.txt, April 2011. 843 [5] Wu, J., Cui, Y., Metz, C., and E. Rosen, "Softwire Mesh 844 Framework", RFC 5565, June 2009. 846 [6] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 847 Networks (VPNs)", RFC 4364, February 2006. 849 [7] De Clercq, J. , Ooms, D., Prevost, S., Le Faucheur, F., 850 "Connecting IPv6 Islands over IPv4 MPLS Using IPv6 Provider 851 Edge Routers (6PE)", RFC 4798, February 2007 853 [8] O. Bonaventure, C. Filsfils, and P. Francois. "Achieving sub- 854 50 milliseconds recovery upon bgp peering link failures, " 855 IEEE/ACM Transactions on Networking, 15(5):1123-1135, 2007 857 [9] D. Walton, E. Chen, A. Retana, J. Scudder, "Advertisement of 858 Multiple Paths in BGP", draft-ietf-idr-add-paths-07.txt, June 859 2012 861 [10] R. Raszuk, R. Fernando, K. Patel, D. McPherson, K. Kumaki, 862 "Distribution of diverse BGP paths", draft-ietf-grow-diverse- 863 bgp-path-dist-08.txt, July 2012 865 [11] P. Mohapatra, R. Fernando, C. Filsfils, and R. Raszuk, "Fast 866 Connectivity Restoration Using BGP Add-path", draft-pmohapat- 867 idr-fast-conn-restore-02, October 2011 869 11. Acknowledgments 871 Special thanks to Neeraj Malhotra and Yuri Tsier for the valuable 872 help 874 This document was prepared using 2-Word-v2.0.template.dot. 876 Appendix A. Modification History 878 A.1.1. Changes from Version 01 880 Some editorial corrections 882 A.1.2. Changes from Version 00 884 There were few editorial corrections. 886 Authors' Addresses 888 Ahmed Bashandy 889 Cisco Systems 890 170 West Tasman Dr, San Jose, CA 95134 891 Email: bashandy@cisco.com 893 Clarence Filsfils 894 Cisco Systems 895 Brussels, Belgium 896 Email: cfilsfil@cisco.com 898 Prodosh Mohapatra 899 Cumulus Networks 900 Email: pmohapat@cumulusnetworks.com