idnits 2.17.1 draft-francis-intra-va-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: 1. A VA router must never FIB-install a sub-prefix route for which there is no tunnel to the BGP NEXT_HOP address. This is to prevent a loop whereby the APR forwards the packet hop-by-hop towards the next hop, but a router on the path that has FIB-suppressed the sub-prefix forwards it back to the APR. If there is an alternate route to the sub-prefix for which there is a tunnel, then that route should be selected, even if it is less attractive according to the normal BGP best path selection algorithm. 2. If the router is an APR, a route for every sub-prefix within the VP MUST be FIB-installed (subject to the above limitation that there be a tunnel). 3. If a non-APR router has a sub-prefix route that does not fall within any VP (as determined by the VP-List), then the route must be installed. This may occur because the ISP hasn't defined a VP covering that prefix, for instance during an incremental deployment buildup. 4. If a non-APR router does not have a route for a known VP, then it MAY or MAY NOT install sub-prefixes within that VP. Whether or not it does is up to the vendor and the network operator. One approach is to never install such sub-prefixes, on the assumption that the network operator will engineer his network so that this rarely if ever happens. 5. Another approach is to have routers install such sub-prefixes, but taking care not to do so if the missing VP route is a transient condition. For instance, if the router is booting up, and simply has not yet received all of its routes, then it can reasonably expect to receive a VP route soon and so SHOULD NOT install the sub-prefixes. On the other hand, if a continuously operating router had only a single remaining route for the VP, and that route is withdrawn, then the router might not expect to receive a replacement VP route soon and so SHOULD install the sub-prefixes. Obviously a router can't predict the future with certainty, so the following algorithm might be a useful way to manage whether or not to install sub-prefixes for a non-existing VP route: * Define a timer MISSING_VP_TIMER, set for a relatively short time (say 10 seconds or so). * Start the timer when either: 1) the last VP route is withdrawn, or 2) there are initially neither VP routes nor sub-prefix routes, and the first sub-prefix route is received. * When the timer expires, install sub-prefix routes. Note, however, that optional routes may first need to be removed from the FIB to make room for the new sub-prefix routes. If even after removing optional routes there is no room in the FIB for sub-prefix routes, then they should remain suppressed. In other words, sub-prefix entries required by virtue of being an APR take priority over sub-prefix entries required by virtue of not having a VP route. 6. All other sub-prefix routes MAY be suppressed. Such "optional" sub-prefixes that are nevertheless installed are referred to as popular prefixes. -- The document date (April 24, 2009) is 5475 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2328' is defined on line 1141, but no explicit reference was found in the text == Unused Reference: 'RFC3107' is defined on line 1147, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3107 (Obsoleted by RFC 8277) Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Francis 3 Internet-Draft MPI-SWS 4 Intended status: BCP X. Xu 5 Expires: October 26, 2009 Huawei 6 H. Ballani 7 Cornell U. 8 D. Jen 9 UCLA 10 R. Raszuk 11 Self 12 L. Zhang 13 UCLA 14 April 24, 2009 16 FIB Suppression with Virtual Aggregation 17 draft-francis-intra-va-01.txt 19 Status of this Memo 21 This Internet-Draft is submitted to IETF in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF), its areas, and its working groups. Note that 26 other groups may also distribute working documents as Internet- 27 Drafts. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 The list of current Internet-Drafts can be accessed at 35 http://www.ietf.org/ietf/1id-abstracts.txt. 37 The list of Internet-Draft Shadow Directories can be accessed at 38 http://www.ietf.org/shadow.html. 40 This Internet-Draft will expire on October 26, 2009. 42 Copyright Notice 44 Copyright (c) 2009 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents in effect on the date of 49 publication of this document (http://trustee.ietf.org/license-info). 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. 53 Abstract 55 The continued growth in the Default Free Routing Table (DFRT) 56 stresses the global routing system in a number of ways. One of the 57 most costly stresses is FIB size: ISPs often must upgrade router 58 hardware simply because the FIB has run out of space, and router 59 vendors must design routers that have adequate FIB. FIB suppression 60 is an approach to relieving stress on the FIB by NOT loading selected 61 RIB entries into the FIB. Virtual Aggregation (VA) allows ISPs to 62 shrink the FIBs of any and all routers, easily by an order of 63 magnitude with negligible increase in path length and load. FIB 64 suppression deployed autonomously by an ISP (cooperation between ISPs 65 is not required), and can co-exist with legacy routers in the ISP. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 1.1. Scope of this Document . . . . . . . . . . . . . . . . . . 4 71 1.2. Requirements notation . . . . . . . . . . . . . . . . . . 5 72 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 73 1.4. Temporary Sections . . . . . . . . . . . . . . . . . . . . 6 74 1.4.1. Document revisions . . . . . . . . . . . . . . . . . . 6 75 2. Overview of Virtual Aggregation (VA) . . . . . . . . . . . . . 8 76 2.1. Mix of legacy and VA routers . . . . . . . . . . . . . . . 10 77 2.2. Summary of Tunnels and Paths . . . . . . . . . . . . . . . 10 78 3. Specification of VA . . . . . . . . . . . . . . . . . . . . . 12 79 3.1. Requirements for VA . . . . . . . . . . . . . . . . . . . 12 80 3.2. VA Operation . . . . . . . . . . . . . . . . . . . . . . . 13 81 3.2.1. Legacy Routers . . . . . . . . . . . . . . . . . . . . 13 82 3.2.2. Advertising and Handling Virtual Prefixes (VP) . . . . 13 83 3.2.3. Border VA Routers . . . . . . . . . . . . . . . . . . 17 84 3.2.4. Advertising and Handling Sub-Prefixes . . . . . . . . 18 85 3.2.5. Suppressing FIB Sub-prefix Routes . . . . . . . . . . 18 86 3.2.6. Core-Edge Operation . . . . . . . . . . . . . . . . . 20 87 3.3. Requirements Discussion . . . . . . . . . . . . . . . . . 21 88 3.3.1. Response to router failure . . . . . . . . . . . . . . 21 89 3.3.2. Traffic Engineering . . . . . . . . . . . . . . . . . 22 90 3.3.3. Incremental and safe deploy and start-up . . . . . . . 22 91 3.3.4. VA security . . . . . . . . . . . . . . . . . . . . . 22 92 3.4. New Configuration . . . . . . . . . . . . . . . . . . . . 23 93 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 94 5. Security Considerations . . . . . . . . . . . . . . . . . . . 24 95 5.1. Properly Configured VA . . . . . . . . . . . . . . . . . . 24 96 5.2. Mis-configured VA . . . . . . . . . . . . . . . . . . . . 25 97 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 98 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 99 7.1. Normative References . . . . . . . . . . . . . . . . . . . 25 100 7.2. Informative References . . . . . . . . . . . . . . . . . . 26 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 103 1. Introduction 105 ISPs today manage constant DFRT growth in a number of ways. Most 106 commonly, ISPs will upgrade their router hardware before DFRT growth 107 outstrips the size of the FIB. In cases where an ISP wants to 108 continue to use routers whose FIBs are not large enough, it may 109 deploy them at edge locations where a full DFRT is not needed, for 110 instance at the customer interface. Packets for which there is no 111 route are defaulted to a "core" infrastructure that does contain the 112 full DFRT. While this helps, it cannot be used for all edge routers, 113 for instance those that interface with other ISPs. Alternatively, 114 some lower-tier ISPs may simply ignore some routes, for instance 115 /24's that fall within the aggregate of another route. 117 FIB Suppression is an approach to shrinking FIB size that requires no 118 changes to BGP, no changes to packet forwarding mechanisms in 119 routers, and relatively minor changes to control mechanisms in 120 routers and configuration of those mechanisms. The core idea behind 121 FIB suppression is to run BGP as normal, and in particular to not 122 shrink the RIB, but rather to not load certain RIB entries into the 123 FIB, for instance by not committing them to the Routing Table. This 124 approach minimizes changes to routers, and in particular is simpler 125 than more general routing architectures that try to shrink both RIB 126 and FIB. With FIB suppression, there are no changes to BGP per se. 127 The BGP decision process does not change. The selected AS-path does 128 not change, and except on rare occasion the exit router does not 129 change. ISPs can deploy FIB suppression autonomously and with no 130 coordination with neighbor ASes. 132 This document describes an approach to FIB suppression called 133 "Virtual Aggregation" (VA). VA operates by organizing the IP (v4 or 134 v6) address space into Virtual Prefixes (VP), and using tunnels to 135 aggregate the (regular) sub-prefixes within each VP. The decrease in 136 FIB size can be dramatic, easily 5x or 10x with only a slight path 137 length and router load increase [nsdi09]. The VPs can be organized 138 such that all routers in an ISP see FIB size decrease, or in such a 139 way that "core" routers keep the full FIB, and "edge" routers have 140 almost no FIB (i.e. by defining a VP of 0/0). 142 1.1. Scope of this Document 144 The scope of this document is limited to Intra-domain VA operation. 145 In other words, the case where a single ISP autonomously operates VA 146 internally without any coordination with neighboring ISPs. 148 Note that this document assumes that the VA "domain" (i.e. the unit 149 of autonomy) is the AS (that is, different ASes run VA independently 150 and without coordination). For the remainder of this document, the 151 terms ISP, AS, and domain are used interchangeably. 153 This document applies equally to IPv4 and IPv6. 155 VA may operate with a mix of upgraded routers and legacy routers. 156 There are no topological restrictions placed on the mix of routers. 157 In order to avoid loops between upgraded and legacy routers, however, 158 legacy routers must be able to terminate tunnels. 160 This document is agnostic about what type of tunnel may be used for 161 VA, and does not specify a tunnel type per se. Rather, it refers 162 generically to tunnels and specifies the minimum set of requirements 163 that a given tunnel type must satisfy. Separate documents are used 164 to specify the operation of VA for specific tunnel types. 166 1.2. Requirements notation 168 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 169 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 170 document are to be interpreted as described in [RFC2119]. 172 1.3. Terminology 174 Aggregation Point Router (APR): An Aggregation Point Router (APR) is 175 a router that aggregates a Virtual Prefix (VP) by installing 176 routes (into the FIB) for all of the sub-prefixes within the VP. 177 APRs advertise the VP to other routers with BGP. For each sub- 178 prefix within the VP, APRs have a tunnel from themselves to the 179 remote ASBR (Autonomous System Border Router) where packets for 180 that prefix should be delivered. 181 Install and Suppress: The terms "install" and "suppress" are used to 182 describe whether a RIB entry has been loaded or not loaded into 183 the FIB (or, equivalently, the Routing Table). In other words, 184 the phrase "install a route" means "install a route into the FIB", 185 and the phrase "suppress a route" means "do not install a route 186 into the FIB". 187 Legacy Router: A router that does not run VA, and has no knowledge 188 of VA. Legacy routers, however, must be able to terminate 189 tunnels. (If a Legacy router cannot terminate tunnels, then any 190 routes that are reached via that router must be installed in all 191 FIBs.) 192 non-APR Router: In discussing VPs, it is often necessary to 193 distinguish between routers that are APRs for that VP, and routers 194 that are not APRs for that VP (but of course may be APRs for other 195 VPs not under discussion). In these cases, the term "APR" is 196 taken to mean "a VA router that is an APR for the given VP", and 197 the term "non-APR" is taken to mean "a VA router that is not an 198 APR for the given VP". The term non-APR router is not used to 199 refer to legacy routers. 200 Popular Prefix: A popular prefix is a sub-prefix that is installed 201 in a router in addition to the sub-prefixes it holds by virtue of 202 being a Aggregation Point Router. The popular prefix allows 203 packets to follow the shortest path. Note that different routers 204 do not need to have the same set of popular prefixes. 205 Routing Table: The term Routing Table is defined here the same way 206 as in Section 3.2 of [RFC4271]: "Routing information that the BGP 207 speaker uses to forward packets (or to construct the forwarding 208 table used for packet forwarding) is maintained in the Routing 209 Table." As such, FIB Suppression can be achieved by not 210 installing a route into the Routing Table 211 Routing Information Base (RIB): The term RIB is used rather sloppily 212 in this document to refer either to the loc-RIB (as used in 213 [RFC4271]), or to the combined Adj-RIBs-In, the Loc-RIB, and the 214 Adj-RIBs-Out. 215 Sub-Prefix: A regular (physically aggregatable) prefix. These are 216 equivalent to the prefixes that would normally comprise the DFRT 217 in the absence of VA. A VA router will contain a sub-prefix entry 218 either because the sub-prefix falls within a virtual prefix for 219 which the router is an APR, or because the sub-prefix is installed 220 as a popular prefix. Legacy routers hold the same sub-prefixes 221 they hold today. 222 Tunnel: VA can use a variety of tunnel types: MPLS LSPs, IP-in-IP, 223 GRE, L2TP, and so on. This document does not describe how any 224 given tunnel information is conveyed: that is left for companion 225 documents. This document uses the term tunnel to refer to any 226 appropriate tunnel type. 227 VA router: A router that operates Virtual Aggregation according to 228 this document. 229 Virtual Prefix (VP): A Virtual Prefix (VP) is a prefix used to 230 aggregate its contained regular prefixes (sub-prefixes). A VP is 231 not physically aggregatable, and so it is aggregated at APRs 232 through the use of tunnels. 233 VP-List: A list of defines VPs. All routers must agree on the 234 contents of this list (which is statically configured into every 235 VA router). 237 1.4. Temporary Sections 239 This section contains temporary information, and will be removed in 240 the final version. 242 1.4.1. Document revisions 244 This document was previously published as 245 draft-francis-idr-intra-va-01.txt. 247 1.4.1.1. Revisions from the 00 version (of 248 draft-francis-intra-va-00.txt) 250 Added additional authors (Jen, Raszuk, Zhang), to reflect primary 251 contributors moving forwards. In addition, a number of minor 252 clarifications were made. 254 1.4.1.2. Revisions from the 01 version (of 255 draft-francis-idr-intra-va-01.txt) 257 1. Changed file name from draft-francis-idr-intra-va to 258 draft-francis-intra-va. 259 2. Restructured the document to make the edge suppression mode a 260 specific sub-case of VA rather than a separate mode of operation. 261 This includes modifying the title of the draft. 262 3. Removed MPLS tunneling details so that specific tunneling 263 approaches can be described in separate documents. 265 1.4.1.3. Revisions from 00 version 267 o Changed intended document type from STD to BCP, as per advice from 268 Dublin IDR meeting. 269 o Cleaned up the MPLS language, and specified that the full-address 270 routes to remote ASBRs must be imported into OSPF (Section 3.2.3). 271 As per Daniel Ginsburg's email 272 http://www.ietf.org/mail-archive/web/idr/current/msg02933.html. 273 o Clarified that legacy routers must run MPLS. As per Daniel 274 Ginsburg's email 275 http://www.ietf.org/mail-archive/web/idr/current/msg02935.html. 276 o Fixed LOCAL_PREF bug. As per Daniel Ginsburg's email 277 http://www.ietf.org/mail-archive/web/idr/current/msg02940.html. 278 o Removed the need for the extended communities attribute on VP 279 routes, and added the requirement that all VA routers be 280 statically configured with the complete list of VPs. As per 281 Daniel Ginsburg's emails 282 http://www.ietf.org/mail-archive/web/idr/current/msg02940.html and 283 http://www.ietf.org/mail-archive/web/idr/current/msg02958.html. 284 In addition, the procedure for adding, deleting, splitting, and 285 merging VPs was added. As part of this, the possibility of having 286 overlapping VPs was added. 287 o Added the special case of a core-edge topology with default routes 288 to the edge as suggested by Robert Raszuk in email 289 http://www.ietf.org/mail-archive/web/idr/current/msg02948.html. 290 Note that this altered the structure and even title of the 291 document. 292 o Clarified that FIB suppression can be achieved by not loading 293 entries into the Routing Table, as suggested by Rajiv Asati in 294 email 295 http://www.ietf.org/mail-archive/web/idr/current/msg03019.html. 297 2. Overview of Virtual Aggregation (VA) 299 For descriptive simplicity, this section starts by describing VA 300 assuming that there are no legacy routers in the domain. Section 2.1 301 overviews the additional functions required by VA routers to 302 accommodate legacy routers. 304 A key concept behind VA is to operate BGP as normal, and in 305 particular to populate the RIB with the full DFRT, but to suppress 306 many or most prefixes from being loaded into the FIB. By populating 307 the RIB as normal, we avoid any changes to BGP, and changes to router 308 operation are relatively minor. The basic idea behind VA is quite 309 simple. The address space is partitioned into large prefixes --- 310 larger than any aggregatable prefix in use today. These prefixes are 311 called virtual prefixes (VP). Different VPs do not need to be the 312 same size. They may be a mix of /6, /7, /8 (for IPv4), and so on. 313 Indeed, an ISP can define a single /0 VP, and use it for a core/edge 314 type of configuration (commonly seen today). That is, the core 315 routers would maintain full FIBs, and edge routers could maintain 316 default routes to the core routers, and suppress as much of the FIB 317 as they wish. Each ISP can independently select the size of its VPs. 319 VPs are not themselves topologically aggregatable. VA makes the VPs 320 aggregatable through the use of tunnels, as follows. Associated with 321 each VP are one or more "Aggregation Point Routers" (APR). An APR 322 (for a given VP) is a router that installs routes for all sub- 323 prefixes (i.e. real physically aggregatable prefixes) within the VP. 324 By "install routes" here, we mean: 326 1. The route for each of the sub-prefixes is loaded into the FIB, 327 and 328 2. there is a tunnel from the APR to the BGP NEXT_HOP for the route. 330 The APR originates a BGP route to the VP. This route is distributed 331 within the domain, but not outside the domain. With this structure 332 in place, a packet transiting the ISP goes from the ingress router to 333 the APR (possibly via a tunnel), and then from the APR to the BGP 334 NEXT_HOP router via a tunnel. 336 Normally the BGP NEXT_HOP is the remote ASBR. In this case, even 337 though the remote ASBR is the tunnel endpoint, the tunnel header is 338 stripped by the local ASBR before the packet is delivered to the 339 remote ASBR. In other words, the remote ASBR sees a normal IP 340 packet, and is completely unaware of the existence of VA in the 341 neighboring ISP. The exception to this is legacy local ASBR routers. 343 In this case, the legacy router is the BGP NEXT_HOP, and packets are 344 tunneled to the legacy router, which then uses a FIB lookup to 345 deliver the packet to the appropriate remote ASBR. This applies only 346 to legacy routers that can convey tunnel parameters and detunnel 347 packets. 349 Note that the AS-path is not effected at all by VA. This means among 350 other things that AS-level policies are not effected by VA. The 351 packet may not, however, follow the shortest path within the ISP 352 (where shortest path is defined here as the path that would have been 353 taken if VA were not operating), because the APR may not be on the 354 shortest path between the ingress and egress routers. When this 355 happens, the packet experiences additional latency and creates extra 356 load (by virtue of taking more hops than it otherwise would have). 357 Note also that, with VA, a packet may occasionally take a different 358 exit point than it otherwise would have. 360 VA can avoid traversing the APR for selected routes by installing 361 these routes in non-APR routers. In other words, even if an ingress 362 router is not an APR for a given sub-prefix, it may install that sub- 363 prefix into its FIB. Packets in this case are tunneled directly from 364 the ingress to the BGP NEXT_HOP. These extra routes are called 365 "Popular Prefixes", and are typically installed for policy reasons 366 (i.e. customer routes are always installed), or for sub-prefixes that 367 carry a high volume of traffic (Section 3.2.5.1). Different routers 368 may have different popular prefixes. As such, an ISP may assign 369 popular prefixes per router, per POP, or uniformly across the ISP. A 370 given router may have zero popular prefixes, or the majority of its 371 FIB may consist of popular prefixes. The effectiveness of popular 372 prefixes to reduce traffic load relies on the fact that traffic 373 volumes follow something like a power-law distribution: i.e. that 90% 374 of traffic is destined to 10% of the destinations. Internet traffic 375 measurement studies over the years have consistently shown that 376 traffic patterns follow this distribution, though there is no 377 guarantee that they always will. 379 Note that for routing to work properly, every packet must sooner or 380 later reach a router that has installed a sub-prefix route that 381 matches the packet. This would obviously be the case for a given 382 sub-prefix if every router has installed a route for that sub-prefix 383 (which of course is the situation in the absence of VA). If this is 384 not the case, then there must be at least one Aggregation Point 385 Router (APR) for the sub-prefix's virtual prefix (VP). Ideally, 386 every POP contains at least two APRs for every virtual prefix. By 387 having APRs in every POP, the latency imposed by routing to the APR 388 is minimal (the extra hop is within the POP). By having more than 389 one APR, there is a redundant APR should one fail. In practice it is 390 often not possible to have an APR for every VP in every POP. This is 391 because some POPs may have only one or a few routers, and therefore 392 there may not have enough cumulative FIB space in the POP to hold 393 every sub-prefix. Note that any router ("edge", "core", etc.) may be 394 an APR. 396 2.1. Mix of legacy and VA routers 398 It is important that an ISP be able to operate with a mix of "VA 399 routers" (routers upgraded to operate VA as described in the 400 document) and "legacy routers". This allows ISPs to deploy VA in an 401 incremental fashion and to continue to use routers that for whatever 402 reason cannot be upgraded. This document allows such a mix, and 403 indeed places no topological restrictions on that mix. It does, 404 however, require that legacy routers (and VA routers for that matter) 405 are able to forward already-tunneled packets, are able to serve as 406 tunnel endpoints, and are able to participate in distribution of 407 tunnel information required to establish themselves as tunnel 408 endpoints. (This is listed as Requirement R5 in the companion 409 tunneling documents.) Depending on the tunnel type, legacy routers 410 may also be able to generate tunneled packets, though this is an 411 optional requirement. (This is listed as Requirement R4 in the 412 companion tunneling documents.) Legacy routers must use their own 413 address as the BGP NEXT_HOP, and must FIB-install routes for which 414 they are the BGP NEXT_HOP. 416 2.2. Summary of Tunnels and Paths 418 To summarize, the following tunnels are created: 420 1. From all VA routers to all BGP NEXT_HOP addresses (where the BGP 421 NEXT_HOP address is either an APR, a legacy router, or the remote 422 ASBR neighbor of a VA router). Note that this is listed as 423 Requirement R3 in the companion tunneling documents. 424 2. Optionally, from all legacy routers to all BGP NEXT_HOP 425 addresses. 426 There are a number of possible paths that packets may take through an 427 ISP, summarized in the following diagram. Here, "VA" is a VA router, 428 "LR" is a legacy router, the symbol "==>" represents a tunneled 429 packet (through zero or more routers), "-->" represents an untunneled 430 packet, and "(pop)" represents stripping the tunnel header. The 431 symbol "::>" represents the portion of the path where although the 432 tunnel is targeted to the receiving node, the outer header has been 433 stripped. (Note that the remote ASBR may actually be a legacy router 434 or a VA router---it doesn't matter (and isn't known) to the ISP.) 435 Egress 436 Router 437 Ingress Some APR (Local Remote 438 Router Router Router ASBR) ASBR 439 ------- ------ ------ ------ -------- 440 1. VA===================>VA=========>VA(pop)::::>LR 442 2. VA===================>VA=========>LR--------->LR 444 3. VA===============================>VA(pop)::::>LR 446 4. VA===============================>LR--------->LR 448 (The following two exist in the case where legacy routers 449 can initiate tunneled packets.) 451 5. LR===============================>VA(pop)::::>LR 453 6. LR===============================>LR--------->LR 455 (The following two exist in the case where legacy routers 456 cannot initiate tunneled packets.) 458 7. LR------->VA (remaining paths as in 1 to 4 above) 460 8. LR------->LR--------------------->LR--------->LR 462 The first and second paths represent the case where the ingress 463 router does not have a popular prefix for the destination, and must 464 tunnel the packet to an APR. The third and fourth paths represent 465 the case where the ingress router does have a popular prefix for the 466 destination, and so tunnels the packet directly to the egress. The 467 fifth and sixth paths are similar, but where the ingress is a legacy 468 router that can initiate tunneled packets, and effectively has the 469 popular prefix by virtue of holding the entire DFRT. (Note that some 470 ISPs have only partial RIBs in their customer-facing edge routers, 471 and default route to a router that holds the full DFRT. This case is 472 not shown here.) Finally, paths 7 and 8 represent the case where 473 legacy routers cannot initiate a tunneled packet. 475 VA prevents the routing loops that might otherwise occur when VA 476 routers and legacy routers are mixed. The trick is avoiding the case 477 where a legacy router is forwarding packets towards the BGP NEXT_HOP, 478 while a VA router is forwarding packets towards the APR, with each 479 router thinking that the other is on the shortest path to their 480 respective targets. 482 In the first four types of path, the loop is avoided because tunnels 483 are used all the way to the egress. As a result, there is never an 484 opportunity for a legacy router to try to route based on the 485 destination address unless the legacy router is the egress, in which 486 case it forwards the packet to the remote ASBR. 488 In the 5th and 6th cases, the ingress is a legacy router, but this 489 router can initiate tunnels and has the full FIB, and so simply 490 tunnels the packet to the egress router. 492 In the 7th and 8th cases, the legacy ingress cannot initiate tunnels, 493 and so forwards the packet hop-by-hop towards the BGP NEXT_HOP. The 494 packet will work its way towards the egress router, and will either 495 progress through a series of legacy routers (in which case the IGP 496 prevents loops), or it will eventually reach a VA router, after which 497 it will take tunnels as in the 1st and 2nd cases. 499 3. Specification of VA 501 This section describes in detail how to operate VA. It starts with a 502 brief discussion of requirements, followed by a specification of 503 router support for VA. 505 3.1. Requirements for VA 507 While the core requirement is of course to be able to manage FIB 508 size, this must be done in a way that: 509 o is robust to router failure, 510 o allows for traffic engineering, 511 o allows for existing inter-domain routing policies, 512 o operates in a predictable manner and is therefore possible to 513 test, debug, and reason about performance (i.e. establish SLAs), 514 o can be safely installed, tested, and started up, 515 o Can be configured and reconfigured without service interruption, 516 o can be incrementally deployed, and in particular can be operated 517 in an AS with a mix of VA-capable and legacy routers, 518 o accommodates existing security mechanisms such as ingress 519 filtering and DoS defense, 520 o does not introduce significant new security vulnerabilities. 521 In short, operation of VA must not significantly affect the way ISPs 522 operate their networks today. Section 3.3 discusses the extent to 523 which these requirements are met by the design presented in 524 Section 3.2. 526 3.2. VA Operation 528 In this section, the detailed operation of VA is specified. 530 3.2.1. Legacy Routers 532 VA can operate with a mix of VA and legacy routers. To avoid the 533 types of loops described in Section 2.2, however, legacy routers MUST 534 satisfy the following requirements: 536 1. When forwarding externally-received routes over iBGP, the BGP 537 NEXT_HOP attribute MUST be set to the legacy router itself. 538 2. Legacy routers MUST be able to detunnel packets addressed to 539 themselves at the BGP NEXT_HOP address. They MUST also be able 540 to convey the tunnel information needed by other routers to 541 initiate tunneled packets to them. This is listed as 542 "Requirement R1" in the companion tunneling documents. If a 543 legacy router cannot detunnel and convey tunnel parameters, then 544 the AS cannot use VA. 545 3. Legacy routers MUST be able to forward all tunneled packets. 546 4. Every legacy router MUST hold its complete FIB. (Note, of 547 course, that this FIB does not necessarily need to contain the 548 full DFRT. This might be the case, for instance, if the router 549 is an edge router that defaults to a core router.) 551 As long as legacy routers participating in tunneling as described 552 above there are no topological restrictions on the legacy routers. 553 They may be freely mixed with VA routers without the possibility of 554 forming sustained loops (Section 2.2). 556 3.2.2. Advertising and Handling Virtual Prefixes (VP) 558 3.2.2.1. Distinguishing VP's from Sub-prefixes 560 VA routers must be able to distinguish VP's from sub-prefixes. This 561 is primarily in order to know which routes to install. In 562 particular, non-APR routers must know which prefixes are VPs before 563 they receive routes for those VPs, for instance when they first boot 564 up. This is in order to avoid the situation where they unnecessarily 565 start filling their FIB with routes that they ultimately don't need 566 to install (Section 3.2.5). This leads to the following requirement: 568 It MUST be possible to statically configure the complete list of VP's 569 into all VA routers. This list is known as the VP-List. 571 3.2.2.2. Limitations on Virtual Prefixes 573 From the point of view of best-match routing semantics, VPs are 574 treated identically to any other prefix. In other words, if the 575 longest matching prefix is a VP, then the packet is routed towards 576 the VP. If a packet matching a VP reaches an Aggregation Point 577 Router (APR) for that VP, and the APR does not have a better matching 578 route, then the packet is discarded by the APR (just as a router that 579 originates any prefix will discard a packet that does not have a 580 better match). 582 The overall semantics of VPs, however, are subtly different from 583 those of real prefixes (well, maybe not so subtly). Without VA, when 584 a router originates a route for a (real) prefix, the expectation is 585 that the addresses within the prefix are within the originating AS 586 (or a customer of the AS). For VPs, this is not the case. APRs 587 originate VPs whose sub-prefixes exist in different ASes. Because of 588 this, it is important that VPs not be advertised across AS 589 boundaries. 591 It is up to individual domains to define their own VPs. VPs MUST be 592 "larger" (span a larger address space) than any real sub-prefix. If 593 a VP is smaller than a real prefix, then packets that match the real 594 prefix will nevertheless be routed to an APR owning the VP, at which 595 point the packet will be dropped if it does not match a sub-prefix 596 within the VP (Section 5). 598 (Note that, in principle there are cases where a VP could be smaller 599 than a real prefix. This is where the egress router to the real 600 prefix is a VA router. In this case, the APR could theoretically 601 tunnel the packet to the appropriate remote ASBR, which would then 602 forward the packet correctly. On the other hand, if the egress 603 router is a legacy router, then the APR could not tunnel matching 604 packets to the egress. This is because the egress would view the VP 605 as a better match, and would loop the packet back to the APR. For 606 this reason we require that VPs be larger than any real prefixes, and 607 that APR's never install prefixes larger than a VP in their FIBs.) 609 It is valid for a VP to be a subset of another VP. For example, 20/7 610 and 20/8 can both be VPs. In fact, this capability is necessary for 611 "splitting" a VP without temporarily the FIB size in any router. 612 (Section 3.2.2.5). 614 3.2.2.3. Aggregation Point Routers (APR) 616 Any router may be configured as an Aggregation Point Router (APR) for 617 one or more Virtual Prefixes (VP). For each VP for which a router is 618 an APR, the router does the following: 620 1. The APR MUST originate a BGP route to the VP [RFC4271]. In this 621 route, the NLRI are all of the VPs for which the router is an 622 APR. This is true even for VPs that are a subset of another VP. 623 The ORIGIN is set to INCOMPLETE (value 2), the AS number of the 624 APR's AS is used in the AS_PATH, and the BGP NEXT_HOP is set to 625 the address of the APR. The ATOMIC_AGGREGATE and AGGREGATOR 626 attributes are not included. 627 2. The APR MUST attach a NO_EXPORT Communities Attribute [RFC1997] 628 to the route. 629 3. The APR MUST be able to detunnel packets addressed to itself at 630 its BGP NEXT_HOP address. It MUST also be able to convey the 631 tunnel information needed by other routers to initiate tunneled 632 packets to them (Requirement R1). 633 4. If a packet is received at the APR whose best match is the VP 634 (i.e. it matches the VP but not any sub-prefixes within the VP), 635 then the packet MUST be discarded (see Section 3.2.2.2). This 636 can be accomplished by never installing a prefix larger than the 637 VP into the FIB, or by installing the VP as a route to \dev\null. 639 3.2.2.3.1. Selecting APRs 641 An ISP is free to select APRs however it chooses. The details of 642 this are outside the scope of this document. Nevertheless, a few 643 comments are made here. In general, APRs should be selected such 644 that the distance to the nearest APR for any VP is small---ideally 645 within the same POP. Depending on the number of routers in a POP, 646 and the sizes of the FIBs in the routers relative to the DFRT size, 647 it may not be possible for all VPs to be represented in a given POP. 648 In addition, there should be multiple APRs for each VP, again ideally 649 in each POP, so that the failure of one does not unduly disrupt 650 traffic. 652 APRs may be (and probably should be) statically assigned. They may 653 also, however, be dynamically assigned, for instance in response to 654 APR failure. For instance, each router may be assigned as a backup 655 APR for some other APR. If the other APR crashes (as indicated by 656 the withdrawal of its routes to its VPs), the backup APR can install 657 the appropriate sub-prefixes and advertise the VP as specified above. 658 Note that doing so may require it to first remove some popular 659 prefixes from its FIB to make room. 661 Note that, although VPs MUST be larger than real prefixes, there is 662 intentionally no mechanism designed to automatically insure that this 663 is the case. Such a mechanisms would be dangerous. For instance, if 664 an ISP somewhere advertised a very large prefix (a /4, say), then 665 this would cause APRs to throw out all VPs that are smaller than 666 this. For this reason, VPs must be set through static configuration 667 only. 669 3.2.2.4. Non-APR Routers 671 A non-APR router MUST install at least the following routes: 673 1. Routes to VPs (identifiable using the VP-List). 674 2. Routes to the largest of any prefixes that contain a given VP. 675 (Note that although this is not supposed to happen, if it does 676 the non-APR should install it, with the effect that any addresses 677 in the prefix not covered by VPs will be routed outside the 678 domain.) 679 3. Routes to all prefixes that contain an address that is in part of 680 the address space for which no VP is defined (i.e. as is done 681 today without VA). 683 If the non-APR has a tunnel to the BGP NEXT_HOP of any such route, it 684 MUST use the tunnel to forward packets to the BGP NEXT_HOP. 686 When an APR fails, routers MUST select another APR to send packets to 687 (if there is one). This happens, however, through normal internal 688 BGP convergence mechanisms. Note that it is strongly recommended 689 that routers keep at least two VP routes in their RIB at all times. 690 The main reason is that if the currently used VP route is withdrawn, 691 the second VP route can be immediately installed, and the issue of 692 whether to temporarily install sub-prefixes in the FIB is avoided 693 (Section 3.2.5). Another reason is that the IGP can be used to even 694 more quickly detect that the APR has crashed, again allowing the 695 second VP route to be immediately installed. 697 3.2.2.5. Adding and deleting VP's 699 An ISP may from time to time wish to reconfigure its VP-List. There 700 are a number of reasons for this. For instance, early in its 701 deployment an ISP may configure one or a small number of VPs in order 702 to test VA. As the ISP gets more confident with VA, it may increase 703 the number of VPs. Or, an ISP may start with a small number of large 704 VPs (i.e. /4's or even one /0), and over time move to more smaller 705 VPs in order to save even more FIB. In this case, the ISP will need 706 to "split" a VP. Finally, since the address space is not uniformly 707 populated with prefixes, the ISP may want to change the size of VPs 708 in order to balance FIB size across routers. This can involve both 709 splitting and merging VPs. Of course, an ISP MUST be able to modify 710 its VP-List without 1) interrupting service to any destinations, or 711 2) temporarily increasing the size of any FIB (i.e. where the FIB 712 size during the change is no bigger than its size either before or 713 after the change). 715 Adding a VP is straightforward. The first step is to configure the 716 APRs for the VP. This causes the APRs to originate routes for the 717 VP. Non-APR routers will install this route according to the rules 718 in Section 3.2.2.4 even though they do not yet recognize that the 719 prefix is a VP. Subsequently the VP is added to the VP-List of non- 720 APR routers. The Non-APR routers can then start suppressing the sub- 721 prefixes with no loss of service. 723 To delete a VP, the process is reversed. First, the VP is removed 724 from the VP-Lists of non-APRs. This causes the non-APRs to install 725 the sub-prefixes. After all sub-prefixes have been installed, the VP 726 may be removed from the APRs. 728 In many cases, it is desirable to split a VP. For instance, consider 729 the case where two routers, Ra and Rb, are APRs for the same prefix. 730 It would be possible to shrink the FIB in both routers by splitting 731 the VP into two VPs (i.e. split one /6 into two /7's), and assigning 732 each router to one of the VPs. While this could in theory be done by 733 first deleting the larger VP, and then adding the smaller VPs, doing 734 so would temporarily increase the FIB size in non-APRs, which may not 735 have adequate space for such an increase. For this reason, we allow 736 overlapping VPs. 738 To split a VP, first the two smaller VPs are added to the VP-Lists of 739 all non-APR routers (in addition to the larger superset VP). Next, 740 the smaller VPs are added to the selected APRs (which may or may not 741 be APRs for the larger VP). Because the smaller VPs are a better 742 match than the larger VP, this will cause the non-APR routers to 743 forward packets to the APRs for the smaller VPs. Next, the larger VP 744 can be removed from the VP-Lists of all non-APR routers. Finally, 745 the larger VP can be removed from its APRs. 747 To merge two VPs, the new larger VP is configured in all non-APRs. 748 This has no effect on FIB size or APR selection, since the smaller 749 VPs are better matches. Next the larger VP is configured in its 750 selected APRs. Next the smaller VPs are deleted from all non-APRs. 751 Finally, the smaller VPs are deleted from their corresponding APRs. 753 3.2.3. Border VA Routers 755 VA routers that are border routers MUST do the following: When 756 forwarding externally-received routes over iBGP, the BGP NEXT_HOP 757 attribute MUST be set to the remote ASBR. They MUST establish 758 tunnels that have the following properties (Requirement R2 in 759 companion documents): 761 1. The tunnel target must be the remote ASBR BGP NEXT_HOP address. 762 In other words, the target address used by other routers in the 763 domain for tunneling packets is the remote ASBR address. 765 2. The border router must detunnel the packet before forwarding the 766 packet to the remote ASBR. In other words, the remote ASBR 767 receives a normal untunneled packet identical to the packet it 768 would receive without VA. 769 3. The border router must be able to forward the packet without a 770 FIB lookup. In other words, the tunnel information itself 771 contains all the information needed by the border router to know 772 which remote ASBR should receive the packet. 774 Note that there are a number of ways the above tunnel can be created, 775 as documented separately. For instance, the tag on an MPLS LSP could 776 identify the remote ASBR, and the border router could use what is 777 effectively penultimate hop popping to deliver the packet. Or, GRE 778 could be used whereby the outer IP header addresses the border 779 router, and the GRE key value identifies the remote ASBR. 781 3.2.4. Advertising and Handling Sub-Prefixes 783 Sub-prefixes are advertised and handled by BGP as normal. VA does 784 not effect this behavior. The only difference in the handling of 785 sub-prefixes is that they might not be installed in the FIB, as 786 described in Section 3.2.5. 788 In those cases where the route is installed, packets forwarded to 789 prefixes external to the AS MUST be transmitted via the tunnel 790 established as described in Section 3.2.3. 792 3.2.5. Suppressing FIB Sub-prefix Routes 794 Any route not for a known VP (i.e. not in the VP-List) is taken to be 795 a sub-prefix. The following rules are used to determine if a sub- 796 prefix route can be suppressed. 798 1. A VA router must never FIB-install a sub-prefix route for which 799 there is no tunnel to the BGP NEXT_HOP address. This is to 800 prevent a loop whereby the APR forwards the packet hop-by-hop 801 towards the next hop, but a router on the path that has FIB- 802 suppressed the sub-prefix forwards it back to the APR. If there 803 is an alternate route to the sub-prefix for which there is a 804 tunnel, then that route should be selected, even if it is less 805 attractive according to the normal BGP best path selection 806 algorithm. 807 2. If the router is an APR, a route for every sub-prefix within the 808 VP MUST be FIB-installed (subject to the above limitation that 809 there be a tunnel). 810 3. If a non-APR router has a sub-prefix route that does not fall 811 within any VP (as determined by the VP-List), then the route must 812 be installed. This may occur because the ISP hasn't defined a VP 813 covering that prefix, for instance during an incremental 814 deployment buildup. 815 4. If a non-APR router does not have a route for a known VP, then it 816 MAY or MAY NOT install sub-prefixes within that VP. Whether or 817 not it does is up to the vendor and the network operator. One 818 approach is to never install such sub-prefixes, on the assumption 819 that the network operator will engineer his network so that this 820 rarely if ever happens. 821 5. Another approach is to have routers install such sub-prefixes, 822 but taking care not to do so if the missing VP route is a 823 transient condition. For instance, if the router is booting up, 824 and simply has not yet received all of its routes, then it can 825 reasonably expect to receive a VP route soon and so SHOULD NOT 826 install the sub-prefixes. On the other hand, if a continuously 827 operating router had only a single remaining route for the VP, 828 and that route is withdrawn, then the router might not expect to 829 receive a replacement VP route soon and so SHOULD install the 830 sub-prefixes. Obviously a router can't predict the future with 831 certainty, so the following algorithm might be a useful way to 832 manage whether or not to install sub-prefixes for a non-existing 833 VP route: 834 * Define a timer MISSING_VP_TIMER, set for a relatively short 835 time (say 10 seconds or so). 836 * Start the timer when either: 1) the last VP route is 837 withdrawn, or 2) there are initially neither VP routes nor 838 sub-prefix routes, and the first sub-prefix route is received. 839 * When the timer expires, install sub-prefix routes. Note, 840 however, that optional routes may first need to be removed 841 from the FIB to make room for the new sub-prefix routes. If 842 even after removing optional routes there is no room in the 843 FIB for sub-prefix routes, then they should remain suppressed. 844 In other words, sub-prefix entries required by virtue of being 845 an APR take priority over sub-prefix entries required by 846 virtue of not having a VP route. 847 6. All other sub-prefix routes MAY be suppressed. Such "optional" 848 sub-prefixes that are nevertheless installed are referred to as 849 popular prefixes. 851 3.2.5.1. Selecting Popular Prefixes 853 Individual routers may independently choose which sub-prefixes are 854 popular prefixes. There is no need for different routers to install 855 the same sub-prefixes. There is therefore significant leeway as to 856 how routers select popular prefixes. As a general rule, routers 857 should fill the FIB as much as possible, because the cost of doing so 858 is relatively small, and more FIB entries leads to fewer packets 859 taking a longer path. Broadly speaking, an ISP may choose to fill 860 the FIB by making routers APR's for as many VP's as possible, or by 861 assigning relatively few APR's and rather filling the FIB with 862 popular prefixes. Several basic approaches to selecting popular 863 prefixes are outlined here. Router vendors are free to implement 864 whatever approaches they want. 866 1. Policy-based: The simplest approach for network administrators is 867 to have broad policies that routers use to determine which sub- 868 prefixes are designated as popular. An obvious policy would be a 869 "customer routes" policy, whereby all customer routes are 870 installed (as identified for instance by appropriate community 871 attribute tags). Another policy would be for a router to install 872 prefixes originated by specific ASes. For instance, two ISPs 873 could mutually agree to install each other's originated prefixes. 874 A third policy might be to install prefixes with the shortest AS- 875 path. 876 2. Static list: Another approach would be to configure static lists 877 of specific prefixes to install. For instance, prefixes 878 associated with an SLA might be configured. Or, a list of 879 prefixes for the most popular websites might be installed. 880 3. High-volume prefixes: By installing high-volume prefixes as 881 popular prefixes, the latency and load associated with the longer 882 path required by VA is minimized. One approach would be for an 883 ISP to measure its traffic volume over time (days or a few 884 weeks), and statically configure high-volume prefixes as popular 885 prefixes. There is strong evidence that prefixes that are high- 886 volume tend to remain high-volume over multi-day or multi-week 887 timeframes (though not necessarily at short timeframes like 888 minutes or seconds). High-volume prefixes may also be installed 889 dynamically. In other words, a router measures its own traffic 890 volumes, and installs and removes popular prefixes in response to 891 short term traffic load. The downside of this approach is that 892 it complicates debugging network problems. If packets are being 893 dropped somewhere in the network, it is more difficult to find 894 out where if the selected path can change dynamically. 896 3.2.6. Core-Edge Operation 898 A common style of router deployment in ISPs is the "core-edge" 899 deployment, whereby there is a core of high-capacity routers 900 surrounded by potentially lower-capacity "edge" routers that may not 901 carry the whole DFRT, and which default route to a core router. VA 902 can support this style of configuration be effectively defining a 903 single VP as 0/0, and by defining core routers to be APRs for 0/0. 904 This results in core routers maintaining full FIBs, and edge routers 905 having potentially extremely small FIBs. The advantage of using VA 906 to support core-edge topologies is that, with VA, any edge router, 907 including those peering with other ISPs, can have a small FIB. Today 908 such routers must maintain the full DFRT in order to peer. 910 Vendors may wish to facilitate configuration of a core-edge style of 911 VA for its customers that already use a core-edge topology. In other 912 words, a vendor may wish to simplify the VA configuration task so 913 that a customer merely needs to configure which of its routers are 914 core and which are edge, and the appropriate VA configuration, i.e. 915 the VP-List, tunnels, and popular prefixes, is automatically done 916 "under the hood" so to speak. Note that, under a core-edge 917 configuration, it isn't strictly speaking necessary for core routers 918 to advertise the 0/0 VP within BGP. Rather, edge routers could rely 919 on their default route to a core router. 921 3.3. Requirements Discussion 923 This section describes the extent to which VA satisfies the list of 924 requirements given in Section 3.1. 926 3.3.1. Response to router failure 928 VA introduces a new failure mode in the form of Aggregation Point 929 Router (APR) failure. There are two basic approaches to protecting 930 against APR failure, static APR redundancy, and dynamic APR 931 assignment (see Section 3.2.2.3.1). In static APR redundancy, enough 932 APRs are assigned for each Virtual Prefix (VP) so that if one goes 933 down, there are others to absorb its load. Failover to a static 934 redundant APR is automatic with existing BGP mechanisms. If an APR 935 crashes, BGP will cause packets to be routed to the next nearest APR. 936 Nevertheless, there are three concerns here: convergence time, load 937 increase at the redundant APR, and latency increase for diverted 938 flows. 940 Regarding convergence time, note that, while fast-reroute mechanisms 941 apply to the rerouting of packets to a given APR or egress router, 942 they don't apply to APR failure. Convergence time was discussed in 943 Section 3.2.2.4, which suggested that it is likely that BGP 944 convergence times will be adequate, and if not the IGP mechanisms may 945 be used. 947 Regarding load increase, in general this is relatively small. This 948 is because substantial reductions in FIB size can be achieved with 949 almost negligible increase in load. For instance, [nsdi09] shows 950 that a 5x reduction in FIB size yields a less than one percent 951 increase in load overall. Given this, depending on the configuration 952 of redundant APRs, failure of one APR increases the load of its 953 backups by only a few percent. This is well within the variation 954 seen in normal traffic loads. 956 Regarding latency increase, some flows may see a significant increase 957 in delay (and, specifically, an increase that puts it outside of its 958 SLA boundary). Normally a redundant APR would be placed within the 959 same POP, and so increased latency would be minimal (assuming that 960 load is also quite small, and so there is no significant queuing 961 delay). It is not always possible, however, to have an APR for every 962 VP within every POP, much less a redundant APR within every POP, and 963 so sometimes failure of an APR will result in significant latency 964 increases for a small fraction of traffic. 966 3.3.2. Traffic Engineering 968 VA complicates traffic engineering because the placement of APRs and 969 selection of popular prefixes influences how packets flow. (Though 970 to repeat, increased load is in any event likely to be minimal, and 971 so the effect on traffic engineering should not be great in any 972 event.) Since the majority of packets may be forwarded by popular 973 prefixes (and therefore follow the shortest path), it is particularly 974 important that popular prefixes be selected appropriately. As 975 discussed in Section 3.2.5.1, there are static and dynamic approaches 976 to this. [nsdi09] shows that high-volume prefixes tend to stay high- 977 volume for many days, and so a static strategy is probably adequate. 978 VA can operate correctly using either RSVP-TE [RFC3209] or LDP to 979 establish tunnels. 981 3.3.3. Incremental and safe deploy and start-up 983 It must be possible to install and configure VA in a safe and 984 incremental fashion, as well as start it up when routers reboot. 985 This document allows for a mixture of VA and legacy routers, allows a 986 fraction or all of the address space to fall within virtual prefixes, 987 and allows different routers to suppress different FIB entries 988 (including none at all). As a result, it is generally possible to 989 deploy and test VA in an incremental fashion. 991 3.3.4. VA security 993 Regarding ingress filtering, because in VA the RIB is effectively 994 unchanged, routers contain the same information they have today for 995 installing ingress filters [RFC2827]. Presumably, installing an 996 ingress filter in the FIB takes up some memory space. Since ingress 997 filtering is most effective at the "edge" of the network (i.e. at the 998 customer interface), the number of FIB entries for ingress filtering 999 should remain relatively small---equal to the number of prefixes 1000 owned by the customer. Whether this is true in all cases remains for 1001 further study. 1003 Regarding DoS attacks, there are two issues that need to be 1004 considered. First, does VA result in new types of DoS attacks? 1005 Second, does VA make it more difficult to deploy DoS defense systems. 1007 Regarding the first issue, one possibility is that an attacker 1008 targets a given router by flooding the network with traffic to 1009 prefixes that are not popular, and for which that router is an APR. 1010 This would cause a disproportionate amount of traffic to be forwarded 1011 to the APR(s). While it is up to individual ISPs to decide if this 1012 attack is a concern, it does not strike the authors that this attack 1013 is likely to significantly worsen the DoS problem. 1015 Regarding DoS defense system deployment, more input about specific 1016 systems is needed. It is the authors' understanding, however, that 1017 at least some of these systems use dynamically established Routing 1018 Table entries to divert victims' traffic into LSPs that carry the 1019 traffic to scrubbers. The expectation is that this mechanism simply 1020 over-rides whatever route is in place (with or without VA), and so 1021 the operation of VA should not limit the deployment of these types of 1022 DoS defense systems. Nevertheless, more study is needed here. 1024 3.4. New Configuration 1026 VA places new configuration requirements on ISP administrators. 1027 Namely, the administrator must: 1029 1. Select VPs, and configure the VP-List into all VA routers. As a 1030 general rule, having a larger number of relatively small prefixes 1031 gives administrators the most flexibility in terms of filling 1032 available FIB with sub-prefixes, and in terms of balancing load 1033 across routers. Once an administrator has selected a VP-List, it 1034 is just as easy to configure routers with a large list as a small 1035 list. We can expect network operator groups like NANOG to 1036 compile good VP-Lists that ISPs can then adopt. A good list 1037 would be one where the number of VPs is relatively large, say 100 1038 or so (noting again that each VP must be smaller than a real 1039 prefix), and the number of sub-prefixes within each VP is roughly 1040 the same. 1041 2. Select and configure APRs. There are three primary 1042 considerations here. First, there must be enough APRs to handle 1043 reasonable APR failure scenarios. Second, APR assignment should 1044 not result in router overload. Third, particularly long paths 1045 should be avoided. Ideally there should be two APRs for each VP 1046 within each PoP, but this may not be possible for small PoPs. 1047 Failing this, there should be at least two APRs in each 1048 geographical region, so as to minimize path length increase. 1049 Routers should have the appropriate counters to allow 1050 administrators to know the volume of APR traffic each router is 1051 handling so as to adjust load by adding or removing APR 1052 assignments. 1054 3. Select and configure Popular Prefixes or Popular Prefix policies. 1055 There are two general goals here. The first is to minimize load 1056 overall by minimizing the number of packets that take longer 1057 paths. The second is to insure that specific selected prefixes 1058 don't have overly long paths. These goals must be weighed 1059 against the administrative overhead of configuring potentially 1060 thousands of popular prefixes. As one example a small ISP may 1061 wish to keep it simple by doing nothing more than indicating that 1062 customer routes should be installed. In this case, the 1063 administrator could otherwise assign as many APRs as possible 1064 while leaving enough FIB space for customer routes. As another 1065 example, a large ISP could build a management system that takes 1066 into consideration the traffic matrix, customer SLAs, robustness 1067 requirements, FIB sizes, topology, and router capacity, and 1068 periodically automatically computes APR and popular prefix 1069 assignments. 1071 4. IANA Considerations 1073 There are no IANA considerations. 1075 5. Security Considerations 1077 We consider the security implications of VA under two scenarios, one 1078 where VA is configured and operated correctly, and one where it is 1079 mis-configured. A cornerstone of VA operation is that the basic 1080 behavior of BGP doesn't change, especially inter-domain. Among other 1081 things, this makes it easier to reason about security. 1083 5.1. Properly Configured VA 1085 If VA is configured and operated properly, then the external behavior 1086 of an AS does not change. The same upstream ASes are selected, and 1087 the same prefixes and AS-paths are advertised. Therefore, a properly 1088 configured VA domain has no security impact on other domains. 1090 This document discusses intra-domain security concerns in 1091 Section 3.3.4 which argues that any new security concerns appear to 1092 be relatively minor. 1094 If another ISP starts advertising a prefix that is larger than a 1095 given VP, this prefix will be ignored by APRs that have a VP that 1096 falls within the larger prefix (Section 3.2.2.3). As a result, 1097 packets that might otherwise have been routed to the new larger 1098 prefix will be dropped at the APRs. Note that the trend in the 1099 Internet is towards large prefixes being broken up into smaller ones, 1100 not the reverse. Therefore, such a larger prefix is likely to be 1101 invalid. If it is determined without a doubt that the larger prefix 1102 is valid, then the ISP will have to reconfigure its VPs. 1104 5.2. Mis-configured VA 1106 VA introduces the possibility that a VP is advertised outside of an 1107 AS. This in fact should be a low probability event, but it is 1108 considered here none-the-less. 1110 If an AS leaks a large VP (i.e. larger than any real prefixes), then 1111 the impact is minimal. Smaller prefixes will be preferred because of 1112 best-match semantics, and so the only impact is that packets that 1113 otherwise have no matching routes will be sent to the misbehaving AS 1114 and dropped there. If an AS leaks a small VP (i.e. smaller than a 1115 real prefix), then packets to that AS will be hijacked by the 1116 misbehaving AS and dropped. This can happen with or without VA, and 1117 so doesn't represent a new security problem per se. 1119 6. Acknowledgements 1121 The authors would like to acknowledge the efforts of Xinyang Zhang 1122 and Jia Wang, who worked on CRIO (Core Router Integrated Overlay), an 1123 early inter-domain variant of FIB suppression, and the efforts of 1124 Hitesh Ballani and Tuan Cao, who worked on the configuration-only 1125 variant of VA that works with legacy routers. We would also like to 1126 thank Scott Brim, Daniel Ginsburg, and Rajiv Asati for their helpful 1127 comments. In particular, Daniel's comments significantly simplified 1128 the spec (eliminating the need for a new External Communities 1129 Attribute). 1131 7. References 1133 7.1. Normative References 1135 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 1136 Communities Attribute", RFC 1997, August 1996. 1138 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1139 Requirement Levels", BCP 14, RFC 2119, March 1997. 1141 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. 1143 [RFC2827] Ferguson, P. and D. Senie, "Network Ingress Filtering: 1144 Defeating Denial of Service Attacks which employ IP Source 1145 Address Spoofing", BCP 38, RFC 2827, May 2000. 1147 [RFC3107] Rekhter, Y. and E. Rosen, "Carrying Label Information in 1148 BGP-4", RFC 3107, May 2001. 1150 [RFC3209] Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., 1151 and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP 1152 Tunnels", RFC 3209, December 2001. 1154 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 1155 Protocol 4 (BGP-4)", RFC 4271, January 2006. 1157 7.2. Informative References 1159 [nsdi09] Ballani, H., Francis, P., Cao, T., and J. Wang, "Making 1160 Routers Last Longer with ViAggre", ACM Usenix NSDI 2009 ht 1161 tp://www.usenix.org/events/nsdi09/tech/full_papers/ 1162 ballani/ballani.pdf, April 2009. 1164 Authors' Addresses 1166 Paul Francis 1167 Max Planck Institute for Software Systems 1168 Gottlieb-Daimler-Strasse 1169 Kaiserslautern 67633 1170 Germany 1172 Phone: +49 631 930 39600 1173 Email: francis@mpi-sws.org 1175 Xiaohu Xu 1176 Huawei Technologies 1177 No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District 1178 Beijing, Beijing 100085 1179 P.R.China 1181 Phone: +86 10 82836073 1182 Email: xuxh@huawei.com 1183 Hitesh Ballani 1184 Cornell University 1185 4130 Upson Hall 1186 Ithaca, NY 14853 1187 US 1189 Phone: +1 607 279 6780 1190 Email: hitesh@cs.cornell.edu 1192 Dan Jen 1193 UCLA 1194 4805 Boelter Hall 1195 Los Angeles, CA 90095 1196 US 1198 Phone: 1199 Email: jenster@cs.ucla.edu 1201 Robert Raszuk 1202 Self 1204 Phone: 1205 Email: robert@raszuk.net 1207 Lixia Zhang 1208 UCLA 1209 3713 Boelter Hall 1210 Los Angeles, CA 90095 1211 US 1213 Phone: 1214 Email: lixia@cs.ucla.edu