idnits 2.17.1 draft-ietf-grow-va-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.i or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: 2. If the router is an APR, a route for every sub-prefix within the VP MUST be FIB-installed (subject to the above limitation that there be a tunnel). 3. If a non-APR router has a sub-prefix route that does not fall within any VP (as determined by the VP-List), then the route must be installed. This may occur because the ISP hasn't defined a VP covering that prefix, for instance during an incremental deployment buildup. 4. If a non-APR router does not have a route for a known VP, then it MAY or MAY NOT install sub-prefixes within that VP. Whether or not it does is up to the vendor and the network operator. One approach is to never install such sub-prefixes, on the assumption that the network operator will engineer his network so that this rarely if ever happens. 5. Another approach is to have routers install such sub-prefixes, but taking care not to do so if the missing VP route is a transient condition. For instance, if the router is booting up, and simply has not yet received all of its routes, then it can reasonably expect to receive a VP route soon and so SHOULD NOT install the sub-prefixes. On the other hand, if a continuously operating router had only a single remaining route for the VP, and that route is withdrawn, then the router might not expect to receive a replacement VP route soon and so SHOULD install the sub-prefixes. Obviously a router can't predict the future with certainty, so the following algorithm might be a useful way to manage whether or not to install sub-prefixes for a non-existing VP route: * Define a timer MISSING_VP_TIMER, set for a relatively short time (say 10 seconds or so). * Start the timer when either: 1) the last VP route is withdrawn, or 2) there are initially neither VP routes nor sub-prefix routes, and the first sub-prefix route is received. * When the timer expires, install sub-prefix routes. Note, however, that optional routes may first need to be removed from the FIB to make room for the new sub-prefix routes. If even after removing optional routes there is no room in the FIB for sub-prefix routes, then they should remain suppressed. In other words, sub-prefix entries required by virtue of being an APR take priority over sub-prefix entries required by virtue of not having a VP route. 6. All other sub-prefix routes MAY be suppressed. Such "optional" sub-prefixes that are nevertheless installed are referred to as popular prefixes. -- The document date (October 25, 2009) is 5290 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2328' is defined on line 1151, but no explicit reference was found in the text == Unused Reference: 'RFC3107' is defined on line 1157, but no explicit reference was found in the text ** Obsolete normative reference: RFC 3107 (Obsoleted by RFC 8277) ** Obsolete normative reference: RFC 4601 (Obsoleted by RFC 7761) Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Francis 3 Internet-Draft MPI-SWS 4 Intended status: Informational X. Xu 5 Expires: April 28, 2010 Huawei 6 H. Ballani 7 Cornell U. 8 D. Jen 9 UCLA 10 R. Raszuk 11 Self 12 L. Zhang 13 UCLA 14 October 25, 2009 16 FIB Suppression with Virtual Aggregation 17 draft-ietf-grow-va-01.txt 19 Status of this Memo 21 This Internet-Draft is submitted to IETF in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF), its areas, and its working groups. Note that 26 other groups may also distribute working documents as Internet- 27 Drafts. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 The list of current Internet-Drafts can be accessed at 35 http://www.ietf.org/ietf/1id-abstracts.txt. 37 The list of Internet-Draft Shadow Directories can be accessed at 38 http://www.ietf.org/shadow.html. 40 This Internet-Draft will expire on April 28, 2010. 42 Copyright Notice 44 Copyright (c) 2009 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents in effect on the date of 49 publication of this document (http://trustee.ietf.org/license-info). 50 Please review these documents carefully, as they describe your rights 51 and restrictions with respect to this document. 53 Abstract 55 The continued growth in the Default Free Routing Table (DFRT) 56 stresses the global routing system in a number of ways. One of the 57 most costly stresses is FIB size: ISPs often must upgrade router 58 hardware simply because the FIB has run out of space, and router 59 vendors must design routers that have adequate FIB. FIB suppression 60 is an approach to relieving stress on the FIB by NOT loading selected 61 RIB entries into the FIB. Virtual Aggregation (VA) allows ISPs to 62 shrink the FIBs of any and all routers, easily by an order of 63 magnitude with negligible increase in path length and load. FIB 64 suppression deployed autonomously by an ISP (cooperation between ISPs 65 is not required), and can co-exist with legacy routers in the ISP. 67 Table of Contents 69 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 1.1. Scope of this Document . . . . . . . . . . . . . . . . . . 4 71 1.2. Requirements notation . . . . . . . . . . . . . . . . . . 5 72 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 73 1.4. Temporary Sections . . . . . . . . . . . . . . . . . . . . 6 74 1.4.1. Document revisions . . . . . . . . . . . . . . . . . . 6 75 2. Overview of Virtual Aggregation (VA) . . . . . . . . . . . . . 8 76 2.1. Mix of legacy and VA routers . . . . . . . . . . . . . . . 10 77 2.2. Summary of Tunnels and Paths . . . . . . . . . . . . . . . 10 78 3. Specification of VA . . . . . . . . . . . . . . . . . . . . . 12 79 3.1. Requirements for VA . . . . . . . . . . . . . . . . . . . 12 80 3.2. VA Operation . . . . . . . . . . . . . . . . . . . . . . . 13 81 3.2.1. Legacy Routers . . . . . . . . . . . . . . . . . . . . 13 82 3.2.2. Advertising and Handling Virtual Prefixes (VP) . . . . 13 83 3.2.3. Border VA Routers . . . . . . . . . . . . . . . . . . 17 84 3.2.4. Advertising and Handling Sub-Prefixes . . . . . . . . 18 85 3.2.5. Suppressing FIB Sub-prefix Routes . . . . . . . . . . 18 86 3.2.6. Core-Edge Operation . . . . . . . . . . . . . . . . . 20 87 3.3. Requirements Discussion . . . . . . . . . . . . . . . . . 21 88 3.3.1. Response to router failure . . . . . . . . . . . . . . 21 89 3.3.2. Traffic Engineering . . . . . . . . . . . . . . . . . 22 90 3.3.3. Incremental and safe deploy and start-up . . . . . . . 22 91 3.3.4. VA security . . . . . . . . . . . . . . . . . . . . . 22 92 3.4. New Configuration . . . . . . . . . . . . . . . . . . . . 23 93 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 94 5. Security Considerations . . . . . . . . . . . . . . . . . . . 24 95 5.1. Properly Configured VA . . . . . . . . . . . . . . . . . . 24 96 5.2. Mis-configured VA . . . . . . . . . . . . . . . . . . . . 25 97 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 25 98 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 25 99 7.1. Normative References . . . . . . . . . . . . . . . . . . . 25 100 7.2. Informative References . . . . . . . . . . . . . . . . . . 26 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 103 1. Introduction 105 ISPs today manage constant DFRT growth in a number of ways. Most 106 commonly, ISPs will upgrade their router hardware before DFRT growth 107 outstrips the size of the FIB. In cases where an ISP wants to 108 continue to use routers whose FIBs are not large enough, it may 109 deploy them at edge locations where a full DFRT is not needed, for 110 instance at the customer interface. Packets for which there is no 111 route are defaulted to a "core" infrastructure that does contain the 112 full DFRT. While this helps, it cannot be used for all edge routers, 113 for instance those that interface with other ISPs. Alternatively, 114 some lower-tier ISPs may simply ignore some routes, for instance 115 /24's that fall within the aggregate of another route. 117 FIB Suppression is an approach to shrinking FIB size that requires no 118 changes to BGP, no changes to packet forwarding mechanisms in 119 routers, and relatively minor changes to control mechanisms in 120 routers and configuration of those mechanisms. The core idea behind 121 FIB suppression is to run BGP as normal, and in particular to not 122 shrink the RIB, but rather to not load certain RIB entries into the 123 FIB. This approach minimizes changes to routers, and in particular 124 is simpler than more general routing architectures that try to shrink 125 both RIB and FIB. With FIB suppression, there are no changes to BGP 126 per se. The BGP decision process does not change. The selected AS- 127 path does not change, and except on rare occasion the exit router 128 does not change. ISPs can deploy FIB suppression autonomously and 129 with no coordination with neighbor ASes. 131 This document describes an approach to FIB suppression called 132 "Virtual Aggregation" (VA). VA operates by organizing the IP (v4 or 133 v6) address space into Virtual Prefixes (VP), and using tunnels to 134 aggregate the (regular) sub-prefixes within each VP. The decrease in 135 FIB size can be dramatic, easily 5x or 10x with only a slight path 136 length and router load increase [nsdi09]. The VPs can be organized 137 such that all routers in an ISP see FIB size decrease, or in such a 138 way that "core" routers keep the full FIB, and "edge" routers have 139 almost no FIB (i.e. by defining a VP of 0/0). 141 1.1. Scope of this Document 143 The scope of this document is limited to Intra-domain VA operation. 144 In other words, the case where a single ISP autonomously operates VA 145 internally without any coordination with neighboring ISPs. 147 Note that this document assumes that the VA "domain" (i.e. the unit 148 of autonomy) is the AS (that is, different ASes run VA independently 149 and without coordination). For the remainder of this document, the 150 terms ISP, AS, and domain are used interchangeably. 152 This document applies equally to IPv4 and IPv6. 154 VA may operate with a mix of upgraded routers and legacy routers. 155 There are no topological restrictions placed on the mix of routers. 156 In order to avoid loops between upgraded and legacy routers, however, 157 legacy routers must be able to terminate tunnels. 159 This document is agnostic about what type of tunnel may be used for 160 VA, and does not specify a tunnel type per se. Rather, it refers 161 generically to tunnels and specifies the minimum set of requirements 162 that a given tunnel type must satisfy. Separate documents are used 163 to specify the operation of VA for specific tunnel types. 165 1.2. Requirements notation 167 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 168 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 169 document are to be interpreted as described in [RFC2119]. 171 1.3. Terminology 173 Aggregation Point Router (APR): An Aggregation Point Router (APR) is 174 a router that aggregates a Virtual Prefix (VP) by installing 175 routes (into the FIB) for all of the sub-prefixes within the VP. 176 APRs advertise the VP to other routers with BGP. For each sub- 177 prefix within the VP, APRs have a tunnel from themselves to the 178 remote ASBR (Autonomous System Border Router) where packets for 179 that prefix should be delivered. 180 Install and Suppress: The terms "install" and "suppress" are used to 181 describe whether a RIB entry has been loaded or not loaded into 182 the FIB. In other words, the phrase "install a route" means 183 "install a route into the FIB", and the phrase "suppress a route" 184 means "do not install a route into the FIB". 185 Legacy Router: A router that does not run VA, and has no knowledge 186 of VA. Legacy routers, however, must be able to terminate 187 tunnels. (If a Legacy router cannot terminate tunnels, then any 188 routes that are reached via that router must be installed in all 189 FIBs.) 190 non-APR Router: In discussing VPs, it is often necessary to 191 distinguish between routers that are APRs for that VP, and routers 192 that are not APRs for that VP (but of course may be APRs for other 193 VPs not under discussion). In these cases, the term "APR" is 194 taken to mean "a VA router that is an APR for the given VP", and 195 the term "non-APR" is taken to mean "a VA router that is not an 196 APR for the given VP". The term non-APR router is not used to 197 refer to legacy routers. 199 Popular Prefix: A popular prefix is a sub-prefix that is installed 200 in a router in addition to the sub-prefixes it holds by virtue of 201 being a Aggregation Point Router. The popular prefix allows 202 packets to follow the shortest path. Note that different routers 203 do not need to have the same set of popular prefixes. 204 Routing Information Base (RIB): The term RIB is used rather sloppily 205 in this document to refer either to the loc-RIB (as used in 206 [RFC4271]), or to the combined Adj-RIBs-In, the Loc-RIB, and the 207 Adj-RIBs-Out. 208 Sub-Prefix: A regular (physically aggregatable) prefix. These are 209 equivalent to the prefixes that would normally comprise the DFRT 210 in the absence of VA. A VA router will contain a sub-prefix entry 211 either because the sub-prefix falls within a virtual prefix for 212 which the router is an APR, or because the sub-prefix is installed 213 as a popular prefix. Legacy routers hold the same sub-prefixes 214 they hold today. 215 Tunnel: VA can use a variety of tunnel types: MPLS LSPs, IP-in-IP, 216 GRE, L2TP, and so on. This document does not describe how any 217 given tunnel information is conveyed: that is left for companion 218 documents. This document uses the term tunnel to refer to any 219 appropriate tunnel type. 220 VA router: A router that operates Virtual Aggregation according to 221 this document. 222 Virtual Prefix (VP): A Virtual Prefix (VP) is a prefix used to 223 aggregate its contained regular prefixes (sub-prefixes). A VP is 224 not physically aggregatable, and so it is aggregated at APRs 225 through the use of tunnels. 226 VP-List: A list of defines VPs. All routers must agree on the 227 contents of this list (which is statically configured into every 228 VA router). 230 1.4. Temporary Sections 232 This section contains temporary information, and will be removed in 233 the final version. 235 1.4.1. Document revisions 237 This document was previously published as both 238 draft-francis-idr-intra-va-01.txt and draft-francis-intra-va-01.txt. 240 1.4.1.1. Revisions from the 00 version of draft-ietf-grow-va-00 242 Removed the notion that FIB suppression can be done by suppressing 243 entries from the Routing Table (as defined in Section 3.2 of 244 [RFC4271]), an idea that was introduced in the second version of the 245 draft. Suppressing from the Routing Table breaks PIM-SM, which 246 relies on the contents of the Routing Table to produce its forwarding 247 table. 249 1.4.1.2. Revisions from the 00 version (of 250 draft-francis-intra-va-00.txt) 252 Added additional authors (Jen, Raszuk, Zhang), to reflect primary 253 contributors moving forwards. In addition, a number of minor 254 clarifications were made. 256 1.4.1.3. Revisions from the 01 version (of 257 draft-francis-idr-intra-va-01.txt) 259 1. Changed file name from draft-francis-idr-intra-va to 260 draft-francis-intra-va. 261 2. Restructured the document to make the edge suppression mode a 262 specific sub-case of VA rather than a separate mode of operation. 263 This includes modifying the title of the draft. 264 3. Removed MPLS tunneling details so that specific tunneling 265 approaches can be described in separate documents. 267 1.4.1.4. Revisions from 00 version 269 o Changed intended document type from STD to BCP, as per advice from 270 Dublin IDR meeting. 271 o Cleaned up the MPLS language, and specified that the full-address 272 routes to remote ASBRs must be imported into OSPF (Section 3.2.3). 273 As per Daniel Ginsburg's email 274 http://www.ietf.org/mail-archive/web/idr/current/msg02933.html. 275 o Clarified that legacy routers must run MPLS. As per Daniel 276 Ginsburg's email 277 http://www.ietf.org/mail-archive/web/idr/current/msg02935.html. 278 o Fixed LOCAL_PREF bug. As per Daniel Ginsburg's email 279 http://www.ietf.org/mail-archive/web/idr/current/msg02940.html. 280 o Removed the need for the extended communities attribute on VP 281 routes, and added the requirement that all VA routers be 282 statically configured with the complete list of VPs. As per 283 Daniel Ginsburg's emails 284 http://www.ietf.org/mail-archive/web/idr/current/msg02940.html and 285 http://www.ietf.org/mail-archive/web/idr/current/msg02958.html. 286 In addition, the procedure for adding, deleting, splitting, and 287 merging VPs was added. As part of this, the possibility of having 288 overlapping VPs was added. 289 o Added the special case of a core-edge topology with default routes 290 to the edge as suggested by Robert Raszuk in email 291 http://www.ietf.org/mail-archive/web/idr/current/msg02948.html. 292 Note that this altered the structure and even title of the 293 document. 295 o Clarified that FIB suppression can be achieved by not loading 296 entries into the Routing Table, as suggested by Rajiv Asati in 297 email 298 http://www.ietf.org/mail-archive/web/idr/current/msg03019.html. 300 2. Overview of Virtual Aggregation (VA) 302 For descriptive simplicity, this section starts by describing VA 303 assuming that there are no legacy routers in the domain. Section 2.1 304 overviews the additional functions required by VA routers to 305 accommodate legacy routers. 307 A key concept behind VA is to operate BGP as normal, and in 308 particular to populate the RIB with the full DFRT, but to suppress 309 many or most prefixes from being loaded into the FIB. By populating 310 the RIB as normal, we avoid any changes to BGP, and changes to router 311 operation are relatively minor. The basic idea behind VA is quite 312 simple. The address space is partitioned into large prefixes --- 313 larger than any aggregatable prefix in use today. These prefixes are 314 called virtual prefixes (VP). Different VPs do not need to be the 315 same size. They may be a mix of /6, /7, /8 (for IPv4), and so on. 316 Indeed, an ISP can define a single /0 VP, and use it for a core/edge 317 type of configuration (commonly seen today). That is, the core 318 routers would maintain full FIBs, and edge routers could maintain 319 default routes to the core routers, and suppress as much of the FIB 320 as they wish. Each ISP can independently select the size of its VPs. 322 VPs are not themselves topologically aggregatable. VA makes the VPs 323 aggregatable through the use of tunnels, as follows. Associated with 324 each VP are one or more "Aggregation Point Routers" (APR). An APR 325 (for a given VP) is a router that installs routes for all sub- 326 prefixes (i.e. real physically aggregatable prefixes) within the VP. 327 By "install routes" here, we mean: 329 1. The route for each of the sub-prefixes is loaded into the FIB, 330 and 331 2. there is a tunnel from the APR to the BGP NEXT_HOP for the route. 333 The APR originates a BGP route to the VP. This route is distributed 334 within the domain, but not outside the domain. With this structure 335 in place, a packet transiting the ISP goes from the ingress router to 336 the APR (possibly via a tunnel), and then from the APR to the BGP 337 NEXT_HOP router via a tunnel. 339 Normally the BGP NEXT_HOP is the remote ASBR. In this case, even 340 though the remote ASBR is the tunnel endpoint, the tunnel header is 341 stripped by the local ASBR before the packet is delivered to the 342 remote ASBR. In other words, the remote ASBR sees a normal IP 343 packet, and is completely unaware of the existence of VA in the 344 neighboring ISP. The exception to this is legacy local ASBR routers. 345 In this case, the legacy router is the BGP NEXT_HOP, and packets are 346 tunneled to the legacy router, which then uses a FIB lookup to 347 deliver the packet to the appropriate remote ASBR. This applies only 348 to legacy routers that can convey tunnel parameters and detunnel 349 packets. 351 Note that the AS-path is not effected at all by VA. This means among 352 other things that AS-level policies are not effected by VA. The 353 packet may not, however, follow the shortest path within the ISP 354 (where shortest path is defined here as the path that would have been 355 taken if VA were not operating), because the APR may not be on the 356 shortest path between the ingress and egress routers. When this 357 happens, the packet experiences additional latency and creates extra 358 load (by virtue of taking more hops than it otherwise would have). 359 Note also that, with VA, a packet may occasionally take a different 360 exit point than it otherwise would have. 362 VA can avoid traversing the APR for selected routes by installing 363 these routes in non-APR routers. In other words, even if an ingress 364 router is not an APR for a given sub-prefix, it may install that sub- 365 prefix into its FIB. Packets in this case are tunneled directly from 366 the ingress to the BGP NEXT_HOP. These extra routes are called 367 "Popular Prefixes", and are typically installed for policy reasons 368 (i.e. customer routes are always installed), or for sub-prefixes that 369 carry a high volume of traffic (Section 3.2.5.1). Different routers 370 may have different popular prefixes. As such, an ISP may assign 371 popular prefixes per router, per POP, or uniformly across the ISP. A 372 given router may have zero popular prefixes, or the majority of its 373 FIB may consist of popular prefixes. The effectiveness of popular 374 prefixes to reduce traffic load relies on the fact that traffic 375 volumes follow something like a power-law distribution: i.e. that 90% 376 of traffic is destined to 10% of the destinations. Internet traffic 377 measurement studies over the years have consistently shown that 378 traffic patterns follow this distribution, though there is no 379 guarantee that they always will. 381 Note that for routing to work properly, every packet must sooner or 382 later reach a router that has installed a sub-prefix route that 383 matches the packet. This would obviously be the case for a given 384 sub-prefix if every router has installed a route for that sub-prefix 385 (which of course is the situation in the absence of VA). If this is 386 not the case, then there must be at least one Aggregation Point 387 Router (APR) for the sub-prefix's virtual prefix (VP). Ideally, 388 every POP contains at least two APRs for every virtual prefix. By 389 having APRs in every POP, the latency imposed by routing to the APR 390 is minimal (the extra hop is within the POP). By having more than 391 one APR, there is a redundant APR should one fail. In practice it is 392 often not possible to have an APR for every VP in every POP. This is 393 because some POPs may have only one or a few routers, and therefore 394 there may not have enough cumulative FIB space in the POP to hold 395 every sub-prefix. Note that any router ("edge", "core", etc.) may be 396 an APR. 398 It is important that both the contents of BGP RIBs, as well as the 399 contents of the Routing Table (as defined in Section 3.2 of 400 [RFC4271]) not be modified by VA (other than the introduction of 401 routes to VPs). This is because PIM-SM [RFC4601] relies on the 402 contents of the Routing Table to build its own trees and forwarding 403 table. Therefore, FIB suppression must take place between the 404 Routing Table and the actual FIB(s). 406 2.1. Mix of legacy and VA routers 408 It is important that an ISP be able to operate with a mix of "VA 409 routers" (routers upgraded to operate VA as described in the 410 document) and "legacy routers". This allows ISPs to deploy VA in an 411 incremental fashion and to continue to use routers that for whatever 412 reason cannot be upgraded. This document allows such a mix, and 413 indeed places no topological restrictions on that mix. It does, 414 however, require that legacy routers (and VA routers for that matter) 415 are able to forward already-tunneled packets, are able to serve as 416 tunnel endpoints, and are able to participate in distribution of 417 tunnel information required to establish themselves as tunnel 418 endpoints. (This is listed as Requirement R5 in the companion 419 tunneling documents.) Depending on the tunnel type, legacy routers 420 may also be able to generate tunneled packets, though this is an 421 optional requirement. (This is listed as Requirement R4 in the 422 companion tunneling documents.) Legacy routers must use their own 423 address as the BGP NEXT_HOP, and must FIB-install routes for which 424 they are the BGP NEXT_HOP. 426 2.2. Summary of Tunnels and Paths 428 To summarize, the following tunnels are created: 430 1. From all VA routers to all BGP NEXT_HOP addresses (where the BGP 431 NEXT_HOP address is either an APR, a legacy router, or the remote 432 ASBR neighbor of a VA router). Note that this is listed as 433 Requirement R3 in the companion tunneling documents. 434 2. Optionally, from all legacy routers to all BGP NEXT_HOP 435 addresses. 436 There are a number of possible paths that packets may take through an 437 ISP, summarized in the following diagram. Here, "VA" is a VA router, 438 "LR" is a legacy router, the symbol "==>" represents a tunneled 439 packet (through zero or more routers), "-->" represents an untunneled 440 packet, and "(pop)" represents stripping the tunnel header. The 441 symbol "::>" represents the portion of the path where although the 442 tunnel is targeted to the receiving node, the outer header has been 443 stripped. (Note that the remote ASBR may actually be a legacy router 444 or a VA router---it doesn't matter (and isn't known) to the ISP.) 446 Egress 447 Router 448 Ingress Some APR (Local Remote 449 Router Router Router ASBR) ASBR 450 ------- ------ ------ ------ -------- 451 1. VA===================>VA=========>VA(pop)::::>LR 453 2. VA===================>VA=========>LR--------->LR 455 3. VA===============================>VA(pop)::::>LR 457 4. VA===============================>LR--------->LR 459 (The following two exist in the case where legacy routers 460 can initiate tunneled packets.) 462 5. LR===============================>VA(pop)::::>LR 464 6. LR===============================>LR--------->LR 466 (The following two exist in the case where legacy routers 467 cannot initiate tunneled packets.) 469 7. LR------->VA (remaining paths as in 1 to 4 above) 471 8. LR------->LR--------------------->LR--------->LR 473 The first and second paths represent the case where the ingress 474 router does not have a popular prefix for the destination, and must 475 tunnel the packet to an APR. The third and fourth paths represent 476 the case where the ingress router does have a popular prefix for the 477 destination, and so tunnels the packet directly to the egress. The 478 fifth and sixth paths are similar, but where the ingress is a legacy 479 router that can initiate tunneled packets, and effectively has the 480 popular prefix by virtue of holding the entire DFRT. (Note that some 481 ISPs have only partial RIBs in their customer-facing edge routers, 482 and default route to a router that holds the full DFRT. This case is 483 not shown here.) Finally, paths 7 and 8 represent the case where 484 legacy routers cannot initiate a tunneled packet. 486 VA prevents the routing loops that might otherwise occur when VA 487 routers and legacy routers are mixed. The trick is avoiding the case 488 where a legacy router is forwarding packets towards the BGP NEXT_HOP, 489 while a VA router is forwarding packets towards the APR, with each 490 router thinking that the other is on the shortest path to their 491 respective targets. 493 In the first four types of path, the loop is avoided because tunnels 494 are used all the way to the egress. As a result, there is never an 495 opportunity for a legacy router to try to route based on the 496 destination address unless the legacy router is the egress, in which 497 case it forwards the packet to the remote ASBR. 499 In the 5th and 6th cases, the ingress is a legacy router, but this 500 router can initiate tunnels and has the full FIB, and so simply 501 tunnels the packet to the egress router. 503 In the 7th and 8th cases, the legacy ingress cannot initiate tunnels, 504 and so forwards the packet hop-by-hop towards the BGP NEXT_HOP. The 505 packet will work its way towards the egress router, and will either 506 progress through a series of legacy routers (in which case the IGP 507 prevents loops), or it will eventually reach a VA router, after which 508 it will take tunnels as in the 1st and 2nd cases. 510 3. Specification of VA 512 This section describes in detail how to operate VA. It starts with a 513 brief discussion of requirements, followed by a specification of 514 router support for VA. 516 3.1. Requirements for VA 518 While the core requirement is of course to be able to manage FIB 519 size, this must be done in a way that: 520 o is robust to router failure, 521 o allows for traffic engineering, 522 o allows for existing inter-domain routing policies, 523 o operates in a predictable manner and is therefore possible to 524 test, debug, and reason about performance (i.e. establish SLAs), 525 o can be safely installed, tested, and started up, 526 o Can be configured and reconfigured without service interruption, 527 o can be incrementally deployed, and in particular can be operated 528 in an AS with a mix of VA-capable and legacy routers, 529 o accommodates existing security mechanisms such as ingress 530 filtering and DoS defense, 532 o does not introduce significant new security vulnerabilities. 533 In short, operation of VA must not significantly affect the way ISPs 534 operate their networks today. Section 3.3 discusses the extent to 535 which these requirements are met by the design presented in 536 Section 3.2. 538 3.2. VA Operation 540 In this section, the detailed operation of VA is specified. 542 3.2.1. Legacy Routers 544 VA can operate with a mix of VA and legacy routers. To avoid the 545 types of loops described in Section 2.2, however, legacy routers MUST 546 satisfy the following requirements: 548 1. When forwarding externally-received routes over iBGP, the BGP 549 NEXT_HOP attribute MUST be set to the legacy router itself. 550 2. Legacy routers MUST be able to detunnel packets addressed to 551 themselves at the BGP NEXT_HOP address. They MUST also be able 552 to convey the tunnel information needed by other routers to 553 initiate tunneled packets to them. This is listed as 554 "Requirement R1" in the companion tunneling documents. If a 555 legacy router cannot detunnel and convey tunnel parameters, then 556 the AS cannot use VA. 557 3. Legacy routers MUST be able to forward all tunneled packets. 558 4. Every legacy router MUST hold its complete FIB. (Note, of 559 course, that this FIB does not necessarily need to contain the 560 full DFRT. This might be the case, for instance, if the router 561 is an edge router that defaults to a core router.) 563 As long as legacy routers participating in tunneling as described 564 above there are no topological restrictions on the legacy routers. 565 They may be freely mixed with VA routers without the possibility of 566 forming sustained loops (Section 2.2). 568 3.2.2. Advertising and Handling Virtual Prefixes (VP) 570 3.2.2.1. Distinguishing VP's from Sub-prefixes 572 VA routers must be able to distinguish VP's from sub-prefixes. This 573 is primarily in order to know which routes to install. In 574 particular, non-APR routers must know which prefixes are VPs before 575 they receive routes for those VPs, for instance when they first boot 576 up. This is in order to avoid the situation where they unnecessarily 577 start filling their FIB with routes that they ultimately don't need 578 to install (Section 3.2.5). This leads to the following requirement: 580 It MUST be possible to statically configure the complete list of VP's 581 into all VA routers. This list is known as the VP-List. 583 3.2.2.2. Limitations on Virtual Prefixes 585 From the point of view of best-match routing semantics, VPs are 586 treated identically to any other prefix. In other words, if the 587 longest matching prefix is a VP, then the packet is routed towards 588 the VP. If a packet matching a VP reaches an Aggregation Point 589 Router (APR) for that VP, and the APR does not have a better matching 590 route, then the packet is discarded by the APR (just as a router that 591 originates any prefix will discard a packet that does not have a 592 better match). 594 The overall semantics of VPs, however, are subtly different from 595 those of real prefixes (well, maybe not so subtly). Without VA, when 596 a router originates a route for a (real) prefix, the expectation is 597 that the addresses within the prefix are within the originating AS 598 (or a customer of the AS). For VPs, this is not the case. APRs 599 originate VPs whose sub-prefixes exist in different ASes. Because of 600 this, it is important that VPs not be advertised across AS 601 boundaries. 603 It is up to individual domains to define their own VPs. VPs MUST be 604 "larger" (span a larger address space) than any real sub-prefix. If 605 a VP is smaller than a real prefix, then packets that match the real 606 prefix will nevertheless be routed to an APR owning the VP, at which 607 point the packet will be dropped if it does not match a sub-prefix 608 within the VP (Section 5). 610 (Note that, in principle there are cases where a VP could be smaller 611 than a real prefix. This is where the egress router to the real 612 prefix is a VA router. In this case, the APR could theoretically 613 tunnel the packet to the appropriate remote ASBR, which would then 614 forward the packet correctly. On the other hand, if the egress 615 router is a legacy router, then the APR could not tunnel matching 616 packets to the egress. This is because the egress would view the VP 617 as a better match, and would loop the packet back to the APR. For 618 this reason we require that VPs be larger than any real prefixes, and 619 that APR's never install prefixes larger than a VP in their FIBs.) 621 It is valid for a VP to be a subset of another VP. For example, 20/7 622 and 20/8 can both be VPs. In fact, this capability is necessary for 623 "splitting" a VP without temporarily the FIB size in any router. 624 (Section 3.2.2.5). 626 3.2.2.3. Aggregation Point Routers (APR) 628 Any router may be configured as an Aggregation Point Router (APR) for 629 one or more Virtual Prefixes (VP). For each VP for which a router is 630 an APR, the router does the following: 632 1. The APR MUST originate a BGP route to the VP [RFC4271]. In this 633 route, the NLRI are all of the VPs for which the router is an 634 APR. This is true even for VPs that are a subset of another VP. 635 The ORIGIN is set to INCOMPLETE (value 2), the AS number of the 636 APR's AS is used in the AS_PATH, and the BGP NEXT_HOP is set to 637 the address of the APR. The ATOMIC_AGGREGATE and AGGREGATOR 638 attributes are not included. 639 2. The APR MUST attach a NO_EXPORT Communities Attribute [RFC1997] 640 to the route. 641 3. The APR MUST be able to detunnel packets addressed to itself at 642 its BGP NEXT_HOP address. It MUST also be able to convey the 643 tunnel information needed by other routers to initiate tunneled 644 packets to them (Requirement R1). 645 4. If a packet is received at the APR whose best match is the VP 646 (i.e. it matches the VP but not any sub-prefixes within the VP), 647 then the packet MUST be discarded (see Section 3.2.2.2). This 648 can be accomplished by never installing a prefix larger than the 649 VP into the FIB, or by installing the VP as a route to \dev\null. 651 3.2.2.3.1. Selecting APRs 653 An ISP is free to select APRs however it chooses. The details of 654 this are outside the scope of this document. Nevertheless, a few 655 comments are made here. In general, APRs should be selected such 656 that the distance to the nearest APR for any VP is small---ideally 657 within the same POP. Depending on the number of routers in a POP, 658 and the sizes of the FIBs in the routers relative to the DFRT size, 659 it may not be possible for all VPs to be represented in a given POP. 660 In addition, there should be multiple APRs for each VP, again ideally 661 in each POP, so that the failure of one does not unduly disrupt 662 traffic. 664 APRs may be (and probably should be) statically assigned. They may 665 also, however, be dynamically assigned, for instance in response to 666 APR failure. For instance, each router may be assigned as a backup 667 APR for some other APR. If the other APR crashes (as indicated by 668 the withdrawal of its routes to its VPs), the backup APR can install 669 the appropriate sub-prefixes and advertise the VP as specified above. 670 Note that doing so may require it to first remove some popular 671 prefixes from its FIB to make room. 673 Note that, although VPs MUST be larger than real prefixes, there is 674 intentionally no mechanism designed to automatically insure that this 675 is the case. Such a mechanisms would be dangerous. For instance, if 676 an ISP somewhere advertised a very large prefix (a /4, say), then 677 this would cause APRs to throw out all VPs that are smaller than 678 this. For this reason, VPs must be set through static configuration 679 only. 681 3.2.2.4. Non-APR Routers 683 A non-APR router MUST install at least the following routes: 685 1. Routes to VPs (identifiable using the VP-List). 686 2. Routes to the largest of any prefixes that contain a given VP. 687 (Note that although this is not supposed to happen, if it does 688 the non-APR should install it, with the effect that any addresses 689 in the prefix not covered by VPs will be routed outside the 690 domain.) 691 3. Routes to all prefixes that contain an address that is in part of 692 the address space for which no VP is defined (i.e. as is done 693 today without VA). 695 If the non-APR has a tunnel to the BGP NEXT_HOP of any such route, it 696 MUST use the tunnel to forward packets to the BGP NEXT_HOP. 698 When an APR fails, routers MUST select another APR to send packets to 699 (if there is one). This happens, however, through normal internal 700 BGP convergence mechanisms. Note that it is strongly recommended 701 that routers keep at least two VP routes in their RIB at all times. 702 The main reason is that if the currently used VP route is withdrawn, 703 the second VP route can be immediately installed, and the issue of 704 whether to temporarily install sub-prefixes in the FIB is avoided 705 (Section 3.2.5). Another reason is that the IGP can be used to even 706 more quickly detect that the APR has crashed, again allowing the 707 second VP route to be immediately installed. 709 3.2.2.5. Adding and deleting VP's 711 An ISP may from time to time wish to reconfigure its VP-List. There 712 are a number of reasons for this. For instance, early in its 713 deployment an ISP may configure one or a small number of VPs in order 714 to test VA. As the ISP gets more confident with VA, it may increase 715 the number of VPs. Or, an ISP may start with a small number of large 716 VPs (i.e. /4's or even one /0), and over time move to more smaller 717 VPs in order to save even more FIB. In this case, the ISP will need 718 to "split" a VP. Finally, since the address space is not uniformly 719 populated with prefixes, the ISP may want to change the size of VPs 720 in order to balance FIB size across routers. This can involve both 721 splitting and merging VPs. Of course, an ISP MUST be able to modify 722 its VP-List without 1) interrupting service to any destinations, or 723 2) temporarily increasing the size of any FIB (i.e. where the FIB 724 size during the change is no bigger than its size either before or 725 after the change). 727 Adding a VP is straightforward. The first step is to configure the 728 APRs for the VP. This causes the APRs to originate routes for the 729 VP. Non-APR routers will install this route according to the rules 730 in Section 3.2.2.4 even though they do not yet recognize that the 731 prefix is a VP. Subsequently the VP is added to the VP-List of non- 732 APR routers. The Non-APR routers can then start suppressing the sub- 733 prefixes with no loss of service. 735 To delete a VP, the process is reversed. First, the VP is removed 736 from the VP-Lists of non-APRs. This causes the non-APRs to install 737 the sub-prefixes. After all sub-prefixes have been installed, the VP 738 may be removed from the APRs. 740 In many cases, it is desirable to split a VP. For instance, consider 741 the case where two routers, Ra and Rb, are APRs for the same prefix. 742 It would be possible to shrink the FIB in both routers by splitting 743 the VP into two VPs (i.e. split one /6 into two /7's), and assigning 744 each router to one of the VPs. While this could in theory be done by 745 first deleting the larger VP, and then adding the smaller VPs, doing 746 so would temporarily increase the FIB size in non-APRs, which may not 747 have adequate space for such an increase. For this reason, we allow 748 overlapping VPs. 750 To split a VP, first the two smaller VPs are added to the VP-Lists of 751 all non-APR routers (in addition to the larger superset VP). Next, 752 the smaller VPs are added to the selected APRs (which may or may not 753 be APRs for the larger VP). Because the smaller VPs are a better 754 match than the larger VP, this will cause the non-APR routers to 755 forward packets to the APRs for the smaller VPs. Next, the larger VP 756 can be removed from the VP-Lists of all non-APR routers. Finally, 757 the larger VP can be removed from its APRs. 759 To merge two VPs, the new larger VP is configured in all non-APRs. 760 This has no effect on FIB size or APR selection, since the smaller 761 VPs are better matches. Next the larger VP is configured in its 762 selected APRs. Next the smaller VPs are deleted from all non-APRs. 763 Finally, the smaller VPs are deleted from their corresponding APRs. 765 3.2.3. Border VA Routers 767 VA routers that are border routers MUST do the following: When 768 forwarding externally-received routes over iBGP, the BGP NEXT_HOP 769 attribute MUST be set to the remote ASBR. They MUST establish 770 tunnels that have the following properties (Requirement R2 in 771 companion documents): 773 1. The tunnel target must be the remote ASBR BGP NEXT_HOP address. 774 In other words, the target address used by other routers in the 775 domain for tunneling packets is the remote ASBR address. 776 2. The border router must detunnel the packet before forwarding the 777 packet to the remote ASBR. In other words, the remote ASBR 778 receives a normal untunneled packet identical to the packet it 779 would receive without VA. 780 3. The border router must be able to forward the packet without a 781 FIB lookup. In other words, the tunnel information itself 782 contains all the information needed by the border router to know 783 which remote ASBR should receive the packet. 785 Note that there are a number of ways the above tunnel can be created, 786 as documented separately. For instance, the tag on an MPLS LSP could 787 identify the remote ASBR, and the border router could use what is 788 effectively penultimate hop popping to deliver the packet. Or, GRE 789 could be used whereby the outer IP header addresses the border 790 router, and the GRE key value identifies the remote ASBR. 792 3.2.4. Advertising and Handling Sub-Prefixes 794 Sub-prefixes are advertised and handled by BGP as normal. VA does 795 not effect this behavior. The only difference in the handling of 796 sub-prefixes is that they might not be installed in the FIB, as 797 described in Section 3.2.5. 799 In those cases where the route is installed, packets forwarded to 800 prefixes external to the AS MUST be transmitted via the tunnel 801 established as described in Section 3.2.3. 803 3.2.5. Suppressing FIB Sub-prefix Routes 805 Any route not for a known VP (i.e. not in the VP-List) is taken to be 806 a sub-prefix. The following rules are used to determine if a sub- 807 prefix route can be suppressed. 809 1. A VA router must never FIB-install a sub-prefix route for which 810 there is no tunnel to the BGP NEXT_HOP address. This is to 811 prevent a loop whereby the APR forwards the packet hop-by-hop 812 towards the next hop, but a router on the path that has FIB- 813 suppressed the sub-prefix forwards it back to the APR. If there 814 is an alternate route to the sub-prefix for which there is a 815 tunnel, then that route should be selected, even if it is less 816 attractive according to the normal BGP best path selection 817 algorithm. 819 2. If the router is an APR, a route for every sub-prefix within the 820 VP MUST be FIB-installed (subject to the above limitation that 821 there be a tunnel). 822 3. If a non-APR router has a sub-prefix route that does not fall 823 within any VP (as determined by the VP-List), then the route must 824 be installed. This may occur because the ISP hasn't defined a VP 825 covering that prefix, for instance during an incremental 826 deployment buildup. 827 4. If a non-APR router does not have a route for a known VP, then it 828 MAY or MAY NOT install sub-prefixes within that VP. Whether or 829 not it does is up to the vendor and the network operator. One 830 approach is to never install such sub-prefixes, on the assumption 831 that the network operator will engineer his network so that this 832 rarely if ever happens. 833 5. Another approach is to have routers install such sub-prefixes, 834 but taking care not to do so if the missing VP route is a 835 transient condition. For instance, if the router is booting up, 836 and simply has not yet received all of its routes, then it can 837 reasonably expect to receive a VP route soon and so SHOULD NOT 838 install the sub-prefixes. On the other hand, if a continuously 839 operating router had only a single remaining route for the VP, 840 and that route is withdrawn, then the router might not expect to 841 receive a replacement VP route soon and so SHOULD install the 842 sub-prefixes. Obviously a router can't predict the future with 843 certainty, so the following algorithm might be a useful way to 844 manage whether or not to install sub-prefixes for a non-existing 845 VP route: 846 * Define a timer MISSING_VP_TIMER, set for a relatively short 847 time (say 10 seconds or so). 848 * Start the timer when either: 1) the last VP route is 849 withdrawn, or 2) there are initially neither VP routes nor 850 sub-prefix routes, and the first sub-prefix route is received. 851 * When the timer expires, install sub-prefix routes. Note, 852 however, that optional routes may first need to be removed 853 from the FIB to make room for the new sub-prefix routes. If 854 even after removing optional routes there is no room in the 855 FIB for sub-prefix routes, then they should remain suppressed. 856 In other words, sub-prefix entries required by virtue of being 857 an APR take priority over sub-prefix entries required by 858 virtue of not having a VP route. 859 6. All other sub-prefix routes MAY be suppressed. Such "optional" 860 sub-prefixes that are nevertheless installed are referred to as 861 popular prefixes. 863 3.2.5.1. Selecting Popular Prefixes 865 Individual routers may independently choose which sub-prefixes are 866 popular prefixes. There is no need for different routers to install 867 the same sub-prefixes. There is therefore significant leeway as to 868 how routers select popular prefixes. As a general rule, routers 869 should fill the FIB as much as possible, because the cost of doing so 870 is relatively small, and more FIB entries leads to fewer packets 871 taking a longer path. Broadly speaking, an ISP may choose to fill 872 the FIB by making routers APR's for as many VP's as possible, or by 873 assigning relatively few APR's and rather filling the FIB with 874 popular prefixes. Several basic approaches to selecting popular 875 prefixes are outlined here. Router vendors are free to implement 876 whatever approaches they want. 878 1. Policy-based: The simplest approach for network administrators is 879 to have broad policies that routers use to determine which sub- 880 prefixes are designated as popular. An obvious policy would be a 881 "customer routes" policy, whereby all customer routes are 882 installed (as identified for instance by appropriate community 883 attribute tags). Another policy would be for a router to install 884 prefixes originated by specific ASes. For instance, two ISPs 885 could mutually agree to install each other's originated prefixes. 886 A third policy might be to install prefixes with the shortest AS- 887 path. 888 2. Static list: Another approach would be to configure static lists 889 of specific prefixes to install. For instance, prefixes 890 associated with an SLA might be configured. Or, a list of 891 prefixes for the most popular websites might be installed. 892 3. High-volume prefixes: By installing high-volume prefixes as 893 popular prefixes, the latency and load associated with the longer 894 path required by VA is minimized. One approach would be for an 895 ISP to measure its traffic volume over time (days or a few 896 weeks), and statically configure high-volume prefixes as popular 897 prefixes. There is strong evidence that prefixes that are high- 898 volume tend to remain high-volume over multi-day or multi-week 899 timeframes (though not necessarily at short timeframes like 900 minutes or seconds). High-volume prefixes may also be installed 901 dynamically. In other words, a router measures its own traffic 902 volumes, and installs and removes popular prefixes in response to 903 short term traffic load. The downside of this approach is that 904 it complicates debugging network problems. If packets are being 905 dropped somewhere in the network, it is more difficult to find 906 out where if the selected path can change dynamically. 908 3.2.6. Core-Edge Operation 910 A common style of router deployment in ISPs is the "core-edge" 911 deployment, whereby there is a core of high-capacity routers 912 surrounded by potentially lower-capacity "edge" routers that may not 913 carry the whole DFRT, and which default route to a core router. VA 914 can support this style of configuration be effectively defining a 915 single VP as 0/0, and by defining core routers to be APRs for 0/0. 916 This results in core routers maintaining full FIBs, and edge routers 917 having potentially extremely small FIBs. The advantage of using VA 918 to support core-edge topologies is that, with VA, any edge router, 919 including those peering with other ISPs, can have a small FIB. Today 920 such routers must maintain the full DFRT in order to peer. 922 Vendors may wish to facilitate configuration of a core-edge style of 923 VA for its customers that already use a core-edge topology. In other 924 words, a vendor may wish to simplify the VA configuration task so 925 that a customer merely needs to configure which of its routers are 926 core and which are edge, and the appropriate VA configuration, i.e. 927 the VP-List, tunnels, and popular prefixes, is automatically done 928 "under the hood" so to speak. Note that, under a core-edge 929 configuration, it isn't strictly speaking necessary for core routers 930 to advertise the 0/0 VP within BGP. Rather, edge routers could rely 931 on their default route to a core router. 933 3.3. Requirements Discussion 935 This section describes the extent to which VA satisfies the list of 936 requirements given in Section 3.1. 938 3.3.1. Response to router failure 940 VA introduces a new failure mode in the form of Aggregation Point 941 Router (APR) failure. There are two basic approaches to protecting 942 against APR failure, static APR redundancy, and dynamic APR 943 assignment (see Section 3.2.2.3.1). In static APR redundancy, enough 944 APRs are assigned for each Virtual Prefix (VP) so that if one goes 945 down, there are others to absorb its load. Failover to a static 946 redundant APR is automatic with existing BGP mechanisms. If an APR 947 crashes, BGP will cause packets to be routed to the next nearest APR. 948 Nevertheless, there are three concerns here: convergence time, load 949 increase at the redundant APR, and latency increase for diverted 950 flows. 952 Regarding convergence time, note that, while fast-reroute mechanisms 953 apply to the rerouting of packets to a given APR or egress router, 954 they don't apply to APR failure. Convergence time was discussed in 955 Section 3.2.2.4, which suggested that it is likely that BGP 956 convergence times will be adequate, and if not the IGP mechanisms may 957 be used. 959 Regarding load increase, in general this is relatively small. This 960 is because substantial reductions in FIB size can be achieved with 961 almost negligible increase in load. For instance, [nsdi09] shows 962 that a 5x reduction in FIB size yields a less than one percent 963 increase in load overall. Given this, depending on the configuration 964 of redundant APRs, failure of one APR increases the load of its 965 backups by only a few percent. This is well within the variation 966 seen in normal traffic loads. 968 Regarding latency increase, some flows may see a significant increase 969 in delay (and, specifically, an increase that puts it outside of its 970 SLA boundary). Normally a redundant APR would be placed within the 971 same POP, and so increased latency would be minimal (assuming that 972 load is also quite small, and so there is no significant queuing 973 delay). It is not always possible, however, to have an APR for every 974 VP within every POP, much less a redundant APR within every POP, and 975 so sometimes failure of an APR will result in significant latency 976 increases for a small fraction of traffic. 978 3.3.2. Traffic Engineering 980 VA complicates traffic engineering because the placement of APRs and 981 selection of popular prefixes influences how packets flow. (Though 982 to repeat, increased load is in any event likely to be minimal, and 983 so the effect on traffic engineering should not be great in any 984 event.) Since the majority of packets may be forwarded by popular 985 prefixes (and therefore follow the shortest path), it is particularly 986 important that popular prefixes be selected appropriately. As 987 discussed in Section 3.2.5.1, there are static and dynamic approaches 988 to this. [nsdi09] shows that high-volume prefixes tend to stay high- 989 volume for many days, and so a static strategy is probably adequate. 990 VA can operate correctly using either RSVP-TE [RFC3209] or LDP to 991 establish tunnels. 993 3.3.3. Incremental and safe deploy and start-up 995 It must be possible to install and configure VA in a safe and 996 incremental fashion, as well as start it up when routers reboot. 997 This document allows for a mixture of VA and legacy routers, allows a 998 fraction or all of the address space to fall within virtual prefixes, 999 and allows different routers to suppress different FIB entries 1000 (including none at all). As a result, it is generally possible to 1001 deploy and test VA in an incremental fashion. 1003 3.3.4. VA security 1005 Regarding ingress filtering, because in VA the RIB is effectively 1006 unchanged, routers contain the same information they have today for 1007 installing ingress filters [RFC2827]. Presumably, installing an 1008 ingress filter in the FIB takes up some memory space. Since ingress 1009 filtering is most effective at the "edge" of the network (i.e. at the 1010 customer interface), the number of FIB entries for ingress filtering 1011 should remain relatively small---equal to the number of prefixes 1012 owned by the customer. Whether this is true in all cases remains for 1013 further study. 1015 Regarding DoS attacks, there are two issues that need to be 1016 considered. First, does VA result in new types of DoS attacks? 1017 Second, does VA make it more difficult to deploy DoS defense systems. 1018 Regarding the first issue, one possibility is that an attacker 1019 targets a given router by flooding the network with traffic to 1020 prefixes that are not popular, and for which that router is an APR. 1021 This would cause a disproportionate amount of traffic to be forwarded 1022 to the APR(s). While it is up to individual ISPs to decide if this 1023 attack is a concern, it does not strike the authors that this attack 1024 is likely to significantly worsen the DoS problem. 1026 Regarding DoS defense system deployment, more input about specific 1027 systems is needed. It is the authors' understanding, however, that 1028 at least some of these systems use dynamically established Routing 1029 Table entries to divert victims' traffic into LSPs that carry the 1030 traffic to scrubbers. The expectation is that this mechanism simply 1031 over-rides whatever route is in place (with or without VA), and so 1032 the operation of VA should not limit the deployment of these types of 1033 DoS defense systems. Nevertheless, more study is needed here. 1035 3.4. New Configuration 1037 VA places new configuration requirements on ISP administrators. 1038 Namely, the administrator must: 1040 1. Select VPs, and configure the VP-List into all VA routers. As a 1041 general rule, having a larger number of relatively small prefixes 1042 gives administrators the most flexibility in terms of filling 1043 available FIB with sub-prefixes, and in terms of balancing load 1044 across routers. Once an administrator has selected a VP-List, it 1045 is just as easy to configure routers with a large list as a small 1046 list. We can expect network operator groups like NANOG to 1047 compile good VP-Lists that ISPs can then adopt. A good list 1048 would be one where the number of VPs is relatively large, say 100 1049 or so (noting again that each VP must be smaller than a real 1050 prefix), and the number of sub-prefixes within each VP is roughly 1051 the same. 1052 2. Select and configure APRs. There are three primary 1053 considerations here. First, there must be enough APRs to handle 1054 reasonable APR failure scenarios. Second, APR assignment should 1055 not result in router overload. Third, particularly long paths 1056 should be avoided. Ideally there should be two APRs for each VP 1057 within each PoP, but this may not be possible for small PoPs. 1058 Failing this, there should be at least two APRs in each 1059 geographical region, so as to minimize path length increase. 1060 Routers should have the appropriate counters to allow 1061 administrators to know the volume of APR traffic each router is 1062 handling so as to adjust load by adding or removing APR 1063 assignments. 1064 3. Select and configure Popular Prefixes or Popular Prefix policies. 1065 There are two general goals here. The first is to minimize load 1066 overall by minimizing the number of packets that take longer 1067 paths. The second is to insure that specific selected prefixes 1068 don't have overly long paths. These goals must be weighed 1069 against the administrative overhead of configuring potentially 1070 thousands of popular prefixes. As one example a small ISP may 1071 wish to keep it simple by doing nothing more than indicating that 1072 customer routes should be installed. In this case, the 1073 administrator could otherwise assign as many APRs as possible 1074 while leaving enough FIB space for customer routes. As another 1075 example, a large ISP could build a management system that takes 1076 into consideration the traffic matrix, customer SLAs, robustness 1077 requirements, FIB sizes, topology, and router capacity, and 1078 periodically automatically computes APR and popular prefix 1079 assignments. 1081 4. IANA Considerations 1083 There are no IANA considerations. 1085 5. Security Considerations 1087 We consider the security implications of VA under two scenarios, one 1088 where VA is configured and operated correctly, and one where it is 1089 mis-configured. A cornerstone of VA operation is that the basic 1090 behavior of BGP doesn't change, especially inter-domain. Among other 1091 things, this makes it easier to reason about security. 1093 5.1. Properly Configured VA 1095 If VA is configured and operated properly, then the external behavior 1096 of an AS does not change. The same upstream ASes are selected, and 1097 the same prefixes and AS-paths are advertised. Therefore, a properly 1098 configured VA domain has no security impact on other domains. 1100 This document discusses intra-domain security concerns in 1101 Section 3.3.4 which argues that any new security concerns appear to 1102 be relatively minor. 1104 If another ISP starts advertising a prefix that is larger than a 1105 given VP, this prefix will be ignored by APRs that have a VP that 1106 falls within the larger prefix (Section 3.2.2.3). As a result, 1107 packets that might otherwise have been routed to the new larger 1108 prefix will be dropped at the APRs. Note that the trend in the 1109 Internet is towards large prefixes being broken up into smaller ones, 1110 not the reverse. Therefore, such a larger prefix is likely to be 1111 invalid. If it is determined without a doubt that the larger prefix 1112 is valid, then the ISP will have to reconfigure its VPs. 1114 5.2. Mis-configured VA 1116 VA introduces the possibility that a VP is advertised outside of an 1117 AS. This in fact should be a low probability event, but it is 1118 considered here none-the-less. 1120 If an AS leaks a large VP (i.e. larger than any real prefixes), then 1121 the impact is minimal. Smaller prefixes will be preferred because of 1122 best-match semantics, and so the only impact is that packets that 1123 otherwise have no matching routes will be sent to the misbehaving AS 1124 and dropped there. If an AS leaks a small VP (i.e. smaller than a 1125 real prefix), then packets to that AS will be hijacked by the 1126 misbehaving AS and dropped. This can happen with or without VA, and 1127 so doesn't represent a new security problem per se. 1129 6. Acknowledgements 1131 The authors would like to acknowledge the efforts of Xinyang Zhang 1132 and Jia Wang, who worked on CRIO (Core Router Integrated Overlay), an 1133 early inter-domain variant of FIB suppression, and the efforts of 1134 Hitesh Ballani and Tuan Cao, who worked on the configuration-only 1135 variant of VA that works with legacy routers. We would also like to 1136 thank Scott Brim, Daniel Ginsburg, and Rajiv Asati for their helpful 1137 comments. In particular, Daniel's comments significantly simplified 1138 the spec (eliminating the need for a new External Communities 1139 Attribute). 1141 7. References 1143 7.1. Normative References 1145 [RFC1997] Chandrasekeran, R., Traina, P., and T. Li, "BGP 1146 Communities Attribute", RFC 1997, August 1996. 1148 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1149 Requirement Levels", BCP 14, RFC 2119, March 1997. 1151 [RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. 1153 [RFC2827] Ferguson, P. and D. Senie, "Network Ingress Filtering: 1154 Defeating Denial of Service Attacks which employ IP Source 1155 Address Spoofing", BCP 38, RFC 2827, May 2000. 1157 [RFC3107] Rekhter, Y. and E. Rosen, "Carrying Label Information in 1158 BGP-4", RFC 3107, May 2001. 1160 [RFC3209] Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., 1161 and G. Swallow, "RSVP-TE: Extensions to RSVP for LSP 1162 Tunnels", RFC 3209, December 2001. 1164 [RFC4271] Rekhter, Y., Li, T., and S. Hares, "A Border Gateway 1165 Protocol 4 (BGP-4)", RFC 4271, January 2006. 1167 [RFC4601] Fenner, B., Handley, M., Holbrook, H., and I. Kouvelas, 1168 "Protocol Independent Multicast - Sparse Mode (PIM-SM): 1169 Protocol Specification (Revised)", RFC 4601, August 2006. 1171 7.2. Informative References 1173 [nsdi09] Ballani, H., Francis, P., Cao, T., and J. Wang, "Making 1174 Routers Last Longer with ViAggre", ACM Usenix NSDI 2009 ht 1175 tp://www.usenix.org/events/nsdi09/tech/full_papers/ 1176 ballani/ballani.pdf, April 2009. 1178 Authors' Addresses 1180 Paul Francis 1181 Max Planck Institute for Software Systems 1182 Gottlieb-Daimler-Strasse 1183 Kaiserslautern 67633 1184 Germany 1186 Phone: +49 631 930 39600 1187 Email: francis@mpi-sws.org 1188 Xiaohu Xu 1189 Huawei Technologies 1190 No.3 Xinxi Rd., Shang-Di Information Industry Base, Hai-Dian District 1191 Beijing, Beijing 100085 1192 P.R.China 1194 Phone: +86 10 82836073 1195 Email: xuxh@huawei.com 1197 Hitesh Ballani 1198 Cornell University 1199 4130 Upson Hall 1200 Ithaca, NY 14853 1201 US 1203 Phone: +1 607 279 6780 1204 Email: hitesh@cs.cornell.edu 1206 Dan Jen 1207 UCLA 1208 4805 Boelter Hall 1209 Los Angeles, CA 90095 1210 US 1212 Phone: 1213 Email: jenster@cs.ucla.edu 1215 Robert Raszuk 1216 Self 1218 Phone: 1219 Email: robert@raszuk.net 1221 Lixia Zhang 1222 UCLA 1223 3713 Boelter Hall 1224 Los Angeles, CA 90095 1225 US 1227 Phone: 1228 Email: lixia@cs.ucla.edu