Multi6 Working Group Ramakrishna Gummadi Internet-Draft September 2, 2001 Expires: March 2, 2001 A proposal for scalable network-level multihoming draft-ramki-multi6-nlmp-00.txt A Proposal for Scalable Network-level Multihoming Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on March 1, 2002. Copyright Notice Copyright (C) The Internet Society (2001). All Rights Reserved. Abstract This document proposes two extensions to BGP4+ to provide scalable multihoming in IPv6. The first is a new well-known and mandatory "TunneL (TL)" BGP path attribute that allows a multihomed enterprise to maintain global connectivity in the presence of network outages within a direct provider Autonomous System (AS) without adding any global forwarding overhead. The second is a new well-known and mandatory "Selective Announcement (SA)" BGP community attribute that allows the multihomed enterprise to recover from failures in distant AS's by explicitly specifying the list of AS's to which the site's reachability must be advertised. These two extensions form the core of the "Network Layer Multihoming Protocol (NLMP)" described in this draft. NLMP is designed to be useful to both multihomed providers and multihomed enterprises. 1. Introduction The main goal of NLMP is to restore global routing scalability lost due to the current way [1] of multihoming enterprises and providers, while, at the same time, preserving all multihoming benefits [1]. NLMP is a network-layer protocol that affects routing and forwarding at routers, but does not require end host support. The forwarding addition is a check for possible IP-in-IP encapsulation [5] before the packet is delivered to the next hop. The routing change is addition of a new BGP path attribute and a new BGP community attribute. While NLMP takes advantage of aggressive aggregation [11] encouraged in IPv6, it is otherwise independent of IPv6. NLMP can be used by both multihomed enterprises as defined in [18], and multihomed providers (providers who connect to more than one upstream provider). As described in [12], recent analysis of BGP routing tables shows that the pre-CIDR exponential growth of number of globally visible routes has resumed. Such an unconstrained growth decreases overall routing stability, increases route convergence times, and stresses packet forwarding and route processing resources at routers. An increasing tendency of enterprises and providers to multihome has been identified as a significant contributor to this spurt in routing table growth. As analyzed in [14], customers' desire to achieve resiliency at the network edge by obtaining service from multiple providers who currently have very little leeway for over-engineering their networks seems to be the driving force behind multihoming. Currently, each multihomed enterprise/provider introduces a single non-aggregable prefix into the default-free zone (DFZ). This is clearly an unscalable approach. It can also be directly self-defeating because slower BGP convergence means that the multihomer's overall network availability is decreased. For example, a recent study [13] points out that 80% of route withdrawals take more than a minute. This means that just the process of failover itself brings down availability by 0.0002%, assuming an extremely conservative estimate of one network failure per year. Even if such overall availability metrics seem acceptable, the catch is that one minute is long enough to disrupt transport layer survivability of most applications. Clearly, alternate techniques for multihoming must be explored. This document makes two concrete proposals for robustly enhancing global routing scalability and a multihomed enterprise's availability. In NLMP, a multihomed enterprise has multiple provider-aggregatable address prefixes that are inherited from each provider's address space. Consequently, NLMP scales better than the practice of announcing non-aggregable prefixes into the default-free zone (DFZ). While it is straightforward to achieve scalability this way, without additional work at routers and/or hosts, almost all traditional multihoming benefits [1] are seriously affected. In particular, resiliency to network failures and transport-layer survivability are lost, and load-sharing and performance become complicated. The goal of NLMP is to resolve these issues by altering only the routing layer. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119 [4]. Within the context of this document, and only within this context, we (re)define and use the following terms. Access Router (AR): The router within an ISP that a site connects to. AR's are typically located in an ISP's PoP (Point-of-presence). Border Router (BR): Either an exchange router or a transit router. Direct AS: The AS (ISP) directly connected to a site, and providing transit to the site. Distant/Remote AS: All AS's other than direct AS's. Enterprise: From [18], an entity autonomously operating a network using TCP/IP and, in particular, determining the addressing plan and address assignments within that network. For our purposes, an enterprise need NOT be an AS. Exchange Router (XR): The router that connects an ISP to other ISPs at an exchange point. Multihomer: A multihomed enterprise or provider. Peering: Interconnection between two ISPs that do not have a provider-subscriber relationship. Each ISP is required to accept and possibly re-advertise the other's prefixes only to its own subscribers. The two ISPs are called peers. Protection Domain (PD): The set of routers and links whose failures a redundancy scheme is resilient against. Provider: An AS providing transit (an ISP). Site: Synonymous for enterprise. Subscriber: Either a site or a provider receiving transit. Transit: Interconnection between two ISPs or between an ISP and a site that share a provider-subscriber relationship. The provider is required to accept and re-advertise the subscriber's prefixes, possibly after aggregation, to its peers, subscribers, and any upstream providers. Transit Router (TR): An ISP's router that connects the ISP to an upstream transit provider. v4-multihoming: Multihoming in IPv4 as done today. This approach and its limitation are described in [1]. 3. Basic Approach We first examine what addressing and routing mechanism a conventional multihomed site uses, and how this has to be modified for scalable multihoming in IPv6. |-----------| | DFZ | |-----------| (TLA1) ISP1 ISP2 (TLA2) /|\ /|\ | | | | (NLA1) ISP4 ISP3 (NLA2) /|\ /|\ \____/ |EBR| ---- \ / Site | H The figure above shows a multihomed site connected to two providers ISP3 and ISP4, who, in turn, get transit from ISP2 and ISP1 respectively. In v4-multihoming, the site gets either a single portable address block from a registry, or a single provider- aggregatable (PA) address block from the address space allocated to one of its providers. The enterprise's border router (EBR) then announces the site's prefix to both upstream providers using BGP4. Such a prefix must be announced separately and carried unaggregated in all BGP routing tables. To achieve full redundancy and transport-layer survivability, EBR MUST also receive all address prefixes reachable through both ISP3 and ISP4. This way, an outbound packet can be sent using a different ISP if the destination becomes unreachable through one ISP. As described in [8], such a multihoming architecture limits CIDR [8] aggregation benefits. To achieve high aggregation, in NLMP, the site MUST use two separate prefixes, one each from the PA address space of the two ISPs. Under normal (no failure) circumstances, EBR MUST NOT announce address prefix inherited from one ISP to another. This means each ISP is required to announce only a single aggregated prefix globally. Such an architecture clearly ensures global routing scalability. One approach towards restoring resiliency and transport layer survivability is to make changes to hosts' application and/or protocol stacks. In particular, transport layers at end hosts must be equipped with the ability to work with changing addresses. There are, however, many unresolved security and performance issues in this approach. In particular, this approach opens up new possibilities for packet hijacking and Denial- of-Service (DoS) unless address change is carefully implemented. In many cases (when clients are anonymous, or with busy servers), implementing full redundancy and transport layer survivability seems to require an in-band globally scalable authentication mechanism, which is unavailable now (and is not likely to appear in the near future). Further, if we do not want to make applications aware of all multihoming details, implementing even simple out-of-band redundancy and security mechanisms (such as DNS lookups for learning/verifying alternate destination addresses) at transport layer and below is a layer violation that may have some implications. The problem becomes worse for multicast, especially reliable multicast, and source-specific multicast (SSM), where the overhead of active path management (even if invoked only upon detecting data loss) may be prohibitively expensive because of the scalability issue with per-receiver reverse direction traffic that has to be either ack'ed by the source or nack'ed by the network (in terms of destination unreachable messages ). An alternate approach is to handle these multihoming issues completely at the network layer by making suitable alterations to routers and routing protocols. The main goal is to extend routing to provide high availability (HA) semantics to all prefixes assigned to a site without incurring high global routing and forwarding overhead. Since the strategy of multiple PA addresses already minimizes routing and forwarding state carried under normal (no failure) conditions, the challenge now is to reduce and/or eliminate state injected to deal with failures. One such approach is described in RFC2260 [2]. To maintain connectivity in presence of failures in the nodes and links connecting a site to its ISPs, RFC2260 recommends two strategies. The first strategy is to make EBR in Figure 1 originate a global route for a "failed prefix" through the reachable ISP. By scoping such advertisements using BGP communities [6], one can carefully control the AS boundaries where the prefix is leaked. An alternate strategy that eliminates this leakage is to use packet encapsulation between EBR and the ISP border router to which communication has been lost. EBR's address reachable via the second ISP forms the tunnel destination for inbound packets, and tunnel source for outbound packets. Note that the reachable ISP that carries tunnel traffic need not involve itself in the tunnel setup or maintenance in any way. The main drawback of both approaches is that they only deal with a restricted set of failures. NLMP extends these two ideas to cover a broader set of failure scenarios. The main goal of NLMP is to maintain reachability between any AS and all of a site's prefixes as long as there is a feasible path. NLMP has two algorithms/routing extensions to repair paths. They are called "Selective Announcement (SA)", and "Tunneling", and are analogous to the ones in RFC2260. The set of failures that a site can tolerate using SA is most general, and same as what can be tolerated using conventional multihoming, but with a lower global routing and forwarding overhead under both normal and failure conditions. On the other hand, tunneling provides faster protection against a more restricted set of failures with no global forwarding overhead, and very little routing overhead. In particular, tunneling deals with failures that arise in a direct AS, while SA deals with failures that occur in distant AS's. In other words, the protection domain of tunneling is the direct AS, while the protection domain of SA includes all AS's. The resiliency requirements in the current multihoming requirement specification [2] are "mostly" confined to the direct AS, and can be handled by tunneling alone. (We say "mostly" because BGP peer resets due to problems in provider's peer network can only be handled with SA.) While v4-multihoming uses a single mechanism for protecting against both kinds of failures, NLMP permits EBR to determine quickly and reliably (routing instead of pinging) whether the problem is local (in the upstream provider's network) or remote, and, thus, decide whether tunneling or SA should be used. Since SA establishes more optimal routes, it can outperform tunneling if failures in a direct AS persist over longer time scales. Tunneling SHOULD be used by default for direct AS failures, and, depending on policy, SA MAY be used to establish better paths, during which period tunneling continues to carry traffic on relatively sub-optimal paths. Since a route for the failed prefix exists in remote routers under either condition (see Section 4 for routes maintained during tunneling), incoming traffic can be atomically shifted from tunneling to the new route established by SA. There are inter-provider policy issues in both cases, and are discussed in Sections 4.1 and 6. Both algorithms provide redundancy along with transport-layer survivability. 4. The Tunneling Algorithm and Protocol A variety of problems can occur in the path between a source and destination, but the redundancy portion of NLMP corrects only those failures that would otherwise manifest as prefix reachability problems. In particular, congestion, host overload, and intra- AS failures that heal through intra-AS routing do not trigger NLMP's path repair algorithms. This is broadly the approach taken by v4-multihoming as well. The basic idea behind tunneling is that as long as all border routers (BR's) in a direct provider's network know a working site prefix, HA with respect to the BR provider's prefix is achieved. The provider's backbone and access routers need not be aware of tunneling at all. Normally, incoming packets are sent through the provider's access router (AR) that the multihomed site is connected to. RFC2260 describes how AR can tunnel packets if the direct link to EBR becomes unavailable. However, AR itself may become inaccessible to some BR's in the provider network due to AR crashes, network partitions (such as a fiber cut connecting an exchange router to the provider backbone, provider-wide backbone failures, etc.), or configuration problems. In such cases, tunneling ensures that every border router will still be able to accept packets meant for the unreachable prefix, and deliver them through a reachable prefix of a different provider. Every BR detects unreachable intra-AS prefixes through interior routing protocols, just as today. Such unreachable prefixes indicate network partitioning. For our purposes, network partitions are disconnected islands of an AS with at least one BR to serve as an exit/entrance into networks of peers and/or higher providers. Instead of completely withdrawing such unreachable prefixes within the AS from external BGP peers, a BR in a network partition attaches a new well-known mandatory "TL" path attribute to BGP update messages announcing withdrawn routes comprising these unreachable prefixes. Note that all unreachable prefixes, including those assigned to multihomed sites, are marked with this attribute. The semantics of this attribute is that routes marked with this attribute would get the lowest preference in BGP's decision Process [17] at a router in an external AS, but would still be retained in global forwarding tables if no better route to the unreachable prefixes exists. Such withdrawals represent a temporary phenomenon of network partitioning. The unreachable prefixes comprising withdrawn routes tend to exhibit good aggregation properties because partitions are typically contiguous. Thus, the advantage of the "TL" attribute is that a BR need not separately advertise reachability through tunnels to individual multihomed site prefixes that may be sparsely distributed in the unreachable partitions, and a single entry in external routers for the entire unreachable partition suffices. When the BR receives packets meant for an unreachable multihomed site from external BGP peers, it can use the site's EBR address from the alternate provider as the tunneled packet's destination address. BR can be statically configured with this alternate EBR address, and the address would be loaded into BR's local routing tables when the intra-AS routing withdraws the direct EBR address. Instead of statically configuring BR with alternate prefixes of all of the AS's multihomed customers, mechanisms such as tagging routes advertised by interior routing protocols with alternate site prefixes, or periodic DNS lookups can be used. In any case, these alternate addresses do not appear in the local routing table under normal conditions, and are not propagated outside the BR. Since the BR cannot aggregate destination addresses for tunnel packets, every unreachable EBR directly served by the provider would have to be created a local routing table entry. Since this may present a local scalability problem, a BR may be able to create a forwarding entry on demand only after it sees a packet addressed to the site. These tunnels are stateless in that EBR does not need to know that a BR has created a local tunnel entry. Instead, if it finds a packet addressed to its alternate address and marked as a tunneled packet, it simply strips off the tunnel header before injecting the packet into the site. (The site's outgoing packets need not be tunneled because EBR knows from BGP the ISP through which the destination can be reached.) There is no additional global forwarding state introduced by tunneling or multihoming because external routers that do not know how to reach more than one AS partition would carry a single aggregated route to a reachable partition. AS's that use conventional BGP withdraw semantics in a non-multihoming world would have to maintain a forwarding entry, the difference being that the forwarding entry is only for reachable prefixes. On the other hand, routers that learn reachable routes for multiple partitions would either carry more specific routes separately for each of these partitions, or fewer covering routes, as decided by BGP's Decision Algorithm [17]. There is, however, an overhead in terms of carrying a withdrawn route marked with a "TL" attribute in an Adj-RIBS-In base, but this overhead is expected to be insignificant, both because of aggregation, and because it is transient. After partitions merge, BR resumes EBGP NLRI advertisement for the merged prefixes, and these updates purge previous withdraws from Adj-RIBS-In base if the withdrawn prefixes are contained in the new routes. Tunneling can be used even if multiple simultaneous failures within an AS create multiple partitions. For maintaining connectivity between single homed subscribers in one partition and multihomed subscribers in another partition of the same AS, BR's must also advertise indirectly reachable routes for multihomed sites in the unreachable partition into interior routing protocols with a preference lower than that used for advertising directly reachable routes. As an example of the above approach, a border router at an exchange point that has lost its connectivity to the rest of the ISP network can still tunnel packets addressed to unreachable multihomed sites through common exchange backhaul after receiving them from other exchange ISPs that have no other route to the partitioned ISP. No multihomed traffic is lost, and no additional global forwarding state is introduced. 4.1 Tunneling Issues Looping: When multiple direct providers experience failures, packets may bounce between two providers. This can be prevented by making sure that BR does not create a local forwarding entry to tunnel packets to prefixes whose withdrawals are accompanied with a ``TL'' attribute. Alternately, the tunnel encapsulation limit option [5] can be used to limit tunnel nesting, but this decreases overall robustness, and increases packet overhead. Path MTU: A tunneled packet is larger by the size of the tunnel header (40 bytes). This may change the path MTU, causing a ICMPv6 Packet Too Big message to be generated, and forcing the source to do path MTU discovery. IPv6 Path MTU Discovery specification [15] requires an IPv6 host to cope with changes in Path MTU, which can happen when routing changes paths in general. If alternate addressing schemes like GSE [16] are used, packet encapsulation is unnecessary because a BR can rewrite the packet's destination routing goop. Policy: It must be noted that even though one ISP's BR originates tunnel traffic, the destination is the site prefix assigned by another ISP providing direct transit to the site. Typically, even if the two ISP's are competitors, the service agreement between an ISP and a customer entails the ISP to carry customer's traffic even if it was originated within a competing provider's network. This means that it is reasonable to expect tunneled traffic, which is generated only during failure, to be delivered to the site. Such traffic consumes extra resources, but these are accounted at a different level, such as traffic carried between two peer ISPs, or between a customer and provider ISP. 5. The Selective Announcement Algorithm and Protocol While tunneling can be used to recover from failures in a direct AS, failures in a distant AS cannot be tolerated using this mechanism. This is partly because of administrative and trust limitations, and partly because debugging complexity increases rapidly. Consider the example topology in the figure below: H1 H2 | | //--+--\\ //--+--\\ | AS5 | | AS1 | | | | | \\--+--// \\--+--// |<----transit----> | //--+--\\ //--+--\\ | AS2 |partition AS2 | | | | | \\--+--// \\--+--// | | |<---- peering --->| //--+--\\ //- +--\\ | AS4 | | AS3 | | | | | \\--\--// \\-//--// \ / transit-->\ /<--transit \ / EBR In this case, H1 cannot reach the site using addresses inherited from AS3 as the destination, and likewise for H2. Tunneling will not work here because AS4 has only a peering relationship with AS2. The problem also persists if partitioned AS2 is replaced with two distinct AS's. In both cases, EBR uses a selective announcement mechanism to inject reachability information about itself into the distant AS's. We assume below that bi-directional BGP peering is setup at all BGP speakers. This is a valid assumption when traffic has to flow both ways, even if peers do asymmetric routing. EBR first determines if any destination prefixes can be reached from one provider but not from the other. It can do this by taking the set difference between the prefixes heard from the two providers. Any withdrawn routes attached with the "TL" attribute are treated as unreachable for this purpose. For each destination prefix d_pfx that EBR determines is reachable through one site prefix s_pfx1 but not through the other site prefix s_pfx2, EBR originates a route advertisement for s_pfx2, and tags it with a selective announcement (SA) community prefix that contains a list of AS's from the AS_PATH attribute of the original route that provided the destination prefix d_pfx. The effect of the SA community attribute is to restrict the distribution of the non- aggregable site prefix s_pfx2 to select AS's that would otherwise not know how to forward packets addressed to s_pfx2. Since AS_PATH typically contains a sequence of AS's through which a forward route is propagated, reverse route distribution can be controlled by initially setting the SA attribute to the reverse AS_PATH list, followed by each AS in the list announcing the route only to the next AS member in the list. Routes tagged with the SA attribute MUST NOT be propagated to any AS not in the SA list. Further, such routes have a higher priority than those marked with TL, but a lower priority than those learnt through normal BGP messages. Such routes are deleted whenever a route with a covering prefix and no SA attribute is announced/withdrawn. 6. NLMP and Multihoming Requirements NLMP satisfies scalability, redundancy, and transport-layer survivability requirements, as described above. NLMP provides only a coarse control for load-sharing incoming traffic, not unlike v4-multihoming today. For outbound traffic, however, NLMP can achieve good load-sharing assuming a provider's AR is configured to accept traffic with all valid site prefixes as source addresses. With regard to performance, to avoid congested links between direct providers for incoming traffic, the site may advertise all valid site prefixes into each provider. This approach introduces extra prefixes into direct providers, but does not impact global routing. v4-multihoming's site policy configurations can be used in NLMP to provide the same level of policy control to the multihomed site. While we believe that NLMP is fairly simple, there may still be some inter-provider policy issues with carrying tunnel traffic. But SA has better scalability but same policy issues with regard to carrying alternate prefix announcements as current v4- multihoming, 7. Security NLMP does not introduce any new security problems not already existing in current routing. In particular, it relies on hop-by-hop co-ordination used by routing. Any cryptographic techniques used to protect integrity of routing messages can be used by NLMP as well. References [1] Abley, J., Black, B., Gill, V., "IPv4 Multihoming Motivation, Practices and Limitations (work-in-progress)", I-D draft-ietf-multi6-v4-multihoming-00, June 2001, . [2] Bates, T., Rekhter, Y., ``Scalable Support for Multi-homed Multi-provider Connectivity'', RFC2260, January 1998. [3] Black, B., Gill, V., Abley, J., "Requirements for IP Multihoming Architectures (work-in-progress)", I-D draft-ietf-multi6-v4-multihoming-01, June 2001, . [4] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", RFC2119, March 1997. [5] Conta, A., Deering, S., ``Generic Packet Tunneling in IPv6 Specification'', RFC2473, December 1998. [6] Chandra, P., Traina, P., Li, T., ``BGP Communities Attribute'', RFC1997, August 1996. [7] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC2460, December 1998. [8] Fuller, V., Li, T., Yu, J. and K. Varadhan, "Classless Inter-Domain Routing (CIDR): An Address Assignment and Aggregation Strategy", RFC1519, September 1993. [9] Gilligan, R., Thomson, S., Bound, J. and W. Stevens, "Basic Socket Interface Extensions for IPv6", RFC2553, March 1999. [10] Hinden, R. and S. Deering, "IP Version 6 Addressing Architecture", RFC2373, July 1998. [11] Hinden, R., O'Dell, M. and S. Deering, "An IPv6 Aggregatable Global Unicast Address Format", RFC2374, July 1998. [12] Huston, G., "Analyzing the Internet's BGP Routing Table", January 2001. [13] Huston G., ``BGP Issues'', Presentation, March 2001, [14] Huston, G., ``The Unreliable Internet'', Broadband Satellite Column, May 2001, [15] McCann, J., S. Deering, S., Mogul, J., Path MTU Discovery for IP version 6 , RFC1981, August 1996. [16] O' Dell, M., ``GSE-an alternate addressing architecture for IPv6'', I-D draft- ietf-ipngwg-gseaddr-00.txt, February 1997. [17] Rekhter, Y., Li, T., ``A Border Gateway Protocol 4 (BGP-4)'', RFC1771, March 1995. [18] Rekhter, Y., Moskowitz, B., Karrenberg, D., de Groot, G., E. Lear, "Address Allocation for Private Internets", RFC1918, February 1996. Author's Address Ramakrishna Gummadi 475 Soda Hall UC Berkeley Email: ramki@cs.berkeley.edu