[lisp] ALT structure, robustness and the long-path problem
Robin Whittle <rw@firstpr.com.au> Mon, 07 December 2009 02:49 UTC
Return-Path: <rw@firstpr.com.au>
X-Original-To: lisp@core3.amsl.com
Delivered-To: lisp@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 268573A697F for <lisp@core3.amsl.com>; Sun, 6 Dec 2009 18:49:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.403
X-Spam-Level:
X-Spam-Status: No, score=-1.403 tagged_above=-999 required=5 tests=[AWL=0.177, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327, SARE_MILLIONSOF=0.315]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GgHMkY93u+Yn for <lisp@core3.amsl.com>; Sun, 6 Dec 2009 18:49:03 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 04D0428C0D7 for <lisp@ietf.org>; Sun, 6 Dec 2009 18:49:01 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id AFF0C175A4F; Mon, 7 Dec 2009 13:48:49 +1100 (EST)
Message-ID: <4B1C6D19.4080404@firstpr.com.au>
Date: Mon, 07 Dec 2009 13:48:57 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: lisp@ietf.org
References: <mailman.32.1260043203.5501.lisp@ietf.org> <4eb512450912060728l4b0c36d4neaa6ebcc2626db5c@mail.gmail.com> <4B1BE583.2010705@ac.upc.edu>
In-Reply-To: <4B1BE583.2010705@ac.upc.edu>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: jnc@mercury.lcs.mit.edu, Charrie Sun <charriesun@gmail.com>
Subject: [lisp] ALT structure, robustness and the long-path problem
X-BeenThere: lisp@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: List for the discussion of the Locator/ID Separation Protocol <lisp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/lisp>, <mailto:lisp-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lisp>
List-Post: <mailto:lisp@ietf.org>
List-Help: <mailto:lisp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lisp>, <mailto:lisp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 07 Dec 2009 02:49:05 -0000
Short version: How can the ALT structure scale to 10^8, 10^9 or 10^10 EIDs with minimal delay times and robustness against single points of failure? If it can't, then there's no advantage in using ALT compared to the simpler NERD, which does not delay packets - or APT or Ivip, which delay initial packets by at most a few tens of milliseconds. In the thread "Re: [lisp] lisp Digest, Vol 13, Issue 13", Florin Coras wrote, quoting Letong (Charrie Sun): >> Florin Coras and his colleagues have done a very good simulation on this; >> i'm thinking that this is the IPv4 scene, an ITR can actually contains all >> the pointer information of EID mapping, i.e. we needn't the first level of >> ALT nodes since there are only 2^8 nodes in the second level. And these >> second-level nodes can be a full mesh. > The issue with this approach is that you need to keep hundreds of GRE > tunnels and BGP sessions. Not having practical experience with such > large networks I wouldn't know what the upper bound is. Either way, > already the second level routers aggregate thousands of prefixes from > lower level ones and these values might not be sustainable. The trivial > solution to this would be to use more levels which would result in > higher latencies. Yes - in order to maintain some reasonable limit on the number of GRE tunnels and therefore the number of BGP neighbours for each ALT router, there needs to be a limit to how much aggregation takes place in each level of the ALT structure. >> But what if EID is based on IPv6? In >> my design in the proposal there can be 5 layers and node degree up to 2^24. >> Well this means that a longest map request path would be of 8 hops. Based on >> the assumption that average delay between two random nodes be 115ms, this >> means a 1s delay around. This is critical for ITRs for an ITR is expected to >> (at least in LISP?) buffer the unsolved packets, especially in the >> high-speed network. So additional mechanism need to be digged out, as well >> as the careful design of the ALT architecture. >> Right? > As far as I know packets are dropped at the ITR when there is no mapping > for a certain destination and the host applications are expected cope > with this. With ALT, the initial packets in a new communication flow (to a destination EID the ITR has no mapping for) will either be dropped or delayed. I understand they are dropped in the current implementation, since it is not known how long it will be before the map reply arrives - and by then it may be more disruptive to send a delayed packet than to drop it. Therefore, when an application starts sending packets to an IP address it has not used before, if its ITR has no mapping, the application will frequently receive no response. It may retry, or it may try another IP address, so potentially restarting the process. In a system such as LISP-NERD, there is no delay since the ITRs have the full mapping database. With APT or Ivip, the delay would typically be in the millisecond or a few tens of millisecond range, since the mapping queries go to local full database query servers (in the same ISP or end-user network). So there is little chance of lost query or response packets and the response will normally arrive very quickly. Such delays are probably small enough to be of no practical significance to the application or end-user, so the ITR would buffer the packet and send it milliseconds later, once the mapping reply arrives. At some level of the ALT structure all the routers are full meshed. Using an IPv4 example, in the simplest possible ALT structure, and meshing at the /8 level, there would be 224 routers at this level, each with 223 GRE tunnels and neighbours at that level. Assuming an even distribution of EIDs over the 224 /8 prefixes, and assuming no correlation between EID and the part of the ALT structure the requesting ITR connects to, then each one of these routers has to handle most of the map requests for a single /8 of EID space. For instance, the single ALT router which aggregates 55.0.0.0/8 handles the map requests from all ITRs apart from the small (1/224) subset which connect to the ALT network somewhere in the 55.0.0.0 inverted tree below this ALT router. To minimise the number of GRE tunnels and BGP neighbours at the highest level of the ALT network, we would raise the aggregation level - such as to /7 or /6. However, this doubles or quadruples the traffic handled by each of the now smaller number of top level routers. It looks highly unscalable to concentrate about 1/224 of the world's map requests through a single router. So the solution is to either lower the level of full aggregation: /9, means each router has 511 GRE tunnels and BGP neighbours or to retain the /8 full mesh, and implement some kind of meshiness below that level. However, this leads to only a partial set of connections and to a huge increase in the number of GRE tunnels and BGP peers for routers at lower levels. While in practice there would probably be some centralization of the physical locations of the highest level ALT routers, such as in North America, to be properly scalable the ALT design needs to cope with the ALT routers at all levels being pretty much randomly distributed around the Earth, or at least those parts of the Earth with Internet infrastructure. This is because the organisations which run them, each somehow being responsible for a part of the ALT address space, should be assumed to have their facilities pretty much randomly distributed around the Net. So a GRE tunnel between one ALT router and the next - either a mesh across the same level, or a link from one level to the another - would physically be carried via multiple Internet routers and data links. This would often be a long physical length, involving a dozen or so routers and one or more trans-oceanic cables. Just because the ALT network in its simplest form can be depicted by a neat diagram, doesn't mean that the path-lengths from one ALT router to the next will be short. Typically, they will be long, because typically, the whole purpose of LISP or the other core-edge separation schemes is to make it easy to use EID address space anywhere in the Net. So the LISP-ALT design must assume that ALT routers, or the locations of the authoritative query servers (ETRs for non-mobile LISP) are going to be randomly physically distributed. A simple arrangement for ALT in IPv4 would be to assume that EIDs are no longer than /24 and that the ALT network is fully meshed at /8. This leaves 16 bits to span. If we set a limit of 16 GRE tunnels to the ALT routers in the next lowest level, and one to the ALT router in the level above, then we get a network like this: L0 224 ALT routers each for a /8 L1 224 x 16 = 3584 ALT routers, each for a /12. One set of 16 under each of the L0 routers. L2 3584 x 16 = 57344 ALT routers, each for a /16. One set of 16 under each of the L1 routers. L3 57344 x 16 = 458752 ALT routers, each for a /20. One set of 16 under each of the L2 routers. Each such router connects via a tunnel to up to 16 ETRs, each of which could be anywhere in the world and each of which is the authoritative query server for a /24 EID. In reality, not every part of the hierarchy would be populated, since only a substantial subset of the address space will be used for EIDs. Also, some EIDs will be shorter than /24, so there will be less EIDs than depicted above, requiring fewer ALT routers. However, the ALT network should scale to cope well with hundreds of millions, to billions, of EIDs. In IPv4 this means most of them will be /30, /31 or /32. An ALT router doesn't need to maintain expensive GRE tunnels and of course not BGP links to each ETR - and the ALT routers will probably link to Map Servers which themselves link to multiple ETRs, so this does somewhat reduce the number of ALT routers needed at the lower levels of the inverted tree (those with numerically higher L numbers). But ALT needs to scale to hundreds of millions, and ideally billions of EIDs. If ALT can't scale to this, then there's no point to it at all. The sole advantage of the ALT (or CONS or TRRP) decentralised, global, mapping query server system is that no single device needs to handle the whole mapping database. I argue that even the largest conceivable mapping database, of 10 billion EIDs each with 100 bytes or so, is perfectly feasible to transport the database to, and store it within, many single devices. This is a terabyte and it fits on a consumer hard drive today. APT and Ivip involve hundreds of thousands of local query servers which can do this. Even if it is argued that it is impossible or undesirable to have any single device, much less tens of thousands of them, store the entire mapping database of 10^7 EIDs, it is still true that APT needs to scale to some largish number of EIDs, such as 100M to 1B. If LISP is not required to scale to such numbers of EIDs, then it would make sense to use NERD, since an individual ITR can easily store 100M EIDs. The bandwidth requirements for updating the ITR would not be a significant problem in the scheme of things by the time such numbers of EIDs are used. So let's assume at least 10^8 EIDs, and ideally 10^9 or 10^10 - and consider the structure of the ALT network in IPv6. I assume the longest EID would be a /64. Wherever the top level of the ALT hierarchy is, such as /8 or /12, this is a lot of bits to span. Many parts of the address space will not be EIDs, but the parts which are will collectively contain vast numbers of EIDs. Due to the portability which is a requirement of any scalable mapping solution, we must design for a case where there is little or no discernible relationship between the EID address and the topological location of the ETR which is authoritative for this EID's mapping. So the first level of aggregation - whether it is Map Servers under ALT routers directly, or ALT routers linking directly to ETRs - will involve tunnels or some other kind of communication session leading to ETRs all over the world. There is unlikely to be much geographic or topological closeness between ALT routers at different levels, so each GRE tunnels which querys ascend and descend up and down the hierarchy must be assumed to be often quite long, and potentially involve one or more long transoceanic cable links. With a full scale IPv6 deployment, such as 10^9 EIDs, the simplest form of ALT hierarchy is going to have many levels - considering whatever limits there are on how many GRE tunnels and BGP neighbours each ALT router can have. Since each of these tunnels is sometimes - often - geographically and topographically long, it is clear that the total path taken by many mapping requests is going to be physically very long indeed. Most requests will ascend to the top level, traversing as many GRE tunnels and ALT routers as there are in the hierarchy. A few will not need to ascend so high before they find an ALT router which aggregates the destination EID. Then, the request needs to descend down the hierarchy with a similar number of GRE tunnels and ALT routers. Both up and down, each GRE tunnel involves typically many Internet routers and long-distance data-links. Clearly there are problems with scaling the ALT network to the sizes which are required to give it any advantage over the simpler, faster and more robust alternatives of NERD, APT or Ivip. When the ALT network has so many levels, there are going to be serious problems with the time taken for the request to ascend and descend the hierarchy, and likewise serious problems with query packets being dropped. But the above discussion is for a strictly aggregated ALT network - the simplest possible arrangement which does the job and respects the limits of each router in terms of GRE tunnels and BGP neighbours. How is the above simple tree-like structure going to be elaborated upon so the system continues to work when one or more ALT routers, or their GRE tunnels, suffers transient or lasting failure? The ALT designers have never proposed how they would solve these problems. I think ALT or any other global query server approach is a non-starter because the delays it causes will make it impossible to have the scheme adopted widely enough on a voluntary basis. The initial packet delay problem, due to the global nature of the query system made much worse by the "long-path" problem of GRE links often being geographically very long, means that ALT fails to meet constraint 9 of: http://www.firstpr.com.au/ip/ivip/RRG-2009/constraints/ If there were no competing alternative, violating this constraint might not be such a problem. However, NERD, APT and Ivip meet all the constraints. In particularly, they meet 9 because they do not significantly delay any packets, including initial packets. >> As ALT nodes are communicated on an overlay and use tunnels, can i discard >> the actual delays between two nodes and use a more general value to measure >> the reponse delay, that is, the measurement would be solely bounded with the >> hops they experience. > Not sure I understand this correctly but given the variability of the > delays we got, between some ms to 2.5s, (or if you exclude the top and > bottom 10% values 400ms to 800ms) I'm not sure if that would be wise. > On the other hand, if you are reassured (by feedback on the list) that > the hierarchy would be built from grounds up and the ALT routers at a > certain level in the ALT hierarchy would have a similar geographical > distance (i.e. would be geographically close) to those that aggregate > their announced prefixes I suppose your approximation would work. It can't be assumed that ALT routers at a certain level would all be in the same geographical location. The whole idea of a routing scaling solution is to make EID space equally usable in any location in the Net. This means there are no particular locations where an ALT router could be close to the ETRs which are authoritative for the EIDs whose space the router is aggregating. You could put all the ALT routers for the 55.0.0.0/8 section of EID space close together, such as in Shanghai, Los Angeles or London. That would work fine between these routers, but that means they can't be close to most of the locations where the ETRs are located which are authoritative for the millions of EIDs in this range. Nor does it mean that the top router in this branch is going to be close to the other /8 ALT routers at the top level. Before anyone thinks they can simulate the ALT network, they need to decide its structure, for a given number of EIDs, in either IPv4 or IPv6. This is all guesswork, as far as I know, since the ALT designers have never given detailed examples of how a full scale production ALT network would be structured. The only advantages of ALT over NERD, APT or Ivip its that it can: 1 - Theoretically scale to any number of EIDs, since no one device is required to store the entire mapping database. 2 - Enable each EID to change its mapping, and to change its authoritative ETR (which is the same thing in LISP), as often as required, without requiring this information to be propagated very far. For NERD, APT and Ivip, mapping changes need to propagated to every ITR (NERD) or every local query server (APT or Ivip). With ALT, the ETR changes only need to be propagated to the lowest level ALT router, or to the Map Server - whichever is the lowest level. With ALT, the mapping changes are not propagated anywhere, since the mapping is internal state of each ETR, and nothing in the ALT network, including the Map Servers (AFAIK) caches the mapping. (However, maybe the changes need to be propagated to some ITRs which recently requested it or which for some other reason are identified as needing the update.) So the ALT network needs to scale well to 10^8, 10^9 or 10^10 EIDs, while minimising delays and remaining robust in the event of link and router failure. No-one has ever explained how this can be done. - Robin Further discussion and a diagram of the "long-path problem" is at: http://www.firstpr.com.au/ip/ivip/lisp-links/#critiques http://www.antd.nist.gov/~ksriram/strong_aggregation.png
- Re: [lisp] lisp Digest, Vol 13, Issue 13 Charrie Sun
- Re: [lisp] lisp Digest, Vol 13, Issue 13 Florin Coras
- [lisp] ALT structure, robustness and the long-pat… Robin Whittle
- Re: [lisp] ALT structure, robustness and the long… Robin Whittle
- Re: [lisp] ALT structure, robustness and the long… Florin Coras
- Re: [lisp] ALT structure, robustness and the long… Sam Hartman
- Re: [lisp] ALT structure, robustness and the long… Florin Coras
- [lisp] Gleaning and research assumptions Sam Hartman
- [lisp] Fwd: ALT structure, robustness and the lon… Charrie Sun
- Re: [lisp] ALT structure, robustness and the long… Charrie Sun
- Re: [lisp] Gleaning and research assumptions Charrie Sun
- Re: [lisp] ALT structure, robustness and the long… Robin Whittle
- Re: [lisp] Gleaning and research assumptions Robin Whittle
- Re: [lisp] Gleaning and research assumptions Sam Hartman
- Re: [lisp] Gleaning and research assumptions Sam Hartman
- Re: [lisp] Gleaning and research assumptions Charrie Sun