[lisp] ALT structure, robustness and the long-path problem

Robin Whittle <rw@firstpr.com.au> Mon, 07 December 2009 02:49 UTC

Message-ID: <4B1C6D19.4080404@firstpr.com.au>
Date: Mon, 07 Dec 2009 13:48:57 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: lisp@ietf.org
References: <mailman.32.1260043203.5501.lisp@ietf.org> <4eb512450912060728l4b0c36d4neaa6ebcc2626db5c@mail.gmail.com> <4B1BE583.2010705@ac.upc.edu>
In-Reply-To: <4B1BE583.2010705@ac.upc.edu>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Cc: jnc@mercury.lcs.mit.edu, Charrie Sun <charriesun@gmail.com>
Subject: [lisp] ALT structure, robustness and the long-path problem
Precedence: list

Short version:   How can the ALT structure scale to 10^8, 10^9 or
                 10^10 EIDs with minimal delay times and robustness
                 against single points of failure?

                 If it can't, then there's no advantage in using ALT
                 compared to the simpler NERD, which does not delay
                 packets - or APT or Ivip, which delay initial
                 packets by at  most a few tens of milliseconds.


In the thread "Re: [lisp] lisp Digest, Vol 13, Issue 13", Florin
Coras wrote, quoting Letong (Charrie Sun):

>> Florin Coras and his colleagues have done a very good simulation on this;
>> i'm thinking that this is the IPv4 scene, an ITR can actually contains all
>> the pointer information of EID mapping, i.e. we needn't the first level of
>> ALT nodes since there are only 2^8 nodes in the second level. And these
>> second-level nodes can be a full mesh. 

> The issue with this approach is that you need to keep hundreds of GRE
> tunnels and BGP sessions. Not having practical experience with such
> large networks I wouldn't know what the upper bound is. Either way,
> already the second level routers aggregate thousands of prefixes from
> lower level ones and these values might not be sustainable. The trivial
> solution to this would be to use more levels which would result in
> higher latencies.

Yes - in order to maintain some reasonable limit on the number of GRE
tunnels and therefore the number of BGP neighbours for each ALT
router, there needs to be a limit to how much aggregation takes place
in each level of the ALT structure.


>> But what if EID is based on IPv6? In
>> my design in the proposal there can be 5 layers and node degree up to 2^24.
>> Well this means that a longest map request path would be of 8 hops. Based on
>> the assumption that average delay between two random nodes be 115ms, this
>> means a 1s delay around. This is critical for ITRs for an ITR is expected to
>> (at least in LISP?) buffer the unsolved packets, especially in the
>> high-speed network. So additional mechanism need to be digged out, as well
>> as the careful design of the ALT architecture.
>> Right?

> As far as I know packets are dropped at the ITR when there is no mapping
> for a certain destination and the host applications are expected cope
> with this.

With ALT, the initial packets in a new communication flow (to a
destination EID the ITR has no mapping for) will either be dropped or
delayed.  I understand they are dropped in the current
implementation, since it is not known how long it will be before the
map reply arrives - and by then it may be more disruptive to send a
delayed packet than to drop it.

Therefore, when an application starts sending packets to an IP
address it has not used before, if its ITR has no mapping, the
application will frequently receive no response.  It may retry, or it
may try another IP address, so potentially restarting the process.

In a system such as LISP-NERD, there is no delay since the ITRs have
the full mapping database.

With APT or Ivip, the delay would typically be in the millisecond or
a few tens of millisecond range, since the mapping queries go to
local full database query servers (in the same ISP or end-user
network).  So there is little chance of lost query or response
packets and the response will normally arrive very quickly.  Such
delays are probably small enough to be of no practical significance
to the application or end-user, so the ITR would buffer the packet
and send it milliseconds later, once the mapping reply arrives.


At some level of the ALT structure all the routers are full meshed.

Using an IPv4 example, in the simplest possible ALT structure, and
meshing at the /8 level, there would be 224 routers at this level,
each with 223 GRE tunnels and neighbours at that level.

Assuming an even distribution of EIDs over the 224 /8 prefixes, and
assuming no correlation between EID and the part of the ALT structure
the requesting ITR connects to, then each one of these routers has to
handle most of the map requests for a single /8 of EID space.

For instance, the single ALT router which aggregates 55.0.0.0/8
handles the map requests from all ITRs apart from the small (1/224)
subset which connect to the ALT network somewhere in the 55.0.0.0
inverted tree below this ALT router.

To minimise the number of GRE tunnels and BGP neighbours at the
highest level of the ALT network, we would raise the aggregation
level - such as to /7 or /6.  However, this doubles or quadruples the
traffic handled by each of the now smaller number of top level routers.

It looks highly unscalable to concentrate about 1/224 of the world's
map requests through a single router.

So the solution is to either lower the level of full aggregation:

   /9, means each router has 511 GRE tunnels and BGP neighbours

or to retain the /8 full mesh, and implement some kind of meshiness
below that level.  However, this leads to only a partial set of
connections and to a huge increase in the number of GRE tunnels and
BGP peers for routers at lower levels.

While in practice there would probably be some centralization of the
physical locations of the highest level ALT routers, such as in North
America, to be properly scalable the ALT design needs to cope with
the ALT routers at all levels being pretty much randomly distributed
around the Earth, or at least those parts of the Earth with Internet
infrastructure.

This is because the organisations which run them, each somehow being
responsible for a part of the ALT address space, should be assumed to
have their facilities pretty much randomly distributed around the Net.

So a GRE tunnel between one ALT router and the next - either a mesh
across the same level, or a link from one level to the another -
would physically be carried via multiple Internet routers and data
links.  This would often be a long physical length, involving a dozen
or so routers and one or more trans-oceanic cables.

Just because the ALT network in its simplest form can be depicted by
a neat diagram, doesn't mean that the path-lengths from one ALT
router to the next will be short.  Typically, they will be long,
because typically, the whole purpose of LISP or the other core-edge
separation schemes is to make it easy to use EID address space
anywhere in the Net.  So the LISP-ALT design must assume that ALT
routers, or the locations of the authoritative query servers (ETRs
for non-mobile LISP) are going to be randomly physically distributed.

A simple arrangement for ALT in IPv4 would be to assume that EIDs are
no longer than /24 and that the ALT network is fully meshed at /8.
This leaves 16 bits to span.  If we set a limit of 16 GRE tunnels to
the ALT routers in the next lowest level, and one to the ALT router
in the level above, then we get a network like this:

   L0    224 ALT routers each for a /8

   L1    224 x 16 = 3584 ALT routers, each for a /12.
         One set of 16 under each of the L0 routers.

   L2    3584 x 16 = 57344 ALT routers, each for a /16.
         One set of 16 under each of the L1 routers.

   L3    57344 x 16 = 458752 ALT routers, each for a /20.
         One set of 16 under each of the L2 routers.
         Each such router connects via a tunnel to up to
         16 ETRs, each of which could be anywhere in the
         world and each of which is the authoritative
         query server for a /24 EID.

In reality, not every part of the hierarchy would be populated, since
only a substantial subset of the address space will be used for EIDs.

Also, some EIDs will be shorter than /24, so there will be less EIDs
than depicted above, requiring fewer ALT routers.

However, the ALT network should scale to cope well with hundreds of
millions, to billions, of EIDs.  In IPv4 this means most of them will
be /30, /31 or /32.

An ALT router doesn't need to maintain expensive GRE tunnels and of
course not BGP links to each ETR - and the ALT routers will probably
link to Map Servers which themselves link to multiple ETRs, so this
does somewhat reduce the number of ALT routers needed at the lower
levels of the inverted tree (those with numerically higher L numbers).

But ALT needs to scale to hundreds of millions, and ideally billions
of EIDs.  If ALT can't scale to this, then there's no point to it at
all.  The sole advantage of the ALT (or CONS or TRRP) decentralised,
global, mapping query server system is that no single device needs to
handle the whole mapping database.

I argue that even the largest conceivable mapping database, of 10
billion EIDs each with 100 bytes or so, is perfectly feasible to
transport the database to, and store it within, many single devices.
 This is a terabyte and it fits on a consumer hard drive today.  APT
and Ivip involve hundreds of thousands of local query servers which
can do this.

Even if it is argued that it is impossible or undesirable to have any
single device, much less tens of thousands of them, store the entire
mapping database of 10^7 EIDs, it is still true that APT needs to
scale to some largish number of EIDs, such as 100M to 1B.

If LISP is not required to scale to such numbers of EIDs, then it
would make sense to use NERD, since an individual ITR can easily
store 100M EIDs.  The bandwidth requirements for updating the ITR
would not be a significant problem in the scheme of things by the
time such numbers of EIDs are used.

So let's assume at least 10^8 EIDs, and ideally 10^9 or 10^10 - and
consider the structure of the ALT network in IPv6.

I assume the longest EID would be a /64.  Wherever the top level of
the ALT hierarchy is, such as /8 or /12, this is a lot of bits to
span.   Many parts of the address space will not be EIDs, but the
parts which are will collectively contain vast numbers of EIDs.

Due to the portability which is a requirement of any scalable mapping
solution, we must design for a case where there is little or no
discernible relationship between the EID address and the topological
location of the ETR which is authoritative for this EID's mapping.

So the first level of aggregation - whether it is Map Servers under
ALT routers directly, or ALT routers linking directly to ETRs - will
involve tunnels or some other kind of communication session leading
to ETRs all over the world.

There is unlikely to be much geographic or topological closeness
between ALT routers at different levels, so each GRE tunnels which
querys ascend and descend up and down the hierarchy must be assumed
to be often quite long, and potentially involve one or more long
transoceanic cable links.

With a full scale IPv6 deployment, such as 10^9 EIDs, the simplest
form of ALT hierarchy is going to have many levels - considering
whatever limits there are on how many GRE tunnels and BGP neighbours
each ALT router can have.

Since each of these tunnels is sometimes - often - geographically and
topographically long, it is clear that the total path taken by many
mapping requests is going to be physically very long indeed.

Most requests will ascend to the top level, traversing as many GRE
tunnels and ALT routers as there are in the hierarchy.  A few will
not need to ascend so high before they find an ALT router which
aggregates the destination EID.  Then, the request needs to descend
down the hierarchy with a similar number of GRE tunnels and ALT
routers.  Both up and down, each GRE tunnel involves typically many
Internet routers and long-distance data-links.

Clearly there are problems with scaling the ALT network to the sizes
which are required to give it any advantage over the simpler, faster
and more robust alternatives of NERD, APT or Ivip.

When the ALT network has so many levels, there are going to be
serious problems with the time taken for the request to ascend and
descend the hierarchy, and likewise serious problems with query
packets being dropped.

But the above discussion is for a strictly aggregated ALT network -
the simplest possible arrangement which does the job and respects the
limits of each router in terms of GRE tunnels and BGP neighbours.

How is the above simple tree-like structure going to be elaborated
upon so the system continues to work when one or more ALT routers, or
their GRE tunnels, suffers transient or lasting failure?

The ALT designers have never proposed how they would solve these
problems.

I think ALT or any other global query server approach is a
non-starter because the delays it causes will make it impossible to
have the scheme adopted widely enough on a voluntary basis.  The
initial packet delay problem, due to the global nature of the query
system made much worse by the "long-path" problem of GRE links often
being geographically very long, means that ALT fails to meet
constraint 9 of:

  http://www.firstpr.com.au/ip/ivip/RRG-2009/constraints/

If there were no competing alternative, violating this constraint
might not be such a problem.  However, NERD, APT and Ivip meet all
the constraints.  In particularly, they meet 9 because they do not
significantly delay any packets, including initial packets.


>> As ALT nodes are communicated on an overlay and use tunnels, can i discard
>> the actual delays between two nodes and use a more general value to measure
>> the reponse delay, that is, the measurement would be solely bounded with the
>> hops they experience.

> Not sure I understand this correctly but given the variability of the
> delays we got, between some ms to 2.5s, (or if you exclude the top and
> bottom 10% values 400ms to 800ms) I'm not sure if that would be wise.
> On the other hand, if you are reassured (by feedback on the list) that
> the hierarchy would be built from grounds up and the ALT routers at a
> certain level in the ALT hierarchy would have a similar geographical
> distance (i.e. would be geographically close) to those that aggregate
> their announced prefixes I suppose your approximation would work.

It can't be assumed that ALT routers at a certain level would all be
in the same geographical location.  The whole idea of a routing
scaling solution is to make EID space equally usable in any location
in the Net.  This means there are no particular locations where an
ALT router could be close to the ETRs which are authoritative for the
EIDs whose space the router is aggregating.

You could put all the ALT routers for the 55.0.0.0/8 section of EID
space close together, such as in Shanghai, Los Angeles or London.
That would work fine between these routers, but that means they can't
be close to most of the locations where the ETRs are located which
are authoritative for the millions of EIDs in this range.

Nor does it mean that the top router in this branch is going to be
close to the other /8 ALT routers at the top level.

Before anyone thinks they can simulate the ALT network, they need to
decide its structure, for a given number of EIDs, in either IPv4 or
IPv6.  This is all guesswork, as far as I know, since the ALT
designers have never given detailed examples of how a full scale
production ALT network would be structured.

The only advantages of ALT over NERD, APT or Ivip its that it can:

  1 - Theoretically scale to any number of EIDs, since no one device
      is required to store the entire mapping database.

  2 - Enable each EID to change its mapping, and to change its
      authoritative ETR (which is the same thing in LISP), as often
      as required, without requiring this information to be
      propagated very far.  For NERD, APT and Ivip, mapping
      changes need to propagated to every ITR (NERD) or every local
      query server (APT or Ivip).

      With ALT, the ETR changes only need to be propagated to the
      lowest level ALT router, or to the Map Server - whichever is
      the lowest level.

      With ALT, the mapping changes are not propagated anywhere,
      since the mapping is internal state of each ETR, and nothing
      in the ALT network, including the Map Servers (AFAIK) caches
      the mapping.  (However, maybe the changes need to be
      propagated to some ITRs which recently requested it or which
      for some other reason are identified as needing the update.)

So the ALT network needs to scale well to 10^8, 10^9 or 10^10 EIDs,
while minimising delays and remaining robust in the event of link and
router failure.   No-one has ever explained how this can be done.

 - Robin


Further discussion and a diagram of the "long-path problem" is at:

  http://www.firstpr.com.au/ip/ivip/lisp-links/#critiques
  http://www.antd.nist.gov/~ksriram/strong_aggregation.png

Re: [lisp] lisp Digest, Vol 13, Issue 13 Charrie Sun
Re: [lisp] lisp Digest, Vol 13, Issue 13 Florin Coras
[lisp] ALT structure, robustness and the long-pat… Robin Whittle
Re: [lisp] ALT structure, robustness and the long… Robin Whittle
Re: [lisp] ALT structure, robustness and the long… Florin Coras
Re: [lisp] ALT structure, robustness and the long… Sam Hartman
Re: [lisp] ALT structure, robustness and the long… Florin Coras
[lisp] Gleaning and research assumptions Sam Hartman
[lisp] Fwd: ALT structure, robustness and the lon… Charrie Sun
Re: [lisp] ALT structure, robustness and the long… Charrie Sun
Re: [lisp] Gleaning and research assumptions Charrie Sun
Re: [lisp] ALT structure, robustness and the long… Robin Whittle
Re: [lisp] Gleaning and research assumptions Robin Whittle
Re: [lisp] Gleaning and research assumptions Sam Hartman
Re: [lisp] Gleaning and research assumptions Sam Hartman
Re: [lisp] Gleaning and research assumptions Charrie Sun