[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
LC Comments on "considerations" scaling analysis
Appendix A of the "considerations" draft purports to provide an analysis
showing that BGP C-multicast routing provides a much more efficient solution
for PE-PE communication that does PE-PE PIM. The analysis is not based on
the "hard state vs. soft state" issues, but on the amount of state that
needs to be maintained, and the number of control messages that need to be
sent and/or received, in order to join or leave a tree.
The analysis does not seem correct to me.
Let's look at a particular VPN, which is attached to n PE routers, PE1, ..,
PEn. Suppose PE1, is attached to a site containing multicast source S.
Let's also suppose that each other PE has received a PIM Join(S,G) message
over at least one VRF interface.
Using PE-PE PIM over an MI-PMSI, how many control messages over the MI-PMSI
are needed to enable PE2, ..., PEn to join the (S,G) tree via PE1? Answer:
1. PE2, say, sends a PIM Join(S,G) over the MI-PMSI. PE1 receives this
message, and as a result may send a PIM Join(S,G) out one of its VRF
interfaces. PE3, ..., PEn also received PE2's Join(S,G), and as a result
they do not send Joins themselves.
Appendix A says that the messaging cost of additional PEs after the first
joining an (S,G) tree is the same as the cost of the first PE joining the
tree. This is false. The messaging cost of each additional PE to join the
tree is 0.
Appendix A, noting the fact that the first Join must be received by every PE
in the VPN, states that the message processing overhead is O(# PEs in the
VPN). That is not really a meaningful measure of anything. The messaging
itself is O(1), and there is no router whose processing increases as O(# PEs
in the VPN). In each PE router, the message processing per tree per VPN
that is due to PE-PE interactions is O(1).
It would be more illuminating to take the PE-CE interactions into account as
well. In PIM, whether a given interface is p2p or multiaccess, only a
single Join(S,G) message is needed to enable all the PEs (except the
upstream PE, of course) on that interface to join the (S,G) tree. At a
given node, the PIM messaging overhead per tree is actually proportional to
the number of interfaces, not to the number of PIM adjacencies. (Excluding
Hellos, of course, just as Appendix A does.) That's why the PE-PE overhead
per VPN tends to be dwarfed by the PE-CE overhead; for a given VPN, a PE may
have lots of VRF interfaces, but it only has one PE-PE interface. In many
cases, the PE-PE overhead is just in the noise.
Anyway, let's look at the case where BGP C-multicast routing is used. Now
each PE joining an (S,G) tree must send a C-multicast Source Tree Join to
each RR. If n-1 PEs want to join the tree (with the nth PE being the
upstream PE), the number of C-multicast Source Tree Joins sent is n times
the number of RRs (most likely 2*n). In PIM, this would have been done with
only one message; it is BGP that makes the number of messages needed
proportional to the number of PEs in the VPN. The reason is that there is
no way in BGP to prevent each PE from issuing a Join.
PIM Join Suppression is a big savings when receivers are widely dispersed
among the sites; if each PE is attached to a site with receivers, the need
to process someone else's Join is a good tradeoff against the need to send
your own Join. If receivers are concentrated in a few sites, this tradeoff
is not so good. But many multicast applications have the model of a few
sources with a lot of widely dispersed receivers.
Appendix A is also incorrect about the overhead related to PIM Prune(S,G)
messages.
> By default when PIM LAN procedures are used, when a PE Prunes
> itself from a multicast tree, all other PEs check their own state
> to known if they are on the tree, in which case they send a PIM
> Join message to override the Prune. The "did the last receiver
> leave?" question is thus implicitly replied to by all PE routers,
> for each PIM Prune message.
>
> We can see that answering the "last receiver leaves" question is a
> significant proportion of the work that the C-multicast routing
> building block has to make
I don't see how this conclusion follows. If a PE has "Upstream Join (S,G)
state" for the logical interface connecting the PEs, then it always has a
timer that it uses to refresh the Join. If the PE sees a Join(S,G) from
another PE, it restarts the timer (causing Join Suppression). If the PE
sees a Prune(S,G) from another PE, it sets the timer to a shorter
pre-computed randomized value. If, before that timer expires, it sees a
Join(S,G) from another PE, it resets the timer and sets it back to its usual
value. Appendix A would lead one to believe that a Prune(S,G) immediately
causes lots of Join(S,G) messages to be sent; this is not true.
It takes only one message to do a Prune, and one message to override it.
In BGP, if PE2 decides to prune itself from the (S,G) tree, it has to send
each of the RRs a message withdrawing its Source Tree Join C-multicast route
for (S,G). Then the RR has to use its BGP decision process to determine
whether there is another Source Tree Join C-multicast route for the same
(S,G). Then the RR has to distribute the latter route. The number of
messages is no fewer.
I think the distinction that Appendix A is really trying to get at is the
following.
Consider a given interface of a given node. Suppose that interface appears
either as the upstream interface or as a downstream interface of n trees.
Then let's say that interface contains n "branches". The total amount of
state that that PIM needs to maintain is roughly proportional to the sum,
over all the node's interfaces, of the number of branches.
Suppose that in a given PE, for a given VPN, there are i VRF interfaces, and
that the average number of branches per interface is m. For each VPN, there
is also a single PE-PE interface (a multiaccess interface to which all the
PEs of the VPN belong). Let's suppose that this link contains k branches,
where j of these branches belong to trees that the PE is a member of, and
k-j belong to other trees in the VPN.
If PE-PE PIM is used, the amount of state is proportional to i*m+k. If BGP
C-multicast routing is used, the amount of state is proportional to i*m+j.
The incremental state needed by PIM is thus proportional to k-j. For the
state savings to be significant, one must assume not only that k is much
larger than j, but also that k-j is significiant when compared to i*m.
Basically this means that you need to assume that each PE has receivers for
only a small number of the VPN's trees, or else that there are a small
number of VRF interfaces per VPN. Whether or not these assumptions are
accurate depends of course on the set of multicast applications being used
by the customers.
There are also a couple of important factors which Appendix A has omitted:
- If you are using two different protocols to maintain each multicast state,
then the total amount of state is effectively doubled. This will
generally result in both a memory cost and a CPU cost. While this is a
"mere factor of two," perhaps it deserves a mention in any document that
compares a control plane that uses two protocols (e.g. PIM for PE-CE, BGP
for PE-PE) to a control plane that uses only one.
- For sparse mode, which is the type of multicast most commonly found in the
enterprises that buy the VPN service, there is quite a bit of additional
messaging in BGP that has not been considered.
When PE1 receives a C-multicast Source Tree Join for a sparse mode group,
it has to generate a Source Active A-D Route. It needs to send a BGP
update for this route to each RR. Each RR needs to send it to each of
PE2, ..., PE3. So we have more messaging which is O( # of PEs in VPN).
Needless to say, PIM also has sparse mode overhead which hasn't been
considered, but the point is that BGP overhead is not O(1).
- One might get the impression that the BGP scheme eliminates any need for
the C-PIM state machine to maintain states for the PMSIs that connect the
PEs over the backbone. This is not true; PEs must maintain Prune(S,G,rpt)
for the PMSIs. PEs receiving multicast data over a PMSI must also do the
RPF interface check for the arriving data, considering the PMSI as the
"input interface". The backbone does not become transparent to the C-PIM
interfaces just because PIM control packets do not flow over the PMSIs.
Since some of the conclusions of the draft depend on the analysis in
Appendix A, I think those conclusions need to be removed until the
underlying analysis is corrected.
If and when a corrected version of the analysis is available, it should be
last called by the PIM WG, which is where one might expect to find the most
expertise in PIM.