[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LC Comments on "considerations" scaling analysis



Appendix A  of the  "considerations" draft purports  to provide  an analysis
showing that BGP C-multicast routing provides a much more efficient solution
for PE-PE communication  that does PE-PE PIM.  The analysis  is not based on
the "hard  state vs.  soft  state" issues, but  on the amount of  state that
needs to be  maintained, and the number of control messages  that need to be
sent and/or received, in order to join or leave a tree.

The analysis does not seem correct to me.

Let's look at a particular VPN, which  is attached to n PE routers, PE1, ..,
PEn.   Suppose PE1, is  attached to  a site  containing multicast  source S.
Let's also suppose  that each other PE has received  a PIM Join(S,G) message
over at least one VRF interface.

Using PE-PE PIM over an MI-PMSI,  how many control messages over the MI-PMSI
are needed to enable PE2, ..., PEn  to join the (S,G) tree via PE1?  Answer:
1.  PE2,  say, sends a  PIM Join(S,G) over  the MI-PMSI.  PE1  receives this
message,  and as  a result  may send  a  PIM Join(S,G)  out one  of its  VRF
interfaces.  PE3,  ..., PEn also received  PE2's Join(S,G), and  as a result
they do not send Joins themselves.

Appendix A  says that the messaging  cost of additional PEs  after the first
joining an (S,G)  tree is the same as  the cost of the first  PE joining the
tree.  This is false.  The messaging  cost of each additional PE to join the
tree is 0.

Appendix A, noting the fact that the first Join must be received by every PE
in the  VPN, states that the message  processing overhead is O(#  PEs in the
VPN).  That is  not really a meaningful measure  of anything.  The messaging
itself is O(1), and there is no router whose processing increases as O(# PEs
in the  VPN).  In each  PE router, the  message processing per tree  per VPN
that is due to PE-PE interactions is O(1).

It would be more illuminating to take the PE-CE interactions into account as
well.   In PIM,  whether a  given interface  is p2p  or multiaccess,  only a
single  Join(S,G)  message is  needed  to enable  all  the  PEs (except  the
upstream PE,  of course)  on that interface  to join  the (S,G) tree.   At a
given node, the PIM messaging  overhead per tree is actually proportional to
the number of interfaces, not  to the number of PIM adjacencies.  (Excluding
Hellos, of course, just as Appendix  A does.)  That's why the PE-PE overhead
per VPN tends to be dwarfed by the PE-CE overhead; for a given VPN, a PE may
have lots of  VRF interfaces, but it only has one  PE-PE interface.  In many
cases, the PE-PE overhead is just in the noise.

Anyway, let's look  at the case where BGP C-multicast  routing is used.  Now
each PE  joining an (S,G) tree must  send a C-multicast Source  Tree Join to
each  RR.  If n-1  PEs want  to join  the tree  (with the  nth PE  being the
upstream PE),  the number of C-multicast  Source Tree Joins sent  is n times
the number of RRs (most likely 2*n).  In PIM, this would have been done with
only  one message;  it  is BGP  that  makes the  number  of messages  needed
proportional to the number  of PEs in the VPN.  The reason  is that there is
no way in BGP to prevent each PE from issuing a Join.

PIM Join  Suppression is a big  savings when receivers  are widely dispersed
among the sites; if  each PE is attached to a site  with receivers, the need
to process someone  else's Join is a good tradeoff against  the need to send
your own Join.  If receivers are  concentrated in a few sites, this tradeoff
is not  so good.  But  many multicast applications  have the model of  a few
sources with a lot of widely dispersed receivers.

Appendix A  is also incorrect about  the overhead related  to PIM Prune(S,G)
messages.

>      By default when PIM LAN procedures are used, when a PE Prunes
>      itself from a multicast tree, all other PEs check their own state
>      to known if they are on the tree, in which case they send a PIM
>      Join message to override the Prune.  The "did the last receiver
>      leave?" question is thus implicitly replied to by all PE routers,
>      for each PIM Prune message.
>
>   We can see that answering the "last receiver leaves" question is a
>   significant proportion of the work that the C-multicast routing
>   building block has to make

I don't see  how this conclusion follows.  If a PE  has "Upstream Join (S,G)
state" for  the logical interface connecting  the PEs, then it  always has a
timer that  it uses to refresh  the Join.  If  the PE sees a  Join(S,G) from
another PE,  it restarts  the timer (causing  Join Suppression).  If  the PE
sees  a  Prune(S,G)  from  another  PE,  it sets  the  timer  to  a  shorter
pre-computed randomized  value.  If,  before that timer  expires, it  sees a
Join(S,G) from another PE, it resets the timer and sets it back to its usual
value.  Appendix A  would lead one to believe  that a Prune(S,G) immediately
causes lots of Join(S,G) messages to be sent; this is not true.  

It takes only one message to do a Prune, and one message to override it.

In BGP, if PE2  decides to prune itself from the (S,G)  tree, it has to send
each of the RRs a message withdrawing its Source Tree Join C-multicast route
for (S,G).   Then the RR  has to use  its BGP decision process  to determine
whether there  is another  Source Tree Join  C-multicast route for  the same
(S,G).   Then the  RR has  to distribute  the latter  route.  The  number of
messages is no fewer.

I think the  distinction that Appendix A  is really trying to get  at is the
following.

Consider a given interface of  a given node.  Suppose that interface appears
either as  the upstream interface or  as a downstream interface  of n trees.
Then let's  say that interface contains  n "branches".  The  total amount of
state that  that PIM needs to  maintain is roughly proportional  to the sum,
over all the node's interfaces, of the number of branches.

Suppose that in a given PE, for a given VPN, there are i VRF interfaces, and
that the average number of branches per interface is m.  For each VPN, there
is also a  single PE-PE interface (a multiaccess interface  to which all the
PEs of the  VPN belong).  Let's suppose that this  link contains k branches,
where j of  these branches belong to trees  that the PE is a  member of, and
k-j belong to other trees in the VPN.

If PE-PE PIM is used, the amount  of state is proportional to i*m+k.  If BGP
C-multicast routing is  used, the amount of state  is proportional to i*m+j.
The incremental  state needed by PIM  is thus proportional to  k-j.  For the
state savings  to be significant,  one must assume  not only that k  is much
larger  than j,  but also  that k-j  is significiant  when compared  to i*m.
Basically this means that you need  to assume that each PE has receivers for
only a  small number  of the  VPN's trees, or  else that  there are  a small
number  of VRF interfaces  per VPN.   Whether or  not these  assumptions are
accurate depends of  course on the set of  multicast applications being used
by the customers.

There are also a couple of important factors which Appendix A has omitted:

- If you are using two different protocols to maintain each multicast state,
  then  the  total  amount  of  state is  effectively  doubled.   This  will
  generally result in  both a memory cost  and a CPU cost.  While  this is a
  "mere factor of  two," perhaps it deserves a mention  in any document that
  compares a control plane that uses two protocols (e.g.  PIM for PE-CE, BGP
  for PE-PE) to a control plane that uses only one.

- For sparse mode, which is the type of multicast most commonly found in the
  enterprises that buy  the VPN service, there is quite  a bit of additional
  messaging in BGP that has not been considered.

  When PE1 receives a C-multicast Source  Tree Join for a sparse mode group,
  it has  to generate a  Source Active  A-D Route.  It  needs to send  a BGP
  update for  this route to each  RR.  Each RR needs  to send it  to each of
  PE2, ..., PE3.   So we have more messaging  which is O( # of  PEs in VPN).
  Needless  to say,  PIM also  has sparse  mode overhead  which  hasn't been
  considered, but the point is that BGP overhead is not O(1).

- One might get  the impression that the BGP scheme  eliminates any need for
  the C-PIM state machine to maintain  states for the PMSIs that connect the
  PEs over the backbone.  This is not true; PEs must maintain Prune(S,G,rpt)
  for the PMSIs.  PEs receiving multicast  data over a PMSI must also do the
  RPF interface  check for  the arriving data,  considering the PMSI  as the
  "input interface".  The backbone does  not become transparent to the C-PIM
  interfaces just because PIM control packets do not flow over the PMSIs.

Since  some of  the  conclusions of  the  draft depend  on  the analysis  in
Appendix  A,  I  think  those  conclusions  need to  be  removed  until  the
underlying analysis is corrected.

If and when  a corrected version of the analysis is  available, it should be
last called by the PIM WG, which  is where one might expect to find the most
expertise in PIM.