[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LC Comments on "considerations" scaling analysis



Hi Eric, working group,

In your comments, there is only *one* of them backed with a precise quote (it is which is discussed below, and I happen to agree that a better formulation can be found, though it doesn't show the analysis as incorrect).

But, on the other (numerous) comments, you don't make any quotes, which makes it hard to identify what exactly would be incorrect in Appendix A in your opinion. This makes your claim that the analysis is incorrect very weakly founded.

These other comments are discussed below, but let me highlight the following:
* some comments are just plain wrong or misleading, possibly due to a misunderstanding on the hypothesis made and what is counted (see below)
* some comments are not related to what is looked at in Appendix A (such as CE-PE scaling)
* some comments relate to more complex stuff, and are debatable, but as far as I can tell they don't challenge the conclusions in Appendix A, not the conclusions drawn elsewhere in the document

Reading in between the lines of your comment, it seems to me that you may just disagree with how the counting of the amount of processing is done.  You seem to prefer counting in "number of messages" rather than in "how many times a message is processed", which is what we do in this analysis.  Merely counting the number of message would certainly be very favorable to PIM in the considered scenario, since PIM uses messages that are processed by all the PEs in the mVPN, but it would not reflect the load due to multicast routing on the different equipments, for the considered options.    What is funny is that, for BGP S-A A-D routes, you'd seem to prefer counting "how many times a message is processed" rather than the "number of messages".

What we can do is insist on why we count "how many times a message is processed", to make sure that the reader doesn't miss that point. And we can make sure that in the different places we get the wording right to avoid misunderstandings (e.g. I've spotted a table legend which is unclear).

In any case, in all of your comments, I don't find anything solid that questions the conclusions or the correctness of the analysis in Appendix A, and none of the few changes suggested appear to me as changing the conclusions reused elsewhere in the document.

Please find below more detailed answers...



11/06/2009 17:52, Eric Rosen:
[..]
  
The analysis does not seem correct to me.

I'm not saying that there is not a single mistake, there might be. This document was written by humans.
But if you think you found an incorrectness, please do a precise quote and explain.

Using PE-PE PIM over an MI-PMSI,  how many control messages over the MI-PMSI
are needed to enable PE2, ..., PEn  to join the (S,G) tree via PE1?  Answer:
1.  PE2,  say, sends a  PIM Join(S,G) over  the MI-PMSI.  PE1  receives this
message,  and as  a result  may send  a  PIM Join(S,G)  out one  of its  VRF
interfaces.  PE3,  ..., PEn also received  PE2's Join(S,G), and  as a result
they do not send Joins themselves.
  
You are here just re-explaining us what Join suppression is. I don't know if it is useful.
You insist a lot in all your comment on how Join suppression is great. Join suppression is certainly a useful mechanism, and the analysis in Appendix A does not forget to take it into account. So I don't think that this comment puts into question the analysis.



Appendix A  says that the messaging  cost of additional PEs  after the first
joining an (S,G)  tree is the same as  the cost of the first  PE joining the
tree.  This is false.  The messaging  cost of each additional PE to join the
tree is 0.
  
No, when a PE Joins the tree it sends a Join(S,G).  There is no "join suppression" effect for the first Join (no suppression for the first join, just read p.55 of RFC4601). This is true for each PE. The first Join of each additional PE is received by all the PEs in the mVPN. Thus all the PEs of the mVPN process one Join for each PE joined to (S,G).

So, If you increase the number of PEs joined to (S,G) by one, the number of messages processed by each PE increases by *one*.

And it is the same if you count the number of Join(S,G) that have been sent: also increased by *one*, not by zero.

Sorry, no free beer. ;-)




Appendix A, noting the fact that the first Join must be received by every PE
in the  VPN, states that the message  processing overhead is O(#  PEs in the
VPN). 
  That is  not really a meaningful measure  of anything.  The messaging
itself is O(1), and 
"messaging" is vague, do you talk about "how many times a messages is sent" or "how many times a messages is processed"  ?

You possibly challenge the fact we count the total number of messages processed on all equipments of a kind, and not merely look at one equipement. If so, refer to my comment on this in the introduction of this me.

there is no router whose processing increases as O(# PEs
in the  VPN). 
No, and we don't assert this, AFAIK.
What is true, is that the total number of message processed by each PE in the mVPN increases as O(#R_PEs), the number of PEs that join to the considered stream.


 In each  PE router, the  message processing per tree  per VPN
that is due to PE-PE interactions is O(1).
  

That statement is bogus.   Because, for each PE that joins a stream and leaves it later, the first Join and the Prune is always sent, and because both are processed by each PE in the mVPN, it is obvious that each PE processes 2 message for each PE that joins and leaves the stream (ignoring refreshes) in that case.

The amount of processing depends on the number of PEs joined to a stream and is not in O(1).





It would be more illuminating to take the PE-CE interactions into account as
well.   In PIM,  whether a  given interface  is p2p  or multiaccess,  only a
single  Join(S,G)  message is  needed  to enable  all  the  PEs (except  the
upstream PE,  of course)  on that interface  to join  the (S,G) tree.   At a
given node, the PIM messaging  overhead per tree is actually proportional to
the number of interfaces, not  to the number of PIM adjacencies.  (Excluding
Hellos, of course, just as Appendix  A does.)  That's why the PE-PE overhead
per VPN tends to be dwarfed by the PE-CE overhead; for a given VPN, a PE may
have lots of  VRF interfaces, but it only has one  PE-PE interface.  In many
cases, the PE-PE overhead is just in the noise.  
This statement related to PE-CE is nowhere related to Appendix A, AFAICT.

(and again, you do not take into account that the first join is always sent (RFC4601) and that a Prune is always sent by a PE that leaves the stream, thus the amount of processing is certainly not independent of the amount of neighbors  on  an interface, as soon as they join a stream.)



Anyway, let's look  at the case where BGP C-multicast  routing is used.  Now
each PE  joining an (S,G) tree must  send a C-multicast Source  Tree Join to
each  RR.  If n-1  PEs want  to join  the tree  (with the  nth PE  being the
upstream PE),  the number of C-multicast  Source Tree Joins sent  is n times
the number of RRs (most likely 2*n).  In PIM, this would have been done with
only  one message;  it  is BGP  that  makes the  number  of messages  needed
proportional to the number  of PEs in the VPN.  The reason  is that there is
no way in BGP to prevent each PE from issuing a Join.
  
You can chose count the "number of messages" instead of the number of times "some message is processed by a node". But Appendix A counts the number of times "some message is processed by a node" (see comment in intro of this mail).  See comment in the intro of this mail. You are not challenging here the correctness of the analysis in Appendix A.


PIM Join  Suppression is a big  savings when receivers  are widely dispersed
among the sites; if  each PE is attached to a site  with receivers, the need
to process someone  else's Join is a good tradeoff against  the need to send
your own Join.  If receivers are  concentrated in a few sites, this tradeoff
is not  so good.  But  many multicast applications  have the model of  a few
sources with a lot of widely dispersed receivers.
  
As said above, Join suppression effects are accounted for in the analysis.

Let's get to the point: how does the statement above is supposed to challenge Appendix A ?
(please quote the exact paragraph/line/counting that you disagree with)


Appendix A  is also incorrect about  the overhead related  to PIM Prune(S,G)
messages.
  

Given what I already read, I feel that you just can't get away with such a blunt affirmation.

The basic scenario considered, for what matters to Prunes, is that n PEs, that had joined a stream, not leave a stream, one by one.
- each joined PE, when it leave the stream (because JoinDesired became false because the last CE left), sends a Prune
- each Prune is processed by each PE of the VPN
So, just considering the above, how many times did we had "a Prune is processed" ?
Answer: #PEs x #R_PEs

The conclusion are just drawn from such basic arithmetic. The numbers for the amount of processing for Prunes are nothing surprising, and are (in order of magnitude) similar to numbers for Joins.




     By default when PIM LAN procedures are used, when a PE Prunes
     itself from a multicast tree, all other PEs check their own state
     to known if they are on the tree, in which case they send a PIM
     Join message to override the Prune.  The "did the last receiver
     leave?" question is thus implicitly replied to by all PE routers,
     for each PIM Prune message.

  We can see that answering the "last receiver leaves" question is a
  significant proportion of the work that the C-multicast routing
  building block has to make
    

I don't see  how this conclusion follows.  If a PE  has "Upstream Join (S,G)
state" for  the logical interface connecting  the PEs, then it  always has a
timer that  it uses to refresh  the Join.  If  the PE sees a  Join(S,G) from
another PE,  it restarts  the timer (causing  Join Suppression).  If  the PE
sees  a  Prune(S,G)  from  another  PE,  it sets  the  timer  to  a  shorter
pre-computed randomized  value.  If,  before that timer  expires, it  sees a
Join(S,G) from another PE, it resets the timer and sets it back to its usual
value.  
  
It takes only one message to do a Prune, and one message to override it.
  
True, but everyone knows this.   Everyone also knows that this PIM Prune message is processed by all the PEs: they all have to parse it, and lookup if they have a corresponding state, and reset a timer.  If none of the PEs has state, nobody sends a Join to override ; this "non response" is the result of the collective work of all PEs, that have checked that they didn't had matching state.

Only two routers did send a message, but every PE worked !

This is what Appendix A refers to by saying :
 
     The "did the last receiver leave?" question is thus *implicitly* replied to
     by all PE routers, for each PIM Prune message.

> Appendix A  would lead one to believe  that a Prune(S,G) immediately
> causes lots of Join(S,G) messages to be sent; this is not true.  

We'll try and improve the sentence to avoid the misunderstanding, insisting on the collective work to produce a non-response in due time, letting the upstream node deduce that it can prune traffic.


In BGP, if PE2  decides to prune itself from the (S,G)  tree, it has to send
each of the RRs a message withdrawing its Source Tree Join C-multicast route
for (S,G).   Then the RR  has to use  its BGP decision process  to determine
whether there  is another  Source Tree Join  C-multicast route for  the same
(S,G).   Then the  RR has  to distribute  the latter  route.  The  number of
messages is no fewer.
  

What is counted in Appendix A is the amount of routing processing done by the routing equipments, not the number of messages. And in the example you take, in PIM case the amount of times there was "some message processed by some equipment" is higher than in BGP. This comment does not expose an incorrectness in Appendix A.



I think the  distinction that Appendix A  is really trying to get  at is the
following.

Consider a given interface of  a given node.  Suppose that interface appears
either as  the upstream interface or  as a downstream interface  of n trees.
Then let's  say that interface contains  n "branches".  The  total amount of
state that  that PIM needs to  maintain is roughly proportional  to the sum,
over all the node's interfaces, of the number of branches.
  

(you omit the processing of prunes and first joins on each branch! )


Suppose that in a given PE, for a given VPN, there are i VRF interfaces, and
that the average number of branches per interface is m.  For each VPN, there
is also a  single PE-PE interface (a multiaccess interface  to which all the
PEs of the  VPN belong).  Let's suppose that this  link contains k branches,
where j of  these branches belong to trees  that the PE is a  member of, and
k-j belong to other trees in the VPN.

If PE-PE PIM is used, the amount  of state is proportional to i*m+k.  If BGP
C-multicast routing is  used, the amount of state  is proportional to i*m+j.
The incremental  state needed by PIM  is thus proportional to  k-j.  For the
state savings  to be significant,  one must assume  not only that k  is much
larger  than j,  but also  that k-j  is significiant  when compared  to i*m.
Basically this means that you need  to assume that each PE has receivers for
only a  small number  of the  VPN's trees, or  else that  there are  a small
number  of VRF interfaces  per VPN.   Whether or  not these  assumptions are
accurate depends of  course on the set of  multicast applications being used
by the customers.
  

I don't get your point and I don't see why with PE-PE PIM "the amount of state is proportional to i*m+k", since a PIM PE has to maintain state only for streams it is joined to, thus rather i*m+j (it has to process messages for joins for streams it is not joined to, but no state to maintain, afaik).

But anyway, the conclusion on state maintenance in Appendix A are that, in order of magnitude, the amount of state is the same for all approaches. This does not goes against the conclusion above, right ?

(please be more explicit on what in Appendix A you challenge as "incorrect")

There are also a couple of important factors which Appendix A has omitted:

- If you are using two different protocols to maintain each multicast state,
  then  the  total  amount  of  state is  effectively  doubled.   This  will
  generally result in  both a memory cost  and a CPU cost.  While  this is a
  "mere factor of  two," perhaps it deserves a mention  in any document that
  compares a control plane that uses two protocols (e.g.  PIM for PE-CE, BGP
  for PE-PE) to a control plane that uses only one.
 
This was pointed out by Maria and already fixed in -03.
(state maintenance in A.3, multiplied by 2; it is a linear factor not changing the conclusions of Appendix A in O(x))


- For sparse mode, which is the type of multicast most commonly found in the
  enterprises that buy  the VPN service, there is quite  a bit of additional
  messaging in BGP that has not been considered.

  
> When PE1 receives a C-multicast Source  Tree Join for a sparse mode
> group,it has  to generate a  Source Active  A-D Route.  It  needs to
> send  a BGP update for  this route to each  RR.  Each RR needs  to
> send it  to each of PE2, ..., PE3.   So we have more messaging which
> is O( # of  PEs in VPN). Needless  to say,  PIM also  has sparse
> mode overhead  which  hasn't been considered, but the point is that
> BGP overhead is not O(1).


A few comments:
- let me remind, that due to the complexity in comparing for a non-SSM scenario, because the PIM and BGP procedures are significantly different, it just wasn't done ; the SPT/SSM part is common, and seems to me as enough to guide the comparison
- let me highlight that, this time, you choose to count the number of times a message is processed, not how many messages are sent ; you seem to change what you like to count depending on what you want to show :)
- you highlight above a scenario with a total amount of processing across all nodes in O(#mVPN_PE) and say that "the  BGP overhead is not O(1)" : this is correct, but not significantly different from what we have in the base scenario of Appendix A, for which the  total amount of processing across all nodes is O(#R_PE) for BGP, and  O(#mVPN_PE x #R_PE) for PIM

So, while it could be conceivable to complement the analysis with a non-SSM scenario in the line of the SSM scenario already considered, we don't have yet seen a compelling argument to do it, since the change wouldn't seem to fundamentally affect the comparison between PIM and BGP.


Now, there is another thing briefly looked at at the very end of Appendix A:  the case of a more dynamic situation where PEs join and leave many times. In the non-SSM case, as you mention, BGP would have to  also produce the S-A A-D routes, and they would have to be processed by all PEs.  But for one said downsteam PE joining the steam, the upstream PE will advertises a new S-A A-D routes only if there was no PE already joined to the stream. The impact on BGP thus depends a lot on the Join/Prune dynamics...  As you say, the overhead of BGP for the total amount of processing is not O(1) in that case, but will depend on the number of PEs. But this is true to for PIM too, except that for PIM we don't need to make strong hypothesis on the dynamics to know the number of times a message is processed  (in the dynamic case, PIM loses the gains of join suppression, and pays the full price of having each message being processed by every PE in the mVPN).

So well, if you can explain that such a case would expose conclusions impacting what is in the document today, then it could be worth including.


- One might get  the impression that the BGP scheme  eliminates any need for
  the C-PIM state machine to maintain  states for the PMSIs that connect the
  PEs over the backbone.  This is not true; 
We don't state anything contradictory with the above, AFAIK.
Taking the exact amount/volume of state information would be doable, but as far as I can tell it doesn't seem to change the orders of magnitude.

  PEs must maintain Prune(S,G,rpt)
  for the PMSIs.  PEs receiving multicast  data over a PMSI must also do the
  RPF interface  check for  the arriving data,  considering the PMSI  as the
  "input interface".  The backbone does  not become transparent to the C-PIM
  interfaces just because PIM control packets do not flow over the PMSIs.
  
The above is not related to the routing/processing load, hence not to Appendix A.

Since  some of  the  conclusions of  the  draft depend  on  the analysis  in
Appendix  A,  I  think  those  conclusions  need to  be  removed  until  the
underlying analysis is corrected.
  

Your comments will be useful to improve some points in  Appendix A (thank you), but nothing that changes the conclusions significantly enough to remove stuff elsewhere in the document. Or if you really have such a claim, I think that to take a decision we would need you to make it more explicit what you would remove.

If and when  a corrected version of the analysis is  available, it should be
last called by the PIM WG, which  is where one might expect to find the most
expertise in PIM.  
And a last call by IDR too for the BGP part ?

Let's be serious: first, let's see if a "correction" is needed, and delay the question until you really show in how the analysis would be incorrect !

Cheers,

-Thomas