Benchmarking Methodology Working Group (BMWG)
Thursday, July 23, 2015
1300-1500 CEDT 	Afternoon Session I	
Karlin III 	OPS 	bmwg 	

Remote Participation:
http://www.ietf.org/meeting/93/index.html
http://www.ietf.org/meeting/93/remote-participation.html

Minute Takers: Marius Georgescu, Jacob Rapp

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

0.  Agenda Bashing
    No agenda bashing.


1a. New Charter and Milestones (Chairs)
    No charter update.

1b. WG Status  (Chairs)
    No comments or questions.


2. Data Center Benchmarking Proposal
   Presenter: Jacob Rapp
   https://datatracker.ietf.org/doc/draft-ietf-bmwg-dcbench-terminology/
   https://datatracker.ietf.org/doc/draft-ietf-bmwg-dcbench-methodology/

   Scott Bradner: What do you say about how to report the repeatability? 
      Is there something in there to say report it as error bars, or 
      something else ? Tell us a consistent way to report on it. 
   Marius Georgescu: May I suggest an error from the average, the 
      standard error or standard deviation.
   Al Morton: Any other any other comments from people that have read the 
      draft?
   Sarah Banks: I think the draft progressed nicely. I appreciate the 
      effort, thank you.

3. IPv6 Neighbor Discovery
   Presenter: Ron Bonica
   http://tools.ietf.org/html/draft-ietf-bmwg-ipv6-nd

   Scott: Wouldn't you also want to record when you get more Neighbour 
      Discovery (ND) packets ? 
   Ron: You could record when you start sending NDs from the device 
      under test (DUT).
   Scott: It's not when you start, but when the DUT starts. 
   Ron: So, you might want to record when the DUT sends its Neighbour 
      Solicitations (NS). That's an interesting thing to record. 
   Scott: That might mean that with some experience you'll be able to 
      shorten the test.  

   Scott: That's a neat design. One thing I worry about, something I've 
      observed in the past, is that the cache for something like ND is part 
      of a general memory, rather than something dedicated to a particular 
      task, and therefore the amount of storage varies considering what else
      is running, so your memory is not going to be a constant. 
   Ron: Actually, when the cache is general memory you tend to get that 
      first behaviour and that's a bad thing. 
   Scott: There's a nuance there.  It could be a hang, it could be a block 
      not a crash. So it looks like a crash, but if back off on traffic, it 
      starts up again. I've seen both.
   Ron: And actually there are two flavours of crashes, one is the whole
      device crashing or just processes crashing.
   Ramki Krishnan : Just a related question. There are other solutions 
      around  rate limiting the ND messages. I don't know if it's standardized.
   Ron: It's true, you could rate limit the ND messages. Here we 
      intentionally do not. We are creating messages at a rate so slow that it 
      shouldn't be rate limited. I want to make the cache get too big. 
   Ramki: If there are such commercial solutions than maybe that can be a 
      consideration. If there are rate limiters, what will happen? Is it still 
      so bad, does it still cause exhaustion?  
   Ron: Here is a possibility I have not thought of. Let's say n is very, 
      very large. And let's say you are going to send a NS for each stream at 
      one half stale entry time. It might be that all your NS are not getting 
      through because you're rate limiting them. We need to figure out if 
      there's any rate limiting going on. That might be hard to do, because 
      systems do not report that. 
   Ramki: That's correct.
   Ron: I'll put it in the draft and somehow we have to figure out if ND 
      messages are getting rate limited. 
   Ramki: That would be perfect.
   Scott: The other thing that could be, is prioritization issues.  For 
      example, if data is prioritized over  control traffic. That sometimes 
      happens. 
   Ron: That's why we said that the interfaces have to be so big that 
      prioritization doesn't make a difference.  
   Scott: If the number of streams gets to be very large, the actual data 
      rate can be pretty high.
   Ron: As long as we make the interfaces between the tester and the DUT 
      large enough, prioritization would not make a difference, since 
      congestions doesn't happen.  
   Scott: It depends on the size of the buffer.
   Scott: We're getting into religion here. This is a functionality test, 
      not a performance test. 
   Ron: You have a point. We might want to drop this test all together?
   Scott: I think it's a tremendously useful piece of information, but I 
      don't think it fits in this working group.
   Ron: Everybody agreed?
   Scott: The AD is the one that needs to know.
   Ron: I'll cover for the AD and agree with you that we should drop the test.

4. VNF and Infrastructure Benchmarking Considerations
   Presenter: Al Morton
   https://datatracker.ietf.org/doc/draft-ietf-bmwg-virtual-net/
   
   Ramki: should we also include verification of configuration across all 
      the elements.
   Al: Can you say more about that?
   Ramki: You have a policy at top level which is pushed towards the end 
      device. So, is it actually really programmed in the router ? It could be
      verification of the software component or going down to the hardware 
      component, and this recursively applies to each level of the hierarchy. 
   Al: I think that's more of a functional test. Would you tend to agree? Our 
      charter is to keep focus on performance characteristics.
   Ramki: It is indeed verifying the function, but the verification aspect 
      itself can impact performance. This is one aspects we are trying to address 
      in the Network Function Virtualization Research Group (NFVRG). How to make 
      it performance efficient, so it doesn't affect normal operations. 
   Al: Let's think about how it can be included. There's a possibility there. 
      If it impacts performance, there is a ray of hope here.  Maybe include the  
      performance cost of verification.
   Maryam: About the footprint of the Virtual Network Function (VNF) in its 
      deployment, is the CPU utilisation or memory utilization tracked in the 
      matrix. 
   Al: How do we track the footprint of each VNF in its deployment? I want to 
      think about that some more.  
   Scott: I guess this goes back to the discussion whether this is a black-box, 
      you're looking at bits flowing in and out) or a you've got some hook inside. 
      Historically, this group has been more on the bits running in and out side, 
      rather than having a management interface telling you how it's feeling.  Not 
      unimportant, but a secondary effect.  
   Al: That's right Scott. We've considered these metrics in the OPNFV work, and 
      what we agreed upon was collecting them as auxiliary metrics, to help better 
      understand the performance measurements. So I think, what Maryam is suggesting 
      will be in that category. 
   [DidNotGetTheName from Ericsson]: I believe the DUT will behave like a 
      different VNF having different CPU, memory or storage assignment. So maybe we 
      have to characterize this. I think this is very interesting if you consider 
      the big picture, from the orchestration point of view. The consumer or 
      customer of such benchmarking can be a high-level entity. You can regard it 
      as a black-box, but you can consider different setups under which you test.
   Sarah: One of the biggest questions we get asked is during creation, or 
      during tear-down does that change, because I need to know if the resources I 
      have allocated when I set it up where adequate. 
   Scott: You find out if they're adequate because you see what happens. That's 
      exactly the right point. You configure 10 IP filters and see what happens, or 
      you configure 12 IP filters and see what happens. So it's from the 
      configuration side, not  the observation or reporting side. 

5. IPv6 Transition Benchmarking
   Presenter: Marius Georgescu
   http://tools.ietf.org/html/draft-georgescu-bmwg-ipv6-tran-tech-benchmarking-01
	
   Al: In all the diagrams I've looked at, there's a CE and a PE on the customer 
      edge. And that's why just talking about this PE is ambiguous. 
   Marius: I think that a terminology section would clarify things. The diagram 
      itself is an oversimplification that helps classifying the transition 
      technologies. 
   Scott: I think we have talked about it before. What happens if there is an 
      imbalance in performance between translating in one direction or the other.
   Marius: I added some recommendations for translation-based devices to use the 
      first test setup to measure any performance difference between the two 
      translating directions. 
   Scott: You should stress that in this configuration here you can't tell 
      exactly what caused a performance issue, something generic in the box or the 
      translation from one direction to the other. So, both testing methodologies 
      should be supported.
   Marius: I definitely agree and maybe I didn't make it clear enough in the 
      draft. But, for encapsulation I guess you would agree that would not be 
      possible. I am not sure if that can be done from a black-box perspective.
   Scott: you may just not be able to tell if it's encapsulation or decapsulation 
      which is the limiting factor. That should be mentioned.
   Al: You cannot tell where the problem is, where is the bottleneck.
   Marius: I guess that's one of the limitations of black-box testing. 
   Al: Just looking at the performance of the implementation after the 
      traditional MTU size, which seems to be increasing a bit. 
   Marius: I would see it as normal or expected. But we have to look at it 
      further. a white/grey box analysis should help. 
   Al: Expressing the higher percentile for PDV is recommended. It's like a 
      pseudo-range, you can see the variance of the results but not be bothered by 
      outliers.  The two-sided IPDV is a little more difficult. 
   Marius: I agree. I hope there are some opinions in the audience. The proposal 
      so far is expressing the minimum, average and maximum. Histograms, like Jacob 
      was saying would be an alternative. But considering also scalability scores, 
      I think summarization is necessary.
   Kostas Pentikousis: Clarification question, why average and not median?   
   Marius: That is a good question.  I think the probability distribution can be 
      an indication of which of the two to use.
   Kostas: Median has an advantage. It's a real number, something that was 
      actually recorded. Average reflects no real sample value. And if you have an 
      average close to max, I'm not sure about the information it gives you, 
      compared to a median close to max.
   Sarah: With a big sample set, I don't think the average is so unuseful. I 
      think it comes down to how many data points you have. 
   Kostas: I think the median is more conservative. 
   Marius: Statisticians have decided that the question can be answered by 
      looking at the probability distribution. If the distribution is normal
      the average and the median are the same. But than again, it's rare that you 
      get a perfectly normal distribution. 
   Al: Let's defer the statistical discussion to the list.
   Marius: We could actually add both. But that would overcomplicate the 
      performance report, and the scores calculation for scalability.
   Kostas: Just to clarify, I'm not saying you should add median. I'm saying you 
      should have one or the other and explain why you chose one over the other. 
   Marius: We might cite some of the testing we did to act as decision base for 
      choosing one over the other.
   Scott: A description about how the dynamic routing should be done would be 
      helpful. Unless you're using RIP, the setup can get complicated. Some minor 
      wording about how to do that wouldn't be bad.
   Marius: I'll look on how to improve that.
   Kaname NishizukI support this draft and I think that adding a DNS 
      resolution text would make the draft more useful. You should consider that
      if the DNS cache is within the PE or outside it can affect the performance 
      measurement. Also, I think the PE term is a little bit confusing.
   Marius: I agree, the PE terminology is confusing. I'll do something about 
      that. Also, thank you very much for being proactive about the DNS resolution 
      performance proposal.
   Marius: How ready is the draft for adoption?
   Al: How many people have read the draft? ... 5 people and me. that's pretty 
      good.  I think we can ask the group if there is interest to take this draft 
      on as working group item. Please hum for yes.
      [some people humming]
   Al: Please humm if you are opposed.
      [no humming]
   Al: That sounds like good support in the meeting. Obviously, this gets put 
      on the list. I think that's a good step forward Marius, thank you.
   Marius: Thank you very much.

6. Benchmarking Methodology for SDN Controller Performance
   Presenter: Sarah Banks
   http://tools.ietf.org/html/draft-bhuvan-bmwg-of-controller-benchmarking-01

   Scott: [unrelated to the draft presented] I looked at the document you 
      mentioned is sitting in RFE Editor queue for some time. There's a hanging 
      reference that is mentioned in a side, which could be moved to informational. 
      The title in the text is also incorrect, it's the title of the previous MD5 
      that's been superseded by the TCP authentication option.
   Sarah: Thank you. I will work with the authors to strongly suggest 
      informative instead of normative. It makes lots of sense
   Al: how many people have read the old version of the draft? ... it looks 
      like 5 people. Should we adopt this draft as a working group draft. Please 
      hum for yes.
      [some people humming]
   Al: Those opposed please humm.
      [no humming]
   Al: There was a clear bias towards adoption. But we'll also take that one to 
      the list.
   Sarah: Thank you.

7. Benchmarking Virtual Switches in OPNFV
   Presenter:  Maryam Tahhan
   http://tools.ietf.org/html/draft-vsperf-bmwg-vswitch-opnfv-00.txt

   Scott: The ID name of the draft is not compliant. It should be 
      draft-<primary author last name>-... 
   Al: It was really the project that was represented here, but I agree. 
   Scott: The IETF is people, not projects or companies.
   Al: Point taken.
   Scott: I think there's a bit of confusion between maximum forwarding rate and   
      throughput for the 72 hours test. Normally we define throughput as the 
      maximum forwarding rate without errors. That wouldn't be something to 
      measure over 72 hours. It doesn't sound quite right. If the term "throughput" 
      than it should be the one BMWG is using, the one defined in RFC2544. 
   Al: Maybe we should look at RFC2889 maximum forwarding rate and also consider 
      loss. 
   Al: as authors, we're interested in feedback about the draft.
   Kostas: This is very timely work. I've just sent an e-mail with the two drafts 
      to the Unify project on behalf of which I'm here. 
   Maryam: Excellent. Thank you very much.
   Marius: I think this is very important work, as virtualization represents the 
      future in my opinion. I have one question, why 72 hours?
   Maryam: A lot of manual testing and reading test reports conducted by many 
      companies suggested 72 hours as a baseline for testing a virtualized 
      environment. Initially we thought 24 hours would be enough. But we needed to 
      validate over a longer period of time. So for our soak tests the minimum is 
      72 hours. But there are other tests which are not soak tests.
   Marius: I'm not sure I understand, is that one iteration ? 72 hours would not 
      be an incentive to start testing.
   Maryam: We run multiple iterations to determine a rate and than we soak at that 
      rate.
   Al: Do folks wanna see us continue this work here?
      [many people nodding :) and someone saying yes] 

   
8. Benchmarking Methodology for Virtualization Network Performance
   Speaker: Rong Gu
   https://tools.ietf.org/html/draft-huang-bmwg-virtual-network-performance-01
   
   Jacob: I read through the draft. You seem to mention also VXLAN, SDN and some 
      other things. I feel like there should be a draft that addresses only the 
      vSwitch performance. 
   Rong: I understand. There is no content about VXLAN in the updated version.
   Sarah: We heard from Maryam and Al today and there is a draft coming in on the 
      other side. I did see some notes about how these things play together or not. 
      So, I would encourage the authors of the two drafts to communicate. If there 
      is a redundancy it should be discussed. Can we get an update in Yokohama?
   Al: I think the best way to think about this is that the drafts have different
      approaches. The OPNFV is referencing existing specifications, while Rong's 
      work is preparing new procedures. I think the drafts can be complementary. 
      But we still need to stay coordinated. We are using a physical tester (the 
      rabbit ears) while Rong's team is investigating the possibility of using a 
      virtual tester as well.
   Maryam: We might end up referencing some of this work, because we have scenarios 
      like virtual-to-physical and virtual-to-physical where we would need to 
      measure the performance. But there's one scenario which I didn't see here 
      today, virtual-to-virtual. That is something that maybe we should address, 
      or you should address, or maybe collaborate on.
   Sarah: Let's make sure the people in the working group know that the two drafts 
      are not redundant, which both teams are clearly indicating.

- End of session -