Benchmarking Methodology Working Group (BMWG) Thursday, July 23, 2015 1300-1500 CEDT Afternoon Session I Karlin III OPS bmwg Remote Participation: http://www.ietf.org/meeting/93/index.html http://www.ietf.org/meeting/93/remote-participation.html Minute Takers: Marius Georgescu, Jacob Rapp -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 0. Agenda Bashing No agenda bashing. 1a. New Charter and Milestones (Chairs) No charter update. 1b. WG Status (Chairs) No comments or questions. 2. Data Center Benchmarking Proposal Presenter: Jacob Rapp https://datatracker.ietf.org/doc/draft-ietf-bmwg-dcbench-terminology/ https://datatracker.ietf.org/doc/draft-ietf-bmwg-dcbench-methodology/ Scott Bradner: What do you say about how to report the repeatability? Is there something in there to say report it as error bars, or something else ? Tell us a consistent way to report on it. Marius Georgescu: May I suggest an error from the average, the standard error or standard deviation. Al Morton: Any other any other comments from people that have read the draft? Sarah Banks: I think the draft progressed nicely. I appreciate the effort, thank you. 3. IPv6 Neighbor Discovery Presenter: Ron Bonica http://tools.ietf.org/html/draft-ietf-bmwg-ipv6-nd Scott: Wouldn't you also want to record when you get more Neighbour Discovery (ND) packets ? Ron: You could record when you start sending NDs from the device under test (DUT). Scott: It's not when you start, but when the DUT starts. Ron: So, you might want to record when the DUT sends its Neighbour Solicitations (NS). That's an interesting thing to record. Scott: That might mean that with some experience you'll be able to shorten the test. Scott: That's a neat design. One thing I worry about, something I've observed in the past, is that the cache for something like ND is part of a general memory, rather than something dedicated to a particular task, and therefore the amount of storage varies considering what else is running, so your memory is not going to be a constant. Ron: Actually, when the cache is general memory you tend to get that first behaviour and that's a bad thing. Scott: There's a nuance there. It could be a hang, it could be a block not a crash. So it looks like a crash, but if back off on traffic, it starts up again. I've seen both. Ron: And actually there are two flavours of crashes, one is the whole device crashing or just processes crashing. Ramki Krishnan : Just a related question. There are other solutions around rate limiting the ND messages. I don't know if it's standardized. Ron: It's true, you could rate limit the ND messages. Here we intentionally do not. We are creating messages at a rate so slow that it shouldn't be rate limited. I want to make the cache get too big. Ramki: If there are such commercial solutions than maybe that can be a consideration. If there are rate limiters, what will happen? Is it still so bad, does it still cause exhaustion? Ron: Here is a possibility I have not thought of. Let's say n is very, very large. And let's say you are going to send a NS for each stream at one half stale entry time. It might be that all your NS are not getting through because you're rate limiting them. We need to figure out if there's any rate limiting going on. That might be hard to do, because systems do not report that. Ramki: That's correct. Ron: I'll put it in the draft and somehow we have to figure out if ND messages are getting rate limited. Ramki: That would be perfect. Scott: The other thing that could be, is prioritization issues. For example, if data is prioritized over control traffic. That sometimes happens. Ron: That's why we said that the interfaces have to be so big that prioritization doesn't make a difference. Scott: If the number of streams gets to be very large, the actual data rate can be pretty high. Ron: As long as we make the interfaces between the tester and the DUT large enough, prioritization would not make a difference, since congestions doesn't happen. Scott: It depends on the size of the buffer. Scott: We're getting into religion here. This is a functionality test, not a performance test. Ron: You have a point. We might want to drop this test all together? Scott: I think it's a tremendously useful piece of information, but I don't think it fits in this working group. Ron: Everybody agreed? Scott: The AD is the one that needs to know. Ron: I'll cover for the AD and agree with you that we should drop the test. 4. VNF and Infrastructure Benchmarking Considerations Presenter: Al Morton https://datatracker.ietf.org/doc/draft-ietf-bmwg-virtual-net/ Ramki: should we also include verification of configuration across all the elements. Al: Can you say more about that? Ramki: You have a policy at top level which is pushed towards the end device. So, is it actually really programmed in the router ? It could be verification of the software component or going down to the hardware component, and this recursively applies to each level of the hierarchy. Al: I think that's more of a functional test. Would you tend to agree? Our charter is to keep focus on performance characteristics. Ramki: It is indeed verifying the function, but the verification aspect itself can impact performance. This is one aspects we are trying to address in the Network Function Virtualization Research Group (NFVRG). How to make it performance efficient, so it doesn't affect normal operations. Al: Let's think about how it can be included. There's a possibility there. If it impacts performance, there is a ray of hope here. Maybe include the performance cost of verification. Maryam: About the footprint of the Virtual Network Function (VNF) in its deployment, is the CPU utilisation or memory utilization tracked in the matrix. Al: How do we track the footprint of each VNF in its deployment? I want to think about that some more. Scott: I guess this goes back to the discussion whether this is a black-box, you're looking at bits flowing in and out) or a you've got some hook inside. Historically, this group has been more on the bits running in and out side, rather than having a management interface telling you how it's feeling. Not unimportant, but a secondary effect. Al: That's right Scott. We've considered these metrics in the OPNFV work, and what we agreed upon was collecting them as auxiliary metrics, to help better understand the performance measurements. So I think, what Maryam is suggesting will be in that category. [DidNotGetTheName from Ericsson]: I believe the DUT will behave like a different VNF having different CPU, memory or storage assignment. So maybe we have to characterize this. I think this is very interesting if you consider the big picture, from the orchestration point of view. The consumer or customer of such benchmarking can be a high-level entity. You can regard it as a black-box, but you can consider different setups under which you test. Sarah: One of the biggest questions we get asked is during creation, or during tear-down does that change, because I need to know if the resources I have allocated when I set it up where adequate. Scott: You find out if they're adequate because you see what happens. That's exactly the right point. You configure 10 IP filters and see what happens, or you configure 12 IP filters and see what happens. So it's from the configuration side, not the observation or reporting side. 5. IPv6 Transition Benchmarking Presenter: Marius Georgescu http://tools.ietf.org/html/draft-georgescu-bmwg-ipv6-tran-tech-benchmarking-01 Al: In all the diagrams I've looked at, there's a CE and a PE on the customer edge. And that's why just talking about this PE is ambiguous. Marius: I think that a terminology section would clarify things. The diagram itself is an oversimplification that helps classifying the transition technologies. Scott: I think we have talked about it before. What happens if there is an imbalance in performance between translating in one direction or the other. Marius: I added some recommendations for translation-based devices to use the first test setup to measure any performance difference between the two translating directions. Scott: You should stress that in this configuration here you can't tell exactly what caused a performance issue, something generic in the box or the translation from one direction to the other. So, both testing methodologies should be supported. Marius: I definitely agree and maybe I didn't make it clear enough in the draft. But, for encapsulation I guess you would agree that would not be possible. I am not sure if that can be done from a black-box perspective. Scott: you may just not be able to tell if it's encapsulation or decapsulation which is the limiting factor. That should be mentioned. Al: You cannot tell where the problem is, where is the bottleneck. Marius: I guess that's one of the limitations of black-box testing. Al: Just looking at the performance of the implementation after the traditional MTU size, which seems to be increasing a bit. Marius: I would see it as normal or expected. But we have to look at it further. a white/grey box analysis should help. Al: Expressing the higher percentile for PDV is recommended. It's like a pseudo-range, you can see the variance of the results but not be bothered by outliers. The two-sided IPDV is a little more difficult. Marius: I agree. I hope there are some opinions in the audience. The proposal so far is expressing the minimum, average and maximum. Histograms, like Jacob was saying would be an alternative. But considering also scalability scores, I think summarization is necessary. Kostas Pentikousis: Clarification question, why average and not median? Marius: That is a good question. I think the probability distribution can be an indication of which of the two to use. Kostas: Median has an advantage. It's a real number, something that was actually recorded. Average reflects no real sample value. And if you have an average close to max, I'm not sure about the information it gives you, compared to a median close to max. Sarah: With a big sample set, I don't think the average is so unuseful. I think it comes down to how many data points you have. Kostas: I think the median is more conservative. Marius: Statisticians have decided that the question can be answered by looking at the probability distribution. If the distribution is normal the average and the median are the same. But than again, it's rare that you get a perfectly normal distribution. Al: Let's defer the statistical discussion to the list. Marius: We could actually add both. But that would overcomplicate the performance report, and the scores calculation for scalability. Kostas: Just to clarify, I'm not saying you should add median. I'm saying you should have one or the other and explain why you chose one over the other. Marius: We might cite some of the testing we did to act as decision base for choosing one over the other. Scott: A description about how the dynamic routing should be done would be helpful. Unless you're using RIP, the setup can get complicated. Some minor wording about how to do that wouldn't be bad. Marius: I'll look on how to improve that. Kaname NishizukI support this draft and I think that adding a DNS resolution text would make the draft more useful. You should consider that if the DNS cache is within the PE or outside it can affect the performance measurement. Also, I think the PE term is a little bit confusing. Marius: I agree, the PE terminology is confusing. I'll do something about that. Also, thank you very much for being proactive about the DNS resolution performance proposal. Marius: How ready is the draft for adoption? Al: How many people have read the draft? ... 5 people and me. that's pretty good. I think we can ask the group if there is interest to take this draft on as working group item. Please hum for yes. [some people humming] Al: Please humm if you are opposed. [no humming] Al: That sounds like good support in the meeting. Obviously, this gets put on the list. I think that's a good step forward Marius, thank you. Marius: Thank you very much. 6. Benchmarking Methodology for SDN Controller Performance Presenter: Sarah Banks http://tools.ietf.org/html/draft-bhuvan-bmwg-of-controller-benchmarking-01 Scott: [unrelated to the draft presented] I looked at the document you mentioned is sitting in RFE Editor queue for some time. There's a hanging reference that is mentioned in a side, which could be moved to informational. The title in the text is also incorrect, it's the title of the previous MD5 that's been superseded by the TCP authentication option. Sarah: Thank you. I will work with the authors to strongly suggest informative instead of normative. It makes lots of sense Al: how many people have read the old version of the draft? ... it looks like 5 people. Should we adopt this draft as a working group draft. Please hum for yes. [some people humming] Al: Those opposed please humm. [no humming] Al: There was a clear bias towards adoption. But we'll also take that one to the list. Sarah: Thank you. 7. Benchmarking Virtual Switches in OPNFV Presenter: Maryam Tahhan http://tools.ietf.org/html/draft-vsperf-bmwg-vswitch-opnfv-00.txt Scott: The ID name of the draft is not compliant. It should be draft--... Al: It was really the project that was represented here, but I agree. Scott: The IETF is people, not projects or companies. Al: Point taken. Scott: I think there's a bit of confusion between maximum forwarding rate and throughput for the 72 hours test. Normally we define throughput as the maximum forwarding rate without errors. That wouldn't be something to measure over 72 hours. It doesn't sound quite right. If the term "throughput" than it should be the one BMWG is using, the one defined in RFC2544. Al: Maybe we should look at RFC2889 maximum forwarding rate and also consider loss. Al: as authors, we're interested in feedback about the draft. Kostas: This is very timely work. I've just sent an e-mail with the two drafts to the Unify project on behalf of which I'm here. Maryam: Excellent. Thank you very much. Marius: I think this is very important work, as virtualization represents the future in my opinion. I have one question, why 72 hours? Maryam: A lot of manual testing and reading test reports conducted by many companies suggested 72 hours as a baseline for testing a virtualized environment. Initially we thought 24 hours would be enough. But we needed to validate over a longer period of time. So for our soak tests the minimum is 72 hours. But there are other tests which are not soak tests. Marius: I'm not sure I understand, is that one iteration ? 72 hours would not be an incentive to start testing. Maryam: We run multiple iterations to determine a rate and than we soak at that rate. Al: Do folks wanna see us continue this work here? [many people nodding :) and someone saying yes] 8. Benchmarking Methodology for Virtualization Network Performance Speaker: Rong Gu https://tools.ietf.org/html/draft-huang-bmwg-virtual-network-performance-01 Jacob: I read through the draft. You seem to mention also VXLAN, SDN and some other things. I feel like there should be a draft that addresses only the vSwitch performance. Rong: I understand. There is no content about VXLAN in the updated version. Sarah: We heard from Maryam and Al today and there is a draft coming in on the other side. I did see some notes about how these things play together or not. So, I would encourage the authors of the two drafts to communicate. If there is a redundancy it should be discussed. Can we get an update in Yokohama? Al: I think the best way to think about this is that the drafts have different approaches. The OPNFV is referencing existing specifications, while Rong's work is preparing new procedures. I think the drafts can be complementary. But we still need to stay coordinated. We are using a physical tester (the rabbit ears) while Rong's team is investigating the possibility of using a virtual tester as well. Maryam: We might end up referencing some of this work, because we have scenarios like virtual-to-physical and virtual-to-physical where we would need to measure the performance. But there's one scenario which I didn't see here today, virtual-to-virtual. That is something that maybe we should address, or you should address, or maybe collaborate on. Sarah: Let's make sure the people in the working group know that the two drafts are not redundant, which both teams are clearly indicating. - End of session -