Benchmarking Methodology Working Group (BMWG) IETF 92 * Tuesday, March 24, 2015 * Morning Session I Royal * OPS * bmwg Chairs: Al Morton, Sarah Banks Minutes takers: Marius Georgescu, Bill Cerveny *** NOTE: Action Items (AIs) have been denoted in BOLD text. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 0. Agenda Bashing - A couple of people attending BMWG for the first time; Al welcomes them. - No bashes to the agenda 1a. New Charter and Milestones (Chairs) - Al asks for major issues for traffic management draft; he declares working group consensus. - Al has submitted shepherd document for traffic management draft. - DC draft adopted; lots of support observed. 1b. WG Status (Chairs) 2. Traffic Management Benchmarking draft-ietf-bmwg-traffic-management-03 Presenter: Barry Constantine - Results of Third WGLC, overview of the comments received https://tools.ietf.org/html/draft-ietf-bmwg-traffic-management-03 - Barry and Ram Krishnan present; Barry presenting - Al about to push the button on submitting document; asks for comments; there are none. Al thanks authors for their efforts. - Jacob Rapp: There are other tools. Barry: He’d like to add those - Al: Best for all would be for Jacob to add this to the list. This would be very good background. I’d like to add this to the supplementary site. - Marius: There is another tool of interest, called D-ITG; Sarah said she will add that as an action item. - Scott Bradner: How long is the test run? - Barry: Min of 60 seconds - Scott: - Marius: My experience is with 10-20 iterations, 10 might be a good number. - Also, wouldn't expressing the error for the number of iterations be beneficial? - Scott: Number of tests dependent on duration. I would do 100 or 200 tests if the test is only 10 seconds. - 04 draft to be created, updated with number of trials 3. Software Upgrade Benchmarking document draft-ietf-bmwg-issu-meth-00 Presenter: Sarah Banks - WG Adoption, discussion, etc. - Good feedback so far. - Updated draft has been posted. - Sarah: We’re looking for a support and more people interested in the draft. - We have 2 or 3 companies looking at this. - Al: If you have a Linux Foundation login, there is active discussion there in regards to software upgrades in virtualization environments. 4. Data Center Benchmarking Proposal draft-dcbench-def-02.txt & draft-bmwg-dcbench-methodology-03.txt Presenter: Jacob Rapp - Successful Call for adoption! - Identified 16 people who have reviewed draft. - Al: Green areas have been commented. Open for comments on everything - Marius: You don’t indicate number of repetitions; no exact number specified. - Jacob: OK, we don’t yet specify how many repetitions. - Scott: Encourage repetitions; can get unreliable results if you don’t. - Al (as participant): We ended up with a refined definition of jitter / PDV (Packet Delay Variation). - I’d like to see some comparison between virtual and physical world. You want to pick one that you want to know. - This is headed in the right direction; we need review in the whole spectrum of things. - Ram: We are not debating anything layer 2-3-4, but we are talking about virtualization. How about the NFI infrastructure ? - Jacob: We wanted to refine the physical stuff, so we remained focused on the physical tests. - Scott: Something you might consider: one of the little crisies we talk about in the IETF is buffer bloat. - Buffers that are too big can make things worse. - There should be some sort of indication that the bigger the buffer is not necessarily the better the buffer. 5. IPv6 Neighbor Discovery draft-cerveny-bmwg-ipv6-nd-06 Presenter: Bill Cerveny Call for Adoption! - Scott: I am against the use of word problems. Benchmarking problems is a funky concept. - Can you explain more about the small stuff? If you test with ND working and it overflows, does the traffic stop through that? - Ron: Can I answer that question? When the neighbor cache is full a couple of things can happen. One is that the kernel can crash. That's probably the worst behavior. The next is the device is trying to create an entry, but in can't. Another possibility is you have available addresses but they can't refresh. One of these behaviors should be prioritized. The question is: is the behavior recommended by RFC6583 really happening. - Scott: I think you're not really testing flows. Maybe you should not call it testing flows. - What is the ND time-out? - Bill: It is around 45 seconds and it varies every 2-4 hours. - Scott: that's going to mean long test runs. You might want to reward it. You're not really testing for flows; you're testing for failures of devices, which will cause failures of flows. - Ron: What we're really benchmarking is the neighbor cache behavior. - Bill: So, benchmarking IPv6 neighbor cache behavior. - Al: We have enough support for adoption. - Change document name Returning proposals: 6. IPv6 Transition Benchmarking Presenter: Marius Georgescu Many comments addressed on the list. http://tools.ietf.org/html/draft-georgescu-bmwg-ipv6-tran-tech-benchmarking-00 was: http://tools.ietf.org/html/draft-georgescu-ipv6-transition-tech-benchmarking-00 - Scott: I always use the dynamic routing. I would recommend simple setups. - Al: I think you’re headed in the right direction ... comment about jitter yet to be addressed. - Al: It comes down to what you want to learn about the measurements you are making. The one main difference is that delay ... There are lots of circumstances in live networks where it may be easier to measure ... It comes down to what you want to know. People ask about how much delay ... What we want to know form our delay variation measurement. - Scott: For Inter-packet delays there are cases where management is interested in inter-packet jitter. It is helpful; in the real world PDV is the most useful. - Al: I don’t want to exclude one, but want to have preferred measurements that will answer your question. - Jacob: You should have a recommended value for delay variation. We should make sure we have a recommended value for benchmarking purposes. - Scott: The only thing you don’t get is the concern that packets are too close for next device; but this is a very narrow case. - Marius: Is this document likely to be adopted at some point? - Al: There's a possibility that it could be a working group document, but it’s not currently in the charter. These are very important technologies. - Scott: My opinion is it's stuff we should adopt. 7. VNF and Infrastructure Benchmarking Considerations Presenter: Al Morton Discussion on Test/DUT Interaction, new metrics, scalability in the matrix https://tools.ietf.org/html/draft-morton-bmwg-virtual-net-03.txt - Scott: Why are you so focused on COTS (consumer off the shelf)? - Al: It's the mantra of the network function virtualization world. It's more non- specialized computing than coming of-the-shelf. - Scott: Why does the BMWG care? - Al: If you have to configure consumer off the shelf systems with multiple components, different systems with different multiple components will perform differently. I believe that's an important test parameter to report. - Joel: I think it is an externality. - Scott: What is different in the test if it is a COTS or something else? - Al: What’s different is only in the reporting. There are more variables to report on now. There are more things to be aware of now and to report on now. - Sarah: I think it might make sense to call out COTS vs black-box. - Scott: I would remove the concept of COTS from this; you are doing yourself a disservice by making this distinction. - Ram: Hardware and software are delivered as a whole package. Talking about NFV (Network Function Virtualization) you want to make the separation between the two. Maybe that's a way to proceed. - Al: It doesn't change the test, it changes what you report about the test. - Scott: If you’re not reporting on what you’re testing, you’re making a mistake. - Al: I am shocked and scared by the variations of Intel Xeon processors. - Jacob: Maybe what you're getting at is a question of repeatability. It may be the case that the test is unrepeatable because you need a specific combination on hardware and software. - Scott: Maybe you just want to make it clear that the device should be fully described. - Al: If there is something that people discover for their platform of choice, that it would be useful to collect this information. But we’re focusing on black-box measurement. - Barry: I want to mention concurrency. Considering many virtual network devices (e.g. routers, firewalls) can run on the same hardware. I think it might take things a long way to help the community understand how to report on that, how to build the test configuration and how to define the level of concurrency. It's a whole new mix. - Bhuvan: I think it's worthwhile to propose metrics specific to the virtual world. - Scott: I think this stuff is tremendously important. You don't have to be discouraged by the challenges. - Ashish: About competing DNS, is there enough given on defining the environment in which DNS can be tested? - Al: I think that would be a challenge, but we may get some help from some of the open source projects. One of the things we might look at would be the variation across the differences between them. However, I would be reluctant to produce a static document for what would be the standard server architecture, since things change so fast. - Scott: I actually don't see the point here to compare between environments. - Ram: Power consumption could be another parameter. - Al: should that be an ongoing power measurement during the test or some sort of server specification (e.g. maximum power). - Ram: another parameter can be dynamic power consumption, check if the power is exceeding the threshold. I think that would be useful. Other useful info can be CPU utilization, DRAM utilization etc. - Al: these are actually internal measurements, which can be taken in an operational environment. We are currently emphasizing the black-box measurements. - Ram: Separate drafts might be a good idea in the context of different VNFs. - Al: This is something that we’ve written down quickly, for which other people are going to prepare individual benchmarking work. - The specialized things need to be done one-by-one. The more physical things we have, the easier it is to characterize. “Corral” is a good word. - Jacob: It is useful to talk about how things scale, such as how firewalls scale. - Al: Scale testing is one of the most important topics as well. - Scott: What you describe as accuracy sounds like conformance testing. The wording is important. - Al: Correctness of outcome is shared between speed and reliability. - Scott: Such as not recording VMs with errors - Al: If we can’t see them, we can’t report them. - From slides: OPNFV (Open Platform for NFV) -- Al’s comment: they have IPPM in mind ... - Sarah asked about support for document; there seemed to be support; Sarah to ask for adoption on the list. 8. Benchmarking Methodology for SDN Controller Performance Presenter: Bhuvan Vengainathan Revised draft, comments on the list http://tools.ietf.org/html/draft-bhuvan-bmwg-of-controller-benchmarking-01 - Jacob: Is this openflow specific and not just SDN specific? - Bhuvan: This is intended to be generic to SDN controller - Sarah: This was a change from the first revision. - Jacob: I suggest scoping this as OpenFlow if you are discussing OpenFlow to reduce variability - Sarah: Please take a look at the definition of the SDN controller in the draft and let's continue the discussion. - Al: Within SDN research group there is standard terminology and Bhuvan has adopted this terminology. - Scott: Need to clarify where traffic is coming from and where it is going. - Sarah: I agree, this is something that needs to be cleaned up. - Ram: While keeping this generic, different controller applications might need different recommendations. - Bhuvan: The base functionality is the same even if the applications are various. This question was partially addressed on the mailing list. Maybe extension drafts can cover more specific applications. - Ashish: Exception handling; why are you testing this? - Bhuvan: We are trying to measure deviation from the baseline performance. - Ashish: It would seem there is an almost infinite number of variations with benchmarking exception handling. Too many possibilities. - Bhuvan: I understand your point. We are examining the robustness. - Sarah: Let’s take this to the list. - 4-5 people read draft, about same number are in favour of adoption. - Al: We’re on the fence about adoption. Let’s take this to the list. LAST. AOB ######################################################################## #######################################