Benchmarking Methodology Working Group (BMWG) IETF 92 * Tuesday, March 24, 2015 * Morning Session I Royal * OPS * bmwg
Chairs: Al Morton, Sarah Banks
Minutes takers: Marius Georgescu, Bill Cerveny
*** NOTE: Action Items (AIs) have been denoted in BOLD text. 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

0. Agenda Bashing
- A couple of people attending BMWG for the first time; Al welcomes them.
- No bashes to the agenda


1a. New Charter and Milestones (Chairs)
- Al asks for major issues for traffic management draft; he declares working group consensus.
- Al has submitted shepherd document for traffic management draft.
- DC draft adopted; lots of support observed.


1b. WG Status (Chairs)


2. Traffic Management Benchmarking draft-ietf-bmwg-traffic-management-03 Presenter: Barry Constantine
- Results of Third WGLC, overview of the comments
received https://tools.ietf.org/html/draft-ietf-bmwg-traffic-management-03
- Barry and Ram Krishnan present; Barry presenting
- Al about to push the button on submitting document; asks for comments; there
are none. Al thanks authors for their efforts.
- Jacob Rapp: There are other tools. Barry: He’d like to add those
- Al: Best for all would be for Jacob to add this to the list. This would be very
good background. I’d like to add this to the supplementary site.
- Marius: There is another tool of interest, called D-ITG; Sarah said she will add that as an action item.
- Scott Bradner: How long is the test run?
- Barry: Min of 60 seconds
- Scott: <General discussion of advantages of lots of interactions.>

- Marius: My experience is with 10-20 iterations, 10 might be a good number.
- Also, wouldn't expressing the error for the number of iterations be beneficial?
- Scott: Number of tests dependent on duration. I would do 100 or 200 tests if the
test is only 10 seconds.
- 04 draft to be created, updated with number of trials


3. Software Upgrade Benchmarking document draft-ietf-bmwg-issu-meth-00
Presenter: Sarah Banks

- WG Adoption, discussion, etc.
- Good feedback so far.
- Updated draft has been posted.
- Sarah: We’re looking for a support and more people interested in the draft.
- We have 2 or 3 companies looking at this.
- Al: If you have a Linux Foundation login, there is active discussion there in regards to software upgrades in virtualization environments.


4. Data Center Benchmarking Proposal
draft-dcbench-def-02.txt & draft-bmwg-dcbench-methodology-03.txt 
Presenter: Jacob Rapp

- Successful Call for adoption!
- Identified 16 people who have reviewed draft.
- Al: Green areas have been commented. Open for comments on everything
- Marius: You don’t indicate number of repetitions; no exact number specified.
- Jacob: OK, we don’t yet specify how many repetitions.
- Scott: Encourage repetitions; can get unreliable results if you don’t.
- Al (as participant): We ended up with a refined definition of jitter / PDV (Packet
Delay Variation).
- I’d like to see some comparison between virtual and physical world. You want to
pick one that you want to know.
- This is headed in the right direction; we need review in the whole spectrum of
things.
- Ram: We are not debating anything layer 2-3-4, but we are talking about
virtualization. How about the NFI infrastructure ?
- Jacob: We wanted to refine the physical stuff, so we remained focused on the
physical tests.
- Scott: Something you might consider: one of the little crisies we talk about in the
IETF is buffer bloat.
- Buffers that are too big can make things worse.
- There should be some sort of indication that the bigger the buffer is not
necessarily the better the buffer.


5. IPv6 Neighbor Discovery draft-cerveny-bmwg-ipv6-nd-06 
Presenter: Bill Cerveny

Call for Adoption!
- Scott: I am against the use of word problems. Benchmarking problems is a funky concept.
- Can you explain more about the small stuff? If you test with ND working and it overflows, does the traffic stop through that?
- Ron: Can I answer that question? When the neighbor cache is full a couple of things can happen. One is that the kernel can crash. That's probably the worst behavior. The next is the device is trying to create an entry, but in can't. Another possibility is you have available addresses but they can't refresh. One of these behaviors should be prioritized. The question is: is the behavior recommended by RFC6583 really happening.
- Scott: I think you're not really testing flows. Maybe you should not call it testing flows.
- What is the ND time-out?
- Bill: It is around 45 seconds and it varies every 2-4 hours.
- Scott: that's going to mean long test runs. You might want to reward it. You're
not really testing for flows; you're testing for failures of devices, which will cause
failures of flows.
- Ron: What we're really benchmarking is the neighbor cache behavior.
- Bill: So, benchmarking IPv6 neighbor cache behavior.
- Al: We have enough support for adoption.
- Change document name


Returning proposals:


6. IPv6 Transition Benchmarking 
Presenter: Marius Georgescu

Many comments addressed on the list.
http://tools.ietf.org/html/draft-georgescu-bmwg-ipv6-tran-tech-benchmarking-00
was:
http://tools.ietf.org/html/draft-georgescu-ipv6-transition-tech-benchmarking-00
- Scott: I always use the dynamic routing. I would recommend simple setups.
- Al: I think you’re headed in the right direction ... comment about jitter yet to be addressed.
- Al: It comes down to what you want to learn about the measurements you are making. The one main difference is that delay ... There are lots of circumstances in live networks where it may be easier to measure ... It comes down to what you want to know. People ask about how much delay ... What we want to know form our delay variation measurement.
- Scott: For Inter-packet delays there are cases where management is interested in inter-packet jitter. It is helpful; in the real world PDV is the most useful.
- Al: I don’t want to exclude one, but want to have preferred measurements that will answer your question.
- Jacob: You should have a recommended value for delay variation. We should make sure we have a recommended value for benchmarking purposes.
- Scott: The only thing you don’t get is the concern that packets are too close for next device; but this is a very narrow case.
- Marius: Is this document likely to be adopted at some point?
- Al: There's a possibility that it could be a working group document, but it’s not
currently in the charter. These are very important technologies.
- Scott: My opinion is it's stuff we should adopt.


7. VNF and Infrastructure Benchmarking Considerations
Presenter: Al Morton

Discussion on Test/DUT Interaction, new metrics, scalability in the matrix https://tools.ietf.org/html/draft-morton-bmwg-virtual-net-03.txt
- Scott: Why are you so focused on COTS (consumer off the shelf)?
- Al: It's the mantra of the network function virtualization world. It's more non-
specialized computing than coming of-the-shelf.
- Scott: Why does the BMWG care?
- Al: If you have to configure consumer off the shelf systems with multiple
components, different systems with different multiple components will perform
differently. I believe that's an important test parameter to report.
- Joel: I think it is an externality.
- Scott: What is different in the test if it is a COTS or something else?
- Al: What’s different is only in the reporting. There are more variables to report
on now. There are more things to be aware of now and to report on now.
- Sarah: I think it might make sense to call out COTS vs black-box.
- Scott: I would remove the concept of COTS from this; you are doing yourself a
disservice by making this distinction.
- Ram: Hardware and software are delivered as a whole package. Talking about
NFV (Network Function Virtualization) you want to make the separation between
the two. Maybe that's a way to proceed.
- Al: It doesn't change the test, it changes what you report about the test.
- Scott: If you’re not reporting on what you’re testing, you’re making a mistake.

- Al: I am shocked and scared by the variations of Intel Xeon processors.
- Jacob: Maybe what you're getting at is a question of repeatability. It may be the
case that the test is unrepeatable because you need a specific combination on
hardware and software.
- Scott: Maybe you just want to make it clear that the device should be fully
described.
- Al: If there is something that people discover for their platform of choice, that it
would be useful to collect this information. But we’re focusing on black-box
measurement.
- Barry: I want to mention concurrency. Considering many virtual network devices
(e.g. routers, firewalls) can run on the same hardware. I think it might take things a long way to help the community understand how to report on that, how to build the test configuration and how to define the level of concurrency. It's a whole new mix.
- Bhuvan: I think it's worthwhile to propose metrics specific to the virtual world.
- Scott: I think this stuff is tremendously important. You don't have to be
discouraged by the challenges.
- Ashish: About competing DNS, is there enough given on defining the
environment in which DNS can be tested?
- Al: I think that would be a challenge, but we may get some help from some of
the open source projects. One of the things we might look at would be the variation across the differences between them. However, I would be reluctant to produce a static document for what would be the standard server architecture, since things change so fast.
- Scott: I actually don't see the point here to compare between environments.
- Ram: Power consumption could be another parameter.
- Al: should that be an ongoing power measurement during the test or some sort
of server specification (e.g. maximum power).
- Ram: another parameter can be dynamic power consumption, check if the
power is exceeding the threshold. I think that would be useful. Other useful info
can be CPU utilization, DRAM utilization etc.
- Al: these are actually internal measurements, which can be taken in an
operational environment. We are currently emphasizing the black-box
measurements.
- Ram: Separate drafts might be a good idea in the context of different VNFs.
- Al: This is something that we’ve written down quickly, for which other people are
going to prepare individual benchmarking work.
- The specialized things need to be done one-by-one. The more physical things we
have, the easier it is to characterize. “Corral” is a good word.
- Jacob: It is useful to talk about how things scale, such as how firewalls scale.
- Al: Scale testing is one of the most important topics as well.
- Scott: What you describe as accuracy sounds like conformance testing. The
wording is important.
- Al: Correctness of outcome is shared between speed and reliability.
- Scott: Such as not recording VMs with errors
- Al: If we can’t see them, we can’t report them.
- From slides: OPNFV (Open Platform for NFV) -- Al’s comment: they have IPPM in
mind ...
- Sarah asked about support for document; there seemed to be support; Sarah to ask for adoption on the list.


8. Benchmarking Methodology for SDN Controller Performance
Presenter: Bhuvan Vengainathan
Revised draft, comments on the list http://tools.ietf.org/html/draft-bhuvan-bmwg-of-controller-benchmarking-01

- Jacob: Is this openflow specific and not just SDN specific?
- Bhuvan: This is intended to be generic to SDN controller
- Sarah: This was a change from the first revision.
- Jacob: I suggest scoping this as OpenFlow if you are discussing OpenFlow to
reduce variability
- Sarah: Please take a look at the definition of the SDN controller in the draft and
let's continue the discussion.
- Al: Within SDN research group there is standard terminology and Bhuvan has
adopted this terminology.
- Scott: Need to clarify where traffic is coming from and where it is going.
- Sarah: I agree, this is something that needs to be cleaned up.
- Ram: While keeping this generic, different controller applications might need
different recommendations.
- Bhuvan: The base functionality is the same even if the applications are various.
This question was partially addressed on the mailing list. Maybe extension drafts
can cover more specific applications.
- Ashish: Exception handling; why are you testing this?
- Bhuvan: We are trying to measure deviation from the baseline performance.
- Ashish: It would seem there is an almost infinite number of variations with
benchmarking exception handling. Too many possibilities.
- Bhuvan: I understand your point. We are examining the robustness.
- Sarah: Let’s take this to the list.
- 4-5 people read draft, about same number are in favour of adoption.
- Al: We’re on the fence about adoption. Let’s take this to the list.
LAST. AOB

######################################################################## #######################################
<end BMWG minutes>