Considerations for
Benchmarking Virtual Network Functions and Their InfrastructureAT&T Labs200 Laurel Avenue SouthMiddletown,NJ07748USA+1 732 420 1571+1 732 368 1192acmorton@att.comhttp://home.comcast.net/~acmacm/Benchmarking Methodology Working Group has traditionally conducted
laboratory characterization of dedicated physical implementations of
internetworking functions. This memo investigates additional
considerations when network functions are virtualized and performed in
general purpose hardware.See the new version history section for updates.The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119.Benchmarking Methodology Working Group (BMWG) has traditionally
conducted laboratory characterization of dedicated physical
implementations of internetworking functions (or physical network
functions, PNFs). The Black-box Benchmarks of Throughput, Latency,
Forwarding Rates and others have served our industry for many years.
and are the
cornerstones of the work.An emerging set of service provider and vendor development goals is
to reduce costs while increasing flexibility of network devices, and
drastically accelerate their deployment. Network Function Virtualization
(NFV) has the promise to achieve these goals, and therefore has garnered
much attention. It now seems certain that some network functions will be
virtualized following the success of cloud computing and virtual
desktops supported by sufficient network path capacity, performance, and
widespread deployment; many of the same techniques will help achieve
NFV.In the context of Virtualized Network Functions (VNF), the supporting
Infrastructure requires general-purpose computing systems, storage
systems, networking systems, virtualization support systems (such as
hypervisors), and management systems for the virtual and physical
resources. There will be many potential suppliers of Infrastructure
systems and significant flexibility in configuring the systems for best
performance. There are also many potential suppliers of VNFs, adding to
the combinations possible in this environment. The separation of
hardware and software suppliers has a profound implication on
benchmarking activities: much more of the internal configuration of the
black-box device under test (DUT) must now be specified and reported
with the results, to foster both repeatability and comparison testing at
a later time.Consider the following User Story as further background and
motivation:"I'm designing and building my NFV Infrastructure platform. The first
steps were easy because I had a small number of categories of VNFs to
support and the VNF vendor gave HW recommendations that I followed. Now
I need to deploy more VNFs from new vendors, and there are different
hardware recommendations. How well will the new VNFs perform on my
existing hardware? Which among several new VNFs in a given category are
most efficient in terms of capacity they deliver? And, when I operate
multiple categories of VNFs (and PNFs) *concurrently* on a hardware
platform such that they share resources, what are the new performance
limits, and what are the software design choices I can make to optimize
my chosen hardware platform? Conversely, what hardware platform upgrades
should I pursue to increase the capacity of these concurrently operating
VNFs?"See http://www.etsi.org/technologies-clusters/technologies/nfv for
more background, for example, the white papers there may be a useful
starting place. The Performance and Portability Best Practices are particularly relevant to BMWG. There are
documents available in the Open Area
http://docbox.etsi.org/ISG/NFV/Open/Latest_Drafts/ including drafts
describing Infrastructure aspects and service quality.BMWG will consider the new topic of Virtual Network Functions and
related Infrastructure to ensure that common issues are recognized from
the start, using background materials from industry and SDOs (e.g.,
IETF, ETSI NFV).This memo investigates additional methodological considerations
necessary when benchmarking VNFs instantiated and hosted in
general-purpose hardware, using bare-metal hypervisors or other
isolation environments such as Linux containers. An essential
consideration is benchmarking physical and virtual network functions in
the same way when possible, thereby allowing direct comparison. Also,
benchmarking combinations of physical and virtual devices and functions
in a System Under Test.A clearly related goal: the benchmarks for the capacity of a
general-purpose platform to host a plurality of VNF instances should be
investigated. Existing networking technology benchmarks will also be
considered for adaptation to NFV and closely associated
technologies.A non-goal is any overlap with traditional computer benchmark
development and their specific metrics (SPECmark suites such as
SPECCPU).A continued non-goal is any form of architecture development related
to NFV and associated technologies in BMWG, consistent with all
chartered work since BMWG began in 1989.This section lists the new considerations which must be addressed to
benchmark VNF(s) and their supporting infrastructure. The System Under
Test (SUT) is composed of the hardware platform components, the VNFs
installed, and many other supporting systems. It is critical to document
all aspects of the SUT to foster repeatability.New Hardware components will become part of the test set-up.High volume server platforms (general-purpose, possibly with
virtual technology enhancements).Storage systems with large capacity, high speed, and high
reliability.Network Interface ports specially designed for efficient
service of many virtual NICs.High capacity Ethernet Switches.The components above are subjects for development of specialized
benchmarks which are focused on the special demands of network
function deployment.Labs conducting comparisons of different VNFs may be able to use
the same hardware platform over many studies, until the steady march
of innovations overtakes their capabilities (as happens with the lab's
traffic generation and testing devices today).It will be necessary to configure and document the settings for the
entire general-purpose platform to ensure repeatability and foster
future comparisons, including but clearly not limited-to the
following:number of server blades (shelf occupation)CPUscachesmemorystorage systemI/Oas well as configurations that support the devices which host
the VNF itself: Hypervisor (or other forms of virtual function hosting)Virtual Machine (VM)Infrastructure Virtual Network (which interconnects Virtual
Machines with physical network interfaces, or with each other
through virtual switches, for example)and finally, the VNF itself, with items such as:specific function being implemented in VNFreserved resources for each function (e.g., CPU pinning and
Non-Uniform Memory Access, NUMA node assignment)number of VNFs (or sub-VNF components, each with its own VM) in
the service function chain (see section 1.1 of for a definition of service function chain)number of physical interfaces and links transited in the
service function chainIn the physical device benchmarking context, most of the
corresponding infrastructure configuration choices were determined by
the vendor. Although the platform itself is now one of the
configuration variables, it is important to maintain emphasis on the
networking benchmarks and capture the platform variables as input
factors.The concept of characterizing performance at capacity limits may
change. For example:It may be more representative of system capacity to
characterize the case where Virtual Machines (VM, hosting the VNF)
are operating at 50% Utilization, and therefore sharing the "real"
processing power across many VMs.Another important case stems from the need for partitioning
functions. A noisy neighbor (VM hosting a VNF in an infinite loop)
would ideally be isolated and the performance of other VMs would
continue according to their specifications.System errors will likely occur as transients, implying a
distribution of performance characteristics with a long tail (like
latency), leading to the need for longer-term tests of each set of
configuration and test parameters.The desire for elasticity and flexibility among network
functions will include tests where there is constant flux in the
number of VM instances, the resources the VMs require, and the
set-up/tear-down of network paths that support VM connectivity.
Requests for and instantiation of new VMs, along with Releases for
VMs hosting VNFs that are no longer needed would be an normal
operational condition. In other words, benchmarking should include
scenarios with production life cycle management of VMs and their
VNFs and network connectivity in-progress, including VNF scaling
up/down operations, as well as static configurations.All physical things can fail, and benchmarking efforts can also
examine recovery aided by the virtual architecture with different
approaches to resiliency.The sheer number of test conditionas and configuration
combinations encourage increased efficiency, including automated
testing arrangements, combination sub-sampling through an
understanding of inter-relationships, and machine-readable test
results.Since many components of the new NFV Infrastructure are virtual,
test set-up design must have prior knowledge of
inter-actions/dependencies within the various resource domains in the
System Under Test (SUT). For example, a virtual machine performing the
role of a traditional tester function such as generating and/or
receiving traffic should avoid sharing any SUT resources with the
Device Under Test DUT. Otherwise, the results will have unexpected
dependencies not encountered in physical device benchmarking.Note: The term "tester" has traditionally referred to devices
dedicated to testing in BMWG literature. In this new context, "tester"
additionally refers to functions dedicated to testing, which may be
either virtual or physical. "Tester" has never referred to the
individuals performing the tests.The shared-resource aspect of test design remains one of the
critical challenges to overcome in a way to produce useful results.
Benchmarking set-ups may designate isolated resources for the DUT and
other critical support components (such as the host/kernel) as the
first baseline step, and add other loading processes. The added
complexity of each set-up leads to shared-resource testing scenarios,
where the characteristics of the competing load (in terms of memory,
storage, and CPU utilization) will directly affect the benchmarking
results (and variability of the results), but the results should
reconcile with the baseline.The physical test device remains a solid foundation to compare with
results using combinations of physical and virtual test functions, or
results using only virtual testers when necessary to assess virtual
interfaces and other virtual functions.This section discusses considerations related to Benchmarks
applicable to VNFs and their associated technologies.In order to compare the performance of VNFs and system
implementations with their physical counterparts, identical benchmarks
must be used. Since BMWG has already developed specifications for many
network functions, there will be re-use of existing benchmarks through
references, while allowing for the possibility of benchmark curation
during development of new methodologies. Consideration should be given
to quantifying the number of parallel VNFs required to achieve
comparable scale/capacity with a given physical device, or whether
some limit of scale was reached before the VNFs could achieve the
comparable level. Again, implementation based-on different hypervisors
or other virtual function hosting remain as critical factors in
performance assessment.When the network functions under test are based on Open Source
code, there may be a tendency to rely on internal measurements to some
extent, especially when the externally-observable phenomena only
support an inference of internal events (such as routing protocol
convergence observed in the dataplane). Examples include CPU/Core
utilization, Network utilization, Storage utilization, and Memory
Comitted/used. These "white-box" metrics provide one view of the
resource footprint of a VNF. Note: The resource utilization metrics do
not easily match the 3x4 Matrix.However, external observations remain essential as the basis for
Benchmarks. Internal observations with fixed specification and
interpretation may be provided in parallel (as auxilliary metrics), to
assist the development of operations procedures when the technology is
deployed, for example. Internal metrics and measurements from Open
Source implementations may be the only direct source of performance
results in a desired dimension, but corroborating external
observations are still required to assure the integrity of measurement
discipline was maintained for all reported results.A related aspect of benchmark development is where the scope
includes multiple approaches to a common function under the same
benchmark. For example, there are many ways to arrange for activation
of a network path between interface points and the activation times
can be compared if the start-to-stop activation interval has a generic
and unambiguous definition. Thus, generic benchmark definitions are
preferred over technology/protocol specific definitions where
possible.There will be new classes of benchmarks needed for network design
and assistance when developing operational practices (possibly
automated management and orchestration of deployment scale). Examples
follow in the paragraphs below, many of which are prompted by the
goals of increased elasticity and flexibility of the network
functions, along with accelerated deployment times.Time to deploy VNFs: In cases where the general-purpose
hardware is already deployed and ready for service, it is valuable
to know the response time when a management system is tasked with
"standing-up" 100's of virtual machines and the VNFs they will
host.Time to migrate VNFs: In cases where a rack or shelf of
hardware must be removed from active service, it is valuable to
know the response time when a management system is tasked with
"migrating" some number of virtual machines and the VNFs they
currently host to alternate hardware that will remain
in-service.Time to create a virtual network in the general-purpose
infrastructure: This is a somewhat simplified version of existing
benchmarks for convergence time, in that the process is initiated
by a request from (centralized or distributed) control, rather
than inferred from network events (link failure). The successful
response time would remain dependent on dataplane observations to
confirm that the network is ready to perform.Effect of verification measurements on performance: A complete
VNF, or something as simple as a new poicy to implement in a VNF,
is implemented. The action to verify instantiation of the VNF or
policy could affect performance during normal operation.Also, it appears to be valuable to measure traditional packet
transfer performance metrics during the assessment of traditional and
new benchmarks, including metrics that may be used to support service
engineering such as the Spatial Composition metrics found in . Examples include Mean one-way delay in section 4.1
of , Packet Delay Variation (PDV) in , and Packet Reordering .It can be useful to organize benchmarks according to their
applicable life cycle stage and the performance criteria they were
designed to assess. The table below provides a way to organize
benchmarks such that there is a clear indication of coverage for the
intersection of life cycle stages and performance criteria.For example, the "Time to deploy VNFs" benchmark described above
would be placed in the intersection of Activation and Speed, making it
clear that there are other potential performance criteria to
benchmark, such as the "percentage of unsuccessful VM/VNF stand-ups"
in a set of 100 attempts. This example emphasizes that the Activation
and De-activation life cycle stages are key areas for NFV and related
infrastructure, and encourage expansion beyond traditional benchmarks
for normal operation. Thus, reviewing the benchmark coverage using
this table (sometimes called the 3x3 matrix) can be a worthwhile
exercise in BMWG.In one of the first applications of the 3x3 matrix in BMWG , we discovered
that metrics on measured size, capacity, or scale do not easily match
one of the three columns above. Following discussion, this was
resolved in two ways:Add a column, Scale, for use when categorizing and assessing
the coverage of benchmarks (without measured results). Examples of
this use are found in and . This is the 3x4
Matrix.If using the matrix to report results in an organized way, keep
size, capacity, and scale metrics separate from the 3x3 matrix and
incorporate them in the report with other qualifications of the
results.Note: The resource utilization (e.g., CPU) metrics do not fit in
the Matrix. They are not benchmarks, and omitting them confirms their
status as auxilliary metrics. Resource assignments are configuration
parameters, and these are reported seperately.This approach encourages use of the 3x3 matrix to organize reports
of results, where the capacity at which the various metrics were
measured could be included in the title of the matrix (and results for
multiple capacities would result in separate 3x3 matrices, if there
were sufficient measurements/results to organize in that way).For example, results for each VM and VNF could appear in the 3x3
matrix, organized to illustrate resource occupation (CPU Cores) in a
particular physical computing system, as shown below.The combination of tables above could be built incrementally,
beginning with VNF#1 and one Core, then adding VNFs according to their
supporting core assignments. X-Y plots of critical benchmarks would
also provide insight to the effect of increased HW utilization. All
VNFs might be of the same type, or to match a production environment
there could be VNFs of multiple types and categories. In this figure,
VNFs #3-#5 are assumed to require small CPU resources, while VNF#2
requires 4 cores to perform its function.Although there is incomplete work to benchmark physical network
function power consumption in a meaningful way, the desire to measure
the physical infrastructure supporting the virtual functions only adds
to the need. Both maximum power consumption and dynamic power
consumption (with varying load) would be useful. The IPMI standard
has been implemented by many manufacturers,
and supports measurement of instantaneous energy consumption.To assess the instantaneous energy consumption of virtual
resources, it may be possible to estimate the value using an overall
metric based on utilization readings, according to .Benchmarking activities as described in this memo are limited to
technology characterization of a Device Under Test/System Under Test
(DUT/SUT) using controlled stimuli in a laboratory environment, with
dedicated address space and the constraints specified in the sections
above.The benchmarking network topology will be an independent test setup
and MUST NOT be connected to devices that may forward the test traffic
into a production network, or misroute traffic to the test management
network.Further, benchmarking is performed on a "black-box" basis, relying
solely on measurements observable external to the DUT/SUT.Special capabilities SHOULD NOT exist in the DUT/SUT specifically for
benchmarking purposes. Any implications for network security arising
from the DUT/SUT SHOULD be identical in the lab and in production
networks.No IANA Action is requested at this time.The author acknowledges an encouraging conversation on this topic
with Mukhtiar Shaikh and Ramki Krishnan in November 2013. Bhavani Parise
and Ilya Varlashkin have provided useful suggestions to expand these
considerations. Bhuvaneswaran Vengainathan has already tried the 3x3
matrix with SDN controller draft, and contributed to many discussions.
Scott Bradner quickly pointed out shared resource dependencies in an
early vSwitch measurement proposal, and the topic was included here as a
key consideration. Further development was encouraged by Barry
Constantine's comments following the IETF-92 BMWG session: the session
itself was an affirmation for this memo. There have been many
interesting contributions from Maryam Tahhan, Marius Georgescu, Jacob
Rapp, Saurabh Chattopadhyay, and others.(This section should be removed by the RFC Editor.)Version 02:New version history section.Added Memory in section 3.2, configuration.Updated ACKs and References.Version 01:Addressed Ramki Krishnan's comments on section 4.5, power, see that
section (7/27 message to the list). Addressed Saurabh Chattopadhyay's
7/24 comments on VNF resources and other resource conditions and their
effect on benchmarking, see section 3.4. Addressed Marius Georgescu's
7/17 comments on the list (sections 4.3 and 4.4).AND, comments from the extended discussion during IETF-93 BMWG
session:Section 4.2: VNF footprint and auxilliary metrics (Maryam Tahhan),
Section 4.3: Verification affect metrics (Ramki Krishnan); Section 4.4:
Auxilliary metrics in the Matrix (Maryam Tahhan, Scott Bradner,
others)Network Function Virtualization: Performance and Portability
Best PracticesIntelligent Platform Management Interface, v2.0 with latest
Errata