idnits 2.17.1 

draft-ietf-netvc-testing-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 15, 2016) is 2957 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'L1100' is defined on line 505, but no explicit
     reference was found in the text

  == Unused Reference: 'STEAM' is defined on line 522, but no explicit
     reference was found in the text


     Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           T. Daede
3	Internet-Draft                                                   Mozilla
4	Intended status: Informational                                 A. Norkin
5	Expires: September 16, 2016                                      Netflix
6	                                                          March 15, 2016

8	              Video Codec Testing and Quality Measurement
9	                      draft-ietf-netvc-testing-02

11	Abstract

13	   This document describes guidelines and procedures for evaluating a
14	   video codec specified at the IETF.  This covers subjective and
15	   objective tests, test conditions, and materials used for the test.

17	Status of This Memo

19	   This Internet-Draft is submitted in full conformance with the
20	   provisions of BCP 78 and BCP 79.

22	   Internet-Drafts are working documents of the Internet Engineering
23	   Task Force (IETF).  Note that other groups may also distribute
24	   working documents as Internet-Drafts.  The list of current Internet-
25	   Drafts is at http://datatracker.ietf.org/drafts/current/.

27	   Internet-Drafts are draft documents valid for a maximum of six months
28	   and may be updated, replaced, or obsoleted by other documents at any
29	   time.  It is inappropriate to use Internet-Drafts as reference
30	   material or to cite them other than as "work in progress."

32	   This Internet-Draft will expire on September 16, 2016.

34	Copyright Notice

36	   Copyright (c) 2016 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents
41	   (http://trustee.ietf.org/license-info) in effect on the date of
42	   publication of this document.  Please review these documents
43	   carefully, as they describe your rights and restrictions with respect
44	   to this document.  Code Components extracted from this document must
45	   include Simplified BSD License text as described in Section 4.e of
46	   the Trust Legal Provisions and are provided without warranty as
47	   described in the Simplified BSD License.

49	Table of Contents

51	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
52	   2.  Subjective quality tests  . . . . . . . . . . . . . . . . . .   3
53	     2.1.  Still Image Pair Comparison . . . . . . . . . . . . . . .   3
54	     2.2.  Subjective viewing test . . . . . . . . . . . . . . . . .   3
55	     2.3.  Expert viewing  . . . . . . . . . . . . . . . . . . . . .   3
56	   3.  Objective Metrics . . . . . . . . . . . . . . . . . . . . . .   3
57	     3.1.  Overall PSNR  . . . . . . . . . . . . . . . . . . . . . .   4
58	     3.2.  Frame-averaged PSNR . . . . . . . . . . . . . . . . . . .   4
59	     3.3.  PSNR-HVS-M  . . . . . . . . . . . . . . . . . . . . . . .   5
60	     3.4.  SSIM  . . . . . . . . . . . . . . . . . . . . . . . . . .   5
61	     3.5.  Multi-Scale SSIM  . . . . . . . . . . . . . . . . . . . .   5
62	     3.6.  Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . .   5
63	     3.7.  CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . .   5
64	     3.8.  VMAF  . . . . . . . . . . . . . . . . . . . . . . . . . .   5
65	   4.  Comparing and Interpreting Results  . . . . . . . . . . . . .   6
66	     4.1.  Graphing  . . . . . . . . . . . . . . . . . . . . . . . .   6
67	     4.2.  Bjontegaard . . . . . . . . . . . . . . . . . . . . . . .   6
68	     4.3.  Ranges  . . . . . . . . . . . . . . . . . . . . . . . . .   7
69	   5.  Test Sequences  . . . . . . . . . . . . . . . . . . . . . . .   7
70	     5.1.  Sources . . . . . . . . . . . . . . . . . . . . . . . . .   7
71	     5.2.  Test Sets . . . . . . . . . . . . . . . . . . . . . . . .   7
72	     5.3.  Operating Points  . . . . . . . . . . . . . . . . . . . .   8
73	       5.3.1.  Common settings . . . . . . . . . . . . . . . . . . .   8
74	       5.3.2.  High Latency CQP  . . . . . . . . . . . . . . . . . .   8
75	       5.3.3.  Low Latency CQP . . . . . . . . . . . . . . . . . . .   9
76	       5.3.4.  Unconstrained High Latency  . . . . . . . . . . . . .   9
77	       5.3.5.  Unconstrained Low Latency . . . . . . . . . . . . . .   9
78	   6.  Automation  . . . . . . . . . . . . . . . . . . . . . . . . .  10
79	     6.1.  Regression tests  . . . . . . . . . . . . . . . . . . . .  10
80	     6.2.  Objective performance tests . . . . . . . . . . . . . . .  10
81	     6.3.  Periodic tests  . . . . . . . . . . . . . . . . . . . . .  11
82	   7.  Informative References  . . . . . . . . . . . . . . . . . . .  11
83	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

85	1.  Introduction

87	   When developing a video codec, changes and additions to the codec
88	   need to be decided based on their performance tradeoffs.  In
89	   addition, measurements are needed to determine when the codec has met
90	   its performance goals.  This document specifies how the tests are to
91	   be carried about to ensure valid comparisons when evaluating changes
92	   under consideration.  Authors of features or changes should provide
93	   the results of the appropriate test when proposing codec
94	   modifications.

96	2.  Subjective quality tests

98	   Subjective testing is the preferable method of testing video codecs.

100	   Because the IETF does not have testing resources of its own, it has
101	   to rely on the resources of its participants.  For this reason, even
102	   if the group agrees that a particular test is important, if no one
103	   volunteers to do it, or if volunteers do not complete it in a timely
104	   fashion, then that test should be discarded.  This ensures that only
105	   important tests be done in particular, the tests that are important
106	   to participants.

108	2.1.  Still Image Pair Comparison

110	   A simple way to determine superiority of one compressed image over
111	   another is to visually compare two compressed images, and have the
112	   viewer judge which one has a higher quality.  This is mainly used for
113	   rapid comparisons during development.  For this test, the two
114	   compressed images should have similar compressed file sizes, with one
115	   image being no more than 5% larger than the other.  In addition, at
116	   least 5 different images should be compared.

118	2.2.  Subjective viewing test

120	   A subjective viewing test is the preferred method of evaluating the
121	   quality.  The subjective test should be performed as either
122	   consecutively showing the video sequences on one screen or on two
123	   screens located side-by-side.  The testing procedure should normally
124	   follow rules described in [BT500] and be performed with non-expert
125	   test subjects.  The result of the test could be (depending on the
126	   test procedure) mean opinion scores (MOS) or differential mean
127	   opinion scores (DMOS).  Normally, confidence intervals are also
128	   calculated to judge whether the difference between two encodings is
129	   statistically significant.

131	2.3.  Expert viewing

133	   An expert viewing test can be performed in the case when an answer to
134	   a particular question should be found.  An example of such test can
135	   be a test involving video coding experts on evaluation of a
136	   particular problem, for example such as comparing the results of two
137	   de-ringing filters.  Depending on what information is sought, the
138	   appropriate test procedure can be chosen.

140	3.  Objective Metrics
141	   Objective metrics are used in place of subjective metrics for easy
142	   and repeatable experiments.  Most objective metrics have been
143	   designed to correlate with subjective scores.

145	   The following descriptions give an overview of the operation of each
146	   of the metrics.  Because implementation details can sometimes vary,
147	   the exact implementation is specified in C in the Daala tools
148	   repository [DAALA-GIT].

150	   Unless otherwise specified, all of the metrics described below only
151	   apply to the luma plane, individually by frame.  When applied to the
152	   video, the scores of each frame are averaged to create the final
153	   score.

155	   Codecs are allowed to internally use downsampling, but must include a
156	   normative upsampler, so that the metrics run at the same resolution
157	   as the source video.  In addition, some metrics, such as PSNR and
158	   FASTSSIM, have poor behavior on downsampled images, so it must be
159	   noted in test results if downsampling is in effect.

161	3.1.  Overall PSNR

163	   PSNR is a traditional signal quality metric, measured in decibels.
164	   It is directly drived from mean square error (MSE), or its square
165	   root (RMSE).  The formula used is:

167	   20 * log10 ( MAX / RMSE )

169	   or, equivalently:

171	   10 * log10 ( MAX^2 / MSE )

173	   where the error is computed over all the pixels in the video, which
174	   is the method used in the dump_psnr.c reference implementation.

176	   This metric may be applied to both the luma and chroma planes, with
177	   all planes reported separately.

179	3.2.  Frame-averaged PSNR

181	   PSNR can also be calculated per-frame, and then the values averaged
182	   together.  This is reported in the same way as overall PSNR.

184	3.3.  PSNR-HVS-M

186	   The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the
187	   image, weights the coefficients, and then calculates the PSNR of
188	   those coefficients.  Several different sets of weights have been
189	   considered.  [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in
190	   the Daala repository have been found to be the best match to real MOS
191	   scores.

193	3.4.  SSIM

195	   SSIM (Structural Similarity Image Metric) is a still image quality
196	   metric introduced in 2004 [SSIM].  It computes a score for each
197	   individual pixel, using a window of neighboring pixels.  These scores
198	   can then be averaged to produce a global score for the entire image.
199	   The original paper produces scores ranging between 0 and 1.

201	   For the metric to appear more linear on BD-rate curves, the score is
202	   converted into a nonlinear decibel scale:

204	   -10 * log10 (1 - SSIM)

206	3.5.  Multi-Scale SSIM

208	   Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM].

210	3.6.  Fast Multi-Scale SSIM

212	   Fast MS-SSIM is a modified implementation of MS-SSIM which operates
213	   on a limited number of scales and with modified weights [FASTSSIM].
214	   The final score is converted to decibels in the same manner as SSIM.

216	3.7.  CIEDE2000

218	   CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000].  It
219	   generates a single score taking into account all three chroma planes.
220	   It does not take into consideration any structural similarity or
221	   other psychovisual effects.

223	3.8.  VMAF

225	   Video Multi-method Assessment Fusion (VMAF) is a full-reference
226	   perceptual video quality metric that aims to approximate human
227	   perception of video quality [VMAF].  This metric is focused on
228	   quality degradation due compression and rescaling.  VMAF estimates
229	   the perceived quality score by computing scores from multiple quality
230	   assessment algorithms, and fusing them using a support vector machine
231	   (SVM).  Currently, three image fidelity metrics and one temporal
232	   signal have been chosen as features to the SVM, namely Anti-noise SNR
233	   (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
234	   (VIF), and the mean co-located pixel difference of a frame with
235	   respect to the previous frame.

237	4.  Comparing and Interpreting Results

239	4.1.  Graphing

241	   When displayed on a graph, bitrate is shown on the X axis, and the
242	   quality metric is on the Y axis.  For publication, the X axis should
243	   be linear.  The Y axis metric should be plotted in decibels.  If the
244	   quality metric does not natively report quality in decibels, it
245	   should be converted as described in the previous section.

247	4.2.  Bjontegaard

249	   The Bjontegaard rate difference, also known as BD-rate, allows the
250	   measurement of the bitrate reduction offered by a codec or codec
251	   feature, while maintaining the same quality as measured by objective
252	   metrics.  The rate change is computed as the average percent
253	   difference in rate over a range of qualities.  Metric score ranges
254	   are not static - they are calculated either from a range of bitrates
255	   of the reference codec, or from quantizers of a third, reference
256	   codec.  Given a reference codec, test codec, and ranges, BD-rate
257	   values are calculated as follows:

259	   o  Rate/distortion points are calculated for the reference and test
260	      codec.  There need to be enough points so that at least four
261	      points lie within the quality levels.

263	   o  The rates are converted into log-rates.

265	   o  A piecewise cubic hermite interpolating polynomial is fit to the
266	      points for each codec to produce functions of distortion in terms
267	      of log-rate.

269	   o  Metric score ranges are computed.

271	      *  If using a bitrate range, metric score ranges are computed by
272	         converting the rate bounds into log-rate and then looking up
273	         scores of the reference codec using the interpolating
274	         polynomial.

276	      *  If using a quantizer range, a third anchor codec is used to
277	         generate metric scores for the quantizer bounds.  The anchor
278	         codec makes the range immune to quantizer changes.

280	   o  The log-rate is numerically integrated over the metric range for
281	      each curve.

283	   o  The resulting integrated log-rates are converted back into linear
284	      rate, and then the percent difference is calculated from the
285	      reference to the test codec.

287	4.3.  Ranges

289	   For all tests described in this document, quantizers of an anchor
290	   codec are used to determine the quality ranges.  The anchor codec
291	   used for ranges is libvpx 1.5.0 run with VP9 and High Latency CQP
292	   settings.  The quality range used is that achieved between cq-level
293	   20 and 60.

295	5.  Test Sequences

297	5.1.  Sources

299	   Lossless test clips are preferred for most tests, because the
300	   structure of compression artifacts in already-compressed clips may
301	   introduce extra noise in the test results.  However, a large amount
302	   of content on the internet needs to be recompressed at least once, so
303	   some sources of this nature are useful.  The encoder should run at
304	   the same bit depth as the original source.  In addition, metrics need
305	   to support operation at high bit depth.  If one or more codecs in a
306	   comparison do not support high bit depth, sources need to be
307	   converted once before entering the encoder.

309	5.2.  Test Sets

311	   Sources are divided into several categories to test different
312	   scenarios the codec will be required to operate in.  For easier
313	   comparison, all videos in each set should have the same color
314	   subsampling, same resolution, and same number of frames.  In
315	   addition, all test videos must be publicly available for testing use,
316	   to allow for reproducibility of results.  All current test sets are
317	   available for download [TESTSEQUENCES].

319	   o  Still images are useful when comparing intra coding performance.
320	      Xiph.org has four sets of lossless, one megapixel images that have
321	      been converted into YUV 4:2:0 format.  There are four sets that
322	      can be used:

324	      *  subset1 (50 images)

326	      *  subset2 (50 images)
327	      *  subset3 (1000 images)

329	      *  subset4 (1000 images)

331	   o  video-hd-3, a set that consists of 1920x1080 clips from
332	      [DERFVIDEO] (1500 frames total)

334	   o  vc-360p-1, a low quality video conferencing set (2700 frames
335	      total)

337	   o  vc-720p-1, a high quality video conferencing set (2750 frames
338	      total)

340	   o  netflix-4k-1, a cinematic 4K video test set (2280 frames total)

342	   o  netflix-2k-1, a 2K scaled version of netflix-4k-1 (2280 frames
343	      total)

345	   o  twitch-1, a game sequence set (2280 frames total)

347	5.3.  Operating Points

349	   Four operating modes are defined.  High latency is intended for on
350	   demand streaming, one-to-many live streaming, and stored video.  Low
351	   latency is intended for videoconferencing and remote access.  Both of
352	   these modes come in CQP and unconstrained variants.  When testing
353	   still image sets, such as subset1, high latency CQP mode should be
354	   used.

356	5.3.1.  Common settings

358	   Encoders should be configured to their best settings when being
359	   compared against each other:

361	   o  vp10: -codec=vp10 -ivf -frame-parallel=0 -tile-columns=0 -cpu-
362	      used=0 -threads=1

364	5.3.2.  High Latency CQP

366	   High Latency CQP is used for evaluating incremental changes to a
367	   codec.  It should not be used to compare unrelated codecs to each
368	   other.  It allows codec features with intrinsic frame delay.

370	   o  daala: -v=x -b 2

372	   o  vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

374	   o  vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

376	5.3.3.  Low Latency CQP

378	   Low Latency CQP is used for evaluating incremental changes to a
379	   codec.  It should not be used to compare unrelated codecs to each
380	   other.  It requires the codec to be set for zero intrinsic frame
381	   delay.

383	   o  daala: -v=x

385	   o  vp10: -end-usage=q -cq-level=x -lag-in-frames=0

387	5.3.4.  Unconstrained High Latency

389	   The encoder should be run at the best quality mode available, using
390	   the mode that will provide the best quality per bitrate (VBR or
391	   constant quality mode).  Lookahead and/or two-pass are allowed, if
392	   supported.  One parameter is provided to adjust bitrate, but the
393	   units are arbitrary.  Example configurations follow:

395	   o  x264: -crf=x

397	   o  x265: -crf=x

399	   o  daala: -v=x -b 2

401	   o  vp10: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

403	5.3.5.  Unconstrained Low Latency

405	   The encoder should be run at the best quality mode available, using
406	   the mode that will provide the best quality per bitrate (VBR or
407	   constant quality mode), but no frame delay, buffering, or lookahead
408	   is allowed.  One parameter is provided to adjust bitrate, but the
409	   units are arbitrary.  Example configurations follow:

411	   o  x264: -crf-x -tune zerolatency

413	   o  x265: -crf=x -tune zerolatency

415	   o  daala: -v=x

417	   o  vp10: -end-usage=q -cq-level=x -lag-in-frames=0

419	6.  Automation

421	   Frequent objective comparisons are extremely beneficial while
422	   developing a new codec.  Several tools exist in order to automate the
423	   process of objective comparisons.  The Compare-Codecs tool allows BD-
424	   rate curves to be generated for a wide variety of codecs
425	   [COMPARECODECS].  The Daala source repository contains a set of
426	   scripts that can be used to automate the various metrics used.  In
427	   addition, these scripts can be run automatically utilizing
428	   distributed computers for fast results, with the AreWeCompressedYet
429	   tool [AWCY].  Because of computational constraints, several levels of
430	   testing are specified.

432	6.1.  Regression tests

434	   Regression tests run on a small number of short sequences.  The
435	   regression tests should include a number of various test conditions.
436	   The purpose of regression tests is to ensure bug fixes (and similar
437	   patches) do not negatively affect the performance.  The anchor in
438	   regression tests is the previous revision of the codec in source
439	   control.  Regression tests are run on the following sets, in both
440	   high and low latency CQP modes:

442	   o  vc-720p-1

444	   o  netflix-2k-1

446	6.2.  Objective performance tests

448	   Changes that are expected to affect the quality of encode or
449	   bitstream should run an objective performance test.  The performance
450	   tests should be run on a wider number of sequences.  If the option
451	   for the objective performance test is chosen, wide range and full
452	   length simulations are run on the site and the results (including all
453	   the objective metrics) are generated.  Objective performance tests
454	   are run on the following sets, in both high and low latency CQP
455	   modes:

457	   o  video-hd-3

459	   o  netflix-2k-1

461	   o  netflix-4k-1

463	   o  vc-720p-1

465	   o  vc-360p-1
466	   o  twitch-1

468	6.3.  Periodic tests

470	   Periodic tests are run on a wide range of bitrates in order to gauge
471	   progress over time, as well as detect potential regressions missed by
472	   other tests.

474	7.  Informative References

476	   [AWCY]     Xiph.Org, "Are We Compressed Yet?", 2015, <https://
477	              arewecompressedyet.com/>.

479	   [BT500]    ITU-R, "Recommendation ITU-R BT.500-13", 2012, <https://
480	              www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-
481	              BT.500-13-201201-I!!PDF-E.pdf>.

483	   [CIEDE2000]
484	              Yang, Y., Ming, J., and N. Yu, "Color Image Quality
485	              Assessment Based on CIEDE2000", 2012,
486	              <http://dx.doi.org/10.1155/2012/273723>.

488	   [COMPARECODECS]
489	              Alvestrand, H., "Compare Codecs", 2015,
490	              <http://compare-codecs.appspot.com/>.

492	   [DAALA-GIT]
493	              Xiph.Org, "Daala Git Repository", 2015,
494	              <http://git.xiph.org/?p=daala.git;a=summary>.

496	   [DERFVIDEO]
497	              Terriberry, T., "Xiph.org Video Test Media", n.d., <https:
498	              //media.xiph.org/video/derf/>.

500	   [FASTSSIM]
501	              Chen, M. and A. Bovik, "Fast structural similarity index
502	              algorithm", 2010, <http://live.ece.utexas.edu/publications
503	              /2011/chen_rtip_2011.pdf>.

505	   [L1100]    Bossen, F., "Common test conditions and software reference
506	              configurations", JCTVC L1100, 2013,
507	              <http://phenix.int-evry.fr/jct/>.

509	   [MSSSIM]   Wang, Z., Simoncelli, E., and A. Bovik, "Multi-Scale
510	              Structural Similarity for Image Quality Assessment", n.d.,
511	              <http://www.cns.nyu.edu/~zwang/files/papers/msssim.pdf>.

513	   [PSNRHVS]  Egiazarian, K., Astola, J., Ponomarenko, N., Lukin, V.,
514	              Battisti, F., and M. Carli, "A New Full-Reference Quality
515	              Metrics Based on HVS", 2002.

517	   [SSIM]     Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image
518	              Quality Assessment: From Error Visibility to Structural
519	              Similarity", 2004,
520	              <http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf>.

522	   [STEAM]    Valve Corporation, "Steam Hardware & Software Survey: June
523	              2015", June 2015,
524	              <http://store.steampowered.com/hwsurvey>.

526	   [TESTSEQUENCES]
527	              Daede, T., "Test Sets", n.d., <https://people.xiph.org/
528	              ~tdaede/sets/>.

530	   [VMAF]     Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C.
531	              Kuo, "Challenges in cloud based ingest and encoding for
532	              high quality streaming media", 2015, <http://
533	              ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=7351097>.

535	Authors' Addresses

537	   Thomas Daede
538	   Mozilla

540	   Email: tdaede@mozilla.com

542	   Andrey Norkin
543	   Netflix

545	   Email: anorkin@netflix.com