idnits 2.17.1 

draft-ietf-netvc-testing-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 02, 2018) is 2118 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Unused Reference: 'DERFVIDEO' is defined on line 1011, but no explicit
     reference was found in the text

  == Unused Reference: 'L1100' is defined on line 1021, but no explicit
     reference was found in the text

  == Unused Reference: 'STEAM' is defined on line 1041, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-10) exists of
     draft-ietf-netvc-requirements-02


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           T. Daede
3	Internet-Draft                                                   Mozilla
4	Intended status: Informational                                 A. Norkin
5	Expires: January 3, 2019                                         Netflix
6	                                                          I. Brailovskiy
7	                                                           Amazon Lab126
8	                                                           July 02, 2018

10	              Video Codec Testing and Quality Measurement
11	                      draft-ietf-netvc-testing-07

13	Abstract

15	   This document describes guidelines and procedures for evaluating a
16	   video codec.  This covers subjective and objective tests, test
17	   conditions, and materials used for the test.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on January 3, 2019.

36	Copyright Notice

38	   Copyright (c) 2018 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	   2.  Subjective quality tests  . . . . . . . . . . . . . . . . . .   3
55	     2.1.  Still Image Pair Comparison . . . . . . . . . . . . . . .   3
56	     2.2.  Video Pair Comparison . . . . . . . . . . . . . . . . . .   4
57	     2.3.  Mean Opinion Score  . . . . . . . . . . . . . . . . . . .   4
58	   3.  Objective Metrics . . . . . . . . . . . . . . . . . . . . . .   5
59	     3.1.  Overall PSNR  . . . . . . . . . . . . . . . . . . . . . .   5
60	     3.2.  Frame-averaged PSNR . . . . . . . . . . . . . . . . . . .   5
61	     3.3.  PSNR-HVS-M  . . . . . . . . . . . . . . . . . . . . . . .   5
62	     3.4.  SSIM  . . . . . . . . . . . . . . . . . . . . . . . . . .   6
63	     3.5.  Multi-Scale SSIM  . . . . . . . . . . . . . . . . . . . .   6
64	     3.6.  CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . .   6
65	     3.7.  VMAF  . . . . . . . . . . . . . . . . . . . . . . . . . .   6
66	   4.  Comparing and Interpreting Results  . . . . . . . . . . . . .   7
67	     4.1.  Graphing  . . . . . . . . . . . . . . . . . . . . . . . .   7
68	     4.2.  BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . .   7
69	     4.3.  Ranges  . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	   5.  Test Sequences  . . . . . . . . . . . . . . . . . . . . . . .   8
71	     5.1.  Sources . . . . . . . . . . . . . . . . . . . . . . . . .   8
72	     5.2.  Test Sets . . . . . . . . . . . . . . . . . . . . . . . .   8
73	       5.2.1.  regression-1  . . . . . . . . . . . . . . . . . . . .   8
74	       5.2.2.  objective-2-slow  . . . . . . . . . . . . . . . . . .   9
75	       5.2.3.  objective-2-fast  . . . . . . . . . . . . . . . . . .  12
76	       5.2.4.  objective-1.1 . . . . . . . . . . . . . . . . . . . .  14
77	       5.2.5.  objective-1-fast  . . . . . . . . . . . . . . . . . .  17
78	     5.3.  Operating Points  . . . . . . . . . . . . . . . . . . . .  19
79	       5.3.1.  Common settings . . . . . . . . . . . . . . . . . . .  19
80	       5.3.2.  High Latency CQP  . . . . . . . . . . . . . . . . . .  19
81	       5.3.3.  Low Latency CQP . . . . . . . . . . . . . . . . . . .  19
82	       5.3.4.  Unconstrained High Latency  . . . . . . . . . . . . .  20
83	       5.3.5.  Unconstrained Low Latency . . . . . . . . . . . . . .  20
84	   6.  Automation  . . . . . . . . . . . . . . . . . . . . . . . . .  20
85	     6.1.  Regression tests  . . . . . . . . . . . . . . . . . . . .  21
86	     6.2.  Objective performance tests . . . . . . . . . . . . . . .  21
87	     6.3.  Periodic tests  . . . . . . . . . . . . . . . . . . . . .  22
88	   7.  Informative References  . . . . . . . . . . . . . . . . . . .  22
89	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  23

91	1.  Introduction

93	   When developing a video codec, changes and additions to the codec
94	   need to be decided based on their performance tradeoffs.  In
95	   addition, measurements are needed to determine when the codec has met
96	   its performance goals.  This document specifies how the tests are to
97	   be carried about to ensure valid comparisons when evaluating changes
98	   under consideration.  Authors of features or changes should provide
99	   the results of the appropriate test when proposing codec
100	   modifications.

102	2.  Subjective quality tests

104	   Subjective testing is the preferable method of testing video codecs.

106	   Subjective testing results take priority over objective testing
107	   results, when available.  Subjective testing is recommended
108	   especially when taking advantage of psychovisual effects that may not
109	   be well represented by objective metrics, or when different objective
110	   metrics disagree.

112	   Selection of a testing methodology depends on the feature being
113	   tested and the resources available.  Test methodologies are presented
114	   in order of increasing accuracy and cost.

116	   Testing relies on the resources of participants.  For this reason,
117	   even if the group agrees that a particular test is important, if no
118	   one volunteers to do it, or if volunteers do not complete it in a
119	   timely fashion, then that test should be discarded.  This ensures
120	   that only important tests be done in particular, the tests that are
121	   important to participants.

123	   Subjective tests should use the same operating points as the
124	   objective tests.

126	2.1.  Still Image Pair Comparison

128	   A simple way to determine superiority of one compressed image is to
129	   visually compare two compressed images, and have the viewer judge
130	   which one has a higher quality.  For example, this test may be
131	   suitable for an intra de-ringing filter, but not for a new inter
132	   prediction mode.  For this test, the two compressed images should
133	   have similar compressed file sizes, with one image being no more than
134	   5% larger than the other.  In addition, at least 5 different images
135	   should be compared.

137	   Once testing is complete, a p-value can be computed using the
138	   binomial test.  A significant result should have a resulting p-value
139	   less than or equal to 0.5.  For example:

141	   p_value = binom_test(a,a+b)
142	   where a is the number of votes for one video, b is the number of
143	   votes for the second video, and binom_test(x,y) returns the binomial
144	   PMF with x observed tests, y total tests, and expected probability
145	   0.5.

147	   If ties are allowed to be reported, then the equation is modified:

149	   p_value = binom_test(a+floor(t/2),a+b+t)

151	   where t is the number of tie votes.

153	   Still image pair comparison is used for rapid comparisons during
154	   development - the viewer may be either a developer or user, for
155	   example.  As the results are only relative, it is effective even with
156	   an inconsistent viewing environment.  Because this test only uses
157	   still images (keyframes), this is only suitable for changes with
158	   similar or no effect on inter frames.

160	2.2.  Video Pair Comparison

162	   The still image pair comparison method can be modified to also
163	   compare vidoes.  This is necessary when making changes with temporal
164	   effects, such as changes to inter-frame prediction.  Video pair
165	   comparisons follow the same procedure as still images.  Videos used
166	   for testing should be limited to 10 seconds in length, and can be
167	   rewatched an unlimited number of times.

169	2.3.  Mean Opinion Score

171	   A Mean Opinion Score (MOS) viewing test is the preferred method of
172	   evaluating the quality.  The subjective test should be performed as
173	   either consecutively showing the video sequences on one screen or on
174	   two screens located side-by-side.  The testing procedure should
175	   normally follow rules described in [BT500] and be performed with non-
176	   expert test subjects.  The result of the test will be (depending on
177	   the test procedure) mean opinion scores (MOS) or differential mean
178	   opinion scores (DMOS).  Confidence intervals are also calculated to
179	   judge whether the difference between two encodings is statistically
180	   significant.  In certain cases, a viewing test with expert test
181	   subjects can be performed, for example if a test should evaluate
182	   technologies with similar performance with respect to a particular
183	   artifact (e.g. loop filters or motion prediction).  Unlike pair
184	   comparisions, a MOS test requires a consistent testing environment.
185	   This means that for large scale or distributed tests, pair
186	   comparisons are preferred.

188	3.  Objective Metrics

190	   Objective metrics are used in place of subjective metrics for easy
191	   and repeatable experiments.  Most objective metrics have been
192	   designed to correlate with subjective scores.

194	   The following descriptions give an overview of the operation of each
195	   of the metrics.  Because implementation details can sometimes vary,
196	   the exact implementation is specified in C in the Daala tools
197	   repository [DAALA-GIT].  Implementations of metrics must directly
198	   support the input's resolution, bit depth, and sampling format.

200	   Unless otherwise specified, all of the metrics described below only
201	   apply to the luma plane, individually by frame.  When applied to the
202	   video, the scores of each frame are averaged to create the final
203	   score.

205	   Codecs must output the same resolution, bit depth, and sampling
206	   format as the input.

208	3.1.  Overall PSNR

210	   PSNR is a traditional signal quality metric, measured in decibels.
211	   It is directly drived from mean square error (MSE), or its square
212	   root (RMSE).  The formula used is:

214	   20 * log10 ( MAX / RMSE )

216	   or, equivalently:

218	   10 * log10 ( MAX^2 / MSE )

220	   where the error is computed over all the pixels in the video, which
221	   is the method used in the dump_psnr.c reference implementation.

223	   This metric may be applied to both the luma and chroma planes, with
224	   all planes reported separately.

226	3.2.  Frame-averaged PSNR

228	   PSNR can also be calculated per-frame, and then the values averaged
229	   together.  This is reported in the same way as overall PSNR.

231	3.3.  PSNR-HVS-M

233	   The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the
234	   image, weights the coefficients, and then calculates the PSNR of
235	   those coefficients.  Several different sets of weights have been
236	   considered.  [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in
237	   the Daala repository have been found to be the best match to real MOS
238	   scores.

240	3.4.  SSIM

242	   SSIM (Structural Similarity Image Metric) is a still image quality
243	   metric introduced in 2004 [SSIM].  It computes a score for each
244	   individual pixel, using a window of neighboring pixels.  These scores
245	   can then be averaged to produce a global score for the entire image.
246	   The original paper produces scores ranging between 0 and 1.

248	   To linearize the metric for BD-Rate computation, the score is
249	   converted into a nonlinear decibel scale:

251	   -10 * log10 (1 - SSIM)

253	3.5.  Multi-Scale SSIM

255	   Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM].
256	   The metric score is converted to decibels in the same way as SSIM.

258	3.6.  CIEDE2000

260	   CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000].  It
261	   generates a single score taking into account all three chroma planes.
262	   It does not take into consideration any structural similarity or
263	   other psychovisual effects.

265	3.7.  VMAF

267	   Video Multi-method Assessment Fusion (VMAF) is a full-reference
268	   perceptual video quality metric that aims to approximate human
269	   perception of video quality [VMAF].  This metric is focused on
270	   quality degradation due compression and rescaling.  VMAF estimates
271	   the perceived quality score by computing scores from multiple quality
272	   assessment algorithms, and fusing them using a support vector machine
273	   (SVM).  Currently, three image fidelity metrics and one temporal
274	   signal have been chosen as features to the SVM, namely Anti-noise SNR
275	   (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
276	   (VIF), and the mean co-located pixel difference of a frame with
277	   respect to the previous frame.

279	   The quality score from VMAF is used directly to calculate BD-Rate,
280	   without any conversions.

282	4.  Comparing and Interpreting Results

284	4.1.  Graphing

286	   When displayed on a graph, bitrate is shown on the X axis, and the
287	   quality metric is on the Y axis.  For publication, the X axis should
288	   be linear.  The Y axis metric should be plotted in decibels.  If the
289	   quality metric does not natively report quality in decibels, it
290	   should be converted as described in the previous section.

292	4.2.  BD-Rate

294	   The Bjontegaard rate difference, also known as BD-rate, allows the
295	   measurement of the bitrate reduction offered by a codec or codec
296	   feature, while maintaining the same quality as measured by objective
297	   metrics.  The rate change is computed as the average percent
298	   difference in rate over a range of qualities.  Metric score ranges
299	   are not static - they are calculated either from a range of bitrates
300	   of the reference codec, or from quantizers of a third, anchor codec.
301	   Given a reference codec and test codec, BD-rate values are calculated
302	   as follows:

304	   o  Rate/distortion points are calculated for the reference and test
305	      codec.

307	      *  At least four points must be computed.  These points should be
308	         the same quantizers when comparing two versions of the same
309	         codec.

311	      *  Additional points outside of the range should be discarded.

313	   o  The rates are converted into log-rates.

315	   o  A piecewise cubic hermite interpolating polynomial is fit to the
316	      points for each codec to produce functions of log-rate in terms of
317	      distortion.

319	   o  Metric score ranges are computed:

321	      *  If comparing two versions of the same codec, the overlap is the
322	         intersection of the two curves, bound by the chosen quantizer
323	         points.

325	      *  If comparing dissimilar codecs, a third anchor codec's metric
326	         scores at fixed quantizers are used directly as the bounds.

328	   o  The log-rate is numerically integrated over the metric range for
329	      each curve, using at least 1000 samples and trapezoidal
330	      integration.

332	   o  The resulting integrated log-rates are converted back into linear
333	      rate, and then the percent difference is calculated from the
334	      reference to the test codec.

336	4.3.  Ranges

338	   For individual feature changes in libaom or libvpx, the overlap BD-
339	   Rate method with quantizers 20, 32, 43, and 55 must be used.

341	   For the final evaluation described in [I-D.ietf-netvc-requirements],
342	   the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55.

344	5.  Test Sequences

346	5.1.  Sources

348	   Lossless test clips are preferred for most tests, because the
349	   structure of compression artifacts in already-compressed clips may
350	   introduce extra noise in the test results.  However, a large amount
351	   of content on the internet needs to be recompressed at least once, so
352	   some sources of this nature are useful.  The encoder should run at
353	   the same bit depth as the original source.  In addition, metrics need
354	   to support operation at high bit depth.  If one or more codecs in a
355	   comparison do not support high bit depth, sources need to be
356	   converted once before entering the encoder.

358	5.2.  Test Sets

360	   Sources are divided into several categories to test different
361	   scenarios the codec will be required to operate in.  For easier
362	   comparison, all videos in each set should have the same color
363	   subsampling, same resolution, and same number of frames.  In
364	   addition, all test videos must be publicly available for testing use,
365	   to allow for reproducibility of results.  All current test sets are
366	   available for download [TESTSEQUENCES].

368	   Test sequences should be downloaded in whole.  They should not be
369	   recreated from the original sources.

371	5.2.1.  regression-1

373	   This test set is used for basic regression testing.  It contains a
374	   very small number of clips.

376	   o  kirlandvga (640x360, 8bit, 4:2:0, 300 frames)

378	   o  FourPeople (1280x720, 8bit, 4:2:0, 60 frames)

380	   o  Narrarator (4096x2160, 10bit, 4:2:0, 15 frames)

382	   o  CSGO (1920x1080, 8bit, 4:4:4 60 frames)

384	5.2.2.  objective-2-slow

386	   This test set is a comprehensive test set, grouped by resolution.
387	   These test clips were created from originals at [TESTSEQUENCES].
388	   They have been scaled and cropped to match the resolution of their
389	   category.  This test set requires compiling with high bit depth
390	   support.

392	   4096x2160, 4:2:0, 60 frames:

394	   o  Netflix_BarScene_4096x2160_60fps_10bit_420_60f

396	   o  Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f

398	   o  Netflix_Dancers_4096x2160_60fps_10bit_420_60f

400	   o  Netflix_Narrator_4096x2160_60fps_10bit_420_60f

402	   o  Netflix_RitualDance_4096x2160_60fps_10bit_420_60f

404	   o  Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f

406	   o  Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f

408	   o  street_hdr_amazon_2160p

410	   1920x1080, 4:2:0, 60 frames:

412	   o  aspen_1080p_60f

414	   o  crowd_run_1080p50_60f

416	   o  ducks_take_off_1080p50_60f

418	   o  guitar_hdr_amazon_1080p

420	   o  life_1080p30_60f

422	   o  Netflix_Aerial_1920x1080_60fps_8bit_420_60f
423	   o  Netflix_Boat_1920x1080_60fps_8bit_420_60f

425	   o  Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f

427	   o  Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f

429	   o  Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f

431	   o  Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f

433	   o  Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f

435	   o  old_town_cross_1080p50_60f

437	   o  pan_hdr_amazon_1080p

439	   o  park_joy_1080p50_60f

441	   o  pedestrian_area_1080p25_60f

443	   o  rush_field_cuts_1080p_60f

445	   o  rush_hour_1080p25_60f

447	   o  seaplane_hdr_amazon_1080p

449	   o  station2_1080p25_60f

451	   o  touchdown_pass_1080p_60f

453	   1280x720, 4:2:0, 120 frames:

455	   o  boat_hdr_amazon_720p

457	   o  dark720p_120f

459	   o  FourPeople_1280x720_60_120f

461	   o  gipsrestat720p_120f

463	   o  Johnny_1280x720_60_120f

465	   o  KristenAndSara_1280x720_60_120f

467	   o  Netflix_DinnerScene_1280x720_60fps_8bit_420_120f

469	   o  Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f
470	   o  Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f

472	   o  Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f

474	   o  Netflix_Tango_1280x720_60fps_8bit_420_120f

476	   o  rain_hdr_amazon_720p

478	   o  vidyo1_720p_60fps_120f

480	   o  vidyo3_720p_60fps_120f

482	   o  vidyo4_720p_60fps_120f

484	   640x360, 4:2:0, 120 frames:

486	   o  blue_sky_360p_120f

488	   o  controlled_burn_640x360_120f

490	   o  desktop2360p_120f

492	   o  kirland360p_120f

494	   o  mmstationary360p_120f

496	   o  niklas360p_120f

498	   o  rain2_hdr_amazon_360p

500	   o  red_kayak_360p_120f

502	   o  riverbed_360p25_120f

504	   o  shields2_640x360_120f

506	   o  snow_mnt_640x360_120f

508	   o  speed_bag_640x360_120f

510	   o  stockholm_640x360_120f

512	   o  tacomanarrows360p_120f

514	   o  thaloundeskmtg360p_120f

516	   o  water_hdr_amazon_360p
517	   426x240, 4:2:0, 120 frames:

519	   o  bqfree_240p_120f

521	   o  bqhighway_240p_120f

523	   o  bqzoom_240p_120f

525	   o  chairlift_240p_120f

527	   o  dirtbike_240p_120f

529	   o  mozzoom_240p_120f

531	   1920x1080, 4:4:4 or 4:2:0, 60 frames:

533	   o  CSGO_60f.y4m

535	   o  DOTA2_60f_420.y4m

537	   o  MINECRAFT_60f_420.y4m

539	   o  STARCRAFT_60f_420.y4m

541	   o  EuroTruckSimulator2_60f.y4m

543	   o  Hearthstone_60f.y4m

545	   o  wikipedia_420.y4m

547	   o  pvq_slideshow.y4m

549	5.2.3.  objective-2-fast

551	   This test set is a strict subset of objective-2-slow.  It is designed
552	   for faster runtime.  This test set requires compiling with high bit
553	   depth support.

555	   1920x1080, 4:2:0, 60 frames:

557	   o  aspen_1080p_60f

559	   o  ducks_take_off_1080p50_60f

561	   o  life_1080p30_60f

563	   o  Netflix_Aerial_1920x1080_60fps_8bit_420_60f
564	   o  Netflix_Boat_1920x1080_60fps_8bit_420_60f

566	   o  Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f

568	   o  Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f

570	   o  Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f

572	   o  Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f

574	   o  rush_hour_1080p25_60f

576	   o  seaplane_hdr_amazon_1080p

578	   o  touchdown_pass_1080p_60f

580	   1280x720, 4:2:0, 120 frames:

582	   o  boat_hdr_amazon_720p

584	   o  dark720p_120f

586	   o  gipsrestat720p_120f

588	   o  KristenAndSara_1280x720_60_120f

590	   o  Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f

592	   o  Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f

594	   o  vidyo1_720p_60fps_120f

596	   o  vidyo4_720p_60fps_120f

598	   640x360, 4:2:0, 120 frames:

600	   o  blue_sky_360p_120f

602	   o  controlled_burn_640x360_120f

604	   o  kirland360p_120f

606	   o  niklas360p_120f

608	   o  rain2_hdr_amazon_360p

610	   o  red_kayak_360p_120f
611	   o  riverbed_360p25_120f

613	   o  shields2_640x360_120f

615	   o  speed_bag_640x360_120f

617	   o  thaloundeskmtg360p_120f

619	   426x240, 4:2:0, 120 frames:

621	   o  bqfree_240p_120f

623	   o  bqzoom_240p_120f

625	   o  dirtbike_240p_120f

627	   1290x1080, 4:2:0, 60 frames:

629	   o  DOTA2_60f_420.y4m

631	   o  MINECRAFT_60f_420.y4m

633	   o  STARCRAFT_60f_420.y4m

635	   o  wikipedia_420.y4m

637	5.2.4.  objective-1.1

639	   This test set is an old version of objective-2-slow.

641	   4096x2160, 10bit, 4:2:0, 60 frames:

643	   o  Aerial (start frame 600)

645	   o  BarScene (start frame 120)

647	   o  Boat (start frame 0)

649	   o  BoxingPractice (start frame 0)

651	   o  Crosswalk (start frame 0)

653	   o  Dancers (start frame 120)

655	   o  FoodMarket

657	   o  Narrator
658	   o  PierSeaside

660	   o  RitualDance

662	   o  SquareAndTimelapse

664	   o  ToddlerFountain (start frame 120)

666	   o  TunnelFlag

668	   o  WindAndNature (start frame 120)

670	   1920x1080, 8bit, 4:4:4, 60 frames:

672	   o  CSGO

674	   o  DOTA2

676	   o  EuroTruckSimulator2

678	   o  Hearthstone

680	   o  MINECRAFT

682	   o  STARCRAFT

684	   o  wikipedia

686	   o  pvq_slideshow

688	   1920x1080, 8bit, 4:2:0, 60 frames:

690	   o  ducks_take_off

692	   o  life

694	   o  aspen

696	   o  crowd_run

698	   o  old_town_cross

700	   o  park_joy

702	   o  pedestrian_area

704	   o  rush_field_cuts
705	   o  rush_hour

707	   o  station2

709	   o  touchdown_pass

711	   1280x720, 8bit, 4:2:0, 60 frames:

713	   o  Netflix_FoodMarket2

715	   o  Netflix_Tango

717	   o  DrivingPOV (start frame 120)

719	   o  DinnerScene (start frame 120)

721	   o  RollerCoaster (start frame 600)

723	   o  FourPeople

725	   o  Johnny

727	   o  KristenAndSara

729	   o  vidyo1

731	   o  vidyo3

733	   o  vidyo4

735	   o  dark720p

737	   o  gipsrecmotion720p

739	   o  gipsrestat720p

741	   o  controlled_burn

743	   o  stockholm

745	   o  speed_bag

747	   o  snow_mnt

749	   o  shields

751	   640x360, 8bit, 4:2:0, 60 frames:

753	   o  red_kayak

755	   o  blue_sky

757	   o  riverbed

759	   o  thaloundeskmtgvga

761	   o  kirlandvga

763	   o  tacomanarrowsvga

765	   o  tacomascmvvga

767	   o  desktop2360p

769	   o  mmmovingvga

771	   o  mmstationaryvga

773	   o  niklasvga

775	5.2.5.  objective-1-fast

777	   This is an old version of objective-2-fast.

779	   1920x1080, 8bit, 4:2:0, 60 frames:

781	   o  Aerial (start frame 600)

783	   o  Boat (start frame 0)

785	   o  Crosswalk (start frame 0)

787	   o  FoodMarket

789	   o  PierSeaside

791	   o  SquareAndTimelapse

793	   o  TunnelFlag

795	   1920x1080, 8bit, 4:2:0, 60 frames:

797	   o  CSGO

799	   o  EuroTruckSimulator2
800	   o  MINECRAFT

802	   o  wikipedia

804	   1920x1080, 8bit, 4:2:0, 60 frames:

806	   o  ducks_take_off

808	   o  aspen

810	   o  old_town_cross

812	   o  pedestrian_area

814	   o  rush_hour

816	   o  touchdown_pass

818	   1280x720, 8bit, 4:2:0, 60 frames:

820	   o  Netflix_FoodMarket2

822	   o  DrivingPOV (start frame 120)

824	   o  RollerCoaster (start frame 600)

826	   o  Johnny

828	   o  vidyo1

830	   o  vidyo4

832	   o  gipsrecmotion720p

834	   o  speed_bag

836	   o  shields

838	   640x360, 8bit, 4:2:0, 60 frames:

840	   o  red_kayak

842	   o  riverbed

844	   o  kirlandvga

846	   o  tacomascmvvga
847	   o  mmmovingvga

849	   o  niklasvga

851	5.3.  Operating Points

853	   Four operating modes are defined.  High latency is intended for on
854	   demand streaming, one-to-many live streaming, and stored video.  Low
855	   latency is intended for videoconferencing and remote access.  Both of
856	   these modes come in CQP and unconstrained variants.  When testing
857	   still image sets, such as subset1, high latency CQP mode should be
858	   used.

860	5.3.1.  Common settings

862	   Encoders should be configured to their best settings when being
863	   compared against each other:

865	   o  av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0
866	      -threads=1

868	5.3.2.  High Latency CQP

870	   High Latency CQP is used for evaluating incremental changes to a
871	   codec.  This method is well suited to compare codecs with similar
872	   coding tools.  It allows codec features with intrinsic frame delay.

874	   o  daala: -v=x -b 2

876	   o  vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

878	   o  av1: -end-usage=q -cq-level=x -auto-alt-ref=2

880	5.3.3.  Low Latency CQP

882	   Low Latency CQP is used for evaluating incremental changes to a
883	   codec.  This method is well suited to compare codecs with similar
884	   coding tools.  It requires the codec to be set for zero intrinsic
885	   frame delay.

887	   o  daala: -v=x

889	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=0

891	5.3.4.  Unconstrained High Latency

893	   The encoder should be run at the best quality mode available, using
894	   the mode that will provide the best quality per bitrate (VBR or
895	   constant quality mode).  Lookahead and/or two-pass are allowed, if
896	   supported.  One parameter is provided to adjust bitrate, but the
897	   units are arbitrary.  Example configurations follow:

899	   o  x264: -crf=x

901	   o  x265: -crf=x

903	   o  daala: -v=x -b 2

905	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

907	5.3.5.  Unconstrained Low Latency

909	   The encoder should be run at the best quality mode available, using
910	   the mode that will provide the best quality per bitrate (VBR or
911	   constant quality mode), but no frame delay, buffering, or lookahead
912	   is allowed.  One parameter is provided to adjust bitrate, but the
913	   units are arbitrary.  Example configurations follow:

915	   o  x264: -crf-x -tune zerolatency

917	   o  x265: -crf=x -tune zerolatency

919	   o  daala: -v=x

921	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=0

923	6.  Automation

925	   Frequent objective comparisons are extremely beneficial while
926	   developing a new codec.  Several tools exist in order to automate the
927	   process of objective comparisons.  The Compare-Codecs tool allows BD-
928	   rate curves to be generated for a wide variety of codecs
929	   [COMPARECODECS].  The Daala source repository contains a set of
930	   scripts that can be used to automate the various metrics used.  In
931	   addition, these scripts can be run automatically utilizing
932	   distributed computers for fast results, with rd_tool [RD_TOOL].  This
933	   tool can be run via a web interface called AreWeCompressedYet [AWCY],
934	   or locally.

936	   Because of computational constraints, several levels of testing are
937	   specified.

939	6.1.  Regression tests

941	   Regression tests run on a small number of short sequences -
942	   regression-test-1.  The regression tests should include a number of
943	   various test conditions.  The purpose of regression tests is to
944	   ensure bug fixes (and similar patches) do not negatively affect the
945	   performance.  The anchor in regression tests is the previous revision
946	   of the codec in source control.  Regression tests are run on both
947	   high and low latency CQP modes

949	6.2.  Objective performance tests

951	   Changes that are expected to affect the quality of encode or
952	   bitstream should run an objective performance test.  The performance
953	   tests should be run on a wider number of sequences.  The following
954	   data should be reported:

956	   o  Identifying information for the encoder used, such as the git
957	      commit hash.

959	   o  Command line options to the encoder, configure script, and
960	      anything else necessary to replicate the experiment.

962	   o  The name of the test set run (objective-1-fast)

964	   o  For both high and low latency CQP modes, and for each objective
965	      metric:

967	      *  The BD-Rate score, in percent, for each clip.

969	      *  The average of all BD-Rate scores, equally weighted, for each
970	         resolution category in the test set.

972	      *  The average of all BD-Rate scores for all videos in all
973	         categories.

975	   Normally, the encoder should always be run at the slowest, highest
976	   quality speed setting (cpu-used=0 in the case of AV1 and VP9).
977	   However, in the case of computation time, both the reference and
978	   changed encoder can be built with some options disabled.  For AV1, -
979	   disable-ext_partition and -disable-ext_partition_types can be passed
980	   to the configure script to substantially speed up encoding, but the
981	   usage of these options must be reported in the test results.

983	6.3.  Periodic tests

985	   Periodic tests are run on a wide range of bitrates in order to gauge
986	   progress over time, as well as detect potential regressions missed by
987	   other tests.

989	7.  Informative References

991	   [AWCY]     Xiph.Org, "Are We Compressed Yet?", 2016,
992	              <https://arewecompressedyet.com/>.

994	   [BT500]    ITU-R, "Recommendation ITU-R BT.500-13", 2012,
995	              <https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-
996	              BT.500-13-201201-I!!PDF-E.pdf>.

998	   [CIEDE2000]
999	              Yang, Y., Ming, J., and N. Yu, "Color Image Quality
1000	              Assessment Based on CIEDE2000", 2012,
1001	              <http://dx.doi.org/10.1155/2012/273723>.

1003	   [COMPARECODECS]
1004	              Alvestrand, H., "Compare Codecs", 2015,
1005	              <http://compare-codecs.appspot.com/>.

1007	   [DAALA-GIT]
1008	              Xiph.Org, "Daala Git Repository", 2015,
1009	              <http://git.xiph.org/?p=daala.git;a=summary>.

1011	   [DERFVIDEO]
1012	              Terriberry, T., "Xiph.org Video Test Media", n.d.,
1013	              <https://media.xiph.org/video/derf/>.

1015	   [I-D.ietf-netvc-requirements]
1016	              Filippov, A., Norkin, A., and j.
1017	              jose.roberto.alvarez@huawei.com, "<Video Codec
1018	              Requirements and Evaluation Methodology>", draft-ietf-
1019	              netvc-requirements-02 (work in progress), June 2016.

1021	   [L1100]    Bossen, F., "Common test conditions and software reference
1022	              configurations", JCTVC L1100, 2013,
1023	              <http://phenix.int-evry.fr/jct/>.

1025	   [MSSSIM]   Wang, Z., Simoncelli, E., and A. Bovik, "Multi-Scale
1026	              Structural Similarity for Image Quality Assessment", n.d.,
1027	              <http://www.cns.nyu.edu/~zwang/files/papers/msssim.pdf>.

1029	   [PSNRHVS]  Egiazarian, K., Astola, J., Ponomarenko, N., Lukin, V.,
1030	              Battisti, F., and M. Carli, "A New Full-Reference Quality
1031	              Metrics Based on HVS", 2002.

1033	   [RD_TOOL]  Xiph.Org, "rd_tool", 2016, <https://github.com/tdaede/
1034	              rd_tool>.

1036	   [SSIM]     Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image
1037	              Quality Assessment: From Error Visibility to Structural
1038	              Similarity", 2004,
1039	              <http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf>.

1041	   [STEAM]    Valve Corporation, "Steam Hardware & Software Survey: June
1042	              2015", June 2015,
1043	              <http://store.steampowered.com/hwsurvey>.

1045	   [TESTSEQUENCES]
1046	              Daede, T., "Test Sets", n.d.,
1047	              <https://people.xiph.org/~tdaede/sets/>.

1049	   [VMAF]     Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C.
1050	              Kuo, "VMAF - Video Multi-Method Assessment Fusion", 2015,
1051	              <https://github.com/Netflix/vmaf>.

1053	Authors' Addresses

1055	   Thomas Daede
1056	   Mozilla

1058	   Email: tdaede@mozilla.com

1060	   Andrey Norkin
1061	   Netflix

1063	   Email: anorkin@netflix.com

1065	   Ilya Brailovskiy
1066	   Amazon Lab126

1068	   Email: brailovs@lab126.com