idnits 2.17.1 

draft-ietf-netvc-testing-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 31, 2020) is 1537 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                           T. Daede
3	Internet-Draft                                                   Mozilla
4	Intended status: Informational                                 A. Norkin
5	Expires: August 3, 2020                                          Netflix
6	                                                          I. Brailovskiy
7	                                                           Amazon Lab126
8	                                                        January 31, 2020

10	              Video Codec Testing and Quality Measurement
11	                      draft-ietf-netvc-testing-09

13	Abstract

15	   This document describes guidelines and procedures for evaluating a
16	   video codec.  This covers subjective and objective tests, test
17	   conditions, and materials used for the test.

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at https://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on August 3, 2020.

36	Copyright Notice

38	   Copyright (c) 2020 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (https://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
54	   2.  Subjective quality tests  . . . . . . . . . . . . . . . . . .   3
55	     2.1.  Still Image Pair Comparison . . . . . . . . . . . . . . .   3
56	     2.2.  Video Pair Comparison . . . . . . . . . . . . . . . . . .   4
57	     2.3.  Mean Opinion Score  . . . . . . . . . . . . . . . . . . .   4
58	   3.  Objective Metrics . . . . . . . . . . . . . . . . . . . . . .   5
59	     3.1.  Overall PSNR  . . . . . . . . . . . . . . . . . . . . . .   5
60	     3.2.  Frame-averaged PSNR . . . . . . . . . . . . . . . . . . .   5
61	     3.3.  PSNR-HVS-M  . . . . . . . . . . . . . . . . . . . . . . .   6
62	     3.4.  SSIM  . . . . . . . . . . . . . . . . . . . . . . . . . .   6
63	     3.5.  Multi-Scale SSIM  . . . . . . . . . . . . . . . . . . . .   6
64	     3.6.  CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . .   6
65	     3.7.  VMAF  . . . . . . . . . . . . . . . . . . . . . . . . . .   6
66	   4.  Comparing and Interpreting Results  . . . . . . . . . . . . .   7
67	     4.1.  Graphing  . . . . . . . . . . . . . . . . . . . . . . . .   7
68	     4.2.  BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . .   7
69	     4.3.  Ranges  . . . . . . . . . . . . . . . . . . . . . . . . .   8
70	   5.  Test Sequences  . . . . . . . . . . . . . . . . . . . . . . .   8
71	     5.1.  Sources . . . . . . . . . . . . . . . . . . . . . . . . .   8
72	     5.2.  Test Sets . . . . . . . . . . . . . . . . . . . . . . . .   8
73	       5.2.1.  regression-1  . . . . . . . . . . . . . . . . . . . .   9
74	       5.2.2.  objective-2-slow  . . . . . . . . . . . . . . . . . .   9
75	       5.2.3.  objective-2-fast  . . . . . . . . . . . . . . . . . .  12
76	       5.2.4.  objective-1.1 . . . . . . . . . . . . . . . . . . . .  14
77	       5.2.5.  objective-1-fast  . . . . . . . . . . . . . . . . . .  17
78	     5.3.  Operating Points  . . . . . . . . . . . . . . . . . . . .  19
79	       5.3.1.  Common settings . . . . . . . . . . . . . . . . . . .  19
80	       5.3.2.  High Latency CQP  . . . . . . . . . . . . . . . . . .  19
81	       5.3.3.  Low Latency CQP . . . . . . . . . . . . . . . . . . .  19
82	       5.3.4.  Unconstrained High Latency  . . . . . . . . . . . . .  20
83	       5.3.5.  Unconstrained Low Latency . . . . . . . . . . . . . .  20
84	   6.  Automation  . . . . . . . . . . . . . . . . . . . . . . . . .  20
85	     6.1.  Regression tests  . . . . . . . . . . . . . . . . . . . .  21
86	     6.2.  Objective performance tests . . . . . . . . . . . . . . .  21
87	     6.3.  Periodic tests  . . . . . . . . . . . . . . . . . . . . .  22
88	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  22
89	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  22
90	   9.  Informative References  . . . . . . . . . . . . . . . . . . .  22
91	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  23

93	1.  Introduction

95	   When developing a video codec, changes and additions to the codec
96	   need to be decided based on their performance tradeoffs.  In
97	   addition, measurements are needed to determine when the codec has met
98	   its performance goals.  This document specifies how the tests are to
99	   be carried about to ensure valid comparisons when evaluating changes
100	   under consideration.  Authors of features or changes should provide
101	   the results of the appropriate test when proposing codec
102	   modifications.

104	2.  Subjective quality tests

106	   Subjective testing uses human viewers to rate and compare the quality
107	   of videos.  It is the preferable method of testing video codecs.

109	   Subjective testing results take priority over objective testing
110	   results, when available.  Subjective testing is recommended
111	   especially when taking advantage of psychovisual effects that may not
112	   be well represented by objective metrics, or when different objective
113	   metrics disagree.

115	   Selection of a testing methodology depends on the feature being
116	   tested and the resources available.  Test methodologies are presented
117	   in order of increasing accuracy and cost.

119	   Testing relies on the resources of participants.  If a participant
120	   requires a subjective test for a particular feature or improvement,
121	   they are responsible for ensuring that resources are available.  This
122	   ensures that only important tests be done; in particular, the tests
123	   that are important to participants.

125	   Subjective tests should use the same operating points as the
126	   objective tests.

128	2.1.  Still Image Pair Comparison

130	   A simple way to determine superiority of one compressed image is to
131	   visually compare two compressed images, and have the viewer judge
132	   which one has a higher quality.  For example, this test may be
133	   suitable for an intra de-ringing filter, but not for a new inter
134	   prediction mode.  For this test, the two compressed images should
135	   have similar compressed file sizes, with one image being no more than
136	   5% larger than the other.  In addition, at least 5 different images
137	   should be compared.

139	   Once testing is complete, a p-value can be computed using the
140	   binomial test.  A significant result should have a resulting p-value
141	   less than or equal to 0.5.  For example:

143	   p_value = binom_test(a,a+b)

145	   where a is the number of votes for one video, b is the number of
146	   votes for the second video, and binom_test(x,y) returns the binomial
147	   PMF (probability mass function) with x observed tests, y total tests,
148	   and expected probability 0.5.

150	   If ties are allowed to be reported, then the equation is modified:

152	   p_value = binom_test(a+floor(t/2),a+b+t)

154	   where t is the number of tie votes.

156	   Still image pair comparison is used for rapid comparisons during
157	   development - the viewer may be either a developer or user, for
158	   example.  As the results are only relative, it is effective even with
159	   an inconsistent viewing environment.  Because this test only uses
160	   still images (keyframes), this is only suitable for changes with
161	   similar or no effect on inter frames.

163	2.2.  Video Pair Comparison

165	   The still image pair comparison method can be modified to also
166	   compare vidoes.  This is necessary when making changes with temporal
167	   effects, such as changes to inter-frame prediction.  Video pair
168	   comparisons follow the same procedure as still images.  Videos used
169	   for testing should be limited to 10 seconds in length, and can be
170	   rewatched an unlimited number of times.

172	2.3.  Mean Opinion Score

174	   A Mean Opinion Score (MOS) viewing test is the preferred method of
175	   evaluating the quality.  The subjective test should be performed as
176	   either consecutively showing the video sequences on one screen or on
177	   two screens located side-by-side.  The testing procedure should
178	   normally follow rules described in [BT500] and be performed with non-
179	   expert test subjects.  The result of the test will be (depending on
180	   the test procedure) mean opinion scores (MOS) or differential mean
181	   opinion scores (DMOS).  Confidence intervals are also calculated to
182	   judge whether the difference between two encodings is statistically
183	   significant.  In certain cases, a viewing test with expert test
184	   subjects can be performed, for example if a test should evaluate
185	   technologies with similar performance with respect to a particular
186	   artifact (e.g. loop filters or motion prediction).  Unlike pair
187	   comparisions, a MOS test requires a consistent testing environment.
188	   This means that for large scale or distributed tests, pair
189	   comparisons are preferred.

191	3.  Objective Metrics

193	   Objective metrics are used in place of subjective metrics for easy
194	   and repeatable experiments.  Most objective metrics have been
195	   designed to correlate with subjective scores.

197	   The following descriptions give an overview of the operation of each
198	   of the metrics.  Because implementation details can sometimes vary,
199	   the exact implementation is specified in C in the Daala tools
200	   repository [DAALA-GIT].  Implementations of metrics must directly
201	   support the input's resolution, bit depth, and sampling format.

203	   Unless otherwise specified, all of the metrics described below only
204	   apply to the luma plane, individually by frame.  When applied to the
205	   video, the scores of each frame are averaged to create the final
206	   score.

208	   Codecs must output the same resolution, bit depth, and sampling
209	   format as the input.

211	3.1.  Overall PSNR

213	   PSNR is a traditional signal quality metric, measured in decibels.
214	   It is directly drived from mean square error (MSE), or its square
215	   root (RMSE).  The formula used is:

217	   20 * log10 ( MAX / RMSE )

219	   or, equivalently:

221	   10 * log10 ( MAX^2 / MSE )

223	   where the error is computed over all the pixels in the video, which
224	   is the method used in the dump_psnr.c reference implementation.

226	   This metric may be applied to both the luma and chroma planes, with
227	   all planes reported separately.

229	3.2.  Frame-averaged PSNR

231	   PSNR can also be calculated per-frame, and then the values averaged
232	   together.  This is reported in the same way as overall PSNR.

234	3.3.  PSNR-HVS-M

236	   The PSNR-HVS [PSNRHVS] metric performs a DCT transform of 8x8 blocks
237	   of the image, weights the coefficients, and then calculates the PSNR
238	   of those coefficients.  Several different sets of weights have been
239	   considered.  The weights used by the dump_pnsrhvs.c tool in the Daala
240	   repository have been found to be the best match to real MOS scores.

242	3.4.  SSIM

244	   SSIM (Structural Similarity Image Metric) is a still image quality
245	   metric introduced in 2004 [SSIM].  It computes a score for each
246	   individual pixel, using a window of neighboring pixels.  These scores
247	   can then be averaged to produce a global score for the entire image.
248	   The original paper produces scores ranging between 0 and 1.

250	   To linearize the metric for BD-Rate computation, the score is
251	   converted into a nonlinear decibel scale:

253	   -10 * log10 (1 - SSIM)

255	3.5.  Multi-Scale SSIM

257	   Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM].
258	   The metric score is converted to decibels in the same way as SSIM.

260	3.6.  CIEDE2000

262	   CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000].  It
263	   generates a single score taking into account all three chroma planes.
264	   It does not take into consideration any structural similarity or
265	   other psychovisual effects.

267	3.7.  VMAF

269	   Video Multi-method Assessment Fusion (VMAF) is a full-reference
270	   perceptual video quality metric that aims to approximate human
271	   perception of video quality [VMAF].  This metric is focused on
272	   quality degradation due to compression and rescaling.  VMAF estimates
273	   the perceived quality score by computing scores from multiple quality
274	   assessment algorithms, and fusing them using a support vector machine
275	   (SVM).  Currently, three image fidelity metrics and one temporal
276	   signal have been chosen as features to the SVM, namely Anti-noise SNR
277	   (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
278	   (VIF), and the mean co-located pixel difference of a frame with
279	   respect to the previous frame.

281	   The quality score from VMAF is used directly to calculate BD-Rate,
282	   without any conversions.

284	4.  Comparing and Interpreting Results

286	4.1.  Graphing

288	   When displayed on a graph, bitrate is shown on the X axis, and the
289	   quality metric is on the Y axis.  For publication, the X axis should
290	   be linear.  The Y axis metric should be plotted in decibels.  If the
291	   quality metric does not natively report quality in decibels, it
292	   should be converted as described in the previous section.

294	4.2.  BD-Rate

296	   The Bjontegaard rate difference, also known as BD-rate, allows the
297	   measurement of the bitrate reduction offered by a codec or codec
298	   feature, while maintaining the same quality as measured by objective
299	   metrics.  The rate change is computed as the average percent
300	   difference in rate over a range of qualities.  Metric score ranges
301	   are not static - they are calculated either from a range of bitrates
302	   of the reference codec, or from quantizers of a third, anchor codec.
303	   Given a reference codec and test codec, BD-rate values are calculated
304	   as follows:

306	   o  Rate/distortion points are calculated for the reference and test
307	      codec.

309	      *  At least four points must be computed.  These points should be
310	         the same quantizers when comparing two versions of the same
311	         codec.

313	      *  Additional points outside of the range should be discarded.

315	   o  The rates are converted into log-rates.

317	   o  A piecewise cubic hermite interpolating polynomial is fit to the
318	      points for each codec to produce functions of log-rate in terms of
319	      distortion.

321	   o  Metric score ranges are computed:

323	      *  If comparing two versions of the same codec, the overlap is the
324	         intersection of the two curves, bound by the chosen quantizer
325	         points.

327	      *  If comparing dissimilar codecs, a third anchor codec's metric
328	         scores at fixed quantizers are used directly as the bounds.

330	   o  The log-rate is numerically integrated over the metric range for
331	      each curve, using at least 1000 samples and trapezoidal
332	      integration.

334	   o  The resulting integrated log-rates are converted back into linear
335	      rate, and then the percent difference is calculated from the
336	      reference to the test codec.

338	4.3.  Ranges

340	   For individual feature changes in libaom or libvpx, the overlap BD-
341	   Rate method with quantizers 20, 32, 43, and 55 must be used.

343	   For the final evaluation described in [I-D.ietf-netvc-requirements],
344	   the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55.

346	5.  Test Sequences

348	5.1.  Sources

350	   Lossless test clips are preferred for most tests, because the
351	   structure of compression artifacts in already-compressed clips may
352	   introduce extra noise in the test results.  However, a large amount
353	   of content on the internet needs to be recompressed at least once, so
354	   some sources of this nature are useful.  The encoder should run at
355	   the same bit depth as the original source.  In addition, metrics need
356	   to support operation at high bit depth.  If one or more codecs in a
357	   comparison do not support high bit depth, sources need to be
358	   converted once before entering the encoder.

360	5.2.  Test Sets

362	   Sources are divided into several categories to test different
363	   scenarios the codec will be required to operate in.  For easier
364	   comparison, all videos in each set should have the same color
365	   subsampling, same resolution, and same number of frames.  In
366	   addition, all test videos must be publicly available for testing use,
367	   to allow for reproducibility of results.  All current test sets are
368	   available for download [TESTSEQUENCES].

370	   Test sequences should be downloaded in whole.  They should not be
371	   recreated from the original sources.

373	   Each clip is labeled with its resolution, bit depth, color
374	   subsampling, and length.

376	5.2.1.  regression-1

378	   This test set is used for basic regression testing.  It contains a
379	   very small number of clips.

381	   o  kirlandvga (640x360, 8bit, 4:2:0, 300 frames)

383	   o  FourPeople (1280x720, 8bit, 4:2:0, 60 frames)

385	   o  Narrarator (4096x2160, 10bit, 4:2:0, 15 frames)

387	   o  CSGO (1920x1080, 8bit, 4:4:4 60 frames)

389	5.2.2.  objective-2-slow

391	   This test set is a comprehensive test set, grouped by resolution.
392	   These test clips were created from originals at [TESTSEQUENCES].
393	   They have been scaled and cropped to match the resolution of their
394	   category.  This test set requires a codec that supports both 8 and 10
395	   bit video.

397	   4096x2160, 4:2:0, 60 frames:

399	   o  Netflix_BarScene_4096x2160_60fps_10bit_420_60f

401	   o  Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f

403	   o  Netflix_Dancers_4096x2160_60fps_10bit_420_60f

405	   o  Netflix_Narrator_4096x2160_60fps_10bit_420_60f

407	   o  Netflix_RitualDance_4096x2160_60fps_10bit_420_60f

409	   o  Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f

411	   o  Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f

413	   o  street_hdr_amazon_2160p

415	   1920x1080, 4:2:0, 60 frames:

417	   o  aspen_1080p_60f

419	   o  crowd_run_1080p50_60f

421	   o  ducks_take_off_1080p50_60f

423	   o  guitar_hdr_amazon_1080p
424	   o  life_1080p30_60f

426	   o  Netflix_Aerial_1920x1080_60fps_8bit_420_60f

428	   o  Netflix_Boat_1920x1080_60fps_8bit_420_60f

430	   o  Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f

432	   o  Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f

434	   o  Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f

436	   o  Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f

438	   o  Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f

440	   o  old_town_cross_1080p50_60f

442	   o  pan_hdr_amazon_1080p

444	   o  park_joy_1080p50_60f

446	   o  pedestrian_area_1080p25_60f

448	   o  rush_field_cuts_1080p_60f

450	   o  rush_hour_1080p25_60f

452	   o  seaplane_hdr_amazon_1080p

454	   o  station2_1080p25_60f

456	   o  touchdown_pass_1080p_60f

458	   1280x720, 4:2:0, 120 frames:

460	   o  boat_hdr_amazon_720p

462	   o  dark720p_120f

464	   o  FourPeople_1280x720_60_120f

466	   o  gipsrestat720p_120f

468	   o  Johnny_1280x720_60_120f

470	   o  KristenAndSara_1280x720_60_120f
471	   o  Netflix_DinnerScene_1280x720_60fps_8bit_420_120f

473	   o  Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f

475	   o  Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f

477	   o  Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f

479	   o  Netflix_Tango_1280x720_60fps_8bit_420_120f

481	   o  rain_hdr_amazon_720p

483	   o  vidyo1_720p_60fps_120f

485	   o  vidyo3_720p_60fps_120f

487	   o  vidyo4_720p_60fps_120f

489	   640x360, 4:2:0, 120 frames:

491	   o  blue_sky_360p_120f

493	   o  controlled_burn_640x360_120f

495	   o  desktop2360p_120f

497	   o  kirland360p_120f

499	   o  mmstationary360p_120f

501	   o  niklas360p_120f

503	   o  rain2_hdr_amazon_360p

505	   o  red_kayak_360p_120f

507	   o  riverbed_360p25_120f

509	   o  shields2_640x360_120f

511	   o  snow_mnt_640x360_120f

513	   o  speed_bag_640x360_120f

515	   o  stockholm_640x360_120f

517	   o  tacomanarrows360p_120f
518	   o  thaloundeskmtg360p_120f

520	   o  water_hdr_amazon_360p

522	   426x240, 4:2:0, 120 frames:

524	   o  bqfree_240p_120f

526	   o  bqhighway_240p_120f

528	   o  bqzoom_240p_120f

530	   o  chairlift_240p_120f

532	   o  dirtbike_240p_120f

534	   o  mozzoom_240p_120f

536	   1920x1080, 4:4:4 or 4:2:0, 60 frames:

538	   o  CSGO_60f.y4m

540	   o  DOTA2_60f_420.y4m

542	   o  MINECRAFT_60f_420.y4m

544	   o  STARCRAFT_60f_420.y4m

546	   o  EuroTruckSimulator2_60f.y4m

548	   o  Hearthstone_60f.y4m

550	   o  wikipedia_420.y4m

552	   o  pvq_slideshow.y4m

554	5.2.3.  objective-2-fast

556	   This test set is a strict subset of objective-2-slow.  It is designed
557	   for faster runtime.  This test set requires compiling with high bit
558	   depth support.

560	   1920x1080, 4:2:0, 60 frames:

562	   o  aspen_1080p_60f

564	   o  ducks_take_off_1080p50_60f
565	   o  life_1080p30_60f

567	   o  Netflix_Aerial_1920x1080_60fps_8bit_420_60f

569	   o  Netflix_Boat_1920x1080_60fps_8bit_420_60f

571	   o  Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f

573	   o  Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f

575	   o  Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f

577	   o  Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f

579	   o  rush_hour_1080p25_60f

581	   o  seaplane_hdr_amazon_1080p

583	   o  touchdown_pass_1080p_60f

585	   1280x720, 4:2:0, 120 frames:

587	   o  boat_hdr_amazon_720p

589	   o  dark720p_120f

591	   o  gipsrestat720p_120f

593	   o  KristenAndSara_1280x720_60_120f

595	   o  Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f

597	   o  Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f

599	   o  vidyo1_720p_60fps_120f

601	   o  vidyo4_720p_60fps_120f

603	   640x360, 4:2:0, 120 frames:

605	   o  blue_sky_360p_120f

607	   o  controlled_burn_640x360_120f

609	   o  kirland360p_120f

611	   o  niklas360p_120f
612	   o  rain2_hdr_amazon_360p

614	   o  red_kayak_360p_120f

616	   o  riverbed_360p25_120f

618	   o  shields2_640x360_120f

620	   o  speed_bag_640x360_120f

622	   o  thaloundeskmtg360p_120f

624	   426x240, 4:2:0, 120 frames:

626	   o  bqfree_240p_120f

628	   o  bqzoom_240p_120f

630	   o  dirtbike_240p_120f

632	   1290x1080, 4:2:0, 60 frames:

634	   o  DOTA2_60f_420.y4m

636	   o  MINECRAFT_60f_420.y4m

638	   o  STARCRAFT_60f_420.y4m

640	   o  wikipedia_420.y4m

642	5.2.4.  objective-1.1

644	   This test set is an old version of objective-2-slow.

646	   4096x2160, 10bit, 4:2:0, 60 frames:

648	   o  Aerial (start frame 600)

650	   o  BarScene (start frame 120)

652	   o  Boat (start frame 0)

654	   o  BoxingPractice (start frame 0)

656	   o  Crosswalk (start frame 0)

658	   o  Dancers (start frame 120)
659	   o  FoodMarket

661	   o  Narrator

663	   o  PierSeaside

665	   o  RitualDance

667	   o  SquareAndTimelapse

669	   o  ToddlerFountain (start frame 120)

671	   o  TunnelFlag

673	   o  WindAndNature (start frame 120)

675	   1920x1080, 8bit, 4:4:4, 60 frames:

677	   o  CSGO

679	   o  DOTA2

681	   o  EuroTruckSimulator2

683	   o  Hearthstone

685	   o  MINECRAFT

687	   o  STARCRAFT

689	   o  wikipedia

691	   o  pvq_slideshow

693	   1920x1080, 8bit, 4:2:0, 60 frames:

695	   o  ducks_take_off

697	   o  life

699	   o  aspen

701	   o  crowd_run

703	   o  old_town_cross

705	   o  park_joy
706	   o  pedestrian_area

708	   o  rush_field_cuts

710	   o  rush_hour

712	   o  station2

714	   o  touchdown_pass

716	   1280x720, 8bit, 4:2:0, 60 frames:

718	   o  Netflix_FoodMarket2

720	   o  Netflix_Tango

722	   o  DrivingPOV (start frame 120)

724	   o  DinnerScene (start frame 120)

726	   o  RollerCoaster (start frame 600)

728	   o  FourPeople

730	   o  Johnny

732	   o  KristenAndSara

734	   o  vidyo1

736	   o  vidyo3

738	   o  vidyo4

740	   o  dark720p

742	   o  gipsrecmotion720p

744	   o  gipsrestat720p

746	   o  controlled_burn

748	   o  stockholm

750	   o  speed_bag

752	   o  snow_mnt
753	   o  shields

755	   640x360, 8bit, 4:2:0, 60 frames:

757	   o  red_kayak

759	   o  blue_sky

761	   o  riverbed

763	   o  thaloundeskmtgvga

765	   o  kirlandvga

767	   o  tacomanarrowsvga

769	   o  tacomascmvvga

771	   o  desktop2360p

773	   o  mmmovingvga

775	   o  mmstationaryvga

777	   o  niklasvga

779	5.2.5.  objective-1-fast

781	   This is an old version of objective-2-fast.

783	   1920x1080, 8bit, 4:2:0, 60 frames:

785	   o  Aerial (start frame 600)

787	   o  Boat (start frame 0)

789	   o  Crosswalk (start frame 0)

791	   o  FoodMarket

793	   o  PierSeaside

795	   o  SquareAndTimelapse

797	   o  TunnelFlag

799	   1920x1080, 8bit, 4:2:0, 60 frames:

801	   o  CSGO

803	   o  EuroTruckSimulator2

805	   o  MINECRAFT

807	   o  wikipedia

809	   1920x1080, 8bit, 4:2:0, 60 frames:

811	   o  ducks_take_off

813	   o  aspen

815	   o  old_town_cross

817	   o  pedestrian_area

819	   o  rush_hour

821	   o  touchdown_pass

823	   1280x720, 8bit, 4:2:0, 60 frames:

825	   o  Netflix_FoodMarket2

827	   o  DrivingPOV (start frame 120)

829	   o  RollerCoaster (start frame 600)

831	   o  Johnny

833	   o  vidyo1

835	   o  vidyo4

837	   o  gipsrecmotion720p

839	   o  speed_bag

841	   o  shields

843	   640x360, 8bit, 4:2:0, 60 frames:

845	   o  red_kayak

847	   o  riverbed
848	   o  kirlandvga

850	   o  tacomascmvvga

852	   o  mmmovingvga

854	   o  niklasvga

856	5.3.  Operating Points

858	   Four operating modes are defined.  High latency is intended for on
859	   demand streaming, one-to-many live streaming, and stored video.  Low
860	   latency is intended for videoconferencing and remote access.  Both of
861	   these modes come in CQP (constant quantizer parameter) and
862	   unconstrained variants.  When testing still image sets, such as
863	   subset1, high latency CQP mode should be used.

865	5.3.1.  Common settings

867	   Encoders should be configured to their best settings when being
868	   compared against each other:

870	   o  av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0
871	      -threads=1

873	5.3.2.  High Latency CQP

875	   High Latency CQP is used for evaluating incremental changes to a
876	   codec.  This method is well suited to compare codecs with similar
877	   coding tools.  It allows codec features with intrinsic frame delay.

879	   o  daala: -v=x -b 2

881	   o  vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

883	   o  av1: -end-usage=q -cq-level=x -auto-alt-ref=2

885	5.3.3.  Low Latency CQP

887	   Low Latency CQP is used for evaluating incremental changes to a
888	   codec.  This method is well suited to compare codecs with similar
889	   coding tools.  It requires the codec to be set for zero intrinsic
890	   frame delay.

892	   o  daala: -v=x

894	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=0

896	5.3.4.  Unconstrained High Latency

898	   The encoder should be run at the best quality mode available, using
899	   the mode that will provide the best quality per bitrate (VBR or
900	   constant quality mode).  Lookahead and/or two-pass are allowed, if
901	   supported.  One parameter is provided to adjust bitrate, but the
902	   units are arbitrary.  Example configurations follow:

904	   o  x264: -crf=x

906	   o  x265: -crf=x

908	   o  daala: -v=x -b 2

910	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2

912	5.3.5.  Unconstrained Low Latency

914	   The encoder should be run at the best quality mode available, using
915	   the mode that will provide the best quality per bitrate (VBR or
916	   constant quality mode), but no frame delay, buffering, or lookahead
917	   is allowed.  One parameter is provided to adjust bitrate, but the
918	   units are arbitrary.  Example configurations follow:

920	   o  x264: -crf-x -tune zerolatency

922	   o  x265: -crf=x -tune zerolatency

924	   o  daala: -v=x

926	   o  av1: -end-usage=q -cq-level=x -lag-in-frames=0

928	6.  Automation

930	   Frequent objective comparisons are extremely beneficial while
931	   developing a new codec.  Several tools exist in order to automate the
932	   process of objective comparisons.  The Compare-Codecs tool allows BD-
933	   rate curves to be generated for a wide variety of codecs
934	   [COMPARECODECS].  The Daala source repository contains a set of
935	   scripts that can be used to automate the various metrics used.  In
936	   addition, these scripts can be run automatically utilizing
937	   distributed computers for fast results, with rd_tool [RD_TOOL].  This
938	   tool can be run via a web interface called AreWeCompressedYet [AWCY],
939	   or locally.

941	   Because of computational constraints, several levels of testing are
942	   specified.

944	6.1.  Regression tests

946	   Regression tests run on a small number of short sequences -
947	   regression-test-1.  The regression tests should include a number of
948	   various test conditions.  The purpose of regression tests is to
949	   ensure bug fixes (and similar patches) do not negatively affect the
950	   performance.  The anchor in regression tests is the previous revision
951	   of the codec in source control.  Regression tests are run on both
952	   high and low latency CQP modes

954	6.2.  Objective performance tests

956	   Changes that are expected to affect the quality of encode or
957	   bitstream should run an objective performance test.  The performance
958	   tests should be run on a wider number of sequences.  The following
959	   data should be reported:

961	   o  Identifying information for the encoder used, such as the git
962	      commit hash.

964	   o  Command line options to the encoder, configure script, and
965	      anything else necessary to replicate the experiment.

967	   o  The name of the test set run (objective-1-fast)

969	   o  For both high and low latency CQP modes, and for each objective
970	      metric:

972	      *  The BD-Rate score, in percent, for each clip.

974	      *  The average of all BD-Rate scores, equally weighted, for each
975	         resolution category in the test set.

977	      *  The average of all BD-Rate scores for all videos in all
978	         categories.

980	   Normally, the encoder should always be run at the slowest, highest
981	   quality speed setting (cpu-used=0 in the case of AV1 and VP9).
982	   However, in the case of computation time, both the reference and
983	   changed encoder can be built with some options disabled.  For AV1, -
984	   disable-ext_partition and -disable-ext_partition_types can be passed
985	   to the configure script to substantially speed up encoding, but the
986	   usage of these options must be reported in the test results.

988	6.3.  Periodic tests

990	   Periodic tests are run on a wide range of bitrates in order to gauge
991	   progress over time, as well as detect potential regressions missed by
992	   other tests.

994	7.  IANA Considerations

996	   This document does not require any IANA actions.

998	8.  Security Considerations

1000	   This document describes the methodologies an procedures for
1001	   qualitative testing, therefore does not iteself have implications for
1002	   network of decoder security.

1004	9.  Informative References

1006	   [AWCY]     Xiph.Org, "Are We Compressed Yet?", 2016,
1007	              <https://arewecompressedyet.com/>.

1009	   [BT500]    ITU-R, "Recommendation ITU-R BT.500-13", 2012,
1010	              <https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-
1011	              BT.500-13-201201-I!!PDF-E.pdf>.

1013	   [CIEDE2000]
1014	              Yang, Y., Ming, J., and N. Yu, "Color Image Quality
1015	              Assessment Based on CIEDE2000", 2012,
1016	              <http://dx.doi.org/10.1155/2012/273723>.

1018	   [COMPARECODECS]
1019	              Alvestrand, H., "Compare Codecs", 2015,
1020	              <http://compare-codecs.appspot.com/>.

1022	   [DAALA-GIT]
1023	              Xiph.Org, "Daala Git Repository", 2015,
1024	              <http://git.xiph.org/?p=daala.git;a=summary>.

1026	   [I-D.ietf-netvc-requirements]
1027	              Filippov, A., Norkin, A., and j.
1028	              jose.roberto.alvarez@huawei.com, "Video Codec Requirements
1029	              and Evaluation Methodology", draft-ietf-netvc-
1030	              requirements-10 (work in progress), November 2019.

1032	   [MSSSIM]   Wang, Z., Simoncelli, E., and A. Bovik, "Multi-Scale
1033	              Structural Similarity for Image Quality Assessment", n.d.,
1034	              <http://www.cns.nyu.edu/~zwang/files/papers/msssim.pdf>.

1036	   [PSNRHVS]  Egiazarian, K., Astola, J., Ponomarenko, N., Lukin, V.,
1037	              Battisti, F., and M. Carli, "A New Full-Reference Quality
1038	              Metrics Based on HVS", 2002.

1040	   [RD_TOOL]  Xiph.Org, "rd_tool", 2016,
1041	              <https://github.com/tdaede/rd_tool>.

1043	   [SSIM]     Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image
1044	              Quality Assessment: From Error Visibility to Structural
1045	              Similarity", 2004,
1046	              <http://www.cns.nyu.edu/pub/eero/wang03-reprint.pdf>.

1048	   [TESTSEQUENCES]
1049	              Daede, T., "Test Sets", n.d.,
1050	              <https://people.xiph.org/~tdaede/sets/>.

1052	   [VMAF]     Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C.
1053	              Kuo, "VMAF - Video Multi-Method Assessment Fusion", 2015,
1054	              <https://github.com/Netflix/vmaf>.

1056	Authors' Addresses

1058	   Thomas Daede
1059	   Mozilla

1061	   Email: tdaede@mozilla.com

1063	   Andrey Norkin
1064	   Netflix

1066	   Email: anorkin@netflix.com

1068	   Ilya Brailovskiy
1069	   Amazon Lab126

1071	   Email: brailovs@lab126.com