idnits 2.17.1
draft-ietf-netvc-testing-07.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The document seems to lack a Security Considerations section.
** The document seems to lack an IANA Considerations section. (See Section
2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
when there are no actions for IANA.)
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
-- The document date (July 02, 2018) is 2118 days in the past. Is this
intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
== Unused Reference: 'DERFVIDEO' is defined on line 1011, but no explicit
reference was found in the text
== Unused Reference: 'L1100' is defined on line 1021, but no explicit
reference was found in the text
== Unused Reference: 'STEAM' is defined on line 1041, but no explicit
reference was found in the text
== Outdated reference: A later version (-10) exists of
draft-ietf-netvc-requirements-02
Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group T. Daede
3 Internet-Draft Mozilla
4 Intended status: Informational A. Norkin
5 Expires: January 3, 2019 Netflix
6 I. Brailovskiy
7 Amazon Lab126
8 July 02, 2018
10 Video Codec Testing and Quality Measurement
11 draft-ietf-netvc-testing-07
13 Abstract
15 This document describes guidelines and procedures for evaluating a
16 video codec. This covers subjective and objective tests, test
17 conditions, and materials used for the test.
19 Status of This Memo
21 This Internet-Draft is submitted in full conformance with the
22 provisions of BCP 78 and BCP 79.
24 Internet-Drafts are working documents of the Internet Engineering
25 Task Force (IETF). Note that other groups may also distribute
26 working documents as Internet-Drafts. The list of current Internet-
27 Drafts is at http://datatracker.ietf.org/drafts/current/.
29 Internet-Drafts are draft documents valid for a maximum of six months
30 and may be updated, replaced, or obsoleted by other documents at any
31 time. It is inappropriate to use Internet-Drafts as reference
32 material or to cite them other than as "work in progress."
34 This Internet-Draft will expire on January 3, 2019.
36 Copyright Notice
38 Copyright (c) 2018 IETF Trust and the persons identified as the
39 document authors. All rights reserved.
41 This document is subject to BCP 78 and the IETF Trust's Legal
42 Provisions Relating to IETF Documents
43 (http://trustee.ietf.org/license-info) in effect on the date of
44 publication of this document. Please review these documents
45 carefully, as they describe your rights and restrictions with respect
46 to this document. Code Components extracted from this document must
47 include Simplified BSD License text as described in Section 4.e of
48 the Trust Legal Provisions and are provided without warranty as
49 described in the Simplified BSD License.
51 Table of Contents
53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
54 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3
55 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3
56 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 4
57 2.3. Mean Opinion Score . . . . . . . . . . . . . . . . . . . 4
58 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 5
59 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 5
60 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5
61 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5
62 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 6
63 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 6
64 3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 6
65 3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6
66 4. Comparing and Interpreting Results . . . . . . . . . . . . . 7
67 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 7
68 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 7
69 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 8
70 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 8
71 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 8
72 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8
73 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8
74 5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 9
75 5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12
76 5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14
77 5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17
78 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 19
79 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 19
80 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19
81 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19
82 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 20
83 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 20
84 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20
85 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 21
86 6.2. Objective performance tests . . . . . . . . . . . . . . . 21
87 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 22
88 7. Informative References . . . . . . . . . . . . . . . . . . . 22
89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23
91 1. Introduction
93 When developing a video codec, changes and additions to the codec
94 need to be decided based on their performance tradeoffs. In
95 addition, measurements are needed to determine when the codec has met
96 its performance goals. This document specifies how the tests are to
97 be carried about to ensure valid comparisons when evaluating changes
98 under consideration. Authors of features or changes should provide
99 the results of the appropriate test when proposing codec
100 modifications.
102 2. Subjective quality tests
104 Subjective testing is the preferable method of testing video codecs.
106 Subjective testing results take priority over objective testing
107 results, when available. Subjective testing is recommended
108 especially when taking advantage of psychovisual effects that may not
109 be well represented by objective metrics, or when different objective
110 metrics disagree.
112 Selection of a testing methodology depends on the feature being
113 tested and the resources available. Test methodologies are presented
114 in order of increasing accuracy and cost.
116 Testing relies on the resources of participants. For this reason,
117 even if the group agrees that a particular test is important, if no
118 one volunteers to do it, or if volunteers do not complete it in a
119 timely fashion, then that test should be discarded. This ensures
120 that only important tests be done in particular, the tests that are
121 important to participants.
123 Subjective tests should use the same operating points as the
124 objective tests.
126 2.1. Still Image Pair Comparison
128 A simple way to determine superiority of one compressed image is to
129 visually compare two compressed images, and have the viewer judge
130 which one has a higher quality. For example, this test may be
131 suitable for an intra de-ringing filter, but not for a new inter
132 prediction mode. For this test, the two compressed images should
133 have similar compressed file sizes, with one image being no more than
134 5% larger than the other. In addition, at least 5 different images
135 should be compared.
137 Once testing is complete, a p-value can be computed using the
138 binomial test. A significant result should have a resulting p-value
139 less than or equal to 0.5. For example:
141 p_value = binom_test(a,a+b)
142 where a is the number of votes for one video, b is the number of
143 votes for the second video, and binom_test(x,y) returns the binomial
144 PMF with x observed tests, y total tests, and expected probability
145 0.5.
147 If ties are allowed to be reported, then the equation is modified:
149 p_value = binom_test(a+floor(t/2),a+b+t)
151 where t is the number of tie votes.
153 Still image pair comparison is used for rapid comparisons during
154 development - the viewer may be either a developer or user, for
155 example. As the results are only relative, it is effective even with
156 an inconsistent viewing environment. Because this test only uses
157 still images (keyframes), this is only suitable for changes with
158 similar or no effect on inter frames.
160 2.2. Video Pair Comparison
162 The still image pair comparison method can be modified to also
163 compare vidoes. This is necessary when making changes with temporal
164 effects, such as changes to inter-frame prediction. Video pair
165 comparisons follow the same procedure as still images. Videos used
166 for testing should be limited to 10 seconds in length, and can be
167 rewatched an unlimited number of times.
169 2.3. Mean Opinion Score
171 A Mean Opinion Score (MOS) viewing test is the preferred method of
172 evaluating the quality. The subjective test should be performed as
173 either consecutively showing the video sequences on one screen or on
174 two screens located side-by-side. The testing procedure should
175 normally follow rules described in [BT500] and be performed with non-
176 expert test subjects. The result of the test will be (depending on
177 the test procedure) mean opinion scores (MOS) or differential mean
178 opinion scores (DMOS). Confidence intervals are also calculated to
179 judge whether the difference between two encodings is statistically
180 significant. In certain cases, a viewing test with expert test
181 subjects can be performed, for example if a test should evaluate
182 technologies with similar performance with respect to a particular
183 artifact (e.g. loop filters or motion prediction). Unlike pair
184 comparisions, a MOS test requires a consistent testing environment.
185 This means that for large scale or distributed tests, pair
186 comparisons are preferred.
188 3. Objective Metrics
190 Objective metrics are used in place of subjective metrics for easy
191 and repeatable experiments. Most objective metrics have been
192 designed to correlate with subjective scores.
194 The following descriptions give an overview of the operation of each
195 of the metrics. Because implementation details can sometimes vary,
196 the exact implementation is specified in C in the Daala tools
197 repository [DAALA-GIT]. Implementations of metrics must directly
198 support the input's resolution, bit depth, and sampling format.
200 Unless otherwise specified, all of the metrics described below only
201 apply to the luma plane, individually by frame. When applied to the
202 video, the scores of each frame are averaged to create the final
203 score.
205 Codecs must output the same resolution, bit depth, and sampling
206 format as the input.
208 3.1. Overall PSNR
210 PSNR is a traditional signal quality metric, measured in decibels.
211 It is directly drived from mean square error (MSE), or its square
212 root (RMSE). The formula used is:
214 20 * log10 ( MAX / RMSE )
216 or, equivalently:
218 10 * log10 ( MAX^2 / MSE )
220 where the error is computed over all the pixels in the video, which
221 is the method used in the dump_psnr.c reference implementation.
223 This metric may be applied to both the luma and chroma planes, with
224 all planes reported separately.
226 3.2. Frame-averaged PSNR
228 PSNR can also be calculated per-frame, and then the values averaged
229 together. This is reported in the same way as overall PSNR.
231 3.3. PSNR-HVS-M
233 The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the
234 image, weights the coefficients, and then calculates the PSNR of
235 those coefficients. Several different sets of weights have been
236 considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in
237 the Daala repository have been found to be the best match to real MOS
238 scores.
240 3.4. SSIM
242 SSIM (Structural Similarity Image Metric) is a still image quality
243 metric introduced in 2004 [SSIM]. It computes a score for each
244 individual pixel, using a window of neighboring pixels. These scores
245 can then be averaged to produce a global score for the entire image.
246 The original paper produces scores ranging between 0 and 1.
248 To linearize the metric for BD-Rate computation, the score is
249 converted into a nonlinear decibel scale:
251 -10 * log10 (1 - SSIM)
253 3.5. Multi-Scale SSIM
255 Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM].
256 The metric score is converted to decibels in the same way as SSIM.
258 3.6. CIEDE2000
260 CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It
261 generates a single score taking into account all three chroma planes.
262 It does not take into consideration any structural similarity or
263 other psychovisual effects.
265 3.7. VMAF
267 Video Multi-method Assessment Fusion (VMAF) is a full-reference
268 perceptual video quality metric that aims to approximate human
269 perception of video quality [VMAF]. This metric is focused on
270 quality degradation due compression and rescaling. VMAF estimates
271 the perceived quality score by computing scores from multiple quality
272 assessment algorithms, and fusing them using a support vector machine
273 (SVM). Currently, three image fidelity metrics and one temporal
274 signal have been chosen as features to the SVM, namely Anti-noise SNR
275 (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity
276 (VIF), and the mean co-located pixel difference of a frame with
277 respect to the previous frame.
279 The quality score from VMAF is used directly to calculate BD-Rate,
280 without any conversions.
282 4. Comparing and Interpreting Results
284 4.1. Graphing
286 When displayed on a graph, bitrate is shown on the X axis, and the
287 quality metric is on the Y axis. For publication, the X axis should
288 be linear. The Y axis metric should be plotted in decibels. If the
289 quality metric does not natively report quality in decibels, it
290 should be converted as described in the previous section.
292 4.2. BD-Rate
294 The Bjontegaard rate difference, also known as BD-rate, allows the
295 measurement of the bitrate reduction offered by a codec or codec
296 feature, while maintaining the same quality as measured by objective
297 metrics. The rate change is computed as the average percent
298 difference in rate over a range of qualities. Metric score ranges
299 are not static - they are calculated either from a range of bitrates
300 of the reference codec, or from quantizers of a third, anchor codec.
301 Given a reference codec and test codec, BD-rate values are calculated
302 as follows:
304 o Rate/distortion points are calculated for the reference and test
305 codec.
307 * At least four points must be computed. These points should be
308 the same quantizers when comparing two versions of the same
309 codec.
311 * Additional points outside of the range should be discarded.
313 o The rates are converted into log-rates.
315 o A piecewise cubic hermite interpolating polynomial is fit to the
316 points for each codec to produce functions of log-rate in terms of
317 distortion.
319 o Metric score ranges are computed:
321 * If comparing two versions of the same codec, the overlap is the
322 intersection of the two curves, bound by the chosen quantizer
323 points.
325 * If comparing dissimilar codecs, a third anchor codec's metric
326 scores at fixed quantizers are used directly as the bounds.
328 o The log-rate is numerically integrated over the metric range for
329 each curve, using at least 1000 samples and trapezoidal
330 integration.
332 o The resulting integrated log-rates are converted back into linear
333 rate, and then the percent difference is calculated from the
334 reference to the test codec.
336 4.3. Ranges
338 For individual feature changes in libaom or libvpx, the overlap BD-
339 Rate method with quantizers 20, 32, 43, and 55 must be used.
341 For the final evaluation described in [I-D.ietf-netvc-requirements],
342 the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55.
344 5. Test Sequences
346 5.1. Sources
348 Lossless test clips are preferred for most tests, because the
349 structure of compression artifacts in already-compressed clips may
350 introduce extra noise in the test results. However, a large amount
351 of content on the internet needs to be recompressed at least once, so
352 some sources of this nature are useful. The encoder should run at
353 the same bit depth as the original source. In addition, metrics need
354 to support operation at high bit depth. If one or more codecs in a
355 comparison do not support high bit depth, sources need to be
356 converted once before entering the encoder.
358 5.2. Test Sets
360 Sources are divided into several categories to test different
361 scenarios the codec will be required to operate in. For easier
362 comparison, all videos in each set should have the same color
363 subsampling, same resolution, and same number of frames. In
364 addition, all test videos must be publicly available for testing use,
365 to allow for reproducibility of results. All current test sets are
366 available for download [TESTSEQUENCES].
368 Test sequences should be downloaded in whole. They should not be
369 recreated from the original sources.
371 5.2.1. regression-1
373 This test set is used for basic regression testing. It contains a
374 very small number of clips.
376 o kirlandvga (640x360, 8bit, 4:2:0, 300 frames)
378 o FourPeople (1280x720, 8bit, 4:2:0, 60 frames)
380 o Narrarator (4096x2160, 10bit, 4:2:0, 15 frames)
382 o CSGO (1920x1080, 8bit, 4:4:4 60 frames)
384 5.2.2. objective-2-slow
386 This test set is a comprehensive test set, grouped by resolution.
387 These test clips were created from originals at [TESTSEQUENCES].
388 They have been scaled and cropped to match the resolution of their
389 category. This test set requires compiling with high bit depth
390 support.
392 4096x2160, 4:2:0, 60 frames:
394 o Netflix_BarScene_4096x2160_60fps_10bit_420_60f
396 o Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f
398 o Netflix_Dancers_4096x2160_60fps_10bit_420_60f
400 o Netflix_Narrator_4096x2160_60fps_10bit_420_60f
402 o Netflix_RitualDance_4096x2160_60fps_10bit_420_60f
404 o Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f
406 o Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f
408 o street_hdr_amazon_2160p
410 1920x1080, 4:2:0, 60 frames:
412 o aspen_1080p_60f
414 o crowd_run_1080p50_60f
416 o ducks_take_off_1080p50_60f
418 o guitar_hdr_amazon_1080p
420 o life_1080p30_60f
422 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f
423 o Netflix_Boat_1920x1080_60fps_8bit_420_60f
425 o Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f
427 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f
429 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f
431 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f
433 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f
435 o old_town_cross_1080p50_60f
437 o pan_hdr_amazon_1080p
439 o park_joy_1080p50_60f
441 o pedestrian_area_1080p25_60f
443 o rush_field_cuts_1080p_60f
445 o rush_hour_1080p25_60f
447 o seaplane_hdr_amazon_1080p
449 o station2_1080p25_60f
451 o touchdown_pass_1080p_60f
453 1280x720, 4:2:0, 120 frames:
455 o boat_hdr_amazon_720p
457 o dark720p_120f
459 o FourPeople_1280x720_60_120f
461 o gipsrestat720p_120f
463 o Johnny_1280x720_60_120f
465 o KristenAndSara_1280x720_60_120f
467 o Netflix_DinnerScene_1280x720_60fps_8bit_420_120f
469 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f
470 o Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f
472 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f
474 o Netflix_Tango_1280x720_60fps_8bit_420_120f
476 o rain_hdr_amazon_720p
478 o vidyo1_720p_60fps_120f
480 o vidyo3_720p_60fps_120f
482 o vidyo4_720p_60fps_120f
484 640x360, 4:2:0, 120 frames:
486 o blue_sky_360p_120f
488 o controlled_burn_640x360_120f
490 o desktop2360p_120f
492 o kirland360p_120f
494 o mmstationary360p_120f
496 o niklas360p_120f
498 o rain2_hdr_amazon_360p
500 o red_kayak_360p_120f
502 o riverbed_360p25_120f
504 o shields2_640x360_120f
506 o snow_mnt_640x360_120f
508 o speed_bag_640x360_120f
510 o stockholm_640x360_120f
512 o tacomanarrows360p_120f
514 o thaloundeskmtg360p_120f
516 o water_hdr_amazon_360p
517 426x240, 4:2:0, 120 frames:
519 o bqfree_240p_120f
521 o bqhighway_240p_120f
523 o bqzoom_240p_120f
525 o chairlift_240p_120f
527 o dirtbike_240p_120f
529 o mozzoom_240p_120f
531 1920x1080, 4:4:4 or 4:2:0, 60 frames:
533 o CSGO_60f.y4m
535 o DOTA2_60f_420.y4m
537 o MINECRAFT_60f_420.y4m
539 o STARCRAFT_60f_420.y4m
541 o EuroTruckSimulator2_60f.y4m
543 o Hearthstone_60f.y4m
545 o wikipedia_420.y4m
547 o pvq_slideshow.y4m
549 5.2.3. objective-2-fast
551 This test set is a strict subset of objective-2-slow. It is designed
552 for faster runtime. This test set requires compiling with high bit
553 depth support.
555 1920x1080, 4:2:0, 60 frames:
557 o aspen_1080p_60f
559 o ducks_take_off_1080p50_60f
561 o life_1080p30_60f
563 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f
564 o Netflix_Boat_1920x1080_60fps_8bit_420_60f
566 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f
568 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f
570 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f
572 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f
574 o rush_hour_1080p25_60f
576 o seaplane_hdr_amazon_1080p
578 o touchdown_pass_1080p_60f
580 1280x720, 4:2:0, 120 frames:
582 o boat_hdr_amazon_720p
584 o dark720p_120f
586 o gipsrestat720p_120f
588 o KristenAndSara_1280x720_60_120f
590 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f
592 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f
594 o vidyo1_720p_60fps_120f
596 o vidyo4_720p_60fps_120f
598 640x360, 4:2:0, 120 frames:
600 o blue_sky_360p_120f
602 o controlled_burn_640x360_120f
604 o kirland360p_120f
606 o niklas360p_120f
608 o rain2_hdr_amazon_360p
610 o red_kayak_360p_120f
611 o riverbed_360p25_120f
613 o shields2_640x360_120f
615 o speed_bag_640x360_120f
617 o thaloundeskmtg360p_120f
619 426x240, 4:2:0, 120 frames:
621 o bqfree_240p_120f
623 o bqzoom_240p_120f
625 o dirtbike_240p_120f
627 1290x1080, 4:2:0, 60 frames:
629 o DOTA2_60f_420.y4m
631 o MINECRAFT_60f_420.y4m
633 o STARCRAFT_60f_420.y4m
635 o wikipedia_420.y4m
637 5.2.4. objective-1.1
639 This test set is an old version of objective-2-slow.
641 4096x2160, 10bit, 4:2:0, 60 frames:
643 o Aerial (start frame 600)
645 o BarScene (start frame 120)
647 o Boat (start frame 0)
649 o BoxingPractice (start frame 0)
651 o Crosswalk (start frame 0)
653 o Dancers (start frame 120)
655 o FoodMarket
657 o Narrator
658 o PierSeaside
660 o RitualDance
662 o SquareAndTimelapse
664 o ToddlerFountain (start frame 120)
666 o TunnelFlag
668 o WindAndNature (start frame 120)
670 1920x1080, 8bit, 4:4:4, 60 frames:
672 o CSGO
674 o DOTA2
676 o EuroTruckSimulator2
678 o Hearthstone
680 o MINECRAFT
682 o STARCRAFT
684 o wikipedia
686 o pvq_slideshow
688 1920x1080, 8bit, 4:2:0, 60 frames:
690 o ducks_take_off
692 o life
694 o aspen
696 o crowd_run
698 o old_town_cross
700 o park_joy
702 o pedestrian_area
704 o rush_field_cuts
705 o rush_hour
707 o station2
709 o touchdown_pass
711 1280x720, 8bit, 4:2:0, 60 frames:
713 o Netflix_FoodMarket2
715 o Netflix_Tango
717 o DrivingPOV (start frame 120)
719 o DinnerScene (start frame 120)
721 o RollerCoaster (start frame 600)
723 o FourPeople
725 o Johnny
727 o KristenAndSara
729 o vidyo1
731 o vidyo3
733 o vidyo4
735 o dark720p
737 o gipsrecmotion720p
739 o gipsrestat720p
741 o controlled_burn
743 o stockholm
745 o speed_bag
747 o snow_mnt
749 o shields
751 640x360, 8bit, 4:2:0, 60 frames:
753 o red_kayak
755 o blue_sky
757 o riverbed
759 o thaloundeskmtgvga
761 o kirlandvga
763 o tacomanarrowsvga
765 o tacomascmvvga
767 o desktop2360p
769 o mmmovingvga
771 o mmstationaryvga
773 o niklasvga
775 5.2.5. objective-1-fast
777 This is an old version of objective-2-fast.
779 1920x1080, 8bit, 4:2:0, 60 frames:
781 o Aerial (start frame 600)
783 o Boat (start frame 0)
785 o Crosswalk (start frame 0)
787 o FoodMarket
789 o PierSeaside
791 o SquareAndTimelapse
793 o TunnelFlag
795 1920x1080, 8bit, 4:2:0, 60 frames:
797 o CSGO
799 o EuroTruckSimulator2
800 o MINECRAFT
802 o wikipedia
804 1920x1080, 8bit, 4:2:0, 60 frames:
806 o ducks_take_off
808 o aspen
810 o old_town_cross
812 o pedestrian_area
814 o rush_hour
816 o touchdown_pass
818 1280x720, 8bit, 4:2:0, 60 frames:
820 o Netflix_FoodMarket2
822 o DrivingPOV (start frame 120)
824 o RollerCoaster (start frame 600)
826 o Johnny
828 o vidyo1
830 o vidyo4
832 o gipsrecmotion720p
834 o speed_bag
836 o shields
838 640x360, 8bit, 4:2:0, 60 frames:
840 o red_kayak
842 o riverbed
844 o kirlandvga
846 o tacomascmvvga
847 o mmmovingvga
849 o niklasvga
851 5.3. Operating Points
853 Four operating modes are defined. High latency is intended for on
854 demand streaming, one-to-many live streaming, and stored video. Low
855 latency is intended for videoconferencing and remote access. Both of
856 these modes come in CQP and unconstrained variants. When testing
857 still image sets, such as subset1, high latency CQP mode should be
858 used.
860 5.3.1. Common settings
862 Encoders should be configured to their best settings when being
863 compared against each other:
865 o av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0
866 -threads=1
868 5.3.2. High Latency CQP
870 High Latency CQP is used for evaluating incremental changes to a
871 codec. This method is well suited to compare codecs with similar
872 coding tools. It allows codec features with intrinsic frame delay.
874 o daala: -v=x -b 2
876 o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
878 o av1: -end-usage=q -cq-level=x -auto-alt-ref=2
880 5.3.3. Low Latency CQP
882 Low Latency CQP is used for evaluating incremental changes to a
883 codec. This method is well suited to compare codecs with similar
884 coding tools. It requires the codec to be set for zero intrinsic
885 frame delay.
887 o daala: -v=x
889 o av1: -end-usage=q -cq-level=x -lag-in-frames=0
891 5.3.4. Unconstrained High Latency
893 The encoder should be run at the best quality mode available, using
894 the mode that will provide the best quality per bitrate (VBR or
895 constant quality mode). Lookahead and/or two-pass are allowed, if
896 supported. One parameter is provided to adjust bitrate, but the
897 units are arbitrary. Example configurations follow:
899 o x264: -crf=x
901 o x265: -crf=x
903 o daala: -v=x -b 2
905 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2
907 5.3.5. Unconstrained Low Latency
909 The encoder should be run at the best quality mode available, using
910 the mode that will provide the best quality per bitrate (VBR or
911 constant quality mode), but no frame delay, buffering, or lookahead
912 is allowed. One parameter is provided to adjust bitrate, but the
913 units are arbitrary. Example configurations follow:
915 o x264: -crf-x -tune zerolatency
917 o x265: -crf=x -tune zerolatency
919 o daala: -v=x
921 o av1: -end-usage=q -cq-level=x -lag-in-frames=0
923 6. Automation
925 Frequent objective comparisons are extremely beneficial while
926 developing a new codec. Several tools exist in order to automate the
927 process of objective comparisons. The Compare-Codecs tool allows BD-
928 rate curves to be generated for a wide variety of codecs
929 [COMPARECODECS]. The Daala source repository contains a set of
930 scripts that can be used to automate the various metrics used. In
931 addition, these scripts can be run automatically utilizing
932 distributed computers for fast results, with rd_tool [RD_TOOL]. This
933 tool can be run via a web interface called AreWeCompressedYet [AWCY],
934 or locally.
936 Because of computational constraints, several levels of testing are
937 specified.
939 6.1. Regression tests
941 Regression tests run on a small number of short sequences -
942 regression-test-1. The regression tests should include a number of
943 various test conditions. The purpose of regression tests is to
944 ensure bug fixes (and similar patches) do not negatively affect the
945 performance. The anchor in regression tests is the previous revision
946 of the codec in source control. Regression tests are run on both
947 high and low latency CQP modes
949 6.2. Objective performance tests
951 Changes that are expected to affect the quality of encode or
952 bitstream should run an objective performance test. The performance
953 tests should be run on a wider number of sequences. The following
954 data should be reported:
956 o Identifying information for the encoder used, such as the git
957 commit hash.
959 o Command line options to the encoder, configure script, and
960 anything else necessary to replicate the experiment.
962 o The name of the test set run (objective-1-fast)
964 o For both high and low latency CQP modes, and for each objective
965 metric:
967 * The BD-Rate score, in percent, for each clip.
969 * The average of all BD-Rate scores, equally weighted, for each
970 resolution category in the test set.
972 * The average of all BD-Rate scores for all videos in all
973 categories.
975 Normally, the encoder should always be run at the slowest, highest
976 quality speed setting (cpu-used=0 in the case of AV1 and VP9).
977 However, in the case of computation time, both the reference and
978 changed encoder can be built with some options disabled. For AV1, -
979 disable-ext_partition and -disable-ext_partition_types can be passed
980 to the configure script to substantially speed up encoding, but the
981 usage of these options must be reported in the test results.
983 6.3. Periodic tests
985 Periodic tests are run on a wide range of bitrates in order to gauge
986 progress over time, as well as detect potential regressions missed by
987 other tests.
989 7. Informative References
991 [AWCY] Xiph.Org, "Are We Compressed Yet?", 2016,
992 .
994 [BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012,
995 .
998 [CIEDE2000]
999 Yang, Y., Ming, J., and N. Yu, "Color Image Quality
1000 Assessment Based on CIEDE2000", 2012,
1001 .
1003 [COMPARECODECS]
1004 Alvestrand, H., "Compare Codecs", 2015,
1005 .
1007 [DAALA-GIT]
1008 Xiph.Org, "Daala Git Repository", 2015,
1009 .
1011 [DERFVIDEO]
1012 Terriberry, T., "Xiph.org Video Test Media", n.d.,
1013 .
1015 [I-D.ietf-netvc-requirements]
1016 Filippov, A., Norkin, A., and j.
1017 jose.roberto.alvarez@huawei.com, "