idnits 2.17.1 draft-ietf-netvc-testing-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 02, 2018) is 2118 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'DERFVIDEO' is defined on line 1011, but no explicit reference was found in the text == Unused Reference: 'L1100' is defined on line 1021, but no explicit reference was found in the text == Unused Reference: 'STEAM' is defined on line 1041, but no explicit reference was found in the text == Outdated reference: A later version (-10) exists of draft-ietf-netvc-requirements-02 Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Daede 3 Internet-Draft Mozilla 4 Intended status: Informational A. Norkin 5 Expires: January 3, 2019 Netflix 6 I. Brailovskiy 7 Amazon Lab126 8 July 02, 2018 10 Video Codec Testing and Quality Measurement 11 draft-ietf-netvc-testing-07 13 Abstract 15 This document describes guidelines and procedures for evaluating a 16 video codec. This covers subjective and objective tests, test 17 conditions, and materials used for the test. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on January 3, 2019. 36 Copyright Notice 38 Copyright (c) 2018 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3 55 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 56 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 4 57 2.3. Mean Opinion Score . . . . . . . . . . . . . . . . . . . 4 58 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 5 59 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 5 60 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5 61 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5 62 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 6 64 3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 4. Comparing and Interpreting Results . . . . . . . . . . . . . 7 67 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 7 68 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 8 71 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8 73 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8 74 5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 9 75 5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12 76 5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14 77 5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17 78 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 19 79 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 19 80 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19 81 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19 82 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 20 83 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 20 84 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20 85 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 21 86 6.2. Objective performance tests . . . . . . . . . . . . . . . 21 87 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 22 88 7. Informative References . . . . . . . . . . . . . . . . . . . 22 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 91 1. Introduction 93 When developing a video codec, changes and additions to the codec 94 need to be decided based on their performance tradeoffs. In 95 addition, measurements are needed to determine when the codec has met 96 its performance goals. This document specifies how the tests are to 97 be carried about to ensure valid comparisons when evaluating changes 98 under consideration. Authors of features or changes should provide 99 the results of the appropriate test when proposing codec 100 modifications. 102 2. Subjective quality tests 104 Subjective testing is the preferable method of testing video codecs. 106 Subjective testing results take priority over objective testing 107 results, when available. Subjective testing is recommended 108 especially when taking advantage of psychovisual effects that may not 109 be well represented by objective metrics, or when different objective 110 metrics disagree. 112 Selection of a testing methodology depends on the feature being 113 tested and the resources available. Test methodologies are presented 114 in order of increasing accuracy and cost. 116 Testing relies on the resources of participants. For this reason, 117 even if the group agrees that a particular test is important, if no 118 one volunteers to do it, or if volunteers do not complete it in a 119 timely fashion, then that test should be discarded. This ensures 120 that only important tests be done in particular, the tests that are 121 important to participants. 123 Subjective tests should use the same operating points as the 124 objective tests. 126 2.1. Still Image Pair Comparison 128 A simple way to determine superiority of one compressed image is to 129 visually compare two compressed images, and have the viewer judge 130 which one has a higher quality. For example, this test may be 131 suitable for an intra de-ringing filter, but not for a new inter 132 prediction mode. For this test, the two compressed images should 133 have similar compressed file sizes, with one image being no more than 134 5% larger than the other. In addition, at least 5 different images 135 should be compared. 137 Once testing is complete, a p-value can be computed using the 138 binomial test. A significant result should have a resulting p-value 139 less than or equal to 0.5. For example: 141 p_value = binom_test(a,a+b) 142 where a is the number of votes for one video, b is the number of 143 votes for the second video, and binom_test(x,y) returns the binomial 144 PMF with x observed tests, y total tests, and expected probability 145 0.5. 147 If ties are allowed to be reported, then the equation is modified: 149 p_value = binom_test(a+floor(t/2),a+b+t) 151 where t is the number of tie votes. 153 Still image pair comparison is used for rapid comparisons during 154 development - the viewer may be either a developer or user, for 155 example. As the results are only relative, it is effective even with 156 an inconsistent viewing environment. Because this test only uses 157 still images (keyframes), this is only suitable for changes with 158 similar or no effect on inter frames. 160 2.2. Video Pair Comparison 162 The still image pair comparison method can be modified to also 163 compare vidoes. This is necessary when making changes with temporal 164 effects, such as changes to inter-frame prediction. Video pair 165 comparisons follow the same procedure as still images. Videos used 166 for testing should be limited to 10 seconds in length, and can be 167 rewatched an unlimited number of times. 169 2.3. Mean Opinion Score 171 A Mean Opinion Score (MOS) viewing test is the preferred method of 172 evaluating the quality. The subjective test should be performed as 173 either consecutively showing the video sequences on one screen or on 174 two screens located side-by-side. The testing procedure should 175 normally follow rules described in [BT500] and be performed with non- 176 expert test subjects. The result of the test will be (depending on 177 the test procedure) mean opinion scores (MOS) or differential mean 178 opinion scores (DMOS). Confidence intervals are also calculated to 179 judge whether the difference between two encodings is statistically 180 significant. In certain cases, a viewing test with expert test 181 subjects can be performed, for example if a test should evaluate 182 technologies with similar performance with respect to a particular 183 artifact (e.g. loop filters or motion prediction). Unlike pair 184 comparisions, a MOS test requires a consistent testing environment. 185 This means that for large scale or distributed tests, pair 186 comparisons are preferred. 188 3. Objective Metrics 190 Objective metrics are used in place of subjective metrics for easy 191 and repeatable experiments. Most objective metrics have been 192 designed to correlate with subjective scores. 194 The following descriptions give an overview of the operation of each 195 of the metrics. Because implementation details can sometimes vary, 196 the exact implementation is specified in C in the Daala tools 197 repository [DAALA-GIT]. Implementations of metrics must directly 198 support the input's resolution, bit depth, and sampling format. 200 Unless otherwise specified, all of the metrics described below only 201 apply to the luma plane, individually by frame. When applied to the 202 video, the scores of each frame are averaged to create the final 203 score. 205 Codecs must output the same resolution, bit depth, and sampling 206 format as the input. 208 3.1. Overall PSNR 210 PSNR is a traditional signal quality metric, measured in decibels. 211 It is directly drived from mean square error (MSE), or its square 212 root (RMSE). The formula used is: 214 20 * log10 ( MAX / RMSE ) 216 or, equivalently: 218 10 * log10 ( MAX^2 / MSE ) 220 where the error is computed over all the pixels in the video, which 221 is the method used in the dump_psnr.c reference implementation. 223 This metric may be applied to both the luma and chroma planes, with 224 all planes reported separately. 226 3.2. Frame-averaged PSNR 228 PSNR can also be calculated per-frame, and then the values averaged 229 together. This is reported in the same way as overall PSNR. 231 3.3. PSNR-HVS-M 233 The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the 234 image, weights the coefficients, and then calculates the PSNR of 235 those coefficients. Several different sets of weights have been 236 considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in 237 the Daala repository have been found to be the best match to real MOS 238 scores. 240 3.4. SSIM 242 SSIM (Structural Similarity Image Metric) is a still image quality 243 metric introduced in 2004 [SSIM]. It computes a score for each 244 individual pixel, using a window of neighboring pixels. These scores 245 can then be averaged to produce a global score for the entire image. 246 The original paper produces scores ranging between 0 and 1. 248 To linearize the metric for BD-Rate computation, the score is 249 converted into a nonlinear decibel scale: 251 -10 * log10 (1 - SSIM) 253 3.5. Multi-Scale SSIM 255 Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM]. 256 The metric score is converted to decibels in the same way as SSIM. 258 3.6. CIEDE2000 260 CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It 261 generates a single score taking into account all three chroma planes. 262 It does not take into consideration any structural similarity or 263 other psychovisual effects. 265 3.7. VMAF 267 Video Multi-method Assessment Fusion (VMAF) is a full-reference 268 perceptual video quality metric that aims to approximate human 269 perception of video quality [VMAF]. This metric is focused on 270 quality degradation due compression and rescaling. VMAF estimates 271 the perceived quality score by computing scores from multiple quality 272 assessment algorithms, and fusing them using a support vector machine 273 (SVM). Currently, three image fidelity metrics and one temporal 274 signal have been chosen as features to the SVM, namely Anti-noise SNR 275 (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity 276 (VIF), and the mean co-located pixel difference of a frame with 277 respect to the previous frame. 279 The quality score from VMAF is used directly to calculate BD-Rate, 280 without any conversions. 282 4. Comparing and Interpreting Results 284 4.1. Graphing 286 When displayed on a graph, bitrate is shown on the X axis, and the 287 quality metric is on the Y axis. For publication, the X axis should 288 be linear. The Y axis metric should be plotted in decibels. If the 289 quality metric does not natively report quality in decibels, it 290 should be converted as described in the previous section. 292 4.2. BD-Rate 294 The Bjontegaard rate difference, also known as BD-rate, allows the 295 measurement of the bitrate reduction offered by a codec or codec 296 feature, while maintaining the same quality as measured by objective 297 metrics. The rate change is computed as the average percent 298 difference in rate over a range of qualities. Metric score ranges 299 are not static - they are calculated either from a range of bitrates 300 of the reference codec, or from quantizers of a third, anchor codec. 301 Given a reference codec and test codec, BD-rate values are calculated 302 as follows: 304 o Rate/distortion points are calculated for the reference and test 305 codec. 307 * At least four points must be computed. These points should be 308 the same quantizers when comparing two versions of the same 309 codec. 311 * Additional points outside of the range should be discarded. 313 o The rates are converted into log-rates. 315 o A piecewise cubic hermite interpolating polynomial is fit to the 316 points for each codec to produce functions of log-rate in terms of 317 distortion. 319 o Metric score ranges are computed: 321 * If comparing two versions of the same codec, the overlap is the 322 intersection of the two curves, bound by the chosen quantizer 323 points. 325 * If comparing dissimilar codecs, a third anchor codec's metric 326 scores at fixed quantizers are used directly as the bounds. 328 o The log-rate is numerically integrated over the metric range for 329 each curve, using at least 1000 samples and trapezoidal 330 integration. 332 o The resulting integrated log-rates are converted back into linear 333 rate, and then the percent difference is calculated from the 334 reference to the test codec. 336 4.3. Ranges 338 For individual feature changes in libaom or libvpx, the overlap BD- 339 Rate method with quantizers 20, 32, 43, and 55 must be used. 341 For the final evaluation described in [I-D.ietf-netvc-requirements], 342 the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55. 344 5. Test Sequences 346 5.1. Sources 348 Lossless test clips are preferred for most tests, because the 349 structure of compression artifacts in already-compressed clips may 350 introduce extra noise in the test results. However, a large amount 351 of content on the internet needs to be recompressed at least once, so 352 some sources of this nature are useful. The encoder should run at 353 the same bit depth as the original source. In addition, metrics need 354 to support operation at high bit depth. If one or more codecs in a 355 comparison do not support high bit depth, sources need to be 356 converted once before entering the encoder. 358 5.2. Test Sets 360 Sources are divided into several categories to test different 361 scenarios the codec will be required to operate in. For easier 362 comparison, all videos in each set should have the same color 363 subsampling, same resolution, and same number of frames. In 364 addition, all test videos must be publicly available for testing use, 365 to allow for reproducibility of results. All current test sets are 366 available for download [TESTSEQUENCES]. 368 Test sequences should be downloaded in whole. They should not be 369 recreated from the original sources. 371 5.2.1. regression-1 373 This test set is used for basic regression testing. It contains a 374 very small number of clips. 376 o kirlandvga (640x360, 8bit, 4:2:0, 300 frames) 378 o FourPeople (1280x720, 8bit, 4:2:0, 60 frames) 380 o Narrarator (4096x2160, 10bit, 4:2:0, 15 frames) 382 o CSGO (1920x1080, 8bit, 4:4:4 60 frames) 384 5.2.2. objective-2-slow 386 This test set is a comprehensive test set, grouped by resolution. 387 These test clips were created from originals at [TESTSEQUENCES]. 388 They have been scaled and cropped to match the resolution of their 389 category. This test set requires compiling with high bit depth 390 support. 392 4096x2160, 4:2:0, 60 frames: 394 o Netflix_BarScene_4096x2160_60fps_10bit_420_60f 396 o Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f 398 o Netflix_Dancers_4096x2160_60fps_10bit_420_60f 400 o Netflix_Narrator_4096x2160_60fps_10bit_420_60f 402 o Netflix_RitualDance_4096x2160_60fps_10bit_420_60f 404 o Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f 406 o Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f 408 o street_hdr_amazon_2160p 410 1920x1080, 4:2:0, 60 frames: 412 o aspen_1080p_60f 414 o crowd_run_1080p50_60f 416 o ducks_take_off_1080p50_60f 418 o guitar_hdr_amazon_1080p 420 o life_1080p30_60f 422 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 423 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 425 o Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f 427 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 429 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 431 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 433 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 435 o old_town_cross_1080p50_60f 437 o pan_hdr_amazon_1080p 439 o park_joy_1080p50_60f 441 o pedestrian_area_1080p25_60f 443 o rush_field_cuts_1080p_60f 445 o rush_hour_1080p25_60f 447 o seaplane_hdr_amazon_1080p 449 o station2_1080p25_60f 451 o touchdown_pass_1080p_60f 453 1280x720, 4:2:0, 120 frames: 455 o boat_hdr_amazon_720p 457 o dark720p_120f 459 o FourPeople_1280x720_60_120f 461 o gipsrestat720p_120f 463 o Johnny_1280x720_60_120f 465 o KristenAndSara_1280x720_60_120f 467 o Netflix_DinnerScene_1280x720_60fps_8bit_420_120f 469 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f 470 o Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f 472 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f 474 o Netflix_Tango_1280x720_60fps_8bit_420_120f 476 o rain_hdr_amazon_720p 478 o vidyo1_720p_60fps_120f 480 o vidyo3_720p_60fps_120f 482 o vidyo4_720p_60fps_120f 484 640x360, 4:2:0, 120 frames: 486 o blue_sky_360p_120f 488 o controlled_burn_640x360_120f 490 o desktop2360p_120f 492 o kirland360p_120f 494 o mmstationary360p_120f 496 o niklas360p_120f 498 o rain2_hdr_amazon_360p 500 o red_kayak_360p_120f 502 o riverbed_360p25_120f 504 o shields2_640x360_120f 506 o snow_mnt_640x360_120f 508 o speed_bag_640x360_120f 510 o stockholm_640x360_120f 512 o tacomanarrows360p_120f 514 o thaloundeskmtg360p_120f 516 o water_hdr_amazon_360p 517 426x240, 4:2:0, 120 frames: 519 o bqfree_240p_120f 521 o bqhighway_240p_120f 523 o bqzoom_240p_120f 525 o chairlift_240p_120f 527 o dirtbike_240p_120f 529 o mozzoom_240p_120f 531 1920x1080, 4:4:4 or 4:2:0, 60 frames: 533 o CSGO_60f.y4m 535 o DOTA2_60f_420.y4m 537 o MINECRAFT_60f_420.y4m 539 o STARCRAFT_60f_420.y4m 541 o EuroTruckSimulator2_60f.y4m 543 o Hearthstone_60f.y4m 545 o wikipedia_420.y4m 547 o pvq_slideshow.y4m 549 5.2.3. objective-2-fast 551 This test set is a strict subset of objective-2-slow. It is designed 552 for faster runtime. This test set requires compiling with high bit 553 depth support. 555 1920x1080, 4:2:0, 60 frames: 557 o aspen_1080p_60f 559 o ducks_take_off_1080p50_60f 561 o life_1080p30_60f 563 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 564 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 566 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 568 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 570 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 572 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 574 o rush_hour_1080p25_60f 576 o seaplane_hdr_amazon_1080p 578 o touchdown_pass_1080p_60f 580 1280x720, 4:2:0, 120 frames: 582 o boat_hdr_amazon_720p 584 o dark720p_120f 586 o gipsrestat720p_120f 588 o KristenAndSara_1280x720_60_120f 590 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f 592 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f 594 o vidyo1_720p_60fps_120f 596 o vidyo4_720p_60fps_120f 598 640x360, 4:2:0, 120 frames: 600 o blue_sky_360p_120f 602 o controlled_burn_640x360_120f 604 o kirland360p_120f 606 o niklas360p_120f 608 o rain2_hdr_amazon_360p 610 o red_kayak_360p_120f 611 o riverbed_360p25_120f 613 o shields2_640x360_120f 615 o speed_bag_640x360_120f 617 o thaloundeskmtg360p_120f 619 426x240, 4:2:0, 120 frames: 621 o bqfree_240p_120f 623 o bqzoom_240p_120f 625 o dirtbike_240p_120f 627 1290x1080, 4:2:0, 60 frames: 629 o DOTA2_60f_420.y4m 631 o MINECRAFT_60f_420.y4m 633 o STARCRAFT_60f_420.y4m 635 o wikipedia_420.y4m 637 5.2.4. objective-1.1 639 This test set is an old version of objective-2-slow. 641 4096x2160, 10bit, 4:2:0, 60 frames: 643 o Aerial (start frame 600) 645 o BarScene (start frame 120) 647 o Boat (start frame 0) 649 o BoxingPractice (start frame 0) 651 o Crosswalk (start frame 0) 653 o Dancers (start frame 120) 655 o FoodMarket 657 o Narrator 658 o PierSeaside 660 o RitualDance 662 o SquareAndTimelapse 664 o ToddlerFountain (start frame 120) 666 o TunnelFlag 668 o WindAndNature (start frame 120) 670 1920x1080, 8bit, 4:4:4, 60 frames: 672 o CSGO 674 o DOTA2 676 o EuroTruckSimulator2 678 o Hearthstone 680 o MINECRAFT 682 o STARCRAFT 684 o wikipedia 686 o pvq_slideshow 688 1920x1080, 8bit, 4:2:0, 60 frames: 690 o ducks_take_off 692 o life 694 o aspen 696 o crowd_run 698 o old_town_cross 700 o park_joy 702 o pedestrian_area 704 o rush_field_cuts 705 o rush_hour 707 o station2 709 o touchdown_pass 711 1280x720, 8bit, 4:2:0, 60 frames: 713 o Netflix_FoodMarket2 715 o Netflix_Tango 717 o DrivingPOV (start frame 120) 719 o DinnerScene (start frame 120) 721 o RollerCoaster (start frame 600) 723 o FourPeople 725 o Johnny 727 o KristenAndSara 729 o vidyo1 731 o vidyo3 733 o vidyo4 735 o dark720p 737 o gipsrecmotion720p 739 o gipsrestat720p 741 o controlled_burn 743 o stockholm 745 o speed_bag 747 o snow_mnt 749 o shields 751 640x360, 8bit, 4:2:0, 60 frames: 753 o red_kayak 755 o blue_sky 757 o riverbed 759 o thaloundeskmtgvga 761 o kirlandvga 763 o tacomanarrowsvga 765 o tacomascmvvga 767 o desktop2360p 769 o mmmovingvga 771 o mmstationaryvga 773 o niklasvga 775 5.2.5. objective-1-fast 777 This is an old version of objective-2-fast. 779 1920x1080, 8bit, 4:2:0, 60 frames: 781 o Aerial (start frame 600) 783 o Boat (start frame 0) 785 o Crosswalk (start frame 0) 787 o FoodMarket 789 o PierSeaside 791 o SquareAndTimelapse 793 o TunnelFlag 795 1920x1080, 8bit, 4:2:0, 60 frames: 797 o CSGO 799 o EuroTruckSimulator2 800 o MINECRAFT 802 o wikipedia 804 1920x1080, 8bit, 4:2:0, 60 frames: 806 o ducks_take_off 808 o aspen 810 o old_town_cross 812 o pedestrian_area 814 o rush_hour 816 o touchdown_pass 818 1280x720, 8bit, 4:2:0, 60 frames: 820 o Netflix_FoodMarket2 822 o DrivingPOV (start frame 120) 824 o RollerCoaster (start frame 600) 826 o Johnny 828 o vidyo1 830 o vidyo4 832 o gipsrecmotion720p 834 o speed_bag 836 o shields 838 640x360, 8bit, 4:2:0, 60 frames: 840 o red_kayak 842 o riverbed 844 o kirlandvga 846 o tacomascmvvga 847 o mmmovingvga 849 o niklasvga 851 5.3. Operating Points 853 Four operating modes are defined. High latency is intended for on 854 demand streaming, one-to-many live streaming, and stored video. Low 855 latency is intended for videoconferencing and remote access. Both of 856 these modes come in CQP and unconstrained variants. When testing 857 still image sets, such as subset1, high latency CQP mode should be 858 used. 860 5.3.1. Common settings 862 Encoders should be configured to their best settings when being 863 compared against each other: 865 o av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0 866 -threads=1 868 5.3.2. High Latency CQP 870 High Latency CQP is used for evaluating incremental changes to a 871 codec. This method is well suited to compare codecs with similar 872 coding tools. It allows codec features with intrinsic frame delay. 874 o daala: -v=x -b 2 876 o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 878 o av1: -end-usage=q -cq-level=x -auto-alt-ref=2 880 5.3.3. Low Latency CQP 882 Low Latency CQP is used for evaluating incremental changes to a 883 codec. This method is well suited to compare codecs with similar 884 coding tools. It requires the codec to be set for zero intrinsic 885 frame delay. 887 o daala: -v=x 889 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 891 5.3.4. Unconstrained High Latency 893 The encoder should be run at the best quality mode available, using 894 the mode that will provide the best quality per bitrate (VBR or 895 constant quality mode). Lookahead and/or two-pass are allowed, if 896 supported. One parameter is provided to adjust bitrate, but the 897 units are arbitrary. Example configurations follow: 899 o x264: -crf=x 901 o x265: -crf=x 903 o daala: -v=x -b 2 905 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 907 5.3.5. Unconstrained Low Latency 909 The encoder should be run at the best quality mode available, using 910 the mode that will provide the best quality per bitrate (VBR or 911 constant quality mode), but no frame delay, buffering, or lookahead 912 is allowed. One parameter is provided to adjust bitrate, but the 913 units are arbitrary. Example configurations follow: 915 o x264: -crf-x -tune zerolatency 917 o x265: -crf=x -tune zerolatency 919 o daala: -v=x 921 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 923 6. Automation 925 Frequent objective comparisons are extremely beneficial while 926 developing a new codec. Several tools exist in order to automate the 927 process of objective comparisons. The Compare-Codecs tool allows BD- 928 rate curves to be generated for a wide variety of codecs 929 [COMPARECODECS]. The Daala source repository contains a set of 930 scripts that can be used to automate the various metrics used. In 931 addition, these scripts can be run automatically utilizing 932 distributed computers for fast results, with rd_tool [RD_TOOL]. This 933 tool can be run via a web interface called AreWeCompressedYet [AWCY], 934 or locally. 936 Because of computational constraints, several levels of testing are 937 specified. 939 6.1. Regression tests 941 Regression tests run on a small number of short sequences - 942 regression-test-1. The regression tests should include a number of 943 various test conditions. The purpose of regression tests is to 944 ensure bug fixes (and similar patches) do not negatively affect the 945 performance. The anchor in regression tests is the previous revision 946 of the codec in source control. Regression tests are run on both 947 high and low latency CQP modes 949 6.2. Objective performance tests 951 Changes that are expected to affect the quality of encode or 952 bitstream should run an objective performance test. The performance 953 tests should be run on a wider number of sequences. The following 954 data should be reported: 956 o Identifying information for the encoder used, such as the git 957 commit hash. 959 o Command line options to the encoder, configure script, and 960 anything else necessary to replicate the experiment. 962 o The name of the test set run (objective-1-fast) 964 o For both high and low latency CQP modes, and for each objective 965 metric: 967 * The BD-Rate score, in percent, for each clip. 969 * The average of all BD-Rate scores, equally weighted, for each 970 resolution category in the test set. 972 * The average of all BD-Rate scores for all videos in all 973 categories. 975 Normally, the encoder should always be run at the slowest, highest 976 quality speed setting (cpu-used=0 in the case of AV1 and VP9). 977 However, in the case of computation time, both the reference and 978 changed encoder can be built with some options disabled. For AV1, - 979 disable-ext_partition and -disable-ext_partition_types can be passed 980 to the configure script to substantially speed up encoding, but the 981 usage of these options must be reported in the test results. 983 6.3. Periodic tests 985 Periodic tests are run on a wide range of bitrates in order to gauge 986 progress over time, as well as detect potential regressions missed by 987 other tests. 989 7. Informative References 991 [AWCY] Xiph.Org, "Are We Compressed Yet?", 2016, 992 . 994 [BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012, 995 . 998 [CIEDE2000] 999 Yang, Y., Ming, J., and N. Yu, "Color Image Quality 1000 Assessment Based on CIEDE2000", 2012, 1001 . 1003 [COMPARECODECS] 1004 Alvestrand, H., "Compare Codecs", 2015, 1005 . 1007 [DAALA-GIT] 1008 Xiph.Org, "Daala Git Repository", 2015, 1009 . 1011 [DERFVIDEO] 1012 Terriberry, T., "Xiph.org Video Test Media", n.d., 1013 . 1015 [I-D.ietf-netvc-requirements] 1016 Filippov, A., Norkin, A., and j. 1017 jose.roberto.alvarez@huawei.com, "