idnits 2.17.1 draft-ietf-netvc-testing-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 27, 2017) is 2587 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'DERFVIDEO' is defined on line 978, but no explicit reference was found in the text == Unused Reference: 'FASTSSIM' is defined on line 982, but no explicit reference was found in the text == Unused Reference: 'L1100' is defined on line 994, but no explicit reference was found in the text == Unused Reference: 'STEAM' is defined on line 1014, but no explicit reference was found in the text == Outdated reference: A later version (-10) exists of draft-ietf-netvc-requirements-02 Summary: 2 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Daede 3 Internet-Draft Mozilla 4 Intended status: Informational A. Norkin 5 Expires: September 28, 2017 Netflix 6 I. Brailovskiy 7 Amazon Lab126 8 March 27, 2017 10 Video Codec Testing and Quality Measurement 11 draft-ietf-netvc-testing-05 13 Abstract 15 This document describes guidelines and procedures for evaluating a 16 video codec. This covers subjective and objective tests, test 17 conditions, and materials used for the test. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on September 28, 2017. 36 Copyright Notice 38 Copyright (c) 2017 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3 55 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 56 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 3 57 2.3. Subjective viewing test . . . . . . . . . . . . . . . . . 4 58 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 4 59 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4 60 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5 61 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5 62 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5 63 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5 64 3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 5 65 3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 4. Comparing and Interpreting Results . . . . . . . . . . . . . 6 67 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 6 68 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 6 69 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 7 70 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 7 71 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 7 72 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8 73 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8 74 5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 8 75 5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12 76 5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14 77 5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17 78 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 18 79 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 18 80 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19 81 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19 82 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 19 83 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 19 84 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20 85 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 20 86 6.2. Objective performance tests . . . . . . . . . . . . . . . 20 87 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 21 88 7. Informative References . . . . . . . . . . . . . . . . . . . 21 89 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 91 1. Introduction 93 When developing a video codec, changes and additions to the codec 94 need to be decided based on their performance tradeoffs. In 95 addition, measurements are needed to determine when the codec has met 96 its performance goals. This document specifies how the tests are to 97 be carried about to ensure valid comparisons when evaluating changes 98 under consideration. Authors of features or changes should provide 99 the results of the appropriate test when proposing codec 100 modifications. 102 2. Subjective quality tests 104 Subjective testing is the preferable method of testing video codecs. 106 Subjective testing results take priority over objective testing 107 results, when available. Subjective testing is recommended 108 especially when taking advantage of psychovisual effects that may not 109 be well represented by objective metrics, or when different objective 110 metrics disagree. 112 Selection of a testing methodology depends on the feature being 113 tested and the resources available. Test methodologies are presented 114 in order of increasing accuracy and cost. 116 Testing relies on the resources of participants. For this reason, 117 even if the group agrees that a particular test is important, if no 118 one volunteers to do it, or if volunteers do not complete it in a 119 timely fashion, then that test should be discarded. This ensures 120 that only important tests be done in particular, the tests that are 121 important to participants. 123 2.1. Still Image Pair Comparison 125 A simple way to determine superiority of one compressed image is to 126 visually compare two compressed images, and have the viewer judge 127 which one has a higher quality. This is used for rapid comparisons 128 during development - the viewer may be a developer or user, for 129 example. Because testing is done on still images (keyframes), this 130 is only suitable for changes with similar or no effect on other 131 frames. For example, this test may be suitable for an intra de- 132 ringing filter, but not for a new inter prediction mode. For this 133 test, the two compressed images should have similar compressed file 134 sizes, with one image being no more than 5% larger than the other. 135 In addition, at least 5 different images should be compared. 137 2.2. Video Pair Comparison 139 Video comparisons are necessary when making changes with temporal 140 effects, such as changes to inter-frame prediction. Video pair 141 comparisons follow the same procedure as still images. 143 2.3. Subjective viewing test 145 A subjective viewing test is the preferred method of evaluating the 146 quality. The subjective test should be performed as either 147 consecutively showing the video sequences on one screen or on two 148 screens located side-by-side. The testing procedure should normally 149 follow rules described in [BT500] and be performed with non-expert 150 test subjects. The result of the test could be (depending on the 151 test procedure) mean opinion scores (MOS) or differential mean 152 opinion scores (DMOS). Normally, confidence intervals are also 153 calculated to judge whether the difference between two encodings is 154 statistically significant. In certain cases, a viewing test with 155 expert test subjects can be performed, for example if a test should 156 evaluate technologies with similar performance with respect to a 157 particular artifact (e.g. loop filters or motion prediction). 158 Depending on the setup of the test, the output could be a MOS, DMOS 159 or a percentage of experts, who preferred one or another technology. 161 3. Objective Metrics 163 Objective metrics are used in place of subjective metrics for easy 164 and repeatable experiments. Most objective metrics have been 165 designed to correlate with subjective scores. 167 The following descriptions give an overview of the operation of each 168 of the metrics. Because implementation details can sometimes vary, 169 the exact implementation is specified in C in the Daala tools 170 repository [DAALA-GIT]. Implementations of metrics must directly 171 support the input's resolution, bit depth, and sampling format. 173 Unless otherwise specified, all of the metrics described below only 174 apply to the luma plane, individually by frame. When applied to the 175 video, the scores of each frame are averaged to create the final 176 score. 178 Codecs must output the same resolution, bit depth, and sampling 179 format as the input. 181 3.1. Overall PSNR 183 PSNR is a traditional signal quality metric, measured in decibels. 184 It is directly drived from mean square error (MSE), or its square 185 root (RMSE). The formula used is: 187 20 * log10 ( MAX / RMSE ) 189 or, equivalently: 191 10 * log10 ( MAX^2 / MSE ) 193 where the error is computed over all the pixels in the video, which 194 is the method used in the dump_psnr.c reference implementation. 196 This metric may be applied to both the luma and chroma planes, with 197 all planes reported separately. 199 3.2. Frame-averaged PSNR 201 PSNR can also be calculated per-frame, and then the values averaged 202 together. This is reported in the same way as overall PSNR. 204 3.3. PSNR-HVS-M 206 The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the 207 image, weights the coefficients, and then calculates the PSNR of 208 those coefficients. Several different sets of weights have been 209 considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in 210 the Daala repository have been found to be the best match to real MOS 211 scores. 213 3.4. SSIM 215 SSIM (Structural Similarity Image Metric) is a still image quality 216 metric introduced in 2004 [SSIM]. It computes a score for each 217 individual pixel, using a window of neighboring pixels. These scores 218 can then be averaged to produce a global score for the entire image. 219 The original paper produces scores ranging between 0 and 1. 221 To linearize the metric for BD-Rate computation, the score is 222 converted into a nonlinear decibel scale: 224 -10 * log10 (1 - SSIM) 226 3.5. Multi-Scale SSIM 228 Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM]. 229 The metric score is converted to decibels in the same way as SSIM. 231 3.6. CIEDE2000 233 CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It 234 generates a single score taking into account all three chroma planes. 235 It does not take into consideration any structural similarity or 236 other psychovisual effects. 238 3.7. VMAF 240 Video Multi-method Assessment Fusion (VMAF) is a full-reference 241 perceptual video quality metric that aims to approximate human 242 perception of video quality [VMAF]. This metric is focused on 243 quality degradation due compression and rescaling. VMAF estimates 244 the perceived quality score by computing scores from multiple quality 245 assessment algorithms, and fusing them using a support vector machine 246 (SVM). Currently, three image fidelity metrics and one temporal 247 signal have been chosen as features to the SVM, namely Anti-noise SNR 248 (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity 249 (VIF), and the mean co-located pixel difference of a frame with 250 respect to the previous frame. 252 The quality score from VMAF is used directly to calculate BD-Rate, 253 without any conversions. 255 4. Comparing and Interpreting Results 257 4.1. Graphing 259 When displayed on a graph, bitrate is shown on the X axis, and the 260 quality metric is on the Y axis. For publication, the X axis should 261 be linear. The Y axis metric should be plotted in decibels. If the 262 quality metric does not natively report quality in decibels, it 263 should be converted as described in the previous section. 265 4.2. BD-Rate 267 The Bjontegaard rate difference, also known as BD-rate, allows the 268 measurement of the bitrate reduction offered by a codec or codec 269 feature, while maintaining the same quality as measured by objective 270 metrics. The rate change is computed as the average percent 271 difference in rate over a range of qualities. Metric score ranges 272 are not static - they are calculated either from a range of bitrates 273 of the reference codec, or from quantizers of a third, anchor codec. 274 Given a reference codec and test codec, BD-rate values are calculated 275 as follows: 277 o Rate/distortion points are calculated for the reference and test 278 codec. 280 * At least four points must be computed. These points should be 281 the same quantizers when comparing two versions of the same 282 codec. 284 * Additional points outside of the range should be discarded. 286 o The rates are converted into log-rates. 288 o A piecewise cubic hermite interpolating polynomial is fit to the 289 points for each codec to produce functions of log-rate in terms of 290 distortion. 292 o Metric score ranges are computed: 294 * If comparing two versions of the same codec, the overlap is the 295 intersection of the two curves, bound by the chosen quantizer 296 points. 298 * If comparing dissimilar codecs, a third anchor codec's metric 299 scores at fixed quantizers are used directly as the bounds. 301 o The log-rate is numerically integrated over the metric range for 302 each curve, using at least 1000 samples and trapezoidal 303 integration. 305 o The resulting integrated log-rates are converted back into linear 306 rate, and then the percent difference is calculated from the 307 reference to the test codec. 309 4.3. Ranges 311 For individual feature changes in libaom or libvpx, the overlap BD- 312 Rate method with quantizers 20, 32, 43, and 55 must be used. 314 For the final evaluation described in [I-D.ietf-netvc-requirements], 315 the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55. 317 5. Test Sequences 319 5.1. Sources 321 Lossless test clips are preferred for most tests, because the 322 structure of compression artifacts in already-compressed clips may 323 introduce extra noise in the test results. However, a large amount 324 of content on the internet needs to be recompressed at least once, so 325 some sources of this nature are useful. The encoder should run at 326 the same bit depth as the original source. In addition, metrics need 327 to support operation at high bit depth. If one or more codecs in a 328 comparison do not support high bit depth, sources need to be 329 converted once before entering the encoder. 331 5.2. Test Sets 333 Sources are divided into several categories to test different 334 scenarios the codec will be required to operate in. For easier 335 comparison, all videos in each set should have the same color 336 subsampling, same resolution, and same number of frames. In 337 addition, all test videos must be publicly available for testing use, 338 to allow for reproducibility of results. All current test sets are 339 available for download [TESTSEQUENCES]. 341 Test sequences should be downloaded in whole. They should not be 342 recreated from the original sources. 344 5.2.1. regression-1 346 This test set is used for basic regression testing. It contains a 347 very small number of clips. 349 o kirlandvga (640x360, 8bit, 4:2:0, 300 frames) 351 o FourPeople (1280x720, 8bit, 4:2:0, 60 frames) 353 o Narrarator (4096x2160, 10bit, 4:2:0, 15 frames) 355 o CSGO (1920x1080, 8bit, 4:4:4 60 frames) 357 5.2.2. objective-2-slow 359 This test set is a comprehensive test set, grouped by resolution. 360 These test clips were created from originals at [TESTSEQUENCES]. 361 They have been scaled and cropped to match the resolution of their 362 category. This test set requires compiling with high bit depth 363 support. 365 4096x2160, 4:2:0, 60 frames: 367 o Netflix_BarScene_4096x2160_60fps_10bit_420_60f 369 o Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f 371 o Netflix_Dancers_4096x2160_60fps_10bit_420_60f 373 o Netflix_Narrator_4096x2160_60fps_10bit_420_60f 375 o Netflix_RitualDance_4096x2160_60fps_10bit_420_60f 377 o Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f 378 o Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f 380 o street_hdr_amazon_2160p 382 1920x1080, 4:2:0, 60 frames: 384 o aspen_1080p_60f 386 o crowd_run_1080p50_60f 388 o ducks_take_off_1080p50_60f 390 o guitar_hdr_amazon_1080p 392 o life_1080p30_60f 394 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 396 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 398 o Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f 400 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 402 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 404 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 406 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 408 o old_town_cross_1080p50_60f 410 o pan_hdr_amazon_1080p 412 o park_joy_1080p50_60f 414 o pedestrian_area_1080p25_60f 416 o rush_field_cuts_1080p_60f 418 o rush_hour_1080p25_60f 420 o seaplane_hdr_amazon_1080p 422 o station2_1080p25_60f 424 o touchdown_pass_1080p_60f 425 1280x720, 4:2:0, 120 frames: 427 o boat_hdr_amazon_720p 429 o dark720p_120f 431 o FourPeople_1280x720_60_120f 433 o gipsrestat720p_120f 435 o Johnny_1280x720_60_120f 437 o KristenAndSara_1280x720_60_120f 439 o Netflix_DinnerScene_1280x720_60fps_8bit_420_120f 441 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f 443 o Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f 445 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f 447 o Netflix_Tango_1280x720_60fps_8bit_420_120f 449 o rain_hdr_amazon_720p 451 o vidyo1_720p_60fps_120f 453 o vidyo3_720p_60fps_120f 455 o vidyo4_720p_60fps_120f 457 640x360, 4:2:0, 120 frames: 459 o blue_sky_360p_120f 461 o controlled_burn_640x360_120f 463 o desktop2360p_120f 465 o kirland360p_120f 467 o mmstationary360p_120f 469 o niklas360p_120f 471 o rain2_hdr_amazon_360p 472 o red_kayak_360p_120f 474 o riverbed_360p25_120f 476 o shields2_640x360_120f 478 o snow_mnt_640x360_120f 480 o speed_bag_640x360_120f 482 o stockholm_640x360_120f 484 o tacomanarrows360p_120f 486 o thaloundeskmtg360p_120f 488 o water_hdr_amazon_360p 490 426x240, 4:2:0, 120 frames: 492 o bqfree_240p_120f 494 o bqhighway_240p_120f 496 o bqzoom_240p_120f 498 o chairlift_240p_120f 500 o dirtbike_240p_120f 502 o mozzoom_240p_120f 504 1920x1080, 4:4:4 or 4:2:0, 60 frames: 506 o CSGO_60f.y4m 508 o DOTA2_60f_420.y4m 510 o MINECRAFT_60f_420.y4m 512 o STARCRAFT_60f_420.y4m 514 o EuroTruckSimulator2_60f.y4m 516 o Hearthstone_60f.y4m 518 o wikipedia_420.y4m 519 o pvq_slideshow.y4m 521 5.2.3. objective-2-fast 523 This test set is a strict subset of objective-2-slow. It is designed 524 for faster runtime. This test set requires compiling with high bit 525 depth support. 527 1920x1080, 4:2:0, 60 frames: 529 o aspen_1080p_60f 531 o ducks_take_off_1080p50_60f 533 o life_1080p30_60f 535 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 537 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 539 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 541 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 543 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 545 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 547 o rush_hour_1080p25_60f 549 o seaplane_hdr_amazon_1080p 551 o touchdown_pass_1080p_60f 553 1280x720, 4:2:0, 120 frames: 555 o boat_hdr_amazon_720p 557 o dark720p_120f 559 o gipsrestat720p_120f 561 o KristenAndSara_1280x720_60_120f 563 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f 565 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f 566 o vidyo1_720p_60fps_120f 568 o vidyo4_720p_60fps_120f 570 640x360, 4:2:0, 120 frames: 572 o blue_sky_360p_120f 574 o controlled_burn_640x360_120f 576 o kirland360p_120f 578 o niklas360p_120f 580 o rain2_hdr_amazon_360p 582 o red_kayak_360p_120f 584 o riverbed_360p25_120f 586 o shields2_640x360_120f 588 o speed_bag_640x360_120f 590 o thaloundeskmtg360p_120f 592 426x240, 4:2:0, 120 frames: 594 o bqfree_240p_120f 596 o bqzoom_240p_120f 598 o dirtbike_240p_120f 600 1290x1080, 4:2:0, 60 frames: 602 o DOTA2_60f_420.y4m 604 o MINECRAFT_60f_420.y4m 606 o STARCRAFT_60f_420.y4m 608 o wikipedia_420.y4m 610 5.2.4. objective-1.1 612 This test set is an old version of objective-2-slow. 614 4096x2160, 10bit, 4:2:0, 60 frames: 616 o Aerial (start frame 600) 618 o BarScene (start frame 120) 620 o Boat (start frame 0) 622 o BoxingPractice (start frame 0) 624 o Crosswalk (start frame 0) 626 o Dancers (start frame 120) 628 o FoodMarket 630 o Narrator 632 o PierSeaside 634 o RitualDance 636 o SquareAndTimelapse 638 o ToddlerFountain (start frame 120) 640 o TunnelFlag 642 o WindAndNature (start frame 120) 644 1920x1080, 8bit, 4:4:4, 60 frames: 646 o CSGO 648 o DOTA2 650 o EuroTruckSimulator2 652 o Hearthstone 654 o MINECRAFT 656 o STARCRAFT 657 o wikipedia 659 o pvq_slideshow 661 1920x1080, 8bit, 4:2:0, 60 frames: 663 o ducks_take_off 665 o life 667 o aspen 669 o crowd_run 671 o old_town_cross 673 o park_joy 675 o pedestrian_area 677 o rush_field_cuts 679 o rush_hour 681 o station2 683 o touchdown_pass 685 1280x720, 8bit, 4:2:0, 60 frames: 687 o Netflix_FoodMarket2 689 o Netflix_Tango 691 o DrivingPOV (start frame 120) 693 o DinnerScene (start frame 120) 695 o RollerCoaster (start frame 600) 697 o FourPeople 699 o Johnny 701 o KristenAndSara 703 o vidyo1 704 o vidyo3 706 o vidyo4 708 o dark720p 710 o gipsrecmotion720p 712 o gipsrestat720p 714 o controlled_burn 716 o stockholm 718 o speed_bag 720 o snow_mnt 722 o shields 724 640x360, 8bit, 4:2:0, 60 frames: 726 o red_kayak 728 o blue_sky 730 o riverbed 732 o thaloundeskmtgvga 734 o kirlandvga 736 o tacomanarrowsvga 738 o tacomascmvvga 740 o desktop2360p 742 o mmmovingvga 744 o mmstationaryvga 746 o niklasvga 748 5.2.5. objective-1-fast 750 This is an old version of objective-2-fast. 752 1920x1080, 8bit, 4:2:0, 60 frames: 754 o Aerial (start frame 600) 756 o Boat (start frame 0) 758 o Crosswalk (start frame 0) 760 o FoodMarket 762 o PierSeaside 764 o SquareAndTimelapse 766 o TunnelFlag 768 1920x1080, 8bit, 4:2:0, 60 frames: 770 o CSGO 772 o EuroTruckSimulator2 774 o MINECRAFT 776 o wikipedia 778 1920x1080, 8bit, 4:2:0, 60 frames: 780 o ducks_take_off 782 o aspen 784 o old_town_cross 786 o pedestrian_area 788 o rush_hour 790 o touchdown_pass 792 1280x720, 8bit, 4:2:0, 60 frames: 794 o Netflix_FoodMarket2 795 o DrivingPOV (start frame 120) 797 o RollerCoaster (start frame 600) 799 o Johnny 801 o vidyo1 803 o vidyo4 805 o gipsrecmotion720p 807 o speed_bag 809 o shields 811 640x360, 8bit, 4:2:0, 60 frames: 813 o red_kayak 815 o riverbed 817 o kirlandvga 819 o tacomascmvvga 821 o mmmovingvga 823 o niklasvga 825 5.3. Operating Points 827 Four operating modes are defined. High latency is intended for on 828 demand streaming, one-to-many live streaming, and stored video. Low 829 latency is intended for videoconferencing and remote access. Both of 830 these modes come in CQP and unconstrained variants. When testing 831 still image sets, such as subset1, high latency CQP mode should be 832 used. 834 5.3.1. Common settings 836 Encoders should be configured to their best settings when being 837 compared against each other: 839 o av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0 840 -threads=1 842 5.3.2. High Latency CQP 844 High Latency CQP is used for evaluating incremental changes to a 845 codec. This method is well suited to compare codecs with similar 846 coding tools. It allows codec features with intrinsic frame delay. 848 o daala: -v=x -b 2 850 o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 852 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 854 5.3.3. Low Latency CQP 856 Low Latency CQP is used for evaluating incremental changes to a 857 codec. This method is well suited to compare codecs with similar 858 coding tools. It requires the codec to be set for zero intrinsic 859 frame delay. 861 o daala: -v=x 863 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 865 5.3.4. Unconstrained High Latency 867 The encoder should be run at the best quality mode available, using 868 the mode that will provide the best quality per bitrate (VBR or 869 constant quality mode). Lookahead and/or two-pass are allowed, if 870 supported. One parameter is provided to adjust bitrate, but the 871 units are arbitrary. Example configurations follow: 873 o x264: -crf=x 875 o x265: -crf=x 877 o daala: -v=x -b 2 879 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 881 5.3.5. Unconstrained Low Latency 883 The encoder should be run at the best quality mode available, using 884 the mode that will provide the best quality per bitrate (VBR or 885 constant quality mode), but no frame delay, buffering, or lookahead 886 is allowed. One parameter is provided to adjust bitrate, but the 887 units are arbitrary. Example configurations follow: 889 o x264: -crf-x -tune zerolatency 890 o x265: -crf=x -tune zerolatency 892 o daala: -v=x 894 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 896 6. Automation 898 Frequent objective comparisons are extremely beneficial while 899 developing a new codec. Several tools exist in order to automate the 900 process of objective comparisons. The Compare-Codecs tool allows BD- 901 rate curves to be generated for a wide variety of codecs 902 [COMPARECODECS]. The Daala source repository contains a set of 903 scripts that can be used to automate the various metrics used. In 904 addition, these scripts can be run automatically utilizing 905 distributed computers for fast results, with rd_tool [RD_TOOL]. This 906 tool can be run via a web interface called AreWeCompressedYet [AWCY], 907 or locally. 909 Because of computational constraints, several levels of testing are 910 specified. 912 6.1. Regression tests 914 Regression tests run on a small number of short sequences - 915 regression-test-1. The regression tests should include a number of 916 various test conditions. The purpose of regression tests is to 917 ensure bug fixes (and similar patches) do not negatively affect the 918 performance. The anchor in regression tests is the previous revision 919 of the codec in source control. Regression tests are run on both 920 high and low latency CQP modes 922 6.2. Objective performance tests 924 Changes that are expected to affect the quality of encode or 925 bitstream should run an objective performance test. The performance 926 tests should be run on a wider number of sequences. The following 927 data should be reported: 929 o Identifying information for the encoder used, such as the git 930 commit hash. 932 o Command line options to the encoder, configure script, and 933 anything else necessary to replicate the experiment. 935 o The name of the test set run (objective-1) 936 o For both high and low latency CQP modes, and for each objective 937 metric: 939 * The BD-Rate score, in percent, for each clip. 941 * The average of all BD-Rate scores, equally weighted, for each 942 resolution category in the test set. 944 * The average of all BD-Rate scores for all videos in all 945 categories. 947 For non-tool contributions, the test set objective-1-fast can be 948 substituted. 950 6.3. Periodic tests 952 Periodic tests are run on a wide range of bitrates in order to gauge 953 progress over time, as well as detect potential regressions missed by 954 other tests. 956 7. Informative References 958 [AWCY] Xiph.Org, "Are We Compressed Yet?", 2016, 959 . 961 [BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012, 962 . 965 [CIEDE2000] 966 Yang, Y., Ming, J., and N. Yu, "Color Image Quality 967 Assessment Based on CIEDE2000", 2012, 968 . 970 [COMPARECODECS] 971 Alvestrand, H., "Compare Codecs", 2015, 972 . 974 [DAALA-GIT] 975 Xiph.Org, "Daala Git Repository", 2015, 976 . 978 [DERFVIDEO] 979 Terriberry, T., "Xiph.org Video Test Media", n.d., 980 . 982 [FASTSSIM] 983 Chen, M. and A. Bovik, "Fast structural similarity index 984 algorithm", 2010, 985 . 988 [I-D.ietf-netvc-requirements] 989 Filippov, A., Norkin, A., and j. 990 jose.roberto.alvarez@huawei.com, "