idnits 2.17.1 draft-ietf-netvc-testing-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 31, 2020) is 1537 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Daede 3 Internet-Draft Mozilla 4 Intended status: Informational A. Norkin 5 Expires: August 3, 2020 Netflix 6 I. Brailovskiy 7 Amazon Lab126 8 January 31, 2020 10 Video Codec Testing and Quality Measurement 11 draft-ietf-netvc-testing-09 13 Abstract 15 This document describes guidelines and procedures for evaluating a 16 video codec. This covers subjective and objective tests, test 17 conditions, and materials used for the test. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on August 3, 2020. 36 Copyright Notice 38 Copyright (c) 2020 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3 55 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 56 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 4 57 2.3. Mean Opinion Score . . . . . . . . . . . . . . . . . . . 4 58 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 5 59 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 5 60 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5 61 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 6 62 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 6 64 3.6. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 6 65 3.7. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 4. Comparing and Interpreting Results . . . . . . . . . . . . . 7 67 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 7 68 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 8 70 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 8 71 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8 73 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 9 74 5.2.2. objective-2-slow . . . . . . . . . . . . . . . . . . 9 75 5.2.3. objective-2-fast . . . . . . . . . . . . . . . . . . 12 76 5.2.4. objective-1.1 . . . . . . . . . . . . . . . . . . . . 14 77 5.2.5. objective-1-fast . . . . . . . . . . . . . . . . . . 17 78 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 19 79 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 19 80 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 19 81 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 19 82 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 20 83 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 20 84 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 20 85 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 21 86 6.2. Objective performance tests . . . . . . . . . . . . . . . 21 87 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 22 88 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 89 8. Security Considerations . . . . . . . . . . . . . . . . . . . 22 90 9. Informative References . . . . . . . . . . . . . . . . . . . 22 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 93 1. Introduction 95 When developing a video codec, changes and additions to the codec 96 need to be decided based on their performance tradeoffs. In 97 addition, measurements are needed to determine when the codec has met 98 its performance goals. This document specifies how the tests are to 99 be carried about to ensure valid comparisons when evaluating changes 100 under consideration. Authors of features or changes should provide 101 the results of the appropriate test when proposing codec 102 modifications. 104 2. Subjective quality tests 106 Subjective testing uses human viewers to rate and compare the quality 107 of videos. It is the preferable method of testing video codecs. 109 Subjective testing results take priority over objective testing 110 results, when available. Subjective testing is recommended 111 especially when taking advantage of psychovisual effects that may not 112 be well represented by objective metrics, or when different objective 113 metrics disagree. 115 Selection of a testing methodology depends on the feature being 116 tested and the resources available. Test methodologies are presented 117 in order of increasing accuracy and cost. 119 Testing relies on the resources of participants. If a participant 120 requires a subjective test for a particular feature or improvement, 121 they are responsible for ensuring that resources are available. This 122 ensures that only important tests be done; in particular, the tests 123 that are important to participants. 125 Subjective tests should use the same operating points as the 126 objective tests. 128 2.1. Still Image Pair Comparison 130 A simple way to determine superiority of one compressed image is to 131 visually compare two compressed images, and have the viewer judge 132 which one has a higher quality. For example, this test may be 133 suitable for an intra de-ringing filter, but not for a new inter 134 prediction mode. For this test, the two compressed images should 135 have similar compressed file sizes, with one image being no more than 136 5% larger than the other. In addition, at least 5 different images 137 should be compared. 139 Once testing is complete, a p-value can be computed using the 140 binomial test. A significant result should have a resulting p-value 141 less than or equal to 0.5. For example: 143 p_value = binom_test(a,a+b) 145 where a is the number of votes for one video, b is the number of 146 votes for the second video, and binom_test(x,y) returns the binomial 147 PMF (probability mass function) with x observed tests, y total tests, 148 and expected probability 0.5. 150 If ties are allowed to be reported, then the equation is modified: 152 p_value = binom_test(a+floor(t/2),a+b+t) 154 where t is the number of tie votes. 156 Still image pair comparison is used for rapid comparisons during 157 development - the viewer may be either a developer or user, for 158 example. As the results are only relative, it is effective even with 159 an inconsistent viewing environment. Because this test only uses 160 still images (keyframes), this is only suitable for changes with 161 similar or no effect on inter frames. 163 2.2. Video Pair Comparison 165 The still image pair comparison method can be modified to also 166 compare vidoes. This is necessary when making changes with temporal 167 effects, such as changes to inter-frame prediction. Video pair 168 comparisons follow the same procedure as still images. Videos used 169 for testing should be limited to 10 seconds in length, and can be 170 rewatched an unlimited number of times. 172 2.3. Mean Opinion Score 174 A Mean Opinion Score (MOS) viewing test is the preferred method of 175 evaluating the quality. The subjective test should be performed as 176 either consecutively showing the video sequences on one screen or on 177 two screens located side-by-side. The testing procedure should 178 normally follow rules described in [BT500] and be performed with non- 179 expert test subjects. The result of the test will be (depending on 180 the test procedure) mean opinion scores (MOS) or differential mean 181 opinion scores (DMOS). Confidence intervals are also calculated to 182 judge whether the difference between two encodings is statistically 183 significant. In certain cases, a viewing test with expert test 184 subjects can be performed, for example if a test should evaluate 185 technologies with similar performance with respect to a particular 186 artifact (e.g. loop filters or motion prediction). Unlike pair 187 comparisions, a MOS test requires a consistent testing environment. 188 This means that for large scale or distributed tests, pair 189 comparisons are preferred. 191 3. Objective Metrics 193 Objective metrics are used in place of subjective metrics for easy 194 and repeatable experiments. Most objective metrics have been 195 designed to correlate with subjective scores. 197 The following descriptions give an overview of the operation of each 198 of the metrics. Because implementation details can sometimes vary, 199 the exact implementation is specified in C in the Daala tools 200 repository [DAALA-GIT]. Implementations of metrics must directly 201 support the input's resolution, bit depth, and sampling format. 203 Unless otherwise specified, all of the metrics described below only 204 apply to the luma plane, individually by frame. When applied to the 205 video, the scores of each frame are averaged to create the final 206 score. 208 Codecs must output the same resolution, bit depth, and sampling 209 format as the input. 211 3.1. Overall PSNR 213 PSNR is a traditional signal quality metric, measured in decibels. 214 It is directly drived from mean square error (MSE), or its square 215 root (RMSE). The formula used is: 217 20 * log10 ( MAX / RMSE ) 219 or, equivalently: 221 10 * log10 ( MAX^2 / MSE ) 223 where the error is computed over all the pixels in the video, which 224 is the method used in the dump_psnr.c reference implementation. 226 This metric may be applied to both the luma and chroma planes, with 227 all planes reported separately. 229 3.2. Frame-averaged PSNR 231 PSNR can also be calculated per-frame, and then the values averaged 232 together. This is reported in the same way as overall PSNR. 234 3.3. PSNR-HVS-M 236 The PSNR-HVS [PSNRHVS] metric performs a DCT transform of 8x8 blocks 237 of the image, weights the coefficients, and then calculates the PSNR 238 of those coefficients. Several different sets of weights have been 239 considered. The weights used by the dump_pnsrhvs.c tool in the Daala 240 repository have been found to be the best match to real MOS scores. 242 3.4. SSIM 244 SSIM (Structural Similarity Image Metric) is a still image quality 245 metric introduced in 2004 [SSIM]. It computes a score for each 246 individual pixel, using a window of neighboring pixels. These scores 247 can then be averaged to produce a global score for the entire image. 248 The original paper produces scores ranging between 0 and 1. 250 To linearize the metric for BD-Rate computation, the score is 251 converted into a nonlinear decibel scale: 253 -10 * log10 (1 - SSIM) 255 3.5. Multi-Scale SSIM 257 Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM]. 258 The metric score is converted to decibels in the same way as SSIM. 260 3.6. CIEDE2000 262 CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It 263 generates a single score taking into account all three chroma planes. 264 It does not take into consideration any structural similarity or 265 other psychovisual effects. 267 3.7. VMAF 269 Video Multi-method Assessment Fusion (VMAF) is a full-reference 270 perceptual video quality metric that aims to approximate human 271 perception of video quality [VMAF]. This metric is focused on 272 quality degradation due to compression and rescaling. VMAF estimates 273 the perceived quality score by computing scores from multiple quality 274 assessment algorithms, and fusing them using a support vector machine 275 (SVM). Currently, three image fidelity metrics and one temporal 276 signal have been chosen as features to the SVM, namely Anti-noise SNR 277 (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity 278 (VIF), and the mean co-located pixel difference of a frame with 279 respect to the previous frame. 281 The quality score from VMAF is used directly to calculate BD-Rate, 282 without any conversions. 284 4. Comparing and Interpreting Results 286 4.1. Graphing 288 When displayed on a graph, bitrate is shown on the X axis, and the 289 quality metric is on the Y axis. For publication, the X axis should 290 be linear. The Y axis metric should be plotted in decibels. If the 291 quality metric does not natively report quality in decibels, it 292 should be converted as described in the previous section. 294 4.2. BD-Rate 296 The Bjontegaard rate difference, also known as BD-rate, allows the 297 measurement of the bitrate reduction offered by a codec or codec 298 feature, while maintaining the same quality as measured by objective 299 metrics. The rate change is computed as the average percent 300 difference in rate over a range of qualities. Metric score ranges 301 are not static - they are calculated either from a range of bitrates 302 of the reference codec, or from quantizers of a third, anchor codec. 303 Given a reference codec and test codec, BD-rate values are calculated 304 as follows: 306 o Rate/distortion points are calculated for the reference and test 307 codec. 309 * At least four points must be computed. These points should be 310 the same quantizers when comparing two versions of the same 311 codec. 313 * Additional points outside of the range should be discarded. 315 o The rates are converted into log-rates. 317 o A piecewise cubic hermite interpolating polynomial is fit to the 318 points for each codec to produce functions of log-rate in terms of 319 distortion. 321 o Metric score ranges are computed: 323 * If comparing two versions of the same codec, the overlap is the 324 intersection of the two curves, bound by the chosen quantizer 325 points. 327 * If comparing dissimilar codecs, a third anchor codec's metric 328 scores at fixed quantizers are used directly as the bounds. 330 o The log-rate is numerically integrated over the metric range for 331 each curve, using at least 1000 samples and trapezoidal 332 integration. 334 o The resulting integrated log-rates are converted back into linear 335 rate, and then the percent difference is calculated from the 336 reference to the test codec. 338 4.3. Ranges 340 For individual feature changes in libaom or libvpx, the overlap BD- 341 Rate method with quantizers 20, 32, 43, and 55 must be used. 343 For the final evaluation described in [I-D.ietf-netvc-requirements], 344 the quantizers used are 20, 24, 28, 32, 36, 39, 43, 47, 51, and 55. 346 5. Test Sequences 348 5.1. Sources 350 Lossless test clips are preferred for most tests, because the 351 structure of compression artifacts in already-compressed clips may 352 introduce extra noise in the test results. However, a large amount 353 of content on the internet needs to be recompressed at least once, so 354 some sources of this nature are useful. The encoder should run at 355 the same bit depth as the original source. In addition, metrics need 356 to support operation at high bit depth. If one or more codecs in a 357 comparison do not support high bit depth, sources need to be 358 converted once before entering the encoder. 360 5.2. Test Sets 362 Sources are divided into several categories to test different 363 scenarios the codec will be required to operate in. For easier 364 comparison, all videos in each set should have the same color 365 subsampling, same resolution, and same number of frames. In 366 addition, all test videos must be publicly available for testing use, 367 to allow for reproducibility of results. All current test sets are 368 available for download [TESTSEQUENCES]. 370 Test sequences should be downloaded in whole. They should not be 371 recreated from the original sources. 373 Each clip is labeled with its resolution, bit depth, color 374 subsampling, and length. 376 5.2.1. regression-1 378 This test set is used for basic regression testing. It contains a 379 very small number of clips. 381 o kirlandvga (640x360, 8bit, 4:2:0, 300 frames) 383 o FourPeople (1280x720, 8bit, 4:2:0, 60 frames) 385 o Narrarator (4096x2160, 10bit, 4:2:0, 15 frames) 387 o CSGO (1920x1080, 8bit, 4:4:4 60 frames) 389 5.2.2. objective-2-slow 391 This test set is a comprehensive test set, grouped by resolution. 392 These test clips were created from originals at [TESTSEQUENCES]. 393 They have been scaled and cropped to match the resolution of their 394 category. This test set requires a codec that supports both 8 and 10 395 bit video. 397 4096x2160, 4:2:0, 60 frames: 399 o Netflix_BarScene_4096x2160_60fps_10bit_420_60f 401 o Netflix_BoxingPractice_4096x2160_60fps_10bit_420_60f 403 o Netflix_Dancers_4096x2160_60fps_10bit_420_60f 405 o Netflix_Narrator_4096x2160_60fps_10bit_420_60f 407 o Netflix_RitualDance_4096x2160_60fps_10bit_420_60f 409 o Netflix_ToddlerFountain_4096x2160_60fps_10bit_420_60f 411 o Netflix_WindAndNature_4096x2160_60fps_10bit_420_60f 413 o street_hdr_amazon_2160p 415 1920x1080, 4:2:0, 60 frames: 417 o aspen_1080p_60f 419 o crowd_run_1080p50_60f 421 o ducks_take_off_1080p50_60f 423 o guitar_hdr_amazon_1080p 424 o life_1080p30_60f 426 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 428 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 430 o Netflix_Crosswalk_1920x1080_60fps_8bit_420_60f 432 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 434 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 436 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 438 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 440 o old_town_cross_1080p50_60f 442 o pan_hdr_amazon_1080p 444 o park_joy_1080p50_60f 446 o pedestrian_area_1080p25_60f 448 o rush_field_cuts_1080p_60f 450 o rush_hour_1080p25_60f 452 o seaplane_hdr_amazon_1080p 454 o station2_1080p25_60f 456 o touchdown_pass_1080p_60f 458 1280x720, 4:2:0, 120 frames: 460 o boat_hdr_amazon_720p 462 o dark720p_120f 464 o FourPeople_1280x720_60_120f 466 o gipsrestat720p_120f 468 o Johnny_1280x720_60_120f 470 o KristenAndSara_1280x720_60_120f 471 o Netflix_DinnerScene_1280x720_60fps_8bit_420_120f 473 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_120f 475 o Netflix_FoodMarket2_1280x720_60fps_8bit_420_120f 477 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_120f 479 o Netflix_Tango_1280x720_60fps_8bit_420_120f 481 o rain_hdr_amazon_720p 483 o vidyo1_720p_60fps_120f 485 o vidyo3_720p_60fps_120f 487 o vidyo4_720p_60fps_120f 489 640x360, 4:2:0, 120 frames: 491 o blue_sky_360p_120f 493 o controlled_burn_640x360_120f 495 o desktop2360p_120f 497 o kirland360p_120f 499 o mmstationary360p_120f 501 o niklas360p_120f 503 o rain2_hdr_amazon_360p 505 o red_kayak_360p_120f 507 o riverbed_360p25_120f 509 o shields2_640x360_120f 511 o snow_mnt_640x360_120f 513 o speed_bag_640x360_120f 515 o stockholm_640x360_120f 517 o tacomanarrows360p_120f 518 o thaloundeskmtg360p_120f 520 o water_hdr_amazon_360p 522 426x240, 4:2:0, 120 frames: 524 o bqfree_240p_120f 526 o bqhighway_240p_120f 528 o bqzoom_240p_120f 530 o chairlift_240p_120f 532 o dirtbike_240p_120f 534 o mozzoom_240p_120f 536 1920x1080, 4:4:4 or 4:2:0, 60 frames: 538 o CSGO_60f.y4m 540 o DOTA2_60f_420.y4m 542 o MINECRAFT_60f_420.y4m 544 o STARCRAFT_60f_420.y4m 546 o EuroTruckSimulator2_60f.y4m 548 o Hearthstone_60f.y4m 550 o wikipedia_420.y4m 552 o pvq_slideshow.y4m 554 5.2.3. objective-2-fast 556 This test set is a strict subset of objective-2-slow. It is designed 557 for faster runtime. This test set requires compiling with high bit 558 depth support. 560 1920x1080, 4:2:0, 60 frames: 562 o aspen_1080p_60f 564 o ducks_take_off_1080p50_60f 565 o life_1080p30_60f 567 o Netflix_Aerial_1920x1080_60fps_8bit_420_60f 569 o Netflix_Boat_1920x1080_60fps_8bit_420_60f 571 o Netflix_FoodMarket_1920x1080_60fps_8bit_420_60f 573 o Netflix_PierSeaside_1920x1080_60fps_8bit_420_60f 575 o Netflix_SquareAndTimelapse_1920x1080_60fps_8bit_420_60f 577 o Netflix_TunnelFlag_1920x1080_60fps_8bit_420_60f 579 o rush_hour_1080p25_60f 581 o seaplane_hdr_amazon_1080p 583 o touchdown_pass_1080p_60f 585 1280x720, 4:2:0, 120 frames: 587 o boat_hdr_amazon_720p 589 o dark720p_120f 591 o gipsrestat720p_120f 593 o KristenAndSara_1280x720_60_120f 595 o Netflix_DrivingPOV_1280x720_60fps_8bit_420_60f 597 o Netflix_RollerCoaster_1280x720_60fps_8bit_420_60f 599 o vidyo1_720p_60fps_120f 601 o vidyo4_720p_60fps_120f 603 640x360, 4:2:0, 120 frames: 605 o blue_sky_360p_120f 607 o controlled_burn_640x360_120f 609 o kirland360p_120f 611 o niklas360p_120f 612 o rain2_hdr_amazon_360p 614 o red_kayak_360p_120f 616 o riverbed_360p25_120f 618 o shields2_640x360_120f 620 o speed_bag_640x360_120f 622 o thaloundeskmtg360p_120f 624 426x240, 4:2:0, 120 frames: 626 o bqfree_240p_120f 628 o bqzoom_240p_120f 630 o dirtbike_240p_120f 632 1290x1080, 4:2:0, 60 frames: 634 o DOTA2_60f_420.y4m 636 o MINECRAFT_60f_420.y4m 638 o STARCRAFT_60f_420.y4m 640 o wikipedia_420.y4m 642 5.2.4. objective-1.1 644 This test set is an old version of objective-2-slow. 646 4096x2160, 10bit, 4:2:0, 60 frames: 648 o Aerial (start frame 600) 650 o BarScene (start frame 120) 652 o Boat (start frame 0) 654 o BoxingPractice (start frame 0) 656 o Crosswalk (start frame 0) 658 o Dancers (start frame 120) 659 o FoodMarket 661 o Narrator 663 o PierSeaside 665 o RitualDance 667 o SquareAndTimelapse 669 o ToddlerFountain (start frame 120) 671 o TunnelFlag 673 o WindAndNature (start frame 120) 675 1920x1080, 8bit, 4:4:4, 60 frames: 677 o CSGO 679 o DOTA2 681 o EuroTruckSimulator2 683 o Hearthstone 685 o MINECRAFT 687 o STARCRAFT 689 o wikipedia 691 o pvq_slideshow 693 1920x1080, 8bit, 4:2:0, 60 frames: 695 o ducks_take_off 697 o life 699 o aspen 701 o crowd_run 703 o old_town_cross 705 o park_joy 706 o pedestrian_area 708 o rush_field_cuts 710 o rush_hour 712 o station2 714 o touchdown_pass 716 1280x720, 8bit, 4:2:0, 60 frames: 718 o Netflix_FoodMarket2 720 o Netflix_Tango 722 o DrivingPOV (start frame 120) 724 o DinnerScene (start frame 120) 726 o RollerCoaster (start frame 600) 728 o FourPeople 730 o Johnny 732 o KristenAndSara 734 o vidyo1 736 o vidyo3 738 o vidyo4 740 o dark720p 742 o gipsrecmotion720p 744 o gipsrestat720p 746 o controlled_burn 748 o stockholm 750 o speed_bag 752 o snow_mnt 753 o shields 755 640x360, 8bit, 4:2:0, 60 frames: 757 o red_kayak 759 o blue_sky 761 o riverbed 763 o thaloundeskmtgvga 765 o kirlandvga 767 o tacomanarrowsvga 769 o tacomascmvvga 771 o desktop2360p 773 o mmmovingvga 775 o mmstationaryvga 777 o niklasvga 779 5.2.5. objective-1-fast 781 This is an old version of objective-2-fast. 783 1920x1080, 8bit, 4:2:0, 60 frames: 785 o Aerial (start frame 600) 787 o Boat (start frame 0) 789 o Crosswalk (start frame 0) 791 o FoodMarket 793 o PierSeaside 795 o SquareAndTimelapse 797 o TunnelFlag 799 1920x1080, 8bit, 4:2:0, 60 frames: 801 o CSGO 803 o EuroTruckSimulator2 805 o MINECRAFT 807 o wikipedia 809 1920x1080, 8bit, 4:2:0, 60 frames: 811 o ducks_take_off 813 o aspen 815 o old_town_cross 817 o pedestrian_area 819 o rush_hour 821 o touchdown_pass 823 1280x720, 8bit, 4:2:0, 60 frames: 825 o Netflix_FoodMarket2 827 o DrivingPOV (start frame 120) 829 o RollerCoaster (start frame 600) 831 o Johnny 833 o vidyo1 835 o vidyo4 837 o gipsrecmotion720p 839 o speed_bag 841 o shields 843 640x360, 8bit, 4:2:0, 60 frames: 845 o red_kayak 847 o riverbed 848 o kirlandvga 850 o tacomascmvvga 852 o mmmovingvga 854 o niklasvga 856 5.3. Operating Points 858 Four operating modes are defined. High latency is intended for on 859 demand streaming, one-to-many live streaming, and stored video. Low 860 latency is intended for videoconferencing and remote access. Both of 861 these modes come in CQP (constant quantizer parameter) and 862 unconstrained variants. When testing still image sets, such as 863 subset1, high latency CQP mode should be used. 865 5.3.1. Common settings 867 Encoders should be configured to their best settings when being 868 compared against each other: 870 o av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0 871 -threads=1 873 5.3.2. High Latency CQP 875 High Latency CQP is used for evaluating incremental changes to a 876 codec. This method is well suited to compare codecs with similar 877 coding tools. It allows codec features with intrinsic frame delay. 879 o daala: -v=x -b 2 881 o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 883 o av1: -end-usage=q -cq-level=x -auto-alt-ref=2 885 5.3.3. Low Latency CQP 887 Low Latency CQP is used for evaluating incremental changes to a 888 codec. This method is well suited to compare codecs with similar 889 coding tools. It requires the codec to be set for zero intrinsic 890 frame delay. 892 o daala: -v=x 894 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 896 5.3.4. Unconstrained High Latency 898 The encoder should be run at the best quality mode available, using 899 the mode that will provide the best quality per bitrate (VBR or 900 constant quality mode). Lookahead and/or two-pass are allowed, if 901 supported. One parameter is provided to adjust bitrate, but the 902 units are arbitrary. Example configurations follow: 904 o x264: -crf=x 906 o x265: -crf=x 908 o daala: -v=x -b 2 910 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 912 5.3.5. Unconstrained Low Latency 914 The encoder should be run at the best quality mode available, using 915 the mode that will provide the best quality per bitrate (VBR or 916 constant quality mode), but no frame delay, buffering, or lookahead 917 is allowed. One parameter is provided to adjust bitrate, but the 918 units are arbitrary. Example configurations follow: 920 o x264: -crf-x -tune zerolatency 922 o x265: -crf=x -tune zerolatency 924 o daala: -v=x 926 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 928 6. Automation 930 Frequent objective comparisons are extremely beneficial while 931 developing a new codec. Several tools exist in order to automate the 932 process of objective comparisons. The Compare-Codecs tool allows BD- 933 rate curves to be generated for a wide variety of codecs 934 [COMPARECODECS]. The Daala source repository contains a set of 935 scripts that can be used to automate the various metrics used. In 936 addition, these scripts can be run automatically utilizing 937 distributed computers for fast results, with rd_tool [RD_TOOL]. This 938 tool can be run via a web interface called AreWeCompressedYet [AWCY], 939 or locally. 941 Because of computational constraints, several levels of testing are 942 specified. 944 6.1. Regression tests 946 Regression tests run on a small number of short sequences - 947 regression-test-1. The regression tests should include a number of 948 various test conditions. The purpose of regression tests is to 949 ensure bug fixes (and similar patches) do not negatively affect the 950 performance. The anchor in regression tests is the previous revision 951 of the codec in source control. Regression tests are run on both 952 high and low latency CQP modes 954 6.2. Objective performance tests 956 Changes that are expected to affect the quality of encode or 957 bitstream should run an objective performance test. The performance 958 tests should be run on a wider number of sequences. The following 959 data should be reported: 961 o Identifying information for the encoder used, such as the git 962 commit hash. 964 o Command line options to the encoder, configure script, and 965 anything else necessary to replicate the experiment. 967 o The name of the test set run (objective-1-fast) 969 o For both high and low latency CQP modes, and for each objective 970 metric: 972 * The BD-Rate score, in percent, for each clip. 974 * The average of all BD-Rate scores, equally weighted, for each 975 resolution category in the test set. 977 * The average of all BD-Rate scores for all videos in all 978 categories. 980 Normally, the encoder should always be run at the slowest, highest 981 quality speed setting (cpu-used=0 in the case of AV1 and VP9). 982 However, in the case of computation time, both the reference and 983 changed encoder can be built with some options disabled. For AV1, - 984 disable-ext_partition and -disable-ext_partition_types can be passed 985 to the configure script to substantially speed up encoding, but the 986 usage of these options must be reported in the test results. 988 6.3. Periodic tests 990 Periodic tests are run on a wide range of bitrates in order to gauge 991 progress over time, as well as detect potential regressions missed by 992 other tests. 994 7. IANA Considerations 996 This document does not require any IANA actions. 998 8. Security Considerations 1000 This document describes the methodologies an procedures for 1001 qualitative testing, therefore does not iteself have implications for 1002 network of decoder security. 1004 9. Informative References 1006 [AWCY] Xiph.Org, "Are We Compressed Yet?", 2016, 1007 . 1009 [BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012, 1010 . 1013 [CIEDE2000] 1014 Yang, Y., Ming, J., and N. Yu, "Color Image Quality 1015 Assessment Based on CIEDE2000", 2012, 1016 . 1018 [COMPARECODECS] 1019 Alvestrand, H., "Compare Codecs", 2015, 1020 . 1022 [DAALA-GIT] 1023 Xiph.Org, "Daala Git Repository", 2015, 1024 . 1026 [I-D.ietf-netvc-requirements] 1027 Filippov, A., Norkin, A., and j. 1028 jose.roberto.alvarez@huawei.com, "Video Codec Requirements 1029 and Evaluation Methodology", draft-ietf-netvc- 1030 requirements-10 (work in progress), November 2019. 1032 [MSSSIM] Wang, Z., Simoncelli, E., and A. Bovik, "Multi-Scale 1033 Structural Similarity for Image Quality Assessment", n.d., 1034 . 1036 [PSNRHVS] Egiazarian, K., Astola, J., Ponomarenko, N., Lukin, V., 1037 Battisti, F., and M. Carli, "A New Full-Reference Quality 1038 Metrics Based on HVS", 2002. 1040 [RD_TOOL] Xiph.Org, "rd_tool", 2016, 1041 . 1043 [SSIM] Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image 1044 Quality Assessment: From Error Visibility to Structural 1045 Similarity", 2004, 1046 . 1048 [TESTSEQUENCES] 1049 Daede, T., "Test Sets", n.d., 1050 . 1052 [VMAF] Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C. 1053 Kuo, "VMAF - Video Multi-Method Assessment Fusion", 2015, 1054 . 1056 Authors' Addresses 1058 Thomas Daede 1059 Mozilla 1061 Email: tdaede@mozilla.com 1063 Andrey Norkin 1064 Netflix 1066 Email: anorkin@netflix.com 1068 Ilya Brailovskiy 1069 Amazon Lab126 1071 Email: brailovs@lab126.com