idnits 2.17.1 draft-ietf-netvc-testing-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 08, 2016) is 2847 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'DERFVIDEO' is defined on line 730, but no explicit reference was found in the text == Unused Reference: 'L1100' is defined on line 739, but no explicit reference was found in the text == Unused Reference: 'STEAM' is defined on line 759, but no explicit reference was found in the text Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group T. Daede 3 Internet-Draft Mozilla 4 Intended status: Informational A. Norkin 5 Expires: January 09, 2017 Netflix 6 I. Brailovskiy 7 Amazon Lab126 8 July 08, 2016 10 Video Codec Testing and Quality Measurement 11 draft-ietf-netvc-testing-03 13 Abstract 15 This document describes guidelines and procedures for evaluating a 16 video codec. This covers subjective and objective tests, test 17 conditions, and materials used for the test. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on January 09, 2017. 36 Copyright Notice 38 Copyright (c) 2016 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Subjective quality tests . . . . . . . . . . . . . . . . . . 3 55 2.1. Still Image Pair Comparison . . . . . . . . . . . . . . . 3 56 2.2. Video Pair Comparison . . . . . . . . . . . . . . . . . . 3 57 2.3. Subjective viewing test . . . . . . . . . . . . . . . . . 4 58 3. Objective Metrics . . . . . . . . . . . . . . . . . . . . . . 4 59 3.1. Overall PSNR . . . . . . . . . . . . . . . . . . . . . . 4 60 3.2. Frame-averaged PSNR . . . . . . . . . . . . . . . . . . . 5 61 3.3. PSNR-HVS-M . . . . . . . . . . . . . . . . . . . . . . . 5 62 3.4. SSIM . . . . . . . . . . . . . . . . . . . . . . . . . . 5 63 3.5. Multi-Scale SSIM . . . . . . . . . . . . . . . . . . . . 5 64 3.6. Fast Multi-Scale SSIM . . . . . . . . . . . . . . . . . . 6 65 3.7. CIEDE2000 . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3.8. VMAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6 67 4. Comparing and Interpreting Results . . . . . . . . . . . . . 6 68 4.1. Graphing . . . . . . . . . . . . . . . . . . . . . . . . 6 69 4.2. BD-Rate . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 4.3. Ranges . . . . . . . . . . . . . . . . . . . . . . . . . 7 71 5. Test Sequences . . . . . . . . . . . . . . . . . . . . . . . 7 72 5.1. Sources . . . . . . . . . . . . . . . . . . . . . . . . . 7 73 5.2. Test Sets . . . . . . . . . . . . . . . . . . . . . . . . 8 74 5.2.1. regression-1 . . . . . . . . . . . . . . . . . . . . 8 75 5.2.2. objective-1 . . . . . . . . . . . . . . . . . . . . . 8 76 5.2.3. objective-1-fast . . . . . . . . . . . . . . . . . . 11 77 5.3. Operating Points . . . . . . . . . . . . . . . . . . . . 13 78 5.3.1. Common settings . . . . . . . . . . . . . . . . . . . 13 79 5.3.2. High Latency CQP . . . . . . . . . . . . . . . . . . 13 80 5.3.3. Low Latency CQP . . . . . . . . . . . . . . . . . . . 14 81 5.3.4. Unconstrained High Latency . . . . . . . . . . . . . 14 82 5.3.5. Unconstrained Low Latency . . . . . . . . . . . . . . 14 83 6. Automation . . . . . . . . . . . . . . . . . . . . . . . . . 15 84 6.1. Regression tests . . . . . . . . . . . . . . . . . . . . 15 85 6.2. Objective performance tests . . . . . . . . . . . . . . . 15 86 6.3. Periodic tests . . . . . . . . . . . . . . . . . . . . . 16 87 7. Informative References . . . . . . . . . . . . . . . . . . . 16 88 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 90 1. Introduction 92 When developing a video codec, changes and additions to the codec 93 need to be decided based on their performance tradeoffs. In 94 addition, measurements are needed to determine when the codec has met 95 its performance goals. This document specifies how the tests are to 96 be carried about to ensure valid comparisons when evaluating changes 97 under consideration. Authors of features or changes should provide 98 the results of the appropriate test when proposing codec 99 modifications. 101 2. Subjective quality tests 103 Subjective testing is the preferable method of testing video codecs. 105 Subjective testing results take priority over objective testing 106 results, when available. Subjective testing is recommended 107 especially when taking advantage of psychovisual effects that may not 108 be well represented by objective metrics, or when different objective 109 metrics disagree. 111 Selection of a testing methodology depends on the feature being 112 tested and the resources available. Test methodologies are presented 113 in order of increasing accuracy and cost. 115 Testing relies on the resources of participants. For this reason, 116 even if the group agrees that a particular test is important, if no 117 one volunteers to do it, or if volunteers do not complete it in a 118 timely fashion, then that test should be discarded. This ensures 119 that only important tests be done in particular, the tests that are 120 important to participants. 122 2.1. Still Image Pair Comparison 124 A simple way to determine superiority of one compressed image is to 125 visually compare two compressed images, and have the viewer judge 126 which one has a higher quality. This is used for rapid comparisons 127 during development - the viewer may be a developer or user, for 128 example. Because testing is done on still images (keyframes), this 129 is only suitable for changes with similar or no effect on other 130 frames. For example, this test may be suitable for an intra de- 131 ringing filter, but not for a new inter prediction mode. For this 132 test, the two compressed images should have similar compressed file 133 sizes, with one image being no more than 5% larger than the other. 134 In addition, at least 5 different images should be compared. 136 2.2. Video Pair Comparison 137 Video comparisons are necessary when making changes with temporal 138 effects, such as changes to inter-frame prediction. Video pair 139 comparisons follow the same procedure as still images. 141 2.3. Subjective viewing test 143 A subjective viewing test is the preferred method of evaluating the 144 quality. The subjective test should be performed as either 145 consecutively showing the video sequences on one screen or on two 146 screens located side-by-side. The testing procedure should normally 147 follow rules described in [BT500] and be performed with non-expert 148 test subjects. The result of the test could be (depending on the 149 test procedure) mean opinion scores (MOS) or differential mean 150 opinion scores (DMOS). Normally, confidence intervals are also 151 calculated to judge whether the difference between two encodings is 152 statistically significant. In certain cases, a viewing test with 153 expert test subjects can be performed, for example if a test should 154 evaluate technologies with similar performance with respect to a 155 particular artifact (e.g. loop filters or motion prediction). 156 Depending on the setup of the test, the output could be a MOS, DMOS 157 or a percentage of experts, who preferred one or another technology. 159 3. Objective Metrics 161 Objective metrics are used in place of subjective metrics for easy 162 and repeatable experiments. Most objective metrics have been 163 designed to correlate with subjective scores. 165 The following descriptions give an overview of the operation of each 166 of the metrics. Because implementation details can sometimes vary, 167 the exact implementation is specified in C in the Daala tools 168 repository [DAALA-GIT]. Implementations of metrics must directly 169 support the input's resolution, bit depth, and sampling format. 171 Unless otherwise specified, all of the metrics described below only 172 apply to the luma plane, individually by frame. When applied to the 173 video, the scores of each frame are averaged to create the final 174 score. 176 Codecs must output the same resolution, bit depth, and sampling 177 format as the input. 179 3.1. Overall PSNR 181 PSNR is a traditional signal quality metric, measured in decibels. 182 It is directly drived from mean square error (MSE), or its square 183 root (RMSE). The formula used is: 185 20 * log10 ( MAX / RMSE ) 187 or, equivalently: 189 10 * log10 ( MAX^2 / MSE ) 191 where the error is computed over all the pixels in the video, which 192 is the method used in the dump_psnr.c reference implementation. 194 This metric may be applied to both the luma and chroma planes, with 195 all planes reported separately. 197 3.2. Frame-averaged PSNR 199 PSNR can also be calculated per-frame, and then the values averaged 200 together. This is reported in the same way as overall PSNR. 202 3.3. PSNR-HVS-M 204 The PSNR-HVS metric performs a DCT transform of 8x8 blocks of the 205 image, weights the coefficients, and then calculates the PSNR of 206 those coefficients. Several different sets of weights have been 207 considered. [PSNRHVS] The weights used by the dump_pnsrhvs.c tool in 208 the Daala repository have been found to be the best match to real MOS 209 scores. 211 3.4. SSIM 213 SSIM (Structural Similarity Image Metric) is a still image quality 214 metric introduced in 2004 [SSIM]. It computes a score for each 215 individual pixel, using a window of neighboring pixels. These scores 216 can then be averaged to produce a global score for the entire image. 217 The original paper produces scores ranging between 0 and 1. 219 For the metric to appear more linear on BD-rate curves, the score is 220 converted into a nonlinear decibel scale: 222 -10 * log10 (1 - SSIM) 224 3.5. Multi-Scale SSIM 226 Multi-Scale SSIM is SSIM extended to multiple window sizes [MSSSIM]. 228 3.6. Fast Multi-Scale SSIM 230 Fast MS-SSIM is a modified implementation of MS-SSIM which operates 231 on a limited number of scales and with modified weights [FASTSSIM]. 232 The final score is converted to decibels in the same manner as SSIM. 234 3.7. CIEDE2000 236 CIEDE2000 is a metric based on CIEDE color distances [CIEDE2000]. It 237 generates a single score taking into account all three chroma planes. 238 It does not take into consideration any structural similarity or 239 other psychovisual effects. 241 3.8. VMAF 243 Video Multi-method Assessment Fusion (VMAF) is a full-reference 244 perceptual video quality metric that aims to approximate human 245 perception of video quality [VMAF]. This metric is focused on 246 quality degradation due compression and rescaling. VMAF estimates 247 the perceived quality score by computing scores from multiple quality 248 assessment algorithms, and fusing them using a support vector machine 249 (SVM). Currently, three image fidelity metrics and one temporal 250 signal have been chosen as features to the SVM, namely Anti-noise SNR 251 (ANSNR), Detail Loss Measure (DLM), Visual Information Fidelity 252 (VIF), and the mean co-located pixel difference of a frame with 253 respect to the previous frame. 255 4. Comparing and Interpreting Results 257 4.1. Graphing 259 When displayed on a graph, bitrate is shown on the X axis, and the 260 quality metric is on the Y axis. For publication, the X axis should 261 be linear. The Y axis metric should be plotted in decibels. If the 262 quality metric does not natively report quality in decibels, it 263 should be converted as described in the previous section. 265 4.2. BD-Rate 267 The Bjontegaard rate difference, also known as BD-rate, allows the 268 measurement of the bitrate reduction offered by a codec or codec 269 feature, while maintaining the same quality as measured by objective 270 metrics. The rate change is computed as the average percent 271 difference in rate over a range of qualities. Metric score ranges 272 are not static - they are calculated either from a range of bitrates 273 of the reference codec, or from quantizers of a third, anchor codec. 274 Given a reference codec and test codec, BD-rate values are calculated 275 as follows: 277 o Rate/distortion points are calculated for the reference and test 278 codec. 280 * At least four points must be computed. These points should be 281 the same quantizers when comparing two versions of the same 282 codec. 284 * Additional points outside of the range should be discarded. 286 o The rates are converted into log-rates. 288 o A piecewise cubic hermite interpolating polynomial is fit to the 289 points for each codec to produce functions of log-rate in terms of 290 distortion. 292 o Metric score ranges are computed: 294 * If comparing two versions of the same codec, the overlap is the 295 intersection of the two curves, bound by the chosen quantizer 296 points. 298 * If comparing dissimilar codecs, a third anchor codec's metric 299 scores at fixed quantizers are used directly as the bounds. 301 o The log-rate is numerically integrated over the metric range for 302 each curve, using at least 1000 samples and trapezoidal 303 integration. 305 o The resulting integrated log-rates are converted back into linear 306 rate, and then the percent difference is calculated from the 307 reference to the test codec. 309 4.3. Ranges 311 For all tests described in this document, the anchor codec used for 312 ranges is libvpx 1.5.0 run with VP9 and High Latency CQP settings. 313 The quality range used is that achieved between cq-level 20 and 55. 314 For testing changes to libvpx or libaom, the anchor does not need to 315 be used. 317 5. Test Sequences 319 5.1. Sources 321 Lossless test clips are preferred for most tests, because the 322 structure of compression artifacts in already-compressed clips may 323 introduce extra noise in the test results. However, a large amount 324 of content on the internet needs to be recompressed at least once, so 325 some sources of this nature are useful. The encoder should run at 326 the same bit depth as the original source. In addition, metrics need 327 to support operation at high bit depth. If one or more codecs in a 328 comparison do not support high bit depth, sources need to be 329 converted once before entering the encoder. 331 5.2. Test Sets 333 Sources are divided into several categories to test different 334 scenarios the codec will be required to operate in. For easier 335 comparison, all videos in each set should have the same color 336 subsampling, same resolution, and same number of frames. In 337 addition, all test videos must be publicly available for testing use, 338 to allow for reproducibility of results. All current test sets are 339 available for download [TESTSEQUENCES]. 341 Test sequences should be downloaded in whole. They should not be 342 recreated from the original sources. 344 5.2.1. regression-1 346 This test set is used for basic regression testing. It contains a 347 very small number of clips. 349 o kirlandvga (640x360, 8bit, 4:2:0, 300 frames) 351 o FourPeople (1280x720, 8bit, 4:2:0, 60 frames) 353 o Narrarator (4096x2160, 10bit, 4:2:0, 15 frames) 355 o CSGO (1920x1080, 8bit, 4:4:4 60 frames) 357 5.2.2. objective-1 359 This test set is a comprehensive test set, grouped by resolution. 360 These test clips were created from originals at [TESTSEQUENCES]. 361 They have been scaled and cropped to match the resolution of their 362 category. Other deviations are noted in parenthesis. 364 4096x2160, 10bit, 4:2:0, 60 frames: 366 o Aerial (start frame 600) 368 o BarScene (start frame 120) 370 o Boat (start frame 0) 372 o BoxingPractice (start frame 0) 373 o Crosswalk (start frame 0) 375 o Dancers (start frame 120) 377 o FoodMarket 379 o Narrator 381 o PierSeaside 383 o RitualDance 385 o SquareAndTimelapse 387 o ToddlerFountain (start frame 120) 389 o TunnelFlag 391 o WindAndNature (start frame 120) 393 1920x1080, 8bit, 4:4:4, 60 frames: 395 o CSGO 397 o DOTA2 399 o EuroTruckSimulator2 401 o Hearthstone 403 o MINECRAFT 405 o STARCRAFT 407 o wikipedia 409 o pvq_slideshow 411 1920x1080, 8bit, 4:2:0, 60 frames: 413 o ducks_take_off 415 o life 417 o aspen 419 o crowd_run 420 o old_town_cross 422 o park_joy 424 o pedestrian_area 426 o rush_field_cuts 428 o rush_hour 430 o station2 432 o touchdown_pass 434 1280x720, 8bit, 4:2:0, 60 frames: 436 o Netflix_FoodMarket2 438 o Netflix_Tango 440 o DrivingPOV (start frame 120) 442 o DinnerScene (start frame 120) 444 o RollerCoaster (start frame 600) 446 o FourPeople 448 o Johnny 450 o KristenAndSara 452 o vidyo1 454 o vidyo3 456 o vidyo4 458 o dark720p 460 o gipsrecmotion720p 462 o gipsrestat720p 464 o controlled_burn 466 o stockholm 467 o speed_bag 469 o snow_mnt 471 o shields 473 640x360, 8bit, 4:2:0, 60 frames: 475 o red_kayak 477 o blue_sky 479 o riverbed 481 o thaloundeskmtgvga 483 o kirlandvga 485 o tacomanarrowsvga 487 o tacomascmvvga 489 o desktop2360p 491 o mmmovingvga 493 o mmstationaryvga 495 o niklasvga 497 5.2.3. objective-1-fast 499 This test set is based on objective-1, but requires much less 500 computation. It is intended to be a predictor for the results from 501 objective-1. 503 2048x1080, 8bit, 4:2:0, 60 frames: 505 o Aerial (start frame 600) 507 o Boat (start frame 0) 509 o Crosswalk (start frame 0) 511 o FoodMarket 513 o PierSeaside 514 o SquareAndTimelapse 516 o TunnelFlag 518 1920x1080, 8bit, 4:2:0, 60 frames: 520 o CSGO 522 o EuroTruckSimulator2 524 o MINECRAFT 526 o wikipedia 528 1920x1080, 8bit, 4:2:0, 60 frames: 530 o ducks_take_off 532 o aspen 534 o old_town_cross 536 o pedestrian_area 538 o rush_hour 540 o touchdown_pass 542 1280x720, 8bit, 4:2:0, 60 frames: 544 o Netflix_FoodMarket2 546 o DrivingPOV (start frame 120) 548 o RollerCoaster (start frame 600) 550 o Johnny 552 o vidyo1 554 o vidyo4 556 o gipsrecmotion720p 558 o speed_bag 560 o shields 561 640x360, 8bit, 4:2:0, 60 frames: 563 o red_kayak 565 o riverbed 567 o kirlandvga 569 o tacomascmvvga 571 o mmmovingvga 573 o niklasvga 575 5.3. Operating Points 577 Four operating modes are defined. High latency is intended for on 578 demand streaming, one-to-many live streaming, and stored video. Low 579 latency is intended for videoconferencing and remote access. Both of 580 these modes come in CQP and unconstrained variants. When testing 581 still image sets, such as subset1, high latency CQP mode should be 582 used. 584 5.3.1. Common settings 586 Encoders should be configured to their best settings when being 587 compared against each other: 589 o av1: -codec=av1 -ivf -frame-parallel=0 -tile-columns=0 -cpu-used=0 590 -threads=1 592 5.3.2. High Latency CQP 594 High Latency CQP is used for evaluating incremental changes to a 595 codec. This method is well suited to compare codecs with similar 596 coding tools. It allows codec features with intrinsic frame delay. 598 o daala: -v=x -b 2 600 o vp9: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 602 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 604 5.3.3. Low Latency CQP 606 Low Latency CQP is used for evaluating incremental changes to a 607 codec. This method is well suited to compare codecs with similar 608 coding tools. It requires the codec to be set for zero intrinsic 609 frame delay. 611 o daala: -v=x 613 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 615 5.3.4. Unconstrained High Latency 617 The encoder should be run at the best quality mode available, using 618 the mode that will provide the best quality per bitrate (VBR or 619 constant quality mode). Lookahead and/or two-pass are allowed, if 620 supported. One parameter is provided to adjust bitrate, but the 621 units are arbitrary. Example configurations follow: 623 o x264: -crf=x 625 o x265: -crf=x 627 o daala: -v=x -b 2 629 o av1: -end-usage=q -cq-level=x -lag-in-frames=25 -auto-alt-ref=2 631 5.3.5. Unconstrained Low Latency 633 The encoder should be run at the best quality mode available, using 634 the mode that will provide the best quality per bitrate (VBR or 635 constant quality mode), but no frame delay, buffering, or lookahead 636 is allowed. One parameter is provided to adjust bitrate, but the 637 units are arbitrary. Example configurations follow: 639 o x264: -crf-x -tune zerolatency 641 o x265: -crf=x -tune zerolatency 643 o daala: -v=x 645 o av1: -end-usage=q -cq-level=x -lag-in-frames=0 647 6. Automation 649 Frequent objective comparisons are extremely beneficial while 650 developing a new codec. Several tools exist in order to automate the 651 process of objective comparisons. The Compare-Codecs tool allows BD- 652 rate curves to be generated for a wide variety of codecs 653 [COMPARECODECS]. The Daala source repository contains a set of 654 scripts that can be used to automate the various metrics used. In 655 addition, these scripts can be run automatically utilizing 656 distributed computers for fast results, with rd_tool [RD_TOOL]. This 657 tool can be run via a web interface called AreWeCompressedYet [AWCY], 658 or locally. 660 Because of computational constraints, several levels of testing are 661 specified. 663 6.1. Regression tests 665 Regression tests run on a small number of short sequences - 666 regression-test-1. The regression tests should include a number of 667 various test conditions. The purpose of regression tests is to 668 ensure bug fixes (and similar patches) do not negatively affect the 669 performance. The anchor in regression tests is the previous revision 670 of the codec in source control. Regression tests are run on both 671 high and low latency CQP modes 673 6.2. Objective performance tests 675 Changes that are expected to affect the quality of encode or 676 bitstream should run an objective performance test. The performance 677 tests should be run on a wider number of sequences. The following 678 data should be reported: 680 o Identifying information for the encoder used, such as the git 681 commit hash. 683 o Command line options to the encoder, configure script, and 684 anything else necessary to replicate the experiment. 686 o The name of the test set run (objective-1) 688 o For both high and low latency CQP modes, and for each objective 689 metric: 691 * The BD-Rate score, in percent, for each clip. 693 * The average of all BD-Rate scores, equally weighted, for each 694 resolution category in the test set. 696 * The average of all BD-Rate scores for all videos in all 697 categories. 699 For non-tool contributions, the test set objective-1-fast can be 700 substituted. 702 6.3. Periodic tests 704 Periodic tests are run on a wide range of bitrates in order to gauge 705 progress over time, as well as detect potential regressions missed by 706 other tests. 708 7. Informative References 710 [AWCY] Xiph.Org, "Are We Compressed Yet?", 2016, . 713 [BT500] ITU-R, "Recommendation ITU-R BT.500-13", 2012, . 717 [CIEDE2000] 718 Yang, Y., Ming, J., and N. Yu, "Color Image Quality 719 Assessment Based on CIEDE2000", 2012, 720 . 722 [COMPARECODECS] 723 Alvestrand, H., "Compare Codecs", 2015, 724 . 726 [DAALA-GIT] 727 Xiph.Org, "Daala Git Repository", 2015, 728 . 730 [DERFVIDEO] 731 Terriberry, T., "Xiph.org Video Test Media", n.d., . 734 [FASTSSIM] 735 Chen, M. and A. Bovik, "Fast structural similarity index 736 algorithm", 2010, . 739 [L1100] Bossen, F., "Common test conditions and software reference 740 configurations", JCTVC L1100, 2013, 741 . 743 [MSSSIM] Wang, Z., Simoncelli, E., and A. Bovik, "Multi-Scale 744 Structural Similarity for Image Quality Assessment", n.d., 745 . 747 [PSNRHVS] Egiazarian, K., Astola, J., Ponomarenko, N., Lukin, V., 748 Battisti, F., and M. Carli, "A New Full-Reference Quality 749 Metrics Based on HVS", 2002. 751 [RD_TOOL] Xiph.Org, "rd_tool", 2016, . 754 [SSIM] Wang, Z., Bovik, A., Sheikh, H., and E. Simoncelli, "Image 755 Quality Assessment: From Error Visibility to Structural 756 Similarity", 2004, 757 . 759 [STEAM] Valve Corporation, "Steam Hardware & Software Survey: June 760 2015", June 2015, 761 . 763 [TESTSEQUENCES] 764 Daede, T., "Test Sets", n.d., . 767 [VMAF] Aaron, A., Li, Z., Manohara, M., Lin, J., Wu, E., and C. 768 Kuo, "VMAF - Video Multi-Method Assessment Fusion", 2015, 769 . 771 Authors' Addresses 773 Thomas Daede 774 Mozilla 776 Email: tdaede@mozilla.com 778 Andrey Norkin 779 Netflix 781 Email: anorkin@netflix.com 783 Ilya Brailovskiy 784 Amazon Lab126 786 Email: brailovs@lab126.com