idnits 2.17.1 draft-valin-netvc-pvq-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(ii) Publication Limitation clause. If this document is intended for submission to the IESG for publication, this constitutes an error. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 9, 2015) is 3243 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group JM. Valin 3 Internet-Draft Mozilla 4 Intended status: Standards Track June 9, 2015 5 Expires: December 11, 2015 7 Pyramid Vector Quantization for Video Coding 8 draft-valin-netvc-pvq-00 10 Abstract 12 This proposes applying pyramid vector quantization (PVQ) to video 13 coding. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at http://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on December 11, 2015. 32 Copyright Notice 34 Copyright (c) 2015 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 This document may not be modified, and derivative works of it may not 48 be created, and it may not be published except as an Internet-Draft. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. Gain-Shape Coding and Activity Masking . . . . . . . . . . . 2 54 3. Householder Reflection . . . . . . . . . . . . . . . . . . . 3 55 4. Angle-Based Encoding . . . . . . . . . . . . . . . . . . . . 4 56 5. Bi-prediction . . . . . . . . . . . . . . . . . . . . . . . . 5 57 6. Coefficient coding . . . . . . . . . . . . . . . . . . . . . 6 58 7. Development Repository . . . . . . . . . . . . . . . . . . . 6 59 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 60 9. Security Considerations . . . . . . . . . . . . . . . . . . . 6 61 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 62 11. Informative References . . . . . . . . . . . . . . . . . . . 7 63 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 7 65 1. Introduction 67 This draft describes a proposal for adapting the Opus RFC 6716 68 [RFC6716] energy conservation principle to video coding based on a 69 pyramid vector quantizer (PVQ) [Pyramid-VQ]. One potential advantage 70 of conserving energy of the AC coefficients in video coding is 71 preserving textures rather than low-passing them. Also, by 72 introducing a fixed-resolution PVQ-type quantizer, we automatically 73 gain a simple activity masking model. 75 The main challenge of adapting this scheme to video is that we have a 76 good prediction (the reference frame), so we are essentially starting 77 from a point that is already on the PVQ hyper-sphere, rather than at 78 the origin like in CELT. Other challenges are the introduction of a 79 quantization matrix and the fact that we want the reference (motion 80 predicted) data to perfectly correspond to one of the entries in our 81 codebook. This proposal is described in greater details in 82 [Perceptual-VQ], as well as in demo [PVQ-demo]. 84 2. Gain-Shape Coding and Activity Masking 86 The main idea behind the proposed video coding scheme is to code 87 groups of DCT coefficient as a scalar gain and a unit-norm "shape" 88 vector. A block's AC coefficients may all be part of the same group, 89 or may be divided by frequency (e.g. by octave) and/or by 90 directionality (horizontal vs vertical). 92 It is desirable for a single quality parameter to control the 93 resolution of both the gain and the shape. Ideally, that quality 94 parameter should also take into account activity masking, that is, 95 the fact that the eye is less sensitive to regions of an image that 96 have more details. According to Jason Garrett-Glaser, the perceptual 97 analysis in the x264 encoder uses a resolution proportional to the 98 variance of the AC coefficients raised to the power a, with a=0.173. 99 For gain-shape quantization, this is equivalent to using a resolution 100 of g^(2a), where g is the gain. We can derive a scalar quantizer 101 that follows this resolution: 103 b 104 g=Q_g gamma , 106 where gamma is the gain quantization index, b=1/(1-2*a) and Q_g is 107 the gain resolution and main quality parameter. 109 An important aspect of the current proposal is the use of prediction. 110 In the case of the gain, there is usually a significant correlation 111 with the gain of neighboring blocks. One way to predict the gain of 112 a block is to compute the gain of the coefficients obtained through 113 intra or inter prediction. Another way is to use the encoded gain of 114 the neighboring blocks to explicitly predict the gain of the current 115 block. 117 3. Householder Reflection 119 Let vector x_d denote the (pre-normalization) DCT band to be coded in 120 the current block and vector r_d denote the corresponding reference 121 (based on intra prediction or motion compensation), the encoder 122 computes and encodes the "band gain" g = sqrt(x_d^T x_d). The 123 normalized band is computed as 125 x_d 126 x = --------- , 127 || x_d || 129 with the normalized reference vector r similarly computed based on 130 r_d. The encoder then finds the position and sign of the largest 131 component in vector r: 133 m = argmax_i | r_i | 134 s = sign(r_m) 136 and computes the Householder reflection that reflects r to -s e_m, 137 where e_m is a unit vector that points in the direction of dimension 138 m. The reflection vector is given by 140 v = r + s e_m . 142 The encoder reflects the normalized band to find the unit-norm vector 143 v^T x 144 z = x - 2 ----- v . 145 v^T v 147 The closer the current band is from the reference band, the closer z 148 is from -s e_m. This can be represented either as an angle, or as a 149 coordinate on a projected pyramid. 151 4. Angle-Based Encoding 153 Assuming no quantization, the similarity can be represented by the 154 angle 156 theta = arccos(-s z_m) . 158 If theta is quantized and transmitted to the decoder, then z can be 159 reconstructed as 161 z = -s cos(theta) e_m + sin(theta) z_r , 163 where z_r is a unit vector based on z that excludes dimension m. 165 The vector z_r can be quantized using PVQ. Let y be a vector of 166 integers that satisfies 168 sum_i(|y[i]|) = K , 170 with K determined in advance, then the PVQ search finds the vector y 171 that maximizes y^T z_r / (y^T y) . The quantized version of z_r is 173 y 174 z_rq = ------- . 175 || y || 177 If we assume that MSE is a good criterion for optimizing the 178 resolution, then the angle quantization resolution should be 179 (roughly) 181 dg 1 b 182 Q_theta = ---------*----- = ------ . 183 d(gamma) g gamma 185 To derive the optimal K we need to consider the normalized distortion 186 for a Laplace-distributed variable found experimentally to be 187 approximately 188 (N-1)^2 + C*(N-1) 189 D_p = ----------------- , 190 24*K^2 192 with C ~= 4.2. The distortion due to the gain is 194 b^2*Q_g^2*gamma^(2*b-2) 195 D_g = ----------------------- . 196 12 198 Since PVQ codes N-2 degrees of freedom, its distortion should also be 199 (N-2) times the gain distortion, which eventually leads us to the 200 optimal number of pulses 202 gamma*sin(theta) / N + C - 2 \ 203 K = ---------------- sqrt | --------- | . 204 b \ 2 / 206 The value of K does not need to be coded because all the variables it 207 depends on are known to the decoder. However, because Q_theta 208 depends on the gain, this can lead to unacceptable loss propagation 209 behavior in the case where inter prediction is used for the gain. 210 This problem can be worked around by making the approximation 211 sin(theta)~=theta. With this approximation, then K depends only on 212 the theta quantization index, with no dependency on the gain. 213 Alternatively, instead of quantizing theta, we can quantize 214 sin(theta) which also removes the dependency on the gain. In the 215 general case, we quantize f(theta) and then assume that 216 sin(theta)~=f(theta). A possible choice of f(theta) is a quadratic 217 function of the form: 219 2 220 f(theta) = a1 theta - a2 theta. 222 where a1 and a2 are two constants satisfying the constraint that 223 f(pi/2)=pi/2. The value of f(theta) can also be predicted, but in 224 case where we care about error propagation, it should only be 225 predicted from information coded in the current frame. 227 5. Bi-prediction 229 We can use this scheme for bi-prediction by introducing a second 230 theta parameter. For the case of two (normalized) reference frames 231 r1 and r2, we introduce s1=(r1+r2)/2 and s2=(r1-r2)/2. We start by 232 using s1 as a reference, apply the Householder reflection to both x 233 and s2, and evaluate theta1. From there, we derive a second 234 Householder reflection from the reflected version of s2 and apply it 235 to z. The result is that the theta2 parameter controls how the 236 current image compares to the two reference images. It should even 237 be possible to use this in the case of fades, using two references 238 that are before the frame being encoded. 240 6. Coefficient coding 242 Encoding coefficients quantized with PVQ differs from encoding 243 scalar-quantized coefficients from the fact that the sum of the 244 coefficients magnitude is known (equal to K). It is possible to take 245 advantage of the known K value either through modeling the 246 distribution of coefficient magnitude or by modeling the zero runs. 247 In the case of magnitude modeling, the expectation of the magnitude 248 of coefficient n is modeled as 250 K_n 251 E(|y_n|) = alpha * ----- , 252 N - n 254 where K_n is the number of pulses left after encoding coeffients from 255 0 to n-1 and alpha depends on the distribution of the coefficients. 256 For run-length modeling, the expectation of the position of the next 257 non-zero coefficient is given by 259 N - n 260 E(|run|) = beta * ----- , 261 K_n 263 where beta also models the coefficient distribution. 265 7. Development Repository 267 The algorithms in this proposal are being developed as part of 268 Xiph.Org's Daala project. The code is available in the Daala git 269 repository at . See 270 for more information. 272 8. IANA Considerations 274 This document makes no request of IANA. 276 9. Security Considerations 278 This draft has no security considerations. 280 10. Acknowledgements 282 Thanks to Jason Garrett-Glaser, Timothy Terriberry, Greg Maxwell, and 283 Nathan Egge for their contribution to this document. 285 11. Informative References 287 [Perceptual-VQ] 288 Valin, JM. and TB. Terriberry, "Perceptual Vector 289 Quantization for Video Coding", Proceedings of SPIE Visual 290 Information Processing and Communication , February 2015, 291 . 293 [PVQ-demo] 294 Valin, JM., "Daala: Perceptual Vector Quantization (PVQ)", 295 November 2014, . 298 [Pyramid-VQ] 299 Fischer, T., "A Pyramid Vector Quantizer", IEEE Trans. on 300 Information Theory, Vol. 32 pp. 568-583, July 1986. 302 [RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the 303 Opus Audio Codec", RFC 6716, September 2012. 305 Author's Address 307 Jean-Marc Valin 308 Mozilla 309 331 E. Evelyn Avenue 310 Mountain View, CA 94041 311 USA 313 Email: jmvalin@jmvalin.ca