idnits 2.17.1 draft-valin-videocodec-pvq-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(ii) Publication Limitation clause. If this document is intended for submission to the IESG for publication, this constitutes an error. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 4, 2014) is 3583 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 309 -- Looks like a reference, but probably isn't: '2' on line 311 Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group JM. Valin 3 Internet-Draft Mozilla 4 Intended status: Standards Track July 4, 2014 5 Expires: January 5, 2015 7 Pyramid Vector Quantization for Video Coding 8 draft-valin-videocodec-pvq-01 10 Abstract 12 This proposes applying pyramid vector quantization (PVQ) to video 13 coding. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at http://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on January 5, 2015. 32 Copyright Notice 34 Copyright (c) 2014 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 This document may not be modified, and derivative works of it may not 48 be created, and it may not be published except as an Internet-Draft. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 53 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 54 3. Gain-Shape Coding and Activity Masking . . . . . . . . . . . 2 55 4. Householder Reflection . . . . . . . . . . . . . . . . . . . 3 56 5. Angle-Based Encoding . . . . . . . . . . . . . . . . . . . . 4 57 6. Bi-prediction . . . . . . . . . . . . . . . . . . . . . . . . 5 58 7. Coefficient coding . . . . . . . . . . . . . . . . . . . . . 6 59 8. Development Repository . . . . . . . . . . . . . . . . . . . 6 60 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 61 10. Security Considerations . . . . . . . . . . . . . . . . . . . 6 62 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 63 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 64 12.1. Normative References . . . . . . . . . . . . . . . . . . 7 65 12.2. Informative References . . . . . . . . . . . . . . . . . 7 66 12.3. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 7 67 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 7 69 1. Introduction 71 This draft describes a proposal for adapting the Opus RFC 6716 72 [RFC6716] energy conservation principle to video coding based on a 73 pyramid vector quantizer (PVQ) [PVQ]. One potential advantage of 74 conserving energy of the AC coefficients in video coding is 75 preserving textures rather than low-passing them. Also, by 76 introducing a fixed-resolution PVQ-type quantizer, we automatically 77 gain a simple activity masking model. 79 The main challenge of adapting this scheme to video is that we have a 80 good prediction (the reference frame), so we are essentially starting 81 from a point that is already on the PVQ hyper-sphere, rather than at 82 the origin like in CELT. Other challenges are the introduction of a 83 quantization matrix and the fact that we want the reference (motion 84 predicted) data to perfectly correspond to one of the entries in our 85 codebook. 87 2. Terminology 89 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 90 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 91 document are to be interpreted as described in RFC 2119 [RFC2119]. 93 3. Gain-Shape Coding and Activity Masking 95 The main idea behind the proposed video coding scheme is to code 96 groups of DCT coefficient as a scalar gain and a unit-norm "shape" 97 vector. A block's AC coefficients may all be part of the same group, 98 or may be divided by frequency (e.g. by octave) and/or by 99 directionality (horizontal vs vertical). 101 It is desirable for a single quality parameter to control the 102 resolution of both the gain and the shape. Ideally, that quality 103 parameter should also take into account activity masking, that is, 104 the fact that the eye is less sensitive to regions of an image that 105 have more details. According to Jason Garrett-Glaser, the perceptual 106 analysis in the x264 encoder uses a resolution proportional to the 107 variance of the AC coefficients raised to the power a, with a=0.173. 108 For gain-shape quantization, this is equivalent to using a resolution 109 of g^(2a), where g is the gain. We can derive a scalar quantizer 110 that follows this resolution: 112 b 113 g=Q_g gamma , 115 where gamma is the gain quantization index, b=1/(1-2*a) and Q_g is 116 the gain resolution and main quality parameter. 118 An important aspect of the current proposal is the use of prediction. 119 In the case of the gain, there is usually a significant correlation 120 with the gain of neighboring blocks. One way to predict the gain of 121 a block is to compute the gain of the coefficients obtained through 122 intra or inter prediction. Another way is to use the encoded gain of 123 the neighboring blocks to explicitly predict the gain of the current 124 block. 126 4. Householder Reflection 128 Let vector x_d denote the (pre-normalization) DCT band to be coded in 129 the current block and vector r_d denote the corresponding reference 130 (based on intra prediction or motion compensation), the encoder 131 computes and encodes the "band gain" g = sqrt(x_d^T x_d). The 132 normalized band is computed as 134 x_d 135 x = --------- , 136 || x_d || 138 with the normalized reference r similarly computed based on r_d. The 139 encoder then finds the position and sign of the maximum value in r: 141 m = argmax_i | r_i | 142 s = sign(r_m) 144 and computes the Householder reflection that reflects r to -s e_m. 145 The reflection vector is given by 146 v = r + s e_m . 148 The encoder reflects the normalized band to find the unit-norm vector 150 v^T x 151 z = x - 2 ----- v . 152 v^T v 154 The closer the current band is from the reference band, the closer z 155 is from -s e_m. This can be represented either as an angle, or as a 156 coordinate on a projected pyramid. 158 5. Angle-Based Encoding 160 Assuming no quantization, the similarity can be represented by the 161 angle 163 theta = arccos(-s z_m) . 165 If theta is quantized and transmitted to the decoder, then z can be 166 reconstructed as 168 z = -s cos(theta) e_m + sin(theta) z_r , 170 where z_r is a unit vector based on z that excludes dimension m. 172 The vector z_r can be quantized using PVQ. Let y be a vector of 173 integers that satisfies 175 sum_i(|y[i]|) = K , 177 with K determined in advance, then the PVQ search finds the vector y 178 that maximizes y^T z_r / (y^T y) . The quantized version of z_r is 180 y 181 z_rq = ------- . 182 || y || 184 If we assume that MSE is a good criterion for optimizing the 185 resolution, then the angle quantization resolution should be 186 (roughly) 188 dg 1 b 189 Q_theta = ---------*----- = ------ . 190 d(gamma) g gamma 192 To derive the optimal K we need to consider the normalized distortion 193 for a Laplace-distributed variable found experimentally to be 194 approximately 196 (N-1)^2 + C*(N-1) 197 D_p = ----------------- , 198 24*K^2 200 with C ~= 4.2. The distortion due to the gain is 202 b^2*Q_g^2*gamma^(2*b-2) 203 D_g = ----------------------- . 204 12 206 Since PVQ codes N-2 degrees of freedom, its distortion should also be 207 (N-2) times the gain distortion, which eventually leads us to the 208 optimal number of pulses 210 gamma*sin(theta) / N + C - 2 \ 211 K = ---------------- sqrt | --------- | . 212 b \ 2 / 214 The value of K does not need to be coded because all the variables it 215 depends on are known to the decoder. However, because Q_theta 216 depends on the gain, this can lead to unacceptable loss propagation 217 behavior in the case where inter prediction is used for the gain. 218 This problem can be worked around by making the approximation 219 sin(theta)~=theta. With this approximation, then tau is equal to the 220 inverse of the theta quantization index, with no dependency on the 221 gain. Alternatively, instead of quantizing theta, we can quantize 222 sin(theta) which also removes the dependency on the gain. In the 223 general case, we quantize f(theta) and then assume that 224 sin(theta)~=f(theta). A possible choice of f(theta) is a quadratic 225 function of the form: 227 2 228 f(theta) = a1 theta - a2 theta. 230 where a1 and a2 are two constants satisfying the constraint that 231 f(pi/2)=pi/2. The value of f(theta) can also be predicted, but in 232 case where we care about error propagation, it should only be 233 predicted from information coded in the current frame. 235 6. Bi-prediction 237 We can use this scheme for bi-prediction by introducing a second 238 theta parameter. For the case of two (normalized) reference frames 239 r1 and r2, we introduce s1=(r1+r2)/2 and s2=(r1-r2)/2. We start by 240 using s1 as a reference, apply the Householder reflection to both x 241 and s2, and evaluate theta1. From there, we derive a second 242 Householder reflection from the reflected version of s2 and apply it 243 to z. The result is that the theta2 parameter controls how the 244 current image compares to the two reference images. It should even 245 be possible to use this in the case of fades, using two references 246 that are before the frame being encoded. 248 7. Coefficient coding 250 Encoding coefficients quantized with PVQ differs from encoding 251 scalar-quantized coefficients from the fact that the sum of the 252 coefficients magnitude is known (equal to K). It is possible to take 253 advantage of the known K value either through modeling the 254 distribution of coefficient magnitude or by modeling the zero runs. 255 In the case of magnitude modeling, the expection of the magnitude of 256 coefficient n is modeled as 258 K_n 259 E(|y_n|) = alpha * ----- , 260 N - n 262 where K_n is the number of number of pulses left after encoding 263 coeffients from 0 to n-1 and alpha depends on the distribution of the 264 coefficients. For run-length modeling, the expection of the position 265 of the next non-zero coefficient is given by 267 N - n 268 E(|run|) = beta * ----- , 269 K_n 271 where beta also models the coefficient distribution. 273 8. Development Repository 275 The algorithms in this proposal are being developed as part of 276 Xiph.Org's Daala project. The code is available in the Daala git 277 repository at [1]. See [2] for more information. 279 9. IANA Considerations 281 This document makes no request of IANA. 283 10. Security Considerations 285 This draft has no security considerations. 287 11. Acknowledgements 289 Thanks to Jason Garrett-Glaser, Timothy Terriberry, Greg Maxwell, and 290 Nathan Egge for their contribution to this document. 292 12. References 294 12.1. Normative References 296 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 297 Requirement Levels", BCP 14, RFC 2119, March 1997. 299 12.2. Informative References 301 [PVQ] Fischer, T., "A Pyramid Vector Quantizer", IEEE Trans. on 302 Information Theory, Vol. 32 pp. 568-583, July 1986. 304 [RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the 305 Opus Audio Codec", RFC 6716, September 2012. 307 12.3. URIs 309 [1] https://git.xiph.org/daala.git 311 [2] https://xiph.org/daala/ 313 Author's Address 315 Jean-Marc Valin 316 Mozilla 317 331 E. Evelyn Avenue 318 Mountain View, CA 94041 319 USA 321 Email: jmvalin@jmvalin.ca