idnits 2.17.1 

draft-valin-videocodec-pvq-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  -- The document has an IETF Trust Provisions (28 Dec 2009) Section 6.c(ii)
     Publication Limitation clause.  If this document is intended for
     submission to the IESG for publication, this constitutes an error.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 4, 2014) is 3583 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '1' on line 309

  -- Looks like a reference, but probably isn't: '2' on line 311


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                          JM. Valin
3	Internet-Draft                                                   Mozilla
4	Intended status: Standards Track                            July 4, 2014
5	Expires: January 5, 2015

7	              Pyramid Vector Quantization for Video Coding
8	                     draft-valin-videocodec-pvq-01

10	Abstract

12	   This proposes applying pyramid vector quantization (PVQ) to video
13	   coding.

15	Status of This Memo

17	   This Internet-Draft is submitted in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF).  Note that other groups may also distribute
22	   working documents as Internet-Drafts.  The list of current Internet-
23	   Drafts is at http://datatracker.ietf.org/drafts/current/.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   This Internet-Draft will expire on January 5, 2015.

32	Copyright Notice

34	   Copyright (c) 2014 IETF Trust and the persons identified as the
35	   document authors.  All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents
39	   (http://trustee.ietf.org/license-info) in effect on the date of
40	   publication of this document.  Please review these documents
41	   carefully, as they describe your rights and restrictions with respect
42	   to this document.  Code Components extracted from this document must
43	   include Simplified BSD License text as described in Section 4.e of
44	   the Trust Legal Provisions and are provided without warranty as
45	   described in the Simplified BSD License.

47	   This document may not be modified, and derivative works of it may not
48	   be created, and it may not be published except as an Internet-Draft.

50	Table of Contents

52	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
53	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   2
54	   3.  Gain-Shape Coding and Activity Masking  . . . . . . . . . . .   2
55	   4.  Householder Reflection  . . . . . . . . . . . . . . . . . . .   3
56	   5.  Angle-Based Encoding  . . . . . . . . . . . . . . . . . . . .   4
57	   6.  Bi-prediction . . . . . . . . . . . . . . . . . . . . . . . .   5
58	   7.  Coefficient coding  . . . . . . . . . . . . . . . . . . . . .   6
59	   8.  Development Repository  . . . . . . . . . . . . . . . . . . .   6
60	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
61	   10. Security Considerations . . . . . . . . . . . . . . . . . . .   6
62	   11. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   7
63	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .   7
64	     12.1.  Normative References . . . . . . . . . . . . . . . . . .   7
65	     12.2.  Informative References . . . . . . . . . . . . . . . . .   7
66	     12.3.  URIs . . . . . . . . . . . . . . . . . . . . . . . . . .   7
67	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .   7

69	1.  Introduction

71	   This draft describes a proposal for adapting the Opus RFC 6716
72	   [RFC6716] energy conservation principle to video coding based on a
73	   pyramid vector quantizer (PVQ) [PVQ].  One potential advantage of
74	   conserving energy of the AC coefficients in video coding is
75	   preserving textures rather than low-passing them.  Also, by
76	   introducing a fixed-resolution PVQ-type quantizer, we automatically
77	   gain a simple activity masking model.

79	   The main challenge of adapting this scheme to video is that we have a
80	   good prediction (the reference frame), so we are essentially starting
81	   from a point that is already on the PVQ hyper-sphere, rather than at
82	   the origin like in CELT.  Other challenges are the introduction of a
83	   quantization matrix and the fact that we want the reference (motion
84	   predicted) data to perfectly correspond to one of the entries in our
85	   codebook.

87	2.  Terminology

89	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
90	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
91	   document are to be interpreted as described in RFC 2119 [RFC2119].

93	3.  Gain-Shape Coding and Activity Masking

95	   The main idea behind the proposed video coding scheme is to code
96	   groups of DCT coefficient as a scalar gain and a unit-norm "shape"
97	   vector.  A block's AC coefficients may all be part of the same group,
98	   or may be divided by frequency (e.g. by octave) and/or by
99	   directionality (horizontal vs vertical).

101	   It is desirable for a single quality parameter to control the
102	   resolution of both the gain and the shape.  Ideally, that quality
103	   parameter should also take into account activity masking, that is,
104	   the fact that the eye is less sensitive to regions of an image that
105	   have more details.  According to Jason Garrett-Glaser, the perceptual
106	   analysis in the x264 encoder uses a resolution proportional to the
107	   variance of the AC coefficients raised to the power a, with a=0.173.
108	   For gain-shape quantization, this is equivalent to using a resolution
109	   of g^(2a), where g is the gain.  We can derive a scalar quantizer
110	   that follows this resolution:

112	                                           b
113	                                g=Q_g gamma     ,

115	   where gamma is the gain quantization index, b=1/(1-2*a) and Q_g is
116	   the gain resolution and main quality parameter.

118	   An important aspect of the current proposal is the use of prediction.
119	   In the case of the gain, there is usually a significant correlation
120	   with the gain of neighboring blocks.  One way to predict the gain of
121	   a block is to compute the gain of the coefficients obtained through
122	   intra or inter prediction.  Another way is to use the encoded gain of
123	   the neighboring blocks to explicitly predict the gain of the current
124	   block.

126	4.  Householder Reflection

128	   Let vector x_d denote the (pre-normalization) DCT band to be coded in
129	   the current block and vector r_d denote the corresponding reference
130	   (based on intra prediction or motion compensation), the encoder
131	   computes and encodes the "band gain" g = sqrt(x_d^T x_d).  The
132	   normalized band is computed as

134	                                        x_d
135	                                 x = --------- ,
136	                                     || x_d ||

138	   with the normalized reference r similarly computed based on r_d.  The
139	   encoder then finds the position and sign of the maximum value in r:

141	                              m = argmax_i | r_i |
142	                              s = sign(r_m)

144	   and computes the Householder reflection that reflects r to -s e_m.
145	   The reflection vector is given by
146	                                 v = r + s e_m .

148	   The encoder reflects the normalized band to find the unit-norm vector

150	                                        v^T x
151	                              z = x - 2 -----  v .
152	                                        v^T v

154	   The closer the current band is from the reference band, the closer z
155	   is from -s e_m.  This can be represented either as an angle, or as a
156	   coordinate on a projected pyramid.

158	5.  Angle-Based Encoding

160	   Assuming no quantization, the similarity can be represented by the
161	   angle

163	                            theta = arccos(-s z_m) .

165	   If theta is quantized and transmitted to the decoder, then z can be
166	   reconstructed as

168	                    z = -s cos(theta) e_m + sin(theta) z_r ,

170	   where z_r is a unit vector based on z that excludes dimension m.

172	   The vector z_r can be quantized using PVQ.  Let y be a vector of
173	   integers that satisfies

175	                               sum_i(|y[i]|) = K ,

177	   with K determined in advance, then the PVQ search finds the vector y
178	   that maximizes y^T z_r / (y^T y) . The quantized version of z_r is

180	                                          y
181	                                z_rq = ------- .
182	                                       || y ||

184	   If we assume that MSE is a good criterion for optimizing the
185	   resolution, then the angle quantization resolution should be
186	   (roughly)

188	                                   dg       1        b
189	                      Q_theta = ---------*----- = ------ .
190	                                 d(gamma)   g      gamma

192	   To derive the optimal K we need to consider the normalized distortion
193	   for a Laplace-distributed variable found experimentally to be
194	   approximately

196	                                  (N-1)^2 + C*(N-1)
197	                            D_p = ----------------- ,
198	                                       24*K^2

200	   with C ~= 4.2.  The distortion due to the gain is

202	                               b^2*Q_g^2*gamma^(2*b-2)
203	                         D_g = ----------------------- .
204	                                        12

206	   Since PVQ codes N-2 degrees of freedom, its distortion should also be
207	   (N-2) times the gain distortion, which eventually leads us to the
208	   optimal number of pulses

210	                       gamma*sin(theta)       / N + C - 2 \
211	                   K = ----------------  sqrt | ---------  | .
212	                               b              \     2     /

214	   The value of K does not need to be coded because all the variables it
215	   depends on are known to the decoder.  However, because Q_theta
216	   depends on the gain, this can lead to unacceptable loss propagation
217	   behavior in the case where inter prediction is used for the gain.
218	   This problem can be worked around by making the approximation
219	   sin(theta)~=theta.  With this approximation, then tau is equal to the
220	   inverse of the theta quantization index, with no dependency on the
221	   gain.  Alternatively, instead of quantizing theta, we can quantize
222	   sin(theta) which also removes the dependency on the gain.  In the
223	   general case, we quantize f(theta) and then assume that
224	   sin(theta)~=f(theta).  A possible choice of f(theta) is a quadratic
225	   function of the form:

227	                                                       2
228	                         f(theta) = a1 theta - a2 theta.

230	   where a1 and a2 are two constants satisfying the constraint that
231	   f(pi/2)=pi/2.  The value of f(theta) can also be predicted, but in
232	   case where we care about error propagation, it should only be
233	   predicted from information coded in the current frame.

235	6.  Bi-prediction

237	   We can use this scheme for bi-prediction by introducing a second
238	   theta parameter.  For the case of two (normalized) reference frames
239	   r1 and r2, we introduce s1=(r1+r2)/2 and s2=(r1-r2)/2.  We start by
240	   using s1 as a reference, apply the Householder reflection to both x
241	   and s2, and evaluate theta1.  From there, we derive a second
242	   Householder reflection from the reflected version of s2 and apply it
243	   to z.  The result is that the theta2 parameter controls how the
244	   current image compares to the two reference images.  It should even
245	   be possible to use this in the case of fades, using two references
246	   that are before the frame being encoded.

248	7.  Coefficient coding

250	   Encoding coefficients quantized with PVQ differs from encoding
251	   scalar-quantized coefficients from the fact that the sum of the
252	   coefficients magnitude is known (equal to K).  It is possible to take
253	   advantage of the known K value either through modeling the
254	   distribution of coefficient magnitude or by modeling the zero runs.
255	   In the case of magnitude modeling, the expection of the magnitude of
256	   coefficient n is modeled as

258	                                               K_n
259	                           E(|y_n|) = alpha * ----- ,
260	                                              N - n

262	   where K_n is the number of number of pulses left after encoding
263	   coeffients from 0 to n-1 and alpha depends on the distribution of the
264	   coefficients.  For run-length modeling, the expection of the position
265	   of the next non-zero coefficient is given by

267	                                              N - n
268	                            E(|run|) = beta * ----- ,
269	                                               K_n

271	   where beta also models the coefficient distribution.

273	8.  Development Repository

275	   The algorithms in this proposal are being developed as part of
276	   Xiph.Org's Daala project.  The code is available in the Daala git
277	   repository at [1].  See [2] for more information.

279	9.  IANA Considerations

281	   This document makes no request of IANA.

283	10.  Security Considerations

285	   This draft has no security considerations.

287	11.  Acknowledgements

289	   Thanks to Jason Garrett-Glaser, Timothy Terriberry, Greg Maxwell, and
290	   Nathan Egge for their contribution to this document.

292	12.  References

294	12.1.  Normative References

296	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
297	              Requirement Levels", BCP 14, RFC 2119, March 1997.

299	12.2.  Informative References

301	   [PVQ]      Fischer, T., "A Pyramid Vector Quantizer", IEEE Trans. on
302	              Information Theory, Vol. 32 pp. 568-583, July 1986.

304	   [RFC6716]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the
305	              Opus Audio Codec", RFC 6716, September 2012.

307	12.3.  URIs

309	   [1] https://git.xiph.org/daala.git

311	   [2] https://xiph.org/daala/

313	Author's Address

315	   Jean-Marc Valin
316	   Mozilla
317	   331 E. Evelyn Avenue
318	   Mountain View, CA  94041
319	   USA

321	   Email: jmvalin@jmvalin.ca