idnits 2.17.1 draft-terriberry-netvc-codingtools-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2734 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 284

  -- Looks like a reference, but probably isn't: '1' on line 1071

  -- Looks like a reference, but probably isn't: '2' on line 1073

  -- Looks like a reference, but probably isn't: '3' on line 1075

  -- Looks like a reference, but probably isn't: '4' on line 1077

  -- Looks like a reference, but probably isn't: '5' on line 379

  -- Looks like a reference, but probably isn't: '6' on line 379

  -- Looks like a reference, but probably isn't: '7' on line 379

  -- Looks like a reference, but probably isn't: '8' on line 379

  == Missing Reference: '-1' is mentioned on line 929, but not defined


     Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	netvc                                                      T. Terriberry
3	Internet-Draft                                                   N. Egge
4	Intended status: Informational                       Mozilla Corporation
5	Expires: May 4, 2017                                    October 31, 2016

7	             Coding Tools for a Next Generation Video Codec
8	                 draft-terriberry-netvc-codingtools-01

10	Abstract

12	   This document proposes a number of coding tools that could be
13	   incorporated into a next-generation video codec.

15	Status of This Memo

17	   This Internet-Draft is submitted in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF).  Note that other groups may also distribute
22	   working documents as Internet-Drafts.  The list of current Internet-
23	   Drafts is at http://datatracker.ietf.org/drafts/current/.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   This Internet-Draft will expire on May 4, 2017.

32	Copyright Notice

34	   Copyright (c) 2016 IETF Trust and the persons identified as the
35	   document authors.  All rights reserved.

37	   This document is subject to BCP 78 and the IETF Trust's Legal
38	   Provisions Relating to IETF Documents
39	   (http://trustee.ietf.org/license-info) in effect on the date of
40	   publication of this document.  Please review these documents
41	   carefully, as they describe your rights and restrictions with respect
42	   to this document.  Code Components extracted from this document must
43	   include Simplified BSD License text as described in Section 4.e of
44	   the Trust Legal Provisions and are provided without warranty as
45	   described in the Simplified BSD License.

47	Table of Contents

49	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
50	   2.  Entropy Coding  . . . . . . . . . . . . . . . . . . . . . . .   3
51	     2.1.  Non-binary Arithmetic Coding  . . . . . . . . . . . . . .   4
52	     2.2.  Non-binary Context Modeling . . . . . . . . . . . . . . .   5
53	     2.3.  Dyadic Adaptation . . . . . . . . . . . . . . . . . . . .   6
54	     2.4.  Simplified Partition Function . . . . . . . . . . . . . .   9
55	       2.4.1.  Reduced-Overhead Partition Function . . . . . . . . .  10
56	     2.5.  Context Adaptation  . . . . . . . . . . . . . . . . . . .  11
57	       2.5.1.  Implicit Adaptation . . . . . . . . . . . . . . . . .  11
58	       2.5.2.  Explicit Adaptation . . . . . . . . . . . . . . . . .  12
59	       2.5.3.  Early Adaptation  . . . . . . . . . . . . . . . . . .  12
60	     2.6.  Simple Experiment . . . . . . . . . . . . . . . . . . . .  13
61	   3.  Reversible Integer Transforms . . . . . . . . . . . . . . . .  14
62	     3.1.  Lifting Steps . . . . . . . . . . . . . . . . . . . . . .  14
63	     3.2.  4-Point Transform . . . . . . . . . . . . . . . . . . . .  17
64	     3.3.  Larger Transforms . . . . . . . . . . . . . . . . . . . .  20
65	     3.4.  Walsh-Hadamard Transforms . . . . . . . . . . . . . . . .  20
66	   4.  Development Repository  . . . . . . . . . . . . . . . . . . .  22
67	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  22
68	   6.  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  22
69	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  22
70	     7.1.  Informative References  . . . . . . . . . . . . . . . . .  22
71	     7.2.  URIs  . . . . . . . . . . . . . . . . . . . . . . . . . .  24
72	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  24

74	1.  Introduction

76	   One of the biggest contributing factors to the success of the
77	   Internet is that the underlying protocols are implementable on a
78	   royalty-free basis.  This allows them to be implemented widely and
79	   easily distributed by application developers, service operators, and
80	   end users, without asking for permission.  In order to produce a
81	   next-generation video codec that is competitive with the best patent-
82	   encumbered standards, yet avoids patents which are not available on
83	   an open-source compatible, royalty-free basis, we must use old coding
84	   tools in new ways and develop new coding tools.  This draft documents
85	   some of the tools we have been working on for inclusion in such a
86	   codec.  This is early work, and the performance of some of these
87	   tools (especially in relation to other approaches) is not yet fully
88	   known.  Nevertheless, it still serves to outline some possibilities
89	   an eventual working group, if formed, could consider.

91	2.  Entropy Coding

93	   The basic theory of entropy coding was well-established by the late
94	   1970's [Pas76].  Modern video codecs have focused on Huffman codes
95	   (or "Variable-Length Codes"/VLCs) and binary arithmetic coding.
96	   Huffman codes are limited in the amount of compression they can
97	   provide and the design flexibility they allow, but as each code word
98	   consists of an integer number of bits, their implementation
99	   complexity is very low, so they were provided at least as an option
100	   in every video codec up through H.264.  Arithmetic coding, on the
101	   other hand, uses code words that can take up fractional parts of a
102	   bit, and are more complex to implement.  However, the prevalence of
103	   cheap, H.264 High Profile hardware, which requires support for
104	   arithmetic coding, shows that it is no longer so expensive that a
105	   fallback VLC-based approach is required.  Having a single entropy-
106	   coding method simplifies both up-front design costs and
107	   interoperability.

109	   However, the primary limitation of arithmetic coding is that it is an
110	   inherently serial operation.  A given symbol cannot be decoded until
111	   the previous symbol is decoded, because the bits (if any) that are
112	   output depend on the exact state of the decoder at the time it is
113	   decoded.  This means that a hardware implementation must run at a
114	   sufficiently high clock rate to be able to decode all of the symbols
115	   in a frame.  Higher clock rates lead to increased power consumption,
116	   and in some cases the entropy coding is actually becoming the
117	   limiting factor in these designs.

119	   As fabrication processes improve, implementers are very willing to
120	   trade increased gate count for lower clock speeds.  So far, most
121	   approaches to allowing parallel entropy coding have focused on
122	   splitting the encoded symbols into multiple streams that can be
123	   decoded independently.  This "independence" requirement has a non-
124	   negligible impact on compression, parallelizability, or both.  For
125	   example, H.264 can split frames into "slices" which might cover only
126	   a small subset of the blocks in the frame.  In order to allow
127	   decoding these slices independently, they cannot use context
128	   information from blocks in other slices (harming compression).  Those
129	   contexts must adapt rapidly to account for the generally small number
130	   of symbols available for learning probabilities (also harming
131	   compression).  In some cases the number of contexts must be reduced
132	   to ensure enough symbols are coded in each context to usefully learn
133	   probabilities at all (once more, harming compression).  Furthermore,
134	   an encoder must specially format the stream to use multiple slices
135	   per frame to allow any parallel entropy decoding at all.  Encoders
136	   rarely have enough information to evaluate this "compression
137	   efficiency" vs. "parallelizability" trade-off, since they don't
138	   generally know the limitations of the decoders for which they are
139	   encoding.  That means there will be many files or streams which could
140	   have been decoded if they were encoded with different options, but
141	   which a given decoder cannot decode because of bad choices made by
142	   the encoder (at least from the perspective of that decoder).  The
143	   same set of drawbacks apply to the DCT token partitions in
144	   VP8 [RFC6386].

146	2.1.  Non-binary Arithmetic Coding

148	   Instead, we propose a very different approach: use non-binary
149	   arithmetic coding.  In binary arithmetic coding, each decoded symbol
150	   has one of two possible values: 0 or 1.  The original arithmetic
151	   coding algorithms allow a symbol to take on any number of possible
152	   values, and allow the size of that alphabet to change with each
153	   symbol coded.  Reasonable values of N (for example, N <= 16) offer
154	   the potential for a decent throughput increase for a reasonable
155	   increase in gate count for hardware implementations.

157	   Binary coding allows a number of computational simplifications.  For
158	   example, for each coded symbol, the set of valid code points is
159	   partitioned in two, and the decoded value is determined by finding
160	   the partition in which the actual code point that was received lies.
161	   This can be determined by computing a single partition value (in both
162	   the encoder and decoder) and (in the decoder) doing a single
163	   comparison.  A non-binary arithmetic coder partitions the set of
164	   valid code points into multiple pieces (one for each possible value
165	   of the coded symbol).  This requires the encoder to compute two
166	   partition values, in general (for both the upper and lower bound of
167	   the symbol to encode).  The decoder, on the other hand, must search
168	   the partitions for the one that contains the received code point.
169	   This requires computing at least O(log N) partition values.

171	   However, coding a parameter with N possible values with a binary
172	   arithmetic coder requires O(log N) symbols in the worst case (the
173	   only case that matters for hardware design).  Hence, this does not
174	   represent any actual savings (indeed, it represents an increase in
175	   the number of partition values computed by the encoder).  In
176	   addition, there are a number of overheads that are per-symbol, rather
177	   than per-value.  For example, renormalization (which enlarges the set
178	   of valid code points after partitioning has reduced it too much),
179	   carry propagation (to deal with the case where the high and low ends
180	   of a partition straddle a bit boundary), etc., are all performed on a
181	   symbol-by-symbol basis.  Since a non-binary arithmetic coder codes a
182	   given set of values with fewer symbols than a binary one, it incurs
183	   these per-symbol overheads less often.  This suggests that a non-
184	   binary arithmetic coder can actually be more efficient than a binary
185	   one.

187	2.2.  Non-binary Context Modeling

189	   The other aspect that binary coding simplifies is probability
190	   modeling.  In arithmetic coding, the size of the sets the code points
191	   are partitioned into are (roughly) proportional to the probability of
192	   each possible symbol value.  Estimating these probabilities is part
193	   of the coding process, though it can be cleanly separated from the
194	   task of actually producing the coded bits.  In a binary arithmetic
195	   coder, this requires estimating the probability of only one of the
196	   two possible values (since the total probability is 1.0).  This is
197	   often done with a simple table lookup that maps the old probability
198	   and the most recently decoded symbol to a new probability to use for
199	   the next symbol in the current context.  The trade-off, of course, is
200	   that non-binary symbols must be "binarized" into a series of bits,
201	   and a context (with an associated probability) chosen for each one.

203	   In a non-binary arithmetic coder, the decoder must compute at least
204	   O(log N) cumulative probabilities (one for each partition value it
205	   needs).  Because these probabilities are usually not estimated
206	   directly in "cumulative" form, this can require computing (N - 1)
207	   non-cumulative probability values.  Unless N is very small, these
208	   cannot be updated with a single table lookup.  The normal approach is
209	   to use "frequency counts".  Define the frequency of value k to be

211	          f[k] = A* + B

213	   where A and B are parameters (usually A=2 and B=1 for a traditional
214	   Krichevsky-Trofimov estimator).  The resulting probability, p[k], is
215	   given by

217	                                     N-1
218	                                     __
219	                                ft = \   f[k]
220	                                     /_
221	                                     k=0

223	                                     f[k]
224	                              p[k] = ----
225	                                      ft

227	   When ft grows too large, the frequencies are rescaled (e.g., halved,
228	   rounding up to prevent reduction of a probability to 0).

230	   When ft is not a power of two, partitioning the code points requires
231	   actual divisions (see [RFC6716] Section 4.1 for one detailed example
232	   of exactly how this is done).  These divisions are acceptable in an
233	   audio codec like Opus [RFC6716], which only has to code a few
234	   hundreds of these symbols per second.  But video requires hundreds of
235	   thousands of symbols per second, at a minimum, and divisions are
236	   still very expensive to implement in hardware.

238	   There are two possible approaches to this.  One is to come up with a
239	   replacement for frequency counts that produces probabilities that sum
240	   to a power of two.  Some possibilities, which can be applied
241	   individually or in combination:

243	   1.  Use probabilities that are fixed for the duration of a frame.
244	       This is the approach taken by VP8, for example, even though it
245	       uses a binary arithmetic coder.  In fact, it is possible to
246	       convert many of VP8's existing binary-alphabet probabilities into
247	       probabilities for non-binary alphabets, an approach that is used
248	       in the experiment presented at the end of this section.

250	   2.  Use parametric distributions.  For example, DCT coefficient
251	       magnitudes usually have an approximately exponential
252	       distribution.  This distribution can be characterized by a single
253	       parameter, e.g., the expected value.  The expected value is
254	       trivial to update after decoding a coefficient.  For example

256	              E[x[n+1]] = E[x[n]] + floor(C*(x[n] - E[x[n]]))

258	       produces an exponential moving average with a decay factor of
259	       (1 - C).  For a choice of C that is a negative power of two
260	       (e.g., 1/16 or 1/32 or similar), this can be implemented with two
261	       adds and a shift.  Given this expected value, the actual
262	       distribution to use can be obtained from a small set of pre-
263	       computed distributions via a lookup table.  Linear interpolation
264	       between these pre-computed values can improve accuracy, at the
265	       cost of O(N) computations, but if N is kept small this is
266	       trivially parallelizable, in SIMD or otherwise.

268	   3.  Change the frequency count update mechanism so that ft is
269	       constant.  This approach is described in the next section.

271	2.3.  Dyadic Adaptation

273	   The goal with context adaptation using dyadic probabilities is to
274	   maintain the invariant that the probabilities all sum to a power of
275	   two before and after adaptation.  This can be achieved with a special
276	   update function that blends the cumulative probabilities of the
277	   current context with a cumulative distribution function where the
278	   coded symbol has probability 1.

280	   Suppose we have model for a given context that codes 8 symbols with
281	   the following probabilities:

283	         +------+------+------+------+------+------+------+------+
284	         | p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7] |
285	         +------+------+------+------+------+------+------+------+
286	         |  1/8 |  1/8 | 3/16 | 1/16 | 1/16 | 3/16 |  1/8 |  1/8 |
287	         +------+------+------+------+------+------+------+------+

289	   Then the cumulative distribution function is:

291	      CDF

293	    1  +                                                +------+
294	       |                                                |
295	       |                                         +------+
296	       |                                         |
297	   3/4 +                                  +------+
298	       |                                  |
299	       |                                  |
300	       |                           +------+
301	   1/2 +                    +------+
302	       |             +------+
303	       |             |
304	       |             |
305	   1/4 +      +------+
306	       |      |
307	       +------+
308	       |
309	    0  +------+------+------+------+------+------+------+------+ Bin
310	        fl[1]  fl[2]  fl[3]  fl[4]  fl[5]  fl[6]  fl[7]  fl[8]

312	   Suppose we code symbol 3 and wish to update the context model so that
313	   this symbol is now more likely.  This can be done by blending the CDF
314	   for the current context with a CDF that has symbol 3 with likelihood
315	   1.

317	      CDF

319	    1  +                    +----------------------------------+
320	       |                    |
321	       |                    |
322	       |                    |
323	    0  +------+------+------+------+------+------+------+------+ Bin
324	        fl[1]  fl[2]  fl[3]  fl[4]  fl[5]  fl[6]  fl[7]  fl[8]

326	   Given an adaptation rate g between 0 and 1, and assuming ft = 2^4 =
327	   16, what we are computing is:

329	   +------+------+------+------+------+------+------+------+
330	   |   2  |   4  |   7  |   8  |   9  |  12  |  14  |  16  |  * (1 - g)
331	   +------+------+------+------+------+------+------+------+

333	                               +

335	   +------+------+------+------+------+------+------+------+
336	   |   0  |   0  |   0  |  16  |  16  |  16  |  16  |  16  |  * g
337	   +------+------+------+------+------+------+------+------+

339	   In order to prevent the probability of any one symbol from going to
340	   zero, the blending functions above and below the coded symbol are
341	   adjusted so that no adjacent cumulative probabilities are the same.

343	   Let M be the alphabet size and 1/2^r be the adaptation rate:

345	           ( fl[i] - floor((fl[i] + 2^r - i - 1)/2^r), i <= coded symbol
346	   fl[i] = <
347	           ( fl[i] - floor((fl[i] + M - i - ft)/2^r),  i > coded symbol

349	   Applying these formulas to the example CDF where M = 8 with
350	   adaptation rate 1/2^16 gives the updated CDF:

352	         +------+------+------+------+------+------+------+------+
353	         |   1  |   3  |   6  |   9  |  10  |  13  |  15  |  16  |
354	         +------+------+------+------+------+------+------+------+

356	   Looking at the graph of the CDF we see that the likelihood for symbol
357	   3 has gone up from 1/16 to 3/16, dropping the likelihood of all other
358	   symbols to make room.

360	      CDF

362	    1  +                                                +------+
363	       |                                         +------+
364	       |                                         |
365	       |                                  +------+
366	   3/4 +                                  |
367	       |                                  |
368	       |                           +------+
369	       |                    +------+
370	   1/2 +                    |
371	       |                    |
372	       |             +------+
373	       |             |
374	   1/4 +             |
375	       |      +------+
376	       |      |
377	       +------+
378	    0  +------+------+------+------+------+------+------+------+ Bin
379	        fl[1]  fl[2]  fl[3]  fl[4]  fl[5]  fl[6]  fl[7]  fl[8]

381	2.4.  Simplified Partition Function

383	   Rather than changing the context modeling, the other approach is to
384	   change the function used to partition the set of valid code points so
385	   that it does not need a division, even when ft is not a power of two.
386	   Let the range of valid code points in the current arithmetic coder
387	   state be [L, L + R), where L is the lower bound of the range and R is
388	   the number of valid code points.  Assume that ft <= R < 2*ft (this is
389	   easy to enforce with the normal rescaling operations used with
390	   frequency counts).  Then one possible partition function is

392	                     r[k] = fl[k] + min(fl[k], R - ft)

394	   so that the new range after coding symbol k is
395	   [L + r[k], L + r[k+1]).

397	   This is a variation of the partition function proposed by [SM98].
398	   The size of the new partition (r[k+1] - r[k]) is no longer truly
399	   proportional to R*p[k].  This partition function counts values of
400	   fl[k] smaller than R - ft double compared to values larger than
401	   R - ft.  This over-estimates the probability of symbols at the start
402	   of the alphabet and underestimates the probability of symbols at the
403	   end of the alphabet.  The amount of the range allocated to a symbol
404	   can be off by up to a factor of 2 compared to its fair share,
405	   implying a peak error as large as one bit per symbol.  However, if
406	   the probabilities are accurate and the symbols being coded are
407	   independent, the average inefficiency introduced will be as low as
408	   log2(log2(e)*2/e) ~= 0.0861 bits per symbol.  This error can, of
409	   course, be reduced by coding fewer symbols with larger alphabets.  In
410	   practice the overhead is roughly equal to the overhead introduced by
411	   other approximate arithmetic coders like H.264's CABAC.  However,
412	   probabilities near one-half tend to have the most overhead.  In fact,
413	   probabilities in the range of 40% to 60% for a binary symbol may not
414	   be worth modeling, since the compression gains may be entirely
415	   countered by the added overhead, making it cheaper and faster to code
416	   such values as raw bits.  This problem is partially alleviated by
417	   using larger alphabets.

419	2.4.1.  Reduced-Overhead Partition Function

421	   A slightly more complicated partition function can reduce the
422	   overhead while still avoiding the division.  This is done by
423	   splitting things into two cases:

425	      Case 1: R - ft > ft - (R - ft)

427	      That is, we have more values that are double-counted than single-
428	      counted.  In this case, we still double-count the first 2*R - 3*ft
429	      values, but after that we alternate between single-counting and
430	      double-counting for the rest.

432	      Case 2: R - ft < ft - (R - ft)

434	      That is, we have more values that are single-counted than double-
435	      counted.  In this case, we alternate between single-counting and
436	      double-counting for the first 2*(R - ft) values, and single-count
437	      the rest.

439	   For two equiprobable symbols in different places in the alphabet,
440	   this reduces the maximum ratio of over-estimation to under-estimation
441	   from 2:1 for the previous partition function to either 4:3 or 3:2
442	   (for each of the two cases above, respectively), assuming symbol
443	   probabilities significantly greater than the minimum possible.  That
444	   reduces the worst-case per-symbol overhead from 1 bit to 0.58 bits.

446	   The resulting reduced-overhead partition function is

448	       e = max(2*R - 3*ft, 0)
449	    r[k] = fl[k] + min(fl[k], e) + min(max(fl[k] - e, 0) >> 1, R - ft)

451	   Here, e is a value that is greater than 0 in case 1, and 0 in case 2.
452	   This function is about three times as expensive to evaluate as the
453	   high-overhead version, but still an order of magnitude cheaper than a
454	   division, since it is composed of very simple operations.

456	   In practice it reduces the overhead by about 0.3% of the total
457	   bitrate.  It also tends to produce R values with a more uniform
458	   distribution compared to the high-overhead version, which tends to
459	   have peaks in the distribution of R at specific values (see [SM98]
460	   for a discussion of this effect).  Overall, it makes it more likely
461	   that the compression gains from probabilities near one-half are not
462	   eliminated by the approximation overhead, increasing the number of
463	   symbols that can be usefully modeled.  It is an open question whether
464	   or not these benefits are worth the increase in computational
465	   complexity.

467	2.5.  Context Adaptation

469	   The dyadic adaptation scheme described in Section 2.3 implements a
470	   low-complexity IIR filter for the steady-state case where we only
471	   want to adapt the context CDF as fast as the 1/2^r adaptation rate.
472	   In many cases, for example when coding symbols at the start of a
473	   video frame, only a limited number of symbols have been seen per
474	   context.  Using this steady-state adaptation scheme risks adapting
475	   too slowly and spending too many bits to code symbols with incorrect
476	   probability estimates.  In other video codecs, this problem is
477	   reduced by either implicitly or explicitly allowing for mechanisms to
478	   set the initial probability models for a given context.

480	2.5.1.  Implicit Adaptation

482	   One implicit way to use default probabilities is to simply require as
483	   a normative part of the decoder that some specific CDFs are used to
484	   initialize each context.  A representative set of inputs is run
485	   through the encoder and a frequency based probability model is
486	   computed and reloaded at the start of every frame.  This has the
487	   advantage of having zero bitstream overhead and is optimal for
488	   certain stationary symbols.  However for other non-stationary
489	   symbols, or highly content dependent contexts where the sample input
490	   is not representative, this can be worse than starting with a flat
491	   distribution as it now takes even longer to adapt to the steady-
492	   state.  Moreover the amount of hardware area required to store
493	   initial probability tables for each context goes up with the number
494	   of contexts in the codec.

496	   Another implicit way to deal with poor initial probabilities is
497	   through backward adaptation based on the probability estimates from
498	   the previous frame.  After decoding a frame, the adapted CDFs for
499	   each context are simply kept as-is and not reset to their defaults.
500	   This has the advantage of having no bitstream overhead, and tracking
501	   to certain content types closely as we expect frames with similar
502	   content at similar rates, to have well correlated CDFs.  However,
503	   this only works when we know there will be no bitstream errors due to
504	   the transport layer, e.g., TCP or HTTP.  In low delay use cases
505	   (video on demand, live streaming, video conferencing), implicit
506	   backwards adaptation is avoided as it risks desynchronizing the
507	   entropy decoder state and permanently losing the video stream.

509	2.5.2.  Explicit Adaptation

511	   For codecs that include the ability to update the probability models
512	   in the bitstream, it is possible to explicitly signal a starting CDF.
513	   The previously described implicit backwards adaptation is now
514	   possible by simply explicitly coding a probability update for each
515	   frame.  However, the cost of signaling the updated CDF must be
516	   overcome by the savings from coding with the updated CDF.  Blindly
517	   updating all contexts per frame may work at high rates where the size
518	   of the CDFs is small relative to the coded symbol data.  However at
519	   low rates, the benefit of using more accurate CDFs is quickly
520	   overcome by the cost of coding them, which increases with the number
521	   of contexts.

523	   More sophisticated encoders can compute the cost of coding a
524	   probability update for a given context, and compare it to the size
525	   reduction achieved by coding symbols with this context.  Here all
526	   symbols for a given frame (or tile) are buffered and not serialized
527	   by the entropy coder until the end of the frame (or tile) is reached.
528	   Once the end of the entropy segment has been reached, the cost in
529	   bits for coding symbols with both the default probabilities and the
530	   proposed updated probabilities can be measured and compared.
531	   However, note that with the symbols already buffered, rather than
532	   consider the context probabilities from the previous frame, a simple
533	   frequency based probability model can be computed and measured.
534	   Because this probability model is computed based on the symbols we
535	   are about to code this technique is called forward adaptation.  If
536	   the cost in bits to signal and code with this new probability model
537	   is less than that of using the default then it is used.  This has the
538	   advantage of only ever coding a probability update if it is an
539	   improvement and producing a bitstream that is robust to errors, but
540	   requires an entire entropy segments worth of symbols be cached.

542	2.5.3.  Early Adaptation

544	   We would like to take advantage of the low-cost multi-symbol CDF
545	   adaptation described in Section 2.3 without in the broadest set of
546	   use cases.  This means the initial probability adaptation scheme
547	   should support low-delay, error-resilient streams that efficiently
548	   implemented in both hardware and software.  We propose an early
549	   adaptation scheme that supports this goal.

551	   At the beginning of a frame (or tile), all CDFs are initialized to a
552	   flat distribution.  For a given multi-symbol context with M potential
553	   symbols, assume that the initial dyadic CDF is initialized so that
554	   each symbol has probability 1/M.  For the first M coded symbols, the
555	   CDF is updated as follows:

557	   a[c,M] = ft/(M + c)

559	           ( fl[i] - floor((fl[i] - i)*a/ft),          i <= coded symbol
560	   fl[i] = <
561	           ( fl[i] - floor((fl[i] + M - i - ft)*a/ft), i > coded symbol

563	   where c goes from 0 to M-1 and is the running count of the number of
564	   symbols coded with this CDF.  Note that for a fixed CDF precision (ft
565	   is always a power of two) and a maximum number of possible symbols M,
566	   the values of a[c,M] can be stored in a M*(M+1)/2 element table,
567	   which is 136 entries when M = 16.

569	2.6.  Simple Experiment

571	   As a simple experiment to validate the non-binary approach, we
572	   compared a non-binary arithmetic coder to the VP8 (binary) entropy
573	   coder.  This was done by instrumenting vp8_treed_read() in libvpx to
574	   dump out the symbol decoded and the associated probabilities used to
575	   decode it.  This data only includes macroblock mode and motion vector
576	   information, as the DCT token data is decoded with custom inline
577	   functions, and not vp8_treed_read().  This data is available at [1].
578	   It includes 1,019,670 values encode using 2,125,995 binary symbols
579	   (or 2.08 symbols per value).  We expect that with a conscious effort
580	   to group symbols during the codec design, this average could easily
581	   be increased.

583	   We then implemented both the regular VP8 entropy decoder (in plain C,
584	   using all of the optimizations available in libvpx at the time) and a
585	   multisymbol entropy decoder (also in plain C, using similar
586	   optimizations), which encodes each value with a single symbol.  For
587	   the decoder partition search in the non-binary decoder, we used a
588	   simple for loop (O(N) worst-case), even though this could be made
589	   constant-time and branchless with a few SIMD instructions such as (on
590	   x86) PCMPGTW, PACKUSWB, and PMOVMASKB followed by BSR.  The source
591	   code for both implementations is available at [2] (compile with
592	   -DEC_BINARY for the binary version and -DEC_MULTISYM for the non-
593	   binary version).

595	   The test simply loads the tokens, and then loops 1024 times encoding
596	   them using the probabilities provided, and then decoding them.  The
597	   loop was added to reduce the impact of the overhead of loading the
598	   data, which is implemented very inefficiently.  The total runtime on
599	   a Core i7 from 2010 is 53.735 seconds for the binary version, and
600	   27.937 seconds for the non-binary version, or a 1.92x improvement.
601	   This is very nearly equal to the number of symbols per value in the
602	   binary coder, suggesting that the per-symbol overheads account for
603	   the vast majority of the computation time in this implementation.

605	3.  Reversible Integer Transforms

607	   Integer transforms in image and video coding date back to at least
608	   1969 [PKA69].  Although standards such as MPEG2 and MPEG4 Part 2
609	   allow some flexibility in the transform implementation,
610	   implementations were subject to drift and error accumulation, and
611	   encoders had to impose special macroblock refresh requirements to
612	   avoid these problems, not always successfully.  As transforms in
613	   modern codecs only account for on the order of 10% of the total
614	   decoder complexity, and, with the use of weighted prediction with
615	   gains greater than unity and intra prediction, are far more
616	   susceptible to drift and error accumulation, it no longer makes sense
617	   to allow a non-exact transform specification.

619	   However, it is also possible to make such transforms "reversible", in
620	   the sense that applying the inverse transform to the result of the
621	   forward transform gives back the original input values, exactly.
622	   This gives a lossy codec, which normally quantizes the coefficients
623	   before feeding them into the inverse transform, the ability to scale
624	   all the way to lossless compression without requiring any new coding
625	   tools.  This approach has been used successfully by JPEG XR, for
626	   example [TSSRM08].

628	   Such reversible transforms can be constructed using "lifting steps",
629	   a series of shear operations that can represent any set of plane
630	   rotations, and thus any orthogonal transform.  This approach dates
631	   back to at least 1992 [BE92], which used it to implement a four-point
632	   1-D Discrete Cosine Transform (DCT).  Their implementation requires
633	   6 multiplications, 10 additions, 2 shifts, and 2 negations, and
634	   produces output that is a factor of sqrt(2) larger than the
635	   orthonormal version of the transform.  The expansion of the dynamic
636	   range directly translates into more bits to code for lossless
637	   compression.  Because the least significant bits are usually very
638	   nearly random noise, this scaling increases the coding cost by
639	   approximately half a bit per sample.

641	3.1.  Lifting Steps

643	   To demonstrate the idea of lifting steps, consider the two-point
644	   transform
645	                                  ___
646	                      [ y0 ]     / 1  [  1 1 ] [ x0 ]
647	                      [    ] =  / --- [      ] [    ]
648	                      [ y1 ]   v   2  [ -1 1 ] [ x1 ]

650	   This can be implemented up to scale via

652	                              y0 = x0 + x1

654	                              y1 = 2*x1 - y0

656	   and reversed via

658	                            x1 = (y0 + y1) >> 1

660	                            x0 = y0 - x1

662	   Both y0 and y1 are too large by a factor of sqrt(2), however.

664	   It is also possible to implement any rotation by an angle t,
665	   including the orthonormal scale factor, by decomposing it into three
666	   steps:

668	                                   cos(t) - 1
669	                         u0 = x0 + ---------- * x1
670	                                     sin(t)

672	                         y1 = x1 + sin(t)*u0

674	                                   cos(t) - 1
675	                         y0 = u0 + ---------- * y1
676	                                     sin(t)

678	   By letting t=-pi/4, we get an implementation of the first transform
679	   that includes the scaling factor.  To get an integer approximation of
680	   this transform, we need only replace the transcendental constants by
681	   fixed-point approximations:

683	                       u0 = x0 + ((27*x1 + 32) >> 6)

685	                       y1 = x1 - ((45*u0 + 32) >> 6)

687	                       y0 = u0 + ((27*y1 + 32) >> 6)

689	   This approximation is still perfectly reversible:

691	                       u0 = y0 - ((27*y1 + 32) >> 6)

693	                       x1 = y1 + ((45*u0 + 32) >> 6)

695	                       x0 = u0 - ((27*x1 + 32) >> 6)

697	   Each of the three steps can be implemented using just two ARM
698	   instructions, with constants that have up to 14 bits of precision
699	   (though using fewer bits allows more efficient hardware
700	   implementations, at a small cost in coding gain).  However, it is
701	   still much more complex than the first approach.

703	   We can get a compromise with a slight modification:

705	                            y0 = x0 + x1

707	                            y1 = x1 - (y0 >> 1)

709	   This still only implements the original orthonormal transform up to
710	   scale.  The y0 coefficient is too large by a factor of sqrt(2) as
711	   before, but y1 is now too small by a factor of sqrt(2).  If our goal
712	   is simply to (optionally quantize) and code the result, this is good
713	   enough.  The different scale factors can be incorporated into the
714	   quantization matrix in the lossy case, and the total expansion is
715	   roughly equivalent to that of the orthonormal transform in the
716	   lossless case.  Plus, we can perform each step with just one ARM
717	   instruction.

719	   However, if instead we want to apply additional transformations to
720	   the data, or use the result to predict other data, it becomes much
721	   more convenient to have uniformly scaled outputs.  For a two-point
722	   transform, there is little we can do to improve on the three-
723	   multiplications approach above.  However, for a four-point transform,
724	   we can use the last approach and arrange multiple transform stages
725	   such that the "too large" and "too small" scaling factors cancel out,
726	   producing a result that has the true, uniform, orthonormal scaling.
727	   To do this, we need one more tool, which implements the following
728	   transform:

730	                        ___
731	            [ y0 ]     / 1  [ cos(t) -sin(t) ] [ 1  0 ] [ x0 ]
732	            [    ] =  / --- [                ] [      ] [    ]
733	            [ y1 ]   v   2  [ sin(t)  cos(t) ] [ 0  2 ] [ x1 ]

735	   This takes unevenly scaled inputs, rescales them, and then rotates
736	   them.  Like an ordinary rotation, it can be reduced to three lifting
737	   steps:

739	                                             _
740	                                 2*cos(t) - v2
741	                       u0 = x0 + ------------- * x1
742	                                     sin(t)
743	                                   ___
744	                                  / 1
745	                       y1 = x1 + / --- * sin(t)*u0
746	                                v   2
747	                                           _
748	                                 cos(t) - v2
749	                       y0 = u0 + ----------- * y1
750	                                    sin(t)

752	   As before, the transcendental constants may be replaced by fixed-
753	   point approximations without harming the reversibility property.

755	3.2.  4-Point Transform

757	   Using the tools from the previous section, we can design a reversible
758	   integer four-point DCT approximation with uniform, orthonormal
759	   scaling.  This requires 3 multiplies, 9 additions, and 2 shifts (not
760	   counting the shift and rounding offset used in the fixed-point
761	   multiplies, as these are built into the multiplier).  This is
762	   significantly cheaper than the [BE92] approach, and the output
763	   scaling is smaller by a factor of sqrt(2), saving half a bit per
764	   sample in the lossless case.  By comparison, the four-point forward
765	   DCT approximation used in VP9, which is not reversible, uses
766	   6 multiplies, 6 additions, and 2 shifts (counting shifts and rounding
767	   offsets which cannot be merged into a single multiply instruction on
768	   ARM).  Four of its multipliers also require 28-bit accumulators,
769	   whereas this proposal can use much smaller multipliers without giving
770	   up the reversibility property.  The total dynamic range expansion is
771	   1 bit: inputs in the range [-256,255) produce transformed values in
772	   the range [-512,510).  This is the smallest dynamic range expansion
773	   possible for any reversible transform constructed from mostly-linear
774	   operations.  It is possible to make reversible orthogonal transforms
775	   with no dynamic range expansion by using "piecewise-linear"
776	   rotations [SLD04], but each step requires a large number of
777	   operations in a software implementation.

779	   Pseudo-code for the forward transform follows:

781	   Input:  x0, x1, x2, x3
782	   Output: y0, y1, y2, y3
783	   /* Rotate (x3, x0) by -pi/4, asymmetrically scaled output. */
784	   t3  = x0 - x3
785	   t0  = x0 - (t3 >> 1)
786	   /* Rotate (x1, x2) by pi/4, asymmetrically scaled output. */
787	   t2  = x1 + x2
788	   t2h = t2 >> 1
789	   t1  = t2h - x2
790	   /* Rotate (t2, t0) by -pi/4, asymmetrically scaled input. */
791	   y0  = t0 + t2h
792	   y2  = y0 - t2
793	   /* Rotate (t3, t1) by 3*pi/8, asymmetrically scaled input. */
794	   t3  = t3 - (45*t1 + 32 >> 6)
795	   y1  = t1 + (21*t3 + 16 >> 5)
796	   y3  = t3 - (71*y1 + 32 >> 6)

798	   Even though there are three asymmetrically scaled rotations by pi/4,
799	   by careful arrangement we can share one of the shift operations (to
800	   help software implementations: shifts by a constant are basically
801	   free in hardware).  This technique can be used to even greater effect
802	   in larger transforms.

804	   The inverse transform is constructed by simply undoing each step in
805	   turn:

807	   Input:  y0, y1, y2, y3
808	   Output: x0, x1, x2, x3
809	   /* Rotate (y3, y1) by -3*pi/8, asymmetrically scaled output. */
810	   t3  = y3 + (71*y1 + 32 >> 6)
811	   t1  = y1 - (21*t3 + 16 >> 5)
812	   t3  = t3 + (45*t1 + 32 >> 6)
813	   /* Rotate (y2, y0) by pi/4, asymmetrically scaled output. */
814	   t2  = y0 - y2
815	   t2h = t2 >> 1
816	   t0  = y0 - t2h
817	   /* Rotate (t1, t2) by -pi/4, asymmetrically scaled input. */
818	   x2  = t2h - t1
819	   x1  = t2 - x2
820	   /* Rotate (x3, x0) by pi/4, asymmetrically scaled input. */
821	   x0  = t0 - (t3 >> 1)
822	   x3  = x0 - t3

824	   Although the right shifts make this transform non-linear, we can
825	   compute "basis functions" for it by sending a vector through it with
826	   a single value set to a large constant (256 was used here), and the
827	   rest of the values set to zero.  The true basis functions for a four-
828	   point DCT (up to five digits) are

830	   [ y0 ]   [ 0.50000  0.50000  0.50000  0.50000 ] [ x0 ]
831	   [ y1 ] = [ 0.65625  0.26953 -0.26953 -0.65625 ] [ x1 ]
832	   [ y2 ]   [ 0.50000 -0.50000 -0.50000  0.50000 ] [ x2 ]
833	   [ y3 ]   [ 0.27344 -0.65234  0.65234 -0.27344 ] [ x3 ]

835	   The corresponding basis functions for our reversible, integer DCT,
836	   computed using the approximation described above, are

838	   [ y0 ]   [ 0.50000  0.50000  0.50000  0.50000 ] [ x0 ]
839	   [ y1 ] = [ 0.65328  0.27060 -0.27060 -0.65328 ] [ x1 ]
840	   [ y2 ]   [ 0.50000 -0.50000 -0.50000  0.50000 ] [ x2 ]
841	   [ y3 ]   [ 0.27060 -0.65328  0.65328 -0.27060 ] [ x3 ]

843	   The mean squared error (MSE) of the output, compared to a true DCT,
844	   can be computed with some assumptions about the input signal.  Let G
845	   be the true DCT basis and G' be the basis for our integer
846	   approximation (computed as described above).  Then the error in the
847	   transformed results is

849	   e = G.x - G'.x = (G - G').x = D.x

851	   where D = (G - G') .  The MSE is then [Que98]

853	   1              1
854	   - * E[e^T.e] = - * E[x^T.D^T.D.x]
855	   N              N

857	                  1
858	                = - * E[tr(D.x.x^T.D^T)]
859	                  N

861	                  1
862	                = - * E[tr(D.Rxx.D^T)]
863	                  N

865	   where Rxx is the autocorrelation matrix of the input signal.
866	   Assuming the input is a zero-mean, first-order autoregressive (AR(1))
867	   process gives an autocorrelation matrix of

869	                 |i - j|
870	   Rxx[i,j] = rho

872	   for some correlation coefficient rho.  A value of rho = 0.95 is
873	   typical for image compression applications.  Smaller values are more
874	   normal for motion-compensated frame differences, but this makes
875	   surprisingly little difference in transform design.  Using the above
876	   procedure, the theoretical MSE of this approximation is 1.230E-6,
877	   which is below the level of the truncation error introduced by the
878	   right shift operations.  This suggests the dynamic range of the input
879	   would have to be more than 20 bits before it became worthwhile to
880	   increase the precision of the constants used in the multiplications
881	   to improve accuracy, though it may be worth using more precision to
882	   reduce bias.

884	3.3.  Larger Transforms

886	   The same techniques can be applied to construct a reversible eight-
887	   point DCT approximation with uniform, orthonormal scaling using
888	   15 multiplies, 31 additions, and 5 shifts.  It is possible to reduce
889	   this to 11 multiplies and 29 additions, which is the minimum number
890	   of multiplies possible for an eight-point DCT with uniform
891	   scaling [LLM89], by introducing a scaling factor of sqrt(2), but this
892	   harms lossless performance.  The dynamic range expansion is 1.5 bits
893	   (again the smallest possible), and the MSE is 1.592E-06.  By
894	   comparison, the eight-point transform in VP9 uses 12 multiplications,
895	   32 additions, and 6 shifts.

897	   Similarly, we have constructed a reversible sixteen-point DCT
898	   approximation with uniform, orthonormal scaling using 33 multiplies,
899	   83 additions, and 16 shifts.  This is just 2 multiplies and
900	   2 additions more than the (non-reversible, non-integer, but uniformly
901	   scaled) factorization in [LLM89].  By comparison, the sixteen-point
902	   transform in VP9 uses 44 multiplies, 88 additions, and 18 shifts.
903	   The dynamic range expansion is only 2 bits (again the smallest
904	   possible), and the MSE is 1.495E-5.

906	   We also have a reversible 32-point DCT approximation with uniform,
907	   orthonormal scaling using 87 multiplies, 215 additions, and
908	   38 shifts.  By comparison, the 32-point transform in VP9 uses
909	   116 multiplies, 194 additions, and 66 shifts.  Our dynamic range
910	   expansion is still the minimal 2.5 bits, and the MSE is 8.006E-05

912	   Code for all of these transforms is available in the development
913	   repository listed in Section 4.

915	3.4.  Walsh-Hadamard Transforms

917	   These techniques can also be applied to constructing Walsh-Hadamard
918	   Transforms, another useful transform family that is cheaper to
919	   implement than the DCT (since it requires no multiplications at all).
920	   The WHT has many applications as a cheap way to approximately change
921	   the time and frequency resolution of a set of data (either individual
922	   bands, as in the Opus audio codec, or whole blocks).  VP9 uses it as
923	   a reversible transform with uniform, orthonormal scaling for lossless
924	   coding in place of its DCT, which does not have these properties.

926	   Applying a 2x2 WHT to a block of 2x2 inputs involves running a
927	   2-point WHT on the rows, and then another 2-point WHT on the columns.
928	   The basis functions for the 2-point WHT are, up to scaling, [1, 1]
929	   and [1, -1].  The four variations of a two-step lifer given in
930	   Section 3.1 are exactly the lifting steps needed to implement a 2x2
931	   WHT: two stages that produce asymmetrically scaled outputs followed
932	   by two stages that consume asymmetrically scaled inputs.

934	   Input:  x00, x01, x10, x11
935	   Output: y00, y01, y10, y11
936	   /* Transform rows */
937	   t1 = x00 - x01
938	   t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */
939	   t2 = x10 + x11
940	   t3 = (t2 >> 1) - x11 /* == (x10 - x11)/2 */
941	   /* Transform columns */
942	   y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */
943	   y10 = y00 - t2       /* == (x00 + x01 - x10 - x11)/2 */
944	   y11 = (t1 >> 1) - t3 /* == (x00 - x01 - x10 + x11)/2 */
945	   y01 = t1 - y11       /* == (x00 - x01 + x10 - x11)/2 */

947	   By simply re-ordering the operations, we can see that there are two
948	   shifts that may be shared between the two stages:

950	   Input:  x00, x01, x10, x11
951	   Output: y00, y01, y10, y11
952	   t1 = x00 - x01
953	   t2 = x10 + x11
954	   t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */
955	   y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */
956	   t3 = (t2 >> 1) - x11 /* == (x10 - x11)/2 */
957	   y11 = (t1 >> 1) - t3 /* == (x00 - x01 - x10 + x11)/2 */
958	   y10 = y00 - t2       /* == (x00 + x01 - x10 - x11)/2 */
959	   y01 = t1 - y11       /* == (x00 - x01 + x10 - x11)/2 */

961	   By eliminating the double-negation of x11 and re-ordering the
962	   additions to it, we can see even more operations in common:

964	   Input:  x00, x01, x10, x11
965	   Output: y00, y01, y10, y11
966	   t1 = x00 - x01
967	   t2 = x10 + x11
968	   t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */
969	   y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */
970	   t3 = x11 + (t1 >> 1) /* == x11 + (x00 - x01)/2 */
971	   y11 = t3 - (t2 >> 1) /* == (x00 - x01 - x10 + x11)/2 */
972	   y10 = y00 - t2       /* == (x00 + x01 - x10 - x11)/2 */
973	   y01 = t1 - y11       /* == (x00 - x01 + x10 - x11)/2 */
974	   Simplifying further, the whole transform may be computed with just
975	   7 additions and 1 shift:

977	   Input:  x00, x01, x10, x11
978	   Output: y00, y01, y10, y11
979	   t1 = x00 - x01
980	   t2 = x10 + x11
981	   t4 = (t2 - t1) >> 1 /* == (-x00 + x01 + x10 + x11)/2 */
982	   y00 = x00 + t4      /* ==  (x00 + x01 + x10 + x11)/2 */
983	   y11 = x11 - t4      /* ==  (x00 - x01 - x10 + x11)/2 */
984	   y10 = y00 - t2      /* ==  (x00 + x01 - x10 - x11)/2 */
985	   y01 = t1 - y11      /* ==  (x00 - x01 + x10 - x11)/2 */

987	   This is a significant savings over other approaches described in the
988	   literature, which require 8 additions, 2 shifts, and
989	   1 negation [FOIK99] (37.5% more operations), or 10 additions,
990	   1 shift, and 2 negations [TSSRM08] (62.5% more operations).  The same
991	   operations can be applied to compute a 4-point WHT in one dimension.
992	   This implementation is used in this way in VP9's lossless mode.
993	   Since larger WHTs may be trivially factored into multiple smaller
994	   WHTs, the same approach can implement a reversible, orthonormally
995	   scaled WHT of any size (2**N)x(2**M), so long as (N + M) is even.

997	4.  Development Repository

999	   The tools presented here were developed as part of Xiph.Org's Daala
1000	   project.  They are available, along with many others in greater and
1001	   lesser states of maturity, in the Daala git repository at [3].  See
1002	   [4] for more information.

1004	5.  IANA Considerations

1006	   This document has no actions for IANA.

1008	6.  Acknowledgments

1010	   Thanks to Nathan Egge, Gregory Maxwell, and Jean-Marc Valin for their
1011	   assistance in the implementation and experimentation, and in
1012	   preparing this draft.

1014	7.  References

1016	7.1.  Informative References

1018	   [RFC6386]  Bankoski, J., Koleszar, J., Quillio, L., Salonen, J.,
1019	              Wilkins, P., and Y. Xu, "VP8 Data Format and Decoding
1020	              Guide", RFC 6386, DOI 10.17487/RFC6386, November 2011,
1021	              .

1023	   [RFC6716]  Valin, JM., Vos, K., and T. Terriberry, "Definition of the
1024	              Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716,
1025	              September 2012, .

1027	   [BE92]     Bruekers, F. and A. van den Enden, "New Networks for
1028	              Perfect Inversion and Perfect Reconstruction", IEEE
1029	              Journal on Selected Areas in Communication 10(1):129--137,
1030	              January 1992.

1032	   [FOIK99]   Fukuma, S., Oyama, K., Iwahashi, M., and N. Kambayashi,
1033	              "Lossless 8-point Fast Discrete Cosine Transform Using
1034	              Lossless Hadamard Transform", Technical Report The
1035	              Institute of Electronics, Information, and Communication
1036	              Engineers of Japan, October 1999.

1038	   [LLM89]    Loeffler, C., Ligtenberg, A., and G. Moschytz, "Practical
1039	              Fast 1-D DCT Algorithms with 11 Multiplications", Proc.
1040	              Acoustics, Speech, and Signal Processing (ICASSP'89) vol.
1041	              2, pp. 988--991, May 1989.

1043	   [Pas76]    Pasco, R., "Source Coding Algorithms for Fast Data
1044	              Compression", Ph.D. Thesis Dept. of Electrical
1045	              Engineering, Stanford University, May 1976.

1047	   [PKA69]    Pratt, W., Kane, J., and H. Andrews, "Hadamard Transform
1048	              Image Coding", Proc. IEEE 57(1):58--68, Jan 1969.

1050	   [Que98]    de Queiroz, R., "On Unitary Transform Approximations",
1051	              IEEE Signal Processing Letters 5(2):46--47, Feb 1998.

1053	   [SLD04]    Senecal, J., Lindstrom, P., and M. Duchaineau, "An
1054	              Improved N-Bit to N-Bit Reversible Haar-Like Transform",
1055	              Proc. of the 12th Pacific Conference on Computer Graphics
1056	              and Applications (PG'04) pp. 371--380, October 2004.

1058	   [SM98]     Stuiver, L. and A. Moffat, "Piecewise Integer Mapping for
1059	              Arithmetic Coding", Proc. of the 17th IEEE Data
1060	              Compression Conference (DCC'98) pp. 1--10, March/April
1061	              1998.

1063	   [TSSRM08]  Tu, C., Srinivasan, S., Sullivan, G., Regunathan, S., and
1064	              H. Malvar, "Low-complexity Hierarchical Lapped Transform
1065	              for Lossy-to-Lossless Image Coding in JPEG XR/HD Photo",
1066	              Applications of Digital Image Processing XXXI vol 7073,
1067	              August 2008.

1069	7.2.  URIs

1071	   [1] https://people.xiph.org/~tterribe/daala/ec_test0/ec_tokens.txt

1073	   [2] https://people.xiph.org/~tterribe/daala/ec_test0/ec_test.c

1075	   [3] https://git.xiph.org/daala.git

1077	   [4] https://xiph.org/daala/

1079	Authors' Addresses

1081	   Timothy B. Terriberry
1082	   Mozilla Corporation
1083	   331 E. Evelyn Avenue
1084	   Mountain View, CA  94041
1085	   USA

1087	   Phone: +1 650 903-0800
1088	   Email: tterribe@xiph.org

1090	   Nathan E. Egge
1091	   Mozilla Corporation
1092	   331 E. Evelyn Avenue
1093	   Mountain View, CA  94041
1094	   USA

1096	   Phone: +1 650 903-0800
1097	   Email: negge@xiph.org