idnits 2.17.1 draft-terriberry-netvc-codingtools-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2016) is 2734 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 284 -- Looks like a reference, but probably isn't: '1' on line 1071 -- Looks like a reference, but probably isn't: '2' on line 1073 -- Looks like a reference, but probably isn't: '3' on line 1075 -- Looks like a reference, but probably isn't: '4' on line 1077 -- Looks like a reference, but probably isn't: '5' on line 379 -- Looks like a reference, but probably isn't: '6' on line 379 -- Looks like a reference, but probably isn't: '7' on line 379 -- Looks like a reference, but probably isn't: '8' on line 379 == Missing Reference: '-1' is mentioned on line 929, but not defined Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 netvc T. Terriberry 3 Internet-Draft N. Egge 4 Intended status: Informational Mozilla Corporation 5 Expires: May 4, 2017 October 31, 2016 7 Coding Tools for a Next Generation Video Codec 8 draft-terriberry-netvc-codingtools-01 10 Abstract 12 This document proposes a number of coding tools that could be 13 incorporated into a next-generation video codec. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at http://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on May 4, 2017. 32 Copyright Notice 34 Copyright (c) 2016 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 50 2. Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . 3 51 2.1. Non-binary Arithmetic Coding . . . . . . . . . . . . . . 4 52 2.2. Non-binary Context Modeling . . . . . . . . . . . . . . . 5 53 2.3. Dyadic Adaptation . . . . . . . . . . . . . . . . . . . . 6 54 2.4. Simplified Partition Function . . . . . . . . . . . . . . 9 55 2.4.1. Reduced-Overhead Partition Function . . . . . . . . . 10 56 2.5. Context Adaptation . . . . . . . . . . . . . . . . . . . 11 57 2.5.1. Implicit Adaptation . . . . . . . . . . . . . . . . . 11 58 2.5.2. Explicit Adaptation . . . . . . . . . . . . . . . . . 12 59 2.5.3. Early Adaptation . . . . . . . . . . . . . . . . . . 12 60 2.6. Simple Experiment . . . . . . . . . . . . . . . . . . . . 13 61 3. Reversible Integer Transforms . . . . . . . . . . . . . . . . 14 62 3.1. Lifting Steps . . . . . . . . . . . . . . . . . . . . . . 14 63 3.2. 4-Point Transform . . . . . . . . . . . . . . . . . . . . 17 64 3.3. Larger Transforms . . . . . . . . . . . . . . . . . . . . 20 65 3.4. Walsh-Hadamard Transforms . . . . . . . . . . . . . . . . 20 66 4. Development Repository . . . . . . . . . . . . . . . . . . . 22 67 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 68 6. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 22 69 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 70 7.1. Informative References . . . . . . . . . . . . . . . . . 22 71 7.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 24 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 74 1. Introduction 76 One of the biggest contributing factors to the success of the 77 Internet is that the underlying protocols are implementable on a 78 royalty-free basis. This allows them to be implemented widely and 79 easily distributed by application developers, service operators, and 80 end users, without asking for permission. In order to produce a 81 next-generation video codec that is competitive with the best patent- 82 encumbered standards, yet avoids patents which are not available on 83 an open-source compatible, royalty-free basis, we must use old coding 84 tools in new ways and develop new coding tools. This draft documents 85 some of the tools we have been working on for inclusion in such a 86 codec. This is early work, and the performance of some of these 87 tools (especially in relation to other approaches) is not yet fully 88 known. Nevertheless, it still serves to outline some possibilities 89 an eventual working group, if formed, could consider. 91 2. Entropy Coding 93 The basic theory of entropy coding was well-established by the late 94 1970's [Pas76]. Modern video codecs have focused on Huffman codes 95 (or "Variable-Length Codes"/VLCs) and binary arithmetic coding. 96 Huffman codes are limited in the amount of compression they can 97 provide and the design flexibility they allow, but as each code word 98 consists of an integer number of bits, their implementation 99 complexity is very low, so they were provided at least as an option 100 in every video codec up through H.264. Arithmetic coding, on the 101 other hand, uses code words that can take up fractional parts of a 102 bit, and are more complex to implement. However, the prevalence of 103 cheap, H.264 High Profile hardware, which requires support for 104 arithmetic coding, shows that it is no longer so expensive that a 105 fallback VLC-based approach is required. Having a single entropy- 106 coding method simplifies both up-front design costs and 107 interoperability. 109 However, the primary limitation of arithmetic coding is that it is an 110 inherently serial operation. A given symbol cannot be decoded until 111 the previous symbol is decoded, because the bits (if any) that are 112 output depend on the exact state of the decoder at the time it is 113 decoded. This means that a hardware implementation must run at a 114 sufficiently high clock rate to be able to decode all of the symbols 115 in a frame. Higher clock rates lead to increased power consumption, 116 and in some cases the entropy coding is actually becoming the 117 limiting factor in these designs. 119 As fabrication processes improve, implementers are very willing to 120 trade increased gate count for lower clock speeds. So far, most 121 approaches to allowing parallel entropy coding have focused on 122 splitting the encoded symbols into multiple streams that can be 123 decoded independently. This "independence" requirement has a non- 124 negligible impact on compression, parallelizability, or both. For 125 example, H.264 can split frames into "slices" which might cover only 126 a small subset of the blocks in the frame. In order to allow 127 decoding these slices independently, they cannot use context 128 information from blocks in other slices (harming compression). Those 129 contexts must adapt rapidly to account for the generally small number 130 of symbols available for learning probabilities (also harming 131 compression). In some cases the number of contexts must be reduced 132 to ensure enough symbols are coded in each context to usefully learn 133 probabilities at all (once more, harming compression). Furthermore, 134 an encoder must specially format the stream to use multiple slices 135 per frame to allow any parallel entropy decoding at all. Encoders 136 rarely have enough information to evaluate this "compression 137 efficiency" vs. "parallelizability" trade-off, since they don't 138 generally know the limitations of the decoders for which they are 139 encoding. That means there will be many files or streams which could 140 have been decoded if they were encoded with different options, but 141 which a given decoder cannot decode because of bad choices made by 142 the encoder (at least from the perspective of that decoder). The 143 same set of drawbacks apply to the DCT token partitions in 144 VP8 [RFC6386]. 146 2.1. Non-binary Arithmetic Coding 148 Instead, we propose a very different approach: use non-binary 149 arithmetic coding. In binary arithmetic coding, each decoded symbol 150 has one of two possible values: 0 or 1. The original arithmetic 151 coding algorithms allow a symbol to take on any number of possible 152 values, and allow the size of that alphabet to change with each 153 symbol coded. Reasonable values of N (for example, N <= 16) offer 154 the potential for a decent throughput increase for a reasonable 155 increase in gate count for hardware implementations. 157 Binary coding allows a number of computational simplifications. For 158 example, for each coded symbol, the set of valid code points is 159 partitioned in two, and the decoded value is determined by finding 160 the partition in which the actual code point that was received lies. 161 This can be determined by computing a single partition value (in both 162 the encoder and decoder) and (in the decoder) doing a single 163 comparison. A non-binary arithmetic coder partitions the set of 164 valid code points into multiple pieces (one for each possible value 165 of the coded symbol). This requires the encoder to compute two 166 partition values, in general (for both the upper and lower bound of 167 the symbol to encode). The decoder, on the other hand, must search 168 the partitions for the one that contains the received code point. 169 This requires computing at least O(log N) partition values. 171 However, coding a parameter with N possible values with a binary 172 arithmetic coder requires O(log N) symbols in the worst case (the 173 only case that matters for hardware design). Hence, this does not 174 represent any actual savings (indeed, it represents an increase in 175 the number of partition values computed by the encoder). In 176 addition, there are a number of overheads that are per-symbol, rather 177 than per-value. For example, renormalization (which enlarges the set 178 of valid code points after partitioning has reduced it too much), 179 carry propagation (to deal with the case where the high and low ends 180 of a partition straddle a bit boundary), etc., are all performed on a 181 symbol-by-symbol basis. Since a non-binary arithmetic coder codes a 182 given set of values with fewer symbols than a binary one, it incurs 183 these per-symbol overheads less often. This suggests that a non- 184 binary arithmetic coder can actually be more efficient than a binary 185 one. 187 2.2. Non-binary Context Modeling 189 The other aspect that binary coding simplifies is probability 190 modeling. In arithmetic coding, the size of the sets the code points 191 are partitioned into are (roughly) proportional to the probability of 192 each possible symbol value. Estimating these probabilities is part 193 of the coding process, though it can be cleanly separated from the 194 task of actually producing the coded bits. In a binary arithmetic 195 coder, this requires estimating the probability of only one of the 196 two possible values (since the total probability is 1.0). This is 197 often done with a simple table lookup that maps the old probability 198 and the most recently decoded symbol to a new probability to use for 199 the next symbol in the current context. The trade-off, of course, is 200 that non-binary symbols must be "binarized" into a series of bits, 201 and a context (with an associated probability) chosen for each one. 203 In a non-binary arithmetic coder, the decoder must compute at least 204 O(log N) cumulative probabilities (one for each partition value it 205 needs). Because these probabilities are usually not estimated 206 directly in "cumulative" form, this can require computing (N - 1) 207 non-cumulative probability values. Unless N is very small, these 208 cannot be updated with a single table lookup. The normal approach is 209 to use "frequency counts". Define the frequency of value k to be 211 f[k] = A* + B 213 where A and B are parameters (usually A=2 and B=1 for a traditional 214 Krichevsky-Trofimov estimator). The resulting probability, p[k], is 215 given by 217 N-1 218 __ 219 ft = \ f[k] 220 /_ 221 k=0 223 f[k] 224 p[k] = ---- 225 ft 227 When ft grows too large, the frequencies are rescaled (e.g., halved, 228 rounding up to prevent reduction of a probability to 0). 230 When ft is not a power of two, partitioning the code points requires 231 actual divisions (see [RFC6716] Section 4.1 for one detailed example 232 of exactly how this is done). These divisions are acceptable in an 233 audio codec like Opus [RFC6716], which only has to code a few 234 hundreds of these symbols per second. But video requires hundreds of 235 thousands of symbols per second, at a minimum, and divisions are 236 still very expensive to implement in hardware. 238 There are two possible approaches to this. One is to come up with a 239 replacement for frequency counts that produces probabilities that sum 240 to a power of two. Some possibilities, which can be applied 241 individually or in combination: 243 1. Use probabilities that are fixed for the duration of a frame. 244 This is the approach taken by VP8, for example, even though it 245 uses a binary arithmetic coder. In fact, it is possible to 246 convert many of VP8's existing binary-alphabet probabilities into 247 probabilities for non-binary alphabets, an approach that is used 248 in the experiment presented at the end of this section. 250 2. Use parametric distributions. For example, DCT coefficient 251 magnitudes usually have an approximately exponential 252 distribution. This distribution can be characterized by a single 253 parameter, e.g., the expected value. The expected value is 254 trivial to update after decoding a coefficient. For example 256 E[x[n+1]] = E[x[n]] + floor(C*(x[n] - E[x[n]])) 258 produces an exponential moving average with a decay factor of 259 (1 - C). For a choice of C that is a negative power of two 260 (e.g., 1/16 or 1/32 or similar), this can be implemented with two 261 adds and a shift. Given this expected value, the actual 262 distribution to use can be obtained from a small set of pre- 263 computed distributions via a lookup table. Linear interpolation 264 between these pre-computed values can improve accuracy, at the 265 cost of O(N) computations, but if N is kept small this is 266 trivially parallelizable, in SIMD or otherwise. 268 3. Change the frequency count update mechanism so that ft is 269 constant. This approach is described in the next section. 271 2.3. Dyadic Adaptation 273 The goal with context adaptation using dyadic probabilities is to 274 maintain the invariant that the probabilities all sum to a power of 275 two before and after adaptation. This can be achieved with a special 276 update function that blends the cumulative probabilities of the 277 current context with a cumulative distribution function where the 278 coded symbol has probability 1. 280 Suppose we have model for a given context that codes 8 symbols with 281 the following probabilities: 283 +------+------+------+------+------+------+------+------+ 284 | p[0] | p[1] | p[2] | p[3] | p[4] | p[5] | p[6] | p[7] | 285 +------+------+------+------+------+------+------+------+ 286 | 1/8 | 1/8 | 3/16 | 1/16 | 1/16 | 3/16 | 1/8 | 1/8 | 287 +------+------+------+------+------+------+------+------+ 289 Then the cumulative distribution function is: 291 CDF 293 1 + +------+ 294 | | 295 | +------+ 296 | | 297 3/4 + +------+ 298 | | 299 | | 300 | +------+ 301 1/2 + +------+ 302 | +------+ 303 | | 304 | | 305 1/4 + +------+ 306 | | 307 +------+ 308 | 309 0 +------+------+------+------+------+------+------+------+ Bin 310 fl[1] fl[2] fl[3] fl[4] fl[5] fl[6] fl[7] fl[8] 312 Suppose we code symbol 3 and wish to update the context model so that 313 this symbol is now more likely. This can be done by blending the CDF 314 for the current context with a CDF that has symbol 3 with likelihood 315 1. 317 CDF 319 1 + +----------------------------------+ 320 | | 321 | | 322 | | 323 0 +------+------+------+------+------+------+------+------+ Bin 324 fl[1] fl[2] fl[3] fl[4] fl[5] fl[6] fl[7] fl[8] 326 Given an adaptation rate g between 0 and 1, and assuming ft = 2^4 = 327 16, what we are computing is: 329 +------+------+------+------+------+------+------+------+ 330 | 2 | 4 | 7 | 8 | 9 | 12 | 14 | 16 | * (1 - g) 331 +------+------+------+------+------+------+------+------+ 333 + 335 +------+------+------+------+------+------+------+------+ 336 | 0 | 0 | 0 | 16 | 16 | 16 | 16 | 16 | * g 337 +------+------+------+------+------+------+------+------+ 339 In order to prevent the probability of any one symbol from going to 340 zero, the blending functions above and below the coded symbol are 341 adjusted so that no adjacent cumulative probabilities are the same. 343 Let M be the alphabet size and 1/2^r be the adaptation rate: 345 ( fl[i] - floor((fl[i] + 2^r - i - 1)/2^r), i <= coded symbol 346 fl[i] = < 347 ( fl[i] - floor((fl[i] + M - i - ft)/2^r), i > coded symbol 349 Applying these formulas to the example CDF where M = 8 with 350 adaptation rate 1/2^16 gives the updated CDF: 352 +------+------+------+------+------+------+------+------+ 353 | 1 | 3 | 6 | 9 | 10 | 13 | 15 | 16 | 354 +------+------+------+------+------+------+------+------+ 356 Looking at the graph of the CDF we see that the likelihood for symbol 357 3 has gone up from 1/16 to 3/16, dropping the likelihood of all other 358 symbols to make room. 360 CDF 362 1 + +------+ 363 | +------+ 364 | | 365 | +------+ 366 3/4 + | 367 | | 368 | +------+ 369 | +------+ 370 1/2 + | 371 | | 372 | +------+ 373 | | 374 1/4 + | 375 | +------+ 376 | | 377 +------+ 378 0 +------+------+------+------+------+------+------+------+ Bin 379 fl[1] fl[2] fl[3] fl[4] fl[5] fl[6] fl[7] fl[8] 381 2.4. Simplified Partition Function 383 Rather than changing the context modeling, the other approach is to 384 change the function used to partition the set of valid code points so 385 that it does not need a division, even when ft is not a power of two. 386 Let the range of valid code points in the current arithmetic coder 387 state be [L, L + R), where L is the lower bound of the range and R is 388 the number of valid code points. Assume that ft <= R < 2*ft (this is 389 easy to enforce with the normal rescaling operations used with 390 frequency counts). Then one possible partition function is 392 r[k] = fl[k] + min(fl[k], R - ft) 394 so that the new range after coding symbol k is 395 [L + r[k], L + r[k+1]). 397 This is a variation of the partition function proposed by [SM98]. 398 The size of the new partition (r[k+1] - r[k]) is no longer truly 399 proportional to R*p[k]. This partition function counts values of 400 fl[k] smaller than R - ft double compared to values larger than 401 R - ft. This over-estimates the probability of symbols at the start 402 of the alphabet and underestimates the probability of symbols at the 403 end of the alphabet. The amount of the range allocated to a symbol 404 can be off by up to a factor of 2 compared to its fair share, 405 implying a peak error as large as one bit per symbol. However, if 406 the probabilities are accurate and the symbols being coded are 407 independent, the average inefficiency introduced will be as low as 408 log2(log2(e)*2/e) ~= 0.0861 bits per symbol. This error can, of 409 course, be reduced by coding fewer symbols with larger alphabets. In 410 practice the overhead is roughly equal to the overhead introduced by 411 other approximate arithmetic coders like H.264's CABAC. However, 412 probabilities near one-half tend to have the most overhead. In fact, 413 probabilities in the range of 40% to 60% for a binary symbol may not 414 be worth modeling, since the compression gains may be entirely 415 countered by the added overhead, making it cheaper and faster to code 416 such values as raw bits. This problem is partially alleviated by 417 using larger alphabets. 419 2.4.1. Reduced-Overhead Partition Function 421 A slightly more complicated partition function can reduce the 422 overhead while still avoiding the division. This is done by 423 splitting things into two cases: 425 Case 1: R - ft > ft - (R - ft) 427 That is, we have more values that are double-counted than single- 428 counted. In this case, we still double-count the first 2*R - 3*ft 429 values, but after that we alternate between single-counting and 430 double-counting for the rest. 432 Case 2: R - ft < ft - (R - ft) 434 That is, we have more values that are single-counted than double- 435 counted. In this case, we alternate between single-counting and 436 double-counting for the first 2*(R - ft) values, and single-count 437 the rest. 439 For two equiprobable symbols in different places in the alphabet, 440 this reduces the maximum ratio of over-estimation to under-estimation 441 from 2:1 for the previous partition function to either 4:3 or 3:2 442 (for each of the two cases above, respectively), assuming symbol 443 probabilities significantly greater than the minimum possible. That 444 reduces the worst-case per-symbol overhead from 1 bit to 0.58 bits. 446 The resulting reduced-overhead partition function is 448 e = max(2*R - 3*ft, 0) 449 r[k] = fl[k] + min(fl[k], e) + min(max(fl[k] - e, 0) >> 1, R - ft) 451 Here, e is a value that is greater than 0 in case 1, and 0 in case 2. 452 This function is about three times as expensive to evaluate as the 453 high-overhead version, but still an order of magnitude cheaper than a 454 division, since it is composed of very simple operations. 456 In practice it reduces the overhead by about 0.3% of the total 457 bitrate. It also tends to produce R values with a more uniform 458 distribution compared to the high-overhead version, which tends to 459 have peaks in the distribution of R at specific values (see [SM98] 460 for a discussion of this effect). Overall, it makes it more likely 461 that the compression gains from probabilities near one-half are not 462 eliminated by the approximation overhead, increasing the number of 463 symbols that can be usefully modeled. It is an open question whether 464 or not these benefits are worth the increase in computational 465 complexity. 467 2.5. Context Adaptation 469 The dyadic adaptation scheme described in Section 2.3 implements a 470 low-complexity IIR filter for the steady-state case where we only 471 want to adapt the context CDF as fast as the 1/2^r adaptation rate. 472 In many cases, for example when coding symbols at the start of a 473 video frame, only a limited number of symbols have been seen per 474 context. Using this steady-state adaptation scheme risks adapting 475 too slowly and spending too many bits to code symbols with incorrect 476 probability estimates. In other video codecs, this problem is 477 reduced by either implicitly or explicitly allowing for mechanisms to 478 set the initial probability models for a given context. 480 2.5.1. Implicit Adaptation 482 One implicit way to use default probabilities is to simply require as 483 a normative part of the decoder that some specific CDFs are used to 484 initialize each context. A representative set of inputs is run 485 through the encoder and a frequency based probability model is 486 computed and reloaded at the start of every frame. This has the 487 advantage of having zero bitstream overhead and is optimal for 488 certain stationary symbols. However for other non-stationary 489 symbols, or highly content dependent contexts where the sample input 490 is not representative, this can be worse than starting with a flat 491 distribution as it now takes even longer to adapt to the steady- 492 state. Moreover the amount of hardware area required to store 493 initial probability tables for each context goes up with the number 494 of contexts in the codec. 496 Another implicit way to deal with poor initial probabilities is 497 through backward adaptation based on the probability estimates from 498 the previous frame. After decoding a frame, the adapted CDFs for 499 each context are simply kept as-is and not reset to their defaults. 500 This has the advantage of having no bitstream overhead, and tracking 501 to certain content types closely as we expect frames with similar 502 content at similar rates, to have well correlated CDFs. However, 503 this only works when we know there will be no bitstream errors due to 504 the transport layer, e.g., TCP or HTTP. In low delay use cases 505 (video on demand, live streaming, video conferencing), implicit 506 backwards adaptation is avoided as it risks desynchronizing the 507 entropy decoder state and permanently losing the video stream. 509 2.5.2. Explicit Adaptation 511 For codecs that include the ability to update the probability models 512 in the bitstream, it is possible to explicitly signal a starting CDF. 513 The previously described implicit backwards adaptation is now 514 possible by simply explicitly coding a probability update for each 515 frame. However, the cost of signaling the updated CDF must be 516 overcome by the savings from coding with the updated CDF. Blindly 517 updating all contexts per frame may work at high rates where the size 518 of the CDFs is small relative to the coded symbol data. However at 519 low rates, the benefit of using more accurate CDFs is quickly 520 overcome by the cost of coding them, which increases with the number 521 of contexts. 523 More sophisticated encoders can compute the cost of coding a 524 probability update for a given context, and compare it to the size 525 reduction achieved by coding symbols with this context. Here all 526 symbols for a given frame (or tile) are buffered and not serialized 527 by the entropy coder until the end of the frame (or tile) is reached. 528 Once the end of the entropy segment has been reached, the cost in 529 bits for coding symbols with both the default probabilities and the 530 proposed updated probabilities can be measured and compared. 531 However, note that with the symbols already buffered, rather than 532 consider the context probabilities from the previous frame, a simple 533 frequency based probability model can be computed and measured. 534 Because this probability model is computed based on the symbols we 535 are about to code this technique is called forward adaptation. If 536 the cost in bits to signal and code with this new probability model 537 is less than that of using the default then it is used. This has the 538 advantage of only ever coding a probability update if it is an 539 improvement and producing a bitstream that is robust to errors, but 540 requires an entire entropy segments worth of symbols be cached. 542 2.5.3. Early Adaptation 544 We would like to take advantage of the low-cost multi-symbol CDF 545 adaptation described in Section 2.3 without in the broadest set of 546 use cases. This means the initial probability adaptation scheme 547 should support low-delay, error-resilient streams that efficiently 548 implemented in both hardware and software. We propose an early 549 adaptation scheme that supports this goal. 551 At the beginning of a frame (or tile), all CDFs are initialized to a 552 flat distribution. For a given multi-symbol context with M potential 553 symbols, assume that the initial dyadic CDF is initialized so that 554 each symbol has probability 1/M. For the first M coded symbols, the 555 CDF is updated as follows: 557 a[c,M] = ft/(M + c) 559 ( fl[i] - floor((fl[i] - i)*a/ft), i <= coded symbol 560 fl[i] = < 561 ( fl[i] - floor((fl[i] + M - i - ft)*a/ft), i > coded symbol 563 where c goes from 0 to M-1 and is the running count of the number of 564 symbols coded with this CDF. Note that for a fixed CDF precision (ft 565 is always a power of two) and a maximum number of possible symbols M, 566 the values of a[c,M] can be stored in a M*(M+1)/2 element table, 567 which is 136 entries when M = 16. 569 2.6. Simple Experiment 571 As a simple experiment to validate the non-binary approach, we 572 compared a non-binary arithmetic coder to the VP8 (binary) entropy 573 coder. This was done by instrumenting vp8_treed_read() in libvpx to 574 dump out the symbol decoded and the associated probabilities used to 575 decode it. This data only includes macroblock mode and motion vector 576 information, as the DCT token data is decoded with custom inline 577 functions, and not vp8_treed_read(). This data is available at [1]. 578 It includes 1,019,670 values encode using 2,125,995 binary symbols 579 (or 2.08 symbols per value). We expect that with a conscious effort 580 to group symbols during the codec design, this average could easily 581 be increased. 583 We then implemented both the regular VP8 entropy decoder (in plain C, 584 using all of the optimizations available in libvpx at the time) and a 585 multisymbol entropy decoder (also in plain C, using similar 586 optimizations), which encodes each value with a single symbol. For 587 the decoder partition search in the non-binary decoder, we used a 588 simple for loop (O(N) worst-case), even though this could be made 589 constant-time and branchless with a few SIMD instructions such as (on 590 x86) PCMPGTW, PACKUSWB, and PMOVMASKB followed by BSR. The source 591 code for both implementations is available at [2] (compile with 592 -DEC_BINARY for the binary version and -DEC_MULTISYM for the non- 593 binary version). 595 The test simply loads the tokens, and then loops 1024 times encoding 596 them using the probabilities provided, and then decoding them. The 597 loop was added to reduce the impact of the overhead of loading the 598 data, which is implemented very inefficiently. The total runtime on 599 a Core i7 from 2010 is 53.735 seconds for the binary version, and 600 27.937 seconds for the non-binary version, or a 1.92x improvement. 601 This is very nearly equal to the number of symbols per value in the 602 binary coder, suggesting that the per-symbol overheads account for 603 the vast majority of the computation time in this implementation. 605 3. Reversible Integer Transforms 607 Integer transforms in image and video coding date back to at least 608 1969 [PKA69]. Although standards such as MPEG2 and MPEG4 Part 2 609 allow some flexibility in the transform implementation, 610 implementations were subject to drift and error accumulation, and 611 encoders had to impose special macroblock refresh requirements to 612 avoid these problems, not always successfully. As transforms in 613 modern codecs only account for on the order of 10% of the total 614 decoder complexity, and, with the use of weighted prediction with 615 gains greater than unity and intra prediction, are far more 616 susceptible to drift and error accumulation, it no longer makes sense 617 to allow a non-exact transform specification. 619 However, it is also possible to make such transforms "reversible", in 620 the sense that applying the inverse transform to the result of the 621 forward transform gives back the original input values, exactly. 622 This gives a lossy codec, which normally quantizes the coefficients 623 before feeding them into the inverse transform, the ability to scale 624 all the way to lossless compression without requiring any new coding 625 tools. This approach has been used successfully by JPEG XR, for 626 example [TSSRM08]. 628 Such reversible transforms can be constructed using "lifting steps", 629 a series of shear operations that can represent any set of plane 630 rotations, and thus any orthogonal transform. This approach dates 631 back to at least 1992 [BE92], which used it to implement a four-point 632 1-D Discrete Cosine Transform (DCT). Their implementation requires 633 6 multiplications, 10 additions, 2 shifts, and 2 negations, and 634 produces output that is a factor of sqrt(2) larger than the 635 orthonormal version of the transform. The expansion of the dynamic 636 range directly translates into more bits to code for lossless 637 compression. Because the least significant bits are usually very 638 nearly random noise, this scaling increases the coding cost by 639 approximately half a bit per sample. 641 3.1. Lifting Steps 643 To demonstrate the idea of lifting steps, consider the two-point 644 transform 645 ___ 646 [ y0 ] / 1 [ 1 1 ] [ x0 ] 647 [ ] = / --- [ ] [ ] 648 [ y1 ] v 2 [ -1 1 ] [ x1 ] 650 This can be implemented up to scale via 652 y0 = x0 + x1 654 y1 = 2*x1 - y0 656 and reversed via 658 x1 = (y0 + y1) >> 1 660 x0 = y0 - x1 662 Both y0 and y1 are too large by a factor of sqrt(2), however. 664 It is also possible to implement any rotation by an angle t, 665 including the orthonormal scale factor, by decomposing it into three 666 steps: 668 cos(t) - 1 669 u0 = x0 + ---------- * x1 670 sin(t) 672 y1 = x1 + sin(t)*u0 674 cos(t) - 1 675 y0 = u0 + ---------- * y1 676 sin(t) 678 By letting t=-pi/4, we get an implementation of the first transform 679 that includes the scaling factor. To get an integer approximation of 680 this transform, we need only replace the transcendental constants by 681 fixed-point approximations: 683 u0 = x0 + ((27*x1 + 32) >> 6) 685 y1 = x1 - ((45*u0 + 32) >> 6) 687 y0 = u0 + ((27*y1 + 32) >> 6) 689 This approximation is still perfectly reversible: 691 u0 = y0 - ((27*y1 + 32) >> 6) 693 x1 = y1 + ((45*u0 + 32) >> 6) 695 x0 = u0 - ((27*x1 + 32) >> 6) 697 Each of the three steps can be implemented using just two ARM 698 instructions, with constants that have up to 14 bits of precision 699 (though using fewer bits allows more efficient hardware 700 implementations, at a small cost in coding gain). However, it is 701 still much more complex than the first approach. 703 We can get a compromise with a slight modification: 705 y0 = x0 + x1 707 y1 = x1 - (y0 >> 1) 709 This still only implements the original orthonormal transform up to 710 scale. The y0 coefficient is too large by a factor of sqrt(2) as 711 before, but y1 is now too small by a factor of sqrt(2). If our goal 712 is simply to (optionally quantize) and code the result, this is good 713 enough. The different scale factors can be incorporated into the 714 quantization matrix in the lossy case, and the total expansion is 715 roughly equivalent to that of the orthonormal transform in the 716 lossless case. Plus, we can perform each step with just one ARM 717 instruction. 719 However, if instead we want to apply additional transformations to 720 the data, or use the result to predict other data, it becomes much 721 more convenient to have uniformly scaled outputs. For a two-point 722 transform, there is little we can do to improve on the three- 723 multiplications approach above. However, for a four-point transform, 724 we can use the last approach and arrange multiple transform stages 725 such that the "too large" and "too small" scaling factors cancel out, 726 producing a result that has the true, uniform, orthonormal scaling. 727 To do this, we need one more tool, which implements the following 728 transform: 730 ___ 731 [ y0 ] / 1 [ cos(t) -sin(t) ] [ 1 0 ] [ x0 ] 732 [ ] = / --- [ ] [ ] [ ] 733 [ y1 ] v 2 [ sin(t) cos(t) ] [ 0 2 ] [ x1 ] 735 This takes unevenly scaled inputs, rescales them, and then rotates 736 them. Like an ordinary rotation, it can be reduced to three lifting 737 steps: 739 _ 740 2*cos(t) - v2 741 u0 = x0 + ------------- * x1 742 sin(t) 743 ___ 744 / 1 745 y1 = x1 + / --- * sin(t)*u0 746 v 2 747 _ 748 cos(t) - v2 749 y0 = u0 + ----------- * y1 750 sin(t) 752 As before, the transcendental constants may be replaced by fixed- 753 point approximations without harming the reversibility property. 755 3.2. 4-Point Transform 757 Using the tools from the previous section, we can design a reversible 758 integer four-point DCT approximation with uniform, orthonormal 759 scaling. This requires 3 multiplies, 9 additions, and 2 shifts (not 760 counting the shift and rounding offset used in the fixed-point 761 multiplies, as these are built into the multiplier). This is 762 significantly cheaper than the [BE92] approach, and the output 763 scaling is smaller by a factor of sqrt(2), saving half a bit per 764 sample in the lossless case. By comparison, the four-point forward 765 DCT approximation used in VP9, which is not reversible, uses 766 6 multiplies, 6 additions, and 2 shifts (counting shifts and rounding 767 offsets which cannot be merged into a single multiply instruction on 768 ARM). Four of its multipliers also require 28-bit accumulators, 769 whereas this proposal can use much smaller multipliers without giving 770 up the reversibility property. The total dynamic range expansion is 771 1 bit: inputs in the range [-256,255) produce transformed values in 772 the range [-512,510). This is the smallest dynamic range expansion 773 possible for any reversible transform constructed from mostly-linear 774 operations. It is possible to make reversible orthogonal transforms 775 with no dynamic range expansion by using "piecewise-linear" 776 rotations [SLD04], but each step requires a large number of 777 operations in a software implementation. 779 Pseudo-code for the forward transform follows: 781 Input: x0, x1, x2, x3 782 Output: y0, y1, y2, y3 783 /* Rotate (x3, x0) by -pi/4, asymmetrically scaled output. */ 784 t3 = x0 - x3 785 t0 = x0 - (t3 >> 1) 786 /* Rotate (x1, x2) by pi/4, asymmetrically scaled output. */ 787 t2 = x1 + x2 788 t2h = t2 >> 1 789 t1 = t2h - x2 790 /* Rotate (t2, t0) by -pi/4, asymmetrically scaled input. */ 791 y0 = t0 + t2h 792 y2 = y0 - t2 793 /* Rotate (t3, t1) by 3*pi/8, asymmetrically scaled input. */ 794 t3 = t3 - (45*t1 + 32 >> 6) 795 y1 = t1 + (21*t3 + 16 >> 5) 796 y3 = t3 - (71*y1 + 32 >> 6) 798 Even though there are three asymmetrically scaled rotations by pi/4, 799 by careful arrangement we can share one of the shift operations (to 800 help software implementations: shifts by a constant are basically 801 free in hardware). This technique can be used to even greater effect 802 in larger transforms. 804 The inverse transform is constructed by simply undoing each step in 805 turn: 807 Input: y0, y1, y2, y3 808 Output: x0, x1, x2, x3 809 /* Rotate (y3, y1) by -3*pi/8, asymmetrically scaled output. */ 810 t3 = y3 + (71*y1 + 32 >> 6) 811 t1 = y1 - (21*t3 + 16 >> 5) 812 t3 = t3 + (45*t1 + 32 >> 6) 813 /* Rotate (y2, y0) by pi/4, asymmetrically scaled output. */ 814 t2 = y0 - y2 815 t2h = t2 >> 1 816 t0 = y0 - t2h 817 /* Rotate (t1, t2) by -pi/4, asymmetrically scaled input. */ 818 x2 = t2h - t1 819 x1 = t2 - x2 820 /* Rotate (x3, x0) by pi/4, asymmetrically scaled input. */ 821 x0 = t0 - (t3 >> 1) 822 x3 = x0 - t3 824 Although the right shifts make this transform non-linear, we can 825 compute "basis functions" for it by sending a vector through it with 826 a single value set to a large constant (256 was used here), and the 827 rest of the values set to zero. The true basis functions for a four- 828 point DCT (up to five digits) are 830 [ y0 ] [ 0.50000 0.50000 0.50000 0.50000 ] [ x0 ] 831 [ y1 ] = [ 0.65625 0.26953 -0.26953 -0.65625 ] [ x1 ] 832 [ y2 ] [ 0.50000 -0.50000 -0.50000 0.50000 ] [ x2 ] 833 [ y3 ] [ 0.27344 -0.65234 0.65234 -0.27344 ] [ x3 ] 835 The corresponding basis functions for our reversible, integer DCT, 836 computed using the approximation described above, are 838 [ y0 ] [ 0.50000 0.50000 0.50000 0.50000 ] [ x0 ] 839 [ y1 ] = [ 0.65328 0.27060 -0.27060 -0.65328 ] [ x1 ] 840 [ y2 ] [ 0.50000 -0.50000 -0.50000 0.50000 ] [ x2 ] 841 [ y3 ] [ 0.27060 -0.65328 0.65328 -0.27060 ] [ x3 ] 843 The mean squared error (MSE) of the output, compared to a true DCT, 844 can be computed with some assumptions about the input signal. Let G 845 be the true DCT basis and G' be the basis for our integer 846 approximation (computed as described above). Then the error in the 847 transformed results is 849 e = G.x - G'.x = (G - G').x = D.x 851 where D = (G - G') . The MSE is then [Que98] 853 1 1 854 - * E[e^T.e] = - * E[x^T.D^T.D.x] 855 N N 857 1 858 = - * E[tr(D.x.x^T.D^T)] 859 N 861 1 862 = - * E[tr(D.Rxx.D^T)] 863 N 865 where Rxx is the autocorrelation matrix of the input signal. 866 Assuming the input is a zero-mean, first-order autoregressive (AR(1)) 867 process gives an autocorrelation matrix of 869 |i - j| 870 Rxx[i,j] = rho 872 for some correlation coefficient rho. A value of rho = 0.95 is 873 typical for image compression applications. Smaller values are more 874 normal for motion-compensated frame differences, but this makes 875 surprisingly little difference in transform design. Using the above 876 procedure, the theoretical MSE of this approximation is 1.230E-6, 877 which is below the level of the truncation error introduced by the 878 right shift operations. This suggests the dynamic range of the input 879 would have to be more than 20 bits before it became worthwhile to 880 increase the precision of the constants used in the multiplications 881 to improve accuracy, though it may be worth using more precision to 882 reduce bias. 884 3.3. Larger Transforms 886 The same techniques can be applied to construct a reversible eight- 887 point DCT approximation with uniform, orthonormal scaling using 888 15 multiplies, 31 additions, and 5 shifts. It is possible to reduce 889 this to 11 multiplies and 29 additions, which is the minimum number 890 of multiplies possible for an eight-point DCT with uniform 891 scaling [LLM89], by introducing a scaling factor of sqrt(2), but this 892 harms lossless performance. The dynamic range expansion is 1.5 bits 893 (again the smallest possible), and the MSE is 1.592E-06. By 894 comparison, the eight-point transform in VP9 uses 12 multiplications, 895 32 additions, and 6 shifts. 897 Similarly, we have constructed a reversible sixteen-point DCT 898 approximation with uniform, orthonormal scaling using 33 multiplies, 899 83 additions, and 16 shifts. This is just 2 multiplies and 900 2 additions more than the (non-reversible, non-integer, but uniformly 901 scaled) factorization in [LLM89]. By comparison, the sixteen-point 902 transform in VP9 uses 44 multiplies, 88 additions, and 18 shifts. 903 The dynamic range expansion is only 2 bits (again the smallest 904 possible), and the MSE is 1.495E-5. 906 We also have a reversible 32-point DCT approximation with uniform, 907 orthonormal scaling using 87 multiplies, 215 additions, and 908 38 shifts. By comparison, the 32-point transform in VP9 uses 909 116 multiplies, 194 additions, and 66 shifts. Our dynamic range 910 expansion is still the minimal 2.5 bits, and the MSE is 8.006E-05 912 Code for all of these transforms is available in the development 913 repository listed in Section 4. 915 3.4. Walsh-Hadamard Transforms 917 These techniques can also be applied to constructing Walsh-Hadamard 918 Transforms, another useful transform family that is cheaper to 919 implement than the DCT (since it requires no multiplications at all). 920 The WHT has many applications as a cheap way to approximately change 921 the time and frequency resolution of a set of data (either individual 922 bands, as in the Opus audio codec, or whole blocks). VP9 uses it as 923 a reversible transform with uniform, orthonormal scaling for lossless 924 coding in place of its DCT, which does not have these properties. 926 Applying a 2x2 WHT to a block of 2x2 inputs involves running a 927 2-point WHT on the rows, and then another 2-point WHT on the columns. 928 The basis functions for the 2-point WHT are, up to scaling, [1, 1] 929 and [1, -1]. The four variations of a two-step lifer given in 930 Section 3.1 are exactly the lifting steps needed to implement a 2x2 931 WHT: two stages that produce asymmetrically scaled outputs followed 932 by two stages that consume asymmetrically scaled inputs. 934 Input: x00, x01, x10, x11 935 Output: y00, y01, y10, y11 936 /* Transform rows */ 937 t1 = x00 - x01 938 t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */ 939 t2 = x10 + x11 940 t3 = (t2 >> 1) - x11 /* == (x10 - x11)/2 */ 941 /* Transform columns */ 942 y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */ 943 y10 = y00 - t2 /* == (x00 + x01 - x10 - x11)/2 */ 944 y11 = (t1 >> 1) - t3 /* == (x00 - x01 - x10 + x11)/2 */ 945 y01 = t1 - y11 /* == (x00 - x01 + x10 - x11)/2 */ 947 By simply re-ordering the operations, we can see that there are two 948 shifts that may be shared between the two stages: 950 Input: x00, x01, x10, x11 951 Output: y00, y01, y10, y11 952 t1 = x00 - x01 953 t2 = x10 + x11 954 t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */ 955 y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */ 956 t3 = (t2 >> 1) - x11 /* == (x10 - x11)/2 */ 957 y11 = (t1 >> 1) - t3 /* == (x00 - x01 - x10 + x11)/2 */ 958 y10 = y00 - t2 /* == (x00 + x01 - x10 - x11)/2 */ 959 y01 = t1 - y11 /* == (x00 - x01 + x10 - x11)/2 */ 961 By eliminating the double-negation of x11 and re-ordering the 962 additions to it, we can see even more operations in common: 964 Input: x00, x01, x10, x11 965 Output: y00, y01, y10, y11 966 t1 = x00 - x01 967 t2 = x10 + x11 968 t0 = x00 - (t1 >> 1) /* == (x00 + x01)/2 */ 969 y00 = t0 + (t2 >> 1) /* == (x00 + x01 + x10 + x11)/2 */ 970 t3 = x11 + (t1 >> 1) /* == x11 + (x00 - x01)/2 */ 971 y11 = t3 - (t2 >> 1) /* == (x00 - x01 - x10 + x11)/2 */ 972 y10 = y00 - t2 /* == (x00 + x01 - x10 - x11)/2 */ 973 y01 = t1 - y11 /* == (x00 - x01 + x10 - x11)/2 */ 974 Simplifying further, the whole transform may be computed with just 975 7 additions and 1 shift: 977 Input: x00, x01, x10, x11 978 Output: y00, y01, y10, y11 979 t1 = x00 - x01 980 t2 = x10 + x11 981 t4 = (t2 - t1) >> 1 /* == (-x00 + x01 + x10 + x11)/2 */ 982 y00 = x00 + t4 /* == (x00 + x01 + x10 + x11)/2 */ 983 y11 = x11 - t4 /* == (x00 - x01 - x10 + x11)/2 */ 984 y10 = y00 - t2 /* == (x00 + x01 - x10 - x11)/2 */ 985 y01 = t1 - y11 /* == (x00 - x01 + x10 - x11)/2 */ 987 This is a significant savings over other approaches described in the 988 literature, which require 8 additions, 2 shifts, and 989 1 negation [FOIK99] (37.5% more operations), or 10 additions, 990 1 shift, and 2 negations [TSSRM08] (62.5% more operations). The same 991 operations can be applied to compute a 4-point WHT in one dimension. 992 This implementation is used in this way in VP9's lossless mode. 993 Since larger WHTs may be trivially factored into multiple smaller 994 WHTs, the same approach can implement a reversible, orthonormally 995 scaled WHT of any size (2**N)x(2**M), so long as (N + M) is even. 997 4. Development Repository 999 The tools presented here were developed as part of Xiph.Org's Daala 1000 project. They are available, along with many others in greater and 1001 lesser states of maturity, in the Daala git repository at [3]. See 1002 [4] for more information. 1004 5. IANA Considerations 1006 This document has no actions for IANA. 1008 6. Acknowledgments 1010 Thanks to Nathan Egge, Gregory Maxwell, and Jean-Marc Valin for their 1011 assistance in the implementation and experimentation, and in 1012 preparing this draft. 1014 7. References 1016 7.1. Informative References 1018 [RFC6386] Bankoski, J., Koleszar, J., Quillio, L., Salonen, J., 1019 Wilkins, P., and Y. Xu, "VP8 Data Format and Decoding 1020 Guide", RFC 6386, DOI 10.17487/RFC6386, November 2011, 1021 . 1023 [RFC6716] Valin, JM., Vos, K., and T. Terriberry, "Definition of the 1024 Opus Audio Codec", RFC 6716, DOI 10.17487/RFC6716, 1025 September 2012, . 1027 [BE92] Bruekers, F. and A. van den Enden, "New Networks for 1028 Perfect Inversion and Perfect Reconstruction", IEEE 1029 Journal on Selected Areas in Communication 10(1):129--137, 1030 January 1992. 1032 [FOIK99] Fukuma, S., Oyama, K., Iwahashi, M., and N. Kambayashi, 1033 "Lossless 8-point Fast Discrete Cosine Transform Using 1034 Lossless Hadamard Transform", Technical Report The 1035 Institute of Electronics, Information, and Communication 1036 Engineers of Japan, October 1999. 1038 [LLM89] Loeffler, C., Ligtenberg, A., and G. Moschytz, "Practical 1039 Fast 1-D DCT Algorithms with 11 Multiplications", Proc. 1040 Acoustics, Speech, and Signal Processing (ICASSP'89) vol. 1041 2, pp. 988--991, May 1989. 1043 [Pas76] Pasco, R., "Source Coding Algorithms for Fast Data 1044 Compression", Ph.D. Thesis Dept. of Electrical 1045 Engineering, Stanford University, May 1976. 1047 [PKA69] Pratt, W., Kane, J., and H. Andrews, "Hadamard Transform 1048 Image Coding", Proc. IEEE 57(1):58--68, Jan 1969. 1050 [Que98] de Queiroz, R., "On Unitary Transform Approximations", 1051 IEEE Signal Processing Letters 5(2):46--47, Feb 1998. 1053 [SLD04] Senecal, J., Lindstrom, P., and M. Duchaineau, "An 1054 Improved N-Bit to N-Bit Reversible Haar-Like Transform", 1055 Proc. of the 12th Pacific Conference on Computer Graphics 1056 and Applications (PG'04) pp. 371--380, October 2004. 1058 [SM98] Stuiver, L. and A. Moffat, "Piecewise Integer Mapping for 1059 Arithmetic Coding", Proc. of the 17th IEEE Data 1060 Compression Conference (DCC'98) pp. 1--10, March/April 1061 1998. 1063 [TSSRM08] Tu, C., Srinivasan, S., Sullivan, G., Regunathan, S., and 1064 H. Malvar, "Low-complexity Hierarchical Lapped Transform 1065 for Lossy-to-Lossless Image Coding in JPEG XR/HD Photo", 1066 Applications of Digital Image Processing XXXI vol 7073, 1067 August 2008. 1069 7.2. URIs 1071 [1] https://people.xiph.org/~tterribe/daala/ec_test0/ec_tokens.txt 1073 [2] https://people.xiph.org/~tterribe/daala/ec_test0/ec_test.c 1075 [3] https://git.xiph.org/daala.git 1077 [4] https://xiph.org/daala/ 1079 Authors' Addresses 1081 Timothy B. Terriberry 1082 Mozilla Corporation 1083 331 E. Evelyn Avenue 1084 Mountain View, CA 94041 1085 USA 1087 Phone: +1 650 903-0800 1088 Email: tterribe@xiph.org 1090 Nathan E. Egge 1091 Mozilla Corporation 1092 331 E. Evelyn Avenue 1093 Mountain View, CA 94041 1094 USA 1096 Phone: +1 650 903-0800 1097 Email: negge@xiph.org