< draft-alakuijala-brotli-08.txt   draft-alakuijala-brotli-11.txt >
Network Working Group J. Alakuijala Network Working Group J. Alakuijala
Internet-Draft Z. Szabadka Internet-Draft Z. Szabadka
Intended Status: Informational Google, Inc Intended Status: Informational Google, Inc
Expires: June 10, 2016 December 2015 Expires: November 24, 2016 May 2016
Brotli Compressed Data Format Brotli Compressed Data Format
draft-alakuijala-brotli-08 draft-alakuijala-brotli-11
Abstract Abstract
This specification defines a lossless compressed data format that This specification defines a lossless compressed data format that
compresses data using a combination of the LZ77 algorithm and Huffman compresses data using a combination of the LZ77 algorithm and Huffman
coding, with efficiency comparable to the best currently available coding, with efficiency comparable to the best currently available
general-purpose compression methods. general-purpose compression methods.
Status of this Memo Status of this Memo
skipping to change at page 1, line 33 skipping to change at page 1, line 33
Internet-Drafts are working documents of the Internet Engineering Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet- working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/. Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress." material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 10, 2016. This Internet-Draft will expire on November 24, 2016.
Copyright Notice Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License. described in the Simplified BSD License.
Table of Contents Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2. Intended audience . . . . . . . . . . . . . . . . . . . . 3 1.2. Intended audience . . . . . . . . . . . . . . . . . . . . 3
1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4. Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4. Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5. Definitions of terms and conventions used . . . . . . . . 4 1.5. Definitions of terms and conventions used . . . . . . . . 4
1.5.1. Packing into bytes . . . . . . . . . . . . . . . . . 4 1.5.1. Packing into bytes . . . . . . . . . . . . . . . . . 5
2. Compressed representation overview . . . . . . . . . . . . . . 5 2. Compressed representation overview . . . . . . . . . . . . . . 6
3. Compressed representation of prefix codes . . . . . . . . . . . 9 3. Compressed representation of prefix codes . . . . . . . . . . 10
3.1. Introduction to prefix coding . . . . . . . . . . . . . . 9 3.1. Introduction to prefix coding . . . . . . . . . . . . . . 10
3.2. Use of prefix coding in the brotli format . . . . . . . . 10 3.2. Use of prefix coding in the brotli format . . . . . . . . 11
3.3. Alphabet sizes . . . . . . . . . . . . . . . . . . . . . 12 3.3. Alphabet sizes . . . . . . . . . . . . . . . . . . . . . 12
3.4. Simple prefix codes . . . . . . . . . . . . . . . . . . . 13 3.4. Simple prefix codes . . . . . . . . . . . . . . . . . . . 13
3.5. Complex prefix codes . . . . . . . . . . . . . . . . . . 14 3.5. Complex prefix codes . . . . . . . . . . . . . . . . . . 14
4. Encoding of distances . . . . . . . . . . . . . . . . . . . . 16 4. Encoding of distances . . . . . . . . . . . . . . . . . . . . 17
5. Encoding of literal insertion lengths and copy lengths . . . . 18 5. Encoding of literal insertion lengths and copy lengths . . . . 18
6. Encoding of block switch commands . . . . . . . . . . . . . . 20 6. Encoding of block switch commands . . . . . . . . . . . . . . 20
7. Context modeling . . . . . . . . . . . . . . . . . . . . . . . 22 7. Context modeling . . . . . . . . . . . . . . . . . . . . . . . 22
7.1. Context modes and context ID lookup for literals . . . . 22 7.1. Context modes and context ID lookup for literals . . . . 22
7.2. Context ID for distances . . . . . . . . . . . . . . . . 24 7.2. Context ID for distances . . . . . . . . . . . . . . . . 24
7.3. Encoding of the context map . . . . . . . . . . . . . . . 24 7.3. Encoding of the context map . . . . . . . . . . . . . . . 24
8. Static dictionary . . . . . . . . . . . . . . . . . . . . . . 26 8. Static dictionary . . . . . . . . . . . . . . . . . . . . . . 26
9. Compressed data format . . . . . . . . . . . . . . . . . . . . 28 9. Compressed data format . . . . . . . . . . . . . . . . . . . . 29
9.1. Format of the stream header . . . . . . . . . . . . . . . 29 9.1. Format of the stream header . . . . . . . . . . . . . . . 29
9.2. Format of the meta-block header . . . . . . . . . . . . . 29 9.2. Format of the meta-block header . . . . . . . . . . . . . 30
9.3. Format of the meta-block data . . . . . . . . . . . . . . 32 9.3. Format of the meta-block data . . . . . . . . . . . . . . 32
10. Decoding algorithm . . . . . . . . . . . . . . . . . . . . . 33 10. Decoding algorithm . . . . . . . . . . . . . . . . . . . . . 34
11. Security Considerations . . . . . . . . . . . . . . . . . . . 36 11. Considerations for compressor implementations . . . . . . . . 36
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36 11.1. Trivial compressor . . . . . . . . . . . . . . . . . . . 36
13. Informative References . . . . . . . . . . . . . . . . . . . 36 11.2. Aligning compressed meta-blocks to byte boundaries . . . 37
14. Source code . . . . . . . . . . . . . . . . . . . . . . . . . 37 11.3. Creating self-contained parts within the compressed data 37
15. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 37 12. Security Considerations . . . . . . . . . . . . . . . . . . . 38
Appendix A. Static dictionary data . . . . . . . . . . . . . . . 37 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40
Appendix B. List of word transformations . . . . . . . . . . . . 117 14. Informative References . . . . . . . . . . . . . . . . . . . 40
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 120 15. Source code . . . . . . . . . . . . . . . . . . . . . . . . . 40
16. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 41
Appendix A. Static dictionary data . . . . . . . . . . . . . . . 41
Appendix B. List of word transformations . . . . . . . . . . . . 121
Appendix C. Computing CRC-32 check values . . . . . . . . . . . 124
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 124
1. Introduction 1. Introduction
1.1. Purpose 1.1. Purpose
The purpose of this specification is to define a lossless compressed The purpose of this specification is to define a lossless compressed
data format that: data format that:
* Is independent of CPU type, operating system, file system, * Is independent of CPU type, operating system, file system,
and character set, and hence can be used for interchange; and character set, and hence can be used for interchange;
skipping to change at page 3, line 30 skipping to change at page 3, line 30
best currently available general-purpose compression methods, best currently available general-purpose compression methods,
and in particular, considerably better than the gzip program; and in particular, considerably better than the gzip program;
* Decompresses much faster than current LZMA implementations. * Decompresses much faster than current LZMA implementations.
The data format defined by this specification does not attempt to: The data format defined by this specification does not attempt to:
* Allow random access to compressed data; * Allow random access to compressed data;
* Compress specialized data (e.g., raster graphics) as well * Compress specialized data (e.g., raster graphics) as well
as the best currently available specialized algorithms. as the best currently available specialized algorithms.
This document is the authoritative specification of the brotli
compressed data format. It defines the set of valid brotli compressed
data streams and a decoder algorithm that produces the uncompressed
data stream from a valid brotli compressed data stream.
1.2. Intended audience 1.2. Intended audience
This specification is intended for use by software implementers to This specification is intended for use by software implementers to
compress data into and/or decompress data from the brotli format. compress data into and/or decompress data from the brotli format.
The text of the specification assumes a basic background in The text of the specification assumes a basic background in
programming at the level of bits and other primitive data programming at the level of bits and other primitive data
representations. Familiarity with the technique of Huffman coding is representations. Familiarity with the technique of Huffman coding is
helpful but not required. helpful but not required.
skipping to change at page 5, line 26 skipping to change at page 5, line 38
starting with the least-significant bit of the data starting with the least-significant bit of the data
element. These are referred to here as integer values element. These are referred to here as integer values
and are considered unsigned. and are considered unsigned.
* Prefix codes are packed starting with the most-significant * Prefix codes are packed starting with the most-significant
bit of the code. bit of the code.
In other words, if one were to print out the compressed data as a In other words, if one were to print out the compressed data as a
sequence of bytes, starting with the first byte at the *right* margin sequence of bytes, starting with the first byte at the *right* margin
and proceeding to the *left*, with the most-significant bit of each and proceeding to the *left*, with the most-significant bit of each
byte on the left as usual, one would be able to parse the result from byte on the left as usual, one would be able to parse the result from
right to left, with fixed-width elements in the correct MSB-to-LSB right to left, with fixed-width elements in the correct msb-to-lsb
order and prefix codes in bit-reversed order (i.e., with the first order and prefix codes in bit-reversed order (i.e., with the first
bit of the code in the relative LSB position). bit of the code in the relative lsb position).
As an example, consider packing the following data elements into a
sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2,
prefix code 110, prefix code 10, 12-bit integer value 3628.
byte 2 byte 1 byte 0
+--------+--------+--------+
|11100010|11000101|10010110|
+--------+--------+--------+
^ ^ ^ ^ ^
| | | | |
| | | | +------ integer value 6
| | | +---------- integer value 2
| | +-------------- prefix code 110
| +---------------- prefix code 10
+----------------------------- integer value 3628
2. Compressed representation overview 2. Compressed representation overview
A compressed data set consists of a header and a series of meta- A compressed data set consists of a header and a series of meta-
blocks. Each meta-block decompresses to a sequence of 0 to 16,777,216 blocks. Each meta-block decompresses to a sequence of 0 to 16,777,216
(16 MiB) uncompressed bytes. The final uncompressed data is the (16 MiB) uncompressed bytes. The final uncompressed data is the
concatenation of the uncompressed sequences from each meta-block. concatenation of the uncompressed sequences from each meta-block.
The header contains the size of the sliding window that was used The header contains the size of the sliding window that was used
during compression. The decompressor must retain at least that during compression. The decompressor must retain at least that
skipping to change at page 14, line 36 skipping to change at page 14, line 46
or 1, 2, 3, 3 (tree-select bit 1). or 1, 2, 3, 3 (tree-select bit 1).
3.5. Complex prefix codes 3.5. Complex prefix codes
A complex prefix code is a canonical prefix code, defined by the A complex prefix code is a canonical prefix code, defined by the
sequence of code lengths, as discussed in Section 3.2., above. For sequence of code lengths, as discussed in Section 3.2., above. For
even greater compactness, the code length sequences themselves are even greater compactness, the code length sequences themselves are
compressed using a prefix code. The alphabet for code lengths is as compressed using a prefix code. The alphabet for code lengths is as
follows: follows:
0 - 15: Represent code lengths of 0 - 15 0..15: Represent code lengths of 0..15
16: Copy the previous non-zero code length 3 - 6 times 16: Copy the previous non-zero code length 3..6 times
The next 2 bits indicate repeat length The next 2 bits indicate repeat length
(0 = 3, ... , 3 = 6) (0 = 3, ... , 3 = 6)
If this is the first code length, or all previous If this is the first code length, or all previous
code lengths are zero, a code length of 8 is code lengths are zero, a code length of 8 is
repeated 3 - 6 times repeated 3..6 times
A repeated code length code of 16 modifies the A repeated code length code of 16 modifies the
repeat count of the previous one as follows: repeat count of the previous one as follows:
repeat count = (4 * (repeat count - 2)) + repeat count = (4 * (repeat count - 2)) +
(3 - 6 on the next 2 bits) (3..6 on the next 2 bits)
Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10) Example: Codes 7, 16 (+2 bits 11), 16 (+2 bits 10)
will expand to 22 code lengths of 7 will expand to 22 code lengths of 7
(1 + 4 * (6 - 2) + 5) (1 + 4 * (6 - 2) + 5)
17: Repeat a code length of 0 for 3 - 10 times. 17: Repeat a code length of 0 for 3..10 times.
(3 bits of length) (3 bits of length)
A repeated code length code of 17 modifies the A repeated code length code of 17 modifies the
repeat count of the previous one as follows: repeat count of the previous one as follows:
repeat count = (8 * (repeat count - 2)) + repeat count = (8 * (repeat count - 2)) +
(3 - 10 on the next 3 bits) (3..10 on the next 3 bits)
Note that a code of 16 that follows an immediately preceding 16 Note that a code of 16 that follows an immediately preceding 16
modifies the previous repeat count, which becomes the new repeat modifies the previous repeat count, which becomes the new repeat
count. The same is true for a 17 following a 17. A sequence of three count. The same is true for a 17 following a 17. A sequence of three
or more 16 codes in a row or three of more 17 codes in a row is or more 16 codes in a row or three of more 17 codes in a row is
possible, modifying the count each time. Only the final repeat count possible, modifying the count each time. Only the final repeat count
is used. The modification only applies if the same code follows. A 16 is used. The modification only applies if the same code follows. A 16
repeat does not modify an immediately preceding 17 count nor vice repeat does not modify an immediately preceding 17 count nor vice
versa. versa.
skipping to change at page 15, line 51 skipping to change at page 16, line 14
Code lengths for symbols in the code length alphabet given Code lengths for symbols in the code length alphabet given
just above, in the order: 1, 2, 3, 4, 0, 5, 17, 6, 16, 7, just above, in the order: 1, 2, 3, 4, 0, 5, 17, 6, 16, 7,
8, 9, 10, 11, 12, 13, 14, 15. If HSKIP is 2, then the 8, 9, 10, 11, 12, 13, 14, 15. If HSKIP is 2, then the
code lengths for symbols 1 and 2 are zero, and the first code lengths for symbols 1 and 2 are zero, and the first
code length is for symbol 3. If HSKIP is 3, then the code code length is for symbol 3. If HSKIP is 3, then the code
length for symbol 3 is also zero, and the first code length length for symbol 3 is also zero, and the first code length
is for symbol 4. is for symbol 4.
The code lengths of code length symbols are between 0 and The code lengths of code length symbols are between 0 and
5, and they are represented with 2 - 4 bits according to 5, and they are represented with 2..4 bits according to
the variable length code above. A code length of 0 means the variable length code above. A code length of 0 means
the corresponding code length symbol is not used. the corresponding code length symbol is not used.
If HSKIP is 2 or 3, a respective number of leading code If HSKIP is 2 or 3, a respective number of leading code
lengths are implicit zeros and are not present in the lengths are implicit zeros and are not present in the
code lengths sequence above. code lengths sequence above.
If there are at least two non-zero code lengths, any If there are at least two non-zero code lengths, any
trailing zero code lengths are omitted, i.e. the last trailing zero code lengths are omitted, i.e. the last
code length in the sequence must be non-zero. In this code length in the sequence must be non-zero. In this
skipping to change at page 17, line 6 skipping to change at page 17, line 16
As described in Section 2., one component of a compressed meta-block As described in Section 2., one component of a compressed meta-block
is a sequence of backward distances. In this section we provide the is a sequence of backward distances. In this section we provide the
details to the encoding of distances. details to the encoding of distances.
Each distance in the compressed data part of a meta-block is Each distance in the compressed data part of a meta-block is
represented with a pair <distance code, extra bits>. The distance represented with a pair <distance code, extra bits>. The distance
code and the extra bits are encoded back-to-back, the distance code code and the extra bits are encoded back-to-back, the distance code
is encoded using a prefix code over the distance alphabet, while the is encoded using a prefix code over the distance alphabet, while the
extra bits value is encoded as a fixed-width integer value. The extra bits value is encoded as a fixed-width integer value. The
number of extra bits can be 0 - 24, and it is dependent on the number of extra bits can be 0..24, and it is dependent on the
distance code. distance code.
To convert a distance code and associated extra bits to a backward To convert a distance code and associated extra bits to a backward
distance, we need the sequence of past distances and two additional distance, we need the sequence of past distances and two additional
parameters, the number of "postfix bits", denoted by NPOSTFIX (0..3), parameters, the number of "postfix bits", denoted by NPOSTFIX (0..3),
and the number of direct distance codes, denoted by NDIRECT (0..120). and the number of direct distance codes, denoted by NDIRECT (0..120).
Both of these parameters are encoded in the meta-block header. We Both of these parameters are encoded in the meta-block header. We
will also use the following derived parameter: will also use the following derived parameter:
POSTFIX_MASK = (1 << NPOSTFIX) - 1 POSTFIX_MASK = (1 << NPOSTFIX) - 1
skipping to change at page 18, line 43 skipping to change at page 19, line 4
Each <insertion length, copy length> pair in the compressed data part Each <insertion length, copy length> pair in the compressed data part
of a meta-block is represented with the following triplet: of a meta-block is represented with the following triplet:
<insert-and-copy length code, insert extra bits, copy extra bits> <insert-and-copy length code, insert extra bits, copy extra bits>
The insert-and-copy length code, the insert extra bits, and the copy The insert-and-copy length code, the insert extra bits, and the copy
extra bits are encoded back-to-back, the insert-and-copy length code extra bits are encoded back-to-back, the insert-and-copy length code
is encoded using a prefix code over the insert-and-copy length code is encoded using a prefix code over the insert-and-copy length code
alphabet, while the extra bits values are encoded as fixed-width alphabet, while the extra bits values are encoded as fixed-width
integer values. The number of insert and copy extra bits can be 0 - integer values. The number of insert and copy extra bits can be
24, and they are dependent on the insert-and-copy length code. 0..24, and they are dependent on the insert-and-copy length code.
Some of the insert-and-copy length codes also express the fact that Some of the insert-and-copy length codes also express the fact that
the distance symbol of the distance in the same command is 0, i.e. the distance symbol of the distance in the same command is 0, i.e.
the distance component of the command is the same as that of the the distance component of the command is the same as that of the
previous command. In this case, the distance code and extra bits for previous command. In this case, the distance code and extra bits for
the distance are omitted from the compressed data stream. the distance are omitted from the compressed data stream.
We describe the insert-and-copy length code alphabet in terms of the We describe the insert-and-copy length code alphabet in terms of the
(not directly used) insert length code and copy length code (not directly used) insert length code and copy length code
alphabets. The symbols of the insert length code alphabet, along with alphabets. The symbols of the insert length code alphabet, along with
the number of insert extra bits, and the range of the insert lengths the number of insert extra bits, and the range of the insert lengths
are as follows: are as follows:
Extra Extra Extra Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- -------
0 0 0 8 2 10-13 16 6 130-193 0 0 0 8 2 10..13 16 6 130..193
1 0 1 9 2 14-17 17 7 194-321 1 0 1 9 2 14..17 17 7 194..321
2 0 2 10 3 18-25 18 8 322-577 2 0 2 10 3 18..25 18 8 322..577
3 0 3 11 3 26-33 19 9 578-1089 3 0 3 11 3 26..33 19 9 578..1089
4 0 4 12 4 34-49 20 10 1090-2113 4 0 4 12 4 34..49 20 10 1090..2113
5 0 5 13 4 50-65 21 12 2114-6209 5 0 5 13 4 50..65 21 12 2114..6209
6 1 6,7 14 5 66-97 22 14 6210-22593 6 1 6,7 14 5 66..97 22 14 6210..22593
7 1 8,9 15 5 98-129 23 24 22594-16799809 7 1 8,9 15 5 98..129 23 24 22594..16799809
The symbols of the copy length code alphabet, along with the number The symbols of the copy length code alphabet, along with the number
of copy extra bits, and the range of copy lengths are as follows: of copy extra bits, and the range of copy lengths are as follows:
Extra Extra Extra Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- -------
0 0 2 8 1 10,11 16 5 70-101 0 0 2 8 1 10,11 16 5 70..101
1 0 3 9 1 12,13 17 5 102-133 1 0 3 9 1 12,13 17 5 102..133
2 0 4 10 2 14-17 18 6 134-197 2 0 4 10 2 14..17 18 6 134..197
3 0 5 11 2 18-21 19 7 198-325 3 0 5 11 2 18..21 19 7 198..325
4 0 6 12 3 22-29 20 8 326-581 4 0 6 12 3 22..29 20 8 326..581
5 0 7 13 3 30-37 21 9 582-1093 5 0 7 13 3 30..37 21 9 582..1093
6 0 8 14 4 38-53 22 10 1094-2117 6 0 8 14 4 38..53 22 10 1094..2117
7 0 9 15 4 54-69 23 24 2118-16779333 7 0 9 15 4 54..69 23 24 2118..16779333
To convert an insert-and-copy length code to an insert length code To convert an insert-and-copy length code to an insert length code
and a copy length code, the following table can be used: and a copy length code, the following table can be used:
Insert Insert
length Copy length code length Copy length code
code 0-7 8-15 16-23 code 0..7 8..15 16..23
+---------+---------+ +----------+----------+
| | | | | |
0-7 | 0-63 | 64-127 | <--- distance symbol 0 0..7 | 0..63 | 64..127 | <--- distance symbol 0
| | | | | |
+---------+---------+---------+ +----------+----------+----------+
| | | | | | | |
0-7 | 128-191 | 192-255 | 384-447 | 0..7 | 128..191 | 192..255 | 384..447 |
| | | | | | | |
+---------+---------+---------+ +----------+----------+----------+
| | | | | | | |
8-15 | 256-319 | 320-383 | 512-575 | 8..15 | 256..319 | 320..383 | 512..575 |
| | | | | | | |
+---------+---------+---------+ +----------+----------+----------+
| | | | | | | |
16-23 | 448-511 | 576-639 | 640-703 | 16..23 | 448..511 | 576..639 | 640..703 |
| | | | | | | |
+---------+---------+---------+ +----------+----------+----------+
First, look up the cell with the 64 value range containing the First, look up the cell with the 64 value range containing the
insert-and-copy length code, this gives the insert length code and insert-and-copy length code, this gives the insert length code and
the copy length code ranges, both 8 values long. The copy length the copy length code ranges, both 8 values long. The copy length
code within its range is determined by bits 0-2 (counted from the code within its range is determined by bits 0..2 (counted from the
LSB) of the insert-and-copy length code. The insert length code lsb) of the insert-and-copy length code. The insert length code
within its range is determined by bits 3-5 (counted from the LSB) of within its range is determined by bits 3..5 (counted from the lsb) of
the insert-and-copy length code. Given the insert length and copy the insert-and-copy length code. Given the insert length and copy
length codes, the actual insert and copy lengths can be obtained by length codes, the actual insert and copy lengths can be obtained by
reading the number of extra bits given by the tables above. reading the number of extra bits given by the tables above.
If the insert-and-copy length code is between 0 and 127, the distance If the insert-and-copy length code is between 0 and 127, the distance
code of the command is set to zero (the last distance reused). code of the command is set to zero (the last distance reused).
6. Encoding of block switch commands 6. Encoding of block switch commands
As described in Section 2., a block-switch command is a pair <block As described in Section 2., a block-switch command is a pair <block
skipping to change at page 21, line 5 skipping to change at page 21, line 5
particular block category. particular block category.
Each block type in the compressed data is represented with a block Each block type in the compressed data is represented with a block
type code, encoded using a prefix code over the block type code type code, encoded using a prefix code over the block type code
alphabet. A block type symbol 0 means that the new block type is the alphabet. A block type symbol 0 means that the new block type is the
same as the type of the previous block from the same block category, same as the type of the previous block from the same block category,
i.e. the block type that preceded the current type, while a block i.e. the block type that preceded the current type, while a block
type symbol 1 means that the new block type equals the current block type symbol 1 means that the new block type equals the current block
type plus one. If the current block type is the maximal possible, type plus one. If the current block type is the maximal possible,
then a block type symbol of 1 results in wrapping to a new block type then a block type symbol of 1 results in wrapping to a new block type
of 0. Block type symbols 2 - 257 represent block types 0 - 255 of 0. Block type symbols 2..257 represent block types 0..255
respectively. The previous and current block types are initialized to respectively. The previous and current block types are initialized to
1 and 0, respectively, at the end of the meta-block header. 1 and 0, respectively, at the end of the meta-block header.
Since the first block type of each block category is 0, the block Since the first block type of each block category is 0, the block
type of the first block-switch command is not encoded in the type of the first block-switch command is not encoded in the
compressed data. If a block category has only one block type, the compressed data. If a block category has only one block type, the
block count of the first block-switch command is also omitted from block count of the first block-switch command is also omitted from
the compressed data, otherwise it is encoded in the meta-block the compressed data, otherwise it is encoded in the meta-block
header. header.
skipping to change at page 21, line 37 skipping to change at page 21, line 37
[0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it [0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it
follows that the alphabet size of literal, insert-and-copy length, follows that the alphabet size of literal, insert-and-copy length,
and distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and and distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and
NBLTYPESD + 2, respectively. NBLTYPESD + 2, respectively.
Each block count in the compressed data is represented with a pair Each block count in the compressed data is represented with a pair
<block count code, extra bits>. The block count code and the extra <block count code, extra bits>. The block count code and the extra
bits are encoded back-to-back, the block count code is encoded using bits are encoded back-to-back, the block count code is encoded using
a prefix code over the block count code alphabet, while the extra a prefix code over the block count code alphabet, while the extra
bits value is encoded as a fixed-width integer value. The number of bits value is encoded as a fixed-width integer value. The number of
extra bits can be 0 - 24, and it is dependent on the block count extra bits can be 0..24, and it is dependent on the block count code.
code. The symbols of the block count code alphabet, along with the The symbols of the block count code alphabet, along with the number
number of extra bits, and the range of block counts are as follows: of extra bits, and the range of block counts are as follows:
Extra Extra Extra Extra Extra Extra
Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths Code Bits Lengths
---- ---- ------ ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- ------- ---- ---- -------
0 2 1-4 9 4 65-80 18 7 369-496 0 2 1..4 9 4 65..80 18 7 369..496
1 2 5-8 10 4 81-96 19 8 497-752 1 2 5..8 10 4 81..96 19 8 497..752
2 2 9-12 11 4 97-112 20 9 753-1264 2 2 9..12 11 4 97..112 20 9 753..1264
3 2 13-16 12 5 113-144 21 10 1265-2288 3 2 13..16 12 5 113..144 21 10 1265..2288
4 3 17-24 13 5 145-176 22 11 2289-4336 4 3 17..24 13 5 145..176 22 11 2289..4336
5 3 25-32 14 5 177-208 23 12 4337-8432 5 3 25..32 14 5 177..208 23 12 4337..8432
6 3 33-40 15 5 209-240 24 13 8433-16624 6 3 33..40 15 5 209..240 24 13 8433..16624
7 3 41-48 16 6 241-304 25 24 16625-16793840 7 3 41..48 16 6 241..304 25 24 16625..16793840
8 4 49-64 17 6 305-368 8 4 49..64 17 6 305..368
The first block-switch command of each block category is special in The first block-switch command of each block category is special in
the sense that it is encoded in the meta-block header, and as the sense that it is encoded in the meta-block header, and as
described earlier, the block type code is omitted since it is an described earlier, the block type code is omitted since it is an
implicit zero. implicit zero.
7. Context modeling 7. Context modeling
As described in Section 2., the prefix tree used to encode a literal As described in Section 2., the prefix tree used to encode a literal
byte or a distance code depends on the block type and the context ID. byte or a distance code depends on the block type and the context ID.
skipping to change at page 24, line 13 skipping to change at page 24, line 13
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7
The lengths and the zlib CRC-32 (ITU-T Recommendation V.42) check The lengths and the CRC-32 check values (see Appendix C.) of each of
values of each of these tables as a sequence of bytes are as follows: these tables as a sequence of bytes are as follows:
Table Length CRC-32 Table Length CRC-32
----- ------ ------ ----- ------ ------
Lut0 256 0x8e91efb7 Lut0 256 0x8e91efb7
Lut1 256 0xd01a32f4 Lut1 256 0xd01a32f4
Lut2 256 0x0dd7a0d6 Lut2 256 0x0dd7a0d6
Given p1 is the last uncompressed byte and p2 is the second-to-last Given p1 is the last uncompressed byte and p2 is the second-to-last
uncompressed byte, the context IDs can be computed as follows: uncompressed byte, the context IDs can be computed as follows:
skipping to change at page 25, line 42 skipping to change at page 25, line 42
times, read RLEMAX bits for repeat length times, read RLEMAX bits for repeat length
RLEMAX + 1: value 1 RLEMAX + 1: value 1
... ...
RLEMAX + NTREES - 1: value NTREES - 1 RLEMAX + NTREES - 1: value NTREES - 1
If RLEMAX = 0, the run length coding is not used, and the symbols of If RLEMAX = 0, the run length coding is not used, and the symbols of
the alphabet are directly the values in the context map. We can now the alphabet are directly the values in the context map. We can now
define the format of the context map (the same format is used for define the format of the context map (the same format is used for
literal and distance context maps): literal and distance context maps):
1-5 bits: RLEMAX, 0 is encoded with one 0 bit, and values 1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values
1 - 16 are encoded with bit pattern xxxx1 (so 01001 1..16 are encoded with bit pattern xxxx1 (so 01001
is 5) is 5)
Prefix code with alphabet size NTREES + RLEMAX Prefix code with alphabet size NTREES + RLEMAX
Context map size values encoded with the above prefix code Context map size values encoded with the above prefix code
and run length coding for zero values. If a run length and run length coding for zero values. If a run length
would result in more lengths in total than the size of would result in more lengths in total than the size of
the context map, then the stream should be rejected as the context map, then the stream should be rejected as
invalid. invalid.
1 bit: IMTF bit, if set, we do an inverse move-to-front 1 bit: IMTF bit, if set, we do an inverse move-to-front
skipping to change at page 27, line 45 skipping to change at page 27, line 45
greater than 120, or the length is smaller than 4 or greater than 24, greater than 120, or the length is smaller than 4 or greater than 24,
then the compressed stream should be rejected as invalid. then the compressed stream should be rejected as invalid.
Each word transformation has the following form: Each word transformation has the following form:
transform_i(word) = prefix_i + T_i(word) + suffix_i transform_i(word) = prefix_i + T_i(word) + suffix_i
where the _i subscript denotes the transform_id above. Each T_i is where the _i subscript denotes the transform_id above. Each T_i is
one of the following 21 elementary transforms: one of the following 21 elementary transforms:
Identity, UppercaseFirst, UppercaseAll, Identity, FermentFirst, FermentAll,
OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9 OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9
The form of these elementary transforms is as follows: The form of these elementary transforms is as follows:
Identity(word) = word Identity(word) = word
UppercaseFirst(word) = first UTF-8 character of word upper-cased FermentFirst(word) = see below
UppercaseAll(word) = all UTF-8 characters of word upper-cased FermentAll(word) = see below
OmitFirstk(word) = the last (length(word) - k) bytes of word, or OmitFirstk(word) = the last (length(word) - k) bytes of word, or
empty string if length(word) < k empty string if length(word) < k
OmitLastk(word) = the first (length(word) - k) bytes of word, or OmitLastk(word) = the first (length(word) - k) bytes of word, or
empty string if length(word) < k empty string if length(word) < k
For the purposes of UppercaseAll, word is parsed into UTF-8 We define the FermentFirst and FermentAll transforms used in this
characters and converted to upper-case by taking 1 - 3 bytes at a specification by the following C language functions:
time, using the algorithm below:
i = 0 int Ferment(uint8_t* word, int word_len, int pos) {
while i < length(word): if (word[pos] < 192) {
if word[i] < 192: if (word[pos] >= 97 and word[pos] <= 122) {
if word[i] >= 97 and word[i] <= 122: word[pos] = word[pos] ^ 32;
word[i] = word[i] ^ 32 }
i = i + 1 return 1;
else if word[i] < 224: } else if (word[pos] < 224) {
if i + 1 < length(word): if (pos + 1 < word_len) {
word[i + 1] = word[i + 1] ^ 32 word[pos + 1] = word[pos + 1] ^ 32;
i = i + 2 }
else: return 2;
if i + 2 < length(word): } else {
word[i + 2] = word[i + 2] ^ 5 if (pos + 2 < word_len) {
i = i + 3 word[pos + 2] = word[pos + 2] ^ 5;
}
return 3;
}
}
For UppercaseFirst, the same algorithm is used, but the loop is void FermentFirst(uint8_t* word, int word_len) {
executed only once. if (word_len > 0) {
Ferment(word, word_len, 0);
}
}
void FermentAll(uint8_t* word, int word_len) {
int i = 0;
while (i < word_len) {
i += Ferment(word, word_len, i);
}
}
Appendix B. contains the list of transformations by specifying the Appendix B. contains the list of transformations by specifying the
prefix, elementary transform and suffix components of each of them. prefix, elementary transform and suffix components of each of them.
Note that the OmitFirst8 elementary transform is not used in the list Note that the OmitFirst8 elementary transform is not used in the list
of transformations. The strings in Appendix B. are in C string format of transformations. The strings in Appendix B. are in C string format
with respect to escape (backslash) characters. with respect to escape (backslash) characters.
The maximum number of additional bytes that a transform may add to a The maximum number of additional bytes that a transform may add to a
base word is 13. Since the largest base word is 24 bytes long, a base word is 13. Since the largest base word is 24 bytes long, a
buffer of 38 bytes is sufficient to store any transformed words buffer of 38 bytes is sufficient to store any transformed words
skipping to change at page 29, line 9 skipping to change at page 29, line 22
9. Compressed data format 9. Compressed data format
In this section we describe the format of the compressed data set in In this section we describe the format of the compressed data set in
terms of the format of the individual data items described in the terms of the format of the individual data items described in the
previous sections. previous sections.
9.1. Format of the stream header 9.1. Format of the stream header
The stream header has only the following one field: The stream header has only the following one field:
1-7 bits: WBITS, a value in the range 10 - 24, encoded with 1..7 bits: WBITS, a value in the range 10..24, encoded with
the following variable length code (as it appears in the following variable length code (as it appears in
the compressed data, where the bits are parsed from the compressed data, where the bits are parsed from
right to left): right to left):
Value Bit Pattern Value Bit Pattern
----- ----------- ----- -----------
10 0100001 10 0100001
11 0110001 11 0110001
12 1000001 12 1000001
13 1010001 13 1010001
14 1100001 14 1100001
15 1110001 15 1110001
16 0 16 0
skipping to change at page 30, line 34 skipping to change at page 30, line 49
length length
MSKIPBYTES x 8 bits: MSKIPLEN - 1, where MSKIPLEN is MSKIPBYTES x 8 bits: MSKIPLEN - 1, where MSKIPLEN is
the number of metadata bytes; this field is the number of metadata bytes; this field is
only present if MSKIPBYTES is positive, only present if MSKIPBYTES is positive,
otherwise MSKIPLEN is 0 (if MSKIPBYTES is otherwise MSKIPLEN is 0 (if MSKIPBYTES is
greater than 1, and the last byte is all greater than 1, and the last byte is all
zeros, then the stream should be rejected zeros, then the stream should be rejected
as invalid) as invalid)
0 - 7 bits: fill bits until the next byte boundary, 0..7 bits: fill bits until the next byte boundary,
must be all zeros must be all zeros
MSKIPLEN bytes of metadata, not part of the MSKIPLEN bytes of metadata, not part of the
uncompressed data or the sliding window uncompressed data or the sliding window
MNIBBLES x 4 bits: MLEN - 1, where MLEN is the length MNIBBLES x 4 bits: MLEN - 1, where MLEN is the length
of the meta-block uncompressed data in bytes (if of the meta-block uncompressed data in bytes (if
MNIBBLES is greater than 4, and the last MNIBBLES is greater than 4, and the last
nibble is all zeros, then the stream should be nibble is all zeros, then the stream should be
rejected as invalid) rejected as invalid)
1 bit: ISUNCOMPRESSED, if set to 1, any bits of compressed 1 bit: ISUNCOMPRESSED, if set to 1, any bits of compressed
data up to the next byte boundary are ignored, and data up to the next byte boundary are ignored, and
the rest of the meta-block contains MLEN bytes of the rest of the meta-block contains MLEN bytes of
literal data; this field is only present if the literal data; this field is only present if the
ISLAST bit is not set (if the ignored bits are not ISLAST bit is not set (if the ignored bits are not
all zeros, the stream should be rejected as invalid) all zeros, the stream should be rejected as invalid)
1-11 bits: NBLTYPESL, # of literal block types, encoded with 1..11 bits: NBLTYPESL, # of literal block types, encoded with
the following variable length code (as it appears in the following variable length code (as it appears in
the compressed data, where the bits are parsed from the compressed data, where the bits are parsed from
right to left, so 0110111 has the value 12): right to left, so 0110111 has the value 12):
Value Bit Pattern Value Bit Pattern
----- ----------- ----- -----------
1 0 1 0
2 0001 2 0001
3-4 x0011 3..4 x0011
5-8 xx0101 5..8 xx0101
9-16 xxx0111 9..16 xxx0111
17-32 xxxx1001 17..32 xxxx1001
33-64 xxxxx1011 33..64 xxxxx1011
65-128 xxxxxx1101 65..128 xxxxxx1101
129-256 xxxxxxx1111 129..256 xxxxxxx1111
Prefix code over the block type code alphabet for Prefix code over the block type code alphabet for
literal block types, appears only if NBLTYPESL >= 2 literal block types, appears only if NBLTYPESL >= 2
Prefix code over the block count code alphabet for Prefix code over the block count code alphabet for
literal block counts, appears only if NBLTYPESL >= 2 literal block counts, appears only if NBLTYPESL >= 2
Block count code + extra bits for first literal Block count code + extra bits for first literal
block count, appears only if NBLTYPESL >= 2 block count, appears only if NBLTYPESL >= 2
1-11 bits: NBLTYPESI, # of insert-and-copy block types, encoded 1..11 bits: NBLTYPESI, # of insert-and-copy block types, encoded
with the same variable length code as above with the same variable length code as above
Prefix code over the block type code alphabet for Prefix code over the block type code alphabet for
insert-and-copy block types, appears only if NBLTYPESI >= 2 insert-and-copy block types, appears only if NBLTYPESI >= 2
Prefix code over the block count code alphabet for Prefix code over the block count code alphabet for
insert-and-copy block counts, appears only if NBLTYPESI >= 2 insert-and-copy block counts, appears only if NBLTYPESI >= 2
Block count code + extra bits for first insert-and-copy Block count code + extra bits for first insert-and-copy
block count, appears only if NBLTYPESI >= 2 block count, appears only if NBLTYPESI >= 2
1-11 bits: NBLTYPESD, # of distance block types, encoded 1..11 bits: NBLTYPESD, # of distance block types, encoded
with the same variable length code as above with the same variable length code as above
Prefix code over the block type code alphabet for Prefix code over the block type code alphabet for
distance block types, appears only if NBLTYPESD >= 2 distance block types, appears only if NBLTYPESD >= 2
Prefix code over the block count code alphabet for Prefix code over the block count code alphabet for
distance block counts, appears only if NBLTYPESD >= 2 distance block counts, appears only if NBLTYPESD >= 2
Block count code + extra bits for first distance Block count code + extra bits for first distance
block count, appears only if NBLTYPESD >= 2 block count, appears only if NBLTYPESD >= 2
2 bits: NPOSTFIX, parameter used in the distance coding 2 bits: NPOSTFIX, parameter used in the distance coding
4 bits: four most-significant bits of NDIRECT, to get the 4 bits: four most-significant bits of NDIRECT, to get the
actual value of the parameter NDIRECT, left-shift actual value of the parameter NDIRECT, left-shift
this four-bit number by NPOSTFIX bits this four-bit number by NPOSTFIX bits
NBLTYPESL x 2 bits: context mode for each literal block type NBLTYPESL x 2 bits: context mode for each literal block type
1-11 bits: NTREESL, # of literal prefix trees, encoded 1..11 bits: NTREESL, # of literal prefix trees, encoded
with the same variable length code as NBLTYPESL with the same variable length code as NBLTYPESL
Literal context map, encoded as described in Section 7.3., Literal context map, encoded as described in Section 7.3.,
appears only if NTREESL >= 2, otherwise the context map appears only if NTREESL >= 2, otherwise the context map
has only zero values has only zero values
1-11 bits: NTREESD, # of distance prefix trees, encoded 1..11 bits: NTREESD, # of distance prefix trees, encoded
with the same variable length code as NBLTYPESD with the same variable length code as NBLTYPESD
Distance context map, encoded as described in Section 7.3., Distance context map, encoded as described in Section 7.3.,
appears only if NTREESD >= 2, otherwise the context map appears only if NTREESD >= 2, otherwise the context map
has only zero values has only zero values
NTREESL prefix codes for literals NTREESL prefix codes for literals
NBLTYPESI prefix codes for insert-and-copy lengths NBLTYPESI prefix codes for insert-and-copy lengths
NTREESD prefix codes for distances NTREESD prefix codes for distances
skipping to change at page 36, line 15 skipping to change at page 36, line 30
Note that a duplicated string reference may refer to a string in a Note that a duplicated string reference may refer to a string in a
previous meta-block, i.e. the backward distance may cross one or more previous meta-block, i.e. the backward distance may cross one or more
meta-block boundaries. However a backward copy distance will not meta-block boundaries. However a backward copy distance will not
refer past the beginning of the uncompressed stream or the window refer past the beginning of the uncompressed stream or the window
size; any such distance is interpreted as a reference to a static size; any such distance is interpreted as a reference to a static
dictionary word. Also note that the referenced string may overlap the dictionary word. Also note that the referenced string may overlap the
current position, for example, if the last 2 bytes decoded have current position, for example, if the last 2 bytes decoded have
values X and Y, a string reference with <length = 5, distance = 2> values X and Y, a string reference with <length = 5, distance = 2>
adds X,Y,X,Y,X to the uncompressed stream. adds X,Y,X,Y,X to the uncompressed stream.
11. Security Considerations 11. Considerations for compressor implementations
Since the intent of this document is to define the brotli compressed
data format without reference to any particular compression
algorithm, the material in this section is not part of the definition
of the format, and a compressor need not follow it in order to be
compliant.
11.1. Trivial compressor
In this section we present a very simple algorithm that produces a
valid brotli stream representing an arbitrary sequence of
uncompressed bytes in the form of the following C++ language
function.
string BrotliCompressTrivial(const string& u) {
if (u.empty()) {
return string(1, 6);
}
int i;
string c;
c.append(1, 12);
for (i = 0; i + 65535 < u.size(); i += 65536) {
c.append(1, 248);
c.append(1, 255);
c.append(1, 15);
c.append(&u[i], 65536);
}
if (i < u.size()) {
int r = u.size() - i - 1;
c.append(1, (r & 31) << 3);
c.append(1, r >> 5);
c.append(1, 8 + (r >> 13));
c.append(&u[i], r + 1);
}
c.append(1, 3);
return c;
}
Note that this simple algorithm does not actually compress data, that
is, the brotli representation will always be bigger than the
original, but it shows that every sequence of N uncompressed bytes
can be represented with a valid brotli stream that is not longer than
N + (3 * (N >> 16) + 5) bytes.
11.2. Aligning compressed meta-blocks to byte boundaries
As described in Section 9., only those meta-blocks that immediately
follow an uncompressed meta-block or a metadata meta-block are
guaranteed to start on a byte boundary. In some applications, it
might be required that every non-metadata meta-block starts on a byte
boundary. This can be achieved by appending an empty metadata meta-
block after every non-metadata meta-block that does not end on a byte
boundary.
11.3. Creating self-contained parts within the compressed data
In some encoder implementations it might be required to make a
sequence of bytes within a brotli stream self-contained, that is,
such that they can be decompressed independently from previous parts
of the compressed data. This is a useful feature for three reasons.
First, if a large compressed file is damaged, it is possible to
recover some of the file after the damage. Second, it is useful when
doing differential transfer of compressed data. If a sequence of
uncompressed bytes is unchanged and compressed independently from
previous data, then the compressed representation may also be
unchanged and can therefore be transferred very cheaply. Third, if
sequences of uncompressed bytes are compressed independently, it
allows for parallel compression of these byte sequences within the
same file, in addition to parallel compression of multiple files.
Given two sequences of uncompressed bytes, U0 and U1, we will now
describe how to create two sequences of compressed bytes, C0 and C1,
such that the concatenation of C0 and C1 is a valid brotli stream,
and that C0 and C1 (together with the first byte of C0 that contains
the window size) can be decompressed independently from each other to
U0 and U1.
When compressing the byte sequence U0 to produce C0, we can use any
compressor that works on the complete set of uncompressed bytes U0,
with the following two changes. First, the ISLAST bit of the last
meta-block of C0 must not be set. Second, C0 must end at a byte-
boundary, which can be ensured by appending an empty metadata meta-
block to it, as in Section 11.2.
When compressing the byte sequence U1 to produce C1, we can use any
compressor that starts a new meta-block at the beginning of U1 within
the U0+U1 input stream, with the following two changes. First,
backward distances in C1 must not refer to static dictionary words or
uncompressed bytes in U0. Even if a sequence of bytes in U1 would
match a static dictionary word, or a sequence of bytes that overlaps
U0, the compressor must represent this sequence of bytes with a
combination of literal insertions and backward references to bytes in
U1 instead. Second, the ring buffer of last four distances must be
replenished first with distances in C1 before using it to encode
other distances in C1. Note that both compressors producing C0 and C1
have to use the same window size, but the stream header is emitted
only by the compressor that produces C0.
Note that this method can be easily generalized to more than two
sequences of uncompressed bytes.
12. Security Considerations
As with any compressed file formats, decompressor implementations As with any compressed file formats, decompressor implementations
should handle all compressed data byte sequences, not only those that should handle all compressed data byte sequences, not only those that
conform to this specification, where non-conformant compressed data conform to this specification, where non-conformant compressed data
sequences should be discarded. A possible attack against a system sequences should be rejected as invalid.
containing a decompressor implementation (e.g. a web browser) is to
exploit a buffer overflow caused by an invalid compressed data.
Therefore decompressor implementations should perform bounds-checking
for each memory access that result from values decoded from the
compressed stream.
12. IANA Considerations A possible attack against a system containing a decompressor
implementation (e.g. a web browser) is to exploit a buffer overflow
triggered by invalid compressed data. Therefore decompressor
implementations should perform bounds-checking for each memory access
that result from values decoded from the compressed stream and
derivatives therof.
Another possible attack against a system containing a decompressor
implementation is to provide it (either valid or invalid) compressed
data that can make the decompressor system's resource consumption
(cpu, memory, or storage) to be disproportionately large compared to
the size of the compressed data. In addition to the size of the
compressed data, the amount of cpu, memory and storage required to
decompress a single compressed meta-block within a brotli stream is
controlled by the following two paramters: the size of the
uncompressed meta-block, which is encoded at the start of the
compressed meta-block, and the size of the sliding window, which is
encoded at the start of the brotli stream. Decompressor
implementations in systems where memory or storage is constrained
should perform a sanity-check on these two parameters. The
uncompressed meta-block size that was decoded from the compressed
stream should be compared against either a hard limit, given by the
system's constraints or some expectation about the uncompressed data,
or against a certain multiple of the size of the compressed data. If
the uncompressed meta-block size is determined to be too high, the
compressed data should be rejected. Likewise, when the complete
uncompressed stream is kept in the system containing the decompressor
implementation, the total uncompressed size of the stream should be
checked before decompressing each additional meta-block. If the size
of the sliding window that was decoded from the start of the
compressed stream is greater than a certain soft limit, then the
decompressor implementation should, at first, allocate a smaller
sliding window that fits the first uncompressed meta-block, and
afterwards, before decompressing each additional meta-block, it
should increase the size of the sliding window until the sliding
window size specified in the compressed data is reached.
Correspondingly, possible attacks against a system containing a
compressor implementation (e.g. a web server) are to exploit a buffer
overflow or cause disproportionately large resource consumption by
providing e.g. uncompressible data. As described in Section 11.1.,
an output buffer of
S(N) = N + (3 * (N >> 16) + 5)
bytes is sufficient to hold a valid compressed brotli stream
representing an arbitrary sequence of N uncompressed bytes.
Therefore compressor implementations should allocate at least S(N)
bytes of output buffer before compressing N bytes of data with
unknown compressibility and should perform bounds-checking for each
write into this output buffer. If their output buffer is full,
compresor implementations should revert to the trivial compression
algorithm described in Section 11.1. The resourse consumption of a
compressor implementation for a particular input data depends mostly
on the algorithm used to find backward matches and on the algorithm
used to construct context maps and prefix codes and only to a lesser
extent on the input data itself. If the system containing a
compressor implementation is overloaded, a possible way to reduce
resource usage is to switch to more simple algorithms for backward
reference search and prefix code construction, or to fall back to the
trivial compression algorithm described in Section 11.1.
A possible attack against a system that sends compressed data over an
encrypted channel is the following. An attacker who can repeatedly
mix arbitrary (attacker-supplied) data with secret data (passwords,
cookies) and observe the length of the ciphertext can potentially
reconstruct the secret data. To protect against this kind of attack,
applications should not mix sensitive data with non-sensitive,
potentially attacker-supplied data in the same compressed stream.
13. IANA Considerations
The "HTTP Content Coding Registry" has been updated with the The "HTTP Content Coding Registry" has been updated with the
registration below: registration below:
+-------+-------------------------------------+------------+ +-------+-------------------------------------+------------+
| Name | Description | Reference | | Name | Description | Reference |
+-------+-------------------------------------+------------+ +-------+-------------------------------------+------------+
| br | Brotli Compressed Data Format | RFCXXXX | | br | Brotli Compressed Data Format | RFCXXXX |
+-------+-------------------------------------+------------+ +-------+-------------------------------------+------------+
13. Informative References 14. Informative References
[HUFFMAN] Huffman, D. A., "A Method for the Construction of Minimum [HUFFMAN] Huffman, D. A., "A Method for the Construction of Minimum
Redundancy Codes", Proceedings of the Institute of Radio Redundancy Codes", Proceedings of the Institute of Radio
Engineers, September 1952, Volume 40, Number 9, pp. Engineers, September 1952, Volume 40, Number 9, pp.
1098-1101. 1098-1101.
[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential
Data Compression", IEEE Transactions on Information Data Compression", IEEE Transactions on Information
Theory, Vol. 23, No. 3, pp. 337-343. Theory, Vol. 23, No. 3, pp. 337-343.
[RFC1951] Deutsch, P., "DEFLATE Compressed Data Format Specification [RFC1951] Deutsch, P., "DEFLATE Compressed Data Format Specification
version 1.3", RFC 1951, Aladdin Enterprises, May 1996. version 1.3", RFC 1951, Aladdin Enterprises, May 1996.
http://www.ietf.org/rfc/rfc1951.txt http://www.ietf.org/rfc/rfc1951.txt
[WOFF2] Levantovsky, V. (ed.), Levien, R. (ed.), "WOFF File Format [WOFF2] Levantovsky, V. (ed.), Levien, R. (ed.), "WOFF File Format
2.0", W3C WebFonts Working Group, 2.0", W3C WebFonts Working Group,
http://www.w3.org/TR/WOFF2/ http://www.w3.org/TR/WOFF2/
14. Source code 15. Source code
Source code for a C language implementation of a brotli compliant Source code for a C language implementation of a brotli compliant
decompressor and a C++ language implementation of a compressor is decompressor and a C++ language implementation of a compressor is
available in the brotli open-source project: available in the brotli open-source project:
https://github.com/google/brotli https://github.com/google/brotli
15. Acknowledgments 16. Acknowledgments
The authors would like to thank Mark Adler, Robert Obryk, Thomas The authors would like to thank Mark Adler, Robert Obryk, Thomas
Pickert, and Joe Tsai for providing helpful review comments, Pickert, and Joe Tsai for providing helpful review comments,
validating the specification by writing an independent decompressor, validating the specification by writing an independent decompressor,
and suggesting improvements to the format and the text of the and suggesting improvements to the format and the text of the
specification. specification.
Appendix A. Static dictionary data Appendix A. Static dictionary data
The hexadecimal form of the DICT array is the following, where the The hexadecimal form of the DICT array is the following, where the
length is 122,784 bytes and the zlib CRC-32 of the byte sequence is length is 122,784 bytes and the CRC-32 of the byte sequence is
0x5136cb04. 0x5136cb04.
74696d65646f776e6c6966656c6566746261636b636f64656461746173686f77 74696d65646f776e6c6966656c6566746261636b636f64656461746173686f77
6f6e6c7973697465636974796f70656e6a7573746c696b6566726565776f726b 6f6e6c7973697465636974796f70656e6a7573746c696b6566726565776f726b
74657874796561726f766572626f64796c6f7665666f726d626f6f6b706c6179 74657874796561726f766572626f64796c6f7665666f726d626f6f6b706c6179
6c6976656c696e6568656c70686f6d65736964656d6f7265776f72646c6f6e67 6c6976656c696e6568656c70686f6d65736964656d6f7265776f72646c6f6e67
7468656d7669657766696e64706167656461797366756c6c686561647465726d 7468656d7669657766696e64706167656461797366756c6c686561647465726d
656163686172656166726f6d747275656d61726b61626c6575706f6e68696768 656163686172656166726f6d747275656d61726b61626c6575706f6e68696768
646174656c616e646e6577736576656e6e65787463617365626f7468706f7374 646174656c616e646e6577736576656e6e65787463617365626f7468706f7374
757365646d61646568616e6468657265776861746e616d654c696e6b626c6f67 757365646d61646568616e6468657265776861746e616d654c696e6b626c6f67
skipping to change at page 117, line 44 skipping to change at page 121, line 33
Appendix B. List of word transformations Appendix B. List of word transformations
The string literals are in C format, with respect to the use of The string literals are in C format, with respect to the use of
backslash escape characters. backslash escape characters.
In order to generate a length and check value, the transforms can be In order to generate a length and check value, the transforms can be
converted to a series of bytes, where each transform is the prefix converted to a series of bytes, where each transform is the prefix
sequence of bytes plus a terminating zero byte, a single byte value sequence of bytes plus a terminating zero byte, a single byte value
identifying the transform, and the suffix sequence of bytes plus a identifying the transform, and the suffix sequence of bytes plus a
terminating zero. The value for the transforms are 0 for Identity, 1 for terminating zero. The value for the transforms are 0 for Identity, 1 for
UppercaseFirst, 2 for UppercaseAll, 3 to 11 for OmitFirst1 to FermentFirst, 2 for FermentAll, 3 to 11 for OmitFirst1 to OmitFirst9,
OmitFirst9, and 12 to 20 for OmitLast1 to OmitLast9. The byte sequences and 12 to 20 for OmitLast1 to OmitLast9. The byte sequences that
that represent the 121 transforms are then concatenated to a single represent the 121 transforms are then concatenated to a single sequence
sequence of bytes. The length of that sequence is 648 bytes, and the of bytes. The length of that sequence is 648 bytes, and the CRC-32 is
zlib CRC is 0x3d965f81. 0x3d965f81.
ID Prefix Transform Suffix ID Prefix Transform Suffix
-- ------ --------- ------ -- ------ --------- ------
0 "" Identity "" 0 "" Identity ""
1 "" Identity " " 1 "" Identity " "
2 " " Identity " " 2 " " Identity " "
3 "" OmitFirst1 "" 3 "" OmitFirst1 ""
4 "" UppercaseFirst " " 4 "" FermentFirst " "
5 "" Identity " the " 5 "" Identity " the "
6 " " Identity "" 6 " " Identity ""
7 "s " Identity " " 7 "s " Identity " "
8 "" Identity " of " 8 "" Identity " of "
9 "" UppercaseFirst "" 9 "" FermentFirst ""
10 "" Identity " and " 10 "" Identity " and "
11 "" OmitFirst2 "" 11 "" OmitFirst2 ""
12 "" OmitLast1 "" 12 "" OmitLast1 ""
13 ", " Identity " " 13 ", " Identity " "
14 "" Identity ", " 14 "" Identity ", "
15 " " UppercaseFirst " " 15 " " FermentFirst " "
16 "" Identity " in " 16 "" Identity " in "
17 "" Identity " to " 17 "" Identity " to "
18 "e " Identity " " 18 "e " Identity " "
19 "" Identity "\"" 19 "" Identity "\""
20 "" Identity "." 20 "" Identity "."
21 "" Identity "\">" 21 "" Identity "\">"
22 "" Identity "\n" 22 "" Identity "\n"
23 "" OmitLast3 "" 23 "" OmitLast3 ""
24 "" Identity "]" 24 "" Identity "]"
25 "" Identity " for " 25 "" Identity " for "
26 "" OmitFirst3 "" 26 "" OmitFirst3 ""
27 "" OmitLast2 "" 27 "" OmitLast2 ""
28 "" Identity " a " 28 "" Identity " a "
29 "" Identity " that " 29 "" Identity " that "
30 " " UppercaseFirst "" 30 " " FermentFirst ""
31 "" Identity ". " 31 "" Identity ". "
32 "." Identity "" 32 "." Identity ""
33 " " Identity ", " 33 " " Identity ", "
34 "" OmitFirst4 "" 34 "" OmitFirst4 ""
35 "" Identity " with " 35 "" Identity " with "
36 "" Identity "'" 36 "" Identity "'"
37 "" Identity " from " 37 "" Identity " from "
38 "" Identity " by " 38 "" Identity " by "
39 "" OmitFirst5 "" 39 "" OmitFirst5 ""
40 "" OmitFirst6 "" 40 "" OmitFirst6 ""
41 " the " Identity "" 41 " the " Identity ""
42 "" OmitLast4 "" 42 "" OmitLast4 ""
43 "" Identity ". The " 43 "" Identity ". The "
44 "" UppercaseAll "" 44 "" FermentAll ""
45 "" Identity " on " 45 "" Identity " on "
46 "" Identity " as " 46 "" Identity " as "
47 "" Identity " is " 47 "" Identity " is "
48 "" OmitLast7 "" 48 "" OmitLast7 ""
49 "" OmitLast1 "ing " 49 "" OmitLast1 "ing "
50 "" Identity "\n\t" 50 "" Identity "\n\t"
51 "" Identity ":" 51 "" Identity ":"
52 " " Identity ". " 52 " " Identity ". "
53 "" Identity "ed " 53 "" Identity "ed "
54 "" OmitFirst9 "" 54 "" OmitFirst9 ""
55 "" OmitFirst7 "" 55 "" OmitFirst7 ""
56 "" OmitLast6 "" 56 "" OmitLast6 ""
57 "" Identity "(" 57 "" Identity "("
58 "" UppercaseFirst ", " 58 "" FermentFirst ", "
59 "" OmitLast8 "" 59 "" OmitLast8 ""
60 "" Identity " at " 60 "" Identity " at "
61 "" Identity "ly " 61 "" Identity "ly "
62 " the " Identity " of " 62 " the " Identity " of "
63 "" OmitLast5 "" 63 "" OmitLast5 ""
64 "" OmitLast9 "" 64 "" OmitLast9 ""
65 " " UppercaseFirst ", " 65 " " FermentFirst ", "
66 "" UppercaseFirst "\"" 66 "" FermentFirst "\""
67 "." Identity "(" 67 "." Identity "("
68 "" UppercaseAll " " 68 "" FermentAll " "
69 "" UppercaseFirst "\">" 69 "" FermentFirst "\">"
70 "" Identity "=\"" 70 "" Identity "=\""
71 " " Identity "." 71 " " Identity "."
72 ".com/" Identity "" 72 ".com/" Identity ""
73 " the " Identity " of the " 73 " the " Identity " of the "
74 "" UppercaseFirst "'" 74 "" FermentFirst "'"
75 "" Identity ". This " 75 "" Identity ". This "
76 "" Identity "," 76 "" Identity ","
77 "." Identity " " 77 "." Identity " "
78 "" UppercaseFirst "(" 78 "" FermentFirst "("
79 "" UppercaseFirst "." 79 "" FermentFirst "."
80 "" Identity " not " 80 "" Identity " not "
81 " " Identity "=\"" 81 " " Identity "=\""
82 "" Identity "er " 82 "" Identity "er "
83 " " UppercaseAll " " 83 " " FermentAll " "
84 "" Identity "al " 84 "" Identity "al "
85 " " UppercaseAll "" 85 " " FermentAll ""
86 "" Identity "='" 86 "" Identity "='"
87 "" UppercaseAll "\"" 87 "" FermentAll "\""
88 "" UppercaseFirst ". " 88 "" FermentFirst ". "
89 " " Identity "(" 89 " " Identity "("
90 "" Identity "ful " 90 "" Identity "ful "
91 " " UppercaseFirst ". " 91 " " FermentFirst ". "
92 "" Identity "ive " 92 "" Identity "ive "
93 "" Identity "less " 93 "" Identity "less "
94 "" UppercaseAll "'" 94 "" FermentAll "'"
95 "" Identity "est " 95 "" Identity "est "
96 " " UppercaseFirst "." 96 " " FermentFirst "."
97 "" UppercaseAll "\">" 97 "" FermentAll "\">"
98 " " Identity "='" 98 " " Identity "='"
99 "" UppercaseFirst "," 99 "" FermentFirst ","
100 "" Identity "ize " 100 "" Identity "ize "
101 "" UppercaseAll "." 101 "" FermentAll "."
102 "\xc2\xa0" Identity "" 102 "\xc2\xa0" Identity ""
103 " " Identity "," 103 " " Identity ","
104 "" UppercaseFirst "=\"" 104 "" FermentFirst "=\""
105 "" UppercaseAll "=\"" 105 "" FermentAll "=\""
106 "" Identity "ous " 106 "" Identity "ous "
107 "" UppercaseAll ", " 107 "" FermentAll ", "
108 "" UppercaseFirst "='" 108 "" FermentFirst "='"
109 " " UppercaseFirst "," 109 " " FermentFirst ","
110 " " UppercaseAll "=\"" 110 " " FermentAll "=\""
111 " " UppercaseAll ", " 111 " " FermentAll ", "
112 "" UppercaseAll "," 112 "" FermentAll ","
113 "" UppercaseAll "(" 113 "" FermentAll "("
114 "" UppercaseAll ". " 114 "" FermentAll ". "
115 " " UppercaseAll "." 115 " " FermentAll "."
116 "" UppercaseAll "='" 116 "" FermentAll "='"
117 " " UppercaseAll ". " 117 " " FermentAll ". "
118 " " UppercaseFirst "=\"" 118 " " FermentFirst "=\""
119 " " UppercaseAll "='" 119 " " FermentAll "='"
120 " " UppercaseFirst "='" 120 " " FermentFirst "='"
Appendix C. Computing CRC-32 check values
For the purpose of this specification, we define the CRC-32 check value
of a byte sequence with the following C language function:
uint32_t CRC32(const uint8_t* v, const int len) {
const uint32_t poly = 0xedb88320UL;
uint32_t crc, c;
int i, k;
crc = 0xffffffffUL;
for (i = 0; i < len; ++i) {
c = (crc ^ v[i]) & 0xff;
for (k = 0; k < 8; k++) c = c & 1 ? poly ^ (c >> 1) : c >> 1;
crc = c ^ (crc >> 8);
}
return crc ^ 0xffffffffUL;
}
Authors' Addresses Authors' Addresses
Jyrki Alakuijala Jyrki Alakuijala
Google, Inc Google, Inc
Email: jyrki@google.com Email: jyrki@google.com
Zoltan Szabadka Zoltan Szabadka
Google, Inc Google, Inc
 End of changes. 72 change blocks. 
198 lines changed or deleted 422 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/