idnits 2.17.1 

draft-vandevenne-shared-brotli-format-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC7932]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (Aug 2020) is 1340 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '31' on line 489


     Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                      J. Alakuijala
3	Internet-Draft                                                  T. Duong
4	Intended Status: Informational                            E. Kliuchnikov
5	Expires: Feb 04, 2021                                           R. Obryk
6	                                                             Z. Szabadka
7	                                                           L. Vandevenne
8	                                                             Google, Inc
9	                                                                Aug 2020

11	                  Shared Brotli Compressed Data Format
12	                draft-vandevenne-shared-brotli-format-06

14	Abstract

16	   This specification defines a data format for shared brotli
17	   compression, which adds support for shared dictionaries, large
18	   window, patching and a container format to brotli [RFC7932].

20	   Shared dictionaries and large window support allow significant
21	   compression gains compared to regular brotli, and patching allows
22	   smaller patches of binary files.

24	Status of this Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79. Internet-Drafts are working
28	   documents of the Internet Engineering Task Force (IETF). Note that
29	   other groups may also distribute working documents as Internet-
30	   Drafts. The list of current Internet-Drafts is at
31	   http://datatracker.ietf.org/drafts/current.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time. It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on Feb 04, 2021.

40	Copyright Notice

42	   Copyright (c) 2018 IETF Trust and the persons identified as the
43	   document authors. All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document. Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document. Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
58	      1.1.  Purpose  . . . . . . . . . . . . . . . . . . . . . . . . . 3
59	      1.2.  Intended audience  . . . . . . . . . . . . . . . . . . . . 3
60	      1.3.  Scope  . . . . . . . . . . . . . . . . . . . . . . . . . . 3
61	      1.4.  Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4
62	      1.5.  Definitions of terms and conventions used  . . . . . . . . 4
63	         1.5.1.  Packing into bytes  . . . . . . . . . . . . . . . . . 4
64	   2.  Shared Brotli Overview  . . . . . . . . . . . . . . . . . . . . 5
65	   3.  Shared Dictionaries . . . . . . . . . . . . . . . . . . . . . . 6
66	      3.1.  Custom Static Dictionaries . . . . . . . . . . . . . . . . 6
67	         3.1.1.  Transform Operations  . . . . . . . . . . . . . . . . 7
68	      3.2.  LZ77 Dictionaries  . . . . . . . . . . . . . . . . . . . . 9
69	   4.  Varint Encoding  . . . . . . . . . . . . . . . . . . . . . . . 10
70	   5.  Shared Dictionary Stream . . . . . . . . . . . . . . . . . . . 10
71	   6.  Large Window Brotli Compressed Data Stream . . . . . . . . . . 13
72	   7.  Patching Format Compressed Data Stream . . . . . . . . . . . . 13
73	   8.  Shared Brotli Compressed Data Stream . . . . . . . . . . . . . 13
74	   9.  Shared Brotli Framing Format Stream  . . . . . . . . . . . . . 14
75	      9.1.  Main Format . . . . . . . . . . . . . . . . . . . . . . . 14
76	      9.2.  Chunk Format  . . . . . . . . . . . . . . . . . . . . . . 14
77	      9.3.  Metadata Format . . . . . . . . . . . . . . . . . . . . . 17
78	      9.4.  Chunk Specifications  . . . . . . . . . . . . . . . . . . 18
79	         9.4.1.  Padding Chunk (Type 0) . . . . . . . . . . . . . . . 18
80	         9.4.2.  Metadata Chunk (Type 1)  . . . . . . . . . . . . . . 18
81	         9.4.3.  Data Chunk (Type 2)  . . . . . . . . . . . . . . . . 19
82	         9.4.4.  First Partial Data Chunk (Type 3)  . . . . . . . . . 19
83	         9.4.5.  Middle Partial Data Chunk (Type 4) . . . . . . . . . 19
84	         9.4.6.  Last Partial Data Chunk (Type 5) . . . . . . . . . . 20
85	         9.4.7.  Footer Metadata Chunk (Type 6) . . . . . . . . . . . 20
86	         9.4.8.  Global Metadata Chunk (Type 7) . . . . . . . . . . . 20
87	         9.4.9.  Repeat Metadata Chunk (Type 8) . . . . . . . . . . . 21
88	         9.4.10.  Central Directory Chunk (Type 9)  . . . . . . . . . 22
89	         9.4.11.  Final Footer Chunk (Type 10)  . . . . . . . . . . . 22
90	   10.  Security Considerations . . . . . . . . . . . . . . . . . . . 23
91	   11.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25
92	   12.  Informative References  . . . . . . . . . . . . . . . . . . . 25
93	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25

95	1. Introduction

97	1.1. Purpose

99	   The purpose of this specification is to extend the brotli compressed
100	   data format format ([RFC7932]) with new abilities that allow further
101	   compression gains:

103	      * Shared dictionaries allow a static shared context between
104	        encoder and decoder for significant compression gains.

106	      * Large window brotli allows much larger back reference distances
107	        to give compression gains for files over 16MiB.

109	      * Patching allows to create smaller patches of binary files

111	      * The framing format is a container format that allows to store
112	        multiple resources, refer to dictionaries, enable patching and
113	        other filters that improve compression.

115	   This document is the authoritative specification of shared brotli
116	   data formats and the backwards compatible changes to brotli, and
117	   defines:

119	      * The data format of serialized shared dictionaries

121	      * The data format of the framing format

123	      * The encoding of window bits and distances for large window
124	        brotli in the brotli data format

126	      * The encoding of shared dictionary references in the brotli data
127	        format

129	      * The data format for patching with brotli

131	1.2. Intended audience

133	   This specification is intended for use by software implementers to
134	   compress data into and/or decompress data from the shared brotli
135	   dictionary format.

137	   The text of the specification assumes a basic background in
138	   programming at the level of bits and other primitive data
139	   representations. Familiarity with the technique of LZ77 coding is
140	   helpful but not required.

142	1.3. Scope
143	   This specification defines a data format for shared brotli
144	   compression, which adds support for dictionaries and extended
145	   features to brotli [RFC7932].

147	1.4. Compliance

149	   Unless otherwise indicated below, a compliant decompressor must be
150	   able to accept and decompress any data set that conforms to all the
151	   specifications presented here. A compliant compressor must produce
152	   data sets that conform to all the specifications presented here.

154	1.5. Definitions of terms and conventions used

156	   Byte: 8 bits stored or transmitted as a unit (same as an octet).  For
157	   this specification, a byte is exactly 8 bits, even on machines that
158	   store a character on a number of bits different from eight.  See
159	   below for the numbering of bits within a byte.

161	   String: a sequence of arbitrary bytes.

163	   Bytes stored within a computer do not have a "bit order", since they
164	   are always treated as a unit. However, a byte considered as an
165	   integer between 0 and 255 does have a most- and least-significant
166	   bit, and since we write numbers with the most-significant digit on
167	   the left, we also write bytes with the most-significant bit on the
168	   left. In the diagrams below, we number the bits of a byte so that bit
169	   0 is the least-significant bit, i.e., the bits are numbered:

171	      +--------+
172	      |76543210|
173	      +--------+

175	   Within a computer, a number may occupy multiple bytes. All multi-byte
176	   numbers in the format described here are unsigned and stored with the
177	   least-significant byte first (at the lower memory address).  For
178	   example, the decimal 16-bit number 520 is stored as:

180	      0        1
181	      +--------+--------+
182	      |00001000|00000010|
183	      +--------+--------+
184	      ^        ^
185	      |        |
186	      |        + more significant byte = 2 x 256
187	      + less significant byte = 8

189	1.5.1. Packing into bytes
190	   This document does not address the issue of the order in which bits
191	   of a byte are transmitted on a bit-sequential medium, since the final
192	   data format described here is byte- rather than bit-oriented.
193	   However, we describe the compressed block format below as a sequence
194	   of data elements of various bit lengths, not a sequence of bytes. We
195	   must therefore specify how to pack these data elements into bytes to
196	   form the final compressed byte sequence:

198	      * Data elements are packed into bytes in order of
199	        increasing bit number within the byte, i.e., starting
200	        with the least-significant bit of the byte.
201	      * Data elements other than prefix codes are packed
202	        starting with the least-significant bit of the data
203	        element. These are referred to here as integer values
204	        and are considered unsigned.
205	      * Prefix codes are packed starting with the most-significant
206	        bit of the code.

208	   In other words, if one were to print out the compressed data as a
209	   sequence of bytes, starting with the first byte at the *right* margin
210	   and proceeding to the *left*, with the most-significant bit of each
211	   byte on the left as usual, one would be able to parse the result from
212	   right to left, with fixed-width elements in the correct MSB-to-LSB
213	   order and prefix codes in bit-reversed order (i.e., with the first
214	   bit of the code in the relative LSB position).

216	   As an example, consider packing the following data elements into a
217	   sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2,
218	   prefix code 110, prefix code 10, 12-bit integer value 3628.

220	        byte 2   byte 1   byte 0
221	      +--------+--------+--------+
222	      |11100010|11000101|10010110|
223	      +--------+--------+--------+
224	       ^            ^ ^   ^   ^
225	       |            | |   |   |
226	       |            | |   |   +------ integer value 6
227	       |            | |   +---------- integer value 2
228	       |            | +-------------- prefix code 110
229	       |            +---------------- prefix code 10
230	       +----------------------------- integer value 3628

232	2. Shared Brotli Overview

234	   Shared brotli extends brotli [RFC7932] with support for shared
235	   dictionaries, larger LZ77 window and a framing format.

237	3. Shared Dictionaries

239	   A shared dictionary is a piece of data shared by a compressor and
240	   decompressor. The compressor can take advantage of the dictionary
241	   context to encode the input in a more compact manner. The compressor
242	   and the decompressor must use exactly the same dictionary. A shared
243	   dictionary is specially useful to compress short input sequences.

245	   A shared brotli dictionary can use two methods of sharing context:

247	      * An LZ77 dictionary. The encoder and decoder could refer
248	        to a given sequence of bytes. Multiple LZ77 dictionaries
249	        can be set.

251	      * A custom static dictionary: a word list with transforms. The
252	        encoder and decoder will replace the static dictionary data
253	        with the data in the shared dictionary. The original static
254	        dictionary is described in Section 8 in [RFC7932]. The original
255	        data from Appendix A and Appendix B of [RFC7932] will be
256	        replaced. In addition, it is possible to dynamically switch
257	        this dictionary based on the data compression context, and/or
258	        to include a reference to the original dictionary in the custom
259	        dictionary.

261	   If no shared dictionary is set the decoder behaves the same as in
262	   [RFC7932] on a brotli stream.

264	   If a shared dictionary is set, then it can set any of: LZ77
265	   dictionaries, overriding static dictionary words, and/or overriding
266	   transforms.

268	3.1. Custom Static Dictionaries

270	   If a custom word list is set, then the following behavior of the RFC
271	   7932 decoder [RFC7932] is overridden:

273	      Instead of the Static Dictionary Data from Appendix A
274	      of [RFC7932], one or more word lists from the custom static
275	      dictionary data are used.

277	      Instead of NDBITS at the end of Appendix A, a custom
278	      SIZE_BITS_BY_LENGTH per custom word list is used.

280	      The copy length for a static dictionary reference must be
281	      between 4 and 31 and may not be a value for which
282	      SIZE_BITS_BY_LENGTH of this dictionary is 0.

284	   If a custom transforms list is set without context dependency, then
285	   the following behavior of the RFC 7932 decoder [RFC7932] is
286	   overridden:

288	      The "List of Word Transformations" from Appendix B is
289	      overridden by one or more lists of custom prefixes, suffixes and
290	      transform operations.

292	      The transform_id must be smaller than the number of transforms
293	      given in the custom transforms list.

295	   If the dictionary is context dependent, it includes a lookup table of
296	   64 word list and transform list combinations. When resolving a static
297	   dictionary word, the decoder computes the literal context id, as in
298	   section 7.1. of [RFC7932]. The literal context id is used as index in
299	   the lookup tables to select the word list and transforms to use. If
300	   the dictionary is not context dependent, this id is implicitely 0
301	   instead.

303	   If a distance goes beyond the dictionary for the current id and
304	   multiple word list / transform list combinations are defined, then a
305	   next dictionary is used in the following order: if not context
306	   dependent, the same order as defined in the shared dictionary. If
307	   context dependent, the index matching the current context is used
308	   first, the same order as defined in the shared dictionary excluding
309	   the current context are used next.

311	3.1.1. Transform Operations

313	   A shared dictionary may include custom word transformations, to
314	   replace those specified in Section 8 and Appendix B of [RFC7932].  A
315	   transform consists of a possible prefix, a transform operation, for
316	   some operations a parameter, and a possible suffix. In the shared
317	   dictionary format, the transform operation is represented by a
318	   numerical ID, listed in the table below.

320	     ID     Operation
321	     --     ---------
322	      0     Identity
323	      1     OmitLast1
324	      2     OmitLast2
325	      3     OmitLast3
326	      4     OmitLast4
327	      5     OmitLast5
328	      6     OmitLast6
329	      7     OmitLast7
330	      8     OmitLast8
331	      9     OmitLast9

333	     10     FermentFirst
334	     11     FermentAll
335	     12     OmitFirst1
336	     13     OmitFirst2
337	     14     OmitFirst3
338	     15     OmitFirst4
339	     16     OmitFirst5
340	     17     OmitFirst6
341	     18     OmitFirst7
342	     19     OmitFirst8
343	     20     OmitFirst9
344	     21     ShiftFirst (by PARAMETER)
345	     22     ShiftAll (by PARAMETER)

347	   Operations 0 to 20 are specified in Section 8 in [RFC7932].
348	   ShiftFirst and ShiftAll transform specifically encoded SCALARs.

350	   A SCALAR is a 7-, 11-, 16- or 21-bit unsigned integer encoded with 1,
351	   2, 3 or 4 bytes respectively with following bit contents:

353	      7-bit SCALAR:
354	      +--------+
355	      |0sssssss|
356	      +--------+

358	      11-bit SCALAR:
359	      +--------+--------+
360	      |110sssss|XXssssss|
361	      +--------+--------+

363	      16-bit SCALAR:
364	      +--------+--------+--------+
365	      |1110ssss|XXssssss|XXssssss|
366	      +--------+--------+--------+

368	      21-bit SCALAR:
369	      +--------+--------+--------+--------+
370	      |11110sss|XXssssss|XXssssss|XXssssss|
371	      +--------+--------+--------+--------+

373	   Given the input bytes matching SCALAR encoding pattern, the SCALAR
374	   value is obtained by concatenation of the "s" bits, with the most
375	   significant bits coming from the earliest byte. The "X" bits could
376	   have arbitrary value.

378	   An ADDEND is defined as the result of limited sign extension of
379	   16-bit unsigned PARAMETER:

381	      At first the PARAMETER is zero-extended to 32 bits. After this,
382	      if the resulting value is greater or equal than 0x8000,
383	      then 0xFF0000 is added.

385	   ShiftAll starts at the beginning of the word and repetitively applies
386	   the following transform until the whole word is transformed:

388	      If the next untransformed byte matches the first byte of the 7-,
389	      11-, 16- or 21-bit SCALAR pattern, then:

391	         If the untransformed part of the word is not long enough to
392	         match the whole SCALAR pattern, then the whole word is
393	         marked as transformed.

395	         Otherwise, let SHIFTED be the sum of the ADDEND and the
396	         encoded SCALAR. The lowest bits from SHIFTED
397	         are written back into the corresponding "s" bits. The "0",
398	         "1" and "X" bits remain unchanged. Next, 1, 2, 3 or
399	         4 not transformed bytes marked as transformed, according to
400	         the SCALAR pattern length.

402	      Otherwise, the next untransformed byte is marked as transformed.

404	   ShiftFirst applies the same transform as ShiftAll, but does not
405	   iterate.

407	3.2. LZ77 Dictionaries

409	   If an LZ77 dictionary is set, then the decoder treats this as a
410	   regular LZ77 copy, but behaves as if the bytes of this dictionary are
411	   accessible as the uncompressed bytes outside of the regular LZ77
412	   window for backwards references.

414	   Let LZ77_DICTIONARY_LENGTH be the length of the LZ77 dictionary.
415	   Then word_id, described in Section 8 in [RFC7932], is redefined as:

417	   word_id = distance - (max allowed distance + 1 +
418	   LZ77_DICTIONARY_LENGTH)

420	   For the case when LZ77_DICTIONARY_LENGTH is 0, word_id matches the
421	   [RFC7932] definition.

423	   Let dictionary_address be

425	   LZ77_DICTIONARY_LENGTH + max allowed distance - distance

427	   Then distance values of <length, distance> pairs [RFC7932] in range
428	   (max allowed distance + 1)..(LZ77_DICTIONARY_LENGTH + max allowed
429	   distance) are interpreted as references starting in the LZ77
430	   dictionary at the byte at dictionary_address. If length is longer
431	   than (LZ77_DICTIONARY_LENGTH - dictionary_address), then the
432	   reference continues to copy (length - LZ77_DICTIONARY_LENGTH +
433	   dictionary_address) bytes from the regular LZ77 window starting at
434	   the beginning.

436	4. Varint Encoding

438	   A varint is encoded in base 128 in one ore more bytes as follows:

440	      +--------+--------+             +--------+
441	      |1xxxxxxx|1xxxxxxx| {0-8 times} |0xxxxxxx|
442	      +--------+--------+             +--------+

444	   where the "x" bits of the first byte are the least significant bits
445	   of the value and the "x" bits of the last byte are the most
446	   significant bits of the value. The last byte must have its MSB set to
447	   0, all other bytes to 1 to indicate there is a next byte.

449	   The maximum allowed amount of bits to read is 63 bits, if the 9th
450	   byte is present and has its MSB set then the stream must be
451	   considered as invalid.

453	5. Shared Dictionary Stream

455	   The shared dictionary stream encodes a custom dictionary for brotli
456	   including custom words and/or custom transformations. A shared
457	   dictionary may appear standalone or as contents of a resource in a
458	   framing format container.

460	   A compliant shared brotli dictionary stream must have the following
461	   format:

463	      2 bytes: file signature, in hexadecimal the bytes 91, 0.

465	      varint: LZ77_DICTIONARY_LENGTH, number of bytes for a LZ77
466	              dictionary, or 0 if there is none.
467	              The maximum allowed value is the maximum possible sliding
468	              window size of brotli or of large window brotli.

470	      LZ77_DICTIONARY_LENGTH bytes: contents of the LZ77 dictionary.

472	      1 byte: NUM_CUSTOM_WORD_LISTS, may have value 0 to 64

474	      NUM_CUSTOM_WORD_LISTS times a word list, with the following
475	              format for each word list:

477	         28 bytes: SIZE_BITS_BY_LENGTH, array of 28 unsigned 8-bit
478	                   integers, indexed by word lengths 4 to 31. The value
479	                   represents log2(number of words of this length),
480	                   with the exception of 0 meaning 0 words of this
481	                   length. The max allowed length value is 15 bits.
482	                   OFFSETS_BY_LENGTH is computed from this as
483	                   OFFSETS_BY_LENGTH[i + 1] = OFFSETS_BY_LENGTH[i] +
484	                   (SIZE_BITS_BY_LENGTH[i] ? (i <<
485	                   SIZE_BITS_BY_LENGTH[i]) : 0)

487	         N bytes: words dictionary data, where N is
488	                  OFFSETS_BY_LENGTH[31] + (SIZE_BITS_BY_LENGTH[31] ?
489	                  (31 << SIZE_BITS_BY_LENGTH[31]) : 0), first all the
490	                  words of shortest length, then all words of the next
491	                  length, and so on, where for each length there are
492	                  either 0 or a positive power of two amount of words.

494	      1 byte: NUM_CUSTOM_TRANSFORM_LISTS, may have value 0 to 64

496	      NUM_CUSTOM_TRANSFORM_LISTS times a transform list, with the
497	              following format for each transform list:

499	            2 bytes: PREFIX_SUFFIX_LENGTH, the length of prefix/suffix
500	                     data. Must be at least 1 because the list must
501	                     always end with a zero-length stringlet even
502	                     if empty.

504	            NUM_PREFIX_SUFFIX times: prefix/suffix stringlet.
505	               NUM_PREFIX_SUFFIX is the amount of stringlets parsed and
506	               must be in range 1..256.

508	               1 byte: STRING_LENGTH, the length of the entry contents.
509	                       0 for the last (terminating) entry of the
510	                       transform list. For other entries STRING_LENGTH
511	                       must be in range 1..255. The 0 entry must be
512	                       present and must be the last byte of the
513	                       PREFIX_SUFFIX_LENGTH bytes of prefix/suffix
514	                       data, else the stream must be rejected as
515	                       invalid.

517	               STRING_LENGTH bytes: contents of the prefix/suffix.

519	         1 byte: NTRANSFORMS, amount of transformation triplets.

521	         NTRANSFORMS times: data for each transform:

523	            1 byte: index of prefix in prefix/suffix data;
524	                    must be less than NUM_PREFIX_SUFFIX.

526	            1 byte: index of suffix in prefix/suffix data;
527	                    must be less than NUM_PREFIX_SUFFIX.

529	            1 byte: operation index, must be an index in the table of
530	                    operations listed in the chapter
531	                    "Transform Operations".

533	         If and only if at least one transform has operation index
534	         ShiftFirst or ShiftAll:

536	            NTRANSFORMS times:

538	               2 bytes: parameters for the transform. If the transform
539	                        does not have type ShiftFirst or ShiftAll, the
540	                        value must be 0. ShiftFirst and ShiftAll
541	                        interpret these bytes as an unsigned 16-bit
542	                        integer.

544	      if NUM_CUSTOM_WORD_LISTS > 0 or NUM_CUSTOM_TRANSFORM_LISTS > 0
545	         (else implicitly NUM_DICTIONARIES is 1 and points to the
546	         brotli built-in and there is no context map)

548	         1 byte: NUM_DICTIONARIES, may have value 1 to 64. Each
549	                 dictionary is a combination of a word list and a
550	                 transform list. Each next dictionary is used when the
551	                 distance goes beyond the previous. If a CONTEXT_MAP is
552	                 enabled, then the dictionary matching the context is
553	                 moved to the front in the order for this context.

555	         NUM_DICTIONARIES times: the DICTIONARY_MAP:

557	            1 byte: index into a custom word list, or value
558	                    NUM_CUSTOM_WORD_LISTS to indicate to use the brotli
559	                    [RFC7932] built-in default word list

561	            1 byte: index into a custom transform list, or value
562	                    NUM_CUSTOM_TRANSFORM_LISTS to indicate to use the
563	                    brotli [RFC7932] built-in default transform list

565	         1 byte: CONTEXT_ENABLED, if 0 there is no context map, if 1 a
566	                 context map used to select the dictionary is encoded
567	                 below

569	         If CONTEXT_ENABLED is 1, a context map for the 64 brotli
570	                 [RFC7932] literals contexts:

572	            64 bytes: CONTEXT_MAP, index into the DICTIONARY_MAP for
573	                      the first dictionary to use for this context

575	6. Large Window Brotli Compressed Data Stream

577	   Large window brotli allows a sliding window beyond the 24-bit maximum
578	   of regular brotli [RFC7932].

580	   The compressed data stream is backwards compatible to brotli
581	   [RFC7932], and may optionally have the following differences:

583	      Encoding of WBITS in the stream header: the following new
584	         pattern of 14 bits is supported:

586	         8 bits: value 00010001, to indicate a large window
587	                 brotli stream

589	         6 bits: WBITS, must have value in range 10 to 62

591	      Distance alphabet: if the stream is a large window brotli
592	         stream, the maximum number of extra bits is 62 and the
593	         theoretical maximum size of the distance alphabet is
594	         (16 + NDIRECT + (124 << NPOSTFIX)). This overrides the
595	         value for the distance alphabet size given in chapter
596	         3.3. of [RFC7932] and affects the amount of bits in the
597	         encoding of the Simple Prefix Code for distances as
598	         described in chapter 3.4 of [RFC7932].
599	         An additional limitation to distances, despite the
600	         large allowed alphabet size, is that the alphabet is
601	         not allowed to contain a distance symbol able to represent
602	         a distance larger than ((1 << 63) - 4) when its extra
603	         bits have their maximum value. It depends on NPOSTFIX
604	         and NDIRECT when this can occur.

606	   A decoder that does not support 64-bit integers may reject a stream
607	   if WBITS is higher than 30 or a distance symbol from the distance
608	   alphabet is able to encode a distance larger than 2147483644.

610	7. Patching Format Compressed Data Stream

612	   TBD

614	8. Shared Brotli Compressed Data Stream

616	   The format of a shared brotli compressed data stream without framing
617	   format is backwards compatible with brotli [RFC7932], with the
618	   following optional differences:

620	      *) LZ77 dictionaries as described above are supported
621	      *) Custom static dictionaries replacing or extending the static
622	         dictionary of brotli [RFC7932] with different words or
623	         transforms are supported

625	      *) The stream may have the format of regular brotli [RFC7932],
626	         or the format of large window brotli as described in section
627	         6, or the format of the patching stream described in
628	         section 7

630	9. Shared Brotli Framing Format Stream

632	   A compliant shared brotli framing format stream has the format
633	   described below.

635	9.1. Main Format

637	     4 bytes: file signature, in hexadecimal the bytes 91, 0a, 42, 52.
638	              The first byte contains the invalid WBITS combination for
639	              brotli [RFC7932] and large window brotli.

641	     1 byte: container flags, 8 bits with meanings:

643	        bit 0 and 1: version indicator, must be 00

645	        bit 2: if 0, the file contains no final footer, may not contain
646	           any metadata chunks, may not contain a central directory,
647	           and may encode only a single resource (using one or more
648	           data chunks). If 1, the file may contain one or more
649	           resources, metadata, central directory, and must contain a
650	           final footer.

652	     multiple times: a chunk, each with the format specified in section
653	                     9.2

655	9.2. Chunk Format

657	      varint: length of this chunk excluding this varint but
658	              including all next header bytes and data. If the value
659	              is 0, then the chunk type byte is not present and the
660	              chunk type is assumed to be 0.

662	      1  byte: CHUNK_TYPE
663	               0: padding chunk
664	               1: metadata chunk
665	               2: data chunk
666	               3: first partial data chunk
667	               4: middle partial data chunk
668	               5: last partial data chunk
669	               6: footer metadata chunk
670	               7: global metadata chunk
671	               8: repeat metadata chunk
672	               9: central directory chunk
673	               10: final footer

675	      if CHUNK_TYPE is not padding chunk, central directory or final
676	      footer:

678	         1 byte: CODEC:

680	            0: uncompressed

682	            1: keep decoder

684	            2: brotli

686	            3: shared brotli

688	            4: JPEG XL lossless JPEG1 recompression [JPEGXL]

690	      if CODEC is not "uncompressed":

692	         varint: uncompressed size in bytes of the data contained
693	                 within the compressed stream

695	      if CODEC is "shared brotli"

697	         1 byte: amount of dictionary references. Multiple dictionary
698	                 references are possible with the following
699	                 restrictions: there can be maximum 1 serialized
700	                 dictionary, maximum 1 patching file, and maximum 15
701	                 prefix dictionaries (a serialized dictionary may
702	                 already contain one of those, and a patching file
703	                 also takes up a prefix dictionary). Circular
704	                 references are not allowed (any dictionary reference
705	                 that directly or indirectly uses this chunk itself
706	                 as dictionary).

708	         per dictionary reference:

710	            1 byte: flags:

712	               bit 0 and 1: dictionary source:

714	                  00: Internal dictionary reference to a full resource
715	                      by pointer, which can span one or more chunks.
716	                      Must point to a full data chunk or a first
717	                      partial data chunk.

719	                  01: Internal dictionary reference to single chunk
720	                      contents by pointer. May point to any chunk with
721	                      contenst (data or metadata). If partial data
722	                      chunk, only this part is the dictionary. In this
723	                      case, the dictionary type is not allowed to be a
724	                      serialised dictionary.

726	                  10: Reference to a dictionary by hash code of a
727	                      resource. The dictionary can come from an
728	                      external source such as a different container.
729	                      The user of the decoder must be able to provide
730	                      the dictionary contents given its hash code (even
731	                      if it comes from this container itself), or treat
732	                      it as an error when the user does not have it
733	                      available.

735	                  11: invalid bit combination

737	               bit 2 and 3: dictionary type:

739	                  00: prefix dictionary, set in front of the sliding
740	                      window

742	                  01: serialized dictionary in the shared brotli
743	                      format as specified in section 5.

745	                  10: file to apply patching algorithm to. The
746	                      compressed stream then has the format specified
747	                      in section 7.

749	                  11: invalid bit combination

751	               bit 4-7: must be 0

753	            if hash-based:

755	               1 byte: type of hash used. Only supported value: 3,
756	                       indicating 256-bit Highwayhash.

758	               32 bytes: 256-bit Highwayhash checksum to refer to
759	                         dictionary.

761	            if pointer based: varint encoded pointer to its
762	               chunk in this container. The chunk must come earlier
763	               in the container than the current chunk.

765	      X bytes: extra header bytes, depending on CHUNK_TYPE. If present,
766	               they are specified in the subsequent chapters.

768	         remaining bytes: the chunk contents. The uncompressed data
769	                          in the chunk content depends on CHUNK_TYPE
770	                          and is specified in the subsequent sections.
771	                          The compressed data has following
772	                          format depending on CODEC:

774	            *) uncompressed: the raw bytes

776	            *) if "keep decoder", the continuation of the compressed
777	               stream which was interrupted at the end of the previous
778	               chunk. The decoder from the previous chunk must be used
779	               and its state it had at the end of the previous chunk
780	               must be kept at the start of the decoding of this chunk.

782	            *) brotli: the bytes are in brotli format
783	                       [RFC7932]

785	            *) shared brotli: the bytes are in the
786	                              shared brotli format specified in section
787	                              8

789	9.3. Metadata Format

791	   All the metadata chunk types use the following format for the
792	   uncompressed content:

794	      Per field:

796	         2 bytes: code to identify this metadata field. This must be
797	                  two lowercase or two uppercase alpha ascii
798	                  characters. If the decoder encounters a lowercase
799	                  field that it does not recognise for the current
800	                  chunk type, non-ascii characters or non-alpha
801	                  characters, the decoder must reject the data stream
802	                  as invalid. Uppercase codes may be used for custom
803	                  user metadata and can be ignored by a compliant
804	                  decoder.

806	         varint: length of the content of this field in bytes,
807	                 excluding the code bytes and this varint

809	         N bytes: the contents of this field

811	   The last field is reached when the chunk content end is reached. If
812	   the length of the last field does not end at the same byte as the end
813	   of the uncompressed content of the chunk, the decoder must reject the
814	   data stream as invalid.

816	9.4. Chunk Specifications

818	9.4.1. Padding Chunk (Type 0)

820	   All bytes in this chunk must be zero, except for the initial varint
821	   that specifies the remaining chunk length.

823	   Since the varint itself takes up bytes as well, when the goal is to
824	   introduce an amount of padding bytes, the dependence of the length of
825	   the varint on the value it encodes must be taken into account.

827	   A single byte varint with value 0 is a padding chunk of length 1.
828	   For more padding, use higher varint values. Do not use multiple
829	   shorter padding chunks, since this is slower to decode.

831	9.4.2. Metadata Chunk (Type 1)

833	   This chunk contains metadata that applies to the resource whose
834	   beginning is encoded in the subsequent data chunk or first partial
835	   data chunk.

837	   The contents of this chunk follows the format described in chapter
838	   9.3.

840	   The following field types are recognised:

842	      id: name field. May appear 0 or 1 times. Has the following
843	          format:

845	         N bytes: name in UTF-8 encoding, length determined by the
846	                  field length. Treated generically but may be used as
847	                  filename. If used as filename, forward slashes '/'
848	                  should be used as directory separator, relative paths
849	                  should be used and filenames ending in a slash with
850	                  0-length content in the matching data chunk should be
851	                  treated as an empty directory.

853	      mt: modification type. May appear 0 or 1 times. Has the following
854	          format:

856	         8 bytes: microseconds since epoch, as a little endian signed
857	                  twos complement 64-bit integer

859	      custom user field: any two uppercase ASCII characters.

861	9.4.3. Data Chunk (Type 2)

863	   A data chunk contains the actual data of a resource.

865	   This chunk has the following extra header bytes:

867	      1 byte: flags:

869	        bit 0: if true, indicates this is not a resource that should be
870	               output implicitly as part of extracting resources from
871	               this container. Instead, it may be referred to only
872	               explicitly, e.g. as a dictionary reference by hash code
873	               or offset. This flag should be set for data used as
874	               dictionary to improve compression of actual resources.

876	        bit 1: if true, hash code is given

878	        bits 2-7: must be zero

880	      if hash code is given:

882	        1 byte: type of hash used. Only supported value: 3,
883	                indicating 256-bit Highwayhash.

885	        32 bytes: 256-bit Highwayhash checksum of the uncompressed
886	                  data

888	   The uncompressed content bytes of this chunk are the actual data of
889	   the resource.

891	9.4.4. First Partial Data Chunk (Type 3)

893	   This chunk contains partial data of a resource. This is the first
894	   chunk in a series containing the entire data of the resource.

896	   The format of this chunk is the same as the format of a Data Chunk
897	   (chapter 9.4.3) except for the differences noted below.

899	   The second bit of flags must be set to 0 and no hash code given.

901	   The uncompressed data size is only of this part of the resource, not
902	   of the full resource.

904	9.4.5. Middle Partial Data Chunk (Type 4)

906	   This chunk contains partial data of a resource, and is neither the
907	   first nor the last part of the full resource.

909	   The format of this chunk is the same as the format of a Data Chunk
910	   (chapter 9.4.3) except for the differences noted below.

912	   The first and second bits of flags must be set to 0.

914	   The uncompressed data size is only of this part of the resource, not
915	   of the full resource.

917	9.4.6. Last Partial Data Chunk (Type 5)

919	   This chunk contains the final piece of partial data of a resource.

921	   The format of this chunk is the same as the format of a Data Chunk
922	   (chapter 9.4.3) except for the differences noted below.

924	   The first bit of the flags must be set to 0.

926	   If a hash code is given, the hash code of the full resource
927	   (concatenated from all previous chunks and this chunk) is given in
928	   this chunk.

930	   The uncompressed data size is only of this part of the resource, not
931	   of the full resource.

933	   The type of this chunk indicates that there are no further chunk
934	   encoding this resource, so the full resource is now known.

936	9.4.7. Footer Metadata Chunk (Type 6)

938	   This metadata applies to the resource whose encoding ended in the
939	   preceding data chunk or last partial data chunk.

941	   The contents of this chunk follows the format described in chapter
942	   9.3.

944	   There are no lowercase field types defined for footer metadata.
945	   Uppercase field types can be used as custom user data.

947	9.4.8. Global Metadata Chunk (Type 7)

949	   This metadata applies to the whole container instead of a single
950	   resource.

952	   The contents of this chunk follows the format described in chapter
953	   9.3.

955	   There are no lowercase field types defined for footer metadata.
956	   Uppercase field types can be used as custom user data.

958	9.4.9. Repeat Metadata Chunk (Type 8)

960	   These chunks optionally repeat metadata that is interleaved between
961	   data chunks. To use these chunks, it is necessary to also read
962	   additional information, such as pointers to the original chunks, from
963	   the central directory.

965	   The contents of this chunk follows the format described in chapter
966	   9.3.

968	   This chunk has an extra header byte:

970	      1 byte: chunk type of repeated chunk (metadata chunk
971	              or footer metadata chunk)

973	   This set of chunks must follow the following restrictions:

975	      It is optional whether or not repeat metadata chunks are
976	         present.

978	      If they are present, then they must be present for all
979	         metadata chunks and footer metadata chunks.

981	      There may be only 1 repeat metadata chunk per repeated metadata
982	         chunk.

984	      They must appear in the same order as the chunks appear in the
985	         container, which is also the same order as listed in the
986	         central directory.

988	      Compression of these chunks is allowed, however it is not allowed
989	         to use any internal dictionary except an earlier repeat
990	         metadata chunk of this series, and it is not allowed for a
991	         metadata chunk to keep the decoder state if the previous chunk
992	         is not a repeat metadata chunk. That is, the series of
993	         metadata chunks must be decompressible without using other
994	         chunks of the framing format file.

996	   The fields contained in this metadata chunk must follow the following
997	   restrictions:

999	      If a field is present, it must
1000	         exactly match the corresponding field of the copied chunk.

1002	      It is allowed to leave out a field that is present
1003	         in the copied chunk.

1005	      If a field is present, then it must be present in *all* other
1006	         repeat metadata chunks when the copied chunk contains this
1007	         field. In other words, if you know you can get the name field
1008	         from a repeat chunk, you know that you will be able to get all
1009	         names of all resources from all repeat chunks.

1011	9.4.10. Central Directory Chunk (Type 9)

1013	   The central directory chunk, along with the repeat metadata chunks,
1014	   allow to quickly find and list compressed resources in the container
1015	   file.

1017	   The central directory chunk is always uncompressed and does not have
1018	   the codec byte. It instead has the following format:

1020	     varint: pointer into the file where the repeat metadata chunks are
1021	             located, or 0 if they are not present

1023	     per chunk listed:

1025	        varint: pointer into the file where this chunk begins

1027	        varint: amount of header bytes N used below

1029	        N bytes: copy of all the header bytes of the pointed at chunk,
1030	                 including total size, chunk type byte, codec,
1031	                 uncompressed size, dictionary references, X extra
1032	                 header bytes. The content is not repeated here.

1034	   The last listed chunk is reached when the end of the contents of the
1035	   central directory are reached. If the end does not match the last
1036	   byte of the central directory, the decoder must reject the data
1037	   stream as invalid.

1039	   If present, the central directory must list all data and metadata
1040	   chunks of all types.

1042	9.4.11. Final Footer Chunk (Type 10)

1044	   Chunk that closes the file, only present if in the initial container
1045	   header flags bit 2 was set.

1047	   This chunk has the following content, always uncompressed:

1049	      reversed varint: size of this entire framing format file,
1050	                       including these bytes themselves, or 0 if this
1051	                       size is not given

1053	      reversed varint: pointer to the start of the central directory,
1054	                       or 0 if there is none

1056	   A reversed varint has the same format as a varint, but has its bytes
1057	   in reversed order and is designed to be parsed from end of file
1058	   towards the beginning.

1060	   9.4.12. Chunk ordering

1062	   The chunk ordering must follow the rules described below, if the
1063	   decoder sees otherwise, it must reject the data stream as invalid.

1065	      Padding chunks may be inserted anywhere, even between chunks for
1066	      which the rules below say no other chunk types may come in
1067	      between.

1069	      Metadata chunks must come immediately before the Data chunks of
1070	      the resource they apply to.

1072	      Footer metadata chunks must come immediately after the Data
1073	      chunks of the resource they apply to.

1075	      There may be only 0 or 1 metadata chunks per resource.

1077	      There may be only 0 or 1 footer metadata chunks per resource.

1079	      A resource must exist out of either 1 data chunk, or 1 first
1080	      partial data chunk, 0 or more middle partial data
1081	      chunks, and 1 last partial data chunk, in that order.

1083	      Repeat metadata chunks must follow the rules of section 9.4.9.

1085	      There may be only 0 or 1 central directory chunks.

1087	      If bit 2 of the container flags is set, there may be only a
1088	      single resource, no metadata chunks of any type, no central
1089	      directory, and no final footer.

1091	      If bit 2 of the container flags is not set, there must be exactly
1092	      1 final footer chunk and it must be the last chunk in the file.

1094	10. Security Considerations

1096	   The security considerations for brotli [RFC7932] apply to shared
1097	   brotli as well.

1099	   In addition, the same considerations apply to the decoding of new
1100	   file format streams for shared brotli, including shared dictionaries,
1101	   the framing format and the shared brotli format.

1103	   The dictionary must be treated with the same security precautions as
1104	   the content, because a change to the dictionary can result in a
1105	   change to the decompressed content.

1107	   The CRIME attack shows that it's a bad idea to compress data from
1108	   mixed (e.g. public and private) sources -- the data sources include
1109	   not only the compressed data but also the dictionaries. For example,
1110	   if you compress secret cookies using a public-data-only dictionary,
1111	   you still leak information about the cookies.

1113	   Not only can the dictionary reveal information about the compressed
1114	   data, but vice versa, data compressed with the dictionary can reveal
1115	   the contents of the dictionary when an adversary can control parts of
1116	   data to compress and see the compressed size. On the other hand, if
1117	   the adversary can control the dictionary, the adversary can learn
1118	   information about the compressed data.

1120	   The most robust defense against CRIME is not to compress private data
1121	   (e.g., sensitive headers like cookies or any content with PII).  The
1122	   challenge has been to identify secrets within a vast amount of to be
1123	   compressed data. Cloudflare uses a regular expression [CLOUDFLARE].
1124	   Another idea is to extend existing web template systems (e.g., Soy
1125	   [SOY]) to allow developers to mark secrets that must not be
1126	   compressed.

1128	   A less robust idea, but easier to implement, is to randomize the
1129	   compression algorithm, i.e., adding randomly generated padding,
1130	   varying the compression ratio, etc. The tricky part is to find the
1131	   right balance between cost and security, i.e., on one hand we don't
1132	   want to add too much padding because it adds a cost to data, on the
1133	   other hand we don't want to add too little because the adversary can
1134	   detect a small amount of padding with traffic analysis.

1136	   Another defense in addition is to not use dictionaries for cross-
1137	   domain requests, and only use shared brotli for the response when the
1138	   origin is the same as where the content is hosted (using CORS).  This
1139	   prevents an adversary to use a private dictionary with user secrets
1140	   to compress content hosted on the adversary's origin.  It also helps
1141	   prevent CRIME attacks that try to benefit from a public dictionary by
1142	   preventing data compression with dictionaries for requests that do
1143	   not originate from the host itself.

1145	   The content of the dictionary itself should not be affected by
1146	   external users, allowing adversaries to control the dictionary allows
1147	   a form of chosen plaintext attack. Instead, only base the dictionary
1148	   on content you control or generic large scale content such as a
1149	   spoken language, and update the dictionary with large time intervals
1150	   (days, not seconds) to prevent fast probing.

1152	11. IANA Considerations

1154	   The "HTTP Content Coding Registry" has been updated with the
1155	   registration below:

1157	      +-------+-------------------------------------+------------+
1158	      | Name  | Description                         | Reference  |
1159	      +-------+-------------------------------------+------------+
1160	      | sbr   | Shared Brotli Compressed Data Format| RFCXXXX    |
1161	      +-------+-------------------------------------+------------+

1163	12. Informative References

1165	   [RFC7932]  Alakuijala, J., Szabadka, Z., "Brotli Compressed Data
1166	              Format", RFC 7932, Google, Inc., July 2016.
1167	              http://www.ietf.org/rfc/rfc7932.txt

1169	   [JPEGXL]  Rhatushnyak, A., Wassenberg, J., Sneyers, J., Alakuijala,
1170	             J., Vandevenne, L., Versari, L., Obryk, R., Szabadka, Z.,
1171	             Kliuchnikov, E., Comsa, I, Potempa, K., Bruse, M.,
1172	             Firsching, M., Khasanova, R., van Asseldonk, R., Boukortt,
1173	             S., Gomez, S., Fischbacher, T., "JPEG XL Image Coding
1174	             System", August 2019.  https://arxiv.org/abs/1908.03565

1176	   [CLOUDFLARE]  https://blog.cloudflare.com/a-solution-to-compression-
1177	                 oracles-on-the-web/

1179	   [SOY]  https://developers.google.com/closure/templates/

1181	Authors' Addresses

1183	   Jyrki Alakuijala
1184	   Google, Inc.

1186	   Email: jyrki@google.com

1188	   Thai Duong
1189	   Google, Inc.

1191	   Email: thaidn@google.com

1193	   Evgenii Kliuchnikov
1194	   Google, Inc.

1196	   Email: eustas@google.com

1198	   Robert Obryk
1199	   Google, Inc.

1201	   Email: robryk@google.com

1203	   Zoltan Szabadka
1204	   Google, Inc.

1206	   Email: szabadka@google.com

1208	   Lode Vandevenne (editor)
1209	   Google, Inc.

1211	   Email: lode@google.com