idnits 2.17.1 draft-vandevenne-shared-brotli-format-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC7932]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (Jul 2019) is 1719 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '31' on line 489 Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Alakuijala 3 Internet-Draft T. Duong 4 Intended Status: Informational E. Kliuchnikov 5 Expires: Jan 23, 2020 R. Obryk 6 Z. Szabadka 7 L. Vandevenne 8 Google, Inc 9 Jul 2019 11 Shared Brotli Compressed Data Format 12 draft-vandevenne-shared-brotli-format-03 14 Abstract 16 This specification defines a data format for shared brotli 17 compression, which adds support for shared dictionaries, large 18 window, patching and a container format to brotli [RFC7932]. 20 Shared dictionaries and large window support allow significant 21 compression gains compared to regular brotli, and patching allows 22 smaller patches of binary files. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. Internet-Drafts are working 28 documents of the Internet Engineering Task Force (IETF). Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. The list of current Internet-Drafts is at 31 http://datatracker.ietf.org/drafts/current. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on Jan 23, 2020. 40 Copyright Notice 42 Copyright (c) 2018 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.2. Intended audience . . . . . . . . . . . . . . . . . . . . 3 60 1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.4. Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.5. Definitions of terms and conventions used . . . . . . . . 4 63 1.5.1. Packing into bytes . . . . . . . . . . . . . . . . . 4 64 2. Shared Brotli Overview . . . . . . . . . . . . . . . . . . . . 5 65 3. Shared Dictionaries . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1. Custom Static Dictionaries . . . . . . . . . . . . . . . . 6 67 3.1.1. Transform Operations . . . . . . . . . . . . . . . . 7 68 3.2. LZ77 Dictionaries . . . . . . . . . . . . . . . . . . . . 9 69 4. Varint Encoding . . . . . . . . . . . . . . . . . . . . . . . 10 70 5. Shared Dictionary Stream . . . . . . . . . . . . . . . . . . . 10 71 6. Large Window Brotli Compressed Data Stream . . . . . . . . . . 13 72 7. Patching Format Compressed Data Stream . . . . . . . . . . . . 13 73 8. Shared Brotli Compressed Data Stream . . . . . . . . . . . . . 13 74 9. Shared Brotli Framing Format Stream . . . . . . . . . . . . . 14 75 9.1. Main Format . . . . . . . . . . . . . . . . . . . . . . . 14 76 9.2. Chunk Format . . . . . . . . . . . . . . . . . . . . . . 14 77 9.3. Metadata Format . . . . . . . . . . . . . . . . . . . . . 17 78 9.4. Chunk Specifications . . . . . . . . . . . . . . . . . . 18 79 9.4.1. Padding Chunk (Type 0) . . . . . . . . . . . . . . . 18 80 9.4.2. Metadata Chunk (Type 1) . . . . . . . . . . . . . . 18 81 9.4.3. Data Chunk (Type 2) . . . . . . . . . . . . . . . . 18 82 9.4.4. First Partial Data Chunk (Type 3) . . . . . . . . . 19 83 9.4.5. Middle Partial Data Chunk (Type 4) . . . . . . . . . 19 84 9.4.6. Last Partial Data Chunk (Type 5) . . . . . . . . . . 20 85 9.4.7. Footer Metadata Chunk (Type 6) . . . . . . . . . . . 20 86 9.4.8. Global Metadata Chunk (Type 7) . . . . . . . . . . . 20 87 9.4.9. Repeat Metadata Chunk (Type 8) . . . . . . . . . . . 20 88 9.4.10. Central Directory Chunk (Type 9) . . . . . . . . . 22 89 9.4.11. Final Footer Chunk (Type 10) . . . . . . . . . . . 22 90 10. Security Considerations . . . . . . . . . . . . . . . . . . . 23 91 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 92 12. Informative References . . . . . . . . . . . . . . . . . . . 25 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 95 1. Introduction 97 1.1. Purpose 99 The purpose of this specification is to extend the brotli compressed 100 data format format ([RFC7932]) with new abilities that allow further 101 compression gains: 103 * Shared dictionaries allow a static shared context between 104 encoder and decoder for significant compression gains. 106 * Large window brotli allows much larger back reference distances 107 to give compression gains for files over 16MiB. 109 * Patching allows to create smaller patches of binary files 111 * The framing format is a container format that allows to store 112 multiple resources, refer to dictionaries, enable patching and 113 other filters that improve compression. 115 This document is the authoritative specification of shared brotli 116 data formats and the backwards compatible changes to brotli, and 117 defines: 119 * The data format of serialized shared dictionaries 121 * The data format of the framing format 123 * The encoding of window bits and distances for large window 124 brotli in the brotli data format 126 * The encoding of shared dictionary references in the brotli data 127 format 129 * The data format for patching with brotli 131 1.2. Intended audience 133 This specification is intended for use by software implementers to 134 compress data into and/or decompress data from the shared brotli 135 dictionary format. 137 The text of the specification assumes a basic background in 138 programming at the level of bits and other primitive data 139 representations. Familiarity with the technique of LZ77 coding is 140 helpful but not required. 142 1.3. Scope 143 This specification defines a data format for shared brotli 144 compression, which adds support for dictionaries and extended 145 features to brotli [RFC7932]. 147 1.4. Compliance 149 Unless otherwise indicated below, a compliant decompressor must be 150 able to accept and decompress any data set that conforms to all the 151 specifications presented here. A compliant compressor must produce 152 data sets that conform to all the specifications presented here. 154 1.5. Definitions of terms and conventions used 156 Byte: 8 bits stored or transmitted as a unit (same as an octet). For 157 this specification, a byte is exactly 8 bits, even on machines that 158 store a character on a number of bits different from eight. See 159 below for the numbering of bits within a byte. 161 String: a sequence of arbitrary bytes. 163 Bytes stored within a computer do not have a "bit order", since they 164 are always treated as a unit. However, a byte considered as an 165 integer between 0 and 255 does have a most- and least-significant 166 bit, and since we write numbers with the most-significant digit on 167 the left, we also write bytes with the most-significant bit on the 168 left. In the diagrams below, we number the bits of a byte so that bit 169 0 is the least-significant bit, i.e., the bits are numbered: 171 +--------+ 172 |76543210| 173 +--------+ 175 Within a computer, a number may occupy multiple bytes. All multi-byte 176 numbers in the format described here are unsigned and stored with the 177 least-significant byte first (at the lower memory address). For 178 example, the decimal 16-bit number 520 is stored as: 180 0 1 181 +--------+--------+ 182 |00001000|00000010| 183 +--------+--------+ 184 ^ ^ 185 | | 186 | + more significant byte = 2 x 256 187 + less significant byte = 8 189 1.5.1. Packing into bytes 190 This document does not address the issue of the order in which bits 191 of a byte are transmitted on a bit-sequential medium, since the final 192 data format described here is byte- rather than bit-oriented. 193 However, we describe the compressed block format below as a sequence 194 of data elements of various bit lengths, not a sequence of bytes. We 195 must therefore specify how to pack these data elements into bytes to 196 form the final compressed byte sequence: 198 * Data elements are packed into bytes in order of 199 increasing bit number within the byte, i.e., starting 200 with the least-significant bit of the byte. 201 * Data elements other than prefix codes are packed 202 starting with the least-significant bit of the data 203 element. These are referred to here as integer values 204 and are considered unsigned. 205 * Prefix codes are packed starting with the most-significant 206 bit of the code. 208 In other words, if one were to print out the compressed data as a 209 sequence of bytes, starting with the first byte at the *right* margin 210 and proceeding to the *left*, with the most-significant bit of each 211 byte on the left as usual, one would be able to parse the result from 212 right to left, with fixed-width elements in the correct MSB-to-LSB 213 order and prefix codes in bit-reversed order (i.e., with the first 214 bit of the code in the relative LSB position). 216 As an example, consider packing the following data elements into a 217 sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2, 218 prefix code 110, prefix code 10, 12-bit integer value 3628. 220 byte 2 byte 1 byte 0 221 +--------+--------+--------+ 222 |11100010|11000101|10010110| 223 +--------+--------+--------+ 224 ^ ^ ^ ^ ^ 225 | | | | | 226 | | | | +------ integer value 6 227 | | | +---------- integer value 2 228 | | +-------------- prefix code 110 229 | +---------------- prefix code 10 230 +----------------------------- integer value 3628 232 2. Shared Brotli Overview 234 Shared brotli extends brotli [RFC7932] with support for shared 235 dictionaries, larger LZ77 window and a framing format. 237 3. Shared Dictionaries 239 A shared dictionary is a piece of data shared by a compressor and 240 decompressor. The compressor can take advantage of the dictionary 241 context to encode the input in a more compact manner. The compressor 242 and the decompressor must use exactly the same dictionary. A shared 243 dictionary is specially useful to compress short input sequences. 245 A shared brotli dictionary can use two methods of sharing context: 247 * An LZ77 dictionary. The encoder and decoder could refer 248 to a given sequence of bytes. Multiple LZ77 dictionaries 249 can be set. 251 * A custom static dictionary: a word list with transforms. The 252 encoder and decoder will replace the static dictionary data 253 with the data in the shared dictionary. The original static 254 dictionary is described in Section 8 in [RFC7932]. The original 255 data from Appendix A and Appendix B of [RFC7932] will be 256 replaced. In addition, it is possible to dynamically switch 257 this dictionary based on the data compression context, and/or 258 to include a reference to the original dictionary in the custom 259 dictionary. 261 If no shared dictionary is set the decoder behaves the same as in 262 [RFC7932] on a brotli stream. 264 If a shared dictionary is set, then it can set any of: LZ77 265 dictionaries, overriding static dictionary words, and/or overriding 266 transforms. 268 3.1. Custom Static Dictionaries 270 If a custom word list is set, then the following behavior of the RFC 271 7932 decoder [RFC7932] is overridden: 273 Instead of the Static Dictionary Data from Appendix A 274 of [RFC7932], one or more word lists from the custom static 275 dictionary data are used. 277 Instead of NDBITS at the end of Appendix A, a custom 278 SIZE_BITS_BY_LENGTH per custom word list is used. 280 The copy length for a static dictionary reference must be 281 between 4 and 31 and may not be a value for which 282 SIZE_BITS_BY_LENGTH of this dictionary is 0. 284 If a custom transforms list is set without context dependency, then 285 the following behavior of the RFC 7932 decoder [RFC7932] is 286 overridden: 288 The "List of Word Transformations" from Appendix B is 289 overridden by one or more lists of custom prefixes, suffixes and 290 transform operations. 292 The transform_id must be smaller than the number of transforms 293 given in the custom transforms list. 295 If the dictionary is context dependent, it includes a lookup table of 296 64 word list and transform list combinations. When resolving a static 297 dictionary word, the decoder computes the literal context id, as in 298 section 7.1. of [RFC7932]. The literal context id is used as index in 299 the lookup tables to select the word list and transforms to use. If 300 the dictionary is not context dependent, this id is implicitely 0 301 instead. 303 If a distance goes beyond the dictionary for the current id and 304 multiple word list / transform list combinations are defined, then a 305 next dictionary is used in the following order: if not context 306 dependent, the same order as defined in the shared dictionary. If 307 context dependent, the index matching the current context is used 308 first, the same order as defined in the shared dictionary excluding 309 the current context are used next. 311 3.1.1. Transform Operations 313 A shared dictionary may include custom word transformations, to 314 replace those specified in Section 8 and Appendix B of [RFC7932]. A 315 transform consists of a possible prefix, a transform operation, for 316 some operations a parameter, and a possible suffix. In the shared 317 dictionary format, the transform operation is represented by a 318 numerical ID, listed in the table below. 320 ID Operation 321 -- --------- 322 0 Identity 323 1 OmitLast1 324 2 OmitLast2 325 3 OmitLast3 326 4 OmitLast4 327 5 OmitLast5 328 6 OmitLast6 329 7 OmitLast7 330 8 OmitLast8 331 9 OmitLast9 333 10 FermentFirst 334 11 FermentAll 335 12 OmitFirst1 336 13 OmitFirst2 337 14 OmitFirst3 338 15 OmitFirst4 339 16 OmitFirst5 340 17 OmitFirst6 341 18 OmitFirst7 342 19 OmitFirst8 343 20 OmitFirst9 344 21 ShiftFirst (by PARAMETER) 345 22 ShiftAll (by PARAMETER) 347 Operations 0 to 20 are specified in Section 8 in [RFC7932]. 348 ShiftFirst and ShiftAll transform specifically encoded SCALARs. 350 A SCALAR is a 7-, 11-, 16- or 21-bit unsigned integer encoded with 1, 351 2, 3 or 4 bytes respectively with following bit contents: 353 7-bit SCALAR: 354 +--------+ 355 |0sssssss| 356 +--------+ 358 11-bit SCALAR: 359 +--------+--------+ 360 |110sssss|XXssssss| 361 +--------+--------+ 363 16-bit SCALAR: 364 +--------+--------+--------+ 365 |1110ssss|XXssssss|XXssssss| 366 +--------+--------+--------+ 368 21-bit SCALAR: 369 +--------+--------+--------+--------+ 370 |11110sss|XXssssss|XXssssss|XXssssss| 371 +--------+--------+--------+--------+ 373 Given the input bytes matching SCALAR encoding pattern, the SCALAR 374 value is obtained by concatenation of the "s" bits, with the most 375 significant bits coming from the earliest byte. The "X" bits could 376 have arbitrary value. 378 An ADDEND is defined as the result of limited sign extension of 379 16-bit unsigned PARAMETER: 381 At first the PARAMETER is zero-extended to 32 bits. After this, 382 if the resulting value is greater or equal than 0x8000, 383 then 0xFF0000 is added. 385 ShiftAll starts at the beginning of the word and repetitively applies 386 the following transform until the whole word is transformed: 388 If the next untransformed byte matches the first byte of the 7-, 389 11-, 16- or 21-bit SCALAR pattern, then: 391 If the untransformed part of the word is not long enough to 392 match the whole SCALAR pattern, then the whole word is 393 marked as transformed. 395 Otherwise, let SHIFTED be the sum of the ADDEND and the 396 encoded SCALAR. The lowest bits from SHIFTED 397 are written back into the corresponding "s" bits. The "0", 398 "1" and "X" bits remain unchanged. Next, 1, 2, 3 or 399 4 not transformed bytes marked as transformed, according to 400 the SCALAR pattern length. 402 Otherwise, the next untransformed byte is marked as transformed. 404 ShiftFirst applies the same transform as ShiftAll, but does not 405 iterate. 407 3.2. LZ77 Dictionaries 409 If an LZ77 dictionary is set, then the decoder treats this as a 410 regular LZ77 copy, but behaves as if the bytes of this dictionary are 411 accessible as the uncompressed bytes outside of the regular LZ77 412 window for backwards references. 414 Let LZ77_DICTIONARY_LENGTH be the length of the LZ77 dictionary. 415 Then word_id, described in Section 8 in [RFC7932], is redefined as: 417 word_id = distance - (max allowed distance + 1 + 418 LZ77_DICTIONARY_LENGTH) 420 For the case when LZ77_DICTIONARY_LENGTH is 0, word_id matches the 421 [RFC7932] definition. 423 Let dictionary_address be 425 LZ77_DICTIONARY_LENGTH + max allowed distance - distance 427 Then distance values of pairs [RFC7932] in range 428 (max allowed distance + 1)..(LZ77_DICTIONARY_LENGTH + max allowed 429 distance) are interpreted as references starting in the LZ77 430 dictionary at the byte at dictionary_address. If length is longer 431 than (LZ77_DICTIONARY_LENGTH - dictionary_address), then the 432 reference continues to copy (length - LZ77_DICTIONARY_LENGTH + 433 dictionary_address) bytes from the regular LZ77 window starting at 434 the beginning. 436 4. Varint Encoding 438 A varint is encoded in base 128 in one ore more bytes as follows: 440 +--------+--------+ +--------+ 441 |1xxxxxxx|1xxxxxxx| {0-8 times} |0xxxxxxx| 442 +--------+--------+ +--------+ 444 where the "x" bits of the first byte are the least significant bits 445 of the value and the "x" bits of the last byte are the most 446 significant bits of the value. The last byte must have its MSB set to 447 0, all other bytes to 1 to indicate there is a next byte. 449 The maximum allowed amount of bits to read is 63 bits, if the 9th 450 byte is present and has its MSB set then the stream must be 451 considered as invalid. 453 5. Shared Dictionary Stream 455 The shared dictionary stream encodes a custom dictionary for brotli 456 including custom words and/or custom transformations. A shared 457 dictionary may appear standalone or as contents of a resource in a 458 framing format container. 460 A compliant shared brotli dictionary stream must have the following 461 format: 463 2 bytes: file signature, in hexadecimal the bytes 91, 0. 465 varint: LZ77_DICTIONARY_LENGTH, number of bytes for a LZ77 466 dictionary, or 0 if there is none. 467 The maximum allowed value is the maximum possible sliding 468 window size of brotli or of large window brotli. 470 LZ77_DICTIONARY_LENGTH bytes: contents of the LZ77 dictionary. 472 1 byte: NUM_CUSTOM_WORD_LISTS, may have value 0 to 64 474 NUM_CUSTOM_WORD_LISTS times a word list, with the following 475 format for each word list: 477 28 bytes: SIZE_BITS_BY_LENGTH, array of 28 unsigned 8-bit 478 integers, indexed by word lengths 4 to 31. The value 479 represents log2(number of words of this length), 480 with the exception of 0 meaning 0 words of this 481 length. The max allowed length value is 15 bits. 482 OFFSETS_BY_LENGTH is computed from this as 483 OFFSETS_BY_LENGTH[i + 1] = OFFSETS_BY_LENGTH[i] + 484 (SIZE_BITS_BY_LENGTH[i] ? (i << 485 SIZE_BITS_BY_LENGTH[i]) : 0) 487 N bytes: words dictionary data, where N is 488 OFFSETS_BY_LENGTH[31] + (SIZE_BITS_BY_LENGTH[31] ? 489 (31 << SIZE_BITS_BY_LENGTH[31]) : 0), first all the 490 words of shortest length, then all words of the next 491 length, and so on, where for each length there are 492 either 0 or a positive power of two amount of words. 494 1 byte: NUM_CUSTOM_TRANSFORM_LISTS, may have value 0 to 64 496 NUM_CUSTOM_TRANSFORM_LISTS times a transform list, with the 497 following format for each transform list: 499 2 bytes: PREFIX_SUFFIX_LENGTH, the length of prefix/suffix 500 data. Must be at least 1 because the list must 501 always end with a zero-length stringlet even 502 if empty. 504 NUM_PREFIX_SUFFIX times: prefix/suffix stringlet. 505 NUM_PREFIX_SUFFIX is the amount of stringlets parsed and 506 must be in range 1..256. 508 1 byte: STRING_LENGTH, the length of the entry contents. 509 0 for the last (terminating) entry of the 510 transform list. For other entries STRING_LENGTH 511 must be in range 1..255. The 0 entry must be 512 present and must be the last byte of the 513 PREFIX_SUFFIX_LENGTH bytes of prefix/suffix 514 data, else the stream must be rejected as 515 invalid. 517 STRING_LENGTH bytes: contents of the prefix/suffix. 519 1 byte: NTRANSFORMS, amount of transformation triplets. 521 NTRANSFORMS times: data for each transform: 523 1 byte: index of prefix in prefix/suffix data; 524 must be less than NUM_PREFIX_SUFFIX. 526 1 byte: index of suffix in prefix/suffix data; 527 must be less than NUM_PREFIX_SUFFIX. 529 1 byte: operation index, must be an index in the table of 530 operations listed in the chapter 531 "Transform Operations". 533 If and only if at least one transform has operation index 534 ShiftFirst or ShiftAll: 536 NTRANSFORMS times: 538 2 bytes: parameters for the transform. If the transform 539 does not have type ShiftFirst or ShiftAll, the 540 value must be 0. ShiftFirst and ShiftAll 541 interpret these bytes as an unsigned 16-bit 542 integer. 544 if NUM_CUSTOM_WORD_LISTS > 0 or NUM_CUSTOM_TRANSFORM_LISTS > 0 545 (else implicitly NUM_DICTIONARIES is 1 and points to the 546 brotli built-in and there is no context map) 548 1 byte: NUM_DICTIONARIES, may have value 1 to 64. Each 549 dictionary is a combination of a word list and a 550 transform list. Each next dictionary is used when the 551 distance goes beyond the previous. If a CONTEXT_MAP is 552 enabled, then the dictionary matching the context is 553 moved to the front in the order for this context. 555 NUM_DICTIONARIES times: the DICTIONARY_MAP: 557 1 byte: index into a custom word list, or value 558 NUM_CUSTOM_WORD_LISTS to indicate to use the brotli 559 [RFC7932] built-in default word list 561 1 byte: index into a custom transform list, or value 562 NUM_CUSTOM_TRANSFORM_LISTS to indicate to use the 563 brotli [RFC7932] built-in default transform list 565 1 byte: CONTEXT_ENABLED, if 0 there is no context map, if 1 a 566 context map used to select the dictionary is encoded 567 below 569 If CONTEXT_ENABLED is 1, a context map for the 64 brotli 570 [RFC7932] literals contexts: 572 64 bytes: CONTEXT_MAP, index into the DICTIONARY_MAP for 573 the first dictionary to use for this context 575 6. Large Window Brotli Compressed Data Stream 577 Large window brotli allows a sliding window beyond the 24-bit maximum 578 of regular brotli [RFC7932]. 580 The compressed data stream is backwards compatible to brotli 581 [RFC7932], and may optionally have the following differences: 583 Encoding of WBITS in the stream header: the following new 584 pattern of 14 bits is supported: 586 8 bits: value 00010001, to indicate a large window 587 brotli stream 589 6 bits: WBITS, must have value in range 10 to 62 591 Distance alphabet: if the stream is a large window brotli 592 stream, the maximum number of extra bits is 62 and the 593 theoretical maximum size of the distance alphabet is 594 (16 + NDIRECT + (124 << NPOSTFIX)). This overrides the 595 value for the distance alphabet size given in chapter 596 3.3. of [RFC7932] and affects the amount of bits in the 597 encoding of the Simple Prefix Code for distances as 598 described in chapter 3.4 of [RFC7932]. 599 An additional limitation to distances, despite the 600 large allowed alphabet size, is that the alphabet is 601 not allowed to contain a distance symbol able to represent 602 a distance larger than ((1 << 63) - 4) when its extra 603 bits have their maximum value. It depends on NPOSTFIX 604 and NDIRECT when this can occur. 606 A decoder that does not support 64-bit integers may reject a stream 607 if WBITS is higher than 30 or a distance symbol from the distance 608 alphabet is able to encode a distance larger than 2147483644. 610 7. Patching Format Compressed Data Stream 612 TBD 614 8. Shared Brotli Compressed Data Stream 616 The format of a shared brotli compressed data stream without framing 617 format is backwards compatible with brotli [RFC7932], with the 618 following optional differences: 620 *) LZ77 dictionaries as described above are supported 621 *) Custom static dictionaries replacing or extending the static 622 dictionary of brotli [RFC7932] with different words or 623 transforms are supported 625 *) The stream may have the format of regular brotli [RFC7932], 626 or the format of large window brotli as described in section 627 6, or the format of the patching stream described in 628 section 7 630 9. Shared Brotli Framing Format Stream 632 A compliant shared brotli framing format stream has the format 633 described below. 635 9.1. Main Format 637 4 bytes: file signature, in hexadecimal the bytes 91, 0a, 42, 52. 638 The first byte contains the invalid WBITS combination for 639 brotli [RFC7932] and large window brotli. 641 1 byte: container flags, 8 bits with meanings: 643 bit 0 and 1: version indicator, must be 00 645 bit 2: if 0, the file contains no final footer, may not contain 646 any metadata chunks, may not contain a central directory, 647 and may encode only a single resource (using one or more 648 data chunks). If 1, the file may contain one or more 649 resources, metadata, central directory, and must contain a 650 final footer. 652 multiple times: a chunk, each with the format specified in section 653 9.2 655 9.2. Chunk Format 657 varint: length of this chunk excluding this varint but 658 including all next header bytes and data. If the value 659 is 0, then the chunk type byte is not present and the 660 chunk type is assumed to be 0. 662 1 byte: CHUNK_TYPE 663 0: padding chunk 664 1: metadata chunk 665 2: data chunk 666 3: first partial data chunk 667 4: middle partial data chunk 668 5: last partial data chunk 669 6: footer metadata chunk 670 7: global metadata chunk 671 8: repeat metadata chunk 672 9: central directory chunk 673 10: final footer 675 if CHUNK_TYPE is not padding chunk, central directory or final 676 footer: 678 1 byte: CODEC: 680 0: uncompressed 682 1: keep decoder 684 2: brotli 686 3: shared brotli 688 if CODEC is not "uncompressed": 690 varint: uncompressed size in bytes of the data contained 691 within the compressed stream 693 if CODEC is "shared brotli" 695 1 byte: amount of dictionary references. Multiple dictionary 696 references are possible with the following 697 restrictions: there can be maximum 1 serialized 698 dictionary, maximum 1 patching file, and maximum 15 699 prefix dictionaries (a serialized dictionary may 700 already contain one of those, and a patching file 701 also takes up a prefix dictionary). Circular 702 references are not allowed (any dictionary reference 703 that directly or indirectly uses this chunk itself 704 as dictionary). 706 per dictionary reference: 708 1 byte: flags: 710 bit 0 and 1: dictionary source: 712 00: Internal dictionary reference to a full resource 713 by pointer, which can span one or more chunks. 714 Must point to a full data chunk or a first 715 partial data chunk. 717 01: Internal dictionary reference to single chunk 718 contents by pointer. May point to any chunk with 719 contenst (data or metadata). If partial data 720 chunk, only this part is the dictionary. In this 721 case, the dictionary type is not allowed to be a 722 serialised dictionary. 724 10: Reference to a dictionary by hash code of a 725 resource. The dictionary can come from an 726 external source such as a different container. 727 The user of the decoder must be able to provide 728 the dictionary contents given its hash code (even 729 if it comes from this container itself), or treat 730 it as an error when the user does not have it 731 available. 733 11: invalid bit combination 735 bit 2 and 3: dictionary type: 737 00: prefix dictionary, set in front of the sliding 738 window 740 01: serialized dictionary in the shared brotli 741 format as specified in section 5. 743 10: file to apply patching algorithm to. The 744 compressed stream then has the format specified 745 in section 7. 747 11: invalid bit combination 749 bit 4-7: must be 0 751 if hash-based: 753 1 byte: type of hash used. Only supported value: 3, 754 indicating 256-bit Highwayhash. 756 32 bytes: 256-bit Highwayhash checksum checksum to refer 757 to dictionary. 759 if pointer based: varint encoded pointer to its 760 chunk in this container. The chunk must come earlier 761 in the container than the current chunk. 763 X bytes: extra header bytes, depending on CHUNK_TYPE. If present, 764 they are specified in the subsequent chapters. 766 remaining bytes: the chunk contents. The uncompressed data 767 in the chunk content depends on CHUNK_TYPE 768 and is specified in the subsequent sections. 769 The compressed data has following 770 format depending on CODEC: 772 *) uncompressed: the raw bytes 774 *) if "keep decoder", the continuation of the compressed 775 stream which was interrupted at the end of the previous 776 chunk. The decoder from the previous chunk must be used 777 and its state it had at the end of the previous chunk 778 must be kept at the start of the decoding of this chunk. 780 *) brotli: the bytes are in brotli format 781 [RFC7932] 783 *) shared brotli: the bytes are in the 784 shared brotli format specified in section 785 8 787 9.3. Metadata Format 789 All the metadata chunk types use the following format for the 790 uncompressed content: 792 Per field: 794 2 bytes: code to identify this metadata field. This must be 795 two lowercase or two uppercase alpha ascii 796 characters. If the decoder encounters a lowercase 797 field that it does not recognise for the current 798 chunk type, non-ascii characters or non-alpha 799 characters, the decoder must reject the data stream 800 as invalid. Uppercase codes may be used for custom 801 user metadata and can be ignored by a compliant 802 decoder. 804 varint: length of the content of this field in bytes, 805 excluding the code bytes and this varint 807 N bytes: the contents of this field 809 The last field is reached when the chunk content end is reached. If 810 the length of the last field does not end at the same byte as the end 811 of the uncompressed content of the chunk, the decoder must reject the 812 data stream as invalid. 814 9.4. Chunk Specifications 816 9.4.1. Padding Chunk (Type 0) 818 All bytes in this chunk must be zero, except for the initial varint 819 that specifies the remaining chunk length. 821 Since the varint itself takes up bytes as well, when the goal is to 822 introduce an amount of padding bytes, the dependence of the length of 823 the varint on the value it encodes must be taken into account. 825 A single byte varint with value 0 is a padding chunk of length 1. 826 For more padding, use higher varint values. Do not use multiple 827 shorter padding chunks, since this is slower to decode. 829 9.4.2. Metadata Chunk (Type 1) 831 This chunk contains metadata that applies to the resource whose 832 beginning is encoded in the subsequent data chunk or first partial 833 data chunk. 835 The contents of this chunk follows the format described in chapter 836 9.3. 838 The following field types are recognised: 840 id: name field. May appear 0 or 1 times. Has the following 841 format: 843 N bytes: name in UTF-8 encoding, length determined by the 844 field length. Treated generically but may be used as 845 filename. If used as filename, forward slashes '/' 846 should be used as directory separator, relative paths 847 should be used and filenames ending in a slash with 848 0-length content in the matching data chunk should be 849 treated as an empty directory. 851 mt: modification type. May appear 0 or 1 times. Has the following 852 format: 854 8 bytes: microseconds since epoch, as a little endian signed 855 twos complement 64-bit integer 857 custom user field: any two uppercase ASCII characters. 859 9.4.3. Data Chunk (Type 2) 861 A data chunk contains the actual data of a resource. 863 This chunk has the following extra header bytes: 865 1 byte: flags: 867 bit 0: if true, indicates this is not a resource that should be 868 output implicitly as part of extracting resources from 869 this container. Instead, it may be referred to only 870 explicitly, e.g. as a dictionary reference by hash code 871 or offset. This flag should be set for data used as 872 dictionary to improve compression of actual resources. 874 bit 1: if true, hash code is given 876 bits 2-7: must be zero 878 if hash code is given: 880 1 byte: type of hash used. Only supported value: 3, 881 indicating 256-bit Highwayhash. 883 32 bytes: 256-bit Highwayhash checksum checksum of the 884 uncompressed data 886 The uncompressed content bytes of this chunk are the actual data of 887 the resource. 889 9.4.4. First Partial Data Chunk (Type 3) 891 This chunk contains partial data of a resource. This is the first 892 chunk in a series containing the entire data of the resource. 894 The format of this chunk is the same as the format of a Data Chunk 895 (chapter 9.4.3) except for the differences noted below. 897 The second bit of flags must be set to 0 and no hash code given. 899 The uncompressed data size is only of this part of the resource, not 900 of the full resource. 902 9.4.5. Middle Partial Data Chunk (Type 4) 904 This chunk contains partial data of a resource, and is neither the 905 first nor the last part of the full resource. 907 The format of this chunk is the same as the format of a Data Chunk 908 (chapter 9.4.3) except for the differences noted below. 910 The first and second bits of flags must be set to 0. 912 The uncompressed data size is only of this part of the resource, not 913 of the full resource. 915 9.4.6. Last Partial Data Chunk (Type 5) 917 This chunk contains the final piece of partial data of a resource. 919 The format of this chunk is the same as the format of a Data Chunk 920 (chapter 9.4.3) except for the differences noted below. 922 The first bit of the flags must be set to 0. 924 If a hash code is given, the hash code of the full resource 925 (concatenated from all previous chunks and this chunk) is given in 926 this chunk. 928 The uncompressed data size is only of this part of the resource, not 929 of the full resource. 931 The type of this chunk indicates that there are no further chunk 932 encoding this resource, so the full resource is now known. 934 9.4.7. Footer Metadata Chunk (Type 6) 936 This metadata applies to the resource whose encoding ended in the 937 preceding data chunk or last partial data chunk. 939 The contents of this chunk follows the format described in chapter 940 9.3. 942 There are no lowercase field types defined for footer metadata. 943 Uppercase field types can be used as custom user data. 945 9.4.8. Global Metadata Chunk (Type 7) 947 This metadata applies to the whole container instead of a single 948 resource. 950 The contents of this chunk follows the format described in chapter 951 9.3. 953 There are no lowercase field types defined for footer metadata. 954 Uppercase field types can be used as custom user data. 956 9.4.9. Repeat Metadata Chunk (Type 8) 958 These chunks optionally repeat metadata that is interleaved between 959 data chunks. To use these chunks, it is necessary to also read 960 additional information, such as pointers to the original chunks, from 961 the central directory. 963 The contents of this chunk follows the format described in chapter 964 9.3. 966 This chunk has an extra header byte: 968 1 byte: chunk type of repeated chunk (metadata chunk 969 or footer metadata chunk) 971 This set of chunks must follow the following restrictions: 973 It is optional whether or not repeat metadata chunks are 974 present. 976 If they are present, then they must be present for all 977 metadata chunks and footer metadata chunks. 979 There may be only 1 repeat metadata chunk per repeated metadata 980 chunk. 982 They must appear in the same order as the chunks appear in the 983 container, which is also the same order as listed in the 984 central directory. 986 Compression of these chunks is allowed, however it is not allowed 987 to use any internal dictionary except an earlier repeat 988 metadata chunk of this series, and it is not allowed for a 989 metadata chunk to keep the decoder state if the previous chunk 990 is not a repeat metadata chunk. That is, the series of 991 metadata chunks must be decompressible without using other 992 chunks of the framing format file. 994 The fields contained in this metadata chunk must follow the following 995 restrictions: 997 If a field is present, it must 998 exactly match the corresponding field of the copied chunk. 1000 It is allowed to leave out a field that is present 1001 in the copied chunk. 1003 If a field is present, then it must be present in *all* other 1004 repeat metadata chunks when the copied chunk contains this 1005 field. In other words, if you know you can get the name field 1006 from a repeat chunk, you know that you will be able to get all 1007 names of all resources from all repeat chunks. 1009 9.4.10. Central Directory Chunk (Type 9) 1011 The central directory chunk, along with the repeat metadata chunks, 1012 allow to quickly find and list compressed resources in the container 1013 file. 1015 The central directory chunk is always uncompressed and does not have 1016 the codec byte. It instead has the following format: 1018 varint: pointer into the file where the repeat metadata chunks are 1019 located, or 0 if they are not present 1021 per chunk listed: 1023 varint: pointer into the file where this chunk begins 1025 varint: amount of header bytes N used below 1027 N bytes: copy of all the header bytes of the pointed at chunk, 1028 including total size, chunk type byte, codec, 1029 uncompressed size, dictionary references, X extra 1030 header bytes. The content is not repeated here. 1032 The last listed chunk is reached when the end of the contents of the 1033 central directory are reached. If the end does not match the last 1034 byte of the central directory, the decoder must reject the data 1035 stream as invalid. 1037 If present, the central directory must list all data and metadata 1038 chunks of all types. 1040 9.4.11. Final Footer Chunk (Type 10) 1042 Chunk that closes the file, only present if in the initial container 1043 header flags bit 2 was set. 1045 This chunk has the following content, always uncompressed: 1047 reversed varint: size of this entire framing format file, 1048 including these bytes themselves, or 0 if this 1049 size is not given 1051 reversed varint: pointer to the start of the central directory, 1052 or 0 if there is none 1054 A reversed varint has the same format as a varint, but has its bytes 1055 in reversed order and is designed to be parsed from end of file 1056 towards the beginning. 1058 9.4.12. Chunk ordering 1060 The chunk ordering must follow the rules described below, if the 1061 decoder sees otherwise, it must reject the data stream as invalid. 1063 Padding chunks may be inserted anywhere, even between chunks for 1064 which the rules below say no other chunk types may come in 1065 between. 1067 Metadata chunks must come immediately before the Data chunks of 1068 the resource they apply to. 1070 Footer metadata chunks must come immediately after the Data 1071 chunks of the resource they apply to. 1073 There may be only 0 or 1 metadata chunks per resource. 1075 There may be only 0 or 1 footer metadata chunks per resource. 1077 A resource must exist out of either 1 data chunk, or 1 first 1078 partial data chunk, 0 or more middle partial data 1079 chunks, and 1 last partial data chunk, in that order. 1081 Repeat metadata chunks must follow the rules of section 9.4.9. 1083 There may be only 0 or 1 central directory chunks. 1085 If bit 2 of the container flags is set, there may be only a 1086 single resource, no metadata chunks of any type, no central 1087 directory, and no final footer. 1089 If bit 2 of the container flags is not set, there must be exactly 1090 1 final footer chunk and it must be the last chunk in the file. 1092 10. Security Considerations 1094 The security considerations for brotli [RFC7932] apply to shared 1095 brotli as well. 1097 In addition, the same considerations apply to the decoding of new 1098 file format streams for shared brotli, including shared dictionaries, 1099 the framing format and the shared brotli format. 1101 The dictionary must be treated with the same security precautions as 1102 the content, because a change to the dictionary can result in a 1103 change to the decompressed content. 1105 The CRIME attack shows that it's a bad idea to compress data from 1106 mixed (e.g. public and private) sources -- the data sources include 1107 not only the compressed data but also the dictionaries. For example, 1108 if you compress secret cookies using a public-data-only dictionary, 1109 you still leak information about the cookies. 1111 Not only can the dictionary reveal information about the compressed 1112 data, but vice versa, data compressed with the dictionary can reveal 1113 the contents of the dictionary when an adversary can control parts of 1114 data to compress and see the compressed size. On the other hand, if 1115 the adversary can control the dictionary, the adversary can learn 1116 information about the compressed data. 1118 The most robust defense against CRIME is not to compress private data 1119 (e.g., sensitive headers like cookies or any content with PII). The 1120 challenge has been to identify secrets within a vast amount of to be 1121 compressed data. Cloudflare uses a regular expression [CLOUDFLARE]. 1122 Another idea is to extend existing web template systems (e.g., Soy 1123 [SOY]) to allow developers to mark secrets that must not be 1124 compressed. 1126 A less robust idea, but easier to implement, is to randomize the 1127 compression algorithm, i.e., adding randomly generated padding, 1128 varying the compression ratio, etc. The tricky part is to find the 1129 right balance between cost and security, i.e., on one hand we don't 1130 want to add too much padding because it adds a cost to data, on the 1131 other hand we don't want to add too little because the adversary can 1132 detect a small amount of padding with traffic analysis. 1134 Another defense in addition is to not use dictionaries for cross- 1135 domain requests, and only use shared brotli for the response when the 1136 origin is the same as where the content is hosted (using CORS). This 1137 prevents an adversary to use a private dictionary with user secrets 1138 to compress content hosted on the adversary's origin. It also helps 1139 prevent CRIME attacks that try to benefit from a public dictionary by 1140 preventing data compression with dictionaries for requests that do 1141 not originate from the host itself. 1143 The content of the dictionary itself should not be affected by 1144 external users, allowing adversaries to control the dictionary allows 1145 a form of chosen plaintext attack. Instead, only base the dictionary 1146 on content you control or generic large scale content such as a 1147 spoken language, and update the dictionary with large time intervals 1148 (days, not seconds) to prevent fast probing. 1150 11. IANA Considerations 1152 The "HTTP Content Coding Registry" has been updated with the 1153 registration below: 1155 +-------+-------------------------------------+------------+ 1156 | Name | Description | Reference | 1157 +-------+-------------------------------------+------------+ 1158 | sbr | Shared Brotli Compressed Data Format| RFCXXXX | 1159 +-------+-------------------------------------+------------+ 1161 12. Informative References 1163 [RFC7932] Alakuijala, J., Szabadka, Z., "Brotli Compressed Data 1164 Format", RFC 7932, Google, Inc., July 2016. 1165 http://www.ietf.org/rfc/rfc7932.txt 1167 [CLOUDFLARE] https://blog.cloudflare.com/a-solution-to-compression- 1168 oracles-on-the-web/ 1170 [SOY] https://developers.google.com/closure/templates/ 1172 Authors' Addresses 1174 Jyrki Alakuijala 1175 Google, Inc. 1177 Email: jyrki@google.com 1179 Thai Duong 1180 Google, Inc. 1182 Email: thaidn@google.com 1184 Evgenii Kliuchnikov 1185 Google, Inc. 1187 Email: eustas@google.com 1189 Robert Obryk 1190 Google, Inc. 1192 Email: robryk@google.com 1194 Zoltan Szabadka 1195 Google, Inc. 1197 Email: szabadka@google.com 1199 Lode Vandevenne (editor) 1200 Google, Inc. 1202 Email: lode@google.com