idnits 2.17.1 draft-vandevenne-shared-brotli-format-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC7932]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (Aug 2020) is 1340 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '31' on line 489 Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Alakuijala 3 Internet-Draft T. Duong 4 Intended Status: Informational E. Kliuchnikov 5 Expires: Feb 04, 2021 R. Obryk 6 Z. Szabadka 7 L. Vandevenne 8 Google, Inc 9 Aug 2020 11 Shared Brotli Compressed Data Format 12 draft-vandevenne-shared-brotli-format-06 14 Abstract 16 This specification defines a data format for shared brotli 17 compression, which adds support for shared dictionaries, large 18 window, patching and a container format to brotli [RFC7932]. 20 Shared dictionaries and large window support allow significant 21 compression gains compared to regular brotli, and patching allows 22 smaller patches of binary files. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. Internet-Drafts are working 28 documents of the Internet Engineering Task Force (IETF). Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. The list of current Internet-Drafts is at 31 http://datatracker.ietf.org/drafts/current. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on Feb 04, 2021. 40 Copyright Notice 42 Copyright (c) 2018 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.2. Intended audience . . . . . . . . . . . . . . . . . . . . 3 60 1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.4. Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.5. Definitions of terms and conventions used . . . . . . . . 4 63 1.5.1. Packing into bytes . . . . . . . . . . . . . . . . . 4 64 2. Shared Brotli Overview . . . . . . . . . . . . . . . . . . . . 5 65 3. Shared Dictionaries . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1. Custom Static Dictionaries . . . . . . . . . . . . . . . . 6 67 3.1.1. Transform Operations . . . . . . . . . . . . . . . . 7 68 3.2. LZ77 Dictionaries . . . . . . . . . . . . . . . . . . . . 9 69 4. Varint Encoding . . . . . . . . . . . . . . . . . . . . . . . 10 70 5. Shared Dictionary Stream . . . . . . . . . . . . . . . . . . . 10 71 6. Large Window Brotli Compressed Data Stream . . . . . . . . . . 13 72 7. Patching Format Compressed Data Stream . . . . . . . . . . . . 13 73 8. Shared Brotli Compressed Data Stream . . . . . . . . . . . . . 13 74 9. Shared Brotli Framing Format Stream . . . . . . . . . . . . . 14 75 9.1. Main Format . . . . . . . . . . . . . . . . . . . . . . . 14 76 9.2. Chunk Format . . . . . . . . . . . . . . . . . . . . . . 14 77 9.3. Metadata Format . . . . . . . . . . . . . . . . . . . . . 17 78 9.4. Chunk Specifications . . . . . . . . . . . . . . . . . . 18 79 9.4.1. Padding Chunk (Type 0) . . . . . . . . . . . . . . . 18 80 9.4.2. Metadata Chunk (Type 1) . . . . . . . . . . . . . . 18 81 9.4.3. Data Chunk (Type 2) . . . . . . . . . . . . . . . . 19 82 9.4.4. First Partial Data Chunk (Type 3) . . . . . . . . . 19 83 9.4.5. Middle Partial Data Chunk (Type 4) . . . . . . . . . 19 84 9.4.6. Last Partial Data Chunk (Type 5) . . . . . . . . . . 20 85 9.4.7. Footer Metadata Chunk (Type 6) . . . . . . . . . . . 20 86 9.4.8. Global Metadata Chunk (Type 7) . . . . . . . . . . . 20 87 9.4.9. Repeat Metadata Chunk (Type 8) . . . . . . . . . . . 21 88 9.4.10. Central Directory Chunk (Type 9) . . . . . . . . . 22 89 9.4.11. Final Footer Chunk (Type 10) . . . . . . . . . . . 22 90 10. Security Considerations . . . . . . . . . . . . . . . . . . . 23 91 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 92 12. Informative References . . . . . . . . . . . . . . . . . . . 25 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 95 1. Introduction 97 1.1. Purpose 99 The purpose of this specification is to extend the brotli compressed 100 data format format ([RFC7932]) with new abilities that allow further 101 compression gains: 103 * Shared dictionaries allow a static shared context between 104 encoder and decoder for significant compression gains. 106 * Large window brotli allows much larger back reference distances 107 to give compression gains for files over 16MiB. 109 * Patching allows to create smaller patches of binary files 111 * The framing format is a container format that allows to store 112 multiple resources, refer to dictionaries, enable patching and 113 other filters that improve compression. 115 This document is the authoritative specification of shared brotli 116 data formats and the backwards compatible changes to brotli, and 117 defines: 119 * The data format of serialized shared dictionaries 121 * The data format of the framing format 123 * The encoding of window bits and distances for large window 124 brotli in the brotli data format 126 * The encoding of shared dictionary references in the brotli data 127 format 129 * The data format for patching with brotli 131 1.2. Intended audience 133 This specification is intended for use by software implementers to 134 compress data into and/or decompress data from the shared brotli 135 dictionary format. 137 The text of the specification assumes a basic background in 138 programming at the level of bits and other primitive data 139 representations. Familiarity with the technique of LZ77 coding is 140 helpful but not required. 142 1.3. Scope 143 This specification defines a data format for shared brotli 144 compression, which adds support for dictionaries and extended 145 features to brotli [RFC7932]. 147 1.4. Compliance 149 Unless otherwise indicated below, a compliant decompressor must be 150 able to accept and decompress any data set that conforms to all the 151 specifications presented here. A compliant compressor must produce 152 data sets that conform to all the specifications presented here. 154 1.5. Definitions of terms and conventions used 156 Byte: 8 bits stored or transmitted as a unit (same as an octet). For 157 this specification, a byte is exactly 8 bits, even on machines that 158 store a character on a number of bits different from eight. See 159 below for the numbering of bits within a byte. 161 String: a sequence of arbitrary bytes. 163 Bytes stored within a computer do not have a "bit order", since they 164 are always treated as a unit. However, a byte considered as an 165 integer between 0 and 255 does have a most- and least-significant 166 bit, and since we write numbers with the most-significant digit on 167 the left, we also write bytes with the most-significant bit on the 168 left. In the diagrams below, we number the bits of a byte so that bit 169 0 is the least-significant bit, i.e., the bits are numbered: 171 +--------+ 172 |76543210| 173 +--------+ 175 Within a computer, a number may occupy multiple bytes. All multi-byte 176 numbers in the format described here are unsigned and stored with the 177 least-significant byte first (at the lower memory address). For 178 example, the decimal 16-bit number 520 is stored as: 180 0 1 181 +--------+--------+ 182 |00001000|00000010| 183 +--------+--------+ 184 ^ ^ 185 | | 186 | + more significant byte = 2 x 256 187 + less significant byte = 8 189 1.5.1. Packing into bytes 190 This document does not address the issue of the order in which bits 191 of a byte are transmitted on a bit-sequential medium, since the final 192 data format described here is byte- rather than bit-oriented. 193 However, we describe the compressed block format below as a sequence 194 of data elements of various bit lengths, not a sequence of bytes. We 195 must therefore specify how to pack these data elements into bytes to 196 form the final compressed byte sequence: 198 * Data elements are packed into bytes in order of 199 increasing bit number within the byte, i.e., starting 200 with the least-significant bit of the byte. 201 * Data elements other than prefix codes are packed 202 starting with the least-significant bit of the data 203 element. These are referred to here as integer values 204 and are considered unsigned. 205 * Prefix codes are packed starting with the most-significant 206 bit of the code. 208 In other words, if one were to print out the compressed data as a 209 sequence of bytes, starting with the first byte at the *right* margin 210 and proceeding to the *left*, with the most-significant bit of each 211 byte on the left as usual, one would be able to parse the result from 212 right to left, with fixed-width elements in the correct MSB-to-LSB 213 order and prefix codes in bit-reversed order (i.e., with the first 214 bit of the code in the relative LSB position). 216 As an example, consider packing the following data elements into a 217 sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2, 218 prefix code 110, prefix code 10, 12-bit integer value 3628. 220 byte 2 byte 1 byte 0 221 +--------+--------+--------+ 222 |11100010|11000101|10010110| 223 +--------+--------+--------+ 224 ^ ^ ^ ^ ^ 225 | | | | | 226 | | | | +------ integer value 6 227 | | | +---------- integer value 2 228 | | +-------------- prefix code 110 229 | +---------------- prefix code 10 230 +----------------------------- integer value 3628 232 2. Shared Brotli Overview 234 Shared brotli extends brotli [RFC7932] with support for shared 235 dictionaries, larger LZ77 window and a framing format. 237 3. Shared Dictionaries 239 A shared dictionary is a piece of data shared by a compressor and 240 decompressor. The compressor can take advantage of the dictionary 241 context to encode the input in a more compact manner. The compressor 242 and the decompressor must use exactly the same dictionary. A shared 243 dictionary is specially useful to compress short input sequences. 245 A shared brotli dictionary can use two methods of sharing context: 247 * An LZ77 dictionary. The encoder and decoder could refer 248 to a given sequence of bytes. Multiple LZ77 dictionaries 249 can be set. 251 * A custom static dictionary: a word list with transforms. The 252 encoder and decoder will replace the static dictionary data 253 with the data in the shared dictionary. The original static 254 dictionary is described in Section 8 in [RFC7932]. The original 255 data from Appendix A and Appendix B of [RFC7932] will be 256 replaced. In addition, it is possible to dynamically switch 257 this dictionary based on the data compression context, and/or 258 to include a reference to the original dictionary in the custom 259 dictionary. 261 If no shared dictionary is set the decoder behaves the same as in 262 [RFC7932] on a brotli stream. 264 If a shared dictionary is set, then it can set any of: LZ77 265 dictionaries, overriding static dictionary words, and/or overriding 266 transforms. 268 3.1. Custom Static Dictionaries 270 If a custom word list is set, then the following behavior of the RFC 271 7932 decoder [RFC7932] is overridden: 273 Instead of the Static Dictionary Data from Appendix A 274 of [RFC7932], one or more word lists from the custom static 275 dictionary data are used. 277 Instead of NDBITS at the end of Appendix A, a custom 278 SIZE_BITS_BY_LENGTH per custom word list is used. 280 The copy length for a static dictionary reference must be 281 between 4 and 31 and may not be a value for which 282 SIZE_BITS_BY_LENGTH of this dictionary is 0. 284 If a custom transforms list is set without context dependency, then 285 the following behavior of the RFC 7932 decoder [RFC7932] is 286 overridden: 288 The "List of Word Transformations" from Appendix B is 289 overridden by one or more lists of custom prefixes, suffixes and 290 transform operations. 292 The transform_id must be smaller than the number of transforms 293 given in the custom transforms list. 295 If the dictionary is context dependent, it includes a lookup table of 296 64 word list and transform list combinations. When resolving a static 297 dictionary word, the decoder computes the literal context id, as in 298 section 7.1. of [RFC7932]. The literal context id is used as index in 299 the lookup tables to select the word list and transforms to use. If 300 the dictionary is not context dependent, this id is implicitely 0 301 instead. 303 If a distance goes beyond the dictionary for the current id and 304 multiple word list / transform list combinations are defined, then a 305 next dictionary is used in the following order: if not context 306 dependent, the same order as defined in the shared dictionary. If 307 context dependent, the index matching the current context is used 308 first, the same order as defined in the shared dictionary excluding 309 the current context are used next. 311 3.1.1. Transform Operations 313 A shared dictionary may include custom word transformations, to 314 replace those specified in Section 8 and Appendix B of [RFC7932]. A 315 transform consists of a possible prefix, a transform operation, for 316 some operations a parameter, and a possible suffix. In the shared 317 dictionary format, the transform operation is represented by a 318 numerical ID, listed in the table below. 320 ID Operation 321 -- --------- 322 0 Identity 323 1 OmitLast1 324 2 OmitLast2 325 3 OmitLast3 326 4 OmitLast4 327 5 OmitLast5 328 6 OmitLast6 329 7 OmitLast7 330 8 OmitLast8 331 9 OmitLast9 333 10 FermentFirst 334 11 FermentAll 335 12 OmitFirst1 336 13 OmitFirst2 337 14 OmitFirst3 338 15 OmitFirst4 339 16 OmitFirst5 340 17 OmitFirst6 341 18 OmitFirst7 342 19 OmitFirst8 343 20 OmitFirst9 344 21 ShiftFirst (by PARAMETER) 345 22 ShiftAll (by PARAMETER) 347 Operations 0 to 20 are specified in Section 8 in [RFC7932]. 348 ShiftFirst and ShiftAll transform specifically encoded SCALARs. 350 A SCALAR is a 7-, 11-, 16- or 21-bit unsigned integer encoded with 1, 351 2, 3 or 4 bytes respectively with following bit contents: 353 7-bit SCALAR: 354 +--------+ 355 |0sssssss| 356 +--------+ 358 11-bit SCALAR: 359 +--------+--------+ 360 |110sssss|XXssssss| 361 +--------+--------+ 363 16-bit SCALAR: 364 +--------+--------+--------+ 365 |1110ssss|XXssssss|XXssssss| 366 +--------+--------+--------+ 368 21-bit SCALAR: 369 +--------+--------+--------+--------+ 370 |11110sss|XXssssss|XXssssss|XXssssss| 371 +--------+--------+--------+--------+ 373 Given the input bytes matching SCALAR encoding pattern, the SCALAR 374 value is obtained by concatenation of the "s" bits, with the most 375 significant bits coming from the earliest byte. The "X" bits could 376 have arbitrary value. 378 An ADDEND is defined as the result of limited sign extension of 379 16-bit unsigned PARAMETER: 381 At first the PARAMETER is zero-extended to 32 bits. After this, 382 if the resulting value is greater or equal than 0x8000, 383 then 0xFF0000 is added. 385 ShiftAll starts at the beginning of the word and repetitively applies 386 the following transform until the whole word is transformed: 388 If the next untransformed byte matches the first byte of the 7-, 389 11-, 16- or 21-bit SCALAR pattern, then: 391 If the untransformed part of the word is not long enough to 392 match the whole SCALAR pattern, then the whole word is 393 marked as transformed. 395 Otherwise, let SHIFTED be the sum of the ADDEND and the 396 encoded SCALAR. The lowest bits from SHIFTED 397 are written back into the corresponding "s" bits. The "0", 398 "1" and "X" bits remain unchanged. Next, 1, 2, 3 or 399 4 not transformed bytes marked as transformed, according to 400 the SCALAR pattern length. 402 Otherwise, the next untransformed byte is marked as transformed. 404 ShiftFirst applies the same transform as ShiftAll, but does not 405 iterate. 407 3.2. LZ77 Dictionaries 409 If an LZ77 dictionary is set, then the decoder treats this as a 410 regular LZ77 copy, but behaves as if the bytes of this dictionary are 411 accessible as the uncompressed bytes outside of the regular LZ77 412 window for backwards references. 414 Let LZ77_DICTIONARY_LENGTH be the length of the LZ77 dictionary. 415 Then word_id, described in Section 8 in [RFC7932], is redefined as: 417 word_id = distance - (max allowed distance + 1 + 418 LZ77_DICTIONARY_LENGTH) 420 For the case when LZ77_DICTIONARY_LENGTH is 0, word_id matches the 421 [RFC7932] definition. 423 Let dictionary_address be 425 LZ77_DICTIONARY_LENGTH + max allowed distance - distance 427 Then distance values of pairs [RFC7932] in range 428 (max allowed distance + 1)..(LZ77_DICTIONARY_LENGTH + max allowed 429 distance) are interpreted as references starting in the LZ77 430 dictionary at the byte at dictionary_address. If length is longer 431 than (LZ77_DICTIONARY_LENGTH - dictionary_address), then the 432 reference continues to copy (length - LZ77_DICTIONARY_LENGTH + 433 dictionary_address) bytes from the regular LZ77 window starting at 434 the beginning. 436 4. Varint Encoding 438 A varint is encoded in base 128 in one ore more bytes as follows: 440 +--------+--------+ +--------+ 441 |1xxxxxxx|1xxxxxxx| {0-8 times} |0xxxxxxx| 442 +--------+--------+ +--------+ 444 where the "x" bits of the first byte are the least significant bits 445 of the value and the "x" bits of the last byte are the most 446 significant bits of the value. The last byte must have its MSB set to 447 0, all other bytes to 1 to indicate there is a next byte. 449 The maximum allowed amount of bits to read is 63 bits, if the 9th 450 byte is present and has its MSB set then the stream must be 451 considered as invalid. 453 5. Shared Dictionary Stream 455 The shared dictionary stream encodes a custom dictionary for brotli 456 including custom words and/or custom transformations. A shared 457 dictionary may appear standalone or as contents of a resource in a 458 framing format container. 460 A compliant shared brotli dictionary stream must have the following 461 format: 463 2 bytes: file signature, in hexadecimal the bytes 91, 0. 465 varint: LZ77_DICTIONARY_LENGTH, number of bytes for a LZ77 466 dictionary, or 0 if there is none. 467 The maximum allowed value is the maximum possible sliding 468 window size of brotli or of large window brotli. 470 LZ77_DICTIONARY_LENGTH bytes: contents of the LZ77 dictionary. 472 1 byte: NUM_CUSTOM_WORD_LISTS, may have value 0 to 64 474 NUM_CUSTOM_WORD_LISTS times a word list, with the following 475 format for each word list: 477 28 bytes: SIZE_BITS_BY_LENGTH, array of 28 unsigned 8-bit 478 integers, indexed by word lengths 4 to 31. The value 479 represents log2(number of words of this length), 480 with the exception of 0 meaning 0 words of this 481 length. The max allowed length value is 15 bits. 482 OFFSETS_BY_LENGTH is computed from this as 483 OFFSETS_BY_LENGTH[i + 1] = OFFSETS_BY_LENGTH[i] + 484 (SIZE_BITS_BY_LENGTH[i] ? (i << 485 SIZE_BITS_BY_LENGTH[i]) : 0) 487 N bytes: words dictionary data, where N is 488 OFFSETS_BY_LENGTH[31] + (SIZE_BITS_BY_LENGTH[31] ? 489 (31 << SIZE_BITS_BY_LENGTH[31]) : 0), first all the 490 words of shortest length, then all words of the next 491 length, and so on, where for each length there are 492 either 0 or a positive power of two amount of words. 494 1 byte: NUM_CUSTOM_TRANSFORM_LISTS, may have value 0 to 64 496 NUM_CUSTOM_TRANSFORM_LISTS times a transform list, with the 497 following format for each transform list: 499 2 bytes: PREFIX_SUFFIX_LENGTH, the length of prefix/suffix 500 data. Must be at least 1 because the list must 501 always end with a zero-length stringlet even 502 if empty. 504 NUM_PREFIX_SUFFIX times: prefix/suffix stringlet. 505 NUM_PREFIX_SUFFIX is the amount of stringlets parsed and 506 must be in range 1..256. 508 1 byte: STRING_LENGTH, the length of the entry contents. 509 0 for the last (terminating) entry of the 510 transform list. For other entries STRING_LENGTH 511 must be in range 1..255. The 0 entry must be 512 present and must be the last byte of the 513 PREFIX_SUFFIX_LENGTH bytes of prefix/suffix 514 data, else the stream must be rejected as 515 invalid. 517 STRING_LENGTH bytes: contents of the prefix/suffix. 519 1 byte: NTRANSFORMS, amount of transformation triplets. 521 NTRANSFORMS times: data for each transform: 523 1 byte: index of prefix in prefix/suffix data; 524 must be less than NUM_PREFIX_SUFFIX. 526 1 byte: index of suffix in prefix/suffix data; 527 must be less than NUM_PREFIX_SUFFIX. 529 1 byte: operation index, must be an index in the table of 530 operations listed in the chapter 531 "Transform Operations". 533 If and only if at least one transform has operation index 534 ShiftFirst or ShiftAll: 536 NTRANSFORMS times: 538 2 bytes: parameters for the transform. If the transform 539 does not have type ShiftFirst or ShiftAll, the 540 value must be 0. ShiftFirst and ShiftAll 541 interpret these bytes as an unsigned 16-bit 542 integer. 544 if NUM_CUSTOM_WORD_LISTS > 0 or NUM_CUSTOM_TRANSFORM_LISTS > 0 545 (else implicitly NUM_DICTIONARIES is 1 and points to the 546 brotli built-in and there is no context map) 548 1 byte: NUM_DICTIONARIES, may have value 1 to 64. Each 549 dictionary is a combination of a word list and a 550 transform list. Each next dictionary is used when the 551 distance goes beyond the previous. If a CONTEXT_MAP is 552 enabled, then the dictionary matching the context is 553 moved to the front in the order for this context. 555 NUM_DICTIONARIES times: the DICTIONARY_MAP: 557 1 byte: index into a custom word list, or value 558 NUM_CUSTOM_WORD_LISTS to indicate to use the brotli 559 [RFC7932] built-in default word list 561 1 byte: index into a custom transform list, or value 562 NUM_CUSTOM_TRANSFORM_LISTS to indicate to use the 563 brotli [RFC7932] built-in default transform list 565 1 byte: CONTEXT_ENABLED, if 0 there is no context map, if 1 a 566 context map used to select the dictionary is encoded 567 below 569 If CONTEXT_ENABLED is 1, a context map for the 64 brotli 570 [RFC7932] literals contexts: 572 64 bytes: CONTEXT_MAP, index into the DICTIONARY_MAP for 573 the first dictionary to use for this context 575 6. Large Window Brotli Compressed Data Stream 577 Large window brotli allows a sliding window beyond the 24-bit maximum 578 of regular brotli [RFC7932]. 580 The compressed data stream is backwards compatible to brotli 581 [RFC7932], and may optionally have the following differences: 583 Encoding of WBITS in the stream header: the following new 584 pattern of 14 bits is supported: 586 8 bits: value 00010001, to indicate a large window 587 brotli stream 589 6 bits: WBITS, must have value in range 10 to 62 591 Distance alphabet: if the stream is a large window brotli 592 stream, the maximum number of extra bits is 62 and the 593 theoretical maximum size of the distance alphabet is 594 (16 + NDIRECT + (124 << NPOSTFIX)). This overrides the 595 value for the distance alphabet size given in chapter 596 3.3. of [RFC7932] and affects the amount of bits in the 597 encoding of the Simple Prefix Code for distances as 598 described in chapter 3.4 of [RFC7932]. 599 An additional limitation to distances, despite the 600 large allowed alphabet size, is that the alphabet is 601 not allowed to contain a distance symbol able to represent 602 a distance larger than ((1 << 63) - 4) when its extra 603 bits have their maximum value. It depends on NPOSTFIX 604 and NDIRECT when this can occur. 606 A decoder that does not support 64-bit integers may reject a stream 607 if WBITS is higher than 30 or a distance symbol from the distance 608 alphabet is able to encode a distance larger than 2147483644. 610 7. Patching Format Compressed Data Stream 612 TBD 614 8. Shared Brotli Compressed Data Stream 616 The format of a shared brotli compressed data stream without framing 617 format is backwards compatible with brotli [RFC7932], with the 618 following optional differences: 620 *) LZ77 dictionaries as described above are supported 621 *) Custom static dictionaries replacing or extending the static 622 dictionary of brotli [RFC7932] with different words or 623 transforms are supported 625 *) The stream may have the format of regular brotli [RFC7932], 626 or the format of large window brotli as described in section 627 6, or the format of the patching stream described in 628 section 7 630 9. Shared Brotli Framing Format Stream 632 A compliant shared brotli framing format stream has the format 633 described below. 635 9.1. Main Format 637 4 bytes: file signature, in hexadecimal the bytes 91, 0a, 42, 52. 638 The first byte contains the invalid WBITS combination for 639 brotli [RFC7932] and large window brotli. 641 1 byte: container flags, 8 bits with meanings: 643 bit 0 and 1: version indicator, must be 00 645 bit 2: if 0, the file contains no final footer, may not contain 646 any metadata chunks, may not contain a central directory, 647 and may encode only a single resource (using one or more 648 data chunks). If 1, the file may contain one or more 649 resources, metadata, central directory, and must contain a 650 final footer. 652 multiple times: a chunk, each with the format specified in section 653 9.2 655 9.2. Chunk Format 657 varint: length of this chunk excluding this varint but 658 including all next header bytes and data. If the value 659 is 0, then the chunk type byte is not present and the 660 chunk type is assumed to be 0. 662 1 byte: CHUNK_TYPE 663 0: padding chunk 664 1: metadata chunk 665 2: data chunk 666 3: first partial data chunk 667 4: middle partial data chunk 668 5: last partial data chunk 669 6: footer metadata chunk 670 7: global metadata chunk 671 8: repeat metadata chunk 672 9: central directory chunk 673 10: final footer 675 if CHUNK_TYPE is not padding chunk, central directory or final 676 footer: 678 1 byte: CODEC: 680 0: uncompressed 682 1: keep decoder 684 2: brotli 686 3: shared brotli 688 4: JPEG XL lossless JPEG1 recompression [JPEGXL] 690 if CODEC is not "uncompressed": 692 varint: uncompressed size in bytes of the data contained 693 within the compressed stream 695 if CODEC is "shared brotli" 697 1 byte: amount of dictionary references. Multiple dictionary 698 references are possible with the following 699 restrictions: there can be maximum 1 serialized 700 dictionary, maximum 1 patching file, and maximum 15 701 prefix dictionaries (a serialized dictionary may 702 already contain one of those, and a patching file 703 also takes up a prefix dictionary). Circular 704 references are not allowed (any dictionary reference 705 that directly or indirectly uses this chunk itself 706 as dictionary). 708 per dictionary reference: 710 1 byte: flags: 712 bit 0 and 1: dictionary source: 714 00: Internal dictionary reference to a full resource 715 by pointer, which can span one or more chunks. 716 Must point to a full data chunk or a first 717 partial data chunk. 719 01: Internal dictionary reference to single chunk 720 contents by pointer. May point to any chunk with 721 contenst (data or metadata). If partial data 722 chunk, only this part is the dictionary. In this 723 case, the dictionary type is not allowed to be a 724 serialised dictionary. 726 10: Reference to a dictionary by hash code of a 727 resource. The dictionary can come from an 728 external source such as a different container. 729 The user of the decoder must be able to provide 730 the dictionary contents given its hash code (even 731 if it comes from this container itself), or treat 732 it as an error when the user does not have it 733 available. 735 11: invalid bit combination 737 bit 2 and 3: dictionary type: 739 00: prefix dictionary, set in front of the sliding 740 window 742 01: serialized dictionary in the shared brotli 743 format as specified in section 5. 745 10: file to apply patching algorithm to. The 746 compressed stream then has the format specified 747 in section 7. 749 11: invalid bit combination 751 bit 4-7: must be 0 753 if hash-based: 755 1 byte: type of hash used. Only supported value: 3, 756 indicating 256-bit Highwayhash. 758 32 bytes: 256-bit Highwayhash checksum to refer to 759 dictionary. 761 if pointer based: varint encoded pointer to its 762 chunk in this container. The chunk must come earlier 763 in the container than the current chunk. 765 X bytes: extra header bytes, depending on CHUNK_TYPE. If present, 766 they are specified in the subsequent chapters. 768 remaining bytes: the chunk contents. The uncompressed data 769 in the chunk content depends on CHUNK_TYPE 770 and is specified in the subsequent sections. 771 The compressed data has following 772 format depending on CODEC: 774 *) uncompressed: the raw bytes 776 *) if "keep decoder", the continuation of the compressed 777 stream which was interrupted at the end of the previous 778 chunk. The decoder from the previous chunk must be used 779 and its state it had at the end of the previous chunk 780 must be kept at the start of the decoding of this chunk. 782 *) brotli: the bytes are in brotli format 783 [RFC7932] 785 *) shared brotli: the bytes are in the 786 shared brotli format specified in section 787 8 789 9.3. Metadata Format 791 All the metadata chunk types use the following format for the 792 uncompressed content: 794 Per field: 796 2 bytes: code to identify this metadata field. This must be 797 two lowercase or two uppercase alpha ascii 798 characters. If the decoder encounters a lowercase 799 field that it does not recognise for the current 800 chunk type, non-ascii characters or non-alpha 801 characters, the decoder must reject the data stream 802 as invalid. Uppercase codes may be used for custom 803 user metadata and can be ignored by a compliant 804 decoder. 806 varint: length of the content of this field in bytes, 807 excluding the code bytes and this varint 809 N bytes: the contents of this field 811 The last field is reached when the chunk content end is reached. If 812 the length of the last field does not end at the same byte as the end 813 of the uncompressed content of the chunk, the decoder must reject the 814 data stream as invalid. 816 9.4. Chunk Specifications 818 9.4.1. Padding Chunk (Type 0) 820 All bytes in this chunk must be zero, except for the initial varint 821 that specifies the remaining chunk length. 823 Since the varint itself takes up bytes as well, when the goal is to 824 introduce an amount of padding bytes, the dependence of the length of 825 the varint on the value it encodes must be taken into account. 827 A single byte varint with value 0 is a padding chunk of length 1. 828 For more padding, use higher varint values. Do not use multiple 829 shorter padding chunks, since this is slower to decode. 831 9.4.2. Metadata Chunk (Type 1) 833 This chunk contains metadata that applies to the resource whose 834 beginning is encoded in the subsequent data chunk or first partial 835 data chunk. 837 The contents of this chunk follows the format described in chapter 838 9.3. 840 The following field types are recognised: 842 id: name field. May appear 0 or 1 times. Has the following 843 format: 845 N bytes: name in UTF-8 encoding, length determined by the 846 field length. Treated generically but may be used as 847 filename. If used as filename, forward slashes '/' 848 should be used as directory separator, relative paths 849 should be used and filenames ending in a slash with 850 0-length content in the matching data chunk should be 851 treated as an empty directory. 853 mt: modification type. May appear 0 or 1 times. Has the following 854 format: 856 8 bytes: microseconds since epoch, as a little endian signed 857 twos complement 64-bit integer 859 custom user field: any two uppercase ASCII characters. 861 9.4.3. Data Chunk (Type 2) 863 A data chunk contains the actual data of a resource. 865 This chunk has the following extra header bytes: 867 1 byte: flags: 869 bit 0: if true, indicates this is not a resource that should be 870 output implicitly as part of extracting resources from 871 this container. Instead, it may be referred to only 872 explicitly, e.g. as a dictionary reference by hash code 873 or offset. This flag should be set for data used as 874 dictionary to improve compression of actual resources. 876 bit 1: if true, hash code is given 878 bits 2-7: must be zero 880 if hash code is given: 882 1 byte: type of hash used. Only supported value: 3, 883 indicating 256-bit Highwayhash. 885 32 bytes: 256-bit Highwayhash checksum of the uncompressed 886 data 888 The uncompressed content bytes of this chunk are the actual data of 889 the resource. 891 9.4.4. First Partial Data Chunk (Type 3) 893 This chunk contains partial data of a resource. This is the first 894 chunk in a series containing the entire data of the resource. 896 The format of this chunk is the same as the format of a Data Chunk 897 (chapter 9.4.3) except for the differences noted below. 899 The second bit of flags must be set to 0 and no hash code given. 901 The uncompressed data size is only of this part of the resource, not 902 of the full resource. 904 9.4.5. Middle Partial Data Chunk (Type 4) 906 This chunk contains partial data of a resource, and is neither the 907 first nor the last part of the full resource. 909 The format of this chunk is the same as the format of a Data Chunk 910 (chapter 9.4.3) except for the differences noted below. 912 The first and second bits of flags must be set to 0. 914 The uncompressed data size is only of this part of the resource, not 915 of the full resource. 917 9.4.6. Last Partial Data Chunk (Type 5) 919 This chunk contains the final piece of partial data of a resource. 921 The format of this chunk is the same as the format of a Data Chunk 922 (chapter 9.4.3) except for the differences noted below. 924 The first bit of the flags must be set to 0. 926 If a hash code is given, the hash code of the full resource 927 (concatenated from all previous chunks and this chunk) is given in 928 this chunk. 930 The uncompressed data size is only of this part of the resource, not 931 of the full resource. 933 The type of this chunk indicates that there are no further chunk 934 encoding this resource, so the full resource is now known. 936 9.4.7. Footer Metadata Chunk (Type 6) 938 This metadata applies to the resource whose encoding ended in the 939 preceding data chunk or last partial data chunk. 941 The contents of this chunk follows the format described in chapter 942 9.3. 944 There are no lowercase field types defined for footer metadata. 945 Uppercase field types can be used as custom user data. 947 9.4.8. Global Metadata Chunk (Type 7) 949 This metadata applies to the whole container instead of a single 950 resource. 952 The contents of this chunk follows the format described in chapter 953 9.3. 955 There are no lowercase field types defined for footer metadata. 956 Uppercase field types can be used as custom user data. 958 9.4.9. Repeat Metadata Chunk (Type 8) 960 These chunks optionally repeat metadata that is interleaved between 961 data chunks. To use these chunks, it is necessary to also read 962 additional information, such as pointers to the original chunks, from 963 the central directory. 965 The contents of this chunk follows the format described in chapter 966 9.3. 968 This chunk has an extra header byte: 970 1 byte: chunk type of repeated chunk (metadata chunk 971 or footer metadata chunk) 973 This set of chunks must follow the following restrictions: 975 It is optional whether or not repeat metadata chunks are 976 present. 978 If they are present, then they must be present for all 979 metadata chunks and footer metadata chunks. 981 There may be only 1 repeat metadata chunk per repeated metadata 982 chunk. 984 They must appear in the same order as the chunks appear in the 985 container, which is also the same order as listed in the 986 central directory. 988 Compression of these chunks is allowed, however it is not allowed 989 to use any internal dictionary except an earlier repeat 990 metadata chunk of this series, and it is not allowed for a 991 metadata chunk to keep the decoder state if the previous chunk 992 is not a repeat metadata chunk. That is, the series of 993 metadata chunks must be decompressible without using other 994 chunks of the framing format file. 996 The fields contained in this metadata chunk must follow the following 997 restrictions: 999 If a field is present, it must 1000 exactly match the corresponding field of the copied chunk. 1002 It is allowed to leave out a field that is present 1003 in the copied chunk. 1005 If a field is present, then it must be present in *all* other 1006 repeat metadata chunks when the copied chunk contains this 1007 field. In other words, if you know you can get the name field 1008 from a repeat chunk, you know that you will be able to get all 1009 names of all resources from all repeat chunks. 1011 9.4.10. Central Directory Chunk (Type 9) 1013 The central directory chunk, along with the repeat metadata chunks, 1014 allow to quickly find and list compressed resources in the container 1015 file. 1017 The central directory chunk is always uncompressed and does not have 1018 the codec byte. It instead has the following format: 1020 varint: pointer into the file where the repeat metadata chunks are 1021 located, or 0 if they are not present 1023 per chunk listed: 1025 varint: pointer into the file where this chunk begins 1027 varint: amount of header bytes N used below 1029 N bytes: copy of all the header bytes of the pointed at chunk, 1030 including total size, chunk type byte, codec, 1031 uncompressed size, dictionary references, X extra 1032 header bytes. The content is not repeated here. 1034 The last listed chunk is reached when the end of the contents of the 1035 central directory are reached. If the end does not match the last 1036 byte of the central directory, the decoder must reject the data 1037 stream as invalid. 1039 If present, the central directory must list all data and metadata 1040 chunks of all types. 1042 9.4.11. Final Footer Chunk (Type 10) 1044 Chunk that closes the file, only present if in the initial container 1045 header flags bit 2 was set. 1047 This chunk has the following content, always uncompressed: 1049 reversed varint: size of this entire framing format file, 1050 including these bytes themselves, or 0 if this 1051 size is not given 1053 reversed varint: pointer to the start of the central directory, 1054 or 0 if there is none 1056 A reversed varint has the same format as a varint, but has its bytes 1057 in reversed order and is designed to be parsed from end of file 1058 towards the beginning. 1060 9.4.12. Chunk ordering 1062 The chunk ordering must follow the rules described below, if the 1063 decoder sees otherwise, it must reject the data stream as invalid. 1065 Padding chunks may be inserted anywhere, even between chunks for 1066 which the rules below say no other chunk types may come in 1067 between. 1069 Metadata chunks must come immediately before the Data chunks of 1070 the resource they apply to. 1072 Footer metadata chunks must come immediately after the Data 1073 chunks of the resource they apply to. 1075 There may be only 0 or 1 metadata chunks per resource. 1077 There may be only 0 or 1 footer metadata chunks per resource. 1079 A resource must exist out of either 1 data chunk, or 1 first 1080 partial data chunk, 0 or more middle partial data 1081 chunks, and 1 last partial data chunk, in that order. 1083 Repeat metadata chunks must follow the rules of section 9.4.9. 1085 There may be only 0 or 1 central directory chunks. 1087 If bit 2 of the container flags is set, there may be only a 1088 single resource, no metadata chunks of any type, no central 1089 directory, and no final footer. 1091 If bit 2 of the container flags is not set, there must be exactly 1092 1 final footer chunk and it must be the last chunk in the file. 1094 10. Security Considerations 1096 The security considerations for brotli [RFC7932] apply to shared 1097 brotli as well. 1099 In addition, the same considerations apply to the decoding of new 1100 file format streams for shared brotli, including shared dictionaries, 1101 the framing format and the shared brotli format. 1103 The dictionary must be treated with the same security precautions as 1104 the content, because a change to the dictionary can result in a 1105 change to the decompressed content. 1107 The CRIME attack shows that it's a bad idea to compress data from 1108 mixed (e.g. public and private) sources -- the data sources include 1109 not only the compressed data but also the dictionaries. For example, 1110 if you compress secret cookies using a public-data-only dictionary, 1111 you still leak information about the cookies. 1113 Not only can the dictionary reveal information about the compressed 1114 data, but vice versa, data compressed with the dictionary can reveal 1115 the contents of the dictionary when an adversary can control parts of 1116 data to compress and see the compressed size. On the other hand, if 1117 the adversary can control the dictionary, the adversary can learn 1118 information about the compressed data. 1120 The most robust defense against CRIME is not to compress private data 1121 (e.g., sensitive headers like cookies or any content with PII). The 1122 challenge has been to identify secrets within a vast amount of to be 1123 compressed data. Cloudflare uses a regular expression [CLOUDFLARE]. 1124 Another idea is to extend existing web template systems (e.g., Soy 1125 [SOY]) to allow developers to mark secrets that must not be 1126 compressed. 1128 A less robust idea, but easier to implement, is to randomize the 1129 compression algorithm, i.e., adding randomly generated padding, 1130 varying the compression ratio, etc. The tricky part is to find the 1131 right balance between cost and security, i.e., on one hand we don't 1132 want to add too much padding because it adds a cost to data, on the 1133 other hand we don't want to add too little because the adversary can 1134 detect a small amount of padding with traffic analysis. 1136 Another defense in addition is to not use dictionaries for cross- 1137 domain requests, and only use shared brotli for the response when the 1138 origin is the same as where the content is hosted (using CORS). This 1139 prevents an adversary to use a private dictionary with user secrets 1140 to compress content hosted on the adversary's origin. It also helps 1141 prevent CRIME attacks that try to benefit from a public dictionary by 1142 preventing data compression with dictionaries for requests that do 1143 not originate from the host itself. 1145 The content of the dictionary itself should not be affected by 1146 external users, allowing adversaries to control the dictionary allows 1147 a form of chosen plaintext attack. Instead, only base the dictionary 1148 on content you control or generic large scale content such as a 1149 spoken language, and update the dictionary with large time intervals 1150 (days, not seconds) to prevent fast probing. 1152 11. IANA Considerations 1154 The "HTTP Content Coding Registry" has been updated with the 1155 registration below: 1157 +-------+-------------------------------------+------------+ 1158 | Name | Description | Reference | 1159 +-------+-------------------------------------+------------+ 1160 | sbr | Shared Brotli Compressed Data Format| RFCXXXX | 1161 +-------+-------------------------------------+------------+ 1163 12. Informative References 1165 [RFC7932] Alakuijala, J., Szabadka, Z., "Brotli Compressed Data 1166 Format", RFC 7932, Google, Inc., July 2016. 1167 http://www.ietf.org/rfc/rfc7932.txt 1169 [JPEGXL] Rhatushnyak, A., Wassenberg, J., Sneyers, J., Alakuijala, 1170 J., Vandevenne, L., Versari, L., Obryk, R., Szabadka, Z., 1171 Kliuchnikov, E., Comsa, I, Potempa, K., Bruse, M., 1172 Firsching, M., Khasanova, R., van Asseldonk, R., Boukortt, 1173 S., Gomez, S., Fischbacher, T., "JPEG XL Image Coding 1174 System", August 2019. https://arxiv.org/abs/1908.03565 1176 [CLOUDFLARE] https://blog.cloudflare.com/a-solution-to-compression- 1177 oracles-on-the-web/ 1179 [SOY] https://developers.google.com/closure/templates/ 1181 Authors' Addresses 1183 Jyrki Alakuijala 1184 Google, Inc. 1186 Email: jyrki@google.com 1188 Thai Duong 1189 Google, Inc. 1191 Email: thaidn@google.com 1193 Evgenii Kliuchnikov 1194 Google, Inc. 1196 Email: eustas@google.com 1198 Robert Obryk 1199 Google, Inc. 1201 Email: robryk@google.com 1203 Zoltan Szabadka 1204 Google, Inc. 1206 Email: szabadka@google.com 1208 Lode Vandevenne (editor) 1209 Google, Inc. 1211 Email: lode@google.com