idnits 2.17.1 draft-vandevenne-shared-brotli-format-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC7932]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (Aug 2018) is 2074 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '31' on line 489 Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Alakuijala 3 Internet-Draft T. Duong 4 Intended Status: Informational E. Kliuchnikov 5 Expires: Feb 04, 2019 R. Obryk 6 Z. Szabadka 7 L. Vandevenne 8 Google, Inc 9 Aug 2018 11 Shared Brotli Compressed Data Format 12 draft-vandevenne-shared-brotli-format-01 14 Abstract 16 This specification defines a data format for shared brotli 17 compression, which adds support for shared dictionaries, large 18 window, patching and a container format to brotli [RFC7932]. 20 Shared dictionaries and large window support allow significant 21 compression gains compared to regular brotli, and patching allows 22 smaller patches of binary files. 24 Status of this Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. Internet-Drafts are working 28 documents of the Internet Engineering Task Force (IETF). Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. The list of current Internet-Drafts is at 31 http://datatracker.ietf.org/drafts/current. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on Feb 04, 2019. 40 Copyright Notice 42 Copyright (c) 2018 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.2. Intended audience . . . . . . . . . . . . . . . . . . . . 3 60 1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.4. Compliance . . . . . . . . . . . . . . . . . . . . . . . . 4 62 1.5. Definitions of terms and conventions used . . . . . . . . 4 63 1.5.1. Packing into bytes . . . . . . . . . . . . . . . . . 4 64 2. Shared Brotli Overview . . . . . . . . . . . . . . . . . . . . 5 65 3. Shared Dictionaries . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1. Custom Static Dictionaries . . . . . . . . . . . . . . . . 6 67 3.1.1. Transform Operations . . . . . . . . . . . . . . . . 7 68 3.2. LZ77 Dictionaries . . . . . . . . . . . . . . . . . . . . 9 69 4. Varint Encoding . . . . . . . . . . . . . . . . . . . . . . . 10 70 5. Shared Dictionary Stream . . . . . . . . . . . . . . . . . . . 10 71 6. Large Window Brotli Compressed Data Stream . . . . . . . . . . 13 72 7. Patching Format Compressed Data Stream . . . . . . . . . . . . 13 73 8. Shared Brotli Compressed Data Stream . . . . . . . . . . . . . 13 74 9. Shared Brotli Framing Format Stream . . . . . . . . . . . . . 14 75 9.1. Main Format . . . . . . . . . . . . . . . . . . . . . . . 14 76 9.2. Chunk Format . . . . . . . . . . . . . . . . . . . . . . 14 77 9.3. Metadata Format . . . . . . . . . . . . . . . . . . . . . 16 78 9.4. Chunk Specifications . . . . . . . . . . . . . . . . . . 17 79 9.4.1. Padding Chunk . . . . . . . . . . . . . . . . . . . 17 80 9.4.2. Metadata Chunk . . . . . . . . . . . . . . . . . . . 17 81 9.4.3. Data Chunk . . . . . . . . . . . . . . . . . . . . . 18 82 9.4.4. First Partial Data Chunk . . . . . . . . . . . . . . 18 83 9.4.5. Middle Partial Data Chunk . . . . . . . . . . . . . 19 84 9.4.6. Last Partial Data Chunk . . . . . . . . . . . . . . 19 85 9.4.7. Footer Metadata Chunk . . . . . . . . . . . . . . . 19 86 9.4.8. Global Metadata Chunk . . . . . . . . . . . . . . . 19 87 9.4.9. Repeat Metadata Chunk . . . . . . . . . . . . . . . 19 88 9.4.10. Central Directory Chunk . . . . . . . . . . . . . . 20 89 9.4.11. Final Footer Chunk . . . . . . . . . . . . . . . . 21 90 10. Security Considerations . . . . . . . . . . . . . . . . . . . 22 91 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23 92 12. Informative References . . . . . . . . . . . . . . . . . . . 24 93 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 95 1. Introduction 97 1.1. Purpose 99 The purpose of this specification is to extend the brotli compressed 100 data format format ([RFC7932]) with new abilities that allow further 101 compression gains: 103 * Shared dictionaries allow a static shared context between 104 encoder and decoder for significant compression gains. 106 * Large window brotli allows much larger back reference distances 107 to give compression gains for files over 16MiB. 109 * Patching allows to create smaller patches of binary files 111 * The framing format is a container format that allows to store 112 multiple resources, refer to dictionaries, enable patching and 113 other filters that improve compression. 115 This document is the authoritative specification of shared brotli 116 data formats and the backwards compatible changes to brotli, and 117 defines: 119 * The data format of serialized shared dictionaries 121 * The data format of the framing format 123 * The encoding of window bits and distances for large window 124 brotli in the brotli data format 126 * The encoding of shared dictionary references in the brotli data 127 format 129 * The data format for patching with brotli 131 1.2. Intended audience 133 This specification is intended for use by software implementers to 134 compress data into and/or decompress data from the shared brotli 135 dictionary format. 137 The text of the specification assumes a basic background in 138 programming at the level of bits and other primitive data 139 representations. Familiarity with the technique of LZ77 coding is 140 helpful but not required. 142 1.3. Scope 143 This specification defines a data format for shared brotli 144 compression, which adds support for dictionaries and extended 145 features to brotli [RFC7932]. 147 1.4. Compliance 149 Unless otherwise indicated below, a compliant decompressor must be 150 able to accept and decompress any data set that conforms to all the 151 specifications presented here. A compliant compressor must produce 152 data sets that conform to all the specifications presented here. 154 1.5. Definitions of terms and conventions used 156 Byte: 8 bits stored or transmitted as a unit (same as an octet). For 157 this specification, a byte is exactly 8 bits, even on machines that 158 store a character on a number of bits different from eight. See 159 below for the numbering of bits within a byte. 161 String: a sequence of arbitrary bytes. 163 Bytes stored within a computer do not have a "bit order", since they 164 are always treated as a unit. However, a byte considered as an 165 integer between 0 and 255 does have a most- and least-significant 166 bit, and since we write numbers with the most-significant digit on 167 the left, we also write bytes with the most-significant bit on the 168 left. In the diagrams below, we number the bits of a byte so that bit 169 0 is the least-significant bit, i.e., the bits are numbered: 171 +--------+ 172 |76543210| 173 +--------+ 175 Within a computer, a number may occupy multiple bytes. All multi-byte 176 numbers in the format described here are unsigned and stored with the 177 least-significant byte first (at the lower memory address). For 178 example, the decimal 16-bit number 520 is stored as: 180 0 1 181 +--------+--------+ 182 |00001000|00000010| 183 +--------+--------+ 184 ^ ^ 185 | | 186 | + more significant byte = 2 x 256 187 + less significant byte = 8 189 1.5.1. Packing into bytes 190 This document does not address the issue of the order in which bits 191 of a byte are transmitted on a bit-sequential medium, since the final 192 data format described here is byte- rather than bit-oriented. 193 However, we describe the compressed block format below as a sequence 194 of data elements of various bit lengths, not a sequence of bytes. We 195 must therefore specify how to pack these data elements into bytes to 196 form the final compressed byte sequence: 198 * Data elements are packed into bytes in order of 199 increasing bit number within the byte, i.e., starting 200 with the least-significant bit of the byte. 201 * Data elements other than prefix codes are packed 202 starting with the least-significant bit of the data 203 element. These are referred to here as integer values 204 and are considered unsigned. 205 * Prefix codes are packed starting with the most-significant 206 bit of the code. 208 In other words, if one were to print out the compressed data as a 209 sequence of bytes, starting with the first byte at the *right* margin 210 and proceeding to the *left*, with the most-significant bit of each 211 byte on the left as usual, one would be able to parse the result from 212 right to left, with fixed-width elements in the correct MSB-to-LSB 213 order and prefix codes in bit-reversed order (i.e., with the first 214 bit of the code in the relative LSB position). 216 As an example, consider packing the following data elements into a 217 sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2, 218 prefix code 110, prefix code 10, 12-bit integer value 3628. 220 byte 2 byte 1 byte 0 221 +--------+--------+--------+ 222 |11100010|11000101|10010110| 223 +--------+--------+--------+ 224 ^ ^ ^ ^ ^ 225 | | | | | 226 | | | | +------ integer value 6 227 | | | +---------- integer value 2 228 | | +-------------- prefix code 110 229 | +---------------- prefix code 10 230 +----------------------------- integer value 3628 232 2. Shared Brotli Overview 234 Shared brotli extends brotli [RFC7932] with support for shared 235 dictionaries, larger LZ77 window and a framing format. 237 3. Shared Dictionaries 239 A shared dictionary is a piece of data shared by a compressor and 240 decompressor. The compressor can take advantage of the dictionary 241 context to encode the input in a more compact manner. The compressor 242 and the decompressor must use exactly the same dictionary. A shared 243 dictionary is specially useful to compress short input sequences. 245 A shared brotli dictionary can use two methods of sharing context: 247 * An LZ77 dictionary. The encoder and decoder could refer 248 to a given sequence of bytes. Multiple LZ77 dictionaries 249 can be set. 251 * A custom static dictionary: a word list with transforms. The 252 encoder and decoder will replace the static dictionary data 253 with the data in the shared dictionary. The original static 254 dictionary is described in Section 8 in [RFC7932]. The original 255 data from Appendix A and Appendix B of [RFC7932] will be 256 replaced. In addition, it is possible to dynamically switch 257 this dictionary based on the data compression context, and/or 258 to include a reference to the original dictionary in the custom 259 dictionary. 261 If no shared dictionary is set the decoder behaves the same as in 262 [RFC7932] on a brotli stream. 264 If a shared dictionary is set, then it can set any of: LZ77 265 dictionaries, overriding static dictionary words, and/or overriding 266 transforms. 268 3.1. Custom Static Dictionaries 270 If a custom word list is set, then the following behavior of the RFC 271 7932 decoder [RFC7932] is overridden: 273 Instead of the Static Dictionary Data from Appendix A 274 of [RFC7932], one or more word lists from the custom static 275 dictionary data are used. 277 Instead of NDBITS at the end of Appendix A, a custom 278 SIZE_BITS_BY_LENGTH per custom word list is used. 280 The copy length for a static dictionary reference must be 281 between 4 and 31 and may not be a value for which 282 SIZE_BITS_BY_LENGTH of this dictionary is 0. 284 If a custom transforms list is set without context dependency, then 285 the following behavior of the RFC 7932 decoder [RFC7932] is 286 overridden: 288 The "List of Word Transformations" from Appendix B is 289 overridden by one or more lists of custom prefixes, suffixes and 290 transform operations. 292 The transform_id must be smaller than the number of transforms 293 given in the custom transforms list. 295 If the dictionary is context dependent, it includes a lookup table of 296 64 word list and transform list combinations. When resolving a static 297 dictionary word, the decoder computes the literal context id, as in 298 section 7.1. of [RFC7932]. The literal context id is used as index in 299 the lookup tables to select the word list and transforms to use. If 300 the dictionary is not context dependent, this id is implicitely 0 301 instead. 303 If a distance goes beyond the dictionary for the current id and 304 multiple word list / transform list combinations are defined, then a 305 next dictionary is used in the following order: if not context 306 dependent, the same order as defined in the shared dictionary. If 307 context dependent, the index matching the current context is used 308 first, the same order as defined in the shared dictionary excluding 309 the current context are used next. 311 3.1.1. Transform Operations 313 A shared dictionary may include custom word transformations, to 314 replace those specified in Section 8 and Appendix B of [RFC7932]. A 315 transform consists of a possible prefix, a transform operation, for 316 some operations a parameter, and a possible suffix. In the shared 317 dictionary format, the transform operation is represented by a 318 numerical ID, listed in the table below. 320 ID Operation 321 -- --------- 322 0 Identity 323 1 OmitLast1 324 2 OmitLast2 325 3 OmitLast3 326 4 OmitLast4 327 5 OmitLast5 328 6 OmitLast6 329 7 OmitLast7 330 8 OmitLast8 331 9 OmitLast9 333 10 FermentFirst 334 11 FermentAll 335 12 OmitFirst1 336 13 OmitFirst2 337 14 OmitFirst3 338 15 OmitFirst4 339 16 OmitFirst5 340 17 OmitFirst6 341 18 OmitFirst7 342 19 OmitFirst8 343 20 OmitFirst9 344 21 ShiftFirst (by PARAMETER) 345 22 ShiftAll (by PARAMETER) 347 Operations 0 to 20 are specified in Section 8 in [RFC7932]. 348 ShiftFirst and ShiftAll transform specifically encoded SCALARs. 350 A SCALAR is a 7-, 11-, 16- or 21-bit unsigned integer encoded with 1, 351 2, 3 or 4 bytes respectively with following bit contents: 353 7-bit SCALAR: 354 +--------+ 355 |0sssssss| 356 +--------+ 358 11-bit SCALAR: 359 +--------+--------+ 360 |110sssss|XXssssss| 361 +--------+--------+ 363 16-bit SCALAR: 364 +--------+--------+--------+ 365 |1110ssss|XXssssss|XXssssss| 366 +--------+--------+--------+ 368 21-bit SCALAR: 369 +--------+--------+--------+--------+ 370 |11110sss|XXssssss|XXssssss|XXssssss| 371 +--------+--------+--------+--------+ 373 Given the input bytes matching SCALAR encoding pattern, the SCALAR 374 value is obtained by concatenation of the "s" bits, with the most 375 significant bits coming from the earliest byte. The "X" bits could 376 have arbitrary value. 378 An ADDEND is defined as the result of limited sign extension of 379 16-bit unsigned PARAMETER: 381 At first the PARAMETER is zero-extended to 32 bits. After this, 382 if the resulting value is greater or equal than 0x8000, 383 then 0xFF0000 is added. 385 ShiftAll starts at the beginning of the word and repetitively applies 386 the following transform until the whole word is transformed: 388 If the next untransformed byte matches the first byte of the 7-, 389 11-, 16- or 21-bit SCALAR pattern, then: 391 If the untransformed part of the word is not long enough to 392 match the whole SCALAR pattern, then the whole word is 393 marked as transformed. 395 Otherwise, let SHIFTED be the sum of the ADDEND and the 396 encoded SCALAR. The lowest bits from SHIFTED 397 are written back into the corresponding "s" bits. The "0", 398 "1" and "X" bits remain unchanged. Next, 1, 2, 3 or 399 4 not transformed bytes marked as transformed, according to 400 the SCALAR pattern length. 402 Otherwise, the next untransformed byte is marked as transformed. 404 ShiftFirst applies the same transform as ShiftAll, but does not 405 iterate. 407 3.2. LZ77 Dictionaries 409 If an LZ77 dictionary is set, then the decoder treats this as a 410 regular LZ77 copy, but behaves as if the bytes of this dictionary are 411 accessible as the uncompressed bytes outside of the regular LZ77 412 window for backwards references. 414 Let LZ77_DICTIONARY_LENGTH be the length of the LZ77 dictionary. 415 Then word_id, described in Section 8 in [RFC7932], is redefined as: 417 word_id = distance - (max allowed distance + 1 + 418 LZ77_DICTIONARY_LENGTH) 420 For the case when LZ77_DICTIONARY_LENGTH is 0, word_id matches the 421 [RFC7932] definition. 423 Let dictionary_address be 425 LZ77_DICTIONARY_LENGTH + max allowed distance - distance 427 Then distance values of pairs [RFC7932] in range 428 (max allowed distance + 1)..(LZ77_DICTIONARY_LENGTH + max allowed 429 distance) are interpreted as references starting in the LZ77 430 dictionary at the byte at dictionary_address. If length is longer 431 than (LZ77_DICTIONARY_LENGTH - dictionary_address), then the 432 reference continues to copy (length - LZ77_DICTIONARY_LENGTH + 433 dictionary_address) bytes from the regular LZ77 window starting at 434 the beginning. 436 4. Varint Encoding 438 A varint is encoded in base 128 in one ore more bytes as follows: 440 +--------+--------+ +--------+ 441 |1xxxxxxx|1xxxxxxx| {0-8 times} |0xxxxxxx| 442 +--------+--------+ +--------+ 444 where the "x" bits of the first byte are the least significant bits 445 of the value and the "x" bits of the last byte are the most 446 significant bits of the value. The last byte must have its MSB set to 447 0, all other bytes to 1 to indicate there is a next byte. 449 The maximum allowed amount of bits to read is 63 bits, if the 9th 450 byte is present and has its MSB set then the stream must be 451 considered as invalid. 453 5. Shared Dictionary Stream 455 The shared dictionary stream encodes a custom dictionary for brotli 456 including custom words and/or custom transformations. A shared 457 dictionary may appear standalone or as contents of a resource in a 458 framing format container. 460 A compliant shared brotli dictionary stream must have the following 461 format: 463 2 bytes: file signature, in hexadecimal the bytes 91, 0. 465 varint: LZ77_DICTIONARY_LENGTH, number of bytes for a LZ77 466 dictionary, or 0 if there is none. 467 The maximum allowed value is the maximum possible sliding 468 window size of brotli or of large window brotli. 470 LZ77_DICTIONARY_LENGTH bytes: contents of the LZ77 dictionary. 472 1 byte: NUM_CUSTOM_WORD_LISTS, may have value 0 to 64 474 NUM_CUSTOM_WORD_LISTS times a word list, with the following 475 format for each word list: 477 28 bytes: SIZE_BITS_BY_LENGTH, array of 28 unsigned 8-bit 478 integers, indexed by word lengths 4 to 31. The value 479 represents log2(number of words of this length), 480 with the exception of 0 meaning 0 words of this 481 length. The max allowed length value is 15 bits. 482 OFFSETS_BY_LENGTH is computed from this as 483 OFFSETS_BY_LENGTH[i + 1] = OFFSETS_BY_LENGTH[i] + 484 (SIZE_BITS_BY_LENGTH[i] ? (i << 485 SIZE_BITS_BY_LENGTH[i]) : 0) 487 N bytes: words dictionary data, where N is 488 OFFSETS_BY_LENGTH[31] + (SIZE_BITS_BY_LENGTH[31] ? 489 (31 << SIZE_BITS_BY_LENGTH[31]) : 0), first all the 490 words of shortest length, then all words of the next 491 length, and so on, where for each length there are 492 either 0 or a positive power of two amount of words. 494 1 byte: NUM_CUSTOM_TRANSFORM_LISTS, may have value 0 to 64 496 NUM_CUSTOM_TRANSFORM_LISTS times a transform list, with the 497 following format for each transform list: 499 2 bytes: PREFIX_SUFFIX_LENGTH, the length of prefix/suffix 500 data. Must be at least 1 because the list must 501 always end with a zero-length stringlet even 502 if empty. 504 NUM_PREFIX_SUFFIX times: prefix/suffix stringlet. 505 NUM_PREFIX_SUFFIX is the amount of stringlets parsed and 506 must be in range 1..256. 508 1 byte: STRING_LENGTH, the length of the entry contents. 509 0 for the last (terminating) entry of the 510 transform list. For other entries STRING_LENGTH 511 must be in range 1..255. The 0 entry must be 512 present and must be the last byte of the 513 PREFIX_SUFFIX_LENGTH bytes of prefix/suffix 514 data, else the stream must be rejected as 515 invalid. 517 STRING_LENGTH bytes: contents of the prefix/suffix. 519 1 byte: NTRANSFORMS, amount of transformation triplets. 521 NTRANSFORMS times: data for each transform: 523 1 byte: index of prefix in prefix/suffix data; 524 must be less than NUM_PREFIX_SUFFIX. 526 1 byte: index of suffix in prefix/suffix data; 527 must be less than NUM_PREFIX_SUFFIX. 529 1 byte: operation index, must be an index in the table of 530 operations listed in the chapter 531 "Transform Operations". 533 If and only if at least one transform has operation index 534 ShiftFirst or ShiftAll: 536 NTRANSFORMS times: 538 2 bytes: parameters for the transform. If the transform 539 does not have type ShiftFirst or ShiftAll, the 540 value must be 0. ShiftFirst and ShiftAll 541 interpret these bytes as an unsigned 16-bit 542 integer. 544 if NUM_CUSTOM_WORD_LISTS > 0 or NUM_CUSTOM_TRANSFORM_LISTS > 0 545 (else implicitly NUM_DICTIONARIES is 1 and points to the 546 brotli built-in and there is no context map) 548 1 byte: NUM_DICTIONARIES, may have value 1 to 64. Each 549 dictionary is a combination of a word list and a 550 transform list. Each next dictionary is used when the 551 distance goes beyond the previous. If a CONTEXT_MAP is 552 enabled, then the dictionary matching the context is 553 moved to the front in the order for this context. 555 NUM_DICTIONARIES times: the DICTIONARY_MAP: 557 1 byte: index into a custom word list, or value 558 NUM_CUSTOM_WORD_LISTS to indicate to use the brotli 559 [RFC7932] built-in default word list 561 1 byte: index into a custom transform list, or value 562 NUM_CUSTOM_TRANSFORM_LISTS to indicate to use the 563 brotli [RFC7932] built-in default transform list 565 1 byte: CONTEXT_ENABLED, if 0 there is no context map, if 1 a 566 context map used to select the dictionary is encoded 567 below 569 If CONTEXT_ENABLED is 1, a context map for the 64 brotli 570 [RFC7932] literals contexts: 572 64 bytes: CONTEXT_MAP, index into the DICTIONARY_MAP for 573 the first dictionary to use for this context 575 6. Large Window Brotli Compressed Data Stream 577 Large window brotli allows a sliding window beyond the 24-bit maximum 578 of regular brotli [RFC7932]. 580 The compressed data stream is backwards compatible to brotli 581 [RFC7932], and may optionally have the following differences: 583 Encoding of WBITS in the stream header: the following new 584 pattern of 14 bits is supported: 586 8 bits: value 00010001, to indicate a large window 587 brotli stream 589 6 bits: WBITS, must have value in range 10 to 62 591 Distance alphabet: if the stream is a large window brotli 592 stream, the maximum number of extra bits is 62 and the 593 theoretical maximum size of the distance alphabet is 594 (16 + NDIRECT + (62 << (NPOSTFIX + 1))). However the 595 alphabet is not allowed to contain a distance symbol 596 able to represent a distance larger than 9223372036854775804 597 when its extra bits have their maximum value. 599 A decoder that does not support 64-bit integers may reject a stream 600 if WBITS is higher than 30 or a distance symbol from the distance 601 alphabet is able to encode a distance larger than 2147483644. 603 7. Patching Format Compressed Data Stream 605 TBD 607 8. Shared Brotli Compressed Data Stream 609 The format of a shared brotli compressed data stream without framing 610 format is backwards compatible with brotli [RFC7932], with the 611 following optional differences: 613 *) LZ77 dictionaries as described above are supported 615 *) Custom static dictionaries replacing or extending the static 616 dictionary of brotli [RFC7932] with different words or 617 transforms are supported 619 *) The stream may have the format of regular brotli [RFC7932], 620 or the format of large window brotli as described in section 621 6, or the format of the patching stream described in 622 section 7 624 9. Shared Brotli Framing Format Stream 626 A compliant shared brotli framing format stream has the format 627 described below. 629 9.1. Main Format 631 4 bytes: file signature, in hexadecimal the bytes 91, 0a, 42, 52. 632 The first byte contains the invalid WBITS combination for 633 brotli [RFC7932] and large window brotli. 635 1 byte: container flags, 8 bits with meanings: 637 bit 0 and 1: version indicator, must be 00 639 bit 2: if 0, the file contains no final footer, may not contain 640 any metadata chunks, may not contain a central directory, 641 and may encode only a single resource (using one or more 642 data chunks). If 1, the file may contain one or more 643 resources, metadata, central directory, and must contain a 644 final footer. 646 multiple times: a chunk, each with the format specified in section 647 9.2 649 9.2. Chunk Format 651 varint: length of this chunk excluding this varint but 652 including all next header bytes and data. If the value 653 is 0, then the chunk type byte is not present and the 654 chunk type is assumed to be 0. 656 1 byte: CHUNK_TYPE 657 0: padding chunk 658 1: metadata chunk 659 2: data chunk 660 3: first partial data chunk 661 4: middle partial data chunk 662 5: last partial data chunk 663 6: footer metadata chunk 664 7: global metadata chunk 665 8: repeat metadata chunk 666 9: central directory chunk 667 10: final footer 669 if CHUNK_TYPE is not padding chunk or final footer: 671 1 byte: CODEC: 673 0: uncompressed 675 1: keep decoder 677 2: brotli 679 3: shared brotli 681 if CODEC is not "uncompressed": 683 varint: uncompressed size in bytes of the data contained 684 within the compressed stream 686 if CODEC is "shared brotli" 688 1 byte: amount of dictionary references. Multiple dictionary 689 references are possible with the following 690 restrictions: there can be maximum 1 serialized 691 dictionary, maximum 1 patching file, and maximum 15 692 prefix dictionaries (a serialized dictionary may 693 already contain one of those, and a patching file 694 also takes up a prefix dictionary) 696 per dictionary reference: 698 1 byte: flags: 700 bit 0: if false, internal dictionary reference (by 701 pointer to resource), if true, dictionary 702 reference by hash code (could be internal or 703 external) 705 bit 1 and 2: dictionary type: 707 00: prefix dictionary, set in front of the sliding 708 window 710 01: serialized dictionary in the shared brotli 711 format as specified in section 5. 713 10: file to apply patching algorithm to. The 714 compressed stream then has the format specified 715 in section 7. 717 11: invalid bit combination 719 bit 3-7: must be 0 721 if hash-based: 4+32 bytes "hw3-" and 256-bit Highwayhash 722 checksum checksum to refer to dictionary 724 if internal index based: varint encoded pointer to its 725 entry in this container. The entry must be a data 726 chunk. 728 X bytes: extra header bytes, depending on CHUNK_TYPE. If present, 729 they are specified in the subsequent chapters. 731 remaining bytes: the chunk contents. The uncompressed data 732 in the chunk content depends on CHUNK_TYPE 733 and is specified in the subsequent sections. 734 The compressed data has following 735 format depending on CODEC: 737 *) uncompressed: the raw bytes 739 *) if "keep decoder", the continuation of the compressed 740 stream which was interrupted at the end of the previous 741 chunk. The decoder from the previous chunk must be used 742 and its state it had at the end of the previous chunk 743 must be kept at the start of the decoding of this chunk. 745 *) brotli: the bytes are in brotli format 746 [RFC7932] 748 *) shared brotli: the bytes are in the 749 shared brotli format specified in section 750 8 752 9.3. Metadata Format 754 All the metadata chunk types use the following format for the 755 uncompressed content: 757 Per field: 759 2 bytes: code to identify this metadata field. This must be 760 two lowercase or two uppercase alpha ascii 761 characters. If the decoder encounters a lowercase 762 field that it does not recognise for the current 763 chunk type, non-ascii characters or non-alpha 764 characters, the decoder must reject the data stream 765 as invalid. Uppercase codes may be used for custom 766 user metadata and can be ignored by a compliant 767 decoder. 769 varint: length of the content of this field in bytes, 770 excluding the code bytes and this varint 772 N bytes: the contents of this field 774 The last field is reached when the chunk content end is reached. If 775 the length of the last field does not end at the same byte as the end 776 of the uncompressed content of the chunk, the decoder must reject the 777 data stream as invalid. 779 9.4. Chunk Specifications 781 9.4.1. Padding Chunk 783 All bytes in this chunk must be zero, except for the initial varint 784 that specifies the remaining chunk length. 786 Since the varint itself takes up bytes as well, when the goal is to 787 introduce an amount of padding bytes, the dependence of the length of 788 the varint on the value it encodes must be taken into account. 790 A single byte varint with value 0 is a padding chunk of length 1. 791 For more padding, use higher varint values. Do not use multiple 792 shorter padding chunks, since this is slower to decode. 794 9.4.2. Metadata Chunk 796 This chunk contains metadata that applies to the resource whose 797 beginning is encoded in the subsequent data chunk or first partial 798 data chunk. 800 The contents of this chunk follows the format described in chapter 801 9.3. 803 The following field types are recognised: 805 id: name field. May appear 0 or 1 times. Has the following 806 format: 808 N bytes: name in UTF-8 encoding, length determined by the 809 field length. Treated generically but may be used as 810 filename. If used as filename, forward slashes '/' 811 should be used as directory separator, relative paths 812 should be used and filenames ending in a slash with 813 0-length content in the matching data chunk should be 814 treated as an empty directory. 816 mt: modification type. May appear 0 or 1 times. Has the following 817 format: 819 8 bytes: microseconds since epoch, as a little endian signed 820 twos complement 64-bit integer 822 custom user field: any two uppercase ASCII characters. 824 9.4.3. Data Chunk 826 A data chunk contains the actual data of a resource. 828 This chunk has the following extra header bytes: 830 1 byte: flags: 832 bit 0: if true, indicates this is not a resource that should be 833 output implicitly as part of extracting resources from 834 this container. Instead, it may be referred to only 835 explicitly, e.g. as a dictionary reference by hash code 836 or offset. This flag should be set for data used as 837 dictionary to improve compression of actual resources. 839 bit 1: if true, hash code is given 841 bits 2-7: must be zero 843 if hash code is given: 845 36 bytes: "hw3-" plus 32 bytes highwayhash-256 checkum of the 846 uncompressed data 848 The uncompressed content bytes of this chunk are the actual data of 849 the resource. 851 9.4.4. First Partial Data Chunk 853 This chunk contains partial data of a resource. This is the first 854 chunk in a series containing the entire data of the resource. 856 The format of this chunk is the same as the format of a Data Chunk 857 (chapter 9.4.3) 859 The uncompressed data size and hash code given by this chunk are only 860 of this part of the resource, not of the full resource. 862 9.4.5. Middle Partial Data Chunk 864 This chunk contains partial data of a resource, and is neither the 865 first nor the last part of the full resource. 867 The format of this chunk is the same as the format of a First Data 868 Chunk (chapter 9.4.4), except bit 0 of the flags must be 0. Only the 869 first chunk specifies the "no output" flag. 871 9.4.6. Last Partial Data Chunk 873 This chunk contains the final piece of partial data of a resource. 875 The format is the same as for Partial Data Chunk, Middle (chapter 876 9.4.5). 878 The type of this chunk indicates that there are no further chunk 879 encoding this resource, so the full resource is now known. 881 9.4.7. Footer Metadata Chunk 883 This metadata applies to the resource whose encoding ended in the 884 preceding data chunk or last partial data chunk. 886 The contents of this chunk follows the format described in chapter 887 9.3. 889 There are no lowercase field types defined for footer metadata. 890 Uppercase field types can be used as custom user data. 892 9.4.8. Global Metadata Chunk 894 This metadata applies to the whole container instead of a single 895 resource. 897 The contents of this chunk follows the format described in chapter 898 9.3. 900 There are no lowercase field types defined for footer metadata. 901 Uppercase field types can be used as custom user data. 903 9.4.9. Repeat Metadata Chunk 905 These chunks optionally repeat metadata that is interleaved between 906 data chunks. To use these chunks, it is necessary to also read 907 additional information, such as pointers to the original chunks, from 908 the central directory. 910 The contents of this chunk follows the format described in chapter 911 9.3. 913 This chunk has an extra header byte: 915 1 byte: chunk type of repeated chunk (metadata chunk 916 or footer metadata chunk) 918 This set of chunks must follow the following restrictions: 920 It is optional whether or not repeat metadata chunks are 921 present. 923 If they are present, then they must be present for all 924 metadata chunks and footer metadata chunks. 926 There may be only 1 repeat metadata chunk per repeated metadata 927 chunk. 929 They must appear in the same order as the chunks appear in the 930 container, which is also the same order as listed in the 931 central directory. 933 The fields contained in this metadata chunk must follow the following 934 restrictions: 936 If a field is present, it must 937 exactly match the corresponding field of the copied chunk. 939 It is allowed to leave out a field that is present 940 in the copied chunk. 942 If a field is present, then it must be present in *all* other 943 repeat metadata chunks when the copied chunk contains this 944 field. In other words, if you know you can get the name field 945 from a repeat chunk, you know that you will be able to get all 946 names of all resources from all repeat chunks. 948 9.4.10. Central Directory Chunk 950 The central directory chunk, along with the repeat metadata chunks, 951 allow to quickly find and list compressed resources in the container 952 file. 954 The contents of the central directory are as follows: 956 varint: pointer into the file where the repeat metadata chunks are 957 located, or 0 if they are not present 959 per chunk listed: 961 varint: pointer into the file where this chunk begins 963 varint: amount of header bytes N used below 965 N bytes: copy of all the header bytes of the pointed at chunk, 966 including total size, chunk type byte, X extra header 967 bytes, codec, dictionary references, uncompressed 968 size. The content is not repeated here. 970 The last listed chunk is reached when the end of the contents of the 971 central directory are reached. If the end does not match the last 972 byte of the central directory, the decoder must reject the data 973 stream as invalid. 975 If present, the central directory must list all chunks of type data, 976 first partial data, middle partial data, last partial data, metadata 977 and footer metadata. 979 9.4.11. Final Footer Chunk 981 Chunk that closes the file, only present if in the initial container 982 header flags bit 2 was set. 984 This chunk has the following content, always uncompressed: 986 reversed varint: size of this entire framing format file, 987 including these bytes themselves, or 0 if this 988 size is not given 990 reversed varint: pointer to the start of the central directory, 991 or 0 if there is none 993 A reversed varint has the same format as a varint, but has its bytes 994 in reversed order and is designed to be parsed from end of file 995 towards the beginning. 997 9.4.12. Chunk ordering 999 The chunk ordering must follow the rules described below, if the 1000 decoder sees otherwise, it must reject the data stream as invalid. 1002 Padding chunks may be inserted anywhere, even between chunks for 1003 which the rules below say no other chunk types may come in 1004 between. 1006 Metadata chunks must come immediately before the Data chunks of 1007 the resource they apply to. 1009 Footer metadata chunks must come immediately after the Data 1010 chunks of the resource they apply to. 1012 There may be only 0 or 1 metadata chunks per resource. 1014 There may be only 0 or 1 footer metadata chunks per resource. 1016 A resource must exist out of either 1 data chunk, or 1 first 1017 partial data chunk, 0 or more middle partial data 1018 chunks, and 1 last partial data chunk, in that order. 1020 Repeat metadata chunks must follow the rules of section 9.4.9. 1022 There may be only 0 or 1 central directory chunks. 1024 If bit 2 of the container flags is set, there may be only a 1025 single resource, no metadata chunks of any type, no central 1026 directory, and no final footer. 1028 If bit 2 of the container flags is not set, there must be exactly 1029 1 final footer chunk and it must be the last chunk in the file. 1031 10. Security Considerations 1033 The security considerations for brotli [RFC7932] apply to shared 1034 brotli as well. 1036 In addition, the same considerations apply to the decoding of new 1037 file format streams for shared brotli, including shared dictionaries, 1038 the framing format and the shared brotli format. 1040 The dictionary must be treated with the same security precautions as 1041 the content, because a change to the dictionary can result in a 1042 change to the decompressed content. 1044 The CRIME attack shows that it's a bad idea to compress data from 1045 mixed (e.g. public and private) sources -- the data sources include 1046 not only the compressed data but also the dictionaries. For example, 1047 if you compress secret cookies using a public-data-only dictionary, 1048 you still leak information about the cookies. 1050 Not only can the dictionary reveal information about the compressed 1051 data, but vice versa, data compressed with the dictionary can reveal 1052 the contents of the dictionary when an adversary can control parts of 1053 data to compress and see the compressed size. On the other hand, if 1054 the adversary can control the dictionary, the adversary can learn 1055 information about the compressed data. 1057 The most robust defense against CRIME is not to compress private data 1058 (e.g., sensitive headers like cookies or any content with PII). The 1059 challenge has been to identify secrets within a vast amount of to be 1060 compressed data. Cloudflare uses a regular expression [CLOUDFLARE]. 1061 Another idea is to extend existing web template systems (e.g., Soy 1062 [SOY]) to allow developers to mark secrets that must not be 1063 compressed. 1065 A less robust idea, but easier to implement, is to randomize the 1066 compression algorithm, i.e., adding randomly generated padding, 1067 varying the compression ratio, etc. The tricky part is to find the 1068 right balance between cost and security, i.e., on one hand we don't 1069 want to add too much padding because it adds a cost to data, on the 1070 other hand we don't want to add too little because the adversary can 1071 detect a small amount of padding with traffic analysis. 1073 Another defense in addition is to not use dictionaries for cross- 1074 domain requests, and only use shared brotli for the response when the 1075 origin is the same as where the content is hosted (using CORS). This 1076 prevents an adversary to use a private dictionary with user secrets 1077 to compress content hosted on the adversary's origin. It also helps 1078 prevent CRIME attacks that try to benefit from a public dictionary by 1079 preventing data compression with dictionaries for requests that do 1080 not originate from the host itself. 1082 The content of the dictionary itself should not be affected by 1083 external users, allowing adversaries to control the dictionary allows 1084 a form of chosen plaintext attack. Instead, only base the dictionary 1085 on content you control or generic large scale content such as a 1086 spoken language, and update the dictionary with large time intervals 1087 (days, not seconds) to prevent fast probing. 1089 11. IANA Considerations 1091 The "HTTP Content Coding Registry" has been updated with the 1092 registration below: 1094 +-------+-------------------------------------+------------+ 1095 | Name | Description | Reference | 1096 +-------+-------------------------------------+------------+ 1097 | sbr | Shared Brotli Compressed Data Format| RFCXXXX | 1098 +-------+-------------------------------------+------------+ 1100 12. Informative References 1102 [RFC7932] Alakuijala, J., Szabadka, Z., "Brotli Compressed Data 1103 Format", RFC 7932, Google, Inc., July 2016. 1104 http://www.ietf.org/rfc/rfc7932.txt 1106 [CLOUDFLARE] https://blog.cloudflare.com/a-solution-to-compression- 1107 oracles-on-the-web/ 1109 [SOY] https://developers.google.com/closure/templates/ 1111 Authors' Addresses 1113 Jyrki Alakuijala 1114 Google, Inc. 1116 Email: jyrki@google.com 1118 Thai Duong 1119 Google, Inc. 1121 Email: thaidn@google.com 1123 Evgenii Kliuchnikov 1124 Google, Inc. 1126 Email: eustas@google.com 1128 Robert Obryk 1129 Google, Inc. 1131 Email: robryk@google.com 1133 Zoltan Szabadka 1134 Google, Inc. 1136 Email: szabadka@google.com 1138 Lode Vandevenne (editor) 1139 Google, Inc. 1141 Email: lode@google.com