idnits 2.17.1 draft-kucherawy-dispatch-zstd-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 15, 2018) is 2112 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '36' on line 1089 -- Looks like a reference, but probably isn't: '53' on line 1099 -- Looks like a reference, but probably isn't: '29' on line 1116 -- Looks like a reference, but probably isn't: '5' on line 1481 -- Looks like a reference, but probably isn't: '0' on line 1499 -- Looks like a reference, but probably isn't: '1' on line 1499 Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Collet 3 Internet-Draft M. Kucherawy, Ed. 4 Intended status: Informational Facebook 5 Expires: January 16, 2019 July 15, 2018 7 Zstandard Compression and The application/zstd Media Type 8 draft-kucherawy-dispatch-zstd-03 10 Abstract 12 Zstandard, or "zstd" (pronounced "zee standard"), is a data 13 compression mechanism. This document describes the mechanism, and 14 registers a media type and content encoding to be used when 15 transporting zstd-compressed content via Multipurpose Internet Mail 16 Extensions (MIME). 18 Despite use of the word "standard" as part of its name, readers are 19 advised that this document is not an Internet Standards Track 20 specification, and is being published for informational purposes 21 only. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on January 16, 2019. 40 Copyright Notice 42 Copyright (c) 2018 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 4 60 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 5 61 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . . 5 62 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . . 6 63 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . . 11 64 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 12 65 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . . 26 66 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . . 27 67 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . . 27 68 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . . 28 69 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 70 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 29 71 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . . 32 72 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . . 32 73 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 34 74 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . . 35 75 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 35 76 4.2.2. Huffman-coded Streams . . . . . . . . . . . . . . . . 36 77 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 37 78 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 79 6.1. The 'application/zstd' Media Type . . . . . . . . . . . . 39 80 6.2. Content Encoding . . . . . . . . . . . . . . . . . . . . . 40 81 6.3. Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 40 82 7. Security Considerations . . . . . . . . . . . . . . . . . . . 40 83 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 41 84 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 42 85 9.1. Normative References . . . . . . . . . . . . . . . . . . . 42 86 9.2. Informative References . . . . . . . . . . . . . . . . . . 42 87 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 43 88 Appendix B. Decoding Tables for Predefined Codes . . . . . . . . 43 89 B.1. Literal Length Code Table . . . . . . . . . . . . . . . . 44 90 B.2. Match Length Code Table . . . . . . . . . . . . . . . . . 46 91 B.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 49 93 1. Introduction 95 Zstandard, or "zstd" (pronounced "zee standard") is a data 96 compression mechanism, akin to gzip [RFC1952]. 98 Despite use of the word "standard" as part of its name, readers are 99 advised that this document is not an Internet Standards Track 100 specification, and is being published for informational purposes 101 only. 103 This document describes the Zstandard format. Also, to enable the 104 transport of a data object compressed with Zstandard, this document 105 registers a media type that can be used to identify such content when 106 it is used in a payload encoded using Multipurpose Internet Mail 107 Extensions (MIME). 109 2. Definitions 111 Some terms used elsewhere in this document are defined here for 112 clarity. 114 uncompressed: Describes an arbitrary set of bytes in their original 115 form, prior to being subjected to compression. 117 compress, compression: The act of processing a set of bytes via the 118 compression mechanism described here. 120 compressed: Describes the result of passing a set of bytes through 121 this mechanism. The original input has thus been compressed. 123 decompress, decompression: The act of processing a set of bytes 124 through the inverse of the compression mechanism described here, 125 in an attempt to recover the original set of bytes prior to 126 compression. 128 decompressed: Describes the result of passing a set of bytes through 129 the reverse of this mechanism. When this is successful, the 130 decompressed payload and the uncompressed payload are 131 indistinguishable. 133 encode: The process of translating data from one form to another; 134 this may include compression or it may refer to other translations 135 done as part of this specification. 137 decode: The reverse of "encode"; describes a process of reversing a 138 prior encoding to recover the original content. 140 frame: Content compressed by Zstandard is transformed into a 141 Zstandard frame. Multiple frames can be appended into a single 142 file or stream. A frame is completely independent, has a defined 143 beginning and end, and a set of parameters which tells the decoder 144 how to decompress it. 146 block: A frame encapsulates one or multiple blocks. Each block can 147 be compressed or not, and has a guaranteed maximum content size, 148 which depends on frame parameters. Unlike frames, each block 149 depends on previous blocks for proper decoding. However, each 150 block can be decompressed without waiting for its successor, 151 allowing streaming operations. 153 natural order: A sequence or ordering of objects or values that is 154 typical of that type of object or value. A set of unique 155 integers, for example, is in "natural order" if when progressing 156 from one element in the set or sequence to the next, there is 157 never a decrease in value. 159 The naming convention for identifiers within the specification is 160 Mixed_Case_With_Underscores. Identifiers inside square brackets 161 indicate the identifier is optional in the presented context. 163 3. Compression Algorithm 165 This section describes the Zstandard algorithm. 167 The purpose of this document is to define a lossless compressed data 168 format, that is independent of CPU type, operating system, file 169 system and character set, and is suitable for file compression, pipe 170 and streaming compression, using the Zstandard algorithm. The text 171 of the specification assumes a basic background in programming at the 172 level of bits and other primitive data representations. 174 The data can be produced or consumed, even for an arbitrarily long 175 sequentially presented input data stream, using only an a priori 176 bounded amount of intermediate storage, and hence can be used in data 177 communications. The format uses the Zstandard compression method, 178 and optional xxHash-64 checksum method [XXHASH], for detection of 179 data corruption. 181 The data format defined by this specification does not attempt to 182 allow random access to compressed data. 184 Unless otherwise indicated below, a compliant compressor must produce 185 data sets that conform to the specifications presented here. 186 However, it does not need to support all options. 188 A compliant decompressor must be able to decompress at least one 189 working set of parameters that conforms to the specifications 190 presented here. It may also ignore informative fields, such as the 191 checksum. Whenever it does not support a parameter defined in the 192 compressed stream, it must produce a non-ambiguous error code and 193 associated error message explaining which parameter is unsupported. 195 This specification is intended for use by implementers of software to 196 compress data into Zstandard format and/or decompress data from 197 Zstandard format. The Zstandard format is supported by an open 198 source reference implementation, written in portable C, and available 199 at [ZSTD]. 201 3.1. Frames 203 Zstandard compressed data is made up of one or more frames. Each 204 frame is independent and can be decompressed independently of other 205 frames. The decompressed content of multiple concatenated frames is 206 the concatenation of each frame's decompressed content. 208 There are two frame formats defined for Zstandard: Zstandard frames 209 and Skippable frames. Zstandard frames contain compressed data, 210 while skippable frames contain custom user metadata. 212 3.1.1. Zstandard Frames 214 The structure of a single Zstandard frame is as follows: 216 +--------------------+------------+ 217 | Magic_Number | 4 bytes | 218 +--------------------+------------+ 219 | Frame_Header | 2-14 bytes | 220 +--------------------+------------+ 221 | Data_Block | n bytes | 222 +--------------------+------------+ 223 | [More Data_Blocks] | | 224 +--------------------+------------+ 225 | [Content_Checksum] | 0-4 bytes | 226 +--------------------+------------+ 228 Magic_Number: Four bytes, little-endian format. Value: 0xFD2FB528. 230 Frame_Header: Two to 14 bytes, detailed in Section 3.1.1.1. 232 Data_Block: Detailed in Section 3.1.1.2. This is where data 233 appears. 235 Content_Checksum: An optional 32-bit checksum, only present if 236 Content_Checksum_Flag is set. The content checksum is the result 237 of the XXH64() hash function [XXHASH] digesting the original 238 (decoded) data as input, and a seed of zero. The low four bytes 239 of the checksum are stored in little-endian format. 241 The magic number was selected to be less probable to find at the 242 beginning of an arbitrary file. It avoids trivial patterns (0x00, 243 0xFF, repeated bytes, increasing bytes, etc.), contains byte values 244 outside of ASCII range, and doesn't map into UTF-8 space, all of 245 which reduce the likelihood of its appearance at the top of a text 246 file. 248 3.1.1.1. Frame Header 250 The frame header has a variable size, with a minimum of two bytes and 251 up to 14 bytes depending on optional parameters. The structure of 252 Frame_Header is as follows: 254 +-------------------------+-----------+ 255 | Frame_Header_Descriptor | 1 byte | 256 +-------------------------+-----------+ 257 | [Window_Descriptor] | 0-1 byte | 258 +-------------------------+-----------+ 259 | [Dictionary_ID] | 0-4 bytes | 260 +-------------------------+-----------+ 261 | [Frame_Content_Size] | 0-8 bytes | 262 +-------------------------+-----------+ 264 3.1.1.1.1. Frame_Header_Descriptor 266 The first header's byte is called the Frame_Header_Descriptor. It 267 describes which other fields are present. Decoding this byte is 268 enough to tell the size of Frame_Header. 270 +------------+-------------------------+ 271 | Bit Number | Field Name | 272 +------------+-------------------------+ 273 | 7-6 | Frame_Content_Size_Flag | 274 +------------+-------------------------+ 275 | 5 | Single_Segment_Flag | 276 +------------+-------------------------+ 277 | 4 | (unused) | 278 +------------+-------------------------+ 279 | 3 | (reserved) | 280 +------------+-------------------------+ 281 | 2 | Content_Checksum_Flag | 282 +------------+-------------------------+ 283 | 1-0 | Dictionary_ID_Flag | 284 +------------+-------------------------+ 286 In this table, bit 7 is the highest bit, while bit 0 is the lowest 287 one. 289 3.1.1.1.1.1. Frame_Content_Size_Flag 291 This is a two-bit flag (equivalent to Frame_Header_Descriptor right- 292 shifted six bits) specifying whether Frame_Content_Size (the 293 decompressed data size) is provided within the header. Flag_Value 294 provides FCS_Field_Size, which is the number of bytes used by 295 Frame_Content_Size according to the following table: 297 +----------------+--------+---+---+---+ 298 | Flag_Value | 0 | 1 | 2 | 3 | 299 +----------------+--------+---+---+---+ 300 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | 301 +----------------+--------+---+---+---+ 303 When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag: 304 If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise, 305 FCS_Field_Size is 0; Frame_Content_Size is not provided. 307 3.1.1.1.1.2. Single_Segment_Flag 309 If this flag is set, data must be regenerated within a single 310 continuous memory segment. 312 In this case, Window_Descriptor byte is skipped, but 313 Frame_Content_Size is necessarily present. As a consequence, the 314 decoder must allocate a memory segment of size equal or larger than 315 Frame_Content_Size. 317 In order to protect the decoder from unreasonable memory 318 requirements, a decoder is allowed to reject a compressed frame that 319 requests a memory size beyond the decoder's authorized range. 321 For broader compatibility, decoders are recommended to support memory 322 sizes of at least 8 MB. This is only a recommendation; each decoder 323 is free to support higher or lower limits, depending on local 324 limitations. 326 3.1.1.1.1.3. Unused Bit 328 A decoder compliant with this specification version shall not 329 interpret this bit. It might be used in a future version, to signal 330 a property which is not mandatory to properly decode the frame. An 331 encoder compliant with this specification must set this bit to zero. 333 3.1.1.1.1.4. Reserved Bit 335 This bit is reserved for some future feature. Its value must be 336 zero. A decoder compliant with this specification version must 337 ensure it is not set. This bit may be used in a future revision, to 338 signal a feature that must be interpreted to decode the frame 339 correctly. 341 3.1.1.1.1.5. Content_Checksum_Flag 343 If this flag is set, a 32-bits Content_Checksum will be present at 344 the frame's end. See the description of Content_Checksum above. 346 3.1.1.1.1.6. Dictionary_ID_Flag 348 This is a two-bit flag (= Frame_Header_Descriptor & 0x3) indicating 349 whether a dictionary ID is provided within the header. It also 350 specifies the size of this field as DID_Field_Size: 352 +----------------+---+---+---+---+ 353 | Flag_Value | 0 | 1 | 2 | 3 | 354 +----------------+---+---+---+---+ 355 | DID_Field_Size | 0 | 1 | 2 | 4 | 356 +----------------+---+---+---+---+ 358 3.1.1.1.2. Window Descriptor 360 Provides guarantees on minimum memory buffer required to decompress a 361 frame. This information is important for decoders to allocate enough 362 memory. 364 The Window_Descriptor byte is optional. When Single_Segment_Flag is 365 set, Window_Descriptor is not present. In this case, Window_Size is 366 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 367 ExaBytes). 369 +-------------+----------+----------+ 370 | Bit numbers | 7-3 | 2-0 | 371 +-------------+----------+----------+ 372 | Field name | Exponent | Mantissa | 373 +-------------+----------+----------+ 375 The minimum memory buffer size is called Window_Size. It is 376 described by the following formulae: 378 windowLog = 10 + Exponent; 379 windowBase = 1 << windowLog; 380 windowAdd = (windowBase / 8) * Mantissa; 381 Window_Size = windowBase + windowAdd; 383 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) 384 + 7*(1<<38) bytes, which is 3.75 TB. 386 In general, larger Window_Size values tend to improve compression 387 ratio, but at the cost of increased memory usage. 389 To properly decode compressed data, a decoder will need to allocate a 390 buffer of at least Window_Size bytes. 392 In order to protect decoders from unreasonable memory requirements, a 393 decoder is allowed to reject a compressed frame which requests a 394 memory size beyond decoder's authorized range. 396 For improved interoperability, it's recommended for decoders to 397 support values of Window_Size up to 8 MB, and for encoders not to 398 generate frames requiring a Window_Size larger than 8 MB. It's 399 merely a recommendation though, and decoders are free to support 400 larger or lower limits, depending on local limitations. 402 3.1.1.1.3. Dictionary_ID 404 This is a variable size field, which contains the ID of the 405 dictionary required to properly decode the frame. This field is 406 optional. When it's not present, it's up to the decoder to know 407 which dictionary to use. 409 Dictionary_ID field size is provided by DID_Field_Size. 410 DID_Field_Size is directly derived from the value of 411 Dictionary_ID_Flag. One byte can represent an ID 0-255; two bytes 412 can represent an ID 0-65535; four bytes can represent an ID 413 0-4294967295. Format is little-endian. 415 It is permitted to represent a small ID (for example 13) with a large 416 four-byte dictionary ID, even if it is less efficient. 418 Within private environments, any dictionary ID can be used. However, 419 for frames and dictionaries distributed in public space, 420 Dictionary_ID must be attributed carefully. The following ranges are 421 reserved for use only with dictionaries that have been registered 422 with IANA (see Section 6.3): 424 low range: <= 32767 426 high range: >= (1 << 31) 428 Any other value for Dictionary_ID can be used by private arrangement 429 between participants. 431 Any payload presented for decompression that references an 432 unregistered reserved dictionary ID results in an error. 434 3.1.1.1.4. Frame Content Size 436 This is the original (uncompressed) size. This information is 437 optional. Frame_Content_Size uses a variable number of bytes, 438 provided by FCS_Field_Size. FCS_Field_Size is provided by the value 439 of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not 440 present), 1, 2, 4 or 8 bytes. 442 +----------------+--------------+ 443 | FCS Field Size | Range | 444 +----------------+--------------+ 445 | 0 | unknown | 446 +----------------+--------------+ 447 | 1 | 0 - 255 | 448 +----------------+--------------+ 449 | 2 | 256 - 65791 | 450 +----------------+--------------+ 451 | 4 | 0 - 2^32 - 1 | 452 +----------------+--------------+ 453 | 8 | 0 - 2^64 - 1 | 454 +----------------+--------------+ 456 Frame_Content_Size format is little-endian. When FCS_Field_Size is 457 1, 4 or 8 bytes, the value is read directly. When FCS_Field_Size is 458 2, the offset of 256 is added. It's allowed to represent a small 459 size (for example 18) using any compatible variant. 461 3.1.1.2. Blocks 463 After Magic_Number and Frame_Header, there are some number of blocks. 464 Each frame must have at least one block, but there is no upper limit 465 on the number of blocks per frame. 467 The structure of a block is as follows: 469 +--------------+---------------+ 470 | Block_Header | Block_Content | 471 +--------------+---------------+ 472 | 3 bytes | n bytes | 473 +--------------+---------------+ 475 Block_Header uses three bytes, written using little-endian 476 convention. It contains three fields: 478 +------------+------------+------------+ 479 | Last_Block | Block_Type | Block_Size | 480 +------------+------------+------------+ 481 | bit 0 | bits 1-2 | bits 3-23 | 482 +------------+------------+------------+ 484 3.1.1.2.1. Last_Block 486 The lowest bit (Last_Block) signals whether this block is the last 487 one. The frame will end after this last block. It may be followed 488 by an optional Content_Checksum (see Section 3.1.1). 490 3.1.1.2.2. Block_Type 492 The next two bits represent the Block_Type. There are four block 493 types: 495 +-----------+------------------+ 496 | Value | Block_Type | 497 +-----------+------------------+ 498 | 0 | Raw_Block | 499 +-----------+------------------+ 500 | 1 | RLE_Block | 501 +-----------+------------------+ 502 | 2 | Compressed_Block | 503 +-----------+------------------+ 504 | 3 | Reserved | 505 +-----------+------------------+ 507 Raw_Block: This is an uncompressed block. Block_Content contains 508 Block_Size bytes. 510 RLE_Block: This is a single byte, repeated Block_Size times. 511 Block_Content consists of a single byte. On the decompression 512 side, this byte must be repeated Block_Size times. 514 Compressed_Block: This is a compressed block as described in 515 Section 3.1.1.3. Block_Size is the length of Block_Content, 516 namely the compressed data. The decompressed size is not known, 517 but its maximum possible value is guaranteed (see below). 519 Reserved: This is not a block. This value cannot be used with the 520 current specification. If such a value is present, it is 521 considered to be corrupt data. 523 3.1.1.2.3. Block_Size 525 The upper 21 bits of Block_Header represent the Block_Size. 526 Block_Size is the size of the block excluding the header. A block 527 can contain any number of bytes (even zero), up to 528 Block_Maximum_Decompressed_Size, which is the smallest of: 530 o Window_Size 532 o 128 KB 534 A Compressed_Block has the extra restriction that Block_Size is 535 always strictly less than the decompressed size. If this condition 536 cannot be respected, the block must be sent uncompressed instead 537 (i.e., treated as a Raw_Block). 539 3.1.1.3. Compressed Blocks 541 To decompress a compressed block, the compressed size must be 542 provided from Block_Size field within Block_Header. 544 A compressed block consists of two sections: a Literals Section 545 (Section 3.1.1.3.1) and a Sequences_Section (Section 3.1.1.3.2). The 546 results of the two sections are then combined to produce the 547 decompressed data in Sequence Execution (Section 3.1.1.4). 549 To decode a compressed block, the following elements are necessary: 551 o Previous decoded data, up to a distance of Window_Size, or the 552 beginning of the Frame, whichever is smaller. Single_Segment_Flag 553 will be set in the latter case. 555 o List of "recent offsets" from the previous Compressed_Block. 557 o The previous Huffman tree, required by Treeless_Literals_Block 558 type. 560 o Previous FSE decoding tables, required by Repeat_Mode, for each 561 symbol type (literals lengths, match lengths, offsets). 563 Note that decoding tables are not always from the previous 564 Compressed_Block: 566 o Every decoding table can come from a dictionary. 568 o The Huffman tree comes from the previous 569 Compressed_Literals_Block. 571 3.1.1.3.1. Literals_Section_Header 573 All literals are regrouped in the first part of the block. They can 574 be decoded first, and then copied during Sequence Execution (see 575 Section 3.1.1.4), or they can be decoded on the flow during Sequence 576 Execution. 578 Literals can be stored uncompressed or compressed using Huffman 579 prefix codes. When compressed, an optional tree description can be 580 present, followed by one or four streams. 582 +----------------------------+ 583 | Literals_Section_Header | 584 +----------------------------+ 585 | [Huffman_Tree_Description] | 586 +----------------------------+ 587 | [Jump_Table] | 588 +----------------------------+ 589 | Stream_1 | 590 +----------------------------+ 591 | [Stream_2] | 592 +----------------------------+ 593 | [Stream_3] | 594 +----------------------------+ 595 | [Stream_4] | 596 +----------------------------+ 598 3.1.1.3.1.1. Literals_Section_Header 600 This field describes how literals are packed. It's a byte-aligned 601 variable-size bitfield, ranging from one to five bytes, using little- 602 endian convention. 604 +---------------------+-----------+ 605 | Literals_Block_Type | 2 bits | 606 +---------------------+-----------+ 607 | Size_Format | 1-2 bits | 608 +---------------------+-----------+ 609 | Regenerated_Size | 5-20 bits | 610 +---------------------+-----------+ 611 | [Compressed_Size] | 0-18 bits | 612 +---------------------+-----------+ 614 In this representation, bits at the top are the lowest bits. 616 The Literals_Block_Type field uses the two lowest bits of the first 617 byte, describing four different block types: 619 +---------------------------+-------+ 620 | Literals_Block_Type | Value | 621 +---------------------------+-------+ 622 | Raw_Literals_Block | 0 | 623 +---------------------------+-------+ 624 | RLE_Literals_Block | 1 | 625 +---------------------------+-------+ 626 | Compressed_Literals_Block | 2 | 627 +---------------------------+-------+ 628 | Treeless_Literals_Block | 3 | 629 +---------------------------+-------+ 631 Raw_Literals_Block: Literals are stored uncompressed. 632 Literals_Section_Content is Regenerated_Size. 634 RLE_Literals_Block: Literals consist of a single byte value repeated 635 Regenerated_Size times. Literals_Section_Content is one. 637 Compressed_Literals_Block: This is a standard Huffman-compressed 638 block, starting with a Huffman tree description. See details 639 below. Literals_Section_Content is Compressed_Size. 641 Treeless_Literals_Block: This is a Huffman-compressed block, using 642 the Huffman tree from the previous Compressed_Literals_Block, or a 643 dictionary if there is no previous Huffman-compressed literals 644 block. Huffman_Tree_Description will be skipped. Note that if 645 this mode is triggered without any previous Huffman-table in the 646 frame (or dictionary, per Section 5), this should be treated as 647 data corruption. Literals_Section_Content is Compressed_Size. 649 The Size_Format is divided into two families: 651 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary 652 to decode Regenerated_Size. There is no Compressed_Size field. 654 o For Compressed_Block and Treeless_Literals_Block, it's required to 655 decode both Compressed_Size and Regenerated_Size (the decompressed 656 size). It's also necessary to decode the number of streams (1 or 657 4). 659 For values spanning several bytes, the convention is little-endian. 661 Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2 662 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3. 664 Size_Format == 00 or 10: Size_Format uses one bit. Regenerated_Size 665 uses five bits (value 0-31). Literals_Section_Header uses one 666 byte. Regenerated_Size = Literal_Section_Header[0]>>3. 668 Size_Format == 01: Size_Format uses two bits. Regenerated_Size uses 669 12 bits (values 0-4095). Literals_Section_Header uses two bytes. 670 Regenerated_Size = (Literals_Section_Header[0]>>4) + 671 (Literals_Section_Header[1]<<4). 673 Size_Format == 11: Size_Format uses two bits. Regenerated_Size uses 674 20 bits (values 0-1048575). Literals_Section_Header uses three 675 bytes. Regenerated_Size = (Literals_Section_Header[0]>>4) + 676 (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12) 678 Only Stream_1 is present for these cases. Note that it is permitted 679 to represent a short value (for example 13) using a long format, even 680 if it's less efficient. 682 Size_Format for Compressed_Literals_Block and Treeless_Literals_Block 683 always uses two bits. 685 Size_Format == 00: A single stream. Both Regenerated_Size and 686 Compressed_Size use ten bits (values 0-1023). 687 Literals_Section_Header uses three bytes. 689 Size_Format == 01: Four streams. Both Regenerated_Size and 690 Compressed_Size use ten bits (values 0-1023). 691 Literals_Section_Header uses three bytes. 693 Size_Format == 10: Four streams. Both Regenerated_Size and 694 Compressed_Size use 14 bits (values 0-16383). 695 Literals_Section_Header uses four bytes. 697 Size_Format == 11: Four streams. Both Regenerated_Size and 698 Compressed_Size use 18 bits (values 0-262143). 699 Literals_Section_Header uses five bytes. 701 Both the Compressed_Size and Regenerated_Size fields follow little- 702 endian convention. Note that Compressed_Size includes the size of 703 the Huffman_Tree_Description when it is present. 705 3.1.1.3.1.2. Raw_Literals_Block 707 The data in Stream_1 is Regenerated_Size bytes long. It contains the 708 raw literals data to be used during Sequence Execution 709 (Section 3.1.1.3.2). 711 3.1.1.3.1.3. RLE_Literals_Block 713 Stream_1 consists of a single byte which should be repeated 714 Regenerated_Size times to generate the decoded literals. 716 3.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block 718 Both of these modes contain Huffman encoded data. For 719 Treeless_Literals_Block the Huffman table comes the previously 720 compressed literals block, or from a dictionary. (see Section 5). 722 3.1.1.3.1.5. Huffman_Tree_Description 724 This section is only present when the Literals_Block_Type type is 725 Compressed_Literals_Block (2). The format of 726 Huffman_Tree_Description can be found in Section 4.2.1. The size of 727 Huffman_Tree_Description is determined during the decoding process. 728 It must be used to determine where streams begin. 730 Total_Streams_Size = Compressed_Size 731 - Huffman_Tree_Description_Size 733 3.1.1.3.1.6. Jump_Table 735 The Jump_Table is only present when there are four Huffman-coded 736 streams. 738 (Reminder: Huffman compressed data consists of either one or four 739 Huffman-coded streams.) 741 If only one stream is present, it is a single bitstream occupying the 742 entire remaining portion of the literals block, encoded as described 743 within Section 4.2.2. 745 If there are four streams, Literals_Section_Header only provides 746 enough information to know the decompressed and compressed sizes of 747 all four streams combined. The decompressed size of each stream is 748 equal to (Regenerated_Size+3)/4, except for the last stream which may 749 be up to three bytes smaller, to reach a total decompressed size as 750 specified in Regenerated_Size. 752 The compressed size of each stream is provided explicitly in the 753 Jump_Table. The Jump_Table is six bytes long and consists of three 754 two-byte little-endian fields, describing the compressed sizes of the 755 first three streams. Stream4_Size is computed from 756 Total_Streams_Size minus sizes of other streams. 758 Stream4_Size = Total_Streams_Size - 6 759 - Stream1_Size - Stream2_Size 760 - Stream3_Size 762 Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds 763 Total_Streams_Size, the data are considered corrupted. 765 Each of these four bitstreams is then decoded independently as a 766 Huffman-Coded stream, as described in Section 4.2.2. 768 3.1.1.3.2. Sequences_Section 770 A compressed block is a succession of sequences. A sequence is a 771 literal copy command, followed by a match copy command. A literal 772 copy command specifies a length. It is the number of bytes to be 773 copied (or extracted) from the Literals Section. A match copy 774 command specifies an offset and a length. 776 When all sequences are decoded, if there are literals left in the 777 literal section, these bytes are added at the end of the block. 779 This is described in more detail in Section 3.1.1.4. 781 The Sequences_Section regroups all symbols required to decode 782 commands. There are three symbol types: literals lengths, offsets, 783 and match lengths. They are encoded together, interleaved, in a 784 single "bitstream". 786 The Sequences_Section starts by a header, followed by optional 787 probability tables for each symbol type, followed by the bitstream. 789 Sequences_Section_Header 790 [Literals_Length_Table] 791 [Offset_Table] 792 [Match_Length_Table] 793 bitStream 795 To decode the Sequences_Section, it's necessary to know its size. 796 This size is deduced from the literals section size: 797 Sequences_Section_Size = Block_Size - Literals_Section_Header - 798 Literals_Section_Content 800 3.1.1.3.2.1. Sequences_Section_Header 802 This header consists of two items: 804 o Number_of_Sequences 806 o Symbol_Compression_Modes 808 Number_of_Sequences is a variable size field using between one and 809 three bytes. If the first byte is "byte0": 811 o if (byte0 == 0): there are no sequences. The sequence section 812 stops here. Decompressed content is defined entirely as Literals 813 Section content. The FSE tables used in Repeat_Mode are not 814 updated. 816 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte. 818 o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) + 819 byte1. Uses 2 bytes. 821 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) + 822 0x7F00. Uses 3 bytes. 824 Symbol_Compression_Modes is a single byte, defining the compression 825 mode of each symbol type. 827 +------------+----------------------+ 828 | Bit Number | Field Name | 829 +------------+----------------------+ 830 | 7-6 | Literal_Lengths_Mode | 831 +------------+----------------------+ 832 | 5-4 | Offsets_Mode | 833 +------------+----------------------+ 834 | 3-2 | Match_Lengths_Mode | 835 +------------+----------------------+ 836 | 1-0 | Reserved | 837 +------------+----------------------+ 839 The last field, Reserved, must be all zeroes. 841 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define 842 the Compression_Mode of literals lengths, offsets, and match lengths 843 symbols respectively. They follow the same enumeration: 845 +-------+---------------------+ 846 | Value | Compression_Mode | 847 +-------+---------------------+ 848 | 0 | Predefined_Mode | 849 +-------+---------------------+ 850 | 1 | RLE_Mode | 851 +-------+---------------------+ 852 | 2 | FSE_Compressed_Mode | 853 +-------+---------------------+ 854 | 3 | Repeat_Mode | 855 +-------+---------------------+ 857 Predefined_Mode: A predefined FSE (see Section 4.1) distribution 858 table is used, defined in Section 3.1.1.3.2.2. No distribution 859 table will be present. 861 RLE_Mode: The table description consists of a single byte, which 862 contains the symbol's value. THis symbol will be used for all 863 sequences. 865 FSE_Compressed_Mode: Standard FSE compression. A distribution table 866 will be present. The format of this distribution table is 867 described in Section 4.1.1. Note that the maximum allowed 868 accuracy log for literals length and match length tables is 9, and 869 the maximum accuracy log for the offsets table is 8. This mode 870 must not be used when only one symbol is present; RLE_Mode should 871 be used instead (although any other mode will work). 873 Repeat_Mode: The table used in the previous Compressed_Block with 874 Number_Of_Sequences > 0 will be used again, or if this is the 875 first block, the table in the dictionary will be used. Note that 876 this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the 877 same symbol will be repeated. It also includes Predefined_Mode, 878 in which case Repeat_Mode will have the same outcome as 879 Predefined_Mode. No distribution table will be present. If this 880 mode is used without any previous sequence table in the frame (or 881 dictionary; see Section 5) to repeat, this should be treated as 882 corruption. 884 3.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets 886 Each symbol is a code in its own context, which specifies Baseline 887 and Number_of_Bits to add. Codes are FSE compressed, and interleaved 888 with raw additional bits in the same bitstream. 890 Literals length codes are values ranging from 0 to 35 inclusive. 891 They define lengths from 0 to 131071 bytes. The literals length is 892 equal to the decoded Baseline plus the result of reading 893 Number_of_Bits bits from the bitstream, as a little-endian value. 895 +----------------------+----------+----------------+ 896 | Literals_Length_Code | Baseline | Number_of_Bits | 897 +----------------------+----------+----------------+ 898 | 0-15 | length | 0 | 899 +----------------------+----------+----------------+ 900 | 16 | 16 | 1 | 901 +----------------------+----------+----------------+ 902 | 17 | 18 | 1 | 903 +----------------------+----------+----------------+ 904 | 18 | 20 | 1 | 905 +----------------------+----------+----------------+ 906 | 19 | 22 | 1 | 907 +----------------------+----------+----------------+ 908 | 20 | 24 | 2 | 909 +----------------------+----------+----------------+ 910 | 21 | 28 | 2 | 911 +----------------------+----------+----------------+ 912 | 22 | 32 | 3 | 913 +----------------------+----------+----------------+ 914 | 23 | 40 | 3 | 915 +----------------------+----------+----------------+ 916 | 24 | 48 | 4 | 917 +----------------------+----------+----------------+ 918 | 25 | 64 | 6 | 919 +----------------------+----------+----------------+ 920 | 26 | 128 | 7 | 921 +----------------------+----------+----------------+ 922 | 27 | 256 | 8 | 923 +----------------------+----------+----------------+ 924 | 28 | 512 | 9 | 925 +----------------------+----------+----------------+ 926 | 29 | 1024 | 10 | 927 +----------------------+----------+----------------+ 928 | 30 | 2048 | 11 | 929 +----------------------+----------+----------------+ 930 | 31 | 4096 | 12 | 931 +----------------------+----------+----------------+ 932 | 32 | 8192 | 13 | 933 +----------------------+----------+----------------+ 934 | 33 | 16384 | 14 | 935 +----------------------+----------+----------------+ 936 | 34 | 32768 | 15 | 937 +----------------------+----------+----------------+ 938 | 35 | 65536 | 16 | 939 +----------------------+----------+----------------+ 941 Match length codes are values ranging from 0 to 52 included. They 942 define lengths from 3 to 131074 bytes. The match length is equal to 943 the decoded Baseline plus the result of reading Number_of_Bits bits 944 from the bitstream, as a little-endian value. 946 +-------------------+-----------------------+----------------+ 947 | Match_Length_Code | Baseline | Number_of_Bits | 948 +-------------------+-----------------------+----------------+ 949 | 0-31 | Match_Length_Code + 3 | 0 | 950 +-------------------+-----------------------+----------------+ 951 | 32 | 35 | 1 | 952 +-------------------+-----------------------+----------------+ 953 | 33 | 37 | 1 | 954 +-------------------+-----------------------+----------------+ 955 | 34 | 39 | 1 | 956 +-------------------+-----------------------+----------------+ 957 | 35 | 41 | 1 | 958 +-------------------+-----------------------+----------------+ 959 | 36 | 43 | 2 | 960 +-------------------+-----------------------+----------------+ 961 | 37 | 47 | 2 | 962 +-------------------+-----------------------+----------------+ 963 | 38 | 51 | 3 | 964 +-------------------+-----------------------+----------------+ 965 | 39 | 59 | 3 | 966 +-------------------+-----------------------+----------------+ 967 | 40 | 67 | 4 | 968 +-------------------+-----------------------+----------------+ 969 | 41 | 83 | 4 | 970 +-------------------+-----------------------+----------------+ 971 | 42 | 99 | 5 | 972 +-------------------+-----------------------+----------------+ 973 | 43 | 131 | 7 | 974 +-------------------+-----------------------+----------------+ 975 | 44 | 259 | 8 | 976 +-------------------+-----------------------+----------------+ 977 | 45 | 515 | 9 | 978 +-------------------+-----------------------+----------------+ 979 | 46 | 1027 | 10 | 980 +-------------------+-----------------------+----------------+ 981 | 47 | 2051 | 11 | 982 +-------------------+-----------------------+----------------+ 983 | 48 | 4099 | 12 | 984 +-------------------+-----------------------+----------------+ 985 | 49 | 8195 | 13 | 986 +-------------------+-----------------------+----------------+ 987 | 50 | 16387 | 14 | 988 +-------------------+-----------------------+----------------+ 989 | 51 | 32771 | 15 | 990 +-------------------+-----------------------+----------------+ 991 | 52 | 65539 | 16 | 992 +-------------------+-----------------------+----------------+ 994 Offset codes are values ranging from 0 to N. 996 A decoder is free to limit its maximum supported value for N. Support 997 for values of at least 22 is recommended. At the time of this 998 writing, the reference decoder supports a maximum N value of 31. 1000 An offset code is also the number of additional bits to read in 1001 little-endian fashion, and can be translated into an Offset_Value 1002 using the following formulas: 1004 Offset_Value = (1 << offsetCode) + readNBits(offsetCode); 1005 if (Offset_Value > 3) Offset = Offset_Value - 3; 1007 This means that maximum Offset_Value is (2^(N+1))-1, supporting back- 1008 reference distance up to (2^(N+1))-4, but is limited by the maximum 1009 back-reference distance (see Section 3.1.1.1.2). 1011 Offset_Value from 1 to 3 are special: they define "repeat codes". 1012 This is described in more detail in Section 3.1.1.5. 1014 3.1.1.3.2.1.2. Decoding Sequences 1016 FSE bitstreams are read in reverse direction than written. In zstd, 1017 the compressor writes bits forward into a block and the decompressor 1018 must read the bitstream backwards. 1020 To find the start of the bitstream it is therefore necessary to know 1021 the offset of the last byte of the block which can be found by 1022 counting Block_Size bytes after the block header. 1024 After writing the last bit containing information, the compressor 1025 writes a single 1-bit and then fills the byte with 0-7 zero bits of 1026 padding. The last byte of the compressed bitstream cannot be zero 1027 for that reason. 1029 When decompressing, the last byte containing the padding is the first 1030 byte to read. The decompressor needs to skip 0-7 initial zero bits 1031 until the first one bit occurs. Afterwards, the useful part of the 1032 bitstream begins. 1034 FSE decoding requires a 'state' to be carried from symbol to symbol. 1035 For more explanation on FSE decoding, see Section 4.1. 1037 For sequence decoding, a separate state keeps track of each literal 1038 lengths, offsets, and match lengths symbols. Some FSE primitives are 1039 also used. For more details on the operation of these primitives, 1040 see Section 4.1. 1042 The bitstream starts with initial FSE state values, each using the 1043 required number of bits in their respective accuracy, decoded 1044 previously from their normalized distribution. It starts with 1045 Literals_Length_State, followed by Offset_State, and finally 1046 Match_Length_State. 1048 Note that all values are read backward, so the 'start' of the 1049 bitstream is at the highest position in memory, immediately before 1050 the last one bit for padding. 1052 After decoding the starting states, a single sequence is decoded 1053 Number_Of_Sequences times. These sequences are decoded in order from 1054 first to last. Since the compressor writes the bitstream in the 1055 forward direction, this means the compressor must encode the 1056 sequences starting with the last one and ending with the first. 1058 For each of the symbol types, the FSE state can be used to determine 1059 the appropriate code. The code then defines the baseline and number 1060 of bits to read for each type. The description of the codes for how 1061 to determine these values can be found in Section 3.1.1.3.2.1. 1063 Decoding starts by reading the Number_of_Bits required to decode 1064 Offset. It then does the same for Match_Length, and then for 1065 Literals_Length. This sequence is then used for sequence execution 1066 (see Section 3.1.1.4). 1068 If it is not the last sequence in the block, the next operation is to 1069 update states. Using the rules pre-calculated in the decoding 1070 tables, Literals_Length_State is updated, followed by 1071 Match_Length_State, and then Offset_State. See Section 4.1 for 1072 details on how to update states from the bitstream. 1074 This operation will be repeated Number_of_Sequences times. At the 1075 end, the bitstream shall be entirely consumed, otherwise the 1076 bitstream is considered corrupted. 1078 3.1.1.3.2.2. Default Distributions 1080 If Predefined_Mode is selected for a symbol type, its FSE decoding 1081 table is generated from a predefined distribution table defined here. 1082 For details on how to convert this distribution into a decoding 1083 table, see Section 4.1. 1085 3.1.1.3.2.2.1. Literals Length 1087 The decoding table uses an accuracy log of 6 bits (64 states). 1089 short literalsLength_defaultDistribution[36] = 1090 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1091 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1092 -1,-1,-1,-1 1093 }; 1095 3.1.1.3.2.2.2. Match Length 1097 The decoding table uses an accuracy log of 6 bits (64 states). 1099 short matchLengths_defaultDistribution[53] = 1100 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1101 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1102 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1103 -1,-1,-1,-1,-1 1104 }; 1106 3.1.1.3.2.2.3. Offset Codes 1108 The decoding table uses an accuracy log of 5 bits (32 states), and 1109 supports a maximum N value of 28, allowing offset values up to 1110 536,870,908. 1112 If any sequence in the compressed block requires a larger offset than 1113 this, it's not possible to use the default distribution to represent 1114 it. 1116 short offsetCodes_defaultDistribution[29] = 1117 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1118 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 1119 }; 1121 3.1.1.4. Sequence Execution 1123 Once literals and sequences have been decoded, they are combined to 1124 produce the decoded content of a block. 1126 Each sequence consists of a tuple of (literals_length, offset_value, 1127 match_length), decoded as described in the Sequences_Section 1128 (Section 3.1.1.3.2). To execute a sequence, first copy 1129 literals_length bytes from the literals section to the output. 1131 Then match_length bytes are copied from previous decoded data. The 1132 offset to copy from is determined by offset_value: 1134 o if Offset_Value > 3, then the offset is Offset_Value - 3; 1135 o if Offset_Value is from 1-3, the offset is a special repeat offset 1136 value. See Section 3.1.1.5 for how the offset is determined in 1137 this case. 1139 The offset is defined as from the current position (after copying the 1140 literals), so an offset of 6 and a match length of 3 means that 3 1141 bytes should be copied from 6 bytes back. Note that all offsets 1142 leading to previously decoded data must be smaller than Window_Size 1143 defined in Frame_Header_Descriptor (Section 3.1.1.1.1). 1145 3.1.1.5. Repeat Offsets 1147 As seen above, the first three values define a repeated offset and we 1148 will call them Repeated_Offset1, Repeated_Offset2, and 1149 Repeated_Offset3. They are sorted in recency order, with 1150 Repeated_Offset1 meaning "most recent one". 1152 If offset_value is 1, then the offset used is Repeated_Offset1, etc. 1154 There is one exception: When the current sequence's literals_length 1155 is 0, repeated offsets are shifted by one, so an offset_value of 1 1156 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, 1157 and an offset_value of 3 means Repeated_Offset1 - 1_byte. 1159 For the first block, the starting offset history is populated with 1160 the following values: Repeated_Offset1 (1), Repeatead_Offset2 (4), 1161 and Repeatead_Offset3 (8), unless a dictionary is used, in which case 1162 they come from the dictionary. 1164 Then each block gets its starting offset history from the ending 1165 values of the most recent Compressed_Block. Note that blocks that 1166 are not Compressed_Block are skipped; they do not contribute to 1167 offset history. 1169 The newest offset takes the lead in offset history, shifting others 1170 back (up to its previous place if it was already present). This 1171 means that when Repeated_Offset1 (most recent) is used, history is 1172 unmodified. When Repeated_Offset2 is used, it is swapped with 1173 Repeated_Offset1. If any other offset is used, it becomes 1174 Repeated_Offset1 and the rest are shifted back by one. 1176 3.1.2. Skippable Frames 1178 +--------------+------------+-----------+ 1179 | Magic_Number | Frame_Size | User_Data | 1180 +--------------+------------+-----------+ 1181 | 4 bytes | 4 bytes | n bytes | 1182 +--------------+------------+-----------+ 1184 Skippable frames allow the insertion of user-defined metadata into a 1185 flow of concatenated frames. 1187 Skippable frames defined in this specification are compatible with 1188 skippable frames in [LZ4]. 1190 From a compliant decoder perspective, skippable frames simply need to 1191 be skipped, and their content ignored, resuming decoding after the 1192 skippable frame. 1194 It should be noted that a skippable frame can be used to watermark a 1195 stream of concatenated frames embedding any kind of tracking 1196 information (even just a UUID). Users wary of such possibility 1197 should scan the stream of concatenated frames in an attempt to detect 1198 such frame for analysis or removal. 1200 The fields are: 1202 Magic_Number: Four bytes, little-endian format. Value: 0x184D2A5?, 1203 which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values 1204 are valid to identify a skippable frame. This specification does 1205 not detail any specific tagging methods for skippable frames. 1207 Frame_Size: This is the size, in bytes, of the following User_Data 1208 (without including the magic number nor the size field itself). 1209 This field is represented using four bytes, little-endian format, 1210 unsigned 32-bits. This means User_Data can't be bigger than 1211 (2^32-1) bytes. 1213 User_Data: This field can be anything. Data will just be skipped by 1214 the decoder. 1216 4. Entropy Encoding 1218 Two types of entropy encoding are used by the Zstandard format: FSE, 1219 and Huffman coding. Huffman is used to compress literals, while FSE 1220 is used for all other symbols (Literals_Length_Code, 1221 Match_Length_Code, offset codes) and to compress Huffman headers. 1223 4.1. FSE 1225 FSE, short for Finite State Entropy, is an entropy codec based on 1226 [ANS]. FSE encoding/decoding involves a state that is carried over 1227 between symbols, so decoding must be done in the opposite direction 1228 as encoding. Therefore, all FSE bitstreams are read from end to 1229 beginning. Note that the order of the bits in the stream is not 1230 reversed; they are simply read in the reverse order from which they 1231 were written. 1233 For additional details on FSE, see Finite State Entropy [FSE]. 1235 FSE decoding involves a decoding table that has a power of two size, 1236 and contains three elements: Symbol, Num_Bits, and Baseline. The 1237 base two logarithm of the table size is its Accuracy_Log. An FSE 1238 state value represents an index in this table. 1240 To obtain the initial state value, consume Accuracy_Log bits from the 1241 stream as a little-endian value. The next symbol in the stream is 1242 the Symbol indicated in the table for that state. To obtain the next 1243 state value, the decoder should consume Num_Bits bits from the stream 1244 as a little-endian value and add it to Baseline. 1246 4.1.1. FSE Table Description 1248 To decode FSE streams, it is necessary to construct the decoding 1249 table. The Zstandard format encodes FSE table descriptions as 1250 described here. 1252 An FSE distribution table describes the probabilities of all symbols 1253 from 0 to the last present one (included) on a normalized scale of (1 1254 << Accuracy_Log). Note that there must be two or more symbols with 1255 nonzero probability. 1257 A bitstream is read forward, in little-endian fashion. It is not 1258 necessary to know its exact size, since the size will be discovered 1259 and reported by the decoding process. The bitstream starts by 1260 reporting on which scale it operates. If low4bits designates the 1261 lowest four bits of the first byte, then Accuracy_Log = low4bits + 5. 1263 This is followed by each symbol value, from 0 to the last present 1264 one. The number of bits used by each field is variable and depends 1265 on: 1267 Remaining probabilities + 1: For example, presuming an Accuracy_Log 1268 of 8, and presuming 100 probabilities points have already been 1269 distributed, the decoder may read any value from 0 to (256 - 100 + 1270 1) == 157, inclusive. Therefore, it must read log2sup(157) == 8 1271 bits. 1273 Value decoded: Small values use one fewer bit. For example, 1274 presuming values from 0 to 157 (inclusive) are possible, 255 - 157 1275 = 98 values are remaining in an 8-bit field. The first 98 values 1276 (hence from 0 to 97) use only 7 bits, and values from 98 to 157 1277 use 8 bits. This is achieved through this scheme: 1279 +------------+---------------+-----------+ 1280 | Value read | Value decoded | Bits used | 1281 +------------+---------------+-----------+ 1282 | 0 - 97 | 0 - 97 | 7 | 1283 +------------+---------------+-----------+ 1284 | 98 - 127 | 98 - 127 | 8 | 1285 +------------+---------------+-----------+ 1286 | 128 - 225 | 0 - 97 | 7 | 1287 +------------+---------------+-----------+ 1288 | 226 - 255 | 128 - 157 | 8 | 1289 +------------+---------------+-----------+ 1291 Symbol probabilities are read one by one, in order. The probability 1292 is obtained from Value decoded using the formula P = Value - 1. This 1293 means the value 0 becomes the negative probability -1. This is a 1294 special probability that means "less than 1". Its effect on the 1295 distribution table is described below. For the purpose of 1296 calculating total allocated probability points, it counts as 1. 1298 When a symbol has a probability of zero, it is followed by a 2-bit 1299 repeat flag. This repeat flag tells how many probabilities of zeroes 1300 follow the current one. It provides a number ranging from 0 to 3. 1301 If it is a 3, another 2-bit repeat flag follows, and so on. 1303 When the last symbol reaches a cumulated total of (1 << 1304 Accuracy_Log), decoding is complete. If the last symbol makes the 1305 cumulated total go above (1 << Accuracy_Log), distribution is 1306 considered corrupted. 1308 Finally, the decoder can tell how many bytes were used in this 1309 process, and how many symbols are present. The bitstream consumes a 1310 round number of bytes. Any remaining bit within the last byte is 1311 simply unused. 1313 The distribution of normalized probabilities is enough to create a 1314 unique decoding table. The table has a size of (1 << Accuracy_Log). 1315 Each cell describes the symbol decoded, and instructions to get the 1316 next state. 1318 Symbols are scanned in their natural order for "less than 1" 1319 probabilities as described above. Symbols with this probability are 1320 being attributed a single cell, starting from the end of the table 1321 and retreating. These symbols define a full state reset, reading 1322 Accuracy_Log bits. 1324 All remaining symbols are allocated in their natural order. Starting 1325 from symbol 0 and table position 0, each symbol gets allocated as 1326 many cells as its probability. Cell allocation is spread, not 1327 linear; each successor position follows this rule: 1329 position += (tableSize >> 1) + (tableSize >> 3) + 3; 1330 position &= tableSize - 1; 1332 A position is skipped if it is already occupied by a "less than 1" 1333 probability symbol. Position does not reset between symbols; it 1334 simply iterates through each position in the table, switching to the 1335 next symbol when enough states have been allocated to the current 1336 one. 1338 The result is a list of state values. Each state will decode the 1339 current symbol. 1341 To get the Number_of_Bits and Baseline required for the next state, 1342 it is first necessary to sort all states in their natural order. The 1343 lower states will need one more bit than higher ones. The process is 1344 repeated for each symbol. 1346 For example, presuming a symbol has a probability of 5, it receives 1347 five state values. States are sorted in natural order. The next 1348 power of two is 8. The space of probabilities is divided into 8 1349 equal parts. Presuming the Accuracy_Log is 7, this defines 128 1350 states, and each share (divided by 8) is 16 in size. In order to 1351 reach 8, 8 - 5 = 3 lowest states will count "double", doubling the 1352 number of shares (32 in width), requiring one more bit in the 1353 process. 1355 Baseline is assigned starting from the higher states using fewer 1356 bits, and proceeding naturally, then resuming at the first state, 1357 each taking its allocated width from Baseline. 1359 +----------------+-------+-------+--------+------+-------+ 1360 | state order | 0 | 1 | 2 | 3 | 4 | 1361 +----------------+-------+-------+--------+------+-------+ 1362 | width | 32 | 32 | 32 | 16 | 16 | 1363 +----------------+-------+-------+--------+------+-------+ 1364 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | 1365 +----------------+-------+-------+--------+------+-------+ 1366 | range number | 2 | 4 | 6 | 0 | 1 | 1367 +----------------+-------+-------+--------+------+-------+ 1368 | Baseline | 32 | 64 | 96 | 0 | 16 | 1369 +----------------+-------+-------+--------+------+-------+ 1370 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1371 +----------------+-------+-------+--------+------+-------+ 1373 The next state is determined from the current state by reading the 1374 required Number_of_Bits, and adding the specified Baseline. 1376 See Appendix B for the results of this process applied to the default 1377 distributions. 1379 4.2. Huffman Coding 1381 Zstandard Huffman-coded streams are read backwards, similar to the 1382 FSE bitstreams. Therefore, to find the start of the bitstream, it is 1383 necessary to know the offset of the last byte of the Huffman-coded 1384 stream. 1386 After writing the last bit containing information, the compressor 1387 writes a single 1-bit and then fills the byte with 0-7 0 bits of 1388 padding. The last byte of the compressed bitstream cannot be 0 for 1389 that reason. 1391 When decompressing, the last byte containing the padding is the first 1392 byte to read. The decompressor needs to skip 0-7 initial 0-bits and 1393 the first 1-bit that occurs. Afterwards, the useful part of the 1394 bitstream begins. 1396 The bitstream contains Huffman-coded symbols in little-endian order, 1397 with the codes defined by the method below. 1399 4.2.1. Huffman Tree Description 1401 Prefix coding represents symbols from an a priori known alphabet by 1402 bit sequences (codewords), one codeword for each symbol, in a manner 1403 such that different symbols may be represented by bit sequences of 1404 different lengths, but a parser can always parse an encoded string 1405 unambiguously symbol-by-symbol. 1407 Given an alphabet with known symbol frequencies, the Huffman 1408 algorithm allows the construction of an optimal prefix code using the 1409 fewest bits of any possible prefix codes for that alphabet. 1411 The prefix code must not exceed a maximum code length. More bits 1412 improve accuracy but yield a larger header size, and require more 1413 memory or more complex decoding operations. This specification 1414 limits the maximum code length to 11 bits. 1416 All literal values from zero (included) to the last present one 1417 (excluded) are represented by Weight with values from 0 to 1418 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits 1419 follows this pseudocode: 1421 if Weight == 0 1422 Number_of_Bits = 0 1423 else 1424 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1426 The last symbol's Weight is deduced from previously decoded ones, by 1427 completing to the nearest power of 2. This power of 2 gives 1428 Max_Number_of_Bits, the depth of the current tree. 1430 For example, presume the following Huffman tree must be described: 1432 +---------------+----------------+ 1433 | literal value | Number_of_Bits | 1434 +---------------+----------------+ 1435 | 0 | 1 | 1436 +---------------+----------------+ 1437 | 1 | 2 | 1438 +---------------+----------------+ 1439 | 2 | 3 | 1440 +---------------+----------------+ 1441 | 3 | 0 | 1442 +---------------+----------------+ 1443 | 4 | 4 | 1444 +---------------+----------------+ 1445 | 5 | 4 | 1446 +---------------+----------------+ 1448 The tree depth is four, since its longest element uses four bits. 1449 (The longest elements are those with the smallest frequencies.) 1450 Value 5 will not be listed as it can be determined from the values 1451 for 0-4, nor will values above 5 as they are all 0. Values from 0 to 1452 4 will be listed using Weight instead of Number_of_Bits. The 1453 pseudocode to determine Weight is: 1455 if Number_of_Bits == 0 1456 Weight = 0 1457 else 1458 Weight = Max_Number_of_Bits + 1 - Number_of_Bits 1460 It gives the following series of weights: 1462 +---------------+--------+ 1463 | literal value | Weight | 1464 +---------------+--------+ 1465 | 0 | 4 | 1466 +---------------+--------+ 1467 | 1 | 3 | 1468 +---------------+--------+ 1469 | 2 | 2 | 1470 +---------------+--------+ 1471 | 3 | 0 | 1472 +---------------+--------+ 1473 | 4 | 1 | 1474 +---------------+--------+ 1476 The decoder will do the inverse operation: having collected weights 1477 of literals from 0 to 4, it knows the last literal, 5, is present 1478 with a non-zero weight. The weight of 5 can be determined by 1479 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 1480 0's) is 15. The nearest power of 2 is 16. Therefore, 1481 Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1. 1483 4.2.1.1. Huffman Tree Header 1485 This is a single byte value (0-255), which describes how to the 1486 series of weights is encoded. 1488 headerByte < 128: The series of weights is compressed using FSE (see 1489 below). The length of the FSE-compressed series is equal to 1490 headerByte (0-127). 1492 headerByte >= 128: This is a direct representation, where each 1493 Weight is written directly as a four-bit field (0-15). They are 1494 encoded forward, two weights to a byte with the first weight 1495 taking the top four bits and the second taking the bottom four 1496 (e.g. the following operations could be used to read the weights: 1498 Weight[0] = (Byte[0] >> 4) 1499 Weight[1] = (Byte[0] & 0xf), 1500 etc. 1502 The full representation occupies ceiling(Number_of_Symbols/2) 1503 bytes, meaning it uses only full bytes even if Number_of_Symbols 1504 is odd. Number_of_Symbols = headerByte - 127. Note that maximum 1505 Number_of_Symbols is 255 - 127 = 128. If any literal has a value 1506 over 128, raw header mode is not possible and it is necessary to 1507 use FSE compression. 1509 4.2.1.2. FSE Compression of Huffman Weights 1511 In this case, the series of Huffman weights is compressed using FSE 1512 compression. It is a single bitstream with two interleaved states, 1513 sharing a single distribution table. 1515 To decode an FSE bitstream, it is necessary to know its compressed 1516 size. Compressed size is provided by headerByte. It's also 1517 necessary to know its maximum possible decompressed size, which is 1518 255, since literal values span from 0 to 255, and the last symbol's 1519 weight is not represented. 1521 An FSE bitstream starts by a header, describing probabilities 1522 distribution. It will create a Decoding Table. For a list of 1523 Huffman weights, the maximum accuracy log is 6 bits. For more 1524 description see Section 4.1.1. 1526 The Huffman header compression uses two states, which share the same 1527 FSE distribution table. The first state (State1) encodes the even 1528 indexed symbols, and the second (State2) encodes the odd indexes. 1529 State1 is initialized first, and then State2, and they take turns 1530 decoding a single symbol and updating their state. For more details 1531 on these FSE operations, see the FSE section. 1533 The number of symbols to decode is determined by tracking the 1534 bitStream overflow condition: If updating state after decoding a 1535 symbol would require more bits than remain in the stream, it is 1536 assumed that extra bits are zero. Then, symbols for each of the 1537 final states are decoded and the process is complete. 1539 4.2.1.3. Conversion from Weights to Huffman Prefix Codes 1541 All present symbols will now have a Weight value. It is possible to 1542 transform weights into Number_of_Bits, using this formula: 1544 if Weight > 0 1545 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1546 else 1547 Number_of_Bits = 0 1549 Symbols are sorted by Weight. Within the same Weight, symbols keep 1550 natural sequential order. Symbols with a Weight of zero are removed. 1551 Then, starting from lowest weight, prefix codes are distributed in 1552 sequential order. 1554 For example, assume the following list of weights has been decoded: 1556 +---------+--------+ 1557 | Literal | Weight | 1558 +---------+--------+ 1559 | 0 | 4 | 1560 +---------+--------+ 1561 | 1 | 3 | 1562 +---------+--------+ 1563 | 2 | 2 | 1564 +---------+--------+ 1565 | 3 | 0 | 1566 +---------+--------+ 1567 | 4 | 1 | 1568 +---------+--------+ 1569 | 5 | 1 | 1570 +---------+--------+ 1572 Sorted by weight and then the natural sequential order, yielding the 1573 following distribution: 1575 +---------+--------+----------------+--------------+ 1576 | Literal | Weight | Number_Of_Bits | prefix codes | 1577 +---------+--------+----------------|--------------+ 1578 | 3 | 0 | 0 | N/A | 1579 +---------+--------+----------------|--------------+ 1580 | 4 | 1 | 4 | 0000 | 1581 +---------+--------+----------------|--------------+ 1582 | 5 | 1 | 4 | 0001 | 1583 +---------+--------+----------------|--------------+ 1584 | 2 | 2 | 3 | 001 | 1585 +---------+--------+----------------|--------------+ 1586 | 1 | 3 | 2 | 01 | 1587 +---------+--------+----------------|--------------+ 1588 | 0 | 4 | 1 | 1 | 1589 +---------+--------+----------------|--------------+ 1591 4.2.2. Huffman-coded Streams 1593 Given a Huffman decoding table, it is possible to decode a Huffman- 1594 coded stream. 1596 Each bitstream must be read backward, that is starting from the end 1597 up to the beginning. Therefore, it is necessary to know the size of 1598 each bitstream. 1600 It is also necessary to know exactly which bit is the latest. This 1601 is detected by a final bit flag: the highest bit of latest byte is a 1602 final-bit-flag. Consequently, a last byte of 0 is not possible. And 1603 the final-bit-flag itself is not part of the useful bitstream. 1605 Hence, the last byte contains between 0 and 7 useful bits. 1607 Starting from the end, it is possible to read the bitstream in a 1608 little-endian fashion, keeping track of already used bits. Since the 1609 bitstream is encoded in reverse order, starting from the end, read 1610 symbols in forward order. 1612 For example, if the literal sequence "0145" was encoded using above 1613 prefix code, it would be encoded (in reverse order) as: 1615 +---------+----------+ 1616 | Symbol | Encoding | 1617 +---------+----------+ 1618 | 5 | 0000 | 1619 +---------+----------+ 1620 | 4 | 0001 | 1621 +---------+----------+ 1622 | 1 | 01 | 1623 +---------+----------+ 1624 | 0 | 1 | 1625 +---------+----------+ 1626 | Padding | 00001 | 1627 +---------+----------+ 1629 This results in the following two-byte bitstream: 1631 00010000 00001101 1633 Here is an alternative representation with the symbol codes separated 1634 by underscores: 1636 0001_0000 00001_1_01 1638 Reading the highest Max_Number_of_Bits bits, it's possible to compare 1639 the extracted value to the decoding table, determining the symbol to 1640 decode and number of bits to discard. 1642 The process continues up to reading the required number of symbols 1643 per stream. If a bitstream is not entirely and exactly consumed, 1644 hence reaching exactly its beginning position with all bits consumed, 1645 the decoding process is considered faulty. 1647 5. Dictionary Format 1649 Zstandard is compatible with "raw content" dictionaries, free of any 1650 format restriction, except that they must be at least eight bytes. 1651 These dictionaries function as if they were just the Content part of 1652 a formatted dictionary. 1654 However, dictionaries created by "zstd --train" in the reference 1655 implementation follow a specific format, described here. 1657 Dictionaries are not included in the compressed content, but rather 1658 are provided out-of-band. That is, the Dictionary_ID identifies 1659 which should be used, but this specification does not describe the 1660 mechanism by which the dictionary is obtained prior to use during 1661 compression or decompression. 1663 A dictionary has a size, defined either by a buffer limit or a file 1664 size. The general format is: 1666 +--------------+---------------+----------------+---------+ 1667 | Magic_Number | Dictionary_ID | Entropy_Tables | Content | 1668 +--------------+---------------+----------------+---------+ 1670 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format 1672 Dictionary_ID: 4 bytes, stored in little-endian format. 1673 Dictionary_ID can be any value, except 0 (which means no 1674 Dictionary_ID). It is used by decoders to check if they use the 1675 correct dictionary. If the frame is going to be distributed in a 1676 private environment, any Dictionary_ID can be used. However, for 1677 public distribution of compressed frames, the following ranges are 1678 reserved and shall not be used: 1680 - low range : <= 32767 1681 - high range : >= (2^31) 1683 Entropy_Tables: Follow the same format as the tables in compressed 1684 blocks. See the relevant FSE and Huffman sections for how to 1685 decode these tables. They are stored in following order: Huffman 1686 tables for literals, FSE table for offsets, FSE table for match 1687 lengths, and FSE table for literals lengths. These tables 1688 populate the Repeat Stats literals mode and Repeat distribution 1689 mode for sequence decoding. It is finally followed by 3 offset 1690 values, populating repeat offsets (instead of using {1,4,8}), 1691 stored in order, 4-bytes little-endian each, for a total of 12 1692 bytes. Each repeat offset must have a value less than the 1693 dictionary size. 1695 Content: The rest of the dictionary is its content. The content act 1696 as a "past" in front of data to compress or decompress, so it can 1697 be referenced in sequence commands. As long as the amount of data 1698 decoded from this frame is less than or equal to Window_Size, 1699 sequence commands may specify offsets longer than the total length 1700 of decoded output so far to reference back to the dictionary, even 1701 parts of the dictionary with offsets larger than Window_Size. 1703 After the total output has surpassed Window_Size, however, this is 1704 no longer allowed and the dictionary is no longer accessible. 1706 6. IANA Considerations 1708 This document contains two registration actions for IANA. 1710 6.1. The 'application/zstd' Media Type 1712 The 'application/zstd' media type identifies a block of data that is 1713 compressed using zstd compression. The data is a stream of bytes as 1714 described in this document. IANA is requested to add the following 1715 to the Media Types registry: 1717 Type name: application 1719 Subtype name: zstd 1721 Required parameters: N/A 1723 Optional parameters: N/A 1725 Encoding considerations: binary 1727 Security considerations: See Section 7 of [this document] 1729 Interoperability considerations: N/A 1731 Published specification: [this document] 1733 Applications that use this media type: anywhere data size is an 1734 issue 1736 Additional information: 1738 Magic number(s): 4 Bytes, little-endian format. Value : 1739 0xFD2FB528 1741 File extension(s): zstd 1743 Macintosh file type code(s): N/A 1745 For further information: See [ZSTD] 1747 Intended usage: common 1748 Restrictions on usage: N/A 1750 Author: Murray S. Kucherawy 1752 Change Controller: IETF 1754 Provisional registration: yes 1756 6.2. Content Encoding 1758 IANA is requested to add the following entry to the HTTP Content 1759 Coding Parameters subregistry within the Hypertext Transfer Protocol 1760 (HTTP) registry: 1762 Name: zstd 1764 Description: A stream of bytes compressed using the Zstandard 1765 protocol 1767 Pointer to specification text: [this document] 1769 6.3. Dictionaries 1771 Work in progress incluces development of dictionaries that will 1772 optimize compression and decompression of particular types of data. 1773 Specification of such dictionaries for public use will necessitate 1774 registration of a code point from the reserved range described in 1775 Section 3.1.1.1.3 and its association with a specific dictionary. 1777 However, there are at present no such dictionaries published for 1778 public use, so this document makes no immediate request of IANA to 1779 create such a registry. 1781 7. Security Considerations 1783 Any data compression method involves the reduction of redundancy in 1784 the data. Zstandard is no exception, and the usual precautions 1785 apply. 1787 One should never compress together a message whose content must 1788 remain secret with a message generated by a third party. This can be 1789 used to guess the content of the secret message through analysis of 1790 entropy reduction. This was demonstrated in the [CRIME] attack, for 1791 example. 1793 A decoder has to demonstrate capabilities to detect and prevent any 1794 kind of data tampering in the compressed frame from triggering system 1795 faults, such as reading or writing beyond allowed memory ranges. 1797 This can be guaranteed either by the implementation language, or by 1798 careful bound checkings. Of particular note is the encoding of 1799 Number_of_Sequences values that cause the decoder to read into the 1800 block header (and beyond), as well as the indication of a 1801 Frame_Content_Size that is smaller than the actual decompressed data, 1802 in an attempt to trigger a buffer overflow. It is highly recommended 1803 to fuzz-test (i.e., provide invalid, unexpected, or random input and 1804 verify safe operation of) decoder implementations to test and harden 1805 their capability to detect bad frames and deal with them without any 1806 adverse system side-effect. 1808 An attacker may provide correctly formed compressed frames with 1809 unreasonable memory requirements. A decoder must always control 1810 memory requirements and enforce some (system-specific) limits in 1811 order to protect memory usage from such scenarios. 1813 Compression can be optimized by training a dictionary on a variety of 1814 related content payloads. This dictionary must then be available at 1815 the decoder for decompression of the payload to be possible. While 1816 this document does not specify how to acquire a dictionary for a 1817 given compressed payload, it is worth noting that third-party 1818 dictionaries may interact unexpectedly with a decoder, leading to 1819 possible memory or other resource exhaustion attacks. We expect such 1820 topics to be discussed in further detail in the Security 1821 Considerations section of a forthcoming RFC for dictionary 1822 acquisition and transmission, but highlight this issue now out of an 1823 abundance of caution. 1825 As discussed in Section 3.1.2, it is possible to store arbitrary user 1826 metadata in skippable frames. While such frames are ignored during 1827 decompression of the data, they can be used as a watermark to track 1828 the path of the compressed payload. 1830 8. Implementation Status 1832 Source code for a C language implementation of a "Zstandard" 1833 compliant library is available at [ZSTD-GITHUB]. This implementation 1834 is considered to be the reference implementation and is production 1835 ready, implementing the full range of the specification. It is 1836 tested against security hazards, and widely deployed within Facebook 1837 infrastructure. 1839 The reference version is speed optimised and highly portable. It has 1840 been proven to run safely on multiple architectures (x86, x64, ARM, 1841 MIPS, PowerPC, IA64) featuring 32 or 64-bits addressing schemes, 1842 little or big endian storage scheme, a number of different operating 1843 systems, UNIX (including Linux, BSD, OS-X and Solaris), and Windows, 1844 and a number of compilers (gcc, clang, visual, icc). 1846 [RFC EDITOR: Please remove the remainder of this section prior to 1847 publication.] 1849 The C reference version is also used to bind into multiple languages, 1850 a partial list of which (~20 of them) is being maintained at 1851 [ZSTD-OTHER]. 1853 The reference repository also contains an independently developed 1854 educational decoder, by Sean Purcell, created from the Zstandard 1855 format specification and built for clarity to help third party 1856 implementers. This is available at [ZSTD-EDU]. 1858 A specific version has been created for integration into the Linux 1859 kernel in order to provide compatibility with relevant memory 1860 restrictions. It was released in version 4.14 of the kernel. See 1861 [ZSTD-LINUX]. 1863 A Java native implementation of the decoder has been developed and 1864 open-sourced by the Presto team. This is available at [ZSTD-JAVA]. 1866 As of early July 2017, we are aware of one other decoder 1867 implementation in assembler, two full codec hardware implementations 1868 (programmable and ASIC) being actively developed, and a third one 1869 being evaluated. We are not permitted to disclose them at this 1870 stage. 1872 The popular UNIX command line HTTP client "curl" has expressed intent 1873 to support zstd in a future release. 1875 9. References 1877 9.1. Normative References 1879 [XXHASH] "XXHASH Algorithm", 2017, . 1881 [ZSTD] "Zstandard - Real-time data compression algorithm", 1882 2017, . 1884 9.2. Informative References 1886 [ANS] "Asymmetric Numeral Systems: Entropy Coding Combining 1887 Speed of Huffman Coding with Compression Rate of 1888 Arithmetic Coding", 2017, 1889 . 1891 [CRIME] "Compression Ratio Info-leak Made Easy", 2017, 1892 . 1894 [FSE] "Finite State Entropy", 2017, 1895 . 1897 [LZ4] "LZ4 Frame Format Description", 2017, . 1901 [RFC1952] Deutsch, P., "GZIP file format specification version 1902 4.3", RFC 1952, DOI 10.17487/RFC1952, May 1996, 1903 . 1905 [ZSTD-EDU] "Zstandard Educational Decoder", 2017, . 1909 [ZSTD-GITHUB] "Zstandard Github Repository", 2017, 1910 . 1912 [ZSTD-JAVA] "Zstandard Github Repository", 2017, . 1916 [ZSTD-LINUX] "Zstandard Github Repository", 2017, . 1920 [ZSTD-OTHER] "Zstandard Language Bindings", 2017, 1921 . 1923 Appendix A. Acknowledgments 1925 zstd was developed by Yann Collet. 1927 Bobo Bose-Kolanu, Felix Handte, Kyle Nekritz, Nick Terrell, and David 1928 Schleimer provided helpful feedback during the development of this 1929 document. 1931 Appendix B. Decoding Tables for Predefined Codes 1933 This appendix contains FSE decoding tables for the predefined literal 1934 length, match length, and offset codes. The tables have been 1935 constructed using the algorithm as given above in chapter "from 1936 normalized distribution to decoding tables". The tables here can be 1937 used as examples to crosscheck that an implementation build its 1938 decoding tables correctly. 1940 B.1. Literal Length Code Table 1942 +-------+--------+----------------+------+ 1943 | State | Symbol | Number_Of_Bits | Base | 1944 +-------+--------+----------------+------+ 1945 | 0 | 0 | 0 | 0 | 1946 +-------+--------+----------------+------+ 1947 | 0 | 0 | 4 | 0 | 1948 +-------+--------+----------------+------+ 1949 | 1 | 0 | 4 | 16 | 1950 +-------+--------+----------------+------+ 1951 | 2 | 1 | 5 | 32 | 1952 +-------+--------+----------------+------+ 1953 | 3 | 3 | 5 | 0 | 1954 +-------+--------+----------------+------+ 1955 | 4 | 4 | 5 | 0 | 1956 +-------+--------+----------------+------+ 1957 | 5 | 6 | 5 | 0 | 1958 +-------+--------+----------------+------+ 1959 | 6 | 7 | 5 | 0 | 1960 +-------+--------+----------------+------+ 1961 | 7 | 9 | 5 | 0 | 1962 +-------+--------+----------------+------+ 1963 | 8 | 10 | 5 | 0 | 1964 +-------+--------+----------------+------+ 1965 | 9 | 12 | 5 | 0 | 1966 +-------+--------+----------------+------+ 1967 | 10 | 14 | 6 | 0 | 1968 +-------+--------+----------------+------+ 1969 | 11 | 16 | 5 | 0 | 1970 +-------+--------+----------------+------+ 1971 | 12 | 18 | 5 | 0 | 1972 +-------+--------+----------------+------+ 1973 | 13 | 19 | 5 | 0 | 1974 +-------+--------+----------------+------+ 1975 | 14 | 21 | 5 | 0 | 1976 +-------+--------+----------------+------+ 1977 | 15 | 22 | 5 | 0 | 1978 +-------+--------+----------------+------+ 1979 | 16 | 24 | 5 | 0 | 1980 +-------+--------+----------------+------+ 1981 | 17 | 25 | 5 | 32 | 1982 +-------+--------+----------------+------+ 1983 | 18 | 26 | 5 | 0 | 1984 +-------+--------+----------------+------+ 1985 | 19 | 27 | 6 | 0 | 1986 +-------+--------+----------------+------+ 1987 | 20 | 29 | 6 | 0 | 1988 +-------+--------+----------------+------+ 1989 | 21 | 31 | 6 | 0 | 1990 +-------+--------+----------------+------+ 1991 | 22 | 0 | 4 | 32 | 1992 +-------+--------+----------------+------+ 1993 | 23 | 1 | 4 | 0 | 1994 +-------+--------+----------------+------+ 1995 | 24 | 2 | 5 | 0 | 1996 +-------+--------+----------------+------+ 1997 | 25 | 4 | 5 | 32 | 1998 +-------+--------+----------------+------+ 1999 | 26 | 5 | 5 | 0 | 2000 +-------+--------+----------------+------+ 2001 | 27 | 7 | 5 | 32 | 2002 +-------+--------+----------------+------+ 2003 | 28 | 8 | 5 | 0 | 2004 +-------+--------+----------------+------+ 2005 | 29 | 10 | 5 | 32 | 2006 +-------+--------+----------------+------+ 2007 | 30 | 11 | 5 | 0 | 2008 +-------+--------+----------------+------+ 2009 | 31 | 13 | 6 | 0 | 2010 +-------+--------+----------------+------+ 2011 | 32 | 16 | 5 | 32 | 2012 +-------+--------+----------------+------+ 2013 | 33 | 17 | 5 | 0 | 2014 +-------+--------+----------------+------+ 2015 | 34 | 19 | 5 | 32 | 2016 +-------+--------+----------------+------+ 2017 | 35 | 20 | 5 | 0 | 2018 +-------+--------+----------------+------+ 2019 | 36 | 22 | 5 | 32 | 2020 +-------+--------+----------------+------+ 2021 | 37 | 23 | 5 | 0 | 2022 +-------+--------+----------------+------+ 2023 | 38 | 25 | 4 | 0 | 2024 +-------+--------+----------------+------+ 2025 | 39 | 25 | 4 | 16 | 2026 +-------+--------+----------------+------+ 2027 | 40 | 26 | 5 | 32 | 2028 +-------+--------+----------------+------+ 2029 | 41 | 28 | 6 | 0 | 2030 +-------+--------+----------------+------+ 2031 | 42 | 30 | 6 | 0 | 2032 +-------+--------+----------------+------+ 2033 | 43 | 0 | 4 | 48 | 2034 +-------+--------+----------------+------+ 2035 | 44 | 1 | 4 | 16 | 2036 +-------+--------+----------------+------+ 2037 | 45 | 2 | 5 | 32 | 2038 +-------+--------+----------------+------+ 2039 | 46 | 3 | 5 | 32 | 2040 +-------+--------+----------------+------+ 2041 | 47 | 5 | 5 | 32 | 2042 +-------+--------+----------------+------+ 2043 | 48 | 6 | 5 | 32 | 2044 +-------+--------+----------------+------+ 2045 | 49 | 8 | 5 | 32 | 2046 +-------+--------+----------------+------+ 2047 | 50 | 9 | 5 | 32 | 2048 +-------+--------+----------------+------+ 2049 | 51 | 11 | 5 | 32 | 2050 +-------+--------+----------------+------+ 2051 | 52 | 12 | 5 | 32 | 2052 +-------+--------+----------------+------+ 2053 | 53 | 15 | 6 | 0 | 2054 +-------+--------+----------------+------+ 2055 | 54 | 17 | 5 | 32 | 2056 +-------+--------+----------------+------+ 2057 | 55 | 18 | 5 | 32 | 2058 +-------+--------+----------------+------+ 2059 | 56 | 20 | 5 | 32 | 2060 +-------+--------+----------------+------+ 2061 | 57 | 21 | 5 | 32 | 2062 +-------+--------+----------------+------+ 2063 | 58 | 23 | 5 | 32 | 2064 +-------+--------+----------------+------+ 2065 | 59 | 24 | 5 | 32 | 2066 +-------+--------+----------------+------+ 2067 | 60 | 35 | 6 | 0 | 2068 +-------+--------+----------------+------+ 2069 | 61 | 34 | 6 | 0 | 2070 +-------+--------+----------------+------+ 2071 | 62 | 33 | 6 | 0 | 2072 +-------+--------+----------------+------+ 2073 | 63 | 32 | 6 | 0 | 2074 +-------+--------+----------------+------+ 2076 B.2. Match Length Code Table 2078 +-------+--------+----------------+------+ 2079 | State | Symbol | Number_Of_Bits | Base | 2080 +-------+--------+----------------+------+ 2081 | 0 | 0 | 0 | 0 | 2082 +-------+--------+----------------+------+ 2083 | 0 | 0 | 6 | 0 | 2084 +-------+--------+----------------+------+ 2085 | 1 | 1 | 4 | 0 | 2086 +-------+--------+----------------+------+ 2087 | 2 | 2 | 5 | 32 | 2088 +-------+--------+----------------+------+ 2089 | 3 | 3 | 5 | 0 | 2090 +-------+--------+----------------+------+ 2091 | 4 | 5 | 5 | 0 | 2092 +-------+--------+----------------+------+ 2093 | 5 | 6 | 5 | 0 | 2094 +-------+--------+----------------+------+ 2095 | 6 | 8 | 5 | 0 | 2096 +-------+--------+----------------+------+ 2097 | 7 | 10 | 6 | 0 | 2098 +-------+--------+----------------+------+ 2099 | 8 | 13 | 6 | 0 | 2100 +-------+--------+----------------+------+ 2101 | 9 | 16 | 6 | 0 | 2102 +-------+--------+----------------+------+ 2103 | 10 | 19 | 6 | 0 | 2104 +-------+--------+----------------+------+ 2105 | 11 | 22 | 6 | 0 | 2106 +-------+--------+----------------+------+ 2107 | 12 | 25 | 6 | 0 | 2108 +-------+--------+----------------+------+ 2109 | 13 | 28 | 6 | 0 | 2110 +-------+--------+----------------+------+ 2111 | 14 | 31 | 6 | 0 | 2112 +-------+--------+----------------+------+ 2113 | 15 | 33 | 6 | 0 | 2114 +-------+--------+----------------+------+ 2115 | 16 | 35 | 6 | 0 | 2116 +-------+--------+----------------+------+ 2117 | 17 | 37 | 6 | 0 | 2118 +-------+--------+----------------+------+ 2119 | 18 | 39 | 6 | 0 | 2120 +-------+--------+----------------+------+ 2121 | 19 | 41 | 6 | 0 | 2122 +-------+--------+----------------+------+ 2123 | 20 | 43 | 6 | 0 | 2124 +-------+--------+----------------+------+ 2125 | 21 | 45 | 6 | 0 | 2126 +-------+--------+----------------+------+ 2127 | 22 | 1 | 4 | 16 | 2128 +-------+--------+----------------+------+ 2129 | 23 | 2 | 4 | 0 | 2130 +-------+--------+----------------+------+ 2131 | 24 | 3 | 5 | 32 | 2132 +-------+--------+----------------+------+ 2133 | 25 | 4 | 5 | 0 | 2134 +-------+--------+----------------+------+ 2135 | 26 | 6 | 5 | 32 | 2136 +-------+--------+----------------+------+ 2137 | 27 | 7 | 5 | 0 | 2138 +-------+--------+----------------+------+ 2139 | 28 | 9 | 6 | 0 | 2140 +-------+--------+----------------+------+ 2141 | 29 | 12 | 6 | 0 | 2142 +-------+--------+----------------+------+ 2143 | 30 | 15 | 6 | 0 | 2144 +-------+--------+----------------+------+ 2145 | 31 | 18 | 6 | 0 | 2146 +-------+--------+----------------+------+ 2147 | 32 | 21 | 6 | 0 | 2148 +-------+--------+----------------+------+ 2149 | 33 | 24 | 6 | 0 | 2150 +-------+--------+----------------+------+ 2151 | 34 | 27 | 6 | 0 | 2152 +-------+--------+----------------+------+ 2153 | 35 | 30 | 6 | 0 | 2154 +-------+--------+----------------+------+ 2155 | 36 | 32 | 6 | 0 | 2156 +-------+--------+----------------+------+ 2157 | 37 | 34 | 6 | 0 | 2158 +-------+--------+----------------+------+ 2159 | 38 | 36 | 6 | 0 | 2160 +-------+--------+----------------+------+ 2161 | 39 | 38 | 6 | 0 | 2162 +-------+--------+----------------+------+ 2163 | 40 | 40 | 6 | 0 | 2164 +-------+--------+----------------+------+ 2165 | 41 | 42 | 6 | 0 | 2166 +-------+--------+----------------+------+ 2167 | 42 | 44 | 6 | 0 | 2168 +-------+--------+----------------+------+ 2169 | 43 | 1 | 4 | 32 | 2170 +-------+--------+----------------+------+ 2171 | 44 | 1 | 4 | 48 | 2172 +-------+--------+----------------+------+ 2173 | 45 | 2 | 4 | 16 | 2174 +-------+--------+----------------+------+ 2175 | 46 | 4 | 5 | 32 | 2176 +-------+--------+----------------+------+ 2177 | 47 | 5 | 5 | 32 | 2178 +-------+--------+----------------+------+ 2179 | 48 | 7 | 5 | 32 | 2180 +-------+--------+----------------+------+ 2181 | 49 | 8 | 5 | 32 | 2182 +-------+--------+----------------+------+ 2183 | 50 | 11 | 6 | 0 | 2184 +-------+--------+----------------+------+ 2185 | 51 | 14 | 6 | 0 | 2186 +-------+--------+----------------+------+ 2187 | 52 | 17 | 6 | 0 | 2188 +-------+--------+----------------+------+ 2189 | 53 | 20 | 6 | 0 | 2190 +-------+--------+----------------+------+ 2191 | 54 | 23 | 6 | 0 | 2192 +-------+--------+----------------+------+ 2193 | 55 | 26 | 6 | 0 | 2194 +-------+--------+----------------+------+ 2195 | 56 | 29 | 6 | 0 | 2196 +-------+--------+----------------+------+ 2197 | 57 | 52 | 6 | 0 | 2198 +-------+--------+----------------+------+ 2199 | 58 | 51 | 6 | 0 | 2200 +-------+--------+----------------+------+ 2201 | 59 | 50 | 6 | 0 | 2202 +-------+--------+----------------+------+ 2203 | 60 | 49 | 6 | 0 | 2204 +-------+--------+----------------+------+ 2205 | 61 | 48 | 6 | 0 | 2206 +-------+--------+----------------+------+ 2207 | 62 | 47 | 6 | 0 | 2208 +-------+--------+----------------+------+ 2209 | 63 | 46 | 6 | 0 | 2210 +-------+--------+----------------+------+ 2212 B.3. Offset Code Table 2214 +-------+--------+----------------+------+ 2215 | State | Symbol | Number_Of_Bits | Base | 2216 +-------+--------+----------------+------+ 2217 | 0 | 0 | 0 | 0 | 2218 +-------+--------+----------------+------+ 2219 | 0 | 0 | 5 | 0 | 2220 +-------+--------+----------------+------+ 2221 | 1 | 6 | 4 | 0 | 2222 +-------+--------+----------------+------+ 2223 | 2 | 9 | 5 | 0 | 2224 +-------+--------+----------------+------+ 2225 | 3 | 15 | 5 | 0 | 2226 +-------+--------+----------------+------+ 2227 | 4 | 21 | 5 | 0 | 2228 +-------+--------+----------------+------+ 2229 | 5 | 3 | 5 | 0 | 2230 +-------+--------+----------------+------+ 2231 | 6 | 7 | 4 | 0 | 2232 +-------+--------+----------------+------+ 2233 | 7 | 12 | 5 | 0 | 2234 +-------+--------+----------------+------+ 2235 | 8 | 18 | 5 | 0 | 2236 +-------+--------+----------------+------+ 2237 | 9 | 23 | 5 | 0 | 2238 +-------+--------+----------------+------+ 2239 | 10 | 5 | 5 | 0 | 2240 +-------+--------+----------------+------+ 2241 | 11 | 8 | 4 | 0 | 2242 +-------+--------+----------------+------+ 2243 | 12 | 14 | 5 | 0 | 2244 +-------+--------+----------------+------+ 2245 | 13 | 20 | 5 | 0 | 2246 +-------+--------+----------------+------+ 2247 | 14 | 2 | 5 | 0 | 2248 +-------+--------+----------------+------+ 2249 | 15 | 7 | 4 | 16 | 2250 +-------+--------+----------------+------+ 2251 | 16 | 11 | 5 | 0 | 2252 +-------+--------+----------------+------+ 2253 | 17 | 17 | 5 | 0 | 2254 +-------+--------+----------------+------+ 2255 | 18 | 22 | 5 | 0 | 2256 +-------+--------+----------------+------+ 2257 | 19 | 4 | 5 | 0 | 2258 +-------+--------+----------------+------+ 2259 | 20 | 8 | 4 | 16 | 2260 +-------+--------+----------------+------+ 2261 | 21 | 13 | 5 | 0 | 2262 +-------+--------+----------------+------+ 2263 | 22 | 19 | 5 | 0 | 2264 +-------+--------+----------------+------+ 2265 | 23 | 1 | 5 | 0 | 2266 +-------+--------+----------------+------+ 2267 | 24 | 6 | 4 | 16 | 2268 +-------+--------+----------------+------+ 2269 | 25 | 10 | 5 | 0 | 2270 +-------+--------+----------------+------+ 2271 | 26 | 16 | 5 | 0 | 2272 +-------+--------+----------------+------+ 2273 | 27 | 28 | 5 | 0 | 2274 +-------+--------+----------------+------+ 2275 | 28 | 27 | 5 | 0 | 2276 +-------+--------+----------------+------+ 2277 | 29 | 26 | 5 | 0 | 2278 +-------+--------+----------------+------+ 2279 | 30 | 25 | 5 | 0 | 2280 +-------+--------+----------------+------+ 2281 | 31 | 24 | 5 | 0 | 2282 +-------+--------+----------------+------+ 2284 Authors' Addresses 2286 Yann Collet 2287 Facebook 2288 1 Hacker Way 2289 Menlo Park, CA 94025 2290 United States 2292 EMail: cyan@fb.com 2294 Murray S. Kucherawy (editor) 2295 Facebook 2296 1 Hacker Way 2297 Menlo Park, CA 94025 2298 United States 2300 EMail: msk@fb.com