idnits 2.17.1 draft-kucherawy-rfc8478bis-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 1727: '... type registered here SHOULD NOT use a...' RFC 2119 keyword, line 1808: '...cified for +zstd SHOULD be as specifie...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 16, 2019) is 1594 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '36' on line 1106 -- Looks like a reference, but probably isn't: '53' on line 1116 -- Looks like a reference, but probably isn't: '29' on line 1133 -- Looks like a reference, but probably isn't: '5' on line 1497 -- Looks like a reference, but probably isn't: '0' on line 1515 -- Looks like a reference, but probably isn't: '1' on line 1515 Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Collet 3 Internet-Draft M. Kucherawy, Ed. 4 Intended status: Informational Facebook 5 Expires: May 19, 2020 November 16, 2019 7 Zstandard Compression and the application/zstd Media Type 8 draft-kucherawy-rfc8478bis-01 10 Abstract 12 Zstandard, or "zstd" (pronounced "zee standard"), is a data 13 compression mechanism. This document describes the mechanism and 14 registers a media type and content encoding to be used when 15 transporting zstd-compressed content via Multipurpose Internet Mail 16 Extensions (MIME). 18 Despite use of the word "standard" as part of its name, readers are 19 advised that this document is not an Internet Standards Track 20 specification; it is being published for informational purposes only. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 19, 2020. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 4 59 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 5 60 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . . 5 61 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . . 6 62 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . . 10 63 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 12 64 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . . 26 65 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . . 27 66 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . . 27 67 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . . 28 68 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 69 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 29 70 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . . 32 71 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . . 32 72 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 34 73 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . . 35 74 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 35 75 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 36 76 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 37 77 6. Use of Dictionaries . . . . . . . . . . . . . . . . . . . . . 39 78 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 79 7.1. The 'application/zstd' Media Type . . . . . . . . . . . . 39 80 7.2. Content Encoding . . . . . . . . . . . . . . . . . . . . . 40 81 7.3. Structured Syntax Suffix . . . . . . . . . . . . . . . . . 40 82 7.4. Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 41 83 8. Security Considerations . . . . . . . . . . . . . . . . . . . 41 84 9. Implementation Status . . . . . . . . . . . . . . . . . . . . 42 85 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 43 86 10.1. Normative References . . . . . . . . . . . . . . . . . . . 43 87 10.2. Informative References . . . . . . . . . . . . . . . . . . 43 88 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 43 89 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 43 90 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 46 91 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 49 92 Appendix B. Changes Since RFC8478 . . . . . . . . . . . . . . . . 51 93 Appendix C. Acknowledgments . . . . . . . . . . . . . . . . . . . 51 94 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 51 96 1. Introduction 98 Zstandard, or "zstd" (pronounced "zee standard"), is a data 99 compression mechanism, akin to gzip [RFC1952]. 101 Despite use of the word "standard" as part of its name, readers are 102 advised that this document is not an Internet Standards Track 103 specification; it is being published for informational purposes only. 105 This document describes the Zstandard format. Also, to enable the 106 transport of a data object compressed with Zstandard, this document 107 registers a media type that can be used to identify such content when 108 it is used in a payload encoded using Multipurpose Internet Mail 109 Extensions (MIME). 111 2. Definitions 113 Some terms used elsewhere in this document are defined here for 114 clarity. 116 uncompressed: Describes an arbitrary set of bytes in their original 117 form, prior to being subjected to compression. 119 compress, compression: The act of processing a set of bytes via the 120 compression mechanism described here. 122 compressed: Describes the result of passing a set of bytes through 123 this mechanism. The original input has thus been compressed. 125 decompress, decompression: The act of processing a set of bytes 126 through the inverse of the compression mechanism described here, 127 in an attempt to recover the original set of bytes prior to 128 compression. 130 decompressed: Describes the result of passing a set of bytes through 131 the reverse of this mechanism. When this is successful, the 132 decompressed payload and the uncompressed payload are 133 indistinguishable. 135 encode: The process of translating data from one form to another; 136 this may include compression or it may refer to other translations 137 done as part of this specification. 139 decode: The reverse of "encode"; describes a process of reversing a 140 prior encoding to recover the original content. 142 frame: Content compressed by Zstandard is transformed into a 143 Zstandard frame. Multiple frames can be appended into a single 144 file or stream. A frame is completely independent, has a defined 145 beginning and end, and has a set of parameters that tells the 146 decoder how to decompress it. 148 block: A frame encapsulates one or multiple blocks. Each block 149 contains arbitrary content, which is described by its header, and 150 has a guaranteed maximum content size that depends upon frame 151 parameters. Unlike frames, each block depends on previous blocks 152 for proper decoding. However, each block can be decompressed 153 without waiting for its successor, allowing streaming operations. 155 natural order: A sequence or ordering of objects or values that is 156 typical of that type of object or value. A set of unique 157 integers, for example, is in "natural order" if when progressing 158 from one element in the set or sequence to the next, there is 159 never a decrease in value. 161 The naming convention for identifiers within the specification is 162 Mixed_Case_With_Underscores. Identifiers inside square brackets 163 indicate that the identifier is optional in the presented context. 165 3. Compression Algorithm 167 This section describes the Zstandard algorithm. 169 The purpose of this document is to define a lossless compressed data 170 format that is a) independent of the CPU type, operating system, file 171 system, and character set and b) is suitable for file compression and 172 pipe and streaming compression, using the Zstandard algorithm. The 173 text of the specification assumes a basic background in programming 174 at the level of bits and other primitive data representations. 176 The data can be produced or consumed, even for an arbitrarily long 177 sequentially presented input data stream, using only an a priori 178 bounded amount of intermediate storage, and hence can be used in data 179 communications. The format uses the Zstandard compression method, 180 and an optional xxHash-64 checksum method [XXHASH], for detection of 181 data corruption. 183 The data format defined by this specification does not attempt to 184 allow random access to compressed data. 186 Unless otherwise indicated below, a compliant compressor must produce 187 data sets that conform to the specifications presented here. 188 However, it does not need to support all options. 190 A compliant decompressor must be able to decompress at least one 191 working set of parameters that conforms to the specifications 192 presented here. It may also ignore informative fields, such as the 193 checksum. Whenever it does not support a parameter defined in the 194 compressed stream, it must produce a non-ambiguous error code and 195 associated error message explaining which parameter is unsupported. 197 This specification is intended for use by implementers of software to 198 compress data into Zstandard format and/or decompress data from 199 Zstandard format. The Zstandard format is supported by an open 200 source reference implementation, written in portable C, and available 201 at [ZSTD]. 203 3.1. Frames 205 Zstandard compressed data is made up of one or more frames. Each 206 frame is independent and can be decompressed independently of other 207 frames. The decompressed content of multiple concatenated frames is 208 the concatenation of each frame's decompressed content. 210 There are two frame formats defined for Zstandard: Zstandard frames 211 and skippable frames. Zstandard frames contain compressed data, 212 while skippable frames contain custom user metadata. 214 3.1.1. Zstandard Frames 216 The structure of a single Zstandard frame is as follows: 218 +--------------------+------------+ 219 | Magic_Number | 4 bytes | 220 +--------------------+------------+ 221 | Frame_Header | 2-14 bytes | 222 +--------------------+------------+ 223 | Data_Block | n bytes | 224 +--------------------+------------+ 225 | [More Data_Blocks] | | 226 +--------------------+------------+ 227 | [Content_Checksum] | 0-4 bytes | 228 +--------------------+------------+ 230 Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528. 232 Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1. 234 Data_Block: Detailed in Section 3.1.1.2. This is where data 235 appears. 237 Content_Checksum: An optional 32-bit checksum, only present if 238 Content_Checksum_Flag is set. The content checksum is the result 239 of the XXH64() hash function [XXHASH] digesting the original 240 (decoded) data as input, and a seed of zero. The low 4 bytes of 241 the checksum are stored in little-endian format. 243 The magic number was selected to be less probable to find at the 244 beginning of an arbitrary file. It avoids trivial patterns (0x00, 245 0xFF, repeated bytes, increasing bytes, etc.), contains byte values 246 outside of ASCII range, and doesn't map into UTF-8 space, all of 247 which reduce the likelihood of its appearance at the top of a text 248 file. 250 3.1.1.1. Frame Header 252 The frame header has a variable size, with a minimum of 2 bytes and 253 up to 14 bytes depending on optional parameters. The structure of 254 Frame_Header is as follows: 256 +-------------------------+-----------+ 257 | Frame_Header_Descriptor | 1 byte | 258 +-------------------------+-----------+ 259 | [Window_Descriptor] | 0-1 byte | 260 +-------------------------+-----------+ 261 | [Dictionary_ID] | 0-4 bytes | 262 +-------------------------+-----------+ 263 | [Frame_Content_Size] | 0-8 bytes | 264 +-------------------------+-----------+ 266 3.1.1.1.1. Frame_Header_Descriptor 268 The first header's byte is called the Frame_Header_Descriptor. It 269 describes which other fields are present. Decoding this byte is 270 enough to tell the size of Frame_Header. 272 +------------+-------------------------+ 273 | Bit Number | Field Name | 274 +------------+-------------------------+ 275 | 7-6 | Frame_Content_Size_Flag | 276 +------------+-------------------------+ 277 | 5 | Single_Segment_Flag | 278 +------------+-------------------------+ 279 | 4 | (unused) | 280 +------------+-------------------------+ 281 | 3 | (reserved) | 282 +------------+-------------------------+ 283 | 2 | Content_Checksum_Flag | 284 +------------+-------------------------+ 285 | 1-0 | Dictionary_ID_Flag | 286 +------------+-------------------------+ 288 In this table, bit 7 is the highest bit, while bit 0 is the lowest 289 one. 291 3.1.1.1.1.1. Frame_Content_Size_Flag 293 This is a 2-bit flag (equivalent to Frame_Header_Descriptor right- 294 shifted 6 bits) specifying whether Frame_Content_Size (the 295 decompressed data size) is provided within the header. Flag_Value 296 provides FCS_Field_Size, which is the number of bytes used by 297 Frame_Content_Size according to the following table: 299 +----------------+--------+---+---+---+ 300 | Flag_Value | 0 | 1 | 2 | 3 | 301 +----------------+--------+---+---+---+ 302 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | 303 +----------------+--------+---+---+---+ 305 When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag: 306 If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise, 307 FCS_Field_Size is 0; Frame_Content_Size is not provided. 309 3.1.1.1.1.2. Single_Segment_Flag 311 If this flag is set, data must be regenerated within a single 312 continuous memory segment. 314 In this case, Window_Descriptor byte is skipped, but 315 Frame_Content_Size is necessarily present. As a consequence, the 316 decoder must allocate a memory segment of size equal or larger than 317 Frame_Content_Size. 319 In order to protect the decoder from unreasonable memory 320 requirements, a decoder is allowed to reject a compressed frame that 321 requests a memory size beyond the decoder's authorized range. 323 For broader compatibility, decoders are recommended to support memory 324 sizes of at least 8 MB. This is only a recommendation; each decoder 325 is free to support higher or lower limits, depending on local 326 limitations. 328 3.1.1.1.1.3. Unused Bit 330 A decoder compliant with this specification version shall not 331 interpret this bit. It might be used in a future version, to signal 332 a property that is not mandatory to properly decode the frame. An 333 encoder compliant with this specification must set this bit to zero. 335 3.1.1.1.1.4. Reserved Bit 337 This bit is reserved for some future feature. Its value must be 338 zero. A decoder compliant with this specification version must 339 ensure it is not set. This bit may be used in a future revision, to 340 signal a feature that must be interpreted to decode the frame 341 correctly. 343 3.1.1.1.1.5. Content_Checksum_Flag 345 If this flag is set, a 32-bit Content_Checksum will be present at the 346 frame's end. See the description of Content_Checksum above. 348 3.1.1.1.1.6. Dictionary_ID_Flag 350 This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating 351 whether a dictionary ID is provided within the header. It also 352 specifies the size of this field as DID_Field_Size: 354 +----------------+---+---+---+---+ 355 | Flag_Value | 0 | 1 | 2 | 3 | 356 +----------------+---+---+---+---+ 357 | DID_Field_Size | 0 | 1 | 2 | 4 | 358 +----------------+---+---+---+---+ 360 3.1.1.1.2. Window Descriptor 362 This provides guarantees about the minimum memory buffer required to 363 decompress a frame. This information is important for decoders to 364 allocate enough memory. 366 The Window_Descriptor byte is optional. When Single_Segment_Flag is 367 set, Window_Descriptor is not present. In this case, Window_Size is 368 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 369 ExaBytes). 371 +------------+----------+----------+ 372 | Bit Number | 7-3 | 2-0 | 373 +------------+----------+----------+ 374 | Field Name | Exponent | Mantissa | 375 +------------+----------+----------+ 377 The minimum memory buffer size is called Window_Size. It is 378 described by the following formulae: 380 windowLog = 10 + Exponent; 381 windowBase = 1 << windowLog; 382 windowAdd = (windowBase / 8) * Mantissa; 383 Window_Size = windowBase + windowAdd; 385 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) 386 + 7*(1<<38) bytes, which is 3.75 TB. 388 In general, larger Window_Size values tend to improve the compression 389 ratio, but at the cost of increased memory usage. 391 To properly decode compressed data, a decoder will need to allocate a 392 buffer of at least Window_Size bytes. 394 In order to protect decoders from unreasonable memory requirements, a 395 decoder is allowed to reject a compressed frame that requests a 396 memory size beyond decoder's authorized range. 398 For improved interoperability, it's recommended for decoders to 399 support values of Window_Size up to 8 MB and for encoders not to 400 generate frames requiring a Window_Size larger than 8 MB. It's 401 merely a recommendation though, and decoders are free to support 402 larger or lower limits, depending on local limitations. 404 3.1.1.1.3. Dictionary_ID 406 This is a variable size field, which contains the ID of the 407 dictionary required to properly decode the frame. This field is 408 optional. When it's not present, it's up to the decoder to know 409 which dictionary to use. 411 Dictionary_ID field size is provided by DID_Field_Size. 412 DID_Field_Size is directly derived from the value of 413 Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can 414 represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295. 415 Format is little-endian. 417 It is permitted to represent a small ID (for example, 13) with a 418 large 4-byte dictionary ID, even if it is less efficient. 420 Within private environments, any dictionary ID can be used. However, 421 for frames and dictionaries distributed in public space, 422 Dictionary_ID must be attributed carefully. The following ranges are 423 reserved for use only with dictionaries that have been registered 424 with IANA (see Section 7.4): 425 low range: <= 32767 426 high range: >= (1 << 31) 428 Any other value for Dictionary_ID can be used by private arrangement 429 between participants. 431 Any payload presented for decompression that references an 432 unregistered reserved dictionary ID results in an error. 434 3.1.1.1.4. Frame Content Size 436 This is the original (uncompressed) size. This information is 437 optional. Frame_Content_Size uses a variable number of bytes, 438 provided by FCS_Field_Size. FCS_Field_Size is provided by the value 439 of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not 440 present), 1, 2, 4, or 8 bytes. 442 +----------------+--------------+ 443 | FCS Field Size | Range | 444 +----------------+--------------+ 445 | 0 | unknown | 446 +----------------+--------------+ 447 | 1 | 0 - 255 | 448 +----------------+--------------+ 449 | 2 | 256 - 65791 | 450 +----------------+--------------+ 451 | 4 | 0 - 2^32 - 1 | 452 +----------------+--------------+ 453 | 8 | 0 - 2^64 - 1 | 454 +----------------+--------------+ 456 Frame_Content_Size format is little-endian. When FCS_Field_Size is 457 1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is 458 2, the offset of 256 is added. It's allowed to represent a small 459 size (for example 18) using any compatible variant. 461 3.1.1.2. Blocks 463 After Magic_Number and Frame_Header, there are some number of blocks. 464 Each frame must have at least 1 block, but there is no upper limit on 465 the number of blocks per frame. 467 The structure of a block is as follows: 469 +--------------+---------------+ 470 | Block_Header | Block_Content | 471 +--------------+---------------+ 472 | 3 bytes | n bytes | 473 +--------------+---------------+ 475 Block_Header uses 3 bytes, written using little-endian convention. 476 It contains three fields: 478 +------------+------------+------------+ 479 | Last_Block | Block_Type | Block_Size | 480 +------------+------------+------------+ 481 | bit 0 | bits 1-2 | bits 3-23 | 482 +------------+------------+------------+ 484 3.1.1.2.1. Last_Block 486 The lowest bit (Last_Block) signals whether this block is the last 487 one. The frame will end after this last block. It may be followed 488 by an optional Content_Checksum (see Section 3.1.1). 490 3.1.1.2.2. Block_Type 492 The next 2 bits represent the Block_Type. There are four block 493 types: 495 +-----------+------------------+ 496 | Value | Block_Type | 497 +-----------+------------------+ 498 | 0 | Raw_Block | 499 +-----------+------------------+ 500 | 1 | RLE_Block | 501 +-----------+------------------+ 502 | 2 | Compressed_Block | 503 +-----------+------------------+ 504 | 3 | Reserved | 505 +-----------+------------------+ 507 Raw_Block: This is an uncompressed block. Block_Content contains 508 Block_Size bytes. 510 RLE_Block: This is a single byte, repeated Block_Size times. 511 Block_Content consists of a single byte. On the decompression 512 side, this byte must be repeated Block_Size times. 514 Compressed_Block: This is a compressed block as described in 515 Section 3.1.1.3. Block_Size is the length of Block_Content, 516 namely the compressed data. The decompressed size is not known, 517 but its maximum possible value is guaranteed (see below). 519 Reserved: This is not a block. This value cannot be used with the 520 current specification. If such a value is present, it is 521 considered to be corrupt data. 523 3.1.1.2.3. Block_Size 525 The upper 21 bits of Block_Header represent the Block_Size. 527 When Block_Type is Compressed_Block or Raw_Block, Block_Size is the 528 size of Block_Content (hence excluding Block_Header). 530 When Block_Type is RLE_Block, since Block_Content's size is always 1, 531 Block_Size represents the number of times this byte must be repeated. 533 Block_Size is limited by Block_Maximum_Size (see below). 535 3.1.1.2.4. Block_Content and Block_Maximum_Size 537 The size of Block_Content is limited by Block_Maximum_Size, which is 538 the smallest of: 540 o Window_Size 542 o 128 KB 544 Block_Maximum_Size is constant for a given frame. This maximum is 545 applicable to both the decompressed size and the compressed size of 546 any block in the frame. 548 The reasoning for this limit is that a decoder can read this 549 information at the beginning of a frame and use it to allocate 550 buffers. The guarantees on the size of blocks ensure that the 551 buffers will be large enough for any following block of the valid 552 frame. 554 3.1.1.3. Compressed Blocks 556 To decompress a compressed block, the compressed size must be 557 provided from the Block_Size field within Block_Header. 559 A compressed block consists of two sections: a Literals Section 560 (Section 3.1.1.3.1) and a Sequences_Section (Section 3.1.1.3.2). The 561 results of the two sections are then combined to produce the 562 decompressed data in Sequence Execution (Section 3.1.1.4). 564 To decode a compressed block, the following elements are necessary: 566 o Previous decoded data, up to a distance of Window_Size, or the 567 beginning of the Frame, whichever is smaller. Single_Segment_Flag 568 will be set in the latter case. 570 o List of "recent offsets" from the previous Compressed_Block. 572 o The previous Huffman tree, required by Treeless_Literals_Block 573 type. 575 o Previous Finite State Entropy (FSE) decoding tables, required by 576 Repeat_Mode, for each symbol type (literals lengths, match 577 lengths, offsets). 579 Note that decoding tables are not always from the previous 580 Compressed_Block: 582 o Every decoding table can come from a dictionary. 584 o The Huffman tree comes from the previous 585 Compressed_Literals_Block. 587 3.1.1.3.1. Literals_Section_Header 589 All literals are regrouped in the first part of the block. They can 590 be decoded first and then copied during Sequence Execution (see 591 Section 3.1.1.4), or they can be decoded on the flow during Sequence 592 Execution. 594 Literals can be stored uncompressed or compressed using Huffman 595 prefix codes. When compressed, an optional tree description can be 596 present, followed by 1 or 4 streams. 598 +----------------------------+ 599 | Literals_Section_Header | 600 +----------------------------+ 601 | [Huffman_Tree_Description] | 602 +----------------------------+ 603 | [Jump_Table] | 604 +----------------------------+ 605 | Stream_1 | 606 +----------------------------+ 607 | [Stream_2] | 608 +----------------------------+ 609 | [Stream_3] | 610 +----------------------------+ 611 | [Stream_4] | 612 +----------------------------+ 614 3.1.1.3.1.1. Literals_Section_Header 616 This field describes how literals are packed. It's a byte-aligned 617 variable-size bit field, ranging from 1 to 5 bytes, using little- 618 endian convention. 620 +---------------------+-----------+ 621 | Literals_Block_Type | 2 bits | 622 +---------------------+-----------+ 623 | Size_Format | 1-2 bits | 624 +---------------------+-----------+ 625 | Regenerated_Size | 5-20 bits | 626 +---------------------+-----------+ 627 | [Compressed_Size] | 0-18 bits | 628 +---------------------+-----------+ 630 In this representation, bits at the top are the lowest bits. 632 The Literals_Block_Type field uses the two lowest bits of the first 633 byte, describing four different block types: 635 +---------------------------+-------+ 636 | Literals_Block_Type | Value | 637 +---------------------------+-------+ 638 | Raw_Literals_Block | 0 | 639 +---------------------------+-------+ 640 | RLE_Literals_Block | 1 | 641 +---------------------------+-------+ 642 | Compressed_Literals_Block | 2 | 643 +---------------------------+-------+ 644 | Treeless_Literals_Block | 3 | 645 +---------------------------+-------+ 647 Raw_Literals_Block: Literals are stored uncompressed. 648 Literals_Section_Content is Regenerated_Size. 650 RLE_Literals_Block: Literals consist of a single-byte value repeated 651 Regenerated_Size times. Literals_Section_Content is 1. 653 Compressed_Literals_Block: This is a standard Huffman-compressed 654 block, starting with a Huffman tree description. See details 655 below. Literals_Section_Content is Compressed_Size. 657 Treeless_Literals_Block: This is a Huffman-compressed block, using 658 the Huffman tree from the previous Compressed_Literals_Block, or a 659 dictionary if there is no previous Huffman-compressed literals 660 block. Huffman_Tree_Description will be skipped. Note that if 661 this mode is triggered without any previous Huffman-table in the 662 frame (or dictionary, per Section 5), it should be treated as data 663 corruption. Literals_Section_Content is Compressed_Size. 665 The Size_Format is divided into two families: 667 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary 668 to decode Regenerated_Size. There is no Compressed_Size field. 670 o For Compressed_Block and Treeless_Literals_Block, it's required to 671 decode both Compressed_Size and Regenerated_Size (the decompressed 672 size). It's also necessary to decode the number of streams (1 or 673 4). 675 For values spanning several bytes, the convention is little endian. 677 Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2 678 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3. 680 Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size 681 uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte. 682 Regenerated_Size = Literal_Section_Header[0]>>3. 684 Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses 685 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes. 686 Regenerated_Size = (Literals_Section_Header[0]>>4) + 687 (Literals_Section_Header[1]<<4). 689 Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses 690 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes. 691 Regenerated_Size = (Literals_Section_Header[0]>>4) + 692 (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12) 694 Only Stream_1 is present for these cases. Note that it is permitted 695 to represent a short value (for example, 13) using a long format, 696 even if it's less efficient. 698 Size_Format for Compressed_Literals_Block and Treeless_Literals_Block 699 always uses 2 bits. 701 Size_Format == 00: A single stream. Both Regenerated_Size and 702 Compressed_Size use 10 bits (values 0-1023). 703 Literals_Section_Header uses 3 bytes. 705 Size_Format == 01: 4 streams. Both Regenerated_Size and 706 Compressed_Size use 10 bits (values 0-1023). 707 Literals_Section_Header uses 3 bytes. 709 Size_Format == 10: 4 streams. Both Regenerated_Size and 710 Compressed_Size use 14 bits (values 0-16383). 711 Literals_Section_Header uses 4 bytes. 713 Size_Format == 11: 4 streams. Both Regenerated_Size and 714 Compressed_Size use 18 bits (values 0-262143). 715 Literals_Section_Header uses 5 bytes. 717 Both the Compressed_Size and Regenerated_Size fields follow little- 718 endian convention. Note that Compressed_Size includes the size of 719 the Huffman_Tree_Description when it is present. 721 3.1.1.3.1.2. Raw_Literals_Block 723 The data in Stream_1 is Regenerated_Size bytes long. It contains the 724 raw literals data to be used during Sequence Execution 725 (Section 3.1.1.3.2). 727 3.1.1.3.1.3. RLE_Literals_Block 729 Stream_1 consists of a single byte that should be repeated 730 Regenerated_Size times to generate the decoded literals. 732 3.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block 734 Both of these modes contain Huffman-encoded data. For 735 Treeless_Literals_Block, the Huffman table comes from the previously 736 compressed literals block, or from a dictionary; see Section 5. 738 3.1.1.3.1.5. Huffman_Tree_Description 740 This section is only present when the Literals_Block_Type type is 741 Compressed_Literals_Block (2). The format of 742 Huffman_Tree_Description can be found in Section 4.2.1. The size of 743 Huffman_Tree_Description is determined during the decoding process. 744 It must be used to determine where streams begin. 746 Total_Streams_Size = Compressed_Size 747 - Huffman_Tree_Description_Size 749 3.1.1.3.1.6. Jump_Table 751 The Jump_Table is only present when there are 4 Huffman-coded 752 streams. 754 (Reminder: Huffman-compressed data consists of either 1 or 4 Huffman- 755 coded streams.) 757 If only 1 stream is present, it is a single bitstream occupying the 758 entire remaining portion of the literals block, encoded as described 759 within Section 4.2.2. 761 If there are 4 streams, Literals_Section_Header only provides enough 762 information to know the decompressed and compressed sizes of all 4 763 streams combined. The decompressed size of each stream is equal to 764 (Regenerated_Size+3)/4, except for the last stream, which may be up 765 to 3 bytes smaller, to reach a total decompressed size as specified 766 in Regenerated_Size. 768 The compressed size of each stream is provided explicitly in the 769 Jump_Table. The Jump_Table is 6 bytes long and consists of three 770 2-byte little-endian fields, describing the compressed sizes of the 771 first 3 streams. Stream4_Size is computed from Total_Streams_Size 772 minus sizes of other streams. 774 Stream4_Size = Total_Streams_Size - 6 775 - Stream1_Size - Stream2_Size 776 - Stream3_Size 778 Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds 779 Total_Streams_Size, the data are considered corrupted. 781 Each of these 4 bitstreams is then decoded independently as a 782 Huffman-Coded stream, as described in Section 4.2.2. 784 3.1.1.3.2. Sequences_Section 786 A compressed block is a succession of sequences. A sequence is a 787 literal copy command, followed by a match copy command. A literal 788 copy command specifies a length. It is the number of bytes to be 789 copied (or extracted) from the Literals Section. A match copy 790 command specifies an offset and a length. 792 When all sequences are decoded, if there are literals left in the 793 literals section, these bytes are added at the end of the block. 795 This is described in more detail in Section 3.1.1.4. 797 The Sequences_Section regroups all symbols required to decode 798 commands. There are three symbol types: literals lengths, offsets, 799 and match lengths. They are encoded together, interleaved, in a 800 single "bitstream". 802 The Sequences_Section starts by a header, followed by optional 803 probability tables for each symbol type, followed by the bitstream. 805 Sequences_Section_Header 806 [Literals_Length_Table] 807 [Offset_Table] 808 [Match_Length_Table] 809 bitStream 811 To decode the Sequences_Section, it's necessary to know its size. 812 This size is deduced from the size of the Literals_Section: 813 Sequences_Section_Size = Block_Size - Literals_Section_Header - 814 Literals_Section_Content 816 3.1.1.3.2.1. Sequences_Section_Header 818 This header consists of two items: 820 o Number_of_Sequences 822 o Symbol_Compression_Modes 824 Number_of_Sequences is a variable size field using between 1 and 3 825 bytes. If the first byte is "byte0": 827 o if (byte0 == 0): there are no sequences. The sequence section 828 stops here. Decompressed content is defined entirely as Literals 829 Section content. The FSE tables used in Repeat_Mode are not 830 updated. 832 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte. 834 o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) + 835 byte1. Uses 2 bytes. 837 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) + 838 0x7F00. Uses 3 bytes. 840 Symbol_Compression_Modes is a single byte, defining the compression 841 mode of each symbol type. 843 +-------------+----------------------+ 844 | Bit Number | Field Name | 845 +-------------+----------------------+ 846 | 7-6 | Literal_Lengths_Mode | 847 +-------------+----------------------+ 848 | 5-4 | Offsets_Mode | 849 +-------------+----------------------+ 850 | 3-2 | Match_Lengths_Mode | 851 +-------------+----------------------+ 852 | 1-0 | Reserved | 853 +-------------+----------------------+ 855 The last field, Reserved, must be all zeroes. 857 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define 858 the Compression_Mode of literals lengths, offsets, and match lengths 859 symbols, respectively. They follow the same enumeration: 861 +-------+---------------------+ 862 | Value | Compression_Mode | 863 +-------+---------------------+ 864 | 0 | Predefined_Mode | 865 +-------+---------------------+ 866 | 1 | RLE_Mode | 867 +-------+---------------------+ 868 | 2 | FSE_Compressed_Mode | 869 +-------+---------------------+ 870 | 3 | Repeat_Mode | 871 +-------+---------------------+ 873 Predefined_Mode: A predefined FSE (see Section 4.1) distribution 874 table is used, as defined in Section 3.1.1.3.2.2. No distribution 875 table will be present. 877 RLE_Mode: The table description consists of a single byte, which 878 contains the symbol's value. This symbol will be used for all 879 sequences. 881 FSE_Compressed_Mode: Standard FSE compression. A distribution table 882 will be present. The format of this distribution table is 883 described in Section 4.1.1. Note that the maximum allowed 884 accuracy log for literals length and match length tables is 9, and 885 the maximum accuracy log for the offsets table is 8. This mode 886 must not be used when only one symbol is present; RLE_Mode should 887 be used instead (although any other mode will work). 889 Repeat_Mode: The table used in the previous Compressed_Block with 890 Number_Of_Sequences > 0 will be used again, or if this is the 891 first block, the table in the dictionary will be used. Note that 892 this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the 893 same symbol will be repeated. It also includes Predefined_Mode, 894 in which case Repeat_Mode will have the same outcome as 895 Predefined_Mode. No distribution table will be present. If this 896 mode is used without any previous sequence table in the frame (or 897 dictionary; see Section 5) to repeat, this should be treated as 898 corruption. 900 3.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets 902 Each symbol is a code in its own context, which specifies Baseline 903 and Number_of_Bits to add. Codes are FSE compressed and interleaved 904 with raw additional bits in the same bitstream. 906 Literals length codes are values ranging from 0 to 35 inclusive. 907 They define lengths from 0 to 131071 bytes. The literals length is 908 equal to the decoded Baseline plus the result of reading 909 Number_of_Bits bits from the bitstream, as a little-endian value. 911 +----------------------+----------+----------------+ 912 | Literals_Length_Code | Baseline | Number_of_Bits | 913 +----------------------+----------+----------------+ 914 | 0-15 | length | 0 | 915 +----------------------+----------+----------------+ 916 | 16 | 16 | 1 | 917 +----------------------+----------+----------------+ 918 | 17 | 18 | 1 | 919 +----------------------+----------+----------------+ 920 | 18 | 20 | 1 | 921 +----------------------+----------+----------------+ 922 | 19 | 22 | 1 | 923 +----------------------+----------+----------------+ 924 | 20 | 24 | 2 | 925 +----------------------+----------+----------------+ 926 | 21 | 28 | 2 | 927 +----------------------+----------+----------------+ 928 | 22 | 32 | 3 | 929 +----------------------+----------+----------------+ 930 | 23 | 40 | 3 | 931 +----------------------+----------+----------------+ 932 | 24 | 48 | 4 | 933 +----------------------+----------+----------------+ 934 | 25 | 64 | 6 | 935 +----------------------+----------+----------------+ 936 | 26 | 128 | 7 | 937 +----------------------+----------+----------------+ 938 | 27 | 256 | 8 | 939 +----------------------+----------+----------------+ 940 | 28 | 512 | 9 | 941 +----------------------+----------+----------------+ 942 | 29 | 1024 | 10 | 943 +----------------------+----------+----------------+ 944 | 30 | 2048 | 11 | 945 +----------------------+----------+----------------+ 946 | 31 | 4096 | 12 | 947 +----------------------+----------+----------------+ 948 | 32 | 8192 | 13 | 949 +----------------------+----------+----------------+ 950 | 33 | 16384 | 14 | 951 +----------------------+----------+----------------+ 952 | 34 | 32768 | 15 | 953 +----------------------+----------+----------------+ 954 | 35 | 65536 | 16 | 955 +----------------------+----------+----------------+ 957 Match length codes are values ranging from 0 to 52 inclusive. They 958 define lengths from 3 to 131074 bytes. The match length is equal to 959 the decoded Baseline plus the result of reading Number_of_Bits bits 960 from the bitstream, as a little-endian value. 962 +-------------------+-----------------------+----------------+ 963 | Match_Length_Code | Baseline | Number_of_Bits | 964 +-------------------+-----------------------+----------------+ 965 | 0-31 | Match_Length_Code + 3 | 0 | 966 +-------------------+-----------------------+----------------+ 967 | 32 | 35 | 1 | 968 +-------------------+-----------------------+----------------+ 969 | 33 | 37 | 1 | 970 +-------------------+-----------------------+----------------+ 971 | 34 | 39 | 1 | 972 +-------------------+-----------------------+----------------+ 973 | 35 | 41 | 1 | 974 +-------------------+-----------------------+----------------+ 975 | 36 | 43 | 2 | 976 +-------------------+-----------------------+----------------+ 977 | 37 | 47 | 2 | 978 +-------------------+-----------------------+----------------+ 979 | 38 | 51 | 3 | 980 +-------------------+-----------------------+----------------+ 981 | 39 | 59 | 3 | 982 +-------------------+-----------------------+----------------+ 983 | 40 | 67 | 4 | 984 +-------------------+-----------------------+----------------+ 985 | 41 | 83 | 4 | 986 +-------------------+-----------------------+----------------+ 987 | 42 | 99 | 5 | 988 +-------------------+-----------------------+----------------+ 989 | 43 | 131 | 7 | 990 +-------------------+-----------------------+----------------+ 991 | 44 | 259 | 8 | 992 +-------------------+-----------------------+----------------+ 993 | 45 | 515 | 9 | 994 +-------------------+-----------------------+----------------+ 995 | 46 | 1027 | 10 | 996 +-------------------+-----------------------+----------------+ 997 | 47 | 2051 | 11 | 998 +-------------------+-----------------------+----------------+ 999 | 48 | 4099 | 12 | 1000 +-------------------+-----------------------+----------------+ 1001 | 49 | 8195 | 13 | 1002 +-------------------+-----------------------+----------------+ 1003 | 50 | 16387 | 14 | 1004 +-------------------+-----------------------+----------------+ 1005 | 51 | 32771 | 15 | 1006 +-------------------+-----------------------+----------------+ 1007 | 52 | 65539 | 16 | 1008 +-------------------+-----------------------+----------------+ 1010 Offset codes are values ranging from 0 to N. 1012 A decoder is free to limit its maximum supported value for N. Support 1013 for values of at least 22 is recommended. At the time of this 1014 writing, the reference decoder supports a maximum N value of 31. 1016 An offset code is also the number of additional bits to read in 1017 little-endian fashion and can be translated into an Offset_Value 1018 using the following formulas: 1020 Offset_Value = (1 << offsetCode) + readNBits(offsetCode); 1021 if (Offset_Value > 3) Offset = Offset_Value - 3; 1023 This means that maximum Offset_Value is (2^(N+1))-1, supporting back- 1024 reference distance up to (2^(N+1))-4, but it is limited by the 1025 maximum back-reference distance (see Section 3.1.1.1.2). 1027 Offset_Value from 1 to 3 are special: they define "repeat codes". 1028 This is described in more detail in Section 3.1.1.5. 1030 3.1.1.3.2.1.2. Decoding Sequences 1032 FSE bitstreams are read in reverse of the direction they are written. 1033 In zstd, the compressor writes bits forward into a block, and the 1034 decompressor must read the bitstream backwards. 1036 To find the start of the bitstream, it is therefore necessary to know 1037 the offset of the last byte of the block, which can be found by 1038 counting Block_Size bytes after the block header. 1040 After writing the last bit containing information, the compressor 1041 writes a single 1 bit and then fills the byte with 0-7 zero bits of 1042 padding. The last byte of the compressed bitstream cannot be zero 1043 for that reason. 1045 When decompressing, the last byte containing the padding is the first 1046 byte to read. The decompressor needs to skip 0-7 initial zero bits 1047 until the first 1 bit occurs. Afterwards, the useful part of the 1048 bitstream begins. 1050 FSE decoding requires a 'state' to be carried from symbol to symbol. 1051 For more explanation on FSE decoding, see Section 4.1. 1053 For sequence decoding, a separate state keeps track of each literal 1054 lengths, offsets, and match lengths symbols. Some FSE primitives are 1055 also used. For more details on the operation of these primitives, 1056 see Section 4.1. 1058 The bitstream starts with initial FSE state values, each using the 1059 required number of bits in their respective accuracy, decoded 1060 previously from their normalized distribution. It starts with 1061 Literals_Length_State, followed by Offset_State, and finally 1062 Match_Length_State. 1064 Note that all values are read backward, so the 'start' of the 1065 bitstream is at the highest position in memory, immediately before 1066 the last 1 bit for padding. 1068 After decoding the starting states, a single sequence is decoded 1069 Number_Of_Sequences times. These sequences are decoded in order from 1070 first to last. Since the compressor writes the bitstream in the 1071 forward direction, this means the compressor must encode the 1072 sequences starting with the last one and ending with the first. 1074 For each of the symbol types, the FSE state can be used to determine 1075 the appropriate code. The code then defines the Baseline and 1076 Number_of_Bits to read for each type. The description of the codes 1077 for how to determine these values can be found in 1078 Section 3.1.1.3.2.1. 1080 Decoding starts by reading the Number_of_Bits required to decode 1081 offset. It does the same for Match_Length and then for 1082 Literals_Length. This sequence is then used for Sequence Execution 1083 (see Section 3.1.1.4). 1085 If it is not the last sequence in the block, the next operation is to 1086 update states. Using the rules pre-calculated in the decoding 1087 tables, Literals_Length_State is updated, followed by 1088 Match_Length_State, and then Offset_State. See Section 4.1 for 1089 details on how to update states from the bitstream. 1091 This operation will be repeated Number_of_Sequences times. At the 1092 end, the bitstream shall be entirely consumed; otherwise, the 1093 bitstream is considered corrupted. 1095 3.1.1.3.2.2. Default Distributions 1097 If Predefined_Mode is selected for a symbol type, its FSE decoding 1098 table is generated from a predefined distribution table defined here. 1099 For details on how to convert this distribution into a decoding 1100 table, see Section 4.1. 1102 3.1.1.3.2.2.1. Literals Length 1104 The decoding table uses an accuracy log of 6 bits (64 states). 1106 short literalsLength_defaultDistribution[36] = 1107 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1108 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1109 -1,-1,-1,-1 1110 }; 1112 3.1.1.3.2.2.2. Match Length 1114 The decoding table uses an accuracy log of 6 bits (64 states). 1116 short matchLengths_defaultDistribution[53] = 1117 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1118 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1119 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1120 -1,-1,-1,-1,-1 1121 }; 1123 3.1.1.3.2.2.3. Offset Codes 1125 The decoding table uses an accuracy log of 5 bits (32 states), and 1126 supports a maximum N value of 28, allowing offset values up to 1127 536,870,908. 1129 If any sequence in the compressed block requires a larger offset than 1130 this, it's not possible to use the default distribution to represent 1131 it. 1133 short offsetCodes_defaultDistribution[29] = 1134 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1135 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 1136 }; 1138 3.1.1.4. Sequence Execution 1140 Once literals and sequences have been decoded, they are combined to 1141 produce the decoded content of a block. 1143 Each sequence consists of a tuple of (literals_length, offset_value, 1144 match_length), decoded as described in the Sequences_Section 1145 (Section 3.1.1.3.2). To execute a sequence, first copy 1146 literals_length bytes from the decoded literals to the output. 1148 Then, match_length bytes are copied from previous decoded data. The 1149 offset to copy from is determined by offset_value: 1151 o if Offset_Value > 3, then the offset is Offset_Value - 3; 1152 o if Offset_Value is from 1-3, the offset is a special repeat offset 1153 value. See Section 3.1.1.5 for how the offset is determined in 1154 this case. 1156 The offset is defined as from the current position (after copying the 1157 literals), so an offset of 6 and a match length of 3 means that 3 1158 bytes should be copied from 6 bytes back. Note that all offsets 1159 leading to previously decoded data must be smaller than Window_Size 1160 defined in Frame_Header_Descriptor (Section 3.1.1.1.1). 1162 3.1.1.5. Repeat Offsets 1164 As seen above, the first three values define a repeated offset; we 1165 will call them Repeated_Offset1, Repeated_Offset2, and 1166 Repeated_Offset3. They are sorted in recency order, with 1167 Repeated_Offset1 meaning "most recent one". 1169 If offset_value is 1, then the offset used is Repeated_Offset1, etc. 1171 There is one exception: When the current sequence's literals_length 1172 is 0, repeated offsets are shifted by 1, so an offset_value of 1 1173 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, 1174 and an offset_value of 3 means Repeated_Offset1 - 1_byte. 1176 For the first block, the starting offset history is populated with 1177 the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and 1178 Repeated_Offset3 (8), unless a dictionary is used, in which case they 1179 come from the dictionary. 1181 Then each block gets its starting offset history from the ending 1182 values of the most recent Compressed_Block. Note that blocks that 1183 are not Compressed_Block are skipped; they do not contribute to 1184 offset history. 1186 The newest offset takes the lead in offset history, shifting others 1187 back (up to its previous place if it was already present). This 1188 means that when Repeated_Offset1 (most recent) is used, history is 1189 unmodified. When Repeated_Offset2 is used, it is swapped with 1190 Repeated_Offset1. If any other offset is used, it becomes 1191 Repeated_Offset1, and the rest are shifted back by 1. 1193 3.1.2. Skippable Frames 1195 +--------------+------------+-----------+ 1196 | Magic_Number | Frame_Size | User_Data | 1197 +--------------+------------+-----------+ 1198 | 4 bytes | 4 bytes | n bytes | 1199 +--------------+------------+-----------+ 1201 Skippable frames allow the insertion of user-defined metadata into a 1202 flow of concatenated frames. 1204 Skippable frames defined in this specification are compatible with 1205 skippable frames in [LZ4]. 1207 From a compliant decoder perspective, skippable frames simply need to 1208 be skipped, and their content ignored, resuming decoding after the 1209 skippable frame. 1211 It should be noted that a skippable frame can be used to watermark a 1212 stream of concatenated frames embedding any kind of tracking 1213 information (even just a Universally Unique Identifier (UUID)). 1214 Users wary of such possibility should scan the stream of concatenated 1215 frames in an attempt to detect such frames for analysis or removal. 1217 The fields are: 1219 Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?, 1220 which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values 1221 are valid to identify a skippable frame. This specification does 1222 not detail any specific tagging methods for skippable frames. 1224 Frame_Size: This is the size, in bytes, of the following User_Data 1225 (without including the magic number nor the size field itself). 1226 This field is represented using 4 bytes, little-endian format, 1227 unsigned 32 bits. This means User_Data can't be bigger than 1228 (2^32-1) bytes. 1230 User_Data: This field can be anything. Data will just be skipped by 1231 the decoder. 1233 4. Entropy Encoding 1235 Two types of entropy encoding are used by the Zstandard format: FSE 1236 and Huffman coding. Huffman is used to compress literals, while FSE 1237 is used for all other symbols (Literals_Length_Code, 1238 Match_Length_Code, and offset codes) and to compress Huffman headers. 1240 4.1. FSE 1242 FSE, short for Finite State Entropy, is an entropy codec based on 1243 [ANS]. FSE encoding/decoding involves a state that is carried over 1244 between symbols, so decoding must be done in the opposite direction 1245 as encoding. Therefore, all FSE bitstreams are read from end to 1246 beginning. Note that the order of the bits in the stream is not 1247 reversed; they are simply read in the reverse order from which they 1248 were written. 1250 For additional details on FSE, see Finite State Entropy [FSE]. 1252 FSE decoding involves a decoding table that has a power of 2 size and 1253 contains three elements: Symbol, Num_Bits, and Baseline. The base 2 1254 logarithm of the table size is its Accuracy_Log. An FSE state value 1255 represents an index in this table. 1257 To obtain the initial state value, consume Accuracy_Log bits from the 1258 stream as a little-endian value. The next symbol in the stream is 1259 the Symbol indicated in the table for that state. To obtain the next 1260 state value, the decoder should consume Num_Bits bits from the stream 1261 as a little-endian value and add it to Baseline. 1263 4.1.1. FSE Table Description 1265 To decode FSE streams, it is necessary to construct the decoding 1266 table. The Zstandard format encodes FSE table descriptions as 1267 described here. 1269 An FSE distribution table describes the probabilities of all symbols 1270 from 0 to the last present one (included) on a normalized scale of 1271 (1 << Accuracy_Log). Note that there must be two or more symbols 1272 with non-zero probability. 1274 A bitstream is read forward, in little-endian fashion. It is not 1275 necessary to know its exact size, since the size will be discovered 1276 and reported by the decoding process. The bitstream starts by 1277 reporting on which scale it operates. If low4bits designates the 1278 lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5. 1280 This is followed by each symbol value, from 0 to the last present 1281 one. The number of bits used by each field is variable and depends 1282 on: 1284 Remaining probabilities + 1: For example, presuming an Accuracy_Log 1285 of 8, and presuming 100 probabilities points have already been 1286 distributed, the decoder may read any value from 0 to (256 - 100 + 1287 1) == 157, inclusive. Therefore, it must read log2sup(157) == 8 1288 bits. 1290 Value decoded: Small values use 1 fewer bit. For example, presuming 1291 values from 0 to 157 (inclusive) are possible, 255 - 157 = 98 1292 values are remaining in an 8-bit field. The first 98 values 1293 (hence from 0 to 97) use only 7 bits, and values from 98 to 157 1294 use 8 bits. This is achieved through this scheme: 1296 +------------+---------------+-----------+ 1297 | Value Read | Value Decoded | Bits Used | 1298 +------------+---------------+-----------+ 1299 | 0 - 97 | 0 - 97 | 7 | 1300 +------------+---------------+-----------+ 1301 | 98 - 127 | 98 - 127 | 8 | 1302 +------------+---------------+-----------+ 1303 | 128 - 225 | 0 - 97 | 7 | 1304 +------------+---------------+-----------+ 1305 | 226 - 255 | 128 - 157 | 8 | 1306 +------------+---------------+-----------+ 1308 Symbol probabilities are read one by one, in order. The probability 1309 is obtained from Value decoded using the formula P = Value - 1. This 1310 means the value 0 becomes the negative probability -1. This is a 1311 special probability that means "less than 1". Its effect on the 1312 distribution table is described below. For the purpose of 1313 calculating total allocated probability points, it counts as 1. 1315 When a symbol has a probability of zero, it is followed by a 2-bit 1316 repeat flag. This repeat flag tells how many probabilities of zeroes 1317 follow the current one. It provides a number ranging from 0 to 3. 1318 If it is a 3, another 2-bit repeat flag follows, and so on. 1320 When the last symbol reaches a cumulated total of 1321 (1 << Accuracy_Log), decoding is complete. If the last symbol makes 1322 the cumulated total go above (1 << Accuracy_Log), distribution is 1323 considered corrupted. 1325 Finally, the decoder can tell how many bytes were used in this 1326 process and how many symbols are present. The bitstream consumes a 1327 round number of bytes. Any remaining bit within the last byte is 1328 simply unused. 1330 The distribution of normalized probabilities is enough to create a 1331 unique decoding table. The table has a size of (1 << Accuracy_Log). 1332 Each cell describes the symbol decoded and instructions to get the 1333 next state. 1335 Symbols are scanned in their natural order for "less than 1" 1336 probabilities as described above. Symbols with this probability are 1337 being attributed a single cell, starting from the end of the table 1338 and retreating. These symbols define a full state reset, reading 1339 Accuracy_Log bits. 1341 All remaining symbols are allocated in their natural order. Starting 1342 from symbol 0 and table position 0, each symbol gets allocated as 1343 many cells as its probability. Cell allocation is spread, not 1344 linear; each successor position follows this rule: 1346 position += (tableSize >> 1) + (tableSize >> 3) + 3; 1347 position &= tableSize - 1; 1349 A position is skipped if it is already occupied by a "less than 1" 1350 probability symbol. Position does not reset between symbols; it 1351 simply iterates through each position in the table, switching to the 1352 next symbol when enough states have been allocated to the current 1353 one. 1355 The result is a list of state values. Each state will decode the 1356 current symbol. 1358 To get the Number_of_Bits and Baseline required for the next state, 1359 it is first necessary to sort all states in their natural order. The 1360 lower states will need 1 more bit than higher ones. The process is 1361 repeated for each symbol. 1363 For example, presuming a symbol has a probability of 5, it receives 1364 five state values. States are sorted in natural order. The next 1365 power of 2 is 8. The space of probabilities is divided into 8 equal 1366 parts. Presuming the Accuracy_Log is 7, this defines 128 states, and 1367 each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5 1368 = 3 lowest states will count "double", doubling the number of shares 1369 (32 in width), requiring 1 more bit in the process. 1371 Baseline is assigned starting from the higher states using fewer 1372 bits, and proceeding naturally, then resuming at the first state, 1373 each taking its allocated width from Baseline. 1375 +----------------+-------+-------+--------+------+-------+ 1376 | state order | 0 | 1 | 2 | 3 | 4 | 1377 +----------------+-------+-------+--------+------+-------+ 1378 | width | 32 | 32 | 32 | 16 | 16 | 1379 +----------------+-------+-------+--------+------+-------+ 1380 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | 1381 +----------------+-------+-------+--------+------+-------+ 1382 | range number | 2 | 4 | 6 | 0 | 1 | 1383 +----------------+-------+-------+--------+------+-------+ 1384 | Baseline | 32 | 64 | 96 | 0 | 16 | 1385 +----------------+-------+-------+--------+------+-------+ 1386 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1387 +----------------+-------+-------+--------+------+-------+ 1389 The next state is determined from the current state by reading the 1390 required Number_of_Bits and adding the specified Baseline. 1392 See Appendix A for the results of this process that are applied to 1393 the default distributions. 1395 4.2. Huffman Coding 1397 Zstandard Huffman-coded streams are read backwards, similar to the 1398 FSE bitstreams. Therefore, to find the start of the bitstream, it is 1399 necessary to know the offset of the last byte of the Huffman-coded 1400 stream. 1402 After writing the last bit containing information, the compressor 1403 writes a single 1 bit and then fills the byte with 0-7 0 bits of 1404 padding. The last byte of the compressed bitstream cannot be 0 for 1405 that reason. 1407 When decompressing, the last byte containing the padding is the first 1408 byte to read. The decompressor needs to skip 0-7 initial 0 bits and 1409 the first 1 bit that occurs. Afterwards, the useful part of the 1410 bitstream begins. 1412 The bitstream contains Huffman-coded symbols in little-endian order, 1413 with the codes defined by the method below. 1415 4.2.1. Huffman Tree Description 1417 Prefix coding represents symbols from an a priori known alphabet by 1418 bit sequences (codewords), one codeword for each symbol, in a manner 1419 such that different symbols may be represented by bit sequences of 1420 different lengths, but a parser can always parse an encoded string 1421 unambiguously symbol by symbol. 1423 Given an alphabet with known symbol frequencies, the Huffman 1424 algorithm allows the construction of an optimal prefix code using the 1425 fewest bits of any possible prefix codes for that alphabet. 1427 The prefix code must not exceed a maximum code length. More bits 1428 improve accuracy but yield a larger header size and require more 1429 memory or more complex decoding operations. This specification 1430 limits the maximum code length to 11 bits. 1432 All literal values from zero (included) to the last present one 1433 (excluded) are represented by Weight with values from 0 to 1434 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits 1435 follows this pseudocode: 1437 if Weight == 0 1438 Number_of_Bits = 0 1439 else 1440 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1442 The last symbol's Weight is deduced from previously decoded ones, by 1443 completing to the nearest power of 2. This power of 2 gives 1444 Max_Number_of_Bits the depth of the current tree. 1446 For example, presume the following Huffman tree must be described: 1448 +---------------+----------------+ 1449 | Literal Value | Number_of_Bits | 1450 +---------------+----------------+ 1451 | 0 | 1 | 1452 +---------------+----------------+ 1453 | 1 | 2 | 1454 +---------------+----------------+ 1455 | 2 | 3 | 1456 +---------------+----------------+ 1457 | 3 | 0 | 1458 +---------------+----------------+ 1459 | 4 | 4 | 1460 +---------------+----------------+ 1461 | 5 | 4 | 1462 +---------------+----------------+ 1464 The tree depth is 4, since its longest element uses 4 bits. (The 1465 longest elements are those with the smallest frequencies.) Value 5 1466 will not be listed as it can be determined from the values for 0-4, 1467 nor will values above 5 as they are all 0. Values from 0 to 4 will 1468 be listed using Weight instead of Number_of_Bits. The pseudocode to 1469 determine Weight is: 1471 if Number_of_Bits == 0 1472 Weight = 0 1473 else 1474 Weight = Max_Number_of_Bits + 1 - Number_of_Bits 1476 It gives the following series of weights: 1478 +---------------+--------+ 1479 | Literal Value | Weight | 1480 +---------------+--------+ 1481 | 0 | 4 | 1482 +---------------+--------+ 1483 | 1 | 3 | 1484 +---------------+--------+ 1485 | 2 | 2 | 1486 +---------------+--------+ 1487 | 3 | 0 | 1488 +---------------+--------+ 1489 | 4 | 1 | 1490 +---------------+--------+ 1492 The decoder will do the inverse operation: having collected weights 1493 of literals from 0 to 4, it knows the last literal, 5, is present 1494 with a non-zero Weight. The Weight of 5 can be determined by 1495 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 1496 0's) is 15. The nearest power of 2 is 16. Therefore, 1497 Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1. 1499 4.2.1.1. Huffman Tree Header 1501 This is a single byte value (0-255), which describes how the series 1502 of weights is encoded. 1504 headerByte < 128: The series of weights is compressed using FSE (see 1505 below). The length of the FSE-compressed series is equal to 1506 headerByte (0-127). 1508 headerByte >= 128: This is a direct representation, where each 1509 Weight is written directly as a 4-bit field (0-15). They are 1510 encoded forward, 2 weights to a byte with the first weight taking 1511 the top 4 bits and the second taking the bottom 4; for example, 1512 the following operations could be used to read the weights: 1514 Weight[0] = (Byte[0] >> 4) 1515 Weight[1] = (Byte[0] & 0xf), 1516 etc. 1518 The full representation occupies ceiling(Number_of_Symbols/2) 1519 bytes, meaning it uses only full bytes even if Number_of_Symbols 1520 is odd. Number_of_Symbols = headerByte - 127. Note that maximum 1521 Number_of_Symbols is 255 - 127 = 128. If any literal has a value 1522 over 128, raw header mode is not possible, and it is necessary to 1523 use FSE compression. 1525 4.2.1.2. FSE Compression of Huffman Weights 1527 In this case, the series of Huffman weights is compressed using FSE 1528 compression. It is a single bitstream with two interleaved states, 1529 sharing a single distribution table. 1531 To decode an FSE bitstream, it is necessary to know its compressed 1532 size. Compressed size is provided by headerByte. It's also 1533 necessary to know its maximum possible decompressed size, which is 1534 255, since literal values span from 0 to 255, and the last symbol's 1535 Weight is not represented. 1537 An FSE bitstream starts by a header, describing probabilities 1538 distribution. It will create a decoding table. For a list of 1539 Huffman weights, the maximum accuracy log is 6 bits. For more 1540 details, see Section 4.1.1. 1542 The Huffman header compression uses two states, which share the same 1543 FSE distribution table. The first state (State1) encodes the even- 1544 numbered index symbols, and the second (State2) encodes the odd- 1545 numbered index symbols. State1 is initialized first, and then 1546 State2, and they take turns decoding a single symbol and updating 1547 their state. For more details on these FSE operations, see 1548 Section 4.1. 1550 The number of symbols to be decoded is determined by tracking the 1551 bitStream overflow condition: If updating state after decoding a 1552 symbol would require more bits than remain in the stream, it is 1553 assumed that extra bits are zero. Then, symbols for each of the 1554 final states are decoded and the process is complete. 1556 4.2.1.3. Conversion from Weights to Huffman Prefix Codes 1558 All present symbols will now have a Weight value. It is possible to 1559 transform weights into Number_of_Bits, using this formula: 1561 if Weight > 0 1562 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1563 else 1564 Number_of_Bits = 0 1566 Symbols are sorted by Weight. Within the same Weight, symbols keep 1567 natural sequential order. Symbols with a Weight of zero are removed. 1568 Then, starting from the lowest Weight, prefix codes are distributed 1569 in sequential order. 1571 For example, assume the following list of weights has been decoded: 1573 +---------+--------+ 1574 | Literal | Weight | 1575 +---------+--------+ 1576 | 0 | 4 | 1577 +---------+--------+ 1578 | 1 | 3 | 1579 +---------+--------+ 1580 | 2 | 2 | 1581 +---------+--------+ 1582 | 3 | 0 | 1583 +---------+--------+ 1584 | 4 | 1 | 1585 +---------+--------+ 1586 | 5 | 1 | 1587 +---------+--------+ 1589 Sorting by weight and then the natural sequential order yields the 1590 following distribution: 1592 +---------+--------+----------------+--------------+ 1593 | Literal | Weight | Number_Of_Bits | Prefix Codes | 1594 +---------+--------+----------------|--------------+ 1595 | 3 | 0 | 0 | N/A | 1596 +---------+--------+----------------|--------------+ 1597 | 4 | 1 | 4 | 0000 | 1598 +---------+--------+----------------|--------------+ 1599 | 5 | 1 | 4 | 0001 | 1600 +---------+--------+----------------|--------------+ 1601 | 2 | 2 | 3 | 001 | 1602 +---------+--------+----------------|--------------+ 1603 | 1 | 3 | 2 | 01 | 1604 +---------+--------+----------------|--------------+ 1605 | 0 | 4 | 1 | 1 | 1606 +---------+--------+----------------|--------------+ 1608 4.2.2. Huffman-Coded Streams 1610 Given a Huffman decoding table, it is possible to decode a Huffman- 1611 coded stream. 1613 Each bitstream must be read backward, which starts from the end and 1614 goes up to the beginning. Therefore, it is necessary to know the 1615 size of each bitstream. 1617 It is also necessary to know exactly which bit is the last. This is 1618 detected by a final bit flag: the highest bit of the last byte is a 1619 final-bit-flag. Consequently, a last byte of 0 is not possible. And 1620 the final-bit-flag itself is not part of the useful bitstream. 1622 Hence, the last byte contains between 0 and 7 useful bits. 1624 Starting from the end, it is possible to read the bitstream in a 1625 little-endian fashion, keeping track of already used bits. Since the 1626 bitstream is encoded in reverse order, starting from the end, read 1627 symbols in forward order. 1629 For example, if the literal sequence "0145" was encoded using the 1630 above prefix code, it would be encoded (in reverse order) as: 1632 +---------+----------+ 1633 | Symbol | Encoding | 1634 +---------+----------+ 1635 | 5 | 0000 | 1636 +---------+----------+ 1637 | 4 | 0001 | 1638 +---------+----------+ 1639 | 1 | 01 | 1640 +---------+----------+ 1641 | 0 | 1 | 1642 +---------+----------+ 1643 | Padding | 00001 | 1644 +---------+----------+ 1646 This results in the following 2-byte bitstream: 1648 00010000 00001101 1650 Here is an alternative representation with the symbol codes separated 1651 by underscores: 1653 0001_0000 00001_1_01 1655 Reading the highest Max_Number_of_Bits bits, it's possible to compare 1656 the extracted value to the decoding table, determining the symbol to 1657 decode and number of bits to discard. 1659 The process continues reading up to the required number of symbols 1660 per stream. If a bitstream is not entirely and exactly consumed, 1661 hence reaching exactly its beginning position with all bits consumed, 1662 the decoding process is considered faulty. 1664 5. Dictionary Format 1666 Zstandard is compatible with "raw content" dictionaries, free of any 1667 format restriction, except that they must be at least 8 bytes. These 1668 dictionaries function as if they were just the content part of a 1669 formatted dictionary. 1671 However, dictionaries created by "zstd --train" in the reference 1672 implementation follow a specific format, described here. 1674 Dictionaries are not included in the compressed content but rather 1675 are provided out of band. That is, the Dictionary_ID identifies 1676 which should be used, but this specification does not describe the 1677 mechanism by which the dictionary is obtained prior to use during 1678 compression or decompression. 1680 A dictionary has a size, defined either by a buffer limit or a file 1681 size. The general format is: 1683 +--------------+---------------+----------------+---------+ 1684 | Magic_Number | Dictionary_ID | Entropy_Tables | Content | 1685 +--------------+---------------+----------------+---------+ 1687 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format. 1689 Dictionary_ID: 4 bytes, stored in little-endian format. 1690 Dictionary_ID can be any value, except 0 (which means no 1691 Dictionary_ID). It is used by decoders to check if they use the 1692 correct dictionary. If the frame is going to be distributed in a 1693 private environment, any Dictionary_ID can be used. However, for 1694 public distribution of compressed frames, the following ranges are 1695 reserved and shall not be used: 1696 low range: <= 32767 1697 high range: >= (2^31) 1699 Entropy_Tables: Follow the same format as the tables in compressed 1700 blocks. See the relevant FSE and Huffman sections for how to 1701 decode these tables. They are stored in the following order: 1702 Huffman table for literals, FSE table for offsets, FSE table for 1703 match lengths, and FSE table for literals lengths. These tables 1704 populate the Repeat Stats literals mode and Repeat distribution 1705 mode for sequence decoding. It is finally followed by 3 offset 1706 values, populating repeat offsets (instead of using {1,4,8}), 1707 stored in order, 4-bytes little-endian each, for a total of 12 1708 bytes. Each repeat offset must have a value less than the 1709 dictionary size. 1711 Content: The rest of the dictionary is its content. The content 1712 acts as a "past" in front of data to be compressed or 1713 decompressed, so it can be referenced in sequence commands. As 1714 long as the amount of data decoded from this frame is less than or 1715 equal to Window_Size, sequence commands may specify offsets longer 1716 than the total length of decoded output so far to reference back 1717 to the dictionary, even parts of the dictionary with offsets 1718 larger than Window_Size. After the total output has surpassed 1719 Window_Size, however, this is no longer allowed, and the 1720 dictionary is no longer accessible. 1722 6. Use of Dictionaries 1724 Provisioning for use of dictionaries with zstd is planned for future 1725 work. To ensure compatibility with the future specification of use 1726 of dictionaries with zstd payloads, especially with MIME, content 1727 encoded with the media type registered here SHOULD NOT use a 1728 dictionary. The exception to this requirement might be a private 1729 dictionary negotiaton that is not part of this specification. 1731 7. IANA Considerations 1733 IANA has made three registrations, as described below. 1735 7.1. The 'application/zstd' Media Type 1737 The 'application/zstd' media type identifies a block of data that is 1738 compressed using zstd compression. The data is a stream of bytes as 1739 described in this document. IANA has added the following to the 1740 "Media Types" registry: 1742 Type name: application 1744 Subtype name: zstd 1746 Required parameters: N/A 1748 Optional parameters: N/A 1750 Encoding considerations: binary 1752 Security considerations: See Section 8 of [this document] 1754 Interoperability considerations: N/A 1756 Published specification: [this document] 1758 Applications that use this media type: anywhere data size is an 1759 issue 1761 Additional information: 1763 Magic number(s): 4 bytes, little-endian format. Value: 1764 0xFD2FB528 1766 File extension(s): zst 1768 Macintosh file type code(s): N/A 1770 For further information: See [ZSTD] 1772 Intended usage: common 1774 Restrictions on usage: N/A 1776 Author: Murray S. Kucherawy 1778 Change Controller: IETF 1780 Provisional registration: no 1782 7.2. Content Encoding 1784 IANA has added the following entry to the "HTTP Content Coding 1785 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 1786 registry: 1788 Name: zstd 1790 Description: A stream of bytes compressed using the Zstandard 1791 protocol 1793 Pointer to specification text: [this document] 1795 7.3. Structured Syntax Suffix 1797 IANA is requested to register the following into the Structured 1798 Syntax Suffix registry: 1800 Name: Zstandard 1802 +suffix: +zstd 1804 Encoding Considerations: binary 1805 Interoperability Considerations: N/A 1807 Fragment Identifier Considerations: The syntax and semantics of 1808 fragment identifiers specified for +zstd SHOULD be as specified 1809 for "application/zstd". 1811 Security Considerations: See Section 7 of [this document]. 1813 Contact: See Section 7 of [this document]. 1815 Author/Change Controller: See Section 7 of [this document]. 1817 7.4. Dictionaries 1819 Work in progress includes development of dictionaries that will 1820 optimize compression and decompression of particular types of data. 1821 Specification of such dictionaries for public use will necessitate 1822 registration of a code point from the reserved range described in 1823 Section 3.1.1.1.3 and its association with a specific dictionary. 1825 However, there are at present no such dictionaries published for 1826 public use, so this document makes no immediate request of IANA to 1827 create such a registry. 1829 8. Security Considerations 1831 Any data compression method involves the reduction of redundancy in 1832 the data. Zstandard is no exception, and the usual precautions 1833 apply. 1835 One should never compress a message whose content must remain secret 1836 with a message generated by a third party. Such a compression can be 1837 used to guess the content of the secret message through analysis of 1838 entropy reduction. This was demonstrated in the Compression Ratio 1839 Info-leak Made Easy (CRIME) attack [CRIME], for example. 1841 A decoder has to demonstrate capabilities to detect and prevent any 1842 kind of data tampering in the compressed frame from triggering system 1843 faults, such as reading or writing beyond allowed memory ranges. 1844 This can be guaranteed by either the implementation language or 1845 careful bound checkings. Of particular note is the encoding of 1846 Number_of_Sequences values that cause the decoder to read into the 1847 block header (and beyond), as well as the indication of a 1848 Frame_Content_Size that is smaller than the actual decompressed data, 1849 in an attempt to trigger a buffer overflow. It is highly recommended 1850 to fuzz-test (i.e., provide invalid, unexpected, or random input and 1851 verify safe operation of) decoder implementations to test and harden 1852 their capability to detect bad frames and deal with them without any 1853 adverse system side effect. 1855 An attacker may provide correctly formed compressed frames with 1856 unreasonable memory requirements. A decoder must always control 1857 memory requirements and enforce some (system-specific) limits in 1858 order to protect memory usage from such scenarios. 1860 Compression can be optimized by training a dictionary on a variety of 1861 related content payloads. This dictionary must then be available at 1862 the decoder for decompression of the payload to be possible. While 1863 this document does not specify how to acquire a dictionary for a 1864 given compressed payload, it is worth noting that third-party 1865 dictionaries may interact unexpectedly with a decoder, leading to 1866 possible memory or other resource exhaustion attacks. We expect such 1867 topics to be discussed in further detail in the Security 1868 Considerations section of a forthcoming RFC for dictionary 1869 acquisition and transmission, but highlight this issue now out of an 1870 abundance of caution. 1872 As discussed in Section 3.1.2, it is possible to store arbitrary user 1873 metadata in skippable frames. While such frames are ignored during 1874 decompression of the data, they can be used as a watermark to track 1875 the path of the compressed payload. 1877 9. Implementation Status 1879 Source code for a C language implementation of a Zstandard-compliant 1880 library is available at [ZSTD-GITHUB]. This implementation is 1881 considered to be the reference implementation and is production 1882 ready; it implements the full range of the specification. It is 1883 routinely tested against security hazards and widely deployed within 1884 Facebook infrastructure. 1886 The reference version is optimized for speed and is highly portable. 1887 It has been proven to run safely on multiple architectures (e.g., 1888 x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit 1889 addressing schemes, a little- or big-endian storage scheme, a number 1890 of different operating systems (e.g., UNIX (including Linux, BSD, 1891 OS-X, and Solaris) and Windows), and a number of compilers (e.g., 1892 gcc, clang, visual, and icc). 1894 10. References 1895 10.1. Normative References 1897 [ZSTD] "Zstandard", 2017, . 1899 10.2. Informative References 1901 [ANS] Duda, J., "Asymmetric numeral systems: entropy coding 1902 combining speed of Huffman coding with compression rate of 1903 arithmetic coding", January 2014, 1904 . 1906 [CRIME] "CRIME", June 2018, . 1909 [FSE] "FiniteStateEntropy", June 2018, 1910 . 1912 [LZ4] "LZ4 Frame Format Description", January 2018, . 1915 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 1916 RFC 1952, DOI 10.17487/RFC1952, May 1996, 1917 . 1919 [XXHASH] "XXHASH Algorithm", 2017, . 1921 [ZSTD-GITHUB] 1922 "zstd", August 2018, . 1924 Appendix A. Decoding Tables for Predefined Codes 1926 This appendix contains FSE decoding tables for the predefined literal 1927 length, match length, and offset codes. The tables have been 1928 constructed using the algorithm as given above in Section 4.1.1. The 1929 tables here can be used as examples to crosscheck that an 1930 implementation has built its decoding tables correctly. 1932 A.1. Literal Length Code Table 1934 +-------+--------+----------------+------+ 1935 | State | Symbol | Number_Of_Bits | Base | 1936 +-------+--------+----------------+------+ 1937 | 0 | 0 | 0 | 0 | 1938 +-------+--------+----------------+------+ 1939 | 0 | 0 | 4 | 0 | 1940 +-------+--------+----------------+------+ 1941 | 1 | 0 | 4 | 16 | 1942 +-------+--------+----------------+------+ 1943 | 2 | 1 | 5 | 32 | 1944 +-------+--------+----------------+------+ 1945 | 3 | 3 | 5 | 0 | 1946 +-------+--------+----------------+------+ 1947 | 4 | 4 | 5 | 0 | 1948 +-------+--------+----------------+------+ 1949 | 5 | 6 | 5 | 0 | 1950 +-------+--------+----------------+------+ 1951 | 6 | 7 | 5 | 0 | 1952 +-------+--------+----------------+------+ 1953 | 7 | 9 | 5 | 0 | 1954 +-------+--------+----------------+------+ 1955 | 8 | 10 | 5 | 0 | 1956 +-------+--------+----------------+------+ 1957 | 9 | 12 | 5 | 0 | 1958 +-------+--------+----------------+------+ 1959 | 10 | 14 | 6 | 0 | 1960 +-------+--------+----------------+------+ 1961 | 11 | 16 | 5 | 0 | 1962 +-------+--------+----------------+------+ 1963 | 12 | 18 | 5 | 0 | 1964 +-------+--------+----------------+------+ 1965 | 13 | 19 | 5 | 0 | 1966 +-------+--------+----------------+------+ 1967 | 14 | 21 | 5 | 0 | 1968 +-------+--------+----------------+------+ 1969 | 15 | 22 | 5 | 0 | 1970 +-------+--------+----------------+------+ 1971 | 16 | 24 | 5 | 0 | 1972 +-------+--------+----------------+------+ 1973 | 17 | 25 | 5 | 32 | 1974 +-------+--------+----------------+------+ 1975 | 18 | 26 | 5 | 0 | 1976 +-------+--------+----------------+------+ 1977 | 19 | 27 | 6 | 0 | 1978 +-------+--------+----------------+------+ 1979 | 20 | 29 | 6 | 0 | 1980 +-------+--------+----------------+------+ 1981 | 21 | 31 | 6 | 0 | 1982 +-------+--------+----------------+------+ 1983 | 22 | 0 | 4 | 32 | 1984 +-------+--------+----------------+------+ 1985 | 23 | 1 | 4 | 0 | 1986 +-------+--------+----------------+------+ 1987 | 24 | 2 | 5 | 0 | 1988 +-------+--------+----------------+------+ 1989 | 25 | 4 | 5 | 32 | 1990 +-------+--------+----------------+------+ 1991 | 26 | 5 | 5 | 0 | 1992 +-------+--------+----------------+------+ 1993 | 27 | 7 | 5 | 32 | 1994 +-------+--------+----------------+------+ 1995 | 28 | 8 | 5 | 0 | 1996 +-------+--------+----------------+------+ 1997 | 29 | 10 | 5 | 32 | 1998 +-------+--------+----------------+------+ 1999 | 30 | 11 | 5 | 0 | 2000 +-------+--------+----------------+------+ 2001 | 31 | 13 | 6 | 0 | 2002 +-------+--------+----------------+------+ 2003 | 32 | 16 | 5 | 32 | 2004 +-------+--------+----------------+------+ 2005 | 33 | 17 | 5 | 0 | 2006 +-------+--------+----------------+------+ 2007 | 34 | 19 | 5 | 32 | 2008 +-------+--------+----------------+------+ 2009 | 35 | 20 | 5 | 0 | 2010 +-------+--------+----------------+------+ 2011 | 36 | 22 | 5 | 32 | 2012 +-------+--------+----------------+------+ 2013 | 37 | 23 | 5 | 0 | 2014 +-------+--------+----------------+------+ 2015 | 38 | 25 | 4 | 0 | 2016 +-------+--------+----------------+------+ 2017 | 39 | 25 | 4 | 16 | 2018 +-------+--------+----------------+------+ 2019 | 40 | 26 | 5 | 32 | 2020 +-------+--------+----------------+------+ 2021 | 41 | 28 | 6 | 0 | 2022 +-------+--------+----------------+------+ 2023 | 42 | 30 | 6 | 0 | 2024 +-------+--------+----------------+------+ 2025 | 43 | 0 | 4 | 48 | 2026 +-------+--------+----------------+------+ 2027 | 44 | 1 | 4 | 16 | 2028 +-------+--------+----------------+------+ 2029 | 45 | 2 | 5 | 32 | 2030 +-------+--------+----------------+------+ 2031 | 46 | 3 | 5 | 32 | 2032 +-------+--------+----------------+------+ 2033 | 47 | 5 | 5 | 32 | 2034 +-------+--------+----------------+------+ 2035 | 48 | 6 | 5 | 32 | 2036 +-------+--------+----------------+------+ 2037 | 49 | 8 | 5 | 32 | 2038 +-------+--------+----------------+------+ 2039 | 50 | 9 | 5 | 32 | 2040 +-------+--------+----------------+------+ 2041 | 51 | 11 | 5 | 32 | 2042 +-------+--------+----------------+------+ 2043 | 52 | 12 | 5 | 32 | 2044 +-------+--------+----------------+------+ 2045 | 53 | 15 | 6 | 0 | 2046 +-------+--------+----------------+------+ 2047 | 54 | 17 | 5 | 32 | 2048 +-------+--------+----------------+------+ 2049 | 55 | 18 | 5 | 32 | 2050 +-------+--------+----------------+------+ 2051 | 56 | 20 | 5 | 32 | 2052 +-------+--------+----------------+------+ 2053 | 57 | 21 | 5 | 32 | 2054 +-------+--------+----------------+------+ 2055 | 58 | 23 | 5 | 32 | 2056 +-------+--------+----------------+------+ 2057 | 59 | 24 | 5 | 32 | 2058 +-------+--------+----------------+------+ 2059 | 60 | 35 | 6 | 0 | 2060 +-------+--------+----------------+------+ 2061 | 61 | 34 | 6 | 0 | 2062 +-------+--------+----------------+------+ 2063 | 62 | 33 | 6 | 0 | 2064 +-------+--------+----------------+------+ 2065 | 63 | 32 | 6 | 0 | 2066 +-------+--------+----------------+------+ 2068 A.2. Match Length Code Table 2070 +-------+--------+----------------+------+ 2071 | State | Symbol | Number_Of_Bits | Base | 2072 +-------+--------+----------------+------+ 2073 | 0 | 0 | 0 | 0 | 2074 +-------+--------+----------------+------+ 2075 | 0 | 0 | 6 | 0 | 2076 +-------+--------+----------------+------+ 2077 | 1 | 1 | 4 | 0 | 2078 +-------+--------+----------------+------+ 2079 | 2 | 2 | 5 | 32 | 2080 +-------+--------+----------------+------+ 2081 | 3 | 3 | 5 | 0 | 2082 +-------+--------+----------------+------+ 2083 | 4 | 5 | 5 | 0 | 2084 +-------+--------+----------------+------+ 2085 | 5 | 6 | 5 | 0 | 2086 +-------+--------+----------------+------+ 2087 | 6 | 8 | 5 | 0 | 2088 +-------+--------+----------------+------+ 2089 | 7 | 10 | 6 | 0 | 2090 +-------+--------+----------------+------+ 2091 | 8 | 13 | 6 | 0 | 2092 +-------+--------+----------------+------+ 2093 | 9 | 16 | 6 | 0 | 2094 +-------+--------+----------------+------+ 2095 | 10 | 19 | 6 | 0 | 2096 +-------+--------+----------------+------+ 2097 | 11 | 22 | 6 | 0 | 2098 +-------+--------+----------------+------+ 2099 | 12 | 25 | 6 | 0 | 2100 +-------+--------+----------------+------+ 2101 | 13 | 28 | 6 | 0 | 2102 +-------+--------+----------------+------+ 2103 | 14 | 31 | 6 | 0 | 2104 +-------+--------+----------------+------+ 2105 | 15 | 33 | 6 | 0 | 2106 +-------+--------+----------------+------+ 2107 | 16 | 35 | 6 | 0 | 2108 +-------+--------+----------------+------+ 2109 | 17 | 37 | 6 | 0 | 2110 +-------+--------+----------------+------+ 2111 | 18 | 39 | 6 | 0 | 2112 +-------+--------+----------------+------+ 2113 | 19 | 41 | 6 | 0 | 2114 +-------+--------+----------------+------+ 2115 | 20 | 43 | 6 | 0 | 2116 +-------+--------+----------------+------+ 2117 | 21 | 45 | 6 | 0 | 2118 +-------+--------+----------------+------+ 2119 | 22 | 1 | 4 | 16 | 2120 +-------+--------+----------------+------+ 2121 | 23 | 2 | 4 | 0 | 2122 +-------+--------+----------------+------+ 2123 | 24 | 3 | 5 | 32 | 2124 +-------+--------+----------------+------+ 2125 | 25 | 4 | 5 | 0 | 2126 +-------+--------+----------------+------+ 2127 | 26 | 6 | 5 | 32 | 2128 +-------+--------+----------------+------+ 2129 | 27 | 7 | 5 | 0 | 2130 +-------+--------+----------------+------+ 2131 | 28 | 9 | 6 | 0 | 2132 +-------+--------+----------------+------+ 2133 | 29 | 12 | 6 | 0 | 2134 +-------+--------+----------------+------+ 2135 | 30 | 15 | 6 | 0 | 2136 +-------+--------+----------------+------+ 2137 | 31 | 18 | 6 | 0 | 2138 +-------+--------+----------------+------+ 2139 | 32 | 21 | 6 | 0 | 2140 +-------+--------+----------------+------+ 2141 | 33 | 24 | 6 | 0 | 2142 +-------+--------+----------------+------+ 2143 | 34 | 27 | 6 | 0 | 2144 +-------+--------+----------------+------+ 2145 | 35 | 30 | 6 | 0 | 2146 +-------+--------+----------------+------+ 2147 | 36 | 32 | 6 | 0 | 2148 +-------+--------+----------------+------+ 2149 | 37 | 34 | 6 | 0 | 2150 +-------+--------+----------------+------+ 2151 | 38 | 36 | 6 | 0 | 2152 +-------+--------+----------------+------+ 2153 | 39 | 38 | 6 | 0 | 2154 +-------+--------+----------------+------+ 2155 | 40 | 40 | 6 | 0 | 2156 +-------+--------+----------------+------+ 2157 | 41 | 42 | 6 | 0 | 2158 +-------+--------+----------------+------+ 2159 | 42 | 44 | 6 | 0 | 2160 +-------+--------+----------------+------+ 2161 | 43 | 1 | 4 | 32 | 2162 +-------+--------+----------------+------+ 2163 | 44 | 1 | 4 | 48 | 2164 +-------+--------+----------------+------+ 2165 | 45 | 2 | 4 | 16 | 2166 +-------+--------+----------------+------+ 2167 | 46 | 4 | 5 | 32 | 2168 +-------+--------+----------------+------+ 2169 | 47 | 5 | 5 | 32 | 2170 +-------+--------+----------------+------+ 2171 | 48 | 7 | 5 | 32 | 2172 +-------+--------+----------------+------+ 2173 | 49 | 8 | 5 | 32 | 2174 +-------+--------+----------------+------+ 2175 | 50 | 11 | 6 | 0 | 2176 +-------+--------+----------------+------+ 2177 | 51 | 14 | 6 | 0 | 2178 +-------+--------+----------------+------+ 2179 | 52 | 17 | 6 | 0 | 2180 +-------+--------+----------------+------+ 2181 | 53 | 20 | 6 | 0 | 2182 +-------+--------+----------------+------+ 2183 | 54 | 23 | 6 | 0 | 2184 +-------+--------+----------------+------+ 2185 | 55 | 26 | 6 | 0 | 2186 +-------+--------+----------------+------+ 2187 | 56 | 29 | 6 | 0 | 2188 +-------+--------+----------------+------+ 2189 | 57 | 52 | 6 | 0 | 2190 +-------+--------+----------------+------+ 2191 | 58 | 51 | 6 | 0 | 2192 +-------+--------+----------------+------+ 2193 | 59 | 50 | 6 | 0 | 2194 +-------+--------+----------------+------+ 2195 | 60 | 49 | 6 | 0 | 2196 +-------+--------+----------------+------+ 2197 | 61 | 48 | 6 | 0 | 2198 +-------+--------+----------------+------+ 2199 | 62 | 47 | 6 | 0 | 2200 +-------+--------+----------------+------+ 2201 | 63 | 46 | 6 | 0 | 2202 +-------+--------+----------------+------+ 2204 A.3. Offset Code Table 2206 +-------+--------+----------------+------+ 2207 | State | Symbol | Number_Of_Bits | Base | 2208 +-------+--------+----------------+------+ 2209 | 0 | 0 | 0 | 0 | 2210 +-------+--------+----------------+------+ 2211 | 0 | 0 | 5 | 0 | 2212 +-------+--------+----------------+------+ 2213 | 1 | 6 | 4 | 0 | 2214 +-------+--------+----------------+------+ 2215 | 2 | 9 | 5 | 0 | 2216 +-------+--------+----------------+------+ 2217 | 3 | 15 | 5 | 0 | 2218 +-------+--------+----------------+------+ 2219 | 4 | 21 | 5 | 0 | 2220 +-------+--------+----------------+------+ 2221 | 5 | 3 | 5 | 0 | 2222 +-------+--------+----------------+------+ 2223 | 6 | 7 | 4 | 0 | 2224 +-------+--------+----------------+------+ 2225 | 7 | 12 | 5 | 0 | 2226 +-------+--------+----------------+------+ 2227 | 8 | 18 | 5 | 0 | 2228 +-------+--------+----------------+------+ 2229 | 9 | 23 | 5 | 0 | 2230 +-------+--------+----------------+------+ 2231 | 10 | 5 | 5 | 0 | 2232 +-------+--------+----------------+------+ 2233 | 11 | 8 | 4 | 0 | 2234 +-------+--------+----------------+------+ 2235 | 12 | 14 | 5 | 0 | 2236 +-------+--------+----------------+------+ 2237 | 13 | 20 | 5 | 0 | 2238 +-------+--------+----------------+------+ 2239 | 14 | 2 | 5 | 0 | 2240 +-------+--------+----------------+------+ 2241 | 15 | 7 | 4 | 16 | 2242 +-------+--------+----------------+------+ 2243 | 16 | 11 | 5 | 0 | 2244 +-------+--------+----------------+------+ 2245 | 17 | 17 | 5 | 0 | 2246 +-------+--------+----------------+------+ 2247 | 18 | 22 | 5 | 0 | 2248 +-------+--------+----------------+------+ 2249 | 19 | 4 | 5 | 0 | 2250 +-------+--------+----------------+------+ 2251 | 20 | 8 | 4 | 16 | 2252 +-------+--------+----------------+------+ 2253 | 21 | 13 | 5 | 0 | 2254 +-------+--------+----------------+------+ 2255 | 22 | 19 | 5 | 0 | 2256 +-------+--------+----------------+------+ 2257 | 23 | 1 | 5 | 0 | 2258 +-------+--------+----------------+------+ 2259 | 24 | 6 | 4 | 16 | 2260 +-------+--------+----------------+------+ 2261 | 25 | 10 | 5 | 0 | 2262 +-------+--------+----------------+------+ 2263 | 26 | 16 | 5 | 0 | 2264 +-------+--------+----------------+------+ 2265 | 27 | 28 | 5 | 0 | 2266 +-------+--------+----------------+------+ 2267 | 28 | 27 | 5 | 0 | 2268 +-------+--------+----------------+------+ 2269 | 29 | 26 | 5 | 0 | 2270 +-------+--------+----------------+------+ 2271 | 30 | 25 | 5 | 0 | 2272 +-------+--------+----------------+------+ 2273 | 31 | 24 | 5 | 0 | 2274 +-------+--------+----------------+------+ 2276 Appendix B. Changes Since RFC8478 2278 The following are the changes in this document relative to RFC 8478: 2280 o Apply erratum #5786. 2282 o Clarify forward compatibility regarding dictionaries. 2284 o Clarify application of Block_Maximum_Size. 2286 o Add structured media type suffix registration. 2288 Appendix C. Acknowledgments 2290 zstd was developed by Yann Collet. 2292 Felix Handte and Nick Terrell provided feedback that went into this 2293 revision. 2295 Authors' Addresses 2297 Yann Collet 2298 Facebook 2299 1 Hacker Way 2300 Menlo Park, CA 94025 2301 United States of America 2303 Email: cyan@fb.com 2305 Murray S. Kucherawy (editor) 2306 Facebook 2307 1 Hacker Way 2308 Menlo Park, CA 94025 2309 United States of America 2311 Email: msk@fb.com