idnits 2.17.1 draft-kucherawy-rfc8478bis-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 20, 2019) is 1589 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '36' on line 1109 -- Looks like a reference, but probably isn't: '53' on line 1119 -- Looks like a reference, but probably isn't: '29' on line 1136 -- Looks like a reference, but probably isn't: '5' on line 1500 -- Looks like a reference, but probably isn't: '0' on line 1518 -- Looks like a reference, but probably isn't: '1' on line 1518 Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Collet 3 Internet-Draft M. Kucherawy, Ed. 4 Obsoletes: 8478 (if approved) Facebook 5 Intended status: Informational December 20, 2019 6 Expires: June 22, 2020 8 Zstandard Compression and the application/zstd Media Type 9 draft-kucherawy-rfc8478bis-03 11 Abstract 13 Zstandard, or "zstd" (pronounced "zee standard"), is a data 14 compression mechanism. This document describes the mechanism and 15 registers a media type and content encoding to be used when 16 transporting zstd-compressed content via Multipurpose Internet Mail 17 Extensions (MIME). 19 Despite use of the word "standard" as part of its name, readers are 20 advised that this document is not an Internet Standards Track 21 specification; it is being published for informational purposes only. 23 This document replaces and obsoletes RFC 8478. 25 Status of this Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on June 22, 2020. 42 Copyright Notice 44 Copyright (c) 2019 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 61 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5 62 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . . 6 64 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . . 7 65 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . . 11 66 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 13 67 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . . 27 68 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . . 28 69 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . . 28 70 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . . 29 71 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 72 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 30 73 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . . 33 74 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . . 33 75 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 35 76 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . . 36 77 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 36 78 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 37 79 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 38 80 6. Use of Dictionaries . . . . . . . . . . . . . . . . . . . . . 40 81 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 82 7.1. The 'application/zstd' Media Type . . . . . . . . . . . . 40 83 7.2. Content Encoding . . . . . . . . . . . . . . . . . . . . . 41 84 7.3. Structured Syntax Suffix . . . . . . . . . . . . . . . . . 41 85 7.4. Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 42 86 8. Security Considerations . . . . . . . . . . . . . . . . . . . 42 87 9. Implementation Status . . . . . . . . . . . . . . . . . . . . 43 88 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44 89 10.1. Normative References . . . . . . . . . . . . . . . . . . . 44 90 10.2. Informative References . . . . . . . . . . . . . . . . . . 44 91 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 44 92 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 44 93 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 47 94 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 50 95 Appendix B. Changes Since RFC8478 . . . . . . . . . . . . . . . . 52 96 Appendix C. Acknowledgments . . . . . . . . . . . . . . . . . . . 52 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 52 99 1. Introduction 101 Zstandard, or "zstd" (pronounced "zee standard"), is a data 102 compression mechanism, akin to gzip [RFC1952]. 104 Despite use of the word "standard" as part of its name, readers are 105 advised that this document is not an Internet Standards Track 106 specification; it is being published for informational purposes only. 108 This document describes the Zstandard format. Also, to enable the 109 transport of a data object compressed with Zstandard, this document 110 registers a media type that can be used to identify such content when 111 it is used in a payload encoded using Multipurpose Internet Mail 112 Extensions (MIME). 114 2. Definitions 116 Some terms used elsewhere in this document are defined here for 117 clarity. 119 uncompressed: Describes an arbitrary set of bytes in their original 120 form, prior to being subjected to compression. 122 compress, compression: The act of processing a set of bytes via the 123 compression mechanism described here. 125 compressed: Describes the result of passing a set of bytes through 126 this mechanism. The original input has thus been compressed. 128 decompress, decompression: The act of processing a set of bytes 129 through the inverse of the compression mechanism described here, 130 in an attempt to recover the original set of bytes prior to 131 compression. 133 decompressed: Describes the result of passing a set of bytes through 134 the reverse of this mechanism. When this is successful, the 135 decompressed payload and the uncompressed payload are 136 indistinguishable. 138 encode: The process of translating data from one form to another; 139 this may include compression or it may refer to other translations 140 done as part of this specification. 142 decode: The reverse of "encode"; describes a process of reversing a 143 prior encoding to recover the original content. 145 frame: Content compressed by Zstandard is transformed into a 146 Zstandard frame. Multiple frames can be appended into a single 147 file or stream. A frame is completely independent, has a defined 148 beginning and end, and has a set of parameters that tells the 149 decoder how to decompress it. 151 block: A frame encapsulates one or multiple blocks. Each block 152 contains arbitrary content, which is described by its header, and 153 has a guaranteed maximum content size that depends upon frame 154 parameters. Unlike frames, each block depends on previous blocks 155 for proper decoding. However, each block can be decompressed 156 without waiting for its successor, allowing streaming operations. 158 natural order: A sequence or ordering of objects or values that is 159 typical of that type of object or value. A set of unique 160 integers, for example, is in "natural order" if when progressing 161 from one element in the set or sequence to the next, there is 162 never a decrease in value. 164 The naming convention for identifiers within the specification is 165 Mixed_Case_With_Underscores. Identifiers inside square brackets 166 indicate that the identifier is optional in the presented context. 168 3. Compression Algorithm 170 This section describes the Zstandard algorithm. 172 The purpose of this document is to define a lossless compressed data 173 format that is a) independent of the CPU type, operating system, file 174 system, and character set and b) is suitable for file compression and 175 pipe and streaming compression, using the Zstandard algorithm. The 176 text of the specification assumes a basic background in programming 177 at the level of bits and other primitive data representations. 179 The data can be produced or consumed, even for an arbitrarily long 180 sequentially presented input data stream, using only an a priori 181 bounded amount of intermediate storage, and hence can be used in data 182 communications. The format uses the Zstandard compression method, 183 and an optional xxHash-64 checksum method [XXHASH], for detection of 184 data corruption. 186 The data format defined by this specification does not attempt to 187 allow random access to compressed data. 189 Unless otherwise indicated below, a compliant compressor must produce 190 data sets that conform to the specifications presented here. 191 However, it does not need to support all options. 193 A compliant decompressor must be able to decompress at least one 194 working set of parameters that conforms to the specifications 195 presented here. It may also ignore informative fields, such as the 196 checksum. Whenever it does not support a parameter defined in the 197 compressed stream, it must produce a non-ambiguous error code and 198 associated error message explaining which parameter is unsupported. 200 This specification is intended for use by implementers of software to 201 compress data into Zstandard format and/or decompress data from 202 Zstandard format. The Zstandard format is supported by an open 203 source reference implementation, written in portable C, and available 204 at [ZSTD]. 206 3.1. Frames 208 Zstandard compressed data is made up of one or more frames. Each 209 frame is independent and can be decompressed independently of other 210 frames. The decompressed content of multiple concatenated frames is 211 the concatenation of each frame's decompressed content. 213 There are two frame formats defined for Zstandard: Zstandard frames 214 and skippable frames. Zstandard frames contain compressed data, 215 while skippable frames contain custom user metadata. 217 3.1.1. Zstandard Frames 219 The structure of a single Zstandard frame is as follows: 221 +--------------------+------------+ 222 | Magic_Number | 4 bytes | 223 +--------------------+------------+ 224 | Frame_Header | 2-14 bytes | 225 +--------------------+------------+ 226 | Data_Block | n bytes | 227 +--------------------+------------+ 228 | [More Data_Blocks] | | 229 +--------------------+------------+ 230 | [Content_Checksum] | 0-4 bytes | 231 +--------------------+------------+ 233 Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528. 235 Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1. 237 Data_Block: Detailed in Section 3.1.1.2. This is where data 238 appears. 240 Content_Checksum: An optional 32-bit checksum, only present if 241 Content_Checksum_Flag is set. The content checksum is the result 242 of the XXH64() hash function [XXHASH] digesting the original 243 (decoded) data as input, and a seed of zero. The low 4 bytes of 244 the checksum are stored in little-endian format. 246 The magic number was selected to be less probable to find at the 247 beginning of an arbitrary file. It avoids trivial patterns (0x00, 248 0xFF, repeated bytes, increasing bytes, etc.), contains byte values 249 outside of ASCII range, and doesn't map into UTF-8 space, all of 250 which reduce the likelihood of its appearance at the top of a text 251 file. 253 3.1.1.1. Frame Header 255 The frame header has a variable size, with a minimum of 2 bytes and 256 up to 14 bytes depending on optional parameters. The structure of 257 Frame_Header is as follows: 259 +-------------------------+-----------+ 260 | Frame_Header_Descriptor | 1 byte | 261 +-------------------------+-----------+ 262 | [Window_Descriptor] | 0-1 byte | 263 +-------------------------+-----------+ 264 | [Dictionary_ID] | 0-4 bytes | 265 +-------------------------+-----------+ 266 | [Frame_Content_Size] | 0-8 bytes | 267 +-------------------------+-----------+ 269 3.1.1.1.1. Frame_Header_Descriptor 271 The first header's byte is called the Frame_Header_Descriptor. It 272 describes which other fields are present. Decoding this byte is 273 enough to tell the size of Frame_Header. 275 +------------+-------------------------+ 276 | Bit Number | Field Name | 277 +------------+-------------------------+ 278 | 7-6 | Frame_Content_Size_Flag | 279 +------------+-------------------------+ 280 | 5 | Single_Segment_Flag | 281 +------------+-------------------------+ 282 | 4 | (unused) | 283 +------------+-------------------------+ 284 | 3 | (reserved) | 285 +------------+-------------------------+ 286 | 2 | Content_Checksum_Flag | 287 +------------+-------------------------+ 288 | 1-0 | Dictionary_ID_Flag | 289 +------------+-------------------------+ 291 In this table, bit 7 is the highest bit, while bit 0 is the lowest 292 one. 294 3.1.1.1.1.1. Frame_Content_Size_Flag 296 This is a 2-bit flag (equivalent to Frame_Header_Descriptor right- 297 shifted 6 bits) specifying whether Frame_Content_Size (the 298 decompressed data size) is provided within the header. Flag_Value 299 provides FCS_Field_Size, which is the number of bytes used by 300 Frame_Content_Size according to the following table: 302 +----------------+--------+---+---+---+ 303 | Flag_Value | 0 | 1 | 2 | 3 | 304 +----------------+--------+---+---+---+ 305 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | 306 +----------------+--------+---+---+---+ 308 When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag: 309 If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise, 310 FCS_Field_Size is 0; Frame_Content_Size is not provided. 312 3.1.1.1.1.2. Single_Segment_Flag 314 If this flag is set, data must be regenerated within a single 315 continuous memory segment. 317 In this case, Window_Descriptor byte is skipped, but 318 Frame_Content_Size is necessarily present. As a consequence, the 319 decoder must allocate a memory segment of size equal or larger than 320 Frame_Content_Size. 322 In order to protect the decoder from unreasonable memory 323 requirements, a decoder is allowed to reject a compressed frame that 324 requests a memory size beyond the decoder's authorized range. 326 For broader compatibility, decoders are recommended to support memory 327 sizes of at least 8 MB. This is only a recommendation; each decoder 328 is free to support higher or lower limits, depending on local 329 limitations. 331 3.1.1.1.1.3. Unused Bit 333 A decoder compliant with this specification version shall not 334 interpret this bit. It might be used in a future version, to signal 335 a property that is not mandatory to properly decode the frame. An 336 encoder compliant with this specification must set this bit to zero. 338 3.1.1.1.1.4. Reserved Bit 340 This bit is reserved for some future feature. Its value must be 341 zero. A decoder compliant with this specification version must 342 ensure it is not set. This bit may be used in a future revision, to 343 signal a feature that must be interpreted to decode the frame 344 correctly. 346 3.1.1.1.1.5. Content_Checksum_Flag 348 If this flag is set, a 32-bit Content_Checksum will be present at the 349 frame's end. See the description of Content_Checksum above. 351 3.1.1.1.1.6. Dictionary_ID_Flag 353 This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating 354 whether a dictionary ID is provided within the header. It also 355 specifies the size of this field as DID_Field_Size: 357 +----------------+---+---+---+---+ 358 | Flag_Value | 0 | 1 | 2 | 3 | 359 +----------------+---+---+---+---+ 360 | DID_Field_Size | 0 | 1 | 2 | 4 | 361 +----------------+---+---+---+---+ 363 3.1.1.1.2. Window Descriptor 365 This provides guarantees about the minimum memory buffer required to 366 decompress a frame. This information is important for decoders to 367 allocate enough memory. 369 The Window_Descriptor byte is optional. When Single_Segment_Flag is 370 set, Window_Descriptor is not present. In this case, Window_Size is 371 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 372 ExaBytes). 374 +------------+----------+----------+ 375 | Bit Number | 7-3 | 2-0 | 376 +------------+----------+----------+ 377 | Field Name | Exponent | Mantissa | 378 +------------+----------+----------+ 380 The minimum memory buffer size is called Window_Size. It is 381 described by the following formulae: 383 windowLog = 10 + Exponent; 384 windowBase = 1 << windowLog; 385 windowAdd = (windowBase / 8) * Mantissa; 386 Window_Size = windowBase + windowAdd; 388 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) 389 + 7*(1<<38) bytes, which is 3.75 TB. 391 In general, larger Window_Size values tend to improve the compression 392 ratio, but at the cost of increased memory usage. 394 To properly decode compressed data, a decoder will need to allocate a 395 buffer of at least Window_Size bytes. 397 In order to protect decoders from unreasonable memory requirements, a 398 decoder is allowed to reject a compressed frame that requests a 399 memory size beyond decoder's authorized range. 401 For improved interoperability, it's recommended for decoders to 402 support values of Window_Size up to 8 MB and for encoders not to 403 generate frames requiring a Window_Size larger than 8 MB. It's 404 merely a recommendation though, and decoders are free to support 405 larger or lower limits, depending on local limitations. 407 3.1.1.1.3. Dictionary_ID 409 This is a variable size field, which contains the ID of the 410 dictionary required to properly decode the frame. This field is 411 optional. When it's not present, it's up to the decoder to know 412 which dictionary to use. 414 Dictionary_ID field size is provided by DID_Field_Size. 415 DID_Field_Size is directly derived from the value of 416 Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can 417 represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295. 418 Format is little-endian. 420 It is permitted to represent a small ID (for example, 13) with a 421 large 4-byte dictionary ID, even if it is less efficient. 423 Within private environments, any dictionary ID can be used. However, 424 for frames and dictionaries distributed in public space, 425 Dictionary_ID must be attributed carefully. The following ranges are 426 reserved for use only with dictionaries that have been registered 427 with IANA (see Section 7.4): 428 low range: <= 32767 429 high range: >= (1 << 31) 431 Any other value for Dictionary_ID can be used by private arrangement 432 between participants. 434 Any payload presented for decompression that references an 435 unregistered reserved dictionary ID results in an error. 437 3.1.1.1.4. Frame Content Size 439 This is the original (uncompressed) size. This information is 440 optional. Frame_Content_Size uses a variable number of bytes, 441 provided by FCS_Field_Size. FCS_Field_Size is provided by the value 442 of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not 443 present), 1, 2, 4, or 8 bytes. 445 +----------------+--------------+ 446 | FCS Field Size | Range | 447 +----------------+--------------+ 448 | 0 | unknown | 449 +----------------+--------------+ 450 | 1 | 0 - 255 | 451 +----------------+--------------+ 452 | 2 | 256 - 65791 | 453 +----------------+--------------+ 454 | 4 | 0 - 2^32 - 1 | 455 +----------------+--------------+ 456 | 8 | 0 - 2^64 - 1 | 457 +----------------+--------------+ 459 Frame_Content_Size format is little-endian. When FCS_Field_Size is 460 1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is 461 2, the offset of 256 is added. It's allowed to represent a small 462 size (for example 18) using any compatible variant. 464 3.1.1.2. Blocks 466 After Magic_Number and Frame_Header, there are some number of blocks. 467 Each frame must have at least 1 block, but there is no upper limit on 468 the number of blocks per frame. 470 The structure of a block is as follows: 472 +--------------+---------------+ 473 | Block_Header | Block_Content | 474 +--------------+---------------+ 475 | 3 bytes | n bytes | 476 +--------------+---------------+ 478 Block_Header uses 3 bytes, written using little-endian convention. 479 It contains three fields: 481 +------------+------------+------------+ 482 | Last_Block | Block_Type | Block_Size | 483 +------------+------------+------------+ 484 | bit 0 | bits 1-2 | bits 3-23 | 485 +------------+------------+------------+ 487 3.1.1.2.1. Last_Block 489 The lowest bit (Last_Block) signals whether this block is the last 490 one. The frame will end after this last block. It may be followed 491 by an optional Content_Checksum (see Section 3.1.1). 493 3.1.1.2.2. Block_Type 495 The next 2 bits represent the Block_Type. There are four block 496 types: 498 +-----------+------------------+ 499 | Value | Block_Type | 500 +-----------+------------------+ 501 | 0 | Raw_Block | 502 +-----------+------------------+ 503 | 1 | RLE_Block | 504 +-----------+------------------+ 505 | 2 | Compressed_Block | 506 +-----------+------------------+ 507 | 3 | Reserved | 508 +-----------+------------------+ 510 Raw_Block: This is an uncompressed block. Block_Content contains 511 Block_Size bytes. 513 RLE_Block: This is a single byte, repeated Block_Size times. 514 Block_Content consists of a single byte. On the decompression 515 side, this byte must be repeated Block_Size times. 517 Compressed_Block: This is a compressed block as described in 518 Section 3.1.1.3. Block_Size is the length of Block_Content, 519 namely the compressed data. The decompressed size is not known, 520 but its maximum possible value is guaranteed (see below). 522 Reserved: This is not a block. This value cannot be used with the 523 current specification. If such a value is present, it is 524 considered to be corrupt data. 526 3.1.1.2.3. Block_Size 528 The upper 21 bits of Block_Header represent the Block_Size. 530 When Block_Type is Compressed_Block or Raw_Block, Block_Size is the 531 size of Block_Content (hence excluding Block_Header). 533 When Block_Type is RLE_Block, since Block_Content's size is always 1, 534 Block_Size represents the number of times this byte must be repeated. 536 Block_Size is limited by Block_Maximum_Size (see below). 538 3.1.1.2.4. Block_Content and Block_Maximum_Size 540 The size of Block_Content is limited by Block_Maximum_Size, which is 541 the smallest of: 543 o Window_Size 545 o 128 KB 547 Block_Maximum_Size is constant for a given frame. This maximum is 548 applicable to both the decompressed size and the compressed size of 549 any block in the frame. 551 The reasoning for this limit is that a decoder can read this 552 information at the beginning of a frame and use it to allocate 553 buffers. The guarantees on the size of blocks ensure that the 554 buffers will be large enough for any following block of the valid 555 frame. 557 3.1.1.3. Compressed Blocks 559 To decompress a compressed block, the compressed size must be 560 provided from the Block_Size field within Block_Header. 562 A compressed block consists of two sections: a Literals Section 563 (Section 3.1.1.3.1) and a Sequences_Section (Section 3.1.1.3.2). The 564 results of the two sections are then combined to produce the 565 decompressed data in Sequence Execution (Section 3.1.1.4). 567 To decode a compressed block, the following elements are necessary: 569 o Previous decoded data, up to a distance of Window_Size, or the 570 beginning of the Frame, whichever is smaller. Single_Segment_Flag 571 will be set in the latter case. 573 o List of "recent offsets" from the previous Compressed_Block. 575 o The previous Huffman tree, required by Treeless_Literals_Block 576 type. 578 o Previous Finite State Entropy (FSE) decoding tables, required by 579 Repeat_Mode, for each symbol type (literals lengths, match 580 lengths, offsets). 582 Note that decoding tables are not always from the previous 583 Compressed_Block: 585 o Every decoding table can come from a dictionary. 587 o The Huffman tree comes from the previous 588 Compressed_Literals_Block. 590 3.1.1.3.1. Literals_Section_Header 592 All literals are regrouped in the first part of the block. They can 593 be decoded first and then copied during Sequence Execution (see 594 Section 3.1.1.4), or they can be decoded on the flow during Sequence 595 Execution. 597 Literals can be stored uncompressed or compressed using Huffman 598 prefix codes. When compressed, an optional tree description can be 599 present, followed by 1 or 4 streams. 601 +----------------------------+ 602 | Literals_Section_Header | 603 +----------------------------+ 604 | [Huffman_Tree_Description] | 605 +----------------------------+ 606 | [Jump_Table] | 607 +----------------------------+ 608 | Stream_1 | 609 +----------------------------+ 610 | [Stream_2] | 611 +----------------------------+ 612 | [Stream_3] | 613 +----------------------------+ 614 | [Stream_4] | 615 +----------------------------+ 617 3.1.1.3.1.1. Literals_Section_Header 619 This field describes how literals are packed. It's a byte-aligned 620 variable-size bit field, ranging from 1 to 5 bytes, using little- 621 endian convention. 623 +---------------------+-----------+ 624 | Literals_Block_Type | 2 bits | 625 +---------------------+-----------+ 626 | Size_Format | 1-2 bits | 627 +---------------------+-----------+ 628 | Regenerated_Size | 5-20 bits | 629 +---------------------+-----------+ 630 | [Compressed_Size] | 0-18 bits | 631 +---------------------+-----------+ 633 In this representation, bits at the top are the lowest bits. 635 The Literals_Block_Type field uses the two lowest bits of the first 636 byte, describing four different block types: 638 +---------------------------+-------+ 639 | Literals_Block_Type | Value | 640 +---------------------------+-------+ 641 | Raw_Literals_Block | 0 | 642 +---------------------------+-------+ 643 | RLE_Literals_Block | 1 | 644 +---------------------------+-------+ 645 | Compressed_Literals_Block | 2 | 646 +---------------------------+-------+ 647 | Treeless_Literals_Block | 3 | 648 +---------------------------+-------+ 650 Raw_Literals_Block: Literals are stored uncompressed. 651 Literals_Section_Content is Regenerated_Size. 653 RLE_Literals_Block: Literals consist of a single-byte value repeated 654 Regenerated_Size times. Literals_Section_Content is 1. 656 Compressed_Literals_Block: This is a standard Huffman-compressed 657 block, starting with a Huffman tree description. See details 658 below. Literals_Section_Content is Compressed_Size. 660 Treeless_Literals_Block: This is a Huffman-compressed block, using 661 the Huffman tree from the previous Compressed_Literals_Block, or a 662 dictionary if there is no previous Huffman-compressed literals 663 block. Huffman_Tree_Description will be skipped. Note that if 664 this mode is triggered without any previous Huffman-table in the 665 frame (or dictionary, per Section 5), it should be treated as data 666 corruption. Literals_Section_Content is Compressed_Size. 668 The Size_Format is divided into two families: 670 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary 671 to decode Regenerated_Size. There is no Compressed_Size field. 673 o For Compressed_Block and Treeless_Literals_Block, it's required to 674 decode both Compressed_Size and Regenerated_Size (the decompressed 675 size). It's also necessary to decode the number of streams (1 or 676 4). 678 For values spanning several bytes, the convention is little endian. 680 Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2 681 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3. 683 Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size 684 uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte. 685 Regenerated_Size = Literal_Section_Header[0]>>3. 687 Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses 688 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes. 689 Regenerated_Size = (Literals_Section_Header[0]>>4) + 690 (Literals_Section_Header[1]<<4). 692 Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses 693 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes. 694 Regenerated_Size = (Literals_Section_Header[0]>>4) + 695 (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12) 697 Only Stream_1 is present for these cases. Note that it is permitted 698 to represent a short value (for example, 13) using a long format, 699 even if it's less efficient. 701 Size_Format for Compressed_Literals_Block and Treeless_Literals_Block 702 always uses 2 bits. 704 Size_Format == 00: A single stream. Both Regenerated_Size and 705 Compressed_Size use 10 bits (values 0-1023). 706 Literals_Section_Header uses 3 bytes. 708 Size_Format == 01: 4 streams. Both Regenerated_Size and 709 Compressed_Size use 10 bits (values 0-1023). 710 Literals_Section_Header uses 3 bytes. 712 Size_Format == 10: 4 streams. Both Regenerated_Size and 713 Compressed_Size use 14 bits (values 0-16383). 714 Literals_Section_Header uses 4 bytes. 716 Size_Format == 11: 4 streams. Both Regenerated_Size and 717 Compressed_Size use 18 bits (values 0-262143). 718 Literals_Section_Header uses 5 bytes. 720 Both the Compressed_Size and Regenerated_Size fields follow little- 721 endian convention. Note that Compressed_Size includes the size of 722 the Huffman_Tree_Description when it is present. 724 3.1.1.3.1.2. Raw_Literals_Block 726 The data in Stream_1 is Regenerated_Size bytes long. It contains the 727 raw literals data to be used during Sequence Execution 728 (Section 3.1.1.3.2). 730 3.1.1.3.1.3. RLE_Literals_Block 732 Stream_1 consists of a single byte that should be repeated 733 Regenerated_Size times to generate the decoded literals. 735 3.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block 737 Both of these modes contain Huffman-encoded data. For 738 Treeless_Literals_Block, the Huffman table comes from the previously 739 compressed literals block, or from a dictionary; see Section 5. 741 3.1.1.3.1.5. Huffman_Tree_Description 743 This section is only present when the Literals_Block_Type type is 744 Compressed_Literals_Block (2). The format of 745 Huffman_Tree_Description can be found in Section 4.2.1. The size of 746 Huffman_Tree_Description is determined during the decoding process. 747 It must be used to determine where streams begin. 749 Total_Streams_Size = Compressed_Size 750 - Huffman_Tree_Description_Size 752 3.1.1.3.1.6. Jump_Table 754 The Jump_Table is only present when there are 4 Huffman-coded 755 streams. 757 (Reminder: Huffman-compressed data consists of either 1 or 4 Huffman- 758 coded streams.) 760 If only 1 stream is present, it is a single bitstream occupying the 761 entire remaining portion of the literals block, encoded as described 762 within Section 4.2.2. 764 If there are 4 streams, Literals_Section_Header only provides enough 765 information to know the decompressed and compressed sizes of all 4 766 streams combined. The decompressed size of each stream is equal to 767 (Regenerated_Size+3)/4, except for the last stream, which may be up 768 to 3 bytes smaller, to reach a total decompressed size as specified 769 in Regenerated_Size. 771 The compressed size of each stream is provided explicitly in the 772 Jump_Table. The Jump_Table is 6 bytes long and consists of three 773 2-byte little-endian fields, describing the compressed sizes of the 774 first 3 streams. Stream4_Size is computed from Total_Streams_Size 775 minus sizes of other streams. 777 Stream4_Size = Total_Streams_Size - 6 778 - Stream1_Size - Stream2_Size 779 - Stream3_Size 781 Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds 782 Total_Streams_Size, the data are considered corrupted. 784 Each of these 4 bitstreams is then decoded independently as a 785 Huffman-Coded stream, as described in Section 4.2.2. 787 3.1.1.3.2. Sequences_Section 789 A compressed block is a succession of sequences. A sequence is a 790 literal copy command, followed by a match copy command. A literal 791 copy command specifies a length. It is the number of bytes to be 792 copied (or extracted) from the Literals Section. A match copy 793 command specifies an offset and a length. 795 When all sequences are decoded, if there are literals left in the 796 literals section, these bytes are added at the end of the block. 798 This is described in more detail in Section 3.1.1.4. 800 The Sequences_Section regroups all symbols required to decode 801 commands. There are three symbol types: literals lengths, offsets, 802 and match lengths. They are encoded together, interleaved, in a 803 single "bitstream". 805 The Sequences_Section starts by a header, followed by optional 806 probability tables for each symbol type, followed by the bitstream. 808 Sequences_Section_Header 809 [Literals_Length_Table] 810 [Offset_Table] 811 [Match_Length_Table] 812 bitStream 814 To decode the Sequences_Section, it's necessary to know its size. 815 This size is deduced from the size of the Literals_Section: 816 Sequences_Section_Size = Block_Size - Literals_Section_Header - 817 Literals_Section_Content 819 3.1.1.3.2.1. Sequences_Section_Header 821 This header consists of two items: 823 o Number_of_Sequences 825 o Symbol_Compression_Modes 827 Number_of_Sequences is a variable size field using between 1 and 3 828 bytes. If the first byte is "byte0": 830 o if (byte0 == 0): there are no sequences. The sequence section 831 stops here. Decompressed content is defined entirely as Literals 832 Section content. The FSE tables used in Repeat_Mode are not 833 updated. 835 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte. 837 o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) + 838 byte1. Uses 2 bytes. 840 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) + 841 0x7F00. Uses 3 bytes. 843 Symbol_Compression_Modes is a single byte, defining the compression 844 mode of each symbol type. 846 +-------------+----------------------+ 847 | Bit Number | Field Name | 848 +-------------+----------------------+ 849 | 7-6 | Literal_Lengths_Mode | 850 +-------------+----------------------+ 851 | 5-4 | Offsets_Mode | 852 +-------------+----------------------+ 853 | 3-2 | Match_Lengths_Mode | 854 +-------------+----------------------+ 855 | 1-0 | Reserved | 856 +-------------+----------------------+ 858 The last field, Reserved, must be all zeroes. 860 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define 861 the Compression_Mode of literals lengths, offsets, and match lengths 862 symbols, respectively. They follow the same enumeration: 864 +-------+---------------------+ 865 | Value | Compression_Mode | 866 +-------+---------------------+ 867 | 0 | Predefined_Mode | 868 +-------+---------------------+ 869 | 1 | RLE_Mode | 870 +-------+---------------------+ 871 | 2 | FSE_Compressed_Mode | 872 +-------+---------------------+ 873 | 3 | Repeat_Mode | 874 +-------+---------------------+ 876 Predefined_Mode: A predefined FSE (see Section 4.1) distribution 877 table is used, as defined in Section 3.1.1.3.2.2. No distribution 878 table will be present. 880 RLE_Mode: The table description consists of a single byte, which 881 contains the symbol's value. This symbol will be used for all 882 sequences. 884 FSE_Compressed_Mode: Standard FSE compression. A distribution table 885 will be present. The format of this distribution table is 886 described in Section 4.1.1. Note that the maximum allowed 887 accuracy log for literals length and match length tables is 9, and 888 the maximum accuracy log for the offsets table is 8. This mode 889 must not be used when only one symbol is present; RLE_Mode should 890 be used instead (although any other mode will work). 892 Repeat_Mode: The table used in the previous Compressed_Block with 893 Number_Of_Sequences > 0 will be used again, or if this is the 894 first block, the table in the dictionary will be used. Note that 895 this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the 896 same symbol will be repeated. It also includes Predefined_Mode, 897 in which case Repeat_Mode will have the same outcome as 898 Predefined_Mode. No distribution table will be present. If this 899 mode is used without any previous sequence table in the frame (or 900 dictionary; see Section 5) to repeat, this should be treated as 901 corruption. 903 3.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets 905 Each symbol is a code in its own context, which specifies Baseline 906 and Number_of_Bits to add. Codes are FSE compressed and interleaved 907 with raw additional bits in the same bitstream. 909 Literals length codes are values ranging from 0 to 35 inclusive. 910 They define lengths from 0 to 131071 bytes. The literals length is 911 equal to the decoded Baseline plus the result of reading 912 Number_of_Bits bits from the bitstream, as a little-endian value. 914 +----------------------+----------+----------------+ 915 | Literals_Length_Code | Baseline | Number_of_Bits | 916 +----------------------+----------+----------------+ 917 | 0-15 | length | 0 | 918 +----------------------+----------+----------------+ 919 | 16 | 16 | 1 | 920 +----------------------+----------+----------------+ 921 | 17 | 18 | 1 | 922 +----------------------+----------+----------------+ 923 | 18 | 20 | 1 | 924 +----------------------+----------+----------------+ 925 | 19 | 22 | 1 | 926 +----------------------+----------+----------------+ 927 | 20 | 24 | 2 | 928 +----------------------+----------+----------------+ 929 | 21 | 28 | 2 | 930 +----------------------+----------+----------------+ 931 | 22 | 32 | 3 | 932 +----------------------+----------+----------------+ 933 | 23 | 40 | 3 | 934 +----------------------+----------+----------------+ 935 | 24 | 48 | 4 | 936 +----------------------+----------+----------------+ 937 | 25 | 64 | 6 | 938 +----------------------+----------+----------------+ 939 | 26 | 128 | 7 | 940 +----------------------+----------+----------------+ 941 | 27 | 256 | 8 | 942 +----------------------+----------+----------------+ 943 | 28 | 512 | 9 | 944 +----------------------+----------+----------------+ 945 | 29 | 1024 | 10 | 946 +----------------------+----------+----------------+ 947 | 30 | 2048 | 11 | 948 +----------------------+----------+----------------+ 949 | 31 | 4096 | 12 | 950 +----------------------+----------+----------------+ 951 | 32 | 8192 | 13 | 952 +----------------------+----------+----------------+ 953 | 33 | 16384 | 14 | 954 +----------------------+----------+----------------+ 955 | 34 | 32768 | 15 | 956 +----------------------+----------+----------------+ 957 | 35 | 65536 | 16 | 958 +----------------------+----------+----------------+ 960 Match length codes are values ranging from 0 to 52 inclusive. They 961 define lengths from 3 to 131074 bytes. The match length is equal to 962 the decoded Baseline plus the result of reading Number_of_Bits bits 963 from the bitstream, as a little-endian value. 965 +-------------------+-----------------------+----------------+ 966 | Match_Length_Code | Baseline | Number_of_Bits | 967 +-------------------+-----------------------+----------------+ 968 | 0-31 | Match_Length_Code + 3 | 0 | 969 +-------------------+-----------------------+----------------+ 970 | 32 | 35 | 1 | 971 +-------------------+-----------------------+----------------+ 972 | 33 | 37 | 1 | 973 +-------------------+-----------------------+----------------+ 974 | 34 | 39 | 1 | 975 +-------------------+-----------------------+----------------+ 976 | 35 | 41 | 1 | 977 +-------------------+-----------------------+----------------+ 978 | 36 | 43 | 2 | 979 +-------------------+-----------------------+----------------+ 980 | 37 | 47 | 2 | 981 +-------------------+-----------------------+----------------+ 982 | 38 | 51 | 3 | 983 +-------------------+-----------------------+----------------+ 984 | 39 | 59 | 3 | 985 +-------------------+-----------------------+----------------+ 986 | 40 | 67 | 4 | 987 +-------------------+-----------------------+----------------+ 988 | 41 | 83 | 4 | 989 +-------------------+-----------------------+----------------+ 990 | 42 | 99 | 5 | 991 +-------------------+-----------------------+----------------+ 992 | 43 | 131 | 7 | 993 +-------------------+-----------------------+----------------+ 994 | 44 | 259 | 8 | 995 +-------------------+-----------------------+----------------+ 996 | 45 | 515 | 9 | 997 +-------------------+-----------------------+----------------+ 998 | 46 | 1027 | 10 | 999 +-------------------+-----------------------+----------------+ 1000 | 47 | 2051 | 11 | 1001 +-------------------+-----------------------+----------------+ 1002 | 48 | 4099 | 12 | 1003 +-------------------+-----------------------+----------------+ 1004 | 49 | 8195 | 13 | 1005 +-------------------+-----------------------+----------------+ 1006 | 50 | 16387 | 14 | 1007 +-------------------+-----------------------+----------------+ 1008 | 51 | 32771 | 15 | 1009 +-------------------+-----------------------+----------------+ 1010 | 52 | 65539 | 16 | 1011 +-------------------+-----------------------+----------------+ 1013 Offset codes are values ranging from 0 to N. 1015 A decoder is free to limit its maximum supported value for N. Support 1016 for values of at least 22 is recommended. At the time of this 1017 writing, the reference decoder supports a maximum N value of 31. 1019 An offset code is also the number of additional bits to read in 1020 little-endian fashion and can be translated into an Offset_Value 1021 using the following formulas: 1023 Offset_Value = (1 << offsetCode) + readNBits(offsetCode); 1024 if (Offset_Value > 3) Offset = Offset_Value - 3; 1026 This means that maximum Offset_Value is (2^(N+1))-1, supporting back- 1027 reference distance up to (2^(N+1))-4, but it is limited by the 1028 maximum back-reference distance (see Section 3.1.1.1.2). 1030 Offset_Value from 1 to 3 are special: they define "repeat codes". 1031 This is described in more detail in Section 3.1.1.5. 1033 3.1.1.3.2.1.2. Decoding Sequences 1035 FSE bitstreams are read in reverse of the direction they are written. 1036 In zstd, the compressor writes bits forward into a block, and the 1037 decompressor must read the bitstream backwards. 1039 To find the start of the bitstream, it is therefore necessary to know 1040 the offset of the last byte of the block, which can be found by 1041 counting Block_Size bytes after the block header. 1043 After writing the last bit containing information, the compressor 1044 writes a single 1 bit and then fills the byte with 0-7 zero bits of 1045 padding. The last byte of the compressed bitstream cannot be zero 1046 for that reason. 1048 When decompressing, the last byte containing the padding is the first 1049 byte to read. The decompressor needs to skip 0-7 initial zero bits 1050 until the first 1 bit occurs. Afterwards, the useful part of the 1051 bitstream begins. 1053 FSE decoding requires a 'state' to be carried from symbol to symbol. 1054 For more explanation on FSE decoding, see Section 4.1. 1056 For sequence decoding, a separate state keeps track of each literal 1057 lengths, offsets, and match lengths symbols. Some FSE primitives are 1058 also used. For more details on the operation of these primitives, 1059 see Section 4.1. 1061 The bitstream starts with initial FSE state values, each using the 1062 required number of bits in their respective accuracy, decoded 1063 previously from their normalized distribution. It starts with 1064 Literals_Length_State, followed by Offset_State, and finally 1065 Match_Length_State. 1067 Note that all values are read backward, so the 'start' of the 1068 bitstream is at the highest position in memory, immediately before 1069 the last 1 bit for padding. 1071 After decoding the starting states, a single sequence is decoded 1072 Number_Of_Sequences times. These sequences are decoded in order from 1073 first to last. Since the compressor writes the bitstream in the 1074 forward direction, this means the compressor must encode the 1075 sequences starting with the last one and ending with the first. 1077 For each of the symbol types, the FSE state can be used to determine 1078 the appropriate code. The code then defines the Baseline and 1079 Number_of_Bits to read for each type. The description of the codes 1080 for how to determine these values can be found in 1081 Section 3.1.1.3.2.1. 1083 Decoding starts by reading the Number_of_Bits required to decode 1084 offset. It does the same for Match_Length and then for 1085 Literals_Length. This sequence is then used for Sequence Execution 1086 (see Section 3.1.1.4). 1088 If it is not the last sequence in the block, the next operation is to 1089 update states. Using the rules pre-calculated in the decoding 1090 tables, Literals_Length_State is updated, followed by 1091 Match_Length_State, and then Offset_State. See Section 4.1 for 1092 details on how to update states from the bitstream. 1094 This operation will be repeated Number_of_Sequences times. At the 1095 end, the bitstream shall be entirely consumed; otherwise, the 1096 bitstream is considered corrupted. 1098 3.1.1.3.2.2. Default Distributions 1100 If Predefined_Mode is selected for a symbol type, its FSE decoding 1101 table is generated from a predefined distribution table defined here. 1102 For details on how to convert this distribution into a decoding 1103 table, see Section 4.1. 1105 3.1.1.3.2.2.1. Literals Length 1107 The decoding table uses an accuracy log of 6 bits (64 states). 1109 short literalsLength_defaultDistribution[36] = 1110 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1111 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1112 -1,-1,-1,-1 1113 }; 1115 3.1.1.3.2.2.2. Match Length 1117 The decoding table uses an accuracy log of 6 bits (64 states). 1119 short matchLengths_defaultDistribution[53] = 1120 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1121 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1122 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1123 -1,-1,-1,-1,-1 1124 }; 1126 3.1.1.3.2.2.3. Offset Codes 1128 The decoding table uses an accuracy log of 5 bits (32 states), and 1129 supports a maximum N value of 28, allowing offset values up to 1130 536,870,908. 1132 If any sequence in the compressed block requires a larger offset than 1133 this, it's not possible to use the default distribution to represent 1134 it. 1136 short offsetCodes_defaultDistribution[29] = 1137 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1138 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 1139 }; 1141 3.1.1.4. Sequence Execution 1143 Once literals and sequences have been decoded, they are combined to 1144 produce the decoded content of a block. 1146 Each sequence consists of a tuple of (literals_length, offset_value, 1147 match_length), decoded as described in the Sequences_Section 1148 (Section 3.1.1.3.2). To execute a sequence, first copy 1149 literals_length bytes from the decoded literals to the output. 1151 Then, match_length bytes are copied from previous decoded data. The 1152 offset to copy from is determined by offset_value: 1154 o if Offset_Value > 3, then the offset is Offset_Value - 3; 1155 o if Offset_Value is from 1-3, the offset is a special repeat offset 1156 value. See Section 3.1.1.5 for how the offset is determined in 1157 this case. 1159 The offset is defined as from the current position (after copying the 1160 literals), so an offset of 6 and a match length of 3 means that 3 1161 bytes should be copied from 6 bytes back. Note that all offsets 1162 leading to previously decoded data must be smaller than Window_Size 1163 defined in Frame_Header_Descriptor (Section 3.1.1.1.1). 1165 3.1.1.5. Repeat Offsets 1167 As seen above, the first three values define a repeated offset; we 1168 will call them Repeated_Offset1, Repeated_Offset2, and 1169 Repeated_Offset3. They are sorted in recency order, with 1170 Repeated_Offset1 meaning "most recent one". 1172 If offset_value is 1, then the offset used is Repeated_Offset1, etc. 1174 There is one exception: When the current sequence's literals_length 1175 is 0, repeated offsets are shifted by 1, so an offset_value of 1 1176 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, 1177 and an offset_value of 3 means Repeated_Offset1 - 1_byte. 1179 For the first block, the starting offset history is populated with 1180 the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and 1181 Repeated_Offset3 (8), unless a dictionary is used, in which case they 1182 come from the dictionary. 1184 Then each block gets its starting offset history from the ending 1185 values of the most recent Compressed_Block. Note that blocks that 1186 are not Compressed_Block are skipped; they do not contribute to 1187 offset history. 1189 The newest offset takes the lead in offset history, shifting others 1190 back (up to its previous place if it was already present). This 1191 means that when Repeated_Offset1 (most recent) is used, history is 1192 unmodified. When Repeated_Offset2 is used, it is swapped with 1193 Repeated_Offset1. If any other offset is used, it becomes 1194 Repeated_Offset1, and the rest are shifted back by 1. 1196 3.1.2. Skippable Frames 1198 +--------------+------------+-----------+ 1199 | Magic_Number | Frame_Size | User_Data | 1200 +--------------+------------+-----------+ 1201 | 4 bytes | 4 bytes | n bytes | 1202 +--------------+------------+-----------+ 1204 Skippable frames allow the insertion of user-defined metadata into a 1205 flow of concatenated frames. 1207 Skippable frames defined in this specification are compatible with 1208 skippable frames in [LZ4]. 1210 From a compliant decoder perspective, skippable frames simply need to 1211 be skipped, and their content ignored, resuming decoding after the 1212 skippable frame. 1214 It should be noted that a skippable frame can be used to watermark a 1215 stream of concatenated frames embedding any kind of tracking 1216 information (even just a Universally Unique Identifier (UUID)). 1217 Users wary of such possibility should scan the stream of concatenated 1218 frames in an attempt to detect such frames for analysis or removal. 1220 The fields are: 1222 Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?, 1223 which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values 1224 are valid to identify a skippable frame. This specification does 1225 not detail any specific tagging methods for skippable frames. 1227 Frame_Size: This is the size, in bytes, of the following User_Data 1228 (without including the magic number nor the size field itself). 1229 This field is represented using 4 bytes, little-endian format, 1230 unsigned 32 bits. This means User_Data can't be bigger than 1231 (2^32-1) bytes. 1233 User_Data: This field can be anything. Data will just be skipped by 1234 the decoder. 1236 4. Entropy Encoding 1238 Two types of entropy encoding are used by the Zstandard format: FSE 1239 and Huffman coding. Huffman is used to compress literals, while FSE 1240 is used for all other symbols (Literals_Length_Code, 1241 Match_Length_Code, and offset codes) and to compress Huffman headers. 1243 4.1. FSE 1245 FSE, short for Finite State Entropy, is an entropy codec based on 1246 [ANS]. FSE encoding/decoding involves a state that is carried over 1247 between symbols, so decoding must be done in the opposite direction 1248 as encoding. Therefore, all FSE bitstreams are read from end to 1249 beginning. Note that the order of the bits in the stream is not 1250 reversed; they are simply read in the reverse order from which they 1251 were written. 1253 For additional details on FSE, see Finite State Entropy [FSE]. 1255 FSE decoding involves a decoding table that has a power of 2 size and 1256 contains three elements: Symbol, Num_Bits, and Baseline. The base 2 1257 logarithm of the table size is its Accuracy_Log. An FSE state value 1258 represents an index in this table. 1260 To obtain the initial state value, consume Accuracy_Log bits from the 1261 stream as a little-endian value. The next symbol in the stream is 1262 the Symbol indicated in the table for that state. To obtain the next 1263 state value, the decoder should consume Num_Bits bits from the stream 1264 as a little-endian value and add it to Baseline. 1266 4.1.1. FSE Table Description 1268 To decode FSE streams, it is necessary to construct the decoding 1269 table. The Zstandard format encodes FSE table descriptions as 1270 described here. 1272 An FSE distribution table describes the probabilities of all symbols 1273 from 0 to the last present one (included) on a normalized scale of 1274 (1 << Accuracy_Log). Note that there must be two or more symbols 1275 with non-zero probability. 1277 A bitstream is read forward, in little-endian fashion. It is not 1278 necessary to know its exact size, since the size will be discovered 1279 and reported by the decoding process. The bitstream starts by 1280 reporting on which scale it operates. If low4bits designates the 1281 lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5. 1283 This is followed by each symbol value, from 0 to the last present 1284 one. The number of bits used by each field is variable and depends 1285 on: 1287 Remaining probabilities + 1: For example, presuming an Accuracy_Log 1288 of 8, and presuming 100 probabilities points have already been 1289 distributed, the decoder may read any value from 0 to (256 - 100 + 1290 1) == 157, inclusive. Therefore, it must read log2sup(157) == 8 1291 bits. 1293 Value decoded: Small values use 1 fewer bit. For example, presuming 1294 values from 0 to 157 (inclusive) are possible, 255 - 157 = 98 1295 values are remaining in an 8-bit field. The first 98 values 1296 (hence from 0 to 97) use only 7 bits, and values from 98 to 157 1297 use 8 bits. This is achieved through this scheme: 1299 +------------+---------------+-----------+ 1300 | Value Read | Value Decoded | Bits Used | 1301 +------------+---------------+-----------+ 1302 | 0 - 97 | 0 - 97 | 7 | 1303 +------------+---------------+-----------+ 1304 | 98 - 127 | 98 - 127 | 8 | 1305 +------------+---------------+-----------+ 1306 | 128 - 225 | 0 - 97 | 7 | 1307 +------------+---------------+-----------+ 1308 | 226 - 255 | 128 - 157 | 8 | 1309 +------------+---------------+-----------+ 1311 Symbol probabilities are read one by one, in order. The probability 1312 is obtained from Value decoded using the formula P = Value - 1. This 1313 means the value 0 becomes the negative probability -1. This is a 1314 special probability that means "less than 1". Its effect on the 1315 distribution table is described below. For the purpose of 1316 calculating total allocated probability points, it counts as 1. 1318 When a symbol has a probability of zero, it is followed by a 2-bit 1319 repeat flag. This repeat flag tells how many probabilities of zeroes 1320 follow the current one. It provides a number ranging from 0 to 3. 1321 If it is a 3, another 2-bit repeat flag follows, and so on. 1323 When the last symbol reaches a cumulated total of 1324 (1 << Accuracy_Log), decoding is complete. If the last symbol makes 1325 the cumulated total go above (1 << Accuracy_Log), distribution is 1326 considered corrupted. 1328 Finally, the decoder can tell how many bytes were used in this 1329 process and how many symbols are present. The bitstream consumes a 1330 round number of bytes. Any remaining bit within the last byte is 1331 simply unused. 1333 The distribution of normalized probabilities is enough to create a 1334 unique decoding table. The table has a size of (1 << Accuracy_Log). 1335 Each cell describes the symbol decoded and instructions to get the 1336 next state. 1338 Symbols are scanned in their natural order for "less than 1" 1339 probabilities as described above. Symbols with this probability are 1340 being attributed a single cell, starting from the end of the table 1341 and retreating. These symbols define a full state reset, reading 1342 Accuracy_Log bits. 1344 All remaining symbols are allocated in their natural order. Starting 1345 from symbol 0 and table position 0, each symbol gets allocated as 1346 many cells as its probability. Cell allocation is spread, not 1347 linear; each successor position follows this rule: 1349 position += (tableSize >> 1) + (tableSize >> 3) + 3; 1350 position &= tableSize - 1; 1352 A position is skipped if it is already occupied by a "less than 1" 1353 probability symbol. Position does not reset between symbols; it 1354 simply iterates through each position in the table, switching to the 1355 next symbol when enough states have been allocated to the current 1356 one. 1358 The result is a list of state values. Each state will decode the 1359 current symbol. 1361 To get the Number_of_Bits and Baseline required for the next state, 1362 it is first necessary to sort all states in their natural order. The 1363 lower states will need 1 more bit than higher ones. The process is 1364 repeated for each symbol. 1366 For example, presuming a symbol has a probability of 5, it receives 1367 five state values. States are sorted in natural order. The next 1368 power of 2 is 8. The space of probabilities is divided into 8 equal 1369 parts. Presuming the Accuracy_Log is 7, this defines 128 states, and 1370 each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5 1371 = 3 lowest states will count "double", doubling the number of shares 1372 (32 in width), requiring 1 more bit in the process. 1374 Baseline is assigned starting from the higher states using fewer 1375 bits, and proceeding naturally, then resuming at the first state, 1376 each taking its allocated width from Baseline. 1378 +----------------+-------+-------+--------+------+-------+ 1379 | state order | 0 | 1 | 2 | 3 | 4 | 1380 +----------------+-------+-------+--------+------+-------+ 1381 | width | 32 | 32 | 32 | 16 | 16 | 1382 +----------------+-------+-------+--------+------+-------+ 1383 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | 1384 +----------------+-------+-------+--------+------+-------+ 1385 | range number | 2 | 4 | 6 | 0 | 1 | 1386 +----------------+-------+-------+--------+------+-------+ 1387 | Baseline | 32 | 64 | 96 | 0 | 16 | 1388 +----------------+-------+-------+--------+------+-------+ 1389 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1390 +----------------+-------+-------+--------+------+-------+ 1392 The next state is determined from the current state by reading the 1393 required Number_of_Bits and adding the specified Baseline. 1395 See Appendix A for the results of this process that are applied to 1396 the default distributions. 1398 4.2. Huffman Coding 1400 Zstandard Huffman-coded streams are read backwards, similar to the 1401 FSE bitstreams. Therefore, to find the start of the bitstream, it is 1402 necessary to know the offset of the last byte of the Huffman-coded 1403 stream. 1405 After writing the last bit containing information, the compressor 1406 writes a single 1 bit and then fills the byte with 0-7 0 bits of 1407 padding. The last byte of the compressed bitstream cannot be 0 for 1408 that reason. 1410 When decompressing, the last byte containing the padding is the first 1411 byte to read. The decompressor needs to skip 0-7 initial 0 bits and 1412 the first 1 bit that occurs. Afterwards, the useful part of the 1413 bitstream begins. 1415 The bitstream contains Huffman-coded symbols in little-endian order, 1416 with the codes defined by the method below. 1418 4.2.1. Huffman Tree Description 1420 Prefix coding represents symbols from an a priori known alphabet by 1421 bit sequences (codewords), one codeword for each symbol, in a manner 1422 such that different symbols may be represented by bit sequences of 1423 different lengths, but a parser can always parse an encoded string 1424 unambiguously symbol by symbol. 1426 Given an alphabet with known symbol frequencies, the Huffman 1427 algorithm allows the construction of an optimal prefix code using the 1428 fewest bits of any possible prefix codes for that alphabet. 1430 The prefix code must not exceed a maximum code length. More bits 1431 improve accuracy but yield a larger header size and require more 1432 memory or more complex decoding operations. This specification 1433 limits the maximum code length to 11 bits. 1435 All literal values from zero (included) to the last present one 1436 (excluded) are represented by Weight with values from 0 to 1437 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits 1438 follows this pseudocode: 1440 if Weight == 0 1441 Number_of_Bits = 0 1442 else 1443 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1445 The last symbol's Weight is deduced from previously decoded ones, by 1446 completing to the nearest power of 2. This power of 2 gives 1447 Max_Number_of_Bits the depth of the current tree. 1449 For example, presume the following Huffman tree must be described: 1451 +---------------+----------------+ 1452 | Literal Value | Number_of_Bits | 1453 +---------------+----------------+ 1454 | 0 | 1 | 1455 +---------------+----------------+ 1456 | 1 | 2 | 1457 +---------------+----------------+ 1458 | 2 | 3 | 1459 +---------------+----------------+ 1460 | 3 | 0 | 1461 +---------------+----------------+ 1462 | 4 | 4 | 1463 +---------------+----------------+ 1464 | 5 | 4 | 1465 +---------------+----------------+ 1467 The tree depth is 4, since its longest element uses 4 bits. (The 1468 longest elements are those with the smallest frequencies.) Value 5 1469 will not be listed as it can be determined from the values for 0-4, 1470 nor will values above 5 as they are all 0. Values from 0 to 4 will 1471 be listed using Weight instead of Number_of_Bits. The pseudocode to 1472 determine Weight is: 1474 if Number_of_Bits == 0 1475 Weight = 0 1476 else 1477 Weight = Max_Number_of_Bits + 1 - Number_of_Bits 1479 It gives the following series of weights: 1481 +---------------+--------+ 1482 | Literal Value | Weight | 1483 +---------------+--------+ 1484 | 0 | 4 | 1485 +---------------+--------+ 1486 | 1 | 3 | 1487 +---------------+--------+ 1488 | 2 | 2 | 1489 +---------------+--------+ 1490 | 3 | 0 | 1491 +---------------+--------+ 1492 | 4 | 1 | 1493 +---------------+--------+ 1495 The decoder will do the inverse operation: having collected weights 1496 of literals from 0 to 4, it knows the last literal, 5, is present 1497 with a non-zero Weight. The Weight of 5 can be determined by 1498 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 1499 0's) is 15. The nearest power of 2 is 16. Therefore, 1500 Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1. 1502 4.2.1.1. Huffman Tree Header 1504 This is a single byte value (0-255), which describes how the series 1505 of weights is encoded. 1507 headerByte < 128: The series of weights is compressed using FSE (see 1508 below). The length of the FSE-compressed series is equal to 1509 headerByte (0-127). 1511 headerByte >= 128: This is a direct representation, where each 1512 Weight is written directly as a 4-bit field (0-15). They are 1513 encoded forward, 2 weights to a byte with the first weight taking 1514 the top 4 bits and the second taking the bottom 4; for example, 1515 the following operations could be used to read the weights: 1517 Weight[0] = (Byte[0] >> 4) 1518 Weight[1] = (Byte[0] & 0xf), 1519 etc. 1521 The full representation occupies ceiling(Number_of_Symbols/2) 1522 bytes, meaning it uses only full bytes even if Number_of_Symbols 1523 is odd. Number_of_Symbols = headerByte - 127. Note that maximum 1524 Number_of_Symbols is 255 - 127 = 128. If any literal has a value 1525 over 128, raw header mode is not possible, and it is necessary to 1526 use FSE compression. 1528 4.2.1.2. FSE Compression of Huffman Weights 1530 In this case, the series of Huffman weights is compressed using FSE 1531 compression. It is a single bitstream with two interleaved states, 1532 sharing a single distribution table. 1534 To decode an FSE bitstream, it is necessary to know its compressed 1535 size. Compressed size is provided by headerByte. It's also 1536 necessary to know its maximum possible decompressed size, which is 1537 255, since literal values span from 0 to 255, and the last symbol's 1538 Weight is not represented. 1540 An FSE bitstream starts by a header, describing probabilities 1541 distribution. It will create a decoding table. For a list of 1542 Huffman weights, the maximum accuracy log is 6 bits. For more 1543 details, see Section 4.1.1. 1545 The Huffman header compression uses two states, which share the same 1546 FSE distribution table. The first state (State1) encodes the even- 1547 numbered index symbols, and the second (State2) encodes the odd- 1548 numbered index symbols. State1 is initialized first, and then 1549 State2, and they take turns decoding a single symbol and updating 1550 their state. For more details on these FSE operations, see 1551 Section 4.1. 1553 The number of symbols to be decoded is determined by tracking the 1554 bitStream overflow condition: If updating state after decoding a 1555 symbol would require more bits than remain in the stream, it is 1556 assumed that extra bits are zero. Then, symbols for each of the 1557 final states are decoded and the process is complete. 1559 4.2.1.3. Conversion from Weights to Huffman Prefix Codes 1561 All present symbols will now have a Weight value. It is possible to 1562 transform weights into Number_of_Bits, using this formula: 1564 if Weight > 0 1565 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1566 else 1567 Number_of_Bits = 0 1569 Symbols are sorted by Weight. Within the same Weight, symbols keep 1570 natural sequential order. Symbols with a Weight of zero are removed. 1571 Then, starting from the lowest Weight, prefix codes are distributed 1572 in sequential order. 1574 For example, assume the following list of weights has been decoded: 1576 +---------+--------+ 1577 | Literal | Weight | 1578 +---------+--------+ 1579 | 0 | 4 | 1580 +---------+--------+ 1581 | 1 | 3 | 1582 +---------+--------+ 1583 | 2 | 2 | 1584 +---------+--------+ 1585 | 3 | 0 | 1586 +---------+--------+ 1587 | 4 | 1 | 1588 +---------+--------+ 1589 | 5 | 1 | 1590 +---------+--------+ 1592 Sorting by weight and then the natural sequential order yields the 1593 following distribution: 1595 +---------+--------+----------------+--------------+ 1596 | Literal | Weight | Number_Of_Bits | Prefix Codes | 1597 +---------+--------+----------------|--------------+ 1598 | 3 | 0 | 0 | N/A | 1599 +---------+--------+----------------|--------------+ 1600 | 4 | 1 | 4 | 0000 | 1601 +---------+--------+----------------|--------------+ 1602 | 5 | 1 | 4 | 0001 | 1603 +---------+--------+----------------|--------------+ 1604 | 2 | 2 | 3 | 001 | 1605 +---------+--------+----------------|--------------+ 1606 | 1 | 3 | 2 | 01 | 1607 +---------+--------+----------------|--------------+ 1608 | 0 | 4 | 1 | 1 | 1609 +---------+--------+----------------|--------------+ 1611 4.2.2. Huffman-Coded Streams 1613 Given a Huffman decoding table, it is possible to decode a Huffman- 1614 coded stream. 1616 Each bitstream must be read backward, which starts from the end and 1617 goes up to the beginning. Therefore, it is necessary to know the 1618 size of each bitstream. 1620 It is also necessary to know exactly which bit is the last. This is 1621 detected by a final bit flag: the highest bit of the last byte is a 1622 final-bit-flag. Consequently, a last byte of 0 is not possible. And 1623 the final-bit-flag itself is not part of the useful bitstream. 1625 Hence, the last byte contains between 0 and 7 useful bits. 1627 Starting from the end, it is possible to read the bitstream in a 1628 little-endian fashion, keeping track of already used bits. Since the 1629 bitstream is encoded in reverse order, starting from the end, read 1630 symbols in forward order. 1632 For example, if the literal sequence "0145" was encoded using the 1633 above prefix code, it would be encoded (in reverse order) as: 1635 +---------+----------+ 1636 | Symbol | Encoding | 1637 +---------+----------+ 1638 | 5 | 0000 | 1639 +---------+----------+ 1640 | 4 | 0001 | 1641 +---------+----------+ 1642 | 1 | 01 | 1643 +---------+----------+ 1644 | 0 | 1 | 1645 +---------+----------+ 1646 | Padding | 00001 | 1647 +---------+----------+ 1649 This results in the following 2-byte bitstream: 1651 00010000 00001101 1653 Here is an alternative representation with the symbol codes separated 1654 by underscores: 1656 0001_0000 00001_1_01 1658 Reading the highest Max_Number_of_Bits bits, it's possible to compare 1659 the extracted value to the decoding table, determining the symbol to 1660 decode and number of bits to discard. 1662 The process continues reading up to the required number of symbols 1663 per stream. If a bitstream is not entirely and exactly consumed, 1664 hence reaching exactly its beginning position with all bits consumed, 1665 the decoding process is considered faulty. 1667 5. Dictionary Format 1669 Zstandard is compatible with "raw content" dictionaries, free of any 1670 format restriction, except that they must be at least 8 bytes. These 1671 dictionaries function as if they were just the content part of a 1672 formatted dictionary. 1674 However, dictionaries created by "zstd --train" in the reference 1675 implementation follow a specific format, described here. 1677 Dictionaries are not included in the compressed content but rather 1678 are provided out of band. That is, the Dictionary_ID identifies 1679 which should be used, but this specification does not describe the 1680 mechanism by which the dictionary is obtained prior to use during 1681 compression or decompression. 1683 A dictionary has a size, defined either by a buffer limit or a file 1684 size. The general format is: 1686 +--------------+---------------+----------------+---------+ 1687 | Magic_Number | Dictionary_ID | Entropy_Tables | Content | 1688 +--------------+---------------+----------------+---------+ 1690 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format. 1692 Dictionary_ID: 4 bytes, stored in little-endian format. 1693 Dictionary_ID can be any value, except 0 (which means no 1694 Dictionary_ID). It is used by decoders to check if they use the 1695 correct dictionary. If the frame is going to be distributed in a 1696 private environment, any Dictionary_ID can be used. However, for 1697 public distribution of compressed frames, the following ranges are 1698 reserved and shall not be used: 1699 low range: <= 32767 1700 high range: >= (2^31) 1702 Entropy_Tables: Follow the same format as the tables in compressed 1703 blocks. See the relevant FSE and Huffman sections for how to 1704 decode these tables. They are stored in the following order: 1705 Huffman table for literals, FSE table for offsets, FSE table for 1706 match lengths, and FSE table for literals lengths. These tables 1707 populate the Repeat Stats literals mode and Repeat distribution 1708 mode for sequence decoding. It is finally followed by 3 offset 1709 values, populating repeat offsets (instead of using {1,4,8}), 1710 stored in order, 4-bytes little-endian each, for a total of 12 1711 bytes. Each repeat offset must have a value less than the 1712 dictionary size. 1714 Content: The rest of the dictionary is its content. The content 1715 acts as a "past" in front of data to be compressed or 1716 decompressed, so it can be referenced in sequence commands. As 1717 long as the amount of data decoded from this frame is less than or 1718 equal to Window_Size, sequence commands may specify offsets longer 1719 than the total length of decoded output so far to reference back 1720 to the dictionary, even parts of the dictionary with offsets 1721 larger than Window_Size. After the total output has surpassed 1722 Window_Size, however, this is no longer allowed, and the 1723 dictionary is no longer accessible. 1725 6. Use of Dictionaries 1727 Provisioning for use of dictionaries with zstd is planned for future 1728 work. To ensure compatibility with the future specification of use 1729 of dictionaries with zstd payloads, especially with MIME, content 1730 encoded with the media type registered here should not use a 1731 dictionary. The exception to this requirement might be a private 1732 dictionary negotiaton that is not part of this specification. 1734 7. IANA Considerations 1736 IANA has made three registrations, as described below. 1738 7.1. The 'application/zstd' Media Type 1740 The 'application/zstd' media type identifies a block of data that is 1741 compressed using zstd compression. The data is a stream of bytes as 1742 described in this document. IANA has added the following to the 1743 "Media Types" registry: 1745 Type name: application 1747 Subtype name: zstd 1749 Required parameters: N/A 1751 Optional parameters: N/A 1753 Encoding considerations: binary 1755 Security considerations: See Section 8 of [this document] 1757 Interoperability considerations: N/A 1759 Published specification: [this document] 1761 Applications that use this media type: anywhere data size is an 1762 issue 1764 Additional information: 1766 Magic number(s): 4 bytes, little-endian format. Value: 1767 0xFD2FB528 1769 File extension(s): zst 1771 Macintosh file type code(s): N/A 1773 For further information: See [ZSTD] 1775 Intended usage: common 1777 Restrictions on usage: N/A 1779 Author: Murray S. Kucherawy 1781 Change Controller: IETF 1783 Provisional registration: no 1785 7.2. Content Encoding 1787 IANA has added the following entry to the "HTTP Content Coding 1788 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 1789 registry: 1791 Name: zstd 1793 Description: A stream of bytes compressed using the Zstandard 1794 protocol 1796 Pointer to specification text: [this document] 1798 7.3. Structured Syntax Suffix 1800 IANA is requested to register the following into the Structured 1801 Syntax Suffix registry: 1803 Name: Zstandard 1805 +suffix: +zstd 1807 Encoding Considerations: binary 1808 Interoperability Considerations: N/A 1810 Fragment Identifier Considerations: The syntax and semantics of 1811 fragment identifiers specified for +zstd should be as specified 1812 for "application/zstd". 1814 Security Considerations: See Section 7 of [this document]. 1816 Contact: See Section 7 of [this document]. 1818 Author/Change Controller: See Section 7 of [this document]. 1820 7.4. Dictionaries 1822 Work in progress includes development of dictionaries that will 1823 optimize compression and decompression of particular types of data. 1824 Specification of such dictionaries for public use will necessitate 1825 registration of a code point from the reserved range described in 1826 Section 3.1.1.1.3 and its association with a specific dictionary. 1828 However, there are at present no such dictionaries published for 1829 public use, so this document makes no immediate request of IANA to 1830 create such a registry. 1832 8. Security Considerations 1834 Any data compression method involves the reduction of redundancy in 1835 the data. Zstandard is no exception, and the usual precautions 1836 apply. 1838 One should never compress a message whose content must remain secret 1839 with a message generated by a third party. Such a compression can be 1840 used to guess the content of the secret message through analysis of 1841 entropy reduction. This was demonstrated in the Compression Ratio 1842 Info-leak Made Easy (CRIME) attack [CRIME], for example. 1844 A decoder has to demonstrate capabilities to detect and prevent any 1845 kind of data tampering in the compressed frame from triggering system 1846 faults, such as reading or writing beyond allowed memory ranges. 1847 This can be guaranteed by either the implementation language or 1848 careful bound checkings. Of particular note is the encoding of 1849 Number_of_Sequences values that cause the decoder to read into the 1850 block header (and beyond), as well as the indication of a 1851 Frame_Content_Size that is smaller than the actual decompressed data, 1852 in an attempt to trigger a buffer overflow. It is highly recommended 1853 to fuzz-test (i.e., provide invalid, unexpected, or random input and 1854 verify safe operation of) decoder implementations to test and harden 1855 their capability to detect bad frames and deal with them without any 1856 adverse system side effect. 1858 An attacker may provide correctly formed compressed frames with 1859 unreasonable memory requirements. A decoder must always control 1860 memory requirements and enforce some (system-specific) limits in 1861 order to protect memory usage from such scenarios. 1863 Compression can be optimized by training a dictionary on a variety of 1864 related content payloads. This dictionary must then be available at 1865 the decoder for decompression of the payload to be possible. While 1866 this document does not specify how to acquire a dictionary for a 1867 given compressed payload, it is worth noting that third-party 1868 dictionaries may interact unexpectedly with a decoder, leading to 1869 possible memory or other resource exhaustion attacks. We expect such 1870 topics to be discussed in further detail in the Security 1871 Considerations section of a forthcoming RFC for dictionary 1872 acquisition and transmission, but highlight this issue now out of an 1873 abundance of caution. 1875 As discussed in Section 3.1.2, it is possible to store arbitrary user 1876 metadata in skippable frames. While such frames are ignored during 1877 decompression of the data, they can be used as a watermark to track 1878 the path of the compressed payload. 1880 9. Implementation Status 1882 Source code for a C language implementation of a Zstandard-compliant 1883 library is available at [ZSTD-GITHUB]. This implementation is 1884 considered to be the reference implementation and is production 1885 ready; it implements the full range of the specification. It is 1886 routinely tested against security hazards and widely deployed within 1887 Facebook infrastructure. 1889 The reference version is optimized for speed and is highly portable. 1890 It has been proven to run safely on multiple architectures (e.g., 1891 x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit 1892 addressing schemes, a little- or big-endian storage scheme, a number 1893 of different operating systems (e.g., UNIX (including Linux, BSD, 1894 OS-X, and Solaris) and Windows), and a number of compilers (e.g., 1895 gcc, clang, visual, and icc). 1897 10. References 1898 10.1. Normative References 1900 [ZSTD] "Zstandard", 2017, . 1902 10.2. Informative References 1904 [ANS] Duda, J., "Asymmetric numeral systems: entropy coding 1905 combining speed of Huffman coding with compression rate of 1906 arithmetic coding", January 2014, 1907 . 1909 [CRIME] "CRIME", June 2018, . 1912 [FSE] "FiniteStateEntropy", June 2018, 1913 . 1915 [LZ4] "LZ4 Frame Format Description", January 2018, . 1918 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 1919 RFC 1952, DOI 10.17487/RFC1952, May 1996, 1920 . 1922 [XXHASH] "XXHASH Algorithm", 2017, . 1924 [ZSTD-GITHUB] 1925 "zstd", August 2018, . 1927 Appendix A. Decoding Tables for Predefined Codes 1929 This appendix contains FSE decoding tables for the predefined literal 1930 length, match length, and offset codes. The tables have been 1931 constructed using the algorithm as given above in Section 4.1.1. The 1932 tables here can be used as examples to crosscheck that an 1933 implementation has built its decoding tables correctly. 1935 A.1. Literal Length Code Table 1937 +-------+--------+----------------+------+ 1938 | State | Symbol | Number_Of_Bits | Base | 1939 +-------+--------+----------------+------+ 1940 | 0 | 0 | 0 | 0 | 1941 +-------+--------+----------------+------+ 1942 | 0 | 0 | 4 | 0 | 1943 +-------+--------+----------------+------+ 1944 | 1 | 0 | 4 | 16 | 1945 +-------+--------+----------------+------+ 1946 | 2 | 1 | 5 | 32 | 1947 +-------+--------+----------------+------+ 1948 | 3 | 3 | 5 | 0 | 1949 +-------+--------+----------------+------+ 1950 | 4 | 4 | 5 | 0 | 1951 +-------+--------+----------------+------+ 1952 | 5 | 6 | 5 | 0 | 1953 +-------+--------+----------------+------+ 1954 | 6 | 7 | 5 | 0 | 1955 +-------+--------+----------------+------+ 1956 | 7 | 9 | 5 | 0 | 1957 +-------+--------+----------------+------+ 1958 | 8 | 10 | 5 | 0 | 1959 +-------+--------+----------------+------+ 1960 | 9 | 12 | 5 | 0 | 1961 +-------+--------+----------------+------+ 1962 | 10 | 14 | 6 | 0 | 1963 +-------+--------+----------------+------+ 1964 | 11 | 16 | 5 | 0 | 1965 +-------+--------+----------------+------+ 1966 | 12 | 18 | 5 | 0 | 1967 +-------+--------+----------------+------+ 1968 | 13 | 19 | 5 | 0 | 1969 +-------+--------+----------------+------+ 1970 | 14 | 21 | 5 | 0 | 1971 +-------+--------+----------------+------+ 1972 | 15 | 22 | 5 | 0 | 1973 +-------+--------+----------------+------+ 1974 | 16 | 24 | 5 | 0 | 1975 +-------+--------+----------------+------+ 1976 | 17 | 25 | 5 | 32 | 1977 +-------+--------+----------------+------+ 1978 | 18 | 26 | 5 | 0 | 1979 +-------+--------+----------------+------+ 1980 | 19 | 27 | 6 | 0 | 1981 +-------+--------+----------------+------+ 1982 | 20 | 29 | 6 | 0 | 1983 +-------+--------+----------------+------+ 1984 | 21 | 31 | 6 | 0 | 1985 +-------+--------+----------------+------+ 1986 | 22 | 0 | 4 | 32 | 1987 +-------+--------+----------------+------+ 1988 | 23 | 1 | 4 | 0 | 1989 +-------+--------+----------------+------+ 1990 | 24 | 2 | 5 | 0 | 1991 +-------+--------+----------------+------+ 1992 | 25 | 4 | 5 | 32 | 1993 +-------+--------+----------------+------+ 1994 | 26 | 5 | 5 | 0 | 1995 +-------+--------+----------------+------+ 1996 | 27 | 7 | 5 | 32 | 1997 +-------+--------+----------------+------+ 1998 | 28 | 8 | 5 | 0 | 1999 +-------+--------+----------------+------+ 2000 | 29 | 10 | 5 | 32 | 2001 +-------+--------+----------------+------+ 2002 | 30 | 11 | 5 | 0 | 2003 +-------+--------+----------------+------+ 2004 | 31 | 13 | 6 | 0 | 2005 +-------+--------+----------------+------+ 2006 | 32 | 16 | 5 | 32 | 2007 +-------+--------+----------------+------+ 2008 | 33 | 17 | 5 | 0 | 2009 +-------+--------+----------------+------+ 2010 | 34 | 19 | 5 | 32 | 2011 +-------+--------+----------------+------+ 2012 | 35 | 20 | 5 | 0 | 2013 +-------+--------+----------------+------+ 2014 | 36 | 22 | 5 | 32 | 2015 +-------+--------+----------------+------+ 2016 | 37 | 23 | 5 | 0 | 2017 +-------+--------+----------------+------+ 2018 | 38 | 25 | 4 | 0 | 2019 +-------+--------+----------------+------+ 2020 | 39 | 25 | 4 | 16 | 2021 +-------+--------+----------------+------+ 2022 | 40 | 26 | 5 | 32 | 2023 +-------+--------+----------------+------+ 2024 | 41 | 28 | 6 | 0 | 2025 +-------+--------+----------------+------+ 2026 | 42 | 30 | 6 | 0 | 2027 +-------+--------+----------------+------+ 2028 | 43 | 0 | 4 | 48 | 2029 +-------+--------+----------------+------+ 2030 | 44 | 1 | 4 | 16 | 2031 +-------+--------+----------------+------+ 2032 | 45 | 2 | 5 | 32 | 2033 +-------+--------+----------------+------+ 2034 | 46 | 3 | 5 | 32 | 2035 +-------+--------+----------------+------+ 2036 | 47 | 5 | 5 | 32 | 2037 +-------+--------+----------------+------+ 2038 | 48 | 6 | 5 | 32 | 2039 +-------+--------+----------------+------+ 2040 | 49 | 8 | 5 | 32 | 2041 +-------+--------+----------------+------+ 2042 | 50 | 9 | 5 | 32 | 2043 +-------+--------+----------------+------+ 2044 | 51 | 11 | 5 | 32 | 2045 +-------+--------+----------------+------+ 2046 | 52 | 12 | 5 | 32 | 2047 +-------+--------+----------------+------+ 2048 | 53 | 15 | 6 | 0 | 2049 +-------+--------+----------------+------+ 2050 | 54 | 17 | 5 | 32 | 2051 +-------+--------+----------------+------+ 2052 | 55 | 18 | 5 | 32 | 2053 +-------+--------+----------------+------+ 2054 | 56 | 20 | 5 | 32 | 2055 +-------+--------+----------------+------+ 2056 | 57 | 21 | 5 | 32 | 2057 +-------+--------+----------------+------+ 2058 | 58 | 23 | 5 | 32 | 2059 +-------+--------+----------------+------+ 2060 | 59 | 24 | 5 | 32 | 2061 +-------+--------+----------------+------+ 2062 | 60 | 35 | 6 | 0 | 2063 +-------+--------+----------------+------+ 2064 | 61 | 34 | 6 | 0 | 2065 +-------+--------+----------------+------+ 2066 | 62 | 33 | 6 | 0 | 2067 +-------+--------+----------------+------+ 2068 | 63 | 32 | 6 | 0 | 2069 +-------+--------+----------------+------+ 2071 A.2. Match Length Code Table 2073 +-------+--------+----------------+------+ 2074 | State | Symbol | Number_Of_Bits | Base | 2075 +-------+--------+----------------+------+ 2076 | 0 | 0 | 0 | 0 | 2077 +-------+--------+----------------+------+ 2078 | 0 | 0 | 6 | 0 | 2079 +-------+--------+----------------+------+ 2080 | 1 | 1 | 4 | 0 | 2081 +-------+--------+----------------+------+ 2082 | 2 | 2 | 5 | 32 | 2083 +-------+--------+----------------+------+ 2084 | 3 | 3 | 5 | 0 | 2085 +-------+--------+----------------+------+ 2086 | 4 | 5 | 5 | 0 | 2087 +-------+--------+----------------+------+ 2088 | 5 | 6 | 5 | 0 | 2089 +-------+--------+----------------+------+ 2090 | 6 | 8 | 5 | 0 | 2091 +-------+--------+----------------+------+ 2092 | 7 | 10 | 6 | 0 | 2093 +-------+--------+----------------+------+ 2094 | 8 | 13 | 6 | 0 | 2095 +-------+--------+----------------+------+ 2096 | 9 | 16 | 6 | 0 | 2097 +-------+--------+----------------+------+ 2098 | 10 | 19 | 6 | 0 | 2099 +-------+--------+----------------+------+ 2100 | 11 | 22 | 6 | 0 | 2101 +-------+--------+----------------+------+ 2102 | 12 | 25 | 6 | 0 | 2103 +-------+--------+----------------+------+ 2104 | 13 | 28 | 6 | 0 | 2105 +-------+--------+----------------+------+ 2106 | 14 | 31 | 6 | 0 | 2107 +-------+--------+----------------+------+ 2108 | 15 | 33 | 6 | 0 | 2109 +-------+--------+----------------+------+ 2110 | 16 | 35 | 6 | 0 | 2111 +-------+--------+----------------+------+ 2112 | 17 | 37 | 6 | 0 | 2113 +-------+--------+----------------+------+ 2114 | 18 | 39 | 6 | 0 | 2115 +-------+--------+----------------+------+ 2116 | 19 | 41 | 6 | 0 | 2117 +-------+--------+----------------+------+ 2118 | 20 | 43 | 6 | 0 | 2119 +-------+--------+----------------+------+ 2120 | 21 | 45 | 6 | 0 | 2121 +-------+--------+----------------+------+ 2122 | 22 | 1 | 4 | 16 | 2123 +-------+--------+----------------+------+ 2124 | 23 | 2 | 4 | 0 | 2125 +-------+--------+----------------+------+ 2126 | 24 | 3 | 5 | 32 | 2127 +-------+--------+----------------+------+ 2128 | 25 | 4 | 5 | 0 | 2129 +-------+--------+----------------+------+ 2130 | 26 | 6 | 5 | 32 | 2131 +-------+--------+----------------+------+ 2132 | 27 | 7 | 5 | 0 | 2133 +-------+--------+----------------+------+ 2134 | 28 | 9 | 6 | 0 | 2135 +-------+--------+----------------+------+ 2136 | 29 | 12 | 6 | 0 | 2137 +-------+--------+----------------+------+ 2138 | 30 | 15 | 6 | 0 | 2139 +-------+--------+----------------+------+ 2140 | 31 | 18 | 6 | 0 | 2141 +-------+--------+----------------+------+ 2142 | 32 | 21 | 6 | 0 | 2143 +-------+--------+----------------+------+ 2144 | 33 | 24 | 6 | 0 | 2145 +-------+--------+----------------+------+ 2146 | 34 | 27 | 6 | 0 | 2147 +-------+--------+----------------+------+ 2148 | 35 | 30 | 6 | 0 | 2149 +-------+--------+----------------+------+ 2150 | 36 | 32 | 6 | 0 | 2151 +-------+--------+----------------+------+ 2152 | 37 | 34 | 6 | 0 | 2153 +-------+--------+----------------+------+ 2154 | 38 | 36 | 6 | 0 | 2155 +-------+--------+----------------+------+ 2156 | 39 | 38 | 6 | 0 | 2157 +-------+--------+----------------+------+ 2158 | 40 | 40 | 6 | 0 | 2159 +-------+--------+----------------+------+ 2160 | 41 | 42 | 6 | 0 | 2161 +-------+--------+----------------+------+ 2162 | 42 | 44 | 6 | 0 | 2163 +-------+--------+----------------+------+ 2164 | 43 | 1 | 4 | 32 | 2165 +-------+--------+----------------+------+ 2166 | 44 | 1 | 4 | 48 | 2167 +-------+--------+----------------+------+ 2168 | 45 | 2 | 4 | 16 | 2169 +-------+--------+----------------+------+ 2170 | 46 | 4 | 5 | 32 | 2171 +-------+--------+----------------+------+ 2172 | 47 | 5 | 5 | 32 | 2173 +-------+--------+----------------+------+ 2174 | 48 | 7 | 5 | 32 | 2175 +-------+--------+----------------+------+ 2176 | 49 | 8 | 5 | 32 | 2177 +-------+--------+----------------+------+ 2178 | 50 | 11 | 6 | 0 | 2179 +-------+--------+----------------+------+ 2180 | 51 | 14 | 6 | 0 | 2181 +-------+--------+----------------+------+ 2182 | 52 | 17 | 6 | 0 | 2183 +-------+--------+----------------+------+ 2184 | 53 | 20 | 6 | 0 | 2185 +-------+--------+----------------+------+ 2186 | 54 | 23 | 6 | 0 | 2187 +-------+--------+----------------+------+ 2188 | 55 | 26 | 6 | 0 | 2189 +-------+--------+----------------+------+ 2190 | 56 | 29 | 6 | 0 | 2191 +-------+--------+----------------+------+ 2192 | 57 | 52 | 6 | 0 | 2193 +-------+--------+----------------+------+ 2194 | 58 | 51 | 6 | 0 | 2195 +-------+--------+----------------+------+ 2196 | 59 | 50 | 6 | 0 | 2197 +-------+--------+----------------+------+ 2198 | 60 | 49 | 6 | 0 | 2199 +-------+--------+----------------+------+ 2200 | 61 | 48 | 6 | 0 | 2201 +-------+--------+----------------+------+ 2202 | 62 | 47 | 6 | 0 | 2203 +-------+--------+----------------+------+ 2204 | 63 | 46 | 6 | 0 | 2205 +-------+--------+----------------+------+ 2207 A.3. Offset Code Table 2209 +-------+--------+----------------+------+ 2210 | State | Symbol | Number_Of_Bits | Base | 2211 +-------+--------+----------------+------+ 2212 | 0 | 0 | 0 | 0 | 2213 +-------+--------+----------------+------+ 2214 | 0 | 0 | 5 | 0 | 2215 +-------+--------+----------------+------+ 2216 | 1 | 6 | 4 | 0 | 2217 +-------+--------+----------------+------+ 2218 | 2 | 9 | 5 | 0 | 2219 +-------+--------+----------------+------+ 2220 | 3 | 15 | 5 | 0 | 2221 +-------+--------+----------------+------+ 2222 | 4 | 21 | 5 | 0 | 2223 +-------+--------+----------------+------+ 2224 | 5 | 3 | 5 | 0 | 2225 +-------+--------+----------------+------+ 2226 | 6 | 7 | 4 | 0 | 2227 +-------+--------+----------------+------+ 2228 | 7 | 12 | 5 | 0 | 2229 +-------+--------+----------------+------+ 2230 | 8 | 18 | 5 | 0 | 2231 +-------+--------+----------------+------+ 2232 | 9 | 23 | 5 | 0 | 2233 +-------+--------+----------------+------+ 2234 | 10 | 5 | 5 | 0 | 2235 +-------+--------+----------------+------+ 2236 | 11 | 8 | 4 | 0 | 2237 +-------+--------+----------------+------+ 2238 | 12 | 14 | 5 | 0 | 2239 +-------+--------+----------------+------+ 2240 | 13 | 20 | 5 | 0 | 2241 +-------+--------+----------------+------+ 2242 | 14 | 2 | 5 | 0 | 2243 +-------+--------+----------------+------+ 2244 | 15 | 7 | 4 | 16 | 2245 +-------+--------+----------------+------+ 2246 | 16 | 11 | 5 | 0 | 2247 +-------+--------+----------------+------+ 2248 | 17 | 17 | 5 | 0 | 2249 +-------+--------+----------------+------+ 2250 | 18 | 22 | 5 | 0 | 2251 +-------+--------+----------------+------+ 2252 | 19 | 4 | 5 | 0 | 2253 +-------+--------+----------------+------+ 2254 | 20 | 8 | 4 | 16 | 2255 +-------+--------+----------------+------+ 2256 | 21 | 13 | 5 | 0 | 2257 +-------+--------+----------------+------+ 2258 | 22 | 19 | 5 | 0 | 2259 +-------+--------+----------------+------+ 2260 | 23 | 1 | 5 | 0 | 2261 +-------+--------+----------------+------+ 2262 | 24 | 6 | 4 | 16 | 2263 +-------+--------+----------------+------+ 2264 | 25 | 10 | 5 | 0 | 2265 +-------+--------+----------------+------+ 2266 | 26 | 16 | 5 | 0 | 2267 +-------+--------+----------------+------+ 2268 | 27 | 28 | 5 | 0 | 2269 +-------+--------+----------------+------+ 2270 | 28 | 27 | 5 | 0 | 2271 +-------+--------+----------------+------+ 2272 | 29 | 26 | 5 | 0 | 2273 +-------+--------+----------------+------+ 2274 | 30 | 25 | 5 | 0 | 2275 +-------+--------+----------------+------+ 2276 | 31 | 24 | 5 | 0 | 2277 +-------+--------+----------------+------+ 2279 Appendix B. Changes Since RFC8478 2281 The following are the changes in this document relative to RFC 8478: 2283 o Apply erratum #5786. 2285 o Clarify forward compatibility regarding dictionaries. 2287 o Clarify application of Block_Maximum_Size. 2289 o Add structured media type suffix registration. 2291 Appendix C. Acknowledgments 2293 zstd was developed by Yann Collet. 2295 Felix Handte and Nick Terrell provided feedback that went into this 2296 revision. 2298 Authors' Addresses 2300 Yann Collet 2301 Facebook 2302 1 Hacker Way 2303 Menlo Park, CA 94025 2304 United States of America 2306 Email: cyan@fb.com 2308 Murray S. Kucherawy (editor) 2309 Facebook 2310 1 Hacker Way 2311 Menlo Park, CA 94025 2312 United States of America 2314 Email: msk@fb.com