idnits 2.17.1 draft-kucherawy-rfc8478bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 18, 2020) is 1222 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '36' on line 1116 -- Looks like a reference, but probably isn't: '53' on line 1126 -- Looks like a reference, but probably isn't: '29' on line 1143 -- Looks like a reference, but probably isn't: '5' on line 1546 -- Looks like a reference, but probably isn't: '0' on line 1564 -- Looks like a reference, but probably isn't: '1' on line 1564 Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Collet 3 Internet-Draft M. Kucherawy, Ed. 4 Obsoletes: 8478 (if approved) Facebook 5 Intended status: Informational December 18, 2020 6 Expires: June 21, 2021 8 Zstandard Compression and the application/zstd Media Type 9 draft-kucherawy-rfc8478bis-06 11 Abstract 13 Zstandard, or "zstd" (pronounced "zee standard"), is a data 14 compression mechanism. This document describes the mechanism and 15 registers a media type and content encoding to be used when 16 transporting zstd-compressed content via Multipurpose Internet Mail 17 Extensions (MIME). It also registers a corresponding media type, 18 content encoding, and structured syntax suffix. 20 Despite use of the word "standard" as part of its name, readers are 21 advised that this document is not an Internet Standards Track 22 specification; it is being published for informational purposes only. 24 This document replaces and obsoletes RFC 8478. 26 Status of this Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at http://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on June 21, 2021. 43 Copyright Notice 45 Copyright (c) 2020 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 61 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 62 3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5 63 3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 6 64 3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . . 6 65 3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . . 7 66 3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . . 12 67 3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 14 68 3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . . 27 69 3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . . 28 70 3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . . 29 71 4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . . 30 72 4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 73 4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 31 74 4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . . 33 75 4.2.1. Huffman Tree Description . . . . . . . . . . . . . . . 34 76 4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 36 77 4.2.1.2. FSE Compression of Huffman Weights . . . . . . . . 36 78 4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 37 79 4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 38 80 5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 39 81 6. Use of Dictionaries . . . . . . . . . . . . . . . . . . . . . 40 82 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 41 83 7.1. The 'application/zstd' Media Type . . . . . . . . . . . . 41 84 7.2. Content Encoding . . . . . . . . . . . . . . . . . . . . . 42 85 7.3. Structured Syntax Suffix . . . . . . . . . . . . . . . . . 42 86 7.4. Dictionaries . . . . . . . . . . . . . . . . . . . . . . . 43 87 8. Security Considerations . . . . . . . . . . . . . . . . . . . 43 88 9. Implementation Status . . . . . . . . . . . . . . . . . . . . 44 89 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 45 90 10.1. Normative References . . . . . . . . . . . . . . . . . . . 45 91 10.2. Informative References . . . . . . . . . . . . . . . . . . 45 92 Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 45 93 A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 45 94 A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 48 95 A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 51 96 Appendix B. Changes Since RFC8478 . . . . . . . . . . . . . . . . 53 97 Appendix C. Acknowledgments . . . . . . . . . . . . . . . . . . . 53 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 53 100 1. Introduction 102 Zstandard, or "zstd" (pronounced "zee standard"), is a data 103 compression mechanism, akin to gzip [RFC1952]. 105 Despite use of the word "standard" as part of its name, readers are 106 advised that this document is not an Internet Standards Track 107 specification; it is being published for informational purposes only. 109 This document describes the Zstandard format. Also, to enable the 110 transport of a data object compressed with Zstandard, this document 111 registers a media type that can be used to identify such content when 112 it is used in a payload encoded using Multipurpose Internet Mail 113 Extensions (MIME). 115 2. Definitions 117 Some terms used elsewhere in this document are defined here for 118 clarity. 120 uncompressed: Describes an arbitrary set of bytes in their original 121 form, prior to being subjected to compression. 123 compress, compression: The act of processing a set of bytes via the 124 compression mechanism described here. 126 compressed: Describes the result of passing a set of bytes through 127 this mechanism. The original input has thus been compressed. 129 decompress, decompression: The act of processing a set of bytes 130 through the inverse of the compression mechanism described here, 131 in an attempt to recover the original set of bytes prior to 132 compression. 134 decompressed: Describes the result of passing a set of bytes through 135 the reverse of this mechanism. When this is successful, the 136 decompressed payload and the uncompressed payload are 137 indistinguishable. 139 encode: The process of translating data from one form to another; 140 this may include compression or it may refer to other translations 141 done as part of this specification. 143 decode: The reverse of "encode"; describes a process of reversing a 144 prior encoding to recover the original content. 146 frame: Content compressed by Zstandard is transformed into a 147 Zstandard frame. Multiple frames can be appended into a single 148 file or stream. A frame is completely independent, has a defined 149 beginning and end, and has a set of parameters that tells the 150 decoder how to decompress it. 152 block: A frame encapsulates one or multiple blocks. Each block 153 contains arbitrary content, which is described by its header, and 154 has a guaranteed maximum content size that depends upon frame 155 parameters. Unlike frames, each block depends on previous blocks 156 for proper decoding. However, each block can be decompressed 157 without waiting for its successor, allowing streaming operations. 159 natural order: A sequence or ordering of objects or values that is 160 typical of that type of object or value. A set of unique 161 integers, for example, is in "natural order" if when progressing 162 from one element in the set or sequence to the next, there is 163 never a decrease in value. 165 The naming convention for identifiers within the specification is 166 Mixed_Case_With_Underscores. Identifiers inside square brackets 167 indicate that the identifier is optional in the presented context. 169 3. Compression Algorithm 171 This section describes the Zstandard algorithm. 173 The purpose of this document is to define a lossless compressed data 174 format that is a) independent of the CPU type, operating system, file 175 system, and character set and b) is suitable for file compression and 176 pipe and streaming compression, using the Zstandard algorithm. The 177 text of the specification assumes a basic background in programming 178 at the level of bits and other primitive data representations. 180 The data can be produced or consumed, even for an arbitrarily long 181 sequentially presented input data stream, using only an a priori 182 bounded amount of intermediate storage, and hence can be used in data 183 communications. The format uses the Zstandard compression method, 184 and an optional xxHash-64 checksum method [XXHASH], for detection of 185 data corruption. 187 The data format defined by this specification does not attempt to 188 allow random access to compressed data. 190 Unless otherwise indicated below, a compliant compressor must produce 191 data sets that conform to the specifications presented here. 192 However, it does not need to support all options. 194 A compliant decompressor must be able to decompress at least one 195 working set of parameters that conforms to the specifications 196 presented here. It may also ignore informative fields, such as the 197 checksum. Whenever it does not support a parameter defined in the 198 compressed stream, it must produce a non-ambiguous error code and 199 associated error message explaining which parameter is unsupported. 201 This specification is intended for use by implementers of software to 202 compress data into Zstandard format and/or decompress data from 203 Zstandard format. The Zstandard format is supported by an open 204 source reference implementation, written in portable C, and available 205 at [ZSTD]. 207 3.1. Frames 209 Zstandard compressed data is made up of one or more frames. Each 210 frame is independent and can be decompressed independently of other 211 frames. The decompressed content of multiple concatenated frames is 212 the concatenation of each frame's decompressed content. 214 There are two frame formats defined for Zstandard: Zstandard frames 215 and skippable frames. Zstandard frames contain compressed data, 216 while skippable frames contain custom user metadata. 218 3.1.1. Zstandard Frames 220 The structure of a single Zstandard frame is as follows: 222 +--------------------+------------+ 223 | Magic_Number | 4 bytes | 224 +--------------------+------------+ 225 | Frame_Header | 2-14 bytes | 226 +--------------------+------------+ 227 | Data_Block | n bytes | 228 +--------------------+------------+ 229 | [More Data_Blocks] | | 230 +--------------------+------------+ 231 | [Content_Checksum] | 4 bytes | 232 +--------------------+------------+ 234 Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528. 236 Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1. 238 Data_Block: Detailed in Section 3.1.1.2. This is where data 239 appears. 241 Content_Checksum: An optional 32-bit checksum, only present if 242 Content_Checksum_Flag is set. The content checksum is the result 243 of the XXH64() hash function [XXHASH] digesting the original 244 (decoded) data as input, and a seed of zero. The low 4 bytes of 245 the checksum are stored in little-endian format. 247 The magic number was selected to be less probable to find at the 248 beginning of an arbitrary file. It avoids trivial patterns (0x00, 249 0xFF, repeated bytes, increasing bytes, etc.), contains byte values 250 outside of ASCII range, and doesn't map into UTF-8 space, all of 251 which reduce the likelihood of its appearance at the top of a text 252 file. 254 3.1.1.1. Frame Header 256 The frame header has a variable size, with a minimum of 2 bytes and 257 up to 14 bytes depending on optional parameters. The structure of 258 Frame_Header is as follows: 260 +-------------------------+-----------+ 261 | Frame_Header_Descriptor | 1 byte | 262 +-------------------------+-----------+ 263 | [Window_Descriptor] | 0-1 byte | 264 +-------------------------+-----------+ 265 | [Dictionary_ID] | 0-4 bytes | 266 +-------------------------+-----------+ 267 | [Frame_Content_Size] | 0-8 bytes | 268 +-------------------------+-----------+ 270 3.1.1.1.1. Frame_Header_Descriptor 272 The first header's byte is called the Frame_Header_Descriptor. It 273 describes which other fields are present. Decoding this byte is 274 enough to tell the size of Frame_Header. 276 +------------+-------------------------+ 277 | Bit Number | Field Name | 278 +------------+-------------------------+ 279 | 7-6 | Frame_Content_Size_Flag | 280 +------------+-------------------------+ 281 | 5 | Single_Segment_Flag | 282 +------------+-------------------------+ 283 | 4 | (unused) | 284 +------------+-------------------------+ 285 | 3 | (reserved) | 286 +------------+-------------------------+ 287 | 2 | Content_Checksum_Flag | 288 +------------+-------------------------+ 289 | 1-0 | Dictionary_ID_Flag | 290 +------------+-------------------------+ 292 In this table, bit 7 is the highest bit, while bit 0 is the lowest 293 one. 295 3.1.1.1.1.1. Frame_Content_Size_Flag 297 This is a 2-bit flag (equivalent to Frame_Header_Descriptor right- 298 shifted 6 bits) specifying whether Frame_Content_Size (the 299 decompressed data size) is provided within the header. 300 Frame_Content_Size_Flag provides FCS_Field_Size, which is the number 301 of bytes used by Frame_Content_Size according to the following table: 303 +-------------------------+--------+---+---+---+ 304 | Frame_Content_Size_Flag | 0 | 1 | 2 | 3 | 305 +-------------------------+--------+---+---+---+ 306 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | 307 +-------------------------+--------+---+---+---+ 309 When Frame_Content_Size_Flag is 0, FCS_Field_Size depends on 310 Single_Segment_Flag: If Single_Segment_Flag is set, FCS_Field_Size is 311 1. Otherwise, FCS_Field_Size is 0; Frame_Content_Size is not 312 provided. 314 3.1.1.1.1.2. Single_Segment_Flag 316 If this flag is set, data must be regenerated within a single 317 continuous memory segment. 319 In this case, Window_Descriptor byte is skipped, but 320 Frame_Content_Size is necessarily present. As a consequence, the 321 decoder must allocate a memory segment of size equal or larger than 322 Frame_Content_Size. 324 In order to protect the decoder from unreasonable memory 325 requirements, a decoder is allowed to reject a compressed frame that 326 requests a memory size beyond the decoder's authorized range. 328 For broader compatibility, decoders are recommended to support memory 329 sizes of at least 8 MB. This is only a recommendation; each decoder 330 is free to support higher or lower limits, depending on local 331 limitations. 333 3.1.1.1.1.3. Unused Bit 335 A decoder compliant with this specification version shall not 336 interpret this bit. It might be used in a future version, to signal 337 a property that is not mandatory to properly decode the frame. An 338 encoder compliant with this specification must set this bit to zero. 340 3.1.1.1.1.4. Reserved Bit 342 This bit is reserved for some future feature. Its value must be 343 zero. A decoder compliant with this specification version must 344 ensure it is not set. This bit may be used in a future revision, to 345 signal a feature that must be interpreted to decode the frame 346 correctly. 348 3.1.1.1.1.5. Content_Checksum_Flag 350 If this flag is set, a 32-bit Content_Checksum will be present at the 351 frame's end. See the description of Content_Checksum above. 353 3.1.1.1.1.6. Dictionary_ID_Flag 355 This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating 356 whether a dictionary ID is provided within the header. It also 357 specifies the size of this field as DID_Field_Size: 359 +--------------------+---+---+---+---+ 360 | Dictionary_ID_Flag | 0 | 1 | 2 | 3 | 361 +--------------------+---+---+---+---+ 362 | DID_Field_Size | 0 | 1 | 2 | 4 | 363 +--------------------+---+---+---+---+ 365 3.1.1.1.2. Window Descriptor 367 This provides guarantees about the minimum memory buffer required to 368 decompress a frame. This information is important for decoders to 369 allocate enough memory. 371 The Window_Descriptor byte is optional. When Single_Segment_Flag is 372 set, Window_Descriptor is not present. In this case, Window_Size is 373 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 374 ExaBytes). 376 +------------+----------+----------+ 377 | Bit Number | 7-3 | 2-0 | 378 +------------+----------+----------+ 379 | Field Name | Exponent | Mantissa | 380 +------------+----------+----------+ 382 The minimum memory buffer size is called Window_Size. It is 383 described by the following formulae: 385 windowLog = 10 + Exponent; 386 windowBase = 1 << windowLog; 387 windowAdd = (windowBase / 8) * Mantissa; 388 Window_Size = windowBase + windowAdd; 390 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) 391 + 7*(1<<38) bytes, which is 3.75 TB. 393 In general, larger Window_Size values tend to improve the compression 394 ratio, but at the cost of increased memory usage. 396 To properly decode compressed data, a decoder will need to allocate a 397 buffer of at least Window_Size bytes. 399 In order to protect decoders from unreasonable memory requirements, a 400 decoder is allowed to reject a compressed frame that requests a 401 memory size beyond decoder's authorized range. 403 For improved interoperability, it's recommended for decoders to 404 support values of Window_Size up to 8 MB and for encoders not to 405 generate frames requiring a Window_Size larger than 8 MB. It's 406 merely a recommendation though, and decoders are free to support 407 larger or lower limits, depending on local limitations. 409 3.1.1.1.3. Dictionary_ID 411 This is a variable size field, which contains the ID of the 412 dictionary required to properly decode the frame. This field is 413 optional. When it's not present, it's up to the decoder to know 414 which dictionary to use. 416 Dictionary_ID field size is provided by DID_Field_Size. 417 DID_Field_Size is directly derived from the value of 418 Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can 419 represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295. 421 Format is little-endian. 423 It is permitted to represent a small ID (for example, 13) with a 424 large 4-byte dictionary ID, even if it is less efficient. 426 Within private environments, any dictionary ID can be used. However, 427 for frames and dictionaries distributed in public space, 428 Dictionary_ID must be attributed carefully. The following ranges are 429 reserved for use only with dictionaries that have been registered 430 with IANA (see Section 7.4): 431 low range: <= 32767 432 high range: >= (1 << 31) 434 Any other value for Dictionary_ID can be used by private arrangement 435 between participants. 437 Any payload presented for decompression that references an 438 unregistered reserved dictionary ID results in an error. 440 3.1.1.1.4. Frame Content Size 442 This is the original (uncompressed) size. This information is 443 optional. Frame_Content_Size uses a variable number of bytes, 444 provided by FCS_Field_Size. FCS_Field_Size is provided by the value 445 of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not 446 present), 1, 2, 4, or 8 bytes. 448 +----------------+--------------+ 449 | FCS Field Size | Range | 450 +----------------+--------------+ 451 | 0 | unknown | 452 +----------------+--------------+ 453 | 1 | 0 - 255 | 454 +----------------+--------------+ 455 | 2 | 256 - 65791 | 456 +----------------+--------------+ 457 | 4 | 0 - 2^32 - 1 | 458 +----------------+--------------+ 459 | 8 | 0 - 2^64 - 1 | 460 +----------------+--------------+ 462 Frame_Content_Size format is little-endian. When FCS_Field_Size is 463 1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is 464 2, the offset of 256 is added. It's allowed to represent a small 465 size (for example 18) using any compatible variant. 467 3.1.1.2. Blocks 469 After Magic_Number and Frame_Header, there are some number of blocks. 470 Each frame must have at least 1 block, but there is no upper limit on 471 the number of blocks per frame. 473 The structure of a block is as follows: 475 +--------------+---------------+ 476 | Block_Header | Block_Content | 477 +--------------+---------------+ 478 | 3 bytes | n bytes | 479 +--------------+---------------+ 481 Block_Header uses 3 bytes, written using little-endian convention. 482 It contains three fields: 484 +------------+------------+------------+ 485 | Last_Block | Block_Type | Block_Size | 486 +------------+------------+------------+ 487 | bit 0 | bits 1-2 | bits 3-23 | 488 +------------+------------+------------+ 490 3.1.1.2.1. Last_Block 492 The lowest bit (Last_Block) signals whether this block is the last 493 one. The frame will end after this last block. It may be followed 494 by an optional Content_Checksum (see Section 3.1.1). 496 3.1.1.2.2. Block_Type 498 The next 2 bits represent the Block_Type. There are four block 499 types: 501 +-----------+------------------+ 502 | Value | Block_Type | 503 +-----------+------------------+ 504 | 0 | Raw_Block | 505 +-----------+------------------+ 506 | 1 | RLE_Block | 507 +-----------+------------------+ 508 | 2 | Compressed_Block | 509 +-----------+------------------+ 510 | 3 | Reserved | 511 +-----------+------------------+ 513 Raw_Block: This is an uncompressed block. Block_Content contains 514 Block_Size bytes. 516 RLE_Block: This is a single byte, repeated Block_Size times. 517 Block_Content consists of a single byte. On the decompression 518 side, this byte must be repeated Block_Size times. 520 Compressed_Block: This is a compressed block as described in 521 Section 3.1.1.3. Block_Size is the length of Block_Content, 522 namely the compressed data. The decompressed size is not known, 523 but its maximum possible value is guaranteed (see below). 525 Reserved: This is not a block. This value cannot be used with the 526 current specification. If such a value is present, it is 527 considered to be corrupt data, and a compliant decoder must reject 528 it. 530 3.1.1.2.3. Block_Size 532 The upper 21 bits of Block_Header represent the Block_Size. 534 When Block_Type is Compressed_Block or Raw_Block, Block_Size is the 535 size of Block_Content (hence excluding Block_Header). 537 When Block_Type is RLE_Block, since Block_Content's size is always 1, 538 Block_Size represents the number of times this byte must be repeated. 540 Block_Size is limited by Block_Maximum_Size (see below). 542 3.1.1.2.4. Block_Content and Block_Maximum_Size 544 The size of Block_Content is limited by Block_Maximum_Size, which is 545 the smallest of: 547 o Window_Size 549 o 128 KB 551 Block_Maximum_Size is constant for a given frame. This maximum is 552 applicable to both the decompressed size and the compressed size of 553 any block in the frame. 555 The reasoning for this limit is that a decoder can read this 556 information at the beginning of a frame and use it to allocate 557 buffers. The guarantees on the size of blocks ensure that the 558 buffers will be large enough for any following block of the valid 559 frame. 561 If the compressed block is larger than the uncompressed one, sending 562 the uncompressed block (i.e., a Raw_Block) is recommended instead. 564 3.1.1.3. Compressed Blocks 566 To decompress a compressed block, the compressed size must be 567 provided from the Block_Size field within Block_Header. 569 A compressed block consists of two sections: a Literals Section 570 (Section 3.1.1.3.1) and a Sequences_Section (Section 3.1.1.3.2). The 571 results of the two sections are then combined to produce the 572 decompressed data in Sequence Execution (Section 3.1.1.4). 574 To decode a compressed block, the following elements are necessary: 576 o Previous decoded data, up to a distance of Window_Size, or the 577 beginning of the Frame, whichever is smaller. Single_Segment_Flag 578 will be set in the latter case. 580 o List of "recent offsets" from the previous Compressed_Block. 582 o The previous Huffman tree, required by Treeless_Literals_Block 583 type. 585 o Previous Finite State Entropy (FSE) decoding tables, required by 586 Repeat_Mode, for each symbol type (literals lengths, match 587 lengths, offsets). 589 Note that decoding tables are not always from the previous 590 Compressed_Block: 592 o Every decoding table can come from a dictionary. 594 o The Huffman tree comes from the previous 595 Compressed_Literals_Block. 597 3.1.1.3.1. Literals_Section_Header 599 All literals are regrouped in the first part of the block. They can 600 be decoded first and then copied during Sequence Execution (see 601 Section 3.1.1.4), or they can be decoded on the flow during Sequence 602 Execution. 604 Literals can be stored uncompressed or compressed using Huffman 605 prefix codes. When compressed, an optional tree description can be 606 present, followed by 1 or 4 streams. 608 +----------------------------+ 609 | Literals_Section_Header | 610 +----------------------------+ 611 | [Huffman_Tree_Description] | 612 +----------------------------+ 613 | [Jump_Table] | 614 +----------------------------+ 615 | Stream_1 | 616 +----------------------------+ 617 | [Stream_2] | 618 +----------------------------+ 619 | [Stream_3] | 620 +----------------------------+ 621 | [Stream_4] | 622 +----------------------------+ 624 3.1.1.3.1.1. Literals_Section_Header 626 This field describes how literals are packed. It's a byte-aligned 627 variable-size bit field, ranging from 1 to 5 bytes, using little- 628 endian convention. 630 +---------------------+-----------+ 631 | Literals_Block_Type | 2 bits | 632 +---------------------+-----------+ 633 | Size_Format | 1-2 bits | 634 +---------------------+-----------+ 635 | Regenerated_Size | 5-20 bits | 636 +---------------------+-----------+ 637 | [Compressed_Size] | 0-18 bits | 638 +---------------------+-----------+ 640 In this representation, bits at the top are the lowest bits. 642 The Literals_Block_Type field uses the two lowest bits of the first 643 byte, describing four different block types: 645 +---------------------------+-------+ 646 | Literals_Block_Type | Value | 647 +---------------------------+-------+ 648 | Raw_Literals_Block | 0 | 649 +---------------------------+-------+ 650 | RLE_Literals_Block | 1 | 651 +---------------------------+-------+ 652 | Compressed_Literals_Block | 2 | 653 +---------------------------+-------+ 654 | Treeless_Literals_Block | 3 | 655 +---------------------------+-------+ 657 Raw_Literals_Block: Literals are stored uncompressed. 658 Literals_Section_Content is Regenerated_Size. 660 RLE_Literals_Block: Literals consist of a single-byte value repeated 661 Regenerated_Size times. Literals_Section_Content is 1. 663 Compressed_Literals_Block: This is a standard Huffman-compressed 664 block, starting with a Huffman tree description. See details 665 below. Literals_Section_Content is Compressed_Size. 667 Treeless_Literals_Block: This is a Huffman-compressed block, using 668 the Huffman tree from the previous Compressed_Literals_Block, or a 669 dictionary if there is no previous Huffman-compressed literals 670 block. Huffman_Tree_Description will be skipped. Note that if 671 this mode is triggered without any previous Huffman-table in the 672 frame (or dictionary, per Section 5), it should be treated as data 673 corruption. Literals_Section_Content is Compressed_Size. 675 The Size_Format is divided into two families: 677 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary 678 to decode Regenerated_Size. There is no Compressed_Size field. 680 o For Compressed_Block and Treeless_Literals_Block, it's required to 681 decode both Compressed_Size and Regenerated_Size (the decompressed 682 size). It's also necessary to decode the number of streams (1 or 683 4). 685 For values spanning several bytes, the convention is little endian. 687 Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2 688 bits. Its value is (Literals_Section_Header[0]>>2) & 0x3. 690 Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size 691 uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte. 692 Regenerated_Size = Literal_Section_Header[0]>>3. 694 Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses 695 12 bits (values 0-4095). Literals_Section_Header uses 2 bytes. 696 Regenerated_Size = (Literals_Section_Header[0]>>4) + 697 (Literals_Section_Header[1]<<4). 699 Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses 700 20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes. 701 Regenerated_Size = (Literals_Section_Header[0]>>4) + 702 (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12) 704 Only Stream_1 is present for these cases. Note that it is permitted 705 to represent a short value (for example, 13) using a long format, 706 even if it's less efficient. 708 Size_Format for Compressed_Literals_Block and Treeless_Literals_Block 709 always uses 2 bits. 711 Size_Format == 00: A single stream. Both Regenerated_Size and 712 Compressed_Size use 10 bits (values 0-1023). 713 Literals_Section_Header uses 3 bytes. 715 Size_Format == 01: 4 streams. Both Regenerated_Size and 716 Compressed_Size use 10 bits (values 0-1023). 717 Literals_Section_Header uses 3 bytes. 719 Size_Format == 10: 4 streams. Both Regenerated_Size and 720 Compressed_Size use 14 bits (values 0-16383). 721 Literals_Section_Header uses 4 bytes. 723 Size_Format == 11: 4 streams. Both Regenerated_Size and 724 Compressed_Size use 18 bits (values 0-262143). 725 Literals_Section_Header uses 5 bytes. 727 Both the Compressed_Size and Regenerated_Size fields follow little- 728 endian convention. Note that Compressed_Size includes the size of 729 the Huffman_Tree_Description when it is present. 731 3.1.1.3.1.2. Raw_Literals_Block 733 The data in Stream_1 is Regenerated_Size bytes long. It contains the 734 raw literals data to be used during Sequence Execution 735 (Section 3.1.1.3.2). 737 3.1.1.3.1.3. RLE_Literals_Block 739 Stream_1 consists of a single byte that should be repeated 740 Regenerated_Size times to generate the decoded literals. 742 3.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block 744 Both of these modes contain Huffman-encoded data. For 745 Treeless_Literals_Block, the Huffman table comes from the previously 746 compressed literals block, or from a dictionary; see Section 5. 748 3.1.1.3.1.5. Huffman_Tree_Description 750 This section is only present when the Literals_Block_Type type is 751 Compressed_Literals_Block (2). The format of 752 Huffman_Tree_Description can be found in Section 4.2.1. The size of 753 Huffman_Tree_Description is determined during the decoding process. 754 It must be used to determine where streams begin. 756 Total_Streams_Size = Compressed_Size 757 - Huffman_Tree_Description_Size 759 3.1.1.3.1.6. Jump_Table 761 The Jump_Table is only present when there are 4 Huffman-coded 762 streams. 764 (Reminder: Huffman-compressed data consists of either 1 or 4 Huffman- 765 coded streams.) 767 If only 1 stream is present, it is a single bitstream occupying the 768 entire remaining portion of the literals block, encoded as described 769 within Section 4.2.2. 771 If there are 4 streams, Literals_Section_Header only provides enough 772 information to know the decompressed and compressed sizes of all 4 773 streams combined. The decompressed size of each stream is equal to 774 (Regenerated_Size+3)/4, except for the last stream, which may be up 775 to 3 bytes smaller, to reach a total decompressed size as specified 776 in Regenerated_Size. 778 The compressed size of each stream is provided explicitly in the 779 Jump_Table. The Jump_Table is 6 bytes long and consists of three 780 2-byte little-endian fields, describing the compressed sizes of the 781 first 3 streams. Stream4_Size is computed from Total_Streams_Size 782 minus sizes of other streams. 784 Stream4_Size = Total_Streams_Size - 6 785 - Stream1_Size - Stream2_Size 786 - Stream3_Size 788 Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds 789 Total_Streams_Size, the data are considered corrupted. 791 Each of these 4 bitstreams is then decoded independently as a 792 Huffman-Coded stream, as described in Section 4.2.2. 794 3.1.1.3.2. Sequences_Section 796 A compressed block is a succession of sequences. A sequence is a 797 literal copy command, followed by a match copy command. A literal 798 copy command specifies a length. It is the number of bytes to be 799 copied (or extracted) from the Literals Section. A match copy 800 command specifies an offset and a length. 802 When all sequences are decoded, if there are literals left in the 803 literals section, these bytes are added at the end of the block. 805 This is described in more detail in Section 3.1.1.4. 807 The Sequences_Section regroups all symbols required to decode 808 commands. There are three symbol types: literals lengths, offsets, 809 and match lengths. They are encoded together, interleaved, in a 810 single "bitstream". 812 The Sequences_Section starts by a header, followed by optional 813 probability tables for each symbol type, followed by the bitstream. 815 Sequences_Section_Header 816 [Literals_Length_Table] 817 [Offset_Table] 818 [Match_Length_Table] 819 bitStream 821 To decode the Sequences_Section, it's necessary to know its size. 822 This size is deduced from the size of the Literals_Section: 823 Sequences_Section_Size = Block_Size - Literals_Section_Header - 824 Literals_Section_Content 826 3.1.1.3.2.1. Sequences_Section_Header 828 This header consists of two items: 830 o Number_of_Sequences 832 o Symbol_Compression_Modes 834 Number_of_Sequences is a variable size field using between 1 and 3 835 bytes. If the first byte is "byte0": 837 o if (byte0 == 0): there are no sequences. The sequence section 838 stops here. Decompressed content is defined entirely as Literals 839 Section content. The FSE tables used in Repeat_Mode are not 840 updated. 842 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte. 844 o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) + 845 byte1. Uses 2 bytes. 847 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) + 848 0x7F00. Uses 3 bytes. 850 Symbol_Compression_Modes is a single byte, defining the compression 851 mode of each symbol type. 853 +-------------+----------------------+ 854 | Bit Number | Field Name | 855 +-------------+----------------------+ 856 | 7-6 | Literal_Lengths_Mode | 857 +-------------+----------------------+ 858 | 5-4 | Offsets_Mode | 859 +-------------+----------------------+ 860 | 3-2 | Match_Lengths_Mode | 861 +-------------+----------------------+ 862 | 1-0 | Reserved | 863 +-------------+----------------------+ 865 The last field, Reserved, must be all zeroes. 867 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define 868 the Compression_Mode of literals lengths, offsets, and match lengths 869 symbols, respectively. They follow the same enumeration: 871 +-------+---------------------+ 872 | Value | Compression_Mode | 873 +-------+---------------------+ 874 | 0 | Predefined_Mode | 875 +-------+---------------------+ 876 | 1 | RLE_Mode | 877 +-------+---------------------+ 878 | 2 | FSE_Compressed_Mode | 879 +-------+---------------------+ 880 | 3 | Repeat_Mode | 881 +-------+---------------------+ 883 Predefined_Mode: A predefined FSE (see Section 4.1) distribution 884 table is used, as defined in Section 3.1.1.3.2.2. No distribution 885 table will be present. 887 RLE_Mode: The table description consists of a single byte, which 888 contains the symbol's value. This symbol will be used for all 889 sequences. 891 FSE_Compressed_Mode: Standard FSE compression. A distribution table 892 will be present. The format of this distribution table is 893 described in Section 4.1.1. Note that the maximum allowed 894 accuracy log for literals length and match length tables is 9, and 895 the maximum accuracy log for the offsets table is 8. This mode 896 must not be used when only one symbol is present; RLE_Mode should 897 be used instead (although any other mode will work). 899 Repeat_Mode: The table used in the previous Compressed_Block with 900 Number_Of_Sequences > 0 will be used again, or if this is the 901 first block, the table in the dictionary will be used. Note that 902 this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the 903 same symbol will be repeated. It also includes Predefined_Mode, 904 in which case Repeat_Mode will have the same outcome as 905 Predefined_Mode. No distribution table will be present. If this 906 mode is used without any previous sequence table in the frame (or 907 dictionary; see Section 5) to repeat, this should be treated as 908 corruption. 910 3.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets 912 Each symbol is a code in its own context, which specifies Baseline 913 and Number_of_Bits to add. Codes are FSE compressed and interleaved 914 with raw additional bits in the same bitstream. 916 Literals length codes are values ranging from 0 to 35 inclusive. 917 They define lengths from 0 to 131071 bytes. The literals length is 918 equal to the decoded Baseline plus the result of reading 919 Number_of_Bits bits from the bitstream, as a little-endian value. 921 +----------------------+----------+----------------+ 922 | Literals_Length_Code | Baseline | Number_of_Bits | 923 +----------------------+----------+----------------+ 924 | 0-15 | length | 0 | 925 +----------------------+----------+----------------+ 926 | 16 | 16 | 1 | 927 +----------------------+----------+----------------+ 928 | 17 | 18 | 1 | 929 +----------------------+----------+----------------+ 930 | 18 | 20 | 1 | 931 +----------------------+----------+----------------+ 932 | 19 | 22 | 1 | 933 +----------------------+----------+----------------+ 934 | 20 | 24 | 2 | 935 +----------------------+----------+----------------+ 936 | 21 | 28 | 2 | 937 +----------------------+----------+----------------+ 938 | 22 | 32 | 3 | 939 +----------------------+----------+----------------+ 940 | 23 | 40 | 3 | 941 +----------------------+----------+----------------+ 942 | 24 | 48 | 4 | 943 +----------------------+----------+----------------+ 944 | 25 | 64 | 6 | 945 +----------------------+----------+----------------+ 946 | 26 | 128 | 7 | 947 +----------------------+----------+----------------+ 948 | 27 | 256 | 8 | 949 +----------------------+----------+----------------+ 950 | 28 | 512 | 9 | 951 +----------------------+----------+----------------+ 952 | 29 | 1024 | 10 | 953 +----------------------+----------+----------------+ 954 | 30 | 2048 | 11 | 955 +----------------------+----------+----------------+ 956 | 31 | 4096 | 12 | 957 +----------------------+----------+----------------+ 958 | 32 | 8192 | 13 | 959 +----------------------+----------+----------------+ 960 | 33 | 16384 | 14 | 961 +----------------------+----------+----------------+ 962 | 34 | 32768 | 15 | 963 +----------------------+----------+----------------+ 964 | 35 | 65536 | 16 | 965 +----------------------+----------+----------------+ 967 Match length codes are values ranging from 0 to 52 inclusive. They 968 define lengths from 3 to 131074 bytes. The match length is equal to 969 the decoded Baseline plus the result of reading Number_of_Bits bits 970 from the bitstream, as a little-endian value. 972 +-------------------+-----------------------+----------------+ 973 | Match_Length_Code | Baseline | Number_of_Bits | 974 +-------------------+-----------------------+----------------+ 975 | 0-31 | Match_Length_Code + 3 | 0 | 976 +-------------------+-----------------------+----------------+ 977 | 32 | 35 | 1 | 978 +-------------------+-----------------------+----------------+ 979 | 33 | 37 | 1 | 980 +-------------------+-----------------------+----------------+ 981 | 34 | 39 | 1 | 982 +-------------------+-----------------------+----------------+ 983 | 35 | 41 | 1 | 984 +-------------------+-----------------------+----------------+ 985 | 36 | 43 | 2 | 986 +-------------------+-----------------------+----------------+ 987 | 37 | 47 | 2 | 988 +-------------------+-----------------------+----------------+ 989 | 38 | 51 | 3 | 990 +-------------------+-----------------------+----------------+ 991 | 39 | 59 | 3 | 992 +-------------------+-----------------------+----------------+ 993 | 40 | 67 | 4 | 994 +-------------------+-----------------------+----------------+ 995 | 41 | 83 | 4 | 996 +-------------------+-----------------------+----------------+ 997 | 42 | 99 | 5 | 998 +-------------------+-----------------------+----------------+ 999 | 43 | 131 | 7 | 1000 +-------------------+-----------------------+----------------+ 1001 | 44 | 259 | 8 | 1002 +-------------------+-----------------------+----------------+ 1003 | 45 | 515 | 9 | 1004 +-------------------+-----------------------+----------------+ 1005 | 46 | 1027 | 10 | 1006 +-------------------+-----------------------+----------------+ 1007 | 47 | 2051 | 11 | 1008 +-------------------+-----------------------+----------------+ 1009 | 48 | 4099 | 12 | 1010 +-------------------+-----------------------+----------------+ 1011 | 49 | 8195 | 13 | 1012 +-------------------+-----------------------+----------------+ 1013 | 50 | 16387 | 14 | 1014 +-------------------+-----------------------+----------------+ 1015 | 51 | 32771 | 15 | 1016 +-------------------+-----------------------+----------------+ 1017 | 52 | 65539 | 16 | 1018 +-------------------+-----------------------+----------------+ 1020 Offset codes are values ranging from 0 to N. 1022 A decoder is free to limit its maximum supported value for N. Support 1023 for values of at least 22 is recommended. At the time of this 1024 writing, the reference decoder supports a maximum N value of 31. 1026 An offset code is also the number of additional bits to read in 1027 little-endian fashion and can be translated into an Offset_Value 1028 using the following formulas: 1030 Offset_Value = (1 << offsetCode) + readNBits(offsetCode); 1031 if (Offset_Value > 3) Offset = Offset_Value - 3; 1033 This means that maximum Offset_Value is (2^(N+1))-1, supporting back- 1034 reference distance up to (2^(N+1))-4, but it is limited by the 1035 maximum back-reference distance (see Section 3.1.1.1.2). 1037 Offset_Value from 1 to 3 are special: they define "repeat codes". 1038 This is described in more detail in Section 3.1.1.5. 1040 3.1.1.3.2.1.2. Decoding Sequences 1042 FSE bitstreams are read in reverse of the direction they are written. 1043 In zstd, the compressor writes bits forward into a block, and the 1044 decompressor must read the bitstream backwards. 1046 To find the start of the bitstream, it is therefore necessary to know 1047 the offset of the last byte of the block, which can be found by 1048 counting Block_Size bytes after the block header. 1050 After writing the last bit containing information, the compressor 1051 writes a single 1 bit and then fills the rest of the byte with zero 1052 bits. The last byte of the compressed bitstream cannot be zero for 1053 that reason. 1055 When decompressing, the last byte containing the padding is the first 1056 byte to read. The decompressor needs to skip the up to 7 bits of 1057 0-padding as well as the the first 1 bit that occurs. Afterwards, 1058 the useful part of the bitstream begins. 1060 FSE decoding requires a 'state' to be carried from symbol to symbol. 1061 For more explanation on FSE decoding, see Section 4.1. 1063 For sequence decoding, a separate state keeps track of each literal 1064 lengths, offsets, and match lengths symbols. Some FSE primitives are 1065 also used. For more details on the operation of these primitives, 1066 see Section 4.1. 1068 The bitstream starts with initial FSE state values, each using the 1069 required number of bits in their respective accuracy, decoded 1070 previously from their normalized distribution. It starts with 1071 Literals_Length_State, followed by Offset_State, and finally 1072 Match_Length_State. 1074 Note that all values are read backward, so the 'start' of the 1075 bitstream is at the highest position in memory, immediately before 1076 the last 1 bit for padding. 1078 After decoding the starting states, a single sequence is decoded 1079 Number_Of_Sequences times. These sequences are decoded in order from 1080 first to last. Since the compressor writes the bitstream in the 1081 forward direction, this means the compressor must encode the 1082 sequences starting with the last one and ending with the first. 1084 For each of the symbol types, the FSE state can be used to determine 1085 the appropriate code. The code then defines the Baseline and 1086 Number_of_Bits to read for each type. The description of the codes 1087 for how to determine these values can be found in 1088 Section 3.1.1.3.2.1. 1090 Decoding starts by reading the Number_of_Bits required to decode 1091 offset. It does the same for Match_Length and then for 1092 Literals_Length. This sequence is then used for Sequence Execution 1093 (see Section 3.1.1.4). 1095 If it is not the last sequence in the block, the next operation is to 1096 update states. Using the rules pre-calculated in the decoding 1097 tables, Literals_Length_State is updated, followed by 1098 Match_Length_State, and then Offset_State. See Section 4.1 for 1099 details on how to update states from the bitstream. 1101 This operation will be repeated Number_of_Sequences times. At the 1102 end, the bitstream shall be entirely consumed; otherwise, the 1103 bitstream is considered corrupted. 1105 3.1.1.3.2.2. Default Distributions 1107 If Predefined_Mode is selected for a symbol type, its FSE decoding 1108 table is generated from a predefined distribution table defined here. 1109 For details on how to convert this distribution into a decoding 1110 table, see Section 4.1. 1112 3.1.1.3.2.2.1. Literals Length 1114 The decoding table uses an accuracy log of 6 bits (64 states). 1116 short literalsLength_defaultDistribution[36] = 1117 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1118 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 1119 -1,-1,-1,-1 1120 }; 1122 3.1.1.3.2.2.2. Match Length 1124 The decoding table uses an accuracy log of 6 bits (64 states). 1126 short matchLengths_defaultDistribution[53] = 1127 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1128 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1129 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 1130 -1,-1,-1,-1,-1 1131 }; 1133 3.1.1.3.2.2.3. Offset Codes 1135 The decoding table uses an accuracy log of 5 bits (32 states), and 1136 supports a maximum N value of 28, allowing offset values up to 1137 536,870,908. 1139 If any sequence in the compressed block requires a larger offset than 1140 this, it's not possible to use the default distribution to represent 1141 it. 1143 short offsetCodes_defaultDistribution[29] = 1144 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1145 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 1146 }; 1148 3.1.1.4. Sequence Execution 1150 Once literals and sequences have been decoded, they are combined to 1151 produce the decoded content of a block. 1153 Each sequence consists of a tuple of (literals_length, offset_value, 1154 match_length), decoded as described in the Sequences_Section 1155 (Section 3.1.1.3.2). To execute a sequence, first copy 1156 literals_length bytes from the decoded literals to the output. 1158 Then, match_length bytes are copied from previous decoded data. The 1159 offset to copy from is determined by offset_value: 1161 o if Offset_Value > 3, then the offset is Offset_Value - 3; 1162 o if Offset_Value is from 1-3, the offset is a special repeat offset 1163 value. See Section 3.1.1.5 for how the offset is determined in 1164 this case. 1166 The offset is defined as from the current position (after copying the 1167 literals), so an offset of 6 and a match length of 3 means that 3 1168 bytes should be copied from 6 bytes back. Note that all offsets 1169 leading to previously decoded data must be smaller than Window_Size 1170 defined in Frame_Header_Descriptor (Section 3.1.1.1.1). 1172 3.1.1.5. Repeat Offsets 1174 As seen above, the first three values define a repeated offset; we 1175 will call them Repeated_Offset1, Repeated_Offset2, and 1176 Repeated_Offset3. They are sorted in recency order, with 1177 Repeated_Offset1 meaning "most recent one". 1179 If offset_value is 1, then the offset used is Repeated_Offset1, etc. 1181 There is one exception: When the current sequence's literals_length 1182 is 0, repeated offsets are shifted by 1, so an offset_value of 1 1183 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, 1184 and an offset_value of 3 means Repeated_Offset1 - 1_byte. 1186 For the first block, the starting offset history is populated with 1187 the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and 1188 Repeated_Offset3 (8), unless a dictionary is used, in which case they 1189 come from the dictionary. 1191 Then each block gets its starting offset history from the ending 1192 values of the most recent Compressed_Block. Note that blocks that 1193 are not Compressed_Block are skipped; they do not contribute to 1194 offset history. 1196 During the execution of the sequences of a Compressed_Block, the 1197 Repeated_Offsets' values are kept up to date, so that they always 1198 represent the three most-recently used offsets. In order to achieve 1199 that, they are updated after executing each sequence in the following 1200 way: 1202 When the sequence's offset_value does not refer to one of the 1203 Repeated_Offsets -- when it has value greater than 3, or when it has 1204 value 3 and the sequence's literals_length is zero -- the 1205 Repeated_Offsets' values are shifted back one, and Repeated_Offset1 1206 takes on the value of the just-used offset. 1208 Otherwise, when the sequence's offset_value refers to one of the 1209 Repeated_Offsets -- when it has value 1 or 2, or when it has value 3 1210 and the sequence's literals_length is non-zero -- the 1211 Repeated_Offsets are re-ordered so that Repeated_Offset1 takes on the 1212 value of the used Repeated_Offset, and the existing values are pushed 1213 back from the first Repeated_Offset through to the Repeated_Offset 1214 selected by the offset_value. This effectively performs a single- 1215 stepped wrapping rotation of the values of these offsets, so that 1216 their order again reflects the recency of their use. 1218 The following table shows the values of the Repeated_Offsets as a 1219 series of sequences are applied to them: 1221 Repeated_Offset1 | Repeated_Offset2 1222 literals_length | | | Repeated_Offset3 1223 offset_value | | | | | Comment 1224 -------------|-----|------|------|------|------------------------ 1225 | | 1 | 4 | 8 | starting values 1226 1114 | 11 | 1111 | 1 | 4 | non-repeat 1227 1 | 22 | 1111 | 1 | 4 | repeat 1; no change 1228 2225 | 22 | 2222 | 1111 | 1 | non-repeat 1229 1114 | 111 | 1111 | 2222 | 1111 | non-repeat 1230 3336 | 33 | 3333 | 1111 | 2222 | non-repeat 1231 2 | 22 | 1111 | 3333 | 2222 | repeat 2; swap 1 & 2 1232 3 | 33 | 2222 | 1111 | 3333 | repeat 3; rotate 3 to 1 1233 3 | 0 | 2221 | 2222 | 1111 | insert resolved offset 1234 1 | 0 | 2222 | 2221 | 3333 | repeat 2 1236 3.1.2. Skippable Frames 1238 +--------------+------------+-----------+ 1239 | Magic_Number | Frame_Size | User_Data | 1240 +--------------+------------+-----------+ 1241 | 4 bytes | 4 bytes | n bytes | 1242 +--------------+------------+-----------+ 1244 Skippable frames allow the insertion of user-defined metadata into a 1245 flow of concatenated frames. 1247 Skippable frames defined in this specification are compatible with 1248 skippable frames in [LZ4]. 1250 From a compliant decoder perspective, skippable frames simply need to 1251 be skipped, and their content ignored, resuming decoding after the 1252 skippable frame. 1254 It should be noted that a skippable frame can be used to watermark a 1255 stream of concatenated frames embedding any kind of tracking 1256 information (even just a Universally Unique Identifier (UUID)). 1258 Users wary of such possibility should scan the stream of concatenated 1259 frames in an attempt to detect such frames for analysis or removal. 1261 The fields are: 1263 Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?, 1264 which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values 1265 are valid to identify a skippable frame. This specification does 1266 not detail any specific tagging methods for skippable frames. 1268 Frame_Size: This is the size, in bytes, of the following User_Data 1269 (without including the magic number nor the size field itself). 1270 This field is represented using 4 bytes, little-endian format, 1271 unsigned 32 bits. This means User_Data can't be bigger than 1272 (2^32-1) bytes. 1274 User_Data: This field can be anything. Data will just be skipped by 1275 the decoder. 1277 4. Entropy Encoding 1279 Two types of entropy encoding are used by the Zstandard format: FSE 1280 and Huffman coding. Huffman is used to compress literals, while FSE 1281 is used for all other symbols (Literals_Length_Code, 1282 Match_Length_Code, and offset codes) and to compress Huffman headers. 1284 4.1. FSE 1286 FSE, short for Finite State Entropy, is an entropy codec based on 1287 [ANS]. FSE encoding/decoding involves a state that is carried over 1288 between symbols, so decoding must be done in the opposite direction 1289 as encoding. Therefore, all FSE bitstreams are read from end to 1290 beginning. Note that the order of the bits in the stream is not 1291 reversed; they are simply read in the reverse order from which they 1292 were written. 1294 For additional details on FSE, see Finite State Entropy [FSE]. 1296 FSE decoding involves a decoding table that has a power of 2 size and 1297 contains three elements: Symbol, Num_Bits, and Baseline. The base 2 1298 logarithm of the table size is its Accuracy_Log. An FSE state value 1299 represents an index in this table. 1301 To obtain the initial state value, consume Accuracy_Log bits from the 1302 stream as a little-endian value. The next symbol in the stream is 1303 the Symbol indicated in the table for that state. To obtain the next 1304 state value, the decoder should consume Num_Bits bits from the stream 1305 as a little-endian value and add it to Baseline. 1307 4.1.1. FSE Table Description 1309 To decode FSE streams, it is necessary to construct the decoding 1310 table. The Zstandard format encodes FSE table descriptions as 1311 described here. 1313 An FSE distribution table describes the probabilities of all symbols 1314 from 0 to the last present one (included) on a normalized scale of 1315 (1 << Accuracy_Log). Note that there must be two or more symbols 1316 with non-zero probability. 1318 A bitstream is read forward, in little-endian fashion. It is not 1319 necessary to know its exact size, since the size will be discovered 1320 and reported by the decoding process. The bitstream starts by 1321 reporting on which scale it operates. If low4bits designates the 1322 lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5. 1324 This is followed by each symbol value, from 0 to the last present 1325 one. The number of bits used by each field is variable and depends 1326 on: 1328 Remaining probabilities + 1: For example, presuming an Accuracy_Log 1329 of 8, and presuming 100 probabilities points have already been 1330 distributed, the decoder may read any value from 0 to (256 - 100 + 1331 1) == 157, inclusive. Therefore, it must read log2sup(157) == 8 1332 bits. 1334 Value decoded: Small values use 1 fewer bit. For example, presuming 1335 values from 0 to 157 (inclusive) are possible, 255 - 157 = 98 1336 values are remaining in an 8-bit field. The first 98 values 1337 (hence from 0 to 97) use only 7 bits, and values from 98 to 157 1338 use 8 bits. This is achieved through this scheme: 1340 +------------+---------------+-----------+ 1341 | Value Read | Value Decoded | Bits Used | 1342 +------------+---------------+-----------+ 1343 | 0 - 97 | 0 - 97 | 7 | 1344 +------------+---------------+-----------+ 1345 | 98 - 127 | 98 - 127 | 8 | 1346 +------------+---------------+-----------+ 1347 | 128 - 225 | 0 - 97 | 7 | 1348 +------------+---------------+-----------+ 1349 | 226 - 255 | 128 - 157 | 8 | 1350 +------------+---------------+-----------+ 1352 Symbol probabilities are read one by one, in order. The probability 1353 is obtained from Value decoded using the formula P = Value - 1. This 1354 means the value 0 becomes the negative probability -1. This is a 1355 special probability that means "less than 1". Its effect on the 1356 distribution table is described below. For the purpose of 1357 calculating total allocated probability points, it counts as 1. 1359 When a symbol has a probability of zero, it is followed by a 2-bit 1360 repeat flag. This repeat flag tells how many probabilities of zeroes 1361 follow the current one. It provides a number ranging from 0 to 3. 1362 If it is a 3, another 2-bit repeat flag follows, and so on. 1364 When the last symbol reaches a cumulated total of 1365 (1 << Accuracy_Log), decoding is complete. If the last symbol makes 1366 the cumulated total go above (1 << Accuracy_Log), distribution is 1367 considered corrupted. 1369 Finally, the decoder can tell how many bytes were used in this 1370 process and how many symbols are present. The bitstream consumes a 1371 round number of bytes. Any remaining bit within the last byte is 1372 simply unused. 1374 The context in which the table is to be used specifies an expected 1375 number of symbols. That expected number of symbols never exceeds 1376 256. If the number of symbols decoded is not equal to the expected, 1377 the header should be considered corrupt. 1379 The distribution of normalized probabilities is enough to create a 1380 unique decoding table. The table has a size of (1 << Accuracy_Log). 1381 Each cell describes the symbol decoded and instructions to get the 1382 next state. 1384 Symbols are scanned in their natural order for "less than 1" 1385 probabilities as described above. Symbols with this probability are 1386 being attributed a single cell, starting from the end of the table 1387 and retreating. These symbols define a full state reset, reading 1388 Accuracy_Log bits. 1390 All remaining symbols are allocated in their natural order. Starting 1391 from symbol 0 and table position 0, each symbol gets allocated as 1392 many cells as its probability. Cell allocation is spread, not 1393 linear; each successor position follows this rule: 1395 position += (tableSize >> 1) + (tableSize >> 3) + 3; 1396 position &= tableSize - 1; 1398 A position is skipped if it is already occupied by a "less than 1" 1399 probability symbol. Position does not reset between symbols; it 1400 simply iterates through each position in the table, switching to the 1401 next symbol when enough states have been allocated to the current 1402 one. 1404 The result is a list of state values. Each state will decode the 1405 current symbol. 1407 To get the Number_of_Bits and Baseline required for the next state, 1408 it is first necessary to sort all states in their natural order. The 1409 lower states will need 1 more bit than higher ones. The process is 1410 repeated for each symbol. 1412 For example, presuming a symbol has a probability of 5, it receives 1413 five state values. States are sorted in natural order. The next 1414 power of 2 is 8. The space of probabilities is divided into 8 equal 1415 parts. Presuming the Accuracy_Log is 7, this defines 128 states, and 1416 each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5 1417 = 3 lowest states will count "double", doubling the number of shares 1418 (32 in width), requiring 1 more bit in the process. 1420 Baseline is assigned starting from the higher states using fewer 1421 bits, and proceeding naturally, then resuming at the first state, 1422 each taking its allocated width from Baseline. 1424 +----------------+-------+-------+--------+------+-------+ 1425 | state order | 0 | 1 | 2 | 3 | 4 | 1426 +----------------+-------+-------+--------+------+-------+ 1427 | width | 32 | 32 | 32 | 16 | 16 | 1428 +----------------+-------+-------+--------+------+-------+ 1429 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | 1430 +----------------+-------+-------+--------+------+-------+ 1431 | range number | 2 | 4 | 6 | 0 | 1 | 1432 +----------------+-------+-------+--------+------+-------+ 1433 | Baseline | 32 | 64 | 96 | 0 | 16 | 1434 +----------------+-------+-------+--------+------+-------+ 1435 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1436 +----------------+-------+-------+--------+------+-------+ 1438 The next state is determined from the current state by reading the 1439 required Number_of_Bits and adding the specified Baseline. 1441 See Appendix A for the results of this process that are applied to 1442 the default distributions. 1444 4.2. Huffman Coding 1446 Zstandard Huffman-coded streams are read backwards, similar to the 1447 FSE bitstreams. Therefore, to find the start of the bitstream, it is 1448 necessary to know the offset of the last byte of the Huffman-coded 1449 stream. 1451 After writing the last bit containing information, the compressor 1452 writes a single 1 bit and then fills the rest of the byte with 0 1453 bits. The last byte of the compressed bitstream cannot be 0 for that 1454 reason. 1456 When decompressing, the last byte containing the padding is the first 1457 byte to read. The decompressor needs to skip the up to 7 bits of 1458 0-padding as well as the first 1 bit that occurs. Afterwards, the 1459 useful part of the bitstream begins. 1461 The bitstream contains Huffman-coded symbols in little-endian order, 1462 with the codes defined by the method below. 1464 4.2.1. Huffman Tree Description 1466 Prefix coding represents symbols from an a priori known alphabet by 1467 bit sequences (codewords), one codeword for each symbol, in a manner 1468 such that different symbols may be represented by bit sequences of 1469 different lengths, but a parser can always parse an encoded string 1470 unambiguously symbol by symbol. 1472 Given an alphabet with known symbol frequencies, the Huffman 1473 algorithm allows the construction of an optimal prefix code using the 1474 fewest bits of any possible prefix codes for that alphabet. 1476 The prefix code must not exceed a maximum code length. More bits 1477 improve accuracy but yield a larger header size and require more 1478 memory or more complex decoding operations. This specification 1479 limits the maximum code length to 11 bits. 1481 All literal values from zero (included) to the last present one 1482 (excluded) are represented by Weight with values from 0 to 1483 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits 1484 follows this pseudocode: 1486 if Weight == 0 1487 Number_of_Bits = 0 1488 else 1489 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1491 The last symbol's Weight is deduced from previously decoded ones, by 1492 completing to the nearest power of 2. This power of 2 gives 1493 Max_Number_of_Bits the depth of the current tree. 1495 For example, presume the following Huffman tree must be described: 1497 +---------------+----------------+ 1498 | Literal Value | Number_of_Bits | 1499 +---------------+----------------+ 1500 | 0 | 1 | 1501 +---------------+----------------+ 1502 | 1 | 2 | 1503 +---------------+----------------+ 1504 | 2 | 3 | 1505 +---------------+----------------+ 1506 | 3 | 0 | 1507 +---------------+----------------+ 1508 | 4 | 4 | 1509 +---------------+----------------+ 1510 | 5 | 4 | 1511 +---------------+----------------+ 1513 The tree depth is 4, since its longest element uses 4 bits. (The 1514 longest elements are those with the smallest frequencies.) Value 5 1515 will not be listed as it can be determined from the values for 0-4, 1516 nor will values above 5 as they are all 0. Values from 0 to 4 will 1517 be listed using Weight instead of Number_of_Bits. The pseudocode to 1518 determine Weight is: 1520 if Number_of_Bits == 0 1521 Weight = 0 1522 else 1523 Weight = Max_Number_of_Bits + 1 - Number_of_Bits 1525 It gives the following series of weights: 1527 +---------------+--------+ 1528 | Literal Value | Weight | 1529 +---------------+--------+ 1530 | 0 | 4 | 1531 +---------------+--------+ 1532 | 1 | 3 | 1533 +---------------+--------+ 1534 | 2 | 2 | 1535 +---------------+--------+ 1536 | 3 | 0 | 1537 +---------------+--------+ 1538 | 4 | 1 | 1539 +---------------+--------+ 1541 The decoder will do the inverse operation: having collected weights 1542 of literals from 0 to 4, it knows the last literal, 5, is present 1543 with a non-zero Weight. The Weight of 5 can be determined by 1544 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 1545 0's) is 15. The nearest power of 2 is 16. Therefore, 1546 Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1. 1548 4.2.1.1. Huffman Tree Header 1550 This is a single byte value (0-255), which describes how the series 1551 of weights is encoded. 1553 headerByte < 128: The series of weights is compressed using FSE (see 1554 below). The length of the FSE-compressed series is equal to 1555 headerByte (0-127). 1557 headerByte >= 128: This is a direct representation, where each 1558 Weight is written directly as a 4-bit field (0-15). They are 1559 encoded forward, 2 weights to a byte with the first weight taking 1560 the top 4 bits and the second taking the bottom 4; for example, 1561 the following operations could be used to read the weights: 1563 Weight[0] = (Byte[0] >> 4) 1564 Weight[1] = (Byte[0] & 0xf), 1565 etc. 1567 The full representation occupies ceiling(Number_of_Symbols/2) 1568 bytes, meaning it uses only full bytes even if Number_of_Symbols 1569 is odd. Number_of_Symbols = headerByte - 127. Note that maximum 1570 Number_of_Symbols is 255 - 127 = 128. If any literal has a value 1571 over 128, raw header mode is not possible, and it is necessary to 1572 use FSE compression. 1574 4.2.1.2. FSE Compression of Huffman Weights 1576 In this case, the series of Huffman weights is compressed using FSE 1577 compression. It is a single bitstream with two interleaved states, 1578 sharing a single distribution table. 1580 To decode an FSE bitstream, it is necessary to know its compressed 1581 size. Compressed size is provided by headerByte. It's also 1582 necessary to know its maximum possible decompressed size, which is 1583 255, since literal values span from 0 to 255, and the last symbol's 1584 Weight is not represented. 1586 An FSE bitstream starts by a header, describing probabilities 1587 distribution. It will create a decoding table. For a list of 1588 Huffman weights, the maximum accuracy log is 6 bits. For more 1589 details, see Section 4.1.1. 1591 The Huffman header compression uses two states, which share the same 1592 FSE distribution table. The first state (State1) encodes the even- 1593 numbered index symbols, and the second (State2) encodes the odd- 1594 numbered index symbols. State1 is initialized first, and then 1595 State2, and they take turns decoding a single symbol and updating 1596 their state. For more details on these FSE operations, see 1597 Section 4.1. 1599 The number of symbols to be decoded is determined by tracking the 1600 bitStream overflow condition: If updating state after decoding a 1601 symbol would require more bits than remain in the stream, it is 1602 assumed that extra bits are zero. Then, symbols for each of the 1603 final states are decoded and the process is complete. 1605 4.2.1.3. Conversion from Weights to Huffman Prefix Codes 1607 All present symbols will now have a Weight value. It is possible to 1608 transform weights into Number_of_Bits, using this formula: 1610 if Weight > 0 1611 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1612 else 1613 Number_of_Bits = 0 1615 Symbols are sorted by Weight. Within the same Weight, symbols keep 1616 natural sequential order. Symbols with a Weight of zero are removed. 1617 Then, starting from the lowest Weight, prefix codes are distributed 1618 in sequential order. 1620 For example, assume the following list of weights has been decoded: 1622 +---------+--------+ 1623 | Literal | Weight | 1624 +---------+--------+ 1625 | 0 | 4 | 1626 +---------+--------+ 1627 | 1 | 3 | 1628 +---------+--------+ 1629 | 2 | 2 | 1630 +---------+--------+ 1631 | 3 | 0 | 1632 +---------+--------+ 1633 | 4 | 1 | 1634 +---------+--------+ 1635 | 5 | 1 | 1636 +---------+--------+ 1638 Sorting by weight and then the natural sequential order yields the 1639 following distribution: 1641 +---------+--------+----------------+--------------+ 1642 | Literal | Weight | Number_Of_Bits | Prefix Codes | 1643 +---------+--------+----------------|--------------+ 1644 | 3 | 0 | 0 | N/A | 1645 +---------+--------+----------------|--------------+ 1646 | 4 | 1 | 4 | 0000 | 1647 +---------+--------+----------------|--------------+ 1648 | 5 | 1 | 4 | 0001 | 1649 +---------+--------+----------------|--------------+ 1650 | 2 | 2 | 3 | 001 | 1651 +---------+--------+----------------|--------------+ 1652 | 1 | 3 | 2 | 01 | 1653 +---------+--------+----------------|--------------+ 1654 | 0 | 4 | 1 | 1 | 1655 +---------+--------+----------------|--------------+ 1657 4.2.2. Huffman-Coded Streams 1659 Given a Huffman decoding table, it is possible to decode a Huffman- 1660 coded stream. 1662 Each bitstream must be read backward, which starts from the end and 1663 goes up to the beginning. Therefore, it is necessary to know the 1664 size of each bitstream. 1666 It is also necessary to know exactly which bit is the last. This is 1667 detected by a final bit flag: the highest bit of the last byte is a 1668 final-bit-flag. Consequently, a last byte of 0 is not possible. And 1669 the final-bit-flag itself is not part of the useful bitstream. 1670 Hence, the last byte contains between 0 and 7 useful bits. 1672 Starting from the end, it is possible to read the bitstream in a 1673 little-endian fashion, keeping track of already used bits. Since the 1674 bitstream is encoded in reverse order, starting from the end, read 1675 symbols in forward order. 1677 For example, if the literal sequence "0145" was encoded using the 1678 above prefix code, it would be encoded (in reverse order) as: 1680 +---------+----------+ 1681 | Symbol | Encoding | 1682 +---------+----------+ 1683 | 5 | 0000 | 1684 +---------+----------+ 1685 | 4 | 0001 | 1686 +---------+----------+ 1687 | 1 | 01 | 1688 +---------+----------+ 1689 | 0 | 1 | 1690 +---------+----------+ 1691 | Padding | 00001 | 1692 +---------+----------+ 1694 This results in the following 2-byte bitstream: 1696 00010000 00001101 1698 Here is an alternative representation with the symbol codes separated 1699 by underscores: 1701 0001_0000 00001_1_01 1703 Reading the highest Max_Number_of_Bits bits, it's possible to compare 1704 the extracted value to the decoding table, determining the symbol to 1705 decode and number of bits to discard. 1707 The process continues reading up to the required number of symbols 1708 per stream. If a bitstream is not entirely and exactly consumed, 1709 hence reaching exactly its beginning position with all bits consumed, 1710 the decoding process is considered faulty. 1712 5. Dictionary Format 1714 Zstandard is compatible with "raw content" dictionaries, free of any 1715 format restriction, except that they must be at least 8 bytes. These 1716 dictionaries function as if they were just the content part of a 1717 formatted dictionary. 1719 However, dictionaries created by "zstd --train" in the reference 1720 implementation follow a specific format, described here. 1722 Dictionaries are not included in the compressed content but rather 1723 are provided out of band. That is, the Dictionary_ID identifies 1724 which should be used, but this specification does not describe the 1725 mechanism by which the dictionary is obtained prior to use during 1726 compression or decompression. 1728 A dictionary has a size, defined either by a buffer limit or a file 1729 size. The general format is: 1731 +--------------+---------------+----------------+---------+ 1732 | Magic_Number | Dictionary_ID | Entropy_Tables | Content | 1733 +--------------+---------------+----------------+---------+ 1735 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format. 1737 Dictionary_ID: 4 bytes, stored in little-endian format. 1738 Dictionary_ID can be any value, except 0 (which means no 1739 Dictionary_ID). It is used by decoders to check if they use the 1740 correct dictionary. If the frame is going to be distributed in a 1741 private environment, any Dictionary_ID can be used. However, for 1742 public distribution of compressed frames, the following ranges are 1743 reserved and shall not be used: 1744 low range: <= 32767 1745 high range: >= (2^31) 1747 Entropy_Tables: Follow the same format as the tables in compressed 1748 blocks. See the relevant FSE and Huffman sections for how to 1749 decode these tables. They are stored in the following order: 1750 Huffman table for literals, FSE table for offsets, FSE table for 1751 match lengths, and FSE table for literals lengths. These tables 1752 populate the Repeat Stats literals mode and Repeat distribution 1753 mode for sequence decoding. It is finally followed by 3 offset 1754 values, populating repeat offsets (instead of using {1,4,8}), 1755 stored in order, 4-bytes little-endian each, for a total of 12 1756 bytes. Each repeat offset must have a value less than the 1757 dictionary size. 1759 Content: The rest of the dictionary is its content. The content 1760 acts as a "past" in front of data to be compressed or 1761 decompressed, so it can be referenced in sequence commands. As 1762 long as the amount of data decoded from this frame is less than or 1763 equal to Window_Size, sequence commands may specify offsets longer 1764 than the total length of decoded output so far to reference back 1765 to the dictionary, even parts of the dictionary with offsets 1766 larger than Window_Size. After the total output has surpassed 1767 Window_Size, however, this is no longer allowed, and the 1768 dictionary is no longer accessible. 1770 6. Use of Dictionaries 1772 Provisioning for use of dictionaries with zstd is being explored. 1773 See, for example, [DICT-SEC]. The likely outcome will be a registry 1774 of well-tested dictionaries optimized for different use cases and 1775 identifiers for each, possibly with a private negotiation mechanism 1776 for use of unregistered dictionaries. 1778 To ensure compatibility with the future specification of use of 1779 dictionaries with zstd payloads, especially with MIME, content 1780 encoded with the media type registered here should not use a 1781 dictionary. The exception to this requirement might be a private 1782 dictionary negotiaton, suggested above, which is not part of this 1783 specification. 1785 7. IANA Considerations 1787 IANA has updated two previously existing registrations and made one 1788 new registration as described below. 1790 7.1. The 'application/zstd' Media Type 1792 The 'application/zstd' media type identifies a block of data that is 1793 compressed using zstd compression. The data is a stream of bytes as 1794 described in this document. IANA has added the following to the 1795 "Media Types" registry: 1797 Type name: application 1799 Subtype name: zstd 1801 Required parameters: N/A 1803 Optional parameters: N/A 1805 Encoding considerations: binary 1807 Security considerations: See Section 8 of [this document] 1809 Interoperability considerations: N/A 1811 Published specification: [this document] 1813 Applications that use this media type: anywhere data size is an 1814 issue 1816 Fragment identifier considerations: No fragment identifiers are 1817 defined for this type. 1819 Additional information: 1821 Magic number(s): 4 bytes, little-endian format. Value: 1822 0xFD2FB528 1824 File extension(s): zst 1826 Macintosh file type code(s): N/A 1828 For further information: See [ZSTD] 1830 Intended usage: common 1832 Restrictions on usage: N/A 1834 Author: Murray S. Kucherawy 1836 Change Controller: IETF 1838 Provisional registration: no 1840 7.2. Content Encoding 1842 IANA has added the following entry to the "HTTP Content Coding 1843 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 1844 registry: 1846 Name: zstd 1848 Description: A stream of bytes compressed using the Zstandard 1849 protocol 1851 Pointer to specification text: [this document] 1853 7.3. Structured Syntax Suffix 1855 IANA is requested to register the following into the Structured 1856 Syntax Suffix registry: 1858 Name: Zstandard 1860 +suffix: +zstd 1862 Encoding Considerations: binary 1863 Interoperability Considerations: N/A 1865 Fragment Identifier Considerations: The syntax and semantics of 1866 fragment identifiers specified for +zstd should be as specified 1867 for "application/zstd". 1869 Security Considerations: See Section 8 of [this document]. 1871 Contact: Refer to the author for the application/zstd media type. 1873 Author/Change Controller: IETF 1875 7.4. Dictionaries 1877 Work in progress includes development of dictionaries that will 1878 optimize compression and decompression of particular types of data. 1879 Specification of such dictionaries for public use will necessitate 1880 registration of a code point from the reserved range described in 1881 Section 3.1.1.1.3 and its association with a specific dictionary. 1883 However, there are at present no such dictionaries published for 1884 public use, so this document makes no immediate request of IANA to 1885 create such a registry. 1887 8. Security Considerations 1889 Any data compression method involves the reduction of redundancy in 1890 the data. Zstandard is no exception, and the usual precautions 1891 apply. 1893 One should never compress a message whose content must remain secret 1894 with a message generated by a third party. Such a compression can be 1895 used to guess the content of the secret message through analysis of 1896 entropy reduction. This was demonstrated in the Compression Ratio 1897 Info-leak Made Easy (CRIME) attack [CRIME], for example. 1899 A decoder has to demonstrate capabilities to detect and prevent any 1900 kind of data tampering in the compressed frame from triggering system 1901 faults, such as reading or writing beyond allowed memory ranges. 1902 This can be guaranteed by either the implementation language or 1903 careful bound checkings. Of particular note is the encoding of 1904 Number_of_Sequences values that cause the decoder to read into the 1905 block header (and beyond), as well as the indication of a 1906 Frame_Content_Size that is smaller than the actual decompressed data, 1907 in an attempt to trigger a buffer overflow. It is highly recommended 1908 to fuzz-test (i.e., provide invalid, unexpected, or random input and 1909 verify safe operation of) decoder implementations to test and harden 1910 their capability to detect bad frames and deal with them without any 1911 adverse system side effect. 1913 An attacker may provide correctly formed compressed frames with 1914 unreasonable memory requirements. A decoder must always control 1915 memory requirements and enforce some (system-specific) limits in 1916 order to protect memory usage from such scenarios. 1918 Compression can be optimized by training a dictionary on a variety of 1919 related content payloads. This dictionary must then be available at 1920 the decoder for decompression of the payload to be possible. While 1921 this document does not specify how to acquire a dictionary for a 1922 given compressed payload, it is worth noting that third-party 1923 dictionaries may interact unexpectedly with a decoder, leading to 1924 possible memory or other resource exhaustion attacks. We expect such 1925 topics to be discussed in further detail in the Security 1926 Considerations section of a forthcoming RFC for dictionary 1927 acquisition and transmission, but highlight this issue now out of an 1928 abundance of caution. 1930 As discussed in Section 3.1.2, it is possible to store arbitrary user 1931 metadata in skippable frames. While such frames are ignored during 1932 decompression of the data, they can be used as a watermark to track 1933 the path of the compressed payload. 1935 9. Implementation Status 1937 Source code for a C language implementation of a Zstandard-compliant 1938 library is available at [ZSTD-GITHUB]. This implementation is 1939 considered to be the reference implementation and is production 1940 ready; it implements the full range of the specification. It is 1941 routinely tested against security hazards and widely deployed within 1942 Facebook infrastructure. 1944 The reference version is optimized for speed and is highly portable. 1945 It has been proven to run safely on multiple architectures (e.g., 1946 x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit 1947 addressing schemes, a little- or big-endian storage scheme, a number 1948 of different operating systems (e.g., UNIX (including Linux, BSD, 1949 OS-X, and Solaris) and Windows), and a number of compilers (e.g., 1950 gcc, clang, visual, and icc). 1952 A comprehensive and current list of known implementations can be 1953 found at [ZSTD]. 1955 10. References 1956 10.1. Normative References 1958 [ZSTD] "Zstandard", 2017, . 1960 10.2. Informative References 1962 [ANS] Duda, J., "Asymmetric numeral systems: entropy coding 1963 combining speed of Huffman coding with compression rate of 1964 arithmetic coding", January 2014, 1965 . 1967 [CRIME] "CRIME", June 2018, . 1970 [DICT-SEC] 1971 Handte, W., "Security Considerations Regarding Compression 1972 Dictionaries", (work in 1973 progress) draft-handte-httpbis-dict-sec, October 2019. 1975 [FSE] "FiniteStateEntropy", June 2018, 1976 . 1978 [LZ4] "LZ4 Frame Format Description", January 2018, . 1981 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 1982 RFC 1952, DOI 10.17487/RFC1952, May 1996, 1983 . 1985 [XXHASH] "XXHASH Algorithm", 2017, . 1987 [ZSTD-GITHUB] 1988 "zstd", August 2018, . 1990 Appendix A. Decoding Tables for Predefined Codes 1992 This appendix contains FSE decoding tables for the predefined literal 1993 length, match length, and offset codes. The tables have been 1994 constructed using the algorithm as given above in Section 4.1.1. The 1995 tables here can be used as examples to crosscheck that an 1996 implementation has built its decoding tables correctly. 1998 A.1. Literal Length Code Table 2000 +-------+--------+----------------+------+ 2001 | State | Symbol | Number_Of_Bits | Base | 2002 +-------+--------+----------------+------+ 2003 | 0 | 0 | 0 | 0 | 2004 +-------+--------+----------------+------+ 2005 | 0 | 0 | 4 | 0 | 2006 +-------+--------+----------------+------+ 2007 | 1 | 0 | 4 | 16 | 2008 +-------+--------+----------------+------+ 2009 | 2 | 1 | 5 | 32 | 2010 +-------+--------+----------------+------+ 2011 | 3 | 3 | 5 | 0 | 2012 +-------+--------+----------------+------+ 2013 | 4 | 4 | 5 | 0 | 2014 +-------+--------+----------------+------+ 2015 | 5 | 6 | 5 | 0 | 2016 +-------+--------+----------------+------+ 2017 | 6 | 7 | 5 | 0 | 2018 +-------+--------+----------------+------+ 2019 | 7 | 9 | 5 | 0 | 2020 +-------+--------+----------------+------+ 2021 | 8 | 10 | 5 | 0 | 2022 +-------+--------+----------------+------+ 2023 | 9 | 12 | 5 | 0 | 2024 +-------+--------+----------------+------+ 2025 | 10 | 14 | 6 | 0 | 2026 +-------+--------+----------------+------+ 2027 | 11 | 16 | 5 | 0 | 2028 +-------+--------+----------------+------+ 2029 | 12 | 18 | 5 | 0 | 2030 +-------+--------+----------------+------+ 2031 | 13 | 19 | 5 | 0 | 2032 +-------+--------+----------------+------+ 2033 | 14 | 21 | 5 | 0 | 2034 +-------+--------+----------------+------+ 2035 | 15 | 22 | 5 | 0 | 2036 +-------+--------+----------------+------+ 2037 | 16 | 24 | 5 | 0 | 2038 +-------+--------+----------------+------+ 2039 | 17 | 25 | 5 | 32 | 2040 +-------+--------+----------------+------+ 2041 | 18 | 26 | 5 | 0 | 2042 +-------+--------+----------------+------+ 2043 | 19 | 27 | 6 | 0 | 2044 +-------+--------+----------------+------+ 2045 | 20 | 29 | 6 | 0 | 2046 +-------+--------+----------------+------+ 2047 | 21 | 31 | 6 | 0 | 2048 +-------+--------+----------------+------+ 2049 | 22 | 0 | 4 | 32 | 2050 +-------+--------+----------------+------+ 2051 | 23 | 1 | 4 | 0 | 2052 +-------+--------+----------------+------+ 2053 | 24 | 2 | 5 | 0 | 2054 +-------+--------+----------------+------+ 2055 | 25 | 4 | 5 | 32 | 2056 +-------+--------+----------------+------+ 2057 | 26 | 5 | 5 | 0 | 2058 +-------+--------+----------------+------+ 2059 | 27 | 7 | 5 | 32 | 2060 +-------+--------+----------------+------+ 2061 | 28 | 8 | 5 | 0 | 2062 +-------+--------+----------------+------+ 2063 | 29 | 10 | 5 | 32 | 2064 +-------+--------+----------------+------+ 2065 | 30 | 11 | 5 | 0 | 2066 +-------+--------+----------------+------+ 2067 | 31 | 13 | 6 | 0 | 2068 +-------+--------+----------------+------+ 2069 | 32 | 16 | 5 | 32 | 2070 +-------+--------+----------------+------+ 2071 | 33 | 17 | 5 | 0 | 2072 +-------+--------+----------------+------+ 2073 | 34 | 19 | 5 | 32 | 2074 +-------+--------+----------------+------+ 2075 | 35 | 20 | 5 | 0 | 2076 +-------+--------+----------------+------+ 2077 | 36 | 22 | 5 | 32 | 2078 +-------+--------+----------------+------+ 2079 | 37 | 23 | 5 | 0 | 2080 +-------+--------+----------------+------+ 2081 | 38 | 25 | 4 | 0 | 2082 +-------+--------+----------------+------+ 2083 | 39 | 25 | 4 | 16 | 2084 +-------+--------+----------------+------+ 2085 | 40 | 26 | 5 | 32 | 2086 +-------+--------+----------------+------+ 2087 | 41 | 28 | 6 | 0 | 2088 +-------+--------+----------------+------+ 2089 | 42 | 30 | 6 | 0 | 2090 +-------+--------+----------------+------+ 2091 | 43 | 0 | 4 | 48 | 2092 +-------+--------+----------------+------+ 2093 | 44 | 1 | 4 | 16 | 2094 +-------+--------+----------------+------+ 2095 | 45 | 2 | 5 | 32 | 2096 +-------+--------+----------------+------+ 2097 | 46 | 3 | 5 | 32 | 2098 +-------+--------+----------------+------+ 2099 | 47 | 5 | 5 | 32 | 2100 +-------+--------+----------------+------+ 2101 | 48 | 6 | 5 | 32 | 2102 +-------+--------+----------------+------+ 2103 | 49 | 8 | 5 | 32 | 2104 +-------+--------+----------------+------+ 2105 | 50 | 9 | 5 | 32 | 2106 +-------+--------+----------------+------+ 2107 | 51 | 11 | 5 | 32 | 2108 +-------+--------+----------------+------+ 2109 | 52 | 12 | 5 | 32 | 2110 +-------+--------+----------------+------+ 2111 | 53 | 15 | 6 | 0 | 2112 +-------+--------+----------------+------+ 2113 | 54 | 17 | 5 | 32 | 2114 +-------+--------+----------------+------+ 2115 | 55 | 18 | 5 | 32 | 2116 +-------+--------+----------------+------+ 2117 | 56 | 20 | 5 | 32 | 2118 +-------+--------+----------------+------+ 2119 | 57 | 21 | 5 | 32 | 2120 +-------+--------+----------------+------+ 2121 | 58 | 23 | 5 | 32 | 2122 +-------+--------+----------------+------+ 2123 | 59 | 24 | 5 | 32 | 2124 +-------+--------+----------------+------+ 2125 | 60 | 35 | 6 | 0 | 2126 +-------+--------+----------------+------+ 2127 | 61 | 34 | 6 | 0 | 2128 +-------+--------+----------------+------+ 2129 | 62 | 33 | 6 | 0 | 2130 +-------+--------+----------------+------+ 2131 | 63 | 32 | 6 | 0 | 2132 +-------+--------+----------------+------+ 2134 A.2. Match Length Code Table 2136 +-------+--------+----------------+------+ 2137 | State | Symbol | Number_Of_Bits | Base | 2138 +-------+--------+----------------+------+ 2139 | 0 | 0 | 0 | 0 | 2140 +-------+--------+----------------+------+ 2141 | 0 | 0 | 6 | 0 | 2142 +-------+--------+----------------+------+ 2143 | 1 | 1 | 4 | 0 | 2144 +-------+--------+----------------+------+ 2145 | 2 | 2 | 5 | 32 | 2146 +-------+--------+----------------+------+ 2147 | 3 | 3 | 5 | 0 | 2148 +-------+--------+----------------+------+ 2149 | 4 | 5 | 5 | 0 | 2150 +-------+--------+----------------+------+ 2151 | 5 | 6 | 5 | 0 | 2152 +-------+--------+----------------+------+ 2153 | 6 | 8 | 5 | 0 | 2154 +-------+--------+----------------+------+ 2155 | 7 | 10 | 6 | 0 | 2156 +-------+--------+----------------+------+ 2157 | 8 | 13 | 6 | 0 | 2158 +-------+--------+----------------+------+ 2159 | 9 | 16 | 6 | 0 | 2160 +-------+--------+----------------+------+ 2161 | 10 | 19 | 6 | 0 | 2162 +-------+--------+----------------+------+ 2163 | 11 | 22 | 6 | 0 | 2164 +-------+--------+----------------+------+ 2165 | 12 | 25 | 6 | 0 | 2166 +-------+--------+----------------+------+ 2167 | 13 | 28 | 6 | 0 | 2168 +-------+--------+----------------+------+ 2169 | 14 | 31 | 6 | 0 | 2170 +-------+--------+----------------+------+ 2171 | 15 | 33 | 6 | 0 | 2172 +-------+--------+----------------+------+ 2173 | 16 | 35 | 6 | 0 | 2174 +-------+--------+----------------+------+ 2175 | 17 | 37 | 6 | 0 | 2176 +-------+--------+----------------+------+ 2177 | 18 | 39 | 6 | 0 | 2178 +-------+--------+----------------+------+ 2179 | 19 | 41 | 6 | 0 | 2180 +-------+--------+----------------+------+ 2181 | 20 | 43 | 6 | 0 | 2182 +-------+--------+----------------+------+ 2183 | 21 | 45 | 6 | 0 | 2184 +-------+--------+----------------+------+ 2185 | 22 | 1 | 4 | 16 | 2186 +-------+--------+----------------+------+ 2187 | 23 | 2 | 4 | 0 | 2188 +-------+--------+----------------+------+ 2189 | 24 | 3 | 5 | 32 | 2190 +-------+--------+----------------+------+ 2191 | 25 | 4 | 5 | 0 | 2192 +-------+--------+----------------+------+ 2193 | 26 | 6 | 5 | 32 | 2194 +-------+--------+----------------+------+ 2195 | 27 | 7 | 5 | 0 | 2196 +-------+--------+----------------+------+ 2197 | 28 | 9 | 6 | 0 | 2198 +-------+--------+----------------+------+ 2199 | 29 | 12 | 6 | 0 | 2200 +-------+--------+----------------+------+ 2201 | 30 | 15 | 6 | 0 | 2202 +-------+--------+----------------+------+ 2203 | 31 | 18 | 6 | 0 | 2204 +-------+--------+----------------+------+ 2205 | 32 | 21 | 6 | 0 | 2206 +-------+--------+----------------+------+ 2207 | 33 | 24 | 6 | 0 | 2208 +-------+--------+----------------+------+ 2209 | 34 | 27 | 6 | 0 | 2210 +-------+--------+----------------+------+ 2211 | 35 | 30 | 6 | 0 | 2212 +-------+--------+----------------+------+ 2213 | 36 | 32 | 6 | 0 | 2214 +-------+--------+----------------+------+ 2215 | 37 | 34 | 6 | 0 | 2216 +-------+--------+----------------+------+ 2217 | 38 | 36 | 6 | 0 | 2218 +-------+--------+----------------+------+ 2219 | 39 | 38 | 6 | 0 | 2220 +-------+--------+----------------+------+ 2221 | 40 | 40 | 6 | 0 | 2222 +-------+--------+----------------+------+ 2223 | 41 | 42 | 6 | 0 | 2224 +-------+--------+----------------+------+ 2225 | 42 | 44 | 6 | 0 | 2226 +-------+--------+----------------+------+ 2227 | 43 | 1 | 4 | 32 | 2228 +-------+--------+----------------+------+ 2229 | 44 | 1 | 4 | 48 | 2230 +-------+--------+----------------+------+ 2231 | 45 | 2 | 4 | 16 | 2232 +-------+--------+----------------+------+ 2233 | 46 | 4 | 5 | 32 | 2234 +-------+--------+----------------+------+ 2235 | 47 | 5 | 5 | 32 | 2236 +-------+--------+----------------+------+ 2237 | 48 | 7 | 5 | 32 | 2238 +-------+--------+----------------+------+ 2239 | 49 | 8 | 5 | 32 | 2240 +-------+--------+----------------+------+ 2241 | 50 | 11 | 6 | 0 | 2242 +-------+--------+----------------+------+ 2243 | 51 | 14 | 6 | 0 | 2244 +-------+--------+----------------+------+ 2245 | 52 | 17 | 6 | 0 | 2246 +-------+--------+----------------+------+ 2247 | 53 | 20 | 6 | 0 | 2248 +-------+--------+----------------+------+ 2249 | 54 | 23 | 6 | 0 | 2250 +-------+--------+----------------+------+ 2251 | 55 | 26 | 6 | 0 | 2252 +-------+--------+----------------+------+ 2253 | 56 | 29 | 6 | 0 | 2254 +-------+--------+----------------+------+ 2255 | 57 | 52 | 6 | 0 | 2256 +-------+--------+----------------+------+ 2257 | 58 | 51 | 6 | 0 | 2258 +-------+--------+----------------+------+ 2259 | 59 | 50 | 6 | 0 | 2260 +-------+--------+----------------+------+ 2261 | 60 | 49 | 6 | 0 | 2262 +-------+--------+----------------+------+ 2263 | 61 | 48 | 6 | 0 | 2264 +-------+--------+----------------+------+ 2265 | 62 | 47 | 6 | 0 | 2266 +-------+--------+----------------+------+ 2267 | 63 | 46 | 6 | 0 | 2268 +-------+--------+----------------+------+ 2270 A.3. Offset Code Table 2272 +-------+--------+----------------+------+ 2273 | State | Symbol | Number_Of_Bits | Base | 2274 +-------+--------+----------------+------+ 2275 | 0 | 0 | 0 | 0 | 2276 +-------+--------+----------------+------+ 2277 | 0 | 0 | 5 | 0 | 2278 +-------+--------+----------------+------+ 2279 | 1 | 6 | 4 | 0 | 2280 +-------+--------+----------------+------+ 2281 | 2 | 9 | 5 | 0 | 2282 +-------+--------+----------------+------+ 2283 | 3 | 15 | 5 | 0 | 2284 +-------+--------+----------------+------+ 2285 | 4 | 21 | 5 | 0 | 2286 +-------+--------+----------------+------+ 2287 | 5 | 3 | 5 | 0 | 2288 +-------+--------+----------------+------+ 2289 | 6 | 7 | 4 | 0 | 2290 +-------+--------+----------------+------+ 2291 | 7 | 12 | 5 | 0 | 2292 +-------+--------+----------------+------+ 2293 | 8 | 18 | 5 | 0 | 2294 +-------+--------+----------------+------+ 2295 | 9 | 23 | 5 | 0 | 2296 +-------+--------+----------------+------+ 2297 | 10 | 5 | 5 | 0 | 2298 +-------+--------+----------------+------+ 2299 | 11 | 8 | 4 | 0 | 2300 +-------+--------+----------------+------+ 2301 | 12 | 14 | 5 | 0 | 2302 +-------+--------+----------------+------+ 2303 | 13 | 20 | 5 | 0 | 2304 +-------+--------+----------------+------+ 2305 | 14 | 2 | 5 | 0 | 2306 +-------+--------+----------------+------+ 2307 | 15 | 7 | 4 | 16 | 2308 +-------+--------+----------------+------+ 2309 | 16 | 11 | 5 | 0 | 2310 +-------+--------+----------------+------+ 2311 | 17 | 17 | 5 | 0 | 2312 +-------+--------+----------------+------+ 2313 | 18 | 22 | 5 | 0 | 2314 +-------+--------+----------------+------+ 2315 | 19 | 4 | 5 | 0 | 2316 +-------+--------+----------------+------+ 2317 | 20 | 8 | 4 | 16 | 2318 +-------+--------+----------------+------+ 2319 | 21 | 13 | 5 | 0 | 2320 +-------+--------+----------------+------+ 2321 | 22 | 19 | 5 | 0 | 2322 +-------+--------+----------------+------+ 2323 | 23 | 1 | 5 | 0 | 2324 +-------+--------+----------------+------+ 2325 | 24 | 6 | 4 | 16 | 2326 +-------+--------+----------------+------+ 2327 | 25 | 10 | 5 | 0 | 2328 +-------+--------+----------------+------+ 2329 | 26 | 16 | 5 | 0 | 2330 +-------+--------+----------------+------+ 2331 | 27 | 28 | 5 | 0 | 2332 +-------+--------+----------------+------+ 2333 | 28 | 27 | 5 | 0 | 2334 +-------+--------+----------------+------+ 2335 | 29 | 26 | 5 | 0 | 2336 +-------+--------+----------------+------+ 2337 | 30 | 25 | 5 | 0 | 2338 +-------+--------+----------------+------+ 2339 | 31 | 24 | 5 | 0 | 2340 +-------+--------+----------------+------+ 2342 Appendix B. Changes Since RFC8478 2344 The following are the changes in this document relative to RFC 8478: 2346 o Apply errata #5786 and #6303. 2348 o Clarify forward compatibility regarding dictionaries. 2350 o Clarify application of Block_Maximum_Size. 2352 o Add structured media type suffix registration. 2354 o Clarify that the content checksum is always 4 bytes. 2356 o Clarify handling of reserved and corrupt inputs. 2358 o Add fragment identifier considerations to the media type 2359 registration. 2361 Appendix C. Acknowledgments 2363 zstd was developed by Yann Collet. 2365 Felix Handte and Nick Terrell provided feedback that went into this 2366 revision and RFC 8478. RFC 8478 also received contributions from 2367 Bobo Bose-Kolanu, Kyle Nekritz, and David Schleimer. 2369 Authors' Addresses 2371 Yann Collet 2372 Facebook 2373 1 Hacker Way 2374 Menlo Park, CA 94025 2375 United States of America 2377 Email: cyan@fb.com 2378 Murray S. Kucherawy (editor) 2379 Facebook 2380 1 Hacker Way 2381 Menlo Park, CA 94025 2382 United States of America 2384 Email: msk@fb.com