idnits 2.17.1 draft-kucherawy-dispatch-zstd-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 25, 2017) is 2403 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Stream 2' is mentioned on line 455, but not defined == Missing Reference: 'Stream 3' is mentioned on line 457, but not defined == Missing Reference: 'Stream 4' is mentioned on line 459, but not defined -- Looks like a reference, but probably isn't: '36' on line 935 -- Looks like a reference, but probably isn't: '53' on line 945 -- Looks like a reference, but probably isn't: '29' on line 962 -- Looks like a reference, but probably isn't: '5' on line 1305 -- Looks like a reference, but probably isn't: '0' on line 1319 -- Looks like a reference, but probably isn't: '1' on line 1319 -- Possible downref: Non-RFC (?) normative reference: ref. 'ZSTD' Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Y. Collet 3 Internet-Draft M. Kucherawy, Ed. 4 Intended status: Standards Track Facebook 5 Expires: March 29, 2018 September 25, 2017 7 Zstandard Compression and The application/zstd Media Type 8 draft-kucherawy-dispatch-zstd-00 10 Abstract 12 Zstandard, or "zstd" (pronounced "zee standard"), is a data 13 compression mechanism. This document describes the mechanism, and 14 registers a media type to be used when transporting zstd-compressed 15 via Multipurpose Internet Mail Extensions (MIME). 17 Status of This Memo 19 This Internet-Draft is submitted in full conformance with the 20 provisions of BCP 78 and BCP 79. 22 Internet-Drafts are working documents of the Internet Engineering 23 Task Force (IETF). Note that other groups may also distribute 24 working documents as Internet-Drafts. The list of current Internet- 25 Drafts is at http://datatracker.ietf.org/drafts/current/. 27 Internet-Drafts are draft documents valid for a maximum of six months 28 and may be updated, replaced, or obsoleted by other documents at any 29 time. It is inappropriate to use Internet-Drafts as reference 30 material or to cite them other than as "work in progress." 32 This Internet-Draft will expire on March 29, 2018. 34 Copyright Notice 36 Copyright (c) 2017 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents 41 (http://trustee.ietf.org/license-info) in effect on the date of 42 publication of this document. Please review these documents 43 carefully, as they describe your rights and restrictions with respect 44 to this document. Code Components extracted from this document must 45 include Simplified BSD License text as described in Section 4.e of 46 the Trust Legal Provisions and are provided without warranty as 47 described in the Simplified BSD License. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 3 53 2.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . . 3 55 2.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . . 4 56 2.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . . 8 57 2.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 10 58 2.2. Sequence Execution . . . . . . . . . . . . . . . . . . . . 22 59 2.2.1. Repeat Offsets . . . . . . . . . . . . . . . . . . . . 23 60 2.3. Skippable Frames . . . . . . . . . . . . . . . . . . . . . 23 61 2.4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . 24 62 2.4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . 24 63 2.4.1.1. FSE Table Description . . . . . . . . . . . . . . 25 64 2.4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . 27 65 2.4.2.1. Huffman Tree Description . . . . . . . . . . . . . 28 66 2.4.2.2. Huffman-coded Streams . . . . . . . . . . . . . . 32 67 2.5. Dictionary Format . . . . . . . . . . . . . . . . . . . . 33 68 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 34 69 3.1. The 'application/zstd' Media Type . . . . . . . . . . . . 34 70 3.2. Content Encoding . . . . . . . . . . . . . . . . . . . . . 35 71 4. Security Considerations . . . . . . . . . . . . . . . . . . . 36 72 5. Implementation Status . . . . . . . . . . . . . . . . . . . . 36 73 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37 74 6.1. Normative References . . . . . . . . . . . . . . . . . . . 37 75 6.2. Informative References . . . . . . . . . . . . . . . . . . 37 76 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . . 38 77 Appendix B. Decoding Tables for Predefined Codes . . . . . . . . 38 78 B.1. Literal Length Code Table . . . . . . . . . . . . . . . . 38 79 B.2. Match Length Code Table . . . . . . . . . . . . . . . . . 41 80 B.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 44 82 1. Introduction 84 Zstandard, or "zstd" (pronounced "zee standard") is a data 85 compression mechanism, akin to gzip [RFC1952]. 87 This document describes the Zstandard format. Also, to enable the 88 transport of a data object compressed with Zstandard, this document 89 registers a media type that can be used to identify such content when 90 it is used in a payload encoded using Multipurpose Internet Mail 91 Extensions (MIME). 93 2. Compression Algorithm 95 This section describes the Zstandard algorithm. 97 2.1. Frames 99 Zstandard compressed data is made of up one or more frames. Each 100 frame is independent and can be decompressed indepedently of other 101 frames. The decompressed content of multiple concatenated frames is 102 the concatenation of each frame's decompressed content. 104 There are two frame formats defined for Zstandard: Zstandard frames 105 and Skippable frames. Zstandard frames contain compressed data, 106 while skippable frames contain no data and can be used for metadata. 108 2.1.1. Zstandard Frames 110 The structure of a single Zstandard frame is as follows: 112 +--------------------+------------+ 113 | Magic_Number | 4 bytes | 114 +--------------------+------------+ 115 | Frame_Header | 2-14 bytes | 116 +--------------------+------------+ 117 | Data_Block | n bytes | 118 +--------------------+------------+ 119 | [More Data Blocks] | | 120 +--------------------+------------+ 121 | [Content Checksum] | 0-4 bytes | 122 +--------------------+------------+ 124 Magic_Number: Four bytes, little-endian format. Value: 0xFD2FB528 126 Frame_Header: Two to 14 bytes, detailed in Section 2.1.1.1 127 Data_Block: Detailed in Section 2.1.1.3. This is where compressed 128 data appears. 130 Content_Checksum: An optional 32-bit checksum, only present if the 131 Content_Checksum_flag is set. The content checksum is the result 132 of the xxh64() hash function [XXHASH] digesting the origina 133 (decoded) data as input, and a seed of zero. The low four bytes 134 of the checksum are stored in little-endian format. 136 2.1.1.1. Frame Header 138 The frame header has a variable size, with a minimum of two bytes and 139 up to 14 bytes depending on optional parameters. The structure of 140 Frame_Header is as follows: 142 +-------------------------+-----------+ 143 | Frame_Header_Descriptor | 1 byte | 144 +-------------------------+-----------+ 145 | [Window_Descriptor] | 0-1 byte | 146 +-------------------------+-----------+ 147 | [Dictionary_ID] | 0-4 bytes | 148 +-------------------------+-----------+ 149 | [Frame_Content_Size] | 0-8 bytes | 150 +-------------------------+-----------+ 152 2.1.1.1.1. Frame Header Descrptor 154 The first header's byte is called the Frame Header Descriptor. It 155 describes which other fields are present. Decoding this byte is 156 enough to tell the size of Frame_Header. 158 +------------+-------------------------+ 159 | Bit Number | Field Name | 160 +------------+-------------------------+ 161 | 7-6 | Frame Content Size Flag | 162 +------------+-------------------------+ 163 | 5 | Single Segment Flag | 164 +------------+-------------------------+ 165 | 4 | (unused) | 166 +------------+-------------------------+ 167 | 3 | (reserved) | 168 +------------+-------------------------+ 169 | 2 | Content Checksum Flag | 170 +------------+-------------------------+ 171 | 1-0 | Dictionary ID Flag | 172 +------------+-------------------------+ 174 In this table, bit 7 is the highest bit, while bit 0 is the lowest 175 one. 177 2.1.1.1.1.1. Frame_Content_Size_Flag 179 This is a two-bit flag (equivalent to Frame_Header_Descriptor left- 180 shifted six bits) specifying whether Frame_Content_Size (the 181 decompressed data size) is provided within the header. Flag_Value 182 provides FCS_Field_Size, which is the number of bytes used by 183 Frame_Content_Size according to the following table: 185 +----------------+--------+---+---+---+ 186 | Flag_Value | 0 | 1 | 2 | 3 | 187 +----------------+--------+---+---+---+ 188 | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 | 189 +----------------+--------+---+---+---+ 191 When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag: 192 If Single_Segment_flag is set, Field_Size is 1. Otherwise, 193 Field_Size is 0; Frame_Content_Size is not provided. 195 2.1.1.1.1.2. Single_Segment_flag 197 If this flag is set, data must be regenerated within a single 198 continuous memory segment. 200 In this case, Window_Descriptor byte is skipped, but 201 Frame_Content_Size is necessarily present. As a consequence, the 202 decoder must allocate a memory segment of size equal or bigger than 203 Frame_Content_Size. 205 In order to protect the decoder from unreasonable memory 206 requirements, a decoder is allowed to reject a compressed frame that 207 requests a memory size beyond the decoder's authorized range. 209 For broader compatibility, decoders are recommended to support memory 210 sizes of at least 8 MB. This is only a recommendation; each decoder 211 is free to support higher or lower limits, depending on local 212 limitations. 214 2.1.1.1.1.3. Unused Bit 216 The value of this bit should be set to zero. A decoder compliant 217 with this specification version shall not interpret it. It might be 218 used in a future version, to signal a property which is not mandatory 219 to properly decode the frame. 221 2.1.1.1.1.4. Reserved Bit 223 This bit is reserved for some future feature. Its value must be 224 zero. A decoder compliant with this specification version must 225 ensure it is not set. This bit may be used in a future revision, to 226 signal a feature that must be interpreted to decode the frame 227 correctly. 229 2.1.1.1.1.5. Content_Checksum_Flag 231 If this flag is set, a 32-bits Content_Checksum will be present at 232 the frame's end. See the description of Content_Checksum above. 234 2.1.1.1.1.6. Dictionary_ID_Flag 236 This is a two-bit flag (= FHD & 3) indicating whether a dictionary ID 237 is provided within the header. It also specifies the size of this 238 field as Field_Size: 240 +------------+---+---+---+---+ 241 | Flag_Value | 0 | 1 | 2 | 3 | 242 +------------+---+---+---+---+ 243 | Field_Size | 0 | 1 | 2 | 4 | 244 +------------+---+---+---+---+ 246 2.1.1.1.2. Window Descriptor 248 Provides guarantees on minimum memory buffer required to decompress a 249 frame. This information is important for decoders to allocate enough 250 memory. 252 The Window_Descriptor byte is optional. When Single_Segment_flag is 253 set, Window_Descriptor is not present. In this case, Window_Size is 254 Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16 255 ExaBytes). 257 +-------------+----------+----------+ 258 | Bit numbers | 7-3 | 2-0 | 259 +-------------+----------+----------+ 260 | Field name | Exponent | Mantissa | 261 +-------------+----------+----------+ 263 The minimum memory buffer size is called Window_Size. It is 264 described by the following formulae: 266 windowLog = 10 + Exponent; 267 windowBase = 1 << windowLog; 268 windowAdd = (windowBase / 8) * Mantissa; 269 Window_Size = windowBase + windowAdd; 271 The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41) 272 + 7*(1<<38) bytes, which is 3.75 TB. 274 To properly decode compressed data, a decoder will need to allocate a 275 buffer of at least Window_Size bytes. 277 In order to protect decoders from unreasonable memory requirements, a 278 decoder is allowed to reject a compressed frame which requests a 279 memory size beyond decoder's authorized range. 281 For improved interoperability, decoders are recommended to be 282 compatible with Window_Size >= 8 MB, and encoders are recommended to 283 not request more than 8 MB. It's merely a recommendation though, and 284 decoders are free to support larger or lower limits, depending on 285 local limitations. 287 2.1.1.1.3. Dictionary ID 289 This is a variable size field, which contains the ID of the 290 dictionary required to properly decode the frame. This field is 291 optional. When it's not present, it's up to the decoder to make sure 292 it uses the correct dictionary. 294 Field size depends on Dictionary_ID_flag. One byte can represent an 295 ID 0-255; two bytes can represent an ID 0-65535; four bytes can 296 represent an ID 0-4294967295. Format is little-endian. 298 It is permitted to represent a small ID (for example 13) with a large 299 four-byte dictionary ID, even if it is less efficient. 301 If the frame is going to be distributed in a private environment, any 302 dictionary ID can be used. However, for public distribution of 303 compressed frames using a dictionary, the following ranges are 304 reserved and shall not be used: 306 low range: <= 32767 308 high range: >= (1 << 31) 310 2.1.1.1.4. Frame Content Size 312 This is the original (uncompressed) size. This information is 313 optional. Frame_Content_Size uses a variable number of bytes, 314 provided by FCS_Field_Size. FCS_Field_Size is provided by the value 315 of Frame_Content_Size_flag. FCS_Field_Size can be equal to 0 (not 316 present), 1, 2, 4 or 8 bytes. 318 +----------------+--------------+ 319 | FCS Field Size | Range | 320 +----------------+--------------+ 321 | 0 | unknown | 322 +----------------+--------------+ 323 | 1 | 0 - 255 | 324 +----------------+--------------+ 325 | 2 | 256 - 65791 | 326 +----------------+--------------+ 327 | 4 | 0 - 2^32 - 1 | 328 +----------------+--------------+ 329 | 8 | 0 - 2^64 - 1 | 330 +----------------+--------------+ 332 Frame_Content_Size format is little-endian. When FCS_Field_Size is 333 1, 4 or 8 bytes, the value is read directly. When FCS_Field_Size is 334 2, the offset of 256 is added. It's allowed to represent a small 335 size (for example 18) using any compatible variant. 337 2.1.1.2. Blocks 339 After Magic_Number and Frame_Header, there are some number of blocks. 340 Each frame must have at least one block, but there is no upper limit 341 on the number of blocks per frame. 343 The structure of a block is as follows: 345 +--------------+---------------+ 346 | Block_Header | Block_Content | 347 +--------------+---------------+ 348 | 3 bytes | n bytes | 349 +--------------+---------------+ 351 Block_Header uses three bytes, written using little-endian 352 convention. It contains three fields: 354 +------------+------------+------------+ 355 | Last_Block | Block_Type | Block_Size | 356 +------------+------------+------------+ 357 | bit 0 | bits 1-2 | bits 3-23 | 358 +------------+------------+------------+ 360 2.1.1.2.1. Last_Block 362 The lowest bit signals if this block is the last one. The frame will 363 end after this last block. It may be followed by an optional 364 Content_Checksum (see Section 2.1.1). 366 2.1.1.2.2. Block_Type 368 The next two bits represent the Block_Type. There are four block 369 types: 371 +-----------+------------------+ 372 | Value | Block_Type | 373 +-----------+------------------+ 374 | 0 | Raw_Block | 375 +-----------+------------------+ 376 | 1 | RLE_Block | 377 +-----------+------------------+ 378 | 2 | Compressed_Block | 379 +-----------+------------------+ 380 | 3 | Reserved | 381 +-----------+------------------+ 383 Raw_Block: This is an uncompressed block. Block_Content contains 384 Block_Size bytes. 386 RLE_Block: This is a single byte, repeated Block_Size times. 387 Block_Content consists of a single byte. On the decompression 388 side, this byte must be repeated Block_Size times. 390 Compressed_Block: This is a compressed block as described in 391 Section 2.1.1.3. Block_Size is the length of Block_Content, 392 namely the compressed data. The decompressed size is not known, 393 but its maximum possible value is guaranteed (see below). 395 Reserved: This is not a block. This value cannot be used with the 396 current specification. 398 2.1.1.2.3. Block_Size 400 The upper 21 bits of Block_Header represent the Block_Size. Block 401 sizes must respect a few rules: 403 o for Compressed_Block, Block_Size is always strictly less than 404 decompressed size; 406 o block decompressed size is always <= Window_Size; 408 o block decompressed size is always <= 128 KB. 410 A block can contain any number of bytes (even zero), up to 411 Block_Maximum_Decompressed_Size, which is the smallest of: 413 o Window_Size 415 o 128 KB 417 2.1.1.3. Compressed Blocks 419 To decompress a compressed block, the compressed size must be 420 provided from Block_Size field within Block_Header. 422 A compressed block consists of two sections: a Literals Section 423 (Section 2.1.1.3.1) and a Sequences Section (Section 2.1.1.3.2). The 424 results of the two sections are then combined to produce the 425 decompressed data in Sequence Execution (Section 2.2). 427 To decode a compressed block, the following elements are necessary: 429 o Previous decoded data, up to a distance of Window_Size, or all 430 previously decoded data when Single_Segment_flag is set. 432 o List of "recent offsets" from previous Compressed_Block. 434 o Decoding tables of previous Compressed_Block for each symbol type 435 (literals, literals lengths, match lengths, offsets). 437 2.1.1.3.1. Literals Section 439 All literals are regrouped in the first part of the block. They can 440 be decoded first, and then copied during Sequence Execution (see 441 Section 2.2), or they can be decoded on the flow during Sequence 442 Execution. 444 Literals can be stored uncompressed or compressed using Huffman 445 prefix codes. When compressed, an optional tree description can be 446 present, followed by one or four streams. 448 +----------------------------+ 449 | Literals_Section_Header | 450 +----------------------------+ 451 | [Huffman_Tree_Description] | 452 +----------------------------+ 453 | Stream 1 | 454 +----------------------------+ 455 | [Stream 2] | 456 +----------------------------+ 457 | [Stream 3] | 458 +----------------------------+ 459 | [Stream 4] | 460 +----------------------------+ 462 2.1.1.3.1.1. Literals_Section_Header 464 This field describes how literals are packed. It's a byte-aligned 465 variable-size bitfield, ranging from one to five bytes, using little- 466 endian convention. 468 +---------------------+-----------+ 469 | Literals_Block_Type | 2 bits | 470 +---------------------+-----------+ 471 | Size_Format | 1-2 bits | 472 +---------------------+-----------+ 473 | Regenerated_Size | 5-20 bits | 474 +---------------------+-----------+ 475 | [Compressed_Size] | 0-18 bits | 476 +---------------------+-----------+ 478 In this representation, bits at the top are the lowest bits. 480 The Literals_Block_Type field uses the two lowest bits of the first 481 byte, describing four different block types: 483 +---------------------------+-------+ 484 | Literals_Block_Type | Value | 485 +---------------------------+-------+ 486 | Raw_Literals_Block | 0 | 487 +---------------------------+-------+ 488 | RLE_Literals_Block | 1 | 489 +---------------------------+-------+ 490 | Compressed_Literals_Block | 2 | 491 +---------------------------+-------+ 492 | Treeless_Literals_Block | 3 | 493 +---------------------------+-------+ 495 Raw_Literals_Block: Literals are stored uncompressed. 497 RLE_Literals_Block: Literals consist of a single byte value repeated 498 Regenerated_Size times. 500 Compressed_Literals_Block: This is a standard Huffman-compressed 501 block, starting with a Huffman tree description. See details 502 below. 504 Treeless_Literals_Block: This is a Huffman-compressed block, using 505 Huffman tree from previous Huffman-compressed literals block. 506 Huffman_Tree_Description will be skipped. Note that if this mode 507 is triggered without any previous Huffman-table in the frame (or 508 dictionary, per Section 2.5), this should be treated as data 509 corruption. 511 The Size_Format is divided into two families: 513 o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary 514 to decode Regenerated_Size. There is no Compressed_Size field. 516 o For Compressed_Block and Treeless_Literals_Block, it's required to 517 decode both Compressed_Size and Regenerated_Size (the decompressed 518 size). It's also necessary to decode the number of streams (1 or 519 4). 521 For values spanning several bytes, convention is little-endian. 523 Size_Format for Raw_Literals_Block and RLE_Literals_Block: 525 Value ?0: Size_Format uses one bit. Regenerated_Size uses five bits 526 (value 0-31). Literals_Section_Header has one byte. 527 Regenerated_Size = Header[0]>>3. 529 Value 01: Size_Format uses two bits. Regenerated_Size uses 12 bits 530 (values 0-4095). Literals_Section_Header has two bytes. 531 Regenerated_Size = (Header[0]>>4) + (Header[1]<<4). 533 Value 11: Size_Format uses two bits. Regenerated_Size uses 20 bits 534 (values 0-1048575). Literals_Section_Header has three bytes. 535 Regenerated_Size = (Header[0]>>4) + (Header[1]<<4) + 536 (Header[2]<<12) 538 Only Stream1 is present for these cases. Note that it is permitted 539 to represent a short value (for example 13) using a long format, even 540 if it's less efficient. 542 Size_Format for Compressed_Literals_Block and 543 Treeless_Literals_Block: 545 Value 00: A single stream. Both Regenerated_Size and 546 Compressed_Size use ten bits (values 0-1023). 547 Literals_Section_Header has three bytes. 549 Value 01: Four streams. Both Regenerated_Size and Compressed_Size 550 use ten bits (values 0-1023). Literals_Section_Header has three 551 bytes. 553 Value 10: Four streams. Both Regenerated_Size and Compressed_Size 554 use 14 bits (values 0-16383). Literals_Section_Header has four 555 bytes. 557 Value 11: Four streams. Both Regenerated_Size and Compressed_Size 558 use 18 bits (values 0-262143). Literals_Section_Header has five 559 bytes. 561 Both the Compressed_Size and Regenerated_Size fields follow little- 562 endian convention. Note that Compressed_Size includes the size of 563 the Huffman Tree description when it is present. 565 2.1.1.3.1.2. Raw Literals Block 567 The data in Stream1 is Regenerated_Size bytes long. It contains the 568 raw literals data to be used during Sequence Execution 569 (Section 2.1.1.3.2). 571 2.1.1.3.1.3. RLE Literals Block 573 Stream1 consists of a single byte which should be repeated 574 Regenerated_Size times to generate the decoded literals. 576 2.1.1.3.1.4. Compressed Literals Block and Treeless Literals Block 578 Both of these modes contain Huffman encoded data. 579 Treeless_Literals_Block does not have a Huffman_Tree_Description. 581 2.1.1.3.1.4.1. Huffman_Tree_Description 583 This section is only present when Literals_Block_Type type is 584 Compressed_Literals_Block (2). The format of the Huffman tree 585 description can be found in Section 2.4.2.1. The size of 586 Huffman_Tree_Description is determined during the decoding process. 587 It must be used to determine where streams begin. It is always true 588 that: 590 Total_Streams_Size = Compressed_Size 591 - Huffman_Tree_Description_Size 593 For Treeless_Literals_Block, the Huffman table comes from previously 594 compressed literals block. 596 Huffman compressed data consists of either one or four Huffman-coded 597 streams. 599 If only one stream is present, it is a single bitstream occupying the 600 entire remaining portion of the literals block, encoded as described 601 within Section 2.4.2.2. 603 If there are four streams, the literals section header only provides 604 enough information to know the decompressed and compressed sizes of 605 all four streams combined. The decompressed size of each stream is 606 equal to (Regenerated_Size+3)/4, except for the last stream which may 607 be up to three bytes smaller, to reach a total decompressed size as 608 specified in Regenerated_Size. 610 The compressed size of each stream is provided explicitly: the first 611 six bytes of the compressed data consist of three two-byte little- 612 endian fields, describing the compressed sizes of the first three 613 streams. Stream4_Size is computed from Total_Streams_Size minus 614 sizes of other streams. 616 Stream4_Size = Total_Streams_Size - 6 617 - Stream1_Size - Stream2_Size 618 - Stream3_Size 620 Note that Total_Streams_Size can be smaller than Compressed_Size in 621 the header, because Compressed_Size also contains 622 Huffman_Tree_Description_Size when it is present. 624 Each of these four bitstreams is then decoded independently as a 625 Huffman-Coded stream, as described in Section 2.4.2.2. 627 2.1.1.3.2. Sequences Section 629 A compressed block is a succession of sequences. A sequence is a 630 literal copy command, followed by a match copy command. A literal 631 copy command specifies a length. It is the number of bytes to be 632 copied (or extracted) from the Literals Section. A match copy 633 command specifies an offset and a length. 635 When all sequences are decoded, if there are literals left in the 636 literal section, these bytes are added at the end of the block. 638 This is described in more detail in Section 2.2. 640 The Sequences_Section regroups all symbols required to decode 641 commands. There are three symbol types: literals lengths, offsets, 642 and match lengths. They are encoded together, interleaved, in a 643 single "bitstream". 645 The Sequences_Section starts by a header, followed by optional 646 probability tables for each symbol type, followed by the bitstream. 648 Sequences_Section_Header 649 [Literals_Length_Table] 650 [Offset_Table] 651 [Match_Length_Table] 652 bitStream 654 To decode the Sequences_Section, it's necessary to know its size. 655 This size is deduced from Block_Size - Literals_Section_Size. 657 2.1.1.3.2.1. Sequences_Section_Header 659 This header consists of two items: 661 o Number_of_Sequences 663 o Symbol_Compression_Modes 665 Number_of_Sequences is a variable size field using between one and 666 three bytes. If the first byte is "byte0": 668 o if (byte0 == 0): there are no sequences. The sequence section 669 stops here. Decompressed content is defined entirely as Literals 670 Section content. 672 o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte. 674 o if (byte0 < 255): Number_of_Sequences = ((byte0-128) << 8) + 675 byte1. Uses 2 bytes. 677 o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2<<8) + 678 0x7F00. Uses 3 bytes. 680 Symbol_Compression_Modes is a single byte, defining the compression 681 mode of each symbol type. 683 +------------+----------------------+ 684 | Bit Number | Field Name | 685 +------------+----------------------+ 686 | 7-6 | Literal_Lengths_Mode | 687 +------------+----------------------+ 688 | 5-4 | Offsets_Mode | 689 +------------+----------------------+ 690 | 3-2 | Match_Lengths_Mode | 691 +------------+----------------------+ 692 | 1-0 | Reserved | 693 +------------+----------------------+ 695 The last field, Reserved, must be all zeroes. 697 Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define 698 the Compression_Mode of literals lengths, offsets, and match lengths 699 symbols respectively. They follow the same enumeration: 701 +-------+---------------------+ 702 | Value | Compression_Mode | 703 +-------+---------------------+ 704 | 0 | Predefined_Mode | 705 +-------+---------------------+ 706 | 1 | RLE_Mode | 707 +-------+---------------------+ 708 | 2 | FSE_Compressed_Mode | 709 +-------+---------------------+ 710 | 3 | Repeat_Mode | 711 +-------+---------------------+ 713 Predefined_Mode: A predefined FSE distribution table is used, 714 defined in Section 2.1.1.3.2.2. No distribution table will be 715 present. 717 RLE_Mode: The table description consists of a single byte. This 718 code will be repeated for all sequences. 720 Repeat_Mode: The table used in the previous compressed block will be 721 used again. No distribution table will be present. Note that 722 this includes RLE mode, so if Repeat_Mode follows RLE_Mode, the 723 same symbol will be repeated. If this mode is used without any 724 previous sequence table in the frame (or dictionary; see 725 Section 2.5) to repeat, this should be treated as corruption. 727 FSE_Compressed_Mode: Standard FSE compression. A distribution table 728 will be present. The format of this distribution table is 729 described in Section 2.4.1.1. Note that the maximum allowed 730 accuracy log for literals length and match length tables is 9, and 731 the maximum accuracy log for the offsets table is 8. 733 Each symbol is a code in its own context, which specifies Baseline 734 and Number_of_Bits to add. Codes are FSE compressed, and interleaved 735 with raw additional bits in the same bitstream. 737 Literals length codes are values ranging from 0 to 35 inclusive. 738 They define lengths from 0 to 131071 bytes. The literals length is 739 equal to the decoded Baseline plus the result of reading 740 Number_of_Bits bits from the bitstream, as a little-endian value. 742 +----------------------+----------+----------------+ 743 | Literals_Length_Code | Baseline | Number_of_Bits | 744 +----------------------+----------+----------------+ 745 | 0-15 | length | 0 | 746 +----------------------+----------+----------------+ 747 | 16 | 16 | 1 | 748 +----------------------+----------+----------------+ 749 | 17 | 18 | 1 | 750 +----------------------+----------+----------------+ 751 | 18 | 20 | 1 | 752 +----------------------+----------+----------------+ 753 | 19 | 22 | 1 | 754 +----------------------+----------+----------------+ 755 | 20 | 24 | 2 | 756 +----------------------+----------+----------------+ 757 | 21 | 28 | 2 | 758 +----------------------+----------+----------------+ 759 | 22 | 32 | 3 | 760 +----------------------+----------+----------------+ 761 | 23 | 40 | 3 | 762 +----------------------+----------+----------------+ 763 | 24 | 48 | 4 | 764 +----------------------+----------+----------------+ 765 | 25 | 64 | 6 | 766 +----------------------+----------+----------------+ 767 | 26 | 128 | 7 | 768 +----------------------+----------+----------------+ 769 | 27 | 256 | 8 | 770 +----------------------+----------+----------------+ 771 | 28 | 512 | 9 | 772 +----------------------+----------+----------------+ 773 | 29 | 1024 | 10 | 774 +----------------------+----------+----------------+ 775 | 30 | 2048 | 11 | 776 +----------------------+----------+----------------+ 777 | 31 | 4096 | 12 | 778 +----------------------+----------+----------------+ 779 | 32 | 8192 | 13 | 780 +----------------------+----------+----------------+ 781 | 33 | 16384 | 14 | 782 +----------------------+----------+----------------+ 783 | 34 | 32768 | 15 | 784 +----------------------+----------+----------------+ 785 | 35 | 65536 | 16 | 786 +----------------------+----------+----------------+ 788 Match length codes are values ranging from 0 to 52 included. They 789 define lengths from 3 to 131074 bytes. The match length is equal to 790 the decoded Baseline plus the result of reading Number_of_Bits bits 791 from the bitstream, as a little-endian value. 793 +-------------------+----------+----------------+ 794 | Match_Length_Code | Baseline | Number_of_Bits | 795 +-------------------+----------+----------------+ 796 | 0-31 | length | 0 | 797 +-------------------+----------+----------------+ 798 | 32 | 35 | 1 | 799 +-------------------+----------+----------------+ 800 | 33 | 37 | 1 | 801 +-------------------+----------+----------------+ 802 | 34 | 39 | 1 | 803 +-------------------+----------+----------------+ 804 | 35 | 41 | 1 | 805 +-------------------+----------+----------------+ 806 | 36 | 43 | 2 | 807 +-------------------+----------+----------------+ 808 | 37 | 47 | 2 | 809 +-------------------+----------+----------------+ 810 | 38 | 51 | 3 | 811 +-------------------+----------+----------------+ 812 | 39 | 59 | 3 | 813 +-------------------+----------+----------------+ 814 | 40 | 67 | 4 | 815 +-------------------+----------+----------------+ 816 | 41 | 83 | 4 | 817 +-------------------+----------+----------------+ 818 | 42 | 99 | 5 | 819 +-------------------+----------+----------------+ 820 | 43 | 131 | 7 | 821 +-------------------+----------+----------------+ 822 | 44 | 259 | 8 | 823 +-------------------+----------+----------------+ 824 | 45 | 515 | 9 | 825 +-------------------+----------+----------------+ 826 | 46 | 1027 | 10 | 827 +-------------------+----------+----------------+ 828 | 47 | 2051 | 11 | 829 +-------------------+----------+----------------+ 830 | 48 | 4099 | 12 | 831 +-------------------+----------+----------------+ 832 | 49 | 8195 | 13 | 833 +-------------------+----------+----------------+ 834 | 50 | 16387 | 14 | 835 +-------------------+----------+----------------+ 836 | 51 | 32771 | 15 | 837 +-------------------+----------+----------------+ 838 | 52 | 65539 | 16 | 839 +-------------------+----------+----------------+ 841 Offset codes are values ranging from 0 to N. 843 A decoder is free to limit its maximum supported value for N. Support 844 for values of at least 22 is recommended. At the time of this 845 writing, the reference decoder supports a maximum N value of 28 in 846 64-bits mode. 848 An offset code is also the number of additional bits to read in 849 little-endian fashion, and can be translated into an Offset_Value 850 using the following formulas: 852 Offset_Value = (1 << offsetCode) + readNBits(offsetCode); 853 if (Offset_Value > 3) offset = Offset_Value - 3; 855 This means that maximum Offset_Value is (2^(N+1))-1 and it supports 856 back-reference distance up to (2^(N+1))-4 but is limited by maximum 857 back-reference distance (see Section 2.1.1.1.2). 859 Offset_Value from 1 to 3 are special: they define "repeat codes". 860 This is described in more detail in Repeat Offsets. 862 FSE bitstreams are read in reverse direction than written. In zstd, 863 the compressor writes bits forward into a block and the decompressor 864 must read the bitstream backwards. 866 To find the start of the bitstream it is therefore necessary to know 867 the offset of the last byte of the block which can be found by 868 counting Block_Size bytes after the block header. 870 After writing the last bit containing information, the compressor 871 writes a single 1-bit and then fills the byte with 0-7 zero bits of 872 padding. The last byte of the compressed bitstream cannot be zero 873 for that reason. 875 When decompressing, the last byte containing the padding is the first 876 byte to read. The decompressor needs to skip 0-7 initial zero bits 877 until the first one bit occurs. Afterwards, the useful part of the 878 bitstream begins. 880 FSE decoding requires a 'state' to be carried from symbol to symbol. 881 For more explanation on FSE decoding, see Section 2.4.1. 883 For sequence decoding, a separate state keeps track of each literal 884 lengths, offsets, and match lengths symbols. Some FSE primitives are 885 also used. For more details on the operation of these primitives, 886 see Section 2.4.1. 888 The bitstream starts with initial FSE state values, each using the 889 required number of bits in their respective accuracy, decoded 890 previously from their normalized distribution. It starts with 891 Literals_Length_State, followed by Offset_State, and finally 892 Match_Length_State. 894 Note that all values are read backward, so the 'start' of the 895 bitstream is at the highest position in memory, immediately before 896 the last one bit for padding. 898 After decoding the starting states, a single sequence is decoded 899 Number_Of_Sequences times. These sequences are decoded in order from 900 first to last. Since the compressor writes the bitstream in the 901 forward direction, this means the compressor must encode the 902 sequences starting with the last one and ending with the first. 904 For each of the symbol types, the FSE state can be used to determine 905 the appropriate code. The code then defines the baseline and number 906 of bits to read for each type. The description of the codes for how 907 to determine these values was presented earlier. 909 Decoding starts by reading the Number_of_Bits required to decode 910 Offset. It then does the same for Match_Length, and then for 911 Literals_Length. This sequence is then used for sequence execution 912 (see Section 2.2). 914 If it is not the last sequence in the block, the next operation is to 915 update states. Using the rules pre-calculated in the decoding 916 tables, Literals_Length_State is updated, followed by 917 Match_Length_State, and then Offset_State. See Section 2.4.1 for 918 details on how to update states from the bitstream. 920 This operation will be repeated Number_of_Sequences times. At the 921 end, the bitstream shall be entirely consumed, otherwise the 922 bitstream is considered corrupted. 924 2.1.1.3.2.2. Default Distributions 926 If Predefined_Mode is selected for a symbol type, its FSE decoding 927 table is generated from a predefined distribution table defined here. 928 For details on how to convert this distribution into a decoding 929 table, see Section 2.4.1. 931 2.1.1.3.2.2.1. Literals Length 933 The decoding table uses an accuracy log of 6 bits (64 states). 935 short literalsLength_defaultDistribution[36] = 936 { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 937 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 938 -1,-1,-1,-1 939 }; 941 2.1.1.3.2.2.2. Match Length 943 The decoding table uses an accuracy log of 6 bits (64 states). 945 short matchLengths_defaultDistribution[53] = 946 { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 947 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 948 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 949 -1,-1,-1,-1,-1 950 }; 952 2.1.1.3.2.2.3. Offset Codes 954 The decoding table uses an accuracy log of 5 bits (32 states), and 955 supports a maximum N value of 28, allowing offset values up to 956 536,870,908. 958 If any sequence in the compressed block requires a larger offset than 959 this, it's not possible to use the default distribution to represent 960 it. 962 short offsetCodes_defaultDistribution[29] = 963 { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 964 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 965 }; 967 2.2. Sequence Execution 969 Once literals and sequences have been decoded, they are combined to 970 produce the decoded content of a block. 972 Each sequence consists of a tuple of (literals_length, offset_value, 973 match_length), decoded as described in the Sequences Section 974 (Section 2.1.1.3.2). To execute a sequence, first copy 975 literals_length bytes from the literals section to the output. 977 Then match_length bytes are copied from previous decoded data. The 978 offset to copy from is determined by offset_value: 980 o if offset_value > 3, then the offset is offset_value - 3; 981 o if offset_value is from 1-3, the offset is a special repeat offset 982 value. See Section 2.2.1 for how the offset is determined in this 983 case. 985 The offset is defined as from the current position, so an offset of 6 986 and a match length of 3 means that 3 bytes should be copied from 6 987 bytes back. Note that all offsets leading to previously decoded data 988 must be smaller than Window_Size defined in Frame_Header_Descriptor 989 (Section 2.1.1.1.1). 991 2.2.1. Repeat Offsets 993 As seen above, the first three values define a repeated offset and we 994 will call them Repeated_Offset1, Repeated_Offset2, and 995 Repeated_Offset3. They are sorted in recency order, with 996 Repeated_Offset1 meaning "most recent one". 998 If offset_value is 1, then the offset used is Repeated_Offset1, etc. 1000 There is one exception: When the current sequence's literals_length 1001 is 0, repeated offsets are shifted by one, so an offset_value of 1 1002 means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3, 1003 and an offset_value of 3 means Repeated_Offset1 - 1_byte. 1005 For the first block, the starting offset history is populated with 1006 the following values : 1, 4 and 8 (in order). 1008 Then each block gets its starting offset history from the ending 1009 values of the most recent Compressed_Block. Note that blocks that 1010 are not Compressed_Block are skipped; they do not contribute to 1011 offset history. 1013 The newest offset takes the lead in offset history, shifting others 1014 back (up to its previous place if it was already present). This 1015 means that when Repeated_Offset1 (most recent) is used, history is 1016 unmodified. When Repeated_Offset2 is used, it is swapped with 1017 Repeated_Offset1. If any other offset is used, it becomes 1018 Repeated_Offset1 and the rest are shifted back by one. 1020 2.3. Skippable Frames 1022 +--------------+------------+-----------+ 1023 | Magic_Number | Frame_Size | User_Data | 1024 +--------------+------------+-----------+ 1025 | 4 bytes | 4 bytes | n bytes | 1026 +--------------+------------+-----------+ 1028 Skippable frames allow the insertion of user-defined data into a flow 1029 of concatenated frames. Its design is pretty straightforward, with 1030 the sole objective to allow the decoder to quickly skip over user- 1031 defined data and continue decoding. 1033 Skippable frames defined in this specification are compatible with 1034 skippable frames in [LZ4]. 1036 The fields are: 1038 Magic_Number: Four bytes, little-endian format. Value: 0x184D2A5?, 1039 which means any value from 0x184D2A50 to 0x184D2A5F. All 16 values 1040 are valid to identify a skippable frame. 1042 Frame_Size: This is the size, in bytes, of the following User_Data 1043 (without including the magic number nor the size field itself). 1044 This field is represented using four bytes, little-endian format, 1045 unsigned 32-bits. This means User_Data can't be bigger than 1046 (2^32-1) bytes. 1048 User_Data: This field can be anything. Data will just be skipped by 1049 the decoder. 1051 2.4. Entropy Encoding 1053 Two types of entropy encoding are used by the Zstandard format: FSE, 1054 and Huffman coding. 1056 2.4.1. FSE 1058 FSE, short for Finite State Entropy, is an entropy codec based on 1059 [ANS]. FSE encoding/decoding involves a state that is carried over 1060 between symbols, so decoding must be done in the opposite direction 1061 as encoding. Therefore, all FSE bitstreams are read from end to 1062 beginning. 1064 For additional details on FSE, see Finite State Entropy [FSE]. 1066 FSE decoding involves a decoding table that has a power of two size, 1067 and contains three elements: Symbol, Num_Bits, and Baseline. The 1068 base two logarithm of the table size is its Accuracy_Log. The FSE 1069 state represents an index in this table. 1071 To obtain the initial state value, consume Accuracy_Log bits from the 1072 stream as a little-endian value. The next symbol in the stream is 1073 the Symbol indicated in the table for that state. To obtain the next 1074 state value, the decoder should consume Num_Bits bits from the stream 1075 as a little-endian value and add it to Baseline. 1077 2.4.1.1. FSE Table Description 1079 To decode FSE streams, it is necessary to construct the decoding 1080 table. The Zstandard format encodes FSE table descriptions as 1081 described here. 1083 An FSE distribution table describes the probabilities of all symbols 1084 from 0 to the last present one (included) on a normalized scale of (1 1085 << Accuracy_Log), meaning a binary 1 left-shifted Accuracy_Log bits. 1087 A bitstream is read forward, in little-endian fashion. It is not 1088 necessary to know its exact size, since the size will be discovered 1089 and reported by the decoding process. The bitstream starts by 1090 reporting on which scale it operates. Note that Accuracy_Log = 1091 low4bits + 5. 1093 This is followed by each symbol value, from 0 to the last present 1094 one. The number of bits used by each field is variable and depends 1095 on: 1097 Remaining probabilities + 1: For example, presuming an Accuracy_Log 1098 of 8, and presuming 100 probabilities points have already been 1099 distributed, the decoder may read any value from 0 to (255 - 100 + 1100 1) == 156, inclusive. Therefore, it must read log2sup(156) == 8 1101 bits. 1103 Value decoded: Small values use one less bit. For example, 1104 presuming values from 0 to 156 (inclusive) are possible, 255 - 156 1105 = 99 values are remaining in an 8-bits field. The first 99 values 1106 (hence from 0 to 98) use only 7 bits, and values from 99 to 156 1107 use 8 bits. This is achieved through this scheme: 1109 +------------+---------------+-----------+ 1110 | Value read | Value decoded | Bits used | 1111 +------------+---------------+-----------+ 1112 | 0 - 98 | 0 - 98 | 7 | 1113 +------------+---------------+-----------+ 1114 | 99 - 127 | 99 - 127 | 8 | 1115 +------------+---------------+-----------+ 1116 | 128 - 226 | 0 - 98 | 7 | 1117 +------------+---------------+-----------+ 1118 | 227 - 255 | 128 - 156 | 8 | 1119 +------------+---------------+-----------+ 1121 Symbol probabilities are read one by one, in order. The probability 1122 is obtained from Value decoded using the formula P = Value - 1. This 1123 means the value 0 becomes the negative probability -1. This is a 1124 special probability that means "less than 1". Its effect on the 1125 distribution table is described below. For the purpose of 1126 calculating total allocated probability points, it counts as 1. 1128 When a symbol has a probability of zero, it is followed by a 2-bit 1129 repeat flag. This repeat flag tells how many probabilities of zeroes 1130 follow the current one. It provides a number ranging from 0 to 3. 1131 If it is a 3, another 2-bit repeat flag follows, and so on. 1133 When the last symbol reaches a cumulated total of (1 << 1134 Accuracy_Log), decoding is complete. If the last symbol makes the 1135 cumulated total go above (1 << Accuracy_Log), distribution is 1136 considered corrupted. 1138 Finally, the decoder can tell how many bytes were used in this 1139 process, and how many symbols are present. The bitstream consumes a 1140 round number of bytes. Any remaining bit within the last byte is 1141 simply unused. 1143 The distribution of normalized probabilities is enough to create a 1144 unique decoding table. The table has a size of (1 << Accuracy_Log). 1145 Each cell describes the symbol decoded, and instructions to get the 1146 next state. 1148 Symbols are scanned in their natural order for "less than 1" 1149 probabilities as described above. Symbols with this probability are 1150 being attributed a single cell, starting from the end of the table. 1151 These symbols define a full state reset, reading Accuracy_Log bits. 1153 All remaining symbols are sorted in their natural order. Starting 1154 from symbol 0 and table position 0, each symbol gets attributed as 1155 many cells as its probability. Cell allocation is non-linear linear; 1156 each successor position follow this rule: 1158 position += (tableSize >> 1) + (tableSize >> 3) + 3; 1159 position &= tableSize - 1; 1161 A position is skipped if it is already occupied by a "less than 1" 1162 probability symbol. Position does not reset between symbols; it 1163 simply iterates through each position in the table, switching to the 1164 next symbol when enough states have been allocated to the current 1165 one. 1167 The result is a list of state values. Each state will decode the 1168 current symbol. 1170 To get the Number_of_Bits and Baseline required for the next state, 1171 it is first necessary to sort all states in their natural order. The 1172 lower states will need one more bit than higher ones. 1174 For example, presuming a symbol has a probability of 5, it receives 1175 five state values. States are sorted in natural order. The next 1176 power of two is 8. The space of probabilities is divided into 8 1177 equal parts. Presuming the Accuracy_Log is 7, this defines 128 1178 states, and each share (divided by 8) is 16 in size. In order to 1179 reach 8, 8 - 5 = 3 lowest states will count "double", doubling the 1180 number of shares, requiring one more bit in the process. 1182 Numbering starts from higher states using fewer bits. 1184 +----------------+-------+-------+--------+------+-------+ 1185 | state order | 0 | 1 | 2 | 3 | 4 | 1186 +----------------+-------+-------+--------+------+-------+ 1187 | width | 32 | 32 | 32 | 16 | 16 | 1188 +----------------+-------+-------+--------+------+-------+ 1189 | Number_of_Bits | 5 | 5 | 5 | 4 | 4 | 1190 +----------------+-------+-------+--------+------+-------+ 1191 | range number | 2 | 4 | 6 | 0 | 1 | 1192 +----------------+-------+-------+--------+------+-------+ 1193 | Baseline | 32 | 64 | 96 | 0 | 16 | 1194 +----------------+-------+-------+--------+------+-------+ 1195 | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1196 +----------------+-------+-------+--------+------+-------+ 1198 The next state is determined from the current state by reading the 1199 required Number_of_Bits, and adding the specified Baseline. 1201 See Appendix B for the results of this process applied to the default 1202 distributions. 1204 2.4.2. Huffman Coding 1206 Zstandard Huffman-coded streams are read backwards, similar to the 1207 FSE bitstreams. Therefore, to find the start of the bitstream, it is 1208 necessary to know the offset of the last byte of the Huffman-coded 1209 stream. 1211 After writing the last bit containing information, the compressor 1212 writes a single 1-bit and then fills the byte with 0-7 0 bits of 1213 padding. The last byte of the compressed bitstream cannot be 0 for 1214 that reason. 1216 When decompressing, the last byte containing the padding is the first 1217 byte to read. The decompressor needs to skip 0-7 initial 0-bits and 1218 the first 1-bit lt occurs. Afterwards, the useful part of the 1219 bitstream begins. 1221 The bitstream contains Huffman-coded symbols in little-endian order, 1222 with the codes defined by the method below. 1224 2.4.2.1. Huffman Tree Description 1226 Prefix coding represents symbols from an a priori known alphabet by 1227 bit sequences (codewords), one codeword for each symbol, in a manner 1228 such that different symbols may be represented by bit sequences of 1229 different lengths, but a parser can always parse an encoded string 1230 unambiguously symbol-by-symbol. 1232 Given an alphabet with known symbol frequencies, the Huffman 1233 algorithm allows the construction of an optimal prefix code using the 1234 fewest bits of any possible prefix codes for that alphabet. 1236 The prefix code must not exceed a maximum code length. More bits 1237 improve accuracy but yield a larger header size, and require more 1238 memory or more complex decoding operations. This specification 1239 limits the maximum code length to 11 bits. 1241 All literal values from zero (included) to the last present one 1242 (excluded) are represented by Weight with values from 0 to 1243 Max_Number_of_Bits. Transformation from Weight to Number_of_Bits 1244 follows this pseudocode: 1246 if Weight == 0 1247 Number_of_Bits = 0 1248 else 1249 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1251 The last symbol's Weight is deduced from previously decoded ones, by 1252 completing to the nearest power of 2. This power of 2 gives 1253 Max_Number_of_Bits, the depth of the current tree. 1255 For example, presume the following Huffman tree must be described: 1257 +----------------+---------+ 1258 | Number_of_Bits | literal | 1259 +----------------+---------+ 1260 | 1 | 0 | 1261 +----------------+---------+ 1262 | 2 | 1 | 1263 +----------------+---------+ 1264 | 3 | 2 | 1265 +----------------+---------+ 1266 | 0 | 3 | 1267 +----------------+---------+ 1268 | 4 | 4 | 1269 +----------------+---------+ 1270 | 4 | 5 | 1271 +----------------+---------+ 1273 The tree depth is 4, since its smallest element uses 4 bits. Value 5 1274 will not be listed as it can be determined from the values for 0-4, 1275 nor will values above 5 as they are all 0. Values from 0 to 4 will 1276 be listed using Weight instead of Number_of_Bits. The pseudocode to 1277 determine Weight is: 1279 if Number_of_Bits == 0 1280 Weight = 0 1281 else 1282 Weight = Max_Number_of_Bits + 1 - Number_of_Bits 1284 It gives the following series of weights: 1286 +---------+--------+ 1287 | literal | Weight | 1288 +---------+--------+ 1289 | 0 | 4 | 1290 +---------+--------+ 1291 | 1 | 3 | 1292 +---------+--------+ 1293 | 2 | 2 | 1294 +---------+--------+ 1295 | 3 | 0 | 1296 +---------+--------+ 1297 | 4 | 1 | 1298 +---------+--------+ 1300 The decoder will do the inverse operation: having collected weights 1301 of literals from 0 to 4, it knows the last literal, 5, is present 1302 with a non-zero weight. The weight of 5 can be determined by 1303 advancing to the next power of 2. The sum of 2^(Weight-1) (excluding 1304 0's) is 15. The nearest power of 2 is 16. Therefore, 1305 Max_Number_of_Bits = 4 and Weight[5] = 1. 1307 2.4.2.1.1. Huffman Tree Header 1309 This is a single byte value (0-255), which describes how to decode 1310 the list of weights. 1312 headerByte >= 128: This is a direct representation, where each 1313 Weight is written directly as a 4-bit field (0-15). They are 1314 encoded forward, two weights to a byte with the first weight 1315 taking the top four bits and the second taking the bottom four 1316 (e.g. the following operations could be used to read the weights: 1318 Weight[0] = (Byte[0] >> 4) 1319 Weight[1] = (Byte[0] & 0xf), 1320 etc. 1322 The full representation occupies (Number_of_Symbols+1)/2 bytes, 1323 meaning it uses a last full byte even if Number_of_Symbols is odd. 1324 Number_of_Symbols is equal to headerByte - 127. Note that maximum 1325 Number_of_Symbols is 255-127 = 128. A larger series must 1326 necessarily use FSE compression. 1328 headerByte < 128: The series of weights is compressed by FSE. The 1329 length of the FSE-compressed series is equal to this value 1330 (0-127). 1332 2.4.2.1.2. FSE Compression of Huffman Weights 1334 In this case, the series of Huffman weights is compressed using FSE 1335 compression. It is a single bitstream with two interleaved states, 1336 sharing a single distribution table. 1338 To decode an FSE bitstream, it is necessary to know its compressed 1339 size. Compressed size is provided by headerByte. It's also 1340 necessary to know its maximum possible decompressed size, which is 1341 255, since literal values span from 0 to 255, and the last symbol's 1342 weight is not represented. 1344 An FSE bitstream starts by a header, describing probabilities 1345 distribution. It will create a Decoding Table. For a list of 1346 Huffman weights, the maximum accuracy log is 7 bits. For more 1347 description see Section 2.4.1.1. 1349 The Huffman header compression uses two states, which share the same 1350 FSE distribution table. The first state (State1) encodes the even 1351 indexed symbols, and the second (State2) encodes the odd indexes. 1352 State1 is initialized first, and then State2, and they take turns 1353 decoding a single symbol and updating their state. For more details 1354 on these FSE operations, see the FSE section. 1356 The number of symbols to decode is determined by tracking the 1357 bitStream overflow condition: If updating state after decoding a 1358 symbol would require more bits than remain in the stream, it is 1359 assumed that extra bits are zero. Then, the symbols for each of the 1360 I final states are decoded and the process is complete. 1362 2.4.2.1.3. Conversion from Weights to Huffman Prefix Codes 1364 All present symbols will now have a Weight value. It is possible to 1365 transform weights into Number_of_Bits, using this formula: 1367 if Number_of_Bits != 0 1368 Number_of_Bits = Max_Number_of_Bits + 1 - Weight 1370 Symbols are sorted by Weight. Within same Weight, symbols keep 1371 natural order. Symbols with a Weight of zero are removed. Then, 1372 starting from lowest weight, prefix codes are distributed in order. 1374 For example, assume the following list of weights has been decoded: 1376 +---------+--------+ 1377 | Literal | Weight | 1378 +---------+--------+ 1379 | 0 | 4 | 1380 +---------+--------+ 1381 | 1 | 3 | 1382 +---------+--------+ 1383 | 2 | 2 | 1384 +---------+--------+ 1385 | 3 | 0 | 1386 +---------+--------+ 1387 | 4 | 1 | 1388 +---------+--------+ 1389 | 5 | 1 | 1390 +---------+--------+ 1392 Sorted by weight and then the natural order, yielding the following 1393 distribution: 1395 +---------+--------+----------------+--------------+ 1396 | Literal | Weight | Number_Of_Bits | prefix codes | 1397 +---------+--------+----------------|--------------+ 1398 | 3 | 0 | 0 | N/A | 1399 +---------+--------+----------------|--------------+ 1400 | 4 | 1 | 4 | 0000 | 1401 +---------+--------+----------------|--------------+ 1402 | 5 | 1 | 4 | 0001 | 1403 +---------+--------+----------------|--------------+ 1404 | 2 | 2 | 3 | 001 | 1405 +---------+--------+----------------|--------------+ 1406 | 1 | 3 | 2 | 01 | 1407 +---------+--------+----------------|--------------+ 1408 | 0 | 4 | 1 | 1 | 1409 +---------+--------+----------------|--------------+ 1411 2.4.2.2. Huffman-coded Streams 1413 Given a Huffman decoding table, it is possible to decode a Huffman- 1414 coded stream. 1416 Each bitstream must be read backward, that is starting from the end 1417 up to the beginning. Therefore, it is necessary to know the size of 1418 each bitstream. 1420 It is also necessary to know exactly which bit is the latest. This 1421 is detected by a final bit flag: the highest bit of latest byte is a 1422 final-bit-flag. Consequently, a last byte of 0 is not possible. And 1423 the final-bit-flag itself is not part of the useful bitstream. 1424 Hence, the last byte contains between 0 and 7 useful bits. 1426 Starting from the end, it is possible to read the bitstream in a 1427 little-endian fashion, keeping track of already used bits. Since the 1428 bitstream is encoded in reverse order, starting from the end, read 1429 symbols in forward order. 1431 For example, if the literal sequence "0145" was encoded using above 1432 prefix code, it would be encoded (in reverse order) as: 1434 +---------+----------+ 1435 | Symbol | Encoding | 1436 +---------+----------+ 1437 | 5 | 0000 | 1438 +---------+----------+ 1439 | 4 | 0001 | 1440 +---------+----------+ 1441 | 1 | 01 | 1442 +---------+----------+ 1443 | 0 | 1 | 1444 +---------+----------+ 1445 | Padding | 00001 | 1446 +---------+----------+ 1448 This results in the following two-byte bitstream: 1450 00010000 00001101 1452 Here is an alternative representation with the symbol codes separated 1453 by underscores: 1455 0001_0000 00001_1_01 1457 Reading the highest Max_Number_of_Bits bits, it's possible to compare 1458 the extracted value to the decoding table, determining the symbol to 1459 decode and number of bits to discard. 1461 The process continues up to reading the required number of symbols 1462 per stream. If a bitstream is not entirely and exactly consumed, 1463 hence reaching exactly its beginning position with all bits consumed, 1464 the decoding process is considered faulty. 1466 2.5. Dictionary Format 1468 Zstandard is compatible with "raw content" dictionaries, free of any 1469 format restriction, except that they must be at least eight bytes. 1470 These dictionaries function as if they were just the Content part of 1471 a formatted dictionary. 1473 However, dictionaries created by "zstd --train" in the reference 1474 implementation follow a specific format, described here. 1476 A dictionary has a size, defined either by a buffer limit or a file 1477 size. The general format is: 1479 +--------------+---------------+----------------+---------+ 1480 | Magic_Number | Dictionary_ID | Entropy_Tables | Content | 1481 +--------------+---------------+----------------+---------+ 1483 Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format 1485 Dictionary_ID: 4 bytes, stored in little-endian format. 1486 Dictionary_ID can be any value, except 0 (which means no 1487 Dictionary_ID). It is used by decoders to check if they use the 1488 correct dictionary. If the frame is going to be distributed in a 1489 private environment, any Dictionary_ID can be used. However, for 1490 public distribution of compressed frames, the following ranges are 1491 reserved and shall not be used: 1493 - low range : <= 32767 1494 - high range : >= (2^31) 1496 Entropy_Tables: Following the same format as the tables in 1497 compressed blocks. See the relevant FSE and Huffman sections for 1498 how to decode these tables. They are stored in following order: 1499 Huffman tables for literals, FSE table for offsets, FSE table for 1500 match lengths, and FSE table for literals lengths. These tables 1501 populate the Repeat Stats literals mode and Repeat distribution 1502 mode for sequence decoding. It is finally followed by 3 offset 1503 values, populating recent offsets (instead of using {1,4,8}), 1504 stored in order, 4-bytes little-endian each, for a total of 12 1505 bytes. Each recent offset must have a value less than the 1506 dictionary size. 1508 Content: The rest of the dictionary is its content. The content act 1509 as a "past" in front of data to compress or decompress, so it can 1510 be referenced in sequence commands. As long as the amount of data 1511 decoded from this frame is less than or equal to Window_Size, 1512 sequence commands may specify offsets longer than the total length 1513 of decoded output so far to reference back to the dictionary. 1514 After the total output has surpassed Window_Size however, this is 1515 no longer allowed and the dictionary is no longer accessible. 1517 3. IANA Considerations 1519 This document contains two registration actions for IANA. 1521 3.1. The 'application/zstd' Media Type 1523 The 'application/zstd' media type identifies a block of data that is 1524 compressed using zstd compression. The data is a stream of bytes as 1525 described in this document. IANA is requested to add the following 1526 to the Media Types registry: 1528 Type name: application 1530 Subtype name: zstd 1532 Required parameters: N/A 1534 Optional parameters: N/A 1536 Encoding considerations: binary 1538 Security considerations: See Section 4 1540 Interoperability considerations: N/A 1542 Published specification: [ZSTD] 1544 Applications that use this media type: anywhere data size is an 1545 issue 1547 Additional information: 1549 Magic number(s): 4 Bytes, little-endian format. Value : 1550 0xFD2FB528 1552 File extension(s): zstd 1554 Macintosh file type code(s): N/A 1556 For further information: See [ZSTD] 1558 Intended usage: common 1560 Restrictions on usage: N/A 1562 Author: Murray S. Kucherawy 1564 Change Controller: IETF 1566 Provisional registration: yes 1568 3.2. Content Encoding 1570 IANA is requested to add the following entry to the HTTP Content 1571 Coding Parameters subregistry within the Hypertext Transfer Protocol 1572 (HTTP) registry: 1574 Name: zstd 1576 Description: A stream of bytes compressed using the Zstandard 1577 protocol 1579 Pointer to specification text: [this document] 1581 4. Security Considerations 1583 Any data compression method involves the reduction of redundancy in 1584 the data. Zstandard is no exception, and the usual precautions 1585 apply. 1587 One should never compress together a message whose content must 1588 remain secret with a message under control of a third party. This 1589 can be used to guess the content of the secret message through 1590 analysis of entropy reduction. This was demonstrated in the [CRIME] 1591 attack for example. 1593 A decoder has to demonstrate capabilities to detect and prevent any 1594 kind of data tampering in the compressed frame from triggering system 1595 faults, such as reading or writing beyond allowed memory ranges. 1596 This can be guaranteed either by the implementation language, or by 1597 careful bound checkings. It is highly recommended to fuzz-test 1598 decoder implementations to test and harden their capability to detect 1599 bad frames and deal with them without any system side-effect. 1601 An attacker may provide correctly formed compressed frames with 1602 unreasonable memory requirements. A decoder must always control 1603 memory requirements and enforce some (system-specific) limits in 1604 order to protect memory usage from such scenarios. 1606 5. Implementation Status 1608 [RFC EDITOR: Please remove this section prior to publication.] 1610 Source code for a C language implementation of a "Zstandard" 1611 compliant library is available at [ZSTD-GITHUB]. This implementation 1612 is production ready, implementing the full range of the 1613 specification. It is tested against security hazards, and widely 1614 deployed within Facebook infrastructure. 1616 The reference version is speed optimised and highly portable. It has 1617 been proven to run safely on multiple architectures (x86, x64, ARM, 1618 MIPS, PowerPC, IA64) featuring 32 or 64-bits addressing schemes, 1619 little or big endian storage scheme, a number of different operating 1620 systems, UNIX (including Linux, BSD, OS-X and Solaris), and Windows, 1621 and a number of compilers (gcc, clang, visual, icc). 1623 The C reference version is also used to bind into multiple languages, 1624 a partial list of which (~20 of them) is being maintained at 1625 [ZSTD-OTHER]. 1627 The reference repository also contains an independently developed 1628 educational decoder, by Sean Purcell, created from the Zstandard 1629 format specification and built for clarity to help third party 1630 implementers. This is available at [ZSTD-EDU]. 1632 A specific version has been created for integration into the Linux 1633 kernel in order to provide compatibility with relevant memory 1634 restrictions. It was released in version 4.14 of the kernel. See 1635 [ZSTD-LINUX]. 1637 A Java native implementation of the decoder has been developed and 1638 open-sourced by the Presto team. This is available at [ZSTD-JAVA]. 1640 As of early July 2017, we are aware of one other decoder 1641 implementation in assembler, two full codec hardware implementations 1642 (programmable and ASIC) being actively developed, and a third one 1643 being evaluated. We are not permitted to disclose them at this 1644 stage. 1646 6. References 1648 6.1. Normative References 1650 [ZSTD] "Zstandard - Real-time data compression algorithm", 1651 2017, . 1653 6.2. Informative References 1655 [ANS] "Asymmetric Numeral Systems: Entropy Coding Combining 1656 Speed of Huffman Coding with Compression Rate of 1657 Arithmetic Coding", 2017, 1658 . 1660 [CRIME] "Compression Ratio Info-leak Made Easy", 2017, 1661 . 1663 [FSE] "Finite State Entropy", 2017, 1664 . 1666 [LZ4] "LZ4 Frame Format Description", 2017, . 1670 [RFC1952] Deutsch, P., "GZIP file format specification version 1671 4.3", RFC 1952, DOI 10.17487/RFC1952, May 1996, 1672 . 1674 [XXHASH] "XXHASH Algorithm", 2017, . 1676 [ZSTD-EDU] "Zstandard Educational Decoder", 2017, . 1680 [ZSTD-GITHUB] "Zstandard Github Repository", 2017, 1681 . 1683 [ZSTD-JAVA] "Zstandard Github Repository", 2017, . 1687 [ZSTD-LINUX] "Zstandard Github Repository", 2017, . 1691 [ZSTD-OTHER] "Zstandard Language Bindings", 2017, 1692 . 1694 Appendix A. Acknowledgments 1696 zstd was developed by Yann Collet. 1698 Appendix B. Decoding Tables for Predefined Codes 1700 This appendix contains FSE decoding tables for the predefined literal 1701 length, match length, and offset codes. The tables have been 1702 constructed using the algorithm as given above in chapter "from 1703 normalized distribution to decoding tables". The tables here can be 1704 used as examples to crosscheck that an implementation build its 1705 decoding tables correctly. 1707 B.1. Literal Length Code Table 1709 +-------+--------+----------------+------+ 1710 | State | Symbol | Number_Of_Bits | Base | 1711 +-------+--------+----------------+------+ 1712 | 0 | 0 | 0 | 0 | 1713 +-------+--------+----------------+------+ 1714 | 0 | 0 | 4 | 0 | 1715 +-------+--------+----------------+------+ 1716 | 1 | 0 | 4 | 16 | 1717 +-------+--------+----------------+------+ 1718 | 2 | 1 | 5 | 32 | 1719 +-------+--------+----------------+------+ 1720 | 3 | 3 | 5 | 0 | 1721 +-------+--------+----------------+------+ 1722 | 4 | 4 | 5 | 0 | 1723 +-------+--------+----------------+------+ 1724 | 5 | 6 | 5 | 0 | 1725 +-------+--------+----------------+------+ 1726 | 6 | 7 | 5 | 0 | 1727 +-------+--------+----------------+------+ 1728 | 7 | 9 | 5 | 0 | 1729 +-------+--------+----------------+------+ 1730 | 8 | 10 | 5 | 0 | 1731 +-------+--------+----------------+------+ 1732 | 9 | 12 | 5 | 0 | 1733 +-------+--------+----------------+------+ 1734 | 10 | 14 | 6 | 0 | 1735 +-------+--------+----------------+------+ 1736 | 11 | 16 | 5 | 0 | 1737 +-------+--------+----------------+------+ 1738 | 12 | 18 | 5 | 0 | 1739 +-------+--------+----------------+------+ 1740 | 13 | 19 | 5 | 0 | 1741 +-------+--------+----------------+------+ 1742 | 14 | 21 | 5 | 0 | 1743 +-------+--------+----------------+------+ 1744 | 15 | 22 | 5 | 0 | 1745 +-------+--------+----------------+------+ 1746 | 16 | 24 | 5 | 0 | 1747 +-------+--------+----------------+------+ 1748 | 17 | 25 | 5 | 32 | 1749 +-------+--------+----------------+------+ 1750 | 18 | 26 | 5 | 0 | 1751 +-------+--------+----------------+------+ 1752 | 19 | 27 | 6 | 0 | 1753 +-------+--------+----------------+------+ 1754 | 20 | 29 | 6 | 0 | 1755 +-------+--------+----------------+------+ 1756 | 21 | 31 | 6 | 0 | 1757 +-------+--------+----------------+------+ 1758 | 22 | 0 | 4 | 32 | 1759 +-------+--------+----------------+------+ 1760 | 23 | 1 | 4 | 0 | 1761 +-------+--------+----------------+------+ 1762 | 24 | 2 | 5 | 0 | 1763 +-------+--------+----------------+------+ 1764 | 25 | 4 | 5 | 32 | 1765 +-------+--------+----------------+------+ 1766 | 26 | 5 | 5 | 0 | 1767 +-------+--------+----------------+------+ 1768 | 27 | 7 | 5 | 32 | 1769 +-------+--------+----------------+------+ 1770 | 28 | 8 | 5 | 0 | 1771 +-------+--------+----------------+------+ 1772 | 29 | 10 | 5 | 32 | 1773 +-------+--------+----------------+------+ 1774 | 30 | 11 | 5 | 0 | 1775 +-------+--------+----------------+------+ 1776 | 31 | 13 | 6 | 0 | 1777 +-------+--------+----------------+------+ 1778 | 32 | 16 | 5 | 32 | 1779 +-------+--------+----------------+------+ 1780 | 33 | 17 | 5 | 0 | 1781 +-------+--------+----------------+------+ 1782 | 34 | 19 | 5 | 32 | 1783 +-------+--------+----------------+------+ 1784 | 35 | 20 | 5 | 0 | 1785 +-------+--------+----------------+------+ 1786 | 36 | 22 | 5 | 32 | 1787 +-------+--------+----------------+------+ 1788 | 37 | 23 | 5 | 0 | 1789 +-------+--------+----------------+------+ 1790 | 38 | 25 | 4 | 0 | 1791 +-------+--------+----------------+------+ 1792 | 39 | 25 | 4 | 16 | 1793 +-------+--------+----------------+------+ 1794 | 40 | 26 | 5 | 32 | 1795 +-------+--------+----------------+------+ 1796 | 41 | 28 | 6 | 0 | 1797 +-------+--------+----------------+------+ 1798 | 42 | 30 | 6 | 0 | 1799 +-------+--------+----------------+------+ 1800 | 43 | 0 | 4 | 48 | 1801 +-------+--------+----------------+------+ 1802 | 44 | 1 | 4 | 16 | 1803 +-------+--------+----------------+------+ 1804 | 45 | 2 | 5 | 32 | 1805 +-------+--------+----------------+------+ 1806 | 46 | 3 | 5 | 32 | 1807 +-------+--------+----------------+------+ 1808 | 47 | 5 | 5 | 32 | 1809 +-------+--------+----------------+------+ 1810 | 48 | 6 | 5 | 32 | 1811 +-------+--------+----------------+------+ 1812 | 49 | 8 | 5 | 32 | 1813 +-------+--------+----------------+------+ 1814 | 50 | 9 | 5 | 32 | 1815 +-------+--------+----------------+------+ 1816 | 51 | 11 | 5 | 32 | 1817 +-------+--------+----------------+------+ 1818 | 52 | 12 | 5 | 32 | 1819 +-------+--------+----------------+------+ 1820 | 53 | 15 | 6 | 0 | 1821 +-------+--------+----------------+------+ 1822 | 54 | 17 | 5 | 32 | 1823 +-------+--------+----------------+------+ 1824 | 55 | 18 | 5 | 32 | 1825 +-------+--------+----------------+------+ 1826 | 56 | 20 | 5 | 32 | 1827 +-------+--------+----------------+------+ 1828 | 57 | 21 | 5 | 32 | 1829 +-------+--------+----------------+------+ 1830 | 58 | 23 | 5 | 32 | 1831 +-------+--------+----------------+------+ 1832 | 59 | 24 | 5 | 32 | 1833 +-------+--------+----------------+------+ 1834 | 60 | 35 | 6 | 0 | 1835 +-------+--------+----------------+------+ 1836 | 61 | 34 | 6 | 0 | 1837 +-------+--------+----------------+------+ 1838 | 62 | 33 | 6 | 0 | 1839 +-------+--------+----------------+------+ 1840 | 63 | 32 | 6 | 0 | 1841 +-------+--------+----------------+------+ 1843 B.2. Match Length Code Table 1845 +-------+--------+----------------+------+ 1846 | State | Symbol | Number_Of_Bits | Base | 1847 +-------+--------+----------------+------+ 1848 | 0 | 0 | 0 | 0 | 1849 +-------+--------+----------------+------+ 1850 | 0 | 0 | 6 | 0 | 1851 +-------+--------+----------------+------+ 1852 | 1 | 1 | 4 | 0 | 1853 +-------+--------+----------------+------+ 1854 | 2 | 2 | 5 | 32 | 1855 +-------+--------+----------------+------+ 1856 | 3 | 3 | 5 | 0 | 1857 +-------+--------+----------------+------+ 1858 | 4 | 5 | 5 | 0 | 1859 +-------+--------+----------------+------+ 1860 | 5 | 6 | 5 | 0 | 1861 +-------+--------+----------------+------+ 1862 | 6 | 8 | 5 | 0 | 1863 +-------+--------+----------------+------+ 1864 | 7 | 10 | 6 | 0 | 1865 +-------+--------+----------------+------+ 1866 | 8 | 13 | 6 | 0 | 1867 +-------+--------+----------------+------+ 1868 | 9 | 16 | 6 | 0 | 1869 +-------+--------+----------------+------+ 1870 | 10 | 19 | 6 | 0 | 1871 +-------+--------+----------------+------+ 1872 | 11 | 22 | 6 | 0 | 1873 +-------+--------+----------------+------+ 1874 | 12 | 25 | 6 | 0 | 1875 +-------+--------+----------------+------+ 1876 | 13 | 28 | 6 | 0 | 1877 +-------+--------+----------------+------+ 1878 | 14 | 31 | 6 | 0 | 1879 +-------+--------+----------------+------+ 1880 | 15 | 33 | 6 | 0 | 1881 +-------+--------+----------------+------+ 1882 | 16 | 35 | 6 | 0 | 1883 +-------+--------+----------------+------+ 1884 | 17 | 37 | 6 | 0 | 1885 +-------+--------+----------------+------+ 1886 | 18 | 39 | 6 | 0 | 1887 +-------+--------+----------------+------+ 1888 | 19 | 41 | 6 | 0 | 1889 +-------+--------+----------------+------+ 1890 | 20 | 43 | 6 | 0 | 1891 +-------+--------+----------------+------+ 1892 | 21 | 45 | 6 | 0 | 1893 +-------+--------+----------------+------+ 1894 | 22 | 1 | 4 | 16 | 1895 +-------+--------+----------------+------+ 1896 | 23 | 2 | 4 | 0 | 1897 +-------+--------+----------------+------+ 1898 | 24 | 3 | 5 | 32 | 1899 +-------+--------+----------------+------+ 1900 | 25 | 4 | 5 | 0 | 1901 +-------+--------+----------------+------+ 1902 | 26 | 6 | 5 | 32 | 1903 +-------+--------+----------------+------+ 1904 | 27 | 7 | 5 | 0 | 1905 +-------+--------+----------------+------+ 1906 | 28 | 9 | 6 | 0 | 1907 +-------+--------+----------------+------+ 1908 | 29 | 12 | 6 | 0 | 1909 +-------+--------+----------------+------+ 1910 | 30 | 15 | 6 | 0 | 1911 +-------+--------+----------------+------+ 1912 | 31 | 18 | 6 | 0 | 1913 +-------+--------+----------------+------+ 1914 | 32 | 21 | 6 | 0 | 1915 +-------+--------+----------------+------+ 1916 | 33 | 24 | 6 | 0 | 1917 +-------+--------+----------------+------+ 1918 | 34 | 27 | 6 | 0 | 1919 +-------+--------+----------------+------+ 1920 | 35 | 30 | 6 | 0 | 1921 +-------+--------+----------------+------+ 1922 | 36 | 32 | 6 | 0 | 1923 +-------+--------+----------------+------+ 1924 | 37 | 34 | 6 | 0 | 1925 +-------+--------+----------------+------+ 1926 | 38 | 36 | 6 | 0 | 1927 +-------+--------+----------------+------+ 1928 | 39 | 38 | 6 | 0 | 1929 +-------+--------+----------------+------+ 1930 | 40 | 40 | 6 | 0 | 1931 +-------+--------+----------------+------+ 1932 | 41 | 42 | 6 | 0 | 1933 +-------+--------+----------------+------+ 1934 | 42 | 44 | 6 | 0 | 1935 +-------+--------+----------------+------+ 1936 | 43 | 1 | 4 | 32 | 1937 +-------+--------+----------------+------+ 1938 | 44 | 1 | 4 | 48 | 1939 +-------+--------+----------------+------+ 1940 | 45 | 2 | 4 | 16 | 1941 +-------+--------+----------------+------+ 1942 | 46 | 4 | 5 | 32 | 1943 +-------+--------+----------------+------+ 1944 | 47 | 5 | 5 | 32 | 1945 +-------+--------+----------------+------+ 1946 | 48 | 7 | 5 | 32 | 1947 +-------+--------+----------------+------+ 1948 | 49 | 8 | 5 | 32 | 1949 +-------+--------+----------------+------+ 1950 | 50 | 11 | 6 | 0 | 1951 +-------+--------+----------------+------+ 1952 | 51 | 14 | 6 | 0 | 1953 +-------+--------+----------------+------+ 1954 | 52 | 17 | 6 | 0 | 1955 +-------+--------+----------------+------+ 1956 | 53 | 20 | 6 | 0 | 1957 +-------+--------+----------------+------+ 1958 | 54 | 23 | 6 | 0 | 1959 +-------+--------+----------------+------+ 1960 | 55 | 26 | 6 | 0 | 1961 +-------+--------+----------------+------+ 1962 | 56 | 29 | 6 | 0 | 1963 +-------+--------+----------------+------+ 1964 | 57 | 52 | 6 | 0 | 1965 +-------+--------+----------------+------+ 1966 | 58 | 51 | 6 | 0 | 1967 +-------+--------+----------------+------+ 1968 | 59 | 50 | 6 | 0 | 1969 +-------+--------+----------------+------+ 1970 | 60 | 49 | 6 | 0 | 1971 +-------+--------+----------------+------+ 1972 | 61 | 48 | 6 | 0 | 1973 +-------+--------+----------------+------+ 1974 | 62 | 47 | 6 | 0 | 1975 +-------+--------+----------------+------+ 1976 | 63 | 46 | 6 | 0 | 1977 +-------+--------+----------------+------+ 1979 B.3. Offset Code Table 1981 +-------+--------+----------------+------+ 1982 | State | Symbol | Number_Of_Bits | Base | 1983 +-------+--------+----------------+------+ 1984 | 0 | 0 | 0 | 0 | 1985 +-------+--------+----------------+------+ 1986 | 0 | 0 | 5 | 0 | 1987 +-------+--------+----------------+------+ 1988 | 1 | 6 | 4 | 0 | 1989 +-------+--------+----------------+------+ 1990 | 2 | 9 | 5 | 0 | 1991 +-------+--------+----------------+------+ 1992 | 3 | 15 | 5 | 0 | 1993 +-------+--------+----------------+------+ 1994 | 4 | 21 | 5 | 0 | 1995 +-------+--------+----------------+------+ 1996 | 5 | 3 | 5 | 0 | 1997 +-------+--------+----------------+------+ 1998 | 6 | 7 | 4 | 0 | 1999 +-------+--------+----------------+------+ 2000 | 7 | 12 | 5 | 0 | 2001 +-------+--------+----------------+------+ 2002 | 8 | 18 | 5 | 0 | 2003 +-------+--------+----------------+------+ 2004 | 9 | 23 | 5 | 0 | 2005 +-------+--------+----------------+------+ 2006 | 10 | 5 | 5 | 0 | 2007 +-------+--------+----------------+------+ 2008 | 11 | 8 | 4 | 0 | 2009 +-------+--------+----------------+------+ 2010 | 12 | 14 | 5 | 0 | 2011 +-------+--------+----------------+------+ 2012 | 13 | 20 | 5 | 0 | 2013 +-------+--------+----------------+------+ 2014 | 14 | 2 | 5 | 0 | 2015 +-------+--------+----------------+------+ 2016 | 15 | 7 | 4 | 16 | 2017 +-------+--------+----------------+------+ 2018 | 16 | 11 | 5 | 0 | 2019 +-------+--------+----------------+------+ 2020 | 17 | 17 | 5 | 0 | 2021 +-------+--------+----------------+------+ 2022 | 18 | 22 | 5 | 0 | 2023 +-------+--------+----------------+------+ 2024 | 19 | 4 | 5 | 0 | 2025 +-------+--------+----------------+------+ 2026 | 20 | 8 | 4 | 16 | 2027 +-------+--------+----------------+------+ 2028 | 21 | 13 | 5 | 0 | 2029 +-------+--------+----------------+------+ 2030 | 22 | 19 | 5 | 0 | 2031 +-------+--------+----------------+------+ 2032 | 23 | 1 | 5 | 0 | 2033 +-------+--------+----------------+------+ 2034 | 24 | 6 | 4 | 16 | 2035 +-------+--------+----------------+------+ 2036 | 25 | 10 | 5 | 0 | 2037 +-------+--------+----------------+------+ 2038 | 26 | 16 | 5 | 0 | 2039 +-------+--------+----------------+------+ 2040 | 27 | 28 | 5 | 0 | 2041 +-------+--------+----------------+------+ 2042 | 28 | 27 | 5 | 0 | 2043 +-------+--------+----------------+------+ 2044 | 29 | 26 | 5 | 0 | 2045 +-------+--------+----------------+------+ 2046 | 30 | 25 | 5 | 0 | 2047 +-------+--------+----------------+------+ 2048 | 31 | 24 | 5 | 0 | 2049 +-------+--------+----------------+------+ 2051 Authors' Addresses 2053 Yann Collet 2054 Facebook 2055 1 Hacker Way 2056 Menlo Park, CA 94025 2057 United States 2059 EMail: cyan@fb.com 2061 Murray S. Kucherawy (editor) 2062 Facebook 2063 1 Hacker Way 2064 Menlo Park, CA 94025 2065 United States 2067 EMail: msk@fb.com