idnits 2.17.1 draft-diaz-lzip-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 271 has weird spacing: '...literal lit...' == Line 279 has weird spacing: '...0 + len rep2...' == Line 281 has weird spacing: '...1 + len rep3...' == Line 293 has weird spacing: '... 3 bits len...' == Line 294 has weird spacing: '... 8 bits len...' == (7 more instances...) -- The document date (April 2022) is 736 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force (IETF) A. Diaz 3 INTERNET-DRAFT draft-diaz-lzip-05 GNU Project 4 Category: Informational April 2022 5 Expiration date: 2022-10-26 7 Lzip Compressed Format and the 'application/lzip' Media Type 9 Abstract 11 Lzip is a lossless compressed data format designed for data sharing, 12 long-term archiving, and parallel compression/decompression. Lzip 13 uses a simplified form of the LZMA stream format and provides a 3 14 factor integrity checking to maximize interoperability and optimize 15 safety. Lzip can achieve higher compression ratios than gzip. This 16 document describes the lzip format and registers a media type, a 17 content encoding, and a structured syntax suffix to be used when 18 transporting lzip-compressed content via MIME or HTTP. 20 Status of This Memo 22 This document is not an Internet Standards Track specification; it is 23 published for informational purposes. 25 This document is a product of the Internet Engineering Task Force 26 (IETF). It represents the consensus of the IETF community. It has 27 received public review and has been approved for publication by the 28 Internet Engineering Steering Group (IESG). Not all documents 29 approved by the IESG are candidates for any level of Internet 30 Standard; see Section 2 of RFC 7841. 32 Information about the current status of this document, any errata, 33 and how to provide feedback on it may be obtained at 34 http://www.rfc-editor.org/info/rfc. 36 Comments are solicited and should be addressed to the lzip's mailing 37 list at lzip-bug@nongnu.org and/or the author. 39 Copyright Notice 41 Copyright (c) 2022 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Internet-Draft Boilerplate 56 This Internet-Draft is submitted in full conformance with the 57 provisions of BCP 78 and BCP 79. Internet-Drafts are working 58 documents of the Internet Engineering Task Force (IETF). Note that 59 other groups may also distribute working documents as 60 Internet-Drafts. The list of current Internet-Drafts is at 61 http://datatracker.ietf.org/drafts/current. 63 Internet-Drafts are draft documents valid for a maximum of six months 64 and may be updated, replaced, or obsoleted by other documents at any 65 time. It is inappropriate to use Internet-Drafts as reference 66 material or to cite them other than as "work in progress." 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . 3 72 1.2. Compliance . . . . . . . . . . . . . . . . . . . . . . . 4 73 2. File Format . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 3. Format of the LZMA stream in lzip files . . . . . . . . . . . 5 75 3.1. What is coded . . . . . . . . . . . . . . . . . . . . . 6 76 3.2. The coding contexts . . . . . . . . . . . . . . . . . . 8 77 3.3. The range decoder . . . . . . . . . . . . . . . . . . . 10 78 3.4. Decoding and verifying the LZMA stream . . . . . . . . . 10 79 3.5. Compression . . . . . . . . . . . . . . . . . . . . . . 10 80 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 81 4.1. The 'application/lzip' Media Type . . . . . . . . . . . 11 82 4.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 12 83 4.3. Structured Syntax Suffix . . . . . . . . . . . . . . . . 12 84 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 85 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 86 6.1. Normative References . . . . . . . . . . . . . . . . . . 13 87 6.2. Informative References . . . . . . . . . . . . . . . . . 13 88 Appendix A. Reference Source Code . . . . . . . . . . . . . . . 14 89 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 23 90 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 23 92 1. Introduction 94 Lzip is a lossless compressed data format similar to gzip [RFC1952]. 95 Lzip is designed for data sharing, long-term archiving, parallel 96 compression/decompression, and limited random access to the data. 97 Lzip can achieve higher compression ratios than gzip. 99 Lzip uses a simplified form of the Lempel-Ziv-Markov chain-Algorithm 100 (LZMA) stream format and provides a 3 factor integrity checking to 101 maximize interoperability and optimize safety. The original LZMA 102 algorithm and stream format were developed by Igor Pavlov. 104 LZMA is much like deflate (the algorithm of gzip) with two main 105 differences that account for its higher compression ratio. First, 106 LZMA can use a dictionary size thousands of times larger than 107 deflate. Second, LZMA uses a range encoder as its last stage instead 108 of the less efficient (but faster) Huffman coding used by deflate. 110 1.1. Purpose 112 The purpose of this document is to define a lossless compressed data 113 format that is a) independent of the CPU type, operating system, file 114 system, and character set and b) suitable for file compression and 115 pipe and streaming compression, using the LZMA algorithm. The text 116 of the specification assumes a basic background in programming at the 117 level of bits and other primitive data representations. 119 The data can be produced or consumed, even for an arbitrarily long 120 sequentially presented input data stream, using only an a priori 121 bounded amount of intermediate storage, and hence can be used in data 122 communications. 124 The data format defined by this specification allows efficient 125 parallel compression/decompression, and random access to blocks of 126 compressed data by means of multimember files and a distributed 127 index. 129 This specification is intended for use by implementors of software to 130 compress data into lzip format and/or decompress data from lzip 131 format. The lzip format is supported by one free software reference 132 implementation (the lzip tool) written in portable C++ (C++11), and 133 by several free software implementations written in portable C (C99), 134 all of them available at [LZIP]. The reference implementation has 135 been in stable status since 2009, and is widely deployed. 137 Also, to enable the transport of a data object compressed with lzip, 138 this document registers a media type, a content encoding, and a 139 structured syntax suffix that can be used to identify such content 140 when it is used in a payload encoded using Multipurpose Internet Mail 141 Extensions (MIME) or Hypertext Transfer Protocol (HTTP). 143 1.2. Compliance 145 A compliant decompressor must be able to accept and decompress any 146 file that conforms to all the specifications presented here; a 147 compliant compressor must produce files that conform to all the 148 specifications presented here. 150 2. File Format 152 Perfection is reached, not when there is no longer anything to add, 153 but when there is no longer anything to take away. 154 -- Antoine de Saint-Exupery 156 In the diagram below, a box like this: 158 +---+ 159 | | <-- the vertical bars might be missing 160 +---+ 162 represents one byte; a box like this: 164 +==============+ 165 | | 166 +==============+ 168 represents a variable number of bytes. 170 A lzip file consists of a series of independent "members" (compressed 171 data sets). The members simply appear one after another in the file, 172 with no additional information before, between, or after them. Each 173 member can encode in compressed form up to 16 EiB - 1 byte of 174 uncompressed data. The size of a multimember file is unlimited. 176 Each member has the following structure: 178 +-+-+-+-+---+---+===========+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 179 | ID |VN |DS |LZMA stream| CRC32 | Data size | Member size | 180 +-+-+-+-+---+---+===========+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 182 All multibyte values are stored in little endian order. 184 ID string (the "magic" bytes) 185 A four byte string, identifying the lzip format, with the value 186 "LZIP" (0x4C, 0x5A, 0x49, 0x50). 188 VN (version number, 1 byte) 189 Just in case something needs to be modified in the future. 1 for 190 now. 192 DS (coded dictionary size, 1 byte) 193 The dictionary size is calculated by taking a power of 2 (the base 194 size) and subtracting from it a fraction between 0/16 and 7/16 of 195 the base size. 196 Bits 4-0 contain the base 2 logarithm of the base size (12 to 29). 197 Bits 7-5 contain the numerator of the fraction (0 to 7) to 198 subtract from the base size to obtain the dictionary size. 199 Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB 200 Valid values for dictionary size range from 4 KiB to 512 MiB. 202 LZMA stream 203 The LZMA stream, finished by an "End Of Stream" marker. Uses 204 default values for encoder properties. See Section 3 for a 205 complete description. 207 CRC32 (4 bytes) 208 Cyclic Redundancy Check (CRC) of the original uncompressed data. 210 Data size (8 bytes) 211 Size of the original uncompressed data. 213 Member size (8 bytes) 214 Total size of the member, including header and trailer. This 215 field acts as a distributed index, allows the verification of 216 stream integrity, and facilitates the safe recovery of undamaged 217 members from multimember files. Member size should be limited to 218 2 PiB to prevent the data size field from overflowing. 220 3. Format of the LZMA stream in lzip files 222 The LZMA algorithm has three parameters, called "special LZMA 223 properties", to adjust it for some kinds of binary data. These 224 parameters are: 'literal_context_bits' (with a default value of 3), 225 'literal_pos_state_bits' (with a default value of 0), and 226 'pos_state_bits' (with a default value of 2). As a general purpose 227 compressor, lzip only uses the default values for these parameters. 228 In particular 'literal_pos_state_bits' has been optimized away and 229 does not even appear in the code. 231 Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker 232 (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with 233 the 'member size' field in the member trailer allows the verification 234 of stream integrity. The EOS marker is the only marker allowed in 235 lzip files. The LZMA stream in lzip files always has these two 236 features (default properties and EOS marker) and is referred to in 237 this document as LZMA-302eos. This simplified form of the LZMA 238 stream format has been chosen to maximize interoperability and 239 safety. 241 The second stage of LZMA is a range encoder that uses a different 242 probability model for each type of symbol: distances, lengths, 243 literal bytes, etc. Range encoding conceptually encodes all the 244 symbols of the message into one number. Unlike Huffman coding, which 245 assigns to each symbol a bit-pattern and concatenates all the 246 bit-patterns together, range encoding can compress one symbol to less 247 than one bit. Therefore the compressed data produced by a range 248 encoder can't be split in pieces that could be described 249 individually. 251 It seems that the only way of describing the LZMA-302eos stream is to 252 describe the algorithm that decodes it. And given the many details 253 about the range decoder that need to be described accurately, the 254 source code of a real decompressor seems the only appropriate 255 reference to use. 257 What follows is a description of the decoding algorithm for 258 LZMA-302eos streams using as reference the source code of "lzd", an 259 educational decompressor for lzip files which can be downloaded from 260 the lzip download directory. Lzd is written in C++11 and its source 261 code is included in appendix A. 263 3.1. What is coded 265 The LZMA stream includes literals, matches, and repeated matches 266 (matches reusing a recently used distance). There are 7 different 267 coding sequences: 269 Bit sequence Name Description 270 ----------------------- -------- -------------------------------- 271 0 + byte literal literal byte 272 1 + 0 + len + dis match distance-length pair 273 1 + 1 + 0 + 0 shortrep 1 byte match at latest used 274 distance 275 1 + 1 + 0 + 1 + len rep0 len bytes match at latest used 276 distance 277 1 + 1 + 1 + 0 + len rep1 len bytes match at second latest 278 used distance 279 1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest 280 used distance 281 1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest 282 used distance 284 In the following tables, multibit sequences are coded in normal 285 order, from most significant bit (MSB) to least significant bit 286 (LSB), except where noted otherwise. 288 Lengths (the 'len' in the table above) are coded as follows: 290 Bit sequence Description 291 -------------- ---------------------- 292 0 + 3 bits lengths from 2 to 9 293 1 + 0 + 3 bits lengths from 10 to 17 294 1 + 1 + 8 bits lengths from 18 to 273 296 The coding of distances is a little more complicated, so I'll begin 297 by explaining a simpler version of the encoding. 299 Imagine you need to encode a number from 0 to 2^32 - 1, and you want 300 to do it in a way that produces shorter codes for the smaller 301 numbers. You may first encode the position of the most significant 302 bit that is set to 1, which you may find by making a bit scan from 303 the left (from the MSB). A position of 0 means that the number is 0 304 (no bit is set), 1 means the LSB is the first bit set (the number is 305 1), and 32 means the MSB is set (i.e., the number is >= 0x80000000). 306 Then, if the position is >= 2, you encode the remaining position - 1 307 bits. Let's call these bits "direct bits" because they are coded 308 directly by value instead of indirectly by position. 310 The inconvenient of this simple method is that it needs 6 bits to 311 encode the position, but it just uses 33 of the 64 possible values, 312 wasting almost half of the codes. 314 The intelligent trick of LZMA is that it encodes in what it calls a 315 "slot" the position of the most significant bit set, along with the 316 value of the next bit, using the same 6 bits that would take to 317 encode the position alone. This seems to need 66 slots (twice the 318 number of positions), but for positions 0 and 1 there is no next bit, 319 so the number of slots needed is 64 (0 to 63). 321 The 6 bits representing this "slot number" are then context-coded. 322 If the distance is >= 4, the remaining bits are encoded as follows. 323 'direct_bits' is the amount of remaining bits (from 1 to 30) needed 324 to form a complete distance, and is calculated as (slot >> 1) - 1. 325 If a distance needs 6 or more direct_bits, the last 4 bits are 326 encoded separately. The last piece (all the direct_bits for 327 distances 4 to 127, or the last 4 bits for distances >= 128) is 328 context-coded in reverse order (from LSB to MSB). For distances 329 >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5 330 probability. 332 Bit sequence Description 333 --------------------------------- ------------------------------ 334 slot distances from 0 to 3 335 slot + direct_bits distances from 4 to 127 336 slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 338 3.2. The coding contexts 340 These contexts ('Bit_model' in the source), are integers or arrays of 341 integers representing the probability of the corresponding bit being 342 0. 344 The indices used in these arrays are: 346 state 347 A state machine ('State' in the source) with 12 states (0 to 11), 348 coding the latest 2 to 4 types of sequences processed. The 349 initial state is 0. 351 pos_state 352 Value of the 2 least significant bits of the current position in 353 the decoded data. 355 literal_state 356 Value of the 3 most significant bits of the latest byte decoded. 358 len_state 359 Coded value of the current match length (length - 2), with a 360 maximum of 3. The resulting value is in the range 0 to 3. 362 The types of previous sequences corresponding to each state are shown 363 in the following table. '!literal' is any sequence except a literal 364 byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The 365 last type in each line is the most recent. 367 State Types of previous sequences 368 ----- --------------------------------------------- 369 0 literal, literal, literal 370 1 match, literal, literal 371 2 rep or (!literal, shortrep), literal, literal 372 3 literal, shortrep, literal, literal 373 4 match, literal 374 5 rep or (!literal, shortrep), literal 375 6 literal, shortrep, literal 376 7 literal, match 377 8 literal, rep 378 9 literal, shortrep 379 10 !literal, match 380 11 !literal, (rep or shortrep) 381 The contexts for decoding the type of coding sequence are: 383 Name Indices Used when 384 -------- ---------------- ------------------- 385 bm_match state, pos_state sequence start 386 bm_rep state after sequence 1 387 bm_rep0 state after sequence 11 388 bm_rep1 state after sequence 111 389 bm_rep2 state after sequence 1111 390 bm_len state, pos_state after sequence 110 392 The contexts for decoding distances are: 394 Name Indices Used when 395 ----------- ------------------- --------------------------- 396 bm_dis_slot len_state, bit tree distance start 397 bm_dis reverse bit tree after slots 4 to 13 398 bm_align reverse bit tree for distances >= 128, after 399 fixed probability bits 401 There are two separate sets of contexts for lengths ('Len_model' in 402 the source). One for normal matches, the other for repeated matches. 403 The contexts in each Len_model are (see 'decode_len' in the source): 405 Name Indices Used when 406 ------- ------------------- ----------------- 407 choice1 none length start 408 choice2 none after sequence 1 409 bm_low pos_state, bit tree after sequence 0 410 bm_mid pos_state, bit tree after sequence 10 411 bm_high bit tree after sequence 11 413 The context array 'bm_literal' is special. In principle it acts as a 414 normal bit tree context, the one selected by 'literal_state'. But if 415 the previous decoded byte was not a literal, two other bit tree 416 contexts are used depending on the value of each bit in 'match_byte' 417 (the byte at the latest used distance), until a bit is decoded that 418 is different from its corresponding bit in 'match_byte'. After the 419 first difference is found, the rest of the byte is decoded using the 420 normal bit tree context. (See 'decode_matched' in the source). 422 3.3. The range decoder 424 The LZMA stream is consumed one byte at a time by the range decoder. 425 (See 'normalize' in the source). Every byte consumed produces a 426 variable number of decoded bits, depending on how well these bits 427 agree with their context. (See 'decode_bit' in the source). 429 The range decoder state consists of two unsigned 32-bit variables: 430 'range' (representing the most significant part of the range size not 431 yet decoded) and 'code' (representing the current point within 432 'range'). 'range' is initialized to 2^32 - 1, and 'code' is 433 initialized to 0. 435 The range encoder produces a first 0 byte that must be ignored by the 436 range decoder. This is done by shifting 5 bytes in the 437 initialization of 'code' instead of 4. (See the 'Range_decoder' 438 constructor in the source). 440 3.4. Decoding and verifying the LZMA stream 442 After decoding the member header and obtaining the dictionary size, 443 the range decoder is initialized and then the LZMA decoder enters a 444 loop (see 'decode_member' in the source) where it invokes the range 445 decoder with the appropriate contexts to decode the different coding 446 sequences (matches, repeated matches, and literal bytes), until the 447 "End Of Stream" marker is decoded. 449 Once the "End Of Stream" marker has been decoded, the decompressor 450 must read and decode the member trailer, and verify that the three 451 integrity factors stored there (CRC, data size, and member size) 452 match those computed from the data. 454 3.5. Compression 456 Compression consists in describing the uncompressed data as a 457 succession of coding sequences from the set shown in Section 3.1, and 458 then encoding them using a range encoder. The fast encoder in the 459 reference implementation shows how this can be done in almost the 460 simplest way possible; issuing the longest match found, or a literal 461 byte if no match is found, and repeating until all the data have been 462 compressed. More sophisticated choosing of the coding sequences may 463 achieve higher compression ratios. 465 4. IANA Considerations 467 IANA is asked to make two registrations, as described below. 469 4.1. The 'application/lzip' Media Type 471 The 'application/lzip' media type identifies a block of data that is 472 compressed using lzip compression. The data are a stream of bytes as 473 described in this document. IANA is asked to add the following entry 474 to the standards tree of the "Media Types" registry: 476 Type name: application 478 Subtype name: lzip 480 Required parameters: N/A 482 Optional parameters: N/A 484 Encoding considerations: binary 486 Security considerations: See Section 5 of this RFC 488 Interoperability considerations: N/A 490 Published specification: this RFC 492 Applications that use this media type: 493 Any application where data size is an issue 495 Fragment Identifier Considerations: N/A 497 Restrictions on Usage: N/A 499 Provisional Registrations: N/A 501 Additional information: 503 Deprecated alias names for this type: application/x-lzip 505 Magic number(s): First 4 bytes (0x4C, 0x5A, 0x49, 0x50) 507 File extension(s): .lz .tlz 509 Macintosh File Type Code(s): N/A 511 Object Identifier(s) or OID(s): N/A 513 Intended Usage: COMMON 515 Other Information & Comments: See [LZIP] 517 Author: Antonio Diaz Diaz 519 Change Controller: IETF 521 4.2. Content Encoding 523 IANA is asked to add the following entry to the "HTTP Content Coding 524 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 525 registry: 527 Name: lzip 529 Description: A stream of bytes compressed using lzip compression 531 Pointer to specification text: this RFC 533 4.3. Structured Syntax Suffix 535 IANA is asked to add the following entry to the "Structured Syntax 536 Suffix" registry: 538 Name: lzip 540 +suffix: +lzip 542 References: this RFC 544 Encoding Considerations: binary 546 Interoperability Considerations: N/A 548 Fragment Identifier Considerations: 549 The syntax and semantics of fragment identifiers specified for 550 +lzip should be as specified for 'application/lzip'. 551 (At publication of this document, there is no fragment 552 identification syntax defined for 'application/lzip'.) 554 Security Considerations: See Section 5 of this RFC 556 Contact: See Author's Address of this RFC 558 Author/Change Controller: IETF 560 5. Security Considerations 562 Lzip is a compressed format. Decompressing lzip data may expand them 563 to a size more than 7000 times larger, risking an out-of-memory or 564 out-of-disc-space condition. Since both the gzip and lzip formats 565 contain a single data stream, any steps already taken to avoid such 566 attacks on application/gzip should also work on application/lzip. 567 One possible measure, already implemented in some applications, is to 568 set a limit to the decompressed size and stop the decompression or 569 warn the user if the limit is surpassed. 571 This media type does not employ any kind of "active content", but it 572 can be used to compress any other media type (for example 573 application/postscript) which could then be interpreted by the 574 application. 576 A lzip stream does not store any metadata. Any data appended to the 577 end of the stream is easily detected, and an error can be signaled. 578 It is not apparent how this media type could be used to help violate 579 a recipient's privacy. 581 The lzip media type does not need itself any external security 582 mechanisms. But again, it can be used to compress other media types 583 requiring them (for example a media type defined for storage of 584 sensitive medical information). 586 Because of the nature of the decoding algorithm used by lzip, it is 587 easy to protect the decoder from invalid memory accesses caused by 588 corruption in the input data (intentional or not). Size limits need 589 to be checked at just one place in the decompressor (the decoding of 590 rep0) to prevent buffer overflows. This inherent safety has been 591 extensively tested in the reference implementation. 593 The 'data size' field in the lzip trailer can be faked to be smaller 594 than the actual decompressed data in an attempt to trigger a buffer 595 overflow. This is not a problem for most lzip decompressors 596 (including the reference implementation) because they use 'data size' 597 only to verify the size of the data produced, as an error detection 598 measure. However, any application that tries to decompress a whole 599 member in memory must not trust 'data size' and must always avoid 600 decompressing beyond the end of the buffer because, even if 'data 601 size' is correct, decompressing corrupt data may produce more 602 decompressed data than expected, and may cause a buffer overflow. 604 6. References 606 6.1. Normative References 608 [LZIP] Diaz, A., "Lzip", . 610 6.2. Informative References 612 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 613 RFC 1952, DOI 10.17487/RFC1952, May 1996, 614 . 616 Appendix A. Reference Source Code 618


619	/* Lzd - Educational decompressor for the lzip format
620	   Copyright (C) 2013-2022 Antonio Diaz Diaz.

622	   This program is free software. Redistribution and use in source and
623	   binary forms, with or without modification, are permitted provided
624	   that the following conditions are met:

626	   1. Redistributions of source code must retain the above copyright
627	   notice, this list of conditions, and the following disclaimer.

629	   2. Redistributions in binary form must reproduce the above copyright
630	   notice, this list of conditions, and the following disclaimer in the
631	   documentation and/or other materials provided with the distribution.

633	   This program is distributed in the hope that it will be useful,
634	   but WITHOUT ANY WARRANTY; without even the implied warranty of
635	   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
636	*/
637	/*
638	   Exit status: 0 for a normal exit, 1 for environmental problems
639	   (file not found, invalid flags, I/O errors, etc), 2 to indicate a
640	   corrupt or invalid input file.
641	*/

643	#include 
644	#include 
645	#include 
646	#include 
647	#include 
648	#include 
649	#include 
650	#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
651	#include 
652	#include 
653	#endif
654	#define PROGVERSION "1.2"

656	class State
657	  {
658	  int st;

660	public:
661	  enum { states = 12 };
662	  State() : st( 0 ) {}
663	  int operator()() const { return st; }
664	  bool is_char() const { return st < 7; }
665	  void set_char()
666	    {
667	    const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 };
668	    st = next[st];
669	    }
670	  void set_match()     { st = ( st < 7 ) ? 7 : 10; }
671	  void set_rep()       { st = ( st < 7 ) ? 8 : 11; }
672	  void set_short_rep() { st = ( st < 7 ) ? 9 : 11; }
673	  };

675	enum {
676	  min_dictionary_size = 1 << 12,
677	  max_dictionary_size = 1 << 29,
678	  literal_context_bits = 3,
679	  literal_pos_state_bits = 0,                           // not used
680	  pos_state_bits = 2,
681	  pos_states = 1 << pos_state_bits,
682	  pos_state_mask = pos_states - 1,

684	  len_states = 4,
685	  dis_slot_bits = 6,
686	  start_dis_model = 4,
687	  end_dis_model = 14,
688	  modeled_distances = 1 << ( end_dis_model / 2 ),       // 128
689	  dis_align_bits = 4,
690	  dis_align_size = 1 << dis_align_bits,

692	  len_low_bits = 3,
693	  len_mid_bits = 3,
694	  len_high_bits = 8,
695	  len_low_symbols = 1 << len_low_bits,
696	  len_mid_symbols = 1 << len_mid_bits,
697	  len_high_symbols = 1 << len_high_bits,
698	  max_len_symbols = len_low_symbols+len_mid_symbols+len_high_symbols,

700	  min_match_len = 2,                                    // must be 2

702	  bit_model_move_bits = 5,
703	  bit_model_total_bits = 11,
704	  bit_model_total = 1 << bit_model_total_bits };

706	struct Bit_model
707	  {
708	  int probability;
709	  Bit_model() : probability( bit_model_total / 2 ) {}
710	  };

712	struct Len_model
713	  {
714	  Bit_model choice1;
715	  Bit_model choice2;
716	  Bit_model bm_low[pos_states][len_low_symbols];
717	  Bit_model bm_mid[pos_states][len_mid_symbols];
718	  Bit_model bm_high[len_high_symbols];
719	  };

721	class CRC32
722	  {
723	  uint32_t data[256];           // Table of CRCs of all 8-bit messages.

725	public:
726	  CRC32()
727	    {
728	    for( unsigned n = 0; n < 256; ++n )
729	      {
730	      unsigned c = n;
731	      for( int k = 0; k < 8; ++k )
732	        { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; }
733	      data[n] = c;
734	      }
735	    }

737	  void update_buf( uint32_t & crc, const uint8_t * const buffer,
738	                   const int size ) const
739	    {
740	    for( int i = 0; i < size; ++i )
741	      crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 );
742	    }
743	  };

745	const CRC32 crc32;

747	typedef uint8_t Lzip_header[6];         // 0-3 magic bytes
748	                                        //   4 version
749	                                        //   5 coded dictionary size
750	typedef uint8_t Lzip_trailer[20];
751	                       //  0-3  CRC32 of the uncompressed data
752	                       //  4-11 size of the uncompressed data
753	                       // 12-19 member size including header and trailer

755	class Range_decoder
756	  {
757	  unsigned long long member_pos;
758	  uint32_t code;
759	  uint32_t range;

761	public:
762	  Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU )
763	    {
764	    for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte();
765	    }

767	  uint8_t get_byte() { ++member_pos; return std::getc( stdin ); }
768	  unsigned long long member_position() const { return member_pos; }

770	  unsigned decode( const int num_bits )
771	    {
772	    unsigned symbol = 0;
773	    for( int i = num_bits; i > 0; --i )
774	      {
775	      range >>= 1;
776	      symbol <<= 1;
777	      if( code >= range ) { code -= range; symbol |= 1; }
778	      if( range <= 0x00FFFFFFU )                        // normalize
779	        { range <<= 8; code = ( code << 8 ) | get_byte(); }
780	      }
781	    return symbol;
782	    }

784	  unsigned decode_bit( Bit_model & bm )
785	    {
786	    unsigned symbol;
787	    const uint32_t bound =
788	      ( range >> bit_model_total_bits ) * bm.probability;
789	    if( code < bound )
790	      {
791	      range = bound;
792	      bm.probability +=
793	        ( bit_model_total - bm.probability ) >> bit_model_move_bits;
794	      symbol = 0;
795	      }
796	    else
797	      {
798	      range -= bound;
799	      code -= bound;
800	      bm.probability -= bm.probability >> bit_model_move_bits;
801	      symbol = 1;
802	      }
803	    if( range <= 0x00FFFFFFU )                          // normalize
804	      { range <<= 8; code = ( code << 8 ) | get_byte(); }
805	    return symbol;
806	    }

808	  unsigned decode_tree( Bit_model bm[], const int num_bits )
809	    {
810	    unsigned symbol = 1;
811	    for( int i = 0; i < num_bits; ++i )
812	      symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
813	    return symbol - ( 1 << num_bits );
814	    }

816	  unsigned decode_tree_reversed( Bit_model bm[], const int num_bits )
817	    {
818	    unsigned symbol = decode_tree( bm, num_bits );
819	    unsigned reversed_symbol = 0;
820	    for( int i = 0; i < num_bits; ++i )
821	      {
822	      reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 );
823	      symbol >>= 1;
824	      }
825	    return reversed_symbol;
826	    }

828	  unsigned decode_matched( Bit_model bm[], const unsigned match_byte )
829	    {
830	    unsigned symbol = 1;
831	    for( int i = 7; i >= 0; --i )
832	      {
833	      const unsigned match_bit = ( match_byte >> i ) & 1;
834	      const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100]);
835	      symbol = ( symbol << 1 ) | bit;
836	      if( match_bit != bit )
837	        {
838	        while( symbol < 0x100 )
839	          symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
840	        break;
841	        }
842	      }
843	    return symbol & 0xFF;
844	    }

846	  unsigned decode_len( Len_model & lm, const int pos_state )
847	    {
848	    if( decode_bit( lm.choice1 ) == 0 )
849	      return decode_tree( lm.bm_low[pos_state], len_low_bits );
850	    if( decode_bit( lm.choice2 ) == 0 )
851	      return len_low_symbols +
852	             decode_tree( lm.bm_mid[pos_state], len_mid_bits );
853	    return len_low_symbols + len_mid_symbols +
854	           decode_tree( lm.bm_high, len_high_bits );
855	    }
856	  };

858	class LZ_decoder
859	  {
860	  unsigned long long partial_data_pos;
861	  Range_decoder rdec;
862	  const unsigned dictionary_size;
863	  uint8_t * const buffer;       // output buffer
864	  unsigned pos;                 // current pos in buffer
865	  unsigned stream_pos;          // first byte not yet written to stdout
866	  uint32_t crc_;
867	  bool pos_wrapped;

869	  void flush_data();

871	  uint8_t peek( const unsigned distance ) const
872	    {
873	    if( pos > distance ) return buffer[pos - distance - 1];
874	    if( pos_wrapped ) return buffer[dictionary_size+pos-distance-1];
875	    return 0;                   // prev_byte of first byte
876	    }

878	  void put_byte( const uint8_t b )
879	    {
880	    buffer[pos] = b;
881	    if( ++pos >= dictionary_size ) flush_data();
882	    }

884	public:
885	  explicit LZ_decoder( const unsigned dict_size )
886	    :
887	    partial_data_pos( 0 ),
888	    dictionary_size( dict_size ),
889	    buffer( new uint8_t[dictionary_size] ),
890	    pos( 0 ),
891	    stream_pos( 0 ),
892	    crc_( 0xFFFFFFFFU ),
893	    pos_wrapped( false )
894	    {}

896	  ~LZ_decoder() { delete[] buffer; }

898	  unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; }
899	  unsigned long long data_position() const
900	    { return partial_data_pos + pos; }
901	  uint8_t get_byte() { return rdec.get_byte(); }
902	  unsigned long long member_position() const
903	    { return rdec.member_position(); }

905	  bool decode_member();
906	  };

908	void LZ_decoder::flush_data()
909	  {
910	  if( pos > stream_pos )
911	    {
912	    const unsigned size = pos - stream_pos;
913	    crc32.update_buf( crc_, buffer + stream_pos, size );
914	    if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size )
915	      { std::fprintf(stderr, "Write error: %s\n", std::strerror(errno));
916	        std::exit( 1 ); }
917	    if( pos >= dictionary_size )
918	      { partial_data_pos += pos; pos = 0; pos_wrapped = true; }
919	    stream_pos = pos;
920	    }
921	  }

923	bool LZ_decoder::decode_member()        // Returns false if error
924	  {
925	  Bit_model bm_literal[1<> (8 - literal_context_bits);
951	      Bit_model * const bm = bm_literal[literal_state];
952	      if( state.is_char() )
953	        put_byte( rdec.decode_tree( bm, 8 ) );
954	      else
955	        put_byte( rdec.decode_matched( bm, peek( rep0 ) ) );
956	      state.set_char();
957	      continue;
958	      }
959	    // match or repeated match
960	    int len;
961	    if( rdec.decode_bit( bm_rep[state()] ) != 0 )             // 2nd bit
962	      {
963	      if( rdec.decode_bit( bm_rep0[state()] ) == 0 )          // 3rd bit
964	        {
965	        if( rdec.decode_bit(bm_len[state()][pos_state]) == 0 )// 4th bit
966	          { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; }
967	        }
968	      else
969	        {
970	        unsigned distance;
971	        if( rdec.decode_bit( bm_rep1[state()] ) == 0 )        // 4th bit
972	          distance = rep1;
973	        else
974	          {
975	          if( rdec.decode_bit( bm_rep2[state()] ) == 0 )      // 5th bit
976	            distance = rep2;
977	          else
978	            { distance = rep3; rep3 = rep2; }
979	          rep2 = rep1;
980	          }
981	        rep1 = rep0;
982	        rep0 = distance;
983	        }
984	      state.set_rep();
985	      len = min_match_len + rdec.decode_len( rep_len_model, pos_state );
986	      }
987	    else                                        // match
988	      {
989	      rep3 = rep2; rep2 = rep1; rep1 = rep0;
990	      len = min_match_len + rdec.decode_len(match_len_model, pos_state);
991	      const int len_state = std::min( len-min_match_len, len_states-1 );
992	      rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits );
993	      if( rep0 >= start_dis_model )
994	        {
995	        const unsigned dis_slot = rep0;
996	        const int direct_bits = ( dis_slot >> 1 ) - 1;
997	        rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits;
998	        if( dis_slot < end_dis_model )
999	          rep0 += rdec.decode_tree_reversed( bm_dis + (rep0 - dis_slot),
1000	                                             direct_bits );
1001	        else
1002	          {
1003	          rep0 +=
1004	            rdec.decode(direct_bits - dis_align_bits) << dis_align_bits;
1005	          rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits );
1006	          if( rep0 == 0xFFFFFFFFU )             // marker found
1007	            {
1008	            flush_data();
1009	            return ( len == min_match_len );    // End Of Stream marker
1010	            }
1011	          }
1012	        }
1013	      state.set_match();
1014	      if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) )
1015	        { flush_data(); return false; }
1016	      }

1018	    for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) );
1019	    }
1020	  flush_data();
1021	  return false;
1022	  }

1024	int main( const int argc, const char * const argv[] )
1025	  {
1026	  if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) )
1027	    {
1028	    std::printf(
1029	      "Lzd %s - Educational decompressor for the lzip format.\n"
1030	      "Study the source to learn how a lzip decompressor works.\n"
1031	      "See the lzip manual for an explanation of the code.\n"
1032	      "\nUsage: %s [-d] < file.lz > file\n"
1033	      "Lzd decompresses from standard input to standard output.\n"
1034	      "\nCopyright (C) 2022 Antonio Diaz Diaz.\n"
1035	      "License 2-clause BSD.\n"
1036	      "This is free software: you are free to change and redistribute "
1037	      "it.\nThere is NO WARRANTY, to the extent permitted by law.\n"
1038	      "Report bugs to lzip-bug@nongnu.org\n"
1039	      "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n",
1040	      PROGVERSION, argv[0] );
1041	    return 0;
1042	    }

1044	#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
1045	  setmode( STDIN_FILENO, O_BINARY );
1046	  setmode( STDOUT_FILENO, O_BINARY );
1047	#endif

1049	  for( bool first_member = true; ; first_member = false )
1050	    {
1051	    Lzip_header header;                         // verify header
1052	    for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin );
1053	    if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01",5 ) != 0 )
1054	      {
1055	      if( first_member )
1056	        { std::fputs( "Bad magic number (file not in lzip format).\n",
1057	                      stderr ); return 2; }
1058	      break;                                    // ignore trailing data
1059	      }
1060	    unsigned dict_size = 1 << ( header[5] & 0x1F );
1061	    dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 );
1062	    if( dict_sizemax_dictionary_size )
1063	      { std::fputs( "Invalid dictionary size in member header.\n",
1064	                    stderr ); return 2; }

1066	    LZ_decoder decoder( dict_size );            // decode LZMA stream
1067	    if( !decoder.decode_member() )
1068	      { std::fputs( "Data error\n", stderr ); return 2; }

1070	    Lzip_trailer trailer;                       // verify trailer
1071	    for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte();
1072	    int retval = 0;
1073	    unsigned crc = 0;
1074	    for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i];
1075	    if( crc != decoder.crc() )
1076	      { std::fputs( "CRC mismatch\n", stderr ); retval = 2; }

1078	    unsigned long long data_size = 0;
1079	    for( int i = 11; i >= 4; --i )
1080	      data_size = ( data_size << 8 ) + trailer[i];
1081	    if( data_size != decoder.data_position() )
1082	      { std::fputs( "Data size mismatch\n", stderr ); retval = 2; }

1084	    unsigned long long member_size = 0;
1085	    for( int i = 19; i >= 12; --i )
1086	      member_size = ( member_size << 8 ) + trailer[i];
1087	    if( member_size != decoder.member_position() )
1088	      { std::fputs( "Member size mismatch\n", stderr ); retval = 2; }
1089	    if( retval ) return retval;
1090	    }

1092	  if( std::fclose( stdout ) != 0 )
1093	    { std::fprintf( stderr, "Error closing stdout: %s\n",
1094	                    std::strerror( errno ) ); return 1; }
1095	  return 0;
1096	  }
1097	

1099	Acknowledgements

1101	   The ideas embodied in lzip are due to (at least) the following
1102	   people: Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey
1103	   Markov (for the definition of Markov chains), G.N.N. Martin (for the
1104	   definition of range encoding), and Igor Pavlov (for putting all the
1105	   above together in LZMA).

1107	Author's Address

1109	   Antonio Diaz Diaz
1110	   GNU Project
1111	   Email: antonio@gnu.org

1113	Expiration date: 2022-10-26