idnits 2.17.1 draft-diaz-lzip-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 269 has weird spacing: '...literal lit...' == Line 277 has weird spacing: '...0 + len rep2...' == Line 279 has weird spacing: '...1 + len rep3...' == Line 291 has weird spacing: '... 3 bits len...' == Line 292 has weird spacing: '... 8 bits len...' == (7 more instances...) -- The document date (October 2020) is 1288 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force (IETF) A. Diaz 3 INTERNET-DRAFT draft-diaz-lzip-02 GNU Project 4 Category: Informational October 2020 5 Expiration date: 2021-04-26 7 Lzip Compressed Format and the application/lzip Media Type 9 Abstract 11 Lzip is a lossless compressed data format designed for data sharing, 12 long-term archiving, and parallel compression/decompression. Lzip 13 uses a simplified form of the Lempel-Ziv-Markov chain-Algorithm 14 (LZMA) stream format, chosen to maximize safety and interoperability. 15 Lzip can achieve higher compression ratios than gzip. This document 16 describes the lzip format and registers a media type and content 17 encoding to be used when transporting lzip-compressed content via 18 Multipurpose Internet Mail Extensions (MIME). 20 Status of This Memo 22 This document is not an Internet Standards Track specification; it is 23 published for informational purposes. 25 This document is a product of the Internet Engineering Task Force 26 (IETF). It represents the consensus of the IETF community. It has 27 received public review and has been approved for publication by the 28 Internet Engineering Steering Group (IESG). Not all documents 29 approved by the IESG are candidates for any level of Internet 30 Standard; see Section 2 of RFC 7841. 32 Information about the current status of this document, any errata, 33 and how to provide feedback on it may be obtained at 34 http://www.rfc-editor.org/info/rfc. 36 Comments are solicited and should be addressed to the lzip's mailing 37 list at lzip-bug@nongnu.org and/or the author. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Internet-Draft Boilerplate 56 This Internet-Draft is submitted in full conformance with the 57 provisions of BCP 78 and BCP 79. Internet-Drafts are working 58 documents of the Internet Engineering Task Force (IETF). Note that 59 other groups may also distribute working documents as 60 Internet-Drafts. The list of current Internet-Drafts is at 61 http://datatracker.ietf.org/drafts/current. 63 Internet-Drafts are draft documents valid for a maximum of six months 64 and may be updated, replaced, or obsoleted by other documents at any 65 time. It is inappropriate to use Internet-Drafts as reference 66 material or to cite them other than as "work in progress." 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . 3 72 1.2. Compliance . . . . . . . . . . . . . . . . . . . . . . . 4 73 2. File Format . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 3. Format of the LZMA stream in lzip files . . . . . . . . . . . 5 75 3.1. What is coded . . . . . . . . . . . . . . . . . . . . . 6 76 3.2. The coding contexts . . . . . . . . . . . . . . . . . . 8 77 3.3. The range decoder . . . . . . . . . . . . . . . . . . . 10 78 3.4. Decoding and verifying the LZMA stream . . . . . . . . . 10 79 3.5. Compression . . . . . . . . . . . . . . . . . . . . . . 10 80 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 81 4.1. The 'application/lzip' Media Type . . . . . . . . . . . 11 82 4.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 12 83 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 84 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 85 6.1. Normative References . . . . . . . . . . . . . . . . . . 13 86 6.2. Informative References . . . . . . . . . . . . . . . . . 13 87 Appendix A. Reference Source Code . . . . . . . . . . . . . . . 13 88 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 23 89 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 23 91 1. Introduction 93 Lzip is a lossless compressed data format similar to gzip [RFC1952]. 94 Lzip is designed for data sharing, long-term archiving, parallel 95 compression/decompression, and limited random access to the data. 96 Lzip can achieve higher compression ratios than gzip. 98 Lzip uses a simplified form of the Lempel-Ziv-Markov chain-Algorithm 99 (LZMA) stream format, chosen to maximize safety and interoperability. 100 The original LZMA algorithm and stream format were developed by Igor 101 Pavlov. 103 LZMA is much like deflate (the algorithm of gzip) with two main 104 differences that account for its higher compression ratio. First, 105 LZMA can use a dictionary size thousands of times larger than 106 deflate. Second, LZMA uses a range encoder as its last stage instead 107 of the less efficient (but faster) Huffman coding used by deflate. 109 1.1. Purpose 111 The purpose of this document is to define a lossless compressed data 112 format that is a) independent of the CPU type, operating system, file 113 system, and character set and b) is suitable for file compression and 114 pipe and streaming compression, using the LZMA algorithm. The text 115 of the specification assumes a basic background in programming at the 116 level of bits and other primitive data representations. 118 The data can be produced or consumed, even for an arbitrarily long 119 sequentially presented input data stream, using only an a priori 120 bounded amount of intermediate storage, and hence can be used in data 121 communications. 123 The data format defined by this specification allows efficient 124 parallel compression/decompression, and random access to blocks of 125 compressed data by means of multimember files and a distributed 126 index. 128 This specification is intended for use by implementors of software to 129 compress data into lzip format and/or decompress data from lzip 130 format. The lzip format is supported by one free software reference 131 implementation (the lzip tool) written in portable C++ (C++11), and 132 by several free software implementations written in portable C (C99), 133 all of them available at [LZIP]. The reference implementation has 134 been in stable status since 2009, and is widely deployed. 136 Also, to enable the transport of a data object compressed with lzip, 137 this document registers a media type that can be used to identify 138 such content when it is used in a payload encoded using Multipurpose 139 Internet Mail Extensions (MIME). 141 1.2. Compliance 143 Unless otherwise indicated below, a compliant decompressor must be 144 able to accept and decompress any file that conforms to all the 145 specifications presented here; a compliant compressor must produce 146 files that conform to all the specifications presented here. 148 2. File Format 150 Perfection is reached, not when there is no longer anything to add, 151 but when there is no longer anything to take away. 152 -- Antoine de Saint-Exupery 154 In the diagram below, a box like this: 156 +---+ 157 | | <-- the vertical bars might be missing 158 +---+ 160 represents one byte; a box like this: 162 +==============+ 163 | | 164 +==============+ 166 represents a variable number of bytes. 168 A lzip file consists of a series of "members" (compressed data sets). 169 The members simply appear one after another in the file, with no 170 additional information before, between, or after them. 172 Each member has the following structure: 174 +--+--+--+--+----+----+=============+ 175 | ID string | VN | DS | LZMA stream | (more-->) 176 +--+--+--+--+----+----+=============+ 178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 179 | CRC32 | Data size | Member size | 180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 182 All multibyte values are stored in little endian order. 184 ID string (the "magic" bytes) 185 A four byte string, identifying the lzip format, with the value 186 "LZIP" (0x4C, 0x5A, 0x49, 0x50). 188 VN (version number, 1 byte) 189 Just in case something needs to be modified in the future. 1 for 190 now. 192 DS (coded dictionary size, 1 byte) 193 The dictionary size is calculated by taking a power of 2 (the base 194 size) and subtracting from it a fraction between 0/16 and 7/16 of 195 the base size. 196 Bits 4-0 contain the base 2 logarithm of the base size (12 to 29). 197 Bits 7-5 contain the numerator of the fraction (0 to 7) to 198 subtract from the base size to obtain the dictionary size. 199 Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB 200 Valid values for dictionary size range from 4 KiB to 512 MiB. 202 LZMA stream 203 The LZMA stream, finished by an end of stream marker. Uses 204 default values for encoder properties (see Section 3 for a 205 complete description). 207 CRC32 (4 bytes) 208 Cyclic Redundancy Check (CRC) of the uncompressed original data. 210 Data size (8 bytes) 211 Size of the uncompressed original data. 213 Member size (8 bytes) 214 Total size of the member, including header and trailer. This 215 field acts as a distributed index, allows the verification of 216 stream integrity, and facilitates safe recovery of undamaged 217 members from multimember files. 219 3. Format of the LZMA stream in lzip files 221 The LZMA algorithm has three parameters, called "special LZMA 222 properties", to adjust it for some kinds of binary data. These 223 parameters are; 'literal_context_bits' (with a default value of 3), 224 'literal_pos_state_bits' (with a default value of 0), and 225 'pos_state_bits' (with a default value of 2). As a general purpose 226 compressor, lzip only uses the default values for these parameters. 227 In particular 'literal_pos_state_bits' has been optimized away and 228 does not even appear in the code. 230 Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker 231 (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with 232 the 'member size' field in the member trailer allows the verification 233 of stream integrity. The LZMA stream in lzip files always has these 234 two features (default properties and EOS marker) and is referred to 235 in this document as LZMA-302eos. The EOS marker is the only marker 236 allowed in lzip files. This variant of the LZMA stream format has 237 been chosen to maximize safety and interoperability. 239 The second stage of LZMA is a range encoder that uses a different 240 probability model for each type of symbol; distances, lengths, 241 literal bytes, etc. Range encoding conceptually encodes all the 242 symbols of the message into one number. Unlike Huffman coding, which 243 assigns to each symbol a bit-pattern and concatenates all the 244 bit-patterns together, range encoding can compress one symbol to less 245 than one bit. Therefore the compressed data produced by a range 246 encoder can't be split in pieces that could be described 247 individually. 249 It seems that the only way of describing the LZMA-302eos stream is 250 describing the algorithm that decodes it. And given the many details 251 about the range decoder that need to be described accurately, the 252 source code of a real decoder seems the only appropriate reference to 253 use. 255 What follows is a description of the decoding algorithm for 256 LZMA-302eos streams using as reference the source code of "lzd", an 257 educational decompressor for lzip files which can be downloaded from 258 the lzip download directory. Lzd is written in C++11 and its source 259 code is included in appendix A. 261 3.1. What is coded 263 The LZMA stream includes literals, matches, and repeated matches 264 (matches reusing a recently used distance). There are 7 different 265 coding sequences: 267 Bit sequence Name Description 268 ----------------------- -------- -------------------------------- 269 0 + byte literal literal byte 270 1 + 0 + len + dis match distance-length pair 271 1 + 1 + 0 + 0 shortrep 1 byte match at latest used 272 distance 273 1 + 1 + 0 + 1 + len rep0 len bytes match at latest used 274 distance 275 1 + 1 + 1 + 0 + len rep1 len bytes match at second latest 276 used distance 277 1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest 278 used distance 279 1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest 280 used distance 282 In the following tables, multibit sequences are coded in normal 283 order, from most significant bit (MSB) to least significant bit 284 (LSB), except where noted otherwise. 286 Lengths (the 'len' in the table above) are coded as follows: 288 Bit sequence Description 289 -------------- ---------------------- 290 0 + 3 bits lengths from 2 to 9 291 1 + 0 + 3 bits lengths from 10 to 17 292 1 + 1 + 8 bits lengths from 18 to 273 294 The coding of distances is a little more complicated, so I'll begin 295 explaining a simpler version of the encoding. 297 Imagine you need to encode a number from 0 to 2^32 - 1, and you want 298 to do it in a way that produces shorter codes for the smaller 299 numbers. You may first encode the position of the most significant 300 bit that is set to 1, which you may find by making a bit scan from 301 the left (from the MSB). A position of 0 means that the number is 0 302 (no bit is set), 1 means the LSB is the first bit set (the number is 303 1), and 32 means the MSB is set (i.e., the number is >= 0x80000000). 304 Then, if the position is >= 2, you encode the remaining position - 1 305 bits. Let's call these bits "direct bits" because they are coded 306 directly by value instead of indirectly by position. 308 The inconvenient of this simple method is that it needs 6 bits to 309 encode the position, but it just uses 33 of the 64 possible values, 310 wasting almost half of the codes. 312 The intelligent trick of LZMA is that it encodes in what it calls a 313 "slot" the position of the most significant bit set, along with the 314 value of the next bit, using the same 6 bits that would take to 315 encode the position alone. This seems to need 66 slots (twice the 316 number of positions), but for positions 0 and 1 there is no next bit, 317 so the number of slots needed is 64 (0 to 63). 319 The 6 bits representing this "slot number" are then context-coded. 320 If the distance is >= 4, the remaining bits are encoded as follows. 321 'direct_bits' is the amount of remaining bits (from 1 to 30) needed 322 to form a complete distance, and is calculated as (slot >> 1) - 1. 323 If a distance needs 6 or more direct_bits, the last 4 bits are 324 encoded separately. The last piece (all the direct_bits for 325 distances 4 to 127, or the last 4 bits for distances >= 128) is 326 context-coded in reverse order (from LSB to MSB). For distances 327 >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5 328 probability. 330 Bit sequence Description 331 --------------------------------- ------------------------------ 332 slot distances from 0 to 3 333 slot + direct_bits distances from 4 to 127 334 slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 336 3.2. The coding contexts 338 These contexts ('Bit_model' in the source), are integers or arrays of 339 integers representing the probability of the corresponding bit being 340 0. 342 The indices used in these arrays are: 344 state 345 A state machine ('State' in the source) with 12 states (0 to 11), 346 coding the latest 2 to 4 types of sequences processed. The 347 initial state is 0. 349 pos_state 350 Value of the 2 least significant bits of the current position in 351 the decoded data. 353 literal_state 354 Value of the 3 most significant bits of the latest byte decoded. 356 len_state 357 Coded value of the current match length (length - 2), with a 358 maximum of 3. The resulting value is in the range 0 to 3. 360 In the following table, '!literal' is any sequence except a literal 361 byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The 362 types of previous sequences corresponding to each state are: 364 State Types of previous sequences 365 ----- --------------------------------------------- 366 0 literal, literal, literal 367 1 match, literal, literal 368 2 rep or (!literal, shortrep), literal, literal 369 3 literal, shortrep, literal, literal 370 4 match, literal 371 5 rep or (!literal, shortrep), literal 372 6 literal, shortrep, literal 373 7 literal, match 374 8 literal, rep 375 9 literal, shortrep 376 10 !literal, match 377 11 !literal, (rep or shortrep) 378 The contexts for decoding the type of coding sequence are: 380 Name Indices Used when 381 -------- ---------------- ------------------- 382 bm_match state, pos_state sequence start 383 bm_rep state after sequence 1 384 bm_rep0 state after sequence 11 385 bm_rep1 state after sequence 111 386 bm_rep2 state after sequence 1111 387 bm_len state, pos_state after sequence 110 389 The contexts for decoding distances are: 391 Name Indices Used when 392 ----------- ------------------- --------------------------- 393 bm_dis_slot len_state, bit tree distance start 394 bm_dis reverse bit tree after slots 4 to 13 395 bm_align reverse bit tree for distances >= 128, after 396 fixed probability bits 398 There are two separate sets of contexts for lengths ('Len_model' in 399 the source). One for normal matches, the other for repeated matches. 400 The contexts in each Len_model are (see 'decode_len' in the source): 402 Name Indices Used when 403 ------- ------------------- ----------------- 404 choice1 none length start 405 choice2 none after sequence 1 406 bm_low pos_state, bit tree after sequence 0 407 bm_mid pos_state, bit tree after sequence 10 408 bm_high bit tree after sequence 11 410 The context array 'bm_literal' is special. In principle it acts as a 411 normal bit tree context, the one selected by 'literal_state'. But if 412 the previous decoded byte was not a literal, two other bit tree 413 contexts are used depending on the value of each bit in 'match_byte' 414 (the byte at the latest used distance), until a bit is decoded that 415 is different from its corresponding bit in 'match_byte'. After the 416 first difference is found, the rest of the byte is decoded using the 417 normal bit tree context. (See 'decode_matched' in the source). 419 3.3. The range decoder 421 The LZMA stream is consumed one byte at a time by the range decoder. 422 (See 'normalize' in the source). Every byte consumed produces a 423 variable number of decoded bits, depending on how well these bits 424 agree with their context. (See 'decode_bit' in the source). 426 The range decoder state consists of two unsigned 32-bit variables; 427 'range' (representing the most significant part of the range size not 428 yet decoded), and 'code' (representing the current point within 429 'range'). 'range' is initialized to 2^32 - 1, and 'code' is 430 initialized to 0. 432 The range encoder produces a first 0 byte that must be ignored by the 433 range decoder. This is done by shifting 5 bytes in the 434 initialization of 'code' instead of 4. (See the 'Range_decoder' 435 constructor in the source). 437 3.4. Decoding and verifying the LZMA stream 439 After decoding the member header and obtaining the dictionary size, 440 the range decoder is initialized and then the LZMA decoder enters a 441 loop (see 'decode_member' in the source) where it invokes the range 442 decoder with the appropriate contexts to decode the different coding 443 sequences (matches, repeated matches, and literal bytes), until the 444 "End Of Stream" marker is decoded. 446 Once the "End Of Stream" marker has been decoded, the decompressor 447 must read and decode the member trailer, and verify that the three 448 integrity factors (CRC, data size, and member size) match those 449 calculated by the LZMA decoder. 451 3.5. Compression 453 Compression consists in describing the uncompressed data as a 454 succession of coding sequences from the set shown in Section 3.1, and 455 then encoding them using a range encoder. The fast encoder in the 456 reference implementation shows how this can be done in almost the 457 simplest way possible; issuing the longest match found, or a literal 458 byte if no match is found, and repeating until all the data have been 459 compressed. More sophisticated choosing of the coding sequences may 460 achieve higher compression ratios. 462 4. IANA Considerations 464 IANA is asked to make two registrations, as described below. 466 4.1. The 'application/lzip' Media Type 468 The 'application/lzip' media type identifies a block of data that is 469 compressed using lzip compression. The data is a stream of bytes as 470 described in this document. IANA is asked to add the following entry 471 to the standards tree of the "Media Types" registry: 473 Type name: application 475 Subtype name: lzip 477 Required parameters: N/A 479 Optional parameters: N/A 481 Encoding considerations: binary 483 Security considerations: See Section 5 of this RFC 485 Interoperability considerations: N/A 487 Published specification: this RFC 489 Applications that use this media type: 490 Any application where data size is an issue 492 Fragment Identifier Considerations: N/A 494 Restrictions on Usage: N/A 496 Provisional Registrations: N/A 498 Additional information: 500 Deprecated alias names for this type: application/x-lzip 502 Magic number(s): First 4 bytes (0x4C, 0x5A, 0x49, 0x50) 504 File extension(s): .lz .tlz 506 Macintosh File Type Code(s): N/A 508 Object Identifier(s) or OID(s): N/A 510 Intended Usage: COMMON 512 Other Information & Comments: See [LZIP] 514 Author: Antonio Diaz Diaz 516 Change Controller: IETF 518 4.2. Content Encoding 520 IANA is asked to add the following entry to the "HTTP Content Coding 521 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 522 registry: 524 Name: lzip 526 Description: A stream of bytes compressed using lzip compression 528 Pointer to specification text: this RFC 530 5. Security Considerations 532 Lzip is a compressed format. Decompression of the data may expand it 533 to a size more than 7000 times larger, risking an out-of-memory or 534 out-of-disc-space condition. Since both the gzip and lzip formats 535 contain a single data stream, any steps already taken to avoid such 536 attacks on application/gzip should also work on application/lzip. 537 One possible measure, already implemented in some applications, is to 538 set a limit to the decompressed size and stop the decompression or 539 warn the user if the limit is surpassed. 541 This media type does not employ any kind of "active content", but it 542 can be used to compress any other media type (for example 543 application/postscript) which could then be interpreted by the 544 application. 546 A lzip stream does not store any metadata. Any data appended to the 547 end of the stream is easily detected, and an error can be signaled. 548 It is not apparent how this media type could be used to help violate 549 a recipient's privacy. 551 The lzip media type does not need itself any external security 552 mechanisms. But again, it can be used to compress other media types 553 requiring them (for example a media type defined for storage of 554 sensitive medical information). 556 Because of the nature of the decoding algorithm used by lzip, it is 557 easy to protect the decoder from invalid memory accesses caused by 558 corruption in the input data (intentional or not). Size limits need 559 to be checked at just one place in the decompressor (the decoding of 560 rep0) to prevent buffer overflows. This inherent safety has been 561 extensively tested in the reference implementation and makes possible 562 the unique data recovery capabilities provided by lziprecover. 564 The lzip data stream does not contain any size fields that could be 565 faked to be smaller than the actual decompressed data in an attempt 566 to trigger a buffer overflow. (The 'data size' field in the lzip 567 trailer is only used to verify the size of the data produced as an 568 error detection measure). 570 6. References 572 6.1. Normative References 574 [LZIP] Diaz, A., "Lzip", . 576 6.2. Informative References 578 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 579 RFC 1952, DOI 10.17487/RFC1952, May 1996, 580 . 582 Appendix A. Reference Source Code 584


585	/* Lzd - Educational decompressor for the lzip format
586	   Copyright (C) 2013-2020 Antonio Diaz Diaz.

588	   This program is free software. Redistribution and use in source and
589	   binary forms, with or without modification, are permitted provided
590	   that the following conditions are met:

592	   1. Redistributions of source code must retain the above copyright
593	   notice, this list of conditions, and the following disclaimer.

595	   2. Redistributions in binary form must reproduce the above copyright
596	   notice, this list of conditions, and the following disclaimer in the
597	   documentation and/or other materials provided with the distribution.

599	   This program is distributed in the hope that it will be useful,
600	   but WITHOUT ANY WARRANTY; without even the implied warranty of
601	   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
602	*/
603	/*
604	   Exit status: 0 for a normal exit, 1 for environmental problems
605	   (file not found, invalid flags, I/O errors, etc), 2 to indicate a
606	   corrupt or invalid input file.
607	*/
608	#include 
609	#include 
610	#include 
611	#include 
612	#include 
613	#include 
614	#include 
615	#if defined(__MSVCRT__) || defined(__OS2__) || defined(__DJGPP__)
616	#include 
617	#include 
618	#endif
619	#define PROGVERSION "1.2"

621	class State
622	  {
623	  int st;

625	public:
626	  enum { states = 12 };
627	  State() : st( 0 ) {}
628	  int operator()() const { return st; }
629	  bool is_char() const { return st < 7; }

631	  void set_char()
632	    {
633	    const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 };
634	    st = next[st];
635	    }
636	  void set_match()     { st = ( st < 7 ) ? 7 : 10; }
637	  void set_rep()       { st = ( st < 7 ) ? 8 : 11; }
638	  void set_short_rep() { st = ( st < 7 ) ? 9 : 11; }
639	  };

641	enum {
642	  min_dictionary_size = 1 << 12,
643	  max_dictionary_size = 1 << 29,
644	  literal_context_bits = 3,
645	  literal_pos_state_bits = 0,                           // not used
646	  pos_state_bits = 2,
647	  pos_states = 1 << pos_state_bits,
648	  pos_state_mask = pos_states - 1,

650	  len_states = 4,
651	  dis_slot_bits = 6,
652	  start_dis_model = 4,
653	  end_dis_model = 14,
654	  modeled_distances = 1 << ( end_dis_model / 2 ),       // 128
655	  dis_align_bits = 4,
656	  dis_align_size = 1 << dis_align_bits,
657	  len_low_bits = 3,
658	  len_mid_bits = 3,
659	  len_high_bits = 8,
660	  len_low_symbols = 1 << len_low_bits,
661	  len_mid_symbols = 1 << len_mid_bits,
662	  len_high_symbols = 1 << len_high_bits,
663	  max_len_symbols = len_low_symbols+len_mid_symbols+len_high_symbols,

665	  min_match_len = 2,                                    // must be 2

667	  bit_model_move_bits = 5,
668	  bit_model_total_bits = 11,
669	  bit_model_total = 1 << bit_model_total_bits };

671	struct Bit_model
672	  {
673	  int probability;
674	  Bit_model() : probability( bit_model_total / 2 ) {}
675	  };

677	struct Len_model
678	  {
679	  Bit_model choice1;
680	  Bit_model choice2;
681	  Bit_model bm_low[pos_states][len_low_symbols];
682	  Bit_model bm_mid[pos_states][len_mid_symbols];
683	  Bit_model bm_high[len_high_symbols];
684	  };

686	class CRC32
687	  {
688	  uint32_t data[256];           // Table of CRCs of all 8-bit messages.

690	public:
691	  CRC32()
692	    {
693	    for( unsigned n = 0; n < 256; ++n )
694	      {
695	      unsigned c = n;
696	      for( int k = 0; k < 8; ++k )
697	        { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; }
698	      data[n] = c;
699	      }
700	    }

702	  void update_buf( uint32_t & crc, const uint8_t * const buffer,
703	                   const int size ) const
704	    {
705	    for( int i = 0; i < size; ++i )
706	      crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 );

708	    }
709	  };

711	const CRC32 crc32;

713	typedef uint8_t Lzip_header[6];         // 0-3 magic bytes
714	                                        //   4 version
715	                                        //   5 coded dictionary size
716	typedef uint8_t Lzip_trailer[20];
717	                       //  0-3  CRC32 of the uncompressed data
718	                       //  4-11 size of the uncompressed data
719	                       // 12-19 member size including header and trailer

721	class Range_decoder
722	  {
723	  unsigned long long member_pos;
724	  uint32_t code;
725	  uint32_t range;

727	public:
728	  Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU )
729	    {
730	    for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte();
731	    }

733	  uint8_t get_byte() { ++member_pos; return std::getc( stdin ); }
734	  unsigned long long member_position() const { return member_pos; }

736	  unsigned decode( const int num_bits )
737	    {
738	    unsigned symbol = 0;
739	    for( int i = num_bits; i > 0; --i )
740	      {
741	      range >>= 1;
742	      symbol <<= 1;
743	      if( code >= range ) { code -= range; symbol |= 1; }
744	      if( range <= 0x00FFFFFFU )                        // normalize
745	        { range <<= 8; code = ( code << 8 ) | get_byte(); }
746	      }
747	    return symbol;
748	    }

750	  unsigned decode_bit( Bit_model & bm )
751	    {
752	    unsigned symbol;
753	    const uint32_t bound =
754	      ( range >> bit_model_total_bits ) * bm.probability;
755	    if( code < bound )
756	      {
757	      range = bound;
758	      bm.probability +=
759	        ( bit_model_total - bm.probability ) >> bit_model_move_bits;
760	      symbol = 0;
761	      }
762	    else
763	      {
764	      range -= bound;
765	      code -= bound;
766	      bm.probability -= bm.probability >> bit_model_move_bits;
767	      symbol = 1;
768	      }
769	    if( range <= 0x00FFFFFFU )                          // normalize
770	      { range <<= 8; code = ( code << 8 ) | get_byte(); }
771	    return symbol;
772	    }

774	  unsigned decode_tree( Bit_model bm[], const int num_bits )
775	    {
776	    unsigned symbol = 1;
777	    for( int i = 0; i < num_bits; ++i )
778	      symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
779	    return symbol - ( 1 << num_bits );
780	    }

782	  unsigned decode_tree_reversed( Bit_model bm[], const int num_bits )
783	    {
784	    unsigned symbol = decode_tree( bm, num_bits );
785	    unsigned reversed_symbol = 0;
786	    for( int i = 0; i < num_bits; ++i )
787	      {
788	      reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 );
789	      symbol >>= 1;
790	      }
791	    return reversed_symbol;
792	    }

794	  unsigned decode_matched( Bit_model bm[], const unsigned match_byte )
795	    {
796	    unsigned symbol = 1;
797	    for( int i = 7; i >= 0; --i )
798	      {
799	      const unsigned match_bit = ( match_byte >> i ) & 1;
800	      const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100]);
801	      symbol = ( symbol << 1 ) | bit;
802	      if( match_bit != bit )
803	        {
804	        while( symbol < 0x100 )
805	          symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
806	        break;
807	        }
808	      }

810	    return symbol & 0xFF;
811	    }

813	  unsigned decode_len( Len_model & lm, const int pos_state )
814	    {
815	    if( decode_bit( lm.choice1 ) == 0 )
816	      return decode_tree( lm.bm_low[pos_state], len_low_bits );
817	    if( decode_bit( lm.choice2 ) == 0 )
818	      return len_low_symbols +
819	             decode_tree( lm.bm_mid[pos_state], len_mid_bits );
820	    return len_low_symbols + len_mid_symbols +
821	           decode_tree( lm.bm_high, len_high_bits );
822	    }
823	  };

825	class LZ_decoder
826	  {
827	  unsigned long long partial_data_pos;
828	  Range_decoder rdec;
829	  const unsigned dictionary_size;
830	  uint8_t * const buffer;       // output buffer
831	  unsigned pos;                 // current pos in buffer
832	  unsigned stream_pos;          // first byte not yet written to stdout
833	  uint32_t crc_;
834	  bool pos_wrapped;

836	  void flush_data();

838	  uint8_t peek( const unsigned distance ) const
839	    {
840	    if( pos > distance ) return buffer[pos - distance - 1];
841	    if( pos_wrapped ) return buffer[dictionary_size+pos-distance-1];
842	    return 0;                   // prev_byte of first byte
843	    }

845	  void put_byte( const uint8_t b )
846	    {
847	    buffer[pos] = b;
848	    if( ++pos >= dictionary_size ) flush_data();
849	    }

851	public:
852	  explicit LZ_decoder( const unsigned dict_size )
853	    :
854	    partial_data_pos( 0 ),
855	    dictionary_size( dict_size ),
856	    buffer( new uint8_t[dictionary_size] ),
857	    pos( 0 ),
858	    stream_pos( 0 ),
859	    crc_( 0xFFFFFFFFU ),
860	    pos_wrapped( false )
861	    {}

863	  ~LZ_decoder() { delete[] buffer; }

865	  unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; }
866	  unsigned long long data_position() const
867	    { return partial_data_pos + pos; }
868	  uint8_t get_byte() { return rdec.get_byte(); }
869	  unsigned long long member_position() const
870	    { return rdec.member_position(); }

872	  bool decode_member();
873	  };

875	void LZ_decoder::flush_data()
876	  {
877	  if( pos > stream_pos )
878	    {
879	    const unsigned size = pos - stream_pos;
880	    crc32.update_buf( crc_, buffer + stream_pos, size );
881	    if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size )
882	      { std::fprintf(stderr, "Write error: %s\n", std::strerror(errno));
883	        std::exit( 1 ); }
884	    if( pos >= dictionary_size )
885	      { partial_data_pos += pos; pos = 0; pos_wrapped = true; }
886	    stream_pos = pos;
887	    }
888	  }

890	bool LZ_decoder::decode_member()        // Returns false if error
891	  {
892	  Bit_model bm_literal[1<> (8 - literal_context_bits);
918	      Bit_model * const bm = bm_literal[literal_state];
919	      if( state.is_char() )
920	        put_byte( rdec.decode_tree( bm, 8 ) );
921	      else
922	        put_byte( rdec.decode_matched( bm, peek( rep0 ) ) );
923	      state.set_char();
924	      continue;
925	      }
926	    // match or repeated match
927	    int len;
928	    if( rdec.decode_bit( bm_rep[state()] ) != 0 )             // 2nd bit
929	      {
930	      if( rdec.decode_bit( bm_rep0[state()] ) == 0 )          // 3rd bit
931	        {
932	        if( rdec.decode_bit(bm_len[state()][pos_state]) == 0 )// 4th bit
933	          { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; }
934	        }
935	      else
936	        {
937	        unsigned distance;
938	        if( rdec.decode_bit( bm_rep1[state()] ) == 0 )        // 4th bit
939	          distance = rep1;
940	        else
941	          {
942	          if( rdec.decode_bit( bm_rep2[state()] ) == 0 )      // 5th bit
943	            distance = rep2;
944	          else
945	            { distance = rep3; rep3 = rep2; }
946	          rep2 = rep1;
947	          }
948	        rep1 = rep0;
949	        rep0 = distance;
950	        }
951	      state.set_rep();
952	      len = min_match_len + rdec.decode_len( rep_len_model, pos_state );
953	      }
954	    else                                        // match
955	      {
956	      rep3 = rep2; rep2 = rep1; rep1 = rep0;
957	      len = min_match_len + rdec.decode_len(match_len_model, pos_state);
958	      const int len_state = std::min( len-min_match_len, len_states-1 );
959	      rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits );
960	      if( rep0 >= start_dis_model )
961	        {
962	        const unsigned dis_slot = rep0;
963	        const int direct_bits = ( dis_slot >> 1 ) - 1;
964	        rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits;
965	        if( dis_slot < end_dis_model )
966	          rep0 += rdec.decode_tree_reversed( bm_dis + (rep0 - dis_slot),
967	                                             direct_bits );
968	        else
969	          {
970	          rep0 +=
971	            rdec.decode(direct_bits - dis_align_bits) << dis_align_bits;
972	          rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits );
973	          if( rep0 == 0xFFFFFFFFU )             // marker found
974	            {
975	            flush_data();
976	            return ( len == min_match_len );    // End Of Stream marker
977	            }
978	          }
979	        }
980	      state.set_match();
981	      if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) )
982	        { flush_data(); return false; }
983	      }
984	    for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) );
985	    }
986	  flush_data();
987	  return false;
988	  }

990	int main( const int argc, const char * const argv[] )
991	  {
992	  if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) )
993	    {
994	    std::printf(
995	      "Lzd %s - Educational decompressor for the lzip format.\n"
996	      "Study the source to learn how a lzip decompressor works.\n"
997	      "See the lzip manual for an explanation of the code.\n"
998	      "\nUsage: %s [-d] < file.lz > file\n"
999	      "Lzd decompresses from standard input to standard output.\n"
1000	      "\nCopyright (C) 2020 Antonio Diaz Diaz.\n"
1001	      "License 2-clause BSD.\n"
1002	      "This is free software: you are free to change and redistribute "
1003	      "it.\nThere is NO WARRANTY, to the extent permitted by law.\n"
1004	      "Report bugs to lzip-bug@nongnu.org\n"
1005	      "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n",
1006	      PROGVERSION, argv[0] );
1007	    return 0;
1008	    }

1010	#if defined(__MSVCRT__) || defined(__OS2__) || defined(__DJGPP__)
1011	  setmode( STDIN_FILENO, O_BINARY );
1012	  setmode( STDOUT_FILENO, O_BINARY );
1013	#endif

1015	  for( bool first_member = true; ; first_member = false )
1016	    {
1017	    Lzip_header header;                         // verify header
1018	    for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin );
1019	    if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01",5 ) != 0 )
1020	      {
1021	      if( first_member )
1022	        { std::fputs( "Bad magic number (file not in lzip format).\n",
1023	                      stderr ); return 2; }
1024	      break;                                    // ignore trailing data
1025	      }
1026	    unsigned dict_size = 1 << ( header[5] & 0x1F );
1027	    dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 );
1028	    if( dict_sizemax_dictionary_size )
1029	      { std::fputs( "Invalid dictionary size in member header.\n",
1030	                    stderr ); return 2; }

1032	    LZ_decoder decoder( dict_size );            // decode LZMA stream
1033	    if( !decoder.decode_member() )
1034	      { std::fputs( "Data error\n", stderr ); return 2; }

1036	    Lzip_trailer trailer;                       // verify trailer
1037	    for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte();
1038	    int retval = 0;
1039	    unsigned crc = 0;
1040	    for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i];
1041	    if( crc != decoder.crc() )
1042	      { std::fputs( "CRC mismatch\n", stderr ); retval = 2; }

1044	    unsigned long long data_size = 0;
1045	    for( int i = 11; i >= 4; --i )
1046	      data_size = ( data_size << 8 ) + trailer[i];
1047	    if( data_size != decoder.data_position() )
1048	      { std::fputs( "Data size mismatch\n", stderr ); retval = 2; }

1050	    unsigned long long member_size = 0;
1051	    for( int i = 19; i >= 12; --i )
1052	      member_size = ( member_size << 8 ) + trailer[i];
1053	    if( member_size != decoder.member_position() )
1054	      { std::fputs( "Member size mismatch\n", stderr ); retval = 2; }
1055	    if( retval ) return retval;
1056	    }

1058	  if( std::fclose( stdout ) != 0 )
1059	    { std::fprintf( stderr, "Error closing stdout: %s\n",
1060	                    std::strerror( errno ) ); return 1; }

1062	  return 0;
1063	  }
1064	

1066	Acknowledgments

1068	   The ideas embodied in lzip are due to (at least) the following
1069	   people: Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey
1070	   Markov (for the definition of Markov chains), G.N.N. Martin (for the
1071	   definition of range encoding), and Igor Pavlov (for putting all the
1072	   above together in LZMA).

1074	Author's Address

1076	   Antonio Diaz Diaz
1077	   GNU Project
1078	   Email: antonio@gnu.org

1080	Expiration date: 2021-04-26