idnits 2.17.1 draft-diaz-lzip-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 269 has weird spacing: '...literal lit...' == Line 277 has weird spacing: '...0 + len rep2...' == Line 279 has weird spacing: '...1 + len rep3...' == Line 291 has weird spacing: '... 3 bits len...' == Line 292 has weird spacing: '... 8 bits len...' == (7 more instances...) -- The document date (April 2020) is 1470 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force (IETF) A. Diaz 3 INTERNET-DRAFT draft-diaz-lzip-01 GNU Project 4 Category: Informational April 2020 5 Expiration date: 2020-10-26 7 Lzip Compressed Format and the application/lzip Media Type 9 Abstract 11 Lzip is a lossless compressed data format designed for data sharing, 12 long-term archiving, and parallel (de)compression. Lzip uses a 13 simplified form of the Lempel-Ziv-Markov chain-Algorithm (LZMA) 14 stream format, chosen to maximize safety and interoperability. Lzip 15 can achieve higher compression ratios than gzip. This document 16 describes the lzip format and registers a media type and content 17 encoding to be used when transporting lzip-compressed content via 18 Multipurpose Internet Mail Extensions (MIME). 20 Status of This Memo 22 This document is not an Internet Standards Track specification; it is 23 published for informational purposes. 25 This document is a product of the Internet Engineering Task Force 26 (IETF). It represents the consensus of the IETF community. It has 27 received public review and has been approved for publication by the 28 Internet Engineering Steering Group (IESG). Not all documents 29 approved by the IESG are candidates for any level of Internet 30 Standard; see Section 2 of RFC 7841. 32 Information about the current status of this document, any errata, 33 and how to provide feedback on it may be obtained at 34 http://www.rfc-editor.org/info/rfc. 36 Comments are solicited and should be addressed to the lzip's mailing 37 list at lzip-bug@nongnu.org and/or the author. 39 Copyright Notice 41 Copyright (c) 2020 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Internet-Draft Boilerplate 56 This Internet-Draft is submitted in full conformance with the 57 provisions of BCP 78 and BCP 79. Internet-Drafts are working 58 documents of the Internet Engineering Task Force (IETF). Note that 59 other groups may also distribute working documents as 60 Internet-Drafts. The list of current Internet-Drafts is at 61 http://datatracker.ietf.org/drafts/current. 63 Internet-Drafts are draft documents valid for a maximum of six months 64 and may be updated, replaced, or obsoleted by other documents at any 65 time. It is inappropriate to use Internet-Drafts as reference 66 material or to cite them other than as "work in progress." 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . 3 72 1.2. Compliance . . . . . . . . . . . . . . . . . . . . . . . 4 73 2. File Format . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 3. Format of the LZMA stream in lzip files . . . . . . . . . . . 5 75 3.1. What is coded . . . . . . . . . . . . . . . . . . . . . 6 76 3.2. The coding contexts . . . . . . . . . . . . . . . . . . 8 77 3.3. The range decoder . . . . . . . . . . . . . . . . . . . 10 78 3.4. Decoding and verifying the LZMA stream . . . . . . . . . 10 79 3.5. Compression . . . . . . . . . . . . . . . . . . . . . . 10 80 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 81 4.1. The 'application/lzip' Media Type . . . . . . . . . . . 11 82 4.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 12 83 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 84 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 85 6.1. Normative References . . . . . . . . . . . . . . . . . . 13 86 6.2. Informative References . . . . . . . . . . . . . . . . . 13 87 Appendix A. Reference Source Code . . . . . . . . . . . . . . . 13 88 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 23 89 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 23 91 1. Introduction 93 Lzip is a lossless compressed data format similar to gzip [RFC1952]. 94 Lzip is designed for data sharing, long-term archiving, parallel 95 (de)compression, and limited random access to the data. Lzip can 96 achieve higher compression ratios than gzip. 98 Lzip uses a simplified form of the Lempel-Ziv-Markov chain-Algorithm 99 (LZMA) stream format, chosen to maximize safety and interoperability. 100 The original LZMA algorithm and stream format were developed by Igor 101 Pavlov. 103 LZMA is much like deflate (the algorithm of gzip) with two main 104 differences that account for its higher compression ratio. First, 105 LZMA can use a dictionary size thousands of times larger than 106 deflate. Second, LZMA uses a range encoder as its last stage instead 107 of the less efficient (but faster) Huffman coding used by deflate. 109 1.1. Purpose 111 The purpose of this document is to define a lossless compressed data 112 format that is a) independent of the CPU type, operating system, file 113 system, and character set and b) is suitable for file compression and 114 pipe and streaming compression, using the LZMA algorithm. The text 115 of the specification assumes a basic background in programming at the 116 level of bits and other primitive data representations. 118 The data can be produced or consumed, even for an arbitrarily long 119 sequentially presented input data stream, using only an a priori 120 bounded amount of intermediate storage, and hence can be used in data 121 communications. 123 The data format defined by this specification allows efficient 124 parallel compression/decompression, and random access to blocks of 125 compressed data by means of multimember files and a distributed 126 index. 128 This specification is intended for use by implementors of software to 129 compress data into lzip format and/or decompress data from lzip 130 format. The lzip format is supported by one free software reference 131 implementation (the lzip tool) written in portable C++ (C++11), and 132 by several free software implementations written in portable C (C99), 133 all of them available at [LZIP]. The reference implementation has 134 been in stable status since 2009, and is widely deployed. 136 Also, to enable the transport of a data object compressed with lzip, 137 this document registers a media type that can be used to identify 138 such content when it is used in a payload encoded using Multipurpose 139 Internet Mail Extensions (MIME). 141 1.2. Compliance 143 Unless otherwise indicated below, a compliant decompressor must be 144 able to accept and decompress any file that conforms to all the 145 specifications presented here; a compliant compressor must produce 146 files that conform to all the specifications presented here. 148 2. File Format 150 Perfection is reached, not when there is no longer anything to add, 151 but when there is no longer anything to take away. 152 -- Antoine de Saint-Exupery 154 In the diagram below, a box like this: 156 +---+ 157 | | <-- the vertical bars might be missing 158 +---+ 160 represents one byte; a box like this: 162 +==============+ 163 | | 164 +==============+ 166 represents a variable number of bytes. 168 A lzip file consists of a series of "members" (compressed data sets). 169 The members simply appear one after another in the file, with no 170 additional information before, between, or after them. 172 Each member has the following structure: 174 +--+--+--+--+----+----+=============+ 175 | ID string | VN | DS | LZMA stream | (more-->) 176 +--+--+--+--+----+----+=============+ 178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 179 | CRC32 | Data size | Member size | 180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 182 All multibyte values are stored in little endian order. 184 ID string (the "magic" bytes) 185 A four byte string, identifying the lzip format, with the value 186 "LZIP" (0x4C, 0x5A, 0x49, 0x50). 188 VN (version number, 1 byte) 189 Just in case something needs to be modified in the future. 1 for 190 now. 192 DS (coded dictionary size, 1 byte) 193 The dictionary size is calculated by taking a power of 2 (the base 194 size) and subtracting from it a fraction between 0/16 and 7/16 of 195 the base size. 196 Bits 4-0 contain the base 2 logarithm of the base size (12 to 29). 197 Bits 7-5 contain the numerator of the fraction (0 to 7) to 198 subtract from the base size to obtain the dictionary size. 199 Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB 200 Valid values for dictionary size range from 4 KiB to 512 MiB. 202 LZMA stream 203 The LZMA stream, finished by an end of stream marker. Uses 204 default values for encoder properties (see Section 3 for a 205 complete description). 207 CRC32 (4 bytes) 208 Cyclic Redundancy Check (CRC) of the uncompressed original data. 210 Data size (8 bytes) 211 Size of the uncompressed original data. 213 Member size (8 bytes) 214 Total size of the member, including header and trailer. This 215 field acts as a distributed index, allows the verification of 216 stream integrity, and facilitates safe recovery of undamaged 217 members from multimember files. 219 3. Format of the LZMA stream in lzip files 221 The LZMA algorithm has three parameters, called "special LZMA 222 properties", to adjust it for some kinds of binary data. These 223 parameters are; 'literal_context_bits' (with a default value of 3), 224 'literal_pos_state_bits' (with a default value of 0), and 225 'pos_state_bits' (with a default value of 2). As a general purpose 226 compressor, lzip only uses the default values for these parameters. 227 In particular 'literal_pos_state_bits' has been optimized away and 228 does not even appear in the code. 230 Lzip also finishes the LZMA stream with an "End Of Stream" (EOS) 231 marker (the distance-length pair 0xFFFFFFFFU, 2), which in 232 conjunction with the "member size" field in the member trailer allows 233 the verification of stream integrity. The LZMA stream in lzip files 234 always has these two features (default properties and EOS marker) and 235 is referred to in this document as LZMA-302eos. This variant of the 236 LZMA stream format has been chosen to maximize safety and 237 interoperability. 239 The second stage of LZMA is a range encoder that uses a different 240 probability model for each type of symbol; distances, lengths, 241 literal bytes, etc. Range encoding conceptually encodes all the 242 symbols of the message into one number. Unlike Huffman coding, which 243 assigns to each symbol a bit-pattern and concatenates all the 244 bit-patterns together, range encoding can compress one symbol to less 245 than one bit. Therefore the compressed data produced by a range 246 encoder can't be split in pieces that could be individually 247 described. 249 It seems that the only way of describing the LZMA-302eos stream is 250 describing the algorithm that decodes it. And given the many details 251 about the range decoder that need to be described accurately, the 252 source code of a real decoder seems the only appropriate reference to 253 use. 255 What follows is a description of the decoding algorithm for 256 LZMA-302eos streams using as reference the source code of "lzd", an 257 educational decompressor for lzip files which can be downloaded from 258 the lzip download directory. Lzd is written in C++11 and its source 259 code is included in appendix A. 261 3.1. What is coded 263 The LZMA stream includes literals, matches, and repeated matches 264 (matches reusing a recently used distance). There are 7 different 265 coding sequences: 267 Bit sequence Name Description 268 ----------------------- -------- -------------------------------- 269 0 + byte literal literal byte 270 1 + 0 + len + dis match distance-length pair 271 1 + 1 + 0 + 0 shortrep 1 byte match at latest used 272 distance 273 1 + 1 + 0 + 1 + len rep0 len bytes match at latest used 274 distance 275 1 + 1 + 1 + 0 + len rep1 len bytes match at second latest 276 used distance 277 1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest 278 used distance 279 1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest 280 used distance 282 In the following tables, multibit sequences are coded in normal 283 order, from most significant bit (MSB) to least significant bit 284 (LSB), except where noted otherwise. 286 Lengths (the 'len' in the table above) are coded as follows: 288 Bit sequence Description 289 -------------- ---------------------- 290 0 + 3 bits lengths from 2 to 9 291 1 + 0 + 3 bits lengths from 10 to 17 292 1 + 1 + 8 bits lengths from 18 to 273 294 The coding of distances is a little more complicated, so I'll begin 295 explaining a simpler version of the encoding. 297 Imagine you need to code a number from 0 to 2^32 - 1, and you want to 298 do it in a way that produces shorter codes for the smaller numbers. 299 You may first encode the position of the most significant bit that is 300 set to 1, which you may find by making a bit scan from the left (from 301 the MSB). A position of 0 means that the number is 0 (no bit is 302 set), 1 means the LSB is the first bit set (the number is 1), and 32 303 means the MSB is set (i.e., the number is >= 0x80000000). Then, if 304 the position is >= 2, you encode the remaining position - 1 bits. 305 Let's call these bits "direct_bits" because they are coded directly 306 by value instead of indirectly by position. 308 The inconvenient of this simple method is that it needs 6 bits to 309 encode the position, but it just uses 33 of the 64 possible values, 310 wasting almost half of the codes. 312 The intelligent trick of LZMA is that it encodes in what it calls a 313 "slot" the position of the most significant bit set, along with the 314 value of the next bit, using the same 6 bits that would take to 315 encode the position alone. This seems to need 66 slots (twice the 316 number of positions), but for positions 0 and 1 there is no next bit, 317 so the number of slots needed is 64 (0 to 63). 319 The 6 bits representing this "slot number" are then context-coded. 320 If the distance is >= 4, the remaining bits are coded as follows. 321 'direct_bits' is the amount of remaining bits (from 1 to 30) needed 322 to form a complete distance, and is calculated as (slot >> 1) - 1. 323 If a distance needs 6 or more direct_bits, the last 4 bits are coded 324 separately. The last piece (all the direct_bits for distances 4 to 325 127, or the last 4 bits for distances >= 128) is context-coded in 326 reverse order (from LSB to MSB). For distances >= 128, the 327 'direct_bits - 4' part is coded with fixed 0.5 probability. 329 Bit sequence Description 330 --------------------------------- ------------------------------ 331 slot distances from 0 to 3 332 slot + direct_bits distances from 4 to 127 333 slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 335 3.2. The coding contexts 337 These contexts ('Bit_model' in the source), are integers or arrays of 338 integers representing the probability of the corresponding bit being 339 0. 341 The indices used in these arrays are: 343 state 344 A state machine ('State' in the source) with 12 states (0 to 11), 345 coding the latest 2 to 4 types of sequences processed. The 346 initial state is 0. 348 pos_state 349 Value of the 2 least significant bits of the current position in 350 the decoded data. 352 literal_state 353 Value of the 3 most significant bits of the latest byte decoded. 355 len_state 356 Coded value of length (length - 2), with a maximum of 3. The 357 resulting value is in the range 0 to 3. 359 In the following table, '!literal' is any sequence except a literal 360 byte. 'rep' is any one of 'rep0', 'rep1', 'rep2' or 'rep3'. The 361 types of previous sequences corresponding to each state are: 363 State Types of previous sequences 364 ----- --------------------------------------------- 365 0 literal, literal, literal 366 1 match, literal, literal 367 2 rep or (!literal, shortrep), literal, literal 368 3 literal, shortrep, literal, literal 369 4 match, literal 370 5 rep or (!literal, shortrep), literal 371 6 literal, shortrep, literal 372 7 literal, match 373 8 literal, rep 374 9 literal, shortrep 375 10 !literal, match 376 11 !literal, (rep or shortrep) 377 The contexts for decoding the type of coding sequence are: 379 Name Indices Used when 380 -------- ---------------- ------------------- 381 bm_match state, pos_state sequence start 382 bm_rep state after sequence 1 383 bm_rep0 state after sequence 11 384 bm_rep1 state after sequence 111 385 bm_rep2 state after sequence 1111 386 bm_len state, pos_state after sequence 110 388 The contexts for decoding distances are: 390 Name Indices Used when 391 ----------- ------------------- --------------------------- 392 bm_dis_slot len_state, bit tree distance start 393 bm_dis reverse bit tree after slots 4 to 13 394 bm_align reverse bit tree for distances >= 128, after 395 fixed probability bits 397 There are two separate sets of contexts for lengths ('Len_model' in 398 the source). One for normal matches, the other for repeated matches. 399 The contexts in each Len_model are (see 'decode_len' in the source): 401 Name Indices Used when 402 ------- ------------------- ----------------- 403 choice1 none length start 404 choice2 none after sequence 1 405 bm_low pos_state, bit tree after sequence 0 406 bm_mid pos_state, bit tree after sequence 10 407 bm_high bit tree after sequence 11 409 The context array 'bm_literal' is special. In principle it acts as a 410 normal bit tree context, the one selected by 'literal_state'. But if 411 the previous decoded byte was not a literal, two other bit tree 412 contexts are used depending on the value of each bit in 'match_byte' 413 (the byte at the latest used distance), until a bit is decoded that 414 is different from its corresponding bit in 'match_byte'. After the 415 first difference is found, the rest of the byte is decoded using the 416 normal bit tree context. (See 'decode_matched' in the source). 418 3.3. The range decoder 420 The LZMA stream is consumed one byte at a time by the range decoder. 421 (See 'normalize' in the source). Every byte consumed produces a 422 variable number of decoded bits, depending on how well these bits 423 agree with their context. (See 'decode_bit' in the source). 425 The range decoder state consists of two unsigned 32-bit variables; 426 'range' (representing the most significant part of the range size not 427 yet decoded), and 'code' (representing the current point within 428 'range'). 'range' is initialized to (2^32 - 1), and 'code' is 429 initialized to 0. 431 The range encoder produces a first 0 byte that must be ignored by the 432 range decoder. This is done by shifting 5 bytes in the 433 initialization of 'code' instead of 4. (See the 'Range_decoder' 434 constructor in the source). 436 3.4. Decoding and verifying the LZMA stream 438 After decoding the member header and obtaining the dictionary size, 439 the range decoder is initialized and then the LZMA decoder enters a 440 loop (see 'decode_member' in the source) where it invokes the range 441 decoder with the appropriate contexts to decode the different coding 442 sequences (matches, repeated matches, and literal bytes), until the 443 "End Of Stream" marker is decoded. 445 Once the "End Of Stream" marker has been decoded, the decompressor 446 must read and decode the member trailer, and verify that the three 447 integrity factors (CRC, data size, and member size) match those 448 calculated by the LZMA decoder. 450 3.5. Compression 452 Compression consists in describing the uncompressed data as a 453 succession of coding sequences from the set shown in Section 3.1, and 454 then encoding them using a range encoder. The fast encoder in the 455 lzip reference implementation [LZIP] shows how this can be done in 456 almost the simplest way possible; issuing the longest match found, or 457 a literal byte if no match is found, and repeating until all the data 458 have been compressed. More sophisticated choosing of the coding 459 sequences may achieve higher compression ratios. 461 4. IANA Considerations 463 IANA is asked to make two registrations, as described below. 465 4.1. The 'application/lzip' Media Type 467 The 'application/lzip' media type identifies a block of data that is 468 compressed using lzip compression. The data is a stream of bytes as 469 described in this document. IANA is asked to add the following entry 470 to the standards tree of the "Media Types" registry: 472 Type name: application 474 Subtype name: lzip 476 Required parameters: N/A 478 Optional parameters: N/A 480 Encoding considerations: binary 482 Security considerations: See Section 5 of this RFC 484 Interoperability considerations: N/A 486 Published specification: this RFC 488 Applications that use this media type: 489 Any application where data size is an issue 491 Fragment Identifier Considerations: N/A 493 Restrictions on Usage: N/A 495 Provisional Registrations: N/A 497 Additional information: 499 Deprecated alias names for this type: application/x-lzip 501 Magic number(s): First 4 bytes (0x4C, 0x5A, 0x49, 0x50) 503 File extension(s): .lz .tlz 505 Macintosh File Type Code(s): N/A 507 Object Identifier(s) or OID(s): N/A 509 Intended Usage: COMMON 511 Other Information & Comments: See [LZIP] 513 Author: Antonio Diaz Diaz 515 Change Controller: IETF 517 4.2. Content Encoding 519 IANA is asked to add the following entry to the "HTTP Content Coding 520 Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters" 521 registry: 523 Name: lzip 525 Description: A stream of bytes compressed using lzip compression 527 Pointer to specification text: this RFC 529 5. Security Considerations 531 Lzip is a compressed format. Decompression of the data may expand it 532 to a size more than 7000 times larger, risking an out-of-memory or 533 out-of-disc-space condition. Since both the gzip and lzip formats 534 contain a single data stream, any steps already taken to avoid such 535 attacks on application/gzip should also work on application/lzip. 536 One possible measure, already implemented in some applications, is to 537 set a limit to the decompressed size and stop the decompression or 538 warn the user if the limit is surpassed. 540 This media type does not employ any kind of "active content", but it 541 can be used to compress any other media type (for example 542 application/postscript) which could then be interpreted by the 543 application. 545 A lzip stream does not store any metadata. Any data appended to the 546 end of the stream is easily detected, and an error can be signaled. 547 It is not apparent how this media type could be used to help violate 548 a recipient's privacy. 550 The lzip media type does not need itself any external security 551 mechanisms. But again, it can be used to compress other media types 552 requiring them (for example a media type defined for storage of 553 sensitive medical information). 555 Because of the nature of the decoding algorithm used by lzip, it is 556 easy to protect the decoder from invalid memory accesses caused by 557 corruption in the input data (intentional or not). Size limits need 558 to be checked at just one place in the decompressor (the decoding of 559 rep0) to prevent buffer overflows. This has been extensively tested 560 with a specialized tool (unzcrash) distributed with the lziprecover 561 tool. 563 The lzip data stream does not contain any size fields that can be 564 faked to be smaller than the actual decompressed data in an attempt 565 to trigger a buffer overflow. (The 'Data size' field in the lzip 566 trailer is only used to verify the size of the data produced as an 567 error detection measure). 569 6. References 571 6.1. Normative References 573 [LZIP] Diaz, A., "Lzip", . 575 6.2. Informative References 577 [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", 578 RFC 1952, DOI 10.17487/RFC1952, May 1996, 579 . 581 Appendix A. Reference Source Code 583 584 /* Lzd - Educational decompressor for the lzip format 585 Copyright (C) 2013-2020 Antonio Diaz Diaz. 587 This program is free software. Redistribution and use in source and 588 binary forms, with or without modification, are permitted provided 589 that the following conditions are met: 591 1. Redistributions of source code must retain the above copyright 592 notice, this list of conditions, and the following disclaimer. 594 2. Redistributions in binary form must reproduce the above copyright 595 notice, this list of conditions, and the following disclaimer in the 596 documentation and/or other materials provided with the distribution. 598 This program is distributed in the hope that it will be useful, 599 but WITHOUT ANY WARRANTY; without even the implied warranty of 600 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 601 */ 602 /* 603 Exit status: 0 for a normal exit, 1 for environmental problems 604 (file not found, invalid flags, I/O errors, etc), 2 to indicate a 605 corrupt or invalid input file. 606 */ 607 #include 608 #include 609 #include 610 #include 611 #include 612 #include 613 #include 614 #if defined(__MSVCRT__) || defined(__OS2__) || defined(__DJGPP__) 615 #include 616 #include 617 #endif 618 #define PROGVERSION "1.2" 620 class State 621 { 622 int st; 624 public: 625 enum { states = 12 }; 626 State() : st( 0 ) {} 627 int operator()() const { return st; } 628 bool is_char() const { return st < 7; } 630 void set_char() 631 { 632 const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; 633 st = next[st]; 634 } 635 void set_match() { st = ( st < 7 ) ? 7 : 10; } 636 void set_rep() { st = ( st < 7 ) ? 8 : 11; } 637 void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } 638 }; 640 enum { 641 min_dictionary_size = 1 << 12, 642 max_dictionary_size = 1 << 29, 643 literal_context_bits = 3, 644 literal_pos_state_bits = 0, // not used 645 pos_state_bits = 2, 646 pos_states = 1 << pos_state_bits, 647 pos_state_mask = pos_states - 1, 649 len_states = 4, 650 dis_slot_bits = 6, 651 start_dis_model = 4, 652 end_dis_model = 14, 653 modeled_distances = 1 << ( end_dis_model / 2 ), // 128 654 dis_align_bits = 4, 655 dis_align_size = 1 << dis_align_bits, 656 len_low_bits = 3, 657 len_mid_bits = 3, 658 len_high_bits = 8, 659 len_low_symbols = 1 << len_low_bits, 660 len_mid_symbols = 1 << len_mid_bits, 661 len_high_symbols = 1 << len_high_bits, 662 max_len_symbols = len_low_symbols+len_mid_symbols+len_high_symbols, 664 min_match_len = 2, // must be 2 666 bit_model_move_bits = 5, 667 bit_model_total_bits = 11, 668 bit_model_total = 1 << bit_model_total_bits }; 670 struct Bit_model 671 { 672 int probability; 673 Bit_model() : probability( bit_model_total / 2 ) {} 674 }; 676 struct Len_model 677 { 678 Bit_model choice1; 679 Bit_model choice2; 680 Bit_model bm_low[pos_states][len_low_symbols]; 681 Bit_model bm_mid[pos_states][len_mid_symbols]; 682 Bit_model bm_high[len_high_symbols]; 683 }; 685 class CRC32 686 { 687 uint32_t data[256]; // Table of CRCs of all 8-bit messages. 689 public: 690 CRC32() 691 { 692 for( unsigned n = 0; n < 256; ++n ) 693 { 694 unsigned c = n; 695 for( int k = 0; k < 8; ++k ) 696 { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; } 697 data[n] = c; 698 } 699 } 701 void update_buf( uint32_t & crc, const uint8_t * const buffer, 702 const int size ) const 703 { 704 for( int i = 0; i < size; ++i ) 705 crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 ); 707 } 708 }; 710 const CRC32 crc32; 712 typedef uint8_t Lzip_header[6]; // 0-3 magic bytes 713 // 4 version 714 // 5 coded_dict_size 715 typedef uint8_t Lzip_trailer[20]; 716 // 0-3 CRC32 of the uncompressed data 717 // 4-11 size of the uncompressed data 718 // 12-19 member size including header and trailer 720 class Range_decoder 721 { 722 unsigned long long member_pos; 723 uint32_t code; 724 uint32_t range; 726 public: 727 Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU ) 728 { 729 for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte(); 730 } 732 uint8_t get_byte() { ++member_pos; return std::getc( stdin ); } 733 unsigned long long member_position() const { return member_pos; } 735 unsigned decode( const int num_bits ) 736 { 737 unsigned symbol = 0; 738 for( int i = num_bits; i > 0; --i ) 739 { 740 range >>= 1; 741 symbol <<= 1; 742 if( code >= range ) { code -= range; symbol |= 1; } 743 if( range <= 0x00FFFFFFU ) // normalize 744 { range <<= 8; code = ( code << 8 ) | get_byte(); } 745 } 746 return symbol; 747 } 749 unsigned decode_bit( Bit_model & bm ) 750 { 751 unsigned symbol; 752 const uint32_t bound = 753 ( range >> bit_model_total_bits ) * bm.probability; 754 if( code < bound ) 755 { 756 range = bound; 757 bm.probability += 758 ( bit_model_total - bm.probability ) >> bit_model_move_bits; 759 symbol = 0; 760 } 761 else 762 { 763 range -= bound; 764 code -= bound; 765 bm.probability -= bm.probability >> bit_model_move_bits; 766 symbol = 1; 767 } 768 if( range <= 0x00FFFFFFU ) // normalize 769 { range <<= 8; code = ( code << 8 ) | get_byte(); } 770 return symbol; 771 } 773 unsigned decode_tree( Bit_model bm[], const int num_bits ) 774 { 775 unsigned symbol = 1; 776 for( int i = 0; i < num_bits; ++i ) 777 symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); 778 return symbol - ( 1 << num_bits ); 779 } 781 unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) 782 { 783 unsigned symbol = decode_tree( bm, num_bits ); 784 unsigned reversed_symbol = 0; 785 for( int i = 0; i < num_bits; ++i ) 786 { 787 reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 ); 788 symbol >>= 1; 789 } 790 return reversed_symbol; 791 } 793 unsigned decode_matched( Bit_model bm[], const unsigned match_byte ) 794 { 795 unsigned symbol = 1; 796 for( int i = 7; i >= 0; --i ) 797 { 798 const unsigned match_bit = ( match_byte >> i ) & 1; 799 const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100]); 800 symbol = ( symbol << 1 ) | bit; 801 if( match_bit != bit ) 802 { 803 while( symbol < 0x100 ) 804 symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); 805 break; 806 } 807 } 809 return symbol & 0xFF; 810 } 812 unsigned decode_len( Len_model & lm, const int pos_state ) 813 { 814 if( decode_bit( lm.choice1 ) == 0 ) 815 return decode_tree( lm.bm_low[pos_state], len_low_bits ); 816 if( decode_bit( lm.choice2 ) == 0 ) 817 return len_low_symbols + 818 decode_tree( lm.bm_mid[pos_state], len_mid_bits ); 819 return len_low_symbols + len_mid_symbols + 820 decode_tree( lm.bm_high, len_high_bits ); 821 } 822 }; 824 class LZ_decoder 825 { 826 unsigned long long partial_data_pos; 827 Range_decoder rdec; 828 const unsigned dictionary_size; 829 uint8_t * const buffer; // output buffer 830 unsigned pos; // current pos in buffer 831 unsigned stream_pos; // first byte not yet written to stdout 832 uint32_t crc_; 833 bool pos_wrapped; 835 void flush_data(); 837 uint8_t peek( const unsigned distance ) const 838 { 839 if( pos > distance ) return buffer[pos - distance - 1]; 840 if( pos_wrapped ) return buffer[dictionary_size+pos-distance-1]; 841 return 0; // prev_byte of first byte 842 } 844 void put_byte( const uint8_t b ) 845 { 846 buffer[pos] = b; 847 if( ++pos >= dictionary_size ) flush_data(); 848 } 850 public: 851 explicit LZ_decoder( const unsigned dict_size ) 852 : 853 partial_data_pos( 0 ), 854 dictionary_size( dict_size ), 855 buffer( new uint8_t[dictionary_size] ), 856 pos( 0 ), 857 stream_pos( 0 ), 858 crc_( 0xFFFFFFFFU ), 859 pos_wrapped( false ) 860 {} 862 ~LZ_decoder() { delete[] buffer; } 864 unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } 865 unsigned long long data_position() const 866 { return partial_data_pos + pos; } 867 uint8_t get_byte() { return rdec.get_byte(); } 868 unsigned long long member_position() const 869 { return rdec.member_position(); } 871 bool decode_member(); 872 }; 874 void LZ_decoder::flush_data() 875 { 876 if( pos > stream_pos ) 877 { 878 const unsigned size = pos - stream_pos; 879 crc32.update_buf( crc_, buffer + stream_pos, size ); 880 if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) 881 { std::fprintf(stderr, "Write error: %s\n", std::strerror(errno)); 882 std::exit( 1 ); } 883 if( pos >= dictionary_size ) 884 { partial_data_pos += pos; pos = 0; pos_wrapped = true; } 885 stream_pos = pos; 886 } 887 } 889 bool LZ_decoder::decode_member() // Returns false if error 890 { 891 Bit_model bm_literal[1<> (8 - literal_context_bits); 917 Bit_model * const bm = bm_literal[literal_state]; 918 if( state.is_char() ) 919 put_byte( rdec.decode_tree( bm, 8 ) ); 920 else 921 put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); 922 state.set_char(); 923 continue; 924 } 925 // match or repeated match 926 int len; 927 if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit 928 { 929 if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit 930 { 931 if( rdec.decode_bit(bm_len[state()][pos_state]) == 0 )// 4th bit 932 { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } 933 } 934 else 935 { 936 unsigned distance; 937 if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit 938 distance = rep1; 939 else 940 { 941 if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit 942 distance = rep2; 943 else 944 { distance = rep3; rep3 = rep2; } 945 rep2 = rep1; 946 } 947 rep1 = rep0; 948 rep0 = distance; 949 } 950 state.set_rep(); 951 len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); 952 } 953 else // match 954 { 955 rep3 = rep2; rep2 = rep1; rep1 = rep0; 956 len = min_match_len + rdec.decode_len(match_len_model, pos_state); 957 const int len_state = std::min( len-min_match_len, len_states-1 ); 958 rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits ); 959 if( rep0 >= start_dis_model ) 960 { 961 const unsigned dis_slot = rep0; 962 const int direct_bits = ( dis_slot >> 1 ) - 1; 963 rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits; 964 if( dis_slot < end_dis_model ) 965 rep0 += rdec.decode_tree_reversed( bm_dis + (rep0 - dis_slot), 966 direct_bits ); 967 else 968 { 969 rep0 += 970 rdec.decode(direct_bits - dis_align_bits) << dis_align_bits; 971 rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits ); 972 if( rep0 == 0xFFFFFFFFU ) // marker found 973 { 974 flush_data(); 975 return ( len == min_match_len ); // End Of Stream marker 976 } 977 } 978 } 979 state.set_match(); 980 if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) 981 { flush_data(); return false; } 982 } 983 for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) ); 984 } 985 flush_data(); 986 return false; 987 } 989 int main( const int argc, const char * const argv[] ) 990 { 991 if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) 992 { 993 std::printf( 994 "Lzd %s - Educational decompressor for the lzip format.\n" 995 "Study the source to learn how a lzip decompressor works.\n" 996 "See the lzip manual for an explanation of the code.\n" 997 "\nUsage: %s [-d] < file.lz > file\n" 998 "Lzd decompresses from standard input to standard output.\n" 999 "\nCopyright (C) 2020 Antonio Diaz Diaz.\n" 1000 "This is free software: you are free to change and redistribute " 1001 "it.\nThere is NO WARRANTY, to the extent permitted by law.\n" 1002 "Report bugs to lzip-bug@nongnu.org\n" 1003 "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", 1004 PROGVERSION, argv[0] ); 1005 return 0; 1006 } 1008 #if defined(__MSVCRT__) || defined(__OS2__) || defined(__DJGPP__) 1009 setmode( STDIN_FILENO, O_BINARY ); 1010 setmode( STDOUT_FILENO, O_BINARY ); 1011 #endif 1013 for( bool first_member = true; ; first_member = false ) 1014 { 1015 Lzip_header header; // verify header 1016 for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin ); 1017 if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01",5 ) != 0 ) 1018 { 1019 if( first_member ) 1020 { std::fputs( "Bad magic number (file not in lzip format).\n", 1021 stderr ); return 2; } 1022 break; // ignore trailing data 1023 } 1024 unsigned dict_size = 1 << ( header[5] & 0x1F ); 1025 dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); 1026 if( dict_sizemax_dictionary_size ) 1027 { std::fputs( "Invalid dictionary size in member header.\n", 1028 stderr ); return 2; } 1030 LZ_decoder decoder( dict_size ); // decode LZMA stream 1031 if( !decoder.decode_member() ) 1032 { std::fputs( "Data error\n", stderr ); return 2; } 1034 Lzip_trailer trailer; // verify trailer 1035 for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte(); 1036 int retval = 0; 1037 unsigned crc = 0; 1038 for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i]; 1039 if( crc != decoder.crc() ) 1040 { std::fputs( "CRC mismatch\n", stderr ); retval = 2; } 1042 unsigned long long data_size = 0; 1043 for( int i = 11; i >= 4; --i ) 1044 data_size = ( data_size << 8 ) + trailer[i]; 1045 if( data_size != decoder.data_position() ) 1046 { std::fputs( "Data size mismatch\n", stderr ); retval = 2; } 1048 unsigned long long member_size = 0; 1049 for( int i = 19; i >= 12; --i ) 1050 member_size = ( member_size << 8 ) + trailer[i]; 1051 if( member_size != decoder.member_position() ) 1052 { std::fputs( "Member size mismatch\n", stderr ); retval = 2; } 1053 if( retval ) return retval; 1054 } 1056 if( std::fclose( stdout ) != 0 ) 1057 { std::fprintf( stderr, "Error closing stdout: %s\n", 1058 std::strerror( errno ) ); return 1; } 1060 return 0; 1061 } 1062 1064 Acknowledgments 1066 The ideas embodied in lzip are due to (at least) the following 1067 people: Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey 1068 Markov (for the definition of Markov chains), G.N.N. Martin (for the 1069 definition of range encoding), and Igor Pavlov (for putting all the 1070 above together in LZMA). 1072 Author's Address 1074 Antonio Diaz Diaz 1075 GNU Project 1076 Email: antonio@gnu.org 1078 Expiration date: 2020-10-26