| < draft-diaz-lzip-04.txt | draft-diaz-lzip-05.txt > | |||
|---|---|---|---|---|
| Internet Engineering Task Force (IETF) A. Diaz | Internet Engineering Task Force (IETF) A. Diaz | |||
| INTERNET-DRAFT draft-diaz-lzip-04 GNU Project | INTERNET-DRAFT draft-diaz-lzip-05 GNU Project | |||
| Category: Informational October 2021 | Category: Informational April 2022 | |||
| Expiration date: 2022-04-26 | Expiration date: 2022-10-26 | |||
| Lzip Compressed Format and the 'application/lzip' Media Type | Lzip Compressed Format and the 'application/lzip' Media Type | |||
| Abstract | Abstract | |||
| Lzip is a lossless compressed data format designed for data sharing, | Lzip is a lossless compressed data format designed for data sharing, | |||
| long-term archiving, and parallel compression/decompression. Lzip | long-term archiving, and parallel compression/decompression. Lzip | |||
| uses a simplified form of the LZMA stream format and provides a 3 | uses a simplified form of the LZMA stream format and provides a 3 | |||
| factor integrity checking to maximize interoperability and optimize | factor integrity checking to maximize interoperability and optimize | |||
| safety. Lzip can achieve higher compression ratios than gzip. This | safety. Lzip can achieve higher compression ratios than gzip. This | |||
| skipping to change at page 2, line 7 ¶ | skipping to change at page 2, line 7 ¶ | |||
| Information about the current status of this document, any errata, | Information about the current status of this document, any errata, | |||
| and how to provide feedback on it may be obtained at | and how to provide feedback on it may be obtained at | |||
| http://www.rfc-editor.org/info/rfc<rfc-no>. | http://www.rfc-editor.org/info/rfc<rfc-no>. | |||
| Comments are solicited and should be addressed to the lzip's mailing | Comments are solicited and should be addressed to the lzip's mailing | |||
| list at lzip-bug@nongnu.org and/or the author. | list at lzip-bug@nongnu.org and/or the author. | |||
| Copyright Notice | Copyright Notice | |||
| Copyright (c) 2021 IETF Trust and the persons identified as the | Copyright (c) 2022 IETF Trust and the persons identified as the | |||
| document authors. All rights reserved. | document authors. All rights reserved. | |||
| This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
| Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
| (http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
| publication of this document. Please review these documents | publication of this document. Please review these documents | |||
| carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
| to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
| include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
| the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
| skipping to change at page 5, line 37 ¶ | skipping to change at page 5, line 37 ¶ | |||
| Total size of the member, including header and trailer. This | Total size of the member, including header and trailer. This | |||
| field acts as a distributed index, allows the verification of | field acts as a distributed index, allows the verification of | |||
| stream integrity, and facilitates the safe recovery of undamaged | stream integrity, and facilitates the safe recovery of undamaged | |||
| members from multimember files. Member size should be limited to | members from multimember files. Member size should be limited to | |||
| 2 PiB to prevent the data size field from overflowing. | 2 PiB to prevent the data size field from overflowing. | |||
| 3. Format of the LZMA stream in lzip files | 3. Format of the LZMA stream in lzip files | |||
| The LZMA algorithm has three parameters, called "special LZMA | The LZMA algorithm has three parameters, called "special LZMA | |||
| properties", to adjust it for some kinds of binary data. These | properties", to adjust it for some kinds of binary data. These | |||
| parameters are; 'literal_context_bits' (with a default value of 3), | parameters are: 'literal_context_bits' (with a default value of 3), | |||
| 'literal_pos_state_bits' (with a default value of 0), and | 'literal_pos_state_bits' (with a default value of 0), and | |||
| 'pos_state_bits' (with a default value of 2). As a general purpose | 'pos_state_bits' (with a default value of 2). As a general purpose | |||
| compressor, lzip only uses the default values for these parameters. | compressor, lzip only uses the default values for these parameters. | |||
| In particular 'literal_pos_state_bits' has been optimized away and | In particular 'literal_pos_state_bits' has been optimized away and | |||
| does not even appear in the code. | does not even appear in the code. | |||
| Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker | Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker | |||
| (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with | (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with | |||
| the 'member size' field in the member trailer allows the verification | the 'member size' field in the member trailer allows the verification | |||
| of stream integrity. The EOS marker is the only marker allowed in | of stream integrity. The EOS marker is the only marker allowed in | |||
| lzip files. The LZMA stream in lzip files always has these two | lzip files. The LZMA stream in lzip files always has these two | |||
| features (default properties and EOS marker) and is referred to in | features (default properties and EOS marker) and is referred to in | |||
| this document as LZMA-302eos. This simplified form of the LZMA | this document as LZMA-302eos. This simplified form of the LZMA | |||
| stream format has been chosen to maximize interoperability and | stream format has been chosen to maximize interoperability and | |||
| safety. | safety. | |||
| The second stage of LZMA is a range encoder that uses a different | The second stage of LZMA is a range encoder that uses a different | |||
| probability model for each type of symbol; distances, lengths, | probability model for each type of symbol: distances, lengths, | |||
| literal bytes, etc. Range encoding conceptually encodes all the | literal bytes, etc. Range encoding conceptually encodes all the | |||
| symbols of the message into one number. Unlike Huffman coding, which | symbols of the message into one number. Unlike Huffman coding, which | |||
| assigns to each symbol a bit-pattern and concatenates all the | assigns to each symbol a bit-pattern and concatenates all the | |||
| bit-patterns together, range encoding can compress one symbol to less | bit-patterns together, range encoding can compress one symbol to less | |||
| than one bit. Therefore the compressed data produced by a range | than one bit. Therefore the compressed data produced by a range | |||
| encoder can't be split in pieces that could be described | encoder can't be split in pieces that could be described | |||
| individually. | individually. | |||
| It seems that the only way of describing the LZMA-302eos stream is to | It seems that the only way of describing the LZMA-302eos stream is to | |||
| describe the algorithm that decodes it. And given the many details | describe the algorithm that decodes it. And given the many details | |||
| skipping to change at page 8, line 29 ¶ | skipping to change at page 8, line 29 ¶ | |||
| Value of the 2 least significant bits of the current position in | Value of the 2 least significant bits of the current position in | |||
| the decoded data. | the decoded data. | |||
| literal_state | literal_state | |||
| Value of the 3 most significant bits of the latest byte decoded. | Value of the 3 most significant bits of the latest byte decoded. | |||
| len_state | len_state | |||
| Coded value of the current match length (length - 2), with a | Coded value of the current match length (length - 2), with a | |||
| maximum of 3. The resulting value is in the range 0 to 3. | maximum of 3. The resulting value is in the range 0 to 3. | |||
| In the following table, '!literal' is any sequence except a literal | The types of previous sequences corresponding to each state are shown | |||
| in the following table. '!literal' is any sequence except a literal | ||||
| byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The | byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The | |||
| types of previous sequences corresponding to each state are: | last type in each line is the most recent. | |||
| State Types of previous sequences | State Types of previous sequences | |||
| ----- --------------------------------------------- | ----- --------------------------------------------- | |||
| 0 literal, literal, literal | 0 literal, literal, literal | |||
| 1 match, literal, literal | 1 match, literal, literal | |||
| 2 rep or (!literal, shortrep), literal, literal | 2 rep or (!literal, shortrep), literal, literal | |||
| 3 literal, shortrep, literal, literal | 3 literal, shortrep, literal, literal | |||
| 4 match, literal | 4 match, literal | |||
| 5 rep or (!literal, shortrep), literal | 5 rep or (!literal, shortrep), literal | |||
| 6 literal, shortrep, literal | 6 literal, shortrep, literal | |||
| skipping to change at page 10, line 12 ¶ | skipping to change at page 10, line 12 ¶ | |||
| first difference is found, the rest of the byte is decoded using the | first difference is found, the rest of the byte is decoded using the | |||
| normal bit tree context. (See 'decode_matched' in the source). | normal bit tree context. (See 'decode_matched' in the source). | |||
| 3.3. The range decoder | 3.3. The range decoder | |||
| The LZMA stream is consumed one byte at a time by the range decoder. | The LZMA stream is consumed one byte at a time by the range decoder. | |||
| (See 'normalize' in the source). Every byte consumed produces a | (See 'normalize' in the source). Every byte consumed produces a | |||
| variable number of decoded bits, depending on how well these bits | variable number of decoded bits, depending on how well these bits | |||
| agree with their context. (See 'decode_bit' in the source). | agree with their context. (See 'decode_bit' in the source). | |||
| The range decoder state consists of two unsigned 32-bit variables; | The range decoder state consists of two unsigned 32-bit variables: | |||
| 'range' (representing the most significant part of the range size not | 'range' (representing the most significant part of the range size not | |||
| yet decoded), and 'code' (representing the current point within | yet decoded) and 'code' (representing the current point within | |||
| 'range'). 'range' is initialized to 2^32 - 1, and 'code' is | 'range'). 'range' is initialized to 2^32 - 1, and 'code' is | |||
| initialized to 0. | initialized to 0. | |||
| The range encoder produces a first 0 byte that must be ignored by the | The range encoder produces a first 0 byte that must be ignored by the | |||
| range decoder. This is done by shifting 5 bytes in the | range decoder. This is done by shifting 5 bytes in the | |||
| initialization of 'code' instead of 4. (See the 'Range_decoder' | initialization of 'code' instead of 4. (See the 'Range_decoder' | |||
| constructor in the source). | constructor in the source). | |||
| 3.4. Decoding and verifying the LZMA stream | 3.4. Decoding and verifying the LZMA stream | |||
| skipping to change at page 14, line 9 ¶ | skipping to change at page 14, line 9 ¶ | |||
| 6.2. Informative References | 6.2. Informative References | |||
| [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", | [RFC1952] Deutsch, P., "GZIP file format specification version 4.3", | |||
| RFC 1952, DOI 10.17487/RFC1952, May 1996, | RFC 1952, DOI 10.17487/RFC1952, May 1996, | |||
| <http://www.rfc-editor.org/info/rfc1952>. | <http://www.rfc-editor.org/info/rfc1952>. | |||
| Appendix A. Reference Source Code | Appendix A. Reference Source Code | |||
| <CODE BEGINS> | <CODE BEGINS> | |||
| /* Lzd - Educational decompressor for the lzip format | /* Lzd - Educational decompressor for the lzip format | |||
| Copyright (C) 2013-2021 Antonio Diaz Diaz. | Copyright (C) 2013-2022 Antonio Diaz Diaz. | |||
| This program is free software. Redistribution and use in source and | This program is free software. Redistribution and use in source and | |||
| binary forms, with or without modification, are permitted provided | binary forms, with or without modification, are permitted provided | |||
| that the following conditions are met: | that the following conditions are met: | |||
| 1. Redistributions of source code must retain the above copyright | 1. Redistributions of source code must retain the above copyright | |||
| notice, this list of conditions, and the following disclaimer. | notice, this list of conditions, and the following disclaimer. | |||
| 2. Redistributions in binary form must reproduce the above copyright | 2. Redistributions in binary form must reproduce the above copyright | |||
| notice, this list of conditions, and the following disclaimer in the | notice, this list of conditions, and the following disclaimer in the | |||
| skipping to change at page 22, line 21 ¶ | skipping to change at page 22, line 21 ¶ | |||
| int main( const int argc, const char * const argv[] ) | int main( const int argc, const char * const argv[] ) | |||
| { | { | |||
| if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) | if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) | |||
| { | { | |||
| std::printf( | std::printf( | |||
| "Lzd %s - Educational decompressor for the lzip format.\n" | "Lzd %s - Educational decompressor for the lzip format.\n" | |||
| "Study the source to learn how a lzip decompressor works.\n" | "Study the source to learn how a lzip decompressor works.\n" | |||
| "See the lzip manual for an explanation of the code.\n" | "See the lzip manual for an explanation of the code.\n" | |||
| "\nUsage: %s [-d] < file.lz > file\n" | "\nUsage: %s [-d] < file.lz > file\n" | |||
| "Lzd decompresses from standard input to standard output.\n" | "Lzd decompresses from standard input to standard output.\n" | |||
| "\nCopyright (C) 2021 Antonio Diaz Diaz.\n" | "\nCopyright (C) 2022 Antonio Diaz Diaz.\n" | |||
| "License 2-clause BSD.\n" | "License 2-clause BSD.\n" | |||
| "This is free software: you are free to change and redistribute " | "This is free software: you are free to change and redistribute " | |||
| "it.\nThere is NO WARRANTY, to the extent permitted by law.\n" | "it.\nThere is NO WARRANTY, to the extent permitted by law.\n" | |||
| "Report bugs to lzip-bug@nongnu.org\n" | "Report bugs to lzip-bug@nongnu.org\n" | |||
| "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", | "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", | |||
| PROGVERSION, argv[0] ); | PROGVERSION, argv[0] ); | |||
| return 0; | return 0; | |||
| } | } | |||
| #if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ | #if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ | |||
| skipping to change at page 23, line 49 ¶ | skipping to change at page 23, line 49 ¶ | |||
| Markov (for the definition of Markov chains), G.N.N. Martin (for the | Markov (for the definition of Markov chains), G.N.N. Martin (for the | |||
| definition of range encoding), and Igor Pavlov (for putting all the | definition of range encoding), and Igor Pavlov (for putting all the | |||
| above together in LZMA). | above together in LZMA). | |||
| Author's Address | Author's Address | |||
| Antonio Diaz Diaz | Antonio Diaz Diaz | |||
| GNU Project | GNU Project | |||
| Email: antonio@gnu.org | Email: antonio@gnu.org | |||
| Expiration date: 2022-04-26 | Expiration date: 2022-10-26 | |||
| End of changes. 11 change blocks. | ||||
| 12 lines changed or deleted | 13 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ | ||||