< draft-diaz-lzip-04.txt   draft-diaz-lzip-05.txt >
Internet Engineering Task Force (IETF) A. Diaz Internet Engineering Task Force (IETF) A. Diaz
INTERNET-DRAFT draft-diaz-lzip-04 GNU Project INTERNET-DRAFT draft-diaz-lzip-05 GNU Project
Category: Informational October 2021 Category: Informational April 2022
Expiration date: 2022-04-26 Expiration date: 2022-10-26
Lzip Compressed Format and the 'application/lzip' Media Type Lzip Compressed Format and the 'application/lzip' Media Type
Abstract Abstract
Lzip is a lossless compressed data format designed for data sharing, Lzip is a lossless compressed data format designed for data sharing,
long-term archiving, and parallel compression/decompression. Lzip long-term archiving, and parallel compression/decompression. Lzip
uses a simplified form of the LZMA stream format and provides a 3 uses a simplified form of the LZMA stream format and provides a 3
factor integrity checking to maximize interoperability and optimize factor integrity checking to maximize interoperability and optimize
safety. Lzip can achieve higher compression ratios than gzip. This safety. Lzip can achieve higher compression ratios than gzip. This
skipping to change at page 2, line 7 skipping to change at page 2, line 7
Information about the current status of this document, any errata, Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc<rfc-no>. http://www.rfc-editor.org/info/rfc<rfc-no>.
Comments are solicited and should be addressed to the lzip's mailing Comments are solicited and should be addressed to the lzip's mailing
list at lzip-bug@nongnu.org and/or the author. list at lzip-bug@nongnu.org and/or the author.
Copyright Notice Copyright Notice
Copyright (c) 2021 IETF Trust and the persons identified as the Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved. document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of (http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as the Trust Legal Provisions and are provided without warranty as
skipping to change at page 5, line 37 skipping to change at page 5, line 37
Total size of the member, including header and trailer. This Total size of the member, including header and trailer. This
field acts as a distributed index, allows the verification of field acts as a distributed index, allows the verification of
stream integrity, and facilitates the safe recovery of undamaged stream integrity, and facilitates the safe recovery of undamaged
members from multimember files. Member size should be limited to members from multimember files. Member size should be limited to
2 PiB to prevent the data size field from overflowing. 2 PiB to prevent the data size field from overflowing.
3. Format of the LZMA stream in lzip files 3. Format of the LZMA stream in lzip files
The LZMA algorithm has three parameters, called "special LZMA The LZMA algorithm has three parameters, called "special LZMA
properties", to adjust it for some kinds of binary data. These properties", to adjust it for some kinds of binary data. These
parameters are; 'literal_context_bits' (with a default value of 3), parameters are: 'literal_context_bits' (with a default value of 3),
'literal_pos_state_bits' (with a default value of 0), and 'literal_pos_state_bits' (with a default value of 0), and
'pos_state_bits' (with a default value of 2). As a general purpose 'pos_state_bits' (with a default value of 2). As a general purpose
compressor, lzip only uses the default values for these parameters. compressor, lzip only uses the default values for these parameters.
In particular 'literal_pos_state_bits' has been optimized away and In particular 'literal_pos_state_bits' has been optimized away and
does not even appear in the code. does not even appear in the code.
Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker
(the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with
the 'member size' field in the member trailer allows the verification the 'member size' field in the member trailer allows the verification
of stream integrity. The EOS marker is the only marker allowed in of stream integrity. The EOS marker is the only marker allowed in
lzip files. The LZMA stream in lzip files always has these two lzip files. The LZMA stream in lzip files always has these two
features (default properties and EOS marker) and is referred to in features (default properties and EOS marker) and is referred to in
this document as LZMA-302eos. This simplified form of the LZMA this document as LZMA-302eos. This simplified form of the LZMA
stream format has been chosen to maximize interoperability and stream format has been chosen to maximize interoperability and
safety. safety.
The second stage of LZMA is a range encoder that uses a different The second stage of LZMA is a range encoder that uses a different
probability model for each type of symbol; distances, lengths, probability model for each type of symbol: distances, lengths,
literal bytes, etc. Range encoding conceptually encodes all the literal bytes, etc. Range encoding conceptually encodes all the
symbols of the message into one number. Unlike Huffman coding, which symbols of the message into one number. Unlike Huffman coding, which
assigns to each symbol a bit-pattern and concatenates all the assigns to each symbol a bit-pattern and concatenates all the
bit-patterns together, range encoding can compress one symbol to less bit-patterns together, range encoding can compress one symbol to less
than one bit. Therefore the compressed data produced by a range than one bit. Therefore the compressed data produced by a range
encoder can't be split in pieces that could be described encoder can't be split in pieces that could be described
individually. individually.
It seems that the only way of describing the LZMA-302eos stream is to It seems that the only way of describing the LZMA-302eos stream is to
describe the algorithm that decodes it. And given the many details describe the algorithm that decodes it. And given the many details
skipping to change at page 8, line 29 skipping to change at page 8, line 29
Value of the 2 least significant bits of the current position in Value of the 2 least significant bits of the current position in
the decoded data. the decoded data.
literal_state literal_state
Value of the 3 most significant bits of the latest byte decoded. Value of the 3 most significant bits of the latest byte decoded.
len_state len_state
Coded value of the current match length (length - 2), with a Coded value of the current match length (length - 2), with a
maximum of 3. The resulting value is in the range 0 to 3. maximum of 3. The resulting value is in the range 0 to 3.
In the following table, '!literal' is any sequence except a literal The types of previous sequences corresponding to each state are shown
in the following table. '!literal' is any sequence except a literal
byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The byte. 'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The
types of previous sequences corresponding to each state are: last type in each line is the most recent.
State Types of previous sequences State Types of previous sequences
----- --------------------------------------------- ----- ---------------------------------------------
0 literal, literal, literal 0 literal, literal, literal
1 match, literal, literal 1 match, literal, literal
2 rep or (!literal, shortrep), literal, literal 2 rep or (!literal, shortrep), literal, literal
3 literal, shortrep, literal, literal 3 literal, shortrep, literal, literal
4 match, literal 4 match, literal
5 rep or (!literal, shortrep), literal 5 rep or (!literal, shortrep), literal
6 literal, shortrep, literal 6 literal, shortrep, literal
skipping to change at page 10, line 12 skipping to change at page 10, line 12
first difference is found, the rest of the byte is decoded using the first difference is found, the rest of the byte is decoded using the
normal bit tree context. (See 'decode_matched' in the source). normal bit tree context. (See 'decode_matched' in the source).
3.3. The range decoder 3.3. The range decoder
The LZMA stream is consumed one byte at a time by the range decoder. The LZMA stream is consumed one byte at a time by the range decoder.
(See 'normalize' in the source). Every byte consumed produces a (See 'normalize' in the source). Every byte consumed produces a
variable number of decoded bits, depending on how well these bits variable number of decoded bits, depending on how well these bits
agree with their context. (See 'decode_bit' in the source). agree with their context. (See 'decode_bit' in the source).
The range decoder state consists of two unsigned 32-bit variables; The range decoder state consists of two unsigned 32-bit variables:
'range' (representing the most significant part of the range size not 'range' (representing the most significant part of the range size not
yet decoded), and 'code' (representing the current point within yet decoded) and 'code' (representing the current point within
'range'). 'range' is initialized to 2^32 - 1, and 'code' is 'range'). 'range' is initialized to 2^32 - 1, and 'code' is
initialized to 0. initialized to 0.
The range encoder produces a first 0 byte that must be ignored by the The range encoder produces a first 0 byte that must be ignored by the
range decoder. This is done by shifting 5 bytes in the range decoder. This is done by shifting 5 bytes in the
initialization of 'code' instead of 4. (See the 'Range_decoder' initialization of 'code' instead of 4. (See the 'Range_decoder'
constructor in the source). constructor in the source).
3.4. Decoding and verifying the LZMA stream 3.4. Decoding and verifying the LZMA stream
skipping to change at page 14, line 9 skipping to change at page 14, line 9
6.2. Informative References 6.2. Informative References
[RFC1952] Deutsch, P., "GZIP file format specification version 4.3", [RFC1952] Deutsch, P., "GZIP file format specification version 4.3",
RFC 1952, DOI 10.17487/RFC1952, May 1996, RFC 1952, DOI 10.17487/RFC1952, May 1996,
<http://www.rfc-editor.org/info/rfc1952>. <http://www.rfc-editor.org/info/rfc1952>.
Appendix A. Reference Source Code Appendix A. Reference Source Code
<CODE BEGINS> <CODE BEGINS>
/* Lzd - Educational decompressor for the lzip format /* Lzd - Educational decompressor for the lzip format
Copyright (C) 2013-2021 Antonio Diaz Diaz. Copyright (C) 2013-2022 Antonio Diaz Diaz.
This program is free software. Redistribution and use in source and This program is free software. Redistribution and use in source and
binary forms, with or without modification, are permitted provided binary forms, with or without modification, are permitted provided
that the following conditions are met: that the following conditions are met:
1. Redistributions of source code must retain the above copyright 1. Redistributions of source code must retain the above copyright
notice, this list of conditions, and the following disclaimer. notice, this list of conditions, and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright 2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions, and the following disclaimer in the notice, this list of conditions, and the following disclaimer in the
skipping to change at page 22, line 21 skipping to change at page 22, line 21
int main( const int argc, const char * const argv[] ) int main( const int argc, const char * const argv[] )
{ {
if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) )
{ {
std::printf( std::printf(
"Lzd %s - Educational decompressor for the lzip format.\n" "Lzd %s - Educational decompressor for the lzip format.\n"
"Study the source to learn how a lzip decompressor works.\n" "Study the source to learn how a lzip decompressor works.\n"
"See the lzip manual for an explanation of the code.\n" "See the lzip manual for an explanation of the code.\n"
"\nUsage: %s [-d] < file.lz > file\n" "\nUsage: %s [-d] < file.lz > file\n"
"Lzd decompresses from standard input to standard output.\n" "Lzd decompresses from standard input to standard output.\n"
"\nCopyright (C) 2021 Antonio Diaz Diaz.\n" "\nCopyright (C) 2022 Antonio Diaz Diaz.\n"
"License 2-clause BSD.\n" "License 2-clause BSD.\n"
"This is free software: you are free to change and redistribute " "This is free software: you are free to change and redistribute "
"it.\nThere is NO WARRANTY, to the extent permitted by law.\n" "it.\nThere is NO WARRANTY, to the extent permitted by law.\n"
"Report bugs to lzip-bug@nongnu.org\n" "Report bugs to lzip-bug@nongnu.org\n"
"Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n",
PROGVERSION, argv[0] ); PROGVERSION, argv[0] );
return 0; return 0;
} }
#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ #if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
skipping to change at page 23, line 49 skipping to change at page 23, line 49
Markov (for the definition of Markov chains), G.N.N. Martin (for the Markov (for the definition of Markov chains), G.N.N. Martin (for the
definition of range encoding), and Igor Pavlov (for putting all the definition of range encoding), and Igor Pavlov (for putting all the
above together in LZMA). above together in LZMA).
Author's Address Author's Address
Antonio Diaz Diaz Antonio Diaz Diaz
GNU Project GNU Project
Email: antonio@gnu.org Email: antonio@gnu.org
Expiration date: 2022-04-26 Expiration date: 2022-10-26
 End of changes. 11 change blocks. 
12 lines changed or deleted 13 lines changed or added

This html diff was produced by rfcdiff 1.48. The latest version is available from http://tools.ietf.org/tools/rfcdiff/