idnits 2.17.1 

draft-deutsch-deflate-spec-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-24) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 14
     longer pages, the longest (page 2) being 59 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (12 Feb 1996) is 10299 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'I' on line 351

  -- Looks like a reference, but probably isn't: 'N' on line 392

  == Missing Reference: '0' is mentioned on line 360, but not defined

  == Unused Reference: '4' is defined on line 722, but no explicit reference
     was found in the text

  == Unused Reference: '5' is defined on line 725, but no explicit reference
     was found in the text

  == Unused Reference: '6' is defined on line 728, but no explicit reference
     was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  -- Possible downref: Non-RFC (?) normative reference: ref. '5'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'


     Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	INTERNET-DRAFT                                             L. P. Deutsch
3	DEFLATE 1.3                                          Aladdin Enterprises
4	Expires: 17 Aug 1996                                         12 Feb 1996

6	        DEFLATE Compressed Data Format Specification version 1.3

8	File draft-deutsch-deflate-spec-01.txt

10	Status of this Memo

12	   This document is an Internet-Draft.  Internet-Drafts are working
13	   documents of the Internet Engineering Task Force (IETF), its areas,
14	   and its working groups.  Note that other groups may also distribute
15	   working documents as Internet-Drafts.

17	   Internet-Drafts are draft documents valid for a maximum of six months
18	   and may be updated, replaced, or obsoleted by other documents at any
19	   time.  It is inappropriate to use Internet- Drafts as reference
20	   material or to cite them other than as ``work in progress.''

22	   To learn the current status of any Internet-Draft, please check the
23	   ``1id-abstracts.txt'' listing contained in the Internet- Drafts
24	   Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
25	   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
26	   ftp.isi.edu (US West Coast).

28	   Distribution of this memo is unlimited.

30	 Notices

32	   Copyright (C) 1996 L. Peter Deutsch

34	   Permission is granted to copy and distribute this document for any
35	   purpose and without charge, including translations into other
36	   languages and incorporation into compilations, provided that it is
37	   copied as a whole (including the copyright notice and this notice)
38	   and with no changes.

40	Abstract

42	   This specification defines a lossless compressed data format that
43	   compresses data using a combination of the LZ77 algorithm and Huffman
44	   coding, with efficiency comparable to the best currently available
45	   general-purpose compression methods.  The data can be produced or
46	   consumed, even for an arbitrarily long sequentially presented input
47	   data stream, using only an a priori bounded amount of intermediate
48	   storage.  The format can be implemented readily in a manner not
49	   covered by patents.

51	Deutsch                                                        [Page  1]
52	Table of Contents

54	   1. Introduction ................................................... 2
55	      1.1 Purpose .................................................... 2
56	      1.2 Intended audience .......................................... 3
57	      1.3 Scope ...................................................... 3
58	      1.4 Compliance ................................................. 3
59	      1.5  Definitions of terms and conventions used ................. 3
60	      1.6 Changes from previous versions ............................. 4
61	   2. Compressed representation overview ............................. 4
62	   3. Detailed specification ......................................... 4
63	      3.1 Overall conventions ........................................ 4
64	          3.1.1. Packing into bytes .................................. 5
65	      3.2 Compressed block format .................................... 6
66	          3.2.1. Synopsis of prefix and Huffman coding ............... 6
67	          3.2.2. Use of Huffman coding in the 'deflate' format ....... 7
68	          3.2.3. Details of block format ............................. 8
69	          3.2.4. Non-compressed blocks (BTYPE=00) ................... 10
70	          3.2.5. Compressed blocks (length and distance codes) ...... 10
71	          3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 11
72	          3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 11
73	      3.3 Compliance ................................................ 13
74	   4. Compression algorithm details ................................. 13
75	   5. References .................................................... 14
76	   6. Security considerations ....................................... 14
77	   7. Source code ................................................... 15
78	   8. Acknowledgements .............................................. 15
79	   9. Author's address .............................................. 15

81	1. Introduction

83	   1.1. Purpose

85	      The purpose of this specification is to define a lossless
86	      compressed data format that:

88	          o Is independent of CPU type, operating system, file system,
89	            and character set, and hence can be used for interchange;
90	          o Can be produced or consumed, even for an arbitrarily long
91	            sequentially presented input data stream, using only an a
92	            priori bounded amount of intermediate storage, and hence can
93	            be used in data communications or similar structures such as
94	            Unix filters;
95	          o Compresses data with efficiency comparable to the best
96	            currently available general-purpose compression methods, and
97	            in particular considerably better than the 'compress'
98	            program;
99	          o Can be implemented readily in a manner not covered by
100	            patents, and hence can be practiced freely;
101	          o Is compatible with the file format produced by the current
102	            widely used gzip utility, in that conforming decompressors
103	            will be able to read data produced by the existing gzip

105	Deutsch                                                        [Page  2]
106	            compressor.

108	      The data format defined by this specification does not attempt to:

110	          o Allow random access to compressed data;
111	          o Compress specialized data (e.g., raster graphics) as well as
112	            the best currently available specialized algorithms.

114	      A simple counting argument shows that no lossless compression
115	      algorithm can compress every possible input data set.  For the
116	      format defined here, the worst case expansion is 5 bytes per 32K-
117	      byte block, i.e., a size increase of 0.015% for large data sets.
118	      English text usually compresses by a factor of 2.5 to 3;
119	      executable files usually compress somewhat less; graphical data
120	      such as raster images may compress much more.

122	   1.2. Intended audience

124	      This specification is intended for use by implementors of software
125	      to compress data into 'deflate' format and/or decompress data from
126	      'deflate' format.

128	      The text of the specification assumes a basic background in
129	      programming at the level of bits and other primitive data
130	      representations.  Familiarity with the technique of Huffman coding
131	      is helpful but not required.

133	   1.3. Scope

135	      The specification specifies a method for representing a sequence
136	      of bytes as a (usually shorter) sequence of bits, and a method for
137	      packing the latter bit sequence into bytes.

139	   1.4. Compliance

141	      Unless otherwise indicated below, a compliant decompressor must be
142	      able to accept and decompress any data set that conforms to all
143	      the specifications presented here; a compliant compressor must
144	      produce data sets that conform to all the specifications presented
145	      here.

147	   1.5.  Definitions of terms and conventions used

149	      byte: 8 bits stored or transmitted as a unit (same as an octet).
150	      (For this specification, a byte is exactly 8 bits, even on
151	      machines which store a character on a number of bits different
152	      from 8.) See Section 3.1, below, for the numbering of bits within
153	      a byte.

155	      string: a sequence of arbitrary bytes.

157	   1.6. Changes from previous versions

159	Deutsch                                                        [Page  3]
160	      There have been no technical changes to the deflate format since
161	      version 1.1 of this specification.  In version 1.2, some
162	      terminology was changed.  Version 1.3 is a conversion of the
163	      specification to Internet Draft style.

165	2. Compressed representation overview

167	   A compressed data set consists of a series of blocks, corresponding
168	   to successive blocks of input data.  The block sizes are arbitrary,
169	   except that non-compressible blocks are limited to 65,535 bytes.

171	   Each block is compressed using a combination of the LZ77 algorithm
172	   and Huffman coding. The Huffman trees for each block are independant
173	   of those for previous or subsequent blocks; the LZ77 algorithm may
174	   use a reference to a duplicated string occurring in a previous block,
175	   up to 32K input bytes before.

177	   Each block consists of two parts: a pair of Huffman code trees that
178	   describe the representation of the compressed data part, and a
179	   compressed data part.  (The Huffman trees themselves are compressed
180	   using Huffman encoding.)  The compressed data consists of a series of
181	   elements of two types: literal bytes (of strings that have not been
182	   detected as duplicated within the previous 32K input bytes), and
183	   pointers to duplicated strings, where a pointer is represented as a
184	   pair <length, backward distance>.  The representation used in the
185	   'deflate' format limits distances to 32K bytes and lengths to 258
186	   bytes, but does not limit the size of a block, except for
187	   uncompressible blocks, which are limited as noted above.

189	   Each type of value (literals, distances, and lengths) in the
190	   compressed data is represented using a Huffman code, using one code
191	   tree for literals and lengths and a separate code tree for distances.
192	   The code trees for each block appear in a compact form just before
193	   the compressed data for that block.

195	3. Detailed specification

197	   3.1. Overall conventions In the diagrams below, a box like this:

199	         +---+
200	         |   | <-- the vertical bars might be missing
201	         +---+

203	      represents one byte; a box like this:

205	         +==============+
206	         |              |
207	         +==============+

209	      represents a variable number of bytes.

211	      Bytes stored within a computer do not have a 'bit order', since

213	Deutsch                                                        [Page  4]
214	      they are always treated as a unit.  However, a byte considered as
215	      an integer between 0 and 255 does have a most- and least-
216	      significant bit, and since we write numbers with the most-
217	      significant digit on the left, we also write bytes with the most-
218	      significant bit on the left.  In the diagrams below, we number the
219	      bits of a byte so that bit 0 is the least-significant bit, i.e.,
220	      the bits are numbered:

222	         +--------+
223	         |76543210|
224	         +--------+

226	      Within a computer, a number may occupy multiple bytes.  All
227	      multi-byte numbers in the format described here are stored with
228	      the least-significant byte first (at the lower memory address).
229	      For example, the decimal number 520 is stored as:

231	             0        1
232	         +--------+--------+
233	         |00001000|00000010|
234	         +--------+--------+
235	          ^        ^
236	          |        |
237	          |        + more significant byte = 2 x 256
238	          + less significant byte = 8

240	      3.1.1. Packing into bytes

242	         This document does not address the issue of the order in which
243	         bits of a byte are transmitted on a bit-sequential medium,
244	         since the final data format described here is byte- rather than
245	         bit-oriented.  However, we describe the compressed block format
246	         in Section 3.2, below, as a sequence of data elements of
247	         various bit lengths, not a sequence of bytes.  We must
248	         therefore specify how to pack these data elements into bytes to
249	         form the final compressed byte sequence:

251	             o Data elements are packed into bytes in order of
252	               increasing bit number within the byte, i.e., starting
253	               with the least- significant bit of the byte.
254	             o Data elements other than Huffman codes are packed
255	               starting with the least-significant bit of the data
256	               element.
257	             o Huffman codes are packed starting with the most-
258	               significant bit of the code.

260	         In other words, if one were to print out the compressed data as
261	         a sequence of bytes, starting with the first byte at the
262	         *right* margin and proceeding to the *left*, with the most-
263	         significant bit of each byte on the left as usual, one would be
264	         able to parse the result from right to left, with fixed-width
265	         elements in the correct MSB-to-LSB order and Huffman codes in

267	Deutsch                                                        [Page  5]
268	         bit-reversed order (i.e., with the first bit of the code in the
269	         relative LSB position).

271	   3.2. Compressed block format

273	      3.2.1. Synopsis of prefix and Huffman coding

275	         Prefix coding represents symbols from an a priori known
276	         alphabet by bit sequences (codes), one code for each symbol, in
277	         a manner such that different symbols may be represented by bit
278	         sequences of different lengths, but a parser can always parse
279	         an encoded string unambiguously symbol-by-symbol.

281	         We define a prefix code in terms of a binary tree in which the
282	         two edges descending from each non-leaf node are labeled 0 and
283	         1 and in which the leaf nodes correspond one-for-one with (are
284	         labeled with) the symbols of the alphabet; then the code for a
285	         symbol is the sequence of 0's and 1's on the edges leading from
286	         the root to the leaf labeled with that symbol.  For example:

288	                          /\              Symbol    Code
289	                         0  1             ------    ----
290	                        /    \                A      00
291	                       /\     B               B       1
292	                      0  1                    C     011
293	                     /    \                   D     010
294	                    A     /\
295	                         0  1
296	                        /    \
297	                       D      C

299	         A parser can decode the next symbol from an encoded input
300	         stream by walking down the tree from the root, at each step
301	         choosing the edge corresponding to the next input bit.

303	         Given an alphabet with known symbol frequencies, the Huffman
304	         algorithm allows the construction of an optimal prefix code
305	         (one which represents strings with those symbol frequencies
306	         using the fewest bits of any possible prefix codes for that
307	         alphabet).  Such a code is called a Huffman code.  (See
308	         reference [1] in Chapter 5, references for additional
309	         information on Huffman codes.)

311	         Note that in the 'deflate' format, the Huffman codes for the
312	         various alphabets must not exceed certain maximum code lengths.
313	         This constraint complicates the algorithm for computing code
314	         lengths from symbol frequencies.  Again, see Chapter 5,
315	         references for details.

317	      3.2.2. Use of Huffman coding in the 'deflate' format

319	         The Huffman codes used for each alphabet in the 'deflate'

321	Deutsch                                                        [Page  6]
322	         format have two additional rules:

324	             o All codes of a given bit length have lexicographically
325	               consecutive values, in the same order as the symbols they
326	               represent;

328	             o Shorter codes lexicographically precede longer codes.

330	         We could recode the example above to follow this rule as
331	         follows, assuming that the order of the alphabet is ABCD:

333	            Symbol  Code
334	            ------  ----
335	            A       10
336	            B       0
337	            C       110
338	            D       111

340	         I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
341	         lexicographically consecutive.

343	         Given this rule, we can define the Huffman code for an alphabet
344	         just by giving the bit lengths of the codes for each symbol of
345	         the alphabet in order; this is sufficient to determine the
346	         actual codes.  In our example, the code is completely defined
347	         by the sequence of bit lengths (2, 1, 3, 3).  The following
348	         algorithm generates the codes as integers, intended to be read
349	         from most- to least-significant bit.  The code lengths are
350	         initially in tree[I].Len; the codes are produced in
351	         tree[I].Code.

353	         1)  Count the number of codes for each code length.  Let
354	         bl_count[N] be the number of codes of length N, N >= 1.

356	         2)  Find the numerical value of the smallest code for each code
357	         length:

359	                code = 0;
360	                bl_count[0] = 0;
361	                for (bits = 1; bits <= MAX_BITS; bits++) {
362	                    next_code[bits] = code
363	                                    = (code + bl_count[bits-1]) << 1;
364	                }

366	         3)  Assign numerical values to all codes, using consecutive
367	         values for all codes of the same length with the base values
368	         determined at step 2. Codes that are never used (which have a
369	         bit length of zero) must not be assigned a value.

371	                for (n = 0;  n <= max_code; n++) {
372	                    len = tree[n].Len;
373	                    if (len == 0) continue;

375	Deutsch                                                        [Page  7]
376	                    tree[n].Code = next_code[len]++;
377	                }

379	         Example:

381	         Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
382	         3, 2, 4, 4).  After step 1, we have:

384	            N      bl_count[N]
385	            -      -----------
386	            2      1
387	            3      5
388	            4      2

390	         Step 2 computes the following next_code values:

392	            N      next_code[N]
393	            -      ------------
394	            1      0
395	            2      0
396	            3      2
397	            4      14

399	         Step 3 produces the following code values:

401	            Symbol Length   Code
402	            ------ ------   ----
403	            A       3        010
404	            B       3        011
405	            C       3        100
406	            D       3        101
407	            E       3        110
408	            F       2         00
409	            G       4       1110
410	            H       4       1111

412	      3.2.3. Details of block format

414	         Each block of compressed data begins with 3 header bits
415	         containing the following data:

417	            first bit       BFINAL
418	            next 2 bits     BTYPE

420	         Note that the header bits do not necessarily begin on a byte
421	         boundary, since a block does not necessarily occupy an integral
422	         number of bytes.

424	         BFINAL is set iff this is the last block of the data set.

426	Deutsch                                                        [Page  8]
427	         BTYPE specifies how the data are compressed, as follows:

429	            00 - no compression
430	            01 - compressed with fixed Huffman codes
431	            10 - compressed with dynamic Huffman codes
432	            11 - reserved (error)

434	         The only difference between the two compressed cases is how the
435	         Huffman codes for the literal/length and distance alphabets are
436	         defined.

438	         In all cases, the decoding algorithm for the actual data is as
439	         follows:

441	            do
442	               read block header from input stream.
443	               if stored with no compression
444	                  skip any remaining bits in current partially
445	                     processed byte
446	                  read LEN and NLEN (see next section)
447	                  copy LEN bytes of data to output
448	               otherwise
449	                  if compressed with dynamic Huffman codes
450	                     read representation of code trees (see
451	                        subsection below)
452	                  loop (until end of block code recognized)
453	                     decode literal/length value from input stream
454	                     if value < 256
455	                        copy value (literal byte) to output stream
456	                     otherwise
457	                        if value = end of block (256)
458	                           break from loop
459	                        otherwise (value = 257..285)
460	                           decode distance from input stream

462	                           move backwards distance bytes in the output
463	                           stream, and copy length bytes from this
464	                           position to the output stream.
465	                  end loop
466	            while not last block

468	         Note that a duplicated string reference may refer to a string
469	         in a previous block; i.e., the backward distance may cross one
470	         or more block boundaries.  However a distance cannot refer past
471	         the beginning of the output stream.  (An application using a
472	         preset dictionary might discard part of the output stream; a
473	         distance can refer to that part of the output stream anyway)
474	         Note also that the referenced string may overlap the current
475	         position; for example, if the last 2 bytes decoded have values
476	         X and Y, a string reference with <length = 5, distance = 2>
477	         adds X,Y,X,Y,X to the output stream.

479	Deutsch                                                        [Page  9]
480	         We now specify each compression method in turn.

482	      3.2.4. Non-compressed blocks (BTYPE=00)

484	         Any bits of input up to the next byte boundary are ignored.
485	         The rest of the block consists of the following information:

487	              0   1   2   3   4...
488	            +---+---+---+---+=================================+
489	            |  LEN  | NLEN  |... LEN bytes of literal data...|
490	            +---+---+---+---+=================================+

492	         LEN is the number of data bytes in the block.  NLEN is the
493	         one's complement of LEN.

495	      3.2.5. Compressed blocks (length and distance codes)

497	         As noted above, encoded data blocks in the 'deflate' format
498	         consist of sequences of symbols drawn from three conceptually
499	         distinct alphabets: either literal bytes, from the alphabet of
500	         byte values (0..255), or <length, backward distance> pairs,
501	         where the length is drawn from (3..258) and the distance is
502	         drawn from (1..32,768).  In fact, the literal and length
503	         alphabets are merged into a single alphabet (0..285), where
504	         values 0..255 represent literal bytes, the value 256 indicates
505	         end-of-block, and values 257..285 represent length codes
506	         (possibly in conjunction with extra bits following the symbol
507	         code) as follows:

509	                 Extra               Extra               Extra
510	            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
511	            ---- ---- ------     ---- ---- -------   ---- ---- -------
512	             257   0     3       267   1   15,16     277   4   67-82
513	             258   0     4       268   1   17,18     278   4   83-98
514	             259   0     5       269   2   19-22     279   4   99-114
515	             260   0     6       270   2   23-26     280   4  115-130
516	             261   0     7       271   2   27-30     281   5  131-162
517	             262   0     8       272   2   31-34     282   5  163-194
518	             263   0     9       273   3   35-42     283   5  195-226
519	             264   0    10       274   3   43-50     284   5  227-257
520	             265   1  11,12      275   3   51-58     285   0    258
521	             266   1  13,14      276   3   59-66

523	         The extra bits should be interpreted as a machine integer
524	         stored with the most-significant bit first, e.g., bits 1110

526	Deutsch                                                       [Page  10]
527	         represent the value 14.

529	                  Extra           Extra               Extra
530	             Code Bits Dist  Code Bits   Dist     Code Bits Distance
531	             ---- ---- ----  ---- ----  ------    ---- ---- --------
532	               0   0    1     10   4     33-48    20    9   1025-1536
533	               1   0    2     11   4     49-64    21    9   1537-2048
534	               2   0    3     12   5     65-96    22   10   2049-3072
535	               3   0    4     13   5     97-128   23   10   3073-4096
536	               4   1   5,6    14   6    129-192   24   11   4097-6144
537	               5   1   7,8    15   6    193-256   25   11   6145-8192
538	               6   2   9-12   16   7    257-384   26   12  8193-12288
539	               7   2  13-16   17   7    385-512   27   12 12289-16384
540	               8   3  17-24   18   8    513-768   28   13 16385-24576
541	               9   3  25-32   19   8   769-1024   29   13 24577-32768

543	      3.2.6. Compression with fixed Huffman codes (BTYPE=01)

545	         The Huffman codes for the two alphabets are fixed, and are not
546	         represented explicitly in the data.  The Huffman code lengths
547	         for the literal/length alphabet are:

549	                   Lit Value    Bits        Codes
550	                   ---------    ----        -----
551	                     0 - 143     8          00110000 through
552	                                            10111111
553	                   144 - 255     9          110010000 through
554	                                            111111111
555	                   256 - 279     7          0000000 through
556	                                            0010111
557	                   280 - 287     8          11000000 through
558	                                            11000111

560	         The code lengths are sufficient to generate the actual codes,
561	         as described above; we show the codes in the table for added
562	         clarity.  Literal/length values 286-287 will never actually
563	         occur in the compressed data, but participate in the code
564	         construction.

566	         Distance codes 0-31 are represented by (fixed-length) 5-bit
567	         codes, with possible additional bits as shown in the table
568	         shown in Paragraph 3.2.5, above.  Note that distance codes 30-
569	         31 will never actually occur in the compressed data.

571	      3.2.7. Compression with dynamic Huffman codes (BTYPE=10)

573	         The Huffman codes for the two alphabets appear in the block
574	         immediately after the header bits and before the actual
575	         compressed data, first the literal/length code and then the
576	         distance code.  Each code is defined by a sequence of code
577	         lengths, as discussed in Paragraph 3.2.2, above.  For even
578	         greater compactness, the code length sequences themselves are

580	Deutsch                                                       [Page  11]
581	         compressed using a Huffman code.  The alphabet for code lengths
582	         is as follows:

584	               0 - 15: Represent code lengths of 0 - 15
585	                   16: Copy the previous code length 3 - 6 times.
586	                       The next 2 bits indicate repeat length
587	                             (0 = 3, ... , 3 = 6)
588	                          Example:  Codes 8, 16 (+2 bits 11),
589	                                    16 (+2 bits 10) will expand to
590	                                    12 code lengths of 8 (1 + 6 + 5)
591	                   17: Repeat a code length of 0 for 3 - 10 times.
592	                       (3 bits of length)
593	                   18: Repeat a code length of 0 for 11 - 138 times
594	                       (7 bits of length)

596	         A code length of 0 indicates that the corresponding symbol in
597	         the literal/length or distance alphabet will not occur in the
598	         block, and should not participate in the Huffman code
599	         construction algorithm given earlier.  If only one distance
600	         code is used, it is encoded using one bit, not zero bits; in
601	         this case there is a single code length of one, with one unused
602	         code.  One distance code of zero bits means that there are no
603	         distance codes used at all (the data is all literals).

605	         We can now define the format of the block:

607	               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
608	               5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
609	               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)

611	               (HCLEN + 4) x 3 bits: code lengths for the code length
612	                  alphabet given just above, in the order: 16, 17, 18,
613	                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

615	                  These code lengths are interpreted as 3-bit integers
616	                  (0-7); as above, a code length of 0 means the
617	                  corresponding symbol (literal/length or distance code
618	                  length) is not used.

620	               HLIT + 257 code lengths for the literal/length alphabet,
621	                  encoded using the code length Huffman code

623	               HDIST + 1 code lengths for the distance alphabet,
624	                  encoded using the code length Huffman code

626	               The actual compressed data of the block,
627	                  encoded using the literal/length and distance Huffman
628	                  codes

630	               The literal/length symbol 256 (end of data),
631	                  encoded using the literal/length Huffman code

633	Deutsch                                                       [Page  12]
634	         The code length repeat codes can cross from HLIT + 257 to the
635	         HDIST + 1 code lengths.  In other words, all code lengths form
636	         a single sequence of HLIT + HDIST + 258 values.

638	   3.3. Compliance

640	      A compressor may limit further the ranges of values specified in
641	      the previous section and still be compliant; for example, it may
642	      limit the range of backward pointers to some value smaller than
643	      32K.  Similarly, a compressor may limit the size of blocks so that
644	      a compressible block fits in memory.

646	      A compliant decompressor must accept the full range of possible
647	      values defined in the previous section, and must accept blocks of
648	      arbitrary size.

650	4. Compression algorithm details

652	   While it is the intent of this document to define the 'deflate'
653	   compressed data format without reference to any particular
654	   compression algorithm, the format is related to the compressed
655	   formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
656	   since many variations of LZ77 are patented, it is strongly
657	   recommended that the implementor of a compressor follow the general
658	   algorithm presented here, which is known not to be patented per se.
659	   The material in this section is not part of the definition of the
660	   specification per se, and a compressor need not follow it in order to
661	   be compliant.

663	   The compressor terminates a block when it determines that starting a
664	   new block with fresh trees would be useful, or when the block size
665	   fills up the compressor's block buffer.

667	   The compressor uses a chained hash table to find duplicated strings,
668	   using a hash function that operates on 3-byte sequences.  At any
669	   given point during compression, let XYZ be the next 3 input bytes to
670	   be examined (not necessarily all different, of course).  First, the
671	   compressor examines the hash chain for XYZ.  If the chain is empty,
672	   the compressor simply writes out X as a literal byte and advances one
673	   byte in the input.  If the hash chain is not empty, indicating that
674	   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
675	   same hash function value) has occurred recently, the compressor
676	   compares all strings on the XYZ hash chain with the actual input data
677	   sequence starting at the current point, and selects the longest
678	   match.

680	   The compressor searches the hash chains starting with the most recent
681	   strings, to favor small distances and thus take advantage of the
682	   Huffman encoding.  The hash chains are singly linked. There are no
683	   deletions from the hash chains; the algorithm simply discards matches
684	   that are too old.  To avoid a worst-case situation, very long hash
685	   chains are arbitrarily truncated at a certain length, determined by a

687	Deutsch                                                       [Page  13]
688	   run-time parameter.

690	   To improve overall compression, the compressor optionally defers the
691	   selection of matches ("lazy matching"): after a match of length N has
692	   been found, the compressor searches for a longer match starting at
693	   the next input byte.  If it finds a longer match, it truncates the
694	   previous match to a length of one (thus producing a single literal
695	   byte) and then emits the longer match.  Otherwise, it emits the
696	   original match, and, as described above, advances N bytes before
697	   continuing.

699	   Run-time parameters also control this "lazy match" procedure.  If
700	   compression ratio is most important, the compressor attempts a
701	   complete second search regardless of the length of the first match.
702	   In the normal case, if the current match is "long enough", the
703	   compressor reduces the search for a longer match, thus speeding up
704	   the process.  If speed is most important, the compressor inserts new
705	   strings in the hash table only when no match was found, or when the
706	   match is not "too long".  This degrades the compression ratio but
707	   saves time since there are both fewer insertions and fewer searches.

709	5. References

711	   [1] Huffman, D. A., "A Method for the Construction of Minimum
712	   Redundancy Codes", Proceedings of the Institute of Radio Engineers,
713	   September 1952, Volume 40, Number 9, pp. 1098-1101.

715	   [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
716	   Compression", IEEE Transactions on Information Theory", Vol. 23, No.
717	   3, pp. 337-343.

719	   [3] Gailly, J.-L., and Adler, M., zlib documentation and sources,
720	   available in ftp.uu.net:/pub/archiving/zip/doc/zlib*

722	   [4] Gailly, J.-L., and Adler, M., gzip documentation and sources,
723	   available in prep.ai.mit.edu:/pub/gnu/gzip-*.tar

725	   [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
726	   encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.

728	   [6] "Efficient decoding of prefix codes", Hirschberg and Lelewer,
729	   Comm. ACM, 33,4, April 1990, pp. 449-459.

731	6. Security considerations

733	   Any data compression method involves the reduction of redundancy in
734	   the data.  Consequently, any corruption of the data is likely to have
735	   severe effects and be difficult to correct.  Uncompressed text, on
736	   the other hand, will probably still be readable despite the presence
737	   of some corrupted bytes.

739	   It is recommended that systems using this data format provide some

741	Deutsch                                                       [Page  14]
742	   means of validating the integrity of the compressed data.  See
743	   reference [3], for example.

745	7. Source code

747	   Source code for a C language implementation of a 'deflate' compliant
748	   compressor and decompressor is available within the zlib package at
749	   ftp.uu.net:/pub/archiving/zip/zlib/zlib*.

751	8. Acknowledgements

753	   Trademarks cited in this document are the property of their
754	   respective owners.

756	   Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
757	   Adler wrote the related software described in this specification.
758	   Glenn Randers-Pehrson converted this document to Internet Draft and
759	   HTML format.

761	9. Author's address

763	   L. Peter Deutsch

765	      Aladdin Enterprises
766	      203 Santa Margarita Ave.
767	      Menlo Park, CA 94025

769	      Phone: (415) 322-0103 (AM only)
770	      FAX:   (415) 322-1734
771	      EMail: <ghost@aladdin.com>

773	   Questions about the technical content of this specification can be
774	   sent by email to

776	      Jean-loup Gailly <gzip@prep.ai.mit.edu> and
777	      Mark Adler <madler@alumni.caltech.edu>

779	   Editorial comments on this specification can be sent by email to

781	      L. Peter Deutsch <ghost@aladdin.com> and
782	      Glenn Randers-Pehrson <randeg@alumni.rpi.edu>

784	Deutsch                                                       [Page  15]