idnits 2.17.1 

draft-deutsch-deflate-spec-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-26) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 15
     longer pages, the longest (page 2) being 59 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (21 Mar 1996) is 10263 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'I' on line 353

  -- Looks like a reference, but probably isn't: 'N' on line 395

  == Missing Reference: '0' is mentioned on line 363, but not defined

  == Unused Reference: '4' is defined on line 725, but no explicit reference
     was found in the text

  == Unused Reference: '5' is defined on line 728, but no explicit reference
     was found in the text

  == Unused Reference: '6' is defined on line 731, but no explicit reference
     was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  -- Possible downref: Non-RFC (?) normative reference: ref. '3'

  -- Possible downref: Non-RFC (?) normative reference: ref. '4'

  -- Possible downref: Non-RFC (?) normative reference: ref. '5'

  -- Possible downref: Non-RFC (?) normative reference: ref. '6'


     Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	INTERNET-DRAFT                                          L. Peter Deutsch
2	DEFLATE 1.3                                          Aladdin Enterprises
3	Expires: 26 Sep 1996                                         21 Mar 1996

5	        DEFLATE Compressed Data Format Specification version 1.3

7	File draft-deutsch-deflate-spec-03.txt

9	Status of this Memo

11	   This document is an Internet-Draft.  Internet-Drafts are working
12	   documents of the Internet Engineering Task Force (IETF), its areas,
13	   and its working groups.  Note that other groups may also distribute
14	   working documents as Internet-Drafts.

16	   Internet-Drafts are draft documents valid for a maximum of six months
17	   and may be updated, replaced, or obsoleted by other documents at any
18	   time.  It is inappropriate to use Internet- Drafts as reference
19	   material or to cite them other than as ``work in progress.''

21	   To learn the current status of any Internet-Draft, please check the
22	   ``1id-abstracts.txt'' listing contained in the Internet- Drafts
23	   Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe),
24	   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or
25	   ftp.isi.edu (US West Coast).

27	   Distribution of this memo is unlimited.

29	   A pointer to the latest version of this and related documentation in
30	   HTML format can be found at the URL
31	   <ftp:ftp.uu.net/pub/graphics/png/documents/zlib/zdoc-index.html>.

33	Notices

35	   Copyright (c) 1996 L. Peter Deutsch

37	   Permission is granted to copy and distribute this document for any
38	   purpose and without charge, including translations into other
39	   languages and incorporation into compilations, provided that the
40	   copyright notice and this notice are preserved, and that any
41	   substantive changes or deletions from the original are clearly
42	   marked.

44	Deutsch                                                        [Page  1]
45	Abstract

47	   This specification defines a lossless compressed data format that
48	   compresses data using a combination of the LZ77 algorithm and Huffman
49	   coding, with efficiency comparable to the best currently available
50	   general-purpose compression methods.  The data can be produced or
51	   consumed, even for an arbitrarily long sequentially presented input
52	   data stream, using only an a priori bounded amount of intermediate
53	   storage.  The format can be implemented readily in a manner not
54	   covered by patents.

56	Table of Contents

58	   1. Introduction ................................................... 2
59	      1.1. Purpose ................................................... 3
60	      1.2. Intended audience ......................................... 3
61	      1.3. Scope ..................................................... 3
62	      1.4. Compliance ................................................ 4
63	      1.5.  Definitions of terms and conventions used ................ 4
64	      1.6. Changes from previous versions ............................ 4
65	   2. Compressed representation overview ............................. 4
66	   3. Detailed specification ......................................... 5
67	      3.1. Overall conventions ....................................... 5
68	          3.1.1. Packing into bytes .................................. 5
69	      3.2. Compressed block format ................................... 6
70	          3.2.1. Synopsis of prefix and Huffman coding ............... 6
71	          3.2.2. Use of Huffman coding in the 'deflate' format ....... 7
72	          3.2.3. Details of block format ............................. 9
73	          3.2.4. Non-compressed blocks (BTYPE=00) ................... 10
74	          3.2.5. Compressed blocks (length and distance codes) ...... 11
75	          3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 11
76	          3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 12
77	      3.3. Compliance ............................................... 13
78	   4. Compression algorithm details ................................. 14
79	   5. References .................................................... 15
80	   6. Security considerations ....................................... 15
81	   7. Source code ................................................... 15
82	   8. Acknowledgements .............................................. 16
83	   9. Author's address .............................................. 16

85	1. Introduction

87	Deutsch                                                        [Page  2]
88	   1.1. Purpose

90	      The purpose of this specification is to define a lossless
91	      compressed data format that:

93	          * Is independent of CPU type, operating system, file system,
94	            and character set, and hence can be used for interchange;
95	          * Can be produced or consumed, even for an arbitrarily long
96	            sequentially presented input data stream, using only an a
97	            priori bounded amount of intermediate storage, and hence can
98	            be used in data communications or similar structures such as
99	            Unix filters;
100	          * Compresses data with efficiency comparable to the best
101	            currently available general-purpose compression methods, and
102	            in particular considerably better than the 'compress'
103	            program;
104	          * Can be implemented readily in a manner not covered by
105	            patents, and hence can be practiced freely;
106	          * Is compatible with the file format produced by the current
107	            widely used gzip utility, in that conforming decompressors
108	            will be able to read data produced by the existing gzip
109	            compressor.

111	      The data format defined by this specification does not attempt to:

113	          * Allow random access to compressed data;
114	          * Compress specialized data (e.g., raster graphics) as well as
115	            the best currently available specialized algorithms.

117	      A simple counting argument shows that no lossless compression
118	      algorithm can compress every possible input data set.  For the
119	      format defined here, the worst case expansion is 5 bytes per 32K-
120	      byte block, i.e., a size increase of 0.015% for large data sets.
121	      English text usually compresses by a factor of 2.5 to 3;
122	      executable files usually compress somewhat less; graphical data
123	      such as raster images may compress much more.

125	   1.2. Intended audience

127	      This specification is intended for use by implementors of software
128	      to compress data into 'deflate' format and/or decompress data from
129	      'deflate' format.

131	      The text of the specification assumes a basic background in
132	      programming at the level of bits and other primitive data
133	      representations.  Familiarity with the technique of Huffman coding
134	      is helpful but not required.

136	   1.3. Scope

138	      The specification specifies a method for representing a sequence
139	      of bytes as a (usually shorter) sequence of bits, and a method for

141	Deutsch                                                        [Page  3]
142	      packing the latter bit sequence into bytes.

144	   1.4. Compliance

146	      Unless otherwise indicated below, a compliant decompressor must be
147	      able to accept and decompress any data set that conforms to all
148	      the specifications presented here; a compliant compressor must
149	      produce data sets that conform to all the specifications presented
150	      here.

152	   1.5.  Definitions of terms and conventions used

154	      Byte: 8 bits stored or transmitted as a unit (same as an octet).
155	      For this specification, a byte is exactly 8 bits, even on machines
156	      which store a character on a number of bits different from eight.
157	      See below, for the numbering of bits within a byte.

159	      String: a sequence of arbitrary bytes.

161	   1.6. Changes from previous versions

163	      There have been no technical changes to the deflate format since
164	      version 1.1 of this specification.  In version 1.2, some
165	      terminology was changed.  Version 1.3 is a conversion of the
166	      specification to Internet Draft style.

168	2. Compressed representation overview

170	   A compressed data set consists of a series of blocks, corresponding
171	   to successive blocks of input data.  The block sizes are arbitrary,
172	   except that non-compressible blocks are limited to 65,535 bytes.

174	   Each block is compressed using a combination of the LZ77 algorithm
175	   and Huffman coding. The Huffman trees for each block are independant
176	   of those for previous or subsequent blocks; the LZ77 algorithm may
177	   use a reference to a duplicated string occurring in a previous block,
178	   up to 32K input bytes before.

180	   Each block consists of two parts: a pair of Huffman code trees that
181	   describe the representation of the compressed data part, and a
182	   compressed data part.  (The Huffman trees themselves are compressed
183	   using Huffman encoding.)  The compressed data consists of a series of
184	   elements of two types: literal bytes (of strings that have not been
185	   detected as duplicated within the previous 32K input bytes), and
186	   pointers to duplicated strings, where a pointer is represented as a
187	   pair <length, backward distance>.  The representation used in the
188	   'deflate' format limits distances to 32K bytes and lengths to 258
189	   bytes, but does not limit the size of a block, except for
190	   uncompressible blocks, which are limited as noted above.

192	   Each type of value (literals, distances, and lengths) in the
193	   compressed data is represented using a Huffman code, using one code

195	Deutsch                                                        [Page  4]
196	   tree for literals and lengths and a separate code tree for distances.
197	   The code trees for each block appear in a compact form just before
198	   the compressed data for that block.

200	3. Detailed specification

202	   3.1. Overall conventions In the diagrams below, a box like this:

204	         +---+
205	         |   | <-- the vertical bars might be missing
206	         +---+

208	      represents one byte; a box like this:

210	         +==============+
211	         |              |
212	         +==============+

214	      represents a variable number of bytes.

216	      Bytes stored within a computer do not have a 'bit order', since
217	      they are always treated as a unit.  However, a byte considered as
218	      an integer between 0 and 255 does have a most- and least-
219	      significant bit, and since we write numbers with the most-
220	      significant digit on the left, we also write bytes with the most-
221	      significant bit on the left.  In the diagrams below, we number the
222	      bits of a byte so that bit 0 is the least-significant bit, i.e.,
223	      the bits are numbered:

225	         +--------+
226	         |76543210|
227	         +--------+

229	      Within a computer, a number may occupy multiple bytes.  All
230	      multi-byte numbers in the format described here are stored with
231	      the least-significant byte first (at the lower memory address).
232	      For example, the decimal number 520 is stored as:

234	             0        1
235	         +--------+--------+
236	         |00001000|00000010|
237	         +--------+--------+
238	          ^        ^
239	          |        |
240	          |        + more significant byte = 2 x 256
241	          + less significant byte = 8

243	      3.1.1. Packing into bytes

245	         This document does not address the issue of the order in which
246	         bits of a byte are transmitted on a bit-sequential medium,
247	         since the final data format described here is byte- rather than

249	Deutsch                                                        [Page  5]
250	         bit-oriented.  However, we describe the compressed block format
251	         in below, as a sequence of data elements of various bit
252	         lengths, not a sequence of bytes.  We must therefore specify
253	         how to pack these data elements into bytes to form the final
254	         compressed byte sequence:

256	             * Data elements are packed into bytes in order of
257	               increasing bit number within the byte, i.e., starting
258	               with the least- significant bit of the byte.
259	             * Data elements other than Huffman codes are packed
260	               starting with the least-significant bit of the data
261	               element.
262	             * Huffman codes are packed starting with the most-
263	               significant bit of the code.

265	         In other words, if one were to print out the compressed data as
266	         a sequence of bytes, starting with the first byte at the
267	         *right* margin and proceeding to the *left*, with the most-
268	         significant bit of each byte on the left as usual, one would be
269	         able to parse the result from right to left, with fixed-width
270	         elements in the correct MSB-to-LSB order and Huffman codes in
271	         bit-reversed order (i.e., with the first bit of the code in the
272	         relative LSB position).

274	   3.2. Compressed block format

276	      3.2.1. Synopsis of prefix and Huffman coding

278	         Prefix coding represents symbols from an a priori known
279	         alphabet by bit sequences (codes), one code for each symbol, in
280	         a manner such that different symbols may be represented by bit
281	         sequences of different lengths, but a parser can always parse
282	         an encoded string unambiguously symbol-by-symbol.

284	         We define a prefix code in terms of a binary tree in which the
285	         two edges descending from each non-leaf node are labeled 0 and
286	         1 and in which the leaf nodes correspond one-for-one with (are
287	         labeled with) the symbols of the alphabet; then the code for a
288	         symbol is the sequence of 0's and 1's on the edges leading from
289	         the root to the leaf labeled with that symbol.  For example:

291	                          /\              Symbol    Code
292	                         0  1             ------    ----
293	                        /    \                A      00
294	                       /\     B               B       1
295	                      0  1                    C     011
296	                     /    \                   D     010
297	                    A     /\
298	                         0  1
299	                        /    \
300	                       D      C

302	Deutsch                                                        [Page  6]
303	         A parser can decode the next symbol from an encoded input
304	         stream by walking down the tree from the root, at each step
305	         choosing the edge corresponding to the next input bit.

307	         Given an alphabet with known symbol frequencies, the Huffman
308	         algorithm allows the construction of an optimal prefix code
309	         (one which represents strings with those symbol frequencies
310	         using the fewest bits of any possible prefix codes for that
311	         alphabet).  Such a code is called a Huffman code.  (See
312	         reference [1] in Chapter 5, references for additional
313	         information on Huffman codes.)

315	         Note that in the 'deflate' format, the Huffman codes for the
316	         various alphabets must not exceed certain maximum code lengths.
317	         This constraint complicates the algorithm for computing code
318	         lengths from symbol frequencies.  Again, see Chapter 5,
319	         references for details.

321	      3.2.2. Use of Huffman coding in the 'deflate' format

323	         The Huffman codes used for each alphabet in the 'deflate'
324	         format have two additional rules:

326	             * All codes of a given bit length have lexicographically
327	               consecutive values, in the same order as the symbols they
328	               represent;

330	             * Shorter codes lexicographically precede longer codes.

332	         We could recode the example above to follow this rule as
333	         follows, assuming that the order of the alphabet is ABCD:

335	            Symbol  Code
336	            ------  ----
337	            A       10
338	            B       0
339	            C       110
340	            D       111

342	         I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
343	         lexicographically consecutive.

345	         Given this rule, we can define the Huffman code for an alphabet
346	         just by giving the bit lengths of the codes for each symbol of
347	         the alphabet in order; this is sufficient to determine the
348	         actual codes.  In our example, the code is completely defined
349	         by the sequence of bit lengths (2, 1, 3, 3).  The following
350	         algorithm generates the codes as integers, intended to be read
351	         from most- to least-significant bit.  The code lengths are
352	         initially in tree[I].Len; the codes are produced in
353	         tree[I].Code.

355	Deutsch                                                        [Page  7]
356	         1)  Count the number of codes for each code length.  Let
357	         bl_count[N] be the number of codes of length N, N >= 1.

359	         2)  Find the numerical value of the smallest code for each code
360	         length:

362	                code = 0;
363	                bl_count[0] = 0;
364	                for (bits = 1; bits <= MAX_BITS; bits++) {
365	                    code = (code + bl_count[bits-1]) << 1;
366	                    next_code[bits] = code;
367	                }

369	         3)  Assign numerical values to all codes, using consecutive
370	         values for all codes of the same length with the base values
371	         determined at step 2. Codes that are never used (which have a
372	         bit length of zero) must not be assigned a value.

374	                for (n = 0;  n <= max_code; n++) {
375	                    len = tree[n].Len;
376	                    if (len != 0) {
377	                        tree[n].Code = next_code[len];
378	                        next_code[len]++;
379	                    }
380	                }

382	         Example:

384	         Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
385	         3, 2, 4, 4).  After step 1, we have:

387	            N      bl_count[N]
388	            -      -----------
389	            2      1
390	            3      5
391	            4      2

393	         Step 2 computes the following next_code values:

395	            N      next_code[N]
396	            -      ------------
397	            1      0
398	            2      0
399	            3      2
400	            4      14

402	         Step 3 produces the following code values:

404	Deutsch                                                        [Page  8]
405	            Symbol Length   Code
406	            ------ ------   ----
407	            A       3        010
408	            B       3        011
409	            C       3        100
410	            D       3        101
411	            E       3        110
412	            F       2         00
413	            G       4       1110
414	            H       4       1111

416	      3.2.3. Details of block format

418	         Each block of compressed data begins with 3 header bits
419	         containing the following data:

421	            first bit       BFINAL
422	            next 2 bits     BTYPE

424	         Note that the header bits do not necessarily begin on a byte
425	         boundary, since a block does not necessarily occupy an integral
426	         number of bytes.

428	         BFINAL is set iff this is the last block of the data set.

430	         BTYPE specifies how the data are compressed, as follows:

432	            00 - no compression
433	            01 - compressed with fixed Huffman codes
434	            10 - compressed with dynamic Huffman codes
435	            11 - reserved (error)

437	         The only difference between the two compressed cases is how the
438	         Huffman codes for the literal/length and distance alphabets are
439	         defined.

441	         In all cases, the decoding algorithm for the actual data is as
442	         follows:

444	Deutsch                                                        [Page  9]
445	            do
446	               read block header from input stream.
447	               if stored with no compression
448	                  skip any remaining bits in current partially
449	                     processed byte
450	                  read LEN and NLEN (see next section)
451	                  copy LEN bytes of data to output
452	               otherwise
453	                  if compressed with dynamic Huffman codes
454	                     read representation of code trees (see
455	                        subsection below)
456	                  loop (until end of block code recognized)
457	                     decode literal/length value from input stream
458	                     if value < 256
459	                        copy value (literal byte) to output stream
460	                     otherwise
461	                        if value = end of block (256)
462	                           break from loop
463	                        otherwise (value = 257..285)
464	                           decode distance from input stream

466	                           move backwards distance bytes in the output
467	                           stream, and copy length bytes from this
468	                           position to the output stream.
469	                  end loop
470	            while not last block

472	         Note that a duplicated string reference may refer to a string
473	         in a previous block; i.e., the backward distance may cross one
474	         or more block boundaries.  However a distance cannot refer past
475	         the beginning of the output stream.  (An application using a
476	         preset dictionary might discard part of the output stream; a
477	         distance can refer to that part of the output stream anyway)
478	         Note also that the referenced string may overlap the current
479	         position; for example, if the last 2 bytes decoded have values
480	         X and Y, a string reference with <length = 5, distance = 2>
481	         adds X,Y,X,Y,X to the output stream.

483	         We now specify each compression method in turn.

485	      3.2.4. Non-compressed blocks (BTYPE=00)

487	         Any bits of input up to the next byte boundary are ignored.
488	         The rest of the block consists of the following information:

490	              0   1   2   3   4...
491	            +---+---+---+---+=================================+
492	            |  LEN  | NLEN  |... LEN bytes of literal data...|
493	            +---+---+---+---+=================================+

495	         LEN is the number of data bytes in the block.  NLEN is the
496	         one's complement of LEN.

498	Deutsch                                                       [Page  10]
499	      3.2.5. Compressed blocks (length and distance codes)

501	         As noted above, encoded data blocks in the 'deflate' format
502	         consist of sequences of symbols drawn from three conceptually
503	         distinct alphabets: either literal bytes, from the alphabet of
504	         byte values (0..255), or <length, backward distance> pairs,
505	         where the length is drawn from (3..258) and the distance is
506	         drawn from (1..32,768).  In fact, the literal and length
507	         alphabets are merged into a single alphabet (0..285), where
508	         values 0..255 represent literal bytes, the value 256 indicates
509	         end-of-block, and values 257..285 represent length codes
510	         (possibly in conjunction with extra bits following the symbol
511	         code) as follows:

513	                 Extra               Extra               Extra
514	            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
515	            ---- ---- ------     ---- ---- -------   ---- ---- -------
516	             257   0     3       267   1   15,16     277   4   67-82
517	             258   0     4       268   1   17,18     278   4   83-98
518	             259   0     5       269   2   19-22     279   4   99-114
519	             260   0     6       270   2   23-26     280   4  115-130
520	             261   0     7       271   2   27-30     281   5  131-162
521	             262   0     8       272   2   31-34     282   5  163-194
522	             263   0     9       273   3   35-42     283   5  195-226
523	             264   0    10       274   3   43-50     284   5  227-257
524	             265   1  11,12      275   3   51-58     285   0    258
525	             266   1  13,14      276   3   59-66

527	         The extra bits should be interpreted as a machine integer
528	         stored with the most-significant bit first, e.g., bits 1110
529	         represent the value 14.

531	                  Extra           Extra               Extra
532	             Code Bits Dist  Code Bits   Dist     Code Bits Distance
533	             ---- ---- ----  ---- ----  ------    ---- ---- --------
534	               0   0    1     10   4     33-48    20    9   1025-1536
535	               1   0    2     11   4     49-64    21    9   1537-2048
536	               2   0    3     12   5     65-96    22   10   2049-3072
537	               3   0    4     13   5     97-128   23   10   3073-4096
538	               4   1   5,6    14   6    129-192   24   11   4097-6144
539	               5   1   7,8    15   6    193-256   25   11   6145-8192
540	               6   2   9-12   16   7    257-384   26   12  8193-12288
541	               7   2  13-16   17   7    385-512   27   12 12289-16384
542	               8   3  17-24   18   8    513-768   28   13 16385-24576
543	               9   3  25-32   19   8   769-1024   29   13 24577-32768

545	      3.2.6. Compression with fixed Huffman codes (BTYPE=01)

547	         The Huffman codes for the two alphabets are fixed, and are not
548	         represented explicitly in the data.  The Huffman code lengths
549	         for the literal/length alphabet are:

551	Deutsch                                                       [Page  11]
552	                   Lit Value    Bits        Codes
553	                   ---------    ----        -----
554	                     0 - 143     8          00110000 through
555	                                            10111111
556	                   144 - 255     9          110010000 through
557	                                            111111111
558	                   256 - 279     7          0000000 through
559	                                            0010111
560	                   280 - 287     8          11000000 through
561	                                            11000111

563	         The code lengths are sufficient to generate the actual codes,
564	         as described above; we show the codes in the table for added
565	         clarity.  Literal/length values 286-287 will never actually
566	         occur in the compressed data, but participate in the code
567	         construction.

569	         Distance codes 0-31 are represented by (fixed-length) 5-bit
570	         codes, with possible additional bits as shown in the table
571	         shown in Paragraph 3.2.5, above.  Note that distance codes 30-
572	         31 will never actually occur in the compressed data.

574	      3.2.7. Compression with dynamic Huffman codes (BTYPE=10)

576	         The Huffman codes for the two alphabets appear in the block
577	         immediately after the header bits and before the actual
578	         compressed data, first the literal/length code and then the
579	         distance code.  Each code is defined by a sequence of code
580	         lengths, as discussed in Paragraph 3.2.2, above.  For even
581	         greater compactness, the code length sequences themselves are
582	         compressed using a Huffman code.  The alphabet for code lengths
583	         is as follows:

585	               0 - 15: Represent code lengths of 0 - 15
586	                   16: Copy the previous code length 3 - 6 times.
587	                       The next 2 bits indicate repeat length
588	                             (0 = 3, ... , 3 = 6)
589	                          Example:  Codes 8, 16 (+2 bits 11),
590	                                    16 (+2 bits 10) will expand to
591	                                    12 code lengths of 8 (1 + 6 + 5)
592	                   17: Repeat a code length of 0 for 3 - 10 times.
593	                       (3 bits of length)
594	                   18: Repeat a code length of 0 for 11 - 138 times
595	                       (7 bits of length)

597	         A code length of 0 indicates that the corresponding symbol in
598	         the literal/length or distance alphabet will not occur in the
599	         block, and should not participate in the Huffman code
600	         construction algorithm given earlier.  If only one distance
601	         code is used, it is encoded using one bit, not zero bits; in
602	         this case there is a single code length of one, with one unused
603	         code.  One distance code of zero bits means that there are no

605	Deutsch                                                       [Page  12]
606	         distance codes used at all (the data is all literals).

608	         We can now define the format of the block:

610	               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
611	               5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
612	               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)

614	               (HCLEN + 4) x 3 bits: code lengths for the code length
615	                  alphabet given just above, in the order: 16, 17, 18,
616	                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

618	                  These code lengths are interpreted as 3-bit integers
619	                  (0-7); as above, a code length of 0 means the
620	                  corresponding symbol (literal/length or distance code
621	                  length) is not used.

623	               HLIT + 257 code lengths for the literal/length alphabet,
624	                  encoded using the code length Huffman code

626	               HDIST + 1 code lengths for the distance alphabet,
627	                  encoded using the code length Huffman code

629	               The actual compressed data of the block,
630	                  encoded using the literal/length and distance Huffman
631	                  codes

633	               The literal/length symbol 256 (end of data),
634	                  encoded using the literal/length Huffman code

636	         The code length repeat codes can cross from HLIT + 257 to the
637	         HDIST + 1 code lengths.  In other words, all code lengths form
638	         a single sequence of HLIT + HDIST + 258 values.

640	   3.3. Compliance

642	      A compressor may limit further the ranges of values specified in
643	      the previous section and still be compliant; for example, it may
644	      limit the range of backward pointers to some value smaller than
645	      32K.  Similarly, a compressor may limit the size of blocks so that
646	      a compressible block fits in memory.

648	      A compliant decompressor must accept the full range of possible
649	      values defined in the previous section, and must accept blocks of
650	      arbitrary size.

652	Deutsch                                                       [Page  13]
653	4. Compression algorithm details

655	   While it is the intent of this document to define the 'deflate'
656	   compressed data format without reference to any particular
657	   compression algorithm, the format is related to the compressed
658	   formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
659	   since many variations of LZ77 are patented, it is strongly
660	   recommended that the implementor of a compressor follow the general
661	   algorithm presented here, which is known not to be patented per se.
662	   The material in this section is not part of the definition of the
663	   specification per se, and a compressor need not follow it in order to
664	   be compliant.

666	   The compressor terminates a block when it determines that starting a
667	   new block with fresh trees would be useful, or when the block size
668	   fills up the compressor's block buffer.

670	   The compressor uses a chained hash table to find duplicated strings,
671	   using a hash function that operates on 3-byte sequences.  At any
672	   given point during compression, let XYZ be the next 3 input bytes to
673	   be examined (not necessarily all different, of course).  First, the
674	   compressor examines the hash chain for XYZ.  If the chain is empty,
675	   the compressor simply writes out X as a literal byte and advances one
676	   byte in the input.  If the hash chain is not empty, indicating that
677	   the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
678	   same hash function value) has occurred recently, the compressor
679	   compares all strings on the XYZ hash chain with the actual input data
680	   sequence starting at the current point, and selects the longest
681	   match.

683	   The compressor searches the hash chains starting with the most recent
684	   strings, to favor small distances and thus take advantage of the
685	   Huffman encoding.  The hash chains are singly linked. There are no
686	   deletions from the hash chains; the algorithm simply discards matches
687	   that are too old.  To avoid a worst-case situation, very long hash
688	   chains are arbitrarily truncated at a certain length, determined by a
689	   run-time parameter.

691	   To improve overall compression, the compressor optionally defers the
692	   selection of matches ("lazy matching"): after a match of length N has
693	   been found, the compressor searches for a longer match starting at
694	   the next input byte.  If it finds a longer match, it truncates the
695	   previous match to a length of one (thus producing a single literal
696	   byte) and then emits the longer match.  Otherwise, it emits the
697	   original match, and, as described above, advances N bytes before
698	   continuing.

700	   Run-time parameters also control this "lazy match" procedure.  If
701	   compression ratio is most important, the compressor attempts a
702	   complete second search regardless of the length of the first match.
703	   In the normal case, if the current match is "long enough", the
704	   compressor reduces the search for a longer match, thus speeding up

706	Deutsch                                                       [Page  14]
707	   the process.  If speed is most important, the compressor inserts new
708	   strings in the hash table only when no match was found, or when the
709	   match is not "too long".  This degrades the compression ratio but
710	   saves time since there are both fewer insertions and fewer searches.

712	5. References

714	   [1] Huffman, D. A., "A Method for the Construction of Minimum
715	       Redundancy Codes", Proceedings of the Institute of Radio
716	       Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.

718	   [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
719	       Compression", IEEE Transactions on Information Theory, Vol. 23,
720	       No. 3, pp. 337-343.

722	   [3] Gailly, J.-L., and Adler, M., zlib documentation and sources,
723	       available in ftp.uu.net:/pub/archiving/zip/doc/zlib*

725	   [4] Gailly, J.-L., and Adler, M., gzip documentation and sources,
726	       available in prep.ai.mit.edu:/pub/gnu/gzip-*.tar

728	   [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
729	       encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.

731	   [6] "Efficient decoding of prefix codes", Hirschberg and Lelewer,
732	       Comm. ACM, 33,4, April 1990, pp. 449-459.

734	6. Security considerations

736	   Any data compression method involves the reduction of redundancy in
737	   the data.  Consequently, any corruption of the data is likely to have
738	   severe effects and be difficult to correct.  Uncompressed text, on
739	   the other hand, will probably still be readable despite the presence
740	   of some corrupted bytes.

742	   It is recommended that systems using this data format provide some
743	   means of validating the integrity of the compressed data.  See
744	   reference [3], for example.

746	7. Source code

748	   Source code for a C language implementation of a 'deflate' compliant
749	   compressor and decompressor is available within the zlib package at
750	   ftp.uu.net:/pub/archiving/zip/zlib/zlib*.

752	Deutsch                                                       [Page  15]
753	8. Acknowledgements

755	   Trademarks cited in this document are the property of their
756	   respective owners.

758	   Phil Katz designed the deflate format.  Jean-Loup Gailly and Mark
759	   Adler wrote the related software described in this specification.
760	   Glenn Randers-Pehrson converted this document to Internet Draft and
761	   HTML format.

763	9. Author's address

765	   L. Peter Deutsch

767	      Aladdin Enterprises
768	      203 Santa Margarita Ave.
769	      Menlo Park, CA 94025

771	      Phone: (415) 322-0103 (AM only)
772	      FAX:   (415) 322-1734
773	      EMail: <ghost@aladdin.com>

775	   Questions about the technical content of this specification can be
776	   sent by email to

778	      Jean-loup Gailly <gzip@prep.ai.mit.edu> and
779	      Mark Adler <madler@alumni.caltech.edu>

781	   Editorial comments on this specification can be sent by email to

783	      L. Peter Deutsch <ghost@aladdin.com> and
784	      Glenn Randers-Pehrson <randeg@alumni.rpi.edu>

786	Deutsch                                                       [Page  16]