Network Working Group Richard Price, Siemens/Roke Manor INTERNET-DRAFT Jonathan Rosenberg, dynamicsoft Expires: May 2002 Abigail Surtees, Siemens/Roke Manor Mark A West, Siemens/Roke Manor Lawrence Conroy, Siemens/Roke Manor 14 November, 2001 Universal Decompression Algorithm Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC-2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document is a submission to the IETF ROHC WG. Comments should be directed to the mailing list of ROHC, rohc@cdt.luth.se. Abstract This specification defines a "universal decompressor" capable of interoperating with a wide range of compression algorithms. Using the basic techniques of Huffman compression and LZ77-style string substitution, the decompressor can be configured to understand the output of many well-known compressors including [DEFLATE] and [LZW]. Price et al. [PAGE 1] INTERNET-DRAFT Universal Decompressor 14 November, 2001 Revision history Changes from New COPY-LITERAL and COPY-OFFSET tokens added to reduce complexity for LZ77-based algorithms. COMPARE token modified to allow it to be used without a SWITCH token. HUFFMAN token modified to require fewer parameters. Support added for literal parameters (see Section 4.3). Example token sets added for decompression of LZ77, [DEFLATE] and [LZW] compressed messages. Table of contents 1. Introduction.................................................3 2. Terminology..................................................3 3. Requirements on underlying transport protocol................3 4. Description of the universal decompressor....................4 4.1. Structure of universal decompressor dictionary.............5 4.2. Important entries in the byte buffer.......................5 4.3. Token parameters...........................................5 4.4. Decompressor actions upon receiving a compressed message...6 4.5. Decompression failure......................................7 5. Library of tokens............................................8 5.1. COPY.......................................................9 5.2. ADD / SUBTRACT.............................................10 5.3. LSHIFT / RSHIFT............................................10 5.4. COMPARE....................................................11 5.5. SWITCH.....................................................11 5.6. CALL / RETURN..............................................11 5.7. HUFFMAN....................................................12 6. Security considerations......................................14 7. References...................................................15 8. Authors' addresses...........................................15 A. Example sets of tokens.......................................16 A.1. Mnemonic language..........................................16 A.2. Example token set for simple LZ77 decompression............17 A.3. Example token set for DEFLATE decompression................17 A.4. Example token set for LZW decompression....................20 Price et al. [PAGE 2] INTERNET-DRAFT Universal Decompressor 14 November, 2001 1. Introduction This draft introduces the concept of a "universal decompressor". The goal of the document is to standardize a decompressor capable of interoperating with a wide range of compression algorithms. Consequently this draft describes the decompressor operation only, i.e. the actions which the decompressor takes upon receiving a certain instruction from the compressor. 2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC-2119]. Byte buffer The universal decompressor maintains a byte buffer containing any previously received text strings that might be useful for future compression. Token A token is an instruction transmitted from the compressor to the decompressor. 3. Requirements on underlying transport protocol The universal decompressor takes as its input a sequence of compressed messages, which are processed and then outputted as a sequence of uncompressed messages. This chapter lists the requirements on the transport protocol used to carry compressed messages to the universal decompressor. Note that the universal decompressor outputs an uncompressed message when it encounters an explicit end-of-message character. Consequently there is no need for one uncompressed message to correspond to exactly one compressed message. Two or more compressed messages can be sent to reconstruct a single uncompressed message, which is very useful for segmenting compressed messages that are larger than the MTU of the transport protocol. Conversely, one compressed message can be sent to reconstruct several uncompressed messages. In particular, messages can be successfully decompressed even when the transport protocol provides data as a byte stream with no framing. Note however that the universal decompressor can make use of previously received messages to improve the overall compression ratio (see [V-J] for an example of a compression algorithm which uses previously received messages in this manner). Price et al. [PAGE 3] INTERNET-DRAFT Universal Decompressor 14 November, 2001 Consequently, the main requirement on the underlying protocol is that it MUST ensure that the contents of the decompressor byte buffer are as expected by the next message to be decompressed. This requirement can be supported in a number of ways. If the underlying transport protocol is reliable (for example TCP or [SCTP]) then the compressed messages are provided to the universal decompressor in the correct order and free from bit errors. In this case the byte buffer is automatically updated correctly between messages. If the underlying transport protocol is unreliable (for example UDP) then the byte buffer MUST be reset to a known state between compressed messages. This ensures that messages are not compressed relative to text that has not yet been received by the universal decompressor. The "known state" of the byte buffer can be negotiated a-priori: for example a set of tokens can be downloaded from the compressor to the decompressor when the universal decompressor is being set up. All messages are then compressed relative to these tokens. Alternatively the transport protocol can indicate to the decompressor how the byte buffer should be set up before each message is decompressed. 4. Description of the universal decompressor An important feature of the universal decompressor is that it can interoperate with a wide range of compression algorithms. The precise method for compression is left as an implementation decision, and in fact the standard decompressor can interoperate with any of the following classes of algorithm: * Generic text compressor (for example [DEFLATE] or a similar algorithm). * Protocol-aware compressor offering excellent performance for one type of text message (for example the text messages generated by [SIP]). * Hybrid compressor with similar performance to [DEFLATE] for generic text strings and superior performance for certain types of text message. The choice of which tokens to send to the decompressor is left as a local implementation decision at the compressor. The only requirement is that of transparency, i.e. the compressor MUST NOT send tokens which cause the decompressor to incorrectly decompress a given message. Note however that it is perfectly acceptable for the compressor to send tokens which update the byte buffer at the decompressor, but which cause no decompressed message to be outputted. Indeed, this is a useful technique for pre-populating the dictionary with well-known text strings. Price et al. [PAGE 4] INTERNET-DRAFT Universal Decompressor 14 November, 2001 4.1. Structure of universal decompressor dictionary The universal decompressor dictionary consists of a simple byte buffer designed to hold the current uncompressed message, the current compressed message, and any other previously received text strings that might be useful for future compression. The size buffer_size of the byte buffer can be negotiated by an externally defined mechanism (e.g. by the underlying protocol used to transport the compressed messages). Entries in the byte buffer are referred to as buffer[n] where 0 =< n < buffer_size. As all of the tokens currently use 2-byte indices into the byte buffer, the maximum size of the buffer is 64K. 4.2. Important entries in the byte buffer The first few bytes in the byte buffer are used to store some important 2-byte integers. These integers are given the following names: Position in buffer: Name: 0 - 1 first_token 2 - 3 uncompressed_start 4 - 5 uncompressed_end 6 - 7 circular_buffer 8 - 9 compressed_start 10 - 11 compressed_end 12 - 13 compressed_pointer 14 - 15 stack_free 16 - 17 stack[0] 18 - 19 stack[1] 20 - 21 stack[2] : : The MSBs of the integer are always stored before the LSBs. So, for example, the MSBs of first_token are stored in buffer[0] whilst the LSBs are stored in buffer[1]. The use of each integer is described in the following sections of the draft. 4.3. Token parameters Each of the tokens is followed by 0 or more bytes containing the parameters required by the token. At present all parameters are stored as 2-byte integers with MSBs stored in the byte preceding the LSBs in the byte buffer. The most significant bit of the 2-byte integer has a special meaning: it is used to determine whether the parameter is a literal value or an index pointing to a literal value. Price et al. [PAGE 5] INTERNET-DRAFT Universal Decompressor 14 November, 2001 If the most significant bit is 0 then the 2-byte integer is interpreted as a literal value from 0 to 32767. If the most significant bit is 1 then the 2-byte integer is interpreted as an index from 0 to 32767 (the most significant bit itself is ignored). This index points to the location in the buffer containing the literal value of the parameter (MSBs stored in buffer[index] and LSBs stored in buffer[index + 1]). If the index references a location beyond the size of the byte buffer then a bad compressed message has been received and decompression failure occurs (see Section 4.5.). 4.4. Decompressor actions upon receiving a compressed message When the universal decompressor is initialized all entries in the byte buffer are set to 0. Upon receiving a compressed message, the decompressor strips off the underlying protocol header and then performs the following actions: 1.) The message is copied directly into the byte buffer beginning at the byte specified in compressed_start. The underlying protocol MUST NOT pass a compressed message of more than 1460 bytes to the universal decompressor. If a larger compressed message is received, the underlying protocol passes only the first 1460 bytes to the decompressor, and provides additional working memory to store the bytes that are not currently being decompressed. The remainder of the compressed message is passed to the decompressor in blocks of 1460 bytes (or less if it is the last block in the compressed message). The decompressor MUST NOT concatenate two messages to form a single compressed message. This is because compressed messages are typically padded with trailing zero bits so that they are a whole number of bytes long. Concatenating two messages would cause these padding bits to be incorrectly interpreted as compressed data. Note that the buffer is circular, so once a byte is copied into buffer[buffer_size - 1], the next byte is copied into buffer[circular_buffer]. The parameter circular_buffer (see Section 4.2) can be set to prevent the first part of the buffer from being overwritten by new messages. Typically this area of the buffer is used to hold important tokens and text strings that should be kept from one compressed message to the next. If circular_buffer lies beyond the size of the byte buffer then decompression failure occurs (see Section 4.5). After the message has been copied into the byte buffer, the position of the last byte in the compressed message is copied into compressed_end. Also, the value in compressed_start is copied into compressed_pointer. 2.) Next, the tokens contained within the byte buffer are executed beginning at the byte specified in first_token. The tokens are Price et al. [PAGE 6] INTERNET-DRAFT Universal Decompressor 14 November, 2001 executed consecutively unless indicated explicitly (for example when the decompressor encounters a SWITCH token). If the next token to be executed lies outside the byte buffer then decompression failure occurs (see Section 4.5). 3.) The decompressor stops token execution when it reaches buffer[0] or buffer[1]. Depending on which buffer entry is reached, the following actions are then taken: If the decompressor reaches buffer[0], instead of executing the token contained within buffer[0] it stops token execution and outputs the uncompressed message. The location of the uncompressed message is from uncompressed_start up to, but not including uncompressed_end. After the uncompressed message has been outputted, the value in uncompressed_end is then copied into uncompressed_start. If uncompressed_start = uncompressed_end then no uncompressed message is outputted. As stated before the buffer is circular, so once a byte is copied from buffer[buffer_size - 1], the next byte is copied from buffer[circular_buffer]. If either uncompressed_start or uncompressed_end lie outside the circular buffer then decompression failure occurs. If buffer[1] is reached then token execution stops but no uncompressed message is outputted. When the next compressed message becomes available, the universal decompressor continues at Step 1.) above. Note that if the underlying transport protocol does not provide reliable, in-order message delivery then the contents of the byte buffer MUST be reset to the state expected by the next compressed message. This state can be negotiated a-priori, or the transport protocol can indicate to the decompressor how the byte buffer should be set up before the next message is decompressed. 4.5. Decompression failure If the compressed messages received by the decompressor are corrupted (either accidentally or maliciously) then one of three possibilities might occur: * A decompressed message is outputted that is incorrect. * A token is encountered that cannot be processed successfully by the decompressor (for example a RETURN token when no CALL token has previously been encountered). * The decompressor never finishes decompressing a message. To counter the first possibility the underlying protocol SHOULD include a checksum to verify either the compressed message or the uncompressed message. If a message fails the checksum then "decompression failure" has occurred. The decompressor does not Price et al. [PAGE 7] INTERNET-DRAFT Universal Decompressor 14 November, 2001 output an uncompressed message, and ignores any future compressed message until the byte buffer is reset. If a token is encountered that cannot be successfully processed then decompression failure occurs automatically. To counter the third possibility, decompression failure SHOULD also occur after a certain number of tokens have been processed for a given compressed message. The maximum number of tokens to process is currently left as an implementation decision (but might in future be negotiated). 5. Library of tokens The universal decompressor currently understands twelve types of token, chosen to support the widest possible range of compression algorithms with the minimum possible overhead. All tokens are stored as a single byte to indicate the token type, followed by 0 or more bytes containing the parameters required by the token. At present all parameters are 2-byte integers with MSBs stored before LSBs. For example, the COPY token is followed by three parameters as shown below: COPY (position, length, destination) In the byte buffer a COPY token is stored as the following 7 bytes: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0 0 0 0 0 0 0 0| Position MSB | Position LSB | Length MSB | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Length LSB |Destination MSB|Destination LSB| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Twelve token types are currently available to the universal decompressor. The following table lists the different token types and the byte values used to transmit the tokens to the decompressor: Token type: Corresponding byte value: COPY 0 COPY-LITERAL 1 COPY-OFFSET 2 ADD 3 SUBTRACT 4 LSHIFT 5 RSHIFT 6 COMPARE 7 SWITCH 8 CALL 9 RETURN 10 HUFFMAN 11 Price et al. [PAGE 8] INTERNET-DRAFT Universal Decompressor 14 November, 2001 Each token is explained in more detail below: 5.1. COPY The COPY token instructs the decompressor to copy a string of bytes from one part of the byte buffer to another. A COPY token is stored in the byte buffer as 7 consecutive bytes as follows: COPY (position, length, destination) As with all tokens currently defined, the COPY token translates into a single byte for the token type followed by 2 bytes for each of the parameters. The meaning of the three parameters is explained below: Position: 2-byte integer indicating the location of the first byte in the string to be copied. Length: 2-byte integer indicating the number of bytes to be copied. Destination: 2-byte integer indicating the location to which the first byte in the string will be copied. Note that the copying function is performed on a byte-by-byte basis, with the position parameter indicating the first byte to be copied. In particular, some of the later bytes to be copied may themselves have been written into the byte buffer by the COPY token currently being executed. Equally, it is possible for a COPY token to overwrite itself or its parameters. If this occurs then the COPY token MUST continue to execute as if the parameters were still in place in the byte buffer. If the source or destination of the next byte to be copied is larger than (buffer_size - 1) then sufficient multiples of (buffer_size - circular_buffer) are subtracted until it is not. So for example, once a byte is copied into buffer[buffer_size - 1] the next byte is copied into buffer[circular_buffer]. If circular_buffer equals or exceeds buffer_size then a bad compressed message has been received and decompression failure occurs (see Section 4.5). A modified version of the COPY token is given below: COPY-LITERAL (position, length, destination) The COPY-LITERAL token is identical to COPY except that after copying, the destination parameter is replaced with the value (destination + length). If the destination parameter is a literal value then it is updated directly. If the destination parameter is an index then the literal value it references is updated instead. Price et al. [PAGE 9] INTERNET-DRAFT Universal Decompressor 14 November, 2001 As above, if (destination + length) is larger than (buffer_size - 1) then sufficient multiples of (buffer_size - circular_buffer) are subtracted until it is not. A further version of the COPY-LITERAL token is given below: COPY-OFFSET (offset, length, destination) The COPY-OFFSET token is identical to COPY-LITERAL except that an offset parameter is given instead of a position parameter. The two parameters are related by position = (destination - offset). If (destination - offset) does not lie between circular_buffer and (buffer_size - 1) inclusive then sufficient multiples of (buffer_size - circular_buffer) are added or subtracted until it does. 5.2. ADD / SUBTRACT The ADD token instructs the decompressor to add two 2-byte integers (addition performed modulo 2^16) and to store the result in the location of the first parameter. The format of the ADD token is given below: ADD (parameter_1, parameter_2) Note that as per Section 4.3, depending on how the MSB is set the parameters can be interpreted as literal values or indices to literal values. If parameter_1 is a literal value then it is overwritten with the result of the addition. If parameter_1 is an index to a literal value then the literal value is overwritten (not parameter_1 itself). The SUBTRACT token is the same as the ADD token except that the second integer is subtracted from the first (subtraction performed modulo 2^16): SUBTRACT (parameter_1, parameter_2) 5.3. LSHIFT / RSHIFT The LSHIFT token instructs the decompressor to left shift a 2-byte value by the specified number of bits: LSHIFT (parameter, no_of_bits) The value to be shifted is stored in parameter, and the number of bits to shift is stored in no_of_bits. As usual both can be stored either as a literal value or as an index to a literal value. If the value of no_of_bits does not lie between 0 and 15 inclusive then a bad compressed message has been received and decompression failure occurs. Price et al. [PAGE 10] INTERNET-DRAFT Universal Decompressor 14 November, 2001 The RSHIFT token is the same as the LSHIFT token except that the bits are right shifted: RSHIFT (parameter, no_of_bits) 5.4. COMPARE The COMPARE token instructs the decompressor to compare two 2-byte values and then to jump to one of three specified indices depending on the result. COMPARE (parameter_1, parameter_2, index_1, index_2, index_3) If the parameter_1 < parameter_2 then the decompressor continues token execution at the byte position specified by index 1. If parameter_1 = parameter_2 then it jumps to index_2. If parameter_1 > parameter_2 then it jumps to index_3. If an index is specified which is beyond the size of the byte buffer, a bad compressed message has been received and decompression failure occurs. 5.5. SWITCH The SWITCH token performs a conditional jump based on the value of its first parameter. SWITCH (j, index_0, index_1, ... , index_n-1) When a SWITCH token is encountered the decompressor reads the value of j. It then continues token execution at the byte position specified by index j. If j specifies an index which is beyond the size of the byte buffer, a bad compressed message has been received and decompression failure occurs. 5.6. CALL / RETURN The CALL and RETURN tokens provide support for compression algorithms with a nested structure. CALL (index) RETURN When the decompressor reaches a CALL token, it finds the byte position of the token immediately following the CALL token and copies this 2-byte integer into stack[stack_free] ready for later retrieval. It then increases stack_free by 1 and continues token execution at the byte position specified by index. Price et al. [PAGE 11] INTERNET-DRAFT Universal Decompressor 14 November, 2001 When the decompressor reaches a RETURN token it decreases stack_free by 1, and then continues token execution at the byte position stored in stack[stack_free]. If stack_free ever becomes more than buffer_size - 1 or less than 0 then a bad compressed message has been received and decompression failure occurs (see Section 4.5). 5.7. HUFFMAN The HUFFMAN token maps a shorthand Huffman code onto its uncompressed equivalent. The format of a HUFFMAN token is as follows: HUFFMAN (position, bit_offset, destination, n, bits_1, uncomp_1, code_1, bits_2, uncomp_2, code_2, ... , bits_n-1, uncomp_n-1, code_n-1, bits_n, uncomp_n) The HUFFMAN token is followed by four mandatory parameters plus n additional sets of parameters. Every set contains three parameters except the nth set, which contains two. The nth set of parameters omits the parameter code_n, as the decompressor can always work out what the correct value of code_n should be. The meaning of the four mandatory parameters is explained below: Position: Indicates the byte location of the Huffman code to be decompressed. Bit Offset: Indicates the bit offset at which the Huffman code begins within the byte specified above. Destination: Indicates the location to which the uncompressed value will be copied. n: Indicates the number of additional sets of parameters that follow. If n = 0 then the HUFFMAN token is ignored by the decompressor. The remaining parameters specify the actual Huffman codes and their uncompressed equivalents. Each set of 3 parameters specifies a block of Huffman codes with the same length and (when treated as integers) with values that increase by 1 for each additional Huffman code. The precise meaning of the parameters is given below: bits_j: Indicates the additional length in bits of the Huffman codes in block j, compared to the Huffman codes in block j-1. Note that the total length of the Huffman codes in set j is (bits_1 + bits_2 + ... + bits_j). uncomp_j: Indicates the uncompressed value of the first Huffman code in set j. Price et al. [PAGE 12] INTERNET-DRAFT Universal Decompressor 14 November, 2001 code_j: Indicates the compressed value (when read as an integer) of the last Huffman code in set j. Note that the compressed value of the first Huffman code in set j is not specified explicitly, but instead is taken to be 0 when j = 1 or (2^bits_j) x (1 + code_j-1) when j > 1. This simply means that the Huffman code is 1 greater than the largest Huffman code in set j-1, but padded with enough zeroes to give it the correct length in bits. For example, suppose that bits_1 = 2, uncomp_1 = 15 and code_1 = 1. This defines a set of 2 Huffman codes (00 and 01) with corresponding values 15 and 16. Suppose also that bits_2 = 1, uncomp_2 = 4 and code_2 = 7. This defines an additional set of 4 Huffman codes (100, 101, 110 and 111) with corresponding values 4, 5, 6 and 7. Huffman code: Uncompressed value: 00 15 01 16 100 4 101 5 110 6 111 7 The motivation for downloading Huffman codes to the decompressor in this form is that it is very easy to convert a compressed Huffman code into its uncompressed equivalent. This can be achieved by taking the following steps: 1.) Set j = 1, set code_0 = 65535 and set code_n = 65535. 2.) Read a total of (bits_1 + bits_2 + ... + bits_j) bits starting from the specified position and bit_offset. Interpret these bits as an integer H, with the first bit to be read as the MSB and the last bit to be read as the LSB. 3.) If (H > code_j) then set j = j + 1 and goto 2. 4.) Output (H + uncomp_j - (2^bits_j) x (1 + code_j-1)), with the arithmetic operations calculated modulo 2^16. Note that as the HUFFMAN token reads individual bits from within a byte, to avoid ambiguity it is necessary to define the order in which these bits are read. This draft specifies that a bit offset of 0 indicates the least significant bit of a byte, whilst a bit offset of 7 indicates the most significant bit. For example, suppose that an 8-bit Huffman code begins at byte position 0 and bit offset 2. In this case the 8 bits of the Huffman code can be found in the following locations (Bit 0 is the first bit in the Huffman code and Bit 7 is the last bit): Price et al. [PAGE 13] INTERNET-DRAFT Universal Decompressor 14 November, 2001 MSB LSB MSB LSB +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |5 4 3 2 1 0 | 7 6| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Byte 0 Byte 1 If the parameter bit offset does not take a value between 0 and 7 inclusive then a bad compressed message has been received and decompression failure occurs (see Section 4.5.). Decompression failure also occurs if a total of more than 16 bits are read from the specified position and bit offset. 6. Security considerations Note that this initial version raises the issues, without detailing a specific solution to resolve them. Introduction: In effect, a compressed message is a program; the tokens are instructions that are executed by the decompressor. This presents particular risks to the operation of the node on which the decompressor is running. This is the case for any compression algorithm, and so affects the operation of a node using this one. Transport Mechanism Requirements: The algorithm itself has no security issues, but does place requirements on any transport mechanism used to deliver the messages. If such requirements are not met, then the operation of the system can be exposed to attack ranging from indirect reduction in service, direct denial of service, and modification of subsequent messages. An initial list of requirements on the transport mechanism is: - messages should be delivered reliably to avoid corruption - for some uses of this algorithm, this corruption may have longer lasting effects. - messages should be not be duplicated - to ensure that operations implied by those messages are executed once only. - potentially, the message source should be identified and/or validated - to restrict possible insinuation of "attack" messages by third parties and to allow "blacklisting" of individual sources that "behave badly". Price et al. [PAGE 14] INTERNET-DRAFT Universal Decompressor 14 November, 2001 7. References [DEFLATE] "DEFLATE Compressed Data Format Specification version 1.3", RFC 1951, P. Deutsch, May 1996 [V-J] "Compressing TCP/IP Headers for Low-Speed Serial Links", V. Jacobson, Internet Engineering Task Force, February 1990 [LZW] "LZW Data Compression", Mark Nelson, Dr. Dobb's Journal, October 1989 [SCTP] "Stream Control Transmission Protocol", Stewart et al, RFC 2960, Internet Engineering Task Force, October 2000 [SIP] "SIP: Session Initiation Protocol", Handley et al, RFC 2543, Internet Engineering Task Force, March 1999 8. Authors' addresses Richard Price Tel: +44 1794 833681 Email: richard.price@roke.co.uk Abigail Surtees Tel: +44 1794 833131 Email: abigail.surtees@roke.co.uk Mark A West Tel: +44 1794 833311 Email: mark.a.west@roke.co.uk Lawrence Conroy Tel: +44 1794 833666 Email: lwc@roke.co.uk Roke Manor Research Ltd Romsey, Hants, SO51 0ZN United Kingdom Jonathan Rosenberg dynamicsoft 72 Eagle Rock Avenue First Floor East Hanover, NJ 07936 Email: jdrosen@dynamicsoft.com Price et al. [PAGE 15] INTERNET-DRAFT Universal Decompressor 14 November, 2001 Appendix A. Example sets of tokens This appendix gives three example sets of tokens which can be downloaded to the universal decompressor. The first set of tokens can be used to decompress a very simple variant of LZ77, which illustrates how the different token types are intended to be used. The next two examples show how the universal decompressor can be configured to understand the output of existing compression algorthms. One set of tokens allows the decompressor to understand [DEFLATE] compressed messages, whilst the other set allows the decompressor to understand [LZW] compressed messages. A.1. Mnemonic language Presenting the example set of tokens as a set of bytes would not be very informative, so they are written instead using a simple mnemonic language. Most importantly, the language allows the parameters of a token to be specified as text references rather than as 2-byte integers. The mnemonic language is translated into bytes as follows: Tokens: Token names are given in capitals. Replace each name with the corresponding 1-byte value as per Chapter 5. Labels: Label names are given as a colon followed by lowercase text. They are deleted when converting the mnemonics to bytes. References: These are indicated by a label name without the colon. Replace each reference with a 2-byte value specifying the location of the byte immediately following the label. Moreover if the reference is preceded by a "$" symbol then it is interpreted as an pointer rather than a literal value, and hence the MSB of the 2-byte value is set to 1. .short: Each decimal number following ".short" is interpreted as a 2-byte integer with MSBs preceding LSBs. Unless otherwise specified, this is the default. .byte: Each decimal number following ".byte" is interpreted as a single byte. Whitespace: All whitespace, brackets and commas just delimit the tokens. Delete. Comments: These are indicated by a semicolon and continue to the end of the line. Delete. Price et al. [PAGE 16] INTERNET-DRAFT Universal Decompressor 14 November, 2001 A.2. Example token set for simple LZ77 decompression The first example gives the tokens required to support a very simple LZ77-based algorithm. The decompressor is instructed to interpret a compressed message as a set of 4-byte characters, where each character contains a 2-byte position integer followed by a 2-byte length integer. Taken together these integers point to a previously received text string in the byte buffer, which is then copied to the end of the uncompressed message. Since the compressor can only send references to strings already present in the byte buffer, before the first message is decompressed the buffer must be initialized with a static dictionary containing the 256 ASCII characters. :first_token unpack_static_dictionary :uncompressed_start 1792 :uncompressed_end 2048 :circular_buffer 2048 :compressed_start compressed_message_start :compressed_end 0 :compressed_pointer 0 :bit_offset 0 :position 0 :length 0 :token_start next_character :unpack_static_dictionary COPY-LITERAL (17, 1, $uncompressed_start) ADD ($position, 1) COMPARE ($position, 256, unpack_static_dictionary, continue, 0) ; The above tokens are used to initialize the static dictionary. :continue COPY (token_start, 2, first_token) :next_character HUFFMAN ($compressed_pointer, $bit_offset, position, 1, 16, 0) HUFFMAN ($compressed_pointer, $bit_offset, length, 1, 16, 0) COPY-LITERAL ($position, $length, $uncompressed_end) COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0) :compressed_message_start A.3. Example token set for DEFLATE decompression This example gives the tokens required to understand [DEFLATE] compressed messages as defined in RFC 1951. Note that the example Price et al. [PAGE 17] INTERNET-DRAFT Universal Decompressor 14 November, 2001 only covers static Huffman codes (Block Type 01); dynamic Huffman codes will be added in a future version. Note also that the order in which the HUFFMAN token decompresses bits within a byte differs from that specified in RFC 1951. An alternative version of the HUFFMAN token may be introduced to support the exact bit order of [DEFLATE] in future (although this is not an urgent issue, as the order in which bits are decompressed does not affect the overall compression ratio). :first_token next_character :uncompressed_start 2048 :uncompressed_end 2048 :circular_buffer 2048 :compressed_start compressed_message_start :compressed_end 0 :compressed_pointer 0 :bit_offset 0 :index 0 :extra_length_bits 0 :length_value 0 :extra_distance_bits 0 :distance_value 0 :next_character HUFFMAN ($compressed_pointer, $bit_offset, index, 4, 7, 16428, 23, 1, 0, 191, 0, 16452, 199, 1, 144) ; The above HUFFMAN token decompresses a Huffman code from the ; length/literal alphabet. COMPARE ($index, 16428, literal, 0, length) ; The COMPARE token is used to determine whether the decoded ; character should be interpreted as a length value, a literal value ; or an end-of-block character. In the latter case the decompressor ; stops and outputs the uncompressed message. :literal COPY-LITERAL (17, 1, $uncompressed_end) COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1) ; If the decoded character is to be interpreted as a literal value ; then it is copied directly to the uncompressed message. The ; decompressor then checks whether more compressed characters are ; available. If it has run out then it pauses and waits for some ; more. :length LSHIFT ($index, 2) Price et al. [PAGE 18] INTERNET-DRAFT Universal Decompressor 14 November, 2001 COPY ($index, 4, extra_length_bits) ; If the decoded character is to be interpreted as a length value ; then as specified in [DEFLATE], the decompressor must first add a ; certain number of additional bits from the compressed message to ; this length value. HUFFMAN ($compressed_pointer, $bit_offset, extra_length_bits, 1, $extra_length_bits, 0) ; The above HUFFMAN token obtains the correct number of additional ; length bits from the compressed message. ADD ($length_value, $extra_length_bits) :distance HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 5, 74) ; In [DEFLATE] a length code is always followed by a distance code. ; The above HUFFMAN token decompresses this character. LSHIFT ($index, 2) COPY ($index, 4, extra_distance_bits) HUFFMAN ($compressed_pointer, $bit_offset, extra_distance_bits, 1, $extra_distance_bits, 0) ; The above HUFFMAN token obtains the correct number of additional ; distance bits from the compressed message. ADD ($distance_value, $extra_distance_bits) COPY-OFFSET ($distance_value, $length_value, $uncompressed_end) ; The text string specified by the length and distance codes is then ; copied to the end of the uncompressed message. COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1) .byte 0 0 0 .short ; 3 bytes worth of padding :length_table 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 1 11 1 13 1 15 1 17 2 19 2 23 2 27 2 31 3 35 3 43 3 51 3 59 4 67 4 83 4 99 4 115 5 131 5 163 5 195 5 227 0 258 ; This is the length table from Page 11 of [DEFLATE]. Price et al. [PAGE 19] INTERNET-DRAFT Universal Decompressor 14 November, 2001 :distance_table 0 1 0 2 0 3 0 4 1 5 1 7 2 9 2 13 3 17 3 25 4 33 4 49 5 65 5 97 6 129 6 193 7 257 7 385 8 513 8 767 9 1025 9 1537 10 2049 10 3073 11 4097 11 6145 12 8193 12 12289 13 16385 13 24577 ; This is the distance table from Page 11 of [DEFLATE]. :compressed_message_start ; The compressed messages are stored in the byte buffer following ; the above set of tokens. A.4. Example token set for LZW decompression This example gives the tokens required to understand [LZW] compressed messages. This particular variant of LZW uses a codebook containing up to 1024 entries, and so each compressed character is a 10-bit value referencing one of the entries in the codebook. :first_token unpack_static_dictionary :uncompressed_start 5888 :uncompressed_end 1 :circular_buffer 6144 :compressed_start compressed_message_start :compressed_end 0 :compressed_pointer 0 :bit_offset 0 :index 0 :position_value 0 :length_value 0 :codebook_next 2048 :token_start next_character :unpack_static_dictionary COPY-LITERAL (uncompressed_start, 4, $codebook_next) COPY-LITERAL (17, 1, $uncompressed_start) ADD ($index, 1) COMPARE ($index, 256, unpack_static_dictionary, continue, 0) ; The above tokens are used to initialize the first 256 entries in ; the LZW codebook with single ASCII characters. :continue Price et al. [PAGE 20] INTERNET-DRAFT Universal Decompressor 14 November, 2001 COPY (circular_buffer, 2, uncompressed_end) COPY (token_start, 2, first_token) :next_character HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 10, 512) ; The above HUFFMAN token extracts 10 bits from the compressed ; message. LSHIFT ($index, 2) COPY ($index, 4, position_value) ; The contents of the corresponding codebook entry is retrieved by ; the above COPY token. COPY-LITERAL (uncompressed_end, 2, $codebook_next) COPY-LITERAL ($position_value, $length_value, $uncompressed_end) ; The above COPY-LITERAL token appends the selected text string to ; the end of the uncompressed message. ADD ($length_value, 1) COPY-LITERAL (length_value, 2, $codebook_next) ; As per the LZW algorithm, a new entry is added to the codebook ; containing the text string copied to the end of the uncompressed ; message, plus the next character in the uncompressed message. COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0) :compressed_message_start Price et al. [PAGE 21]