idnits 2.17.1 draft-ietf-rohc-sigcomp-algorithm-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-24) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 20 longer pages, the longest (page 2) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([DEFLATE], [LZW]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC-2026' is mentioned on line 17, but not defined == Missing Reference: 'PAGE 1' is mentioned on line 44, but not defined == Missing Reference: 'PAGE 2' is mentioned on line 90, but not defined == Missing Reference: 'RFC-2119' is mentioned on line 106, but not defined == Missing Reference: 'PAGE 3' is mentioned on line 144, but not defined == Missing Reference: 'PAGE 4' is mentioned on line 200, but not defined -- Looks like a reference, but probably isn't: '0' on line 322 -- Looks like a reference, but probably isn't: '1' on line 336 -- Looks like a reference, but probably isn't: '2' on line 234 == Missing Reference: 'PAGE 5' is mentioned on line 255, but not defined == Missing Reference: 'PAGE 6' is mentioned on line 311, but not defined == Missing Reference: 'PAGE 7' is mentioned on line 367, but not defined == Missing Reference: 'PAGE 8' is mentioned on line 423, but not defined == Missing Reference: 'PAGE 9' is mentioned on line 479, but not defined == Missing Reference: 'PAGE 10' is mentioned on line 535, but not defined == Missing Reference: 'PAGE 11' is mentioned on line 588, but not defined == Missing Reference: 'PAGE 12' is mentioned on line 643, but not defined == Missing Reference: 'PAGE 13' is mentioned on line 698, but not defined == Missing Reference: 'PAGE 14' is mentioned on line 750, but not defined == Missing Reference: 'PAGE 15' is mentioned on line 794, but not defined == Missing Reference: 'PAGE 16' is mentioned on line 847, but not defined == Missing Reference: 'PAGE 17' is mentioned on line 902, but not defined == Missing Reference: 'PAGE 18' is mentioned on line 958, but not defined == Missing Reference: 'PAGE 19' is mentioned on line 1014, but not defined == Missing Reference: 'PAGE 20' is mentioned on line 1069, but not defined == Missing Reference: 'PAGE 21' is mentioned on line 1103, but not defined ** Downref: Normative reference to an Informational RFC: RFC 1951 (ref. 'DEFLATE') -- Possible downref: Non-RFC (?) normative reference: ref. 'V-J' -- Possible downref: Non-RFC (?) normative reference: ref. 'LZW' ** Obsolete normative reference: RFC 2960 (ref. 'SCTP') (Obsoleted by RFC 4960) ** Obsolete normative reference: RFC 2543 (ref. 'SIP') (Obsoleted by RFC 3261, RFC 3262, RFC 3263, RFC 3264, RFC 3265) Summary: 7 errors (**), 0 flaws (~~), 25 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group Richard Price, Siemens/Roke Manor 3 INTERNET-DRAFT Jonathan Rosenberg, dynamicsoft 4 Expires: May 2002 Abigail Surtees, Siemens/Roke Manor 5 Mark A West, Siemens/Roke Manor 6 Lawrence Conroy, Siemens/Roke Manor 8 14 November, 2001 10 Universal Decompression Algorithm 11 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of [RFC-2026]. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that other 20 groups may also distribute working documents as Internet-Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This document is a submission to the IETF ROHC WG. Comments should be 34 directed to the mailing list of ROHC, rohc@cdt.luth.se. 36 Abstract 38 This specification defines a "universal decompressor" capable of 39 interoperating with a wide range of compression algorithms. Using the 40 basic techniques of Huffman compression and LZ77-style string 41 substitution, the decompressor can be configured to understand the 42 output of many well-known compressors including [DEFLATE] and [LZW]. 44 Price et al. [PAGE 1] 45 Revision history 47 Changes from 49 New COPY-LITERAL and COPY-OFFSET tokens added to reduce complexity 50 for LZ77-based algorithms. 52 COMPARE token modified to allow it to be used without a SWITCH 53 token. 55 HUFFMAN token modified to require fewer parameters. 57 Support added for literal parameters (see Section 4.3). 59 Example token sets added for decompression of LZ77, [DEFLATE] and 60 [LZW] compressed messages. 62 Table of contents 64 1. Introduction.................................................3 65 2. Terminology..................................................3 66 3. Requirements on underlying transport protocol................3 67 4. Description of the universal decompressor....................4 68 4.1. Structure of universal decompressor dictionary.............5 69 4.2. Important entries in the byte buffer.......................5 70 4.3. Token parameters...........................................5 71 4.4. Decompressor actions upon receiving a compressed message...6 72 4.5. Decompression failure......................................7 73 5. Library of tokens............................................8 74 5.1. COPY.......................................................9 75 5.2. ADD / SUBTRACT.............................................10 76 5.3. LSHIFT / RSHIFT............................................10 77 5.4. COMPARE....................................................11 78 5.5. SWITCH.....................................................11 79 5.6. CALL / RETURN..............................................11 80 5.7. HUFFMAN....................................................12 81 6. Security considerations......................................14 82 7. References...................................................15 83 8. Authors' addresses...........................................15 84 A. Example sets of tokens.......................................16 85 A.1. Mnemonic language..........................................16 86 A.2. Example token set for simple LZ77 decompression............17 87 A.3. Example token set for DEFLATE decompression................17 88 A.4. Example token set for LZW decompression....................20 90 Price et al. [PAGE 2] 91 1. Introduction 93 This draft introduces the concept of a "universal decompressor". 95 The goal of the document is to standardize a decompressor capable of 96 interoperating with a wide range of compression algorithms. 97 Consequently this draft describes the decompressor operation only, 98 i.e. the actions which the decompressor takes upon receiving a 99 certain instruction from the compressor. 101 2. Terminology 103 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 104 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 105 document are to be interpreted as described in [RFC-2119]. 107 Byte buffer 109 The universal decompressor maintains a byte buffer containing any 110 previously received text strings that might be useful for future 111 compression. 113 Token 115 A token is an instruction transmitted from the compressor to the 116 decompressor. 118 3. Requirements on underlying transport protocol 120 The universal decompressor takes as its input a sequence of 121 compressed messages, which are processed and then outputted as a 122 sequence of uncompressed messages. This chapter lists the 123 requirements on the transport protocol used to carry compressed 124 messages to the universal decompressor. 126 Note that the universal decompressor outputs an uncompressed message 127 when it encounters an explicit end-of-message character. 128 Consequently there is no need for one uncompressed message to 129 correspond to exactly one compressed message. Two or more compressed 130 messages can be sent to reconstruct a single uncompressed message, 131 which is very useful for segmenting compressed messages that are 132 larger than the MTU of the transport protocol. 134 Conversely, one compressed message can be sent to reconstruct 135 several uncompressed messages. In particular, messages can be 136 successfully decompressed even when the transport protocol provides 137 data as a byte stream with no framing. 139 Note however that the universal decompressor can make use of 140 previously received messages to improve the overall compression 141 ratio (see [V-J] for an example of a compression algorithm which 142 uses previously received messages in this manner). 144 Price et al. [PAGE 3] 145 Consequently, the main requirement on the underlying protocol is 146 that it MUST ensure that the contents of the decompressor byte 147 buffer are as expected by the next message to be decompressed. 149 This requirement can be supported in a number of ways. If the 150 underlying transport protocol is reliable (for example TCP or 151 [SCTP]) then the compressed messages are provided to the universal 152 decompressor in the correct order and free from bit errors. In this 153 case the byte buffer is automatically updated correctly between 154 messages. 156 If the underlying transport protocol is unreliable (for example UDP) 157 then the byte buffer MUST be reset to a known state between 158 compressed messages. This ensures that messages are not compressed 159 relative to text that has not yet been received by the universal 160 decompressor. 162 The "known state" of the byte buffer can be negotiated a-priori: for 163 example a set of tokens can be downloaded from the compressor to the 164 decompressor when the universal decompressor is being set up. All 165 messages are then compressed relative to these tokens. Alternatively 166 the transport protocol can indicate to the decompressor how the byte 167 buffer should be set up before each message is decompressed. 169 4. Description of the universal decompressor 171 An important feature of the universal decompressor is that it can 172 interoperate with a wide range of compression algorithms. The 173 precise method for compression is left as an implementation 174 decision, and in fact the standard decompressor can interoperate 175 with any of the following classes of algorithm: 177 * Generic text compressor (for example [DEFLATE] or a similar 178 algorithm). 180 * Protocol-aware compressor offering excellent performance for 181 one type of text message (for example the text messages 182 generated by [SIP]). 184 * Hybrid compressor with similar performance to [DEFLATE] for 185 generic text strings and superior performance for certain types 186 of text message. 188 The choice of which tokens to send to the decompressor is left as a 189 local implementation decision at the compressor. The only 190 requirement is that of transparency, i.e. the compressor MUST NOT 191 send tokens which cause the decompressor to incorrectly decompress a 192 given message. 194 Note however that it is perfectly acceptable for the compressor to 195 send tokens which update the byte buffer at the decompressor, but 196 which cause no decompressed message to be outputted. Indeed, this is 197 a useful technique for pre-populating the dictionary with well-known 198 text strings. 200 Price et al. [PAGE 4] 201 4.1. Structure of universal decompressor dictionary 203 The universal decompressor dictionary consists of a simple byte 204 buffer designed to hold the current uncompressed message, the 205 current compressed message, and any other previously received text 206 strings that might be useful for future compression. 208 The size buffer_size of the byte buffer can be negotiated by an 209 externally defined mechanism (e.g. by the underlying protocol used 210 to transport the compressed messages). Entries in the byte buffer 211 are referred to as buffer[n] where 0 =< n < buffer_size. 213 As all of the tokens currently use 2-byte indices into the byte 214 buffer, the maximum size of the buffer is 64K. 216 4.2. Important entries in the byte buffer 218 The first few bytes in the byte buffer are used to store some 219 important 2-byte integers. These integers are given the following 220 names: 222 Position in buffer: Name: 224 0 - 1 first_token 225 2 - 3 uncompressed_start 226 4 - 5 uncompressed_end 227 6 - 7 circular_buffer 228 8 - 9 compressed_start 229 10 - 11 compressed_end 230 12 - 13 compressed_pointer 231 14 - 15 stack_free 232 16 - 17 stack[0] 233 18 - 19 stack[1] 234 20 - 21 stack[2] 235 : : 237 The MSBs of the integer are always stored before the LSBs. So, for 238 example, the MSBs of first_token are stored in buffer[0] whilst the 239 LSBs are stored in buffer[1]. 241 The use of each integer is described in the following sections of 242 the draft. 244 4.3. Token parameters 246 Each of the tokens is followed by 0 or more bytes containing the 247 parameters required by the token. At present all parameters are 248 stored as 2-byte integers with MSBs stored in the byte preceding the 249 LSBs in the byte buffer. 251 The most significant bit of the 2-byte integer has a special 252 meaning: it is used to determine whether the parameter is a literal 253 value or an index pointing to a literal value. 255 Price et al. [PAGE 5] 256 If the most significant bit is 0 then the 2-byte integer is 257 interpreted as a literal value from 0 to 32767. 259 If the most significant bit is 1 then the 2-byte integer is 260 interpreted as an index from 0 to 32767 (the most significant bit 261 itself is ignored). This index points to the location in the buffer 262 containing the literal value of the parameter (MSBs stored in 263 buffer[index] and LSBs stored in buffer[index + 1]). If the index 264 references a location beyond the size of the byte buffer then a bad 265 compressed message has been received and decompression failure 266 occurs (see Section 4.5.). 268 4.4. Decompressor actions upon receiving a compressed message 270 When the universal decompressor is initialized all entries in the 271 byte buffer are set to 0. Upon receiving a compressed message, the 272 decompressor strips off the underlying protocol header and then 273 performs the following actions: 275 1.) The message is copied directly into the byte buffer beginning 276 at the byte specified in compressed_start. 278 The underlying protocol MUST NOT pass a compressed message of more 279 than 1460 bytes to the universal decompressor. If a larger 280 compressed message is received, the underlying protocol passes only 281 the first 1460 bytes to the decompressor, and provides additional 282 working memory to store the bytes that are not currently being 283 decompressed. The remainder of the compressed message is passed to 284 the decompressor in blocks of 1460 bytes (or less if it is the last 285 block in the compressed message). 287 The decompressor MUST NOT concatenate two messages to form a single 288 compressed message. This is because compressed messages are 289 typically padded with trailing zero bits so that they are a whole 290 number of bytes long. Concatenating two messages would cause these 291 padding bits to be incorrectly interpreted as compressed data. 293 Note that the buffer is circular, so once a byte is copied into 294 buffer[buffer_size - 1], the next byte is copied into 295 buffer[circular_buffer]. The parameter circular_buffer (see Section 296 4.2) can be set to prevent the first part of the buffer from being 297 overwritten by new messages. Typically this area of the buffer is 298 used to hold important tokens and text strings that should be kept 299 from one compressed message to the next. If circular_buffer lies 300 beyond the size of the byte buffer then decompression failure occurs 301 (see Section 4.5). 303 After the message has been copied into the byte buffer, the position 304 of the last byte in the compressed message is copied into 305 compressed_end. Also, the value in compressed_start is copied into 306 compressed_pointer. 308 2.) Next, the tokens contained within the byte buffer are executed 309 beginning at the byte specified in first_token. The tokens are 311 Price et al. [PAGE 6] 312 executed consecutively unless indicated explicitly (for example when 313 the decompressor encounters a SWITCH token). If the next token to be 314 executed lies outside the byte buffer then decompression failure 315 occurs (see Section 4.5). 317 3.) The decompressor stops token execution when it reaches 318 buffer[0] or buffer[1]. Depending on which buffer entry is reached, 319 the following actions are then taken: 321 If the decompressor reaches buffer[0], instead of executing the 322 token contained within buffer[0] it stops token execution and 323 outputs the uncompressed message. The location of the uncompressed 324 message is from uncompressed_start up to, but not including 325 uncompressed_end. After the uncompressed message has been outputted, 326 the value in uncompressed_end is then copied into 327 uncompressed_start. If uncompressed_start = uncompressed_end then no 328 uncompressed message is outputted. 330 As stated before the buffer is circular, so once a byte is copied 331 from buffer[buffer_size - 1], the next byte is copied from 332 buffer[circular_buffer]. If either uncompressed_start or 333 uncompressed_end lie outside the circular buffer then decompression 334 failure occurs. 336 If buffer[1] is reached then token execution stops but no 337 uncompressed message is outputted. 339 When the next compressed message becomes available, the universal 340 decompressor continues at Step 1.) above. Note that if the 341 underlying transport protocol does not provide reliable, in-order 342 message delivery then the contents of the byte buffer MUST be reset 343 to the state expected by the next compressed message. This state can 344 be negotiated a-priori, or the transport protocol can indicate to 345 the decompressor how the byte buffer should be set up before the 346 next message is decompressed. 348 4.5. Decompression failure 350 If the compressed messages received by the decompressor are 351 corrupted (either accidentally or maliciously) then one of three 352 possibilities might occur: 354 * A decompressed message is outputted that is incorrect. 356 * A token is encountered that cannot be processed successfully by 357 the decompressor (for example a RETURN token when no CALL token 358 has previously been encountered). 360 * The decompressor never finishes decompressing a message. 362 To counter the first possibility the underlying protocol SHOULD 363 include a checksum to verify either the compressed message or the 364 uncompressed message. If a message fails the checksum then 365 "decompression failure" has occurred. The decompressor does not 367 Price et al. [PAGE 7] 368 output an uncompressed message, and ignores any future compressed 369 message until the byte buffer is reset. 371 If a token is encountered that cannot be successfully processed then 372 decompression failure occurs automatically. 374 To counter the third possibility, decompression failure SHOULD also 375 occur after a certain number of tokens have been processed for a 376 given compressed message. The maximum number of tokens to process is 377 currently left as an implementation decision (but might in future be 378 negotiated). 380 5. Library of tokens 382 The universal decompressor currently understands twelve types of 383 token, chosen to support the widest possible range of compression 384 algorithms with the minimum possible overhead. 386 All tokens are stored as a single byte to indicate the token type, 387 followed by 0 or more bytes containing the parameters required by 388 the token. At present all parameters are 2-byte integers with MSBs 389 stored before LSBs. For example, the COPY token is followed by three 390 parameters as shown below: 392 COPY (position, length, destination) 394 In the byte buffer a COPY token is stored as the following 7 bytes: 396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 397 |0 0 0 0 0 0 0 0| Position MSB | Position LSB | Length MSB | 398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 | Length LSB |Destination MSB|Destination LSB| 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 404 Twelve token types are currently available to the universal 405 decompressor. The following table lists the different token types 406 and the byte values used to transmit the tokens to the decompressor: 408 Token type: Corresponding byte value: 410 COPY 0 411 COPY-LITERAL 1 412 COPY-OFFSET 2 413 ADD 3 414 SUBTRACT 4 415 LSHIFT 5 416 RSHIFT 6 417 COMPARE 7 418 SWITCH 8 419 CALL 9 420 RETURN 10 421 HUFFMAN 11 423 Price et al. [PAGE 8] 424 Each token is explained in more detail below: 426 5.1. COPY 428 The COPY token instructs the decompressor to copy a string of bytes 429 from one part of the byte buffer to another. 431 A COPY token is stored in the byte buffer as 7 consecutive bytes as 432 follows: 434 COPY (position, length, destination) 436 As with all tokens currently defined, the COPY token translates into 437 a single byte for the token type followed by 2 bytes for each of the 438 parameters. 440 The meaning of the three parameters is explained below: 442 Position: 2-byte integer indicating the location of the first 443 byte in the string to be copied. 445 Length: 2-byte integer indicating the number of bytes to be 446 copied. 448 Destination: 2-byte integer indicating the location to which the 449 first byte in the string will be copied. 451 Note that the copying function is performed on a byte-by-byte basis, 452 with the position parameter indicating the first byte to be copied. 453 In particular, some of the later bytes to be copied may themselves 454 have been written into the byte buffer by the COPY token currently 455 being executed. 457 Equally, it is possible for a COPY token to overwrite itself or its 458 parameters. If this occurs then the COPY token MUST continue to 459 execute as if the parameters were still in place in the byte buffer. 461 If the source or destination of the next byte to be copied is larger 462 than (buffer_size - 1) then sufficient multiples of (buffer_size - 463 circular_buffer) are subtracted until it is not. So for example, 464 once a byte is copied into buffer[buffer_size - 1] the next byte is 465 copied into buffer[circular_buffer]. If circular_buffer equals or 466 exceeds buffer_size then a bad compressed message has been received 467 and decompression failure occurs (see Section 4.5). 469 A modified version of the COPY token is given below: 471 COPY-LITERAL (position, length, destination) 473 The COPY-LITERAL token is identical to COPY except that after 474 copying, the destination parameter is replaced with the value 475 (destination + length). If the destination parameter is a literal 476 value then it is updated directly. If the destination parameter is 477 an index then the literal value it references is updated instead. 479 Price et al. [PAGE 9] 480 As above, if (destination + length) is larger than (buffer_size - 1) 481 then sufficient multiples of (buffer_size - circular_buffer) are 482 subtracted until it is not. 484 A further version of the COPY-LITERAL token is given below: 486 COPY-OFFSET (offset, length, destination) 488 The COPY-OFFSET token is identical to COPY-LITERAL except that an 489 offset parameter is given instead of a position parameter. The two 490 parameters are related by position = (destination - offset). 492 If (destination - offset) does not lie between circular_buffer and 493 (buffer_size - 1) inclusive then sufficient multiples of 494 (buffer_size - circular_buffer) are added or subtracted until it 495 does. 497 5.2. ADD / SUBTRACT 499 The ADD token instructs the decompressor to add two 2-byte integers 500 (addition performed modulo 2^16) and to store the result in the 501 location of the first parameter. 503 The format of the ADD token is given below: 505 ADD (parameter_1, parameter_2) 507 Note that as per Section 4.3, depending on how the MSB is set the 508 parameters can be interpreted as literal values or indices to 509 literal values. If parameter_1 is a literal value then it is 510 overwritten with the result of the addition. If parameter_1 is an 511 index to a literal value then the literal value is overwritten (not 512 parameter_1 itself). 514 The SUBTRACT token is the same as the ADD token except that the 515 second integer is subtracted from the first (subtraction performed 516 modulo 2^16): 518 SUBTRACT (parameter_1, parameter_2) 520 5.3. LSHIFT / RSHIFT 522 The LSHIFT token instructs the decompressor to left shift a 2-byte 523 value by the specified number of bits: 525 LSHIFT (parameter, no_of_bits) 527 The value to be shifted is stored in parameter, and the number of 528 bits to shift is stored in no_of_bits. As usual both can be stored 529 either as a literal value or as an index to a literal value. 531 If the value of no_of_bits does not lie between 0 and 15 inclusive 532 then a bad compressed message has been received and decompression 533 failure occurs. 535 Price et al. [PAGE 10] 536 The RSHIFT token is the same as the LSHIFT token except that the 537 bits are right shifted: 539 RSHIFT (parameter, no_of_bits) 541 5.4. COMPARE 543 The COMPARE token instructs the decompressor to compare two 2-byte 544 values and then to jump to one of three specified indices depending 545 on the result. 547 COMPARE (parameter_1, parameter_2, index_1, index_2, index_3) 549 If the parameter_1 < parameter_2 then the decompressor continues 550 token execution at the byte position specified by index 1. If 551 parameter_1 = parameter_2 then it jumps to index_2. If parameter_1 > 552 parameter_2 then it jumps to index_3. 554 If an index is specified which is beyond the size of the byte 555 buffer, a bad compressed message has been received and decompression 556 failure occurs. 558 5.5. SWITCH 560 The SWITCH token performs a conditional jump based on the value of 561 its first parameter. 563 SWITCH (j, index_0, index_1, ... , index_n-1) 565 When a SWITCH token is encountered the decompressor reads the value 566 of j. It then continues token execution at the byte position 567 specified by index j. 569 If j specifies an index which is beyond the size of the byte buffer, 570 a bad compressed message has been received and decompression failure 571 occurs. 573 5.6. CALL / RETURN 575 The CALL and RETURN tokens provide support for compression 576 algorithms with a nested structure. 578 CALL (index) 580 RETURN 582 When the decompressor reaches a CALL token, it finds the byte 583 position of the token immediately following the CALL token and 584 copies this 2-byte integer into stack[stack_free] ready for later 585 retrieval. It then increases stack_free by 1 and continues token 586 execution at the byte position specified by index. 588 Price et al. [PAGE 11] 589 When the decompressor reaches a RETURN token it decreases stack_free 590 by 1, and then continues token execution at the byte position stored 591 in stack[stack_free]. 593 If stack_free ever becomes more than buffer_size - 1 or less than 0 594 then a bad compressed message has been received and decompression 595 failure occurs (see Section 4.5). 597 5.7. HUFFMAN 599 The HUFFMAN token maps a shorthand Huffman code onto its 600 uncompressed equivalent. 602 The format of a HUFFMAN token is as follows: 604 HUFFMAN (position, bit_offset, destination, n, bits_1, uncomp_1, 605 code_1, bits_2, uncomp_2, code_2, ... , bits_n-1, uncomp_n-1, 606 code_n-1, bits_n, uncomp_n) 608 The HUFFMAN token is followed by four mandatory parameters plus n 609 additional sets of parameters. Every set contains three parameters 610 except the nth set, which contains two. The nth set of parameters 611 omits the parameter code_n, as the decompressor can always work out 612 what the correct value of code_n should be. 614 The meaning of the four mandatory parameters is explained below: 616 Position: Indicates the byte location of the Huffman code to be 617 decompressed. 619 Bit Offset: Indicates the bit offset at which the Huffman code 620 begins within the byte specified above. 622 Destination: Indicates the location to which the uncompressed value 623 will be copied. 625 n: Indicates the number of additional sets of parameters 626 that follow. If n = 0 then the HUFFMAN token is ignored 627 by the decompressor. 629 The remaining parameters specify the actual Huffman codes and their 630 uncompressed equivalents. Each set of 3 parameters specifies a block 631 of Huffman codes with the same length and (when treated as integers) 632 with values that increase by 1 for each additional Huffman code. The 633 precise meaning of the parameters is given below: 635 bits_j: Indicates the additional length in bits of the Huffman 636 codes in block j, compared to the Huffman codes in 637 block j-1. Note that the total length of the Huffman 638 codes in set j is (bits_1 + bits_2 + ... + bits_j). 640 uncomp_j: Indicates the uncompressed value of the first Huffman 641 code in set j. 643 Price et al. [PAGE 12] 644 code_j: Indicates the compressed value (when read as an 645 integer) of the last Huffman code in set j. 647 Note that the compressed value of the first Huffman code in set j is 648 not specified explicitly, but instead is taken to be 0 when j = 1 or 649 (2^bits_j) x (1 + code_j-1) when j > 1. This simply means that the 650 Huffman code is 1 greater than the largest Huffman code in set j-1, 651 but padded with enough zeroes to give it the correct length in bits. 653 For example, suppose that bits_1 = 2, uncomp_1 = 15 and code_1 = 1. 654 This defines a set of 2 Huffman codes (00 and 01) with corresponding 655 values 15 and 16. 657 Suppose also that bits_2 = 1, uncomp_2 = 4 and code_2 = 7. This 658 defines an additional set of 4 Huffman codes (100, 101, 110 and 111) 659 with corresponding values 4, 5, 6 and 7. 661 Huffman code: Uncompressed value: 663 00 15 664 01 16 665 100 4 666 101 5 667 110 6 668 111 7 670 The motivation for downloading Huffman codes to the decompressor in 671 this form is that it is very easy to convert a compressed Huffman 672 code into its uncompressed equivalent. This can be achieved by 673 taking the following steps: 675 1.) Set j = 1, set code_0 = 65535 and set code_n = 65535. 677 2.) Read a total of (bits_1 + bits_2 + ... + bits_j) bits starting 678 from the specified position and bit_offset. Interpret these bits as 679 an integer H, with the first bit to be read as the MSB and the last 680 bit to be read as the LSB. 682 3.) If (H > code_j) then set j = j + 1 and goto 2. 684 4.) Output (H + uncomp_j - (2^bits_j) x (1 + code_j-1)), with the 685 arithmetic operations calculated modulo 2^16. 687 Note that as the HUFFMAN token reads individual bits from within a 688 byte, to avoid ambiguity it is necessary to define the order in 689 which these bits are read. This draft specifies that a bit offset of 690 0 indicates the least significant bit of a byte, whilst a bit offset 691 of 7 indicates the most significant bit. 693 For example, suppose that an 8-bit Huffman code begins at byte 694 position 0 and bit offset 2. In this case the 8 bits of the Huffman 695 code can be found in the following locations (Bit 0 is the first bit 696 in the Huffman code and Bit 7 is the last bit): 698 Price et al. [PAGE 13] 699 MSB LSB MSB LSB 701 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 702 |5 4 3 2 1 0 | 7 6| 703 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 705 Byte 0 Byte 1 707 If the parameter bit offset does not take a value between 0 and 7 708 inclusive then a bad compressed message has been received and 709 decompression failure occurs (see Section 4.5.). Decompression 710 failure also occurs if a total of more than 16 bits are read from 711 the specified position and bit offset. 713 6. Security considerations 715 Note that this initial version raises the issues, without detailing 716 a specific solution to resolve them. 718 Introduction: 720 In effect, a compressed message is a program; the tokens are 721 instructions that are executed by the decompressor. This presents 722 particular risks to the operation of the node on which the 723 decompressor is running. This is the case for any compression 724 algorithm, and so affects the operation of a node using this one. 726 Transport Mechanism Requirements: 728 The algorithm itself has no security issues, but does place 729 requirements on any transport mechanism used to deliver the 730 messages. 732 If such requirements are not met, then the operation of the system 733 can be exposed to attack ranging from indirect reduction in service, 734 direct denial of service, and modification of subsequent messages. 736 An initial list of requirements on the transport mechanism is: 738 - messages should be delivered reliably to avoid corruption - for 739 some uses of this algorithm, this corruption may have longer lasting 740 effects. 742 - messages should be not be duplicated - to ensure that operations 743 implied by those messages are executed once only. 745 - potentially, the message source should be identified and/or 746 validated - to restrict possible insinuation of "attack" messages by 747 third parties and to allow "blacklisting" of individual sources that 748 "behave badly". 750 Price et al. [PAGE 14] 751 7. References 753 [DEFLATE] "DEFLATE Compressed Data Format Specification version 754 1.3", RFC 1951, P. Deutsch, May 1996 756 [V-J] "Compressing TCP/IP Headers for Low-Speed Serial 757 Links", V. Jacobson, Internet Engineering Task Force, 758 February 1990 760 [LZW] "LZW Data Compression", Mark Nelson, Dr. Dobb's 761 Journal, October 1989 763 [SCTP] "Stream Control Transmission Protocol", Stewart et al, 764 RFC 2960, Internet Engineering Task Force, October 2000 766 [SIP] "SIP: Session Initiation Protocol", Handley et al, 767 RFC 2543, Internet Engineering Task Force, March 1999 769 8. Authors' addresses 771 Richard Price Tel: +44 1794 833681 772 Email: richard.price@roke.co.uk 774 Abigail Surtees Tel: +44 1794 833131 775 Email: abigail.surtees@roke.co.uk 777 Mark A West Tel: +44 1794 833311 778 Email: mark.a.west@roke.co.uk 780 Lawrence Conroy Tel: +44 1794 833666 781 Email: lwc@roke.co.uk 783 Roke Manor Research Ltd 784 Romsey, Hants, SO51 0ZN 785 United Kingdom 787 Jonathan Rosenberg 788 dynamicsoft 789 72 Eagle Rock Avenue 790 First Floor 791 East Hanover, NJ 07936 792 Email: jdrosen@dynamicsoft.com 794 Price et al. [PAGE 15] 795 Appendix A. Example sets of tokens 797 This appendix gives three example sets of tokens which can be 798 downloaded to the universal decompressor. The first set of tokens 799 can be used to decompress a very simple variant of LZ77, which 800 illustrates how the different token types are intended to be used. 802 The next two examples show how the universal decompressor can be 803 configured to understand the output of existing compression 804 algorthms. One set of tokens allows the decompressor to understand 805 [DEFLATE] compressed messages, whilst the other set allows the 806 decompressor to understand [LZW] compressed messages. 808 A.1. Mnemonic language 810 Presenting the example set of tokens as a set of bytes would not be 811 very informative, so they are written instead using a simple 812 mnemonic language. Most importantly, the language allows the 813 parameters of a token to be specified as text references rather than 814 as 2-byte integers. 816 The mnemonic language is translated into bytes as follows: 818 Tokens: Token names are given in capitals. Replace each 819 name with the corresponding 1-byte value as per 820 Chapter 5. 822 Labels: Label names are given as a colon followed by lowercase 823 text. They are deleted when converting the mnemonics to 824 bytes. 826 References: These are indicated by a label name without the 827 colon. Replace each reference with a 2-byte value 828 specifying the location of the byte immediately 829 following the label. Moreover if the reference is 830 preceded by a "$" symbol then it is interpreted as an 831 pointer rather than a literal value, and hence the MSB 832 of the 2-byte value is set to 1. 834 .short: Each decimal number following ".short" is interpreted as 835 a 2-byte integer with MSBs preceding LSBs. Unless 836 otherwise specified, this is the default. 838 .byte: Each decimal number following ".byte" is interpreted as 839 a single byte. 841 Whitespace: All whitespace, brackets and commas just delimit the 842 tokens. Delete. 844 Comments: These are indicated by a semicolon and continue 845 to the end of the line. Delete. 847 Price et al. [PAGE 16] 848 A.2. Example token set for simple LZ77 decompression 850 The first example gives the tokens required to support a very simple 851 LZ77-based algorithm. The decompressor is instructed to interpret a 852 compressed message as a set of 4-byte characters, where each 853 character contains a 2-byte position integer followed by a 2-byte 854 length integer. Taken together these integers point to a previously 855 received text string in the byte buffer, which is then copied to the 856 end of the uncompressed message. 858 Since the compressor can only send references to strings already 859 present in the byte buffer, before the first message is decompressed 860 the buffer must be initialized with a static dictionary containing 861 the 256 ASCII characters. 863 :first_token unpack_static_dictionary 864 :uncompressed_start 1792 865 :uncompressed_end 2048 866 :circular_buffer 2048 867 :compressed_start compressed_message_start 868 :compressed_end 0 869 :compressed_pointer 0 871 :bit_offset 0 872 :position 0 873 :length 0 874 :token_start next_character 876 :unpack_static_dictionary 878 COPY-LITERAL (17, 1, $uncompressed_start) 879 ADD ($position, 1) 880 COMPARE ($position, 256, unpack_static_dictionary, continue, 0) 882 ; The above tokens are used to initialize the static dictionary. 884 :continue 886 COPY (token_start, 2, first_token) 888 :next_character 890 HUFFMAN ($compressed_pointer, $bit_offset, position, 1, 16, 0) 891 HUFFMAN ($compressed_pointer, $bit_offset, length, 1, 16, 0) 892 COPY-LITERAL ($position, $length, $uncompressed_end) 893 COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0) 895 :compressed_message_start 897 A.3. Example token set for DEFLATE decompression 899 This example gives the tokens required to understand [DEFLATE] 900 compressed messages as defined in RFC 1951. Note that the example 902 Price et al. [PAGE 17] 903 only covers static Huffman codes (Block Type 01); dynamic Huffman 904 codes will be added in a future version. 906 Note also that the order in which the HUFFMAN token decompresses 907 bits within a byte differs from that specified in RFC 1951. An 908 alternative version of the HUFFMAN token may be introduced to 909 support the exact bit order of [DEFLATE] in future (although this is 910 not an urgent issue, as the order in which bits are decompressed 911 does not affect the overall compression ratio). 913 :first_token next_character 914 :uncompressed_start 2048 915 :uncompressed_end 2048 916 :circular_buffer 2048 917 :compressed_start compressed_message_start 918 :compressed_end 0 919 :compressed_pointer 0 921 :bit_offset 0 922 :index 0 923 :extra_length_bits 0 924 :length_value 0 925 :extra_distance_bits 0 926 :distance_value 0 928 :next_character 930 HUFFMAN ($compressed_pointer, $bit_offset, index, 4, 7, 16428, 23, 931 1, 0, 191, 0, 16452, 199, 1, 144) 933 ; The above HUFFMAN token decompresses a Huffman code from the 934 ; length/literal alphabet. 936 COMPARE ($index, 16428, literal, 0, length) 938 ; The COMPARE token is used to determine whether the decoded 939 ; character should be interpreted as a length value, a literal value 940 ; or an end-of-block character. In the latter case the decompressor 941 ; stops and outputs the uncompressed message. 943 :literal 945 COPY-LITERAL (17, 1, $uncompressed_end) 946 COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1) 948 ; If the decoded character is to be interpreted as a literal value 949 ; then it is copied directly to the uncompressed message. The 950 ; decompressor then checks whether more compressed characters are 951 ; available. If it has run out then it pauses and waits for some 952 ; more. 954 :length 956 LSHIFT ($index, 2) 958 Price et al. [PAGE 18] 959 COPY ($index, 4, extra_length_bits) 961 ; If the decoded character is to be interpreted as a length value 962 ; then as specified in [DEFLATE], the decompressor must first add a 963 ; certain number of additional bits from the compressed message to 964 ; this length value. 966 HUFFMAN ($compressed_pointer, $bit_offset, extra_length_bits, 1, 967 $extra_length_bits, 0) 969 ; The above HUFFMAN token obtains the correct number of additional 970 ; length bits from the compressed message. 972 ADD ($length_value, $extra_length_bits) 974 :distance 976 HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 5, 74) 978 ; In [DEFLATE] a length code is always followed by a distance code. 979 ; The above HUFFMAN token decompresses this character. 981 LSHIFT ($index, 2) 982 COPY ($index, 4, extra_distance_bits) 983 HUFFMAN ($compressed_pointer, $bit_offset, extra_distance_bits, 1, 984 $extra_distance_bits, 0) 986 ; The above HUFFMAN token obtains the correct number of additional 987 ; distance bits from the compressed message. 989 ADD ($distance_value, $extra_distance_bits) 990 COPY-OFFSET ($distance_value, $length_value, $uncompressed_end) 992 ; The text string specified by the length and distance codes is then 993 ; copied to the end of the uncompressed message. 995 COMPARE ($compressed_pointer, $compressed_end, next_character, 1, 1) 997 .byte 0 0 0 .short ; 3 bytes worth of padding 999 :length_table 1001 0 3 0 4 0 5 1002 0 6 0 7 0 8 1003 0 9 0 10 1 11 1004 1 13 1 15 1 17 1005 2 19 2 23 2 27 1006 2 31 3 35 3 43 1007 3 51 3 59 4 67 1008 4 83 4 99 4 115 1009 5 131 5 163 5 195 1010 5 227 0 258 1012 ; This is the length table from Page 11 of [DEFLATE]. 1014 Price et al. [PAGE 19] 1015 :distance_table 1017 0 1 0 2 0 3 1018 0 4 1 5 1 7 1019 2 9 2 13 3 17 1020 3 25 4 33 4 49 1021 5 65 5 97 6 129 1022 6 193 7 257 7 385 1023 8 513 8 767 9 1025 1024 9 1537 10 2049 10 3073 1025 11 4097 11 6145 12 8193 1026 12 12289 13 16385 13 24577 1028 ; This is the distance table from Page 11 of [DEFLATE]. 1030 :compressed_message_start 1032 ; The compressed messages are stored in the byte buffer following 1033 ; the above set of tokens. 1035 A.4. Example token set for LZW decompression 1037 This example gives the tokens required to understand [LZW] 1038 compressed messages. This particular variant of LZW uses a codebook 1039 containing up to 1024 entries, and so each compressed character is a 1040 10-bit value referencing one of the entries in the codebook. 1042 :first_token unpack_static_dictionary 1043 :uncompressed_start 5888 1044 :uncompressed_end 1 1045 :circular_buffer 6144 1046 :compressed_start compressed_message_start 1047 :compressed_end 0 1048 :compressed_pointer 0 1050 :bit_offset 0 1051 :index 0 1052 :position_value 0 1053 :length_value 0 1054 :codebook_next 2048 1055 :token_start next_character 1057 :unpack_static_dictionary 1059 COPY-LITERAL (uncompressed_start, 4, $codebook_next) 1060 COPY-LITERAL (17, 1, $uncompressed_start) 1061 ADD ($index, 1) 1062 COMPARE ($index, 256, unpack_static_dictionary, continue, 0) 1064 ; The above tokens are used to initialize the first 256 entries in 1065 ; the LZW codebook with single ASCII characters. 1067 :continue 1069 Price et al. [PAGE 20] 1070 COPY (circular_buffer, 2, uncompressed_end) 1071 COPY (token_start, 2, first_token) 1073 :next_character 1075 HUFFMAN ($compressed_pointer, $bit_offset, index, 1, 10, 512) 1077 ; The above HUFFMAN token extracts 10 bits from the compressed 1078 ; message. 1080 LSHIFT ($index, 2) 1081 COPY ($index, 4, position_value) 1083 ; The contents of the corresponding codebook entry is retrieved by 1084 ; the above COPY token. 1086 COPY-LITERAL (uncompressed_end, 2, $codebook_next) 1087 COPY-LITERAL ($position_value, $length_value, $uncompressed_end) 1089 ; The above COPY-LITERAL token appends the selected text string to 1090 ; the end of the uncompressed message. 1092 ADD ($length_value, 1) 1093 COPY-LITERAL (length_value, 2, $codebook_next) 1095 ; As per the LZW algorithm, a new entry is added to the codebook 1096 ; containing the text string copied to the end of the uncompressed 1097 ; message, plus the next character in the uncompressed message. 1099 COMPARE ($compressed_pointer, $compressed_end, next_character, 0, 0) 1101 :compressed_message_start 1103 Price et al. [PAGE 21]