Network Services Research Lab David Korn and Kiem-Phong Vo AT&T Labs Submission: February 23, 1999 Expiration: August 23, 1999 The VCDIFF Generic Differencing and Compression Data Format Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This memo describes a general and efficient data format suitable for encoding compressed and/or differencing data so that they can be easily transported among computers. Table of Contents: 1. Executive Summary ............................................ 1 2. Delta Instructions ........................................... 3 3. Vcdiff Encoding Format ....................................... 4 4. Instruction Code Tables ...................................... 10 5. Compression and Differencing Algorithms ...................... 15 6. Summary ...................................................... 19 ACKNOWLEDGEMENTS ............................................. 19 REFERENCES ................................................... 19 AUTHOR'S ADDRESS ............................................. 20 1. Executive Summary Storage and transmission of files and file versions can be greatly improved via the use of compression and differencing techniques. Since files are often transported across machines with distinct architectures and performance characteristics, encoded data should be in a form that is portable and easy to decode. This document describes Vcdiff, a new encoding format that satisfies these requirements. The format enables the construction of a single efficient decoder that is independent of encoding algorithms. Data differencing is the process of computing a compact and invertible RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 encoding of a target file given a source file. Data compression is similar but without the use of source data. The UNIX utilities diff, compress, and gzip are well-known examples of data differencing and compression tools. For data differencing, the computed encoding is called a delta file, and, for data compression, it is called a compressed file. Delta and compressed files are good for storage and transmission because they are often much smaller than the original data files. Data differencing and data compression are traditionally treated as distinct types of data processing. However, compression can be thought of as a special case of differencing in which the source data is empty. This blending of differencing and compression was first introduced in the Vdelta tool [1] by Korn and Vo. The basic idea was to unify the string parsing scheme used in the Lempel-Ziv'77 style compressors [2], and the block-move technique of Tichy [3]. Loosely speaking, this works as follows: 1. Concatenate source and target data. 2. Process the resulting data from left to right using a parsing scheme similar to LZ'77. 3. Start to output when the target data is reached. Parsing can be implemented on top of various string matching algorithms, for example using suffix trees [4], each with different space and time performance characteristics. Vdelta uses an efficient string matching algorithm previously shown to compare well against other well-known techniques as shown in [5]. However, even with the relatively efficient memory usage for Vdelta, memory requirement for string matching can be prohibitive when processing large files. To handle this, data compressors and differencers typically partition input files into chunks called windows and process each window separately. Except for unpublished work by Vo, little has been done on designing effective windowing schemes. Current techniques, including Vdelta, simply use windows with corresponding addresses across the source and target files. Such a pair of corresponding windows also often have the same size. From the point of view of data transporting and data storage, it is desirable to remove all issues concerning string matching and windowing algorithms from the decoding phase. This would enable client-server applications in which a server may serve multiple clients whose computing characteristics are unknown to it. However, all current differencing and compressing tools, including Vdelta, fell short in this aspect since their data encoding formats depend on both the string matching and windowing algorithms that they use. To solve this problem, we propose here a new data encoding format, Vcdiff, for encoding delta and compressed files. Vcdiff achieves the below characteristics: 1. Algorithm genericity: Both the data format and the decoding algorithm are independent of string matching algorithms or windowing schemes. 2. Decoding efficiency: A target file can be reconstructed in time proportional to its size and the space used to do so is proportional to the maximal window size. 3. Output compactness: A delta or compressed file is compactly represented. 4. Data portability: The encoding format is free from byte order and word size issues. -2- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 The Vcdiff data format and the algorithms for decoding data shall be described next. Since Vcdiff treats compression as a special case of differencing, we shall use the terms "delta file" to indicate the compressed output for both cases. 2. Delta Instructions A large target file is partitioned into sections called windows, each of which is processed separately. Each target window T with length n may be compared against some source data segment S with length m bytes. A source data segment may come from the already processed part of the target file or it may come the source file, if one is given. It is assumed that there is sufficient memory so that both T and S can be treated as strings stored in main memory. For the purpose of addressing bytes in a source segment and target window, it is convenient to think of S and T as substrings of a superstring U formed by concatenating T to S like this: s[0]s[1]...s[m-1]t[0]t[1]...t[n-1] An index in S or T is referred to by its location in U. For example, index i in T will be referred to as m+i. Note also that indices start from zero. The instructions to encode and direct the reconstruction of a target window are called delta instructions. There are three types: 1. ADD: This instruction type requires two arguments, a size s and a sequence of s bytes to be copied to reconstruct the target window. 2. COPY: This instruction type requires two arguments, a size s and an address p in the string U. The arguments specify the substring of U that must be copied. Note that such a substring must be entirely contained in one of S or T. 3. RUN: This instruction type requires two arguments, a size s and a byte c that will be repeated s times. For example, given the source and target strings a b c d e f g h i j k l m n o p a b c d w x y z e f g h e f g h e f g h e f g h e f g h the target string can be constructed from the source string with the following instructions: COPY 4, 0 ADD 4, w x y z COPY 4, 4 COPY 16, 24 Note that the third COPY instruction copies data from T itself since address 24 is position 8 in T. Note also that parts of the data to be copied by this instruction will be reconstructed during the copy operation. Thus, periodic sequences, i.e., sequences with regularly repeated subsequences, can be efficiently encoded. Although a run of -3- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 the same byte can be thought of as a periodic sequence with period 1, we provide a separate RUN instruction because for certain types of files such as executable code, runs of the same bytes are frequent and can benefit from having a compact representation. 3. Vcdiff Encoding Format Algorithms will be presented in ANSI-C. To ease the presentation, error checking will be generally omitted. Certain elementary functions shall be assumed without formal description. For example, get_byte() reads single byte from the delta file. Other like functions will be mentioned during the course of the presentation. Data will be encoded in byte units. For portability, control data generated by Vcdiff shall be limited to the lower eight bits of a byte even on machines with larger bytes. Bits shall be named from right to left so that the first bit has lowest value, 1, and the eight bit has value (1<<7) or 128. 3.1 Delta File Layout A large target file is typically processed in non-overlapping sections called target windows. We shall assert that each window is at most 2^31 in length. A target window may be processed against some source data segment. For compression, a source segment would be from some earlier part of the target file while, for differencing, it may also be from the source file. The following is the layout of a delta file. Header First window Second window .... The delta file starts with a 4-byte header identifying the file type, 0xd6, 0xc3, 0xc4, 0. The first three bytes are the ASCII letters 'V', 'C' and 'D' or-ed with the eighth bit. The fourth byte is a version number, currently set to zero. Following the 4-byte header are a number of sections, each of which encodes a target window. During target file reconstruction, windows are processed in the order that they appear in the delta file. Each window section starts with the below header data: 1. The number of bytes used to encode this window. This includes all header data. If this value is zero, the reconstruction ends. The encoding of the value uses four bytes based on the portable format to encode integers in the range [0,2^31] (Section 3.2). 2. The size of the target window is encoded as a variable-sized unsigned integer (Section 3.2). 3. An indicator byte whose low four bits signal the type of source segment and the high four bits signal any change in the instruction code table. The bits are interpreted as follows: -4- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 a. The low four bits of the indicator byte are 0 for no source segment, 1 for the source segment from the source file, and 2 for from the target file. When there is a source segment, next in the delta file are two unsigned integers for the size of this segment and its position in the respective file. b. The high four bits of the indicator byte after right-shifting 4 bits are 0 for no instruction code table change, 1 for switching to a code table already defined, and 2 for defining a new code table. When the indicator bits are non-zero, next in the delta stream will be a byte indicating the index of the new table. If the indicator bits have value 2, next will be the data defining the new table. Note that the initial code table always has index 0. Section 4 discusses instruction tables in detail. Below is the algorithm to decode a target file: 1. Vcdcode_t *Tab; 2. get_byte(); get_byte(); get_byte(); get_byte(); 3. for(;;) 4. { if((n_head = get_rint(1<<31)) == 0 || iseof()) 5. break; 6. n_tar = get_uint(); 7. if(((indicator = get_byte())&0xf) > 0) 8. { n_src = get_uint(); 9. p_src = get_uint(); 10. src = get_source(indicator&0xf, n_src, p_src); 11. } 12. else n_src = p_src = 0; 13. if((indicator >>= 4) > 0) 14. Tab = get_codetab(indicator); 15. win_decode(Tab, src,n_src, tar,n_tar); 16. put_target(tar,n_tar); 17. } Line 1 declares Tab which points to the current instruction code table. Line 2 reads the four-byte header. Line 4 reads the number of bytes for the encoding of this window. Line 6 read the size of a target window to be reconstructed. Lines 7-12 obtain the source data segment. The function get_source() constructs the source data segment which could be from the source file or from the target file depending on the value of i. Lines 13-14 process data to define a new instruction code table, if any. The function get_codetab() will be discussed in Section 4.2. Lines 15-16 call win_decode() to decode the target window, then call put_target() to write it to the target file. The elementary functions get_source() and put_target() shall be assumed without description. 3.2 Integer Encoding Formats -5- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 It is important that sizes of delta instructions be encoded compactly. Vcdiff encodes these values as unsigned integers using the variable sized portable format that was introduced in the Sfio library [6]. Each unsigned integer is encoded as a sequence of bytes. The eighth bit of a byte is the continuation bit, and, when on, indicates that the next byte is part the same integer. Below are the algorithms to encode and decode unsigned integers. The type uchar_t is the C type unsigned char and int_t is the largest supported integer type. We assume that all values fit into int_t. For absolute data portability, any encoded value should be small enough to fit the int_t of a decoding machine. For example, this means that a file with size larger than 2^32 will not be decodable on a machine whose integers are restricted to 32 bits. This is reasonable since such a machine does not support large files anyway. 1. #define MORE_SHIFT 7 2. #define MORE_BIT (1 << MORE_SHIFT) 3. int put_uint(int_t v) 4. { uchar_t c[2*sizeof(int_t)], *s; 5. s = &c[sizeof(c)-1]; 6. *s = v & (MORE_BIT-1); 7. while((v >>= MORE_SHIFT) != 0) 8. *--s = (v & (MORE_BIT-1)) | MORE_BIT; 9. put_data(s, (&c[sizeof(c)]) - s); 10. return &c[sizeof(c)]-s; 11. } 12. int_t get_uint() 13. { int_t b, v; 14. for(v = 0;; ) 15. { b = get_byte(); 16. v = (v << MORE_SHIFT) | (b & (MORE_BIT-1)); 17. if(!(b & MORE_BIT) ) 18. return v; 18. } 19. } Line 9 calls put_data() to write the encoded integer to the delta file. This elementary function shall be assumed without description. +---------------------+ | 11100000 | 00111001 | +---------------------+ byte 0 byte 1 Table 1. Table 1 shows the variable sized encoding of 12345. The bytes in the encoding are presented in binary to make clear the use of the continuation bit. Omitting the continuation bit, the encoding of 12345 as an unsigned integer uses two bytes with values 96 and 57. -6- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 The address of a COPY instruction is an unsigned integers with a known range, i.e., it cannot be larger than the current position. Such a value can be encoded more compactly than as done in the Sfio scheme. For example, if a value is known to be in the range 0-255, it can be encoded with a single byte. On the other hand, if the same value happens to be larger than 127, it would require two bytes to encode in the Sfio scheme. Below are the algorithms to encode integers with known ranges. 1. #define BYTE_SHIFT 8 2. #define BYTE_MASK ((1 << BYTE_SHIFT) - 1) 3. int put_rint(int_t v, int_t range) 4. { uchar_t c[sizeof(int_t)], *s; 5. s = &c[sizeof(c)-1]; 6. *s = v & BYTE_MASK; 7. while((range >>= BYTE_SHIFT) != 0) 8. { *--s = v & BYTE_MASK; 9. v >>= BYTE_SHIFT; 10. } 11. put_data(s, (&c[sizeof(c)]) - s); 12. return (&c[sizeof(c)]) - s; 13. } 14. int_t get_rint(int_t range) 15. { int_t v; 16. for(v = 0; range > 0; range >>= BYTE_SHIFT) 17. v = (v << BYTE_SHIFT) | get_byte(); 18. return v; 19. } 3.3 Delta Instruction Coding Format The delta instructions are encoded in a stream of control bytes and associated data. Each control byte is an index into an instruction code table of 256 entries whose type, Vcdcode_t, is defined below: 1. typedef struct _vcdinst_s 2. { unsigned char type; /* instruction type */ 3. unsigned char size; /* >0 if size is coded here */ 4. unsigned char mode; /* address mode for COPYs */ 5. } Vcdinst_t; 6. typedef struct _vcdcode_s 7. { Vcdinst_t inst1; /* first instruction */ 8. Vcdinst_t inst2; /* second instruction */ 9. } Vcdcode_t; Thus, each control byte may identify up to two instructions. Below are the different instruction types: 1. #define VCD_NOOP 0 /* not an instruction */ 2. #define VCD_ADD 1 /* an ADD instruction */ 3. #define VCD_RUN 2 /* a RUN instruction */ 4. #define VCD_COPY 3 /* a COPY instruction */ -7- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Each instruction has a size n of the data involved. If the field Vcdinst_t.size is non-zero, it is the value for n. Otherwise, n is encoded next in the delta file as an unsigned integer. Following this is respectively the n bytes of data for an ADD instruction, or the byte to be repeated for a RUN instruction, or the copying address for a COPY instruction. The COPY address can be encoded in different ways. The field Vcdinst_t.mode defines the address encoding mode: 1. #define VCD_SELF 0 /* coded as itself */ 2. #define VCD_HERE 1 /* coded off current position */ 3. #define VCD_SAME 2 /* index into "same" cache */ 4. #define VCD_NEAR 3 /* index into "near" cache */ The values VCD_SAME and VCD_NEAR indicate two address caching methods designed to take advantage of the fact that different versions of a data source often arise from small modifications. As such, successive copying addresses tend to have addresses that are either identical or close to one another. An address A of a COPY instruction is coded as follows: 1. Let H be the current location of target data to be processed. If H-A is in the range [0,255], the address mode is VCD_HERE and A is coded in a single byte with value H-A. 2. 256 addresses are kept in a "same" cache. The low eight bits of A are used to index the cache. If A matches this address, then the mode is VCD_SAME and A is coded in a single byte representing the matching index. 3. 4 addresses are kept in a "near" cache. This cache is updated in a round robin manner. A is checked against each element E of the cache. If A-E is in the range [-127,128], the address mode is VCD_NEAR plus the index of E while A is coded in a single byte with value A-E+127. 4. If none of the above cases apply, the address mode is VCD_SELF and A is coded as an integer in the range [0-H] where H is the current location. Below are the algorithms to initialize and update address caches. 1. typedef struct _cache_s 2. { int n; 3. int near[4]; 4. int same[256]; 5. } Cache_t; 6. cache_init(Cache_t* ka) 7. { int i; 8. for(i = 0; i < 4; ++i) 9. ka->near[i] = 0; 10. ka->n = 0; 11. for(i = 0; i < 256; ++i) 12. ka->same[i] = 0; 13. } 14. cache_update(Cache_t* ka, int_t addr) 15. { ka->near[ka->n] = addr; 16. if((ka->n += 1) >= 4) 17. ka->n = 0; 18. ka->same[addr&255] = addr; 19. } -8- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Below is the function to encode a COPY address: 1. int addr_encode(Cache_t* ka, int addr, int here, int* best) 2. { int i, d, mode = -1; 3. if((d = (here-addr)) < 256) 4. { *best = d; 5. mode = VCD_HERE; 6. } 7. else if(ka->same[d = addr&255] == addr) 8. { *best = d; 9. mode = VCD_SAME; 10. } 11. else 12. { for(i = 0; i < 4; ++i) 13. { if((d = addr - ka->k_near[i]) >= -127 && d <= 128) 14. { *best = d + 127; 15. mode = VCD_NEAR+i; 16. } 17. } 18. } 19. if(mode < 0) 20. { *best = addr; 21. mode = VCD_SELF; 22. } 23. cache_update(ka,addr); 24. return mode; 25. } Lines 3-6 check for coding against the current location. Lines 7-10 check for coding against the same[] cache. Lines 12-17 check for coding against the near[] cache. Lines 19-22 encode the address as itself if no other mode was chosen. Lines 23-24 update the cache, then return the selected mode. Below is the function to decode a COPY address: 1. int addr_decode(Cache_t* ka, int here, int type) 2. { int addr; 3. if(type == VCD_SELF) 4. addr = get_rint(here); 5. else if(type == VCD_HERE) 6. addr = here - get_byte(); 7. else if(type == VCD_SAME) 8. addr = ka->same[get_byte()]; 9. else addr = ka->near[type] + get_byte() - 127; 10. cache_update(ka, addr); 11. return addr; 12. } -9- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 3.4 Decoding A Target Window The algorithm to decode a target window is as follows: 1. win_decode(Vcdcode_t* tab,uchar_t* src,int nsrc,uchar_t* tar,int ntar) 2. { int_t here, size, addr, i; 3. Cache_t ka; 4. Vcdcode_t *code; 5. Vcdinst_t *inst; 6. cache_init(&ka); 7. for(here = 0; here < n_tar; ) 8. { code = &tab[get_byte()]; 9. for(i = 1; i <= 2; ++i) 10. { inst = i == 1 ? &code->inst1 : &code->inst2; 11. if(inst->type == VCD_NOOP) 12. continue; 13. if((size = inst->size) == 0) 14. size = get_uint(); 15. if(inst->type == VCD_ADD) 16. { for(; size > 0; --size) 17. tar[here++] = get_byte(); 18. } 19. else if(inst->type == VCD_RUN) 20. { int c = get_byte(); 21. for(; size > 0; --size) 22. tar[here++] = c; 23. } 24. else if(inst->type == VCD_COPY) 25. { uchar_t* from; 26. addr = addr_decode(&ka, here, inst->mode); 27. from = addr < nsrc ? src+addr : tar+addr-nsrc; 28. for(; size > 0; --size) 29. tar[here++] = *from++; 30. } 31. } 32. } 33. } Line 1 declares the formal arguments to win_decode(). The argument "tab" defines the instruction code table. Line 6 initializes the address caches. Line 8 reads a control byte and gets the corresponding code. Lines 9-32 process the pair of delta instructions defined by the code. 4. Instruction Code Tables Each target window is based on some instruction code table. Delta file compactness can be enhanced by tailoring such code tables to particular data characteristics. Vcdiff provides support for up to 256 instruction tables with indices in the range [0,255]. However, the first 8 indices [0,7] are reserved by Vcdiff for certain predefined code tables and should not be reset by applications. -10- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 4.1 Decoding a Code Table in the Delta File To output a table to a delta file, it is necessary to encode it in a portable form. Further, it is desirable to do that compactly. To accomplish this, we first represent a table as a string. As each instruction consists of 3 bytes and each table entry codes 2 instructions, the total size needed is 6*256 or 1536 bytes. Two functions tab2str() and str2tab() are used for converting between a table to a string and vice versa. Below is the description of tab2str(). str2tab() is just as straightforward so its description will be omitted. Note that bytes of the same type are grouped together. This helps with the step to compress the data. 1. tab2str(Vcdcode_t* tab, uchar_t data[6*256]) 2. { int i, n; 3. n = 0; 4. for(i = 0; i < 256; ++i) 5. data[n++] = tab[i].inst1.type; 6. for(i = 0; i < 256; ++i) 7. data[n++] = tab[i].inst2.type; 8. for(i = 0; i < 256; ++i) 9. data[n++] = tab[i].inst1.size; 10. for(i = 0; i < 256; ++i) 11. data[n++] = tab[i].inst2.size; 12. for(i = 0; i < 256; ++i) 13. data[n++] = tab[i].inst1.mode; 14. for(i = 0; i < 256; ++i) 15. data[n++] = tab[i].inst2.mode; 16. } Below is the function get_codetab() to get a new instruction code table to decode a target window: 1. Vcdcode_t *Code[256]; 2. Vcdcode_t* get_codetab(int indicator) 3. { int index; 4. uchar_t *diff, data[1536]; 5. index = get_byte(); 6. if(indicator == 2) 7. { diff = tab2str(Code[get_byte()]); 8. win_decode(Code[0], diff, 1536, data, 1536); 9. Code[index] = str2tab(data); 10. } 11. return Code[index]; 12. } Line 1 indicates that Vcdiff allows up to 256 code tables. Line 5 reads the index of the new code table. Lines 6-10 construct a new table from data in the delta stream. Such a table is always encoded using the reserved code table Code[0]. In addition, it is compared against a previously defined table to minimize the amount of data needed to transmit in the delta file. Line 7 reads the index of the table used to compare with the new table to construct the delta instructions. -11- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Line 8 calls win_decode() to recompute the string representation of the table. Line 9 sets the new table in the given index. Line 11 returns the code table. 4.2 Reserved Instruction Code Tables Using COPY instructions with small sizes sometimes adversely affect compactness. Thus, it is advantageous to restrict COPY instructions to some minimum sizes. Then, code tables can be designed to take advantage of this restriction to optimize delta file compactness. Vcdiff provides two reserved instruction code tables with indices 0 and 1 for minimium COPY sizes of 4 and 3 respectively. As noted earlier, the starting code table has index 0 so it is optimized for minimum COPY size of 4. Note also that even when a table optimized for a particular minimum COPY size, this does not mean that a COPY instruction with a smaller size cannot be encoded. It simply means that the size of such an instruction must be coded separately as an unsigned integer. TYPE SIZE MODE TYPE SIZE MODE INDEX ------------------------------------------------------------- 1. RUN 0 NOOP 0 2. ADD 0 NOOP 1 3. ADD [1,14] NOOP [2,15] 4. COPY 0 [0,6] NOOP [16,22] 5. COPY [M,M+10] [0,6] NOOP [23,99] 6. COPY 0 [0,6] ADD [1,4] [100,127] 7. COPY [M,M+3] [0,6] ADD [1,4] [128,239] 8. COPY [M,M+3] SELF COPY [M,M+3] SELF [240,255] ------------------------------------------------------------- Table 2. For a given minimum COPY size M, Table 2 shows how the 256 code values are partitioned into different categories of indices: 1. Index 0 defines a RUN instruction whose size is always coded separately as an unsigned integer. This is signified by having the "size" field being 0. 2. Index 1 defines an ADD instruction with size coded separately. 3. Indices 2-15 define 14 ADD instructions with sizes in the range [1,14]. 4. Indices 16-22 define 7 COPY instructions for the 7 different address encoding modes discusses in Section 3.3. The sizes of these instructions are coded separately. 5. Indices 23-99 define 77 COPY instructions using all 7 addressing modes and having sizes in the range [M,M+10]. 6. Indices 100-127 defines 28 pairs of COPY and ADD instructions. The COPY sizes are coded separately while the ADD sizes are in the range [1,4]. 7. Indices 128-239 define 112 pairs of COPY and ADD instructions whose sizes are in the ranges [M,M+3] and [1,4] respectively. 8. Finally, indices 240-255 defines 16 pairs of COPY and COPY instructions with sizes in the range [M,M+3] and both using the same VCD_SELF addressing mode. -12- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Next we present the algorithms to construct an instruction code table as described above. Where appropriate, the algorithms also construct the inverted tables usable for data encoding. Any part of a table not explicitly set will be assumed to be initialized to be either zero or the null pointer as the case warranted. 1. mk_codetab(Vcdcode_t tab[256], int add[16], int copy[15][7], 2. int copyadd[8][7][5], int copycopy[8][8], int M) 3. { 4. tab[0].inst1.type = VCD_RUN; 5. tab[0].inst1.size = 0; 6. tab[0].inst2.type = VCD_NOOP; 7. mk_add(tab, add); 8. mk_copy(tab, copy, M); 9. mk_copyadd(tab, copyadd, M); 10. mk_copycopy(tab, copycopy, M); 11. } Lines 1-2 declare the arguments for mk_codetab(). The inverted tables add[], copy[][], copyadd[][][], and copycopy[][] are designed to be indexed by sizes and address modes. The argument M is the minimum COPY size used for the table. Lines 4-6 construct the RUN instruction. Lines 7-10 calls various subroutines to construct different parts of the instruction table. Below is the algorithm to construct the single ADD instructions: 1. mk_add(Vcdcode_t tab[256], int add[16]) 2. { int s, n; 3. tab[1].inst1.type = VCD_ADD; 4. tab[1].inst1.size = 0; 5. tab[1].inst2.type = VCD_NOOP; 6. add[0] = 1; 7. for(n = 2, s = 1; s <= 14; ++s, ++n) 8. { tab[n].inst1.type = VCD_ADD; 9. tab[n].inst1.size = s; 10. tab[n].inst2.type = VCD_NOOP; 11. add[s] = n; 12. } 13. } Lines 3-6 construct the ADD instruction with size coded separately. Location 0 in the inverted table add[] points to this code. Lines 7-12 construct the ADD instructions with size in the range [1-14] as discussed earlier. -13- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Below is the function to construct the single COPY instructions: 1. mk_copy(Vcdcode_t tab[256], int copy[15][7], int M) 2. { int s, m, n; 3. for(n = 16, m = 0; m < 7; ++m, ++n) 4. { tab[n].inst1.type = VCD_COPY; 5. tab[n].inst1.size = 0; 6. tab[n].inst1.mode = m; 7. tab[n].inst2.type = VCD_NOOP; 8. copy[0][m] = n; 9. } 10. for(s = M; s <= M+10; ++s) 11. { for(m = 0; m < 7; ++m, ++n) 12. { tab[n].inst1.type = VCD_COPY; 13. tab[n].inst1.size = s; 14. tab[n].inst1.mode = m; 15. tab[n].inst2.type = VCD_NOOP; 16. copy[s][m] = n; 17. } 18. } 20. } Lines 3-9 construct the 7 COPY instructions with sizes coded separately. Lines 10-18 construct the 77 COPY instructions with sizes in the range [M,M+10]. The below function constructs the pairs of COPY/ADD instructions: 1. mk_copyadd(Vcdcode_t tab[256], int copyadd[8][7][5], int M) 2. { int s, m, a, n; 3. for(n = 100, m = 0; m < 7; ++m) 4. { for(a = 1; a <= 4; ++a, ++n) 5. { tab[n].inst1.type = VCD_COPY; 6. tab[n].inst1.size = 0; 7. tab[n].inst1.mode = m; 8. tab[n].inst2.type = VCD_ADD; 9. tab[n].inst2.size = a; 10. copyadd[0][m][a] = n; 11. } 12. } 13. for(s = M; s <= M+3; ++s) 14. { for(m = 0; m < 7; ++m) 15. { for(a = 1; a <= 4; ++a, ++n) 16. { tab[n].inst1.type = VCD_COPY; 17. tab[n].inst1.size = s; 18. tab[n].inst1.mode = m; 19. tab[n].inst2.type = VCD_ADD; 20. tab[n].inst2.size = a; 21. copyadd[s][m][a] = n; 22. } 23. } 24. } 25. } -14- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Finally, the algorithm to construct the 16 COPY/COPY pairs: 1. mk_copycopy(Vcdcode_t tab[256], int copycopy[8][8], int M) 2. { int s, c, n; 3. for(n = 240, s = M; s <= M+3; ++s) 4. { for(c = M; c <= M+3; ++c, ++n) 5. { tab[n].inst1.type = VCD_COPY; 6. tab[n].inst1.size = s; 7. tab[n].inst1.mode = VCD_SELF; 8. tab[n].inst2.type = VCD_COPY; 9. tab[n].inst2.size = c; 10. tab[n].inst2.mode = VCD_SELF; 11. copycopy[s][c] = n; 12. } 13. } 14. } 5. Compression and Differencing Algorithms Sections 3 and 4 describe in full how to reconstruct a target file from data in a delta file. In this section, we discuss briefly general algorithms for compressors and differencers. Two abstract functions shall be assumed: win_match(): This function computes a source segment, if any, to be compared with a target window. It returns 0, 1 or 2 for no source segment, source segment from the source file, or source segment from the target file. str_match(): This function computes a string in either the source segment or the target window that matches with a prefix of the target data being processed. Such a matched string can be encoded as a COPY instruction. The performance of an encoder in terms of time as well as output compactness depends on what algorithms are used to implement the two window and string matching functions. However, as shown earlier, the reconstruction of a target file is completely independent of such choices. Below is the algorithm for processing a target file: 1. put_byte(0xd6); put_byte(0xc3); put_byte(0xc4); put_byte(0); 2. for(p_tar = 0; ; p_tar += n_tar) 3. { if((n_tar = get_target(&tar)) <= 0) 4. break; 5. n_delta = put_rint(0, (1<<31)); 6. n_delta += put_uint(n_tar); 7. type = win_match(tar, n_tar, p_tar, &src, &n_src, &p_src); 8. n_delta += put_byte(type); 9. if(type > 0) 10. { n_delta += put_uint(n_src); 11. n_delta += put_uint(p_src); 12. } 13. n_delta += win_encode(src,n_src, tar,n_tar); 14. put_delsize(n_delta); 15. } -15- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Line 1 outputs the four header bytes. Lines 3-4 get a target window. If there is none, the loop is terminated. Line 5 outputs a place holder for the total size of the delta code used for this window. Line 6 outputs the size of the target window. Line 7 calls win_match() to compute a source segment if any. Note that the current position in the target file, p_tar, is passed to win_match() so that it can ensure that a source segment from the target file will come entirely before p_tar. Line 8 outputs the indicator byte for the window. Here, we assume that the default instruction code table with index 0 is used. Lines 9-12 output the size and position of the source segment as needed. Line 13 calls win_encode() to encode the target window. Line 14 overwrites the value written out on Line 5 with the actual size of the delta. The function put_delsize() is straightforward so its description will be omitted. Below is the function to encode a target window: 1. int win_encode(uchar_t* src, int n_src, uchar_t* tar, int n_tar) 2. { int add, here, addr, s, size = 0; 3. Cache_t ka; 4. cache_init(&ka); 5. for(here = 0; add = -1; here < n_tar; ) 6. { if( (s = chk_run(tar, n_tar, here)) > 0) 7. addr = -1; 8. else s = str_match(src,n_src,tar,n_tar,here,&addr); 9. if(s > 0) 10. { if(add >= 0) 11. size += put_add(here-add, tar+here-add); 12. if(addr < 0) 13. size += put_run(s, tar[here]); 14. else size += put_copy(&ka, s, addr, here); 15. add = -1; 16. here += s; 17. } 18. else 19. { if(add < 0) 20. add = here; 21. here += 1; 22. } 23. } 24. if(add >= 0) 25. size += put_add(here-add, tar+here-add); 26. else size += put_copy(0,0,0,0); 27. return size; 28. } Line 4 initializes the address caches. Lines 6-8 check to see if there is a RUN or a COPY. In the latter case, str_match() returns the address of the COPY in addr. The function chk_run() is straightforward to implement so its description will be omitted. -16- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Lines 9-17 output the appropriate delta instruction. Lines 19-22 get executed if there is no run or match. In this case, data is collected for an ADD instruction. Lines 24-25 output the last unmatched part of the target window in an ADD instruction. Line 26 forces the last saved COPY instruction (if any) to be output. See the function put_copy() below. Line 27 returns the number of bytes used for encoding the data. To finish up the encoding algorithm, we need to describe the functions put_add(), put_run() and put_copy(). We shall assume that the default code instruction table 0 is used. The below code declares the tables and define a few convenience macro functions. 1. Vcdcode_t Tab[256]; 2. Vcdcode_t *Add[16]; 3. Vcdcode_t *Copy[15][7]; 4. Vcdcode_t *Copyadd[8][7][5]; 5. Vcdcode_t *Copycopy[8][8]; 6. #define INDEXADD(a) ((a >= 1 && a <= 14) ? a : 0) 7. #define INDEXCOPY(c) ((c >= 4 && c <= 14) ? c : 0) 8. #define INDEXCOPYADD(c) ((c >= 4 && c <= 7) ? c : 0) 9. #define ISTINYADD(a) ((a >= 1 && a <= 4) ? 1 : 0) 10. #define ISTINYCOPY(c) ((c >= 4 && c <= 7) ? 1 : 0) Lines 6-8 define functions to compute the index into the tables Add, Copy, and Copyadd for some given ADD or COPY size. Lines 9-10 define functions to test if a given size is small. These are used to determine if certain pair of instructions could be coded using a single byte code. Below is the function to output a COPY instruction. 1. int Size, Addr, Here, Mode; 2. int put_copy(Cache_t* ka, int size, int addr, int here) 3. { int code, mode = 0, best = 0, ndel = 0; 4. if(size > 0) 5. mode = addr_encode(ka, addr, here, &best); 6. if(Size > 0) 7. { if(Mode == mode && mode == VCD_SELF && 8. ISTINYCOPY(Size) && ISTINYCOPY(size) ) 9. { code = Copycopy[Size][size]; 10. ndel += put_byte(code); 11. ndel += put_rint(Addr, Here); 12. ndel += put_rint(addr, here); 13. Size = -1; 14. return ndel; 15. } 16. code = Copy[INDEXCOPY(Size)][Mode]; 17. ndel += put_byte(code); 18. if(Tab[code].inst1.size == 0) 19. ndel += put_uint(Size); 20. if(Mode == VDC_SELF) 21. ndel += put_rint(Addr,Here); -17- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 22. else ndel += put_byte(Addr); 23. } 24. Size = size; 25. Addr = best; 26. Mode = mode; 27. Here = here; 28. return ndel; 29. } Line 1 defines variables to save a COPY instruction so that it may be coded jointly with another instruction later. Lines 4-5 compute the mode and value to encode the given COPY address if there is one (i.e., "size" is positive). Lines 6-23 process a previously saved COPY instruction. Lines 7-15 output a code that encode both COPY instructions. Lines 16-22 output the previously saved COPY instruction singly. Lines 24-27 save the current COPY instruction if it has not already output as discussed. Below is the function to output an ADD instruction: 1. int put_add(int size, uchar_t* data) 2. { int code, ndel = 0; 3. if(Size > 0) 4. { if(ISTINYADD(size)) 5. { code = Copyadd[INDEXCOPYADD(Size)][Mode][size]; 6. ndel += put_byte(code); 7. if(Tab[code].inst1.size == 0) 8. ndel += put_uint(Size); 9. if(Mode == VCD_SELF) 10. ndel += put_rint(Addr,Here); 11. else ndel += put_byte(Addr); 12. for(; size > 0; --size, ++data) 13. ndel += put_byte(*data); 14. return ndel; 15. } 16. else ndel += put_copy(0,0,0,0); 17. } 18. code = Add[INDEXADD(size)]; 19. ndel += put_byte(code); 20. if(Tab[code].inst1.size == 0) 21. ndel += put_uint(size); 22. ndel += size; 23. for(; size > 0; --size, ++data) 24. put_byte(*data); 25. return ndel; 26. } Lines 3-17 consider the case when there is a previously saved COPY instruction. In that case, if the ADD size is small, a single code is output for both COPY and ADD instructions. Otherwise, the COPY instruction is output by itself. Lines 18-24 output an ADD instruction by itself. -18- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 Below is the function to output a RUN instruction: 1. int put_run(int size, int byte) 2. { int ndel = 0; 3. if(Size > 0) 4. ndel += put_copy(0,0,0,0); 5. ndel += put_byte(0); 6. ndel += put_uint(size); 7. ndel += put_byte(byte); 8. return ndel; 9. } Lines 3-4 output a previously saved COPY instruction, if any. Lines 5-7 output the RUN instruction. 6. Summary We have described a general and portable data format for compression differencing. Reference [7] shows that the format is compact. For compression, it is better than Unix compress and close to gzip. For differencing, it is better than all known methods. The decoding algorithms run in linear time and require working space proportional to window sizes. This makes it suitable for data communication among computers. ACKNOWLEDGEMENTS Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff who provided much encouragement to publicize Vcdiff. REFERENCES [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression, Practical Reusable Unix Software, Editor B. Krishnamurthy, John Wiley & Sons, Inc., 1995. [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3):337-343, May 1977. [3] W. Tichy, The String-to-String Correction Problem with Block Moves, ACM Transactions on Computer Systems, 2(4):309-321, November 1984. [4] E.M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, Journal of the ACM, 23:262-272, 1976. [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms, IEEE Software Configuration and Maintenance Workshop, 1996. [6] G.S. Fowler, D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Accepted for publication in Software Practice & Experience, 1999. [7] D.G. Korn and K.P. Vo, A Generic Differencing and Compression Data Format, Technical Report AT&T Labs HA1630000-021899-02TM, February 1999. -19- RFC 0 VCDIFF Generic Differencing and Compression Data Fornmat Feb 1999 AUTHOR'S ADDRESS Kiem-Phong Vo (main contact) AT&T Labs, Room D223 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8630 Email: kpv@research.att.com David G. Korn AT&T Labs, Room D237 180 Park Avenue Florham Park, NJ 07932 Phone: 973-360-8602 Email: dgk@research.att.com EXPIRATION DATE August 23, 1999 -20-