idnits 2.17.1 draft-morgan-ezflate-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 2 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 251: '... The compressor MUST always operates ...' RFC 2119 keyword, line 354: '... compressor SHOULD use non-compresse...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 2, 2014) is 3613 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 1951 (ref. 'DEFLATE') -- Duplicate reference: RFC1951, mentioned in 'GZIP', was also mentioned in 'DEFLATE'. Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 HTTPbis K. Morgan 3 Internet-Draft C. Brunhuber 4 Intended status: Standards Track IAEA 5 Expires: December 4, 2014 June 2, 2014 7 EZFLATE: Token-based DEFLATE Compression 8 draft-morgan-ezflate-00 10 Abstract 12 This specification defines EZFLATE, a token-based DEFLATE compression 13 algorithm for secure compression within encrypted communication 14 channels. 16 Editorial Note (To be removed by RFC Editor) 18 Discussion of this draft takes place on the HTTPBIS working group 19 mailing list (ietf-http-wg@w3.org), which is archived at 20 . 22 Working Group information can be found at 23 ; 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at http://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on December 4, 2014. 42 Copyright Notice 44 Copyright (c) 2014 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (http://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 2. Compression Attacks . . . . . . . . . . . . . . . . . . . . . . 3 61 2.1. CRIME . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 2.2. BREACH . . . . . . . . . . . . . . . . . . . . . . . . . . 4 63 3. Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 4 64 3.1. "Take a Bite out of CRIME" . . . . . . . . . . . . . . . . 4 65 3.2. Delimiters & Tokenization Methods . . . . . . . . . . . . . 5 66 3.2.1. Method 1: Keep Delimiters as Separate Tokens . . . . . 5 67 3.2.2. Method 2: Keep Delimiters with Preceding Token . . . . 5 68 3.2.3. Method 3: Keep Delimiters with Succeeding Token . . . . 6 69 3.3. Tokenization Rules . . . . . . . . . . . . . . . . . . . . 6 70 4. EZFLATE Compression Algorithm Details . . . . . . . . . . . . . 6 71 5. Security Considerations . . . . . . . . . . . . . . . . . . . . 8 72 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 73 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 74 7.1. Normative References . . . . . . . . . . . . . . . . . . . 9 75 7.2. Informative References . . . . . . . . . . . . . . . . . . 9 77 1. Introduction 79 DEFLATE [DEFLATE] is one of the most widely used compression 80 mechanisms. It forms the basis of ZIP [ZIP], GZIP [GZIP] and ZLIB 81 [ZLIB]. However, DEFLATE poses a security vulnerability when 82 messages comprised of secret and attacker-controlled data are 83 delivered through a secure communication channel, as demonstrated by 84 the CRIME [CRIME] and BREACH [BREACH] attacks. 86 A key observation of the CRIME and BREACH attacks is that they 87 exploited the DEFLATE _algorithm_ as described in RFC 1951 [DEFLATE], 88 not the DEFLATE _format_ (also described in RFC 1951). 90 The first paragraph of Section 4, "Compression algorithm details", in 91 RFC 1951 explicitly states that the "deflate" format is not tied to 92 any particular algorithm. To quote, it says, "While it is the intent 93 of this document to define the 'deflate' compressed data format 94 without reference to any particular compression algorithm, the format 95 is related to the compressed formats produced by LZ77 ... it is 96 strongly recommended that the implementor of a compressor follow the 97 general algorithm presented here ... [however] a compressor need not 98 follow it in order to be compliant [DEFLATE]." 100 This document describes EZFLATE, a token-based DEFLATE _algorithm_. 101 The key feature of the EZFLATE algorithm (and primary difference to 102 the algorithm described in RFC 1951), is that LZ77 tuples may only reference a duplicated token (octet string) 104 occurring in a previous block, up to 32K input bytes before, if the 105 current token exactly matches in length _and_ octet values. In other 106 words, LZ77 tuples must not reference partial matches, neither 107 smaller nor longer, occurring in a previous block. 109 2. Compression Attacks 111 2.1. CRIME 113 SPDY [SPDY] used DEFLATE [DEFLATE] to compress redundant HTTP 114 [HTTP-p1] headers. The CRIME [CRIME] attack exploited DEFLATE to 115 guess secret information contained in the HTTP request headers (e.g. 116 "Cookies"). Since DEFLATE looks for matches character-by-character, 117 the attackers were able to guess the secret one character at a time. 118 Each failed guess resulted in a compressed block of size n. Correct 119 guesses resulted in a compressed block of size m < n. For example, 120 consider the following sample HTTP requests (taken from [CRIME]): 122 Attacker guesses the first character is 'a': 124 GET /twid=a 125 Host: twitter.com 126 User-Agent: Chrome 127 Cookie: twid=secret 129 ... (more guesses) ... 131 Attacker guesses the first character is 's': 133 GET /twid=s 134 Host: twitter.com 135 User-Agent: Chrome 136 Cookie: twid=secret 138 After guessing 's' is the first character of the secret, the attacker 139 sees a corresponding increase in compression (i.e. decrease in 140 length) indicating the guess is correct. The procedure is repeated 141 until the entire secret is revealed. 143 Although this is a brute-force attack of sorts, the average number of 144 guesses required to extract a secret is considerably lower, and 145 therefore tractable, thanks to being able to guess character by 146 character. If the number of possible characters is 256, then on 147 average it will require 128 guesses per character to make a correct 148 guess. An n-character secret then only requires, on average, n * 128 149 guesses to discover the secret. For example, a 12-character secret 150 requires 12 * 128 = 1,536 guesses. In many cases the set of possible 151 characters is smaller than 256, reducing the search space even more! 153 2.2. BREACH 155 While CRIME exploited SPDY (and TLS compression too), BREACH [BREACH] 156 exploited compression of HTTP _response bodies_. The mechanism of 157 BREACH is similar to CRIME with slightly different details, therefore 158 a detailed review of BREACH will not be given here. 160 3. Tokenization 162 3.1. "Take a Bite out of CRIME" 164 Taking a "bite out of crime" requires more than a security dog named 165 McGruff [MCGRUFF]. EZFLATE fights CRIME by using token matching, 166 rather than character-by-character matching, to make the search space 167 for an attacker equivalent to a full brute-force attack. At that 168 point there are easier ways to execute the brute-force attack. 170 As stated in Section 1, the key feature of the the EZFLATE algorithm 171 (and primary difference to the algorithm described in RFC 1951), is 172 that LZ77 tuples may only reference a 173 duplicated token (octet string) if the current token and referenced 174 token exactly match in length _and_ octet values. By selecting 175 appropriate tokenization rules, potential attackers are forced to 176 guess n-character secrets n characters at a time rather than one 177 character at a time. Taking the example from Section 2.1, a 12- 178 character secret has a search space of 2^12 * 8 = 2^96 = 179 79,228,162,514,264,337,593,543,950,336 possible secrets (of course, 180 on average it would "only" take an attacker half that many guesses to 181 discover the secret by brute force). Even a Base64-encoded 12- 182 character secret has a search space of 64^12 = 183 4,722,366,482,869,645,213,696 possible secrets. 185 3.2. Delimiters & Tokenization Methods 187 The classic tokenization method splits an arbitrary-length octet 188 string by a subset of octets that act as "delimiters". The result is 189 typically a set of octet strings with the delimiter octets omitted. 191 In the likely case where full reconstruction of the original octet 192 string is desired, the delimiters obviously cannot be omitted. There 193 are three different choices, outlined below, for keeping the 194 delimiters. Each has their advantages and disadvantages. The best 195 method for a particular data type can be discovered by experimenting. 196 Note also that a sequence of consecutive delimiters are treated as a 197 single unit. 199 3.2.1. Method 1: Keep Delimiters as Separate Tokens 201 The first tokenization method simply keeps the delimiter(s) as 202 separate tokens. For example, consider the following string of ASCII 203 text to be tokenized by white space and puncuation. 205 ASCII string to be tokenized: 207 "The quick; brown fox." 209 Resulting token set after Method 1: 211 { "The", " ", "quick", "; ", "brown", " ", "fox", "." } 213 3.2.2. Method 2: Keep Delimiters with Preceding Token 215 The second tokenization method keeps the delimiter(s) with the token 216 which immediately precedes the delimiter(s). 218 Resulting token set after Method 2: 220 { "The ", "quick; ", "brown ", "fox." } 222 3.2.3. Method 3: Keep Delimiters with Succeeding Token 224 The third tokenization method keeps the delimiter(s) with the token 225 which immediately succeeds the delimiter(s). 227 Resulting token set after Method 2: 229 { "The", " quick", "; brown", " fox", "." } 231 3.3. Tokenization Rules 233 Note that EZFLATE is agnostic to tokenization methods and delimiter 234 sets. Appropriate tokenization rules and delimiter sets depend on 235 the type of data to be compressed. For a particular data type, a 236 global set of generally applicable tokenization rules ought to be 237 selected if at all possible. This makes it easier for decompressors 238 to verify, if desired, that the compressed data conforms to the rules 239 and delimiter sets. However, specialized rules might be necessary 240 for special cases, particularly cases which might negatively affect 241 security. The selection of appropriate rules and delimiter sets for 242 a particular data type is beyond the scope of this document. 244 4. EZFLATE Compression Algorithm Details 246 The EZFLATE algorithm, described in this Section, is simply an 247 alternate to the algorithm described in Section 4 of [DEFLATE], it is 248 not intended to replace that algorithm, which provides better 249 general-purpose compression. 251 The compressor MUST always operates on a single token (octet string). 252 The compressor uses a chained hash table to find duplicated tokens. 253 Any hash function may be used and the hash function may operate on 254 the entire octet string or any subset. The compressor also uses a 255 separate "length table" to track the token lengths for each position 256 in the input window. At any given point during compression, let T be 257 the next token at position P with length L (not necessarily all 258 different octets, of course). First, the compressor computes the 259 hash value of the token T and updates the hash table with the hash 260 value of T and updates the "length table" with L at position P. Next, 261 the compressor examines the hash chain for the token T. If the chain 262 is empty, the compressor simply emits the L literal octets of T. If 263 the hash chain is not empty, this indicates that the token T (or, if 264 there was a hash collision, some other octet string with the same 265 hash value) has occurred recently. The compressor follows the hash 266 chain to find the first entry within the current window which has 267 length L (stored in the length table) and is octet-for-octet 268 equivalent to the L octets in T. If an exact match is found, the 269 compressor may emit a LZ77 tuple, where 270 the length is L and the backward distance is the position of the 271 current token minus the position of the exactly matching token. If 272 no exact match is found, the compressor simply emits the L literal 273 octets of T. 275 Just as stated in Section 4 of [DEFLATE], "the compressor searches 276 the hash chains starting with the most recent [octet] strings, to 277 favor small distances and thus take advantage of the Huffman encoding 278 [HUFF]. The hash chains are singly linked. There are no deletions 279 from the hash chains; the algorithm simply discards matches that are 280 too old. To avoid a worst-case situation, very long hash chains are 281 arbitrarily truncated at a certain length, determined by a run-time 282 parameter." 284 To improve overall compression, the compressor may optionally defer 285 emitting a match in order to find a longer match that is a 286 consecutive sequence of matching tokens. After a set of one or more 287 matches with positions Pm ... Pn have been found for the current 288 token T1 at position P1 with length L1, the compressor may defer 289 emitting the match and examine the next token T2 to determine if that 290 token appears at any of the positions Pm + L1 ... Pn + L1, i.e. 291 directly after any of the matches for T1. It is critical to note 292 that the same rules apply for checking matches of T2 at positions Pm 293 + L1 ... Pn + L1 as if the compressor were finding matches for T2 294 using the hash chain. In particular the length of a given token at 295 position Pi + L1 must be equal to L2 _and_ be octet-for-octet 296 equivalent to the L2 octets in T2. The compressor reduces the set of 297 matching positions in Pm ... Pn to only include those which had an 298 exact match for T2 at a given position Pi + L1. If one or more 299 matches remains, the process is repeated until either the set of 300 deferred matches cannot be extended by the next token or the length 301 of the next token is such that the overall length of the match would 302 exceed the maximum length of a match (258 octets) allowed by the 303 DEFLATE format (see Section 3.2.5 of [DEFLATE]). Note that the 304 compressor must keep enough information about the currently deferred 305 match(es), that once the final match is found the total length and 306 backward distance are known or can be computed. Once the set of 307 matches cannot be extended any longer, the compressor returns to the 308 normal process of searching for matches using the hash table. 310 Just as stated in Section 4 of [DEFLATE], "the compressor terminates 311 a block when it determines that starting a new block with fresh trees 312 would be useful, or when the block size fills up the compressor's 313 block buffer." 314 A token T with length L less than the minimum length of a match (3 315 octets) allowed by the DEFLATE format (see Section 3.2.5 of 316 [DEFLATE]) is processed by simply updating the "length table" and 317 emiting the L literal octets. These short tokens should, however, 318 participate in extending deferred matches. The same matching rules 319 apply as for matching tokens in all other situations, namely the 320 length of the potentially matching token stored in the "length table" 321 must be equal to L and the L octets of the stored token must be 322 octet-for-octet equivalent to the L octets in T. 324 Note that the formatting rules and static/dynamic Huffman encoding 325 rules in RFC 1951 [DEFLATE], apply to any implementation of the 326 EZFLATE algorithm described here. 328 5. Security Considerations 330 Care must be taken when implementing deferred matching such that 331 checking for matches of the current token against the octets 332 immediately following the deferred set of matches not only checks for 333 octet equivalency, but length equivalency as well. If the length is 334 not also checked, an attacker would be able to perform character-by- 335 character guessing by injecting the token immediately preceding the 336 secret, which in many cases may be a known identifier. 338 [TODO: A lot of the remaining text is completely lifted from the 339 HPACK Internet Draft; re-write or give credit somehow.] 341 An EZFLATE compressor can still act as an oracle to an attacker 342 probing the compression context. However, the attacker can only find 343 out if the guess of the entire token is correct or not. 345 Still, an attacker could take advantage of this limited information 346 for breaking low-entropy secrets using a brute-force attack. A 347 server usually has some protections against such brute-force attack. 348 Here, the attack would target the client, where it would be harder to 349 detect. The attack would be even more dangerous if the attacker is 350 able to prevent the traffic generated by its brute-force attack from 351 reaching the server. 353 To offer protection against such type of attacks, users of an EZFLATE 354 compressor SHOULD use non-compressed DEFLATE blocks (see Section 355 3.2.4 of [DEFLATE]) disable compression of any low-entropy secrets 356 which could be put at risk by a brute-force attack. 358 There is currently no known threat taking advantage of the use of a 359 fixed Huffman encoding. A study has shown that using a fixed Huffman 360 encoding table created an information leakage, however this same 361 study concluded that an attacker could not take advantage of this 362 information leakage to recover any meaningful amount of information 363 (see [PETAL]). 365 6. Acknowledgements 367 Roberto Peon and Herve Ruellan. 369 7. References 371 7.1. Normative References 373 [DEFLATE] Deutsch, P., "DEFLATE Compressed Data Format Specification 374 version 1.3", RFC 1951, May 1996. 376 7.2. Informative References 378 [BREACH] Gluck, Y., "BREACH: REVIVING THE CRIME ATTACK", July 2013, 379 . 382 [CRIME] Rizzo, J. and T. Duong, "The CRIME Attack", 383 September 2012, . 388 [GZIP] Deutsch, P., "GZIP file format specification version 4.3", 389 RFC 1951, May 1996. 391 [HTTP-p1] Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer 392 Protocol (HTTP/1.1): Message Syntax and Routing", 393 draft-ietf-httpbis-p1-messaging-26 (work in progress), 394 February 2014. 396 [HUFF] Huffman, D., "A Method for the Construction of Minimum 397 Redundancy Codes", Proceedings of the Institute of Radio 398 Engineers Volume 40, Number 9, pp. 1098-1101, 399 September 1952, . 402 [MCGRUFF] McGruff, M., "About McGruff", 2014, 403 . 405 [PETAL] Tan, J. and J. Nahata, "PETAL: Preset Encoding Table 406 Information Leakage", April 2013, . 409 [SPDY] Belshe, M. and R. Peon, "SPDY Protocol", 410 draft-mbelshe-httpbis-spdy-00 (work in progress), 411 February 2012. 413 [ZIP] Katz, P., ".ZIP File Format Specification", 414 September 2012, 415 . 417 [ZLIB] Deutsch, P., "ZLIB Compressed Data Format Specification 418 version 3.3", RFC 1950, May 1996. 420 Authors' Addresses 422 Keith Shearl Morgan 423 International Atomic Energy Agency 425 EMail: k.morgan@iaea.org 427 Christoph Brunhuber 428 International Atomic Energy Agency 430 EMail: c.brunhuber@iaea.org