idnits 2.17.1 

draft-morgan-ezflate-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** There is 1 instance of too long lines in the document, the longest one
     being 2 characters in excess of 72.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 251: '...  The compressor MUST always operates ...'
     RFC 2119 keyword, line 354: '...   compressor SHOULD use non-compresse...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 2, 2014) is 3613 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Downref: Normative reference to an Informational RFC: RFC 1951 (ref.
     'DEFLATE')

  -- Duplicate reference: RFC1951, mentioned in 'GZIP', was also mentioned in
     'DEFLATE'.


     Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	HTTPbis                                                        K. Morgan
3	Internet-Draft                                              C. Brunhuber
4	Intended status: Standards Track                                    IAEA
5	Expires: December 4, 2014                                   June 2, 2014

7	                EZFLATE: Token-based DEFLATE Compression
8	                      draft-morgan-ezflate-00

10	Abstract

12	   This specification defines EZFLATE, a token-based DEFLATE compression
13	   algorithm for secure compression within encrypted communication
14	   channels.

16	Editorial Note (To be removed by RFC Editor)

18	   Discussion of this draft takes place on the HTTPBIS working group
19	   mailing list (ietf-http-wg@w3.org), which is archived at
20	   <http://lists.w3.org/Archives/Public/ietf-http-wg/>.

22	   Working Group information can be found at
23	   <http://tools.ietf.org/wg/httpbis/>;

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on December 4, 2014.

42	Copyright Notice

44	   Copyright (c) 2014 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
60	   2.  Compression Attacks . . . . . . . . . . . . . . . . . . . . . . 3
61	     2.1.  CRIME . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
62	     2.2.  BREACH  . . . . . . . . . . . . . . . . . . . . . . . . . . 4
63	   3.  Tokenization  . . . . . . . . . . . . . . . . . . . . . . . . . 4
64	     3.1.  "Take a Bite out of CRIME"  . . . . . . . . . . . . . . . . 4
65	     3.2.  Delimiters & Tokenization Methods . . . . . . . . . . . . . 5
66	       3.2.1.  Method 1: Keep Delimiters as Separate Tokens  . . . . . 5
67	       3.2.2.  Method 2: Keep Delimiters with Preceding Token  . . . . 5
68	       3.2.3.  Method 3: Keep Delimiters with Succeeding Token . . . . 6
69	     3.3.  Tokenization Rules  . . . . . . . . . . . . . . . . . . . . 6
70	   4.  EZFLATE Compression Algorithm Details . . . . . . . . . . . . . 6
71	   5.  Security Considerations . . . . . . . . . . . . . . . . . . . . 8
72	   6.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . 9
73	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . . . 9
74	     7.1.  Normative References  . . . . . . . . . . . . . . . . . . . 9
75	     7.2.  Informative References  . . . . . . . . . . . . . . . . . . 9

77	1.  Introduction

79	   DEFLATE [DEFLATE] is one of the most widely used compression
80	   mechanisms.  It forms the basis of ZIP [ZIP], GZIP [GZIP] and ZLIB
81	   [ZLIB].  However, DEFLATE poses a security vulnerability when
82	   messages comprised of secret and attacker-controlled data are
83	   delivered through a secure communication channel, as demonstrated by
84	   the CRIME [CRIME] and BREACH [BREACH] attacks.

86	   A key observation of the CRIME and BREACH attacks is that they
87	   exploited the DEFLATE _algorithm_ as described in RFC 1951 [DEFLATE],
88	   not the DEFLATE _format_ (also described in RFC 1951).

90	   The first paragraph of Section 4, "Compression algorithm details", in
91	   RFC 1951 explicitly states that the "deflate" format is not tied to
92	   any particular algorithm.  To quote, it says, "While it is the intent
93	   of this document to define the 'deflate' compressed data format
94	   without reference to any particular compression algorithm, the format
95	   is related to the compressed formats produced by LZ77 ... it is
96	   strongly recommended that the implementor of a compressor follow the
97	   general algorithm presented here ... [however] a compressor need not
98	   follow it in order to be compliant [DEFLATE]."

100	   This document describes EZFLATE, a token-based DEFLATE _algorithm_.
101	   The key feature of the EZFLATE algorithm (and primary difference to
102	   the algorithm described in RFC 1951), is that LZ77 <length, backward
103	   distance> tuples may only reference a duplicated token (octet string)
104	   occurring in a previous block, up to 32K input bytes before, if the
105	   current token exactly matches in length _and_ octet values.  In other
106	   words, LZ77 tuples must not reference partial matches, neither
107	   smaller nor longer, occurring in a previous block.

109	2.  Compression Attacks

111	2.1.  CRIME

113	   SPDY [SPDY] used DEFLATE [DEFLATE] to compress redundant HTTP
114	   [HTTP-p1] headers.  The CRIME [CRIME] attack exploited DEFLATE to
115	   guess secret information contained in the HTTP request headers (e.g.
116	   "Cookies").  Since DEFLATE looks for matches character-by-character,
117	   the attackers were able to guess the secret one character at a time.
118	   Each failed guess resulted in a compressed block of size n.  Correct
119	   guesses resulted in a compressed block of size m < n.  For example,
120	   consider the following sample HTTP requests (taken from [CRIME]):

122	   Attacker guesses the first character is 'a':

124	   GET /twid=a
125	   Host: twitter.com
126	   User-Agent: Chrome
127	   Cookie: twid=secret

129	   ... (more guesses) ...

131	   Attacker guesses the first character is 's':

133	   GET /twid=s
134	   Host: twitter.com
135	   User-Agent: Chrome
136	   Cookie: twid=secret

138	   After guessing 's' is the first character of the secret, the attacker
139	   sees a corresponding increase in compression (i.e. decrease in
140	   length) indicating the guess is correct.  The procedure is repeated
141	   until the entire secret is revealed.

143	   Although this is a brute-force attack of sorts, the average number of
144	   guesses required to extract a secret is considerably lower, and
145	   therefore tractable, thanks to being able to guess character by
146	   character.  If the number of possible characters is 256, then on
147	   average it will require 128 guesses per character to make a correct
148	   guess.  An n-character secret then only requires, on average, n * 128
149	   guesses to discover the secret.  For example, a 12-character secret
150	   requires 12 * 128 = 1,536 guesses.  In many cases the set of possible
151	   characters is smaller than 256, reducing the search space even more!

153	2.2.  BREACH

155	   While CRIME exploited SPDY (and TLS compression too), BREACH [BREACH]
156	   exploited compression of HTTP _response bodies_.  The mechanism of
157	   BREACH is similar to CRIME with slightly different details, therefore
158	   a detailed review of BREACH will not be given here.

160	3.  Tokenization

162	3.1.  "Take a Bite out of CRIME"

164	   Taking a "bite out of crime" requires more than a security dog named
165	   McGruff [MCGRUFF].  EZFLATE fights CRIME by using token matching,
166	   rather than character-by-character matching, to make the search space
167	   for an attacker equivalent to a full brute-force attack.  At that
168	   point there are easier ways to execute the brute-force attack.

170	   As stated in Section 1, the key feature of the the EZFLATE algorithm
171	   (and primary difference to the algorithm described in RFC 1951), is
172	   that LZ77 <length, backward distance> tuples may only reference a
173	   duplicated token (octet string) if the current token and referenced
174	   token exactly match in length _and_ octet values.  By selecting
175	   appropriate tokenization rules, potential attackers are forced to
176	   guess n-character secrets n characters at a time rather than one
177	   character at a time.  Taking the example from Section 2.1, a 12-
178	   character secret has a search space of 2^12 * 8 = 2^96 =
179	   79,228,162,514,264,337,593,543,950,336 possible secrets (of course,
180	   on average it would "only" take an attacker half that many guesses to
181	   discover the secret by brute force).  Even a Base64-encoded 12-
182	   character secret has a search space of 64^12 =
183	   4,722,366,482,869,645,213,696 possible secrets.

185	3.2.  Delimiters & Tokenization Methods

187	   The classic tokenization method splits an arbitrary-length octet
188	   string by a subset of octets that act as "delimiters".  The result is
189	   typically a set of octet strings with the delimiter octets omitted.

191	   In the likely case where full reconstruction of the original octet
192	   string is desired, the delimiters obviously cannot be omitted.  There
193	   are three different choices, outlined below, for keeping the
194	   delimiters.  Each has their advantages and disadvantages.  The best
195	   method for a particular data type can be discovered by experimenting.
196	   Note also that a sequence of consecutive delimiters are treated as a
197	   single unit.

199	3.2.1.  Method 1: Keep Delimiters as Separate Tokens

201	   The first tokenization method simply keeps the delimiter(s) as
202	   separate tokens.  For example, consider the following string of ASCII
203	   text to be tokenized by white space and puncuation.

205	   ASCII string to be tokenized:

207	                   "The quick; brown fox."

209	   Resulting token set after Method 1:

211	                   { "The", " ", "quick", "; ", "brown", " ", "fox", "." }

213	3.2.2.  Method 2: Keep Delimiters with Preceding Token

215	   The second tokenization method keeps the delimiter(s) with the token
216	   which immediately precedes the delimiter(s).

218	   Resulting token set after Method 2:

220	                   { "The ", "quick; ", "brown ", "fox." }

222	3.2.3.  Method 3: Keep Delimiters with Succeeding Token

224	   The third tokenization method keeps the delimiter(s) with the token
225	   which immediately succeeds the delimiter(s).

227	   Resulting token set after Method 2:

229	                   { "The", " quick", "; brown", " fox", "." }

231	3.3.  Tokenization Rules

233	   Note that EZFLATE is agnostic to tokenization methods and delimiter
234	   sets.  Appropriate tokenization rules and delimiter sets depend on
235	   the type of data to be compressed.  For a particular data type, a
236	   global set of generally applicable tokenization rules ought to be
237	   selected if at all possible.  This makes it easier for decompressors
238	   to verify, if desired, that the compressed data conforms to the rules
239	   and delimiter sets.  However, specialized rules might be necessary
240	   for special cases, particularly cases which might negatively affect
241	   security.  The selection of appropriate rules and delimiter sets for
242	   a particular data type is beyond the scope of this document.

244	4.  EZFLATE Compression Algorithm Details

246	   The EZFLATE algorithm, described in this Section, is simply an
247	   alternate to the algorithm described in Section 4 of [DEFLATE], it is
248	   not intended to replace that algorithm, which provides better
249	   general-purpose compression.

251	   The compressor MUST always operates on a single token (octet string).
252	   The compressor uses a chained hash table to find duplicated tokens.
253	   Any hash function may be used and the hash function may operate on
254	   the entire octet string or any subset.  The compressor also uses a
255	   separate "length table" to track the token lengths for each position
256	   in the input window.  At any given point during compression, let T be
257	   the next token at position P with length L (not necessarily all
258	   different octets, of course).  First, the compressor computes the
259	   hash value of the token T and updates the hash table with the hash
260	   value of T and updates the "length table" with L at position P. Next,
261	   the compressor examines the hash chain for the token T. If the chain
262	   is empty, the compressor simply emits the L literal octets of T. If
263	   the hash chain is not empty, this indicates that the token T (or, if
264	   there was a hash collision, some other octet string with the same
265	   hash value) has occurred recently.  The compressor follows the hash
266	   chain to find the first entry within the current window which has
267	   length L (stored in the length table) and is octet-for-octet
268	   equivalent to the L octets in T. If an exact match is found, the
269	   compressor may emit a LZ77 <length, backward distance> tuple, where
270	   the length is L and the backward distance is the position of the
271	   current token minus the position of the exactly matching token.  If
272	   no exact match is found, the compressor simply emits the L literal
273	   octets of T.

275	   Just as stated in Section 4 of [DEFLATE], "the compressor searches
276	   the hash chains starting with the most recent [octet] strings, to
277	   favor small distances and thus take advantage of the Huffman encoding
278	   [HUFF].  The hash chains are singly linked.  There are no deletions
279	   from the hash chains; the algorithm simply discards matches that are
280	   too old.  To avoid a worst-case situation, very long hash chains are
281	   arbitrarily truncated at a certain length, determined by a run-time
282	   parameter."

284	   To improve overall compression, the compressor may optionally defer
285	   emitting a match in order to find a longer match that is a
286	   consecutive sequence of matching tokens.  After a set of one or more
287	   matches with positions Pm ...  Pn have been found for the current
288	   token T1 at position P1 with length L1, the compressor may defer
289	   emitting the match and examine the next token T2 to determine if that
290	   token appears at any of the positions Pm + L1 ...  Pn + L1, i.e.
291	   directly after any of the matches for T1.  It is critical to note
292	   that the same rules apply for checking matches of T2 at positions Pm
293	   + L1 ...  Pn + L1 as if the compressor were finding matches for T2
294	   using the hash chain.  In particular the length of a given token at
295	   position Pi + L1 must be equal to L2 _and_ be octet-for-octet
296	   equivalent to the L2 octets in T2.  The compressor reduces the set of
297	   matching positions in Pm ...  Pn to only include those which had an
298	   exact match for T2 at a given position Pi + L1.  If one or more
299	   matches remains, the process is repeated until either the set of
300	   deferred matches cannot be extended by the next token or the length
301	   of the next token is such that the overall length of the match would
302	   exceed the maximum length of a match (258 octets) allowed by the
303	   DEFLATE format (see Section 3.2.5 of [DEFLATE]).  Note that the
304	   compressor must keep enough information about the currently deferred
305	   match(es), that once the final match is found the total length and
306	   backward distance are known or can be computed.  Once the set of
307	   matches cannot be extended any longer, the compressor returns to the
308	   normal process of searching for matches using the hash table.

310	   Just as stated in Section 4 of [DEFLATE], "the compressor terminates
311	   a block when it determines that starting a new block with fresh trees
312	   would be useful, or when the block size fills up the compressor's
313	   block buffer."
314	   A token T with length L less than the minimum length of a match (3
315	   octets) allowed by the DEFLATE format (see Section 3.2.5 of
316	   [DEFLATE]) is processed by simply updating the "length table" and
317	   emiting the L literal octets.  These short tokens should, however,
318	   participate in extending deferred matches.  The same matching rules
319	   apply as for matching tokens in all other situations, namely the
320	   length of the potentially matching token stored in the "length table"
321	   must be equal to L and the L octets of the stored token must be
322	   octet-for-octet equivalent to the L octets in T.

324	   Note that the formatting rules and static/dynamic Huffman encoding
325	   rules in RFC 1951 [DEFLATE], apply to any implementation of the
326	   EZFLATE algorithm described here.

328	5.  Security Considerations

330	   Care must be taken when implementing deferred matching such that
331	   checking for matches of the current token against the octets
332	   immediately following the deferred set of matches not only checks for
333	   octet equivalency, but length equivalency as well.  If the length is
334	   not also checked, an attacker would be able to perform character-by-
335	   character guessing by injecting the token immediately preceding the
336	   secret, which in many cases may be a known identifier.

338	   [TODO: A lot of the remaining text is completely lifted from the
339	   HPACK Internet Draft; re-write or give credit somehow.]

341	   An EZFLATE compressor can still act as an oracle to an attacker
342	   probing the compression context.  However, the attacker can only find
343	   out if the guess of the entire token is correct or not.

345	   Still, an attacker could take advantage of this limited information
346	   for breaking low-entropy secrets using a brute-force attack.  A
347	   server usually has some protections against such brute-force attack.
348	   Here, the attack would target the client, where it would be harder to
349	   detect.  The attack would be even more dangerous if the attacker is
350	   able to prevent the traffic generated by its brute-force attack from
351	   reaching the server.

353	   To offer protection against such type of attacks, users of an EZFLATE
354	   compressor SHOULD use non-compressed DEFLATE blocks (see Section
355	   3.2.4 of [DEFLATE]) disable compression of any low-entropy secrets
356	   which could be put at risk by a brute-force attack.

358	   There is currently no known threat taking advantage of the use of a
359	   fixed Huffman encoding.  A study has shown that using a fixed Huffman
360	   encoding table created an information leakage, however this same
361	   study concluded that an attacker could not take advantage of this
362	   information leakage to recover any meaningful amount of information
363	   (see [PETAL]).

365	6.  Acknowledgements

367	   Roberto Peon and Herve Ruellan.

369	7.  References

371	7.1.  Normative References

373	   [DEFLATE]  Deutsch, P., "DEFLATE Compressed Data Format Specification
374	              version 1.3", RFC 1951, May 1996.

376	7.2.  Informative References

378	   [BREACH]   Gluck, Y., "BREACH: REVIVING THE CRIME ATTACK", July 2013,
379	              <http://breachattack.com/resources/
380	              BREACH%20-%20SSL,%20gone%20in%2030%20seconds.pdf>.

382	   [CRIME]    Rizzo, J. and T. Duong, "The CRIME Attack",
383	              September 2012, <https://docs.google.com/a/twist.com/
384	              presentation/d/
385	              11eBmGiHbYcHR9gL5nDyZChu_-lCa2GizeuOfaLU2HOU/
386	              edit#slide=id.g1eb6c1b5_3_6>.

388	   [GZIP]     Deutsch, P., "GZIP file format specification version 4.3",
389	              RFC 1951, May 1996.

391	   [HTTP-p1]  Fielding, R., Ed. and J. Reschke, Ed., "Hypertext Transfer
392	              Protocol (HTTP/1.1): Message Syntax and Routing",
393	              draft-ietf-httpbis-p1-messaging-26 (work in progress),
394	              February 2014.

396	   [HUFF]     Huffman, D., "A Method for the Construction of Minimum
397	              Redundancy Codes", Proceedings of the Institute of Radio
398	              Engineers Volume 40, Number 9, pp. 1098-1101,
399	              September 1952, <http://ieeexplore.ieee.org/xpl/
400	              articleDetails.jsp?arnumber=4051119>.

402	   [MCGRUFF]  McGruff, M., "About McGruff", 2014,
403	              <http://www.ncpc.org/about/about-mcgruff>.

405	   [PETAL]    Tan, J. and J. Nahata, "PETAL: Preset Encoding Table
406	              Information Leakage", April 2013, <http://www.pdl.cmu.edu/
407	              PDL-FTP/associated/CMU-PDL-13-106.pdf>.

409	   [SPDY]     Belshe, M. and R. Peon, "SPDY Protocol",
410	              draft-mbelshe-httpbis-spdy-00 (work in progress),
411	              February 2012.

413	   [ZIP]      Katz, P., ".ZIP File Format Specification",
414	              September 2012,
415	              <http://www.pkware.com/documents/casestudies/APPNOTE.TXT>.

417	   [ZLIB]     Deutsch, P., "ZLIB Compressed Data Format Specification
418	              version 3.3", RFC 1950, May 1996.

420	Authors' Addresses

422	   Keith Shearl Morgan
423	   International Atomic Energy Agency

425	   EMail: k.morgan@iaea.org

427	   Christoph Brunhuber
428	   International Atomic Energy Agency

430	   EMail: c.brunhuber@iaea.org