idnits 2.17.1 draft-abarth-mime-sniff-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 136: '... File extensions MUST NOT be used for ...' RFC 2119 keyword, line 189: '...e/ of a resource MUST be found as foll...' RFC 2119 keyword, line 245: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 294: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 534: '... The user agent MAY wait for 512 or m...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 31, 2009) is 5434 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'RFC2616' on line 185 looks like a reference

  -- Missing reference section? 'RFC2046' on line 194 looks like a reference

  -- Missing reference section? '0' on line 554 looks like a reference

  -- Missing reference section? '1' on line 554 looks like a reference

  -- Missing reference section? '2' on line 554 looks like a reference


     Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Working Group                                                   A. Barth
3	Internet-Draft                                             U.C. Berkeley
4	Expires: December 2, 2009                                     I. Hickson
5	                                                            Google, Inc.
6	                                                            May 31, 2009

8	                     Content-Type Processing Model
9	                       draft-abarth-mime-sniff-01

11	Status of this Memo

13	   This Internet-Draft is submitted to IETF in full conformance with the
14	   provisions of BCP 78 and BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on December 2, 2009.

34	Copyright Notice

36	   Copyright (c) 2009 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents in effect on the date of
41	   publication of this document (http://trustee.ietf.org/license-info).
42	   Please review these documents carefully, as they describe your rights
43	   and restrictions with respect to this document.

45	Abstract

47	   Many web servers supply incorrect Content-Type headers with their
48	   HTTP responses.  In order to be compatible with these servers, user
49	   agents must consider the content of HTTP responses as well as the
50	   Content-Type header when determining the effective media type of the
51	   response.  This document describes an algorithm for determining the
52	   effective media type of HTTP responses that balances security and
53	   compatibility considerations.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
58	   2.  Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
59	   3.  Web Pages  . . . . . . . . . . . . . . . . . . . . . . . . . .  7
60	   4.  Text or Binary . . . . . . . . . . . . . . . . . . . . . . . .  9
61	   5.  Unknown Type . . . . . . . . . . . . . . . . . . . . . . . . . 11
62	   6.  Image  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
63	   7.  Feed or HTML . . . . . . . . . . . . . . . . . . . . . . . . . 17
64	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

66	1.  Introduction

68	   The HTTP Content-Type header indicates the media type of an HTTP
69	   response.  However, many HTTP servers supply a Content-Type that does
70	   not match the actual contents of the response.  Historically, web
71	   browsers have been tolerated these servers by examining the content
72	   of HTTP responses in addition to the Content-Type header to determine
73	   the effective media type of the response.

75	   Without a clear specification of how to "sniff" the media type, each
76	   user agent implementor was forced to reverse engineer the behavior of
77	   the other user agents and to developed their own algorithm.  These
78	   divergent algorithms have lead to a lack of interoperability between
79	   user agents and to security issues when the server intends an HTTP
80	   response to be interpreted as one media type but some user agents
81	   interpret the responses as another media type.

83	   These security issues are most severe when an "honest" server lets
84	   potentially malicious users upload files and then serves the contents
85	   of those files with a low-privilege media type (such as text/plain or
86	   image/jpeg).  (Malicious servers, of course, can specify an arbitrary
87	   media type in the Content-Type header.)  In the absense of mime
88	   sniffing, this user-generated content would not be interpreted as a
89	   high-privilege media type, such as text/html.  However, if a user
90	   agent does interpret a low-privilege media type, such as image/gif,
91	   as a high-privilege media type, such as text/html, the user agent as
92	   created a privilege escalation vulnerability in the server.  For
93	   example, a malicious user might be able to leverage content sniffing
94	   to mount a cross-site script attack by including JavaScript code in
95	   the uploaded file that a user agent treats as text/html.

97	   This document describes a content sniffing algorithm that carefully
98	   balances the compatibility needs of user agent implementors with the
99	   security constraints.  The algorithm has been constructed with
100	   reference to content sniffing algorithms present in popular user
101	   agents, an extensive database of existing web content, and metrics
102	   collected from implementations deployed to a sizable number of users.

104	   WARNING!  Whenever possible, user agents should avoid employing a
105	   content sniffing algorithm.  However, if the user agent does emply a
106	   content sniffing algorithm, it is imperative that the algorithm in
107	   this document be followed exactly.  When a user agent uses different
108	   heuristics for media type detection than the server expects, security
109	   problems can occur.  For example, if a server believes that the
110	   client will treat a contributed file as an image (and thus treat it
111	   as benign), but a user agent believes the content to be HTML (and
112	   thus privileged to execute any scripts contained therein), an
113	   attacker might be able to steal the user's authentication credentials
114	   and mount other cross-site scripting attacks.

116	2.  Metadata

118	   What explicit Content-Type metadata is associated with the resource
119	   (the resource's type information) depends on the protocol that was
120	   used to fetch the resource.

122	   For HTTP resources, only the last Content-Type HTTP header, if any,
123	   contributes any type information; the official type of the resource
124	   is then the value of that header, interpreted as described by the
125	   HTTP specifications.  If the Content-Type HTTP header is present but
126	   the value of the last such header cannot be interpreted as described
127	   by the HTTP specifications (e.g. because its value doesn't contain a
128	   U+002F SOLIDUS ('/') character), then the resource has no type
129	   information (even if there are multiple Content-Type HTTP headers and
130	   one of the other ones is syntactically correct).

132	   For resources fetched from the file system, user agents should use
133	   platform-specific conventions, e.g. operating system file extension/
134	   type mappings.

136	   File extensions MUST NOT be used for determining resource types for
137	   resources fetched over HTTP.

139	   For resources fetched over most other protocols, e.g.  FTP, there is
140	   no type information.

142	   The algorithm for extracting an encoding from a Content-Type, given a
143	   string s, is as follows.  It either returns an encoding or nothing.

145	   1.  Find the first seven characters in s that are an ASCII case-
146	       insensitive match for the word "charset".  If no such match is
147	       found, return nothing.

149	   2.  Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters
150	       that immediately follow the word 'charset' (there might not be
151	       any).

153	   3.  If the next character is not a U+003D EQUALS SIGN ('='), return
154	       nothing.

156	   4.  Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters
157	       that immediately follow the equals sign (there might not be any).

159	   5.  Process the next character as follows:

161	       *  If it is a U+0022 QUOTATION MARK ('"') and there is a later
162	          U+0022 QUOTATION MARK ('"') in s, or

164	       *  If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027
165	          APOSTROPHE ("'") in s

167	             Return the string between this character and the next
168	             earliest occurrence of this character.

170	       *  If it is an unmatched U+0022 QUOTATION MARK ('"'),

172	       *  If it is an unmatched U+0027 APOSTROPHE ("'"), or

174	       *  If there is no next character

176	             Return nothing.

178	       *  Otherwise

180	             Return the string from this character to the first U+0009,
181	             U+000A, U+000C, U+000D, U+0020, or U+003B character or the
182	             end of s, whichever comes first.

184	   Note: The above algorithm is a willful violation of the HTTP
185	   specification.  [RFC2616]

187	3.  Web Pages

189	   The /sniffed type/ of a resource MUST be found as follows:

191	   1.  Let /official type/ be the type given by the Content-Type
192	       metadata for the resource, ignoring parameters.  Comparisons with
193	       this type, as defined by MIME specifications, are done in an
194	       ASCII case-insensitive manner.  [RFC2046]

196	   2.  If the user agent is configured to strictly obey Content-Type
197	       headers for this resource, then jump to the last step in this set
198	       of steps.

200	   3.  If the resource was fetched over an HTTP protocol and there is an
201	       HTTP Content-Type header and the value of the last such header
202	       has bytes that exactly match one of the following lines:

204	      +-------------------------------+--------------------------------+
205	      | Bytes in Hexadecimal          | Textual Representation         |
206	      +-------------------------------+--------------------------------+
207	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain                     |
208	      +-------------------------------+--------------------------------+
209	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=ISO-8859-1 |
210	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
211	      | 49 53 4f 2d 38 38 35 39 2d 31 |                                |
212	      +-------------------------------+--------------------------------+
213	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=iso-8859-1 |
214	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
215	      | 69 73 6f 2d 38 38 35 39 2d 31 |                                |
216	      +-------------------------------+--------------------------------+
217	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=UTF-8      |
218	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
219	      | 55 54 46 2d 38                |                                |
220	      +-------------------------------+--------------------------------+

222	       ...then jump to the "text or binary" section below.

224	   4.  If there is no /official type/, jump to the unknown type step
225	       below.

227	   5.  If /official type/ is "unknown/unknown", "application/unknown",
228	       or "*/*", jump to the unknown type step below.

230	   6.  If /official type/ ends in "+xml", or if it is either "text/xml"
231	       or "application/xml", then the /sniffed type/ of the resource is
232	       /official type/; return that and abort these steps.

234	   7.  If /official type/ is an image type supported by the user agent
235	       (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to
236	       the "images" section below, passing it the /official type/.

238	   8.  If /official type/ is "text/html", then jump to the feed or HTML
239	       section below.

241	   9.  The /sniffed type/ of the resource is /official type/.

243	4.  Text or Binary

245	   1.  The user agent MAY wait for 512 or more bytes of the resource to
246	       be available.

248	   2.  Let n be the smaller of either 512 or the number of bytes already
249	       available.

251	   3.  If n is greater than or equal to 3, and the first 2 or 3 bytes of
252	       the resource match one of the following byte sequences:

254	                   +----------------------+--------------+
255	                   | Bytes in Hexadecimal | Description  |
256	                   +----------------------+--------------+
257	                   | FE FF                | UTF-16BE BOM |
258	                   | FF FE                | UTF-16LE BOM |
259	                   | EF BB BF             | UTF-8 BOM    |
260	                   +----------------------+--------------+

262	       ...then the /sniffed type/ of the resource is "text/plain".
263	       Abort these steps.

265	   4.  If none of the first n bytes of the resource are binary data
266	       bytes then the /sniffed type/ of the resource is "text/plain".
267	       Abort these steps.

269	                         +-------------------------+
270	                         | Binary Data Byte Ranges |
271	                         +-------------------------+
272	                         | 0x00 -- 0x08            |
273	                         | 0x0B                    |
274	                         | 0x0E -- 0x1A            |
275	                         | 0x1C -- 0x1F            |
276	                         +-------------------------+

278	   5.  If the first bytes of the resource match one of the byte
279	       sequences in the "pattern" column of the table in the unknown
280	       type section below, ignoring any rows whose cell in the
281	       "security" column says "scriptable" (or "n/a"), then the /sniffed
282	       type/ of the resource is the type given in the corresponding cell
283	       in the "sniffed type" column on that row; abort these steps.

285	          WARNING!  It is critical that this step not ever return a
286	          scriptable type (e.g. text/html), as otherwise that would
287	          allow a privilege escalation attack.

289	   6.  Otherwise, the /sniffed type/ of the resource is "application/
290	       octet-stream".

292	5.  Unknown Type

294	   1.  The user agent MAY wait for 512 or more bytes of the resource to
295	       be available.

297	   2.  Let /stream length/ be the smaller of either 512 or the number of
298	       bytes already available.

300	   3.  For each row in the table below:

302	       *  If the row has no "WS" bytes:

304	          1.  Let /pattern length/ be the length of the pattern (number
305	              of bytes described by the cell in the second column of the
306	              row).

308	          2.  If /stream length/ is smaller than /pattern length/ then
309	              skip this row.

311	          3.  Apply the "and" operator to the first /pattern length/
312	              bytes of the resource and the given mask (the bytes in the
313	              cell of first column of that row), and let the result be
314	              the data.

316	          4.  If the bytes of the data matches the given pattern bytes
317	              exactly, then the /sniffed type/ of the resource is the
318	              type given in the cell of the third column in that row;
319	              abort these steps.

321	       *  If the row has a "WS" byte:

323	          1.  Let /index pattern/ be an index into the mask and pattern
324	              byte strings of the row.

326	          2.  Let /index stream/ be an index into the byte stream being
327	              examined.

329	          3.  Loop: If /index stream/ points beyond the end of the byte
330	              stream, then this row doesn't match, skip this row.

332	          4.  Examine the /index stream/th byte of the byte stream as
333	              follows:

335	              -  If the /index pattern/th byte of the pattern is a
336	                 normal hexadecimal byte and not a "WS" byte:

338	                    If the "and" operator, applied to the /index
339	                    stream/th byte of the stream and the /index
340	                    pattern/th byte of the mask, yield a value different
341	                    that the /index pattern/th byte of the pattern, then
342	                    skip this row.

344	                    Otherwise, increment /index pattern/ to the next
345	                    byte in the mask and pattern and /index stream/ to
346	                    the next byte in the byte stream.

348	              -  Otherwise, if the /index pattern/th byte of the pattern
349	                 is a "WS" byte:

351	                    "WS" means "whitespace", and allows insignificant
352	                    whitespace to be skipped when sniffing for a type
353	                    signature.

355	                    If the /index stream/th byte of the stream is one of
356	                    0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF),
357	                    0x0D (ASCII CR), or 0x20 (ASCII space), then
358	                    increment only the /index stream/ to the next byte
359	                    in the byte stream.

361	                    Otherwise, increment only the /index pattern/ to the
362	                    next byte in the mask and pattern.

364	          5.  If /index pattern/ does not point beyond the end of the
365	              mask and pattern byte strings, then jump back to the loop
366	              step in this algorithm.

368	          6.  Otherwise, the /sniffed type/ of the resource is the type
369	              given in the cell of the third column in that row; abort
370	              these steps.

372	   4.  If none of the first n bytes of the resource are binary data
373	       bytes then the sniffed type of the resource is "text/plain".
374	       Abort these steps.

376	   5.  Otherwise, the sniffed type of the resource is "application/
377	       octet-stream".

379	   The table used by the above algorithm is:

381	+-------------------+-------------------+-----------------+------------+
382	| Mask in Hex       | Pattern in Hex    | Sniffed Type    | Security   |
383	+-------------------+-------------------+-----------------+------------+
384	| FF FF DF DF DF DF | WS 3C 21 44 4F 43 | text/html       | Scriptable |
385	| DF DF DF FF DF DF | 54 59 50 45 20 48 |                 |            |
386	| DF DF             | 54 4D 4C          |                 |            |
387	| Comment: ""),
581	            then increase pos by 3 and jump back to the previous step
582	            (the step labeled loop start) in the overall algorithm in
583	            this section.

585	        3.  Otherwise, increase pos by 1.

587	        4.  Return to step 2 in these substeps.

589	   8.   If s[pos] equals 0x21 (ASCII "!"):

591	        1.  Increase pos by 1.

593	        2.  If s[pos] equals 0x3E, then increase pos by 1 and jump back
594	            to the step labeled loop start in the overall algorithm in
595	            this section.

597	        3.  Otherwise, return to step 1 in these substeps.

599	   9.   If s[pos] equals 0x3F (ASCII "?"):

601	        1.  Increase pos by 1.

603	        2.  If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively,
604	            then increase pos by 1 and jump back to the step labeled
605	            loop start in the overall algorithm in this section.

607	        3.  Otherwise, return to step 1 in these substeps.

609	   10.  Otherwise, if the bytes in s starting at pos match any of the
610	        sequences of bytes in the first column of the following table,
611	        then the user agent must follow the steps given in the
612	        corresponding cell in the second column of the same row.

614	 +----------------------+------------------------------------+---------+
615	 | Bytes in Hexadecimal | Requirement                        | Comment |
616	 +----------------------+------------------------------------+---------+
617	 | 72 73 73             | The /sniffed type/ of the resource | rss     |
618	 |                      | is "application/rss+xml"; abort    |         |
619	 |                      | these steps.                       |         |
620	 +----------------------+------------------------------------+---------+
621	 | 66 65 65 64          | The /sniffed type/ of the resource | feed    |
622	 |                      | is "application/atom+xml"; abort   |         |
623	 |                      | these steps.                       |         |
624	 +----------------------+------------------------------------+---------+
625	 | 72 64 66 3A 52 44 46 | Continue to the next step in this  | rdf:RDF |
626	 |                      | algorithm.                         |         |
627	 +----------------------+------------------------------------+---------+

629	        If none of the byte sequences above match the bytes in s
630	        starting at pos, then the /sniffed type/ of the resource is
631	        "text/html".  Abort these steps.

633	   11.  Otherwise, the /sniffed type/ of the resource is "text/html".

635	   For efficiency reasons, implementations may wish to implement this
636	   algorithm and the algorithm for detecting the character encoding of
637	   HTML documents in parallel.

639	Authors' Addresses

641	   Adam Barth
642	   University of California, Berkeley

644	   Email: abarth@eecs.berkeley.edu
645	   URI:   http://www.adambarth.com/

647	   Ian Hickson
648	   Google, Inc.

650	   Email: ian@hixie.ch
651	   URI:   http://ln.hixie.ch/