idnits 2.17.1 draft-abarth-mime-sniff-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 248: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 297: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 534: '... The user agent MAY wait for 512 or m...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 29, 2009) is 5316 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'RFC2616' is mentioned on line 188, but not defined

  ** Obsolete undefined reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  == Missing Reference: 'RFC2046' is mentioned on line 197, but not defined

  -- Looks like a reference, but probably isn't: '0' on line 553

  -- Looks like a reference, but probably isn't: '1' on line 553

  -- Looks like a reference, but probably isn't: '2' on line 553

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'BarthCaballeroSong2009'


     Summary: 6 errors (**), 0 flaws (~~), 4 warnings (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Working Group                                                   A. Barth
3	Internet-Draft                                             U.C. Berkeley
4	Expires: April 2, 2010                                        I. Hickson
5	                                                            Google, Inc.
6	                                                      September 29, 2009

8	                     Content-Type Processing Model
9	                       draft-abarth-mime-sniff-03

11	Status of this Memo

13	   This Internet-Draft is submitted to IETF in full conformance with the
14	   provisions of BCP 78 and BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on April 2, 2010.

34	Copyright Notice

36	   Copyright (c) 2009 IETF Trust and the persons identified as the
37	   document authors.  All rights reserved.

39	   This document is subject to BCP 78 and the IETF Trust's Legal
40	   Provisions Relating to IETF Documents in effect on the date of
41	   publication of this document (http://trustee.ietf.org/license-info).
42	   Please review these documents carefully, as they describe your rights
43	   and restrictions with respect to this document.

45	Abstract

47	   Many web servers supply incorrect Content-Type headers with their
48	   HTTP responses.  In order to be compatible with these servers, user
49	   agents consider the content of HTTP responses as well as the Content-
50	   Type header when determining the effective media type of the
51	   response.  This document describes an algorithm for determining the
52	   effective media type of HTTP responses that balances security and
53	   compatibility considerations.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
58	   2.  Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
59	   3.  Web Pages  . . . . . . . . . . . . . . . . . . . . . . . . . .  7
60	   4.  Text or Binary . . . . . . . . . . . . . . . . . . . . . . . .  9
61	   5.  Unknown Type . . . . . . . . . . . . . . . . . . . . . . . . . 11
62	   6.  Image  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
63	   7.  Feed or HTML . . . . . . . . . . . . . . . . . . . . . . . . . 17
64	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 21
65	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22

67	1.  Introduction

69	   The HTTP Content-Type header indicates the media type of an HTTP
70	   response.  However, many HTTP servers supply a Content-Type that does
71	   not match the actual contents of the response.  Historically, web
72	   browsers have been tolerated these servers by examining the content
73	   of HTTP responses in addition to the Content-Type header to determine
74	   the effective media type of the response.

76	   Without a clear specification of how to "sniff" the media type, each
77	   user agent implementor was forced to reverse engineer the behavior of
78	   the other user agents and to developed their own algorithm.  These
79	   divergent algorithms have lead to a lack of interoperability between
80	   user agents and to security issues when the server intends an HTTP
81	   response to be interpreted as one media type but some user agents
82	   interpret the responses as another media type.

84	   These security issues are most severe when an "honest" server lets
85	   potentially malicious users upload files and then serves the contents
86	   of those files with a low-privilege media type (such as text/plain or
87	   image/jpeg).  (Malicious servers, of course, can specify an arbitrary
88	   media type in the Content-Type header.)  In the absense of mime
89	   sniffing, this user-generated content would not be interpreted as a
90	   high-privilege media type, such as text/html.  However, if a user
91	   agent does interpret a low-privilege media type, such as image/gif,
92	   as a high-privilege media type, such as text/html, the user agent as
93	   created a privilege escalation vulnerability in the server.  For
94	   example, a malicious user might be able to leverage content sniffing
95	   to mount a cross-site script attack by including JavaScript code in
96	   the uploaded file that a user agent treats as text/html.

98	   This document describes a content sniffing algorithm that carefully
99	   balances the compatibility needs of user agent implementors with the
100	   security constraints.  The algorithm has been constructed with
101	   reference to content sniffing algorithms present in popular user
102	   agents, an extensive database of existing web content, and metrics
103	   collected from implementations deployed to a sizable number of users
104	   [BarthCaballeroSong2009].

106	   WARNING!  Whenever possible, user agents should avoid employing a
107	   content sniffing algorithm.  However, if a user agent does employ a
108	   content sniffing algorithm, the user agent should use the algorithm
109	   in this document exactly because using a different content sniffing
110	   algorithm than servers expect causes security problems.  For example,
111	   if a server believes that the client will treat a contributed file as
112	   an image (and thus treat it as benign), but a user agent believes the
113	   content to be HTML (and thus privileged to execute any scripts
114	   contained therein), an attacker might be able to steal the user's
115	   authentication credentials and mount other cross-site scripting
116	   attacks.

118	2.  Metadata

120	   The explicit Content-Type metadata associated with the resource (the
121	   resource's type information) depends on the protocol that was used to
122	   fetch the resource.

124	   For HTTP resources, only the last Content-Type HTTP header, if any,
125	   contributes any type information; the official type of the resource
126	   is then the value of that header, interpreted as described by the
127	   HTTP specifications.  If the Content-Type HTTP header is present but
128	   the value of the last such header cannot be interpreted as described
129	   by the HTTP specifications (e.g. because its value doesn't contain a
130	   U+002F SOLIDUS ('/') character), then the resource has no type
131	   information (even if there are multiple Content-Type HTTP headers and
132	   one of the other ones is syntactically correct).

134	   For resources fetched from the file system, user agents should use
135	   platform-specific conventions, e.g. operating system file extension/
136	   type mappings.

138	      Note: It is essential that file extensions are not used for
139	      determining the media type for resources fetched over HTTP because
140	      file extensions can often by supplied by malicious parties.

142	   For resources fetched over most other protocols, e.g.  FTP, there is
143	   no type information.

145	   The algorithm for extracting an encoding from a Content-Type, given a
146	   string s, is as follows.  It either returns an encoding or nothing.

148	   1.  Find the first seven characters in s that are an ASCII case-
149	       insensitive match for the word "charset".  If no such match is
150	       found, return nothing.

152	   2.  Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters
153	       that immediately follow the word 'charset' (there might not be
154	       any).

156	   3.  If the next character is not a U+003D EQUALS SIGN ('='), return
157	       nothing.

159	   4.  Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters
160	       that immediately follow the equals sign (there might not be any).

162	   5.  Process the next character as follows:

164	       *  If it is a U+0022 QUOTATION MARK ('"') and there is a later
165	          U+0022 QUOTATION MARK ('"') in s, or

167	       *  If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027
168	          APOSTROPHE ("'") in s

170	             Return the string between this character and the next
171	             earliest occurrence of this character.

173	       *  If it is an unmatched U+0022 QUOTATION MARK ('"'),

175	       *  If it is an unmatched U+0027 APOSTROPHE ("'"), or

177	       *  If there is no next character

179	             Return nothing.

181	       *  Otherwise

183	             Return the string from this character to the first U+0009,
184	             U+000A, U+000C, U+000D, U+0020, or U+003B character or the
185	             end of s, whichever comes first.

187	   Note: The above algorithm is a willful violation of the HTTP
188	   specification.  [RFC2616]

190	3.  Web Pages

192	   The /sniffed type/ of a resource is found as follows:

194	   1.  Let /official type/ be the type given by the Content-Type
195	       metadata for the resource, ignoring parameters.  Comparisons with
196	       this type, as defined by MIME specifications, are done in an
197	       ASCII case-insensitive manner.  [RFC2046]

199	   2.  If the user agent is configured to strictly obey Content-Type
200	       headers for this resource, then jump to the last step in this set
201	       of steps.

203	   3.  If the resource was fetched over an HTTP protocol and there is an
204	       HTTP Content-Type header and the value of the last such header
205	       has bytes that exactly match one of the following lines:

207	      +-------------------------------+--------------------------------+
208	      | Bytes in Hexadecimal          | Textual Representation         |
209	      +-------------------------------+--------------------------------+
210	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain                     |
211	      +-------------------------------+--------------------------------+
212	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=ISO-8859-1 |
213	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
214	      | 49 53 4f 2d 38 38 35 39 2d 31 |                                |
215	      +-------------------------------+--------------------------------+
216	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=iso-8859-1 |
217	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
218	      | 69 73 6f 2d 38 38 35 39 2d 31 |                                |
219	      +-------------------------------+--------------------------------+
220	      | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=UTF-8      |
221	      | 3b 20 63 68 61 72 73 65 74 3d |                                |
222	      | 55 54 46 2d 38                |                                |
223	      +-------------------------------+--------------------------------+

225	       ...then jump to the "text or binary" section below.

227	   4.  If there is no /official type/, jump to the unknown type step
228	       below.

230	   5.  If /official type/ is "unknown/unknown", "application/unknown",
231	       or "*/*", jump to the unknown type step below.

233	   6.  If /official type/ ends in "+xml", or if it is either "text/xml"
234	       or "application/xml", then the /sniffed type/ of the resource is
235	       /official type/; return that and abort these steps.

237	   7.  If /official type/ is an image type supported by the user agent
238	       (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to
239	       the "images" section below, passing it the /official type/.

241	   8.  If /official type/ is "text/html", then jump to the feed or HTML
242	       section below.

244	   9.  The /sniffed type/ of the resource is /official type/.

246	4.  Text or Binary

248	   1.  The user agent MAY wait for 512 or more bytes of the resource to
249	       be available.

251	   2.  Let n be the smaller of either 512 or the number of bytes already
252	       available.

254	   3.  If n is greater than or equal to 3, and the first 2 or 3 bytes of
255	       the resource match one of the following byte sequences:

257	                   +----------------------+--------------+
258	                   | Bytes in Hexadecimal | Description  |
259	                   +----------------------+--------------+
260	                   | FE FF                | UTF-16BE BOM |
261	                   | FF FE                | UTF-16LE BOM |
262	                   | EF BB BF             | UTF-8 BOM    |
263	                   +----------------------+--------------+

265	       ...then the /sniffed type/ of the resource is "text/plain".
266	       Abort these steps.

268	   4.  If none of the first n bytes of the resource are binary data
269	       bytes then the /sniffed type/ of the resource is "text/plain".
270	       Abort these steps.

272	                         +-------------------------+
273	                         | Binary Data Byte Ranges |
274	                         +-------------------------+
275	                         | 0x00 -- 0x08            |
276	                         | 0x0B                    |
277	                         | 0x0E -- 0x1A            |
278	                         | 0x1C -- 0x1F            |
279	                         +-------------------------+

281	   5.  If the first bytes of the resource match one of the byte
282	       sequences in the "pattern" column of the table in the unknown
283	       type section below, ignoring any rows whose cell in the
284	       "security" column says "scriptable" (or "n/a"), then the /sniffed
285	       type/ of the resource is the type given in the corresponding cell
286	       in the "sniffed type" column on that row; abort these steps.

288	          WARNING!  It is critical that this step not ever return a
289	          scriptable type (e.g. text/html), as otherwise that would
290	          allow a privilege escalation attack.

292	   6.  Otherwise, the /sniffed type/ of the resource is "application/
293	       octet-stream".

295	5.  Unknown Type

297	   1.  The user agent MAY wait for 512 or more bytes of the resource to
298	       be available.

300	   2.  Let /stream length/ be the smaller of either 512 or the number of
301	       bytes already available.

303	   3.  For each row in the table below:

305	       *  If the row has no "WS" bytes:

307	          1.  Let /pattern length/ be the length of the pattern (number
308	              of bytes described by the cell in the second column of the
309	              row).

311	          2.  If /stream length/ is smaller than /pattern length/ then
312	              skip this row.

314	          3.  Apply the "and" operator to the first /pattern length/
315	              bytes of the resource and the given mask (the bytes in the
316	              cell of first column of that row), and let the result be
317	              the data.

319	          4.  If the bytes of the data matches the given pattern bytes
320	              exactly, then the /sniffed type/ of the resource is the
321	              type given in the cell of the third column in that row;
322	              abort these steps.

324	       *  If the row has a "WS" byte:

326	          1.  Let /index pattern/ be an index into the mask and pattern
327	              byte strings of the row.

329	          2.  Let /index stream/ be an index into the byte stream being
330	              examined.

332	          3.  Loop: If /index stream/ points beyond the end of the byte
333	              stream, then this row doesn't match, skip this row.

335	          4.  Examine the /index stream/th byte of the byte stream as
336	              follows:

338	              -  If the /index pattern/th byte of the pattern is a
339	                 normal hexadecimal byte and not a "WS" byte:

341	                    If the "and" operator, applied to the /index
342	                    stream/th byte of the stream and the /index
343	                    pattern/th byte of the mask, yield a value different
344	                    that the /index pattern/th byte of the pattern, then
345	                    skip this row.

347	                    Otherwise, increment /index pattern/ to the next
348	                    byte in the mask and pattern and /index stream/ to
349	                    the next byte in the byte stream.

351	              -  Otherwise, if the /index pattern/th byte of the pattern
352	                 is a "WS" byte:

354	                    "WS" means "whitespace", and allows insignificant
355	                    whitespace to be skipped when sniffing for a type
356	                    signature.

358	                    If the /index stream/th byte of the stream is one of
359	                    0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF),
360	                    0x0D (ASCII CR), or 0x20 (ASCII space), then
361	                    increment only the /index stream/ to the next byte
362	                    in the byte stream.

364	                    Otherwise, increment only the /index pattern/ to the
365	                    next byte in the mask and pattern.

367	          5.  If /index pattern/ does not point beyond the end of the
368	              mask and pattern byte strings, then jump back to the loop
369	              step in this algorithm.

371	          6.  Otherwise, the /sniffed type/ of the resource is the type
372	              given in the cell of the third column in that row; abort
373	              these steps.

375	   4.  If none of the first n bytes of the resource are binary data
376	       bytes then the sniffed type of the resource is "text/plain".
377	       Abort these steps.

379	   5.  Otherwise, the sniffed type of the resource is "application/
380	       octet-stream".

382	   The table used by the above algorithm is:

384	+-------------------+-------------------+-----------------+------------+
385	| Mask in Hex       | Pattern in Hex    | Sniffed Type    | Security   |
386	+-------------------+-------------------+-----------------+------------+
387	| FF FF FF DF DF DF | WS 3C 21 44 4F 43 | text/html       | Scriptable |
388	| DF DF DF DF FF DF | 54 59 50 45 20 48 |                 |            |
389	| DF DF DF          | 54 4D 4C          |                 |            |
390	| Comment: ""),
580	            then increase pos by 3 and jump back to the previous step
581	            (the step labeled loop start) in the overall algorithm in
582	            this section.

584	        3.  Otherwise, increase pos by 1.

586	        4.  Return to step 2 in these substeps.

588	   8.   If s[pos] equals 0x21 (ASCII "!"):

590	        1.  Increase pos by 1.

592	        2.  If s[pos] equals 0x3E, then increase pos by 1 and jump back
593	            to the step labeled loop start in the overall algorithm in
594	            this section.

596	        3.  Otherwise, return to step 1 in these substeps.

598	   9.   If s[pos] equals 0x3F (ASCII "?"):

600	        1.  Increase pos by 1.

602	        2.  If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively,
603	            then increase pos by 1 and jump back to the step labeled
604	            loop start in the overall algorithm in this section.

606	        3.  Otherwise, return to step 1 in these substeps.

608	   10.  Otherwise, if the bytes in s starting at pos match any of the
609	        sequences of bytes in the first column of the following table,
610	        then the user agent must follow the steps given in the
611	        corresponding cell in the second column of the same row.

613	 +----------------------+------------------------------------+---------+
614	 | Bytes in Hexadecimal | Requirement                        | Comment |
615	 +----------------------+------------------------------------+---------+
616	 | 72 73 73             | The /sniffed type/ of the resource | rss     |
617	 |                      | is "application/rss+xml"; abort    |         |
618	 |                      | these steps.                       |         |
619	 +----------------------+------------------------------------+---------+
620	 | 66 65 65 64          | The /sniffed type/ of the resource | feed    |
621	 |                      | is "application/atom+xml"; abort   |         |
622	 |                      | these steps.                       |         |
623	 +----------------------+------------------------------------+---------+
624	 | 72 64 66 3A 52 44 46 | Continue to the next step in this  | rdf:RDF |
625	 |                      | algorithm.                         |         |
626	 +----------------------+------------------------------------+---------+

628	        If none of the byte sequences above match the bytes in s
629	        starting at pos, then the /sniffed type/ of the resource is
630	        "text/html".  Abort these steps.

632	   11.  Initialize /RDF flag/ to 0.

634	   12.  Initialize /RSS flag/ to 0.

636	   13.  If the bytes with positions pos to pos+23 in s are exactly equal
637	        to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x70, 0x75, 0x72,
638	        0x6C, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x72, 0x73, 0x73, 0x2F,
639	        0x31, 0x2E, 0x30, 0x2F respectively (ASCII for
640	        "http://purl.org/rss/1.0/"), then:

642	        1.  Increase pos by 23.

644	        2.  Set /RSS flag/ to 1.

646	   14.  If the bytes with positions pos to pos+42 in s are exactly equal
647	        to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77,
648	        0x2E, 0x77, 0x33, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x31, 0x39,
649	        0x39, 0x39, 0x2F, 0x30, 0x32, 0x2F, 0x32, 0x32, 0x2D, 0x72,
650	        0x64, 0x66, 0x2D, 0x73, 0x79, 0x6E, 0x74, 0x61, 0x78, 0x2D,
651	        0x6E, 0x73, 0x23 respectively (ASCII for
652	        "http://www.w3.org/1999/02/22-rdf-syntax-ns#"), then:

654	        1.  Increase pos by 42.

656	        2.  Set /RDF flag/ to 1.

658	   15.  Increase pos by 1.

660	   16.  If /RDF flag/ is 1 and /RSS flag/ is 1, then the /sniffed type/
661	        of the resource is "application/rss+xml".  Abort these steps.

663	   17.  If pos points beyond the end of the byte stream s, then continue
664	        to step 19 of this algorithm.

666	   18.  Jump back to step 13 of this algorithm.

668	   19.  The /sniffed type/ of the resource is "text/html".

670	   For efficiency reasons, implementations may wish to implement this
671	   algorithm and the algorithm for detecting the character encoding of
672	   HTML documents in parallel.

674	8.  References

676	   [BarthCaballeroSong2009]
677	              Barth, A., Caballero, J., and D. Song, "Secure Content
678	              Sniffing for Web Browsers, or How to Stop Papers from
679	              Reviewing Themselves", 2009, .

682	   TODO: * Transcribe the tables into C and auto generate the tables. *
683	   Investigate charset parsing.

685	Authors' Addresses

687	   Adam Barth
688	   University of California, Berkeley

690	   Email: abarth@eecs.berkeley.edu
691	   URI:   http://www.adambarth.com/

693	   Ian Hickson
694	   Google, Inc.

696	   Email: ian@hixie.ch
697	   URI:   http://ln.hixie.ch/