idnits 2.17.1 draft-abarth-mime-sniff-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 248: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 297: '... The user agent MAY wait for 512 or m...' RFC 2119 keyword, line 534: '... The user agent MAY wait for 512 or m...' Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 29, 2009) is 5316 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC2616' is mentioned on line 188, but not defined ** Obsolete undefined reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) == Missing Reference: 'RFC2046' is mentioned on line 197, but not defined -- Looks like a reference, but probably isn't: '0' on line 553 -- Looks like a reference, but probably isn't: '1' on line 553 -- Looks like a reference, but probably isn't: '2' on line 553 -- Possible downref: Non-RFC (?) normative reference: ref. 'BarthCaballeroSong2009' Summary: 6 errors (**), 0 flaws (~~), 4 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Working Group A. Barth 3 Internet-Draft U.C. Berkeley 4 Expires: April 2, 2010 I. Hickson 5 Google, Inc. 6 September 29, 2009 8 Content-Type Processing Model 9 draft-abarth-mime-sniff-03 11 Status of this Memo 13 This Internet-Draft is submitted to IETF in full conformance with the 14 provisions of BCP 78 and BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that 18 other groups may also distribute working documents as Internet- 19 Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference 24 material or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt. 29 The list of Internet-Draft Shadow Directories can be accessed at 30 http://www.ietf.org/shadow.html. 32 This Internet-Draft will expire on April 2, 2010. 34 Copyright Notice 36 Copyright (c) 2009 IETF Trust and the persons identified as the 37 document authors. All rights reserved. 39 This document is subject to BCP 78 and the IETF Trust's Legal 40 Provisions Relating to IETF Documents in effect on the date of 41 publication of this document (http://trustee.ietf.org/license-info). 42 Please review these documents carefully, as they describe your rights 43 and restrictions with respect to this document. 45 Abstract 47 Many web servers supply incorrect Content-Type headers with their 48 HTTP responses. In order to be compatible with these servers, user 49 agents consider the content of HTTP responses as well as the Content- 50 Type header when determining the effective media type of the 51 response. This document describes an algorithm for determining the 52 effective media type of HTTP responses that balances security and 53 compatibility considerations. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 59 3. Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 4. Text or Binary . . . . . . . . . . . . . . . . . . . . . . . . 9 61 5. Unknown Type . . . . . . . . . . . . . . . . . . . . . . . . . 11 62 6. Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 63 7. Feed or HTML . . . . . . . . . . . . . . . . . . . . . . . . . 17 64 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 21 65 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 67 1. Introduction 69 The HTTP Content-Type header indicates the media type of an HTTP 70 response. However, many HTTP servers supply a Content-Type that does 71 not match the actual contents of the response. Historically, web 72 browsers have been tolerated these servers by examining the content 73 of HTTP responses in addition to the Content-Type header to determine 74 the effective media type of the response. 76 Without a clear specification of how to "sniff" the media type, each 77 user agent implementor was forced to reverse engineer the behavior of 78 the other user agents and to developed their own algorithm. These 79 divergent algorithms have lead to a lack of interoperability between 80 user agents and to security issues when the server intends an HTTP 81 response to be interpreted as one media type but some user agents 82 interpret the responses as another media type. 84 These security issues are most severe when an "honest" server lets 85 potentially malicious users upload files and then serves the contents 86 of those files with a low-privilege media type (such as text/plain or 87 image/jpeg). (Malicious servers, of course, can specify an arbitrary 88 media type in the Content-Type header.) In the absense of mime 89 sniffing, this user-generated content would not be interpreted as a 90 high-privilege media type, such as text/html. However, if a user 91 agent does interpret a low-privilege media type, such as image/gif, 92 as a high-privilege media type, such as text/html, the user agent as 93 created a privilege escalation vulnerability in the server. For 94 example, a malicious user might be able to leverage content sniffing 95 to mount a cross-site script attack by including JavaScript code in 96 the uploaded file that a user agent treats as text/html. 98 This document describes a content sniffing algorithm that carefully 99 balances the compatibility needs of user agent implementors with the 100 security constraints. The algorithm has been constructed with 101 reference to content sniffing algorithms present in popular user 102 agents, an extensive database of existing web content, and metrics 103 collected from implementations deployed to a sizable number of users 104 [BarthCaballeroSong2009]. 106 WARNING! Whenever possible, user agents should avoid employing a 107 content sniffing algorithm. However, if a user agent does employ a 108 content sniffing algorithm, the user agent should use the algorithm 109 in this document exactly because using a different content sniffing 110 algorithm than servers expect causes security problems. For example, 111 if a server believes that the client will treat a contributed file as 112 an image (and thus treat it as benign), but a user agent believes the 113 content to be HTML (and thus privileged to execute any scripts 114 contained therein), an attacker might be able to steal the user's 115 authentication credentials and mount other cross-site scripting 116 attacks. 118 2. Metadata 120 The explicit Content-Type metadata associated with the resource (the 121 resource's type information) depends on the protocol that was used to 122 fetch the resource. 124 For HTTP resources, only the last Content-Type HTTP header, if any, 125 contributes any type information; the official type of the resource 126 is then the value of that header, interpreted as described by the 127 HTTP specifications. If the Content-Type HTTP header is present but 128 the value of the last such header cannot be interpreted as described 129 by the HTTP specifications (e.g. because its value doesn't contain a 130 U+002F SOLIDUS ('/') character), then the resource has no type 131 information (even if there are multiple Content-Type HTTP headers and 132 one of the other ones is syntactically correct). 134 For resources fetched from the file system, user agents should use 135 platform-specific conventions, e.g. operating system file extension/ 136 type mappings. 138 Note: It is essential that file extensions are not used for 139 determining the media type for resources fetched over HTTP because 140 file extensions can often by supplied by malicious parties. 142 For resources fetched over most other protocols, e.g. FTP, there is 143 no type information. 145 The algorithm for extracting an encoding from a Content-Type, given a 146 string s, is as follows. It either returns an encoding or nothing. 148 1. Find the first seven characters in s that are an ASCII case- 149 insensitive match for the word "charset". If no such match is 150 found, return nothing. 152 2. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters 153 that immediately follow the word 'charset' (there might not be 154 any). 156 3. If the next character is not a U+003D EQUALS SIGN ('='), return 157 nothing. 159 4. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters 160 that immediately follow the equals sign (there might not be any). 162 5. Process the next character as follows: 164 * If it is a U+0022 QUOTATION MARK ('"') and there is a later 165 U+0022 QUOTATION MARK ('"') in s, or 167 * If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 168 APOSTROPHE ("'") in s 170 Return the string between this character and the next 171 earliest occurrence of this character. 173 * If it is an unmatched U+0022 QUOTATION MARK ('"'), 175 * If it is an unmatched U+0027 APOSTROPHE ("'"), or 177 * If there is no next character 179 Return nothing. 181 * Otherwise 183 Return the string from this character to the first U+0009, 184 U+000A, U+000C, U+000D, U+0020, or U+003B character or the 185 end of s, whichever comes first. 187 Note: The above algorithm is a willful violation of the HTTP 188 specification. [RFC2616] 190 3. Web Pages 192 The /sniffed type/ of a resource is found as follows: 194 1. Let /official type/ be the type given by the Content-Type 195 metadata for the resource, ignoring parameters. Comparisons with 196 this type, as defined by MIME specifications, are done in an 197 ASCII case-insensitive manner. [RFC2046] 199 2. If the user agent is configured to strictly obey Content-Type 200 headers for this resource, then jump to the last step in this set 201 of steps. 203 3. If the resource was fetched over an HTTP protocol and there is an 204 HTTP Content-Type header and the value of the last such header 205 has bytes that exactly match one of the following lines: 207 +-------------------------------+--------------------------------+ 208 | Bytes in Hexadecimal | Textual Representation | 209 +-------------------------------+--------------------------------+ 210 | 74 65 78 74 2f 70 6c 61 69 6e | text/plain | 211 +-------------------------------+--------------------------------+ 212 | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=ISO-8859-1 | 213 | 3b 20 63 68 61 72 73 65 74 3d | | 214 | 49 53 4f 2d 38 38 35 39 2d 31 | | 215 +-------------------------------+--------------------------------+ 216 | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=iso-8859-1 | 217 | 3b 20 63 68 61 72 73 65 74 3d | | 218 | 69 73 6f 2d 38 38 35 39 2d 31 | | 219 +-------------------------------+--------------------------------+ 220 | 74 65 78 74 2f 70 6c 61 69 6e | text/plain; charset=UTF-8 | 221 | 3b 20 63 68 61 72 73 65 74 3d | | 222 | 55 54 46 2d 38 | | 223 +-------------------------------+--------------------------------+ 225 ...then jump to the "text or binary" section below. 227 4. If there is no /official type/, jump to the unknown type step 228 below. 230 5. If /official type/ is "unknown/unknown", "application/unknown", 231 or "*/*", jump to the unknown type step below. 233 6. If /official type/ ends in "+xml", or if it is either "text/xml" 234 or "application/xml", then the /sniffed type/ of the resource is 235 /official type/; return that and abort these steps. 237 7. If /official type/ is an image type supported by the user agent 238 (e.g. "image/png", "image/gif", "image/jpeg", etc), then jump to 239 the "images" section below, passing it the /official type/. 241 8. If /official type/ is "text/html", then jump to the feed or HTML 242 section below. 244 9. The /sniffed type/ of the resource is /official type/. 246 4. Text or Binary 248 1. The user agent MAY wait for 512 or more bytes of the resource to 249 be available. 251 2. Let n be the smaller of either 512 or the number of bytes already 252 available. 254 3. If n is greater than or equal to 3, and the first 2 or 3 bytes of 255 the resource match one of the following byte sequences: 257 +----------------------+--------------+ 258 | Bytes in Hexadecimal | Description | 259 +----------------------+--------------+ 260 | FE FF | UTF-16BE BOM | 261 | FF FE | UTF-16LE BOM | 262 | EF BB BF | UTF-8 BOM | 263 +----------------------+--------------+ 265 ...then the /sniffed type/ of the resource is "text/plain". 266 Abort these steps. 268 4. If none of the first n bytes of the resource are binary data 269 bytes then the /sniffed type/ of the resource is "text/plain". 270 Abort these steps. 272 +-------------------------+ 273 | Binary Data Byte Ranges | 274 +-------------------------+ 275 | 0x00 -- 0x08 | 276 | 0x0B | 277 | 0x0E -- 0x1A | 278 | 0x1C -- 0x1F | 279 +-------------------------+ 281 5. If the first bytes of the resource match one of the byte 282 sequences in the "pattern" column of the table in the unknown 283 type section below, ignoring any rows whose cell in the 284 "security" column says "scriptable" (or "n/a"), then the /sniffed 285 type/ of the resource is the type given in the corresponding cell 286 in the "sniffed type" column on that row; abort these steps. 288 WARNING! It is critical that this step not ever return a 289 scriptable type (e.g. text/html), as otherwise that would 290 allow a privilege escalation attack. 292 6. Otherwise, the /sniffed type/ of the resource is "application/ 293 octet-stream". 295 5. Unknown Type 297 1. The user agent MAY wait for 512 or more bytes of the resource to 298 be available. 300 2. Let /stream length/ be the smaller of either 512 or the number of 301 bytes already available. 303 3. For each row in the table below: 305 * If the row has no "WS" bytes: 307 1. Let /pattern length/ be the length of the pattern (number 308 of bytes described by the cell in the second column of the 309 row). 311 2. If /stream length/ is smaller than /pattern length/ then 312 skip this row. 314 3. Apply the "and" operator to the first /pattern length/ 315 bytes of the resource and the given mask (the bytes in the 316 cell of first column of that row), and let the result be 317 the data. 319 4. If the bytes of the data matches the given pattern bytes 320 exactly, then the /sniffed type/ of the resource is the 321 type given in the cell of the third column in that row; 322 abort these steps. 324 * If the row has a "WS" byte: 326 1. Let /index pattern/ be an index into the mask and pattern 327 byte strings of the row. 329 2. Let /index stream/ be an index into the byte stream being 330 examined. 332 3. Loop: If /index stream/ points beyond the end of the byte 333 stream, then this row doesn't match, skip this row. 335 4. Examine the /index stream/th byte of the byte stream as 336 follows: 338 - If the /index pattern/th byte of the pattern is a 339 normal hexadecimal byte and not a "WS" byte: 341 If the "and" operator, applied to the /index 342 stream/th byte of the stream and the /index 343 pattern/th byte of the mask, yield a value different 344 that the /index pattern/th byte of the pattern, then 345 skip this row. 347 Otherwise, increment /index pattern/ to the next 348 byte in the mask and pattern and /index stream/ to 349 the next byte in the byte stream. 351 - Otherwise, if the /index pattern/th byte of the pattern 352 is a "WS" byte: 354 "WS" means "whitespace", and allows insignificant 355 whitespace to be skipped when sniffing for a type 356 signature. 358 If the /index stream/th byte of the stream is one of 359 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 360 0x0D (ASCII CR), or 0x20 (ASCII space), then 361 increment only the /index stream/ to the next byte 362 in the byte stream. 364 Otherwise, increment only the /index pattern/ to the 365 next byte in the mask and pattern. 367 5. If /index pattern/ does not point beyond the end of the 368 mask and pattern byte strings, then jump back to the loop 369 step in this algorithm. 371 6. Otherwise, the /sniffed type/ of the resource is the type 372 given in the cell of the third column in that row; abort 373 these steps. 375 4. If none of the first n bytes of the resource are binary data 376 bytes then the sniffed type of the resource is "text/plain". 377 Abort these steps. 379 5. Otherwise, the sniffed type of the resource is "application/ 380 octet-stream". 382 The table used by the above algorithm is: 384 +-------------------+-------------------+-----------------+------------+ 385 | Mask in Hex | Pattern in Hex | Sniffed Type | Security | 386 +-------------------+-------------------+-----------------+------------+ 387 | FF FF FF DF DF DF | WS 3C 21 44 4F 43 | text/html | Scriptable | 388 | DF DF DF DF FF DF | 54 59 50 45 20 48 | | | 389 | DF DF DF | 54 4D 4C | | | 390 | Comment: ""), 580 then increase pos by 3 and jump back to the previous step 581 (the step labeled loop start) in the overall algorithm in 582 this section. 584 3. Otherwise, increase pos by 1. 586 4. Return to step 2 in these substeps. 588 8. If s[pos] equals 0x21 (ASCII "!"): 590 1. Increase pos by 1. 592 2. If s[pos] equals 0x3E, then increase pos by 1 and jump back 593 to the step labeled loop start in the overall algorithm in 594 this section. 596 3. Otherwise, return to step 1 in these substeps. 598 9. If s[pos] equals 0x3F (ASCII "?"): 600 1. Increase pos by 1. 602 2. If s[pos] and s[pos+1] equal 0x3F and 0x3E respectively, 603 then increase pos by 1 and jump back to the step labeled 604 loop start in the overall algorithm in this section. 606 3. Otherwise, return to step 1 in these substeps. 608 10. Otherwise, if the bytes in s starting at pos match any of the 609 sequences of bytes in the first column of the following table, 610 then the user agent must follow the steps given in the 611 corresponding cell in the second column of the same row. 613 +----------------------+------------------------------------+---------+ 614 | Bytes in Hexadecimal | Requirement | Comment | 615 +----------------------+------------------------------------+---------+ 616 | 72 73 73 | The /sniffed type/ of the resource | rss | 617 | | is "application/rss+xml"; abort | | 618 | | these steps. | | 619 +----------------------+------------------------------------+---------+ 620 | 66 65 65 64 | The /sniffed type/ of the resource | feed | 621 | | is "application/atom+xml"; abort | | 622 | | these steps. | | 623 +----------------------+------------------------------------+---------+ 624 | 72 64 66 3A 52 44 46 | Continue to the next step in this | rdf:RDF | 625 | | algorithm. | | 626 +----------------------+------------------------------------+---------+ 628 If none of the byte sequences above match the bytes in s 629 starting at pos, then the /sniffed type/ of the resource is 630 "text/html". Abort these steps. 632 11. Initialize /RDF flag/ to 0. 634 12. Initialize /RSS flag/ to 0. 636 13. If the bytes with positions pos to pos+23 in s are exactly equal 637 to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x70, 0x75, 0x72, 638 0x6C, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x72, 0x73, 0x73, 0x2F, 639 0x31, 0x2E, 0x30, 0x2F respectively (ASCII for 640 "http://purl.org/rss/1.0/"), then: 642 1. Increase pos by 23. 644 2. Set /RSS flag/ to 1. 646 14. If the bytes with positions pos to pos+42 in s are exactly equal 647 to 0x68, 0x74, 0x74, 0x70, 0x3A, 0x2F, 0x2F, 0x77, 0x77, 0x77, 648 0x2E, 0x77, 0x33, 0x2E, 0x6F, 0x72, 0x67, 0x2F, 0x31, 0x39, 649 0x39, 0x39, 0x2F, 0x30, 0x32, 0x2F, 0x32, 0x32, 0x2D, 0x72, 650 0x64, 0x66, 0x2D, 0x73, 0x79, 0x6E, 0x74, 0x61, 0x78, 0x2D, 651 0x6E, 0x73, 0x23 respectively (ASCII for 652 "http://www.w3.org/1999/02/22-rdf-syntax-ns#"), then: 654 1. Increase pos by 42. 656 2. Set /RDF flag/ to 1. 658 15. Increase pos by 1. 660 16. If /RDF flag/ is 1 and /RSS flag/ is 1, then the /sniffed type/ 661 of the resource is "application/rss+xml". Abort these steps. 663 17. If pos points beyond the end of the byte stream s, then continue 664 to step 19 of this algorithm. 666 18. Jump back to step 13 of this algorithm. 668 19. The /sniffed type/ of the resource is "text/html". 670 For efficiency reasons, implementations may wish to implement this 671 algorithm and the algorithm for detecting the character encoding of 672 HTML documents in parallel. 674 8. References 676 [BarthCaballeroSong2009] 677 Barth, A., Caballero, J., and D. Song, "Secure Content 678 Sniffing for Web Browsers, or How to Stop Papers from 679 Reviewing Themselves", 2009, . 682 TODO: * Transcribe the tables into C and auto generate the tables. * 683 Investigate charset parsing. 685 Authors' Addresses 687 Adam Barth 688 University of California, Berkeley 690 Email: abarth@eecs.berkeley.edu 691 URI: http://www.adambarth.com/ 693 Ian Hickson 694 Google, Inc. 696 Email: ian@hixie.ch 697 URI: http://ln.hixie.ch/