idnits 2.17.1
draft-masinter-mime-web-info-00.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
No issues found here.
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the IETF Trust and authors Copyright Line does not
match the current year
== Line 122 has weird spacing: '...tagging how t...'
== Line 125 has weird spacing: '...bagging how t...'
-- The document date (September 23, 2010) is 4935 days in the past. Is
this intentional?
Checking references for intended status: Informational
----------------------------------------------------------------------------
No issues found here.
Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Internet Engineering Task Force L. Masinter
3 Internet-Draft Adobe
4 Intended status: Informational September 23, 2010
5 Expires: March 27, 2011
7 Internet Media Types and the Web
8 draft-masinter-mime-web-info-00
10 Abstract
12 This document describes some of the ways in which parts of the MIME
13 system, originally designed for electronic mail, have been used in
14 the web, and some of the ways in which those uses have resulted in
15 difficulties. This informational document is intended as background
16 and justification for a companion Best Current Practice which makes
17 some changes to the registry of Internet Media Types and other
18 specifications and practices, in order to facilitate Web application
19 design and standardization.
21 Status of this Memo
23 This Internet-Draft is submitted in full conformance with the
24 provisions of BCP 78 and BCP 79.
26 Internet-Drafts are working documents of the Internet Engineering
27 Task Force (IETF). Note that other groups may also distribute
28 working documents as Internet-Drafts. The list of current Internet-
29 Drafts is at http://datatracker.ietf.org/drafts/current/.
31 Internet-Drafts are draft documents valid for a maximum of six months
32 and may be updated, replaced, or obsoleted by other documents at any
33 time. It is inappropriate to use Internet-Drafts as reference
34 material or to cite them other than as "work in progress."
36 This Internet-Draft will expire on March 27, 2011.
38 Copyright Notice
40 Copyright (c) 2010 IETF Trust and the persons identified as the
41 document authors. All rights reserved.
43 This document is subject to BCP 78 and the IETF Trust's Legal
44 Provisions Relating to IETF Documents
45 (http://trustee.ietf.org/license-info) in effect on the date of
46 publication of this document. Please review these documents
47 carefully, as they describe your rights and restrictions with respect
48 to this document. Code Components extracted from this document must
49 include Simplified BSD License text as described in Section 4.e of
50 the Trust Legal Provisions and are provided without warranty as
51 described in the Simplified BSD License.
53 Table of Contents
55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
56 2. History . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
57 2.1. Origins of MIME . . . . . . . . . . . . . . . . . . . . . 3
58 2.2. Introducing MIME into the Web . . . . . . . . . . . . . . 4
59 2.3. Distributed Extensibility . . . . . . . . . . . . . . . . 4
60 3. Problems with application to the Web . . . . . . . . . . . . . 5
61 3.1. Differences between email and web delivery . . . . . . . . 5
62 3.2. The Rules Weren't Quite Followed . . . . . . . . . . . . . 6
63 3.3. Consequences . . . . . . . . . . . . . . . . . . . . . . . 7
64 3.4. The Down Side of Extensibility . . . . . . . . . . . . . . 7
65 4. Additional considerations . . . . . . . . . . . . . . . . . . 8
66 4.1. There are related problems with charsets . . . . . . . . . 8
67 4.2. Embedded, downloaded, launch independent application . . . 8
68 4.3. Additional Use Cases: Polyglot and Multiview . . . . . . . 8
69 4.4. Evolution, Versioning, Forking . . . . . . . . . . . . . . 9
70 4.5. Content Negotiation . . . . . . . . . . . . . . . . . . . 10
71 4.6. Fragment identifiers . . . . . . . . . . . . . . . . . . . 10
72 5. Where we need to go . . . . . . . . . . . . . . . . . . . . . 10
73 6. Specific recommendations . . . . . . . . . . . . . . . . . . . 11
74 6.1. Internet Media Type registration . . . . . . . . . . . . . 11
75 6.2. Sniffing . . . . . . . . . . . . . . . . . . . . . . . . . 12
76 6.3. Other specifications and BCPs . . . . . . . . . . . . . . 12
77 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
78 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13
79 9. Security Considerations . . . . . . . . . . . . . . . . . . . 13
80 10. Informative References . . . . . . . . . . . . . . . . . . . . 13
81 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 13
83 1. Introduction
85 This document was prompted by a set of discussions in the W3C
86 Technical Architecture Group about web architecture and the
87 difficulties surrounding evolution of the web, Internet Media types,
88 multiple specifications for a single media type, and related
89 discussions. The goal of the document is to prompt an evolution
90 within W3C and IETF over the use of MIME (and in particular Internet
91 Media Types) to fix some of the outstanding problems. This is an
92 initial version review and update. The goal is to initially survey
93 the current situation and then make a set of recommendation to the
94 definition and use MIME components (and specifically, Internet Media
95 Types and charset declarations) to facilitate their standardization
96 across Web and Web-related technologies with other Internet
97 applications. Discussion of this document is suggested on the
98 mailing list www-tag@w3c.org, a mailing list open for subscription to
99 all, archives at http://lists.w3.org/Archives/Public/www-tag/.
101 2. History
103 2.1. Origins of MIME
105 MIME was invented originally for email, based on general principles
106 of 'messaging', a foundational architecture framework. The role of
107 MIME was to extend Internet email messaging from ASCII-only plain
108 text, to include other character sets, images, rich documents, etc.)
109 The basic architecture of complex content messaging is:
111 o Message sent from A to B.
113 o Message includes some data. Sender A includes standard 'headers'
114 telling recipient B enough information that recipient B knows how
115 sender A intends the message to be interpreted.
117 o Recipient B gets the message, interprets the headers for the data
118 and uses it as information on how to interpret the data.
120 MIME is a "tagging and bagging" specification:
122 tagging how to label content so the intent of how the content should
123 be interpreted is known
125 bagging how to wrap the content so the label is clear, or, if there
126 are multiple parts to a single message, how to combine them.
128 "MIME types" (renamed "Internet Media Types") were part of the
129 tagging -- a name space for describing how to initiate interpretation
130 of a message. The "Internet Media Type registry" (MIME type
131 registry) is where someone can tell the world what a particular label
132 means, as far as the sender's intent of how recipients should process
133 a message of that type, and the description of a recipients
134 capability and ability for senders.
136 2.2. Introducing MIME into the Web
138 The original World Wide Web (the 0.9 version of HTTP) didn't have
139 "tagging and bagging" -- everything sent via HTTP was assumed to be
140 HTML. However, at the time (early 1990's) other distributed
141 information access systems, including Gopher (distributed menu
142 system) and WAIS (remote access to document databases) were adding
143 capabilities for accessing many things other text and hypertext and
144 the WWW folks were considering type tagging. It was agreed that HTTP
145 should use MIME as the vocabulary for talking about file types and
146 character sets. The result was that HTTP 1.0 added the "content-
147 type" header, following (more or less) MIME. Later, for content
148 negotiation, additional uses of this technology (in 'Accept' headers)
149 were also added.
151 The differences between the use of Internet Media Types between email
152 and HTTP were minor:
154 o default charset
156 o requirement for CRLF in plain text.
158 These minor differences have caused a lot of trouble.
160 2.3. Distributed Extensibility
162 The real advantage of using Internet Media Types to label content
163 meant that the web was no longer restricted to a single format. This
164 one addition meant expanding from Global Hypertext to Global
165 Hypermedia (as suggested in a 1992 email [connolly92])
167 +-------------------------------------------------------------------+
168 | The Internet currently serves as the backbone for a global |
169 | hypertext. FTP and email provided a good start, and the gopher, |
170 | WWW, or WAIS clients and servers make wide area information |
171 | browsing simple. These systems even interoperate, with email |
172 | servers talking to FTP servers, WWW clients talking to gopher |
173 | servers, on and on. |
174 | This currently works quite well for text. But what should WWW |
175 | clients do as Gopher and WAIS servers begin to serve up pictures, |
176 | sounds, movies, spreadsheet templates, postscript files, etc.? |
177 | It would be a shame for each to adopt its own multimedia typing |
178 | system. |
179 | If they all adopt the MIME typing system (and as many other |
180 | features from MIME as are appropriate), we can step from global |
181 | hypertext to global hypermedia that much easier. |
182 +-------------------------------------------------------------------+
184 The fact that HTTP could reliably transport images of different
185 formats, for example, allowed NCSA to add to HTML. MIME
186 allowed other document formats (Word, PDF, Postscript) and other
187 kinds of hypermedia, as well as other applications, to be part of the
188 web. MIME was arguably the most important extensibility mechanism in
189 the web.
191 3. Problems with application to the Web
193 Unfortunately, while the use of Internet Media Types for the web
194 added incredible power, several problems have arisen.
196 3.1. Differences between email and web delivery
198 Some of the differences between the application contexts of email and
199 web delivery determine different requirements:
201 o web "messages" are generally HTTP responses to a specific request;
202 this means you know more about the data before you receive it. In
203 particular, the data really does have a 'name' (mainly, the URL
204 used to access the data), while in messaging, the messages were
205 anonymous.
207 o You would like to know more about the content before you retrieve
208 it. The "tagging" is often not sufficient to know, for example,
209 "can I interpret this if I retrieve it", because of versioning,
210 capabilities, or dependencies on things like screen size or
211 interaction capabilities of the recipient.
213 o Some content isn't delivered over the HTTP (files on local file
214 system), or there is no opportunity for tagging (data delivered
215 over FTP) and in those cases, some other ways are needed for
216 determining file type.
218 Operating systems use using, and continued to evolve to use,
219 different systems to determine the 'type' of something, different
220 from the MIME tagging and bagging:
222 o 'magic numbers': in many contexts, file types could be guessed
223 pretty reliably by looking for headers.
225 o Originally MAC OS had a 4 character 'file type' and another 4
226 character 'creator code' for file types.
228 o Windows evolved to use the "file extension" -- 3 letters (and then
229 more) at the end of the file name
231 Information about these other ways of determining type (rather than
232 by the content-type label) were gathered for the Internet Media Type
233 registry; those registering types are encouraged to also describe
234 'magic numbers', Mac file type, common file extensions. However,
235 since there was no formal use of that information, the quality of
236 that information in the registry is haphazard.
238 Finally, there was the fact that tagging and bagging might be OK for
239 unilaterally initiated (one-way) messaging, you might want to know
240 whether you could handle the data before reading it in and
241 interpreting it, but the Internet Media Types weren't enough to tell.
243 3.2. The Rules Weren't Quite Followed
245 The behavior of the community when the Internet Media Type registry
246 was designed haven't matched expectations:
248 o Lots of file types aren't registered (no entry in IANA for file
249 types).
251 o Those that are, the registration is incomplete or incorrect
252 (people doing registration didn't understand 'magic number' or
253 other fields).
255 o The actual content deployed or created by deployed software
256 doesn't match the registration.
258 In particular, web implementations of Internet Media Types diverged
259 from expected behavior:
261 o Browser implementors would be liberal in what they accepted, and
262 use file extension and/or magic number or other 'sniffing'
263 techniques to decide file type, without assuming content-label was
264 authoritative. This was necessary anyway for files that weren't
265 delivered by HTTP.
267 o HTTP server implementors and administrators didn't supply ways of
268 easily associating the 'intended' file type label with the file,
269 resulting in files frequently being delivered with a label other
270 than the one they would have chosen if they'd thought about it,
271 and if browsers *had* assumed content-type was authoritative.
272 Some popular servers had default configuration files that treated
273 any unknown type as "text/plain" (plain ext in ASCII). Since it
274 didn't matter (the browsers worked anyway), it was hard to get
275 this fixed.
277 Incorrect senders coupled with liberal readers wind up feeding a
278 negative feedback loop based on the robustness principle.
280 3.3. Consequences
282 The result, alas, is that the web is unreliable, in that
284 o servers sending responses to browsers don't have a good guarantee
285 that the browser won't "sniff" the content and decide to do
286 something other than treat it as it is labeled
288 o browsers receiving content don't have a good guarantee that the
289 content isn't mis-labeled
291 o intermediaries (gateways, proxies, caches, and other pieces of the
292 web infrastructure) don't have a good way of telling what the
293 conversation means.
295 This ambiguity and 'sniffing' also applies to packaged content in
296 webapps ('bagging' but using ZIP rather than MIME multipart). (NOTE:
297 NEEDS EXPANSION)
299 3.4. The Down Side of Extensibility
301 Extensibility adds great power, and allows the web to evolve without
302 committee approval of every extension. For some (those who want to
303 extend and their clients who want those extensions), this is power!
304 For others (those who are building web components or infrastructure),
305 extensibility is a drawback -- it adds to the unreliability and
306 difference of the web experience. When senders use extensions
307 recipients aren't aware of, implement incorrectly or incompletely,
308 then communication often fails. With messaging, this is a serious
309 problem, although most 'rich text' documents are still delivered in
310 multiple forms (using multipart/alternative).
312 If your job is to support users of a popular browser, however, where
313 each user has installed a different configuration of file handlers
314 and extensibility mechanisms, MIME may appear to add unnecessary
315 complexity and variable experience for users of all but the most
316 popular types.
318 4. Additional considerations
320 This section notes some additional considerations.
322 4.1. There are related problems with charsets
324 MIME includes provisions not only for file 'types', but also,
325 importantly the "character encoding" used by text types: for example,
326 simple US ASCII, Western European ISO-8859-1, Unicode UTF8. A
327 similar vicious cycle also happened with character set labels:
328 mislabeled content happily processed correctly by liberal browsers
329 encouraged more and more sites to proliferate text with mis-labeled
330 character sets, to the point where browsers feel they *have* to guess
331 the wrong label. (NEEDS EXPANSION)
333 There are sites that intentionally label content as iso-2022-jp or
334 euc-jp when it is in fact one of the Microsoft extension charsets
335 (e.g., for access to circled digits. This is an intentional misuse
336 of the definitions of the charsets themselves -- definitions which
337 originated at the national standards body level.
339 4.2. Embedded, downloaded, launch independent application
341 The type of a document might be determined not only for entire
342 documents "HTML" vs "Word" vs "PDF", but also to embedded components
343 of documents, "JPEG image" vs. "PNG image". However, the use cases,
344 requirements and likely operational impact of MIME handling is likely
345 different for those use cases.
347 4.3. Additional Use Cases: Polyglot and Multiview
349 There are some interesting additional use cases which add to the
350 design requirements:
352 o "Polyglot" documents: A 'polyglot' document is one which is some
353 data which can be treated as two different Internet Media Types,
354 in the case where the meaning of the data is the same. This is
355 part of a transition strategy to allow content providers (senders)
356 to manage, produce, store, deliver the same data, but with two
357 different labels, and have it work equivalently with two different
358 kinds of receivers (one of which knows one Internet Media Type,
359 and another which knows a second one.) This use case was part of
360 the transition strategy from HTML to an XML-based XHTML, and also
361 as a way of a single service offering both HTML-based and XML-
362 based processing (e.g., same content useful for news articles and
363 web pages.
365 o "Multiview" documents: This use case seems similar but it's quite
366 different. In this case, the same data has very different meaning
367 when served as two different content-types, but that difference is
368 intentional; for example, the same data served as text/html is a
369 document, and served as an RDFa type is some specific data.
371 4.4. Evolution, Versioning, Forking
373 Formats and their specifications evolve over time -- some times
374 compatibly, some times not. It is part of the responsibility of the
375 designer of a new version of a file type to try to insure both
376 forward and backward compatibility: new documents work reasonably
377 (with some fallback) with old viewers and that old documents work
378 reasonably with new viewers. In some cases this is accomplished,
379 others not; in some cases, "works reasonably" is softened to "either
380 works reasonably or gives clear warning about nature of problem
381 (version mismatch)."
383 In MIME, the 'tag', the Internet Media Type, corresponds to the
384 versioned series. Internet Media Types do not identify a particular
385 version of a file format. Rather, the general idea is that the
386 Internet Media Type identifies the family, and also how you're
387 supposed to otherwise find version information on a per-format basis.
388 Many (most) file formats have an internal version indicator, with the
389 idea that you only need a new Internet Media Type to designate a
390 completely incompatible format. The notion of an "Internet Media
391 Type" is very course-grained. The general approach to this has been
392 that the actual Media Type includes provisions for version
393 indicator(s) embedded in the content itself to determine more
394 precisely the nature of how the data is to be interpreted. That is,
395 the message itself contains further information.
397 Unfortunately, lots has gone wrong in this scenario as well --
398 processors ignoring version indicators encouraging content creators
399 to not be careful to supply correct version indicators, leading to
400 lots of content with wrong version indicators.
402 Those updating an existing Internet Media Type registration to
403 account for new versions are admonished to not make previously
404 conforming documents non-conforming. This is harder to enforce than
405 would seem, because the previous specifications are not always
406 accurate to what the Internet Media Type was used for in practice.
408 (NOTE: MULTIPLE INCOMPATIBLE AUTHORITATIVE SPECS)
410 4.5. Content Negotiation
412 The general idea of content negotiation is when party A communicates
413 to party B, and the message can be delivered in more than one format
414 (or version, or configuration), there can be some way of allowing
415 some negotiation, some way for A to communication to B the available
416 options, and for B to be able to accept or indicate preferences.
418 Content negotiation happens all over. When one fax machine twirps to
419 another when initially connecting, they are negotiating resolution,
420 compression methods and so forth. In Internet mail, which is a one-
421 way communication, the "negotiation" consists of the sender preparing
422 and sending multiple versions of the message, one in text/html, one
423 in text/plain, for example, in sender-preference order. The
424 recipient then chooses the first version it can understand.
426 HTTP added "Accept" and "Accept-language" to allow content
427 negotiation in HTTP GET, based on Internet Media Types, and there are
428 other methods explained in the HTTP spec.
430 4.6. Fragment identifiers
432 The web added the notion of being able to address part of a content
433 and not the whole content by adding a 'fragment identifier' to the
434 URL that addressed the data. Of course, this originally made sense
435 for the original web with just HTML, but how would it apply to other
436 content. The URL spec glibly noted that "the definition of the
437 fragment identifier meaning depends on the Internet Media Type", but
438 unfortunately, few of the Internet Media Type definitions included
439 this information, and practices diverged greatly.
441 If the interpretation of fragment identifiers depends on the MIME
442 type, though, this really crimps the style of using fragment
443 identifiers differently if content negotiation is wanted.
445 5. Where we need to go
447 Many people are confused about the purpose of MIME in the web, its
448 uses, the meaning of Internet Media Types. Many W3C specifications
449 TAG findings and Internet Media Type registrations make what are
450 (IMHO) incorrect assumptions about the meaning and purposes of a
451 Internet Media Type registration.
453 We need a clear direction on how to make the web more reliable, not
454 less. We need a realistic transition plan from the unreliable web to
455 the more reliable one. Part of this is to encourage senders (web
456 servers) to mean what they say, and encourage recipients (browsers)
457 to give preference to what the senders are sending.
459 We should try to create specifications for protocols and best
460 practices that will lead the web to more reliable and secure
461 communication. To this end, we give an overall architectural
462 approach to use of MIME, and then specific specifications, for HTTP
463 clients and servers, Web Browsers in general, proxies and
464 intermediaries, which encourage behavior which, on the one hand,
465 continues to work with the already deployed infrastructure (of
466 servers, browsers, and intermediaries), but which advice, if
467 followed, also improves the operability, reliability and security of
468 the web.
470 NOTE: This section should be elaborated to include requirements for
471 changes to MIME and Internet Media Type registrations to improve the
472 situation.
474 6. Specific recommendations
476 NOTE: We should try to get agreement on the background, problem
477 statement and requirements, before sending out any more about
478 possible solutions. The intention is that recommendations for
479 changes to IETF-specified processes and registries would be moved
480 into a new BCP-track document.
482 However, the following is a partial list of documents that should be
483 reviewed and updated, or new documents written.
485 6.1. Internet Media Type registration
487 Update the Internet Media Type registration process (via a new IETF
488 BCP document):
490 o Allow commenting or easier update; not all Internet Media Type
491 owners need or have all the information the internet needs. Wiki
492 for Internet Media Types as well as formal registry? Ability to
493 add comments about deployed senders, deployed content, deployed
494 recievers for new recievers or senders.
496 o Be clearer about relationship of 'magic numbers' to sniffing;
497 review Internet Media Types already registered and update.
499 o Be clearer about requiring Security Considerations to address
500 risks of sniffing
502 o require definition of fragment identifier applicability
503 o ask the 'applications that use this type' section to be clearer
504 about whether the file type is suitable for embedding (plug-in) or
505 as a separate document with auto-launch (MIME handler), or should
506 always be donwloaded.
508 o Be clearer about file extension use and relationship of file
509 extensions to MIME handlers
511 6.2. Sniffing
513 Various new specifications promote the use of 'sniffing' -- using the
514 content of the data to supplement or even override the declared
515 content-type or charset. Update these specifications:
517 o Sniffing uses MIME registry for 'magic numbers'
519 o all sniffing can be a priviledge upgrade, if there is a buggy
520 recipient, although bugs can be fixed.
522 o discourage sniffing unless there is no type label:
524 * malformed content-type: error
526 * no knowledge that given content-type isn't better than guessed
527 content-type
529 6.3. Other specifications and BCPs
531 o FTP specifications: do FTP clients also change rules about
532 guessing file types based on OS of FTP server?
534 o update Tag finding on authoritative metadata: is it possible to
535 remove 'authority'?
537 o new: MIME and Internet Media Type section to WebArch, referencing
538 this memo
540 o New: Add a W3C web architecture material on MIME in HTML to W3C
541 web site, referencing this memo
543 o Reconsider other extensibility mechanisms (namespaces, for
544 example): should they use MIME or something like it?
546 7. Acknowledgements
548 This document is the result of discussions among many individuals in
549 the IETF and W3C.
551 8. IANA Considerations
553 This memo includes no request to IANA.
555 9. Security Considerations
557 This document discusses some of the security issues resulting from
558 use (and mis-use) of MIME content types in the web.
560 10. Informative References
562 [connolly92]
563 Connolly, D., "Global Hypermedia", Oct 1992, .
567 Author's Address
569 Larry Masinter
570 Adobe
571 345 Park Ave.
572 San Jose, 95110
573 USA
575 Phone: +1 408 536 3024
576 Email: masinter@adobe.com
577 URI: http://larry.masinter.net