Re: [apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"

Ned Freed <ned.freed@mrochek.com> Thu, 23 February 2012 07:14 UTC

Return-Path: <ned.freed@mrochek.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 57F9C11E8093 for <apps-discuss@ietfa.amsl.com>; Wed, 22 Feb 2012 23:14:58 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.563
X-Spam-Level:
X-Spam-Status: No, score=-2.563 tagged_above=-999 required=5 tests=[AWL=-0.008, BAYES_00=-2.599, DATE_IN_PAST_03_06=0.044]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r2ju2elzCx-w for <apps-discuss@ietfa.amsl.com>; Wed, 22 Feb 2012 23:14:52 -0800 (PST)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.59.230.40]) by ietfa.amsl.com (Postfix) with ESMTP id EF43011E8085 for <apps-discuss@ietf.org>; Wed, 22 Feb 2012 23:14:51 -0800 (PST)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01OCB421VNTS006M6P@mauve.mrochek.com> for apps-discuss@ietf.org; Wed, 22 Feb 2012 23:14:47 -0800 (PST)
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset="utf-8"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01OC8QYYHB0W00ZUIL@mauve.mrochek.com>; Wed, 22 Feb 2012 23:14:43 -0800 (PST)
Message-id: <01OCB41ZCJES00ZUIL@mauve.mrochek.com>
Date: Wed, 22 Feb 2012 19:59:01 -0800
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Wed, 22 Feb 2012 15:25:02 +0200" <CAJQvAudekOKa2mzas-igD_6pa2je000Darin2HDNda-sk9TLCQ@mail.gmail.com>
References: <CAJQvAudekOKa2mzas-igD_6pa2je000Darin2HDNda-sk9TLCQ@mail.gmail.com>
To: Henri Sivonen <hsivonen@iki.fi>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mrochek.com; s=mauve; t=1329981292; bh=MP+Yi3jtFxWlvBL9biwPB2kVv9zKxr4nwrTMf692fi0=; h=MIME-version:Content-type:Cc:Message-id:Date:From:Subject: In-reply-to:References:To; b=kaDVwrogEZ2wfAWlIW50s+LVIGoRlIw0KehM0gzNrt1f67TR3Dxi03XPRmsuce+Xa EoTPsd5Hv5UHBRKvSqtf+0vnUvo14pftjp60WD4GTQ+Bm9gNvviOg2e5ePg+7rMsq9 yaU9rS+rZFAWQiPgp4Ubg312KrmaxLRh3UR8U7S8=
Cc: Anne van Kesteren <annevk@opera.com>, apps-discuss@ietf.org
Subject: Re: [apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Feb 2012 07:15:00 -0000

> In reference to
> https://svn.tools.ietf.org/svn/wg/appsawg/draft-ietf-appsawg-mime-default-charset/latest/draft-ietf-appsawg-mime-default-charset.html

> First of all, thank you for finally taking on the much-needed update
> to RFC 2046 rules.

> Unfortunately, the draft doesn't address the problem the right way in
> my opinion. Quotes from the draft.

> > Each subtype of the "text" media type which uses the "charset" parameter can define its own default value for the "charset" parameter, including absence of any default.

> Additionally, media types should be able to define circumstances where
> in-band indicators override the charset parameter even if the charset
> parameter is present.

That's a terrible way to do it - if the type is self-identifying in terms of
charset, a charset parameter should simply not be defined for the type -
exactly what the current specification says to do. The result is effectively
the same as what you're proposing since undefined parameters are supposed to be
ignored. The difference is your approach is effectively sanctioning active
mislabelling, and that's a road we've already explored way too much in an
informal fashion, especially in the CJK arena. Formalizing it is the last
thing we should be doing.

> In particular, media types should be allowed to override the charset
> parameter if the first two or three bytes of the payload look like an
> UTF-16 or UTF-8 BOM.

There are quite a few charsets in existence where it is perfectly permissible
for the first few bytes to match a BOM, except that it means something entirely
different.

> See:
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359
> https://bugzilla.mozilla.org/show_bug.cgi?id=716579
> https://bugzilla.mozilla.org/show_bug.cgi?id=687859

> > In order to improve interoperability with deployed agents, "text/*" media
> type definitions SHOULD either a) specify that the "charset" parameter is not
> used for the defined subtype, because the charset information is transported
> inside the payload (as in "text/xml")

> This seems wrong. If the charset parameter is present, it has an
> effect for text/xml.

That's only because the definition of test/xml did it incorrectly. I'll grant
you, however, that the example is a little wierd. What it should say is that
XML-based formats can and should self-identify the format. But this has no
technical effect on the specification.

> > or b) require explicit unconditional inclusion of the "charset" parameter
> eliminating the need for a default value.

> This seems naïve. Formats need to specify what happens when a charset
> parameter is missing, since no matter how much the format says it's
> "required", the party sending data can omit the charset parameter.

Yep. They can also misspell the parameter name, misspell the charset name, use
the wrong type/subtype, misspell the header field name, omit the field
entirely, or any of a million other mistakes.

And when (not if) this sort of thing happens, the receiver can elect to pursue
whatever course of action it deems appropriate for invalid material. It can
reject it, refuse to show it, use a default, sniff it and make a guess, prompt
the user and ask what to do, whatever. Not allowing specification of a default
in the type registration has no impact at all on the options an implementation
has available. In fact it's the other way around - allowing specifiction of a
default *limits* implementation choices.

We learned long ago through multiple painful experiences that attempting to
specify the handling of invalid stuff is an enormous rathole that is best given
wide berth. And even if the cases where it is successful in producing some sort
of advice, or worse making what should be illegal legal, that advice or
approach is usually bound to a place and time and become the wrong advice or
approach at some point in the future. So, far from being naive, it is your
approach that has been shown, over and over, to the the naive one.

> > In accordance with option (a), above, "text/*" media types that can transport charset information inside the corresponding payloads, specifically including "text/html" and "text/xml", SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the charset parameter value and the value specified in the payload disagree.

> For backwards compatibility, pretty much every existing text/* type
> will have to violate this "SHOULD NOT".

Yep. That's the main reason why it needs to be a SHOULD.

> > New subtypes of the "text" media type, thus, SHOULD NOT define a default "charset" value. If there is a strong reason to do so despite this advice, they SHOULD use the "UTF-8" [RFC3629] charset as the default.

> Seems reasonable.

> > Specifications of how to specify the "charset" parameter, and what default value, if any, is used, are subtype-specific, NOT protocol-specific.

> Seems reasonable.

> > Protocols that use MIME, therefore, MUST NOT override default charset values for "text/*" media types to be different for their specific protocol. The protocol definitions MUST leave that to the subtype definitions.

> Seems reasonable.

> > The default charset parameter value for text/plain is unchanged from
> > [RFC2046] and remains as "US-ASCII".

Note that this has the effect of rendering any content that contains 8bit
as being invalid. Implementation are then free to choose how to handle
that, as above. Of course this leaves 7bit charsets in the lurch, notably
the iso-2022-* group, which is the main reason why I'm not happy with
the US-ASCII default, as I mention below.

> This is incompatible with reality. Web browsers, for instance, assume
> a configuration-dependent default (which correlates with browser
> localization) and may also (depending on configuration which, again,
> correlates with localization by default) perform a heuristic analysis
> on the payload.

Well, on this one you'll have to argue with someone else. I don't especially
like the approach in the draft, but I have been unable to come up with any
reasonable alternative. Your proposed alternative below, is totally unworkable.

> I suggest specifying the following instead of sections 3 and 4 of the draft:

> 3. New rules for determining the character encoding for text/* media types

> Each text/* media type MUST specify an algorithm for establishing the
> character encoding of the entity body from the entity body (or
> preferably the first N bytes thereof, preferably with N = 1024), the
> charset parameter and Other Information. Other Information MAY include
> configuration, an encoding label supplied by the referrer, the
> previous encoding of an entity body retrieved from the same location
> or the encoding of the referrer. New text/* media types MUST not use
> Other Information in the algorithms they specify.
> New text/* media types SHOULD use the following algorithm:

> The character encoding is UTF-8. Terminate these steps.

This approach is ridiculous when you consider how general the usage of some
text types are and  the close similarity of many charsets in common use. It is
not at all common for there to only be one or two characters that in a large
document that display incorrectly when the wrong charset is selected.

> 4. Determining the character encoding for text/plain

> If the first 2 octets of the entity body are 0xFE followed by 0xFF,
> the character encoding is big-endian UTF-16. Terminate these steps.

> If the first 2 octets of the entity body are 0xFF followed by 0xFE,
> the character encoding is little-endian UTF-16. Terminate these steps.

See my note above about BOMs.

> If the first 3 octets of the entity body are 0xEF followed by 0xBB
> followed by 0xBF, the character encoding is UTF-8. Terminate these
> steps.

I've checked and I have yet to receive a single message in utf-8 with a leading
BOM. But this sequence can show up in material in other charsets.

> If the value of the charset parameter is a ASCII case-insensitive[1]
> match for a label[2] of a supported encoding, the character encoding
> is the encoding whose label was matched. Terminate these steps.

So the label is secondary to sniffing. Totally unacceptable. It is also
unacceptable to ignore a valid label just because you don't support 
the encoding.

> If the entity is being navigate to in a browsing context[3] and the
> previous document had the same origin[4] as the text/plain entity, the
> character encoding is the encoding of the referring document.
> Terminate these steps. (Disclaimer: I'm not 100% sure that this step
> is in the right order relative to the others.)

> Optional: If a heuristic detector recognizes the octets of the entity
> body as being encoding according to an encoding, the character
> encoding is that encoding. Terminate these steps. This step SHOULD NOT
> be implemented for locales where it has not been implemented
> traditionally.

> If the entity is being loaded into a nested browsing context that has
> the same origin as the parent browsing context, the encoding is the
> encoding of the document loaded in the parent browsing context.
> Terminate these steps.

> If the entity is being loaded into a browsing context and is being
> fetched from a location from which an entity has been loaded before
> and the previous character encoding has been cached, the character
> encoding is the cached encoding. Terminate these steps.

> If the entity is being loaded via a non-browsing context mechanism
> (such as XMLHttpRequest) that defines a fallback encoding, use that
> encoding. Terminate these steps.

> Otherwise, the character encoding is a configuration-dependent
> encoding. The default configuration SHOULD depend on the locale of the
> user agent according to the table given in step 8 in [5]. Terminate
> these steps.

Finally, I'd say the odds of any implementor following such a complex
set of rules is remote in the extreme.

> [1] http://www.whatwg.org/specs/web-apps/current-work/#ascii-case-insensitive
> [2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#concept-encoding-label
> [3] http://www.whatwg.org/specs/web-apps/current-work/#browsing-context
> [4] http://tools.ietf.org/html/rfc6454
> [5] http://www.whatwg.org/specs/web-apps/current-work/#determining-the-character-encoding

				Ned