[apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"

Henri Sivonen <hsivonen@iki.fi> Wed, 22 February 2012 13:25 UTC

Return-Path: <hsivonen@gmail.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EA00021F879B for <apps-discuss@ietfa.amsl.com>; Wed, 22 Feb 2012 05:25:12 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.977
X-Spam-Level:
X-Spam-Status: No, score=-2.977 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0bZ5bavtPYEu for <apps-discuss@ietfa.amsl.com>; Wed, 22 Feb 2012 05:25:08 -0800 (PST)
Received: from mail-yw0-f44.google.com (mail-yw0-f44.google.com [209.85.213.44]) by ietfa.amsl.com (Postfix) with ESMTP id DDDB121F864C for <apps-discuss@ietf.org>; Wed, 22 Feb 2012 05:25:07 -0800 (PST)
Received: by yhkk25 with SMTP id k25so13155yhk.31 for <apps-discuss@ietf.org>; Wed, 22 Feb 2012 05:25:07 -0800 (PST)
Received-SPF: pass (google.com: domain of hsivonen@gmail.com designates 10.236.161.232 as permitted sender) client-ip=10.236.161.232;
Authentication-Results: mr.google.com; spf=pass (google.com: domain of hsivonen@gmail.com designates 10.236.161.232 as permitted sender) smtp.mail=hsivonen@gmail.com; dkim=pass header.i=hsivonen@gmail.com
Received: from mr.google.com ([10.236.161.232]) by 10.236.161.232 with SMTP id w68mr42229558yhk.56.1329917107620 (num_hops = 1); Wed, 22 Feb 2012 05:25:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=WFASOV7LhhwZoaEhwSKborB43ZzImP6bO9FWqxLPsnw=; b=xmB6yI7CD9jfCuuNqXzniOPS0NNH0VNOLkXFzEniX2P10JoN/STsFFO7eixeuSRFxb AGWlF8MrvrxGpM5KXOv6fU20BDHsrFvrNhQyqIQWAc4Gy/o1f+3i9dvIEprExhR8snep z8ORcr2Oj0raYhvW8k1bS3dwm7UZIcs9mg3BQ=
MIME-Version: 1.0
Received: by 10.236.161.232 with SMTP id w68mr32951524yhk.56.1329917102032; Wed, 22 Feb 2012 05:25:02 -0800 (PST)
Sender: hsivonen@gmail.com
Received: by 10.101.170.17 with HTTP; Wed, 22 Feb 2012 05:25:02 -0800 (PST)
Date: Wed, 22 Feb 2012 15:25:02 +0200
X-Google-Sender-Auth: 9_1CRRtiXa-kDpUrhRzz36VRe9A
Message-ID: <CAJQvAudekOKa2mzas-igD_6pa2je000Darin2HDNda-sk9TLCQ@mail.gmail.com>
From: Henri Sivonen <hsivonen@iki.fi>
To: apps-discuss@ietf.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Anne van Kesteren <annevk@opera.com>
Subject: [apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Feb 2012 13:27:26 -0000

In reference to
https://svn.tools.ietf.org/svn/wg/appsawg/draft-ietf-appsawg-mime-default-charset/latest/draft-ietf-appsawg-mime-default-charset.html

First of all, thank you for finally taking on the much-needed update
to RFC 2046 rules.

Unfortunately, the draft doesn't address the problem the right way in
my opinion. Quotes from the draft.

> Each subtype of the "text" media type which uses the "charset" parameter can define its own default value for the "charset" parameter, including absence of any default.

Additionally, media types should be able to define circumstances where
in-band indicators override the charset parameter even if the charset
parameter is present.

In particular, media types should be allowed to override the charset
parameter if the first two or three bytes of the payload look like an
UTF-16 or UTF-8 BOM.

See:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15359
https://bugzilla.mozilla.org/show_bug.cgi?id=716579
https://bugzilla.mozilla.org/show_bug.cgi?id=687859

> In order to improve interoperability with deployed agents, "text/*" media type definitions SHOULD either a) specify that the "charset" parameter is not used for the defined subtype, because the charset information is transported inside the payload (as in "text/xml")

This seems wrong. If the charset parameter is present, it has an
effect for text/xml.

> or b) require explicit unconditional inclusion of the "charset" parameter eliminating the need for a default value.

This seems naïve. Formats need to specify what happens when a charset
parameter is missing, since no matter how much the format says it's
"required", the party sending data can omit the charset parameter.

> In accordance with option (a), above, "text/*" media types that can transport charset information inside the corresponding payloads, specifically including "text/html" and "text/xml", SHOULD NOT specify the use of a "charset" parameter, nor any default value, in order to avoid conflicting interpretations should the charset parameter value and the value specified in the payload disagree.

For backwards compatibility, pretty much every existing text/* type
will have to violate this "SHOULD NOT".

> New subtypes of the "text" media type, thus, SHOULD NOT define a default "charset" value. If there is a strong reason to do so despite this advice, they SHOULD use the "UTF-8" [RFC3629] charset as the default.

Seems reasonable.

> Specifications of how to specify the "charset" parameter, and what default value, if any, is used, are subtype-specific, NOT protocol-specific.

Seems reasonable.

> Protocols that use MIME, therefore, MUST NOT override default charset values for "text/*" media types to be different for their specific protocol. The protocol definitions MUST leave that to the subtype definitions.

Seems reasonable.

> The default charset parameter value for text/plain is unchanged from [RFC2046] and remains as "US-ASCII".

This is incompatible with reality. Web browsers, for instance, assume
a configuration-dependent default (which correlates with browser
localization) and may also (depending on configuration which, again,
correlates with localization by default) perform a heuristic analysis
on the payload.

I suggest specifying the following instead of sections 3 and 4 of the draft:

3. New rules for determining the character encoding for text/* media types

Each text/* media type MUST specify an algorithm for establishing the
character encoding of the entity body from the entity body (or
preferably the first N bytes thereof, preferably with N = 1024), the
charset parameter and Other Information. Other Information MAY include
configuration, an encoding label supplied by the referrer, the
previous encoding of an entity body retrieved from the same location
or the encoding of the referrer. New text/* media types MUST not use
Other Information in the algorithms they specify. New text/* media
types SHOULD use the following algorithm:

The character encoding is UTF-8. Terminate these steps.

4. Determining the character encoding for text/plain

If the first 2 octets of the entity body are 0xFE followed by 0xFF,
the character encoding is big-endian UTF-16. Terminate these steps.

If the first 2 octets of the entity body are 0xFF followed by 0xFE,
the character encoding is little-endian UTF-16. Terminate these steps.

If the first 3 octets of the entity body are 0xEF followed by 0xBB
followed by 0xBF, the character encoding is UTF-8. Terminate these
steps.

If the value of the charset parameter is a ASCII case-insensitive[1]
match for a label[2] of a supported encoding, the character encoding
is the encoding whose label was matched. Terminate these steps.

If the entity is being navigate to in a browsing context[3] and the
previous document had the same origin[4] as the text/plain entity, the
character encoding is the encoding of the referring document.
Terminate these steps. (Disclaimer: I'm not 100% sure that this step
is in the right order relative to the others.)

Optional: If a heuristic detector recognizes the octets of the entity
body as being encoding according to an encoding, the character
encoding is that encoding. Terminate these steps. This step SHOULD NOT
be implemented for locales where it has not been implemented
traditionally.

If the entity is being loaded into a nested browsing context that has
the same origin as the parent browsing context, the encoding is the
encoding of the document loaded in the parent browsing context.
Terminate these steps.

If the entity is being loaded into a browsing context and is being
fetched from a location from which an entity has been loaded before
and the previous character encoding has been cached, the character
encoding is the cached encoding. Terminate these steps.

If the entity is being loaded via a non-browsing context mechanism
(such as XMLHttpRequest) that defines a fallback encoding, use that
encoding. Terminate these steps.

Otherwise, the character encoding is a configuration-dependent
encoding. The default configuration SHOULD depend on the locale of the
user agent according to the table given in step 8 in [5]. Terminate
these steps.

[1] http://www.whatwg.org/specs/web-apps/current-work/#ascii-case-insensitive
[2] http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#concept-encoding-label
[3] http://www.whatwg.org/specs/web-apps/current-work/#browsing-context
[4] http://tools.ietf.org/html/rfc6454
[5] http://www.whatwg.org/specs/web-apps/current-work/#determining-the-character-encoding

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/