Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt

Bjoern Hoehrmann <derhoermi@gmx.net> Tue, 19 November 2013 15:03 UTC

Return-Path: <derhoermi@gmx.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3B3E61AE005 for <apps-discuss@ietfa.amsl.com>; Tue, 19 Nov 2013 07:03:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.425
X-Spam-Level:
X-Spam-Status: No, score=-2.425 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RP_MATCHES_RCVD=-0.525, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3QkM0KLIsPqf for <apps-discuss@ietfa.amsl.com>; Tue, 19 Nov 2013 07:03:07 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.17.20]) by ietfa.amsl.com (Postfix) with ESMTP id 644051AE001 for <apps-discuss@ietf.org>; Tue, 19 Nov 2013 07:03:07 -0800 (PST)
Received: from netb.Speedport_W_700V ([91.35.62.159]) by mail.gmx.com (mrgmx103) with ESMTPA (Nemesis) id 0MU0U9-1W8o1S3Ns9-00Qi0R for <apps-discuss@ietf.org>; Tue, 19 Nov 2013 16:03:00 +0100
From: Bjoern Hoehrmann <derhoermi@gmx.net>
To: ht@inf.ed.ac.uk
Date: Tue, 19 Nov 2013 16:03:00 +0100
Message-ID: <q8nm895dap8iefa6srlf1k5787j8fuc6n4@hive.bjoern.hoehrmann.de>
References: <20131119120919.12901.59046.idtracker@ietfa.amsl.com> <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
In-Reply-To: <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
X-Mailer: Forte Agent 3.3/32.846
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
X-Provags-ID: V03:K0:hnxvgJS9YfpGiBEdwu1BHs1e22GlPz1GaJufieRTa3xcJddxhAf sWwUWdPqUk/NawGXsiu1FK4SQI5Ze8CymuDMWxRQcvuc+ckG8LBUfR6cBY6D4+B6S63skll IgsfwcaYSBOUXDKn7EZSxyQUYMSO1EB27BxGDOc4aIB+ESdkW3x+zU9i8PKQqUNQ0Jeq176 U3mZiuPqO0EC6tJ7BYP9Q==
Cc: apps-discuss@ietf.org
Subject: Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 19 Nov 2013 15:03:10 -0000

* Henry S. Thompson wrote:
>My reasoning for not changing 3.6 in response to Bjoern Hoehrmann's
>objection to the treatment of a BOM as authoritative even in the
>presence of a charset parameter [1] is set out in [2].

I said that I think this is a serious and problematic change that needs
wide review prior to a two-week Last Call period and that the change is
fine by me provided there is evidence of wide review. It would be easy,
for instance, to ask on a set of mailing lists, say

  * ietf-charsets@iana.org
  * www-international@w3.org
  * unicode@unicode.org

for endorsements for one part of the proposal which would require XML
implementations to treat

  data:application/xml;charset=utf-32,%FF%FE%00%00...

as malformed UTF-16 encoded document, and if the question is not mis-
leading and there are endorsements by qualified individuals then I'd be
satisfied that this part of the proposal has received adequate review.
You could also point me to messages in archive of this list where such
individuals have provided rationale for their support of this change.

I note that the "Changes from RFC 3023" appendix of the document does
not mention the changes in question and you need to have a good grasp
of the issues to notice them when carefully reading the document. If
the document had carefully explained the changes where appropriate, for
instance, for the issue above, which is just one among several, e.g.

  NOTE: While appendix F.1 of the XML 1.0 Recommendation suggests that
  an initial byte sequence of FF FE 00 00 is indicative of the UTF-32
  character encoding, the rules of this specification require such a
  sequence to be interpreted as a UTF-16 Byte Order Mark followed by a 
  U+0000 character rendering the document malformed. It is therefore
  impossible to use UTF-32 in application/xml, ... entities; a charset
  parameter cannot be used to prevent misinterpretation of documents
  encoded in UTF-32 when using the media types defined in this memo.

then there might be a plausible claim people knew of this change and
some of its implications and signed off on it. The document does not
have that, so I find it reasonable to believe that people are unaware
of these changes, have not considered their implications, and possibly
would object to them if they did. I can't even tell myself whether a
note such as the one above accurately reflects what you mean to propose.

I do not think it would be useful to continue the discussion between
only the two of us, and I find your misrepresentation of my position
most unhelpful. I do appreciate that you have now started to gather
some information on what running code actually does, but let's have a
look at your claims:

  Expat is a surprising case -- it provides a parameter which can be
  used to pass in a character encoding name, but it will ignore this if
  it detects a different encoding in its input byte-stream (tested via
  both the Perl and Python embeddings, and confirmed by examining the
  source).  So in fact expat does explicitly to treat a BOM as
  authoritative.

Well then let us examine the source:

  /* This is what detects the encoding.  ...
  */
  
  static int
  initScan(const ENCODING * const *encodingTable, 
  ...
      case 0xEFBB:
        /* Maybe a UTF-8 BOM (EF BB BF) */
        /* If there's an explicitly specified (external) encoding
           of ISO-8859-1 or some flavour of UTF-16
           and this is an external text entity,
           don't look for the BOM,
           because it might be a legal data.
        */

What does that suggest about expat always looking for a BOM that
overrides anything else? Or how about this claim:

  (I'm not ignoring your reminder that you have built an add-on to
  perl's HTML::Parser module which treats the charset as authoritative,
  but since that module does not qualify as a conformant XML parser in
  any case, it's not really relevant to 3023bis).

The W3C Markup Validator uses my HTML::Encoding module to detect the
character encoding of HTML and XHTML documents. XHTML documents use
whatever rules there are for XML documents to detect the encoding. It
is an implementation of "Given a HTTP response, what is the encoding
of the XML document in it", which is what most of the draft is about.

Inconsistent, ad-hoc, and changing character encoding detection rules
have been a long-standing concern of mine and I have tried to reach out
e.g. in http://www.unicode.org/mail-arch/unicode-ml/y2010-m10/0003.html
to others to improve the situation. It is not too much to ask of the
Applications Area Working Group to do the same.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/