Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt

Bjoern Hoehrmann <derhoermi@gmx.net> Tue, 19 November 2013 15:03 UTC

From: Bjoern Hoehrmann <derhoermi@gmx.net>
To: ht@inf.ed.ac.uk
Date: Tue, 19 Nov 2013 16:03:00 +0100
Message-ID: <q8nm895dap8iefa6srlf1k5787j8fuc6n4@hive.bjoern.hoehrmann.de>
References: <20131119120919.12901.59046.idtracker@ietfa.amsl.com> <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
In-Reply-To: <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 8bit
Cc: apps-discuss@ietf.org
Subject: Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt
Precedence: list

* Henry S. Thompson wrote:
>My reasoning for not changing 3.6 in response to Bjoern Hoehrmann's
>objection to the treatment of a BOM as authoritative even in the
>presence of a charset parameter [1] is set out in [2].

I said that I think this is a serious and problematic change that needs
wide review prior to a two-week Last Call period and that the change is
fine by me provided there is evidence of wide review. It would be easy,
for instance, to ask on a set of mailing lists, say

  * ietf-charsets@iana.org
  * www-international@w3.org
  * unicode@unicode.org

for endorsements for one part of the proposal which would require XML
implementations to treat

  data:application/xml;charset=utf-32,%FF%FE%00%00...

as malformed UTF-16 encoded document, and if the question is not mis-
leading and there are endorsements by qualified individuals then I'd be
satisfied that this part of the proposal has received adequate review.
You could also point me to messages in archive of this list where such
individuals have provided rationale for their support of this change.

I note that the "Changes from RFC 3023" appendix of the document does
not mention the changes in question and you need to have a good grasp
of the issues to notice them when carefully reading the document. If
the document had carefully explained the changes where appropriate, for
instance, for the issue above, which is just one among several, e.g.

  NOTE: While appendix F.1 of the XML 1.0 Recommendation suggests that
  an initial byte sequence of FF FE 00 00 is indicative of the UTF-32
  character encoding, the rules of this specification require such a
  sequence to be interpreted as a UTF-16 Byte Order Mark followed by a 
  U+0000 character rendering the document malformed. It is therefore
  impossible to use UTF-32 in application/xml, ... entities; a charset
  parameter cannot be used to prevent misinterpretation of documents
  encoded in UTF-32 when using the media types defined in this memo.

then there might be a plausible claim people knew of this change and
some of its implications and signed off on it. The document does not
have that, so I find it reasonable to believe that people are unaware
of these changes, have not considered their implications, and possibly
would object to them if they did. I can't even tell myself whether a
note such as the one above accurately reflects what you mean to propose.

I do not think it would be useful to continue the discussion between
only the two of us, and I find your misrepresentation of my position
most unhelpful. I do appreciate that you have now started to gather
some information on what running code actually does, but let's have a
look at your claims:

  Expat is a surprising case -- it provides a parameter which can be
  used to pass in a character encoding name, but it will ignore this if
  it detects a different encoding in its input byte-stream (tested via
  both the Perl and Python embeddings, and confirmed by examining the
  source).  So in fact expat does explicitly to treat a BOM as
  authoritative.

Well then let us examine the source:

  /* This is what detects the encoding.  ...
  */

  static int
  initScan(const ENCODING * const *encodingTable, 
  ...
      case 0xEFBB:
        /* Maybe a UTF-8 BOM (EF BB BF) */
        /* If there's an explicitly specified (external) encoding
           of ISO-8859-1 or some flavour of UTF-16
           and this is an external text entity,
           don't look for the BOM,
           because it might be a legal data.
        */

What does that suggest about expat always looking for a BOM that
overrides anything else? Or how about this claim:

  (I'm not ignoring your reminder that you have built an add-on to
  perl's HTML::Parser module which treats the charset as authoritative,
  but since that module does not qualify as a conformant XML parser in
  any case, it's not really relevant to 3023bis).

The W3C Markup Validator uses my HTML::Encoding module to detect the
character encoding of HTML and XHTML documents. XHTML documents use
whatever rules there are for XML documents to detect the encoding. It
is an implementation of "Given a HTTP response, what is the encoding
of the XML document in it", which is what most of the draft is about.

Inconsistent, ad-hoc, and changing character encoding detection rules
have been a long-standing concern of mine and I have tried to reach out
e.g. in http://www.unicode.org/mail-arch/unicode-ml/y2010-m10/0003.html
to others to improve the situation. It is not too much to ask of the
Applications Area Working Group to do the same.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

[apps-discuss] I-D Action: draft-ietf-appsawg-xml… internet-drafts
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Bjoern Hoehrmann
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Rushforth, Peter
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
[apps-discuss] expat and the BOM (was Re: I-D Act… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Martin J. Dürst
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Martin J. Dürst
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson