Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
John C Klensin <john-ietf@jck.com> Thu, 19 March 2015 08:07 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: lucid@ietfa.amsl.com
Delivered-To: lucid@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 53ED21A1B57 for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 01:07:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.61
X-Spam-Level:
X-Spam-Status: No, score=-4.61 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, GB_I_LETTER=-2, RCVD_IN_DNSWL_LOW=-0.7, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eVidwkYAgkVz for <lucid@ietfa.amsl.com>; Thu, 19 Mar 2015 01:07:36 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id BC3EF1A6EF1 for <lucid@ietf.org>; Thu, 19 Mar 2015 01:07:36 -0700 (PDT)
Received: from [198.252.137.35] (helo=JcK-HP8200.jck.com) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1YYVU8-000NpR-RU; Thu, 19 Mar 2015 04:07:20 -0400
Date: Thu, 19 Mar 2015 04:07:15 -0400
From: John C Klensin <john-ietf@jck.com>
To: Shawn Steele <Shawn.Steele@microsoft.com>
Message-ID: <A62526FD387D08270363E96E@JcK-HP8200.jck.com>
In-Reply-To: <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150311013300.GC12479@dyn.com> <CA+9kkMDZW9yPtDxtLTfY1=VS6itvHtXHF1qdZKtXdwwORwqnew@mail.gmail.com> <55008F97.8040701@ix.netcom.com> <CA+9kkMAcgSA1Ch0B9W1Np0LMn2udegZ=AzU1b26dAi+SDcbGgg@mail.gmail.com> <CY1PR0301MB07310C68F6CFDD46AE22086F82190@CY1PR0301MB0731.namprd0 3.prod.outlook.com> <20150311200941.GV15037@mx1.yitter.info> <CY1PR0301MB0731F4EBE5EB5C3340F7059282190@CY1PR0301MB0731.namprd03.prod.outlook.com> <20150319014018.GI5743@mx1.yitter.info> <BLUPR03MB1378184CE32E928A3086665582010@BLUPR03MB1378.namprd03.prod.outlook.com> <20150319023029.GA6046@mx1.yitter.info> <BLUPR03MB137886903F15000BB01E3F5882010@BLUPR03MB1378.namprd03.prod.outlook.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.35
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <http://mailarchive.ietf.org/arch/msg/lucid/zQypaW7l3D-TH4mz-393WTucPeU>
Cc: lucid@ietf.org, Andrew Sullivan <ajs@anvilwalrusden.com>
Subject: Re: [Lucid] FW: [mark@macchiato.com: Re: Non-normalizable diacritics - new property]
X-BeenThere: lucid@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Locale-free UniCode Identifiers \(LUCID\)" <lucid.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/lucid>, <mailto:lucid-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/lucid/>
List-Post: <mailto:lucid@ietf.org>
List-Help: <mailto:lucid-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/lucid>, <mailto:lucid-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Mar 2015 08:07:42 -0000
--On Thursday, March 19, 2015 04:31 +0000 Shawn Steele <Shawn.Steele@microsoft.com> wrote: >> > No, even all NFC or NFKC would be 100% unique to the machine > >> This is either tautologically true, or false. Certainly we >> learned with IDNA2003 that NFKC doesn't work, because while >> it's good for increasing match probability the identifiers >> aren't stable. So when they're handed around through >> different environments, stuff happens that is bad. > > I said "to the machine". They're "just numbers", and the > NFC/NFKC rules are mathematical. Yes, you do have to exclude > unassigned code points as those don't have defined behavior, > however "Assigned NFC or NFKC for defined code points" would > be 100% unique to the machine. Shawn, A few observations, although we may need to agree to disagree. First, saying "NFC/NFKC rules" tends to obscure the issue here. An important property of NFC/NFD is reversibility, i.e., a dual relationship in the mathematical sense. You may not be able to recover whatever the original way, but you can go back and forth between the two normalized forms without loss of information. By contrast, NFKC (and NFKD) are potentially information-losing and that is (although more of some sets of compatibility equivalences than others) significant. >... > I'm not trying to go from "this can't be perfect" to "we > shouldn't try". I am trying to say that this is good enough. > The additional cost of trying doesn't add value. Sorry. I think you can argue that it doesn't add enough value to be worth the trouble. I disagree with that given an appropriate definition of the problem and believe we don't agree on that definition. But "doesn't add value"... well, we disagree. >... >> I think speculating about the anthropololical facts here is >> going to lead us to grief. Let's stick with a domain of >> discourse we know well. > > IDN is sociological exercise. If the need was purely > scientific/mathematical, then we'd only need a bunch of > numbers or opaque IDs. In order to make much progress here I > think we need to think about how people use them. At one level, I agree about "sociological exercise". I would normally characterize it a bit differently, but let's accept that for the purposes of this note. I would say much the same thing, not just about IDNs, but about almost any human-readable identifier, noting that, as soon as a "bunch of numbers" is expressed as numerals in written forms, there are "sociological" issues of number base and the set of numerals to be used (in Unicode-land and modern writing systems bound to script for everything but Arabic). However, maybe there is a different way to characterize this issue that avoids the sociological exercise. I continue to accept that, at least in general, Unicode has a reasonable set of coding principles and that, in general, a reasonable set of decisions has been made given those principles. I firmly believe that a different coding system, with different principles (or priorities among principles) would simply trade one set of issues for another but might otherwise be equally "good". However, many of those decisions -- about coding, not about characters -- do have alternatives. For example, instead of coding lower and upper case characters differently, one could have dealt with case distinctions by assigning a single code point to the case-insensitive abstract character and then using a qualifier to designate case (either always for one case or when anyone cared for both -- two different coding style distinctions). Similarly, ZWJ and ZWNJ are, in some sense, indicators and artifacts of coding decisions, not "characters": one could have avoiding them by assigning separate code points to the characters they affect on rendering (note that there have been passionate arguments that just that should have been done with some Indic scripts). In a way, decisions about script boundaries are, themselves, coding decisions rather than inherent character properties but, having been vaguely involved with an attempt to define a universal character set that did not depend on such boundaries, that path seems to lead to madness. Unicode still could have defined its script boundaries differently, e.g., seeking a higher level of integration (or "unification") across the board but, while I think understand that as a choice is helpful, pursuing it very far probably is not. Modulo the script boundary issue, it is (approximately) possible to see all of the "confusion" and "too-similar appearance" problems as human perception issues involving the recipient/viewer (not really "sociological", but the distinction may not be important). Seen that way, the issue here is not about the viewer perception issue but about the coding one -- resolving differences in decisions about how something might have been code (again and for convenience, within a script). If one goes back a few hundred years, the question might become, not whether someone viewing the character would "see" the difference but whether the calligrapher or perhaps even someone trying to set a string in cold type would see an important difference -- also, in a way, a coding decision problem. Coming back to the issue that started this, had Unicode (followed by the IETF) not deprecated embedded code points to identify languages, U+08A1 could have been coded (using an odd notation whose intent should be obvious) as <lang:fula>U+0628 U+0654</lang> without loss of any phonetic or semantic information and the justification for adding the code point at all would disappear (at least absent what we generically describe as "politics", in which case the justification for its not decomposing would disappear). Similarly, a different CCS could have avoided at least the portion of the "Mathematical" collection that are ultimately Latin or Greek letters in special fonts by different coding conventions that would use the base character plus qualifiers for usage and/or type style. Another example lies in the collection of combining characters that can be used to for precomposed character that don't decompose, not for, e.g., phonetic reasons but because the definitions of those combining characters don't contain quite enough information (see Section 3.3.2.3 of draft-klensin-idna-5892upd-unicode70-04.txt and its citations). At least in theory, Unicode could have chosen to assign code points to more precisely-defined (as to how they affect base characters) combining characters. A coding system with different principles might have used position and/or size indicator coding with similar effect (an approach that will probably be needed if Unicode ever takes on, e.g., Classic Mayan script). Now, from that perspective, this issue is about smoothing over (by either some form of equivalence rules or exclusion (or non-inclusion)) differences among character code sequences that are the result of coding decisions (decisions that are at least semi-arbitrary because others could have been made with no "important" information loss -- and, yes, "important" is sometimes debatable without yet other coding decisions). For sequences that compose and decompose symmetrically, NFC (or NFD) normalization does the necessary job. IDNA2008 disallows those mathematical characters as a way to do a different part of the job without making non-reversible compatibility equivalences part of the standard. As another coding decision matter, all of this would be significantly easier were Unicode consistent about its coding distinctions. Such consistency is likely impossible, at least given other decisions, but that doesn't mean it wouldn't be helpful. However, we have, instead, a combining sequence preference (with exceptions) for Latin but a precomposed character preference for Arabic. We have all precombined Latin characters where combining sequences decomposing, except for some combining characters. Most European scripts code the abstract graphics in grapheme clusters but East and South Asian ones use indicators like ZWJ and ZWNJ. There is a strict rule against assigning separate code points to typestyle distinctions but an exception for some usage contexts such as mathematics and phonetic description. Unicode does not have indicator codes or separate code points for layout or presentation (leaving that to external markup), but such coding has proved necessary for writing systems that are primarily right-to-left and in such cases as non-breaking space. There are no language-dependent or pronunciation-dependent coding distinctions (see Section 2 of draft-klensin-idna-5892upd-unicode70-04.txt and/or Chapter 2 of Uniocde 7.0) except where there are. Again, I don't think any of those decisions are "wrong". But they are all problematic for the IETF's language-insensitive, fairly context-free, identifier comparison purposes. And they are, at least IMO, worth some effort because (again, independent of discussions about "confusion"), at least, (i) We have already established the precedent of dealing with all of the important groups of coding artifacts we knew about when IDNA2008 was under consideration by adopting normalization rules, DISALLOWing a lot of characters, and even developing special context-dependent rules for some of them. (ii) When different input methods, using data entry devices that are indistinguishable to the user (e.g., the alphabetic key layouts on a German keyboard for Windows, Linux, and the Mac are the same) and will produce different output (stored) strings for the same input, we are dealing with coding artifacts, not "visual confusion". Whether the difference in internal coding is the decision of one system to prefer NFC and that of another to prefer NFD or the result of one typist using an "ö" key and another deciding to type an "o" and a "dead key" umlaut, we have (and IMO, should have) comparison systems that eliminate those coding differences. This is, from that point of view, just a new set of coding decision differences that neither Unicode nor we arranged to compensate for earlier. john One way to look at the issue involved here
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Shawn Steele
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Andrew Sullivan
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… Asmus Freytag
- Re: [Lucid] FW: [mark@macchiato.com: Re: Non-norm… John C Klensin
- [Lucid] [mark@macchiato.com: Re: Non-normalizable… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Ted Hardie
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Shawn Steele
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… Andrew Sullivan
- Re: [Lucid] [mark@macchiato.com: Re: Non-normaliz… John C Klensin
- [Lucid] FW: [mark@macchiato.com: Re: Non-normaliz… Shawn Steele