[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Ltru] going back to the roots to find a solution to "zh"
Not so sure that we will get accuracy; frankly I think the introduction of cmn, arb, etc. were well intentioned, but in retrospect a mistake. There was no reason not to treat zh, ar, etc just like we treated de. There is no substantive difference between zh:yue and de:gsw.
Given the existence and long usage of zh in IT, the introduction of cmn (etc) will inevitably just end up with mistagged data, mishandled lookups, and unsatisfied users, unless one follows the strategy that unless you have a strong reason not to, treat cmn as a synonym for zh. (In fact, it is simplest to remap cmn-* to zh-* on input.)
[Our best course of action for compatibility and interoperability would be to deprecate cmn, arb, and the few others in that section, but I doubt that we could get consensus on that, so I'd never brought it up.]
Mark
From earlier, for context:
The situation is messy, and part of the problem is that ISO 639 was
dramatically underspecified. My interpretation, before I got into this
arena, was that when the standard say X, it was short for "Standard X".
Thus it was perfectly reasonable to make the following change:
de - German (= Standard German, Hochdeutsch)
//later add
gsw - Swiss German
frk - Frankish German
sxu - Saxon German
...
The
code de really meant Standard German, so adding a new code gsw is fine.
It is a different language, so we are not narrowing de. We would NOT
make de a macrolanguage and add gsw and
then *hde (Standard German). We recognize that past Swiss German
content was tagged with de since that was the 'best' available code,
but we recommend that people switch.
I was really surprised to see a contrary tack taken with ar. I would have expected it to behave just like German.
ar - Arabic (= Standard Arabic)
later add
pga
- Sudanese Creole Arabic, etc.
But *don't* add arb = Standard Arabic, because we already have one (ar = ara).
Water
under the bridge, but had this approach been taken with Arabic,
Chinese, and so on, users of the standard would not be faced with a
difficult situation.
On Tue, May 6, 2008 at 11:30 AM, Randy Presuhn <
randy_presuhn at mindspring.com> wrote:
Hi -
As a technical contributor...
> From: "Peter Constable" <petercon at microsoft.com>
> To: "LTRU Working Group" <ltru at ietf.org>
> Sent: Tuesday, May 06, 2008 10:13 AM
> Subject: [Ltru] going back to the roots to find a solution to "zh"
...
> Of course, that's the general problem we're facing: we must find a solution
> to the dual usage or "zh" or abandon any possibility of allowing "zh" to have
> its generic meaning, yet existing usage seems to imply that the latter isn't
> an option, so a solution to the dual usage is essential - but we're at a loss
> as to how to solve it.
...
This points to the "soft underbelly" of "tag wisely" - the assumption that
the tagger can reasonably anticipate how the "consumers" of the tag
will want to use that information.
In retrospect, I think it would have been better to have taken the route of
zh -> some kind of Chinese, likely (but not guaranteed) to be Mandarin
zh-cmn -> Chinese, specifically Mandarin
ar -> some kind of Arabic, likely (but not guaranteed) to be Standard Arabic
ar-arb -> Arabic, specificall Standard Arabic
de -> some kind of German, overwhelmingly likely (but not guaranteed) to be Hochdeutsch
de-*hde -> German, specifically Hochdeutsch
Even recognizing that reasoning about "language X is some kind of Y" can be
horribly fuzzy, this would still be better aligned with a "principle of least
astonishment" for folks trying to understand the specification, trying to tag data,
or trying to formulate a query. Yes, huge amounts of data might end up being tagged
less precisely than we might like, but at least they'd still be tagged accurately.
Randy
--
Mark
_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Note Well: Messages sent to this mailing list are the opinions
of the senders and do not imply endorsement by the IETF.