[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Ltru] Re: Proposed Text for Moving Forward



I agree with you 100%, Doug, because I think that we have it 180 degrees reversed.

The problem with default script, as you point out, is that it puts the registry maintainers into a position of having to adjudicate about a language's script. "en" is a very poor test case because it is nowhere near an edge case. 

The problem is that registering a default script for a language is like trying to prove a negative. Let's pick on Serbian or Azerbaijani for a minute. Let's say someone tried to register a default script for one of these. You could point to all of the sr-Cyrl or az-Latn texts you want. That doesn't prove the non-existence or non-importance of other scripts for that language. I'm concerned that later registrations might cause a need for general retagging and that is BAD.

For the registry process to have meaning, I think that we should stick to what is actually demonstrable: you can register the fact that a language is commonly written in two or more scripts because you can demonstrate that. This would make the rules and design simpler and *also* make the entries more robust. The rules would then be:

1. Script subtags SHOULD NOT be used to form a language tag unless they add some distinguishing information. For example, the 'Latn' subtag is generally unnecessary with the primary language 'en' because nearly all English documents are written in Latin script.

2. The script subtag SHOULD always be used to form a language tag when the script of the tagged content matches a 'Required_Script' field for the associated primary language. For example, you should use the 'Hant' script subtag to form the language tag "zh-Hant-TW", even though most Chinese language documents in Taiwan are written in the Traditional Chinese script. The 'Required_Script' field generally appears in primary languages that are written in more than one script and which might otherwise be ambiguous.

IOW, I added script suppression as a sop to default script, a concept that is easy to understand and seems desirable, but which may be difficult to maintain in practice (see all the deprecated primary language registrations for examples of the registration process being used in unproductive ways). 

Best Regards,

Addison

Addison P. Phillips
Globalization Architect, Quest Software
Chair, W3C Internationalization Core Working Group

Internationalization is not a feature.
It is an architecture. 

> -----Original Message-----
> From: ltru-bounces at lists.ietf.org [mailto:ltru-bounces at lists.ietf.org] On
> Behalf Of Doug Ewell
> Sent: dimanche 17 avril 2005 01:05
> To: LTRU Working Group
> Subject: [Ltru] Re: Proposed Text for Moving Forward
> 
> Addison Phillips <addison dot phillips at quest dot com> wrote:
> 
> > I agree, Mark, that the full effect can be achieved with only one
> > field and that your proposal is superior in a number of regards (fewer
> > moving parts, ease of maintenance, ease of application).
> >
> > I proposed two, though, for a reason. One of the objections was that
> > we didn't document when a particular script really ought to be used
> > (i.e. that you really should start to use zh-HanX-XX in preference to
> > zh-XX).
> 
> I know I said I would back off and not get involved in the
> default-script issue.  OK, so I lied.  Sorry about that.
> 
> AFAICT, this whole issue started with the concern that people would use
> a script subtag in cases where it was generally thought to be (a)
> unnecessary, because the intended script would be obvious, and (b)
> undesirable, because it would interfere with left-prefix (RFR) matching.
> 
> The standard example was "en-Latn-US."  The case was made that the
> overwhelming majority of written U.S. English text is written in the
> Latin script, so the added flexibility of being able to specify the
> script would be largely unnecessary, and in particular it would be
> overshadowed by the inability of existing left-prefix matching
> algorithms to match "en-Latn-US" with "en-US" (sometimes generalized to
> "broken backward compatibility").
> 
> This was the foundation of "default script": certain languages like
> English could be listed as having a default script of Latin, so that tag
> generators could avoid creating tags like "en-Latn" or "en-Latn-US"
> whose disadvantages would outweigh their advantages.
> 
> Of course, in certain circumstances you might have English written in
> Braille, or even in Cyrillic, and most if not all seemed to agree that
> in these rare circumstances it would be acceptable to generate
> "en-Brai-whatever" or "en-Cyrl-whatever."
> 
> The standard counterexamples were "zh-Hans" and "zh-Hant."  The case was
> made that Chinese is commonly written in both of these script variants,
> and it would often be beneficial to include script information, to the
> point of perhaps being more important than strict compatibility with
> left-prefix matching algorithms.  Languages like Chinese and Azerbaijani
> and Serbian, after all, were the major use cases for the introduction of
> script subtags in the first place.
> 
> So unlike "en-Latn", it would not be discouraged to write "zh-Hans" or
> "zh-Hant", or either of these followed by a region subtag.  Of course,
> just like English, Chinese could also be written in a less obvious
> script like Braille or Cyrillic, and so "zh-Brai" and "zh-Cyrl" ought to
> be allowed as well.
> 
> I concede that because of the stated predominance of processes that use
> left-prefix matching, it might be beneficial to define a default script
> for common languages that are written in a single script 99.9% of the
> time.  I still don't know where the authority comes from to decide which
> languages and which scripts get marked in this way -- definitely not
> from ISO or documented registrations or deterministic rules, like
> everything else in the registry -- but I assume that would be worked out
> in due course.
> 
> What I do NOT understand is how this has expanded to telling people when
> they SHOULD use script subtags, and how the set of allowable subtags
> should be limited in some way.
> 
> There may well be cases where "zh" or "zh-CN' or "zh-TW" is all that is
> needed, and there is certainly existing data that uses such tags.  I
> don't see any justification for discouraging such usage, even if we have
> defined a way to tag Chinese data more precisely.  Likewise, if there is
> no prohibition against writing "en-Brai" or "en-Cyrl", then I see no
> reason to prohibit or discourage "zh-Brai" or "zh-Cyrl" either.  A
> "required-script" field would do exactly this, by listing 'Hans' and
> 'Hant' but not others.
> 
> This is too prescriptive.  It tells people how they SHOULD tag data, not
> just in terms of "tag content wisely" or "don't be excessively precise,"
> but on a specific language-by-language basis.  It assumes, implicitly,
> that this group or ietf-languages has the expertise and authority to
> make this judgment.  Unlike default-script, required-script does nothing
> to solve the left-prefix matching problem, and as such, I don't think
> it's within the scope of the charter.
> 
> I propose the following:
> 
> 1.  An optional, informative default-script field that would suggest to
> tag generators that they not use that particular script subtag together
> with that particular language subtag.  This field could be added,
> changed, or removed at any time.  (It doesn't matter much what the field
> is called, and I renew my suggestion that we not try to inject too much
> deep meaning into the names of fields, or assume that users will derive
> deep meaning from them.)
> 
> 2.  NO requirement within the draft that tag generators "must not" use
> script subtags in any given scenario.  The text in draft-01 that
> discourages the use of a script subtag "unless it conveys additional
> information" should be adequate.
> 
> 3.  NO mechanism to tell tag generators that they "should" use a script
> subtag together with any particular language subtag, and *especially*
> not one that lists the "expected" script subtags while excluding others.
> If tag generators opt to create a tag such as "zh-TW" that "may be
> ambiguous without script information," that should be up to them.
> 
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
> 
> 
> 
> _______________________________________________
> Ltru mailing list
> Ltru at lists.ietf.org
> https://www1.ietf.org/mailman/listinfo/ltru

_______________________________________________
Ltru mailing list
Ltru at lists.ietf.org
https://www1.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.