[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ltru] Consensus call: extlang



Peter's addressed some of the questions. Back to your original question.

For backward compatibility, we'll continue to represent Mandarin as "zh", Standard Arabic as "ar", and so on. Note that this is independent of whether extlang is used or not. That is, if extlang exists, we'll treat incoming "zh-cmn" as if it were "zh"; if it doesn't, we'll treat "cmn" as if it were "zh". And under either scenario it is conformant to tag Mandarin as 'zh'.

Why? While 639-3 now specifies clearly that "de" means (for example) just Standard German, while "zh" means Any Chinese, this clarity of specification was not present earlier. The code "zh" has been used in the past for Mandarin, overwhelmingly so; not just 99% or 99.9%, but many 9's. As you said, the tendency was to use illegal (or private use codes) for non-Mandarin content. All of our internal software and any external software that we talk to will expect Mandarin to be tagged as 'zh' for the forseeable future.  Of course, we recognize that others may end up using 'zh-cmn' / 'cmn', so we're prepared to deal with that.

Note also that really the whole premise of extlang is that 'zh' continues to normally map to Mandarin. After all, if 'zh' really meant that you were as likely to get Gan or Hakka as Mandarin, then having "zh-yue" in order to get some kind of automatic fallback wouldn't make any sense.

Other comments below.

On Thu, May 29, 2008 at 6:04 PM, Broome, Karen <Karen_Broome at spe.sony.com> wrote:
Mark,

One thing I think you aren't acknowledging is that "treat as synonyms" means something very different to the vast numbers of content creators who use this standard than it does the handful of search engines that use the fuzzy logic associated with companion standards. As you note in your document, "It is clear that companies like Google or Yahoo can work around the problems with extlang." How many other users need and can afford to implement the extended fallback and filtering logic? Enough that this logic should be the primary driver behind the chosen solution?

Before I spend too much time picking apart your lengthy screed involving a scenario where the BBC presents its web site in Sudanese Creole Arabic with rotating languages code logic for each day of the week ... (ahem) ... here's my real-world Chinese language list:

Chinese (Variant Unknown)
Chinese (Cantonese, Spoken)
Chinese (Cantonese, Written)
Chinese (Mandarin, Spoken)
Chinese (Mandarin, Spoken Taiwanese)
Chinese (Mandarin, Simplified)
Chinese (Mandarin, Traditional)
Chinese (Taiwanese, Spoken)
Chinese (Taiwanese, Written)

Sorry you consider it a scree. I realize that the emails have sometimes gotten heated -- email really is a poor substitute for audio discussions in controversial issues; I've seen many, many issues in Unicode and other standards flare for months in email, and be resolved in a few hours of discussion.

My real point is that if a query for 'ar' really means "give me any kind of Arabic", then a query for 'ar' would be almost meaningless, since it could return any of a number of mutually incomprehensible alternatives. Although 639-3 now defines it to be "any Arabic", in practice what users will expect to get back is Standard Arabic, and they would be unpleasantly surprised to get back other varieties. And our purpose should be to avoid our users' getting unpleasant surprises.
 


(Apologies, this is hard to represent in ASCII. I have a mini-spreadsheet if someone wants it.)


   1             2             3           4
a. zh            zh            zh          zh
b. zh-yue        yue           yue         yue
c. zh-yue        yue           yue         yue
d. zh-cmn        cmn           zh          cmn
e. zh-cmn-TW     cmn-TW        zh-TW       cmn-TW
f. zh-cmn-Hans   cmn-Hans      zh-Hans     zh-Hans
g. zh-cmn-Hant   cmn-Hant      zh-Hant     zh-Hant
h. zh-min-nan    nan           nan         nan
i. zh-min-nan    nan           nan         nan

above modified slightly to add row references. 



* Option #1 (RFC 4646) contains the codes as I have them today.
Note that this is not actually RFC4646 conformant: zh-cmn-TW is not valid.

* Option #2 (RFC 4646bis) contains the codes if I choose to go against the grain and use "cmn".

* Option #3 (RFC 4646bis) treats "zh" and "cmn" as synonyms; avoids using "cmn" for compatibility.
* Option #4 (RFC 4646bis) contains the codes "cmn" for spoken context (where distinction is essential) and "zh" for written context.

Comments:

* Option #1 is unambiguous and shows that there is a relationship between these languages. It also preserves the legacy "zh" tag so developers that aren't hip to later versions of BCP 47 or 639-3 will have some idea what these tags mean. The tags are maybe longer than they need to be, but if I need a fixed-length tag, I can wait for 639-6. The languages may not be mutually intelligible in some contexts, but they are related.

* Option #2 is unambiguous, but Microsoft, Google, and Amazon won't be using the same tags for Chinese that I do. Even if I don't follow their lead, others likely will. This worries me. Also, the rules for #2 must include fuzzy guidelines such as, "use the 'zh' tag except when you think it's a bad idea" and "use the shortest tag except when you don't want to." This presents complications in trying to explain some sort of consistent method to the LTRU madness to others. Given this, I start to wish ISO 639-6 a safe and speedy passage.

* Option #3 is what I believe you might suggest, but for me, that's the worst list of all. There are five ambiguous "zh" categories on that list. It follows the "always use the shortest tag" rule and respects history, but it's useless to me from an identification perspective.

Your list is already ambiguous for columns 1 and 2; you are using "yue" for two different things (written and spoken). The only change it really makes is that you don't have a term for "any chinese".

RFC 4646 lacks terms for many, many combinations of things: a term for "any german" (including de, gsw, ...), "any french", "any scandinavian", or any one of the countless other possible sets of languages that people consider to be important for some particular purpose. That's why lists of languages are really the appropriate vehicle.



* Option #4 has three ambiguous tags and means I have to explain to people who aren't in this industry about why I use different tags for the same language. This strategy is less ambiguous that #3, but I'm not sure I can explain it to other content creators for the same reasons as #2 and presents the spoken/written complication others may not want. In the long run, this seems messy and unclear enough that it will result in bad tagging.

* Options #2,3,4: In general, it worries me that RFC 4646bis offers so many "preferred" options for the same thing. I really can't see how this simplifies things for anyone.

I don't have a need for fuzzy fallback scenarios. I need precise tags and mostly simple lookup. I think if you take the fallback scenarios and absurdities out of the document you reference, I don't think there's much left.

The only purpose I have heard for extlang *is* for fallback; that's why the document goes into (painful) depth on that topic. For identification alone, "zh" and "zh-cmn" really mean just the same thing. It is only in the context of matching (filtering and lookup) that they differ in semantics *because of their behavior*: where "cmn" means simply Cantonese, "zh-cmn" effectively means "Cantonese but fallback to any Chinese". 


Regards,

Karen Broome




>-----Original Message-----
>From: ltru-bounces at ietf.org [mailto:ltru-bounces at ietf.org] On Behalf
>Of Mark Davis
>Sent: Thursday, May 29, 2008 4:00 PM
>To: debbie at ictmarketing.co.uk
>Cc: LTRU Working Group
>Subject: Re: [Ltru] Consensus call: extlang
>
>What would be useful is to hear from the extlangistas what their
>concerns are specifically; many have not given reasons for favoring
>encompassed languages into extlang instead of into the primary
>language subtag. It would be useful for them to give the scenarios
>where they think extlang is an improvement. It would be useful to
>find out why they think the scenarios such as in
>http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en are not a
>problem.
>
>Clearly people think that using the extlang model solves more
>problems than it causes, so it would be useful to example specific
>cases and see if that is, in fact, true.
>
>
>Mark




--
Mark
_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.