[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ltru] Consensus call: extlang



Mark Davis さんは書きました:
> Peter's addressed some of the questions. Back to your original question.
>
> For backward compatibility, we'll continue to represent Mandarin as 
> "zh", Standard Arabic as "ar", and so on. Note that this is 
> independent of whether extlang is used or not. That is, if extlang 
> exists, we'll treat incoming "zh-cmn" as if it were "zh"; if it 
> doesn't, we'll treat "cmn" as if it were "zh". And under either 
> scenario it is conformant to tag Mandarin as 'zh'.
>
> Why? 
> While 639-3 now specifies clearly that "de" means (for example) just Standard German, while "zh" means Any 
> Chinese, this clarity of specification was not present earlier. The 
> code "zh" has been used in the past for Mandarin, overwhelmingly so; 
> not just 99% or 99.9%, but many 9's. As you said, the tendency was to 
> use illegal (or private use codes) for non-Mandarin content. All of 
> our internal software and any external software that we talk to will 
> expect Mandarin to be tagged as 'zh' for the forseeable future.  Of 
> course, we recognize that others may end up using 'zh-cmn' / 'cmn', so 
> we're prepared to deal with that.
>
>     Note also that really the whole premise of extlang is that 'zh'
>     continues to normally map to Mandarin. After all, if 'zh' really
>     meant that you were as likely to get Gan or Hakka as Mandarin,
>     then having "zh-yue" in order to get some kind of automatic
>     fallback wouldn't make any sense.
>
>
> Other comments below.
>
> On Thu, May 29, 2008 at 6:04 PM, Broome, Karen 
> <Karen_Broome at spe.sony.com <mailto:Karen_Broome at spe.sony.com>> wrote:
>
>     Mark,
>
>     One thing I think you aren't acknowledging is that "treat as
>     synonyms" means something very different to the vast numbers of
>     content creators who use this standard than it does the handful of
>     search engines that use the fuzzy logic associated with companion
>     standards. As you note in your document, "It is clear that
>     companies like Google or Yahoo can work around the problems with
>     extlang." How many other users need and can afford to implement
>     the extended fallback and filtering logic? Enough that this logic
>     should be the primary driver behind the chosen solution?
>
>     Before I spend too much time picking apart your lengthy screed
>     involving a scenario where the BBC presents its web site in
>     Sudanese Creole Arabic with rotating languages code logic for each
>     day of the week ... (ahem) ... here's my real-world Chinese
>     language list:
>
>     Chinese (Variant Unknown)
>     Chinese (Cantonese, Spoken)
>     Chinese (Cantonese, Written)
>     Chinese (Mandarin, Spoken)
>     Chinese (Mandarin, Spoken Taiwanese)
>     Chinese (Mandarin, Simplified)
>     Chinese (Mandarin, Traditional)
>     Chinese (Taiwanese, Spoken)
>     Chinese (Taiwanese, Written)
>
>
> Sorry you consider it a scree. I realize that the emails have 
> sometimes gotten heated -- email really is a poor substitute for audio 
> discussions in controversial issues; I've seen many, many issues in 
> Unicode and other standards flare for months in email, and be resolved 
> in a few hours of discussion.

Which let's me think that a few hours meeting, maybe even face-to-face, 
between the two of you and potentially others, could be very helpful.

Felix

>
> My real point is that if a query for 'ar' really means "give me any 
> kind of Arabic", then a query for 'ar' would be almost meaningless, 
> since it could return any of a number of mutually incomprehensible 
> alternatives. Although 639-3 now defines it to be "any Arabic", in 
> practice what users will expect to get back is Standard Arabic, and 
> they would be unpleasantly surprised to get back other varieties. And 
> our purpose should be to avoid our users' getting unpleasant surprises.
>  
>
>
>
>     (Apologies, this is hard to represent in ASCII. I have a
>     mini-spreadsheet if someone wants it.)
>
>
>        1             2             3           4
>     a. zh            zh            zh          zh
>     b. zh-yue        yue           yue         yue
>     c. zh-yue        yue           yue         yue
>     d. zh-cmn        cmn           zh          cmn
>     e. zh-cmn-TW     cmn-TW        zh-TW       cmn-TW
>     f. zh-cmn-Hans   cmn-Hans      zh-Hans     zh-Hans
>     g. zh-cmn-Hant   cmn-Hant      zh-Hant     zh-Hant
>     h. zh-min-nan    nan           nan         nan
>     i. zh-min-nan    nan           nan         nan
>
>
> above modified slightly to add row references. 
>
>
>
>
>     * Option #1 (RFC 4646) contains the codes as I have them today.
>
> Note that this is not actually RFC4646 conformant: zh-cmn-TW is not valid.
>
>
>     * Option #2 (RFC 4646bis) contains the codes if I choose to go
>     against the grain and use "cmn".
>
>
>     * Option #3 (RFC 4646bis) treats "zh" and "cmn" as synonyms;
>     avoids using "cmn" for compatibility.
>     * Option #4 (RFC 4646bis) contains the codes "cmn" for spoken
>     context (where distinction is essential) and "zh" for written context.
>
>     Comments:
>
>     * Option #1 is unambiguous and shows that there is a relationship
>     between these languages. It also preserves the legacy "zh" tag so
>     developers that aren't hip to later versions of BCP 47 or 639-3
>     will have some idea what these tags mean. The tags are maybe
>     longer than they need to be, but if I need a fixed-length tag, I
>     can wait for 639-6. The languages may not be mutually intelligible
>     in some contexts, but they are related.
>
>     * Option #2 is unambiguous, but Microsoft, Google, and Amazon
>     won't be using the same tags for Chinese that I do. Even if I
>     don't follow their lead, others likely will. This worries me.
>     Also, the rules for #2 must include fuzzy guidelines such as, "use
>     the 'zh' tag except when you think it's a bad idea" and "use the
>     shortest tag except when you don't want to." This presents
>     complications in trying to explain some sort of consistent method
>     to the LTRU madness to others. Given this, I start to wish ISO
>     639-6 a safe and speedy passage.
>
>     * Option #3 is what I believe you might suggest, but for me,
>     that's the worst list of all. There are five ambiguous "zh"
>     categories on that list. It follows the "always use the shortest
>     tag" rule and respects history, but it's useless to me from an
>     identification perspective.
>
>
> Your list is already ambiguous for columns 1 and 2; you are using 
> "yue" for two different things (written and spoken). The only change 
> it really makes is that you don't have a term for "any chinese".
>
> RFC 4646 lacks terms for many, many combinations of things: a term for 
> "any german" (including de, gsw, ...), "any french", "any 
> scandinavian", or any one of the countless other possible sets of 
> languages that people consider to be important for some particular 
> purpose. That's why lists of languages are really the appropriate vehicle.
>
>
>
>     * Option #4 has three ambiguous tags and means I have to explain
>     to people who aren't in this industry about why I use different
>     tags for the same language. This strategy is less ambiguous that
>     #3, but I'm not sure I can explain it to other content creators
>     for the same reasons as #2 and presents the spoken/written
>     complication others may not want. In the long run, this seems
>     messy and unclear enough that it will result in bad tagging.
>
>     * Options #2,3,4: In general, it worries me that RFC 4646bis
>     offers so many "preferred" options for the same thing. I really
>     can't see how this simplifies things for anyone.
>
>     I don't have a need for fuzzy fallback scenarios. I need precise
>     tags and mostly simple lookup. I think if you take the fallback
>     scenarios and absurdities out of the document you reference, I
>     don't think there's much left.
>
>
> The only purpose I have heard for extlang *is* for fallback; that's 
> why the document goes into (painful) depth on that topic. For 
> identification alone, "zh" and "zh-cmn" really mean just the same 
> thing. It is only in the context of matching (filtering and lookup) 
> that they differ in semantics *because of their behavior*: where "cmn" 
> means simply Cantonese, "zh-cmn" effectively means "Cantonese but 
> fallback to any Chinese". 
>
>
>
>     Regards,
>
>     Karen Broome
>
>
>
>
>     >-----Original Message-----
>     >From: ltru-bounces at ietf.org <mailto:ltru-bounces at ietf.org>
>     [mailto:ltru-bounces at ietf.org <mailto:ltru-bounces at ietf.org>] On
>     Behalf
>     >Of Mark Davis
>     >Sent: Thursday, May 29, 2008 4:00 PM
>     >To: debbie at ictmarketing.co.uk <mailto:debbie at ictmarketing.co.uk>
>     >Cc: LTRU Working Group
>     >Subject: Re: [Ltru] Consensus call: extlang
>     >
>     >What would be useful is to hear from the extlangistas what their
>     >concerns are specifically; many have not given reasons for favoring
>     >encompassed languages into extlang instead of into the primary
>     >language subtag. It would be useful for them to give the scenarios
>     >where they think extlang is an improvement. It would be useful to
>     >find out why they think the scenarios such as in
>     >http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en
>     <http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en> are not a
>     >problem.
>     >
>     >Clearly people think that using the extlang model solves more
>     >problems than it causes, so it would be useful to example specific
>     >cases and see if that is, in fact, true.
>     >
>     >
>     >Mark
>
>
>
>
> -- 
> Mark
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ltru mailing list
> Ltru at ietf.org
> https://www.ietf.org/mailman/listinfo/ltru
>   

_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.