[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Ltru] my technical position on extlang



Here are some thoughts on extlang. The more readable version is at:

http://docs.google.com/Doc?docid=dfqr8rd5_676kxxxjhd&hl=en

Copied here for the archive:

Extlang


The arguments for extlang are that they give superior results, and are thus worth the complication of having some languages be unavailable as langtag, and only in a secondary position. I believe that when extlang is examined carefully, by people who have implemented language tag lookup, that on balance most people will be worse off than if we retain the structure of RFC4646, and do not complicate the structure to the overall detriment of implementations, do not make encompassed languages be in the inferior, extlang position.

Links

Process

I looked back over the emails, and people may not remember everything that was discussed in the emails around the time that we came to (rough) consensus on extlang. While Shawn claimed that the topic should be reopened -- well after the last call went out -- as far as I can tell there is in fact no new information being presented that was not available on the list 6 months ago.



Since this issue seems to be reopening (for no good reason that I can see), I put together some responses from past emails on the topic. They are not wonderfully organized; I just tried to cast them as Q/A pairs to make them a bit more readable. There is definitely some repetition that ought to be edited away. The tone may seem too harsh at times -- sometimes that was in the heat of the moment, and I haven't had time to moderate the text, so I apologize in advance for any offense I may give.

Q. Where would extlang make a difference?

A. Where As RFC 4647 describes, there are two main processes for matching using language tags, filtering and lookup. There is reason to look at how the two different macrolanguage models in affect 4647. If primary reason cited for the extlang model is compatibility, we have to see what effects each model has on commonly used matching functions, which is where the rubber hits the road.

For filtering, extlang offers no particular advantage. Let's look at queries of "ar-ary" Moroccan Arabic vs "ary". In either case I need a way to match all and only Moroccan Arabic; I must not fallback to "ar". If I fallback to include all Arabic, the actual content that is in Moroccan Arabic would be completely and utterly swamped by Standard Arabic. So for filtering, the extlang model just gives us a more complicated syntax, with no benefit.

The only possible advantage of extlang would be in Lookup.

Q. Isn't extlang better for lookup?

A. Take an example

Scenario 1. The user's browser has the proposed "zh-yue-Hant-US". My lookup falls back to zh, so I serve it up to the user. So even if the target of the match (zh) is not Cantonese, you want a fallback to zh. I'm guessing that you see this as better than if we defined the tag as "yue-Hant-US", since it gets to some fallback that the user is likely to understand. But I don't see this as much different than if we had fr-br-BE (meaning Breton, but fall back to French), or ro-mo (meaning Moldavian, but falling back to Romanian). And note that in the fallback, the script and region are completely lost.

Scenario 2. The user's browser has zh-cmn-Hant-US. In matching, we fall back to zh. Note than in the fallback, the script and region are completely lost. We have essentially just introduced a synonym for zh which causes fallback to lose information, for no good reason.

Q. What's wrong with simply including extlang?


A. If we bake it in, then every simple algorithm will in practice automatically fall back from Cantonese to Mandarin, fall back from Dari to Persian, fall back from Khetrani to Lahnda -- and in doing so, strip the script and country information. That, unless they fix the algorithm to pretend that the secondary language is in fact a primary language. So we are forcing people into a model that is often, or mostly, wrong.

If we supply the information in the registry, then implementations can choose whatever they think is appropriate, given the particular facts about languages and the particular needs of their applications without having to work around the extlang mechanism.

I'm reminded again of a similar case with C++. The assignment operator gets a default implementation. That must have seemed like a nice convenience for the user, but except for toy programs, it is always, always wrong. So supplying that default just means that people usually have to take extra steps to disable it, and prevent it from causing bugs in their programs. I'm worried about this being similar.

It is clear that companies like Google or Yahoo can work around the problems with extlang-- what I'm worried about are the people who don't have a lot of experience with these matters, and are just led down a garden path. We need to look long and hard at the experience of people who have had detailed implementation experience with filtering and matching these tags in production environments.

Q. Where are some cases where extlang works particularly badly?


Extlang plays especially badly in many cases. Suppose that we have macrolanguage m1, and microlanguages x1 and x2. By the design of ISO 639, we can't assume that a speaker of x1 can also speaker x2 or vice versa. If a user has as accept language the list <x1-Ssss-Rr, en, fr>, it works fine without extlang: she gets the fallback

(A) Script fallback

  1. x1-Ssss-Rr
  2. x1-Ssss
  3. x1
  4. en
  5. fr

If she also speaks/reads x2, then she can specify <x1-Ssss-Rr, x2, en, fr> or <x1-Ssss-Rr, en, fr, x2>; that is, putting x2 in the list in the position she wants it. If x2 is the predominant microlanguage, meaning that m1 is essentially always assumed to be x2, then the priority list can be <x1-Ssss-Rr, en, fr, m1>, also wherever it belongs. Thus, the user gets something she can understand, based on the list she supplied.

If we are using extlang, and x2 is the content for m1, then we get the fallback

(B) Script + extlang fallback
  1. m1-x1-Ssss-Rr
  2. m1-x1-Ssss
  3. m1-x1
  4. m1
  5. en
  6. fr

That has two problems: first, the script and region are lost. That can be fixed by hacking the fallback (although there is a lot of installed base that won't do this), to

(C) Hacked Script + extlang fallback

  1. m1-x1-Ssss-Rr
  2. m1-x1-Ssss
  3. m1-x1
  4. m1-Ssss-Rr
  5. m1-Ssss
  6. m1
  7. en
  8. fr

But even more importantly, we are disabling the user's explicit choice. If the user doesn't speak x2 (or whatever the content of m1 is), he's screwed. There is no way that he can indicate that he wants x1 but no other version of m1 because he can't understand them.

We'd have to change the fallback to be quite substantially different to get around this, with

(D) More Hacked Script + extlang fallback
  1. m1-x1-Ssss-Rr
  2. m1-x1-Ssss
  3. m1-x1
  4. en
  5. fr
  6. m1-Ssss-Rr
  7. m1-Ssss
  8. m1

This is, however, still only appropriate if it is likely that a user of x1 speaks whatever happens to be the content for m1. That is an extremely shaky assumption.

If Peter Constable said, for each and every macrolanguage on http://www.sil.org/iso639-3/macrolanguages.asp, there is at least one microlanguage that all speakers (or even most speakers) of each of the other microlanguages would understand, I'd say: fine, let's do extlang and incorporate that information into the registry, with the "default microlanguage" for each macrolanguage. Then, for example, implementers would know that, say " fuf Pular" is understood by all the speakers of the microlanguages under "ful Fulah", so we can tell people to always have the content of "ful" be "fuf", and bake the macrolanguages in as extlang.

The point of the suggested text is that if your application wants to use macrolanguages to support extlang-equivalent fallback, there is nothing stopping you from doing so. If there are particular environments where an extlang-like fallback is right for a particular language community, it is simple to do. But we don't need to bake shaky assumptions into the structure of language tags.

Q. Isn't extlang just like script fallback?


A. The problem with extlang is that the fallback from encompassed language to macrolanguage is fundamentally different in kind than a fallback from region to script to base language. In the case of script, like uz-Arab and uz-Latn, or en-US vs en-GB, we really have variations on the same language, and fallback makes sense. We ordered the subtags so that it works optimally overall.

The encompassed languages, on the other hand, are not just dialects, not just variants. They are languages in their own right. Trying to insert them into the fallback process just screws things up, because they need a "sideways" matching not just simple truncation fallback. If you want to do any fallback with extlang, it would be to fall back from zh-yue-<other stuff> to zh-<other stuff>. That means that in order to do reasonable fallback, you can't just use truncation fallback anyway. So I see the situation this way:

  1. The only reason for adding the complication of the extlang mechanism is to make truncation fallback work better.
  2. Truncation fallback with extlang doesn't work better.
  3. So there is no need to make encompassed languages be "secondary" languages by making them be "secondary" subtags.

The goals of extlang are good, to make matching work better, but in practice it just makes things worse. [Speaking to those familiar with C++, it feels a bit like the default assignment operator in C++. Nice in theory, but in practice it gums things up more than it fixes, since once you are beyond very simple (toy) classes, the default is almost always wrong -- but because it is supplied behind your back you don't realize it.]

So instead of adding the extlang mechanism to RFC 4646, what we really need to do is to point people to how to handle yue and other encompassed languages along with mo/ro, tl/fil, and other edge cases in a reasonable way, by augmenting matching.

Q. Where might the macrolanguage be useful?

A. An implementation may choose to use that information in falling back from some encompassed languages to macro languages. For example, given the language priority list with Cantonese in Traditional Script as used in Hong Kong, followed by French ("yue-Hant-HK, fr"), the lookup could be the following:

1. yue-Hant-HK
2. yue-Hant
3. yue-HK
4. fr
5. implementation defined default:
  5a. zh-Hant-Hk
  5b. zh-Hant
  5c. zh
  5d. en

Whether such fallback should be used -- and if so, the precise way in which such a fallback is done -- is application-dependent.  Where it is very likely that the audience requesting Cantonese (as above) will accept and understand Mandarin (the predominant content for 'zh'), then this fallback might be useful. Where there is risk that that the audience requesting Cantonese will not be conversant with Mandarin, and would prefer an alternative in the language priority list, it should be avoided. (This might be the case, for example, with audio using yue-Zxxx-US.

Q. Why not have extlang for using macrolanguages if the suggested text adds macrolanguages back into the fallback chain?


A. The suggested text doesn't add it back in. It only says that IF an application wants to do extlang-equivalent fallback, the text in BCP 47 already allows for that.

We really have no idea whether using macro languages in the fallback chain is a good idea or not. Some people think it will be an advantage for a few specific examples that are cited. (Cantonese comes up, but when I tested some of the assumptions with Cantonese speakers, they didn't quite hold up.) But nobody has substantiated that it will give better results for all or most macrolanguages. Or any indication of an even rough list of those for which it will be better. Nor has anyone effectively argued that the situation between yue and zh is substantially different than the situation between gsw and de, where we get along just fine without extlang.

So the suggested text just provides it as an option, and leaves it up to the application.

Q. How should we look at macrolanguages?


I think one of the things that we realized when looking at how this would work in practice is that we are better off if we treat macrolanguage as a piece of perhaps information for matching, but one that can be enhanced (that is, changed) over time as more information becomes available. Hard-coding it into extlang doesn't serve that purpose, and causes other problems, notably that the other fields are lost in fallback: if we had zh-yue-Hant, then by the time we get to zh in fallback, we've lost the Hant.

So that is the origin of the text we are proposing.

There are a number of edge cases such as deprecated codes, closely related languages, or practical fallbacks (eg if someone speaks X they are likely to speak Y, even if X and Y are not linguistically related) that are simply unsuited to hard-coding in the tag. If I hit a code like "iw-PL", I want to match that with "he-PL", not depend on some kind of fallback between "iw" and "he"; otherwise it loses information (the PL). We do provide that kind of information in the deprecated field, and with the macrolanguage field we would provide more (for example, it would make clear the relation between no, nb, and nn and how that could be used in matching). Other useful information is the scripts used with a language in practice; suppress script supplies just a little information, but doesn't tell me that Uzbek is customarily written with Arabic, Latin, or Cyrillic, but not with (say) Tagalog.

The more information there is available, whether it be in the language subtag registry or somewhere else, the better a job people can do in dealing with some of the edge cases that turn up in matching.

Q. Doesn't the macrolanguage relationship uniquely define the best fallback?


A. Certain languages are closely related, and the lookup process may take that into account. Macrolanguage is just one factor that may (or might not) be useful. For example, since the the tag "gsw-CH" (for Swiss German as used in Switzerland) was first available on 2006-12-08, Swiss German ("Schwyzerduetsch") text may have been tagged with "de-CH" instead. ISO 639 was not (and is still not) clear on whether "de" meant only High German or also included variants such as Low German or not. Thus Swiss German material may have been, and may still be tagged with "de". Essentially all Swiss German speakers are comfortable in High German, so where Swiss German is not available, High German is a very good fallback. Thus when given the language priority list: "gsw-CH, fr-CH", an implementation using lookup may augment the default values to also include the lookup of related values, such as the following search order:

1. gsw-CH
2. gsw
3. fr-CH // next language
4. fr
5. implementation defined default:
    5a. de-CH // special fallback from gsw-CH
    5b. de
    5c. en // root

In this way, other likely possibilities are tried before the final fallback to the root value. Note that typically the fallback to related languages should include the script and region codes if available.

In this way, the lookup process may take into account what languages people are likely to understand, given a language priority list. Similarly, the close relations between Romanian and Moldavian, Tagalog and Filipino, Serbo-Croatian and Croatian, and so on may all be useful in doing related language lookup.

This is not restricted to related languages. For example, a Breton speaker is very likely to also understand French, given the language priority list. Thus the implementation may choose to use the following lookup for the language priority list "br-FR, de":

1. br-FR
2. br
3. de
4. implementation defined default:
    4a. fr-FR // special fallback from br (Breton)
    4b. fr
    4c. en

Q. I see "zh" and "cmn", I have no way of telling that they're related without looking at the registry, which basically means a hard-coded table.  If "zh" is preferred", then I may want to move from "zh" to "cmn" or whatever.

A. You can't tell that from the registry either! Sometimes microlanguages are related in such a way as to be a good fallback, but usually they are not. The handful of actual cases where we think it might be a good idea are listed in the text around Table 8, *not* in the registry. Knowing that X is a macro/micro language does not necessarily mean -- and usually doesn't mean -- that you want to use it in fallbacks. If there is no predominant form, then it's a crapshoot as to whether the macro/microlanguage is a good fallback.

There is no special runtime information in the registry. When a new macro/microlanguage shows up in the registryregistry macro/microlanguages is not a good idea. Nor is it complete, since it misses tl/fil, ro/mo, and many others that are much higher frequency cases than most of the macro/microlanguages.

If an implementation provides a UI for selecting language priority lists, it may be better to give the user the option of having explicit fallbacks (such as from Cantonese to Mandarin or Tagalog to Filipino), rather than trying to guess the user's intent (and run the distinct risk of getting it wrong). For that purpose, when a user adds a language to the priority list, the UI may suggest macrolanguages, or other related languages, as additional fallbacks.
*and* you support the one of the pair, then it may be useful to review whether or not you want to add a fallback. Automatically updating fallbacks blindly according to the

Q. In lookup, if there is a predominant form how is it best (in practice) to deal with the macrolanguage?

For most programs, I believe that treating them as synonyms is the right thing to do, and alternative approaches would be extremely counterproductive. And this goes for any of the macrolanguage cases where there is a predominant encompassed language with long usage in the computer industry.

Let's take a scenario where this is not done (a la Ewell). Suppose that a user picks Arabic as her Accept-Language in her browser. Any existing browser will represent that with "ar". Then she goes to the BBC site. The entire site is translated, not into standard Arabic, but into Sudanese Creole Arabic. The user complains, since she can't understand it, and the BBC responds that they are just following the standard to the letter and spirit: "ar" means any kind of Arabic whatsoever, and so in the interests of fairness, they pick a different encompassed language to serve up each day. They inform the user that it is her fault for using 'ar' if she really only wants Standard Arabic. So they have the following schedule:

Monday
aao Algerian Saharan Arabic
Tuesday
abh Tajiki Arabic
Wednesday abv Baharna Arabic
 ... acm Mesopotamian Arabic
  acq Ta'izzi-Adeni Arabic
  acw Hijazi Arabic
  acx Omani Arabic
  acy Cypriot Arabic
  adf Dhofari Arabic
  aeb Tunisian Arabic
  aec Saidi Arabic
  afb Gulf Arabic
  ajp South Levantine Arabic
  apc North Levantine Arabic
  apd Sudanese Arabic
  arb Standard Arabic
  arq Algerian Arabic
  ars Najdi Arabic
  ary Moroccan Arabic
  arz Egyptian Arabic
  auz Uzbeki Arabic
  avl Eastern Egyptian Bedawi Arabic
  ayh Hadrami Arabic
  ayl Libyan Arabic
  ayn Sanaani Arabic
  ayp North Mesopotamian Arabic
  bbz Babalia Creole Arabic
  pga Sudanese Creole Arabic
  shu Chadian Arabic
  ssh Shihhi Arabic


If 'ar' means any Arabic, without any preference, this would be a perfectly reasonable thing to do. But for users, it would hardly be satisfactory. And this would be a bizarrely stupid thing for the BBC to do.

Someone might respond that, well, everyone needs to convert over to cmn for Mandarin since it is now the Right Thing to Do. Even if that magically happened, it would take years, and during the transition we would get all kinds of screwups with different programs transitioning at different paces. And there isn't much magic around; it is very hard to get people to change infrastructure that works just fine -- you have to give them a compelling case for why users are served better by the change. And that would be a very hard sell, since there isn't any real advantage.

The right approach for the BBC is to treat a request 'ar' as a request for Standard Arabic, just as they have always done. Internally, that means treating 'ar' and 'arb', and any language tag that starts with them, as a request for Standard Arabic. That is, treating ar-EG and arb-EG, or ar-SA and arb-SA, or other combinations, as synonyms for the purpose of lookup. Now, this "treating as synonyms" could be done in different ways. One way is to mash on input; the other is to have a fancier fallback, eg arb-EG => ar-EG => arb => ar (mutatis mutandis, when starting with ar-EG).

Nor was this solved at all by extlang -- as we discussed at some length, it discarded all script and region info when falling back, and produces worse results in many cases, especially where the macrolanguage does not have a predominant form.

The "treating as synonyms" strategy is always going to be the right answer. There are undoubtedly scenarios where this strategy is not necessary, although I can't think of any off the top of my head.

Moreover, I think one of the more productive things we can do is to push for the incorporation of Language Priority Lists in any query-like protocols. That way I could say I'd like "ary, fr, ar" if my preferred ordering is Moroccan Arabic, then French, then as a last resort, Standard Arabic

Q. How do I use fancier fallback for the predominant form?


Here is a more detailed case

  1. cmn-Hant-HK
  2. zh-Hant-Hk
  3. cmn-Hant
  4. zh-Hant
  5. yue-HK
  6. zh-HK
  7. yue
  8. zh

Q. Haven't people always interpreted zh as meaning anything from Mandarin to Hakka to Min?


A. That's very unclear. People usually choose 'zh' not literally, but through a UI that shows a human readable form. So the question is, how many people have looked at interfaces that say simply " 中文" and think
  1. "that could mean Mandarin but could also mean Hakka" vs
  2. "that means just Mandarin, they don't offer Hakka so I'll pick something else", vs
  3. "that means just Mandarin, they don't offer Hakka, but Mandarin is the closest to Hakka that is offered so I'll pick that."

In lookup on computer systems it is clear that nobody's expectation is that by picking 'zh', they will get Hakka. And anything but Mandarin is a vanishingly small percentage of tagged text; the same is true of 'ar'; anything but Standard Arabic is a vanishingly small percentage.

As Karen said on a related topic: "My experience is that the users who need to specify Cantonese most often make up an illegal tag. Not saying that's what we should recommend, but I believe my experience does not support the statement as worded. "

Q. Isn't the written form of Cantonese the same as Mandarin?

A. No, no more than the "written form of Swiss German is the same as High German". What is true is that when the Swiss write, they write in High German; that's different. They are using different words than what is spoken, and very different syntax: "I bi doo gsy." => "Ich war hier."

Here is what I have on the subject from John Jenkins:

I believe you said that a Mandarin speaker can read written Cantonese, but will not understand everything (a bit like a Dane reading Swedish).

More like French and Spanish, actually. 

Some characters would not (normally) be used in Mandarin.

Most famously U+4E5C, the Cantonese for "what".

Some characters would have different meanings than in Mandarin

Best illustrated with U+4FC2, which means "to bind" in Mandarin but is frequently borrowed to for the Cantonese word for "to be."

Some syntax would be different

Yes, but I can't think of any examples off the top of my head and the book I've got that lists the differences is successfully playing hide-and-seek at the moment.  There are actually not an awful lot of these.  The main differences between the two are phonetic and lexical.  The grammars are very similar.  

Can you point me to some web pages with written Cantonese that would demonstrate that to a Chinese reader?

Nothing better to start with than the Cantonese Wikipedia article on Cantonese:  <http://zh-yue.wikipedia.org/wiki/粵語>.  Similarly, <http://zh-yue.wikipedia.org/wiki/香港>, <http://zh-yue.wikipedia.org/wiki/Unicode>, and pretty much everything else in the Cantonese Wikipedia. 


I can give you the relative character frequencies in the Cantonese and traditional (or simplified Chinese) Wikipediae, if you like.  I've still got that data around somewhere.  

The thing you have to emphasize is the difference between what-Cantonese-speakers-generally-read-and-write, which is just Mandardin with Cantonese phonetics, and writing-down-what-Cantonese-speakers-actually-speak, which is what "written Cantonese" should be used to mean.  Unfortunately, not everybody groks this.  Fortunately, Wikipedia does.

Meanwhile, I quote from Stephen Matthews and Virginia Yip, _Cantonese: A Comprehensive Grammar_ (London: Routledge, 1994), pp. 5-6: 

"Traditionally, Cantonese has been regarded as one of the many Chinese dialects. It does not have a standardized written form on a par with standard written Chinese.  No form of written Cantonese is taught in schools or used in academic settings in any Cantonese-speaking community.  When it comes to the written form, it is standard written Chinese that is taught and learnt.  For educated Cantonese speakers, standard written Chinese is the written form they use in most contexts.  However, in colloquial genres such as novels, popular magazines, newspaper gossip columns, informal personal communications, written Cantonese may be used.  When written CantoneseCantonese words and expressions, non-Cantonese speakers may find it totally unintelligible."   contains too many exclusively

Since they wrote, however, there's been a distinct upsurge in the use of written Cantonese.  (It's tied in with a kind of Hong Kongese pseudo-nationalism.)  It's still not exactly *common*, but it's a lot more common than it used to be.



--
Mark
_______________________________________________
Ltru mailing list
Ltru at ietf.org
https://www.ietf.org/mailman/listinfo/ltru

Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.