RE: [Ltru] Macrolanguage and extlang

Mark, Thanks for this update. In reading this over (and trying to see between the lines) with an eye to implications for many African languages and macrolanguages, the mention of Romanian/Moldavian seems particularly relevant, as an example of cases  "where the 'best fit' information is not contained in the language registry." This is an area to which I hope that experts on African languages who have a familiarity with tagging isues can be organized to propose amendments to the system (i.e., ISO 639, which I realize is not the purview of this list, and the current RFCs).

There are also cases where macrolanguages are defined, but a still somewhat fluid situation wrt standardization makes defining their use problematic (I've posted previously on some of the issues as I see them, both on this list and on ietf-languages - such as cases where the language tag may be less appropriate than the macrolanguage tag).  Here too there seems to be a need for input by experts on African languages in discussions of tagging as well as in language planning.

In the meantime, I hope that the new wording can accommodate all such situations, especially for languages with less resources and emerging standards.

The fallback language issue (mentioned in the example re Breton) raises another question: can there be more than one fallback language? In the case of many of the crossborder languages in Africa (such as Hausa, Swahili, Wolof, Fula, Tsonga, Oshiwambo, etc.) this would be helpful.

Don

From: Mark Davis [mailto:mark.davis@icu-project.org] 
Sent: Friday, July 13, 2007 9:06 PM
To: LTRU Working Group
Subject: [Ltru] Macrolanguage and extlang

Addision and I have discussed the issue of extlang and Macrolanguages and are proposing the following text replacing the use of extlang.

[A new section called Macrolanguages: ]

The Macrolanguage field contains a primary language subtag that encompasses this subtag. That is, this language is a dialect or sub-language of the Macrolanguage, and is called an encompassed subtag. The Macrolanguage value is defined by ISO 639-3. The field can be useful to applications or users when selecting language tags or as additional metadata useful in matching. The Macrolanguage field can only occur in records of type 'language'. Only values assigned by ISO 639-3 will be considered for inclusion. Macrolanguage fields MAY be added via the normal registration process whenever ISO 639-3 defines new values. Macrolanguages are informational, and MAY be removed or changed if ISO 639-3 changes the values.

For example, the language subtags 'nb' (Norwegian Bokmal) and 'nn' (Norwegian Nynorsk) has a Macrolanguage entry of 'no' (Norwegian). For more information see [Choice]. 

[A new section in tag choice (section 4.1), referenced from the above] 

Languages with a Macrolanguage field in the registry sometimes can be usefully referenced using their Macrolanguage. However, the Macrolanguage field doesn't define what the relationship is between the language subtag whose record it appears in and its encompassed language or languages. Nor does it define how the encompassed languages are related to one-another. In some cases, the Macrolanguage has a standard form as well as a variety of less-common dialects. For example, the Macrolanguage 'ar' (Arabic) and the subtag 'arb' (Standard Arabic) generally describe the same language, with other subtags describing less-common local variations. In other cases there is no particular standard form and the encompassed subtags describe specific variations within the parent language. 

Applications MAY use Macrolanguage information to improve matching or language negotiation. For example, the information that 'sr' and 'hr' share a Macrolanguage expresses a closer relation between those languages than between, say, "sr" and "ma" (Macedonian). It is valid to use either the encompassed language or its Macrolanguage to form language tags. However, many matching applications will not be aware of the relationship between the languages. Care in selecting which subtags are used is crucial to interoperability. In general, use the most specific tag. However, where the standard written form of an encompassed language is captured by the Macrolanguage, the Macrolanguage should still be used for written material.

In particular, chinese language(s) and dialects call for special consideration. Because the written form is very similar for most languages having 'zh' as a Macrolanguage (and because historically subtags for the various sub-languages and dialects were not available), languages such as 'yue' (Cantonese) have usually used tags beginning with the subtag 'zh'. This past practice of tagging means that Macrolanguage information is encouraged when searching for content or when providing fallbacks in language negotiation. For example, the information that 'yue' has a macrolangauge of 'zh' could be used in the Lookup algorithm to fallback from a request for "yue-Hans-CN" to "zh-Hans-CN" without losing the script and region information (even though the user did not specify "zh-Hans-CN" in their language priority list). 

However, the Macrolanguage is only one of many additional pieces of information  that can be used in matching languages. There are many other circumstances where the "best fit" information is not contained in the language registry. For example, the languages "ro" (Romanian) and "mo" (Moldavian) are very closely related, and so for searching it is often best to treat them as being the same. In other cases, the best fallback for a requested language may be a completely unrelated language, but one that a majority of speakers in the requested language may understand. For example, in a given application the best fallback for "be" (Breton), may be "fr" (French) -- rather than the more closely related "cy" (Welsh) -- because Breton readers are far more likely to be able to read French than Welsh.  

For more information on matching, see [RFC 4647].  

[In the section talking about updates]

The Macrolanguage field is added whenever a language has a corresponding Macrolanguage in [ISO 639-3]. For example, 'sr' (Serbian) will have the Macrolanguage value 'sh' (Serbo-Croatian).

[Other changes]

[Search for instances of "Suppress-Script" (just as a place to find where field descriptions are) and make an addition of "Macrolanguage" if appropriate, eg in the "LANGUAGE SUBTAG REGISTRATION FORM"]

-- 
Mark