All,
I am back from a recent soggy camping trip and catching up with this thread. I have just now posted an editor's copy of the text that has been discussed on this thread to inter-locale. Note that I have made some minor edits to the proposed text (to ensure it matches the style of the document; uses RFC 2119 keywords properly; and is grammatically correct).
Here are the links:
Diff: http://tinyurl.com/cyrmju
HTML: http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-22-ed-md.html
TXT: http://www.inter-locale.com/ID/draft-ietf-ltru-4646bis-222-ed-mt.txt
Here is the proposed text for Section 4.5:
--
4.5. Canonicalization of Language Tags
Since a particular language tag is sometimes used by many processes,
language tags SHOULD always be created or generated in a canonical
form.
There are two canonical forms for language tags: the 'default'
canonical form contains no extended language subtags, while the
'extlang' canonical form contains extended language subtags where
required. Normally, the 'default' canonicalization is preferred.
However, the 'extlang' canonical form can be useful in environments
where the presence of the enclosing primary language subtag is
considered beneficial to matching or selection (see Section 4.1.2)
A language tag is in a canonical form, either default or extended,
when the tag is well-formed according the rules in Section 2.1 and
Section 2.2 and it has been canonicalized by applying each of the
following steps in order, using data from the IANA registry (see
Section 3.1):
1. Extension sequences are ordered into case-insensitive ASCII order
by singleton subtag.
* That is, the subtag sequence '-a-babble' comes before
'-b-warble'.
2. Redundant or grandfathered tags are replaced by their Preferred-
Value, if there is one.
* These items are either deprecated mappings created before the
adoption of this document (such as the mapping of "no-nyn" to
"nn" or "i-klingon" to "tlh") or are the result of later
registrations or additions to this document (for example, "zh-
hakka" was deprecated in favor of the ISO 639-3 code 'hak'
when this document was adopted).
* Note: The field-body of the Preferred-Value for grandfathered
and redundant tags is an "extended language range" ([RFC4647])
and might consist of more than one subtag.
3. Subtags are replaced by their Preferred-Value, if there is one.
For extended language subtags, the original primary language
subtag is also replaced if there is a primary language subtag in
the Preferred-Value.
* The field-body of the Preferred-Value for extlangs is an
"extended language range" and almost always consists of a
single, primary language subtag. For example, the subtag
sequence "zh-hak" (Chinese, Hakka) would be replaced with the
tag "hak" (Hakka).
* The field-body of the Preferred-Value for all other types of
subtags consists of a subtag of the same type. Most of these
non-extlang subtags are either Region subtags where the
country name or designation has changed or are clerical
corrections to ISO 639-1.
4. In the 'extlang' canonical form (but not the 'default' canonical
form), primary language subtags that are also extlang subtags are
prepended with the extlang's Prefix.
* For example, "hak-CN" (Hakka, China) has a primary language
subtag of 'hak', which also appears in the registry as an
'extlang' record with a Prefix 'zh' (Chinese). The 'extlang'
canonical form would be "zh-hak-CN" (Chinese, Hakka, China).
* Note that this step can restore a subtag that was removed by
the previous step.
Example: The language tag "en-a-aaa-b-ccc-bbb-x-xyz" is in a
canonical form, while "en-b-ccc-bbb-a-aaa-X-xyz" is well-formed and
potentially valid (extensions 'a' and 'b' are not defined as of the
publication of this document) but not in a canonical form (the
extensions are not in alphabetical order).
Example: Although the tag "en-BU" (English as used in Burma)
maintains its validity, the language tag "en-BU" is not in a
canonical form because the 'BU' subtag has a canonical mapping to
'MM' (Myanmar).
Canonicalization of language tags does not imply anything about the
use of upper or lowercase letters when processing or comparing
subtags (and as described in Section 2.1). All comparisons MUST be
performed in a case-insensitive manner.
When performing canonicalization of language tags, processors MAY
regularize the case of the subtags (that is, this process is
OPTIONAL), following the case used in the registry (see
Section 2.1.1).
If more than one variant appears within a tag, processors MAY reorder
the variants to obtain better matching behavior or more consistent
presentation. Reordering of the variants SHOULD follow the
recommendations for variant ordering in Section 4.1.
If the field 'Deprecated' appears in a registry record without an
accompanying 'Preferred-Value' field, then that tag or subtag is
deprecated without a replacement. These values are canonical when
they appear in a language tag. However, tags that include these
values SHOULD NOT be selected by users or generated by
implementations.
An extension MUST define any relationships that exist between the
various subtags in the extension and thus MAY define an alternate
canonicalization scheme for the extension's subtags. Extensions MAY
define how the order of the extension's subtags are interpreted. For
example, an extension could define that its subtags are in canonical
order when the subtags are placed into ASCII order: that is, "en-a-
aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa". Another extension might
define that the order of the subtags influences their semantic
meaning (so that "en-b-ccc-bbb-aaa" has a different value from "en-b-
aaa-bbb-ccc"). However, extension specifications SHOULD be designed
so that they are tolerant of the typical processes described in
Section 3.7.
--
Addison Phillips
Globalization Architect -- Lab126
Internationalization is not a feature.
It is an architecture.
Note Well: Messages sent to this mailing list are the opinions of the senders and do not imply endorsement by the IETF.