idnits 2.17.1
draft-ietf-ltru-matching-05.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 955.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 932.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 939.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 945.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
-- The draft header indicates that this document obsoletes RFC3066, but the
abstract doesn't seem to mention this, which it should.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (October 7, 2005) is 6776 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
== Unused Reference: 'ID.ietf-ltru-initial' is defined on line 795, but no
explicit reference was found in the text
== Unused Reference: 'RFC1327' is defined on line 800, but no explicit
reference was found in the text
== Unused Reference: 'RFC1521' is defined on line 803, but no explicit
reference was found in the text
== Unused Reference: 'RFC2028' is defined on line 808, but no explicit
reference was found in the text
== Unused Reference: 'RFC2231' is defined on line 815, but no explicit
reference was found in the text
== Unused Reference: 'RFC2396' is defined on line 824, but no explicit
reference was found in the text
== Unused Reference: 'RFC2434' is defined on line 828, but no explicit
reference was found in the text
== Unused Reference: 'RFC2860' is defined on line 836, but no explicit
reference was found in the text
== Unused Reference: 'RFC3629' is defined on line 840, but no explicit
reference was found in the text
== Unused Reference: 'ISO15924' is defined on line 851, but no explicit
reference was found in the text
== Unused Reference: 'ISO3166-1' is defined on line 855, but no explicit
reference was found in the text
== Unused Reference: 'ISO639-1' is defined on line 860, but no explicit
reference was found in the text
== Unused Reference: 'ISO639-2' is defined on line 864, but no explicit
reference was found in the text
== Unused Reference: 'RFC3339' is defined on line 877, but no explicit
reference was found in the text
-- Possible downref: Non-RFC (?) normative reference: ref.
'ID.ietf-ltru-initial'
** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156)
** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046,
RFC 2047, RFC 2048, RFC 2049)
** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281)
** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)
** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226)
** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
RFC 7232, RFC 7233, RFC 7234, RFC 7235)
** Downref: Normative reference to an Informational RFC: RFC 2860
-- Obsolete informational reference (is this intentional?): RFC 1766
(Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 3066
(Obsoleted by RFC 4646, RFC 4647)
Summary: 10 errors (**), 0 flaws (~~), 17 warnings (==), 11 comments
(--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Quest Software
4 Obsoletes: 3066 (if approved) M. Davis, Ed.
5 Expires: April 10, 2006 IBM
6 October 7, 2005
8 Matching Tags for the Identification of Languages
9 draft-ietf-ltru-matching-05
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on April 10, 2006.
36 Copyright Notice
38 Copyright (C) The Internet Society (2005).
40 Abstract
42 This document describes different mechanisms for comparing, matching,
43 and evaluating language tags. Possible algorithms for language
44 negotiation and content selection are described.
46 Table of Contents
48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
49 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
50 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4
51 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4
52 2.2.1. Matching . . . . . . . . . . . . . . . . . . . . . . . 5
53 2.2.2. Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6
54 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 7
55 2.3.1. Extended Range Matching . . . . . . . . . . . . . . . 9
56 2.3.2. Extended Range Lookup . . . . . . . . . . . . . . . . 10
57 2.3.3. Distance Metric Scheme . . . . . . . . . . . . . . . . 11
58 2.4. Meaning of Language Tags and Ranges . . . . . . . . . . . 13
59 2.5. Choosing Between Alternate Matching Schemes . . . . . . . 14
60 2.6. Considerations for Private Use Subtags . . . . . . . . . . 15
61 2.7. Length Considerations in Matching . . . . . . . . . . . . 16
62 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
63 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
64 5. Security Considerations . . . . . . . . . . . . . . . . . . . 20
65 6. Character Set Considerations . . . . . . . . . . . . . . . . . 21
66 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
67 7.1. Normative References . . . . . . . . . . . . . . . . . . . 22
68 7.2. Informative References . . . . . . . . . . . . . . . . . . 23
69 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 24
70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25
71 Intellectual Property and Copyright Statements . . . . . . . . . . 26
73 1. Introduction
75 Human beings on our planet have, past and present, used a number of
76 languages. There are many reasons why one would want to identify the
77 language used when presenting or requesting information.
79 Information about a user's language preferences commonly needs to be
80 identified so that appropriate processing can be applied. For
81 example, the user's language preferences in a browser can be used to
82 select web pages appropriately. A choice of language preference can
83 also be used to select among tools (such as dictionaries) to assist
84 in the processing or understanding of content in different languages.
86 Given a set of language identifiers, such as those defined in [draft-
87 registry], various mechanisms can be envisioned for performing
88 language negotiation and tag matching. The suitability of a
89 particular mechanism to a particular application depends on the needs
90 of that application.
92 This document defines several mechanisms for matching and filtering
93 natural language content identified using Language Tags [draft-
94 registry]. It also defines the syntax (called a "language range")
95 associated with each of these mechanisms for specifying user language
96 preferences.
98 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
100 document are to be interpreted as described in [RFC2119].
102 2. The Language Range
104 Language Tags [draft-registry] are used to identify the language of
105 some information item or content. Applications that use language
106 tags are often faced with the problem of identifying sets of content
107 that share certain language attributes. For example, HTTP 1.1
108 [RFC2616] describes language ranges in its discussion of the Accept-
109 Language header (Section 14.4), which is used for selecting content
110 from servers based on the language of that content.
112 When selecting content according to its language, it is useful to
113 have a mechanism for identifying sets of language tags that share
114 specific attributes. This allows users to select or filter content
115 based on specific requirements. Such an identifier is called a
116 "Language Range".
118 2.1. Lists of Language Ranges
120 When users specify a language preference they often need to specify a
121 prioritized list of language ranges in order to best reflect their
122 language requirements for the matching operation. This is especially
123 true for speakers of minority languages. A speaker of Breton in
124 France, for example, may specify "be" followed by "fr", meaning that
125 if Breton is available, it is preferred, but otherwise French is the
126 best alternative. It can get more complex: a speaker may wish to
127 fallback from Skolt Sami to Northern Sami to Finnish.
129 A "Language Priority List" consists of a prioritized or weighted list
130 of language ranges. One well known example of such a list is the
131 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
132 14.4) and RFC 3282 [RFC3282]. The various matching operations
133 described in this document include considerations for using a
134 language priority list.
136 2.2. Basic Language Range
138 A "Basic Language Range" identifies the set of content whose language
139 tags begin with the same sequence of subtags. A basic language range
140 is identified by its 'language-range' tag, by adapting the
141 ABNF[RFC2234bis] from HTTP/1.1 [RFC2616] :
143 language-range = language-tag / "*"
144 language-tag = 1*8[alphanum] *["-" 1*8alphanum]
145 alphanum = ALPHA / DIGIT
147 That is, a language-range has the same syntax as a language-tag or is
148 the single character "*". Basic Language Ranges imply that there is
149 a semantic relationship between language tags that share the same
150 prefix. While this is often the case, it is not always true and
151 users should note that the set of language tags that match a specific
152 language-range may not be mutually intelligible.
154 Basic language ranges were originally described in [RFC3066] and HTTP
155 1.1 [RFC2616] (where they are referred to as simply a "language
156 range").
158 Users SHOULD avoid subtags that add no distinguishing value to a
159 language range. For example, script subtags SHOULD NOT be used to
160 form a language range with language subtags which have a matching
161 Suppress-Script field in their registry record. Thus the language
162 range "en-Latn" is probably inappropriate for most applications
163 (because the vast majority English documents are written in the Latin
164 script and thus the 'en' language subtag has a Suppress-Script field
165 for 'Latn' in the registry).
167 Language tags and thus language ranges are to be treated as case
168 insensitive: there exist conventions for the capitalization of some
169 of the subtags, but these MUST NOT be taken to carry meaning.
170 Matching of language tags to language ranges MUST be done in a case
171 insensitive manner.
173 When working with tags and ranges, note that extensions and most
174 private use subtags are generally orthogonal to language tag fallback
175 and users SHOULD avoid using these subtags in language ranges, since
176 they will often interfere with the selection of available language
177 content. Since these subtags are always at the end of the sequence
178 of subtags, they don't normally interfere with the use of prefixes
179 for matching in the schemes described below.
181 There are two matching schemes that are commonly associated with
182 basic language ranges: matching and lookup.
184 Note that neither matching nor lookup using basic language ranges
185 attempt to process the semantics of the tags or ranges in any way.
186 The language tag and language range are compared in a case
187 insensitive manner using basic string processing. The choice of
188 subtags in both the language tag and language range may affect the
189 results produced as a result.
191 2.2.1. Matching
193 Language tag matching is used to select all content that matches a
194 given prefix. In matching, the language range represents the least
195 specific tag which is an acceptable match and every piece of content
196 that matches is returned. If the language priority list contains
197 more than one range, the matches returned are typically ordered in
198 descending level of preference.
200 For example, if an application is applying a style to all content in
201 a document in a particular language, it might use language tag
202 matching to select the content to which the style is applied.
204 A language-range matches a language-tag if it exactly equals the tag,
205 or if it exactly equals a prefix of the tag such that the first
206 character following the prefix is "-". (That is, the language-range
207 "de-de" matches the language tag "de-DE-1996", but not the language
208 tag "de-Deva".)
210 The special range "*" matches any tag. A protocol which uses
211 language ranges MAY specify additional rules about the semantics of
212 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
213 languages not matched by any other range within an "Accept-Language"
214 header.
216 2.2.2. Lookup
218 Content lookup is used to select the single information item that
219 best matches the language priority list for a given request. In
220 lookup, each language range in the language priority list represents
221 the most specific tag which is an acceptable match; only the closest
222 matching item according the user's priority is returned.
224 For example, if an application inserts some dynamic content into a
225 document, returning an empty string if there is no exact match is not
226 an option. Instead, the application "falls back" until it finds a
227 suitable piece of content to insert.
229 When performing lookup, the language range is progressively truncated
230 from the end until a matching piece of content is located. For
231 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
232 would progressively search for content as shown below:
234 Range to match: zh-Hant-CN-x-wadegile
235 1. zh-Hant-CN-x-wadegile
236 2. zh-Hant-CN
237 3. zh-Hant
238 4. zh
239 5. (default content or the empty tag)
241 Figure 2: Default Fallback Pattern Example
243 This scheme allows some flexibility in finding content. It also
244 typically provides better results when data is not available at a
245 specific level of tag granularity or is sparsely populated (than if
246 the default language for the system or content were used).
248 When performing lookup using a language priority list, the
249 progressive search MUST proceed to consider each language range
250 before finding the default content or empty tag. For example, for
251 the list "fr-FR; zh-Hant" would search for content as follows:
252 1. fr-FR
253 2. fr
254 3. zh-Hant // next language
255 4. zh
256 5. (default content or the empty tag)
258 Figure 3: Lookup Using a Language Priority List
260 2.3. Extended Language Range
262 Prefix matching using a Basic Language Range, as described above, is
263 not always the most appropriate way to access the information
264 contained in language tags when selecting or filtering content. Some
265 applications might wish to define a more granular matching scheme and
266 such a matching scheme requires the ability to specify the various
267 attributes of a language tag in the language range. An extended
268 language range can be represented by the following ABNF:
269 extended-language-range = range ; a range
270 / privateuse ; private use tag
271 / grandfathered ; grandfathered registrations
273 range = (language
274 ["-" script]
275 ["-" region]
276 *("-" variant)
277 *("-" extension)
278 ["-" privateuse])
280 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
281 / 4ALPHA ; reserved for future use
282 / 5*8ALPHA ; registered language subtag
283 / "*" ; ... or wildcard
285 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
286 ; reserved for future use
287 ; wildcard can only appear
288 ; at the end
290 script = 4ALPHA ; ISO 15924 code
291 / "*" ; or wildcard
293 region = 2ALPHA ; ISO 3166 code
294 / 3DIGIT ; UN M.49 code
295 / "*" ; ... or wildcard
297 variant = 5*8alphanum ; registered variants
298 / (DIGIT 3alphanum) ;
299 / "*" ; ... or wildcard
301 extension = singleton *("-" (2*8alphanum)) [ "-*" ]
302 ; extension subtags
303 ; wildcard can only appear
304 ; at the end
306 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
307 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
308 ; Single letters: x/X is reserved for private use
310 privateuse = ("x"/"X") 1*("-" (1*8alphanum))
312 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
313 ; grandfathered registration
314 ; Note: i is the only singleton
315 ; that starts a grandfathered tag
317 alphanum = (ALPHA / DIGIT) ; letters and numbers
319 In an extended language range, the identifier takes the form of a
320 series of subtags which must consist of well-formed subtags or the
321 special subtag "*". For example, the language range "en-*-US"
322 specifies a primary language of 'en', followed by any script subtag,
323 followed by the region subtag 'US'.
325 A field not present in the middle of an extended language range MAY
326 be treated as if the field contained a "*". For example, the range
327 "en-US" MAY be considered to be equivalent to the range "en-*-US".
328 This also means that multiple wildcards can be collapsed (so that
329 "en-*-*-US" is equivalent to "en-*-US").
331 When working with tags and ranges users SHOULD note the following:
333 1. Private-use and Extension subtags are normally orthogonal to
334 language tag fallback. Implementations SHOULD ignore
335 unrecognized private-use and extension subtags when performing
336 language tag fallback. Since these subtags are always at the end
337 of the sequence of subtags, they don't normally interfere with
338 the use of prefixes for matching in the schemes described below.
340 2. Implementations that choose not to interpret one or more private-
341 use or extension subtags SHOULD NOT remove or modify these
342 extensions in content that they are processing. When a language
343 tag instance is to be used in a specific, known protocol, and is
344 not being passed through to other protocols, language tags MAY be
345 filtered to remove subtags and extensions that are not supported
346 by that protocol. Such filtering SHOULD be avoided, if possible,
347 since it removes information that might be relevant if services
348 on the other end of the protocol would make use of that
349 information.
351 3. Some applications of language tags might want or need to consider
352 extensions and private-use subtags when matching tags. If
353 extensions and private-use subtags are included in a matching or
354 filtering process that utilizes the one of the schemes described
355 in this document, then the implementation SHOULD canonicalize the
356 language tags and/or ranges before performing the matching. Note
357 that language tag processors that claim to be "well-formed"
358 processors as defined in [draft-registry] generally fall into
359 this category.
361 There are several matching algorithms or schemes which can be applied
362 when matching extended language ranges to language tags.
364 2.3.1. Extended Range Matching
366 In extended range matching, each extended language range in the
367 language priority list is considered in turn, according to priority.
368 The subtags in each extended language range are compared to the
369 corresponding subtags in the language tag being examined. The subtag
370 from the range is considered to match if it exactly matches the
371 corresponding subtag in the tag or the range's subtag has the value
372 "*" (which matches all subtags, including the empty subtag).
373 Extended Range Matching is an extension of basic matching
374 (Section 2.2.1): the language range represents the least specific tag
375 which is an acceptable match.
377 Private use subtags MAY be specified in the language range and MUST
378 NOT be ignored when matching.
380 Subtags not specified, including those at the end of the language
381 range, are assigned the value "*". This makes each range into a
382 prefix much like that used in basic language range matching. For
383 example, the extended language range "zh-*-CN" matches all of the
384 following tags because the unspecified variant field is expanded to
385 "*":
387 zh-Hant-CN
388 zh-CN
390 zh-Hans-CN
392 zh-CN-x-wadegile
394 zh-Latn-CN-boont
396 zh-cmn-Hans-CN-x-wadegile
398 2.3.2. Extended Range Lookup
400 In extended range lookup, each extended language range in the
401 language priority list is considered in turn. The subtags in each
402 extended language range are compared to the corresponding subtags in
403 the language tag being examined. A subtag is considered to match if
404 it exactly matches the corresponding subtag in the tag or the range's
405 subtag has the value "*" (which matches all subtags, including the
406 empty subtag). Extended language range lookup is an extension of
407 basic lookup (Section 2.2.2): each language range represents the most
408 specific tag which will form an acceptable match. If no match is
409 found, the default content or content with the empty language tag is
410 usually returned (or the search can be considered to have failed).
412 Subtags not specified are assigned the value "*" prior to performing
413 tag matching. Unlike in extended range matching, however, fields at
414 the end of the range MUST NOT be expanded in this manner. For
415 example, "en-US" MUST NOT be considered to be the same as the range
416 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST
417 be used at the end of the range to indicate that all tags with the
418 range as a prefix are allowable matches. That is, the range "zh-*"
419 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
420 matches neither of those tags.
422 The wildcard "*" at the end of a range SHOULD be considered to match
423 any private use subtag sequences (making extended language range
424 lookup function exactly like extended range matching Section 2.3.1).
426 By default all extensions and their subtags SHOULD be ignored for
427 extended language range lookup. Private use subtags MAY be specified
428 in the language range and MUST NOT be ignored when performing lookup.
429 The wildcard "*" at the end of a range SHOULD be considered to match
430 any private use subtag sequences in addition to variants.
432 For example, the range "*-US" matches all of the following tags:
434 en-US
435 en-Latn-US
437 en-US-r-extends (extensions are ignored)
439 fr-US
441 For example, the range "en-*-US" matches _none_ of the following
442 tags:
444 fr-US
446 en (missing region US)
448 en-Latn (missing region US)
450 en-Latn-US-scouse (variant field is present)
452 For example, the range "en-*" matches all of the following tags:
454 en-Latn
456 en-Latn-US
458 en-Latn-US-scouse
460 en-US
462 en-scouse
464 Note that the ability to be specific in extended range lookup can
465 make this matching scheme a more appropriate replacement for basic
466 matching than the extended range matching scheme.
468 2.3.3. Distance Metric Scheme
470 Both Basic and Extended Language Ranges produce simple boolean
471 matches. Some applications may benefit by providing an array of
472 results with different levels of matching, for example, sorting
473 results based on the overall "quality" of the match.
475 This type of matching is sometimes called a "distance metric". A
476 distance metric assigns a pair of language tags a numeric value
477 representing the 'distance' between the two. A distance of zero
478 means that they are identical, a small distance indicates that they
479 are very similar, and a large distance indicated that they are very
480 different. Using a distance metric, implementations can, for
481 example, allow users to select a threshold distance for a match to be
482 successful or a filter to be applied.
484 The first step in the process is to normalize the extended language
485 range and the language tags to be matched to it by canonicalizing
486 them, mapping grandfathered and obsolete tags into modern
487 equivalents.
489 The language range and the language tags are then transformed into
490 quintuples of elements of the form (language, script, country,
491 variant, extension). Any extended language subtags are considered
492 part of the language element; private use subtag sequences are
493 considered part of the language element if in the initial position in
494 the tag and part of the variant element if not. Language subtags
495 'und', 'mul', and the script subtag 'Zyyy' are converted to "*".
497 Missing components in the language-tag are set to "*"; thus a "*"
498 pattern becomes the quintuple ("*", "*", "*", "*", "*"). Missing
499 components in the extended language-range are handled similarly to
500 extended range lookup: missing internal subtags are expanded to "*".
501 Missing end subtags are expanded as the empty string. Thus a pattern
502 "en-US" becomes the quintuple ("en","*","US","","").
504 Here are some examples of language-tags and their quintuples:
506 en-US ("en","*","US","*","*")
508 sr-Latn ("sr,"Latn","*","*","*")
510 zh-cmn-Hant ("zh-cmn","Hant","*","*","*")
512 x-foo ("x-foo","*","*","*","*")
514 en-x-foo ("en","*","*","x-foo","*")
516 i-default ("i-default","*","*","*","*")
518 sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*")
520 zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical
522 Each language-range/language-tag pair being matched or filtered is
523 assigned a distance value, whereby small values indicate better
524 matches and large values indicate worse ones. The distance between
525 the pair is the sum of the distances for each of the corresponding
526 elements of the quintuple. If the elements are identical or one is
527 '*', then the distance value between them is zero. Otherwise, it is
528 given by the following table:
530 256 language mismatch
531 128 script mismatch
532 32 region mismatch
533 4 variant mismatch
534 1 extension mismatch
536 A value of 0 is a perfect match; 421 is no match at all. Different
537 threshold values might be appropriate for different applications and
538 implementations will probably allow users to choose the most
539 appropriate selection value, ranking the selections based on score.
541 Examples of various tag's distances from the range "en-US":
543 "fr" 256 (language mismatch, region match)
544 "en-GB" 384 (language, region mismatch)
545 "en-Latn-US" 0 (all fields match)
546 "en-Brai" 32 (region mismatch)
547 "en-US-x-foo" 4 (variant mismatch: range is the empty string)
548 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
550 Implementations may want to use more sophisticated weights that
551 depend on the values of the corresponding elements. For example,
552 depending on the domain, an implemenation might give a small distance
553 to the difference between the language subtag 'no' and the closely
554 related language subtags 'nb' or 'nn'; or between the script subtags
555 'Kata' and 'Hira'; or between the region subtags 'US' and 'UM'.
557 2.4. Meaning of Language Tags and Ranges
559 A language tag defines a language as spoken (or written, signed or
560 otherwise signaled) by human beings for communication of information
561 to other human beings.
563 If a language tag B contains language tag A as a prefix, then B is
564 typically "narrower" or "more specific" than A. For example, "zh-
565 Hant-TW" is more specific than "zh-Hant".
567 This relationship is not guaranteed in all cases: specifically,
568 languages that begin with the same sequence of subtags are NOT
569 guaranteed to be mutually intelligible, although they might be.
571 For example, the tag "az" shares a prefix with both "az-Latn"
572 (Azerbaijani written using the Latin script) and "az-Cyrl"
573 (Azerbaijani written using the Cyrillic script). A person fluent in
574 one script might not be able to read the other, even though the text
575 might be otherwise identical. Content tagged as "az" most probably
576 is written in just one script and thus might not be intelligible to a
577 reader familiar with the other script.
579 Variant subtags in particular seem to represent specific divisions in
580 mutual understanding, since they often encode dialects or other
581 idiosyncratic variations within a language.
583 The relationship between the language tag and the information it
584 relates to is defined by the standard describing the context in which
585 it appears. Accordingly, this section can only give possible
586 examples of its usage.
588 o For a single information object, the associated language tags
589 might be interpreted as the set of languages that are necessary
590 for a complete comprehension of the complete object. Example:
591 Plain text documents.
593 o For an aggregation of information objects, the associated language
594 tags could be taken as the set of languages used inside components
595 of that aggregation. Examples: Document stores and libraries.
597 o For information objects whose purpose is to provide alternatives,
598 the associated language tags could be regarded as a hint that the
599 content is provided in several languages, and that one has to
600 inspect each of the alternatives in order to find its language or
601 languages. In this case, the presence of multiple tags might not
602 mean that one needs to be multi-lingual to get complete
603 understanding of the document. Example: MIME multipart/
604 alternative.
606 o In markup languages, such as HTML and XML, language information
607 can be added to each part of the document identified by the markup
608 structure (including the whole document itself). For example, one
609 could write C'est la vie. inside a
610 Norwegian document; the Norwegian-speaking user could then access
611 a French-Norwegian dictionary to find out what the marked section
612 meant. If the user were listening to that document through a
613 speech synthesis interface, this formation could be used to signal
614 the synthesizer to appropriately apply French text-to-speech
615 pronunciation rules to that span of text, instead of misapplying
616 the Norwegian rules.
618 2.5. Choosing Between Alternate Matching Schemes
620 Implementers are faced with the decision of what form of matching to
621 use in a specific application. An application can choose to
622 implement different styles of matching for different kinds of
623 processing.
625 The most basic choice is between schemes that produce an open-ended
626 set of content (a "matching" application) and those that usually
627 produce a single information item (a "lookup" application). Note
628 that lookup applications can produce multiple items, but usually only
629 a single item for any given piece of content, and they can be used to
630 order content (the later in the overall fallback that the content
631 appears to match, the more distant the match).
633 Matching applications can produce an ordered or unordered set of
634 results. For example, applying formatting to a document based on the
635 language of specific pieces of content does not require the content
636 to be ordered. It is sufficient to know whether a specific piece of
637 content matches or does not match. A search application, on the
638 other hand, probably would put the results into a priority order.
640 If single item is to be chosen, it may sometimes be useful to apply
641 additional information, such as the most likely script used in the
642 language or region in question or the script used by other content
643 selected, in order to make a more "informed" choice.
645 The matching schemes in this document are designed so that
646 implementations do not have to examine the values of the subtags
647 supplied and, except for scored matching, they do not need access to
648 the Language Subtag Registry nor do they require the use of valid
649 subtags in language tags or ranges. This has great benefit for speed
650 and simplicity of implementation.
652 Implementations might also wish to use semantic information external
653 to the langauge tags when performing fallback. For example, the
654 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
655 Norwegian) might both be usefully matched to the more general subtag
656 'no' (Norwegian). Or an application might infer that content labeled
657 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
658 content labeled "zh-TW".
660 2.6. Considerations for Private Use Subtags
662 Private-use subtags require private agreement between the parties
663 that intend to use or exchange language tags that use them and great
664 caution SHOULD be used in employing them in content or protocols
665 intended for general use. Private-use subtags are simply useless for
666 information exchange without prior arrangement.
668 The value and semantic meaning of private-use tags and of the subtags
669 used within such a language tag are not defined. Matching private
670 use tags using language ranges or extended language ranges can result
671 in unpredictable content being returned.
673 2.7. Length Considerations in Matching
675 RFC 3066 [RFC3066] did not provide an upper limit on the size of
676 language tags or ranges. RFC 3066 did define the semantics of
677 particular subtags in such a way that most language tags or ranges
678 consisted of language and region subtags with a combined total length
679 of up to six characters. Larger tags and ranges (in terms of both
680 subtags and characters) did exist, however.
682 [draft-registry] also does not impose a fixed upper limit on the
683 number of subtags in a language tag or range (and thus an upper bound
684 on the size of either). The syntax in that document suggests that,
685 depending on the specific language or range of languages, more
686 subtags (and thus characters) are sometimes necessary as a result.
687 Length considerations and their impact on the selection and
688 processing of tags are described in Section 2.1.1 of that document.
690 A matching implementation MAY choose to limit the length of the
691 language tags or ranges used in matching. Any such limitation SHOULD
692 be clearly documented, and such documentation SHOULD include the
693 disposition of any longer tags or ranges (for example, whether an
694 error value is generated or the language tag or range is truncated).
695 If truncation is permitted it MUST NOT permit a subtag to be divided,
696 since this changes the semantics of the subtag being matched and can
697 result in false positives or negatives.
699 Implementations that restrict storage SHOULD consider the impact of
700 tag or range truncation on the resulting matches. For example,
701 removing the "*" from the end of an extended language range (see
702 Section 2.3) can greatly modify the set of returned matches. A
703 protocol that allows tags or ranges to be truncated at an arbitrary
704 limit, without giving any indication of what that limit is, has the
705 potential for causing harm by changing the meaning of values in
706 substantial ways.
708 In practice, most tags do not require additional subtags or
709 substantially more characters. Additional subtags sometimes add
710 useful distinguishing information, but extraneous subtags interfere
711 with the meaning, understanding, and especially matching of language
712 tags. Since language tags or ranges MAY be truncated by an
713 application or protocol that limits storage, when choosing language
714 tags or ranges users and applications SHOULD avoid adding subtags
715 that add no distinguishing value. In particular, users and
716 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
717 fields in the registry (defined in Section 3.6 of [draft-registry]):
718 these fields provide guidance on when specific additional subtags
719 SHOULD (and SHOULD NOT) be used.
721 Implementations MUST support a limit of at least 33 characters. This
722 limit includes at least one subtag of each non-extension, non-private
723 use type. When choosing a buffer limit, a length of at least 42
724 characters is strongly RECOMMENDED.
726 The practical limit on tags or ranges derived solely from registered
727 values is 42 characters. Implementations MUST be able to handle tags
728 and ranges of this length. Support for tags and ranges of at least
729 62 characters in length is RECOMMENDED. Implementations MAY support
730 longer values, including matching extensive sets of private use or
731 extension subtags.
733 Applications or protocols which have to truncate a tag MUST do so by
734 progressively removing subtags along with their preceding "-" from
735 the right side of the language tag until the tag is short enough for
736 the given buffer. If the resulting tag ends with a single-character
737 subtag, that subtag and its preceding "-" MUST also be removed. For
738 example:
740 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
741 1. zh-Hant-CN-variant1-a-extend1-x-wadegile
742 2. zh-Hant-CN-variant1-a-extend1
743 3. zh-Hant-CN-variant1
744 4. zh-Hant-CN
745 5. zh-Hant
746 6. zh
748 Figure 7: Example of Tag Truncation
750 3. IANA Considerations
752 This document presents no new or existing considerations for IANA.
754 4. Changes
756 This is the first version of this document.
758 The following changes were put into this document since draft-03:
760 Modified the ABNF to match changes in [draft-registry]
761 (K.Karlsson)
763 Matched the references and reference formats to [draft-registry]
764 (K.Karlsson)
766 Various edits, additions, and emendations to deal with changes in
767 the Last Call of draft-registry as well as cleaning up the text.
769 5. Security Considerations
771 Language ranges used in content negotiation might be used to infer
772 the nationality of the sender, and thus identify potential targets
773 for surveillance. In addition, unique or highly unusual language
774 ranges or combinations of language ranges might be used to track
775 specific individual's activities.
777 This is a special case of the general problem that anything you send
778 is visible to the receiving party. It is useful to be aware that
779 such concerns can exist in some cases.
781 The evaluation of the exact magnitude of the threat, and any possible
782 countermeasures, is left to each application protocol.
784 6. Character Set Considerations
786 The syntax of language tags and language ranges permit only the
787 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters
788 are present in most character sets, so presentation of language tags
789 should not present any character set issues.
791 7. References
793 7.1. Normative References
795 [ID.ietf-ltru-initial]
796 Ewell, D., Ed., "Language Tags Initial Registry (work in
797 progress)", August 2005, .
800 [RFC1327] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO
801 10021 and RFC 822", RFC 1327, May 1992.
803 [RFC1521] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet
804 Mail Extensions) Part One: Mechanisms for Specifying and
805 Describing the Format of Internet Message Bodies",
806 RFC 1521, September 1993.
808 [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in
809 the IETF Standards Process", BCP 11, RFC 2028,
810 October 1996.
812 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
813 Requirement Levels", BCP 14, RFC 2119, March 1997.
815 [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded
816 Word Extensions: Character Sets, Languages, and
817 Continuations", RFC 2231, November 1997.
819 [RFC2234bis]
820 Crocker, D. and P. Overell, "Augmented BNF for Syntax
821 Specifications: ABNF", draft-crocker-abnf-rfc2234bis-00
822 (work in progress), March 2005.
824 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
825 Resource Identifiers (URI): Generic Syntax", RFC 2396,
826 August 1998.
828 [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an
829 IANA Considerations Section in RFCs", BCP 26, RFC 2434,
830 October 1998.
832 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
833 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
834 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
836 [RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
837 Understanding Concerning the Technical Work of the
838 Internet Assigned Numbers Authority", RFC 2860, June 2000.
840 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
841 10646", STD 63, RFC 3629, November 2003.
843 [draft-registry]
844 Phillips, A., Ed. and M. Davis, Ed., "Tags for the
845 Identification of Languages (work in progress)",
846 August 2005, .
849 7.2. Informative References
851 [ISO15924]
852 "ISO 15924:2004. Information and documentation -- Codes
853 for the representation of names of scripts", January 2004.
855 [ISO3166-1]
856 "ISO 3166-1:1997. Codes for the representation of names of
857 countries and their subdivisions -- Part 1: Country
858 codes", 1997.
860 [ISO639-1]
861 "ISO 639-1:2002. Codes for the representation of names of
862 languages -- Part 1: Alpha-2 code", 2002.
864 [ISO639-2]
865 "ISO 639-2:1998. Codes for the representation of names of
866 languages -- Part 2: Alpha-3 code, first edition", 1998.
868 [RFC1766] Alvestrand, H., "Tags for the Identification of
869 Languages", RFC 1766, March 1995.
871 [RFC3066] Alvestrand, H., "Tags for the Identification of
872 Languages", BCP 47, RFC 3066, January 2001.
874 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282,
875 May 2002.
877 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet:
878 Timestamps", RFC 3339, July 2002.
880 [UN_M.49] Statistics Division, United Nations, "Standard Country or
881 Area Codes for Statistical Use", UN Standard Country or
882 Area Codes for Statistical Use, Revision 4 (United Nations
883 publication, Sales No. 98.XVII.9, June 1999.
885 Appendix A. Acknowledgements
887 Any list of contributors is bound to be incomplete; please regard the
888 following as only a selection from the group of people who have
889 contributed to make this document what it is today.
891 The contributors to [draft-registry], [RFC3066] and [RFC1766], each
892 of which is a precursor to this document, made enormous contributions
893 directly or indirectly to this document and are generally responsible
894 for the success of language tags.
896 The following people (in alphabetical order by family name)
897 contributed to this document:
899 Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Kent
900 Karlsson, Ira McDonald, M. Patton, Randy Presuhn and many, many
901 others.
903 Very special thanks must go to Harald Tveit Alvestrand, who
904 originated RFCs 1766 and 3066, and without whom this document would
905 not have been possible.
907 For this particular document, John Cowan originated the scheme
908 described in Section 2.3.3. Mark Davis originated the scheme
909 described in the Section 2.2.2.
911 Authors' Addresses
913 Addison Phillips (editor)
914 Quest Software
916 Email: addison dot phillips at quest dot com
918 Mark Davis (editor)
919 IBM
921 Email: mark dot davis at ibm dot com
923 Intellectual Property Statement
925 The IETF takes no position regarding the validity or scope of any
926 Intellectual Property Rights or other rights that might be claimed to
927 pertain to the implementation or use of the technology described in
928 this document or the extent to which any license under such rights
929 might or might not be available; nor does it represent that it has
930 made any independent effort to identify any such rights. Information
931 on the procedures with respect to rights in RFC documents can be
932 found in BCP 78 and BCP 79.
934 Copies of IPR disclosures made to the IETF Secretariat and any
935 assurances of licenses to be made available, or the result of an
936 attempt made to obtain a general license or permission for the use of
937 such proprietary rights by implementers or users of this
938 specification can be obtained from the IETF on-line IPR repository at
939 http://www.ietf.org/ipr.
941 The IETF invites any interested party to bring to its attention any
942 copyrights, patents or patent applications, or other proprietary
943 rights that may cover technology that may be required to implement
944 this standard. Please address the information to the IETF at
945 ietf-ipr@ietf.org.
947 Disclaimer of Validity
949 This document and the information contained herein are provided on an
950 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
951 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
952 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
953 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
954 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
955 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
957 Copyright Statement
959 Copyright (C) The Internet Society (2005). This document is subject
960 to the rights, licenses and restrictions contained in BCP 78, and
961 except as set forth therein, the authors retain all their rights.
963 Acknowledgment
965 Funding for the RFC Editor function is currently provided by the
966 Internet Society.