idnits 2.17.1
draft-ietf-ltru-matching-09.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 1137.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1114.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1121.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1127.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (February 6, 2006) is 6653 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)
-- Obsolete informational reference (is this intentional?): RFC 1766
(Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 2616
(Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)
-- Obsolete informational reference (is this intentional?): RFC 3066
(Obsoleted by RFC 4646, RFC 4647)
Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 10 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Yahoo! Inc
4 Obsoletes: 3066 (if approved) M. Davis, Ed.
5 Expires: August 10, 2006 Google
6 February 6, 2006
8 Matching of Language Tags
9 draft-ietf-ltru-matching-09
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on August 10, 2006.
36 Copyright Notice
38 Copyright (C) The Internet Society (2006).
40 Abstract
42 This document describes different mechanisms for comparing, matching,
43 and evaluating language tags. Possible algorithms for language
44 negotiation or content selection, filtering, and lookup are
45 described. This document, in combination with RFC 3066bis (replace
46 "3066bis" with the RFC number assigned to
47 draft-ietf-ltru-registry-14), replaces RFC 3066, which replaced RFC
48 1766.
50 Table of Contents
52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
53 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
54 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4
55 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5
56 2.3. The Language Priority List . . . . . . . . . . . . . . . . 7
57 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 8
58 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 8
59 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9
60 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 11
61 3.2.2. Filtering with Extended Language Ranges . . . . . . . 11
62 3.2.3. Scored Filtering . . . . . . . . . . . . . . . . . . . 11
63 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 15
64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 19
65 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 19
66 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 20
67 4.3. Considerations for Private Use Subtags . . . . . . . . . . 21
68 4.4. Length Considerations in Matching . . . . . . . . . . . . 22
69 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24
70 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
71 7. Security Considerations . . . . . . . . . . . . . . . . . . . 26
72 8. Character Set Considerations . . . . . . . . . . . . . . . . . 27
73 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28
74 9.1. Normative References . . . . . . . . . . . . . . . . . . . 28
75 9.2. Informative References . . . . . . . . . . . . . . . . . . 28
76 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 29
77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30
78 Intellectual Property and Copyright Statements . . . . . . . . . . 31
80 1. Introduction
82 Human beings on our planet have, past and present, used a number of
83 languages. There are many reasons why one would want to identify the
84 language used when presenting or requesting information.
86 Information about a user's language preferences commonly needs to be
87 identified so that appropriate processing can be applied. For
88 example, the user's language preferences in a browser can be used to
89 select web pages appropriately. Language preferences can also be
90 used to select among tools (such as dictionaries) to assist in the
91 processing or understanding of content in different languages.
93 Given a set of language identifiers, such as those defined in
94 [RFC3066bis], various mechanisms can be envisioned for performing
95 language negotiation and tag matching.
97 This document defines a syntax (called a language range (Section 2))
98 for specifying a user's language preferences, as well as several
99 schemes for selecting or filtering content by comparing language
100 ranges to the language tags [RFC3066bis] used to identify the natural
101 language of that content. Applications, protocols, or specifications
102 will have varying needs and requirements that affect the choice of a
103 suitable matching scheme. Depending on the choice of scheme, there
104 are various options left to the implementation. Protocols that
105 implement a matching scheme either need to specify each particular
106 choice or indicate the options that are left to the implementation to
107 decide.
109 This document is divided into three main sections. One describes how
110 to indicate a user's preferences using language ranges. Then a
111 section describes various schemes for matching these ranges to a set
112 of language tags in order to select specific content. There is also
113 a section that deals with various practical considerations that apply
114 to implementing and using these schemes.
116 This document, in combination with [RFC3066bis] (Ed.: replace
117 "3066bis" globally in this document with the RFC number assigned to
118 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
119 [RFC1766].
121 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
123 document are to be interpreted as described in [RFC2119].
125 2. The Language Range
127 Language Tags [RFC3066bis] are used to identify the language of some
128 information item or content. Applications or protocols that use
129 language tags are often faced with the problem of identifying sets of
130 content that share certain language attributes. For example,
131 HTTP/1.1 [RFC2616] describes one such mechanism in its discussion of
132 the Accept-Language header (Section 14.4), which is used when
133 selecting content from servers based on the language of that content.
135 When selecting content according to its language, it is useful to
136 have a mechanism for identifying sets of language tags that share
137 specific attributes. This allows users to select or filter content
138 based on specific requirements. Such an identifier is called a
139 "Language Range".
141 Language ranges are similar in structure and content to language
142 tags: they consist of alphanumeric "subtags" separated by hyphens,
143 plus a special subtag consisting of the character "*" (%2A,
144 ASTERISK), which is used in ranges as a "wildcard", that is, a value
145 that matches any subtag.
147 Language tags and thus language ranges are to be treated as case-
148 insensitive: there exist conventions for the capitalization of some
149 of the subtags, but these MUST NOT be taken to carry meaning.
150 Matching of language tags to language ranges MUST be done in a case-
151 insensitive manner as well.
153 2.1. Basic Language Range
155 A "basic language range" identifies the set of content whose language
156 tags begin with the same sequence of subtags. Each range consists of
157 a sequence of alphanumeric subtags separated by hyphens. The basic
158 language range is defined by the following ABNF[RFC4234]:
160 language-range = language-tag / "*"
161 language-tag = 1*8[alphanum] *["-" 1*8alphanum]
162 alphanum = ALPHA / DIGIT
164 Basic language ranges (originally described by HTTP/1.1 [RFC2616] and
165 later [RFC3066]) have the same syntax as an [RFC3066] language tag or
166 are the single character "*". They differ from the language tags
167 defined in [RFC3066bis] only in that there is no requirement that
168 they be "well-formed" or be validated against the IANA Language
169 Subtag Registry (although such ill-formed ranges will probably not
170 match anything).
172 Use of a basic language range seems to imply that there is a semantic
173 relationship between language tags that share the same prefix. While
174 this is often the case, it is not always true and users should note
175 that the set of language tags that match a specific language-range
176 may not be mutually intelligible.
178 2.2. Extended Language Range
180 A Basic Language Range does not always provide the most appropriate
181 way to specify a user's preferences. Sometimes it is beneficial to
182 use a more fine-grained matching scheme that takes advantage of the
183 internal structure of language tags. This allows the user to
184 specify, for example, the value of a specific field in a language tag
185 or to indicate which values are of interest in filtering or selecting
186 the content.
188 In an extended language range, the identifier takes the form of a
189 series of subtags which MUST consist of well-formed subtags or the
190 special subtag "*". For example, the language range "en-*-US"
191 specifies a primary language of 'en', followed by any script subtag,
192 followed by the region subtag 'US'.
194 An extended language range can be represented by the following ABNF:
196 extended-language-range = range ; a range
197 / privateuse ; a private-use range
198 / grandfathered ; a grandfathered registration
200 range = (language
201 ["-" script]
202 ["-" region]
203 *("-" variant)
204 *("-" extension)
205 ["-" privateuse])
207 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
208 / 4ALPHA ; reserved for future use
209 / 5*8ALPHA ; registered language subtag
210 / "*" ; or wildcard
212 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
213 ; reserved for future use
214 ; wildcard can only appear
215 ; at the end
217 script = 4ALPHA ; ISO 15924 code
218 / "*" ; or wildcard
220 region = 2ALPHA ; ISO 3166 code
221 / 3DIGIT ; UN M.49 code
222 / "*" ; or wildcard
224 variant = 5*8alphanum ; registered variants
225 / (DIGIT 3alphanum) ;
226 / "*" ; or wildcard
228 extension = singleton *("-" (2*8alphanum)) [ "-*" ]
229 ; extension subtags
230 ; wildcard can only appear
231 ; at the end
233 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
234 ; single letters (except for "x") or digits
236 privateuse = "x" 1*("-" (1*8alphanum))
238 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
239 ; grandfathered registration
240 ; Note: I is the only singleton
241 ; that starts a grandfathered tag
243 alphanum = (ALPHA / DIGIT) ; letters and numbers
244 A field not present in the middle of an extended language range is
245 treated as if the field contained a "*". Implementations that
246 normalize extended language ranges SHOULD expand missing fields to be
247 "*" so that the semantic meaning of the language range is clear to
248 the user. At the same time, multiple wildcards in a row are
249 redundant and implementations SHOULD collapse these to a single
250 wildcard when normalizing the range (for brevity). For example, both
251 the range "sl-nedis" and the range "sl-*-*-nedis" are equivalent to
252 and should be normalized as "sl-*-nedis".
254 2.3. The Language Priority List
256 When users specify a language preference they often need to specify a
257 prioritized list of language ranges in order to best reflect their
258 language preferences. This is especially true for speakers of
259 minority languages. A speaker of Breton in France, for example, may
260 specify "be" followed by "fr", meaning that if Breton is available,
261 it is preferred, but otherwise French is the best alternative. It
262 can get more complex: a speaker may wish to fall back from Skolt Sami
263 to Northern Sami to Finnish.
265 A "Language Priority List" is a prioritized or weighted list of
266 language ranges. One well known example of such a list is the
267 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
268 14.4) and RFC 3282 [RFC3282]. A simple list of ranges, i.e. one that
269 contains no weighting information, is considered to be in descending
270 order of priority.
272 The various matching operations described in this document include
273 considerations for using a language priority list. This document
274 does not define any syntax for a language priority list; defining
275 such a syntax is the responsibility of the protocol, application, or
276 implementation that uses it. When given as examples in this
277 document, language priority lists will be shown as a quoted sequence
278 of ranges separated by semi-colons, like this: "en; fr; zh-Hant"
279 (which would be read as "English before French before Chinese as
280 written in the Traditional script").
282 3. Types of Matching
284 Matching language ranges to language tags can be done in a number of
285 different ways. This section describes several different matching
286 schemes, as well as the considerations for choosing between them.
287 Protocols and specifications SHOULD clearly indicate the particular
288 mechanism used in selecting or matching language tags.
290 There are two basic types of matching scheme: those that produce zero
291 or more information items (called "filtering") and those that produce
292 a single information item for a given request (called "lookup").
294 A key difference between these two types of matching scheme is that
295 the language ranges in the language priority list represent the
296 _least_ specific content one will accept as a match, while for lookup
297 operations the language ranges represent the _most_ specific content.
299 3.1. Choosing a Type of Matching
301 Applications, protocols, and specifications are faced with the
302 decision of what type of matching to use. Sometimes, different
303 styles of matching might be suited for different kinds of processing
304 within a particular application or protocol.
306 Language tag matching is a tool, and does not by itself specify a
307 complete procedure for the use of language tags. Such procedures are
308 intimately tied to the application protocol in which they occur.
309 When specifying a protocol operation using matching, the protocol
310 MUST specify:
312 o Which type(s) of language tag matching it uses
314 o Whether the operation returns a single result (lookup) or a
315 possibly empty set of results (filtering)
317 o For lookup, what the result is when no matching tag is found. For
318 instance, a protocol might result in failure of the operation, an
319 empty value, returning some protocol defined or implementation
320 defined default, or returning i-default [RFC2277].
322 Filtering can be used to produce a set of results (such as a
323 collection of documents). For example, if using a search engine, one
324 might use filtering to limit the results to documents written in
325 French. It can also be used when deciding whether to perform a
326 language-sensitive process on some content. For example, a process
327 might cause paragraphs whose language tag matched the language range
328 "nl" to be displayed in italics within a document.
330 This document describes four types of matching (three types of
331 filtering, plus the lookup scheme):
333 1. Basic Filtering (Section 3.2.1) is used to match content using
334 basic language ranges (Section 2.1).
336 2. Extended Range Filtering (Section 3.2.2) is used to match content
337 using extended language ranges (Section 2.2).
339 3. Scored Filtering (Section 3.2.3) produces an ordered set of
340 content using extended language ranges. It SHOULD be used when
341 the quality of the match within a specific language range is
342 important, as when presenting a list of documents resulting from
343 a search.
345 4. Lookup (Section 3.3) is used when each request needs to produce
346 _exactly_ one piece of content. For example, if a process were
347 to insert a human readable error message into a protocol header,
348 it might select the text based on the user's language preference.
349 Since it can return only one item, it must choose a single item
350 and it must return some item, even if no content matches the
351 language priority list supplied by the user.
353 Most types of matching in this document are designed so that
354 implementations are not required to validate or understand any of the
355 semantics of the subtags supplied and, except for scored filtering,
356 they do not need access to the IANA Language Subtag Registry (see
357 Section 3 in [RFC3066bis]). This simplifies and speeds the
358 performance of implementations.
360 Regardless of the matching scheme chosen, protocols and
361 implementations MAY canonicalize language tags and ranges by mapping
362 grandfathered and obsolete tags or subtags into modern equivalents.
363 If an implementation canonicalizes either ranges or tags, then the
364 implementation will require the IANA Language Subtag Registry
365 information for that purpose. Implementations MAY also use semantic
366 information external to the registry when matching tags. For
367 example, the primary language subtags 'nn' (Nynorsk Norwegian) and
368 'nb' (Bokmal Norwegian) might both be usefully matched to the more
369 general subtag 'no' (Norwegian). Or an implementation might infer
370 that content labeled "zh-CN" is more likely to match the range "zh-
371 Hans" than equivalent content labeled "zh-TW".
373 3.2. Filtering
375 Filtering is used to select the set of content that matches a given
376 language priority list. It is called "filtering" because this set of
377 content may contain no items at all or it may return an arbitrarily
378 large number of matching items: as many items as match the language
379 priority list, thus "filtering out" the non-matching items.
381 In filtering, the language range represents the _least_ specific
382 (that is, the fewest number of subtags) language tag which is an
383 acceptable match. That is, all of the language tags in the set of
384 filtered content will have an equal or greater number of subtags than
385 the language range. For example, if the language priority list
386 consists of the range "de-CH", one might see matching content with
387 the tag "de-CH-1996" but one will never see a match with the tag
388 "de".
390 If the language priority list (see Section 2.3) contains more than
391 one range, the content returned is typically ordered in descending
392 level of preference.
394 Some examples where filtering might be appropriate include:
396 o Applying a style to sections of a document in a particular set of
397 languages.
399 o Displaying the set of documents containing a particular set of
400 keywords written in a specific set of languages.
402 o Selecting all email items written in a specific set of languages.
404 Filtering can produce either an ordered or an unordered set of
405 results. For example, applying formatting to a document based on the
406 language of specific pieces of content does not require the content
407 to be ordered. It is sufficient to know whether a specific piece of
408 content is selected by the language priority list (or not). A search
409 application, on the other hand, probably would want to order the
410 results.
412 If an ordered set is desired, as described above, then the
413 application or protocol needs to determine the relative "quality" of
414 the match between different language tags and the language range.
416 This measurement is called a "distance metric". A distance metric
417 assigns a numeric value to the comparison of a language tag to a
418 language range that represents the 'distance' between the two. A
419 distance of zero means that they are identical, a small distance
420 indicates that they are very similar, and a large distance indicates
421 that they are very different. Using a distance metric,
422 implementations can, for example, allow users to select a threshold
423 distance for a match to be "successful" while filtering, or they
424 might use the numeric values to order the results.
426 3.2.1. Filtering with Basic Language Ranges
428 When filtering using basic language ranges, each basic language range
429 in the language priority list is considered in turn, according to
430 priority. A particular language tag matches a language range if it
431 exactly equals the tag, or if it exactly equals a prefix of the tag
432 such that the first character following the prefix is "-". (That is,
433 the language-range "de-de" matches the language tag "de-DE-1996", but
434 not the language tag "de-Deva".)
436 The special range "*" in a language priority list matches any tag. A
437 protocol which uses language ranges MAY specify additional rules
438 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616]
439 specifies that the range "*" matches only languages not matched by
440 any other range within an "Accept-Language" header.
442 3.2.2. Filtering with Extended Language Ranges
444 When filtering using extended language ranges, each extended language
445 range in the language priority list is considered in turn, according
446 to priority. The subtags in each extended language range are
447 compared to the corresponding subtags in the language tag being
448 examined. The subtag from the range is considered to match if it
449 exactly matches the corresponding subtag in the tag or the range's
450 subtag has the value "*" (which matches all subtags, including the
451 empty subtag).
453 Subtags not specified, including those at the end of the language
454 range, are assigned the wildcard value "*". This makes each range
455 into a prefix much like that used in basic language range matching.
456 For example, the extended language range "de-*-DE" matches all of the
457 following tags, in part because the unspecified variant, extension,
458 and private-use subtags are expanded to "*":
460 de-DE
462 de-Latn-DE
464 de-Latf-DE
466 de-DE-x-goethe
468 de-Latn-DE-1996
470 3.2.3. Scored Filtering
472 Both basic and extended language range filtering produce simple
473 boolean matches between a language range and a language tag.
475 Sometimes it may be useful to provide an array of results with
476 different levels of matching, for example, sorting results based on
477 the overall "quality" of the match. Scored (or "distance metric")
478 filtering provides a way to generate these quality values.
480 As with the other forms of filtering, the process considers each
481 language range in the language priority list in order of priority.
483 Each extended language range and language tag MUST first be
484 canonicalized by mapping grandfathered and obsolete tags into modern
485 equivalents. This requires the information in the IANA Language
486 Subtag Registry (see Section 3 of [RFC3066bis]).
488 The language range and each language tag it is to be compared to are
489 then transformed into a "quintuple" consisting of five "elements" in
490 the form (language, script, country, variant, extension).
492 Any extended language subtags are considered part of the language
493 "element". For example, the language element for the tag "zh-cmn-
494 Hans" would be "zh-cmn".
496 Private-use subtag sequences are considered part of the language
497 "element" if in the initial position in the tag and part of the
498 variant "element" if not. The different handling of private-use
499 sequences prevents a range such as "x-twain" from matching all
500 possible tags, while a range such as "en-US-x-twain" would closely
501 match nearly all tags for English as used in the United States.
503 Language subtags 'und', 'mul', and the script subtag 'Zyyy' are
504 converted to "*": these subtag values represent undetermined,
505 multiple, or private-use values which are consistent with the use of
506 the wildcard.
508 For language tags that have no script subtag but whose language
509 subtag's record in the IANA Language Subtag Registry contains the
510 field "Suppress-Script", the script element in the quintuple MUST be
511 set to the script subtag in the Suppress-Script field. This is
512 necessary because [RFC3066bis] strongly recommends that users not use
513 this subtag to form language tags and this document (see Section 4.1)
514 recommends that users not use them to form ranges. Languages which
515 have a "Suppress-Script" field in the registry are predominantly
516 written in that single script, making the subtag redundant in forming
517 a language tag or range. Thus if the script were not expanded in
518 this manner, a range such as "de-DE" would produce a more-distant
519 score for content that happened to be labeled "de-Latn-DE" than users
520 would expect that it should.
522 Any remaining missing components in the language tag are set to "*";
523 thus an empty language tag becomes the quintuple ("*", "*", "*", "*",
524 "*"). Missing components in the language range are handled similarly
525 to extended range lookup: missing internal subtags are expanded to
526 "*". Missing end subtags are expanded as the empty string. Thus a
527 pattern "en-US" becomes the quintuple ("en","*","US","","").
529 Here are some examples of language tags, showing their quintuples as
530 both language tags and language ranges:
532 en-US
533 Tag: (en, *, US, *, *)
534 Range: (en, *, US, "", "")
536 sr-Latn
537 Tag: (sr, Latn, *, *, *)
538 Range: (sr, Latn, "", "", "")
540 zh-cmn-Hant
541 Tag: (zh-cmn, Hant, *, *, *)
542 Range: (zh-cmn, Hant, "", "", "")
544 x-foo
545 Tag: (x-foo, *, *, *, *)
546 Range: (x-foo, "", "", "", "")
548 en-x-foo
549 Tag: (en, *, *, x-foo, *)
550 Range: (en, *, *, x-foo, "")
552 i-default
553 Tag: (i-default, *, *, *, *)
554 Range: (i-default, "", "", "", "")
556 sl-Latn-IT-rozaj
557 Tag: (sl, Latn, IT, rozaj, *)
558 Range: (sl, Latn, IT, rozaj, "")
560 zh-r-wadegile (hypothetical)
561 Tag: (zh, *, *, *, r-wadegile)
562 Range: (zh, *, *, *, r-wadegile)
564 Figure 3: Examples of Distance Metric Quintuples
566 Each pair of quintuples being compared is assigned a distance value,
567 in which small values indicate better matches and large values
568 indicate worse ones. The distance between the pair is the sum of the
569 distances for each of the corresponding elements of the quintuple.
570 If the elements are identical or one is '*', then the distance value
571 between them is zero. Otherwise, it is given by the following table:
572 256 language mismatch
573 128 script mismatch
574 32 region mismatch
575 4 variant mismatch
576 1 extension mismatch
578 A value of 0 is a perfect match; 421 is no match at all. Different
579 threshold values might be appropriate for different applications or
580 protocols. Implementations will usually allow users to choose the
581 most appropriate selection value, ranking the matched items based on
582 score.
584 Examples of various tag's distances from the range "en-US":
586 "fr-FR" 384 (language & region mismatch)
587 "fr" 256 (language mismatch, region match)
588 "en-GB" 32 (region mismatch)
589 "en-Latn-US" 0 (all fields match)
590 "en-Brai" 32 (region mismatch)
591 "en-US-x-foo" 4 (variant mismatch: range is the empty string)
592 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
594 Where a language priority list follows the syntax of the "Accept-
595 Language" header defined in [RFC2616] (see Section 14.4) and
596 [RFC3282], language ranges without a Q value are given values equal
597 to the value of the previous language range in the list (processing
598 from first to last). If the first language range has no Q value, it
599 is given a value of 1.0. Language ranges with Q values of zero are
600 removed. For example, "fr, en;q=0.5, de, it" becomes
601 "fr;q=1.0,en;q=0.5,de;q=0.5,it;q=0.5". The distance values given
602 above are then divided by the Q values. For example, if that
603 language tag "fr-FR" has a distance of 384 from a language range with
604 a Q value of 0.8, then the resulting distance is 480 (384 div 0.8).
606 Implementations or protocols MAY use different weighting systems than
607 the ones described above, as long as the weightings and weighting
608 mechanisms are clearly specified. Thus, for example, an
609 implementation or protocol could give all language tags with missing
610 Q values a value of 1.0, or give the distance value 1000 to a
611 language mismatch. They MAY also use more sophisticated weights that
612 depend on the values of the corresponding elements. For example, an
613 implementation might give a small distance to the difference closely
614 related subtags. Some examples of closely related subtags might be:
616 Language:
617 no (Norwegian)
618 nb (Bokmal Norwegian)
619 nn (Nynorsk Norwegian)
621 Script:
622 Kata (katakana)
623 Hira (hiragana)
625 Region:
626 US (United States of America)
627 UM (United States Minor Outlying Islands)
629 Figure 6: Examples of Closely Related Subtags
631 3.3. Lookup
633 Lookup is used to select the single information item that best
634 matches the language priority list for a given request. When
635 performing lookup, each language range in the language priority list
636 is considered in turn, according to priority. By contrast with
637 filtering, each language ranges represents the _most_ specific tag
638 which is an acceptable match. The first information item found with
639 a matching tag, according the user's priority, is considered the
640 closest match and is the item returned. For example, if the language
641 range is "de-CH", one might expect to receive an information item
642 with the tag "de" but never one with the tag "de-CH-1996". Usually
643 if no content matches the request, a "default" item is returned.
645 For example, if an application inserts some dynamic content into a
646 document, returning an empty string if there is no exact match is not
647 an option. Instead, the application "falls back" until it finds a
648 suitable piece of content to insert. Other examples of lookup might
649 include:
651 o Selection of a template containing the text for an automated email
652 response.
654 o Selection of a item containing some text for inclusion in a
655 particular Web page.
657 o Selection of a string of text for inclusion in an error log.
659 In the lookup scheme, the language range is progressively truncated
660 from the end until a matching piece of content is located. For
661 example, starting with the range "zh-Hant-CN-x-private", the lookup
662 progressively searches for content as shown below:
664 Range to match: zh-Hant-CN-x-private
665 1. zh-Hant-CN-x-private
666 2. zh-Hant-CN
667 3. zh-Hant
668 4. zh
669 5. (default content or the empty tag)
671 Figure 7: Example of a Lookup Fallback Pattern
673 This scheme allows some flexibility in finding content. For example,
674 it provides better results for cases in which data is not available
675 that exactly matches the user request than if the default language
676 for the system or content were returned immediately. Not every
677 specific level of tag granularity is usually available or language
678 content may be sparsely populated, so "falling back" through the
679 subtag sequence provides more opportunity to find a match between
680 available content and the user's request.
682 The default content is implementation defined. It might be content
683 with no language tag; might have an empty value (the built-in
684 attribute xml:lang in [XML10] permits the empty value); might be a
685 particular language designated for that bit of content; or it might
686 be content that is labeled with the tag "i-default" (see [RFC2277]).
687 When performing lookup using a language priority list, the
688 progressive search MUST proceed to consider each language range in
689 the list before finding the default content or empty tag.
691 One common way for an application or implementation to provide for
692 default content is to allow a specific language range to be set as
693 the default for a specific type of request. This language range is
694 then treated as if it were appended to the end of the language
695 priority list as a whole, rather than after each item in the language
696 priority list.
698 For example, if a particular user's language priority list were
699 "fr-FR; zh-Hant" and the program doing the matching had a default
700 language range of "ja-JP", the program would search for content as
701 follows:
702 1. fr-FR
703 2. fr
704 3. zh-Hant // next language
705 4. zh
706 5. (search for the default content)
707 a. ja-JP
708 b. ja
709 c. (implementation defined default)
711 Figure 8: Lookup Using a Language Priority List
712 Implementations SHOULD ignore extensions and unrecognized private-use
713 subtags when performing lookup, since these subtags are usually
714 orthogonal to the user's request.
716 The special language range "*" matches any language tag. In the
717 lookup scheme, this range does not convey enough information by
718 itself to determine which content is most appropriate, since it
719 matches everything. If the language range "*" is the only one in the
720 language priority list, it matches the default content. If the
721 language range "*" is followed by other language ranges, it should be
722 skipped.
724 In some cases, the language priority list might contain one or more
725 extended language ranges (as, for example, when the same language
726 priority list is used as input for both lookup and filtering
727 operations). Wildcard values in an extended language range normally
728 match any value that occurs in that position in a language tag.
729 Since only one item can be returned for any given lookup request,
730 wildcards in a language range have to be processed in a consistent
731 manner or the same request will produce widely varying results.
732 Implementations that accept extended language ranges MUST define
733 which content is returned when more than one item matches the
734 extended language range.
736 For example, an implementation could return the matching content that
737 is first in ASCII-order. For example, if the language range were
738 "*-CH" and the set of content included "de-CH", "fr-CH", and "it-CH",
739 then the content labeled "de-CH" would be returned.
741 Implementations MAY also map extended language ranges to basic
742 language ranges: if the first subtag is a "*" then the entire range
743 is treated as "*" (which matches the default content), otherwise each
744 wildcard subtag is removed. For example, if the language range were
745 "en-*-US", then the range would be mapped to "en-US".
747 Where a language priority list contains Q values as in the syntax of
748 the "Accept-Language" header defined in [RFC2616] (see Section 14.4)
749 and [RFC3282], language tags without a Q value are given values equal
750 to the value of the previous language tag (processing from first to
751 last). If the first language tag has no Q value, it is given a value
752 of 1.0. Then language tags with zero Q values are removed. For
753 example, "fr, en;q=0.5, de, it" becomes "fr;q=1.0, en;q=0.5,
754 de;q=0.5, it;q=0.5". The language priority list is then sorted from
755 highest priority to lowest, whereby any two language tags with the
756 same Q values are remain in the same order as in the original
757 language priority list. This list is then traversed as described
758 above in doing lookup.
760 Implementations or protocols MAY use different lookup mechanisms
761 systems than the ones described above, as long as those mechanisms
762 are clearly specified.
764 4. Other Considerations
766 When working with language ranges and matching schemes, there are
767 some additional points that may influence the choice of either.
769 4.1. Choosing Language Ranges
771 Users indicate their language preferences via the choice of a
772 language range or the list of language ranges in a language priority
773 list. The type of matching affects what the best choice is for a
774 given user.
776 Most matching schemes make no attempt to process the semantic meaning
777 of the subtags. The language range (or its subtags) is usually
778 compared in a case-insensitive manner to each language tag being
779 matched, using basic string processing.
781 Users SHOULD avoid subtags that add no distinguishing value to a
782 language range. Generally, the fewer subtags that appear in the
783 language range, the more content the range will match.
785 Most notably, script subtags SHOULD NOT be used to form a language
786 range in combination with language subtags that have a matching
787 Suppress-Script field in their registry entry. Thus the language
788 range "en-Latn" is probably inappropriate in most cases (because the
789 vast majority of English documents are written in the Latin script
790 and thus the 'en' language subtag has a Suppress-Script field for
791 'Latn' in the registry).
793 When working with tags and ranges, note that extensions and most
794 private-use subtags are orthogonal to language tag matching, in that
795 they specify additional attributes of the text not related to the
796 goals of most matching schemes. Users SHOULD avoid using these
797 subtags in language ranges, since they interfere with the selection
798 of available content. When used in language tags (as opposed to
799 ranges), these subtags normally do not interfere with filtering
800 (Section 3), since they appear at the end of the tag and will match
801 all prefixes.
803 When working with language tags and language ranges note that:
805 o Private-use and Extension subtags are normally orthogonal to
806 language tag fallback. Implementations or specifications that use
807 a lookup (Section 3.3) matching scheme often ignore unrecognized
808 private-use and extension subtags when performing language tag
809 fallback. In addition, since these subtags are always at the end
810 of the sequence of subtags, their use in language tags normally
811 doesn't interfere with the use of ranges that omit them in the
812 filtering (Section 3.2) matching schemes described below.
813 However, they do interfere with filtering when used in language
814 ranges and SHOULD be avoided in ranges as a result.
816 o Applications, specifications, or protocols that choose not to
817 interpret one or more private-use or extension subtags SHOULD NOT
818 remove or modify these extensions in content that they are
819 processing. When a language tag instance is to be used in a
820 specific, known protocol, and is not being passed through to other
821 protocols, language tags MAY be filtered to remove subtags and
822 extensions that are not supported by that protocol. Such
823 filtering SHOULD be avoided, if possible, since it removes
824 information that might be relevant to services on the other end of
825 the protocol that would make use of that information.
827 o Some applications of language tags might want or need to consider
828 extensions and private-use subtags when matching tags. If
829 extensions and private-use subtags are included in a matching or
830 filtering process that utilizes one of the schemes described in
831 this document, then the implementation SHOULD canonicalize the
832 language tags and/or ranges before performing the matching. Note
833 that language tag processors that claim to be "well-formed"
834 processors as defined in [RFC3066bis] generally fall into this
835 category.
837 4.2. Meaning of Language Tags and Ranges
839 Selecting content using language ranges requires some understanding
840 by users of what they are selecting. A language tag or range
841 identifies a language as spoken (or written, signed or otherwise
842 signaled) by human beings for communication of information to other
843 human beings.
845 If a language tag B contains language tag A as a prefix, then B is
846 typically "narrower" or "more specific" than A. For example, "zh-
847 Hant-TW" is more specific than "zh-Hant".
849 This relationship is not guaranteed in all cases: specifically,
850 languages that begin with the same sequence of subtags are NOT
851 guaranteed to be mutually intelligible, although they might be.
853 For example, the tag "az" shares a prefix with both "az-Latn"
854 (Azerbaijani written using the Latin script) and "az-Arab"
855 (Azerbaijani written using the Arabic script). A person fluent in
856 one script might not be able to read the other, even though the text
857 might be otherwise identical. Content tagged as "az" most probably
858 is written in just one script and thus might not be intelligible to a
859 reader familiar with the other script.
861 Variant subtags in particular seem to represent specific divisions in
862 mutual understanding, since they often encode dialects or other
863 idiosyncratic variations within a language. They also seem to
864 represent relatively low divisions with a high chance of at least
865 limited understanding, although this depends on the specific variant
866 in question.
868 The relationship between the language tag and the information it
869 relates to is defined by the standard describing the context in which
870 it appears. Accordingly, this section can only give possible
871 examples of its usage:
873 o For a single information object, the associated language tags
874 might be interpreted as the set of languages that are necessary
875 for a complete comprehension of the complete object. Example:
876 Plain text documents.
878 o For an aggregation of information objects, the associated language
879 tags could be taken as the set of languages used inside components
880 of that aggregation. Examples: Document stores and libraries.
882 o For information objects whose purpose is to provide alternatives,
883 the associated language tags could be regarded as a hint that the
884 content is provided in several languages, and that one has to
885 inspect each of the alternatives in order to find its language or
886 languages. In this case, the presence of multiple tags might not
887 mean that one needs to be multi-lingual to get complete
888 understanding of the document. Example: MIME multipart/
889 alternative.
891 o In markup languages, such as HTML and XML, language information
892 can be added to each part of the document identified by the markup
893 structure (including the whole document itself). For example, one
894 could write C'est la vie. inside a
895 Norwegian document; the Norwegian-speaking user could then access
896 a French-Norwegian dictionary to find out what the marked section
897 meant. If the user were listening to that document through a
898 speech synthesis interface, this formation could be used to signal
899 the synthesizer to appropriately apply French text-to-speech
900 pronunciation rules to that span of text, instead of misapplying
901 the Norwegian rules.
903 4.3. Considerations for Private Use Subtags
905 Private-use subtags require private agreement between the parties
906 that intend to use or exchange language tags that use them and great
907 caution SHOULD be used in employing them in content or protocols
908 intended for general use. Private-use subtags are simply useless for
909 information exchange without prior arrangement.
911 The value and semantic meaning of private-use tags and of the subtags
912 used within such a language tag are not defined. Matching private-
913 use tags using language ranges or extended language ranges can result
914 in unpredictable content being returned.
916 4.4. Length Considerations in Matching
918 RFC 3066 [RFC3066] did not provide an upper limit on the size of
919 language tags or ranges. RFC 3066 did define the semantics of
920 particular subtags in such a way that most language tags or ranges
921 consisted of language and region subtags with a combined total length
922 of up to six characters. Larger tags and ranges (in terms of both
923 subtags and characters) did exist, however.
925 [RFC3066bis] also does not impose a fixed upper limit on the number
926 of subtags in a language tag or range (and thus an upper bound on the
927 size of either). The syntax in that document suggests that,
928 depending on the specific language or range of languages, more
929 subtags (and thus characters) are sometimes necessary as a result.
930 Length considerations and their impact on the selection and
931 processing of tags are described in Section 2.1.1 of that document.
933 An application or protocol MAY choose to limit the length of the
934 language tags or ranges used in matching. Any such limitation SHOULD
935 be clearly documented, and such documentation SHOULD include the
936 disposition of any longer tags or ranges (for example, whether an
937 error value is generated or the language tag or range is truncated).
938 If truncation is permitted it MUST NOT permit a subtag to be divided,
939 since this changes the semantics of the subtag being matched and can
940 result in false positives or negatives.
942 Applications or protocols that restrict storage SHOULD consider the
943 impact of tag or range truncation on the resulting matches. For
944 example, removing the "*" from the end of an extended language range
945 (see Section 2.2) can greatly modify the set of returned matches. A
946 protocol that allows tags or ranges to be truncated at an arbitrary
947 limit, without giving any indication of what that limit is, has the
948 potential for causing harm by changing the meaning of values in
949 substantial ways.
951 In practice, most tags do not require additional subtags or
952 substantially more characters. Additional subtags sometimes add
953 useful distinguishing information, but extraneous subtags interfere
954 with the meaning, understanding, and especially matching of language
955 tags. Since language tags or ranges MAY be truncated by an
956 application or protocol that limits storage, when choosing language
957 tags or ranges users and applications SHOULD avoid adding subtags
958 that add no distinguishing value. In particular, users and
959 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
960 fields in the registry (defined in Section 3.6 of [RFC3066bis]):
961 these fields provide guidance on when specific additional subtags
962 SHOULD (and SHOULD NOT) be used.
964 Implementations MUST support a limit of at least 33 characters. This
965 limit includes at least one subtag of each non-extension, non-private
966 use type. When choosing a buffer limit, a length of at least 42
967 characters is strongly RECOMMENDED.
969 The practical limit on tags or ranges derived solely from registered
970 values is 42 characters. Implementations MUST be able to handle tags
971 and ranges of this length. Support for tags and ranges of at least
972 62 characters in length is RECOMMENDED. Implementations MAY support
973 longer values, including matching extensive sets of private-use or
974 extension subtags.
976 Applications or protocols which have to truncate a tag MUST do so by
977 progressively removing subtags along with their preceding "-" from
978 the right side of the language tag until the tag is short enough for
979 the given buffer. If the resulting tag ends with a single-character
980 subtag, that subtag and its preceding "-" MUST also be removed. For
981 example:
983 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
984 1. zh-Latn-CN-variant1-a-extend1-x-wadegile
985 2. zh-Latn-CN-variant1-a-extend1
986 3. zh-Latn-CN-variant1
987 4. zh-Latn-CN
988 5. zh-Latn
989 6. zh
991 Figure 9: Example of Tag Truncation
993 5. IANA Considerations
995 This document presents no new or existing considerations for IANA.
997 6. Changes
999 This is the first version of this document.
1001 The following changes were put into this document since draft-07:
1003 Added a mention of "*" to the Character Set Considerations section
1004 (D.Ewell)
1006 7. Security Considerations
1008 Language ranges used in content negotiation might be used to infer
1009 the nationality of the sender, and thus identify potential targets
1010 for surveillance. In addition, unique or highly unusual language
1011 ranges or combinations of language ranges might be used to track a
1012 specific individual's activities.
1014 This is a special case of the general problem that anything you send
1015 is visible to the receiving party. It is useful to be aware that
1016 such concerns can exist in some cases.
1018 The evaluation of the exact magnitude of the threat, and any possible
1019 countermeasures, is left to each application or protocol.
1021 8. Character Set Considerations
1023 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN-
1024 MINUS (%x2D). Language ranges also use the character ASTERISK
1025 (%x2A). These characters are present in most character sets, so
1026 presentation or exchange of language tags or ranges should not be
1027 constrained by character set issues.
1029 9. References
1031 9.1. Normative References
1033 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
1034 Requirement Levels", BCP 14, RFC 2119, March 1997.
1036 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
1037 Languages", BCP 18, RFC 2277, January 1998.
1039 [RFC3066bis]
1040 Phillips, A., Ed. and M. Davis, Ed., "Tags for the
1041 Identification of Languages", October 2005, .
1045 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
1046 Specifications: ABNF", RFC 4234, October 2005.
1048 9.2. Informative References
1050 [RFC1766] Alvestrand, H., "Tags for the Identification of
1051 Languages", RFC 1766, March 1995.
1053 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1054 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1055 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
1057 [RFC3066] Alvestrand, H., "Tags for the Identification of
1058 Languages", BCP 47, RFC 3066, January 2001.
1060 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282,
1061 May 2002.
1063 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0",
1064 02 2004.
1066 Appendix A. Acknowledgements
1068 Any list of contributors is bound to be incomplete; please regard the
1069 following as only a selection from the group of people who have
1070 contributed to make this document what it is today.
1072 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
1073 which is a precursor to this document, made enormous contributions
1074 directly or indirectly to this document and are generally responsible
1075 for the success of language tags.
1077 The following people (in alphabetical order by family name)
1078 contributed to this document:
1080 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank
1081 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M.
1082 Patton, Randy Presuhn, Eric van der Poel, Markus Scherer, and many,
1083 many others.
1085 Very special thanks must go to Harald Tveit Alvestrand, who
1086 originated RFCs 1766 and 3066, and without whom this document would
1087 not have been possible.
1089 For this particular document, John Cowan originated the scheme
1090 described in Section 3.2.3. Mark Davis originated the scheme
1091 described in the Section 3.3.
1093 Authors' Addresses
1095 Addison Phillips (editor)
1096 Yahoo! Inc
1098 Email: addison at inter dash locale dot com
1100 Mark Davis (editor)
1101 Google
1103 Email: mark dot davis at macchiato dot com
1105 Intellectual Property Statement
1107 The IETF takes no position regarding the validity or scope of any
1108 Intellectual Property Rights or other rights that might be claimed to
1109 pertain to the implementation or use of the technology described in
1110 this document or the extent to which any license under such rights
1111 might or might not be available; nor does it represent that it has
1112 made any independent effort to identify any such rights. Information
1113 on the procedures with respect to rights in RFC documents can be
1114 found in BCP 78 and BCP 79.
1116 Copies of IPR disclosures made to the IETF Secretariat and any
1117 assurances of licenses to be made available, or the result of an
1118 attempt made to obtain a general license or permission for the use of
1119 such proprietary rights by implementers or users of this
1120 specification can be obtained from the IETF on-line IPR repository at
1121 http://www.ietf.org/ipr.
1123 The IETF invites any interested party to bring to its attention any
1124 copyrights, patents or patent applications, or other proprietary
1125 rights that may cover technology that may be required to implement
1126 this standard. Please address the information to the IETF at
1127 ietf-ipr@ietf.org.
1129 Disclaimer of Validity
1131 This document and the information contained herein are provided on an
1132 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1133 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1134 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1135 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1136 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1137 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1139 Copyright Statement
1141 Copyright (C) The Internet Society (2006). This document is subject
1142 to the rights, licenses and restrictions contained in BCP 78, and
1143 except as set forth therein, the authors retain all their rights.
1145 Acknowledgment
1147 Funding for the RFC Editor function is currently provided by the
1148 Internet Society.