idnits 2.17.1
draft-ietf-ltru-matching-08.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 1098.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1075.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1082.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1088.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (December 7, 2005) is 6706 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)
-- Obsolete informational reference (is this intentional?): RFC 1766
(Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 2616
(Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)
-- Obsolete informational reference (is this intentional?): RFC 3066
(Obsoleted by RFC 4646, RFC 4647)
Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 10 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Quest Software
4 Obsoletes: 3066 (if approved) M. Davis, Ed.
5 Expires: June 10, 2006 IBM
6 December 7, 2005
8 Matching of Language Tags
9 draft-ietf-ltru-matching-08
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on June 10, 2006.
36 Copyright Notice
38 Copyright (C) The Internet Society (2005).
40 Abstract
42 This document describes different mechanisms for comparing, matching,
43 and evaluating language tags. Possible algorithms for language
44 negotiation or content selection, filtering, and lookup are
45 described. This document, in combination with RFC 3066bis (replace
46 "3066bis" with the RFC number assigned to
47 draft-ietf-ltru-registry-14), replaces RFC 3066, which replaced RFC
48 1766.
50 Table of Contents
52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
53 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
54 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4
55 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5
56 2.3. The Language Priority List . . . . . . . . . . . . . . . . 7
57 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 8
58 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 8
59 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9
60 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 10
61 3.2.2. Filtering with Extended Language Ranges . . . . . . . 11
62 3.2.3. Scored Filtering . . . . . . . . . . . . . . . . . . . 11
63 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 15
64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 18
65 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 18
66 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 19
67 4.3. Considerations for Private Use Subtags . . . . . . . . . . 20
68 4.4. Length Considerations in Matching . . . . . . . . . . . . 21
69 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 23
70 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
71 7. Security Considerations . . . . . . . . . . . . . . . . . . . 25
72 8. Character Set Considerations . . . . . . . . . . . . . . . . . 26
73 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
74 9.1. Normative References . . . . . . . . . . . . . . . . . . . 27
75 9.2. Informative References . . . . . . . . . . . . . . . . . . 27
76 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 28
77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29
78 Intellectual Property and Copyright Statements . . . . . . . . . . 30
80 1. Introduction
82 Human beings on our planet have, past and present, used a number of
83 languages. There are many reasons why one would want to identify the
84 language used when presenting or requesting information.
86 Information about a user's language preferences commonly needs to be
87 identified so that appropriate processing can be applied. For
88 example, the user's language preferences in a browser can be used to
89 select web pages appropriately. Language preferences can also be
90 used to select among tools (such as dictionaries) to assist in the
91 processing or understanding of content in different languages.
93 Given a set of language identifiers, such as those defined in
94 [RFC3066bis], various mechanisms can be envisioned for performing
95 language negotiation and tag matching.
97 This document defines a syntax (called a language range (Section 2))
98 for specifying a user's language preferences, as well as several
99 schemes for selecting or filtering content by comparing language
100 ranges to the language tags [RFC3066bis] used to identify the natural
101 language of that content. Applications, protocols, or specifications
102 will have varying needs and requirements that affect the choice of a
103 suitable matching scheme. Depending on the choice of scheme, there
104 are various options left to the implementation. Protocols that
105 implement a matching scheme either need to choose a particular option
106 or indicate that the particular options is left to the specific
107 implementation to decide.
109 This document is divided into three main sections. One describes how
110 to indicate a user's preferences using language ranges. Then a
111 section describes various schemes for matching these ranges to a set
112 of language tags in order to select specific content. There is also
113 a section that deals with various practical considerations that apply
114 to implementing and using these schemes.
116 This document, in combination with [RFC3066bis] (Ed.: replace
117 "3066bis" globally in this document with the RFC number assigned to
118 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
119 [RFC1766].
121 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
123 document are to be interpreted as described in [RFC2119].
125 2. The Language Range
127 Language Tags [RFC3066bis] are used to identify the language of some
128 information item or content. Applications or protocols that use
129 language tags are often faced with the problem of identifying sets of
130 content that share certain language attributes. For example,
131 HTTP/1.1 [RFC2616] describes one such mechanism in its discussion of
132 the Accept-Language header (Section 14.4), which is used when
133 selecting content from servers based on the language of that content.
135 When selecting content according to its language, it is useful to
136 have a mechanism for identifying sets of language tags that share
137 specific attributes. This allows users to select or filter content
138 based on specific requirements. Such an identifier is called a
139 "Language Range".
141 Language tags and thus language ranges are to be treated as case-
142 insensitive: there exist conventions for the capitalization of some
143 of the subtags, but these MUST NOT be taken to carry meaning.
144 Matching of language tags to language ranges MUST be done in a case-
145 insensitive manner as well.
147 2.1. Basic Language Range
149 A "basic language range" identifies the set of content whose language
150 tags begin with the same sequence of subtags. Each range consists of
151 a sequence of alphanumeric subtags separated by hyphens. The basic
152 language range is defined by the following the ABNF[RFC4234]:
154 language-range = language-tag / "*"
155 language-tag = 1*8[alphanum] *["-" 1*8alphanum]
156 alphanum = ALPHA / DIGIT
158 Basic language ranges (originally described by HTTP/1.1 [RFC2616] and
159 later [RFC3066]) have the same syntax as an [RFC3066] language tag or
160 are the single character "*". They differ from the language tags
161 defined in [RFC3066bis] only in that there is no requirement that
162 they be "well-formed" or be validated against the IANA Language
163 Subtag Registry (although such ill-formed ranges will probably not
164 match anything).
166 Use of a basic language range seems to imply that there is a semantic
167 relationship between language tags that share the same prefix. While
168 this is often the case, it is not always true and users should note
169 that the set of language tags that match a specific language-range
170 may not be mutually intelligible.
172 2.2. Extended Language Range
174 A Basic Language Range does not always provide the most appropriate
175 way to specify a user's preferences. Sometimes it is beneficial to
176 use a more fine-grained matching scheme that takes advantage of the
177 internal structure of language tags. This allows the user to
178 specify, for example, the value of a specific field in a language tag
179 or to indicate which values are of interest in filtering or selecting
180 the content.
182 In an extended language range, the identifier takes the form of a
183 series of subtags which MUST consist of well-formed subtags or the
184 special subtag "*". For example, the language range "en-*-US"
185 specifies a primary language of 'en', followed by any script subtag,
186 followed by the region subtag 'US'.
188 An extended language range can be represented by the following ABNF:
190 extended-language-range = range ; a range
191 / privateuse ; private-use tag
192 / grandfathered ; grandfathered registrations
194 range = (language
195 ["-" script]
196 ["-" region]
197 *("-" variant)
198 *("-" extension)
199 ["-" privateuse])
201 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
202 / 4ALPHA ; reserved for future use
203 / 5*8ALPHA ; registered language subtag
204 / "*" ; ... or wildcard
206 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
207 ; reserved for future use
208 ; wildcard can only appear
209 ; at the end
211 script = 4ALPHA ; ISO 15924 code
212 / "*" ; or wildcard
214 region = 2ALPHA ; ISO 3166 code
215 / 3DIGIT ; UN M.49 code
216 / "*" ; ... or wildcard
218 variant = 5*8alphanum ; registered variants
219 / (DIGIT 3alphanum) ;
220 / "*" ; ... or wildcard
222 extension = singleton *("-" (2*8alphanum)) [ "-*" ]
223 ; extension subtags
224 ; wildcard can only appear
225 ; at the end
227 singleton = "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
228 ; Single letters: x/X is reserved for private use
230 privateuse = ("x"/"X") 1*("-" (1*8alphanum))
232 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
233 ; grandfathered registration
234 ; Note: I is the only singleton
235 ; that starts a grandfathered tag
237 alphanum = (ALPHA / DIGIT) ; letters and numbers
238 A field not present in the middle of an extended language range is
239 treated as if the field contained a "*". Implementations that
240 normalize extended language ranges SHOULD expand missing fields to be
241 "*" so that the semantic meaning of the language range is clear to
242 the user. At the same time, multiple wildcards in a row are
243 redundant and implementations SHOULD collapse these to a single
244 wildcard when normalizing the range (for brevity). For example, both
245 the range "sl-nedis" and the range "sl-*-*-nedis" are equivalent to
246 and should be normalized as "sl-*-nedis".
248 2.3. The Language Priority List
250 When users specify a language preference they often need to specify a
251 prioritized list of language ranges in order to best reflect their
252 language preferences. This is especially true for speakers of
253 minority languages. A speaker of Breton in France, for example, may
254 specify "be" followed by "fr", meaning that if Breton is available,
255 it is preferred, but otherwise French is the best alternative. It
256 can get more complex: a speaker may wish to fall back from Skolt Sami
257 to Northern Sami to Finnish.
259 A "Language Priority List" is a prioritized or weighted list of
260 language ranges. One well known example of such a list is the
261 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
262 14.4) and RFC 3282 [RFC3282]. A simple list of ranges, i.e. one that
263 contains no weighting information, is considered to be in descending
264 order of priority.
266 The various matching operations described in this document include
267 considerations for using a language priority list. This document
268 does not define any syntax for a language priority list; defining
269 such a syntax is the responsibility of the protocol, application, or
270 implementation that uses it. When given as examples in this
271 document, language priority lists will be shown as a quoted sequence
272 of ranges separated by semi-colons, like this: "en; fr; zh-Hant"
273 (which would be read as "English before French before Chinese as
274 written in the Traditional script").
276 3. Types of Matching
278 Matching language ranges to language tags can be done in a number of
279 different ways. This section describes several different matching
280 schemes, as well as the considerations for choosing between them.
281 Protocols and specifications SHOULD clearly indicate the particular
282 mechanism used in selecting or matching language tags.
284 There are two basic types of matching scheme: those that produce zero
285 or more information items (called "filtering") and those that produce
286 a single information item for a given request (called "lookup").
288 A key difference between these two types of matching scheme is that
289 the language ranges in the language priority list represent the
290 _least_ specific content one will accept as a match, while for lookup
291 operations the language ranges represent the _most_ specific content.
293 3.1. Choosing a Type of Matching
295 Applications, protocols, and specifications are faced with the
296 decision of what type of matching to use. Sometimes, different
297 styles of matching might be suited for different kinds of processing
298 within a particular application or protocol.
300 Language tag matching is a tool, and does not by itself specify a
301 complete procedure for the use of language tags. Such procedures are
302 intimately tied to the application protocol in which they occur.
303 When specifying a protocol operation using matching, the protocol
304 MUST specify:
306 o Which type(s) of language tag matching it uses
308 o Whether the operation returns a single result (lookup) or a
309 possibly empty set of results (filtering)
311 o For lookup, what the result is when no matching tag is found. For
312 instance, a protocol might result in failure of the operation, an
313 empty value, returning some protocol defined or implementation
314 defined default, or returning i-default [RFC2277].
316 Filtering can be used to produce a set of results (such as a
317 collection of documents). For example, if using a search engine, one
318 might use filtering to limit the results to documents written in
319 French. It can also be used when deciding whether to perform a
320 language-sensitive process on some content. For example, a process
321 might cause paragraphs whose language tag matched the language range
322 "nl" to be displayed in italics within a document.
324 This document describes four types of matching (three types of
325 filtering, plus the lookup scheme):
327 1. Basic Filtering (Section 3.2.1) is used to match content using
328 basic language ranges (Section 2.1).
330 2. Extended Range Filtering (Section 3.2.2) is used to match content
331 using extended language ranges (Section 2.2).
333 3. Scored Filtering (Section 3.2.3) produces an ordered set of
334 content using extended language ranges. It SHOULD be used when
335 the quality of the match within a specific language range is
336 important, as when presenting a list of documents resulting from
337 a search.
339 4. Lookup (Section 3.3) is used when each request needs to produce
340 _exactly_ one piece of content. For example, if process were to
341 insert a human readable error message into a protocol header, it
342 might select the text based on the user's language preference.
343 Since it can return only one item, it must choose a single item
344 and it must return some item, even if no content matches the
345 language priority list supplied by the user.
347 Most types of matching in this document are designed so that
348 implementations are not required to validate or understand any of the
349 semantics of the subtags supplied and, except for scored filtering,
350 they do not need access to the IANA Language Subtag Registry (see
351 Section 3 in [RFC3066bis]). This simplifies and speeds the
352 performance of implementations.
354 If an implementation canonicalizes either ranges or tags, then the
355 implementation will require the IANA Language Subtag Registry
356 information for that purpose. Implementations MAY use semantic
357 information external to the registry when matching tags. For
358 example, the primary language subtags 'nn' (Nynorsk Norwegian) and
359 'nb' (Bokmal Norwegian) might both be usefully matched to the more
360 general subtag 'no' (Norwegian). Or an implementation might infer
361 that content labeled "zh-CN" is more likely to match the range "zh-
362 Hans" than equivalent content labeled "zh-TW".
364 3.2. Filtering
366 Filtering is used to select the set of content that matches a given
367 language priority list. It is called "filtering" because this set of
368 content may contain no items at all or it may return an arbitrarily
369 large number of matching items--as many as match the language range
370 used to specify the items, thus filtering out the non-matching
371 content.
373 In filtering, the language range represents the _least_ specific
374 (that is, the fewest number of subtags) language tag which is an
375 acceptable match. That is, all of the language tags in the set of
376 filtered content will have an equal or greater number of subtags than
377 the language range. For example, if the language priority list
378 consists of the range "de-CH", one might see matching content with
379 the tag "de-CH-1996" but one will never see a match with the tag
380 "de".
382 If the language priority list (see Section 2.3) contains more than
383 one range, the content returned is typically ordered in descending
384 level of preference.
386 Some examples where filtering might be appropriate include:
388 o Applying a style to sections of a document in a particular set of
389 languages.
391 o Displaying the set of documents containing a particular set of
392 keywords written in a specific set of languages.
394 o Selecting all email items written in a specific set of languages.
396 Filtering can produce either an ordered or an unordered set of
397 results. For example, applying formatting to a document based on the
398 language of specific pieces of content does not require the content
399 to be ordered. It is sufficient to know whether a specific piece of
400 content is selected by the language priority list (or not). A search
401 application, on the other hand, probably would want to order the
402 results.
404 If an ordered set is desired, as described above, then the
405 application or protocol needs to determine the relative "quality" of
406 the match between different language tags and the language range.
408 This measurement is called a "distance metric". A distance metric
409 assigns a numeric value to the comparison of a language tag to a
410 language range that represents the 'distance' between the two. A
411 distance of zero means that they are identical, a small distance
412 indicates that they are very similar, and a large distance indicates
413 that they are very different. Using a distance metric,
414 implementations can, for example, allow users to select a threshold
415 distance for a match to be "successful" while filtering, or they
416 might use the numeric values to order the results.
418 3.2.1. Filtering with Basic Language Ranges
420 When filtering using basic language ranges, each basic language range
421 in the language priority list is considered in turn, according to
422 priority. A particular language tag matches a language range if it
423 exactly equals the tag, or if it exactly equals a prefix of the tag
424 such that the first character following the prefix is "-". (That is,
425 the language-range "de-de" matches the language tag "de-DE-1996", but
426 not the language tag "de-Deva".)
428 The special range "*" in a language priority list matches any tag. A
429 protocol which uses language ranges MAY specify additional rules
430 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616]
431 specifies that the range "*" matches only languages not matched by
432 any other range within an "Accept-Language" header.
434 3.2.2. Filtering with Extended Language Ranges
436 When filtering using extended language ranges, each extended language
437 range in the language priority list is considered in turn, according
438 to priority. The subtags in each extended language range are
439 compared to the corresponding subtags in the language tag being
440 examined. The subtag from the range is considered to match if it
441 exactly matches the corresponding subtag in the tag or the range's
442 subtag has the value "*" (which matches all subtags, including the
443 empty subtag).
445 Subtags not specified, including those at the end of the language
446 range, are assigned the wildcard value "*". This makes each range
447 into a prefix much like that used in basic language range matching.
448 For example, the extended language range "de-*-DE" matches all of the
449 following tags because the unspecified variant field is expanded to
450 "*":
452 de-DE
454 de-Latn-DE
456 de-Latf-DE
458 de-DE-x-goethe
460 de-Latn-DE-1996
462 3.2.3. Scored Filtering
464 Both basic and extended language range filtering produce simple
465 boolean matches between a language range and a language tag.
466 Sometimes it may be useful to provide an array of results with
467 different levels of matching, for example, sorting results based on
468 the overall "quality" of the match. Scored (or "distance metric")
469 filtering provides a way to generate these quality values.
471 As with the other forms of filtering, the process considers each
472 language range in the language priority list in order of priority.
474 Each extended language range and language tag MUST first be
475 canonicalized by mapping grandfathered and obsolete tags into modern
476 equivalents. This requires the information in the IANA Language
477 Subtag Registry (see Section 3 of [RFC3066bis]).
479 The language range and each language tag it is to be compared to are
480 then transformed into a "quintuple" consisting of five "elements" in
481 the form (language, script, country, variant, extension).
483 Any extended language subtags are considered part of the language
484 "element". For example, the language element for the tag "zh-cmn-
485 Hans" would be "zh-cmn".
487 Private-use subtag sequences are considered part of the language
488 "element" if in the initial position in the tag and part of the
489 variant "element" if not. The different handling of private-use
490 sequences prevents a range such as "x-twain" from matching all
491 possible tags, while a range such as "en-US-x-twain" would closely
492 match nearly all tags for English as used in the United States.
494 Language subtags 'und', 'mul', and the script subtag 'Zyyy' are
495 converted to "*": these subtag values represent undetermined,
496 multiple, or private-use values which are consistent with the use of
497 the wildcard.
499 For language tags that have no script subtag but whose language
500 subtag's record in the IANA Language Subtag Registry contains the
501 field "Suppress-Script", the script element in the quintuple MUST be
502 set to the script subtag in the Suppress-Script field. This is
503 necessary because [RFC3066bis] strongly recommends that users not use
504 this subtag to form language tags and this document recommends that
505 users not use them to form ranges. For example, if the script were
506 not expanded in this manner, a range such as "de-DE" would produce a
507 more-distant score for content that happened to be labeled
508 "de-Latn-DE" than users would expect that it should. Note that
509 languages which have a "Suppress-Script" field in the registry are
510 predominantly written in a single script.
512 Any remaining missing components in the language tag are set to "*";
513 thus an empty language tag becomes the quintuple ("*", "*", "*", "*",
514 "*"). Missing components in the language range are handled similarly
515 to extended range lookup: missing internal subtags are expanded to
516 "*". Missing end subtags are expanded as the empty string. Thus a
517 pattern "en-US" becomes the quintuple ("en","*","US","","").
519 Here are some examples of language tags, showing their quintuples as
520 both language tags and language ranges:
522 en-US
523 Tag: (en, *, US, *, *)
524 Range: (en, *, US, "", "")
526 sr-Latn
527 Tag: (sr, Latn, *, *, *)
528 Range: (sr, Latn, "", "", "")
530 zh-cmn-Hant
531 Tag: (zh-cmn, Hant, *, *, *)
532 Range: (zh-cmn, Hant, "", "", "")
534 x-foo
535 Tag: (x-foo, *, *, *, *)
536 Range: (x-foo, "", "", "", "")
538 en-x-foo
539 Tag: (en, *, *, x-foo, *)
540 Range: (en, *, *, x-foo, "")
542 i-default
543 Tag: (i-default, *, *, *, *)
544 Range: (i-default, "", "", "", "")
546 sl-Latn-IT-rozaj
547 Tag: (sl, Latn, IT, rozaj, *)
548 Range: (sl, Latn, IT, rozaj, "")
550 zh-r-wadegile (hypothetical)
551 Tag: (zh, *, *, *, r-wadegile)
552 Range: (zh, *, *, *, r-wadegile)
554 Figure 3: Examples of Distance Metric Quintuples
556 Each pair of quintuples being compared is assigned a distance value,
557 in which small values indicate better matches and large values
558 indicate worse ones. The distance between the pair is the sum of the
559 distances for each of the corresponding elements of the quintuple.
560 If the elements are identical or one is '*', then the distance value
561 between them is zero. Otherwise, it is given by the following table:
563 256 language mismatch
564 128 script mismatch
565 32 region mismatch
566 4 variant mismatch
567 1 extension mismatch
569 A value of 0 is a perfect match; 421 is no match at all. Different
570 threshold values might be appropriate for different applications or
571 protocols. Implementations will usually allow users to choose the
572 most appropriate selection value, ranking the matched items based on
573 score.
575 Examples of various tag's distances from the range "en-US":
577 "fr-FR" 384 (language & region mismatch)
578 "fr" 256 (language mismatch, region match)
579 "en-GB" 32 (region mismatch)
580 "en-Latn-US" 0 (all fields match)
581 "en-Brai" 32 (region mismatch)
582 "en-US-x-foo" 4 (variant mismatch: range is the empty string)
583 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
585 Note: A variation of this algorithm might vary the scoring used
586 overall or for specific values. For example, sometimes it might make
587 sense to use more sophisticated weighting that depends on the values
588 of the corresponding elements. Thus, depending on the domain, an
589 implementation might assign a smaller distance to the difference
590 between closely related subtags (or treat certain values as equal).
591 Some examples of closely related subtags might be:
593 Language:
594 no (Norwegian)
595 nb (Bokmal Norwegian)
596 nn (Nynorsk Norwegian)
598 Script:
599 Kata (katakana)
600 Hira (hiragana)
602 Region:
603 US (United States of America)
604 UM (United States Minor Outlying Islands)
606 Figure 6: Examples of Closely Related Subtags
608 3.3. Lookup
610 Lookup is used to select the single information item that best
611 matches the language priority list for a given request. When
612 performing lookup, each language range in the language priority list
613 is considered in turn, according to priority. By contrast with
614 filtering, each language ranges represents the _most_ specific tag
615 which is an acceptable match. The first information item found with
616 a matching tag, according the user's priority, is considered the
617 closest match and is the item returned. For example, if the language
618 range is "de-CH", one might expect to receive an information item
619 with the tag "de" but never one with the tag "de-CH-1996". Usually
620 if no content matches the request, a "default" item is returned.
622 For example, if an application inserts some dynamic content into a
623 document, returning an empty string if there is no exact match is not
624 an option. Instead, the application "falls back" until it finds a
625 suitable piece of content to insert. Other examples of lookup might
626 include:
628 o Selection of a template containing the text for an automated email
629 response.
631 o Selection of a item containing some text for inclusion in a
632 particular Web page.
634 o Selection of a string of text for inclusion in an error log.
636 In the lookup scheme, the language range is progressively truncated
637 from the end until a matching piece of content is located. For
638 example, starting with the range "zh-Hant-CN-x-private", the lookup
639 progressively searches for content as shown below:
641 Range to match: zh-Hant-CN-x-private
642 1. zh-Hant-CN-x-private
643 2. zh-Hant-CN
644 3. zh-Hant
645 4. zh
646 5. (default content or the empty tag)
648 Figure 7: Example of a Lookup Fallback Pattern
650 This scheme allows some flexibility in finding content. For example,
651 it provides better results for cases in which data is not available
652 that exactly matches the user request than if the default language
653 for the system or content were returned immediately. Not every
654 specific level of tag granularity is usually available or language
655 content may be sparsely populated, so "falling back" through the
656 subtag sequence provides more opportunity to find a match between
657 available content and the user's request.
659 The default content is implementation defined. It might be content
660 with no language tag; might have an empty value (the built-in
661 attribute xml:lang in [XML10] permits the empty value); might be a
662 particular language designated for that bit of content; or it might
663 be content that is labeled with the tag "i-default" (see [RFC2277]).
664 When performing lookup using a language priority list, the
665 progressive search MUST proceed to consider each language range in
666 the list before finding the default content or empty tag.
668 One common way for an application or implementation to provide for
669 default content is to allow a specific language range to be set as
670 the default for a specific type of request. This language range is
671 then treated as if it were appended to the end of the language
672 priority list as a whole, rather than after each item in the language
673 priority list.
675 For example, if a particular user's language priority list were
676 "fr-FR; zh-Hant" and the program doing the matching had a default
677 language range of "ja-JP", the program would search for content as
678 follows:
679 1. fr-FR
680 2. fr
681 3. zh-Hant // next language
682 4. zh
683 5. (search for the default content)
684 a. ja-JP
685 b. ja
686 c. (implementation defined default)
688 Figure 8: Lookup Using a Language Priority List
690 Implementations SHOULD ignore extensions and unrecognized private-use
691 subtags when performing lookup, since these subtags are usually
692 orthogonal to the user's request.
694 The special language range "*" matches any language tag. In the
695 lookup scheme, this range does not convey enough information by
696 itself to determine which content is most appropriate, since it
697 matches everything. If the language range "*" is the only one in the
698 language priority list, it matches the default content. If the
699 language range "*" is followed by other language ranges, it should be
700 skipped.
702 In some cases, the language priority list might contain one or more
703 extended language ranges (as, for example, when the same language
704 priority list is used as input for both lookup and filtering
705 operations). Wildcard values in an extended language range normally
706 match any value that occurs in that position in a language tag.
707 Since only one item can be returned for any given lookup request,
708 wildcards in a language range have to be processed in a consistent
709 manner or the same request will produce widely varying results.
710 Implementations that accept extended language ranges MUST define
711 which content is returned when more than one item matches the
712 extended language range.
714 For example, an implementation could return the matching content that
715 is first in ASCII-order. For example, if the language range were
716 "*-CH" and the set of content included "de-CH", "fr-CH", and "it-CH",
717 then the content labeled "de-CH" would be returned.
719 Another way an implementation could address extended language ranges
720 would be to map them to basic language ranges: if the first subtag is
721 a "*" then the entire range is treated as "*" (which matches the
722 default content), otherwise the wildcard subtag is removed. For
723 example, if the language range were "en-*-US", then the range would
724 be mapped to "en-US".
726 4. Other Considerations
728 When working with language ranges and matching schemes, there are
729 some additional points that may influence the choice of either.
731 4.1. Choosing Language Ranges
733 Users indicate their language preferences via the choice of a
734 language range or the list of language ranges in a language priority
735 list. The type of matching affects what the best choice is for a
736 given user.
738 Most matching schemes make no attempt to process the semantic meaning
739 of the subtags. The language range (or its subtags) is usually
740 compared in a case-insensitive manner to each language tag being
741 matched, using basic string processing.
743 Users SHOULD avoid subtags that add no distinguishing value to a
744 language range. Generally, the fewer subtags that appear in the
745 language range, the more content the range will match.
747 Most notably, script subtags SHOULD NOT be used to form a language
748 range in combination with language subtags that have a matching
749 Suppress-Script field in their registry entry. Thus the language
750 range "en-Latn" is probably inappropriate in most cases (because the
751 vast majority of English documents are written in the Latin script
752 and thus the 'en' language subtag has a Suppress-Script field for
753 'Latn' in the registry).
755 When working with tags and ranges, note that extensions and most
756 private-use subtags are orthogonal to language tag matching, in that
757 they specify additional attributes of the text not related to the
758 goals of most matching schemes. Users SHOULD avoid using these
759 subtags in language ranges, since they interfere with the selection
760 of available content. When used in language tags (as opposed to
761 ranges), these subtags normally do not interefer with filtering
762 (Section 3), since they appear at the end of the tag and will match
763 all prefixes.
765 When working with language tags and language ranges note that:
767 o Private-use and Extension subtags are normally orthogonal to
768 language tag fallback. Implementations or specifications that use
769 a lookup (Section 3.3) matching scheme often ignore unrecognized
770 private-use and extension subtags when performing language tag
771 fallback. In addition, since these subtags are always at the end
772 of the sequence of subtags, their use in language tags normally
773 doesn't interfere with the use of ranges that omit them in the
774 filtering (Section 3.2) matching schemes described below.
775 However, they do interfere with filtering when used in language
776 ranges and SHOULD be avoided in ranges as a result.
778 o Applications, specifications, or protocols that choose not to
779 interpret one or more private-use or extension subtags SHOULD NOT
780 remove or modify these extensions in content that they are
781 processing. When a language tag instance is to be used in a
782 specific, known protocol, and is not being passed through to other
783 protocols, language tags MAY be filtered to remove subtags and
784 extensions that are not supported by that protocol. Such
785 filtering SHOULD be avoided, if possible, since it removes
786 information that might be relevant to services on the other end of
787 the protocol that would make use of that information.
789 o Some applications of language tags might want or need to consider
790 extensions and private-use subtags when matching tags. If
791 extensions and private-use subtags are included in a matching or
792 filtering process that utilizes one of the schemes described in
793 this document, then the implementation SHOULD canonicalize the
794 language tags and/or ranges before performing the matching. Note
795 that language tag processors that claim to be "well-formed"
796 processors as defined in [RFC3066bis] generally fall into this
797 category.
799 4.2. Meaning of Language Tags and Ranges
801 Selecting content using language ranges requires some understanding
802 by users of what they are selecting. A language tag or range
803 identifies a language as spoken (or written, signed or otherwise
804 signaled) by human beings for communication of information to other
805 human beings.
807 If a language tag B contains language tag A as a prefix, then B is
808 typically "narrower" or "more specific" than A. For example, "zh-
809 Hant-TW" is more specific than "zh-Hant".
811 This relationship is not guaranteed in all cases: specifically,
812 languages that begin with the same sequence of subtags are NOT
813 guaranteed to be mutually intelligible, although they might be.
815 For example, the tag "az" shares a prefix with both "az-Latn"
816 (Azerbaijani written using the Latin script) and "az-Arab"
817 (Azerbaijani written using the Arabic script). A person fluent in
818 one script might not be able to read the other, even though the text
819 might be otherwise identical. Content tagged as "az" most probably
820 is written in just one script and thus might not be intelligible to a
821 reader familiar with the other script.
823 Variant subtags in particular seem to represent specific divisions in
824 mutual understanding, since they often encode dialects or other
825 idiosyncratic variations within a language. They also seem to
826 represent relatively low divisions with a high chance of at least
827 limited understanding, although this depends on the specific variant
828 in question.
830 The relationship between the language tag and the information it
831 relates to is defined by the standard describing the context in which
832 it appears. Accordingly, this section can only give possible
833 examples of its usage:
835 o For a single information object, the associated language tags
836 might be interpreted as the set of languages that are necessary
837 for a complete comprehension of the complete object. Example:
838 Plain text documents.
840 o For an aggregation of information objects, the associated language
841 tags could be taken as the set of languages used inside components
842 of that aggregation. Examples: Document stores and libraries.
844 o For information objects whose purpose is to provide alternatives,
845 the associated language tags could be regarded as a hint that the
846 content is provided in several languages, and that one has to
847 inspect each of the alternatives in order to find its language or
848 languages. In this case, the presence of multiple tags might not
849 mean that one needs to be multi-lingual to get complete
850 understanding of the document. Example: MIME multipart/
851 alternative.
853 o In markup languages, such as HTML and XML, language information
854 can be added to each part of the document identified by the markup
855 structure (including the whole document itself). For example, one
856 could write C'est la vie. inside a
857 Norwegian document; the Norwegian-speaking user could then access
858 a French-Norwegian dictionary to find out what the marked section
859 meant. If the user were listening to that document through a
860 speech synthesis interface, this formation could be used to signal
861 the synthesizer to appropriately apply French text-to-speech
862 pronunciation rules to that span of text, instead of misapplying
863 the Norwegian rules.
865 4.3. Considerations for Private Use Subtags
867 Private-use subtags require private agreement between the parties
868 that intend to use or exchange language tags that use them and great
869 caution SHOULD be used in employing them in content or protocols
870 intended for general use. Private-use subtags are simply useless for
871 information exchange without prior arrangement.
873 The value and semantic meaning of private-use tags and of the subtags
874 used within such a language tag are not defined. Matching private-
875 use tags using language ranges or extended language ranges can result
876 in unpredictable content being returned.
878 4.4. Length Considerations in Matching
880 RFC 3066 [RFC3066] did not provide an upper limit on the size of
881 language tags or ranges. RFC 3066 did define the semantics of
882 particular subtags in such a way that most language tags or ranges
883 consisted of language and region subtags with a combined total length
884 of up to six characters. Larger tags and ranges (in terms of both
885 subtags and characters) did exist, however.
887 [RFC3066bis] also does not impose a fixed upper limit on the number
888 of subtags in a language tag or range (and thus an upper bound on the
889 size of either). The syntax in that document suggests that,
890 depending on the specific language or range of languages, more
891 subtags (and thus characters) are sometimes necessary as a result.
892 Length considerations and their impact on the selection and
893 processing of tags are described in Section 2.1.1 of that document.
895 An application or protocol MAY choose to limit the length of the
896 language tags or ranges used in matching. Any such limitation SHOULD
897 be clearly documented, and such documentation SHOULD include the
898 disposition of any longer tags or ranges (for example, whether an
899 error value is generated or the language tag or range is truncated).
900 If truncation is permitted it MUST NOT permit a subtag to be divided,
901 since this changes the semantics of the subtag being matched and can
902 result in false positives or negatives.
904 Applications or protocols that restrict storage SHOULD consider the
905 impact of tag or range truncation on the resulting matches. For
906 example, removing the "*" from the end of an extended language range
907 (see Section 2.2) can greatly modify the set of returned matches. A
908 protocol that allows tags or ranges to be truncated at an arbitrary
909 limit, without giving any indication of what that limit is, has the
910 potential for causing harm by changing the meaning of values in
911 substantial ways.
913 In practice, most tags do not require additional subtags or
914 substantially more characters. Additional subtags sometimes add
915 useful distinguishing information, but extraneous subtags interfere
916 with the meaning, understanding, and especially matching of language
917 tags. Since language tags or ranges MAY be truncated by an
918 application or protocol that limits storage, when choosing language
919 tags or ranges users and applications SHOULD avoid adding subtags
920 that add no distinguishing value. In particular, users and
921 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
922 fields in the registry (defined in Section 3.6 of [RFC3066bis]):
923 these fields provide guidance on when specific additional subtags
924 SHOULD (and SHOULD NOT) be used.
926 Implementations MUST support a limit of at least 33 characters. This
927 limit includes at least one subtag of each non-extension, non-private
928 use type. When choosing a buffer limit, a length of at least 42
929 characters is strongly RECOMMENDED.
931 The practical limit on tags or ranges derived solely from registered
932 values is 42 characters. Implementations MUST be able to handle tags
933 and ranges of this length. Support for tags and ranges of at least
934 62 characters in length is RECOMMENDED. Implementations MAY support
935 longer values, including matching extensive sets of private-use or
936 extension subtags.
938 Applications or protocols which have to truncate a tag MUST do so by
939 progressively removing subtags along with their preceding "-" from
940 the right side of the language tag until the tag is short enough for
941 the given buffer. If the resulting tag ends with a single-character
942 subtag, that subtag and its preceding "-" MUST also be removed. For
943 example:
945 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
946 1. zh-Latn-CN-variant1-a-extend1-x-wadegile
947 2. zh-Latn-CN-variant1-a-extend1
948 3. zh-Latn-CN-variant1
949 4. zh-Latn-CN
950 5. zh-Latn
951 6. zh
953 Figure 9: Example of Tag Truncation
955 5. IANA Considerations
957 This document presents no new or existing considerations for IANA.
959 6. Changes
961 This is the first version of this document.
963 The following changes were put into this document since draft-07:
965 Added a mention of "*" to the Character Set Considerations section
966 (D.Ewell)
968 7. Security Considerations
970 Language ranges used in content negotiation might be used to infer
971 the nationality of the sender, and thus identify potential targets
972 for surveillance. In addition, unique or highly unusual language
973 ranges or combinations of language ranges might be used to track a
974 specific individual's activities.
976 This is a special case of the general problem that anything you send
977 is visible to the receiving party. It is useful to be aware that
978 such concerns can exist in some cases.
980 The evaluation of the exact magnitude of the threat, and any possible
981 countermeasures, is left to each application or protocol.
983 8. Character Set Considerations
985 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN-
986 MINUS (%x2D). Language ranges also use the character ASTERISK
987 (%x2A). These characters are present in most character sets, so
988 presentation or exchange of language tags or ranges should not be
989 constrained by character set issues.
991 9. References
993 9.1. Normative References
995 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
996 Requirement Levels", BCP 14, RFC 2119, March 1997.
998 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
999 Languages", BCP 18, RFC 2277, January 1998.
1001 [RFC3066bis]
1002 Phillips, A., Ed. and M. Davis, Ed., "Tags for the
1003 Identification of Languages", October 2005, .
1007 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
1008 Specifications: ABNF", RFC 4234, October 2005.
1010 9.2. Informative References
1012 [RFC1766] Alvestrand, H., "Tags for the Identification of
1013 Languages", RFC 1766, March 1995.
1015 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
1016 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
1017 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
1019 [RFC3066] Alvestrand, H., "Tags for the Identification of
1020 Languages", BCP 47, RFC 3066, January 2001.
1022 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282,
1023 May 2002.
1025 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0",
1026 02 2004.
1028 Appendix A. Acknowledgements
1030 Any list of contributors is bound to be incomplete; please regard the
1031 following as only a selection from the group of people who have
1032 contributed to make this document what it is today.
1034 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
1035 which is a precursor to this document, made enormous contributions
1036 directly or indirectly to this document and are generally responsible
1037 for the success of language tags.
1039 The following people (in alphabetical order by family name)
1040 contributed to this document:
1042 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank
1043 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M.
1044 Patton, Randy Presuhn, Eric van der Poel, and many, many others.
1046 Very special thanks must go to Harald Tveit Alvestrand, who
1047 originated RFCs 1766 and 3066, and without whom this document would
1048 not have been possible.
1050 For this particular document, John Cowan originated the scheme
1051 described in Section 3.2.3. Mark Davis originated the scheme
1052 described in the Section 3.3.
1054 Authors' Addresses
1056 Addison Phillips (editor)
1057 Quest Software
1059 Email: addison dot phillips at quest dot com
1061 Mark Davis (editor)
1062 IBM
1064 Email: mark dot davis at ibm dot com
1066 Intellectual Property Statement
1068 The IETF takes no position regarding the validity or scope of any
1069 Intellectual Property Rights or other rights that might be claimed to
1070 pertain to the implementation or use of the technology described in
1071 this document or the extent to which any license under such rights
1072 might or might not be available; nor does it represent that it has
1073 made any independent effort to identify any such rights. Information
1074 on the procedures with respect to rights in RFC documents can be
1075 found in BCP 78 and BCP 79.
1077 Copies of IPR disclosures made to the IETF Secretariat and any
1078 assurances of licenses to be made available, or the result of an
1079 attempt made to obtain a general license or permission for the use of
1080 such proprietary rights by implementers or users of this
1081 specification can be obtained from the IETF on-line IPR repository at
1082 http://www.ietf.org/ipr.
1084 The IETF invites any interested party to bring to its attention any
1085 copyrights, patents or patent applications, or other proprietary
1086 rights that may cover technology that may be required to implement
1087 this standard. Please address the information to the IETF at
1088 ietf-ipr@ietf.org.
1090 Disclaimer of Validity
1092 This document and the information contained herein are provided on an
1093 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1094 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1095 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1096 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1097 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1098 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1100 Copyright Statement
1102 Copyright (C) The Internet Society (2005). This document is subject
1103 to the rights, licenses and restrictions contained in BCP 78, and
1104 except as set forth therein, the authors retain all their rights.
1106 Acknowledgment
1108 Funding for the RFC Editor function is currently provided by the
1109 Internet Society.