idnits 2.17.1
draft-ietf-ltru-matching-07.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 1026.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1003.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1010.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1016.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== Line 699 has weird spacing: '...becomes en-US...'
== Line 700 has weird spacing: '...becomes en-La...'
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (November 18, 2005) is 6727 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
RFC 7232, RFC 7233, RFC 7234, RFC 7235)
** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)
-- Obsolete informational reference (is this intentional?): RFC 1766
(Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 3066
(Obsoleted by RFC 4646, RFC 4647)
Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Quest Software
4 Obsoletes: 3066 (if approved) M. Davis, Ed.
5 Expires: May 22, 2006 IBM
6 November 18, 2005
8 Matching of Language Tags
9 draft-ietf-ltru-matching-07
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on May 22, 2006.
36 Copyright Notice
38 Copyright (C) The Internet Society (2005).
40 Abstract
42 This document describes different mechanisms for comparing, matching,
43 and evaluating language tags. Possible algorithms for language
44 negotiation and content selection are described. This document, in
45 combination with RFC 3066bis (replace "3066bis" with the RFC number
46 assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which
47 replaced RFC 1766.
49 Table of Contents
51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
52 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
53 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4
54 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4
55 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 5
56 2.4. Choosing a Language Range . . . . . . . . . . . . . . . . 6
57 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 9
58 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 9
59 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 10
60 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 11
61 3.2.2. Filtering with Extended Language Ranges . . . . . . . 11
62 3.2.3. Scored Filtering . . . . . . . . . . . . . . . . . . . 12
63 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 14
64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 18
65 4.1. Meaning of Language Tags and Ranges . . . . . . . . . . . 18
66 4.2. Considerations for Private Use Subtags . . . . . . . . . . 19
67 4.3. Length Considerations in Matching . . . . . . . . . . . . 19
68 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22
69 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
70 7. Security Considerations . . . . . . . . . . . . . . . . . . . 24
71 8. Character Set Considerations . . . . . . . . . . . . . . . . . 25
72 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
73 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26
74 9.2. Informative References . . . . . . . . . . . . . . . . . . 26
75 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 27
76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28
77 Intellectual Property and Copyright Statements . . . . . . . . . . 29
79 1. Introduction
81 Human beings on our planet have, past and present, used a number of
82 languages. There are many reasons why one would want to identify the
83 language used when presenting or requesting information.
85 Information about a user's language preferences commonly need to be
86 identified so that appropriate processing can be applied. For
87 example, the user's language preferences in a browser can be used to
88 select web pages appropriately. Language preferences can also be
89 used to select among tools (such as dictionaries) to assist in the
90 processing or understanding of content in different languages.
92 Given a set of language identifiers, such as those defined in
93 [RFC3066bis], various mechanisms can be envisioned for performing
94 language negotiation and tag matching. Applications, protocols, or
95 specifications will have varying needs and requirements that affect
96 the choice of a suitable mechanism.
98 This document defines several mechanisms for matching, selecting, or
99 filtering content whose natural language is identified using Language
100 Tags [RFC3066bis], as well as the syntax (called a "language range")
101 associated with each of these mechanisms for specifying the user's
102 language preferences.
104 This document, in combination with [RFC3066bis] (replace "3066bis"
105 globally in this document with the RFC number assigned to
106 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
107 [RFC1766].
109 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
111 document are to be interpreted as described in [RFC2119].
113 2. The Language Range
115 Language Tags [RFC3066bis] are used to identify the language of some
116 information item or content. Applications or protocols that use
117 language tags are often faced with the problem of identifying sets of
118 content that share certain language attributes. For example,
119 HTTP/1.1 [RFC2616] describes language ranges in its discussion of the
120 Accept-Language header (Section 14.4). These are to be used when
121 selecting content from servers based on the language of that content.
123 When selecting content according to its language, it is useful to
124 have a mechanism for identifying sets of language tags that share
125 specific attributes. This allows users to select or filter content
126 based on specific requirements. Such an identifier is called a
127 "Language Range".
129 Language tags and thus language ranges are to be treated as case
130 insensitive: there exist conventions for the capitalization of some
131 of the subtags, but these MUST NOT be taken to carry meaning.
132 Matching of language tags to language ranges MUST be done in a case
133 insensitive manner as well.
135 2.1. Lists of Language Ranges
137 When users specify a language preference they often need to specify a
138 prioritized list of language ranges in order to best reflect their
139 language preferences. This is especially true for speakers of
140 minority languages. A speaker of Breton in France, for example, may
141 specify "be" followed by "fr", meaning that if Breton is available,
142 it is preferred, but otherwise French is the best alternative. It
143 can get more complex: a speaker may wish to fall back from Skolt Sami
144 to Northern Sami to Finnish.
146 A "Language Priority List" consists of a prioritized or weighted list
147 of language ranges. One well known example of such a list is the
148 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
149 14.4) and RFC 3282 [RFC3282].
151 The various matching operations described in this document include
152 considerations for using a language priority list. When given as
153 examples in this document, language priority lists will be shown as a
154 quoted sequence of ranges separated by semi-colons, like this: "en;
155 fr; zh-Hant" (which would be read as "English before French before
156 Chinese as written in the Traditional script").
158 2.2. Basic Language Range
160 A "Basic Language Range" identifies the set of content whose language
161 tags begin with the same sequence of subtags. A basic language range
162 is identified by its 'language range' tag, by adapting the
163 ABNF[RFC4234] from HTTP/1.1 [RFC2616] :
165 language-range = language-tag / "*"
166 language-tag = 1*8[alphanum] *["-" 1*8alphanum]
167 alphanum = ALPHA / DIGIT
169 That is, a language-range has the same syntax as a language-tag or is
170 the single character "*". Basic Language Ranges imply that there is
171 a semantic relationship between language tags that share the same
172 prefix. While this is often the case, it is not always true and
173 users should note that the set of language tags that match a specific
174 language-range may not be mutually intelligible.
176 Basic language ranges were originally described in [RFC3066] and
177 HTTP/1.1 [RFC2616] (where they are referred to as simply a "language
178 range").
180 2.3. Extended Language Range
182 A Basic Language Range does not always provide the most appropriate
183 way to specify a user's preferences. Sometimes it is beneficial to
184 use a more granular matching scheme that takes advantage of the
185 internal structure of language tags, by allowing the user to specify,
186 for example, the value of a specific field in a language tag or to
187 indicate which values are of interest in filtering or selecting the
188 content.
190 In an extended language range, the identifier takes the form of a
191 series of subtags which MUST consist of well-formed subtags or the
192 special subtag "*". For example, the language range "en-*-US"
193 specifies a primary language of 'en', followed by any script subtag,
194 followed by the region subtag 'US'.
196 An extended language range can be represented by the following ABNF:
197 extended-language-range = range ; a range
198 / privateuse ; private-use tag
199 / grandfathered ; grandfathered registrations
201 range = (language
202 ["-" script]
203 ["-" region]
204 *("-" variant)
205 *("-" extension)
206 ["-" privateuse])
208 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
209 / 4ALPHA ; reserved for future use
210 / 5*8ALPHA ; registered language subtag
211 / "*" ; ... or wildcard
213 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
214 ; reserved for future use
215 ; wildcard can only appear
216 ; at the end
218 script = 4ALPHA ; ISO 15924 code
219 / "*" ; or wildcard
221 region = 2ALPHA ; ISO 3166 code
222 / 3DIGIT ; UN M.49 code
223 / "*" ; ... or wildcard
225 variant = 5*8alphanum ; registered variants
226 / (DIGIT 3alphanum) ;
227 / "*" ; ... or wildcard
229 extension = singleton *("-" (2*8alphanum)) [ "-*" ]
230 ; extension subtags
231 ; wildcard can only appear
232 ; at the end
234 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
235 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
236 ; Single letters: x/X is reserved for private use
238 privateuse = ("x"/"X") 1*("-" (1*8alphanum))
240 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
241 ; grandfathered registration
242 ; Note: I is the only singleton
243 ; that starts a grandfathered tag
245 alphanum = (ALPHA / DIGIT) ; letters and numbers
247 A field not present in the middle of an extended language range MAY
248 be treated as if the field contained a "*". For example, the range
249 "en-US" MAY be considered to be equivalent to the range "en-*-US".
250 This also means that multiple wildcards can be collapsed (so that
251 "en-*-*-US" is equivalent to "en-*-US").
253 2.4. Choosing a Language Range
255 Users indicate their language preferences via the choice of a
256 language range or the set of language ranges in the language priority
257 list. The type of matching will affect what the best choice is for
258 given user. In addition, user's should be aware that, when working
259 with language ranges, most matching schemes make no attempt to
260 process the semantic meaning of the subtags. The language tag and
261 language range (or their subtags) are usually compared in a case
262 insensitive manner using basic string processing. Thus the choice of
263 subtags in both the language tag and language range may affect the
264 results produced.
266 Users SHOULD avoid subtags that add no distinguishing value to a
267 language range. For example, script subtags SHOULD NOT be used to
268 form a language range with language subtags which have a matching
269 Suppress-Script field in their registry record. Thus the language
270 range "en-Latn" is probably inappropriate in most cases (because the
271 vast majority of English documents are written in the Latin script
272 and thus the 'en' language subtag has a Suppress-Script field for
273 'Latn' in the registry).
275 When working with tags and ranges, note that extensions and most
276 private-use subtags are orthogonal to language tag fallback and users
277 SHOULD avoid using these subtags in language ranges, since they will
278 often interfere with the selection of available language content.
279 Since these subtags are always at the end of the sequence of subtags,
280 they don't normally interfere with the use of prefixes for the
281 filtering schemes described below in Section 3.
283 When working with tags and ranges users SHOULD note the following:
285 1. Private-use and Extension subtags are normally orthogonal to
286 language tag fallback. Implementations or specifications that
287 use a lookup (Section 3.3) matching scheme SHOULD ignore
288 unrecognized private-use and extension subtags when performing
289 language tag fallback. Since these subtags are always at the end
290 of the sequence of subtags, they don't normally interfere with
291 the use of prefixes for matching in the schemes described below.
293 2. Applications, specifications, or protocols that choose not to
294 interpret one or more private-use or extension subtags SHOULD NOT
295 remove or modify these extensions in content that they are
296 processing. When a language tag instance is to be used in a
297 specific, known protocol, and is not being passed through to
298 other protocols, language tags MAY be filtered to remove subtags
299 and extensions that are not supported by that protocol. Such
300 filtering SHOULD be avoided, if possible, since it removes
301 information that might be relevant if services on the other end
302 of the protocol would make use of that information.
304 3. Some applications of language tags might want or need to consider
305 extensions and private-use subtags when matching tags. If
306 extensions and private-use subtags are included in a matching or
307 filtering process that utilizes the one of the schemes described
308 in this document, then the implementation SHOULD canonicalize the
309 language tags and/or ranges before performing the matching. Note
310 that language tag processors that claim to be "well-formed"
311 processors as defined in [RFC3066bis] generally fall into this
312 category.
314 3. Types of Matching
316 Matching language ranges to language tags can be done in a number of
317 different ways. This section describes the different types of
318 matching scheme, as well as the considerations for choosing between
319 them. Protocols and specifications SHOULD clearly indicate the
320 particular mechanism used in selecting or matching language tags.
322 There are two basic types of matching scheme: those that produce an
323 open-ended set of content (called "filtering") and those that produce
324 a single information item for a given request (called "lookup").
326 A key difference between these two types of matching scheme is that
327 the language range for filtering operations is always the _least_
328 specific tag one will accept as a match, while for lookup operations
329 the language range is always the _most_ specific tag.
331 3.1. Choosing a Type of Matching
333 Applications, protocols, and specifications are faced with the
334 decision of what type of matching to use. Sometimes, different
335 styles of matching might be suited for different kinds of processing
336 within a particular application or protocol.
338 Filtering can be used to produce a set of results (such as a
339 collection of documents). For example, if using a search engine, one
340 might use filtering to limit the results to documents written in
341 French. It can also be used when deciding whether to perform some
342 processing that is language sensitive on some content. For example,
343 a process might cause paragraphs whose language tag matched the
344 language range "nl" to be displayed in italics within a document.
346 This document describes three types of filtering:
348 1. Basic Filtering (Section 3.2.1) is used to match content using
349 basic language ranges (Section 2.2). It is compatible with
350 implementations that do not produce extended language ranges.
352 2. Extended Range Filtering (Section 3.2.2) is used to match content
353 using extended language ranges (Section 2.3). Newer
354 implementations SHOULD use this form of filtering in preference
355 to basic filtering.
357 3. Scored Filtering (Section 3.2.3) produces an ordered set of
358 content using either basic or extended language ranges. It
359 SHOULD be used when the quality of the match within a specific
360 language range is important, as when presenting a list of
361 documents resulting from a search.
363 Lookup (Section 3.3) is used when each request MUST produce exactly
364 one piece of content. For example, a Web server might use the
365 Accept-Language HTTP header to choose which language to return a
366 custom 404 page in: since it can return only one page, it must choose
367 a single item and it must return some item, even if no content
368 matches the language ranges supplied by the user.
370 Most types of matching in this document are designed so that
371 implementations do not have to examine the values of the subtags
372 supplied and, except for scored filtering, they do not need access to
373 the Language Subtag Registry nor do they require the use of valid
374 subtags in either language tags or language ranges. This has great
375 benefit for speed and simplicity of implementation.
377 Implementations might also wish to use semantic information external
378 to the language tags when performing fallback. For example, the
379 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
380 Norwegian) might both be usefully matched to the more general subtag
381 'no' (Norwegian). Or an implementation might infer that content
382 labeled "zh-CN" is more likely to match the range "zh-Hans" than
383 equivalent content labeled "zh-TW".
385 3.2. Filtering
387 Filtering is used to select the set of content that matches a given
388 prefix. It is called "filtering" because this set of content may
389 contain no items at all or it may return an arbitrary number of
390 matching items--as many as match the language range used to specify
391 the items, thus filtering out the non-matching content.
393 In filtering, the language range represents the _least_ specific tag
394 which is an acceptable match. That is, all of the language tags in
395 the set of filtered content will have an equal or greater number of
396 subtags than the language range. For example, if the language range
397 is "de-CH", one might see matching content with the tag "de-CH-1996"
398 but one will never see a match with the tag "de".
400 If the language priority list (see Section 2.1) contains more than
401 one range, the content returned is typically ordered in descending
402 level of preference.
404 Some examples where filtering might be appropriate include:
406 o Applying a style to sections of a document in a particular
407 language range.
409 o Displaying the set of documents containing a particular set of
410 keywords written in a specific language.
412 o Selecting all email items written in specific range of languages.
414 Filtering can produce either an ordered or an unordered set of
415 results. For example, applying formatting to a document based on the
416 language of specific pieces of content does not require the content
417 to be ordered. It is sufficient to know whether a specific piece of
418 content matches or does not match. A search application, on the
419 other hand, probably would put the results into a priority order.
421 If an ordered set is desired, as described above, then the
422 application or protocol needs to determine the relative "quality" of
423 the match between different language tags and the language range.
425 This measurement is called a "distance metric". A distance metric
426 assigns a numeric value to the comparison of each language tag to a
427 language range and represents the 'distance' between the two. A
428 distance of zero means that they are identical, a small distance
429 indicates that they are very similar, and a large distance indicated
430 that they are very different. Using a distance metric,
431 implementations can, for example, allow users to select a threshold
432 distance for a match to be "successful" while filtering or it can use
433 the numeric value to order the results.
435 3.2.1. Filtering with Basic Language Ranges
437 When filtering using a basic language range, the language range
438 matches a language tag if it exactly equals the tag, or if it exactly
439 equals a prefix of the tag such that the first character following
440 the prefix is "-". (That is, the language-range "de-de" matches the
441 language tag "de-DE-1996", but not the language tag "de-Deva".)
443 The special range "*" matches any tag. A protocol which uses
444 language ranges MAY specify additional rules about the semantics of
445 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
446 languages not matched by any other range within an "Accept-Language"
447 header.
449 3.2.2. Filtering with Extended Language Ranges
451 In the Extended Range Matching scheme, each extended language range
452 in the language priority list is considered in turn, according to
453 priority. The subtags in each extended language range are compared
454 to the corresponding subtags in the language tag being examined. The
455 subtag from the range is considered to match if it exactly matches
456 the corresponding subtag in the tag or the range's subtag has the
457 value "*" (which matches all subtags, including the empty subtag).
458 Extended Range Matching is an extension of basic matching
459 (Section 3.2.1): the language range represents the least specific tag
460 which is an acceptable match.
462 private-use subtags MAY be specified in the language range and MUST
463 NOT be ignored when matching.
465 Subtags not specified, including those at the end of the language
466 range, are assigned the value "*". This makes each range into a
467 prefix much like that used in basic language range matching. For
468 example, the extended language range "de-*-DE" matches all of the
469 following tags because the unspecified variant field is expanded to
470 "*":
472 de-DE
474 de-Latn-DE
476 de-Latf-DE
478 de-DE-x-goethe
480 de-Latn-DE-1996
482 3.2.3. Scored Filtering
484 Both basic and extended language range filtering produce simple
485 boolean matches. Sometimes it may be beneficial to provide an array
486 of results with different levels of matching, for example, sorting
487 results based on the overall "quality" of the match. Scored (or
488 "distance metric") filtering provides a way to generate these quality
489 values.
491 First both the extended language range and the language tags to be
492 matched to it must be canonicalized by mapping grandfathered and
493 obsolete tags into modern equivalents.
495 The language range and the language tags are then transformed into
496 quintuples of elements of the form (language, script, country,
497 variant, extension). Any extended language subtags are considered
498 part of the language element; private-use subtag sequences are
499 considered part of the language element if in the initial position in
500 the tag and part of the variant element if not. Language subtags
501 'und', 'mul', and the script subtag 'Zyyy' are converted to "*".
503 Missing components in the language-tag are set to "*"; thus a "*"
504 pattern becomes the quintuple ("*", "*", "*", "*", "*"). Missing
505 components in the extended language-range are handled similarly to
506 extended range lookup: missing internal subtags are expanded to "*".
507 Missing end subtags are expanded as the empty string. Thus a pattern
508 "en-US" becomes the quintuple ("en","*","US","","").
510 Here are some examples of language tags, showing their quintuples as
511 both language tags and language ranges:
513 en-US
514 Tag: (en, *, US, *, *)
515 Range: (en, *, US, "", "")
517 sr-Latn
518 Tag: (sr, Latn, *, *, *)
519 Range: (sr, Latn, "", "", "")
521 zh-cmn-Hant
522 Tag: (zh-cmn, Hant, *, *, *)
523 Range: (zh-cmn, Hant, "", "", "")
525 x-foo
526 Tag: (x-foo, *, *, *, *)
527 Range: (x-foo, "", "", "", "")
529 en-x-foo
530 Tag: (en, *, *, x-foo, *)
531 Range: (en, *, *, x-foo, "")
533 i-default
534 Tag: (i-default, *, *, *, *)
535 Range: (i-default, "", "", "", "")
537 sl-Latn-IT-rozaj
538 Tag: (sl, Latn, IT, rozaj, *)
539 Range: (sl, Latn, IT, rozaj, "")
541 zh-r-wadegile (hypothetical)
542 Tag: (z., *, *, *, r-wadegile)
543 Range: (z., *, *, *, r-wadegile)
545 Figure 3: Examples of Distance Metric Quintuples
547 Each language-range/language-tag pair being compared is assigned a
548 distance value, whereby small values indicate better matches and
549 large values indicate worse ones. The distance between the pair is
550 the sum of the distances for each of the corresponding elements of
551 the quintuple. If the elements are identical or one is '*', then the
552 distance value between them is zero. Otherwise, it is given by the
553 following table:
555 256 language mismatch
556 128 script mismatch
557 32 region mismatch
558 4 variant mismatch
559 1 extension mismatch
561 A value of 0 is a perfect match; 421 is no match at all. Different
562 threshold values might be appropriate for different applications or
563 protocols. Implementations will usually allow users to choose the
564 most appropriate selection value, ranking the matched items based on
565 score.
567 Examples of various tag's distances from the range "en-US":
569 "fr-FR" 384 (language & region mismatch)
570 "fr" 256 (language mismatch, region match)
571 "en-GB" 32 (region mismatch)
572 "en-Latn-US" 0 (all fields match)
573 "en-Brai" 32 (region mismatch)
574 "en-US-x-foo" 4 (variant mismatch: range is the empty string)
575 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
577 Implementations or protocols sometimes might wish to use more
578 sophisticated weights that depend on the values of the corresponding
579 elements. For example, depending on the domain, an implementation
580 might give a small distance to the difference closely related
581 subtags. Some examples of closely related subtags might be:
583 Language:
584 no (Norwegian)
585 nb (Bokmal Norwegian)
586 nn (Nynorsk Norwegian)
588 Script:
589 Kata (katakana)
590 Hira (hiragana)
592 Region:
593 US (United States of America)
594 UM (United States Minor Outlying Islands
596 Figure 6: Examples of Closely Related Subtags
598 3.3. Lookup
600 Lookup is used to select the single information item that best
601 matches the language priority list for a given request. In lookup,
602 each language-range in the language priority list represents the
603 _most_ specific tag which is an acceptable match; only the closest
604 matching item according the user's priority is returned. For
605 example, if the language range is "de-CH", one might expect to
606 receive an information item with the tag "de" but never one with the
607 tag "de-CH-1996". Usually if no content matches the request, a
608 "default" item is returned.
610 For example, if an application inserts some dynamic content into a
611 document, returning an empty string if there is no exact match is not
612 an option. Instead, the application "falls back" until it finds a
613 suitable piece of content to insert. Other examples of lookup might
614 include:
616 o Selection of a template containing the text for an automated email
617 response.
619 o Selection of a graphic containing text for inclusion in a
620 particular Web page.
622 o Selection of a string of text for inclusion in an error log.
624 In the Lookup scheme, the language-range is progressively truncated
625 from the end until a matching piece of content is located. For
626 example, starting with the range "zh-Hant-CN-x-private", the lookup
627 would progressively search for content as shown below:
629 Range to match: zh-Hant-CN-x-private
630 1. zh-Hant-CN-x-private
631 2. zh-Hant-CN
632 3. zh-Hant
633 4. z.
634 5. (default content or the empty tag)
636 Figure 7: Example of a Lookup Fallback Pattern
638 This scheme allows some flexibility in finding content. It also
639 typically provides better results when data is not available at a
640 specific level of tag granularity or is sparsely populated (than if
641 the default language for the system or content were used).
643 The language range "*" matches any language tag. In the lookup
644 scheme, this language range does not convey enough information to
645 determine which content is most appropriate. If this language range
646 is the only one in the language priority list, it matches the default
647 content. If this language range is followed by other language
648 ranges, it should be skipped.
650 When performing lookup using a language priority list, the
651 progressive search MUST proceed to consider each language range
652 before finding the default content or empty tag. The default content
653 might be content with no language tag (or with an empty value, as
654 with xml:lang in the XML specification), or it might be a particular
655 language designated for that bit of content.
657 One common way to provide for default content is to allow a specific
658 language range to be set as the default for a specific type of
659 request. This language range is then treated as if it were appended
660 to the end of the language priority list, rather than after each item
661 in the language priority list.
663 For example, if a particular user's language priority list were
664 "fr-FR; zh-Hant" and the program doing the matching had a default
665 language range of "ja-JP", the program would search for content as
666 follows:
667 1. fr-FR
668 2. fr
669 3. zh-Hant // next language
670 4. z.
671 5. (return default content)
672 a. ja-JP
673 b. ja
674 c. (empty tag or other default content)
676 Figure 8: Lookup Using a Language Priority List
678 In some cases, the language priority list might contain one or more
679 extended language ranges (as, for example, when the same language
680 priority list is used as input for both lookup and filtering
681 operations). Wildcard values in an extended language range are
682 supposed to match any value that occurs in that position in a
683 language tag. Since only one item can be returned for any given
684 lookup request, the wildcards must be processed in a predictable
685 manner (or the same request might produce widely varying results).
686 Thus, for each range in the language priority list, the following
687 rules must be applied to produce a basic language range for use in
688 the fallback mechanism:
690 1. If the first subtag in the extended language range is a "*" then
691 entire range is converted to "*".
693 2. For each subsequent subtag, if the value is a "*" then that
694 subtag and its preceding hyphen are removed.
696 For example:
698 *-US becomes *
699 en-*-US becomes en-US
700 en-Latn-* becomes en-Latn
702 Figure 9: Transformation of Extended Language Ranges
704 For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback
705 pattern would be:
706 1. * (skipped)
707 2. fr-FR
708 3. fr
709 4. zh-Hant
710 5. z.
711 6. (default content)
713 Figure 10: Extended Language Range Fallback Example
715 4. Other Considerations
717 When working with language ranges and matching schemes, there are
718 some additional points that may influence the choice of either.
720 4.1. Meaning of Language Tags and Ranges
722 Selecting content using language ranges requires some understanding
723 by users of what they are selecting. A language tag or range
724 identifies a language as spoken (or written, signed or otherwise
725 signaled) by human beings for communication of information to other
726 human beings.
728 If a language tag B contains language tag A as a prefix, then B is
729 typically "narrower" or "more specific" than A. For example, "zh-
730 Hant-TW" is more specific than "zh-Hant".
732 This relationship is not guaranteed in all cases: specifically,
733 languages that begin with the same sequence of subtags are NOT
734 guaranteed to be mutually intelligible, although they might be.
736 For example, the tag "az" shares a prefix with both "az-Latn"
737 (Azerbaijani written using the Latin script) and "az-Arab"
738 (Azerbaijani written using the Arabic script). A person fluent in
739 one script might not be able to read the other, even though the text
740 might be otherwise identical. Content tagged as "az" most probably
741 is written in just one script and thus might not be intelligible to a
742 reader familiar with the other script.
744 Variant subtags in particular seem to represent specific divisions in
745 mutual understanding, since they often encode dialects or other
746 idiosyncratic variations within a language.
748 The relationship between the language tag and the information it
749 relates to is defined by the standard describing the context in which
750 it appears. Accordingly, this section can only give possible
751 examples of its usage:
753 o For a single information object, the associated language tags
754 might be interpreted as the set of languages that are necessary
755 for a complete comprehension of the complete object. Example:
756 Plain text documents.
758 o For an aggregation of information objects, the associated language
759 tags could be taken as the set of languages used inside components
760 of that aggregation. Examples: Document stores and libraries.
762 o For information objects whose purpose is to provide alternatives,
763 the associated language tags could be regarded as a hint that the
764 content is provided in several languages, and that one has to
765 inspect each of the alternatives in order to find its language or
766 languages. In this case, the presence of multiple tags might not
767 mean that one needs to be multi-lingual to get complete
768 understanding of the document. Example: MIME multipart/
769 alternative.
771 o In markup languages, such as HTML and XML, language information
772 can be added to each part of the document identified by the markup
773 structure (including the whole document itself). For example, one
774 could write C'est la vie. inside a
775 Norwegian document; the Norwegian-speaking user could then access
776 a French-Norwegian dictionary to find out what the marked section
777 meant. If the user were listening to that document through a
778 speech synthesis interface, this formation could be used to signal
779 the synthesizer to appropriately apply French text-to-speech
780 pronunciation rules to that span of text, instead of misapplying
781 the Norwegian rules.
783 4.2. Considerations for Private Use Subtags
785 Private-use subtags require private agreement between the parties
786 that intend to use or exchange language tags that use them and great
787 caution SHOULD be used in employing them in content or protocols
788 intended for general use. Private-use subtags are simply useless for
789 information exchange without prior arrangement.
791 The value and semantic meaning of private-use tags and of the subtags
792 used within such a language tag are not defined. Matching private-
793 use tags using language ranges or extended language ranges can result
794 in unpredictable content being returned.
796 4.3. Length Considerations in Matching
798 RFC 3066 [RFC3066] did not provide an upper limit on the size of
799 language tags or ranges. RFC 3066 did define the semantics of
800 particular subtags in such a way that most language tags or ranges
801 consisted of language and region subtags with a combined total length
802 of up to six characters. Larger tags and ranges (in terms of both
803 subtags and characters) did exist, however.
805 [RFC3066bis] also does not impose a fixed upper limit on the number
806 of subtags in a language tag or range (and thus an upper bound on the
807 size of either). The syntax in that document suggests that,
808 depending on the specific language or range of languages, more
809 subtags (and thus characters) are sometimes necessary as a result.
811 Length considerations and their impact on the selection and
812 processing of tags are described in Section 2.1.1 of that document.
814 An application or protocol MAY choose to limit the length of the
815 language tags or ranges used in matching. Any such limitation SHOULD
816 be clearly documented, and such documentation SHOULD include the
817 disposition of any longer tags or ranges (for example, whether an
818 error value is generated or the language tag or range is truncated).
819 If truncation is permitted it MUST NOT permit a subtag to be divided,
820 since this changes the semantics of the subtag being matched and can
821 result in false positives or negatives.
823 Applications or protocols that restrict storage SHOULD consider the
824 impact of tag or range truncation on the resulting matches. For
825 example, removing the "*" from the end of an extended language range
826 (see Section 2.3) can greatly modify the set of returned matches. A
827 protocol that allows tags or ranges to be truncated at an arbitrary
828 limit, without giving any indication of what that limit is, has the
829 potential for causing harm by changing the meaning of values in
830 substantial ways.
832 In practice, most tags do not require additional subtags or
833 substantially more characters. Additional subtags sometimes add
834 useful distinguishing information, but extraneous subtags interfere
835 with the meaning, understanding, and especially matching of language
836 tags. Since language tags or ranges MAY be truncated by an
837 application or protocol that limits storage, when choosing language
838 tags or ranges users and applications SHOULD avoid adding subtags
839 that add no distinguishing value. In particular, users and
840 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
841 fields in the registry (defined in Section 3.6 of [RFC3066bis]):
842 these fields provide guidance on when specific additional subtags
843 SHOULD (and SHOULD NOT) be used.
845 Implementations MUST support a limit of at least 33 characters. This
846 limit includes at least one subtag of each non-extension, non-private
847 use type. When choosing a buffer limit, a length of at least 42
848 characters is strongly RECOMMENDED.
850 The practical limit on tags or ranges derived solely from registered
851 values is 42 characters. Implementations MUST be able to handle tags
852 and ranges of this length. Support for tags and ranges of at least
853 62 characters in length is RECOMMENDED. Implementations MAY support
854 longer values, including matching extensive sets of private-use or
855 extension subtags.
857 Applications or protocols which have to truncate a tag MUST do so by
858 progressively removing subtags along with their preceding "-" from
859 the right side of the language tag until the tag is short enough for
860 the given buffer. If the resulting tag ends with a single-character
861 subtag, that subtag and its preceding "-" MUST also be removed. For
862 example:
864 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
865 1. zh-Latn-CN-variant1-a-extend1-x-wadegile
866 2. zh-Latn-CN-variant1-a-extend1
867 3. zh-Latn-CN-variant1
868 4. zh-Latn-CN
869 5. zh-Latn
870 6. z.
872 Figure 11: Example of Tag Truncation
874 5. IANA Considerations
876 This document presents no new or existing considerations for IANA.
878 6. Changes
880 This is the first version of this document.
882 The following changes were put into this document since draft-06:
884 Changed the document title from the unwieldy "Matching Tags for
885 the Identification of Languages" to "Matching Language Tags" (Ed.)
887 Fixed problems with the distance metric filtering scheme
888 (Section 3.2.3) examples (in which tags were expanded
889 incorrectly). (D.Ewell)
891 Moved the sentence "Protocols and specifications SHOULD clearly
892 indicate the particular mechanism used in selecting or matching
893 language tags." from the introduction (where there should not be
894 any normative language) to the start of Section 3. (A.Phillips)
896 Created section Section 2.4 and moved text there (A.Phillips)
898 Modified the examples of closely related subtags in Section 3.2.3
899 to show what the examples mean (M.Duerst)
901 Various spelling and grammatical fixes (D.Ewell)
903 7. Security Considerations
905 Language ranges used in content negotiation might be used to infer
906 the nationality of the sender, and thus identify potential targets
907 for surveillance. In addition, unique or highly unusual language
908 ranges or combinations of language ranges might be used to track a
909 specific individual's activities.
911 This is a special case of the general problem that anything you send
912 is visible to the receiving party. It is useful to be aware that
913 such concerns can exist in some cases.
915 The evaluation of the exact magnitude of the threat, and any possible
916 countermeasures, is left to each application or protocol.
918 8. Character Set Considerations
920 The syntax of language tags and language ranges permit only the
921 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters
922 are present in most character sets, so presentation of language tags
923 should not present any character set issues.
925 9. References
927 9.1. Normative References
929 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
930 Requirement Levels", BCP 14, RFC 2119, March 1997.
932 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
933 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
934 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
936 [RFC3066bis]
937 Phillips, A., Ed. and M. Davis, Ed., "Tags for the
938 Identification of Languages", October 2005, .
942 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
943 Specifications: ABNF", RFC 4234, October 2005.
945 9.2. Informative References
947 [RFC1766] Alvestrand, H., "Tags for the Identification of
948 Languages", RFC 1766, March 1995.
950 [RFC3066] Alvestrand, H., "Tags for the Identification of
951 Languages", BCP 47, RFC 3066, January 2001.
953 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282,
954 May 2002.
956 Appendix A. Acknowledgements
958 Any list of contributors is bound to be incomplete; please regard the
959 following as only a selection from the group of people who have
960 contributed to make this document what it is today.
962 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
963 which is a precursor to this document, made enormous contributions
964 directly or indirectly to this document and are generally responsible
965 for the success of language tags.
967 The following people (in alphabetical order by family name)
968 contributed to this document:
970 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank
971 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M.
972 Patton, Randy Presuhn, Eric van der Poel, and many, many others.
974 Very special thanks must go to Harald Tveit Alvestrand, who
975 originated RFCs 1766 and 3066, and without whom this document would
976 not have been possible.
978 For this particular document, John Cowan originated the scheme
979 described in Section 3.2.3. Mark Davis originated the scheme
980 described in the Section 3.3.
982 Authors' Addresses
984 Addison Phillips (editor)
985 Quest Software
987 Email: addison dot phillips at quest dot com
989 Mark Davis (editor)
990 IBM
992 Email: mark dot davis at ibm dot com
994 Intellectual Property Statement
996 The IETF takes no position regarding the validity or scope of any
997 Intellectual Property Rights or other rights that might be claimed to
998 pertain to the implementation or use of the technology described in
999 this document or the extent to which any license under such rights
1000 might or might not be available; nor does it represent that it has
1001 made any independent effort to identify any such rights. Information
1002 on the procedures with respect to rights in RFC documents can be
1003 found in BCP 78 and BCP 79.
1005 Copies of IPR disclosures made to the IETF Secretariat and any
1006 assurances of licenses to be made available, or the result of an
1007 attempt made to obtain a general license or permission for the use of
1008 such proprietary rights by implementers or users of this
1009 specification can be obtained from the IETF on-line IPR repository at
1010 http://www.ietf.org/ipr.
1012 The IETF invites any interested party to bring to its attention any
1013 copyrights, patents or patent applications, or other proprietary
1014 rights that may cover technology that may be required to implement
1015 this standard. Please address the information to the IETF at
1016 ietf-ipr@ietf.org.
1018 Disclaimer of Validity
1020 This document and the information contained herein are provided on an
1021 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1022 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1023 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1024 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1025 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1026 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1028 Copyright Statement
1030 Copyright (C) The Internet Society (2005). This document is subject
1031 to the rights, licenses and restrictions contained in BCP 78, and
1032 except as set forth therein, the authors retain all their rights.
1034 Acknowledgment
1036 Funding for the RFC Editor function is currently provided by the
1037 Internet Society.