idnits 2.17.1 draft-ietf-ltru-matching-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1137. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1114. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1121. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1127. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 6, 2006) is 6653 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Yahoo! Inc 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: August 10, 2006 Google 6 February 6, 2006 8 Matching of Language Tags 9 draft-ietf-ltru-matching-09 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on August 10, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2006). 40 Abstract 42 This document describes different mechanisms for comparing, matching, 43 and evaluating language tags. Possible algorithms for language 44 negotiation or content selection, filtering, and lookup are 45 described. This document, in combination with RFC 3066bis (replace 46 "3066bis" with the RFC number assigned to 47 draft-ietf-ltru-registry-14), replaces RFC 3066, which replaced RFC 48 1766. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 54 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 55 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5 56 2.3. The Language Priority List . . . . . . . . . . . . . . . . 7 57 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 8 58 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 8 59 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9 60 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 11 61 3.2.2. Filtering with Extended Language Ranges . . . . . . . 11 62 3.2.3. Scored Filtering . . . . . . . . . . . . . . . . . . . 11 63 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 15 64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 19 65 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 19 66 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 20 67 4.3. Considerations for Private Use Subtags . . . . . . . . . . 21 68 4.4. Length Considerations in Matching . . . . . . . . . . . . 22 69 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 70 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 71 7. Security Considerations . . . . . . . . . . . . . . . . . . . 26 72 8. Character Set Considerations . . . . . . . . . . . . . . . . . 27 73 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28 74 9.1. Normative References . . . . . . . . . . . . . . . . . . . 28 75 9.2. Informative References . . . . . . . . . . . . . . . . . . 28 76 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 29 77 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 30 78 Intellectual Property and Copyright Statements . . . . . . . . . . 31 80 1. Introduction 82 Human beings on our planet have, past and present, used a number of 83 languages. There are many reasons why one would want to identify the 84 language used when presenting or requesting information. 86 Information about a user's language preferences commonly needs to be 87 identified so that appropriate processing can be applied. For 88 example, the user's language preferences in a browser can be used to 89 select web pages appropriately. Language preferences can also be 90 used to select among tools (such as dictionaries) to assist in the 91 processing or understanding of content in different languages. 93 Given a set of language identifiers, such as those defined in 94 [RFC3066bis], various mechanisms can be envisioned for performing 95 language negotiation and tag matching. 97 This document defines a syntax (called a language range (Section 2)) 98 for specifying a user's language preferences, as well as several 99 schemes for selecting or filtering content by comparing language 100 ranges to the language tags [RFC3066bis] used to identify the natural 101 language of that content. Applications, protocols, or specifications 102 will have varying needs and requirements that affect the choice of a 103 suitable matching scheme. Depending on the choice of scheme, there 104 are various options left to the implementation. Protocols that 105 implement a matching scheme either need to specify each particular 106 choice or indicate the options that are left to the implementation to 107 decide. 109 This document is divided into three main sections. One describes how 110 to indicate a user's preferences using language ranges. Then a 111 section describes various schemes for matching these ranges to a set 112 of language tags in order to select specific content. There is also 113 a section that deals with various practical considerations that apply 114 to implementing and using these schemes. 116 This document, in combination with [RFC3066bis] (Ed.: replace 117 "3066bis" globally in this document with the RFC number assigned to 118 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 119 [RFC1766]. 121 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 122 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 123 document are to be interpreted as described in [RFC2119]. 125 2. The Language Range 127 Language Tags [RFC3066bis] are used to identify the language of some 128 information item or content. Applications or protocols that use 129 language tags are often faced with the problem of identifying sets of 130 content that share certain language attributes. For example, 131 HTTP/1.1 [RFC2616] describes one such mechanism in its discussion of 132 the Accept-Language header (Section 14.4), which is used when 133 selecting content from servers based on the language of that content. 135 When selecting content according to its language, it is useful to 136 have a mechanism for identifying sets of language tags that share 137 specific attributes. This allows users to select or filter content 138 based on specific requirements. Such an identifier is called a 139 "Language Range". 141 Language ranges are similar in structure and content to language 142 tags: they consist of alphanumeric "subtags" separated by hyphens, 143 plus a special subtag consisting of the character "*" (%2A, 144 ASTERISK), which is used in ranges as a "wildcard", that is, a value 145 that matches any subtag. 147 Language tags and thus language ranges are to be treated as case- 148 insensitive: there exist conventions for the capitalization of some 149 of the subtags, but these MUST NOT be taken to carry meaning. 150 Matching of language tags to language ranges MUST be done in a case- 151 insensitive manner as well. 153 2.1. Basic Language Range 155 A "basic language range" identifies the set of content whose language 156 tags begin with the same sequence of subtags. Each range consists of 157 a sequence of alphanumeric subtags separated by hyphens. The basic 158 language range is defined by the following ABNF[RFC4234]: 160 language-range = language-tag / "*" 161 language-tag = 1*8[alphanum] *["-" 1*8alphanum] 162 alphanum = ALPHA / DIGIT 164 Basic language ranges (originally described by HTTP/1.1 [RFC2616] and 165 later [RFC3066]) have the same syntax as an [RFC3066] language tag or 166 are the single character "*". They differ from the language tags 167 defined in [RFC3066bis] only in that there is no requirement that 168 they be "well-formed" or be validated against the IANA Language 169 Subtag Registry (although such ill-formed ranges will probably not 170 match anything). 172 Use of a basic language range seems to imply that there is a semantic 173 relationship between language tags that share the same prefix. While 174 this is often the case, it is not always true and users should note 175 that the set of language tags that match a specific language-range 176 may not be mutually intelligible. 178 2.2. Extended Language Range 180 A Basic Language Range does not always provide the most appropriate 181 way to specify a user's preferences. Sometimes it is beneficial to 182 use a more fine-grained matching scheme that takes advantage of the 183 internal structure of language tags. This allows the user to 184 specify, for example, the value of a specific field in a language tag 185 or to indicate which values are of interest in filtering or selecting 186 the content. 188 In an extended language range, the identifier takes the form of a 189 series of subtags which MUST consist of well-formed subtags or the 190 special subtag "*". For example, the language range "en-*-US" 191 specifies a primary language of 'en', followed by any script subtag, 192 followed by the region subtag 'US'. 194 An extended language range can be represented by the following ABNF: 196 extended-language-range = range ; a range 197 / privateuse ; a private-use range 198 / grandfathered ; a grandfathered registration 200 range = (language 201 ["-" script] 202 ["-" region] 203 *("-" variant) 204 *("-" extension) 205 ["-" privateuse]) 207 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code 208 / 4ALPHA ; reserved for future use 209 / 5*8ALPHA ; registered language subtag 210 / "*" ; or wildcard 212 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*")) 213 ; reserved for future use 214 ; wildcard can only appear 215 ; at the end 217 script = 4ALPHA ; ISO 15924 code 218 / "*" ; or wildcard 220 region = 2ALPHA ; ISO 3166 code 221 / 3DIGIT ; UN M.49 code 222 / "*" ; or wildcard 224 variant = 5*8alphanum ; registered variants 225 / (DIGIT 3alphanum) ; 226 / "*" ; or wildcard 228 extension = singleton *("-" (2*8alphanum)) [ "-*" ] 229 ; extension subtags 230 ; wildcard can only appear 231 ; at the end 233 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT 234 ; single letters (except for "x") or digits 236 privateuse = "x" 1*("-" (1*8alphanum)) 238 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum)) 239 ; grandfathered registration 240 ; Note: I is the only singleton 241 ; that starts a grandfathered tag 243 alphanum = (ALPHA / DIGIT) ; letters and numbers 244 A field not present in the middle of an extended language range is 245 treated as if the field contained a "*". Implementations that 246 normalize extended language ranges SHOULD expand missing fields to be 247 "*" so that the semantic meaning of the language range is clear to 248 the user. At the same time, multiple wildcards in a row are 249 redundant and implementations SHOULD collapse these to a single 250 wildcard when normalizing the range (for brevity). For example, both 251 the range "sl-nedis" and the range "sl-*-*-nedis" are equivalent to 252 and should be normalized as "sl-*-nedis". 254 2.3. The Language Priority List 256 When users specify a language preference they often need to specify a 257 prioritized list of language ranges in order to best reflect their 258 language preferences. This is especially true for speakers of 259 minority languages. A speaker of Breton in France, for example, may 260 specify "be" followed by "fr", meaning that if Breton is available, 261 it is preferred, but otherwise French is the best alternative. It 262 can get more complex: a speaker may wish to fall back from Skolt Sami 263 to Northern Sami to Finnish. 265 A "Language Priority List" is a prioritized or weighted list of 266 language ranges. One well known example of such a list is the 267 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 268 14.4) and RFC 3282 [RFC3282]. A simple list of ranges, i.e. one that 269 contains no weighting information, is considered to be in descending 270 order of priority. 272 The various matching operations described in this document include 273 considerations for using a language priority list. This document 274 does not define any syntax for a language priority list; defining 275 such a syntax is the responsibility of the protocol, application, or 276 implementation that uses it. When given as examples in this 277 document, language priority lists will be shown as a quoted sequence 278 of ranges separated by semi-colons, like this: "en; fr; zh-Hant" 279 (which would be read as "English before French before Chinese as 280 written in the Traditional script"). 282 3. Types of Matching 284 Matching language ranges to language tags can be done in a number of 285 different ways. This section describes several different matching 286 schemes, as well as the considerations for choosing between them. 287 Protocols and specifications SHOULD clearly indicate the particular 288 mechanism used in selecting or matching language tags. 290 There are two basic types of matching scheme: those that produce zero 291 or more information items (called "filtering") and those that produce 292 a single information item for a given request (called "lookup"). 294 A key difference between these two types of matching scheme is that 295 the language ranges in the language priority list represent the 296 _least_ specific content one will accept as a match, while for lookup 297 operations the language ranges represent the _most_ specific content. 299 3.1. Choosing a Type of Matching 301 Applications, protocols, and specifications are faced with the 302 decision of what type of matching to use. Sometimes, different 303 styles of matching might be suited for different kinds of processing 304 within a particular application or protocol. 306 Language tag matching is a tool, and does not by itself specify a 307 complete procedure for the use of language tags. Such procedures are 308 intimately tied to the application protocol in which they occur. 309 When specifying a protocol operation using matching, the protocol 310 MUST specify: 312 o Which type(s) of language tag matching it uses 314 o Whether the operation returns a single result (lookup) or a 315 possibly empty set of results (filtering) 317 o For lookup, what the result is when no matching tag is found. For 318 instance, a protocol might result in failure of the operation, an 319 empty value, returning some protocol defined or implementation 320 defined default, or returning i-default [RFC2277]. 322 Filtering can be used to produce a set of results (such as a 323 collection of documents). For example, if using a search engine, one 324 might use filtering to limit the results to documents written in 325 French. It can also be used when deciding whether to perform a 326 language-sensitive process on some content. For example, a process 327 might cause paragraphs whose language tag matched the language range 328 "nl" to be displayed in italics within a document. 330 This document describes four types of matching (three types of 331 filtering, plus the lookup scheme): 333 1. Basic Filtering (Section 3.2.1) is used to match content using 334 basic language ranges (Section 2.1). 336 2. Extended Range Filtering (Section 3.2.2) is used to match content 337 using extended language ranges (Section 2.2). 339 3. Scored Filtering (Section 3.2.3) produces an ordered set of 340 content using extended language ranges. It SHOULD be used when 341 the quality of the match within a specific language range is 342 important, as when presenting a list of documents resulting from 343 a search. 345 4. Lookup (Section 3.3) is used when each request needs to produce 346 _exactly_ one piece of content. For example, if a process were 347 to insert a human readable error message into a protocol header, 348 it might select the text based on the user's language preference. 349 Since it can return only one item, it must choose a single item 350 and it must return some item, even if no content matches the 351 language priority list supplied by the user. 353 Most types of matching in this document are designed so that 354 implementations are not required to validate or understand any of the 355 semantics of the subtags supplied and, except for scored filtering, 356 they do not need access to the IANA Language Subtag Registry (see 357 Section 3 in [RFC3066bis]). This simplifies and speeds the 358 performance of implementations. 360 Regardless of the matching scheme chosen, protocols and 361 implementations MAY canonicalize language tags and ranges by mapping 362 grandfathered and obsolete tags or subtags into modern equivalents. 363 If an implementation canonicalizes either ranges or tags, then the 364 implementation will require the IANA Language Subtag Registry 365 information for that purpose. Implementations MAY also use semantic 366 information external to the registry when matching tags. For 367 example, the primary language subtags 'nn' (Nynorsk Norwegian) and 368 'nb' (Bokmal Norwegian) might both be usefully matched to the more 369 general subtag 'no' (Norwegian). Or an implementation might infer 370 that content labeled "zh-CN" is more likely to match the range "zh- 371 Hans" than equivalent content labeled "zh-TW". 373 3.2. Filtering 375 Filtering is used to select the set of content that matches a given 376 language priority list. It is called "filtering" because this set of 377 content may contain no items at all or it may return an arbitrarily 378 large number of matching items: as many items as match the language 379 priority list, thus "filtering out" the non-matching items. 381 In filtering, the language range represents the _least_ specific 382 (that is, the fewest number of subtags) language tag which is an 383 acceptable match. That is, all of the language tags in the set of 384 filtered content will have an equal or greater number of subtags than 385 the language range. For example, if the language priority list 386 consists of the range "de-CH", one might see matching content with 387 the tag "de-CH-1996" but one will never see a match with the tag 388 "de". 390 If the language priority list (see Section 2.3) contains more than 391 one range, the content returned is typically ordered in descending 392 level of preference. 394 Some examples where filtering might be appropriate include: 396 o Applying a style to sections of a document in a particular set of 397 languages. 399 o Displaying the set of documents containing a particular set of 400 keywords written in a specific set of languages. 402 o Selecting all email items written in a specific set of languages. 404 Filtering can produce either an ordered or an unordered set of 405 results. For example, applying formatting to a document based on the 406 language of specific pieces of content does not require the content 407 to be ordered. It is sufficient to know whether a specific piece of 408 content is selected by the language priority list (or not). A search 409 application, on the other hand, probably would want to order the 410 results. 412 If an ordered set is desired, as described above, then the 413 application or protocol needs to determine the relative "quality" of 414 the match between different language tags and the language range. 416 This measurement is called a "distance metric". A distance metric 417 assigns a numeric value to the comparison of a language tag to a 418 language range that represents the 'distance' between the two. A 419 distance of zero means that they are identical, a small distance 420 indicates that they are very similar, and a large distance indicates 421 that they are very different. Using a distance metric, 422 implementations can, for example, allow users to select a threshold 423 distance for a match to be "successful" while filtering, or they 424 might use the numeric values to order the results. 426 3.2.1. Filtering with Basic Language Ranges 428 When filtering using basic language ranges, each basic language range 429 in the language priority list is considered in turn, according to 430 priority. A particular language tag matches a language range if it 431 exactly equals the tag, or if it exactly equals a prefix of the tag 432 such that the first character following the prefix is "-". (That is, 433 the language-range "de-de" matches the language tag "de-DE-1996", but 434 not the language tag "de-Deva".) 436 The special range "*" in a language priority list matches any tag. A 437 protocol which uses language ranges MAY specify additional rules 438 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616] 439 specifies that the range "*" matches only languages not matched by 440 any other range within an "Accept-Language" header. 442 3.2.2. Filtering with Extended Language Ranges 444 When filtering using extended language ranges, each extended language 445 range in the language priority list is considered in turn, according 446 to priority. The subtags in each extended language range are 447 compared to the corresponding subtags in the language tag being 448 examined. The subtag from the range is considered to match if it 449 exactly matches the corresponding subtag in the tag or the range's 450 subtag has the value "*" (which matches all subtags, including the 451 empty subtag). 453 Subtags not specified, including those at the end of the language 454 range, are assigned the wildcard value "*". This makes each range 455 into a prefix much like that used in basic language range matching. 456 For example, the extended language range "de-*-DE" matches all of the 457 following tags, in part because the unspecified variant, extension, 458 and private-use subtags are expanded to "*": 460 de-DE 462 de-Latn-DE 464 de-Latf-DE 466 de-DE-x-goethe 468 de-Latn-DE-1996 470 3.2.3. Scored Filtering 472 Both basic and extended language range filtering produce simple 473 boolean matches between a language range and a language tag. 475 Sometimes it may be useful to provide an array of results with 476 different levels of matching, for example, sorting results based on 477 the overall "quality" of the match. Scored (or "distance metric") 478 filtering provides a way to generate these quality values. 480 As with the other forms of filtering, the process considers each 481 language range in the language priority list in order of priority. 483 Each extended language range and language tag MUST first be 484 canonicalized by mapping grandfathered and obsolete tags into modern 485 equivalents. This requires the information in the IANA Language 486 Subtag Registry (see Section 3 of [RFC3066bis]). 488 The language range and each language tag it is to be compared to are 489 then transformed into a "quintuple" consisting of five "elements" in 490 the form (language, script, country, variant, extension). 492 Any extended language subtags are considered part of the language 493 "element". For example, the language element for the tag "zh-cmn- 494 Hans" would be "zh-cmn". 496 Private-use subtag sequences are considered part of the language 497 "element" if in the initial position in the tag and part of the 498 variant "element" if not. The different handling of private-use 499 sequences prevents a range such as "x-twain" from matching all 500 possible tags, while a range such as "en-US-x-twain" would closely 501 match nearly all tags for English as used in the United States. 503 Language subtags 'und', 'mul', and the script subtag 'Zyyy' are 504 converted to "*": these subtag values represent undetermined, 505 multiple, or private-use values which are consistent with the use of 506 the wildcard. 508 For language tags that have no script subtag but whose language 509 subtag's record in the IANA Language Subtag Registry contains the 510 field "Suppress-Script", the script element in the quintuple MUST be 511 set to the script subtag in the Suppress-Script field. This is 512 necessary because [RFC3066bis] strongly recommends that users not use 513 this subtag to form language tags and this document (see Section 4.1) 514 recommends that users not use them to form ranges. Languages which 515 have a "Suppress-Script" field in the registry are predominantly 516 written in that single script, making the subtag redundant in forming 517 a language tag or range. Thus if the script were not expanded in 518 this manner, a range such as "de-DE" would produce a more-distant 519 score for content that happened to be labeled "de-Latn-DE" than users 520 would expect that it should. 522 Any remaining missing components in the language tag are set to "*"; 523 thus an empty language tag becomes the quintuple ("*", "*", "*", "*", 524 "*"). Missing components in the language range are handled similarly 525 to extended range lookup: missing internal subtags are expanded to 526 "*". Missing end subtags are expanded as the empty string. Thus a 527 pattern "en-US" becomes the quintuple ("en","*","US","",""). 529 Here are some examples of language tags, showing their quintuples as 530 both language tags and language ranges: 532 en-US 533 Tag: (en, *, US, *, *) 534 Range: (en, *, US, "", "") 536 sr-Latn 537 Tag: (sr, Latn, *, *, *) 538 Range: (sr, Latn, "", "", "") 540 zh-cmn-Hant 541 Tag: (zh-cmn, Hant, *, *, *) 542 Range: (zh-cmn, Hant, "", "", "") 544 x-foo 545 Tag: (x-foo, *, *, *, *) 546 Range: (x-foo, "", "", "", "") 548 en-x-foo 549 Tag: (en, *, *, x-foo, *) 550 Range: (en, *, *, x-foo, "") 552 i-default 553 Tag: (i-default, *, *, *, *) 554 Range: (i-default, "", "", "", "") 556 sl-Latn-IT-rozaj 557 Tag: (sl, Latn, IT, rozaj, *) 558 Range: (sl, Latn, IT, rozaj, "") 560 zh-r-wadegile (hypothetical) 561 Tag: (zh, *, *, *, r-wadegile) 562 Range: (zh, *, *, *, r-wadegile) 564 Figure 3: Examples of Distance Metric Quintuples 566 Each pair of quintuples being compared is assigned a distance value, 567 in which small values indicate better matches and large values 568 indicate worse ones. The distance between the pair is the sum of the 569 distances for each of the corresponding elements of the quintuple. 570 If the elements are identical or one is '*', then the distance value 571 between them is zero. Otherwise, it is given by the following table: 572 256 language mismatch 573 128 script mismatch 574 32 region mismatch 575 4 variant mismatch 576 1 extension mismatch 578 A value of 0 is a perfect match; 421 is no match at all. Different 579 threshold values might be appropriate for different applications or 580 protocols. Implementations will usually allow users to choose the 581 most appropriate selection value, ranking the matched items based on 582 score. 584 Examples of various tag's distances from the range "en-US": 586 "fr-FR" 384 (language & region mismatch) 587 "fr" 256 (language mismatch, region match) 588 "en-GB" 32 (region mismatch) 589 "en-Latn-US" 0 (all fields match) 590 "en-Brai" 32 (region mismatch) 591 "en-US-x-foo" 4 (variant mismatch: range is the empty string) 592 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string) 594 Where a language priority list follows the syntax of the "Accept- 595 Language" header defined in [RFC2616] (see Section 14.4) and 596 [RFC3282], language ranges without a Q value are given values equal 597 to the value of the previous language range in the list (processing 598 from first to last). If the first language range has no Q value, it 599 is given a value of 1.0. Language ranges with Q values of zero are 600 removed. For example, "fr, en;q=0.5, de, it" becomes 601 "fr;q=1.0,en;q=0.5,de;q=0.5,it;q=0.5". The distance values given 602 above are then divided by the Q values. For example, if that 603 language tag "fr-FR" has a distance of 384 from a language range with 604 a Q value of 0.8, then the resulting distance is 480 (384 div 0.8). 606 Implementations or protocols MAY use different weighting systems than 607 the ones described above, as long as the weightings and weighting 608 mechanisms are clearly specified. Thus, for example, an 609 implementation or protocol could give all language tags with missing 610 Q values a value of 1.0, or give the distance value 1000 to a 611 language mismatch. They MAY also use more sophisticated weights that 612 depend on the values of the corresponding elements. For example, an 613 implementation might give a small distance to the difference closely 614 related subtags. Some examples of closely related subtags might be: 616 Language: 617 no (Norwegian) 618 nb (Bokmal Norwegian) 619 nn (Nynorsk Norwegian) 621 Script: 622 Kata (katakana) 623 Hira (hiragana) 625 Region: 626 US (United States of America) 627 UM (United States Minor Outlying Islands) 629 Figure 6: Examples of Closely Related Subtags 631 3.3. Lookup 633 Lookup is used to select the single information item that best 634 matches the language priority list for a given request. When 635 performing lookup, each language range in the language priority list 636 is considered in turn, according to priority. By contrast with 637 filtering, each language ranges represents the _most_ specific tag 638 which is an acceptable match. The first information item found with 639 a matching tag, according the user's priority, is considered the 640 closest match and is the item returned. For example, if the language 641 range is "de-CH", one might expect to receive an information item 642 with the tag "de" but never one with the tag "de-CH-1996". Usually 643 if no content matches the request, a "default" item is returned. 645 For example, if an application inserts some dynamic content into a 646 document, returning an empty string if there is no exact match is not 647 an option. Instead, the application "falls back" until it finds a 648 suitable piece of content to insert. Other examples of lookup might 649 include: 651 o Selection of a template containing the text for an automated email 652 response. 654 o Selection of a item containing some text for inclusion in a 655 particular Web page. 657 o Selection of a string of text for inclusion in an error log. 659 In the lookup scheme, the language range is progressively truncated 660 from the end until a matching piece of content is located. For 661 example, starting with the range "zh-Hant-CN-x-private", the lookup 662 progressively searches for content as shown below: 664 Range to match: zh-Hant-CN-x-private 665 1. zh-Hant-CN-x-private 666 2. zh-Hant-CN 667 3. zh-Hant 668 4. zh 669 5. (default content or the empty tag) 671 Figure 7: Example of a Lookup Fallback Pattern 673 This scheme allows some flexibility in finding content. For example, 674 it provides better results for cases in which data is not available 675 that exactly matches the user request than if the default language 676 for the system or content were returned immediately. Not every 677 specific level of tag granularity is usually available or language 678 content may be sparsely populated, so "falling back" through the 679 subtag sequence provides more opportunity to find a match between 680 available content and the user's request. 682 The default content is implementation defined. It might be content 683 with no language tag; might have an empty value (the built-in 684 attribute xml:lang in [XML10] permits the empty value); might be a 685 particular language designated for that bit of content; or it might 686 be content that is labeled with the tag "i-default" (see [RFC2277]). 687 When performing lookup using a language priority list, the 688 progressive search MUST proceed to consider each language range in 689 the list before finding the default content or empty tag. 691 One common way for an application or implementation to provide for 692 default content is to allow a specific language range to be set as 693 the default for a specific type of request. This language range is 694 then treated as if it were appended to the end of the language 695 priority list as a whole, rather than after each item in the language 696 priority list. 698 For example, if a particular user's language priority list were 699 "fr-FR; zh-Hant" and the program doing the matching had a default 700 language range of "ja-JP", the program would search for content as 701 follows: 702 1. fr-FR 703 2. fr 704 3. zh-Hant // next language 705 4. zh 706 5. (search for the default content) 707 a. ja-JP 708 b. ja 709 c. (implementation defined default) 711 Figure 8: Lookup Using a Language Priority List 712 Implementations SHOULD ignore extensions and unrecognized private-use 713 subtags when performing lookup, since these subtags are usually 714 orthogonal to the user's request. 716 The special language range "*" matches any language tag. In the 717 lookup scheme, this range does not convey enough information by 718 itself to determine which content is most appropriate, since it 719 matches everything. If the language range "*" is the only one in the 720 language priority list, it matches the default content. If the 721 language range "*" is followed by other language ranges, it should be 722 skipped. 724 In some cases, the language priority list might contain one or more 725 extended language ranges (as, for example, when the same language 726 priority list is used as input for both lookup and filtering 727 operations). Wildcard values in an extended language range normally 728 match any value that occurs in that position in a language tag. 729 Since only one item can be returned for any given lookup request, 730 wildcards in a language range have to be processed in a consistent 731 manner or the same request will produce widely varying results. 732 Implementations that accept extended language ranges MUST define 733 which content is returned when more than one item matches the 734 extended language range. 736 For example, an implementation could return the matching content that 737 is first in ASCII-order. For example, if the language range were 738 "*-CH" and the set of content included "de-CH", "fr-CH", and "it-CH", 739 then the content labeled "de-CH" would be returned. 741 Implementations MAY also map extended language ranges to basic 742 language ranges: if the first subtag is a "*" then the entire range 743 is treated as "*" (which matches the default content), otherwise each 744 wildcard subtag is removed. For example, if the language range were 745 "en-*-US", then the range would be mapped to "en-US". 747 Where a language priority list contains Q values as in the syntax of 748 the "Accept-Language" header defined in [RFC2616] (see Section 14.4) 749 and [RFC3282], language tags without a Q value are given values equal 750 to the value of the previous language tag (processing from first to 751 last). If the first language tag has no Q value, it is given a value 752 of 1.0. Then language tags with zero Q values are removed. For 753 example, "fr, en;q=0.5, de, it" becomes "fr;q=1.0, en;q=0.5, 754 de;q=0.5, it;q=0.5". The language priority list is then sorted from 755 highest priority to lowest, whereby any two language tags with the 756 same Q values are remain in the same order as in the original 757 language priority list. This list is then traversed as described 758 above in doing lookup. 760 Implementations or protocols MAY use different lookup mechanisms 761 systems than the ones described above, as long as those mechanisms 762 are clearly specified. 764 4. Other Considerations 766 When working with language ranges and matching schemes, there are 767 some additional points that may influence the choice of either. 769 4.1. Choosing Language Ranges 771 Users indicate their language preferences via the choice of a 772 language range or the list of language ranges in a language priority 773 list. The type of matching affects what the best choice is for a 774 given user. 776 Most matching schemes make no attempt to process the semantic meaning 777 of the subtags. The language range (or its subtags) is usually 778 compared in a case-insensitive manner to each language tag being 779 matched, using basic string processing. 781 Users SHOULD avoid subtags that add no distinguishing value to a 782 language range. Generally, the fewer subtags that appear in the 783 language range, the more content the range will match. 785 Most notably, script subtags SHOULD NOT be used to form a language 786 range in combination with language subtags that have a matching 787 Suppress-Script field in their registry entry. Thus the language 788 range "en-Latn" is probably inappropriate in most cases (because the 789 vast majority of English documents are written in the Latin script 790 and thus the 'en' language subtag has a Suppress-Script field for 791 'Latn' in the registry). 793 When working with tags and ranges, note that extensions and most 794 private-use subtags are orthogonal to language tag matching, in that 795 they specify additional attributes of the text not related to the 796 goals of most matching schemes. Users SHOULD avoid using these 797 subtags in language ranges, since they interfere with the selection 798 of available content. When used in language tags (as opposed to 799 ranges), these subtags normally do not interfere with filtering 800 (Section 3), since they appear at the end of the tag and will match 801 all prefixes. 803 When working with language tags and language ranges note that: 805 o Private-use and Extension subtags are normally orthogonal to 806 language tag fallback. Implementations or specifications that use 807 a lookup (Section 3.3) matching scheme often ignore unrecognized 808 private-use and extension subtags when performing language tag 809 fallback. In addition, since these subtags are always at the end 810 of the sequence of subtags, their use in language tags normally 811 doesn't interfere with the use of ranges that omit them in the 812 filtering (Section 3.2) matching schemes described below. 813 However, they do interfere with filtering when used in language 814 ranges and SHOULD be avoided in ranges as a result. 816 o Applications, specifications, or protocols that choose not to 817 interpret one or more private-use or extension subtags SHOULD NOT 818 remove or modify these extensions in content that they are 819 processing. When a language tag instance is to be used in a 820 specific, known protocol, and is not being passed through to other 821 protocols, language tags MAY be filtered to remove subtags and 822 extensions that are not supported by that protocol. Such 823 filtering SHOULD be avoided, if possible, since it removes 824 information that might be relevant to services on the other end of 825 the protocol that would make use of that information. 827 o Some applications of language tags might want or need to consider 828 extensions and private-use subtags when matching tags. If 829 extensions and private-use subtags are included in a matching or 830 filtering process that utilizes one of the schemes described in 831 this document, then the implementation SHOULD canonicalize the 832 language tags and/or ranges before performing the matching. Note 833 that language tag processors that claim to be "well-formed" 834 processors as defined in [RFC3066bis] generally fall into this 835 category. 837 4.2. Meaning of Language Tags and Ranges 839 Selecting content using language ranges requires some understanding 840 by users of what they are selecting. A language tag or range 841 identifies a language as spoken (or written, signed or otherwise 842 signaled) by human beings for communication of information to other 843 human beings. 845 If a language tag B contains language tag A as a prefix, then B is 846 typically "narrower" or "more specific" than A. For example, "zh- 847 Hant-TW" is more specific than "zh-Hant". 849 This relationship is not guaranteed in all cases: specifically, 850 languages that begin with the same sequence of subtags are NOT 851 guaranteed to be mutually intelligible, although they might be. 853 For example, the tag "az" shares a prefix with both "az-Latn" 854 (Azerbaijani written using the Latin script) and "az-Arab" 855 (Azerbaijani written using the Arabic script). A person fluent in 856 one script might not be able to read the other, even though the text 857 might be otherwise identical. Content tagged as "az" most probably 858 is written in just one script and thus might not be intelligible to a 859 reader familiar with the other script. 861 Variant subtags in particular seem to represent specific divisions in 862 mutual understanding, since they often encode dialects or other 863 idiosyncratic variations within a language. They also seem to 864 represent relatively low divisions with a high chance of at least 865 limited understanding, although this depends on the specific variant 866 in question. 868 The relationship between the language tag and the information it 869 relates to is defined by the standard describing the context in which 870 it appears. Accordingly, this section can only give possible 871 examples of its usage: 873 o For a single information object, the associated language tags 874 might be interpreted as the set of languages that are necessary 875 for a complete comprehension of the complete object. Example: 876 Plain text documents. 878 o For an aggregation of information objects, the associated language 879 tags could be taken as the set of languages used inside components 880 of that aggregation. Examples: Document stores and libraries. 882 o For information objects whose purpose is to provide alternatives, 883 the associated language tags could be regarded as a hint that the 884 content is provided in several languages, and that one has to 885 inspect each of the alternatives in order to find its language or 886 languages. In this case, the presence of multiple tags might not 887 mean that one needs to be multi-lingual to get complete 888 understanding of the document. Example: MIME multipart/ 889 alternative. 891 o In markup languages, such as HTML and XML, language information 892 can be added to each part of the document identified by the markup 893 structure (including the whole document itself). For example, one 894 could write C'est la vie. inside a 895 Norwegian document; the Norwegian-speaking user could then access 896 a French-Norwegian dictionary to find out what the marked section 897 meant. If the user were listening to that document through a 898 speech synthesis interface, this formation could be used to signal 899 the synthesizer to appropriately apply French text-to-speech 900 pronunciation rules to that span of text, instead of misapplying 901 the Norwegian rules. 903 4.3. Considerations for Private Use Subtags 905 Private-use subtags require private agreement between the parties 906 that intend to use or exchange language tags that use them and great 907 caution SHOULD be used in employing them in content or protocols 908 intended for general use. Private-use subtags are simply useless for 909 information exchange without prior arrangement. 911 The value and semantic meaning of private-use tags and of the subtags 912 used within such a language tag are not defined. Matching private- 913 use tags using language ranges or extended language ranges can result 914 in unpredictable content being returned. 916 4.4. Length Considerations in Matching 918 RFC 3066 [RFC3066] did not provide an upper limit on the size of 919 language tags or ranges. RFC 3066 did define the semantics of 920 particular subtags in such a way that most language tags or ranges 921 consisted of language and region subtags with a combined total length 922 of up to six characters. Larger tags and ranges (in terms of both 923 subtags and characters) did exist, however. 925 [RFC3066bis] also does not impose a fixed upper limit on the number 926 of subtags in a language tag or range (and thus an upper bound on the 927 size of either). The syntax in that document suggests that, 928 depending on the specific language or range of languages, more 929 subtags (and thus characters) are sometimes necessary as a result. 930 Length considerations and their impact on the selection and 931 processing of tags are described in Section 2.1.1 of that document. 933 An application or protocol MAY choose to limit the length of the 934 language tags or ranges used in matching. Any such limitation SHOULD 935 be clearly documented, and such documentation SHOULD include the 936 disposition of any longer tags or ranges (for example, whether an 937 error value is generated or the language tag or range is truncated). 938 If truncation is permitted it MUST NOT permit a subtag to be divided, 939 since this changes the semantics of the subtag being matched and can 940 result in false positives or negatives. 942 Applications or protocols that restrict storage SHOULD consider the 943 impact of tag or range truncation on the resulting matches. For 944 example, removing the "*" from the end of an extended language range 945 (see Section 2.2) can greatly modify the set of returned matches. A 946 protocol that allows tags or ranges to be truncated at an arbitrary 947 limit, without giving any indication of what that limit is, has the 948 potential for causing harm by changing the meaning of values in 949 substantial ways. 951 In practice, most tags do not require additional subtags or 952 substantially more characters. Additional subtags sometimes add 953 useful distinguishing information, but extraneous subtags interfere 954 with the meaning, understanding, and especially matching of language 955 tags. Since language tags or ranges MAY be truncated by an 956 application or protocol that limits storage, when choosing language 957 tags or ranges users and applications SHOULD avoid adding subtags 958 that add no distinguishing value. In particular, users and 959 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 960 fields in the registry (defined in Section 3.6 of [RFC3066bis]): 961 these fields provide guidance on when specific additional subtags 962 SHOULD (and SHOULD NOT) be used. 964 Implementations MUST support a limit of at least 33 characters. This 965 limit includes at least one subtag of each non-extension, non-private 966 use type. When choosing a buffer limit, a length of at least 42 967 characters is strongly RECOMMENDED. 969 The practical limit on tags or ranges derived solely from registered 970 values is 42 characters. Implementations MUST be able to handle tags 971 and ranges of this length. Support for tags and ranges of at least 972 62 characters in length is RECOMMENDED. Implementations MAY support 973 longer values, including matching extensive sets of private-use or 974 extension subtags. 976 Applications or protocols which have to truncate a tag MUST do so by 977 progressively removing subtags along with their preceding "-" from 978 the right side of the language tag until the tag is short enough for 979 the given buffer. If the resulting tag ends with a single-character 980 subtag, that subtag and its preceding "-" MUST also be removed. For 981 example: 983 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 984 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 985 2. zh-Latn-CN-variant1-a-extend1 986 3. zh-Latn-CN-variant1 987 4. zh-Latn-CN 988 5. zh-Latn 989 6. zh 991 Figure 9: Example of Tag Truncation 993 5. IANA Considerations 995 This document presents no new or existing considerations for IANA. 997 6. Changes 999 This is the first version of this document. 1001 The following changes were put into this document since draft-07: 1003 Added a mention of "*" to the Character Set Considerations section 1004 (D.Ewell) 1006 7. Security Considerations 1008 Language ranges used in content negotiation might be used to infer 1009 the nationality of the sender, and thus identify potential targets 1010 for surveillance. In addition, unique or highly unusual language 1011 ranges or combinations of language ranges might be used to track a 1012 specific individual's activities. 1014 This is a special case of the general problem that anything you send 1015 is visible to the receiving party. It is useful to be aware that 1016 such concerns can exist in some cases. 1018 The evaluation of the exact magnitude of the threat, and any possible 1019 countermeasures, is left to each application or protocol. 1021 8. Character Set Considerations 1023 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN- 1024 MINUS (%x2D). Language ranges also use the character ASTERISK 1025 (%x2A). These characters are present in most character sets, so 1026 presentation or exchange of language tags or ranges should not be 1027 constrained by character set issues. 1029 9. References 1031 9.1. Normative References 1033 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1034 Requirement Levels", BCP 14, RFC 2119, March 1997. 1036 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1037 Languages", BCP 18, RFC 2277, January 1998. 1039 [RFC3066bis] 1040 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 1041 Identification of Languages", October 2005, . 1045 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 1046 Specifications: ABNF", RFC 4234, October 2005. 1048 9.2. Informative References 1050 [RFC1766] Alvestrand, H., "Tags for the Identification of 1051 Languages", RFC 1766, March 1995. 1053 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1054 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1055 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1057 [RFC3066] Alvestrand, H., "Tags for the Identification of 1058 Languages", BCP 47, RFC 3066, January 2001. 1060 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 1061 May 2002. 1063 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0", 1064 02 2004. 1066 Appendix A. Acknowledgements 1068 Any list of contributors is bound to be incomplete; please regard the 1069 following as only a selection from the group of people who have 1070 contributed to make this document what it is today. 1072 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 1073 which is a precursor to this document, made enormous contributions 1074 directly or indirectly to this document and are generally responsible 1075 for the success of language tags. 1077 The following people (in alphabetical order by family name) 1078 contributed to this document: 1080 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank 1081 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. 1082 Patton, Randy Presuhn, Eric van der Poel, Markus Scherer, and many, 1083 many others. 1085 Very special thanks must go to Harald Tveit Alvestrand, who 1086 originated RFCs 1766 and 3066, and without whom this document would 1087 not have been possible. 1089 For this particular document, John Cowan originated the scheme 1090 described in Section 3.2.3. Mark Davis originated the scheme 1091 described in the Section 3.3. 1093 Authors' Addresses 1095 Addison Phillips (editor) 1096 Yahoo! Inc 1098 Email: addison at inter dash locale dot com 1100 Mark Davis (editor) 1101 Google 1103 Email: mark dot davis at macchiato dot com 1105 Intellectual Property Statement 1107 The IETF takes no position regarding the validity or scope of any 1108 Intellectual Property Rights or other rights that might be claimed to 1109 pertain to the implementation or use of the technology described in 1110 this document or the extent to which any license under such rights 1111 might or might not be available; nor does it represent that it has 1112 made any independent effort to identify any such rights. Information 1113 on the procedures with respect to rights in RFC documents can be 1114 found in BCP 78 and BCP 79. 1116 Copies of IPR disclosures made to the IETF Secretariat and any 1117 assurances of licenses to be made available, or the result of an 1118 attempt made to obtain a general license or permission for the use of 1119 such proprietary rights by implementers or users of this 1120 specification can be obtained from the IETF on-line IPR repository at 1121 http://www.ietf.org/ipr. 1123 The IETF invites any interested party to bring to its attention any 1124 copyrights, patents or patent applications, or other proprietary 1125 rights that may cover technology that may be required to implement 1126 this standard. Please address the information to the IETF at 1127 ietf-ipr@ietf.org. 1129 Disclaimer of Validity 1131 This document and the information contained herein are provided on an 1132 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1133 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1134 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1135 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1136 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1137 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1139 Copyright Statement 1141 Copyright (C) The Internet Society (2006). This document is subject 1142 to the rights, licenses and restrictions contained in BCP 78, and 1143 except as set forth therein, the authors retain all their rights. 1145 Acknowledgment 1147 Funding for the RFC Editor function is currently provided by the 1148 Internet Society.