idnits 2.17.1 draft-ietf-ltru-matching-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 990. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 967. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 974. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 980. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 658 has weird spacing: '...becomes en-US...' == Line 659 has weird spacing: '...becomes en-La...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 16, 2005) is 6729 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: May 20, 2006 IBM 6 November 16, 2005 8 Matching Tags for the Identification of Languages 9 draft-ietf-ltru-matching-06 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on May 20, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing, matching, 43 and evaluating language tags. Possible algorithms for language 44 negotiation and content selection are described. This document, in 45 combination with RFC 3066bis (replace "3066bis" with the RFC number 46 assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which 47 replaced RFC 1766. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 53 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4 54 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 55 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 5 56 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 8 57 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 8 58 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9 59 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 10 60 3.2.2. Filtering with Extended Language Ranges . . . . . . . 10 61 3.2.3. Distance Metric Filtering . . . . . . . . . . . . . . 11 62 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 13 63 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 16 64 4.1. Meaning of Language Tags and Ranges . . . . . . . . . . . 16 65 4.2. Considerations for Private Use Subtags . . . . . . . . . . 17 66 4.3. Length Considerations in Matching . . . . . . . . . . . . 17 67 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 68 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 69 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 70 8. Character Set Considerations . . . . . . . . . . . . . . . . . 23 71 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 24 72 9.1. Normative References . . . . . . . . . . . . . . . . . . . 24 73 9.2. Informative References . . . . . . . . . . . . . . . . . . 24 74 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 25 75 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26 76 Intellectual Property and Copyright Statements . . . . . . . . . . 27 78 1. Introduction 80 Human beings on our planet have, past and present, used a number of 81 languages. There are many reasons why one would want to identify the 82 language used when presenting or requesting information. 84 Information about a user's language preferences commonly needs to be 85 identified so that appropriate processing can be applied. For 86 example, the user's language preferences in a browser can be used to 87 select web pages appropriately. Language preferences can also be 88 used to select among tools (such as dictionaries) to assist in the 89 processing or understanding of content in different languages. 91 Given a set of language identifiers, such as those defined in 92 [RFC3066bis], various mechanisms can be envisioned for performing 93 language negotiation and tag matching. Applications, protocols, or 94 specifications will have varying needs and requirements that will 95 affect the choice of a suitable mechanism. Protocols and 96 specifications SHOULD clearly indicate the particular mechanism used 97 in selecting or matching language tags. 99 This document defines several mechanisms for matching, selecting, or 100 filtering content whose natural language is identified using Language 101 Tags [RFC3066bis], as well as the syntax (called a "language range") 102 associated with each of these mechanisms for specifying the user's 103 language preferences. 105 This document, in combination with [RFC3066bis] (replace "3066bis" 106 globally in this document with the RFC number assigned to 107 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 108 [RFC1766]. 110 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 111 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 112 document are to be interpreted as described in [RFC2119]. 114 2. The Language Range 116 Language Tags [RFC3066bis] are used to identify the language of some 117 information item or content. Applications or protocols that use 118 language tags are often faced with the problem of identifying sets of 119 content that share certain language attributes. For example, HTTP 120 1.1 [RFC2616] describes language ranges in its discussion of the 121 Accept-Language header (Section 14.4), which is used for selecting 122 content from servers based on the language of that content. 124 When selecting content according to its language, it is useful to 125 have a mechanism for identifying sets of language tags that share 126 specific attributes. This allows users to select or filter content 127 based on specific requirements. Such an identifier is called a 128 "Language Range". 130 2.1. Lists of Language Ranges 132 When users specify a language preference they often need to specify a 133 prioritized list of language ranges in order to best reflect their 134 language requirements for the matching operation. This is especially 135 true for speakers of minority languages. A speaker of Breton in 136 France, for example, may specify "be" followed by "fr", meaning that 137 if Breton is available, it is preferred, but otherwise French is the 138 best alternative. It can get more complex: a speaker may wish to 139 fallback from Skolt Sami to Northern Sami to Finnish. 141 A "Language Priority List" consists of a prioritized or weighted list 142 of language ranges. One well known example of such a list is the 143 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 144 14.4) and RFC 3282 [RFC3282]. The various matching operations 145 described in this document include considerations for using a 146 language priority list. 148 2.2. Basic Language Range 150 A "Basic Language Range" identifies the set of content whose language 151 tags begin with the same sequence of subtags. A basic language range 152 is identified by its 'language-range' tag, by adapting the 153 ABNF[RFC4234] from HTTP/1.1 [RFC2616] : 155 language-range = language-tag / "*" 156 language-tag = 1*8[alphanum] *["-" 1*8alphanum] 157 alphanum = ALPHA / DIGIT 159 That is, a language-range has the same syntax as a language-tag or is 160 the single character "*". Basic Language Ranges imply that there is 161 a semantic relationship between language tags that share the same 162 prefix. While this is often the case, it is not always true and 163 users should note that the set of language tags that match a specific 164 language-range may not be mutually intelligible. 166 Basic language ranges were originally described in [RFC3066] and HTTP 167 1.1 [RFC2616] (where they are referred to as simply a "language 168 range"). 170 Users SHOULD avoid subtags that add no distinguishing value to a 171 language range. For example, script subtags SHOULD NOT be used to 172 form a language range with language subtags which have a matching 173 Suppress-Script field in their registry record. Thus the language 174 range "en-Latn" is probably inappropriate in most cases (because the 175 vast majority English documents are written in the Latin script and 176 thus the 'en' language subtag has a Suppress-Script field for 'Latn' 177 in the registry). 179 Language tags and thus language ranges are to be treated as case 180 insensitive: there exist conventions for the capitalization of some 181 of the subtags, but these MUST NOT be taken to carry meaning. 182 Matching of language tags to language ranges MUST be done in a case 183 insensitive manner. 185 When working with tags and ranges, note that extensions and most 186 private use subtags are generally orthogonal to language tag fallback 187 and users SHOULD avoid using these subtags in language ranges, since 188 they will often interfere with the selection of available language 189 content. Since these subtags are always at the end of the sequence 190 of subtags, they don't normally interfere with the use of prefixes 191 for matching in the schemes described below. 193 Note that when working with basic language ranges, no attempt is made 194 to process the semantics of the tags or ranges in any way. The 195 language tag and language range are compared in a case insensitive 196 manner using basic string processing. Thus the choice of subtags in 197 both the language tag and language range may affect the results 198 produced as a result. 200 2.3. Extended Language Range 202 A Basic Language Range does not always provide the most appropriate 203 way to specify a user's preferences. Sometimes it is beneficial to 204 define a more granular matching scheme that takes advantage of the 205 internal structure of language tags, by allowing the user to specify, 206 for example, the value of a specific field in a language tag or to 207 indicate which values are of interest in filtering or selecting the 208 content. 210 In an extended language range, the identifier takes the form of a 211 series of subtags which must consist of well-formed subtags or the 212 special subtag "*". For example, the language range "en-*-US" 213 specifies a primary language of 'en', followed by any script subtag, 214 followed by the region subtag 'US'. 216 An extended language range can be represented by the following ABNF: 217 extended-language-range = range ; a range 218 / privateuse ; private use tag 219 / grandfathered ; grandfathered registrations 221 range = (language 222 ["-" script] 223 ["-" region] 224 *("-" variant) 225 *("-" extension) 226 ["-" privateuse]) 228 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code 229 / 4ALPHA ; reserved for future use 230 / 5*8ALPHA ; registered language subtag 231 / "*" ; ... or wildcard 233 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*")) 234 ; reserved for future use 235 ; wildcard can only appear 236 ; at the end 238 script = 4ALPHA ; ISO 15924 code 239 / "*" ; or wildcard 241 region = 2ALPHA ; ISO 3166 code 242 / 3DIGIT ; UN M.49 code 243 / "*" ; ... or wildcard 245 variant = 5*8alphanum ; registered variants 246 / (DIGIT 3alphanum) ; 247 / "*" ; ... or wildcard 249 extension = singleton *("-" (2*8alphanum)) [ "-*" ] 250 ; extension subtags 251 ; wildcard can only appear 252 ; at the end 254 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT 255 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9" 256 ; Single letters: x/X is reserved for private use 258 privateuse = ("x"/"X") 1*("-" (1*8alphanum)) 260 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum)) 261 ; grandfathered registration 262 ; Note: i is the only singleton 263 ; that starts a grandfathered tag 265 alphanum = (ALPHA / DIGIT) ; letters and numbers 267 A field not present in the middle of an extended language range MAY 268 be treated as if the field contained a "*". For example, the range 269 "en-US" MAY be considered to be equivalent to the range "en-*-US". 270 This also means that multiple wildcards can be collapsed (so that 271 "en-*-*-US" is equivalent to "en-*-US"). 273 When working with tags and ranges users SHOULD note the following: 275 1. Private-use and Extension subtags are normally orthogonal to 276 language tag fallback. Implementations or specifications that 277 use a lookup (Section 3.3) matching scheme SHOULD ignore 278 unrecognized private-use and extension subtags when performing 279 language tag fallback. Since these subtags are always at the end 280 of the sequence of subtags, they don't normally interfere with 281 the use of prefixes for matching in the schemes described below. 283 2. Applications, specifications, or protocols that choose not to 284 interpret one or more private-use or extension subtags SHOULD NOT 285 remove or modify these extensions in content that they are 286 processing. When a language tag instance is to be used in a 287 specific, known protocol, and is not being passed through to 288 other protocols, language tags MAY be filtered to remove subtags 289 and extensions that are not supported by that protocol. Such 290 filtering SHOULD be avoided, if possible, since it removes 291 information that might be relevant if services on the other end 292 of the protocol would make use of that information. 294 3. Some applications of language tags might want or need to consider 295 extensions and private-use subtags when matching tags. If 296 extensions and private-use subtags are included in a matching or 297 filtering process that utilizes the one of the schemes described 298 in this document, then the implementation SHOULD canonicalize the 299 language tags and/or ranges before performing the matching. Note 300 that language tag processors that claim to be "well-formed" 301 processors as defined in [RFC3066bis] generally fall into this 302 category. 304 There are several matching algorithms or schemes which can be applied 305 when matching extended language ranges to language tags. 307 3. Types of Matching 309 Matching language ranges to language tags can be done in a number of 310 different ways. This section describes the different types of 311 matching scheme, as well as the considerations for choosing between 312 them. 314 There are two basic types of matching scheme: those that produce an 315 open-ended set of content (called "filtering") and those that produce 316 a single information item for a given request (called "lookup"). 318 A key difference between these two types of matching scheme is that 319 the language range for filtering operations is always the _least_ 320 specific tag one will accept as a match, while for lookup operations 321 the language range is always the _most_ specific tag. 323 3.1. Choosing a Type of Matching 325 Applications, protocols, and specifications are faced with the 326 decision of what type of matching to use. Sometimes, different 327 styles of matching might be suited for different kinds of processing 328 within a particular application or protocol. 330 Filtering can be used to produce a set of results (such as a 331 collection of documents). For example, if using a search engine, one 332 might use filtering to limit the results to documents written in 333 French. It can also be used when deciding whether to perform some 334 processing that is language sensitive on some content. For example, 335 a process might cause paragraphs whose language tag matched the 336 language range "nl" to be displayed in italics within a document. 338 This document describes three types of filtering: 340 1. Basic Filtering (Section 3.2.1) is used to match content using 341 basic language rangesSection 2.2. It is compatible with 342 implementations that do not produce extended language ranges. 344 2. Extended Range Filtering (Section 3.2.2) is used to match content 345 using extended language rangesSection 2.3. Newer implementations 346 SHOULD use this form of filtering in preference to basic 347 filtering. 349 3. Scored Filtering (Section 3.2.3) produces an ordered set of 350 content using either basic or extended language ranges. It 351 should be used when the quality of the match within a specific 352 language range is important, as when presenting a list of 353 documents resulting from a search. 355 Lookup (Section 3.3) is used when each request MUST produce exactly 356 one piece of content. For example, a Web server might use the 357 Accept-Language HTTP header to choose which language to return a 358 custom 404 page in: since it can return only one page, it must choose 359 a single item and it must return some item, even if no content 360 matches the language ranges supplied by the user. 362 Most types of matching in this document are designed so that 363 implementations do not have to examine the values of the subtags 364 supplied and, except for scored filtering, they do not need access to 365 the Language Subtag Registry nor do they require the use of valid 366 subtags in either language tags or language ranges. This has great 367 benefit for speed and simplicity of implementation. 369 Implementations might also wish to use semantic information external 370 to the langauge tags when performing fallback. For example, the 371 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal 372 Norwegian) might both be usefully matched to the more general subtag 373 'no' (Norwegian). Or an implementation might infer that content 374 labeled "zh-CN" is morely likely to match the range "zh-Hans" than 375 equivalent content labeled "zh-TW". 377 3.2. Filtering 379 Filtering is used to select the set of content that matches a given 380 prefix. It is called "filtering" because this set of content may 381 contain no items at all or it may return an arbitrary number of 382 matching items--as many as match the language range used to specify 383 the items, thus filtering out the non-matching content. 385 In filtering, the language range represents the _least_ specific tag 386 which is an acceptable match. That is, all of the language tags in 387 the set of filtered content will have an equal or greater number of 388 subtags than the language range. For example, if the language range 389 is "de-CH", one might see matching content with the tag "de-CH-1996" 390 but one will never see a match with the tag "de". 392 If the language priority list (see Section 2.1) contains more than 393 one range, the content returned is typically ordered in descending 394 level of preference. 396 Some examples where filtering might be appropriate include: 398 o Applying a style to sections of a document in a particular 399 language range. 401 o Displaying the set of documents containing a particular set of 402 keywords written in a specific language. 404 o Selecting all email items written in specific range of languages. 406 Filtering can produce either ordered or unordered set of results. 407 For example, applying formatting to a document based on the language 408 of specific pieces of content does not require the content to be 409 ordered. It is sufficient to know whether a specific piece of 410 content matches or does not match. A search application, on the 411 other hand, probably would put the results into a priority order. 413 If an ordered set is desired, as described above, then the 414 application or protocol needs to determine the relative "quality" of 415 the match between different language tags and the language range. 417 This measurment is called a "distance metric". A distance metric 418 assigns a numeric value to the comparison of each language tag to a 419 language range and represents the 'distance' between the two. A 420 distance of zero means that they are identical, a small distance 421 indicates that they are very similar, and a large distance indicated 422 that they are very different. Using a distance metric, 423 implementations can, for example, allow users to select a threshold 424 distance for a match to be "successful" while filtering or it can use 425 the numeric value to order the results. 427 3.2.1. Filtering with Basic Language Ranges 429 When filtering using a basic language range, the language range 430 matches a language tag if it exactly equals the tag, or if it exactly 431 equals a prefix of the tag such that the first character following 432 the prefix is "-". (That is, the language-range "de-de" matches the 433 language tag "de-DE-1996", but not the language tag "de-Deva".) 435 The special range "*" matches any tag. A protocol which uses 436 language ranges MAY specify additional rules about the semantics of 437 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 438 languages not matched by any other range within an "Accept-Language" 439 header. 441 3.2.2. Filtering with Extended Language Ranges 443 In the Extended Range Matching scheme, each extended language range 444 in the language priority list is considered in turn, according to 445 priority. The subtags in each extended language range are compared 446 to the corresponding subtags in the language tag being examined. The 447 subtag from the range is considered to match if it exactly matches 448 the corresponding subtag in the tag or the range's subtag has the 449 value "*" (which matches all subtags, including the empty subtag). 450 Extended Range Matching is an extension of basic matching 451 (Section 3.2.1): the language range represents the least specific tag 452 which is an acceptable match. 454 Private use subtags MAY be specified in the language range and MUST 455 NOT be ignored when matching. 457 Subtags not specified, including those at the end of the language 458 range, are assigned the value "*". This makes each range into a 459 prefix much like that used in basic language range matching. For 460 example, the extended language range "zh-*-CN" matches all of the 461 following tags because the unspecified variant field is expanded to 462 "*": 464 zh-Hant-CN 466 zh-CN 468 zh-Hans-CN 470 zh-CN-x-wadegile 472 zh-Latn-CN-boont 474 zh-cmn-Hans-CN-x-private 476 3.2.3. Distance Metric Filtering 478 Both basic and extended language range filtering produce simple 479 boolean matches. Sometimes it may be beneficial to provide an array 480 of results with different levels of matching, for example, sorting 481 results based on the overall "quality" of the match. Distance metric 482 filtering provides a way to generate these quality values. 484 First both the extended language range and the language tags to be 485 matched to it must be canonicalized by mapping grandfathered and 486 obsolete tags into modern equivalents. 488 The language range and the language tags are then transformed into 489 quintuples of elements of the form (language, script, country, 490 variant, extension). Any extended language subtags are considered 491 part of the language element; private use subtag sequences are 492 considered part of the language element if in the initial position in 493 the tag and part of the variant element if not. Language subtags 494 'und', 'mul', and the script subtag 'Zyyy' are converted to "*". 496 Missing components in the language-tag are set to "*"; thus a "*" 497 pattern becomes the quintuple ("*", "*", "*", "*", "*"). Missing 498 components in the extended language-range are handled similarly to 499 extended range lookup: missing internal subtags are expanded to "*". 501 Missing end subtags are expanded as the empty string. Thus a pattern 502 "en-US" becomes the quintuple ("en","*","US","",""). 504 Here are some examples of language-tags and their quintuples: 506 en-US ("en","*","US","*","*") 508 sr-Latn ("sr,"Latn","*","*","*") 510 zh-cmn-Hant ("zh-cmn","Hant","*","*","*") 512 x-foo ("x-foo","*","*","*","*") 514 en-x-foo ("en","*","*","x-foo","*") 516 i-default ("i-default","*","*","*","*") 518 sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*") 520 zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical 522 Each language-range/language-tag pair being compared is assigned a 523 distance value, whereby small values indicate better matches and 524 large values indicate worse ones. The distance between the pair is 525 the sum of the distances for each of the corresponding elements of 526 the quintuple. If the elements are identical or one is '*', then the 527 distance value between them is zero. Otherwise, it is given by the 528 following table: 529 256 language mismatch 530 128 script mismatch 531 32 region mismatch 532 4 variant mismatch 533 1 extension mismatch 535 A value of 0 is a perfect match; 421 is no match at all. Different 536 threshold values might be appropriate for different applications or 537 protocols. Implementations will usually allow users to choose the 538 most appropriate selection value, ranking the matched items based on 539 score. 541 Examples of various tag's distances from the range "en-US": 543 "fr" 256 (language mismatch, region match) 544 "en-GB" 384 (language, region mismatch) 545 "en-Latn-US" 0 (all fields match) 546 "en-Brai" 32 (region mismatch) 547 "en-US-x-foo" 4 (variant mismatch: range is the empty string) 548 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string) 549 Implementations or protocols sometimes might wish to use more 550 sophisticated weights that depend on the values of the corresponding 551 elements. For example, depending on the domain, an implemenation 552 might give a small distance to the difference between the language 553 subtag 'no' and the closely related language subtags 'nb' or 'nn'; or 554 between the script subtags 'Kata' and 'Hira'; or between the region 555 subtags 'US' and 'UM'. 557 3.3. Lookup 559 Lookup is used to select the single information item that best 560 matches the language priority list for a given request. In lookup, 561 each language range in the language priority list represents the 562 _most_ specific tag which is an acceptable match; only the closest 563 matching item according the user's priority is returned. For 564 example, if the language range is "de-CH", one might expect to 565 receive an information item with the tag "de" but never one with the 566 tag "de-CH-1996". Usually if no content matches the request, a 567 "default" item is returned. 569 For example, if an application inserts some dynamic content into a 570 document, returning an empty string if there is no exact match is not 571 an option. Instead, the application "falls back" until it finds a 572 suitable piece of content to insert. Other examples of lookup might 573 include: 575 o Selection of a template containing the text for an automated email 576 response. 578 o Selection of a graphic containing text for inclusion in a 579 particular Web page. 581 o Selection of a string of text for inclusion in an error log. 583 In the Lookup scheme, the language range is progressively truncated 584 from the end until a matching piece of content is located. For 585 example, starting with the range "zh-Hant-CN-x-private", the lookup 586 would progressively search for content as shown below: 588 Range to match: zh-Hant-CN-x-private 589 1. zh-Hant-CN-x-private 590 2. zh-Hant-CN 591 3. zh-Hant 592 4. zh 593 5. (default content or the empty tag) 595 Figure 5: Example of a Lookup Fallback Pattern 596 This scheme allows some flexibility in finding content. It also 597 typically provides better results when data is not available at a 598 specific level of tag granularity or is sparsely populated (than if 599 the default language for the system or content were used). 601 The language range "*" matches any language tag. In the lookup 602 scheme, this language range does not convey enough information to 603 determine which content is most appropriate. If this language range 604 is the only one in the language priority list, it matches the default 605 content. If this language range is followed by other language 606 ranges, it should be skipped. 608 When performing lookup using a language priority list, the 609 progressive search MUST proceed to consider each language range 610 before finding the default content or empty tag. The default content 611 might be content with no language tag (or with an empty value, as 612 with xml:lang in the XML specification), or it might be a particular 613 language designated for that bit of content. 615 One common way to provide for default content is to allow a specific 616 language range to be set as the default for a specific type of 617 request. This language range is then treated as if it were appended 618 to the end of the language priority list, rather than after each item 619 in the language priority list. 621 For example, if a particular user's language priority list were 622 "fr-FR; zh-Hant" and the program doing the matching had a default 623 language range of "ja-JP", the program would search for content as 624 follows: 625 1. fr-FR 626 2. fr 627 3. zh-Hant // next language 628 4. zh 629 5. (return default content) 630 a. ja-JP 631 b. ja 632 c. (empty tag or other default content) 634 Figure 6: Lookup Using a Language Priority List 636 In some cases, the language priority list might contain one or more 637 extended language ranges (as, for example, when the same language 638 priority list is used as input for both lookup and filtering 639 operations). Wildcard values in an extended language range are 640 supposed to match any value that occurs in that position in a 641 language tag. Since only one item can be returned for any given 642 lookup request, the wildcards must be processed in a predictable 643 manner (or the same request might produce widely varying results). 645 Thus, for each range in the language priority list, the following 646 rules must be applied to produce a basic language range for use in 647 the fallback mechanism: 649 1. If the first subtag in the extended language range is a "*" then 650 entire range is converted to "*". 652 2. For each subsequent subtag, if the value is a "*" then that 653 subtag and its preceeding hyphen are removed. 655 For example: 657 *-US becomes * 658 en-*-US becomes en-US 659 en-Latn-* becomes en-Latn 661 Figure 7: Transformation of Extended Language Ranges 663 For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback 664 pattern would be: 665 1. * (skipped) 666 2. fr-FR 667 3. fr 668 4. zh-Hant 669 5. zh 670 6. (default content) 672 Figure 8: Extended Language Range Fallback Example 674 4. Other Considerations 676 When working with language ranges and matching schemes, there are 677 some additional points that may influence the choice of either. 679 4.1. Meaning of Language Tags and Ranges 681 Selecting content using language ranges requires some understanding 682 by users of what they are selecting. A language tag or range 683 identifies a language as spoken (or written, signed or otherwise 684 signaled) by human beings for communication of information to other 685 human beings. 687 If a language tag B contains language tag A as a prefix, then B is 688 typically "narrower" or "more specific" than A. For example, "zh- 689 Hant-TW" is more specific than "zh-Hant". 691 This relationship is not guaranteed in all cases: specifically, 692 languages that begin with the same sequence of subtags are NOT 693 guaranteed to be mutually intelligible, although they might be. 695 For example, the tag "az" shares a prefix with both "az-Latn" 696 (Azerbaijani written using the Latin script) and "az-Arab" 697 (Azerbaijani written using the Arabic script). A person fluent in 698 one script might not be able to read the other, even though the text 699 might be otherwise identical. Content tagged as "az" most probably 700 is written in just one script and thus might not be intelligible to a 701 reader familiar with the other script. 703 Variant subtags in particular seem to represent specific divisions in 704 mutual understanding, since they often encode dialects or other 705 idiosyncratic variations within a language. 707 The relationship between the language tag and the information it 708 relates to is defined by the standard describing the context in which 709 it appears. Accordingly, this section can only give possible 710 examples of its usage: 712 o For a single information object, the associated language tags 713 might be interpreted as the set of languages that are necessary 714 for a complete comprehension of the complete object. Example: 715 Plain text documents. 717 o For an aggregation of information objects, the associated language 718 tags could be taken as the set of languages used inside components 719 of that aggregation. Examples: Document stores and libraries. 721 o For information objects whose purpose is to provide alternatives, 722 the associated language tags could be regarded as a hint that the 723 content is provided in several languages, and that one has to 724 inspect each of the alternatives in order to find its language or 725 languages. In this case, the presence of multiple tags might not 726 mean that one needs to be multi-lingual to get complete 727 understanding of the document. Example: MIME multipart/ 728 alternative. 730 o In markup languages, such as HTML and XML, language information 731 can be added to each part of the document identified by the markup 732 structure (including the whole document itself). For example, one 733 could write C'est la vie. inside a 734 Norwegian document; the Norwegian-speaking user could then access 735 a French-Norwegian dictionary to find out what the marked section 736 meant. If the user were listening to that document through a 737 speech synthesis interface, this formation could be used to signal 738 the synthesizer to appropriately apply French text-to-speech 739 pronunciation rules to that span of text, instead of misapplying 740 the Norwegian rules. 742 4.2. Considerations for Private Use Subtags 744 Private-use subtags require private agreement between the parties 745 that intend to use or exchange language tags that use them and great 746 caution SHOULD be used in employing them in content or protocols 747 intended for general use. Private-use subtags are simply useless for 748 information exchange without prior arrangement. 750 The value and semantic meaning of private-use tags and of the subtags 751 used within such a language tag are not defined. Matching private 752 use tags using language ranges or extended language ranges can result 753 in unpredictable content being returned. 755 4.3. Length Considerations in Matching 757 RFC 3066 [RFC3066] did not provide an upper limit on the size of 758 language tags or ranges. RFC 3066 did define the semantics of 759 particular subtags in such a way that most language tags or ranges 760 consisted of language and region subtags with a combined total length 761 of up to six characters. Larger tags and ranges (in terms of both 762 subtags and characters) did exist, however. 764 [RFC3066bis] also does not impose a fixed upper limit on the number 765 of subtags in a language tag or range (and thus an upper bound on the 766 size of either). The syntax in that document suggests that, 767 depending on the specific language or range of languages, more 768 subtags (and thus characters) are sometimes necessary as a result. 770 Length considerations and their impact on the selection and 771 processing of tags are described in Section 2.1.1 of that document. 773 An application or protocol MAY choose to limit the length of the 774 language tags or ranges used in matching. Any such limitation SHOULD 775 be clearly documented, and such documentation SHOULD include the 776 disposition of any longer tags or ranges (for example, whether an 777 error value is generated or the language tag or range is truncated). 778 If truncation is permitted it MUST NOT permit a subtag to be divided, 779 since this changes the semantics of the subtag being matched and can 780 result in false positives or negatives. 782 Applications or protocols that restrict storage SHOULD consider the 783 impact of tag or range truncation on the resulting matches. For 784 example, removing the "*" from the end of an extended language range 785 (see Section 2.3) can greatly modify the set of returned matches. A 786 protocol that allows tags or ranges to be truncated at an arbitrary 787 limit, without giving any indication of what that limit is, has the 788 potential for causing harm by changing the meaning of values in 789 substantial ways. 791 In practice, most tags do not require additional subtags or 792 substantially more characters. Additional subtags sometimes add 793 useful distinguishing information, but extraneous subtags interfere 794 with the meaning, understanding, and especially matching of language 795 tags. Since language tags or ranges MAY be truncated by an 796 application or protocol that limits storage, when choosing language 797 tags or ranges users and applications SHOULD avoid adding subtags 798 that add no distinguishing value. In particular, users and 799 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 800 fields in the registry (defined in Section 3.6 of [RFC3066bis]): 801 these fields provide guidance on when specific additional subtags 802 SHOULD (and SHOULD NOT) be used. 804 Implementations MUST support a limit of at least 33 characters. This 805 limit includes at least one subtag of each non-extension, non-private 806 use type. When choosing a buffer limit, a length of at least 42 807 characters is strongly RECOMMENDED. 809 The practical limit on tags or ranges derived solely from registered 810 values is 42 characters. Implementations MUST be able to handle tags 811 and ranges of this length. Support for tags and ranges of at least 812 62 characters in length is RECOMMENDED. Implementations MAY support 813 longer values, including matching extensive sets of private use or 814 extension subtags. 816 Applications or protocols which have to truncate a tag MUST do so by 817 progressively removing subtags along with their preceding "-" from 818 the right side of the language tag until the tag is short enough for 819 the given buffer. If the resulting tag ends with a single-character 820 subtag, that subtag and its preceding "-" MUST also be removed. For 821 example: 823 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 824 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 825 2. zh-Latn-CN-variant1-a-extend1 826 3. zh-Latn-CN-variant1 827 4. zh-Latn-CN 828 5. zh-Latn 829 6. zh 831 Figure 9: Example of Tag Truncation 833 5. IANA Considerations 835 This document presents no new or existing considerations for IANA. 837 6. Changes 839 This is the first version of this document. 841 The following changes were put into this document since draft-05: 843 Modified the ABNF to match changes in [RFC3066bis] (K.Karlsson) 845 Matched the references and reference formats to [RFC3066bis] 846 (K.Karlsson) 848 Various edits, additions, and emendations to deal with changes in 849 the Last Call of draft-registry as well as cleaning up the text. 851 Changed from 'defined' to 'identifies' in Section 4.1. (M.Gunn) 853 Reorganized the text and broke it into sections (M.Duerst) 855 Modified occurences of the word "application" to refer to 856 "applications or protocols" or otherwise be specific (E. van der 857 Poel) 859 Removed "Extended Language Range Lookup", merging it with other 860 text on lookup to form a single scheme. (M.Davis) 862 Fixed or removed obsolete or dangling references (Ed.) 864 Added an introduction to section 4 and added one sentence to make 865 it flow better to the start of section 4.1. (Ed.) 867 7. Security Considerations 869 Language ranges used in content negotiation might be used to infer 870 the nationality of the sender, and thus identify potential targets 871 for surveillance. In addition, unique or highly unusual language 872 ranges or combinations of language ranges might be used to track 873 specific individual's activities. 875 This is a special case of the general problem that anything you send 876 is visible to the receiving party. It is useful to be aware that 877 such concerns can exist in some cases. 879 The evaluation of the exact magnitude of the threat, and any possible 880 countermeasures, is left to each application or protocol. 882 8. Character Set Considerations 884 The syntax of language tags and language ranges permit only the 885 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters 886 are present in most character sets, so presentation of language tags 887 should not present any character set issues. 889 9. References 891 9.1. Normative References 893 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 894 Requirement Levels", BCP 14, RFC 2119, March 1997. 896 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 897 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 898 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 900 [RFC3066bis] 901 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 902 Identification of Languages", October 2005, . 906 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 907 Specifications: ABNF", RFC 4234, October 2005. 909 9.2. Informative References 911 [RFC1766] Alvestrand, H., "Tags for the Identification of 912 Languages", RFC 1766, March 1995. 914 [RFC3066] Alvestrand, H., "Tags for the Identification of 915 Languages", BCP 47, RFC 3066, January 2001. 917 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 918 May 2002. 920 Appendix A. Acknowledgements 922 Any list of contributors is bound to be incomplete; please regard the 923 following as only a selection from the group of people who have 924 contributed to make this document what it is today. 926 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 927 which is a precursor to this document, made enormous contributions 928 directly or indirectly to this document and are generally responsible 929 for the success of language tags. 931 The following people (in alphabetical order by family name) 932 contributed to this document: 934 Jeremy Carroll, John Cowan, Martin Duerst, Frank Ellermann, Doug 935 Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. Patton, Randy 936 Presuhn, Eric van der Poel, and many, many others. 938 Very special thanks must go to Harald Tveit Alvestrand, who 939 originated RFCs 1766 and 3066, and without whom this document would 940 not have been possible. 942 For this particular document, John Cowan originated the scheme 943 described in Section 3.2.3. Mark Davis originated the scheme 944 described in the Section 3.3. 946 Authors' Addresses 948 Addison Phillips (editor) 949 Quest Software 951 Email: addison dot phillips at quest dot com 953 Mark Davis (editor) 954 IBM 956 Email: mark dot davis at ibm dot com 958 Intellectual Property Statement 960 The IETF takes no position regarding the validity or scope of any 961 Intellectual Property Rights or other rights that might be claimed to 962 pertain to the implementation or use of the technology described in 963 this document or the extent to which any license under such rights 964 might or might not be available; nor does it represent that it has 965 made any independent effort to identify any such rights. Information 966 on the procedures with respect to rights in RFC documents can be 967 found in BCP 78 and BCP 79. 969 Copies of IPR disclosures made to the IETF Secretariat and any 970 assurances of licenses to be made available, or the result of an 971 attempt made to obtain a general license or permission for the use of 972 such proprietary rights by implementers or users of this 973 specification can be obtained from the IETF on-line IPR repository at 974 http://www.ietf.org/ipr. 976 The IETF invites any interested party to bring to its attention any 977 copyrights, patents or patent applications, or other proprietary 978 rights that may cover technology that may be required to implement 979 this standard. Please address the information to the IETF at 980 ietf-ipr@ietf.org. 982 Disclaimer of Validity 984 This document and the information contained herein are provided on an 985 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 986 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 987 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 988 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 989 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 990 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 992 Copyright Statement 994 Copyright (C) The Internet Society (2005). This document is subject 995 to the rights, licenses and restrictions contained in BCP 78, and 996 except as set forth therein, the authors retain all their rights. 998 Acknowledgment 1000 Funding for the RFC Editor function is currently provided by the 1001 Internet Society.