idnits 2.17.1 draft-ietf-ltru-matching-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 887. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 864. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 871. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 877. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (April 6, 2006) is 6585 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Duplicate reference: RFC2616, mentioned in 'RFC2616errata', was also mentioned in 'RFC2616'. -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Yahoo! Inc. 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: October 8, 2006 Google 6 April 6, 2006 8 Matching of Language Tags 9 draft-ietf-ltru-matching-12 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on October 8, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2006). 40 Abstract 42 This document describes different mechanisms for comparing and 43 matching language tags. Possible algorithms for language negotiation 44 or content selection, filtering, and lookup are described. This 45 document, in combination with RFC 3066bis (Ed.: replace "3066bis" 46 with the RFC number assigned to draft-ietf-ltru-registry-14), 47 replaces RFC 3066, which replaced RFC 1766. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 53 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 54 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5 55 2.3. The Language Priority List . . . . . . . . . . . . . . . . 5 56 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 7 57 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 7 58 3.2. Implementation Considerations . . . . . . . . . . . . . . 8 59 3.3. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9 60 3.3.1. Basic Filtering . . . . . . . . . . . . . . . . . . . 10 61 3.3.2. Extended Filtering . . . . . . . . . . . . . . . . . . 10 62 3.4. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 12 63 3.4.1. Default Values . . . . . . . . . . . . . . . . . . . . 14 64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 16 65 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 16 66 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 17 67 4.3. Considerations for Private Use Subtags . . . . . . . . . . 17 68 4.4. Length Considerations for Language Ranges . . . . . . . . 18 69 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 70 6. Security Considerations . . . . . . . . . . . . . . . . . . . 20 71 7. Character Set Considerations . . . . . . . . . . . . . . . . . 21 72 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 73 8.1. Normative References . . . . . . . . . . . . . . . . . . . 22 74 8.2. Informative References . . . . . . . . . . . . . . . . . . 22 75 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 23 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 77 Intellectual Property and Copyright Statements . . . . . . . . . . 25 79 1. Introduction 81 Human beings on our planet have, past and present, used a number of 82 languages. There are many reasons why one would want to identify the 83 language used when presenting or requesting information or in some 84 specific set of information items. 86 Applications, protocols, or specifications that use language 87 identifiers, such as the language tags defined in [RFC3066bis], 88 sometimes need to match language tags to a user's language 89 preferences. 91 This document defines a syntax (called a language range (Section 2)) 92 for specifying items in the user's list of language preferences 93 (called a language priority list (Section 2.3)), as well as several 94 schemes for selecting or filtering sets of language tags by comparing 95 the language tags to the user's preferences. Applications, 96 protocols, or specifications will have varying needs and requirements 97 that affect the choice of a suitable matching scheme. 99 This document describes: how to indicate a user's preferences using 100 language ranges; three schemes for matching these ranges to a set of 101 language tags; and the various practical considerations that apply to 102 implementing and using these schemes. 104 This document, in combination with [RFC3066bis] (Ed.: replace 105 "3066bis" globally in this document with the RFC number assigned to 106 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 107 [RFC1766]. 109 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 111 document are to be interpreted as described in [RFC2119]. 113 2. The Language Range 115 Language tags [RFC3066bis] are used to help identify languages, 116 whether spoken, written, signed, or otherwise signaled, for the 117 purpose of communication. Applications, protocols, or specifications 118 that use language tags are often faced with the problem of 119 identifying sets of content that share certain language attributes. 120 For example, HTTP/1.1 [RFC2616] describes one such mechanism in its 121 discussion of the Accept-Language header (Section 14.4), which is 122 used when selecting content from servers based on the language of 123 that content. 125 It is, thus, useful to have a mechanism for identifying sets of 126 language tags that share specific attributes. This allows users to 127 select or filter the language tags based on specific requirements. 128 Such an identifier is called a "language range". 130 There are different types of language range, whose specific 131 attributes vary according to their application. Language ranges are 132 similar to language tags: they consist of a sequence of subtags 133 separated by hyphens. In a language range, each subtag MUST either 134 be a sequence of ASCII alphanumeric characters or the single 135 character '*' (%2A, ASTERISK). The character '*' is a "wildcard" 136 that matches any sequence of subtags. The meaning and uses of 137 wildcards vary according to the type of language range. 139 Language tags and thus language ranges are to be treated as case- 140 insensitive: there exist conventions for the capitalization of some 141 of the subtags, but these MUST NOT be taken to carry meaning. 142 Matching of language tags to language ranges MUST be done in a case- 143 insensitive manner. 145 2.1. Basic Language Range 147 A "basic language range" consists of a sequence of alphanumeric 148 subtags separated by hyphens. It is defined by the following ABNF 149 [RFC4234]: 151 language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*" 152 alphanum = ALPHA / DIGIT 154 Basic language ranges (originally described by HTTP/1.1 [RFC2616] and 155 later [RFC3066]) have the same syntax as an [RFC3066] language tag or 156 are the single character "*". They differ from the language tags 157 defined in [RFC3066bis] only in that there is no requirement that 158 they be "well-formed" or be validated against the IANA Language 159 Subtag Registry. Such ill-formed ranges will probably not match 160 anything. Note that the ABNF [RFC4234] in [RFC2616] is incorrect, 161 since it disallows the use of digits anywhere in the 'language-range' 162 (see: [RFC2616errata]). 164 2.2. Extended Language Range 166 Occasionally users will wish to select a set of language tags based 167 on the presence of specific subtags. An "extended language range" 168 describes a user's language preference as an ordered sequence of 169 subtags. For example, a user might wish to select all language tags 170 that contain the region subtag 'CH' (Switzerland). Extended language 171 ranges are useful in specifying a particular sequence of subtags that 172 appear in the set of matching tags without having to specify all of 173 the intervening subtags. 175 An extended language range can be represented by the following ABNF: 177 extended-language-range = (1*8ALPHA / "*") 178 *("-" (1*8alphanum / "*")) 180 Figure 2: Extended Language Range 182 The wildcard subtag '*' can occur in any position in the extended 183 language range, where it matches any sequence of subtags that might 184 occur in that position in a language tag. However, wildcards outside 185 the first position are ignored by Extended Filtering (see Section 186 3.2.2). The use or absence of one or more wildcards cannot be taken 187 to imply that a certain number of subtags will appear in the matching 188 set of language tags. 190 2.3. The Language Priority List 192 A user's language preferences will often need to specify more than 193 one language range and thus users often need to specify a prioritized 194 list of language ranges in order to best reflect their language 195 preferences. This is especially true for speakers of minority 196 languages. A speaker of Breton in France, for example, may specify 197 "br" followed by "fr", meaning that if Breton is available, it is 198 preferred, but otherwise French is the best alternative. It can get 199 more complex: a user may wish to fall back from Skolt Sami to 200 Northern Sami to Finnish. 202 A "language priority list" is a prioritized or weighted list of 203 language ranges. One well known example of such a list is the 204 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 205 14.4) and RFC 3282 [RFC3282]. 207 The various matching operations described in this document include 208 considerations for using a language priority list. This document 209 does not define the syntax for a language priority list; defining 210 such a syntax is the responsibility of the protocol, application, or 211 specification that uses it. When given as examples in this document, 212 language priority lists will be shown as a quoted sequence of ranges 213 separated by commas, like this: "en, fr, zh-Hant" (which would be 214 read as "English before French before Chinese as written in the 215 Traditional script"). 217 A simple list of ranges is considered to be in descending order of 218 priority. Other language priority lists provide "quality weights" 219 for the language ranges in order to specify the relative priority of 220 the user's language preferences. An example of this would be the use 221 of "q" values in the syntax of the "Accept-Language" header (defined 222 in [RFC2616], Section 14.4, and [RFC3282]). 224 3. Types of Matching 226 Matching language ranges to language tags can be done in many 227 different ways. This section describes three such matching schemes, 228 as well as the considerations for choosing between them. Protocols 229 and specifications requiring conformance to this specification MUST 230 clearly indicate the particular mechanism used in selecting or 231 matching language tags. 233 There are two types of matching scheme in this document. A matching 234 scheme that produces zero or more matching language tags is called 235 "filtering". A matching scheme that produces exactly one match for a 236 given request is called "lookup". 238 3.1. Choosing a Type of Matching 240 Applications, protocols, and specifications are faced with the 241 decision of what type of matching to use. Sometimes, different 242 styles of matching are suited to different kinds of processing within 243 a particular application or protocol. 245 This document describes three types of matching: 247 1. Basic Filtering (Section 3.3.1) matches a language priority list 248 consisting of basic language ranges (Section 2.1) to sets of 249 language tags. 251 2. Extended Filtering (Section 3.3.2) matches a language priority 252 list consisting of extended language ranges (Section 2.2) to sets 253 of language tags. 255 3. Lookup (Section 3.4) matches a language priority list consisting 256 of basic language ranges to sets of language tags to find the one 257 _exact_ language tag that best matches the range. 259 Filtering can be used to produce a set of results (such as a 260 collection of documents) by comparing the user's preferences to a set 261 of language tags. For example, when performing a search, one might 262 use filtering to limit the results to items tagged as being in the 263 French language. Filtering can also be used when deciding whether to 264 perform a language-sensitive process on some content. For example, a 265 process might cause paragraphs whose language tag matched the 266 language range "nl" to be displayed in italics within a document. 268 Lookup produces the single result that best matches the user's 269 preferences from the list of available tags, so it is useful in cases 270 in which a single item is required (and for which only a single item 271 can be returned). For example, if a process were to insert a human 272 readable error message into a protocol header, it might select the 273 text based on the user's language priority list. Since the process 274 can return only one item, it must choose a single item and it must 275 return some item, even if none of the content's language tags match 276 the language priority list supplied by the user. 278 3.2. Implementation Considerations 280 Language tag matching is a tool, and does not by itself specify a 281 complete procedure for the use of language tags. Such procedures are 282 intimately tied to the application protocol in which they occur. 283 When specifying a protocol operation using matching, the protocol 284 MUST specify: 286 o Which type(s) of language tag matching it uses 288 o Whether the operation returns a single result (lookup) or a 289 possibly empty set of results (filtering) 291 o For lookup, what the default item is (or the sequence of 292 operations or configuration information used to determine the 293 default) when no matching tag is found. For instance, a protocol 294 might define the result as failure of the operation, an empty 295 value, returning some protocol defined or implementation defined 296 default, or returning i-default [RFC2277]. 298 Applications, protocols, and specifications are not required to 299 validate or understand any of the semantics of the language tags or 300 ranges or of the subtags in them, nor do they require access to the 301 IANA Language Subtag Registry (see Section 3 in [RFC3066bis]). This 302 simplifies implementation. 304 However, designers of applications, protocols, or specifications are 305 encouraged to use the information from the IANA Language Subtag 306 Registry to support canonicalizing language tags and ranges in order 307 to map grandfathered and obsolete tags or subtags into modern 308 equivalents. 310 Applications, protocols, or specifications that canonicalize ranges 311 MUST either perform matching operations with both the canonical and 312 original (unmodified) form of the range or MUST also canonicalize 313 each tag for the purposes of comparison. 315 Note that canonicalizing language ranges makes certain operations 316 impossible. For example, an implementation that canonicalizes the 317 language range "art-lojban" to use the more modern "jbo" cannot be 318 used to select just the items with the older tag. 320 Applications, protocols, or specifications that use basic ranges 321 might sometimes receive extended language ranges instead. An 322 application, protocol, or specification MUST choose to: a) map 323 extended language ranges to basic ranges using the algorithm below, 324 b) reject any extended language ranges in the language priority list 325 that are not valid basic language ranges, or c) treat each extended 326 language range as if it were a basic language range, which will have 327 the same result as ignoring them, since these ranges will won't match 328 any valid language tags. 330 An extended language range is mapped to a basic language range as 331 follows: if the first subtag is a '*' then the entire range is 332 treated as "*", otherwise each wildcard subtag is removed. For 333 example, if the language range were "en-*-US", then the range would 334 be mapped to "en-US". 336 Applications, protocols, or specifications, in addressing their 337 particular requirements, can offer pre-processing or configuration 338 options. For example, an implementation could allow a user to 339 associate or map a particular language range to a different value. 340 Such a user might wish to associate the language range subtags 'nn' 341 (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian) with the more general 342 subtag 'no' (Norwegian). Or perhaps the user could associate the 343 range "zh-Hans" (Chinese as written in the Simplified script) with 344 the language tag "zh-CN" (Chinese as used in China, where the 345 Simplified script is predominant) because content is available with 346 that tag. Documentation on how the ranges or tags are altered, 347 prioritized, or compared in the subsequent match in such an 348 implementation will assist users in making the best configuration 349 choices. 351 3.3. Filtering 353 Filtering is used to select the set of language tags that matches a 354 given language priority list. It is called "filtering" because this 355 set might contain no items at all or it might return an arbitrarily 356 large number of matching items: as many items as match the language 357 priority list, thus "filtering out" the non-matching items. 359 In filtering, each language range represents the _least_ specific 360 language tag (that is, the language tag with fewest number of 361 subtags) which is an acceptable match. All of the language tags in 362 the matching set of tags will have an equal or greater number of 363 subtags than the language range. Every non-wildcard subtag in the 364 language range will appear in every one of the matching language 365 tags. For example, if the language priority list consists of the 366 range "de-CH", one might see tags such as "de-CH-1996" but one will 367 never see a tag such as "de" (because the 'CH' subtag is missing). 369 If the language priority list (see Section 2.3) contains more than 370 one range, the content returned is typically ordered in descending 371 level of preference, but it MAY be unordered, according to the needs 372 of the application or protocol. 374 Some examples of applications where filtering might be appropriate 375 include: 377 o Applying a style to sections of a document in a particular set of 378 languages. 380 o Displaying the set of documents containing a particular set of 381 keywords written in a specific set of languages. 383 o Selecting all email items written in a specific set of languages. 385 o Selecting audio files spoken in a particular language. 387 Filtering seems to imply that there is a semantic relationship 388 between language tags that share the same prefix. While this is 389 often the case, it is not always true and users should note that the 390 set of language tags that match a specific language range do not 391 necessarily represent mutually intelligible languages. 393 3.3.1. Basic Filtering 395 Basic filtering uses basic language ranges. Each basic language 396 range in the language priority list is considered in turn, according 397 to priority. A language range matches a particular language tag if, 398 in a case-insensitive comparison, it exactly equals the tag, or if it 399 exactly equals a prefix of the tag such that the first character 400 following the prefix is "-". For example, the language-range "de-de" 401 matches the language tag "de-DE-1996", but not the language tags "de- 402 Deva" or "de-Latn-DE". 404 The special range "*" in a language priority list matches any tag. A 405 protocol which uses language ranges MAY specify additional rules 406 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616] 407 specifies that the range "*" matches only languages not matched by 408 any other range within an "Accept-Language" header. 410 Basic filtering is identical to the type of matching described in 411 [RFC3066], Section 2.5 (Language-range). 413 3.3.2. Extended Filtering 415 Extended filtering compares extended language ranges to language 416 tags. Each extended language range in the language priority list is 417 considered in turn, according to priority. A language range matches 418 a particular language tag if their list of subtags match. To 419 determine a match: 421 1. Split both the extended language range and the language tag being 422 compared into a list of subtags by dividing on the hyphen (%2D) 423 character. Two subtags match if either they are the same when 424 compared case-insensitively or the language range's subtag is the 425 wildcard '*'. 427 2. Begin with the first subtag in each list. If the first subtag in 428 the range does not match the first subtag in the tag, the overall 429 match fails. Otherwise, move to the next subtag in both the 430 range and the tag. 432 3. While there are more subtags left in the language range's list: 434 A. If the subtag currently being examined in the range is the 435 wildcard ('*'), move to the next subtag in the range and 436 continue with the loop. 438 B. Else, if there are no more subtags in the language tag's 439 list, the match fails. 441 C. Else, if the current subtag in the range's list matches the 442 current subtag in the language tag's list, move to the next 443 subtag in both lists and continue with the loop. 445 D. Else, if the language tag's subtag is a "singleton" (a single 446 letter or digit, which includes the private-use subtag 'x') 447 the match fails. 449 E. Else, move to the next subtag in the language tag's list and 450 continue with the loop. 452 4. When the language range's list has no more subtags, the match 453 succeeds. 455 Subtags not specified, including those at the end of the language 456 range, are thus treated as if assigned the wildcard value '*'. Much 457 like basic filtering, extended filtering selects content with 458 arbitrarily long tags that share the same initial subtags as the 459 language range. In addition, extended filtering selects language 460 tags that contain any intermediate subtags not specified in the 461 language range. For example, the extended language range "de-*-DE" 462 (or its synonym "de-DE") matches all of the following tags: 464 de-DE 466 de-Latn-DE 468 de-Latf-DE 470 de-de 472 de-DE-x-goethe 474 de-Latn-DE-1996 476 de-Deva-DE 478 The same range does not match any of the following tags for the 479 reasons shown: 481 de (missing 'DE') 483 de-x-DE (singleton 'x' occurs before 'DE') 485 de-Deva ('Deva' not equal to 'DE') 487 Note: [RFC3066bis] defines each type of subtag (language, script, 488 region, and so forth) according to position, size, and content. This 489 means that subtags in a language range can only match specific types 490 of subtags in a language tag. For example, a subtag such as 'Latn' 491 is always a script subtag (unless it follows a singleton) while a 492 subtag such as 'nedis' can only match the equivalent variant subtag. 493 One such difference is that two-letter subtags in initial position 494 have a different type (language) than two-letter subtags in later 495 positions (region). This is the reason why a wildcard in the 496 extended language range is significant in the first position and 497 subsequently ignored. 499 3.4. Lookup 501 Lookup is used to select the single language tag that best matches 502 the language priority list for a given request. When performing 503 lookup, each language range in the language priority list is 504 considered in turn, according to priority. By contrast with 505 filtering, each language range represents the _most_ specific tag 506 which is an acceptable match. The first matching tag found, 507 according to the user's priority, is considered the closest match and 508 is the item returned. For example, if the language range is "de-ch", 509 a lookup operation can produce content with the tags "de" or "de-CH" 510 but never content with the tag "de-CH-1996". If no language tag 511 matches the request, the "default" value is returned. 513 For example, if an application inserts some dynamic content into a 514 document, returning an empty string if there is no exact match is not 515 an option. Instead, the application "falls back" until it finds a 516 matching language tag associated with a suitable piece of content to 517 insert. Examples of lookup might include: 519 o Selection of a template containing the text for an automated email 520 response. 522 o Selection of a item containing some text for inclusion in a 523 particular Web page. 525 o Selection of a string of text for inclusion in an error log. 527 o Selection of an audio file to play as a prompt in a phone system. 529 In the lookup scheme, the language range is progressively truncated 530 from the end until a matching language tag is located. Single letter 531 or digit subtags (including both the letter 'x' which introduces 532 private-use sequences, and the subtags that introduce extensions) are 533 removed at the same time as their closest trailing subtag. For 534 example, starting with the range "zh-Hant-CN-x-private1-private2", 535 the lookup progressively searches for content as shown below: 537 Range to match: zh-Hant-CN-x-private1-private2 538 1. zh-Hant-CN-x-private1-private2 539 2. zh-Hant-CN-x-private1 540 3. zh-Hant-CN 541 4. zh-Hant 542 5. zh 543 6. (default) 545 Figure 3: Example of a Lookup Fallback Pattern 547 This allows some flexibility in finding a match. For example, lookup 548 provides better results for cases in which content is not available 549 that exactly matches the user request than if the default language 550 for the system or content were returned immediately. Language 551 material is sometimes sparsely populated, so an item might not be 552 available at every level of tag granularity. "Falling back" through 553 the subtag sequence provides more opportunity to find a match between 554 available language tags and the user's request. 556 Extensions and unrecognized private-use subtags might be unrelated to 557 a particular application of lookup. Since these subtags come at the 558 end of the subtag sequence, they are removed first during the 559 fallback process and usually pose no barrier to interoperability. 560 However, an implementation MAY remove these from ranges prior to 561 performing the lookup (provided the implementation also removes them 562 from the tags being compared). Such modification is internal to the 563 implementation and applications, protocols, or specifications SHOULD 564 NOT remove or modify subtags in content that they return or forward, 565 because this removes information that might be used elsewhere. 567 The special language range "*" matches any language tag. In the 568 lookup scheme, this range does not convey enough information by 569 itself to determine which language tag is most appropriate, since it 570 matches everything. If the language range "*" is followed by other 571 language ranges, it is skipped. If the language range "*" is the 572 only one in the language priority list or if no other language range 573 follows, the default value is computed and returned. 575 In some cases, the language priority list might contain one or more 576 extended language ranges (as, for example, when the same language 577 priority list is used as input for both lookup and filtering 578 operations). Wildcard values in an extended language range normally 579 match any value that can occur in that position in a language tag. 580 Since only one item can be returned for any given lookup request, 581 wildcards in a language range have to be processed in a consistent 582 manner or the same request will produce widely varying results. 583 Applications, protocols, or specifications that accept extended 584 language ranges MUST define which item is returned when more than one 585 item matches the extended language range. 587 For example, an implementation could return the matching tag that is 588 first in ASCII-order. If the language range were "*-CH" and the set 589 of tags included "de-CH", "fr-CH", and "it-CH", then the tag "de-CH" 590 would be returned. Another possibility would be for an 591 implementation to map the extended language ranges to basic ranges. 593 3.4.1. Default Values 595 Each application, protocol, or specification MUST define the 596 defaulting behavior when no tag matches the language priority list. 597 What this action consists of strongly depends on how lookup is being 598 applied. Some examples of defaulting behavior might include: 600 o return an item with no language tag or an item of a non-linguistic 601 nature, such as an image or sound 603 o return a null string as the language tag value, in cases where the 604 protocol permits the empty value (see, for example, "xml:lang" in 605 [XML10]) 607 o return a particular language tag designated for the operation 608 o return the language tag "i-default" (see: [RFC2277]) 610 o return an error condition or error message 612 o return a list of available languages for the user to select from 614 When performing lookup using a language priority list, the 615 progressive search MUST process each language range in the list 616 before seeking or calculating the default. 618 The default value MAY be calculated and might include additional 619 searching or matching. Applications, protocols, or specifications 620 can specify different ways in which users can specify or override the 621 defaults. 623 One common way to provide for a default is to allow a specific 624 language range to be set as the default for a specific type of 625 request. If this approach is chosen, this language range MUST be 626 treated as if it were appended to the end of the language priority 627 list as a whole, rather than after each item in the language priority 628 list. The application, protocol, or specification MUST also define 629 the defaulting behavior if that search fails to find a matching tag 630 or item. 632 For example, if a particular user's language priority list were 633 "fr-FR, zh-Hant" and the program doing the matching had a default 634 language range of "ja-JP", the program would search as follows: 636 1. fr-FR 637 2. fr 638 3. zh-Hant // next language 639 4. zh 640 5. ja-JP // now searching for the default content 641 6. ja 642 7. (implementation defined default) 644 Figure 4: Lookup Using a Language Priority List 646 4. Other Considerations 648 When working with language ranges and matching schemes, there are 649 some additional points that may influence the choice of either. 651 4.1. Choosing Language Ranges 653 Users indicate their language preferences via the choice of a 654 language range or the list of language ranges in a language priority 655 list. The type of matching affects what the best choice is for a 656 user. 658 Most matching schemes make no attempt to process the semantic meaning 659 of the subtags. The language range is compared, in a case- 660 insensitive manner, to each language tag being matched, using basic 661 string processing. Users SHOULD select language ranges that are 662 well-formed, valid language tags according to [RFC3066bis] 663 (substituting wildcards as appropriate in extended language ranges). 665 Applications are encouraged to canonicalize language tags and ranges 666 by using the Preferred-Value from the IANA Language Subtag Registry 667 for tags or subtags which have been deprecated. If the user is 668 working with content that might use the older form, the user might 669 want to include both the new and old forms in a language priority 670 list. For example, the tag "art-lojban" is deprecated. The subtag 671 'jbo' is supposed to be used instead, so the user might use it to 672 form the language range. Or the user might include both in a 673 language priority list: "jbo, art-lojban". 675 Users SHOULD avoid subtags that add no distinguishing value to a 676 language range. When filtering, the fewer the number of subtags that 677 appear in the language range, the more content the range will 678 probably match, while in lookup unnecessary subtags might cause 679 "better", more-specific content to be skipped in favor of less 680 specific content. For example, the range "de-Latn-DE" would return 681 content tagged "de" instead of content tagged "de-DE", even though 682 the latter is probably a better match. 684 Whether a subtag adds distinguishing value can depend on the context 685 of the request. For example, a user who reads both Simplified and 686 Traditional Chinese, but who prefers Simplified, might use the range 687 "zh" for filtering (matching all items that user can read) but "zh- 688 Hans" for lookup (making sure that user gets the preferred form if 689 it's available, but the fallback to "zh" will still work). On the 690 other hand, content in this case should be labeled as "zh-Hans" (or 691 "zh-Hant" if that applies) for filtering, but for lookup, if there is 692 either "zh-Hans" content or "zh-Hant" content, then one of them (the 693 one considered 'default') should also be available under a simple 694 "zh". Note that the user can create a language priority list "zh- 695 Hans, zh" that delivers the best possible results for both schemes. 696 If the user cannot be sure which scheme is being used (or if more 697 than one might be applied to a given request), the user SHOULD 698 specify the most specific (largest number of subtags) range first and 699 then supply shorter prefixes later in the list to ensure that 700 filtering returns a complete set of tags. 702 Many languages are written predominantly in a single script. This is 703 usually recorded in the Suppress-Script field in that language 704 subtag's registry entry. For these languages, script subtags SHOULD 705 NOT be used to form a language range. Thus the language range "en- 706 Latn" is inappropriate in most cases (because the vast majority of 707 English documents are written in the Latin script and thus the 'en' 708 language subtag has a Suppress-Script field for 'Latn' in the 709 registry). 711 When working with tags and ranges, note that extensions and most 712 private-use subtags are orthogonal to language tag matching, in that 713 they specify additional attributes of the text not related to the 714 goals of most matching schemes. Users SHOULD avoid using these 715 subtags in language ranges, since they interfere with the selection 716 of available content. When used in language tags (as opposed to 717 ranges), these subtags normally do not interfere with filtering 718 (Section 3), since they appear at the end of the tag and will match 719 all prefixes. Lookup (Section 3.4) implementations are advised to 720 ignore unrecognized private-use and extension subtags when performing 721 language tag fallback. 723 4.2. Meaning of Language Tags and Ranges 725 Selecting language tags using language ranges requires some 726 understanding by users of what they are selecting. The meaning of 727 the various subtags in a language range are identical to their 728 meaning in a language tag (see Section 4.2 in [RFC3066bis]), with the 729 addition that the wildcard "*" represents any matching sequence of 730 values. 732 4.3. Considerations for Private Use Subtags 734 Private-use subtags require private agreement between the parties 735 that intend to use or exchange language tags that use them. They 736 SHOULD NOT be used in content or protocols intended for general use. 737 Private-use subtags are simply useless for information exchange 738 without prior arrangement. 740 The value and semantic meaning of private-use tags and of the subtags 741 used within such a language tag are not defined. Matching private- 742 use tags using language ranges or extended language ranges can result 743 in unpredictable content being returned. 745 4.4. Length Considerations for Language Ranges 747 Language ranges are very similar to language tags in terms of content 748 and usage. The same types of restrictions on length that apply to 749 language tags can also apply to language ranges. See [RFC3066bis] 750 Section 4.3 (Length Considerations). 752 5. IANA Considerations 754 This document presents no new or existing considerations for IANA. 756 6. Security Considerations 758 Language ranges used in content negotiation might be used to infer 759 the nationality of the sender, and thus identify potential targets 760 for surveillance. In addition, unique or highly unusual language 761 ranges or combinations of language ranges might be used to track a 762 specific individual's activities. 764 This is a special case of the general problem that anything you send 765 is visible to the receiving party. It is useful to be aware that 766 such concerns can exist in some cases. 768 The evaluation of the exact magnitude of the threat, and any possible 769 countermeasures, is left to each application or protocol. 771 7. Character Set Considerations 773 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN- 774 MINUS (%x2D). Language ranges also use the character ASTERISK 775 (%x2A). These characters are present in most character sets, so 776 presentation or exchange of language tags or ranges should not be 777 constrained by character set issues. 779 8. References 781 8.1. Normative References 783 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 784 Requirement Levels", BCP 14, RFC 2119, March 1997. 786 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 787 Languages", BCP 18, RFC 2277, January 1998. 789 [RFC3066bis] 790 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 791 Identification of Languages", October 2005, . 795 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 796 Specifications: ABNF", RFC 4234, October 2005. 798 8.2. Informative References 800 [RFC1766] Alvestrand, H., "Tags for the Identification of 801 Languages", RFC 1766, March 1995. 803 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 804 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 805 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 807 [RFC2616errata] 808 IETF, "HTTP/1.1 Specification Errata", October 2004, 809 . 811 [RFC3066] Alvestrand, H., "Tags for the Identification of 812 Languages", BCP 47, RFC 3066, January 2001. 814 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 815 May 2002. 817 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0", 818 February 2004. 820 Appendix A. Acknowledgements 822 Any list of contributors is bound to be incomplete; please regard the 823 following as only a selection from the group of people who have 824 contributed to make this document what it is today. 826 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 827 which is a precursor to this document, made enormous contributions 828 directly or indirectly to this document and are generally responsible 829 for the success of language tags. 831 The following people (in alphabetical order by family name) 832 contributed to this document: 834 Harald Alvestrand, Stephane Bortzmeyer, Jeremy Carroll, John Cowan, 835 Martin Duerst, Frank Ellermann, Doug Ewell, Debbie Garside, Marion 836 Gunn, Kent Karlsson, Ira McDonald, M. Patton, Randy Presuhn, Eric van 837 der Poel, Markus Scherer, and many, many others. 839 Very special thanks must go to Harald Tveit Alvestrand, who 840 originated RFCs 1766 and 3066, and without whom this document would 841 not have been possible. 843 Authors' Addresses 845 Addison Phillips (editor) 846 Yahoo! Inc. 848 Email: addison@inter-locale.com 850 Mark Davis (editor) 851 Google 853 Email: mark.davis@macchiato.com 855 Intellectual Property Statement 857 The IETF takes no position regarding the validity or scope of any 858 Intellectual Property Rights or other rights that might be claimed to 859 pertain to the implementation or use of the technology described in 860 this document or the extent to which any license under such rights 861 might or might not be available; nor does it represent that it has 862 made any independent effort to identify any such rights. Information 863 on the procedures with respect to rights in RFC documents can be 864 found in BCP 78 and BCP 79. 866 Copies of IPR disclosures made to the IETF Secretariat and any 867 assurances of licenses to be made available, or the result of an 868 attempt made to obtain a general license or permission for the use of 869 such proprietary rights by implementers or users of this 870 specification can be obtained from the IETF on-line IPR repository at 871 http://www.ietf.org/ipr. 873 The IETF invites any interested party to bring to its attention any 874 copyrights, patents or patent applications, or other proprietary 875 rights that may cover technology that may be required to implement 876 this standard. Please address the information to the IETF at 877 ietf-ipr@ietf.org. 879 Disclaimer of Validity 881 This document and the information contained herein are provided on an 882 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 883 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 884 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 885 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 886 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 887 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 889 Copyright Statement 891 Copyright (C) The Internet Society (2006). This document is subject 892 to the rights, licenses and restrictions contained in BCP 78, and 893 except as set forth therein, the authors retain all their rights. 895 Acknowledgment 897 Funding for the RFC Editor function is currently provided by the 898 Internet Society.