idnits 2.17.1 draft-ietf-ltru-matching-13.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 902. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 879. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 886. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 892. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 18, 2006) is 6553 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Duplicate reference: RFC2616, mentioned in 'RFC2616errata', was also mentioned in 'RFC2616'. -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 4 errors (**), 0 flaws (~~), 3 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Yahoo! Inc. 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: November 19, 2006 Google 6 May 18, 2006 8 Matching of Language Tags 9 draft-ietf-ltru-matching-13 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on November 19, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2006). 40 Abstract 42 This document describes a syntax, called a "language-range", for 43 specifying items in a user's language preferences, called a "language 44 priority list". It also describes different mechanisms for comparing 45 and matching these to language tags. Two kinds of matching 46 mechanisms, filtering and lookup, are defined. Filtering produces a 47 (potentially empty) set of language tags, whereas lookup produces a 48 single language tag. Possible applications include language 49 negotiation or content selection. This document, in combination with 50 RFC 3066bis (Ed.: replace "3066bis" with the RFC number assigned to 51 draft-ietf-ltru-registry-14), replaces RFC 3066, which replaced RFC 52 1766. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 58 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 59 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5 60 2.3. The Language Priority List . . . . . . . . . . . . . . . . 5 61 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 7 62 3.1. Choosing a Matching Scheme . . . . . . . . . . . . . . . . 7 63 3.2. Implementation Considerations . . . . . . . . . . . . . . 8 64 3.3. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9 65 3.3.1. Basic Filtering . . . . . . . . . . . . . . . . . . . 10 66 3.3.2. Extended Filtering . . . . . . . . . . . . . . . . . . 11 67 3.4. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 12 68 3.4.1. Default Values . . . . . . . . . . . . . . . . . . . . 14 69 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 16 70 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 16 71 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 17 72 4.3. Considerations for Private Use Subtags . . . . . . . . . . 17 73 4.4. Length Considerations for Language Ranges . . . . . . . . 18 74 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 75 6. Security Considerations . . . . . . . . . . . . . . . . . . . 20 76 7. Character Set Considerations . . . . . . . . . . . . . . . . . 21 77 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 78 8.1. Normative References . . . . . . . . . . . . . . . . . . . 22 79 8.2. Informative References . . . . . . . . . . . . . . . . . . 22 80 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 23 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 82 Intellectual Property and Copyright Statements . . . . . . . . . . 25 84 1. Introduction 86 Human beings on our planet have, past and present, used a number of 87 languages. There are many reasons why one would want to identify the 88 language used when presenting or requesting information. 90 Applications, protocols, or specifications that use language 91 identifiers, such as the language tags defined in [RFC3066bis], 92 sometimes need to match language tags to a user's language 93 preferences. 95 This document defines a syntax (called a language range (Section 2)) 96 for specifying items in the user's list of language preferences 97 (called a language priority list (Section 2.3)), as well as several 98 schemes for selecting or filtering sets of language tags by comparing 99 the language tags to the user's preferences. Applications, 100 protocols, or specifications will have varying needs and requirements 101 that affect the choice of a suitable matching scheme. 103 This document describes: how to indicate a user's preferences using 104 language ranges; three schemes for matching these ranges to a set of 105 language tags; and the various practical considerations that apply to 106 implementing and using these schemes. 108 This document, in combination with [RFC3066bis] (Ed.: replace 109 "3066bis" globally in this document with the RFC number assigned to 110 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 111 [RFC1766]. 113 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 114 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 115 document are to be interpreted as described in [RFC2119]. 117 2. The Language Range 119 Language tags [RFC3066bis] are used to help identify languages, 120 whether spoken, written, signed, or otherwise signaled, for the 121 purpose of communication. Applications, protocols, or specifications 122 that use language tags are often faced with the problem of 123 identifying sets of content that share certain language attributes. 124 For example, HTTP/1.1 [RFC2616] describes one such mechanism in its 125 discussion of the Accept-Language header (Section 14.4), which is 126 used when selecting content from servers based on the language of 127 that content. 129 It is, thus, useful to have a mechanism for identifying sets of 130 language tags that share specific attributes. This allows users to 131 select or filter the language tags based on specific requirements. 132 Such an identifier is called a "language range". 134 There are different types of language range, whose specific 135 attributes vary according to their application. Language ranges are 136 similar to language tags: they consist of a sequence of subtags 137 separated by hyphens. In a language range, each subtag MUST either 138 be a sequence of ASCII alphanumeric characters or the single 139 character '*' (%2A, ASTERISK). The character '*' is a "wildcard" 140 that matches any sequence of subtags. The meaning and uses of 141 wildcards vary according to the type of language range. 143 Language tags and thus language ranges are to be treated as case- 144 insensitive: there exist conventions for the capitalization of some 145 of the subtags, but these MUST NOT be taken to carry meaning. 146 Matching of language tags to language ranges MUST be done in a case- 147 insensitive manner. 149 2.1. Basic Language Range 151 A "basic language range" has the same syntax as an [RFC3066] language 152 tag or is the single character "*". The basic language range was 153 originally described by HTTP/1.1 [RFC2616] and later [RFC3066]. It 154 is defined by the following ABNF [RFC4234]: 156 language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*" 157 alphanum = ALPHA / DIGIT 159 A basic language range differs from the language tags defined in 160 [RFC3066bis] only in that there is no requirement that they be "well- 161 formed" or be validated against the IANA Language Subtag Registry. 162 Such ill-formed ranges will probably not match anything. Note that 163 the ABNF [RFC4234] in [RFC2616] is incorrect, since it disallows the 164 use of digits anywhere in the 'language-range' (see: 166 [RFC2616errata]). 168 2.2. Extended Language Range 170 Occasionally users will wish to select a set of language tags based 171 on the presence of specific subtags. An "extended language range" 172 describes a user's language preference as an ordered sequence of 173 subtags. For example, a user might wish to select all language tags 174 that contain the region subtag 'CH' (Switzerland). Extended language 175 ranges are useful for specifying a particular sequence of subtags 176 that appear in the set of matching tags without having to specify all 177 of the intervening subtags. 179 An extended language range can be represented by the following ABNF: 181 extended-language-range = (1*8ALPHA / "*") 182 *("-" (1*8alphanum / "*")) 184 Figure 2: Extended Language Range 186 The wildcard subtag '*' can occur in any position in the extended 187 language range, where it matches any sequence of subtags that might 188 occur in that position in a language tag. However, wildcards outside 189 the first position are ignored by Extended Filtering (see Section 190 3.2.2). The use or absence of one or more wildcards cannot be taken 191 to imply that a certain number of subtags will appear in the matching 192 set of language tags. 194 2.3. The Language Priority List 196 A user's language preferences will often need to specify more than 197 one language range and thus users often need to specify a prioritized 198 list of language ranges in order to best reflect their language 199 preferences. This is especially true for speakers of minority 200 languages. A speaker of Breton in France, for example, can specify 201 "br" followed by "fr", meaning that if Breton is available, it is 202 preferred, but otherwise French is the best alternative. It can get 203 more complex: a different user might want to fall back from Skolt 204 Sami to Northern Sami to Finnish. 206 A "language priority list" is a prioritized or weighted list of 207 language ranges. One well known example of such a list is the 208 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 209 14.4) and RFC 3282 [RFC3282]. 211 The various matching operations described in this document include 212 considerations for using a language priority list. This document 213 does not define the syntax for a language priority list; defining 214 such a syntax is the responsibility of the protocol, application, or 215 specification that uses it. When given as examples in this document, 216 language priority lists will be shown as a quoted sequence of ranges 217 separated by commas, like this: "en, fr, zh-Hant" (which is read 218 "English before French before Chinese as written in the Traditional 219 script"). 221 A simple list of ranges is considered to be in descending order of 222 priority. Other language priority lists provide "quality weights" 223 for the language ranges in order to specify the relative priority of 224 the user's language preferences. An example of this is the use of 225 "q" values in the syntax of the "Accept-Language" header (defined in 226 [RFC2616], Section 14.4, and [RFC3282]). 228 3. Types of Matching 230 Matching language ranges to language tags can be done in many 231 different ways. This section describes three such matching schemes, 232 as well as the considerations for choosing between them. Protocols 233 and specifications requiring conformance to this specification MUST 234 clearly indicate the particular mechanism used in selecting or 235 matching language tags. 237 There are two types of matching scheme in this document. A matching 238 scheme that produces zero or more matching language tags is called 239 "filtering". A matching scheme that produces exactly one match for a 240 given request is called "lookup". 242 3.1. Choosing a Matching Scheme 244 Applications, protocols, and specifications are faced with the 245 decision of what type of matching to use. Sometimes, different 246 styles of matching are suited to different kinds of processing within 247 a particular application or protocol. 249 This document describes three matching schemes: 251 1. Basic Filtering (Section 3.3.1) matches a language priority list 252 consisting of basic language ranges (Section 2.1) to sets of 253 language tags. 255 2. Extended Filtering (Section 3.3.2) matches a language priority 256 list consisting of extended language ranges (Section 2.2) to sets 257 of language tags. 259 3. Lookup (Section 3.4) matches a language priority list consisting 260 of basic language ranges to sets of language tags to find the one 261 _exact_ language tag that best matches the range. 263 Filtering can be used to produce a set of results (such as a 264 collection of documents) by comparing the user's preferences to a set 265 of language tags. For example, when performing a search, filtering 266 can be used to limit the results to items tagged as being in the 267 French language. Filtering can also be used when deciding whether to 268 perform a language-sensitive process on some content. For example, a 269 process might cause paragraphs whose language tag matched the 270 language range "nl" (Dutch) to be displayed in italics within a 271 document. 273 Lookup produces the single result that best matches the user's 274 preferences from the list of available tags, so it is useful in cases 275 in which a single item is required (and for which only a single item 276 can be returned). For example, if a process were to insert a human 277 readable error message into a protocol header, it might select the 278 text based on the user's language priority list. Since the process 279 can return only one item, it is forced to choose a single item and it 280 has to return some item, even if none of the content's language tags 281 match the language priority list supplied by the user. 283 3.2. Implementation Considerations 285 Language tag matching is a tool, and does not by itself specify a 286 complete procedure for the use of language tags. Such procedures are 287 intimately tied to the application protocol in which they occur. 288 When specifying a protocol operation using matching, the protocol 289 MUST specify: 291 o Which type(s) of language tag matching it uses 293 o Whether the operation returns a single result (lookup) or a 294 possibly empty set of results (filtering) 296 o For lookup, what the default item is (or the sequence of 297 operations or configuration information used to determine the 298 default) when no matching tag is found. For instance, a protocol 299 might define the result as failure of the operation, an empty 300 value, returning some protocol defined or implementation defined 301 default, or returning i-default [RFC2277]. 303 Applications, protocols, and specifications are not required to 304 validate or understand any of the semantics of the language tags or 305 ranges or of the subtags in them, nor do they require access to the 306 IANA Language Subtag Registry (see Section 3 in [RFC3066bis]). This 307 simplifies implementation. 309 However, designers of applications, protocols, or specifications are 310 encouraged to use the information from the IANA Language Subtag 311 Registry to support canonicalizing language tags and ranges in order 312 to map grandfathered and obsolete tags or subtags into modern 313 equivalents. 315 Applications, protocols, or specifications that canonicalize ranges 316 MUST either perform matching operations with both the canonical and 317 original (unmodified) form of the range or MUST also canonicalize 318 each tag for the purposes of comparison. 320 Note that canonicalizing language ranges makes certain operations 321 impossible. For example, an implementation that canonicalizes the 322 language range "art-lojban" (artificial language, lojban variant) to 323 use the more modern "jbo" (Lojban) cannot be used to select just the 324 items with the older tag. 326 Applications, protocols, or specifications that use basic ranges 327 might sometimes receive extended language ranges instead. An 328 application, protocol, or specification MUST choose to: a) map 329 extended language ranges to basic ranges using the algorithm below, 330 b) reject any extended language ranges in the language priority list 331 that are not valid basic language ranges, or c) treat each extended 332 language range as if it were a basic language range, which will have 333 the same result as ignoring them, since these ranges will not match 334 any valid language tags. 336 An extended language range is mapped to a basic language range as 337 follows: if the first subtag is a '*' then the entire range is 338 treated as "*", otherwise each wildcard subtag is removed. For 339 example, the extended language range "en-*-US" maps to "en-US" 340 (English, United States). 342 Applications, protocols, or specifications, in addressing their 343 particular requirements, can offer pre-processing or configuration 344 options. For example, an implementation could allow a user to 345 associate or map a particular language range to a different value. 346 Such a user might wish to associate the language range subtags 'nn' 347 (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian) with the more general 348 subtag 'no' (Norwegian). Or perhaps a user would want to associate 349 requests for the range "zh-Hans" (Chinese as written in the 350 Simplified script) with content bearing the language tag "zh-CN" 351 (Chinese as used in China, where the Simplified script is 352 predominant). Documentation on how the ranges or tags are altered, 353 prioritized, or compared in the subsequent match in such an 354 implementation will assist users in making these types of 355 configuration choices. 357 3.3. Filtering 359 Filtering is used to select the set of language tags that matches a 360 given language priority list. It is called "filtering" because this 361 set might contain no items at all or it might return an arbitrarily 362 large number of matching items: as many items as match the language 363 priority list, thus "filtering out" the non-matching items. 365 In filtering, each language range represents the _least_ specific 366 language tag (that is, the language tag with fewest number of 367 subtags) which is an acceptable match. All of the language tags in 368 the matching set of tags will have an equal or greater number of 369 subtags than the language range. Every non-wildcard subtag in the 370 language range will appear in every one of the matching language 371 tags. For example, if the language priority list consists of the 372 range "de-CH" (German as used in Switzerland), one might see tags 373 such as "de-CH-1996" (German as used in Switzerland, orthography of 374 1996) but one will never see a tag such as "de" (because the 'CH' 375 subtag is missing). 377 If the language priority list (see Section 2.3) contains more than 378 one range, the content returned is typically ordered in descending 379 level of preference, but it MAY be unordered, according to the needs 380 of the application or protocol. 382 Some examples of applications where filtering might be appropriate 383 include: 385 o Applying a style to sections of a document in a particular set of 386 languages. 388 o Displaying the set of documents containing a particular set of 389 keywords written in a specific set of languages. 391 o Selecting all email items written in a specific set of languages. 393 o Selecting audio files spoken in a particular language. 395 Filtering seems to imply that there is a semantic relationship 396 between language tags that share the same prefix. While this is 397 often the case, it is not always true: the language tags that match a 398 specific language range do not necessarily represent mutually 399 intelligible languages. 401 3.3.1. Basic Filtering 403 Basic filtering compares basic language ranges to language tags. 404 Each basic language range in the language priority list is considered 405 in turn, according to priority. A language range matches a 406 particular language tag if, in a case-insensitive comparison, it 407 exactly equals the tag, or if it exactly equals a prefix of the tag 408 such that the first character following the prefix is "-". For 409 example, the language-range "de-de" (German as used in German) 410 matches the language tag "de-DE-1996" (German as used in Germany, 411 orthography of 1996), but not the language tags "de-Deva" (German as 412 written in the Devanagari script) or "de-Latn-DE" (German, Latin 413 script, as used in Germany). 415 The special range "*" in a language priority list matches any tag. A 416 protocol which uses language ranges MAY specify additional rules 417 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616] 418 specifies that the range "*" matches only languages not matched by 419 any other range within an "Accept-Language" header. 421 Basic filtering is identical to the type of matching described in 422 [RFC3066], Section 2.5 (Language-range). 424 3.3.2. Extended Filtering 426 Extended filtering compares extended language ranges to language 427 tags. Each extended language range in the language priority list is 428 considered in turn, according to priority. A language range matches 429 a particular language tag if their list of subtags match. To 430 determine a match: 432 1. Split both the extended language range and the language tag being 433 compared into a list of subtags by dividing on the hyphen (%2D) 434 character. Two subtags match if either they are the same when 435 compared case-insensitively or the language range's subtag is the 436 wildcard '*'. 438 2. Begin with the first subtag in each list. If the first subtag in 439 the range does not match the first subtag in the tag, the overall 440 match fails. Otherwise, move to the next subtag in both the 441 range and the tag. 443 3. While there are more subtags left in the language range's list: 445 A. If the subtag currently being examined in the range is the 446 wildcard ('*'), move to the next subtag in the range and 447 continue with the loop. 449 B. Else, if there are no more subtags in the language tag's 450 list, the match fails. 452 C. Else, if the current subtag in the range's list matches the 453 current subtag in the language tag's list, move to the next 454 subtag in both lists and continue with the loop. 456 D. Else, if the language tag's subtag is a "singleton" (a single 457 letter or digit, which includes the private-use subtag 'x') 458 the match fails. 460 E. Else, move to the next subtag in the language tag's list and 461 continue with the loop. 463 4. When the language range's list has no more subtags, the match 464 succeeds. 466 Subtags not specified, including those at the end of the language 467 range, are thus treated as if assigned the wildcard value '*'. Much 468 like basic filtering, extended filtering selects content with 469 arbitrarily long tags that share the same initial subtags as the 470 language range. In addition, extended filtering selects language 471 tags that contain any intermediate subtags not specified in the 472 language range. For example, the extended language range "de-*-DE" 473 (or its synonym "de-DE") matches all of the following tags: 475 de-DE (German, as used in Germany) 477 de-de (German, as used in Germany) 479 de-Latn-DE (Latin script) 481 de-Latf-DE (Fraktur variant of Latin script) 483 de-DE-x-goethe (private use subtag) 485 de-Latn-DE-1996 (orthography of 1996) 487 de-Deva-DE (Devanagari script) 489 The same range does not match any of the following tags for the 490 reasons shown: 492 de (missing 'DE') 494 de-x-DE (singleton 'x' occurs before 'DE') 496 de-Deva ('Deva' not equal to 'DE') 498 Note: [RFC3066bis] defines each type of subtag (language, script, 499 region, and so forth) according to position, size, and content. This 500 means that subtags in a language range can only match specific types 501 of subtags in a language tag. For example, a subtag such as 'Latn' 502 is always a script subtag (unless it follows a singleton) while a 503 subtag such as 'nedis' can only match the equivalent variant subtag. 504 Two-letter subtags in initial position have a different type 505 (language) than two-letter subtags in later positions (region). This 506 is the reason why a wildcard in the extended language range is 507 significant in the first position but is ignored in all other 508 positions. 510 3.4. Lookup 512 Lookup is used to select the single language tag that best matches 513 the language priority list for a given request. When performing 514 lookup, each language range in the language priority list is 515 considered in turn, according to priority. By contrast with 516 filtering, each language range represents the _most_ specific tag 517 which is an acceptable match. The first matching tag found, 518 according to the user's priority, is considered the closest match and 519 is the item returned. For example, if the language range is "de-ch", 520 a lookup operation can produce content with the tags "de" or "de-CH" 521 but never content with the tag "de-CH-1996". If no language tag 522 matches the request, the "default" value is returned. 524 For example, if an application inserts some dynamic content into a 525 document, returning an empty string if there is no exact match is not 526 an option. Instead, the application "falls back" until it finds a 527 matching language tag associated with a suitable piece of content to 528 insert. Some applications of lookup include: 530 o Selection of a template containing the text for an automated email 531 response. 533 o Selection of a item containing some text for inclusion in a 534 particular Web page. 536 o Selection of a string of text for inclusion in an error log. 538 o Selection of an audio file to play as a prompt in a phone system. 540 In the lookup scheme, the language range is progressively truncated 541 from the end until a matching language tag is located. Single letter 542 or digit subtags (including both the letter 'x' which introduces 543 private-use sequences, and the subtags that introduce extensions) are 544 removed at the same time as their closest trailing subtag. For 545 example, starting with the range "zh-Hant-CN-x-private1-private2" 546 (Chinese, Traditional script, China, two private use tags) the lookup 547 progressively searches for content as shown below: 549 Range to match: zh-Hant-CN-x-private1-private2 550 1. zh-Hant-CN-x-private1-private2 551 2. zh-Hant-CN-x-private1 552 3. zh-Hant-CN 553 4. zh-Hant 554 5. zh 555 6. (default) 557 Figure 3: Example of a Lookup Fallback Pattern 559 This fallback behavior allows some flexibility in finding a match. 560 Without fallback, the default content would be returned immediately 561 if exactly matching content is unavailable. With fallback, a result 562 more closely matching the user request can be provided. 564 Extensions and unrecognized private-use subtags might be unrelated to 565 a particular application of lookup. Since these subtags come at the 566 end of the subtag sequence, they are removed first during the 567 fallback process and usually pose no barrier to interoperability. 568 However, an implementation MAY remove these from ranges prior to 569 performing the lookup (provided the implementation also removes them 570 from the tags being compared). Such modification is internal to the 571 implementation and applications, protocols, or specifications SHOULD 572 NOT remove or modify subtags in content that they return or forward, 573 because this removes information that can be used elsewhere. 575 The special language range "*" matches any language tag. In the 576 lookup scheme, this range does not convey enough information by 577 itself to determine which language tag is most appropriate, since it 578 matches everything. If the language range "*" is followed by other 579 language ranges, it is skipped. If the language range "*" is the 580 only one in the language priority list or if no other language range 581 follows, the default value is computed and returned. 583 In some cases, the language priority list can contain one or more 584 extended language ranges (as, for example, when the same language 585 priority list is used as input for both lookup and filtering 586 operations). Wildcard values in an extended language range normally 587 match any value that can occur in that position in a language tag. 588 Since only one item can be returned for any given lookup request, 589 wildcards in a language range have to be processed in a consistent 590 manner or the same request will produce widely varying results. 591 Applications, protocols, or specifications that accept extended 592 language ranges MUST define which item is returned when more than one 593 item matches the extended language range. 595 For example, an implementation could map the extended language ranges 596 to basic ranges. Another possibility would be for an implementation 597 to return the matching tag that is first in ASCII-order. If the 598 language range were "*-CH" ('CH' represents Switzerland) and the set 599 of tags included "de-CH" (German as used in Switzerland), "fr-CH" 600 (French, Switzerland), and "it-CH" (Italian, Switzerland), then the 601 tag "de-CH" would be returned. 603 3.4.1. Default Values 605 Each application, protocol, or specification that uses lookup MUST 606 define the defaulting behavior when no tag matches the language 607 priority list. What this action consists of strongly depends on how 608 lookup is being applied. Some examples of defaulting behavior 609 include: 611 o return an item with no language tag or an item of a non-linguistic 612 nature, such as an image or sound 614 o return a null string as the language tag value, in cases where the 615 protocol permits the empty value (see, for example, "xml:lang" in 616 [XML10]) 618 o return a particular language tag designated for the operation 620 o return the language tag "i-default" (see: [RFC2277]) 622 o return an error condition or error message 624 o return a list of available languages for the user to select from 626 When performing lookup using a language priority list, the 627 progressive search MUST process each language range in the list 628 before seeking or calculating the default. 630 The default value MAY be calculated or include additional searching 631 or matching. Applications, protocols, or specifications can specify 632 different ways in which users can specify or override the defaults. 634 One common way to provide for a default is to allow a specific 635 language range to be set as the default for a specific type of 636 request. If this approach is chosen, this language range MUST be 637 treated as if it were appended to the end of the language priority 638 list as a whole, rather than after each item in the language priority 639 list. The application, protocol, or specification MUST also define 640 the defaulting behavior if that search fails to find a matching tag 641 or item. 643 For example, if a particular user's language priority list is "fr-FR, 644 zh-Hant" (French as used in France followed by Chinese as written in 645 the Traditional script) and the program doing the matching had a 646 default language range of "ja-JP" (Japanese as used in Japan), then 647 the program searches as follows: 649 1. fr-FR 650 2. fr 651 3. zh-Hant // next language 652 4. zh 653 5. ja-JP // now searching for the default content 654 6. ja 655 7. (implementation defined default) 657 Figure 4: Lookup Using a Language Priority List 659 4. Other Considerations 661 When working with language ranges and matching schemes, there are 662 some additional points that can influence the choice of either. 664 4.1. Choosing Language Ranges 666 Users indicate their language preferences via the choice of a 667 language range or the list of language ranges in a language priority 668 list. The type of matching affects what the best choice is for a 669 user. 671 Most matching schemes make no attempt to process the semantic meaning 672 of the subtags. The language range is compared, in a case- 673 insensitive manner, to each language tag being matched, using basic 674 string processing. Users SHOULD select language ranges that are 675 well-formed, valid language tags according to [RFC3066bis] 676 (substituting wildcards as appropriate in extended language ranges). 678 Applications are encouraged to canonicalize language tags and ranges 679 by using the Preferred-Value from the IANA Language Subtag Registry 680 for tags or subtags which have been deprecated. If the user is 681 working with content that might use the older form, the user might 682 want to include both the new and old forms in a language priority 683 list. For example, the tag "art-lojban" is deprecated. The subtag 684 'jbo' is supposed to be used instead, so the user might use it to 685 form the language range. Or the user might include both in a 686 language priority list: "jbo, art-lojban". 688 Users SHOULD avoid subtags that add no distinguishing value to a 689 language range. When filtering, the fewer the number of subtags that 690 appear in the language range, the more content the range will 691 probably match, while in lookup unnecessary subtags can cause 692 "better", more-specific content to be skipped in favor of less 693 specific content. For example, the range "de-Latn-DE" returns 694 content tagged "de" instead of content tagged "de-DE", even though 695 the latter is probably a better match. 697 Whether a subtag adds distinguishing value can depend on the context 698 of the request. For example, a user who reads both Simplified and 699 Traditional Chinese, but who prefers Simplified, might use the range 700 "zh" for filtering (matching all items that user can read) but "zh- 701 Hans" for lookup (making sure that user gets the preferred form if 702 it's available, but the fallback to "zh" will still work). On the 703 other hand, content in this case ought to be labeled as "zh-Hans" (or 704 "zh-Hant" if that applies) for filtering, while for lookup, if there 705 is either "zh-Hans" content or "zh-Hant" content, one of them (the 706 one considered 'default') also ought to be made available with the 707 simple "zh". Note that the user can create a language priority list 708 "zh-Hans, zh" that delivers the best possible results for both 709 schemes. If the user cannot be sure which scheme is being used (or 710 if more than one might be applied to a given request), the user 711 SHOULD specify the most specific (largest number of subtags) range 712 first and then supply shorter prefixes later in the list to ensure 713 that filtering returns a complete set of tags. 715 Many languages are written predominantly in a single script. This is 716 usually recorded in the Suppress-Script field in that language 717 subtag's registry entry. For these languages, script subtags SHOULD 718 NOT be used to form a language range. Thus the language range "en- 719 Latn" is inappropriate in most cases (because the vast majority of 720 English documents are written in the Latin script and thus the 'en' 721 language subtag has a Suppress-Script field for 'Latn' in the 722 registry). 724 When working with tags and ranges, note that extensions and most 725 private-use subtags are orthogonal to language tag matching, in that 726 they specify additional attributes of the text not related to the 727 goals of most matching schemes. Users SHOULD avoid using these 728 subtags in language ranges, since they interfere with the selection 729 of available content. When used in language tags (as opposed to 730 ranges), these subtags normally do not interfere with filtering 731 (Section 3), since they appear at the end of the tag and will match 732 all prefixes. Lookup (Section 3.4) implementations are advised to 733 ignore unrecognized private-use and extension subtags when performing 734 language tag fallback. 736 4.2. Meaning of Language Tags and Ranges 738 Selecting language tags using language ranges requires some 739 understanding by users of what they are selecting. The meaning of 740 the various subtags in a language range are identical to their 741 meaning in a language tag (see Section 4.2 in [RFC3066bis]), with the 742 addition that the wildcard "*" represents any matching sequence of 743 values. 745 4.3. Considerations for Private Use Subtags 747 Private argeement is necessary between the parties that intend to use 748 or exchange language tags that contain private-use subtags. Great 749 caution SHOULD be used in employing private-use subtags in content or 750 protocols intended for general use. Private-use subtags are simply 751 useless for information exchange without prior arrangement. 753 The value and semantic meaning of private-use tags and of the subtags 754 used within such a language tag are not defined. Matching private- 755 use tags using language ranges or extended language ranges can result 756 in unpredictable content being returned. 758 4.4. Length Considerations for Language Ranges 760 Language ranges are very similar to language tags in terms of content 761 and usage. The same types of restrictions on length that can be 762 applied to language tags can also be applied to language ranges. See 763 [RFC3066bis] Section 4.3 (Length Considerations). 765 5. IANA Considerations 767 This document presents no new or existing considerations for IANA. 769 6. Security Considerations 771 Language ranges used in content negotiation might be used to infer 772 the nationality of the sender, and thus identify potential targets 773 for surveillance. In addition, unique or highly unusual language 774 ranges or combinations of language ranges might be used to track a 775 specific individual's activities. 777 This is a special case of the general problem that anything you send 778 is visible to the receiving party. It is useful to be aware that 779 such concerns can exist in some cases. 781 The evaluation of the exact magnitude of the threat, and any possible 782 countermeasures, is left to each application or protocol. 784 7. Character Set Considerations 786 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN- 787 MINUS (%x2D). Language ranges also use the character ASTERISK 788 (%x2A). These characters are present in most character sets, so 789 presentation or exchange of language tags or ranges should not be 790 constrained by character set issues. 792 8. References 794 8.1. Normative References 796 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 797 Requirement Levels", BCP 14, RFC 2119, March 1997. 799 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 800 Languages", BCP 18, RFC 2277, January 1998. 802 [RFC3066bis] 803 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 804 Identification of Languages", October 2005, . 808 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 809 Specifications: ABNF", RFC 4234, October 2005. 811 8.2. Informative References 813 [RFC1766] Alvestrand, H., "Tags for the Identification of 814 Languages", RFC 1766, March 1995. 816 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 817 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 818 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 820 [RFC2616errata] 821 IETF, "HTTP/1.1 Specification Errata", October 2004, 822 . 824 [RFC3066] Alvestrand, H., "Tags for the Identification of 825 Languages", BCP 47, RFC 3066, January 2001. 827 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 828 May 2002. 830 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0", 831 February 2004. 833 Appendix A. Acknowledgements 835 Any list of contributors is bound to be incomplete; please regard the 836 following as only a selection from the group of people who have 837 contributed to make this document what it is today. 839 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 840 which is a precursor to this document, made enormous contributions 841 directly or indirectly to this document and are generally responsible 842 for the success of language tags. 844 The following people (in alphabetical order by family name) 845 contributed to this document: 847 Harald Alvestrand, Stephane Bortzmeyer, Jeremy Carroll, Peter 848 Constable, John Cowan, Mark Crispin, Martin Duerst, Frank Ellermann, 849 Doug Ewell, Debbie Garside, Marion Gunn, Jon Hanna, Kent Karlsson, 850 Erkki Kolehmainen, Jukka Korpela, Ira McDonald, M. Patton, Randy 851 Presuhn, Eric van der Poel, Markus Scherer, Misha Wolf, and many, 852 many others. 854 Very special thanks must go to Harald Tveit Alvestrand, who 855 originated RFCs 1766 and 3066, and without whom this document would 856 not have been possible. 858 Authors' Addresses 860 Addison Phillips (editor) 861 Yahoo! Inc. 863 Email: addison@inter-locale.com 865 Mark Davis (editor) 866 Google 868 Email: mark.davis@macchiato.com 870 Intellectual Property Statement 872 The IETF takes no position regarding the validity or scope of any 873 Intellectual Property Rights or other rights that might be claimed to 874 pertain to the implementation or use of the technology described in 875 this document or the extent to which any license under such rights 876 might or might not be available; nor does it represent that it has 877 made any independent effort to identify any such rights. Information 878 on the procedures with respect to rights in RFC documents can be 879 found in BCP 78 and BCP 79. 881 Copies of IPR disclosures made to the IETF Secretariat and any 882 assurances of licenses to be made available, or the result of an 883 attempt made to obtain a general license or permission for the use of 884 such proprietary rights by implementers or users of this 885 specification can be obtained from the IETF on-line IPR repository at 886 http://www.ietf.org/ipr. 888 The IETF invites any interested party to bring to its attention any 889 copyrights, patents or patent applications, or other proprietary 890 rights that may cover technology that may be required to implement 891 this standard. Please address the information to the IETF at 892 ietf-ipr@ietf.org. 894 Disclaimer of Validity 896 This document and the information contained herein are provided on an 897 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 898 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 899 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 900 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 901 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 902 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 904 Copyright Statement 906 Copyright (C) The Internet Society (2006). This document is subject 907 to the rights, licenses and restrictions contained in BCP 78, and 908 except as set forth therein, the authors retain all their rights. 910 Acknowledgment 912 Funding for the RFC Editor function is currently provided by the 913 Internet Society.