idnits 2.17.1 draft-ietf-ltru-matching-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 833. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 810. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 817. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 823. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 4, 2006) is 6620 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC2616errata' is defined on line 753, but no explicit reference was found in the text ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Duplicate reference: RFC2616, mentioned in 'RFC2616errata', was also mentioned in 'RFC2616'. -- Obsolete informational reference (is this intentional?): RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 12 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Yahoo! Inc 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: September 5, 2006 Google 6 March 4, 2006 8 Matching of Language Tags 9 draft-ietf-ltru-matching-11 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on September 5, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2006). 40 Abstract 42 This document describes different mechanisms for comparing and 43 matching language tags. Possible algorithms for language negotiation 44 or content selection, filtering, and lookup are described. This 45 document, in combination with RFC 3066bis (Ed.: replace "3066bis" 46 with the RFC number assigned to draft-ietf-ltru-registry-14), 47 replaces RFC 3066, which replaced RFC 1766. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 53 2.1. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 54 2.2. Extended Language Range . . . . . . . . . . . . . . . . . 5 55 2.3. The Language Priority List . . . . . . . . . . . . . . . . 5 56 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 7 57 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 7 58 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 9 59 3.2.1. Basic Filtering . . . . . . . . . . . . . . . . . . . 9 60 3.2.2. Extended Filtering . . . . . . . . . . . . . . . . . . 10 61 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 11 62 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 15 63 4.1. Choosing Language Ranges . . . . . . . . . . . . . . . . . 15 64 4.2. Meaning of Language Tags and Ranges . . . . . . . . . . . 16 65 4.3. Considerations for Private Use Subtags . . . . . . . . . . 16 66 4.4. Length Considerations for Language Ranges . . . . . . . . 17 67 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 68 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 69 7. Security Considerations . . . . . . . . . . . . . . . . . . . 20 70 8. Character Set Considerations . . . . . . . . . . . . . . . . . 21 71 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 72 9.1. Normative References . . . . . . . . . . . . . . . . . . . 22 73 9.2. Informative References . . . . . . . . . . . . . . . . . . 22 74 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 23 75 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 76 Intellectual Property and Copyright Statements . . . . . . . . . . 25 78 1. Introduction 80 Human beings on our planet have, past and present, used a number of 81 languages. There are many reasons why one would want to identify the 82 language used when presenting or requesting information or in some 83 specific set of information items or "content". 85 One use for language identifiers, such as those defined in 86 [RFC3066bis], is to select content by matching the associated 87 language tags to a user's language preferences. 89 This document defines a syntax (called a language range (Section 2)) 90 for specifying items in the user's list of language preferences 91 (called a language priority list (Section 2.3)), as well as several 92 schemes for selecting or filtering sets of content by comparing the 93 content's language tags to the user's preferences. Applications, 94 protocols, or specifications will have varying needs and requirements 95 that affect the choice of a suitable matching scheme. Depending on 96 the choice of scheme, there are various options left to the 97 implementation. Protocols that implement a matching scheme either 98 need to specify each particular choice or indicate the options that 99 are left to the implementation to decide. 101 This document is divided into three main sections. One describes how 102 to indicate a user's preferences using language ranges. Then a 103 section describes various schemes for matching these ranges to a set 104 of language tags. There is also a section that deals with various 105 practical considerations that apply to implementing and using these 106 schemes. 108 This document, in combination with [RFC3066bis] (Ed.: replace 109 "3066bis" globally in this document with the RFC number assigned to 110 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 111 [RFC1766]. 113 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 114 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 115 document are to be interpreted as described in [RFC2119]. 117 2. The Language Range 119 Language Tags [RFC3066bis] are used to identify the language of some 120 information item or "content". Applications or protocols that use 121 language tags are often faced with the problem of identifying sets of 122 content that share certain language attributes. For example, 123 HTTP/1.1 [RFC2616] describes one such mechanism in its discussion of 124 the Accept-Language header (Section 14.4), which is used when 125 selecting content from servers based on the language of that content. 127 When selecting content according to its language, it is useful to 128 have a mechanism for identifying sets of language tags that share 129 specific attributes. This allows users to select or filter content 130 based on specific requirements. Such an identifier is called a 131 "language range". 133 There are different types of language range, whose specific 134 attributes vary according to their application. Language ranges are 135 similar to language tags: they consist of a sequence of subtags 136 separated by hyphens. In a language range, each subtag MUST either 137 be a sequence of ASCII alphanumeric characters or the single 138 character '*' (%2A, ASTERISK). The character '*' is a "wildcard" 139 that matches any sequence of subtags. The meaning and uses of 140 wildcards vary according to the type of language range. 142 Language tags and thus language ranges are to be treated as case- 143 insensitive: there exist conventions for the capitalization of some 144 of the subtags, but these MUST NOT be taken to carry meaning. 145 Matching of language tags to language ranges MUST be done in a case- 146 insensitive manner. 148 2.1. Basic Language Range 150 A "basic language range" describes a user's language preference as a 151 specific, uninterrupted, sequence of subtags. Each range consists of 152 a sequence of alphanumeric subtags separated by hyphens. The basic 153 language range is defined by the following ABNF [RFC4234]: 155 language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*" 156 alphanum = ALPHA / DIGIT 158 Basic language ranges (originally described by HTTP/1.1 [RFC2616] and 159 later [RFC3066]) have the same syntax as an [RFC3066] language tag or 160 are the single character "*". They differ from the language tags 161 defined in [RFC3066bis] only in that there is no requirement that 162 they be "well-formed" or be validated against the IANA Language 163 Subtag Registry (although such ill-formed ranges will probably not 164 match anything). (Note that the ABNF [RFC4234] in [RFC2616] is 165 incorrect, since it disallows the use of digits anywhere in the 166 'language-range': this is mentioned in the errata) 168 Use of a basic language range seems to imply that there is a semantic 169 relationship between language tags that share the same prefix. While 170 this is often the case, it is not always true and users should note 171 that the set of language tags that match a specific language range 172 may not represent mutually intelligible languages. 174 2.2. Extended Language Range 176 Occasionally users will wish to select a set of language tags based 177 on the presence of specific subtags. An "extended language range" 178 describes a user's language preference as an ordered sequence of 179 subtags. For example, a user might wish to select all language tags 180 that contain the region subtag 'CH' (Switzerland). Extended language 181 ranges are useful in specifying a particular sequence of subtags that 182 appear in the set of matching tags without having to specify all of 183 the intervening subtags. 185 An extended language range can be represented by the following ABNF: 187 extended-language-range = (1*8ALPHA / "*") 188 *("-" (1*8alphanum / "*")) 190 Figure 2: Extended Language Range 192 The wildcard subtag '*' can occur in any position in the extended 193 language range, where it matches any sequence of subtags that might 194 occur in that position in a language tag. However, wildcards outside 195 the first position in an extended language range are ignored by most 196 matching schemes. Use of one or more wildcards SHOULD NOT be taken 197 to imply that a certain number of subtags will appear in the matching 198 set of language tags. 200 Implementations that specify basic ranges MAY map extended language 201 ranges to basic language ranges: if the first subtag is a "*" then 202 the entire range is treated as "*", otherwise each wildcard subtag is 203 removed. For example, if the language range were "en-*-US", then the 204 range would be mapped to "en-US". 206 2.3. The Language Priority List 208 A user's language preferences will often need to specify more than 209 one language range and thus users often need to specify a prioritized 210 list of language ranges in order to best reflect their language 211 preferences. This is especially true for speakers of minority 212 languages. A speaker of Breton in France, for example, may specify 213 "be" followed by "fr", meaning that if Breton is available, it is 214 preferred, but otherwise French is the best alternative. It can get 215 more complex: a user may wish to fall back from Skolt Sami to 216 Northern Sami to Finnish. 218 A "language priority list" is a prioritized or weighted list of 219 language ranges. One well known example of such a list is the 220 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 221 14.4) and RFC 3282 [RFC3282]. 223 The various matching operations described in this document include 224 considerations for using a language priority list. This document 225 does not define the syntax for a language priority list; defining 226 such a syntax is the responsibility of the protocol, application, or 227 specification that uses it. When given as examples in this document, 228 language priority lists will be shown as a quoted sequence of ranges 229 separated by commas, like this: "en, fr, zh-Hant" (which would be 230 read as "English before French before Chinese as written in the 231 Traditional script"). 233 A simple list of ranges is considered to be in descending order of 234 priority. Other language priority lists provide "quality weights" 235 for the language ranges in order to specify the relative priority of 236 the user's language preferences. An example of this would be the use 237 of "q" values in the syntax of the "Accept-Language" header (defined 238 in [RFC2616], Section 14.4, and [RFC3282]). 240 3. Types of Matching 242 Matching language ranges to language tags can be done in a number of 243 different ways. This section describes several different matching 244 schemes, as well as the considerations for choosing between them. 245 Protocols and specifications SHOULD clearly indicate the particular 246 mechanism used in selecting or matching language tags. 248 There are several types of matching scheme. This document presents 249 two types: those that produce zero or more information items (called 250 "filtering") and those that produce a single information item for a 251 given request (called "lookup"). 253 Implementations or protocols MAY use different matching schemes from 254 the ones described in this document, as long as those mechanisms are 255 clearly specified. 257 3.1. Choosing a Type of Matching 259 Applications, protocols, and specifications are faced with the 260 decision of what type of matching to use. Sometimes, different 261 styles of matching are suited to different kinds of processing within 262 a particular application or protocol. 264 Language tag matching is a tool, and does not by itself specify a 265 complete procedure for the use of language tags. Such procedures are 266 intimately tied to the application protocol in which they occur. 267 When specifying a protocol operation using matching, the protocol 268 MUST specify: 270 o Which type(s) of language tag matching it uses 272 o Whether the operation returns a single result (lookup) or a 273 possibly empty set of results (filtering) 275 o For lookup, what the result is when no matching tag is found. For 276 instance, a protocol might define the result as failure of the 277 operation, an empty value, returning some protocol defined or 278 implementation defined default, or returning i-default [RFC2277]. 280 This document describes three types of matching: 282 1. Basic Filtering (Section 3.2.1) matches a language priority list 283 consisting of basic language ranges (Section 2.1) to sets of 284 language tags. 286 2. Extended Filtering (Section 3.2.2) matches a language priority 287 list consisting of extended language ranges (Section 2.2) to sets 288 of language tags. 290 3. Lookup (Section 3.3) matches a language priority list consisting 291 of basic language ranges to sets of language tags to find the one 292 _exact_ language tag that best matches the range. 294 Filtering can be used to produce a set of results (such as a 295 collection of documents) by comparing the user's preferences to 296 language tags associated with the set of content. For example, when 297 performing a search, one might use filtering to limit the results to 298 items tagged as being in the French language. Filtering can also be 299 used when deciding whether to perform a language-sensitive process on 300 some content. For example, a process might cause paragraphs whose 301 language tag matched the language range "nl" to be displayed in 302 italics within a document. 304 Lookup produces the single result that best matches the user's 305 preferences, so it is useful in cases in which only a single item can 306 be returned. For example, if a process were to insert a human 307 readable error message into a protocol header, it might select the 308 text based on the user's language priority list. Since the process 309 can return only one item, it must choose a single item and it must 310 return some item, even if none of the content's language tags match 311 the language priority list supplied by the user. 313 The types of matching in this document are designed so that 314 implementations are not required to validate or understand any of the 315 semantics of the language tags or ranges or of the subtags in them. 316 None of them require access to the IANA Language Subtag Registry (see 317 Section 3 in [RFC3066bis]). This simplifies implementation of these 318 schemes. An implementation MAY choose to check if either the 319 language ranges or language tags being matched are "well-formed" or 320 "valid" (see [RFC3066bis], Section 2.2.9) and MAY choose not to 321 process invalid ranges. 323 Regardless of the matching scheme chosen, protocols and 324 implementations MAY canonicalize language tags and ranges by mapping 325 grandfathered and obsolete tags or subtags into modern equivalents. 326 If an implementation canonicalizes either ranges or tags, then the 327 implementation will require the IANA Language Subtag Registry 328 information for that purpose. Implementations MAY also use semantic 329 information external to the registry when matching tags. For 330 example, the primary language subtags 'nn' (Nynorsk Norwegian) and 331 'nb' (Bokmal Norwegian) might both be usefully matched to the more 332 general subtag 'no' (Norwegian). Or an implementation might infer 333 that content labeled "zh-Hans" (Chinese as written in the Simplified 334 script) is more likely to match the range "zh-CN" (Chinese as used in 335 China, where the Simplified script is predominant) than equivalent 336 content labeled "zh-TW" (Chinese as used in Taiwan, where the 337 Traditional script is predominant). 339 3.2. Filtering 341 Filtering is used to select the set of language tags that matches a 342 given language priority list and return the associated content. It 343 is called "filtering" because this set might contain no items at all 344 or it might return an arbitrarily large number of matching items: as 345 many items as match the language priority list, thus "filtering out" 346 the non-matching items. 348 In filtering, each language range represents the _least_ specific 349 language tag (that is, the language tag with fewest number of 350 subtags) which is an acceptable match. All of the language tags in 351 the matching set of tags will have an equal or greater number of 352 subtags than the language range. Every non-wildcard subtag in the 353 language range will appear in every one of the matching language 354 tags. For example, if the language priority list consists of the 355 range "de-CH", one might see tags such as "de-CH-1996" but one will 356 never see a tag such as "de" (because the 'CH' subtag is missing). 358 If the language priority list (see Section 2.3) contains more than 359 one range, the content returned is typically ordered in descending 360 level of preference, but it MAY be unordered, according to the needs 361 of the application or protocol. 363 Some examples of applications where filtering might be appropriate 364 include: 366 o Applying a style to sections of a document in a particular set of 367 languages. 369 o Displaying the set of documents containing a particular set of 370 keywords written in a specific set of languages. 372 o Selecting all email items written in a specific set of languages. 374 o Selecting audio files spoken in a particular language. 376 3.2.1. Basic Filtering 378 When filtering using basic language ranges, each basic language range 379 in the language priority list is considered in turn, according to 380 priority. A particular language tag matches a language range if, in 381 a case-insensitive comparison, it exactly equals the tag, or if it 382 exactly equals a prefix of the tag such that the first character 383 following the prefix is "-". For example, the language-range "de-de" 384 matches the language tag "de-DE-1996", but not the language tags "de- 385 Deva" or "de-Latn-DE". 387 The special range "*" in a language priority list matches any tag. A 388 protocol which uses language ranges MAY specify additional rules 389 about the semantics of "*"; for instance, HTTP/1.1 [RFC2616] 390 specifies that the range "*" matches only languages not matched by 391 any other range within an "Accept-Language" header. 393 Basic filtering is identical to the type of matching described in 394 [RFC3066], Section 2.5 (Language-range). 396 3.2.2. Extended Filtering 398 When filtering using extended language ranges, each extended language 399 range in the language priority list is considered in turn, according 400 to priority. A particular language range is compared to each 401 language tag using the following process: 403 Compare the first subtag in the extended language tag to the first 404 subtag in the language tag in a case insensitive manner. If the 405 first subtag in the range is "*", it matches any value. Otherwise 406 the two values must match or the overall match fails. 408 Take each non-wildcard subtag in the language range and compare it in 409 a case-insensitive manner to the next subtag in the language tag. If 410 the range's subtag exactly matches the tag's subtag, proceed to the 411 next non-wildcard subtag in the language range (and beginning with 412 the next subtag in the language tag) until the list of subtags in the 413 language range is exhausted or the match fails. If the tag's subtag 414 is a "singleton" (a single letter or digit, which, in this case, 415 includes the private-use subtag 'x') and the range's subtag does not 416 match or if the language tag's list of subtags is exhausted, the 417 match fails. If the language range's list of subtags is exhausted, 418 the match succeeds. 420 Subtags not specified, including those at the end of the language 421 range, are thus treated as if assigned the wildcard value "*". Much 422 like basic filtering, extended filtering selects content with 423 arbitrarily long tags that share the same initial subtags as the 424 language range. In addition extended filtering selects content with 425 any intermediate subtags unspecified in the language range. For 426 example, the extended language range "de-*-DE" matches all of the 427 following tags: 429 de-DE 430 de-Latn-DE 432 de-Latf-DE 434 de-de 436 de-DE-x-goethe 438 de-Latn-DE-1996 440 The same range does not match any of the following tags for the 441 reasons shown: 443 de (missing 'DE') 445 de-x-DE (singleton 'x' occurs before 'DE') 447 de-Deva ('Deva' not equal to 'DE') 449 Note: The structure of language tags defined by [RFC3066bis] defines 450 each type of subtag (language, script, region, and so forth) 451 according to position, size, and content. This means that subtags in 452 a language range can only match specific types of subtags in a 453 language tag. For example, a subtag such as 'Latn' is always a 454 script subtag (unless it follows a singleton) while a subtag such as 455 'nedis' can only match the equivalent variant subtag. 457 3.3. Lookup 459 Lookup is used to select the single language tag that best matches 460 the language priority list for a given request and return the 461 associated content. When performing lookup, each language range in 462 the language priority list is considered in turn, according to 463 priority. By contrast with filtering, each language range represents 464 the _most_ specific tag which is an acceptable match. The first 465 content found with a matching tag, according to the user's priority, 466 is considered the closest match and is the content returned. For 467 example, if the language range is "de-ch", a lookup operation might 468 produce content with the tags "de" or "de-CH" but never one with the 469 tag "de-CH-1996". Usually if no content matches the request, the 470 "default" content is returned. 472 For example, if an application inserts some dynamic content into a 473 document, returning an empty string if there is no exact match is not 474 an option. Instead, the application "falls back" until it finds a 475 matching language tag associated with a suitable piece of content to 476 insert. Examples of lookup might include: 478 o Selection of a template containing the text for an automated email 479 response. 481 o Selection of a item containing some text for inclusion in a 482 particular Web page. 484 o Selection of a string of text for inclusion in an error log. 486 o Selection of an audio file to play as a prompt in a phone system. 488 In the lookup scheme, the language range is progressively truncated 489 from the end until a matching piece of content is located. Single 490 letter or digit subtags (including both the letter 'x' which 491 introduces private-use sequences, and the subtags that introduce 492 extensions) are removed at the same time as their closest trailing 493 subtag. For example, starting with the range "zh-Hant-CN-x-private1- 494 private2", the lookup progressively searches for content as shown 495 below: 497 Range to match: zh-Hant-CN-x-private1-private2 498 1. zh-Hant-CN-x-private1-private2 499 2. zh-Hant-CN-x-private1 500 3. zh-Hant-CN 501 4. zh-Hant 502 5. zh 503 6. (default content) 505 Figure 3: Example of a Lookup Fallback Pattern 507 This allows some flexibility in finding a match. For example, lookup 508 provides better results for cases in which content is not available 509 that exactly matches the user request than if the default language 510 for the system or content were returned immediately. Not every 511 specific level of tag granularity is usually available or language 512 content may be sparsely populated. "Falling back" through the subtag 513 sequence provides more opportunity to find a match between available 514 language tags and the user's request. 516 The default behavior when no tag matches the language priority list 517 is implementation defined. An implementation might, for example, 518 return content: 520 o with no language tag 522 o of a non-linguistic nature, such as an image or sound 524 o with an empty language tag value, in cases where the protocol 525 permits the empty value (see, for example, "xml:lang" in [XML10], 526 which indicates that the element contains non-linguistic content) 528 o in a particular language designated for the bit of content being 529 selected 531 o labelled with the tag "i-default" (see [RFC2277]) 533 When performing lookup using a language priority list, the 534 progressive search MUST process each language range in the list 535 before finding the default content or empty tag. 537 One common way for an application or implementation to provide for a 538 default is to allow a specific language range to be set as the 539 default for a specific type of request. This language range is then 540 treated as if it were appended to the end of the language priority 541 list as a whole, rather than after each item in the language priority 542 list. 544 For example, if a particular user's language priority list were 545 "fr-FR, zh-Hant" and the program doing the matching had a default 546 language range of "ja-JP", the program would search for content as 547 follows: 548 1. fr-FR 549 2. fr 550 3. zh-Hant // next language 551 4. zh 552 5. (search for the default content) 553 a. ja-JP 554 b. ja 555 c. (implementation defined default) 557 Figure 4: Lookup Using a Language Priority List 559 Implementations SHOULD ignore extensions and unrecognized private-use 560 subtags when performing lookup, since these subtags are usually 561 orthogonal to the user's request. 563 The special language range "*" matches any language tag. In the 564 lookup scheme, this range does not convey enough information by 565 itself to determine which content is most appropriate, since it 566 matches everything. If the language range "*" is followed by other 567 language ranges, it SHOULD be skipped. If the language range "*" is 568 the only one in the language priority list or if no other language 569 range follows, the default content SHOULD be returned. 571 In some cases, the language priority list might contain one or more 572 extended language ranges (as, for example, when the same language 573 priority list is used as input for both lookup and filtering 574 operations). Wildcard values in an extended language range normally 575 match any value that occurs in that position in a language tag. 576 Since only one item can be returned for any given lookup request, 577 wildcards in a language range have to be processed in a consistent 578 manner or the same request will produce widely varying results. 579 Implementations that accept extended language ranges MUST define 580 which content is returned when more than one item matches the 581 extended language range. 583 For example, an implementation could return the matching tag that is 584 first in ASCII-order. If the language range were "*-CH" and the set 585 of tags included "de-CH", "fr-CH", and "it-CH", then the tag "de-CH" 586 would be returned. Another possibility would be for an 587 implementation to map the extended language ranges to basic ranges. 589 4. Other Considerations 591 When working with language ranges and matching schemes, there are 592 some additional points that may influence the choice of either. 594 4.1. Choosing Language Ranges 596 Users indicate their language preferences via the choice of a 597 language range or the list of language ranges in a language priority 598 list. The type of matching affects what the best choice is for a 599 user. 601 Most matching schemes make no attempt to process the semantic meaning 602 of the subtags and the language range is compared, in a case- 603 insensitive manner, to each language tag being matched, using basic 604 string processing. Users SHOULD select language ranges that are 605 well-formed, valid language tags according to [RFC3066bis] 606 (substituting wildcards as appropriate in extended language ranges). 608 Users SHOULD replace tags or subtags which have been deprecated with 609 the Preferred-Value from the IANA Language Subtag Registry. If the 610 user is working with content that might use the older form, the user 611 might include both the new and old forms in a language priority list. 612 For example, the tag "art-lojban" is deprecated. The subtag 'jbo' is 613 supposed to be used instead, so the user might use it to form the 614 language range. Or the user might include both in a language 615 priority list: "jbo, art-lojban". 617 Users SHOULD avoid subtags that add no distinguishing value to a 618 language range. When filtering, the fewer the number of subtags that 619 appear in the language range, the more content the range will 620 probably match, while in lookup unnecessary subtags might cause 621 "better", more-specific content to be skipped in favor of less 622 specific content. For example, the range "de-Latn-DE" would return 623 content tagged "de" instead of content tagged "de-DE", even though 624 the latter is probably a better match. 626 Many languages are written predominantly in a single script. This is 627 usually recorded in the Suppress-Script field in that language 628 subtag's registry entry. For these languages, script subtags SHOULD 629 NOT be used to form a language range. Thus the language range "en- 630 Latn" is inappropriate in most cases (because the vast majority of 631 English documents are written in the Latin script and thus the 'en' 632 language subtag has a Suppress-Script field for 'Latn' in the 633 registry). 635 When working with tags and ranges, note that extensions and most 636 private-use subtags are orthogonal to language tag matching, in that 637 they specify additional attributes of the text not related to the 638 goals of most matching schemes. Users SHOULD avoid using these 639 subtags in language ranges, since they interfere with the selection 640 of available content. When used in language tags (as opposed to 641 ranges), these subtags normally do not interfere with filtering 642 (Section 3), since they appear at the end of the tag and will match 643 all prefixes. Lookup (Section 3.3) implementations often ignore 644 unrecognized private-use and extension subtags when performing 645 language tag fallback. 647 Applications, specifications, or protocols that choose not to 648 interpret one or more private-use or extension subtags SHOULD NOT 649 remove or modify these extensions in content that they are 650 processing. When a language tag instance is to be used in a 651 specific, known protocol, and is not being passed through to other 652 protocols, language tags MAY be altered to remove subtags and 653 extensions that are not supported by that protocol. Such alterations 654 SHOULD be avoided, if possible, since they remove information that 655 might be relevant elsewhere that would make use of that information. 657 Some applications of language tags might want or need to consider 658 extensions and private-use subtags when matching tags. If extensions 659 and private-use subtags are included in a matching process that 660 utilizes one of the schemes described in this document, then the 661 implementation SHOULD canonicalize the language tags and/or ranges 662 before performing the matching. Note that language tag processors 663 that claim to be "well-formed" processors as defined in [RFC3066bis] 664 generally fall into this category. 666 4.2. Meaning of Language Tags and Ranges 668 Selecting content using language ranges requires some understanding 669 by users of what they are selecting. The meaning of the various 670 subtags in a language range are identical to their meaning in a 671 language tag (see Section 4.2 in [RFC3066bis]), with the addition 672 that the wildcard "*" represents any matching sequence of values. 674 4.3. Considerations for Private Use Subtags 676 Private-use subtags require private agreement between the parties 677 that intend to use or exchange language tags that use them and great 678 caution SHOULD be used in employing them in content or protocols 679 intended for general use. Private-use subtags are simply useless for 680 information exchange without prior arrangement. 682 The value and semantic meaning of private-use tags and of the subtags 683 used within such a language tag are not defined. Matching private- 684 use tags using language ranges or extended language ranges can result 685 in unpredictable content being returned. 687 4.4. Length Considerations for Language Ranges 689 Language ranges are very similar to language tags in terms of content 690 and usage. The same types of restrictions on length that apply to 691 language tags can also apply to language ranges. See [RFC3066bis] 692 Section 4.3 (Length Considerations). 694 5. IANA Considerations 696 This document presents no new or existing considerations for IANA. 698 6. Changes 700 This is the first version of this document. 702 7. Security Considerations 704 Language ranges used in content negotiation might be used to infer 705 the nationality of the sender, and thus identify potential targets 706 for surveillance. In addition, unique or highly unusual language 707 ranges or combinations of language ranges might be used to track a 708 specific individual's activities. 710 This is a special case of the general problem that anything you send 711 is visible to the receiving party. It is useful to be aware that 712 such concerns can exist in some cases. 714 The evaluation of the exact magnitude of the threat, and any possible 715 countermeasures, is left to each application or protocol. 717 8. Character Set Considerations 719 Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN- 720 MINUS (%x2D). Language ranges also use the character ASTERISK 721 (%x2A). These characters are present in most character sets, so 722 presentation or exchange of language tags or ranges should not be 723 constrained by character set issues. 725 9. References 727 9.1. Normative References 729 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 730 Requirement Levels", BCP 14, RFC 2119, March 1997. 732 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 733 Languages", BCP 18, RFC 2277, January 1998. 735 [RFC3066bis] 736 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 737 Identification of Languages", October 2005, . 741 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 742 Specifications: ABNF", RFC 4234, October 2005. 744 9.2. Informative References 746 [RFC1766] Alvestrand, H., "Tags for the Identification of 747 Languages", RFC 1766, March 1995. 749 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 750 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 751 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 753 [RFC2616errata] 754 IETF, "HTTP/1.1 Specification Errata", 10 2004, 755 . 757 [RFC3066] Alvestrand, H., "Tags for the Identification of 758 Languages", BCP 47, RFC 3066, January 2001. 760 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 761 May 2002. 763 [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0", 764 02 2004. 766 Appendix A. Acknowledgements 768 Any list of contributors is bound to be incomplete; please regard the 769 following as only a selection from the group of people who have 770 contributed to make this document what it is today. 772 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 773 which is a precursor to this document, made enormous contributions 774 directly or indirectly to this document and are generally responsible 775 for the success of language tags. 777 The following people (in alphabetical order by family name) 778 contributed to this document: 780 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank 781 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. 782 Patton, Randy Presuhn, Eric van der Poel, Markus Scherer, and many, 783 many others. 785 Very special thanks must go to Harald Tveit Alvestrand, who 786 originated RFCs 1766 and 3066, and without whom this document would 787 not have been possible. 789 Authors' Addresses 791 Addison Phillips (editor) 792 Yahoo! Inc 794 Email: addison at inter dash locale dot com 796 Mark Davis (editor) 797 Google 799 Email: mark dot davis at macchiato dot com 801 Intellectual Property Statement 803 The IETF takes no position regarding the validity or scope of any 804 Intellectual Property Rights or other rights that might be claimed to 805 pertain to the implementation or use of the technology described in 806 this document or the extent to which any license under such rights 807 might or might not be available; nor does it represent that it has 808 made any independent effort to identify any such rights. Information 809 on the procedures with respect to rights in RFC documents can be 810 found in BCP 78 and BCP 79. 812 Copies of IPR disclosures made to the IETF Secretariat and any 813 assurances of licenses to be made available, or the result of an 814 attempt made to obtain a general license or permission for the use of 815 such proprietary rights by implementers or users of this 816 specification can be obtained from the IETF on-line IPR repository at 817 http://www.ietf.org/ipr. 819 The IETF invites any interested party to bring to its attention any 820 copyrights, patents or patent applications, or other proprietary 821 rights that may cover technology that may be required to implement 822 this standard. Please address the information to the IETF at 823 ietf-ipr@ietf.org. 825 Disclaimer of Validity 827 This document and the information contained herein are provided on an 828 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 829 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 830 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 831 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 832 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 833 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 835 Copyright Statement 837 Copyright (C) The Internet Society (2006). This document is subject 838 to the rights, licenses and restrictions contained in BCP 78, and 839 except as set forth therein, the authors retain all their rights. 841 Acknowledgment 843 Funding for the RFC Editor function is currently provided by the 844 Internet Society.