idnits 2.17.1 draft-ietf-ltru-matching-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 1026. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1003. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1010. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1016. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 699 has weird spacing: '...becomes en-US...' == Line 700 has weird spacing: '...becomes en-La...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 18, 2005) is 6727 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234) -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: May 22, 2006 IBM 6 November 18, 2005 8 Matching of Language Tags 9 draft-ietf-ltru-matching-07 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on May 22, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing, matching, 43 and evaluating language tags. Possible algorithms for language 44 negotiation and content selection are described. This document, in 45 combination with RFC 3066bis (replace "3066bis" with the RFC number 46 assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which 47 replaced RFC 1766. 49 Table of Contents 51 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 52 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 53 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4 54 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 55 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 5 56 2.4. Choosing a Language Range . . . . . . . . . . . . . . . . 6 57 3. Types of Matching . . . . . . . . . . . . . . . . . . . . . . 9 58 3.1. Choosing a Type of Matching . . . . . . . . . . . . . . . 9 59 3.2. Filtering . . . . . . . . . . . . . . . . . . . . . . . . 10 60 3.2.1. Filtering with Basic Language Ranges . . . . . . . . . 11 61 3.2.2. Filtering with Extended Language Ranges . . . . . . . 11 62 3.2.3. Scored Filtering . . . . . . . . . . . . . . . . . . . 12 63 3.3. Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 14 64 4. Other Considerations . . . . . . . . . . . . . . . . . . . . . 18 65 4.1. Meaning of Language Tags and Ranges . . . . . . . . . . . 18 66 4.2. Considerations for Private Use Subtags . . . . . . . . . . 19 67 4.3. Length Considerations in Matching . . . . . . . . . . . . 19 68 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 69 6. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 70 7. Security Considerations . . . . . . . . . . . . . . . . . . . 24 71 8. Character Set Considerations . . . . . . . . . . . . . . . . . 25 72 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 73 9.1. Normative References . . . . . . . . . . . . . . . . . . . 26 74 9.2. Informative References . . . . . . . . . . . . . . . . . . 26 75 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 27 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28 77 Intellectual Property and Copyright Statements . . . . . . . . . . 29 79 1. Introduction 81 Human beings on our planet have, past and present, used a number of 82 languages. There are many reasons why one would want to identify the 83 language used when presenting or requesting information. 85 Information about a user's language preferences commonly need to be 86 identified so that appropriate processing can be applied. For 87 example, the user's language preferences in a browser can be used to 88 select web pages appropriately. Language preferences can also be 89 used to select among tools (such as dictionaries) to assist in the 90 processing or understanding of content in different languages. 92 Given a set of language identifiers, such as those defined in 93 [RFC3066bis], various mechanisms can be envisioned for performing 94 language negotiation and tag matching. Applications, protocols, or 95 specifications will have varying needs and requirements that affect 96 the choice of a suitable mechanism. 98 This document defines several mechanisms for matching, selecting, or 99 filtering content whose natural language is identified using Language 100 Tags [RFC3066bis], as well as the syntax (called a "language range") 101 associated with each of these mechanisms for specifying the user's 102 language preferences. 104 This document, in combination with [RFC3066bis] (replace "3066bis" 105 globally in this document with the RFC number assigned to 106 draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced 107 [RFC1766]. 109 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 110 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 111 document are to be interpreted as described in [RFC2119]. 113 2. The Language Range 115 Language Tags [RFC3066bis] are used to identify the language of some 116 information item or content. Applications or protocols that use 117 language tags are often faced with the problem of identifying sets of 118 content that share certain language attributes. For example, 119 HTTP/1.1 [RFC2616] describes language ranges in its discussion of the 120 Accept-Language header (Section 14.4). These are to be used when 121 selecting content from servers based on the language of that content. 123 When selecting content according to its language, it is useful to 124 have a mechanism for identifying sets of language tags that share 125 specific attributes. This allows users to select or filter content 126 based on specific requirements. Such an identifier is called a 127 "Language Range". 129 Language tags and thus language ranges are to be treated as case 130 insensitive: there exist conventions for the capitalization of some 131 of the subtags, but these MUST NOT be taken to carry meaning. 132 Matching of language tags to language ranges MUST be done in a case 133 insensitive manner as well. 135 2.1. Lists of Language Ranges 137 When users specify a language preference they often need to specify a 138 prioritized list of language ranges in order to best reflect their 139 language preferences. This is especially true for speakers of 140 minority languages. A speaker of Breton in France, for example, may 141 specify "be" followed by "fr", meaning that if Breton is available, 142 it is preferred, but otherwise French is the best alternative. It 143 can get more complex: a speaker may wish to fall back from Skolt Sami 144 to Northern Sami to Finnish. 146 A "Language Priority List" consists of a prioritized or weighted list 147 of language ranges. One well known example of such a list is the 148 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 149 14.4) and RFC 3282 [RFC3282]. 151 The various matching operations described in this document include 152 considerations for using a language priority list. When given as 153 examples in this document, language priority lists will be shown as a 154 quoted sequence of ranges separated by semi-colons, like this: "en; 155 fr; zh-Hant" (which would be read as "English before French before 156 Chinese as written in the Traditional script"). 158 2.2. Basic Language Range 160 A "Basic Language Range" identifies the set of content whose language 161 tags begin with the same sequence of subtags. A basic language range 162 is identified by its 'language range' tag, by adapting the 163 ABNF[RFC4234] from HTTP/1.1 [RFC2616] : 165 language-range = language-tag / "*" 166 language-tag = 1*8[alphanum] *["-" 1*8alphanum] 167 alphanum = ALPHA / DIGIT 169 That is, a language-range has the same syntax as a language-tag or is 170 the single character "*". Basic Language Ranges imply that there is 171 a semantic relationship between language tags that share the same 172 prefix. While this is often the case, it is not always true and 173 users should note that the set of language tags that match a specific 174 language-range may not be mutually intelligible. 176 Basic language ranges were originally described in [RFC3066] and 177 HTTP/1.1 [RFC2616] (where they are referred to as simply a "language 178 range"). 180 2.3. Extended Language Range 182 A Basic Language Range does not always provide the most appropriate 183 way to specify a user's preferences. Sometimes it is beneficial to 184 use a more granular matching scheme that takes advantage of the 185 internal structure of language tags, by allowing the user to specify, 186 for example, the value of a specific field in a language tag or to 187 indicate which values are of interest in filtering or selecting the 188 content. 190 In an extended language range, the identifier takes the form of a 191 series of subtags which MUST consist of well-formed subtags or the 192 special subtag "*". For example, the language range "en-*-US" 193 specifies a primary language of 'en', followed by any script subtag, 194 followed by the region subtag 'US'. 196 An extended language range can be represented by the following ABNF: 197 extended-language-range = range ; a range 198 / privateuse ; private-use tag 199 / grandfathered ; grandfathered registrations 201 range = (language 202 ["-" script] 203 ["-" region] 204 *("-" variant) 205 *("-" extension) 206 ["-" privateuse]) 208 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code 209 / 4ALPHA ; reserved for future use 210 / 5*8ALPHA ; registered language subtag 211 / "*" ; ... or wildcard 213 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*")) 214 ; reserved for future use 215 ; wildcard can only appear 216 ; at the end 218 script = 4ALPHA ; ISO 15924 code 219 / "*" ; or wildcard 221 region = 2ALPHA ; ISO 3166 code 222 / 3DIGIT ; UN M.49 code 223 / "*" ; ... or wildcard 225 variant = 5*8alphanum ; registered variants 226 / (DIGIT 3alphanum) ; 227 / "*" ; ... or wildcard 229 extension = singleton *("-" (2*8alphanum)) [ "-*" ] 230 ; extension subtags 231 ; wildcard can only appear 232 ; at the end 234 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT 235 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9" 236 ; Single letters: x/X is reserved for private use 238 privateuse = ("x"/"X") 1*("-" (1*8alphanum)) 240 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum)) 241 ; grandfathered registration 242 ; Note: I is the only singleton 243 ; that starts a grandfathered tag 245 alphanum = (ALPHA / DIGIT) ; letters and numbers 247 A field not present in the middle of an extended language range MAY 248 be treated as if the field contained a "*". For example, the range 249 "en-US" MAY be considered to be equivalent to the range "en-*-US". 250 This also means that multiple wildcards can be collapsed (so that 251 "en-*-*-US" is equivalent to "en-*-US"). 253 2.4. Choosing a Language Range 255 Users indicate their language preferences via the choice of a 256 language range or the set of language ranges in the language priority 257 list. The type of matching will affect what the best choice is for 258 given user. In addition, user's should be aware that, when working 259 with language ranges, most matching schemes make no attempt to 260 process the semantic meaning of the subtags. The language tag and 261 language range (or their subtags) are usually compared in a case 262 insensitive manner using basic string processing. Thus the choice of 263 subtags in both the language tag and language range may affect the 264 results produced. 266 Users SHOULD avoid subtags that add no distinguishing value to a 267 language range. For example, script subtags SHOULD NOT be used to 268 form a language range with language subtags which have a matching 269 Suppress-Script field in their registry record. Thus the language 270 range "en-Latn" is probably inappropriate in most cases (because the 271 vast majority of English documents are written in the Latin script 272 and thus the 'en' language subtag has a Suppress-Script field for 273 'Latn' in the registry). 275 When working with tags and ranges, note that extensions and most 276 private-use subtags are orthogonal to language tag fallback and users 277 SHOULD avoid using these subtags in language ranges, since they will 278 often interfere with the selection of available language content. 279 Since these subtags are always at the end of the sequence of subtags, 280 they don't normally interfere with the use of prefixes for the 281 filtering schemes described below in Section 3. 283 When working with tags and ranges users SHOULD note the following: 285 1. Private-use and Extension subtags are normally orthogonal to 286 language tag fallback. Implementations or specifications that 287 use a lookup (Section 3.3) matching scheme SHOULD ignore 288 unrecognized private-use and extension subtags when performing 289 language tag fallback. Since these subtags are always at the end 290 of the sequence of subtags, they don't normally interfere with 291 the use of prefixes for matching in the schemes described below. 293 2. Applications, specifications, or protocols that choose not to 294 interpret one or more private-use or extension subtags SHOULD NOT 295 remove or modify these extensions in content that they are 296 processing. When a language tag instance is to be used in a 297 specific, known protocol, and is not being passed through to 298 other protocols, language tags MAY be filtered to remove subtags 299 and extensions that are not supported by that protocol. Such 300 filtering SHOULD be avoided, if possible, since it removes 301 information that might be relevant if services on the other end 302 of the protocol would make use of that information. 304 3. Some applications of language tags might want or need to consider 305 extensions and private-use subtags when matching tags. If 306 extensions and private-use subtags are included in a matching or 307 filtering process that utilizes the one of the schemes described 308 in this document, then the implementation SHOULD canonicalize the 309 language tags and/or ranges before performing the matching. Note 310 that language tag processors that claim to be "well-formed" 311 processors as defined in [RFC3066bis] generally fall into this 312 category. 314 3. Types of Matching 316 Matching language ranges to language tags can be done in a number of 317 different ways. This section describes the different types of 318 matching scheme, as well as the considerations for choosing between 319 them. Protocols and specifications SHOULD clearly indicate the 320 particular mechanism used in selecting or matching language tags. 322 There are two basic types of matching scheme: those that produce an 323 open-ended set of content (called "filtering") and those that produce 324 a single information item for a given request (called "lookup"). 326 A key difference between these two types of matching scheme is that 327 the language range for filtering operations is always the _least_ 328 specific tag one will accept as a match, while for lookup operations 329 the language range is always the _most_ specific tag. 331 3.1. Choosing a Type of Matching 333 Applications, protocols, and specifications are faced with the 334 decision of what type of matching to use. Sometimes, different 335 styles of matching might be suited for different kinds of processing 336 within a particular application or protocol. 338 Filtering can be used to produce a set of results (such as a 339 collection of documents). For example, if using a search engine, one 340 might use filtering to limit the results to documents written in 341 French. It can also be used when deciding whether to perform some 342 processing that is language sensitive on some content. For example, 343 a process might cause paragraphs whose language tag matched the 344 language range "nl" to be displayed in italics within a document. 346 This document describes three types of filtering: 348 1. Basic Filtering (Section 3.2.1) is used to match content using 349 basic language ranges (Section 2.2). It is compatible with 350 implementations that do not produce extended language ranges. 352 2. Extended Range Filtering (Section 3.2.2) is used to match content 353 using extended language ranges (Section 2.3). Newer 354 implementations SHOULD use this form of filtering in preference 355 to basic filtering. 357 3. Scored Filtering (Section 3.2.3) produces an ordered set of 358 content using either basic or extended language ranges. It 359 SHOULD be used when the quality of the match within a specific 360 language range is important, as when presenting a list of 361 documents resulting from a search. 363 Lookup (Section 3.3) is used when each request MUST produce exactly 364 one piece of content. For example, a Web server might use the 365 Accept-Language HTTP header to choose which language to return a 366 custom 404 page in: since it can return only one page, it must choose 367 a single item and it must return some item, even if no content 368 matches the language ranges supplied by the user. 370 Most types of matching in this document are designed so that 371 implementations do not have to examine the values of the subtags 372 supplied and, except for scored filtering, they do not need access to 373 the Language Subtag Registry nor do they require the use of valid 374 subtags in either language tags or language ranges. This has great 375 benefit for speed and simplicity of implementation. 377 Implementations might also wish to use semantic information external 378 to the language tags when performing fallback. For example, the 379 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal 380 Norwegian) might both be usefully matched to the more general subtag 381 'no' (Norwegian). Or an implementation might infer that content 382 labeled "zh-CN" is more likely to match the range "zh-Hans" than 383 equivalent content labeled "zh-TW". 385 3.2. Filtering 387 Filtering is used to select the set of content that matches a given 388 prefix. It is called "filtering" because this set of content may 389 contain no items at all or it may return an arbitrary number of 390 matching items--as many as match the language range used to specify 391 the items, thus filtering out the non-matching content. 393 In filtering, the language range represents the _least_ specific tag 394 which is an acceptable match. That is, all of the language tags in 395 the set of filtered content will have an equal or greater number of 396 subtags than the language range. For example, if the language range 397 is "de-CH", one might see matching content with the tag "de-CH-1996" 398 but one will never see a match with the tag "de". 400 If the language priority list (see Section 2.1) contains more than 401 one range, the content returned is typically ordered in descending 402 level of preference. 404 Some examples where filtering might be appropriate include: 406 o Applying a style to sections of a document in a particular 407 language range. 409 o Displaying the set of documents containing a particular set of 410 keywords written in a specific language. 412 o Selecting all email items written in specific range of languages. 414 Filtering can produce either an ordered or an unordered set of 415 results. For example, applying formatting to a document based on the 416 language of specific pieces of content does not require the content 417 to be ordered. It is sufficient to know whether a specific piece of 418 content matches or does not match. A search application, on the 419 other hand, probably would put the results into a priority order. 421 If an ordered set is desired, as described above, then the 422 application or protocol needs to determine the relative "quality" of 423 the match between different language tags and the language range. 425 This measurement is called a "distance metric". A distance metric 426 assigns a numeric value to the comparison of each language tag to a 427 language range and represents the 'distance' between the two. A 428 distance of zero means that they are identical, a small distance 429 indicates that they are very similar, and a large distance indicated 430 that they are very different. Using a distance metric, 431 implementations can, for example, allow users to select a threshold 432 distance for a match to be "successful" while filtering or it can use 433 the numeric value to order the results. 435 3.2.1. Filtering with Basic Language Ranges 437 When filtering using a basic language range, the language range 438 matches a language tag if it exactly equals the tag, or if it exactly 439 equals a prefix of the tag such that the first character following 440 the prefix is "-". (That is, the language-range "de-de" matches the 441 language tag "de-DE-1996", but not the language tag "de-Deva".) 443 The special range "*" matches any tag. A protocol which uses 444 language ranges MAY specify additional rules about the semantics of 445 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 446 languages not matched by any other range within an "Accept-Language" 447 header. 449 3.2.2. Filtering with Extended Language Ranges 451 In the Extended Range Matching scheme, each extended language range 452 in the language priority list is considered in turn, according to 453 priority. The subtags in each extended language range are compared 454 to the corresponding subtags in the language tag being examined. The 455 subtag from the range is considered to match if it exactly matches 456 the corresponding subtag in the tag or the range's subtag has the 457 value "*" (which matches all subtags, including the empty subtag). 458 Extended Range Matching is an extension of basic matching 459 (Section 3.2.1): the language range represents the least specific tag 460 which is an acceptable match. 462 private-use subtags MAY be specified in the language range and MUST 463 NOT be ignored when matching. 465 Subtags not specified, including those at the end of the language 466 range, are assigned the value "*". This makes each range into a 467 prefix much like that used in basic language range matching. For 468 example, the extended language range "de-*-DE" matches all of the 469 following tags because the unspecified variant field is expanded to 470 "*": 472 de-DE 474 de-Latn-DE 476 de-Latf-DE 478 de-DE-x-goethe 480 de-Latn-DE-1996 482 3.2.3. Scored Filtering 484 Both basic and extended language range filtering produce simple 485 boolean matches. Sometimes it may be beneficial to provide an array 486 of results with different levels of matching, for example, sorting 487 results based on the overall "quality" of the match. Scored (or 488 "distance metric") filtering provides a way to generate these quality 489 values. 491 First both the extended language range and the language tags to be 492 matched to it must be canonicalized by mapping grandfathered and 493 obsolete tags into modern equivalents. 495 The language range and the language tags are then transformed into 496 quintuples of elements of the form (language, script, country, 497 variant, extension). Any extended language subtags are considered 498 part of the language element; private-use subtag sequences are 499 considered part of the language element if in the initial position in 500 the tag and part of the variant element if not. Language subtags 501 'und', 'mul', and the script subtag 'Zyyy' are converted to "*". 503 Missing components in the language-tag are set to "*"; thus a "*" 504 pattern becomes the quintuple ("*", "*", "*", "*", "*"). Missing 505 components in the extended language-range are handled similarly to 506 extended range lookup: missing internal subtags are expanded to "*". 507 Missing end subtags are expanded as the empty string. Thus a pattern 508 "en-US" becomes the quintuple ("en","*","US","",""). 510 Here are some examples of language tags, showing their quintuples as 511 both language tags and language ranges: 513 en-US 514 Tag: (en, *, US, *, *) 515 Range: (en, *, US, "", "") 517 sr-Latn 518 Tag: (sr, Latn, *, *, *) 519 Range: (sr, Latn, "", "", "") 521 zh-cmn-Hant 522 Tag: (zh-cmn, Hant, *, *, *) 523 Range: (zh-cmn, Hant, "", "", "") 525 x-foo 526 Tag: (x-foo, *, *, *, *) 527 Range: (x-foo, "", "", "", "") 529 en-x-foo 530 Tag: (en, *, *, x-foo, *) 531 Range: (en, *, *, x-foo, "") 533 i-default 534 Tag: (i-default, *, *, *, *) 535 Range: (i-default, "", "", "", "") 537 sl-Latn-IT-rozaj 538 Tag: (sl, Latn, IT, rozaj, *) 539 Range: (sl, Latn, IT, rozaj, "") 541 zh-r-wadegile (hypothetical) 542 Tag: (z., *, *, *, r-wadegile) 543 Range: (z., *, *, *, r-wadegile) 545 Figure 3: Examples of Distance Metric Quintuples 547 Each language-range/language-tag pair being compared is assigned a 548 distance value, whereby small values indicate better matches and 549 large values indicate worse ones. The distance between the pair is 550 the sum of the distances for each of the corresponding elements of 551 the quintuple. If the elements are identical or one is '*', then the 552 distance value between them is zero. Otherwise, it is given by the 553 following table: 555 256 language mismatch 556 128 script mismatch 557 32 region mismatch 558 4 variant mismatch 559 1 extension mismatch 561 A value of 0 is a perfect match; 421 is no match at all. Different 562 threshold values might be appropriate for different applications or 563 protocols. Implementations will usually allow users to choose the 564 most appropriate selection value, ranking the matched items based on 565 score. 567 Examples of various tag's distances from the range "en-US": 569 "fr-FR" 384 (language & region mismatch) 570 "fr" 256 (language mismatch, region match) 571 "en-GB" 32 (region mismatch) 572 "en-Latn-US" 0 (all fields match) 573 "en-Brai" 32 (region mismatch) 574 "en-US-x-foo" 4 (variant mismatch: range is the empty string) 575 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string) 577 Implementations or protocols sometimes might wish to use more 578 sophisticated weights that depend on the values of the corresponding 579 elements. For example, depending on the domain, an implementation 580 might give a small distance to the difference closely related 581 subtags. Some examples of closely related subtags might be: 583 Language: 584 no (Norwegian) 585 nb (Bokmal Norwegian) 586 nn (Nynorsk Norwegian) 588 Script: 589 Kata (katakana) 590 Hira (hiragana) 592 Region: 593 US (United States of America) 594 UM (United States Minor Outlying Islands 596 Figure 6: Examples of Closely Related Subtags 598 3.3. Lookup 600 Lookup is used to select the single information item that best 601 matches the language priority list for a given request. In lookup, 602 each language-range in the language priority list represents the 603 _most_ specific tag which is an acceptable match; only the closest 604 matching item according the user's priority is returned. For 605 example, if the language range is "de-CH", one might expect to 606 receive an information item with the tag "de" but never one with the 607 tag "de-CH-1996". Usually if no content matches the request, a 608 "default" item is returned. 610 For example, if an application inserts some dynamic content into a 611 document, returning an empty string if there is no exact match is not 612 an option. Instead, the application "falls back" until it finds a 613 suitable piece of content to insert. Other examples of lookup might 614 include: 616 o Selection of a template containing the text for an automated email 617 response. 619 o Selection of a graphic containing text for inclusion in a 620 particular Web page. 622 o Selection of a string of text for inclusion in an error log. 624 In the Lookup scheme, the language-range is progressively truncated 625 from the end until a matching piece of content is located. For 626 example, starting with the range "zh-Hant-CN-x-private", the lookup 627 would progressively search for content as shown below: 629 Range to match: zh-Hant-CN-x-private 630 1. zh-Hant-CN-x-private 631 2. zh-Hant-CN 632 3. zh-Hant 633 4. z. 634 5. (default content or the empty tag) 636 Figure 7: Example of a Lookup Fallback Pattern 638 This scheme allows some flexibility in finding content. It also 639 typically provides better results when data is not available at a 640 specific level of tag granularity or is sparsely populated (than if 641 the default language for the system or content were used). 643 The language range "*" matches any language tag. In the lookup 644 scheme, this language range does not convey enough information to 645 determine which content is most appropriate. If this language range 646 is the only one in the language priority list, it matches the default 647 content. If this language range is followed by other language 648 ranges, it should be skipped. 650 When performing lookup using a language priority list, the 651 progressive search MUST proceed to consider each language range 652 before finding the default content or empty tag. The default content 653 might be content with no language tag (or with an empty value, as 654 with xml:lang in the XML specification), or it might be a particular 655 language designated for that bit of content. 657 One common way to provide for default content is to allow a specific 658 language range to be set as the default for a specific type of 659 request. This language range is then treated as if it were appended 660 to the end of the language priority list, rather than after each item 661 in the language priority list. 663 For example, if a particular user's language priority list were 664 "fr-FR; zh-Hant" and the program doing the matching had a default 665 language range of "ja-JP", the program would search for content as 666 follows: 667 1. fr-FR 668 2. fr 669 3. zh-Hant // next language 670 4. z. 671 5. (return default content) 672 a. ja-JP 673 b. ja 674 c. (empty tag or other default content) 676 Figure 8: Lookup Using a Language Priority List 678 In some cases, the language priority list might contain one or more 679 extended language ranges (as, for example, when the same language 680 priority list is used as input for both lookup and filtering 681 operations). Wildcard values in an extended language range are 682 supposed to match any value that occurs in that position in a 683 language tag. Since only one item can be returned for any given 684 lookup request, the wildcards must be processed in a predictable 685 manner (or the same request might produce widely varying results). 686 Thus, for each range in the language priority list, the following 687 rules must be applied to produce a basic language range for use in 688 the fallback mechanism: 690 1. If the first subtag in the extended language range is a "*" then 691 entire range is converted to "*". 693 2. For each subsequent subtag, if the value is a "*" then that 694 subtag and its preceding hyphen are removed. 696 For example: 698 *-US becomes * 699 en-*-US becomes en-US 700 en-Latn-* becomes en-Latn 702 Figure 9: Transformation of Extended Language Ranges 704 For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback 705 pattern would be: 706 1. * (skipped) 707 2. fr-FR 708 3. fr 709 4. zh-Hant 710 5. z. 711 6. (default content) 713 Figure 10: Extended Language Range Fallback Example 715 4. Other Considerations 717 When working with language ranges and matching schemes, there are 718 some additional points that may influence the choice of either. 720 4.1. Meaning of Language Tags and Ranges 722 Selecting content using language ranges requires some understanding 723 by users of what they are selecting. A language tag or range 724 identifies a language as spoken (or written, signed or otherwise 725 signaled) by human beings for communication of information to other 726 human beings. 728 If a language tag B contains language tag A as a prefix, then B is 729 typically "narrower" or "more specific" than A. For example, "zh- 730 Hant-TW" is more specific than "zh-Hant". 732 This relationship is not guaranteed in all cases: specifically, 733 languages that begin with the same sequence of subtags are NOT 734 guaranteed to be mutually intelligible, although they might be. 736 For example, the tag "az" shares a prefix with both "az-Latn" 737 (Azerbaijani written using the Latin script) and "az-Arab" 738 (Azerbaijani written using the Arabic script). A person fluent in 739 one script might not be able to read the other, even though the text 740 might be otherwise identical. Content tagged as "az" most probably 741 is written in just one script and thus might not be intelligible to a 742 reader familiar with the other script. 744 Variant subtags in particular seem to represent specific divisions in 745 mutual understanding, since they often encode dialects or other 746 idiosyncratic variations within a language. 748 The relationship between the language tag and the information it 749 relates to is defined by the standard describing the context in which 750 it appears. Accordingly, this section can only give possible 751 examples of its usage: 753 o For a single information object, the associated language tags 754 might be interpreted as the set of languages that are necessary 755 for a complete comprehension of the complete object. Example: 756 Plain text documents. 758 o For an aggregation of information objects, the associated language 759 tags could be taken as the set of languages used inside components 760 of that aggregation. Examples: Document stores and libraries. 762 o For information objects whose purpose is to provide alternatives, 763 the associated language tags could be regarded as a hint that the 764 content is provided in several languages, and that one has to 765 inspect each of the alternatives in order to find its language or 766 languages. In this case, the presence of multiple tags might not 767 mean that one needs to be multi-lingual to get complete 768 understanding of the document. Example: MIME multipart/ 769 alternative. 771 o In markup languages, such as HTML and XML, language information 772 can be added to each part of the document identified by the markup 773 structure (including the whole document itself). For example, one 774 could write C'est la vie. inside a 775 Norwegian document; the Norwegian-speaking user could then access 776 a French-Norwegian dictionary to find out what the marked section 777 meant. If the user were listening to that document through a 778 speech synthesis interface, this formation could be used to signal 779 the synthesizer to appropriately apply French text-to-speech 780 pronunciation rules to that span of text, instead of misapplying 781 the Norwegian rules. 783 4.2. Considerations for Private Use Subtags 785 Private-use subtags require private agreement between the parties 786 that intend to use or exchange language tags that use them and great 787 caution SHOULD be used in employing them in content or protocols 788 intended for general use. Private-use subtags are simply useless for 789 information exchange without prior arrangement. 791 The value and semantic meaning of private-use tags and of the subtags 792 used within such a language tag are not defined. Matching private- 793 use tags using language ranges or extended language ranges can result 794 in unpredictable content being returned. 796 4.3. Length Considerations in Matching 798 RFC 3066 [RFC3066] did not provide an upper limit on the size of 799 language tags or ranges. RFC 3066 did define the semantics of 800 particular subtags in such a way that most language tags or ranges 801 consisted of language and region subtags with a combined total length 802 of up to six characters. Larger tags and ranges (in terms of both 803 subtags and characters) did exist, however. 805 [RFC3066bis] also does not impose a fixed upper limit on the number 806 of subtags in a language tag or range (and thus an upper bound on the 807 size of either). The syntax in that document suggests that, 808 depending on the specific language or range of languages, more 809 subtags (and thus characters) are sometimes necessary as a result. 811 Length considerations and their impact on the selection and 812 processing of tags are described in Section 2.1.1 of that document. 814 An application or protocol MAY choose to limit the length of the 815 language tags or ranges used in matching. Any such limitation SHOULD 816 be clearly documented, and such documentation SHOULD include the 817 disposition of any longer tags or ranges (for example, whether an 818 error value is generated or the language tag or range is truncated). 819 If truncation is permitted it MUST NOT permit a subtag to be divided, 820 since this changes the semantics of the subtag being matched and can 821 result in false positives or negatives. 823 Applications or protocols that restrict storage SHOULD consider the 824 impact of tag or range truncation on the resulting matches. For 825 example, removing the "*" from the end of an extended language range 826 (see Section 2.3) can greatly modify the set of returned matches. A 827 protocol that allows tags or ranges to be truncated at an arbitrary 828 limit, without giving any indication of what that limit is, has the 829 potential for causing harm by changing the meaning of values in 830 substantial ways. 832 In practice, most tags do not require additional subtags or 833 substantially more characters. Additional subtags sometimes add 834 useful distinguishing information, but extraneous subtags interfere 835 with the meaning, understanding, and especially matching of language 836 tags. Since language tags or ranges MAY be truncated by an 837 application or protocol that limits storage, when choosing language 838 tags or ranges users and applications SHOULD avoid adding subtags 839 that add no distinguishing value. In particular, users and 840 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 841 fields in the registry (defined in Section 3.6 of [RFC3066bis]): 842 these fields provide guidance on when specific additional subtags 843 SHOULD (and SHOULD NOT) be used. 845 Implementations MUST support a limit of at least 33 characters. This 846 limit includes at least one subtag of each non-extension, non-private 847 use type. When choosing a buffer limit, a length of at least 42 848 characters is strongly RECOMMENDED. 850 The practical limit on tags or ranges derived solely from registered 851 values is 42 characters. Implementations MUST be able to handle tags 852 and ranges of this length. Support for tags and ranges of at least 853 62 characters in length is RECOMMENDED. Implementations MAY support 854 longer values, including matching extensive sets of private-use or 855 extension subtags. 857 Applications or protocols which have to truncate a tag MUST do so by 858 progressively removing subtags along with their preceding "-" from 859 the right side of the language tag until the tag is short enough for 860 the given buffer. If the resulting tag ends with a single-character 861 subtag, that subtag and its preceding "-" MUST also be removed. For 862 example: 864 Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 865 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 866 2. zh-Latn-CN-variant1-a-extend1 867 3. zh-Latn-CN-variant1 868 4. zh-Latn-CN 869 5. zh-Latn 870 6. z. 872 Figure 11: Example of Tag Truncation 874 5. IANA Considerations 876 This document presents no new or existing considerations for IANA. 878 6. Changes 880 This is the first version of this document. 882 The following changes were put into this document since draft-06: 884 Changed the document title from the unwieldy "Matching Tags for 885 the Identification of Languages" to "Matching Language Tags" (Ed.) 887 Fixed problems with the distance metric filtering scheme 888 (Section 3.2.3) examples (in which tags were expanded 889 incorrectly). (D.Ewell) 891 Moved the sentence "Protocols and specifications SHOULD clearly 892 indicate the particular mechanism used in selecting or matching 893 language tags." from the introduction (where there should not be 894 any normative language) to the start of Section 3. (A.Phillips) 896 Created section Section 2.4 and moved text there (A.Phillips) 898 Modified the examples of closely related subtags in Section 3.2.3 899 to show what the examples mean (M.Duerst) 901 Various spelling and grammatical fixes (D.Ewell) 903 7. Security Considerations 905 Language ranges used in content negotiation might be used to infer 906 the nationality of the sender, and thus identify potential targets 907 for surveillance. In addition, unique or highly unusual language 908 ranges or combinations of language ranges might be used to track a 909 specific individual's activities. 911 This is a special case of the general problem that anything you send 912 is visible to the receiving party. It is useful to be aware that 913 such concerns can exist in some cases. 915 The evaluation of the exact magnitude of the threat, and any possible 916 countermeasures, is left to each application or protocol. 918 8. Character Set Considerations 920 The syntax of language tags and language ranges permit only the 921 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters 922 are present in most character sets, so presentation of language tags 923 should not present any character set issues. 925 9. References 927 9.1. Normative References 929 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 930 Requirement Levels", BCP 14, RFC 2119, March 1997. 932 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 933 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 934 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 936 [RFC3066bis] 937 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 938 Identification of Languages", October 2005, . 942 [RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax 943 Specifications: ABNF", RFC 4234, October 2005. 945 9.2. Informative References 947 [RFC1766] Alvestrand, H., "Tags for the Identification of 948 Languages", RFC 1766, March 1995. 950 [RFC3066] Alvestrand, H., "Tags for the Identification of 951 Languages", BCP 47, RFC 3066, January 2001. 953 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 954 May 2002. 956 Appendix A. Acknowledgements 958 Any list of contributors is bound to be incomplete; please regard the 959 following as only a selection from the group of people who have 960 contributed to make this document what it is today. 962 The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of 963 which is a precursor to this document, made enormous contributions 964 directly or indirectly to this document and are generally responsible 965 for the success of language tags. 967 The following people (in alphabetical order by family name) 968 contributed to this document: 970 Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank 971 Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. 972 Patton, Randy Presuhn, Eric van der Poel, and many, many others. 974 Very special thanks must go to Harald Tveit Alvestrand, who 975 originated RFCs 1766 and 3066, and without whom this document would 976 not have been possible. 978 For this particular document, John Cowan originated the scheme 979 described in Section 3.2.3. Mark Davis originated the scheme 980 described in the Section 3.3. 982 Authors' Addresses 984 Addison Phillips (editor) 985 Quest Software 987 Email: addison dot phillips at quest dot com 989 Mark Davis (editor) 990 IBM 992 Email: mark dot davis at ibm dot com 994 Intellectual Property Statement 996 The IETF takes no position regarding the validity or scope of any 997 Intellectual Property Rights or other rights that might be claimed to 998 pertain to the implementation or use of the technology described in 999 this document or the extent to which any license under such rights 1000 might or might not be available; nor does it represent that it has 1001 made any independent effort to identify any such rights. Information 1002 on the procedures with respect to rights in RFC documents can be 1003 found in BCP 78 and BCP 79. 1005 Copies of IPR disclosures made to the IETF Secretariat and any 1006 assurances of licenses to be made available, or the result of an 1007 attempt made to obtain a general license or permission for the use of 1008 such proprietary rights by implementers or users of this 1009 specification can be obtained from the IETF on-line IPR repository at 1010 http://www.ietf.org/ipr. 1012 The IETF invites any interested party to bring to its attention any 1013 copyrights, patents or patent applications, or other proprietary 1014 rights that may cover technology that may be required to implement 1015 this standard. Please address the information to the IETF at 1016 ietf-ipr@ietf.org. 1018 Disclaimer of Validity 1020 This document and the information contained herein are provided on an 1021 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1022 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 1023 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 1024 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 1025 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1026 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1028 Copyright Statement 1030 Copyright (C) The Internet Society (2005). This document is subject 1031 to the rights, licenses and restrictions contained in BCP 78, and 1032 except as set forth therein, the authors retain all their rights. 1034 Acknowledgment 1036 Funding for the RFC Editor function is currently provided by the 1037 Internet Society.