idnits 2.17.1 draft-ietf-ltru-matching-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 955. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 932. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 939. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 945. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document obsoletes RFC3066, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 7, 2005) is 6776 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'ID.ietf-ltru-initial' is defined on line 795, but no explicit reference was found in the text == Unused Reference: 'RFC1327' is defined on line 800, but no explicit reference was found in the text == Unused Reference: 'RFC1521' is defined on line 803, but no explicit reference was found in the text == Unused Reference: 'RFC2028' is defined on line 808, but no explicit reference was found in the text == Unused Reference: 'RFC2231' is defined on line 815, but no explicit reference was found in the text == Unused Reference: 'RFC2396' is defined on line 824, but no explicit reference was found in the text == Unused Reference: 'RFC2434' is defined on line 828, but no explicit reference was found in the text == Unused Reference: 'RFC2860' is defined on line 836, but no explicit reference was found in the text == Unused Reference: 'RFC3629' is defined on line 840, but no explicit reference was found in the text == Unused Reference: 'ISO15924' is defined on line 851, but no explicit reference was found in the text == Unused Reference: 'ISO3166-1' is defined on line 855, but no explicit reference was found in the text == Unused Reference: 'ISO639-1' is defined on line 860, but no explicit reference was found in the text == Unused Reference: 'ISO639-2' is defined on line 864, but no explicit reference was found in the text == Unused Reference: 'RFC3339' is defined on line 877, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ID.ietf-ltru-initial' ** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156) ** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049) ** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281) ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 2860 -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 10 errors (**), 0 flaws (~~), 17 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Obsoletes: 3066 (if approved) M. Davis, Ed. 5 Expires: April 10, 2006 IBM 6 October 7, 2005 8 Matching Tags for the Identification of Languages 9 draft-ietf-ltru-matching-05 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on April 10, 2006. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing, matching, 43 and evaluating language tags. Possible algorithms for language 44 negotiation and content selection are described. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 49 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 50 2.1. Lists of Language Ranges . . . . . . . . . . . . . . . . . 4 51 2.2. Basic Language Range . . . . . . . . . . . . . . . . . . . 4 52 2.2.1. Matching . . . . . . . . . . . . . . . . . . . . . . . 5 53 2.2.2. Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6 54 2.3. Extended Language Range . . . . . . . . . . . . . . . . . 7 55 2.3.1. Extended Range Matching . . . . . . . . . . . . . . . 9 56 2.3.2. Extended Range Lookup . . . . . . . . . . . . . . . . 10 57 2.3.3. Distance Metric Scheme . . . . . . . . . . . . . . . . 11 58 2.4. Meaning of Language Tags and Ranges . . . . . . . . . . . 13 59 2.5. Choosing Between Alternate Matching Schemes . . . . . . . 14 60 2.6. Considerations for Private Use Subtags . . . . . . . . . . 15 61 2.7. Length Considerations in Matching . . . . . . . . . . . . 16 62 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 63 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 64 5. Security Considerations . . . . . . . . . . . . . . . . . . . 20 65 6. Character Set Considerations . . . . . . . . . . . . . . . . . 21 66 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 67 7.1. Normative References . . . . . . . . . . . . . . . . . . . 22 68 7.2. Informative References . . . . . . . . . . . . . . . . . . 23 69 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 24 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25 71 Intellectual Property and Copyright Statements . . . . . . . . . . 26 73 1. Introduction 75 Human beings on our planet have, past and present, used a number of 76 languages. There are many reasons why one would want to identify the 77 language used when presenting or requesting information. 79 Information about a user's language preferences commonly needs to be 80 identified so that appropriate processing can be applied. For 81 example, the user's language preferences in a browser can be used to 82 select web pages appropriately. A choice of language preference can 83 also be used to select among tools (such as dictionaries) to assist 84 in the processing or understanding of content in different languages. 86 Given a set of language identifiers, such as those defined in [draft- 87 registry], various mechanisms can be envisioned for performing 88 language negotiation and tag matching. The suitability of a 89 particular mechanism to a particular application depends on the needs 90 of that application. 92 This document defines several mechanisms for matching and filtering 93 natural language content identified using Language Tags [draft- 94 registry]. It also defines the syntax (called a "language range") 95 associated with each of these mechanisms for specifying user language 96 preferences. 98 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in [RFC2119]. 102 2. The Language Range 104 Language Tags [draft-registry] are used to identify the language of 105 some information item or content. Applications that use language 106 tags are often faced with the problem of identifying sets of content 107 that share certain language attributes. For example, HTTP 1.1 108 [RFC2616] describes language ranges in its discussion of the Accept- 109 Language header (Section 14.4), which is used for selecting content 110 from servers based on the language of that content. 112 When selecting content according to its language, it is useful to 113 have a mechanism for identifying sets of language tags that share 114 specific attributes. This allows users to select or filter content 115 based on specific requirements. Such an identifier is called a 116 "Language Range". 118 2.1. Lists of Language Ranges 120 When users specify a language preference they often need to specify a 121 prioritized list of language ranges in order to best reflect their 122 language requirements for the matching operation. This is especially 123 true for speakers of minority languages. A speaker of Breton in 124 France, for example, may specify "be" followed by "fr", meaning that 125 if Breton is available, it is preferred, but otherwise French is the 126 best alternative. It can get more complex: a speaker may wish to 127 fallback from Skolt Sami to Northern Sami to Finnish. 129 A "Language Priority List" consists of a prioritized or weighted list 130 of language ranges. One well known example of such a list is the 131 "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section 132 14.4) and RFC 3282 [RFC3282]. The various matching operations 133 described in this document include considerations for using a 134 language priority list. 136 2.2. Basic Language Range 138 A "Basic Language Range" identifies the set of content whose language 139 tags begin with the same sequence of subtags. A basic language range 140 is identified by its 'language-range' tag, by adapting the 141 ABNF[RFC2234bis] from HTTP/1.1 [RFC2616] : 143 language-range = language-tag / "*" 144 language-tag = 1*8[alphanum] *["-" 1*8alphanum] 145 alphanum = ALPHA / DIGIT 147 That is, a language-range has the same syntax as a language-tag or is 148 the single character "*". Basic Language Ranges imply that there is 149 a semantic relationship between language tags that share the same 150 prefix. While this is often the case, it is not always true and 151 users should note that the set of language tags that match a specific 152 language-range may not be mutually intelligible. 154 Basic language ranges were originally described in [RFC3066] and HTTP 155 1.1 [RFC2616] (where they are referred to as simply a "language 156 range"). 158 Users SHOULD avoid subtags that add no distinguishing value to a 159 language range. For example, script subtags SHOULD NOT be used to 160 form a language range with language subtags which have a matching 161 Suppress-Script field in their registry record. Thus the language 162 range "en-Latn" is probably inappropriate for most applications 163 (because the vast majority English documents are written in the Latin 164 script and thus the 'en' language subtag has a Suppress-Script field 165 for 'Latn' in the registry). 167 Language tags and thus language ranges are to be treated as case 168 insensitive: there exist conventions for the capitalization of some 169 of the subtags, but these MUST NOT be taken to carry meaning. 170 Matching of language tags to language ranges MUST be done in a case 171 insensitive manner. 173 When working with tags and ranges, note that extensions and most 174 private use subtags are generally orthogonal to language tag fallback 175 and users SHOULD avoid using these subtags in language ranges, since 176 they will often interfere with the selection of available language 177 content. Since these subtags are always at the end of the sequence 178 of subtags, they don't normally interfere with the use of prefixes 179 for matching in the schemes described below. 181 There are two matching schemes that are commonly associated with 182 basic language ranges: matching and lookup. 184 Note that neither matching nor lookup using basic language ranges 185 attempt to process the semantics of the tags or ranges in any way. 186 The language tag and language range are compared in a case 187 insensitive manner using basic string processing. The choice of 188 subtags in both the language tag and language range may affect the 189 results produced as a result. 191 2.2.1. Matching 193 Language tag matching is used to select all content that matches a 194 given prefix. In matching, the language range represents the least 195 specific tag which is an acceptable match and every piece of content 196 that matches is returned. If the language priority list contains 197 more than one range, the matches returned are typically ordered in 198 descending level of preference. 200 For example, if an application is applying a style to all content in 201 a document in a particular language, it might use language tag 202 matching to select the content to which the style is applied. 204 A language-range matches a language-tag if it exactly equals the tag, 205 or if it exactly equals a prefix of the tag such that the first 206 character following the prefix is "-". (That is, the language-range 207 "de-de" matches the language tag "de-DE-1996", but not the language 208 tag "de-Deva".) 210 The special range "*" matches any tag. A protocol which uses 211 language ranges MAY specify additional rules about the semantics of 212 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 213 languages not matched by any other range within an "Accept-Language" 214 header. 216 2.2.2. Lookup 218 Content lookup is used to select the single information item that 219 best matches the language priority list for a given request. In 220 lookup, each language range in the language priority list represents 221 the most specific tag which is an acceptable match; only the closest 222 matching item according the user's priority is returned. 224 For example, if an application inserts some dynamic content into a 225 document, returning an empty string if there is no exact match is not 226 an option. Instead, the application "falls back" until it finds a 227 suitable piece of content to insert. 229 When performing lookup, the language range is progressively truncated 230 from the end until a matching piece of content is located. For 231 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup 232 would progressively search for content as shown below: 234 Range to match: zh-Hant-CN-x-wadegile 235 1. zh-Hant-CN-x-wadegile 236 2. zh-Hant-CN 237 3. zh-Hant 238 4. zh 239 5. (default content or the empty tag) 241 Figure 2: Default Fallback Pattern Example 243 This scheme allows some flexibility in finding content. It also 244 typically provides better results when data is not available at a 245 specific level of tag granularity or is sparsely populated (than if 246 the default language for the system or content were used). 248 When performing lookup using a language priority list, the 249 progressive search MUST proceed to consider each language range 250 before finding the default content or empty tag. For example, for 251 the list "fr-FR; zh-Hant" would search for content as follows: 252 1. fr-FR 253 2. fr 254 3. zh-Hant // next language 255 4. zh 256 5. (default content or the empty tag) 258 Figure 3: Lookup Using a Language Priority List 260 2.3. Extended Language Range 262 Prefix matching using a Basic Language Range, as described above, is 263 not always the most appropriate way to access the information 264 contained in language tags when selecting or filtering content. Some 265 applications might wish to define a more granular matching scheme and 266 such a matching scheme requires the ability to specify the various 267 attributes of a language tag in the language range. An extended 268 language range can be represented by the following ABNF: 269 extended-language-range = range ; a range 270 / privateuse ; private use tag 271 / grandfathered ; grandfathered registrations 273 range = (language 274 ["-" script] 275 ["-" region] 276 *("-" variant) 277 *("-" extension) 278 ["-" privateuse]) 280 language = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code 281 / 4ALPHA ; reserved for future use 282 / 5*8ALPHA ; registered language subtag 283 / "*" ; ... or wildcard 285 extlang = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*")) 286 ; reserved for future use 287 ; wildcard can only appear 288 ; at the end 290 script = 4ALPHA ; ISO 15924 code 291 / "*" ; or wildcard 293 region = 2ALPHA ; ISO 3166 code 294 / 3DIGIT ; UN M.49 code 295 / "*" ; ... or wildcard 297 variant = 5*8alphanum ; registered variants 298 / (DIGIT 3alphanum) ; 299 / "*" ; ... or wildcard 301 extension = singleton *("-" (2*8alphanum)) [ "-*" ] 302 ; extension subtags 303 ; wildcard can only appear 304 ; at the end 306 singleton = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT 307 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9" 308 ; Single letters: x/X is reserved for private use 310 privateuse = ("x"/"X") 1*("-" (1*8alphanum)) 312 grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum)) 313 ; grandfathered registration 314 ; Note: i is the only singleton 315 ; that starts a grandfathered tag 317 alphanum = (ALPHA / DIGIT) ; letters and numbers 319 In an extended language range, the identifier takes the form of a 320 series of subtags which must consist of well-formed subtags or the 321 special subtag "*". For example, the language range "en-*-US" 322 specifies a primary language of 'en', followed by any script subtag, 323 followed by the region subtag 'US'. 325 A field not present in the middle of an extended language range MAY 326 be treated as if the field contained a "*". For example, the range 327 "en-US" MAY be considered to be equivalent to the range "en-*-US". 328 This also means that multiple wildcards can be collapsed (so that 329 "en-*-*-US" is equivalent to "en-*-US"). 331 When working with tags and ranges users SHOULD note the following: 333 1. Private-use and Extension subtags are normally orthogonal to 334 language tag fallback. Implementations SHOULD ignore 335 unrecognized private-use and extension subtags when performing 336 language tag fallback. Since these subtags are always at the end 337 of the sequence of subtags, they don't normally interfere with 338 the use of prefixes for matching in the schemes described below. 340 2. Implementations that choose not to interpret one or more private- 341 use or extension subtags SHOULD NOT remove or modify these 342 extensions in content that they are processing. When a language 343 tag instance is to be used in a specific, known protocol, and is 344 not being passed through to other protocols, language tags MAY be 345 filtered to remove subtags and extensions that are not supported 346 by that protocol. Such filtering SHOULD be avoided, if possible, 347 since it removes information that might be relevant if services 348 on the other end of the protocol would make use of that 349 information. 351 3. Some applications of language tags might want or need to consider 352 extensions and private-use subtags when matching tags. If 353 extensions and private-use subtags are included in a matching or 354 filtering process that utilizes the one of the schemes described 355 in this document, then the implementation SHOULD canonicalize the 356 language tags and/or ranges before performing the matching. Note 357 that language tag processors that claim to be "well-formed" 358 processors as defined in [draft-registry] generally fall into 359 this category. 361 There are several matching algorithms or schemes which can be applied 362 when matching extended language ranges to language tags. 364 2.3.1. Extended Range Matching 366 In extended range matching, each extended language range in the 367 language priority list is considered in turn, according to priority. 368 The subtags in each extended language range are compared to the 369 corresponding subtags in the language tag being examined. The subtag 370 from the range is considered to match if it exactly matches the 371 corresponding subtag in the tag or the range's subtag has the value 372 "*" (which matches all subtags, including the empty subtag). 373 Extended Range Matching is an extension of basic matching 374 (Section 2.2.1): the language range represents the least specific tag 375 which is an acceptable match. 377 Private use subtags MAY be specified in the language range and MUST 378 NOT be ignored when matching. 380 Subtags not specified, including those at the end of the language 381 range, are assigned the value "*". This makes each range into a 382 prefix much like that used in basic language range matching. For 383 example, the extended language range "zh-*-CN" matches all of the 384 following tags because the unspecified variant field is expanded to 385 "*": 387 zh-Hant-CN 388 zh-CN 390 zh-Hans-CN 392 zh-CN-x-wadegile 394 zh-Latn-CN-boont 396 zh-cmn-Hans-CN-x-wadegile 398 2.3.2. Extended Range Lookup 400 In extended range lookup, each extended language range in the 401 language priority list is considered in turn. The subtags in each 402 extended language range are compared to the corresponding subtags in 403 the language tag being examined. A subtag is considered to match if 404 it exactly matches the corresponding subtag in the tag or the range's 405 subtag has the value "*" (which matches all subtags, including the 406 empty subtag). Extended language range lookup is an extension of 407 basic lookup (Section 2.2.2): each language range represents the most 408 specific tag which will form an acceptable match. If no match is 409 found, the default content or content with the empty language tag is 410 usually returned (or the search can be considered to have failed). 412 Subtags not specified are assigned the value "*" prior to performing 413 tag matching. Unlike in extended range matching, however, fields at 414 the end of the range MUST NOT be expanded in this manner. For 415 example, "en-US" MUST NOT be considered to be the same as the range 416 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST 417 be used at the end of the range to indicate that all tags with the 418 range as a prefix are allowable matches. That is, the range "zh-*" 419 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh" 420 matches neither of those tags. 422 The wildcard "*" at the end of a range SHOULD be considered to match 423 any private use subtag sequences (making extended language range 424 lookup function exactly like extended range matching Section 2.3.1). 426 By default all extensions and their subtags SHOULD be ignored for 427 extended language range lookup. Private use subtags MAY be specified 428 in the language range and MUST NOT be ignored when performing lookup. 429 The wildcard "*" at the end of a range SHOULD be considered to match 430 any private use subtag sequences in addition to variants. 432 For example, the range "*-US" matches all of the following tags: 434 en-US 435 en-Latn-US 437 en-US-r-extends (extensions are ignored) 439 fr-US 441 For example, the range "en-*-US" matches _none_ of the following 442 tags: 444 fr-US 446 en (missing region US) 448 en-Latn (missing region US) 450 en-Latn-US-scouse (variant field is present) 452 For example, the range "en-*" matches all of the following tags: 454 en-Latn 456 en-Latn-US 458 en-Latn-US-scouse 460 en-US 462 en-scouse 464 Note that the ability to be specific in extended range lookup can 465 make this matching scheme a more appropriate replacement for basic 466 matching than the extended range matching scheme. 468 2.3.3. Distance Metric Scheme 470 Both Basic and Extended Language Ranges produce simple boolean 471 matches. Some applications may benefit by providing an array of 472 results with different levels of matching, for example, sorting 473 results based on the overall "quality" of the match. 475 This type of matching is sometimes called a "distance metric". A 476 distance metric assigns a pair of language tags a numeric value 477 representing the 'distance' between the two. A distance of zero 478 means that they are identical, a small distance indicates that they 479 are very similar, and a large distance indicated that they are very 480 different. Using a distance metric, implementations can, for 481 example, allow users to select a threshold distance for a match to be 482 successful or a filter to be applied. 484 The first step in the process is to normalize the extended language 485 range and the language tags to be matched to it by canonicalizing 486 them, mapping grandfathered and obsolete tags into modern 487 equivalents. 489 The language range and the language tags are then transformed into 490 quintuples of elements of the form (language, script, country, 491 variant, extension). Any extended language subtags are considered 492 part of the language element; private use subtag sequences are 493 considered part of the language element if in the initial position in 494 the tag and part of the variant element if not. Language subtags 495 'und', 'mul', and the script subtag 'Zyyy' are converted to "*". 497 Missing components in the language-tag are set to "*"; thus a "*" 498 pattern becomes the quintuple ("*", "*", "*", "*", "*"). Missing 499 components in the extended language-range are handled similarly to 500 extended range lookup: missing internal subtags are expanded to "*". 501 Missing end subtags are expanded as the empty string. Thus a pattern 502 "en-US" becomes the quintuple ("en","*","US","",""). 504 Here are some examples of language-tags and their quintuples: 506 en-US ("en","*","US","*","*") 508 sr-Latn ("sr,"Latn","*","*","*") 510 zh-cmn-Hant ("zh-cmn","Hant","*","*","*") 512 x-foo ("x-foo","*","*","*","*") 514 en-x-foo ("en","*","*","x-foo","*") 516 i-default ("i-default","*","*","*","*") 518 sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*") 520 zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical 522 Each language-range/language-tag pair being matched or filtered is 523 assigned a distance value, whereby small values indicate better 524 matches and large values indicate worse ones. The distance between 525 the pair is the sum of the distances for each of the corresponding 526 elements of the quintuple. If the elements are identical or one is 527 '*', then the distance value between them is zero. Otherwise, it is 528 given by the following table: 530 256 language mismatch 531 128 script mismatch 532 32 region mismatch 533 4 variant mismatch 534 1 extension mismatch 536 A value of 0 is a perfect match; 421 is no match at all. Different 537 threshold values might be appropriate for different applications and 538 implementations will probably allow users to choose the most 539 appropriate selection value, ranking the selections based on score. 541 Examples of various tag's distances from the range "en-US": 543 "fr" 256 (language mismatch, region match) 544 "en-GB" 384 (language, region mismatch) 545 "en-Latn-US" 0 (all fields match) 546 "en-Brai" 32 (region mismatch) 547 "en-US-x-foo" 4 (variant mismatch: range is the empty string) 548 "en-US-r-wadegile" 1 (extension mismatch: range is the empty string) 550 Implementations may want to use more sophisticated weights that 551 depend on the values of the corresponding elements. For example, 552 depending on the domain, an implemenation might give a small distance 553 to the difference between the language subtag 'no' and the closely 554 related language subtags 'nb' or 'nn'; or between the script subtags 555 'Kata' and 'Hira'; or between the region subtags 'US' and 'UM'. 557 2.4. Meaning of Language Tags and Ranges 559 A language tag defines a language as spoken (or written, signed or 560 otherwise signaled) by human beings for communication of information 561 to other human beings. 563 If a language tag B contains language tag A as a prefix, then B is 564 typically "narrower" or "more specific" than A. For example, "zh- 565 Hant-TW" is more specific than "zh-Hant". 567 This relationship is not guaranteed in all cases: specifically, 568 languages that begin with the same sequence of subtags are NOT 569 guaranteed to be mutually intelligible, although they might be. 571 For example, the tag "az" shares a prefix with both "az-Latn" 572 (Azerbaijani written using the Latin script) and "az-Cyrl" 573 (Azerbaijani written using the Cyrillic script). A person fluent in 574 one script might not be able to read the other, even though the text 575 might be otherwise identical. Content tagged as "az" most probably 576 is written in just one script and thus might not be intelligible to a 577 reader familiar with the other script. 579 Variant subtags in particular seem to represent specific divisions in 580 mutual understanding, since they often encode dialects or other 581 idiosyncratic variations within a language. 583 The relationship between the language tag and the information it 584 relates to is defined by the standard describing the context in which 585 it appears. Accordingly, this section can only give possible 586 examples of its usage. 588 o For a single information object, the associated language tags 589 might be interpreted as the set of languages that are necessary 590 for a complete comprehension of the complete object. Example: 591 Plain text documents. 593 o For an aggregation of information objects, the associated language 594 tags could be taken as the set of languages used inside components 595 of that aggregation. Examples: Document stores and libraries. 597 o For information objects whose purpose is to provide alternatives, 598 the associated language tags could be regarded as a hint that the 599 content is provided in several languages, and that one has to 600 inspect each of the alternatives in order to find its language or 601 languages. In this case, the presence of multiple tags might not 602 mean that one needs to be multi-lingual to get complete 603 understanding of the document. Example: MIME multipart/ 604 alternative. 606 o In markup languages, such as HTML and XML, language information 607 can be added to each part of the document identified by the markup 608 structure (including the whole document itself). For example, one 609 could write C'est la vie. inside a 610 Norwegian document; the Norwegian-speaking user could then access 611 a French-Norwegian dictionary to find out what the marked section 612 meant. If the user were listening to that document through a 613 speech synthesis interface, this formation could be used to signal 614 the synthesizer to appropriately apply French text-to-speech 615 pronunciation rules to that span of text, instead of misapplying 616 the Norwegian rules. 618 2.5. Choosing Between Alternate Matching Schemes 620 Implementers are faced with the decision of what form of matching to 621 use in a specific application. An application can choose to 622 implement different styles of matching for different kinds of 623 processing. 625 The most basic choice is between schemes that produce an open-ended 626 set of content (a "matching" application) and those that usually 627 produce a single information item (a "lookup" application). Note 628 that lookup applications can produce multiple items, but usually only 629 a single item for any given piece of content, and they can be used to 630 order content (the later in the overall fallback that the content 631 appears to match, the more distant the match). 633 Matching applications can produce an ordered or unordered set of 634 results. For example, applying formatting to a document based on the 635 language of specific pieces of content does not require the content 636 to be ordered. It is sufficient to know whether a specific piece of 637 content matches or does not match. A search application, on the 638 other hand, probably would put the results into a priority order. 640 If single item is to be chosen, it may sometimes be useful to apply 641 additional information, such as the most likely script used in the 642 language or region in question or the script used by other content 643 selected, in order to make a more "informed" choice. 645 The matching schemes in this document are designed so that 646 implementations do not have to examine the values of the subtags 647 supplied and, except for scored matching, they do not need access to 648 the Language Subtag Registry nor do they require the use of valid 649 subtags in language tags or ranges. This has great benefit for speed 650 and simplicity of implementation. 652 Implementations might also wish to use semantic information external 653 to the langauge tags when performing fallback. For example, the 654 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal 655 Norwegian) might both be usefully matched to the more general subtag 656 'no' (Norwegian). Or an application might infer that content labeled 657 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent 658 content labeled "zh-TW". 660 2.6. Considerations for Private Use Subtags 662 Private-use subtags require private agreement between the parties 663 that intend to use or exchange language tags that use them and great 664 caution SHOULD be used in employing them in content or protocols 665 intended for general use. Private-use subtags are simply useless for 666 information exchange without prior arrangement. 668 The value and semantic meaning of private-use tags and of the subtags 669 used within such a language tag are not defined. Matching private 670 use tags using language ranges or extended language ranges can result 671 in unpredictable content being returned. 673 2.7. Length Considerations in Matching 675 RFC 3066 [RFC3066] did not provide an upper limit on the size of 676 language tags or ranges. RFC 3066 did define the semantics of 677 particular subtags in such a way that most language tags or ranges 678 consisted of language and region subtags with a combined total length 679 of up to six characters. Larger tags and ranges (in terms of both 680 subtags and characters) did exist, however. 682 [draft-registry] also does not impose a fixed upper limit on the 683 number of subtags in a language tag or range (and thus an upper bound 684 on the size of either). The syntax in that document suggests that, 685 depending on the specific language or range of languages, more 686 subtags (and thus characters) are sometimes necessary as a result. 687 Length considerations and their impact on the selection and 688 processing of tags are described in Section 2.1.1 of that document. 690 A matching implementation MAY choose to limit the length of the 691 language tags or ranges used in matching. Any such limitation SHOULD 692 be clearly documented, and such documentation SHOULD include the 693 disposition of any longer tags or ranges (for example, whether an 694 error value is generated or the language tag or range is truncated). 695 If truncation is permitted it MUST NOT permit a subtag to be divided, 696 since this changes the semantics of the subtag being matched and can 697 result in false positives or negatives. 699 Implementations that restrict storage SHOULD consider the impact of 700 tag or range truncation on the resulting matches. For example, 701 removing the "*" from the end of an extended language range (see 702 Section 2.3) can greatly modify the set of returned matches. A 703 protocol that allows tags or ranges to be truncated at an arbitrary 704 limit, without giving any indication of what that limit is, has the 705 potential for causing harm by changing the meaning of values in 706 substantial ways. 708 In practice, most tags do not require additional subtags or 709 substantially more characters. Additional subtags sometimes add 710 useful distinguishing information, but extraneous subtags interfere 711 with the meaning, understanding, and especially matching of language 712 tags. Since language tags or ranges MAY be truncated by an 713 application or protocol that limits storage, when choosing language 714 tags or ranges users and applications SHOULD avoid adding subtags 715 that add no distinguishing value. In particular, users and 716 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 717 fields in the registry (defined in Section 3.6 of [draft-registry]): 718 these fields provide guidance on when specific additional subtags 719 SHOULD (and SHOULD NOT) be used. 721 Implementations MUST support a limit of at least 33 characters. This 722 limit includes at least one subtag of each non-extension, non-private 723 use type. When choosing a buffer limit, a length of at least 42 724 characters is strongly RECOMMENDED. 726 The practical limit on tags or ranges derived solely from registered 727 values is 42 characters. Implementations MUST be able to handle tags 728 and ranges of this length. Support for tags and ranges of at least 729 62 characters in length is RECOMMENDED. Implementations MAY support 730 longer values, including matching extensive sets of private use or 731 extension subtags. 733 Applications or protocols which have to truncate a tag MUST do so by 734 progressively removing subtags along with their preceding "-" from 735 the right side of the language tag until the tag is short enough for 736 the given buffer. If the resulting tag ends with a single-character 737 subtag, that subtag and its preceding "-" MUST also be removed. For 738 example: 740 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1 741 1. zh-Hant-CN-variant1-a-extend1-x-wadegile 742 2. zh-Hant-CN-variant1-a-extend1 743 3. zh-Hant-CN-variant1 744 4. zh-Hant-CN 745 5. zh-Hant 746 6. zh 748 Figure 7: Example of Tag Truncation 750 3. IANA Considerations 752 This document presents no new or existing considerations for IANA. 754 4. Changes 756 This is the first version of this document. 758 The following changes were put into this document since draft-03: 760 Modified the ABNF to match changes in [draft-registry] 761 (K.Karlsson) 763 Matched the references and reference formats to [draft-registry] 764 (K.Karlsson) 766 Various edits, additions, and emendations to deal with changes in 767 the Last Call of draft-registry as well as cleaning up the text. 769 5. Security Considerations 771 Language ranges used in content negotiation might be used to infer 772 the nationality of the sender, and thus identify potential targets 773 for surveillance. In addition, unique or highly unusual language 774 ranges or combinations of language ranges might be used to track 775 specific individual's activities. 777 This is a special case of the general problem that anything you send 778 is visible to the receiving party. It is useful to be aware that 779 such concerns can exist in some cases. 781 The evaluation of the exact magnitude of the threat, and any possible 782 countermeasures, is left to each application protocol. 784 6. Character Set Considerations 786 The syntax of language tags and language ranges permit only the 787 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters 788 are present in most character sets, so presentation of language tags 789 should not present any character set issues. 791 7. References 793 7.1. Normative References 795 [ID.ietf-ltru-initial] 796 Ewell, D., Ed., "Language Tags Initial Registry (work in 797 progress)", August 2005, . 800 [RFC1327] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 801 10021 and RFC 822", RFC 1327, May 1992. 803 [RFC1521] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet 804 Mail Extensions) Part One: Mechanisms for Specifying and 805 Describing the Format of Internet Message Bodies", 806 RFC 1521, September 1993. 808 [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in 809 the IETF Standards Process", BCP 11, RFC 2028, 810 October 1996. 812 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 813 Requirement Levels", BCP 14, RFC 2119, March 1997. 815 [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded 816 Word Extensions: Character Sets, Languages, and 817 Continuations", RFC 2231, November 1997. 819 [RFC2234bis] 820 Crocker, D. and P. Overell, "Augmented BNF for Syntax 821 Specifications: ABNF", draft-crocker-abnf-rfc2234bis-00 822 (work in progress), March 2005. 824 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 825 Resource Identifiers (URI): Generic Syntax", RFC 2396, 826 August 1998. 828 [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an 829 IANA Considerations Section in RFCs", BCP 26, RFC 2434, 830 October 1998. 832 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 833 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 834 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 836 [RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of 837 Understanding Concerning the Technical Work of the 838 Internet Assigned Numbers Authority", RFC 2860, June 2000. 840 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 841 10646", STD 63, RFC 3629, November 2003. 843 [draft-registry] 844 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 845 Identification of Languages (work in progress)", 846 August 2005, . 849 7.2. Informative References 851 [ISO15924] 852 "ISO 15924:2004. Information and documentation -- Codes 853 for the representation of names of scripts", January 2004. 855 [ISO3166-1] 856 "ISO 3166-1:1997. Codes for the representation of names of 857 countries and their subdivisions -- Part 1: Country 858 codes", 1997. 860 [ISO639-1] 861 "ISO 639-1:2002. Codes for the representation of names of 862 languages -- Part 1: Alpha-2 code", 2002. 864 [ISO639-2] 865 "ISO 639-2:1998. Codes for the representation of names of 866 languages -- Part 2: Alpha-3 code, first edition", 1998. 868 [RFC1766] Alvestrand, H., "Tags for the Identification of 869 Languages", RFC 1766, March 1995. 871 [RFC3066] Alvestrand, H., "Tags for the Identification of 872 Languages", BCP 47, RFC 3066, January 2001. 874 [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, 875 May 2002. 877 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 878 Timestamps", RFC 3339, July 2002. 880 [UN_M.49] Statistics Division, United Nations, "Standard Country or 881 Area Codes for Statistical Use", UN Standard Country or 882 Area Codes for Statistical Use, Revision 4 (United Nations 883 publication, Sales No. 98.XVII.9, June 1999. 885 Appendix A. Acknowledgements 887 Any list of contributors is bound to be incomplete; please regard the 888 following as only a selection from the group of people who have 889 contributed to make this document what it is today. 891 The contributors to [draft-registry], [RFC3066] and [RFC1766], each 892 of which is a precursor to this document, made enormous contributions 893 directly or indirectly to this document and are generally responsible 894 for the success of language tags. 896 The following people (in alphabetical order by family name) 897 contributed to this document: 899 Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Kent 900 Karlsson, Ira McDonald, M. Patton, Randy Presuhn and many, many 901 others. 903 Very special thanks must go to Harald Tveit Alvestrand, who 904 originated RFCs 1766 and 3066, and without whom this document would 905 not have been possible. 907 For this particular document, John Cowan originated the scheme 908 described in Section 2.3.3. Mark Davis originated the scheme 909 described in the Section 2.2.2. 911 Authors' Addresses 913 Addison Phillips (editor) 914 Quest Software 916 Email: addison dot phillips at quest dot com 918 Mark Davis (editor) 919 IBM 921 Email: mark dot davis at ibm dot com 923 Intellectual Property Statement 925 The IETF takes no position regarding the validity or scope of any 926 Intellectual Property Rights or other rights that might be claimed to 927 pertain to the implementation or use of the technology described in 928 this document or the extent to which any license under such rights 929 might or might not be available; nor does it represent that it has 930 made any independent effort to identify any such rights. Information 931 on the procedures with respect to rights in RFC documents can be 932 found in BCP 78 and BCP 79. 934 Copies of IPR disclosures made to the IETF Secretariat and any 935 assurances of licenses to be made available, or the result of an 936 attempt made to obtain a general license or permission for the use of 937 such proprietary rights by implementers or users of this 938 specification can be obtained from the IETF on-line IPR repository at 939 http://www.ietf.org/ipr. 941 The IETF invites any interested party to bring to its attention any 942 copyrights, patents or patent applications, or other proprietary 943 rights that may cover technology that may be required to implement 944 this standard. Please address the information to the IETF at 945 ietf-ipr@ietf.org. 947 Disclaimer of Validity 949 This document and the information contained herein are provided on an 950 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 951 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 952 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 953 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 954 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 955 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 957 Copyright Statement 959 Copyright (C) The Internet Society (2005). This document is subject 960 to the rights, licenses and restrictions contained in BCP 78, and 961 except as set forth therein, the authors retain all their rights. 963 Acknowledgment 965 Funding for the RFC Editor function is currently provided by the 966 Internet Society.