idnits 2.17.1 draft-ietf-ltru-matching-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 762. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 739. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 746. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 752. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 169 has weird spacing: '...schemes that ...' == Line 170 has weird spacing: '...ing and looku...' == Line 374 has weird spacing: '...age tag being...' == Line 467 has weird spacing: '...ch that imple...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 28, 2005) is 6877 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC1327' is defined on line 618, but no explicit reference was found in the text == Unused Reference: 'RFC1521' is defined on line 621, but no explicit reference was found in the text == Unused Reference: 'RFC2028' is defined on line 626, but no explicit reference was found in the text == Unused Reference: 'RFC2231' is defined on line 633, but no explicit reference was found in the text == Unused Reference: 'RFC2234' is defined on line 637, but no explicit reference was found in the text == Unused Reference: 'RFC2396' is defined on line 640, but no explicit reference was found in the text == Unused Reference: 'RFC2434' is defined on line 644, but no explicit reference was found in the text == Unused Reference: 'RFC2860' is defined on line 652, but no explicit reference was found in the text == Unused Reference: 'RFC3629' is defined on line 656, but no explicit reference was found in the text == Unused Reference: 'ISO639-1' is defined on line 661, but no explicit reference was found in the text == Unused Reference: 'ISO639-2' is defined on line 666, but no explicit reference was found in the text == Unused Reference: 'ISO15924' is defined on line 672, but no explicit reference was found in the text == Unused Reference: 'ISO3166' is defined on line 676, but no explicit reference was found in the text == Unused Reference: 'RFC3339' is defined on line 691, but no explicit reference was found in the text == Outdated reference: A later version (-14) exists of draft-ietf-ltru-registry-07 ** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156) ** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049) ** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281) ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234) ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 2860 -- Obsolete informational reference (is this intentional?): RFC 1766 (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (Obsoleted by RFC 4646, RFC 4647) Summary: 11 errors (**), 0 flaws (~~), 22 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Expires: December 30, 2005 M. Davis, Ed. 5 IBM 6 June 28, 2005 8 Matching Tags for the Identification of Languages 9 draft-ietf-ltru-matching-03 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on December 30, 2005. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing, matching, 43 and evaluating language tags. Possible algorithms for language 44 negotiation and content selection are described. 46 Table of Contents 48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 49 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 50 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4 51 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5 52 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6 53 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6 54 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7 55 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8 56 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9 57 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10 58 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11 59 2.5 Considerations for Private Use Subtags . . . . . . . . . . 12 60 2.6 Length Considerations in Matching . . . . . . . . . . . . 12 61 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 62 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 63 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 64 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17 65 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 66 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 67 7.2 Informative References . . . . . . . . . . . . . . . . . . 19 68 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 69 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 70 Intellectual Property and Copyright Statements . . . . . . . . 22 72 1. Introduction 74 Human beings on our planet have, past and present, used a number of 75 languages. There are many reasons why one would want to identify the 76 language used when presenting or requesting information. 78 Information about a user's language preferences commonly needs to be 79 identified so that appropriate processing can be applied. For 80 example, the user's language preferences in a browser can be used to 81 select web pages appropriately. A choice of language preference can 82 also be used to select among tools (such as dictionaries) to assist 83 in the processing or understanding of content in different languages. 85 Given a set of language identifiers, such as those defined in 86 [ID.ietf-ltru-registry], various mechanisms can be envisioned for 87 performing language negotiation and tag matching. The suitability of 88 a particular mechanism to a particular application depends on the 89 needs of that application. 91 This document defines language ranges and syntax for specifying user 92 preferences in a request for language content. It also specifies 93 various schemes and mechanisms that can be used with language ranges 94 when matching or filtering content based on language tags. 96 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 97 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 98 document are to be interpreted as described in [RFC2119]. 100 2. The Language Range 102 Language Tags are used to identify the language of some information 103 item or content. Applications that use language tags are often faced 104 with the problem of identifying sets of content that share certain 105 language attributes. For example, HTTP 1.1 [RFC2616] describes 106 language ranges in its discussion of the Accept-Language header 107 (Section 14.4), which is used for selecting content from servers 108 based on the language of that content. 110 When selecting content according to its language, it is useful to 111 have a mechanism for identifying sets of language tags that share 112 specific attributes. This allows users to select or filter content 113 based on specific requirements. Such an identifier is called a 114 "Language Range". 116 2.1 Basic Language Range 118 A basic language range (such as described in [RFC3066] and HTTP 1.1 119 [RFC2616]) is a set of languages whose tags all begin with the same 120 sequence of subtags. A basic language range can be represented by a 121 'language-range' tag, by using the definition from HTTP/1.1 [RFC2616] 122 : 123 language-range = language-tag / "*" 125 That is, a language-range has the same syntax as a language-tag or is 126 the single character "*". This definition of language-range implies 127 that there is a semantic relationship between tags that share the 128 same prefix. 130 In particular, the set of language tags that match a specific 131 language-range might not all be mutually intelligible. The use of a 132 prefix when matching tags to language ranges does not imply that 133 language tags are assigned to languages in such a way that it is 134 always true that if a user understands a language with a certain tag, 135 then this user will also understand all languages with tags for which 136 this tag is a prefix. The prefix rule simply allows the use of 137 prefix tags if this is the case. 139 When working with tags and ranges you SHOULD also note the following: 141 1. Private-use and Extension subtags are normally orthogonal to 142 language tag fallback. Implementations SHOULD ignore 143 unrecognized private-use and extension subtags when performing 144 language tag fallback. Since these subtags are always at the end 145 of the sequence of subtags, they don't normally interfere with 146 the use of prefixes for matching in the schemes described below. 148 2. Implementations that choose not to interpret one or more private- 149 use or extension subtags SHOULD NOT remove or modify these 150 extensions in content that they are processing. When a language 151 tag instance is to be used in a specific, known protocol, and is 152 not being passed through to other protocols, language tags MAY be 153 filtered to remove subtags and extensions that are not supported 154 by that protocol. Such filtering SHOULD be avoided, if possible, 155 since it removes information that might be relevant if services 156 on the other end of the protocol would make use of that 157 information. 159 3. Some applications of language tags might want or need to consider 160 extensions and private-use subtags when matching tags. If 161 extensions and private-use subtags are included in a matching or 162 filtering process that utilizes the one of the schemes described 163 in this document, then the implementation SHOULD canonicalize the 164 language tags and/or ranges before performing the matching. Note 165 that language tag processors that claim to be "well-formed" 166 processors as defined in [ID.ietf-ltru-registry] generally fall 167 into this category. 169 There are two matching schemes that are commonly associated with 170 basic language ranges: matching and lookup. 172 2.1.1 Matching 174 Language tag matching is used to select all content that matches a 175 given prefix. In matching, the language range represents the least 176 specific tag which is an acceptable match and every piece of content 177 that matches is returned. 179 For example, if an application is applying a style to all content in 180 a web page in a particular language, it might use language tag 181 matching to select the content to which the style is applied. 183 A language-range matches a language-tag if it exactly equals the tag, 184 or if it exactly equals a prefix of the tag such that the first 185 character following the prefix is "-". (That is, the language-range 186 "de-de" matches the language tag "de-DE-1996", but not the language 187 tag "de-Deva".) 189 The special range "*" matches any tag. A protocol which uses 190 language ranges MAY specify additional rules about the semantics of 191 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 192 languages not matched by any other range within an "Accept-Language:" 193 header. 195 2.1.2 Lookup 197 Content lookup is used to select the single information item that 198 best matches the language range for a given request. In lookup, the 199 language range represents the most specific tag which is an 200 acceptable match and only the closest matching item is returned. 202 For example, if an application inserts some dynamic content into a 203 web page, returning an empty string if there is no exact match is not 204 an option. Instead, the application "falls back". 206 When performing lookup, the language range is progressively truncated 207 from the end until a matching piece of content is located. For 208 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup 209 would progressively search for content as shown below: 211 Range to match: zh-Hant-CN-x-wadegile 212 1. zh-Hant-CN-x-wadegile 213 2. zh-Hant-CN 214 3. zh-Hant 215 4. zh 216 5. (default content or the empty tag) 218 Figure 2: Default Fallback Pattern Example 220 This scheme allows some flexibility in finding content. It also 221 typically provides better results when data is not available at a 222 specific level of tag granularity or is sparsely populated (than if 223 the default language for the system or content were used). 225 2.2 Extended Language Range 227 Prefix matching using a Basic Language Range, as described above, is 228 not always the most appropriate way to access the information 229 contained in language tags when selecting or filtering content. Some 230 applications might wish to define a more granular matching scheme and 231 such a matching scheme requires the ability to specify the various 232 attributes of a language tag in the language range. An extended 233 language range can be represented by the following ABNF: 235 extended-language-range = grandfathered / privateuse / range 236 range = ( lang [ "-" script ] [ "-" region ] *( "-" variant ) 237 [ "-" privateuse ] ) 238 lang = 2*8ALPHA / extlang / "*" 239 extlang = 2*3ALPHA *2("-" 3ALPHA) ( "-" ( 3ALPHA / "*" ) ) 240 script = 4ALPHA / "*" 241 region = 2ALPHA / 3DIGIT / "*" 242 variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*" 243 privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) ) 244 grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) ) 245 alphanum = ( ALPHA / DIGIT ) 247 In an extended language range, the identifier takes the form of a 248 series of subtags which must consist of well-formed subtags or the 249 special subtag "*". For example, the language range "en-*-US" 250 specifies a primary language of 'en', followed by any script subtag, 251 followed by the region subtag 'US'. 253 A field not present in the middle of an extended language range MAY 254 be treated as if the field contained a "*". For example, the range 255 "en-US" MAY be considered to be equivalent to the range "en-*-US". 257 There are several matching algorithms or schemes which can be applied 258 when matching extended language ranges to language tags. 260 2.2.1 Extended Range Matching 262 In extended range matching, the subtags in a language tag are 263 compared to the corresponding subtags in the extended language range. 264 A subtag is considered to match if it exactly matches the 265 corresponding subtag in the range or the range contains a subtag with 266 the value "*" (which matches all subtags, including the empty 267 subtag). Extended Range Matching is an extension of basic matching 268 (Section 2.1.1): the language range represents the least specific tag 269 which is an acceptable match. 271 By default all extensions and their subtags are ignored for extended 272 language range matching. 274 Private use subtags MAY be specified in the language range and MUST 275 NOT be ignored when matching. 277 Subtags not specified, including those at the end of the language 278 range, are assigned the value "*". This makes each range into a 279 prefix much like that used in basic language range matching. For 280 example, the extended language range "zh-*-CN" matches all of the 281 following tags because the unspecified variant field is expanded to 282 "*": 284 zh-Hant-CN 286 zh-CN 288 zh-Hans-CN 290 zh-CN-x-wadegile 292 zh-Latn-CN-boont 294 2.2.2 Extended Range Lookup 296 In extended range lookup, the subtags in a language tag are compared 297 to the corresponding subtags in the extended language range. The 298 subtag is considered to match if it exactly matches the corresponding 299 subtag in the range or the range contains a subtag with the value "*" 300 (which matches all subtags, including the empty subtag). Extended 301 language range lookup is an extension of basic lookup 302 (Section 2.1.2): the language range represents the most specific tag 303 which will form an acceptable match. 305 Subtags not specified are assigned the value "*" prior to performing 306 tag matching. Unlike in extended range matching, however, fields at 307 the end of the range MUST NOT be expanded in this manner. For 308 example, "en-US" MUST NOT be considered to be the same as the range 309 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST 310 be used at the end of the range to indicate that all tags with the 311 range as a prefix are allowable matches. That is, the range "zh-*" 312 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh" 313 matches neither of those tags. 315 The wildcard "*" at the end of a range SHOULD be considered to match 316 any private use subtag sequences (making extended language range 317 lookup function exactly like extended range matching Section 2.2.1). 319 By default all extensions and their subtags SHOULD be ignored for 320 extended language range lookup. Private use subtags MAY be specified 321 in the language range and MUST NOT be ignored when performing lookup. 322 The wildcard "*" at the end of a range SHOULD be considered to match 323 any private use subtag sequences in addition to variants. 325 For example, the range "*-US" matches all of the following tags: 327 en-US 329 en-Latn-US 330 en-US-r-extends (extensions are ignored) 332 fr-US 334 For example, the range "en-*-US" matches _none_ of the following 335 tags: 337 fr-US 339 en (missing region US) 341 en-Latn (missing region US) 343 en-Latn-US-scouse (variant field is present) 345 For example, the range "en-*" matches all of the following tags: 347 en-Latn 349 en-Latn-US 351 en-Latn-US-scouse 353 en-US 355 en-scouse 357 Note that the ability to be specific in extended range lookup can 358 make this matching scheme a more appropriate replacement for basic 359 matching than the extended range matching scheme. 361 2.2.3 Scored Matching 363 In the "scored matching" scheme, the extended language range and the 364 language tags are pre-normalized by mapping grandfathered and 365 obsolete tags into modern equivalents. 367 The language range and the language tags are normalized into 368 quadruples of the form (language, script, country, variant), where 369 extended language is considered part of language and x-private-codes 370 are considered part of the language if they are initial and part of 371 the variant if not initial. Missing components are set to "*". An 372 "*" pattern becomes the quadruple ("*", "*", "*", "*"). 374 Each language tag being matched or filtered is assigned a "quality 375 value" such that higher values indicate better matches and lower 376 values indicate worse ones. If the language matches, add 8 to the 377 quality value. If the script matches, add 4 to the quality value. 379 If the region matches, add 2 to the quality value. If the variant 380 matches, add 1 to the quality value. Elements of the quadruples are 381 considered to match if they are the same or if one of them is "*". 383 A value of 15 is a perfect match; 0 is no match at all. Different 384 values could be more or less appropriate for different applications 385 and implementations SHOULD probably allow users to choose the most 386 appropriate selection value. 388 2.3 Meaning of Language Tags and Ranges 390 A language tag defines a language as spoken (or written, signed or 391 otherwise signaled) by human beings for communication of information 392 to other human beings. 394 If a language tag B contains language tag A as a prefix, then B is 395 typically "narrower" or "more specific" than A. For example, "zh- 396 Hant-TW" is more specific than "zh-Hant". 398 This relationship is not guaranteed in all cases: specifically, 399 languages that begin with the same sequence of subtags are NOT 400 guaranteed to be mutually intelligible, although they might be. 402 For example, the tag "az" shares a prefix with both "az-Latn" 403 (Azerbaijani written using the Latin script) and "az-Cyrl" 404 (Azerbaijani written using the Cyrillic script). A person fluent in 405 one script might not be able to read the other, even though the text 406 might be otherwise identical. Content tagged as "az" most probably 407 is written in just one script and thus might not be intelligible to a 408 reader familiar with the other script. 410 Variant subtags in particular seem to represent specific divisions in 411 mutual understanding, since they often encode dialects or other 412 idiosyncratic variations within a language. 414 The relationship between the language tag and the information it 415 relates to is defined by the standard describing the context in which 416 it appears. Accordingly, this section can only give possible 417 examples of its usage. 419 o For a single information object, the associated language tags 420 might be interpreted as the set of languages that are necessary 421 for a complete comprehension of the complete object. Example: 422 Plain text documents. 424 o For an aggregation of information objects, the associated language 425 tags could be taken as the set of languages used inside components 426 of that aggregation. Examples: Document stores and libraries. 428 o For information objects whose purpose is to provide alternatives, 429 the associated language tags could be regarded as a hint that the 430 content is provided in several languages, and that one has to 431 inspect each of the alternatives in order to find its language or 432 languages. In this case, the presence of multiple tags might not 433 mean that one needs to be multi-lingual to get complete 434 understanding of the document. Example: MIME multipart/ 435 alternative. 437 o In markup languages, such as HTML and XML, language information 438 can be added to each part of the document identified by the markup 439 structure (including the whole document itself). For example, one 440 could write C'est la vie. inside a 441 Norwegian document; the Norwegian-speaking user could then access 442 a French-Norwegian dictionary to find out what the marked section 443 meant. If the user were listening to that document through a 444 speech synthesis interface, this formation could be used to signal 445 the synthesizer to appropriately apply French text-to-speech 446 pronunciation rules to that span of text, instead of misapplying 447 the Norwegian rules. 449 2.4 Choosing Between Alternate Matching Schemes 451 Implementations MAY choose to implement different styles of matching 452 for different kinds of processing. For example, an implementation 453 could treat an absent script subtag as a "wildcard" field; thus 454 "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not 455 "az" (this is extended range lookup). If one item is to be chosen, 456 the implementation could pick among those matches based on other 457 information, such as the most likely script used in the language/ 458 region in question or the script used by other content selected. 460 Because the primary language subtag cannot be absent in a language 461 tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic 462 matching. For example, in a query where you want to select all 463 language tags that contain 'Latn' as the script code and 'AZ' as the 464 region code, you could use the range "und-Latn-AZ". This requires an 465 implementation to examine the actual values of the subtags, though. 466 The matching schemes described elsewhere in this document are 467 designed such that implementations do not have to examine the values 468 or subtags supplied and, except for scored matching, they do not need 469 access to the Language Subtag Registry nor the use of valid subtags 470 in language tags or ranges. This has great benefit for speed and 471 simplicity of implementation. 473 Implementations might also wish to use semantic information external 474 to the langauge tags when performing fallback. For example, the 475 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal 476 Norwegian) might both be usefully matched to the more general subtag 477 'no' (Norwegian). Or an application might infer that content labeled 478 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent 479 content labeled "zh-TW". 481 2.5 Considerations for Private Use Subtags 483 Private-use subtags require private agreement between the parties 484 that intend to use or exchange language tags that use them and great 485 caution SHOULD be used in employing them in content or protocols 486 intended for general use. Private-use subtags are simply useless for 487 information exchange without prior arrangement. 489 The value and semantic meaning of private-use tags and of the subtags 490 used within such a language tag are not defined. Matching private 491 use tags using language ranges or extended language ranges can result 492 in unpredictable content being returned. 494 2.6 Length Considerations in Matching 496 [RFC3066] did not provide an upper limit on the size of language tags 497 or ranges. RFC 3066 did define the semantics of particular subtags 498 in such a way that most language tags or ranges consisted of language 499 and region subtags with a combined total length of up to six 500 characters. Larger tags and ranges (in terms of both subtags and 501 characters) did exist, however. 503 [ID.ietf-ltru-registry] also does not impose a fixed upper limit on 504 the number of subtags in a language tag or range (and thus an upper 505 bound on the size of either). The syntax in that document suggests 506 that, depending on the specific language or range of languages, more 507 subtags (and thus characters) are sometimes necessary as a result. 508 Length considerations and their impact on the selection and 509 processing of tags are described in Section 2.1.1 of that document. 511 A matching implementation MAY choose to limit the length of the 512 language tags or ranges used in matching. Any such limitation SHOULD 513 be clearly documented, and such documentation SHOULD include the 514 disposition of any longer tags or ranges (for example, whether an 515 error value is generated or the language tag or range is truncated). 516 If truncation is permitted it MUST NOT permit a subtag to be divided, 517 since this changes the semantics of the subtag being matched and can 518 result in false positives or negatives. 520 Implementations that restrict storage SHOULD consider the impact of 521 tag or range truncation on the resulting matches. For example, 522 removing the "*" from the end of an extended language range (see 523 Section 2.2) can greatly modify the set of returned matches. A 524 protocol that allows tags or ranges to be truncated at an arbitrary 525 limit, without giving any indication of what that limit is, has the 526 potential for causing harm by changing the meaning of values in 527 substantial ways. 529 In practice, most tags do not require additional subtags or 530 substantially more characters. Additional subtags sometimes add 531 useful distinguishing information, but extraneous subtags interfere 532 with the meaning, understanding, and especially matching of language 533 tags. Since language tags or ranges MAY be truncated by an 534 application or protocol that limits storage, when choosing language 535 tags or ranges users and applications SHOULD avoid adding subtags 536 that add no distinguishing value. In particular, users and 537 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 538 fields in the registry (defined in Section 3.6 of [ID.ietf-ltru- 539 registry]): these fields provide guidance on when specific additional 540 subtags SHOULD (and SHOULD NOT) be used. 542 Implementations MUST support a limit of at least 33 characters. This 543 limit includes at least one subtag of each non-extension, non-private 544 use type. When choosing a buffer limit, a length of at least 42 545 characters is strongly RECOMMENDED. 547 The practical limit on tags or ranges derived solely from registered 548 values is 42 characters. Implementations MUST be able to handle tags 549 and ranges of this length. Support for tags and ranges of at least 550 62 characters in length is RECOMMENDED. Implementations MAY support 551 longer values, including matching extensive sets of private use or 552 extension subtags. 554 Applications or protocols which have to truncate a tag MUST do so by 555 progressively removing subtags along with their preceding "-" from 556 the right side of the language tag until the tag is short enough for 557 the given buffer. If the resulting tag ends with a single-character 558 subtag, that subtag and its preceding "-" MUST also be removed. For 559 example: 561 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1 562 1. zh-Hant-CN-variant1-a-extend1-x-wadegile 563 2. zh-Hant-CN-variant1-a-extend1 564 3. zh-Hant-CN-variant1 565 4. zh-Hant-CN 566 5. zh-Hant 567 6. zh 569 Figure 4: Example of Tag Truncation 571 3. IANA Considerations 573 This document presents no new or existing considerations for IANA. 575 4. Changes 577 This is the first version of this document. 579 The following changes were put into this document since draft-02: 581 Turned on symrefs and replaced all reference IDs to make them 582 readable (F.Ellermann) 584 Removed all external references from the abstract (R.Presuhn) 586 5. Security Considerations 588 Language ranges used in content negotiation might be used to infer 589 the nationality of the sender, and thus identify potential targets 590 for surveillance. In addition, unique or highly unusual language 591 ranges or combinations of language ranges might be used to track 592 specific individual's activities. 594 This is a special case of the general problem that anything you send 595 is visible to the receiving party. It is useful to be aware that 596 such concerns can exist in some cases. 598 The evaluation of the exact magnitude of the threat, and any possible 599 countermeasures, is left to each application protocol. 601 6. Character Set Considerations 603 The syntax of language tags and language ranges permit only the 604 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters 605 are present in most character sets, so presentation of language tags 606 should not present any character set issues. 608 7. References 610 7.1 Normative References 612 [ID.ietf-ltru-registry] 613 Phillips, A., Ed. and M. Davis, Ed., "Tags for the 614 Identification of Languages (Internet-Draft)", June 2005, 615 . 618 [RFC1327] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 619 10021 and RFC 822", RFC 1327, May 1992. 621 [RFC1521] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet 622 Mail Extensions) Part One: Mechanisms for Specifying and 623 Describing the Format of Internet Message Bodies", 624 RFC 1521, September 1993. 626 [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in 627 the IETF Standards Process", BCP 11, RFC 2028, 628 October 1996. 630 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 631 Requirement Levels", BCP 14, RFC 2119, March 1997. 633 [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded 634 Word Extensions: Character Sets, Languages, and 635 Continuations", RFC 2231, November 1997. 637 [RFC2234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 638 Specifications: ABNF", RFC 2234, November 1997. 640 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 641 Resource Identifiers (URI): Generic Syntax", RFC 2396, 642 August 1998. 644 [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an 645 IANA Considerations Section in RFCs", BCP 26, RFC 2434, 646 October 1998. 648 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 649 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 650 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 652 [RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of 653 Understanding Concerning the Technical Work of the 654 Internet Assigned Numbers Authority", RFC 2860, June 2000. 656 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 657 10646", STD 63, RFC 3629, November 2003. 659 7.2 Informative References 661 [ISO639-1] 662 International Organization for Standardization, "ISO 639- 663 1:2002, Codes for the representation of names of languages 664 -- Part 1: Alpha-2 code", ISO Standard 639, 2002. 666 [ISO639-2] 667 International Organization for Standardization, "ISO 639- 668 2:1998 - Codes for the representation of names of 669 languages -- Part 2: Alpha-3 code - edition 1", 670 August 1988. 672 [ISO15924] 673 ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the 674 representation of names of scripts", January 2004. 676 [ISO3166] International Organization for Standardization, "Codes for 677 the representation of names of countries, 3rd edition", 678 ISO Standard 3166, August 1988. 680 [UN_M49] Statistical Division, United Nations, "Standard Country or 681 Area Codes for Statistical Use", UN Standard Country or 682 Area Codes for Statistical Use, Revision 4 (United Nations 683 publication, Sales No. 98.XVII.9, June 1999. 685 [RFC1766] Alvestrand, H., "Tags for the Identification of 686 Languages", RFC 1766, March 1995. 688 [RFC3066] Alvestrand, H., "Tags for the Identification of 689 Languages", BCP 47, RFC 3066, January 2001. 691 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet: 692 Timestamps", RFC 3339, July 2002. 694 Authors' Addresses 696 Addison Phillips (editor) 697 Quest Software 699 Email: addison dot phillips at quest dot com 700 Mark Davis (editor) 701 IBM 703 Email: mark dot davis at ibm dot com 705 Appendix A. Acknowledgements 707 Any list of contributors is bound to be incomplete; please regard the 708 following as only a selection from the group of people who have 709 contributed to make this document what it is today. 711 The contributors to [ID.ietf-ltru-registry], [RFC3066] and [RFC1766], 712 each of which is a precursor to this document, made enormous 713 contributions directly or indirectly to this document and are 714 generally responsible for the success of language tags. 716 The following people (in alphabetical order by family name) 717 contributed to this document: 719 Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Ira 720 McDonald, M. Patton, Randy Presuhn and many, many others. 722 Very special thanks must go to Harald Tveit Alvestrand, who 723 originated RFCs 1766 and 3066, and without whom this document would 724 not have been possible. 726 For this particular document, John Cowan originated the scheme 727 described in Section 2.2.3. Mark Davis originated the scheme 728 described in the Section 2.1.2. 730 Intellectual Property Statement 732 The IETF takes no position regarding the validity or scope of any 733 Intellectual Property Rights or other rights that might be claimed to 734 pertain to the implementation or use of the technology described in 735 this document or the extent to which any license under such rights 736 might or might not be available; nor does it represent that it has 737 made any independent effort to identify any such rights. Information 738 on the procedures with respect to rights in RFC documents can be 739 found in BCP 78 and BCP 79. 741 Copies of IPR disclosures made to the IETF Secretariat and any 742 assurances of licenses to be made available, or the result of an 743 attempt made to obtain a general license or permission for the use of 744 such proprietary rights by implementers or users of this 745 specification can be obtained from the IETF on-line IPR repository at 746 http://www.ietf.org/ipr. 748 The IETF invites any interested party to bring to its attention any 749 copyrights, patents or patent applications, or other proprietary 750 rights that may cover technology that may be required to implement 751 this standard. Please address the information to the IETF at 752 ietf-ipr@ietf.org. 754 Disclaimer of Validity 756 This document and the information contained herein are provided on an 757 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 758 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 759 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 760 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 761 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 762 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 764 Copyright Statement 766 Copyright (C) The Internet Society (2005). This document is subject 767 to the rights, licenses and restrictions contained in BCP 78, and 768 except as set forth therein, the authors retain all their rights. 770 Acknowledgment 772 Funding for the RFC Editor function is currently provided by the 773 Internet Society.