idnits 2.17.1 draft-ietf-ltru-matching-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 805. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 782. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 789. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 795. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC3066], [19], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 169 has weird spacing: '...schemes that ...' == Line 170 has weird spacing: '...ing and looku...' == Line 374 has weird spacing: '...age tag being...' == Line 467 has weird spacing: '...ch that imple...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 10, 2005) is 6894 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC 3066' on line 46 == Unused Reference: '2' is defined on line 652, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 655, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 660, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 666, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 670, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 673, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 677, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 685, but no explicit reference was found in the text == Unused Reference: '12' is defined on line 689, but no explicit reference was found in the text == Unused Reference: '13' is defined on line 694, but no explicit reference was found in the text == Unused Reference: '14' is defined on line 698, but no explicit reference was found in the text == Unused Reference: '15' is defined on line 702, but no explicit reference was found in the text == Unused Reference: '16' is defined on line 705, but no explicit reference was found in the text == Unused Reference: '17' is defined on line 709, but no explicit reference was found in the text == Unused Reference: '18' is defined on line 714, but no explicit reference was found in the text == Unused Reference: '20' is defined on line 720, but no explicit reference was found in the text == Outdated reference: A later version (-14) exists of draft-ietf-ltru-registry-03 ** Obsolete normative reference: RFC 1327 (ref. '2') (Obsoleted by RFC 2156) ** Obsolete normative reference: RFC 1521 (ref. '3') (Obsoleted by RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049) ** Obsolete normative reference: RFC 2028 (ref. '4') (Obsoleted by RFC 9281) ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234) ** Obsolete normative reference: RFC 2396 (ref. '8') (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2434 (ref. '9') (Obsoleted by RFC 5226) ** Obsolete normative reference: RFC 2616 (ref. '10') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 2860 (ref. '11') -- Obsolete informational reference (is this intentional?): RFC 1766 (ref. '18') (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (ref. '19') (Obsoleted by RFC 4646, RFC 4647) Summary: 12 errors (**), 0 flaws (~~), 24 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Expires: December 12, 2005 M. Davis, Ed. 5 IBM 6 June 10, 2005 8 Matching Language Identifiers 9 draft-ietf-ltru-matching-02 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on December 12, 2005. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing and 43 matching the tags for the identification of languages defined by [RFC 44 3066bis] [1]. Possible algorithms for language negotiation and 45 content selection are described. This document obsoletes portions of 46 [RFC 3066] [19]. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 51 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 52 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4 53 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5 54 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6 55 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6 56 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7 57 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8 58 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9 59 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10 60 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11 61 2.5 Considerations for Private Use Subtags . . . . . . . . . . 12 62 2.6 Length Considerations in Matching . . . . . . . . . . . . 12 63 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 64 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 65 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 66 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17 67 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 68 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 69 7.2 Informative References . . . . . . . . . . . . . . . . . . 19 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 71 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 72 Intellectual Property and Copyright Statements . . . . . . . . 21 74 1. Introduction 76 Human beings on our planet have, past and present, used a number of 77 languages. There are many reasons why one would want to identify the 78 language used when presenting or requesting information. 80 Information about a user's language preferences commonly needs to be 81 identified so that appropriate processing can be applied. For 82 example, the user's language preferences in a browser can be used to 83 select web pages appropriately. A choice of language preference can 84 also be used to select among tools (such as dictionaries) to assist 85 in the processing or understanding of content in different languages. 87 Given a set of language identifiers, such as those defined in 88 RFC3066bis [1], various mechanisms can be envisioned for performing 89 language negotiation and tag matching. The suitability of a 90 particular mechanism to a particular application depends on the needs 91 of that application. 93 This document defines language ranges and syntax for specifying user 94 preferences in a request for language content. It also specifies 95 various schemes and mechanisms that can be used with language ranges 96 when matching or filtering content based on language tags. 98 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in RFC 2119 [5]. 102 2. The Language Range 104 Language Tags are used to identify the language of some information 105 item or content. Applications that use language tags are often faced 106 with the problem of identifying sets of content that share certain 107 language attributes. For example, HTTP 1.1 [10] describes language 108 ranges in its discussion of the Accept-Language header (Section 109 14.4), which is used for selecting content from servers based on the 110 language of that content. 112 When selecting content according to its language, it is useful to 113 have a mechanism for identifying sets of language tags that share 114 specific attributes. This allows users to select or filter content 115 based on specific requirements. Such an identifier is called a 116 "Language Range". 118 2.1 Basic Language Range 120 A basic language range (such as described in RFC 3066 [19] and HTTP 121 1.1 [10]) is a set of languages whose tags all begin with the same 122 sequence of subtags. A basic language range can be represented by a 123 'language-range' tag, by using the definition from HTTP/1.1 [10] : 124 language-range = language-tag / "*" 126 That is, a language-range has the same syntax as a language-tag or is 127 the single character "*". This definition of language-range implies 128 that there is a semantic relationship between tags that share the 129 same prefix. 131 In particular, the set of language tags that match a specific 132 language-range might not all be mutually intelligible. The use of a 133 prefix when matching tags to language ranges does not imply that 134 language tags are assigned to languages in such a way that it is 135 always true that if a user understands a language with a certain tag, 136 then this user will also understand all languages with tags for which 137 this tag is a prefix. The prefix rule simply allows the use of 138 prefix tags if this is the case. 140 When working with tags and ranges you SHOULD also note the following: 142 1. Private-use and Extension subtags are normally orthogonal to 143 language tag fallback. Implementations SHOULD ignore 144 unrecognized private-use and extension subtags when performing 145 language tag fallback. Since these subtags are always at the end 146 of the sequence of subtags, they don't normally interfere with 147 the use of prefixes for matching in the schemes described below. 149 2. Implementations that choose not to interpret one or more private- 150 use or extension subtags SHOULD NOT remove or modify these 151 extensions in content that they are processing. When a language 152 tag instance is to be used in a specific, known protocol, and is 153 not being passed through to other protocols, language tags MAY be 154 filtered to remove subtags and extensions that are not supported 155 by that protocol. Such filtering SHOULD be avoided, if possible, 156 since it removes information that might be relevant if services 157 on the other end of the protocol would make use of that 158 information. 160 3. Some applications of language tags might want or need to consider 161 extensions and private-use subtags when matching tags. If 162 extensions and private-use subtags are included in a matching or 163 filtering process that utilizes the one of the schemes described 164 in this document, then the implementation SHOULD canonicalize the 165 language tags and/or ranges before performing the matching. Note 166 that language tag processors that claim to be "well-formed" 167 processors as defined in [1] generally fall into this category. 169 There are two matching schemes that are commonly associated with 170 basic language ranges: matching and lookup. 172 2.1.1 Matching 174 Language tag matching is used to select all content that matches a 175 given prefix. In matching, the language range represents the least 176 specific tag which is an acceptable match and every piece of content 177 that matches is returned. 179 For example, if an application is applying a style to all content in 180 a web page in a particular language, it might use language tag 181 matching to select the content to which the style is applied. 183 A language-range matches a language-tag if it exactly equals the tag, 184 or if it exactly equals a prefix of the tag such that the first 185 character following the prefix is "-". (That is, the language-range 186 "en-de" matches the language tag "en-DE-boont", but not the language 187 tag "en-Deva".) 189 The special range "*" matches any tag. A protocol which uses 190 language ranges MAY specify additional rules about the semantics of 191 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 192 languages not matched by any other range within an "Accept-Language:" 193 header. 195 2.1.2 Lookup 197 Content lookup is used to select the single information item that 198 best matches the language range for a given request. In lookup, the 199 language range represents the most specific tag which is an 200 acceptable match and only the closest matching item is returned. 202 For example, if an application inserts some dynamic content into a 203 web page, returning an empty string if there is no exact match is not 204 an option. Instead, the application "falls back". 206 When performing lookup, the language range is progressively truncated 207 from the end until a matching piece of content is located. For 208 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup 209 would progressively search for content as shown below: 211 Range to match: zh-Hant-CN-x-wadegile 212 1. zh-Hant-CN-x-wadegile 213 2. zh-Hant-CN 214 3. zh-Hant 215 4. zh 216 5. (default content or the empty tag) 218 Figure 2: Default Fallback Pattern Example 220 This scheme allows some flexibility in finding content. It also 221 typically provides better results when data is not available at a 222 specific level of tag granularity or is sparsely populated (than if 223 the default language for the system or content were used). 225 2.2 Extended Language Range 227 Prefix matching using a Basic Language Range, as described above, is 228 not always the most appropriate way to access the information 229 contained in language tags when selecting or filtering content. Some 230 applications might wish to define a more granular matching scheme and 231 such a matching scheme requires the ability to specify the various 232 attributes of a language tag in the language range. An extended 233 language range can be represented by the following ABNF: 235 extended-language-range = grandfathered / privateuse / range 236 range = ( lang [ "-" script ] [ "-" region ] *( "-" variant ) 237 [ "-" privateuse ] ) 238 lang = ( 2*8ALPHA *[ "-" extlang ] ) / "*" 239 extlang = 3ALPHA / "*" 240 script = 4ALPHA / "*" 241 region = 2ALPHA / 3DIGIT / "*" 242 variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*" 243 privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) ) 244 grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) ) 245 alphanum = ( ALPHA / DIGIT ) 247 In an extended language range, the identifier takes the form of a 248 series of subtags which must consist of well-formed subtags or the 249 special subtag "*". For example, the language range "en-*-US" 250 specifies a primary language of 'en', followed by any script subtag, 251 followed by the region subtag 'US'. 253 A field not present in the middle of an extended language range MAY 254 be treated as if the field contained a "*". For example, the range 255 "en-US" MAY be considered to be equivalent to the range "en-*-US". 257 There are several matching algorithms or schemes which can be applied 258 when matching extended language ranges to language tags. 260 2.2.1 Extended Range Matching 262 In extended range matching, the subtags in a language tag are 263 compared to the corresponding subtags in the extended language range. 264 A subtag is considered to match if it exactly matches the 265 corresponding subtag in the range or the range contains a subtag with 266 the value "*" (which matches all subtags, including the empty 267 subtag). Extended Range Matching is an extension of basic matching 268 (Section 2.1.1): the language range represents the least specific tag 269 which is an acceptable match. 271 By default all extensions and their subtags are ignored for extended 272 language range matching. 274 Private use subtags MAY be specified in the language range and MUST 275 NOT be ignored when matching. 277 Subtags not specified, including those at the end of the language 278 range, are assigned the value "*". This makes each range into a 279 prefix much like that used in basic language range matching. For 280 example, the extended language range "zh-*-CN" matches all of the 281 following tags because the unspecified variant field is expanded to 282 "*": 284 zh-Hant-CN 286 zh-CN 288 zh-Hans-CN 290 zh-CN-x-wadegile 292 zh-Latn-CN-boont 294 2.2.2 Extended Range Lookup 296 In extended range lookup, the subtags in a language tag are compared 297 to the corresponding subtags in the extended language range. The 298 subtag is considered to match if it exactly matches the corresponding 299 subtag in the range or the range contains a subtag with the value "*" 300 (which matches all subtags, including the empty subtag). Extended 301 language range lookup is an extension of basic lookup 302 (Section 2.1.2): the language range represents the most specific tag 303 which will form an acceptable match. 305 Subtags not specified are assigned the value "*" prior to performing 306 tag matching. Unlike in extended range matching, however, fields at 307 the end of the range MUST NOT be expanded in this manner. For 308 example, "en-US" MUST NOT be considered to be the same as the range 309 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST 310 be used at the end of the range to indicate that all tags with the 311 range as a prefix are allowable matches. That is, the range "zh-*" 312 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh" 313 matches neither of those tags. 315 The wildcard "*" at the end of a range SHOULD be considered to match 316 any private use subtag sequences (making extended language range 317 lookup function exactly like extended range matching Section 2.2.1). 319 By default all extensions and their subtags SHOULD be ignored for 320 extended language range lookup. Private use subtags MAY be specified 321 in the language range and MUST NOT be ignored when performing lookup. 322 The wildcard "*" at the end of a range SHOULD be considered to match 323 any private use subtag sequences in addition to variants. 325 For example, the range "*-US" matches all of the following tags: 327 en-US 329 en-Latn-US 330 en-US-r-extends (extensions are ignored) 332 fr-US 334 For example, the range "en-*-US" matches _none_ of the following 335 tags: 337 fr-US 339 en (missing region US) 341 en-Latn (missing region US) 343 en-Latn-US-scouse (variant field is present) 345 For example, the range "en-*" matches all of the following tags: 347 en-Latn 349 en-Latn-US 351 en-Latn-US-scouse 353 en-US 355 en-scouse 357 Note that the ability to be specific in extended range lookup can 358 make this matching scheme a more appropriate replacement for basic 359 matching than the extended range matching scheme. 361 2.2.3 Scored Matching 363 In the "scored matching" scheme, the extended language range and the 364 language tags are pre-normalized by mapping grandfathered and 365 obsolete tags into modern equivalents. 367 The language range and the language tags are normalized into 368 quadruples of the form (language, script, country, variant), where 369 extended language is considered part of language and x-private-codes 370 are considered part of the language if they are initial and part of 371 the variant if not initial. Missing components are set to "*". An 372 "*" pattern becomes the quadruple ("*", "*", "*", "*"). 374 Each language tag being matched or filtered is assigned a "quality 375 value" such that higher values indicate better matches and lower 376 values indicate worse ones. If the language matches, add 8 to the 377 quality value. If the script matches, add 4 to the quality value. 379 If the region matches, add 2 to the quality value. If the variant 380 matches, add 1 to the quality value. Elements of the quadruples are 381 considered to match if they are the same or if one of them is "*". 383 A value of 15 is a perfect match; 0 is no match at all. Different 384 values could be more or less appropriate for different applications 385 and implementations SHOULD probably allow users to choose the most 386 appropriate selection value. 388 2.3 Meaning of Language Tags and Ranges 390 A language tag defines a language as spoken (or written, signed or 391 otherwise signaled) by human beings for communication of information 392 to other human beings. 394 If a language tag B contains language tag A as a prefix, then B is 395 typically "narrower" or "more specific" than A. For example, "zh- 396 Hant-TW" is more specific than "zh-Hant". 398 This relationship is not guaranteed in all cases: specifically, 399 languages that begin with the same sequence of subtags are NOT 400 guaranteed to be mutually intelligible, although they might be. 402 For example, the tag "az" shares a prefix with both "az-Latn" 403 (Azerbaijani written using the Latin script) and "az-Cyrl" 404 (Azerbaijani written using the Cyrillic script). A person fluent in 405 one script might not be able to read the other, even though the text 406 might be otherwise identical. Content tagged as "az" most probably 407 is written in just one script and thus might not be intelligible to a 408 reader familiar with the other script. 410 Variant subtags in particular seem to represent specific divisions in 411 mutual understanding, since they often encode dialects or other 412 idiosyncratic variations within a language. 414 The relationship between the language tag and the information it 415 relates to is defined by the standard describing the context in which 416 it appears. Accordingly, this section can only give possible 417 examples of its usage. 419 o For a single information object, the associated language tags 420 might be interpreted as the set of languages that are necessary 421 for a complete comprehension of the complete object. Example: 422 Plain text documents. 424 o For an aggregation of information objects, the associated language 425 tags could be taken as the set of languages used inside components 426 of that aggregation. Examples: Document stores and libraries. 428 o For information objects whose purpose is to provide alternatives, 429 the associated language tags could be regarded as a hint that the 430 content is provided in several languages, and that one has to 431 inspect each of the alternatives in order to find its language or 432 languages. In this case, the presence of multiple tags might not 433 mean that one needs to be multi-lingual to get complete 434 understanding of the document. Example: MIME multipart/ 435 alternative. 437 o In markup languages, such as HTML and XML, language information 438 can be added to each part of the document identified by the markup 439 structure (including the whole document itself). For example, one 440 could write C'est la vie. inside a 441 Norwegian document; the Norwegian-speaking user could then access 442 a French-Norwegian dictionary to find out what the marked section 443 meant. If the user were listening to that document through a 444 speech synthesis interface, this formation could be used to signal 445 the synthesizer to appropriately apply French text-to-speech 446 pronunciation rules to that span of text, instead of misapplying 447 the Norwegian rules. 449 2.4 Choosing Between Alternate Matching Schemes 451 Implementations MAY choose to implement different styles of matching 452 for different kinds of processing. For example, an implementation 453 could treat an absent script subtag as a "wildcard" field; thus 454 "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not 455 "az" (this is extended range lookup). If one item is to be chosen, 456 the implementation could pick among those matches based on other 457 information, such as the most likely script used in the language/ 458 region in question or the script used by other content selected. 460 Because the primary language subtag cannot be absent in a language 461 tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic 462 matching. For example, in a query where you want to select all 463 language tags that contain 'Latn' as the script code and 'AZ' as the 464 region code, you could use the range "und-Latn-AZ". This requires an 465 implementation to examine the actual values of the subtags, though. 466 The matching schemes described elsewhere in this document are 467 designed such that implementations do not have to examine the values 468 or subtags supplied and, except for scored matching, they do not need 469 access to the Language Subtag Registry nor the use of valid subtags 470 in language tags or ranges. This has great benefit for speed and 471 simplicity of implementation. 473 Implementations might also wish to use semantic information external 474 to the langauge tags when performing fallback. For example, the 475 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal 476 Norwegian) might both be usefully matched to the more general subtag 477 'no' (Norwegian). Or an application might infer that content labeled 478 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent 479 content labeled "zh-TW". 481 2.5 Considerations for Private Use Subtags 483 Private-use subtags require private agreement between the parties 484 that intend to use or exchange language tags that use them and great 485 caution SHOULD be used in employing them in content or protocols 486 intended for general use. Private-use subtags are simply useless for 487 information exchange without prior arrangement. 489 The value and semantic meaning of private-use tags and of the subtags 490 used within such a language tag are not defined. Matching private 491 use tags using language ranges or extended language ranges can result 492 in unpredictable content being returned. 494 2.6 Length Considerations in Matching 496 RFC 3066 [19] did not provide an upper limit on the size of language 497 tags or ranges. RFC 3066 did define the semantics of particular 498 subtags in such a way that most language tags or ranges consisted of 499 language and region subtags with a combined total length of up to six 500 characters. Larger tags and ranges (in terms of both subtags and 501 characters) did exist, however. 503 [1] also does not impose a fixed upper limit on the number of subtags 504 in a language tag or range (and thus an upper bound on the size of 505 either). The syntax in that document suggests that, depending on the 506 specific language or range of languages, more subtags (and thus 507 characters) are sometimes necessary as a result. Length 508 considerations and their impact on the selection and processing of 509 tags are described in Section 2.1.1 of that document. 511 A matching implementation MAY choose to limit the length of the 512 language tags or ranges used in matching. Any such limitation SHOULD 513 be clearly documented, and such documentation SHOULD include the 514 disposition of any longer tags or ranges (for example, whether an 515 error value is generated or the language tag or range is truncated). 516 If truncation is permitted it MUST NOT permit a subtag to be divided, 517 since this changes the semantics of the subtag being matched and can 518 result in false positives or negatives. 520 Implementations that restrict storage SHOULD consider the impact of 521 tag or range truncation on the resulting matches. For example, 522 removing the "*" from the end of an extended language range (see 523 Section 2.2) can greatly modify the set of returned matches. A 524 protocol that allows tags or ranges to be truncated at an arbitrary 525 limit, without giving any indication of what that limit is, has the 526 potential for causing harm by changing the meaning of values in 527 substantial ways. 529 In practice, most tags do not require additional subtags or 530 substantially more characters. Additional subtags sometimes add 531 useful distinguishing information, but extraneous subtags interfere 532 with the meaning, understanding, and especially matching of language 533 tags. Since language tags or ranges MAY be truncated by an 534 application or protocol that limits storage, when choosing language 535 tags or ranges users and applications SHOULD avoid adding subtags 536 that add no distinguishing value. In particular, users and 537 implementations SHOULD follow the 'Prefix' and 'Suppress-Script' 538 fields in the registry (defined in Section 3.6 of [1]): these fields 539 provide guidance on when specific additional subtags SHOULD (and 540 SHOULD NOT) be used. 542 Implementations MUST support a limit of at least 33 characters. This 543 limit includes at least one subtag of each non-extension, non-private 544 use type. When choosing a buffer limit, a length of at least 42 545 characters is strongly RECOMMENDED. 547 The practical limit on tags or ranges derived solely from registered 548 values is 42 characters. Implementations MUST be able to handle tags 549 and ranges of this length. Support for tags and ranges of at least 550 62 characters in length is RECOMMENDED. Implementations MAY support 551 longer values, including matching extensive sets of private use or 552 extension subtags. 554 Applications or protocols which have to truncate a tag MUST do so by 555 progressively removing subtags along with their preceding "-" from 556 the right side of the language tag until the tag is short enough for 557 the given buffer. If the resulting tag ends with a single-character 558 subtag, that subtag and its preceding "-" MUST also be removed. For 559 example: 561 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1 562 1. zh-Hant-CN-variant1-a-extend1-x-wadegile 563 2. zh-Hant-CN-variant1-a-extend1 564 3. zh-Hant-CN-variant1 565 4. zh-Hant-CN 566 5. zh-Hant 567 6. zh 569 Figure 4: Example of Tag Truncation 571 3. IANA Considerations 573 This document presents no new or existing considerations for IANA. 575 4. Changes 577 This is the first version of this document. 579 The following changes were put into this document since draft-00: 581 Fixed text in the introduction that is no longer accurate. 582 Specifically, there no longer is a default matching algorithm. 583 (A.Phillips) 585 Fixed text in Section 2.1 which incorrectly discussed the default 586 fallback mechanism. (A.Phillips) 588 Minor changes to Section 2.3, in particular, the addition of the 589 'variant' paragraph and some tidying of the text. (A.Phillips) 591 Fixed a minor glitch in the ABNF caused by taking the output of 592 Bill Fenner's parser and not looking too closely at it (M. Patton) 594 Fixed some minor reference problems. (M.Patton) 596 Added Section 2.6 on length considerations in matching. 597 (R.Presuhn) 599 Copied various materials from the length considerations section of 600 the registry draft to keep the two documents in sync. 601 (A.Phillips) 603 5. Security Considerations 605 The only security issue that has been raised with language tags since 606 the publication of RFC 1766, which stated that "Security issues are 607 believed to be irrelevant to this memo", is a concern with language 608 ranges used in content negotiation - that they might be used to infer 609 the nationality of the sender, and thus identify potential targets 610 for surveillance. 612 This is a special case of the general problem that anything you send 613 is visible to the receiving party. It is useful to be aware that 614 such concerns can exist in some cases. 616 The evaluation of the exact magnitude of the threat, and any possible 617 countermeasures, is left to each application protocol. 619 Although the specification of valid subtags for an extension MUST be 620 available over the Internet, implementations SHOULD NOT mechanically 621 depend on it being always accessible, to prevent denial-of-service 622 attacks. 624 6. Character Set Considerations 626 The syntax in this document requires that language ranges use only 627 the characters A-Z, a-z, 0-9, and HYPHEN-MINUS legal in language 628 tags. These characters are present in most character sets, so 629 presentation of language tags should not have any character set 630 issues. 632 Rendering of characters based on the content of a language tag is not 633 addressed in this memo. Historically, some languages have relied on 634 the use of specific character sets or other information in order to 635 infer how a specific character should be rendered (notably this 636 applies to language and culture specific variations of Han ideographs 637 as used in Japanese, Chinese, and Korean). When language tags are 638 applied to spans of text, rendering engines sometimes use that 639 information in deciding which font to use in the absence of other 640 information, particularly where languages with distinct writing 641 traditions use the same characters. 643 7. References 645 7.1 Normative References 647 [1] Phillips, A., Ed. and M. Davis, Ed., "Tags for the 648 Identification of Languages (Internet-Draft)", June 2005, . 652 [2] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021 653 and RFC 822", RFC 1327, May 1992. 655 [3] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail 656 Extensions) Part One: Mechanisms for Specifying and Describing 657 the Format of Internet Message Bodies", RFC 1521, 658 September 1993. 660 [4] Hovey, R. and S. Bradner, "The Organizations Involved in the 661 IETF Standards Process", BCP 11, RFC 2028, October 1996. 663 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement 664 Levels", BCP 14, RFC 2119, March 1997. 666 [6] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word 667 Extensions: Character Sets, Languages, and Continuations", 668 RFC 2231, November 1997. 670 [7] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 671 Specifications: ABNF", RFC 2234, November 1997. 673 [8] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 674 Resource Identifiers (URI): Generic Syntax", RFC 2396, 675 August 1998. 677 [9] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 678 Considerations Section in RFCs", BCP 26, RFC 2434, 679 October 1998. 681 [10] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 682 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 683 HTTP/1.1", RFC 2616, June 1999. 685 [11] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of 686 Understanding Concerning the Technical Work of the Internet 687 Assigned Numbers Authority", RFC 2860, June 2000. 689 [12] Yergeau, F., "UTF-8, a transformation format of ISO 10646", 690 STD 63, RFC 3629, November 2003. 692 7.2 Informative References 694 [13] International Organization for Standardization, "ISO 639- 695 1:2002, Codes for the representation of names of languages -- 696 Part 1: Alpha-2 code", ISO Standard 639, 2002. 698 [14] International Organization for Standardization, "ISO 639-2:1998 699 - Codes for the representation of names of languages -- Part 2: 700 Alpha-3 code - edition 1", August 1988. 702 [15] ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the 703 representation of names of scripts", January 2004. 705 [16] International Organization for Standardization, "Codes for the 706 representation of names of countries, 3rd edition", 707 ISO Standard 3166, August 1988. 709 [17] Statistical Division, United Nations, "Standard Country or Area 710 Codes for Statistical Use", UN Standard Country or Area Codes 711 for Statistical Use, Revision 4 (United Nations publication, 712 Sales No. 98.XVII.9, June 1999. 714 [18] Alvestrand, H., "Tags for the Identification of Languages", 715 RFC 1766, March 1995. 717 [19] Alvestrand, H., "Tags for the Identification of Languages", 718 BCP 47, RFC 3066, January 2001. 720 [20] Klyne, G. and C. Newman, "Date and Time on the Internet: 721 Timestamps", RFC 3339, July 2002. 723 Authors' Addresses 725 Addison Phillips (editor) 726 Quest Software 728 Email: addison dot phillips at quest dot com 730 Mark Davis (editor) 731 IBM 733 Email: mark dot davis at ibm dot com 735 Appendix A. Acknowledgements 737 Any list of contributors is bound to be incomplete; please regard the 738 following as only a selection from the group of people who have 739 contributed to make this document what it is today. 741 The contributors to RFC 3066 and RFC 1766, the precursors of this 742 document, made enormous contributions directly or indirectly to this 743 document and are generally responsible for the success of language 744 tags. 746 The following people (in alphabetical order) contributed to this 747 document or to RFCs 1766 and 3066: 749 Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet, 750 Nathaniel Borenstein, Eric Brunner, Sean M. Burke, Jeremy Carroll, 751 John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin, 752 Dave Crocker, Martin Duerst, Michael Everson, Doug Ewell, Ned Freed, 753 Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren, 754 Elliotte Rusty Harold, Paul Hoffman, Richard Ishida, Olle Jarnefors, 755 Kent Karlsson, John Klensin, Alain LaBonte, Eric Mader, Keith Moore, 756 Chris Newman, Masataka Ohta, Michael S. Patton, Randy Presuhn, George 757 Rhoten, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto 758 Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois 759 Yergeau and many, many others. 761 Very special thanks must go to Harald Tveit Alvestrand, who 762 originated RFCs 1766 and 3066, and without whom this document would 763 not have been possible. Special thanks must go to Michael Everson, 764 who has served as language tag reviewer for almost the complete 765 period since the publication of RFC 1766. Special thanks to Doug 766 Ewell, for his production of the first complete subtag registry, and 767 his work in producing a test parser for verifying language tags. 769 For this particular document, John Cowan originated the scheme 770 described in Section 2.2.3. Mark Davis originated the scheme 771 described in the Section 2.1.2. 773 Intellectual Property Statement 775 The IETF takes no position regarding the validity or scope of any 776 Intellectual Property Rights or other rights that might be claimed to 777 pertain to the implementation or use of the technology described in 778 this document or the extent to which any license under such rights 779 might or might not be available; nor does it represent that it has 780 made any independent effort to identify any such rights. Information 781 on the procedures with respect to rights in RFC documents can be 782 found in BCP 78 and BCP 79. 784 Copies of IPR disclosures made to the IETF Secretariat and any 785 assurances of licenses to be made available, or the result of an 786 attempt made to obtain a general license or permission for the use of 787 such proprietary rights by implementers or users of this 788 specification can be obtained from the IETF on-line IPR repository at 789 http://www.ietf.org/ipr. 791 The IETF invites any interested party to bring to its attention any 792 copyrights, patents or patent applications, or other proprietary 793 rights that may cover technology that may be required to implement 794 this standard. Please address the information to the IETF at 795 ietf-ipr@ietf.org. 797 Disclaimer of Validity 799 This document and the information contained herein are provided on an 800 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 801 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 802 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 803 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 804 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 805 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 807 Copyright Statement 809 Copyright (C) The Internet Society (2005). This document is subject 810 to the rights, licenses and restrictions contained in BCP 78, and 811 except as set forth therein, the authors retain all their rights. 813 Acknowledgment 815 Funding for the RFC Editor function is currently provided by the 816 Internet Society.