idnits 2.17.1 draft-ietf-ltru-matching-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5 on line 774. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 751. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 758. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 764. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([RFC3066], [19], [1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 168 has weird spacing: '...schemes that ...' == Line 169 has weird spacing: '...ing and looku...' == Line 371 has weird spacing: '...age tag being...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 30, 2005) is 6906 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC 3066' on line 46 == Unused Reference: '2' is defined on line 621, but no explicit reference was found in the text == Unused Reference: '3' is defined on line 624, but no explicit reference was found in the text == Unused Reference: '4' is defined on line 629, but no explicit reference was found in the text == Unused Reference: '6' is defined on line 635, but no explicit reference was found in the text == Unused Reference: '7' is defined on line 639, but no explicit reference was found in the text == Unused Reference: '8' is defined on line 642, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 646, but no explicit reference was found in the text == Unused Reference: '11' is defined on line 654, but no explicit reference was found in the text == Unused Reference: '12' is defined on line 658, but no explicit reference was found in the text == Unused Reference: '13' is defined on line 663, but no explicit reference was found in the text == Unused Reference: '14' is defined on line 667, but no explicit reference was found in the text == Unused Reference: '15' is defined on line 671, but no explicit reference was found in the text == Unused Reference: '16' is defined on line 674, but no explicit reference was found in the text == Unused Reference: '17' is defined on line 678, but no explicit reference was found in the text == Unused Reference: '18' is defined on line 683, but no explicit reference was found in the text == Unused Reference: '20' is defined on line 689, but no explicit reference was found in the text == Outdated reference: A later version (-14) exists of draft-ietf-ltru-registry-01 ** Obsolete normative reference: RFC 1327 (ref. '2') (Obsoleted by RFC 2156) ** Obsolete normative reference: RFC 1521 (ref. '3') (Obsoleted by RFC 2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049) ** Obsolete normative reference: RFC 2028 (ref. '4') (Obsoleted by RFC 9281) ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234) ** Obsolete normative reference: RFC 2396 (ref. '8') (Obsoleted by RFC 3986) ** Obsolete normative reference: RFC 2434 (ref. '9') (Obsoleted by RFC 5226) ** Obsolete normative reference: RFC 2616 (ref. '10') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 2860 (ref. '11') -- Obsolete informational reference (is this intentional?): RFC 1766 (ref. '18') (Obsoleted by RFC 3066, RFC 3282) -- Obsolete informational reference (is this intentional?): RFC 3066 (ref. '19') (Obsoleted by RFC 4646, RFC 4647) Summary: 12 errors (**), 0 flaws (~~), 23 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Phillips, Ed. 3 Internet-Draft Quest Software 4 Expires: December 1, 2005 M. Davis, Ed. 5 IBM 6 May 30, 2005 8 Matching Language Identifiers 9 draft-ietf-ltru-matching-01 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on December 1, 2005. 36 Copyright Notice 38 Copyright (C) The Internet Society (2005). 40 Abstract 42 This document describes different mechanisms for comparing and 43 matching the tags for the identification of languages defined by [RFC 44 3066bis] [1]. Possible algorithms for language negotiation and 45 content selection are described. This document obsoletes portions of 46 [RFC 3066] [19]. 48 Table of Contents 50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 51 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4 52 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4 53 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5 54 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 5 55 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6 56 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7 57 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8 58 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9 59 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10 60 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11 61 2.5 Considerations for Private Use Subtags . . . . . . . . . . 11 62 2.6 Length Considerations in Matching . . . . . . . . . . . . 12 63 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 64 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 65 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16 66 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17 67 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 68 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18 69 7.2 Informative References . . . . . . . . . . . . . . . . . . 19 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19 71 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 72 Intellectual Property and Copyright Statements . . . . . . . . 21 74 1. Introduction 76 Human beings on our planet have, past and present, used a number of 77 languages. There are many reasons why one would want to identify the 78 language used when presenting or requesting information. 80 Information about a user's language preferences commonly needs to be 81 identified so that appropriate processing can be applied. For 82 example, the user's language preferences in a browser can be used to 83 select web pages appropriately. A choice of language preference can 84 also be used to select among tools (such as dictionaries) to assist 85 in the processing or understanding of content in different languages. 87 Given a set of language identifiers, such as those defined in 88 RFC3066bis [1], various mechanisms can be envisioned for performing 89 language negotiation and tag matching. The suitability of a 90 particular mechanism to a particular application depends on the needs 91 of that application. 93 This document defines language ranges and syntax for specifying user 94 preferences in a request for language content. It also specifies 95 various schemes and mechanisms that can be used with language ranges 96 when matching or filtering content based on language tags. 98 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 100 document are to be interpreted as described in RFC 2119 [5]. 102 2. The Language Range 104 Language Tags are used to identify the language of some information 105 item or content. Applications that use language tags are often faced 106 with the problem of identifying sets of content that share certain 107 language attributes. For example, HTTP 1.1 [10] describes language 108 ranges in its discussion of the Accept-Language header (Section 109 14.4), which is used for selecting content from servers based on the 110 language of that content. 112 When selecting content according to its language, it is useful to 113 have a mechanism for identifying sets of language tags that share 114 specific attributes. This allows users to select or filter content 115 based on specific requirements. Such an identifier is called a 116 "Language Range". 118 2.1 Basic Language Range 120 A basic language range (such as described in RFC 3066 [19] and HTTP 121 1.1 [10]) is a set of languages whose tags all begin with the same 122 sequence of subtags. A basic language range can be represented by a 123 'language-range' tag, by using the definition from HTTP/1.1 [10] : 124 language-range = language-tag / "*" 126 That is, a language-range has the same syntax as a language-tag or is 127 the single character "*". This definition of language-range implies 128 that there is a semantic relationship between tags that share the 129 same prefix. 131 In particular, the set of language tags that match a specific 132 language-range may not all be mutually intelligible. The use of a 133 prefix when matching tags to language ranges does not imply that 134 language tags are assigned to languages in such a way that it is 135 always true that if a user understands a language with a certain tag, 136 then this user will also understand all languages with tags for which 137 this tag is a prefix. The prefix rule simply allows the use of 138 prefix tags if this is the case. 140 When working with tags and ranges you should also note the following: 142 1. Private-use and Extension subtags are normally orthogonal to 143 language tag fallback. Implementations should ignore 144 unrecognized private-use and extension subtags when performing 145 language tag fallback. Since these subtags are always at the end 146 of the sequence of subtags, they don't normally interfere with 147 the use of prefixes for matching in the schemes described below. 149 2. Implementations that choose not to interpret one or more private- 150 use or extension subtags should not remove or modify these 151 extensions in content that they are processing. When a language 152 tag instance is to be used in a specific, known protocol, and is 153 not being passed through to other protocols, language tags may be 154 filtered to remove subtags and extensions that are not supported 155 by that protocol. This should be done with caution, since it is 156 removing information that may be relevant if services on the 157 other end of the protocol would make use of that information. 159 3. Some applications of language tags may want or need to consider 160 extensions and private-use subtags when matching tags. If 161 extensions and private-use subtags are included in a matching or 162 filtering process that utilizes the one of the schemes described 163 in this document, then the implementation should canonicalize the 164 language tags and/or ranges before performing the matching. Note 165 that language tag processors that claim to be "well-formed" 166 processors as defined in [1] generally fall into this category. 168 There are two matching schemes that are commonly associated with 169 basic language ranges: matching and lookup. 171 2.1.1 Matching 173 Language tag matching is used to select all content that matches a 174 given prefix. In matching, the language range represents the least 175 specific tag which is an acceptable match and every piece of content 176 that matches is returned. 178 For example, if an application is applying a style to all content in 179 a web page in a particular language, it might use language tag 180 matching to select the content to which the style is applied. 182 A language-range matches a language-tag if it exactly equals the tag, 183 or if it exactly equals a prefix of the tag such that the first 184 character following the prefix is "-". (That is, the language-range 185 "en-de" matches the language tag "en-DE-boont", but not the language 186 tag "en-Deva".) 188 The special range "*" matches any tag. A protocol which uses 189 language ranges may specify additional rules about the semantics of 190 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only 191 languages not matched by any other range within an "Accept-Language:" 192 header. 194 2.1.2 Lookup 196 Content lookup is used to select the single information item that 197 best matches the language range for a given request. In lookup, the 198 language range represents the most specific tag which is an 199 acceptable match and only the closest matching item is returned. 201 For example, if an application inserts some dynamic content into a 202 web page, returning an empty string if there is no exact match is not 203 an option. Instead, the application "falls back". 205 When performing lookup, the language range is progressively truncated 206 from the end until a matching piece of content is located. For 207 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup 208 would progressively search for content as shown below: 210 Range to match: zh-Hant-CN-x-wadegile 211 1. zh-Hant-CN-x-wadegile 212 2. zh-Hant-CN 213 3. zh-Hant 214 4. zh 215 5. (default content or the empty tag) 217 Figure 2: Default Fallback Pattern Example 219 This scheme allows some flexibility in finding content. It also 220 typically provides better results when data is not available at a 221 specific level of tag granularity or is sparsely populated (than if 222 the default language for the system or content were used). 224 2.2 Extended Language Range 226 Prefix matching using a Basic Language Range, as described above, is 227 not always the most appropriate way to access the information 228 contained in language tags when selecting or filtering content. Some 229 applications may wish to define a more granular matching scheme and 230 such a matching scheme requires the ability to specify the various 231 attributes of a language tag in the language range. An extended 232 language range can be represented by the following ABNF: 233 extended-language-range = grandfathered / privateuse / range 234 range = ( lang [ "-" script ] [ "-" region ] *( "-" variant ) 235 [ "-" privateuse ] ) 236 lang = ( 2*8ALPHA *[ "-" extlang ] ) / "*" 237 extlang = 3ALPHA / "*" 238 script = 4ALPHA / "*" 239 region = 2ALPHA / 3DIGIT / "*" 240 variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*" 241 privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) ) 242 grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) ) 243 alphanum = ( ALPHA / DIGIT ) 244 In an extended language range, the identifier takes the form of a 245 series of subtags which must consist of well-formed subtags or the 246 special subtag "*". For example, the language range "en-*-US" 247 specifies a primary language of 'en', followed by any script subtag, 248 followed by the region subtag 'US'. 250 A field not present in the middle of an extended language range MAY 251 be treated as if the field contained a "*". For example, the range 252 "en-US" MAY be considered to be equivalent to the range "en-*-US". 254 There are several matching algorithms or schemes which may be applied 255 when matching extended language ranges to language tags. 257 2.2.1 Extended Range Matching 259 In extended range matching, the subtags in a language tag are 260 compared to the corresponding subtags in the extended language range. 261 A subtag is considered to match if it exactly matches the 262 corresponding subtag in the range or the range contains a subtag with 263 the value "*" (which matches all subtags, including the empty 264 subtag). Extended Range Matching is an extension of basic matching 265 (Section 2.1.1): the language range represents the least specific tag 266 which is an acceptable match. 268 By default all extensions and their subtags are ignored for extended 269 language range matching. 271 Private use subtags may be specified in the language range and MUST 272 NOT be ignored when matching. 274 Subtags not specified, including those at the end of the language 275 range, are assigned the value "*". This makes each range into a 276 prefix much like that used in basic language range matching. For 277 example, the extended language range "zh-*-CN" matches all of the 278 following tags because the unspecified variant field is expanded to 279 "*": 281 zh-Hant-CN 283 zh-CN 285 zh-Hans-CN 287 zh-CN-x-wadegile 289 zh-Latn-CN-boont 291 2.2.2 Extended Range Lookup 293 In extended range lookup, the subtags in a language tag are compared 294 to the corresponding subtags in the extended language range. The 295 subtag is considered to match if it exactly matches the corresponding 296 subtag in the range or the range contains a subtag with the value "*" 297 (which matches all subtags, including the empty subtag). Extended 298 language range lookup is an extension of basic lookup 299 (Section 2.1.2): the language range represents the most specific tag 300 which will form an acceptable match. 302 Subtags not specified are assigned the value "*" prior to performing 303 tag matching. Unlike in extended range matching, however, fields at 304 the end of the range MUST NOT be expanded in this manner. For 305 example, "en-US" must not be considered to be the same as the range 306 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST 307 be used at the end of the range to indicate that all tags with the 308 range as a prefix are allowable matches. That is, the range "zh-*" 309 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh" 310 matches neither of those tags. 312 The wildcard "*" at the end of a range SHOULD be considered to match 313 any private use subtag sequences (making extended language range 314 lookup function exactly like extended range matching Section 2.2.1). 316 By default all extensions and their subtags SHOULD be ignored for 317 extended language range lookup. Private use subtags may be specified 318 in the language range and MUST NOT be ignored when performing lookup. 319 The wildcard "*" at the end of a range SHOULD be considered to match 320 any private use subtag sequences in addition to variants. 322 For example, the range "*-US" matches all of the following tags: 324 en-US 326 en-Latn-US 328 en-US-r-extends (extensions are ignored) 330 fr-US 332 For example, the range "en-*-US" matches _none_ of the following 333 tags: 335 fr-US 337 en (missing region US) 338 en-Latn (missing region US) 340 en-Latn-US-scouse (variant field is present) 342 For example, the range "en-*" matches all of the following tags: 344 en-Latn 346 en-Latn-US 348 en-Latn-US-scouse 350 en-US 352 en-scouse 354 It should be noted that the ability to be specific in extended range 355 lookup may make this matching scheme a more appropriate replacement 356 for basic matching than the extended range matching scheme. 358 2.2.3 Scored Matching 360 In the "scored matching" scheme, the extended language range and the 361 language tags are pre-normalized by mapping grandfathered and 362 obsolete tags into modern equivalents. 364 The language range and the language tags are normalized into 365 quadruples of the form (language, script, country, variant), where 366 extended language is considered part of language and x-private-codes 367 are considered part of the language if they are initial and part of 368 the variant if not initial. Missing components are set to "*". An 369 "*" pattern becomes the quadruple ("*", "*", "*", "*"). 371 Each language tag being matched or filtered is assigned a "quality 372 value" such that higher values indicate better matches and lower 373 values indicate worse ones. If the language matches, add 8 to the 374 quality value. If the script matches, add 4 to the quality value. 375 If the region matches, add 2 to the quality value. If the variant 376 matches, add 1 to the quality value. Elements of the quadruples are 377 considered to match if they are the same or if one of them is "*". 379 A value of 15 is a perfect match; 0 is no match at all. Different 380 values may be more or less appropriate for different applications and 381 implementations should probably allow users to choose the most 382 appropriate selection value. 384 2.3 Meaning of Language Tags and Ranges 386 A language tag defines a language as spoken (or written, signed or 387 otherwise signaled) by human beings for communication of information 388 to other human beings. 390 If a language tag B contains language tag A as a prefix, then B is 391 typically "narrower" or "more specific" than A. For example, "zh- 392 Hant-TW" is more specific than "zh-Hant". 394 This relationship is not guaranteed in all cases: specifically, 395 languages that begin with the same sequence of subtags are NOT 396 guaranteed to be mutually intelligible, although they may be. 398 For example, the tag "az" shares a prefix with both "az-Latn" 399 (Azerbaijani written using the Latin script) and "az-Cyrl" 400 (Azerbaijani written using the Cyrillic script). A person fluent in 401 one script may not be able to read the other, even though the text 402 might be otherwise identical. Content tagged as "az" most probably 403 is written in just one script and thus might not be intelligible to a 404 reader familiar with the other script. 406 Variant subtags in particular seem to represent specific divisions in 407 mutual understanding, since they often encode dialects or other 408 idiosyncratic variations within a language. 410 The relationship between the language tag and the information it 411 relates to is defined by the standard describing the context in which 412 it appears. Accordingly, this section can only give possible 413 examples of its usage. 415 o For a single information object, the associated language tags 416 might be interpreted as the set of languages that is required for 417 a complete comprehension of the complete object. Example: Plain 418 text documents. 420 o For an aggregation of information objects, the associated language 421 tags could be taken as the set of languages used inside components 422 of that aggregation. Examples: Document stores and libraries. 424 o For information objects whose purpose is to provide alternatives, 425 the associated language tags could be regarded as a hint that the 426 content is provided in several languages, and that one has to 427 inspect each of the alternatives in order to find its language or 428 languages. In this case, the presence of multiple tags might not 429 mean that one needs to be multi-lingual to get complete 430 understanding of the document. Example: MIME multipart/ 431 alternative. 433 o In markup languages, such as HTML and XML, language information 434 can be added to each part of the document identified by the markup 435 structure (including the whole document itself). For example, one 436 could write C'est la vie. inside a 437 Norwegian document; the Norwegian-speaking user could then access 438 a French-Norwegian dictionary to find out what the marked section 439 meant. If the user were listening to that document through a 440 speech synthesis interface, this formation could be used to signal 441 the synthesizer to appropriately apply French text-to-speech 442 pronunciation rules to that span of text, instead of misapplying 443 the Norwegian rules. 445 2.4 Choosing Between Alternate Matching Schemes 447 Implementations MAY choose to implement different styles of matching 448 for different kinds of processing. For example, an implementation 449 could treat an absent script subtag as a "wildcard" field; thus 450 "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not 451 "az" (this is extended range lookup). If one item is to be chosen, 452 the implementation could pick among those matches based on other 453 information, such as the most likely script used in the language/ 454 region in question or the script used by other content selected. 456 Because the primary language subtag cannot be absent in a language 457 tag, the 'UND' subtag may sometimes be used as a 'wildcard' in basic 458 matching. For example, in a query where you want to select all 459 language tags that contain 'Latn' as the script code and 'AZ' as the 460 region code, you could use the range "und-Latn-AZ". This requires an 461 implementation to examine the actual values of the subtags, though. 462 The matching schemes described elsewhere in this document do not 463 require implementations to examine the values supplied and, except 464 for scored matching, they do not require access to the Language 465 Subtag Registry nor the use of valid subtags in language tags or 466 ranges. This has great benefit for speed and simplicity of 467 implementation. 469 Implementations may also wish to use semantic information external to 470 the langauge tags when performing fallback. For example, the primary 471 language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian) 472 might both be usefully matched to the more general subtag 'no' 473 (Norwegian). Or an application might infer that content labeled 474 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent 475 content labeled "zh-TW". 477 2.5 Considerations for Private Use Subtags 479 Private-use subtags require private agreement between the parties 480 that intend to use or exchange language tags that use them and great 481 caution should be used in employing them in content or protocols 482 intended for general use. Private-use subtags are simply useless for 483 information exchange without prior arrangement. 485 The value and semantic meaning of private-use tags and of the subtags 486 used within such a language tag are not defined. Matching private 487 use tags using language ranges or extended language ranges may result 488 in unpredictable content being returned. 490 2.6 Length Considerations in Matching 492 Although there is no upper bound on the number of subtags in a 493 language tag and it is possible to envision quite long and complex 494 subtag sequences, in practice these are rare because of the various 495 considerations discussed in Section 2.1.1 of [1]. 497 A matching implementation MAY choose not to support the storage or 498 matching of language tags and ranges which exceed a specified length. 499 Any such limitation SHOULD be clearly documented, and such 500 documentation SHOULD include the disposition of any longer tags or 501 ranges (for example, whether an error value is generated or the 502 language tag is truncated). If truncation is permitted it must not 503 permit a subtag to be divided, since this changes the semantics of 504 the tag or range being matched and may result in false positives or 505 negatives. Implementations that restrict storage should consider 506 removing extensions before matching. A protocol that allows tags or 507 ranges to be truncated at an arbitrary limit, without giving any 508 indication of what that limit is, has the potential for causing harm 509 by changing the meaning of values in substantial ways. 511 In practice, tags and ranges are limited to a sequence of four 512 subtags, and thus a maximum length of 26 characters (excluding any 513 extensions or private use sequences). This is because subtags are 514 limited to a length of eight characters and the extlang, script, and 515 region subtags are additionally limited to even fewer characters. In 516 addition, the Language Subtag Registry provides guidance on the use 517 of subtags (via fields such as Suppress-Script and Recommended- 518 Prefix) which further limit useful combination of subtags in a 519 language tag or range. 521 Longer tags are possible. The longest practical tags (excluding 522 extensions) could have a length of up to 58 characters, as shown 523 below. Implementations MUST be able to handle matching tags of this 524 length. Support for tags and ranges of up to 64 characters is 525 RECOMMENDED. Implementations MAY support longer tags, including 526 matching extensive sets of private use or extension subtags. 528 Here is how the 58-character length of the longest practical tag 529 (excluding extensions) is derived: 531 language = 3 532 extlang1 = 4 (currently undefined) 533 extlang2 = 4 (unlikely) 534 script = 5 535 region = 4 (UN M.49) 536 variant = 9 537 variant = 9 (unlikely) 538 private use 1 = 11 539 private use 2 = 9 540 total = 58 characters 542 Figure 4: Derviation of the Longest Tag 544 3. IANA Considerations 546 This document presents no new or existing considerations for IANA. 548 4. Changes 550 This is the first version of this document. 552 The following changes were put into this document since draft-00: 554 Fixed text in the introduction that is no longer accurate. 555 Specifically, there no longer is a default matching algorithm. 556 (A.Phillips) 558 Fixed text in Section 2.1 which incorrectly discussed the default 559 fallback mechanism. (A.Phillips) 561 Minor changes to Section 2.3, in particular, the addition of the 562 'variant' paragraph and some tidying of the text. (A.Phillips) 564 Fixed a minor glitch in the ABNF caused by taking the output of 565 Bill Fenner's parser and not looking too closely at it (M. Patton) 567 Fixed some minor reference problems. (M.Patton) 569 Added Section 2.6 on length considerations in matching. 570 (R.Presuhn) 572 5. Security Considerations 574 The only security issue that has been raised with language tags since 575 the publication of RFC 1766, which stated that "Security issues are 576 believed to be irrelevant to this memo", is a concern with language 577 ranges used in content negotiation - that they may be used to infer 578 the nationality of the sender, and thus identify potential targets 579 for surveillance. 581 This is a special case of the general problem that anything you send 582 is visible to the receiving party. It is useful to be aware that 583 such concerns can exist in some cases. 585 The evaluation of the exact magnitude of the threat, and any possible 586 countermeasures, is left to each application protocol. 588 Although the specification of valid subtags for an extension MUST be 589 available over the Internet, implementations SHOULD NOT mechanically 590 depend on it being always accessible, to prevent denial-of-service 591 attacks. 593 6. Character Set Considerations 595 The syntax in this document requires that language ranges use only 596 the characters A-Z, a-z, 0-9, and HYPHEN-MINUS legal in language 597 tags. These characters are present in most character sets, so 598 presentation of language tags should not have any character set 599 issues. 601 Rendering of characters based on the content of a language tag is not 602 addressed in this memo. Historically, some languages have relied on 603 the use of specific character sets or other information in order to 604 infer how a specific character should be rendered (notably this 605 applies to language and culture specific variations of Han ideographs 606 as used in Japanese, Chinese, and Korean). When language tags are 607 applied to spans of text, rendering engines may use that information 608 in deciding which font to use in the absence of other information, 609 particularly where languages with distinct writing traditions use the 610 same characters. 612 7. References 614 7.1 Normative References 616 [1] Phillips, A., Ed. and M. Davis, Ed., "Tags for the 617 Identification of Languages (Internet-Draft)", February 2005, < 618 http://www.ietf.org/internet-drafts/ 619 draft-ietf-ltru-registry-01.txt>. 621 [2] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021 622 and RFC 822", RFC 1327, May 1992. 624 [3] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail 625 Extensions) Part One: Mechanisms for Specifying and Describing 626 the Format of Internet Message Bodies", RFC 1521, 627 September 1993. 629 [4] Hovey, R. and S. Bradner, "The Organizations Involved in the 630 IETF Standards Process", BCP 11, RFC 2028, October 1996. 632 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement 633 Levels", BCP 14, RFC 2119, March 1997. 635 [6] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word 636 Extensions: Character Sets, Languages, and Continuations", 637 RFC 2231, November 1997. 639 [7] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax 640 Specifications: ABNF", RFC 2234, November 1997. 642 [8] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 643 Resource Identifiers (URI): Generic Syntax", RFC 2396, 644 August 1998. 646 [9] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA 647 Considerations Section in RFCs", BCP 26, RFC 2434, 648 October 1998. 650 [10] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., 651 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- 652 HTTP/1.1", RFC 2616, June 1999. 654 [11] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of 655 Understanding Concerning the Technical Work of the Internet 656 Assigned Numbers Authority", RFC 2860, June 2000. 658 [12] Yergeau, F., "UTF-8, a transformation format of ISO 10646", 659 STD 63, RFC 3629, November 2003. 661 7.2 Informative References 663 [13] International Organization for Standardization, "ISO 639- 664 1:2002, Codes for the representation of names of languages -- 665 Part 1: Alpha-2 code", ISO Standard 639, 2002. 667 [14] International Organization for Standardization, "ISO 639-2:1998 668 - Codes for the representation of names of languages -- Part 2: 669 Alpha-3 code - edition 1", August 1988. 671 [15] ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the 672 representation of names of scripts", January 2004. 674 [16] International Organization for Standardization, "Codes for the 675 representation of names of countries, 3rd edition", 676 ISO Standard 3166, August 1988. 678 [17] Statistical Division, United Nations, "Standard Country or Area 679 Codes for Statistical Use", UN Standard Country or Area Codes 680 for Statistical Use, Revision 4 (United Nations publication, 681 Sales No. 98.XVII.9, June 1999. 683 [18] Alvestrand, H., "Tags for the Identification of Languages", 684 RFC 1766, March 1995. 686 [19] Alvestrand, H., "Tags for the Identification of Languages", 687 BCP 47, RFC 3066, January 2001. 689 [20] Klyne, G. and C. Newman, "Date and Time on the Internet: 690 Timestamps", RFC 3339, July 2002. 692 Authors' Addresses 694 Addison Phillips (editor) 695 Quest Software 697 Email: addison dot phillips at quest dot com 699 Mark Davis (editor) 700 IBM 702 Email: mark dot davis at ibm dot com 704 Appendix A. Acknowledgements 706 Any list of contributors is bound to be incomplete; please regard the 707 following as only a selection from the group of people who have 708 contributed to make this document what it is today. 710 The contributors to RFC 3066 and RFC 1766, the precursors of this 711 document, made enormous contributions directly or indirectly to this 712 document and are generally responsible for the success of language 713 tags. 715 The following people (in alphabetical order) contributed to this 716 document or to RFCs 1766 and 3066: 718 Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet, 719 Nathaniel Borenstein, Eric Brunner, Sean M. Burke, Jeremy Carroll, 720 John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin, 721 Dave Crocker, Martin Duerst, Michael Everson, Doug Ewell, Ned Freed, 722 Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren, 723 Elliotte Rusty Harold, Paul Hoffman, Richard Ishida, Olle Jarnefors, 724 Kent Karlsson, John Klensin, Alain LaBonte, Eric Mader, Keith Moore, 725 Chris Newman, Masataka Ohta, Michael S. Patton, Randy Presuhn, George 726 Rhoten, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto 727 Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois 728 Yergeau and many, many others. 730 Very special thanks must go to Harald Tveit Alvestrand, who 731 originated RFCs 1766 and 3066, and without whom this document would 732 not have been possible. Special thanks must go to Michael Everson, 733 who has served as language tag reviewer for almost the complete 734 period since the publication of RFC 1766. Special thanks to Doug 735 Ewell, for his production of the first complete subtag registry, and 736 his work in producing a test parser for verifying language tags. 738 For this particular document, John Cowan originated the scheme 739 described in Section 2.2.3. Mark Davis originated the scheme 740 described in the Section 2.1.2. 742 Intellectual Property Statement 744 The IETF takes no position regarding the validity or scope of any 745 Intellectual Property Rights or other rights that might be claimed to 746 pertain to the implementation or use of the technology described in 747 this document or the extent to which any license under such rights 748 might or might not be available; nor does it represent that it has 749 made any independent effort to identify any such rights. Information 750 on the procedures with respect to rights in RFC documents can be 751 found in BCP 78 and BCP 79. 753 Copies of IPR disclosures made to the IETF Secretariat and any 754 assurances of licenses to be made available, or the result of an 755 attempt made to obtain a general license or permission for the use of 756 such proprietary rights by implementers or users of this 757 specification can be obtained from the IETF on-line IPR repository at 758 http://www.ietf.org/ipr. 760 The IETF invites any interested party to bring to its attention any 761 copyrights, patents or patent applications, or other proprietary 762 rights that may cover technology that may be required to implement 763 this standard. Please address the information to the IETF at 764 ietf-ipr@ietf.org. 766 Disclaimer of Validity 768 This document and the information contained herein are provided on an 769 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 770 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 771 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 772 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 773 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 774 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 776 Copyright Statement 778 Copyright (C) The Internet Society (2005). This document is subject 779 to the rights, licenses and restrictions contained in BCP 78, and 780 except as set forth therein, the authors retain all their rights. 782 Acknowledgment 784 Funding for the RFC Editor function is currently provided by the 785 Internet Society.