idnits 2.17.1
draft-ietf-ltru-matching-03.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 762.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 739.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 746.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 752.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
No issues found here.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== Line 169 has weird spacing: '...schemes that ...'
== Line 170 has weird spacing: '...ing and looku...'
== Line 374 has weird spacing: '...age tag being...'
== Line 467 has weird spacing: '...ch that imple...'
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (June 28, 2005) is 6877 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
== Unused Reference: 'RFC1327' is defined on line 618, but no explicit
reference was found in the text
== Unused Reference: 'RFC1521' is defined on line 621, but no explicit
reference was found in the text
== Unused Reference: 'RFC2028' is defined on line 626, but no explicit
reference was found in the text
== Unused Reference: 'RFC2231' is defined on line 633, but no explicit
reference was found in the text
== Unused Reference: 'RFC2234' is defined on line 637, but no explicit
reference was found in the text
== Unused Reference: 'RFC2396' is defined on line 640, but no explicit
reference was found in the text
== Unused Reference: 'RFC2434' is defined on line 644, but no explicit
reference was found in the text
== Unused Reference: 'RFC2860' is defined on line 652, but no explicit
reference was found in the text
== Unused Reference: 'RFC3629' is defined on line 656, but no explicit
reference was found in the text
== Unused Reference: 'ISO639-1' is defined on line 661, but no explicit
reference was found in the text
== Unused Reference: 'ISO639-2' is defined on line 666, but no explicit
reference was found in the text
== Unused Reference: 'ISO15924' is defined on line 672, but no explicit
reference was found in the text
== Unused Reference: 'ISO3166' is defined on line 676, but no explicit
reference was found in the text
== Unused Reference: 'RFC3339' is defined on line 691, but no explicit
reference was found in the text
== Outdated reference: A later version (-14) exists of
draft-ietf-ltru-registry-07
** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156)
** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046,
RFC 2047, RFC 2048, RFC 2049)
** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281)
** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234)
** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)
** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226)
** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
RFC 7232, RFC 7233, RFC 7234, RFC 7235)
** Downref: Normative reference to an Informational RFC: RFC 2860
-- Obsolete informational reference (is this intentional?): RFC 1766
(Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 3066
(Obsoleted by RFC 4646, RFC 4647)
Summary: 11 errors (**), 0 flaws (~~), 22 warnings (==), 9 comments (--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Quest Software
4 Expires: December 30, 2005 M. Davis, Ed.
5 IBM
6 June 28, 2005
8 Matching Tags for the Identification of Languages
9 draft-ietf-ltru-matching-03
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on December 30, 2005.
36 Copyright Notice
38 Copyright (C) The Internet Society (2005).
40 Abstract
42 This document describes different mechanisms for comparing, matching,
43 and evaluating language tags. Possible algorithms for language
44 negotiation and content selection are described.
46 Table of Contents
48 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
49 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
50 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4
51 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5
52 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6
53 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6
54 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7
55 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8
56 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9
57 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10
58 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11
59 2.5 Considerations for Private Use Subtags . . . . . . . . . . 12
60 2.6 Length Considerations in Matching . . . . . . . . . . . . 12
61 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14
62 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
63 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16
64 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17
65 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
66 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18
67 7.2 Informative References . . . . . . . . . . . . . . . . . . 19
68 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
69 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
70 Intellectual Property and Copyright Statements . . . . . . . . 22
72 1. Introduction
74 Human beings on our planet have, past and present, used a number of
75 languages. There are many reasons why one would want to identify the
76 language used when presenting or requesting information.
78 Information about a user's language preferences commonly needs to be
79 identified so that appropriate processing can be applied. For
80 example, the user's language preferences in a browser can be used to
81 select web pages appropriately. A choice of language preference can
82 also be used to select among tools (such as dictionaries) to assist
83 in the processing or understanding of content in different languages.
85 Given a set of language identifiers, such as those defined in
86 [ID.ietf-ltru-registry], various mechanisms can be envisioned for
87 performing language negotiation and tag matching. The suitability of
88 a particular mechanism to a particular application depends on the
89 needs of that application.
91 This document defines language ranges and syntax for specifying user
92 preferences in a request for language content. It also specifies
93 various schemes and mechanisms that can be used with language ranges
94 when matching or filtering content based on language tags.
96 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
97 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
98 document are to be interpreted as described in [RFC2119].
100 2. The Language Range
102 Language Tags are used to identify the language of some information
103 item or content. Applications that use language tags are often faced
104 with the problem of identifying sets of content that share certain
105 language attributes. For example, HTTP 1.1 [RFC2616] describes
106 language ranges in its discussion of the Accept-Language header
107 (Section 14.4), which is used for selecting content from servers
108 based on the language of that content.
110 When selecting content according to its language, it is useful to
111 have a mechanism for identifying sets of language tags that share
112 specific attributes. This allows users to select or filter content
113 based on specific requirements. Such an identifier is called a
114 "Language Range".
116 2.1 Basic Language Range
118 A basic language range (such as described in [RFC3066] and HTTP 1.1
119 [RFC2616]) is a set of languages whose tags all begin with the same
120 sequence of subtags. A basic language range can be represented by a
121 'language-range' tag, by using the definition from HTTP/1.1 [RFC2616]
122 :
123 language-range = language-tag / "*"
125 That is, a language-range has the same syntax as a language-tag or is
126 the single character "*". This definition of language-range implies
127 that there is a semantic relationship between tags that share the
128 same prefix.
130 In particular, the set of language tags that match a specific
131 language-range might not all be mutually intelligible. The use of a
132 prefix when matching tags to language ranges does not imply that
133 language tags are assigned to languages in such a way that it is
134 always true that if a user understands a language with a certain tag,
135 then this user will also understand all languages with tags for which
136 this tag is a prefix. The prefix rule simply allows the use of
137 prefix tags if this is the case.
139 When working with tags and ranges you SHOULD also note the following:
141 1. Private-use and Extension subtags are normally orthogonal to
142 language tag fallback. Implementations SHOULD ignore
143 unrecognized private-use and extension subtags when performing
144 language tag fallback. Since these subtags are always at the end
145 of the sequence of subtags, they don't normally interfere with
146 the use of prefixes for matching in the schemes described below.
148 2. Implementations that choose not to interpret one or more private-
149 use or extension subtags SHOULD NOT remove or modify these
150 extensions in content that they are processing. When a language
151 tag instance is to be used in a specific, known protocol, and is
152 not being passed through to other protocols, language tags MAY be
153 filtered to remove subtags and extensions that are not supported
154 by that protocol. Such filtering SHOULD be avoided, if possible,
155 since it removes information that might be relevant if services
156 on the other end of the protocol would make use of that
157 information.
159 3. Some applications of language tags might want or need to consider
160 extensions and private-use subtags when matching tags. If
161 extensions and private-use subtags are included in a matching or
162 filtering process that utilizes the one of the schemes described
163 in this document, then the implementation SHOULD canonicalize the
164 language tags and/or ranges before performing the matching. Note
165 that language tag processors that claim to be "well-formed"
166 processors as defined in [ID.ietf-ltru-registry] generally fall
167 into this category.
169 There are two matching schemes that are commonly associated with
170 basic language ranges: matching and lookup.
172 2.1.1 Matching
174 Language tag matching is used to select all content that matches a
175 given prefix. In matching, the language range represents the least
176 specific tag which is an acceptable match and every piece of content
177 that matches is returned.
179 For example, if an application is applying a style to all content in
180 a web page in a particular language, it might use language tag
181 matching to select the content to which the style is applied.
183 A language-range matches a language-tag if it exactly equals the tag,
184 or if it exactly equals a prefix of the tag such that the first
185 character following the prefix is "-". (That is, the language-range
186 "de-de" matches the language tag "de-DE-1996", but not the language
187 tag "de-Deva".)
189 The special range "*" matches any tag. A protocol which uses
190 language ranges MAY specify additional rules about the semantics of
191 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
192 languages not matched by any other range within an "Accept-Language:"
193 header.
195 2.1.2 Lookup
197 Content lookup is used to select the single information item that
198 best matches the language range for a given request. In lookup, the
199 language range represents the most specific tag which is an
200 acceptable match and only the closest matching item is returned.
202 For example, if an application inserts some dynamic content into a
203 web page, returning an empty string if there is no exact match is not
204 an option. Instead, the application "falls back".
206 When performing lookup, the language range is progressively truncated
207 from the end until a matching piece of content is located. For
208 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
209 would progressively search for content as shown below:
211 Range to match: zh-Hant-CN-x-wadegile
212 1. zh-Hant-CN-x-wadegile
213 2. zh-Hant-CN
214 3. zh-Hant
215 4. zh
216 5. (default content or the empty tag)
218 Figure 2: Default Fallback Pattern Example
220 This scheme allows some flexibility in finding content. It also
221 typically provides better results when data is not available at a
222 specific level of tag granularity or is sparsely populated (than if
223 the default language for the system or content were used).
225 2.2 Extended Language Range
227 Prefix matching using a Basic Language Range, as described above, is
228 not always the most appropriate way to access the information
229 contained in language tags when selecting or filtering content. Some
230 applications might wish to define a more granular matching scheme and
231 such a matching scheme requires the ability to specify the various
232 attributes of a language tag in the language range. An extended
233 language range can be represented by the following ABNF:
235 extended-language-range = grandfathered / privateuse / range
236 range = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
237 [ "-" privateuse ] )
238 lang = 2*8ALPHA / extlang / "*"
239 extlang = 2*3ALPHA *2("-" 3ALPHA) ( "-" ( 3ALPHA / "*" ) )
240 script = 4ALPHA / "*"
241 region = 2ALPHA / 3DIGIT / "*"
242 variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
243 privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
244 grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
245 alphanum = ( ALPHA / DIGIT )
247 In an extended language range, the identifier takes the form of a
248 series of subtags which must consist of well-formed subtags or the
249 special subtag "*". For example, the language range "en-*-US"
250 specifies a primary language of 'en', followed by any script subtag,
251 followed by the region subtag 'US'.
253 A field not present in the middle of an extended language range MAY
254 be treated as if the field contained a "*". For example, the range
255 "en-US" MAY be considered to be equivalent to the range "en-*-US".
257 There are several matching algorithms or schemes which can be applied
258 when matching extended language ranges to language tags.
260 2.2.1 Extended Range Matching
262 In extended range matching, the subtags in a language tag are
263 compared to the corresponding subtags in the extended language range.
264 A subtag is considered to match if it exactly matches the
265 corresponding subtag in the range or the range contains a subtag with
266 the value "*" (which matches all subtags, including the empty
267 subtag). Extended Range Matching is an extension of basic matching
268 (Section 2.1.1): the language range represents the least specific tag
269 which is an acceptable match.
271 By default all extensions and their subtags are ignored for extended
272 language range matching.
274 Private use subtags MAY be specified in the language range and MUST
275 NOT be ignored when matching.
277 Subtags not specified, including those at the end of the language
278 range, are assigned the value "*". This makes each range into a
279 prefix much like that used in basic language range matching. For
280 example, the extended language range "zh-*-CN" matches all of the
281 following tags because the unspecified variant field is expanded to
282 "*":
284 zh-Hant-CN
286 zh-CN
288 zh-Hans-CN
290 zh-CN-x-wadegile
292 zh-Latn-CN-boont
294 2.2.2 Extended Range Lookup
296 In extended range lookup, the subtags in a language tag are compared
297 to the corresponding subtags in the extended language range. The
298 subtag is considered to match if it exactly matches the corresponding
299 subtag in the range or the range contains a subtag with the value "*"
300 (which matches all subtags, including the empty subtag). Extended
301 language range lookup is an extension of basic lookup
302 (Section 2.1.2): the language range represents the most specific tag
303 which will form an acceptable match.
305 Subtags not specified are assigned the value "*" prior to performing
306 tag matching. Unlike in extended range matching, however, fields at
307 the end of the range MUST NOT be expanded in this manner. For
308 example, "en-US" MUST NOT be considered to be the same as the range
309 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST
310 be used at the end of the range to indicate that all tags with the
311 range as a prefix are allowable matches. That is, the range "zh-*"
312 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
313 matches neither of those tags.
315 The wildcard "*" at the end of a range SHOULD be considered to match
316 any private use subtag sequences (making extended language range
317 lookup function exactly like extended range matching Section 2.2.1).
319 By default all extensions and their subtags SHOULD be ignored for
320 extended language range lookup. Private use subtags MAY be specified
321 in the language range and MUST NOT be ignored when performing lookup.
322 The wildcard "*" at the end of a range SHOULD be considered to match
323 any private use subtag sequences in addition to variants.
325 For example, the range "*-US" matches all of the following tags:
327 en-US
329 en-Latn-US
330 en-US-r-extends (extensions are ignored)
332 fr-US
334 For example, the range "en-*-US" matches _none_ of the following
335 tags:
337 fr-US
339 en (missing region US)
341 en-Latn (missing region US)
343 en-Latn-US-scouse (variant field is present)
345 For example, the range "en-*" matches all of the following tags:
347 en-Latn
349 en-Latn-US
351 en-Latn-US-scouse
353 en-US
355 en-scouse
357 Note that the ability to be specific in extended range lookup can
358 make this matching scheme a more appropriate replacement for basic
359 matching than the extended range matching scheme.
361 2.2.3 Scored Matching
363 In the "scored matching" scheme, the extended language range and the
364 language tags are pre-normalized by mapping grandfathered and
365 obsolete tags into modern equivalents.
367 The language range and the language tags are normalized into
368 quadruples of the form (language, script, country, variant), where
369 extended language is considered part of language and x-private-codes
370 are considered part of the language if they are initial and part of
371 the variant if not initial. Missing components are set to "*". An
372 "*" pattern becomes the quadruple ("*", "*", "*", "*").
374 Each language tag being matched or filtered is assigned a "quality
375 value" such that higher values indicate better matches and lower
376 values indicate worse ones. If the language matches, add 8 to the
377 quality value. If the script matches, add 4 to the quality value.
379 If the region matches, add 2 to the quality value. If the variant
380 matches, add 1 to the quality value. Elements of the quadruples are
381 considered to match if they are the same or if one of them is "*".
383 A value of 15 is a perfect match; 0 is no match at all. Different
384 values could be more or less appropriate for different applications
385 and implementations SHOULD probably allow users to choose the most
386 appropriate selection value.
388 2.3 Meaning of Language Tags and Ranges
390 A language tag defines a language as spoken (or written, signed or
391 otherwise signaled) by human beings for communication of information
392 to other human beings.
394 If a language tag B contains language tag A as a prefix, then B is
395 typically "narrower" or "more specific" than A. For example, "zh-
396 Hant-TW" is more specific than "zh-Hant".
398 This relationship is not guaranteed in all cases: specifically,
399 languages that begin with the same sequence of subtags are NOT
400 guaranteed to be mutually intelligible, although they might be.
402 For example, the tag "az" shares a prefix with both "az-Latn"
403 (Azerbaijani written using the Latin script) and "az-Cyrl"
404 (Azerbaijani written using the Cyrillic script). A person fluent in
405 one script might not be able to read the other, even though the text
406 might be otherwise identical. Content tagged as "az" most probably
407 is written in just one script and thus might not be intelligible to a
408 reader familiar with the other script.
410 Variant subtags in particular seem to represent specific divisions in
411 mutual understanding, since they often encode dialects or other
412 idiosyncratic variations within a language.
414 The relationship between the language tag and the information it
415 relates to is defined by the standard describing the context in which
416 it appears. Accordingly, this section can only give possible
417 examples of its usage.
419 o For a single information object, the associated language tags
420 might be interpreted as the set of languages that are necessary
421 for a complete comprehension of the complete object. Example:
422 Plain text documents.
424 o For an aggregation of information objects, the associated language
425 tags could be taken as the set of languages used inside components
426 of that aggregation. Examples: Document stores and libraries.
428 o For information objects whose purpose is to provide alternatives,
429 the associated language tags could be regarded as a hint that the
430 content is provided in several languages, and that one has to
431 inspect each of the alternatives in order to find its language or
432 languages. In this case, the presence of multiple tags might not
433 mean that one needs to be multi-lingual to get complete
434 understanding of the document. Example: MIME multipart/
435 alternative.
437 o In markup languages, such as HTML and XML, language information
438 can be added to each part of the document identified by the markup
439 structure (including the whole document itself). For example, one
440 could write C'est la vie. inside a
441 Norwegian document; the Norwegian-speaking user could then access
442 a French-Norwegian dictionary to find out what the marked section
443 meant. If the user were listening to that document through a
444 speech synthesis interface, this formation could be used to signal
445 the synthesizer to appropriately apply French text-to-speech
446 pronunciation rules to that span of text, instead of misapplying
447 the Norwegian rules.
449 2.4 Choosing Between Alternate Matching Schemes
451 Implementations MAY choose to implement different styles of matching
452 for different kinds of processing. For example, an implementation
453 could treat an absent script subtag as a "wildcard" field; thus
454 "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
455 "az" (this is extended range lookup). If one item is to be chosen,
456 the implementation could pick among those matches based on other
457 information, such as the most likely script used in the language/
458 region in question or the script used by other content selected.
460 Because the primary language subtag cannot be absent in a language
461 tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic
462 matching. For example, in a query where you want to select all
463 language tags that contain 'Latn' as the script code and 'AZ' as the
464 region code, you could use the range "und-Latn-AZ". This requires an
465 implementation to examine the actual values of the subtags, though.
466 The matching schemes described elsewhere in this document are
467 designed such that implementations do not have to examine the values
468 or subtags supplied and, except for scored matching, they do not need
469 access to the Language Subtag Registry nor the use of valid subtags
470 in language tags or ranges. This has great benefit for speed and
471 simplicity of implementation.
473 Implementations might also wish to use semantic information external
474 to the langauge tags when performing fallback. For example, the
475 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
476 Norwegian) might both be usefully matched to the more general subtag
477 'no' (Norwegian). Or an application might infer that content labeled
478 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
479 content labeled "zh-TW".
481 2.5 Considerations for Private Use Subtags
483 Private-use subtags require private agreement between the parties
484 that intend to use or exchange language tags that use them and great
485 caution SHOULD be used in employing them in content or protocols
486 intended for general use. Private-use subtags are simply useless for
487 information exchange without prior arrangement.
489 The value and semantic meaning of private-use tags and of the subtags
490 used within such a language tag are not defined. Matching private
491 use tags using language ranges or extended language ranges can result
492 in unpredictable content being returned.
494 2.6 Length Considerations in Matching
496 [RFC3066] did not provide an upper limit on the size of language tags
497 or ranges. RFC 3066 did define the semantics of particular subtags
498 in such a way that most language tags or ranges consisted of language
499 and region subtags with a combined total length of up to six
500 characters. Larger tags and ranges (in terms of both subtags and
501 characters) did exist, however.
503 [ID.ietf-ltru-registry] also does not impose a fixed upper limit on
504 the number of subtags in a language tag or range (and thus an upper
505 bound on the size of either). The syntax in that document suggests
506 that, depending on the specific language or range of languages, more
507 subtags (and thus characters) are sometimes necessary as a result.
508 Length considerations and their impact on the selection and
509 processing of tags are described in Section 2.1.1 of that document.
511 A matching implementation MAY choose to limit the length of the
512 language tags or ranges used in matching. Any such limitation SHOULD
513 be clearly documented, and such documentation SHOULD include the
514 disposition of any longer tags or ranges (for example, whether an
515 error value is generated or the language tag or range is truncated).
516 If truncation is permitted it MUST NOT permit a subtag to be divided,
517 since this changes the semantics of the subtag being matched and can
518 result in false positives or negatives.
520 Implementations that restrict storage SHOULD consider the impact of
521 tag or range truncation on the resulting matches. For example,
522 removing the "*" from the end of an extended language range (see
523 Section 2.2) can greatly modify the set of returned matches. A
524 protocol that allows tags or ranges to be truncated at an arbitrary
525 limit, without giving any indication of what that limit is, has the
526 potential for causing harm by changing the meaning of values in
527 substantial ways.
529 In practice, most tags do not require additional subtags or
530 substantially more characters. Additional subtags sometimes add
531 useful distinguishing information, but extraneous subtags interfere
532 with the meaning, understanding, and especially matching of language
533 tags. Since language tags or ranges MAY be truncated by an
534 application or protocol that limits storage, when choosing language
535 tags or ranges users and applications SHOULD avoid adding subtags
536 that add no distinguishing value. In particular, users and
537 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
538 fields in the registry (defined in Section 3.6 of [ID.ietf-ltru-
539 registry]): these fields provide guidance on when specific additional
540 subtags SHOULD (and SHOULD NOT) be used.
542 Implementations MUST support a limit of at least 33 characters. This
543 limit includes at least one subtag of each non-extension, non-private
544 use type. When choosing a buffer limit, a length of at least 42
545 characters is strongly RECOMMENDED.
547 The practical limit on tags or ranges derived solely from registered
548 values is 42 characters. Implementations MUST be able to handle tags
549 and ranges of this length. Support for tags and ranges of at least
550 62 characters in length is RECOMMENDED. Implementations MAY support
551 longer values, including matching extensive sets of private use or
552 extension subtags.
554 Applications or protocols which have to truncate a tag MUST do so by
555 progressively removing subtags along with their preceding "-" from
556 the right side of the language tag until the tag is short enough for
557 the given buffer. If the resulting tag ends with a single-character
558 subtag, that subtag and its preceding "-" MUST also be removed. For
559 example:
561 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
562 1. zh-Hant-CN-variant1-a-extend1-x-wadegile
563 2. zh-Hant-CN-variant1-a-extend1
564 3. zh-Hant-CN-variant1
565 4. zh-Hant-CN
566 5. zh-Hant
567 6. zh
569 Figure 4: Example of Tag Truncation
571 3. IANA Considerations
573 This document presents no new or existing considerations for IANA.
575 4. Changes
577 This is the first version of this document.
579 The following changes were put into this document since draft-02:
581 Turned on symrefs and replaced all reference IDs to make them
582 readable (F.Ellermann)
584 Removed all external references from the abstract (R.Presuhn)
586 5. Security Considerations
588 Language ranges used in content negotiation might be used to infer
589 the nationality of the sender, and thus identify potential targets
590 for surveillance. In addition, unique or highly unusual language
591 ranges or combinations of language ranges might be used to track
592 specific individual's activities.
594 This is a special case of the general problem that anything you send
595 is visible to the receiving party. It is useful to be aware that
596 such concerns can exist in some cases.
598 The evaluation of the exact magnitude of the threat, and any possible
599 countermeasures, is left to each application protocol.
601 6. Character Set Considerations
603 The syntax of language tags and language ranges permit only the
604 characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D). These characters
605 are present in most character sets, so presentation of language tags
606 should not present any character set issues.
608 7. References
610 7.1 Normative References
612 [ID.ietf-ltru-registry]
613 Phillips, A., Ed. and M. Davis, Ed., "Tags for the
614 Identification of Languages (Internet-Draft)", June 2005,
615 .
618 [RFC1327] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO
619 10021 and RFC 822", RFC 1327, May 1992.
621 [RFC1521] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet
622 Mail Extensions) Part One: Mechanisms for Specifying and
623 Describing the Format of Internet Message Bodies",
624 RFC 1521, September 1993.
626 [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in
627 the IETF Standards Process", BCP 11, RFC 2028,
628 October 1996.
630 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
631 Requirement Levels", BCP 14, RFC 2119, March 1997.
633 [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded
634 Word Extensions: Character Sets, Languages, and
635 Continuations", RFC 2231, November 1997.
637 [RFC2234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
638 Specifications: ABNF", RFC 2234, November 1997.
640 [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
641 Resource Identifiers (URI): Generic Syntax", RFC 2396,
642 August 1998.
644 [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an
645 IANA Considerations Section in RFCs", BCP 26, RFC 2434,
646 October 1998.
648 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
649 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
650 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.
652 [RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
653 Understanding Concerning the Technical Work of the
654 Internet Assigned Numbers Authority", RFC 2860, June 2000.
656 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
657 10646", STD 63, RFC 3629, November 2003.
659 7.2 Informative References
661 [ISO639-1]
662 International Organization for Standardization, "ISO 639-
663 1:2002, Codes for the representation of names of languages
664 -- Part 1: Alpha-2 code", ISO Standard 639, 2002.
666 [ISO639-2]
667 International Organization for Standardization, "ISO 639-
668 2:1998 - Codes for the representation of names of
669 languages -- Part 2: Alpha-3 code - edition 1",
670 August 1988.
672 [ISO15924]
673 ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the
674 representation of names of scripts", January 2004.
676 [ISO3166] International Organization for Standardization, "Codes for
677 the representation of names of countries, 3rd edition",
678 ISO Standard 3166, August 1988.
680 [UN_M49] Statistical Division, United Nations, "Standard Country or
681 Area Codes for Statistical Use", UN Standard Country or
682 Area Codes for Statistical Use, Revision 4 (United Nations
683 publication, Sales No. 98.XVII.9, June 1999.
685 [RFC1766] Alvestrand, H., "Tags for the Identification of
686 Languages", RFC 1766, March 1995.
688 [RFC3066] Alvestrand, H., "Tags for the Identification of
689 Languages", BCP 47, RFC 3066, January 2001.
691 [RFC3339] Klyne, G. and C. Newman, "Date and Time on the Internet:
692 Timestamps", RFC 3339, July 2002.
694 Authors' Addresses
696 Addison Phillips (editor)
697 Quest Software
699 Email: addison dot phillips at quest dot com
700 Mark Davis (editor)
701 IBM
703 Email: mark dot davis at ibm dot com
705 Appendix A. Acknowledgements
707 Any list of contributors is bound to be incomplete; please regard the
708 following as only a selection from the group of people who have
709 contributed to make this document what it is today.
711 The contributors to [ID.ietf-ltru-registry], [RFC3066] and [RFC1766],
712 each of which is a precursor to this document, made enormous
713 contributions directly or indirectly to this document and are
714 generally responsible for the success of language tags.
716 The following people (in alphabetical order by family name)
717 contributed to this document:
719 Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Ira
720 McDonald, M. Patton, Randy Presuhn and many, many others.
722 Very special thanks must go to Harald Tveit Alvestrand, who
723 originated RFCs 1766 and 3066, and without whom this document would
724 not have been possible.
726 For this particular document, John Cowan originated the scheme
727 described in Section 2.2.3. Mark Davis originated the scheme
728 described in the Section 2.1.2.
730 Intellectual Property Statement
732 The IETF takes no position regarding the validity or scope of any
733 Intellectual Property Rights or other rights that might be claimed to
734 pertain to the implementation or use of the technology described in
735 this document or the extent to which any license under such rights
736 might or might not be available; nor does it represent that it has
737 made any independent effort to identify any such rights. Information
738 on the procedures with respect to rights in RFC documents can be
739 found in BCP 78 and BCP 79.
741 Copies of IPR disclosures made to the IETF Secretariat and any
742 assurances of licenses to be made available, or the result of an
743 attempt made to obtain a general license or permission for the use of
744 such proprietary rights by implementers or users of this
745 specification can be obtained from the IETF on-line IPR repository at
746 http://www.ietf.org/ipr.
748 The IETF invites any interested party to bring to its attention any
749 copyrights, patents or patent applications, or other proprietary
750 rights that may cover technology that may be required to implement
751 this standard. Please address the information to the IETF at
752 ietf-ipr@ietf.org.
754 Disclaimer of Validity
756 This document and the information contained herein are provided on an
757 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
758 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
759 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
760 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
761 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
762 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
764 Copyright Statement
766 Copyright (C) The Internet Society (2005). This document is subject
767 to the rights, licenses and restrictions contained in BCP 78, and
768 except as set forth therein, the authors retain all their rights.
770 Acknowledgment
772 Funding for the RFC Editor function is currently provided by the
773 Internet Society.