idnits 2.17.1
draft-ietf-ltru-matching-02.txt:
Checking boilerplate required by RFC 5378 and the IETF Trust (see
https://trustee.ietf.org/license-info):
----------------------------------------------------------------------------
** It looks like you're using RFC 3978 boilerplate. You should update this
to the boilerplate described in the IETF Trust License Policy document
(see https://trustee.ietf.org/license-info), which is required now.
-- Found old boilerplate from RFC 3978, Section 5.1 on line 16.
-- Found old boilerplate from RFC 3978, Section 5.5 on line 805.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 782.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 789.
-- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 795.
** This document has an original RFC 3978 Section 5.4 Copyright Line,
instead of the newer IETF Trust Copyright according to RFC 4748.
** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
of the newer disclaimer which includes the IETF Trust according to RFC
4748.
Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
----------------------------------------------------------------------------
== No 'Intended status' indicated for this document; assuming Proposed
Standard
Checking nits according to https://www.ietf.org/id-info/checklist :
----------------------------------------------------------------------------
** The abstract seems to contain references ([RFC3066], [19], [1]), which
it shouldn't. Please replace those with straight textual mentions of the
documents in question.
Miscellaneous warnings:
----------------------------------------------------------------------------
== The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
match the current year
== Line 169 has weird spacing: '...schemes that ...'
== Line 170 has weird spacing: '...ing and looku...'
== Line 374 has weird spacing: '...age tag being...'
== Line 467 has weird spacing: '...ch that imple...'
== The document seems to lack the recommended RFC 2119 boilerplate, even if
it appears to use RFC 2119 keywords.
(The document does seem to have the reference to RFC 2119 which the
ID-Checklist requires).
-- The document seems to lack a disclaimer for pre-RFC5378 work, but may
have content which was first submitted before 10 November 2008. If you
have contacted all the original authors and they are all willing to grant
the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
this comment. If not, you may need to add the pre-RFC5378 disclaimer.
(See the Legal Provisions document at
https://trustee.ietf.org/license-info for more information.)
-- The document date (June 10, 2005) is 6894 days in the past. Is this
intentional?
Checking references for intended status: Proposed Standard
----------------------------------------------------------------------------
(See RFCs 3967 and 4897 for information about using normative references
to lower-maturity documents in RFCs)
-- Looks like a reference, but probably isn't: 'RFC 3066' on line 46
== Unused Reference: '2' is defined on line 652, but no explicit reference
was found in the text
== Unused Reference: '3' is defined on line 655, but no explicit reference
was found in the text
== Unused Reference: '4' is defined on line 660, but no explicit reference
was found in the text
== Unused Reference: '6' is defined on line 666, but no explicit reference
was found in the text
== Unused Reference: '7' is defined on line 670, but no explicit reference
was found in the text
== Unused Reference: '8' is defined on line 673, but no explicit reference
was found in the text
== Unused Reference: '9' is defined on line 677, but no explicit reference
was found in the text
== Unused Reference: '11' is defined on line 685, but no explicit reference
was found in the text
== Unused Reference: '12' is defined on line 689, but no explicit reference
was found in the text
== Unused Reference: '13' is defined on line 694, but no explicit reference
was found in the text
== Unused Reference: '14' is defined on line 698, but no explicit reference
was found in the text
== Unused Reference: '15' is defined on line 702, but no explicit reference
was found in the text
== Unused Reference: '16' is defined on line 705, but no explicit reference
was found in the text
== Unused Reference: '17' is defined on line 709, but no explicit reference
was found in the text
== Unused Reference: '18' is defined on line 714, but no explicit reference
was found in the text
== Unused Reference: '20' is defined on line 720, but no explicit reference
was found in the text
== Outdated reference: A later version (-14) exists of
draft-ietf-ltru-registry-03
** Obsolete normative reference: RFC 1327 (ref. '2') (Obsoleted by RFC 2156)
** Obsolete normative reference: RFC 1521 (ref. '3') (Obsoleted by RFC
2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049)
** Obsolete normative reference: RFC 2028 (ref. '4') (Obsoleted by RFC 9281)
** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234)
** Obsolete normative reference: RFC 2396 (ref. '8') (Obsoleted by RFC 3986)
** Obsolete normative reference: RFC 2434 (ref. '9') (Obsoleted by RFC 5226)
** Obsolete normative reference: RFC 2616 (ref. '10') (Obsoleted by RFC
7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)
** Downref: Normative reference to an Informational RFC: RFC 2860 (ref.
'11')
-- Obsolete informational reference (is this intentional?): RFC 1766 (ref.
'18') (Obsoleted by RFC 3066, RFC 3282)
-- Obsolete informational reference (is this intentional?): RFC 3066 (ref.
'19') (Obsoleted by RFC 4646, RFC 4647)
Summary: 12 errors (**), 0 flaws (~~), 24 warnings (==), 10 comments
(--).
Run idnits with the --verbose option for more detailed information about
the items above.
--------------------------------------------------------------------------------
2 Network Working Group A. Phillips, Ed.
3 Internet-Draft Quest Software
4 Expires: December 12, 2005 M. Davis, Ed.
5 IBM
6 June 10, 2005
8 Matching Language Identifiers
9 draft-ietf-ltru-matching-02
11 Status of this Memo
13 By submitting this Internet-Draft, each author represents that any
14 applicable patent or other IPR claims of which he or she is aware
15 have been or will be disclosed, and any of which he or she becomes
16 aware will be disclosed, in accordance with Section 6 of BCP 79.
18 Internet-Drafts are working documents of the Internet Engineering
19 Task Force (IETF), its areas, and its working groups. Note that
20 other groups may also distribute working documents as Internet-
21 Drafts.
23 Internet-Drafts are draft documents valid for a maximum of six months
24 and may be updated, replaced, or obsoleted by other documents at any
25 time. It is inappropriate to use Internet-Drafts as reference
26 material or to cite them other than as "work in progress."
28 The list of current Internet-Drafts can be accessed at
29 http://www.ietf.org/ietf/1id-abstracts.txt.
31 The list of Internet-Draft Shadow Directories can be accessed at
32 http://www.ietf.org/shadow.html.
34 This Internet-Draft will expire on December 12, 2005.
36 Copyright Notice
38 Copyright (C) The Internet Society (2005).
40 Abstract
42 This document describes different mechanisms for comparing and
43 matching the tags for the identification of languages defined by [RFC
44 3066bis] [1]. Possible algorithms for language negotiation and
45 content selection are described. This document obsoletes portions of
46 [RFC 3066] [19].
48 Table of Contents
50 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
51 2. The Language Range . . . . . . . . . . . . . . . . . . . . . . 4
52 2.1 Basic Language Range . . . . . . . . . . . . . . . . . . . 4
53 2.1.1 Matching . . . . . . . . . . . . . . . . . . . . . . . 5
54 2.1.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . 6
55 2.2 Extended Language Range . . . . . . . . . . . . . . . . . 6
56 2.2.1 Extended Range Matching . . . . . . . . . . . . . . . 7
57 2.2.2 Extended Range Lookup . . . . . . . . . . . . . . . . 8
58 2.2.3 Scored Matching . . . . . . . . . . . . . . . . . . . 9
59 2.3 Meaning of Language Tags and Ranges . . . . . . . . . . . 10
60 2.4 Choosing Between Alternate Matching Schemes . . . . . . . 11
61 2.5 Considerations for Private Use Subtags . . . . . . . . . . 12
62 2.6 Length Considerations in Matching . . . . . . . . . . . . 12
63 3. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14
64 4. Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
65 5. Security Considerations . . . . . . . . . . . . . . . . . . . 16
66 6. Character Set Considerations . . . . . . . . . . . . . . . . . 17
67 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
68 7.1 Normative References . . . . . . . . . . . . . . . . . . . 18
69 7.2 Informative References . . . . . . . . . . . . . . . . . . 19
70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
71 A. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
72 Intellectual Property and Copyright Statements . . . . . . . . 21
74 1. Introduction
76 Human beings on our planet have, past and present, used a number of
77 languages. There are many reasons why one would want to identify the
78 language used when presenting or requesting information.
80 Information about a user's language preferences commonly needs to be
81 identified so that appropriate processing can be applied. For
82 example, the user's language preferences in a browser can be used to
83 select web pages appropriately. A choice of language preference can
84 also be used to select among tools (such as dictionaries) to assist
85 in the processing or understanding of content in different languages.
87 Given a set of language identifiers, such as those defined in
88 RFC3066bis [1], various mechanisms can be envisioned for performing
89 language negotiation and tag matching. The suitability of a
90 particular mechanism to a particular application depends on the needs
91 of that application.
93 This document defines language ranges and syntax for specifying user
94 preferences in a request for language content. It also specifies
95 various schemes and mechanisms that can be used with language ranges
96 when matching or filtering content based on language tags.
98 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
99 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
100 document are to be interpreted as described in RFC 2119 [5].
102 2. The Language Range
104 Language Tags are used to identify the language of some information
105 item or content. Applications that use language tags are often faced
106 with the problem of identifying sets of content that share certain
107 language attributes. For example, HTTP 1.1 [10] describes language
108 ranges in its discussion of the Accept-Language header (Section
109 14.4), which is used for selecting content from servers based on the
110 language of that content.
112 When selecting content according to its language, it is useful to
113 have a mechanism for identifying sets of language tags that share
114 specific attributes. This allows users to select or filter content
115 based on specific requirements. Such an identifier is called a
116 "Language Range".
118 2.1 Basic Language Range
120 A basic language range (such as described in RFC 3066 [19] and HTTP
121 1.1 [10]) is a set of languages whose tags all begin with the same
122 sequence of subtags. A basic language range can be represented by a
123 'language-range' tag, by using the definition from HTTP/1.1 [10] :
124 language-range = language-tag / "*"
126 That is, a language-range has the same syntax as a language-tag or is
127 the single character "*". This definition of language-range implies
128 that there is a semantic relationship between tags that share the
129 same prefix.
131 In particular, the set of language tags that match a specific
132 language-range might not all be mutually intelligible. The use of a
133 prefix when matching tags to language ranges does not imply that
134 language tags are assigned to languages in such a way that it is
135 always true that if a user understands a language with a certain tag,
136 then this user will also understand all languages with tags for which
137 this tag is a prefix. The prefix rule simply allows the use of
138 prefix tags if this is the case.
140 When working with tags and ranges you SHOULD also note the following:
142 1. Private-use and Extension subtags are normally orthogonal to
143 language tag fallback. Implementations SHOULD ignore
144 unrecognized private-use and extension subtags when performing
145 language tag fallback. Since these subtags are always at the end
146 of the sequence of subtags, they don't normally interfere with
147 the use of prefixes for matching in the schemes described below.
149 2. Implementations that choose not to interpret one or more private-
150 use or extension subtags SHOULD NOT remove or modify these
151 extensions in content that they are processing. When a language
152 tag instance is to be used in a specific, known protocol, and is
153 not being passed through to other protocols, language tags MAY be
154 filtered to remove subtags and extensions that are not supported
155 by that protocol. Such filtering SHOULD be avoided, if possible,
156 since it removes information that might be relevant if services
157 on the other end of the protocol would make use of that
158 information.
160 3. Some applications of language tags might want or need to consider
161 extensions and private-use subtags when matching tags. If
162 extensions and private-use subtags are included in a matching or
163 filtering process that utilizes the one of the schemes described
164 in this document, then the implementation SHOULD canonicalize the
165 language tags and/or ranges before performing the matching. Note
166 that language tag processors that claim to be "well-formed"
167 processors as defined in [1] generally fall into this category.
169 There are two matching schemes that are commonly associated with
170 basic language ranges: matching and lookup.
172 2.1.1 Matching
174 Language tag matching is used to select all content that matches a
175 given prefix. In matching, the language range represents the least
176 specific tag which is an acceptable match and every piece of content
177 that matches is returned.
179 For example, if an application is applying a style to all content in
180 a web page in a particular language, it might use language tag
181 matching to select the content to which the style is applied.
183 A language-range matches a language-tag if it exactly equals the tag,
184 or if it exactly equals a prefix of the tag such that the first
185 character following the prefix is "-". (That is, the language-range
186 "en-de" matches the language tag "en-DE-boont", but not the language
187 tag "en-Deva".)
189 The special range "*" matches any tag. A protocol which uses
190 language ranges MAY specify additional rules about the semantics of
191 "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
192 languages not matched by any other range within an "Accept-Language:"
193 header.
195 2.1.2 Lookup
197 Content lookup is used to select the single information item that
198 best matches the language range for a given request. In lookup, the
199 language range represents the most specific tag which is an
200 acceptable match and only the closest matching item is returned.
202 For example, if an application inserts some dynamic content into a
203 web page, returning an empty string if there is no exact match is not
204 an option. Instead, the application "falls back".
206 When performing lookup, the language range is progressively truncated
207 from the end until a matching piece of content is located. For
208 example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
209 would progressively search for content as shown below:
211 Range to match: zh-Hant-CN-x-wadegile
212 1. zh-Hant-CN-x-wadegile
213 2. zh-Hant-CN
214 3. zh-Hant
215 4. zh
216 5. (default content or the empty tag)
218 Figure 2: Default Fallback Pattern Example
220 This scheme allows some flexibility in finding content. It also
221 typically provides better results when data is not available at a
222 specific level of tag granularity or is sparsely populated (than if
223 the default language for the system or content were used).
225 2.2 Extended Language Range
227 Prefix matching using a Basic Language Range, as described above, is
228 not always the most appropriate way to access the information
229 contained in language tags when selecting or filtering content. Some
230 applications might wish to define a more granular matching scheme and
231 such a matching scheme requires the ability to specify the various
232 attributes of a language tag in the language range. An extended
233 language range can be represented by the following ABNF:
235 extended-language-range = grandfathered / privateuse / range
236 range = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
237 [ "-" privateuse ] )
238 lang = ( 2*8ALPHA *[ "-" extlang ] ) / "*"
239 extlang = 3ALPHA / "*"
240 script = 4ALPHA / "*"
241 region = 2ALPHA / 3DIGIT / "*"
242 variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
243 privateuse = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
244 grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
245 alphanum = ( ALPHA / DIGIT )
247 In an extended language range, the identifier takes the form of a
248 series of subtags which must consist of well-formed subtags or the
249 special subtag "*". For example, the language range "en-*-US"
250 specifies a primary language of 'en', followed by any script subtag,
251 followed by the region subtag 'US'.
253 A field not present in the middle of an extended language range MAY
254 be treated as if the field contained a "*". For example, the range
255 "en-US" MAY be considered to be equivalent to the range "en-*-US".
257 There are several matching algorithms or schemes which can be applied
258 when matching extended language ranges to language tags.
260 2.2.1 Extended Range Matching
262 In extended range matching, the subtags in a language tag are
263 compared to the corresponding subtags in the extended language range.
264 A subtag is considered to match if it exactly matches the
265 corresponding subtag in the range or the range contains a subtag with
266 the value "*" (which matches all subtags, including the empty
267 subtag). Extended Range Matching is an extension of basic matching
268 (Section 2.1.1): the language range represents the least specific tag
269 which is an acceptable match.
271 By default all extensions and their subtags are ignored for extended
272 language range matching.
274 Private use subtags MAY be specified in the language range and MUST
275 NOT be ignored when matching.
277 Subtags not specified, including those at the end of the language
278 range, are assigned the value "*". This makes each range into a
279 prefix much like that used in basic language range matching. For
280 example, the extended language range "zh-*-CN" matches all of the
281 following tags because the unspecified variant field is expanded to
282 "*":
284 zh-Hant-CN
286 zh-CN
288 zh-Hans-CN
290 zh-CN-x-wadegile
292 zh-Latn-CN-boont
294 2.2.2 Extended Range Lookup
296 In extended range lookup, the subtags in a language tag are compared
297 to the corresponding subtags in the extended language range. The
298 subtag is considered to match if it exactly matches the corresponding
299 subtag in the range or the range contains a subtag with the value "*"
300 (which matches all subtags, including the empty subtag). Extended
301 language range lookup is an extension of basic lookup
302 (Section 2.1.2): the language range represents the most specific tag
303 which will form an acceptable match.
305 Subtags not specified are assigned the value "*" prior to performing
306 tag matching. Unlike in extended range matching, however, fields at
307 the end of the range MUST NOT be expanded in this manner. For
308 example, "en-US" MUST NOT be considered to be the same as the range
309 "en-US-*". This allows ranges to be specific. The "*" wildcard MUST
310 be used at the end of the range to indicate that all tags with the
311 range as a prefix are allowable matches. That is, the range "zh-*"
312 matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
313 matches neither of those tags.
315 The wildcard "*" at the end of a range SHOULD be considered to match
316 any private use subtag sequences (making extended language range
317 lookup function exactly like extended range matching Section 2.2.1).
319 By default all extensions and their subtags SHOULD be ignored for
320 extended language range lookup. Private use subtags MAY be specified
321 in the language range and MUST NOT be ignored when performing lookup.
322 The wildcard "*" at the end of a range SHOULD be considered to match
323 any private use subtag sequences in addition to variants.
325 For example, the range "*-US" matches all of the following tags:
327 en-US
329 en-Latn-US
330 en-US-r-extends (extensions are ignored)
332 fr-US
334 For example, the range "en-*-US" matches _none_ of the following
335 tags:
337 fr-US
339 en (missing region US)
341 en-Latn (missing region US)
343 en-Latn-US-scouse (variant field is present)
345 For example, the range "en-*" matches all of the following tags:
347 en-Latn
349 en-Latn-US
351 en-Latn-US-scouse
353 en-US
355 en-scouse
357 Note that the ability to be specific in extended range lookup can
358 make this matching scheme a more appropriate replacement for basic
359 matching than the extended range matching scheme.
361 2.2.3 Scored Matching
363 In the "scored matching" scheme, the extended language range and the
364 language tags are pre-normalized by mapping grandfathered and
365 obsolete tags into modern equivalents.
367 The language range and the language tags are normalized into
368 quadruples of the form (language, script, country, variant), where
369 extended language is considered part of language and x-private-codes
370 are considered part of the language if they are initial and part of
371 the variant if not initial. Missing components are set to "*". An
372 "*" pattern becomes the quadruple ("*", "*", "*", "*").
374 Each language tag being matched or filtered is assigned a "quality
375 value" such that higher values indicate better matches and lower
376 values indicate worse ones. If the language matches, add 8 to the
377 quality value. If the script matches, add 4 to the quality value.
379 If the region matches, add 2 to the quality value. If the variant
380 matches, add 1 to the quality value. Elements of the quadruples are
381 considered to match if they are the same or if one of them is "*".
383 A value of 15 is a perfect match; 0 is no match at all. Different
384 values could be more or less appropriate for different applications
385 and implementations SHOULD probably allow users to choose the most
386 appropriate selection value.
388 2.3 Meaning of Language Tags and Ranges
390 A language tag defines a language as spoken (or written, signed or
391 otherwise signaled) by human beings for communication of information
392 to other human beings.
394 If a language tag B contains language tag A as a prefix, then B is
395 typically "narrower" or "more specific" than A. For example, "zh-
396 Hant-TW" is more specific than "zh-Hant".
398 This relationship is not guaranteed in all cases: specifically,
399 languages that begin with the same sequence of subtags are NOT
400 guaranteed to be mutually intelligible, although they might be.
402 For example, the tag "az" shares a prefix with both "az-Latn"
403 (Azerbaijani written using the Latin script) and "az-Cyrl"
404 (Azerbaijani written using the Cyrillic script). A person fluent in
405 one script might not be able to read the other, even though the text
406 might be otherwise identical. Content tagged as "az" most probably
407 is written in just one script and thus might not be intelligible to a
408 reader familiar with the other script.
410 Variant subtags in particular seem to represent specific divisions in
411 mutual understanding, since they often encode dialects or other
412 idiosyncratic variations within a language.
414 The relationship between the language tag and the information it
415 relates to is defined by the standard describing the context in which
416 it appears. Accordingly, this section can only give possible
417 examples of its usage.
419 o For a single information object, the associated language tags
420 might be interpreted as the set of languages that are necessary
421 for a complete comprehension of the complete object. Example:
422 Plain text documents.
424 o For an aggregation of information objects, the associated language
425 tags could be taken as the set of languages used inside components
426 of that aggregation. Examples: Document stores and libraries.
428 o For information objects whose purpose is to provide alternatives,
429 the associated language tags could be regarded as a hint that the
430 content is provided in several languages, and that one has to
431 inspect each of the alternatives in order to find its language or
432 languages. In this case, the presence of multiple tags might not
433 mean that one needs to be multi-lingual to get complete
434 understanding of the document. Example: MIME multipart/
435 alternative.
437 o In markup languages, such as HTML and XML, language information
438 can be added to each part of the document identified by the markup
439 structure (including the whole document itself). For example, one
440 could write C'est la vie. inside a
441 Norwegian document; the Norwegian-speaking user could then access
442 a French-Norwegian dictionary to find out what the marked section
443 meant. If the user were listening to that document through a
444 speech synthesis interface, this formation could be used to signal
445 the synthesizer to appropriately apply French text-to-speech
446 pronunciation rules to that span of text, instead of misapplying
447 the Norwegian rules.
449 2.4 Choosing Between Alternate Matching Schemes
451 Implementations MAY choose to implement different styles of matching
452 for different kinds of processing. For example, an implementation
453 could treat an absent script subtag as a "wildcard" field; thus
454 "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
455 "az" (this is extended range lookup). If one item is to be chosen,
456 the implementation could pick among those matches based on other
457 information, such as the most likely script used in the language/
458 region in question or the script used by other content selected.
460 Because the primary language subtag cannot be absent in a language
461 tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic
462 matching. For example, in a query where you want to select all
463 language tags that contain 'Latn' as the script code and 'AZ' as the
464 region code, you could use the range "und-Latn-AZ". This requires an
465 implementation to examine the actual values of the subtags, though.
466 The matching schemes described elsewhere in this document are
467 designed such that implementations do not have to examine the values
468 or subtags supplied and, except for scored matching, they do not need
469 access to the Language Subtag Registry nor the use of valid subtags
470 in language tags or ranges. This has great benefit for speed and
471 simplicity of implementation.
473 Implementations might also wish to use semantic information external
474 to the langauge tags when performing fallback. For example, the
475 primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
476 Norwegian) might both be usefully matched to the more general subtag
477 'no' (Norwegian). Or an application might infer that content labeled
478 "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
479 content labeled "zh-TW".
481 2.5 Considerations for Private Use Subtags
483 Private-use subtags require private agreement between the parties
484 that intend to use or exchange language tags that use them and great
485 caution SHOULD be used in employing them in content or protocols
486 intended for general use. Private-use subtags are simply useless for
487 information exchange without prior arrangement.
489 The value and semantic meaning of private-use tags and of the subtags
490 used within such a language tag are not defined. Matching private
491 use tags using language ranges or extended language ranges can result
492 in unpredictable content being returned.
494 2.6 Length Considerations in Matching
496 RFC 3066 [19] did not provide an upper limit on the size of language
497 tags or ranges. RFC 3066 did define the semantics of particular
498 subtags in such a way that most language tags or ranges consisted of
499 language and region subtags with a combined total length of up to six
500 characters. Larger tags and ranges (in terms of both subtags and
501 characters) did exist, however.
503 [1] also does not impose a fixed upper limit on the number of subtags
504 in a language tag or range (and thus an upper bound on the size of
505 either). The syntax in that document suggests that, depending on the
506 specific language or range of languages, more subtags (and thus
507 characters) are sometimes necessary as a result. Length
508 considerations and their impact on the selection and processing of
509 tags are described in Section 2.1.1 of that document.
511 A matching implementation MAY choose to limit the length of the
512 language tags or ranges used in matching. Any such limitation SHOULD
513 be clearly documented, and such documentation SHOULD include the
514 disposition of any longer tags or ranges (for example, whether an
515 error value is generated or the language tag or range is truncated).
516 If truncation is permitted it MUST NOT permit a subtag to be divided,
517 since this changes the semantics of the subtag being matched and can
518 result in false positives or negatives.
520 Implementations that restrict storage SHOULD consider the impact of
521 tag or range truncation on the resulting matches. For example,
522 removing the "*" from the end of an extended language range (see
523 Section 2.2) can greatly modify the set of returned matches. A
524 protocol that allows tags or ranges to be truncated at an arbitrary
525 limit, without giving any indication of what that limit is, has the
526 potential for causing harm by changing the meaning of values in
527 substantial ways.
529 In practice, most tags do not require additional subtags or
530 substantially more characters. Additional subtags sometimes add
531 useful distinguishing information, but extraneous subtags interfere
532 with the meaning, understanding, and especially matching of language
533 tags. Since language tags or ranges MAY be truncated by an
534 application or protocol that limits storage, when choosing language
535 tags or ranges users and applications SHOULD avoid adding subtags
536 that add no distinguishing value. In particular, users and
537 implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
538 fields in the registry (defined in Section 3.6 of [1]): these fields
539 provide guidance on when specific additional subtags SHOULD (and
540 SHOULD NOT) be used.
542 Implementations MUST support a limit of at least 33 characters. This
543 limit includes at least one subtag of each non-extension, non-private
544 use type. When choosing a buffer limit, a length of at least 42
545 characters is strongly RECOMMENDED.
547 The practical limit on tags or ranges derived solely from registered
548 values is 42 characters. Implementations MUST be able to handle tags
549 and ranges of this length. Support for tags and ranges of at least
550 62 characters in length is RECOMMENDED. Implementations MAY support
551 longer values, including matching extensive sets of private use or
552 extension subtags.
554 Applications or protocols which have to truncate a tag MUST do so by
555 progressively removing subtags along with their preceding "-" from
556 the right side of the language tag until the tag is short enough for
557 the given buffer. If the resulting tag ends with a single-character
558 subtag, that subtag and its preceding "-" MUST also be removed. For
559 example:
561 Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
562 1. zh-Hant-CN-variant1-a-extend1-x-wadegile
563 2. zh-Hant-CN-variant1-a-extend1
564 3. zh-Hant-CN-variant1
565 4. zh-Hant-CN
566 5. zh-Hant
567 6. zh
569 Figure 4: Example of Tag Truncation
571 3. IANA Considerations
573 This document presents no new or existing considerations for IANA.
575 4. Changes
577 This is the first version of this document.
579 The following changes were put into this document since draft-00:
581 Fixed text in the introduction that is no longer accurate.
582 Specifically, there no longer is a default matching algorithm.
583 (A.Phillips)
585 Fixed text in Section 2.1 which incorrectly discussed the default
586 fallback mechanism. (A.Phillips)
588 Minor changes to Section 2.3, in particular, the addition of the
589 'variant' paragraph and some tidying of the text. (A.Phillips)
591 Fixed a minor glitch in the ABNF caused by taking the output of
592 Bill Fenner's parser and not looking too closely at it (M. Patton)
594 Fixed some minor reference problems. (M.Patton)
596 Added Section 2.6 on length considerations in matching.
597 (R.Presuhn)
599 Copied various materials from the length considerations section of
600 the registry draft to keep the two documents in sync.
601 (A.Phillips)
603 5. Security Considerations
605 The only security issue that has been raised with language tags since
606 the publication of RFC 1766, which stated that "Security issues are
607 believed to be irrelevant to this memo", is a concern with language
608 ranges used in content negotiation - that they might be used to infer
609 the nationality of the sender, and thus identify potential targets
610 for surveillance.
612 This is a special case of the general problem that anything you send
613 is visible to the receiving party. It is useful to be aware that
614 such concerns can exist in some cases.
616 The evaluation of the exact magnitude of the threat, and any possible
617 countermeasures, is left to each application protocol.
619 Although the specification of valid subtags for an extension MUST be
620 available over the Internet, implementations SHOULD NOT mechanically
621 depend on it being always accessible, to prevent denial-of-service
622 attacks.
624 6. Character Set Considerations
626 The syntax in this document requires that language ranges use only
627 the characters A-Z, a-z, 0-9, and HYPHEN-MINUS legal in language
628 tags. These characters are present in most character sets, so
629 presentation of language tags should not have any character set
630 issues.
632 Rendering of characters based on the content of a language tag is not
633 addressed in this memo. Historically, some languages have relied on
634 the use of specific character sets or other information in order to
635 infer how a specific character should be rendered (notably this
636 applies to language and culture specific variations of Han ideographs
637 as used in Japanese, Chinese, and Korean). When language tags are
638 applied to spans of text, rendering engines sometimes use that
639 information in deciding which font to use in the absence of other
640 information, particularly where languages with distinct writing
641 traditions use the same characters.
643 7. References
645 7.1 Normative References
647 [1] Phillips, A., Ed. and M. Davis, Ed., "Tags for the
648 Identification of Languages (Internet-Draft)", June 2005, .
652 [2] Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021
653 and RFC 822", RFC 1327, May 1992.
655 [3] Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
656 Extensions) Part One: Mechanisms for Specifying and Describing
657 the Format of Internet Message Bodies", RFC 1521,
658 September 1993.
660 [4] Hovey, R. and S. Bradner, "The Organizations Involved in the
661 IETF Standards Process", BCP 11, RFC 2028, October 1996.
663 [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement
664 Levels", BCP 14, RFC 2119, March 1997.
666 [6] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word
667 Extensions: Character Sets, Languages, and Continuations",
668 RFC 2231, November 1997.
670 [7] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
671 Specifications: ABNF", RFC 2234, November 1997.
673 [8] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
674 Resource Identifiers (URI): Generic Syntax", RFC 2396,
675 August 1998.
677 [9] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
678 Considerations Section in RFCs", BCP 26, RFC 2434,
679 October 1998.
681 [10] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
682 Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
683 HTTP/1.1", RFC 2616, June 1999.
685 [11] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
686 Understanding Concerning the Technical Work of the Internet
687 Assigned Numbers Authority", RFC 2860, June 2000.
689 [12] Yergeau, F., "UTF-8, a transformation format of ISO 10646",
690 STD 63, RFC 3629, November 2003.
692 7.2 Informative References
694 [13] International Organization for Standardization, "ISO 639-
695 1:2002, Codes for the representation of names of languages --
696 Part 1: Alpha-2 code", ISO Standard 639, 2002.
698 [14] International Organization for Standardization, "ISO 639-2:1998
699 - Codes for the representation of names of languages -- Part 2:
700 Alpha-3 code - edition 1", August 1988.
702 [15] ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the
703 representation of names of scripts", January 2004.
705 [16] International Organization for Standardization, "Codes for the
706 representation of names of countries, 3rd edition",
707 ISO Standard 3166, August 1988.
709 [17] Statistical Division, United Nations, "Standard Country or Area
710 Codes for Statistical Use", UN Standard Country or Area Codes
711 for Statistical Use, Revision 4 (United Nations publication,
712 Sales No. 98.XVII.9, June 1999.
714 [18] Alvestrand, H., "Tags for the Identification of Languages",
715 RFC 1766, March 1995.
717 [19] Alvestrand, H., "Tags for the Identification of Languages",
718 BCP 47, RFC 3066, January 2001.
720 [20] Klyne, G. and C. Newman, "Date and Time on the Internet:
721 Timestamps", RFC 3339, July 2002.
723 Authors' Addresses
725 Addison Phillips (editor)
726 Quest Software
728 Email: addison dot phillips at quest dot com
730 Mark Davis (editor)
731 IBM
733 Email: mark dot davis at ibm dot com
735 Appendix A. Acknowledgements
737 Any list of contributors is bound to be incomplete; please regard the
738 following as only a selection from the group of people who have
739 contributed to make this document what it is today.
741 The contributors to RFC 3066 and RFC 1766, the precursors of this
742 document, made enormous contributions directly or indirectly to this
743 document and are generally responsible for the success of language
744 tags.
746 The following people (in alphabetical order) contributed to this
747 document or to RFCs 1766 and 3066:
749 Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet,
750 Nathaniel Borenstein, Eric Brunner, Sean M. Burke, Jeremy Carroll,
751 John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin,
752 Dave Crocker, Martin Duerst, Michael Everson, Doug Ewell, Ned Freed,
753 Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren,
754 Elliotte Rusty Harold, Paul Hoffman, Richard Ishida, Olle Jarnefors,
755 Kent Karlsson, John Klensin, Alain LaBonte, Eric Mader, Keith Moore,
756 Chris Newman, Masataka Ohta, Michael S. Patton, Randy Presuhn, George
757 Rhoten, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto
758 Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois
759 Yergeau and many, many others.
761 Very special thanks must go to Harald Tveit Alvestrand, who
762 originated RFCs 1766 and 3066, and without whom this document would
763 not have been possible. Special thanks must go to Michael Everson,
764 who has served as language tag reviewer for almost the complete
765 period since the publication of RFC 1766. Special thanks to Doug
766 Ewell, for his production of the first complete subtag registry, and
767 his work in producing a test parser for verifying language tags.
769 For this particular document, John Cowan originated the scheme
770 described in Section 2.2.3. Mark Davis originated the scheme
771 described in the Section 2.1.2.
773 Intellectual Property Statement
775 The IETF takes no position regarding the validity or scope of any
776 Intellectual Property Rights or other rights that might be claimed to
777 pertain to the implementation or use of the technology described in
778 this document or the extent to which any license under such rights
779 might or might not be available; nor does it represent that it has
780 made any independent effort to identify any such rights. Information
781 on the procedures with respect to rights in RFC documents can be
782 found in BCP 78 and BCP 79.
784 Copies of IPR disclosures made to the IETF Secretariat and any
785 assurances of licenses to be made available, or the result of an
786 attempt made to obtain a general license or permission for the use of
787 such proprietary rights by implementers or users of this
788 specification can be obtained from the IETF on-line IPR repository at
789 http://www.ietf.org/ipr.
791 The IETF invites any interested party to bring to its attention any
792 copyrights, patents or patent applications, or other proprietary
793 rights that may cover technology that may be required to implement
794 this standard. Please address the information to the IETF at
795 ietf-ipr@ietf.org.
797 Disclaimer of Validity
799 This document and the information contained herein are provided on an
800 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
801 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
802 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
803 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
804 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
805 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
807 Copyright Statement
809 Copyright (C) The Internet Society (2005). This document is subject
810 to the rights, licenses and restrictions contained in BCP 78, and
811 except as set forth therein, the authors retain all their rights.
813 Acknowledgment
815 Funding for the RFC Editor function is currently provided by the
816 Internet Society.