idnits 2.17.1 

draft-ietf-ltru-matching-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 805.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 782.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 789.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 795.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC3066], [19], [1]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 169 has weird spacing: '...schemes  that ...'

  == Line 170 has weird spacing: '...ing and  looku...'

  == Line 374 has weird spacing: '...age tag  being...'

  == Line 467 has weird spacing: '...ch that  imple...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 10, 2005) is 6894 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC 3066' on line 46

  == Unused Reference: '2' is defined on line 652, but no explicit reference
     was found in the text

  == Unused Reference: '3' is defined on line 655, but no explicit reference
     was found in the text

  == Unused Reference: '4' is defined on line 660, but no explicit reference
     was found in the text

  == Unused Reference: '6' is defined on line 666, but no explicit reference
     was found in the text

  == Unused Reference: '7' is defined on line 670, but no explicit reference
     was found in the text

  == Unused Reference: '8' is defined on line 673, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 677, but no explicit reference
     was found in the text

  == Unused Reference: '11' is defined on line 685, but no explicit reference
     was found in the text

  == Unused Reference: '12' is defined on line 689, but no explicit reference
     was found in the text

  == Unused Reference: '13' is defined on line 694, but no explicit reference
     was found in the text

  == Unused Reference: '14' is defined on line 698, but no explicit reference
     was found in the text

  == Unused Reference: '15' is defined on line 702, but no explicit reference
     was found in the text

  == Unused Reference: '16' is defined on line 705, but no explicit reference
     was found in the text

  == Unused Reference: '17' is defined on line 709, but no explicit reference
     was found in the text

  == Unused Reference: '18' is defined on line 714, but no explicit reference
     was found in the text

  == Unused Reference: '20' is defined on line 720, but no explicit reference
     was found in the text

  == Outdated reference: A later version (-14) exists of
     draft-ietf-ltru-registry-03

  ** Obsolete normative reference: RFC 1327 (ref. '2') (Obsoleted by RFC 2156)

  ** Obsolete normative reference: RFC 1521 (ref. '3') (Obsoleted by RFC
     2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049)

  ** Obsolete normative reference: RFC 2028 (ref. '4') (Obsoleted by RFC 9281)

  ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234)

  ** Obsolete normative reference: RFC 2396 (ref. '8') (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2434 (ref. '9') (Obsoleted by RFC 5226)

  ** Obsolete normative reference: RFC 2616 (ref. '10') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 2860 (ref.
     '11')

  -- Obsolete informational reference (is this intentional?): RFC 1766 (ref.
     '18') (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066 (ref.
     '19') (Obsoleted by RFC 4646, RFC 4647)


     Summary: 12 errors (**), 0 flaws (~~), 24 warnings (==), 10 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Expires: December 12, 2005                                 M. Davis, Ed.
5	                                                                     IBM
6	                                                           June 10, 2005

8	                     Matching Language Identifiers
9	                      draft-ietf-ltru-matching-02

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on December 12, 2005.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing and
43	   matching the tags for the identification of languages defined by [RFC
44	   3066bis] [1].  Possible algorithms for language negotiation and
45	   content selection are described.  This document obsoletes portions of
46	   [RFC 3066] [19].

48	Table of Contents

50	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
51	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
52	     2.1   Basic Language Range . . . . . . . . . . . . . . . . . . .  4
53	       2.1.1   Matching . . . . . . . . . . . . . . . . . . . . . . .  5
54	       2.1.2   Lookup . . . . . . . . . . . . . . . . . . . . . . . .  6
55	     2.2   Extended Language Range  . . . . . . . . . . . . . . . . .  6
56	       2.2.1   Extended Range Matching  . . . . . . . . . . . . . . .  7
57	       2.2.2   Extended Range Lookup  . . . . . . . . . . . . . . . .  8
58	       2.2.3   Scored Matching  . . . . . . . . . . . . . . . . . . .  9
59	     2.3   Meaning of Language Tags and Ranges  . . . . . . . . . . . 10
60	     2.4   Choosing Between Alternate Matching Schemes  . . . . . . . 11
61	     2.5   Considerations for Private Use Subtags . . . . . . . . . . 12
62	     2.6   Length Considerations in Matching  . . . . . . . . . . . . 12
63	   3.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
64	   4.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
65	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
66	   6.  Character Set Considerations . . . . . . . . . . . . . . . . . 17
67	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
68	     7.1   Normative References . . . . . . . . . . . . . . . . . . . 18
69	     7.2   Informative References . . . . . . . . . . . . . . . . . . 19
70	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
71	   A.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
72	       Intellectual Property and Copyright Statements . . . . . . . . 21

74	1.  Introduction

76	   Human beings on our planet have, past and present, used a number of
77	   languages.  There are many reasons why one would want to identify the
78	   language used when presenting or requesting information.

80	   Information about a user's language preferences commonly needs to be
81	   identified so that appropriate processing can be applied.  For
82	   example, the user's language preferences in a browser can be used to
83	   select web pages appropriately.  A choice of language preference can
84	   also be used to select among tools (such as dictionaries) to assist
85	   in the processing or understanding of content in different languages.

87	   Given a set of language identifiers, such as those defined in
88	   RFC3066bis [1], various mechanisms can be envisioned for performing
89	   language negotiation and tag matching.  The suitability of a
90	   particular mechanism to a particular application depends on the needs
91	   of that application.

93	   This document defines language ranges and syntax for specifying user
94	   preferences in a request for language content.  It also specifies
95	   various schemes and mechanisms that can be used with language ranges
96	   when matching or filtering content based on language tags.

98	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
99	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
100	   document are to be interpreted as described in RFC 2119 [5].

102	2.  The Language Range

104	   Language Tags are used to identify the language of some information
105	   item or content.  Applications that use language tags are often faced
106	   with the problem of identifying sets of content that share certain
107	   language attributes.  For example, HTTP 1.1 [10] describes language
108	   ranges in its discussion of the Accept-Language header (Section
109	   14.4), which is used for selecting content from servers based on the
110	   language of that content.

112	   When selecting content according to its language, it is useful to
113	   have a mechanism for identifying sets of language tags that share
114	   specific attributes.  This allows users to select or filter content
115	   based on specific requirements.  Such an identifier is called a
116	   "Language Range".

118	2.1  Basic Language Range

120	   A basic language range (such as described in RFC 3066 [19] and HTTP
121	   1.1 [10]) is a set of languages whose tags all begin with the same
122	   sequence of subtags.  A basic language range can be represented by a
123	   'language-range' tag, by using the definition from HTTP/1.1 [10] :
124	   language-range = language-tag / "*"

126	   That is, a language-range has the same syntax as a language-tag or is
127	   the single character "*".  This definition of language-range implies
128	   that there is a semantic relationship between tags that share the
129	   same prefix.

131	   In particular, the set of language tags that match a specific
132	   language-range might not all be mutually intelligible.  The use of a
133	   prefix when matching tags to language ranges does not imply that
134	   language tags are assigned to languages in such a way that it is
135	   always true that if a user understands a language with a certain tag,
136	   then this user will also understand all languages with tags for which
137	   this tag is a prefix.  The prefix rule simply allows the use of
138	   prefix tags if this is the case.

140	   When working with tags and ranges you SHOULD also note the following:

142	   1.  Private-use and Extension subtags are normally orthogonal to
143	       language tag fallback.  Implementations SHOULD ignore
144	       unrecognized private-use and extension subtags when performing
145	       language tag fallback.  Since these subtags are always at the end
146	       of the sequence of subtags, they don't normally interfere with
147	       the use of prefixes for matching in the schemes described below.

149	   2.  Implementations that choose not to interpret one or more private-
150	       use or extension subtags SHOULD NOT remove or modify these
151	       extensions in content that they are processing.  When a language
152	       tag instance is to be used in a specific, known protocol, and is
153	       not being passed through to other protocols, language tags MAY be
154	       filtered to remove subtags and extensions that are not supported
155	       by that protocol.  Such filtering SHOULD be avoided, if possible,
156	       since it removes information that might be relevant if services
157	       on the other end of the protocol would make use of that
158	       information.

160	   3.  Some applications of language tags might want or need to consider
161	       extensions and private-use subtags when matching tags.  If
162	       extensions and private-use subtags are included in a matching or
163	       filtering process that utilizes the one of the schemes described
164	       in this document, then the implementation SHOULD canonicalize the
165	       language tags and/or ranges before performing the matching.  Note
166	       that language tag processors that claim to be "well-formed"
167	       processors as defined in [1] generally fall into this category.

169	   There are two matching schemes  that are commonly associated with
170	   basic language ranges:  matching and  lookup.

172	2.1.1  Matching

174	   Language tag matching is used to select all content that matches a
175	   given prefix.  In matching, the language range represents the least
176	   specific tag which is an acceptable match and every piece of content
177	   that matches is returned.

179	   For example, if an application is applying a style to all content in
180	   a web page in a particular language, it might use language tag
181	   matching to select the content to which the style is applied.

183	   A language-range matches a language-tag if it exactly equals the tag,
184	   or if it exactly equals a prefix of the tag such that the first
185	   character following the prefix is "-".  (That is, the language-range
186	   "en-de" matches the language tag "en-DE-boont", but not the language
187	   tag "en-Deva".)

189	   The special range "*" matches any tag.  A protocol which uses
190	   language ranges MAY specify additional rules about the semantics of
191	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
192	   languages not matched by any other range within an "Accept-Language:"
193	   header.

195	2.1.2  Lookup

197	   Content lookup is used to select the single information item that
198	   best matches the language range for a given request.  In lookup, the
199	   language range represents the most specific tag which is an
200	   acceptable match and only the closest matching item is returned.

202	   For example, if an application inserts some dynamic content into a
203	   web page, returning an empty string if there is no exact match is not
204	   an option.  Instead, the application "falls back".

206	   When performing lookup, the language range is progressively truncated
207	   from the end until a matching piece of content is located.  For
208	   example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
209	   would progressively search for content as shown below:

211	   Range to match: zh-Hant-CN-x-wadegile
212	   1. zh-Hant-CN-x-wadegile
213	   2. zh-Hant-CN
214	   3. zh-Hant
215	   4. zh
216	   5. (default content or the empty tag)

218	                Figure 2: Default Fallback Pattern Example

220	   This scheme allows some flexibility in finding content.  It also
221	   typically provides better results when data is not available at a
222	   specific level of tag granularity or is sparsely populated (than if
223	   the default language for the system or content were used).

225	2.2  Extended Language Range

227	   Prefix matching using a Basic Language Range, as described above, is
228	   not always the most appropriate way to access the information
229	   contained in language tags when selecting or filtering content.  Some
230	   applications might wish to define a more granular matching scheme and
231	   such a matching scheme requires the ability to specify the various
232	   attributes of a language tag in the language range.  An extended
233	   language range can be represented by the following ABNF:

235	   extended-language-range = grandfathered / privateuse / range
236	   range   = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
237	                [ "-" privateuse ] )
238	   lang    = ( 2*8ALPHA *[ "-" extlang ] ) / "*"
239	   extlang = 3ALPHA / "*"
240	   script  = 4ALPHA / "*"
241	   region  = 2ALPHA / 3DIGIT / "*"
242	   variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
243	   privateuse    = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
244	   grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
245	   alphanum      = ( ALPHA / DIGIT )

247	   In an extended language range, the identifier takes the form of a
248	   series of subtags which must consist of well-formed subtags or the
249	   special subtag "*".  For example, the language range "en-*-US"
250	   specifies a primary language of 'en', followed by any script subtag,
251	   followed by the region subtag 'US'.

253	   A field not present in the middle of an extended language range MAY
254	   be treated as if the field contained a "*".  For example, the range
255	   "en-US" MAY be considered to be equivalent to the range "en-*-US".

257	   There are several matching algorithms or schemes which can be applied
258	   when matching extended language ranges to language tags.

260	2.2.1  Extended Range Matching

262	   In extended range matching, the subtags in a language tag are
263	   compared to the corresponding subtags in the extended language range.
264	   A subtag is considered to match if it exactly matches the
265	   corresponding subtag in the range or the range contains a subtag with
266	   the value "*" (which matches all subtags, including the empty
267	   subtag).  Extended Range Matching is an extension of basic matching
268	   (Section 2.1.1): the language range represents the least specific tag
269	   which is an acceptable match.

271	   By default all extensions and their subtags are ignored for extended
272	   language range matching.

274	   Private use subtags MAY be specified in the language range and MUST
275	   NOT be ignored when matching.

277	   Subtags not specified, including those at the end of the language
278	   range, are assigned the value "*".  This makes each range into a
279	   prefix much like that used in basic language range matching.  For
280	   example, the extended language range "zh-*-CN" matches all of the
281	   following tags because the unspecified variant field is expanded to
282	   "*":

284	      zh-Hant-CN

286	      zh-CN

288	      zh-Hans-CN

290	      zh-CN-x-wadegile

292	      zh-Latn-CN-boont

294	2.2.2  Extended Range Lookup

296	   In extended range lookup, the subtags in a language tag are compared
297	   to the corresponding subtags in the extended language range.  The
298	   subtag is considered to match if it exactly matches the corresponding
299	   subtag in the range or the range contains a subtag with the value "*"
300	   (which matches all subtags, including the empty subtag).  Extended
301	   language range lookup is an extension of basic lookup
302	   (Section 2.1.2): the language range represents the most specific tag
303	   which will form an acceptable match.

305	   Subtags not specified are assigned the value "*" prior to performing
306	   tag matching.  Unlike in extended range matching, however, fields at
307	   the end of the range MUST NOT be expanded in this manner.  For
308	   example, "en-US" MUST NOT be considered to be the same as the range
309	   "en-US-*".  This allows ranges to be specific.  The "*" wildcard MUST
310	   be used at the end of the range to indicate that all tags with the
311	   range as a prefix are allowable matches.  That is, the range "zh-*"
312	   matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
313	   matches neither of those tags.

315	   The wildcard "*" at the end of a range SHOULD be considered to match
316	   any private use subtag sequences (making extended language range
317	   lookup function exactly like extended range matching Section 2.2.1).

319	   By default all extensions and their subtags SHOULD be ignored for
320	   extended language range lookup.  Private use subtags MAY be specified
321	   in the language range and MUST NOT be ignored when performing lookup.
322	   The wildcard "*" at the end of a range SHOULD be considered to match
323	   any private use subtag sequences in addition to variants.

325	   For example, the range "*-US" matches all of the following tags:

327	      en-US

329	      en-Latn-US
330	      en-US-r-extends (extensions are ignored)

332	      fr-US

334	   For example, the range "en-*-US" matches _none_ of the following
335	   tags:

337	      fr-US

339	      en (missing region US)

341	      en-Latn (missing region US)

343	      en-Latn-US-scouse (variant field is present)

345	   For example, the range "en-*" matches all of the following tags:

347	      en-Latn

349	      en-Latn-US

351	      en-Latn-US-scouse

353	      en-US

355	      en-scouse

357	   Note that the ability to be specific in extended range lookup can
358	   make this matching scheme a more appropriate replacement for basic
359	   matching than the extended range matching scheme.

361	2.2.3  Scored Matching

363	   In the "scored matching" scheme, the extended language range and the
364	   language tags are pre-normalized by mapping grandfathered and
365	   obsolete tags into modern equivalents.

367	   The language range and the language tags are normalized into
368	   quadruples of the form (language, script, country, variant), where
369	   extended language is considered part of language and x-private-codes
370	   are considered part of the language if they are initial and part of
371	   the variant if not initial.  Missing components are set to "*".  An
372	   "*" pattern becomes the quadruple ("*", "*", "*", "*").

374	   Each language tag  being matched or filtered is assigned a "quality
375	   value" such that higher values indicate better matches and lower
376	   values indicate worse ones.  If the language matches, add 8 to the
377	   quality value.  If the script matches, add 4 to the quality value.

379	   If the region matches, add 2 to the quality value.  If the variant
380	   matches, add 1 to the quality value.  Elements of the quadruples are
381	   considered to match if they are the same or if one of them is "*".

383	   A value of 15 is a perfect match; 0 is no match at all.  Different
384	   values could be more or less appropriate for different applications
385	   and implementations SHOULD probably allow users to choose the most
386	   appropriate selection value.

388	2.3  Meaning of Language Tags and Ranges

390	   A language tag defines a language as spoken (or written, signed or
391	   otherwise signaled) by human beings for communication of information
392	   to other human beings.

394	   If a language tag B contains language tag A as a prefix, then B is
395	   typically "narrower" or "more specific" than A. For example, "zh-
396	   Hant-TW" is more specific than "zh-Hant".

398	   This relationship is not guaranteed in all cases: specifically,
399	   languages that begin with the same sequence of subtags are NOT
400	   guaranteed to be mutually intelligible, although they might be.

402	   For example, the tag "az" shares a prefix with both "az-Latn"
403	   (Azerbaijani written using the Latin script) and "az-Cyrl"
404	   (Azerbaijani written using the Cyrillic script).  A person fluent in
405	   one script might not be able to read the other, even though the text
406	   might be otherwise identical.  Content tagged as "az" most probably
407	   is written in just one script and thus might not be intelligible to a
408	   reader familiar with the other script.

410	   Variant subtags in particular seem to represent specific divisions in
411	   mutual understanding, since they often encode dialects or other
412	   idiosyncratic variations within a language.

414	   The relationship between the language tag and the information it
415	   relates to is defined by the standard describing the context in which
416	   it appears.  Accordingly, this section can only give possible
417	   examples of its usage.

419	   o  For a single information object, the associated language tags
420	      might be interpreted as the set of languages that are necessary
421	      for a complete comprehension of the complete object.  Example:
422	      Plain text documents.

424	   o  For an aggregation of information objects, the associated language
425	      tags could be taken as the set of languages used inside components
426	      of that aggregation.  Examples: Document stores and libraries.

428	   o  For information objects whose purpose is to provide alternatives,
429	      the associated language tags could be regarded as a hint that the
430	      content is provided in several languages, and that one has to
431	      inspect each of the alternatives in order to find its language or
432	      languages.  In this case, the presence of multiple tags might not
433	      mean that one needs to be multi-lingual to get complete
434	      understanding of the document.  Example: MIME multipart/
435	      alternative.

437	   o  In markup languages, such as HTML and XML, language information
438	      can be added to each part of the document identified by the markup
439	      structure (including the whole document itself).  For example, one
440	      could write <span lang="FR">C'est la vie.</span> inside a
441	      Norwegian document; the Norwegian-speaking user could then access
442	      a French-Norwegian dictionary to find out what the marked section
443	      meant.  If the user were listening to that document through a
444	      speech synthesis interface, this formation could be used to signal
445	      the synthesizer to appropriately apply French text-to-speech
446	      pronunciation rules to that span of text, instead of misapplying
447	      the Norwegian rules.

449	2.4  Choosing Between Alternate Matching Schemes

451	   Implementations MAY choose to implement different styles of matching
452	   for different kinds of processing.  For example, an implementation
453	   could treat an absent script subtag as a "wildcard" field; thus
454	   "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
455	   "az" (this is extended range lookup).  If one item is to be chosen,
456	   the implementation could pick among those matches based on other
457	   information, such as the most likely script used in the language/
458	   region in question or the script used by other content selected.

460	   Because the primary language subtag cannot be absent in a language
461	   tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic
462	   matching.  For example, in a query where you want to select all
463	   language tags that contain 'Latn' as the script code and 'AZ' as the
464	   region code, you could use the range "und-Latn-AZ".  This requires an
465	   implementation to examine the actual values of the subtags, though.
466	   The matching schemes described elsewhere in this document are
467	   designed such that  implementations do not have to examine the values
468	   or subtags supplied and, except for scored matching, they do not need
469	   access to the Language Subtag Registry nor the use of valid subtags
470	   in language tags or ranges.  This has great benefit for speed and
471	   simplicity of implementation.

473	   Implementations might also wish to use semantic information external
474	   to the langauge tags when performing fallback.  For example, the
475	   primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
476	   Norwegian) might both be usefully matched to the more general subtag
477	   'no' (Norwegian).  Or an application might infer that content labeled
478	   "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
479	   content labeled "zh-TW".

481	2.5  Considerations for Private Use Subtags

483	   Private-use subtags require private agreement between the parties
484	   that intend to use or exchange language tags that use them and great
485	   caution SHOULD be used in employing them in content or protocols
486	   intended for general use.  Private-use subtags are simply useless for
487	   information exchange without prior arrangement.

489	   The value and semantic meaning of private-use tags and of the subtags
490	   used within such a language tag are not defined.  Matching private
491	   use tags using language ranges or extended language ranges can result
492	   in unpredictable content being returned.

494	2.6  Length Considerations in Matching

496	   RFC 3066 [19] did not provide an upper limit on the size of language
497	   tags or ranges.  RFC 3066 did define the semantics of particular
498	   subtags in such a way that most language tags or ranges consisted of
499	   language and region subtags with a combined total length of up to six
500	   characters.  Larger tags and ranges (in terms of both subtags and
501	   characters) did exist, however.

503	   [1] also does not impose a fixed upper limit on the number of subtags
504	   in a language tag or range (and thus an upper bound on the size of
505	   either).  The syntax in that document suggests that, depending on the
506	   specific language or range of languages, more subtags (and thus
507	   characters) are sometimes necessary as a result.  Length
508	   considerations and their impact on the selection and processing of
509	   tags are described in Section 2.1.1 of that document.

511	   A matching implementation MAY choose to limit the length of the
512	   language tags or ranges used in matching.  Any such limitation SHOULD
513	   be clearly documented, and such documentation SHOULD include the
514	   disposition of any longer tags or ranges (for example, whether an
515	   error value is generated or the language tag or range is truncated).
516	   If truncation is permitted it MUST NOT permit a subtag to be divided,
517	   since this changes the semantics of the subtag being matched and can
518	   result in false positives or negatives.

520	   Implementations that restrict storage SHOULD consider the impact of
521	   tag or range truncation on the resulting matches.  For example,
522	   removing the "*" from the end of an extended language range (see
523	   Section 2.2) can greatly modify the set of returned matches.  A
524	   protocol that allows tags or ranges to be truncated at an arbitrary
525	   limit, without giving any indication of what that limit is, has the
526	   potential for causing harm by changing the meaning of values in
527	   substantial ways.

529	   In practice, most tags do not require additional subtags or
530	   substantially more characters.  Additional subtags sometimes add
531	   useful distinguishing information, but extraneous subtags interfere
532	   with the meaning, understanding, and especially matching of language
533	   tags.  Since language tags or ranges MAY be truncated by an
534	   application or protocol that limits storage, when choosing language
535	   tags or ranges users and applications SHOULD avoid adding subtags
536	   that add no distinguishing value.  In particular, users and
537	   implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
538	   fields in the registry (defined in Section 3.6 of [1]): these fields
539	   provide guidance on when specific additional subtags SHOULD (and
540	   SHOULD NOT) be used.

542	   Implementations MUST support a limit of at least 33 characters.  This
543	   limit includes at least one subtag of each non-extension, non-private
544	   use type.  When choosing a buffer limit, a length of at least 42
545	   characters is strongly RECOMMENDED.

547	   The practical limit on tags or ranges derived solely from registered
548	   values is 42 characters.  Implementations MUST be able to handle tags
549	   and ranges of this length.  Support for tags and ranges of at least
550	   62 characters in length is RECOMMENDED.  Implementations MAY support
551	   longer values, including matching extensive sets of private use or
552	   extension subtags.

554	   Applications or protocols which have to truncate a tag MUST do so by
555	   progressively removing subtags along with their preceding "-" from
556	   the right side of the language tag until the tag is short enough for
557	   the given buffer.  If the resulting tag ends with a single-character
558	   subtag, that subtag and its preceding "-" MUST also be removed.  For
559	   example:

561	   Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
562	   1. zh-Hant-CN-variant1-a-extend1-x-wadegile
563	   2. zh-Hant-CN-variant1-a-extend1
564	   3. zh-Hant-CN-variant1
565	   4. zh-Hant-CN
566	   5. zh-Hant
567	   6. zh

569	                    Figure 4: Example of Tag Truncation

571	3.  IANA Considerations

573	   This document presents no new or existing considerations for IANA.

575	4.  Changes

577	   This is the first version of this document.

579	   The following changes were put into this document since draft-00:

581	      Fixed text in the introduction that is no longer accurate.
582	      Specifically, there no longer is a default matching algorithm.
583	      (A.Phillips)

585	      Fixed text in Section 2.1 which incorrectly discussed the default
586	      fallback mechanism.  (A.Phillips)

588	      Minor changes to Section 2.3, in particular, the addition of the
589	      'variant' paragraph and some tidying of the text.  (A.Phillips)

591	      Fixed a minor glitch in the ABNF caused by taking the output of
592	      Bill Fenner's parser and not looking too closely at it (M. Patton)

594	      Fixed some minor reference problems.  (M.Patton)

596	      Added Section 2.6 on length considerations in matching.
597	      (R.Presuhn)

599	      Copied various materials from the length considerations section of
600	      the registry draft to keep the two documents in sync.
601	      (A.Phillips)

603	5.  Security Considerations

605	   The only security issue that has been raised with language tags since
606	   the publication of RFC 1766, which stated that "Security issues are
607	   believed to be irrelevant to this memo", is a concern with language
608	   ranges used in content negotiation - that they might be used to infer
609	   the nationality of the sender, and thus identify potential targets
610	   for surveillance.

612	   This is a special case of the general problem that anything you send
613	   is visible to the receiving party.  It is useful to be aware that
614	   such concerns can exist in some cases.

616	   The evaluation of the exact magnitude of the threat, and any possible
617	   countermeasures, is left to each application protocol.

619	   Although the specification of valid subtags for an extension MUST be
620	   available over the Internet, implementations SHOULD NOT mechanically
621	   depend on it being always accessible, to prevent denial-of-service
622	   attacks.

624	6.  Character Set Considerations

626	   The syntax in this document requires that language ranges use only
627	   the characters A-Z, a-z, 0-9, and HYPHEN-MINUS legal in language
628	   tags.  These characters are present in most character sets, so
629	   presentation of language tags should not have any character set
630	   issues.

632	   Rendering of characters based on the content of a language tag is not
633	   addressed in this memo.  Historically, some languages have relied on
634	   the use of specific character sets or other information in order to
635	   infer how a specific character should be rendered (notably this
636	   applies to language and culture specific variations of Han ideographs
637	   as used in Japanese, Chinese, and Korean).  When language tags are
638	   applied to spans of text, rendering engines sometimes use that
639	   information in deciding which font to use in the absence of other
640	   information, particularly where languages with distinct writing
641	   traditions use the same characters.

643	7.  References

645	7.1  Normative References

647	   [1]   Phillips, A., Ed. and M. Davis, Ed., "Tags for the
648	         Identification of Languages (Internet-Draft)", June 2005, <http
649	         ://www.ietf.org/internet-drafts/
650	         draft-ietf-ltru-registry-03.txt>.

652	   [2]   Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021
653	         and RFC 822", RFC 1327, May 1992.

655	   [3]   Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
656	         Extensions) Part One: Mechanisms for Specifying and Describing
657	         the Format of Internet Message Bodies", RFC 1521,
658	         September 1993.

660	   [4]   Hovey, R. and S. Bradner, "The Organizations Involved in the
661	         IETF Standards Process", BCP 11, RFC 2028, October 1996.

663	   [5]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
664	         Levels", BCP 14, RFC 2119, March 1997.

666	   [6]   Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word
667	         Extensions: Character Sets, Languages, and Continuations",
668	         RFC 2231, November 1997.

670	   [7]   Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
671	         Specifications: ABNF", RFC 2234, November 1997.

673	   [8]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
674	         Resource Identifiers (URI): Generic Syntax", RFC 2396,
675	         August 1998.

677	   [9]   Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
678	         Considerations Section in RFCs", BCP 26, RFC 2434,
679	         October 1998.

681	   [10]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
682	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
683	         HTTP/1.1", RFC 2616, June 1999.

685	   [11]  Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
686	         Understanding Concerning the Technical Work of the Internet
687	         Assigned Numbers Authority", RFC 2860, June 2000.

689	   [12]  Yergeau, F., "UTF-8, a transformation format of ISO 10646",
690	         STD 63, RFC 3629, November 2003.

692	7.2  Informative References

694	   [13]  International Organization for Standardization, "ISO 639-
695	         1:2002, Codes for the representation of names of languages --
696	         Part 1: Alpha-2 code", ISO Standard 639, 2002.

698	   [14]  International Organization for Standardization, "ISO 639-2:1998
699	         - Codes for the representation of names of languages -- Part 2:
700	         Alpha-3 code - edition 1", August 1988.

702	   [15]  ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the
703	         representation of names of scripts", January 2004.

705	   [16]  International Organization for Standardization, "Codes for the
706	         representation of names of countries, 3rd edition",
707	         ISO Standard 3166, August 1988.

709	   [17]  Statistical Division, United Nations, "Standard Country or Area
710	         Codes for Statistical Use", UN Standard Country or Area Codes
711	         for Statistical Use, Revision 4 (United Nations publication,
712	         Sales No. 98.XVII.9, June 1999.

714	   [18]  Alvestrand, H., "Tags for the Identification of Languages",
715	         RFC 1766, March 1995.

717	   [19]  Alvestrand, H., "Tags for the Identification of Languages",
718	         BCP 47, RFC 3066, January 2001.

720	   [20]  Klyne, G. and C. Newman, "Date and Time on the Internet:
721	         Timestamps", RFC 3339, July 2002.

723	Authors' Addresses

725	   Addison Phillips (editor)
726	   Quest Software

728	   Email: addison dot phillips at quest dot com

730	   Mark Davis (editor)
731	   IBM

733	   Email: mark dot davis at ibm dot com

735	Appendix A.  Acknowledgements

737	   Any list of contributors is bound to be incomplete; please regard the
738	   following as only a selection from the group of people who have
739	   contributed to make this document what it is today.

741	   The contributors to RFC 3066 and RFC 1766, the precursors of this
742	   document, made enormous contributions directly or indirectly to this
743	   document and are generally responsible for the success of language
744	   tags.

746	   The following people (in alphabetical order) contributed to this
747	   document or to RFCs 1766 and 3066:

749	   Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet,
750	   Nathaniel Borenstein, Eric Brunner, Sean M. Burke, Jeremy Carroll,
751	   John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin,
752	   Dave Crocker, Martin Duerst, Michael Everson, Doug Ewell, Ned Freed,
753	   Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren,
754	   Elliotte Rusty Harold, Paul Hoffman, Richard Ishida, Olle Jarnefors,
755	   Kent Karlsson, John Klensin, Alain LaBonte, Eric Mader, Keith Moore,
756	   Chris Newman, Masataka Ohta, Michael S. Patton, Randy Presuhn, George
757	   Rhoten, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto
758	   Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois
759	   Yergeau and many, many others.

761	   Very special thanks must go to Harald Tveit Alvestrand, who
762	   originated RFCs 1766 and 3066, and without whom this document would
763	   not have been possible.  Special thanks must go to Michael Everson,
764	   who has served as language tag reviewer for almost the complete
765	   period since the publication of RFC 1766.  Special thanks to Doug
766	   Ewell, for his production of the first complete subtag registry, and
767	   his work in producing a test parser for verifying language tags.

769	   For this particular document, John Cowan originated the scheme
770	   described in Section 2.2.3.  Mark Davis originated the scheme
771	   described in the Section 2.1.2.

773	Intellectual Property Statement

775	   The IETF takes no position regarding the validity or scope of any
776	   Intellectual Property Rights or other rights that might be claimed to
777	   pertain to the implementation or use of the technology described in
778	   this document or the extent to which any license under such rights
779	   might or might not be available; nor does it represent that it has
780	   made any independent effort to identify any such rights.  Information
781	   on the procedures with respect to rights in RFC documents can be
782	   found in BCP 78 and BCP 79.

784	   Copies of IPR disclosures made to the IETF Secretariat and any
785	   assurances of licenses to be made available, or the result of an
786	   attempt made to obtain a general license or permission for the use of
787	   such proprietary rights by implementers or users of this
788	   specification can be obtained from the IETF on-line IPR repository at
789	   http://www.ietf.org/ipr.

791	   The IETF invites any interested party to bring to its attention any
792	   copyrights, patents or patent applications, or other proprietary
793	   rights that may cover technology that may be required to implement
794	   this standard.  Please address the information to the IETF at
795	   ietf-ipr@ietf.org.

797	Disclaimer of Validity

799	   This document and the information contained herein are provided on an
800	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
801	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
802	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
803	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
804	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
805	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

807	Copyright Statement

809	   Copyright (C) The Internet Society (2005).  This document is subject
810	   to the rights, licenses and restrictions contained in BCP 78, and
811	   except as set forth therein, the authors retain all their rights.

813	Acknowledgment

815	   Funding for the RFC Editor function is currently provided by the
816	   Internet Society.