idnits 2.17.1 

draft-ietf-ltru-matching-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 762.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 739.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 746.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 752.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 169 has weird spacing: '...schemes  that ...'

  == Line 170 has weird spacing: '...ing and  looku...'

  == Line 374 has weird spacing: '...age tag  being...'

  == Line 467 has weird spacing: '...ch that  imple...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 28, 2005) is 6877 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC1327' is defined on line 618, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1521' is defined on line 621, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2028' is defined on line 626, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2231' is defined on line 633, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2234' is defined on line 637, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2396' is defined on line 640, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2434' is defined on line 644, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2860' is defined on line 652, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3629' is defined on line 656, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO639-1' is defined on line 661, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO639-2' is defined on line 666, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO15924' is defined on line 672, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO3166' is defined on line 676, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3339' is defined on line 691, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-14) exists of
     draft-ietf-ltru-registry-07

  ** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156)

  ** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046,
     RFC 2047, RFC 2048, RFC 2049)

  ** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281)

  ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234)

  ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226)

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 2860

  -- Obsolete informational reference (is this intentional?): RFC 1766
     (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)


     Summary: 11 errors (**), 0 flaws (~~), 22 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Expires: December 30, 2005                                 M. Davis, Ed.
5	                                                                     IBM
6	                                                           June 28, 2005

8	           Matching Tags for the Identification of Languages
9	                      draft-ietf-ltru-matching-03

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on December 30, 2005.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing, matching,
43	   and evaluating language tags.  Possible algorithms for language
44	   negotiation and content selection are described.

46	Table of Contents

48	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
49	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
50	     2.1   Basic Language Range . . . . . . . . . . . . . . . . . . .  4
51	       2.1.1   Matching . . . . . . . . . . . . . . . . . . . . . . .  5
52	       2.1.2   Lookup . . . . . . . . . . . . . . . . . . . . . . . .  6
53	     2.2   Extended Language Range  . . . . . . . . . . . . . . . . .  6
54	       2.2.1   Extended Range Matching  . . . . . . . . . . . . . . .  7
55	       2.2.2   Extended Range Lookup  . . . . . . . . . . . . . . . .  8
56	       2.2.3   Scored Matching  . . . . . . . . . . . . . . . . . . .  9
57	     2.3   Meaning of Language Tags and Ranges  . . . . . . . . . . . 10
58	     2.4   Choosing Between Alternate Matching Schemes  . . . . . . . 11
59	     2.5   Considerations for Private Use Subtags . . . . . . . . . . 12
60	     2.6   Length Considerations in Matching  . . . . . . . . . . . . 12
61	   3.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
62	   4.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
63	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
64	   6.  Character Set Considerations . . . . . . . . . . . . . . . . . 17
65	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
66	     7.1   Normative References . . . . . . . . . . . . . . . . . . . 18
67	     7.2   Informative References . . . . . . . . . . . . . . . . . . 19
68	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
69	   A.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
70	       Intellectual Property and Copyright Statements . . . . . . . . 22

72	1.  Introduction

74	   Human beings on our planet have, past and present, used a number of
75	   languages.  There are many reasons why one would want to identify the
76	   language used when presenting or requesting information.

78	   Information about a user's language preferences commonly needs to be
79	   identified so that appropriate processing can be applied.  For
80	   example, the user's language preferences in a browser can be used to
81	   select web pages appropriately.  A choice of language preference can
82	   also be used to select among tools (such as dictionaries) to assist
83	   in the processing or understanding of content in different languages.

85	   Given a set of language identifiers, such as those defined in
86	   [ID.ietf-ltru-registry], various mechanisms can be envisioned for
87	   performing language negotiation and tag matching.  The suitability of
88	   a particular mechanism to a particular application depends on the
89	   needs of that application.

91	   This document defines language ranges and syntax for specifying user
92	   preferences in a request for language content.  It also specifies
93	   various schemes and mechanisms that can be used with language ranges
94	   when matching or filtering content based on language tags.

96	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
97	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
98	   document are to be interpreted as described in [RFC2119].

100	2.  The Language Range

102	   Language Tags are used to identify the language of some information
103	   item or content.  Applications that use language tags are often faced
104	   with the problem of identifying sets of content that share certain
105	   language attributes.  For example, HTTP 1.1 [RFC2616] describes
106	   language ranges in its discussion of the Accept-Language header
107	   (Section 14.4), which is used for selecting content from servers
108	   based on the language of that content.

110	   When selecting content according to its language, it is useful to
111	   have a mechanism for identifying sets of language tags that share
112	   specific attributes.  This allows users to select or filter content
113	   based on specific requirements.  Such an identifier is called a
114	   "Language Range".

116	2.1  Basic Language Range

118	   A basic language range (such as described in [RFC3066] and HTTP 1.1
119	   [RFC2616]) is a set of languages whose tags all begin with the same
120	   sequence of subtags.  A basic language range can be represented by a
121	   'language-range' tag, by using the definition from HTTP/1.1 [RFC2616]
122	   :
123	   language-range = language-tag / "*"

125	   That is, a language-range has the same syntax as a language-tag or is
126	   the single character "*".  This definition of language-range implies
127	   that there is a semantic relationship between tags that share the
128	   same prefix.

130	   In particular, the set of language tags that match a specific
131	   language-range might not all be mutually intelligible.  The use of a
132	   prefix when matching tags to language ranges does not imply that
133	   language tags are assigned to languages in such a way that it is
134	   always true that if a user understands a language with a certain tag,
135	   then this user will also understand all languages with tags for which
136	   this tag is a prefix.  The prefix rule simply allows the use of
137	   prefix tags if this is the case.

139	   When working with tags and ranges you SHOULD also note the following:

141	   1.  Private-use and Extension subtags are normally orthogonal to
142	       language tag fallback.  Implementations SHOULD ignore
143	       unrecognized private-use and extension subtags when performing
144	       language tag fallback.  Since these subtags are always at the end
145	       of the sequence of subtags, they don't normally interfere with
146	       the use of prefixes for matching in the schemes described below.

148	   2.  Implementations that choose not to interpret one or more private-
149	       use or extension subtags SHOULD NOT remove or modify these
150	       extensions in content that they are processing.  When a language
151	       tag instance is to be used in a specific, known protocol, and is
152	       not being passed through to other protocols, language tags MAY be
153	       filtered to remove subtags and extensions that are not supported
154	       by that protocol.  Such filtering SHOULD be avoided, if possible,
155	       since it removes information that might be relevant if services
156	       on the other end of the protocol would make use of that
157	       information.

159	   3.  Some applications of language tags might want or need to consider
160	       extensions and private-use subtags when matching tags.  If
161	       extensions and private-use subtags are included in a matching or
162	       filtering process that utilizes the one of the schemes described
163	       in this document, then the implementation SHOULD canonicalize the
164	       language tags and/or ranges before performing the matching.  Note
165	       that language tag processors that claim to be "well-formed"
166	       processors as defined in [ID.ietf-ltru-registry] generally fall
167	       into this category.

169	   There are two matching schemes  that are commonly associated with
170	   basic language ranges:  matching and  lookup.

172	2.1.1  Matching

174	   Language tag matching is used to select all content that matches a
175	   given prefix.  In matching, the language range represents the least
176	   specific tag which is an acceptable match and every piece of content
177	   that matches is returned.

179	   For example, if an application is applying a style to all content in
180	   a web page in a particular language, it might use language tag
181	   matching to select the content to which the style is applied.

183	   A language-range matches a language-tag if it exactly equals the tag,
184	   or if it exactly equals a prefix of the tag such that the first
185	   character following the prefix is "-".  (That is, the language-range
186	   "de-de" matches the language tag "de-DE-1996", but not the language
187	   tag "de-Deva".)

189	   The special range "*" matches any tag.  A protocol which uses
190	   language ranges MAY specify additional rules about the semantics of
191	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
192	   languages not matched by any other range within an "Accept-Language:"
193	   header.

195	2.1.2  Lookup

197	   Content lookup is used to select the single information item that
198	   best matches the language range for a given request.  In lookup, the
199	   language range represents the most specific tag which is an
200	   acceptable match and only the closest matching item is returned.

202	   For example, if an application inserts some dynamic content into a
203	   web page, returning an empty string if there is no exact match is not
204	   an option.  Instead, the application "falls back".

206	   When performing lookup, the language range is progressively truncated
207	   from the end until a matching piece of content is located.  For
208	   example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
209	   would progressively search for content as shown below:

211	   Range to match: zh-Hant-CN-x-wadegile
212	   1. zh-Hant-CN-x-wadegile
213	   2. zh-Hant-CN
214	   3. zh-Hant
215	   4. zh
216	   5. (default content or the empty tag)

218	                Figure 2: Default Fallback Pattern Example

220	   This scheme allows some flexibility in finding content.  It also
221	   typically provides better results when data is not available at a
222	   specific level of tag granularity or is sparsely populated (than if
223	   the default language for the system or content were used).

225	2.2  Extended Language Range

227	   Prefix matching using a Basic Language Range, as described above, is
228	   not always the most appropriate way to access the information
229	   contained in language tags when selecting or filtering content.  Some
230	   applications might wish to define a more granular matching scheme and
231	   such a matching scheme requires the ability to specify the various
232	   attributes of a language tag in the language range.  An extended
233	   language range can be represented by the following ABNF:

235	   extended-language-range = grandfathered / privateuse / range
236	   range   = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
237	                [ "-" privateuse ] )
238	   lang    = 2*8ALPHA / extlang / "*"
239	   extlang = 2*3ALPHA *2("-" 3ALPHA) ( "-" ( 3ALPHA / "*" ) )
240	   script  = 4ALPHA / "*"
241	   region  = 2ALPHA / 3DIGIT / "*"
242	   variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
243	   privateuse    = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
244	   grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
245	   alphanum      = ( ALPHA / DIGIT )

247	   In an extended language range, the identifier takes the form of a
248	   series of subtags which must consist of well-formed subtags or the
249	   special subtag "*".  For example, the language range "en-*-US"
250	   specifies a primary language of 'en', followed by any script subtag,
251	   followed by the region subtag 'US'.

253	   A field not present in the middle of an extended language range MAY
254	   be treated as if the field contained a "*".  For example, the range
255	   "en-US" MAY be considered to be equivalent to the range "en-*-US".

257	   There are several matching algorithms or schemes which can be applied
258	   when matching extended language ranges to language tags.

260	2.2.1  Extended Range Matching

262	   In extended range matching, the subtags in a language tag are
263	   compared to the corresponding subtags in the extended language range.
264	   A subtag is considered to match if it exactly matches the
265	   corresponding subtag in the range or the range contains a subtag with
266	   the value "*" (which matches all subtags, including the empty
267	   subtag).  Extended Range Matching is an extension of basic matching
268	   (Section 2.1.1): the language range represents the least specific tag
269	   which is an acceptable match.

271	   By default all extensions and their subtags are ignored for extended
272	   language range matching.

274	   Private use subtags MAY be specified in the language range and MUST
275	   NOT be ignored when matching.

277	   Subtags not specified, including those at the end of the language
278	   range, are assigned the value "*".  This makes each range into a
279	   prefix much like that used in basic language range matching.  For
280	   example, the extended language range "zh-*-CN" matches all of the
281	   following tags because the unspecified variant field is expanded to
282	   "*":

284	      zh-Hant-CN

286	      zh-CN

288	      zh-Hans-CN

290	      zh-CN-x-wadegile

292	      zh-Latn-CN-boont

294	2.2.2  Extended Range Lookup

296	   In extended range lookup, the subtags in a language tag are compared
297	   to the corresponding subtags in the extended language range.  The
298	   subtag is considered to match if it exactly matches the corresponding
299	   subtag in the range or the range contains a subtag with the value "*"
300	   (which matches all subtags, including the empty subtag).  Extended
301	   language range lookup is an extension of basic lookup
302	   (Section 2.1.2): the language range represents the most specific tag
303	   which will form an acceptable match.

305	   Subtags not specified are assigned the value "*" prior to performing
306	   tag matching.  Unlike in extended range matching, however, fields at
307	   the end of the range MUST NOT be expanded in this manner.  For
308	   example, "en-US" MUST NOT be considered to be the same as the range
309	   "en-US-*".  This allows ranges to be specific.  The "*" wildcard MUST
310	   be used at the end of the range to indicate that all tags with the
311	   range as a prefix are allowable matches.  That is, the range "zh-*"
312	   matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
313	   matches neither of those tags.

315	   The wildcard "*" at the end of a range SHOULD be considered to match
316	   any private use subtag sequences (making extended language range
317	   lookup function exactly like extended range matching Section 2.2.1).

319	   By default all extensions and their subtags SHOULD be ignored for
320	   extended language range lookup.  Private use subtags MAY be specified
321	   in the language range and MUST NOT be ignored when performing lookup.
322	   The wildcard "*" at the end of a range SHOULD be considered to match
323	   any private use subtag sequences in addition to variants.

325	   For example, the range "*-US" matches all of the following tags:

327	      en-US

329	      en-Latn-US
330	      en-US-r-extends (extensions are ignored)

332	      fr-US

334	   For example, the range "en-*-US" matches _none_ of the following
335	   tags:

337	      fr-US

339	      en (missing region US)

341	      en-Latn (missing region US)

343	      en-Latn-US-scouse (variant field is present)

345	   For example, the range "en-*" matches all of the following tags:

347	      en-Latn

349	      en-Latn-US

351	      en-Latn-US-scouse

353	      en-US

355	      en-scouse

357	   Note that the ability to be specific in extended range lookup can
358	   make this matching scheme a more appropriate replacement for basic
359	   matching than the extended range matching scheme.

361	2.2.3  Scored Matching

363	   In the "scored matching" scheme, the extended language range and the
364	   language tags are pre-normalized by mapping grandfathered and
365	   obsolete tags into modern equivalents.

367	   The language range and the language tags are normalized into
368	   quadruples of the form (language, script, country, variant), where
369	   extended language is considered part of language and x-private-codes
370	   are considered part of the language if they are initial and part of
371	   the variant if not initial.  Missing components are set to "*".  An
372	   "*" pattern becomes the quadruple ("*", "*", "*", "*").

374	   Each language tag  being matched or filtered is assigned a "quality
375	   value" such that higher values indicate better matches and lower
376	   values indicate worse ones.  If the language matches, add 8 to the
377	   quality value.  If the script matches, add 4 to the quality value.

379	   If the region matches, add 2 to the quality value.  If the variant
380	   matches, add 1 to the quality value.  Elements of the quadruples are
381	   considered to match if they are the same or if one of them is "*".

383	   A value of 15 is a perfect match; 0 is no match at all.  Different
384	   values could be more or less appropriate for different applications
385	   and implementations SHOULD probably allow users to choose the most
386	   appropriate selection value.

388	2.3  Meaning of Language Tags and Ranges

390	   A language tag defines a language as spoken (or written, signed or
391	   otherwise signaled) by human beings for communication of information
392	   to other human beings.

394	   If a language tag B contains language tag A as a prefix, then B is
395	   typically "narrower" or "more specific" than A. For example, "zh-
396	   Hant-TW" is more specific than "zh-Hant".

398	   This relationship is not guaranteed in all cases: specifically,
399	   languages that begin with the same sequence of subtags are NOT
400	   guaranteed to be mutually intelligible, although they might be.

402	   For example, the tag "az" shares a prefix with both "az-Latn"
403	   (Azerbaijani written using the Latin script) and "az-Cyrl"
404	   (Azerbaijani written using the Cyrillic script).  A person fluent in
405	   one script might not be able to read the other, even though the text
406	   might be otherwise identical.  Content tagged as "az" most probably
407	   is written in just one script and thus might not be intelligible to a
408	   reader familiar with the other script.

410	   Variant subtags in particular seem to represent specific divisions in
411	   mutual understanding, since they often encode dialects or other
412	   idiosyncratic variations within a language.

414	   The relationship between the language tag and the information it
415	   relates to is defined by the standard describing the context in which
416	   it appears.  Accordingly, this section can only give possible
417	   examples of its usage.

419	   o  For a single information object, the associated language tags
420	      might be interpreted as the set of languages that are necessary
421	      for a complete comprehension of the complete object.  Example:
422	      Plain text documents.

424	   o  For an aggregation of information objects, the associated language
425	      tags could be taken as the set of languages used inside components
426	      of that aggregation.  Examples: Document stores and libraries.

428	   o  For information objects whose purpose is to provide alternatives,
429	      the associated language tags could be regarded as a hint that the
430	      content is provided in several languages, and that one has to
431	      inspect each of the alternatives in order to find its language or
432	      languages.  In this case, the presence of multiple tags might not
433	      mean that one needs to be multi-lingual to get complete
434	      understanding of the document.  Example: MIME multipart/
435	      alternative.

437	   o  In markup languages, such as HTML and XML, language information
438	      can be added to each part of the document identified by the markup
439	      structure (including the whole document itself).  For example, one
440	      could write <span lang="FR">C'est la vie.</span> inside a
441	      Norwegian document; the Norwegian-speaking user could then access
442	      a French-Norwegian dictionary to find out what the marked section
443	      meant.  If the user were listening to that document through a
444	      speech synthesis interface, this formation could be used to signal
445	      the synthesizer to appropriately apply French text-to-speech
446	      pronunciation rules to that span of text, instead of misapplying
447	      the Norwegian rules.

449	2.4  Choosing Between Alternate Matching Schemes

451	   Implementations MAY choose to implement different styles of matching
452	   for different kinds of processing.  For example, an implementation
453	   could treat an absent script subtag as a "wildcard" field; thus
454	   "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
455	   "az" (this is extended range lookup).  If one item is to be chosen,
456	   the implementation could pick among those matches based on other
457	   information, such as the most likely script used in the language/
458	   region in question or the script used by other content selected.

460	   Because the primary language subtag cannot be absent in a language
461	   tag, the 'UND' subtag is sometimes be used as a 'wildcard' in basic
462	   matching.  For example, in a query where you want to select all
463	   language tags that contain 'Latn' as the script code and 'AZ' as the
464	   region code, you could use the range "und-Latn-AZ".  This requires an
465	   implementation to examine the actual values of the subtags, though.
466	   The matching schemes described elsewhere in this document are
467	   designed such that  implementations do not have to examine the values
468	   or subtags supplied and, except for scored matching, they do not need
469	   access to the Language Subtag Registry nor the use of valid subtags
470	   in language tags or ranges.  This has great benefit for speed and
471	   simplicity of implementation.

473	   Implementations might also wish to use semantic information external
474	   to the langauge tags when performing fallback.  For example, the
475	   primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
476	   Norwegian) might both be usefully matched to the more general subtag
477	   'no' (Norwegian).  Or an application might infer that content labeled
478	   "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
479	   content labeled "zh-TW".

481	2.5  Considerations for Private Use Subtags

483	   Private-use subtags require private agreement between the parties
484	   that intend to use or exchange language tags that use them and great
485	   caution SHOULD be used in employing them in content or protocols
486	   intended for general use.  Private-use subtags are simply useless for
487	   information exchange without prior arrangement.

489	   The value and semantic meaning of private-use tags and of the subtags
490	   used within such a language tag are not defined.  Matching private
491	   use tags using language ranges or extended language ranges can result
492	   in unpredictable content being returned.

494	2.6  Length Considerations in Matching

496	   [RFC3066] did not provide an upper limit on the size of language tags
497	   or ranges.  RFC 3066 did define the semantics of particular subtags
498	   in such a way that most language tags or ranges consisted of language
499	   and region subtags with a combined total length of up to six
500	   characters.  Larger tags and ranges (in terms of both subtags and
501	   characters) did exist, however.

503	   [ID.ietf-ltru-registry] also does not impose a fixed upper limit on
504	   the number of subtags in a language tag or range (and thus an upper
505	   bound on the size of either).  The syntax in that document suggests
506	   that, depending on the specific language or range of languages, more
507	   subtags (and thus characters) are sometimes necessary as a result.
508	   Length considerations and their impact on the selection and
509	   processing of tags are described in Section 2.1.1 of that document.

511	   A matching implementation MAY choose to limit the length of the
512	   language tags or ranges used in matching.  Any such limitation SHOULD
513	   be clearly documented, and such documentation SHOULD include the
514	   disposition of any longer tags or ranges (for example, whether an
515	   error value is generated or the language tag or range is truncated).
516	   If truncation is permitted it MUST NOT permit a subtag to be divided,
517	   since this changes the semantics of the subtag being matched and can
518	   result in false positives or negatives.

520	   Implementations that restrict storage SHOULD consider the impact of
521	   tag or range truncation on the resulting matches.  For example,
522	   removing the "*" from the end of an extended language range (see
523	   Section 2.2) can greatly modify the set of returned matches.  A
524	   protocol that allows tags or ranges to be truncated at an arbitrary
525	   limit, without giving any indication of what that limit is, has the
526	   potential for causing harm by changing the meaning of values in
527	   substantial ways.

529	   In practice, most tags do not require additional subtags or
530	   substantially more characters.  Additional subtags sometimes add
531	   useful distinguishing information, but extraneous subtags interfere
532	   with the meaning, understanding, and especially matching of language
533	   tags.  Since language tags or ranges MAY be truncated by an
534	   application or protocol that limits storage, when choosing language
535	   tags or ranges users and applications SHOULD avoid adding subtags
536	   that add no distinguishing value.  In particular, users and
537	   implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
538	   fields in the registry (defined in Section 3.6 of [ID.ietf-ltru-
539	   registry]): these fields provide guidance on when specific additional
540	   subtags SHOULD (and SHOULD NOT) be used.

542	   Implementations MUST support a limit of at least 33 characters.  This
543	   limit includes at least one subtag of each non-extension, non-private
544	   use type.  When choosing a buffer limit, a length of at least 42
545	   characters is strongly RECOMMENDED.

547	   The practical limit on tags or ranges derived solely from registered
548	   values is 42 characters.  Implementations MUST be able to handle tags
549	   and ranges of this length.  Support for tags and ranges of at least
550	   62 characters in length is RECOMMENDED.  Implementations MAY support
551	   longer values, including matching extensive sets of private use or
552	   extension subtags.

554	   Applications or protocols which have to truncate a tag MUST do so by
555	   progressively removing subtags along with their preceding "-" from
556	   the right side of the language tag until the tag is short enough for
557	   the given buffer.  If the resulting tag ends with a single-character
558	   subtag, that subtag and its preceding "-" MUST also be removed.  For
559	   example:

561	   Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
562	   1. zh-Hant-CN-variant1-a-extend1-x-wadegile
563	   2. zh-Hant-CN-variant1-a-extend1
564	   3. zh-Hant-CN-variant1
565	   4. zh-Hant-CN
566	   5. zh-Hant
567	   6. zh

569	                    Figure 4: Example of Tag Truncation

571	3.  IANA Considerations

573	   This document presents no new or existing considerations for IANA.

575	4.  Changes

577	   This is the first version of this document.

579	   The following changes were put into this document since draft-02:

581	      Turned on symrefs and replaced all reference IDs to make them
582	      readable (F.Ellermann)

584	      Removed all external references from the abstract (R.Presuhn)

586	5.  Security Considerations

588	   Language ranges used in content negotiation might be used to infer
589	   the nationality of the sender, and thus identify potential targets
590	   for surveillance.  In addition, unique or highly unusual language
591	   ranges or combinations of language ranges might be used to track
592	   specific individual's activities.

594	   This is a special case of the general problem that anything you send
595	   is visible to the receiving party.  It is useful to be aware that
596	   such concerns can exist in some cases.

598	   The evaluation of the exact magnitude of the threat, and any possible
599	   countermeasures, is left to each application protocol.

601	6.  Character Set Considerations

603	   The syntax of language tags and language ranges permit only the
604	   characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D).  These characters
605	   are present in most character sets, so presentation of language tags
606	   should not present any character set issues.

608	7.  References

610	7.1  Normative References

612	   [ID.ietf-ltru-registry]
613	              Phillips, A., Ed. and M. Davis, Ed., "Tags for the
614	              Identification of Languages (Internet-Draft)", June 2005,
615	              <http://www.ietf.org/internet-drafts/
616	              draft-ietf-ltru-registry-07.txt>.

618	   [RFC1327]  Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO
619	              10021 and RFC 822", RFC 1327, May 1992.

621	   [RFC1521]  Borenstein, N. and N. Freed, "MIME (Multipurpose Internet
622	              Mail Extensions) Part One: Mechanisms for Specifying and
623	              Describing the Format of Internet Message Bodies",
624	              RFC 1521, September 1993.

626	   [RFC2028]  Hovey, R. and S. Bradner, "The Organizations Involved in
627	              the IETF Standards Process", BCP 11, RFC 2028,
628	              October 1996.

630	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
631	              Requirement Levels", BCP 14, RFC 2119, March 1997.

633	   [RFC2231]  Freed, N. and K. Moore, "MIME Parameter Value and Encoded
634	              Word Extensions: Character Sets, Languages, and
635	              Continuations", RFC 2231, November 1997.

637	   [RFC2234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
638	              Specifications: ABNF", RFC 2234, November 1997.

640	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
641	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
642	              August 1998.

644	   [RFC2434]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
645	              IANA Considerations Section in RFCs", BCP 26, RFC 2434,
646	              October 1998.

648	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
649	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
650	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

652	   [RFC2860]  Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
653	              Understanding Concerning the Technical Work of the
654	              Internet Assigned Numbers Authority", RFC 2860, June 2000.

656	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
657	              10646", STD 63, RFC 3629, November 2003.

659	7.2  Informative References

661	   [ISO639-1]
662	              International Organization for Standardization, "ISO 639-
663	              1:2002, Codes for the representation of names of languages
664	              -- Part 1: Alpha-2 code", ISO Standard 639, 2002.

666	   [ISO639-2]
667	              International Organization for Standardization, "ISO 639-
668	              2:1998 - Codes for the representation of names of
669	              languages -- Part 2: Alpha-3 code - edition 1",
670	              August 1988.

672	   [ISO15924]
673	              ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the
674	              representation of names of scripts", January 2004.

676	   [ISO3166]  International Organization for Standardization, "Codes for
677	              the representation of names of countries, 3rd edition",
678	              ISO Standard 3166, August 1988.

680	   [UN_M49]   Statistical Division, United Nations, "Standard Country or
681	              Area Codes for Statistical Use", UN Standard Country or
682	              Area Codes for Statistical Use, Revision 4 (United Nations
683	              publication, Sales No. 98.XVII.9, June 1999.

685	   [RFC1766]  Alvestrand, H., "Tags for the Identification of
686	              Languages", RFC 1766, March 1995.

688	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
689	              Languages", BCP 47, RFC 3066, January 2001.

691	   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
692	              Timestamps", RFC 3339, July 2002.

694	Authors' Addresses

696	   Addison Phillips (editor)
697	   Quest Software

699	   Email: addison dot phillips at quest dot com
700	   Mark Davis (editor)
701	   IBM

703	   Email: mark dot davis at ibm dot com

705	Appendix A.  Acknowledgements

707	   Any list of contributors is bound to be incomplete; please regard the
708	   following as only a selection from the group of people who have
709	   contributed to make this document what it is today.

711	   The contributors to [ID.ietf-ltru-registry], [RFC3066] and [RFC1766],
712	   each of which is a  precursor to this document, made enormous
713	   contributions directly or indirectly to this document and are
714	   generally responsible for the success of language tags.

716	   The following people (in alphabetical order by family name)
717	   contributed to this document:

719	   Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Ira
720	   McDonald, M. Patton, Randy Presuhn and many, many others.

722	   Very special thanks must go to Harald Tveit Alvestrand, who
723	   originated RFCs 1766 and 3066, and without whom this document would
724	   not have been possible.

726	   For this particular document, John Cowan originated the scheme
727	   described in Section 2.2.3.  Mark Davis originated the scheme
728	   described in the Section 2.1.2.

730	Intellectual Property Statement

732	   The IETF takes no position regarding the validity or scope of any
733	   Intellectual Property Rights or other rights that might be claimed to
734	   pertain to the implementation or use of the technology described in
735	   this document or the extent to which any license under such rights
736	   might or might not be available; nor does it represent that it has
737	   made any independent effort to identify any such rights.  Information
738	   on the procedures with respect to rights in RFC documents can be
739	   found in BCP 78 and BCP 79.

741	   Copies of IPR disclosures made to the IETF Secretariat and any
742	   assurances of licenses to be made available, or the result of an
743	   attempt made to obtain a general license or permission for the use of
744	   such proprietary rights by implementers or users of this
745	   specification can be obtained from the IETF on-line IPR repository at
746	   http://www.ietf.org/ipr.

748	   The IETF invites any interested party to bring to its attention any
749	   copyrights, patents or patent applications, or other proprietary
750	   rights that may cover technology that may be required to implement
751	   this standard.  Please address the information to the IETF at
752	   ietf-ipr@ietf.org.

754	Disclaimer of Validity

756	   This document and the information contained herein are provided on an
757	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
758	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
759	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
760	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
761	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
762	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

764	Copyright Statement

766	   Copyright (C) The Internet Society (2005).  This document is subject
767	   to the rights, licenses and restrictions contained in BCP 78, and
768	   except as set forth therein, the authors retain all their rights.

770	Acknowledgment

772	   Funding for the RFC Editor function is currently provided by the
773	   Internet Society.