idnits 2.17.1 

draft-ietf-ltru-matching-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 774.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 751.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 758.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 764.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RFC3066], [19], [1]), which
     it shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 168 has weird spacing: '...schemes  that ...'

  == Line 169 has weird spacing: '...ing and  looku...'

  == Line 371 has weird spacing: '...age tag  being...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (May 30, 2005) is 6906 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'RFC 3066' on line 46

  == Unused Reference: '2' is defined on line 621, but no explicit reference
     was found in the text

  == Unused Reference: '3' is defined on line 624, but no explicit reference
     was found in the text

  == Unused Reference: '4' is defined on line 629, but no explicit reference
     was found in the text

  == Unused Reference: '6' is defined on line 635, but no explicit reference
     was found in the text

  == Unused Reference: '7' is defined on line 639, but no explicit reference
     was found in the text

  == Unused Reference: '8' is defined on line 642, but no explicit reference
     was found in the text

  == Unused Reference: '9' is defined on line 646, but no explicit reference
     was found in the text

  == Unused Reference: '11' is defined on line 654, but no explicit reference
     was found in the text

  == Unused Reference: '12' is defined on line 658, but no explicit reference
     was found in the text

  == Unused Reference: '13' is defined on line 663, but no explicit reference
     was found in the text

  == Unused Reference: '14' is defined on line 667, but no explicit reference
     was found in the text

  == Unused Reference: '15' is defined on line 671, but no explicit reference
     was found in the text

  == Unused Reference: '16' is defined on line 674, but no explicit reference
     was found in the text

  == Unused Reference: '17' is defined on line 678, but no explicit reference
     was found in the text

  == Unused Reference: '18' is defined on line 683, but no explicit reference
     was found in the text

  == Unused Reference: '20' is defined on line 689, but no explicit reference
     was found in the text

  == Outdated reference: A later version (-14) exists of
     draft-ietf-ltru-registry-01

  ** Obsolete normative reference: RFC 1327 (ref. '2') (Obsoleted by RFC 2156)

  ** Obsolete normative reference: RFC 1521 (ref. '3') (Obsoleted by RFC
     2045, RFC 2046, RFC 2047, RFC 2048, RFC 2049)

  ** Obsolete normative reference: RFC 2028 (ref. '4') (Obsoleted by RFC 9281)

  ** Obsolete normative reference: RFC 2234 (ref. '7') (Obsoleted by RFC 4234)

  ** Obsolete normative reference: RFC 2396 (ref. '8') (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2434 (ref. '9') (Obsoleted by RFC 5226)

  ** Obsolete normative reference: RFC 2616 (ref. '10') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 2860 (ref.
     '11')

  -- Obsolete informational reference (is this intentional?): RFC 1766 (ref.
     '18') (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066 (ref.
     '19') (Obsoleted by RFC 4646, RFC 4647)


     Summary: 12 errors (**), 0 flaws (~~), 23 warnings (==), 10 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Expires: December 1, 2005                                  M. Davis, Ed.
5	                                                                     IBM
6	                                                            May 30, 2005

8	                     Matching Language Identifiers
9	                      draft-ietf-ltru-matching-01

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on December 1, 2005.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing and
43	   matching the tags for the identification of languages defined by [RFC
44	   3066bis] [1].  Possible algorithms for language negotiation and
45	   content selection are described.  This document obsoletes portions of
46	   [RFC 3066] [19].

48	Table of Contents

50	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
51	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
52	     2.1   Basic Language Range . . . . . . . . . . . . . . . . . . .  4
53	       2.1.1   Matching . . . . . . . . . . . . . . . . . . . . . . .  5
54	       2.1.2   Lookup . . . . . . . . . . . . . . . . . . . . . . . .  5
55	     2.2   Extended Language Range  . . . . . . . . . . . . . . . . .  6
56	       2.2.1   Extended Range Matching  . . . . . . . . . . . . . . .  7
57	       2.2.2   Extended Range Lookup  . . . . . . . . . . . . . . . .  8
58	       2.2.3   Scored Matching  . . . . . . . . . . . . . . . . . . .  9
59	     2.3   Meaning of Language Tags and Ranges  . . . . . . . . . . . 10
60	     2.4   Choosing Between Alternate Matching Schemes  . . . . . . . 11
61	     2.5   Considerations for Private Use Subtags . . . . . . . . . . 11
62	     2.6   Length Considerations in Matching  . . . . . . . . . . . . 12
63	   3.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 14
64	   4.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
65	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
66	   6.  Character Set Considerations . . . . . . . . . . . . . . . . . 17
67	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
68	     7.1   Normative References . . . . . . . . . . . . . . . . . . . 18
69	     7.2   Informative References . . . . . . . . . . . . . . . . . . 19
70	       Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 19
71	   A.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20
72	       Intellectual Property and Copyright Statements . . . . . . . . 21

74	1.  Introduction

76	   Human beings on our planet have, past and present, used a number of
77	   languages.  There are many reasons why one would want to identify the
78	   language used when presenting or requesting information.

80	   Information about a user's language preferences commonly needs to be
81	   identified so that appropriate processing can be applied.  For
82	   example, the user's language preferences in a browser can be used to
83	   select web pages appropriately.  A choice of language preference can
84	   also be used to select among tools (such as dictionaries) to assist
85	   in the processing or understanding of content in different languages.

87	   Given a set of language identifiers, such as those defined in
88	   RFC3066bis [1], various mechanisms can be envisioned for performing
89	   language negotiation and tag matching.  The suitability of a
90	   particular mechanism to a particular application depends on the needs
91	   of that application.

93	   This document defines language ranges and syntax for specifying user
94	   preferences in a request for language content.  It also specifies
95	   various schemes and mechanisms that can be used with language ranges
96	   when matching or filtering content based on language tags.

98	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
99	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
100	   document are to be interpreted as described in RFC 2119 [5].

102	2.  The Language Range

104	   Language Tags are used to identify the language of some information
105	   item or content.  Applications that use language tags are often faced
106	   with the problem of identifying sets of content that share certain
107	   language attributes.  For example, HTTP 1.1 [10] describes language
108	   ranges in its discussion of the Accept-Language header (Section
109	   14.4), which is used for selecting content from servers based on the
110	   language of that content.

112	   When selecting content according to its language, it is useful to
113	   have a mechanism for identifying sets of language tags that share
114	   specific attributes.  This allows users to select or filter content
115	   based on specific requirements.  Such an identifier is called a
116	   "Language Range".

118	2.1  Basic Language Range

120	   A basic language range (such as described in RFC 3066 [19] and HTTP
121	   1.1 [10]) is a set of languages whose tags all begin with the same
122	   sequence of subtags.  A basic language range can be represented by a
123	   'language-range' tag, by using the definition from HTTP/1.1 [10] :
124	   language-range = language-tag / "*"

126	   That is, a language-range has the same syntax as a language-tag or is
127	   the single character "*".  This definition of language-range implies
128	   that there is a semantic relationship between tags that share the
129	   same prefix.

131	   In particular, the set of language tags that match a specific
132	   language-range may not all be mutually intelligible.  The use of a
133	   prefix when matching tags to language ranges does not imply that
134	   language tags are assigned to languages in such a way that it is
135	   always true that if a user understands a language with a certain tag,
136	   then this user will also understand all languages with tags for which
137	   this tag is a prefix.  The prefix rule simply allows the use of
138	   prefix tags if this is the case.

140	   When working with tags and ranges you should also note the following:

142	   1.  Private-use and Extension subtags are normally orthogonal to
143	       language tag fallback.  Implementations should ignore
144	       unrecognized private-use and extension subtags when performing
145	       language tag fallback.  Since these subtags are always at the end
146	       of the sequence of subtags, they don't normally interfere with
147	       the use of prefixes for matching in the schemes described below.

149	   2.  Implementations that choose not to interpret one or more private-
150	       use or extension subtags should not remove or modify these
151	       extensions in content that they are processing.  When a language
152	       tag instance is to be used in a specific, known protocol, and is
153	       not being passed through to other protocols, language tags may be
154	       filtered to remove subtags and extensions that are not supported
155	       by that protocol.  This should be done with caution, since it is
156	       removing information that may be relevant if services on the
157	       other end of the protocol would make use of that information.

159	   3.  Some applications of language tags may want or need to consider
160	       extensions and private-use subtags when matching tags.  If
161	       extensions and private-use subtags are included in a matching or
162	       filtering process that utilizes the one of the schemes described
163	       in this document, then the implementation should canonicalize the
164	       language tags and/or ranges before performing the matching.  Note
165	       that language tag processors that claim to be "well-formed"
166	       processors as defined in [1] generally fall into this category.

168	   There are two matching schemes  that are commonly associated with
169	   basic language ranges:  matching and  lookup.

171	2.1.1  Matching

173	   Language tag matching is used to select all content that matches a
174	   given prefix.  In matching, the language range represents the least
175	   specific tag which is an acceptable match and every piece of content
176	   that matches is returned.

178	   For example, if an application is applying a style to all content in
179	   a web page in a particular language, it might use language tag
180	   matching to select the content to which the style is applied.

182	   A language-range matches a language-tag if it exactly equals the tag,
183	   or if it exactly equals a prefix of the tag such that the first
184	   character following the prefix is "-".  (That is, the language-range
185	   "en-de" matches the language tag "en-DE-boont", but not the language
186	   tag "en-Deva".)

188	   The special range "*" matches any tag.  A protocol which uses
189	   language ranges may specify additional rules about the semantics of
190	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
191	   languages not matched by any other range within an "Accept-Language:"
192	   header.

194	2.1.2  Lookup

196	   Content lookup is used to select the single information item that
197	   best matches the language range for a given request.  In lookup, the
198	   language range represents the most specific tag which is an
199	   acceptable match and only the closest matching item is returned.

201	   For example, if an application inserts some dynamic content into a
202	   web page, returning an empty string if there is no exact match is not
203	   an option.  Instead, the application "falls back".

205	   When performing lookup, the language range is progressively truncated
206	   from the end until a matching piece of content is located.  For
207	   example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
208	   would progressively search for content as shown below:

210	   Range to match: zh-Hant-CN-x-wadegile
211	   1. zh-Hant-CN-x-wadegile
212	   2. zh-Hant-CN
213	   3. zh-Hant
214	   4. zh
215	   5. (default content or the empty tag)

217	                Figure 2: Default Fallback Pattern Example

219	   This scheme allows some flexibility in finding content.  It also
220	   typically provides better results when data is not available at a
221	   specific level of tag granularity or is sparsely populated (than if
222	   the default language for the system or content were used).

224	2.2  Extended Language Range

226	   Prefix matching using a Basic Language Range, as described above, is
227	   not always the most appropriate way to access the information
228	   contained in language tags when selecting or filtering content.  Some
229	   applications may wish to define a more granular matching scheme and
230	   such a matching scheme requires the ability to specify the various
231	   attributes of a language tag in the language range.  An extended
232	   language range can be represented by the following ABNF:
233	   extended-language-range = grandfathered / privateuse / range
234	   range   = ( lang [ "-" script ] [ "-" region ] *( "-" variant )
235	                [ "-" privateuse ] )
236	   lang    = ( 2*8ALPHA *[ "-" extlang ] ) / "*"
237	   extlang = 3ALPHA / "*"
238	   script  = 4ALPHA / "*"
239	   region  = 2ALPHA / 3DIGIT / "*"
240	   variant = 5*8alphanum / ( DIGIT 3alphanum ) / "*"
241	   privateuse    = ( "x" / "X" ) 1*( "-" ( 1*8alphanum ) )
242	   grandfathered = 1*3ALPHA 1*2( "-" ( 2*8alphanum ) )
243	   alphanum      = ( ALPHA / DIGIT )
244	   In an extended language range, the identifier takes the form of a
245	   series of subtags which must consist of well-formed subtags or the
246	   special subtag "*".  For example, the language range "en-*-US"
247	   specifies a primary language of 'en', followed by any script subtag,
248	   followed by the region subtag 'US'.

250	   A field not present in the middle of an extended language range MAY
251	   be treated as if the field contained a "*".  For example, the range
252	   "en-US" MAY be considered to be equivalent to the range "en-*-US".

254	   There are several matching algorithms or schemes which may be applied
255	   when matching extended language ranges to language tags.

257	2.2.1  Extended Range Matching

259	   In extended range matching, the subtags in a language tag are
260	   compared to the corresponding subtags in the extended language range.
261	   A subtag is considered to match if it exactly matches the
262	   corresponding subtag in the range or the range contains a subtag with
263	   the value "*" (which matches all subtags, including the empty
264	   subtag).  Extended Range Matching is an extension of basic matching
265	   (Section 2.1.1): the language range represents the least specific tag
266	   which is an acceptable match.

268	   By default all extensions and their subtags are ignored for extended
269	   language range matching.

271	   Private use subtags may be specified in the language range and MUST
272	   NOT be ignored when matching.

274	   Subtags not specified, including those at the end of the language
275	   range, are assigned the value "*".  This makes each range into a
276	   prefix much like that used in basic language range matching.  For
277	   example, the extended language range "zh-*-CN" matches all of the
278	   following tags because the unspecified variant field is expanded to
279	   "*":

281	      zh-Hant-CN

283	      zh-CN

285	      zh-Hans-CN

287	      zh-CN-x-wadegile

289	      zh-Latn-CN-boont

291	2.2.2  Extended Range Lookup

293	   In extended range lookup, the subtags in a language tag are compared
294	   to the corresponding subtags in the extended language range.  The
295	   subtag is considered to match if it exactly matches the corresponding
296	   subtag in the range or the range contains a subtag with the value "*"
297	   (which matches all subtags, including the empty subtag).  Extended
298	   language range lookup is an extension of basic lookup
299	   (Section 2.1.2): the language range represents the most specific tag
300	   which will form an acceptable match.

302	   Subtags not specified are assigned the value "*" prior to performing
303	   tag matching.  Unlike in extended range matching, however, fields at
304	   the end of the range MUST NOT be expanded in this manner.  For
305	   example, "en-US" must not be considered to be the same as the range
306	   "en-US-*".  This allows ranges to be specific.  The "*" wildcard MUST
307	   be used at the end of the range to indicate that all tags with the
308	   range as a prefix are allowable matches.  That is, the range "zh-*"
309	   matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
310	   matches neither of those tags.

312	   The wildcard "*" at the end of a range SHOULD be considered to match
313	   any private use subtag sequences (making extended language range
314	   lookup function exactly like extended range matching Section 2.2.1).

316	   By default all extensions and their subtags SHOULD be ignored for
317	   extended language range lookup.  Private use subtags may be specified
318	   in the language range and MUST NOT be ignored when performing lookup.
319	   The wildcard "*" at the end of a range SHOULD be considered to match
320	   any private use subtag sequences in addition to variants.

322	   For example, the range "*-US" matches all of the following tags:

324	      en-US

326	      en-Latn-US

328	      en-US-r-extends (extensions are ignored)

330	      fr-US

332	   For example, the range "en-*-US" matches _none_ of the following
333	   tags:

335	      fr-US

337	      en (missing region US)
338	      en-Latn (missing region US)

340	      en-Latn-US-scouse (variant field is present)

342	   For example, the range "en-*" matches all of the following tags:

344	      en-Latn

346	      en-Latn-US

348	      en-Latn-US-scouse

350	      en-US

352	      en-scouse

354	   It should be noted that the ability to be specific in extended range
355	   lookup may make this matching scheme a more appropriate replacement
356	   for basic matching than the extended range matching scheme.

358	2.2.3  Scored Matching

360	   In the "scored matching" scheme, the extended language range and the
361	   language tags are pre-normalized by mapping grandfathered and
362	   obsolete tags into modern equivalents.

364	   The language range and the language tags are normalized into
365	   quadruples of the form (language, script, country, variant), where
366	   extended language is considered part of language and x-private-codes
367	   are considered part of the language if they are initial and part of
368	   the variant if not initial.  Missing components are set to "*".  An
369	   "*" pattern becomes the quadruple ("*", "*", "*", "*").

371	   Each language tag  being matched or filtered is assigned a "quality
372	   value" such that higher values indicate better matches and lower
373	   values indicate worse ones.  If the language matches, add 8 to the
374	   quality value.  If the script matches, add 4 to the quality value.
375	   If the region matches, add 2 to the quality value.  If the variant
376	   matches, add 1 to the quality value.  Elements of the quadruples are
377	   considered to match if they are the same or if one of them is "*".

379	   A value of 15 is a perfect match; 0 is no match at all.  Different
380	   values may be more or less appropriate for different applications and
381	   implementations should probably allow users to choose the most
382	   appropriate selection value.

384	2.3  Meaning of Language Tags and Ranges

386	   A language tag defines a language as spoken (or written, signed or
387	   otherwise signaled) by human beings for communication of information
388	   to other human beings.

390	   If a language tag B contains language tag A as a prefix, then B is
391	   typically "narrower" or "more specific" than A. For example, "zh-
392	   Hant-TW" is more specific than "zh-Hant".

394	   This relationship is not guaranteed in all cases: specifically,
395	   languages that begin with the same sequence of subtags are NOT
396	   guaranteed to be mutually intelligible, although they may be.

398	   For example, the tag "az" shares a prefix with both "az-Latn"
399	   (Azerbaijani written using the Latin script) and "az-Cyrl"
400	   (Azerbaijani written using the Cyrillic script).  A person fluent in
401	   one script may not be able to read the other, even though the text
402	   might be otherwise identical.  Content tagged as "az" most probably
403	   is written in just one script and thus might not be intelligible to a
404	   reader familiar with the other script.

406	   Variant subtags in particular seem to represent specific divisions in
407	   mutual understanding, since they often encode dialects or other
408	   idiosyncratic variations within a language.

410	   The relationship between the language tag and the information it
411	   relates to is defined by the standard describing the context in which
412	   it appears.  Accordingly, this section can only give possible
413	   examples of its usage.

415	   o  For a single information object, the associated language tags
416	      might be interpreted as the set of languages that is required for
417	      a complete comprehension of the complete object.  Example: Plain
418	      text documents.

420	   o  For an aggregation of information objects, the associated language
421	      tags could be taken as the set of languages used inside components
422	      of that aggregation.  Examples: Document stores and libraries.

424	   o  For information objects whose purpose is to provide alternatives,
425	      the associated language tags could be regarded as a hint that the
426	      content is provided in several languages, and that one has to
427	      inspect each of the alternatives in order to find its language or
428	      languages.  In this case, the presence of multiple tags might not
429	      mean that one needs to be multi-lingual to get complete
430	      understanding of the document.  Example: MIME multipart/
431	      alternative.

433	   o  In markup languages, such as HTML and XML, language information
434	      can be added to each part of the document identified by the markup
435	      structure (including the whole document itself).  For example, one
436	      could write <span lang="FR">C'est la vie.</span> inside a
437	      Norwegian document; the Norwegian-speaking user could then access
438	      a French-Norwegian dictionary to find out what the marked section
439	      meant.  If the user were listening to that document through a
440	      speech synthesis interface, this formation could be used to signal
441	      the synthesizer to appropriately apply French text-to-speech
442	      pronunciation rules to that span of text, instead of misapplying
443	      the Norwegian rules.

445	2.4  Choosing Between Alternate Matching Schemes

447	   Implementations MAY choose to implement different styles of matching
448	   for different kinds of processing.  For example, an implementation
449	   could treat an absent script subtag as a "wildcard" field; thus
450	   "az-AZ" would match "az-AZ", "az-Cyrl-AZ", "az-Latn-AZ", etc. but not
451	   "az" (this is extended range lookup).  If one item is to be chosen,
452	   the implementation could pick among those matches based on other
453	   information, such as the most likely script used in the language/
454	   region in question or the script used by other content selected.

456	   Because the primary language subtag cannot be absent in a language
457	   tag, the 'UND' subtag may sometimes be used as a 'wildcard' in basic
458	   matching.  For example, in a query where you want to select all
459	   language tags that contain 'Latn' as the script code and 'AZ' as the
460	   region code, you could use the range "und-Latn-AZ".  This requires an
461	   implementation to examine the actual values of the subtags, though.
462	   The matching schemes described elsewhere in this document do not
463	   require implementations to examine the values supplied and, except
464	   for scored matching, they do not require access to the Language
465	   Subtag Registry nor the use of valid subtags in language tags or
466	   ranges.  This has great benefit for speed and simplicity of
467	   implementation.

469	   Implementations may also wish to use semantic information external to
470	   the langauge tags when performing fallback.  For example, the primary
471	   language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian)
472	   might both be usefully matched to the more general subtag 'no'
473	   (Norwegian).  Or an application might infer that content labeled
474	   "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
475	   content labeled "zh-TW".

477	2.5  Considerations for Private Use Subtags

479	   Private-use subtags require private agreement between the parties
480	   that intend to use or exchange language tags that use them and great
481	   caution should be used in employing them in content or protocols
482	   intended for general use.  Private-use subtags are simply useless for
483	   information exchange without prior arrangement.

485	   The value and semantic meaning of private-use tags and of the subtags
486	   used within such a language tag are not defined.  Matching private
487	   use tags using language ranges or extended language ranges may result
488	   in unpredictable content being returned.

490	2.6  Length Considerations in Matching

492	   Although there is no upper bound on the number of subtags in a
493	   language tag and it is possible to envision quite long and complex
494	   subtag sequences, in practice these are rare because of the various
495	   considerations discussed in Section 2.1.1 of [1].

497	   A matching implementation MAY choose not to support the storage or
498	   matching of language tags and ranges which exceed a specified length.
499	   Any such limitation SHOULD be clearly documented, and such
500	   documentation SHOULD include the disposition of any longer tags or
501	   ranges (for example, whether an error value is generated or the
502	   language tag is truncated).  If truncation is permitted it must not
503	   permit a subtag to be divided, since this changes the semantics of
504	   the tag or range being matched and may result in false positives or
505	   negatives.  Implementations that restrict storage should consider
506	   removing extensions before matching.  A protocol that allows tags or
507	   ranges to be truncated at an arbitrary limit, without giving any
508	   indication of what that limit is, has the potential for causing harm
509	   by changing the meaning of values in substantial ways.

511	   In practice, tags and ranges are limited to a sequence of four
512	   subtags, and thus a maximum length of 26 characters (excluding any
513	   extensions or private use sequences).  This is because subtags are
514	   limited to a length of eight characters and the extlang, script, and
515	   region subtags are additionally limited to even fewer characters.  In
516	   addition, the Language Subtag Registry provides guidance on the use
517	   of subtags (via fields such as Suppress-Script and Recommended-
518	   Prefix) which further limit useful combination of subtags in a
519	   language tag or range.

521	   Longer tags are possible.  The longest practical tags (excluding
522	   extensions) could have a length of up to 58 characters, as shown
523	   below.  Implementations MUST be able to handle matching tags of this
524	   length.  Support for tags and ranges of up to 64 characters is
525	   RECOMMENDED.  Implementations MAY support longer tags, including
526	   matching extensive sets of private use or extension subtags.

528	   Here is how the 58-character length of the longest practical tag
529	   (excluding extensions) is derived:

531	   language      = 3
532	   extlang1      = 4 (currently undefined)
533	   extlang2      = 4 (unlikely)
534	   script        = 5
535	   region        = 4 (UN M.49)
536	   variant       = 9
537	   variant       = 9 (unlikely)
538	   private use 1 = 11
539	   private use 2 = 9
540	   total         = 58 characters

542	                  Figure 4: Derviation of the Longest Tag

544	3.  IANA Considerations

546	   This document presents no new or existing considerations for IANA.

548	4.  Changes

550	   This is the first version of this document.

552	   The following changes were put into this document since draft-00:

554	      Fixed text in the introduction that is no longer accurate.
555	      Specifically, there no longer is a default matching algorithm.
556	      (A.Phillips)

558	      Fixed text in Section 2.1 which incorrectly discussed the default
559	      fallback mechanism.  (A.Phillips)

561	      Minor changes to Section 2.3, in particular, the addition of the
562	      'variant' paragraph and some tidying of the text.  (A.Phillips)

564	      Fixed a minor glitch in the ABNF caused by taking the output of
565	      Bill Fenner's parser and not looking too closely at it (M. Patton)

567	      Fixed some minor reference problems.  (M.Patton)

569	      Added Section 2.6 on length considerations in matching.
570	      (R.Presuhn)

572	5.  Security Considerations

574	   The only security issue that has been raised with language tags since
575	   the publication of RFC 1766, which stated that "Security issues are
576	   believed to be irrelevant to this memo", is a concern with language
577	   ranges used in content negotiation - that they may be used to infer
578	   the nationality of the sender, and thus identify potential targets
579	   for surveillance.

581	   This is a special case of the general problem that anything you send
582	   is visible to the receiving party.  It is useful to be aware that
583	   such concerns can exist in some cases.

585	   The evaluation of the exact magnitude of the threat, and any possible
586	   countermeasures, is left to each application protocol.

588	   Although the specification of valid subtags for an extension MUST be
589	   available over the Internet, implementations SHOULD NOT mechanically
590	   depend on it being always accessible, to prevent denial-of-service
591	   attacks.

593	6.  Character Set Considerations

595	   The syntax in this document requires that language ranges use only
596	   the characters A-Z, a-z, 0-9, and HYPHEN-MINUS legal in language
597	   tags.  These characters are present in most character sets, so
598	   presentation of language tags should not have any character set
599	   issues.

601	   Rendering of characters based on the content of a language tag is not
602	   addressed in this memo.  Historically, some languages have relied on
603	   the use of specific character sets or other information in order to
604	   infer how a specific character should be rendered (notably this
605	   applies to language and culture specific variations of Han ideographs
606	   as used in Japanese, Chinese, and Korean).  When language tags are
607	   applied to spans of text, rendering engines may use that information
608	   in deciding which font to use in the absence of other information,
609	   particularly where languages with distinct writing traditions use the
610	   same characters.

612	7.  References

614	7.1  Normative References

616	   [1]   Phillips, A., Ed. and M. Davis, Ed., "Tags for the
617	         Identification of Languages (Internet-Draft)", February 2005, <
618	         http://www.ietf.org/internet-drafts/
619	         draft-ietf-ltru-registry-01.txt>.

621	   [2]   Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO 10021
622	         and RFC 822", RFC 1327, May 1992.

624	   [3]   Borenstein, N. and N. Freed, "MIME (Multipurpose Internet Mail
625	         Extensions) Part One: Mechanisms for Specifying and Describing
626	         the Format of Internet Message Bodies", RFC 1521,
627	         September 1993.

629	   [4]   Hovey, R. and S. Bradner, "The Organizations Involved in the
630	         IETF Standards Process", BCP 11, RFC 2028, October 1996.

632	   [5]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
633	         Levels", BCP 14, RFC 2119, March 1997.

635	   [6]   Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word
636	         Extensions: Character Sets, Languages, and Continuations",
637	         RFC 2231, November 1997.

639	   [7]   Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
640	         Specifications: ABNF", RFC 2234, November 1997.

642	   [8]   Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
643	         Resource Identifiers (URI): Generic Syntax", RFC 2396,
644	         August 1998.

646	   [9]   Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA
647	         Considerations Section in RFCs", BCP 26, RFC 2434,
648	         October 1998.

650	   [10]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L.,
651	         Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol --
652	         HTTP/1.1", RFC 2616, June 1999.

654	   [11]  Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
655	         Understanding Concerning the Technical Work of the Internet
656	         Assigned Numbers Authority", RFC 2860, June 2000.

658	   [12]  Yergeau, F., "UTF-8, a transformation format of ISO 10646",
659	         STD 63, RFC 3629, November 2003.

661	7.2  Informative References

663	   [13]  International Organization for Standardization, "ISO 639-
664	         1:2002, Codes for the representation of names of languages --
665	         Part 1: Alpha-2 code", ISO Standard 639, 2002.

667	   [14]  International Organization for Standardization, "ISO 639-2:1998
668	         - Codes for the representation of names of languages -- Part 2:
669	         Alpha-3 code - edition 1", August 1988.

671	   [15]  ISO TC46/WG3, "ISO 15924:2003 (E/F) - Codes for the
672	         representation of names of scripts", January 2004.

674	   [16]  International Organization for Standardization, "Codes for the
675	         representation of names of countries, 3rd edition",
676	         ISO Standard 3166, August 1988.

678	   [17]  Statistical Division, United Nations, "Standard Country or Area
679	         Codes for Statistical Use", UN Standard Country or Area Codes
680	         for Statistical Use, Revision 4 (United Nations publication,
681	         Sales No. 98.XVII.9, June 1999.

683	   [18]  Alvestrand, H., "Tags for the Identification of Languages",
684	         RFC 1766, March 1995.

686	   [19]  Alvestrand, H., "Tags for the Identification of Languages",
687	         BCP 47, RFC 3066, January 2001.

689	   [20]  Klyne, G. and C. Newman, "Date and Time on the Internet:
690	         Timestamps", RFC 3339, July 2002.

692	Authors' Addresses

694	   Addison Phillips (editor)
695	   Quest Software

697	   Email: addison dot phillips at quest dot com

699	   Mark Davis (editor)
700	   IBM

702	   Email: mark dot davis at ibm dot com

704	Appendix A.  Acknowledgements

706	   Any list of contributors is bound to be incomplete; please regard the
707	   following as only a selection from the group of people who have
708	   contributed to make this document what it is today.

710	   The contributors to RFC 3066 and RFC 1766, the precursors of this
711	   document, made enormous contributions directly or indirectly to this
712	   document and are generally responsible for the success of language
713	   tags.

715	   The following people (in alphabetical order) contributed to this
716	   document or to RFCs 1766 and 3066:

718	   Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet,
719	   Nathaniel Borenstein, Eric Brunner, Sean M. Burke, Jeremy Carroll,
720	   John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin,
721	   Dave Crocker, Martin Duerst, Michael Everson, Doug Ewell, Ned Freed,
722	   Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren,
723	   Elliotte Rusty Harold, Paul Hoffman, Richard Ishida, Olle Jarnefors,
724	   Kent Karlsson, John Klensin, Alain LaBonte, Eric Mader, Keith Moore,
725	   Chris Newman, Masataka Ohta, Michael S. Patton, Randy Presuhn, George
726	   Rhoten, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto
727	   Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois
728	   Yergeau and many, many others.

730	   Very special thanks must go to Harald Tveit Alvestrand, who
731	   originated RFCs 1766 and 3066, and without whom this document would
732	   not have been possible.  Special thanks must go to Michael Everson,
733	   who has served as language tag reviewer for almost the complete
734	   period since the publication of RFC 1766.  Special thanks to Doug
735	   Ewell, for his production of the first complete subtag registry, and
736	   his work in producing a test parser for verifying language tags.

738	   For this particular document, John Cowan originated the scheme
739	   described in Section 2.2.3.  Mark Davis originated the scheme
740	   described in the Section 2.1.2.

742	Intellectual Property Statement

744	   The IETF takes no position regarding the validity or scope of any
745	   Intellectual Property Rights or other rights that might be claimed to
746	   pertain to the implementation or use of the technology described in
747	   this document or the extent to which any license under such rights
748	   might or might not be available; nor does it represent that it has
749	   made any independent effort to identify any such rights.  Information
750	   on the procedures with respect to rights in RFC documents can be
751	   found in BCP 78 and BCP 79.

753	   Copies of IPR disclosures made to the IETF Secretariat and any
754	   assurances of licenses to be made available, or the result of an
755	   attempt made to obtain a general license or permission for the use of
756	   such proprietary rights by implementers or users of this
757	   specification can be obtained from the IETF on-line IPR repository at
758	   http://www.ietf.org/ipr.

760	   The IETF invites any interested party to bring to its attention any
761	   copyrights, patents or patent applications, or other proprietary
762	   rights that may cover technology that may be required to implement
763	   this standard.  Please address the information to the IETF at
764	   ietf-ipr@ietf.org.

766	Disclaimer of Validity

768	   This document and the information contained herein are provided on an
769	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
770	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
771	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
772	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
773	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
774	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

776	Copyright Statement

778	   Copyright (C) The Internet Society (2005).  This document is subject
779	   to the rights, licenses and restrictions contained in BCP 78, and
780	   except as set forth therein, the authors retain all their rights.

782	Acknowledgment

784	   Funding for the RFC Editor function is currently provided by the
785	   Internet Society.