idnits 2.17.1 

draft-ietf-ltru-matching-15.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 903.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 880.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 887.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 893.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 22, 2006) is 6512 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)

  -- Obsolete informational reference (is this intentional?): RFC 1766
     (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Duplicate reference: RFC2616, mentioned in 'RFC2616errata', was also
     mentioned in 'RFC2616'.

  -- Obsolete informational reference (is this intentional?): RFC 2616
     (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)


     Summary: 4 errors (**), 0 flaws (~~), 2 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                               Yahoo! Inc.
4	Obsoletes: 3066 (if approved)                              M. Davis, Ed.
5	Expires: December 24, 2006                                        Google
6	                                                           June 22, 2006

8	                       Matching of Language Tags
9	                      draft-ietf-ltru-matching-15

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on December 24, 2006.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2006).

40	Abstract

42	   This document describes a syntax, called a "language-range", for
43	   specifying items in a user's list of language preferences.  It also
44	   describes different mechanisms for comparing and matching these to
45	   language tags.  Two kinds of matching mechanisms, filtering and
46	   lookup, are defined.  Filtering produces a (potentially empty) set of
47	   language tags, whereas lookup produces a single language tag.
48	   Possible applications include language negotiation or content
49	   selection.  This document, in combination with RFC 3066bis (Ed.:
50	   replace "3066bis" with the RFC number assigned to
51	   draft-ietf-ltru-registry-14), replaces RFC 3066, which replaced RFC
52	   1766.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
57	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
58	     2.1.  Basic Language Range . . . . . . . . . . . . . . . . . . .  4
59	     2.2.  Extended Language Range  . . . . . . . . . . . . . . . . .  5
60	     2.3.  The Language Priority List . . . . . . . . . . . . . . . .  5
61	   3.  Types of Matching  . . . . . . . . . . . . . . . . . . . . . .  7
62	     3.1.  Choosing a Matching Scheme . . . . . . . . . . . . . . . .  7
63	     3.2.  Implementation Considerations  . . . . . . . . . . . . . .  8
64	     3.3.  Filtering  . . . . . . . . . . . . . . . . . . . . . . . .  9
65	       3.3.1.  Basic Filtering  . . . . . . . . . . . . . . . . . . . 10
66	       3.3.2.  Extended Filtering . . . . . . . . . . . . . . . . . . 11
67	     3.4.  Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 12
68	       3.4.1.  Default Values . . . . . . . . . . . . . . . . . . . . 14
69	   4.  Other Considerations . . . . . . . . . . . . . . . . . . . . . 16
70	     4.1.  Choosing Language Ranges . . . . . . . . . . . . . . . . . 16
71	     4.2.  Meaning of Language Tags and Ranges  . . . . . . . . . . . 17
72	     4.3.  Considerations for Private Use Subtags . . . . . . . . . . 17
73	     4.4.  Length Considerations for Language Ranges  . . . . . . . . 18
74	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 19
75	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 20
76	   7.  Character Set Considerations . . . . . . . . . . . . . . . . . 21
77	   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
78	     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 22
79	     8.2.  Informative References . . . . . . . . . . . . . . . . . . 22
80	   Appendix A.  Acknowledgments . . . . . . . . . . . . . . . . . . . 23
81	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24
82	   Intellectual Property and Copyright Statements . . . . . . . . . . 25

84	1.  Introduction

86	   Human beings on our planet have, past and present, used a number of
87	   languages.  There are many reasons why one would want to identify the
88	   language used when presenting or requesting information.

90	   Applications, protocols, or specifications that use language
91	   identifiers, such as the language tags defined in [RFC3066bis],
92	   sometimes need to match language tags to a user's language
93	   preferences.

95	   This document defines a syntax (called a language range (Section 2))
96	   for specifying items in the user's list of language preferences
97	   (called a language priority list (Section 2.3)), as well as several
98	   schemes for selecting or filtering sets of language tags by comparing
99	   the language tags to the user's preferences.  Applications,
100	   protocols, or specifications will have varying needs and requirements
101	   that affect the choice of a suitable matching scheme.

103	   This document describes: how to indicate a user's preferences using
104	   language ranges; three schemes for matching these ranges to a set of
105	   language tags; and the various practical considerations that apply to
106	   implementing and using these schemes.

108	   This document, in combination with [RFC3066bis] (Ed.: replace
109	   "3066bis" globally in this document with the RFC number assigned to
110	   draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
111	   [RFC1766].

113	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
114	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
115	   document are to be interpreted as described in [RFC2119].

117	2.  The Language Range

119	   Language tags [RFC3066bis] are used to help identify languages,
120	   whether spoken, written, signed, or otherwise signaled, for the
121	   purpose of communication.  Applications, protocols, or specifications
122	   that use language tags are often faced with the problem of
123	   identifying sets of content that share certain language attributes.
124	   For example, HTTP/1.1 [RFC2616] describes one such mechanism in its
125	   discussion of the Accept-Language header (Section 14.4), which is
126	   used when selecting content from servers based on the language of
127	   that content.

129	   It is, thus, useful to have a mechanism for identifying sets of
130	   language tags that share specific attributes.  This allows users to
131	   select or filter the language tags based on specific requirements.
132	   Such an identifier is called a "language range".

134	   There are different types of language range, whose specific
135	   attributes vary according to their application.  Language ranges are
136	   similar to language tags: they consist of a sequence of subtags
137	   separated by hyphens.  In a language range, each subtag MUST either
138	   be a sequence of ASCII alphanumeric characters or the single
139	   character '*' (%2A, ASTERISK).  The character '*' is a "wildcard"
140	   that matches any sequence of subtags.  The meaning and uses of
141	   wildcards vary according to the type of language range.

143	   Language tags and thus language ranges are to be treated as case-
144	   insensitive: there exist conventions for the capitalization of some
145	   of the subtags, but these MUST NOT be taken to carry meaning.
146	   Matching of language tags to language ranges MUST be done in a case-
147	   insensitive manner.

149	2.1.  Basic Language Range

151	   A "basic language range" has the same syntax as an [RFC3066] language
152	   tag or is the single character "*".  The basic language range was
153	   originally described by HTTP/1.1 [RFC2616] and later [RFC3066].  It
154	   is defined by the following ABNF [RFC4234]:

156	   language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
157	   alphanum         = ALPHA / DIGIT

159	   A basic language range differs from the language tags defined in
160	   [RFC3066bis] only in that there is no requirement that it be "well-
161	   formed" or be validated against the IANA Language Subtag Registry.
162	   Such ill-formed ranges will probably not match anything.  Note that
163	   the ABNF [RFC4234] in [RFC2616] is incorrect, since it disallows the
164	   use of digits anywhere in the 'language-range' (see:

166	   [RFC2616errata]).

168	2.2.  Extended Language Range

170	   Occasionally users will wish to select a set of language tags based
171	   on the presence of specific subtags.  An "extended language range"
172	   describes a user's language preference as an ordered sequence of
173	   subtags.  For example, a user might wish to select all language tags
174	   that contain the region subtag 'CH' (Switzerland).  Extended language
175	   ranges are useful for specifying a particular sequence of subtags
176	   that appear in the set of matching tags without having to specify all
177	   of the intervening subtags.

179	   An extended language range can be represented by the following ABNF:

181	   extended-language-range = (1*8ALPHA / "*")
182	                             *("-" (1*8alphanum / "*"))

184	   The wildcard subtag '*' can occur in any position in the extended
185	   language range, where it matches any sequence of subtags that might
186	   occur in that position in a language tag.  However, wildcards outside
187	   the first position are ignored by Extended Filtering (see Section
188	   3.2.2).  The use or absence of one or more wildcards cannot be taken
189	   to imply that a certain number of subtags will appear in the matching
190	   set of language tags.

192	2.3.  The Language Priority List

194	   A user's language preferences will often need to specify more than
195	   one language range and thus users often need to specify a prioritized
196	   list of language ranges in order to best reflect their language
197	   preferences.  This is especially true for speakers of minority
198	   languages.  A speaker of Breton in France, for example, can specify
199	   "br" followed by "fr", meaning that if Breton is available, it is
200	   preferred, but otherwise French is the best alternative.  It can get
201	   more complex: a different user might want to fall back from Skolt
202	   Sami to Northern Sami to Finnish.

204	   A "language priority list" is a prioritized or weighted list of
205	   language ranges.  One well known example of such a list is the
206	   "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
207	   14.4) and RFC 3282 [RFC3282].

209	   The various matching operations described in this document include
210	   considerations for using a language priority list.  This document
211	   does not define the syntax for a language priority list; defining
212	   such a syntax is the responsibility of the protocol, application, or
213	   specification that uses it.  When given as examples in this document,
214	   language priority lists will be shown as a quoted sequence of ranges
215	   separated by commas, like this: "en, fr, zh-Hant" (which is read
216	   "English before French before Chinese as written in the Traditional
217	   script").

219	   A simple list of ranges is considered to be in descending order of
220	   priority.  Other language priority lists provide "quality weights"
221	   for the language ranges in order to specify the relative priority of
222	   the user's language preferences.  An example of this is the use of
223	   "q" values in the syntax of the "Accept-Language" header (defined in
224	   [RFC2616], Section 14.4, and [RFC3282]).

226	3.  Types of Matching

228	   Matching language ranges to language tags can be done in many
229	   different ways.  This section describes three such matching schemes,
230	   as well as the considerations for choosing between them.  Protocols
231	   and specifications requiring conformance to this specification MUST
232	   clearly indicate the particular mechanism used in selecting or
233	   matching language tags.

235	   There are two types of matching scheme in this document.  A matching
236	   scheme that produces zero or more matching language tags is called
237	   "filtering".  A matching scheme that produces exactly one match for a
238	   given request is called "lookup".

240	3.1.  Choosing a Matching Scheme

242	   Applications, protocols, and specifications are faced with the
243	   decision of what type of matching to use.  Sometimes, different
244	   styles of matching are suited to different kinds of processing within
245	   a particular application or protocol.

247	   This document describes three matching schemes:

249	   1.  Basic Filtering (Section 3.3.1) matches a language priority list
250	       consisting of basic language ranges (Section 2.1) to sets of
251	       language tags.

253	   2.  Extended Filtering (Section 3.3.2) matches a language priority
254	       list consisting of extended language ranges (Section 2.2) to sets
255	       of language tags.

257	   3.  Lookup (Section 3.4) matches a language priority list consisting
258	       of basic language ranges to sets of language tags to find the one
259	       _exact_ language tag that best matches the range.

261	   Filtering can be used to produce a set of results (such as a
262	   collection of documents) by comparing the user's preferences to a set
263	   of language tags.  For example, when performing a search, filtering
264	   can be used to limit the results to items tagged as being in the
265	   French language.  Filtering can also be used when deciding whether to
266	   perform a language-sensitive process on some content.  For example, a
267	   process might cause paragraphs whose language tag matched the
268	   language range "nl" (Dutch) to be displayed in italics within a
269	   document.

271	   Lookup produces the single result that best matches the user's
272	   preferences from the list of available tags, so it is useful in cases
273	   in which a single item is required (and for which only a single item
274	   can be returned).  For example, if a process were to insert a human
275	   readable error message into a protocol header, it might select the
276	   text based on the user's language priority list.  Since the process
277	   can return only one item, it is forced to choose a single item and it
278	   has to return some item, even if none of the content's language tags
279	   match the language priority list supplied by the user.

281	3.2.  Implementation Considerations

283	   Language tag matching is a tool, and does not by itself specify a
284	   complete procedure for the use of language tags.  Such procedures are
285	   intimately tied to the application protocol in which they occur.
286	   When specifying a protocol operation using matching, the protocol
287	   MUST specify:

289	   o  Which type(s) of language tag matching it uses

291	   o  Whether the operation returns a single result (lookup) or a
292	      possibly empty set of results (filtering)

294	   o  For lookup, what the default item is (or the sequence of
295	      operations or configuration information used to determine the
296	      default) when no matching tag is found.  For instance, a protocol
297	      might define the result as failure of the operation, an empty
298	      value, returning some protocol defined or implementation defined
299	      default, or returning i-default [RFC2277].

301	   Applications, protocols, and specifications are not required to
302	   validate or understand any of the semantics of the language tags or
303	   ranges or of the subtags in them, nor do they require access to the
304	   IANA Language Subtag Registry (see Section 3 in [RFC3066bis]).  This
305	   simplifies implementation.

307	   However, designers of applications, protocols, or specifications are
308	   encouraged to use the information from the IANA Language Subtag
309	   Registry to support canonicalizing language tags and ranges in order
310	   to map grandfathered and obsolete tags or subtags into modern
311	   equivalents.

313	   Applications, protocols, or specifications that canonicalize ranges
314	   MUST either perform matching operations with both the canonical and
315	   original (unmodified) form of the range or MUST also canonicalize
316	   each tag for the purposes of comparison.

318	   Note that canonicalizing language ranges makes certain operations
319	   impossible.  For example, an implementation that canonicalizes the
320	   language range "art-lojban" (artificial language, lojban variant) to
321	   use the more modern "jbo" (Lojban) cannot be used to select just the
322	   items with the older tag.

324	   Applications, protocols, or specifications that use basic ranges
325	   might sometimes receive extended language ranges instead.  An
326	   application, protocol, or specification MUST choose to: a) map
327	   extended language ranges to basic ranges using the algorithm below,
328	   b) reject any extended language ranges in the language priority list
329	   that are not valid basic language ranges, or c) treat each extended
330	   language range as if it were a basic language range, which will have
331	   the same result as ignoring them, since these ranges will not match
332	   any valid language tags.

334	   An extended language range is mapped to a basic language range as
335	   follows: if the first subtag is a '*' then the entire range is
336	   treated as "*", otherwise each wildcard subtag is removed.  For
337	   example, the extended language range "en-*-US" maps to "en-US"
338	   (English, United States).

340	   Applications, protocols, or specifications, in addressing their
341	   particular requirements, can offer pre-processing or configuration
342	   options.  For example, an implementation could allow a user to
343	   associate or map a particular language range to a different value.
344	   Such a user might wish to associate the language range subtags 'nn'
345	   (Nynorsk Norwegian) and 'nb' (Bokmal Norwegian) with the more general
346	   subtag 'no' (Norwegian).  Or perhaps a user would want to associate
347	   requests for the range "zh-Hans" (Chinese as written in the
348	   Simplified script) with content bearing the language tag "zh-CN"
349	   (Chinese as used in China, where the Simplified script is
350	   predominant).  Documentation on how the ranges or tags are altered,
351	   prioritized, or compared in the subsequent match in such an
352	   implementation will assist users in making these types of
353	   configuration choices.

355	3.3.  Filtering

357	   Filtering is used to select the set of language tags that matches a
358	   given language priority list.  It is called "filtering" because this
359	   set might contain no items at all or it might return an arbitrarily
360	   large number of matching items: as many items as match the language
361	   priority list, thus "filtering out" the non-matching items.

363	   In filtering, each language range represents the _least_ specific
364	   language tag (that is, the language tag with fewest number of
365	   subtags) which is an acceptable match.  All of the language tags in
366	   the matching set of tags will have an equal or greater number of
367	   subtags than the language range.  Every non-wildcard subtag in the
368	   language range will appear in every one of the matching language
369	   tags.  For example, if the language priority list consists of the
370	   range "de-CH" (German as used in Switzerland), one might see tags
371	   such as "de-CH-1996" (German as used in Switzerland, orthography of
372	   1996) but one will never see a tag such as "de" (because the 'CH'
373	   subtag is missing).

375	   If the language priority list (see Section 2.3) contains more than
376	   one range, the content returned is typically ordered in descending
377	   level of preference, but it MAY be unordered, according to the needs
378	   of the application or protocol.

380	   Some examples of applications where filtering might be appropriate
381	   include:

383	   o  Applying a style to sections of a document in a particular set of
384	      languages.

386	   o  Displaying the set of documents containing a particular set of
387	      keywords written in a specific set of languages.

389	   o  Selecting all email items written in a specific set of languages.

391	   o  Selecting audio files spoken in a particular language.

393	   Filtering seems to imply that there is a semantic relationship
394	   between language tags that share the same prefix.  While this is
395	   often the case, it is not always true: the language tags that match a
396	   specific language range do not necessarily represent mutually
397	   intelligible languages.

399	3.3.1.  Basic Filtering

401	   Basic filtering compares basic language ranges to language tags.
402	   Each basic language range in the language priority list is considered
403	   in turn, according to priority.  A language range matches a
404	   particular language tag if, in a case-insensitive comparison, it
405	   exactly equals the tag, or if it exactly equals a prefix of the tag
406	   such that the first character following the prefix is "-".  For
407	   example, the language-range "de-de" (German as used in Germany)
408	   matches the language tag "de-DE-1996" (German as used in Germany,
409	   orthography of 1996), but not the language tags "de-Deva" (German as
410	   written in the Devanagari script) or "de-Latn-DE" (German, Latin
411	   script, as used in Germany).

413	   The special range "*" in a language priority list matches any tag.  A
414	   protocol which uses language ranges MAY specify additional rules
415	   about the semantics of "*"; for instance, HTTP/1.1 [RFC2616]
416	   specifies that the range "*" matches only languages not matched by
417	   any other range within an "Accept-Language" header.

419	   Basic filtering is identical to the type of matching described in
420	   [RFC3066], Section 2.5 (Language-range).

422	3.3.2.  Extended Filtering

424	   Extended filtering compares extended language ranges to language
425	   tags.  Each extended language range in the language priority list is
426	   considered in turn, according to priority.  A language range matches
427	   a particular language tag if their list of subtags match.  To
428	   determine a match:

430	   1.  Split both the extended language range and the language tag being
431	       compared into a list of subtags by dividing on the hyphen (%2D)
432	       character.  Two subtags match if either they are the same when
433	       compared case-insensitively or the language range's subtag is the
434	       wildcard '*'.

436	   2.  Begin with the first subtag in each list.  If the first subtag in
437	       the range does not match the first subtag in the tag, the overall
438	       match fails.  Otherwise, move to the next subtag in both the
439	       range and the tag.

441	   3.  While there are more subtags left in the language range's list:

443	       A.  If the subtag currently being examined in the range is the
444	           wildcard ('*'), move to the next subtag in the range and
445	           continue with the loop.

447	       B.  Else, if there are no more subtags in the language tag's
448	           list, the match fails.

450	       C.  Else, if the current subtag in the range's list matches the
451	           current subtag in the language tag's list, move to the next
452	           subtag in both lists and continue with the loop.

454	       D.  Else, if the language tag's subtag is a "singleton" (a single
455	           letter or digit, which includes the private-use subtag 'x')
456	           the match fails.

458	       E.  Else, move to the next subtag in the language tag's list and
459	           continue with the loop.

461	   4.  When the language range's list has no more subtags, the match
462	       succeeds.

464	   Subtags not specified, including those at the end of the language
465	   range, are thus treated as if assigned the wildcard value '*'.  Much
466	   like basic filtering, extended filtering selects content with
467	   arbitrarily long tags that share the same initial subtags as the
468	   language range.  In addition, extended filtering selects language
469	   tags that contain any intermediate subtags not specified in the
470	   language range.  For example, the extended language range "de-*-DE"
471	   (or its synonym "de-DE") matches all of the following tags:

473	      de-DE (German, as used in Germany)

475	      de-de (German, as used in Germany)

477	      de-Latn-DE (Latin script)

479	      de-Latf-DE (Fraktur variant of Latin script)

481	      de-DE-x-goethe (private use subtag)

483	      de-Latn-DE-1996 (orthography of 1996)

485	      de-Deva-DE (Devanagari script)

487	   The same range does not match any of the following tags for the
488	   reasons shown:

490	      de (missing 'DE')

492	      de-x-DE (singleton 'x' occurs before 'DE')

494	      de-Deva ('Deva' not equal to 'DE')

496	   Note: [RFC3066bis] defines each type of subtag (language, script,
497	   region, and so forth) according to position, size, and content.  This
498	   means that subtags in a language range can only match specific types
499	   of subtags in a language tag.  For example, a subtag such as 'Latn'
500	   is always a script subtag (unless it follows a singleton) while a
501	   subtag such as 'nedis' can only match the equivalent variant subtag.
502	   Two-letter subtags in initial position have a different type
503	   (language) than two-letter subtags in later positions (region).  This
504	   is the reason why a wildcard in the extended language range is
505	   significant in the first position but is ignored in all other
506	   positions.

508	3.4.  Lookup

510	   Lookup is used to select the single language tag that best matches
511	   the language priority list for a given request.  When performing
512	   lookup, each language range in the language priority list is
513	   considered in turn, according to priority.  By contrast with
514	   filtering, each language range represents the _most_ specific tag
515	   which is an acceptable match.  The first matching tag found,
516	   according to the user's priority, is considered the closest match and
517	   is the item returned.  For example, if the language range is "de-ch",
518	   a lookup operation can produce content with the tags "de" or "de-CH"
519	   but never content with the tag "de-CH-1996".  If no language tag
520	   matches the request, the "default" value is returned.

522	   For example, if an application inserts some dynamic content into a
523	   document, returning an empty string if there is no exact match is not
524	   an option.  Instead, the application "falls back" until it finds a
525	   matching language tag associated with a suitable piece of content to
526	   insert.  Some applications of lookup include:

528	   o  Selection of a template containing the text for an automated email
529	      response.

531	   o  Selection of a item containing some text for inclusion in a
532	      particular Web page.

534	   o  Selection of a string of text for inclusion in an error log.

536	   o  Selection of an audio file to play as a prompt in a phone system.

538	   In the lookup scheme, the language range is progressively truncated
539	   from the end until a matching language tag is located.  Single letter
540	   or digit subtags (including both the letter 'x' which introduces
541	   private-use sequences, and the subtags that introduce extensions) are
542	   removed at the same time as their closest trailing subtag.  For
543	   example, starting with the range "zh-Hant-CN-x-private1-private2"
544	   (Chinese, Traditional script, China, two private use tags) the lookup
545	   progressively searches for content as shown below:

547	   Example of a Lookup Fallback Pattern

549	   Range to match: zh-Hant-CN-x-private1-private2
550	   1. zh-Hant-CN-x-private1-private2
551	   2. zh-Hant-CN-x-private1
552	   3. zh-Hant-CN
553	   4. zh-Hant
554	   5. zh
555	   6. (default)

557	   This fallback behavior allows some flexibility in finding a match.
558	   Without fallback, the default content would be returned immediately
559	   if exactly matching content is unavailable.  With fallback, a result
560	   more closely matching the user request can be provided.

562	   Extensions and unrecognized private-use subtags might be unrelated to
563	   a particular application of lookup.  Since these subtags come at the
564	   end of the subtag sequence, they are removed first during the
565	   fallback process and usually pose no barrier to interoperability.
566	   However, an implementation MAY remove these from ranges prior to
567	   performing the lookup (provided the implementation also removes them
568	   from the tags being compared).  Such modification is internal to the
569	   implementation and applications, protocols, or specifications SHOULD
570	   NOT remove or modify subtags in content that they return or forward,
571	   because this removes information that can be used elsewhere.

573	   The special language range "*" matches any language tag.  In the
574	   lookup scheme, this range does not convey enough information by
575	   itself to determine which language tag is most appropriate, since it
576	   matches everything.  If the language range "*" is followed by other
577	   language ranges, it is skipped.  If the language range "*" is the
578	   only one in the language priority list or if no other language range
579	   follows, the default value is computed and returned.

581	   In some cases, the language priority list can contain one or more
582	   extended language ranges (as, for example, when the same language
583	   priority list is used as input for both lookup and filtering
584	   operations).  Wildcard values in an extended language range normally
585	   match any value that can occur in that position in a language tag.
586	   Since only one item can be returned for any given lookup request,
587	   wildcards in a language range have to be processed in a consistent
588	   manner or the same request will produce widely varying results.
589	   Applications, protocols, or specifications that accept extended
590	   language ranges MUST define which item is returned when more than one
591	   item matches the extended language range.

593	   For example, an implementation could map the extended language ranges
594	   to basic ranges.  Another possibility would be for an implementation
595	   to return the matching tag that is first in ASCII-order.  If the
596	   language range were "*-CH" ('CH' represents Switzerland) and the set
597	   of tags included "de-CH" (German as used in Switzerland), "fr-CH"
598	   (French, Switzerland), and "it-CH" (Italian, Switzerland), then the
599	   tag "de-CH" would be returned.

601	3.4.1.  Default Values

603	   Each application, protocol, or specification that uses lookup MUST
604	   define the defaulting behavior when no tag matches the language
605	   priority list.  What this action consists of strongly depends on how
606	   lookup is being applied.  Some examples of defaulting behavior
607	   include:

609	   o  return an item with no language tag or an item of a non-linguistic
610	      nature, such as an image or sound

612	   o  return a null string as the language tag value, in cases where the
613	      protocol permits the empty value (see, for example, "xml:lang" in
614	      [XML10])

616	   o  return a particular language tag designated for the operation

618	   o  return the language tag "i-default" (see: [RFC2277])

620	   o  return an error condition or error message

622	   o  return a list of available languages for the user to select from

624	   When performing lookup using a language priority list, the
625	   progressive search MUST process each language range in the list
626	   before seeking or calculating the default.

628	   The default value MAY be calculated or include additional searching
629	   or matching.  Applications, protocols, or specifications can specify
630	   different ways in which users can specify or override the defaults.

632	   One common way to provide for a default is to allow a specific
633	   language range to be set as the default for a specific type of
634	   request.  If this approach is chosen, this language range MUST be
635	   treated as if it were appended to the end of the language priority
636	   list as a whole, rather than after each item in the language priority
637	   list.  The application, protocol, or specification MUST also define
638	   the defaulting behavior if that search fails to find a matching tag
639	   or item.

641	   For example, if a particular user's language priority list is "fr-FR,
642	   zh-Hant" (French as used in France followed by Chinese as written in
643	   the Traditional script) and the program doing the matching had a
644	   default language range of "ja-JP" (Japanese as used in Japan), then
645	   the program searches as follows:
646	   1. fr-FR
647	   2. fr
648	   3. zh-Hant // next language
649	   4. zh
650	   5. ja-JP   // now searching for the default content
651	   6. ja
652	   7. (implementation defined default)

654	4.  Other Considerations

656	   When working with language ranges and matching schemes, there are
657	   some additional points that can influence the choice of either.

659	4.1.  Choosing Language Ranges

661	   Users indicate their language preferences via the choice of a
662	   language range or the list of language ranges in a language priority
663	   list.  The type of matching affects what the best choice is for a
664	   user.

666	   Most matching schemes make no attempt to process the semantic meaning
667	   of the subtags.  The language range is compared, in a case-
668	   insensitive manner, to each language tag being matched, using basic
669	   string processing.  Users SHOULD select language ranges that are
670	   well-formed, valid language tags according to [RFC3066bis]
671	   (substituting wildcards as appropriate in extended language ranges).

673	   Applications are encouraged to canonicalize language tags and ranges
674	   by using the Preferred-Value from the IANA Language Subtag Registry
675	   for tags or subtags which have been deprecated.  If the user is
676	   working with content that might use the older form, the user might
677	   want to include both the new and old forms in a language priority
678	   list.  For example, the tag "art-lojban" is deprecated.  The subtag
679	   'jbo' is supposed to be used instead, so the user might use it to
680	   form the language range.  Or the user might include both in a
681	   language priority list: "jbo, art-lojban".

683	   Users SHOULD avoid subtags that add no distinguishing value to a
684	   language range.  When filtering, the fewer the number of subtags that
685	   appear in the language range, the more content the range will
686	   probably match, while in lookup unnecessary subtags can cause
687	   "better", more-specific content to be skipped in favor of less
688	   specific content.  For example, the range "de-Latn-DE" returns
689	   content tagged "de" instead of content tagged "de-DE", even though
690	   the latter is probably a better match.

692	   Whether a subtag adds distinguishing value can depend on the context
693	   of the request.  For example, a user who reads both Simplified and
694	   Traditional Chinese, but who prefers Simplified, might use the range
695	   "zh" for filtering (matching all items that user can read) but "zh-
696	   Hans" for lookup (making sure that user gets the preferred form if
697	   it's available, but the fallback to "zh" will still work).  On the
698	   other hand, content in this case ought to be labeled as "zh-Hans" (or
699	   "zh-Hant" if that applies) for filtering, while for lookup, if there
700	   is either "zh-Hans" content or "zh-Hant" content, one of them (the
701	   one considered 'default') also ought to be made available with the
702	   simple "zh".  Note that the user can create a language priority list
703	   "zh-Hans, zh" that delivers the best possible results for both
704	   schemes.  If the user cannot be sure which scheme is being used (or
705	   if more than one might be applied to a given request), the user
706	   SHOULD specify the most specific (largest number of subtags) range
707	   first and then supply shorter prefixes later in the list to ensure
708	   that filtering returns a complete set of tags.

710	   Many languages are written predominantly in a single script.  This is
711	   usually recorded in the Suppress-Script field in that language
712	   subtag's registry entry.  For these languages, script subtags SHOULD
713	   NOT be used to form a language range.  Thus the language range "en-
714	   Latn" is inappropriate in most cases (because the vast majority of
715	   English documents are written in the Latin script and thus the 'en'
716	   language subtag has a Suppress-Script field for 'Latn' in the
717	   registry).

719	   When working with tags and ranges, note that extensions and most
720	   private-use subtags are orthogonal to language tag matching, in that
721	   they specify additional attributes of the text not related to the
722	   goals of most matching schemes.  Users SHOULD avoid using these
723	   subtags in language ranges, since they interfere with the selection
724	   of available content.  When used in language tags (as opposed to
725	   ranges), these subtags normally do not interfere with filtering
726	   (Section 3), since they appear at the end of the tag and will match
727	   all prefixes.  Lookup (Section 3.4) implementations are advised to
728	   ignore unrecognized private-use and extension subtags when performing
729	   language tag fallback.

731	4.2.  Meaning of Language Tags and Ranges

733	   Selecting language tags using language ranges requires some
734	   understanding by users of what they are selecting.  The meaning of
735	   the various subtags in a language range are identical to their
736	   meaning in a language tag (see Section 4.2 in [RFC3066bis]), with the
737	   addition that the wildcard "*" represents any matching sequence of
738	   values.

740	4.3.  Considerations for Private Use Subtags

742	   Private agreement is necessary between the parties that intend to use
743	   or exchange language tags that contain private-use subtags.  Great
744	   caution SHOULD be used in employing private-use subtags in content or
745	   protocols intended for general use.  Private-use subtags are simply
746	   useless for information exchange without prior arrangement.

748	   The value and semantic meaning of private-use tags and of the subtags
749	   used within such a language tag are not defined.  Matching private-
750	   use tags using language ranges or extended language ranges can result
751	   in unpredictable content being returned.

753	4.4.  Length Considerations for Language Ranges

755	   Language ranges are very similar to language tags in terms of content
756	   and usage.  The same types of restrictions on length that can be
757	   applied to language tags can also be applied to language ranges.  See
758	   [RFC3066bis] Section 4.3 (Length Considerations).

760	5.  IANA Considerations

762	   This document presents no new or existing considerations for IANA.

764	6.  Security Considerations

766	   Language ranges used in content negotiation might be used to infer
767	   the nationality of the sender, and thus identify potential targets
768	   for surveillance.  In addition, unique or highly unusual language
769	   ranges or combinations of language ranges might be used to track a
770	   specific individual's activities.

772	   This is a special case of the general problem that anything you send
773	   is visible to the receiving party.  It is useful to be aware that
774	   such concerns can exist in some cases.

776	   The evaluation of the exact magnitude of the threat, and any possible
777	   countermeasures, is left to each application or protocol.

779	7.  Character Set Considerations

781	   Language tags permit only the characters A-Z, a-z, 0-9, and HYPHEN-
782	   MINUS (%x2D).  Language ranges also use the character ASTERISK
783	   (%x2A).  These characters are present in most character sets, so
784	   presentation or exchange of language tags or ranges should not be
785	   constrained by character set issues.

787	8.  References

789	8.1.  Normative References

791	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
792	              Requirement Levels", BCP 14, RFC 2119, March 1997.

794	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
795	              Languages", BCP 18, RFC 2277, January 1998.

797	   [RFC3066bis]
798	              Phillips, A., Ed. and M. Davis, Ed., "Tags for the
799	              Identification of Languages", October 2005, <http://
800	              www.ietf.org/internet-drafts/
801	              draft-ietf-ltru-registry-14.txt>.

803	   [RFC4234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
804	              Specifications: ABNF", RFC 4234, October 2005.

806	8.2.  Informative References

808	   [RFC1766]  Alvestrand, H., "Tags for the Identification of
809	              Languages", RFC 1766, March 1995.

811	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
812	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
813	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

815	   [RFC2616errata]
816	              IETF, "HTTP/1.1 Specification Errata", October 2004,
817	              <http://purl.org/NET/http-errata>.

819	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
820	              Languages", BCP 47, RFC 3066, January 2001.

822	   [RFC3282]  Alvestrand, H., "Content Language Headers", RFC 3282,
823	              May 2002.

825	   [XML10]    Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., and
826	              F. Yergeau, "Extensible Markup Language (XML) 1.0 (Third
827	              Edition)", World Wide Web Consortium Recommendation,
828	              February 2004, <http://www.w3.org/TR/REC-xml>.

830	Appendix A.  Acknowledgments

832	   Any list of contributors is bound to be incomplete; please regard the
833	   following as only a selection from the group of people who have
834	   contributed to make this document what it is today.

836	   The contributors to [RFC1766] and [RFC3066], each of which was a
837	   precursor to this document, contributed greatly to the development of
838	   language tag matching, and, in particular, the basic language range
839	   and the basic matching scheme.  This document was originally part of
840	   [RFC3066bis], but was split off before that document's completion.
841	   Thus, directly or indirectly, those acknowledged in [RFC3066bis] also
842	   had a hand in the development of this document, and work done prior
843	   to the split is acknowledged in that document.

845	   The following people (in alphabetical order by family name)
846	   contributed to this document:

848	   Harald Alvestrand, Stephane Bortzmeyer, Jeremy Carroll, Peter
849	   Constable, John Cowan, Mark Crispin, Martin Duerst, Frank Ellermann,
850	   Doug Ewell, Debbie Garside, Marion Gunn, Jon Hanna, Kent Karlsson,
851	   Erkki Kolehmainen, Jukka Korpela, Ira McDonald, M. Patton, Randy
852	   Presuhn, Eric van der Poel, Markus Scherer, Misha Wolf, and many,
853	   many others.

855	   Very special thanks must go to Harald Tveit Alvestrand, who
856	   originated RFCs 1766 and 3066, and without whom this document would
857	   not have been possible.

859	Authors' Addresses

861	   Addison Phillips (editor)
862	   Yahoo! Inc.

864	   Email: addison@inter-locale.com

866	   Mark Davis (editor)
867	   Google

869	   Email: mark.davis@macchiato.com

871	Intellectual Property Statement

873	   The IETF takes no position regarding the validity or scope of any
874	   Intellectual Property Rights or other rights that might be claimed to
875	   pertain to the implementation or use of the technology described in
876	   this document or the extent to which any license under such rights
877	   might or might not be available; nor does it represent that it has
878	   made any independent effort to identify any such rights.  Information
879	   on the procedures with respect to rights in RFC documents can be
880	   found in BCP 78 and BCP 79.

882	   Copies of IPR disclosures made to the IETF Secretariat and any
883	   assurances of licenses to be made available, or the result of an
884	   attempt made to obtain a general license or permission for the use of
885	   such proprietary rights by implementers or users of this
886	   specification can be obtained from the IETF on-line IPR repository at
887	   http://www.ietf.org/ipr.

889	   The IETF invites any interested party to bring to its attention any
890	   copyrights, patents or patent applications, or other proprietary
891	   rights that may cover technology that may be required to implement
892	   this standard.  Please address the information to the IETF at
893	   ietf-ipr@ietf.org.

895	Disclaimer of Validity

897	   This document and the information contained herein are provided on an
898	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
899	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
900	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
901	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
902	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
903	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

905	Copyright Statement

907	   Copyright (C) The Internet Society (2006).  This document is subject
908	   to the rights, licenses and restrictions contained in BCP 78, and
909	   except as set forth therein, the authors retain all their rights.

911	Acknowledgment

913	   Funding for the RFC Editor function is currently provided by the
914	   Internet Society.