idnits 2.17.1 

draft-ietf-ltru-matching-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 990.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 967.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 974.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 980.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 658 has weird spacing: '...becomes  en-US...'

  == Line 659 has weird spacing: '...becomes  en-La...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 16, 2005) is 6729 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)

  -- Obsolete informational reference (is this intentional?): RFC 1766
     (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)


     Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Obsoletes: 3066 (if approved)                              M. Davis, Ed.
5	Expires: May 20, 2006                                                IBM
6	                                                       November 16, 2005

8	           Matching Tags for the Identification of Languages
9	                      draft-ietf-ltru-matching-06

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on May 20, 2006.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing, matching,
43	   and evaluating language tags.  Possible algorithms for language
44	   negotiation and content selection are described.  This document, in
45	   combination with RFC 3066bis (replace "3066bis" with the RFC number
46	   assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which
47	   replaced RFC 1766.

49	Table of Contents

51	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
52	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
53	     2.1.  Lists of Language Ranges . . . . . . . . . . . . . . . . .  4
54	     2.2.  Basic Language Range . . . . . . . . . . . . . . . . . . .  4
55	     2.3.  Extended Language Range  . . . . . . . . . . . . . . . . .  5
56	   3.  Types of Matching  . . . . . . . . . . . . . . . . . . . . . .  8
57	     3.1.  Choosing a Type of Matching  . . . . . . . . . . . . . . .  8
58	     3.2.  Filtering  . . . . . . . . . . . . . . . . . . . . . . . .  9
59	       3.2.1.  Filtering with Basic Language Ranges . . . . . . . . . 10
60	       3.2.2.  Filtering with Extended Language Ranges  . . . . . . . 10
61	       3.2.3.  Distance Metric Filtering  . . . . . . . . . . . . . . 11
62	     3.3.  Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 13
63	   4.  Other Considerations . . . . . . . . . . . . . . . . . . . . . 16
64	     4.1.  Meaning of Language Tags and Ranges  . . . . . . . . . . . 16
65	     4.2.  Considerations for Private Use Subtags . . . . . . . . . . 17
66	     4.3.  Length Considerations in Matching  . . . . . . . . . . . . 17
67	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
68	   6.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
69	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 22
70	   8.  Character Set Considerations . . . . . . . . . . . . . . . . . 23
71	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 24
72	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 24
73	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 24
74	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 25
75	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 26
76	   Intellectual Property and Copyright Statements . . . . . . . . . . 27

78	1.  Introduction

80	   Human beings on our planet have, past and present, used a number of
81	   languages.  There are many reasons why one would want to identify the
82	   language used when presenting or requesting information.

84	   Information about a user's language preferences commonly needs to be
85	   identified so that appropriate processing can be applied.  For
86	   example, the user's language preferences in a browser can be used to
87	   select web pages appropriately.  Language preferences can also be
88	   used to select among tools (such as dictionaries) to assist in the
89	   processing or understanding of content in different languages.

91	   Given a set of language identifiers, such as those defined in
92	   [RFC3066bis], various mechanisms can be envisioned for performing
93	   language negotiation and tag matching.  Applications, protocols, or
94	   specifications will have varying needs and requirements that will
95	   affect the choice of a suitable mechanism.  Protocols and
96	   specifications SHOULD clearly indicate the particular mechanism used
97	   in selecting or matching language tags.

99	   This document defines several mechanisms for matching, selecting, or
100	   filtering content whose natural language is identified using Language
101	   Tags [RFC3066bis], as well as the syntax (called a "language range")
102	   associated with each of these mechanisms for specifying the user's
103	   language preferences.

105	   This document, in combination with [RFC3066bis] (replace "3066bis"
106	   globally in this document with the RFC number assigned to
107	   draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
108	   [RFC1766].

110	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
111	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
112	   document are to be interpreted as described in [RFC2119].

114	2.  The Language Range

116	   Language Tags [RFC3066bis] are used to identify the language of some
117	   information item or content.  Applications or protocols that use
118	   language tags are often faced with the problem of identifying sets of
119	   content that share certain language attributes.  For example, HTTP
120	   1.1 [RFC2616] describes language ranges in its discussion of the
121	   Accept-Language header (Section 14.4), which is used for selecting
122	   content from servers based on the language of that content.

124	   When selecting content according to its language, it is useful to
125	   have a mechanism for identifying sets of language tags that share
126	   specific attributes.  This allows users to select or filter content
127	   based on specific requirements.  Such an identifier is called a
128	   "Language Range".

130	2.1.  Lists of Language Ranges

132	   When users specify a language preference they often need to specify a
133	   prioritized list of language ranges in order to best reflect their
134	   language requirements for the matching operation.  This is especially
135	   true for speakers of minority languages.  A speaker of Breton in
136	   France, for example, may specify "be" followed by "fr", meaning that
137	   if Breton is available, it is preferred, but otherwise French is the
138	   best alternative.  It can get more complex: a speaker may wish to
139	   fallback from Skolt Sami to Northern Sami to Finnish.

141	   A "Language Priority List" consists of a prioritized or weighted list
142	   of language ranges.  One well known example of such a list is the
143	   "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
144	   14.4) and RFC 3282 [RFC3282].  The various matching operations
145	   described in this document include considerations for using a
146	   language priority list.

148	2.2.  Basic Language Range

150	   A "Basic Language Range" identifies the set of content whose language
151	   tags begin with the same sequence of subtags.  A basic language range
152	   is identified by its 'language-range' tag, by adapting the
153	   ABNF[RFC4234] from HTTP/1.1 [RFC2616] :

155	   language-range = language-tag / "*"
156	   language-tag   = 1*8[alphanum] *["-" 1*8alphanum]
157	   alphanum       = ALPHA / DIGIT

159	   That is, a language-range has the same syntax as a language-tag or is
160	   the single character "*".  Basic Language Ranges imply that there is
161	   a semantic relationship between language tags that share the same
162	   prefix.  While this is often the case, it is not always true and
163	   users should note that the set of language tags that match a specific
164	   language-range may not be mutually intelligible.

166	   Basic language ranges were originally described in [RFC3066] and HTTP
167	   1.1 [RFC2616] (where they are referred to as simply a "language
168	   range").

170	   Users SHOULD avoid subtags that add no distinguishing value to a
171	   language range.  For example, script subtags SHOULD NOT be used to
172	   form a language range with language subtags which have a matching
173	   Suppress-Script field in their registry record.  Thus the language
174	   range "en-Latn" is probably inappropriate in most cases (because the
175	   vast majority English documents are written in the Latin script and
176	   thus the 'en' language subtag has a Suppress-Script field for 'Latn'
177	   in the registry).

179	   Language tags and thus language ranges are to be treated as case
180	   insensitive: there exist conventions for the capitalization of some
181	   of the subtags, but these MUST NOT be taken to carry meaning.
182	   Matching of language tags to language ranges MUST be done in a case
183	   insensitive manner.

185	   When working with tags and ranges, note that extensions and most
186	   private use subtags are generally orthogonal to language tag fallback
187	   and users SHOULD avoid using these subtags in language ranges, since
188	   they will often interfere with the selection of available language
189	   content.  Since these subtags are always at the end of the sequence
190	   of subtags, they don't normally interfere with the use of prefixes
191	   for matching in the schemes described below.

193	   Note that when working with basic language ranges, no attempt is made
194	   to process the semantics of the tags or ranges in any way.  The
195	   language tag and language range are compared in a case insensitive
196	   manner using basic string processing.  Thus the choice of subtags in
197	   both the language tag and language range may affect the results
198	   produced as a result.

200	2.3.  Extended Language Range

202	   A Basic Language Range does not always provide the most appropriate
203	   way to specify a user's preferences.  Sometimes it is beneficial to
204	   define a more granular matching scheme that takes advantage of the
205	   internal structure of language tags, by allowing the user to specify,
206	   for example, the value of a specific field in a language tag or to
207	   indicate which values are of interest in filtering or selecting the
208	   content.

210	   In an extended language range, the identifier takes the form of a
211	   series of subtags which must consist of well-formed subtags or the
212	   special subtag "*".  For example, the language range "en-*-US"
213	   specifies a primary language of 'en', followed by any script subtag,
214	   followed by the region subtag 'US'.

216	   An extended language range can be represented by the following ABNF:
217	   extended-language-range  = range ; a range
218	                 / privateuse              ; private use tag
219	                 / grandfathered           ; grandfathered registrations

221	   range         = (language
222	                    ["-" script]
223	                    ["-" region]
224	                    *("-" variant)
225	                    *("-" extension)
226	                    ["-" privateuse])

228	   language      = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
229	                 / 4ALPHA                 ; reserved for future use
230	                 / 5*8ALPHA               ; registered language subtag
231	                 / "*"                    ; ... or wildcard

233	   extlang       = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
234	                                          ; reserved for future use
235	                                          ; wildcard can only appear
236	                                          ;   at the end

238	   script        = 4ALPHA                 ; ISO 15924 code
239	                 / "*"                    ; or wildcard

241	   region        = 2ALPHA                 ; ISO 3166 code
242	                 / 3DIGIT                 ; UN M.49 code
243	                 / "*"                    ; ... or wildcard

245	   variant       = 5*8alphanum            ; registered variants
246	                 / (DIGIT 3alphanum)      ;
247	                 / "*"                    ; ... or wildcard

249	   extension     = singleton *("-" (2*8alphanum)) [ "-*" ]
250	                                          ; extension subtags
251	                                          ; wildcard can only appear
252	                                          ;   at the end

254	   singleton     = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
255	                 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
256	                 ; Single letters: x/X is reserved for private use

258	   privateuse    = ("x"/"X") 1*("-" (1*8alphanum))

260	   grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
261	                   ; grandfathered registration
262	                   ; Note: i is the only singleton
263	                   ; that starts a grandfathered tag

265	   alphanum      = (ALPHA / DIGIT)       ; letters and numbers

267	   A field not present in the middle of an extended language range MAY
268	   be treated as if the field contained a "*".  For example, the range
269	   "en-US" MAY be considered to be equivalent to the range "en-*-US".
270	   This also means that multiple wildcards can be collapsed (so that
271	   "en-*-*-US" is equivalent to "en-*-US").

273	   When working with tags and ranges users SHOULD note the following:

275	   1.  Private-use and Extension subtags are normally orthogonal to
276	       language tag fallback.  Implementations or specifications that
277	       use a lookup (Section 3.3) matching scheme SHOULD ignore
278	       unrecognized private-use and extension subtags when performing
279	       language tag fallback.  Since these subtags are always at the end
280	       of the sequence of subtags, they don't normally interfere with
281	       the use of prefixes for matching in the schemes described below.

283	   2.  Applications, specifications, or protocols that choose not to
284	       interpret one or more private-use or extension subtags SHOULD NOT
285	       remove or modify these extensions in content that they are
286	       processing.  When a language tag instance is to be used in a
287	       specific, known protocol, and is not being passed through to
288	       other protocols, language tags MAY be filtered to remove subtags
289	       and extensions that are not supported by that protocol.  Such
290	       filtering SHOULD be avoided, if possible, since it removes
291	       information that might be relevant if services on the other end
292	       of the protocol would make use of that information.

294	   3.  Some applications of language tags might want or need to consider
295	       extensions and private-use subtags when matching tags.  If
296	       extensions and private-use subtags are included in a matching or
297	       filtering process that utilizes the one of the schemes described
298	       in this document, then the implementation SHOULD canonicalize the
299	       language tags and/or ranges before performing the matching.  Note
300	       that language tag processors that claim to be "well-formed"
301	       processors as defined in [RFC3066bis] generally fall into this
302	       category.

304	   There are several matching algorithms or schemes which can be applied
305	   when matching extended language ranges to language tags.

307	3.  Types of Matching

309	   Matching language ranges to language tags can be done in a number of
310	   different ways.  This section describes the different types of
311	   matching scheme, as well as the considerations for choosing between
312	   them.

314	   There are two basic types of matching scheme: those that produce an
315	   open-ended set of content (called "filtering") and those that produce
316	   a single information item for a given request (called "lookup").

318	   A key difference between these two types of matching scheme is that
319	   the language range for filtering operations is always the _least_
320	   specific tag one will accept as a match, while for lookup operations
321	   the language range is always the _most_ specific tag.

323	3.1.  Choosing a Type of Matching

325	   Applications, protocols, and specifications are faced with the
326	   decision of what type of matching to use.  Sometimes, different
327	   styles of matching might be suited for different kinds of processing
328	   within a particular application or protocol.

330	   Filtering can be used to produce a set of results (such as a
331	   collection of documents).  For example, if using a search engine, one
332	   might use filtering to limit the results to documents written in
333	   French.  It can also be used when deciding whether to perform some
334	   processing that is language sensitive on some content.  For example,
335	   a process might cause paragraphs whose language tag matched the
336	   language range "nl" to be displayed in italics within a document.

338	   This document describes three types of filtering:

340	   1.  Basic Filtering (Section 3.2.1) is used to match content using
341	       basic language rangesSection 2.2.  It is compatible with
342	       implementations that do not produce extended language ranges.

344	   2.  Extended Range Filtering (Section 3.2.2) is used to match content
345	       using extended language rangesSection 2.3.  Newer implementations
346	       SHOULD use this form of filtering in preference to basic
347	       filtering.

349	   3.  Scored Filtering (Section 3.2.3) produces an ordered set of
350	       content using either basic or extended language ranges.  It
351	       should be used when the quality of the match within a specific
352	       language range is important, as when presenting a list of
353	       documents resulting from a search.

355	   Lookup (Section 3.3) is used when each request MUST produce exactly
356	   one piece of content.  For example, a Web server might use the
357	   Accept-Language HTTP header to choose which language to return a
358	   custom 404 page in: since it can return only one page, it must choose
359	   a single item and it must return some item, even if no content
360	   matches the language ranges supplied by the user.

362	   Most types of matching in this document are designed so that
363	   implementations do not have to examine the values of the subtags
364	   supplied and, except for scored filtering, they do not need access to
365	   the Language Subtag Registry nor do they require the use of valid
366	   subtags in either language tags or language ranges.  This has great
367	   benefit for speed and simplicity of implementation.

369	   Implementations might also wish to use semantic information external
370	   to the langauge tags when performing fallback.  For example, the
371	   primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
372	   Norwegian) might both be usefully matched to the more general subtag
373	   'no' (Norwegian).  Or an implementation might infer that content
374	   labeled "zh-CN" is morely likely to match the range "zh-Hans" than
375	   equivalent content labeled "zh-TW".

377	3.2.  Filtering

379	   Filtering is used to select the set of content that matches a given
380	   prefix.  It is called "filtering" because this set of content may
381	   contain no items at all or it may return an arbitrary number of
382	   matching items--as many as match the language range used to specify
383	   the items, thus filtering out the non-matching content.

385	   In filtering, the language range represents the _least_ specific tag
386	   which is an acceptable match.  That is, all of the language tags in
387	   the set of filtered content will have an equal or greater number of
388	   subtags than the language range.  For example, if the language range
389	   is "de-CH", one might see matching content with the tag "de-CH-1996"
390	   but one will never see a match with the tag "de".

392	   If the language priority list (see Section 2.1) contains more than
393	   one range, the content returned is typically ordered in descending
394	   level of preference.

396	   Some examples where filtering might be appropriate include:

398	   o  Applying a style to sections of a document in a particular
399	      language range.

401	   o  Displaying the set of documents containing a particular set of
402	      keywords written in a specific language.

404	   o  Selecting all email items written in specific range of languages.

406	   Filtering can produce either ordered or unordered set of results.
407	   For example, applying formatting to a document based on the language
408	   of specific pieces of content does not require the content to be
409	   ordered.  It is sufficient to know whether a specific piece of
410	   content matches or does not match.  A search application, on the
411	   other hand, probably would put the results into a priority order.

413	   If an ordered set is desired, as described above, then the
414	   application or protocol needs to determine the relative "quality" of
415	   the match between different language tags and the language range.

417	   This measurment is called a "distance metric".  A distance metric
418	   assigns a numeric value to the comparison of each language tag to a
419	   language range and represents the 'distance' between the two.  A
420	   distance of zero means that they are identical, a small distance
421	   indicates that they are very similar, and a large distance indicated
422	   that they are very different.  Using a distance metric,
423	   implementations can, for example, allow users to select a threshold
424	   distance for a match to be "successful" while filtering or it can use
425	   the numeric value to order the results.

427	3.2.1.  Filtering with Basic Language Ranges

429	   When filtering using a basic language range, the language range
430	   matches a language tag if it exactly equals the tag, or if it exactly
431	   equals a prefix of the tag such that the first character following
432	   the prefix is "-".  (That is, the language-range "de-de" matches the
433	   language tag "de-DE-1996", but not the language tag "de-Deva".)

435	   The special range "*" matches any tag.  A protocol which uses
436	   language ranges MAY specify additional rules about the semantics of
437	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
438	   languages not matched by any other range within an "Accept-Language"
439	   header.

441	3.2.2.  Filtering with Extended Language Ranges

443	   In the Extended Range Matching scheme, each extended language range
444	   in the language priority list is considered in turn, according to
445	   priority.  The subtags in each extended language range are compared
446	   to the corresponding subtags in the language tag being examined.  The
447	   subtag from the range is considered to match if it exactly matches
448	   the corresponding subtag in the tag or the range's subtag has the
449	   value "*" (which matches all subtags, including the empty subtag).
450	   Extended Range Matching is an extension of basic matching
451	   (Section 3.2.1): the language range represents the least specific tag
452	   which is an acceptable match.

454	   Private use subtags MAY be specified in the language range and MUST
455	   NOT be ignored when matching.

457	   Subtags not specified, including those at the end of the language
458	   range, are assigned the value "*".  This makes each range into a
459	   prefix much like that used in basic language range matching.  For
460	   example, the extended language range "zh-*-CN" matches all of the
461	   following tags because the unspecified variant field is expanded to
462	   "*":

464	      zh-Hant-CN

466	      zh-CN

468	      zh-Hans-CN

470	      zh-CN-x-wadegile

472	      zh-Latn-CN-boont

474	      zh-cmn-Hans-CN-x-private

476	3.2.3.  Distance Metric Filtering

478	   Both basic and extended language range filtering produce simple
479	   boolean matches.  Sometimes it may be beneficial to provide an array
480	   of results with different levels of matching, for example, sorting
481	   results based on the overall "quality" of the match.  Distance metric
482	   filtering provides a way to generate these quality values.

484	   First both the extended language range and the language tags to be
485	   matched to it must be canonicalized by mapping grandfathered and
486	   obsolete tags into modern equivalents.

488	   The language range and the language tags are then transformed into
489	   quintuples of elements of the form (language, script, country,
490	   variant, extension).  Any extended language subtags are considered
491	   part of the language element; private use subtag sequences are
492	   considered part of the language element if in the initial position in
493	   the tag and part of the variant element if not.  Language subtags
494	   'und', 'mul', and the script subtag 'Zyyy' are converted to "*".

496	   Missing components in the language-tag are set to "*"; thus a "*"
497	   pattern becomes the quintuple ("*", "*", "*", "*", "*").  Missing
498	   components in the extended language-range are handled similarly to
499	   extended range lookup: missing internal subtags are expanded to "*".

501	   Missing end subtags are expanded as the empty string.  Thus a pattern
502	   "en-US" becomes the quintuple ("en","*","US","","").

504	   Here are some examples of language-tags and their quintuples:

506	      en-US ("en","*","US","*","*")

508	      sr-Latn ("sr,"Latn","*","*","*")

510	      zh-cmn-Hant ("zh-cmn","Hant","*","*","*")

512	      x-foo ("x-foo","*","*","*","*")

514	      en-x-foo ("en","*","*","x-foo","*")

516	      i-default ("i-default","*","*","*","*")

518	      sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*")

520	      zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical

522	   Each language-range/language-tag pair being compared is assigned a
523	   distance value, whereby small values indicate better matches and
524	   large values indicate worse ones.  The distance between the pair is
525	   the sum of the distances for each of the corresponding elements of
526	   the quintuple.  If the elements are identical or one is '*', then the
527	   distance value between them is zero.  Otherwise, it is given by the
528	   following table:
529	     256    language mismatch
530	     128    script mismatch
531	      32    region mismatch
532	       4    variant mismatch
533	       1    extension mismatch

535	   A value of 0 is a perfect match; 421 is no match at all.  Different
536	   threshold values might be appropriate for different applications or
537	   protocols.  Implementations will usually allow users to choose the
538	   most appropriate selection value, ranking the matched items based on
539	   score.

541	   Examples of various tag's distances from the range "en-US":

543	   "fr"             256 (language mismatch, region match)
544	   "en-GB"          384 (language, region mismatch)
545	   "en-Latn-US"       0 (all fields match)
546	   "en-Brai"         32 (region mismatch)
547	   "en-US-x-foo"      4 (variant mismatch: range is the empty string)
548	   "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)
549	   Implementations or protocols sometimes might wish to use more
550	   sophisticated weights that depend on the values of the corresponding
551	   elements.  For example, depending on the domain, an implemenation
552	   might give a small distance to the difference between the language
553	   subtag 'no' and the closely related language subtags 'nb' or 'nn'; or
554	   between the script subtags 'Kata' and 'Hira'; or between the region
555	   subtags 'US' and 'UM'.

557	3.3.  Lookup

559	   Lookup is used to select the single information item that best
560	   matches the language priority list for a given request.  In lookup,
561	   each language range in the language priority list represents the
562	   _most_ specific tag which is an acceptable match; only the closest
563	   matching item according the user's priority is returned.  For
564	   example, if the language range is "de-CH", one might expect to
565	   receive an information item with the tag "de" but never one with the
566	   tag "de-CH-1996".  Usually if no content matches the request, a
567	   "default" item is returned.

569	   For example, if an application inserts some dynamic content into a
570	   document, returning an empty string if there is no exact match is not
571	   an option.  Instead, the application "falls back" until it finds a
572	   suitable piece of content to insert.  Other examples of lookup might
573	   include:

575	   o  Selection of a template containing the text for an automated email
576	      response.

578	   o  Selection of a graphic containing text for inclusion in a
579	      particular Web page.

581	   o  Selection of a string of text for inclusion in an error log.

583	   In the Lookup scheme, the language range is progressively truncated
584	   from the end until a matching piece of content is located.  For
585	   example, starting with the range "zh-Hant-CN-x-private", the lookup
586	   would progressively search for content as shown below:

588	   Range to match: zh-Hant-CN-x-private
589	   1. zh-Hant-CN-x-private
590	   2. zh-Hant-CN
591	   3. zh-Hant
592	   4. zh
593	   5. (default content or the empty tag)

595	   Figure 5: Example of a Lookup Fallback Pattern
596	   This scheme allows some flexibility in finding content.  It also
597	   typically provides better results when data is not available at a
598	   specific level of tag granularity or is sparsely populated (than if
599	   the default language for the system or content were used).

601	   The language range "*" matches any language tag.  In the lookup
602	   scheme, this language range does not convey enough information to
603	   determine which content is most appropriate.  If this language range
604	   is the only one in the language priority list, it matches the default
605	   content.  If this language range is followed by other language
606	   ranges, it should be skipped.

608	   When performing lookup using a language priority list, the
609	   progressive search MUST proceed to consider each language range
610	   before finding the default content or empty tag.  The default content
611	   might be content with no language tag (or with an empty value, as
612	   with xml:lang in the XML specification), or it might be a particular
613	   language designated for that bit of content.

615	   One common way to provide for default content is to allow a specific
616	   language range to be set as the default for a specific type of
617	   request.  This language range is then treated as if it were appended
618	   to the end of the language priority list, rather than after each item
619	   in the language priority list.

621	   For example, if a particular user's language priority list were
622	   "fr-FR; zh-Hant" and the program doing the matching had a default
623	   language range of "ja-JP", the program would search for content as
624	   follows:
625	   1. fr-FR
626	   2. fr
627	   3. zh-Hant // next language
628	   4. zh
629	   5. (return default content)
630	      a. ja-JP
631	      b. ja
632	      c. (empty tag or other default content)

634	   Figure 6: Lookup Using a Language Priority List

636	   In some cases, the language priority list might contain one or more
637	   extended language ranges (as, for example, when the same language
638	   priority list is used as input for both lookup and filtering
639	   operations).  Wildcard values in an extended language range are
640	   supposed to match any value that occurs in that position in a
641	   language tag.  Since only one item can be returned for any given
642	   lookup request, the wildcards must be processed in a predictable
643	   manner (or the same request might produce widely varying results).

645	   Thus, for each range in the language priority list, the following
646	   rules must be applied to produce a basic language range for use in
647	   the fallback mechanism:

649	   1.  If the first subtag in the extended language range is a "*" then
650	       entire range is converted to "*".

652	   2.  For each subsequent subtag, if the value is a "*" then that
653	       subtag and its preceeding hyphen are removed.

655	   For example:

657	   *-US      becomes  *
658	   en-*-US   becomes  en-US
659	   en-Latn-* becomes  en-Latn

661	   Figure 7: Transformation of Extended Language Ranges

663	   For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback
664	   pattern would be:
665	   1. * (skipped)
666	   2. fr-FR
667	   3. fr
668	   4. zh-Hant
669	   5. zh
670	   6. (default content)

672	   Figure 8: Extended Language Range Fallback Example

674	4.  Other Considerations

676	   When working with language ranges and matching schemes, there are
677	   some additional points that may influence the choice of either.

679	4.1.  Meaning of Language Tags and Ranges

681	   Selecting content using language ranges requires some understanding
682	   by users of what they are selecting.  A language tag or range
683	   identifies a language as spoken (or written, signed or otherwise
684	   signaled) by human beings for communication of information to other
685	   human beings.

687	   If a language tag B contains language tag A as a prefix, then B is
688	   typically "narrower" or "more specific" than A. For example, "zh-
689	   Hant-TW" is more specific than "zh-Hant".

691	   This relationship is not guaranteed in all cases: specifically,
692	   languages that begin with the same sequence of subtags are NOT
693	   guaranteed to be mutually intelligible, although they might be.

695	   For example, the tag "az" shares a prefix with both "az-Latn"
696	   (Azerbaijani written using the Latin script) and "az-Arab"
697	   (Azerbaijani written using the Arabic script).  A person fluent in
698	   one script might not be able to read the other, even though the text
699	   might be otherwise identical.  Content tagged as "az" most probably
700	   is written in just one script and thus might not be intelligible to a
701	   reader familiar with the other script.

703	   Variant subtags in particular seem to represent specific divisions in
704	   mutual understanding, since they often encode dialects or other
705	   idiosyncratic variations within a language.

707	   The relationship between the language tag and the information it
708	   relates to is defined by the standard describing the context in which
709	   it appears.  Accordingly, this section can only give possible
710	   examples of its usage:

712	   o  For a single information object, the associated language tags
713	      might be interpreted as the set of languages that are necessary
714	      for a complete comprehension of the complete object.  Example:
715	      Plain text documents.

717	   o  For an aggregation of information objects, the associated language
718	      tags could be taken as the set of languages used inside components
719	      of that aggregation.  Examples: Document stores and libraries.

721	   o  For information objects whose purpose is to provide alternatives,
722	      the associated language tags could be regarded as a hint that the
723	      content is provided in several languages, and that one has to
724	      inspect each of the alternatives in order to find its language or
725	      languages.  In this case, the presence of multiple tags might not
726	      mean that one needs to be multi-lingual to get complete
727	      understanding of the document.  Example: MIME multipart/
728	      alternative.

730	   o  In markup languages, such as HTML and XML, language information
731	      can be added to each part of the document identified by the markup
732	      structure (including the whole document itself).  For example, one
733	      could write <span lang="FR">C'est la vie.</span> inside a
734	      Norwegian document; the Norwegian-speaking user could then access
735	      a French-Norwegian dictionary to find out what the marked section
736	      meant.  If the user were listening to that document through a
737	      speech synthesis interface, this formation could be used to signal
738	      the synthesizer to appropriately apply French text-to-speech
739	      pronunciation rules to that span of text, instead of misapplying
740	      the Norwegian rules.

742	4.2.  Considerations for Private Use Subtags

744	   Private-use subtags require private agreement between the parties
745	   that intend to use or exchange language tags that use them and great
746	   caution SHOULD be used in employing them in content or protocols
747	   intended for general use.  Private-use subtags are simply useless for
748	   information exchange without prior arrangement.

750	   The value and semantic meaning of private-use tags and of the subtags
751	   used within such a language tag are not defined.  Matching private
752	   use tags using language ranges or extended language ranges can result
753	   in unpredictable content being returned.

755	4.3.  Length Considerations in Matching

757	   RFC 3066 [RFC3066] did not provide an upper limit on the size of
758	   language tags or ranges.  RFC 3066 did define the semantics of
759	   particular subtags in such a way that most language tags or ranges
760	   consisted of language and region subtags with a combined total length
761	   of up to six characters.  Larger tags and ranges (in terms of both
762	   subtags and characters) did exist, however.

764	   [RFC3066bis] also does not impose a fixed upper limit on the number
765	   of subtags in a language tag or range (and thus an upper bound on the
766	   size of either).  The syntax in that document suggests that,
767	   depending on the specific language or range of languages, more
768	   subtags (and thus characters) are sometimes necessary as a result.

770	   Length considerations and their impact on the selection and
771	   processing of tags are described in Section 2.1.1 of that document.

773	   An application or protocol MAY choose to limit the length of the
774	   language tags or ranges used in matching.  Any such limitation SHOULD
775	   be clearly documented, and such documentation SHOULD include the
776	   disposition of any longer tags or ranges (for example, whether an
777	   error value is generated or the language tag or range is truncated).
778	   If truncation is permitted it MUST NOT permit a subtag to be divided,
779	   since this changes the semantics of the subtag being matched and can
780	   result in false positives or negatives.

782	   Applications or protocols that restrict storage SHOULD consider the
783	   impact of tag or range truncation on the resulting matches.  For
784	   example, removing the "*" from the end of an extended language range
785	   (see Section 2.3) can greatly modify the set of returned matches.  A
786	   protocol that allows tags or ranges to be truncated at an arbitrary
787	   limit, without giving any indication of what that limit is, has the
788	   potential for causing harm by changing the meaning of values in
789	   substantial ways.

791	   In practice, most tags do not require additional subtags or
792	   substantially more characters.  Additional subtags sometimes add
793	   useful distinguishing information, but extraneous subtags interfere
794	   with the meaning, understanding, and especially matching of language
795	   tags.  Since language tags or ranges MAY be truncated by an
796	   application or protocol that limits storage, when choosing language
797	   tags or ranges users and applications SHOULD avoid adding subtags
798	   that add no distinguishing value.  In particular, users and
799	   implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
800	   fields in the registry (defined in Section 3.6 of [RFC3066bis]):
801	   these fields provide guidance on when specific additional subtags
802	   SHOULD (and SHOULD NOT) be used.

804	   Implementations MUST support a limit of at least 33 characters.  This
805	   limit includes at least one subtag of each non-extension, non-private
806	   use type.  When choosing a buffer limit, a length of at least 42
807	   characters is strongly RECOMMENDED.

809	   The practical limit on tags or ranges derived solely from registered
810	   values is 42 characters.  Implementations MUST be able to handle tags
811	   and ranges of this length.  Support for tags and ranges of at least
812	   62 characters in length is RECOMMENDED.  Implementations MAY support
813	   longer values, including matching extensive sets of private use or
814	   extension subtags.

816	   Applications or protocols which have to truncate a tag MUST do so by
817	   progressively removing subtags along with their preceding "-" from
818	   the right side of the language tag until the tag is short enough for
819	   the given buffer.  If the resulting tag ends with a single-character
820	   subtag, that subtag and its preceding "-" MUST also be removed.  For
821	   example:

823	   Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
824	   1. zh-Latn-CN-variant1-a-extend1-x-wadegile
825	   2. zh-Latn-CN-variant1-a-extend1
826	   3. zh-Latn-CN-variant1
827	   4. zh-Latn-CN
828	   5. zh-Latn
829	   6. zh

831	   Figure 9: Example of Tag Truncation

833	5.  IANA Considerations

835	   This document presents no new or existing considerations for IANA.

837	6.  Changes

839	   This is the first version of this document.

841	   The following changes were put into this document since draft-05:

843	      Modified the ABNF to match changes in [RFC3066bis] (K.Karlsson)

845	      Matched the references and reference formats to [RFC3066bis]
846	      (K.Karlsson)

848	      Various edits, additions, and emendations to deal with changes in
849	      the Last Call of draft-registry as well as cleaning up the text.

851	      Changed from 'defined' to 'identifies' in Section 4.1.  (M.Gunn)

853	      Reorganized the text and broke it into sections (M.Duerst)

855	      Modified occurences of the word "application" to refer to
856	      "applications or protocols" or otherwise be specific (E. van der
857	      Poel)

859	      Removed "Extended Language Range Lookup", merging it with other
860	      text on lookup to form a single scheme.  (M.Davis)

862	      Fixed or removed obsolete or dangling references (Ed.)

864	      Added an introduction to section 4 and added one sentence to make
865	      it flow better to the start of section 4.1.  (Ed.)

867	7.  Security Considerations

869	   Language ranges used in content negotiation might be used to infer
870	   the nationality of the sender, and thus identify potential targets
871	   for surveillance.  In addition, unique or highly unusual language
872	   ranges or combinations of language ranges might be used to track
873	   specific individual's activities.

875	   This is a special case of the general problem that anything you send
876	   is visible to the receiving party.  It is useful to be aware that
877	   such concerns can exist in some cases.

879	   The evaluation of the exact magnitude of the threat, and any possible
880	   countermeasures, is left to each application or protocol.

882	8.  Character Set Considerations

884	   The syntax of language tags and language ranges permit only the
885	   characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D).  These characters
886	   are present in most character sets, so presentation of language tags
887	   should not present any character set issues.

889	9.  References

891	9.1.  Normative References

893	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
894	              Requirement Levels", BCP 14, RFC 2119, March 1997.

896	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
897	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
898	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

900	   [RFC3066bis]
901	              Phillips, A., Ed. and M. Davis, Ed., "Tags for the
902	              Identification of Languages", October 2005, <http://
903	              www.ietf.org/internet-drafts/
904	              draft-ietf-ltru-registry-14.txt>.

906	   [RFC4234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
907	              Specifications: ABNF", RFC 4234, October 2005.

909	9.2.  Informative References

911	   [RFC1766]  Alvestrand, H., "Tags for the Identification of
912	              Languages", RFC 1766, March 1995.

914	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
915	              Languages", BCP 47, RFC 3066, January 2001.

917	   [RFC3282]  Alvestrand, H., "Content Language Headers", RFC 3282,
918	              May 2002.

920	Appendix A.  Acknowledgements

922	   Any list of contributors is bound to be incomplete; please regard the
923	   following as only a selection from the group of people who have
924	   contributed to make this document what it is today.

926	   The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
927	   which is a precursor to this document, made enormous contributions
928	   directly or indirectly to this document and are generally responsible
929	   for the success of language tags.

931	   The following people (in alphabetical order by family name)
932	   contributed to this document:

934	   Jeremy Carroll, John Cowan, Martin Duerst, Frank Ellermann, Doug
935	   Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M. Patton, Randy
936	   Presuhn, Eric van der Poel, and many, many others.

938	   Very special thanks must go to Harald Tveit Alvestrand, who
939	   originated RFCs 1766 and 3066, and without whom this document would
940	   not have been possible.

942	   For this particular document, John Cowan originated the scheme
943	   described in Section 3.2.3.  Mark Davis originated the scheme
944	   described in the Section 3.3.

946	Authors' Addresses

948	   Addison Phillips (editor)
949	   Quest Software

951	   Email: addison dot phillips at quest dot com

953	   Mark Davis (editor)
954	   IBM

956	   Email: mark dot davis at ibm dot com

958	Intellectual Property Statement

960	   The IETF takes no position regarding the validity or scope of any
961	   Intellectual Property Rights or other rights that might be claimed to
962	   pertain to the implementation or use of the technology described in
963	   this document or the extent to which any license under such rights
964	   might or might not be available; nor does it represent that it has
965	   made any independent effort to identify any such rights.  Information
966	   on the procedures with respect to rights in RFC documents can be
967	   found in BCP 78 and BCP 79.

969	   Copies of IPR disclosures made to the IETF Secretariat and any
970	   assurances of licenses to be made available, or the result of an
971	   attempt made to obtain a general license or permission for the use of
972	   such proprietary rights by implementers or users of this
973	   specification can be obtained from the IETF on-line IPR repository at
974	   http://www.ietf.org/ipr.

976	   The IETF invites any interested party to bring to its attention any
977	   copyrights, patents or patent applications, or other proprietary
978	   rights that may cover technology that may be required to implement
979	   this standard.  Please address the information to the IETF at
980	   ietf-ipr@ietf.org.

982	Disclaimer of Validity

984	   This document and the information contained herein are provided on an
985	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
986	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
987	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
988	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
989	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
990	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

992	Copyright Statement

994	   Copyright (C) The Internet Society (2005).  This document is subject
995	   to the rights, licenses and restrictions contained in BCP 78, and
996	   except as set forth therein, the authors retain all their rights.

998	Acknowledgment

1000	   Funding for the RFC Editor function is currently provided by the
1001	   Internet Society.