idnits 2.17.1 

draft-ietf-ltru-matching-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1026.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1003.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1010.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1016.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 699 has weird spacing: '...becomes  en-US...'

  == Line 700 has weird spacing: '...becomes  en-La...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 18, 2005) is 6727 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Obsolete normative reference: RFC 4234 (Obsoleted by RFC 5234)

  -- Obsolete informational reference (is this intentional?): RFC 1766
     (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)


     Summary: 5 errors (**), 0 flaws (~~), 5 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Obsoletes: 3066 (if approved)                              M. Davis, Ed.
5	Expires: May 22, 2006                                                IBM
6	                                                       November 18, 2005

8	                       Matching of Language Tags
9	                      draft-ietf-ltru-matching-07

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on May 22, 2006.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing, matching,
43	   and evaluating language tags.  Possible algorithms for language
44	   negotiation and content selection are described.  This document, in
45	   combination with RFC 3066bis (replace "3066bis" with the RFC number
46	   assigned to draft-ietf-ltru-registry-14), replaces RFC 3066, which
47	   replaced RFC 1766.

49	Table of Contents

51	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
52	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
53	     2.1.  Lists of Language Ranges . . . . . . . . . . . . . . . . .  4
54	     2.2.  Basic Language Range . . . . . . . . . . . . . . . . . . .  4
55	     2.3.  Extended Language Range  . . . . . . . . . . . . . . . . .  5
56	     2.4.  Choosing a Language Range  . . . . . . . . . . . . . . . .  6
57	   3.  Types of Matching  . . . . . . . . . . . . . . . . . . . . . .  9
58	     3.1.  Choosing a Type of Matching  . . . . . . . . . . . . . . .  9
59	     3.2.  Filtering  . . . . . . . . . . . . . . . . . . . . . . . . 10
60	       3.2.1.  Filtering with Basic Language Ranges . . . . . . . . . 11
61	       3.2.2.  Filtering with Extended Language Ranges  . . . . . . . 11
62	       3.2.3.  Scored Filtering . . . . . . . . . . . . . . . . . . . 12
63	     3.3.  Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . 14
64	   4.  Other Considerations . . . . . . . . . . . . . . . . . . . . . 18
65	     4.1.  Meaning of Language Tags and Ranges  . . . . . . . . . . . 18
66	     4.2.  Considerations for Private Use Subtags . . . . . . . . . . 19
67	     4.3.  Length Considerations in Matching  . . . . . . . . . . . . 19
68	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 22
69	   6.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
70	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 24
71	   8.  Character Set Considerations . . . . . . . . . . . . . . . . . 25
72	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 26
73	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 26
74	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 26
75	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 27
76	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 28
77	   Intellectual Property and Copyright Statements . . . . . . . . . . 29

79	1.  Introduction

81	   Human beings on our planet have, past and present, used a number of
82	   languages.  There are many reasons why one would want to identify the
83	   language used when presenting or requesting information.

85	   Information about a user's language preferences commonly need to be
86	   identified so that appropriate processing can be applied.  For
87	   example, the user's language preferences in a browser can be used to
88	   select web pages appropriately.  Language preferences can also be
89	   used to select among tools (such as dictionaries) to assist in the
90	   processing or understanding of content in different languages.

92	   Given a set of language identifiers, such as those defined in
93	   [RFC3066bis], various mechanisms can be envisioned for performing
94	   language negotiation and tag matching.  Applications, protocols, or
95	   specifications will have varying needs and requirements that affect
96	   the choice of a suitable mechanism.

98	   This document defines several mechanisms for matching, selecting, or
99	   filtering content whose natural language is identified using Language
100	   Tags [RFC3066bis], as well as the syntax (called a "language range")
101	   associated with each of these mechanisms for specifying the user's
102	   language preferences.

104	   This document, in combination with [RFC3066bis] (replace "3066bis"
105	   globally in this document with the RFC number assigned to
106	   draft-ietf-ltru-registry-14), replaces [RFC3066], which replaced
107	   [RFC1766].

109	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
110	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
111	   document are to be interpreted as described in [RFC2119].

113	2.  The Language Range

115	   Language Tags [RFC3066bis] are used to identify the language of some
116	   information item or content.  Applications or protocols that use
117	   language tags are often faced with the problem of identifying sets of
118	   content that share certain language attributes.  For example,
119	   HTTP/1.1 [RFC2616] describes language ranges in its discussion of the
120	   Accept-Language header (Section 14.4).  These are to be used when
121	   selecting content from servers based on the language of that content.

123	   When selecting content according to its language, it is useful to
124	   have a mechanism for identifying sets of language tags that share
125	   specific attributes.  This allows users to select or filter content
126	   based on specific requirements.  Such an identifier is called a
127	   "Language Range".

129	   Language tags and thus language ranges are to be treated as case
130	   insensitive: there exist conventions for the capitalization of some
131	   of the subtags, but these MUST NOT be taken to carry meaning.
132	   Matching of language tags to language ranges MUST be done in a case
133	   insensitive manner as well.

135	2.1.  Lists of Language Ranges

137	   When users specify a language preference they often need to specify a
138	   prioritized list of language ranges in order to best reflect their
139	   language preferences.  This is especially true for speakers of
140	   minority languages.  A speaker of Breton in France, for example, may
141	   specify "be" followed by "fr", meaning that if Breton is available,
142	   it is preferred, but otherwise French is the best alternative.  It
143	   can get more complex: a speaker may wish to fall back from Skolt Sami
144	   to Northern Sami to Finnish.

146	   A "Language Priority List" consists of a prioritized or weighted list
147	   of language ranges.  One well known example of such a list is the
148	   "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
149	   14.4) and RFC 3282 [RFC3282].

151	   The various matching operations described in this document include
152	   considerations for using a language priority list.  When given as
153	   examples in this document, language priority lists will be shown as a
154	   quoted sequence of ranges separated by semi-colons, like this: "en;
155	   fr; zh-Hant" (which would be read as "English before French before
156	   Chinese as written in the Traditional script").

158	2.2.  Basic Language Range

160	   A "Basic Language Range" identifies the set of content whose language
161	   tags begin with the same sequence of subtags.  A basic language range
162	   is identified by its 'language range' tag, by adapting the
163	   ABNF[RFC4234] from HTTP/1.1 [RFC2616] :

165	   language-range = language-tag / "*"
166	   language-tag   = 1*8[alphanum] *["-" 1*8alphanum]
167	   alphanum       = ALPHA / DIGIT

169	   That is, a language-range has the same syntax as a language-tag or is
170	   the single character "*".  Basic Language Ranges imply that there is
171	   a semantic relationship between language tags that share the same
172	   prefix.  While this is often the case, it is not always true and
173	   users should note that the set of language tags that match a specific
174	   language-range may not be mutually intelligible.

176	   Basic language ranges were originally described in [RFC3066] and
177	   HTTP/1.1 [RFC2616] (where they are referred to as simply a "language
178	   range").

180	2.3.  Extended Language Range

182	   A Basic Language Range does not always provide the most appropriate
183	   way to specify a user's preferences.  Sometimes it is beneficial to
184	   use a more granular matching scheme that takes advantage of the
185	   internal structure of language tags, by allowing the user to specify,
186	   for example, the value of a specific field in a language tag or to
187	   indicate which values are of interest in filtering or selecting the
188	   content.

190	   In an extended language range, the identifier takes the form of a
191	   series of subtags which MUST consist of well-formed subtags or the
192	   special subtag "*".  For example, the language range "en-*-US"
193	   specifies a primary language of 'en', followed by any script subtag,
194	   followed by the region subtag 'US'.

196	   An extended language range can be represented by the following ABNF:
197	   extended-language-range  = range ; a range
198	                 / privateuse              ; private-use tag
199	                 / grandfathered           ; grandfathered registrations

201	   range         = (language
202	                    ["-" script]
203	                    ["-" region]
204	                    *("-" variant)
205	                    *("-" extension)
206	                    ["-" privateuse])

208	   language      = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
209	                 / 4ALPHA                 ; reserved for future use
210	                 / 5*8ALPHA               ; registered language subtag
211	                 / "*"                    ; ... or wildcard

213	   extlang       = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
214	                                          ; reserved for future use
215	                                          ; wildcard can only appear
216	                                          ;   at the end

218	   script        = 4ALPHA                 ; ISO 15924 code
219	                 / "*"                    ; or wildcard

221	   region        = 2ALPHA                 ; ISO 3166 code
222	                 / 3DIGIT                 ; UN M.49 code
223	                 / "*"                    ; ... or wildcard

225	   variant       = 5*8alphanum            ; registered variants
226	                 / (DIGIT 3alphanum)      ;
227	                 / "*"                    ; ... or wildcard

229	   extension     = singleton *("-" (2*8alphanum)) [ "-*" ]
230	                                          ; extension subtags
231	                                          ; wildcard can only appear
232	                                          ;   at the end

234	   singleton     = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
235	                 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
236	                 ; Single letters: x/X is reserved for private use

238	   privateuse    = ("x"/"X") 1*("-" (1*8alphanum))

240	   grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
241	                   ; grandfathered registration
242	                   ; Note: I is the only singleton
243	                   ; that starts a grandfathered tag

245	   alphanum      = (ALPHA / DIGIT)       ; letters and numbers

247	   A field not present in the middle of an extended language range MAY
248	   be treated as if the field contained a "*".  For example, the range
249	   "en-US" MAY be considered to be equivalent to the range "en-*-US".
250	   This also means that multiple wildcards can be collapsed (so that
251	   "en-*-*-US" is equivalent to "en-*-US").

253	2.4.  Choosing a Language Range

255	   Users indicate their language preferences via the choice of a
256	   language range or the set of language ranges in the language priority
257	   list.  The type of matching will affect what the best choice is for
258	   given user.  In addition, user's should be aware that, when working
259	   with language ranges, most matching schemes make no attempt to
260	   process the semantic meaning of the subtags.  The language tag and
261	   language range (or their subtags) are usually compared in a case
262	   insensitive manner using basic string processing.  Thus the choice of
263	   subtags in both the language tag and language range may affect the
264	   results produced.

266	   Users SHOULD avoid subtags that add no distinguishing value to a
267	   language range.  For example, script subtags SHOULD NOT be used to
268	   form a language range with language subtags which have a matching
269	   Suppress-Script field in their registry record.  Thus the language
270	   range "en-Latn" is probably inappropriate in most cases (because the
271	   vast majority of English documents are written in the Latin script
272	   and thus the 'en' language subtag has a Suppress-Script field for
273	   'Latn' in the registry).

275	   When working with tags and ranges, note that extensions and most
276	   private-use subtags are orthogonal to language tag fallback and users
277	   SHOULD avoid using these subtags in language ranges, since they will
278	   often interfere with the selection of available language content.
279	   Since these subtags are always at the end of the sequence of subtags,
280	   they don't normally interfere with the use of prefixes for the
281	   filtering schemes described below in Section 3.

283	   When working with tags and ranges users SHOULD note the following:

285	   1.  Private-use and Extension subtags are normally orthogonal to
286	       language tag fallback.  Implementations or specifications that
287	       use a lookup (Section 3.3) matching scheme SHOULD ignore
288	       unrecognized private-use and extension subtags when performing
289	       language tag fallback.  Since these subtags are always at the end
290	       of the sequence of subtags, they don't normally interfere with
291	       the use of prefixes for matching in the schemes described below.

293	   2.  Applications, specifications, or protocols that choose not to
294	       interpret one or more private-use or extension subtags SHOULD NOT
295	       remove or modify these extensions in content that they are
296	       processing.  When a language tag instance is to be used in a
297	       specific, known protocol, and is not being passed through to
298	       other protocols, language tags MAY be filtered to remove subtags
299	       and extensions that are not supported by that protocol.  Such
300	       filtering SHOULD be avoided, if possible, since it removes
301	       information that might be relevant if services on the other end
302	       of the protocol would make use of that information.

304	   3.  Some applications of language tags might want or need to consider
305	       extensions and private-use subtags when matching tags.  If
306	       extensions and private-use subtags are included in a matching or
307	       filtering process that utilizes the one of the schemes described
308	       in this document, then the implementation SHOULD canonicalize the
309	       language tags and/or ranges before performing the matching.  Note
310	       that language tag processors that claim to be "well-formed"
311	       processors as defined in [RFC3066bis] generally fall into this
312	       category.

314	3.  Types of Matching

316	   Matching language ranges to language tags can be done in a number of
317	   different ways.  This section describes the different types of
318	   matching scheme, as well as the considerations for choosing between
319	   them.  Protocols and specifications SHOULD clearly indicate the
320	   particular mechanism used in selecting or matching language tags.

322	   There are two basic types of matching scheme: those that produce an
323	   open-ended set of content (called "filtering") and those that produce
324	   a single information item for a given request (called "lookup").

326	   A key difference between these two types of matching scheme is that
327	   the language range for filtering operations is always the _least_
328	   specific tag one will accept as a match, while for lookup operations
329	   the language range is always the _most_ specific tag.

331	3.1.  Choosing a Type of Matching

333	   Applications, protocols, and specifications are faced with the
334	   decision of what type of matching to use.  Sometimes, different
335	   styles of matching might be suited for different kinds of processing
336	   within a particular application or protocol.

338	   Filtering can be used to produce a set of results (such as a
339	   collection of documents).  For example, if using a search engine, one
340	   might use filtering to limit the results to documents written in
341	   French.  It can also be used when deciding whether to perform some
342	   processing that is language sensitive on some content.  For example,
343	   a process might cause paragraphs whose language tag matched the
344	   language range "nl" to be displayed in italics within a document.

346	   This document describes three types of filtering:

348	   1.  Basic Filtering (Section 3.2.1) is used to match content using
349	       basic language ranges (Section 2.2).  It is compatible with
350	       implementations that do not produce extended language ranges.

352	   2.  Extended Range Filtering (Section 3.2.2) is used to match content
353	       using extended language ranges (Section 2.3).  Newer
354	       implementations SHOULD use this form of filtering in preference
355	       to basic filtering.

357	   3.  Scored Filtering (Section 3.2.3) produces an ordered set of
358	       content using either basic or extended language ranges.  It
359	       SHOULD be used when the quality of the match within a specific
360	       language range is important, as when presenting a list of
361	       documents resulting from a search.

363	   Lookup (Section 3.3) is used when each request MUST produce exactly
364	   one piece of content.  For example, a Web server might use the
365	   Accept-Language HTTP header to choose which language to return a
366	   custom 404 page in: since it can return only one page, it must choose
367	   a single item and it must return some item, even if no content
368	   matches the language ranges supplied by the user.

370	   Most types of matching in this document are designed so that
371	   implementations do not have to examine the values of the subtags
372	   supplied and, except for scored filtering, they do not need access to
373	   the Language Subtag Registry nor do they require the use of valid
374	   subtags in either language tags or language ranges.  This has great
375	   benefit for speed and simplicity of implementation.

377	   Implementations might also wish to use semantic information external
378	   to the language tags when performing fallback.  For example, the
379	   primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
380	   Norwegian) might both be usefully matched to the more general subtag
381	   'no' (Norwegian).  Or an implementation might infer that content
382	   labeled "zh-CN" is more likely to match the range "zh-Hans" than
383	   equivalent content labeled "zh-TW".

385	3.2.  Filtering

387	   Filtering is used to select the set of content that matches a given
388	   prefix.  It is called "filtering" because this set of content may
389	   contain no items at all or it may return an arbitrary number of
390	   matching items--as many as match the language range used to specify
391	   the items, thus filtering out the non-matching content.

393	   In filtering, the language range represents the _least_ specific tag
394	   which is an acceptable match.  That is, all of the language tags in
395	   the set of filtered content will have an equal or greater number of
396	   subtags than the language range.  For example, if the language range
397	   is "de-CH", one might see matching content with the tag "de-CH-1996"
398	   but one will never see a match with the tag "de".

400	   If the language priority list (see Section 2.1) contains more than
401	   one range, the content returned is typically ordered in descending
402	   level of preference.

404	   Some examples where filtering might be appropriate include:

406	   o  Applying a style to sections of a document in a particular
407	      language range.

409	   o  Displaying the set of documents containing a particular set of
410	      keywords written in a specific language.

412	   o  Selecting all email items written in specific range of languages.

414	   Filtering can produce either an ordered or an unordered set of
415	   results.  For example, applying formatting to a document based on the
416	   language of specific pieces of content does not require the content
417	   to be ordered.  It is sufficient to know whether a specific piece of
418	   content matches or does not match.  A search application, on the
419	   other hand, probably would put the results into a priority order.

421	   If an ordered set is desired, as described above, then the
422	   application or protocol needs to determine the relative "quality" of
423	   the match between different language tags and the language range.

425	   This measurement is called a "distance metric".  A distance metric
426	   assigns a numeric value to the comparison of each language tag to a
427	   language range and represents the 'distance' between the two.  A
428	   distance of zero means that they are identical, a small distance
429	   indicates that they are very similar, and a large distance indicated
430	   that they are very different.  Using a distance metric,
431	   implementations can, for example, allow users to select a threshold
432	   distance for a match to be "successful" while filtering or it can use
433	   the numeric value to order the results.

435	3.2.1.  Filtering with Basic Language Ranges

437	   When filtering using a basic language range, the language range
438	   matches a language tag if it exactly equals the tag, or if it exactly
439	   equals a prefix of the tag such that the first character following
440	   the prefix is "-".  (That is, the language-range "de-de" matches the
441	   language tag "de-DE-1996", but not the language tag "de-Deva".)

443	   The special range "*" matches any tag.  A protocol which uses
444	   language ranges MAY specify additional rules about the semantics of
445	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
446	   languages not matched by any other range within an "Accept-Language"
447	   header.

449	3.2.2.  Filtering with Extended Language Ranges

451	   In the Extended Range Matching scheme, each extended language range
452	   in the language priority list is considered in turn, according to
453	   priority.  The subtags in each extended language range are compared
454	   to the corresponding subtags in the language tag being examined.  The
455	   subtag from the range is considered to match if it exactly matches
456	   the corresponding subtag in the tag or the range's subtag has the
457	   value "*" (which matches all subtags, including the empty subtag).
458	   Extended Range Matching is an extension of basic matching
459	   (Section 3.2.1): the language range represents the least specific tag
460	   which is an acceptable match.

462	   private-use subtags MAY be specified in the language range and MUST
463	   NOT be ignored when matching.

465	   Subtags not specified, including those at the end of the language
466	   range, are assigned the value "*".  This makes each range into a
467	   prefix much like that used in basic language range matching.  For
468	   example, the extended language range "de-*-DE" matches all of the
469	   following tags because the unspecified variant field is expanded to
470	   "*":

472	      de-DE

474	      de-Latn-DE

476	      de-Latf-DE

478	      de-DE-x-goethe

480	      de-Latn-DE-1996

482	3.2.3.  Scored Filtering

484	   Both basic and extended language range filtering produce simple
485	   boolean matches.  Sometimes it may be beneficial to provide an array
486	   of results with different levels of matching, for example, sorting
487	   results based on the overall "quality" of the match.  Scored (or
488	   "distance metric") filtering provides a way to generate these quality
489	   values.

491	   First both the extended language range and the language tags to be
492	   matched to it must be canonicalized by mapping grandfathered and
493	   obsolete tags into modern equivalents.

495	   The language range and the language tags are then transformed into
496	   quintuples of elements of the form (language, script, country,
497	   variant, extension).  Any extended language subtags are considered
498	   part of the language element; private-use subtag sequences are
499	   considered part of the language element if in the initial position in
500	   the tag and part of the variant element if not.  Language subtags
501	   'und', 'mul', and the script subtag 'Zyyy' are converted to "*".

503	   Missing components in the language-tag are set to "*"; thus a "*"
504	   pattern becomes the quintuple ("*", "*", "*", "*", "*").  Missing
505	   components in the extended language-range are handled similarly to
506	   extended range lookup: missing internal subtags are expanded to "*".
507	   Missing end subtags are expanded as the empty string.  Thus a pattern
508	   "en-US" becomes the quintuple ("en","*","US","","").

510	   Here are some examples of language tags, showing their quintuples as
511	   both language tags and language ranges:

513	   en-US
514	      Tag:   (en, *, US, *, *)
515	      Range: (en, *, US, "", "")

517	   sr-Latn
518	      Tag:   (sr, Latn, *, *, *)
519	      Range: (sr, Latn, "", "", "")

521	   zh-cmn-Hant
522	      Tag:   (zh-cmn, Hant, *, *, *)
523	      Range: (zh-cmn, Hant, "", "", "")

525	   x-foo
526	      Tag:   (x-foo, *, *, *, *)
527	      Range: (x-foo, "", "", "", "")

529	   en-x-foo
530	      Tag:   (en, *, *, x-foo, *)
531	      Range: (en, *, *, x-foo, "")

533	   i-default
534	      Tag:   (i-default, *, *, *, *)
535	      Range: (i-default, "", "", "", "")

537	   sl-Latn-IT-rozaj
538	      Tag:   (sl, Latn, IT, rozaj, *)
539	      Range: (sl, Latn, IT, rozaj, "")

541	   zh-r-wadegile (hypothetical)
542	      Tag:   (z., *, *, *, r-wadegile)
543	      Range: (z., *, *, *, r-wadegile)

545	   Figure 3: Examples of Distance Metric Quintuples

547	   Each language-range/language-tag pair being compared is assigned a
548	   distance value, whereby small values indicate better matches and
549	   large values indicate worse ones.  The distance between the pair is
550	   the sum of the distances for each of the corresponding elements of
551	   the quintuple.  If the elements are identical or one is '*', then the
552	   distance value between them is zero.  Otherwise, it is given by the
553	   following table:

555	     256    language mismatch
556	     128    script mismatch
557	      32    region mismatch
558	       4    variant mismatch
559	       1    extension mismatch

561	   A value of 0 is a perfect match; 421 is no match at all.  Different
562	   threshold values might be appropriate for different applications or
563	   protocols.  Implementations will usually allow users to choose the
564	   most appropriate selection value, ranking the matched items based on
565	   score.

567	   Examples of various tag's distances from the range "en-US":

569	   "fr-FR"          384 (language & region mismatch)
570	   "fr"             256 (language mismatch, region match)
571	   "en-GB"           32 (region mismatch)
572	   "en-Latn-US"       0 (all fields match)
573	   "en-Brai"         32 (region mismatch)
574	   "en-US-x-foo"      4 (variant mismatch: range is the empty string)
575	   "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)

577	   Implementations or protocols sometimes might wish to use more
578	   sophisticated weights that depend on the values of the corresponding
579	   elements.  For example, depending on the domain, an implementation
580	   might give a small distance to the difference closely related
581	   subtags.  Some examples of closely related subtags might be:

583	   Language:
584	     no (Norwegian)
585	     nb (Bokmal Norwegian)
586	     nn (Nynorsk Norwegian)

588	   Script:
589	     Kata (katakana)
590	     Hira (hiragana)

592	   Region:
593	     US (United States of America)
594	     UM (United States Minor Outlying Islands

596	   Figure 6: Examples of Closely Related Subtags

598	3.3.  Lookup

600	   Lookup is used to select the single information item that best
601	   matches the language priority list for a given request.  In lookup,
602	   each language-range in the language priority list represents the
603	   _most_ specific tag which is an acceptable match; only the closest
604	   matching item according the user's priority is returned.  For
605	   example, if the language range is "de-CH", one might expect to
606	   receive an information item with the tag "de" but never one with the
607	   tag "de-CH-1996".  Usually if no content matches the request, a
608	   "default" item is returned.

610	   For example, if an application inserts some dynamic content into a
611	   document, returning an empty string if there is no exact match is not
612	   an option.  Instead, the application "falls back" until it finds a
613	   suitable piece of content to insert.  Other examples of lookup might
614	   include:

616	   o  Selection of a template containing the text for an automated email
617	      response.

619	   o  Selection of a graphic containing text for inclusion in a
620	      particular Web page.

622	   o  Selection of a string of text for inclusion in an error log.

624	   In the Lookup scheme, the language-range is progressively truncated
625	   from the end until a matching piece of content is located.  For
626	   example, starting with the range "zh-Hant-CN-x-private", the lookup
627	   would progressively search for content as shown below:

629	   Range to match: zh-Hant-CN-x-private
630	   1. zh-Hant-CN-x-private
631	   2. zh-Hant-CN
632	   3. zh-Hant
633	   4. z.
634	   5. (default content or the empty tag)

636	   Figure 7: Example of a Lookup Fallback Pattern

638	   This scheme allows some flexibility in finding content.  It also
639	   typically provides better results when data is not available at a
640	   specific level of tag granularity or is sparsely populated (than if
641	   the default language for the system or content were used).

643	   The language range "*" matches any language tag.  In the lookup
644	   scheme, this language range does not convey enough information to
645	   determine which content is most appropriate.  If this language range
646	   is the only one in the language priority list, it matches the default
647	   content.  If this language range is followed by other language
648	   ranges, it should be skipped.

650	   When performing lookup using a language priority list, the
651	   progressive search MUST proceed to consider each language range
652	   before finding the default content or empty tag.  The default content
653	   might be content with no language tag (or with an empty value, as
654	   with xml:lang in the XML specification), or it might be a particular
655	   language designated for that bit of content.

657	   One common way to provide for default content is to allow a specific
658	   language range to be set as the default for a specific type of
659	   request.  This language range is then treated as if it were appended
660	   to the end of the language priority list, rather than after each item
661	   in the language priority list.

663	   For example, if a particular user's language priority list were
664	   "fr-FR; zh-Hant" and the program doing the matching had a default
665	   language range of "ja-JP", the program would search for content as
666	   follows:
667	   1. fr-FR
668	   2. fr
669	   3. zh-Hant // next language
670	   4. z.
671	   5. (return default content)
672	      a. ja-JP
673	      b. ja
674	      c. (empty tag or other default content)

676	   Figure 8: Lookup Using a Language Priority List

678	   In some cases, the language priority list might contain one or more
679	   extended language ranges (as, for example, when the same language
680	   priority list is used as input for both lookup and filtering
681	   operations).  Wildcard values in an extended language range are
682	   supposed to match any value that occurs in that position in a
683	   language tag.  Since only one item can be returned for any given
684	   lookup request, the wildcards must be processed in a predictable
685	   manner (or the same request might produce widely varying results).
686	   Thus, for each range in the language priority list, the following
687	   rules must be applied to produce a basic language range for use in
688	   the fallback mechanism:

690	   1.  If the first subtag in the extended language range is a "*" then
691	       entire range is converted to "*".

693	   2.  For each subsequent subtag, if the value is a "*" then that
694	       subtag and its preceding hyphen are removed.

696	   For example:

698	   *-US      becomes  *
699	   en-*-US   becomes  en-US
700	   en-Latn-* becomes  en-Latn

702	   Figure 9: Transformation of Extended Language Ranges

704	   For the language priority list "*-US; fr-*-FR; zh-Hant", the fallback
705	   pattern would be:
706	   1. * (skipped)
707	   2. fr-FR
708	   3. fr
709	   4. zh-Hant
710	   5. z.
711	   6. (default content)

713	   Figure 10: Extended Language Range Fallback Example

715	4.  Other Considerations

717	   When working with language ranges and matching schemes, there are
718	   some additional points that may influence the choice of either.

720	4.1.  Meaning of Language Tags and Ranges

722	   Selecting content using language ranges requires some understanding
723	   by users of what they are selecting.  A language tag or range
724	   identifies a language as spoken (or written, signed or otherwise
725	   signaled) by human beings for communication of information to other
726	   human beings.

728	   If a language tag B contains language tag A as a prefix, then B is
729	   typically "narrower" or "more specific" than A. For example, "zh-
730	   Hant-TW" is more specific than "zh-Hant".

732	   This relationship is not guaranteed in all cases: specifically,
733	   languages that begin with the same sequence of subtags are NOT
734	   guaranteed to be mutually intelligible, although they might be.

736	   For example, the tag "az" shares a prefix with both "az-Latn"
737	   (Azerbaijani written using the Latin script) and "az-Arab"
738	   (Azerbaijani written using the Arabic script).  A person fluent in
739	   one script might not be able to read the other, even though the text
740	   might be otherwise identical.  Content tagged as "az" most probably
741	   is written in just one script and thus might not be intelligible to a
742	   reader familiar with the other script.

744	   Variant subtags in particular seem to represent specific divisions in
745	   mutual understanding, since they often encode dialects or other
746	   idiosyncratic variations within a language.

748	   The relationship between the language tag and the information it
749	   relates to is defined by the standard describing the context in which
750	   it appears.  Accordingly, this section can only give possible
751	   examples of its usage:

753	   o  For a single information object, the associated language tags
754	      might be interpreted as the set of languages that are necessary
755	      for a complete comprehension of the complete object.  Example:
756	      Plain text documents.

758	   o  For an aggregation of information objects, the associated language
759	      tags could be taken as the set of languages used inside components
760	      of that aggregation.  Examples: Document stores and libraries.

762	   o  For information objects whose purpose is to provide alternatives,
763	      the associated language tags could be regarded as a hint that the
764	      content is provided in several languages, and that one has to
765	      inspect each of the alternatives in order to find its language or
766	      languages.  In this case, the presence of multiple tags might not
767	      mean that one needs to be multi-lingual to get complete
768	      understanding of the document.  Example: MIME multipart/
769	      alternative.

771	   o  In markup languages, such as HTML and XML, language information
772	      can be added to each part of the document identified by the markup
773	      structure (including the whole document itself).  For example, one
774	      could write <span lang="FR">C'est la vie.</span> inside a
775	      Norwegian document; the Norwegian-speaking user could then access
776	      a French-Norwegian dictionary to find out what the marked section
777	      meant.  If the user were listening to that document through a
778	      speech synthesis interface, this formation could be used to signal
779	      the synthesizer to appropriately apply French text-to-speech
780	      pronunciation rules to that span of text, instead of misapplying
781	      the Norwegian rules.

783	4.2.  Considerations for Private Use Subtags

785	   Private-use subtags require private agreement between the parties
786	   that intend to use or exchange language tags that use them and great
787	   caution SHOULD be used in employing them in content or protocols
788	   intended for general use.  Private-use subtags are simply useless for
789	   information exchange without prior arrangement.

791	   The value and semantic meaning of private-use tags and of the subtags
792	   used within such a language tag are not defined.  Matching private-
793	   use tags using language ranges or extended language ranges can result
794	   in unpredictable content being returned.

796	4.3.  Length Considerations in Matching

798	   RFC 3066 [RFC3066] did not provide an upper limit on the size of
799	   language tags or ranges.  RFC 3066 did define the semantics of
800	   particular subtags in such a way that most language tags or ranges
801	   consisted of language and region subtags with a combined total length
802	   of up to six characters.  Larger tags and ranges (in terms of both
803	   subtags and characters) did exist, however.

805	   [RFC3066bis] also does not impose a fixed upper limit on the number
806	   of subtags in a language tag or range (and thus an upper bound on the
807	   size of either).  The syntax in that document suggests that,
808	   depending on the specific language or range of languages, more
809	   subtags (and thus characters) are sometimes necessary as a result.

811	   Length considerations and their impact on the selection and
812	   processing of tags are described in Section 2.1.1 of that document.

814	   An application or protocol MAY choose to limit the length of the
815	   language tags or ranges used in matching.  Any such limitation SHOULD
816	   be clearly documented, and such documentation SHOULD include the
817	   disposition of any longer tags or ranges (for example, whether an
818	   error value is generated or the language tag or range is truncated).
819	   If truncation is permitted it MUST NOT permit a subtag to be divided,
820	   since this changes the semantics of the subtag being matched and can
821	   result in false positives or negatives.

823	   Applications or protocols that restrict storage SHOULD consider the
824	   impact of tag or range truncation on the resulting matches.  For
825	   example, removing the "*" from the end of an extended language range
826	   (see Section 2.3) can greatly modify the set of returned matches.  A
827	   protocol that allows tags or ranges to be truncated at an arbitrary
828	   limit, without giving any indication of what that limit is, has the
829	   potential for causing harm by changing the meaning of values in
830	   substantial ways.

832	   In practice, most tags do not require additional subtags or
833	   substantially more characters.  Additional subtags sometimes add
834	   useful distinguishing information, but extraneous subtags interfere
835	   with the meaning, understanding, and especially matching of language
836	   tags.  Since language tags or ranges MAY be truncated by an
837	   application or protocol that limits storage, when choosing language
838	   tags or ranges users and applications SHOULD avoid adding subtags
839	   that add no distinguishing value.  In particular, users and
840	   implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
841	   fields in the registry (defined in Section 3.6 of [RFC3066bis]):
842	   these fields provide guidance on when specific additional subtags
843	   SHOULD (and SHOULD NOT) be used.

845	   Implementations MUST support a limit of at least 33 characters.  This
846	   limit includes at least one subtag of each non-extension, non-private
847	   use type.  When choosing a buffer limit, a length of at least 42
848	   characters is strongly RECOMMENDED.

850	   The practical limit on tags or ranges derived solely from registered
851	   values is 42 characters.  Implementations MUST be able to handle tags
852	   and ranges of this length.  Support for tags and ranges of at least
853	   62 characters in length is RECOMMENDED.  Implementations MAY support
854	   longer values, including matching extensive sets of private-use or
855	   extension subtags.

857	   Applications or protocols which have to truncate a tag MUST do so by
858	   progressively removing subtags along with their preceding "-" from
859	   the right side of the language tag until the tag is short enough for
860	   the given buffer.  If the resulting tag ends with a single-character
861	   subtag, that subtag and its preceding "-" MUST also be removed.  For
862	   example:

864	   Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
865	   1. zh-Latn-CN-variant1-a-extend1-x-wadegile
866	   2. zh-Latn-CN-variant1-a-extend1
867	   3. zh-Latn-CN-variant1
868	   4. zh-Latn-CN
869	   5. zh-Latn
870	   6. z.

872	   Figure 11: Example of Tag Truncation

874	5.  IANA Considerations

876	   This document presents no new or existing considerations for IANA.

878	6.  Changes

880	   This is the first version of this document.

882	   The following changes were put into this document since draft-06:

884	      Changed the document title from the unwieldy "Matching Tags for
885	      the Identification of Languages" to "Matching Language Tags" (Ed.)

887	      Fixed problems with the distance metric filtering scheme
888	      (Section 3.2.3) examples (in which tags were expanded
889	      incorrectly).  (D.Ewell)

891	      Moved the sentence "Protocols and specifications SHOULD clearly
892	      indicate the particular mechanism used in selecting or matching
893	      language tags." from the introduction (where there should not be
894	      any normative language) to the start of Section 3.  (A.Phillips)

896	      Created section Section 2.4 and moved text there (A.Phillips)

898	      Modified the examples of closely related subtags in Section 3.2.3
899	      to show what the examples mean (M.Duerst)

901	      Various spelling and grammatical fixes (D.Ewell)

903	7.  Security Considerations

905	   Language ranges used in content negotiation might be used to infer
906	   the nationality of the sender, and thus identify potential targets
907	   for surveillance.  In addition, unique or highly unusual language
908	   ranges or combinations of language ranges might be used to track a
909	   specific individual's activities.

911	   This is a special case of the general problem that anything you send
912	   is visible to the receiving party.  It is useful to be aware that
913	   such concerns can exist in some cases.

915	   The evaluation of the exact magnitude of the threat, and any possible
916	   countermeasures, is left to each application or protocol.

918	8.  Character Set Considerations

920	   The syntax of language tags and language ranges permit only the
921	   characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D).  These characters
922	   are present in most character sets, so presentation of language tags
923	   should not present any character set issues.

925	9.  References

927	9.1.  Normative References

929	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
930	              Requirement Levels", BCP 14, RFC 2119, March 1997.

932	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
933	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
934	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

936	   [RFC3066bis]
937	              Phillips, A., Ed. and M. Davis, Ed., "Tags for the
938	              Identification of Languages", October 2005, <http://
939	              www.ietf.org/internet-drafts/
940	              draft-ietf-ltru-registry-14.txt>.

942	   [RFC4234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
943	              Specifications: ABNF", RFC 4234, October 2005.

945	9.2.  Informative References

947	   [RFC1766]  Alvestrand, H., "Tags for the Identification of
948	              Languages", RFC 1766, March 1995.

950	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
951	              Languages", BCP 47, RFC 3066, January 2001.

953	   [RFC3282]  Alvestrand, H., "Content Language Headers", RFC 3282,
954	              May 2002.

956	Appendix A.  Acknowledgements

958	   Any list of contributors is bound to be incomplete; please regard the
959	   following as only a selection from the group of people who have
960	   contributed to make this document what it is today.

962	   The contributors to [RFC3066bis], [RFC3066] and [RFC1766], each of
963	   which is a precursor to this document, made enormous contributions
964	   directly or indirectly to this document and are generally responsible
965	   for the success of language tags.

967	   The following people (in alphabetical order by family name)
968	   contributed to this document:

970	   Harald Alvestrand, Jeremy Carroll, John Cowan, Martin Duerst, Frank
971	   Ellermann, Doug Ewell, Marion Gunn, Kent Karlsson, Ira McDonald, M.
972	   Patton, Randy Presuhn, Eric van der Poel, and many, many others.

974	   Very special thanks must go to Harald Tveit Alvestrand, who
975	   originated RFCs 1766 and 3066, and without whom this document would
976	   not have been possible.

978	   For this particular document, John Cowan originated the scheme
979	   described in Section 3.2.3.  Mark Davis originated the scheme
980	   described in the Section 3.3.

982	Authors' Addresses

984	   Addison Phillips (editor)
985	   Quest Software

987	   Email: addison dot phillips at quest dot com

989	   Mark Davis (editor)
990	   IBM

992	   Email: mark dot davis at ibm dot com

994	Intellectual Property Statement

996	   The IETF takes no position regarding the validity or scope of any
997	   Intellectual Property Rights or other rights that might be claimed to
998	   pertain to the implementation or use of the technology described in
999	   this document or the extent to which any license under such rights
1000	   might or might not be available; nor does it represent that it has
1001	   made any independent effort to identify any such rights.  Information
1002	   on the procedures with respect to rights in RFC documents can be
1003	   found in BCP 78 and BCP 79.

1005	   Copies of IPR disclosures made to the IETF Secretariat and any
1006	   assurances of licenses to be made available, or the result of an
1007	   attempt made to obtain a general license or permission for the use of
1008	   such proprietary rights by implementers or users of this
1009	   specification can be obtained from the IETF on-line IPR repository at
1010	   http://www.ietf.org/ipr.

1012	   The IETF invites any interested party to bring to its attention any
1013	   copyrights, patents or patent applications, or other proprietary
1014	   rights that may cover technology that may be required to implement
1015	   this standard.  Please address the information to the IETF at
1016	   ietf-ipr@ietf.org.

1018	Disclaimer of Validity

1020	   This document and the information contained herein are provided on an
1021	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1022	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1023	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1024	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1025	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1026	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1028	Copyright Statement

1030	   Copyright (C) The Internet Society (2005).  This document is subject
1031	   to the rights, licenses and restrictions contained in BCP 78, and
1032	   except as set forth therein, the authors retain all their rights.

1034	Acknowledgment

1036	   Funding for the RFC Editor function is currently provided by the
1037	   Internet Society.