idnits 2.17.1 

draft-ietf-ltru-matching-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 955.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 932.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 939.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 945.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document obsoletes RFC3066, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 7, 2005) is 6776 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'ID.ietf-ltru-initial' is defined on line 795, but no
     explicit reference was found in the text

  == Unused Reference: 'RFC1327' is defined on line 800, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC1521' is defined on line 803, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2028' is defined on line 808, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2231' is defined on line 815, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2396' is defined on line 824, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2434' is defined on line 828, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2860' is defined on line 836, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3629' is defined on line 840, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO15924' is defined on line 851, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO3166-1' is defined on line 855, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO639-1' is defined on line 860, but no explicit
     reference was found in the text

  == Unused Reference: 'ISO639-2' is defined on line 864, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3339' is defined on line 877, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref.
     'ID.ietf-ltru-initial'

  ** Obsolete normative reference: RFC 1327 (Obsoleted by RFC 2156)

  ** Obsolete normative reference: RFC 1521 (Obsoleted by RFC 2045, RFC 2046,
     RFC 2047, RFC 2048, RFC 2049)

  ** Obsolete normative reference: RFC 2028 (Obsoleted by RFC 9281)

  ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226)

  ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231,
     RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 2860

  -- Obsolete informational reference (is this intentional?): RFC 1766
     (Obsoleted by RFC 3066, RFC 3282)

  -- Obsolete informational reference (is this intentional?): RFC 3066
     (Obsoleted by RFC 4646, RFC 4647)


     Summary: 10 errors (**), 0 flaws (~~), 17 warnings (==), 11 comments
     (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                   A. Phillips, Ed.
3	Internet-Draft                                            Quest Software
4	Obsoletes: 3066 (if approved)                              M. Davis, Ed.
5	Expires: April 10, 2006                                              IBM
6	                                                         October 7, 2005

8	           Matching Tags for the Identification of Languages
9	                      draft-ietf-ltru-matching-05

11	Status of this Memo

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as "work in progress."

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt.

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   This Internet-Draft will expire on April 10, 2006.

36	Copyright Notice

38	   Copyright (C) The Internet Society (2005).

40	Abstract

42	   This document describes different mechanisms for comparing, matching,
43	   and evaluating language tags.  Possible algorithms for language
44	   negotiation and content selection are described.

46	Table of Contents

48	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
49	   2.  The Language Range . . . . . . . . . . . . . . . . . . . . . .  4
50	     2.1.  Lists of Language Ranges . . . . . . . . . . . . . . . . .  4
51	     2.2.  Basic Language Range . . . . . . . . . . . . . . . . . . .  4
52	       2.2.1.  Matching . . . . . . . . . . . . . . . . . . . . . . .  5
53	       2.2.2.  Lookup . . . . . . . . . . . . . . . . . . . . . . . .  6
54	     2.3.  Extended Language Range  . . . . . . . . . . . . . . . . .  7
55	       2.3.1.  Extended Range Matching  . . . . . . . . . . . . . . .  9
56	       2.3.2.  Extended Range Lookup  . . . . . . . . . . . . . . . . 10
57	       2.3.3.  Distance Metric Scheme . . . . . . . . . . . . . . . . 11
58	     2.4.  Meaning of Language Tags and Ranges  . . . . . . . . . . . 13
59	     2.5.  Choosing Between Alternate Matching Schemes  . . . . . . . 14
60	     2.6.  Considerations for Private Use Subtags . . . . . . . . . . 15
61	     2.7.  Length Considerations in Matching  . . . . . . . . . . . . 16
62	   3.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 18
63	   4.  Changes  . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
64	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 20
65	   6.  Character Set Considerations . . . . . . . . . . . . . . . . . 21
66	   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
67	     7.1.  Normative References . . . . . . . . . . . . . . . . . . . 22
68	     7.2.  Informative References . . . . . . . . . . . . . . . . . . 23
69	   Appendix A.  Acknowledgements  . . . . . . . . . . . . . . . . . . 24
70	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 25
71	   Intellectual Property and Copyright Statements . . . . . . . . . . 26

73	1.  Introduction

75	   Human beings on our planet have, past and present, used a number of
76	   languages.  There are many reasons why one would want to identify the
77	   language used when presenting or requesting information.

79	   Information about a user's language preferences commonly needs to be
80	   identified so that appropriate processing can be applied.  For
81	   example, the user's language preferences in a browser can be used to
82	   select web pages appropriately.  A choice of language preference can
83	   also be used to select among tools (such as dictionaries) to assist
84	   in the processing or understanding of content in different languages.

86	   Given a set of language identifiers, such as those defined in [draft-
87	   registry], various mechanisms can be envisioned for performing
88	   language negotiation and tag matching.  The suitability of a
89	   particular mechanism to a particular application depends on the needs
90	   of that application.

92	   This document defines several mechanisms for matching and filtering
93	   natural language content identified using Language Tags [draft-
94	   registry].  It also defines the syntax (called a "language range")
95	   associated with each of these mechanisms for specifying user language
96	   preferences.

98	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
99	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
100	   document are to be interpreted as described in [RFC2119].

102	2.  The Language Range

104	   Language Tags [draft-registry] are used to identify the language of
105	   some information item or content.  Applications that use language
106	   tags are often faced with the problem of identifying sets of content
107	   that share certain language attributes.  For example, HTTP 1.1
108	   [RFC2616] describes language ranges in its discussion of the Accept-
109	   Language header (Section 14.4), which is used for selecting content
110	   from servers based on the language of that content.

112	   When selecting content according to its language, it is useful to
113	   have a mechanism for identifying sets of language tags that share
114	   specific attributes.  This allows users to select or filter content
115	   based on specific requirements.  Such an identifier is called a
116	   "Language Range".

118	2.1.  Lists of Language Ranges

120	   When users specify a language preference they often need to specify a
121	   prioritized list of language ranges in order to best reflect their
122	   language requirements for the matching operation.  This is especially
123	   true for speakers of minority languages.  A speaker of Breton in
124	   France, for example, may specify "be" followed by "fr", meaning that
125	   if Breton is available, it is preferred, but otherwise French is the
126	   best alternative.  It can get more complex: a speaker may wish to
127	   fallback from Skolt Sami to Northern Sami to Finnish.

129	   A "Language Priority List" consists of a prioritized or weighted list
130	   of language ranges.  One well known example of such a list is the
131	   "Accept-Language" header defined in RFC 2616 [RFC2616] (see Section
132	   14.4) and RFC 3282 [RFC3282].  The various matching operations
133	   described in this document include considerations for using a
134	   language priority list.

136	2.2.  Basic Language Range

138	   A "Basic Language Range" identifies the set of content whose language
139	   tags begin with the same sequence of subtags.  A basic language range
140	   is identified by its 'language-range' tag, by adapting the
141	   ABNF[RFC2234bis] from HTTP/1.1 [RFC2616] :

143	   language-range = language-tag / "*"
144	   language-tag   = 1*8[alphanum] *["-" 1*8alphanum]
145	   alphanum       = ALPHA / DIGIT

147	   That is, a language-range has the same syntax as a language-tag or is
148	   the single character "*".  Basic Language Ranges imply that there is
149	   a semantic relationship between language tags that share the same
150	   prefix.  While this is often the case, it is not always true and
151	   users should note that the set of language tags that match a specific
152	   language-range may not be mutually intelligible.

154	   Basic language ranges were originally described in [RFC3066] and HTTP
155	   1.1 [RFC2616] (where they are referred to as simply a "language
156	   range").

158	   Users SHOULD avoid subtags that add no distinguishing value to a
159	   language range.  For example, script subtags SHOULD NOT be used to
160	   form a language range with language subtags which have a matching
161	   Suppress-Script field in their registry record.  Thus the language
162	   range "en-Latn" is probably inappropriate for most applications
163	   (because the vast majority English documents are written in the Latin
164	   script and thus the 'en' language subtag has a Suppress-Script field
165	   for 'Latn' in the registry).

167	   Language tags and thus language ranges are to be treated as case
168	   insensitive: there exist conventions for the capitalization of some
169	   of the subtags, but these MUST NOT be taken to carry meaning.
170	   Matching of language tags to language ranges MUST be done in a case
171	   insensitive manner.

173	   When working with tags and ranges, note that extensions and most
174	   private use subtags are generally orthogonal to language tag fallback
175	   and users SHOULD avoid using these subtags in language ranges, since
176	   they will often interfere with the selection of available language
177	   content.  Since these subtags are always at the end of the sequence
178	   of subtags, they don't normally interfere with the use of prefixes
179	   for matching in the schemes described below.

181	   There are two matching schemes that are commonly associated with
182	   basic language ranges: matching and lookup.

184	   Note that neither matching nor lookup using basic language ranges
185	   attempt to process the semantics of the tags or ranges in any way.
186	   The language tag and language range are compared in a case
187	   insensitive manner using basic string processing.  The choice of
188	   subtags in both the language tag and language range may affect the
189	   results produced as a result.

191	2.2.1.  Matching

193	   Language tag matching is used to select all content that matches a
194	   given prefix.  In matching, the language range represents the least
195	   specific tag which is an acceptable match and every piece of content
196	   that matches is returned.  If the language priority list contains
197	   more than one range, the matches returned are typically ordered in
198	   descending level of preference.

200	   For example, if an application is applying a style to all content in
201	   a document in a particular language, it might use language tag
202	   matching to select the content to which the style is applied.

204	   A language-range matches a language-tag if it exactly equals the tag,
205	   or if it exactly equals a prefix of the tag such that the first
206	   character following the prefix is "-".  (That is, the language-range
207	   "de-de" matches the language tag "de-DE-1996", but not the language
208	   tag "de-Deva".)

210	   The special range "*" matches any tag.  A protocol which uses
211	   language ranges MAY specify additional rules about the semantics of
212	   "*"; for instance, HTTP/1.1 specifies that the range "*" matches only
213	   languages not matched by any other range within an "Accept-Language"
214	   header.

216	2.2.2.  Lookup

218	   Content lookup is used to select the single information item that
219	   best matches the language priority list for a given request.  In
220	   lookup, each language range in the language priority list represents
221	   the most specific tag which is an acceptable match; only the closest
222	   matching item according the user's priority is returned.

224	   For example, if an application inserts some dynamic content into a
225	   document, returning an empty string if there is no exact match is not
226	   an option.  Instead, the application "falls back" until it finds a
227	   suitable piece of content to insert.

229	   When performing lookup, the language range is progressively truncated
230	   from the end until a matching piece of content is located.  For
231	   example, starting with the range "zh-Hant-CN-x-wadegile", the lookup
232	   would progressively search for content as shown below:

234	   Range to match: zh-Hant-CN-x-wadegile
235	   1. zh-Hant-CN-x-wadegile
236	   2. zh-Hant-CN
237	   3. zh-Hant
238	   4. zh
239	   5. (default content or the empty tag)

241	   Figure 2: Default Fallback Pattern Example

243	   This scheme allows some flexibility in finding content.  It also
244	   typically provides better results when data is not available at a
245	   specific level of tag granularity or is sparsely populated (than if
246	   the default language for the system or content were used).

248	   When performing lookup using a language priority list, the
249	   progressive search MUST proceed to consider each language range
250	   before finding the default content or empty tag.  For example, for
251	   the list "fr-FR; zh-Hant" would search for content as follows:
252	   1. fr-FR
253	   2. fr
254	   3. zh-Hant // next language
255	   4. zh
256	   5. (default content or the empty tag)

258	   Figure 3: Lookup Using a Language Priority List

260	2.3.  Extended Language Range

262	   Prefix matching using a Basic Language Range, as described above, is
263	   not always the most appropriate way to access the information
264	   contained in language tags when selecting or filtering content.  Some
265	   applications might wish to define a more granular matching scheme and
266	   such a matching scheme requires the ability to specify the various
267	   attributes of a language tag in the language range.  An extended
268	   language range can be represented by the following ABNF:
269	   extended-language-range  = range ; a range
270	                 / privateuse              ; private use tag
271	                 / grandfathered           ; grandfathered registrations

273	   range         = (language
274	                    ["-" script]
275	                    ["-" region]
276	                    *("-" variant)
277	                    *("-" extension)
278	                    ["-" privateuse])

280	   language      = (2*3ALPHA [ extlang ]) ; shortest ISO 639 code
281	                 / 4ALPHA                 ; reserved for future use
282	                 / 5*8ALPHA               ; registered language subtag
283	                 / "*"                    ; ... or wildcard

285	   extlang       = *2("-" 3ALPHA) ("-" ( 3ALPHA / "*"))
286	                                          ; reserved for future use
287	                                          ; wildcard can only appear
288	                                          ;   at the end

290	   script        = 4ALPHA                 ; ISO 15924 code
291	                 / "*"                    ; or wildcard

293	   region        = 2ALPHA                 ; ISO 3166 code
294	                 / 3DIGIT                 ; UN M.49 code
295	                 / "*"                    ; ... or wildcard

297	   variant       = 5*8alphanum            ; registered variants
298	                 / (DIGIT 3alphanum)      ;
299	                 / "*"                    ; ... or wildcard

301	   extension     = singleton *("-" (2*8alphanum)) [ "-*" ]
302	                                          ; extension subtags
303	                                          ; wildcard can only appear
304	                                          ;   at the end

306	   singleton     = %x41-57 / %x59-5A / %x61-77 / %x79-7A / DIGIT
307	                 ; "a"-"w" / "y"-"z" / "A"-"W" / "Y"-"Z" / "0"-"9"
308	                 ; Single letters: x/X is reserved for private use

310	   privateuse    = ("x"/"X") 1*("-" (1*8alphanum))

312	   grandfathered = 1*3ALPHA 1*2("-" (2*8alphanum))
313	                   ; grandfathered registration
314	                   ; Note: i is the only singleton
315	                   ; that starts a grandfathered tag

317	   alphanum      = (ALPHA / DIGIT)       ; letters and numbers

319	   In an extended language range, the identifier takes the form of a
320	   series of subtags which must consist of well-formed subtags or the
321	   special subtag "*".  For example, the language range "en-*-US"
322	   specifies a primary language of 'en', followed by any script subtag,
323	   followed by the region subtag 'US'.

325	   A field not present in the middle of an extended language range MAY
326	   be treated as if the field contained a "*".  For example, the range
327	   "en-US" MAY be considered to be equivalent to the range "en-*-US".
328	   This also means that multiple wildcards can be collapsed (so that
329	   "en-*-*-US" is equivalent to "en-*-US").

331	   When working with tags and ranges users SHOULD note the following:

333	   1.  Private-use and Extension subtags are normally orthogonal to
334	       language tag fallback.  Implementations SHOULD ignore
335	       unrecognized private-use and extension subtags when performing
336	       language tag fallback.  Since these subtags are always at the end
337	       of the sequence of subtags, they don't normally interfere with
338	       the use of prefixes for matching in the schemes described below.

340	   2.  Implementations that choose not to interpret one or more private-
341	       use or extension subtags SHOULD NOT remove or modify these
342	       extensions in content that they are processing.  When a language
343	       tag instance is to be used in a specific, known protocol, and is
344	       not being passed through to other protocols, language tags MAY be
345	       filtered to remove subtags and extensions that are not supported
346	       by that protocol.  Such filtering SHOULD be avoided, if possible,
347	       since it removes information that might be relevant if services
348	       on the other end of the protocol would make use of that
349	       information.

351	   3.  Some applications of language tags might want or need to consider
352	       extensions and private-use subtags when matching tags.  If
353	       extensions and private-use subtags are included in a matching or
354	       filtering process that utilizes the one of the schemes described
355	       in this document, then the implementation SHOULD canonicalize the
356	       language tags and/or ranges before performing the matching.  Note
357	       that language tag processors that claim to be "well-formed"
358	       processors as defined in [draft-registry] generally fall into
359	       this category.

361	   There are several matching algorithms or schemes which can be applied
362	   when matching extended language ranges to language tags.

364	2.3.1.  Extended Range Matching

366	   In extended range matching, each extended language range in the
367	   language priority list is considered in turn, according to priority.
368	   The subtags in each extended language range are compared to the
369	   corresponding subtags in the language tag being examined.  The subtag
370	   from the range is considered to match if it exactly matches the
371	   corresponding subtag in the tag or the range's subtag has the value
372	   "*" (which matches all subtags, including the empty subtag).
373	   Extended Range Matching is an extension of basic matching
374	   (Section 2.2.1): the language range represents the least specific tag
375	   which is an acceptable match.

377	   Private use subtags MAY be specified in the language range and MUST
378	   NOT be ignored when matching.

380	   Subtags not specified, including those at the end of the language
381	   range, are assigned the value "*".  This makes each range into a
382	   prefix much like that used in basic language range matching.  For
383	   example, the extended language range "zh-*-CN" matches all of the
384	   following tags because the unspecified variant field is expanded to
385	   "*":

387	      zh-Hant-CN
388	      zh-CN

390	      zh-Hans-CN

392	      zh-CN-x-wadegile

394	      zh-Latn-CN-boont

396	      zh-cmn-Hans-CN-x-wadegile

398	2.3.2.  Extended Range Lookup

400	   In extended range lookup, each extended language range in the
401	   language priority list is considered in turn.  The subtags in each
402	   extended language range are compared to the corresponding subtags in
403	   the language tag being examined.  A subtag is considered to match if
404	   it exactly matches the corresponding subtag in the tag or the range's
405	   subtag has the value "*" (which matches all subtags, including the
406	   empty subtag).  Extended language range lookup is an extension of
407	   basic lookup (Section 2.2.2): each language range represents the most
408	   specific tag which will form an acceptable match.  If no match is
409	   found, the default content or content with the empty language tag is
410	   usually returned (or the search can be considered to have failed).

412	   Subtags not specified are assigned the value "*" prior to performing
413	   tag matching.  Unlike in extended range matching, however, fields at
414	   the end of the range MUST NOT be expanded in this manner.  For
415	   example, "en-US" MUST NOT be considered to be the same as the range
416	   "en-US-*".  This allows ranges to be specific.  The "*" wildcard MUST
417	   be used at the end of the range to indicate that all tags with the
418	   range as a prefix are allowable matches.  That is, the range "zh-*"
419	   matches the tags "zh-Hant" and "zh-Hant-CN", while the range "zh"
420	   matches neither of those tags.

422	   The wildcard "*" at the end of a range SHOULD be considered to match
423	   any private use subtag sequences (making extended language range
424	   lookup function exactly like extended range matching Section 2.3.1).

426	   By default all extensions and their subtags SHOULD be ignored for
427	   extended language range lookup.  Private use subtags MAY be specified
428	   in the language range and MUST NOT be ignored when performing lookup.
429	   The wildcard "*" at the end of a range SHOULD be considered to match
430	   any private use subtag sequences in addition to variants.

432	   For example, the range "*-US" matches all of the following tags:

434	      en-US
435	      en-Latn-US

437	      en-US-r-extends (extensions are ignored)

439	      fr-US

441	   For example, the range "en-*-US" matches _none_ of the following
442	   tags:

444	      fr-US

446	      en (missing region US)

448	      en-Latn (missing region US)

450	      en-Latn-US-scouse (variant field is present)

452	   For example, the range "en-*" matches all of the following tags:

454	      en-Latn

456	      en-Latn-US

458	      en-Latn-US-scouse

460	      en-US

462	      en-scouse

464	   Note that the ability to be specific in extended range lookup can
465	   make this matching scheme a more appropriate replacement for basic
466	   matching than the extended range matching scheme.

468	2.3.3.  Distance Metric Scheme

470	   Both Basic and Extended Language Ranges produce simple boolean
471	   matches.  Some applications may benefit by providing an array of
472	   results with different levels of matching, for example, sorting
473	   results based on the overall "quality" of the match.

475	   This type of matching is sometimes called a "distance metric".  A
476	   distance metric assigns a pair of language tags a numeric value
477	   representing the 'distance' between the two.  A distance of zero
478	   means that they are identical, a small distance indicates that they
479	   are very similar, and a large distance indicated that they are very
480	   different.  Using a distance metric, implementations can, for
481	   example, allow users to select a threshold distance for a match to be
482	   successful or a filter to be applied.

484	   The first step in the process is to normalize the extended language
485	   range and the language tags to be matched to it by canonicalizing
486	   them, mapping grandfathered and obsolete tags into modern
487	   equivalents.

489	   The language range and the language tags are then transformed into
490	   quintuples of elements of the form (language, script, country,
491	   variant, extension).  Any extended language subtags are considered
492	   part of the language element; private use subtag sequences are
493	   considered part of the language element if in the initial position in
494	   the tag and part of the variant element if not.  Language subtags
495	   'und', 'mul', and the script subtag 'Zyyy' are converted to "*".

497	   Missing components in the language-tag are set to "*"; thus a "*"
498	   pattern becomes the quintuple ("*", "*", "*", "*", "*").  Missing
499	   components in the extended language-range are handled similarly to
500	   extended range lookup: missing internal subtags are expanded to "*".
501	   Missing end subtags are expanded as the empty string.  Thus a pattern
502	   "en-US" becomes the quintuple ("en","*","US","","").

504	   Here are some examples of language-tags and their quintuples:

506	      en-US ("en","*","US","*","*")

508	      sr-Latn ("sr,"Latn","*","*","*")

510	      zh-cmn-Hant ("zh-cmn","Hant","*","*","*")

512	      x-foo ("x-foo","*","*","*","*")

514	      en-x-foo ("en","*","*","x-foo","*")

516	      i-default ("i-default","*","*","*","*")

518	      sl-Latn-IT-roazj ("sl","Latn","IT","rozaj","*")

520	      zh-r-wadegile ("zh","*","*","*","r-wadegile") // hypothetical

522	   Each language-range/language-tag pair being matched or filtered is
523	   assigned a distance value, whereby small values indicate better
524	   matches and large values indicate worse ones.  The distance between
525	   the pair is the sum of the distances for each of the corresponding
526	   elements of the quintuple.  If the elements are identical or one is
527	   '*', then the distance value between them is zero.  Otherwise, it is
528	   given by the following table:

530	     256    language mismatch
531	     128    script mismatch
532	      32    region mismatch
533	       4    variant mismatch
534	       1    extension mismatch

536	   A value of 0 is a perfect match; 421 is no match at all.  Different
537	   threshold values might be appropriate for different applications and
538	   implementations will probably allow users to choose the most
539	   appropriate selection value, ranking the selections based on score.

541	   Examples of various tag's distances from the range "en-US":

543	   "fr"             256 (language mismatch, region match)
544	   "en-GB"          384 (language, region mismatch)
545	   "en-Latn-US"       0 (all fields match)
546	   "en-Brai"         32 (region mismatch)
547	   "en-US-x-foo"      4 (variant mismatch: range is the empty string)
548	   "en-US-r-wadegile" 1 (extension mismatch: range is the empty string)

550	   Implementations may want to use more sophisticated weights that
551	   depend on the values of the corresponding elements.  For example,
552	   depending on the domain, an implemenation might give a small distance
553	   to the difference between the language subtag 'no' and the closely
554	   related language subtags 'nb' or 'nn'; or between the script subtags
555	   'Kata' and 'Hira'; or between the region subtags 'US' and 'UM'.

557	2.4.  Meaning of Language Tags and Ranges

559	   A language tag defines a language as spoken (or written, signed or
560	   otherwise signaled) by human beings for communication of information
561	   to other human beings.

563	   If a language tag B contains language tag A as a prefix, then B is
564	   typically "narrower" or "more specific" than A. For example, "zh-
565	   Hant-TW" is more specific than "zh-Hant".

567	   This relationship is not guaranteed in all cases: specifically,
568	   languages that begin with the same sequence of subtags are NOT
569	   guaranteed to be mutually intelligible, although they might be.

571	   For example, the tag "az" shares a prefix with both "az-Latn"
572	   (Azerbaijani written using the Latin script) and "az-Cyrl"
573	   (Azerbaijani written using the Cyrillic script).  A person fluent in
574	   one script might not be able to read the other, even though the text
575	   might be otherwise identical.  Content tagged as "az" most probably
576	   is written in just one script and thus might not be intelligible to a
577	   reader familiar with the other script.

579	   Variant subtags in particular seem to represent specific divisions in
580	   mutual understanding, since they often encode dialects or other
581	   idiosyncratic variations within a language.

583	   The relationship between the language tag and the information it
584	   relates to is defined by the standard describing the context in which
585	   it appears.  Accordingly, this section can only give possible
586	   examples of its usage.

588	   o  For a single information object, the associated language tags
589	      might be interpreted as the set of languages that are necessary
590	      for a complete comprehension of the complete object.  Example:
591	      Plain text documents.

593	   o  For an aggregation of information objects, the associated language
594	      tags could be taken as the set of languages used inside components
595	      of that aggregation.  Examples: Document stores and libraries.

597	   o  For information objects whose purpose is to provide alternatives,
598	      the associated language tags could be regarded as a hint that the
599	      content is provided in several languages, and that one has to
600	      inspect each of the alternatives in order to find its language or
601	      languages.  In this case, the presence of multiple tags might not
602	      mean that one needs to be multi-lingual to get complete
603	      understanding of the document.  Example: MIME multipart/
604	      alternative.

606	   o  In markup languages, such as HTML and XML, language information
607	      can be added to each part of the document identified by the markup
608	      structure (including the whole document itself).  For example, one
609	      could write <span lang="FR">C'est la vie.</span> inside a
610	      Norwegian document; the Norwegian-speaking user could then access
611	      a French-Norwegian dictionary to find out what the marked section
612	      meant.  If the user were listening to that document through a
613	      speech synthesis interface, this formation could be used to signal
614	      the synthesizer to appropriately apply French text-to-speech
615	      pronunciation rules to that span of text, instead of misapplying
616	      the Norwegian rules.

618	2.5.  Choosing Between Alternate Matching Schemes

620	   Implementers are faced with the decision of what form of matching to
621	   use in a specific application.  An application can choose to
622	   implement different styles of matching for different kinds of
623	   processing.

625	   The most basic choice is between schemes that produce an open-ended
626	   set of content (a "matching" application) and those that usually
627	   produce a single information item (a "lookup" application).  Note
628	   that lookup applications can produce multiple items, but usually only
629	   a single item for any given piece of content, and they can be used to
630	   order content (the later in the overall fallback that the content
631	   appears to match, the more distant the match).

633	   Matching applications can produce an ordered or unordered set of
634	   results.  For example, applying formatting to a document based on the
635	   language of specific pieces of content does not require the content
636	   to be ordered.  It is sufficient to know whether a specific piece of
637	   content matches or does not match.  A search application, on the
638	   other hand, probably would put the results into a priority order.

640	   If single item is to be chosen, it may sometimes be useful to apply
641	   additional information, such as the most likely script used in the
642	   language or region in question or the script used by other content
643	   selected, in order to make a more "informed" choice.

645	   The matching schemes in this document are designed so that
646	   implementations do not have to examine the values of the subtags
647	   supplied and, except for scored matching, they do not need access to
648	   the Language Subtag Registry nor do they require the use of valid
649	   subtags in language tags or ranges.  This has great benefit for speed
650	   and simplicity of implementation.

652	   Implementations might also wish to use semantic information external
653	   to the langauge tags when performing fallback.  For example, the
654	   primary language subtags 'nn' (Nynorsk Norwegian) and 'nb' (Bokmal
655	   Norwegian) might both be usefully matched to the more general subtag
656	   'no' (Norwegian).  Or an application might infer that content labeled
657	   "zh-CN" is morely likely to match the range "zh-Hans" than equivalent
658	   content labeled "zh-TW".

660	2.6.  Considerations for Private Use Subtags

662	   Private-use subtags require private agreement between the parties
663	   that intend to use or exchange language tags that use them and great
664	   caution SHOULD be used in employing them in content or protocols
665	   intended for general use.  Private-use subtags are simply useless for
666	   information exchange without prior arrangement.

668	   The value and semantic meaning of private-use tags and of the subtags
669	   used within such a language tag are not defined.  Matching private
670	   use tags using language ranges or extended language ranges can result
671	   in unpredictable content being returned.

673	2.7.  Length Considerations in Matching

675	   RFC 3066 [RFC3066] did not provide an upper limit on the size of
676	   language tags or ranges.  RFC 3066 did define the semantics of
677	   particular subtags in such a way that most language tags or ranges
678	   consisted of language and region subtags with a combined total length
679	   of up to six characters.  Larger tags and ranges (in terms of both
680	   subtags and characters) did exist, however.

682	   [draft-registry] also does not impose a fixed upper limit on the
683	   number of subtags in a language tag or range (and thus an upper bound
684	   on the size of either).  The syntax in that document suggests that,
685	   depending on the specific language or range of languages, more
686	   subtags (and thus characters) are sometimes necessary as a result.
687	   Length considerations and their impact on the selection and
688	   processing of tags are described in Section 2.1.1 of that document.

690	   A matching implementation MAY choose to limit the length of the
691	   language tags or ranges used in matching.  Any such limitation SHOULD
692	   be clearly documented, and such documentation SHOULD include the
693	   disposition of any longer tags or ranges (for example, whether an
694	   error value is generated or the language tag or range is truncated).
695	   If truncation is permitted it MUST NOT permit a subtag to be divided,
696	   since this changes the semantics of the subtag being matched and can
697	   result in false positives or negatives.

699	   Implementations that restrict storage SHOULD consider the impact of
700	   tag or range truncation on the resulting matches.  For example,
701	   removing the "*" from the end of an extended language range (see
702	   Section 2.3) can greatly modify the set of returned matches.  A
703	   protocol that allows tags or ranges to be truncated at an arbitrary
704	   limit, without giving any indication of what that limit is, has the
705	   potential for causing harm by changing the meaning of values in
706	   substantial ways.

708	   In practice, most tags do not require additional subtags or
709	   substantially more characters.  Additional subtags sometimes add
710	   useful distinguishing information, but extraneous subtags interfere
711	   with the meaning, understanding, and especially matching of language
712	   tags.  Since language tags or ranges MAY be truncated by an
713	   application or protocol that limits storage, when choosing language
714	   tags or ranges users and applications SHOULD avoid adding subtags
715	   that add no distinguishing value.  In particular, users and
716	   implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
717	   fields in the registry (defined in Section 3.6 of [draft-registry]):
718	   these fields provide guidance on when specific additional subtags
719	   SHOULD (and SHOULD NOT) be used.

721	   Implementations MUST support a limit of at least 33 characters.  This
722	   limit includes at least one subtag of each non-extension, non-private
723	   use type.  When choosing a buffer limit, a length of at least 42
724	   characters is strongly RECOMMENDED.

726	   The practical limit on tags or ranges derived solely from registered
727	   values is 42 characters.  Implementations MUST be able to handle tags
728	   and ranges of this length.  Support for tags and ranges of at least
729	   62 characters in length is RECOMMENDED.  Implementations MAY support
730	   longer values, including matching extensive sets of private use or
731	   extension subtags.

733	   Applications or protocols which have to truncate a tag MUST do so by
734	   progressively removing subtags along with their preceding "-" from
735	   the right side of the language tag until the tag is short enough for
736	   the given buffer.  If the resulting tag ends with a single-character
737	   subtag, that subtag and its preceding "-" MUST also be removed.  For
738	   example:

740	   Tag to truncate: zh-Hant-CN-variant1-a-extend1-x-wadegile-private1
741	   1. zh-Hant-CN-variant1-a-extend1-x-wadegile
742	   2. zh-Hant-CN-variant1-a-extend1
743	   3. zh-Hant-CN-variant1
744	   4. zh-Hant-CN
745	   5. zh-Hant
746	   6. zh

748	   Figure 7: Example of Tag Truncation

750	3.  IANA Considerations

752	   This document presents no new or existing considerations for IANA.

754	4.  Changes

756	   This is the first version of this document.

758	   The following changes were put into this document since draft-03:

760	      Modified the ABNF to match changes in [draft-registry]
761	      (K.Karlsson)

763	      Matched the references and reference formats to [draft-registry]
764	      (K.Karlsson)

766	      Various edits, additions, and emendations to deal with changes in
767	      the Last Call of draft-registry as well as cleaning up the text.

769	5.  Security Considerations

771	   Language ranges used in content negotiation might be used to infer
772	   the nationality of the sender, and thus identify potential targets
773	   for surveillance.  In addition, unique or highly unusual language
774	   ranges or combinations of language ranges might be used to track
775	   specific individual's activities.

777	   This is a special case of the general problem that anything you send
778	   is visible to the receiving party.  It is useful to be aware that
779	   such concerns can exist in some cases.

781	   The evaluation of the exact magnitude of the threat, and any possible
782	   countermeasures, is left to each application protocol.

784	6.  Character Set Considerations

786	   The syntax of language tags and language ranges permit only the
787	   characters A-Z, a-z, 0-9, and HYPHEN-MINUS (%x2D).  These characters
788	   are present in most character sets, so presentation of language tags
789	   should not present any character set issues.

791	7.  References

793	7.1.  Normative References

795	   [ID.ietf-ltru-initial]
796	              Ewell, D., Ed., "Language Tags Initial Registry (work in
797	              progress)", August 2005, <http://www.ietf.org/
798	              internet-drafts/draft-ietf-ltru-initial-04.txt>.

800	   [RFC1327]  Hardcastle-Kille, S., "Mapping between X.400(1988) / ISO
801	              10021 and RFC 822", RFC 1327, May 1992.

803	   [RFC1521]  Borenstein, N. and N. Freed, "MIME (Multipurpose Internet
804	              Mail Extensions) Part One: Mechanisms for Specifying and
805	              Describing the Format of Internet Message Bodies",
806	              RFC 1521, September 1993.

808	   [RFC2028]  Hovey, R. and S. Bradner, "The Organizations Involved in
809	              the IETF Standards Process", BCP 11, RFC 2028,
810	              October 1996.

812	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
813	              Requirement Levels", BCP 14, RFC 2119, March 1997.

815	   [RFC2231]  Freed, N. and K. Moore, "MIME Parameter Value and Encoded
816	              Word Extensions: Character Sets, Languages, and
817	              Continuations", RFC 2231, November 1997.

819	   [RFC2234bis]
820	              Crocker, D. and P. Overell, "Augmented BNF for Syntax
821	              Specifications: ABNF", draft-crocker-abnf-rfc2234bis-00
822	              (work in progress), March 2005.

824	   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
825	              Resource Identifiers (URI): Generic Syntax", RFC 2396,
826	              August 1998.

828	   [RFC2434]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
829	              IANA Considerations Section in RFCs", BCP 26, RFC 2434,
830	              October 1998.

832	   [RFC2616]  Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
833	              Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext
834	              Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

836	   [RFC2860]  Carpenter, B., Baker, F., and M. Roberts, "Memorandum of
837	              Understanding Concerning the Technical Work of the
838	              Internet Assigned Numbers Authority", RFC 2860, June 2000.

840	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
841	              10646", STD 63, RFC 3629, November 2003.

843	   [draft-registry]
844	              Phillips, A., Ed. and M. Davis, Ed., "Tags for the
845	              Identification of Languages (work in progress)",
846	              August 2005, <http://www.ietf.org/internet-drafts/
847	              draft-ietf-ltru-registry-12.txt>.

849	7.2.  Informative References

851	   [ISO15924]
852	              "ISO 15924:2004. Information and documentation -- Codes
853	              for the representation of names of scripts", January 2004.

855	   [ISO3166-1]
856	              "ISO 3166-1:1997. Codes for the representation of names of
857	              countries and their subdivisions -- Part 1: Country
858	              codes", 1997.

860	   [ISO639-1]
861	              "ISO 639-1:2002. Codes for the representation of names of
862	              languages -- Part 1: Alpha-2 code", 2002.

864	   [ISO639-2]
865	              "ISO 639-2:1998. Codes for the representation of names of
866	              languages -- Part 2: Alpha-3 code, first edition", 1998.

868	   [RFC1766]  Alvestrand, H., "Tags for the Identification of
869	              Languages", RFC 1766, March 1995.

871	   [RFC3066]  Alvestrand, H., "Tags for the Identification of
872	              Languages", BCP 47, RFC 3066, January 2001.

874	   [RFC3282]  Alvestrand, H., "Content Language Headers", RFC 3282,
875	              May 2002.

877	   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
878	              Timestamps", RFC 3339, July 2002.

880	   [UN_M.49]  Statistics Division, United Nations, "Standard Country or
881	              Area Codes for Statistical Use", UN Standard Country or
882	              Area Codes for Statistical Use, Revision 4 (United Nations
883	              publication, Sales No. 98.XVII.9, June 1999.

885	Appendix A.  Acknowledgements

887	   Any list of contributors is bound to be incomplete; please regard the
888	   following as only a selection from the group of people who have
889	   contributed to make this document what it is today.

891	   The contributors to [draft-registry], [RFC3066] and [RFC1766], each
892	   of which is a precursor to this document, made enormous contributions
893	   directly or indirectly to this document and are generally responsible
894	   for the success of language tags.

896	   The following people (in alphabetical order by family name)
897	   contributed to this document:

899	   Jeremy Carroll, John Cowan, Frank Ellermann, Doug Ewell, Kent
900	   Karlsson, Ira McDonald, M. Patton, Randy Presuhn and many, many
901	   others.

903	   Very special thanks must go to Harald Tveit Alvestrand, who
904	   originated RFCs 1766 and 3066, and without whom this document would
905	   not have been possible.

907	   For this particular document, John Cowan originated the scheme
908	   described in Section 2.3.3.  Mark Davis originated the scheme
909	   described in the Section 2.2.2.

911	Authors' Addresses

913	   Addison Phillips (editor)
914	   Quest Software

916	   Email: addison dot phillips at quest dot com

918	   Mark Davis (editor)
919	   IBM

921	   Email: mark dot davis at ibm dot com

923	Intellectual Property Statement

925	   The IETF takes no position regarding the validity or scope of any
926	   Intellectual Property Rights or other rights that might be claimed to
927	   pertain to the implementation or use of the technology described in
928	   this document or the extent to which any license under such rights
929	   might or might not be available; nor does it represent that it has
930	   made any independent effort to identify any such rights.  Information
931	   on the procedures with respect to rights in RFC documents can be
932	   found in BCP 78 and BCP 79.

934	   Copies of IPR disclosures made to the IETF Secretariat and any
935	   assurances of licenses to be made available, or the result of an
936	   attempt made to obtain a general license or permission for the use of
937	   such proprietary rights by implementers or users of this
938	   specification can be obtained from the IETF on-line IPR repository at
939	   http://www.ietf.org/ipr.

941	   The IETF invites any interested party to bring to its attention any
942	   copyrights, patents or patent applications, or other proprietary
943	   rights that may cover technology that may be required to implement
944	   this standard.  Please address the information to the IETF at
945	   ietf-ipr@ietf.org.

947	Disclaimer of Validity

949	   This document and the information contained herein are provided on an
950	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
951	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
952	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
953	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
954	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
955	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

957	Copyright Statement

959	   Copyright (C) The Internet Society (2005).  This document is subject
960	   to the rights, licenses and restrictions contained in BCP 78, and
961	   except as set forth therein, the authors retain all their rights.

963	Acknowledgment

965	   Funding for the RFC Editor function is currently provided by the
966	   Internet Society.