idnits 2.17.1 

draft-davis-t-langtag-ext-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 16, 2011) is 4698 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'TBD' is mentioned on line 362, but not defined

  == Unused Reference: 'US-ASCII' is defined on line 408, but no explicit
     reference was found in the text


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                 M. Davis
3	Internet-Draft                                                    Google
4	Intended status: Informational                               A. Phillips
5	Expires: December 18, 2011                                        Lab126
6	                                                               Y. Umaoka
7	                                                                     IBM
8	                                                           June 16, 2011

10	                           BCP 47 Extension T
11	                      draft-davis-t-langtag-ext-00

13	Abstract

15	   This document specifies an Extension to BCP 47 which provides subtags
16	   for specifying the source language or script of transformed text,
17	   including text that has been transliterated, transcribed, or
18	   translated.  It also provides for additional information used for
19	   identification.

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on December 18, 2011.

38	Copyright Notice

40	   Copyright (c) 2011 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  3
54	   2.  BCP47 Required Information . . . . . . . . . . . . . . . . . .  3
55	     2.1.  Summary  . . . . . . . . . . . . . . . . . . . . . . . . .  6
56	       2.1.1.  Canonicalization . . . . . . . . . . . . . . . . . . .  8
57	     2.2.  Registration Form  . . . . . . . . . . . . . . . . . . . .  9
58	   3.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . .  9
59	   4.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . .  9
60	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
61	   6.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 10
62	     6.1.  Normative References . . . . . . . . . . . . . . . . . . . 10
63	     6.2.  Informative References . . . . . . . . . . . . . . . . . . 10
64	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10

66	1.  Introduction

68	   [BCP47] permits the definition and registration of language tag
69	   extensions "that contain a language component and are compatible with
70	   applications that understand language tags".  This document defines
71	   an extension for specifying the source of a text transformation,
72	   including text that has been transliterated, transcribed, or
73	   translated.  The "singleton" identifier for this extension is 't'.

75	1.1.  Requirements Language

77	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
78	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
79	   document are to be interpreted as described in RFC 2119.

81	2.  BCP47 Required Information

83	   Language tags, as defined by [BCP47], are useful for identifying the
84	   language of content.  There are mechanisms for specifying variant
85	   subtags for special purposes.  However, these variants are
86	   insufficient for specifying text transformations, including text that
87	   has been transliterated, transcribed, or translated.  That is, for
88	   fully specifying such text, it is important to specify the source
89	   language and/or script.  In addition, it may also be important to
90	   specify a particular specification for the transformation.

92	   For example, if one is transcribing the names of Italian or Russian
93	   cities on a map for Japanese users, each name will need to be
94	   transliterated into katakana using rules appropriate for the source
95	   language and target languages.  When tagging such data, it is
96	   important to be able to indicate not only the resulting content
97	   language ("ja" in this case), but also the source language.

99	   Transforms such as transliteration may vary depending not only on the
100	   basis of the source and target script, but also language.  Thus the
101	   Russian <U+041F U+0443 U+0442 U+0438 U+043D> (which corresponds to
102	   the Cyrillic <PE, U, TE, I, EN>) transliterates into "Putin" in
103	   English but "Poutine" in French.  The identifier may need to indicate
104	   a desired mechanical transformation in an API, or may need to tag
105	   data that has been converted (mechanically or by hand) according to a
106	   transliteration method.

108	   Such identification is accomplished by using the 't' extension
109	   defined in this document.  This extension is formed by the 't'
110	   singleton followed by a sequence of subtags that would form a
111	   language tag defined by [BCP47].  This allows for the source language
112	   or script to be specified to the degree of precision required.  There
113	   are restrictions on the sequence of subtags.  They MUST form a
114	   regular, valid, canonical language tag, and MUST neither include
115	   extensions nor private use sequences introduced by the singleton 'x'.
116	   Where only the script is relevant (such as identifying a script-
117	   script transliteration) then 'und' is used for the primary language
118	   subtag.

120	   For example:

122	   +---------------------+---------------------------------------------+
123	   | Language Tag        | Description                                 |
124	   +---------------------+---------------------------------------------+
125	   | ja-t-it             | The content is Japanese, transformed from   |
126	   |                     | Italian.                                    |
127	   | ja-Kana-t-it        | The content is Japanese Katakana,           |
128	   |                     | transformed from Italian.                   |
129	   | und-Latn-t-und-cyrl | The content is in the Latin script,         |
130	   |                     | transformed from the Cyrillic script.       |
131	   +---------------------+---------------------------------------------+

133	   Note that the sequence of subtags governed by 't' cannot contain a
134	   singleton (a single-character subtag), because that would start a new
135	   extension.  For example, the tag "ja-t-i-ami" does not indicate that
136	   the source is in "i-ami", because "i-ami" is not a regular language
137	   tag in [BCP47].  That tag would express an empty 't' extension
138	   followed by an 'i' extension.

140	   In addition, it is sometimes necessary to indicate additional
141	   information, such as the mechanism used to do the transformation,
142	   optionally including the version of the mechanism.  The mechanism can
143	   be supplied by using the 'm0' separator.  The format of such a 't'
144	   extension is thus:

146	   "t-<language-tag>-m0-<mechanism>".

148	   (The full format reserves some additional syntax for future
149	   expansion, as described below.)

151	   The transform <mechanism> is a series of subtags that indicate the
152	   specification used for the transformation, such as "UNGEGN" for the
153	   the United Nations Group of Experts on Geographical Names
154	   transliterations and transcriptions.

156	   For example:

158	   +------------------------------------+------------------------------+
159	   | Language Tag                       | Description                  |
160	   +------------------------------------+------------------------------+
161	   | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic,  |
162	   |                                    | transformed from Latn,       |
163	   |                                    | according to a UNGEGN        |
164	   |                                    | specification dated 2007.    |
165	   +------------------------------------+------------------------------+

167	   The separator subtags such as 'm0' were chosen because they are
168	   short, visually distinctive, and cannot occur in a language subtag
169	   (outside of an extension and after 'x'), thus eliminating the
170	   potential for collision or confusion with the source language tag.

172	   The subtags that are valid after in the 't' extension are provided by
173	   Section 3 [1] of Unicode Technical Standard #35: Unicode Locale Data
174	   Markup Language [UTS35].  As required by BCP 47, subtags follow the
175	   language tag ABNF and other rules for the formation of language tags
176	   and subtags, are restricted to the ASCII letters and digits, are not
177	   case sensitive, and do not exceed eight characters in length.

179	   EDITORIAL NOTE: This new facility has been accepted by the Unicode
180	   CLDR committee for incorporation into the next version of Unicode
181	   CLDR, parallel with the structure of the 'u' extension [RFC6067], for
182	   which it is already the maintaining authority.  The data and
183	   specification will be available by the time this internet draft has
184	   been approved.

186	   LDML is available over the Internet and at no cost, and is available
187	   via a royalty-free license at http://unicode.org/copyright.html.
188	   LDML is versioned, and each version of LDML is numbered, dated, and
189	   stable.  Extension subtags, once defined by LDML, are never retracted
190	   or change in meaning in a substantial way.

192	   The structure of 't' subtags is determined by the Unicode CLDR
193	   Technical Committee, in accordance with the policies and procedures
194	   in http://www.unicode.org/consortium/tc-procedures.html, and subject
195	   to the Unicode Consortium Policies on
196	   http://www.unicode.org/policies/policies.html.

198	   Changes that can be made by successive versions of LDML [UTS35] by
199	   the Unicode Consortium without requiring a new RFC include the
200	   allocation of new subtags for use after the 't' extension.  A new RFC
201	   would be required for material changes to an existing 't' subtag, or
202	   an incompatible change to the overall syntactic structure of the 't'
203	   extension; however, such a change would be contrary to the policies
204	   of the Unicode Consortium, and thus is not anticipated.

206	   The maintaining authority for the 't' extension is the Unicode
207	   Consortium:

209	   +---------------+---------------------------------------------------+
210	   | Item          | Value                                             |
211	   +---------------+---------------------------------------------------+
212	   | Name          | Unicode Consortium                                |
213	   | Contact Email | cldr-contact@unicode.org                          |
214	   | Discussion    | cldr-users@unicode.org                            |
215	   | List Email    |                                                   |
216	   | URL Location  | cldr.unicode.org                                  |
217	   | Specification | Unicode Technical Standard #35 Unicode Locale     |
218	   |               | Data Markup Language (LDML),                      |
219	   |               | http://unicode.org/reports/tr35/                  |
220	   | Section       | Section 3 Unicode Language and Locale Identifiers |
221	   +---------------+---------------------------------------------------+

223	2.1.  Summary

225	   The following is a summary of the definition for the 't' subtags
226	   defined by Section 3 [1] of Unicode Technical Standard #35: Unicode
227	   Locale Data Markup Language [UTS35], which is relevant for this
228	   specification.

230	   The subtags in the 't' extension are of the following form:

232	     +--------+-------------------------+----------------------------+
233	     | Label  | ABNF                    | Comment                    |
234	     +--------+-------------------------+----------------------------+
235	     | t_ext= | "t-"                    | Extension                  |
236	     |        | [lang]                  | Source                     |
237	     |        | *("-" field)            | Optional information       |
238	     | lang=  | language                | [BCP47], with restrictions |
239	     |        | ["-" script]            |                            |
240	     |        | ["-" region]            |                            |
241	     |        | *("-" variant)          |                            |
242	     | field= | sep 1*("-" 3*8alphanum) | With restrictions          |
243	     | sep=   | 1ALPHA 1DIGIT           | Subtag separators          |
244	     +--------+-------------------------+----------------------------+

246	   Description and restrictions:

248	   a.  The 't' extension MUST have at least one subtag.

250	   b.  The 't' extension normally starts with a source language tag,
251	       which MUST be a regular, canonical language tag as specified by
252	       [BCP47].  Tags described by the 'irregular' production in BCP 47
253	       MUST NOT be used to form the language tag.  The source language
254	       tag MAY be omitted: some field values do not require it.

256	   c.  There is optionally a sequence of fields, where each field is a
257	       separator followed by a sequence of subtags.  Two identical
258	       separators MUST NOT be present.

260	   d.  One field is initially specified in [UTS35]: the transform
261	       mechanism.

263	       A.  The transform mechanism consists of a sequence of subtags
264	           starting with the 'm0' separator followed by one or more
265	           mechanism subtags.  Each mechanism subtag has a length of 3
266	           to 8 alphanumeric characters.  The sequence as a whole
267	           provides an identification of the specification for the
268	           transform, such as the mechanism subtag 'UNGEGN' in "und-
269	           Cyrl-t-und-latn-m0-ungegn".  In many cases, only one
270	           mechanism subtag is necessary, but multiple subtags MAY be
271	           defined in [UTS35] where necessary.

273	       B.  Any purely numeric subtag is a representation of a date in
274	           the Gregorian calendar.  It MAY occur in any mechanism field.
275	           If it does occur:

277	           +  it MUST occur as the final subtag in the field,

279	           +  it MUST NOT be the only subtag in the field, and

281	           +  it MUST consist of a sequence of digits of the form YYYY,
282	              YYYYMM, or YYYYMMDD.

284	           For example, 20110623 represents June 23th, 2011.  A date
285	           subtag SHOULD only be used where necessary, and then SHOULD
286	           be as short as possible.  For example, suppose that the BGN
287	           transliteration specification for Cyrillic to Latin had three
288	           versions, dated June 11th, 1999; Dec 30th, 1999; and May 1st,
289	           2011.  In that case, the corresponding first two DATE subtags
290	           would require months to be distinctive (199906 and 199912),
291	           but the last subtag would only require the year (2011).

293	       C.  Some mechanisms may use a versioning system that is not
294	           distinguished by date, or not by date alone.  In the latter
295	           case, the version will be of a form specified by [UTS35] for
296	           that mechanism.  For example, if the mechanism XXX uses
297	           versions of the form v21a, then a tag could look like "ja-t-
298	           it-m0-xxx-v21a".  If there are multiple subversions
299	           distinguished by date, then a tag could look like "ja-t-it-
300	           m0-xxx-v21a-2007".

302	   e.  Successive versions of [UTS35] could define additional separator
303	       subtags, and additional subtags for those separators.  Once
304	       defined, those subtags will never be removed.

306	   f.  The order of the subtags is significant (see Section 2.1.1
307	       Canonicalization).

309	   EDITORIAL NOTE: The following parallels the structure used for the
310	   'u' extension [RFC6067], for which the Unicode Consortium is the
311	   maintaining authority.  The data and specification will be available
312	   by the time this internet draft has been approved.

314	   Beginning with CLDR version 1.7.2, machine-readable files are
315	   available listing the data defined for BCP47 extensions for each
316	   successive version of [UTS35].  These releases are listed on
317	   http://cldr.unicode.org/index/downloads.  Each release has an
318	   associated data directory of the form
319	   "http://unicode.org/Public/cldr/<version>", where "<version>" is
320	   replaced by the release number.  For example, for version 1.7.2, the
321	   "core.zip" file is located at
322	   http://unicode.org/Public/cldr/1.7.2/core.zip [2].  Inside the
323	   "core.zip" file, the path "common/bcp47" contains the data files
324	   defining the data defined for BCP47 extensions.  The most recent
325	   version is always identified by the version "latest" and can be
326	   accessed by the URL in Section 2.2.

328	   To get the version information in XML when working with the data
329	   files, the XML parser must be validating.  When the 'core.zip' file
330	   is unzipped, the 'dtd' directory will be at the same level as the
331	   'bcp47' directory; that is required for correct validation.  For each
332	   release after CLDR 1.8, types introduced in that release are also
333	   marked in the data files by the XML attribute "since", such as in the
334	   following example:
335	   <type name="adp" since="1.9"/>

337	   The data is also currently maintained in a source code repository,
338	   with each release tagged, for viewing directly without unzipping.
339	   For example, see:

341	   o  http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/

343	   o  http://unicode.org/repos/cldr/tags/release-1-8/common/bcp47/

345	2.1.1.  Canonicalization

347	   As required by [BCP47], the use of uppercase or lowercase letters is
348	   not significant in the subtags used in this extension.  The canonical
349	   form for all subtags in the extension is lowercase, with the fields
350	   ordered by the separators, alphabetically.

352	2.2.  Registration Form

354	   Per RFC 5646, Section 3.7 [BCP47] :

356	   %%
357	   Identifier: t
358	   Description: Transform Specification
359	   Comments: Subtags for the identification of text transforms,
360	       including transliteration, transcription, and translation.
361	   Added: 2010-mm-dd
362	   RFC: [TBD]
363	   Authority: Unicode Consortium
364	   Contact_Email: cldr-contact@unicode.org
365	   Mailing_List: cldr-users@unicode.org
366	   URL: http://www.unicode.org/Public/cldr/latest/core.zip
367	   %%

369	3.  Acknowledgements

371	   Thanks to John Emmons and the rest of the Unicode CLDR Technical
372	   Committee for their work in developing the BCP 47 subtags for LDML.

374	4.  IANA Considerations

376	   This document will require IANA to insert the record in Section 2.2
377	   into the Language Extensions Registry, according to Section 3.7.
378	   Extensions and the Extensions Registry of "Tags for Identifying
379	   Languages" in [BCP47].  Per Section 5.2 of [BCP47], there might be
380	   occasional (rare) requests by the Unicode Consortium (the "Authority"
381	   listed in the record) for maintenance of this record.  Changes that
382	   can be submitted to IANA without the publication of a new RFC are
383	   limited to modification of the Comments, Contact_Email, Mailing_List,
384	   and URL fields.  Any such requested changes MUST use the domain
385	   'unicode.org' in any new addresses or URIs, MUST explicitly cite this
386	   document (so that IANA can reference these requirements), and MUST
387	   originate from the 'unicode.org' domain.  The domain or authority can
388	   only be changed via a new RFC.

390	   This document does not require IANA to create or maintain a new
391	   registry or otherwise impact IANA.

393	5.  Security Considerations

395	   The security considerations for this extension are the same as those
396	   for [BCP47].  See RFC 5646, Section 6, Security Considerations
397	   [BCP47].

399	6.  References

401	6.1.  Normative References

403	   [BCP47]    Davis, M., Ed., "Tags for the Identification of Language
404	              (BCP47)", September 2009.

406	   [RFC6067]  Davis, M., Ed., "BCP 47 Extension U", September 2010.

408	   [US-ASCII]
409	              International Organization for Standardization, "ISO/IEC
410	              646:1991, Information technology -- ISO 7-bit coded
411	              character set for information interchange.", 1991.

413	   [UTS35]    Davis, M., "Unicode Technical Standard #35: Locale Data
414	              Markup Language (LDML)", December 2007,
415	              <http://www.unicode.org/reports/tr35/>.

417	              Section 3: http://unicode.org/reports/
418	              tr35/#Unicode_Language_and_Locale_Identifiers

420	              Appendix Q: http://unicode.org/reports/
421	              tr35/#Locale_Extension_Key_and_Type_Data

423	6.2.  Informative References

425	   [ldml-registry]
426	              "Registry for Common Locale Data Repository tag elements",
427	              September 2009.

429	URIs

431	   [1]  <http://unicode.org/reports/tr35/>

433	   [2]  <http://unicode.org/Public/cldr/1.7.2/>

435	Authors' Addresses

437	   Mark Davis
438	   Google

440	   Email: mark@macchiato.com

442	   Addison Phillips
443	   Lab126

445	   Email: addison@lab126.com

447	   Yoshito Umaoka
448	   IBM

450	   Email: yoshito_umaoka@us.ibm.com