idnits 2.17.1 

draft-davis-t-langtag-ext-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 22, 2011) is 4691 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'TBD' is mentioned on line 271, but not defined

  == Unused Reference: 'US-ASCII' is defined on line 453, but no explicit
     reference was found in the text


     Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Internet Engineering Task Force                                 M. Davis
3	Internet-Draft                                                    Google
4	Intended status: Informational                               A. Phillips
5	Expires: December 24, 2011                                        Lab126
6	                                                               Y. Umaoka
7	                                                                     IBM
8	                                                                 C. Falk
9	                                                       Infinite Automata
10	                                                           June 22, 2011

12	                           BCP 47 Extension T
13	                      draft-davis-t-langtag-ext-01

15	Abstract

17	   This document specifies an Extension to BCP 47 which provides subtags
18	   for specifying the source language or script of transformed text,
19	   including text that has been transliterated, transcribed, or
20	   translated.  It also provides for additional information used for
21	   identification.

23	Status of this Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at http://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on December 24, 2011.

40	Copyright Notice

42	   Copyright (c) 2011 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
55	     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  3
56	   2.  BCP47 Required Information . . . . . . . . . . . . . . . . . .  3
57	     2.1.  Introduction . . . . . . . . . . . . . . . . . . . . . . .  3
58	     2.2.  Structure  . . . . . . . . . . . . . . . . . . . . . . . .  6
59	     2.3.  Canonicalization . . . . . . . . . . . . . . . . . . . . .  6
60	     2.4.  BCP47 Registration Form  . . . . . . . . . . . . . . . . .  7
61	     2.5.  Field Definitions  . . . . . . . . . . . . . . . . . . . .  7
62	     2.6.  Registration of Field Subtags  . . . . . . . . . . . . . .  8
63	     2.7.  Machine-Readable Data  . . . . . . . . . . . . . . . . . .  9
64	   3.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10
65	   4.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
66	   5.  Security Considerations  . . . . . . . . . . . . . . . . . . . 10
67	   6.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
68	     6.1.  Normative References . . . . . . . . . . . . . . . . . . . 11
69	     6.2.  Informative References . . . . . . . . . . . . . . . . . . 11
70	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11

72	1.  Introduction

74	   [BCP47] permits the definition and registration of language tag
75	   extensions "that contain a language component and are compatible with
76	   applications that understand language tags".  This document defines
77	   an extension for specifying the source of a text transformation,
78	   including text that has been transliterated, transcribed, or
79	   translated.  The "singleton" identifier for this extension is 't'.

81	   Language tags, as defined by [BCP47], are useful for identifying the
82	   language of content.  There are mechanisms for specifying variant
83	   subtags for special purposes.  However, these variants are
84	   insufficient for specifying text transformations, including text that
85	   has been transliterated, transcribed, or translated.  That is, for
86	   fully specifying such text, it is important to specify the source
87	   language and/or script.  In addition, it may also be important to
88	   identify a particular specification for the transformation.

90	   For example, if one is transcribing the names of Italian or Russian
91	   cities on a map for Japanese users, each name will need to be
92	   transliterated into katakana using rules appropriate for the specific
93	   source and target language.  When tagging such data, it is important
94	   to be able to indicate not only the resulting content language ("ja"
95	   in this case), but also the source language.

97	   Transforms such as transliteration may vary depending not only on the
98	   basis of the source and target script, but also language.  Thus the
99	   Russian <U+041F U+0443 U+0442 U+0438 U+043D> (which corresponds to
100	   the Cyrillic <PE, U, TE, I, EN>) transliterates into "Putin" in
101	   English but "Poutine" in French.  The identifier may need to indicate
102	   a desired mechanical transformation in an API, or may need to tag
103	   data that has been converted (mechanically or by hand) according to a
104	   transliteration method.

106	1.1.  Requirements Language

108	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
109	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
110	   document are to be interpreted as described in RFC 2119.

112	2.  BCP47 Required Information

114	2.1.  Introduction

116	   Identification of transforms can be done using the 't' extension
117	   defined in this document.  This extension is formed by the 't'
118	   singleton followed by a sequence of subtags that would form a
119	   language tag defined by [BCP47].  This allows for the source language
120	   or script to be specified to the degree of precision required.  There
121	   are restrictions on the sequence of subtags.  They MUST form a
122	   regular, valid, canonical language tag, and MUST neither include
123	   extensions nor private use sequences introduced by the singleton 'x'.
124	   Where only the script is relevant (such as identifying a script-
125	   script transliteration) then 'und' is used for the primary language
126	   subtag.

128	   For example:

130	   +---------------------+---------------------------------------------+
131	   | Language Tag        | Description                                 |
132	   +---------------------+---------------------------------------------+
133	   | ja-t-it             | The content is Japanese, transformed from   |
134	   |                     | Italian.                                    |
135	   | ja-Kana-t-it        | The content is Japanese Katakana,           |
136	   |                     | transformed from Italian.                   |
137	   | und-Latn-t-und-cyrl | The content is in the Latin script,         |
138	   |                     | transformed from the Cyrillic script.       |
139	   +---------------------+---------------------------------------------+

141	   Note that the sequence of subtags governed by 't' cannot contain a
142	   singleton (a single-character subtag), because that would start a new
143	   extension.  For example, the tag "ja-t-i-ami" does not indicate that
144	   the source is in "i-ami", because "i-ami" is not a regular language
145	   tag in [BCP47].  That tag would express an empty 't' extension
146	   followed by an 'i' extension.

148	   In addition, it is sometimes necessary to indicate additional
149	   information about the transformation.  This additional information is
150	   optionally supplied after the source in a series of one or more
151	   fields, where each field consists of a field separator subtag
152	   followed by one or more non-separator subtags.  Each field separator
153	   subtag consists of a single letter followed by a single digit.

155	   A transformation mechanism is an optional field that indicates the
156	   specification used for the transformation, such as "UNGEGN" for the
157	   the United Nations Group of Experts on Geographical Names
158	   transliterations and transcriptions.  It uses the 'm0' field
159	   separator followed by certain subtags.

161	   For example:

163	   +------------------------------------+------------------------------+
164	   | Language Tag                       | Description                  |
165	   +------------------------------------+------------------------------+
166	   | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic,  |
167	   |                                    | transformed from Latn,       |
168	   |                                    | according to a UNGEGN        |
169	   |                                    | specification dated 2007.    |
170	   +------------------------------------+------------------------------+

172	   The field separator subtags such as 'm0' were chosen because they are
173	   short, visually distinctive, and cannot occur in a language subtag
174	   (outside of an extension and after 'x'), thus eliminating the
175	   potential for collision or confusion with the source language tag.

177	   The field subtags are defined by Section 3 [1] of Unicode Technical
178	   Standard #35: Unicode Locale Data Markup Language [UTS35].  As
179	   required by BCP 47, subtags follow the language tag ABNF and other
180	   rules for the formation of language tags and subtags, are restricted
181	   to the ASCII letters and digits, are not case sensitive, and do not
182	   exceed eight characters in length.

184	   EDITORIAL NOTE: This new facility has been accepted by the Unicode
185	   CLDR committee for incorporation into the next version of Unicode
186	   CLDR, parallel with the structure of the 'u' extension [RFC6067], for
187	   which it is already the maintaining authority.  The data and
188	   specification will be available by the time this internet draft has
189	   been approved.

191	   LDML is available over the Internet and at no cost, and is available
192	   via a royalty-free license at http://unicode.org/copyright.html.
193	   LDML is versioned, and each version of LDML is numbered, dated, and
194	   stable.  Extension subtags, once defined by LDML, are never retracted
195	   or change in meaning in a substantial way.

197	   The maintaining authority for the 't' extension is the Unicode
198	   Consortium:

200	   +---------------+---------------------------------------------------+
201	   | Item          | Value                                             |
202	   +---------------+---------------------------------------------------+
203	   | Name          | Unicode Consortium                                |
204	   | Contact Email | cldr-contact@unicode.org                          |
205	   | Discussion    | cldr-users@unicode.org                            |
206	   | List Email    |                                                   |
207	   | URL Location  | cldr.unicode.org                                  |
208	   | Specification | Unicode Technical Standard #35 Unicode Locale     |
209	   |               | Data Markup Language (LDML),                      |
210	   |               | http://unicode.org/reports/tr35/                  |
211	   | Section       | Section 3 Unicode Language and Locale Identifiers |
212	   +---------------+---------------------------------------------------+

214	2.2.  Structure

216	   The subtags in the 't' extension are of the following form:

218	     +--------+-------------------------+----------------------------+
219	     | Label  | ABNF                    | Comment                    |
220	     +--------+-------------------------+----------------------------+
221	     | t_ext= | "t"                     | Extension                  |
222	     |        | ("-" lang *("-" field)  | Source + optional field(s) |
223	     |        | / 1*("-" field))        | Field(s) only (no source)  |
224	     | lang=  | language                | [BCP47], with restrictions |
225	     |        | ["-" script]            |                            |
226	     |        | ["-" region]            |                            |
227	     |        | *("-" variant)          |                            |
228	     | field= | sep 1*("-" 3*8alphanum) | With restrictions          |
229	     | sep=   | 1ALPHA 1DIGIT           | Subtag separators          |
230	     +--------+-------------------------+----------------------------+

232	   Description and restrictions:

234	   a.  The 't' extension MUST have at least one subtag.

236	   b.  The 't' extension normally starts with a source language tag,
237	       which MUST be a regular, canonical language tag as specified by
238	       [BCP47].  Tags described by the 'irregular' production in BCP 47
239	       MUST NOT be used to form the language tag.  The source language
240	       tag MAY be omitted: some field values do not require it.

242	   c.  There is optionally a sequence of fields, where each field has a
243	       separator followed by a sequence of one or more subtags.  Two
244	       identical field separators MUST NOT be present in the language
245	       tag.

247	   d.  The order of the subtags in a t extension is significant (see
248	       Section 2.3 Canonicalization).

250	   e.  The 't' subtag fields are defined by Section 3 [1] of Unicode
251	       Technical Standard #35: Unicode Locale Data Markup Language
252	       [UTS35].

254	2.3.  Canonicalization

256	   As required by [BCP47], the use of uppercase or lowercase letters is
257	   not significant in the subtags used in this extension.  The canonical
258	   form for all subtags in the extension is lowercase, with the fields
259	   ordered by the separators, alphabetically.

261	2.4.  BCP47 Registration Form

263	   Per RFC 5646, Section 3.7 [BCP47]:

265	   %%
266	   Identifier: t
267	   Description: Transform Specification
268	   Comments: Subtags for the identification of text transforms,
269	   including transliteration, transcription, and translation.
270	   Added: 2010-mm-dd
271	   RFC: [TBD]
272	   Authority: Unicode Consortium
273	   Contact_Email: cldr-contact@unicode.org
274	   Mailing_List: cldr-users@unicode.org
275	   URL: http://www.unicode.org/Public/cldr/latest/core.zip
276	   %%

278	2.5.  Field Definitions

280	   The structure of 't' field subtags is determined by the Unicode CLDR
281	   Technical Committee, in accordance with the policies and procedures
282	   in http://www.unicode.org/consortium/tc-procedures.html, and subject
283	   to the Unicode Consortium Policies on
284	   http://www.unicode.org/policies/policies.html.

286	   Changes that can be made by successive versions of LDML [UTS35] by
287	   the Unicode Consortium without requiring a new RFC include:

289	   o  The allocation of new field separator subtags for use after the
290	      't' extension.

292	   o  The allocation of subtags valid after a field separator subtag.

294	   A new RFC would be required for material changes to an existing 't'
295	   subtag, or an incompatible change to the overall syntactic structure
296	   of the 't' extension; however, such a change would be contrary to the
297	   policies of the Unicode Consortium, and thus is not anticipated.

299	   One field is initially specified in [UTS35]: the transform mechanism.
300	   That field is summarized here:

302	   a.  The transform mechanism consists of a sequence of subtags
303	       starting with the 'm0' separator followed by one or more
304	       mechanism subtags.  Each mechanism subtag has a length of 3 to 8
305	       alphanumeric characters.  The sequence as a whole provides an
306	       identification of the specification for the transform, such as
307	       the mechanism subtag 'UNGEGN' in "und-Cyrl-t-und-latn-m0-ungegn".
308	       In many cases, only one mechanism subtag is necessary, but
309	       multiple subtags MAY be defined in [UTS35] where necessary.

311	   b.  Any purely numeric subtag is a representation of a date in the
312	       Gregorian calendar.  It MAY occur in any mechanism field.  If it
313	       does occur:

315	       *  it MUST occur as the final subtag in the field,

317	       *  it MUST NOT be the only subtag in the field, and

319	       *  it MUST consist of a sequence of digits of the form YYYY,
320	          YYYYMM, or YYYYMMDD.

322	       For example, 20110623 represents June 23th, 2011.  A date subtag
323	       SHOULD only be used where necessary, and then SHOULD be as short
324	       as possible.  For example, suppose that the BGN transliteration
325	       specification for Cyrillic to Latin had three versions, dated
326	       June 11th, 1999; Dec 30th, 1999; and May 1st, 2011.  In that
327	       case, the corresponding first two DATE subtags would require
328	       months to be distinctive (199906 and 199912), but the last subtag
329	       would only require the year (2011).

331	   c.  Some mechanisms may use a versioning system that is not
332	       distinguished by date, or not by date alone.  In the latter case,
333	       the version will be of a form specified by [UTS35] for that
334	       mechanism.  For example, if the mechanism XXX uses versions of
335	       the form v21a, then a tag could look like "ja-t-it-m0-xxx-v21a".
336	       If there are multiple subversions distinguished by date, then a
337	       tag could look like "ja-t-it-m0-xxx-v21a-2007".

339	2.6.  Registration of Field Subtags

341	   Registration of transform mechanisms is requested by filing a ticket
342	   at cldr.unicode.org [2].  The proposal in the ticket MUST contain the
343	   following information:

345	   +-------------+-----------------------------------------------------+
346	   | Item        | Description                                         |
347	   +-------------+-----------------------------------------------------+
348	   | Subtag      | The proposed mechanism subtag (or subtag sequence). |
349	   | Description | A description of the proposed mechanism; that       |
350	   |             | description MUST be sufficient to distinguish it    |
351	   |             | from other mechanisms in use.                       |
352	   | Version     | If versioning for the mechanism is not done         |
353	   |             | according to date, then a description of the        |
354	   |             | versioning conventions used for the mechanism.      |
355	   +-------------+-----------------------------------------------------+

357	   The committee MAY request more information be supplied in tickets in
358	   the future if such information is found to be useful.

360	   The committee MUST respond to each proposal within 2 weeks.

362	   The response MAY:

364	   o  request more information or clarification

366	   o  accept the proposal, optionally with modifications to the subtag
367	      or description

369	   o  reject the proposal, because of significant objections raised on
370	      the mailing list or due to problems with constraints in this
371	      document or in [UTS35]

373	   Accepted tickets result an a new entry in the machine-readable CLDR
374	   BCP47 data.

376	2.7.  Machine-Readable Data

378	   EDITORIAL NOTE: The following parallels the structure used for the
379	   'u' extension [RFC6067], for which the Unicode Consortium is the
380	   maintaining authority.  The data and specification will be available
381	   by the time this internet draft has been approved.

383	   Beginning with CLDR version 1.7.2, machine-readable files are
384	   available listing the data defined for BCP47 extensions for each
385	   successive version of [UTS35].  These releases are listed on
386	   http://cldr.unicode.org/index/downloads.  Each release has an
387	   associated data directory of the form
388	   "http://unicode.org/Public/cldr/<version>", where "<version>" is
389	   replaced by the release number.  For example, for version 1.7.2, the
390	   "core.zip" file is located at
391	   http://unicode.org/Public/cldr/1.7.2/core.zip [3].  Inside the
392	   "core.zip" file, the path "common/bcp47" contains the data files
393	   defining the data defined for BCP47 extensions.  The most recent
394	   version is always identified by the version "latest" and can be
395	   accessed by the URL in Section 2.4.

397	   To get the version information in XML when working with the data
398	   files, the XML parser must be validating.  When the 'core.zip' file
399	   is unzipped, the 'dtd' directory will be at the same level as the
400	   'bcp47' directory; that is required for correct validation.  For each
401	   release after CLDR 1.8, types introduced in that release are also
402	   marked in the data files by the XML attribute "since", such as in the
403	   following example:
404	   <type name="adp" since="1.9"/>

406	   The data is also currently maintained in a source code repository,
407	   with each release tagged, for viewing directly without unzipping.
408	   For example, see:

410	   o  http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/

412	   o  http://unicode.org/repos/cldr/tags/release-1-8/common/bcp47/

414	3.  Acknowledgements

416	   Thanks to John Emmons and the rest of the Unicode CLDR Technical
417	   Committee for their work in developing the BCP 47 subtags for LDML.

419	4.  IANA Considerations

421	   This document will require IANA to insert the record in Section 2.4
422	   into the Language Extensions Registry, according to Section 3.7.
423	   Extensions and the Extensions Registry of "Tags for Identifying
424	   Languages" in [BCP47].  Per Section 5.2 of [BCP47], there might be
425	   occasional (rare) requests by the Unicode Consortium (the "Authority"
426	   listed in the record) for maintenance of this record.  Changes that
427	   can be submitted to IANA without the publication of a new RFC are
428	   limited to modification of the Comments, Contact_Email, Mailing_List,
429	   and URL fields.  Any such requested changes MUST use the domain
430	   'unicode.org' in any new addresses or URIs, MUST explicitly cite this
431	   document (so that IANA can reference these requirements), and MUST
432	   originate from the 'unicode.org' domain.  The domain or authority can
433	   only be changed via a new RFC.

435	   This document does not require IANA to create or maintain a new
436	   registry or otherwise impact IANA.

438	5.  Security Considerations

440	   The security considerations for this extension are the same as those
441	   for [BCP47].  See RFC 5646, Section 6, Security Considerations
442	   [BCP47].

444	6.  References

446	6.1.  Normative References

448	   [BCP47]    Davis, M., Ed., "Tags for the Identification of Language
449	              (BCP47)", September 2009.

451	   [RFC6067]  Davis, M., Ed., "BCP 47 Extension U", September 2010.

453	   [US-ASCII]
454	              International Organization for Standardization, "ISO/IEC
455	              646:1991, Information technology -- ISO 7-bit coded
456	              character set for information interchange.", 1991.

458	   [UTS35]    Davis, M., "Unicode Technical Standard #35: Locale Data
459	              Markup Language (LDML)", December 2007,
460	              <http://www.unicode.org/reports/tr35/>.

462	              Section 3: http://unicode.org/reports/
463	              tr35/#Unicode_Language_and_Locale_Identifiers

465	              Appendix Q: http://unicode.org/reports/
466	              tr35/#Locale_Extension_Key_and_Type_Data

468	6.2.  Informative References

470	   [ldml-registry]
471	              "Registry for Common Locale Data Repository tag elements",
472	              September 2009.

474	URIs

476	   [1]  <http://unicode.org/reports/tr35/>

478	   [2]  <http://cldr.unicode.org/>

480	   [3]  <http://unicode.org/Public/cldr/1.7.2/>

482	Authors' Addresses

484	   Mark Davis
485	   Google

487	   Email: mark@macchiato.com
488	   Addison Phillips
489	   Lab126

491	   Email: addison@lab126.com

493	   Yoshito Umaoka
494	   IBM

496	   Email: yoshito_umaoka@us.ibm.com

498	   Courtney Falk
499	   Infinite Automata

501	   Email: court@infiauto.com