idnits 2.17.1 draft-davis-t-langtag-ext-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 22, 2011) is 4691 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'TBD' is mentioned on line 271, but not defined == Unused Reference: 'US-ASCII' is defined on line 453, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Davis 3 Internet-Draft Google 4 Intended status: Informational A. Phillips 5 Expires: December 24, 2011 Lab126 6 Y. Umaoka 7 IBM 8 C. Falk 9 Infinite Automata 10 June 22, 2011 12 BCP 47 Extension T 13 draft-davis-t-langtag-ext-01 15 Abstract 17 This document specifies an Extension to BCP 47 which provides subtags 18 for specifying the source language or script of transformed text, 19 including text that has been transliterated, transcribed, or 20 translated. It also provides for additional information used for 21 identification. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on December 24, 2011. 40 Copyright Notice 42 Copyright (c) 2011 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 56 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 3 57 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 3 58 2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6 59 2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 6 60 2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 7 61 2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 7 62 2.6. Registration of Field Subtags . . . . . . . . . . . . . . 8 63 2.7. Machine-Readable Data . . . . . . . . . . . . . . . . . . 9 64 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 65 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 66 5. Security Considerations . . . . . . . . . . . . . . . . . . . 10 67 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 68 6.1. Normative References . . . . . . . . . . . . . . . . . . . 11 69 6.2. Informative References . . . . . . . . . . . . . . . . . . 11 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 72 1. Introduction 74 [BCP47] permits the definition and registration of language tag 75 extensions "that contain a language component and are compatible with 76 applications that understand language tags". This document defines 77 an extension for specifying the source of a text transformation, 78 including text that has been transliterated, transcribed, or 79 translated. The "singleton" identifier for this extension is 't'. 81 Language tags, as defined by [BCP47], are useful for identifying the 82 language of content. There are mechanisms for specifying variant 83 subtags for special purposes. However, these variants are 84 insufficient for specifying text transformations, including text that 85 has been transliterated, transcribed, or translated. That is, for 86 fully specifying such text, it is important to specify the source 87 language and/or script. In addition, it may also be important to 88 identify a particular specification for the transformation. 90 For example, if one is transcribing the names of Italian or Russian 91 cities on a map for Japanese users, each name will need to be 92 transliterated into katakana using rules appropriate for the specific 93 source and target language. When tagging such data, it is important 94 to be able to indicate not only the resulting content language ("ja" 95 in this case), but also the source language. 97 Transforms such as transliteration may vary depending not only on the 98 basis of the source and target script, but also language. Thus the 99 Russian (which corresponds to 100 the Cyrillic ) transliterates into "Putin" in 101 English but "Poutine" in French. The identifier may need to indicate 102 a desired mechanical transformation in an API, or may need to tag 103 data that has been converted (mechanically or by hand) according to a 104 transliteration method. 106 1.1. Requirements Language 108 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 109 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 110 document are to be interpreted as described in RFC 2119. 112 2. BCP47 Required Information 114 2.1. Introduction 116 Identification of transforms can be done using the 't' extension 117 defined in this document. This extension is formed by the 't' 118 singleton followed by a sequence of subtags that would form a 119 language tag defined by [BCP47]. This allows for the source language 120 or script to be specified to the degree of precision required. There 121 are restrictions on the sequence of subtags. They MUST form a 122 regular, valid, canonical language tag, and MUST neither include 123 extensions nor private use sequences introduced by the singleton 'x'. 124 Where only the script is relevant (such as identifying a script- 125 script transliteration) then 'und' is used for the primary language 126 subtag. 128 For example: 130 +---------------------+---------------------------------------------+ 131 | Language Tag | Description | 132 +---------------------+---------------------------------------------+ 133 | ja-t-it | The content is Japanese, transformed from | 134 | | Italian. | 135 | ja-Kana-t-it | The content is Japanese Katakana, | 136 | | transformed from Italian. | 137 | und-Latn-t-und-cyrl | The content is in the Latin script, | 138 | | transformed from the Cyrillic script. | 139 +---------------------+---------------------------------------------+ 141 Note that the sequence of subtags governed by 't' cannot contain a 142 singleton (a single-character subtag), because that would start a new 143 extension. For example, the tag "ja-t-i-ami" does not indicate that 144 the source is in "i-ami", because "i-ami" is not a regular language 145 tag in [BCP47]. That tag would express an empty 't' extension 146 followed by an 'i' extension. 148 In addition, it is sometimes necessary to indicate additional 149 information about the transformation. This additional information is 150 optionally supplied after the source in a series of one or more 151 fields, where each field consists of a field separator subtag 152 followed by one or more non-separator subtags. Each field separator 153 subtag consists of a single letter followed by a single digit. 155 A transformation mechanism is an optional field that indicates the 156 specification used for the transformation, such as "UNGEGN" for the 157 the United Nations Group of Experts on Geographical Names 158 transliterations and transcriptions. It uses the 'm0' field 159 separator followed by certain subtags. 161 For example: 163 +------------------------------------+------------------------------+ 164 | Language Tag | Description | 165 +------------------------------------+------------------------------+ 166 | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic, | 167 | | transformed from Latn, | 168 | | according to a UNGEGN | 169 | | specification dated 2007. | 170 +------------------------------------+------------------------------+ 172 The field separator subtags such as 'm0' were chosen because they are 173 short, visually distinctive, and cannot occur in a language subtag 174 (outside of an extension and after 'x'), thus eliminating the 175 potential for collision or confusion with the source language tag. 177 The field subtags are defined by Section 3 [1] of Unicode Technical 178 Standard #35: Unicode Locale Data Markup Language [UTS35]. As 179 required by BCP 47, subtags follow the language tag ABNF and other 180 rules for the formation of language tags and subtags, are restricted 181 to the ASCII letters and digits, are not case sensitive, and do not 182 exceed eight characters in length. 184 EDITORIAL NOTE: This new facility has been accepted by the Unicode 185 CLDR committee for incorporation into the next version of Unicode 186 CLDR, parallel with the structure of the 'u' extension [RFC6067], for 187 which it is already the maintaining authority. The data and 188 specification will be available by the time this internet draft has 189 been approved. 191 LDML is available over the Internet and at no cost, and is available 192 via a royalty-free license at http://unicode.org/copyright.html. 193 LDML is versioned, and each version of LDML is numbered, dated, and 194 stable. Extension subtags, once defined by LDML, are never retracted 195 or change in meaning in a substantial way. 197 The maintaining authority for the 't' extension is the Unicode 198 Consortium: 200 +---------------+---------------------------------------------------+ 201 | Item | Value | 202 +---------------+---------------------------------------------------+ 203 | Name | Unicode Consortium | 204 | Contact Email | cldr-contact@unicode.org | 205 | Discussion | cldr-users@unicode.org | 206 | List Email | | 207 | URL Location | cldr.unicode.org | 208 | Specification | Unicode Technical Standard #35 Unicode Locale | 209 | | Data Markup Language (LDML), | 210 | | http://unicode.org/reports/tr35/ | 211 | Section | Section 3 Unicode Language and Locale Identifiers | 212 +---------------+---------------------------------------------------+ 214 2.2. Structure 216 The subtags in the 't' extension are of the following form: 218 +--------+-------------------------+----------------------------+ 219 | Label | ABNF | Comment | 220 +--------+-------------------------+----------------------------+ 221 | t_ext= | "t" | Extension | 222 | | ("-" lang *("-" field) | Source + optional field(s) | 223 | | / 1*("-" field)) | Field(s) only (no source) | 224 | lang= | language | [BCP47], with restrictions | 225 | | ["-" script] | | 226 | | ["-" region] | | 227 | | *("-" variant) | | 228 | field= | sep 1*("-" 3*8alphanum) | With restrictions | 229 | sep= | 1ALPHA 1DIGIT | Subtag separators | 230 +--------+-------------------------+----------------------------+ 232 Description and restrictions: 234 a. The 't' extension MUST have at least one subtag. 236 b. The 't' extension normally starts with a source language tag, 237 which MUST be a regular, canonical language tag as specified by 238 [BCP47]. Tags described by the 'irregular' production in BCP 47 239 MUST NOT be used to form the language tag. The source language 240 tag MAY be omitted: some field values do not require it. 242 c. There is optionally a sequence of fields, where each field has a 243 separator followed by a sequence of one or more subtags. Two 244 identical field separators MUST NOT be present in the language 245 tag. 247 d. The order of the subtags in a t extension is significant (see 248 Section 2.3 Canonicalization). 250 e. The 't' subtag fields are defined by Section 3 [1] of Unicode 251 Technical Standard #35: Unicode Locale Data Markup Language 252 [UTS35]. 254 2.3. Canonicalization 256 As required by [BCP47], the use of uppercase or lowercase letters is 257 not significant in the subtags used in this extension. The canonical 258 form for all subtags in the extension is lowercase, with the fields 259 ordered by the separators, alphabetically. 261 2.4. BCP47 Registration Form 263 Per RFC 5646, Section 3.7 [BCP47]: 265 %% 266 Identifier: t 267 Description: Transform Specification 268 Comments: Subtags for the identification of text transforms, 269 including transliteration, transcription, and translation. 270 Added: 2010-mm-dd 271 RFC: [TBD] 272 Authority: Unicode Consortium 273 Contact_Email: cldr-contact@unicode.org 274 Mailing_List: cldr-users@unicode.org 275 URL: http://www.unicode.org/Public/cldr/latest/core.zip 276 %% 278 2.5. Field Definitions 280 The structure of 't' field subtags is determined by the Unicode CLDR 281 Technical Committee, in accordance with the policies and procedures 282 in http://www.unicode.org/consortium/tc-procedures.html, and subject 283 to the Unicode Consortium Policies on 284 http://www.unicode.org/policies/policies.html. 286 Changes that can be made by successive versions of LDML [UTS35] by 287 the Unicode Consortium without requiring a new RFC include: 289 o The allocation of new field separator subtags for use after the 290 't' extension. 292 o The allocation of subtags valid after a field separator subtag. 294 A new RFC would be required for material changes to an existing 't' 295 subtag, or an incompatible change to the overall syntactic structure 296 of the 't' extension; however, such a change would be contrary to the 297 policies of the Unicode Consortium, and thus is not anticipated. 299 One field is initially specified in [UTS35]: the transform mechanism. 300 That field is summarized here: 302 a. The transform mechanism consists of a sequence of subtags 303 starting with the 'm0' separator followed by one or more 304 mechanism subtags. Each mechanism subtag has a length of 3 to 8 305 alphanumeric characters. The sequence as a whole provides an 306 identification of the specification for the transform, such as 307 the mechanism subtag 'UNGEGN' in "und-Cyrl-t-und-latn-m0-ungegn". 308 In many cases, only one mechanism subtag is necessary, but 309 multiple subtags MAY be defined in [UTS35] where necessary. 311 b. Any purely numeric subtag is a representation of a date in the 312 Gregorian calendar. It MAY occur in any mechanism field. If it 313 does occur: 315 * it MUST occur as the final subtag in the field, 317 * it MUST NOT be the only subtag in the field, and 319 * it MUST consist of a sequence of digits of the form YYYY, 320 YYYYMM, or YYYYMMDD. 322 For example, 20110623 represents June 23th, 2011. A date subtag 323 SHOULD only be used where necessary, and then SHOULD be as short 324 as possible. For example, suppose that the BGN transliteration 325 specification for Cyrillic to Latin had three versions, dated 326 June 11th, 1999; Dec 30th, 1999; and May 1st, 2011. In that 327 case, the corresponding first two DATE subtags would require 328 months to be distinctive (199906 and 199912), but the last subtag 329 would only require the year (2011). 331 c. Some mechanisms may use a versioning system that is not 332 distinguished by date, or not by date alone. In the latter case, 333 the version will be of a form specified by [UTS35] for that 334 mechanism. For example, if the mechanism XXX uses versions of 335 the form v21a, then a tag could look like "ja-t-it-m0-xxx-v21a". 336 If there are multiple subversions distinguished by date, then a 337 tag could look like "ja-t-it-m0-xxx-v21a-2007". 339 2.6. Registration of Field Subtags 341 Registration of transform mechanisms is requested by filing a ticket 342 at cldr.unicode.org [2]. The proposal in the ticket MUST contain the 343 following information: 345 +-------------+-----------------------------------------------------+ 346 | Item | Description | 347 +-------------+-----------------------------------------------------+ 348 | Subtag | The proposed mechanism subtag (or subtag sequence). | 349 | Description | A description of the proposed mechanism; that | 350 | | description MUST be sufficient to distinguish it | 351 | | from other mechanisms in use. | 352 | Version | If versioning for the mechanism is not done | 353 | | according to date, then a description of the | 354 | | versioning conventions used for the mechanism. | 355 +-------------+-----------------------------------------------------+ 357 The committee MAY request more information be supplied in tickets in 358 the future if such information is found to be useful. 360 The committee MUST respond to each proposal within 2 weeks. 362 The response MAY: 364 o request more information or clarification 366 o accept the proposal, optionally with modifications to the subtag 367 or description 369 o reject the proposal, because of significant objections raised on 370 the mailing list or due to problems with constraints in this 371 document or in [UTS35] 373 Accepted tickets result an a new entry in the machine-readable CLDR 374 BCP47 data. 376 2.7. Machine-Readable Data 378 EDITORIAL NOTE: The following parallels the structure used for the 379 'u' extension [RFC6067], for which the Unicode Consortium is the 380 maintaining authority. The data and specification will be available 381 by the time this internet draft has been approved. 383 Beginning with CLDR version 1.7.2, machine-readable files are 384 available listing the data defined for BCP47 extensions for each 385 successive version of [UTS35]. These releases are listed on 386 http://cldr.unicode.org/index/downloads. Each release has an 387 associated data directory of the form 388 "http://unicode.org/Public/cldr/", where "" is 389 replaced by the release number. For example, for version 1.7.2, the 390 "core.zip" file is located at 391 http://unicode.org/Public/cldr/1.7.2/core.zip [3]. Inside the 392 "core.zip" file, the path "common/bcp47" contains the data files 393 defining the data defined for BCP47 extensions. The most recent 394 version is always identified by the version "latest" and can be 395 accessed by the URL in Section 2.4. 397 To get the version information in XML when working with the data 398 files, the XML parser must be validating. When the 'core.zip' file 399 is unzipped, the 'dtd' directory will be at the same level as the 400 'bcp47' directory; that is required for correct validation. For each 401 release after CLDR 1.8, types introduced in that release are also 402 marked in the data files by the XML attribute "since", such as in the 403 following example: 404 406 The data is also currently maintained in a source code repository, 407 with each release tagged, for viewing directly without unzipping. 408 For example, see: 410 o http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/ 412 o http://unicode.org/repos/cldr/tags/release-1-8/common/bcp47/ 414 3. Acknowledgements 416 Thanks to John Emmons and the rest of the Unicode CLDR Technical 417 Committee for their work in developing the BCP 47 subtags for LDML. 419 4. IANA Considerations 421 This document will require IANA to insert the record in Section 2.4 422 into the Language Extensions Registry, according to Section 3.7. 423 Extensions and the Extensions Registry of "Tags for Identifying 424 Languages" in [BCP47]. Per Section 5.2 of [BCP47], there might be 425 occasional (rare) requests by the Unicode Consortium (the "Authority" 426 listed in the record) for maintenance of this record. Changes that 427 can be submitted to IANA without the publication of a new RFC are 428 limited to modification of the Comments, Contact_Email, Mailing_List, 429 and URL fields. Any such requested changes MUST use the domain 430 'unicode.org' in any new addresses or URIs, MUST explicitly cite this 431 document (so that IANA can reference these requirements), and MUST 432 originate from the 'unicode.org' domain. The domain or authority can 433 only be changed via a new RFC. 435 This document does not require IANA to create or maintain a new 436 registry or otherwise impact IANA. 438 5. Security Considerations 440 The security considerations for this extension are the same as those 441 for [BCP47]. See RFC 5646, Section 6, Security Considerations 442 [BCP47]. 444 6. References 446 6.1. Normative References 448 [BCP47] Davis, M., Ed., "Tags for the Identification of Language 449 (BCP47)", September 2009. 451 [RFC6067] Davis, M., Ed., "BCP 47 Extension U", September 2010. 453 [US-ASCII] 454 International Organization for Standardization, "ISO/IEC 455 646:1991, Information technology -- ISO 7-bit coded 456 character set for information interchange.", 1991. 458 [UTS35] Davis, M., "Unicode Technical Standard #35: Locale Data 459 Markup Language (LDML)", December 2007, 460 . 462 Section 3: http://unicode.org/reports/ 463 tr35/#Unicode_Language_and_Locale_Identifiers 465 Appendix Q: http://unicode.org/reports/ 466 tr35/#Locale_Extension_Key_and_Type_Data 468 6.2. Informative References 470 [ldml-registry] 471 "Registry for Common Locale Data Repository tag elements", 472 September 2009. 474 URIs 476 [1] 478 [2] 480 [3] 482 Authors' Addresses 484 Mark Davis 485 Google 487 Email: mark@macchiato.com 488 Addison Phillips 489 Lab126 491 Email: addison@lab126.com 493 Yoshito Umaoka 494 IBM 496 Email: yoshito_umaoka@us.ibm.com 498 Courtney Falk 499 Infinite Automata 501 Email: court@infiauto.com