idnits 2.17.1 draft-davis-t-langtag-ext-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 16, 2011) is 4698 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'TBD' is mentioned on line 362, but not defined == Unused Reference: 'US-ASCII' is defined on line 408, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Davis 3 Internet-Draft Google 4 Intended status: Informational A. Phillips 5 Expires: December 18, 2011 Lab126 6 Y. Umaoka 7 IBM 8 June 16, 2011 10 BCP 47 Extension T 11 draft-davis-t-langtag-ext-00 13 Abstract 15 This document specifies an Extension to BCP 47 which provides subtags 16 for specifying the source language or script of transformed text, 17 including text that has been transliterated, transcribed, or 18 translated. It also provides for additional information used for 19 identification. 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on December 18, 2011. 38 Copyright Notice 40 Copyright (c) 2011 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 54 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 3 55 2.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6 56 2.1.1. Canonicalization . . . . . . . . . . . . . . . . . . . 8 57 2.2. Registration Form . . . . . . . . . . . . . . . . . . . . 9 58 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 9 59 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 60 5. Security Considerations . . . . . . . . . . . . . . . . . . . 10 61 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 62 6.1. Normative References . . . . . . . . . . . . . . . . . . . 10 63 6.2. Informative References . . . . . . . . . . . . . . . . . . 10 64 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 66 1. Introduction 68 [BCP47] permits the definition and registration of language tag 69 extensions "that contain a language component and are compatible with 70 applications that understand language tags". This document defines 71 an extension for specifying the source of a text transformation, 72 including text that has been transliterated, transcribed, or 73 translated. The "singleton" identifier for this extension is 't'. 75 1.1. Requirements Language 77 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 78 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 79 document are to be interpreted as described in RFC 2119. 81 2. BCP47 Required Information 83 Language tags, as defined by [BCP47], are useful for identifying the 84 language of content. There are mechanisms for specifying variant 85 subtags for special purposes. However, these variants are 86 insufficient for specifying text transformations, including text that 87 has been transliterated, transcribed, or translated. That is, for 88 fully specifying such text, it is important to specify the source 89 language and/or script. In addition, it may also be important to 90 specify a particular specification for the transformation. 92 For example, if one is transcribing the names of Italian or Russian 93 cities on a map for Japanese users, each name will need to be 94 transliterated into katakana using rules appropriate for the source 95 language and target languages. When tagging such data, it is 96 important to be able to indicate not only the resulting content 97 language ("ja" in this case), but also the source language. 99 Transforms such as transliteration may vary depending not only on the 100 basis of the source and target script, but also language. Thus the 101 Russian (which corresponds to 102 the Cyrillic ) transliterates into "Putin" in 103 English but "Poutine" in French. The identifier may need to indicate 104 a desired mechanical transformation in an API, or may need to tag 105 data that has been converted (mechanically or by hand) according to a 106 transliteration method. 108 Such identification is accomplished by using the 't' extension 109 defined in this document. This extension is formed by the 't' 110 singleton followed by a sequence of subtags that would form a 111 language tag defined by [BCP47]. This allows for the source language 112 or script to be specified to the degree of precision required. There 113 are restrictions on the sequence of subtags. They MUST form a 114 regular, valid, canonical language tag, and MUST neither include 115 extensions nor private use sequences introduced by the singleton 'x'. 116 Where only the script is relevant (such as identifying a script- 117 script transliteration) then 'und' is used for the primary language 118 subtag. 120 For example: 122 +---------------------+---------------------------------------------+ 123 | Language Tag | Description | 124 +---------------------+---------------------------------------------+ 125 | ja-t-it | The content is Japanese, transformed from | 126 | | Italian. | 127 | ja-Kana-t-it | The content is Japanese Katakana, | 128 | | transformed from Italian. | 129 | und-Latn-t-und-cyrl | The content is in the Latin script, | 130 | | transformed from the Cyrillic script. | 131 +---------------------+---------------------------------------------+ 133 Note that the sequence of subtags governed by 't' cannot contain a 134 singleton (a single-character subtag), because that would start a new 135 extension. For example, the tag "ja-t-i-ami" does not indicate that 136 the source is in "i-ami", because "i-ami" is not a regular language 137 tag in [BCP47]. That tag would express an empty 't' extension 138 followed by an 'i' extension. 140 In addition, it is sometimes necessary to indicate additional 141 information, such as the mechanism used to do the transformation, 142 optionally including the version of the mechanism. The mechanism can 143 be supplied by using the 'm0' separator. The format of such a 't' 144 extension is thus: 146 "t--m0-". 148 (The full format reserves some additional syntax for future 149 expansion, as described below.) 151 The transform is a series of subtags that indicate the 152 specification used for the transformation, such as "UNGEGN" for the 153 the United Nations Group of Experts on Geographical Names 154 transliterations and transcriptions. 156 For example: 158 +------------------------------------+------------------------------+ 159 | Language Tag | Description | 160 +------------------------------------+------------------------------+ 161 | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic, | 162 | | transformed from Latn, | 163 | | according to a UNGEGN | 164 | | specification dated 2007. | 165 +------------------------------------+------------------------------+ 167 The separator subtags such as 'm0' were chosen because they are 168 short, visually distinctive, and cannot occur in a language subtag 169 (outside of an extension and after 'x'), thus eliminating the 170 potential for collision or confusion with the source language tag. 172 The subtags that are valid after in the 't' extension are provided by 173 Section 3 [1] of Unicode Technical Standard #35: Unicode Locale Data 174 Markup Language [UTS35]. As required by BCP 47, subtags follow the 175 language tag ABNF and other rules for the formation of language tags 176 and subtags, are restricted to the ASCII letters and digits, are not 177 case sensitive, and do not exceed eight characters in length. 179 EDITORIAL NOTE: This new facility has been accepted by the Unicode 180 CLDR committee for incorporation into the next version of Unicode 181 CLDR, parallel with the structure of the 'u' extension [RFC6067], for 182 which it is already the maintaining authority. The data and 183 specification will be available by the time this internet draft has 184 been approved. 186 LDML is available over the Internet and at no cost, and is available 187 via a royalty-free license at http://unicode.org/copyright.html. 188 LDML is versioned, and each version of LDML is numbered, dated, and 189 stable. Extension subtags, once defined by LDML, are never retracted 190 or change in meaning in a substantial way. 192 The structure of 't' subtags is determined by the Unicode CLDR 193 Technical Committee, in accordance with the policies and procedures 194 in http://www.unicode.org/consortium/tc-procedures.html, and subject 195 to the Unicode Consortium Policies on 196 http://www.unicode.org/policies/policies.html. 198 Changes that can be made by successive versions of LDML [UTS35] by 199 the Unicode Consortium without requiring a new RFC include the 200 allocation of new subtags for use after the 't' extension. A new RFC 201 would be required for material changes to an existing 't' subtag, or 202 an incompatible change to the overall syntactic structure of the 't' 203 extension; however, such a change would be contrary to the policies 204 of the Unicode Consortium, and thus is not anticipated. 206 The maintaining authority for the 't' extension is the Unicode 207 Consortium: 209 +---------------+---------------------------------------------------+ 210 | Item | Value | 211 +---------------+---------------------------------------------------+ 212 | Name | Unicode Consortium | 213 | Contact Email | cldr-contact@unicode.org | 214 | Discussion | cldr-users@unicode.org | 215 | List Email | | 216 | URL Location | cldr.unicode.org | 217 | Specification | Unicode Technical Standard #35 Unicode Locale | 218 | | Data Markup Language (LDML), | 219 | | http://unicode.org/reports/tr35/ | 220 | Section | Section 3 Unicode Language and Locale Identifiers | 221 +---------------+---------------------------------------------------+ 223 2.1. Summary 225 The following is a summary of the definition for the 't' subtags 226 defined by Section 3 [1] of Unicode Technical Standard #35: Unicode 227 Locale Data Markup Language [UTS35], which is relevant for this 228 specification. 230 The subtags in the 't' extension are of the following form: 232 +--------+-------------------------+----------------------------+ 233 | Label | ABNF | Comment | 234 +--------+-------------------------+----------------------------+ 235 | t_ext= | "t-" | Extension | 236 | | [lang] | Source | 237 | | *("-" field) | Optional information | 238 | lang= | language | [BCP47], with restrictions | 239 | | ["-" script] | | 240 | | ["-" region] | | 241 | | *("-" variant) | | 242 | field= | sep 1*("-" 3*8alphanum) | With restrictions | 243 | sep= | 1ALPHA 1DIGIT | Subtag separators | 244 +--------+-------------------------+----------------------------+ 246 Description and restrictions: 248 a. The 't' extension MUST have at least one subtag. 250 b. The 't' extension normally starts with a source language tag, 251 which MUST be a regular, canonical language tag as specified by 252 [BCP47]. Tags described by the 'irregular' production in BCP 47 253 MUST NOT be used to form the language tag. The source language 254 tag MAY be omitted: some field values do not require it. 256 c. There is optionally a sequence of fields, where each field is a 257 separator followed by a sequence of subtags. Two identical 258 separators MUST NOT be present. 260 d. One field is initially specified in [UTS35]: the transform 261 mechanism. 263 A. The transform mechanism consists of a sequence of subtags 264 starting with the 'm0' separator followed by one or more 265 mechanism subtags. Each mechanism subtag has a length of 3 266 to 8 alphanumeric characters. The sequence as a whole 267 provides an identification of the specification for the 268 transform, such as the mechanism subtag 'UNGEGN' in "und- 269 Cyrl-t-und-latn-m0-ungegn". In many cases, only one 270 mechanism subtag is necessary, but multiple subtags MAY be 271 defined in [UTS35] where necessary. 273 B. Any purely numeric subtag is a representation of a date in 274 the Gregorian calendar. It MAY occur in any mechanism field. 275 If it does occur: 277 + it MUST occur as the final subtag in the field, 279 + it MUST NOT be the only subtag in the field, and 281 + it MUST consist of a sequence of digits of the form YYYY, 282 YYYYMM, or YYYYMMDD. 284 For example, 20110623 represents June 23th, 2011. A date 285 subtag SHOULD only be used where necessary, and then SHOULD 286 be as short as possible. For example, suppose that the BGN 287 transliteration specification for Cyrillic to Latin had three 288 versions, dated June 11th, 1999; Dec 30th, 1999; and May 1st, 289 2011. In that case, the corresponding first two DATE subtags 290 would require months to be distinctive (199906 and 199912), 291 but the last subtag would only require the year (2011). 293 C. Some mechanisms may use a versioning system that is not 294 distinguished by date, or not by date alone. In the latter 295 case, the version will be of a form specified by [UTS35] for 296 that mechanism. For example, if the mechanism XXX uses 297 versions of the form v21a, then a tag could look like "ja-t- 298 it-m0-xxx-v21a". If there are multiple subversions 299 distinguished by date, then a tag could look like "ja-t-it- 300 m0-xxx-v21a-2007". 302 e. Successive versions of [UTS35] could define additional separator 303 subtags, and additional subtags for those separators. Once 304 defined, those subtags will never be removed. 306 f. The order of the subtags is significant (see Section 2.1.1 307 Canonicalization). 309 EDITORIAL NOTE: The following parallels the structure used for the 310 'u' extension [RFC6067], for which the Unicode Consortium is the 311 maintaining authority. The data and specification will be available 312 by the time this internet draft has been approved. 314 Beginning with CLDR version 1.7.2, machine-readable files are 315 available listing the data defined for BCP47 extensions for each 316 successive version of [UTS35]. These releases are listed on 317 http://cldr.unicode.org/index/downloads. Each release has an 318 associated data directory of the form 319 "http://unicode.org/Public/cldr/", where "" is 320 replaced by the release number. For example, for version 1.7.2, the 321 "core.zip" file is located at 322 http://unicode.org/Public/cldr/1.7.2/core.zip [2]. Inside the 323 "core.zip" file, the path "common/bcp47" contains the data files 324 defining the data defined for BCP47 extensions. The most recent 325 version is always identified by the version "latest" and can be 326 accessed by the URL in Section 2.2. 328 To get the version information in XML when working with the data 329 files, the XML parser must be validating. When the 'core.zip' file 330 is unzipped, the 'dtd' directory will be at the same level as the 331 'bcp47' directory; that is required for correct validation. For each 332 release after CLDR 1.8, types introduced in that release are also 333 marked in the data files by the XML attribute "since", such as in the 334 following example: 335 337 The data is also currently maintained in a source code repository, 338 with each release tagged, for viewing directly without unzipping. 339 For example, see: 341 o http://unicode.org/repos/cldr/tags/release-1-7-2/common/bcp47/ 343 o http://unicode.org/repos/cldr/tags/release-1-8/common/bcp47/ 345 2.1.1. Canonicalization 347 As required by [BCP47], the use of uppercase or lowercase letters is 348 not significant in the subtags used in this extension. The canonical 349 form for all subtags in the extension is lowercase, with the fields 350 ordered by the separators, alphabetically. 352 2.2. Registration Form 354 Per RFC 5646, Section 3.7 [BCP47] : 356 %% 357 Identifier: t 358 Description: Transform Specification 359 Comments: Subtags for the identification of text transforms, 360 including transliteration, transcription, and translation. 361 Added: 2010-mm-dd 362 RFC: [TBD] 363 Authority: Unicode Consortium 364 Contact_Email: cldr-contact@unicode.org 365 Mailing_List: cldr-users@unicode.org 366 URL: http://www.unicode.org/Public/cldr/latest/core.zip 367 %% 369 3. Acknowledgements 371 Thanks to John Emmons and the rest of the Unicode CLDR Technical 372 Committee for their work in developing the BCP 47 subtags for LDML. 374 4. IANA Considerations 376 This document will require IANA to insert the record in Section 2.2 377 into the Language Extensions Registry, according to Section 3.7. 378 Extensions and the Extensions Registry of "Tags for Identifying 379 Languages" in [BCP47]. Per Section 5.2 of [BCP47], there might be 380 occasional (rare) requests by the Unicode Consortium (the "Authority" 381 listed in the record) for maintenance of this record. Changes that 382 can be submitted to IANA without the publication of a new RFC are 383 limited to modification of the Comments, Contact_Email, Mailing_List, 384 and URL fields. Any such requested changes MUST use the domain 385 'unicode.org' in any new addresses or URIs, MUST explicitly cite this 386 document (so that IANA can reference these requirements), and MUST 387 originate from the 'unicode.org' domain. The domain or authority can 388 only be changed via a new RFC. 390 This document does not require IANA to create or maintain a new 391 registry or otherwise impact IANA. 393 5. Security Considerations 395 The security considerations for this extension are the same as those 396 for [BCP47]. See RFC 5646, Section 6, Security Considerations 397 [BCP47]. 399 6. References 401 6.1. Normative References 403 [BCP47] Davis, M., Ed., "Tags for the Identification of Language 404 (BCP47)", September 2009. 406 [RFC6067] Davis, M., Ed., "BCP 47 Extension U", September 2010. 408 [US-ASCII] 409 International Organization for Standardization, "ISO/IEC 410 646:1991, Information technology -- ISO 7-bit coded 411 character set for information interchange.", 1991. 413 [UTS35] Davis, M., "Unicode Technical Standard #35: Locale Data 414 Markup Language (LDML)", December 2007, 415 . 417 Section 3: http://unicode.org/reports/ 418 tr35/#Unicode_Language_and_Locale_Identifiers 420 Appendix Q: http://unicode.org/reports/ 421 tr35/#Locale_Extension_Key_and_Type_Data 423 6.2. Informative References 425 [ldml-registry] 426 "Registry for Common Locale Data Repository tag elements", 427 September 2009. 429 URIs 431 [1] 433 [2] 435 Authors' Addresses 437 Mark Davis 438 Google 440 Email: mark@macchiato.com 442 Addison Phillips 443 Lab126 445 Email: addison@lab126.com 447 Yoshito Umaoka 448 IBM 450 Email: yoshito_umaoka@us.ibm.com