idnits 2.17.1 draft-davis-t-langtag-ext-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 11, 2011) is 4670 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'TBD' is mentioned on line 286, but not defined Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Davis 3 Internet-Draft Google 4 Intended status: Informational A. Phillips 5 Expires: January 12, 2012 Lab126 6 Y. Umaoka 7 IBM 8 C. Falk 9 Infinite Automata 10 July 11, 2011 12 BCP 47 Extension T - Transformed Content 13 draft-davis-t-langtag-ext-03 15 Abstract 17 This document specifies an Extension to BCP 47 which provides subtags 18 for specifying the source language or script of transformed content, 19 including content that has been transliterated, transcribed, or 20 translated, or in some other way influenced by the source. It also 21 provides for additional information used for identification. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on January 12, 2012. 40 Copyright Notice 42 Copyright (c) 2011 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 56 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 4 57 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 4 58 2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6 59 2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 7 60 2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 7 61 2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 7 62 2.6. Registration of Field Subtags . . . . . . . . . . . . . . 9 63 2.7. Machine-Readable Data . . . . . . . . . . . . . . . . . . 10 64 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 65 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 66 5. Security Considerations . . . . . . . . . . . . . . . . . . . 12 67 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 68 6.1. Normative References . . . . . . . . . . . . . . . . . . . 12 69 6.2. Informative References . . . . . . . . . . . . . . . . . . 12 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 72 1. Introduction 74 [BCP47] permits the definition and registration of language tag 75 extensions "that contain a language component and are compatible with 76 applications that understand language tags". This document defines 77 an extension for specifying the source of content that has been 78 transformed, including text that has been transliterated, 79 transcribed, or translated, or in some other way influenced by the 80 source. It may be used in queries to request content that has been 81 transformed. The "singleton" identifier for this extension is 't'. 83 Language tags, as defined by [BCP47], are useful for identifying the 84 language of content. There are mechanisms for specifying variant 85 subtags for special purposes. However, these variants are 86 insufficient for specifying content that has undergone 87 transformations, including content that has been transliterated, 88 transcribed, or translated. That is, for fully specifying such 89 content, it is important to specify the source language and/or 90 script. In addition, it may also be important to identify a 91 particular specification for the transformation. 93 For example, suppose that Italian or Russian cities on a map are 94 transcribed for Japanese users. Each name needs to be transliterated 95 into katakana using rules appropriate for the specific source and 96 target language. When tagging such data, it is important to be able 97 to indicate not only the resulting content language ("ja" in this 98 case), but also the source language. 100 Transforms such as transliteration may vary depending not only on the 101 basis of the source and target script, but also on language. Thus 102 the Russian (which corresponds 103 to the Cyrillic ) transliterates into "Putin" in 104 English but "Poutine" in French. The identifier could be used to 105 indicate a desired mechanical transformation in an API, or could be 106 used to tag data that has been converted (mechanically or by hand) 107 according to a transliteration method. 109 The usage of this extension is not limited to formal transformations, 110 and may include other instances where the content is in some other 111 way influenced by the source. For example, this extension could be 112 used to designate a request for a speech recognizer that is tailored 113 specifically for 2nd-language speakers who are 1st-language speakers 114 of a particular language (e.g. a recognizer for "English spoken with 115 a Chinese accent"). 117 1.1. Requirements Language 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in RFC 2119. 123 2. BCP47 Required Information 125 2.1. Introduction 127 Identification of transformed content can be done using the 't' 128 extension defined in this document. This extension is formed by the 129 't' singleton followed by a sequence of subtags that would form a 130 language tag as defined by [BCP47]. This allows for the source 131 language or script to be specified to the degree of precision 132 required. There are restrictions on the sequence of subtags. They 133 MUST form a regular, valid, canonical language tag, and MUST neither 134 include extensions nor private use sequences introduced by the 135 singleton 'x'. Where only the script is relevant (such as 136 identifying a script-script transliteration) then 'und' is used for 137 the primary language subtag. 139 For example: 141 +---------------------+---------------------------------------------+ 142 | Language Tag | Description | 143 +---------------------+---------------------------------------------+ 144 | ja-t-it | The content is Japanese, transformed from | 145 | | Italian. | 146 | ja-Kana-t-it | The content is Japanese Katakana, | 147 | | transformed from Italian. | 148 | und-Latn-t-und-cyrl | The content is in the Latin script, | 149 | | transformed from the Cyrillic script. | 150 +---------------------+---------------------------------------------+ 152 Note that the sequence of subtags governed by 't' cannot contain a 153 singleton (a single-character subtag), because that would start a new 154 extension. For example, the tag "ja-t-i-ami" does not indicate that 155 the source is in "i-ami", because "i-ami" is not a regular language 156 tag in [BCP47]. That tag would express an empty 't' extension 157 followed by an 'i' extension. 159 It is sometimes necessary to indicate additional information about 160 the transformation. This additional information is optionally 161 supplied after the source in a series of one or more fields, where 162 each field consists of a field separator subtag followed by one or 163 more non-separator subtags. Each field separator subtag consists of 164 a single letter followed by a single digit. 166 A transformation mechanism is an optional field that indicates the 167 specification used for the transformation, such as "UNGEGN" for the 168 the United Nations Group of Experts on Geographical Names 169 transliterations and transcriptions. It uses the 'm0' field 170 separator followed by certain subtags. 172 For example: 174 +------------------------------------+------------------------------+ 175 | Language Tag | Description | 176 +------------------------------------+------------------------------+ 177 | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic, | 178 | | transformed from Latn, | 179 | | according to a UNGEGN | 180 | | specification dated 2007. | 181 +------------------------------------+------------------------------+ 183 The field separator subtags such as 'm0' were chosen because they are 184 short, visually distinctive, and cannot occur in a language subtag 185 (outside of an extension and after 'x'), thus eliminating the 186 potential for collision or confusion with the source language tag. 188 The field subtags are defined by Section 3 [1] of Unicode Technical 189 Standard #35: Unicode Locale Data Markup Language [UTS35]. As 190 required by BCP 47, subtags follow the language tag ABNF and other 191 rules for the formation of language tags and subtags, are restricted 192 to the ASCII letters and digits, are not case sensitive, and do not 193 exceed eight characters in length. 195 EDITORIAL NOTE: This new facility has been accepted by the Unicode 196 CLDR committee for incorporation into the next version of Unicode 197 CLDR, parallel with the structure of the 'u' extension [RFC6067], for 198 which it is already the maintaining authority. The data and 199 specification will be available by the time this internet draft has 200 been approved. 202 LDML is available over the Internet and at no cost, and is available 203 via a royalty-free license at http://unicode.org/copyright.html. 204 LDML is versioned, and each version of LDML is numbered, dated, and 205 stable. Extension subtags, once defined by LDML, are never retracted 206 or change in meaning in a substantial way. 208 The maintaining authority for the 't' extension is the Unicode 209 Consortium: 211 +---------------+---------------------------------------------------+ 212 | Item | Value | 213 +---------------+---------------------------------------------------+ 214 | Name | Unicode Consortium | 215 | Contact Email | cldr-contact@unicode.org | 216 | Discussion | cldr-users@unicode.org | 217 | List Email | | 218 | URL Location | cldr.unicode.org | 219 | Specification | Unicode Technical Standard #35 Unicode Locale | 220 | | Data Markup Language (LDML), | 221 | | http://unicode.org/reports/tr35/ | 222 | Section | Section 3 Unicode Language and Locale Identifiers | 223 +---------------+---------------------------------------------------+ 225 2.2. Structure 227 The subtags in the 't' extension are of the following form: 229 t-ext= "t" ; Extension 230 (("-" lang *("-" field)) ; Source + optional field(s) 231 / 1*("-" field)) ; Field(s) only (no source) 233 lang= language ; BCP47, with restrictions 234 ["-" script] 235 ["-" region] 236 *("-" variant) 238 field= sep 1*("-" 3*8alphanum) ; With restrictions 240 sep= ALPHA DIGIT ; Subtag separators 241 alphanum= ALPHA / DIGIT 243 where ,