idnits 2.17.1 draft-davis-t-langtag-ext-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 5, 2011) is 4519 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'TBD' is mentioned on line 317, but not defined Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force M. Davis 3 Internet-Draft Google 4 Intended status: Informational A. Phillips 5 Expires: June 7, 2012 Lab126 6 Y. Umaoka 7 IBM 8 C. Falk 9 Infinite Automata 10 December 5, 2011 12 BCP 47 Extension T - Transformed Content 13 draft-davis-t-langtag-ext-07 15 Abstract 17 This document specifies an Extension to BCP 47 which provides subtags 18 for specifying the source language or script of transformed content, 19 including content that has been transliterated, transcribed, or 20 translated, or in some other way influenced by the source. It also 21 provides for additional information used for identification. 23 Status of this Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on June 7, 2012. 40 Copyright Notice 42 Copyright (c) 2011 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 56 2. BCP47 Required Information . . . . . . . . . . . . . . . . . . 4 57 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 2.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . 6 59 2.3. Canonicalization . . . . . . . . . . . . . . . . . . . . . 7 60 2.4. BCP47 Registration Form . . . . . . . . . . . . . . . . . 8 61 2.5. Field Definitions . . . . . . . . . . . . . . . . . . . . 8 62 2.6. Registration of Field Subtags . . . . . . . . . . . . . . 10 63 2.7. Registration of Additional Fields . . . . . . . . . . . . 10 64 2.8. Committee Responses to Registration Proposals . . . . . . 11 65 2.9. Machine-Readable Data . . . . . . . . . . . . . . . . . . 11 66 3. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13 67 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 68 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14 69 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 14 70 6.1. Normative References . . . . . . . . . . . . . . . . . . . 14 71 6.2. Informative References . . . . . . . . . . . . . . . . . . 14 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 15 74 1. Introduction 76 [BCP47] permits the definition and registration of language tag 77 extensions "that contain a language component and are compatible with 78 applications that understand language tags". This document defines 79 an extension for specifying the source of content that has been 80 transformed, including text that has been transliterated, 81 transcribed, or translated, or in some other way influenced by the 82 source. It may be used in queries to request content that has been 83 transformed. The "singleton" identifier for this extension is 't'. 85 Language tags, as defined by [BCP47], are useful for identifying the 86 language of content. There are mechanisms for specifying variant 87 subtags for special purposes. However, these variants are 88 insufficient for specifying content that has undergone 89 transformations, including content that has been transliterated, 90 transcribed, or translated. The correct interpretation of the 91 content may depend upon knowledge of the conventions used for the 92 transformation. 94 Suppose that Italian or Russian cities on a map are transcribed for 95 Japanese users. Each name needs to be transliterated into katakana 96 using rules appropriate for the specific source and target language. 97 When tagging such data, it is important to be able to indicate not 98 only the resulting content language ("ja" in this case), but also the 99 source language. 101 Transforms such as transliterations may vary depending not only on 102 the basis of the source and target script, but also on the source and 103 target language. Thus the Russian (which corresponds to the Cyrillic ) 105 transliterates into "Putin" in English but "Poutine" in French. The 106 identifier could be used to indicate a desired mechanical 107 transformation in an API, or could be used to tag data that has been 108 converted (mechanically or by hand) according to a transliteration 109 method. 111 In addition, many different conventions have arisen for how to 112 transform text, even between the same languages and scripts. For 113 example, "Gaddafi" is commonly transliterated from Arabic to English 114 as any of (G/Q/K/Kh)a(d/dh/dd/dhdh/th/zz)af(i/y). Some examples of 115 standardized conventions used for transcribing or transliterating 116 text include: 118 a. United Nations Group of Experts on Geographical Names (UNGEGN) 120 b. US Library of Congress (LOC) 121 c. US Board on Geographic Names (BGN) 123 d. Korean Ministry of Culture, Sports and Tourism (MCST) 125 e. International Organization for Standardization (ISO) 127 The usage of this extension is not limited to formal transformations, 128 and may include other instances where the content is in some other 129 way influenced by the source. For example, this extension could be 130 used to designate a request for a speech recognizer that is tailored 131 specifically for 2nd-language speakers who are 1st-language speakers 132 of a particular language (e.g. a recognizer for "English spoken with 133 a Chinese accent"). 135 1.1. Requirements Language 137 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 138 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 139 document are to be interpreted as described in RFC 2119. 141 2. BCP47 Required Information 143 2.1. Overview 145 Identification of transformed content can be done using the 't' 146 extension defined in this document. This extension is formed by the 147 't' singleton followed by a sequence of subtags that would form a 148 language tag as defined by [BCP47]. This allows for the source 149 language or script to be specified to the degree of precision 150 required. There are restrictions on the sequence of subtags. They 151 MUST form a regular, valid, canonical language tag, and MUST neither 152 include extensions nor private use sequences introduced by the 153 singleton 'x'. Where only the script is relevant (such as 154 identifying a script-script transliteration) then 'und' is used for 155 the primary language subtag. 157 For example: 159 +---------------------+---------------------------------------------+ 160 | Language Tag | Description | 161 +---------------------+---------------------------------------------+ 162 | ja-t-it | The content is Japanese, transformed from | 163 | | Italian. | 164 | ja-Kana-t-it | The content is Japanese Katakana, | 165 | | transformed from Italian. | 166 | und-Latn-t-und-cyrl | The content is in the Latin script, | 167 | | transformed from the Cyrillic script. | 168 +---------------------+---------------------------------------------+ 170 Note that the sequence of subtags governed by 't' cannot contain a 171 singleton (a single-character subtag), because that would start a new 172 extension. For example, the tag "ja-t-i-ami" does not indicate that 173 the source is in "i-ami", because "i-ami" is not a regular language 174 tag in [BCP47]. That tag would express an empty 't' extension 175 followed by an 'i' extension. 177 The 't' extension is not intended for use in structured data that 178 already provides separate source and target language identifiers. 179 For example, this is the case in localization interchange formats 180 such as XLIFF. In such cases, it would be inappropriate to use "ja- 181 t-it" for the target language tag because the source language tag 182 "it" would already be present in the data. Instead one would use the 183 language tag "ja". 185 As noted earlier, it is sometimes necessary to indicate additional 186 information about a transformation. This additional information is 187 optionally supplied after the source in a series of one or more 188 fields, where each field consists of a field separator subtag 189 followed by one or more non-separator subtags. Each field separator 190 subtag consists of a single letter followed by a single digit. 192 A transformation mechanism is an optional field that indicates the 193 specification used for the transformation, such as "UNGEGN" for the 194 the United Nations Group of Experts on Geographical Names 195 transliterations and transcriptions. It uses the 'm0' field 196 separator followed by certain subtags. 198 For example: 200 +------------------------------------+------------------------------+ 201 | Language Tag | Description | 202 +------------------------------------+------------------------------+ 203 | und-Cyrl-t-und-latn-m0-ungegn-2007 | the content is in Cyrillic, | 204 | | transformed from Latn, | 205 | | according to a UNGEGN | 206 | | specification dated 2007. | 207 +------------------------------------+------------------------------+ 209 The field separator subtags such as 'm0' were chosen because they are 210 short, visually distinctive, and cannot occur in a language subtag 211 (outside of an extension and after 'x'), thus eliminating the 212 potential for collision or confusion with the source language tag. 214 The field subtags are defined by Section 3 [1] of Unicode Technical 215 Standard #35: Unicode Locale Data Markup Language [UTS35] (LDML), the 216 main specification for the Unicode Common Locale Data Repository 217 (CLDR) project. As required by BCP 47, subtags follow the language 218 tag ABNF and other rules for the formation of language tags and 219 subtags, are restricted to the ASCII letters and digits, are not case 220 sensitive, and do not exceed eight characters in length. 222 EDITORIAL NOTE: This new facility has been accepted by the Unicode 223 CLDR committee for incorporation into the next versions of CLDR and 224 LDML, parallel with the structure of the 'u' extension [RFC6067], for 225 which it is already the maintaining authority. The data and 226 specification will be available by the time this internet draft has 227 been approved. 229 The LDML specification is available over the Internet and at no cost, 230 and is available via a royalty-free license at 231 http://unicode.org/copyright.html. LDML is versioned, and each 232 version of LDML is numbered, dated, and stable. Extension subtags, 233 once defined by LDML, are never retracted or substantially changed in 234 meaning. 236 The maintaining authority for the 't' extension is the Unicode 237 Consortium: 239 +---------------+---------------------------------------------------+ 240 | Item | Value | 241 +---------------+---------------------------------------------------+ 242 | Name | Unicode Consortium | 243 | Contact Email | cldr-contact@unicode.org | 244 | Discussion | cldr-users@unicode.org | 245 | List Email | | 246 | URL Location | cldr.unicode.org | 247 | Specification | Unicode Technical Standard #35 Unicode Locale | 248 | | Data Markup Language (LDML), | 249 | | http://unicode.org/reports/tr35/ | 250 | Section | Section 3 Unicode Language and Locale Identifiers | 251 +---------------+---------------------------------------------------+ 253 2.2. Structure 255 The subtags in the 't' extension are of the following form: 257 t-ext= "t" ; Extension 258 (("-" lang *("-" field)) ; Source + optional field(s) 259 / 1*("-" field)) ; Field(s) only (no source) 261 lang= language ; BCP47, with restrictions 262 ["-" script] 263 ["-" region] 264 *("-" variant) 266 field= sep 1*("-" 3*8alphanum) ; With restrictions 268 sep= ALPHA DIGIT ; Subtag separators 269 alphanum= ALPHA / DIGIT 271 where ,