idnits 2.17.1 draft-jseng-utf5-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 6 longer pages, the longest (page 1) being 59 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([0-9], [A-V]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The "Author's Address" (or "Authors' Addresses") section title is misspelled. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 2000) is 8868 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'A-V' on line 66 looks like a reference -- Missing reference section? '0-9' on line 66 looks like a reference -- Missing reference section? 'UNICODE' on line 281 looks like a reference -- Missing reference section? 'ISO-10646' on line 266 looks like a reference -- Missing reference section? 'UTF7' on line 273 looks like a reference -- Missing reference section? 'UTF8' on line 277 looks like a reference -- Missing reference section? 'UTF16' on line 267 looks like a reference -- Missing reference section? 'IETFPC' on line 303 looks like a reference -- Missing reference section? 'DNS' on line 288 looks like a reference -- Missing reference section? 'SMTP' on line 293 looks like a reference -- Missing reference section? 'US-ASCII' on line 285 looks like a reference -- Missing reference section? 'MIME' on line 299 looks like a reference -- Missing reference section? 'RFC822' on line 294 looks like a reference Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft James Seng, BIX 2 Martin Duerst, W3C 3 27th July 1999 Tin Wee Tan, NUS 4 Expires End of January 2000 6 UTF-5, a transformation format of Unicode and ISO 10646 8 Status of this Memo 10 This document is an Internet-Draft and is in full conformance 11 with all provisions of Section 10 of RFC2026. 13 Internet-Drafts are working documents of the Internet Engineering 14 Task Force (IETF), its areas, and its working groups. Note that 15 other groups may also distribute working documents as 16 Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six 19 months and may be updated, replaced, or obsoleted by other 20 documents at any time. It is inappropriate to use Internet- 21 Drafts as reference material or to cite them other than as 22 "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 Distribution of this document is unlimited. Please send comments 31 to the authors at jseng@pobox.org.sg, mduerst@w3.org and 32 tinwee@post1.com. 34 Abstract 36 A new transformation format, called UTF-5 for Unicode is proposed. 37 The resulting string of this UTF is within a [A-V][0-9] alphanumeric 38 range. This enables legacy systems or protocols designed for alpha- 39 numerical character set only to be multilingual enabled and inter- 40 nationalized immediately. Example of such systems are the domain 41 name system and email addresses. 43 1. Introduction 45 The Unicode Standard, version 2.1 [UNICODE], and ISO/IEC 10646-1 46 [ISO-10646] jointly define a 16 bit character set, UCS-2, which 47 encompasses most of the world's writing systems. ISO 10646 further 48 defines a 31-bit character set, UCS-4, with currently no assignments 49 outside of the region corresponding to UCS-2 (the Basic Multilingual 50 Plane, BMP). The UCS-2 and UCS-4 encodings, however, are hard to 51 use in many current applications and protocols that assume 8 or even 52 7 bit characters. Even newer systems able to deal with 16 bit char- 53 acters cannot process UCS-4 data. This situation has led to the 54 development of so-called UCS transformation formats (UTF), each with 55 different characteristics. 57 Expires End of January 2000 [Page 1] 58 At this moment, there are 3 standard UTF, namely UTF-7 [UTF7], UTF-8 59 [UTF8] and UTF-16 [UTF16], each is a variable length transformation 60 which gives 7 bit, 8 bit and 16 bit strings respectively. While 61 these are sufficient for most application uses, there are however 62 some legacy systems which are, unfortunately, unable to handle even 63 7 bit strings either due to technical restriction or common uses. 65 The object of this memo is to propose a UTF-5 which gives a trans- 66 formed string that is within [A-V][0-9] alphanumerical character set. 67 This enables legacy system designed for alphanumerical character set 68 only to be multilingual enabled and internationalized immediately. 70 UTF-8 is the transformation format for all IETF standards [IETFPC]. 71 UTF-5 is not here to change this. It is proposed to support legacy 72 applications or protocols that cannot be modify in a simple way to 73 handle 8 bits using UTF-8 encoding. See Section 4 on the discussion 74 on how UTF-5 can be used for Domain Name System [DNS] and Simple Mail 75 Transfer Protocol [SMTP] Address. 77 2. UTF-5 definition 79 In UTF-5, each character are encoded using a sequence of 1 to 8 80 octets. Two transformations are needed for UTF-5, namely 82 1. Determine the quintet ("5-bit") binary sequence. 83 2. From a table, translate the quintet to the resulting string. 85 Take note that the UTF-5 is not a sequence of quintets but a sequence 86 of octets where each octets are in the alphanumeric range. Alpha- 87 numeric is defined as A to V (uppercase only) and 0 to 9 in this 88 context. 90 This memo does not specify the binary pattern of the alphanumeric 91 characters as the purpose of the transformation is to get a alpha- 92 numeric string which represent a multilingual string. However, it 93 is presumed that US-ASCII [US-ASCII] is use for most purposes. 95 2.1 Determine the quintet binary sequence 97 The first quintet of a binary sequence will have the highest-order 98 bit set to 1 and the remaining quintet will have the highest-order 99 bit set to 0. The remaining 4 bits of every quintet contain bits 100 from the value of the character to be encoding. 102 The table below summarizes the format of these different quintet 103 types. The letter x indictes bits available for encoding bits of 104 the UCS-4 character value. 106 Expires End of January 2000 [Page 2] 107 UCS-4 range (hex.) UTF-5 quintet sequence (binary) 108 0000 0000-0000 000F 1xxxx 109 0000 0010-0000 00FF 1xxxx 0xxxx 110 0000 0100-0000 0FFF 1xxxx 0xxxx 0xxxx 111 0000 1000-0000 FFFF 1xxxx 0xxxx 0xxxx 0xxxx 112 ... 113 1000 0000-7FFF FFFF 1xxxx 0xxxx 0xxxx ..... 0xxxx 115 2.2 Translation table for quintet and alphanumeric character 117 Translation table for quintet binary pattern and alphanumeric 118 character are as follows: 120 quintet quintet quintet quintet 121 00000 0 01000 8 10000 G 11000 O 122 00001 1 01001 9 10001 H 11001 P 123 00010 2 01010 A 10010 I 11010 Q 124 00011 3 01011 B 10011 J 11011 R 125 00100 4 01100 C 10100 K 11100 S 126 00101 5 01101 D 10101 L 11101 T 127 00110 6 01110 E 10110 M 11110 U 128 00111 7 01111 F 10111 N 11111 V 130 2.3 Encoding from UCS-4 to UTF-5 132 1) Determine the required number of octets from the character value. 133 Let U be the UCS-4 value, then the required number of octets is 134 log16(U) round up. 136 2) Prepare the quintet binary sequence. Put the highest order bit 137 of the first quintet as 1 and highest order bit of the rest of 138 the quintet as 0. 140 3) Fill in the bits marked x from the bits of the character value, 141 starting from the lower-order its of the character value and 142 putting them first in the last quintet of the sequence, then the 143 next to last, etc until all x bits are filled in. 145 4) For each quintet, apply the lookup table in Section 2.2 to get 146 the corresponding alphanumeric character. 148 2.4 Decoding UTF-5 to UCS-4 150 1) Determine the length of octet sequence. As according to the UTF-5 151 encoding, every character will have the inital octet within 'G' 152 to 'V'. Thus, the length of the octet sequence can be determined 153 by looking for 'G' to 'V' in the UTF-5 string. 155 2) Apply the reverse lookup according to the table in Section 2.2 156 to get the quintet binary sequence. 158 3) Initialize the 4 octets of the UCS-4 character with all bits set 159 to 0. 161 Expires End of January 2000 [Page 3] 162 4) Distribute the bits from the sequence to the UCS-4 character, 163 first the lower-order bits from the last octet of the sequence 164 and proceeding to the left until no x bits are left. 166 If the UTF-5 sequence is no more than four octets long, decoding 167 can proceed directly to UCS-2 (or equivalently Unicode). 169 2.5 Detecting UTF-5 string 171 As the UTF-5 string is a alphanumeric string, it is difficult to 172 differential between a normal ASCII document or a UTF-5 document. 174 Nevertheless, if the string is sufficient long, it is possible to do 175 some detection of UTF-5 string base on the fact that 176 1. UTF-5 strings only have characters within '0'-'9' and 'A'-'V'. 177 2. UTF-5 strings have a well-defined inital octet of 'G' to 'V'. 178 3. The 'G' character always occurs as the inital and only octet. 180 3. Examples of UTF-5 182 The Unicode sequence "A." (0041, 2262, 183 0391, 002E) may be encoded as follows: 185 "K1I262J91IE" 187 The Unicode sequence "Hi Mom !" (0048, 0069, 188 0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows: 190 "K8M9I0KDMFMDI0I63AI1" 192 The Unicode sequence representing the Han characters for the 193 Japanese word "nihongo" (65E5, 672C, 8A9E) may be encoded as 194 follows: 196 "M5E5M72COA9E" 198 Note that from the examples, it is obvious that there is a short-cut 199 to the UTF-5 transformation which goes like this: 201 1. Write down the hexdecimal of the Unicode character as a string. 202 2. For the first character of the hexdecimal string, change 0 to G, 203 1 to H, 2 to I, ... F to V. 205 This will yield you the UTF-5 string of the Unicode character. 207 4. Applications 209 There are many applications whereby UTF-5 would be useful for 210 Internationalization ("i18n"). Here are some of the possible uses. 212 Expires End of January 2000 [Page 4] 213 a. Internationalization of Domain Names System 215 In the Domain Name System, although the technical standard does not 216 prevent 8-bits character to be use as domain names, general use of 217 the system restrict it to only A-Z (upper and lower), 0-9 and "-" 218 as a valid domain name. This pose some great difficulty when doing 219 i18n of domain names as the current UTF-7, UTF-8 and UTF-16 is not 220 compatible with the existing software system already in used. 222 Please see draft-xxx-xxx-xxx.txt for detail discussion on 223 Internationalization of Domain Names System ("iDNS"). 224 http://www.idns.org/ 226 b. Internationalization of Simple Mail Transfer Protocol Address 228 While it is possible for a person to send SMTP Mail in different 229 language on different character set to each another using Multi- 230 purpose Internet Mail Extensions [MIME], the SMTP Mail Address 231 remains a challenge to be Internationalized. Internationalization of 232 SMTP Address has two barrier, 1. the Internationalization of Domain 233 Name System and 2. the Internationalization of the mailbox or 234 username. SMTP mailbox have a very strict check [RFC822] dues to 235 many potential security risks when using symbols or special char- 236 acters in mailbox. UTF-5 will allow Unicode to be used immediately 237 as mailbox with minimual change in system and without additional 238 security risks. 240 Please see draft-xxx-xxx-xxx.txt for detail discussion on Inter- 241 nationalization of Simple Mail Transfer Protocol Address 242 ("iMail"). 244 Internationalization of URIs is not discussed in this memo. Please 245 refer to http://www.w3.org/International/0-URL-and-ident.html. 247 However, uses for UTF-5 goes beyond Internet back to old legacy 248 system such as Telegram system or even Morse code allowing 249 Multilingual characters to be transmitted. 251 5. Security Considerations 253 This memo does not address any security consideration at the moment. 255 6. Acknowledgments 257 UTF-5 was first defined by Martin Duerst at the University of Zurich 258 in draft-duerst-dns-i18n-00.txt. 260 Contributors (not in any order): 261 Marc Blanchet 263 Expires End of January 2000 [Page 5] 264 7. Bibliography 266 [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- 267 [UTF16] Information technology -- Universal Multiple-Octet 268 Coded Character Set (UCS) -- Part 1: Architecture 269 and Basic Multilingual Plane. UTF-8 is described in 270 Annex R, adopted but not yet published. UTF-16 is 271 described in Annex Q, adopted but not yet published. 273 [UTF7] Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe 274 Transformation Format of Unicode", RFC 1642, 275 Taligent, Inc., July 1994. 277 [UTF8] F. Yergeau "UTF-8: a transformation format of Unicode 278 and ISO 10646", RFC2044, Alis Technologies, October 279 1996. 281 [UNICODE] The Unicode Consortium, "The Unicode Standard -- 282 Worldwide Character Encoding -- Version 1.0", 283 Addison-Wesley, Volume 1, 1991, Volume 2, 1992. 285 [US-ASCII] Coded Character Set--7-bit American Standard Code for 286 Information Interchange, ANSI X3.4-1986. 288 [DNS] P. Mockapetris "Domain Names - Concepts and 289 Facilities", RFC1034, ISI, November 1987, "Domain 290 Names - Implementation and Specification", RFC1035, 291 ISI, November 1987. 293 [SMTP] Jonathan B. Postel "Simple Mail Transfer Protocol", 294 [RFC822] RFC821, ISI, August 1982. David H. Crocker "Standard 295 for ARPA Internet Text Messages", RFC822, Dept of 296 Electrical Engineering, Univeristy of Delaware, 297 August 1982. 299 [MIME] "Multipurpose Internet Mail Extensions", RFC1341, 300 N. Borensten, Bellcore, N. Freed, Innosoft, June 301 1992. 303 [IETFPC] "IETF Policy on Character Sets and Languages", 304 RFC2277 BCP18, H. Alvestrand, Jan 1998. 306 Expires End of January 2000 [Page 6] 307 8. Author Address 309 James C.H Seng 310 BioInformatrix Pte Ltd 311 102 Elm Street 312 Menlo Park CA 94025 314 Tel: (650) 322-6505 315 E-mail: jseng@pobox.org.sg 317 Martin J. Duerst 318 World Wide Web Consortium 319 Keio Research Institute at SFC 320 Keio University 321 Fujisawa 322 252-8520 Japan 324 Tel: +81 446 49 11 70 325 E-mail: mduerst@w3.org 327 NOTE -- Please write the author's name with u-Umlaut wherever 328 possible, e.g. in HTML as Dürst. 330 Tin Wee Tan, Dr 331 National University of Singapore 332 c/o BioInformatic Center 333 National University Hospital 334 Lower Kent Ridge Road 335 Singapore 119074 337 Tel: +65 774 7149 338 E-mail: tinwee@post1.com 340 This memo is also archived at http://www.idns.org/technical.html 342 Expires End of January 2000 [Page 7]