idnits 2.17.1 

draft-jseng-utf5-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 6
     longer pages, the longest (page 1) being 59 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The abstract seems to contain references ([0-9], [A-V]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The "Author's Address" (or "Authors' Addresses") section title is
     misspelled.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (January 2000) is 8868 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'A-V' on line 66 looks like a reference

  -- Missing reference section? '0-9' on line 66 looks like a reference

  -- Missing reference section? 'UNICODE' on line 281 looks like a reference

  -- Missing reference section? 'ISO-10646' on line 266 looks like a reference

  -- Missing reference section? 'UTF7' on line 273 looks like a reference

  -- Missing reference section? 'UTF8' on line 277 looks like a reference

  -- Missing reference section? 'UTF16' on line 267 looks like a reference

  -- Missing reference section? 'IETFPC' on line 303 looks like a reference

  -- Missing reference section? 'DNS' on line 288 looks like a reference

  -- Missing reference section? 'SMTP' on line 293 looks like a reference

  -- Missing reference section? 'US-ASCII' on line 285 looks like a reference

  -- Missing reference section? 'MIME' on line 299 looks like a reference

  -- Missing reference section? 'RFC822' on line 294 looks like a reference


     Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                         James Seng, BIX
2	<draft-jseng-utf5-00.txt>                           Martin Duerst, W3C
3	27th July 1999                                        Tin Wee Tan, NUS
4	Expires End of January 2000

6	            UTF-5, a transformation format of Unicode and ISO 10646

8	Status of this Memo

10	   This document is an Internet-Draft and is in full conformance
11	   with all provisions of Section 10 of RFC2026.

13	   Internet-Drafts are working documents of the Internet Engineering
14	   Task Force (IETF), its areas, and its working groups.  Note that
15	   other groups may also distribute working documents as
16	   Internet-Drafts.

18	   Internet-Drafts are draft documents valid for a maximum of six
19	   months and may be updated, replaced, or obsoleted by other
20	   documents at any time.  It is inappropriate to use Internet-
21	   Drafts as reference material or to cite them other than as
22	   "work in progress."

24	   The list of current Internet-Drafts can be accessed at
25	   http://www.ietf.org/ietf/1id-abstracts.txt

27	   The list of Internet-Draft Shadow Directories can be accessed at
28	   http://www.ietf.org/shadow.html.

30	   Distribution of this document is unlimited. Please send comments
31	   to the authors at jseng@pobox.org.sg, mduerst@w3.org and
32	   tinwee@post1.com.

34	Abstract

36	   A new transformation format, called UTF-5 for Unicode is proposed.
37	   The resulting string of this UTF is within a [A-V][0-9] alphanumeric
38	   range. This enables legacy systems or protocols designed for alpha-
39	   numerical character set only to be multilingual enabled and inter-
40	   nationalized immediately. Example of such systems are the domain
41	   name system and email addresses.

43	1. Introduction

45	   The Unicode Standard, version 2.1 [UNICODE], and ISO/IEC 10646-1
46	   [ISO-10646] jointly define a 16 bit character set, UCS-2, which
47	   encompasses most of the world's writing systems.  ISO 10646 further
48	   defines a 31-bit character set, UCS-4, with currently no assignments
49	   outside of the region corresponding to UCS-2 (the Basic Multilingual
50	   Plane, BMP).  The UCS-2 and UCS-4 encodings, however, are hard to
51	   use in many current applications and protocols that assume 8 or even
52	   7 bit characters. Even newer systems able to deal with 16 bit char-
53	   acters cannot process UCS-4 data. This situation has led to the
54	   development of so-called UCS transformation formats (UTF), each with
55	   different characteristics.

57	                   Expires End of January 2000     [Page  1]
58	   At this moment, there are 3 standard UTF, namely UTF-7 [UTF7], UTF-8
59	   [UTF8] and UTF-16 [UTF16], each is a variable length transformation
60	   which gives 7 bit, 8 bit and 16 bit strings respectively. While
61	   these are sufficient for most application uses, there are however
62	   some legacy systems which are, unfortunately, unable to handle even
63	   7 bit strings either due to technical restriction or common uses.

65	   The object of this memo is to propose a UTF-5 which gives a trans-
66	   formed string that is within [A-V][0-9] alphanumerical character set.
67	   This enables legacy system designed for alphanumerical character set
68	   only to be multilingual enabled and internationalized immediately.

70	   UTF-8 is the transformation format for all IETF standards [IETFPC].
71	   UTF-5 is not here to change this. It is proposed to support legacy
72	   applications or protocols that cannot be modify in a simple way to
73	   handle 8 bits using UTF-8 encoding. See Section 4 on the discussion
74	   on how UTF-5 can be used for Domain Name System [DNS] and Simple Mail
75	   Transfer Protocol [SMTP] Address.

77	2. UTF-5 definition

79	   In UTF-5, each character are encoded using a sequence of 1 to 8
80	   octets. Two transformations are needed for UTF-5, namely

82	   1. Determine the quintet ("5-bit") binary sequence.
83	   2. From a table, translate the quintet to the resulting string.

85	   Take note that the UTF-5 is not a sequence of quintets but a sequence
86	   of octets where each octets are in the alphanumeric range. Alpha-
87	   numeric is defined as A to V (uppercase only) and 0 to 9 in this
88	   context.

90	   This memo does not specify the binary pattern of the alphanumeric
91	   characters as the purpose of the transformation is to get a alpha-
92	   numeric string which represent a multilingual string. However, it
93	   is presumed that US-ASCII [US-ASCII] is use for most purposes.

95	   2.1 Determine the quintet binary sequence

97	   The first quintet of a binary sequence will have the highest-order
98	   bit set to 1 and the remaining quintet will have the highest-order
99	   bit set to 0. The remaining 4 bits of every quintet contain bits
100	   from the value of the character to be encoding.

102	   The table below summarizes the format of these different quintet
103	   types.  The letter x indictes bits available for encoding bits of
104	   the UCS-4 character value.

106	                   Expires End of January 2000     [Page  2]
107	   UCS-4 range (hex.)           UTF-5 quintet sequence (binary)
108	   0000 0000-0000 000F          1xxxx
109	   0000 0010-0000 00FF          1xxxx 0xxxx
110	   0000 0100-0000 0FFF          1xxxx 0xxxx 0xxxx
111	   0000 1000-0000 FFFF          1xxxx 0xxxx 0xxxx 0xxxx
112	   ...
113	   1000 0000-7FFF FFFF          1xxxx 0xxxx 0xxxx ..... 0xxxx

115	   2.2 Translation table for quintet and alphanumeric character

117	   Translation table for quintet binary pattern and alphanumeric
118	   character are as follows:

120	   quintet          quintet         quintet         quintet
121	   00000   0        01000   8       10000   G       11000   O
122	   00001   1        01001   9       10001   H       11001   P
123	   00010   2        01010   A       10010   I       11010   Q
124	   00011   3        01011   B       10011   J       11011   R
125	   00100   4        01100   C       10100   K       11100   S
126	   00101   5        01101   D       10101   L       11101   T
127	   00110   6        01110   E       10110   M       11110   U
128	   00111   7        01111   F       10111   N       11111   V

130	   2.3 Encoding from UCS-4 to UTF-5

132	   1) Determine the required number of octets from the character value.
133	      Let U be the UCS-4 value, then the required number of octets is
134	      log16(U) round up.

136	   2) Prepare the quintet binary sequence. Put the highest order bit
137	      of the first quintet as 1 and highest order bit of the rest of
138	      the quintet as 0.

140	   3) Fill in the bits marked x from the bits of the character value,
141	      starting from the lower-order its of the character value and
142	      putting them first in the last quintet of the sequence, then the
143	      next to last, etc until all x bits are filled in.

145	   4) For each quintet, apply the lookup table in Section 2.2 to get
146	      the corresponding alphanumeric character.

148	   2.4 Decoding UTF-5 to UCS-4

150	   1) Determine the length of octet sequence. As according to the UTF-5
151	      encoding, every character will have the inital octet within 'G'
152	      to 'V'. Thus, the length of the octet sequence can be determined
153	      by looking for 'G' to 'V' in the UTF-5 string.

155	   2) Apply the reverse lookup according to the table in Section 2.2
156	      to get the quintet binary sequence.

158	   3) Initialize the 4 octets of the UCS-4 character with all bits set
159	      to 0.

161	                   Expires End of January 2000     [Page  3]
162	   4) Distribute the bits from the sequence to the UCS-4 character,
163	      first the lower-order bits from the last octet of the sequence
164	      and proceeding to the left until no x bits are left.

166	      If the UTF-5 sequence is no more than four octets long, decoding
167	      can proceed directly to UCS-2 (or equivalently Unicode).

169	   2.5 Detecting UTF-5 string

171	   As the UTF-5 string is a alphanumeric string, it is difficult to
172	   differential between a normal ASCII document or a UTF-5 document.

174	   Nevertheless, if the string is sufficient long, it is possible to do
175	   some detection of UTF-5 string base on the fact that
176	   1. UTF-5 strings only have characters within '0'-'9' and 'A'-'V'.
177	   2. UTF-5 strings have a well-defined inital octet of 'G' to 'V'.
178	   3. The 'G' character always occurs as the inital and only octet.

180	3. Examples of UTF-5

182	   The Unicode sequence "A<NOT IDENTICAL TO><ALPHA>." (0041, 2262,
183	   0391, 002E) may be encoded as follows:

185	   "K1I262J91IE"

187	   The Unicode sequence "Hi Mom <WHITE SMILING FACE>!" (0048, 0069,
188	   0020, 004D, 006F, 006D, 0020, 263A, 0021) may be encoded as follows:

190	   "K8M9I0KDMFMDI0I63AI1"

192	   The Unicode sequence representing the Han characters for the
193	   Japanese word "nihongo" (65E5, 672C, 8A9E) may be encoded as
194	   follows:

196	   "M5E5M72COA9E"

198	   Note that from the examples, it is obvious that there is a short-cut
199	   to the UTF-5 transformation which goes like this:

201	   1. Write down the hexdecimal of the Unicode character as a string.
202	   2. For the first character of the hexdecimal string, change 0 to G,
203	      1 to H, 2 to I, ... F to V.

205	   This will yield you the UTF-5 string of the Unicode character.

207	4. Applications

209	   There are many applications whereby UTF-5 would be useful for
210	   Internationalization ("i18n"). Here are some of the possible uses.

212	                   Expires End of January 2000     [Page  4]
213	   a. Internationalization of Domain Names System

215	   In the Domain Name System, although the technical standard does not
216	   prevent 8-bits character to be use as domain names, general use of
217	   the system restrict it to only A-Z (upper and lower), 0-9 and "-"
218	   as a valid domain name. This pose some great difficulty when doing
219	   i18n of domain names as the current UTF-7, UTF-8 and UTF-16 is not
220	   compatible with the existing software system already in used.

222	   Please see draft-xxx-xxx-xxx.txt for detail discussion on
223	   Internationalization of Domain Names System ("iDNS").
224	   http://www.idns.org/

226	   b. Internationalization of Simple Mail Transfer Protocol Address

228	   While it is possible for a person to send SMTP Mail in different
229	   language on different character set to each another using Multi-
230	   purpose Internet Mail Extensions [MIME], the SMTP Mail Address
231	   remains a challenge to be Internationalized. Internationalization of
232	   SMTP Address has two barrier, 1. the Internationalization of Domain
233	   Name System and 2. the Internationalization of the mailbox or
234	   username. SMTP mailbox have a very strict check [RFC822] dues to
235	   many potential security risks when using symbols or special char-
236	   acters in mailbox. UTF-5 will allow Unicode to be used immediately
237	   as mailbox with minimual change in system and without additional
238	   security risks.

240	   Please see draft-xxx-xxx-xxx.txt for detail discussion on Inter-
241	   nationalization of Simple Mail Transfer Protocol Address
242	   ("iMail").

244	   Internationalization of URIs is not discussed in this memo. Please
245	   refer to http://www.w3.org/International/0-URL-and-ident.html.

247	   However, uses for UTF-5 goes beyond Internet back to old legacy
248	   system such as Telegram system or even Morse code allowing
249	   Multilingual characters to be transmitted.

251	5. Security Considerations

253	   This memo does not address any security consideration at the moment.

255	6. Acknowledgments

257	   UTF-5 was first defined by Martin Duerst at the University of Zurich
258	   in draft-duerst-dns-i18n-00.txt.

260	   Contributors (not in any order):
261	   Marc Blanchet <Marc.Blanchet@viagenic.qc.ca>

263	                   Expires End of January 2000     [Page  5]
264	7. Bibliography

266	   [ISO-10646]    ISO/IEC 10646-1:1993. International Standard --
267	   [UTF16]        Information technology -- Universal Multiple-Octet
268	                  Coded Character Set (UCS) -- Part 1: Architecture
269	                  and Basic Multilingual Plane. UTF-8 is described in
270	                  Annex R, adopted but not yet published. UTF-16 is
271	                  described in Annex Q, adopted but not yet published.

273	   [UTF7]         Goldsmith, D., and M. Davis, "UTF-7: A Mail-safe
274	                  Transformation Format of Unicode", RFC 1642,
275	                  Taligent, Inc., July 1994.

277	   [UTF8]         F. Yergeau "UTF-8: a transformation format of Unicode
278	                  and ISO 10646", RFC2044, Alis Technologies, October
279	                  1996.

281	   [UNICODE]      The Unicode Consortium, "The Unicode Standard --
282	                  Worldwide Character Encoding -- Version 1.0",
283	                  Addison-Wesley, Volume 1, 1991, Volume 2, 1992.

285	   [US-ASCII]     Coded Character Set--7-bit American Standard Code for
286	                  Information Interchange, ANSI X3.4-1986.

288	   [DNS]          P. Mockapetris "Domain Names - Concepts and
289	                  Facilities", RFC1034, ISI, November 1987, "Domain
290	                  Names - Implementation and Specification", RFC1035,
291	                  ISI, November 1987.

293	   [SMTP]         Jonathan B. Postel "Simple Mail Transfer Protocol",
294	   [RFC822]       RFC821, ISI, August 1982. David H. Crocker "Standard
295	                  for ARPA Internet Text Messages", RFC822, Dept of
296	                  Electrical Engineering, Univeristy of Delaware,
297	                  August 1982.

299	   [MIME]         "Multipurpose Internet Mail Extensions", RFC1341,
300	                  N. Borensten, Bellcore, N. Freed, Innosoft, June
301	                  1992.

303	   [IETFPC]       "IETF Policy on Character Sets and Languages",
304	                  RFC2277 BCP18, H. Alvestrand, Jan 1998.

306	                   Expires End of January 2000     [Page  6]
307	8. Author Address

309	   James C.H Seng
310	   BioInformatrix Pte Ltd
311	   102 Elm Street
312	   Menlo Park CA 94025

314	   Tel: (650) 322-6505
315	   E-mail: jseng@pobox.org.sg

317	   Martin J. Duerst
318	   World Wide Web Consortium
319	   Keio Research Institute at SFC
320	   Keio University
321	   Fujisawa
322	   252-8520 Japan

324	   Tel: +81 446 49 11 70
325	   E-mail: mduerst@w3.org

327	   NOTE -- Please write the author's name with u-Umlaut wherever
328	   possible, e.g. in HTML as D&uuml;rst.

330	   Tin Wee Tan, Dr
331	   National University of Singapore
332	   c/o BioInformatic Center
333	   National University Hospital
334	   Lower Kent Ridge Road
335	   Singapore 119074

337	   Tel: +65 774 7149
338	   E-mail: tinwee@post1.com

340	   This memo is also archived at http://www.idns.org/technical.html

342	                   Expires End of January 2000     [Page  7]