idnits 2.17.1 

draft-ietf-idn-utf6-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 452 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Authors' Addresses Section.

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 14 instances of too long lines in the document, the longest
     one being 4 characters in excess of 72.

  == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (May 16, 2001) is 8380 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'IDNRACE' is mentioned on line 119, but not defined

  == Missing Reference: '0123456789abcdef' is mentioned on line 255, but not
     defined

  -- Looks like a reference, but probably isn't: '0123456789' on line 262

  == Unused Reference: 'IDNCOMP' is defined on line 380, but no explicit
     reference was found in the text

  == Unused Reference: 'IDNNAMEPREP' is defined on line 386, but no explicit
     reference was found in the text

  -- Possible downref: Normative reference to a draft: ref. 'IDNCOMP' 

  -- No information found for draft-ietf-idn-requirement - is the name
     correct?

  -- Possible downref: Normative reference to a draft: ref. 'IDNREQ' 

  -- Possible downref: Normative reference to a draft: ref. 'IDNDUERST' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE3'


     Summary: 4 errors (**), 0 flaws (~~), 7 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Engineering Task Force (IETF)                         Mark Welter
2	INTERNET-DRAFT                                          Brian W. Spolarich
3	draft-ietf-idn-utf6-00                                         WALID, Inc.
4	November 16, 2000                                     Expires May 16, 2001

6	        UTF-6 - Yet Another ASCII-Compatible Encoding for IDN

8	Status of this memo

10	This document is an Internet-Draft and is in full conformance with all
11	provisions of Section 10 of RFC2026.

13	Internet-Drafts are working documents of the Internet Engineering Task
14	Force (IETF), its areas, and its working groups. Note that other
15	groups may also distribute working documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time. It is inappropriate to use Internet-Drafts as reference
20	material or to cite them other than as "work in progress."

22	     The list of current Internet-Drafts can be accessed at
23	     http://www.ietf.org/ietf/1id-abstracts.txt

25	     The list of Internet-Draft Shadow Directories can be accessed at
26	     http://www.ietf.org/shadow.html.

28	The distribution of this document is unlimited.

30	Copyright (c) The Internet Society (2000).  All Rights Reserved.

32	Abstract

34	This document describes a tranformation method for representing
35	Unicode character codepoints in host name parts in a fashion that is
36	completely compatible with the current Domain Name System.  It is
37	proposed as a potential candidate for an ASCII-Compatible Encoding (ACE)
38	for supporting the deployment of an internationalized Domain Name System.
39	The tranformation method, an extension of the UTF-5 encoding proposed by
40	Duerst, provides both for more efficient representation of typical Unicode
41	sequences while preserving simplicity and readability.  This transformation
42	method is deployed as part of the current WALID multilingual domain name
43	system implementation, although that status should not necessarily influence
44	the evaluation of its merits as a candidate encoding method.

46	Table of Contents

48	1.        Introduction
49	1.1         Terminology
50	2.        Hostname Part Transformation
51	2.1         Post-Converted Name Prefix
52	2.2         Hostname Prepartion
53	2.3         Definitions
54	2.4         UTF-6 Encoding
55	2.4.1         Variable Length Hex Encoding
56	2.4.2         UTF-6 Compression Algorithm
57	2.4.3         Forward Transformation Algorithm
58	2.5         UTF-6 Decoding
59	2.5.1         Variable Length Hex Decoding
60	2.5.2         UTF-6 Decompression Algorithm
61	2.5.3         Reverse Transformation Algorithm
62	3.        Examples
63	3.1         'www.walid.com' (in Arabic)
64	4.        Security Considerations
65	5.        References

67	1.  Introduction

69	UTF-6 describes an encoding scheme of the ISO/IEC 10646 [ISO10646]
70	character set (whose character code assignments are synchronized
71	with Unicode [UNICODE3]), and the procedures for using this scheme
72	to transform host name parts containing Unicode character sequences
73	into sequences that are compatible with the current DNS protocol
74	[STD13].  As such, it satisfies the definition of a 'charset' as
75	defined in [IDNREQ].

77	1.1  Terminology

79	The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
80	"MAY" in this document are to be interpreted as described in RFC 2119
81	[RFC2119].

83	Hexadecimal values are shown preceded with an "0x". For example,
84	"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
85	shown preceded with an "0b". For example, a nine-bit value might be
86	shown as "0b101101111".

88	Examples in this document use the notation from the Unicode Standard
89	[UNICODE3] as well as the ISO 10646 names. For example, the letter "a"
90	may be represented as either "U+0061" or "LATIN SMALL LETTER A".

92	UTF-6 converts strings with internationalized characters into
93	strings of US-ASCII that are acceptable as host name parts in current
94	DNS host naming usage. The former are called "pre-converted" and the
95	latter are called "post-converted".  This specification defines both
96	a forward and reverse transformation algorithm.

98	2.  Hostname Part Transformation

100	According to [STD13], hostname parts must be case-insensitive, start and
101	end with a letter or digit, and contain only letters, digits, and the
102	hyphen character ("-"). This, of course, excludes most characters used
103	by non-English speakers, characters, as well as many other characters in
104	the ASCII character repertoire. Further, domain name parts must be
105	63 octets or shorter in length.

107	2.1  Post-Converted Name Prefix

109	This document defines the string 'wq--' as a prefix to identify
110	UTF-6-encoded sequences.  For the purposes of comparison in the IDN
111	Working Group activities, the 'wq--' prefix should be used solely to
112	identify UTF-6 sequences.  However, should this document proceed beyond
113	draft status the prefix should be changed to whatever prefix, if any,
114	is the final consensus of the IDN working group.

116	Note that the prepending of a fixed identifier sequence is only one
117	mechanism for differentiating ASCII character encoded international
118	domain names from 'ordinary' domain names.  One method, as proposed in
119	[IDNRACE], is to include a character prefix or suffix that does not
120	appear in any name in any zone file.  A second method is to insert a
121	domain component which pushes off any international names one or more
122	levels deeper into the DNS heirarchy.  There are trade-offs between
123	these two methods which are independent of the Unicode to ASCII
124	transcoding method finally chosen.  We do not address the international
125	vs. 'ordinary' name differention issue in this paper.

127	2.2  Hostname Prepartion

129	The hostname part is assumed to have at least one character disallowed
130	by [STD13], and that is has been processed for logically equivalent
131	character mapping, filtering of disallowed characters (if any), and
132	compatibility composition/decomposition before presentation to the UTF-6
133	conversion algorithm.

135	While it is possible to invent a transcoding mechanism that relies
136	on certain Unicode characters being deemed illegal within domain names
137	and hence available to the transcoding mechanism for improving encoding
138	efficiency, we feel that such a proposal would complicate matters
139	excessively.  We also believe that Unicode name preprocessing for
140	both name resolution and name registration should be considered as s
141	separate, independent issues, which we will attempt to address in a
142	separate document.

144	2.3  Definitions

146	For clarity:

148	  'integer' is an unsigned binary quantity;
149	  'byte' is an 8-bit integer quantity;
150	  'nibble' is a 4-bit integer quantity.

152	2.4  UTF-6 Encoding

154	The idea behind this scheme was to improve on the UTF-5 transformation
155	algorithm described in [IDNDUERST] by providing a straightforward
156	compression mechanism.  UTF-6 defines a compression mechanism by
157	indentifying identical leading byte or nibble values in the pre-converted
158	string, and using the length of this leading value to select a mask which
159	can be applied to the pre-converted string.  The resulting post-converted
160	string is preserves the simplicity and readability of UTF-5 while
161	enabling longer sequences to be encoded into a single host name part.

163	2.4.1  Variable Length Hex Encoding

165	The variable length hex encoding algorithm was introduced by Duerst in
166	[IDNDUERST].  It encodes an integer value in a slight modification of
167	traditional hexadecimal notation, the difference being that the most
168	significant digit is represented with an alternate set of "digits"
169	- -- 'g through 'v' are used to represent 0 through 15.  The result is a
170	variable length encoding which can efficiently represent integers of
171	arbitrary length.

173	The variable length nibble encoding of an integer, C, is defined
174	as follows:

176	  1.  Skip over leading non-significant zero nibbles to find I,
177	      the first significant nibble of c;

179	  2.  Emit the Ith character of the set [ghijklmopqrstuv];

181	  3.  Continue from most to least significant, encoding each remaining
182	      nibble J by emitting the Jth character of the set [0123456789abcdef].

184	Examples:

186	  0x1f4c    is encoded as "hf4c"
187	  0x0624    is encoded as "m24"
188	  0x0000    is encoded as "g"
189	  'n'       a single character in single quotes stands for the
190	            Unicode code point for that character.

192	2.4.2  UTF-6 Compression Algorithm

194	UTF-6 improves on the UTF-5 encoding by providing compression, which
195	enables encoding of a larger number of characters in each hostname
196	part.  The compression algorithm is defined as follows:

198	  1.  Set the mask to 0xFFFF;

200	  2.  If the number of non '-' characters is less than 2, proceed to
201	      step 5;

203	  3.  If the most significant byte of every non '-' character is the
204	      same value:

206	      3a.  Set HB to this value;
207	      3b.  Emit 'Y';
208	      3c.  Emit the variable length hex encoding of HB;
209	      3d.  Set the mask to 0x00FF;
210	      3e.  Proceed to step 5.

212	  4.  If the most significant nibble of every non '-' character is the
213	      same value:

215	      4a.  Set HN to this value;
216	      4b.  Emit 'Z';
217	      4c.  Emit the variable length hex encoding of HN;
218	      4d.  Set the mask to 0x0FFF.

220	  5.  Foreach input character:

222	      5a.  Set HN to the result of the bitwise AND of the input
223	           character and the mask;
224	      5b.  Emit the variable length nibble encoding of HN.

226	2.4.3  Forward Transformation Algorithm

228	The UTF-6 transformation algorithm accepts a string in UTF-16
229	[ISO10646] format as input.  The encoding algorithm is as follows:

231	  1.  Break the hostname string into dot-separated hostname parts.
232	      For each hostname part, perform steps 2 and 3 below;

234	  2.  Compress the component using the method described in section
235	      2.4.2 above, and encode using the encoding described in section 2.4.1;

237	  3.  Prepend the post-converted name prefix 'wq--' (see section 2.1
238	      above) to the resulting string.

240	2.5  UTF-6 Decoding

242	2.5.1  Variable Length Hex Decoding

244	  1.  Let N be the lower case of the first input character;

246	      If N is not in set [ghijklmnopqrstuv] return error,
247	        else consume the input character;

249	  2.  Let R = N - 'g';

251	  3.  If another input character exists,
252	        then let N be the lower case of the next input character,
253	        else goto Step 9;

255	  4.  If N is not in the set [0123456789abcdef], go to Step 9;

257	  5.  Let N = the lower case of the next input character and consume
258	      the input character;

260	  6.  Let R = R * 16;

262	  7.  If N is in set [0123456789],
263	        then let R = R + (N - '0'),
264	        else let R = R + (N - 'a') + 10;

266	  8.  Go to step 3;

268	  9.  Return decoded result R.

270	2.5.2  UTF-6 Decompression Algorithm

272	  1.  Let N be the lower case of the first input character;

274	  2.  If N != 'y' and N != 'z',

276	      2a.  Let CPART be 0;
277	      2b.  Let VMAX be 0xFFFF;

279	      This is the no-compression case;

281	  3.  If N == 'y',

283	      3a.  Let M be the variable length hex decoding of the next
284	           character;
285	      3b.  Let CPART be the result of M * 0x0100;
286	      3c.  Let VMAX be 0x00FF;
287	      3d.  Continue to Step 5;

289	  4.  If N == 'z',

291	      4a.  Let M be the variable length hex decoding of the next
292	           character;
293	      4b.  Let CPART be the result of M * 0x1000;
294	      4c.  Let VMAX be 0x0FFF;
295	      4d.  Continue to Step 5;

297	  5.  While another input character exists, let N be the lower case of
298	      the next input character, and do the following:

300	      5a.  If N == '-' consume the character and
301	             then append '-' to the result string,
302	             else let VPART be the next variable hex decoded value;
303	      5b.  If VPART > VMAX, return error,
304	             else append CPART + VPART to the result string;

306	  6.  Return the result string.

308	2.5.3  Reverse Transformation Algorithm

310	  1.  Break the string into dot-separated components and apply Steps
311	      2 through 4 to each component:

313	  2.  Check for legality (in terms of RFC1035 permitted characters) and
314	      return error status if illegal,

316	  3.  Remove the post converted name prefix 'wq--' (see Section 2.1),

318	  4.  Decompress the component using the decompression algorithm
319	      described above.

321	  5.  Concatenate the decoded segments with dot separators and return.

323	3.  Examples

325	The examples below illustrate the encoding algorithm and provide
326	comparisons to alternate encoding schemes.  UTF-5 sequences are
327	prefixed with '----', as no ACE prefix was defined for that encoding.

329	3.1  'www.walid.com' (in Arabic):

331	  UTF-16:  U+0645 U+0648 U+0642 U+0639 . U+0648 U+0644 U+064A U+062F .
332	           U+0634 U+0631 U+0643 U+0629

334	  UTF-6:   wq--ymk5k8k2j9.wq--ymk8k4kaif.wq--ymj4j1k3i9

336	  UTF-5:   ----m45m48m42m39.----m48m44m4am2f.----m34m31m43m29

338	  RACE:    bq--azcuqqrz.bq--azeeisrp.bq--ay2dcqzj

340	  LACE:    bq--aqdekscche.bq--aqdeqrckf5.bq--aqddimkdfe

342	3.2  Mixed Katakana and Hiragana (SOREZORENOBASHO)

344	  UTF-16:  U+305D U+308C U+305E U+308C U+306E U+5834 U+6240

346	  UTF-6:

348	  UTF-5:

350	  RACE:    bq--4ayf3memgbpdbdbqnzmdiysa

352	  LACE:    bq--auyf4dc7rrxacwbuafrea

354	3.3  Currently Disallowed ASCII Characters ($OneBillionDollars!):

356	  UTF-16:  U+0024 U+004F U+006E U+0065 U+0042 U+0069 U+006C U+006C
357	           U+0069 U+006F U+006E U+0044 U+006F U+006C U+006C U+0061
358	           U+0072 U+0073 U+0021

360	  UTF-6:

362	  UTF-5:

364	  RACE:   bq--aase74tfijuwy4djn6xei44mnrqxe5zb

366	  LACE:   bq--cmacit4omvbgs4dmnfxw5rdpnrwgc5ttee

368	4.  Security Considerations

370	Much of the security of the Internet relies on the DNS and any
371	change to the characteristics of the DNS may change the security of
372	much of the Internet. Therefore UTF-6 makes no changes to the DNS itself.

374	UTF-6 is designed so that distinct Unicode sequences map to distinct
375	domain name sequences (modulo the Unicode and DNS equivalence rules).
376	Therefore use of UTF-6 with DNS will not negatively affect security.

378	5.  References

380	[IDNCOMP] Paul Hoffman, "Comparison of Internationalized Domain Name
381	Proposals", draft-ietf-idn-compare.

383	[IDNREQ] James Seng, "Requirements of Internationalized Domain Names",
384	draft-ietf-idn-requirement.

386	[IDNNAMEPREP] Paul Hoffman and Marc Blanchet, "Preparation of
387	Internationalized Host Names", draft-ietf-idn-nameprep

389	[IDNDUERST] M. Duerst, "Internationalization of Domain Names",
390	draft-duerst-dns-i18n.

392	[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
393	technology -- Universal Multiple-Octet Coded Character Set (UCS) --
394	Part 1: Architecture and Basic Multilingual Plane.  Five amendments and
395	a technical corrigendum have been published up to now. UTF-16 is
396	described in Annex Q, published as Amendment 1. 17 other amendments are
397	currently at various stages of standardization.

399	[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
400	Requirement Levels", March 1997, RFC 2119.

402	[STD13] Paul Mockapetris, "Domain names - implementation and
403	specification", November 1987, STD 13 (RFC 1035).

405	[UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version
406	3.0", ISBN 0-201-61633-5. Described at
407	<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

409	A.  Acknowledgements

411	The structure (and some of the structural text) of this document is
412	intentionally borrowed from the LACE IDN draft (draft-ietf-idn-lace)
413	by Mark Davis and Paul Hoffman.

415	The 'SOREZORENOBASHO' example was taken from draft-ietf-idn-brace draft
416	by Adam Costello.

418	B.  IANA Considerations

420	There are no IANA considerations in this document.

422	C.  Author Contact Information

424	Mark Welter
425	Brian W. Spolarich
426	WALID, Inc.
427	State Technology Park
428	2245 S. State St.
429	Ann Arbor, MI  48104
430	+1-734-822-2020

432	mwelter@walid.com
433	briansp@walid.com
434	-----BEGIN PGP SIGNATURE-----
435	Version: GnuPG v1.0.1 (GNU/Linux)
436	Comment: For info see http://www.gnupg.org

438	iD8DBQE6FaCt/DkPcNgtD/0RAtRmAJwISVeJGY6qmll71mL+Axc51o8iIwCgmNt/
439	86RcQh1JQYWTux+8FS+XvMU=
440	=bxiv
441	-----END PGP SIGNATURE-----