idnits 2.17.1 

draft-ietf-idn-lace-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 523 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Authors' Addresses Section.

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 3 instances of too long lines in the document, the longest one
     being 8 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 364 has weird spacing: '...   bits   char...'

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Normative reference to a draft: ref. 'IDNComp' 

  -- No information found for draft-ietf-idn-requirement - is the name
     correct?

  -- Possible downref: Normative reference to a draft: ref. 'IDNReq' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3'


     Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                            Mark Davis
2	draft-ietf-idn-lace-00.txt                                       IBM
3	November 6, 2000                                        Paul Hoffman
4	Expires May 6, 2001                                       IMC & VPNC

6	        LACE: Length-based ASCII Compatible Encoding for IDN

8	Status of this memo

10	This document is an Internet-Draft and is in full conformance with all
11	provisions of Section 10 of RFC2026.

13	Internet-Drafts are working documents of the Internet Engineering Task
14	Force (IETF), its areas, and its working groups. Note that other
15	groups may also distribute working documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time. It is inappropriate to use Internet-Drafts as reference
20	material or to cite them other than as "work in progress."

22	     The list of current Internet-Drafts can be accessed at
23	     http://www.ietf.org/ietf/1id-abstracts.txt

25	     The list of Internet-Draft Shadow Directories can be accessed at
26	     http://www.ietf.org/shadow.html.

28	Abstract

30	This document describes a transformation method for representing
31	non-ASCII characters in host name parts in a fashion that is completely
32	compatible with the current DNS. It is a potential candidate for an
33	ASCII-Compatible Encoding (ACE) for internationalized host names, as
34	described in the comparison document from the IETF IDN Working Group.
35	This method is based on the observation that many internationalized host
36	name parts will have a few substrings from a small number of rows of the
37	ISO 10646 repertoire. Run-length encoding for these types of
38	host names will be fairly compact, and is fairly easy to describe.

40	1. Introduction

42	There is a strong world-wide desire to use characters other than plain
43	ASCII in host names. Host names have become the equivalent of business
44	or product names for many services on the Internet, so there is a need
45	to make them usable by people whose native scripts are not representable
46	by ASCII. The requirements for internationalizing host names are
47	described in the IDN WG's requirements document, [IDNReq].

49	The IDN WG's comparison document [IDNComp] describes three potential
50	main architectures for IDN: arch-1 (just send binary), arch-2 (send
51	binary or ACE), and arch-3 (just send ACE). LACE is an ACE, called
52	Row-based ACE or LACE, that can be used with protocols that match arch-2
53	or arch-3. LACE specifies an ACE format as specified in ace-1 in
54	[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in
55	[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to the
56	beginning of the name part).

58	In formal terms, LACE describes a character encoding scheme of the
59	ISO/IEC 10646 [ISO10646] coded character set (whose assignment of
60	characters is synchronized with Unicode [Unicode3]) and the rules for
61	using that scheme in the DNS. As such, it could also be called a
62	"charset" as defined in [IDNReq].

64	The LACE protocol has the following features:

66	- There is exactly one way to convert internationalized host parts to
67	and from LACE parts. Host name part uniqueness is preserved.

69	- Host parts that have no international characters are not changed.

71	- Names using LACE can include more internationalized characters than
72	with other ACE protocols that have been suggested to date. LACE-encoded
73	names are variable length, depending on the number of transitions
74	between rows in the ISO 10646 repertoire that appear in the name part.
75	Name parts that cannot be compressed using run-length encoding can have
76	up to 17 characters, and names that can be compressed can have up to 35
77	characters. Further, a name that has just a few row transitions
78	typically can have over 30 characters.

80	It is important to note that the following sections contain many
81	normative statements with "MUST" and "MUST NOT". Any implementation that
82	does not follow these statements exactly is likely to cause damage to
83	the Internet by creating non-unique representations of host names.

85	1.1 Terminology

87	The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
88	"MAY" in this document are to be interpreted as described in RFC 2119
89	[RFC2119].

91	Hexadecimal values are shown preceded with an "0x". For example,
92	"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
93	shown preceded with an "0b". For example, a nine-bit value might be
94	shown as "0b101101111".

96	Examples in this document use the notation from the Unicode Standard
97	[Unicode3] as well as the ISO 10646 names. For example, the letter "a"
98	may be represented as either "U+0061" or "LATIN SMALL LETTER A".

100	LACE converts strings with internationalized characters into
101	strings of US-ASCII that are acceptable as host name parts in current
102	DNS host naming usage. The former are called "pre-converted" and the
103	latter are called "post-converted".

105	1.2 IDN summary

107	Using the terminology in [IDNComp], LACE specifies an ACE format as
108	specified in ace-1. Further, it specifies an identifying mechanism for
109	ace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginning
110	of the name part).

112	LACE has the following length characteristics. In this list, "row" means
113	a row from ISO 10646.

115	- LACE-encoded names are variable length, depending on the number of
116	transitions between rows that appear in the name part.

118	- Name parts that cannot be compressed using run-length encoding can
119	have up to 17 characters.

121	- Names that can be compressed can have up to 35 characters.

123	-A name that has just a few row transitions typically can have over 30
124	characters.

126	2. Host Part Transformation

128	According to [STD13], host parts must be case-insensitive, start and
129	end with a letter or digit, and contain only letters, digits, and the
130	hyphen character ("-"). This, of course, excludes any internationalized
131	characters, as well as many other characters in the ASCII character
132	repertoire. Further, domain name parts must be 63 octets or shorter in
133	length.

135	2.1 Name tagging

137	All post-converted name parts that contain internationalized characters
138	begin with the string "bq--". (Of course, because host name parts are
139	case-insensitive, this might also be represented as "Bq--" or "bQ--" or
140	"BQ--".) The string "bq--" was chosen because it is extremely unlikely
141	to exist in host parts before this specification was produced. As a
142	historical note, in late August 2000, none of the second-level host name
143	parts in any of the .com, .edu, .net, and .org top-level domains began
144	with "bq--"; there are many tens of thousands of other strings of three
145	characters followed by a hyphen that have this property and could be
146	used instead. The string "bq--" will change to other strings with the
147	same properties in future versions of this draft.

149	Note that a zone administrator might still choose to use "bq--" at the
150	beginning of a host name part even if that part does not contain
151	internationalized characters. Zone administrators SHOULD NOT create host
152	part names that begin with "bq--" unless those names are post-converted
153	names. Creating host part names that begin with "bq--" but that are not
154	post-converted names may cause two distinct problems. Some display
155	systems, after converting the post-converted name part back to an
156	internationalized name part, might display the name parts in a
157	possibly-confusing fashion to users. More seriously, some resolvers,
158	after converting the post-converted name part back to an
159	internationalized name part, might reject the host name if it contains
160	illegal characters.

162	2.2 Converting an internationalized name to an ACE name part

164	To convert a string of internationalized characters into an ACE name
165	part, the following steps MUST be preformed in the exact order of the
166	subsections given here.

168	If a name part consists exclusively of characters that conform to the
169	host name requirements in [STD13], the name MUST NOT be converted to
170	LACE. That is, a name part that can be represented without LACE MUST NOT
171	be encoded using LACE. This absolute requirement prevents there from
172	being two different encodings for a single DNS host name.

174	If any checking for prohibited name parts (such as ones that are
175	prohibited characters, case-folding, or canonicalization) is to be done,
176	it MUST be done before doing the conversion to an ACE name part.

178	The input name string consists of characters from the ISO 10646
179	character set in big-endian UTF-16 encoding. This is the pre-converted
180	string.

182	Characters outside the first plane of characters
183	(those with codepoints above U+FFFF) MUST be represented using surrogates, as
184	described in the UTF-16 description in ISO 10646.

186	2.2.1 Compress the pre-converted string

188	The entire pre-converted string MUST be compressed using the compression
189	algorithm specified in section 2.4. The result of this step is the
190	compressed string.

192	2.2.2 Check the length of the compressed string

194	The compressed string MUST be 36 octets or shorter. If the compressed
195	string is 37 octets or longer, the conversion MUST stop with an error.

197	2.2.3 Encode the compressed string with Base32

199	The compressed string MUST be converted using the Base32 encoding
200	described in section 2.5. The result of this step is the encoded string.

202	2.2.4 Prepend "bq--" to the encoded string and finish

204	Prepend the characters "bq--" to the encoded string. This is the host
205	name part that can be used in DNS resolution.

207	2.3 Converting a host name part to an internationalized name

209	The input string for conversion is a valid host name part. Note that if
210	any checking for prohibited name parts (such as prohibited characters,
211	case-folding, or canonicalization is to be done, it MUST be done after
212	doing the conversion from an ACE name part.

214	If a decoded name part consists exclusively of characters that conform
215	to the host name requirements in [STD13], the conversion from LACE MUST
216	fail. Because a name part that can be represented without LACE MUST NOT
217	be encoded using LACE, the decoding process MUST check for name parts
218	that consists exclusively of characters that conform to the host name
219	requirements in [STD13] and, if such a name part is found, MUST
220	beconsidered an error (and possibly a security violation).

222	2.3.1 Strip the "bq--"

224	The input string MUST begin with the characters "bq--". If it does not,
225	the conversion MUST stop with an error. Otherwise, remove the characters
226	"bq--" from the input string. The result of this step is the stripped
227	string.

229	2.3.2 Decode the stripped string with Base32

231	The entire stripped string MUST be checked to see if it is valid Base32
232	output. The entire stripped string MUST be changed to all lower-case
233	letters and digits. If any resulting characters are not in Table 1, the
234	conversion MUST stop with an error; the input string is the
235	post-converted string. Otherwise, the entire resulting string MUST be
236	converted to a binary format using the Base32 decoding described in
237	section 2.5. The result of this step is the decoded string.

239	2.3.3 Decompress the decoded string

241	The entire decoded string MUST be converted to ISO 10646 characters
242	using the decompression algorithm described in section 2.4. The result
243	of this is the internationalized string.

245	2.4 Compression algorithm

247	The basic method for compression is to reduce a substring that consists
248	of characters all from a single row of the ISO 10646 repertoire to a
249	count octet followed by the row header followed by the lower octets of
250	the characters. If this ends up being longer than the input, the string
251	is not compressed, but instead has a unique one-octet header attached.

253	Although the uncompressed mode limits the number of characters in a LACE
254	name part to 17, this is still generally enough for almost all names in
255	almost scripts. Also, this limit is close to the limits set by other
256	encoding proposals.

258	Note that the compression and decompression rules MUST be followed
259	exactly. This requirement prevents a single host name part from having
260	two encodings. Thus, for any input to the algorithm, there is only one
261	possible output. An implementation cannot chose to use one-octet mode or
262	two-octet mode using anything other than the logic given in this
263	section.

265	2.4.1 Compressing a string

267	The input string is in big-endian UTF-16 encoding with no byte order
268	mark.

270	Design note: No checking is done on the input to this algorithm. It is
271	assumed that all checking for valid ISO/IEC 10646 characters has already
272	been done by a previous step in the conversion process.

274	1) If the length of the input is not even, or is less than 2, stop with
275	an error.

277	2) Set the input pointer, called IP, to the first octet of the input
278	string.

280	3) Set the variable called HIGH to the octet at IP.

282	4) Determine the number of pairs at or after IP that have HIGH as the
283	first octet; call this COUNT.

285	5) Put into an output buffer the single octet for COUNT followed by the
286	single octet for HIGH, followed by all those low octets. Move IP to the
287	end of those pairs; that is, set IP to IP+(2*(COUNT+1)).

289	6) If IP is not at the end of the input string, go to step 3.

291	7) If the length of the output buffer is less than or equal to the
292	length of the input buffer (in octets, not in characters), output the
293	buffer. Otherwise, output the octet 0xFF followed by the input buffer.
294	Note that there can only be one possible representation for a name part,
295	so that outputting the wrong name part is a serious security error.
296	Decompression schemes MUST accept only the valid form and MUST NOT
297	accept invalid forms.

299	2.4.2 Decompressing a string

301	1. Set the input pointer, called IP, to the first octet of the input
302	string. If there is no first octet, stop with an error.

304	2. If the octet at IP is 0xFF, go to step 10.

306	3. Get the octet at IP, call it COUNT. Set IP to IP+1. If IP is now at
307	the end of the input string, stop with an error.

309	4. Get the octet at IP, call it HIGH. Set IP to IP+1. If IP is now at
310	the end of the input string, stop with an error.

312	5. Get the octet at IP, call it LOW. Set IP to IP+1.

314	6. Output HIGH, then LOW, to the output buffer.

316	7. Decrement COUNT. If COUNT is greater than 0, go to step 5.

318	8. If IP is not at the end of the input buffer, go to step 3.

320	9. Compare the length of the input string with the length of the output
321	buffer. If the length of the output buffer is longer than the length of
322	the input buffer, stop with an error because the wrong compression form
323	was used. Otherwise, send out the output buffer and stop.

325	10. Set IP to IP+1. Copy the rest of the input buffer to the output
326	buffer. Compress the output buffer into a separate comparison buffer
327	following the steps for compression above. If the length of the
328	comparison buffer is less than or equal to the length of the output
329	buffer, stop with an error because the wrong compression form was used.
330	Otherwise, send out the output buffer and stop.

332	2.4.3 Compression examples

334	The five input characters <U+30E6 U+30CB U+30B3 U+30FC U+30C9> are
335	represented in big-endian UTF-16 as the ten octets <30 E6 30 CB 30 B3 30
336	FC 30 C9>. All the code units are in the same row (03). The output
337	buffer has seven octets <05 30 E6 CB B3 FC C9>, which is shorter than
338	the input string. Thus the output is <05 30 E6 CB B3 FC C9>.

340	The four input characters <U+012E U+0110 U+014A U+00C5> are represented
341	in big-endian UTF-16 as the eight octets <01 2E 01 10 01 4A 00 C5>. The
342	output buffer has eight octets <03 01 2E 10 4A 01 00 C5>, which is the
343	same length as the input string. Thus, the output is <03 01 2E 10 4A 01
344	00 C5>.

346	The three input characters <U+012E U+00D0 U+014A> are represented in
347	big-endian UTF-16 as the six octets <01 2E 00 D0  01 4A>. The output
348	buffer is nine octets <01 01 2E 01 00 D0 01 01 4A>, which is longer than
349	the input buffer. Thus, the output is <FF 01 2E 00 D0 01 4A>.

351	2.5 Base32

353	In order to encode non-ASCII characters in DNS-compatible host name parts,
354	they must be converted into legal characters. This is done with Base32
355	encoding, described here.

357	Table 1 shows the mapping between input bits and output characters in
358	Base32. Design note: the digits used in Base32 are "2" through "7"
359	instead of "0" through "6" in order to avoid digits "0" and "1". This
360	helps reduce errors for users who are entering a Base32 stream and may
361	misinterpret a "0" for an "O" or a "1" for an "l".

363	                    Table 1: Base32 conversion
364	             bits   char  hex         bits   char  hex
365	             00000   a    0x61        10000   q    0x71
366	             00001   b    0x62        10001   r    0x72
367	             00010   c    0x63        10010   s    0x73
368	             00011   d    0x64        10011   t    0x74
369	             00100   e    0x65        10100   u    0x75
370	             00101   f    0x66        10101   v    0x76
371	             00110   g    0x67        10110   w    0x77
372	             00111   h    0x68        10111   x    0x78
373	             01000   i    0x69        11000   y    0x79
374	             01001   j    0x6a        11001   z    0x7a
375	             01010   k    0x6b        11010   2    0x32
376	             01011   l    0x6c        11011   3    0x33
377	             01100   m    0x6d        11100   4    0x34
378	             01101   n    0x6e        11101   5    0x35
379	             01110   o    0x6f        11110   6    0x36
380	             01111   p    0x70        11111   7    0x37

382	2.5.1 Encoding octets as Base32

384	The input is a stream of octets. However, the octets are then treated
385	as a stream of bits.

387	Design note: The assumption that the input is a stream of octets
388	(instead of a stream of bits) was made so that no padding was needed.
389	If you are reusing this algorithm for a stream of bits, you must add a
390	padding mechanism in order to differentiate different lengths of input.

392	1) Set the read pointer to the beginning of the input bit stream.

394	2) Look at the five bits after the read pointer. If there are not five
395	bits, go to step 5.

397	3) Look up the value of the set of five bits in the bits column of
398	Table 1, and output the character from the char column (whose hex value
399	is in the hex column).

401	4) Move the read pointer five bits forward. If the read pointer is at
402	the end of the input bit stream (that is, there are no more bits in the
403	input), stop. Otherwise, go to step 2.

405	5) Pad the bits seen until there are five bits.

407	6) Look up the value of the set of five bits in the bits column of
408	Table 1, and output the character from the char column (whose hex value
409	is in the hex column).

411	2.5.2 Decoding Base32 as octets

413	The input is octets in network byte order. The input octets MUST be
414	values from the second column in Table 1.

416	1) Set the read pointer to the beginning of the input octet stream.

418	2) Look up the character value of the octet in the char column (or hex
419	value in hex column) of Table 1, and output the five bits from the bits
420	column.

422	3) Move the read pointer one octet forward. If the read pointer is at
423	the end of the input octet stream (that is, there are no more octets in
424	the input), stop. Otherwise, go to step 2.

426	2.5.3 Base32 example

428	Assume you want to encode the value 0x3a270f93. The bit string is:

430	3   a    2   7    0   f    9   3
431	00111010 00100111 00001111 10010011

433	Broken into chunks of five bits, this is:

435	00111 01000 10011 10000 11111 00100 11

437	Padding is added to make the last chunk five bits:

439	00111 01000 10011 10000 11111 00100 11000

441	The output of encoding is:

443	00111 01000 10011 10000 11111 00100 11000
444	  h     i     t     q     7     e     y
445	or "hitq7ey".

447	3. Security Considerations

449	Much of the security of the Internet relies on the DNS. Thus, any
450	change to the characteristics of the DNS can change the security of
451	much of the Internet. Thus, LACE makes no changes to the DNS
452	itself.

454	Host names are used by users to connect to Internet servers. The
455	security of the Internet would be compromised if a user entering a
456	single internationalized name could be connected to different servers
457	based on different interpretations of the internationalized host
458	name.

460	LACE is designed so that every internationalized host name part
461	can be represented as one and only one DNS-compatible string. If there
462	is any way to follow the steps in this document and get two or more
463	different results, it is a severe and fatal error in the protocol.

465	4. References

467	[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name Proposals",
468	draft-ietf-idn-compare.

470	[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
471	draft-ietf-idn-requirement.

473	[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
474	technology -- Universal Multiple-Octet Coded Character Set (UCS) --
475	Part 1: Architecture and Basic Multilingual Plane.  Five amendments and
476	a technical corrigendum have been published up to now. UTF-16 is
477	described in Annex Q, published as Amendment 1. 17 other amendments are
478	currently at various stages of standardization. [[[ THIS REFERENCE
479	NEEDS TO BE UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]

481	[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
482	Requirement Levels", March 1997, RFC 2119.

484	[STD13] Paul Mockapetris, "Domain names - implementation and
485	specification", November 1987, STD 13 (RFC 1035).

487	[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
488	3.0", ISBN 0-201-61633-5. Described at
489	<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

491	A. Acknowledgements

493	Base32 is quite obviously inspired by the tried-and-true Base64
494	Content-Transfer-Encoding from MIME.

496	B. IANA Considerations

498	There are no IANA considerations in this document.

500	C. Author Contact Information

502	Mark Davis
503	IBM
504	10275 N. De Anza Blvd
505	Cupertino, CA 95014
506	mark.davis@us.ibm.com and mark.davis@macchiato.com

508	Paul Hoffman
509	Internet Mail Consortium and VPN Consortium
510	127 Segre Place
511	Santa Cruz, CA  95060 USA
512	paul.hoffman@imc.org and paul.hoffman@vpnc.org