idnits 2.17.1 

draft-ietf-idn-nameprep-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 856 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Authors' Addresses Section.

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- Couldn't find a document date in the document -- date freshness check
     skipped.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'ROMAN NUMERALS' is mentioned on line 664, but not
     defined

  == Missing Reference: 'SPACES' is mentioned on line 604, but not defined

  == Missing Reference: 'CONTROL CHARACTERS' is mentioned on line 573, but
     not defined

  == Missing Reference: 'PRIVATE USE' is mentioned on line 716, but not
     defined

  == Missing Reference: 'PLANE 0' is mentioned on line 716, but not defined

  == Missing Reference: 'MATHEMATICAL OPERATORS' is mentioned on line 666,
     but not defined

  == Missing Reference: 'ARROWS' is mentioned on line 665, but not defined

  == Missing Reference: 'MISCELLANEOUS TECHNICAL' is mentioned on line 667,
     but not defined

  == Missing Reference: 'CONTROL PICTURES' is mentioned on line 668, but not
     defined

  == Missing Reference: 'BOX DRAWING' is mentioned on line 690, but not
     defined

  == Missing Reference: 'BLOCK ELEMENTS' is mentioned on line 691, but not
     defined

  == Missing Reference: 'GEOMETRIC SHAPES' is mentioned on line 692, but not
     defined

  == Missing Reference: 'MISCELLANEOUS SYMBOLS' is mentioned on line 693, but
     not defined

  == Missing Reference: 'DINGBATS' is mentioned on line 694, but not defined

  == Missing Reference: 'BRAILLE PATTERNS' is mentioned on line 695, but not
     defined

  == Missing Reference: 'SURROGATE CHARACTERS' is mentioned on line 715, but
     not defined

  == Missing Reference: 'KANGXI RADICALS' is mentioned on line 697, but not
     defined

  == Unused Reference: 'Normalize' is defined on line 787, but no explicit
     reference was found in the text

  == Unused Reference: 'STD13' is defined on line 799, but no explicit
     reference was found in the text

  -- Possible downref: Normative reference to a draft: ref. 'IDNComp' 

  -- No information found for draft-ietf-idn-requirement - is the name
     correct?

  -- Possible downref: Normative reference to a draft: ref. 'IDNReq' 

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  == Outdated reference: A later version (-04) exists of
     draft-duerst-i18n-norm-03

  -- Possible downref: Normative reference to a draft: ref. 'Normalize' 

  ** Obsolete normative reference: RFC 2396 (Obsoleted by RFC 3986)

  ** Obsolete normative reference: RFC 2732 (Obsoleted by RFC 3986)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode3'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'


     Summary: 7 errors (**), 0 flaws (~~), 23 warnings (==), 10 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                          Paul Hoffman
2	draft-ietf-idn-nameprep-00.txt                            IMC & VPNC
3	July 3, 2000                                           Marc Blanchet
4	Expires in six months                                       ViaGenie

6	             Preparation of Internationalized Host Names

8	Status of this memo

10	This document is an Internet-Draft and is in full conformance with all
11	provisions of Section 10 of RFC2026.

13	Internet-Drafts are working documents of the Internet Engineering Task
14	Force (IETF), its areas, and its working groups. Note that other groups
15	may also distribute working documents as Internet-Drafts.

17	Internet-Drafts are draft documents valid for a maximum of six months
18	and may be updated, replaced, or obsoleted by other documents at any
19	time. It is inappropriate to use Internet-Drafts as reference material
20	or to cite them other than as "work in progress."

22	     The list of current Internet-Drafts can be accessed at
23	     http://www.ietf.org/ietf/1id-abstracts.txt

25	     The list of Internet-Draft Shadow Directories can be accessed at
26	     http://www.ietf.org/shadow.html.

28	Abstract

30	This document describes how to prepare internationalized host names for
31	transmission on the wire. The steps include excluding characters that
32	are prohibited from appearing in internationalized host names, changing
33	all characters that have case properties to be lowercase, and
34	normalizing the characters. Further, this document lists the prohibited
35	characters.

37	1. Introduction

39	When expanding today's DNS to include internationalized host names,
40	those new names will be handled in many parts of the DNS. The IDN
41	Working Group's requirements document [IDNReq] describes a framework for
42	domain name handling as well as requirements for the new names. The IDN
43	Working Group's comparison document [IDNComp] gives a framework for how
44	various parts of the IDN solution work together.

46	A user can enter a domain name into an application program in a myriad
47	of fashions. Depending on the input method, the characters entered in
48	the domain name may or may not be those that are allowed in
49	internationalized host names. Thus, there must be a way to canonicalized
50	the user's input before the name is resolved in the DNS.

52	It is a design goal of this document to allow users to enter host names
53	in applications and have the highest chance of getting the name correct.
54	This means that the user should not be limited to only entering exactly
55	the characters that might have been used, but to instead be able to
56	enter characters that unambiguously canonicalize to characters in the
57	desired host name. At the same time, this process must not introduce any
58	chance that two host names could be represented by two distinct strings
59	of characters that look identical to typical users. It is also a design
60	goal to have all preprocessing of IDN done before going on the wire, so
61	that no transformation is done in the DNS server space.

63	This document describes the steps needed to convert a name part from one
64	that is entered by the user to one that can be used in the DNS.

66	1.1 Terminology

68	The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
69	"MAY" in this document are to be interpreted as described in RFC 2119
70	[RFC2119].

72	Examples in this document use the notation from the Unicode Standard
73	[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
74	letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
75	A". In the lists of prohibited characters, the "U+" is left off to make
76	the lists easier to read.

78	1.2 IDN summary

80	Using the terminology in [IDNComp], this document specifies all of the
81	prohibited characters and the canonicalization for an IDN solution.
82	Specifically, it covers the following sections from [IDNComp]:

84	prohib-1: Identical and near-identical characters
85	prohib-2: Separators
86	prohib-3: Non-displaying and non-spacing characters
87	prohib-4: Private use characters
88	prohib-5: Punctuation
89	prohib-6: Symbols
90	canon-1.2: Normalization Form KC
91	canon-2.1: Case folding in ASCII
92	canon-2.2: Case folding in non-ASCII

94	Note that this document does not cover:
95	canon-1.1: Normalization Form C
96	canon-2.3: Han folding

98	1.3 Open issues

100	This is the first draft of this document. Although there has been much
101	discussion on the WG mailing list about the topics here, there has not
102	yet been much agreement on some issues. Now that there is a document to
103	talk about, that discussion can be more focussed.

105	1.3.1 Where to do name preparation

107	Section 2.1 says to do name preparation in the resolver. An argument can
108	be made for doing name preparation in the application, before the
109	application service interface. An advantage of that proposal is that
110	resolvers would not need to do any name preparation. A disadvantage is
111	that applications would have to be updated each time the IDN protocol is
112	updated, such as if new characters are added to the repertoire of
113	allowed characters. It seems likely that resolvers are more easily
114	updated than all the individual applications that use internationalized
115	host names.

117	1.3.2 Choosing between normalization form C and KC

119	Much of the discussion of normalization on the WG mailing list assumed
120	that normalization form C would be used. Near the time that this
121	document was written, people started considering form KC instead of C.
122	This document used form KC, but the reasons for doing so could be
123	contentious.

125	1.3.3 Does the prohibition catch all bad characters?

127	On the mailing list, it was discussed doing prohibition in two steps: a
128	short list of prohibited characters before case folding in order to
129	prevent uppercase characters that have no lowercase equivalents from
130	getting through, and then a full check on the output of normalization.
131	In this draft, all checking is done before case folding, based on the
132	(possibly wrong) assumption that none of the prohibited characters will
133	re-appear after the case folding and normalization. If that assumption
134	turns out to be wrong, a check for just those problematic characters can
135	be added after normalization, or a full check against the prohibited
136	characters can be added.

138	2. Preparation Overview

140	This section describes where name preparation happens and the steps that
141	name preparation software must take.

143	2.1 Where name preparation happens

145	Part of the chart in section 1.4 of [IDNReq] looks like this:

147	+---------------+
148	| Application   |
149	+---------------+
150	      |  Application service interface
151	      |  For ex. GethostbyXXXX interface
152	+---------------+
153	| Resolver      |
154	+---------------+
155	      |     <-----   DNS service interface
156	+-------------------------------------------+

158	In this specification, the name preparation is done in the resolver,
159	before the DNS service interface. That is, it is acceptable for software
160	in the application service interface (such as a "GetHostByName" API) to
161	pass the resolver a name that has not been prepared. However, the
162	resolver MUST prepare the name as described in this specification before
163	passing it to the DNS service interface.

165	2.2 Name preparation steps

167	The steps for preparing names are:

169	1) Input from the application service interface -- This can be done in
170	many ways and is not specified in this document

172	2) Look for prohibited input -- Check for any characters that are not
173	allowed in the input. If any are found, return an error to the
174	application service interface. This step is necessary to prevent errors
175	in the following two steps. This step fulfills prohib-1, prohib-2,
176	prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].

178	3) Fold case -- Change all uppercase characters into lowercase
179	characters. Design note: this step could just as easily have been
180	"change all lowercase characters into uppercase characters". However,
181	the upper-to-lower folding was chosen because most users of the Internet
182	today enter host names in lowercase. This step fulfills canon-2.1 and
183	canon-2.2 from [IDNComp].

185	4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
186	from [IDNComp].

188	5) Resolution of the prepared name -- This must be specified in a
189	different IDN document.

191	The above steps MUST be performed in the order given in order to comply
192	with this specification.

194	3. Prohibited Input

196	Before the text can be processed, it must be checked for prohibited
197	characters. There is a variety of prohibited characters, as described in
198	this section.

200	Note that one of the goals of IDN is to allow the widest possible set of
201	host names as long as those host names do not cause other problems, such
202	as possible ambiguity. Specifically, experience with current DNS names
203	have shown that there is a desire for host names that include personal
204	names, company names, and spoken phrases. A goal of this section is to
205	prohibit as few characters that might be used in these contexts as
206	possible while making sure that characters that might easily cause
207	confusion or ambiguity are prohibited.

209	Note that every character listed in this section MUST NOT be transmitted
210	on the DNS service interface. Although the checking is being performed
211	before case folding and canonicalization, those steps cannot result in
212	any of these characters if these characters are not in the input stream.
213	[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
214	server receives a request containing a prohibited character, then the
215	IDN protocol MUST return an error message.

217	Note that some characters listed in one section would also appear in
218	other sections. Each character is only listed once.

220	3.1 prohib-1: Identical and near-identical characters

222	Many characters in [ISO10646] are identical or nearly identical to other
223	characters. These were often included for compatibility with other
224	character sets.

226	The characters prohibited because they are identical or nearly identical
227	to allowed characters are:

229	00AD        SOFT HYPHEN
230	00D7        MULTIPLICATION SIGN
231	01C3        LATIN LETTER RETROFLEX CLICK
232	02B0-02FF   [SPACING MODIFIER LETTERS]
233	066D        ARABIC FIVE POINTED STAR
234	1806        MONGOLIAN TODO SOFT HYPHEN
235	2010        HYPHEN
236	2011        NON-BREAKING HYPHEN
237	2012        FIGURE DASH
238	2013        EN DASH
239	2014        EM DASH
240	2160-217F   [ROMAN NUMERALS]
241	FB1D-FB4F   [HEBREW PRESENTATION FORMS]
242	FB50-FDFF   [ARABIC PRESENTATION FORMS A]
243	FE20-FE2F   [COMBINING HALF MARKS]
244	FE30-FE4F   [CJK COMPATIBILITY FORMS]
245	FE50-FE6F   [SMALL FORM VARIANTS]
246	FE70-FEFC   [ARABIC PRESENTATION FORMS B]
247	FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]

249	3.2 prohib-2: Separators

251	Horizontal and vertical spacing characters would make it unclear where a
252	host name begins and ends. The prohibited spacing characters are:

254	0020        SPACE
255	00A0        NO-BREAK SPACE
256	1680        OGHAM SPACE MARK
257	2000-200B   [SPACES]
258	2028        LINE SEPARATOR
259	2029        PARAGRAPH SEPARATOR
260	202F        NARROW NO-BREAK SPACE
261	3000        IDEOGRAPHIC SPACE

263	Allowing periods and period-like characters as characters within a name
264	part would also cause similar confusion. The prohibited periods,
265	characters that look like periods, and characters that canonicalize to a
266	period or to a period-like character are:

268	002E        FULL STOP
269	06D4        ARABIC FULL STOP
270	2024        ONE DOT LEADER
271	2025        TWO DOT LEADER
272	2026        HORIZONTAL ELLIPSIS
273	2488        DIGIT ONE FULL STOP
274	2489        DIGIT TWO FULL STOP
275	248A        DIGIT THREE FULL STOP
276	248B        DIGIT FOUR FULL STOP
277	248C        DIGIT FIVE FULL STOP
278	248D        DIGIT SIX FULL STOP
279	248E        DIGIT SEVEN FULL STOP
280	248F        DIGIT EIGHT FULL STOP
281	2490        DIGIT NINE FULL STOP
282	2491        NUMBER TEN FULL STOP
283	2492        NUMBER ELEVEN FULL STOP
284	2493        NUMBER TWELVE FULL STOP
285	2494        NUMBER THIRTEEN FULL STOP
286	2495        NUMBER FOURTEEN FULL STOP
287	2496        NUMBER FIFTEEN FULL STOP
288	2497        NUMBER SIXTEEN FULL STOP
289	2498        NUMBER SEVENTEEN FULL STOP
290	2499        NUMBER EIGHTEEN FULL STOP
291	249A        NUMBER NINETEEN FULL STOP
292	249B        NUMBER TWENTY FULL STOP
293	33C2        SQUARE AM
294	33C2        SQUARE AM
295	33C7        SQUARE CO
296	33D8        SQUARE PM
297	33D8        SQUARE PM

299	3.3 prohib-3: Non-displaying and non-spacing characters

301	There are many characters that cannot be seen in the ISO 10646 character
302	set. These include control characters, non-breaking spaces, formatting
303	characters, and tagging characters. These characters would certainly
304	cause confusion if allowed in host names.

306	0000-001F   [CONTROL CHARACTERS]
307	007F        DELETE
308	0080-009F   [CONTROL CHARACTERS]
309	070F        SYRIAC ABBREVIATION MARK
310	180B        MONGOLIAN FREE VARIATION SELECTOR ONE
311	180C        MONGOLIAN FREE VARIATION SELECTOR TWO
312	180D        MONGOLIAN FREE VARIATION SELECTOR THREE
313	180E        MONGOLIAN VOWEL SEPARATOR
314	200C        ZERO WIDTH NON-JOINER
315	200D        ZERO WIDTH JOINER
316	200E        LEFT-TO-RIGHT MARK
317	200F        RIGHT-TO-LEFT MARK
318	202A        LEFT-TO-RIGHT EMBEDDING
319	202B        RIGHT-TO-LEFT EMBEDDING
320	202C        POP DIRECTIONAL FORMATTING
321	202D        LEFT-TO-RIGHT OVERRIDE
322	202E        RIGHT-TO-LEFT OVERRIDE
323	206A        INHIBIT SYMMETRIC SWAPPING
324	206B        ACTIVATE SYMMETRIC SWAPPING
325	206C        INHIBIT ARABIC FORM SHAPING
326	206D        ACTIVATE ARABIC FORM SHAPING
327	206E        NATIONAL DIGIT SHAPES
328	206F        NOMINAL DIGIT SHAPES
329	FEFF        ZERO WIDTH NO-BREAK SPACE
330	FFF9        INTERLINEAR ANNOTATION ANCHOR
331	FFFA        INTERLINEAR ANNOTATION SEPARATOR
332	FFFB        INTERLINEAR ANNOTATION TERMINATOR
333	FFFC        OBJECT REPLACEMENT CHARACTER
334	FFFD        REPLACEMENT CHARACTER

336	3.4 prohib-4: Private use characters

338	Because private-use characters do not have defined meanings, they are
339	prohibited. The private-use characters are:

341	E000-F8FF   [PRIVATE USE, PLANE 0]

343	3.5 prohib-5: Punctuation

345	The following characters are reserved or delimiters in URLs [RFC2396]
346	and [RFC2732]:

348	" # $ % & + , . / : ; < = > ? @ [ ]

350	3.5.1 Characters from URLs

352	The following punctuation characters are prohibited because they are
353	reserved or delimiters in URLs.

355	0022        QUOTATION MARK
356	0023        NUMBER SIGN
357	0024        DOLLAR SIGN
358	0025        PERCENT SIGN
359	0026        AMPERSAND
360	002B        PLUS SIGN
361	002C        COMMA
362	002E        FULL STOP
363	002F        SOLIDUS
364	003A        COLON
365	003B        SEMICOLON
366	003C        LESS-THAN SIGN
367	003D        EQUALS SIGN
368	003E        GREATER-THAN SIGN
369	003F        QUESTION MARK
370	0040        COMMERCIAL AT
371	005B        LEFT SQUARE BRACKET
372	005D        RIGHT SQUARE BRACKET

374	3.5.2 Characters that canonicalize to characters from URLs

376	The following punctuation characters are prohibited because their
377	normalization contains one or more of the characters from section 3.5.1.

379	037E        GREEK QUESTION MARK
380	2048        QUESTION EXCLAMATION MARK
381	2049        EXCLAMATION QUESTION MARK
382	207A        SUPERSCRIPT PLUS SIGN
383	207C        SUPERSCRIPT EQUALS SIGN
384	208A        SUBSCRIPT PLUS SIGN
385	208C        SUBSCRIPT EQUALS SIGN
386	2100        ACCOUNT OF
387	2101        ADDRESSED TO THE SUBJECT
388	2105        CARE OF
389	2106        CADA UNA

391	3.5.3 Characters that look like characters from URLs

393	The following are prohibited because they look indistinguishable from
394	the characters listed in section 3.5.1.

396	037E        GREEK QUESTION MARK
397	0589        ARMENIAN FULL STOP
398	060C        ARABIC COMMA
399	061B        ARABIC SEMICOLON
400	066A        ARABIC PERCENT SIGN
401	201A        SINGLE LOW-9 QUOTATION MARK
402	2030        PER MILLE SIGN
403	2031        PER TEN THOUSAND SIGN
404	2033        DOUBLE PRIME
405	2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
406	2044        FRACTION SLASH
407	203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
408	203D        INTERROBANG
409	3001        IDEOGRAPHIC COMMA
410	3002        IDEOGRAPHIC FULL STOP
411	3003        DITTO MARK
412	3008        LEFT ANGLE BRACKET
413	3009        RIGHT ANGLE BRACKET
414	3014        LEFT TORTOISE SHELL BRACKET
415	3015        RIGHT TORTOISE SHELL BRACKET
416	301A        LEFT WHITE SQUARE BRACKET
417	301B        RIGHT WHITE SQUARE BRACKET

419	3.5.4 Other punctuation

421	The following punctuation are prohibited because they are unlikely to
422	be used in names and may be confusing to users or to character-entry
423	processes:

425	005C        REVERSE SOLIDUS

427	3.6 prohib-6: Symbols

429	[UniData] has non-normative categories for symbols. The four symbol
430	categories are:

432	Symbol, Currency: Currency symbols could appear in company names and
433	spoken phrases, so they are not prohibited.

435	Symbol, Modifier: Stand-alone modifiers might appear in personal names,
436	company names, and spoken phrases, so they are not prohibited.

438	Symbol, Math: It is very unlikely that there are any significant
439	personal names, company names, or spoken phrases that contain
440	mathematical symbols. Further, many of these symbols are the same or
441	similar to other punctuation, thereby leading to ambiguity. For this
442	reason, math-specific symbols are prohibited. These prohibited math
443	symbols are:

445	00AC        NOT SIGN
446	00B1        PLUS-MINUS SIGN
447	2200-22FF   [MATHEMATICAL OPERATORS]

449	Further, the following characters canonicalize to characters in the
450	above math list, and therefore are also prohibited:

452	00BC        VULGAR FRACTION ONE QUARTER
453	00BD        VULGAR FRACTION ONE HALF
454	00BE        VULGAR FRACTION THREE QUARTERS
455	207B        SUPERSCRIPT MINUS
456	208B        SUBSCRIPT MINUS
457	2153        VULGAR FRACTION ONE THIRD
458	2154        VULGAR FRACTION TWO THIRDS
459	2155        VULGAR FRACTION ONE FIFTH
460	2156        VULGAR FRACTION TWO FIFTHS
461	2157        VULGAR FRACTION THREE FIFTHS
462	2158        VULGAR FRACTION FOUR FIFTHS
463	2159        VULGAR FRACTION ONE SIXTH
464	215A        VULGAR FRACTION FIVE SIXTHS
465	215B        VULGAR FRACTION ONE EIGHTH
466	215C        VULGAR FRACTION THREE EIGHTHS
467	215D        VULGAR FRACTION FIVE EIGHTHS
468	215E        VULGAR FRACTION SEVEN EIGHTHS
469	215F        FRACTION NUMERATOR ONE
470	33A7        SQUARE M OVER S
471	33A8        SQUARE M OVER S SQUARED
472	33AE        SQUARE RAD OVER S
473	33AF        SQUARE RAD OVER S SQUARED
474	33C6        SQUARE C OVER KG

476	Symbol, Other: This category covers a multitude of symbols, few of which
477	would ever appear in personal names, company names, and spoken phrases.
478	The rest of the prohibited symbols are:

480	2190-21FF   [ARROWS]
481	2300-23FF   [MISCELLANEOUS TECHNICAL]
482	2400-243F   [CONTROL PICTURES]
483	2440-245F   [OPTICAL CHARACTER RECOGNITION]
484	2500-257F   [BOX DRAWING]
485	2580-259F   [BLOCK ELEMENTS]
486	25A0-25FF   [GEOMETRIC SHAPES]
487	2600-267F   [MISCELLANEOUS SYMBOLS]
488	2700-27BF   [DINGBATS]
489	2800-287F   [BRAILLE PATTERNS]

491	3.7 Additional prohibited characters

493	3.7.1 Unassigned characters

495	All characters not yet assigned in [ISO10646] are prohibited. Although
496	this may at first seem trivial, it is extremely important because
497	characters that may be assigned in the future might have properties that
498	would cause them to be prohibited or might have case-folding properties.
499	As is the case of all prohibited characters, if a DNS server receives a
500	request containing an unassigned character, then the IDN protocol MUST
501	return an error message.

503	3.7.2 Surrogate characters

505	So far, all proposals for binary encodings of internationalized name
506	parts have specified UTF-8 as the encoding format. In such an encoding,
507	surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
508	the following are prohibited:

510	D800-DFFF   [SURROGATE CHARACTERS]

512	3.7.3 Uppercase characters with no lowercase mappings

514	There are many uppercase characters in [ISO10646] which do not have
515	lowercase equivalents in [UniData]. Therefore, they are prohibited on
516	input because they would get through the case mapping step while still
517	being in uppercase.

519	The characters that are prohibited on input because they are uppercase
520	but have no lowercase mappings are:

522	03D2        GREEK UPSILON WITH HOOK SYMBOL
523	03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
524	03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
525	04C0        CYRILLIC LETTER PALOCHKA
526	10A0-10C5   [GEORGIAN CAPITAL LETTERS]

528	Note that many characters in the range U+1200 to U+213A, the letterlike
529	symbols, also are uppercase but have no lowercase mappings. However,
530	they are not listed here because the entire range is already prohibited
531	in section 3.6.

533	3.7.4 Radicals and Ideographic Description

535	Some Han characters can be informally defined in terms of ideographic
536	descriptions. However, ideographic descriptions can lead to multiple
537	character streams leading to the same character in a fashion that does
538	not canonicalize. Thus, the radicals for ideographic description and the
539	ideographic description characters themselves are prohibited. These
540	characters are:

542	2E80-2EFF   [CJK RADICALS SUPPLEMENT]
543	2F00-2FDF   [KANGXI RADICALS]
544	2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]

546	3.8 Summary of prohibited characters

548	The following is a collected list from the previous sections.

550	0000-001F   [CONTROL CHARACTERS]
551	0020        SPACE
552	0022        QUOTATION MARK
553	0023        NUMBER SIGN
554	0024        DOLLAR SIGN
555	0025        PERCENT SIGN
556	0026        AMPERSAND
557	002B        PLUS SIGN
558	002C        COMMA
559	002E        FULL STOP
560	002E        FULL STOP
561	002F        SOLIDUS
562	003A        COLON
563	003B        SEMICOLON
564	003C        LESS-THAN SIGN
565	003D        EQUALS SIGN
566	003E        GREATER-THAN SIGN
567	003F        QUESTION MARK
568	0040        COMMERCIAL AT
569	005B        LEFT SQUARE BRACKET
570	005C        REVERSE SOLIDUS
571	005D        RIGHT SQUARE BRACKET
572	007F        DELETE
573	0080-009F   [CONTROL CHARACTERS]
574	00A0        NO-BREAK SPACE
575	00AC        NOT SIGN
576	00AD        SOFT HYPHEN
577	00B1        PLUS-MINUS SIGN
578	00BC        VULGAR FRACTION ONE QUARTER
579	00BD        VULGAR FRACTION ONE HALF
580	00BE        VULGAR FRACTION THREE QUARTERS
581	00D7        MULTIPLICATION SIGN
582	01C3        LATIN LETTER RETROFLEX CLICK
583	02B0-02FF   [SPACING MODIFIER LETTERS]
584	037E        GREEK QUESTION MARK
585	037E        GREEK QUESTION MARK
586	03D2        GREEK UPSILON WITH HOOK SYMBOL
587	03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
588	03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
589	04C0        CYRILLIC LETTER PALOCHKA
590	0589        ARMENIAN FULL STOP
591	060C        ARABIC COMMA
592	061B        ARABIC SEMICOLON
593	066A        ARABIC PERCENT SIGN
594	066D        ARABIC FIVE POINTED STAR
595	06D4        ARABIC FULL STOP
596	070F        SYRIAC ABBREVIATION MARK
597	10A0-10C5   [GEORGIAN CAPITAL LETTERS]
598	1680        OGHAM SPACE MARK
599	1806        MONGOLIAN TODO SOFT HYPHEN
600	180B        MONGOLIAN FREE VARIATION SELECTOR ONE
601	180C        MONGOLIAN FREE VARIATION SELECTOR TWO
602	180D        MONGOLIAN FREE VARIATION SELECTOR THREE
603	180E        MONGOLIAN VOWEL SEPARATOR
604	2000-200B   [SPACES]
605	200C        ZERO WIDTH NON-JOINER
606	200D        ZERO WIDTH JOINER
607	200E        LEFT-TO-RIGHT MARK
608	200F        RIGHT-TO-LEFT MARK
609	2010        HYPHEN
610	2011        NON-BREAKING HYPHEN
611	2012        FIGURE DASH
612	2013        EN DASH
613	2014        EM DASH
614	201A        SINGLE LOW-9 QUOTATION MARK
615	2024        ONE DOT LEADER
616	2025        TWO DOT LEADER
617	2026        HORIZONTAL ELLIPSIS
618	2028        LINE SEPARATOR
619	2029        PARAGRAPH SEPARATOR
620	202A        LEFT-TO-RIGHT EMBEDDING
621	202B        RIGHT-TO-LEFT EMBEDDING
622	202C        POP DIRECTIONAL FORMATTING
623	202D        LEFT-TO-RIGHT OVERRIDE
624	202E        RIGHT-TO-LEFT OVERRIDE
625	202F        NARROW NO-BREAK SPACE
626	2030        PER MILLE SIGN
627	2031        PER TEN THOUSAND SIGN
628	2033        DOUBLE PRIME
629	2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
630	203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
631	203D        INTERROBANG
632	2044        FRACTION SLASH
633	2048        QUESTION EXCLAMATION MARK
634	2049        EXCLAMATION QUESTION MARK
635	206A        INHIBIT SYMMETRIC SWAPPING
636	206B        ACTIVATE SYMMETRIC SWAPPING
637	206C        INHIBIT ARABIC FORM SHAPING
638	206D        ACTIVATE ARABIC FORM SHAPING
639	206E        NATIONAL DIGIT SHAPES
640	206F        NOMINAL DIGIT SHAPES
641	207A        SUPERSCRIPT PLUS SIGN
642	207B        SUPERSCRIPT MINUS
643	207C        SUPERSCRIPT EQUALS SIGN
644	208A        SUBSCRIPT PLUS SIGN
645	208B        SUBSCRIPT MINUS
646	208C        SUBSCRIPT EQUALS SIGN
647	2100        ACCOUNT OF
648	2101        ADDRESSED TO THE SUBJECT
649	2105        CARE OF
650	2106        CADA UNA
651	2153        VULGAR FRACTION ONE THIRD
652	2154        VULGAR FRACTION TWO THIRDS
653	2155        VULGAR FRACTION ONE FIFTH
654	2156        VULGAR FRACTION TWO FIFTHS
655	2157        VULGAR FRACTION THREE FIFTHS
656	2158        VULGAR FRACTION FOUR FIFTHS
657	2159        VULGAR FRACTION ONE SIXTH
658	215A        VULGAR FRACTION FIVE SIXTHS
659	215B        VULGAR FRACTION ONE EIGHTH
660	215C        VULGAR FRACTION THREE EIGHTHS
661	215D        VULGAR FRACTION FIVE EIGHTHS
662	215E        VULGAR FRACTION SEVEN EIGHTHS
663	215F        FRACTION NUMERATOR ONE
664	2160-217F   [ROMAN NUMERALS]
665	2190-21FF   [ARROWS]
666	2200-22FF   [MATHEMATICAL OPERATORS]
667	2300-23FF   [MISCELLANEOUS TECHNICAL]
668	2400-243F   [CONTROL PICTURES]
669	2440-245F   [OPTICAL CHARACTER RECOGNITION]
670	2488        DIGIT ONE FULL STOP
671	2489        DIGIT TWO FULL STOP
672	248A        DIGIT THREE FULL STOP
673	248B        DIGIT FOUR FULL STOP
674	248C        DIGIT FIVE FULL STOP
675	248D        DIGIT SIX FULL STOP
676	248E        DIGIT SEVEN FULL STOP
677	248F        DIGIT EIGHT FULL STOP
678	2490        DIGIT NINE FULL STOP
679	2491        NUMBER TEN FULL STOP
680	2492        NUMBER ELEVEN FULL STOP
681	2493        NUMBER TWELVE FULL STOP
682	2494        NUMBER THIRTEEN FULL STOP
683	2495        NUMBER FOURTEEN FULL STOP
684	2496        NUMBER FIFTEEN FULL STOP
685	2497        NUMBER SIXTEEN FULL STOP
686	2498        NUMBER SEVENTEEN FULL STOP
687	2499        NUMBER EIGHTEEN FULL STOP
688	249A        NUMBER NINETEEN FULL STOP
689	249B        NUMBER TWENTY FULL STOP
690	2500-257F   [BOX DRAWING]
691	2580-259F   [BLOCK ELEMENTS]
692	25A0-25FF   [GEOMETRIC SHAPES]
693	2600-267F   [MISCELLANEOUS SYMBOLS]
694	2700-27BF   [DINGBATS]
695	2800-287F   [BRAILLE PATTERNS]
696	2E80-2EFF   [CJK RADICALS SUPPLEMENT]
697	2F00-2FDF   [KANGXI RADICALS]
698	2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]
699	3000        IDEOGRAPHIC SPACE
700	3001        IDEOGRAPHIC COMMA
701	3002        IDEOGRAPHIC FULL STOP
702	3003        DITTO MARK
703	3008        LEFT ANGLE BRACKET
704	3009        RIGHT ANGLE BRACKET
705	33A7        SQUARE M OVER S
706	33A8        SQUARE M OVER S SQUARED
707	33AE        SQUARE RAD OVER S
708	33AF        SQUARE RAD OVER S SQUARED
709	33C2        SQUARE AM
710	33C2        SQUARE AM
711	33C6        SQUARE C OVER KG
712	33C7        SQUARE CO
713	33D8        SQUARE PM
714	33D8        SQUARE PM
715	D800-DFFF   [SURROGATE CHARACTERS]
716	E000-F8FF   [PRIVATE USE, PLANE 0]
717	FB1D-FB4F   [HEBREW PRESENTATION FORMS]
718	FB50-FDFF   [ARABIC PRESENTATION FORMS A]
719	FE20-FE2F   [COMBINING HALF MARKS]
720	FE30-FE4F   [CJK COMPATIBILITY FORMS]
721	FE50-FE6F   [SMALL FORM VARIANTS]
722	FE70-FEFC   [ARABIC PRESENTATION FORMS B]
723	FEFF        ZERO WIDTH NO-BREAK SPACE
724	FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]
725	FFF9        INTERLINEAR ANNOTATION ANCHOR
726	FFFA        INTERLINEAR ANNOTATION SEPARATOR
727	FFFB        INTERLINEAR ANNOTATION TERMINATOR
728	FFFC        OBJECT REPLACEMENT CHARACTER
729	FFFD        REPLACEMENT CHARACTER
730	Unassigned characters

732	4. Case Folding

734	After it has been verified that the input text has none of the
735	characters prohibited for case folding, the case-folding step itself is
736	quite straight-forward. For each character in the input, if there is a
737	lowercase mapping for that character in [UniData], the input character
738	is changed to the mapped lowercase letter.

740	5. Canonicalization

742	After case folding, the input string is normalized using form KC, as
743	described in [UTR15].

745	6. IDN Table Revisions

747	A table consisting of all characters allowed and prohibited and the
748	rules for case folding and canonicalization will be created based on the
749	content of the [UniData] and on the content of this document. This table
750	will be the authority for implementations to follow and will be
751	normatively referenced by this document. Such a table will enable the
752	IDN protocol to have versions independent of the revisions to Unicode
753	and/or to ISO 10646 because the revision of IDN and its deployment may
754	not in sync with revisions to Unicode and ISO 10646.

756	In a future draft of this document, IANA will be asked to keep this
757	table, with an initial version number of 1. Each new version of the
758	table will have a new, higher version number.

760	7. Security Considerations

762	Much of the security of the Internet relies on the DNS. Thus, any change
763	to the characteristics of the DNS can change the security of much of the
764	Internet.

766	Host names are used by users to connect to Internet servers. The
767	security of the Internet would be compromised if a user entering a
768	single internationalized name could be connected to different servers
769	based on different interpretations of the internationalized host name.

771	8. References

773	[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
774	Proposals", draft-ietf-idn-compare.

776	[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
777	draft-ietf-idn-requirement.

779	[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
780	technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
781	1: Architecture and Basic Multilingual Plane.  Five amendments and a
782	technical corrigendum have been published up to now. UTF-16 is described
783	in Annex Q, published as Amendment 1. 17 other amendments are currently
784	at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
785	UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]

787	[Normalize] Character Normalization in IETF Protocols,
788	draft-duerst-i18n-norm-03

790	[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
791	Requirement Levels", March 1997, RFC 2119.

793	[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
794	Generic Syntax", August 1998, RFC 2396.

796	[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
797	URL's, December 1999, RFC 2732.

799	[STD13] Paul Mockapetris, "Domain names - implementation and
800	specification", November 1987, STD 13 (RFC 1035).

802	[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
803	3.0", ISBN 0-201-61633-5. Described at
804	<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

806	[UniData] The Unicode Consortium. UnicodeData File.
807	<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

809	[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
810	Unicode Technical Report #15.
811	<http://www.unicode.org/unicode/reports/tr15/>.

813	A. Acknowledgements

815	Many people from the IETF IDN Working Group and the Unicode Technical
816	Committee contributed ideas that went into the first draft of this
817	document. Mark Davis was particularly helpful in some of the early
818	ideas.

820	B. Changes From Previous Versions of this Draft

822	This is the -00 version, so there are no changes.

824	C. IANA Considerations

826	There are no specific IANA considerations in this draft, but there will
827	be in a future draft of this document.

829	D. Author Contact Information

831	Paul Hoffman
832	Internet Mail Consortium and VPN Consortium
833	127 Segre Place
834	Santa Cruz, CA  95060 USA
835	paul.hoffman@imc.org and paul.hoffman@vpnc.org

837	Marc Blanchet
838	Viagenie inc.
839	2875 boul. Laurier, bur. 300
840	Ste-Foy, Quebec, Canada, G1V 2M2
841	Marc.Blanchet@viagenie.qc.ca