idnits 2.17.1 

draft-duerst-i18n-norm-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 797 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 63 instances of too long lines in the document, the longest
     one being 82 characters in excess of 72.

  ** There is 1 instance of lines with control characters in the document.

  ** The abstract seems to contain references ([RFC2279], [UTR15],
     [ISO10646,Unicode], [RFC2277]), which it shouldn't.  Please replace those
     with straight textual mentions of the documents in question.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 239: '...y decompositions MUST NOT be used for ...'
     RFC 2119 keyword, line 326: '...For 3) and 4), the lists provided in [CompExcl] MUST be used....'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 202 has weird spacing: '... actual   norm...'

  == Line 767 has weird spacing: '...@w3.org   http...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 2000) is 8808 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer'

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2277')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also
     mentioned in 'RFC 2277'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2279')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also
     mentioned in 'RFC 2279'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'


     Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                               M. Duerst
2	<draft-duerst-i18n-norm-03.txt>                    W3C/Keio University
3	Expires in six months                                         M. Davis
4	                                                                   IBM
5	                                                            March 2000

7	             Character Normalization in IETF Protocols

9	Status of this Memo

11	This document is an Internet-Draft and is in full conformance
12	with all provisions of Section 10 of RFC2026.

14	Internet-Drafts are working documents of the Internet Engineering
15	Task Force (IETF), its areas, and its working groups.  Note that
16	other groups may also distribute working documents as
17	Internet-Drafts.

19	Internet-Drafts are draft documents valid for a maximum of six
20	months and may be updated, replaced, or obsoleted by other
21	documents at any time.  It is inappropriate to use Internet-
22	Drafts as reference material or to cite them other than as
23	"work in progress."

25	The list of current Internet-Drafts can be accessed at
26	http://www.ietf.org/ietf/1id-abstracts.txt

28	The list of Internet-Draft Shadow Directories can be accessed at
29	http://www.ietf.org/shadow.html.

31	This document is not a product of any working group, but may be
32	discussed on the mailing lists <www-international@w3.org> or
33	<discuss@apps.ietf.org>.

35	This is a new version of an Internet Draft entitled "Normalization of
36	Internationalized Identifiers" that dealt with quite similar issues
37	and was submitted in July 1997 by the first author while he was at the
38	University of Zurich.

40	Abstract

42	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide
43	repertoire of characters. The IETF, in [RFC 2277], requires that future IETF
44	protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The
45	wide range of characters included in the UCS has lead to some cases of
46	duplicate encodings. This document proposes that in IETF protocols, the
47	class of duplicates called canonical equivalents be dealt with by using
48	Early Uniform Normalization according to Unicode Normalization Form C,
49	Canonical Composition [UTR15]. This document describes both Early
50	Uniform Normalization and Normalization Form C.

52	Table of contents

54	0. Change log
55	1. Introduction
56	2. Early Uniform Normalization
57	3. Canonical Composition (Normalization Form C)
58	   3.1 Decomposition
59	   3.2 Reordering
60	   3.3 Recomposition
61	   3.4 Implementation Notes
62	4. Stability and Versioning
63	5. Cases not dealt with by Canonical Equivalence
64	6. Security Considerations
65	Acknowledgements
66	References
67	Copyright
68	Author's Addresses

70	0. Change log

72	Changes from -02 to -03

74	- Fixed a bad typo in the title.
75	- Made a lot of wording corrections and presentation improvements,
76	  most of them suggested by Paul Hofmann.

78	1. Introduction

80	1.1 Motivation

82	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide
83	repertoire of characters. The IETF, in [RFC 2277], requires that future IETF
84	protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The
85	wide range of characters included in the UCS has lead to some cases of
86	duplicate encodings. This has lead to uncertainity for protocol specifiers
87	and implementers, because it was not clear which part of the Internet
88	infrastructure should take responsibility for these duplicates, and how.

90	There are mainly two kinds of duplicates, singleton equivalences and
91	precomposed/decomposed equivalences. Both of there can be illustrated
92	using the A character with a ring above. This character can be encoded
93	in three ways:

95	1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
96	2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE
97	3) U+212B ANGSTROM SIGN

99	In all three cases, it is supposed to look the same for the reader.
100	The equivalence between 1) and 3) is a singleton equivalence; the
101	equivalence between 1) and 2) is a precomposed/decomposed equivalence.
102	1) is the precomposed representation, 2) is the decomposed representation.
103	The inclusion of these various representation alternatives was a result of
104	the requirement for round trip conversion with a wide range of legacy encodings
105	as well as of the merger between Unicode and ISO 10646.

107	The Unicode Standard from early on has defined Canonical Equivalence to
108	make clear which sequences of codepoints cases should be treated as pure
109	encoding duplicates and which sequences of codepoints should be treated as
110	genuinely different (if maybe in some cases closely related) data.
111	The Unicode Standard also from early on defined
112	decomposed normalization, what is now called Normalization Form D (case 2)
113	in the example above). This is very well suited for some kinds of
114	internal processing, but decomposition does not correspond to how data
115	gets converted from legacy encodings and transmitted on the Internet. In that
116	case, precomposed data (i.e. case 1) in the example above) is prevalent.

118	Note: This specification uses the term 'codepoint', and not 'character',
119	      to make clear that it speaks about what the standards encode,
120	      and not what the end user think about.

122	Encouraged by many factors such as a requirements analysis of the W3C
123	[Charreq], the Unicode Technical Committee defined Normalization Form C,
124	Canonical Composition (see [UTR15]). Normalization Form C in general produces
125	the same representation as straightforward transcoding from legacy encodings
126	(See Section 3.4 for the known exception). The careful and detailled definition
127	of Normalization Form C is mainly needed to unambigously define edge cases
128	(base letters with two or more combining characters).
129	Most of these edge cases will turn up extremely rarely in actual data.

131	The W3C is adapting Normalization Form C in the form of Early Uniform
132	Normalization, which means that it assumes that in general, data will
133	be already in Normalization Form C [Charmod].

135	This document proposes that in IETF protocols, Canonical Equivalents be dealt
136	with by using Early Uniform Normalization according to Unicode Normalization
137	Form C, Canonical Composition [UTR15]. This document describes both Early
138	Uniform Normalization (in Section 2) and Normalization Form C (in Section 3).
139	Section 4 contains an analysis of (mostly theoretical) potential risks
140	for the stability of Normalization Form C. For reference, Section 5 discusses
141	various cases of equivalences not dealt with by Normalization Form C.

143	2. Early Uniform Normalization

145	This section tries to give some guidance on how Normalization Form C,
146	defined later in Section 3, should be used by Internet protocols.
147	Each Internet protocol has to define by itself how to use Normalization
148	Form C, and has to take into account its particular needs. However,
149	the advice in this section is intended to help writers of specifications
150	not very familliar with text normalization issues, and to try to make
151	sure that the various protocols use solutions that interface easily
152	with each other.

154	This section uses various well-known Internet protocols as examples.
155	However, such examples do not imply that the protocol elements mentioned
156	actually accept non-ASCII characters. Depending on the protocol element
157	mentioned, that may or may not be the case. Also, the examples are not
158	intended to actually define how a specific protocol deals with text
159	normalization issues. This is solely the responsibility of the specification
160	for each specific protocol.

162	The basic principle for how to use Normalization Form C is Early
163	Uniform Normalization. This means that ideally, only text in
164	Normalization Form C appears on the wire on the Internet. This can be seen
165	as applying 'be conservative in what you send' to the problem
166	of text normalization. And (again ideally) it should not be needed
167	that each implemenation of an Internet protocol separately implements
168	normalization. Text should just be provided normalized from the
169	underlying infrastructure, e.g. the operating system or the keyboard
170	driver.

172	Early normalization is of particular importance for those parts of
173	Internet protocols that are used as identifiers. Examples would
174	be URIs, domain names, email addresses, identifier names in PKIX
175	certificates, file names in FTP, newsgroup names in NNTP, and so on.
176	This is due to the following reasons:

178	- In order for the protocol to work, it has to be very well defined
179	  when two protocol element values match and when not.
180	- Implementations, in particular on the server side, do not in any
181	  way have to deal with e.g. display of multilingual text, but on
182	  the other hand have to handle a lot of protocol-specific issues.
183	  Such implementations therefore should not be bothered with text
184	  normalization.

186	For free text, e.g. the content of mail messages or news postings,
187	Early Uniform Normalization is somewhat less important, but definitely
188	can improve interoperability.

190	For protocol elements used as identifiers, this document advises
191	Internet protocols to specify the following:

193	- Comparison should be carried out purely binary (after it has been made
194	  sure, where necessary, that the texts to be compared are in the same
195	  character encoding).
196	- Any kind of text, and in particular identifier-like protocol elements,
197	  should be sent normalized to Normalization Form C.
198	- In case comparison fails due to a difference in text normalization, the
199	  originator of the non-normalized text is responsible for the failure.
200	- In case implementors are aware of the fact, or suspect, that their
201	  underlying infrastructure produces non-normalized text, they should
202	  take care to do the necessary tests and if necessary the actual   normalization by themselves.
203	- In the case of creation of identifiers, and in particular if this
204	  creation is comparatively infrequent (e.g. newsgroup names, domain names),
205	  and happens in a rather centralized manner, explicit checks for
206	  normalization should be required by the protocol specification.

208	3. Canonical Composition (Normalization Form C)

210	This section describes Canonical Composition (Normalization Form C).
211	The description is done in a procedural way, but any other procedure
212	that leads to identical results can be used. The result is intended
213	to be exactly identical to that described by [UTR15]. Various notes
214	are provided to help understand the description and give implementation
215	hints.

217	Given a sequence of UCS codepoints, its Canonical Composition can
218	be computed with the following three steps:

220	1. Decomposition
221	2. Reordering
222	3. Recomposition

224	These steps are described in detail below.

226	3.1 Decomposition

228	For each UCS codepoint in the input sequence, check whether this
229	codepoint has a canonical decomposition according to the newest
230	version of the Unicode Character Database (field 5 in [UniData]).
231	If such a decomposition is found, replace the codepoint in the
232	input sequence by the codepoint(s) in the decomposition, and
233	recursivly check for and apply decomposition on the first replaced
234	codepoint.

236	Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData] is the
237	   6th field when counting with an index origin of 1. Fields starting with
238	   a tag delimited by '<' and '>' indicate compatibility decompositions;
239	   these compatibility decompositions MUST NOT be used for Normalization
240	   Form C.

242	Note: For Korean Hangul, the decompositions are not contained
243	   in [UniData], but have to be generated algorithmically
244	   according to the description in [Unicode].

246	Note: Some decompositions replace a single codepoint by another
247	   single codepoint.

249	Note: It is not necessary to check replaced codepoints other than the
250	   first one due to the properties of the data in the Unicode Character
251	   Database.

253	Note: It is possible to 'precompile' the decompositions to avoid
254	   having to apply them recursively.

256	3.2 Reordering

258	For each adjacent pair of UCS codepoints after decomposition,
259	check the combining classes of the UCS codepoints according to
260	the newest version of the Unicode Character Database (Field 3
261	in [UniData]). If the combining class of the first codepoint
262	is higher than the combining class of the second codepoint,
263	and at the same time the combining class of the second codepoint
264	is not zero, then exchange the two codepoints. Repeat this process
265	until no two codepoints can be exchanged anymore.

267	Note: A combining class greater than zero indicates that a codepoint
268	   is a combining mark that participates in reordering. A combining
269	   class of zero indicates that a codepoint is not a combining mark,
270	   or that it is a is a combining mark that is not affected by reordering.
271	   There are no combining classes below zero.

273	Note: Besides a few script-specific combining classes, combining classes
274	   mainly distinguish whether a combining mark is attached to the base
275	   letter or just placed near the base letter, and on which side of the
276	   base letter (e.g. bottom, above right,...) the combining mark is
277	   attached/placed. Reordering assures that combining marks placed on
278	   different sides of the same character are placed in a canonical order
279	   (because any order would visually look the same), while
280	   combining marks placed on the same side of a character
281	   are not reordered (because reordering them would change
282	   the combination they represent).

284	Note: After completing this step, the sequence of UCS codepoints
285	   is in Canonical Decomposition (Normalization Form D).

287	3.3 Recomposition

289	Process the sequence of UCS codepoints resulting from Reordering
290	from start to end. his process requires a state variable called
291	'initial'. At the beginning of the process, the value of 'initial'
292	is empty.

294	- If 'initial' has a value, and the codepoint immediately
295	  preceeding the current codepoint is this 'initial' or has a combining
296	  class smaller than the combining class of the current codepoint,
297	  and the 'initial' can be canonically recombined with with the current
298	  codepoint, then replace the 'initial' with the canonical recombination
299	  and remove the current codepoint.
300	- Otherwise, if the current codepoint has combining class zero,
301	  store its value in 'initial'.

303	A sequence of two codepoints can be canonically recombined to a
304	third codepoint if this third codepoint has a canonical decomposition
305	into the sequence of two codepoints (see [UniData], field 5) and
306	this canonical decomposition is not excluded from recombination.
307	For Korean Hangul, the redecompositions are not contained
308	in [UniData], but have to be generated algorithmically
309	according to the description in [Unicode].
310	The exclusions from recombination are defined as follows:

312	1) Singletons: Codepoints that have a canonical decomposition into
313	   a single other codepoint.
314	2) Non-starter: A codepoint with a decomposition starting with
315	   a codepoint of a combining class other than zero.
316	3) Post-Unicode3.0: A codepoint with a decomposition introduced
317	   after Unicode 3.0.
318	4) Script-specific: Precomposed codepoints that are not the
319	   generally preferred form for their script.

321	The list of codepoints for 1) and 2) can be produced directly
322	from the Unicode Character Database [UniData]. The list of
323	codepoints for 3) can be produced from a comparison between
324	the 3.0.0 version and the latest version of [UniData], but this
325	may be difficult. The list of codepoints for 4) cannot be computed.
326	For 3) and 4), the lists provided in [CompExcl] MUST be used.
327	[CompExcl] also provides lists for 1) and 2) for cross-checking.
328	The list for 3) is currently empty because there are currently
329	no post-Unicode3.0 codepoints with decompositions.

331	Note: At the beginning of recomposition, there is no 'initial'.
332	   An 'initial' is remembered as soon as the first codepoint
333	   with a combining class of zero is found. Not every codepoint
334	   with a combining class of zero becomes an 'initial'; the
335	   exceptions are those that are the second codepoint in
336	   a recomposition. The 'initial' as used in this description
337	   is slightly different from the 'starter' used in [UTR15].

339	Note: Checking the previous codepoint to have a combining class
340	   smaller than the combining class of the current codepoint
341	   assures that the conditions used for reordering are maintained
342	   in the recombination step.

344	Note: Exclusion of singletons is necessary because in a pair of
345	   canonically equivalent codepoints, the canonical decomposition
346	   points from the 'less desirable' codepoint to the preferred
347	   codepoint. In this case, both canonical decomposition and
348	   canonical composition have the same preference.

350	Note: For discussion of the exclusion of Post-Unicode3.0
351	   codepoints from recombination, please see Section 4
352	   on versioning issues.

354	Note: Other algorithms for recomposition have been considered, but
355	   this algorithm has been choosen because it provides a very good
356	   balance between computational and implementation complexity
357	   and 'power' of recombination.

359	3.4 Implementation Notes

361	This section contains various notes on potential implementation
362	issues, improvements, and shortcuts.

364	3.4.1 Avoiding decomposition, and checking for Normalization Form C

366	It is not always necessary to decompose
367	and recompose. In particular, any sequence that does not contain
368	any of the following is already in Normalization Form C:

370	- Codepoints that are excluded from recomposition
371	- Codepoints that appear in second position in a canonical recomposition
372	- Hangul Jamo codepoints (U+1100-U+11F9)
373	- Unknown codepoints

375	If a contiguous part of a sequence satisfies the above criterion
376	all but the last of the codepoints are already in Normalization Form C.

378	The above criteria can also be used to easily check that some data
379	is already in Normalization Form C. However, this check will reject
380	some cases that actually are normalized.

382	3.4.2 Unknown codepoints

384	Unknown codepoints are listed above to avoid claiming
385	that something is in Normalization Form C when it may indeed not be, but
386	they usually will be treated differently from others. The following
387	behaviours may be possible, depending on the context of normalization:

389	- Stop the normalization process with a fatal error. (This should be
390	  done only in very exceptional circumstances. It would mean that
391	  the implementation will die with data that conforms to a future version
392	  of Unicode.)
393	- Produce some warning that such codepoints have been seen, for
394	  further checking.
395	- Just copy the unknown codepoint from the input to the output,
396	  running the risk of not normalizing completely.
397	- Checking that the program-internal data is up to date via the Internet.
398	- Distinguish behaviour depending on which range of codepoints
399	  the unknown codepoint has been found.

401	3.4.3 Surrogates

403	When implementing normalization for sequences of UCS codepoints
404	represented as UTF-16 code units, care has to be taken that pairs of
405	surrogate code units that represent a single UCS codepoint are treated
406	appropriately.

408	3.4.4 Korean Hangul

410	There are no interactions between normalization of
411	Korean Hangul and the other normalizations. These two parts of normalization
412	can therefore be carried out separately, with different implementation
413	improvements.

415	3.4.5 Piecewise application

417	The various steps such as decomposition,
418	reordering, and recomposition, can be applied to parts of a
419	codepoint sequence. As an example, when normalizing a large file,
420	normalization can be done on each line separately because line
421	endings and normalization do not interact.

423	3.4.6 Integrating decomposition and recomposition

425	It is possible to
426	avoid full decomposition by noting that a decomposition of
427	a codepoint that is not in the exclusion list can be avoided
428	if it is not followed by a codepoint that can appear in second
429	position in a canonical recomposition. This condition can
430	be strengthened by noting that decomposition is not necessary
431	if the combining class of the following codepoint is higher
432	than the highest combining class obtained from decomposing
433	the character in question. In other cases, a decomposition
434	followed immediately by a recomposition can be precalculated.
435	Further details are left to the reader.

437	3.4.7 Decomposition

439	Recursive application of decomposition can be
440	avoided by a preprocessing step that calculates a full canonical
441	decomposition for each character with a canonical decomposition.

443	3.4.8 Reordering

445	The reordering step basically is a sorting problem.
446	Because the number of consecutive combining marks (i.e. consecutive
447	codepoints with combining class greater than zero) is usually
448	extremely small, a very simple sorting algorithm can be used,
449	e.g. a straightforward bubble sort.

451	Because reordering will occur
452	extremely locally, the following variant of bubble sort will lead
453	to a fast and simple implementation:

455	- Start checking the first pair (e.g. the first two codepoints).
456	- If there is an exchange, and we are not at the start of the
457	  sequence, move back by one codepoint and check again.
458	- Otherwise (i.e. if there is no exchange, or we are at the start
459	  of the sequence) and we are not at the end of the sequence,
460	  move forward by one codepoint and check again.
461	- If we are at the end of the sequence, and there has been no
462	  exchange for the last pair, then we are done.

464	3.4.9 Conversion from legacy encodings

466	Normalization Form C is designed so that
467	in almost all cases, one-to-one conversion from legacy encodings (e.g.
468	iso-8859-1,...) to UCS will produce a result that is already in Normalization
469	Form C.

471	The one known exception to this at the moment is the Vietnamese Windows
472	code page, which uses a kind of 'half-precomposed' encoding, whereas
473	Normalization Form C uses full precomposition for the characters needed for
474	Vietnamese. It was impossible to preserve the 'half-precomposed' encoding
475	for Vietnamese in Normalization Form C because otherwise this would have lead
476	to anomalies among else for French.

478	3.4.10 Uses of UCS in non-normalized form

480	The only case known where the UCS is used
481	in a way that is not in Normalization Form C is a group of users using the UCS
482	for Yiddish. The few combinations of Hebrew base letters and diacritics used
483	to write Yiddish are available precomposed in UCS. On the other hand, the
484	many combinations used in writing the Hebrew language are only available
485	by using combining characters.

487	In order to lead to an uniform model of
488	encoding Hebrew, the precomposed Hebrew codepoints were excluded from
489	recombination. This means that Yiddish using precomposed codepoints is not
490	in Normalization Form C. It is hoped that as soon as systems that transparently
491	handle composition become more widespread, Yiddish users will move to
492	using a decomposed representation that is in Normalization Form C.

494	Implementation examples can be found at [Charlint] (Perl) and [Normalizer]
495	(Java).

497	4. Stability and Versioning

499	Defining a normalization form for Internet-wide use requires that
500	this normalization form stays as stable as possible. Stability for
501	Normalization Form C is mainly achieved by introducing a cutoff
502	version. For precomposed characters encoded up to and including this
503	version, in principle the precomposed version is the normal form, but
504	precompomposed codepoints introduced after the cutoff version are
505	decomposed in Normalization Form C.

507	As the cutoff version, version 3.0 of Unicode and the second edition
508	of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by-
509	codepoint, and are easily available.

511	The rest of this section discusses potential threats to the stability of
512	Normalization Form C, the probability of such threats, and how to
513	avoid them.

515	The analysis below shows that the probability of the various
516	threats is extremely low. The analysis is provided here to
517	document the awareness of these treats and the measures that
518	have to be taken to avoid them. This section is only of marginal
519	importance to an implementer of Normalization Form C or to an
520	author of an Internet protocol specification.

522	4.1 New Precomposed Codepoints

524	The introduction of new (post-Unicode 3.0) precomposed codepoints
525	is not a threat to the stability of Normalization Form C. Such
526	codepoints would just provide an alternate way of encoding characters
527	that can already be encoded without them, by using a decomposed
528	form. The normalization algorithm already provides for the exclusion
529	of such characters from recomposition.

531	While Normalization Form C itself is not affected, such new codepoints
532	would affect implementations of Normalization Form C, because such
533	implementations have to be updated to correctly decompose the new
534	codepoints.

536	Note: While the new codepoint may be correctly normalized only by
537	updated implementations, once normalized neither older nor updated
538	implementations will change anything anymore.

540	Because the new codepoints do not actually encode any new
541	characters that couldn't be encoded before, because the new codepoints
542	won't actually be used due to Early Uniform Normalization, and because
543	of the above implementation problems, encoding new precomposed characters
544	is superfluous and should be very clearly avoided.

546	4.2 New Combining Marks

548	It is in theory possible that a new combining mark would be encoded
549	that is intended to represent decomposable pieces of already existing
550	encoded characters. In case this indeed would happen, problems for
551	Normalization Form C can be avoided by making sure the precomposed
552	character that now has a decomposition is not included in the list
553	of recoposition exclusions. While this helps for Normalization Form
554	C, adding a canonical decomposition would affect other normalization
555	forms, and it is therefore highly unlikely that such a canonical
556	decomposition will ever be added in the first place.

558	In case new combining marks are encoded for new scripts, or in case
559	a combining mark is introduced that does not appear in any precomposed
560	character yet, then the appropriate normalization for these characters
561	can easily be defined by providing the appropriate data. However,
562	hopefully no new encoding ambiguities are introduced for new scripts.

564	4.3 Changed Codepoints

566	A major threat to the stability of Normalization Form C would
567	come from changes to ISO/IEC 10646/Unicode itself, i.e. by moving
568	around characters or redefining codepoint or by ISO/IEC 10646 and
569	Unicode evolving differently in the future. These threats are
570	not specific to Normalization Form C, but relevant for the use
571	of the UCS in general, and are mentioned here for completeness.

573	Because of the very wide and increasing use of the UCS thoughout
574	the world, the amount of resistance to any changes of defined
575	codepoints or to any divergence between ISO/IEC 10646 and Unicode
576	is extremely strong. Awareness about the need for stability in
577	this point, as well as others, is particularly high due to the
578	experiences with some changes in the early history of these standards,
579	in particular with the reencoding of some Korean Hangul characters
580	in ISO/IEC 10646 amendment 5 (and the corresponding change in Unicode).
581	For the IETF in particular, the wording in [RFC 2279] and [RFC 2781]
582	stresses the importance of stability in this respect.

584	5. Cases not dealt with by Canonical Equivalence

586	This section gives a list of cases that are not dealt with by Canonical
587	Equivalence and Normalization Form C. This is done to help the reader
588	understand Normalization Form C and its limits. The list in this section
589	contains many cases of widely varying nature. In most cases, a viewer,
590	if familiar with the script in question, will be able to distinguish
591	the various variants.

593	Internet protocols can deal in various ways with the cases below.
594	One way is to limit the characters e.g. allowed in an identifier
595	so that one of the variants is disallowed. Another way is to assume
596	that the user can make the distinction him/herself. Another is to
597	understand that some characters or combinations of characters that
598	would lead to confusion are very difficult to actually enter on any
599	keyboard; it may therefore not really be worth to exclude them
600	explicitly.

602	   - Various ligatures (Latin, Arabic)

604	   - Croatian digraphs

606	   - Full-width Latin compatibility variants

608	   - Half-width Kana and Hangul compatibility variants

610	   - Vertical compatibility variants (U+FE30...)

612	   - Superscript/subscript variants (numbers and IPA)

614	   - Small form compatibility variants (U+FE50...)

616	   - Enclosed/encircled alphanumerics, Kana, Hangul,...

618	   - Letterlike symbols, Roman numerals,...

620	   - Squared Katakana and Latin abbreviations (units,...)

622	   - Hangul jamo representation alternatives for historical Hangul

624	   - Presence or absence of joiner/non-joiner and other control characters

626	   - Upper case/lower case distinction

628	   - Distinction between Katakana and Hiragana

630	   - Similar letters from different scripts
631	     (e.g. "A" in Latin, Greek, and Cyrillic)

633	   - CJK ideograph variants (glyph variants introduced due to the source
634	     separation rule, simplifications)

636	   - Various punctuation variants (apostrophes, middle dots, spaces,...)

638	   - Ignorable whitespace, hyphens,...

640	   - Ignorable accents,...

642	Many of the cases above are identified as compatibility equivalences in the
643	Unicode database. [UTR15] defines Normalization Forms KC and KD to normalize
644	compatibility equivalences. It may look attractive to just use Normalization
645	Form KC instead of Normalization Form C for Internet protocols. However,
646	while Canonical Equivalence that forms the base of Normalization Form C
647	deals with a very small number of very well defined cases of complete
648	equivalence (from an user point of view), Compatibility Equivalence comprises
649	a very wide range of cases that usually have to be examined one at a time.

651	6. Security Considerations

653	Improper implementation of normalization can cause problems in security
654	protocols. For example, in certificate chaining, if the program validating
655	a certificate chain mis-implements normalization rules, an attacker
656	might be able to spoof an identity by picking a name that the validator
657	thinks is equivalent to another name.

659	Acknowledgements

661	An earlier version of this document benefited from ideas, advice, criticism and help from: Mark Davis, Larry Masenter, Michael Kung, Edward Cherlin, Alain
662	LaBonte, Francois Yergeau, and others. For the current version, the authors
663	were encouraged in particular by Patrick Faltstrom and Paul Hoffman.
664	The discussion of potential stability threats is based on contributions
665	by John Cowan and Kenneth Whistler. Further contributions are due to
666	Dan Oscarson.

668	References

670	   [Charlint]     Martin Duerst. Charlint - A Character Normalization Tool.
671	                  <http://www.w3.org/International/charlint>.

673	   [Charreq]      Martin J. Duerst, Ed. Requirements for String Identity
674	                  Matching and String Indexing. World Wide Web Consortium
675	                  Working Draft. <http://www.w3.org/TR/WD-charreq>.

677	   [Charmod]      Martin J. Duerst and Francois Yergeau, Eds. Character Model
678	                  for the World Wide Web. World Wide Web Consortium Working
679	                  Draft. <http://www.w3.org/TR/charmod>.

681	   [CompExcl]     The Unicode Consortium. Composition Exclusions.
682	              <ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt>.

684	   [ISO10646]     ISO/IEC 10646-1:1993. International standard -- Infor-
685	                  mation technology -- Universal multiple-octet coded
686	                  character Set (UCS) -- Part 1: Architecture and basic
687	                  multilingual plane, and its Amendments.

689	   [Normalizer]   The Unicode Consortium. Normalization Demo.
690	                  <http://www.unicode.org/unicode/reports/tr15/Normalizer.html>

692	   [RFC 2277]     Harald Alvestrand, IETF Policy on Character Sets and
693	                  Languages, January 1998.
694	                  <http://www.ietf.org/rfc/rfc2781.txt>.

696	   [RFC 2279]     Francois Yergeau. UTF-8, a transformation format of
697	                  ISO 10646. <http://www.ietf.org/rfc/rfc2781.txt>.

699	   [RFC 2781]     Paul Hoffman and Francois Yergeau. UTF-16, an encoding of
700	                  ISO 10646. <http://www.ietf.org/rfc/rfc2781.txt>.

702	   [Unicode]      The Unicode Consortium. The Unicode Standard, Version
703	                  3.0. Reading, MA, Addison-Wesley Developers Press, 2000.
704	                  ISBN 0-201-61633-5.

706	   [UniData]      The Unicode Consortium. UnicodeData File.
707	                  <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
708	                  For explanation on the content of this file, please see
709	                  <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>.

711	   [UTR15]        Mark Davis and Martin Duerst. Unicode Normalization Forms.
712	                  Unicode Technical Report #15.
713	                  <http://www.unicode.org/unicode/reports/tr15/>.

715	Copyright

717	Copyright (C) The Internet Society, 2000. All Rights Reserved.

719	This document and translations of it may be copied and furnished to
720	others, and derivative works that comment on or otherwise explain it
721	or assist in its implementation may be prepared, copied, published
722	and distributed, in whole or in part, without restriction of any
723	kind, provided that the above copyright notice and this paragraph
724	are included on all such copies and derivative works.  However, this
725	document itself may not be modified in any way, such as by removing
726	the copyright notice or references to the Internet Society or other
727	Internet organizations, except as needed for the purpose of
728	developing Internet standards in which case the procedures for
729	copyrights defined in the Internet Standards process must be
730	followed, or as required to translate it into languages other
731	than English.

733	The limited permissions granted above are perpetual and will not be
734	revoked by the Internet Society or its successors or assigns.

736	This document and the information contained herein is provided on an
737	"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
738	TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
739	BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
740	HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
741	MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

743	Author's Addresses

745		Martin J. Duerst
746	        W3C/Keio University
747	        5322 Endo, Fujisawa
748	        252-8520 Japan
749	        mailto:duerst@w3.org
750	        http://www.w3.org/People/D%C3%BCrst/
751	        Tel/Fax: +81 466 49 1170

753	        Note: Please write "Duerst" with u-umlaut wherever
754	              possible, i.e. as "D&252;rst" in HTML and XML.

756	        Mark E. Davis
757	        IBM Center for Java Technology
758	        10275 North De Anza Bouleward
759	        Cupertino 95014 CA
760	        U.S.A.
761	        mailto:mark.davis@us.ibm.com
762	        http://www.macchiato.com
763	        Tel: +1 (408) 777-5850
764	        Fax: +1 (408) 777-5891

766	#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
767	#-#-#  mailto:duerst@w3.org   http://www.w3.org/People/D%C3%BCrst