idnits 2.17.1 

draft-duerst-i18n-norm-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1) being 726 lines


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 69 instances of too long lines in the document, the longest
     one being 82 characters in excess of 72.

  ** There is 1 instance of lines with control characters in the document.

  ** The abstract seems to contain references ([RFC2279], [UTR15],
     [ISO10646,Unicode], [RFC2277]), which it shouldn't.  Please replace those
     with straight textual mentions of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == Line 185 has weird spacing: '... actual   norm...'

  == Line 700 has weird spacing: '...@w3.org   http...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (March 2000) is 8807 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer'

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2277')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also
     mentioned in 'RFC 2277'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2279')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also
     mentioned in 'RFC 2279'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'


     Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                               M. Duerst
2	<draft-duerst-i18n-norm-02.txt>                    W3C/Keio University
3	Expires in six months                                         M. Davis
4	                                                                   IBM
5	                                                            March 2000

7	             Character Normalization in ITEF Protocols

9	Status of this Memo

11	This document is an Internet-Draft and is in full conformance
12	with all provisions of Section 10 of RFC2026.

14	Internet-Drafts are working documents of the Internet Engineering
15	Task Force (IETF), its areas, and its working groups.  Note that
16	other groups may also distribute working documents as
17	Internet-Drafts.

19	Internet-Drafts are draft documents valid for a maximum of six
20	months and may be updated, replaced, or obsoleted by other
21	documents at any time.  It is inappropriate to use Internet-
22	Drafts as reference material or to cite them other than as
23	"work in progress."

25	The list of current Internet-Drafts can be accessed at
26	http://www.ietf.org/ietf/1id-abstracts.txt

28	The list of Internet-Draft Shadow Directories can be accessed at
29	http://www.ietf.org/shadow.html.

31	This document is not a product of any working group, but may be
32	discussed on the mailing lists <www-international@w3.org> or
33	<discuss@apps.ietf.org>.

35	This is a new version of an Internet Draft entitled "Normalization of
36	Internationalized Identifiers" that dealt with quite similar issues
37	and was submitted in July 1997 by the first author while he was at the
38	University of Zurich.

40	Abstract

42	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide
43	repertoire of characters. The IETF, in [RFC 2277], requires that future IETF
44	protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The
45	wide range of characters included in the UCS has lead to some cases of
46	duplicate encodings. This document proposes that in IETF protocols, the
47	class of duplicates called canonical equivalents be dealt with by using
48	Early Uniform Normalization according to Unicode Normalization Form C,
49	Canonical Composition [UTR15]. This document describes both Early
50	Uniform Normalization and Normalization Form C.

52	Table of contents

54	1. Introduction
55	2. Early Uniform Normalization
56	3. Canonical Composition (Normalization Form C)
57	   3.1 Decomposition
58	   3.2 Reordering
59	   3.3 Recomposition
60	   3.4 Implementation Notes
61	4. Stability and Versioning
62	5. Cases not dealt with by Canonical Equivalence
63	   Acknowledgements
64	   References
65	   Copyright
66	   Author's Addresses

68	1. Introduction

70	1.1 Motivation

72	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide
73	repertoire of characters. The IETF, in [RFC 2277], requires that future IETF
74	protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The
75	wide range of characters included in the UCS has lead to some cases of
76	duplicate encodings. This has lead to uncertainity for protocol specifiers
77	and implementers, because it was not clear which part of the Internet
78	infrastructure should take responsibility for these duplicates, and how.

80	There are mainly two kinds of duplicates, singleton equivalences and
81	precomposed/decomposed equivalences. Both of there can be illustrated
82	using the A character with a ring above. This character can be encoded
83	in three ways:

85	1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
86	2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE
87	3) U+212B ANGSTROM SIGN

89	In all three cases, it is supposed to look the same for the reader.
90	The equivalence between 1) and 3) is a singleton equivalence; the
91	equivalence between 1) and 2) is a precomposed/decomposed equivalence.
92	1) is the precomposed representation, 2) is the decomposed representation.
93	The inclusion of these various representation alternatives was a result of
94	the requirement for round trip conversion with a wide range of legacy encodings
95	as well as of the merger between Unicode and ISO 10646.

97	The Unicode Standard from early on has defined Canonical Equivalence to
98	make clear which cases should be treated as pure encoding duplicates and
99	which cases should be treated as genuinely different (if maybe in some cases
100	closely related) data. The Unicode Standard also from early on defined
101	decomposed normalization, what is now called Normalization Form D (case 2)
102	in the example above). This is very well suited for some kinds of
103	internal processing, but decomposition does not correspond to how data
104	gets converted from legacy encodings and transmitted on the Internet. In that
105	case, precomposed data (i.e. case 1) in the example above) is prevalent.

107	Encouraged among else by a requirements analysis of the W3C [Charreq],
108	the Unicode Technical Committee defined Normalization Form C,
109	Canonical Composition (see [UTR15]). Normalization Form C in general produces
110	the same representation as straightforward transcoding from legacy encodings
111	(See Section 3.4 for the known exception). The careful and detailled definition
112	of Normalization Form C is mainly needed to unambigously define edge cases.
113	Most of these edge cases will turn up extremely rarely in actual data.

115	The W3C is adapting Normalization Form C in the form of Early Uniform
116	Normalization, which means that it assumes that in general, data will
117	be already in Normalization Form C [Charmod].

119	This document proposes that in IETF protocols, Canonical Equivalents be dealt
120	with by using Early Uniform Normalization according to Unicode Normalization
121	Form C, Canonical Composition [UTR15]. This document describes both Early
122	Uniform Normalization (in Section 2) and Normalization Form C (in Section 3).
123	Section 4 contains an analysis of (postly theoretical) potential risks
124	for the stability of Normalization Form C. For reference, Section 5 discusses
125	various cases of equivalences not dealt with by Normalization Form C.

127	2. Early Uniform Normalization

129	This section tries to give some guidance on how Normalization Form C,
130	defined later in Section 3, should be used by Internet protocols.
131	Each Internet protocol has to define by itself how to use Normalization
132	Form C, and has to take into account its particular needs. However,
133	the advice in this section is intended to help writers of specifications
134	not very familliar with text normalization issues, and to try to make
135	sure that the various protocols use solutions that interface easily
136	with each other.

138	This section uses various well-known Internet protocols as examples.
139	However, such examples do not imply that the protocol elements mentionned
140	actually accept non-ASCII characters. Depending on the protocol element
141	mentionned, that may or may not be the case. Also, the examples are not
142	intended to actually define how a specific protocol deals with text
143	normalization issues. This is solely the responsibility of the specification
144	for each specific protocol.

146	The basic principle for how to use Normalization Form C is Early
147	Uniform Normalization. This means that ideally, only text in
148	Normalization Form C appears on the Internet. This can be seen
149	as applying 'be conservative in what you send' to the problem
150	of text normalization. And (again ideally) it should not be needed
151	that each implemenation of an Internet protocol separately implements
152	normalization. Text should just be provided normalized from the
153	underlying infrastructure, e.g. the operating system or the keyboard
154	driver.

156	Early normalization is of particular importance for those parts of
157	Internet protocols that are used as identifiers. Examples would
158	be file names in FTP, newsgroup names in NNTP, and so on. This is
159	due to the following reasons:

161	- In order for the protocol to work, it has to be very well defined
162	  when two protocol element values match and when not.
163	- Implementations, in particular on the server side, do not in any
164	  way have to deal with e.g. display of multilingual text, but on
165	  the other hand have to handle a lot of protocol-specific issues.
166	  Such implementations therefore should not be bothered with text
167	  normalization.

169	For free text, e.g. the content of mail messages or news postings,
170	Early Uniform Normalization is somewhat less important, but definitely
171	can improve interoperability.

173	For protocol elements used as identifiers, this document advises
174	Internet protocols to specify the following:

176	- Comparison should be carried out purely binary (after it has been made
177	  sure, where necessary, that the texts to be compared are in the same
178	  character encoding).
179	- Any kind of text, and in particular identifier-like protocol elements,
180	  should be sent normalized to Normalization Form C.
181	- In case comparison fails due to a difference in text normalization, the
182	  originator of the non-normalized text is responsible for the failure.
183	- In case implementors are aware of the fact, or suspect, that their
184	  underlying infrastructure produces non-normalized text, they should
185	  take care to do the necessary tests and if necessary the actual   normalization by themselves.
186	- In the case of creation of identifiers, and in particular if this
187	  creation is comparatively infrequent (e.g. newsgroup names, domain names),
188	  and happens in a rather centralized manner, explicit checks for
189	  normalization should be required by the protocol specification.

191	3. Canonical Composition (Normalization Form C)

193	This section describes Canonical Composition (Normalization Form C).
194	The description is done in a procedural way, but any other procedure
195	that leads to identical results can be used. The result is intended
196	to be exactly identical to that described by [UTR15]. Various notes
197	are provided to help understand the description and give implementation
198	hints.

200	Given a sequence of UCS codepoints, its Canonical Composition can
201	be computed with the following three steps:

203	1. Decomposition
204	2. Reordering
205	3. Recomposition

207	These steps are described in detail below.

209	3.1 Decomposition

211	For each UCS codepoint in the input sequence, check whether this
212	codepoint has a canonical decomposition according to the newest
213	version of the Unicode Character Database (field 5 in [UniData]).
214	If such a decomposition is found, replace the codepoint in the
215	input sequence by the codepoint(s) in the decomposition, and
216	try to apply decomposition to the replaced codepoints.

218	Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData] is the
219	   6th field when counting with an index origin of 1. Fields starting with
220	   a tag delimited by '<' and '>' indicate compatibility decompositions
221	   and therefore have to be ignored.

223	Note: For Korean Hangul, the decompositions are not contained
224	   in [UniData], but have to be generated algorithmically
225	   according to the description in [Unicode].

227	Note: Some decompositions replace a single codepoint by another
228	   single codepoint.

230	Note: Due to the properties of the data in the Unicode Character Database
231	   recursive application of decompositions is necessary only for the first
232	   codepoint of a decomposition.

234	3.2 Reordering

236	For each adjacent pair of UCS codepoints after decomposition,
237	check the combining classes of the UCS codepoints according to
238	the newest version of the Unicode Character Database (Field 3
239	in [UniData]). If the combining class of the first codepoint
240	is higher than the combining class of the second codepoint,
241	and at the same time the combining class of the second codepoint
242	is not zero, then exchange the two codepoints. Repeat this process
243	until no two codepoints can be exchanged anymore.

245	Note: A combining class greater than zero indicates that a codepoint
246	   is a combining mark that participates in reordering. A combining
247	   class of zero indicates that a codepoint is not a combining mark,
248	   or that it is a is a combining mark that is not affected by reordering.
249	   There are no combining classes below zero.

251	Note: Besides a few script-specific combining classes, combining classes
252	   mainly distinguish whether a combining mark is attached to the base
253	   letter or just placed near the base letter, and on which side of the
254	   base letter (e.g. bottom, above right,...) the combining mark is
255	   attached/placed. Reordering assures that combining marks placed on
256	   different sides of the same character are placed in a canonical order
257	   (because any order would visually look the same), while
258	   combining marks placed on the same side of a character
259	   are not reordered (because reordering them would change
260	   the combination they represent).

262	Note: As a result of this step, the sequence of UCS codepoints
263	   is in Canonical Decomposition (Normalization Form D).

265	3.3 Recomposition

267	Process the sequence of UCS codepoints resulting from Reordering
268	from start to end. At the start, do not have remembered an 'initial'.
269	For each of the codepoints, do the following:

271	- If you have remembered an 'initial', and the codepoint immediately
272	  preceeding the current codepoint is this 'initial' or has a combining
273	  class smaller than the combining class of the current codepoint,
274	  and the 'initial' can be canonically recombined with with the current
275	  codepoint, then replace the 'initial' with the canonical recombination
276	  and remove the current codepoint.
277	- Else, if the current codepoint has combining class zero,
278	  remember it as the new 'initial'.

280	A sequence of two codepoints can be canonically recombined to a
281	third codepoint if this third codepoint has a canonical decomposition
282	into the sequence of two codepoints (see [UniData], field 5) and
283	this canonical decomposition is not excluded from recombination.
284	For Korean Hangul, the redecompositions are not contained
285	in [UniData], but have to be generated algorithmically
286	according to the description in [Unicode].
287	The exclusions from recombination are defined as follows:

289	1) Singletons: Codepoints that have a canonical decomposition into
290	   a single other codepoint.
291	2) Non-starter: A codepoint with a decomposition starting with
292	   a codepoint of a combining class other than zero.
293	3) Post-Unicode3.0: A codepoint with a decomposition introduced
294	   after Unicode 3.0.
295	4) Script-specific: Precomposed codepoints that are not the
296	   generally preferred form for their script.

298	The list of codepoints for 1) and 2) can be produced directly
299	from the Unicode Character Database [UniData]. The list of
300	codepoints for 3) can be produced from a comparison between
301	the 3.0.0 version and the latest version of [UniData], but this
302	may be difficult. The list of codepoints for 4) cannot be computed.
303	[CompExcl] provides a normative list for 4), lists for 1) and
304	2) for cross-checking, and an empty slot for 3) (because there
305	are currently no post-Unicode3.0 codepoints with decompositions).

307	Note: At the beginning of recomposition, there is no 'initial'.
308	   An 'initial' is remembered as soon as the first codepoint
309	   with a combining class of zero is found. Not every codepoint
310	   with a combining class of zero becomes an 'initial'; the
311	   exceptions are those that are the second codepoint in
312	   a recomposition. The 'initial' as used in this description
313	   is slightly different from the 'starter' used in [UTR15].

315	Note: Checking the previous codepoint to have a combining class
316	   smaller than the combining class of the current codepoint
317	   assures that the conditions used for reordering are maintained
318	   in the recombination step.

320	Note: Exclusion of singletons is necessary because in a pair of
321	   canonically equivalent codepoints, the canonical decomposition
322	   points from the 'less desirable' codepoint to the preferred
323	   codepoint. In this case, both canonical decomposition and
324	   canonical composition have the same preference.

326	Note: For discussion of the exclusion of Post-Unicode3.0
327	   codepoints from recombination, please see Section 4
328	   on versioning issues.

330	Note: Other algorithms for recomposition have been considered, but
331	   this algorithm has been choosen because it provides a very good
332	   balance between computational and implementation complexity
333	   and 'power' of recombination.

335	3.4 Implementation Notes

337	This section contains various notes on potential implementation
338	issues, improvements, and shortcuts.

340	Avoiding decomposition: It is not always necessary to decompose
341	and recompose. In particular, any sequence that does not contain
342	any of the following is already in Normalization Form C:
343	- Codepoints that are excluded from recomposition
344	- Codepoints that appear in second position in a canonical recomposition
345	- Hangul Jamo codepoints (U+1100-U+11F9)
346	- Unknown codepoints
347	If a contiguous part of a sequence satisfies the above criterion
348	all but the last of the codepoints are already in Normalization Form C.

350	Unknown codepoints: Unknown codepoints are listed above to avoid claiming
351	that something is in Normalization Form C when it may indeed not be, but
352	they usually will be treated differently from others. The following
353	behaviours may be possible, depending on the context of normalization:
354	- Stop the normalization process with a fatal error. (This should be
355	  done only in very exceptional circumstances. It would mean that
356	  the implementation will die with data that conforms to a future version
357	  of Unicode.)
358	- Produce some warning that such codepoints have been seen, for
359	  further checking.
360	- Just copy the unknown codepoint from the input to the output,
361	  running the risk of not normalizing completely.
362	- Checking that the program-internal data is up to date via the Internet.
363	- Distinguish behaviour depending on which range of codepoints
364	  the unknown codepoint has been found.

366	Surrogates: When implementing normalization for sequences of UCS codepoints
367	represented as UTF-16 code units, care has to be taken that pairs of
368	surrogate code units that represent a single UCS codepoint are treated
369	appropriately.

371	Korean Hangul: There are no interactions between normalization of
372	Korean Hangul and the other normalizations. These two parts of normalization
373	can therefore be carried out separately, with different implementation
374	improvements.

376	Piecewise application: The various steps such as decomposition,
377	reordering, and recomposition, can be applied to parts of a
378	codepoint sequence. As an example, when normalizing a large file,
379	normalization can be done on each line separately because line
380	endings and normalization do not interact.

382	Integrating decomposition and recomposition: It is possible to
383	avoid full decomposition by noting that a decomposition of
384	a codepoint that is not in the exclusion list can be avoided
385	if it is not followed by a codepoint that can appear in second
386	position in a canonical recomposition. This condition can
387	be strengthened by noting that decomposition is not necessary
388	if the combining class of the following codepoint is higher
389	than the highest combining class obtained from decomposing
390	the character in question. In other cases, a decomposition
391	followed immediately by a recomposition can be precalculated.
392	Further details are left to the reader.

394	Decomposition: Recursive application of decomposition can be
395	avoided by a preprocessing step that calculates a full canonical
396	decomposition for each character with a canonical decomposition.

398	Reordering: The reordering step basically is a sorting problem.
399	Because the number of consecutive combining marks (i.e. consecutive
400	codepoints with combining class greater than zero) is usually
401	extremely small, a very simple sorting algorithm can be used,
402	e.g. a straightforward bubble sort. Because reordering will occur
403	extremely locally, the following variant of bubble sort will lead
404	to a fast and simple implementation:
405	- Start checking the first pair (e.g. the first two codepoints).
406	- If there is an exchange, and we are not at the start of the
407	  sequence, move back by one codepoint and check again.
408	- Otherwise (i.e. if there is no exchange, or we are at the start
409	  of the sequence) and we are not at the end of the sequence,
410	  move forward by one codepoint and check again.
411	- If we are at the end of the sequence, and there has been no
412	  exchange for the last pair, then we are done.

414	Conversion from legacy encodings: Normalization Form C is designed so that
415	in almost all cases, one-to-one conversion from legacy encodings (e.g.
416	iso-8859-1,...) to UCS will produce a result that is already in Normalization
417	Form C. The one know exception to this at the moment is the Vietnamese Windows
418	code page, which uses a kind of 'half-precomposed' encoding, whereas
419	Normalization Form C uses full precomposition for the characters needed for
420	Vietnamese. It was impossible to preserve the 'half-precomposed' encoding
421	for Vietnamese in Normalization Form C because otherwise this would have lead
422	to anomalies among else for French.

424	Uses of UCS in non-normalized form: The only case known where the UCS is used
425	in a way that is not in Normalization Form C is a group of users using the UCS
426	for Yiddish. The few combinations of Hebrew base letters and diacritics used
427	to write Yiddish are available precomposed in UCS. On the other hand, the
428	many combinations used in writing the Hebrew language are only available
429	by using combining characters. In order to lead to an uniform model of
430	encoding Hebrew, the precomposed Hebrew codepoints were excluded from
431	recombination. This means that Yiddish using precomposed codepoints is not
432	in Normalization Form C. It is hoped that once systems that transparently
433	handle composition become more widespread, Yiddish users can move to
434	using a decomposed representation that is in Normalization Form C.

436	Implementation examples can be found at [Charlint] (Perl) and [Normalizer]
437	(Java).

439	4. Stability and Versioning

441	Defining a normalization form for Internet-wide use requires that
442	this normalization form stays as stable as possible. Stability for
443	Normalization Form C is mainly achieved by introducing a cutoff
444	version. For precomposed characters encoded up to and including this
445	version, in principle the precomposed version is the normal form, but
446	precompomposed codepoints introduced after the cutoff version are
447	decomposed in Normalization Form C.

449	As the cutoff version, version 3.0 of Unicode and the second edition
450	of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by-
451	codepoint, and are easily available.

453	The rest of this section discusses potential threats to the stability of
454	Normalization Form C, the probability of such threats, and how to
455	avoid them.

457	The analysis below shows that the probability of the various
458	threats is extremely low. The analysis is provided here to
459	document the awareness of these treats and the measures that
460	have to be taken to avoid them. This section is only of marginal
461	importance to an implementer of Normalization Form C or to an
462	author of an Internet protocol specification.

464	4.1 New Precomposed Codepoints

466	The introduction of new (post-Unicode 3.0) precomposed codepoints
467	is not a threat to the stability of Normalization Form C. Such
468	codepoints would just provide an alternate way of encoding characters
469	that can already be encoded without them, by using a decomposed
470	form. The normalization algorithm already provides for the exclusion
471	of such characters from recomposition.

473	While Normalization Form C itself is not affected, such new codepoints
474	would affect implementations of Normalization Form C, because such
475	implementations have to be updated to correctly decompose the new
476	codepoints.

478	Note: While the new codepoint may be correctly normalized only by
479	updated implementations, once normalized neither older nor updated
480	implementations will change anything anymore.

482	Because the new codepoints do not actually encode any new
483	characters that couldn't be encoded before, because the new codepoints
484	won't actually be used due to Early Uniform Normalization, and because
485	of the above implementation problems, encoding new precomposed characters
486	is superfluous and should be very clearly avoided.

488	4.2 New Combining Marks

490	It is in theory possible that a new combining mark would be encoded
491	that is intended to represent decomposable pieces of already existing
492	encoded characters. In case this indeed would happen, problems for
493	Normalization Form C can be avoided by making sure the precomposed
494	character that now has a decomposition is not included in the list
495	of recoposition exclusions. While this helps for Normalization Form
496	C, adding a canonical decomposition would affect other normalization
497	forms, and it is therefore highly unlikely that such a canonical
498	decomposition will ever be added in the first place.

500	In case new combining marks are encoded for new scripts, or in case
501	a combining mark is introduced that does not appear in any precomposed
502	character yet, then the appropriate normalization for these characters
503	can easily be defined by providing the appropriate data. However,
504	hopefully no new encoding ambiguities are introduced for new scripts.

506	4.3 Changed Codepoints

508	A major threat to the stability of Normalization Form C would
509	come from changes to ISO/IEC 10646/Unicode itself, i.e. by moving
510	around characters or redefining codepoint or by ISO/IEC 10646 and
511	Unicode evolving differently in the future. These threats are
512	not specific to Normalization Form C, but relevant for the use
513	of the UCS in general, and are mentioned here for completeness.

515	Because of the very wide and increasing use of the UCS thoughout
516	the world, the amount of resistance to any changes of defined
517	codepoints or to any divergence between ISO/IEC 10646 and Unicode
518	is extremely strong. Awareness about the need for stability in
519	this point, as well as others, is particularly high due to the
520	experiences with some changes in the early history of these standards,
521	in particular with the reencoding of some Korean Hangul characters
522	in ISO/IEC 10646 amendment 5 (and the corresponding change in Unicode).
523	For the IETF in particular, the wording in [RFC 2279] and [RFC 2781]
524	stresses the importance of stability in this respect.

526	5. Cases not dealt with by Canonical Equivalence

528	This section gives a list of cases that are not dealt with by Canonical
529	Equivalence and Normalization Form C. This is done to help the reader
530	understand Normalization Form C and its limits. The list in this section
531	contains many cases of widely varying nature. In most cases, a viewer,
532	if familiar with the script in question, will be able to distinguish
533	the various variants.

535	Internet protocols can deal in various ways with the cases below.
536	One way is to limit the characters e.g. allowed in an identifier
537	so that one of the variants is disallowed. Another way is to assume
538	that the user can make the distinction him/herself. Another is to
539	understand that some characters or combinations of characters that
540	would lead to confusion are very difficult to actually enter on any
541	keyboard; it may therefore not really be worth to exclude them
542	explicitly.

544	   - Various ligatures (Latin, Arabic)

546	   - Croatian digraphs

548	   - Full-width Latin compatibility variants

550	   - Half-width Kana and Hangul compatibility variants

552	   - Vertical compatibility variants (U+FE30...)

554	   - Superscript/subscript variants (numbers and IPA)

556	   - Small form compatibility variants (U+FE50...)

558	   - Enclosed/encircled alphanumerics, Kana, Hangul,...

560	   - Letterlike symbols, Roman numerals,...

562	   - Squared Katakana and Latin abbreviations (units,...)

564	   - Hangul jamo representation alternatives for historical Hangul

566	   - Presence or absence of joiner/non-joiner and other control characters

568	   - Upper case/lower case distinction

570	   - Distinction between Katakana and Hiragana

572	   - Similar letters from different scripts
573	     (e.g. "A" in Latin, Greek, and Cyrillic)

575	   - CJK ideograph variants (glyph variants introduced due to the source
576	     separation rule, simplifications)

578	   - Various punctuation variants (apostrophes, middle dots, spaces,...)

580	   - Ignorable whitespace, hyphens,...

582	   - Ignorable accents,...

584	Many of the cases above are identified as compatibility equivalences in the
585	Unicode database. [UTR15] defines Normalization Forms KC and KD to normalize
586	compatibility equivalences. It may look attractive to just use Normalization
587	Form KC instead of Normalization Form C for Internet protocols. However,
588	while Canonical Equivalence that forms the base of Normalization Form C
589	deals with a very small number of very well defined cases of complete
590	equivalence (from an user point of view), Compatibility Equivalence comprises
591	a very wide range of cases that usually have to be examined one at a time.

593	Acknowledgements

595	An earlier version of this document benefited from ideas, advice, criticism and help from: Mark Davis, Larry Masenter, Michael Kung, Edward Cherlin, Alain
596	LaBonte, Francois Yergeau, and others. For the current version, the authors
597	were encouraged in particular by Patrick Faltstrom and Paul Hoffman.
598	The discussion of potential stability threats is based on contributions
599	by John Cowan and Kenneth Whistler.

601	References

603	   [Charlint]     Martin Duerst. Charlint - A Character Normalization Tool.
604	                  <http://www.w3.org/International/charlint>.

606	   [Charreq]      Martin J. Duerst, Ed. Requirements for String Identity
607	                  Matching and String Indexing. World Wide Web Consortium
608	                  Working Draft. <http://www.w3.org/TR/WD-charreq>.

610	   [Charmod]      Martin J. Duerst and Francois Yergeau, Eds. Character Model
611	                  for the World Wide Web. World Wide Web Consortium Working
612	                  Draft. <http://www.w3.org/TR/charmod>.

614	   [CompExcl]     The Unicode Consortium. Composition Exclusions.
615	              <ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt>.

617	   [ISO10646]     ISO/IEC 10646-1:1993. International standard -- Infor-
618	                  mation technology -- Universal multiple-octet coded
619	                  character Set (UCS) -- Part 1: Architecture and basic
620	                  multilingual plane, and its Amendments.

622	   [Normalizer]   The Unicode Consortium. Normalization Demo.
623	                  <http://www.unicode.org/unicode/reports/tr15/Normalizer.html>

625	   [RFC 2277]     Harald Alvestrand, IETF Policy on Character Sets and
626	                  Languages, January 1998.
627	                  <http://www.ietf.org/rfc/rfc2781.txt>.

629	   [RFC 2279]     Francois Yergeau. UTF-8, a transformation format of
630	                  ISO 10646. <http://www.ietf.org/rfc/rfc2781.txt>.

632	   [RFC 2781]     Paul Hoffman and Francois Yergeau. UTF-16, an encoding of
633	                  ISO 10646. <http://www.ietf.org/rfc/rfc2781.txt>.

635	   [Unicode]      The Unicode Consortium. The Unicode Standard, Version
636	                  3.0. Reading, MA, Addison-Wesley Developers Press, 2000.
637	                  ISBN 0-201-61633-5.

639	   [UniData]      The Unicode Consortium. UnicodeData File.
640	                  <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
641	                  For explanation on the content of this file, please see
642	                  <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>.

644	   [UTR15]        Mark Davis and Martin Duerst. Unicode Normalization Forms.
645	                  Unicode Technical Report #15.
646	                  <http://www.unicode.org/unicode/reports/tr15/>.

648	Copyright

650	Copyright (C) The Internet Society, 2000. All Rights Reserved.

652	This document and translations of it may be copied and furnished to
653	others, and derivative works that comment on or otherwise explain it
654	or assist in its implementation may be prepared, copied, published
655	and distributed, in whole or in part, without restriction of any
656	kind, provided that the above copyright notice and this paragraph
657	are included on all such copies and derivative works.  However, this
658	document itself may not be modified in any way, such as by removing
659	the copyright notice or references to the Internet Society or other
660	Internet organizations, except as needed for the purpose of
661	developing Internet standards in which case the procedures for
662	copyrights defined in the Internet Standards process must be
663	followed, or as required to translate it into languages other
664	than English.

666	The limited permissions granted above are perpetual and will not be
667	revoked by the Internet Society or its successors or assigns.

669	This document and the information contained herein is provided on an
670	"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
671	TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
672	BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
673	HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
674	MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

676	Author's Addresses

678		Martin J. Duerst
679	        W3C/Keio University
680	        5322 Endo, Fujisawa
681	        252-8520 Japan
682	        mailto:duerst@w3.org
683	        http://www.w3.org/People/D%C3%BCrst/
684	        Tel/Fax: +81 466 49 1170

686	        Note: Please write "Duerst" with u-umlaut wherever
687	              possible, i.e. as "D&252;rst" in HTML and XML.

689	        Mark E. Davis
690	        IBM Center for Java Technology
691	        10275 North De Anza Bouleward
692	        Cupertino 95014 CA
693	        U.S.A.
694	        mailto:mark.davis@us.ibm.com
695	        http://www.macchiato.com
696	        Tel: +1 (408) 777-5850
697	        Fax: +1 (408) 777-5891

699	#-#-#  Martin J. Du"rst, World Wide Web Consortium
700	#-#-#  mailto:duerst@w3.org   http://www.w3.org