idnits 2.17.1 

draft-duerst-i18n-norm-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Looks like you're using RFC 2026 boilerplate.  This must be updated to
     follow RFC 3978/3979, as updated by RFC 4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 1
     longer page, the longest (page 1252) being 563 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 1252 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 10 instances of too long lines in the document, the longest
     one being 7 characters in excess of 72.

  ** The abstract seems to contain references ([RFC2279], [UTR15],
     [ISO10646,Unicode], [RFC2277]), which it shouldn't.  Please replace those
     with straight textual mentions of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 2000) is 8614 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ICU'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NormTest'

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2277')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also
     mentioned in 'RFC 2277'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref.
     'RFC 2279')

  -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also
     mentioned in 'RFC 2279'.

  ** Downref: Normative reference to an Informational RFC: RFC 2781

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UniPolicy'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15'


     Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 16 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet Draft                                               M. Duerst
2	<draft-duerst-i18n-norm-04.txt>                    W3C/Keio University
3	Expires in six months                                         M. Davis
4	                                                                    IBM
5	                                                         September 2000

7	              Character Normalization in IETF Protocols

9	Status of this Memo

11	This document is an Internet-Draft and is in full conformance
12	with all provisions of Section 10 of RFC2026.

14	Internet-Drafts are working documents of the Internet Engineering Task
15	Force (IETF), its areas, and its working groups.  Note that other
16	groups may also distribute working documents as Internet-Drafts.

18	Internet-Drafts are draft documents valid for a maximum of six months
19	and may be updated, replaced, or obsoleted by other documents at any
20	time.  It is inappropriate to use Internet-Drafts as reference
21	material or to cite them other than as "work in progress."

23	The list of current Internet-Drafts can be accessed at
24	http://www.ietf.org/ietf/1id-abstracts.txt

26	The list of Internet-Draft Shadow Directories can be accessed at
27	http://www.ietf.org/shadow.html.

29	This document is not a product of any working group, but may be
30	discussed on the mailing lists <www-international@w3.org> or
31	<discuss@apps.ietf.org>.

33	Abstract

35	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very
36	wide repertoire of characters. The IETF, in [RFC 2277], requires that
37	future IETF protocols support UTF-8 [RFC 2279], an ASCII-compatible
38	encoding of UCS. The wide range of characters included in the UCS has
39	lead to some cases of duplicate encodings. This document proposes
40	that in IETF protocols, the class of duplicates called canonical
41	equivalents be dealt with by using Early Uniform Normalization
42	according to Unicode Normalization Form C, Canonical Composition (NFC)
43	[UTR15]. This document describes both Early Uniform Normalization
44	and Normalization Form C.

46	Table of contents

48	0. Change log
49	1. Introduction
50	    1.1 Motivation
51	    1.2 Notational Conventions
52	2. Early Uniform Normalization
53	3. Canonical Composition (Normalization Form C)
54	    3.1 Decomposition
55	    3.2 Reordering
56	    3.3 Recomposition
57	    3.4 Implementation Notes
58	4. Stability and Versioning
59	5. Cases not Dealt with by Canonical Equivalence
60	6. Security Considerations
61	Acknowledgements
62	References
63	Copyright
64	Author's Addresses

66	0. Change Log

68	Changes from -03 to -04

70	- Changed intro to make clear this is mainly about canonical
71	   equivalences
72	- Made UTR#15, V18.0, the normative description of NFC
73	- Added subsection on interaction with text processing (3.4.11)
74	- Added various examples
75	- Various small wording changes
76	- Added reference to test file
77	- Added a note re. terminology (Normalization vs. Canonicalization)

79	Changes from -02 to -03

81	- Fixed a bad typo in the title.
82	- Made a lot of wording corrections and presentation improvements,
83	   most of them suggested by Paul Hoffman.

85	1. Introduction

87	1.1 Motivation

89	The Universal Character Set (UCS) [ISO10646, Unicode] covers a very
90	wide repertoire of characters. The IETF, in [RFC 2277], requires that
91	future IETF protocols support UTF-8 [RFC 2279], an ASCII-compatible
92	encoding of UCS. The need for round-trip convertion to pre-existing
93	character encodings has led to some cases of duplicate encodings.
94	This has lead to uncertainity for protocol specifiers and
95	implementers, because it was not clear which part of the Internet
96	infrastructure should take responsibility for these duplicates,
97	and how.

99	Besides straight-out duplicates, there are also many cases of
100	characters that are in one way or another similar. The equivalence
101	between duplicates is called canonical equivalence. Many of the
102	equivalences between similar characters are called compatibility
103	equivalences. This document concentrates on canonical equivalence.
104	The various cases of similar characters are listed in Section 5.

106	There are mainly two kinds of canonical equivalences, singleton
107	equivalences and precomposed/decomposed equivalences. Both of these
108	can be illustrated using the character A with a ring above. This
109	character can be encoded in three ways:

111	1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
112	2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING
113	    ABOVE
114	3) U+212B ANGSTROM SIGN

116	The equivalence between 1) and 3) is a singleton equivalence. The
117	equivalence between 1) and 2) is a precomposed/decomposed equivalence,
118	where 1) is the precomposed representation, and 2) is the decomposed
119	representation.

121	In all three cases, it is supposed to look the same for the reader.
122	Also, applications may use one or another representation, or even
123	more than one, but they are not allowed to assume that other
124	applications will preserve the difference between them.

126	The inclusion of these various representation alternatives was a
127	result of the requirement for round trip conversion with a wide range
128	of legacy encodings as well as of the merger between Unicode and
129	ISO 10646.

131	The Unicode Standard from early on has defined Canonical Equivalence
132	to make clear which sequences of codepoints cases should be treated
133	as pure encoding duplicates and which sequences of codepoints should
134	be treated as genuinely different (if maybe in some cases closely
135	related) data. The Unicode Standard also from early on defined
136	decomposed normalization, what is now called Normalization Form D
137	(case 2) in the example above). This is very well suited for some
138	kinds of internal processing, but decomposition does not correspond
139	to how data gets converted from legacy encodings and transmitted on
140	the Internet. In that case, precomposed data (i.e. case 1) in the
141	example above) is prevalent.

143	Note: This specification uses the term 'codepoint', and not
144	       'character', to make clear that it speaks about what the
145	       standards encode, and not what the end users think about,
146	       which is not always the same.

148	Encouraged by many factors such as a requirements analysis of the W3C
149	[Charreq], the Unicode Technical Committee defined Normalization
150	Form C, Canonical Composition (see [UTR15]). Normalization Form C
151	in general produces the same representation as straightforward
152	transcoding from legacy encodings (See Section 3.4 for the known
153	exception). The careful and detailled definition of Normalization
154	Form C is mainly needed to unambigously define edge cases (base
155	letters with two or more combining characters). Most of these edge
156	cases will turn up extremely rarely in actual data.

158	The W3C is adapting Normalization Form C in the form of Early Uniform
159	Normalization, which means that it assumes that in general, data will
160	be already in Normalization Form C [Charmod].

162	This document recommends that in IETF protocols, Canonical Equivalents
163	be dealt with by using Early Uniform Normalization according to
164	Unicode Normalization Form C, Canonical Composition [UTR15]. This
165	document describes both Early Uniform Normalization (in Section 2)
166	and Normalization Form C (in Section 3). Section 4 contains an
167	analysis of (mostly theoretical) potential risks for the stability
168	of Normalization Form C. For reference, Section 5 discusses various
169	cases of equivalences not dealt with by Normalization Form C.

171	Note: The terms 'normalization' (such as in 'Normalization Form C')
172	    and 'canonicalization' (such as in XML Canonicalization) can
173	    mean virtually the same thing. In the context of the topics
174	    described in this document, only 'normalization' is used because
175	    'canonical' is used to distinguish between canonical equivalents
176	    and compatibility equivalents.

178	1.2 Notational Conventions

180	For UCS codepoints, the notation U+HHHH is used, where HHHH is the
181	hexadecimal representation of the codepoint. This may be followed by
182	the official name of the character in all caps.

184	The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
185	"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
186	specification are to be interpreted as described in [RFC2119].

188	2. Early Uniform Normalization

190	This section tries to give some guidance on how Normalization Form C
191	(NFC), described later in Section 3, should be used by Internet
192	protocols. Each Internet protocol has to define by itself how to use
193	NFC, and has to take into account its particular needs. However, the
194	advice in this section is intended to help writers of specifications
195	not very familliar with text normalization issues, and to try to make
196	sure that the various protocols use solutions that interface easily
197	with each other.

199	This section uses various well-known Internet protocols as examples.
200	However, such examples do not imply that the protocol elements
201	mentioned actually accept non-ASCII characters. Depending on the
202	protocol element mentioned, that may or may not be the case, and may
203	change in the future. Also, the examples are not intended to actually
204	define how a specific protocol deals with text normalization issues.
205	This is the responsibility of the specification for each specific
206	protocol.

208	The basic principle for how to use Normalization Form C is Early
209	Uniform Normalization. This means that ideally, only text in
210	Normalization Form C appears on the wire on the Internet. This can be
211	seen as applying 'be conservative in what you send' to the problem
212	of text normalization. And (again ideally) it should not be needed
213	that each implemenation of an Internet protocol separately implements
214	normalization. Text should just be provided normalized from the
215	underlying infrastructure, e.g. the operating system or the keyboard
216	driver.

218	Early normalization is of particular importance for those parts of
219	Internet protocols that are used as identifiers. Examples would
220	be URIs, domain names, email addresses, identifier names in PKIX
221	certificates, identifiers in ACAP, file names in FTP, folder names in
222	IMAP, newsgroup names in NNTP, and so on. This is due to the following
223	reasons:

225	- In order for the protocol to work, it has to be very well defined
226	   when two protocol element values match and when not.
227	- Implementations, in particular on the server side, do not in any
228	   way have to deal with e.g. display of multilingual text, but on
229	   the other hand have to handle a lot of protocol-specific issues.
230	   Such implementations therefore should not be bothered with text
231	   normalization.

233	For free text, e.g. the content of mail messages or news postings,
234	Early Uniform Normalization is somewhat less important, but definitely
235	improves interoperability.

237	For protocol elements used as identifiers, this document recommends
238	Internet protocols to specify the following:

240	- Comparison SHOULD be carried out purely binary (after it has been
241	   made sure, where necessary, that the texts to be compared are in
242	   the same character encoding).
243	- Any kind of text, and in particular identifier-like protocol
244	   elements, SHOULD be sent normalized to Normalization Form C.
245	- In case comparison fails due to a difference in text normalization,
246	   the originator of the non-normalized text is responsible for the
247	   failure.
248	- In case implementors are aware of the fact, or suspect, that their
249	   underlying infrastructure produces non-normalized text, they SHOULD
250	   take care to do the necessary tests, and if necessary the actual
251	   normalization, by themselves.
252	- In the case of creation of identifiers, and in particular if this
253	   creation is comparatively infrequent (e.g. newsgroup names, domain
254	   names), and happens in a rather centralized manner, explicit checks
255	   for normalization SHOULD be required by the protocol specification.

257	3. Canonical Composition (Normalization Form C)

259	This section describes Canonical Composition (Normalization Form C,
260	NFC). The normative specification of Canonical Composition is found in
261	[UTR15]. The description is done in a procedural way, but any other
262	procedure that leads to identical results can be used. The result is
263	supposed to be exactly identical to that described by [UTR15]. If any
264	differences should be found, [UTR15] must be followed. For each step,
265	various notes are provided to help understand the description and give
266	implementation hints.

268	Given a sequence of UCS codepoints, its Canonical Composition can
269	be computed with the following three steps:

271	1. Decomposition   (Section 3.1)
272	2. Reordering      (Section 3.2)
273	3. Recomposition   (Section 3.3)

275	Additional implementation notes are given in Section 3.4.

277	3.1 Decomposition

279	For each UCS codepoint in the input sequence, check whether this
280	codepoint has a canonical decomposition according to the newest
281	version of the Unicode Character Database (field 5 in [UniData]).
282	If such a decomposition is found, replace the codepoint in the
283	input sequence by the codepoint(s) in the decomposition, and
284	recursivly check for and apply decomposition on the first replaced
285	codepoint.

287	Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData]
288	    is the 6th field when counting with an index origin of 1. Fields
289	    starting with a tag delimited by '<' and '>' indicate compatibility
290	    decompositions; these compatibility decompositions MUST NOT be used
291	    for Normalization Form C.

293	Note: For Korean Hangul, the decompositions are not contained in
294	    [UniData], but have to be generated algorithmically according to
295	    the description in [Unicode], Section 3.11.

297	Note: Some decompositions replace a single codepoint by another
298	    single codepoint.

300	Note: It is not necessary to check replaced codepoints other than the
301	    first one due to the properties of the data in the Unicode
302	    Character Database.

304	Note: It is possible to 'precompile' the decompositions to avoid
305	    having to apply them recursively.

307	3.2 Reordering

309	For each adjacent pair of UCS codepoints after decomposition, check
310	the combining classes of the UCS codepoints according to the newest
311	version of the Unicode Character Database (Field 3 in [UniData]).
312	If the combining class of the first codepoint is higher than the
313	combining class of the second codepoint, and at the same time the
314	combining class of the second codepoint is not zero, then exchange
315	the two codepoints. Repeat this process until no two codepoints can
316	be exchanged anymore.

318	Note: A combining class greater than zero indicates that a codepoint
319	    is a combining mark that participates in reordering. A combining
320	    class of zero indicates that a codepoint is not a combining mark,
321	    or that it is a combining mark that is not affected by reordering.
322	    There are no combining classes below zero.

324	Note: Besides a few script-specific combining classes, combining
325	    classes mainly distinguish whether a combining mark is attached
326	    to the base letter or just placed near the base letter, and on
327	    which side of the base letter (e.g. bottom, above right,...) the
328	    combining mark is attached/placed. Reordering assures that
329	    combining marks placed on different sides of the same character
330	    are placed in a canonical order (because any order would visually
331	    look the same), while combining marks placed on the same side of
332	    a character are not reordered (because reordering them would change
333	    the combination they represent).

335	Note: After completing this step, the sequence of UCS codepoints
336	    is in Canonical Decomposition (Normalization Form D).

338	3.3 Recomposition

340	This section describes recomposition in a top-down manner, first
341	describing recomposition processing in general (Section 3.3.1), then
342	describing which pairs of codepoints can be canonically combined
343	(Section 3.3.2) and then describing the combination exclusions.

345	3.3.1 Recomposition Processing

347	Process the sequence of UCS codepoints resulting from Reordering
348	from start to end. This process requires a state variable called
349	'initial'. At the beginning of the process, the value of 'initial'
350	is empty.

352	For each codepoint in the sequence resulting from Reordering,
353	do the following:
354	- If the following three conditions all apply
355	   - 'initial' has a value
356	   - the codepoint immediately preceeding the current codepoint
357	     is this 'initial' or has a combining class not equal to the
358	     combining class of the current codepoint
359	   - the 'initial' can be canonically recombined (see Section 3.3.1)
360	     with with the current codepoint
361	   then replace the 'initial' with the canonical recombination and
362	   remove the current codepoint.
363	- Otherwise, if the current codepoint has combining class zero,
364	   store its value in 'initial'.

366	Note: At the beginning of recomposition, there is no 'initial'.
367	    An 'initial' is remembered as soon as the first codepoint
368	    with a combining class of zero is found. Not every codepoint
369	    with a combining class of zero becomes an 'initial'; the
370	    exceptions are those that are the second codepoint in
371	    a recomposition. The 'initial' as used in this description
372	    is slightly different from the 'starter' as defined in [UTR15],
373	    but this does not affect the result.

375	Note: Checking the previous codepoint to have a combining class
376	    smaller than the combining class of the current codepoint
377	    (except if the previous codepoint is the 'initial' and therefore
378	    has a combining class of zero) assures that the conditions used
379	    for reordering are maintained in the recombination step.

381	Note: Other algorithms for recomposition have been considered, but
382	    this algorithm has been choosen because it provides a very good
383	    balance between computational and implementation complexity
384	    and 'power' of recombination. As an example, assume a text contains
385	    a U+0041 LATIN CAPITAL LETTER A with a U+030A COMBINING RING ABOVE
386	    and a U+031F COMBINING PLUS SIGN BELOW. Because the canonical
387	    reordering puts the COMBINING PLUS SIGN BELOW before the COMBINING
388	    RING ABOVE, a more straightforward algorithm would not be able
389	    to recombine this to U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE
390	    followed by U+031F COMBINING PLUS SIGN BELOW.

392	3.3.2 Pairs of Codepoints that can be Canonically Recombined

394	A pair of codepoints can be canonically recombined to a third
395	codepoint if this third codepoint has a canonical decomposition into
396	the sequence of two codepoints (see [UniData], field 5) and this
397	canonical decomposition is not excluded from recombination. For Korean
398	Hangul, the redecompositions are not contained in [UniData], but have
399	to be generated algorithmically according to the description in
400	[Unicode], Section 3.11.

402	3.3.3 Combination Exclusions

404	The exclusions from recombination are defined as follows:

406	1) Singletons: Codepoints that have a canonical decomposition into
407	    a single other codepoint (example: U+212B ANGSTROM SIGN).
408	2) Non-starter: A codepoint with a decomposition starting with
409	    a codepoint of a combining class other than zero (example:
410	    U+0F75 TIBETAN VOWEL SIGN UU).
411	3) Post-Unicode3.0: A codepoint with a decomposition introduced
412	    after Unicode 3.0 (no applicable example).
413	4) Script-specific: Precomposed codepoints that are not the
414	    generally preferred form for their script (example: U+0959
415	    DEVANAGARI LETTER KHHA).

417	The list of codepoints for 1) and 2) can be produced directly from
418	the Unicode Character Database [UniData]. The list of codepoints for
419	3) can be produced from a comparison between the 3.0.0 version and
420	the latest version of [UniData], but this may be difficult. The list
421	of codepoints for 4) cannot be computed. For 3) and 4), the lists
422	provided in [CompExcl] MUST be used. [CompExcl] also provides lists
423	for 1) and 2) for cross-checking. The list for 3) is currently empty
424	because there are at the moment no post-Unicode3.0 codepoints with
425	decompositions.

427	Note: Exclusion of singletons is necessary because in a pair of
428	    canonically equivalent codepoints, the canonical decomposition
429	    points from the 'less desirable' codepoint to the preferred
430	    codepoint. In this case, both canonical decomposition and
431	    canonical composition have the same preference.

433	Note: For discussion of the exclusion of Post-Unicode3.0
434	    codepoints from recombination, please see Section 4
435	    on versioning issues.

437	3.4 Implementation Notes

439	This section contains various notes on potential implementation
440	issues, improvements, and shortcuts. Further notes on implementation
441	may be found in [UTR15] or in newer versions of that document.

443	3.4.1 Avoiding Decomposition, and Checking for Normalization Form C

445	It is not always necessary to decompose and recompose. In particular,
446	any sequence that does not contain any of the following is already in
447	Normalization Form C:

449	- Codepoints that are excluded from recomposition (see Section 3.3.3)
450	- Codepoints that appear in second position in a canonical
451	   recomposition
452	- Hangul Jamo codepoints (U+1100-U+11F9)
453	- Unassigned codepoints

455	If a contiguous part of a sequence satisfies the above criterion all
456	but the last of the codepoints are already in Normalization Form C.

458	The above criteria can also be used to easily check that some data
459	is already in Normalization Form C. However, this check will reject
460	some cases that actually are normalized.

462	3.4.2 Unassigned Codepoints

464	Unassigned codepoints (codepoints that are not assigned in the
465	current version of Unicode) are listed above to avoid claiming that
466	something is in Normalization Form C when it may indeed not be, but
467	they usually will be treated differently from others. The following
468	behaviours may be possible, depending on the context of normalization:

470	- Stop the normalization process with a fatal error. (This should be
471	   done only in very exceptional circumstances. It would mean that the
472	   implementation will die with data that conforms to a future version
473	   of Unicode.)
474	- Produce some warning that such codepoints have been seen, for
475	   further checking.
476	- Just copy the unassigned codepoint from the input to the output,
477	   running the risk of not normalizing completely.
478	- Checking that the program-internal data is up to date via the
479	   Internet.
480	- Distinguish behaviour depending on which range of codepoints
481	   the unassigned codepoint has been found

483	3.4.3 Surrogates

485	When implementing normalization for sequences of UCS codepoints
486	represented as UTF-16 code units, care has to be taken that pairs of
487	surrogate code units that represent a single UCS codepoint are treated
488	appropriately.

490	3.4.4 Korean Hangul

492	There are no interactions between normalization of Korean Hangul and
493	the other normalizations. These two parts of normalization can
494	therefore be carried out separately, with different implementation
495	improvements.

497	3.4.5 Piecewise Application

499	The various steps such as decomposition, reordering, and
500	recomposition, can be applied to appropriately choosen parts of a
501	codepoint sequence. As an example, when normalizing a large file,
502	normalization can be done on each line separately because line
503	endings and normalization do not interact.

505	3.4.6 Integrating Decomposition and Recomposition

507	It is possible to avoid full decomposition by noting that
508	decomposition of a codepoint that is not in the exclusion list can be
509	avoided if it is not followed by a codepoint that can appear in second
510	position in a canonical recomposition. This condition can be
511	strengthened by noting that decomposition is not necessary if the
512	combining class of the following codepoint is higher than the highest
513	combining class obtained from decomposing the character in question.
514	In other cases, a decomposition followed immediately by a
515	recomposition can be precalculated. Further details are left to the
516	reader.

518	3.4.7 Decomposition

520	Recursive application of decomposition can be avoided by a
521	preprocessing step that calculates a full canonical decomposition
522	for each character with a canonical decomposition.

524	3.4.8 Reordering

526	The reordering step basically is a sorting problem. Because the
527	number of consecutive combining marks (i.e. consecutive codepoints
528	with combining class greater than zero) is usually extremely small,
529	a very simple sorting algorithm can be used, e.g. a straightforward
530	bubble sort.

532	Because reordering will occur extremely locally, the following
533	variant of bubble sort will lead to a fast and simple implementation:

535	- Start checking the first pair (e.g. the first two codepoints).
536	- If there is an exchange, and we are not at the start of the
537	   sequence, move back by one codepoint and check again.
538	- Otherwise (i.e. if there is no exchange, or we are at the start of
539	   the sequence) and we are not at the end of the sequence, move
540	   forward by one codepoint and check again.
541	- If we are at the end of the sequence, and there has been no exchange
542	   for the last pair, then we are done.

544	3.4.9 Conversion from Legacy Encodings

546	Normalization Form C is designed so that in almost all cases,
547	one-to-one conversion from legacy encodings (e.g. iso-8859-1,...)
548	to UCS will produce a result that is already in Normalization Form C.

550	(charset=windows-1258, for Vietnamese, [windows-1258]). This character
551	encoding uses a kind of 'half-precomposed' encoding, whereas
552	Normalization Form C uses full precomposition for the characters
553	needed for Vietnamese. As an example, U+1EAD LATIN SMALL LETTER A
554	WITH CIRCUMFLEX AND DOT BELOW is encoded as U+00E2 LATIN SMALL LETTER
555	A WITH CIRCUMFLEX followed by U+0323 COMBINING DOT BELOW in
556	code page 1252, but U+1EAD is the normalized form.

558	3.4.10 Uses of UCS in Non-Normalized Form

560	One known case where the UCS is used in a way that is not in
561	Normalization Form C is a group of users using the UCS for Yiddish.
562	The few combinations of Hebrew base letters and diacritics used to
563	write Yiddish are available precomposed in UCS (example: U+FB2F
564	HEBREW LETTER ALEF WITH QAMATS). On the other hand, the many
565	combinations used in writing the Hebrew language are only available
566	by using combining characters.

568	In order to lead to an uniform model of encoding Hebrew, the
569	precomposed Hebrew codepoints were excluded from recombination. This
570	means that Yiddish using precomposed codepoints is not in
571	Normalization Form C.

573	3.4.11 Interaction with Text Processing

575	There are many operations on text strings that can create
576	non-normalized output even if the input was normalized. Examples are
577	concatenation (if the second string starts with one of the characters
578	discussed is Section 3.4.1) or case changes (as an example, 1E98
579	LATIN SMALL LETTER W WITH RING ABOVE does not have a precomposed
580	capital equivalent).

582	3.4.12 Implementations and Test Suites

584	Implementation examples can be found at [Charlint] (Perl), [ICU]
585	(C/C++) and [Normalizer] (Java).

587	A huge file with test cases for normalization is avaliable as part of
588	Unicode 3.0.1 [NormTest].

590	4. Stability and Versioning

592	Defining a normalization form for Internet-wide use requires that
593	this normalization form stays as stable as possible. Stability for
594	Normalization Form C is mainly achieved by introducing a cutoff
595	version. For precomposed characters encoded up to and including this
596	version, in principle the precomposed version is the normal form, but
597	precompomposed codepoints introduced after the cutoff version are
598	decomposed in Normalization Form C.

600	As the cutoff version, version 3.0 of Unicode and the second edition
601	of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by-
602	codepoint. They are both widely and integrally available, i.e. they
603	do not reqire the application of updates ammendments.

605	The rest of this section discusses potential threats to the stability
606	of Normalization Form C, the probability of such threats, and how to
607	avoid them. [UniPolicy] documents policies adopted by the Unicode
608	Consortium to limit the impact of changes on existing implementations.

610	The analysis below shows that the probability of the various threats
611	is extremely low. The analysis is provided here to document the
612	awareness of these treats and the measures that have to be taken to
613	avoid them. This section is only of marginal importance to an
614	implementer of Normalization Form C or to an author of an Internet
615	protocol specification.

617	4.1 New Precomposed Codepoints

619	The introduction of new (post-Unicode 3.0) precomposed codepoints is
620	not a threat to the stability of Normalization Form C. Such codepoints
621	would just provide an alternate way of encoding characters that can
622	already be encoded without them, by using a decomposed form. The
623	normalization algorithm already provides for the exclusion of such
624	characters from recomposition.

626	While Normalization Form C itself is not affected, such new codepoints
627	would affect implementations of Normalization Form C, because such
628	implementations have to be updated to correctly decompose the new
629	codepoints.

631	Note: While the new codepoint may be correctly normalized only by
632	updated implementations, once normalized neither older nor updated
633	implementations will change anything anymore.

635	Because the new codepoints do not actually encode any new characters
636	that could not be encoded before, because the new codepoints would not
637	actually be used due to Early Uniform Normalization, and because of
638	the above implementation problems, encoding new precomposed characters
639	is superfluous and should be very clearly avoided.

641	4.2 New Combining Marks

643	It is in theory possible that a new combining mark would be encoded
644	that is intended to represent decomposable pieces of already existing
645	encoded characters. In case this indeed would happen, problems for
646	Normalization Form C can be avoided by making sure the precomposed
647	character that now has a decomposition is not included in the list
648	of recoposition exclusions. While this helps for Normalization Form
649	C, adding a canonical decomposition would affect other normalization
650	forms, and it is therefore highly unlikely that such a canonical
651	decomposition will ever be added in the first place.

653	In case new combining marks are encoded for new scripts, or in case a
654	combining mark is introduced that does not appear in any precomposed
655	character yet, then the appropriate normalization for these characters
656	can easily be defined by providing the appropriate data. However,
657	hopefully no new encoding ambiguities are introduced for new scripts.

659	4.3 Changed Codepoints

661	A major threat to the stability of Normalization Form C would come
662	from changes to ISO/IEC 10646/Unicode itself, i.e. by moving around
663	characters or redefining codepoint or by ISO/IEC 10646 and Unicode
664	evolving differently in the future. These threats are not specific to
665	Normalization Form C, but relevant for the use of the UCS in general,
666	and are mentioned here for completeness.

668	Because of the very wide and increasing use of the UCS thoughout the
669	world, the amount of resistance to any changes of defined codepoints
670	or to any divergence between ISO/IEC 10646 and Unicode is extremely
671	strong. Awareness about the need for stability in this point, as well
672	as others, is particularly high due to the experiences with some
673	changes in the early history of these standards, in particular with
674	the reencoding of some Korean Hangul characters in ISO/IEC 10646
675	amendment 5 (and the corresponding change in Unicode). For the IETF
676	in particular, the wording in [RFC 2279] and [RFC 2781] stresses the
677	importance of stability in this respect.

679	5. Cases not dealt with by Canonical Equivalence

681	This section gives a list of cases that are not dealt with by
682	Canonical Equivalence and Normalization Form C. This is done to help
683	the reader understand Normalization Form C and its limits. The list in
684	this section contains many cases of widely varying nature. In many
685	cases, a viewer, if familiar with the script in question, will be able
686	to distinguish the various variants.

688	Internet protocols can deal in various ways with the cases below. One
689	way is to limit the characters e.g. allowed in an identifier so that
690	all but one of the variants are disallowed. Another way is to assume
691	that the user can make the distinction him/herself. Another is to
692	understand that some characters or combinations of characters that
693	would lead to confusion are very difficult to actually enter on any
694	keyboard; it may therefore not really be worth to exclude them
695	explicitly.

697	    - Various ligatures (Latin, Arabic, e.g. U+FB01 LATIN SMALL LIGATURE FI
698	      vs. U+0066 LATIN SMALL LETTER F followed by U+0069 LATIN SMALL LETTER I)

700	    - Croatian digraphs (e.g. U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J
701	      vs. U+004C LATIN CAPITAL LETTER L followed by U+006A LATIN SMALL
702	LETTER J)

704	    - Full-width Latin compatibility variants (e.g. U+FF21 FULLWIDTH LATIN
705	      CAPITAL LETTER A vs. U+0041 LATIN CAPITAL LETTER A)

707	    - Half-width Kana and Hangul compatibility variants (e.g. U+FF76 HALFWIDTH
708	      KATAKANA LETTER KA vs. U+30AB KATAKANA LETTER KA)

710	    - Vertical compatibility variants (U+FE35 PRESENTATION FORM FOR VERTICAL
711	      LEFT PARENTHESIS vs. U+0028 LEFT PARENTHESIS)

713	    - Superscript/subscript variants (numbers and IPA, e.g. U+00B2 SUPERSCRIPT
714	      TWO)

716	    - Small form compatibility variants (e.g. U+FE6A SMALL PERCENT SIGN)

718	    - Enclosed/encircled alphanumerics, Kana, Hangul,... (e.g. U+2460 CIRCLED
719	      DIGIT ONE)

721	    - Letterlike symbols, Roman numerals,... (e.g. U+210E PLANCK CONSTANT vs.
722	      U+0068 LATIN SMALL LETTER H)

724	    - Squared Katakana and Latin abbreviations (units,..., e.g. U+334C SQUARE
725	      MEGATON)

727	    - Hangul jamo representation alternatives for historical Hangul

729	    - Presence or absence of joiner/non-joiner and other control
730	      characters

732	    - Upper case/lower case distinction

734	    - Distinction between Katakana and Hiragana

736	    - Similar letters from different scripts
737	      (e.g. "A" in Latin, Greek, and Cyrillic)

739	    - CJK ideograph variants (glyph variants introduced due to the
740	      source separation rule, simplifications)

742	    - Various punctuation variants (apostrophes, middle dots,
743	      spaces,...)

745	    - Ignorable whitespace, hyphens,...

747	    - Ignorable accents,...

749	Many of the cases above are identified as compatibility equivalences
750	in the Unicode database. [UTR15] defines Normalization Forms KC and
751	KD to normalize compatibility equivalences. It may look attractive
752	to just use Normalization Form KC instead of Normalization Form C for
753	Internet protocols. However, while Canonical Equivalence, which forms
754	the base of Normalization Form C, deals with a very small number of
755	very well defined cases of complete equivalence (from an user point
756	of view), Compatibility Equivalence comprises a very wide range of
757	cases that usually have to be examined one at a time. If the domain
758	of acceptable characters is suitably limited, such as for program
759	identifiers, then NFKC may be a suitable normalization form.

761	6. Security Considerations

763	Security problems can result from:
764	- Improper implementations of normalization. For example, in
765	   certificate chaining, if the program validating a certificate chain
766	   mis-implements normalization rules, an attacker might be able to
767	   spoof an identity by picking a name that the validator thinks is
768	   equivalent to another name.
769	- The fact that normalization maps several input sequences to the same
770	   output sequence. If a digital signature calculation includes
771	   normalization, this can make it slightly easier to find a fake
772	   document that has the same digest as a real one.
773	- The use of normalization only in part of the applications. In
774	   particular, if software used for security purposes, e.g. to create
775	   and check digital signatures, normalizes data, but the applications
776	   actually using the data do not normalize, it can be very easy to
777	   create a fake document that can claim to be the real one but
778	   produces different behaviour.
779	- Different behavior in programs that do not respect canonical
780	   equivalence.

782	Security-related applications therefore MAY check for normalized
783	input, but MUST NOT actually apply normalization unless is can be
784	guaranteed that all related applications also apply normalization.

786	Acknowledgements

788	The earliest version of this Internet Draft, which dealt with quite
789	similar issues, was entitled "Normalization of Internationalized
790	Identifiers" and was submitted in July 1997 by the first author while
791	he was at the University of Zurich. It benefited from ideas, advice,
792	criticism and help from: Mark Davis, Larry Masenter, Michael Kung,
793	Edward Cherlin, Alain LaBonte, Francois Yergeau, and others.

795	For the current version, the authors were encouraged in particular by
796	Patrick Faltstrom and Paul Hoffman. The discussion of potential
797	stability threats is based on contributions by John Cowan and Kenneth
798	Whistler. Some security threats were pointed out by Masahiro
799	Sekiguchi. Further contributions are due to Dan Oscarson.

801	References

803	    [Charlint]     Martin Duerst. Charlint - A Character Normalization
804	                   Tool. <http://www.w3.org/International/charlint>.

806	    [Charreq]      Martin J. Duerst, Ed. Requirements for String
807	                   Identity Matching and String Indexing. World Wide
808	                   Web Consortium Working Draft.
809	                   <http://www.w3.org/TR/WD-charreq>.

811	    [Charmod]      Martin J. Duerst and Francois Yergeau, Eds.
812	                   Character Model for the World Wide Web. World Wide
813	                   Web Consortium Working Draft.
814	                   <http://www.w3.org/TR/charmod>.

816	    [CompExcl]     The Unicode Consortium. Composition Exclusions.
817	       <ftp://ftp.unicode.org/Public/UNIDATA/CompositionExclusions.txt>

819	    [ICU]          International Components for Unicode.
820	                   <http://oss.software.ibm.com/icu/>.

822	    [ISO10646]     ISO/IEC 10646-1:2000. International standard --
823	                   Information technology -- Universal multiple-octet
824	                   coded character Set (UCS) -- Part 1: Architecture
825	                   and basic multilingual plane, and its Amendments.

827	    [Normalizer]   The Unicode Consortium. Normalization Demo.
828	          <http://www.unicode.org/unicode/reports/tr15/Normalizer.html>

830	    [NormTest]     Mark Davis. Unicode Normalization Test Suite.
831	          <http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt>

833	    [RFC2119]      Scott Bradner. Key words for use in RFCs to
834	                   Indicate Requirement Levels, March 1997.
835	                   <http://www.ietf.org/rfc/rfc2119.txt>

837	    [RFC 2277]     Harald Alvestrand, IETF Policy on Character Sets and
838	                   Languages, January 1998.
839	                   <http://www.ietf.org/rfc/rfc2781.txt>

841	    [RFC 2279]     Francois Yergeau. UTF-8, a transformation format of
842	                   ISO 10646. <http://www.ietf.org/rfc/rfc2781.txt>

844	    [RFC 2781]     Paul Hoffman and Francois Yergeau. UTF-16, an
845	                   encoding of ISO 10646.
846	                   <http://www.ietf.org/rfc/rfc2781.txt>

848	    [Unicode]      The Unicode Consortium. The Unicode Standard,
849	                   Version 3.0. Reading, MA, Addison-Wesley Developers
850	                   Press, 2000. ISBN 0-201-61633-5.

852	    [UniData]      The Unicode Consortium. UnicodeData File.
853	                 <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>
854	                   For explanation on the content of this file, please
855	                   see
856	                <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html>

858	    [UniPolicy]    The Unicode Consortium. Unicode Consortium Policies.
859	                   <http://www.unicode.org/unicode/standard/policies>

861	    [UTR15]        Mark Davis and Martin Duerst. Unicode Normalization
862	                   Forms. Unicode Technical Report #15, Version 18.0.
863	            <http://www.unicode.org/unicode/reports/tr15/tr15-18.html>,
864	                   also on the CD of [Unicode].

866	    [windows-1258] Microsoft Windows Codepage: 1258 (Viet Nam).
867	            http://www.microsoft.com/globaldev/reference/sbcs/1258.htm>

869	Copyright

871	Copyright (C) The Internet Society, 2000. All Rights Reserved.

873	This document and translations of it may be copied and furnished to
874	others, and derivative works that comment on or otherwise explain it
875	or assist in its implementation may be prepared, copied, published
876	and distributed, in whole or in part, without restriction of any
877	kind, provided that the above copyright notice and this paragraph
878	are included on all such copies and derivative works.  However, this
879	document itself may not be modified in any way, such as by removing
880	the copyright notice or references to the Internet Society or other
881	Internet organizations, except as needed for the purpose of
882	developing Internet standards in which case the procedures for
883	copyrights defined in the Internet Standards process must be
884	followed, or as required to translate it into languages other
885	than English.

887	The limited permissions granted above are perpetual and will not be
888	revoked by the Internet Society or its successors or assigns.

890	This document and the information contained herein is provided on an
891	"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
892	TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
893	BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
894	HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
895	MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

897	Author's Addresses

899	         Martin J. Duerst
900	         W3C/Keio University
901	         5322 Endo, Fujisawa
902	         252-8520 Japan
903	         mailto:duerst@w3.org
904	         http://www.w3.org/People/D%C3%BCrst/
905	         Tel/Fax: +81 466 49 1170

907	         Note: Please write "Duerst" with u-umlaut wherever
908	               possible, e.g. as "D&252;rst" in HTML and XML.

910	         Mark E. Davis
911	         IBM Center for Java Technology
912	         10275 North De Anza Bouleward
913	         Cupertino 95014 CA
914	         U.S.A.
915	         mailto:mark.davis@us.ibm.com
916	         http://www.macchiato.com
917	         Tel: +1 (408) 777-5850
918	         Fax: +1 (408) 777-5891