idnits 2.17.1 

draft-duerst-i18n-norm-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in
     this document.

     Expected boilerplate is as follows today (2024-04-27) according to
     https://trustee.ietf.org/license-info :

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.a:
        This Internet-Draft is submitted in full conformance with the provisions
        of BCP 78 and BCP 79.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2:
        Copyright (c) 2024 IETF Trust and the persons identified as the document
        authors.  All rights reserved.

     IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3:
        This document is subject to BCP 78 and the IETF Trust's Legal Provisions
        Relating to IETF Documents
        (https://trustee.ietf.org/license-info) in effect on the date of
        publication of this document.  Please review these documents
        carefully, as they describe your rights and restrictions with
        respect to this document.  Code Components extracted from this
        document must include Simplified BSD License text as described in
        Section 4.e of the Trust Legal Provisions and are provided
        without warranty as described in the Simplified BSD License.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** Missing expiration date.  The document expiration date should appear on
     the first and last page.

  ** The document seems to lack a 1id_guidelines paragraph about
     Internet-Drafts being working documents. 

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     current Internet-Drafts. 

  ** The document seems to lack a 1id_guidelines paragraph about the list of
     Shadow Directories. 

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 220: '...dentifiers, they MUST be converted to ...'


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (July 1997) is 9783 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'ISO 10646' on line 83 looks like a reference

  -- Missing reference section? 'Unicode2' on line 463 looks like a reference

  -- Missing reference section? 'URN-Syntax' on line 466 looks like a
     reference

  -- Missing reference section? 'ISO10646' on line 458 looks like a reference


     Summary: 9 errors (**), 0 flaws (~~), 1 warning (==), 6 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Draft                                               M. Duerst
3	<draft-duerst-i18n-norm-00.txt>                   University of Zurich
4	Expires in six months                                        July 1997

6	             Normalization of Internationalized Identifiers

8	Status of this Memo

10	   This document is an Internet-Draft.  Internet-Drafts are working doc-
11	   uments of the Internet Engineering Task Force (IETF), its areas, and
12	   its working groups. Note that other groups may also distribute work-
13	   ing documents as Internet-Drafts.

15	   Internet-Drafts are draft documents valid for a maximum of six
16	   months. Internet-Drafts may be updated, replaced, or obsoleted by
17	   other documents at any time.  It is not appropriate to use Internet-
18	   Drafts as reference material or to cite them other than as a "working
19	   draft" or "work in progress".

21	   To learn the current status of any Internet-Draft, please check the
22	   1id-abstracts.txt listing contained in the Internet-Drafts Shadow
23	   Directories on ds.internic.net (US East Coast), nic.nordu.net
24	   (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
25	   Rim).

27	   Distribution of this document is unlimited.  Please send comments to
28	   the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
29	   uri@bunyip.com. This document is currently a very early draft,
30	   intended to stimulate discussion only.  It is intended to become part
31	   of a suite of documents related to the internationalization of URLs.

33	Abstract

35	   The Universal Character Set (UCS) makes it possible to extend the
36	   repertoire of characters used in non-local identifiers beyond US-
37	   ASCII. The UCS contains a large overall number of characters, many
38	   codepoints for backwards compatibility, and various mechanisms to
39	   cope with the features of the writing systems of the world.  All this
40	   together can lead to ambiguities in representation.  Such ambiguities
41	   are not a problem when representing running text.  Therefore existing
42	   standards have only defined equivalences.  For the use in identi-
43	   fiers, which are compared using their binary representation, this is
44	   not sufficient.  This document defines a normalization algorithm and
45	   gives usage guidelines to avoid such ambiguities.

47	Table of contents

49	   1. Introduction ................................................... 2
50	     1.1 Motivation .................................................. 2
51	     1.2 List of Potential Ambiguities ............................... 4
52	     1.3 Categories .................................................. 5
53	       1.3.1 Category Overview ....................................... 5
54	       1.3.2 Category List ........................................... 5
55	     1.4 Applicabality and Conformance ............................... 6
56	     1.5 Notation .................................................... 6
57	   2. Normalization Rules ............................................ 6
58	     2.1 Normalization of Combining Sequences ........................ 7
59	     2.2 Hangul Jamo Normalization ................................... 9
60	     2.3 Arabic Ligature and Presentation Form Normalization ......... 9
61	   3. Forbidden Characters and Character Combinations ................ 9
62	   4. Dangerous Characters and Character Combinations ................ 9
63	   5. Discouraged Characters and Character Combinations ............. 10
64	     5.1 Similar Letters in Different Alphabets ..................... 10
65	   6. No Normalization nor Restriction .............................. 10
66	     6.1 Case Folding ............................................... 11
67	   Acknowledgements ................................................. 11
68	   Bibliography ..................................................... 11
69	   Author's Address ................................................. 12

71	1. Introduction

73	1.1 Motivation

75	   For the identification of resources in networks, many kinds of iden-
76	   tifiers are in use. Locally, many kinds of identifiers can contain
77	   characters from all kinds of languages and scripts, but as long as
78	   different encodings for the same characters exist, these cannot be
79	   used in identifiers across a wider network. Therefore, network iden-
80	   tifiers had to be limited to a very restricted character repertoire,
81	   usually a subset of US-ASCII.

83	   With the definition of the Universal Character Set (UCS) [ISO 10646]
84	   [Unicode2], it becomes possible to extend the character repertoire of
85	   such identifiers. In some cases, this has already been done, for
86	   example in Java and for URNs [URN-Syntax]; other cases are under
87	   study.  While identifiers for resources of full worldwide interest
88	   should continue to be limited to a very restricted set of widestly
89	   known characters, names for resources mainly used in a language-local
90	   or script-local context may provide significant additional user con-
91	   venience if they can make use of a wider character repertoire.

93	   The UCS contains a large overall number of characters, many code-
94	   points for backwards compatibility, and various mechanisms to allow
95	   it to cope with the features of the writing systems of the world.
96	   These all lead to ambiguities that in some cases can be resolved by
97	   careful display, printing, and examination by the reader, but in
98	   other cases are intended to be unnoticable by the reader.  Such ambi-
99	   guities can be dealt with in systems processing running text by using
100	   various kinds of equivalences and normalizations, which may differ by
101	   implementation.

103	   However, identifier processing software usually compares their binary
104	   representation to establish that two identifiers are identical. In
105	   some cases, some additional processing may be done to account for the
106	   specifics of identifier syntax variation. To upgrade all such soft-
107	   ware to take into account the equivalences and ambiguities in the UCS
108	   would be extremely tedious. For some classes of identifiers, it is
109	   impossible because their binary representation is transparent in the
110	   sense that it may allow legacy character encodings besides a charac-
111	   ter encoding based on UCS to be used and/or it may allow for arbi-
112	   trary binary data to be contained in identifiers.

114	   In order to facilitate the use of identifiers containing characters
115	   from UCS, this document therefore intends to develop clear specifica-
116	   tions for a normalization algorithm removing basic ambiguities, and
117	   guidelines for the use of characters with potential ambiguity.

119	   A key design goal of the algorithm was and is that for most identi-
120	   fiers in current use, applying the algorithm results in the identity
121	   transform (i.e. the identifier is already normalized). This allows to
122	   continue to use existing identifiers and to start to use internation-
123	   alized identifiers in new settings even without all the details of
124	   the normalization algorithm having been agreed upon.

126	   Other goals when designing the algorithms and rules have been as fol-
127	   lows:

129	   -  Avoid bad surprises for users when they cannot understand that two
130	      identifiers looking exactly the same don't match.  The user in
131	      this case is an average user without any specific knowledge of
132	      character encoding, but with a basic dose of "computer literacy"
133	      (e.g. know that 0 and O have distinct keys on a keyboard).

135	   -  Restrict normalization to cases where it is really necessary;
136	      cover remaining ambiguities by guidelines.

138	   -  Define normalization so that it can be implemented using widely
139	      accessible documentation.

141	   -  Take measures for best possible compatibility with future addi-
142	      tions to the UCS.

144	   There are some issues this document does currently not address, in
145	   particular bidirectionality. It is not clear yet whether this will be
146	   included in this document or treated separately.

148	1.2 List of Potential Ambiguities

150	   To give an idea of the extent of the problem, this section lists
151	   potential character ambiguities, roughly ordered so that those cases
152	   that are more difficult to distinguish come first. The difficulty to
153	   distinguish certain characters or combinations may depend greatly on
154	   context.

156	   -  Precomposed/decomposed diacritic character representation

158	   -  Hangul jamo vs. johab and jamo representation alternatives

160	   -  CJK compatibility ideographs

162	   -  Other backwards compatibility duplicated characters

164	   -  Separately coded Indic length/AI/AU marks

166	   -  Glyphs for vertical variants

168	   -  Croatian digraphs, other ligatures (Latin, Arabic,...)

170	   -  Various variant punctuation (apostrophes, middle dots, spaces,...)

172	   -  Half-width/full-width characters (Latin, Katakana and Hangul)

174	   -  Vertical variants (U+FE30...)

176	   -  Presence or absence of joiner/non-joiner

178	   -  Superscript/subscript variants (numbers and IPA)

180	   -  Small form variants (U+FE50...)
181	   -  Upper case/lower case

183	   -  Similar letters from different scripts (varying degrees) (e.g. "A"
184	      in Latin, Greek, and Cyrillic)

186	   -  Letterlike symbols, Roman numerals (varying degrees)

188	   -  Enclosed alphanumerics, katakana, hangul,...

190	   -  Squared katakana (units,...), squared Latin abbreviations,...

192	   -  CJK ideograph variants (varying degrees, in particular general
193	      simplifications, backwards-compatibility non-unifications, JIS
194	      78/83 problems)

196	   -  Ignorable whitespace, hyphens,... (sorting)

198	   -  Ignorable accents,... (sorting)

200	1.3 Categories

202	1.3.1 Category Overview

204	   This specification distinguishes various categories of ambigous char-
205	   acters or strings. For each category, it will list or describe:

207	   -  The characters and character combinations in the category

209	   -  The context, if necessary

211	   -  The nature of the ambiguity

213	   -  The necessary actions or recommendations

215	1.3.1 Category List

217	   The following categories are currently under investigation:

219	   -  Normalized: Characters and character combinations in this category
220	      are not allowed in identifiers, they MUST be converted to a nor-
221	      malized form. Examples include characters with strong equiva-
222	      lences.

224	   -  Forbidden: Characters and character combinations in this category
225	      are not allowed at all in identifiers; identifiers containing them
226	      are illegal. Examlpes include characters that cause problems to
227	      software, such as control characters, and cases that need normal-
228	      ization but where normalization is too difficult to specify algo-
229	      rithmically.

231	   -  Dangerous: Characters and character combinations in this category
232	      are seriously advised against. Software would usually alert a user
233	      of an attempt to use such a character, but not force the user to
234	      remove it.

236	   -  Discouraged: Characters and character combinations in this cate-
237	      gory are advised against, but not as strongly as to necessitate an
238	      alert.

240	1.4 Applicability and Conformance

242	   Where identifiers are used just to transmit data from one point to
243	   another, e.g. in the case of the query component of an URL resulting
244	   from a FORM reply, there is no need to apply the normalization rules
245	   and guidelines defined in this document.

247	   Identifiers containing a wide range of characters should be used with
248	   care and only for an audience that is understood to be able to tran-
249	   scribe them without problems.

251	1.5 Notation

253	   Codepoints from the UCS are denoted as U+XXXX, where XXXX is their
254	   hexadecimal representation, according to [Unicode2].

256	   Ranges of characters are expressed as U+XXXX-U+YYYY. A block of char-
257	   acters may also be identified by its first codepoint, followed by
258	   "...".  Official ISO character names are given in all upper case.

260	2. Normalization Rules

262	   This chapter defines several normalization algorithms.  They deal
263	   with different kinds of phenomena, or different scripts. They are
264	   defined so that the sequence of their application does not change the
265	   normalization result; each algorithm has to be applied at least once.
266	   Applying an algorithm a second time will not change the result any-
267	   more.

269	   The algorithms are to a certain extent written in a procedural fash-
270	   ion. This does not imply that an implementation has to follow each
271	   step. The only thing that is relevant is whether an implementation
272	   produces the same outputs on the same inputs for all possible inputs,
273	   i.e. for all randomly generated strings of arbitrary length. An
274	   implementation may also combine the various algorithms into a single
275	   one if the result is the same as applying each of the algorithms at
276	   least once.

278	2.1 Normalization of Combining Sequences

280	   UCS contains a general mechanism for encoding diacritic combinations
281	   from base letters and modifying diacritics, as well as many combina-
282	   tions as precomposed codepoints.

284	   The following algorithm normalizes such combinations:

286	   Step 1: Starting from the beginning of the identifier, find a maximal
287	   sequence of a base character (possibly decomposable) followed by mod-
288	   ifying letters.

290	   Step 2: Fully decompose the sequence found in step 1, using all
291	   canonical decompositions defined in [Unicode2] and all canonical
292	   decompositions defined for future additions to the UCS.

294	   Step 3: Sort the sequence of modifying letters found in Step 2
295	   according to the canonical ordering algorithm of Section 3.9 of [Uni-
296	   code2].

298	   Step 4: If the base character is a Hebrew character, go to step 6.

300	   Step 5: Try to recombine as much as possible of the sequence result-
301	   ing from Step 3 into a precomposed character by finding the longest
302	   initial match with any canonical decomposition sequence defined in
303	   [Unicode2], ignoring decomposition sequences of length 1.

305	   Step 6: Use the result obtained so far as output and continue with
306	   Step 1.

308	        NOTE -- In Step 4, the decomposition sequences in [Uni-
309	        code2] have to be recursively expanded for each character
310	        (except for decomposition sequences of length 1) before
311	        application. Otherwise, a character such as U+1E1C, LATIN
312	        CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recom-
313	        posed correctly.

315	        NOTE -- In Step 4, canonical decompositions defined for
316	        future additions to the UCS are explicitly not considered.
317	        This is done to ease forwards compatibility. It is assumed
318	        that systems knowing about newly defined precompositions
319	        will be able to decompose them correctly in Step 2, but
320	        that it would be hard to change identifiers on older sys-
321	        tems using a decomposed representation.

323	        NOTE -- Maybe we have to define additions to the cannonical
324	        equivalences, and/or to add more exceptions such as Hebrew.

326	        NOTE -- A different definition of Step 4 may lead to
327	        shorter normalizations for some identifiers. The current
328	        definition was choosen for simplicity and implementation
329	        speed.  (this may be subject to discussion, in particular
330	        if somebody has an implementation and is ready to share the
331	        code).

333	        NOTE -- The above algorithm can be sped up by shortcuts, in
334	        particular by noting that most precomposed characters which
335	        are not followed by modifying letters are already normal-
336	        ized.

338	        NOTE -- The exception for "precomposed letters that have a
339	        decomposition sequence of length 1" in Step 4 is necessary
340	        to avoid e.g. the letter "K" being "aggregated" to "KELVIN
341	        SIGN" U+212A.

343	2.2 Hangul Jamo Normalization

345	   Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous
346	   notations and therefore must be carefully normalized.  The following
347	   algorithm should be used:

349	   Step 1: A seqence of Hangul jamo is split up into syllables according
350	   to the definition of syllable boundaries on page 3-12 of [Unicode2].
351	   Each of these syllables is processed according to Steps 2-4.

353	   Step 2: Fillers are inserted as neccessary to form a canonical sylla-
354	   ble as defined on page 3-12 of [Unicode2].

356	   Step 3: Sequences of choseong, jungseong, and jongseong (leading con-
357	   sonants, vowels, and trailing consonants) are replaced by a single
358	   choseong, jungseong, and jongseong respectively according to the com-
359	   patibility decompositions given in [Unicode2]. If this is not possi-
360	   ble, this is a forbidden sequence.

362	   Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF)
363	   if this is possible according to the algorithm given on pp. 3-12/3 of
364	   [Unicode2].

366	        NOTE -- We are not currently dealing with compatibility
367	        Jamo (U+3130...).

369	2.3 Arabic Ligature and Presentation Form Normalization

371	   It is not yet clear whether a normalization algorithm should be
372	   defined here, or wheter ligatures and presentation forms should sim-
373	   ply be forbidden.

375	3. Forbidden Characters and Character Combinations

377	   To be completed.

379	4. Dangerous Characters and Character Combinations

381	   Half-width and full-width compatibility characters (U+FF00...)  can
382	   easily be mistaken and are frequently interchanged.  The version not
383	   in the compatibility section (i.e. half-width for Latin and symbols,
384	   full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black
385	   square, and white circle) should be used wherever possible. Because
386	   half-with Latin characters may be needed in certain parts of certain
387	   identifiers anyway, keyboard settings in places where identifiers are
388	   input should be set to produce half-width Latin characters by
389	   default, making the input of full-width characters more tedious.
390	   Also, while the difference between half-width and full-width charac-
391	   ters is well visible on computers in contexts that use fixed-pitch
392	   displays, they are not well transcribed on paper or with high quality
393	   printing.  Identifiers should never differ by a half-width/full-width
394	   difference only.

396	   To be completed.

398	5. Discouraged Characters and Character Combinations

400	   To be completed.

402	5.1 Similar Letters in Different Alphabets

404	   Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A)
405	   are discouraged in contexts where their assignement to a given alpha-
406	   bet is or may be ambiguous. This means that mixed-alphabet identi-
407	   fiers, in particular in cases where the use of each alphabet is not
408	   cleary marked, e.g. by separators, is discouraged.

410	   In the case of single letters mixed with numbers and simbols, such as
411	   typicaly appearing in part numbers, it should be assumed that such
412	   letters are Latin with first priority, and Cyrillic with second pri-
413	   ority. Priority could also be different for different locations.
414	   [what is best, fixed priorities or regional?]

416	   Lower-case identifiers should be prefered to upper-case identifiers
417	   because lower-case letters are more distinct.

419	6. No Normalization nor Restriction

421	   This chapter lists cases where in some circumstances normalization is
422	   applied or may seem advisable, but which are explicitly not normal-
423	   ized, for example because a consistent normalization worldwide is not
424	   possible.

426	6.1 Case Folding

428	   This document assumes that case is distinguished, and does not have
429	   to be folded or normalized. However, for some identifiers or parts
430	   thereof, case folding may be taking place. In the absence of any spe-
431	   cific knowlegde about this, it is very much advisable, both for auto-
432	   matic processing as well as for user behaviour, to copy identifiers
433	   without changing case in any way. On the other hand, it is advisable
434	   for identifier creators to choose simple and consistent casing.
435	   Intermittent casing can be copied visually, but is difficult to
436	   transmit aurally.

438	   The decision whether to make some part of an identifier case-
439	   sensitive or not is one that can freely be taken in the case identi-
440	   fiers are limited to the basic Latin alphabet.  In many cases, there
441	   is a tendency to extrapolate this to the Latin script in general.
442	   However, the Latin script at large contains several special cases
443	   which are language-dependent (e.g. Turkish dotted and dotless I/i) or
444	   invalidate the one-to-one correspondence of upper case and lower case
445	   (e.g. German sharp s).  For identifiers with a repertoire extending
446	   beyond the basic Latin alphabet, it is therefore highly advisable to
447	   strictly distinguish case, i.e. to make identifiers case-sensitive.

449	Acknowledgements

451	   I am grateful in particular to the following persons for contributing
452	   ideas, advice, criticism and help: Mark Davis, Larry Masenter,
453	   Michael Kung, Edward Cherlin, Alain LaBonte, Francois Yergeau, (to be
454	   completed).

456	Bibliography

458	   [ISO10646]     ISO/IEC 10646-1:1993. International standard -- Infor-
459	                  mation technology -- Universal multiple-octet coded
460	                  character Set (UCS) -- Part 1: Architecture and basic
461	                  multilingual plane.

463	   [Unicode2]     The Unicode Standard, Version 2, Addison-Wesley, Read-
464	                  ing, MA, 1996.

466	   [URN-Syntax]   R. Moats, "URN Syntax", RFC 2141, May 1997.

468	Author's Address

470	   Martin J. Duerst
471	   Multimedia-Laboratory
472	   Department of Computer Science
473	   University of Zurich
474	   Winterthurerstrasse 190
475	   CH-8057 Zurich
476	   Switzerland

478	   Tel: +41 1 257 43 16
479	   Fax: +41 1 363 00 35
480	   E-mail: mduerst@ifi.unizh.ch

482	     NOTE -- Please write the author's name with u-Umlaut wherever
483	     possible, e.g. in HTML as D&uuml;rst.