idnits 2.17.1 draft-duerst-i18n-norm-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-27) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 220: '...dentifiers, they MUST be converted to ...' Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (July 1997) is 9783 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'ISO 10646' on line 83 looks like a reference -- Missing reference section? 'Unicode2' on line 463 looks like a reference -- Missing reference section? 'URN-Syntax' on line 466 looks like a reference -- Missing reference section? 'ISO10646' on line 458 looks like a reference Summary: 9 errors (**), 0 flaws (~~), 1 warning (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft M. Duerst 3 University of Zurich 4 Expires in six months July 1997 6 Normalization of Internationalized Identifiers 8 Status of this Memo 10 This document is an Internet-Draft. Internet-Drafts are working doc- 11 uments of the Internet Engineering Task Force (IETF), its areas, and 12 its working groups. Note that other groups may also distribute work- 13 ing documents as Internet-Drafts. 15 Internet-Drafts are draft documents valid for a maximum of six 16 months. Internet-Drafts may be updated, replaced, or obsoleted by 17 other documents at any time. It is not appropriate to use Internet- 18 Drafts as reference material or to cite them other than as a "working 19 draft" or "work in progress". 21 To learn the current status of any Internet-Draft, please check the 22 1id-abstracts.txt listing contained in the Internet-Drafts Shadow 23 Directories on ds.internic.net (US East Coast), nic.nordu.net 24 (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific 25 Rim). 27 Distribution of this document is unlimited. Please send comments to 28 the author at or to the uri mailing list at 29 uri@bunyip.com. This document is currently a very early draft, 30 intended to stimulate discussion only. It is intended to become part 31 of a suite of documents related to the internationalization of URLs. 33 Abstract 35 The Universal Character Set (UCS) makes it possible to extend the 36 repertoire of characters used in non-local identifiers beyond US- 37 ASCII. The UCS contains a large overall number of characters, many 38 codepoints for backwards compatibility, and various mechanisms to 39 cope with the features of the writing systems of the world. All this 40 together can lead to ambiguities in representation. Such ambiguities 41 are not a problem when representing running text. Therefore existing 42 standards have only defined equivalences. For the use in identi- 43 fiers, which are compared using their binary representation, this is 44 not sufficient. This document defines a normalization algorithm and 45 gives usage guidelines to avoid such ambiguities. 47 Table of contents 49 1. Introduction ................................................... 2 50 1.1 Motivation .................................................. 2 51 1.2 List of Potential Ambiguities ............................... 4 52 1.3 Categories .................................................. 5 53 1.3.1 Category Overview ....................................... 5 54 1.3.2 Category List ........................................... 5 55 1.4 Applicabality and Conformance ............................... 6 56 1.5 Notation .................................................... 6 57 2. Normalization Rules ............................................ 6 58 2.1 Normalization of Combining Sequences ........................ 7 59 2.2 Hangul Jamo Normalization ................................... 9 60 2.3 Arabic Ligature and Presentation Form Normalization ......... 9 61 3. Forbidden Characters and Character Combinations ................ 9 62 4. Dangerous Characters and Character Combinations ................ 9 63 5. Discouraged Characters and Character Combinations ............. 10 64 5.1 Similar Letters in Different Alphabets ..................... 10 65 6. No Normalization nor Restriction .............................. 10 66 6.1 Case Folding ............................................... 11 67 Acknowledgements ................................................. 11 68 Bibliography ..................................................... 11 69 Author's Address ................................................. 12 71 1. Introduction 73 1.1 Motivation 75 For the identification of resources in networks, many kinds of iden- 76 tifiers are in use. Locally, many kinds of identifiers can contain 77 characters from all kinds of languages and scripts, but as long as 78 different encodings for the same characters exist, these cannot be 79 used in identifiers across a wider network. Therefore, network iden- 80 tifiers had to be limited to a very restricted character repertoire, 81 usually a subset of US-ASCII. 83 With the definition of the Universal Character Set (UCS) [ISO 10646] 84 [Unicode2], it becomes possible to extend the character repertoire of 85 such identifiers. In some cases, this has already been done, for 86 example in Java and for URNs [URN-Syntax]; other cases are under 87 study. While identifiers for resources of full worldwide interest 88 should continue to be limited to a very restricted set of widestly 89 known characters, names for resources mainly used in a language-local 90 or script-local context may provide significant additional user con- 91 venience if they can make use of a wider character repertoire. 93 The UCS contains a large overall number of characters, many code- 94 points for backwards compatibility, and various mechanisms to allow 95 it to cope with the features of the writing systems of the world. 96 These all lead to ambiguities that in some cases can be resolved by 97 careful display, printing, and examination by the reader, but in 98 other cases are intended to be unnoticable by the reader. Such ambi- 99 guities can be dealt with in systems processing running text by using 100 various kinds of equivalences and normalizations, which may differ by 101 implementation. 103 However, identifier processing software usually compares their binary 104 representation to establish that two identifiers are identical. In 105 some cases, some additional processing may be done to account for the 106 specifics of identifier syntax variation. To upgrade all such soft- 107 ware to take into account the equivalences and ambiguities in the UCS 108 would be extremely tedious. For some classes of identifiers, it is 109 impossible because their binary representation is transparent in the 110 sense that it may allow legacy character encodings besides a charac- 111 ter encoding based on UCS to be used and/or it may allow for arbi- 112 trary binary data to be contained in identifiers. 114 In order to facilitate the use of identifiers containing characters 115 from UCS, this document therefore intends to develop clear specifica- 116 tions for a normalization algorithm removing basic ambiguities, and 117 guidelines for the use of characters with potential ambiguity. 119 A key design goal of the algorithm was and is that for most identi- 120 fiers in current use, applying the algorithm results in the identity 121 transform (i.e. the identifier is already normalized). This allows to 122 continue to use existing identifiers and to start to use internation- 123 alized identifiers in new settings even without all the details of 124 the normalization algorithm having been agreed upon. 126 Other goals when designing the algorithms and rules have been as fol- 127 lows: 129 - Avoid bad surprises for users when they cannot understand that two 130 identifiers looking exactly the same don't match. The user in 131 this case is an average user without any specific knowledge of 132 character encoding, but with a basic dose of "computer literacy" 133 (e.g. know that 0 and O have distinct keys on a keyboard). 135 - Restrict normalization to cases where it is really necessary; 136 cover remaining ambiguities by guidelines. 138 - Define normalization so that it can be implemented using widely 139 accessible documentation. 141 - Take measures for best possible compatibility with future addi- 142 tions to the UCS. 144 There are some issues this document does currently not address, in 145 particular bidirectionality. It is not clear yet whether this will be 146 included in this document or treated separately. 148 1.2 List of Potential Ambiguities 150 To give an idea of the extent of the problem, this section lists 151 potential character ambiguities, roughly ordered so that those cases 152 that are more difficult to distinguish come first. The difficulty to 153 distinguish certain characters or combinations may depend greatly on 154 context. 156 - Precomposed/decomposed diacritic character representation 158 - Hangul jamo vs. johab and jamo representation alternatives 160 - CJK compatibility ideographs 162 - Other backwards compatibility duplicated characters 164 - Separately coded Indic length/AI/AU marks 166 - Glyphs for vertical variants 168 - Croatian digraphs, other ligatures (Latin, Arabic,...) 170 - Various variant punctuation (apostrophes, middle dots, spaces,...) 172 - Half-width/full-width characters (Latin, Katakana and Hangul) 174 - Vertical variants (U+FE30...) 176 - Presence or absence of joiner/non-joiner 178 - Superscript/subscript variants (numbers and IPA) 180 - Small form variants (U+FE50...) 181 - Upper case/lower case 183 - Similar letters from different scripts (varying degrees) (e.g. "A" 184 in Latin, Greek, and Cyrillic) 186 - Letterlike symbols, Roman numerals (varying degrees) 188 - Enclosed alphanumerics, katakana, hangul,... 190 - Squared katakana (units,...), squared Latin abbreviations,... 192 - CJK ideograph variants (varying degrees, in particular general 193 simplifications, backwards-compatibility non-unifications, JIS 194 78/83 problems) 196 - Ignorable whitespace, hyphens,... (sorting) 198 - Ignorable accents,... (sorting) 200 1.3 Categories 202 1.3.1 Category Overview 204 This specification distinguishes various categories of ambigous char- 205 acters or strings. For each category, it will list or describe: 207 - The characters and character combinations in the category 209 - The context, if necessary 211 - The nature of the ambiguity 213 - The necessary actions or recommendations 215 1.3.1 Category List 217 The following categories are currently under investigation: 219 - Normalized: Characters and character combinations in this category 220 are not allowed in identifiers, they MUST be converted to a nor- 221 malized form. Examples include characters with strong equiva- 222 lences. 224 - Forbidden: Characters and character combinations in this category 225 are not allowed at all in identifiers; identifiers containing them 226 are illegal. Examlpes include characters that cause problems to 227 software, such as control characters, and cases that need normal- 228 ization but where normalization is too difficult to specify algo- 229 rithmically. 231 - Dangerous: Characters and character combinations in this category 232 are seriously advised against. Software would usually alert a user 233 of an attempt to use such a character, but not force the user to 234 remove it. 236 - Discouraged: Characters and character combinations in this cate- 237 gory are advised against, but not as strongly as to necessitate an 238 alert. 240 1.4 Applicability and Conformance 242 Where identifiers are used just to transmit data from one point to 243 another, e.g. in the case of the query component of an URL resulting 244 from a FORM reply, there is no need to apply the normalization rules 245 and guidelines defined in this document. 247 Identifiers containing a wide range of characters should be used with 248 care and only for an audience that is understood to be able to tran- 249 scribe them without problems. 251 1.5 Notation 253 Codepoints from the UCS are denoted as U+XXXX, where XXXX is their 254 hexadecimal representation, according to [Unicode2]. 256 Ranges of characters are expressed as U+XXXX-U+YYYY. A block of char- 257 acters may also be identified by its first codepoint, followed by 258 "...". Official ISO character names are given in all upper case. 260 2. Normalization Rules 262 This chapter defines several normalization algorithms. They deal 263 with different kinds of phenomena, or different scripts. They are 264 defined so that the sequence of their application does not change the 265 normalization result; each algorithm has to be applied at least once. 266 Applying an algorithm a second time will not change the result any- 267 more. 269 The algorithms are to a certain extent written in a procedural fash- 270 ion. This does not imply that an implementation has to follow each 271 step. The only thing that is relevant is whether an implementation 272 produces the same outputs on the same inputs for all possible inputs, 273 i.e. for all randomly generated strings of arbitrary length. An 274 implementation may also combine the various algorithms into a single 275 one if the result is the same as applying each of the algorithms at 276 least once. 278 2.1 Normalization of Combining Sequences 280 UCS contains a general mechanism for encoding diacritic combinations 281 from base letters and modifying diacritics, as well as many combina- 282 tions as precomposed codepoints. 284 The following algorithm normalizes such combinations: 286 Step 1: Starting from the beginning of the identifier, find a maximal 287 sequence of a base character (possibly decomposable) followed by mod- 288 ifying letters. 290 Step 2: Fully decompose the sequence found in step 1, using all 291 canonical decompositions defined in [Unicode2] and all canonical 292 decompositions defined for future additions to the UCS. 294 Step 3: Sort the sequence of modifying letters found in Step 2 295 according to the canonical ordering algorithm of Section 3.9 of [Uni- 296 code2]. 298 Step 4: If the base character is a Hebrew character, go to step 6. 300 Step 5: Try to recombine as much as possible of the sequence result- 301 ing from Step 3 into a precomposed character by finding the longest 302 initial match with any canonical decomposition sequence defined in 303 [Unicode2], ignoring decomposition sequences of length 1. 305 Step 6: Use the result obtained so far as output and continue with 306 Step 1. 308 NOTE -- In Step 4, the decomposition sequences in [Uni- 309 code2] have to be recursively expanded for each character 310 (except for decomposition sequences of length 1) before 311 application. Otherwise, a character such as U+1E1C, LATIN 312 CAPITAL LETTER E WITH CEDILLA AND BREVE, will not be recom- 313 posed correctly. 315 NOTE -- In Step 4, canonical decompositions defined for 316 future additions to the UCS are explicitly not considered. 317 This is done to ease forwards compatibility. It is assumed 318 that systems knowing about newly defined precompositions 319 will be able to decompose them correctly in Step 2, but 320 that it would be hard to change identifiers on older sys- 321 tems using a decomposed representation. 323 NOTE -- Maybe we have to define additions to the cannonical 324 equivalences, and/or to add more exceptions such as Hebrew. 326 NOTE -- A different definition of Step 4 may lead to 327 shorter normalizations for some identifiers. The current 328 definition was choosen for simplicity and implementation 329 speed. (this may be subject to discussion, in particular 330 if somebody has an implementation and is ready to share the 331 code). 333 NOTE -- The above algorithm can be sped up by shortcuts, in 334 particular by noting that most precomposed characters which 335 are not followed by modifying letters are already normal- 336 ized. 338 NOTE -- The exception for "precomposed letters that have a 339 decomposition sequence of length 1" in Step 4 is necessary 340 to avoid e.g. the letter "K" being "aggregated" to "KELVIN 341 SIGN" U+212A. 343 2.2 Hangul Jamo Normalization 345 Hangul Jamo (U+1100-U+11FF) provide ample possibilities for ambiguous 346 notations and therefore must be carefully normalized. The following 347 algorithm should be used: 349 Step 1: A seqence of Hangul jamo is split up into syllables according 350 to the definition of syllable boundaries on page 3-12 of [Unicode2]. 351 Each of these syllables is processed according to Steps 2-4. 353 Step 2: Fillers are inserted as neccessary to form a canonical sylla- 354 ble as defined on page 3-12 of [Unicode2]. 356 Step 3: Sequences of choseong, jungseong, and jongseong (leading con- 357 sonants, vowels, and trailing consonants) are replaced by a single 358 choseong, jungseong, and jongseong respectively according to the com- 359 patibility decompositions given in [Unicode2]. If this is not possi- 360 ble, this is a forbidden sequence. 362 Step 4: The seqence is replaced by a Hangul Syllable (U+AC00-U+D7AF) 363 if this is possible according to the algorithm given on pp. 3-12/3 of 364 [Unicode2]. 366 NOTE -- We are not currently dealing with compatibility 367 Jamo (U+3130...). 369 2.3 Arabic Ligature and Presentation Form Normalization 371 It is not yet clear whether a normalization algorithm should be 372 defined here, or wheter ligatures and presentation forms should sim- 373 ply be forbidden. 375 3. Forbidden Characters and Character Combinations 377 To be completed. 379 4. Dangerous Characters and Character Combinations 381 Half-width and full-width compatibility characters (U+FF00...) can 382 easily be mistaken and are frequently interchanged. The version not 383 in the compatibility section (i.e. half-width for Latin and symbols, 384 full-width for Katakana, Hangul, "LIGHT VERTICAL", arrows, black 385 square, and white circle) should be used wherever possible. Because 386 half-with Latin characters may be needed in certain parts of certain 387 identifiers anyway, keyboard settings in places where identifiers are 388 input should be set to produce half-width Latin characters by 389 default, making the input of full-width characters more tedious. 390 Also, while the difference between half-width and full-width charac- 391 ters is well visible on computers in contexts that use fixed-pitch 392 displays, they are not well transcribed on paper or with high quality 393 printing. Identifiers should never differ by a half-width/full-width 394 difference only. 396 To be completed. 398 5. Discouraged Characters and Character Combinations 400 To be completed. 402 5.1 Similar Letters in Different Alphabets 404 Similar letters in different alphabets (e.g. Latin/Greek/Cyrillic A) 405 are discouraged in contexts where their assignement to a given alpha- 406 bet is or may be ambiguous. This means that mixed-alphabet identi- 407 fiers, in particular in cases where the use of each alphabet is not 408 cleary marked, e.g. by separators, is discouraged. 410 In the case of single letters mixed with numbers and simbols, such as 411 typicaly appearing in part numbers, it should be assumed that such 412 letters are Latin with first priority, and Cyrillic with second pri- 413 ority. Priority could also be different for different locations. 414 [what is best, fixed priorities or regional?] 416 Lower-case identifiers should be prefered to upper-case identifiers 417 because lower-case letters are more distinct. 419 6. No Normalization nor Restriction 421 This chapter lists cases where in some circumstances normalization is 422 applied or may seem advisable, but which are explicitly not normal- 423 ized, for example because a consistent normalization worldwide is not 424 possible. 426 6.1 Case Folding 428 This document assumes that case is distinguished, and does not have 429 to be folded or normalized. However, for some identifiers or parts 430 thereof, case folding may be taking place. In the absence of any spe- 431 cific knowlegde about this, it is very much advisable, both for auto- 432 matic processing as well as for user behaviour, to copy identifiers 433 without changing case in any way. On the other hand, it is advisable 434 for identifier creators to choose simple and consistent casing. 435 Intermittent casing can be copied visually, but is difficult to 436 transmit aurally. 438 The decision whether to make some part of an identifier case- 439 sensitive or not is one that can freely be taken in the case identi- 440 fiers are limited to the basic Latin alphabet. In many cases, there 441 is a tendency to extrapolate this to the Latin script in general. 442 However, the Latin script at large contains several special cases 443 which are language-dependent (e.g. Turkish dotted and dotless I/i) or 444 invalidate the one-to-one correspondence of upper case and lower case 445 (e.g. German sharp s). For identifiers with a repertoire extending 446 beyond the basic Latin alphabet, it is therefore highly advisable to 447 strictly distinguish case, i.e. to make identifiers case-sensitive. 449 Acknowledgements 451 I am grateful in particular to the following persons for contributing 452 ideas, advice, criticism and help: Mark Davis, Larry Masenter, 453 Michael Kung, Edward Cherlin, Alain LaBonte, Francois Yergeau, (to be 454 completed). 456 Bibliography 458 [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- 459 mation technology -- Universal multiple-octet coded 460 character Set (UCS) -- Part 1: Architecture and basic 461 multilingual plane. 463 [Unicode2] The Unicode Standard, Version 2, Addison-Wesley, Read- 464 ing, MA, 1996. 466 [URN-Syntax] R. Moats, "URN Syntax", RFC 2141, May 1997. 468 Author's Address 470 Martin J. Duerst 471 Multimedia-Laboratory 472 Department of Computer Science 473 University of Zurich 474 Winterthurerstrasse 190 475 CH-8057 Zurich 476 Switzerland 478 Tel: +41 1 257 43 16 479 Fax: +41 1 363 00 35 480 E-mail: mduerst@ifi.unizh.ch 482 NOTE -- Please write the author's name with u-Umlaut wherever 483 possible, e.g. in HTML as Dürst.