idnits 2.17.1 draft-duerst-i18n-norm-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 726 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 69 instances of too long lines in the document, the longest one being 82 characters in excess of 72. ** There is 1 instance of lines with control characters in the document. ** The abstract seems to contain references ([RFC2279], [UTR15], [ISO10646,Unicode], [RFC2277]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 185 has weird spacing: '... actual norm...' == Line 700 has weird spacing: '...@w3.org http...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 2000) is 8807 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod' -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer' ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2277') -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also mentioned in 'RFC 2277'. ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2279') -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also mentioned in 'RFC 2279'. ** Downref: Normative reference to an Informational RFC: RFC 2781 -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft M. Duerst 2 W3C/Keio University 3 Expires in six months M. Davis 4 IBM 5 March 2000 7 Character Normalization in ITEF Protocols 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance 12 with all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as 17 Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other 21 documents at any time. It is inappropriate to use Internet- 22 Drafts as reference material or to cite them other than as 23 "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 This document is not a product of any working group, but may be 32 discussed on the mailing lists or 33 . 35 This is a new version of an Internet Draft entitled "Normalization of 36 Internationalized Identifiers" that dealt with quite similar issues 37 and was submitted in July 1997 by the first author while he was at the 38 University of Zurich. 40 Abstract 42 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide 43 repertoire of characters. The IETF, in [RFC 2277], requires that future IETF 44 protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The 45 wide range of characters included in the UCS has lead to some cases of 46 duplicate encodings. This document proposes that in IETF protocols, the 47 class of duplicates called canonical equivalents be dealt with by using 48 Early Uniform Normalization according to Unicode Normalization Form C, 49 Canonical Composition [UTR15]. This document describes both Early 50 Uniform Normalization and Normalization Form C. 52 Table of contents 54 1. Introduction 55 2. Early Uniform Normalization 56 3. Canonical Composition (Normalization Form C) 57 3.1 Decomposition 58 3.2 Reordering 59 3.3 Recomposition 60 3.4 Implementation Notes 61 4. Stability and Versioning 62 5. Cases not dealt with by Canonical Equivalence 63 Acknowledgements 64 References 65 Copyright 66 Author's Addresses 68 1. Introduction 70 1.1 Motivation 72 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide 73 repertoire of characters. The IETF, in [RFC 2277], requires that future IETF 74 protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The 75 wide range of characters included in the UCS has lead to some cases of 76 duplicate encodings. This has lead to uncertainity for protocol specifiers 77 and implementers, because it was not clear which part of the Internet 78 infrastructure should take responsibility for these duplicates, and how. 80 There are mainly two kinds of duplicates, singleton equivalences and 81 precomposed/decomposed equivalences. Both of there can be illustrated 82 using the A character with a ring above. This character can be encoded 83 in three ways: 85 1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE 86 2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE 87 3) U+212B ANGSTROM SIGN 89 In all three cases, it is supposed to look the same for the reader. 90 The equivalence between 1) and 3) is a singleton equivalence; the 91 equivalence between 1) and 2) is a precomposed/decomposed equivalence. 92 1) is the precomposed representation, 2) is the decomposed representation. 93 The inclusion of these various representation alternatives was a result of 94 the requirement for round trip conversion with a wide range of legacy encodings 95 as well as of the merger between Unicode and ISO 10646. 97 The Unicode Standard from early on has defined Canonical Equivalence to 98 make clear which cases should be treated as pure encoding duplicates and 99 which cases should be treated as genuinely different (if maybe in some cases 100 closely related) data. The Unicode Standard also from early on defined 101 decomposed normalization, what is now called Normalization Form D (case 2) 102 in the example above). This is very well suited for some kinds of 103 internal processing, but decomposition does not correspond to how data 104 gets converted from legacy encodings and transmitted on the Internet. In that 105 case, precomposed data (i.e. case 1) in the example above) is prevalent. 107 Encouraged among else by a requirements analysis of the W3C [Charreq], 108 the Unicode Technical Committee defined Normalization Form C, 109 Canonical Composition (see [UTR15]). Normalization Form C in general produces 110 the same representation as straightforward transcoding from legacy encodings 111 (See Section 3.4 for the known exception). The careful and detailled definition 112 of Normalization Form C is mainly needed to unambigously define edge cases. 113 Most of these edge cases will turn up extremely rarely in actual data. 115 The W3C is adapting Normalization Form C in the form of Early Uniform 116 Normalization, which means that it assumes that in general, data will 117 be already in Normalization Form C [Charmod]. 119 This document proposes that in IETF protocols, Canonical Equivalents be dealt 120 with by using Early Uniform Normalization according to Unicode Normalization 121 Form C, Canonical Composition [UTR15]. This document describes both Early 122 Uniform Normalization (in Section 2) and Normalization Form C (in Section 3). 123 Section 4 contains an analysis of (postly theoretical) potential risks 124 for the stability of Normalization Form C. For reference, Section 5 discusses 125 various cases of equivalences not dealt with by Normalization Form C. 127 2. Early Uniform Normalization 129 This section tries to give some guidance on how Normalization Form C, 130 defined later in Section 3, should be used by Internet protocols. 131 Each Internet protocol has to define by itself how to use Normalization 132 Form C, and has to take into account its particular needs. However, 133 the advice in this section is intended to help writers of specifications 134 not very familliar with text normalization issues, and to try to make 135 sure that the various protocols use solutions that interface easily 136 with each other. 138 This section uses various well-known Internet protocols as examples. 139 However, such examples do not imply that the protocol elements mentionned 140 actually accept non-ASCII characters. Depending on the protocol element 141 mentionned, that may or may not be the case. Also, the examples are not 142 intended to actually define how a specific protocol deals with text 143 normalization issues. This is solely the responsibility of the specification 144 for each specific protocol. 146 The basic principle for how to use Normalization Form C is Early 147 Uniform Normalization. This means that ideally, only text in 148 Normalization Form C appears on the Internet. This can be seen 149 as applying 'be conservative in what you send' to the problem 150 of text normalization. And (again ideally) it should not be needed 151 that each implemenation of an Internet protocol separately implements 152 normalization. Text should just be provided normalized from the 153 underlying infrastructure, e.g. the operating system or the keyboard 154 driver. 156 Early normalization is of particular importance for those parts of 157 Internet protocols that are used as identifiers. Examples would 158 be file names in FTP, newsgroup names in NNTP, and so on. This is 159 due to the following reasons: 161 - In order for the protocol to work, it has to be very well defined 162 when two protocol element values match and when not. 163 - Implementations, in particular on the server side, do not in any 164 way have to deal with e.g. display of multilingual text, but on 165 the other hand have to handle a lot of protocol-specific issues. 166 Such implementations therefore should not be bothered with text 167 normalization. 169 For free text, e.g. the content of mail messages or news postings, 170 Early Uniform Normalization is somewhat less important, but definitely 171 can improve interoperability. 173 For protocol elements used as identifiers, this document advises 174 Internet protocols to specify the following: 176 - Comparison should be carried out purely binary (after it has been made 177 sure, where necessary, that the texts to be compared are in the same 178 character encoding). 179 - Any kind of text, and in particular identifier-like protocol elements, 180 should be sent normalized to Normalization Form C. 181 - In case comparison fails due to a difference in text normalization, the 182 originator of the non-normalized text is responsible for the failure. 183 - In case implementors are aware of the fact, or suspect, that their 184 underlying infrastructure produces non-normalized text, they should 185 take care to do the necessary tests and if necessary the actual normalization by themselves. 186 - In the case of creation of identifiers, and in particular if this 187 creation is comparatively infrequent (e.g. newsgroup names, domain names), 188 and happens in a rather centralized manner, explicit checks for 189 normalization should be required by the protocol specification. 191 3. Canonical Composition (Normalization Form C) 193 This section describes Canonical Composition (Normalization Form C). 194 The description is done in a procedural way, but any other procedure 195 that leads to identical results can be used. The result is intended 196 to be exactly identical to that described by [UTR15]. Various notes 197 are provided to help understand the description and give implementation 198 hints. 200 Given a sequence of UCS codepoints, its Canonical Composition can 201 be computed with the following three steps: 203 1. Decomposition 204 2. Reordering 205 3. Recomposition 207 These steps are described in detail below. 209 3.1 Decomposition 211 For each UCS codepoint in the input sequence, check whether this 212 codepoint has a canonical decomposition according to the newest 213 version of the Unicode Character Database (field 5 in [UniData]). 214 If such a decomposition is found, replace the codepoint in the 215 input sequence by the codepoint(s) in the decomposition, and 216 try to apply decomposition to the replaced codepoints. 218 Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData] is the 219 6th field when counting with an index origin of 1. Fields starting with 220 a tag delimited by '<' and '>' indicate compatibility decompositions 221 and therefore have to be ignored. 223 Note: For Korean Hangul, the decompositions are not contained 224 in [UniData], but have to be generated algorithmically 225 according to the description in [Unicode]. 227 Note: Some decompositions replace a single codepoint by another 228 single codepoint. 230 Note: Due to the properties of the data in the Unicode Character Database 231 recursive application of decompositions is necessary only for the first 232 codepoint of a decomposition. 234 3.2 Reordering 236 For each adjacent pair of UCS codepoints after decomposition, 237 check the combining classes of the UCS codepoints according to 238 the newest version of the Unicode Character Database (Field 3 239 in [UniData]). If the combining class of the first codepoint 240 is higher than the combining class of the second codepoint, 241 and at the same time the combining class of the second codepoint 242 is not zero, then exchange the two codepoints. Repeat this process 243 until no two codepoints can be exchanged anymore. 245 Note: A combining class greater than zero indicates that a codepoint 246 is a combining mark that participates in reordering. A combining 247 class of zero indicates that a codepoint is not a combining mark, 248 or that it is a is a combining mark that is not affected by reordering. 249 There are no combining classes below zero. 251 Note: Besides a few script-specific combining classes, combining classes 252 mainly distinguish whether a combining mark is attached to the base 253 letter or just placed near the base letter, and on which side of the 254 base letter (e.g. bottom, above right,...) the combining mark is 255 attached/placed. Reordering assures that combining marks placed on 256 different sides of the same character are placed in a canonical order 257 (because any order would visually look the same), while 258 combining marks placed on the same side of a character 259 are not reordered (because reordering them would change 260 the combination they represent). 262 Note: As a result of this step, the sequence of UCS codepoints 263 is in Canonical Decomposition (Normalization Form D). 265 3.3 Recomposition 267 Process the sequence of UCS codepoints resulting from Reordering 268 from start to end. At the start, do not have remembered an 'initial'. 269 For each of the codepoints, do the following: 271 - If you have remembered an 'initial', and the codepoint immediately 272 preceeding the current codepoint is this 'initial' or has a combining 273 class smaller than the combining class of the current codepoint, 274 and the 'initial' can be canonically recombined with with the current 275 codepoint, then replace the 'initial' with the canonical recombination 276 and remove the current codepoint. 277 - Else, if the current codepoint has combining class zero, 278 remember it as the new 'initial'. 280 A sequence of two codepoints can be canonically recombined to a 281 third codepoint if this third codepoint has a canonical decomposition 282 into the sequence of two codepoints (see [UniData], field 5) and 283 this canonical decomposition is not excluded from recombination. 284 For Korean Hangul, the redecompositions are not contained 285 in [UniData], but have to be generated algorithmically 286 according to the description in [Unicode]. 287 The exclusions from recombination are defined as follows: 289 1) Singletons: Codepoints that have a canonical decomposition into 290 a single other codepoint. 291 2) Non-starter: A codepoint with a decomposition starting with 292 a codepoint of a combining class other than zero. 293 3) Post-Unicode3.0: A codepoint with a decomposition introduced 294 after Unicode 3.0. 295 4) Script-specific: Precomposed codepoints that are not the 296 generally preferred form for their script. 298 The list of codepoints for 1) and 2) can be produced directly 299 from the Unicode Character Database [UniData]. The list of 300 codepoints for 3) can be produced from a comparison between 301 the 3.0.0 version and the latest version of [UniData], but this 302 may be difficult. The list of codepoints for 4) cannot be computed. 303 [CompExcl] provides a normative list for 4), lists for 1) and 304 2) for cross-checking, and an empty slot for 3) (because there 305 are currently no post-Unicode3.0 codepoints with decompositions). 307 Note: At the beginning of recomposition, there is no 'initial'. 308 An 'initial' is remembered as soon as the first codepoint 309 with a combining class of zero is found. Not every codepoint 310 with a combining class of zero becomes an 'initial'; the 311 exceptions are those that are the second codepoint in 312 a recomposition. The 'initial' as used in this description 313 is slightly different from the 'starter' used in [UTR15]. 315 Note: Checking the previous codepoint to have a combining class 316 smaller than the combining class of the current codepoint 317 assures that the conditions used for reordering are maintained 318 in the recombination step. 320 Note: Exclusion of singletons is necessary because in a pair of 321 canonically equivalent codepoints, the canonical decomposition 322 points from the 'less desirable' codepoint to the preferred 323 codepoint. In this case, both canonical decomposition and 324 canonical composition have the same preference. 326 Note: For discussion of the exclusion of Post-Unicode3.0 327 codepoints from recombination, please see Section 4 328 on versioning issues. 330 Note: Other algorithms for recomposition have been considered, but 331 this algorithm has been choosen because it provides a very good 332 balance between computational and implementation complexity 333 and 'power' of recombination. 335 3.4 Implementation Notes 337 This section contains various notes on potential implementation 338 issues, improvements, and shortcuts. 340 Avoiding decomposition: It is not always necessary to decompose 341 and recompose. In particular, any sequence that does not contain 342 any of the following is already in Normalization Form C: 343 - Codepoints that are excluded from recomposition 344 - Codepoints that appear in second position in a canonical recomposition 345 - Hangul Jamo codepoints (U+1100-U+11F9) 346 - Unknown codepoints 347 If a contiguous part of a sequence satisfies the above criterion 348 all but the last of the codepoints are already in Normalization Form C. 350 Unknown codepoints: Unknown codepoints are listed above to avoid claiming 351 that something is in Normalization Form C when it may indeed not be, but 352 they usually will be treated differently from others. The following 353 behaviours may be possible, depending on the context of normalization: 354 - Stop the normalization process with a fatal error. (This should be 355 done only in very exceptional circumstances. It would mean that 356 the implementation will die with data that conforms to a future version 357 of Unicode.) 358 - Produce some warning that such codepoints have been seen, for 359 further checking. 360 - Just copy the unknown codepoint from the input to the output, 361 running the risk of not normalizing completely. 362 - Checking that the program-internal data is up to date via the Internet. 363 - Distinguish behaviour depending on which range of codepoints 364 the unknown codepoint has been found. 366 Surrogates: When implementing normalization for sequences of UCS codepoints 367 represented as UTF-16 code units, care has to be taken that pairs of 368 surrogate code units that represent a single UCS codepoint are treated 369 appropriately. 371 Korean Hangul: There are no interactions between normalization of 372 Korean Hangul and the other normalizations. These two parts of normalization 373 can therefore be carried out separately, with different implementation 374 improvements. 376 Piecewise application: The various steps such as decomposition, 377 reordering, and recomposition, can be applied to parts of a 378 codepoint sequence. As an example, when normalizing a large file, 379 normalization can be done on each line separately because line 380 endings and normalization do not interact. 382 Integrating decomposition and recomposition: It is possible to 383 avoid full decomposition by noting that a decomposition of 384 a codepoint that is not in the exclusion list can be avoided 385 if it is not followed by a codepoint that can appear in second 386 position in a canonical recomposition. This condition can 387 be strengthened by noting that decomposition is not necessary 388 if the combining class of the following codepoint is higher 389 than the highest combining class obtained from decomposing 390 the character in question. In other cases, a decomposition 391 followed immediately by a recomposition can be precalculated. 392 Further details are left to the reader. 394 Decomposition: Recursive application of decomposition can be 395 avoided by a preprocessing step that calculates a full canonical 396 decomposition for each character with a canonical decomposition. 398 Reordering: The reordering step basically is a sorting problem. 399 Because the number of consecutive combining marks (i.e. consecutive 400 codepoints with combining class greater than zero) is usually 401 extremely small, a very simple sorting algorithm can be used, 402 e.g. a straightforward bubble sort. Because reordering will occur 403 extremely locally, the following variant of bubble sort will lead 404 to a fast and simple implementation: 405 - Start checking the first pair (e.g. the first two codepoints). 406 - If there is an exchange, and we are not at the start of the 407 sequence, move back by one codepoint and check again. 408 - Otherwise (i.e. if there is no exchange, or we are at the start 409 of the sequence) and we are not at the end of the sequence, 410 move forward by one codepoint and check again. 411 - If we are at the end of the sequence, and there has been no 412 exchange for the last pair, then we are done. 414 Conversion from legacy encodings: Normalization Form C is designed so that 415 in almost all cases, one-to-one conversion from legacy encodings (e.g. 416 iso-8859-1,...) to UCS will produce a result that is already in Normalization 417 Form C. The one know exception to this at the moment is the Vietnamese Windows 418 code page, which uses a kind of 'half-precomposed' encoding, whereas 419 Normalization Form C uses full precomposition for the characters needed for 420 Vietnamese. It was impossible to preserve the 'half-precomposed' encoding 421 for Vietnamese in Normalization Form C because otherwise this would have lead 422 to anomalies among else for French. 424 Uses of UCS in non-normalized form: The only case known where the UCS is used 425 in a way that is not in Normalization Form C is a group of users using the UCS 426 for Yiddish. The few combinations of Hebrew base letters and diacritics used 427 to write Yiddish are available precomposed in UCS. On the other hand, the 428 many combinations used in writing the Hebrew language are only available 429 by using combining characters. In order to lead to an uniform model of 430 encoding Hebrew, the precomposed Hebrew codepoints were excluded from 431 recombination. This means that Yiddish using precomposed codepoints is not 432 in Normalization Form C. It is hoped that once systems that transparently 433 handle composition become more widespread, Yiddish users can move to 434 using a decomposed representation that is in Normalization Form C. 436 Implementation examples can be found at [Charlint] (Perl) and [Normalizer] 437 (Java). 439 4. Stability and Versioning 441 Defining a normalization form for Internet-wide use requires that 442 this normalization form stays as stable as possible. Stability for 443 Normalization Form C is mainly achieved by introducing a cutoff 444 version. For precomposed characters encoded up to and including this 445 version, in principle the precomposed version is the normal form, but 446 precompomposed codepoints introduced after the cutoff version are 447 decomposed in Normalization Form C. 449 As the cutoff version, version 3.0 of Unicode and the second edition 450 of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by- 451 codepoint, and are easily available. 453 The rest of this section discusses potential threats to the stability of 454 Normalization Form C, the probability of such threats, and how to 455 avoid them. 457 The analysis below shows that the probability of the various 458 threats is extremely low. The analysis is provided here to 459 document the awareness of these treats and the measures that 460 have to be taken to avoid them. This section is only of marginal 461 importance to an implementer of Normalization Form C or to an 462 author of an Internet protocol specification. 464 4.1 New Precomposed Codepoints 466 The introduction of new (post-Unicode 3.0) precomposed codepoints 467 is not a threat to the stability of Normalization Form C. Such 468 codepoints would just provide an alternate way of encoding characters 469 that can already be encoded without them, by using a decomposed 470 form. The normalization algorithm already provides for the exclusion 471 of such characters from recomposition. 473 While Normalization Form C itself is not affected, such new codepoints 474 would affect implementations of Normalization Form C, because such 475 implementations have to be updated to correctly decompose the new 476 codepoints. 478 Note: While the new codepoint may be correctly normalized only by 479 updated implementations, once normalized neither older nor updated 480 implementations will change anything anymore. 482 Because the new codepoints do not actually encode any new 483 characters that couldn't be encoded before, because the new codepoints 484 won't actually be used due to Early Uniform Normalization, and because 485 of the above implementation problems, encoding new precomposed characters 486 is superfluous and should be very clearly avoided. 488 4.2 New Combining Marks 490 It is in theory possible that a new combining mark would be encoded 491 that is intended to represent decomposable pieces of already existing 492 encoded characters. In case this indeed would happen, problems for 493 Normalization Form C can be avoided by making sure the precomposed 494 character that now has a decomposition is not included in the list 495 of recoposition exclusions. While this helps for Normalization Form 496 C, adding a canonical decomposition would affect other normalization 497 forms, and it is therefore highly unlikely that such a canonical 498 decomposition will ever be added in the first place. 500 In case new combining marks are encoded for new scripts, or in case 501 a combining mark is introduced that does not appear in any precomposed 502 character yet, then the appropriate normalization for these characters 503 can easily be defined by providing the appropriate data. However, 504 hopefully no new encoding ambiguities are introduced for new scripts. 506 4.3 Changed Codepoints 508 A major threat to the stability of Normalization Form C would 509 come from changes to ISO/IEC 10646/Unicode itself, i.e. by moving 510 around characters or redefining codepoint or by ISO/IEC 10646 and 511 Unicode evolving differently in the future. These threats are 512 not specific to Normalization Form C, but relevant for the use 513 of the UCS in general, and are mentioned here for completeness. 515 Because of the very wide and increasing use of the UCS thoughout 516 the world, the amount of resistance to any changes of defined 517 codepoints or to any divergence between ISO/IEC 10646 and Unicode 518 is extremely strong. Awareness about the need for stability in 519 this point, as well as others, is particularly high due to the 520 experiences with some changes in the early history of these standards, 521 in particular with the reencoding of some Korean Hangul characters 522 in ISO/IEC 10646 amendment 5 (and the corresponding change in Unicode). 523 For the IETF in particular, the wording in [RFC 2279] and [RFC 2781] 524 stresses the importance of stability in this respect. 526 5. Cases not dealt with by Canonical Equivalence 528 This section gives a list of cases that are not dealt with by Canonical 529 Equivalence and Normalization Form C. This is done to help the reader 530 understand Normalization Form C and its limits. The list in this section 531 contains many cases of widely varying nature. In most cases, a viewer, 532 if familiar with the script in question, will be able to distinguish 533 the various variants. 535 Internet protocols can deal in various ways with the cases below. 536 One way is to limit the characters e.g. allowed in an identifier 537 so that one of the variants is disallowed. Another way is to assume 538 that the user can make the distinction him/herself. Another is to 539 understand that some characters or combinations of characters that 540 would lead to confusion are very difficult to actually enter on any 541 keyboard; it may therefore not really be worth to exclude them 542 explicitly. 544 - Various ligatures (Latin, Arabic) 546 - Croatian digraphs 548 - Full-width Latin compatibility variants 550 - Half-width Kana and Hangul compatibility variants 552 - Vertical compatibility variants (U+FE30...) 554 - Superscript/subscript variants (numbers and IPA) 556 - Small form compatibility variants (U+FE50...) 558 - Enclosed/encircled alphanumerics, Kana, Hangul,... 560 - Letterlike symbols, Roman numerals,... 562 - Squared Katakana and Latin abbreviations (units,...) 564 - Hangul jamo representation alternatives for historical Hangul 566 - Presence or absence of joiner/non-joiner and other control characters 568 - Upper case/lower case distinction 570 - Distinction between Katakana and Hiragana 572 - Similar letters from different scripts 573 (e.g. "A" in Latin, Greek, and Cyrillic) 575 - CJK ideograph variants (glyph variants introduced due to the source 576 separation rule, simplifications) 578 - Various punctuation variants (apostrophes, middle dots, spaces,...) 580 - Ignorable whitespace, hyphens,... 582 - Ignorable accents,... 584 Many of the cases above are identified as compatibility equivalences in the 585 Unicode database. [UTR15] defines Normalization Forms KC and KD to normalize 586 compatibility equivalences. It may look attractive to just use Normalization 587 Form KC instead of Normalization Form C for Internet protocols. However, 588 while Canonical Equivalence that forms the base of Normalization Form C 589 deals with a very small number of very well defined cases of complete 590 equivalence (from an user point of view), Compatibility Equivalence comprises 591 a very wide range of cases that usually have to be examined one at a time. 593 Acknowledgements 595 An earlier version of this document benefited from ideas, advice, criticism and help from: Mark Davis, Larry Masenter, Michael Kung, Edward Cherlin, Alain 596 LaBonte, Francois Yergeau, and others. For the current version, the authors 597 were encouraged in particular by Patrick Faltstrom and Paul Hoffman. 598 The discussion of potential stability threats is based on contributions 599 by John Cowan and Kenneth Whistler. 601 References 603 [Charlint] Martin Duerst. Charlint - A Character Normalization Tool. 604 . 606 [Charreq] Martin J. Duerst, Ed. Requirements for String Identity 607 Matching and String Indexing. World Wide Web Consortium 608 Working Draft. . 610 [Charmod] Martin J. Duerst and Francois Yergeau, Eds. Character Model 611 for the World Wide Web. World Wide Web Consortium Working 612 Draft. . 614 [CompExcl] The Unicode Consortium. Composition Exclusions. 615 . 617 [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- 618 mation technology -- Universal multiple-octet coded 619 character Set (UCS) -- Part 1: Architecture and basic 620 multilingual plane, and its Amendments. 622 [Normalizer] The Unicode Consortium. Normalization Demo. 623 625 [RFC 2277] Harald Alvestrand, IETF Policy on Character Sets and 626 Languages, January 1998. 627 . 629 [RFC 2279] Francois Yergeau. UTF-8, a transformation format of 630 ISO 10646. . 632 [RFC 2781] Paul Hoffman and Francois Yergeau. UTF-16, an encoding of 633 ISO 10646. . 635 [Unicode] The Unicode Consortium. The Unicode Standard, Version 636 3.0. Reading, MA, Addison-Wesley Developers Press, 2000. 637 ISBN 0-201-61633-5. 639 [UniData] The Unicode Consortium. UnicodeData File. 640 . 641 For explanation on the content of this file, please see 642 . 644 [UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. 645 Unicode Technical Report #15. 646 . 648 Copyright 650 Copyright (C) The Internet Society, 2000. All Rights Reserved. 652 This document and translations of it may be copied and furnished to 653 others, and derivative works that comment on or otherwise explain it 654 or assist in its implementation may be prepared, copied, published 655 and distributed, in whole or in part, without restriction of any 656 kind, provided that the above copyright notice and this paragraph 657 are included on all such copies and derivative works. However, this 658 document itself may not be modified in any way, such as by removing 659 the copyright notice or references to the Internet Society or other 660 Internet organizations, except as needed for the purpose of 661 developing Internet standards in which case the procedures for 662 copyrights defined in the Internet Standards process must be 663 followed, or as required to translate it into languages other 664 than English. 666 The limited permissions granted above are perpetual and will not be 667 revoked by the Internet Society or its successors or assigns. 669 This document and the information contained herein is provided on an 670 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 671 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 672 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 673 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 674 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 676 Author's Addresses 678 Martin J. Duerst 679 W3C/Keio University 680 5322 Endo, Fujisawa 681 252-8520 Japan 682 mailto:duerst@w3.org 683 http://www.w3.org/People/D%C3%BCrst/ 684 Tel/Fax: +81 466 49 1170 686 Note: Please write "Duerst" with u-umlaut wherever 687 possible, i.e. as "D&252;rst" in HTML and XML. 689 Mark E. Davis 690 IBM Center for Java Technology 691 10275 North De Anza Bouleward 692 Cupertino 95014 CA 693 U.S.A. 694 mailto:mark.davis@us.ibm.com 695 http://www.macchiato.com 696 Tel: +1 (408) 777-5850 697 Fax: +1 (408) 777-5891 699 #-#-# Martin J. Du"rst, World Wide Web Consortium 700 #-#-# mailto:duerst@w3.org http://www.w3.org