idnits 2.17.1 draft-duerst-i18n-norm-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 797 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 63 instances of too long lines in the document, the longest one being 82 characters in excess of 72. ** There is 1 instance of lines with control characters in the document. ** The abstract seems to contain references ([RFC2279], [UTR15], [ISO10646,Unicode], [RFC2277]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 239: '...y decompositions MUST NOT be used for ...' RFC 2119 keyword, line 326: '...For 3) and 4), the lists provided in [CompExcl] MUST be used....' Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 202 has weird spacing: '... actual norm...' == Line 767 has weird spacing: '...@w3.org http...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (March 2000) is 8808 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod' -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer' ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2277') -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also mentioned in 'RFC 2277'. ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2279') -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also mentioned in 'RFC 2279'. ** Downref: Normative reference to an Informational RFC: RFC 2781 -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' Summary: 11 errors (**), 0 flaws (~~), 4 warnings (==), 14 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft M. Duerst 2 W3C/Keio University 3 Expires in six months M. Davis 4 IBM 5 March 2000 7 Character Normalization in IETF Protocols 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance 12 with all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering 15 Task Force (IETF), its areas, and its working groups. Note that 16 other groups may also distribute working documents as 17 Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six 20 months and may be updated, replaced, or obsoleted by other 21 documents at any time. It is inappropriate to use Internet- 22 Drafts as reference material or to cite them other than as 23 "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 This document is not a product of any working group, but may be 32 discussed on the mailing lists or 33 . 35 This is a new version of an Internet Draft entitled "Normalization of 36 Internationalized Identifiers" that dealt with quite similar issues 37 and was submitted in July 1997 by the first author while he was at the 38 University of Zurich. 40 Abstract 42 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide 43 repertoire of characters. The IETF, in [RFC 2277], requires that future IETF 44 protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The 45 wide range of characters included in the UCS has lead to some cases of 46 duplicate encodings. This document proposes that in IETF protocols, the 47 class of duplicates called canonical equivalents be dealt with by using 48 Early Uniform Normalization according to Unicode Normalization Form C, 49 Canonical Composition [UTR15]. This document describes both Early 50 Uniform Normalization and Normalization Form C. 52 Table of contents 54 0. Change log 55 1. Introduction 56 2. Early Uniform Normalization 57 3. Canonical Composition (Normalization Form C) 58 3.1 Decomposition 59 3.2 Reordering 60 3.3 Recomposition 61 3.4 Implementation Notes 62 4. Stability and Versioning 63 5. Cases not dealt with by Canonical Equivalence 64 6. Security Considerations 65 Acknowledgements 66 References 67 Copyright 68 Author's Addresses 70 0. Change log 72 Changes from -02 to -03 74 - Fixed a bad typo in the title. 75 - Made a lot of wording corrections and presentation improvements, 76 most of them suggested by Paul Hofmann. 78 1. Introduction 80 1.1 Motivation 82 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very wide 83 repertoire of characters. The IETF, in [RFC 2277], requires that future IETF 84 protocols support UTF-8 [RFC 2279], an ASCII-compatible encoding of UCS. The 85 wide range of characters included in the UCS has lead to some cases of 86 duplicate encodings. This has lead to uncertainity for protocol specifiers 87 and implementers, because it was not clear which part of the Internet 88 infrastructure should take responsibility for these duplicates, and how. 90 There are mainly two kinds of duplicates, singleton equivalences and 91 precomposed/decomposed equivalences. Both of there can be illustrated 92 using the A character with a ring above. This character can be encoded 93 in three ways: 95 1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE 96 2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING ABOVE 97 3) U+212B ANGSTROM SIGN 99 In all three cases, it is supposed to look the same for the reader. 100 The equivalence between 1) and 3) is a singleton equivalence; the 101 equivalence between 1) and 2) is a precomposed/decomposed equivalence. 102 1) is the precomposed representation, 2) is the decomposed representation. 103 The inclusion of these various representation alternatives was a result of 104 the requirement for round trip conversion with a wide range of legacy encodings 105 as well as of the merger between Unicode and ISO 10646. 107 The Unicode Standard from early on has defined Canonical Equivalence to 108 make clear which sequences of codepoints cases should be treated as pure 109 encoding duplicates and which sequences of codepoints should be treated as 110 genuinely different (if maybe in some cases closely related) data. 111 The Unicode Standard also from early on defined 112 decomposed normalization, what is now called Normalization Form D (case 2) 113 in the example above). This is very well suited for some kinds of 114 internal processing, but decomposition does not correspond to how data 115 gets converted from legacy encodings and transmitted on the Internet. In that 116 case, precomposed data (i.e. case 1) in the example above) is prevalent. 118 Note: This specification uses the term 'codepoint', and not 'character', 119 to make clear that it speaks about what the standards encode, 120 and not what the end user think about. 122 Encouraged by many factors such as a requirements analysis of the W3C 123 [Charreq], the Unicode Technical Committee defined Normalization Form C, 124 Canonical Composition (see [UTR15]). Normalization Form C in general produces 125 the same representation as straightforward transcoding from legacy encodings 126 (See Section 3.4 for the known exception). The careful and detailled definition 127 of Normalization Form C is mainly needed to unambigously define edge cases 128 (base letters with two or more combining characters). 129 Most of these edge cases will turn up extremely rarely in actual data. 131 The W3C is adapting Normalization Form C in the form of Early Uniform 132 Normalization, which means that it assumes that in general, data will 133 be already in Normalization Form C [Charmod]. 135 This document proposes that in IETF protocols, Canonical Equivalents be dealt 136 with by using Early Uniform Normalization according to Unicode Normalization 137 Form C, Canonical Composition [UTR15]. This document describes both Early 138 Uniform Normalization (in Section 2) and Normalization Form C (in Section 3). 139 Section 4 contains an analysis of (mostly theoretical) potential risks 140 for the stability of Normalization Form C. For reference, Section 5 discusses 141 various cases of equivalences not dealt with by Normalization Form C. 143 2. Early Uniform Normalization 145 This section tries to give some guidance on how Normalization Form C, 146 defined later in Section 3, should be used by Internet protocols. 147 Each Internet protocol has to define by itself how to use Normalization 148 Form C, and has to take into account its particular needs. However, 149 the advice in this section is intended to help writers of specifications 150 not very familliar with text normalization issues, and to try to make 151 sure that the various protocols use solutions that interface easily 152 with each other. 154 This section uses various well-known Internet protocols as examples. 155 However, such examples do not imply that the protocol elements mentioned 156 actually accept non-ASCII characters. Depending on the protocol element 157 mentioned, that may or may not be the case. Also, the examples are not 158 intended to actually define how a specific protocol deals with text 159 normalization issues. This is solely the responsibility of the specification 160 for each specific protocol. 162 The basic principle for how to use Normalization Form C is Early 163 Uniform Normalization. This means that ideally, only text in 164 Normalization Form C appears on the wire on the Internet. This can be seen 165 as applying 'be conservative in what you send' to the problem 166 of text normalization. And (again ideally) it should not be needed 167 that each implemenation of an Internet protocol separately implements 168 normalization. Text should just be provided normalized from the 169 underlying infrastructure, e.g. the operating system or the keyboard 170 driver. 172 Early normalization is of particular importance for those parts of 173 Internet protocols that are used as identifiers. Examples would 174 be URIs, domain names, email addresses, identifier names in PKIX 175 certificates, file names in FTP, newsgroup names in NNTP, and so on. 176 This is due to the following reasons: 178 - In order for the protocol to work, it has to be very well defined 179 when two protocol element values match and when not. 180 - Implementations, in particular on the server side, do not in any 181 way have to deal with e.g. display of multilingual text, but on 182 the other hand have to handle a lot of protocol-specific issues. 183 Such implementations therefore should not be bothered with text 184 normalization. 186 For free text, e.g. the content of mail messages or news postings, 187 Early Uniform Normalization is somewhat less important, but definitely 188 can improve interoperability. 190 For protocol elements used as identifiers, this document advises 191 Internet protocols to specify the following: 193 - Comparison should be carried out purely binary (after it has been made 194 sure, where necessary, that the texts to be compared are in the same 195 character encoding). 196 - Any kind of text, and in particular identifier-like protocol elements, 197 should be sent normalized to Normalization Form C. 198 - In case comparison fails due to a difference in text normalization, the 199 originator of the non-normalized text is responsible for the failure. 200 - In case implementors are aware of the fact, or suspect, that their 201 underlying infrastructure produces non-normalized text, they should 202 take care to do the necessary tests and if necessary the actual normalization by themselves. 203 - In the case of creation of identifiers, and in particular if this 204 creation is comparatively infrequent (e.g. newsgroup names, domain names), 205 and happens in a rather centralized manner, explicit checks for 206 normalization should be required by the protocol specification. 208 3. Canonical Composition (Normalization Form C) 210 This section describes Canonical Composition (Normalization Form C). 211 The description is done in a procedural way, but any other procedure 212 that leads to identical results can be used. The result is intended 213 to be exactly identical to that described by [UTR15]. Various notes 214 are provided to help understand the description and give implementation 215 hints. 217 Given a sequence of UCS codepoints, its Canonical Composition can 218 be computed with the following three steps: 220 1. Decomposition 221 2. Reordering 222 3. Recomposition 224 These steps are described in detail below. 226 3.1 Decomposition 228 For each UCS codepoint in the input sequence, check whether this 229 codepoint has a canonical decomposition according to the newest 230 version of the Unicode Character Database (field 5 in [UniData]). 231 If such a decomposition is found, replace the codepoint in the 232 input sequence by the codepoint(s) in the decomposition, and 233 recursivly check for and apply decomposition on the first replaced 234 codepoint. 236 Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData] is the 237 6th field when counting with an index origin of 1. Fields starting with 238 a tag delimited by '<' and '>' indicate compatibility decompositions; 239 these compatibility decompositions MUST NOT be used for Normalization 240 Form C. 242 Note: For Korean Hangul, the decompositions are not contained 243 in [UniData], but have to be generated algorithmically 244 according to the description in [Unicode]. 246 Note: Some decompositions replace a single codepoint by another 247 single codepoint. 249 Note: It is not necessary to check replaced codepoints other than the 250 first one due to the properties of the data in the Unicode Character 251 Database. 253 Note: It is possible to 'precompile' the decompositions to avoid 254 having to apply them recursively. 256 3.2 Reordering 258 For each adjacent pair of UCS codepoints after decomposition, 259 check the combining classes of the UCS codepoints according to 260 the newest version of the Unicode Character Database (Field 3 261 in [UniData]). If the combining class of the first codepoint 262 is higher than the combining class of the second codepoint, 263 and at the same time the combining class of the second codepoint 264 is not zero, then exchange the two codepoints. Repeat this process 265 until no two codepoints can be exchanged anymore. 267 Note: A combining class greater than zero indicates that a codepoint 268 is a combining mark that participates in reordering. A combining 269 class of zero indicates that a codepoint is not a combining mark, 270 or that it is a is a combining mark that is not affected by reordering. 271 There are no combining classes below zero. 273 Note: Besides a few script-specific combining classes, combining classes 274 mainly distinguish whether a combining mark is attached to the base 275 letter or just placed near the base letter, and on which side of the 276 base letter (e.g. bottom, above right,...) the combining mark is 277 attached/placed. Reordering assures that combining marks placed on 278 different sides of the same character are placed in a canonical order 279 (because any order would visually look the same), while 280 combining marks placed on the same side of a character 281 are not reordered (because reordering them would change 282 the combination they represent). 284 Note: After completing this step, the sequence of UCS codepoints 285 is in Canonical Decomposition (Normalization Form D). 287 3.3 Recomposition 289 Process the sequence of UCS codepoints resulting from Reordering 290 from start to end. his process requires a state variable called 291 'initial'. At the beginning of the process, the value of 'initial' 292 is empty. 294 - If 'initial' has a value, and the codepoint immediately 295 preceeding the current codepoint is this 'initial' or has a combining 296 class smaller than the combining class of the current codepoint, 297 and the 'initial' can be canonically recombined with with the current 298 codepoint, then replace the 'initial' with the canonical recombination 299 and remove the current codepoint. 300 - Otherwise, if the current codepoint has combining class zero, 301 store its value in 'initial'. 303 A sequence of two codepoints can be canonically recombined to a 304 third codepoint if this third codepoint has a canonical decomposition 305 into the sequence of two codepoints (see [UniData], field 5) and 306 this canonical decomposition is not excluded from recombination. 307 For Korean Hangul, the redecompositions are not contained 308 in [UniData], but have to be generated algorithmically 309 according to the description in [Unicode]. 310 The exclusions from recombination are defined as follows: 312 1) Singletons: Codepoints that have a canonical decomposition into 313 a single other codepoint. 314 2) Non-starter: A codepoint with a decomposition starting with 315 a codepoint of a combining class other than zero. 316 3) Post-Unicode3.0: A codepoint with a decomposition introduced 317 after Unicode 3.0. 318 4) Script-specific: Precomposed codepoints that are not the 319 generally preferred form for their script. 321 The list of codepoints for 1) and 2) can be produced directly 322 from the Unicode Character Database [UniData]. The list of 323 codepoints for 3) can be produced from a comparison between 324 the 3.0.0 version and the latest version of [UniData], but this 325 may be difficult. The list of codepoints for 4) cannot be computed. 326 For 3) and 4), the lists provided in [CompExcl] MUST be used. 327 [CompExcl] also provides lists for 1) and 2) for cross-checking. 328 The list for 3) is currently empty because there are currently 329 no post-Unicode3.0 codepoints with decompositions. 331 Note: At the beginning of recomposition, there is no 'initial'. 332 An 'initial' is remembered as soon as the first codepoint 333 with a combining class of zero is found. Not every codepoint 334 with a combining class of zero becomes an 'initial'; the 335 exceptions are those that are the second codepoint in 336 a recomposition. The 'initial' as used in this description 337 is slightly different from the 'starter' used in [UTR15]. 339 Note: Checking the previous codepoint to have a combining class 340 smaller than the combining class of the current codepoint 341 assures that the conditions used for reordering are maintained 342 in the recombination step. 344 Note: Exclusion of singletons is necessary because in a pair of 345 canonically equivalent codepoints, the canonical decomposition 346 points from the 'less desirable' codepoint to the preferred 347 codepoint. In this case, both canonical decomposition and 348 canonical composition have the same preference. 350 Note: For discussion of the exclusion of Post-Unicode3.0 351 codepoints from recombination, please see Section 4 352 on versioning issues. 354 Note: Other algorithms for recomposition have been considered, but 355 this algorithm has been choosen because it provides a very good 356 balance between computational and implementation complexity 357 and 'power' of recombination. 359 3.4 Implementation Notes 361 This section contains various notes on potential implementation 362 issues, improvements, and shortcuts. 364 3.4.1 Avoiding decomposition, and checking for Normalization Form C 366 It is not always necessary to decompose 367 and recompose. In particular, any sequence that does not contain 368 any of the following is already in Normalization Form C: 370 - Codepoints that are excluded from recomposition 371 - Codepoints that appear in second position in a canonical recomposition 372 - Hangul Jamo codepoints (U+1100-U+11F9) 373 - Unknown codepoints 375 If a contiguous part of a sequence satisfies the above criterion 376 all but the last of the codepoints are already in Normalization Form C. 378 The above criteria can also be used to easily check that some data 379 is already in Normalization Form C. However, this check will reject 380 some cases that actually are normalized. 382 3.4.2 Unknown codepoints 384 Unknown codepoints are listed above to avoid claiming 385 that something is in Normalization Form C when it may indeed not be, but 386 they usually will be treated differently from others. The following 387 behaviours may be possible, depending on the context of normalization: 389 - Stop the normalization process with a fatal error. (This should be 390 done only in very exceptional circumstances. It would mean that 391 the implementation will die with data that conforms to a future version 392 of Unicode.) 393 - Produce some warning that such codepoints have been seen, for 394 further checking. 395 - Just copy the unknown codepoint from the input to the output, 396 running the risk of not normalizing completely. 397 - Checking that the program-internal data is up to date via the Internet. 398 - Distinguish behaviour depending on which range of codepoints 399 the unknown codepoint has been found. 401 3.4.3 Surrogates 403 When implementing normalization for sequences of UCS codepoints 404 represented as UTF-16 code units, care has to be taken that pairs of 405 surrogate code units that represent a single UCS codepoint are treated 406 appropriately. 408 3.4.4 Korean Hangul 410 There are no interactions between normalization of 411 Korean Hangul and the other normalizations. These two parts of normalization 412 can therefore be carried out separately, with different implementation 413 improvements. 415 3.4.5 Piecewise application 417 The various steps such as decomposition, 418 reordering, and recomposition, can be applied to parts of a 419 codepoint sequence. As an example, when normalizing a large file, 420 normalization can be done on each line separately because line 421 endings and normalization do not interact. 423 3.4.6 Integrating decomposition and recomposition 425 It is possible to 426 avoid full decomposition by noting that a decomposition of 427 a codepoint that is not in the exclusion list can be avoided 428 if it is not followed by a codepoint that can appear in second 429 position in a canonical recomposition. This condition can 430 be strengthened by noting that decomposition is not necessary 431 if the combining class of the following codepoint is higher 432 than the highest combining class obtained from decomposing 433 the character in question. In other cases, a decomposition 434 followed immediately by a recomposition can be precalculated. 435 Further details are left to the reader. 437 3.4.7 Decomposition 439 Recursive application of decomposition can be 440 avoided by a preprocessing step that calculates a full canonical 441 decomposition for each character with a canonical decomposition. 443 3.4.8 Reordering 445 The reordering step basically is a sorting problem. 446 Because the number of consecutive combining marks (i.e. consecutive 447 codepoints with combining class greater than zero) is usually 448 extremely small, a very simple sorting algorithm can be used, 449 e.g. a straightforward bubble sort. 451 Because reordering will occur 452 extremely locally, the following variant of bubble sort will lead 453 to a fast and simple implementation: 455 - Start checking the first pair (e.g. the first two codepoints). 456 - If there is an exchange, and we are not at the start of the 457 sequence, move back by one codepoint and check again. 458 - Otherwise (i.e. if there is no exchange, or we are at the start 459 of the sequence) and we are not at the end of the sequence, 460 move forward by one codepoint and check again. 461 - If we are at the end of the sequence, and there has been no 462 exchange for the last pair, then we are done. 464 3.4.9 Conversion from legacy encodings 466 Normalization Form C is designed so that 467 in almost all cases, one-to-one conversion from legacy encodings (e.g. 468 iso-8859-1,...) to UCS will produce a result that is already in Normalization 469 Form C. 471 The one known exception to this at the moment is the Vietnamese Windows 472 code page, which uses a kind of 'half-precomposed' encoding, whereas 473 Normalization Form C uses full precomposition for the characters needed for 474 Vietnamese. It was impossible to preserve the 'half-precomposed' encoding 475 for Vietnamese in Normalization Form C because otherwise this would have lead 476 to anomalies among else for French. 478 3.4.10 Uses of UCS in non-normalized form 480 The only case known where the UCS is used 481 in a way that is not in Normalization Form C is a group of users using the UCS 482 for Yiddish. The few combinations of Hebrew base letters and diacritics used 483 to write Yiddish are available precomposed in UCS. On the other hand, the 484 many combinations used in writing the Hebrew language are only available 485 by using combining characters. 487 In order to lead to an uniform model of 488 encoding Hebrew, the precomposed Hebrew codepoints were excluded from 489 recombination. This means that Yiddish using precomposed codepoints is not 490 in Normalization Form C. It is hoped that as soon as systems that transparently 491 handle composition become more widespread, Yiddish users will move to 492 using a decomposed representation that is in Normalization Form C. 494 Implementation examples can be found at [Charlint] (Perl) and [Normalizer] 495 (Java). 497 4. Stability and Versioning 499 Defining a normalization form for Internet-wide use requires that 500 this normalization form stays as stable as possible. Stability for 501 Normalization Form C is mainly achieved by introducing a cutoff 502 version. For precomposed characters encoded up to and including this 503 version, in principle the precomposed version is the normal form, but 504 precompomposed codepoints introduced after the cutoff version are 505 decomposed in Normalization Form C. 507 As the cutoff version, version 3.0 of Unicode and the second edition 508 of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by- 509 codepoint, and are easily available. 511 The rest of this section discusses potential threats to the stability of 512 Normalization Form C, the probability of such threats, and how to 513 avoid them. 515 The analysis below shows that the probability of the various 516 threats is extremely low. The analysis is provided here to 517 document the awareness of these treats and the measures that 518 have to be taken to avoid them. This section is only of marginal 519 importance to an implementer of Normalization Form C or to an 520 author of an Internet protocol specification. 522 4.1 New Precomposed Codepoints 524 The introduction of new (post-Unicode 3.0) precomposed codepoints 525 is not a threat to the stability of Normalization Form C. Such 526 codepoints would just provide an alternate way of encoding characters 527 that can already be encoded without them, by using a decomposed 528 form. The normalization algorithm already provides for the exclusion 529 of such characters from recomposition. 531 While Normalization Form C itself is not affected, such new codepoints 532 would affect implementations of Normalization Form C, because such 533 implementations have to be updated to correctly decompose the new 534 codepoints. 536 Note: While the new codepoint may be correctly normalized only by 537 updated implementations, once normalized neither older nor updated 538 implementations will change anything anymore. 540 Because the new codepoints do not actually encode any new 541 characters that couldn't be encoded before, because the new codepoints 542 won't actually be used due to Early Uniform Normalization, and because 543 of the above implementation problems, encoding new precomposed characters 544 is superfluous and should be very clearly avoided. 546 4.2 New Combining Marks 548 It is in theory possible that a new combining mark would be encoded 549 that is intended to represent decomposable pieces of already existing 550 encoded characters. In case this indeed would happen, problems for 551 Normalization Form C can be avoided by making sure the precomposed 552 character that now has a decomposition is not included in the list 553 of recoposition exclusions. While this helps for Normalization Form 554 C, adding a canonical decomposition would affect other normalization 555 forms, and it is therefore highly unlikely that such a canonical 556 decomposition will ever be added in the first place. 558 In case new combining marks are encoded for new scripts, or in case 559 a combining mark is introduced that does not appear in any precomposed 560 character yet, then the appropriate normalization for these characters 561 can easily be defined by providing the appropriate data. However, 562 hopefully no new encoding ambiguities are introduced for new scripts. 564 4.3 Changed Codepoints 566 A major threat to the stability of Normalization Form C would 567 come from changes to ISO/IEC 10646/Unicode itself, i.e. by moving 568 around characters or redefining codepoint or by ISO/IEC 10646 and 569 Unicode evolving differently in the future. These threats are 570 not specific to Normalization Form C, but relevant for the use 571 of the UCS in general, and are mentioned here for completeness. 573 Because of the very wide and increasing use of the UCS thoughout 574 the world, the amount of resistance to any changes of defined 575 codepoints or to any divergence between ISO/IEC 10646 and Unicode 576 is extremely strong. Awareness about the need for stability in 577 this point, as well as others, is particularly high due to the 578 experiences with some changes in the early history of these standards, 579 in particular with the reencoding of some Korean Hangul characters 580 in ISO/IEC 10646 amendment 5 (and the corresponding change in Unicode). 581 For the IETF in particular, the wording in [RFC 2279] and [RFC 2781] 582 stresses the importance of stability in this respect. 584 5. Cases not dealt with by Canonical Equivalence 586 This section gives a list of cases that are not dealt with by Canonical 587 Equivalence and Normalization Form C. This is done to help the reader 588 understand Normalization Form C and its limits. The list in this section 589 contains many cases of widely varying nature. In most cases, a viewer, 590 if familiar with the script in question, will be able to distinguish 591 the various variants. 593 Internet protocols can deal in various ways with the cases below. 594 One way is to limit the characters e.g. allowed in an identifier 595 so that one of the variants is disallowed. Another way is to assume 596 that the user can make the distinction him/herself. Another is to 597 understand that some characters or combinations of characters that 598 would lead to confusion are very difficult to actually enter on any 599 keyboard; it may therefore not really be worth to exclude them 600 explicitly. 602 - Various ligatures (Latin, Arabic) 604 - Croatian digraphs 606 - Full-width Latin compatibility variants 608 - Half-width Kana and Hangul compatibility variants 610 - Vertical compatibility variants (U+FE30...) 612 - Superscript/subscript variants (numbers and IPA) 614 - Small form compatibility variants (U+FE50...) 616 - Enclosed/encircled alphanumerics, Kana, Hangul,... 618 - Letterlike symbols, Roman numerals,... 620 - Squared Katakana and Latin abbreviations (units,...) 622 - Hangul jamo representation alternatives for historical Hangul 624 - Presence or absence of joiner/non-joiner and other control characters 626 - Upper case/lower case distinction 628 - Distinction between Katakana and Hiragana 630 - Similar letters from different scripts 631 (e.g. "A" in Latin, Greek, and Cyrillic) 633 - CJK ideograph variants (glyph variants introduced due to the source 634 separation rule, simplifications) 636 - Various punctuation variants (apostrophes, middle dots, spaces,...) 638 - Ignorable whitespace, hyphens,... 640 - Ignorable accents,... 642 Many of the cases above are identified as compatibility equivalences in the 643 Unicode database. [UTR15] defines Normalization Forms KC and KD to normalize 644 compatibility equivalences. It may look attractive to just use Normalization 645 Form KC instead of Normalization Form C for Internet protocols. However, 646 while Canonical Equivalence that forms the base of Normalization Form C 647 deals with a very small number of very well defined cases of complete 648 equivalence (from an user point of view), Compatibility Equivalence comprises 649 a very wide range of cases that usually have to be examined one at a time. 651 6. Security Considerations 653 Improper implementation of normalization can cause problems in security 654 protocols. For example, in certificate chaining, if the program validating 655 a certificate chain mis-implements normalization rules, an attacker 656 might be able to spoof an identity by picking a name that the validator 657 thinks is equivalent to another name. 659 Acknowledgements 661 An earlier version of this document benefited from ideas, advice, criticism and help from: Mark Davis, Larry Masenter, Michael Kung, Edward Cherlin, Alain 662 LaBonte, Francois Yergeau, and others. For the current version, the authors 663 were encouraged in particular by Patrick Faltstrom and Paul Hoffman. 664 The discussion of potential stability threats is based on contributions 665 by John Cowan and Kenneth Whistler. Further contributions are due to 666 Dan Oscarson. 668 References 670 [Charlint] Martin Duerst. Charlint - A Character Normalization Tool. 671 . 673 [Charreq] Martin J. Duerst, Ed. Requirements for String Identity 674 Matching and String Indexing. World Wide Web Consortium 675 Working Draft. . 677 [Charmod] Martin J. Duerst and Francois Yergeau, Eds. Character Model 678 for the World Wide Web. World Wide Web Consortium Working 679 Draft. . 681 [CompExcl] The Unicode Consortium. Composition Exclusions. 682 . 684 [ISO10646] ISO/IEC 10646-1:1993. International standard -- Infor- 685 mation technology -- Universal multiple-octet coded 686 character Set (UCS) -- Part 1: Architecture and basic 687 multilingual plane, and its Amendments. 689 [Normalizer] The Unicode Consortium. Normalization Demo. 690 692 [RFC 2277] Harald Alvestrand, IETF Policy on Character Sets and 693 Languages, January 1998. 694 . 696 [RFC 2279] Francois Yergeau. UTF-8, a transformation format of 697 ISO 10646. . 699 [RFC 2781] Paul Hoffman and Francois Yergeau. UTF-16, an encoding of 700 ISO 10646. . 702 [Unicode] The Unicode Consortium. The Unicode Standard, Version 703 3.0. Reading, MA, Addison-Wesley Developers Press, 2000. 704 ISBN 0-201-61633-5. 706 [UniData] The Unicode Consortium. UnicodeData File. 707 . 708 For explanation on the content of this file, please see 709 . 711 [UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms. 712 Unicode Technical Report #15. 713 . 715 Copyright 717 Copyright (C) The Internet Society, 2000. All Rights Reserved. 719 This document and translations of it may be copied and furnished to 720 others, and derivative works that comment on or otherwise explain it 721 or assist in its implementation may be prepared, copied, published 722 and distributed, in whole or in part, without restriction of any 723 kind, provided that the above copyright notice and this paragraph 724 are included on all such copies and derivative works. However, this 725 document itself may not be modified in any way, such as by removing 726 the copyright notice or references to the Internet Society or other 727 Internet organizations, except as needed for the purpose of 728 developing Internet standards in which case the procedures for 729 copyrights defined in the Internet Standards process must be 730 followed, or as required to translate it into languages other 731 than English. 733 The limited permissions granted above are perpetual and will not be 734 revoked by the Internet Society or its successors or assigns. 736 This document and the information contained herein is provided on an 737 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 738 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 739 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 740 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 741 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 743 Author's Addresses 745 Martin J. Duerst 746 W3C/Keio University 747 5322 Endo, Fujisawa 748 252-8520 Japan 749 mailto:duerst@w3.org 750 http://www.w3.org/People/D%C3%BCrst/ 751 Tel/Fax: +81 466 49 1170 753 Note: Please write "Duerst" with u-umlaut wherever 754 possible, i.e. as "D&252;rst" in HTML and XML. 756 Mark E. Davis 757 IBM Center for Java Technology 758 10275 North De Anza Bouleward 759 Cupertino 95014 CA 760 U.S.A. 761 mailto:mark.davis@us.ibm.com 762 http://www.macchiato.com 763 Tel: +1 (408) 777-5850 764 Fax: +1 (408) 777-5891 766 #-#-# Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium 767 #-#-# mailto:duerst@w3.org http://www.w3.org/People/D%C3%BCrst