idnits 2.17.1 draft-duerst-i18n-norm-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1252) being 563 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 1252 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 10 instances of too long lines in the document, the longest one being 7 characters in excess of 72. ** The abstract seems to contain references ([RFC2279], [UTR15], [ISO10646,Unicode], [RFC2277]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (September 2000) is 8614 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'Charlint' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charreq' -- Possible downref: Non-RFC (?) normative reference: ref. 'Charmod' -- Possible downref: Non-RFC (?) normative reference: ref. 'CompExcl' -- Possible downref: Non-RFC (?) normative reference: ref. 'ICU' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'Normalizer' -- Possible downref: Non-RFC (?) normative reference: ref. 'NormTest' ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2277') -- Duplicate reference: RFC2781, mentioned in 'RFC 2279', was also mentioned in 'RFC 2277'. ** Downref: Normative reference to an Informational RFC: RFC 2781 (ref. 'RFC 2279') -- Duplicate reference: RFC2781, mentioned in 'RFC 2781', was also mentioned in 'RFC 2279'. ** Downref: Normative reference to an Informational RFC: RFC 2781 -- Possible downref: Non-RFC (?) normative reference: ref. 'Unicode' -- Possible downref: Non-RFC (?) normative reference: ref. 'UniData' -- Possible downref: Non-RFC (?) normative reference: ref. 'UniPolicy' -- Possible downref: Non-RFC (?) normative reference: ref. 'UTR15' Summary: 9 errors (**), 0 flaws (~~), 4 warnings (==), 16 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft M. Duerst 2 W3C/Keio University 3 Expires in six months M. Davis 4 IBM 5 September 2000 7 Character Normalization in IETF Protocols 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance 12 with all provisions of Section 10 of RFC2026. 14 Internet-Drafts are working documents of the Internet Engineering Task 15 Force (IETF), its areas, and its working groups. Note that other 16 groups may also distribute working documents as Internet-Drafts. 18 Internet-Drafts are draft documents valid for a maximum of six months 19 and may be updated, replaced, or obsoleted by other documents at any 20 time. It is inappropriate to use Internet-Drafts as reference 21 material or to cite them other than as "work in progress." 23 The list of current Internet-Drafts can be accessed at 24 http://www.ietf.org/ietf/1id-abstracts.txt 26 The list of Internet-Draft Shadow Directories can be accessed at 27 http://www.ietf.org/shadow.html. 29 This document is not a product of any working group, but may be 30 discussed on the mailing lists or 31 . 33 Abstract 35 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very 36 wide repertoire of characters. The IETF, in [RFC 2277], requires that 37 future IETF protocols support UTF-8 [RFC 2279], an ASCII-compatible 38 encoding of UCS. The wide range of characters included in the UCS has 39 lead to some cases of duplicate encodings. This document proposes 40 that in IETF protocols, the class of duplicates called canonical 41 equivalents be dealt with by using Early Uniform Normalization 42 according to Unicode Normalization Form C, Canonical Composition (NFC) 43 [UTR15]. This document describes both Early Uniform Normalization 44 and Normalization Form C. 46 Table of contents 48 0. Change log 49 1. Introduction 50 1.1 Motivation 51 1.2 Notational Conventions 52 2. Early Uniform Normalization 53 3. Canonical Composition (Normalization Form C) 54 3.1 Decomposition 55 3.2 Reordering 56 3.3 Recomposition 57 3.4 Implementation Notes 58 4. Stability and Versioning 59 5. Cases not Dealt with by Canonical Equivalence 60 6. Security Considerations 61 Acknowledgements 62 References 63 Copyright 64 Author's Addresses 66 0. Change Log 68 Changes from -03 to -04 70 - Changed intro to make clear this is mainly about canonical 71 equivalences 72 - Made UTR#15, V18.0, the normative description of NFC 73 - Added subsection on interaction with text processing (3.4.11) 74 - Added various examples 75 - Various small wording changes 76 - Added reference to test file 77 - Added a note re. terminology (Normalization vs. Canonicalization) 79 Changes from -02 to -03 81 - Fixed a bad typo in the title. 82 - Made a lot of wording corrections and presentation improvements, 83 most of them suggested by Paul Hoffman. 85 1. Introduction 87 1.1 Motivation 89 The Universal Character Set (UCS) [ISO10646, Unicode] covers a very 90 wide repertoire of characters. The IETF, in [RFC 2277], requires that 91 future IETF protocols support UTF-8 [RFC 2279], an ASCII-compatible 92 encoding of UCS. The need for round-trip convertion to pre-existing 93 character encodings has led to some cases of duplicate encodings. 94 This has lead to uncertainity for protocol specifiers and 95 implementers, because it was not clear which part of the Internet 96 infrastructure should take responsibility for these duplicates, 97 and how. 99 Besides straight-out duplicates, there are also many cases of 100 characters that are in one way or another similar. The equivalence 101 between duplicates is called canonical equivalence. Many of the 102 equivalences between similar characters are called compatibility 103 equivalences. This document concentrates on canonical equivalence. 104 The various cases of similar characters are listed in Section 5. 106 There are mainly two kinds of canonical equivalences, singleton 107 equivalences and precomposed/decomposed equivalences. Both of these 108 can be illustrated using the character A with a ring above. This 109 character can be encoded in three ways: 111 1) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE 112 2) U+0041 LATIN CAPITAL LETTER A followed by U+030A COMBINING RING 113 ABOVE 114 3) U+212B ANGSTROM SIGN 116 The equivalence between 1) and 3) is a singleton equivalence. The 117 equivalence between 1) and 2) is a precomposed/decomposed equivalence, 118 where 1) is the precomposed representation, and 2) is the decomposed 119 representation. 121 In all three cases, it is supposed to look the same for the reader. 122 Also, applications may use one or another representation, or even 123 more than one, but they are not allowed to assume that other 124 applications will preserve the difference between them. 126 The inclusion of these various representation alternatives was a 127 result of the requirement for round trip conversion with a wide range 128 of legacy encodings as well as of the merger between Unicode and 129 ISO 10646. 131 The Unicode Standard from early on has defined Canonical Equivalence 132 to make clear which sequences of codepoints cases should be treated 133 as pure encoding duplicates and which sequences of codepoints should 134 be treated as genuinely different (if maybe in some cases closely 135 related) data. The Unicode Standard also from early on defined 136 decomposed normalization, what is now called Normalization Form D 137 (case 2) in the example above). This is very well suited for some 138 kinds of internal processing, but decomposition does not correspond 139 to how data gets converted from legacy encodings and transmitted on 140 the Internet. In that case, precomposed data (i.e. case 1) in the 141 example above) is prevalent. 143 Note: This specification uses the term 'codepoint', and not 144 'character', to make clear that it speaks about what the 145 standards encode, and not what the end users think about, 146 which is not always the same. 148 Encouraged by many factors such as a requirements analysis of the W3C 149 [Charreq], the Unicode Technical Committee defined Normalization 150 Form C, Canonical Composition (see [UTR15]). Normalization Form C 151 in general produces the same representation as straightforward 152 transcoding from legacy encodings (See Section 3.4 for the known 153 exception). The careful and detailled definition of Normalization 154 Form C is mainly needed to unambigously define edge cases (base 155 letters with two or more combining characters). Most of these edge 156 cases will turn up extremely rarely in actual data. 158 The W3C is adapting Normalization Form C in the form of Early Uniform 159 Normalization, which means that it assumes that in general, data will 160 be already in Normalization Form C [Charmod]. 162 This document recommends that in IETF protocols, Canonical Equivalents 163 be dealt with by using Early Uniform Normalization according to 164 Unicode Normalization Form C, Canonical Composition [UTR15]. This 165 document describes both Early Uniform Normalization (in Section 2) 166 and Normalization Form C (in Section 3). Section 4 contains an 167 analysis of (mostly theoretical) potential risks for the stability 168 of Normalization Form C. For reference, Section 5 discusses various 169 cases of equivalences not dealt with by Normalization Form C. 171 Note: The terms 'normalization' (such as in 'Normalization Form C') 172 and 'canonicalization' (such as in XML Canonicalization) can 173 mean virtually the same thing. In the context of the topics 174 described in this document, only 'normalization' is used because 175 'canonical' is used to distinguish between canonical equivalents 176 and compatibility equivalents. 178 1.2 Notational Conventions 180 For UCS codepoints, the notation U+HHHH is used, where HHHH is the 181 hexadecimal representation of the codepoint. This may be followed by 182 the official name of the character in all caps. 184 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 185 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 186 specification are to be interpreted as described in [RFC2119]. 188 2. Early Uniform Normalization 190 This section tries to give some guidance on how Normalization Form C 191 (NFC), described later in Section 3, should be used by Internet 192 protocols. Each Internet protocol has to define by itself how to use 193 NFC, and has to take into account its particular needs. However, the 194 advice in this section is intended to help writers of specifications 195 not very familliar with text normalization issues, and to try to make 196 sure that the various protocols use solutions that interface easily 197 with each other. 199 This section uses various well-known Internet protocols as examples. 200 However, such examples do not imply that the protocol elements 201 mentioned actually accept non-ASCII characters. Depending on the 202 protocol element mentioned, that may or may not be the case, and may 203 change in the future. Also, the examples are not intended to actually 204 define how a specific protocol deals with text normalization issues. 205 This is the responsibility of the specification for each specific 206 protocol. 208 The basic principle for how to use Normalization Form C is Early 209 Uniform Normalization. This means that ideally, only text in 210 Normalization Form C appears on the wire on the Internet. This can be 211 seen as applying 'be conservative in what you send' to the problem 212 of text normalization. And (again ideally) it should not be needed 213 that each implemenation of an Internet protocol separately implements 214 normalization. Text should just be provided normalized from the 215 underlying infrastructure, e.g. the operating system or the keyboard 216 driver. 218 Early normalization is of particular importance for those parts of 219 Internet protocols that are used as identifiers. Examples would 220 be URIs, domain names, email addresses, identifier names in PKIX 221 certificates, identifiers in ACAP, file names in FTP, folder names in 222 IMAP, newsgroup names in NNTP, and so on. This is due to the following 223 reasons: 225 - In order for the protocol to work, it has to be very well defined 226 when two protocol element values match and when not. 227 - Implementations, in particular on the server side, do not in any 228 way have to deal with e.g. display of multilingual text, but on 229 the other hand have to handle a lot of protocol-specific issues. 230 Such implementations therefore should not be bothered with text 231 normalization. 233 For free text, e.g. the content of mail messages or news postings, 234 Early Uniform Normalization is somewhat less important, but definitely 235 improves interoperability. 237 For protocol elements used as identifiers, this document recommends 238 Internet protocols to specify the following: 240 - Comparison SHOULD be carried out purely binary (after it has been 241 made sure, where necessary, that the texts to be compared are in 242 the same character encoding). 243 - Any kind of text, and in particular identifier-like protocol 244 elements, SHOULD be sent normalized to Normalization Form C. 245 - In case comparison fails due to a difference in text normalization, 246 the originator of the non-normalized text is responsible for the 247 failure. 248 - In case implementors are aware of the fact, or suspect, that their 249 underlying infrastructure produces non-normalized text, they SHOULD 250 take care to do the necessary tests, and if necessary the actual 251 normalization, by themselves. 252 - In the case of creation of identifiers, and in particular if this 253 creation is comparatively infrequent (e.g. newsgroup names, domain 254 names), and happens in a rather centralized manner, explicit checks 255 for normalization SHOULD be required by the protocol specification. 257 3. Canonical Composition (Normalization Form C) 259 This section describes Canonical Composition (Normalization Form C, 260 NFC). The normative specification of Canonical Composition is found in 261 [UTR15]. The description is done in a procedural way, but any other 262 procedure that leads to identical results can be used. The result is 263 supposed to be exactly identical to that described by [UTR15]. If any 264 differences should be found, [UTR15] must be followed. For each step, 265 various notes are provided to help understand the description and give 266 implementation hints. 268 Given a sequence of UCS codepoints, its Canonical Composition can 269 be computed with the following three steps: 271 1. Decomposition (Section 3.1) 272 2. Reordering (Section 3.2) 273 3. Recomposition (Section 3.3) 275 Additional implementation notes are given in Section 3.4. 277 3.1 Decomposition 279 For each UCS codepoint in the input sequence, check whether this 280 codepoint has a canonical decomposition according to the newest 281 version of the Unicode Character Database (field 5 in [UniData]). 282 If such a decomposition is found, replace the codepoint in the 283 input sequence by the codepoint(s) in the decomposition, and 284 recursivly check for and apply decomposition on the first replaced 285 codepoint. 287 Note: Fields in [UniData] are delimited by ';'. Field 5 in [UniData] 288 is the 6th field when counting with an index origin of 1. Fields 289 starting with a tag delimited by '<' and '>' indicate compatibility 290 decompositions; these compatibility decompositions MUST NOT be used 291 for Normalization Form C. 293 Note: For Korean Hangul, the decompositions are not contained in 294 [UniData], but have to be generated algorithmically according to 295 the description in [Unicode], Section 3.11. 297 Note: Some decompositions replace a single codepoint by another 298 single codepoint. 300 Note: It is not necessary to check replaced codepoints other than the 301 first one due to the properties of the data in the Unicode 302 Character Database. 304 Note: It is possible to 'precompile' the decompositions to avoid 305 having to apply them recursively. 307 3.2 Reordering 309 For each adjacent pair of UCS codepoints after decomposition, check 310 the combining classes of the UCS codepoints according to the newest 311 version of the Unicode Character Database (Field 3 in [UniData]). 312 If the combining class of the first codepoint is higher than the 313 combining class of the second codepoint, and at the same time the 314 combining class of the second codepoint is not zero, then exchange 315 the two codepoints. Repeat this process until no two codepoints can 316 be exchanged anymore. 318 Note: A combining class greater than zero indicates that a codepoint 319 is a combining mark that participates in reordering. A combining 320 class of zero indicates that a codepoint is not a combining mark, 321 or that it is a combining mark that is not affected by reordering. 322 There are no combining classes below zero. 324 Note: Besides a few script-specific combining classes, combining 325 classes mainly distinguish whether a combining mark is attached 326 to the base letter or just placed near the base letter, and on 327 which side of the base letter (e.g. bottom, above right,...) the 328 combining mark is attached/placed. Reordering assures that 329 combining marks placed on different sides of the same character 330 are placed in a canonical order (because any order would visually 331 look the same), while combining marks placed on the same side of 332 a character are not reordered (because reordering them would change 333 the combination they represent). 335 Note: After completing this step, the sequence of UCS codepoints 336 is in Canonical Decomposition (Normalization Form D). 338 3.3 Recomposition 340 This section describes recomposition in a top-down manner, first 341 describing recomposition processing in general (Section 3.3.1), then 342 describing which pairs of codepoints can be canonically combined 343 (Section 3.3.2) and then describing the combination exclusions. 345 3.3.1 Recomposition Processing 347 Process the sequence of UCS codepoints resulting from Reordering 348 from start to end. This process requires a state variable called 349 'initial'. At the beginning of the process, the value of 'initial' 350 is empty. 352 For each codepoint in the sequence resulting from Reordering, 353 do the following: 354 - If the following three conditions all apply 355 - 'initial' has a value 356 - the codepoint immediately preceeding the current codepoint 357 is this 'initial' or has a combining class not equal to the 358 combining class of the current codepoint 359 - the 'initial' can be canonically recombined (see Section 3.3.1) 360 with with the current codepoint 361 then replace the 'initial' with the canonical recombination and 362 remove the current codepoint. 363 - Otherwise, if the current codepoint has combining class zero, 364 store its value in 'initial'. 366 Note: At the beginning of recomposition, there is no 'initial'. 367 An 'initial' is remembered as soon as the first codepoint 368 with a combining class of zero is found. Not every codepoint 369 with a combining class of zero becomes an 'initial'; the 370 exceptions are those that are the second codepoint in 371 a recomposition. The 'initial' as used in this description 372 is slightly different from the 'starter' as defined in [UTR15], 373 but this does not affect the result. 375 Note: Checking the previous codepoint to have a combining class 376 smaller than the combining class of the current codepoint 377 (except if the previous codepoint is the 'initial' and therefore 378 has a combining class of zero) assures that the conditions used 379 for reordering are maintained in the recombination step. 381 Note: Other algorithms for recomposition have been considered, but 382 this algorithm has been choosen because it provides a very good 383 balance between computational and implementation complexity 384 and 'power' of recombination. As an example, assume a text contains 385 a U+0041 LATIN CAPITAL LETTER A with a U+030A COMBINING RING ABOVE 386 and a U+031F COMBINING PLUS SIGN BELOW. Because the canonical 387 reordering puts the COMBINING PLUS SIGN BELOW before the COMBINING 388 RING ABOVE, a more straightforward algorithm would not be able 389 to recombine this to U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE 390 followed by U+031F COMBINING PLUS SIGN BELOW. 392 3.3.2 Pairs of Codepoints that can be Canonically Recombined 394 A pair of codepoints can be canonically recombined to a third 395 codepoint if this third codepoint has a canonical decomposition into 396 the sequence of two codepoints (see [UniData], field 5) and this 397 canonical decomposition is not excluded from recombination. For Korean 398 Hangul, the redecompositions are not contained in [UniData], but have 399 to be generated algorithmically according to the description in 400 [Unicode], Section 3.11. 402 3.3.3 Combination Exclusions 404 The exclusions from recombination are defined as follows: 406 1) Singletons: Codepoints that have a canonical decomposition into 407 a single other codepoint (example: U+212B ANGSTROM SIGN). 408 2) Non-starter: A codepoint with a decomposition starting with 409 a codepoint of a combining class other than zero (example: 410 U+0F75 TIBETAN VOWEL SIGN UU). 411 3) Post-Unicode3.0: A codepoint with a decomposition introduced 412 after Unicode 3.0 (no applicable example). 413 4) Script-specific: Precomposed codepoints that are not the 414 generally preferred form for their script (example: U+0959 415 DEVANAGARI LETTER KHHA). 417 The list of codepoints for 1) and 2) can be produced directly from 418 the Unicode Character Database [UniData]. The list of codepoints for 419 3) can be produced from a comparison between the 3.0.0 version and 420 the latest version of [UniData], but this may be difficult. The list 421 of codepoints for 4) cannot be computed. For 3) and 4), the lists 422 provided in [CompExcl] MUST be used. [CompExcl] also provides lists 423 for 1) and 2) for cross-checking. The list for 3) is currently empty 424 because there are at the moment no post-Unicode3.0 codepoints with 425 decompositions. 427 Note: Exclusion of singletons is necessary because in a pair of 428 canonically equivalent codepoints, the canonical decomposition 429 points from the 'less desirable' codepoint to the preferred 430 codepoint. In this case, both canonical decomposition and 431 canonical composition have the same preference. 433 Note: For discussion of the exclusion of Post-Unicode3.0 434 codepoints from recombination, please see Section 4 435 on versioning issues. 437 3.4 Implementation Notes 439 This section contains various notes on potential implementation 440 issues, improvements, and shortcuts. Further notes on implementation 441 may be found in [UTR15] or in newer versions of that document. 443 3.4.1 Avoiding Decomposition, and Checking for Normalization Form C 445 It is not always necessary to decompose and recompose. In particular, 446 any sequence that does not contain any of the following is already in 447 Normalization Form C: 449 - Codepoints that are excluded from recomposition (see Section 3.3.3) 450 - Codepoints that appear in second position in a canonical 451 recomposition 452 - Hangul Jamo codepoints (U+1100-U+11F9) 453 - Unassigned codepoints 455 If a contiguous part of a sequence satisfies the above criterion all 456 but the last of the codepoints are already in Normalization Form C. 458 The above criteria can also be used to easily check that some data 459 is already in Normalization Form C. However, this check will reject 460 some cases that actually are normalized. 462 3.4.2 Unassigned Codepoints 464 Unassigned codepoints (codepoints that are not assigned in the 465 current version of Unicode) are listed above to avoid claiming that 466 something is in Normalization Form C when it may indeed not be, but 467 they usually will be treated differently from others. The following 468 behaviours may be possible, depending on the context of normalization: 470 - Stop the normalization process with a fatal error. (This should be 471 done only in very exceptional circumstances. It would mean that the 472 implementation will die with data that conforms to a future version 473 of Unicode.) 474 - Produce some warning that such codepoints have been seen, for 475 further checking. 476 - Just copy the unassigned codepoint from the input to the output, 477 running the risk of not normalizing completely. 478 - Checking that the program-internal data is up to date via the 479 Internet. 480 - Distinguish behaviour depending on which range of codepoints 481 the unassigned codepoint has been found 483 3.4.3 Surrogates 485 When implementing normalization for sequences of UCS codepoints 486 represented as UTF-16 code units, care has to be taken that pairs of 487 surrogate code units that represent a single UCS codepoint are treated 488 appropriately. 490 3.4.4 Korean Hangul 492 There are no interactions between normalization of Korean Hangul and 493 the other normalizations. These two parts of normalization can 494 therefore be carried out separately, with different implementation 495 improvements. 497 3.4.5 Piecewise Application 499 The various steps such as decomposition, reordering, and 500 recomposition, can be applied to appropriately choosen parts of a 501 codepoint sequence. As an example, when normalizing a large file, 502 normalization can be done on each line separately because line 503 endings and normalization do not interact. 505 3.4.6 Integrating Decomposition and Recomposition 507 It is possible to avoid full decomposition by noting that 508 decomposition of a codepoint that is not in the exclusion list can be 509 avoided if it is not followed by a codepoint that can appear in second 510 position in a canonical recomposition. This condition can be 511 strengthened by noting that decomposition is not necessary if the 512 combining class of the following codepoint is higher than the highest 513 combining class obtained from decomposing the character in question. 514 In other cases, a decomposition followed immediately by a 515 recomposition can be precalculated. Further details are left to the 516 reader. 518 3.4.7 Decomposition 520 Recursive application of decomposition can be avoided by a 521 preprocessing step that calculates a full canonical decomposition 522 for each character with a canonical decomposition. 524 3.4.8 Reordering 526 The reordering step basically is a sorting problem. Because the 527 number of consecutive combining marks (i.e. consecutive codepoints 528 with combining class greater than zero) is usually extremely small, 529 a very simple sorting algorithm can be used, e.g. a straightforward 530 bubble sort. 532 Because reordering will occur extremely locally, the following 533 variant of bubble sort will lead to a fast and simple implementation: 535 - Start checking the first pair (e.g. the first two codepoints). 536 - If there is an exchange, and we are not at the start of the 537 sequence, move back by one codepoint and check again. 538 - Otherwise (i.e. if there is no exchange, or we are at the start of 539 the sequence) and we are not at the end of the sequence, move 540 forward by one codepoint and check again. 541 - If we are at the end of the sequence, and there has been no exchange 542 for the last pair, then we are done. 544 3.4.9 Conversion from Legacy Encodings 546 Normalization Form C is designed so that in almost all cases, 547 one-to-one conversion from legacy encodings (e.g. iso-8859-1,...) 548 to UCS will produce a result that is already in Normalization Form C. 550 (charset=windows-1258, for Vietnamese, [windows-1258]). This character 551 encoding uses a kind of 'half-precomposed' encoding, whereas 552 Normalization Form C uses full precomposition for the characters 553 needed for Vietnamese. As an example, U+1EAD LATIN SMALL LETTER A 554 WITH CIRCUMFLEX AND DOT BELOW is encoded as U+00E2 LATIN SMALL LETTER 555 A WITH CIRCUMFLEX followed by U+0323 COMBINING DOT BELOW in 556 code page 1252, but U+1EAD is the normalized form. 558 3.4.10 Uses of UCS in Non-Normalized Form 560 One known case where the UCS is used in a way that is not in 561 Normalization Form C is a group of users using the UCS for Yiddish. 562 The few combinations of Hebrew base letters and diacritics used to 563 write Yiddish are available precomposed in UCS (example: U+FB2F 564 HEBREW LETTER ALEF WITH QAMATS). On the other hand, the many 565 combinations used in writing the Hebrew language are only available 566 by using combining characters. 568 In order to lead to an uniform model of encoding Hebrew, the 569 precomposed Hebrew codepoints were excluded from recombination. This 570 means that Yiddish using precomposed codepoints is not in 571 Normalization Form C. 573 3.4.11 Interaction with Text Processing 575 There are many operations on text strings that can create 576 non-normalized output even if the input was normalized. Examples are 577 concatenation (if the second string starts with one of the characters 578 discussed is Section 3.4.1) or case changes (as an example, 1E98 579 LATIN SMALL LETTER W WITH RING ABOVE does not have a precomposed 580 capital equivalent). 582 3.4.12 Implementations and Test Suites 584 Implementation examples can be found at [Charlint] (Perl), [ICU] 585 (C/C++) and [Normalizer] (Java). 587 A huge file with test cases for normalization is avaliable as part of 588 Unicode 3.0.1 [NormTest]. 590 4. Stability and Versioning 592 Defining a normalization form for Internet-wide use requires that 593 this normalization form stays as stable as possible. Stability for 594 Normalization Form C is mainly achieved by introducing a cutoff 595 version. For precomposed characters encoded up to and including this 596 version, in principle the precomposed version is the normal form, but 597 precompomposed codepoints introduced after the cutoff version are 598 decomposed in Normalization Form C. 600 As the cutoff version, version 3.0 of Unicode and the second edition 601 of ISO/IEC 10646-1 have been choosen. These are aligned codepoint-by- 602 codepoint. They are both widely and integrally available, i.e. they 603 do not reqire the application of updates ammendments. 605 The rest of this section discusses potential threats to the stability 606 of Normalization Form C, the probability of such threats, and how to 607 avoid them. [UniPolicy] documents policies adopted by the Unicode 608 Consortium to limit the impact of changes on existing implementations. 610 The analysis below shows that the probability of the various threats 611 is extremely low. The analysis is provided here to document the 612 awareness of these treats and the measures that have to be taken to 613 avoid them. This section is only of marginal importance to an 614 implementer of Normalization Form C or to an author of an Internet 615 protocol specification. 617 4.1 New Precomposed Codepoints 619 The introduction of new (post-Unicode 3.0) precomposed codepoints is 620 not a threat to the stability of Normalization Form C. Such codepoints 621 would just provide an alternate way of encoding characters that can 622 already be encoded without them, by using a decomposed form. The 623 normalization algorithm already provides for the exclusion of such 624 characters from recomposition. 626 While Normalization Form C itself is not affected, such new codepoints 627 would affect implementations of Normalization Form C, because such 628 implementations have to be updated to correctly decompose the new 629 codepoints. 631 Note: While the new codepoint may be correctly normalized only by 632 updated implementations, once normalized neither older nor updated 633 implementations will change anything anymore. 635 Because the new codepoints do not actually encode any new characters 636 that could not be encoded before, because the new codepoints would not 637 actually be used due to Early Uniform Normalization, and because of 638 the above implementation problems, encoding new precomposed characters 639 is superfluous and should be very clearly avoided. 641 4.2 New Combining Marks 643 It is in theory possible that a new combining mark would be encoded 644 that is intended to represent decomposable pieces of already existing 645 encoded characters. In case this indeed would happen, problems for 646 Normalization Form C can be avoided by making sure the precomposed 647 character that now has a decomposition is not included in the list 648 of recoposition exclusions. While this helps for Normalization Form 649 C, adding a canonical decomposition would affect other normalization 650 forms, and it is therefore highly unlikely that such a canonical 651 decomposition will ever be added in the first place. 653 In case new combining marks are encoded for new scripts, or in case a 654 combining mark is introduced that does not appear in any precomposed 655 character yet, then the appropriate normalization for these characters 656 can easily be defined by providing the appropriate data. However, 657 hopefully no new encoding ambiguities are introduced for new scripts. 659 4.3 Changed Codepoints 661 A major threat to the stability of Normalization Form C would come 662 from changes to ISO/IEC 10646/Unicode itself, i.e. by moving around 663 characters or redefining codepoint or by ISO/IEC 10646 and Unicode 664 evolving differently in the future. These threats are not specific to 665 Normalization Form C, but relevant for the use of the UCS in general, 666 and are mentioned here for completeness. 668 Because of the very wide and increasing use of the UCS thoughout the 669 world, the amount of resistance to any changes of defined codepoints 670 or to any divergence between ISO/IEC 10646 and Unicode is extremely 671 strong. Awareness about the need for stability in this point, as well 672 as others, is particularly high due to the experiences with some 673 changes in the early history of these standards, in particular with 674 the reencoding of some Korean Hangul characters in ISO/IEC 10646 675 amendment 5 (and the corresponding change in Unicode). For the IETF 676 in particular, the wording in [RFC 2279] and [RFC 2781] stresses the 677 importance of stability in this respect. 679 5. Cases not dealt with by Canonical Equivalence 681 This section gives a list of cases that are not dealt with by 682 Canonical Equivalence and Normalization Form C. This is done to help 683 the reader understand Normalization Form C and its limits. The list in 684 this section contains many cases of widely varying nature. In many 685 cases, a viewer, if familiar with the script in question, will be able 686 to distinguish the various variants. 688 Internet protocols can deal in various ways with the cases below. One 689 way is to limit the characters e.g. allowed in an identifier so that 690 all but one of the variants are disallowed. Another way is to assume 691 that the user can make the distinction him/herself. Another is to 692 understand that some characters or combinations of characters that 693 would lead to confusion are very difficult to actually enter on any 694 keyboard; it may therefore not really be worth to exclude them 695 explicitly. 697 - Various ligatures (Latin, Arabic, e.g. U+FB01 LATIN SMALL LIGATURE FI 698 vs. U+0066 LATIN SMALL LETTER F followed by U+0069 LATIN SMALL LETTER I) 700 - Croatian digraphs (e.g. U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J 701 vs. U+004C LATIN CAPITAL LETTER L followed by U+006A LATIN SMALL 702 LETTER J) 704 - Full-width Latin compatibility variants (e.g. U+FF21 FULLWIDTH LATIN 705 CAPITAL LETTER A vs. U+0041 LATIN CAPITAL LETTER A) 707 - Half-width Kana and Hangul compatibility variants (e.g. U+FF76 HALFWIDTH 708 KATAKANA LETTER KA vs. U+30AB KATAKANA LETTER KA) 710 - Vertical compatibility variants (U+FE35 PRESENTATION FORM FOR VERTICAL 711 LEFT PARENTHESIS vs. U+0028 LEFT PARENTHESIS) 713 - Superscript/subscript variants (numbers and IPA, e.g. U+00B2 SUPERSCRIPT 714 TWO) 716 - Small form compatibility variants (e.g. U+FE6A SMALL PERCENT SIGN) 718 - Enclosed/encircled alphanumerics, Kana, Hangul,... (e.g. U+2460 CIRCLED 719 DIGIT ONE) 721 - Letterlike symbols, Roman numerals,... (e.g. U+210E PLANCK CONSTANT vs. 722 U+0068 LATIN SMALL LETTER H) 724 - Squared Katakana and Latin abbreviations (units,..., e.g. U+334C SQUARE 725 MEGATON) 727 - Hangul jamo representation alternatives for historical Hangul 729 - Presence or absence of joiner/non-joiner and other control 730 characters 732 - Upper case/lower case distinction 734 - Distinction between Katakana and Hiragana 736 - Similar letters from different scripts 737 (e.g. "A" in Latin, Greek, and Cyrillic) 739 - CJK ideograph variants (glyph variants introduced due to the 740 source separation rule, simplifications) 742 - Various punctuation variants (apostrophes, middle dots, 743 spaces,...) 745 - Ignorable whitespace, hyphens,... 747 - Ignorable accents,... 749 Many of the cases above are identified as compatibility equivalences 750 in the Unicode database. [UTR15] defines Normalization Forms KC and 751 KD to normalize compatibility equivalences. It may look attractive 752 to just use Normalization Form KC instead of Normalization Form C for 753 Internet protocols. However, while Canonical Equivalence, which forms 754 the base of Normalization Form C, deals with a very small number of 755 very well defined cases of complete equivalence (from an user point 756 of view), Compatibility Equivalence comprises a very wide range of 757 cases that usually have to be examined one at a time. If the domain 758 of acceptable characters is suitably limited, such as for program 759 identifiers, then NFKC may be a suitable normalization form. 761 6. Security Considerations 763 Security problems can result from: 764 - Improper implementations of normalization. For example, in 765 certificate chaining, if the program validating a certificate chain 766 mis-implements normalization rules, an attacker might be able to 767 spoof an identity by picking a name that the validator thinks is 768 equivalent to another name. 769 - The fact that normalization maps several input sequences to the same 770 output sequence. If a digital signature calculation includes 771 normalization, this can make it slightly easier to find a fake 772 document that has the same digest as a real one. 773 - The use of normalization only in part of the applications. In 774 particular, if software used for security purposes, e.g. to create 775 and check digital signatures, normalizes data, but the applications 776 actually using the data do not normalize, it can be very easy to 777 create a fake document that can claim to be the real one but 778 produces different behaviour. 779 - Different behavior in programs that do not respect canonical 780 equivalence. 782 Security-related applications therefore MAY check for normalized 783 input, but MUST NOT actually apply normalization unless is can be 784 guaranteed that all related applications also apply normalization. 786 Acknowledgements 788 The earliest version of this Internet Draft, which dealt with quite 789 similar issues, was entitled "Normalization of Internationalized 790 Identifiers" and was submitted in July 1997 by the first author while 791 he was at the University of Zurich. It benefited from ideas, advice, 792 criticism and help from: Mark Davis, Larry Masenter, Michael Kung, 793 Edward Cherlin, Alain LaBonte, Francois Yergeau, and others. 795 For the current version, the authors were encouraged in particular by 796 Patrick Faltstrom and Paul Hoffman. The discussion of potential 797 stability threats is based on contributions by John Cowan and Kenneth 798 Whistler. Some security threats were pointed out by Masahiro 799 Sekiguchi. Further contributions are due to Dan Oscarson. 801 References 803 [Charlint] Martin Duerst. Charlint - A Character Normalization 804 Tool. . 806 [Charreq] Martin J. Duerst, Ed. Requirements for String 807 Identity Matching and String Indexing. World Wide 808 Web Consortium Working Draft. 809 . 811 [Charmod] Martin J. Duerst and Francois Yergeau, Eds. 812 Character Model for the World Wide Web. World Wide 813 Web Consortium Working Draft. 814 . 816 [CompExcl] The Unicode Consortium. Composition Exclusions. 817 819 [ICU] International Components for Unicode. 820 . 822 [ISO10646] ISO/IEC 10646-1:2000. International standard -- 823 Information technology -- Universal multiple-octet 824 coded character Set (UCS) -- Part 1: Architecture 825 and basic multilingual plane, and its Amendments. 827 [Normalizer] The Unicode Consortium. Normalization Demo. 828 830 [NormTest] Mark Davis. Unicode Normalization Test Suite. 831 833 [RFC2119] Scott Bradner. Key words for use in RFCs to 834 Indicate Requirement Levels, March 1997. 835 837 [RFC 2277] Harald Alvestrand, IETF Policy on Character Sets and 838 Languages, January 1998. 839 841 [RFC 2279] Francois Yergeau. UTF-8, a transformation format of 842 ISO 10646. 844 [RFC 2781] Paul Hoffman and Francois Yergeau. UTF-16, an 845 encoding of ISO 10646. 846 848 [Unicode] The Unicode Consortium. The Unicode Standard, 849 Version 3.0. Reading, MA, Addison-Wesley Developers 850 Press, 2000. ISBN 0-201-61633-5. 852 [UniData] The Unicode Consortium. UnicodeData File. 853 854 For explanation on the content of this file, please 855 see 856 858 [UniPolicy] The Unicode Consortium. Unicode Consortium Policies. 859 861 [UTR15] Mark Davis and Martin Duerst. Unicode Normalization 862 Forms. Unicode Technical Report #15, Version 18.0. 863 , 864 also on the CD of [Unicode]. 866 [windows-1258] Microsoft Windows Codepage: 1258 (Viet Nam). 867 http://www.microsoft.com/globaldev/reference/sbcs/1258.htm> 869 Copyright 871 Copyright (C) The Internet Society, 2000. All Rights Reserved. 873 This document and translations of it may be copied and furnished to 874 others, and derivative works that comment on or otherwise explain it 875 or assist in its implementation may be prepared, copied, published 876 and distributed, in whole or in part, without restriction of any 877 kind, provided that the above copyright notice and this paragraph 878 are included on all such copies and derivative works. However, this 879 document itself may not be modified in any way, such as by removing 880 the copyright notice or references to the Internet Society or other 881 Internet organizations, except as needed for the purpose of 882 developing Internet standards in which case the procedures for 883 copyrights defined in the Internet Standards process must be 884 followed, or as required to translate it into languages other 885 than English. 887 The limited permissions granted above are perpetual and will not be 888 revoked by the Internet Society or its successors or assigns. 890 This document and the information contained herein is provided on an 891 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 892 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 893 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 894 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 895 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 897 Author's Addresses 899 Martin J. Duerst 900 W3C/Keio University 901 5322 Endo, Fujisawa 902 252-8520 Japan 903 mailto:duerst@w3.org 904 http://www.w3.org/People/D%C3%BCrst/ 905 Tel/Fax: +81 466 49 1170 907 Note: Please write "Duerst" with u-umlaut wherever 908 possible, e.g. as "D&252;rst" in HTML and XML. 910 Mark E. Davis 911 IBM Center for Java Technology 912 10275 North De Anza Bouleward 913 Cupertino 95014 CA 914 U.S.A. 915 mailto:mark.davis@us.ibm.com 916 http://www.macchiato.com 917 Tel: +1 (408) 777-5850 918 Fax: +1 (408) 777-5891