idnits 2.17.1 draft-ietf-appsawg-rfc3536bis-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The draft header indicates that this document obsoletes RFC3536, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 9, 2011) is 4672 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISOIEC10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3491 (Obsoleted by RFC 5891) Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Hoffman 3 Internet-Draft VPN Consortium 4 Obsoletes: 3536 (if approved) J. Klensin 5 Intended status: BCP July 9, 2011 6 Expires: January 10, 2012 8 Terminology Used in Internationalization in the IETF 9 draft-ietf-appsawg-rfc3536bis-06 11 Abstract 13 This document provides a list of terms used in the IETF when 14 discussing internationalization. The purpose is to help frame 15 discussions of internationalization in the various areas of the IETF 16 and to help introduce the main concepts to IETF participants. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on January 10, 2012. 35 Copyright Notice 37 Copyright (c) 2011 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1. Purpose of this Document . . . . . . . . . . . . . . . . . 3 54 1.2. Format of the Definitions in this Document . . . . . . . . 4 55 1.3. Normative Terminology . . . . . . . . . . . . . . . . . . 4 56 2. Fundamental Terms . . . . . . . . . . . . . . . . . . . . . . 4 57 3. Standards Bodies and Standards . . . . . . . . . . . . . . . . 10 58 3.1. Standards bodies . . . . . . . . . . . . . . . . . . . . . 10 59 3.2. Encodings and Transformation Formats of ISO/IEC 10646 . . 13 60 3.3. Native CCSs and charsets . . . . . . . . . . . . . . . . . 14 61 4. Character Issues . . . . . . . . . . . . . . . . . . . . . . . 15 62 4.1. Types of Characters . . . . . . . . . . . . . . . . . . . 19 63 4.2. Differentiation of Subsets . . . . . . . . . . . . . . . . 22 64 5. User Interface for Text . . . . . . . . . . . . . . . . . . . 23 65 6. Text in Current IETF Protocols . . . . . . . . . . . . . . . . 26 66 7. Terms Associated with Internationalized Domain Names . . . . . 30 67 7.1. IDNA Terminology . . . . . . . . . . . . . . . . . . . . . 30 68 7.2. Character Relationships and Variants . . . . . . . . . . . 31 69 8. Other Common Terms In Internationalization . . . . . . . . . . 32 70 9. Security Considerations . . . . . . . . . . . . . . . . . . . 35 71 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36 72 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36 73 11.1. Normative References . . . . . . . . . . . . . . . . . . . 36 74 11.2. Informative References . . . . . . . . . . . . . . . . . . 36 75 Appendix A. Additional Interesting Reading . . . . . . . . . . . 39 76 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 40 77 Appendix C. Significant Changes from RFC 3536 . . . . . . . . . . 40 78 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 79 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45 81 1. Introduction 83 As the IETF Character Set Policy specification [RFC2277] summarizes: 84 "Internationalization is for humans. This means that protocols are 85 not subject to internationalization; text strings are." Many 86 protocols throughout the IETF use text strings that are entered by, 87 or are visible to, humans. It should be possible for anyone to enter 88 or read these text strings, which means that Internet users must be 89 able to be enter text in typical input methods and displayed in any 90 human language. Further, text containing any character should be 91 able to be passed between Internet applications easily. This is the 92 challenge of internationalization. 94 1.1. Purpose of this Document 96 This document provides a glossary of terms used in the IETF when 97 discussing internationalization. The purpose is to help frame 98 discussions of internationalization in the various areas of the IETF 99 and to help introduce the main concepts to IETF participants. 101 Internationalization is discussed in many working groups of the IETF. 102 However, few working groups have internationalization experts. When 103 designing or updating protocols, the question often comes up "should 104 we internationalize this?" (or, more likely, "do we have to 105 internationalize this?"). 107 This document gives an overview of internationalization terminology 108 as it applies to IETF standards work by lightly covering the many 109 aspects of internationalization and the vocabulary associated with 110 those topics. Some of the overview is a somewhat tutorial in nature. 111 It is not meant to be a complete description of internationalization. 112 The definitions here SHOULD be used by IETF standards. IETF 113 standards that explicitly want to create different definitions for 114 the terms defined here can do so, but unless an alternate definition 115 is provided the definitions of the terms in this document apply. 116 IETF standards that have a requirement for different definitions are 117 encouraged, for clarity's sake, to find terms different than the ones 118 defined here. Some of the definitions in this document come from 119 earlier IETF documents and books. 121 As in many fields, there is disagreement in the internationalization 122 community on definitions for many words. The topic of language 123 brings up particularly passionate opinions for experts and non- 124 experts alike. This document attempts to define terms in a way that 125 will be most useful to the IETF audience. 127 This document uses definitions from many documents that have been 128 developed inside and outside the IETF. The primary documents used 129 are: 131 o ISO/IEC 10646 [ISOIEC10646] 133 o The Unicode Standard [UNICODE] 135 o W3C Character Model [CHARMOD] 137 o IETF RFCs, including the Character Set Policy specification 138 [RFC2277] and the domain name internationalization standard 139 [RFC5890] 141 1.2. Format of the Definitions in this Document 143 In the body of this document, the source for the definition is shown 144 in angle brackets, such as "". Many definitions are 145 shown as "", which means that the definitions were crafted 146 originally for this document. The angle bracket notation for the 147 source of definitions is different than the square bracket notation 148 used for references to documents, such as in the paragraph above; 149 these references are given in the reference sections of this 150 document. 152 [[ RFC Editor: please change the "tbd" in "RFCtbd" to be the RFC 153 number assigned to this RFC when published. ]] 155 For some terms, there are commentary and examples after the 156 definitions. In those cases, the part before the angle brackets is 157 the definition that comes from the original source, and the part 158 after the angle brackets is commentary that is not a definition (such 159 as examples or further exposition). 161 Examples in this document use the notation for code points and names 162 from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646]. 163 For example, the letter "a" may be represented as either "U+0061" or 164 "LATIN SMALL LETTER A". See RFC 5137 [RFC5137] for a description of 165 this notation. 167 1.3. Normative Terminology 169 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 170 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 171 document are to be interpreted as described in RFC 2119 [RFC2119]. 173 2. Fundamental Terms 175 This section covers basic topics that are needed for almost anyone 176 who is involved with making IETF protocols more friendly to non-ASCII 177 text (see Section 4.2) and with other aspects of 178 internationalization. 180 language 182 A language is a way that humans communicate. The use of language 183 occurs in many forms, the most common of which are speech, 184 writing, and signing. 186 Some languages have a close relationship between the written and 187 spoken forms, while others have a looser relationship. The so- 188 called LTRU (Language Tag Registry Update) standards [RFC5646] 189 [RFC4647] discuss languages in more detail and provide identifiers 190 for languages for use in Internet protocols. Note that computer 191 languages are explicitly excluded from this definition. 193 script 195 A set of graphic characters used for the written form of one or 196 more languages. 198 Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han 199 (the characters, often called ideographs after a subset of them, 200 used in writing Chinese, Japanese, and Korean). RFC 2277 201 discusses scripts in detail. 203 It is common for internationalization novices to mix up the terms 204 "language" and "script". This can be a problem in protocols that 205 differentiate the two. Almost all protocols that are designed (or 206 were re-designed) to handle non-ASCII text deal with scripts (the 207 written systems) or characters, while fewer actually deal with 208 languages. 210 A single name can mean either a language or a script; for example, 211 "Arabic" is both the name of a language and the name of a script. 212 In fact, many scripts borrow their names from the names of 213 languages. Further, many scripts are used to write more than one 214 language; for example, the Russian and Bulgarian languages are 215 written in the Cyrillic script. Some languages can be expressed 216 using different scripts or were used with different scripts at 217 different times; the Mongolian language can be written in either 218 the Mongolian or Cyrillic scripts; Malay is primarily written in 219 Latin script today but the earlier, Arabic-script-based, Jawa form 220 is still in use; and a number of languages were converted from 221 other scripts to Cyrillic in the first half of the last century, 222 some of which have switched again more recently. Further, some 223 languages are normally expressed with more than one script at the 224 same time; for example, the Japanese language is normally 225 expressed in the Kanji (Han), Katakana, and Hiragana scripts in a 226 single string of text. 228 writing system 230 A set of rules for using one or more scripts to write a particular 231 language. Examples include the American English writing system, 232 the British English writing system, the French writing system, and 233 the Japanese writing system. 235 character 237 A member of a set of elements used for the organization, control, 238 or representation of data. 240 There are at least three common definitions of the word 241 "character": 243 * a general description of a text entity 245 * a unit of a writing system, often synonymous with "letter" or 246 similar terms, but generalized to include digits and symbols of 247 various sorts 249 * the encoded entity itself 251 When people talk about characters, they usually intend one of the 252 first two definitions. The term "character" is often abbreviated 253 as "char". 255 A particular character is identified by its name, not by its 256 shape. A name may suggest a meaning, but the character may be 257 used for representing other meanings as well. A name may suggest 258 a shape, but that does not imply that only that shape is commonly 259 used in print, nor that the particular shape is associated only 260 with that name. 262 coded character 264 A character together with its coded representation. 266 coded character set 268 A coded character set (CCS) is a set of unambiguous rules that 269 establishes a character set and the relationship between the 270 characters of the set and their coded representation. 272 274 character encoding form 276 A character encoding form is a mapping from a coded character set 277 (CCS) to the actual code units used to represent the data. 278 280 repertoire 282 The collection of characters included in a character set. Also 283 called a character repertoire. 285 glyph 287 A glyph is an image of a character that can be displayed after 288 being imaged onto a display surface. 290 The Unicode Standard has a different definition that refers to an 291 abstract form that may represent different images when the same 292 character is rendered under different circumstances. 294 glyph code 296 A glyph code is a numeric code that refers to a glyph. Usually, 297 the glyphs contained in a font are referenced by their glyph code. 298 Glyph codes are local to a particular font; that is, a different 299 font containing the same glyphs may use different codes. 301 transcoding 303 Transcoding is the process of converting text data from one 304 character encoding form to another. Transcoders work only at the 305 level of character encoding and do not parse the text. Note: 306 Transcoding may involve one-to-one, many-to-one, one-to-many or 307 many-to-many mappings. Because some legacy mappings are glyphic, 308 they may not only be many-to-many, but also unordered: thus XYZ 309 may map to yxz. 311 In this definition, "many-to-one" means a sequence of characters 312 mapped to a single character. The "many" does not mean 313 alternative characters that map to the single character. 315 character encoding scheme 317 A character encoding scheme (CES) is a character encoding form 318 plus byte serialization. There are many character encoding 319 schemes in Unicode, such as UTF-8 and UTF-16BE. 320 Some CESs are associated with a single CCS; for example, UTF-8 321 [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and 322 Unicode. Other CESs, such as ISO 2022, are associated with many 323 CCSs. 325 charset 327 A charset is a method of mapping a sequence of octets to a 328 sequence of abstract characters. A charset is, in effect, a 329 combination of one or more CCSs with a CES. Charset names are 330 registered by the IANA according to procedures documented in 331 [RFC2978]. 333 Many protocol definitions use the term "character set" in their 334 descriptions. The terms "charset" or "character encoding scheme" 335 and "coded character set" are strongly preferred over the term 336 "character set" because "character set" has other definitions in 337 other contexts, particularly outside the IETF. When reading IETF 338 standards that use "character set" without defining the term, they 339 usually mean "a specific combination of one CCS with a CES", 340 particularly when they are talking about the "US-ASCII character 341 set". 343 internationalization 345 In the IETF, "internationalization" means to add or improve the 346 handling of non-ASCII text in a protocol. A different 347 perspective, more appropriate to protocols that are designed for 348 global use from the beginning, is the definition used by W3C: 350 "Internationalization is the design and development of a 351 product, application or document content that enables easy 352 localization for target audiences that vary in culture, region, 353 or language." [W3C-i18n-Def] 355 Many protocols that handle text only handle one charset (US- 356 ASCII), or leave the question of what CCS and encoding are used up 357 to local guesswork (which leads, of course, to interoperability 358 problems). If multiple charsets are permitted they must be 359 explicitly identified [RFC2277]. Adding non-ASCII text to a 360 protocol allows the protocol to handle more scripts, hopefully all 361 of the ones useful in the world. In today's world, that is 362 normally best accomplished by allowing Unicode encoded in UTF-8 363 only, thereby shifting conversion issues away from individual 364 choices. 366 localization 368 The process of adapting an internationalized application platform 369 or application to a specific cultural environment. In 370 localization, the same semantics are preserved while the syntax 371 may be changed. [FRAMEWORK] 373 Localization is the act of tailoring an application for a 374 different language or script or culture. Some internationalized 375 applications can handle a wide variety of languages. Typical 376 users only understand a small number of languages, so the program 377 must be tailored to interact with users in just the languages they 378 know. 380 The major work of localization is translating the user interface 381 and documentation. Localization involves not only changing the 382 language interaction, but also other relevant changes such as 383 display of numbers, dates, currency, and so on. The better 384 internationalized an application is, the easier it is to localize 385 it for a particular language and character encoding scheme. 387 Localization is rarely an IETF matter, and protocols that are 388 merely localized, even if they are serially localized for several 389 locations, are generally considered unsatisfactory for the global 390 Internet. 392 Do not confuse "localization" with "locale", which is described in 393 Section 8 of this document. 395 i18n, l10n 397 These are abbreviations for "internationalization" and 398 "localization". 400 "18" is the number of characters between the "i" and the "n" in 401 "internationalization", and "10" is the number of characters 402 between the "l" and the "n" in "localization". 404 multilingual 406 The term "multilingual" has many widely-varying definitions and 407 thus is not recommended for use in standards. Some of the 408 definitions relate to the ability to handle international 409 characters; other definitions relate to the ability to handle 410 multiple charsets; and still others relate to the ability to 411 handle multiple languages. 413 displaying and rendering text 415 To display text, a system puts characters on a visual display 416 device such as a screen or a printer. To render text, a system 417 analyzes the character input to determine how to display the text. 418 The terms "display" and "render" are sometimes used 419 interchangeably. Note, however, that text might be rendered as 420 audio and/or tactile output, such as in systems that have been 421 designed for people with visual disabilities. 423 Combining characters modify the display of the character (or, in 424 some cases, characters) that precede them. When rendering such 425 text, the display engine must either find the glyph in the font 426 that represents the base character and all of the combining 427 characters, or it must render the combination itself. Such 428 rendering can be straight-forward, but it is sometimes complicated 429 when the combining marks interact with each other, such as when 430 there are two combining marks that would appear above the same 431 character. Formatting characters can also change the way that a 432 renderer would display text. Rendering can also be difficult for 433 some scripts that have complex display rules for base characters, 434 such as Arabic and Indic scripts. 436 3. Standards Bodies and Standards 438 This section describes some of the standards bodies and standards 439 that appear in discussions of internationalization in the IETF. This 440 is an incomplete and possibly over-full list; listing too few bodies 441 or standards can be just as politically dangerous as listing too 442 many. Note that there are many other bodies that deal with 443 internationalization; however, few if any of them appear commonly in 444 IETF standards work. 446 3.1. Standards bodies 448 ISO and ISO/IEC JTC 1 450 The International Organization for Standardization has been 451 involved with standards for characters since before the IETF was 452 started. ISO is a non-governmental group made up of national 453 bodies. Most of ISO's work in information technology is performed 454 jointly with a similar body, the International Electrotechnical 455 Commission (IEC) through a joint committee known as "JTC 1". ISO 456 and ISO/IEC JTC 1 have many diverse standards in the international 457 characters area; the one that is most used in the IETF is commonly 458 referred to as "ISO/IEC 10646", sometimes with a specific date. 459 ISO/IEC 10646 describes a CCS that covers almost all known written 460 characters in use today. 462 ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC 463 1/SC 2 WG2", often called "SC2/WG2" or "WG2" for short. ISO 464 standards go through many steps before being finished, and years 465 often go by between changes to the base ISO/IEC 10646 standard 466 although amendments are now issued to track Unicode changes. 467 Information on WG2, and its work products, can be found at 468 . Information on SC2, and its 469 work products, can be found at 474 The standard comes as a base part and a series of attachments or 475 amendments. It is available in PDF form for downloading or in a 476 CD-ROM version. One example of how to cite the standard is given 477 in [RFC3629]. Any standard that cites ISO/IEC 10646 needs to 478 evaluate how to handle the versioning problem that is relevant to 479 the protocol's needs. 481 ISO is responsible for other standards that might be of interest 482 to protocol developers concerned about internationalization. ISO 483 639 [ISO639] specifies the names of languages and forms part of 484 the basis for the IETF's Language Tag work [RFC5646]. ISO 3166 485 [ISO3166] specifies the names and code abbreviations for countries 486 and territories and is used in several protocols and databases 487 including names for country-code top level domain names. The 488 responsibilities of ISO TC 46 on Information and Documentation 489 include a series of 492 standards for transliteration of various languages into Latin 493 characters. 495 Another relevant ISO group was JTC 1/SC22/WG20, which was 496 responsible for internationalization in JTC1, such as for 497 international string ordering. Information on WG20, and its work 498 products, can be found at . 499 The specific tasks of SC22/WG20 were moved from SC22 into SC2 and 500 there has been little significant activity since that occurred. 502 Unicode Consortium 504 The second important group for international character standards 505 is the Unicode Consortium. The Unicode Consortium is a trade 506 association of companies, governments, and other groups interested 507 in promoting the Unicode Standard [UNICODE]. The Unicode Standard 508 is a CCS whose repertoire and code points are identical to ISO/IEC 509 10646. The Unicode Consortium has added features to the base CCS 510 which make it more useful in protocols, such as defining 511 attributes for each character. Examples of these attributes 512 include case conversion and numeric properties. 514 The actual technical and definitional work of the Unicode 515 Consortium is done in the Unicode Technical Committee (UTC). The 516 terms "UTC" and "Unicode Consortium" are often treated, 517 imprecisely, as synonymous in the IETF. 519 The Unicode Consortium publishes addenda to the Unicode Standard 520 as Unicode Technical Reports. There are many types of technical 521 reports at various stages of maturity. The Unicode Standard and 522 affiliated technical reports can be found at 523 . 525 A reciprocal agreement between the Unicode Consortium and ISO/IEC 526 JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode Standard to 527 track each other for definitions of characters and assignments of 528 code points. Updates, often in the form of amendments, to the 529 former sometimes lag updates to the latter for a short period, but 530 the gap has rarely been significant in recent years. 532 At the time that the IETF character set policy [RFC2277] was 533 established and the first version of this terminology 534 specification were published, there was a strong preference in the 535 IETF community for references to ISO/IEC 10646 (rather than 536 Unicode) when possible. That preference largely reflected a more 537 general IETF preference for referencing established open 538 international standards in preference to specifications from 539 consortia. However, the Unicode definitions of character 540 properties and classes are not part of ISO/IEC 10646. Because 541 IETF specifications are increasingly dependent on those 542 definitions (for example, see the explanation in Section 4.2) and 543 the Unicode specifications are freely available online in 544 convenient machine-readable form, the IETF's preference has 545 shifted to referencing the Unicode Standard. The latter is 546 especially important when version consistency between code points 547 (either standard) and Unicode properties (Unicode only) is 548 required. 550 World Wide Web Consortium (W3C) 552 This group created and maintains the standard for XML, the markup 553 language for text that has become very popular. XML has always 554 been fully internationalized so that there is no need for a new 555 version to handle international text. However, in some 556 circumstances, XML files may be sensitive to differences among 557 Unicode versions. 559 local and regional standards organizations 561 Just as there are many native CCSs and charsets, there are many 562 local and regional standards organizations to create and support 563 them. Common examples of these are ANSI (United States), CEN/ISSS 564 (Europe), JIS (Japan), and SAC (China). 566 3.2. Encodings and Transformation Formats of ISO/IEC 10646 568 Characters in the ISO/IEC 10646 CCS can be expressed in many ways. 569 Historically, "encoding forms" are both direct addressing methods, 570 while "transformation formats" are methods for expressing encoding 571 forms as bits on the wire. That distinction has mostly disappeared 572 in recent years. 574 Documents that discuss characters in the ISO/IEC 10646 CCS often need 575 to list specific characters. RFC 5137 describes the common methods 576 for doing so in IETF documents, and these practices have been adopted 577 by many other communities as well. 579 Basic Multilingual Plane (BMP) 581 The BMP is composed of the first 2^16 code points in ISO/IEC 10646 582 and contains almost all characters in contemporary use. The BMP 583 is also called "Plane 0". 585 UCS-2 and UCS-4 587 UCS-2 and UCS-4 are the two encoding forms historically defined 588 for ISO/IEC 10646. UCS-2 addresses only the BMP. Because many 589 useful characters (such as many Han characters) have been defined 590 outside of the BMP, many people consider UCS-2 to be obsolete. 591 UCS-4 addresses the entire range of code points from ISO/IEC 10646 592 (by agreement between ISO/IEC JTC1 SC2 and the Unicode Consortium, 593 a range from 0..0x10FFFF) as 32-bit values with zero padding to 594 the left. UCS-4 is identical to UTF-32BE (without use of a BOM 595 (see below)); UTF-32BE is now the preferred term. 597 UTF-8 599 UTF-8 [RFC3629], is the preferred encoding for IETF protocols. 600 Characters in the BMP are encoded as one, two, or three octets. 601 Characters outside the BMP are encoded as four octets. Characters 602 from the US-ASCII repertoire have the same on-the-wire 603 representation in UTF-8 as they do in US-ASCII. The IETF-specific 604 definition of UTF-8 in RFC 3629 is identical to that in recent 605 versions of the Unicode Standard (e.g., in Section 3.9 of Version 606 6.0 [UNICODE]). 608 UTF-16, UTF-16BE, and UTF-16LE 610 UTF-16, UTF-16BE, and UTF-16LE, three transformation formats 611 described in [RFC2781] and defined in The Unicode Standard 612 (Sections 3.9 and 16.8 of Version 6.0), are not required by any 613 IETF standards, and are thus used much less often in protocols 614 than UTF-8. Characters in the BMP are always encoded as two 615 octets, and characters outside the BMP are encoded as four octets 616 using a "surrogate pair" arrangement. The latter is not part of 617 UCS-2, marking the difference between UTF-16 and UCS-2. The three 618 UTF-16 formats differ based on the order of the octets and the 619 presence or absence of a special lead-in ordering identifier 620 called the "byte order mark" or "BOM". 622 UTF-32 624 The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a 625 transformation format that incorporates the integer code point 626 value right-justified in a 32 bit field. As with UTF-16, the byte 627 order mark (BOM) can be used and UTF-32BE and UTF-32LE are 628 defined. UTF-32 and UCS-4 are essentially equivalent and the 629 terms are often used interchangeably. 631 SCSU and BOCU-1 633 The Unicode Consortium has defined an encoding, SCSU [UTR6], which 634 is designed to offer good compression for typical text. A 635 different encoding that is meant to be MIME-friendly, BOCU-1, is 636 described in [UTN6]. Although compression is attractive, as 637 opposed to UTF-8, neither of these (at the time of this writing) 638 has attracted much interest. 640 The compression provided as a side effect of the Punycode 641 algorithm [RFC3492] is heavily used in some contexts, especially 642 IDNA [RFC5890], but imposes some restrictions (See also 643 Section 7). 645 3.3. Native CCSs and charsets 647 Before ISO/IEC 10646 was developed, many countries developed their 648 own CCSs and charsets. Some of these were adopted into international 649 standards for the relevant scripts or writing systems. Many dozen of 650 these are in common use on the Internet today. Examples include ISO 651 8859-5 for Cyrillic and Shift- JIS for Japanese scripts. 653 The official list of the registered charset names for use with IETF 654 protocols is maintained by IANA and can be found at 655 . The list contains 656 preferred names and aliases. Note that this list has historically 657 contained many errors, such as names that are in fact not charsets or 658 references that do not give enough detail to reliably map names to 659 charsets. 661 Probably the most well-known native CCS is ASCII [US-ASCII]. This 662 CCS is used as the basis for keywords and parameter names in many 663 IETF protocols, and as the sole CCS in numerous IETF protocols that 664 have not yet been internationalized. ASCII became the basis for ISO/ 665 IEC 646 which, in turn, formed the basis for many national and 666 international standards, such as the ISO 8859 series, that mix Basic 667 Latin characters with characters from another script. 669 It is important to note that, strictly speaking, "ASCII" is a CCS and 670 repertoire, not an encoding. The encoding used for ASCII in IETF 671 protocols involves the seven-bit integer ASCII code point right- 672 justified an an 8-bit field and is sometimes described as the 673 "Network Virtual Terminal" or "NVT" encoding [RFC5198]. Less 674 formally, "ASCII" and "NVT" are often used interchangeably. However, 675 "non-ASCII" refers only to characters outside the ASCII repertoire 676 and is not linked to a specific encoding. See Section 4.2. 678 A Unicode publication describes issues involved in mapping character 679 data between charsets, and an XML format for mapping table data 680 [UTR22]. 682 4. Character Issues 684 This section contains terms and topics that are commonly used in 685 character handling and therefore are of concern to people adding non- 686 ASCII text handling to protocols. These topics are standardized 687 outside the IETF. 689 code point 691 A value in the codespace of a repertoire. For all common 692 repertoires developed in recent years, code point values are 693 integers (code points for ASCII and its immediate descendants were 694 defined in terms of column and row positions of a table). 696 combining character 698 A member of an identified subset of the coded character set of 699 ISO/IEC 10646 intended for combination with the preceding non- 700 combining graphic character, or with a sequence of combining 701 characters preceded by a non-combining character. Combining 702 characters are inherently non-spacing. 704 composite sequence or combining character sequence 706 A sequence of graphic characters consisting of a non-combining 707 character followed by one or more combining characters. A graphic 708 symbol for a composite sequence generally consists of the 709 combination of the graphic symbols of each character in the 710 sequence. The Unicode Standard often uses the term "combining 711 character sequence" to refer to composite sequences. A composite 712 sequence is not a character and therefore is not a member of the 713 repertoire of ISO/IEC 10646. However, Unicode now 714 assigns names to some such sequences especially when the names are 715 required to match terminology in other standards [UAX34]. 717 In some CCSs, some characters consist of combinations of other 718 characters. For example, the letter "a with acute" might be a 719 combination of the two characters "a" and "combining acute", or it 720 might be a combination of the three characters "a", a non- 721 destructive backspace, and an acute. In the same or other CCSs, 722 it might be available as a single code point. The rules for 723 combining two or more characters are called "composition rules", 724 and the rules for taking apart a character into other characters 725 is called "decomposition rules". The results of composition is 726 called a "precomposed character"; the results of decomposition is 727 called a "decomposed character". 729 normalization 731 Normalization is the transformation of data to a normal form, for 732 example, to unify spelling. 734 Note that the phrase "unify spelling" in the definition above does 735 not mean unifying different strings with the same meaning as words 736 (such as "color" and "colour"). Instead, it means unifying 737 different character sequences that are intended to form the same 738 composite characters, such as "" and "" (where "" is U+006E, "" is U+0303, and 740 "" is U+00F1). 742 The purpose of normalization is to allow two strings to be 743 compared for equivalence. The strings "" and "" would be shown identically 745 on a text display device. If a protocol designer wants those two 746 strings to be considered equivalent during comparison, the 747 protocol must define where normalization occurs. 749 The terms "normalization" and "canonicalization" are often used 750 interchangeably. Generally, they both mean to convert a string of 751 one or more characters into another string based on standardized 752 rules. However, in Unicode, "canonicalization" or similar terms 753 are used to refer to a particular type of normalization 754 equivalence ("canonical equivalence") in contrast to 755 "compatibility equivalence"), so the term should be used with some 756 care. Some CCSs allow multiple equivalent representations for a 757 written string; normalization selects one among multiple 758 equivalent representations as a base for reference purposes in 759 comparing strings. In strings of text, these rules are usually 760 based on decomposing combined characters or composing characters 761 with combining characters. Unicode Standard Annex #15 [UTR15] 762 describes the process and many forms of normalization in detail. 763 Normalization is important when comparing strings to see if they 764 are the same. 766 The Unicode NFC and NFD normalizations support canonical 767 equivalence; NFKC and NFKD support canonical and compatibility 768 equivalence. 770 case 772 Case is the feature of certain alphabets where the letters have 773 two (or occasionally more) distinct forms. These forms, which may 774 differ markedly in shape and size, are called the uppercase letter 775 (also known as capital or majuscule) and the lowercase letter 776 (also known as small or minuscule). Case mapping is the 777 association of the uppercase and lowercase forms of a letter. 778 780 There is usually (but not always) a one-to-one mapping between the 781 same letter in the two cases. However, there are many examples of 782 characters which exist in one case but for which there is no 783 corresponding character in the other case or for which there is a 784 special mapping rule, such as the Turkish dotless "i", some Greek 785 characters with modifiers, and characters like the German Sharp S 786 (Eszett) and Greek Final Sigma that traditionally do not have 787 uppercase forms. Case mapping can even be dependent on locale or 788 language. Converting text to have only a single case, primarily 789 for comparison purposes, is called "case folding". Because of the 790 various unusual cases, case mapping can be quite controversial and 791 some case folding algorithms even more so. For example, some 792 programming languages such as Java have case-folding algorithms 793 that are locale-sensitive; this makes those algorithms incredibly 794 resource-intensive, and makes them act differently depending on 795 the location of the system at the time the algorithm is used. 797 sorting and collation 799 Collating is the process of ordering units of textual information. 800 Collation is usually specific to a particular language or even to 801 a particular application or locale. It is sometimes known as 802 alphabetizing, although alphabetization is just a special case of 803 sorting and collation. 805 Collation is concerned with the determination of the relative 806 order of any particular pair of strings, and algorithms concerned 807 with collation focus on the problem of providing appropriate 808 weighted keys for string values, to enable binary comparison of 809 the key values to determine the relative ordering of the strings. 811 The relative orders of letters in collation sequences can differ 812 widely based on the needs of the system or protocol defining the 813 collation order. For example, even within ASCII characters, there 814 are two common and very different collation orders: "A, a, B, 815 b,..." and "A, B, C, ..., Z, a, b,...", with additional variations 816 for lower case first and digits before and after letters. 818 In practice, it is rarely necessary to define a collation sequence 819 for characters drawn from different scripts, but arranging such 820 sequences so as to not surprise users is usually particularly 821 problematic. 823 Sorting is the process of actually putting data records into 824 specified orders, according to criteria for comparison between the 825 records. Sorting can apply to any kind of data (including textual 826 data) for which an ordering criterion can be defined. Algorithms 827 concerned with sorting focus on the problem of performance (in 828 terms of time, memory, or other resources) in actually putting the 829 data records into the desired order. 831 A sorting algorithm for string data can be internationalized by 832 providing it with the appropriate collation-weighted keys 833 corresponding to the strings to be ordered. 835 Many processes have a need to order strings in a consistent 836 (sorted) sequence. For only a few CCS/CES combinations, there is 837 an obvious sort order that can be applied without reference to the 838 linguistic meaning of the characters: the code point order is 839 sufficient for sorting. That is, the code point order is also the 840 order that a person would use in sorting the characters. For many 841 CCS/CES combinations, the code point order would make no sense to 842 a person and therefore is not useful for sorting if the results 843 will be displayed to a person. 845 Code Point order is usually not how any human educated by a local 846 school system expects to see strings ordered; if one orders to the 847 expectations of a human, one has a "language-specific" or "human 848 language" sort. Sorting to code point order will seem 849 inconsistent if the strings are not normalized before sorting 850 because different representations of the same character will sort 851 differently. This problem may be smaller with a language-specific 852 sort. 854 code table 856 A code table is a table showing the characters allocated to the 857 octets in a code. 859 Code tables are also commonly called "code charts". 861 4.1. Types of Characters 863 The following definitions of types of characters do not clearly 864 delineate each character into one type, nor do they allow someone to 865 accurately predict what types would apply to a particular character. 866 The definitions are intended for application designers to help them 867 think about the many (sometimes confusing) properties of text. 869 alphabetic 871 An informative Unicode property. Characters that are the primary 872 units of alphabets and/or syllabaries, whether combining or 873 noncombining. This includes composite characters that are 874 canonical equivalents to a combining character sequence of an 875 alphabetic base character plus one or more combining characters: 876 letter digraphs; contextual variant of alphabetic characters; 877 ligatures of alphabetic characters; contextual variants of 878 ligatures; modifier letters; letterlike symbols that are 879 compatibility equivalents of single alphabetic letters; and 880 miscellaneous letter elements. 882 ideographic 884 Any symbol that primarily denotes an idea (or meaning) in contrast 885 to a sound (or pronunciation), for example, a symbol showing a 886 telephone or the Han characters used in Chinese, Japanese, and 887 Korean. 889 While Unicode and many other systems use this term to refer to all 890 Han characters, strictly speaking not all of those characters are 891 actually ideographic. Some are pictographic (such as the 892 telephone example above), some are used phonetically, and so on. 894 However, the convention is to describe the script as ideographic 895 as contrasted to alphabetic. 897 digit or number 899 All modern writing systems use decimal digits in some form; some 900 older ones use non-positional or other systems. Different scripts 901 may have their own digits. Unicode distinguishes between numbers 902 and other kinds of characters by assigning a special General 903 Category value to them and subdividing that value to distinguish 904 between decimal digits, letter digits, and other digits. 906 punctuation 908 Characters that separate units of text, such as sentences and 909 phrases, thus clarifying the meaning of the text. The use of 910 punctuation marks is not limited to prose; they are also used in 911 mathematical and scientific formulae, for example. 913 symbol 915 One of a set of characters other than those used for letters, 916 digits, or punctuation, and representing various concepts 917 generally not connected to written language use per se. 919 Examples of symbols include characters for mathematical operators, 920 symbols for OCR, symbols for box-drawing or graphics, as well as 921 symbols for dingbats, arrows, faces, and geometric shapes. 922 Unicode has a property that identifies symbol characters. 924 nonspacing character 926 A combining character whose positioning in presentation is 927 dependent on its base character. It generally does not consume 928 space along the visual baseline in and of itself. 930 A combining acute accent (U+0301) is an example of a nonspacing 931 character. 933 diacritic 935 A mark applied or attached to a symbol to create a new symbol that 936 represents a modified or new value. They can also be marks 937 applied to a symbol irrespective of whether it changes the value 938 of that symbol. In the latter case, the diacritic usually 939 represents an independent value (for example, an accent, tone, or 940 some other linguistic information). Also called diacritical mark 941 or diacritical. 943 control character 945 The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. 946 The basic space character, U+0020, is often considered as a 947 control character as well, making the total number 66. They are 948 also known as control codes. In terminology adopted by Unicode 949 from ASCII and the ISO 8859 standards, these codes are treated as 950 belonging to three ranges: "C0" (for U+0000..U+001F), "C1" (for 951 U+0080...U+009F), and the single control character "DEL" (U+007F). 952 954 Occasionally, in other vocabularies, the term "control character" 955 is used to describe any character that does not have an associated 956 glyph or to device control sequences [ISO6429]. Neither of those 957 usages is appropriate to internationalization terminology in the 958 IETF. 960 formatting character 962 Characters that are inherently invisible but that have an effect 963 on the surrounding characters. 965 Examples of formatting characters include characters for 966 specifying the direction of text and characters that specify how 967 to join multiple characters. 969 compatibility character or compatibility variant 971 A graphic character included as a coded character of ISO/IEC 10646 972 primarily for compatibility with existing coded character sets. 973 975 The Unicode definition of compatibility charter also includes 976 characters that have been incorporated for other reasons. Their 977 list includes several separate groups of characters included for 978 compatibility purposes: halfwidth and fullwidth characters used 979 with East Asian scripts, Arabic contextual forms (e.g., initial or 980 final forms), some ligatures, deprecated formatting characters, 981 variant forms of characters (or even copies of them) for 982 particular uses (e.g., phonetic or mathematical applications), 983 font variations, CJK compatibility ideographs, and so on. For 984 additional information and the separate term "compatibility 985 decomposable character", see the Unicode standard. 987 For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for 988 compatibility with Asian charsets that include full-width and 989 half-width ASCII characters. 991 Some efforts in the IETF have concluded that it would be useful to 992 support mapping of some groups of compatibility equivalents and 993 not others (e.g., supporting or mapping width variations while 994 preserving or rejecting mathematical variations). See the IDNA 995 Mapping document [RFC5895] for one example. 997 4.2. Differentiation of Subsets 999 Especially as existing IETF standards are internationalized, it is 1000 necessary to describe collections of characters including especially 1001 various subsets of Unicode. Because Unicode includes ways to code 1002 substantially all characters in contemporary use, subsets of the 1003 Unicode repertoire can be a useful tool for defining these 1004 collections as repertoires independent of specific Unicode coding. 1006 However specific collections are defined, it is important to remember 1007 that, while older CCSs such as ASCII and the ISO 8859 family are 1008 close-ended and fixed, Unicode is open-ended, with new character 1009 definitions, and often new scripts, being added every year or so. 1010 So, while, e.g., an ASCII subset, such as "upper case letters", can 1011 be specified as a range of code points (4/1 to 5/10 for that 1012 example), similar definitions for Unicode either have to be specified 1013 in terms of Unicode properties or are very dependent on Unicode 1014 versions (and the relevant version must be identified in any 1015 specification). See the IDNA code point specification [RFC5892] for 1016 an example of specification by combinations of properties. 1018 Some terms are commonly used in the IETF to define character ranges 1019 and subsets. Some of these are imprecise and can cause confusion if 1020 not used carefully. 1022 non-ASCII The term "non-ASCII" strictly refers to characters other 1023 than those that appear in the ASCII repertoire, independent of the 1024 CCS or encoding used for them. In practice, if a repertoire such 1025 as that of Unicode is established as context, "non-ASCII" refers 1026 to characters in that repertoire that do not appear in the ASCII 1027 repertoire. "Outside the ASCII repertoire" and "outside the ASCII 1028 range" are practical, and more precise, synonyms for "non-ASCII". 1030 letters The term "letters" does not have an exact equivalent in the 1031 Unicode standard. Letters are generally characters that are used 1032 to write words, but that means very different things in different 1033 languages and cultures. 1035 5. User Interface for Text 1037 Although the IETF does not standardize user interfaces, many 1038 protocols make assumptions about how a user will enter or see text 1039 that is used in the protocol. Internationalization challenges 1040 assumptions about the type and limitations of the input and output 1041 devices that may be used with applications that use various 1042 protocols. It is therefore useful to consider how users typically 1043 interact with text that might contain one or more non-ASCII 1044 characters. 1046 input methods 1048 An input method is a mechanism for a person to enter text into an 1049 application. 1051 Text can be entered into a computer in many ways. Keyboards are 1052 by far the most common device used, but many characters cannot be 1053 entered on typical computer keyboards in a single stroke. Many 1054 operating systems come with system software that lets users input 1055 characters outside the range of what is allowed by keyboards. 1057 For example, there are dozens of different input methods for Han 1058 characters in Chinese, Japanese, and Korean. Some start with 1059 phonetic input through the keyboard, while others use the number 1060 of strokes in the character. Input methods are also needed for 1061 scripts that have many diacritics, such as European or Vietnamese 1062 characters that have two or three diacritics on a single 1063 alphabetic character. 1065 The term "input method editor" (IME) is often used generically to 1066 describe the tools and software used to deal with input of 1067 characters on a particular system. 1069 rendering rules 1071 A rendering rule is an algorithm that a system uses to decide how 1072 to display a string of text. 1074 Some scripts can be directly displayed with fonts, where each 1075 character from an input stream can simply be copied from a glyph 1076 system and put on the screen or printed page. Other scripts need 1077 rules that are based on the context of the characters in order to 1078 render text for display. 1080 Some examples of these rendering rules include: 1082 * Scripts such as Arabic (and many others), where the form of the 1083 letter changes depending on the adjacent letters, whether the 1084 letter is standing alone, at the beginning of a word, in the 1085 middle of a word, or at the end of a word. The rendering rules 1086 must choose between two or more glyphs. 1088 * Scripts such as the Indic scripts, where consonants may change 1089 their form if they are adjacent to certain other consonants or 1090 may be displayed in an order different from the way they are 1091 stored and pronounced. The rendering rules must choose between 1092 two or more glyphs. 1094 * Arabic and Hebrew scripts, where the order of the characters 1095 displayed are changed by the bidirectional properties of the 1096 alphabetic and other characters characters and with right-to- 1097 left and left-to-right ordering marks. The rendering rules 1098 must choose the order that characters are displayed. 1100 * Some writing systems cannot have their rendering rules suitably 1101 defined using mechanisms that are now defined in the Unicode 1102 Standard. None of those languages are in active non-scholarly 1103 use today. 1105 * Many systems use a special rendering rule when they lack a font 1106 or other mechanism for rendering a particular character 1107 correctly. That rule typically involves substitution of a 1108 small open box or a question mark for the missing character. 1109 See "undisplayable character" below. 1111 graphic symbol 1113 A graphic symbol is the visual representation of a graphic 1114 character or of a composite sequence. 1116 font 1118 A font is a collection of glyphs used for the visual depiction of 1119 character data. A font is often associated with a set of 1120 parameters (for example, size, posture, weight, and serifness), 1121 which, when set to particular values, generate a collection of 1122 imagable glyphs. 1124 The term "font" is often used interchangeably with "typeface". As 1125 historically used in typography, a typeface is a family of one or 1126 more fonts that share a common general design. For example, 1127 "Times Roman" is actually a typeface, with a collection of fonts 1128 such as "Times Roman Bold", "Times Roman Medium", "Times Roman 1129 Italic", and so on. Some sources even consider different type 1130 sizes within a typeface to be different fonts. While those 1131 distinctions are rarely important for internationalization 1132 purposes, there are exceptions. Those writing specifications 1133 should be very careful about definitions in cases in which the 1134 exceptions might lead to ambiguity. 1136 bidirectional display 1138 The process or result of mixing left-to-right oriented text and 1139 right-to-left oriented text in a single line is called 1140 bidirectional display, often abbreviated as "bidi". 1142 Most of the world's written languages are displayed left-to-right. 1143 However, many widely-used written languages such as ones based on 1144 the Hebrew or Arabic scripts are displayed primarily right-to-left 1145 (numerals are a common exception in the modern scripts). Right- 1146 to-left text often confuses protocol writers because they have to 1147 keep thinking in terms of the order of characters in a string in 1148 memory, an order that might be different from what they see on the 1149 screen. (Note that some languages are written both horizontally 1150 and vertically and that some historical ones use other display 1151 orderings.) 1153 Further, bidirectional text can cause confusion because there are 1154 formatting characters in ISO/IEC 10646 that cause the order of 1155 display of text to change. These explicit formatting characters 1156 change the display regardless of the implicit left-to-right or 1157 right-to-left properties of characters. Text that might contain 1158 those characters typically requires careful processing before 1159 being sorted or compared for equality. 1161 It is common to see strings with text in both directions, such as 1162 strings that include both text and numbers, or strings that 1163 contain a mixture of scripts. 1165 Unicode has a long and incredibly detailed algorithm for 1166 displaying bidirectional text [UAX9]. 1168 undisplayable character 1170 A character that has no displayable form. 1172 For instance, the zero-width space (U+200B) cannot be displayed 1173 because it takes up no horizontal space. Formatting characters 1174 such as those for setting the direction of text are also 1175 undisplayable. Note, however, that every character in [UNICODE] 1176 has a glyph associated with it, and that the glyphs for 1177 undisplayable characters are enclosed in a dashed square as an 1178 indication that the actual character is undisplayable. 1180 The property of a character that causes it to be undisplayable is 1181 intrinsic to its definition. Undisplayable characters can never 1182 be displayed in normal text (the dashed square notation is used 1183 only in special circumstances). Printable characters whose 1184 Unicode definitions are associated with glyphs that cannot be 1185 rendered on a particular system are not, in this sense, 1186 undisplayable. 1188 writing style 1190 Conventions of writing the same script in different styles. 1191 1193 Different communities using the script may find text in different 1194 writing styles difficult to read and possibly unintelligible. For 1195 example, the Perso-Arabic Nastalique writing style and the Arabic 1196 Naskh writing style both use the Arabic script but have very 1197 different renderings and are not mutually comprehensible. Writing 1198 styles may have significant impact on internationalization; for 1199 example, the Nastalique writing style requires significantly more 1200 line height than Naskh writing style. 1202 6. Text in Current IETF Protocols 1204 Many IETF protocols started off being fully internationalized, while 1205 others have been internationalized as they were revised. In this 1206 process, IETF members have seen patterns in the way that many 1207 protocols use text. This section describes some specific protocol 1208 interactions with text. 1210 protocol elements 1212 Protocol elements are uniquely-named parts of a protocol. 1214 Almost every protocol has named elements, such as "source port" in 1215 TCP. In some protocols, the names of the elements (or text tokens 1216 for the names) are transmitted within the protocol. For example, 1217 in SMTP and numerous other IETF protocols, the names of the verbs 1218 are part of the command stream. The names are thus part of the 1219 protocol standard. The names of protocol elements are not 1220 normally seen by end users and it is rarely appropriate to 1221 internationalize protocol element names (even while the elements 1222 themselves can be internationalized). 1224 name spaces 1226 A name space is the set of valid names for a particular item, or 1227 the syntactic rules for generating these valid names. 1229 Many items in Internet protocols use names to identify specific 1230 instances or values. The names may be generated (by some 1231 prescribed rules), registered centrally (e.g., such as with IANA), 1232 or have a distributed registration and control mechanism, such as 1233 the names in the DNS. 1235 on-the-wire encoding 1237 The encoding and decoding used before and after transmission over 1238 the network is often called the "on-the-wire" (or sometimes just 1239 "wire") format. 1241 Characters are identified by code points. Before being 1242 transmitted in a protocol, they must first be encoded as bits and 1243 octets. Similarly, when characters are received in a 1244 transmission, they have been encoded, and a protocol that needs to 1245 process the individual characters needs to decode them before 1246 processing. 1248 parsed text 1250 Text strings that is analyzed for subparts. 1252 In some protocols, free text in text fields might be parsed. For 1253 example, many mail user agents (MUAs) will parse the words in the 1254 text of the Subject: field to attempt to thread based on what 1255 appears after the "Re:" prefix. 1257 Such conventions are very sensitive to localization. If, for 1258 example, a form like "Re:" is altered by an MUA to reflect the 1259 language of the sender or recipient, a system that subsequently 1260 does threading may not recognize the replacement term as a 1261 delimiter string. 1263 charset identification 1265 Specification of the charset used for a string of text. 1267 Protocols that allow more than one charset to be used in the same 1268 place should require that the text be identified with the 1269 appropriate charset. Without this identification, a program 1270 looking at the text cannot definitively discern the charset of the 1271 text. Charset identification is also called "charset tagging". 1273 language identification 1275 Specification of the human language used for a string of text. 1276 1278 Some protocols (such as MIME and HTTP) allow text that is meant 1279 for machine processing to be identified with the language used in 1280 the text. Such identification is important for machine processing 1281 of the text, such as by systems that render the text by speaking 1282 it. Language identification is also called "language tagging". 1283 The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a 1284 comprehensive model for language identification. 1286 MIME 1288 MIME (Multipurpose Internet Mail Extensions) is a message format 1289 that allows for textual message bodies and headers in character 1290 sets other than US-ASCII in formats that require ASCII (most 1291 notably RFC 5322, the standard for Internet mail headers 1292 [RFC5322]). MIME is described in RFCs 2045 through 2049, as well 1293 as more recent RFCs. 1295 transfer encoding syntax 1297 A transfer encoding syntax (TES) (sometimes called a transfer 1298 encoding scheme) is a reversible transform of already-encoded data 1299 that is represented in one or more character encoding schemes. 1300 1302 TESs are useful for encoding types of character data into an 1303 another format, usually for allowing new types of data to be 1304 transmitted over legacy protocols. The main examples of TESs used 1305 in the IETF include Base64 and quoted-printable. MIME identifies 1306 the transfer encoding syntax for body parts as a Content-transfer- 1307 encoding, occasionally abbreviated C-T-E. 1309 Base64 1311 Base64 is a transfer encoding syntax that allows binary data to be 1312 represented by the ASCII characters A through Z, a through z, 0 1313 through 9, +, /, and =. It is defined in [RFC2045]. 1315 quoted printable 1317 Quoted printable is a transfer encoding syntax that allows strings 1318 that have non-ASCII characters mixed in with mostly ASCII 1319 printable characters to be somewhat human readable. It is 1320 described in [RFC2047]. 1321 The quoted printable syntax is generally considered to be a 1322 failure at being readable. It is jokingly referred to as "quoted 1323 unreadable". 1325 XML 1327 XML (which is an approximate abbreviation for Extensible Markup 1328 Language) is a popular method for structuring text. XML text that 1329 is not encoded as UTF-8 is explicitly tagged with charsets, and 1330 all text in XML consists only of Unicode characters. The 1331 specification for XML can be found at . 1332 1334 ASN.1 text formats 1336 The ASN.1 data description language has many formats for text 1337 data. The formats allow for different repertoires and different 1338 encodings. Some of the formats that appear in IETF standards 1339 based on ASN.1 include IA5String (all ASCII characters), 1340 PrintableString (most ASCII characters, but missing many 1341 punctuation characters), BMPString (characters from ISO/IEC 10646 1342 plane 0 in UTF-16BE format), UTF8String (just as the name 1343 implies), and TeletexString (also called T61String). 1345 ASCII-compatible encoding (ACE) 1347 Starting in 1996, many ASCII-compatible encoding schemes (which 1348 are actually transfer encoding syntaxes) have been proposed as 1349 possible solutions for internationalizing host names and some 1350 other purposes. Their goal is to be able to encode any string of 1351 ISO/IEC 10646 characters using the preferred syntax for domain 1352 names (as described in STD 13). At the time of this writing, only 1353 the ACE encoding produced by Punycode [RFC3492] has become an IETF 1354 standard. 1356 The choice of ACE forms to internationalize legacy protocols must 1357 be made with care as it can cause some difficult side effects 1358 [RFC6055]. 1360 LDH label 1362 The classical label form used in the DNS and most applications 1363 that call on it, albeit with some additional restrictions, 1364 reflects the early syntax of "hostnames" [RFC0952] and limits 1365 those names to ASCII letters, digits, and embedded hyphens. The 1366 hostname syntax is identical to that described as the "preferred 1367 name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by 1368 RFC 1123 [RFC1123]. LDH labels are defined in a more restrictive 1369 and precise way for internationalization contexts as part of the 1370 IDNA2008 specification [RFC5890]. 1372 7. Terms Associated with Internationalized Domain Names 1374 7.1. IDNA Terminology 1376 The current specification for Internationalized Domain Names (IDNs), 1377 known formally as Internationalized Domain Names for Applications or 1378 IDNA, is referred to in the IETF and parts of the broader community 1379 as "IDNA2008" and consists of several documents. Section 2.3 of the 1380 first of those documents, commonly known as "IDNA2008 Definitions" 1381 [RFC5890] provides definitions and introduces some specialized terms 1382 for differentiating among types of DNS labels in an IDN context. 1383 Those terms are listed in the table below; see RFC 5890 for the 1384 specific definitions if needed. 1386 ACE Prefix 1387 A-label 1388 Domain Name Slot 1389 IDNA-valid string 1390 Internationalized Domain Name (IDN) 1391 Internationalized Label 1392 LDH Label 1393 NR-LDH label 1394 U-label 1396 Two additional terms entered the IETF's vocabulary as part of the 1397 earlier IDN effort [RFC3490] (IDNA2003): 1399 Stringprep 1401 Stringprep [RFC3454] provides a model and character tables for 1402 preparing and handling internationalized strings. It was used 1403 in the original IDN specification (IDNA2003) via a profile 1404 called "Nameprep" [RFC3491]. It is no longer in use in IDNA, 1405 but continues to be used in profiles by a number of other 1406 protocols. 1408 Punycode 1410 This is the name of the algorithm [RFC3492] used to convert 1411 otherwise-valid IDN labels from native-character strings 1412 expressed in Unicode to an ASCII-compatible encoding (ACE). 1413 Strictly speaking, the term applies to the algorithm only. In 1414 practice, it is widely, if erroneously, used to refer to 1415 strings that the algorithm encodes. 1417 7.2. Character Relationships and Variants 1419 The term "variant" was introduced into the IETF i18n vocabulary with 1420 the JET recommendations [RFC3743]. As used there, it referred 1421 strictly to the relationship between Traditional Chinese characters 1422 and their Simplified equivalents. The JET recommendations provided a 1423 model for identifying these pairs of characters and labels that used 1424 them. Specific recommendations for variant handling for the Chinese 1425 language were provided in a follow-up document [RFC4713]. 1427 In more recent years, the term has also been used to describe other 1428 collections of characters or strings that might be perceived as 1429 equivalent. Those collections have involved one or more of several 1430 categories of characters and labels containing them including: 1432 o "visually similar" or "visually confusable" characters. These may 1433 be limited to characters in different scripts, characters in a 1434 single script, or both, and may be those that can appear to be 1435 alike even with high-distinguishability reference fonts are used 1436 or under various circumstances that may involve malicious choices 1437 of typefaces or other ways to trick user perception. Trivial 1438 examples include ASCII "l" and "1" and Latin and Cyrillic "a". 1440 o Characters assigned more than one Unicode code point because of 1441 some special property. These characters may be considered "the 1442 same" for some purposes and different for others (or by other 1443 users). One of the most commonly-cited examples is the Arabic 1444 YEH, which is encoded more than once because some of its shapes 1445 are different across different languages. Another example are the 1446 Greek lower case sigma and final sigma: if the latter were viewed 1447 purely as a positional presentation variation on the former, it 1448 should not have been assigned a separate code point. 1450 o Numerals and labels including them. Unlike letters, the "meaning" 1451 of decimal digits is clear and unambiguous regardless of the 1452 script with which they are associated. Some scripts are routinely 1453 used almost interchangeably with European digits and digits native 1454 to that script. Arabic script has two sets of digits (U+0660..U+ 1455 0669 and U+06F0..U=06F9), written identically for zero through 1456 three and seven through nine but differently for four through six; 1457 European digits predominate in other areas. Substitution of 1458 digits with the same numeric value in labels may give rise to 1459 another type of variant. 1461 o Orthographic differences within a language. Many languages have 1462 alternate choices of spellings or spellings that differ by locale. 1463 Users of those languages generally recognize the spellings as 1464 equivalent, at least as much so as the variations described above. 1465 Examples include "color" and "colour" in English, German words 1466 spelled with o-umlaut or "oe", and so on. Some of these 1467 differences may also create other types of language-specific 1468 perceived that do not exist for other languages using the same 1469 script. For example, in Arabic language usage at the end of 1470 words, ARABIC LETTER TEH MARBUTA (U+0629) and ARABIC LETTER HEH 1471 (U+0647) are differently-shaped (one has 2 dots in top of it) but 1472 they are used interchangeably in writing: they "sound" similar 1473 when pronounced at the end of phrase, and hence the LETTER TEH 1474 MARBUTA sometimes is written as LETTER HEH and the two are 1475 considered "confusable" in that context. 1477 The term "variant" as used in this section should also not be 1478 confused with other uses of the term in this document or in Unicode 1479 terminology (e.g., those in Section 4.1 above). If the term is to be 1480 used at all, context should clearly distinguish among these different 1481 uses and, in particular, between variant characters and variant 1482 labels. Local text should identify which meaning, or combination of 1483 meanings, are intended. 1485 8. Other Common Terms In Internationalization 1487 This is a hodge-podge of other terms that have appeared in 1488 internationalization discussions in the IETF. 1490 locale 1492 Locale is the user-specific location and cultural information 1493 managed by a computer. 1495 Because languages and orthographic conventions differ from country 1496 to country (and even region to region within a country), the 1497 locale of the user can often be an important factor. Typically, 1498 the locale information for a user includes the language(s) used. 1500 Locale issues go beyond character use, and can include things such 1501 as the display format for currency, dates, and times. Some 1502 locales (especially the popular "C" and "POSIX" locales) do not 1503 include language information. 1505 It should be noted that there are many thorny, unsolved issues 1506 with locale. For example, should text be viewed using the locale 1507 information of the person who wrote the text, information that 1508 would apply to the location of the system storing or providing the 1509 text, or the person viewing it? What if the person viewing it is 1510 traveling to different locations? Should only some of the locale 1511 information affect creation and editing of text? 1513 Latin characters 1515 "Latin characters" is a not-precise term for characters 1516 historically related to ancient Greek script as modified in the 1517 Roman Republic and Empire and currently used throughout the world. 1518 1520 The base Latin characters are a subset of the ASCII repertoire and 1521 have been augmented by many single and multiple diacritics and 1522 quite a few other characters. ISO/IEC 10646 encodes the Latin 1523 characters in including ranges U+0020..U+024F, and U+1E00..U+1EFF. 1525 Because "Latin characters" is used in different contexts to refer 1526 to the letters from the ASCII repertoire, the subset of those 1527 characters used late in the Roman Republic period or the different 1528 subset used to write Latin in medieval times, the entire ASCII 1529 repertoire, all of the code points in the extended Latin script as 1530 defined by Unicode, and other collections, the term should be 1531 avoided in IETF specifications when possible. Similarly, "Basic 1532 Latin" should not be used as a synonym for "ASCII". 1534 romanization 1536 The transliteration of a non-Latin script into Latin characters. 1537 1539 Because of the widespread use of Latin characters, people have 1540 tried to represent many languages that are not based on a Latin 1541 repertoire in Latin characters. For example, there are two 1542 popular romanizations of Chinese: Wade-Giles and Pinyin, the 1543 latter of which is by far more common today. Many romanization 1544 systems are inexact and do not give perfect round trip mappings 1545 between the native script and the Latin characters. 1547 CJK characters and Han characters 1549 The ideographic characters used in Chinese, Japanese, Korean, and 1550 traditional Vietnamese writing systems are often called 'CJK 1551 characters' after the initial letters of the language names in 1552 English. They are also called "Han characters", after the term in 1553 Chinese that is often used for these characters. 1555 Note that Han characters do not include the phonetic characters 1556 used in the Japanese and Korean languages. Users of the term "CJK 1557 characters" may or may not assume those additional characters are 1558 included. 1560 In ISO/IEC 10646, the Han characters were "unified", meaning that 1561 each set of Han characters from Japanese, Chinese, and/or Korean 1562 that had the same origin was assigned a single code point. The 1563 positive result of this was that many fewer code points were 1564 needed to represent Han; the negative result of this was that 1565 characters that people who write the three languages think are 1566 different have the same code point. There is a great deal of 1567 disagreement on the nature, the origin, and the severity of the 1568 problems caused by Han unification. 1570 translation 1572 The process of conveying the meaning of some passage of text in 1573 one language, so that it can be expressed equivalently in another 1574 language. 1576 Many language translation systems are inexact and cannot be 1577 applied repeatedly to go from one language to another to another. 1579 transliteration 1581 The process of representing the characters of an alphabetical or 1582 syllabic system of writing by the characters of a conversion 1583 alphabet. 1585 Many script transliterations are exact, and many have perfect 1586 round-trip mappings. The notable exception to this is 1587 romanization, described above. Transliteration involves 1588 converting text expressed in one script into another script, 1589 generally on a letter-by-letter basis. There are many official 1590 and unofficial transliteration standards, most notably those from 1591 ISO TC 46 and the U.S. Library of Congress. 1593 transcription 1595 The process of systematically writing the sounds of some passage 1596 of spoken language, generally with the use of a technical phonetic 1597 alphabet (usually Latin-based) or other systematic transcriptional 1598 orthography. Transcription also sometimes refers to the 1599 conversion of written text into a transcribed form, based on the 1600 sound of the text as if it had been spoken. 1602 Unlike transliterations, which are generally designed to be round- 1603 trip convertible, transcriptions of written material are almost 1604 never round-trip convertible to their original form, at least 1605 without some supplemental information. 1607 regular expressions 1609 Regular expressions provide a mechanism to select specific strings 1610 from a set of character strings. Regular expressions are a 1611 language used to search for text within strings, and possibly 1612 modify the text found with other text. 1614 Pattern matching for text involves being able to represent one or 1615 more code points in an abstract notation, such as searching for 1616 all capital Latin letters or all punctuation. The most common 1617 mechanism in IETF protocols for naming such patterns is the use of 1618 regular expressions. There is no single regular expression 1619 language, but there are numerous very similar dialects that are 1620 not quite consistent with each other. 1622 The Unicode Consortium has a good discussion about how to adapt 1623 regular expression engines to use Unicode. [UTR18] 1625 private use character 1627 ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to 1628 U+FFFFD, and U+100000 to U+10FFFD are available for private use. 1629 This refers to code points of the standard whose interpretation is 1630 not specified by the standard and whose use may be determined by 1631 private agreement among cooperating users. 1633 The use of these "private use" characters is defined by the 1634 parties who transmit and receive them, and is thus not appropriate 1635 for standardization. (The IETF has a long history of private use 1636 names for things such as "x-" names in MIME types, charsets, and 1637 languages. Most of the experience with these has been quite 1638 negative, with many implementors assuming that private use names 1639 are in fact public and long-lived.) 1641 9. Security Considerations 1643 Security is not discussed directly in this document. While the 1644 definitions here have no direct effect on security, they are used in 1645 many security contexts. For example, authentication usually involves 1646 comparing two tokens, and one or both of those tokens might be text; 1647 thus, some methods of comparison might involve using some if the 1648 internationalization concepts for which terms are defined in this 1649 document. 1651 Having said that, other RFCs dealing with internationalization have 1652 security consideration descriptions that may be useful to the reader 1653 of this document. In particular, the security considerations in RFC 1654 3454, RFC 3629, RFC 4013, and RFC 5890 go into a fair amount of 1655 detail. 1657 10. IANA Considerations 1659 [RFC Editor: Please remove this section before publication.]: 1661 This document contains definitions and discussion only -- there are 1662 no actions for IANA. 1664 11. References 1666 11.1. Normative References 1668 [ISOIEC10646] 1669 ISO/IEC, "ISO/IEC 10646:2011. International Standard -- 1670 Information technology - Universal Multiple-Octet Coded 1671 Character Set (UCS)", 2011. 1673 [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) 1674 Part Three: Message Header Extensions for Non-ASCII Text", 1675 RFC 2047, November 1996. 1677 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 1678 6.0", Mountain View, CA: The Unicode Consortium, 1679 2011. ISBN 978-1-936213-01-6)., 2011, 1680 . 1682 11.2. Informative References 1684 [CHARMOD] W3C, "Character Model for the World Wide Web 1.0", 2005, 1685 . 1687 [FRAMEWORK] 1688 ISO/IEC, "ISO/IEC TR 11017:1997(E). Information technology 1689 - Framework for internationalization, prepared by ISO/IEC 1690 JTC 1/SC 22/WG 20", 1997. 1692 [ISO3166] ISO, "ISO 3166-1:2006 - Codes for the representation of 1693 names of countries and their subdivisions -- Part 1: 1694 Country codes", 2006. 1696 [ISO639] ISO, "ISO 639-1:2002 - Code for the representation of 1697 names of languages - Part 1: Alpha-2 code", 2002. 1699 [ISO6429] ISO/IEC, "ISO/IEC, "ISO/IEC 6429:1992. Information 1700 technology -- Control functions for coded character 1701 sets"", ISO/IEC 6429:1992, 1992. 1703 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 1704 host table specification", RFC 952, October 1985. 1706 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1707 STD 13, RFC 1034, November 1987. 1709 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1710 and Support", STD 3, RFC 1123, October 1989. 1712 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1713 Extensions (MIME) Part One: Format of Internet Message 1714 Bodies", RFC 2045, November 1996. 1716 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1717 Requirement Levels", BCP 14, RFC 2119, March 1997. 1719 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1720 Languages", BCP 18, RFC 2277, January 1998. 1722 [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 1723 10646", RFC 2781, February 2000. 1725 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 1726 Procedures", BCP 19, RFC 2978, October 2000. 1728 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1729 Internationalized Strings ("stringprep")", RFC 3454, 1730 December 2002. 1732 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1733 "Internationalizing Domain Names in Applications (IDNA)", 1734 RFC 3490, March 2003. 1736 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1737 Profile for Internationalized Domain Names (IDN)", 1738 RFC 3491, March 2003. 1740 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1741 for Internationalized Domain Names in Applications 1742 (IDNA)", RFC 3492, March 2003. 1744 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1745 10646", STD 63, RFC 3629, November 2003. 1747 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 1748 Engineering Team (JET) Guidelines for Internationalized 1749 Domain Names (IDN) Registration and Administration for 1750 Chinese, Japanese, and Korean", RFC 3743, April 2004. 1752 [RFC4647] Phillips, A. and M. Davis, "Matching of Language Tags", 1753 BCP 47, RFC 4647, September 2006. 1755 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin, 1756 "Registration and Administration Recommendations for 1757 Chinese Domain Names", RFC 4713, October 2006. 1759 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", 1760 BCP 137, RFC 5137, February 2008. 1762 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 1763 Interchange", RFC 5198, March 2008. 1765 [RFC5322] Resnick, P., Ed., "Internet Message Format", RFC 5322, 1766 October 2008. 1768 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1769 Languages", BCP 47, RFC 5646, September 2009. 1771 [RFC5890] Klensin, J., "Internationalized Domain Names for 1772 Applications (IDNA): Definitions and Document Framework", 1773 RFC 5890, August 2010. 1775 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1776 Internationalized Domain Names for Applications (IDNA)", 1777 RFC 5892, August 2010. 1779 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 1780 Internationalized Domain Names in Applications (IDNA) 1781 2008", RFC 5895, September 2010. 1783 [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on 1784 Encodings for Internationalized Domain Names", RFC 6055, 1785 February 2011. 1787 [UAX34] The Unicode Consortium, "Unicode Standard Annex #34: 1788 Unicode Named Character Sequences", 2010, 1789 . 1791 [UAX9] The Unicode Consortium, "Unicode Standard Annex #9: 1792 Unicode Bidirectional Algorithm", 2010, 1793 . 1795 [US-ASCII] 1796 ANSI, "Coded Character Set -- 7-bit American Standard Code 1797 for Information Interchange, ANSI X3.4-1986", 1986. 1799 [UTN6] The Unicode Consortium, "Unicode Technical Note #5: 1800 BOCU-1: MIME-Compatible Unicode Compression", 2006, 1801 . 1803 [UTR15] The Unicode Consortium, "Unicode Standard Annex #15: 1804 Unicode Normalization Forms", 2010, 1805 . 1807 [UTR18] The Unicode Consortium, "Unicode Standard Annex #18: 1808 Unicode Regular Expressions", 2008, 1809 . 1811 [UTR22] The Unicode Consortium, "Unicode Technical Standard #22: 1812 Unicode Character Mapping Markup Language", 2009, 1813 . 1815 [UTR6] The Unicode Consortium, "Unicode Technical Standard #6: A 1816 Standard Compression Scheme for Unicode", 2005, 1817 . 1819 [W3C-i18n-Def] 1820 W3C, "Localization vs. Internationalization", 1821 September 2010, 1822 . 1824 Appendix A. Additional Interesting Reading 1826 [[anchor20: RFC Editor: should these be standardized into your normal 1827 reference format??]] 1829 ALA-LC Romanization Tables, Randall Barry (ed.), U.S. Library of 1830 Congress, 1997, ISBN 0844409405 1832 The Alphabetic Labyrinth: The Letters in History and Imagination, 1833 Johanna Drucker, Thames and Hudson Ltd, 1995, ISBN 0-500-28068-1 1835 Blackwell Encyclopedia of Writing Systems, Florian Coulmas, Blackwell 1836 Publishers, 1999, ISBN 063121481X 1838 Chinese Calligraphy, Edoardo Fazzioli, Abbeville Press, 1986, 1987 1839 (English translation), ISBN 0-89659-774-1 1840 The Chinese Language: Fact and Fantasy, John DeFrancis, University of 1841 Hawaii Press, 1984, ISBN 0-8284-085505 and 0-8248-1058-6 1843 CJKV Information Processing, Ken Lunde, O'Reilly & Assoc., 1999, ISBN 1844 1-56592-224-7 1846 Dictionary of Languages: The Definitive Reference to More than 400 1847 Languages, Andrew Dalby, 2004, ISBN 978-0231115698 1849 Language Visible, David Sacks, Bantam Dell, 2003. Also published as 1850 Letter Perfect: The Marvelous History of Our Alphabet From A to Z, 1851 Broadway, 2004, ISBN 978-0767911733 1853 Reading the Past: Ancient Writing from Cuneiform to the Alphabet, 1854 introduction by J.T. Hooker, British Museum Press, 1990, ISBN 0-7141- 1855 8077-7 1857 The Story of Writing: Alphabets, Hieroglyphs, & Pictograms, Andrew 1858 Robinson, Thames and Hudson, 1995, 2000, ISBN 0-500-28156-4 1860 The World's Writing Systems, Peter Daniels and William Bright, Oxford 1861 University Press, 1996, ISBN 0195079930 1863 Writing Systems of the World, Akira Nakanishi, Charles E. Tuttle 1864 Company, 1980, ISBN 0804816549 1866 Appendix B. Acknowledgements 1868 The definitions in this document come from many sources, including a 1869 wide variety of IETF documents. 1871 James Seng contributed to the initial outline of RFC 3536. Harald 1872 Alvestrand and Martin Duerst made extensive useful comments on early 1873 versions. Others who contributed to the development of RFC 3536 1874 include Dan Kohn, Jacob Palme, Johan van Wingen, Peter Constable, 1875 Yuri Demchenko, Susan Harris, Zita Wenzel, John Klensin, Henning 1876 Schulzrinne, Leslie Daigle, Markus Scherer, and Ken Whistler. 1878 Abdulaziz Al-Zoman, Tim Bray, Frank Ellermann, Antonio Marko, JFC 1879 Morphin, Sarmad Hussain, Mykyta Yevstifeyev, Ken Whistler, and others 1880 identified important issues with, or made specific suggestions for, 1881 this new version. 1883 Appendix C. Significant Changes from RFC 3536 1885 This document mostly consists of additions to RFC 3536. The 1886 following is a list of the most significant changes. 1888 o Change the document's status to BCP. 1890 o Commonly-used synonyms added to several descriptions and indexed. 1892 o A list of terms defined and used in IDNA2008 was added, with a 1893 pointer to RFC 5890. Those definitions have not been repeated in 1894 this document. 1896 o The much-abused term "variant" is now discussed in some detail. 1898 o A discussion of different subsets of the Unicode repertoire was 1899 added as Section 4.2 and associated definitions were included. 1901 o Added a new term, "writing style". 1903 o Discussions of case-folding and mapping were expanded. 1905 o Minor edits were made to some section titles and a number of other 1906 editorial improvements were made. 1908 o The discussion of control codes was updated to include additional 1909 information and clarify that "control code" and "control 1910 character" are synonyms. 1912 o Many terms were clarified to reflect contemporary usage. 1914 o The index to terms by section in RFC 3536 was replaced by an index 1915 to pages containing considerably more terms. 1917 o The acknowledgments were updated. 1919 o Some of the references were updated. 1921 o The supplemental reading list was expanded somewhat. 1923 Index 1925 A 1926 A-label 30 1927 ACE 29, 31 1928 ACE Prefix 30 1929 alphabetic 19 1930 ANSI 13 1931 ASCII 14 1932 ASCII-compatible encoding 29, 31 1933 ASN.1 text formats 29 1935 B 1936 Base64 28 1937 Basic Multilingual Plane 13 1938 bidi 25 1939 bidirectional display 25 1940 BMP 13 1941 BMPString 29 1942 BOCU-1 14 1943 BOM 14 1944 byte order mark 14 1946 C 1947 C-T-E 28 1948 case 17 1949 CCS 7 1950 CEN/ISSS 13 1951 character 6 1952 character encoding form 7 1953 character encoding scheme 7 1954 character repertoire 7 1955 charset 8 1956 charset identification 27 1957 CJK characters 33 1958 code chart 19 1959 code point 15 1960 code table 19 1961 coded character 6 1962 coded character set 7 1963 collation 18 1964 combining character 16 1965 combining character sequence 16 1966 compatibility character 21 1967 compatibility variant 21 1968 composite sequence 16 1969 content-transfer-encoding 28 1970 control character 21 1971 control code 21 1972 control sequence 21 1974 D 1975 decomposed character 16 1976 diacritic 20 1977 displaying and rendering text 10 1978 Domain Name Slot 30 1980 E 1981 encoding forms 13 1983 F 1984 font 24 1985 formatting character 21 1987 G 1988 glyph 7 1989 glyph code 7 1990 graphic symbol 24 1992 H 1993 Han characters 33 1995 I 1996 i10n 9 1997 i18n 9 1998 IA5String 29 1999 ideographic 19 2000 IDN 30 2001 IDNA 30 2002 IDNA-valid string 30 2003 IDNA2003 30 2004 IDNA2008 30 2005 IME 23 2006 input method editor 23 2007 input methods 23 2008 internationalization 8 2009 Internationalized Domain Name 30 2010 Internationalized domain names 30 2011 Internationalized Label 30 2012 ISO 11 2013 ISO 639 11 2014 ISO 3166 11 2015 ISO 8859 14 2016 ISO TC 46 11 2018 J 2019 JIS 13 2020 JTC 1 11 2022 L 2023 language 5 2024 language identification 28 2025 Latin characters 33 2026 LDH Label 30 2027 letters 22 2028 Local and regional standards organizations 13 2029 locale 32 2030 localization 9 2032 M 2033 MIME 28 2034 multilingual 9 2036 N 2037 name spaces 27 2038 Nameprep 30 2039 NFC 17 2040 NFD 17 2041 NFKC 17 2042 NFKD 17 2043 non-ASCII 22 2044 nonspacing character 20 2045 normalization 16 2046 NR-LDH label 30 2047 NVT 15 2049 O 2050 on-the-wire encoding 27 2052 P 2053 parsed text 27 2054 precomposed character 16 2055 PrintableString 29 2056 private use charater 35 2057 protocol elements 26 2058 punctuation 20 2059 Punycode 29, 31 2061 Q 2062 quoted-printable 28 2064 R 2065 regular expressions 35 2066 rendering rules 23 2067 repertoire 7 2068 romanization 33 2070 S 2071 SAC 13 2072 script 5 2073 SCSU 14 2074 sorting 18 2075 Stringprep 30 2076 surrogate pair 14 2077 symbol 20 2079 T 2080 T61String 29 2081 TeletexString 29 2082 TES 28 2083 transcoding 7 2084 transcription 34 2085 transfer encoding syntax 28 2086 transformation formats 13 2087 translation 34 2088 transliteration 33-34 2089 typeface 24 2091 U 2092 U-label 30 2093 UCS-2 13 2094 UCS-4 13 2095 undisplayable character 26 2096 Unicode Consortium 12 2097 US-ASCII 14 2098 UTC 12 2099 UTF-8 14 2100 UTF-16 14 2101 UTF-16BE 14 2102 UTF-16LE 14 2103 UTF-32 14 2104 UTF8String 29 2106 V 2107 variant 31 2109 W 2110 W3C 13 2111 World Wide Web Consortium 13 2112 writing style 26 2113 writing system 6 2115 X 2116 XML 13, 29 2118 Authors' Addresses 2120 Paul Hoffman 2121 VPN Consortium 2123 Email: paul.hoffman@vpnc.org 2125 John C Klensin 2126 1770 Massachusetts Ave, Ste 322 2127 Cambridge, MA 02140 2128 USA 2130 Phone: +1 617 245 1457 2131 Email: john+ietf@jck.com