idnits 2.17.1 draft-hoffman-rfc3536bis-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 2108 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) -- The draft header indicates that this document obsoletes RFC3536, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 21, 2011) is 4726 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'ISOIEC10646' -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE' -- Obsolete informational reference (is this intentional?): RFC 3454 (Obsoleted by RFC 7564) -- Obsolete informational reference (is this intentional?): RFC 3490 (Obsoleted by RFC 5890, RFC 5891) -- Obsolete informational reference (is this intentional?): RFC 3491 (Obsoleted by RFC 5891) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group P. Hoffman 3 Internet-Draft VPN Consortium 4 Obsoletes: 3536 (if approved) J. Klensin 5 Intended status: BCP April 21, 2011 6 Expires: October 23, 2011 8 Terminology Used in Internationalization in the IETF 9 draft-hoffman-rfc3536bis-02 11 Abstract 13 This document provides a glossary of terms used in the IETF when 14 discussing internationalization. The purpose is to help frame 15 discussions of internationalization in the various areas of the IETF 16 and to help introduce the main concepts to IETF participants. 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on October 23, 2011. 35 Copyright Notice 37 Copyright (c) 2011 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction 53 1.1. Purpose of this Document 54 1.2. Format of the Definitions in this Document 55 1.3. Discussion of This Document 56 2. Fundamental Terms 57 3. Standards Bodies and Standards 58 3.1. Standards bodies 59 3.2. Encodings and Transformation Formats of ISO/IEC 10646 60 3.3. Native CCSs and charsets 61 4. Character Issues 62 4.1. Types of Characters 63 4.2. Differentiation of Subsets 64 5. User Interface for Text 65 6. Text in Current IETF Protocols 66 7. Terms Associated with Internationalized Domain Names 67 7.1. IDNA Terminology 68 7.2. Character Relationships and Variants 69 8. Other Common Terms In Internationalization 70 9. Security Considerations 71 10. References 72 10.1. Normative References 73 10.2. Informative References 74 Appendix A. Additional Interesting Reading 75 Appendix B. Acknowledgements 76 Appendix C. Changes from RFC 3536 77 Appendix D. Changes Between Versions of this Draft 78 D.1. Changes in version -01 79 Index 80 Authors' Addresses 82 1. Introduction 84 As the IETF Character Set Policy specification [RFC2277] summarizes: 85 "Internationalization is for humans. This means that protocols are 86 not subject to internationalization; text strings are." Many 87 protocols throughout the IETF use text strings that are entered by, 88 or are visible to, humans. It should be possible for anyone to enter 89 or read these text strings, which means that Internet users must be 90 able to be enter text in typical input methods and displayed in any 91 human language. Further, text containing any character should be 92 able to be passed between Internet applications easily. This is the 93 challenge of internationalization. 95 1.1. Purpose of this Document 97 This document provides a glossary of terms used in the IETF when 98 discussing internationalization. The purpose is to help frame 99 discussions of internationalization in the various areas of the IETF 100 and to help introduce the main concepts to IETF participants. 102 Internationalization is discussed in many working groups of the IETF. 103 However, few working groups have internationalization experts. When 104 designing or updating protocols, the question often comes up "should 105 we internationalize this" (or, more likely, "do we have to 106 internationalize this"). 108 This document gives an overview of internationalization as it applies 109 to IETF standards work by lightly covering the many aspects of 110 internationalization and the vocabulary associated with those topics. 111 Some of the overview is a somewhat tuturial in nature. It is not 112 meant to be a complete description of internationalization. The 113 definitions in this document are not normative for IETF standards; 114 however, they are useful and standards may make informative reference 115 to this document after it becomes an RFC. Some of the definitions in 116 this document come from many earlier IETF documents and books. 118 As in many fields, there is disagreement in the internationalization 119 community on definitions for many words. The topic of language 120 brings up particularly passionate opinions for experts and non- 121 experts alike. This document attempts to define terms in a way that 122 will be most useful to the IETF audience. 124 This document uses definitions from many documents that have been 125 developed outside the IETF. The primary documents used are: 127 o ISO/IEC 10646 [ISOIEC10646] 129 o The Unicode Standard [UNICODE] 131 o W3C Character Model [CHARMOD] 133 o IETF RFCs, including the Character Set Policy specification 134 [RFC2277] 136 1.2. Format of the Definitions in this Document 138 In the body of this document, the source for the definition is shown 139 in angle brackets, such as "". Many definitions are 140 shown as "", which means that the definitions were crafted 141 originally for this document. The angle bracket notation for the 142 source of definitions is different than the square bracket notation 143 used for references to documents, such as in the paragraph above; 144 these references are given in the reference sections of this 145 document. 147 For some terms, there are commentary and examples after the 148 definitions. In those cases, the part before the angle brackets is 149 the definition that comes from the original source, and the part 150 after the angle brackets is commentary that is not a definition (such 151 as examples or further exposition). 153 Examples in this document use the notation for code points and names 154 from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646]. 155 For example, the letter "a" may be represented as either "U+0061" or 156 "LATIN SMALL LETTER A". See RFC 5137 [RFC5137] for a description of 157 this notation. 159 1.3. Discussion of This Document 161 [[ This section is to be removed before the RFC is published. ]] 163 This document is being discussed on the apps-discuss@ietf.org mailing 164 list. For more information, see 165 . 167 2. Fundamental Terms 169 This section covers basic topics that are needed for almost anyone 170 who is involved with making IETF protocols more friendly to non-ASCII 171 text (see Section 4.2) and with other aspects of 172 internationalization. 174 language 176 A language is a way that humans interact. The use of language 177 occurs in many forms, the most common of which are speech, 178 writing, and signing. 180 Some languages have a close relationship between the written and 181 spoken forms, while others have a looser relationship. The so- 182 called LTRU (Language Tag Registry Update) standards [RFC5646] 183 [RFC4647] discuss languages in more detail and provides 184 identifiers for languages for use in Internet protocols. Note 185 that computer languages are explicitly excluded from this 186 definition. 188 script 190 A set of graphic characters used for the written form of one or 191 more languages. 193 Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han 194 (the characters, often called ideographs after a subset of them, 195 used in writing Chinese, Japanese, and Korean). RFC 2277 196 discusses scripts in detail. 198 It is common for internationalization novices to mix up the terms 199 "language" and "script". This can be a problem in protocols that 200 differentiate the two. Almost all protocols that are designed (or 201 were re-designed) to handle non-ASCII text deal with scripts (the 202 written systems) or characters, while fewer actually deal with 203 languages. 205 A single name can mean either a language or a script; for example, 206 "Arabic" is both the name of a language and the name of a script. 207 In fact, many scripts borrow their names from the names of 208 languages. Further, many scripts are used to write more than one 209 language; for example, the Russian and Bulgarian languages are 210 written in the Cyrillic script. Some languages can be expressed 211 using different scripts or were used with different scripts at 212 different times; the Mongolian language can be written in either 213 the Mongolian or Cyrillic scripts; Malay is primarily written in 214 Latin script today but the earlier, Arabic-script-based, Jawa form 215 is still in use; and a number of languages were converted from 216 other scripts to Cyrillic in the first half of the last century, 217 some of which have switched again more recently. Further, some 218 languages are normally expressed with more than one script at the 219 same time; for example, the Japanese language is normally 220 expressed in the Kanji (Han), Katakana, and Hiragana scripts in a 221 single string of text. 223 writing system 225 A set of rules for using one or more scripts to write a particular 226 language. Examples include the American English writing system, 227 the British English writing system, the French writing system, and 228 the Japanese writing system. 230 character 232 A member of a set of elements used for the organization, control, 233 or representation of data. 235 There are at least three common definitions of the word 236 "character": 238 * a general description of a text entity 240 * a unit of a writing system, often synonymous with "letter" or 241 similar terms, but generalized to include digits and symbols of 242 various sorts 244 * the encoded entity itself 246 When people talk about characters, they usually intend one of the 247 first two definitions. 249 A particular character is identified by its name, not by its 250 shape. A name may suggest a meaning, but the character may be 251 used for representing other meanings as well. A name may suggest 252 a shape, but that does not imply that only that shape is commonly 253 used in print, nor that the particular shape is associated only 254 with that name. 256 coded character 258 A character together with its coded representation. 260 coded character set 262 A coded character set (CCS) is a set of unambiguous rules that 263 establishes a character set and the relationship between the 264 characters of the set and their coded representation. 265 267 character encoding form 269 A character encoding form is a mapping from a coded character set 270 (CCS) to the actual code units used to represent the data. 271 273 repertoire 275 The collection of characters included in a character set. Also 276 called a character repertoire. 278 glyph 280 A glyph is an abstract form that represents one or more glyph 281 images. The term "glyph" is often a synonym for glyph image, 282 which is the actual, concrete image of a glyph representation 283 having been rasterized or otherwise imaged onto some display 284 surface. In displaying character data, one or more glyphs may be 285 selected to depict a particular character. These glyphs are 286 selected by a rendering engine during composition and layout 287 processing. 289 glyph code 291 A glyph code is a numeric code that refers to a glyph. Usually, 292 the glyphs contained in a font are referenced by their glyph code. 293 Glyph codes are local to a particular font; that is, a different 294 font containing the same glyphs may use different codes. 296 transcoding 298 Transcoding is the process of converting text data from one 299 character encoding form to another. Transcoders work only at the 300 level of character encoding and do not parse the text. Note: 301 Transcoding may involve one-to-one, many-to-one, one-to-many or 302 many-to-many mappings. Because some legacy mappings are glyphic, 303 they may not only be many-to-many, but also unordered: thus XYZ 304 may map to yxz. 306 In this definition, "many-to-one" means a sequence of characters 307 mapped to a single character. The "many" does not mean 308 alternative characters that map to the single character. 310 character encoding scheme 312 A character encoding scheme (CES) is a character encoding form 313 plus byte serialization. There are many character encoding 314 schemes in Unicode, such as UTF-8 and UTF-16BE. 316 Some CESs are associated with a single CCS; for example, UTF-8 317 [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and 318 Unicode. Other CESs, such as ISO 2022, are associated with many 319 CCSs. 321 charset 323 A charset is a method of mapping a sequence of octets to a 324 sequence of abstract characters. A charset is, in effect, a 325 combination of one or more CCSs with a CES. Charset names are 326 registered by the IANA according to procedures documented in 327 [RFC2978]. 329 Many protocol definitions use the term "character set" in their 330 descriptions. The terms "charset" or "character encoding scheme" 331 and "coded character set" are strongly preferred over the term 332 "character set" because "character set" has other definitions in 333 other contexts and this can be confusing. 335 internationalization 337 In the IETF, "internationalization" means to add or improve the 338 handling of non-ASCII text in a protocol. 340 Many protocols that handle text only handle one charset (US- 341 ASCII), or leave the question of what CCS and encoding are used up 342 to local guesswork (which leads, of course, to interoperability 343 problems). If multiple charsets are permitted they must be 344 explicitly identified [RFC2277]. Adding non-ASCII text to a 345 protocol allows the protocol to handle more scripts, hopefully all 346 of the ones useful in the world. In today's world, that is 347 normally best accomplished by allowing Unicode encoded in UTF-8 348 only, thereby shifting conversion issues away from individual 349 choices. 351 localization 353 The process of adapting an internationalized application platform 354 or application to a specific cultural environment. In 355 localization, the same semantics are preserved while the syntax 356 may be changed. [FRAMEWORK] 358 Localization is the act of tailoring an application for a 359 different language or script or culture. Some internationalized 360 applications can handle a wide variety of languages. Typical 361 users only understand a small number of languages, so the program 362 must be tailored to interact with users in just the languages they 363 know. 365 The major work of localization is translating the user interface 366 and documentation. Localization involves not only changing the 367 language interaction, but also other relevant changes such as 368 display of numbers, dates, currency, and so on. The better 369 internationalized an application is, the easier it is to localize 370 it for a particular language and character encoding scheme. 372 Localization is rarely an IETF matter, and protocols that are 373 merely localized, even if they are serially localized for several 374 locations, are generally considered unsatisfactory for the global 375 Internet. 377 Do not confuse "localization" with "locale", which is described in 378 Section 8 of this document. 380 i18n, l10n 382 These are abbreviations for "internationalization" and 383 "localization". 385 "18" is the number of characters between the "i" and the "n" in 386 "internationalization", and "10" is the number of characters 387 between the "l" and the "n" in "localization". 389 multilingual 391 The term "multilingual" has many widely-varying definitions and 392 thus is not recommended for use in standards. Some of the 393 definitions relate to the ability to handle international 394 characters; other definitions relate to the ability to handle 395 multiple charsets; and still others relate to the ability to 396 handle multiple languages. 398 displaying and rendering text 400 To display text, a system puts characters on a visual display 401 device such as a screen or a printer. To render text, a system 402 analyzes the character input to determine how to display the text. 403 The terms "display" and "render" are sometimes used 404 interchangeably. Note, however, that text might be rendered as 405 audio and/or tactile output, such as in systems that have been 406 designed for people with visual disabilities. 408 Combining characters modify the display of the character (or, in 409 some cases, characters) that precede them. When rendering such 410 text, the display engine must either find the glyph in the font 411 that represents the base character and all of the combining 412 characters, or it must render the combination itself. Such 413 rendering can be straight-forward, but it is sometimes complicated 414 when the combining marks interact with each other, such as when 415 there are two combining marks that would appear above the same 416 character. Formatting characters can also change the way that a 417 renderer would display text. Rendering can also be difficult for 418 some scripts that have complex display rules for base characters, 419 such as Arabic and Indic scripts. 421 3. Standards Bodies and Standards 423 This section describes some of the standards bodies and standards 424 that appear in discussions of internationalization in the IETF. This 425 is an incomplete and possibly over-full list; listing too few bodies 426 or standards can be just as politically dangerous as listing too 427 many. Note that there are many other bodies that deal with 428 internationalization; however, few if any of them appear commonly in 429 IETF standards work. 431 3.1. Standards bodies 433 ISO and ISO/IEC JTC 1 435 The International Organization for Standardization has been 436 involved with standards for characters since before the IETF was 437 started. ISO is a non-governmental group made up of national 438 bodies. Most of ISO's work in information technology is performed 439 jointly with a similar body, the International Electrotechnical 440 Commission (IEC) through a joint committee known as "JTC 1". ISO 441 and ISO/IEC JTC 1 have many diverse standards in the international 442 characters area; the one that is most used in the IETF is commonly 443 referred to as "ISO/IEC 10646", sometimes with a specific date. 444 ISO/IEC 10646 describes a CCS that covers almost all known written 445 characters in use today. 447 ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC 448 1/SC 2 WG2", often called "SC2/WG2" or "WG2" for short. ISO 449 standards go through many steps before being finished, and years 450 often go by between changes to the base ISO/IEC 10646 standard 451 although amendments are now issued to track Unicode changes. 452 Information on WG2, and its work products, can be found at 453 . Information on SC2, and its 454 work products, can be found at 459 The standard comes as a base part and a series of attachments or 460 amendments. It is available in PDF form for downloading or in a 461 CD-ROM version. One example of how to cite the standard is given 462 in [RFC3629]. Any standard that cites ISO/IEC 10646 needs to 463 evaluate how to handle the versioning problem that is relevant to 464 the protocol's needs. 466 ISO is responsible for other standards that might be of interest 467 to protocol developers concerned about internationalization. ISO 468 639 [ISO639] specifies the names of languages and forms part of 469 the basis for the IETF's Language Tag work [RFC5646]. ISO 3166 470 [ISO3166] specifies the names and code abbreviations for countries 471 and territories and is used in several protocols and databases 472 including names for country-code top level domain names. The 473 responsibilities of ISO TC 46 on Information and Documentation 474 include a series of 477 standards for transliteration of various languages into Latin 478 characters. 480 Another relevant ISO group was JTC 1/SC22/WG20, which was 481 responsible for internationalization in JTC1, such as for 482 international string ordering. Information on WG20, and its work 483 products, can be found at . 484 The specific tasks of SC22/WG20 were moved from SC22 into SC2 and 485 there has been little significant activity since that occurred. 487 Unicode Consortium 489 The second important group for international character standards 490 is the Unicode Consortium. The Unicode Consortium is a trade 491 association of companies, governments, and other groups interested 492 in promoting the Unicode Standard [UNICODE]. The Unicode Standard 493 is a CCS whose repertoire and code points are identical to ISO/IEC 494 10646. The Unicode Consortium has added features to the base CCS 495 which make it more useful in protocols, such as defining 496 attributes for each character. Examples of these attributes 497 include case conversion and numeric properties. 499 The actual technical and definitional work of the Unicode 500 Consortium is done in the Unicode Technical Committee (UTC). The 501 terms "UTC" and "Unicode Consortium" are often treated, 502 imprecisely, as synonymous in the IETF. 504 The Unicode Consortium publishes addenda to the Unicode Standard 505 as Unicode Technical Reports. There are many types of technical 506 reports at various stages of maturity. The Unicode Standard and 507 affiliated technical reports can be found at 508 . 510 A reciprocal agreement between the Unicode Consortium and ISO/IEC 511 JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode Standard to 512 track each other for definitions of characters and assignments of 513 code points. Updates, often in the form of amendments, to the 514 former sometimes lag updates to the latter for a short period, but 515 the gap has rarely been significant in recent years. 517 At the time that the IETF character set policy [RFC2277] was 518 established and the first version of this terminology 519 specification were published, there was a strong preference in the 520 IETF community for references to ISO/IEC 10646 (rather than 521 Unicode) when possible. That preference largely reflected a more 522 general IETF preference for referencing established open 523 international standards in preference to specifications from 524 consortia. However, the Unicode definitions of character 525 properties and classes are not part of ISO/IEC 10646. Because 526 IETF specifications are increasingly dependent on those 527 definitions (for example, see the explanation in Section 4.2) and 528 the Unicode specifications are freely available online in 529 convenient machine-readable form, the IETF's preference has 530 shifted to referencing the Unicode Standard. The latter is 531 especially important when version consistency between code points 532 (either standard) and Unicode properties (Unicode only) is 533 required. 535 World Wide Web Consortium (W3C) 537 This group created and maintains the standard for XML, the markup 538 language for text that has become very popular. XML has always 539 been fully internationalized so that there is no need for a new 540 version to handle international text. However, in some 541 circumstances, XML files may be sensitive to differences among 542 Unicode versions. 544 local and regional standards organizations 546 Just as there are many native CCSs and charsets, there are many 547 local and regional standards organizations to create and support 548 them. Common examples of these are ANSI (United States), CEN/ISSS 549 (Europe), JIS (Japan), and SAC (China). 551 3.2. Encodings and Transformation Formats of ISO/IEC 10646 553 Characters in the ISO/IEC 10646 CCS can be expressed in many ways. 554 Encoding forms are direct addressing methods, while transformation 555 formats are methods for expressing encoding forms as bits on the 556 wire. 557 [[anchor9: Note in Draft: The current Unicode Standard, e.g., Section 558 2.5 of version 5, refers to UTF-8, UTF-16, and UTF-32 as "encoding 559 forms". Consequently, the distinction made above may no longer be 560 useful or its definition precisely correct. Comments and suggestions 561 welcome.]] 563 Documents that discuss characters in the ISO/IEC 10646 CCS often need 564 to list specific characters. RFC 5137 describes the common methods 565 for doing so in IETF documents, and these practices have been adopted 566 by many other communities as well. 568 Basic Multilingual Plane (BMP) 570 The BMP is composed of the first 2^16 code points in ISO/IEC 10646 571 and contains almost all characters in contemporary use. The BMP 572 is also called "Plane 0". 574 UCS-2 and UCS-4 576 UCS-2 and UCS-4 are the two encoding forms historically defined 577 for ISO/IEC 10646. UCS-2 addresses only the BMP. Because many 578 useful characters (such as many Han characters) have been defined 579 outside of the BMP, many people consider UCS-2 to be obsolete. 580 UCS-4 addresses the entire range of code points from ISO/IEC 10646 581 (by agreement between ISO/IEC JTC1 SC2 and the Unicode Consortium, 582 a range from 0..0x10FFFF) as 32-bit values with zero padding to 583 the left. UCS-4 is identical to UTF-32BE (without use of a BOM 584 (see below)); UTF-32BE is now the preferred term. 586 UTF-8 588 UTF-8 [RFC3629], is the preferred encoding for IETF protocols. 589 Characters in the BMP are encoded as one, two, or three octets. 590 Characters outside the BMP are encoded as four octets. Characters 591 from the US-ASCII repertoire have the same on-the-wire 592 representation in UTF-8 as they do in US-ASCII. The IETF-specific 593 definition of UTF-8 in RFC 3629 is identical to that in recent 594 versions of the Unicode Standard (e.g., in Section 3.9 of Version 595 5.2 [UNICODE]). 597 UTF-16, UTF-16BE, and UTF-16LE 599 UTF-16, UTF-16BE, and UTF-16LE, three transformation formats 600 described in [RFC2781] and defined in The Unicode Standard 601 (Sections 3.9 and 16.8 of Version 5.2), are not required by any 602 IETF standards, and are thus used much less often in protocols 603 than UTF-8. Characters in the BMP are always encoded as two 604 octets, and characters outside the BMP are encoded as four octets 605 using a "surrogate pair" arrangement. The latter is not part of 606 UCS-2, marking the difference between UTF-16 and UCS-2. The three 607 UTF-16 formats differ based on the order of the octets and the 608 presence or absence of a special lead-in ordering identifier 609 called the "byte order mark" or "BOM". 611 UTF-32 613 The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a 614 transformation format that incorporates the integer code point 615 value right-justified in a 32 bit field. As with UTF-16, the byte 616 order mark (BOM) can be used and UTF-32BE and UTF-32LE are 617 defined. UTF-32 and UCS-4 are essentially equivalent and the 618 terms are often used interchangeably. 620 SCSU and BOCU-1 622 The Unicode Consortium has defined an encoding, SCSU [UTR6], which 623 is designed to offer good compression for typical text. A 624 different encoding that is meant to be MIME-friendly, BOCU-1, is 625 described in [UTN6]. Although compression is attractive, as 626 opposed to UTF-8, neither of these (at the time of this writing) 627 has attracted much interest. 629 The compression provided as a side effect of the Punycode 630 algorithm [RFC3492] is heavily used in some contexts, especially 631 IDNA [RFC5890], but imposes some restrictions (See also 632 Section 7). 634 3.3. Native CCSs and charsets 636 Before ISO/IEC 10646 was developed, many countries developed their 637 own CCSs and charsets. Some of these were adopted into international 638 standards for the relevant scripts or writing systems. Many dozen of 639 these are in common use on the Internet today. Examples include ISO 640 8859-5 for Cyrillic and Shift- JIS for Japanese scripts. 642 The official list of the registered charset names for use with IETF 643 protocols is maintained by IANA and can be found at 644 . The list contains 645 preferred names and aliases. Note that this list has historically 646 contained many errors, such as names that are in fact not charsets or 647 references that do not give enough detail to reliably map names to 648 charsets. 650 Probably the most well-known native CCS is ASCII [US-ASCII]. This 651 CCS is used as the basis for keywords and parameter names in many 652 IETF protocols, and as the sole CCS in numerous IETF protocols that 653 have not yet been internationalized. ASCII became the basis for ISO/ 654 IEC 646 which, in turn, formed the basis for many national and 655 international standards, such as the ISO 8859 series, that mix Basic 656 Latin characters with characters from another script. 658 It is important to note that, strictly speaking, "ASCII" is a CCS and 659 repertoire, not an encoding. The encoding used for ASCII in IETF 660 protocols involves the seven-bit integer ASCII code point right- 661 justified an an 8-bit field and is sometimes described as the 662 "Network Virtual Terminal" or "NVT" encoding [RFC5198]. Less 663 formally, "ASCII" and "NVT" are often used interchangeably. However, 664 "non-ASCII" refers only to characters outside the ASCII repertoire 665 and is not linked to a specific encoding. See Section 4.2. 667 A Unicode publication describes issues involved in mapping character 668 data between charsets, and an XML format for mapping table data 669 [UTR22]. 671 4. Character Issues 673 This section contains terms and topics that are commonly used in 674 character handling and therefore are of concern to people adding non- 675 ASCII text handling to protocols. These topics are standardized 676 outside the IETF. 678 code point 680 A value in the codespace of a repertoire. For all common 681 repertoires developed in recent years, code point values are 682 integers (code points for ASCII and its immediate descendants were 683 defined in terms of column and row positions of a table). 685 combining character 687 A member of an identified subset of the coded character set of 688 ISO/IEC 10646 intended for combination with the preceding non- 689 combining graphic character, or with a sequence of combining 690 characters preceded by a non-combining character. Combining 691 characters are inherently non-spacing. 693 composite sequence or combining character sequemce 695 A sequence of graphic characters consisting of a non-combining 696 character followed by one or more combining characters. A graphic 697 symbol for a composite sequence generally consists of the 698 combination of the graphic symbols of each character in the 699 sequence. The Unicode Standard often uses the term "combining 700 character sequence" to refer to composite sequences. A composite 701 sequence is not a character and therefore is not a member of the 702 repertoire of ISO/IEC 10646. However, Unicode now 703 assigns names to some such sequences especially when the names are 704 required to match terminology in other standards [UAX34]. 706 In some CCSs, some characters consist of combinations of other 707 characters. For example, the letter "a with acute" might be a 708 combination of the two characters "a" and "combining acute", or it 709 might be a combination of the three characters "a", a non- 710 destructive backspace, and an acute. In the same or other CCSs, 711 it might be available as a single code point. The rules for 712 combining two or more characters are called "composition rules", 713 and the rules for taking apart a character into other characters 714 is called "decomposition rules". The results of composition is 715 called a "precomposed character"; the results of decomposition is 716 called a "decomposed character". 718 normalization 720 Normalization is the transformation of data to a normal form, for 721 example, to unify spelling. 723 Note that the phrase "unify spelling" in the definition above does 724 not mean unifying different strings with the same meaning as words 725 (such as "color" and "colour"). Instead, it means unifying 726 different character sequences that are intended to form the same 727 composite characters. such as "" and "" (where "" is U+006E, "" is U+0303, and 729 "" is U+00F1. 731 The purpose of normalization is to allow two strings to be 732 compared for equivalence. The strings "" and "" would be shown identically 734 on a text display device. If a protocol designer wants those two 735 strings to be considered equivalent during comparison, the 736 protocol must define where normalization occurs. 738 The terms "normalization" and "canonicalization" are often used 739 interchangeably. Generally, they both mean to convert a string of 740 one or more characters into another string based on standardized 741 rules. However, in Unicode, "canonicalization" or similar terms 742 are used to refer to a particular type of normalization 743 equivalence ("canonical equivalence") in contrast to 744 "compatibility equivalence"), so the term should be used with some 745 care. Some CCSs allow multiple equivalent representations for a 746 written string; normalization selects one among multiple 747 equivalent representations as a base for reference purposes in 748 comparing strings. In strings of text, these rules are usually 749 based on decomposing combined characters or composing characters 750 with combining characters. Unicode Standard Annex #15 [UTR15] 751 describes the process and many forms of normalization in detail. 752 Normalization is important when comparing strings to see if they 753 are the same. 755 The Unicode NFC and NFD normalizations support canonical 756 equivalence; NFKC and NFKD support canonical and compatibility 757 equivalence. 759 case 761 Case is the feature of certain alphabets where the letters have 762 two (or occasionally more) distinct forms. These forms, which may 763 differ markedly in shape and size, are called the uppercase letter 764 (also known as capital or majuscule) and the lowercase letter 765 (also known as small or minuscule). Case mapping is the 766 association of the uppercase and lowercase forms of a letter. 767 769 There is usually (but not always) a one-to-one mapping between the 770 same letter in the two cases. However, there are many examples of 771 characters which exist in one case but for which there is no 772 corresponding character in the other case or for which there is a 773 special mapping rule, such as the Turkish dotless "i", some Greek 774 characters with modifiers, and characters like the German Sharp S 775 (Eszett) and Greek Final Sigma that traditionally do not have 776 uppercase forms. Case mapping can even be dependent on locale or 777 language. Converting text to have only a single case, primarily 778 for comparison purposes, is called "case folding". Because of the 779 various unusual cases, case mapping can be quite controversial and 780 some case folding algorithms even more so. For example, some 781 programming languages such as Java have case-folding algorithms 782 that are locale-sensitive; this makes those algorithms incredibly 783 resource-intensive, and makes them act differently depending on 784 the location of the system at the time the algorithm is used. 786 sorting and collation 788 Collating is the process of ordering units of textual information. 789 Collation is usually specific to a particular language or even to 790 a particular application or locale. It is sometimes known as 791 alphabetizing, although alphabetization is just a special case of 792 sorting and collation. 794 Collation is concerned with the determination of the relative 795 order of any particular pair of strings, and algorithms concerned 796 with collation focus on the problem of providing appropriate 797 weighted keys for string values, to enable binary comparison of 798 the key values to determine the relative ordering of the strings. 800 The relative orders of letters in collation sequences can differ 801 widely based on the needs of the system or protocol defining the 802 collation order. For example, even within ASCII characters, there 803 are two common and very different collation orders: "A, a, B, 804 b,..." and "A, B, C, ..., Z, a, b,...", with additional variations 805 for lower case first and digits before and after letters. 807 In practice, it is rarely necessary to define a collation sequence 808 for characters drawn from different scripts, but arranging such 809 sequences so as to not surprise users is usually particularly 810 problematic. 812 Sorting is the process of actually putting data records into 813 specified orders, according to criteria for comparison between the 814 records. Sorting can apply to any kind of data (including textual 815 data) for which an ordering criterion can be defined. Algorithms 816 concerned with sorting focus on the problem of performance (in 817 terms of time, memory, or other resources) in actually putting the 818 data records into the desired order. 820 A sorting algorithm for string data can be internationalized by 821 providing it with the appropriate collation-weighted keys 822 corresponding to the strings to be ordered. 824 Many processes have a need to order strings in a consistent 825 (sorted) sequence. For only a few CCS/CES combinations, there is 826 an obvious sort order that can be applied without reference to the 827 linguistic meaning of the characters: the code point order is 828 sufficient for sorting. That is, the code point order is also the 829 order that a person would use in sorting the characters. For many 830 CCS/CES combinations, the code point order would make no sense to 831 a person and therefore is not useful for sorting if the results 832 will be displayed to a person. 834 Code Point order is usually not how any human educated by a local 835 school system expects to see strings ordered; if one orders to the 836 expectations of a human, one has a language-specific sort. 837 Sorting to code point order will seem inconsistent if the strings 838 are not normalized before sorting because different 839 representations of the same character will sort differently. This 840 problem may be smaller with a language-specific sort. 842 code table 844 A code table is a table showing the characters allocated to the 845 octets in a code. 847 Code tables are also commonly called "code charts". 849 4.1. Types of Characters 851 The following definitions of types of characters do not clearly 852 delineate each character into one type, nor do they allow someone to 853 accurately predict what types would apply to a particular character. 854 The definitions are intended for application designers to help them 855 think about the many (sometimes confusing) properties of text. 857 alphabetic 859 An informative Unicode property. Characters that are the primary 860 units of alphabets and/or syllabaries, whether combining or 861 noncombining. This includes composite characters that are 862 canonical equivalents to a combining character sequence of an 863 alphabetic base character plus one or more combining characters: 864 letter digraphs; contextual variant of alphabetic characters; 865 ligatures of alphabetic characters; contextual variants of 866 ligatures; modifier letters; letterlike symbols that are 867 compatibility equivalents of single alphabetic letters; and 868 miscellaneous letter elements. 870 ideographic 872 Any symbol that primarily denotes an idea (or meaning) in contrast 873 to a sound (or pronunciation), for example, a symbol showing a 874 telephone or the Han characters used in Chinese, Japanese, and 875 Korean. 877 While Unicode and many other systems use this term to refer to all 878 Han characters, strictly speaking not all of those characters are 879 actually ideographic. Some are pictographic (such as the 880 telephone example above), some are used phonetically, and so on. 881 However, the convention is to describe the script as ideographic 882 as contrasted to alphabetic. 884 digit or number 886 All modern writing systems use decimal digits in some form; some 887 older ones use non-positional or other systems. Different scripts 888 may have their own digits. Unicode distinguishes between numbers 889 and other kinds of characters by assigning a special General 890 Category value to them and subdividing that value to distinguish 891 between decimal digits, letter digits, and other digits. 893 punctuation 895 Characters that separate units of text, such as sentences and 896 phrases, thus clarifying the meaning of the text. The use of 897 punctuation marks is not limited to prose; they are also used in 898 mathematical and scientific formulae, for example. 900 symbol 902 One of a set of characters other than those used for letters, 903 digits, or punctuation, and representing various concepts 904 generally not connected to written language use per se. 906 Examples of symbols include characters for mathematical operators, 907 symbols for OCR, symbols for box-drawing or graphics, as well as 908 symbols for dingbats, arrows, faces, and geometric shapes. 909 Unicode has a property that identifies symbol characters. 911 nonspacing character 913 A combining character whose positioning in presentation is 914 dependent on its base character. It generally does not consume 915 space along the visual baseline in and of itself. 917 A combining acute accent (U+0301) is an example of a nonspacing 918 character. 920 diacritic 922 A mark applied or attached to a symbol to create a new symbol that 923 represents a modified or new value. They can also be marks 924 applied to a symbol irrespective of whether it changes the value 925 of that symbol. In the latter case, the diacritic usually 926 represents an independent value (for example, an accent, tone, or 927 some other linguistic information). Also called diacritical mark 928 or diacritical. 930 control character 932 The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F. 933 The basic space character, U+0020, is often considered as a 934 control character as well, making the total number 66. They are 935 also known as control codes or control characters. In terminology 936 adopted by Unicode from ASCII and the ISO 8859 standards, these 937 codes are treated as belonging to three ranges: "C0" (for 938 U+0000..U+001F), "C1" (for U+0080...U+009F), and the single 939 control character "DEL" (U+007F). 941 formatting character 943 Characters that are inherently invisible but that have an effect 944 on the surrounding characters. 946 Examples of formatting characters include characters for 947 specifying the direction of text and characters that specify how 948 to join multiple characters. 950 compatibility character or compatibility variant 952 A graphic character included as a coded character of ISO/IEC 10646 953 primarily for compatibility with existing coded character sets. 954 956 The Unicode definition of compatibility charter also includes 957 characters that have been incorporated for other reasons. Their 958 list includes several separate groups of characters included for 959 compatibility purposes: halfwidth and fullwidth characters used 960 with East Asian scripts, Arabic contextual forms (e.g., initial or 961 final forms), some ligatures, deprecated formatting characters, 962 variant forms of characters (or even copies of them) for 963 particular uses (e.g., phonetic or mathematical applications), 964 font variations, CJK compatibility ideographs, and so on. For 965 additional information and the separate term "compatibility 966 decomposable character", see the Unicode standard. 968 For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for 969 compatibility with Asian character sets that include full-width 970 and half-width ASCII characters. 972 Some efforts in the IETF have concluded that it would be useful to 973 support mapping of some groups of compatibility equivalents and 974 not others (e.g., supporting or mapping width variations while 975 preserving or rejecting mathematical variations). See the IDNA 976 Mapping document [RFC5895] for one example. 978 4.2. Differentiation of Subsets 980 Especially as existing IETF standards are internationalized, it is 981 necessary to describe collections of characters including especially 982 various subsets of Unicode. Because Unicode includes ways to code 983 substantially all characters in contemporary use, subsets of the 984 Unicode repertoire can be a useful tool for defining these 985 collections as repertoires independent of specific Unicode coding. 987 However specific collections are defined, it is important to remember 988 that, while older CCSs such as ASCII and the ISO 8859 family are 989 close-ended and fixed, Unicode is open-ended, with new character 990 definitions, and often new scripts, being added every year or so. 991 So, while, e.g., an ASCII subset, such as "upper case letters", can 992 be specified as a range of code points (4/1 to 5/10 for that 993 example), similar definitions for Unicode either have to be specified 994 in terms of Unicode properties or are very dependent on Unicode 995 versions (and the relevant version must be identified in any 996 specification). See the IDNA code point specification [RFC5892] for 997 an example of specification by combinations of properties. 999 Some terms are commonly used in the IETF to define character ranges 1000 and subsets. Some of these are imprecise and can cause confusion if 1001 not used carefully. 1003 non-ASCII The term "non-ASCII" strictly refers to characters other 1004 than those that appear in the ASCII repertoire, independent of the 1005 CCS or encoding used for them. In practice, if a repertoire such 1006 as that of Unicode is established as context, "non-ASCII" refers 1007 to characters in that repertoire that do not appear in the ASCII 1008 repertoire. "Outside the ASCII repertoire" and "outside the ASCII 1009 range" are practical, and more precise, synonyms for "non-ASCII". 1011 letters The term "letters" does not have an exact equivalent in the 1012 Unicode standard. Letters are generally characters that are used 1013 to write words, but that means very different things in different 1014 languages and cultures. 1016 5. User Interface for Text 1018 Although the IETF does not standardize user interfaces, many 1019 protocols make assumptions about how a user will enter or see text 1020 that is used in the protocol. Internationalization challenges 1021 assumptions about the type and limitations of the input and output 1022 devices that may be used with applications that use various 1023 protocols. It is therefore useful to consider how users typically 1024 interact with text that might contain one or more non-ASCII 1025 characters. 1027 input methods 1029 An input method is a mechanism for a person to enter text into an 1030 application. 1032 Text can be entered into a computer in many ways. Keyboards are 1033 by far the most common device used, but many characters cannot be 1034 entered on typical computer keyboards in a single stroke. Many 1035 operating systems come with system software that lets users input 1036 characters outside the range of what is allowed by keyboards. 1038 For example, there are dozens of different input methods for Han 1039 characters in Chinese, Japanese, and Korean. Some start with 1040 phonetic input through the keyboard, while others use the number 1041 of strokes in the character. Input methods are also needed for 1042 scripts that have many diacritics, such as European or Vietnamese 1043 characters that have two or three diacritics on a single 1044 alphabetic character. 1046 The term "input method editor" (IME) is often used generically to 1047 describe the tools and software used to deal with input of 1048 characters on a particular system. 1050 rendering rules 1052 A rendering rule is an algorithm that a system uses to decide how 1053 to display a string of text. 1055 Some scripts can be directly displayed with fonts, where each 1056 character from an input stream can simply be copied from a glyph 1057 system and put on the screen or printed page. Other scripts need 1058 rules that are based on the context of the characters in order to 1059 render text for display. 1061 Some examples of these rendering rules include: 1063 * Scripts such as Arabic (and many others), where the form of the 1064 letter changes depending on the adjacent letters, whether the 1065 letter is standing alone, at the beginning of a word, in the 1066 middle of a word, or at the end of a word. The rendering rules 1067 must choose between two or more glyphs. 1069 * Scripts such as the Indic scripts, where consonants may change 1070 their form if they are adjacent to certain other consonants or 1071 may be displayed in an order different from the way they are 1072 stored and pronounced. The rendering rules must choose between 1073 two or more glyphs. 1075 * Arabic and Hebrew scripts, where the order of the characters 1076 displayed are changed by the bidirectional properties of the 1077 alphabetic and other characters characters and with right-to- 1078 left and left-to-right ordering marks. The rendering rules 1079 must choose the order that characters are displayed. 1081 * Some writing systems cannot have their rendering rules suitably 1082 defined using mechanisms that are now defined in the Unicode 1083 Standard. None of those languages are in active non-scholarly 1084 use today. 1086 * Many systems use a special rendering rule when they lack a font 1087 or other mechanism for rendering a particular character 1088 correctly. That rule typically involves substitution of a 1089 small open box or a question mark for the missing character. 1090 See "undisplayable character" below. 1092 graphic symbol 1094 A graphic symbol is the visual representation of a graphic 1095 character or of a composite sequence. 1097 font 1099 A font is a collection of glyphs used for the visual depiction of 1100 character data. A font is often associated with a set of 1101 parameters (for example, size, posture, weight, and serifness), 1102 which, when set to particular values, generate a collection of 1103 imagable glyphs. 1105 The term "font" is often used interchangeably with "typeface". As 1106 historically used in typography, a typeface is a family of one or 1107 more fonts that share a common general design. For example, 1108 "Times Roman" is actually a typeface, with a collection of fonts 1109 such as "Times Roman Bold", "Times Roman Medium", "Times Roman 1110 Italic", and so on. Some sources even consider different type 1111 sizes within a typeface to be different fonts. While those 1112 distinctions are rarely important for internationalization 1113 purposes, there are exceptions. Those writing specifications 1114 should be very careful about definitions in cases in which the 1115 exceptions might lead to ambiguity. 1117 bidirectional display 1119 The process or result of mixing left-to-right oriented text and 1120 right-to-left oriented text in a single line is called 1121 bidirectional display, often abbreviated as "bidi". 1123 Most of the world's written languages are displayed left-to-right. 1124 However, many widely-used written languages such as ones based on 1125 the Hebrew or Arabic scripts are displayed primarily right-to-left 1126 (numerals are a common exception in the modern scripts). Right- 1127 to-left text often confuses protocol writers because they have to 1128 keep thinking in terms of the order of characters in a string in 1129 memory, an order that might be different from what they see on the 1130 screen. (Note that some languages are written both horizontally 1131 and vertically and that some historical ones use other display 1132 orderings.) 1134 Further, bidirectional text can cause confusion because there are 1135 formatting characters in ISO/IEC 10646 that cause the order of 1136 display of text to change. These explicit formatting characters 1137 change the display regardless of the implicit left-to-right or 1138 right-to-left properties of characters. Text that might contain 1139 those characters typically requires careful processing before 1140 being sorted or compared for equality. 1142 It is common to see strings with text in both directions, such as 1143 strings that include both text and numbers, or strings that 1144 contain a mixture of scripts. 1146 Unicode has a long and incredibly detailed algorithm for 1147 displaying bidirectional text [UAX9]. 1149 undisplayable character 1151 A character that has no displayable form. 1153 For instance, the zero-width space (U+200B) cannot be displayed 1154 because it takes up no horizontal space. Formatting characters 1155 such as those for setting the direction of text are also 1156 undisplayable. Note, however, that every character in [UNICODE] 1157 has a glyph associated with it, and that the glyphs for 1158 undisplayable characters are enclosed in a dashed square as an 1159 indication that the actual character is undisplayable. 1161 The property of a character that causes it to be undisplayable is 1162 intrinsic to its definition. Undisplayable characters can never 1163 be displayed in normal text (the dashed square notation is used 1164 only in special circumstances). Printable characters whose 1165 Unicode definitions are associated with glyphs that cannot be 1166 rendered on a particular system are not, in this sense, 1167 undisplayable. 1169 6. Text in Current IETF Protocols 1171 Many IETF protocols started off being fully internationalized, while 1172 others have been internationalized as they were revised. In this 1173 process, IETF members have seen patterns in the way that many 1174 protocols use text. This section describes some specific protocol 1175 interactions with text. 1177 protocol elements 1179 Protocol elements are uniquely-named parts of a protocol. 1181 Almost every protocol has named elements, such as "source port" in 1182 TCP. In some protocols, the names of the elements (or text tokens 1183 for the names) are transmitted within the protocol. For example, 1184 in SMTP and numerous other IETF protocols, the names of the verbs 1185 are part of the command stream. The names are thus part of the 1186 protocol standard. The names of protocol elements are not 1187 normally seen by end users and it is rarely appropriate to 1188 internationalize protocol element names (even while the elements 1189 themselves can be internationalized). 1191 name spaces 1193 A name space is the set of valid names for a particular item, or 1194 the syntactic rules for generating these valid names. 1196 Many items in Internet protocols use names to identify specific 1197 instances or values. The names may be generated (by some 1198 prescribed rules), registered centrally (e.g., such as with IANA), 1199 or have a distributed registration and control mechanism, such as 1200 the names in the DNS. 1202 on-the-wire encoding 1204 The encoding and decoding used before and after transmission over 1205 the network is often called the "on-the-wire" (or sometimes just 1206 "wire") format. 1208 Characters are identified by code points. Before being 1209 transmitted in a protocol, they must first be encoded as bits and 1210 octets. Similarly, when characters are received in a 1211 transmission, they have been encoded, and a protocol that needs to 1212 process the individual characters needs to decode them before 1213 processing. 1215 parsed text 1217 Text strings that is analyzed for subparts. 1219 In some protocols, free text in text fields might be parsed. For 1220 example, many mail user agents (MUAs) will parse the words in the 1221 text of the Subject: field to attempt to thread based on what 1222 appears after the "Re:" prefix. 1224 Such conventions are very sensitive to localization. If, for 1225 example, a form like "Re:" is altered by an MUA to reflect the 1226 language of the sender or recipient, a system that subsequently 1227 does threading may not recognize the replacement term as a 1228 delimiter string. 1230 charset identification 1232 Specification of the charset used for a string of text. 1234 Protocols that allow more than one charset to be used in the same 1235 place should require that the text be identified with the 1236 appropriate charset. Without this identification, a program 1237 looking at the text cannot definitively discern the charset of the 1238 text. Charset identification is also called "charset tagging". 1240 language identification 1242 Specification of the human language used for a string of text. 1243 1245 Some protocols (such as MIME and HTTP) allow text that is meant 1246 for machine processing to be identified with the language used in 1247 the text. Such identification is important for machine processing 1248 of the text, such as by systems that render the text by speaking 1249 it. Language identification is also called "language tagging". 1250 The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a 1251 comprehensive model for language identification. 1253 MIME 1255 MIME (Multipurpose Internet Mail Extensions) is a message format 1256 that allows for textual message bodies and headers in character 1257 sets other than US-ASCII in formats that require ASCII (most 1258 notably RFC 5322, the standard for Internet mail headers 1259 [RFC5322]). MIME is described in RFCs 2045 through 2049, as well 1260 as more recent RFCs. 1262 transfer encoding syntax 1264 A transfer encoding syntax (TES) (sometimes called a transfer 1265 encoding scheme) is a reversible transform of already-encoded data 1266 that is represented in one or more character encoding schemes. 1267 1269 TESs are useful for encoding types of character data into an 1270 another format, usually for allowing new types of data to be 1271 transmitted over legacy protocols. The main examples of TESs used 1272 in the IETF include Base64 and quoted-printable. MIME identifies 1273 the transfer encoding syntax for body parts as a Content-transfer- 1274 encoding, occasionally abbreviated C-T-E. 1276 Base64 1278 Base64 is a transfer encoding syntax that allows binary data to be 1279 represented by the ASCII characters A through Z, a through z, 0 1280 through 9, +, /, and =. It is defined in [RFC2045]. 1282 quoted printable 1284 Quoted printable is a transfer encoding syntax that allows strings 1285 that have non-ASCII characters mixed in with mostly ASCII 1286 printable characters to be somewhat human readable. It is 1287 described in [RFC2047]. 1289 The quoted printable syntax is generally considered to be a 1290 failure at being readable. It is jokingly referred to as "quoted 1291 unreadable". 1293 XML 1295 XML (which is an approximate abbreviation for Extensible Markup 1296 Language) is a popular method for structuring text. XML text that 1297 is not encoded as UTF-8 is explicitly tagged with charsets, and 1298 all text in XML consists only of Unicode characters. The 1299 specification for XML can be found at . 1300 1302 ASN.1 text formats 1304 The ASN.1 data description language has many formats for text 1305 data. The formats allow for different repertoires and different 1306 encodings. Some of the formats that appear in IETF standards 1307 based on ASN.1 include IA5String (all ASCII characters), 1308 PrintableString (most ASCII characters, but missing many 1309 punctuation characters), BMPString (characters from ISO/IEC 10646 1310 plane 0 in UTF-16BE format), UTF8String (just as the name 1311 implies), and TeletexString (also called T61String). 1313 ASCII-compatible encoding (ACE) 1315 Starting in 1996, many ASCII-compatible encoding schemes (which 1316 are actually transfer encoding syntaxes) have been proposed as 1317 possible solutions for internationalizing host names and some 1318 other purposes. Their goal is to be able to encode any string of 1319 ISO/IEC 10646 characters using the preferred syntax for domain 1320 names (as described in STD 13). At the time of this writing, only 1321 the ACE encoding produced by Punycode [RFC3492] has become an IETF 1322 standard. 1324 The choice of ACE forms to internationalize legacy protocols must 1325 be made with care as it can cause some difficult side effects 1326 [RFC6055]. 1328 LDH label 1330 The classical label form used in the DNS and most applications 1331 that call on it, albeit with some additional restrictions, 1332 reflects the early syntax of "hostnames" [RFC0952] and limits 1333 those names to ASCII letters, digits, and embedded hyphens. The 1334 hostname syntax is identical to that described as the "preferred 1335 name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by 1336 RFC 1123 [RFC1123]. LDH labels are defined in a more restrictive 1337 and precise way for internationalization contexts as part of the 1338 IDNA2008 specification [RFC5890]. 1340 7. Terms Associated with Internationalized Domain Names 1342 7.1. IDNA Terminology 1344 The current specification for Internationalized Domain Names (IDNs), 1345 known formally as Internationalized Domain Names for Applications or 1346 IDNA, is referred to in the IETF and parts of the broader community 1347 as "IDNA2008" and consists of several documents. Section 2.3 of the 1348 first of those documents, commonly known as "IDNA2008 Definitions" 1349 [RFC5890] provides definitions and introduces some specialized terms 1350 for differentiating among types of DNS labels in an IDN context. 1351 Those terms are listed in the table below; see RFC 5890 for the 1352 specific definitions if needed. 1354 ACE Prefix 1355 A-label 1356 Domain Name Slot 1357 IDNA-valid string 1358 Internationalized Domain Name 1359 Internationalized Label 1360 LDH Label 1361 NR-LDH label 1362 U-label 1364 Two additional terms entered the IETF's vocabulary as part of the 1365 earlier IDN effort [RFC3490] (IDNA2003): 1367 Stringprep 1369 Stringprep [RFC3454] provides a model and character tables for 1370 preparing and handling internationalized strings. It was used 1371 in the original IDN specification (IDNA2003) via a profile 1372 called "Nameprep" [RFC3491]. It is no longer in use in IDNA, 1373 but continues to be used in profiles by a number of other 1374 protocols. 1376 Punycode 1378 This is the name of the algorithm [RFC3492] used to convert 1379 otherwise-valid IDN labels from native-character strings 1380 expressed in Unicode to an ASCII-compatible encoding (ACE). 1381 Strictly speaking, the term applies to the algorithm only. In 1382 practice, it is widely, if erroneously, used to refer to 1383 strings that the algorithm encodes. 1385 7.2. Character Relationships and Variants 1387 The term "variant" was introducted into the IETF i18n vocabulary with 1388 the JET recommendations [RFC3743]. As used there, it referred 1389 strictly to the relationship between Traditional Chinese characrers 1390 and their Simplified equivalents. The JET recommendations provided a 1391 model for identifying these pairs of characters and labels that used 1392 them. Specific recommendations for variant handling for the Chinese 1393 language were provided in a follow-up document [RFC4713]. 1395 In more recent years, the term has also been used to describe other 1396 collections of characters or strings that might be perceived as 1397 equivalent. Those collections have involved one or more of several 1398 categories of characters and labels containing them including: 1400 o "visually similar" or "visually confusable" characters. These may 1401 be limited to characters in different scripts, characters in a 1402 single script, or both, and may be those that can appear to be 1403 alike even with high-distinguishability reference fonts are used 1404 or under various circumstances that may involve malicious choices 1405 of typefaces or other ways to trick user perception. Trivial 1406 examples include ASCII "l" and "1" and Latin and Cyrillic "a". 1408 o Characters assigned more than one Unicode code point because of 1409 some special property. These characters may be considered "the 1410 same" for some purposes and different for others (or by other 1411 users). One of the most commonly-cited examples is the Arabic 1412 YEH, which has different presentation forms when used with the 1413 Arabic language and some other languages that use Arabic script. 1414 Another example are the Greek lower case sigma and final sigma: if 1415 the latter were viewed purely as a positional presentation 1416 variation on the former, it should not have been assigned a 1417 separate code point. 1419 o Numerals and labels including them. Unlike letters, the "meaning" 1420 of decimal digits is clear and unambiguous regardless of the 1421 script with which they are associated. Some scripts are routinely 1422 used almost interchangeably with European digits and digits native 1423 to that script. Arabic script has two sets of digits (U+0660..U+ 1424 0669 and U+06F0..U=06F9), written identically for zero through 1425 three and seven through nine but differently for four through six; 1426 European digits predominate in other areas. Substitution of 1427 digits with the same numeric value in labels may give rise to 1428 another type of variant. 1430 o Orthographic differences within a language. Many languages have 1431 alternate choices of spellings or spellings that differ by locale. 1432 Users of those languages generally recognize the spellings as 1433 equivalent, at least as much so as the variations described above. 1434 Examples include "color" and "colour" in English, German words 1435 spelled with o-umlaut or "oe", and so on. 1437 The term "variant" as used in this section should also not be 1438 confused with other uses of the term in this document or in Unicode 1439 terminology (e.g., those in Section 4.1 above). If the term is to be 1440 used at all, context should clearly distinguish among these different 1441 uses and, in particular, between variant characters and variant 1442 labels. Local text should identify which meaning, or combination of 1443 meanings, are intended. 1445 8. Other Common Terms In Internationalization 1447 This is a hodge-podge of other terms that have appeared in 1448 internationalization discussions in the IETF. It is likely that 1449 additional terms will be added as this document matures. 1451 locale 1453 Locale is the user-specific location and cultural information 1454 managed by a computer. 1456 Because languages and orthographic conventions differ from country 1457 to country (and even region to region within a country), the 1458 locale of the user can often be an important factor. Typically, 1459 the locale information for a user includes the language(s) used. 1461 Locale issues go beyond character use, and can include things such 1462 as the display format for currency, dates, and times. Some 1463 locales (especially the popular "C" and "POSIX" locales) do not 1464 include language information. 1466 It should be noted that there are many thorny, unsolved issues 1467 with locale. For example, should text be viewed using the locale 1468 information of the person who wrote the text or the person viewing 1469 it? What if the person viewing it is travelling to different 1470 locations? Should only some of the locale information affect 1471 creation and editing of text? 1473 Latin characters 1475 "Latin characters" is a not-precise term for characters 1476 historically related to ancient Greek script as modified in the 1477 Roman Republic and Empire and currently used throughout the world. 1478 1480 The base Latin characters are a subset of the ASCII repertoire and 1481 have been augmented by many single and multiple diacritics and 1482 quite a few other characters. ISO/IEC 10646 encodes the Latin 1483 characters in including ranges U+0020..U+024F, and U+1E00..U+1EFF. 1485 Because "Latin characters" is used in different contexts to refer 1486 to the letters from the ASCII repertoire, the subset of those 1487 characters used late in the Roman Republic period or the different 1488 subset used to write Latin in medieval times, the entire ASCII 1489 repertoire, all of the code points in the extended Latin script as 1490 defined by Unicode, and other collections, the term should be 1491 avoided in IETF specifications when possible. Similarly, "Basic 1492 Latin" should not be used as a synonym for "ASCII". 1494 romanization 1496 The transliteration of a non-Latin script into Latin characters. 1497 1499 Because of the widespread use of Latin characters, people have 1500 tried to represent many languages that are not based on a Latin 1501 repertoire in Latin. For example, there are two popular 1502 romanizations of Chinese: Wade-Giles and Pinyin, the latter of 1503 which is by far more common today. Many romanization systems are 1504 inexact and do not give perfect round trip mappings between the 1505 native script and the Latin characters. 1507 CJK characters and Han characters 1509 The ideographic characters used in Chinese, Japanese, Korean, and 1510 traditional Vietnamese writing systems are often called 'CJK 1511 characters' after the initial letters of the language names in 1512 English. They are also called "Han characters", after the term in 1513 Chinese that is often used for these characters. 1515 Note that Han characters do not include the phonetic characters 1516 used in the Japanese and Korean languages. Users of the term "CJK 1517 characters" may or may not assume those additional characters are 1518 included. 1520 In ISO/IEC 10646, the Han characters were "unified", meaning that 1521 each set of Han characters from Japanese, Chinese, and/or Korean 1522 that had the same origin was assigned a single code point. The 1523 positive result of this was that many fewer code points were 1524 needed to represent Han; the negative result of this was that 1525 characters that people who write the three languages think are 1526 different have the same code point. There is a great deal of 1527 disagreement on the nature, the origin, and the severity of the 1528 problems caused by Han unification. 1530 translation 1532 The process of conveying the meaning of some passage of text in 1533 one language, so that it can be expressed equivalently in another 1534 language. 1536 Many language translation systems are inexact and cannot be 1537 applied repeatedly to go from one language to another to another. 1539 transliteration 1541 The process of representing the characters of an alphabetical or 1542 syllabic system of writing by the characters of a conversion 1543 alphabet. 1545 Many script transliterations are exact, and many have perfect 1546 round-trip mappings. The notable exception to this is 1547 romanization, described above. Transliteration involves 1548 converting text expressed in one script into another script, 1549 generally on a letter-by-letter basis. There are many official 1550 and unofficial transliteration standards, most notably those from 1551 ISO TC 46 and the U.S. Library of Congress. 1553 transcription 1555 The process of systematically writing the sounds of some passage 1556 of spoken language, generally with the use of a technical phonetic 1557 alphabet (usually Latin-based) or other systematic transcriptional 1558 orthography. Transcription also sometimes refers to the 1559 conversion of written text into a transcribed form, based on the 1560 sound of the text as if it had been spoken. 1562 Unlike transliterations, which are generally designed to be round- 1563 trip convertible, transcriptions of written material are almost 1564 never round-trip convertible to their original form, at least 1565 without some supplemental information. 1567 regular expressions 1569 Regular expressions provide a mechanism to select specific strings 1570 from a set of character strings. Regular expressions are a 1571 language used to search for text within strings, and possibly 1572 modify the text found with other text. 1574 Pattern matching for text involves being able to represent one or 1575 more code points in an abstract notation, such as searching for 1576 all capital Latin letters or all punctuation. The most common 1577 mechanism in IETF protocols for naming such patterns is the use of 1578 regular expressions. There is no single regular expression 1579 language, but there are numerous very similar dialects that are 1580 not quite consistent with each other. 1582 The Unicode Consortium has a good discussion about how to adapt 1583 regular expression engines to use Unicode. [UTR18] 1585 private use 1587 ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to 1588 U+FFFFD, and U+100000 to U+10FFFD are available for private use. 1589 This refers to code points of the standard whose interpretation is 1590 not specified by the standard and whose use may be determined by 1591 private agreement among cooperating users. 1593 The use of these "private use" characters is defined by the 1594 parties who transmit and receive them, and is thus not appropriate 1595 for standardization. (The IETF has a long history of private use 1596 names for things such as "x-" names in MIME types, charsets, and 1597 languages. Most of the experience with these has been quite 1598 negative, with many implementors assuming that private use names 1599 are in fact public and long-lived.) 1601 9. Security Considerations 1603 Security is not discussed directly in this document. While the 1604 definitions here have no direct effect on security, they are used in 1605 many security contexts. For example, authentication usually involves 1606 comparing two tokens, and one or both of those tokens might be text; 1607 thus, some methods of comparison might involve using some if the 1608 internationalization concepts for which terms are defined in this 1609 document. 1611 Having said that, other RFCs dealing with internationalization have 1612 security consideration descriptions that may be useful to the reader 1613 of this document. In particular, the security considerations in RFC 1614 3454, RFC 3629, RFC 4013, and RFC 5890 go into a fair amount of 1615 detail. 1617 10. References 1619 10.1. Normative References 1621 [ISOIEC10646] 1622 ISO/IEC, "ISO/IEC 10646-1:2003. International Standard -- 1623 Information technology - Universal Multiple-Octet Coded 1624 Character Set (UCS)", 2003. 1626 [UNICODE] The Unicode Consortium, "The Unicode Standard, Version 1627 5.2.0", Mountain View, CA: The Unicode Consortium, 1628 2009. ISBN 978-1-936213-00-9)., 2010, 1629 . 1631 10.2. Informative References 1633 [CHARMOD] W3C, "Character Model for the World Wide Web 1.0", 2005, 1634 . 1636 [FRAMEWORK] 1637 ISO/IEC, "ISO/IEC TR 11017:1997(E). Information technology 1638 - Framework for internationalization, prepared by ISO/IEC 1639 JTC 1/SC 22/WG 20", 1997. 1641 [ISO3166] ISO, "ISO 3166-1:2006 - Codes for the representation of 1642 names of countries and their subdivisions -- Part 1: 1643 Country codes", 20066. 1645 [ISO639] ISO, "ISO 639-1:2002 - Code for the representation of 1646 names of languages - Part 1: Alpha-2 code", 2002. 1648 [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet 1649 host table specification", RFC 952, October 1985. 1651 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1652 STD 13, RFC 1034, November 1987. 1654 [RFC1123] Braden, R., "Requirements for Internet Hosts - Application 1655 and Support", STD 3, RFC 1123, October 1989. 1657 [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail 1658 Extensions (MIME) Part One: Format of Internet Message 1659 Bodies", RFC 2045, November 1996. 1661 [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) 1662 Part Three: Message Header Extensions for Non-ASCII Text", 1663 RFC 2047, November 1996. 1665 [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and 1666 Languages", BCP 18, RFC 2277, January 1998. 1668 [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 1669 10646", RFC 2781, February 2000. 1671 [RFC2978] Freed, N. and J. Postel, "IANA Charset Registration 1672 Procedures", BCP 19, RFC 2978, October 2000. 1674 [RFC3454] Hoffman, P. and M. Blanchet, "Preparation of 1675 Internationalized Strings ("stringprep")", RFC 3454, 1676 December 2002. 1678 [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, 1679 "Internationalizing Domain Names in Applications (IDNA)", 1680 RFC 3490, March 2003. 1682 [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1683 Profile for Internationalized Domain Names (IDN)", 1684 RFC 3491, March 2003. 1686 [RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode 1687 for Internationalized Domain Names in Applications 1688 (IDNA)", RFC 3492, March 2003. 1690 [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 1691 10646", STD 63, RFC 3629, November 2003. 1693 [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint 1694 Engineering Team (JET) Guidelines for Internationalized 1695 Domain Names (IDN) Registration and Administration for 1696 Chinese, Japanese, and Korean", RFC 3743, April 2004. 1698 [RFC4647] Phillips, A. and M. Davis, "Matching of Language Tags", 1699 BCP 47, RFC 4647, September 2006. 1701 [RFC4713] Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin, 1702 "Registration and Administration Recommendations for 1703 Chinese Domain Names", RFC 4713, October 2006. 1705 [RFC5137] Klensin, J., "ASCII Escaping of Unicode Characters", 1706 BCP 137, RFC 5137, February 2008. 1708 [RFC5198] Klensin, J. and M. Padlipsky, "Unicode Format for Network 1709 Interchange", RFC 5198, March 2008. 1711 [RFC5322] Resnick, P., Ed., "Internet Message Format", RFC 5322, 1712 October 2008. 1714 [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying 1715 Languages", BCP 47, RFC 5646, September 2009. 1717 [RFC5890] Klensin, J., "Internationalized Domain Names for 1718 Applications (IDNA): Definitions and Document Framework", 1719 RFC 5890, August 2010. 1721 [RFC5892] Faltstrom, P., "The Unicode Code Points and 1722 Internationalized Domain Names for Applications (IDNA)", 1723 RFC 5892, August 2010. 1725 [RFC5895] Resnick, P. and P. Hoffman, "Mapping Characters for 1726 Internationalized Domain Names in Applications (IDNA) 1727 2008", RFC 5895, September 2010. 1729 [RFC6055] Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on 1730 Encodings for Internationalized Domain Names", 1731 February 2011. 1733 [UAX34] The Unicode Consortium, "Unicode Standard Annex #34: 1734 Unicode Named Character Sequences", 2010, 1735 . 1737 [UAX9] The Unicode Consortium, "Unicode Standard Annex #9: 1738 Unicode Bidirectional Algorithm", 2010, 1739 . 1741 [US-ASCII] 1742 ANSI, "Coded Character Set -- 7-bit American Standard Code 1743 for Information Interchange, ANSI X3.4-1986", 1986. 1745 [UTN6] The Unicode Consortium, "Unicode Technical Note #5: 1746 BOCU-1: MIME-Compatible Unicode Compression", 2006, 1747 . 1749 [UTR15] The Unicode Consortium, "Unicode Standard Annex #15: 1750 Unicode Normalization Forms", 2010, 1751 . 1753 [UTR18] The Unicode Consortium, "Unicode Standard Annex #18: 1754 Unicode Regular Expressions", 2008, 1755 . 1757 [UTR22] The Unicode Consortium, "Unicode Technical Standard #22: 1758 Unicode Character Mapping Markup Language", 2009, 1759 . 1761 [UTR6] The Unicode Consortium, "Unicode Technical Standard #6: A 1762 Standard Compression Scheme for Unicode", 2005, 1763 . 1765 Appendix A. Additional Interesting Reading 1767 [[anchor20: RFC Editor: should these be standardized into your normal 1768 reference format??]] 1770 ALA-LC Romanization Tables, Randall Barry (ed.), U.S. Library of 1771 Congress, 1997, ISBN 0844409405 1773 The Alphabetic Labyrinth: The Letters in History and Imagination, 1774 Johanna Drucker, Thames and Hudson Ltd, 1995, ISBN 0-500-28068-1 1776 Blackwell Encyclopedia of Writing Systems, Florian Coulmas, Blackwell 1777 Publishers, 1999, ISBN 063121481X 1779 Chinese Calligraphy, Edoardo Fazzioli, Abbeville Press, 1986, 1987 1780 (English translation), ISBN 0-89659-774-1 1782 The Chinese Language: Fact and Fantasy, John DeFrancis, University of 1783 Hawaii Press, 1984, ISBN 0-8284-085505 and 0-8248-1058-6 1785 CJKV Information Processing, Ken Lunde, O'Reilly & Assoc., 1999, ISBN 1786 1-56592-224-7 1788 Dictionary of Languages: The Definitive Reference to More than 400 1789 Languages, Andrew Dalby, 2004, ISBN 978-0231115698 1791 Language Visible, David Sacks, Bantam Dell, 2003. Also published as 1792 Letter Perfect: The Marvelous History of Our Alphabet From A to Z, 1793 Broadway, 2004, ISBN 978-0767911733 1795 Reading the Past: Ancient Writing from Cuneiform to the Alphabet, 1796 introduction by J.T. Hooker, British Museum Press, 1990, ISBN 0-7141- 1797 8077-7 1799 The Story of Writing: Alphabets, Hieroglyphs, & Pictograms, Andrew 1800 Robinson, Thames and Hudson, 1995, 2000, ISBN 0-500-28156-4 1802 The World's Writing Systems, Peter Daniels and William Bright, Oxford 1803 University Press, 1996, ISBN 0195079930 1805 Writing Systems of the World, Akira Nakanishi, Charles E. Tuttle 1806 Company, 1980, ISBN 0804816549 1808 Appendix B. Acknowledgements 1810 The definitions in this document come from many sources, including a 1811 wide variety of IETF documents. 1813 James Seng contributed to the initial outline of RFC 3536. Harald 1814 Alvestrand and Martin Duerst made extensive useful comments on early 1815 versions. Others who contributed to the development of RFC 3536 1816 include Dan Kohn, Jacob Palme, Johan van Wingen, Peter Constable, 1817 Yuri Demchenko, Susan Harris, Zita Wenzel, John Klensin, Henning 1818 Schulzrinne, Leslie Daigle, Markus Scherer, and Ken Whistler. 1820 Frank Ellermann, Antonio Marko, Tim Bray, and others identified 1821 important issues with this new version. 1823 Appendix C. Changes from RFC 3536 1825 NOTE: This appendix is still quite sketchy. It won't be finalized 1826 until later in the life of the document. 1828 This document mostly consists of additions to RFC 3536. The terms 1829 added in this document are: 1831 o New Section 4.2 and associated definitions. 1833 o Commonly-used synonyms added to several descriptions and indexed. 1835 o ... 1837 In addition, the following changes were made: 1839 o Minor edits were made to some section titles and a number of other 1840 editorial improvements were made. 1842 o The discussion of control codes was updated to include additional 1843 information and clarify that "control code" and "control 1844 character" are synonyms. 1846 o Many terms were clarified to reflect contemporary usage. 1848 o The index to terms by section in RFC 3536 was replaced by an index 1849 to pages containing considerably more terms. 1851 o The acknowledgments were updated. 1853 o Some of the references were updated. 1855 o The supplemental reading list was expanded somewhat. 1857 There is still much to do before this document becomes an RFC. 1858 Intended changes include: 1860 o Adding some of the terms from IDNA2008. 1862 o Adding terms from other RFCs that relate to internationalization. 1864 o More updating of references. 1866 Appendix D. Changes Between Versions of this Draft 1868 [[anchor24: RFC Editor: Please remove this section.]] 1870 D.1. Changes in version -01 1872 o Changed RFC 4646 reference to 5646 (thanks to Doug Ewell) 1874 o Added a new comment about rendering rules and languages. 1876 o New section on IDNA terms 1878 o New section on variants 1880 o Several small errors corrected, tidbits added, and additional 1881 items indexed. 1883 Index 1885 A 1886 A-label 1 1887 ACE 1 1888 ACE Prefix 1 1889 alphabetic 1 1890 ANSI 1 1891 ASCII 1 1892 ASCII-compatible encoding 1 1893 ASN.1 text formats 1 1895 B 1896 Base64 1 1897 Basic Multilingual Plane 1 1898 bidi 1 1899 bidirectional display 1 1900 BMP 1 1901 BMPString 1 1902 BOCU-1 1 1903 BOM 1 1904 byte order mark 1 1906 C 1907 C-T-E 1 1908 case 1 1909 CCS 1 1910 CEN/ISSS 1 1911 character 1 1912 character encoding form 1 1913 character encoding scheme 1 1914 character repertoire 1 1915 charset 1 1916 charset identification 1 1917 CJK characters 1 1918 code chart 1 1919 code point 1 1920 code table 1 1921 coded character 1 1922 coded character set 1 1923 collation 1 1924 combining character 1 1925 combining character sequence 1 1926 compatibility character 1 1927 compatibility variant 1 1928 composite sequence 1 1929 content-transfer-encoding 1 1930 control character 1 1932 D 1933 decomposed character 1 1934 diacritic 1 1935 displaying and rendering text 1 1936 Domain Name Slot 1 1938 F 1939 font 1 1940 formatting character 1 1942 G 1943 glyph 1 1944 glyph code 1 1945 graphic symbol 1 1947 H 1948 Han characters 1 1950 I 1951 i10n 1 1952 i18n 1 1953 IA5String 1 1954 ideographic 1 1955 IDN 1 1956 IDNA 1 1957 IDNA-valid string 1 1958 IDNA2003 1 1959 IDNA2008 1 1960 IME 1 1961 input method editor 1 1962 input methods 1 1963 internationalization 1 1964 Internationalized Domain Name 1 1965 Internationalized domain names 1 1966 Internationalized Label 1 1967 ISO 1 1968 ISO 639 1 1969 ISO 3166 1 1970 ISO 8859 1 1971 ISO TC 46 1 1973 J 1974 JIS 1 1975 JTC 1 1 1977 L 1978 language 1 1979 language identification 1 1980 Latin characters 1 1981 LDH Label 1 1982 letters 1 1983 Local and regional standards organizations 1 1984 locale 1 1985 localization 1 1987 M 1988 MIME 1 1989 multilingual 1 1991 N 1992 name spaces 1 1993 Nameprep 1 1994 NFC 1 1995 NFD 1 1996 NFKC 1 1997 NFKD 1 1998 non-ASCII 1 1999 nonspacing character 1 2000 normalization 1 2001 NR-LDH label 1 2002 NVT 1 2004 O 2005 on-the-wire encoding 1 2007 P 2008 parsed text 1 2009 precomposed character 1 2010 PrintableString 1 2011 private use 1 2012 protocol elements 1 2013 punctuation 1 2014 Punycode 1 2016 Q 2017 quoted-printable 1 2019 R 2020 regular expressions 1 2021 rendering rules 1 2022 repertoire 1 2023 romanization 1 2025 S 2026 SAC 1 2027 script 1 2028 SCSU 1 2029 sorting 1 2030 Stringprep 1 2031 symbol 1 2033 T 2034 T61String 1 2035 TeletexString 1 2036 TES 1 2037 transcoding 1 2038 transcription 1 2039 transfer encoding syntax 1 2040 translation 1 2041 transliteration 1 2042 typeface 1 2044 U 2045 U-label 1 2046 UCS-2 1 2047 UCS-4 1 2048 undisplayable character 1 2049 Unicode Consortium 1 2050 US-ASCII 1 2051 UTC 1 2052 UTF-8 1 2053 UTF-16 1 2054 UTF-16BE 1 2055 UTF-16LE 1 2056 UTF-32 1 2057 UTF8String 1 2059 V 2060 variant 1 2062 W 2063 W3C 1 2064 World Wide Web Consortium 1 2065 writing system 1 2067 X 2068 XML 1 2070 Authors' Addresses 2072 Paul Hoffman 2073 VPN Consortium 2075 Email: paul.hoffman@vpnc.org 2077 John C Klensin 2078 1770 Massachusetts Ave, Ste 322 2079 Cambridge, MA 02140 2080 USA 2082 Phone: +1 617 245 1457 2083 Email: john+ietf@jck.com