idnits 2.17.1 

draft-ietf-appsawg-rfc3536bis-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  -- The draft header indicates that this document obsoletes RFC3536, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 9, 2011) is 4672 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISOIEC10646'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'UNICODE'

  -- Obsolete informational reference (is this intentional?): RFC 3454
     (Obsoleted by RFC 7564)

  -- Obsolete informational reference (is this intentional?): RFC 3490
     (Obsoleted by RFC 5890, RFC 5891)

  -- Obsolete informational reference (is this intentional?): RFC 3491
     (Obsoleted by RFC 5891)


     Summary: 0 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                         P. Hoffman
3	Internet-Draft                                            VPN Consortium
4	Obsoletes: 3536 (if approved)                                 J. Klensin
5	Intended status: BCP                                        July 9, 2011
6	Expires: January 10, 2012

8	          Terminology Used in Internationalization in the IETF
9	                    draft-ietf-appsawg-rfc3536bis-06

11	Abstract

13	   This document provides a list of terms used in the IETF when
14	   discussing internationalization.  The purpose is to help frame
15	   discussions of internationalization in the various areas of the IETF
16	   and to help introduce the main concepts to IETF participants.

18	Status of this Memo

20	   This Internet-Draft is submitted in full conformance with the
21	   provisions of BCP 78 and BCP 79.

23	   Internet-Drafts are working documents of the Internet Engineering
24	   Task Force (IETF).  Note that other groups may also distribute
25	   working documents as Internet-Drafts.  The list of current Internet-
26	   Drafts is at http://datatracker.ietf.org/drafts/current/.

28	   Internet-Drafts are draft documents valid for a maximum of six months
29	   and may be updated, replaced, or obsoleted by other documents at any
30	   time.  It is inappropriate to use Internet-Drafts as reference
31	   material or to cite them other than as "work in progress."

33	   This Internet-Draft will expire on January 10, 2012.

35	Copyright Notice

37	   Copyright (c) 2011 IETF Trust and the persons identified as the
38	   document authors.  All rights reserved.

40	   This document is subject to BCP 78 and the IETF Trust's Legal
41	   Provisions Relating to IETF Documents
42	   (http://trustee.ietf.org/license-info) in effect on the date of
43	   publication of this document.  Please review these documents
44	   carefully, as they describe your rights and restrictions with respect
45	   to this document.  Code Components extracted from this document must
46	   include Simplified BSD License text as described in Section 4.e of
47	   the Trust Legal Provisions and are provided without warranty as
48	   described in the Simplified BSD License.

50	Table of Contents

52	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
53	     1.1.  Purpose of this Document . . . . . . . . . . . . . . . . .  3
54	     1.2.  Format of the Definitions in this Document . . . . . . . .  4
55	     1.3.  Normative Terminology  . . . . . . . . . . . . . . . . . .  4
56	   2.  Fundamental Terms  . . . . . . . . . . . . . . . . . . . . . .  4
57	   3.  Standards Bodies and Standards . . . . . . . . . . . . . . . . 10
58	     3.1.  Standards bodies . . . . . . . . . . . . . . . . . . . . . 10
59	     3.2.  Encodings and Transformation Formats of ISO/IEC 10646  . . 13
60	     3.3.  Native CCSs and charsets . . . . . . . . . . . . . . . . . 14
61	   4.  Character Issues . . . . . . . . . . . . . . . . . . . . . . . 15
62	     4.1.  Types of Characters  . . . . . . . . . . . . . . . . . . . 19
63	     4.2.  Differentiation of Subsets . . . . . . . . . . . . . . . . 22
64	   5.  User Interface for Text  . . . . . . . . . . . . . . . . . . . 23
65	   6.  Text in Current IETF Protocols . . . . . . . . . . . . . . . . 26
66	   7.  Terms Associated with Internationalized Domain Names . . . . . 30
67	     7.1.  IDNA Terminology . . . . . . . . . . . . . . . . . . . . . 30
68	     7.2.  Character Relationships and Variants . . . . . . . . . . . 31
69	   8.  Other Common Terms In Internationalization . . . . . . . . . . 32
70	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 35
71	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 36
72	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 36
73	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 36
74	     11.2. Informative References . . . . . . . . . . . . . . . . . . 36
75	   Appendix A.  Additional Interesting Reading  . . . . . . . . . . . 39
76	   Appendix B.  Acknowledgements  . . . . . . . . . . . . . . . . . . 40
77	   Appendix C.  Significant Changes from RFC 3536 . . . . . . . . . . 40
78	   Index  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
79	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45

81	1.  Introduction

83	   As the IETF Character Set Policy specification [RFC2277] summarizes:
84	   "Internationalization is for humans.  This means that protocols are
85	   not subject to internationalization; text strings are."  Many
86	   protocols throughout the IETF use text strings that are entered by,
87	   or are visible to, humans.  It should be possible for anyone to enter
88	   or read these text strings, which means that Internet users must be
89	   able to be enter text in typical input methods and displayed in any
90	   human language.  Further, text containing any character should be
91	   able to be passed between Internet applications easily.  This is the
92	   challenge of internationalization.

94	1.1.  Purpose of this Document

96	   This document provides a glossary of terms used in the IETF when
97	   discussing internationalization.  The purpose is to help frame
98	   discussions of internationalization in the various areas of the IETF
99	   and to help introduce the main concepts to IETF participants.

101	   Internationalization is discussed in many working groups of the IETF.
102	   However, few working groups have internationalization experts.  When
103	   designing or updating protocols, the question often comes up "should
104	   we internationalize this?" (or, more likely, "do we have to
105	   internationalize this?").

107	   This document gives an overview of internationalization terminology
108	   as it applies to IETF standards work by lightly covering the many
109	   aspects of internationalization and the vocabulary associated with
110	   those topics.  Some of the overview is a somewhat tutorial in nature.
111	   It is not meant to be a complete description of internationalization.
112	   The definitions here SHOULD be used by IETF standards.  IETF
113	   standards that explicitly want to create different definitions for
114	   the terms defined here can do so, but unless an alternate definition
115	   is provided the definitions of the terms in this document apply.
116	   IETF standards that have a requirement for different definitions are
117	   encouraged, for clarity's sake, to find terms different than the ones
118	   defined here.  Some of the definitions in this document come from
119	   earlier IETF documents and books.

121	   As in many fields, there is disagreement in the internationalization
122	   community on definitions for many words.  The topic of language
123	   brings up particularly passionate opinions for experts and non-
124	   experts alike.  This document attempts to define terms in a way that
125	   will be most useful to the IETF audience.

127	   This document uses definitions from many documents that have been
128	   developed inside and outside the IETF.  The primary documents used
129	   are:

131	   o  ISO/IEC 10646 [ISOIEC10646]

133	   o  The Unicode Standard [UNICODE]

135	   o  W3C Character Model [CHARMOD]

137	   o  IETF RFCs, including the Character Set Policy specification
138	      [RFC2277] and the domain name internationalization standard
139	      [RFC5890]

141	1.2.  Format of the Definitions in this Document

143	   In the body of this document, the source for the definition is shown
144	   in angle brackets, such as "<ISOIEC10646>".  Many definitions are
145	   shown as "<RFCtbd>", which means that the definitions were crafted
146	   originally for this document.  The angle bracket notation for the
147	   source of definitions is different than the square bracket notation
148	   used for references to documents, such as in the paragraph above;
149	   these references are given in the reference sections of this
150	   document.

152	   [[ RFC Editor: please change the "tbd" in "RFCtbd" to be the RFC
153	   number assigned to this RFC when published. ]]

155	   For some terms, there are commentary and examples after the
156	   definitions.  In those cases, the part before the angle brackets is
157	   the definition that comes from the original source, and the part
158	   after the angle brackets is commentary that is not a definition (such
159	   as examples or further exposition).

161	   Examples in this document use the notation for code points and names
162	   from the Unicode Standard [UNICODE] and ISO/IEC 10646 [ISOIEC10646].
163	   For example, the letter "a" may be represented as either "U+0061" or
164	   "LATIN SMALL LETTER A".  See RFC 5137 [RFC5137] for a description of
165	   this notation.

167	1.3.  Normative Terminology

169	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
170	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
171	   document are to be interpreted as described in RFC 2119 [RFC2119].

173	2.  Fundamental Terms

175	   This section covers basic topics that are needed for almost anyone
176	   who is involved with making IETF protocols more friendly to non-ASCII
177	   text (see Section 4.2) and with other aspects of
178	   internationalization.

180	   language

182	      A language is a way that humans communicate.  The use of language
183	      occurs in many forms, the most common of which are speech,
184	      writing, and signing. <RFCtbd>

186	      Some languages have a close relationship between the written and
187	      spoken forms, while others have a looser relationship.  The so-
188	      called LTRU (Language Tag Registry Update) standards [RFC5646]
189	      [RFC4647] discuss languages in more detail and provide identifiers
190	      for languages for use in Internet protocols.  Note that computer
191	      languages are explicitly excluded from this definition.

193	   script

195	      A set of graphic characters used for the written form of one or
196	      more languages. <ISOIEC10646>

198	      Examples of scripts are Latin, Cyrillic, Greek, Arabic, and Han
199	      (the characters, often called ideographs after a subset of them,
200	      used in writing Chinese, Japanese, and Korean).  RFC 2277
201	      discusses scripts in detail.

203	      It is common for internationalization novices to mix up the terms
204	      "language" and "script".  This can be a problem in protocols that
205	      differentiate the two.  Almost all protocols that are designed (or
206	      were re-designed) to handle non-ASCII text deal with scripts (the
207	      written systems) or characters, while fewer actually deal with
208	      languages.

210	      A single name can mean either a language or a script; for example,
211	      "Arabic" is both the name of a language and the name of a script.
212	      In fact, many scripts borrow their names from the names of
213	      languages.  Further, many scripts are used to write more than one
214	      language; for example, the Russian and Bulgarian languages are
215	      written in the Cyrillic script.  Some languages can be expressed
216	      using different scripts or were used with different scripts at
217	      different times; the Mongolian language can be written in either
218	      the Mongolian or Cyrillic scripts; Malay is primarily written in
219	      Latin script today but the earlier, Arabic-script-based, Jawa form
220	      is still in use; and a number of languages were converted from
221	      other scripts to Cyrillic in the first half of the last century,
222	      some of which have switched again more recently.  Further, some
223	      languages are normally expressed with more than one script at the
224	      same time; for example, the Japanese language is normally
225	      expressed in the Kanji (Han), Katakana, and Hiragana scripts in a
226	      single string of text.

228	   writing system

230	      A set of rules for using one or more scripts to write a particular
231	      language.  Examples include the American English writing system,
232	      the British English writing system, the French writing system, and
233	      the Japanese writing system. <UNICODE>

235	   character

237	      A member of a set of elements used for the organization, control,
238	      or representation of data. <ISOIEC10646>

240	      There are at least three common definitions of the word
241	      "character":

243	      *  a general description of a text entity

245	      *  a unit of a writing system, often synonymous with "letter" or
246	         similar terms, but generalized to include digits and symbols of
247	         various sorts

249	      *  the encoded entity itself

251	      When people talk about characters, they usually intend one of the
252	      first two definitions.  The term "character" is often abbreviated
253	      as "char".

255	      A particular character is identified by its name, not by its
256	      shape.  A name may suggest a meaning, but the character may be
257	      used for representing other meanings as well.  A name may suggest
258	      a shape, but that does not imply that only that shape is commonly
259	      used in print, nor that the particular shape is associated only
260	      with that name.

262	   coded character

264	      A character together with its coded representation. <ISOIEC10646>

266	   coded character set

268	      A coded character set (CCS) is a set of unambiguous rules that
269	      establishes a character set and the relationship between the
270	      characters of the set and their coded representation.

272	      <ISOIEC10646>

274	   character encoding form

276	      A character encoding form is a mapping from a coded character set
277	      (CCS) to the actual code units used to represent the data.
278	      <UNICODE>

280	   repertoire

282	      The collection of characters included in a character set.  Also
283	      called a character repertoire. <UNICODE>

285	   glyph

287	      A glyph is an image of a character that can be displayed after
288	      being imaged onto a display surface. <RFCtbd>

290	      The Unicode Standard has a different definition that refers to an
291	      abstract form that may represent different images when the same
292	      character is rendered under different circumstances.

294	   glyph code

296	      A glyph code is a numeric code that refers to a glyph.  Usually,
297	      the glyphs contained in a font are referenced by their glyph code.
298	      Glyph codes are local to a particular font; that is, a different
299	      font containing the same glyphs may use different codes. <UNICODE>

301	   transcoding

303	      Transcoding is the process of converting text data from one
304	      character encoding form to another.  Transcoders work only at the
305	      level of character encoding and do not parse the text.  Note:
306	      Transcoding may involve one-to-one, many-to-one, one-to-many or
307	      many-to-many mappings.  Because some legacy mappings are glyphic,
308	      they may not only be many-to-many, but also unordered: thus XYZ
309	      may map to yxz. <CHARMOD>

311	      In this definition, "many-to-one" means a sequence of characters
312	      mapped to a single character.  The "many" does not mean
313	      alternative characters that map to the single character.

315	   character encoding scheme

317	      A character encoding scheme (CES) is a character encoding form
318	      plus byte serialization.  There are many character encoding
319	      schemes in Unicode, such as UTF-8 and UTF-16BE. <UNICODE>
320	      Some CESs are associated with a single CCS; for example, UTF-8
321	      [RFC3629] applies only to the identical CCSs of ISO/IEC 10646 and
322	      Unicode.  Other CESs, such as ISO 2022, are associated with many
323	      CCSs.

325	   charset

327	      A charset is a method of mapping a sequence of octets to a
328	      sequence of abstract characters.  A charset is, in effect, a
329	      combination of one or more CCSs with a CES.  Charset names are
330	      registered by the IANA according to procedures documented in
331	      [RFC2978]. <RFCtbd>

333	      Many protocol definitions use the term "character set" in their
334	      descriptions.  The terms "charset" or "character encoding scheme"
335	      and "coded character set" are strongly preferred over the term
336	      "character set" because "character set" has other definitions in
337	      other contexts, particularly outside the IETF.  When reading IETF
338	      standards that use "character set" without defining the term, they
339	      usually mean "a specific combination of one CCS with a CES",
340	      particularly when they are talking about the "US-ASCII character
341	      set".

343	   internationalization

345	      In the IETF, "internationalization" means to add or improve the
346	      handling of non-ASCII text in a protocol. <RFCtbd> A different
347	      perspective, more appropriate to protocols that are designed for
348	      global use from the beginning, is the definition used by W3C:

350	         "Internationalization is the design and development of a
351	         product, application or document content that enables easy
352	         localization for target audiences that vary in culture, region,
353	         or language."  [W3C-i18n-Def]

355	      Many protocols that handle text only handle one charset (US-
356	      ASCII), or leave the question of what CCS and encoding are used up
357	      to local guesswork (which leads, of course, to interoperability
358	      problems).  If multiple charsets are permitted they must be
359	      explicitly identified [RFC2277].  Adding non-ASCII text to a
360	      protocol allows the protocol to handle more scripts, hopefully all
361	      of the ones useful in the world.  In today's world, that is
362	      normally best accomplished by allowing Unicode encoded in UTF-8
363	      only, thereby shifting conversion issues away from individual
364	      choices.

366	   localization

368	      The process of adapting an internationalized application platform
369	      or application to a specific cultural environment.  In
370	      localization, the same semantics are preserved while the syntax
371	      may be changed.  [FRAMEWORK]

373	      Localization is the act of tailoring an application for a
374	      different language or script or culture.  Some internationalized
375	      applications can handle a wide variety of languages.  Typical
376	      users only understand a small number of languages, so the program
377	      must be tailored to interact with users in just the languages they
378	      know.

380	      The major work of localization is translating the user interface
381	      and documentation.  Localization involves not only changing the
382	      language interaction, but also other relevant changes such as
383	      display of numbers, dates, currency, and so on.  The better
384	      internationalized an application is, the easier it is to localize
385	      it for a particular language and character encoding scheme.

387	      Localization is rarely an IETF matter, and protocols that are
388	      merely localized, even if they are serially localized for several
389	      locations, are generally considered unsatisfactory for the global
390	      Internet.

392	      Do not confuse "localization" with "locale", which is described in
393	      Section 8 of this document.

395	   i18n, l10n

397	      These are abbreviations for "internationalization" and
398	      "localization". <RFCtbd>

400	      "18" is the number of characters between the "i" and the "n" in
401	      "internationalization", and "10" is the number of characters
402	      between the "l" and the "n" in "localization".

404	   multilingual

406	      The term "multilingual" has many widely-varying definitions and
407	      thus is not recommended for use in standards.  Some of the
408	      definitions relate to the ability to handle international
409	      characters; other definitions relate to the ability to handle
410	      multiple charsets; and still others relate to the ability to
411	      handle multiple languages. <RFCtbd>

413	   displaying and rendering text

415	      To display text, a system puts characters on a visual display
416	      device such as a screen or a printer.  To render text, a system
417	      analyzes the character input to determine how to display the text.
418	      The terms "display" and "render" are sometimes used
419	      interchangeably.  Note, however, that text might be rendered as
420	      audio and/or tactile output, such as in systems that have been
421	      designed for people with visual disabilities. <RFCtbd>

423	      Combining characters modify the display of the character (or, in
424	      some cases, characters) that precede them.  When rendering such
425	      text, the display engine must either find the glyph in the font
426	      that represents the base character and all of the combining
427	      characters, or it must render the combination itself.  Such
428	      rendering can be straight-forward, but it is sometimes complicated
429	      when the combining marks interact with each other, such as when
430	      there are two combining marks that would appear above the same
431	      character.  Formatting characters can also change the way that a
432	      renderer would display text.  Rendering can also be difficult for
433	      some scripts that have complex display rules for base characters,
434	      such as Arabic and Indic scripts.

436	3.  Standards Bodies and Standards

438	   This section describes some of the standards bodies and standards
439	   that appear in discussions of internationalization in the IETF.  This
440	   is an incomplete and possibly over-full list; listing too few bodies
441	   or standards can be just as politically dangerous as listing too
442	   many.  Note that there are many other bodies that deal with
443	   internationalization; however, few if any of them appear commonly in
444	   IETF standards work.

446	3.1.  Standards bodies

448	   ISO and ISO/IEC JTC 1

450	      The International Organization for Standardization has been
451	      involved with standards for characters since before the IETF was
452	      started.  ISO is a non-governmental group made up of national
453	      bodies.  Most of ISO's work in information technology is performed
454	      jointly with a similar body, the International Electrotechnical
455	      Commission (IEC) through a joint committee known as "JTC 1".  ISO
456	      and ISO/IEC JTC 1 have many diverse standards in the international
457	      characters area; the one that is most used in the IETF is commonly
458	      referred to as "ISO/IEC 10646", sometimes with a specific date.
459	      ISO/IEC 10646 describes a CCS that covers almost all known written
460	      characters in use today.

462	      ISO/IEC 10646 is controlled by the group known as "ISO/IEC JTC
463	      1/SC 2 WG2", often called "SC2/WG2" or "WG2" for short.  ISO
464	      standards go through many steps before being finished, and years
465	      often go by between changes to the base ISO/IEC 10646 standard
466	      although amendments are now issued to track Unicode changes.
467	      Information on WG2, and its work products, can be found at
468	      <http://www.dkuug.dk/JTC1/SC2/WG2/>.  Information on SC2, and its
469	      work products, can be found at <http://www.iso.org/iso/
470	      standards_development/technical_committees/
471	      list_of_iso_technical_committees/
472	      iso_technical_committee.htm?commid=45050>

474	      The standard comes as a base part and a series of attachments or
475	      amendments.  It is available in PDF form for downloading or in a
476	      CD-ROM version.  One example of how to cite the standard is given
477	      in [RFC3629].  Any standard that cites ISO/IEC 10646 needs to
478	      evaluate how to handle the versioning problem that is relevant to
479	      the protocol's needs.

481	      ISO is responsible for other standards that might be of interest
482	      to protocol developers concerned about internationalization.  ISO
483	      639 [ISO639] specifies the names of languages and forms part of
484	      the basis for the IETF's Language Tag work [RFC5646].  ISO 3166
485	      [ISO3166] specifies the names and code abbreviations for countries
486	      and territories and is used in several protocols and databases
487	      including names for country-code top level domain names.  The
488	      responsibilities of ISO TC 46 on Information and Documentation
489	      <http://www.iso.org/iso/standards_development/
490	      technical_committees/list_of_iso_technical_committees/
491	      iso_technical_committee.htm?commid=48750> include a series of
492	      standards for transliteration of various languages into Latin
493	      characters.

495	      Another relevant ISO group was JTC 1/SC22/WG20, which was
496	      responsible for internationalization in JTC1, such as for
497	      international string ordering.  Information on WG20, and its work
498	      products, can be found at <http://www.dkuug.dk/jtc1/sc22/wg20/>.
499	      The specific tasks of SC22/WG20 were moved from SC22 into SC2 and
500	      there has been little significant activity since that occurred.

502	   Unicode Consortium

504	      The second important group for international character standards
505	      is the Unicode Consortium.  The Unicode Consortium is a trade
506	      association of companies, governments, and other groups interested
507	      in promoting the Unicode Standard [UNICODE].  The Unicode Standard
508	      is a CCS whose repertoire and code points are identical to ISO/IEC
509	      10646.  The Unicode Consortium has added features to the base CCS
510	      which make it more useful in protocols, such as defining
511	      attributes for each character.  Examples of these attributes
512	      include case conversion and numeric properties.

514	      The actual technical and definitional work of the Unicode
515	      Consortium is done in the Unicode Technical Committee (UTC).  The
516	      terms "UTC" and "Unicode Consortium" are often treated,
517	      imprecisely, as synonymous in the IETF.

519	      The Unicode Consortium publishes addenda to the Unicode Standard
520	      as Unicode Technical Reports.  There are many types of technical
521	      reports at various stages of maturity.  The Unicode Standard and
522	      affiliated technical reports can be found at
523	      <http://www.unicode.org/>.

525	      A reciprocal agreement between the Unicode Consortium and ISO/IEC
526	      JTC 1/SC 2 provides for ISO/IEC 10646 and The Unicode Standard to
527	      track each other for definitions of characters and assignments of
528	      code points.  Updates, often in the form of amendments, to the
529	      former sometimes lag updates to the latter for a short period, but
530	      the gap has rarely been significant in recent years.

532	      At the time that the IETF character set policy [RFC2277] was
533	      established and the first version of this terminology
534	      specification were published, there was a strong preference in the
535	      IETF community for references to ISO/IEC 10646 (rather than
536	      Unicode) when possible.  That preference largely reflected a more
537	      general IETF preference for referencing established open
538	      international standards in preference to specifications from
539	      consortia.  However, the Unicode definitions of character
540	      properties and classes are not part of ISO/IEC 10646.  Because
541	      IETF specifications are increasingly dependent on those
542	      definitions (for example, see the explanation in Section 4.2) and
543	      the Unicode specifications are freely available online in
544	      convenient machine-readable form, the IETF's preference has
545	      shifted to referencing the Unicode Standard.  The latter is
546	      especially important when version consistency between code points
547	      (either standard) and Unicode properties (Unicode only) is
548	      required.

550	   World Wide Web Consortium (W3C)

552	      This group created and maintains the standard for XML, the markup
553	      language for text that has become very popular.  XML has always
554	      been fully internationalized so that there is no need for a new
555	      version to handle international text.  However, in some
556	      circumstances, XML files may be sensitive to differences among
557	      Unicode versions.

559	   local and regional standards organizations

561	      Just as there are many native CCSs and charsets, there are many
562	      local and regional standards organizations to create and support
563	      them.  Common examples of these are ANSI (United States), CEN/ISSS
564	      (Europe), JIS (Japan), and SAC (China).

566	3.2.  Encodings and Transformation Formats of ISO/IEC 10646

568	   Characters in the ISO/IEC 10646 CCS can be expressed in many ways.
569	   Historically, "encoding forms" are both direct addressing methods,
570	   while "transformation formats" are methods for expressing encoding
571	   forms as bits on the wire.  That distinction has mostly disappeared
572	   in recent years.

574	   Documents that discuss characters in the ISO/IEC 10646 CCS often need
575	   to list specific characters.  RFC 5137 describes the common methods
576	   for doing so in IETF documents, and these practices have been adopted
577	   by many other communities as well.

579	   Basic Multilingual Plane (BMP)

581	      The BMP is composed of the first 2^16 code points in ISO/IEC 10646
582	      and contains almost all characters in contemporary use.  The BMP
583	      is also called "Plane 0".

585	   UCS-2 and UCS-4

587	      UCS-2 and UCS-4 are the two encoding forms historically defined
588	      for ISO/IEC 10646.  UCS-2 addresses only the BMP.  Because many
589	      useful characters (such as many Han characters) have been defined
590	      outside of the BMP, many people consider UCS-2 to be obsolete.
591	      UCS-4 addresses the entire range of code points from ISO/IEC 10646
592	      (by agreement between ISO/IEC JTC1 SC2 and the Unicode Consortium,
593	      a range from 0..0x10FFFF) as 32-bit values with zero padding to
594	      the left.  UCS-4 is identical to UTF-32BE (without use of a BOM
595	      (see below)); UTF-32BE is now the preferred term.

597	   UTF-8

599	      UTF-8 [RFC3629], is the preferred encoding for IETF protocols.
600	      Characters in the BMP are encoded as one, two, or three octets.
601	      Characters outside the BMP are encoded as four octets.  Characters
602	      from the US-ASCII repertoire have the same on-the-wire
603	      representation in UTF-8 as they do in US-ASCII.  The IETF-specific
604	      definition of UTF-8 in RFC 3629 is identical to that in recent
605	      versions of the Unicode Standard (e.g., in Section 3.9 of Version
606	      6.0 [UNICODE]).

608	   UTF-16, UTF-16BE, and UTF-16LE

610	      UTF-16, UTF-16BE, and UTF-16LE, three transformation formats
611	      described in [RFC2781] and defined in The Unicode Standard
612	      (Sections 3.9 and 16.8 of Version 6.0), are not required by any
613	      IETF standards, and are thus used much less often in protocols
614	      than UTF-8.  Characters in the BMP are always encoded as two
615	      octets, and characters outside the BMP are encoded as four octets
616	      using a "surrogate pair" arrangement.  The latter is not part of
617	      UCS-2, marking the difference between UTF-16 and UCS-2.  The three
618	      UTF-16 formats differ based on the order of the octets and the
619	      presence or absence of a special lead-in ordering identifier
620	      called the "byte order mark" or "BOM".

622	   UTF-32

624	      The Unicode Consortium and ISO/IEC JTC 1 have defined UTF-32 as a
625	      transformation format that incorporates the integer code point
626	      value right-justified in a 32 bit field.  As with UTF-16, the byte
627	      order mark (BOM) can be used and UTF-32BE and UTF-32LE are
628	      defined.  UTF-32 and UCS-4 are essentially equivalent and the
629	      terms are often used interchangeably.

631	   SCSU and BOCU-1

633	      The Unicode Consortium has defined an encoding, SCSU [UTR6], which
634	      is designed to offer good compression for typical text.  A
635	      different encoding that is meant to be MIME-friendly, BOCU-1, is
636	      described in [UTN6].  Although compression is attractive, as
637	      opposed to UTF-8, neither of these (at the time of this writing)
638	      has attracted much interest.

640	      The compression provided as a side effect of the Punycode
641	      algorithm [RFC3492] is heavily used in some contexts, especially
642	      IDNA [RFC5890], but imposes some restrictions (See also
643	      Section 7).

645	3.3.  Native CCSs and charsets

647	   Before ISO/IEC 10646 was developed, many countries developed their
648	   own CCSs and charsets.  Some of these were adopted into international
649	   standards for the relevant scripts or writing systems.  Many dozen of
650	   these are in common use on the Internet today.  Examples include ISO
651	   8859-5 for Cyrillic and Shift- JIS for Japanese scripts.

653	   The official list of the registered charset names for use with IETF
654	   protocols is maintained by IANA and can be found at
655	   <http://www.iana.org/assignments/character-sets>.  The list contains
656	   preferred names and aliases.  Note that this list has historically
657	   contained many errors, such as names that are in fact not charsets or
658	   references that do not give enough detail to reliably map names to
659	   charsets.

661	   Probably the most well-known native CCS is ASCII [US-ASCII].  This
662	   CCS is used as the basis for keywords and parameter names in many
663	   IETF protocols, and as the sole CCS in numerous IETF protocols that
664	   have not yet been internationalized.  ASCII became the basis for ISO/
665	   IEC 646 which, in turn, formed the basis for many national and
666	   international standards, such as the ISO 8859 series, that mix Basic
667	   Latin characters with characters from another script.

669	   It is important to note that, strictly speaking, "ASCII" is a CCS and
670	   repertoire, not an encoding.  The encoding used for ASCII in IETF
671	   protocols involves the seven-bit integer ASCII code point right-
672	   justified an an 8-bit field and is sometimes described as the
673	   "Network Virtual Terminal" or "NVT" encoding [RFC5198].  Less
674	   formally, "ASCII" and "NVT" are often used interchangeably.  However,
675	   "non-ASCII" refers only to characters outside the ASCII repertoire
676	   and is not linked to a specific encoding.  See Section 4.2.

678	   A Unicode publication describes issues involved in mapping character
679	   data between charsets, and an XML format for mapping table data
680	   [UTR22].

682	4.  Character Issues

684	   This section contains terms and topics that are commonly used in
685	   character handling and therefore are of concern to people adding non-
686	   ASCII text handling to protocols.  These topics are standardized
687	   outside the IETF.

689	   code point

691	      A value in the codespace of a repertoire.  For all common
692	      repertoires developed in recent years, code point values are
693	      integers (code points for ASCII and its immediate descendants were
694	      defined in terms of column and row positions of a table).

696	   combining character

698	      A member of an identified subset of the coded character set of
699	      ISO/IEC 10646 intended for combination with the preceding non-
700	      combining graphic character, or with a sequence of combining
701	      characters preceded by a non-combining character.  Combining
702	      characters are inherently non-spacing. <ISOIEC10646>

704	   composite sequence or combining character sequence

706	      A sequence of graphic characters consisting of a non-combining
707	      character followed by one or more combining characters.  A graphic
708	      symbol for a composite sequence generally consists of the
709	      combination of the graphic symbols of each character in the
710	      sequence.  The Unicode Standard often uses the term "combining
711	      character sequence" to refer to composite sequences.  A composite
712	      sequence is not a character and therefore is not a member of the
713	      repertoire of ISO/IEC 10646. <ISOIEC10646> However, Unicode now
714	      assigns names to some such sequences especially when the names are
715	      required to match terminology in other standards [UAX34].

717	      In some CCSs, some characters consist of combinations of other
718	      characters.  For example, the letter "a with acute" might be a
719	      combination of the two characters "a" and "combining acute", or it
720	      might be a combination of the three characters "a", a non-
721	      destructive backspace, and an acute.  In the same or other CCSs,
722	      it might be available as a single code point.  The rules for
723	      combining two or more characters are called "composition rules",
724	      and the rules for taking apart a character into other characters
725	      is called "decomposition rules".  The results of composition is
726	      called a "precomposed character"; the results of decomposition is
727	      called a "decomposed character".

729	   normalization

731	      Normalization is the transformation of data to a normal form, for
732	      example, to unify spelling. <UNICODE>

734	      Note that the phrase "unify spelling" in the definition above does
735	      not mean unifying different strings with the same meaning as words
736	      (such as "color" and "colour").  Instead, it means unifying
737	      different character sequences that are intended to form the same
738	      composite characters, such as "<n><combining tilde>" and "<n with
739	      tilde>" (where "<n>" is U+006E, "<combining tilde>" is U+0303, and
740	      "<n with tilde>" is U+00F1).

742	      The purpose of normalization is to allow two strings to be
743	      compared for equivalence.  The strings "<a><n><combining
744	      tilde><o>" and "<a><n with tilde><o>" would be shown identically
745	      on a text display device.  If a protocol designer wants those two
746	      strings to be considered equivalent during comparison, the
747	      protocol must define where normalization occurs.

749	      The terms "normalization" and "canonicalization" are often used
750	      interchangeably.  Generally, they both mean to convert a string of
751	      one or more characters into another string based on standardized
752	      rules.  However, in Unicode, "canonicalization" or similar terms
753	      are used to refer to a particular type of normalization
754	      equivalence ("canonical equivalence") in contrast to
755	      "compatibility equivalence"), so the term should be used with some
756	      care.  Some CCSs allow multiple equivalent representations for a
757	      written string; normalization selects one among multiple
758	      equivalent representations as a base for reference purposes in
759	      comparing strings.  In strings of text, these rules are usually
760	      based on decomposing combined characters or composing characters
761	      with combining characters.  Unicode Standard Annex #15 [UTR15]
762	      describes the process and many forms of normalization in detail.
763	      Normalization is important when comparing strings to see if they
764	      are the same.

766	      The Unicode NFC and NFD normalizations support canonical
767	      equivalence; NFKC and NFKD support canonical and compatibility
768	      equivalence.

770	   case

772	      Case is the feature of certain alphabets where the letters have
773	      two (or occasionally more) distinct forms.  These forms, which may
774	      differ markedly in shape and size, are called the uppercase letter
775	      (also known as capital or majuscule) and the lowercase letter
776	      (also known as small or minuscule).  Case mapping is the
777	      association of the uppercase and lowercase forms of a letter.
778	      <UNICODE>

780	      There is usually (but not always) a one-to-one mapping between the
781	      same letter in the two cases.  However, there are many examples of
782	      characters which exist in one case but for which there is no
783	      corresponding character in the other case or for which there is a
784	      special mapping rule, such as the Turkish dotless "i", some Greek
785	      characters with modifiers, and characters like the German Sharp S
786	      (Eszett) and Greek Final Sigma that traditionally do not have
787	      uppercase forms.  Case mapping can even be dependent on locale or
788	      language.  Converting text to have only a single case, primarily
789	      for comparison purposes, is called "case folding".  Because of the
790	      various unusual cases, case mapping can be quite controversial and
791	      some case folding algorithms even more so.  For example, some
792	      programming languages such as Java have case-folding algorithms
793	      that are locale-sensitive; this makes those algorithms incredibly
794	      resource-intensive, and makes them act differently depending on
795	      the location of the system at the time the algorithm is used.

797	   sorting and collation

799	      Collating is the process of ordering units of textual information.
800	      Collation is usually specific to a particular language or even to
801	      a particular application or locale.  It is sometimes known as
802	      alphabetizing, although alphabetization is just a special case of
803	      sorting and collation. <UNICODE>

805	      Collation is concerned with the determination of the relative
806	      order of any particular pair of strings, and algorithms concerned
807	      with collation focus on the problem of providing appropriate
808	      weighted keys for string values, to enable binary comparison of
809	      the key values to determine the relative ordering of the strings.

811	      The relative orders of letters in collation sequences can differ
812	      widely based on the needs of the system or protocol defining the
813	      collation order.  For example, even within ASCII characters, there
814	      are two common and very different collation orders: "A, a, B,
815	      b,..." and "A, B, C, ..., Z, a, b,...", with additional variations
816	      for lower case first and digits before and after letters.

818	      In practice, it is rarely necessary to define a collation sequence
819	      for characters drawn from different scripts, but arranging such
820	      sequences so as to not surprise users is usually particularly
821	      problematic.

823	      Sorting is the process of actually putting data records into
824	      specified orders, according to criteria for comparison between the
825	      records.  Sorting can apply to any kind of data (including textual
826	      data) for which an ordering criterion can be defined.  Algorithms
827	      concerned with sorting focus on the problem of performance (in
828	      terms of time, memory, or other resources) in actually putting the
829	      data records into the desired order.

831	      A sorting algorithm for string data can be internationalized by
832	      providing it with the appropriate collation-weighted keys
833	      corresponding to the strings to be ordered.

835	      Many processes have a need to order strings in a consistent
836	      (sorted) sequence.  For only a few CCS/CES combinations, there is
837	      an obvious sort order that can be applied without reference to the
838	      linguistic meaning of the characters: the code point order is
839	      sufficient for sorting.  That is, the code point order is also the
840	      order that a person would use in sorting the characters.  For many
841	      CCS/CES combinations, the code point order would make no sense to
842	      a person and therefore is not useful for sorting if the results
843	      will be displayed to a person.

845	      Code Point order is usually not how any human educated by a local
846	      school system expects to see strings ordered; if one orders to the
847	      expectations of a human, one has a "language-specific" or "human
848	      language" sort.  Sorting to code point order will seem
849	      inconsistent if the strings are not normalized before sorting
850	      because different representations of the same character will sort
851	      differently.  This problem may be smaller with a language-specific
852	      sort.

854	   code table

856	      A code table is a table showing the characters allocated to the
857	      octets in a code. <ISOIEC10646>

859	      Code tables are also commonly called "code charts".

861	4.1.  Types of Characters

863	   The following definitions of types of characters do not clearly
864	   delineate each character into one type, nor do they allow someone to
865	   accurately predict what types would apply to a particular character.
866	   The definitions are intended for application designers to help them
867	   think about the many (sometimes confusing) properties of text.

869	   alphabetic

871	      An informative Unicode property.  Characters that are the primary
872	      units of alphabets and/or syllabaries, whether combining or
873	      noncombining.  This includes composite characters that are
874	      canonical equivalents to a combining character sequence of an
875	      alphabetic base character plus one or more combining characters:
876	      letter digraphs; contextual variant of alphabetic characters;
877	      ligatures of alphabetic characters; contextual variants of
878	      ligatures; modifier letters; letterlike symbols that are
879	      compatibility equivalents of single alphabetic letters; and
880	      miscellaneous letter elements. <UNICODE>

882	   ideographic

884	      Any symbol that primarily denotes an idea (or meaning) in contrast
885	      to a sound (or pronunciation), for example, a symbol showing a
886	      telephone or the Han characters used in Chinese, Japanese, and
887	      Korean. <UNICODE>

889	      While Unicode and many other systems use this term to refer to all
890	      Han characters, strictly speaking not all of those characters are
891	      actually ideographic.  Some are pictographic (such as the
892	      telephone example above), some are used phonetically, and so on.

894	      However, the convention is to describe the script as ideographic
895	      as contrasted to alphabetic.

897	   digit or number

899	      All modern writing systems use decimal digits in some form; some
900	      older ones use non-positional or other systems.  Different scripts
901	      may have their own digits.  Unicode distinguishes between numbers
902	      and other kinds of characters by assigning a special General
903	      Category value to them and subdividing that value to distinguish
904	      between decimal digits, letter digits, and other digits. <UNICODE>

906	   punctuation

908	      Characters that separate units of text, such as sentences and
909	      phrases, thus clarifying the meaning of the text.  The use of
910	      punctuation marks is not limited to prose; they are also used in
911	      mathematical and scientific formulae, for example. <UNICODE>

913	   symbol

915	      One of a set of characters other than those used for letters,
916	      digits, or punctuation, and representing various concepts
917	      generally not connected to written language use per se. <RFCtbd>

919	      Examples of symbols include characters for mathematical operators,
920	      symbols for OCR, symbols for box-drawing or graphics, as well as
921	      symbols for dingbats, arrows, faces, and geometric shapes.
922	      Unicode has a property that identifies symbol characters.

924	   nonspacing character

926	      A combining character whose positioning in presentation is
927	      dependent on its base character.  It generally does not consume
928	      space along the visual baseline in and of itself. <UNICODE>

930	      A combining acute accent (U+0301) is an example of a nonspacing
931	      character.

933	   diacritic

935	      A mark applied or attached to a symbol to create a new symbol that
936	      represents a modified or new value.  They can also be marks
937	      applied to a symbol irrespective of whether it changes the value
938	      of that symbol.  In the latter case, the diacritic usually
939	      represents an independent value (for example, an accent, tone, or
940	      some other linguistic information).  Also called diacritical mark
941	      or diacritical. <UNICODE>

943	   control character

945	      The 65 characters in the ranges U+0000..U+001F and U+007F..U+009F.
946	      The basic space character, U+0020, is often considered as a
947	      control character as well, making the total number 66.  They are
948	      also known as control codes.  In terminology adopted by Unicode
949	      from ASCII and the ISO 8859 standards, these codes are treated as
950	      belonging to three ranges: "C0" (for U+0000..U+001F), "C1" (for
951	      U+0080...U+009F), and the single control character "DEL" (U+007F).
952	      <UNICODE>

954	      Occasionally, in other vocabularies, the term "control character"
955	      is used to describe any character that does not have an associated
956	      glyph or to device control sequences [ISO6429].  Neither of those
957	      usages is appropriate to internationalization terminology in the
958	      IETF.

960	   formatting character

962	      Characters that are inherently invisible but that have an effect
963	      on the surrounding characters. <UNICODE>

965	      Examples of formatting characters include characters for
966	      specifying the direction of text and characters that specify how
967	      to join multiple characters.

969	   compatibility character or compatibility variant

971	      A graphic character included as a coded character of ISO/IEC 10646
972	      primarily for compatibility with existing coded character sets.
973	      <ISOIEC10646)>

975	      The Unicode definition of compatibility charter also includes
976	      characters that have been incorporated for other reasons.  Their
977	      list includes several separate groups of characters included for
978	      compatibility purposes: halfwidth and fullwidth characters used
979	      with East Asian scripts, Arabic contextual forms (e.g., initial or
980	      final forms), some ligatures, deprecated formatting characters,
981	      variant forms of characters (or even copies of them) for
982	      particular uses (e.g., phonetic or mathematical applications),
983	      font variations, CJK compatibility ideographs, and so on.  For
984	      additional information and the separate term "compatibility
985	      decomposable character", see the Unicode standard.

987	      For example, U+FF01 (FULLWIDTH EXCLAMATION MARK) was included for
988	      compatibility with Asian charsets that include full-width and
989	      half-width ASCII characters.

991	      Some efforts in the IETF have concluded that it would be useful to
992	      support mapping of some groups of compatibility equivalents and
993	      not others (e.g., supporting or mapping width variations while
994	      preserving or rejecting mathematical variations).  See the IDNA
995	      Mapping document [RFC5895] for one example.

997	4.2.  Differentiation of Subsets

999	   Especially as existing IETF standards are internationalized, it is
1000	   necessary to describe collections of characters including especially
1001	   various subsets of Unicode.  Because Unicode includes ways to code
1002	   substantially all characters in contemporary use, subsets of the
1003	   Unicode repertoire can be a useful tool for defining these
1004	   collections as repertoires independent of specific Unicode coding.

1006	   However specific collections are defined, it is important to remember
1007	   that, while older CCSs such as ASCII and the ISO 8859 family are
1008	   close-ended and fixed, Unicode is open-ended, with new character
1009	   definitions, and often new scripts, being added every year or so.
1010	   So, while, e.g., an ASCII subset, such as "upper case letters", can
1011	   be specified as a range of code points (4/1 to 5/10 for that
1012	   example), similar definitions for Unicode either have to be specified
1013	   in terms of Unicode properties or are very dependent on Unicode
1014	   versions (and the relevant version must be identified in any
1015	   specification).  See the IDNA code point specification [RFC5892] for
1016	   an example of specification by combinations of properties.

1018	   Some terms are commonly used in the IETF to define character ranges
1019	   and subsets.  Some of these are imprecise and can cause confusion if
1020	   not used carefully.

1022	   non-ASCII  The term "non-ASCII" strictly refers to characters other
1023	      than those that appear in the ASCII repertoire, independent of the
1024	      CCS or encoding used for them.  In practice, if a repertoire such
1025	      as that of Unicode is established as context, "non-ASCII" refers
1026	      to characters in that repertoire that do not appear in the ASCII
1027	      repertoire.  "Outside the ASCII repertoire" and "outside the ASCII
1028	      range" are practical, and more precise, synonyms for "non-ASCII".

1030	   letters  The term "letters" does not have an exact equivalent in the
1031	      Unicode standard.  Letters are generally characters that are used
1032	      to write words, but that means very different things in different
1033	      languages and cultures.

1035	5.  User Interface for Text

1037	   Although the IETF does not standardize user interfaces, many
1038	   protocols make assumptions about how a user will enter or see text
1039	   that is used in the protocol.  Internationalization challenges
1040	   assumptions about the type and limitations of the input and output
1041	   devices that may be used with applications that use various
1042	   protocols.  It is therefore useful to consider how users typically
1043	   interact with text that might contain one or more non-ASCII
1044	   characters.

1046	   input methods

1048	      An input method is a mechanism for a person to enter text into an
1049	      application. <RFCtbd>

1051	      Text can be entered into a computer in many ways.  Keyboards are
1052	      by far the most common device used, but many characters cannot be
1053	      entered on typical computer keyboards in a single stroke.  Many
1054	      operating systems come with system software that lets users input
1055	      characters outside the range of what is allowed by keyboards.

1057	      For example, there are dozens of different input methods for Han
1058	      characters in Chinese, Japanese, and Korean.  Some start with
1059	      phonetic input through the keyboard, while others use the number
1060	      of strokes in the character.  Input methods are also needed for
1061	      scripts that have many diacritics, such as European or Vietnamese
1062	      characters that have two or three diacritics on a single
1063	      alphabetic character.

1065	      The term "input method editor" (IME) is often used generically to
1066	      describe the tools and software used to deal with input of
1067	      characters on a particular system.

1069	   rendering rules

1071	      A rendering rule is an algorithm that a system uses to decide how
1072	      to display a string of text. <RFCtbd>

1074	      Some scripts can be directly displayed with fonts, where each
1075	      character from an input stream can simply be copied from a glyph
1076	      system and put on the screen or printed page.  Other scripts need
1077	      rules that are based on the context of the characters in order to
1078	      render text for display.

1080	      Some examples of these rendering rules include:

1082	      *  Scripts such as Arabic (and many others), where the form of the
1083	         letter changes depending on the adjacent letters, whether the
1084	         letter is standing alone, at the beginning of a word, in the
1085	         middle of a word, or at the end of a word.  The rendering rules
1086	         must choose between two or more glyphs.

1088	      *  Scripts such as the Indic scripts, where consonants may change
1089	         their form if they are adjacent to certain other consonants or
1090	         may be displayed in an order different from the way they are
1091	         stored and pronounced.  The rendering rules must choose between
1092	         two or more glyphs.

1094	      *  Arabic and Hebrew scripts, where the order of the characters
1095	         displayed are changed by the bidirectional properties of the
1096	         alphabetic and other characters characters and with right-to-
1097	         left and left-to-right ordering marks.  The rendering rules
1098	         must choose the order that characters are displayed.

1100	      *  Some writing systems cannot have their rendering rules suitably
1101	         defined using mechanisms that are now defined in the Unicode
1102	         Standard.  None of those languages are in active non-scholarly
1103	         use today.

1105	      *  Many systems use a special rendering rule when they lack a font
1106	         or other mechanism for rendering a particular character
1107	         correctly.  That rule typically involves substitution of a
1108	         small open box or a question mark for the missing character.
1109	         See "undisplayable character" below.

1111	   graphic symbol

1113	      A graphic symbol is the visual representation of a graphic
1114	      character or of a composite sequence. <ISOIEC10646>

1116	   font

1118	      A font is a collection of glyphs used for the visual depiction of
1119	      character data.  A font is often associated with a set of
1120	      parameters (for example, size, posture, weight, and serifness),
1121	      which, when set to particular values, generate a collection of
1122	      imagable glyphs. <UNICODE>

1124	      The term "font" is often used interchangeably with "typeface".  As
1125	      historically used in typography, a typeface is a family of one or
1126	      more fonts that share a common general design.  For example,
1127	      "Times Roman" is actually a typeface, with a collection of fonts
1128	      such as "Times Roman Bold", "Times Roman Medium", "Times Roman
1129	      Italic", and so on.  Some sources even consider different type
1130	      sizes within a typeface to be different fonts.  While those
1131	      distinctions are rarely important for internationalization
1132	      purposes, there are exceptions.  Those writing specifications
1133	      should be very careful about definitions in cases in which the
1134	      exceptions might lead to ambiguity.

1136	   bidirectional display

1138	      The process or result of mixing left-to-right oriented text and
1139	      right-to-left oriented text in a single line is called
1140	      bidirectional display, often abbreviated as "bidi". <UNICODE>

1142	      Most of the world's written languages are displayed left-to-right.
1143	      However, many widely-used written languages such as ones based on
1144	      the Hebrew or Arabic scripts are displayed primarily right-to-left
1145	      (numerals are a common exception in the modern scripts).  Right-
1146	      to-left text often confuses protocol writers because they have to
1147	      keep thinking in terms of the order of characters in a string in
1148	      memory, an order that might be different from what they see on the
1149	      screen.  (Note that some languages are written both horizontally
1150	      and vertically and that some historical ones use other display
1151	      orderings.)

1153	      Further, bidirectional text can cause confusion because there are
1154	      formatting characters in ISO/IEC 10646 that cause the order of
1155	      display of text to change.  These explicit formatting characters
1156	      change the display regardless of the implicit left-to-right or
1157	      right-to-left properties of characters.  Text that might contain
1158	      those characters typically requires careful processing before
1159	      being sorted or compared for equality.

1161	      It is common to see strings with text in both directions, such as
1162	      strings that include both text and numbers, or strings that
1163	      contain a mixture of scripts.

1165	      Unicode has a long and incredibly detailed algorithm for
1166	      displaying bidirectional text [UAX9].

1168	   undisplayable character

1170	      A character that has no displayable form. <RFCtbd>

1172	      For instance, the zero-width space (U+200B) cannot be displayed
1173	      because it takes up no horizontal space.  Formatting characters
1174	      such as those for setting the direction of text are also
1175	      undisplayable.  Note, however, that every character in [UNICODE]
1176	      has a glyph associated with it, and that the glyphs for
1177	      undisplayable characters are enclosed in a dashed square as an
1178	      indication that the actual character is undisplayable.

1180	      The property of a character that causes it to be undisplayable is
1181	      intrinsic to its definition.  Undisplayable characters can never
1182	      be displayed in normal text (the dashed square notation is used
1183	      only in special circumstances).  Printable characters whose
1184	      Unicode definitions are associated with glyphs that cannot be
1185	      rendered on a particular system are not, in this sense,
1186	      undisplayable.

1188	   writing style

1190	      Conventions of writing the same script in different styles.
1191	      <RFCtbd>

1193	      Different communities using the script may find text in different
1194	      writing styles difficult to read and possibly unintelligible.  For
1195	      example, the Perso-Arabic Nastalique writing style and the Arabic
1196	      Naskh writing style both use the Arabic script but have very
1197	      different renderings and are not mutually comprehensible.  Writing
1198	      styles may have significant impact on internationalization; for
1199	      example, the Nastalique writing style requires significantly more
1200	      line height than Naskh writing style.

1202	6.  Text in Current IETF Protocols

1204	   Many IETF protocols started off being fully internationalized, while
1205	   others have been internationalized as they were revised.  In this
1206	   process, IETF members have seen patterns in the way that many
1207	   protocols use text.  This section describes some specific protocol
1208	   interactions with text.

1210	   protocol elements

1212	      Protocol elements are uniquely-named parts of a protocol. <RFCtbd>

1214	      Almost every protocol has named elements, such as "source port" in
1215	      TCP.  In some protocols, the names of the elements (or text tokens
1216	      for the names) are transmitted within the protocol.  For example,
1217	      in SMTP and numerous other IETF protocols, the names of the verbs
1218	      are part of the command stream.  The names are thus part of the
1219	      protocol standard.  The names of protocol elements are not
1220	      normally seen by end users and it is rarely appropriate to
1221	      internationalize protocol element names (even while the elements
1222	      themselves can be internationalized).

1224	   name spaces

1226	      A name space is the set of valid names for a particular item, or
1227	      the syntactic rules for generating these valid names. <RFCtbd>

1229	      Many items in Internet protocols use names to identify specific
1230	      instances or values.  The names may be generated (by some
1231	      prescribed rules), registered centrally (e.g., such as with IANA),
1232	      or have a distributed registration and control mechanism, such as
1233	      the names in the DNS.

1235	   on-the-wire encoding

1237	      The encoding and decoding used before and after transmission over
1238	      the network is often called the "on-the-wire" (or sometimes just
1239	      "wire") format. <RFCtbd>

1241	      Characters are identified by code points.  Before being
1242	      transmitted in a protocol, they must first be encoded as bits and
1243	      octets.  Similarly, when characters are received in a
1244	      transmission, they have been encoded, and a protocol that needs to
1245	      process the individual characters needs to decode them before
1246	      processing.

1248	   parsed text

1250	      Text strings that is analyzed for subparts. <RFCtbd>

1252	      In some protocols, free text in text fields might be parsed.  For
1253	      example, many mail user agents (MUAs) will parse the words in the
1254	      text of the Subject: field to attempt to thread based on what
1255	      appears after the "Re:" prefix.

1257	      Such conventions are very sensitive to localization.  If, for
1258	      example, a form like "Re:" is altered by an MUA to reflect the
1259	      language of the sender or recipient, a system that subsequently
1260	      does threading may not recognize the replacement term as a
1261	      delimiter string.

1263	   charset identification

1265	      Specification of the charset used for a string of text. <RFCtbd>

1267	      Protocols that allow more than one charset to be used in the same
1268	      place should require that the text be identified with the
1269	      appropriate charset.  Without this identification, a program
1270	      looking at the text cannot definitively discern the charset of the
1271	      text.  Charset identification is also called "charset tagging".

1273	   language identification

1275	      Specification of the human language used for a string of text.
1276	      <RFCtbd>

1278	      Some protocols (such as MIME and HTTP) allow text that is meant
1279	      for machine processing to be identified with the language used in
1280	      the text.  Such identification is important for machine processing
1281	      of the text, such as by systems that render the text by speaking
1282	      it.  Language identification is also called "language tagging".
1283	      The IETF "LTRU" standards [RFC5646] and [RFC4647] provide a
1284	      comprehensive model for language identification.

1286	   MIME

1288	      MIME (Multipurpose Internet Mail Extensions) is a message format
1289	      that allows for textual message bodies and headers in character
1290	      sets other than US-ASCII in formats that require ASCII (most
1291	      notably RFC 5322, the standard for Internet mail headers
1292	      [RFC5322]).  MIME is described in RFCs 2045 through 2049, as well
1293	      as more recent RFCs. <RFCtbd>

1295	   transfer encoding syntax

1297	      A transfer encoding syntax (TES) (sometimes called a transfer
1298	      encoding scheme) is a reversible transform of already-encoded data
1299	      that is represented in one or more character encoding schemes.
1300	      <RFCtbd>

1302	      TESs are useful for encoding types of character data into an
1303	      another format, usually for allowing new types of data to be
1304	      transmitted over legacy protocols.  The main examples of TESs used
1305	      in the IETF include Base64 and quoted-printable.  MIME identifies
1306	      the transfer encoding syntax for body parts as a Content-transfer-
1307	      encoding, occasionally abbreviated C-T-E.

1309	   Base64

1311	      Base64 is a transfer encoding syntax that allows binary data to be
1312	      represented by the ASCII characters A through Z, a through z, 0
1313	      through 9, +, /, and =.  It is defined in [RFC2045]. <RFCtbd>

1315	   quoted printable

1317	      Quoted printable is a transfer encoding syntax that allows strings
1318	      that have non-ASCII characters mixed in with mostly ASCII
1319	      printable characters to be somewhat human readable.  It is
1320	      described in [RFC2047]. <RFCtbd>
1321	      The quoted printable syntax is generally considered to be a
1322	      failure at being readable.  It is jokingly referred to as "quoted
1323	      unreadable".

1325	   XML

1327	      XML (which is an approximate abbreviation for Extensible Markup
1328	      Language) is a popular method for structuring text.  XML text that
1329	      is not encoded as UTF-8 is explicitly tagged with charsets, and
1330	      all text in XML consists only of Unicode characters.  The
1331	      specification for XML can be found at <http://www.w3.org/XML/>.
1332	      <RFCtbd>

1334	   ASN.1 text formats

1336	      The ASN.1 data description language has many formats for text
1337	      data.  The formats allow for different repertoires and different
1338	      encodings.  Some of the formats that appear in IETF standards
1339	      based on ASN.1 include IA5String (all ASCII characters),
1340	      PrintableString (most ASCII characters, but missing many
1341	      punctuation characters), BMPString (characters from ISO/IEC 10646
1342	      plane 0 in UTF-16BE format), UTF8String (just as the name
1343	      implies), and TeletexString (also called T61String).

1345	   ASCII-compatible encoding (ACE)

1347	      Starting in 1996, many ASCII-compatible encoding schemes (which
1348	      are actually transfer encoding syntaxes) have been proposed as
1349	      possible solutions for internationalizing host names and some
1350	      other purposes.  Their goal is to be able to encode any string of
1351	      ISO/IEC 10646 characters using the preferred syntax for domain
1352	      names (as described in STD 13).  At the time of this writing, only
1353	      the ACE encoding produced by Punycode [RFC3492] has become an IETF
1354	      standard.

1356	      The choice of ACE forms to internationalize legacy protocols must
1357	      be made with care as it can cause some difficult side effects
1358	      [RFC6055].

1360	   LDH label

1362	      The classical label form used in the DNS and most applications
1363	      that call on it, albeit with some additional restrictions,
1364	      reflects the early syntax of "hostnames" [RFC0952] and limits
1365	      those names to ASCII letters, digits, and embedded hyphens.  The
1366	      hostname syntax is identical to that described as the "preferred
1367	      name syntax" in Section 3.5 of RFC 1034 [RFC1034] as modified by
1368	      RFC 1123 [RFC1123].  LDH labels are defined in a more restrictive
1369	      and precise way for internationalization contexts as part of the
1370	      IDNA2008 specification [RFC5890].

1372	7.  Terms Associated with Internationalized Domain Names

1374	7.1.  IDNA Terminology

1376	   The current specification for Internationalized Domain Names (IDNs),
1377	   known formally as Internationalized Domain Names for Applications or
1378	   IDNA, is referred to in the IETF and parts of the broader community
1379	   as "IDNA2008" and consists of several documents.  Section 2.3 of the
1380	   first of those documents, commonly known as "IDNA2008 Definitions"
1381	   [RFC5890] provides definitions and introduces some specialized terms
1382	   for differentiating among types of DNS labels in an IDN context.
1383	   Those terms are listed in the table below; see RFC 5890 for the
1384	   specific definitions if needed.

1386	      ACE Prefix
1387	      A-label
1388	      Domain Name Slot
1389	      IDNA-valid string
1390	      Internationalized Domain Name (IDN)
1391	      Internationalized Label
1392	      LDH Label
1393	      NR-LDH label
1394	      U-label

1396	   Two additional terms entered the IETF's vocabulary as part of the
1397	   earlier IDN effort [RFC3490] (IDNA2003):

1399	      Stringprep

1401	         Stringprep [RFC3454] provides a model and character tables for
1402	         preparing and handling internationalized strings.  It was used
1403	         in the original IDN specification (IDNA2003) via a profile
1404	         called "Nameprep" [RFC3491].  It is no longer in use in IDNA,
1405	         but continues to be used in profiles by a number of other
1406	         protocols. <RFCtbd>

1408	      Punycode

1410	         This is the name of the algorithm [RFC3492] used to convert
1411	         otherwise-valid IDN labels from native-character strings
1412	         expressed in Unicode to an ASCII-compatible encoding (ACE).
1413	         Strictly speaking, the term applies to the algorithm only.  In
1414	         practice, it is widely, if erroneously, used to refer to
1415	         strings that the algorithm encodes.

1417	7.2.  Character Relationships and Variants

1419	   The term "variant" was introduced into the IETF i18n vocabulary with
1420	   the JET recommendations [RFC3743].  As used there, it referred
1421	   strictly to the relationship between Traditional Chinese characters
1422	   and their Simplified equivalents.  The JET recommendations provided a
1423	   model for identifying these pairs of characters and labels that used
1424	   them.  Specific recommendations for variant handling for the Chinese
1425	   language were provided in a follow-up document [RFC4713].

1427	   In more recent years, the term has also been used to describe other
1428	   collections of characters or strings that might be perceived as
1429	   equivalent.  Those collections have involved one or more of several
1430	   categories of characters and labels containing them including:

1432	   o  "visually similar" or "visually confusable" characters.  These may
1433	      be limited to characters in different scripts, characters in a
1434	      single script, or both, and may be those that can appear to be
1435	      alike even with high-distinguishability reference fonts are used
1436	      or under various circumstances that may involve malicious choices
1437	      of typefaces or other ways to trick user perception.  Trivial
1438	      examples include ASCII "l" and "1" and Latin and Cyrillic "a".

1440	   o  Characters assigned more than one Unicode code point because of
1441	      some special property.  These characters may be considered "the
1442	      same" for some purposes and different for others (or by other
1443	      users).  One of the most commonly-cited examples is the Arabic
1444	      YEH, which is encoded more than once because some of its shapes
1445	      are different across different languages.  Another example are the
1446	      Greek lower case sigma and final sigma: if the latter were viewed
1447	      purely as a positional presentation variation on the former, it
1448	      should not have been assigned a separate code point.

1450	   o  Numerals and labels including them.  Unlike letters, the "meaning"
1451	      of decimal digits is clear and unambiguous regardless of the
1452	      script with which they are associated.  Some scripts are routinely
1453	      used almost interchangeably with European digits and digits native
1454	      to that script.  Arabic script has two sets of digits (U+0660..U+
1455	      0669 and U+06F0..U=06F9), written identically for zero through
1456	      three and seven through nine but differently for four through six;
1457	      European digits predominate in other areas.  Substitution of
1458	      digits with the same numeric value in labels may give rise to
1459	      another type of variant.

1461	   o  Orthographic differences within a language.  Many languages have
1462	      alternate choices of spellings or spellings that differ by locale.
1463	      Users of those languages generally recognize the spellings as
1464	      equivalent, at least as much so as the variations described above.
1465	      Examples include "color" and "colour" in English, German words
1466	      spelled with o-umlaut or "oe", and so on.  Some of these
1467	      differences may also create other types of language-specific
1468	      perceived that do not exist for other languages using the same
1469	      script.  For example, in Arabic language usage at the end of
1470	      words, ARABIC LETTER TEH MARBUTA (U+0629) and ARABIC LETTER HEH
1471	      (U+0647) are differently-shaped (one has 2 dots in top of it) but
1472	      they are used interchangeably in writing: they "sound" similar
1473	      when pronounced at the end of phrase, and hence the LETTER TEH
1474	      MARBUTA sometimes is written as LETTER HEH and the two are
1475	      considered "confusable" in that context.

1477	   The term "variant" as used in this section should also not be
1478	   confused with other uses of the term in this document or in Unicode
1479	   terminology (e.g., those in Section 4.1 above).  If the term is to be
1480	   used at all, context should clearly distinguish among these different
1481	   uses and, in particular, between variant characters and variant
1482	   labels.  Local text should identify which meaning, or combination of
1483	   meanings, are intended.

1485	8.  Other Common Terms In Internationalization

1487	   This is a hodge-podge of other terms that have appeared in
1488	   internationalization discussions in the IETF.

1490	   locale

1492	      Locale is the user-specific location and cultural information
1493	      managed by a computer. <RFCtbd>

1495	      Because languages and orthographic conventions differ from country
1496	      to country (and even region to region within a country), the
1497	      locale of the user can often be an important factor.  Typically,
1498	      the locale information for a user includes the language(s) used.

1500	      Locale issues go beyond character use, and can include things such
1501	      as the display format for currency, dates, and times.  Some
1502	      locales (especially the popular "C" and "POSIX" locales) do not
1503	      include language information.

1505	      It should be noted that there are many thorny, unsolved issues
1506	      with locale.  For example, should text be viewed using the locale
1507	      information of the person who wrote the text, information that
1508	      would apply to the location of the system storing or providing the
1509	      text, or the person viewing it?  What if the person viewing it is
1510	      traveling to different locations?  Should only some of the locale
1511	      information affect creation and editing of text?

1513	   Latin characters

1515	      "Latin characters" is a not-precise term for characters
1516	      historically related to ancient Greek script as modified in the
1517	      Roman Republic and Empire and currently used throughout the world.
1518	      <RFCtbd>

1520	      The base Latin characters are a subset of the ASCII repertoire and
1521	      have been augmented by many single and multiple diacritics and
1522	      quite a few other characters.  ISO/IEC 10646 encodes the Latin
1523	      characters in including ranges U+0020..U+024F, and U+1E00..U+1EFF.

1525	      Because "Latin characters" is used in different contexts to refer
1526	      to the letters from the ASCII repertoire, the subset of those
1527	      characters used late in the Roman Republic period or the different
1528	      subset used to write Latin in medieval times, the entire ASCII
1529	      repertoire, all of the code points in the extended Latin script as
1530	      defined by Unicode, and other collections, the term should be
1531	      avoided in IETF specifications when possible.  Similarly, "Basic
1532	      Latin" should not be used as a synonym for "ASCII".

1534	   romanization

1536	      The transliteration of a non-Latin script into Latin characters.
1537	      <RFCtbd>

1539	      Because of the widespread use of Latin characters, people have
1540	      tried to represent many languages that are not based on a Latin
1541	      repertoire in Latin characters.  For example, there are two
1542	      popular romanizations of Chinese: Wade-Giles and Pinyin, the
1543	      latter of which is by far more common today.  Many romanization
1544	      systems are inexact and do not give perfect round trip mappings
1545	      between the native script and the Latin characters.

1547	   CJK characters and Han characters

1549	      The ideographic characters used in Chinese, Japanese, Korean, and
1550	      traditional Vietnamese writing systems are often called 'CJK
1551	      characters' after the initial letters of the language names in
1552	      English.  They are also called "Han characters", after the term in
1553	      Chinese that is often used for these characters. <RFCtbd>

1555	      Note that Han characters do not include the phonetic characters
1556	      used in the Japanese and Korean languages.  Users of the term "CJK
1557	      characters" may or may not assume those additional characters are
1558	      included.

1560	      In ISO/IEC 10646, the Han characters were "unified", meaning that
1561	      each set of Han characters from Japanese, Chinese, and/or Korean
1562	      that had the same origin was assigned a single code point.  The
1563	      positive result of this was that many fewer code points were
1564	      needed to represent Han; the negative result of this was that
1565	      characters that people who write the three languages think are
1566	      different have the same code point.  There is a great deal of
1567	      disagreement on the nature, the origin, and the severity of the
1568	      problems caused by Han unification.

1570	   translation

1572	      The process of conveying the meaning of some passage of text in
1573	      one language, so that it can be expressed equivalently in another
1574	      language. <RFCtbd>

1576	      Many language translation systems are inexact and cannot be
1577	      applied repeatedly to go from one language to another to another.

1579	   transliteration

1581	      The process of representing the characters of an alphabetical or
1582	      syllabic system of writing by the characters of a conversion
1583	      alphabet. <RFCtbd>

1585	      Many script transliterations are exact, and many have perfect
1586	      round-trip mappings.  The notable exception to this is
1587	      romanization, described above.  Transliteration involves
1588	      converting text expressed in one script into another script,
1589	      generally on a letter-by-letter basis.  There are many official
1590	      and unofficial transliteration standards, most notably those from
1591	      ISO TC 46 and the U.S. Library of Congress.

1593	   transcription

1595	      The process of systematically writing the sounds of some passage
1596	      of spoken language, generally with the use of a technical phonetic
1597	      alphabet (usually Latin-based) or other systematic transcriptional
1598	      orthography.  Transcription also sometimes refers to the
1599	      conversion of written text into a transcribed form, based on the
1600	      sound of the text as if it had been spoken. <RFCtbd>

1602	      Unlike transliterations, which are generally designed to be round-
1603	      trip convertible, transcriptions of written material are almost
1604	      never round-trip convertible to their original form, at least
1605	      without some supplemental information.

1607	   regular expressions

1609	      Regular expressions provide a mechanism to select specific strings
1610	      from a set of character strings.  Regular expressions are a
1611	      language used to search for text within strings, and possibly
1612	      modify the text found with other text. <RFCtbd>

1614	      Pattern matching for text involves being able to represent one or
1615	      more code points in an abstract notation, such as searching for
1616	      all capital Latin letters or all punctuation.  The most common
1617	      mechanism in IETF protocols for naming such patterns is the use of
1618	      regular expressions.  There is no single regular expression
1619	      language, but there are numerous very similar dialects that are
1620	      not quite consistent with each other.

1622	      The Unicode Consortium has a good discussion about how to adapt
1623	      regular expression engines to use Unicode.  [UTR18]

1625	   private use character

1627	      ISO/IEC 10646 code points from U+E000 to U+F8FF, U+F0000 to
1628	      U+FFFFD, and U+100000 to U+10FFFD are available for private use.
1629	      This refers to code points of the standard whose interpretation is
1630	      not specified by the standard and whose use may be determined by
1631	      private agreement among cooperating users. <UNICODE>

1633	      The use of these "private use" characters is defined by the
1634	      parties who transmit and receive them, and is thus not appropriate
1635	      for standardization.  (The IETF has a long history of private use
1636	      names for things such as "x-" names in MIME types, charsets, and
1637	      languages.  Most of the experience with these has been quite
1638	      negative, with many implementors assuming that private use names
1639	      are in fact public and long-lived.)

1641	9.  Security Considerations

1643	   Security is not discussed directly in this document.  While the
1644	   definitions here have no direct effect on security, they are used in
1645	   many security contexts.  For example, authentication usually involves
1646	   comparing two tokens, and one or both of those tokens might be text;
1647	   thus, some methods of comparison might involve using some if the
1648	   internationalization concepts for which terms are defined in this
1649	   document.

1651	   Having said that, other RFCs dealing with internationalization have
1652	   security consideration descriptions that may be useful to the reader
1653	   of this document.  In particular, the security considerations in RFC
1654	   3454, RFC 3629, RFC 4013, and RFC 5890 go into a fair amount of
1655	   detail.

1657	10.  IANA Considerations

1659	   [RFC Editor: Please remove this section before publication.]:

1661	   This document contains definitions and discussion only -- there are
1662	   no actions for IANA.

1664	11.  References

1666	11.1.  Normative References

1668	   [ISOIEC10646]
1669	              ISO/IEC, "ISO/IEC 10646:2011. International Standard --
1670	              Information technology - Universal Multiple-Octet Coded
1671	              Character Set (UCS)", 2011.

1673	   [RFC2047]  Moore, K., "MIME (Multipurpose Internet Mail Extensions)
1674	              Part Three: Message Header Extensions for Non-ASCII Text",
1675	              RFC 2047, November 1996.

1677	   [UNICODE]  The Unicode Consortium, "The Unicode Standard, Version
1678	              6.0", Mountain View, CA: The Unicode Consortium,
1679	              2011. ISBN 978-1-936213-01-6)., 2011,
1680	              <http://www.unicode.org/versions/Unicode6.0.0/>.

1682	11.2.  Informative References

1684	   [CHARMOD]  W3C, "Character Model for the World Wide Web 1.0", 2005,
1685	              <http://www.w3.org/TR/charmod/>.

1687	   [FRAMEWORK]
1688	              ISO/IEC, "ISO/IEC TR 11017:1997(E). Information technology
1689	              - Framework for internationalization, prepared by ISO/IEC
1690	              JTC 1/SC 22/WG 20", 1997.

1692	   [ISO3166]  ISO, "ISO 3166-1:2006 - Codes for the representation of
1693	              names of countries and their subdivisions -- Part 1:
1694	              Country codes", 2006.

1696	   [ISO639]   ISO, "ISO 639-1:2002 - Code for the representation of
1697	              names of languages - Part 1: Alpha-2 code", 2002.

1699	   [ISO6429]  ISO/IEC, "ISO/IEC, "ISO/IEC 6429:1992. Information
1700	              technology -- Control functions for coded character
1701	              sets"", ISO/IEC 6429:1992, 1992.

1703	   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
1704	              host table specification", RFC 952, October 1985.

1706	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
1707	              STD 13, RFC 1034, November 1987.

1709	   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
1710	              and Support", STD 3, RFC 1123, October 1989.

1712	   [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
1713	              Extensions (MIME) Part One: Format of Internet Message
1714	              Bodies", RFC 2045, November 1996.

1716	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1717	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1719	   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
1720	              Languages", BCP 18, RFC 2277, January 1998.

1722	   [RFC2781]  Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO
1723	              10646", RFC 2781, February 2000.

1725	   [RFC2978]  Freed, N. and J. Postel, "IANA Charset Registration
1726	              Procedures", BCP 19, RFC 2978, October 2000.

1728	   [RFC3454]  Hoffman, P. and M. Blanchet, "Preparation of
1729	              Internationalized Strings ("stringprep")", RFC 3454,
1730	              December 2002.

1732	   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
1733	              "Internationalizing Domain Names in Applications (IDNA)",
1734	              RFC 3490, March 2003.

1736	   [RFC3491]  Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
1737	              Profile for Internationalized Domain Names (IDN)",
1738	              RFC 3491, March 2003.

1740	   [RFC3492]  Costello, A., "Punycode: A Bootstring encoding of Unicode
1741	              for Internationalized Domain Names in Applications
1742	              (IDNA)", RFC 3492, March 2003.

1744	   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
1745	              10646", STD 63, RFC 3629, November 2003.

1747	   [RFC3743]  Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint
1748	              Engineering Team (JET) Guidelines for Internationalized
1749	              Domain Names (IDN) Registration and Administration for
1750	              Chinese, Japanese, and Korean", RFC 3743, April 2004.

1752	   [RFC4647]  Phillips, A. and M. Davis, "Matching of Language Tags",
1753	              BCP 47, RFC 4647, September 2006.

1755	   [RFC4713]  Lee, X., Mao, W., Chen, E., Hsu, N., and J. Klensin,
1756	              "Registration and Administration Recommendations for
1757	              Chinese Domain Names", RFC 4713, October 2006.

1759	   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters",
1760	              BCP 137, RFC 5137, February 2008.

1762	   [RFC5198]  Klensin, J. and M. Padlipsky, "Unicode Format for Network
1763	              Interchange", RFC 5198, March 2008.

1765	   [RFC5322]  Resnick, P., Ed., "Internet Message Format", RFC 5322,
1766	              October 2008.

1768	   [RFC5646]  Phillips, A. and M. Davis, "Tags for Identifying
1769	              Languages", BCP 47, RFC 5646, September 2009.

1771	   [RFC5890]  Klensin, J., "Internationalized Domain Names for
1772	              Applications (IDNA): Definitions and Document Framework",
1773	              RFC 5890, August 2010.

1775	   [RFC5892]  Faltstrom, P., "The Unicode Code Points and
1776	              Internationalized Domain Names for Applications (IDNA)",
1777	              RFC 5892, August 2010.

1779	   [RFC5895]  Resnick, P. and P. Hoffman, "Mapping Characters for
1780	              Internationalized Domain Names in Applications (IDNA)
1781	              2008", RFC 5895, September 2010.

1783	   [RFC6055]  Thaler, D., Klensin, J., and S. Cheshire, "IAB Thoughts on
1784	              Encodings for Internationalized Domain Names", RFC 6055,
1785	              February 2011.

1787	   [UAX34]    The Unicode Consortium, "Unicode Standard Annex #34:
1788	              Unicode Named Character Sequences", 2010,
1789	              <http://www.unicode.org/reports/tr34>.

1791	   [UAX9]     The Unicode Consortium, "Unicode Standard Annex #9:
1792	              Unicode Bidirectional Algorithm", 2010,
1793	              <http://www.unicode.org/reports/tr9>.

1795	   [US-ASCII]
1796	              ANSI, "Coded Character Set -- 7-bit American Standard Code
1797	              for Information Interchange, ANSI X3.4-1986", 1986.

1799	   [UTN6]     The Unicode Consortium, "Unicode Technical Note #5:
1800	              BOCU-1: MIME-Compatible Unicode Compression", 2006,
1801	              <http://www.unicode.org/notes/tn6/>.

1803	   [UTR15]    The Unicode Consortium, "Unicode Standard Annex #15:
1804	              Unicode Normalization Forms", 2010,
1805	              <http://www.unicode.org/reports/tr15>.

1807	   [UTR18]    The Unicode Consortium, "Unicode Standard Annex #18:
1808	              Unicode Regular Expressions", 2008,
1809	              <http://www.unicode.org/reports/tr18>.

1811	   [UTR22]    The Unicode Consortium, "Unicode Technical Standard #22:
1812	              Unicode Character Mapping Markup Language", 2009,
1813	              <http://www.unicode.org/reports/tr22>.

1815	   [UTR6]     The Unicode Consortium, "Unicode Technical Standard #6: A
1816	              Standard Compression Scheme for Unicode", 2005,
1817	              <http://www.unicode.org/reports/tr6>.

1819	   [W3C-i18n-Def]
1820	              W3C, "Localization vs. Internationalization",
1821	              September 2010,
1822	              <http://www.w3.org/International/questions/qa-i18n.en>.

1824	Appendix A.  Additional Interesting Reading

1826	   [[anchor20: RFC Editor: should these be standardized into your normal
1827	   reference format??]]

1829	   ALA-LC Romanization Tables, Randall Barry (ed.), U.S. Library of
1830	   Congress, 1997, ISBN 0844409405

1832	   The Alphabetic Labyrinth: The Letters in History and Imagination,
1833	   Johanna Drucker, Thames and Hudson Ltd, 1995, ISBN 0-500-28068-1

1835	   Blackwell Encyclopedia of Writing Systems, Florian Coulmas, Blackwell
1836	   Publishers, 1999, ISBN 063121481X

1838	   Chinese Calligraphy, Edoardo Fazzioli, Abbeville Press, 1986, 1987
1839	   (English translation), ISBN 0-89659-774-1
1840	   The Chinese Language: Fact and Fantasy, John DeFrancis, University of
1841	   Hawaii Press, 1984, ISBN 0-8284-085505 and 0-8248-1058-6

1843	   CJKV Information Processing, Ken Lunde, O'Reilly & Assoc., 1999, ISBN
1844	   1-56592-224-7

1846	   Dictionary of Languages: The Definitive Reference to More than 400
1847	   Languages, Andrew Dalby, 2004, ISBN 978-0231115698

1849	   Language Visible, David Sacks, Bantam Dell, 2003.  Also published as
1850	   Letter Perfect: The Marvelous History of Our Alphabet From A to Z,
1851	   Broadway, 2004, ISBN 978-0767911733

1853	   Reading the Past: Ancient Writing from Cuneiform to the Alphabet,
1854	   introduction by J.T. Hooker, British Museum Press, 1990, ISBN 0-7141-
1855	   8077-7

1857	   The Story of Writing: Alphabets, Hieroglyphs, & Pictograms, Andrew
1858	   Robinson, Thames and Hudson, 1995, 2000, ISBN 0-500-28156-4

1860	   The World's Writing Systems, Peter Daniels and William Bright, Oxford
1861	   University Press, 1996, ISBN 0195079930

1863	   Writing Systems of the World, Akira Nakanishi, Charles E. Tuttle
1864	   Company, 1980, ISBN 0804816549

1866	Appendix B.  Acknowledgements

1868	   The definitions in this document come from many sources, including a
1869	   wide variety of IETF documents.

1871	   James Seng contributed to the initial outline of RFC 3536.  Harald
1872	   Alvestrand and Martin Duerst made extensive useful comments on early
1873	   versions.  Others who contributed to the development of RFC 3536
1874	   include Dan Kohn, Jacob Palme, Johan van Wingen, Peter Constable,
1875	   Yuri Demchenko, Susan Harris, Zita Wenzel, John Klensin, Henning
1876	   Schulzrinne, Leslie Daigle, Markus Scherer, and Ken Whistler.

1878	   Abdulaziz Al-Zoman, Tim Bray, Frank Ellermann, Antonio Marko, JFC
1879	   Morphin, Sarmad Hussain, Mykyta Yevstifeyev, Ken Whistler, and others
1880	   identified important issues with, or made specific suggestions for,
1881	   this new version.

1883	Appendix C.  Significant Changes from RFC 3536

1885	   This document mostly consists of additions to RFC 3536.  The
1886	   following is a list of the most significant changes.

1888	   o  Change the document's status to BCP.

1890	   o  Commonly-used synonyms added to several descriptions and indexed.

1892	   o  A list of terms defined and used in IDNA2008 was added, with a
1893	      pointer to RFC 5890.  Those definitions have not been repeated in
1894	      this document.

1896	   o  The much-abused term "variant" is now discussed in some detail.

1898	   o  A discussion of different subsets of the Unicode repertoire was
1899	      added as Section 4.2 and associated definitions were included.

1901	   o  Added a new term, "writing style".

1903	   o  Discussions of case-folding and mapping were expanded.

1905	   o  Minor edits were made to some section titles and a number of other
1906	      editorial improvements were made.

1908	   o  The discussion of control codes was updated to include additional
1909	      information and clarify that "control code" and "control
1910	      character" are synonyms.

1912	   o  Many terms were clarified to reflect contemporary usage.

1914	   o  The index to terms by section in RFC 3536 was replaced by an index
1915	      to pages containing considerably more terms.

1917	   o  The acknowledgments were updated.

1919	   o  Some of the references were updated.

1921	   o  The supplemental reading list was expanded somewhat.

1923	Index

1925	   A
1926	      A-label  30
1927	      ACE  29, 31
1928	      ACE Prefix  30
1929	      alphabetic  19
1930	      ANSI  13
1931	      ASCII  14
1932	      ASCII-compatible encoding  29, 31
1933	      ASN.1 text formats  29

1935	   B
1936	      Base64  28
1937	      Basic Multilingual Plane  13
1938	      bidi  25
1939	      bidirectional display  25
1940	      BMP  13
1941	      BMPString  29
1942	      BOCU-1  14
1943	      BOM  14
1944	      byte order mark  14

1946	   C
1947	      C-T-E  28
1948	      case  17
1949	      CCS  7
1950	      CEN/ISSS  13
1951	      character  6
1952	      character encoding form  7
1953	      character encoding scheme  7
1954	      character repertoire  7
1955	      charset  8
1956	      charset identification  27
1957	      CJK characters  33
1958	      code chart  19
1959	      code point  15
1960	      code table  19
1961	      coded character  6
1962	      coded character set  7
1963	      collation  18
1964	      combining character  16
1965	      combining character sequence  16
1966	      compatibility character  21
1967	      compatibility variant  21
1968	      composite sequence  16
1969	      content-transfer-encoding  28
1970	      control character  21
1971	      control code  21
1972	      control sequence  21

1974	   D
1975	      decomposed character  16
1976	      diacritic  20
1977	      displaying and rendering text  10
1978	      Domain Name Slot  30

1980	   E
1981	      encoding forms  13

1983	   F
1984	      font  24
1985	      formatting character  21

1987	   G
1988	      glyph  7
1989	      glyph code  7
1990	      graphic symbol  24

1992	   H
1993	      Han characters  33

1995	   I
1996	      i10n  9
1997	      i18n  9
1998	      IA5String  29
1999	      ideographic  19
2000	      IDN  30
2001	      IDNA  30
2002	      IDNA-valid string  30
2003	      IDNA2003  30
2004	      IDNA2008  30
2005	      IME  23
2006	      input method editor  23
2007	      input methods  23
2008	      internationalization  8
2009	      Internationalized Domain Name  30
2010	      Internationalized domain names  30
2011	      Internationalized Label  30
2012	      ISO  11
2013	      ISO 639  11
2014	      ISO 3166  11
2015	      ISO 8859  14
2016	      ISO TC 46  11

2018	   J
2019	      JIS  13
2020	      JTC 1  11

2022	   L
2023	      language  5
2024	      language identification  28
2025	      Latin characters  33
2026	      LDH Label  30
2027	      letters  22
2028	      Local and regional standards organizations  13
2029	      locale  32
2030	      localization  9

2032	   M
2033	      MIME  28
2034	      multilingual  9

2036	   N
2037	      name spaces  27
2038	      Nameprep  30
2039	      NFC  17
2040	      NFD  17
2041	      NFKC  17
2042	      NFKD  17
2043	      non-ASCII  22
2044	      nonspacing character  20
2045	      normalization  16
2046	      NR-LDH label  30
2047	      NVT  15

2049	   O
2050	      on-the-wire encoding  27

2052	   P
2053	      parsed text  27
2054	      precomposed character  16
2055	      PrintableString  29
2056	      private use charater  35
2057	      protocol elements  26
2058	      punctuation  20
2059	      Punycode  29, 31

2061	   Q
2062	      quoted-printable  28

2064	   R
2065	      regular expressions  35
2066	      rendering rules  23
2067	      repertoire  7
2068	      romanization  33

2070	   S
2071	      SAC  13
2072	      script  5
2073	      SCSU  14
2074	      sorting  18
2075	      Stringprep  30
2076	      surrogate pair  14
2077	      symbol  20

2079	   T
2080	      T61String  29
2081	      TeletexString  29
2082	      TES  28
2083	      transcoding  7
2084	      transcription  34
2085	      transfer encoding syntax  28
2086	      transformation formats  13
2087	      translation  34
2088	      transliteration  33-34
2089	      typeface  24

2091	   U
2092	      U-label  30
2093	      UCS-2  13
2094	      UCS-4  13
2095	      undisplayable character  26
2096	      Unicode Consortium  12
2097	      US-ASCII  14
2098	      UTC  12
2099	      UTF-8  14
2100	      UTF-16  14
2101	      UTF-16BE  14
2102	      UTF-16LE  14
2103	      UTF-32  14
2104	      UTF8String  29

2106	   V
2107	      variant  31

2109	   W
2110	      W3C  13
2111	      World Wide Web Consortium  13
2112	      writing style  26
2113	      writing system  6

2115	   X
2116	      XML  13, 29

2118	Authors' Addresses

2120	   Paul Hoffman
2121	   VPN Consortium

2123	   Email: paul.hoffman@vpnc.org

2125	   John C Klensin
2126	   1770 Massachusetts Ave, Ste 322
2127	   Cambridge, MA  02140
2128	   USA

2130	   Phone: +1 617 245 1457
2131	   Email: john+ietf@jck.com