Network Working Group                                            T. Bray
Internet-Draft                                       Textuality Services
Intended status: Standards Track                              P. Hoffman
Expires: 22 February 2024                                          ICANN
                                                          21 August 2023


            Specifying Unicode Character Repertoires in RFCs
                         draft-bray-unichars-00

Abstract

   This document describes how to specify the use of Unicode characters
   in a helpful and unambiguous way.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 22 February 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.


Bray & Hoffman          Expires 22 February 2024                [Page 1]

Internet-Draft             Specifying Unicode                August 2023


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
     1.2.  Notation  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Character Concepts  . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Transformation Formats  . . . . . . . . . . . . . . . . .   3
     2.2.  Problematic Code Points . . . . . . . . . . . . . . . . .   4
       2.2.1.  Surrogates  . . . . . . . . . . . . . . . . . . . . .   4
       2.2.2.  Control Codes . . . . . . . . . . . . . . . . . . . .   4
       2.2.3.  Noncharacters . . . . . . . . . . . . . . . . . . . .   4
   3.  Subsets Defined in the Unicode Standard . . . . . . . . . . .   5
     3.1.  Unicode Code Points . . . . . . . . . . . . . . . . . . .   5
     3.2.  Unicode Scalar Values . . . . . . . . . . . . . . . . . .   5
   4.  Other Definitions . . . . . . . . . . . . . . . . . . . . . .   6
     4.1.  XML Characters  . . . . . . . . . . . . . . . . . . . . .   6
     4.2.  Basic Unicode Characters  . . . . . . . . . . . . . . . .   6
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
   7.  Normative References  . . . . . . . . . . . . . . . . . . . .   7
   8.  Informative References  . . . . . . . . . . . . . . . . . . .   8
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   8

1.  Introduction

   When a protocol or data format has text fields, that text is normally
   composed of Unicode [UNICODE] characters, to support use by speakers
   of all the world's languages.  Unfortunately, the Unicode Standard
   does not define term "Unicode character" in a way that is useful for
   technical specifications.

   Protocols and data formats SHOULD describe exactly which selection of
   the available Unicode characters are to be used.  This document uses
   the term "character repertoire" to describe such a subset of the
   Unicode characters.  Authors should have a way to concisely and
   exactly reference a stable specification that identifies a protocol
   or data format's character repertoire

   There are several subsets that have been popular choices in code and
   specification character repertoires.  This document describes and
   names them, and suggests one new one.  The goal of this document is
   to provide a convenient target for cross-reference from other
   specifications which desire to use one of these character
   repertoires.


Bray & Hoffman          Expires 22 February 2024                [Page 2]

Internet-Draft             Specifying Unicode                August 2023


1.1.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
   "OPTIONAL" in this document are to be interpreted as described in
   BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

1.2.  Notation

   In this document, the numeric values assigned to Unicode characters
   are provided in hexadecimal.  In the text, Unicode’s standard "U+"
   notation [RFC5137] is used.  For example, "A", decimal 65, would be
   expressed as U+0041, and "😉" (Winking Face), decimal 128,521, would
   be U+1F609.

   Certain groups of numeric values described in Section 3 and Section 4
   are given in ABNF [RFC5234].  In ABNF, the hexadecimal values for
   characters are preceded by "%x" rather than "U+".

   All the numeric ranges in this document are inclusive.

2.  Character Concepts

   The Unicode Standard's definition of "Unicode character" is
   conceptual.  However, each Unicode character is assigned an integer
   identifier in the range U+0000 through U+10FFFF, and these numbers
   are used to specify the allowed repertoires of Unicode characters in
   code and specifications.

   The numbers assigned to Unicode characters are called “code points”;
   there are potentially 1,114,112 of them.  As of 2023, less than
   150,000 characters have had code points assigned.  While the
   inclusion of unassigned code points in text data is undesirable, it
   is difficult to specify that it should be avoided, because unassigned
   code points regularly become assigned as new characters are added to
   Unicode.  Fortunately, the occurrence of unassigned code points in
   texts is generally unlikely to cause software to malfunction.

2.1.  Transformation Formats

   Unicode describes a variety of "transformation formats", ways to
   encode code points in bytes of computer memory.  A survey of
   transformation formats is beyond the scope of this document.
   However, it is useful to note that the "UTF-16" transformation format
   represents each code point with one or two 16-bit chunks, and the
   “UTF-8” transformation format uses variable-length byte sequences.


Bray & Hoffman          Expires 22 February 2024                [Page 3]

Internet-Draft             Specifying Unicode                August 2023


   The UTF-8 transformation format is very widely used for interoperable
   data formats such as JSON, YAML, and XML.

2.2.  Problematic Code Points

   Some code points are assigned to constructs which are not actually
   characters or whose value as Unicode characters is questionable.

2.2.1.  Surrogates

   A total of 2,048 code points, in the range U+D800-U+DFFF, are divided
   into two blocks called "high surrogates" and "low surrogates";
   collectively the 2,048 code points are referred to as "surrogates".
   Surrogates can be used in high-surrogate/low-surrogate pairs to
   represent code points greater than 65,535 in the UTF-16
   transformation format.

   A surrogate which occurs as a singleton, or which is in an
   improperly-composed pair, or which occurs in UTF-8-encoded text, has
   no meaning and may cause malfunctions in software which encounters
   it.

2.2.2.  Control Codes

   Section 23.1 in chapter 23 of [UNICODE], "Special Areas and Format
   Characters", introduces the concept of "Control Codes".  They
   comprise 65 code points in the ranges U+0000-U+001F ("C0 Controls")
   and U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".

   The C0 controls include the newline (U+000A), carriage return
   (U+000D), and Tab (U+0009); this document refers to these three
   characters as the "useful controls".  Aside from these, the control
   codes are mostly obsolete and generally lack interoperable semantics.
   This document uses the phrase "useless controls" to describe control
   codes that are not useful controls.

   Since the C0 controls include zero and the 32 smallest integers, they
   are likely to occur in data as a result of programming errors.

2.2.3.  Noncharacters

   Certain code points are permanently reserved by [UNICODE] for
   internal use and are referred to as "noncharacters".

   Code points are organized into 17 "planes", each containing 2^16 code
   points.  The last two code points in each plane are noncharacters:
   U+00FFFE, U+00FFFF, U+01FFFE, U+01FFF, U+02FFFE, U+02FFFF, and so on,
   up to U+10FFFE, U+10FFFF.


Bray & Hoffman          Expires 22 February 2024                [Page 4]

Internet-Draft             Specifying Unicode                August 2023


   The code points in the range U+FDD0 to U+FDEF are noncharacters.

3.  Subsets Defined in the Unicode Standard

   This section describes popular subsets of the code points that are
   defined in [UNICODE].  Specifications can refer to these repertoires
   by the names "Unicode Code Points" and "Unicode Scalar Values".

3.1.  Unicode Code Points

   Definition D9 in chapter 3 of [UNICODE], "Conformance", defines the
   term "Unicode codespace" as "a range of integers from 0 to
   10FFFF_16".  Definition D10 defines the term "Code point" as "Any
   value in the Unicode codespace".

   The "Unicode Code Points" subset can be expressed as an ABNF
   production:

   unicode-code-points =
      %x0-10FFFF

   This subset has the advantage of including all possible code points.
   It has been adopted by JSON [RFC8259].

   However, this subset includes all of the problematic code points
   listed above, and implementors must be prepared to deal with
   meaningless code points such as those assigned to surrogates, useless
   controls, and noncharacters.

3.2.  Unicode Scalar Values

   Definition D76 in chapter 3 of [UNICODE] defines the term "Unicode
   scalar value" as "Any Unicode code point except high-surrogate and
   low-surrogate code points."

   The "Unicode Scalar Values" subset can be expressed as an ABNF
   production:

   unicode-scalar-values =
      %x0-D7FF / %xE000-10FFFF  ; exclude surrogates

   This subset has the advantage of excluding surrogates, which can
   never add any value and have the potential to cause problems.  This
   subset has been adopted by I-JSON [RFC7493].

   However, this subset still includes the useless controls and the
   noncharacters.


Bray & Hoffman          Expires 22 February 2024                [Page 5]

Internet-Draft             Specifying Unicode                August 2023


4.  Other Definitions

   This section lists other ways to specify subsets of the code points
   beyond those provided by the Unicode Standard itself.  These subsets
   may serve as more appropriate character repertoires for some
   protocols and data formats than those in Section 3, depending on
   their needs.  Specifications can refer to these repertoires by the
   names "XML Characters" and "Basic Unicode Characters".

4.1.  XML Characters

   The XML 1.0 Specification [XML], in its grammar production labeled
   "Char", specifies a range of Unicode codepoints that excludes
   surrogates, useless C0 control codes, and the noncharacters U+FFFE
   and U+FFFF.

   THe "XML Characters" subset can be expressed as an ABNF production:

   xml-chars =
      %x9 / %xA / %xD /   ; useful controls
      %x20-D7FF /         ; exclude surrogates
      %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
      %x100000-10FFFF

   While this subset does not exclude all the problematic code points,
   the C1 controls are less likely than the C0 controls to appear
   erroneously in data, and have not been observed to be a frequent
   source of problems.  Also, the noncharacters greater in value than
   U+FFFF are rarely encountered.

   This subset may be especially appropriate for data formats which may
   be represented in either JSON or XML.

4.2.  Basic Unicode Characters

   For convenience, this document defines the "Basic Unicode Characters"
   subset as the Unicode code points, excluding the useless controls,
   surrogates, and noncharacters.

   Basic Unicode characters can be expressed as an ABNF production:


Bray & Hoffman          Expires 22 February 2024                [Page 6]

Internet-Draft             Specifying Unicode                August 2023


   basic-unichars =
      %x9 / %xA / %xD /             ; useful controls
      %x20-7E /                     ; exclude C1 controls and DEL
      %xA0-D7FF /                   ; exclude surrogates
      %xE000-FDCF                   ; exclude FDD0 nonchars
      %xFDF0-FFFD /                 ; exclude FFFE and FFFF nonchars
      %x1000-1FFFD / %x2000-2FFFD / ; (repeat per plane)
      %x3000-3FFFD / %x4000-4FFFD /
      %x5000-5FFFD / %x6000-6FFFD /
      %x7000-7FFFD / %x8000-8FFFD /
      %x9000-9FFFD / %xA000-AFFFD /
      %xB000-BFFFD / %xC000-CFFFD /
      %xD000-DFFFD / %xE000-EFFFD /
      %xF000-FFFFD / %x10000-10FFFD

5.  IANA Considerations

   This document makes no requests of IANA.

6.  Security Considerations

   Unicode Security Considerations [TR36] is a wide-ranging survey of
   the issues implementors should consider while writing software to
   process Unicode text.  Many of the exploits it discusses are aimed at
   deceiving human readers, but vulnerabilities involving issues such as
   surrogates and noncharacters are also covered, and in fact can
   contribute to human-deceiving exploits.

   Note that the Unicode-character subsets specified in this document
   include a successively-decreasing number of surrogates and
   noncharacters, and thus should be less and less susceptible to
   vulnerabilities.  The Section 4.2 subset, "Basic Unicode Characters",
   excludes all of them.

7.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

   [TR36]     The Unicode Consortium, "Unicode Security Considerations",
              <https://www.unicode.org/reports/tr36/>.  Note that this
              reference is to the latest version of this document,


Bray & Hoffman          Expires 22 February 2024                [Page 7]

Internet-Draft             Specifying Unicode                August 2023


              rather than to a specific release.  It is not expected
              that future updates will affect the referenced
              discussions.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
              <http://www.unicode.org/versions/latest/>.  Note that this
              reference is to the latest version of Unicode, rather than
              to a specific release.  It is not expected that future
              changes in the Unicode Standard will affect the referenced
              definitions.

   [XML]      Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F.
              Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
              Edition)", 26 November 2008,
              <http://www.w3.org/TR/2008/REC-xml-20081126/>.  Note that
              this reference is to a specific release, based on a
              history of previous "Edition" releases having changed this
              production.

8.  Informative References

   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters",
              BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008,
              <https://www.rfc-editor.org/info/rfc5137>.

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/info/rfc7493>.

   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

Acknowledgements

   Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata
   Report against RFC8259, The JavaScript Object Notation, noting
   frequent references to "Unicode characters", when in fact the RFC
   formally specifies the use of Unicode code points.

Authors' Addresses


Bray & Hoffman          Expires 22 February 2024                [Page 8]

Internet-Draft             Specifying Unicode                August 2023


   Tim Bray
   Textuality Services
   Email: tbray@textuality.com


   Paul Hoffman
   ICANN
   Email: paul.hoffman@icann.org


Bray & Hoffman          Expires 22 February 2024                [Page 9]