I am the assigned Gen-ART reviewer for this draft. The General Area Review Team (Gen-ART) reviews all IETF documents being processed by the IESG for the IETF Chair. Please treat these comments just like any other comments. For more information, please see the FAQ at . Document: draft-bray-unichars-09 Reviewer: Dale R. Worley Review Date: 2024-10-20 IETF LC End Date: [not known] IESG Telechat date: [not known] Summary: This draft is basically ready for publication, but has a considerable number of editorial issues that should be fixed before publication. Editorial comments: Check whether "numeric values", "code points", and "characters" are used correctly throughout the document. I don't have a good sense of the proper usage of these terms regarding Unicode, but I have a sense (that might be incorrect) that "code point" is a subclass of "numeric value", and should always be used when referring the the number representing a character. You probably want to ASCII-ize various quote symbols used in the document. I'm not sure how the Editor wants to handle the "black heart" characters, but they are informative examples and ought to be retained if possible. 1. Introduction This document discusses issues that apply in choosing subsets, names two subsets that have been popular in practice, and suggests one new subset. The goal is to provide a convenient target for cross- reference from other specifications. It would be useful to describe here why the newly defined subsets are superior to the two existing subsets. Also, this statement is incorrect; the document defines four new subsets, comprising one base class and three profiles. 1.1. Notation In the text, Unicode’s standard "U+", zero-padded to four places, is used. For example, "A", decimal 65, would be expressed as U+0041, and "🖤" (Black Heart), decimal 128,420, would be U+1F5A4. This seems awkward to me. Perhaps: In the text, we use Unicode’s standard notation of "U+" followed by four or more hexadecimal digits. For example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black Heart), decimal 128,420, is U+1F5A4. -- The subsets are described both in ABNF and as PRECIS profiles [RFC8264]. This is correct, but ... The entire document is organized as being within the PRECIS conceptual framework, and yet the references to PRECIS are all phrased as pointers to various parts of the PRECIS RFCs, not to the whole. The document should "at the top level" (it seems like this means in section 1) state that it is part of, or within, the PRECIS framework, and reference the relevant PRECIS RFCs at that point. The later references to PRECIS can then be omitted unless they are to specific sections of RFCs that are relevant to the particular reference. 2. Characters and Code Points However, each Unicode character is assigned a code point, used to represent the characters in computer memory and storage systems and, in specifications, to specify allowed subsets. This is an awkward mix of singular and plural usages. Inquire of Editor the best way to phrase this. Section 6.1 defines a new PRECIS base class that encompasses all Unicode code points. This base class is used for the PRECIS profiles for the subsets defined in this document. Would be a little clearer as Section 6.1 defines a new PRECIS base class, UnicodeBaseClass, that encompasses all Unicode code points. UnicodeBaseClass is used for the PRECIS profiles for the subsets defined in this document. Also, "used for" could probably be replaced with a more specific term describing the relationship between a base class and a profile. 2.1. Transformation Formats However, it is useful to note that the "UTF-16" format represents each code point with one or two 16-bit chunks, and the “UTF-8” format uses variable-length byte sequences. I think the usual terminology would be "variable-length sequences of 8-bit chunks" or better "variable-length sequences of octets". 2.2. Problematic Code Points [...] would benefit from careful consideration of the issues described by PRECIS; [...] It seems to me this ought to specify where these issues are described. Definition D10a in section 3.4 of [UNICODE] defines seven code point types. Three types of code points are assigned to entities which are not actually characters or whose value as Unicode characters in text fields is questionable: "Surrogate", "Control", and "Noncharacter". In this document, "problematic" refers to code points whose type is "Surrogate" or "Noncharacter", and to "legacy controls" as defined in Section 2.2.2.2. Given that "section 3.4" at the beginning of the paragraph refers to [UNICODE], it might be clearer to say "as defined in Section 2.2.2.2 of this document" or "as defined in Section 2.2.2.2 below". 2.2.1. Surrogates A total of 2,048 code points, in the range U+D800-U+DFFF, are divided Since "the range" consists of 2,048 code points, this can be said more exactly: A total of 2,048 code points, the range U+D800-U+DFFF, are divided Also, doesn't "total" take a singular verb? Or is that an Americanism? 2.2.2.2. Legacy Controls Aside from the useful controls, the control codes are mostly obsolete I think you need to capitalize "Control Codes" here. 2.2.3. Noncharacters It seems, looking at rule D15 of section 3.4 of Unicode 15.0.0 shows "noncharacter" as not intrinsically capitalized in Unicode usage. But rule D10a shows "Noncharacter" as intrinsically capitalized. Perhaps ask the Editor about this. 3. Dealing With Problematic Code Points [RFC9413], "Maintaining Robust Protocols", provides a thorough discussion of strategies for dealing with issues in input data, for example problematic code points. Probably better to use "including" in place of "for example". [...] can be used in attacks based on misleading human readers of text that attempt to display them [TR36]. Text does not itself attempt attempt anything. Better is "attacks based on attempting to display text that includes them". [...] differs in programming-language implementations [...] I would say "differs between". Thus, in theory, if a specification requires that input data be encoded with UTF-8, implementors should never have to concern themselves with surrogates. This sentence doesn't make sense to me. If a specification requires something, there is no "in theory" which implies that the input data will conform to the specification. Perhaps something like Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence which would map to a surrogate is ill-formed. If a specification requires that input data be encoded with UTF-8, and all input were well-formed, implementors would never have to concern themselves with surrogates. But it's not clear to me that the second sentence adds any useful information. It seems that the paragraph could just continue with the next sentence: Unfortunately, industry experience teaches that problematic code points, including surrogates, can and do occur in program input where the source of input data is not controlled by the implementor. If the source of the data is controlled by the implementor, it isn't "input". So it seems to me that "where the source of input data is not controlled by the implementor" can be omitted. In particular, the specification of JSON allows any code point to appear in object member names and string values [RFC8259]; the following is a conforming JSON text: It seems like this should start a new paragraph, and be prefixed with "For example,". Reasonable options for dealing with problematic input include, first, rejecting text containing problematic code points, and second, replacing them with placeholders. (As an exception, [UNICODE] notes that it may in some cases be appropriate, specifically for noncharacters, to treat them as non-problematic unassigned code points.) I think you can omit "As an exception", since the parenthesized sentence already contains "may in some cases be appropriate". Silently deleting an ill-formed part of a string is a known security risk. It seems well worth referencing a discussion of the "known security risk". [RFC9413] emphasizes that when encountering problematic input, software should consider the field as a whole, not individual code points or bytes. This needs to be clarified; RFC 9413 does not contain the word "field", and only one instance of "as a whole" (in the phrase "protocol as a whole"). 4.1. Unicode Scalars This subset is called the UnicodeScalarsClass for use in PRECIS. This is awkward. Why not: This subset is the PRECIS profile UnicodeScalarsClass. Similarly in sections 4.2 and 4.3. 4.2. XML Characters [...] surrogates, legacy C0 Controls, and the noncharacters U+FFFE [...] The phrase "legacy C0 Controls" is not defined. I think you mean "C0 Controls". 4.3. Unicode Assignables This subset comprises all code points that are currently assigned, or might in future be assigned, to characters that are not legacy control codes. This is awkward because it seems be careful to exclude "code points that might in future be assigned to characters that are legacy control codes", and of course there are none of those. Probably better: This subset comprises all code points that are currently assigned, excluding legacy control codes, or that might in future be assigned. 5. Using Subsets These formats specify default subsets. This is unclear. Do you mean These specifications specify default subsets of Unicode for use in their protocols. -- Note that escaping techniques such as those in the JSON example in Section 3 cannot be used to circumvent this sort of restriction, which applies to data content, not textual representation in packaging formats. This could be clarified. Perhaps A restriction placed on the contents of a name or value would not be circumventable by an escaping technique (such as those in the JSON example in Section 3) because the restriction applies to the data content, not the textual representation of the content. 6.1. Addition to the PRECIS Base Classes Registry Reference: Section 2 of this RFC This isn't flagged explicitly for Editor/IANA attention. That may be OK, but usually these items are marked explicitly. See also other occurrences of "this RFC". 6.2.3. Unicode Assignables Profile Applicability: Protocols that want to allow all Unicode code points that are currently assigned, or might be assigned in the future, to characters that are not "legacy controls" as defined in Section 2.2.2.2 It seems like this should be "section 2.2.2.2 of [this RFC]". Also, see the comment for section 4.3. 7. Security Considerations It might be worth pointing to section 3 here, as that section contains some security considerations, and points to security considerations documented elsewhere. Note that the Unicode-character subsets specified in this document include a successively-decreasing number of problematic code points, [...] It might be worth explicitly saying "problematic code points (as defined in section 2.2)" so section 7 can be read correctly by someone who hasn't read the rest of the document. 8. Normative References [UNICODE] The Unicode Consortium, "The Unicode Standard", . Note that this reference is to the latest version of Unicode, rather than to a specific release. It is not expected that future changes in the Unicode Standard will affect the referenced definitions. It isn't your problem, but currently the URL goes to a page titled "Unicode(R) 16.0.0", but that page gives only a summary of changes, not the contents of Unicode 16. You have to go to e.g. to see the standard. [END]