[apps-discuss] APPSDIR review of draft-farrell-decade-ni-07

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Tue, 05 June 2012 09:43 UTC

Message-ID: <4FCDD499.7060206@it.aoyama.ac.jp>
Date: Tue, 05 Jun 2012 18:42:49 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: draft-farrell-decade-ni.all@tools.ietf.org
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Cc: IETF discussion list <ietf@ietf.org>, "apps-discuss@ietf.org" <apps-discuss@ietf.org>
Subject: [apps-discuss] APPSDIR review of draft-farrell-decade-ni-07
Precedence: list

Hello everybody,

[For replies, please trim the cc list, thanks!]

I have been selected as the Applications Area Directorate reviewer for
this draft (for background on appsdir, please see
http://trac.tools.ietf.org/area/app/trac/wiki/ApplicationsAreaDirectorate ).

Please resolve these comments along with any other Last Call comments
you may receive. Please wait for direction from your document shepherd
or AD before posting a new version of the draft.

Document: draft-farrell-decade-ni-07
Title: Naming Things with Hashes
Reviewer: Martin Dürst
Review Date: 2012-06-03, 2012 (written up 2012-06-04/05)
IETF Last Call Date: started 2012-06-04, ends 2012-07-02

Summary: This draft addresses a real generic need, but the current form
of the draft is the result of adding more and more special cases without
a clear overall view and a firm hand to separate the wheat from the
chaff. This shows both in the technical issues as well as in many of the
editorial issues below. This draft is not ready for publication without
some serious additional work, but that work is mostly straightforward
and should be easy to complete quickly.

Major design issue:

The draft defines two schemes, which differ only slightly, and mostly
just gratuitously (see also editorial issues).
These are the ni: and the nih: scheme. As far as I understand, they
differ as follows:
ni: nih:
authority: optional disallowed
ascii-compatible encoding: base64url base16
check digit: disallowed optional
query part: optional disallowed
decimal presentation of algorithm: disallowed possible

The usability of URIs is strongly influenced by the number of different
schemes, with the smaller a number, the better. As a somewhat made-up
example, if the original URIs had been separated into httph: for HTML
pages and httpi: for images, or any other arbitrary subdivision that one
can envision, that would have hurt the growth and extensibility of the
Web. Creating new URI schemes is occasionally necessary, and the ideas
that lead to this draft definitely seem to warrant a new scheme (*), but
there's no reason for two schemes.
[(*) I know people who would claim the the .well-formed http/https thing
is completely sufficient, no new scheme needed at all.]

More specifically, if the original URIs had been separated into httpm:
(for machines) and httph: (for humans), the Web for sure wouldn't have
grown at the speed it did (and does) grow. In practice, there are huge
differences in human 'speakability' for URIs (and IRIs, for that
matter); compare e.g. http://google.com with
http://www.google.co.jp/#sclient=psy-ab&hl=en&site=&source=hp&q=hash&oq=hash&aq=f&aqi=g4&aql=
(which I have significantly shortened to hopefully eliminate potential
privacy issues), or compare the average mailto: URI with the average
data: URI. However, what's important is that there never has been a
strong dividing line between machine-only and human-only URIs or
schemes, the division has always been very gradual. Short and mainly
human-oriented URIs have of course been handled by machines, and on the
other hand, very long URIs have been spoken when really necessary.
"Speakability" has been maintained to some extent by scheme designers,
and to some extent by "survival of the fittest" (URIs that weren't very
speakable (or spellable/memorizable/guessable/...), and their Web sites,
might just die out slowly).

It should also be noted that the resistance against multiple URI schemes
may have been low because there are so many different ways to express
hashes in the draft anyway, and one more (the nih: section is the last
one before the examples section) didn't seem like much of a deal
anymore. But when it comes to URIs, one less is a lot better than one more.

In the above ni:/nih: distinction, nih: seems to have been added as an
afterthought after realizing that reading an ni: URI aloud over the
phone may be somewhat suboptimal because there is a need for repeated
"upper case" - "lower case" (sure very quickly shortened to "upper" -
"lower" and then to "up" - "low" or something similar). It is not a bad
idea to try to make sure that IETF technology, and URIs in particular,
are accessible to people with certain kinds of dislexya. (There are
indeed people who have tremendous difficulties with distinguishing
upper- and lower-case letters, and this may or may not be connected with
other aspects of dislexya.) It is however totally unclear to this
reviewer why this has to lead to two different URI schemes with other
gratuitous differences.

Finding a solution is rather easy (of course, other solutions may also
be possible): Merge the schemes, so that authority, check digit, and
query part are all optional (an authority part and/or a query part may
very well be very useful in human communication, and a check digit won't
hurt when transmitted electronically) and the decimal presentation of
the algorithm is always allowed, and use base32
(http://tools.ietf.org/html/rfc4648) as the encoding. This leads to a
16.6% less efficient encoding of the value part of the ni: URI, but
given that other URI-related encodings, e.g. the %-encoding resulting
when converting an IRI to an URI, are much less efficient, and that URI
infrastructure these days can handle URIs with more than 1000 bytes,
this should not be a serious problem. Also, there's a separate binary
format (section 6) that is more compact already.

(relatively) Minor technical issues:

Section 2, "When the input to the hash algorithm is a public key value":
Is it absolutely clear that this will work for any and all public key
values, existing and future, and not only for what's currently around?
After all, as far as I understand, the concept of a public key is a
fairly general one.

"Other than in the above special case where public keys are used, we do
not specify the hash function input here. Other specifications are
expected to define this.": Do you really expect that to happen? Wouldn't
it be better limit variability here as much as possible, and to use
media types to identify different kinds of data? This would also work
for public keys: If there's a MIME media type for a
SubjectPublicKeyInfo, then the fact that this media type is the
preferred way to transfer a public key becomes an application convention
rather than a special case in the spec. If a better way (or just another
way) to encode/transfer public keys became popular at a later date,
there would be no need to change the spec.

Related, in Section 3:
The "val" field MUST contain the output of base64url encoding the
result of applying the hash function ("alg") to its defined input,
which defaults to the object bytes that are expected to be returned
when the URI is dereferenced.
How do I know whether the default applies or not? The URI doesn't tell
you. Deducing from context is a bad idea.

Section 3: "Thus to ensure interoperability, implementations SHOULD NOT
generate URIs that employ URI character escaping": This is wrong and
needs to be fixed. Characters such as "&", "=", "#", and "%", as well as
ASCII characters not allowed in URIs and non-ASCII characters MUST be
%-encoded if they appear in query parameter values in URIs (or in query
parameter tags, which is however less likely). It would be better if the
spec here deferred to the URI spec rather than trying to come up with
its own rules.

Section 3: "The Named Information URI adapts the URI definition from the
URI Generic Syntax [RFC3986].": This sounds as if this were a voluntary
decision (and the text should be changed to avoid such an impression),
but if you don't conform to RFC 3986 syntax, you're not an URI. This is
the first time I have seen an URI scheme definition starting explicitly
with the top ABNF rule from RFC 3986
(http://tools.ietf.org/html/rfc3986#appendix-A). This is completely
unnecessary. Just make sure your production conforms to the generic URI
syntax, and mention all the ABNF rules from RFC3986 that you use.

Also, using the "URI" production from RFC 3986, and then silently
dropping the #fragment part, is technically wrong. Scheme definitions
have nothing to do with the fragment (including the question of whether
there's a fragment or not; the semantics of fragments are defined by the
MIME media type that you get when you resolve). This may not be
completely clear in RFC 4395, but the IRI WG is working on an update of
RFC 4395 where this will be made clearer (see also
http://trac.tools.ietf.org/wg/iri/trac/ticket/126; thanks for giving me
a chance to remember that I had to create a new issue in the tracker for
this :-).

Section 3, ABNF:
ni-hier-part = "//" authority path-algval
/ path-algval
This gives you ni://example.com/sha-256;f4OxZX_x_FO5... (//authority/)
and ni:/sha-256;f4OxZX_x_FO5... (one slash only), but the examples show
ni:///sha-256;f4OxZX_x_FO5... (three slashes). It looks like the ABNF
you want is:
ni-hier-part = "//" authority path-algval
/ "//" path-algval
(aligning "=" and "/" helps!)
or more simply:
ni-hier-part = "//" [authority] path-algval
or even more simply:
ni-hier-part = "//" authority path-algval
because authority can be empty; let's show this:
authority = [ userinfo "@" ] host [ ":" port ]
If we can show that host can be empty, we're done:
host = IP-literal / IPv4address / reg-name
If we can show that any one of these can be empty, we're done, let's
pick reg-name:
reg-name = *( unreserved / pct-encoded / sub-delims )
* means "zero or more", thus reg-name can be empty. QED.

Section 4:
The HTTP(S) mapping MAY be used in any context where clients without
support for ni URIs are needed without loss of interoperability or
functionality.
What is meant by "support for ni"? There's nowhere in the spec where
this is explained clearly. If I were a browser maker, or writing an URI
library,..., what would I do to support the ni scheme? The only thing I
have come up with is to covert ni to the .well-known format, then use
HTTP(S). In that case, the above text seems wrong, as it says that
.well-known is used when there's no support for ni, not in order to
support ni.

Section 5: This defines an "URL segment format". It seems to be limited
to path componest in HTTP URIs. What if I want to use this in a query
part, or maybe even as a fragment identifier? What if I want to use this
as a path component in an FTP URI? Or in some other schem? It would be
better to define the alg-val (see next point) part as such (before the
other things), with an explanation along the following lines: "This is
defined here both for use in other sections of this document as well as
for use in other places where it may be helpful, such as HTTP URI path
segments,..."

Section 5 (and Section 3): "To do this one simply uses the "alg;val"
production": There is no "alg;val" production. Please change to "To do
this one simply uses the <alg-val> production" and fix the ABNF in
section 3 to
path-algval = "/" alg-val
alg-val = alg ";" val
It's probably even better to fold this in with the changes to
ni-hier-part, resulting e.g. in:
ni-hier-part = "//" authority "/" alg-val
alg-val = alg ";" val

Section 9.4: Status can be 'empty' or 'deprecated'. I suggest to replace
'empty' with something positive, such as 'valid' or 'active'. This will
help people who go to the IANA page and start to ask "well, it doesn't
have a status, what does that mean". Also, I strongly suggest to add an
additional status 'reserved', and remove the current "Reserved" hash
name string from the entries with IDs 0 and 32.

Section 9.4: "The Suite ID value 32 is reserved for compatibility with
ORCHIDs [RFC4843].": How will compatibility be kept for future
changes/additions in ORCHID?

Major editorial issues:

Title and abstract (and the spec itself) use the wording "Naming
Things". While in a security context, it may be that there is an
implicitly assumption that there are only digital things, in a wider
context, this is of course not true. Research on the Internet of Things
and efforts such as the Semantic Web/Linked Data try to deal with things
in the real world. People in these areas it will be confused by title,
abstract, and text, unless you can show (me and) them an ni: hash for a
person, an apple, a building, or an elephant. Therefore, while it may be
possible to keep the catchy title, the abstract has to be fixed to avoid
such misunderstandings, e.g. by changing "to identify a thing" to "to
identify a digital object" or some such in the abstract, and likewise in
the main text of the spec.

"Human-speakable" (e.g. ), "human-readable" (e.g. section title of
section 7), and "for humans" (e.g. section title of section 9.2): These
terms are used throughout the spec, but are imprecise and confusing.
First, there's the problem of interpreting "for humans" in the sense of
the previous paragraph, which of course has to be fixed. But the main
problem is that none of the "ni:" URIs are "non-human-readable" or
"non-human-speakable". Reading them aloud is only somewhat more tedious,
but not at all impossible. And because the value part of the nih: form
is 50% longer, and people quickly develop conventions for shortening
things such as "upper case" and "lower case", it's not even clear that
reading aloud the nih: form will necessarily take that much time.
Therefore, I strongly recommend to change all occurrences of
"Human-speakable", "human-readable", "for humans", and the like, to the
more precise "more easily read out aloud by humans" or something equivalent.

Abstract and further on: "specifying URI, URL": By all URx theories (see
e.g. http://www.w3.org/TR/uri-clarification/), URLs are a subset of
URIs, and therefore saying that the spec specifies an URI and an URL is
somewhat confusing. I'd propose using wording along the following lines:
"specifying an URI scheme and a way to map these URIs to http".

Section 2, "When the input to the hash algorithm is a public key value",
and example section: It took me a while to understand that the "public
key" stuff was not yet another way to present a hash, and also not a way
to mix in a public key to the hash in order to obtain some specific
security property (I wasn't able to figure out how that would work, but
draft-hallambaker-decade-ni-params contains something similar involving
digital signatures and a public key). The document would be much easier
to understand if there was a section e.g. entitled "Forms of input to
hash", with subsections e.g. "general data", "public keys", "other stuff
(not defined in this document)". As it is written, the relevant
paragraphs in section 2 look like an afterthought, and it's not clear to
what.
Also, the example section should be fixed as follows: 1) say upfront
that there will be two examples, one for a short string and another for
a public key. 2) Make sure both examples exercise all forms (the public
key example seems to be pretty complete, but the "Hello World!" example
seems to be incomplete). 3) Use the same form of presentation (either a
table in both cases or short paragaphs in both cases.
The caption on Figure 7 is also way too unspecific.

Section 9.4: "Hash Name Algorithm Registry", and later "a new registry
for hash algorithms as used in the name formats specified here": IANA
will be helped tremendously if your draft comes with an
easy-to-understand and unambiguous name for the new registry. "Hash Name
Algorithm Registry" may be okay, but is probably not specific enough.
The circumscription at the start of the section is definitely not good
enough because you're not registering hash algorithms, but names of hash
algorithms and their truncations.

Minor editorial issues:

Introduction: It would be good to have a general reference to hashing
(for security purposes) for people not utterly familiar with the subject.

Intro: After reading the whole document, the structure of the Intro
seems to make some sense, but it didn't on first reading (where it's
actually more important). The main problem I was able to identify was
that after a general outlook in paragraph 1, the Intro drops into a list
of examples without saying what they are good for. I suggest to, after
the sentence "This document specifies standard ways to do that to aid
interoperability.", add a sentence along the lines: "The next few
paragraphs give usage examples for the various ways to include a hash in
a name or identifier as they are defined later in this document.". It
may also make sense to further streamline the following paragraphs, so
that it is clearer which pieces of text refer each to one of the
"standard ways".

There are two instances of the term "binary presentation". Looking
around, it seems that they are supposed to mean the same as "binary
format". Please replace all instances of "binary presentation" with
"binary format" to avoid misunderstandings and useless seach time.

Section 3: "A Named Information (ni) URI consists of the following
components:": It would be good to know exactly where the list ended. One
way to do this would be to say "consists of the following nine components".

Section 3: "Note that while the ni names with and without an authority
differ syntactically, both names refer to the same object if the digest
algorithm and value are the same.": What about cases with different
authority? The text seems to apply by transitivity, but this may be easy
to miss for an implementer. I suggest changing to: "Note that while ni
names with and without an authority, and ni names with different
authorities, differ syntactically, they all refer to the same object if
the digest algorithm and value are the same.".

Section 3: "Consequently no special escaping mechanism is required for
the query parameter portion of ni URIs.": Does this mean "no escaping
mechanism at all"? Or "nothing besides %-encoding"? Or something else?
Please clarify.

Figure 3: the "=" characters of the various rules should be aligned as
much as possible to make it easier to scan the productions (see
http://tools.ietf.org/html/rfc3986#appendix-A for an example).

Section 3:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
; directly from RFC 3986, section 2.3
; "authority" and "pct-encoded" are also from RFC 3986
Please don't copy productions. Please don't copy half (or one-third,
actually) of the productions you use, and reference the rest. Please
don't say what productions you copy from where in a comment, and even
less in a comment for an unrelated production. Please before the ABNF,
say which productions are used from another spec.

Section 4:
The HTTP(S) mapping MAY be used in any context where clients without
support for ni URIs are needed without loss of interoperability or
functionality.
This is difficult to understand. If some new functionality is proposed,
it's usually a client *with* the new functionality that's needed, not
one without. Also, the "without loss of interoperability or
functionality" is unclear: Sure if ni isn't supported, there's a loss in
interoperability. So I suggest to rewrite this as:
The HTTP(S) mapping MAY be used in any context where clients with
support for ni URIs are not available.
(but see also the comment in minor technical issues)

Section 6: "binary format name": Why 'name'? Why not just "binary
format"? The later is completely clear in the context of the document or
together with an indication of the document; for something that can be
used independently, even "binary format name" isn't enough.

Section 6: "suite ID": The word "suite" seems out of place here. In the
general use of the term, it refers to "a group of things forming a unit
or constituting a collection" (see
http://www.merriam-webster.com/dictionary/suite). A good definition that
works for the uses I'm familiar with in digital security would be "An
algorithm suite is a coherent collection of cryptographic algorithms for
performing operations such as signing, encryption, generating message
digests, and so on."
(http://fusesource.com/docs/framework/2.4/security/MsgProtect-SOAP-SpecifyAlgorithmSuite.html;
disclaimer: I'm in no way a SOAP fan). The use here is not for a
collection, but for a single truncated-length variant of a single hash
algorithm. I seriously hope you can find a better name.

Section 6: "Note that a hash value that is truncated to 120 bits will
result in the overall name being a 128-bit value which may be useful
with certain use-cases.": This left me really wondering: Is there
something magic to 128 bits in computer/internet security? What are the
"certain use cases"? Or is this just an example to make sure the reader
got the relationships, and it could have been as well "Note that a hash
value that is truncated to 64 bits will result in the overall name being
a 72-bit value which may be useful with certain use-cases." (or whatever
other value that's registered in section 9)?

Section 7: Just for the highly unfortunate case that this doesn't
disappear, it would be very helpful if the presentation of this section
paralleled section 3.

Section 7: "contain the ID value as a UTF-8 encoded decimal number": I'm
an internationalization expert with a strong affection for UTF-8, but
even for me, this should be "contain the ID value as an ASCII encoded
decimal number".

Section 9: The registration templates refer to sections. This is fine
for readers of the draft, but not if the template is standalone. I
suggest using a format such as that at
http://tools.ietf.org/html/rfc6068#section-8.1, which in draft stage may
look e.g. like
http://tools.ietf.org/html/draft-duerst-eai-mailto-03#section-8.1.

Section 9.3: "Assignment of Well Known URI prefix ni" and later (and
elsewhere in the draft) "URI suffix": Are we dealing with a prefix or a
suffix here?

Section 9.4: "This registry has five fields, the binary suite ID,...":
Better to remove the word "binary", because the actual number is decimal.

Section 9.4: "The expert SHOULD seek IETF review before approving a
request to mark an entry as "deprecated." Such requests may simply take
the form of a mail to the designated expert (an RFC is not required).
IETF review can be achieved if the designated expert sends a mail to the
IETF discussion list. At least two weeks for comments MUST be allowed
thereafter before the request is approved and actioned.": I'm at a loss
to see why asking the IETF at large is a SHOULD, but if it's done, then
the two weeks period is a MUST.

Section 9.4: The registry initialization in Fig. 8 refers to RFC4055
many times. But RFC 4055 does in no way define SHA-256. It looks like
the actual spec is http://tools.ietf.org/html/rfc4055#ref-SHA2 (National
Institute of Standards and Technology (NIST), FIPS 180-2: Secure Hash
Standard, 1 August 2002.) I think this should be cited, in particular
because there is a "Specification Required" requirement, and this sure
should mean that there is a Specification for the actual algorithm, and
not just a specification that mentions some labels. So using RFC4055 as
a reference could be taken as creating bad precedent.

Section 9.4: "The designated expert is responsible for ensuring that the
document referenced for the hash algorithm is such that it would be
acceptable were the "specification required" rule applied.": Why all
this circumscription? Why not just say something like: "The designated
expert is responsible for ensuring that the document referenced for the
hash algorithm meets the "specification required" rule."

Nits:

Author's list: Last time I heard about this, there was a general limit
of 5 authors per RFC. I'm not sure whether this still exists, and what'd
be needed to get around it, but I just wanted to point out that this may
be a potential problem or additional work (hoops to get through).

Intro: "Since, there is no standard" -> "Since there is no standard"

Intro: "for these various purposes" -> "for these purposes" or "for
various purposes" (the indefinite 'various' is incompatible with the
definite 'these').

"2. Hashes are what Count" -> "2. Hashes are what Counts" (the former
may look logically correct, but 'what' requires a singular verb form.

Section 2: "the left-most or most significant in network byte order N
bits from the binary representation of the hash value" -> "the left-most
(or most significant in network byte order) N bits from the binary
representation of the hash value" or "the left-most N bits, or the N
most significant bits in network byte order, from the binary
representation of the hash value" (the current text is virtually
unparsable).

Figure 1: The 0x notation is never explained. A short clause or pharse
is all that would be needed, but it would be better if this were spelled
out.

Section 3, Query Parameter separator: "The query parameter separator
acts a separator between" -> "The query parameter separator acts *as* a
separator between".

Section 3, Query Parameters: "A tag=value list of optional query
parameters as are used with HTTP URLs" -> "A tag=value list of optional
query parameters as used with HTTP URLs" (or "A tag=value list of
optional query parameters as they are used with HTTP URLs").

Section 4: "the object named by the ni URI will be available at the
corresponding HTTP(S) URL" -> "the object named by the ni URI will be
available via the corresponding HTTP(S) URL" (via stresses the point
that this should be done via (sic) redirection)

Section 4: "so there may still be reasons to use" -> "so there can still
be reasons to use" (better to use can because non-normative; the
document otherwise does a good job on this)

Section 10: "Note that fact that" -> "Note the fact that", or much
better: "Note that".

Regards, Martin.

[apps-discuss] APPSDIR review of draft-farrell-de… Martin J. Dürst
Re: [apps-discuss] APPSDIR review of draft-farrel… Stephen Farrell