Comments on content sniffing algorithm draft-abarth-mime-sniff-03

David Booth <david@dbooth.org> Thu, 21 January 2010 00:54 UTC

Subject: Comments on content sniffing algorithm draft-abarth-mime-sniff-03
From: David Booth <david@dbooth.org>
To: apps-discuss <apps-discuss@ietf.org>
Content-Type: text/plain; charset="UTF-8"
Date: Wed, 20 Jan 2010 19:53:56 -0500
Message-ID: <1264035236.23097.29453.camel@dbooth-laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Precedence: list

Some comments on
http://tools.ietf.org/html/draft-abarth-mime-sniff-03

1. This point is not a criticism of the sniffing algorithm proposed, but
rather a comment on the way that the problem is described.  I don't have
a specific suggestion for rewording, so perhaps you should just take
this first comment as food for thought.

It bothers me to see HTML being called a "high-privilege media
type" . . . "(and thus privileged to execute any scripts contained
therein)".  It isn't the basic HTML that is dangerous, it is JavaScript
that has been embedded in HTML that is dangerous, just as Flash, ActiveX
or any other scripting language may be embedded.  Basic HTML is
relatively safe.  

HTML is really just embedded in text, just as JavaScript is embedded in
HTML, yet we don't think of plain text as a high-privilege media type
because our content types distinguish plain text from text that "embeds"
HTML.  But they do not distinguish plain HTML from HTML that embeds
JavaScript or other scripting languages.  This forces us to paint plain
HTML with the same security brush as we paint JavaScript, and this seems
wrong.  

2. Section 2 says "The algorithm for extracting an encoding from a
Content-Type, given a string s, is as follows."  But what exactly is
string s?  Where is s bound?  Is s the Content-Type?

3. Section 3.1 says "the last step in this set of steps".  I think it
would be slightly clearer to say "step 9", though this is perhaps a
minor stylistic issue.

4. There are several uses of the word "resource" that should be "entity
body", as this is the term used in RFC2616 section 14.17:
http://tools.ietf.org/html/rfc2616#section-14.17

5. Section 3.3 says "the last such header has bytes that exactly match".
I suggest changing the word "has" to be more specific, as "has" often
means "contains", and I do not think "contains" is what you meant.  (For
example, one might say "File x has the word 'Foo' in it".)

6. Section 3 defines the term /sniffed type/, which is either sniffed or
the /official type/.  This is a little misleading.  I suggest
distinguishing three terms: /sniffed type/, which really is the sniffed
type; /official type/, as already defined; and /effective type/, which
is determined by your algorithm based on either /sniffed type/
or /effective type/.

7. Section 3.4 says "jump to the unknown type step below", but it is not
clear what step you mean.  Since the steps are numbered, it would be
better to give the target step number instead of "the unknown type step
below".  [After reading further]  Oh, it looks like you may have meant
to refer to Section 5.

8. Section 4.2 says "already available".  After how much time or after
what event?

9. Section 4.4. mentions "binary data bytes".  Where is this term
defined?  I.e., how exactly are binary data bytes identified?



-- 
David Booth, Ph.D.
Cleveland Clinic (contractor)

Opinions expressed herein are those of the author and do not necessarily
reflect those of Cleveland Clinic.

Comments on content sniffing algorithm draft-abar… David Booth