Re: [apps-discuss] What does it mean? (Re: Scope of RFC3986 and successor - what is a URI?)

Dave Cridland <dave@cridland.net> Fri, 16 January 2015 13:57 UTC

Return-Path: <dave@cridland.net>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7A4CB1ACCFE for <apps-discuss@ietfa.amsl.com>; Fri, 16 Jan 2015 05:57:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.347
X-Spam-Level:
X-Spam-Status: No, score=0.347 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, HTTP_ESCAPED_HOST=1.125, J_CHICKENPOX_19=0.6, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fThtMcYcmyhZ for <apps-discuss@ietfa.amsl.com>; Fri, 16 Jan 2015 05:57:40 -0800 (PST)
Received: from mail-qa0-x236.google.com (mail-qa0-x236.google.com [IPv6:2607:f8b0:400d:c00::236]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 24BA71ACCFC for <apps-discuss@ietf.org>; Fri, 16 Jan 2015 05:57:40 -0800 (PST)
Received: by mail-qa0-f54.google.com with SMTP id w8so14678729qac.13 for <apps-discuss@ietf.org>; Fri, 16 Jan 2015 05:57:39 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cridland.net; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=SfqcL4wI4zO3VXDvAQKvcvnucm+EV0lXfj0bkc4MgdM=; b=SJBRjwhK4JMAkQyQ6KffKAwOhcYFyk3jyJK7xI2Sr/GPxDAxs14LKqNBtt9c3YrIBb fyW3vdpdgzFRY2UHhHOaZceyaWzD9Hou5n65xISJi+805FjsoLjTrXHZDot23MvBIEdA xqvPeGvqFC42mSFPYGMUwMxDSegm6DE53pUwU=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=SfqcL4wI4zO3VXDvAQKvcvnucm+EV0lXfj0bkc4MgdM=; b=BHGxhTaAem1sM1aCXLUKQhVGlvT9E8+RcrIyGvBs7LbifvZXuuTHTYsH5zXiMiA3SY TYcTsx3sRV5sME6ColrUfWg5Jb4mK6emljbc8H4GP84XPxgTZOV9zaS0J/tS9B5mFoeT zWlwyFwzpW1S8MKFHnFJ1O7m0MVZF5yfTIPCahSZcE8ndpxZfGXU5MnTlMroJtbWJU7B GVrgjAH5FZBG0bk5C7R/5oPeriU1ZSrQBBBf9rydOkJm5ZibCNbxJNyth0LojeqNn6BP oGlVJf3mtT8jPhjdVuHrj9aoq6Sxh4jD5jBN1EYwthp9Ho2GskZ+9hFPBDEFjkxGLMY3 BA/A==
X-Gm-Message-State: ALoCoQm8FROeCzqDcIM6b2w+qks3CUI91Kkg2K4H6kT4+zS+Dxoqwi3UeVexgis3lVapaF5gE1/N
MIME-Version: 1.0
X-Received: by 10.140.32.38 with SMTP id g35mr9181423qgg.54.1421416659257; Fri, 16 Jan 2015 05:57:39 -0800 (PST)
Received: by 10.140.196.197 with HTTP; Fri, 16 Jan 2015 05:57:39 -0800 (PST)
In-Reply-To: <54B909F4.4020908@intertwingly.net>
References: <54B18B61.8010308@seantek.com> <54B19435.8070401@intertwingly.net> <54B1B211.3050807@seantek.com> <54B1B682.3070609@intertwingly.net> <012001d02d91$6ec42300$4001a8c0@gateway.2wire.net> <54B2781C.4040505@intertwingly.net> <018e01d02dc6$1d03b0a0$4001a8c0@gateway.2wire.net> <54B2CC75.5080900@intertwingly.net> <54B79930.3070009@ninebynine.org> <54B7AEC2.9010109@intertwingly.net> <20150116033032.GD2350@localhost> <DM2PR0201MB096082B3915B85F60EDB617DC34F0@DM2PR0201MB0960.namprd02.prod.outlook.com> <54B8F74A.70601@intertwingly.net> <f5bwq4nexmp.fsf@troutbeck.inf.ed.ac.uk> <54B909F4.4020908@intertwingly.net>
Date: Fri, 16 Jan 2015 13:57:39 +0000
Message-ID: <CAKHUCzx0Ci=UrMCy6iFSq5qX-fEHHdaVVUHfnwGLs1P3HKbsVw@mail.gmail.com>
From: Dave Cridland <dave@cridland.net>
To: Sam Ruby <rubys@intertwingly.net>
Content-Type: multipart/alternative; boundary="001a1139bcda0da1b6050cc55bc4"
Archived-At: <http://mailarchive.ietf.org/arch/msg/apps-discuss/5KF7RA8xcG9k8Ud96bY_Ns52Cz8>
Cc: "apps-discuss@ietf.org" <apps-discuss@ietf.org>, Larry Masinter <masinter@adobe.com>
Subject: Re: [apps-discuss] What does it mean? (Re: Scope of RFC3986 and successor - what is a URI?)
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss/>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Jan 2015 13:57:43 -0000

A couple of points. I shall use the terms "URI" and "IRI" as defined in the
IETF specs, and W_URL to mean the W3C/WHATWG/etc thing.

On 16 January 2015 at 12:54, Sam Ruby <rubys@intertwingly.net> wrote:

> The WHATWG URL specification is not limited to browsers.  It defines a
> syntax that is fit for human production and consumption that is intended to
> be consistently parsed by a variety of parsers in a variety of contexts.
> Every modern programming language has a URI or URL parse mechanism.
>
>
Yes, but it is only concerned with parsers.

The IETF is mostly concerned with protocols.

Taking the example of LDAP, this relies on schemas where a URI is
transported in a IA5String, which is ASCII-only (loosely, it's actually
7-bit ASCII supersets).

Allowing non-ASCII breaks this, no matter how we wave magic wands about
parsers - the parser simply never gets to see the URI, because the BER
decoder (or whatever) throws an error.


> Minor is in the eye of the beholder.  What I am gathering is that changing
> RFC 3986 -- even in ways that only break things "on paper" -- is
> controversial.
>
>
I suspect that there are at least two different change types possible.

A) One is to change reserved "ASCII codepoints" within particular
components of an IRI. I think this is pretty reasonable to explore, and is
the kind of thing that your work with examining the behaviour of parsers is
excellent for. The output of your work essentially informs us on the actual
levels of risk involved, and allows us to consider "running code", which,
with something as widespread in deployment as this is tremendously
important.

B) The other is to consider changes to the range of codepoints allowed in a
URI (or IRI) as a whole. For a URI, this is highly controversial, and I
don't believe it is possible. That said, we clearly do have the IRI
specifications (and their bis drafts), and if there is interest from the
W3C I think we could find the energy to complete that work. Changing the
range within an IRI would, I suspect, be similarly controversial. Even if
every parser in the world copes fine, it doesn't matter.

Examples of (A) include the "[]" in query strings, "#" in fragments, and so
on. These I'm personally open to in principle.

(B) includes things like allowing non-ASCII in URI fragments (as opposed to
IRI fragments). It would also include allowing spaces, line feeds, and so
on.


> I welcome everybody to participate in the discussion as to what W_URLs map
> to.
>
> I welcome discussion as to whether that mapping produces valid I_URIs.
>
> As to whether the changes necessary to the definition of I_URIs turn out
> to be 'minor' or the end result being that W_URLs map to something that is
> neither a proper superset or a proper subset of valid I_URIs is only
> something we will know after those discussions take place.
>
>  I _think_ there's an implication here that I at least had not been
>> aware of in your previous references to "new URL(x).href".
>>
>> Is it in fact the case that the value of this expression (also I think
>> what you've been calling the 'stringification' of a W_URL), _is_
>> exactly, byte for byte, what is put on the wire in HTTP requests by
>> browsers?
>>
>
> That is my understanding.  To put it another way: if it is not, that would
> be considered a bug.
>
> Note that stringification produces a string of Unicode code points.  As
> all of those code points -- with the notable exception of fragments -- are
> a proper subset of US ASCII graphic characters, the mapping of such to
> bytes is straightforward.
>
> At a minimum, the WHATWG/W3C specifications should be updated to make this
> clear.
>

How would you feel about the possibility of having an API that looked
somewhat like this:

Given:

u = new URL(S) # Accepts strictly valid, and certain invalid, IRIs.

u.href #  "cleaned" valid IRI string, with any syntax cleaned according to
rules.

u.canonical # "canonical" IRI string.

u.toASCII() # "cleaned" URI string, for use in HTTP, LDAP, etc.

So with S as http://café.im:80/tŷ/%2e <http://xn--caf-dma.im:80/t%C5%B7/%2e>

u.href => http://café.im:80/tŷ/ <http://xn--caf-dma.im:80/t%C5%B7/>.
(although variants allowed)
u.canonical => http://café.im/ty <http://xn--caf-dma.im/ty>
u.toASCII() => http://xn--caf-dma.im/t%C5%B7

How hard would it be to persuade parser implementors to do this?

Dave.