URL Problem Statement and Directions
IBM
Raleigh
NC
USA
rubys@intertwingly.net
http://intertwingly.net/
Adobe
345 Park Ave
San Jose
CA
95110
USA
masinter@adobe.com
http://larry.masinter.net/
Applications
URI
URL
IRI
This document lays out the problem space of possibly conflicting
standards between multiple organizations for URLs and things like
them, and proposes some actions to resolve the conflicts.
From a user or developer point of view, it makes no sense for there to
be a proliferation of definitions of URL nor for there to be a
proliferation of incompatible implementations. This shouldn't be a
competitive feature. Therefore there is a need for the organizations
involved to update and reconcile the various Internet Drafts,
Recommendations, and Standards in this area.
This document lays out the problem space around
standards for URLs and things like
them, and proposes some actions to resolve the conflicts.
From a user or developer point of view, it makes no sense for there to
be a proliferation of definitions of URL nor for there to be a
proliferation of incompatible implementations. This shouldn't be a
competitive feature. Therefore there is a need for the organizations
involved to update and reconcile the various Internet Drafts,
Recommendations, and Standards in this area.
Possible next steps are discussed in .
Discussions have taken place on public-ietf-w3c@w3.org
(
archive) and public-ietf-w3c@w3.org
(
archive). In addition, the W3C TAG has discussed these issues in
meetings and on their mailing list.
This document, as well as a test suite, reference implementation, and
are being developed at
, including
an issue tracker, Wiki, and related resources.
Pull requests
for edits to doocuments or tests are most welcome.
Raising issues in the
GitHub tracker is also helpful.
Comments to the editors or on those mailing lists in email
are also welcome.
This section contains a very compressed history of URL standards,
in sufficient detail to set some context.
REVIEWERS: history is necessarily incomplete, but please
report incorrect or missing essential facts.
The first standards-track specification for URLs was
in 1994. (That spec contains more background material.) It defined URLs
as ASCII only. later separated the generic
syntax from concrete scheme definitions which are defined in
separate RFCs. Many of those scheme definitions turned out not to get
the attention that they needed.
When it became clear that it was desirable to allow non-ASCII
characters, it was widely feared that support for Unicode by ASCII-only
systems would turn out to be problematic. The tack was therefore taken
to leave "URI" alone and define a new protocol element,
"IRI". was published in 2005 (in sync with the
update to the URI definition). This also
turned out not to get the attention it needed.
To address issues raised both in IETF and for HTML5 (see
for more details), the IRI working
group was established in the IETF in 2009. However,
primarily due to lack of engagement, the IRI group was closed in
2014, with the plan that the documents that had been under
development in the IRI working group could be updated as
individual submissions or within the IETF applications area
working group. In particular, one of the IRI working group
items was to update , which is currently
under development in IETF's application area (see ).
Independently, the HTML specifications in the WHATWG and W3C redefined
"URL" in an attempt to match what some of the browsers were doing. This
definition was later moved out into the "URL Living Standard"
.
When W3C produced the HTML5
recommendation, the normative
reference to the WHATWG URL standard was a gating issue, and
an
unusual compromised was reached, where the [URL] reference is given a descriptive
paragraph rather than a single document reference.
The world has moved on in other ways. ICANN has approved non-ASCII top level
domains, but IDNA specs ( and )
did not fully addressed IRI processing. Subsequently, the Unicode consortium
produced , which mentions URL processing in passing.
The web security working group developed
("The Web Origin Concept"), which
was refined in the W3C specification,
which redefines. Updates
in the IETF were abandoned. Work continues in the WHATWG in
the specification.
There are multiple umbrella organizations which have
produced multiple documents, and it's unclear whether
there's a trajectory to make them consistent. This section
tries to enumerate currently active organizations and specs.
REVIEWERS: are there important ongoing activities we've missed
or gotten wrong? Who are the stakeholders whose current
work might be affected? (This input will help determine the
organizational coordination needed.)
Organizations include
the IETF,
the WHATWG,
the W3C,
Web
Platform.org, and
the Unicode
Consortium.
Relevant specs under development in each organization include:
has
passed working group last call and entered IESG
review.
New schemes and updates to old ones continue, including
'file:'
and 'urn:'.
The IRI working group closed, but work can continue in the Applications
Area working group. Documents sitting needing update, abandoned now,
are three drafts (,
, and
), which were
originally intended to obsolete .
The URNBis
working group has been working to update the definitions
of URNs, but has difficulty with some of the wording in
. In particular,
updates .
The is being developed as a living
standard. It primarily focuses on specifying what is
important for browsers. The means by which new schemes might
be registered is not yet defined. This work is
based on , and includes an explicit goal of
obsoleting both and .
The Web
Applications Working Group, in conjunction with the
TAG,
sporadically have been republishing the WHATWG work with no
technical content differences as . There is a
proposal to formalize this
relationship.
The W3C TAG
developed
Best Practices for Fragment Identifiers and Media Type Definitions
, which points out several problems with the definitions
for the 'fragment' part of URLs. The TAG is working to
ensure liaison exchange happens.
Note also the interim solution for the
HTML5 reference to [URL], which should be updated by
the HTML working
group .
WebPlatform.org is an activity sponsored by W3C and web vendors.
is being developed on a
develop
GitHub branch based on . It
currently contains work that has yet to be folded back into the
, primarily to rewrite the parser logic
in a way that is more understandable and approachable. The intent is
to merge this work once it is ready, and to actively work to keep the
two versions in sync.
defines parameterized functions for mapping
domain names. builds upon this work, specifying
particular values to be used for these parameters. The Unicode
Consortium plans to adapt as registries (e.g.
DENIC) move from to
.
This section lays out the problems we see need a coordinated
solution. REVIEWERS: have we missed some things? Are any of these
non-problems or not worth solving?
The main problem is conflicting specifications that overlap
but don't match each other.
Additionally, the following are issues that need to be resolved to
make URL processing unambiguous and stable.
Nomenclature: over the years, a number of different sets of
terminology has been used. URL / URI / IRI is not the only difference.
chronicles a number of differences.
Deterministic parsing and transformation: The IRI-to-URI
transformation specified in had
options; it wasn't a deterministic path; in particular, which
substrings of which URLs of which Unicode, for strings were to
be transformed to Punycode or to %-escaped-utf8. The
URI-to-IRI transformation was also heuristic, since there was
no guarantee that %xx-encoded bytes in the URI were actually
meant to be %xx percent-hex-encoded bytes of a UTF-8 encoding
of a Unicode string.
Parameterization: standards in this area need to define such
matters as normalization forms and values for parameters such as
UseSTD3ASCIIRules.
Interoperability: even after accounting for the above, there
is a demonstrable lack of interoperability across popular libraries
and browsers. identifies a number
of such differences.
Stability: Before any standard document can be marked as obsoleted,
the requirements other specs that normatively reference the
to-be-obsoleted standard need to be considered, to avoid
dangling references.
IDNA: defines processing for 'IDN-aware
domain name slots' (where "the host portion of the URI in the
src attribute of an HTML <IMG> tag" is given as an
example. Later, "IDNA is applicable to all domain names in all
domain name slots". So in mailto:user@host, is the host a
IDN-aware domain name slot? A domain name slot at all?
Bidi URLs: The problems with writing URLs using characters
from right-to-left languages are well-known among experts; what is not
known is a solution for these problems. The solution given in
has some obvious errors (how to handle
combining marks); it's general approach also probably can be improved
on, but it's not sure how.
Specific scheme definitions: some UR* scheme definitions are woefully
out of date, incomplete, or don't correspond to current practice,
but updating their definitions is unclear. This includes 'file:',
for which there is a current effort, but there are others which
need review (including 'ftp:', 'data:').
Many of the problem above require some cross-organizational
collaboration. This section outlines alternatives and possible
next steps, both in terms of documents and possible updates and
also procedural issues.
REVIEWERS: Neccessary? Sufficient? What are we missing, what
did we get wrong?
The XML Signature
WG is an example of a joint IETF/W3C Working Group. Perhaps a
joint working group covering the topics of URL and URI could be
formed. Elements of the proposal could
be incorporated into the charter of this new WG, and thereby
establishing the WHATWG as a third joint participant in this
activity.
Failing that, it may be desirable to have some organizational
assignment of responsibility in IETF and W3C to working groups in each
organization.
There has been discussion of IETF/W3C liaison getting
involved, with the proposal that W3C liaison to IETF
making a formal liaison request to which IETF would respond.
Perhaps the liaison request might reference this document.
In IETF, the scope of changes proposed may determine how
IETF consensus can best be obtained. It seems unlikely that
the scope of necessary changes to IETF documents could be
managed through individual submissions. Some opinions have
been that updating and/or obsoleting
would
require a full IETF working group. Unless and until another
group is chartered (perhaps using this document as the Problem
Statement / scope), discussion is occuring in the IETF apps area.
Previous venues for
related topics (
public-iri@w3.org,
uri@w3.org)
are old enough that there is likely poor representation of
important communities, unless a concerted effort is made
to revive them.
In W3C, either W3C WebApps, TAG, HTML or some new activity
might be necessary to manage changes, but the nature of the
group necessary to review depends on the extent of changes
needed.
At the moment, the most reliable way of giving feedback on
this document is to raise or comment on issues in the GitHub
issue list.
At various times, many have called for replacing the IETF
URI standard , or updating it. How to
approach this is controversal, but at a minimum the following
are needed:
Make it clear that ASCII-only URIs (as now defined by
) are not what is mainly used on the web.
Incorporate updates for URN.
Incorporate updates for fragment identifier semantics.
Note terminology issue and resolution.
More controversial is whether this can be done on a
strictly "need-to" basis, or whether the merger of URI from
and IRI from would
result in clearer specifications for implementors.
There is some sentiment to restart the work of updating
by starting again, fixing errors and
integrating errata. However, this path doesn't seem to satisfy the
desire for a single spec that lays out deterministic processing for
URLs and references for browser and operating-system handling
of both.
After ensuring that topics covered in are also covered by a W3C URL
recommendation, mark as obsolete
with a short RFC noting the conditions laid out in this
document.
Coordinate 'file:' syntax in
and , possibly
moving the 'file:' part of URL-LS into a separate
document.
Update to be consistent with
and . This
may involve working to get the other specifications updated,
if only to clarify nomenclature.
Obsolete any previous definition of x-url-encoded.
Change the goals to only obsolete
specifications listed above that are not updated. Presuming that
is updated, explicitly state that
conforming URLs are a proper subset of valid URIs, and further state
that canonical URLs (i.e., the output of the URL parser) not only
round trip, but also are valid URIs.
Update and incorporate (or reference) the content currently
present in , probably as an appendix
to , so that readers will understand
what terms are in use and how they map.
Reconcile how and
handle currently unknown schemes,
update to state that
registration applies to both URIs and URLs, and update
to indicate that
is how you register
schemes.
Have the W3C adopt .
Other than keeping on top of and responding
to any feedback that may be provided, no changes to any Unicode
Consortium product is required.
Helpful comments and improvements to this document have
come from
Anne van Kesteren,
Bjoern Hoehrmann,
Graham Klyne,
Julian Reschke,
and Martin Duerst.
This memo currently includes no request to IANA,
although an updated
might add some additional requirements and information to
IANA
URI scheme registry to make clear that the
schemes serve as URL schemes and IRI schemes as well as
URI schemes.
In addition to the security exposures created when URLs work
differently in different systems, all of the security considerations
defined in , ,
, and apply to URLs.
Guidelines and Registration Procedures for New URI Schemes
Microsoft
AT&T Laboratories
Google
Adobe
Cross-Origin Resource Sharing
Fetch Living Standard
Internationalized Resource Identifiers (IRIs)
Aoyama Gakuin University
Unicode Consortium
Adobe
Comparison, Equivalence and Canonicalization of
Internationalized Resource Identifiers
Adobe
Aoyama Gakuin University
Guidelines for Internationalized Resource Identifiers with
Bi-directional Characters (Bidi IRIs)
Aoyama Gakuin University
Adobe
Diwan Software Limited
The file URI Scheme
QUT
How many ways can you slice a URL and name the pieces?
Mozilla
URL Living Standard
Unicode IDNA Compatibility Processing
URL Working Draft
URL test results
IBM
URL WorkMode
IBM
URL Standard