Re: [urn] URNs and URI <fragment>

Juha Hakala <juha.hakala@helsinki.fi> Fri, 05 August 2011 05:48 UTC

Return-Path: <juha.hakala@helsinki.fi>
X-Original-To: urn@ietfa.amsl.com
Delivered-To: urn@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 64D7E21F8760 for <urn@ietfa.amsl.com>; Thu, 4 Aug 2011 22:48:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.299
X-Spam-Level:
X-Spam-Status: No, score=-1.299 tagged_above=-999 required=5 tests=[AWL=-0.500, BAYES_00=-2.599, J_CHICKENPOX_33=0.6, J_CHICKENPOX_34=0.6, J_CHICKENPOX_35=0.6]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JSIrts9aIAeR for <urn@ietfa.amsl.com>; Thu, 4 Aug 2011 22:48:03 -0700 (PDT)
Received: from smtp-rs1-vallila2.fe.helsinki.fi (smtp-rs1-vallila2.fe.helsinki.fi [128.214.173.75]) by ietfa.amsl.com (Postfix) with ESMTP id 3FD5221F874F for <urn@ietf.org>; Thu, 4 Aug 2011 22:47:57 -0700 (PDT)
Received: from [128.214.91.90] (kkkl25.lib.helsinki.fi [128.214.91.90]) by smtp-rs1-vallila2.fe.helsinki.fi (8.13.1/8.13.1) with ESMTP id p755m6Rl020433 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 5 Aug 2011 08:48:06 +0300
Message-ID: <4E3B8416.2010307@helsinki.fi>
Date: Fri, 05 Aug 2011 08:48:06 +0300
From: Juha Hakala <juha.hakala@helsinki.fi>
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
To: Mykyta Yevstifeyev <evnikita2@gmail.com>
References: <4E3A37FE.7060609@helsinki.fi> <4E3A4947.1070209@gmx.de> <4E3A5B9A.9060908@gmail.com> <4E3A7ABA.7060500@helsinki.fi> <4E3B6AA6.80901@gmail.com>
In-Reply-To: <4E3B6AA6.80901@gmail.com>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: urn@ietf.org, Stella Griffiths <stella@isbn-international.org>
Subject: Re: [urn] URNs and URI <fragment>
X-BeenThere: urn@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Discussions about possible revisions to the definition of Uniform Resource Names <urn.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/urn>, <mailto:urn-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/urn>
List-Post: <mailto:urn@ietf.org>
List-Help: <mailto:urn-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/urn>, <mailto:urn-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 05 Aug 2011 05:48:04 -0000

Hello Mykyta; all,

This debate has much to do with how identifiers - and in this case 
particularly ISBN and NBN - are and will be applied in practice, and how 
that relates to URN usage in namespaces. If the aim is to build reliable 
and persistent URN resolution services, we need identifier communities 
with well defined professional practices. The trick is to apply these 
practices in the Internet in a fruitful manner, taking into account the 
requirements of the new environment. If all goes well, both the 
identifier and the IETF communities may learn something in this process.

Since Mykyta brought up many important points about identifiers and 
identification I discuss them at length.

Mykyta Yevstifeyev wrote:

>> The reason why URN:ISBN namespace does not allow fragments has to do 
>> with ISBN syntax. This is OK:
>>
>> http://urn.fi/URN:ISBN:978-952-10-7060-0
>>
>> but I am not allowed to add anything to the namespace specific string 
>> even if the media type in which this book is published would allow 
>> that. Of course, adding <query> would be OK since it would not be part 
>> of the identifier.
> 
> Neither would the fragment ID, though.

That assumption is not correct. According to the RFC2141bis, fragment ID 
- if we allow its use in the URN syntax - will be part of the 
identifier. An identifier which consists of and ISBN and a fragment no 
longer identifies the book as a whole but a component part of it. 
Whereas ISBN with <query> still identifies the book as a whole; the role 
of the <query> is to specify the resolution service the user wants.
> 
>>
>> The URN:ISBN string itself does not imply any media type directly, but 
>> ISBN assignment rules make it clear that if the media type changes, a 
>> new ISBN must be assigned. The ISBN above belongs to a PDF version to 
>> a dissertation; if it is migrated to EPUB 3, a new ISBN is required.
> 
> But you should consider that ISBN may also be assigned to printed 
> version, which may also be transformed to the electronic form.  In this 
> case, even being the same book, the format will be different, and mostly 
> likely different from one used to produced printed copy.  Then such 
> scanned copies may be entered in the URN resolution system, and the URN 
> with ISBN will be resolved to one of such copies.

It should not, at least not automatically. When a book with (or without 
ISBN) is scanned and published in the Web (many national libraries are 
busy with this kind of activity), the resulting digital book shall not 
get an ISBN at all (according to the well established rules of the 
community). So we will end up with URN:ISBN for the printed version and 
URN:NBNs for digitized versions. If the publisher has created "official" 
digital versions, then those will have URN:ISBNs as well.

It is very important that via URN resolution services we can keep track 
on all the manifestations of a book. It is also important to give the 
users the possibility to choose between them, because there is no way of 
knowing in advance which manifestation will fit the needs of the user 
best. He might not know that himself, without some orientation. Some 
people may prefer a modern version, which after many successive 
migrations may be quite different from the original. And some may want 
the original version, even though that means digital archaeology is 
needed to actually read the document.

Interlinking between manifestations can be achieved via well planned 
usage of metadata. All records describing any manifestation of the book 
should contain links to all the other versions. In the future our 
catalogues will contain descriptions of immaterials work such as 
Shakespeare's Hamlet, and these descriptions will contain links to the 
manifestation level records which in turn will contain the actual URN 
links.

This may sound a little bit complicated, but when the aim is to provide 
access to digital resources for centuries, simple solutions will not work.

As an aside, the current resolution services document (RFC 2483) does 
not specify a service for finding all related manifestations of the 
resource. In the late 90s this was not deemed necessary; now even Amazon 
provides it. This indicates at least to me that we can not specify a 
fixed list of resolution services, since the services that are valid at 
any given time depend on the technical infrastructure. It follows that 
it is a bad idea to carve the URN resolution services in stone (specify 
them in a standards track RFC). Instead we need a flexible mechanism 
such as IANA registry where new services can be a registered easily. I 
will start writing a private contribution I-D today to provide a basis 
for this.
> 
> I also don't actually think that when some person transforms the text of 
> the book in the format he/she likes, he/she will request the new ISBN.  
> ISBN identify contents, unlike fragment identifiers, which concertize 
> contents in terms of media type.

ISBN does not identify intellectual content (ISTC does; see 
http://www.istc-international.org/html/). ISBN identifies a particular 
manifestation of the content, such as a paperback or a PDF version of a 
book.

If somebody digitizes as book, it is not possible to acquire an ISBN for 
the resulting book. Another identifier, such as URN:NBN, must be used 
instead.  When digitized books are catalogued, it is a common practice 
to tell that there is a printed version and provide its identifier (ISBN 
or NBN).

> One of the assumptions of URN namespaces is global uniqueness.  From RFC 
> 1737:
> 
>>     o  Global uniqueness: The same URN will never be assigned to two
>>        different resources.
> 
> and therefore I don't understand URNs for NBNs with such identifier 
> possible to be assigned twice or more to the same resource.  If the work 
> doesn't change, the URN must be stable.

As I said above, ISBN identifies manifestations. PDF version of the book 
gets one ISBN, EPUB 3 version get another one, and so on. See

http://www.isbn-international.org/pages/media/101118%20Guidelines%20for%20the%20assignment%20of%20ISBNs%20to%20ebooks.pdf

for details. The raison d'etre for ISBN assignment is derived from book 
trade; anything that is for sale as a separate item must have an ISBN, 
so that the particular manifestation can be told apart from other 
manifestations.

Each time a URN namespace is established, the identifier community 
brings in its own traditions. In some cases the community is not well 
formed; for example I have no idea of what best practices URN:IRI 
community as a whole would have for assigning identifiers. AFAIK well 
defined namespaces such as URN:ISBN will have the best chances of 
actually preserving the resources and resolution services in looong term.

There is a namespace where your view is correct: in the URN:ISTC 
namespace URN will never change as long as the work remains the same. 
When Hamlet is translated to Finnish the translation will of course get 
a new ISTC, but there will be a link to the English original (and to 
translations in other languages, such as Russian and Ukrainian).
>
>> Independently of any physical manifestations libraries will also 
>> describe and identify works, but then we do not use ISBN but ISTC 
>> (International standard text code) for textual resources. And 
>> registering a namespace for that identifier (which has just recently 
>> went into production) remains to be done.
> 
> NBN namespace, with what you've said, break the aforementioned 
> assumption for URNs; different manifestations of a similar work 
> shouldn't have different URNs, I'll repeat.  So I suppose the case with 
> NBNs isn't generally applicable to the URNs.  (It also breaks the 
> "persistence" principle, which means that one URN will identify the 
> resource forever.  With NBNs, which may change depending on the version 
> of the resource and its format, it isn't possible to persistently refer 
> to the "newest version".)

To conclude, there will be namespaces where the identifier identifies 
works (such as ISTC, ISWC, ISAN) and namespaces where manifestations are 
in focus (ISBN, NBN). From the URN point of view, both approaches must 
be correct in the URN system accommodates these namespaces. Of course I 
am assuming here that IETF would approve ISTC namespace registration 
request if the ISTC community produces one.

>> My take on this is that if we have e.g. a book in a file format that 
>> allows specification of fragments, then it is possible that these 
>> fragments can be accessed directly using HTTP URIs, and improving 
>> persistence of these access points to with URNs may make sense.
> 
> And I'll ask the same question - how do you determine the format a 
> particular URN may theoretically be resolved to?

The answer to this will depend on the namespace.

With ISBN, there is always a single manifestation of the book that 
should be retrieved (if the user wants a book). For digital content, 
this approach works for a few decades at most. After a couple of 
centuries it gets a little difficult to carry on like this :-). Please 
note that the time scale for the national libraries really is centuries. 
We may have slightly different notion of persistence than e.g. most web 
developers.

The solution is to provide links between manifestations of the work. 
This will allow the user to pick the one he prefers.

In the URN:ISTC namespace, there is no single manifestation the 
identifier should resolve to. It is up to the user to decide what to do. 
Having started from the English original from Hamlet he may end up 
requesting a digital version of the great Russian Hamlet movie from the 
50s.

Best regards,

Juha
> 
>>
>> In case of multimedia documents (with a multitude of file formats), 
>> URN:NBNs must be assigned at the file format level if the intention is 
>> to use fragments in one or more of these formats.
>>>
>>> (BTW: Fragment IDs, per RFC 3986, are allowed to be present in any 
>>> URI, including URN; however, they cannot be effectively handled in 
>>> the latter case.  I think RFC 2141bis should be clear that this part 
>>> of an URI, if present, should be ignored).
>>>
>>> (BTW2: Is new revision of 2141bis going to be published soon?)
>>
>> Alfred Hoenes indicated before his summer vacation that he plans to 
>> concentrate on URNBIS work during the first half of August.
> 
> Thanks for info.
> 
> Mykyta
> 
>>
>> Best regards,
>>
>> Juha
>>>
>>> Mykyta Yevstifeyev
>>>
>>>>
>>>> Best regards, Julian
>>>> _______________________________________________
>>>> urn mailing list
>>>> urn@ietf.org
>>>> https://www.ietf.org/mailman/listinfo/urn
>>>>
>>>
>>> _______________________________________________
>>> urn mailing list
>>> urn@ietf.org
>>> https://www.ietf.org/mailman/listinfo/urn
>>>
>>
> 
> 

-- 

  Juha Hakala
  Senior advisor, standardisation and IT

  The National Library of Finland
  P.O.Box 15 (Unioninkatu 36, room 503), FIN-00014 Helsinki University
  Email juha.hakala@helsinki.fi, tel +358 50 382 7678