idnits 2.17.1 

draft-pwid-urn-specification-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  == There are 1 instance of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (September 5, 2019) is 1694 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet Engineering Task Force                           E. Zierau, Ed.
3	Internet-Draft                                      Royal Danish Library
4	Intended status: Informational                         September 5, 2019
5	Expires: March 8, 2020

7	            A Persistent Web IDentifier (PWID) URN Namespace
8	                    draft-pwid-urn-specification-09

10	Abstract

12	   This document specifies a Uniform Resource Name (URN) for Persistent
13	   Web IDentifiers for web material in web archives using the 'pwid'
14	   namespace identifier.

16	   The main purpose of the standard is to support specification of
17	   references that are not covered by other reference techniques: to
18	   support references to material in web archives with restricted
19	   access.  Furthermore, it supports persistent technology agnostic
20	   references to web archives in general, in a form that can work as an
21	   algorithmic basis for finding web archive resources in general.  An
22	   additional important benefit is that the standard can be used for
23	   specifying web collections, which can then form a persistent
24	   computational basis for the extract of the archived collection parts.

26	   The PWID URN is designed to meet requirements for proper referencing
27	   needed by researchers.  Therefore, it is designed as general, global,
28	   sustainable, humanly readable, technology agnostic, persistent and
29	   precise web references for web materials in web archives.

31	Status of This Memo

33	   This Internet-Draft is submitted in full conformance with the
34	   provisions of BCP 78 and BCP 79.

36	   Internet-Drafts are working documents of the Internet Engineering
37	   Task Force (IETF).  Note that other groups may also distribute
38	   working documents as Internet-Drafts.  The list of current Internet-
39	   Drafts is at https://datatracker.ietf.org/drafts/current/.

41	   Internet-Drafts are draft documents valid for a maximum of six months
42	   and may be updated, replaced, or obsoleted by other documents at any
43	   time.  It is inappropriate to use Internet-Drafts as reference
44	   material or to cite them other than as "work in progress."

46	   This Internet-Draft will expire on March 8, 2020.

48	Copyright Notice

50	   Copyright (c) 2019 IETF Trust and the persons identified as the
51	   document authors.  All rights reserved.

53	   This document is subject to BCP 78 and the IETF Trust's Legal
54	   Provisions Relating to IETF Documents
55	   (https://trustee.ietf.org/license-info) in effect on the date of
56	   publication of this document.  Please review these documents
57	   carefully, as they describe your rights and restrictions with respect
58	   to this document.  Code Components extracted from this document must
59	   include Simplified BSD License text as described in Section 4.e of
60	   the Trust Legal Provisions and are provided without warranty as
61	   described in the Simplified BSD License.

63	Table of Contents

65	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
66	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   5
67	   2.  Namespace Registration Template . . . . . . . . . . . . . . .   5
68	   3.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  19
69	   4.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
70	     4.1.  Normative References  . . . . . . . . . . . . . . . . . .  19
71	     4.2.  Informative References  . . . . . . . . . . . . . . . . .  20
72	   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  22

74	1.  Introduction

76	   The PWID URN is a supplement to existing reference standards, where
77	   the PWID URN will support references to web archives, including areas
78	   that are not supported today: support of references to material in
79	   web archives with restricted access.  Furthermore, the PWID URN
80	   enables technology agnostic references to web archives in general,
81	   which can be needed, for instance for references to dynamic web
82	   material with frequent updates (e.g. a news site) or a specific
83	   version of a web material (e.g. specific version of the DOI
84	   handbook).

86	   The PWID URN is in a form which can work as an algorithmic basis for
87	   finding the resource.  This also enables computation of archived web
88	   parts to a collection from one or more web archives, if the
89	   collection parts are specified by PWID URNs.

91	   Furthermore, the PWID URN includes information about the resource
92	   which makes it possible to find alternative resources, in cases where
93	   the original precise resource has become unavailable.

95	   The PWID URN is designed to be a persistent reference that is
96	   general, global and technology agnostic in order to enhance its
97	   chances of being sustainable.  Furthermore, it is designed to be
98	   humanly readable and with an ability to specify precision about what
99	   the referenced web archive resource covers.  This design enables a
100	   PWID URN to:

102	   o  be used in technical solutions, e.g. to make them resolvable

104	   o  cover references to materials from all sorts of web archives

106	   The motivation for defining a PWID namespace is the growing
107	   challenges of references to archived web resources, and the PWID as a
108	   URN can assist in overcoming a lot of these challenges.  The standard
109	   is needed to address web materials meeting precision and persistency
110	   issues on par with precision in traditional references for analogue
111	   material.  Furthermore, it is needed in order to address web archive
112	   resources that are not freely available online.  The PWID URN covers
113	   both referencing of web resources from research papers and definition
114	   of web collections/corpora.  In detail the challenges are:

116	   o  Persistent Identifier systems (like DOI [DOI]) will only cover
117	      registered resources.  In general, citation guidelines do not
118	      cover general and persistent referencing techniques for web
119	      resources that are not registered.  However, an increasing number
120	      of references point to resources that only exist on the web, e.g.
121	      blogs that turn out to have a historical impact.  In order to
122	      obtain persistency for a reference, the target needs to be stable.
123	      For non-registered web resources, the common rule is that the
124	      resource will change, since the live-web is constantly changing.
125	      Persistency can only be obtained by referring to something stable,
126	      i.e. an archived version of the resource from the web.  The PWID
127	      URN is therefore focused on referencing archived web material in a
128	      technology agnostic way (research documented in [IPRES2016] and
129	      [ResawRef]).

131	   o  References to materials, which only exist in web archives (i.e. no
132	      longer on the live web) are not well supported, especially not for
133	      materials that only exists in archives with restricted access.
134	      There are many new initiatives for web archive referencing, - most
135	      of which are centralized solutions offering harvesting and
136	      referencing, but these cannot be used for materials that only
137	      exist in web archives.  The PWID URN can be used for all web
138	      archives, including web archives with restricted access.

140	   o  One of the referencing initiatives for open web archives uses URLs
141	      which depend on the current setup of the web archive's access
142	      platform.  These URLs are usually technology and placement
143	      dependent, and therefore such a reference style is not suited for
144	      references that are important to retrace for a long period.  The
145	      PWID URN can be used for such reference purposes, since it is
146	      technology agnostic.

148	   o  Another referencing initiative, for open web archives, is omitting
149	      specification of the web archive where the resource was found.
150	      This strategy is used in order to open the possibility of using
151	      alternatives from other archives.  However, this also adds a risk
152	      of imprecision since different archives tend to have different
153	      versions even when harvesting at the same time.  Therefore, such a
154	      reference style is not suited for references where it is important
155	      that the reference is precisely the verified reference.  The PWID
156	      URN can provide an exact reference for where the reference was
157	      validated.  Additionally, the PWID contains the needed information
158	      in order to search for alternative resource, if needed.

160	   o  For web collections/corpora (possibly across different web
161	      archives), recent research have found that various legal and
162	      sustainability issues has led to a need of a collection definition
163	      of references to their web parts.  Furthermore, there is a need
164	      for a similar persistent referencing for all parts for calculation
165	      and sustainability reasons.  So far, there has been no stable
166	      standard for definition of such collection parts.  The PWID URN
167	      can be used for such definitions in order to fulfil these
168	      requirements (research documented in [ResawColl]).

170	   The PWID URN is especially useful for web material where precision is
171	   in focus and/or there are references to materials from web archives
172	   requiring special permissions in order to gain access.  The precision
173	   regards two aspects.  Firstly, pointing out the archive where the
174	   resource was found and validated against its purpose (other archived
175	   versions in other web archives may differ both regarding completeness
176	   and contents even within short time periods).  Secondly, specifying
177	   whether the referred resource is a web page or a part in form of one
178	   file.

180	   The possibility of specifying the part/file precision enables the
181	   PWID URN to be used in specification of contents of a web collection.
182	   Definitions of web collections are often needed for extraction of
183	   data used in production of research results, e.g. for future
184	   evaluations.  Current practices are not persistent as they often use
185	   some CDX version, which vary for different implementations.

187	   Strict syntax is needed for the PWID URN, in order to ensure that it
188	   can act as a reference which can used for computational purposes.
189	   This is especially relevant for automatic extraction of parts from
190	   web collection definitions.  Furthermore, today's readers of research
191	   papers are expecting to be able to access a referenced resource by
192	   clicking an actionable URI, therefore a similar possibility will be
193	   expected for references to available archived web material, and this
194	   is possible with a strict syntax.  A prototype for resolving URN
195	   PWIDs has been developed for the Danish web archive data and open web
196	   archives with standard patterns for the current technologies.
197	   Implementations for resolution of PWID URNs for other web archives
198	   may be developed.

200	   The purpose of the PWID URN is also to express a web archive
201	   reference as simple as possible and at the same time meet the
202	   requirements for sustainability, usability and scope.  Therefore, the
203	   PWID URN is focused on having only the minimum required information
204	   to make a precise identification of a resource in an arbitrary web
205	   archive.  Recent research have shown that this can be obtained by the
206	   following information [ResawRef]:

208	   o  Identification of web archive

210	   o  Identification of source:

212	      *  Archived URI or identifier

214	      *  Archival timestamp

216	   o  Intended precision (page, part/file)

218	   The PWID URN represents this information in a human readable way as
219	   well as a well-defined way that enables technical solutions to
220	   interpret the URN.

222	1.1.  Requirements Language

224	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
225	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
226	   document are to be interpreted as described in [RFC2119].

228	2.  Namespace Registration Template

230	   Namespace Identifier:

232	      PWID

234	   Version:

236	      1

238	   Date:

240	      2019-09-06

242	   Registrant:

244	      Eld Maj-Britt Olmuetz Zierau
245	      Royal Danish Library
246	      Soeren Kierkegaards Plads 1
247	      1219 Copenhagen
248	      Denmark
249	      ph: +45 9132 4690
250	      email: elzi@kb.dk

252	   Purpose:

254	      The PWID URN is a supplement to existing reference standards,
255	      where the PWID URN will support references to web archives,
256	      including areas that are not supported today: support of
257	      references to material in web archives with restricted access.
258	      Furthermore, the PWID URN enables technology agnostic references
259	      to web archives in general, which can be needed, for instance for
260	      references to dynamic web material with frequent updates (e.g. a
261	      news site) or a specific version of a web material (e.g. specific
262	      version of the DOI handbook).

264	      The PWID URN is in a form which can work as an algorithmic basis
265	      for finding the resource.  This also enables computation of
266	      archived web parts to a collection from one or more web archives,
267	      if the collection parts are specified by PWID URNs.

269	      Furthermore, the PWID URN includes information about the resource
270	      which makes it possible to find alternative resources, in cases
271	      where the original precise resource has become unavailable.

273	      The PWID URN is designed to be a persistent reference that is
274	      general, global and technology agnostic in order to enhance its
275	      chances of being sustainable.  Furthermore, it is designed to be
276	      humanly readable and with an ability to specify precision about
277	      what the referenced web archive resource covers.  This design
278	      enables a PWID URN to:

280	      *  be used in technical solutions, e.g. to make them resolvable

282	      *  cover references to materials from all sorts of web archives

284	      The motivation for defining a PWID namespace is the growing
285	      challenges of references to archived web resources, and the PWID
286	      as a URN can assist in overcoming a lot of these challenges.  The
287	      standard is needed to address web materials meeting precision and
288	      persistency issues on par with precision in traditional references
289	      for analogue material.  Furthermore, it is needed in order to
290	      address web archive resources that are not freely available
291	      online.  The PWID URN covers both referencing of web resources
292	      from research papers and definition of web collections/corpora.
293	      In detail the challenges are:

295	      *  Persistent Identifier systems (like DOI [DOI]) will only cover
296	         registered resources.  In general, citation guidelines do not
297	         cover general and persistent referencing techniques for web
298	         resources that are not registered.  However, an increasing
299	         number of references point to resources that only exist on the
300	         web, e.g. blogs that turn out to have a historical impact.  In
301	         order to obtain persistency for a reference, the target needs
302	         to be stable.  For non-registered web resources, the common
303	         rule is that the resource will change, since the live-web is
304	         constantly changing.  Persistency can only be obtained by
305	         referring to something stable, i.e. an archived version of the
306	         resource from the web.  The PWID URN is therefore focused on
307	         referencing archived web material in a technology agnostic way
308	         (research documented in [IPRES2016] and [ResawRef]).

310	      *  References to materials, which only exist in web archives (i.e.
311	         no longer on the live web) are not well supported, especially
312	         not for materials that only exists in archives with restricted
313	         access.  There are many new initiatives for web archive
314	         referencing, - most of which are centralized solutions offering
315	         harvesting and referencing, but these cannot be used for
316	         materials that only exist in web archives.  The PWID URN can be
317	         used for all web archives, including web archives with
318	         restricted access.

320	      *  One of the referencing initiatives for open web archives uses
321	         URLs which depend on the current setup of the web archive's
322	         access platform.  These URLs are usually technology and
323	         placement dependent, and therefore such a reference style is
324	         not suited for references that are important to retrace for a
325	         long period.  The PWID URN can be used for such reference
326	         purposes, since it is technology agnostic.

328	      *  Another referencing initiative, for open web archives, is
329	         omitting specification of the web archive where the resource
330	         was found.  This strategy is used in order to open the
331	         possibility of using alternatives from other archives.
332	         However, this also adds a risk of imprecision since different
333	         archives tend to have different versions even when harvesting
334	         at the same time.  Therefore, such a reference style is not
335	         suited for references where it is important that the reference
336	         is precisely the verified reference.  The PWID URN can provide
337	         an exact reference for where the reference was validated.
338	         Additionally, the PWID contains the needed information in order
339	         to search for alternative resource, if needed.

341	      *  For web collections/corpora (possibly across different web
342	         archives), recent research have found that various legal and
343	         sustainability issues has led to a need of a collection
344	         definition of references to their web parts.  Furthermore,
345	         there is a need for a similar persistent referencing for all
346	         parts for calculation and sustainability reasons.  So far,
347	         there has been no stable standard for definition of such
348	         collection parts.  The PWID URN can be used for such
349	         definitions in order to fulfil these requirements (research
350	         documented in [ResawColl]).

352	      The PWID URN is especially useful for web material where precision
353	      is in focus and/or there are references to materials from web
354	      archives requiring special permissions in order to gain access.
355	      The precision regards two aspects.  Firstly, pointing out the
356	      archive where the resource was found and validated against its
357	      purpose (other archived versions in other web archives may differ
358	      both regarding completeness and contents even within short time
359	      periods) Secondly, specifying whether the referred resource is a
360	      web page or a part in form of one file.

362	      The possibility of specifying the part/file precision enables the
363	      PWID URN to be used in specification of contents of a web
364	      collection.  Definitions of web collections are often needed for
365	      extraction of data used in production of research results, e.g.
366	      for future evaluations.  Current practices are not persistent as
367	      they often use some CDX version, which vary for different
368	      implementations.

370	      Strict syntax is needed for the PWID URN, in order to ensure that
371	      it can act as a reference which can used for computational
372	      purposes.  This is especially relevant for automatic extraction of
373	      parts from web collection definitions.  Furthermore, today's
374	      readers of research papers are expecting to be able to access a
375	      referenced resource by clicking an actionable URI, therefore a
376	      similar possibility will be expected for references to available
377	      archived web material, and this is possible with a strict syntax.
378	      A prototype for resolving URN PWIDs has been developed for the
379	      Danish web archive data and open web archives with standard
380	      patterns for the current technologies.  Implementations for
381	      resolution of PWID URNs for other web archives may be developed.

383	      The purpose of the PWID URN is also to express a web archive
384	      reference as simple as possible and at the same time meet the
385	      requirements for sustainability, usability and scope.  Therefore,
386	      the PWID URN is focused on having only the minimum required
387	      information to make a precise identification of a resource in an
388	      arbitrary web archive.  Recent research have shown that this can
389	      be obtained by the following information [ResawRef]:

391	      *  Identification of web archive

393	      *  Identification of source:

395	         +  Archived URI or identifier

397	         +  Archival timestamp

399	      *  Intended precision (page, part/file)

401	      The PWID URN represents this information in a human readable way
402	      as well as a well-defined way that enables technical solutions to
403	      interpret the URN.

405	   Syntax:

407	      The syntax of the PWID URN is specified below in Augmented Backus-
408	      Naur Form (ABNF) [RFC5234] and conforms to URN syntax defined in
409	      [RFC8141].  The syntax definition of the PWID URN is:

411	        pwid-urn = "urn:" pwid-NID ":" pwid-NSS

413	        pwid-NID = "pwid"
414	        pwid-NSS = archive-domain ":" archival-time ":" precision-spec
415	                              ":" archived-uri

417	        archival-time = utc-date ["T" utc-time] "Z"
418	        utc-date   = utc-year "-" utc-month "-" utc-day
419	        utc-year   = 4DIGIT
420	        utc-month  = 2DIGIT  ; 01-12
421	        utc-day    = 2DIGIT  ; 01-28, 01-29, 01-30, 01-31 based on
422	                         ; month/year in UTC time
423	        utc-time   = utc-hour ":" utc-minute [":" utc-second [secfrac]]
424	        utc-hour   = 2DIGIT  ; 00-23
425	        utc-minute = 2DIGIT  ; 00-59
426	        utc-second = 2DIGIT  ; 00-58, 00-59, 00-60 based on leap second
427	                                   ; rules
428	        secfrac       = "." 1*9DIGIT

430	        precision-spec = "part" / "page"

432	      where

434	      *  All parts of the pwid-NSS are case insensitive, except for
435	         cases where the archived-uri represents a URI with case
436	         sensitive parts.  According to [RFC8141] (section 3.1) this
437	         means that the PWID URNs in general are case insensitive,
438	         except from cases where it includes a case sensitive archived
439	         URI.

441	      *  'archive-domain' is defined as in (section 3.5) [RFC1034].
442	         The 'archive-domain' must identify the web archive by the
443	         domain for the archive leading to descriptions of how to access
444	         (or apply for access) materials in the archive.  (Discussion of
445	         this way to identify the web archive is described in the
446	         "Assignment" section and discussed in the "Additional
447	         information" section).

449	      *  'archival-time' is a UTC timestamp which conforms to the W3C
450	         profile of [ISO8601] [W3CDTF] and a subset of date-time
451	         specified in [RFC3339] (except from allowing partial time
452	         specification).
453	         The 'archival-time' may be specified at any of the levels of
454	         granularity, as long as it reflects exactly the granularity of
455	         the timestamp recorded in the archive (which is in accordance
456	         with the WARC standard [ISO28500]).

458	      *  'DIGIT' is defined as in [RFC5234].

460	      *  'archived-uri' is defined as 'URI' in [RFC3986] but where
461	         occurrences of "[", "]", "?", "#" and "%" are %-encoded in
462	         order not to clash with URN reserved characters [RFC8141] as
463	         well as having unambiguous use of "%".
464	         The 'archived-uri' must be the URI for the archived source.

466	      The precision specification is expressing the intended precision
467	      of the reference, which is needed for specification of

469	      *  precise coverage of the reference
470	         e.g. to an html file, since the precise meaning of what the
471	         reference covers can be very varied (the html file itself? the
472	         web page it renders to?) or precise web parts of a collection
473	         specification.

475	      *  degree of how precise the reference is with respect to what can
476	         be viewed in the future
477	         The html file itself will be the same.  However for web pages,
478	         there are interpretation involved, which mean the result of
479	         rendering them in the web archive can change over time.  This
480	         may happen in case the web archive's algorithm for calculation
481	         of which archived web parts to use for the web page.  It may
482	         also happen if the web page refers to parts which are added to
483	         the web archive later, and therefore will give another
484	         expression than the originally referenced expression.

486	      The following valid precision-spec values are exists:

488	      *  'page'
489	         Meaning that an application like Wayback calculates a resulting
490	         web page based on calculated referenced web parts (display
491	         templates, images etc.).  For example, an html page displaying
492	         an image will need both the html and the referred image.

494	      *  'part'
495	         Meaning the single archived file/web part harvested as from the
496	         specified URI.  For references to web pages with html code
497	         (i.e. pages where there is an option to "View page source"),
498	         this will mean the actual file with the html code.  It is
499	         relevant to refer to web pages this way, in case it is part of
500	         a collection specification or in case it is the html that is of
501	         interest (e.g. java scripts or hidden links that are not
502	         visible when rendering the web page).
503	         For all other types of files, the URI will be for single files
504	         to be interpreted a file.

506	   Assignment:

508	      The PWID URNs do not have to be assigned by an authority, as they
509	      are based on the information created at the time of archiving.  In
510	      other words: a PWID URN is created independently, but following an
511	      algorithm which ensures that the referred item can be found if it
512	      is still available.  A prerequisite for assignment of a PWID is
513	      that the web archive can be identified (with a domain describing
514	      the web archive) and that it has registered metadata about the
515	      archived URI and the time of archiving (also discussed in section
516	      "Additional Information").

518	      A PWID URN is created by finding the relevant information of the
519	      syntax parts of the PWID:

521	        "urn:pwid:" archive-domain ":" archival-time ":" precision-spec
522	                               ":" archived-uri

524	      The PWID URN for an archived item at hand can be constructed by
525	      exchanging the unspecified PWID parts with relevant information,
526	      as explained in the following:

528	      *  archive-domain (identification of web archive):
529	         This must be the domain of the web archive as identification of
530	         the web archive (e.g. archive.org for Internet Archive's open
531	         web archive and netarkivet.dk for the Danish web archive with
532	         restricted access).  Use of the web archive domain as an
533	         identification of a web archive is chosen, since most web
534	         archives have a web archive domain page that leads to a
535	         description of how to access the web archive, e.g. by online
536	         access or by applying for access grants.  Furthermore, it is
537	         more precise than e.g.  the name of the archive, since there
538	         may be more than one installation of web archives at the same
539	         organization, e.g.  archive.org and archive-it.org are both
540	         covered by Internet Archive.
541	         A more precise and persistent identification would require a
542	         formal registry of web archives, but such a registry does not
543	         yet exist.

545	      *  archival-time (archival timestamp):
546	         The archival time for the archived item must be specified with
547	         as much granularity as possible in order to make sure it
548	         uniquely identifies the resource at hand.  The archival time
549	         may be displayed along with the archived item, but there are
550	         different implementations.  It is important to be aware of
551	         whether a more precise timestamp can be found, and whether the
552	         correct timestamp is used.  In many Wayback implementations,
553	         the precise timestamp can be found as part of the URI used for
554	         viewing the archived item.  For example, the archive http URI
555	         https://web.archive.org/web/20160122100823/https://www.dr.dk
556	         for an archived resource viewable via the Internet Archive's
557	         Wayback installation, the number 20160122100823 represents the
558	         archival time 2016-01-22T10:08:23Z.  In other installations,
559	         the most precise timestamp may be found in the URI from a
560	         search result leading to the resource (which usually redirects
561	         on basis of a call to the underlying archive index).
562	         Especially for web pages with frames, there may be cases where
563	         the actual time is not displayed with the source, since only
564	         the times for the contents of the frames are displayed.

566	      *  precision-spec (part or page):
567	         The precision specification specifies how the referred item
568	         should be regarded.  A typical PWID URN reference in a paper
569	         would be 'page', where a tool will be needed to render the web
570	         page.  Alternatively, the precision-spec can be 'part', which
571	         is the most precise reference since it reference a specific
572	         file where no additional calculations are needed (e.g. as part
573	         of a collection, a specific html file with hidden links or to
574	         indicate that a single image is referenced).  In order to see
575	         whether a viewed browser page is a computed web page or a
576	         single file, browsers have a function "View page source" which
577	         is not activated if for single files).

579	      *  archived-uri (archived URI):
580	         The URI that was harvested by the web archive for the
581	         referenced resource.

583	      A much easier way to construct PWID URNs is to use tools that
584	      construct them.  Currently, there is also a prototype for a SOLR-
585	      Wayback tool (Source at https://github.com/netarchivesuite/
586	      solrwayback) [PWIDprovider], which can assist in finding the most
587	      precise reference to an archived web page.  This Wayback version
588	      can provide all PWID URNs belonging to a shown page where the page
589	      PWID URN is provided at the top of the PWID URN list with 'part'
590	      precision, i.e. the page PWID URN can be taken replacing the
591	      'part' with 'page' or all provided PWID URNs can be taken and e.g.
592	      used in a collection definition.

594	   Security and Privacy:

596	      Security and privacy considerations are restricted to accessible
597	      web resources in web archives.  Resolvers to PWID URNs will
598	      usually only be possible using the web archives' access tools,
599	      where security and privacy are covered by these tools.  In such
600	      cases, security and privacy will be as covered by these tools.

602	      It should be noted that an archived web page or part could be just
603	      as dangerous as a "live" page or part; for instance, it could
604	      include insecure scripts, malware, trackers, etc.  Furthermore, an
605	      archived page can in fact be more dangerous, because it could
606	      include outdated scripts with known vulnerabilities that can never
607	      be patched because the script is archived for all time in a
608	      vulnerable state.

610	   Interoperability:

612	      This is covered by comments in the Syntax description:

614	      *  the PWID URN conforms to the URI standard defined as in
615	         [RFC3986] and the URN standard [RFC8141]

617	      *  the 'archival-time' of the PWID URN conforms to the UTC
618	         timestamp as described in the W3C profile of ISO 8601 [ISO8601]
619	         [W3CDTF] and is in accordance with the WARC standard ISO 28500
620	         [ISO28500].

622	      *  for 'archived-uri', this URI conforms to the URI standard
623	         defined as in [RFC3986], with %-encodings of "[", "]", "#", "?"
624	         and "%" in order to conform to the URN standard [RFC8141] as
625	         well as having unambiguous use of "%"

627	   Resolution:

629	      The information in a PWID URN can be used for locating a web
630	      archive resource, for any kind of web archive.  It includes the
631	      minimum information for web archive materials, which enables
632	      resolvability, manually or by a resolver.  Resolution of a PWID
633	      URN is the primary motivation of making a formal URN definition,
634	      instead of just textual representation of the needed parts of a
635	      PWID.

637	      Resolution is done based on the PWID parts.  This can be done
638	      manually by using information from the PWID parts to lookup the
639	      web archive and use the web archives tools to search for the
640	      resource.  It can also be done automatically by using the
641	      information from the PWID parts to construct an URI to locate the
642	      archived resource the internet (for online web archives) or a
643	      local restricted network (for web archives with access
644	      restrictions).  The relevant information from the PWID parts are:

646	      *  Web archive domain for web archive holding referred resource
647	         The domain name for the web archive.  For the manual solution,
648	         this domain is used to find a description of how to access the
649	         web archive's materials.  For example, "archive.org" is the
650	         domain name leading to the Internet Archive's interface to
651	         their online web collection, and "netarkivet.dk" is the domain
652	         name leading to the website for the Danish web archive with
653	         information about how to apply for access permission to the web
654	         collections.  For an automatic solution, the domain will be
655	         used to identify how to calculate the pattern for the URI for
656	         the resource.

658	      *  Archived URI of archived resource
659	         For the manual solution this domain, the archived URI for the
660	         resource must be used in search for the resource.  For the
661	         automatic solution, this is used as a parameter for
662	         construction of the URI for the resource.

664	      *  Date and time associated with the archived item
665	         The archival date and time must be used in manual search for
666	         the resource or as parameter to automatic construction of the
667	         URI for the resource.

669	      *  Precision of what is referred
670	         The precision contributes to the guidance of how to view the
671	         referred item.  If the precision is 'page', the resource must
672	         be browsed using the web archive browsing tool, which computes
673	         all parts needed for browsing of the page.  If the precision is
674	         'part', the "View page source" browser function can be used for
675	         pages to get the referred resource.  If the resource is a
676	         single file (this option is not activated, since the full
677	         resource is already shown).  The part precision can also be
678	         indicator for tools (e.g. a collection extraction tool) that
679	         they can fetch the contents by fetching the file pointed to.

681	      In the following, the different resolution techniques are
682	      explained (manual as well as via a service) using the following
683	      PWID URN as an example:

685	      urn:pwid:archive.org:2016-01-22T10:08:23Z:page:https://www.dr.dk

687	      In this example the information from the URN PWID parts are:

689	      *  "archive.org"
690	         Currently known identifier in form of the Internet Archive
691	         domain name for their open access web archive.

693	      *  "2016-01-22T10:08:23Z"
694	         UTC date and time associated with the archived URI

696	      *  "page"
697	         Clarification that the reference cover the full web page with
698	         all its inherited parts selected by the web archive

700	      *  "https://www.dr.dk"
701	         archived URI of the referenced resource

703	      A manual resolution technique would be to go through the following
704	      steps using the specified web archive's search interface (which
705	      will work for both open web archives and web archives with
706	      restricted access onsite):

708	      *  Browse the web archive domain "archive.org"
709	         In this case, the domain leads directly into a page where you
710	         can search for archived URIs (in other cases there may be need
711	         for additional clicks to get to search interface or
712	         descriptions of how to apply for access).

714	      *  Enter the archived URI "https://www.dr.dk" in the search field
715	         and make a search, which will result in an overview of the
716	         different times that the resource was archived.

718	      *  Use the archival time "2016-01-22T10:08:23Z" to select the
719	         correct resource

721	      The "page" information is used in verification that the right
722	      precision level is reached.  In case the precision-spec had been
723	      'part', it would require an extra step selecting "View page
724	      source" on the resulting page.

726	      It is also noteworthy that the information in the PWID can help in
727	      finding an alternative resource, in case the original referred
728	      resource is no longer available.  The archived URI can be searched
729	      in other web archives, where the date and time can help to find
730	      the best match, e.g. via Memento [MEMENTO] (for some open web
731	      archives) or via possible coming web archive infrastructures.

733	      Alternative resolution (automatically or manually) of this URN
734	      PWID can be deduced based on the current (2019) knowledge of
735	      Internet Archive's open Wayback access web interface, which has
736	      the pattern:

738	         https://web.archive.org/web/<time>/<uri>

740	      Using this pattern (where only digits from the timestamp is
741	      included), it is possible to deduce the online https URI:

743	         https://web.archive.org/web/20160122100823/https://www.dr.dk

745	      The same recipe can be used for other Wayback platforms for open
746	      web archives.  For web archives with restricted access, there may
747	      be similar recipes, but it may also require special applications
748	      to extract the local URI for the resource (e.g. for Netarkivet, it
749	      is constructed using an API which uses the local CDX to generate
750	      the correct local URI for the resource).

752	      A resolving service is currently available in form of code for a
753	      prototype which run at the Royal Danish Library [PWIDresolver] and
754	      is planned to be more widely available.  This service currently
755	      covers both the Danish web archive (with the proper rights) and
756	      open web archives with access services based on a pattern
757	      including archive, archival time and archived URI.  In other
758	      words, for open web archives it covers conversion of PWID URNs
759	      for: archive.org, archive-it.org, arquivo.pt, bibalex.org,
760	      nationalarchives.gov.uk, stanford.edu and vefsafn.is.  For the
761	      Danish web archive with restricted access, the prototype works
762	      locally accessing the CDX of the library, and providing access via
763	      a local proxy to a restricted environment.  The source code for
764	      this prototype is available from
765	      https://github.com/netarchivesuite/NAS-research/releases/
766	      tag/0.0.6.

768	   Documentation:

770	      None relevant

772	   Additional Information:

774	      Background:
775	      The PWID was originally suggested as a URI, based on research
776	      between a computer science researcher with knowledge of web
777	      archiving and researchers from humanity subjects (History and
778	      Literature).  This resulted in the paper "Persistent Web
779	      References - Best Practices and New Suggestions" [IPRES2016] from
780	      the iPres 2016 conference.  In this paper, the PWID is referred to
781	      as WPID.  However, feedback was received displaying a concern that
782	      WPID was interpreted as a PID related to a PID-system, e.g. as the
783	      DOI.  Although the definition of a PID does not contradict the
784	      name "WPID", there would still be a danger of confusing it with
785	      PID-systems, which is not the intension.  Consequently, this
786	      suggestion names the PWID instead.

788	      Comments on the drafted PWID URI ([DraftPwidUri]) have suggested
789	      that it should be a URN rather than a URI, which is why the PWID
790	      URN is defined here.

792	      There has been expressed interest for the PWID at several
793	      occasions, where it has been presented (iPRES 2016 [IPRES2016]
794	      paper, RESAW 2017 [ResawRef][ResawColl] papers, iPRES 2018
795	      [IPRES2018] best poster, iDCC 2019 [IDCC2019] poster.  Especially,
796	      web researchers from digital humanities have expressed a strong
797	      interest in the PWID, since it will fill a gap and make it
798	      possible for the researchers to make the necessary references.

800	      Limitations to when a PWID URN can be created:
801	      It can be argued that the PWID URN should not have any
802	      restrictions to which material it can be applied.  However, in
803	      order to make a standardized general way to identify material,
804	      there need to be assumptions on a set of information that can be
805	      used for identification.

807	      The limits made are can also be argued to be essential for
808	      material that are to be referenced on a long time basis, to have
809	      information about which archive, when it was archived and what was
810	      archived.  (See also discussion of web archive identification
811	      below).

813	      Discussion of the web archive identification:

815	      Using the domain for a web archive as an identifier of the web
816	      archive is not ideal, but it is workable.  There are a number of
817	      examples where the domain may not work in the future:

819	         The web archive no longer exist

821	         The web archive have been merged into another web archive

823	         The web archive have change the domain they use

825	      In the first case, the precise material has been lost, which would
826	      be a similar situation if the web archive had been identified in
827	      any other way.  It is however recommendable for any user of
828	      references to evaluate the possible sustainability of a web
829	      archive before using the reference, e.g. by evaluating the
830	      probability of continues funding for the web archive.  In any
831	      case, the PWID contain information about archived URI and archival
832	      time, which enables a possibility to search for an alternative
833	      (possibly less precise) reference in other web archives, e.g. by
834	      using Memento.

836	      In the second and the third case, identification of the resource
837	      will require that the new domain is found.  The likelihood of
838	      finding such information is rather high for well-established web
839	      archives, by using one of two ways.  One way is to search for the
840	      domain change information online (if transition is described for
841	      the web archive at the new domain for the web archive).  Another
842	      way would be to search other web archives for the last harvests of
843	      the archive domain with information about forthcoming transition
844	      (many web archives harvests each other's domain home page, e.g.
845	      all the web archives mentioned in this document can be found in
846	      both archive.org and arquivo.pt).

848	      It would of course be ideal to have a registry that has exactly
849	      one identifier for a web archive, with different domains/patterns
850	      for online material for different periods if there have been
851	      changes.

853	      Possible extensions to be investigated:
854	      This first version of the PWID only contains a basic definition,
855	      which means that it does not include all of the possible
856	      extensions which have been suggested at different conferences.
857	      The reason is that these suggestions are not mature enough to be
858	      included at this stage.  The extensions suggested so far have
859	      been:

861	      *  Having web archives identified by registered identifiers.
862	         There will be work on looking at an update to the PWID URN, if
863	         there can be found a workable solution e.g. by making such a
864	         registry by IANA.

866	      *  Having the possibility to use PWID for other web material than
867	         archived URIs, e.g. snapshots and collections

869	      *  Various possibilities for specifying the identified material,
870	         e.g. snapshot

872	      *  Discussion of how to extend use of PWID URNs via plugins in
873	         browsers, standardized way to ask web archives for resource
874	         specified as a PWID URN and access via future web research
875	         infrastructure

877	   Revision Information:

879	      This is the first version of PWID as a URN.

881	3.  Acknowledgements

883	   A special thanks to Caroline Nyvang and Thomas Kromann who have
884	   contributed to the research identifying the minimum information
885	   required in a persistent web reference, and to Bolette Jurik who
886	   contributed with supplementary research concerning requirements for
887	   web collection/corpora definitions.  Also thanks to everybody who has
888	   contributed to this work with the research parts and with reviewing
889	   of this RFC.

891	4.  References

893	4.1.  Normative References

895	   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
896	              STD 13, RFC 1034, DOI 10.17487/RFC1034, November 1987,
897	              <https://www.rfc-editor.org/info/rfc1034>.

899	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
900	              Requirement Levels", BCP 14, RFC 2119,
901	              DOI 10.17487/RFC2119, March 1997,
902	              <https://www.rfc-editor.org/info/rfc2119>.

904	   [RFC3339]  Klyne, G. and C. Newman, "Date and Time on the Internet:
905	              Timestamps", RFC 3339, DOI 10.17487/RFC3339, July 2002,
906	              <https://www.rfc-editor.org/info/rfc3339>.

908	   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
909	              Resource Identifier (URI): Generic Syntax", STD 66,
910	              RFC 3986, DOI 10.17487/RFC3986, January 2005,
911	              <https://www.rfc-editor.org/info/rfc3986>.

913	   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
914	              Specifications: ABNF", STD 68, RFC 5234,
915	              DOI 10.17487/RFC5234, January 2008,
916	              <https://www.rfc-editor.org/info/rfc5234>.

918	   [RFC8141]  Saint-Andre, P. and J. Klensin, "Uniform Resource Names
919	              (URNs)", RFC 8141, DOI 10.17487/RFC8141, April 2017,
920	              <https://www.rfc-editor.org/info/rfc8141>.

922	4.2.  Informative References

924	   [DOI]      International DOI Foundation, "The DOI System", 2016,
925	              <https://web.archive.org/web/20161020222635/
926	              https:/www.doi.org/>.

928	              urn:pwid:archive.org:2016-10-20T22:26:35:page:https://www.
929	              doi.org/

931	   [DraftPwidUri]
932	              Zierau, E., "DRAFT: Scheme Specification for the pwid URI,
933	              Version 4", June 2018, <https://datatracker.ietf.org/doc/
934	              draft-pwid-uri-specification/>.

936	   [IDCC2019]
937	              Zierau, E., "Web References Meeting Requirements for
938	              Proper Referencing Principles"", February 2019,
939	              <http://www.dcc.ac.uk/sites/default/files/documents/IDCC19
940	              /222_Web%20References%20Meeting%20Requirements%20for%20Pro
941	              per%20Referencing%20Principles.pdf>.

943	              Poster at 14th International Digital Curation Conference
944	              (iDCC) 2019

946	   [IPRES2016]
947	              Zierau, E., Nyvang, C., and T. Kromann, "Persistent Web
948	              References - Best Practices and New Suggestions", October
949	              2016, <http://www.ipres2016.ch/frontend/organizers/media/
950	              iPRES2016/_PDF/
951	              IPR16.Proceedings_4_Web_Broschuere_Link.pdf>.

953	              In: proceedings of the 13th International Conference on
954	              Preservation of Digital Objects (iPres) 2016, pp. 237-246

956	   [IPRES2018]
957	              Zierau, E., "Precise and Persistent Web Archive References
958	              - Status, context and expected progress of the PWID",
959	              September 2018, <https://osf.io/u5w3q/>.

961	              In: proceedings of the 15th International Conference on
962	              Preservation of Digital Objects (iPres) 2018, DOI:
963	              10.17605/OSF.IO/U5W3Q

965	   [ISO28500]
966	              International Organization for Standardization,
967	              "Information and documentation -- WARC file format", 2017,
968	              <https://www.iso.org/standard/68004.html>.

970	   [ISO8601]  International Organization for Standardization, "Data
971	              elements and interchange formats -- Information
972	              interchange -- Representation of dates and times", 2004,
973	              <https://www.iso.org/standard/40874.html>.

975	   [MEMENTO]  Memento Development Group, "About the Memento Project",
976	              January 2015, <http://mementoweb.org/about/>.

978	              urn:pwid:archive.org:2018-11-
979	              01T15:26:28Z:page:http://mementoweb.org/about/

981	   [PWIDprovider]
982	              Royal Danish Library (Netarkivet), "SolrWayback 3.1",
983	              2018, <https://github.com/netarchivesuite/solrwayback>.

985	              urn:pwid:archive.org:2018-06-
986	              11T02:00:05Z:page:https://github.com/netarchivesuite/
987	              solrwayback

989	   [PWIDresolver]
990	              Royal Danish Library (Netarkivet), "NAS-research version
991	              0.0.6", 2018, <https://github.com/netarchivesuite/NAS-
992	              research/releases/tag/0.0.6>.

994	              urn:pwid:archive.org:2018-07-
995	              16T06:53:51Z:page:https://github.com/netarchivesuite/NAS-
996	              research/releases/tag/0.0.6

998	   [ResawColl]
999	              Jurik, B. and E. Zierau, "Data Management of Web archive
1000	              Research Data", 2017,
1001	              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
1002	              RESAW2017-JurikZierau-
1003	              Data_management_of_web_archive_research_data.pdf>.

1005	              In: proceedings of the RESAW 2017 Conference, DOI:
1006	              10.14296/resaw.0002

1008	   [ResawRef]
1009	              Nyvang, C., Kromann, T., and E. Zierau, "Capturing the Web
1010	              at Large - a Critique of Current Web Referencing
1011	              Practices", 2017,
1012	              <https://archivedweb.blogs.sas.ac.uk/files/2017/06/
1013	              RESAW2017-NyvangKromannZierau-
1014	              Capturing_the_web_at_large.pdf>.

1016	              In: proceedings of the RESAW 2017 Conference, DOI:
1017	              10.14296/resaw.0004

1019	   [W3CDTF]   W3C, "Date and Time Formats: note submitted to the W3C. 15
1020	              September 1997", 1997,
1021	              <http://www.w3.org/TR/NOTE-datetime>.

1023	              urn:pwid:archive.org:2017-04-
1024	              03T03:37:42Z:page:http://www.w3.org/TR/NOTE-datetime

1026	Author's Address

1028	   Eld Maj-Britt Olmuetz Zierau (editor)
1029	   Royal Danish Library
1030	   Soeren Kierkegaards Plads 1
1031	   Copenhagen  1219
1032	   Denmark

1034	   Phone: +45 9132 4690
1035	   Email: elzi@kb.dk