idnits 2.17.1 draft-pwid-uri-specification-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 7, 2017) is 2328 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 2141 (Obsoleted by RFC 8141) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force E. Zierau, Ed. 3 Internet-Draft The Royal Danish Library 4 Intended status: Informational December 7, 2017 5 Expires: June 10, 2018 7 Scheme Specification for the pwid URI 8 draft-pwid-uri-specification-03 10 Abstract 12 This document specifies a Uniform Resource Identifier (URI) for 13 Persistent Web IDentifiers to web material in web archives using the 14 'pwid' scheme name. The purpose of the standard is to support 15 general, global, sustainable, humanly readable, technology agnostic, 16 persistent and precise web references for web materials. 18 The PWID URI can assist in two ways: First, by providing potential 19 resolvable precise and persistent reference scheme for documents, 20 which is not sufficiently covered by existing web reference 21 practices. Second, by providing a standardized way to specify web 22 elements in a web collection also known as web corpus. Definitions 23 of web collections are often needed for extraction of data used in 24 production of research results, e.g. for evaluations in the future. 25 Current practices today are not persistent as they often use some CDX 26 version, which vary for different implementations. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on June 10, 2018. 45 Copyright Notice 47 Copyright (c) 2017 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (https://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 63 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 64 2. Demonstrable, New, Long-Lived Utility . . . . . . . . . . . . 4 65 3. Syntactic Compatibility . . . . . . . . . . . . . . . . . . . 4 66 4. Well Defined . . . . . . . . . . . . . . . . . . . . . . . . 6 67 5. Definition of Operations . . . . . . . . . . . . . . . . . . 8 68 6. Context of Use . . . . . . . . . . . . . . . . . . . . . . . 9 69 7. Internationalization and Character Encoding . . . . . . . . . 9 70 8. Scheme Name Considerations . . . . . . . . . . . . . . . . . 10 71 9. Interoperability Considerations . . . . . . . . . . . . . . . 10 72 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 73 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 74 12. Clear Security and Privacy Considerations . . . . . . . . . . 10 75 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 76 13.1. Normative References . . . . . . . . . . . . . . . . . . 11 77 13.2. Informative References . . . . . . . . . . . . . . . . . 11 78 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 13 80 1. Introduction 82 The purpose of the PWID URI is to represent general, global, 83 sustainable, humanly readable, technology agnostic, persistent and 84 precise web archive resource references - in a scheme that can be 85 used for technical solutions. The motivation for defining a PWID URI 86 scheme is the growing challenge of references to web resources, - 87 both regarding referencing web resources from papers and regarding 88 definition of web collection/corpus. 90 o Citation guidelines generally do not cover general and persistent 91 referencing techniques for web resources that are not registered 92 by Persistent Identifier systems (like DOI [DOI]). However, an 93 increasing number of references point to resources that only exist 94 on the web, e.g. blogs that turned out to have a historical 95 impact. In order to obtain persistency for a reference, the 96 target need to be stable. As the live web is 'alive' and in 97 constant change, persistency can only be obtained by referring to 98 archived snapshots of the web. The PWID URI is therefore focused 99 on referencing archived web material in a technology agnostic way 100 (research documented in [IPRES] and [ResawRef]). 102 o There are many different requirements for construction of 103 collection definitions for web material besides precision and 104 persistency. Recent research have found that various legal and 105 sustainability issues leads to a need for a collection to be 106 defined by references to the web parts in the collection. The 107 PWID URI is needed in such definitions in order to fulfil these 108 requirements and to enable a collection to cover web materials 109 from more archives (Research documented in and [ResawColl]). 111 For the sake of usability and sustainability, the definition of the 112 PWID URI scheme is focused on only having the minimum required 113 information to make a precise identification of a resource in an 114 arbitrary web archive. Resent research have found that this is 115 obtain by the following information [ResawRef]: 117 o Identification of web archive 119 o Identification of source: 121 * Archived URI or identifier 123 * Archival timestamp 125 o Intended coverage (page, part, subsite etc.) 127 The PWID URI scheme represents this information in an unambiguous 128 way, and thus enabling technical solutions to be defined based on 129 this scheme. 131 1.1. Requirements Language 133 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 134 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 135 document are to be interpreted as described in RFC 2119 [RFC2119]. 137 2. Demonstrable, New, Long-Lived Utility 139 The purpose of the PWID URI is to represent needed referencing 140 information (as listed in the introduction) in a scheme that can be 141 used for technical solutions. As described in [ResawColl] such 142 references can be represented in a textual way. However, strict 143 unambiguous syntax is needed in order to ensure that it can be used 144 for computational purposes. This is relevant for web collection 145 definitions, which will need a strict scheme in order to be a basis 146 for automatic extraction. Furthermore, readers of research papers 147 are today expecting to be able to access a referenced resource by 148 clicking an actionable URI, therefore a similar facility will be 149 expected for references to available archived web material. 151 The interest for this new PWID URI scheme has already been shown, a 152 paper about the invention of the PWID URI "Persistent Web References 153 - Best Practices and New Suggestions" [IPRES] was accepted for the 154 iPres 2016 conference and nominated as best paper. At the RESAW 2017 155 conference there are two related papers: One on referencing practices 156 [ResawRef] and one on research data management practices [ResawColl]. 157 The interest for the PWID URI so far indicates that this is a 158 recognized issue, and that the PWID URI can fill a gap. 160 The PWID URI could function as a URN RFC 2141 [RFC2141], and will be 161 as a starting point (proposal will be sent in December 2017). The 162 ambition is to make an easily understandable and technology 163 independent persistent identifier, where the prefixing of "urn:" will 164 be desturbing. Therefore it is also suggested as an URI, as there in 165 time will come a way where it can function as a URI and also enjoy 166 the same common syntactic, semantic, and shared language benefits 167 that the URI presentation confers. 169 It should be noted that for closed web archives, the PWID URI can be 170 used to resolve within a closed environment. Likewise, the PWID can 171 be resolved within coming web archive research infrastructure, which 172 is currently being proposed in the RESAW community [RESAW]. 174 3. Syntactic Compatibility 176 The syntax of the PWID URI Scheme is specified below in Augmented 177 Backus-Naur Form (ABNF) RFC 5234 [RFC5234] and it conforms to URI 178 syntax defined in RFC 3986 [RFC3986]. The syntax definition of the 179 PWID URI is: 181 pwid-uri = pwid-scheme ":" pwid-spec 182 pwid-scheme = "pwid" 183 pwid-spec = archive-id ":" archival-time ":" coverage-spec 184 ":" archived-item 186 archive-id = +( unreserved ) 188 archival-time = full-date datetime-delim full-pwid-time 189 datetime-delim = "_" / "T" 190 full-pwid-time = time-hour ["."] time-minute ["."] time-second "Z" 192 coverage-spec = "part" / "page" / "subsite" / "site" 193 / "collection" / "recording" / "snapshot" 194 / "other" 196 archived-item = URI / archived-item-id 197 archived-item-id = +( unreserved ) 199 where 201 o 'unreserved' is defined as in RFC 3986 [RFC3986] 203 o 'coverage-spec' values are not case sensitive (i.e. "PAGE" / 204 "PART" / "PaGe" / ... are valid values as well.) 206 o 'archival-time' is a UTC timestamp conforming to the W3C profile 207 ISO8601 ISO 8601 [ISO8601] (also defined in RFC 3339 [RFC3339]), 208 with a few exception. It has to be a UTC timestamp in order to 209 conform with web archiving practices, which always uses UTC in 210 order to avoid confusions. The few exceptions for the 'datetime- 211 delim' and 'full-pwid-time', as well as using "." is used instead 212 of ":" in order not to collide with ":" used for delimitation of 213 URI parts. The 'full-date' is defined as in RFC 3339 [RFC3339]. 214 The 'archival-time' must represent the time specified in the 215 archive, and can therefore be specified at any of the levels of 216 granularity as described in [W3CDTF] and in accordance with teh 217 WARC standard ISO 28500 [ISO28500]. 219 The 'datetime-delim' "_" is accepted in order to make it more 220 readable, in the same way as the W3C profile accepts " ", but 221 where "_" is used here in order to use allowed URI characters in 222 an URI. In line with RFC 3339 [RFC3339] the "T" may alternatively 223 be lower case "t". 225 'time-hour', 'time-minute' and 'time-second' are defined as in RFC 226 3339 [RFC3339]. 228 In line with RFC 3339 [RFC3339] the "Z" may alternatively be lower 229 case "z". 231 o 'URI' is defined as in RFC 3986 [RFC3986] 233 The 'coverage-spec' defines the type of archived item, serving as a 234 precision to what is referred: 236 o part 237 the single archived element, e.g. a pdf, a html text, an image 239 o page 240 the full context as a page, e.g. a html page with referred images 242 o subsite 243 the full context as a subsite within its domain, e.g. a document 244 represented in a web structure 246 o site 247 the full context as a site within its domain 249 o collection 250 a collection/corpora definition, e.g. defined as descibed in 251 [ResawColl] 253 o snapshot 254 a snapshot (image) representation of web material, e.g. a web page 256 o recording 257 a recording of a web browsing 259 o other 260 if something else 262 Note that the 'coverage-spec' is a parameter that could have been 263 specified as a query. However, since the 'pwid-uri' can include an 264 URI as 'archived-item', it would introduce ambiguities if the 265 'coverage-spec' was specified as a query, since it would not be clear 266 whether the query belonged to the 'pwid-uri' or the 'archived-item'. 268 4. Well Defined 270 The information in a PWID URI can be used for locating a web archive 271 resource, for any kind of web archive. It includes the minimum 272 information for web archive materials, which enables resolvability, 273 manually or by a resolver. One of the reasons for defining PWID as a 274 URI is to enable a general, technology agnostic, persistent 275 representation to be resolvable at any time. 277 The information needed is: 279 o Web archive identification 280 to find the archive holding the material 282 o Archived URI or identifier of item 283 as part of identifying the material 285 o Date and time associated with the archived URI/item 286 as part of precise identification of the material 288 o Coverage of what is referred 289 as part of clarification of what the referred material covers 290 (page, part etc.) 292 For example the PWID URI: 294 pwid:archive.org:2016-01-22T11.20.29Z:page:http://www.dr.dk 296 has the information: 298 o archive.org 299 currently known identifier in form of the Internet Archive domian 300 name for their open access web archive 302 o 2016-01-22T11.20.29Z 303 UTC date and time associated with the archived URI 305 o page 306 clarification that the reference cover the full web page with all 307 its inherited parts selected by the web archive 309 o http://www.dr.dk 310 archived URI of item 312 With knowledge of the current (2017) Internet Archive open access web 313 interface having the form: 315 https://web.archive.org/web/