idnits 2.17.1 draft-kunze-ark-08.txt: -(62): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(289): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(352): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(399): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(419): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(442): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(444): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(450): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(452): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(806): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(840): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(844): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1021): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1056): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1064): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1077): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1206): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1230): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1401): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1403): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1404): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1414): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1416): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1419): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1443): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1444): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1445): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1469): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1485): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1502): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1512): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1561): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1667): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1673): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1676): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1677): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding -(1797): Line appears to be too long, but this could be caused by non-ascii characters in UTF-8 encoding Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == There are 59 instances of lines with non-ascii characters in the document. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 41 longer pages, the longest (page 2) being 63 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 42 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 11 instances of too long lines in the document, the longest one being 7 characters in excess of 72. == There are 15 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 575 has weird spacing: '...eful to remem...' == Line 795 has weird spacing: '... regexp repla...' == Line 1898 has weird spacing: '...for the purpo...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (31 July 2004) is 7202 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'MD5' is defined on line 1750, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE' -- Possible downref: Non-RFC (?) normative reference: ref. 'DERC' -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI' ** Obsolete normative reference: RFC 822 (ref. 'EMHDRS') (Obsoleted by RFC 2822) -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. 'MD5') ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm' -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL' -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP' ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC 3986) ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref. 'URNBIB') ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC 3406) Summary: 14 errors (**), 0 flaws (~~), 10 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft: draft-kunze-ark-08.txt J. Kunze 3 ARK Identifier Scheme University of California (UCOP) 4 Expires 31 January 2005 R. P. C. Rodgers 5 US National Library of Medicine 6 31 July 2004 8 The ARK Persistent Identifier Scheme 10 (http://www.ietf.org/internet-drafts/draft-kunze-ark-08.txt) 12 Status of this Document 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as ``work in progress.'' 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Distribution of this document is unlimited. Please send comments to 34 jak@ucop.edu. 36 Copyright (C) The Internet Society (2004). All Rights Reserved. 38 Abstract 40 The ARK (Archival Resource Key) is a scheme intended to facilitate 41 the persistent naming and retrieval of information objects. It 42 comprises an identifier syntax and three services. An ARK has four 43 components: 45 [http://NMAH/]ark:/NAAN/Name 47 an optional and mutable Name Mapping Authority Hostport part (NMAH, 48 where "hostport" is a hostname followed optionally by a colon and 49 port number), the "ark:" label, the Name Assigning Authority Number 50 (NAAN), and the assigned Name. The NAAN and Name together form the 51 immutable persistent identifier for the object. 53 An ARK request is an ARK with a service request and a question mark 54 appended to it. Use of an ARK request proceeds in two steps. First, 55 the NMAH, if not specified, is discovered based on the NAAN. Two 56 discovery methods are proposed: one is file based, the other based 57 on the DNS NAPTR record. Second, the ARK request is submitted to the 58 NMAH. Three ARK services are defined, gaining access to: (1) the 59 object (or a sensible substitute), (2) a description of the object 60 (metadata), and (3) a description of the commitment made by the NMA 61 regarding the persistence of the object (policy). These services are 62 defined initially to use the HTTP protocol. When the NMAH is speci� 63 fied, the ARK is a valid URL that can gain access to ARK services 64 using an unmodified Web client. 66 1. Introduction 68 This document describes a scheme for the high-quality naming of 69 information resources. The scheme, called the Archival Resource Key 70 (ARK), is well suited to long-term access and identification for any 71 information resources that accommodate reasonably regular electronic 72 description. This includes digital documents, databases, software, 73 and websites, as well as physical objects (such as books, bones, and 74 statues) and intangible objects (chemicals, diseases, vocabulary 75 terms, performances). Hereafter the term "object" refers to an 76 information resource. The term ARK itself refers both to the scheme 77 and to any single identifier that conforms to it. A reasonably 78 concise and accessible overview and rationale for the scheme is 79 available at [ARK]. 81 Schemes for persistent identification of network-accessible objects 82 are not new. In the early 1990's, the design of the Uniform Resource 83 Name [URNSYN] responded to the observed failure rate of URLs by 84 articulating an indirect, non-hostname-based naming scheme and the 85 need for responsible name management. Meanwhile, promoters of the 86 Digital Object Identifier [DOI] succeeded in building a community of 87 providers around a mature software system that supports name 88 management. The Persistent Uniform Resource Locator [PURL] was a 89 third scheme that has the unique advantage of working with unmodified 90 web browsers. The ARK scheme is a new approach. 92 A founding principle of the ARK is that persistence is purely a 93 matter of service. Persistence is neither inherent in an object nor 94 conferred on it by a particular naming syntax. Rather, persistence 95 is achieved through a provider's successful stewardship of objects 96 and their identifiers. The highest level of persistence will be 97 reinforced by a provider's robust contingency, redundancy, and 98 succession strategies. It is further safeguarded to the extent that 99 a provider's mission is shielded from marketplace and political 100 instabilities. 102 1.1. Three Reasons to Use ARKs 104 The first requirement of an ARK is to give users a link from an 105 object to a promise of stewardship for it. That promise is a multi- 106 faceted covenant that binds the word of an identified service 107 provider to a specific set of responsibilities. No one can tell if 108 successful stewardship will take place because no one can predict the 109 future. Reasonable conjecture, however, may be based on past 110 performance. There must be a way to tie a promise of persistence to 111 a provider's demonstrated or perceived ability -- its reputation -- 112 in that arena. Provider reputations would then rise and fall as 113 promises are observed variously to be kept and broken. This is 114 perhaps the best way we have for gauging the strength of any 115 persistence promise. 117 The second requirement of an ARK is to give users a link from an 118 object to a description of it. The problem with a naked identifier 119 is that without a description real identification is incomplete. 120 Identifiers common today are relatively opaque, though some contain 121 ad hoc clues that reflect fleeting life cycle events such as the 122 address of a short stay in a filesystem hierarchy. Possession of 123 both an identifier and an object is some improvement, but positive 124 identification may still be elusive since the object itself might not 125 include a matching identifier or might not carry evidence obvious 126 enough to reveal its identity without significant research. In 127 either case, what is called for is a record bearing witness to the 128 identifier's association with the object, as supported by a recorded 129 set of object characteristics. This descriptive record is partly an 130 identification "receipt" with which users and archivists can verify 131 an object's identity after brief inspection and a plausible match 132 with recorded characteristics such as title and size. 134 The final requirement of an ARK is to give users a link to the object 135 itself (or to a copy) if at all possible. Persistent access is the 136 central duty of an ARK, with persistent identification playing a 137 vital but supporting role. Object access may not be feasible for 138 various reasons, such as catastrophic loss of the object, a licensing 139 agreement that keeps an archive "dark" for a period of years, or when 140 an object's own lack of tangible existence precludes normal concepts 141 of access (e.g., a vocabulary term might be accessed through its 142 definition). In such cases the ARK's identification role assumes a 143 much higher profile. But attempts to simplify the persistence 144 problem by decoupling access from identification and concentrating 145 exclusively on the latter are of questionable utility. A perfect 146 system for assigning forever unique identifiers might be created, but 147 if it did so without reducing access failure rates, no one would be 148 interested. The central issue -- which may be summed up as the "HTTP 149 404 Not Found" problem -- would not have been addressed. 151 1.2. Organizing Support for ARKs 153 An organization and the user community it serves can often be seen to 154 struggle with two different areas of persistent identification: the 155 Our Stuff problem and the Their Stuff problem. In the Our Stuff 156 problem, the organization wants its "own" objects to acquire 157 persistent names. It possesses or controls these objects, so our 158 organization tackles the Our Stuff problem directly; being the 159 responsible party, it can plan for, maintain, project, and make 160 commitments about the objects. 162 In the Their Stuff problem, the organization wants others' objects to 163 acquire persistent names, in other words, objects that it does not 164 own or control. Some of these objects will be critically important 165 to the organization but beyond its influence as far as persistence 166 support is concerned. As a result, creating and maintaining 167 persistent identifiers for Their Stuff is difficult. 169 Co-location of persistent access and identification services is 170 natural. Any organization that undertakes ongoing support of true 171 persistent identification (which includes description) is well-served 172 if it controls, owns, or otherwise has clear internal access to the 173 identified objects, and this gives it an advantage if it wishes also 174 to support persistent external access. Conversely, persistent 175 external access requires orderly internal collection management and 176 all that that entails including monitoring, acquisition, 177 verification, and change control over objects carrying identifiers 178 persistent enough to support accountable record keeping practices; 179 this covers the major prerequisite for external support of persistent 180 identification. Organizing ARK services under one roof thus tends to 181 make sense. 183 ARK support is not for everybody. By requiring specific, revealed 184 commitments to preservation, object access, and description, the bar 185 for providing ARK services is high. On the other hand, it would be 186 hard to grant credence to a persistence promise from an organization 187 that could not muster the minimum ARK services. Not that there isn't 188 a business model for an ARK-like, description-only service built on 189 top of another organization's full complement of ARK services. For 190 example, there might be competition at the description level for 191 abstracting and indexing a body of scientific literature archived in 192 a combination of open and fee-based repositories. Such a business 193 would benefit more from persistence than it would directly support 194 it. 196 1.3. A Definition of Identifier 198 Heretofore, persistence discussion has been hampered by a borrowed 199 meaning for "identifier" that emerged as a side effect of defining 200 the Uniform Resource Identifier in [URI]: 202 (formerly) An identifier is a sequence of characters with a 203 restricted syntax ... that can act as a reference to something 204 that has identity. 206 The term works in context, but falters when employed for persistence. 207 Troubling phrases arise, such as, 209 "The goal is to create an identifier that does not break." 211 As defined this kind of identifier "breaks" when it sustains damage 212 to its character sequence, but really what breaks has to do with the 213 identifier's reference role. The following definition is proposed. 215 (new definition) An identifier is an association between a 216 string (a sequence of characters) and an information resource. 217 That association is made manifest by a record (e.g., a 218 cataloging or other metadata record) that binds the identifier 219 string to a set of identifying resource characteristics. 221 The identifier (the association) must be vouched for by some sort of 222 record. In the complete absence of any testimony (e.g., metadata) 223 regarding an association, a would-be identifier string is a 224 meaningless sequence of characters. To keep an externally visible 225 but otherwise internal identifier string opaque to outsiders, for 226 example, it suffices for an organization not to disclose the nature 227 of its association. For our immediate purpose, actual existence of 228 an association record is more important than its authenticity. If 229 one is lucky an object carries its own identifier as part of itself 230 (e.g., imprinted on the first page), but in processes such as 231 resource discovery and retrieval the typical object is often unwieldy 232 or unavailable (such as when licensing restrictions are in effect). 233 A metadata record that includes the identifier string is the next 234 best thing -- a conveniently manipulable surrogate that can act as 235 both an association "receipt" and "declaration". 237 It now makes sense to speak of preventing an identifier, as an 238 association, from breaking. Having said that, this document still 239 (ab)uses the terms "ARK" and "identifier" as shorthands to refer to 240 identifier strings, in other words, to sequences of characters. Thus 241 a discussion of ARK syntax refers to a string format, not an 242 association format. The context should make the meaning clear. 244 2. ARK Anatomy 246 An ARK is represented by a sequence of characters (a string) that 247 contains the label, "ark:", optionally preceded by the beginning part 248 of a URL. Here is a diagrammed example. 250 http://foobar.zaf.org/ark:/12025/654xz321 251 \___________________/ \__/ \___/ \______/ 252 (replaceable) | | | 253 | ARK Label | Name (assigned by the NAA) 254 | | 255 Name Mapping Authority Name Assigning Authority 256 Hostport (NMAH) Number (NAAN) 258 The ARK syntax can be summarized, 260 [http://NMAH/]ark:/NAAN/Name 262 where the NMAH part is in brackets to indicate that it is temporary, 263 replaceable, and optional. 265 2.1. The Name Mapping Authority Hostport (NMAH) 267 Before the "ark:" label may appear an optional Name Mapping Authority 268 Hostport (NMAH) that is a temporary address where ARK service 269 requests may be sent. It consists of "http://" (or any service 270 specification valid for a URL) followed by an Internet hostname or 271 hostport combination having the same format and semantics as the 272 hostport part of a URL. The most important thing about the NMAH is 273 that it is "identity inert" from the point of view of object 274 identification. In other words, ARKs that differ only in the 275 optional NMAH part identify the same object. Thus, for example, the 276 following three ARKs are synonyms for but one information resource: 278 http://foobar.zaf.org/ark:/12025/654xz321 279 http://sneezy.dopey.com/ark:/12025/654xz321 280 ark:/12025/654xz321 282 The NMAH part makes an ARK into an actionable URL. Conversely, any 283 URL whose path component begins with "ark:/" stands a reasonable 284 chance of being an ARK (only because such URLs are not common), but 285 further verification is still required (such as probing the URL for 286 the three ARK services). 288 The NMAH part is temporary, disposable, and replaceable. Over time 289 the NMAH will likely stop working and have to be replaced with a cur� 290 rently active service provider. This relies on a mapping authority 291 discovery process, of which two alternate methods are outlined in a 292 later section. Meanwhile, a carefully chosen NMAH can be as durable 293 as any Internet domain name, and so may last for a decade or longer. 294 Users should be prepared, however, to refresh the NMAH because the 295 one found in the URL form of the ARK may have stopped working. 297 The above method for creating an actionable identifier from a basic 298 ARK (prepending "http://" and an NMAH) is itself temporary. Assuming 299 that the reign of [HTTP] in information retrieval will end one day, 300 ARKs will have to be converted into new kinds of actionable identi� 301 fiers. In any event, if ARKs see widespread use, web browsers would 302 presumably evolve to perform this (currently simple) transformation 303 automatically. 305 2.2. The Name Assigning Authority Number (NAAN) 307 The part of the ARK directly following the "ark:" is the Name 308 Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. 309 This part is always required, as it identifies the organization that 310 originally assigned the Name of the object. It is used to discover a 311 currently valid NMAH and to provide top-level partitioning of the 312 space of all ARKs. NAANs are registered in a manner similar to URN 313 Namespaces, but they are pure numbers consisting of 5 digits or 9 314 digits. Thus, the first 100,000 registered NAAs fit compactly into 315 the 5 digits, and if growth warrants, the next billion fit into the 9 316 digit form. In either case the fixed odd number of digits helps 317 reduce the chances of finding a NAAN out of context and confusing it 318 with nearby quantities such as 4-digit dates. 320 2.3. The Name Part 322 The final part of the ARK is the Name assigned by the NAA, and it is 323 also required. The Name is a string of visible ASCII characters and 324 should be less than 128 bytes in length. The length restriction 325 keeps the ARK short enough to append ordinary ARK request strings 326 without running into transport restrictions within HTTP GET requests. 327 Characters may be letters, digits, or any of these six characters: 329 = @ $ _ * + # 331 The following characters may also be used, but in limited ways: 333 / . - % 335 The characters `/' and `.' are ignored if either appears as the last 336 character of an ARK. If used internally, they allow a name assigning 337 authority to reveal object hierarchy and object variants as described 338 in the next two sections. 340 A `-' (hyphen) may appear in an ARK, but must be ignored in lexical 341 comparisons. The `%' character is reserved for %-encoding all other 342 octets that would appear in the ARK string, in the same manner as for 343 URIs [URI]. A %-encoded octet consists of a `%' followed by two hex 344 digits; for example, "%7d" stands in for `}'. Lower case hex digits 345 are preferred to reduce the chances of false acronym recognition; 346 thus it is better to use "%acT" instead of "%ACT". The character `%' 347 itself must be represented using "%25". As with URNs, %-encoding 348 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 349 that have less restricted character repertoires [URNBIB]. 351 The creation of names that include linguistically based constructs 352 (having recognizable meaning from natural language) is strongly dis� 353 couraged if long-term persistence is a naming priority. Such names 354 do not age or travel well. Names that look more or less like numbers 355 avoid common problems that defeat persistence and international 356 acceptance. The use of digits is highly recommended. Mixing in non- 357 vowel alphabetic characters is a relatively safe and easy way to 358 achieve more compact names, although any character repertoire can 359 work if potentially troublesome names will be discarded during a 360 screening process. More on naming considerations is given in a later 361 section. 363 2.3.1. Names that Reveal Object Hierarchy 365 A name assigning authority may choose to reveal the presence of a 366 hierarchical relationship between objects using the `/' (slash) 367 character in the Name part of an ARK. If the Name contains an 368 internal slash, the piece to its left indicates a containing object. 369 For example, publishing an ARK of the form, 371 ark:/12025/654/xz/321 373 is equivalent to publishing three ARKs, 375 ark:/12025/654/xz/321 376 ark:/12025/654/xz 377 ark:/12025/654 379 together with a declaration that the first object is contained in the 380 second object, and that the second object is contained in the third. 382 Revealing the presence of hierarchy is completely up to the assigning 383 authority. It is hard enough to commit to one object's name, let 384 alone to three objects' names and to a specific, ongoing relatedness 385 among them. Thus, regardless of whether hierarchy was present ini� 386 tially, the assigning authority, by not using slashes, reveals no 387 shared inferences about hierarchical or other inter-relatedness in 388 the following ARKs: 390 ark:/12025/654_xz_321 391 ark:/12025/654_xz 392 ark:/12025/654xz321 393 ark:/12025/654xz 394 ark:/12025/654 396 Note that slashes around the ARK's NAAN (/12025/ in these examples) 397 are not part of the ARK's Name and therefore do not indicate the 398 existence of some sort of NAAN super object containing all objects in 399 its namespace. A slash must have at least one non-structural charac� 400 ter (one that is neither a slash nor a period) on both sides in order 401 for it to separate recognizable structural components. So initial or 402 final slashes may be removed, and double slashes may be converted 403 into single slashes. 405 2.3.2. Names that Reveal Object Variants 407 A name assigning authority may choose to reveal the possible presence 408 of variant objects using the `.' (period) character in the Name part 409 of an ARK. If the Name contains an internal period, the piece to its 410 left is a base name and the piece to its right, and up to the end of 411 the ARK or to the next period is a suffix. A Name may have more than 412 one suffix, for example, 414 ark:/12025/654.24 415 ark:/12025/xz4/654.24 416 ark:/12025/654.f55.g78.v20 418 There are two main rules. First, if two ARKs share the same base 419 name but have different suffixes, the corresponding objects were con� 420 sidered variants of each other (different formats, languages, ver� 421 sions, etc.) by the assigning authority. Thus, the following ARKs 422 are variants of each other: 424 ark:/12025/654.f55.g78.v20 425 ark:/12025/654.321xz 426 ark:/12025/654.44 428 Second, publishing an ARK with a suffix implies the existence of at 429 least one variant identified by the ARK without its suffix. The ARK 430 otherwise permits no further assumptions about what variants might 431 exist. So publishing the ARK, 433 ark:/12025/654.f55.g78.v20 435 is equivalent to publishing the four ARKs, 437 ark:/12025/654.f55.g78.v20 438 ark:/12025/654.f55.g78 439 ark:/12025/654.f55 440 ark:/12025/654 442 Revealing the possibility of variants is completely up to the assign� 443 ing authority. It is hard enough to commit to one object's name, let 444 alone to multiple variants' names and to a specific, ongoing related� 445 ness among them. The assigning authority is the sole arbiter of what 446 constitutes a variant within its namespace, and whether to reveal 447 that kind of relatedness by using periods within its names. 449 A period must have at least one non-structural character (one that is 450 neither a slash nor a period) on both sides in order for it to sepa� 451 rate recognizable structural components. So initial or final periods 452 may be removed, and double periods may be converted into single peri� 453 ods. Multiple suffixes should be arranged in sorted order (pure 454 ASCII collating sequence) at the end of an ARK. 456 2.3.3. Hyphens are Ignored 458 Hyphens are always ignored in ARKs. Hyphens may be added to an ARK's 459 Name part for readability, or during the formatting and wrapping of 460 text lines, but (as in phone numbers) they are treated as if they 461 were not present. Thus, like the NMAH, hyphens are "identity inert" 462 in comparing ARKs for equivalence. For example, the following ARKs 463 are equivalent for purposes of comparison and ARK service access: 465 ark:/12025/65-4-xz-321 466 ark:sneezy.dopey.com/12025/654--xz32-1 467 ark:/12025/654xz321 469 2.4. Normalization and Lexical Equivalence 471 To determine if two or more ARKs identify the same object, the ARKs 472 are compared for lexical equivalence after first being normalized. 473 Since ARK strings may appear in various forms (e.g., having different 474 NMAHs), normalizing them minimizes the chances that comparing two ARK 475 strings for equality will fail unless they actually identify 476 different objects. In a specified-host ARK (one having an NMAH), the 477 NMAH never participates in such comparisons. 479 Normalization of an ARK for the purpose of octet-by-octet equality 480 comparison with another ARK consists of four steps. First, any upper 481 case letters in the "ark:" label and the two characters following a 482 `%' are converted to lower case. The case of all other letters in 483 the ARK string must be preserved. Second, any NMAH part is removed 484 (everything from an initial "http://" up to the next slash) and all 485 hyphens are removed. 487 Third, structural characters (slash and period) are normalized. 488 Initial and final occurrences are removed, and two structural 489 characters in a row (e.g., // or ./) are replaced by the first 490 character, iterating until each occurrence has at least one non- 491 structural character on either side. Finally, if there are any 492 components with a period on the left and a slash on the right, either 493 the component and the preceding period must be moved to the end of 494 the Name part or the ARK must be thrown out as malformed. 496 The fourth and final step is to arrange the suffixes in ASCII 497 collating sequence (that is, to sort them) and to remove duplicate 498 suffixes, if any. It is also permissible to throw out ARKs for which 499 the suffixes are not sorted. 501 The resulting ARK string is now normalized. Comparisons between 502 normalized ARKs are case-sensitive, meaning that upper case letters 503 are considered different from their lower case counterparts. 505 To keep ARK string variation to a minimum, no reserved ARK characters 506 should be %-encoded unless it is deliberately to conceal their 507 reserved meanings. No non-reserved ARK characters should ever be 508 %-encoded. Finally, no %-encoded character should ever appear in an 509 ARK in its decoded form. 511 2.5. Naming Considerations 513 The ARK has different goals from the URI, so it has different 514 character set requirements. Because linguistic constructs imperil 515 persistence, for ARKs non-ASCII character support is unimportant. 516 ARKs and URIs share goals of transcribability and transportability 517 within web documents, so characters are required to be visible, non- 518 conflicting with HTML/XML syntax, and not subject to tampering during 519 transmission across common transport gateways. Add the goal of 520 making an undelimited ARK recognizable in running prose, as in 521 ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma, 522 period) end up being excluded from the ARK lest the end of a phrase 523 or sentence be mistaken for part of the ARK. 525 A valuable technique for provision of persistent objects is to try to 526 arrange for the complete identifier to appear on, with, or near its 527 retrieved object. An object encountered at a moment in time when its 528 discovery context has long since disappeared could then easily be 529 traced back to its metadata, to alternate versions, to updates, etc. 530 This has seen reasonable success, for example, in book publishing and 531 software distribution. 533 If persistence is the goal, a deliberate local strategy for 534 systematic name assignment is crucial. Names must be chosen with 535 great care. Poorly chosen and managed names will devastate any 536 persistence strategy, and they do not discriminate based on naming 537 scheme. Whether a mistakenly re-assigned identifier is a URN, DOI, 538 PURL, URL, or ARK, the damage -- failed access and confusion -- is 539 not mitigated more in one scheme than in another. Conversely, in- 540 house efforts to manage names responsibly will go much further 541 towards safeguarding persistence than any choice of naming scheme or 542 name resolution technology. 544 Hostnames appearing in any identifier meant to be persistent must be 545 chosen with extra care. The tendency in hostname selection has 546 traditionally been to choose a token with recognizable attributes, 547 such as a corporate brand, but that tendency wreaks havoc with 548 persistence that is supposed to outlive brands, corporations, subject 549 classifications, and natural language semantics (e.g., what did the 550 three letters "gay" mean in 1958, 1978, and 1998?). Today's 551 recognized and correct attributes are tomorrow's stale or incorrect 552 attributes. In making hostnames (any names, actually) long-term 553 persistent, it helps to eliminate recognizable attributes to the 554 extent possible. This affects selection of any name based on URLs, 555 including PURLs and the explicitly disposable NMAHs. There is no 556 excuse for a provider that manages its internal names impeccably not 557 to exercise the same care in choosing what could be an exceptionally 558 durable hostname, especially if it would form the prefix for all the 559 provider's URL-based external names. Registering an opaque hostname 560 in the ".org" or ".net" domain would not be a bad start. 562 Dubious persistence speculation does not make selecting naming 563 strategies any easier. For example, despite rumors to the contrary, 564 there are really no obvious reasons why the organizations registering 565 DNS names, URN Namespaces, and DOI publisher IDs should have among 566 them one that is intrinsically more fallible than the next. 567 Moreover, it is a misconception that the demise of DNS and of HTTP 568 need adversely affect the persistence of URLs. At such a time, 569 certainly URLs from the present day might not then be actionable by 570 our present-day mechanisms, but resolution systems for future non- 571 actionable URLs are no harder to imagine than resolution systems for 572 present-day non-actionable URNs and DOIs. There is no more stable a 573 namespace than one that is dead and frozen, and that would then 574 characterize the space of names bearing the "http://" prefix. It is 575 useful to remember that just because hostnames have been carelessly 576 chosen in their brief history does not mean that they are unsuitable 577 in NMAHs (and URLs) intended for use in situations demanding the 578 highest level of persistence available in the Internet environment. 579 A well-planned name assignment strategy is everything. 581 3. Assigners of ARKs 583 A Name Assigning Authority (NAA) is an organization that creates (or 584 delegates creation of) long-term associations between identifiers and 585 information objects. Examples of NAAs include national libraries, 586 national archives, and publishers. An NAA may arrange with an 587 external organization for identifier assignment. The US Library of 588 Congress, for example, allows OCLC (the Online Computer Library 589 Center, a major world cataloger of books) to create associations 590 between Library of Congress call numbers (LCCNs) and the books that 591 OCLC processes. A cataloging record is generated that testifies to 592 each association, and the identifier is included by the publisher, 593 for example, in the front matter of a book. 595 An NAA does not so much create an identifier as create an 596 association. The NAA first draws an unused identifier string from 597 its namespace, which is the set of all identifiers under its control. 598 It then records the assignment of the identifier to an information 599 object having sundry witnessed characteristics, such as a particular 600 author and modification date. A namespace is usually reserved for an 601 NAA by agreement with recognized community organizations (such as 602 IANA and ISO) that all names containing a particular string be under 603 its control. In the ARK an NAA is represented by the Name Assigning 604 Authority Number (NAAN). 606 The ARK namespace reserved for an NAA is the set of names bearing its 607 particular NAAN. For example, all strings beginning with 608 "ark:/12025/" are under control of the NAA registered under 12025, 609 which might be the National Library of Finland. Because each NAA has 610 a different NAAN, names from one namespace cannot conflict with those 611 from another. Each NAA is free to assign names from its namespace 612 (or delegate assignment) according to its own policies. These 613 policies must be documented in a manner similar to the declarations 614 required for URN Namespace registration [URNNID]. 616 For now, registration of ARK NAAs is in a bootstrapping phase. To 617 register, please read about the mapping authority discovery file in 618 the next section and send email to ark@cdlib.org. 620 4. Finding a Name Mapping Authority 622 In order to derive an actionable identifier (these days, a URL) from 623 an ARK, a hostport (hostname or hostname plus port combination) for a 624 working Name Mapping Authority (NMA) must be found. An NMA is a 625 service that is able to respond to the three basic ARK service 626 requests. Relying on registration and client-side discovery, NMAs 627 make known which NAAs' identifiers they are willing to service. 629 Upon encountering an ARK, a user (or client software) looks inside it 630 for the optional NMAH part (the hostport of the NMA's ARK service). 631 If it contains an NMAH that is working, this NMAH discovery step may 632 be skipped; the NMAH effectively uses the beginning of an ARK to 633 cache the results of a prior mapping authority discovery process. If 634 a new NMAH needs to found, the client looks inside the ARK again for 635 the NAAN (Name Assigning Authority Number). Querying a global 636 database, it then uses the NAAN to look up all current NMAHs that 637 service ARKs issued by the identified NAA. The global database is 638 key, and two specific methods for querying it are given in this 639 section. 641 In the interests of long-term persistence, however, ARK mechanisms 642 are first defined in high-level, protocol-independent terms so that 643 mechanisms may evolve and be replaced over time without compromising 644 fundamental service objectives. Either or both specific methods 645 given here may eventually be supplanted by better methods since, by 646 design, the ARK scheme does not depend on a particular method, but 647 only on having some method to locate an active NMAH. 649 At the time of issuance, at least one NMAH for an ARK should be 650 prepared to service it. That NMA may or may not be administered by 651 the Name Assigning Authority (NAA) that created it. Consider the 652 following hypothetical example of providing long-term access to a 653 cancer research journal. The publisher wishes to turn a profit and 654 the National Library of Medicine wishes to preserve the scholarly 655 record. An agreement might be struck whereby the publisher would act 656 as the NAA and the national library would archive the journal issue 657 when it appears, but without providing direct access for the first 658 six months. During the first six months of peak commercial 659 viability, the publisher would retain exclusive delivery rights and 660 would charge access fees. Again, by agreement, both the library and 661 the publisher would act as NMAs, but during that initial period the 662 library would redirect requests for issues less than six months old 663 to the publisher. At the end of the waiting period, the library 664 would then begin servicing requests for issues older than six months 665 by tapping directly into its own archives. Meanwhile, the publisher 666 might routinely redirect incoming requests for older issues to the 667 library. Long-term access is thereby preserved, and so is the 668 commercial incentive to publish content. 670 There is never a requirement that an NAA also run an NMA service, 671 although it seems not an unlikely scenario. Over time NAAs and NMAs 672 would come and go. One NMA would succeed another, and there might be 673 many NMAs serving the same ARKs simultaneously (e.g., as mirrors or 674 as competitors). There might also be asymmetric but coordinated NMAs 675 as in the library-publisher example above. 677 4.1. Looking Up NMAHs in a Globally Accessible File 679 This subsection describes a way to look up NMAHs using a simple text 680 file. For efficient access the file may be stored in a local 681 filesystem, but it needs to be reloaded periodically to incorporate 682 updates. It is not expected that the size of the file or frequency 683 of update should impose an undue maintenance or searching burden any 684 time soon, for even primitive linear search of a file with ten- 685 thousand NAAs is a subsecond operation on modern server machines. 686 The proposed file strategy is similar to the /etc/hosts file strategy 687 that supported Internet host address lookup for a period of years 688 before the advent of the Domain Name System [DNS]. 690 A copy of the current file (at the time of writing) appears in an 691 appendix and is available on the web. A minimal version of the file 692 appears below. Comment lines (lines that begin with `#') explain the 693 format and give the file's modification time, reloading address, and 694 NAA registration instructions. There is even a Perl script that 695 processes the file embedded in the file's comments. Because this is 696 still a proposed file, none of the values in it are real. 698 # 699 # Name Assigning Authority / Name Mapping Authority Lookup Table 700 # Last change: 2 June 2004 701 # Reload from: http://ark.nlm.nih.gov/etc/natab 702 # Mirrored at: http://ark.cdlib.org/natab 703 # To register: mailto:ark@cdlib.org?Subject=naareg 704 # Process with: Perl script at end of this file (optional) 705 # 706 # Each NAA appears at the beginning of a line with the NAA Number 707 # first, a colon, and an ARK or URL to a statement of naming policy 708 # (see http://ark.cdlib.org for an example). 709 # All the NMA hostports that service an NAA are listed, one per 710 # line, indented, after the corresponding NAA line. 711 # 712 # National Library of Medicine 713 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 714 ark.nlm.nih.gov USNLM 715 foobar.zaf.org UCSF 716 sneezy.dopey.com BIREME 717 # 718 # Library of Congress 719 12026: http://www.loc.gov/xxx/naapolicy.html 720 foobar.zaf.org USLC 721 sneezy.dopey.com USLC 722 # 723 # National Agriculture Library 724 12027: http://www.nal.gov/xxx/naapolicy.html 725 foobar.zaf.gov:80 USNAL 726 # 727 # California Digital Library 728 13030: http://www.cdlib.org/inside/diglib/ark/ 729 ark.cdlib.org CDL 730 # 731 # World Intellectual Property Organization 732 13038: http://www.wipo.int/xxx/naapolicy.html 733 www.wipo.int WIPO 734 # 735 # University of California San Diego 736 20775: http://library.ucsd.edu/xxx/naapolicy.html 737 ucsd.edu UCSD 738 # 739 # University of California San Francisco 740 29114: http://library.ucsf.edu/xxx/naapolicy.html 741 ucsf.edu UCSF 742 # 743 # University of California Berkeley 744 28722: http://library.berkeley.edu/xxx/naapolicy.html 745 berkeley.edu UCB 746 # 747 # Rutgers University Libraries 748 15230: http://rci.rutgers.edu/xxx/naapolicy.html 749 rutgers.edu RUL 750 # 751 #--- end of data --- 752 # The following Perl script takes an NAA as argument and outputs 753 # the NMAs in this file listed under any matching NAA. 754 # 755 # my $naa = shift; 756 # while (<>) { 757 # next if (! /^$naa:/); 758 # while (<>) { 759 # last if (! /^[#\s]./); 760 # print "$1\n" if (/^\s+(\S+)/); 761 # } 762 # } 763 # 764 # Create a g/t/nroff-safe version of this table with the UNIX command, 765 # 766 # expand natab | sed 's/\\/\\\e/g' > natab.roff 767 # 768 # end of file 770 4.2. Looking up NMAHs Distributed via DNS 772 This subsection introduces a method for looking up NMAHs that is 773 based on the method for discovering URN resolvers described in 774 [NAPTR]. It relies on querying the DNS system already installed in 775 the background infrastructure of most networked computers. A query 776 is submitted to DNS asking for a list of resolvers that match a given 777 NAAN. DNS distributes the query to the particular DNS servers that 778 can best provide the answer, unless the answer can be found more 779 quickly in a local DNS cache as a side-effect of a recent query. 780 Responses come back inside Name Authority Pointer (NAPTR) records. 781 The normal result is one or more candidate NMAHs. 783 In its full generality the [NAPTR] algorithm ambitiously accommodates 784 a complex set of preferences, orderings, protocols, mapping services, 785 regular expression rewriting rules, and DNS record types. This 786 subsection proposes a drastic simplification of it for the special 787 case of ARK mapping authority discovery. The simplified algorithm is 788 called Maptr. It uses only one DNS record type (NAPTR) and restricts 789 most of its field values to constants. The following hypothetical 790 excerpt from a DNS data file for the NAAN known as 12026 shows three 791 example NAPTR records ready to use with the Maptr algorithm. 793 12026.ark.arpa. 794 ;; US Library of Congress 795 ;; order pref flags service regexp replacement 796 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 797 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 798 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 800 All the fields are held constant for Maptr except for the "flags", 801 "regexp", and "replacement" fields. The "service" field contains the 802 constant value "ark" so that NAPTR records participating in the Maptr 803 algorithm will not be confused with other NAPTR records. The "order" 804 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 805 the algorithm may evolve to use these fields for ranking decisions 806 when usage patterns and local administrative needs are better under� 807 stood. 809 When a Maptr query returns a record with a flags field of "h" (for 810 hostport, a Maptr extension to the NAPTR flags), the replacement 811 field contains the NMAH (hostport) of an ARK service provider. When 812 a query returns a record with a flags field of "" (the empty string), 813 the client needs to submit a new query containing the domain name 814 found in the replacement field. This second sort of record exploits 815 the distributed nature of DNS by redirecting the query to another 816 domain name. It looks like this. 818 12345.ark.arpa. 819 ;; Digital Library Consortium 820 ;; order pref flags service regexp replacement 821 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 823 Here is the Maptr algorithm for ARK mapping authority discovery. In 824 it replace with the NAAN from the ARK for which an NMAH is 825 sought. 827 (1) Initialize the DNS query: type=NAPTR, 828 query=.ark.arpa. 830 (2) Submit the query to DNS and retrieve (NAPTR) records, dis� 831 carding any record that does not have "ark" for the service 832 field. 834 (3) All remaining records with a flags fields of "h" contain 835 candidate NMAHs in their replacement fields. Set them aside, if 836 any. 838 (4) Any record with an empty flags field ("") has a replacement 839 field containing a new domain name to which a subsequent query 840 should be redirected. For each such record, set query= then go to step (2). When all such records have been 842 recursively exhausted, go to step (5). 844 (5) All redirected queries have been resolved and a set of can� 845 didate NMAHs has been accumulated from steps (3). If there are 846 zero NMAHs, exit -- no mapping authority was found. If there is 847 one or more NMAH, choose one using any criteria you wish, then 848 exit. 850 A Perl script that implements this algorithm is included here. 852 #!/depot/bin/perl 854 use Net::DNS; # include simple DNS package 855 my $qtype = "NAPTR"; # initialize query type 856 my $naa = shift; # get NAAN script argument 857 my $mad = new Net::DNS::Resolver; # mapping authority discovery 859 &maptr("$naa.ark.arpa"); # call maptr - that's it 861 sub maptr { # recursive maptr algorithm 862 my $dname = shift; # domain name as argument 863 my ($rr, $order, $pref, $flags, $service, $regexp, 864 $replacement); 865 my $query = $mad->query($dname, $qtype); 866 return # non-productive query 867 if (! $query || ! $query->answer); 868 foreach $rr ($query->answer) { 869 next # skip records of wrong type 870 if ($rr->type ne $qtype); 871 ($order, $pref, $flags, $service, $regexp, 872 $replacement) = split(/\s/, $rr->rdatastr); 873 if ($flags eq "") { 874 &maptr($replacement); # recurse 875 } elsif ($flags eq "h") { 876 print "$replacement\n"; # candidate NMAH 877 } 878 } 879 } 881 The global database thus distributed via DNS and the Maptr algorithm 882 can easily be seen to mirror the contents of the Name Authority Table 883 file described in the previous section. 885 5. Generic ARK Service Definition 887 An ARK request's output is delivered information; examples include 888 the object itself, a policy declaration (e.g., a promise of support), 889 a descriptive metadata record, or an error message. ARK services 890 must be couched in high-level, protocol-independent terms if 891 persistence is to outlive today's networking infrastructural 892 assumptions. The high-level ARK service definitions listed below are 893 followed in the next section by a concrete method (one of many 894 possible methods) for delivering these services with today's 895 technology. 897 5.1. Generic ARK Access Service (access, location) 899 Returns (a copy of) the object or a redirect to the same, although a 900 sensible object proxy may be substituted. Examples of sensible 901 substitutes include, 903 - a table of contents instead of a large complex document, 904 - a home page instead of an entire web site hierarchy, 905 - a rights clearance challenge before accessing protected data, 906 - directions for access to an offline object (e.g., a book), 907 - a description of an intangible object (a disease, an event), or 908 - an applet acting as "player" for a large multimedia object. 910 May also return a discriminated list of alternate object locators. 911 If access is denied, returns an explanation of the object's current 912 (perhaps permanent) inaccessibility. 914 5.2. Generic Policy Service (permanence, naming, etc.) 916 Returns declarations of policy and support commitments for given 917 ARKs. Declarations are returned in either a structured metadata 918 format or a human readable text format; sometimes one format may 919 serve both purposes. Policy subareas may be addressed in separate 920 requests, but the following areas should should be covered: object 921 permanence, object naming, object fragment addressing, and 922 operational service support. 924 The permanence declaration for an object is a rating defined with 925 respect to an identified permanence provider (guarantor), and may 926 include the following aspects. One permanence rating framework is 927 given in [NLMPerm]. 929 (a) "object availability" -- whether and how access to the 930 object is supported (e.g., online 24x7, or offline only), 932 (b) "identifier validity" -- under what conditions the 933 identifier will be or has been re-assigned, 935 (c) "content invariance" -- under what conditions the content of 936 the object is subject to change, and 938 (d) "change history" -- documentation, whether abbreviated or 939 detailed, of any or all corrections, migrations, revisions, etc. 941 Naming policy for an object includes an historical description of the 942 NAA's (and its successor NAA's) policies regarding differentiation of 943 objects. It may include the following aspects. 945 (e) "similarity" -- (or "unity") the limit, defined by the NAA, 946 to the level of dissimilarity beyond which two similar objects 947 warrant separate identifiers but before which they share one 948 single identifier, and 950 (f) "granularity" -- the limit, defined by the NAA, to the level 951 of object subdivision beyond which sub-objects do not warrant 952 separately assigned identifiers but before which sub-objects are 953 assigned separate identifiers. 955 Addressing policy for an object includes a description of how, during 956 access, object components (e.g., paragraphs, sections) or views 957 (e.g., image conversions) may or may not be "addressed", in other 958 words, how the NMA permits arguments or parameters to modify the 959 object delivered as the result of an ARK request. If supported, 960 these sorts of operations would provide things like byte-ranged 961 fragment delivery and open-ended format conversions, or any set of 962 possible transformations that would be too numerous to list or to 963 identify with separately assigned ARKs. 965 Operational service support policy includes a description of general 966 operational aspects of the NMA service, such as after-hours staffing 967 and trouble reporting procedures. 969 5.3. Generic Description Service 971 Returns a description of the object. Descriptions are returned in 972 either a structured metadata format or a human readable text format; 973 sometimes one format may serve both purposes. A description must at 974 a minimum answer the who, what, when, and where questions concerning 975 an expression of the object. Standalone descriptions should be 976 accompanied by the modification date and source of the description 977 itself. May also return discriminated lists of ARKs that are related 978 to the given ARK. 980 6. Overview of the Tiny HTTP URL Mapping Protocol (THUMP) 982 The Tiny HTTP URL Mapping Protocol (THUMP) is a way of taking a key 983 (a kind of identifier) and asking such questions as, what information 984 does this identify and how permanent is it? [THUMP] is in fact one 985 specific method under development for delivering ARK services. The 986 protocol runs over HTTP to exploit the web browser's current pre- 987 eminence as user interface to the Internet. THUMP is designed so 988 that a person can enter ARK requests directly into the location field 989 of current browser interfaces. Because it runs over HTTP, THUMP can 990 be simulated and tested within keyboard-based [TELNET] sessions. 992 The asker (a person or client program) starts with an identifier, 993 such as an ARK or a URL. The identifier reveals to the asker (or 994 allows the asker to infer) the Internet host name and port number of 995 a server system that responds to questions. Here, this is just the 996 NMAH that is obtained by inspection and possibly lookup based on the 997 ARK's NAAN. The asker then sets up an HTTP session with the server 998 system, sends a question via a THUMP request (contained within an 999 HTTP request), receives an answer via a THUMP response (contained 1000 within an HTTP response), and closes the session. That concludes the 1001 connected portion of the protocol. 1003 A THUMP request is a string of characters beginning with a `?' 1004 (question mark) that is appended to the identifier string. The 1005 resulting string is sent as an argument to HTTP's GET command. 1006 Request strings too long for GET may be sent using HTTP's POST 1007 command. The three most common requests correspond to three 1008 degenerate special cases that keep the user's learning and typing 1009 burden low. First, a simple key with no request at all is the same 1010 as an ordinary access request. Thus a plain ARK entered into a 1011 browser's location field behaves much like a plain URL, and returns 1012 access to the primary identified object, for instance, an HTML 1013 document. 1015 The second special case is a minimal ARK description request string 1016 consisting of just "?". For example, entering the string, 1018 ark.nlm.nih.gov/12025/psbbantu? 1020 into the browser's location field directly precipitates a request for 1021 a metadata record describing the object identified by ark:/12025/psb� 1022 bantu. The browser, unaware of THUMP, prepares and sends an HTTP GET 1023 request in the same manner as for a URL. THUMP is designed so that 1024 the response (indicated by the returned HTTP content type) is nor� 1025 mally displayed, whether the output is structured for machine pro� 1026 cessing (text/plain) or formatted for human consumption (text/html). 1028 In the following example THUMP session, each line has been annotated 1029 to include a line number and whether it was the client or server that 1030 sent it. Without going into much depth, the session has four pieces 1031 separated from each other by blank lines: the client's piece (lines 1032 1-3), the server's HTTP/THUMP response headers (4-7), and the body of 1033 the server's response (8-17). The first and last lines (1 and 17) 1034 correspond to the client's steps to start the TCP session and the 1035 server's steps to end it, respectively. 1037 1 C: [opens session] 1038 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 1039 C: 1040 S: HTTP/1.1 200 OK 1041 5 S: Content-Type: text/plain 1042 S: THUMP-Status: 0.1 200 OK 1043 S: 1044 S: |set: NLM | 12025/psbbantu? | 20030731 1045 S: | http://ark.nlm.nih.gov/ark:/12025/psbbantu? 1046 10 S: here: 1 | 1 | 1 1047 S: 1048 S: erc: 1049 S: who: Lederberg, Joshua 1050 S: what: Studies of Human Families for Genetic Linkage 1051 15 S: when: 1974 1052 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1053 S: [closes session] 1055 The first two server response lines (4-5) above are typical of HTTP. 1056 The next line (6) is peculiar to THUMP, and indicates the THUMP ver� 1057 sion and a normal return status. The balance of the response con� 1058 sists of a record set header (lines 8-10) and a single metadata 1059 record (12-16) that comprises the ARK description service response. 1060 The record set header identifies (8-9) who created the set, what its 1061 title is, when it was created, and where an automated process can 1062 access the set; it ends in a line (10) whose respective sub-elements 1063 indicate that here in this communication the recipient can expect to 1064 find 1 record, starting at the record numbered 1, from a set consist� 1065 ing of a total of 1 record (i.e., here is the entire set, consisting 1066 of exactly one record). 1068 The returned record (12-16) is in the format of an Electronic 1069 Resource Citation [ERC], which is discussed in more detail in the 1070 next section. For now, note that it contains four elements that 1071 answer the top priority questions regarding an expression of the 1072 object: who played a major role in expressing it, what the expres� 1073 sion was called, when is was created, and where the expression may be 1074 found. This quartet of elements comes up again and again in ERCs. 1076 The third degenerate special case of an ARK request (and no other 1077 cases will be described in this document) is the string "??", corre� 1078 sponding to a minimal permanence policy request. It can be seen in 1079 use appended to an ARK (on line 2) in the example session that fol� 1080 lows. 1082 1 C: [opens session] 1083 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1 1084 C: 1085 S: HTTP/1.1 200 OK 1086 5 S: Content-Type: text/plain 1087 S: THUMP-Status: 0.1 200 OK 1088 S: 1089 S: |set: NLM | 12025/psbbantu?? | 20030731 1090 S: | http://ark.nlm.nih.gov/ark:/12025/psbbantu?? 1091 10 S: here: 1 | 1 | 1 1092 S: 1093 S: erc: 1094 S: who: Lederberg, Joshua 1095 S: what: Studies of Human Families for Genetic Linkage 1096 15 S: when: 1974 1097 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1098 S: erc-support: 1099 S: who: USNLM 1100 S: what: Permanent, Unchanging Content 1101 20 S: when: 20010421 1102 S: where: http://ark.nlm.nih.gov/yy22948 1103 S: [closes session] 1105 Again, a single metadata record (lines 12-21) is returned, but it 1106 consists of two segments. The first segment (12-16) gives the same 1107 basic citation information as in the previous example. It is 1108 returned in order to establish context for the persistence declara� 1109 tion in the second segment (17-21). 1111 Each segment in an ERC tells a different story relating to the 1112 object, so although the same four questions (elements) appear in 1113 each, the answers depend on the segment's story type. While the 1114 first segment tells the story of an expression of the object, the 1115 second segment tells the story of the support commitment made to it: 1116 who made the commitment, what the nature of the commitment was, when 1117 it was made, and where a fuller explanation of the commitment may be 1118 found. 1120 7. Overview of Electronic Resource Citations (ERCs) 1122 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1123 simple, compact, and printable record designed to hold data 1124 associated with an information resource. By design, the ERC is a 1125 metadata format that balances the needs for expressive power, very 1126 simple machine processing, and direct human manipulation. 1128 A founding principle of the ERC is that direct human contact with 1129 metadata will be a necessary and sufficient condition for the near 1130 term rapid development of metadata standards, systems, and services. 1131 Thus the machine-processable ERC format must only minimally strain 1132 people's ability to read, understand, change, and transmit ERCs 1133 without their relying on intermediation with specialized software 1134 tools. The basic ERC needs to be succinct, transparent, and 1135 trivially parseable by software. 1137 In the current Internet, it is natural seriously to consider using 1138 XML as an exchange format because of predictions that it will obviate 1139 many ad hoc formats and programs, and unify much of the world's 1140 information under one reliable data structuring discipline that is 1141 easy to generate, verify, parse, and render. It appears, however, 1142 that XML is still only catching on after years of standards work and 1143 implementation experience. The reasons for it are unclear, but for 1144 now very simple XML interpretation is still out of reach. Another 1145 important caution is that XML structures are hard on the eyeballs, 1146 taking up an amount of display (and page) space that significantly 1147 exceeds that of traditional formats. Until these conflicts with ERC 1148 principle are resolved, XML is not a first choice for representing 1149 ERCs. Borrowing instead from the data structuring format that 1150 underlies the successful spread of email and web services, the first 1151 ERC format is based on email and HTTP headers (RFC822) [EMHDRS]. 1152 There is a naturalness to its label-colon-value format (seen in the 1153 previous section) that barely needs explanation to a person beginning 1154 to enter ERC metadata. 1156 Besides simplicity of ERC system implementation and data entry 1157 mechanics, ERC semantics (what the record and its constituent parts 1158 mean) must also be easy to explain. ERC semantics are based on a 1159 reformulation and extension of the Dublin Core [DCORE] hypothesis, 1160 which suggests that the fifteen Dublin Core metadata elements have a 1161 key role to play in cross-domain resource description. The ERC 1162 design recognizes that the Dublin Core's primary contribution is the 1163 international, interdisciplinary consensus that identified fifteen 1164 semantic buckets (element categories), regardless of how they are 1165 labeled. The ERC then adds a definition for a record and some 1166 minimal compliance rules. In pursuing the limits of simplicity, the 1167 ERC design combines and relabels some Dublin Core buckets to isolate 1168 a tiny kernel (subset) of four elements for basic cross-domain 1169 resource description. 1171 For the cross-domain kernel, the ERC uses the four basic elements -- 1172 who, what, when, and where -- to pretend that every object in the 1173 universe can have a uniform minimal description. Each has a name or 1174 other identifier, a location, some responsible person or party, and a 1175 date. It doesn't matter what type of object it is, or whether one 1176 plans to read it, interact with it, smoke it, wear it, or navigate 1177 it. Of course, this approach is flawed because uniformity of 1178 description for some object types requires more semantic contortion 1179 and sacrifice than for others. That is why at the beginning of this 1180 document, the ARK was said to be suited to objects that accommodate 1181 reasonably regular electronic description. 1183 While insisting on uniformity at the most basic level provides 1184 powerful cross-domain leverage, the semantic sacrifice is great for 1185 many applications. So the ERC also permits a semantically rich and 1186 nuanced description to co-exist in a record along with a basic 1187 description. In that way both sophisticated and naive recipients of 1188 the record can extract the level of meaning from it that best suits 1189 their needs and abilities. Key to unlocking the richer description 1190 is a controlled vocabulary of ERC record types (not explained in this 1191 document) that permit knowledgeable recipients to apply defined sets 1192 of additional assumptions to the record. 1194 7.1. ERC Syntax 1196 An ERC record is a sequence of metadata elements ending in a blank 1197 line. An element consists of a label, a colon, and an optional 1198 value. Here is an example of a record with five elements. 1200 erc: 1201 who: Gibbon, Edward 1202 what: The Decline and Fall of the Roman Empire 1203 when: 1781 1204 where: http://www.ccel.org/g/gibbon/decline/ 1206 A long value may be folded (continued) onto the next line by insert� 1207 ing a newline and indenting the next line. A value can be thus 1208 folded across multiple lines. Here are two example elements, each 1209 folded across four lines. 1211 who/created: University of California, San Francisco, AIDS 1212 Program at San Francisco General Hospital | University 1213 of California, San Francisco, Center for AIDS Prevention 1214 Studies 1215 what/Topic: 1216 Heart Attack | Heart Failure 1217 | Heart 1218 Diseases 1220 An element value folded across several lines is treated as if the 1221 lines were joined together on one long line. For example, the second 1222 element from the previous example is considered equivalent to 1224 what/Topic: Heart Attack | Heart Failure | Heart Diseases 1226 An element value may contain multiple values, each one separated from 1227 the next by a `|' (pipe) character. The element from the previous 1228 example contains three values. 1230 For annotation purposes, any line beginning with a `#' (hash) charac� 1231 ter is treated as if it were not present; this is a "comment" line (a 1232 feature not available in email or HTTP headers). For example, the 1233 following element is spread across four lines and contains two val� 1234 ues: 1236 what/Topic: 1237 Heart Attack 1238 # | Heart Failure -- hold off until next review cycle 1239 | Heart Diseases 1241 7.2. ERC Stories 1243 An ERC record is organized into one or more distinct segments, where 1244 where each segment tells a story about a different aspect of the 1245 information resource. A segment boundary occurs whenever a segment 1246 label (an element beginning with "erc") is encountered. The basic 1247 label "erc:" introduces the story of an object's expression (e.g., 1248 its publication, installation, or performance). The label "erc- 1249 about:" introduces the story of an object's content (what it is 1250 about) and "erc-support:" introduces the story of a support 1251 commitment made to it. A story segment that concerns the ERC itself 1252 is introduced by the label "erc-from:". It is an important segment 1253 that tells the story of the ERC's provenance. Elements beginning 1254 with "erc" are reserved for segment labels and their associated story 1255 types. From an earlier example, here is an ERC with two segments. 1257 erc: 1258 who: Lederberg, Joshua 1259 what: Studies of Human Families for Genetic Linkage 1260 when: 1974 1261 where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1262 erc-support: 1263 who: NIH/NLM/LHNCBC 1264 what: Permanent, Unchanging Content 1265 # Note to ops staff: date needs verification. 1266 when: 2001 04 21 1267 where: http://ark.nlm.nih.gov/yy22948 1269 Segment stories are told according to journalistic tradition. While 1270 any number of pertinent elements may appear in a segment, priority is 1271 placed on answering the questions who, what, when, and where at the 1272 beginning of each segment so that readers can make the most important 1273 selection or rejection decisions as soon as possible. To make things 1274 simple, the listed ordering of the questions is maintained in each 1275 segment (as it happens most people who have been exposed to this 1276 story telling technique are already familiar with the above order� 1277 ing). 1279 The four questions are answered by using corresponding element 1280 labels. The four element labels can be re-used in each story seg� 1281 ment, but their meaning changes depending on the segment (the story 1282 type) in which they appear. In the example above, "who" is first 1283 used to name a document's author and subsequently used to name the 1284 permanence guarantor (provider). Similarly, "when" first lists the 1285 date of object creation and in the next segment lists the date of a 1286 commitment decision. Four labels appearing across three segments 1287 effectively map to twelve semantically distinct elements. Distinct 1288 element meanings are mapped to Dublin Core elements in a later sec� 1289 tion. 1291 7.3. The ERC Anchoring Story 1293 Each ERC contains an anchoring story. It is usually the first 1294 segment labeled "erc:" and it concerns an "anchoring" expression of 1295 the object. An "anchoring" expression is the one that a provider 1296 deemed the most suitable basic referent given the audience and 1297 application for which it produced the ERC. If it sounds like the 1298 provider has great latitude in choosing its anchoring expression, it 1299 is because it does. A typical anchoring story in an ERC for a born- 1300 digital document would be the story of the document's release on a 1301 web site; such a document would then be the anchoring expression. 1303 An anchoring story need not be the central descriptive goal of an ERC 1304 record. For example, a museum provider may create an ERC for a 1305 digitized photograph of a painting but choose to anchor it in the 1306 story of the original painting instead of the story of the electronic 1307 likeness; although the ERC may through other segments prove to be 1308 centrally concerned with describing the electronic likeness, the 1309 provider may have chosen this particular anchoring story in order to 1310 make the ERC visible in a way that is most natural to patrons (who 1311 would find the Mona Lisa under da Vinci sooner than they would find 1312 it under the name of the person who snapped the photograph or scanned 1313 the image). In another example, a provider that creates an ERC for a 1314 dramatic play as an abstract work has the task of describing a piece 1315 of intangible intellectual property. To anchor this abstract object 1316 in the concrete world, if only through a derivative expression, it 1317 makes sense for the provider to choose a suitable printed edition of 1318 the play as the anchoring object expression (to describe in the 1319 anchoring story) of the ERC. 1321 The anchoring story has special rules designed to keep ERC processing 1322 simple and predictable. Each of the four basic elements (who, what, 1323 when, and where) must be present, unless a best effort to supply it 1324 fails. In the event of failure, the element still appears but a 1325 special value (described later) is used to explain the missing value. 1326 While the requirement that each of the four elements be present only 1327 applies to the anchoring story segment, as usual these elements 1328 appear at the beginning of the segment and may only be used in the 1329 prescribed order. A minimal ERC would normally consist of just an 1330 anchoring story and the element quartet, as illustrated in the next 1331 example. 1333 erc: 1334 who: National Research Council 1335 what: The Digital Dilemma 1336 when: 2000 1337 where: http://books.nap.edu/html/digital%5Fdilemma 1339 A minimal ERC can be abbreviated so that it resembles a traditional 1340 compact bibliographic citation that is nonetheless completely machine 1341 processable. The required elements and ordering makes it possible to 1342 eliminate the element labels, as shown here. 1344 erc: National Research Council | The Digital Dilemma | 2000 1345 | http://books.nap.edu/html/digital%5Fdilemma 1347 7.4. ERC Elements 1349 As mentioned, the four basic ERC elements (who, what, when, and 1350 where) take on different specific meanings depending on the story 1351 segment in which they are used. By appearing in each segment, albeit 1352 in different guises, the four elements serve as a valuable mnemonic 1353 device -- a kind of checklist -- for constructing minimal story 1354 segments from scratch. Again, it is only in the anchoring segment 1355 that all four elements are mandatory. 1357 Here are some mappings between ERC elements and Dublin Core [DCORE] 1358 elements. 1360 Segment ERC Element Equivalent Dublin Core Element 1361 --------- ----------- ------------------------------ 1362 erc who Creator/Contributor/Publisher 1363 erc what Title 1364 erc when Date 1365 erc where Identifier 1366 erc-about who 1367 erc-about what Subject 1368 erc-about when Coverage (temporal) 1369 erc-about where Coverage (spatial) 1371 The basic element labels may also be qualified to add nuances to the 1372 semantic categories that they identify. Elements are qualified by 1373 appending a `/' (slash) and a qualifier term. Often qualifier terms 1374 appear as the past tense form of a verb because it makes re-using 1375 qualifiers among elements easier. 1377 who/published: ... 1378 when/published: ... 1379 where/published: ... 1381 Using past tense verbs for qualifiers also reminds providers and 1382 recipients that element values contain transient assertions that may 1383 have been true once, but that tend to become less true over time. 1384 Recipients that don't understand the meaning of a qualifier can fall 1385 back onto the semantic category (bucket) designated by the unquali� 1386 fied element label. Inevitably recipients (people and software) will 1387 have diverse abilities in understanding elements and qualifiers. 1389 Any number of other elements and qualifiers may be used in conjunc� 1390 tion with the quartet of basic segment questions. The only semantic 1391 requirement is that they pertain to the segment's story. Also, it is 1392 only the four basic elements that change meaning depending on their 1393 segment context. All other elements have meaning independent of the 1394 segment in which they appear. If an element label stripped of its 1395 qualifier is still not recognized by the recipient, a second fall 1396 back position is to ignore it and rely on the four basic elements. 1398 Elements may be either Canonical, Provisional, or Local. Canonical 1399 elements are officially recognized via a registry as part of the 1400 metadata vernacular. All elements, qualifiers, and segment labels 1401 used in this document up until now belong to that vernacular. Provi� 1402 sional elements are also officially recognized via the registry, but 1403 have only been proposed for inclusion in the vernacular. To be pro� 1404 moted to the vernacular, a provisional element passes through a vet� 1405 ting process during which its documentation must be in order and its 1406 community acceptance demonstrated. Local elements are any elements 1407 not officially recognized in the registry. The registry [DERC] is a 1408 work in progress. 1410 Local elements can be immediately distinguishable from Canonical or 1411 Provisional elements because all terms that begin with an upper case 1412 letter are reserved for spontaneous local use. No term beginning 1413 with an upper case letter will ever be assigned Canonical or Provi� 1414 sional status, so it should be safe to use such terms for local pur� 1415 poses. Any recipient of external ERCs containing such terms will 1416 understand them to be part of the originating provider's local meta� 1417 data dialect. Here's an example ERC with three segments, one local 1418 element, and two local qualifiers. The segment boundaries have been 1419 emphasized by comment lines (which, as before, are ignored by proces� 1420 sors). 1422 erc: 1423 who: Bullock, TH | Achimowicz, JZ | Duckrow, RB 1424 | Spencer, SS | Iragui-Madoz, VJ 1425 what: Bicoherence of intracranial EEG in sleep, 1426 wakefulness and seizures 1427 when: 1997 12 00 1428 where: http://cogprints.soton.ac.uk/%{ 1429 documents/disk0/00/00/01/22/index.html %} 1430 in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678 1431 IDcode: cog00000122 1432 # ---- new segment ---- 1433 erc-about: 1434 what/Subcategory: Bispectrum | Nonlinearity | Epilepsy 1435 | Cooperativity | Subdural | Hippocampus | Higher moment 1436 # ---- new segment ---- 1437 erc-from: 1438 who: NIH/NLM/NCBI 1439 what: pm9546494 1440 when/Reviewed: 1998 04 18 021600 1441 where: http://ark.nlm.nih.gov/12025/pm9546494? 1443 The local element "IDcode" immediately precedes the "erc-about" seg� 1444 ment, which itself contains an element with the local qualifier "Sub� 1445 category". The second to last element also carries the local quali� 1446 fier "Reviewed". Finally, what might be a provisional element "in" 1447 appears near the end of the first segment. It might have been pro� 1448 posed as a way to complete a citation for an object originally 1449 appearing inside another object (such as an article appearing in a 1450 journal or an encyclopedia). 1452 7.5. ERC Element Values 1454 ERC element values tend to be straightforward strings. If the 1455 provider intends something special for an element, it will so 1456 indicate with markers at the beginning of its value string. The 1457 markers are designed to be uncommon enough that they would not likely 1458 occur in normal data except by deliberate intent. Markers can only 1459 occur near the beginning of a string, and once any octet of non- 1460 marker data has been encountered, no further marker processing is 1461 done for the element value. In the absence of markers the string is 1462 considered pure data; this has been the case with all the examples 1463 seen thus far. The fullest form of an element value with all three 1464 optional markers in place looks like this. 1466 VALUE = [markup_flags] (:ccode) , DATA 1468 In processing, the first non-whitespace character of an ERC element 1469 value is examined. An initial `[' is reserved to introduce a brack� 1470 eted set of markup flags (not described in this document) that ends 1471 with `]'. If ERC data is machine-generated, each value string may be 1472 preceded by "[]" to prevent any of its data from being mistaken for 1473 markup flags. Once past the optional markup, the remaining value may 1474 optionally begin with a controlled code. A controlled code always 1475 has the form "(:ccode)", for example, 1477 who: (:unkn) Anonymous 1478 what: (:791) Bee Stings 1480 Any string after such a code is taken to be an uncontrolled (e.g., 1481 natural language) equivalent. The code "unkn" indicates a conven� 1482 tional explanation for a missing value (stating that the value is 1483 unknown). The remainder of the string makes an equivalent statement 1484 in a form that the provider deemed most suitable to its (probably 1485 human) audience. The code "791" could be a fixed numeric topic iden� 1486 tifier within an unspecified topic vocabulary. Any code may be 1487 ignored by those that do not understand it. 1489 There are several codes to explain different ways in which a required 1490 element's value may go missing. 1492 (:unkn) unknown (e.g., Anonymous, Inconnue) 1493 (:unav) value unavailable indefinitely 1494 (:unac) temporarily inaccessible 1495 (:unap) not applicable, makes no sense 1496 (:unas) value unassigned (e.g., Untitled) 1497 (:none) never had a value, never will 1498 (:null) explicitly empty 1499 (:unal) unallowed, suppressed intentionally 1501 Once past an optional controlled code, the remaining string value is 1502 subjected to one final test. If the first next non-whitespace char� 1503 acter is a `,' (comma), it indicates that the string value is "sort- 1504 friendly". This means that the value is (a) laid out with an 1505 inverted word order useful for sorting items having comparably laid 1506 out element values (items might be the containing ERC records) and 1507 (b) that the value may contain other commas that indicate inversion 1508 points should it become necessary to recover the value in natural 1509 word order. Typically, this feature is used to express Western-style 1510 personal names in family-name-given-name order. It can also be used 1511 wherever natural word order might make sorting tricky, such as when 1512 data contains titles or corporate names. Here are some example ele� 1513 ments. 1515 who: , van Gogh, Vincent 1516 who:,Howell, III, PhD, 1922-1987, Thurston 1517 who:, Acme Rocket Factory, Inc., The 1518 who:, Mao Tse Tung 1519 who:, McCartney, Paul, Sir, 1520 what:, Health and Human Services, United States Government 1521 Department of, The, 1523 There are rules to use in recovering a copy of the value in natural 1524 word order, if desired. The above example strings have the following 1525 natural word order values, respectively. 1527 Vincent van Gogh 1528 Thurston Howell, III, PhD, 1922-1987 1529 The Acme Rocket Factory, Inc. 1530 Mao Tse Tung 1531 Sir Paul McCartney 1532 The United States Government Department of Health and Human Services 1534 7.6. ERC Element Encoding and Dates 1536 Some characters that need to appear in ERC element values might 1537 conflict with special characters used for structuring ERCs, so there 1538 needs to be a way to include them as literal characters that are 1539 protected from special interpretation. This is accomplished through 1540 an encoding mechanism that resembles the %-encoding familiar to [URI] 1541 handlers. 1543 The ERC encoding mechanism also uses `%', but instead of taking two 1544 following hexadecimal digits, it takes one non-alphanumeric character 1545 or two alphabetic characters that cannot be mistaken for hex digits. 1546 It is designed not to be confused with normal web-style %-encoding. 1547 In particular it can be decoded without risking unintended decoding 1548 of normal %-encoded data (which would introduce errors). Here are 1549 the one-character (non-alphanumeric) ERC encoding extensions. 1551 ERC Purpose 1552 --- ------------------------------------------------ 1553 %! decodes to the element separator `|' 1554 %% decodes to a percent sign `%' 1555 %. decodes to a comma `,' 1556 %_ a non-character used as syntax shim 1557 %{ a non-character that begins an expansion block 1558 %} a non-character that ends an expansion block 1560 One particularly useful construct in ERC element values is the pair 1561 of special encoding markers ("%{" and "%}") that indicates a "expan� 1562 sion" block. Whatever string of characters they enclose will be 1563 treated as if none of the contained whitespace (SPACEs, TABs, New� 1564 lines) were present. This comes in handy for writing long, multi- 1565 part URLs in a readable way. For example, the value in 1566 where: http://foo.bar.org/node%{ 1567 ? db = foo 1568 & start = 1 1569 & end = 5 1570 & buf = 2 1571 & query = foo + bar + zaf 1572 %} 1574 is decoded into an equivalent element, but with a correct and intact 1575 URL: 1577 where: 1578 http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf 1580 In a parting word about ERC element values, a commonly recurring 1581 value type is a date, possibly followed by a time. ERC dates take on 1582 one of the following forms: 1584 1999 (four digit year) 1585 2000 12 29 (year, month, day) 1586 2000 12 29 235955 (year, month, day, hour, minute, second) 1588 21 Spring 31 1st quarter 25 Spring (so. hemisphere) 22 Summer 32 1589 2nd quarter 26 Summer (so. hemisphere) 23 Fall 33 3rd 1590 quarter 27 Fall (so. hemisphere) 24 Winter 34 4th quar� 1591 ter 28 Winter (so. hemisphere) In dates, all internal whitespace 1592 is squeezed out to achieve a normalized form suitable for lexical 1593 comparison and sorting. This means that the following dates 1595 2000 12 29 235955 (recommended for readability) 1596 2000 12 29 23 59 55 1597 20001229 23 59 55 1598 20001229235955 (normalized date and time) 1600 are all equivalent. The first form is recommended for readability. 1601 The last form (shortest and easiest to compute with) is the normal� 1602 ized form. Hyphens and commas are reserved to create date ranges and 1603 lists, for example, 1605 1996-2000 (a range of four years) 1606 1952, 1957, 1969 (a list of three years) 1607 1952, 1958-1967, 1985 (a mixed list of dates and ranges) 1608 20001229-20001231 (a range of three days) 1610 7.7. ERC Stub Records and Internal Support 1612 The ERC design introduces the concept of a "stub" record, which is an 1613 incomplete ERC record intended to be supplemented with additional 1614 elements before being released as a standalone ERC record. A stub 1615 ERC record has no minimum required elements. It is just a group of 1616 elements that does not begin with "erc:" but otherwise conforms to 1617 the ERC record syntax. 1619 ERC stubs may be useful in supporting internal procedures using the 1620 ERC syntax. Often they rely on the convenience and accuracy of 1621 automatically supplied elements, even the basic ones. To be ready 1622 for external use, however, an ERC stub must be transformed into a 1623 complete ERC record having the usual required elements. An ERC stub 1624 record can be convenient for metadata embedded in a document, where 1625 elements such as location, modification date, and size -- which one 1626 would not omit from an externalized record -- are omitted simply 1627 because they are much better supplied by a computation. A separate 1628 local administrative procedure, not defined for ERC's in general, 1629 would effect the promotion of stubs into complete records. 1631 While the ERC is a general-purpose container for exchange of resource 1632 descriptions, it does not dictate how records must be internally 1633 stored, laid out, or assembled by data providers or recipients. 1634 Arbitrary internal descriptive frameworks can support ERCs simply by 1635 mapping (e.g., on demand) local records to the ERC container format 1636 and making them available for export. Therefore, to support ERCs 1637 there is no need for a data provider to convert internal data to be 1638 stored in an ERC format. On the other hand, any provider (such as 1639 one just getting started in the business of resource description) may 1640 choose to store and manipulate local data natively in the ERC format. 1642 8. Advice to Web Clients 1644 This section offers some advice to web client software developers. 1645 It is hard to write about because it tries to anticipate a series of 1646 events that might lead to native web browser support for ARKs. 1648 ARKs are envisaged to appear wherever durable object references are 1649 planned. Library cataloging records, literature citations, and 1650 bibliographies are important examples. In many of these places URLs 1651 (Uniform Resource Locators) currently stand in, and URNs, DOIs, and 1652 PURLs have been proposed as alternatives. 1654 The strings representing ARKs are also envisaged to appear in some of 1655 the places where URLs currently appear: in hypertext links (where 1656 they are not normally shown to users) and in rendered text (displayed 1657 or printed). Internet search engines, for example, tend to include 1658 both actionable and manifest links when listing each item found. A 1659 normal HTML link for which the URL is not displayed looks like this. 1661 Click Here 1663 The same link with an ARK instead of a URL: 1665 Click Here 1667 Web browsers would in general require a small modification to recog� 1668 nize and convert this ARK, via mapping authority discovery, to the 1669 URL form. 1671 Click Here 1673 A browser that knows how to make that conversion could also automati� 1674 cally detect and replace a non-working NMAH. 1676 An NAA will typically make known the associations it creates by pub� 1677 lishing them in catalogs, actively advertizing them, or simply leav� 1678 ing them on web sites for visitors (e.g., users, indexing spiders) to 1679 stumble across in browsing. 1681 9. Security Considerations 1683 The ARK naming scheme poses no direct risk to computers and networks. 1684 Implementors of ARK services need to be aware of security issues when 1685 querying networks and filesystems for Name Mapping Authority 1686 services, and the concomitant risks from spoofing and obtaining 1687 incorrect information. These risks are no greater for ARK mapping 1688 authority discovery than for other kinds of service discovery. For 1689 example, recipients of ARKs with a specified hostport (NMAH) should 1690 treat it like a URL and be aware that the identified ARK service may 1691 no longer be operational. 1693 Apart from mapping authority discovery, ARK clients and servers 1694 subject themselves to all the risks that accompany normal operation 1695 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1696 As specializations of such protocols, an ARK service may limit 1697 exposure to the usual risks. Indeed, ARK services may enhance a kind 1698 of security by helping users identify long-term reliable references 1699 to information objects. 1701 10. Authors' Addresses 1703 John A. Kunze 1704 California Digital Library 1705 University of California, Office of the President 1706 415 20th St, 4th Floor 1707 Oakland, CA 94612-3550, USA 1709 Fax: +1 510-893-5212 1710 EMail: jak@ucop.edu 1712 R. P. C. Rodgers 1713 US National Library of Medicine 1714 8600 Rockville Pike, Bldg. 38A 1715 Bethesda, MD 20894, USA 1716 Fax: +1 301-496-0673 1717 EMail: rodgers@nlm.nih.gov 1719 11. References 1721 [ARK] J. Kunze, "Towards Electronic Persistence Using ARK 1722 Identifiers", Proceedings of the 3rd ECDL Workshop on Web 1723 Archives, August 2003, (PDF) 1724 http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze 1726 [DCORE] Dublin Core Metadata Initiative, "Dublin Core Metadata 1727 Element Set, Version 1.1: Reference Description", July 1728 1999, http://dublincore.org/documents/dces/. 1730 [DERC] J. Kunze, "Dictionary of the ERC", work in progress. 1732 [DNS] P.V. Mockapetris, "Domain Names - Concepts and 1733 Facilities", RFC 1034, November 1987. 1735 [DOI] International DOI Foundation, "The Digital Object 1736 Identifier (DOI) System", February 2001, 1737 http://dx.doi.org/10.1000/203. 1739 [EMHDRS] D. Crocker, "Standard for the format of ARPA Internet text 1740 messages", RFC 822, August 1982. 1742 [ERC] J. Kunze, "A Metadata Kernel for Electronic Permanence", 1743 Journal of Digital Information, Vol 2, Issue 2, January 1744 2002, ISSN 1368-7506, (PDF) 1745 http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/ 1747 [HTTP] R. Fielding, et al, "Hypertext Transfer Protocol -- 1748 HTTP/1.1", RFC 2616, June 1999. 1750 [MD5] R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321, 1751 April 1992. 1753 [NAPTR] M. Mealling, Daniel, R., "The Naming Authority Pointer 1754 (NAPTR) DNS Resource Record", RFC 2915, September 2000. 1756 [NLMPerm] M. Byrnes, "Defining NLM's Commitment to the Permanence of 1757 Electronic Information", ARL 212:8-9, October 2000, 1758 http://www.arl.org/newsltr/212/nlm.html 1760 [PURL] K. Shafer, et al, "Introduction to Persistent Uniform 1761 Resource Locators", 1996, 1762 http://purl.oclc.org/OCLC/PURL/INET96 1764 [TELNET] J. Postel, J.K. Reynolds, "Telnet Protocol Specification", 1765 RFC 854, May 1983. 1767 [THUMP] J. Kunze, "The HTTP URL Mapping Protocol", work in 1768 progress. 1770 [URI] T. Berners-Lee, et al, "Uniform Resource Identifiers 1771 (URI): Generic Syntax", RFC 2396, August 1998. 1773 [URNBIB] C. Lynch, et al, "Using Existing Bibliographic Identifiers 1774 as Uniform Resource Names", RFC 2288, February 1998. 1776 [URNSYN] R. Moats, "URN Syntax", RFC 2141, May 1997. 1778 [URNNID] L. Daigle, et al, "URN Namespace Definition Mechanisms", 1779 RFC 2611, June 1999. 1781 12. Appendix: ARK Implementations 1783 Currently, the primary implementation activity is at the California 1784 Digital Library (CDL), 1786 http://ark.cdlib.org/ 1788 housed at the University of California Office of the President, where 1789 over 150,000 ARKs have been assigned to objects that the CDL owns or 1790 controls. Some experimentation in ARKs is taking place at WIPO and 1791 at the University of California San Diego. 1793 The US National Library of Medicine (NLM) also has an experimental, 1794 prototype ARK service under development. It is being made available 1795 for purposes of demonstrating various aspects of the ARK system, but 1796 is subject to temporary or permanent withdrawal (without notice) 1797 depending upon the circumstances of the small research group respon� 1798 sible for making it available. It is described at: 1800 http://ark.nlm.nih.gov/ 1802 Comments and feedback may be addressed to rodgers@nlm.nih.gov. 1804 13. Appendix: Current ARK Name Authority Table 1806 This appendix contains a copy of the Name Authority Table (a file) at 1807 the time of writing. It may be loaded into a local filesystem (e.g., 1808 /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to 1809 NMAHs (Name Mapping Authority Hostports). It contains Perl code that 1810 can be copied into a standalone script that processes the table (as a 1811 file). Because this is still a proposed file, none of the values in 1812 it are real. 1814 # 1815 # Name Assigning Authority / Name Mapping Authority Lookup Table 1816 # Last change: 2 June 2004 1817 # Reload from: http://ark.nlm.nih.gov/etc/natab 1818 # Mirrored at: http://ark.cdlib.org/natab 1819 # To register: mailto:ark@cdlib.org?Subject=naareg 1820 # Process with: Perl script at end of this file (optional) 1821 # 1822 # Each NAA appears at the beginning of a line with the NAA Number 1823 # first, a colon, and an ARK or URL to a statement of naming policy 1824 # (see http://ark.cdlib.org for an example). 1825 # All the NMA hostports that service an NAA are listed, one per 1826 # line, indented, after the corresponding NAA line. 1827 # 1828 # National Library of Medicine 1829 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 1830 ark.nlm.nih.gov USNLM 1831 foobar.zaf.org UCSF 1832 sneezy.dopey.com BIREME 1833 # 1834 # Library of Congress 1835 12026: http://www.loc.gov/xxx/naapolicy.html 1836 foobar.zaf.org USLC 1837 sneezy.dopey.com USLC 1838 # 1839 # National Agriculture Library 1840 12027: http://www.nal.gov/xxx/naapolicy.html 1841 foobar.zaf.gov:80 USNAL 1842 # 1843 # California Digital Library 1844 13030: http://www.cdlib.org/inside/diglib/ark/ 1845 ark.cdlib.org CDL 1846 # 1847 # World Intellectual Property Organization 1848 13038: http://www.wipo.int/xxx/naapolicy.html 1849 www.wipo.int WIPO 1850 # 1851 # University of California San Diego 1852 20775: http://library.ucsd.edu/xxx/naapolicy.html 1853 ucsd.edu UCSD 1854 # 1855 # University of California San Francisco 1856 29114: http://library.ucsf.edu/xxx/naapolicy.html 1857 ucsf.edu UCSF 1858 # 1859 # University of California Berkeley 1860 28722: http://library.berkeley.edu/xxx/naapolicy.html 1861 berkeley.edu UCB 1862 # 1863 # Rutgers University Libraries 1864 15230: http://rci.rutgers.edu/xxx/naapolicy.html 1865 rutgers.edu RUL 1866 # 1867 #--- end of data --- 1868 # The following Perl script takes an NAA as argument and outputs 1869 # the NMAs in this file listed under any matching NAA. 1870 # 1871 # my $naa = shift; 1872 # while (<>) { 1873 # next if (! /^$naa:/); 1874 # while (<>) { 1875 # last if (! /^[#\s]./); 1876 # print "$1\n" if (/^\s+(\S+)/); 1877 # } 1878 # } 1879 # 1880 # Create a g/t/nroff-safe version of this table with the UNIX command, 1881 # 1882 # expand natab | sed 's/\\/\\\e/g' > natab.roff 1883 # 1884 # end of file 1886 14. Copyright Notice 1888 Copyright (C) The Internet Society (2004). All Rights Reserved. 1890 This document and translations of it may be copied and furnished to 1891 others, and derivative works that comment on or otherwise explain it 1892 or assist in its implementation may be prepared, copied, published 1893 and distributed, in whole or in part, without restriction of any 1894 kind, provided that the above copyright notice and this paragraph are 1895 included on all such copies and derivative works. However, this 1896 document itself may not be modified in any way, such as by removing 1897 the copyright notice or references to the Internet Society or other 1898 Internet organizations, except as needed for the purpose of 1899 developing Internet standards in which case the procedures for 1900 copyrights defined in the Internet Standards process must be 1901 followed, or as required to translate it into languages other than 1902 English. 1904 The limited permissions granted above are perpetual and will not be 1905 revoked by the Internet Society or its successors or assigns. 1907 This document and the information contained herein is provided on an 1908 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1909 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1910 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1911 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1912 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1914 The IETF invites any interested party to bring to its attention any 1915 copyrights, patents or patent applications, or other proprietary 1916 rights which may cover technology that may be required to practice 1917 this standard. Please address the information to the IETF Executive 1918 Director. 1920 Expires 31 January 2005 1921 Table of Contents 1923 Status of this Document . . . . . . . . . . . . . . . . . . . . . . 1 1924 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1925 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1926 1.1. Three Reasons to Use ARKs . . . . . . . . . . . . . . . . . . 3 1927 1.2. Organizing Support for ARKs . . . . . . . . . . . . . . . . . 4 1928 1.3. A Definition of Identifier . . . . . . . . . . . . . . . . . . 5 1929 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1930 2.1. The Name Mapping Authority Hostport (NMAH) . . . . . . . . . . 7 1931 2.2. The Name Assigning Authority Number (NAAN) . . . . . . . . . . 7 1932 2.3. The Name Part . . . . . . . . . . . . . . . . . . . . . . . . 8 1933 2.3.1. Names that Reveal Object Hierarchy . . . . . . . . . . . . . 8 1934 2.3.2. Names that Reveal Object Variants . . . . . . . . . . . . . 9 1935 2.3.3. Hyphens are Ignored . . . . . . . . . . . . . . . . . . . . 10 1936 2.4. Normalization and Lexical Equivalence . . . . . . . . . . . . 11 1937 2.5. Naming Considerations . . . . . . . . . . . . . . . . . . . . 11 1938 3. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . . . . 13 1939 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . . . 14 1940 4.1. Looking Up NMAHs in a Globally Accessible File . . . . . . . . 15 1941 4.2. Looking up NMAHs Distributed via DNS . . . . . . . . . . . . . 17 1942 5. Generic ARK Service Definition . . . . . . . . . . . . . . . . . 19 1943 5.1. Generic ARK Access Service (access, location) . . . . . . . . 20 1944 5.2. Generic Policy Service (permanence, naming, etc.) . . . . . . 20 1945 5.3. Generic Description Service . . . . . . . . . . . . . . . . . 21 1946 6. Overview of the Tiny HTTP URL Mapping Protocol (THUMP) . . . . . 21 1947 7. Overview of Electronic Resource Citations (ERCs) . . . . . . . . 24 1948 7.1. ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1949 7.2. ERC Stories . . . . . . . . . . . . . . . . . . . . . . . . . 27 1950 7.3. The ERC Anchoring Story . . . . . . . . . . . . . . . . . . . 28 1951 7.4. ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . . 29 1952 7.5. ERC Element Values . . . . . . . . . . . . . . . . . . . . . . 31 1953 7.6. ERC Element Encoding and Dates . . . . . . . . . . . . . . . . 33 1954 7.7. ERC Stub Records and Internal Support . . . . . . . . . . . . 34 1955 8. Advice to Web Clients . . . . . . . . . . . . . . . . . . . . . 35 1956 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 36 1957 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 36 1958 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 37 1959 12. Appendix: ARK Implementations . . . . . . . . . . . . . . . . 38 1960 13. Appendix: Current ARK Name Authority Table . . . . . . . . . . 38 1961 14. Copyright Notice . . . . . . . . . . . . . . . . . . . . . . . 40