idnits 2.17.1 draft-kunze-ark-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 2370. ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure Acknowledgement. ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure Invitation. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 51 longer pages, the longest (page 2) being 63 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 51 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 8 instances of too long lines in the document, the longest one being 21 characters in excess of 72. ** The abstract seems to contain references ([Qualifier]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 8 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 1138 has weird spacing: '... regexp repla...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (24 July 2007) is 6119 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Qualifier' is mentioned on line 437, but not defined == Unused Reference: 'MD5' is defined on line 2143, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'ANVL' -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE' -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI' -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC' -- Possible downref: Non-RFC (?) normative reference: ref. 'Handle' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Possible downref: Non-RFC (?) normative reference: ref. 'Kernel' ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. 'MD5') -- Possible downref: Non-RFC (?) normative reference: ref. 'N2T' ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm' -- Possible downref: Non-RFC (?) normative reference: ref. 'NOID' -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL' ** Obsolete normative reference: RFC 822 (Obsoleted by RFC 2822) -- Possible downref: Non-RFC (?) normative reference: ref. 'TEMPER' -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP' ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC 3986) ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref. 'URNBIB') ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC 3406) Summary: 17 errors (**), 0 flaws (~~), 8 warnings (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft: draft-kunze-ark-14.txt J. Kunze 3 ARK Identifier Scheme University of California (UCOP) 4 Expires 24 January 2008 R. P. C. Rodgers 5 US National Library of Medicine 6 24 July 2007 8 The ARK Persistent Identifier Scheme 10 (http://www.ietf.org/internet-drafts/draft-kunze-ark-14.txt) 12 Status of this Document 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/1id-abstracts.html 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html 35 Distribution of this document is unlimited. Please send comments to 36 jak@ucop.edu 38 Copyright (C) The IETF Trust (2007). All Rights Reserved. 40 Abstract 42 The ARK (Archival Resource Key) naming scheme is designed to 43 facilitate the high-quality and persistent identification of 44 information objects. A founding principle of the ARK is that 45 persistence is purely a matter of service and is neither inherent in 46 an object nor conferred on it by a particular naming syntax. The best 47 that an identifier can do is to lead users to the services that 48 support persistence. The term ARK itself refers both to the scheme 49 and to any single identifier that conforms to it. An ARK has five 50 components: 52 [http://NMAH/]ark:/NAAN/Name[Qualifier] 54 an optional and mutable Name Mapping Authority Hostport, the "ark:" 55 label, the Name Assigning Authority Number (NAAN), the assigned Name, 56 and an optional and possibly mutable Qualifier supported by the NMA. 57 The NAAN and Name together form the immutable persistent identifier 58 for the object. An ARK is a special kind of URL that connects users 59 to three things: the named object, its metadata, and the provider's 60 promise about its persistence. When entered into the location field 61 of a Web browser, the ARK leads the user to the named object. That 62 same ARK, followed by a single question mark ('?'), returns a brief 63 metadata record that is both human- and machine-readable. When the 64 ARK is followed by dual question marks ('??'), the returned metadata 65 contains a commitment statement from the current provider. Tools 66 exist for minting, binding, and resolving ARKs. 68 1. Introduction 70 This document describes a scheme for the high-quality naming of 71 information resources. The scheme, called the Archival Resource Key 72 (ARK), is well suited to long-term access and identification of any 73 information resources that accommodate reasonably regular electronic 74 description. This includes digital documents, databases, software, 75 and websites, as well as physical objects (books, bones, statues, 76 etc.) and intangible objects (chemicals, diseases, vocabulary terms, 77 performances). Hereafter the term "object" refers to an information 78 resource. The term ARK itself refers both to the scheme and to any 79 single identifier that conforms to it. A reasonably concise and 80 accessible overview and rationale for the scheme is available at 81 [ARK]. 83 Schemes for persistent identification of network-accessible objects 84 are not new. In the early 1990's, the design of the Uniform Resource 85 Name [URNSYN] responded to the observed failure rate of URLs by 86 articulating an indirect, non-hostname-based naming scheme and the 87 need for responsible name management. Meanwhile, promoters of the 88 Digital Object Identifier [DOI] succeeded in building a community of 89 providers around a mature software system [Handle] that supports name 90 management. The Persistent Uniform Resource Locator [PURL] was 91 another scheme that has the unique advantage of working with 92 unmodified web browsers. ARKs represent an approach that attempts to 93 build on the strengths and to avoid the weaknesses of the other 94 schemes. 96 A founding principle of the ARK is that persistence is purely a 97 matter of service. Persistence is neither inherent in an object nor 98 conferred on it by a particular naming syntax. Nor is the technique 99 of name indirection - upon which URNs, Handles, DOIs, and PURLs are 100 founded - of central importance. Name indirection is an ancient and 101 well-understood practice; new mechanisms for it keep appearing and 102 distracting practitioner attention, with the Domain Name System [DNS] 103 being a particularly dazzling and elegant example. What is often 104 forgotten is that maintenance of an indirection table is the 105 overwhelming and unavoidable cost to the organization providing 106 persistence, and the cost is equivalent across naming schemes. That 107 indirection has always been a native part of the web while being so 108 lightly utilized for the persistence of web-based objects is an 109 indication of how unsuited most organizations are to the task of 110 table maintenance and to the overall challenge of digital permanence. 112 Persistence is achieved through a provider's successful stewardship 113 of objects and their identifiers. The highest level of persistence 114 will be reinforced by a provider's robust contingency, redundancy, 115 and succession strategies. It is further safeguarded to the extent 116 that a provider's mission is shielded from marketplace and political 117 instabilities. These are by far the major challenges confronting 118 persistence providers, and no identifier scheme has any direct impact 119 on them. In fact, some schemes may be actual liabilities for 120 persistence because they create short- and long-term dependencies for 121 every object access on complex, special-purpose local and global 122 infrastructures, parts of which are proprietary and all of which 123 increase the carry-forward burden for the preservation community. It 124 is for this reason that the ARK scheme relies only on educated name 125 assignment and light use of general-purpose infrastructures that the 126 entire internet community needs (the DNS, web servers, and web 127 browsers) and that one can reasonably expect many others to help 128 carry forward into the technologically evolving future. 130 1.1. Reasons to Use ARKs 132 If no persistent identifier scheme contributes directly to 133 persistence, why not just use URLs? A particular URL may be as 134 durable an identifier as it is possible to have, but nothing 135 distinguishes it from an ordinary URL to the recipient who is 136 wondering if it is suitable for long-term reference. An ARK is just 137 a URL, distinguished by its form, that provides some of the necessary 138 conditions for credible persistence. An ARK invites access to not 139 one, but to three things: to the object, to its metadata, and to a 140 nuanced statement of commitment from the provider regarding the 141 object. Existence of the two extra services can be probed 142 automatically by appending either `?' or `??' to the ARK. 144 The form of the ARK also supports the natural separation of naming 145 authorities into the original name assigning authority and the 146 diverse multiple name mapping (or servicing) authorities that in 147 succession and in parallel will take over custodial responsibilities 148 from the original assigner for the large majority of a long-term 149 object's archival lifetime. The mapping authority, indicated by the 150 hostname part of the URL that contains the ARK, serves to launch the 151 ARK into cyberspace. Should it ever fail (and there is no reason why 152 a well-chosen hostname of a 100-year-old cultural memory institution 153 shouldn't last as long as the DNS), that host name is considered 154 disposeable and replaceable. Again, the form of the ARK helps 155 because it defines exactly how to recover the core immutable object 156 identity, and several simple algorithms (based on the URN model) are 157 defined for locating another mapping authority. 159 There are tools to assist in generating ARKs and other identifiers, 160 such as [NOID] and "uuidgen", both of which rely for uniqueness on 161 human-maintained registries. This document also contains some 162 guidelines and considerations for managing namespaces and choosing 163 hostnames wisely. 165 1.2. Three Requirements of ARKs 167 The first requirement of an ARK is to give users a link from an 168 object to a promise of stewardship for it. That promise is a multi- 169 faceted covenant that binds the word of an identified service 170 provider to a specific set of responsibilities. No one can tell if 171 successful stewardship will take place because no one can predict the 172 future. Reasonable conjecture, however, may be based on past 173 performance. There must be a way to tie a promise of persistence to 174 a provider's demonstrated or perceived ability - its reputation - in 175 that arena. Provider reputations would then rise and fall as 176 promises are observed variously to be kept and broken. This is 177 perhaps the best way we have for gauging the strength of any 178 persistence promise. Note that over time, current providers have 179 nothing to do with the intentions of the original assigners of names. 181 The second requirement of an ARK is to give users a link from an 182 object to a description of it. The problem with a naked identifier 183 is that without a description real identification is incomplete. 184 Identifiers common today are relatively opaque, though some contain 185 ad hoc clues that reflect brief life cycle periods such as the 186 address of a short stay in a filesystem hierarchy. Possession of 187 both an identifier and an object is some improvement, but positive 188 identification may still be uncertain since the object itself might 189 not include a matching identifier or might not carry evidence obvious 190 enough to reveal its identity without significant research. In 191 either case, what is called for is a record bearing witness to the 192 identifier's association with the object, as supported by a recorded 193 set of object characteristics. This descriptive record is partly an 194 identification "receipt" with which users and archivists can verify 195 an object's identity after brief inspection and a plausible match 196 with recorded characteristics such as title and size. 198 The final requirement of an ARK is to give users a link to the object 199 itself (or to a copy) if at all possible. Persistent access is the 200 central duty of an ARK. Persistent identification plays a vital 201 supporting role but, strictly speaking, it can be construed as no 202 more than a record attesting to the original assignment of a never- 203 reassigned identifier. Object access may not be feasible for various 204 reasons, such as catastrophic loss of the object, a licensing 205 agreement that keeps an archive "dark" for a period of years, or when 206 an object's own lack of tangible existence confuses normal concepts 207 of access (e.g., a vocabulary term might be accessed through its 208 definition). In such cases the ARK's identification role assumes a 209 much higher profile. But attempts to simplify the persistence 210 problem by decoupling access from identification and concentrating 211 exclusively on the latter are of questionable utility. A perfect 212 system for assigning forever unique identifiers might be created, but 213 if it did so without reducing access failure rates, no one would be 214 interested. The central issue - which may be summed up as the "HTTP 215 404 Not Found" problem - would not have been addressed. 217 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff 219 An organization and the user community it serves can often be seen to 220 struggle with two different areas of persistent identification: the 221 Our Stuff problem and the Their Stuff problem. In the Our Stuff 222 problem, we in the organization want our own objects to acquire 223 persistent names. Since we possess or control these objects, our 224 organization tackles the Our Stuff problem directly. Whether or not 225 the objects are named by ARKs, our organization is the responsible 226 party, so it can plan for, maintain, and make commitments about the 227 objects. 229 In the Their Stuff problem, we in the organization want others' 230 objects to acquire persistent names. These are objects that we do 231 not own or control, but some of which are critically important to us. 232 But because they are beyond our influence as far as support is 233 concerned, creating and maintaining persistent identifiers for Their 234 Stuff is not especially purposeful or feasible for us to do. There 235 is little that we can do about someone else's stuff except encourage 236 them to find or become providers of persistence services. 238 Co-location of persistent access and identification services is 239 natural. Any organization that undertakes ongoing support of true 240 persistent identification (which includes description) is well-served 241 if it controls, owns, or otherwise has clear internal access to the 242 identified objects, and this gives it an advantage if it wishes also 243 to support persistent access to outsiders. Conversely, persistent 244 access to outsiders requires orderly internal collection management 245 procedures that include monitoring, acquisition, verification, and 246 change control over objects, which in turn requires object 247 identifiers persistent enough to support auditable record keeping 248 practices. 250 Although, organizing ARK services under one roof thus tends to make 251 sense, object hosting can successfully be separated from name 252 mapping. An example is when a name mapping authority centrally 253 provides uniform resolution services via a protocol gateway on behalf 254 of organizations that host objects behind a variety of access 255 protocols. It is also reasonable to build value-added description 256 services that rely on the underlying services of a set of mapping 257 authorities. 259 Supporting ARKs is not for every organization. By requiring 260 specific, revealed commitments to preservation, to object access, and 261 to description, the bar for providing ARK services is higher than for 262 some other identifier schemes. On the other hand, it would be hard 263 to grant credence to a persistence promise from an organization that 264 could not muster the minimum ARK services. Not that there isn't a 265 business model for an ARK-like, description-only service built on top 266 of another organization's full complement of ARK services. For 267 example, there might be competition at the description level for 268 abstracting and indexing a body of scientific literature archived in 269 a combination of open and fee-based repositories. The description- 270 only service would have no direct commitment to the objects, but 271 would act as an intermediary, forwarding commitment statements from 272 object hosting services to requestors. 274 1.4. Definition of Identifier 276 An identifier is not a string of character data - an identifier is an 277 association between a string of data and an object. This abstraction 278 is necessary because without it a string is just data. It's nonsense 279 to talk about a string's breaking, or about its being strong, 280 maintained, and authentic. But as a representative of an 281 association, a string can do, metaphorically, the things that we 282 expect of it. 284 Without regard to whether an object is physical, digital, or 285 conceptual, to identify it is to claim an association between it and 286 a representative string, such as "Jane" or "ISBN 0596000278". What 287 gives a claim credibility is a set of verifiable assertions, or 288 metadata, about the object, such as age, height, title, or number of 289 pages. In other words, the association is made manifest by a record 290 (e.g., a cataloging or other metadata record) that vouches for it. 292 In the complete absence of any testimony (metadata) regarding an 293 association, a would-be identifier string is a meaningless sequence 294 of characters. To keep an externally visible but otherwise internal 295 string from being perceived as an identifier by outsiders, for 296 example, it suffices for an organization not to disclose the nature 297 of its association. For our immediate purpose, actual existence of 298 an association record is more important than its authenticity or 299 verifiability, which are outside the scope of this specification. 301 It is a gift to the identification process if an object carries its 302 own name as an inseparable part of itself, such as an identifier 303 imprinted on the first page of a document or embedded in a data 304 structure element of a digital document header. In cases where the 305 object is large, unwieldy, or unavailable (such as when licensing 306 restrictions are in effect), a metadata record that includes the 307 identifier string will usually suffice. That record becomes a 308 conveniently manipulable object surrogate, acting as both an 309 association "receipt" and "declaration". 311 Note that our definition of identifier extends the one in use for 312 Uniform Resource Identifiers [URI]. The present document still 313 sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for 314 the string part of an identifier, but the context should make the 315 meaning clear. 317 2. ARK Anatomy 319 An ARK is represented by a sequence of characters (a string) that 320 contains the label, "ark:", optionally preceded by the beginning part 321 of a URL. Here is a diagrammed example. 323 http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff 324 \___________________/ \__/ \___/ \______/ \____________/ 325 (replaceable) | | | Qualifier 326 | ARK Label | | (NMA-supported) 327 | | | 328 Name Mapping Authority | Name (NAA-assigned) 329 Hostport (NMAH) | 330 Name Assigning Authority Number (NAAN) 332 The ARK syntax can be summarized, 334 [http://NMAH/]ark:/NAAN/Name[Qualifier] 336 where the NMAH and Qualifier parts are in brackets to indicate that 337 they are optional. 339 2.1. The Name Mapping Authority Hostport (NMAH) 341 Before the "ark:" label may appear an optional Name Mapping Authority 342 Hostport (NMAH) that is a temporary address where ARK service 343 requests may be sent. It consists of "http://" (or any service 344 specification valid for a URL) followed by an Internet hostname or 345 hostport combination having the same format and semantics as the 346 hostport part of a URL. The most important thing about the NMAH is 347 that it is "identity inert" from the point of view of object 348 identification. In other words, ARKs that differ only in the 349 optional NMAH part identify the same object. Thus, for example, the 350 following three ARKs are synonyms for just one information object: 352 http://loc.gov/ark:/12025/654xz321 353 http://rutgers.edu/ark:/12025/654xz321 354 ark:/12025/654xz321 356 Strictly speaking, in the realm of digital objects, these ARKs may 357 lead over time to somewhat different or diverging instances of the 358 originally named object. In an ideal world, divergence of persistent 359 objects is not desirable, but it is widely believed that digital 360 preservation efforts will inevitably lead to alterations in some 361 original objects (e.g, a format migration in order to preserve the 362 ability to display a document). If any of those objects are held 363 redundantly in more than one organization (a common preservation 364 strategy), chances are small that all holding organizations will 365 perform the same precise transformations and all maintain the same 366 object metadata. More significant divergence would be expected when 367 the holding organizations serve different audiences or compete with 368 each other. 370 The NMAH part makes an ARK into an actionable URL. As with many 371 internet parameters, it is helpful to approach the NMAH being liberal 372 in what you accept and conservative in what you propose. From the 373 recipient's point of view, the NMAH part should be treated as 374 temporary, disposable, and replaceable. From the NMA's point of 375 view, it should be chosen with the greatest concern for longevity. A 376 carefully chosen NMAH should be at least as permanent as the 377 providing organization's own hostname. In the case of a national or 378 university library, for example, there is no reason why the NMAH 379 should not be considerably more permanent than soft-funded proxy 380 hostnames such as hdl.handle.net, dx.doi.org, and purl.org. In 381 general and over time, however, it is not unexpected for an NMAH 382 eventually to stop working and require replacement with the NMAH of a 383 currently active service provider. 385 This replacement relies on a mapping authority "resolver" discovery 386 process, of which two alternate methods are outlined in a later 387 section. The ARK, URN, Handle, and DOI schemes all use a resolver 388 discovery model that sooner or later requires matching the original 389 assigning authority with a current provider servicing that 390 authority's named objects; once found, the resolver at that provider 391 performs what amounts to a redirect to a place where the object is 392 currently held. All the schemes rely on the ongoing functionality of 393 currently mainstream technologies such as the Domain Name System 394 [DNS] and web browsers. The Handle and DOI schemes in addition 395 require that the Handle protocol layer and global server grid be 396 available at all times. 398 The practice of prepending "http://" and an NMAH to an ARK is a way 399 of creating an actionable identifier by a method that is itself 400 temporary. Assuming that infrastructure supporting [HTTP] 401 information retrieval will no longer be available one day, ARKs will 402 then have to be converted into new kinds of actionable identifiers. 403 By that time, if ARKs see widespread use, web browsers would 404 presumably evolve to perform this (currently simple) transformation 405 automatically. 407 2.2. The ARK Label Part - ark: 409 The label part distinguishes an ARK from an ordinary identifier. In 410 a URL found in the wild, the string, "ark:/", indicates that the URL 411 stands a reasonable chance of being an ARK. If the context warrants, 412 verification that it actually is an ARK can be done by testing it for 413 existence of the three ARK services. 415 Since nothing about an identifier syntax directly affects 416 persistence, the "ark:" label (like "urn:", "doi:", and "hdl:") 417 cannot tell you whether the identifier is persistent or whether the 418 object is available. It does tell you that the original Name 419 Assigning Authority (NAA) had some sort of hopes for it, but it 420 doesn't tell you whether that NAA is still in existence, or whether a 421 decade ago it ceased to have any responsibility for providing 422 persistence, or whether it ever had any responsibility beyond naming. 424 Only a current provider can say for certain what sort of commitment 425 it intends, and the ARK label suggests that you can query the NMAH 426 directly to find out exactly what kind of persistence is promised. 427 Even if what is promised is impersistence (i.e., a short-term 428 identifier), saying so is valuable information to the recipient. 429 Thus an ARK is a high-functioning identifier in the sense that it 430 provides access to the object, the metadata, and a commitment 431 statement, even if the commitment is explicitly very weak. 433 2.3. The Name Assigning Authority Number (NAAN) 435 Recalling that the general form of the ARK is, 437 [http://NMAH/]ark:/NAAN/Name[Qualifier] 439 the part of the ARK directly following the "ark:" is the Name 440 Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. 441 This part is always required, as it identifies the organization that 442 originally assigned the Name of the object. It is used to discover a 443 currently valid NMAH and to provide top-level partitioning of the 444 space of all ARKs. NAANs are registered in a manner similar to URN 445 Namespaces, but they are pure numbers consisting of 5 digits or 9 446 digits. Thus, the first 100,000 registered NAAs fit compactly into 447 the 5 digits, and if growth warrants, the next billion fit into the 9 448 digit form. In either case the fixed odd numbers of digits helps 449 reduce the chances of finding a NAAN out of context and confusing it 450 with nearby quantities such as 4-digit dates. 452 The NAAN designates a top-level ARK namespace. Once registered for a 453 namespace, a NAAN is never re-registered. It is possible, however, 454 for there to be a succession of organizations that manage of an ARK 455 namespace. 457 2.4. The Name Part 459 The part of the ARK just after the NAAN is the Name assigned by the 460 NAA, and it is also required. Semantic opaqueness in the Name part 461 is strongly encouraged in order to reduce an ARK's vulnerability to 462 era- and language-specific change. Identifier strings containing 463 linguistic fragments can create support difficulties down the road. 464 No matter how appropriate or even meaningless they are today, such 465 fragments may one day create confusion, give offense, or infringe on 466 a trademark as the semantic environment around us and our communities 467 evolves. 469 Names that look more or less like numbers avoid common problems that 470 defeat persistence and international acceptance. The use of digits 471 is highly recommended. Mixing in non-vowel alphabetic characters a 472 couple at a time is a relatively safe and easy way to achieve a 473 denser namespace (more possible names for a given length of the name 474 string). Such names have a chance of aging and traveling well. 475 Tools exists that mint, bind, and resolve opaque identifiers, with or 476 without check characters [NOID]. More on naming considerations is 477 given in a subsequent section. 479 2.5. The Qualifier Part 481 The part of the ARK following the NAA-assigned Name is an optional 482 Qualifier. It is a string that extends the base ARK in order to 483 create a kind of service entry point into the object named by the 484 NAA. At the discretion of the providing NMA, such a service entry 485 point permits an ARK to support access to individual hierarchical 486 components and subcomponents of an object, and to variants (versions, 487 languages, formats) of components. A Qualifier may be invented by 488 the NAA or by any NMA servicing the object. 490 In form, the Qualifier is a ComponentPath, or a VariantPath, or a 491 ComponentPath followed by a VariantPath. A VariantPath is introduced 492 and subdivided by the reserved character `.', and a ComponentPath is 493 introduced and subdivided by the reserved character `/'. In this 494 example, 496 http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff 498 the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is 499 a VariantPath. The ARK Qualifier is a formalization of some 500 currently mainstream URL syntax conventions. This formalization 501 specifically reserves meanings that permit recipients to make strong 502 inferences about logical sub-object containment and equivalence based 503 only on the form of the received identifiers; there is great 504 efficiency in not having to inspect metadata records to discover such 505 relationships. NMAs are free not to disclose any of these 506 relationships merely by avoiding the reserved characters above. 507 Hierarchical components and variants are discussed further in the 508 next two sections. 510 The Qualifier, if present, differs from the Name in several important 511 respects. First, a Qualifier may have been assigned either by the 512 NAA or later by the NMA. The assignment of a Qualifier by an NMA 513 effectively amounts to an act of publishing a service entry point 514 within the conceptual object originally named by the NAA. For our 515 purposes, an ARK extended with a Qualifier assigned by an NMA will be 516 called an NMA-qualified ARK. 518 Second, a Qualifier assignment on the part of an NMA is made in 519 fulfillment of its service obligations and may reflect changing 520 service expectations and technology requirements. NMA-qualified ARKs 521 could therefore be transient, even if the base, unqualified ARK is 522 persistent. For example, it would be reasonable for an NMA to 523 support access to an image object through an actionable ARK that is 524 considered persistent even if the experience of that access changes 525 as linking, labeling, and presentation conventions evolve and as 526 format and security standards are updated. For an image "thumbnail", 527 that NMA could also support an NMA-qualified ARK that is considered 528 impersistent because the thumbnail will be replaced with higher 529 resolution images as network bandwidth and CPU speeds increase. At 530 the same time, for an originally scanned, high-resolution master, the 531 NMA could publish an NMA-qualfied ARK that is itself considered 532 persistent. Of course, the NMA must be able to return its separate 533 commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs, 534 and to any NAA-qualified ARKs that it supports. 536 A third difference between a Qualifier and a Name concerns the 537 semantic opaqueness constraint. When an NMA-qualified ARK is to be 538 used as a transient service entry point into a persistent object, the 539 priority given to semantic opaqueness observed by the NAA in the Name 540 part may be relaxed by the NMA in the Qualifier part. If service 541 priorities in the Qualifier take precedence over persistence, short- 542 term usability considerations may recommend somewhat semantically 543 laden Qualifier strings. 545 Finally, not only is the set of Qualifiers supported by an NMA 546 mutable, but different NMAs may support different Qualifier sets for 547 the same NAA-identified object. In this regard the NMAs act 548 independently of each other and of the NAA. 550 The next two sections describe how ARK syntax may be used to declare, 551 or to avoid declaring, certain kinds of relatedness among qualified 552 ARKs. 554 2.5.1. ARKs that Reveal Object Hierarchy 556 An NAA or NMA may choose to reveal the presence of a hierarchical 557 relationship between objects using the `/' (slash) character after 558 the Name part of an ARK. Some authorities will choose not to 559 disclose this information, while others will go ahead and disclose so 560 that manipulators of large sets of ARKs can infer object 561 relationships by simple identifier inspection; for example, this 562 makes it possible for a system to present a collapsed view of a large 563 search result set. 565 If the ARK contains an internal slash after the NAAN, the piece to 566 its left indicates a containing object. For example, publishing an 567 ARK of the form, 569 ark:/12025/654/xz/321 571 is equivalent to publishing three ARKs, 573 ark:/12025/654/xz/321 574 ark:/12025/654/xz 575 ark:/12025/654 577 together with a declaration that the first object is contained in the 578 second object, and that the second object is contained in the third. 580 Revealing the presence of hierarchy is completely up to the assigner 581 (NMA or NAA). It is hard enough to commit to one object's name, let 582 alone to three objects' names and to a specific, ongoing relatedness 583 among them. Thus, regardless of whether hierarchy was present 584 initially, the assigner, by not using slashes, reveals no shared 585 inferences about hierarchical or other inter-relatedness in the 586 following ARKs: 588 ark:/12025/654_xz_321 589 ark:/12025/654_xz 590 ark:/12025/654xz321 591 ark:/12025/654xz 592 ark:/12025/654 594 Note that slashes around the ARK's NAAN (/12025/ in these examples) 595 are not part of the ARK's Name and therefore do not indicate the 596 existence of some sort of NAAN super object containing all objects in 597 its namespace. A slash must have at least one non-structural 598 character (one that is neither a slash nor a period) on both sides in 599 order for it to separate recognizable structural components. So 600 initial or final slashes may be removed, and double slashes may be 601 converted into single slashes. 603 2.5.2. ARKs that Reveal Object Variants 605 An NAA or NMA may choose to reveal the possible presence of variant 606 objects or object components using the `.' (period) character after 607 the Name part of an ARK. Some authorities will choose not to 608 disclose this information, while others will go ahead and disclose so 609 that manipulators of large sets of ARKs can infer object 610 relationships by simple identifier inspection; for example, this 611 makes it possible for a system to present a collapsed view of a large 612 search result set. 614 If the ARK contains an internal period after Name, the piece to its 615 left is a base name and the piece to its right, and up to the end of 616 the ARK or to the next period is a suffix. A Name may have more than 617 one suffix, for example, 618 ark:/12025/654.24 619 ark:/12025/xz4/654.24 620 ark:/12025/654.20v.78g.f55 622 There are two main rules. First, if two ARKs share the same base 623 name but have different suffixes, the corresponding objects were 624 considered variants of each other (different formats, languages, 625 versions, etc.) by the assigner (NMA or NAA). Thus, the following 626 ARKs are variants of each other: 628 ark:/12025/654.20v.78g.f55 629 ark:/12025/654.321xz 630 ark:/12025/654.44 632 Second, publishing an ARK with a suffix implies the existence of at 633 least one variant identified by the ARK without its suffix. The ARK 634 otherwise permits no further assumptions about what variants might 635 exist. So publishing the ARK, 637 ark:/12025/654.20v.78g.f55 639 is equivalent to publishing the four ARKs, 641 ark:/12025/654.20v.78g.f55 642 ark:/12025/654.20v.78g 643 ark:/12025/654.20v 644 ark:/12025/654 646 Revealing the possibility of variants is completely up to the 647 assigner. It is hard enough to commit to one object's name, let 648 alone to multiple variants' names and to a specific, ongoing 649 relatedness among them. The assigner is the sole arbiter of what 650 constitutes a variant within its namespace, and whether to reveal 651 that kind of relatedness by using periods within its names. 653 A period must have at least one non-structural character (one that is 654 neither a slash nor a period) on both sides in order for it to 655 separate recognizable structural components. So initial or final 656 periods may be removed, and adjacent periods may be converted into a 657 single period. Multiple suffixes should be arranged in sorted order 658 (pure ASCII collating sequence) at the end of an ARK. 660 2.6. Character Repertoires 662 The Name and Qualifier parts are strings of visible ASCII characters 663 and should be less than 128 bytes in length. The length restriction 664 keeps the ARK short enough to append ordinary ARK request strings 665 without running into transport restrictions (e.g., within HTTP GET 666 requests). Characters may be letters, digits, or any of these six 667 characters: 669 = # * + @ _ $ 671 The following characters may also be used, but their meanings are 672 reserved: 674 % - . / 676 The characters `/' and `.' are ignored if either appears as the last 677 character of an ARK. If used internally, they allow a name assigner 678 to reveal object hierarchy and object variants as previously 679 described. 681 Hyphens are considered to be insignificant and are always ignored in 682 ARKs. A `-' (hyphen) may appear in an ARK for readability, or it may 683 have crept in during the formatting and wrapping of text, but it must 684 be ignored in lexical comparisons. As in a telephone number, hyphens 685 have no meaning in an ARK. It is always safe for an NMA that 686 receives an ARK to remove any hyphens found in it. As a result, like 687 the NMAH, hyphens are "identity inert" in comparing ARKs for 688 equivalence. For example, the following ARKs are equivalent for 689 purposes of comparison and ARK service access: 691 ark:/12025/65-4-xz-321 692 http://sneezy.dopey.com/ark:/12025/654--xz32-1 693 ark:/12025/654xz321 695 The `%' character is reserved for %-encoding all other octets that 696 would appear in the ARK string, in the same manner as for URIs [URI]. 697 A %-encoded octet consists of a `%' followed by two hex digits; for 698 example, "%7d" stands in for `}'. Lower case hex digits are 699 preferred to reduce the chances of false acronym recognition; thus it 700 is better to use "%acT" instead of "%ACT". The character `%' itself 701 must be represented using "%25". As with URNs, %-encoding permits 702 ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) that have 703 less restricted character repertoires [URNBIB]. 705 2.7. Normalization and Lexical Equivalence 707 To determine if two or more ARKs identify the same object, the ARKs 708 are compared for lexical equivalence after first being normalized. 709 Since ARK strings may appear in various forms (e.g., having different 710 NMAHs), normalizing them minimizes the chances that comparing two ARK 711 strings for equality will fail unless they actually identify 712 different objects. In a specified-host ARK (one having an NMAH), the 713 NMAH never participates in such comparisons. 715 Normalization of an ARK for the purpose of octet-by-octet equality 716 comparison with another ARK consists of four steps. First, any upper 717 case letters in the "ark:" label and the two characters following a 718 `%' are converted to lower case. The case of all other letters in 719 the ARK string must be preserved. Second, any NMAH part is removed 720 (everything from an initial "http://" up to the next slash) and all 721 hyphens are removed. 723 Third, structural characters (slash and period) are normalized. 724 Initial and final occurrences are removed, and two structural 725 characters in a row (e.g., // or ./) are replaced by the first 726 character, iterating until each occurrence has at least one non- 727 structural character on either side. Finally, if there are any 728 components with a period on the left and a slash on the right, either 729 the component and the preceding period must be moved to the end of 730 the Name part or the ARK must be thrown out as malformed. 732 The fourth and final step is to arrange the suffixes in ASCII 733 collating sequence (that is, to sort them) and to remove duplicate 734 suffixes, if any. It is also permissible to throw out ARKs for which 735 the suffixes are not sorted. 737 The resulting ARK string is now normalized. Comparisons between 738 normalized ARKs are case-sensitive, meaning that upper case letters 739 are considered different from their lower case counterparts. 741 To keep ARK string variation to a minimum, no reserved ARK characters 742 should be %-encoded unless it is deliberately to conceal their 743 reserved meanings. No non-reserved ARK characters should ever be 744 %-encoded. Finally, no %-encoded character should ever appear in an 745 ARK in its decoded form. 747 3. Naming Considerations 749 The most important threats faced by persistence providers include 750 such things as funding loss, natural disaster, political and social 751 upheaval, processing faults, and errors in human oversight. There is 752 nothing that an identifer scheme can do about such things. Still, a 753 few observed identifier failures and inconveniences can be traced 754 back to naming practices that we now know to be less than optimal for 755 persistence. 757 3.1. ARKS Embedded in Language 759 The ARK has different goals from the URI, so it has different 760 character set requirements. Because linguistic constructs imperil 761 persistence, for ARKs non-ASCII character support is unimportant. 762 ARKs and URIs share goals of transcribability and transportability 763 within web documents, so characters are required to be visible, non- 764 conflicting with HTML/XML syntax, and not subject to tampering during 765 transmission across common transport gateways. Add the goal of 766 making an undelimited ARK recognizable in running prose, as in 767 ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma, 768 period) end up being excluded from the ARK lest the end of a phrase 769 or sentence be mistaken for part of the ARK. 771 This consideration has more direct effect on ARK usability in a 772 natural language context than it has on ARK persistence. The same is 773 true of the rule preventing hyphens from having lexical significance. 774 It is fine to publish ARKs with hyphens in them (e.g., such as the 775 output of UUID/GUID generators), but the uniform treatment of hyphens 776 as insignificant reduces the possibility of users transcribing 777 identifiers that will have been broken through unpredictable 778 hyphenation by word processors. Any measure that reduces user 779 irritation with an identifier will increase its chances of survival. 781 3.2. Objects Should Wear Their Identifiers 783 A valuable technique for provision of persistent objects is to try to 784 arrange for the complete identifier to appear on, with, or near its 785 retrieved object. An object encountered at a moment in time when its 786 discovery context has long since disappeared could then easily be 787 traced back to its metadata, to alternate versions, to updates, etc. 788 This has seen reasonable success, for example, in book publishing and 789 software distribution. An identifier string only has meaning when 790 its association is known, and this a very sure, simple, and low-tech 791 method of reminding everyone exactly what that association is. 793 3.3. Names are Political, not Technological 795 If persistence is the goal, a deliberate local strategy for 796 systematic name assignment is crucial. Names must be chosen with 797 great care. Poorly chosen and managed names will devastate any 798 persistence strategy, and they do not discriminate by identifier 799 scheme. Whether a mistakenly re-assigned name is a URN, DOI, PURL, 800 URL, or ARK, the damage - failed access and confusion - is not 801 mitigated more in one scheme than in another. Conversely, in-house 802 efforts to manage names responsibly will go much further towards 803 safeguarding persistence than any choice of naming scheme or name 804 resolution technology. 806 Branding (e.g., at the corporate or departmental level) is important 807 for funding and visibility, but substrings representing brands and 808 organizational names should be given a wide berth except when 809 absolutely necessary in the hostname (the identity-inert) part of the 810 ARK. These substrings are not only unstable because organizations 811 change frequently, but they are also dangerous because successor 812 organizations often have political or legal reasons to actively 813 suppress predecessor names and brands. Any measure that reduces the 814 chances of future political or legal pressure on an identifier will 815 decrease the chances that our descendants will be obliged to 816 deliberately break it. 818 3.4. Choosing a Hostname or NMA 820 Hostnames appearing in any identifier meant to be persistent must be 821 chosen with extra care. The tendency in hostname selection has 822 traditionally been to choose a token with recognizable attributes, 823 such as a corporate brand, but that tendency wreaks havoc with 824 persistence that is supposed to outlive brands, corporations, subject 825 classifications, and natural language semantics (e.g., what did the 826 three letters "gay" mean in 1958, 1978, and 1998?). Today's 827 recognized and correct attributes are tomorrow's stale or incorrect 828 attributes. In making hostnames (any names, actually) long-term 829 persistent, it helps to eliminate recognizable attributes to the 830 extent possible. This affects selection of any name based on URLs, 831 including PURLs and the explicitly disposable NMAHs. 833 There is no excuse for a provider that manages its internal names 834 impeccably not to exercise the same care in choosing what could be an 835 exceptionally durable hostname, especially if it would form the 836 prefix for all the provider's URL-based external names. Registering 837 an opaque hostname in the ".org" or ".net" domain would not be a bad 838 start. Another way is to publish your ARKs with an organizational 839 domain name that will be mapped by DNS to an appropriate NMA host. 840 This makes for shorter names with less branding vulnerability. 842 It is a mistake to think that hostnames are inherently unstable. If 843 you require brand visibility, that may be a fact of life. But things 844 are easier if yours is the brand of long-lived cultural memory 845 institution such as a national or university library or archive. 846 Well-chosen hostnames from organizations that are sheltered from the 847 direct effects of a volatile marketplace can easily provide longer- 848 lived global resolvers than the domain names explicitly or implicitly 849 used as starting points for global resolution by indirection-based 850 persistent identifier schemes. For example, it is hard to imagine 851 circumstances under which the Library of Congress' domain name would 852 disappear sooner than, say, "handle.net". 854 For smaller libraries, archives, and preservation organizations, 855 there is a natural concern about whether they will be able to keep 856 their web servers and domain names in the face of uncertain funding. 857 One option is to form or join a consortium [N2T] of like-minded 858 organizations with the purpose of providing mutual preservation 859 support. The first goal of such a consortium would be to perpetually 860 rent a hostname on which to establish a web server that simply 861 redirects incoming member organization requests to the appropriate 862 member server; using ARKs, for example, a 150-member consortium could 863 run a very small server (24x7) that contained nothing more than 150 864 rewrite rules in its configuration file. Even more helpful would be 865 additional consortial support for a member organization that was 866 unable to continue providing services and needed to find a successor 867 archival organization. This would be a low-cost, low-tech way to 868 publish ARKs (or URLs) under highly persistent hostnames. 870 There are no obvious reasons why the organizations registering DNS 871 names, URN Namespaces, and DOI publisher IDs should have among them 872 one that is intrinsically more fallible than the next. Moreover, it 873 is a misconception that the demise of DNS and of HTTP need adversely 874 affect the persistence of URLs. At such a time, certainly URLs from 875 the present day might not then be actionable by our present-day 876 mechanisms, but resolution systems for future non-actionable URLs are 877 no harder to imagine than resolution systems for present-day non- 878 actionable URNs and DOIs. There is no more stable a namespace than 879 one that is dead and frozen, and that would then characterize the 880 space of names bearing the "http://" prefix. It is useful to 881 remember that just because hostnames have been carelessly chosen in 882 their brief history does not mean that they are unsuitable in NMAHs 883 (and URLs) intended for use in situations demanding the highest level 884 of persistence available in the Internet environment. A well-planned 885 name assignment strategy is everything. 887 3.5. Assigners of ARKs 889 A Name Assigning Authority (NAA) is an organization that creates (or 890 delegates creation of) long-term associations between identifiers and 891 information objects. Examples of NAAs include national libraries, 892 national archives, and publishers. An NAA may arrange with an 893 external organization for identifier assignment. The US Library of 894 Congress, for example, allows OCLC (the Online Computer Library 895 Center, a major world cataloger of books) to create associations 896 between Library of Congress call numbers (LCCNs) and the books that 897 OCLC processes. A cataloging record is generated that testifies to 898 each association, and the identifier is included by the publisher, 899 for example, in the front matter of a book. 901 An NAA does not so much create an identifier as create an 902 association. The NAA first draws an unused identifier string from 903 its namespace, which is the set of all identifiers under its control. 904 It then records the assignment of the identifier to an information 905 object having sundry witnessed characteristics, such as a particular 906 author and modification date. A namespace is usually reserved for an 907 NAA by agreement with recognized community organizations (such as 908 IANA and ISO) that all names containing a particular string be under 909 its control. In the ARK an NAA is represented by the Name Assigning 910 Authority Number (NAAN). 912 The ARK namespace reserved for an NAA is the set of names bearing its 913 particular NAAN. For example, all strings beginning with 914 "ark:/12025/" are under control of the NAA registered under 12025, 915 which might be the National Library of Finland. Because each NAA has 916 a different NAAN, names from one namespace cannot conflict with those 917 from another. Each NAA is free to assign names from its namespace 918 (or delegate assignment) according to its own policies. These 919 policies must be documented in a manner similar to the declarations 920 required for URN Namespace registration [URNNID]. 922 To register for a NAAN, please read about the mapping authority 923 discovery file in the next section and send email to ark@cdlib.org. 925 3.6. NAAN Namespace Management 927 Every NAA must have a namespace management strategy. A time-honored 928 technique is to hierarchically partition a namespace into 929 subnamespaces using prefixes that guarantee non-collision of names in 930 different partition. This practice is strongly encouraged for all 931 NAAs, especially when subnamespace management will be delegated to 932 other departments, units, or projects within an organization. For 933 example, with a NAAN that is assigned to a university and managed by 934 its main library, care should be taken to reserve semantically opaque 935 prefixes that will set aside large parts of the unused namespace for 936 future assignments. Prefix-based partition management is an 937 important responsibility of the NAA. 939 This sort of delegation by prefix is well-used in the formation of 940 DNS names and ISBN identifiers. An important difference is that in 941 the former, the hierarchy is deliberately exposed and in the latter 942 it is hidden. Rather than using lexical boundary markers such as the 943 period (`.') found in domain names, the ISBN uses a publisher prefix 944 but doesn't disclose where the prefix ends and the publisher's 945 assigned name begins. This practice of non-disclosure, borrowed from 946 the ISBN and ISSN schemes, is encouraged in assigning ARKs, because 947 it reduces the visibility of an assertion that is probably not 948 important now and may become a vulnerability later. 950 Reasonable prefixes for assigned names usually consist of consonants 951 and digits and are 1-5 characters in length. For example, the 952 constant prefix "x9t" might be delegated to a book digitization 953 project that creates identifiers such as 955 http://444.berkeley.edu/ark:/28722/x9t38rk45c 957 If longevity is the goal, it is important to keep the prefixes free 958 of recognizable semantics; for example, using an acronym representing 959 a project or a department is discouraged. At the same time, you may 960 wish to set aside a subnamespace for testing purposes under a prefix 961 such as "fk..." that can serve as a visual clue and reminder to 962 maintenance staff that this "fake" identifier was never published. 964 There are other measures one can take to avoid user confusion, 965 transcription errors, and the appearance of accidental semantics when 966 creating identifiers. If you are generating identifiers 967 automatically, pure numeric identifiers are likeley to be 968 semantically opaque enough, but it's probably useful to avoid leading 969 zeroes because some users mistakenly treat them as optional, thinking 970 (arithmetically) that they don't contribute to the "value" of the 971 identifier. 973 If you need lots of identifiers and you don't want them to get too 974 long, you can mix digits with consonants (but avoid vowels since they 975 might accidentally spell words) to get more identifiers without 976 increasing the string length. In this case you may not want more 977 than a two letters in a row because it reduces the chance of 978 generating acronyms. Generator tools such as [NOID] provide support 979 for these sorts of identifiers, and can also add a computed check 980 character as a guarantee against the most common transcription 981 errors. 983 3.7. Sub-Object Naming 985 As mentioned previously, semantically opaque identifiers are very 986 useful for long-term naming of abstract objects, however, it may be 987 appropriate to extend these names with less opaque extensions that 988 reference contemporary service entry points (sub-objects) in support 989 of the object. Sub-object extensions beginning with a digit or 990 underscore (`_') are reserved for the possibilty of developing a 991 future registry of canonical service points (e.g., numeric references 992 to versions, formats, languages, etc). 994 4. Finding a Name Mapping Authority 996 In order to derive an actionable identifier (these days, a URL) from 997 an ARK, a hostport (hostname or hostname plus port combination) for a 998 working Name Mapping Authority (NMA) must be found. An NMA is a 999 service that is able to respond to the three basic ARK service 1000 requests. Relying on registration and client-side discovery, NMAs 1001 make known which NAAs' identifiers they are willing to service. 1003 Upon encountering an ARK, a user (or client software) looks inside it 1004 for the optional NMAH part (the hostport of the NMA's ARK service). 1005 If it contains an NMAH that is working, this NMAH discovery step may 1006 be skipped; the NMAH effectively uses the beginning of an ARK to 1007 cache the results of a prior mapping authority discovery process. If 1008 a new NMAH needs to found, the client looks inside the ARK again for 1009 the NAAN (Name Assigning Authority Number). Querying a global 1010 database, it then uses the NAAN to look up all current NMAHs that 1011 service ARKs issued by the identified NAA. The global database is 1012 key, and two specific methods for querying it are given in this 1013 section. 1015 A third very promising method, called the Name-to-Thing [N2T] 1016 Resolver, is being explored. It is a low-cost, highly stable, 1017 consortially maintained NMAH that simply exists to support actionable 1018 HTTP-based URLs for as long as HTTP is used. One of its big 1019 advantages over the other two methods and the URN, Handle, DOI, and 1020 PURL methods, is that N2T addresses the namespace splitting problem. 1021 When objects maintained by one NMA are inherited by more than one 1022 successor NMA, until now one of those successors would be required to 1023 maintain forwarding tables on behalf of the other successors. 1025 In the interests of long-term persistence, however, ARK mechanisms 1026 are first defined in high-level, protocol-independent terms so that 1027 mechanisms may evolve and be replaced over time without compromising 1028 fundamental service objectives. Either or both specific methods 1029 given here may eventually be supplanted by better methods since, by 1030 design, the ARK scheme does not depend on a particular method, but 1031 only on having some method to locate an active NMAH. 1033 At the time of issuance, at least one NMAH for an ARK should be 1034 prepared to service it. That NMA may or may not be administered by 1035 the Name Assigning Authority (NAA) that created it. Consider the 1036 following hypothetical example of providing long-term access to a 1037 cancer research journal. The publisher wishes to turn a profit and 1038 the National Library of Medicine wishes to preserve the scholarly 1039 record. An agreement might be struck whereby the publisher would act 1040 as the NAA and the national library would archive the journal issue 1041 when it appears, but without providing direct access for the first 1042 six months. During the first six months of peak commercial 1043 viability, the publisher would retain exclusive delivery rights and 1044 would charge access fees. Again, by agreement, both the library and 1045 the publisher would act as NMAs, but during that initial period the 1046 library would redirect requests for issues less than six months old 1047 to the publisher. At the end of the waiting period, the library 1048 would then begin servicing requests for issues older than six months 1049 by tapping directly into its own archives. Meanwhile, the publisher 1050 might routinely redirect incoming requests for older issues to the 1051 library. Long-term access is thereby preserved, and so is the 1052 commercial incentive to publish content. 1054 Although it will be common for an NAA also to run an NMA service, it 1055 is never a requirement. Over time NAAs and NMAs will come and go. 1056 One NMA will succeed another, and there might be many NMAs serving 1057 the same ARKs simultaneously (e.g., as mirrors or as competitors). 1058 There might also be asymmetric but coordinated NMAs as in the 1059 library-publisher example above. 1061 4.1. Looking Up NMAHs in a Globally Accessible File 1063 This subsection describes a way to look up NMAHs using a simple name 1064 authority table represented as a plain text file. For efficient 1065 access the file may be stored in a local filesystem, but it needs to 1066 be reloaded periodically to incorporate updates. It is not expected 1067 that the size of the file or frequency of update should impose an 1068 undue maintenance or searching burden any time soon, for even 1069 primitive linear search of a file with ten-thousand NAAs is a 1070 subsecond operation on modern server machines. The proposed file 1071 strategy is similar to the /etc/hosts file strategy that supported 1072 Internet host address lookup for a period of years before the advent 1073 of DNS. 1075 The name authority table file is updated on an ongoing basis and is 1076 available for copying over the internet from the California Digital 1077 Library at http://www.cdlib.org/inside/diglib/ark/natab and from a 1078 number of mirror sites. The file contains comment lines (lines that 1079 begin with `#') explaining the format and giving the file's 1080 modification time, reloading address, and NAA registration 1081 instructions. There is even a Perl script that processes the file 1082 embedded in the file's comments. As of February 2006, currently 1083 registered Name Assigning Authorities are: 1085 12025 National Library of Medicine 1086 12026 Library of Congress 1087 12027 National Agriculture Library 1088 13030 California Digital Library 1089 13038 World Intellectual Property Organization 1090 20775 University of California San Diego 1091 29114 University of California San Francisco 1092 28722 University of California Berkeley 1093 21198 University of California Los Angeles 1094 15230 Rutgers University 1095 13960 Internet Archive 1096 64269 Digital Curation Centre 1097 62624 New York University 1098 67531 University of North Texas 1099 27927 Ithaka Electronic-Archiving Initiative 1100 12148 Bibliotheque nationale de France / National Library of France 1101 78319 Google 1102 88435 Princeton University 1103 78428 University of Washington 1104 89901 Archives of Region of Vastra Gotaland and City of Gothenburg, Sweden 1105 80444 Northwest Digital Archives 1106 25593 Emory University 1107 25031 University of Kansas 1108 17101 Centre for Ecology & Hydrology, UK 1109 65323 University of Calgary 1111 A snapshot of the name authority table file appears in an appendix. 1113 4.2. Looking up NMAHs Distributed via DNS 1115 This subsection introduces a method for looking up NMAHs that is 1116 based on the method for discovering URN resolvers described in 1117 [NAPTR]. It relies on querying the DNS system already installed in 1118 the background infrastructure of most networked computers. A query 1119 is submitted to DNS asking for a list of resolvers that match a given 1120 NAAN. DNS distributes the query to the particular DNS servers that 1121 can best provide the answer, unless the answer can be found more 1122 quickly in a local DNS cache as a side-effect of a recent query. 1123 Responses come back inside Name Authority Pointer (NAPTR) records. 1124 The normal result is one or more candidate NMAHs. 1126 In its full generality the [NAPTR] algorithm ambitiously accommodates 1127 a complex set of preferences, orderings, protocols, mapping services, 1128 regular expression rewriting rules, and DNS record types. This 1129 subsection proposes a drastic simplification of it for the special 1130 case of ARK mapping authority discovery. The simplified algorithm is 1131 called Maptr. It uses only one DNS record type (NAPTR) and restricts 1132 most of its field values to constants. The following hypothetical 1133 excerpt from a DNS data file for the NAAN known as 12026 shows three 1134 example NAPTR records ready to use with the Maptr algorithm. 1136 12026.ark.arpa. 1137 ;; US Library of Congress 1138 ;; order pref flags service regexp replacement 1139 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 1140 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 1141 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 1143 All the fields are held constant for Maptr except for the "flags", 1144 "regexp", and "replacement" fields. The "service" field contains the 1145 constant value "ark" so that NAPTR records participating in the Maptr 1146 algorithm will not be confused with other NAPTR records. The "order" 1147 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 1148 the algorithm may evolve to use these fields for ranking decisions 1149 when usage patterns and local administrative needs are better 1150 understood. 1152 When a Maptr query returns a record with a flags field of "h" (for 1153 hostport, a Maptr extension to the NAPTR flags), the replacement 1154 field contains the NMAH (hostport) of an ARK service provider. When 1155 a query returns a record with a flags field of "" (the empty string), 1156 the client needs to submit a new query containing the domain name 1157 found in the replacement field. This second sort of record exploits 1158 the distributed nature of DNS by redirecting the query to another 1159 domain name. It looks like this. 1161 12345.ark.arpa. 1162 ;; Digital Library Consortium 1163 ;; order pref flags service regexp replacement 1164 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 1166 Here is the Maptr algorithm for ARK mapping authority discovery. In 1167 it replace with the NAAN from the ARK for which an NMAH is 1168 sought. 1170 (1) Initialize the DNS query: type=NAPTR, 1171 query=.ark.arpa. 1173 (2) Submit the query to DNS and retrieve (NAPTR) records, 1174 discarding any record that does not have "ark" for the service 1175 field. 1177 (3) All remaining records with a flags fields of "h" contain 1178 candidate NMAHs in their replacement fields. Set them aside, if 1179 any. 1181 (4) Any record with an empty flags field ("") has a replacement 1182 field containing a new domain name to which a subsequent query 1183 should be redirected. For each such record, set 1184 query= then go to step (2). When all such records 1185 have been recursively exhausted, go to step (5). 1187 (5) All redirected queries have been resolved and a set of 1188 candidate NMAHs has been accumulated from steps (3). If there 1189 are zero NMAHs, exit - no mapping authority was found. If there 1190 is one or more NMAH, choose one using any criteria you wish, 1191 then exit. 1193 A Perl script that implements this algorithm is included here. 1195 #!/depot/bin/perl 1197 use Net::DNS; # include simple DNS package 1198 my $qtype = "NAPTR"; # initialize query type 1199 my $naa = shift; # get NAAN script argument 1200 my $mad = new Net::DNS::Resolver; # mapping authority discovery 1202 &maptr("$naa.ark.arpa"); # call maptr - that's it 1204 sub maptr { # recursive maptr algorithm 1205 my $dname = shift; # domain name as argument 1206 my ($rr, $order, $pref, $flags, $service, $regexp, 1207 $replacement); 1208 my $query = $mad->query($dname, $qtype); 1209 return # non-productive query 1210 if (! $query || ! $query->answer); 1211 foreach $rr ($query->answer) { 1212 next # skip records of wrong type 1213 if ($rr->type ne $qtype); 1214 ($order, $pref, $flags, $service, $regexp, 1215 $replacement) = split(/\s/, $rr->rdatastr); 1216 if ($flags eq "") { 1217 &maptr($replacement); # recurse 1218 } elsif ($flags eq "h") { 1219 print "$replacement\n"; # candidate NMAH 1220 } 1221 } 1222 } 1224 The global database thus distributed via DNS and the Maptr algorithm 1225 can easily be seen to mirror the contents of the Name Authority Table 1226 file described in the previous section. 1228 5. Generic ARK Service Definition 1230 An ARK request's output is delivered information; examples include 1231 the object itself, a policy declaration (e.g., a promise of support), 1232 a descriptive metadata record, or an error message. The experience 1233 of object delivery is expected to be an evolving mix of information 1234 that reflects changing service expectations and technology 1235 requirements; contemporary examples include such things as an object 1236 summary and component links formatted for human consumption. ARK 1237 services must be couched in high-level, protocol-independent terms if 1238 persistence is to outlive today's networking infrastructural 1239 assumptions. The high-level ARK service definitions listed below are 1240 followed in the next section by a concrete method (one of many 1241 possible methods) for delivering these services with today's 1242 technology. 1244 5.1. Generic ARK Access Service (access, location) 1246 Returns (a copy of) the object or a redirect to the same, although a 1247 sensible object proxy may be substituted. Examples of sensible 1248 substitutes include, 1250 - a table of contents instead of a large complex document, 1251 - a home page instead of an entire web site hierarchy, 1252 - a rights clearance challenge before accessing protected data, 1253 - directions for access to an offline object (e.g., a book), 1254 - a description of an intangible object (a disease, an event), or 1255 - an applet acting as "player" for a large multimedia object. 1257 May also return a discriminated list of alternate object locators. 1258 If access is denied, returns an explanation of the object's current 1259 (perhaps permanent) inaccessibility. 1261 5.2. Generic Policy Service (permanence, naming, etc.) 1263 Returns declarations of policy and support commitments for given 1264 ARKs. Declarations are returned in either a structured metadata 1265 format or a human readable text format; sometimes one format may 1266 serve both purposes. Policy subareas may be addressed in separate 1267 requests, but the following areas should should be covered: object 1268 permanence, object naming, object fragment addressing, and 1269 operational service support. 1271 The permanence declaration for an object is a rating defined with 1272 respect to an identified permanence provider (guarantor), which will 1273 be the NMA. It may include the following aspects. 1275 (a) "object availability" - whether and how access to the object 1276 is supported (e.g., online 24x7, or offline only), 1278 (b) "identifier validity" - under what conditions the identifier 1279 will be or has been re-assigned, 1281 (c) "content invariance" - under what conditions the content of 1282 the object is subject to change, and 1284 (d) "change history" - access to corrections, migrations, and 1285 revisions, whether through links to the changed objects 1286 themselves or through a document summarizing the change history 1288 One approach to a permanence rating framework, conceived 1289 independently from ARKs, is given in [NLMPerm]. Under ongoing 1290 development and limited deployment at the US National Library of 1291 Medicine, it identifies the following "permanence levels": 1293 Not Guaranteed: No commitment has been made to retain this 1294 resource. It could become unavailable at any time. Its 1295 identifier could be changed. 1297 Permanent: Dynamic Content: A commitment has been made to keep 1298 this resource permanently available. Its identifier will always 1299 provide access to the resource. Its content could be revised or 1300 replaced. 1302 Permanent: Stable Content: A commitment has been made to keep 1303 this resource permanently available. Its identifier will always 1304 provide access to the resource. Its content is subject only to 1305 minor corrections or additions. 1307 Permanent: Unchanging Content: A commitment has been made to 1308 keep this resource permanently available. Its identifier will 1309 always provide access to the resource. Its content will not 1310 change. 1312 Naming policy for an object includes an historical description of the 1313 NAA's (and its successor NAA's) policies regarding differentiation of 1314 objects. Since it the NMA who responds to requests for policy 1315 statements, it is useful for the NMA to be able to produce or 1316 summarize these historical NAA documents. Naming policy may include 1317 the following aspects. 1319 (i) "similarity" - (or "unity") the limit, defined by the NAA, 1320 to the level of dissimilarity beyond which two similar objects 1321 warrant separate identifiers but before which they share one 1322 single identifier, and 1324 (ii) "granularity" - the limit, defined by the NAA, to the level 1325 of object subdivision beyond which sub-objects do not warrant 1326 separately assigned identifiers but before which sub-objects are 1327 assigned separate identifiers. 1329 Subnaming policy for an object describes the qualifiers that the NMA, 1330 in fulfilling its ongoing and evolving service obligations, allows as 1331 extensions to an NAA-assigned ARK. To the conceptual object that the 1332 NAA named with an ARK, the NMA may add component access points and 1333 derivatives (e.g., format migrations in aid of preservation) in order 1334 to provide both basic and value-added services. 1336 Addressing policy for an object includes a description of how, during 1337 access, object components (e.g., paragraphs, sections) or views 1338 (e.g., image conversions) may or may not be "addressed", in other 1339 words, how the NMA permits arguments or parameters to modify the 1340 object delivered as the result of an ARK request. If supported, 1341 these sorts of operations would provide things like byte-ranged 1342 fragment delivery and open-ended format conversions, or any set of 1343 possible transformations that would be too numerous to list or to 1344 identify with separately assigned ARKs. 1346 Operational service support policy includes a description of general 1347 operational aspects of the NMA service, such as after-hours staffing 1348 and trouble reporting procedures. 1350 5.3. Generic Description Service 1352 Returns a description of the object. Descriptions are returned in 1353 either a structured metadata format or a human readable text format; 1354 sometimes one format may serve both purposes. A description must at 1355 a minimum answer the who, what, when, and where questions concerning 1356 an expression of the object. Standalone descriptions should be 1357 accompanied by the modification date and source of the description 1358 itself. May also return discriminated lists of ARKs that are related 1359 to the given ARK. 1361 6. Overview of The HTTP URL Mapping Protocol (THUMP) 1363 The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (a 1364 kind of identifier) and asking such questions as, what information 1365 does this identify and how permanent is it? [THUMP] is in fact one 1366 specific method under development for delivering ARK services. The 1367 protocol runs over HTTP to exploit the web browser's current pre- 1368 eminence as user interface to the Internet. THUMP is designed so 1369 that a person can enter ARK requests directly into the location field 1370 of current browser interfaces. Because it runs over HTTP, THUMP can 1371 be simulated and tested within keyboard-based [TELNET] sessions. 1373 The asker (a person or client program) starts with an identifier, 1374 such as an ARK or a URL. The identifier reveals to the asker (or 1375 allows the asker to infer) the Internet host name and port number of 1376 a server system that responds to questions. Here, this is just the 1377 NMAH that is obtained by inspection and possibly lookup based on the 1378 ARK's NAAN. The asker then sets up an HTTP session with the server 1379 system, sends a question via a THUMP request (contained within an 1380 HTTP request), receives an answer via a THUMP response (contained 1381 within an HTTP response), and closes the session. That concludes the 1382 connected portion of the protocol. 1384 A THUMP request is a string of characters beginning with a `?' 1385 (question mark) that is appended to the identifier string. The 1386 resulting string is sent as an argument to HTTP's GET command. 1387 Request strings too long for GET may be sent using HTTP's POST 1388 command. The three most common requests correspond to three 1389 degenerate special cases that keep the user's learning and typing 1390 burden low. First, a simple key with no request at all is the same 1391 as an ordinary access request. Thus a plain ARK entered into a 1392 browser's location field behaves much like a plain URL, and returns 1393 access to the primary identified object, for instance, an HTML 1394 document. 1396 The second special case is a minimal ARK description request string 1397 consisting of just "?". For example, entering the string, 1399 ark.nlm.nih.gov/12025/psbbantu? 1401 into the browser's location field directly precipitates a request for 1402 a metadata record describing the object identified by 1403 ark:/12025/psbbantu. The browser, unaware of THUMP, prepares and 1404 sends an HTTP GET request in the same manner as for a URL. THUMP is 1405 designed so that the response (indicated by the returned HTTP content 1406 type) is normally displayed, whether the output is structured for 1407 machine processing (text/plain) or formatted for human consumption 1408 (text/html). 1410 In the following example THUMP session, each line has been annotated 1411 to include a line number and whether it was the client or server that 1412 sent it. Without going into much depth, the session has four pieces 1413 separated from each other by blank lines: the client's piece (lines 1414 1-3), the server's HTTP/THUMP response headers (4-7), and the body of 1415 the server's response (8-17). The first and last lines (1 and 17) 1416 correspond to the client's steps to start the TCP session and the 1417 server's steps to end it, respectively. 1419 1 C: [opens session] 1420 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 1421 C: 1422 S: HTTP/1.1 200 OK 1423 5 S: Content-Type: text/plain 1424 S: THUMP-Status: 0.1 200 OK 1425 S: 1426 S: |set: NLM | 12025/psbbantu? | 20030731 1427 S: | http://ark.nlm.nih.gov/ark:/12025/psbbantu? 1428 10 S: here: 1 | 1 | 1 1429 S: 1430 S: erc: 1431 S: who: Lederberg, Joshua 1432 S: what: Studies of Human Families for Genetic Linkage 1433 15 S: when: 1974 1434 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1435 S: [closes session] 1437 The first two server response lines (4-5) above are typical of HTTP. 1438 The next line (6) is peculiar to THUMP, and indicates the THUMP 1439 version and a normal return status. The balance of the response 1440 consists of a record set header (lines 8-10) and a single metadata 1441 record (12-16) that comprises the ARK description service response. 1442 The record set header identifies (8-9) who created the set, what its 1443 title is, when it was created, and where an automated process can 1444 access the set; it ends in a line (10) whose respective sub-elements 1445 indicate that here in this communication the recipient can expect to 1446 find 1 record, starting at the record numbered 1, from a set 1447 consisting of a total of 1 record (i.e., here is the entire set, 1448 consisting of exactly one record). 1450 The returned record (12-16) is in the format of an Electronic 1451 Resource Citation [ERC], which is discussed in more detail in the 1452 next section. For now, note that it contains four elements that 1453 answer the top priority questions regarding an expression of the 1454 object: who played a major role in expressing it, what the 1455 expression was called, when is was created, and where the expression 1456 may be found. This quartet of elements comes up again and again in 1457 ERCs. 1459 The third degenerate special case of an ARK request (and no other 1460 cases will be described in this document) is the string "??", 1461 corresponding to a minimal permanence policy request. It can be seen 1462 in use appended to an ARK (on line 2) in the example session that 1463 follows. 1465 1 C: [opens session] 1466 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1 1467 C: 1468 S: HTTP/1.1 200 OK 1469 5 S: Content-Type: text/plain 1470 S: THUMP-Status: 0.1 200 OK 1471 S: 1472 S: |set: NLM | 12025/psbbantu?? | 20030731 1473 S: | http://ark.nlm.nih.gov/ark:/12025/psbbantu?? 1474 10 S: here: 1 | 1 | 1 1475 S: 1476 S: erc: 1477 S: who: Lederberg, Joshua 1478 S: what: Studies of Human Families for Genetic Linkage 1479 15 S: when: 1974 1480 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1481 S: erc-support: 1482 S: who: USNLM 1483 S: what: Permanent, Unchanging Content 1484 20 S: when: 20010421 1485 S: where: http://ark.nlm.nih.gov/yy22948 1486 S: [closes session] 1488 Again, a single metadata record (lines 12-21) is returned, but it 1489 consists of two segments. The first segment (12-16) gives the same 1490 basic citation information as in the previous example. It is 1491 returned in order to establish context for the persistence 1492 declaration in the second segment (17-21). 1494 Each segment in an ERC tells a different story relating to the 1495 object, so although the same four questions (elements) appear in 1496 each, the answers depend on the segment's story type. While the 1497 first segment tells the story of an expression of the object, the 1498 second segment tells the story of the support commitment made to it: 1499 who made the commitment, what the nature of the commitment was, when 1500 it was made, and where a fuller explanation of the commitment may be 1501 found. 1503 7. Overview of Electronic Resource Citations (ERCs) 1505 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1506 kind of object description that uses Dublin Core Kernel [Kernel] 1507 metadata elements. The ERC with Kernel elements provides a simple, 1508 compact, and printable record for holding data associated with an 1509 information resource. By design, Kernel metadata balances the needs 1510 for expressive power, very simple machine processing, and direct 1511 human manipulation. 1513 A founding principle of Kernel metadata is that direct human contact 1514 with metadata will be a necessary and sufficient condition for the 1515 near term rapid development of metadata standards, systems, and 1516 services. Thus the machine-processable Kernel elements must only 1517 minimally strain people's ability to read, understand, change, and 1518 transmit ERCs without their relying on intermediation with 1519 specialized software tools. The basic ERC needs to be succinct, 1520 transparent, and trivially parseable by software. 1522 In the current Internet, it is natural seriously to consider using 1523 XML as an exchange format because of predictions that it will obviate 1524 many ad hoc formats and programs, and unify much of the world's 1525 information under one reliable data structuring discipline that is 1526 easy to generate, verify, parse, and render. It appears, however, 1527 that XML is still only catching on after years of standards work and 1528 implementation experience. The reasons for it are unclear, but for 1529 now very simple XML interpretation is still out of reach. Another 1530 important caution is that XML structures are hard on the eyeballs, 1531 taking up an amount of display (and page) space that significantly 1532 exceeds that of traditional formats. Until these conflicts with ERC 1533 principle are resolved, XML is not a first choice for representing 1534 ERCs. Borrowing instead from the data structuring format that 1535 underlies the successful spread of email and web services, the first 1536 ERC format uses [ANVL], which is based on email and HTTP headers 1537 [RFC822]. There is a naturalness to ANVL's label-colon-value format 1538 (seen in the previous section) that barely needs explanation to a 1539 person beginning to enter ERC metadata. 1541 Besides simplicity of ERC system implementation and data entry 1542 mechanics, ERC semantics (what the record and its constituent parts 1543 mean) must also be easy to explain. ERC semantics are based on a 1544 reformulation and extension of the Dublin Core [DCORE] hypothesis, 1545 which suggests that the fifteen Dublin Core metadata elements have a 1546 key role to play in cross-domain resource description. The ERC 1547 design recognizes that the Dublin Core's primary contribution is the 1548 international, interdisciplinary consensus that identified fifteen 1549 semantic buckets (element categories), regardless of how they are 1550 labeled. The ERC then adds a definition for a record and some 1551 minimal compliance rules. In pursuing the limits of simplicity, the 1552 ERC design combines and relabels some Dublin Core buckets to isolate 1553 a tiny kernel (subset) of four elements for basic cross-domain 1554 resource description. 1556 For the cross-domain kernel, the ERC uses the four basic elements - 1557 who, what, when, and where - to pretend that every object in the 1558 universe can have a uniform minimal description. Each has a name or 1559 other identifier, a location, some responsible person or party, and a 1560 date. It doesn't matter what type of object it is, or whether one 1561 plans to read it, interact with it, smoke it, wear it, or navigate 1562 it. Of course, this approach is flawed because uniformity of 1563 description for some object types requires more semantic contortion 1564 and sacrifice than for others. That is why at the beginning of this 1565 document, the ARK was said to be suited to objects that accommodate 1566 reasonably regular electronic description. 1568 While insisting on uniformity at the most basic level provides 1569 powerful cross-domain leverage, the semantic sacrifice is great for 1570 many applications. So the ERC also permits a semantically rich and 1571 nuanced description to co-exist in a record along with a basic 1572 description. In that way both sophisticated and naive recipients of 1573 the record can extract the level of meaning from it that best suits 1574 their needs and abilities. Key to unlocking the richer description 1575 is a controlled vocabulary of ERC record types (not explained in this 1576 document) that permit knowledgeable recipients to apply defined sets 1577 of additional assumptions to the record. 1579 7.1. ERC Syntax 1581 An ERC record is a sequence of metadata elements ending in a blank 1582 line. An element consists of a label, a colon, and an optional 1583 value. Here is an example of a record with five elements. 1585 erc: 1586 who: Gibbon, Edward 1587 what: The Decline and Fall of the Roman Empire 1588 when: 1781 1589 where: http://www.ccel.org/g/gibbon/decline/ 1591 A long value may be folded (continued) onto the next line by 1592 inserting a newline and indenting the next line. A value can be thus 1593 folded across multiple lines. Here are two example elements, each 1594 folded across four lines. 1596 who/created: University of California, San Francisco, AIDS 1597 Program at San Francisco General Hospital | University 1598 of California, San Francisco, Center for AIDS Prevention 1599 Studies 1600 what/Topic: 1601 Heart Attack | Heart Failure 1602 | Heart 1603 Diseases 1605 An element value folded across several lines is treated as if the 1606 lines were joined together on one long line. For example, the second 1607 element from the previous example is considered equivalent to 1609 what/Topic: Heart Attack | Heart Failure | Heart Diseases 1611 An element value may contain multiple values, each one separated from 1612 the next by a `|' (pipe) character. The element from the previous 1613 example contains three values. 1615 For annotation purposes, any line beginning with a `#' (hash) 1616 character is treated as if it were not present; this is a "comment" 1617 line (a feature not available in email or HTTP headers). For 1618 example, the following element is spread across four lines and 1619 contains two values: 1621 what/Topic: 1622 Heart Attack 1623 # | Heart Failure -- hold off until next review cycle 1624 | Heart Diseases 1626 7.2. ERC Stories 1628 An ERC record is organized into one or more distinct segments, where 1629 where each segment tells a story about a different aspect of the 1630 information resource. A segment boundary occurs whenever a segment 1631 label (an element beginning with "erc") is encountered. The basic 1632 label "erc:" introduces the story of an object's expression (e.g., 1633 its publication, installation, or performance). The label "erc- 1634 about:" introduces the story of an object's content (what it is 1635 about) and "erc-support:" introduces the story of a support 1636 commitment made to it. A story segment that concerns the ERC itself 1637 is introduced by the label "erc-from:". It is an important segment 1638 that tells the story of the ERC's provenance. Elements beginning 1639 with "erc" are reserved for segment labels and their associated story 1640 types. From an earlier example, here is an ERC with two segments. 1642 erc: 1643 who: Lederberg, Joshua 1644 what: Studies of Human Families for Genetic Linkage 1645 when: 1974 1646 where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1647 erc-support: 1648 who: NIH/NLM/LHNCBC 1649 what: Permanent, Unchanging Content 1650 # Note to ops staff: date needs verification. 1651 when: 2001 04 21 1652 where: http://ark.nlm.nih.gov/yy22948 1654 Segment stories are told according to journalistic tradition. While 1655 any number of pertinent elements may appear in a segment, priority is 1656 placed on answering the questions who, what, when, and where at the 1657 beginning of each segment so that readers can make the most important 1658 selection or rejection decisions as soon as possible. To make things 1659 simple, the listed ordering of the questions is maintained in each 1660 segment (as it happens most people who have been exposed to this 1661 story telling technique are already familiar with the above 1662 ordering). 1664 The four questions are answered by using corresponding element 1665 labels. The four element labels can be re-used in each story 1666 segment, but their meaning changes depending on the segment (the 1667 story type) in which they appear. In the example above, "who" is 1668 first used to name a document's author and subsequently used to name 1669 the permanence guarantor (provider). Similarly, "when" first lists 1670 the date of object creation and in the next segment lists the date of 1671 a commitment decision. Four labels appearing across three segments 1672 effectively map to twelve semantically distinct elements. Distinct 1673 element meanings are mapped to Dublin Core elements in a later 1674 section. 1676 7.3. The ERC Anchoring Story 1678 Each ERC contains an anchoring story. It is usually the first 1679 segment labeled "erc:" and it concerns an "anchoring" expression of 1680 the object. An "anchoring" expression is the one that a provider 1681 deemed the most suitable basic referent given the audience and 1682 application for which it produced the ERC. If it sounds like the 1683 provider has great latitude in choosing its anchoring expression, it 1684 is because it does. A typical anchoring story in an ERC for a born- 1685 digital document would be the story of the document's release on a 1686 web site; such a document would then be the anchoring expression. 1688 An anchoring story need not be the central descriptive goal of an ERC 1689 record. For example, a museum provider may create an ERC for a 1690 digitized photograph of a painting but choose to anchor it in the 1691 story of the original painting instead of the story of the electronic 1692 likeness; although the ERC may through other segments prove to be 1693 centrally concerned with describing the electronic likeness, the 1694 provider may have chosen this particular anchoring story in order to 1695 make the ERC visible in a way that is most natural to patrons (who 1696 would find the Mona Lisa under da Vinci sooner than they would find 1697 it under the name of the person who snapped the photograph or scanned 1698 the image). In another example, a provider that creates an ERC for a 1699 dramatic play as an abstract work has the task of describing a piece 1700 of intangible intellectual property. To anchor this abstract object 1701 in the concrete world, if only through a derivative expression, it 1702 makes sense for the provider to choose a suitable printed edition of 1703 the play as the anchoring object expression (to describe in the 1704 anchoring story) of the ERC. 1706 The anchoring story has special rules designed to keep ERC processing 1707 simple and predictable. Each of the four basic elements (who, what, 1708 when, and where) must be present, unless a best effort to supply it 1709 fails. In the event of failure, the element still appears but a 1710 special value (described later) is used to explain the missing value. 1711 While the requirement that each of the four elements be present only 1712 applies to the anchoring story segment, as usual these elements 1713 appear at the beginning of the segment and may only be used in the 1714 prescribed order. A minimal ERC would normally consist of just an 1715 anchoring story and the element quartet, as illustrated in the next 1716 example. 1718 erc: 1719 who: National Research Council 1720 what: The Digital Dilemma 1721 when: 2000 1722 where: http://books.nap.edu/html/digital%5Fdilemma 1724 A minimal ERC can be abbreviated so that it resembles a traditional 1725 compact bibliographic citation that is nonetheless completely machine 1726 processable. The required elements and ordering makes it possible to 1727 eliminate the element labels, as shown here. 1729 erc: National Research Council | The Digital Dilemma | 2000 1730 | http://books.nap.edu/html/digital%5Fdilemma 1732 7.4. ERC Elements 1734 As mentioned, the four basic ERC elements (who, what, when, and 1735 where) take on different specific meanings depending on the story 1736 segment in which they are used. By appearing in each segment, albeit 1737 in different guises, the four elements serve as a valuable mnemonic 1738 device - a kind of checklist - for constructing minimal story 1739 segments from scratch. Again, it is only in the anchoring segment 1740 that all four elements are mandatory. 1742 Here are some mappings between ERC elements and Dublin Core [DCORE] 1743 elements. 1745 Segment ERC Element Equivalent Dublin Core Element 1746 --------- ----------- ------------------------------ 1747 erc who Creator/Contributor/Publisher 1748 erc what Title 1749 erc when Date 1750 erc where Identifier 1751 erc-about who 1752 erc-about what Subject 1753 erc-about when Coverage (temporal) 1754 erc-about where Coverage (spatial) 1756 The basic element labels may also be qualified to add nuances to the 1757 semantic categories that they identify. Elements are qualified by 1758 appending a `/' (slash) and a qualifier term. Often qualifier terms 1759 appear as the past tense form of a verb because it makes re-using 1760 qualifiers among elements easier. 1762 who/published: ... 1763 when/published: ... 1764 where/published: ... 1766 Using past tense verbs for qualifiers also reminds providers and 1767 recipients that element values contain transient assertions that may 1768 have been true once, but that tend to become less true over time. 1769 Recipients that don't understand the meaning of a qualifier can fall 1770 back onto the semantic category (bucket) designated by the 1771 unqualified element label. Inevitably recipients (people and 1772 software) will have diverse abilities in understanding elements and 1773 qualifiers. 1775 Any number of other elements and qualifiers may be used in 1776 conjunction with the quartet of basic segment questions. The only 1777 semantic requirement is that they pertain to the segment's story. 1778 Also, it is only the four basic elements that change meaning 1779 depending on their segment context. All other elements have meaning 1780 independent of the segment in which they appear. If an element label 1781 stripped of its qualifier is still not recognized by the recipient, a 1782 second fall back position is to ignore it and rely on the four basic 1783 elements. 1785 Elements may be either Canonical, Provisional, or Local. Canonical 1786 elements are officially recognized via a registry as part of the 1787 metadata vernacular. All elements, qualifiers, and segment labels 1788 used in this document up until now belong to that vernacular. 1789 Provisional elements are also officially recognized via the registry, 1790 but have only been proposed for inclusion in the vernacular. To be 1791 promoted to the vernacular, a provisional element passes through a 1792 vetting process during which its documentation must be in order and 1793 its community acceptance demonstrated. Local elements are any 1794 elements not officially recognized in the registry. The registry 1795 [Kernel] is a work in progress. 1797 Local elements can be immediately distinguishable from Canonical or 1798 Provisional elements because all terms that begin with an upper case 1799 letter are reserved for spontaneous local use. No term beginning 1800 with an upper case letter will ever be assigned Canonical or 1801 Provisional status, so it should be safe to use such terms for local 1802 purposes. Any recipient of external ERCs containing such terms will 1803 understand them to be part of the originating provider's local 1804 metadata dialect. Here's an example ERC with three segments, one 1805 local element, and two local qualifiers. The segment boundaries have 1806 been emphasized by comment lines (which, as before, are ignored by 1807 processors). 1809 erc: 1810 who: Bullock, TH | Achimowicz, JZ | Duckrow, RB 1811 | Spencer, SS | Iragui-Madoz, VJ 1812 what: Bicoherence of intracranial EEG in sleep, 1813 wakefulness and seizures 1814 when: 1997 12 00 1815 where: http://cogprints.soton.ac.uk/%{ 1816 documents/disk0/00/00/01/22/index.html %} 1817 in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678 1818 IDcode: cog00000122 1819 # ---- new segment ---- 1820 erc-about: 1821 what/Subcategory: Bispectrum | Nonlinearity | Epilepsy 1822 | Cooperativity | Subdural | Hippocampus | Higher moment 1823 # ---- new segment ---- 1824 erc-from: 1825 who: NIH/NLM/NCBI 1826 what: pm9546494 1827 when/Reviewed: 1998 04 18 021600 1828 where: http://ark.nlm.nih.gov/12025/pm9546494? 1830 The local element "IDcode" immediately precedes the "erc-about" 1831 segment, which itself contains an element with the local qualifier 1832 "Subcategory". The second to last element also carries the local 1833 qualifier "Reviewed". Finally, what might be a provisional element 1834 "in" appears near the end of the first segment. It might have been 1835 proposed as a way to complete a citation for an object originally 1836 appearing inside another object (such as an article appearing in a 1837 journal or an encyclopedia). 1839 7.5. ERC Element Values 1841 ERC element values tend to be straightforward strings. If the 1842 provider intends something special for an element, it will so 1843 indicate with markers at the beginning of its value string. The 1844 markers are designed to be uncommon enough that they would not likely 1845 occur in normal data except by deliberate intent. Markers can only 1846 occur near the beginning of a string, and once any octet of non- 1847 marker data has been encountered, no further marker processing is 1848 done for the element value. In the absence of markers the string is 1849 considered pure data; this has been the case with all the examples 1850 seen thus far. The fullest form of an element value with all three 1851 optional markers in place looks like this. 1853 VALUE = [markup_flags] (:ccode) , DATA 1855 In processing, the first non-whitespace character of an ERC element 1856 value is examined. An initial `[' is reserved to introduce a 1857 bracketed set of markup flags (not described in this document) that 1858 ends with `]'. If ERC data is machine-generated, each value string 1859 may be preceded by "[]" to prevent any of its data from being 1860 mistaken for markup flags. Once past the optional markup, the 1861 remaining value may optionally begin with a controlled code. A 1862 controlled code always has the form "(:ccode)", for example, 1864 who: (:unkn) Anonymous 1865 what: (:791) Bee Stings 1867 Any string after such a code is taken to be an uncontrolled (e.g., 1868 natural language) equivalent. The code "unkn" indicates a 1869 conventional explanation for a missing value (stating that the value 1870 is unknown). The remainder of the string makes an equivalent 1871 statement in a form that the provider deemed most suitable to its 1872 (probably human) audience. The code "791" could be a fixed numeric 1873 topic identifier within an unspecified topic vocabulary. Any code 1874 may be ignored by those that do not understand it. 1876 There are several codes to explain different ways in which a required 1877 element's value may go missing. 1879 (:unac) temporarily inaccessible 1880 (:unal) unallowed, suppressed intentionally 1881 (:unap) not applicable, makes no sense 1882 (:unas) value unassigned (e.g., Untitled) 1883 (:unav) value unavailable indefinitely 1884 (:unkn) unknown (e.g., Anonymous, Inconnue) 1885 (:etal) too numerous to list (I). 1886 (:none) never had a value, never will 1887 (:null) explicitly empty 1888 (:tba) to be assigned or announced later 1890 Once past an optional controlled code, the remaining string value is 1891 subjected to one final test. If the first next non-whitespace 1892 character is a `,' (comma), it indicates that the string value is 1893 "sort-friendly". This means that the value is (a) laid out with an 1894 inverted word order useful for sorting items having comparably laid 1895 out element values (items might be the containing ERC records) and 1896 (b) that the value may contain other commas that indicate inversion 1897 points should it become necessary to recover the value in natural 1898 word order. Typically, this feature is used to express Western-style 1899 personal names in family-name-given-name order. It can also be used 1900 wherever natural word order might make sorting tricky, such as when 1901 data contains titles or corporate names. Here are some example 1902 elements. 1904 who: , van Gogh, Vincent 1905 who:,Howell, III, PhD, 1922-1987, Thurston 1906 who:, Acme Rocket Factory, Inc., The 1907 who:, Mao Tse Tung 1908 who:, McCartney, Paul, Sir, 1909 what:, Health and Human Services, United States Government 1910 Department of, The, 1912 There are rules to use in recovering a copy of the value in natural 1913 word order, if desired. The above example strings have the following 1914 natural word order values, respectively. 1916 Vincent van Gogh 1917 Thurston Howell, III, PhD, 1922-1987 1918 The Acme Rocket Factory, Inc. 1919 Mao Tse Tung 1920 Sir Paul McCartney 1921 The United States Government Department of Health and Human Services 1923 7.6. ERC Element Encoding and Dates 1925 Some characters that need to appear in ERC element values might 1926 conflict with special characters used for structuring ERCs, so there 1927 needs to be a way to include them as literal characters that are 1928 protected from special interpretation. This is accomplished through 1929 an encoding mechanism that resembles the %-encoding familiar to [URI] 1930 handlers. 1932 The ERC encoding mechanism also uses `%', but instead of taking two 1933 following hexadecimal digits, it takes one non-alphanumeric character 1934 or two alphabetic characters that cannot be mistaken for hex digits. 1935 It is designed not to be confused with normal web-style %-encoding. 1936 In particular it can be decoded without risking unintended decoding 1937 of normal %-encoded data (which would introduce errors). Here are 1938 the one-character (non-alphanumeric) ERC encoding extensions. 1940 ERC Purpose 1941 --- ------------------------------------------------ 1942 %! decodes to the element separator `|' 1943 %% decodes to a percent sign `%' 1944 %. decodes to a comma `,' 1945 %_ a non-character used as syntax shim 1946 %{ a non-character that begins an expansion block 1947 %} a non-character that ends an expansion block 1949 One particularly useful construct in ERC element values is the pair 1950 of special encoding markers ("%{" and "%}") that indicates a 1951 "expansion" block. Whatever string of characters they enclose will 1952 be treated as if none of the contained whitespace (SPACEs, TABs, 1953 Newlines) were present. This comes in handy for writing long, multi- 1954 part URLs in a readable way. For example, the value in 1956 where: http://foo.bar.org/node%{ 1957 ? db = foo 1958 & start = 1 1959 & end = 5 1960 & buf = 2 1961 & query = foo + bar + zaf 1962 %} 1964 is decoded into an equivalent element, but with a correct and intact 1965 URL: 1967 where: 1968 http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf 1970 In a parting word about ERC element values, a commonly recurring 1971 value type is a date, possibly followed by a time. ERC dates use the 1972 [TEMPER] format, taking on one of the following forms: 1974 1999 (four digit year) 1975 2000 12 29 (year, month, day) 1976 2000 12 29 235955 (year, month, day, hour, minute, second) 1978 In dates, all internal whitespace is squeezed out to achieve a 1979 normalized form suitable for lexical comparison and sorting. This 1980 means that the following dates 1982 2000 12 29 235955 (recommended for readability) 1983 2000 12 29 23 59 55 1984 20001229 23 59 55 1985 20001229235955 (normalized date and time) 1987 are all equivalent. The first form is recommended for readability. 1988 The last form (shortest and easiest to compute with) is the 1989 normalized form. Hyphens and commas are reserved to create date 1990 ranges and lists, for example, 1992 1996-2000 (a range of four years) 1993 1952, 1957, 1969 (a list of three years) 1994 1952, 1958-1967, 1985 (a mixed list of dates and ranges) 1995 20001229-20001231 (a range of three days) 1997 7.7. ERC Stub Records and Internal Support 1999 The ERC design introduces the concept of a "stub" record, which is an 2000 incomplete ERC record intended to be supplemented with additional 2001 elements before being released as a standalone ERC record. A stub 2002 ERC record has no minimum required elements. It is just a group of 2003 elements that does not begin with "erc:" but otherwise conforms to 2004 the ERC record syntax. 2006 ERC stubs may be useful in supporting internal procedures using the 2007 ERC syntax. Often they rely on the convenience and accuracy of 2008 automatically supplied elements, even the basic ones. To be ready 2009 for external use, however, an ERC stub must be transformed into a 2010 complete ERC record having the usual required elements. An ERC stub 2011 record can be convenient for metadata embedded in a document, where 2012 elements such as location, modification date, and size - which one 2013 would not omit from an externalized record - are omitted simply 2014 because they are much better supplied by a computation. A separate 2015 local administrative procedure, not defined for ERC's in general, 2016 would effect the promotion of stubs into complete records. 2018 While the ERC is a general-purpose container for exchange of resource 2019 descriptions, it does not dictate how records must be internally 2020 stored, laid out, or assembled by data providers or recipients. 2021 Arbitrary internal descriptive frameworks can support ERCs simply by 2022 mapping (e.g., on demand) local records to the ERC container format 2023 and making them available for export. Therefore, to support ERCs 2024 there is no need for a data provider to convert internal data to be 2025 stored in an ERC format. On the other hand, any provider (such as 2026 one just getting started in the business of resource description) may 2027 choose to store and manipulate local data natively in the ERC format. 2029 8. Advice to Web Clients 2031 This section offers some advice to web client software developers. 2032 It is hard to write about because it tries to anticipate a series of 2033 events that might lead to native web browser support for ARKs. 2035 ARKs are envisaged to appear wherever durable object references are 2036 planned. Library cataloging records, literature citations, and 2037 bibliographies are important examples. In many of these places URLs 2038 (Uniform Resource Locators) currently stand in, and URNs, DOIs, and 2039 PURLs have been proposed as alternatives. 2041 The strings representing ARKs are also envisaged to appear in some of 2042 the places where URLs currently appear: in hypertext links (where 2043 they are not normally shown to users) and in rendered text (displayed 2044 or printed). Internet search engines, for example, tend to include 2045 both actionable and manifest links when listing each item found. A 2046 normal HTML link for which the URL is not displayed looks like this. 2048 Click Here 2050 The same link with an ARK instead of a URL: 2052 Click Here 2054 Web browsers would in general require a small modification to 2055 recognize and convert this ARK, via mapping authority discovery, to 2056 the URL form. 2058 Click Here 2060 A browser that knows how to make that conversion could also 2061 automatically detect and replace a non-working NMAH. 2063 An NAA will typically make known the associations it creates by 2064 publishing them in catalogs, actively advertizing them, or simply 2065 leaving them on web sites for visitors (e.g., users, indexing 2066 spiders) to stumble across in browsing. 2068 9. Security Considerations 2070 The ARK naming scheme poses no direct risk to computers and networks. 2071 Implementors of ARK services need to be aware of security issues when 2072 querying networks and filesystems for Name Mapping Authority 2073 services, and the concomitant risks from spoofing and obtaining 2074 incorrect information. These risks are no greater for ARK mapping 2075 authority discovery than for other kinds of service discovery. For 2076 example, recipients of ARKs with a specified hostport (NMAH) should 2077 treat it like a URL and be aware that the identified ARK service may 2078 no longer be operational. 2080 Apart from mapping authority discovery, ARK clients and servers 2081 subject themselves to all the risks that accompany normal operation 2082 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 2083 As specializations of such protocols, an ARK service may limit 2084 exposure to the usual risks. Indeed, ARK services may enhance a kind 2085 of security by helping users identify long-term reliable references 2086 to information objects. 2088 10. Authors' Addresses 2090 John A. Kunze 2091 California Digital Library 2092 University of California, Office of the President 2093 415 20th St, 4th Floor 2094 Oakland, CA 94612-3550, USA 2096 Fax: +1 510-893-5212 2097 EMail: jak@ucop.edu 2099 R. P. C. Rodgers 2100 US National Library of Medicine 2101 8600 Rockville Pike, Bldg. 38A 2102 Bethesda, MD 20894, USA 2104 Fax: +1 301-496-0673 2105 EMail: rodgers@nlm.nih.gov 2107 11. References 2109 [ANVL] J. Kunze, B. Kahle, et al, "A Name-Value Language", work 2110 in progress, 2111 http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf 2113 [ARK] J. Kunze, "Towards Electronic Persistence Using ARK 2114 Identifiers", Proceedings of the 3rd ECDL Workshop on Web 2115 Archives, August 2003, (PDF) 2116 http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze 2118 [DCORE] Dublin Core Metadata Initiative, "Dublin Core Metadata 2119 Element Set, Version 1.1: Reference Description", July 2120 1999, http://dublincore.org/documents/dces/. 2122 [DNS] P.V. Mockapetris, "Domain Names - Concepts and 2123 Facilities", RFC 1034, November 1987. 2125 [DOI] International DOI Foundation, "The Digital Object 2126 Identifier (DOI) System", February 2001, 2127 http://dx.doi.org/10.1000/203. 2129 [ERC] J. Kunze, "A Metadata Kernel for Electronic Permanence", 2130 Journal of Digital Information, Vol 2, Issue 2, January 2131 2002, ISSN 1368-7506, (PDF) 2132 http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/ 2134 [Handle] L. Lannom, "Handle System Overview", ICSTI Forum, No. 30, 2135 April 1999, http://www.icsti.org/forum/30/#lannom 2137 [HTTP] R. Fielding, et al, "Hypertext Transfer Protocol -- 2138 HTTP/1.1", RFC 2616, June 1999. 2140 [Kernel] Dublin Core Metadata Initiative, "Kernel Metadata Working 2141 Group", http://dublincore.org/groups/kernel/ 2143 [MD5] R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321, 2144 April 1992. 2146 [N2T] CDL, "Name-to-Thing Resolover", work in progress, August 2147 2006, http://n2t.info 2149 [NAPTR] M. Mealling, Daniel, R., "The Naming Authority Pointer 2150 (NAPTR) DNS Resource Record", RFC 2915, September 2000. 2152 [NLMPerm] M. Byrnes, "Defining NLM's Commitment to the Permanence of 2153 Electronic Information", ARL 212:8-9, October 2000, 2154 http://www.arl.org/newsltr/212/nlm.html 2156 [NOID] J. Kunze, "Nice Opaque Identifiers", February 2005, 2157 http://www.cdlib.org/inside/diglib/ark/noid.pdf 2159 [PURL] K. Shafer, et al, "Introduction to Persistent Uniform 2160 Resource Locators", 1996, 2161 http://purl.oclc.org/OCLC/PURL/INET96 2163 [RFC822] D. Crocker, "Standard for the format of ARPA Internet text 2164 messages", RFC 822, August 1982. 2166 [TELNET] J. Postel, J.K. Reynolds, "Telnet Protocol Specification", 2167 RFC 854, May 1983. 2169 [TEMPER] J. Kunze, "Temporal Enumerated Ranges", work in progress, 2170 http://www.cdlib.org/inside/diglib/ark/temperspec.pdf 2172 [THUMP] K. Gamiel, J. Kunze, "The HTTP URL Mapping Protocol", work 2173 in progress, http://www.ietf.org/internet-drafts/draft- 2174 kunze-thump-00.txt 2176 [URI] T. Berners-Lee, et al, "Uniform Resource Identifiers 2177 (URI): Generic Syntax", RFC 2396, August 1998. 2179 [URNBIB] C. Lynch, et al, "Using Existing Bibliographic Identifiers 2180 as Uniform Resource Names", RFC 2288, February 1998. 2182 [URNSYN] R. Moats, "URN Syntax", RFC 2141, May 1997. 2184 [URNNID] L. Daigle, et al, "URN Namespace Definition Mechanisms", 2185 RFC 2611, June 1999. 2187 12. Appendix: ARK Implementations 2189 Currently, the primary implementation activity is at the California 2190 Digital Library (CDL), 2192 http://ark.cdlib.org/ 2194 housed at the University of California Office of the President, where 2195 over 200,000 ARKs have been assigned to objects that the CDL owns or 2196 controls. Some experimentation in ARKs is taking place at JSTOR, the 2197 Digital Curation Centre, WIPO and at the University of California's 2198 San Diego, San Francisco, and Berkeley campuses. 2200 The US National Library of Medicine (NLM) also has an experimental, 2201 prototype ARK service under development. It is being made available 2202 for purposes of demonstrating various aspects of the ARK system, but 2203 is subject to temporary or permanent withdrawal (without notice) 2204 depending upon the circumstances of the small research group 2205 responsible for making it available. It is described at: 2207 http://ark.nlm.nih.gov/ 2209 Comments and feedback may be addressed to rodgers@nlm.nih.gov. 2211 13. Appendix: Current ARK Name Authority Table 2213 This appendix contains a copy of the Name Authority Table (a file) at 2214 the time of writing. It may be loaded into a local filesystem (e.g., 2215 /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to 2216 NMAHs (Name Mapping Authority Hostports). It contains Perl code that 2217 can be copied into a standalone script that processes the table (as a 2218 file). Because this is still a proposed file, none of the values in 2219 it are real. 2221 # 2222 # Name Assigning Authority / Name Mapping Authority Lookup Table 2223 # Last change: 2007.06.05 2224 # Reload from: http://ark.nlm.nih.gov/etc/natab 2225 # Mirrored at: http://www.cdlib.org/inside/diglib/ark/natab 2226 # To register: mailto:ark@cdlib.org?Subject=naareg 2227 # Process with: Perl script at end of this file (optional) 2228 # 2229 # Each NAA appears at the beginning of a line with the NAA Number 2230 # first, a colon, and an ARK or URL to a statement of naming policy 2231 # (see http://ark.cdlib.org for an example). 2232 # All the NMA hostports that service an NAA are listed, one per 2233 # line, indented, after the corresponding NAA line. 2234 # 2235 # National Library of Medicine 2236 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 2237 ark.nlm.nih.gov USNLM 2238 foobar.zaf.org UCSF 2239 # 2240 # Library of Congress 2241 12026: http://www.loc.gov/xxx/naapolicy.html 2242 foobar.zaf.org USLC 2243 # 2244 # National Agriculture Library 2245 12027: http://www.nal.gov/xxx/naapolicy.html 2246 foobar.zaf.gov:80 USNAL 2247 # 2248 # California Digital Library 2249 13030: http://www.cdlib.org/inside/diglib/ark/ 2250 ark.cdlib.org CDL 2251 # 2252 # World Intellectual Property Organization 2253 13038: http://www.wipo.int/xxx/naapolicy.html 2254 www.wipo.int WIPO 2255 # 2256 # University of California San Diego 2257 20775: http://library.ucsd.edu/xxx/naapolicy.html 2258 ucsd.edu UCSD 2259 # 2260 # University of California San Francisco 2261 29114: http://library.ucsf.edu/xxx/naapolicy.html 2262 ucsf.edu UCSF 2263 # 2264 # University of California Berkeley 2265 28722: http://library.berkeley.edu/xxx/naapolicy.html 2266 berkeley.edu UCB 2267 # 2268 # University of California Los Angeles 2269 21198: http://library.ucla.edu/xxx/naapolicy.html 2270 ucla.edu UCLA 2271 # 2272 # Rutgers University 2273 15230: http://rci.rutgers.edu/xxx/naapolicy.html 2274 rutgers.edu RU 2275 # 2276 # Internet Archive 2277 13960: http://www.archive.org/xxx/naapolicy.html 2278 archive.org IA 2279 # 2280 # Digital Curation Centre 2281 64269: http://www.dcc.ac.uk/xxx/naapolicy.html 2282 dcc.ac.uk DCC 2283 # 2284 # New York University 2285 62624: http://library.nyu.edu/xxx/naapolicy.html 2286 nyu.edu NYU 2287 # 2288 # University of North Texas 2289 67531: http://www.library.unt.edu/xxx/naapolicy.html 2290 unt.edu UNT 2291 # 2292 # Ithaka Electronic-Archiving Initiative 2293 27927: http://www.ithaka.org/xxx/naapolicy.html 2294 ithaka.org ITHAKA 2295 # 2296 # Bibliothque nationale de France / National Library of France 2297 12148: http://www.bnf.fr/xxx/naapolicy.html 2298 bnf.fr BNF 2299 # 2300 # Princeton University 2301 88435: http://diglib.princeton.edu/xxx/naapolicy.html 2302 princeton.edu PU 2303 # 2304 # University of Washington 2305 78428: http://u.washington.edu/xxx/naapolicy.html 2306 u.washington.edu UW 2307 # 2308 # Archives of Region of Vstra Gtaland and City of Gothenburg, Sweden 2309 89901: http://www.arkivnamnden.org/xxx/naapolicy.html 2310 arkivnamnden.org AVGG 2311 # 2312 # Northwest Digital Archives 2313 80444: http://nwda.wsulibs.wsu.edu/xxx/naapolicy.html 2314 nwda.wsulibs.wsu.edu NWDA 2315 # 2316 # Emory University 2317 25593: http://id.library.emory.edu/xxx/naapolicy.html 2318 id.library.emory.edu EMORY 2319 # 2320 # University of Kansas 2321 25031: http://www.lib.ku.edu/xxx/naapolicy.html 2322 www.lib.ku.edu UKANSAS 2324 # 2325 # Google 2326 78319: http://www.google.com/xxx/naapolicy.html 2327 www.google.com GOOGLE 2328 # 2329 # Centre for Ecology & Hydrology, UK 2330 17101: http://www.ceh.ac.uk/xxx/naapolicy.html 2331 www.ceh.ac.uk CEH 2332 # 2333 # University of Calgary 2334 65323: http://library.ucalgary.ca/xxx/naapolicy.html 2335 ucalgary.ca UCALGARY 2336 # 2337 #12345: reserved for examples 2338 # 2339 #--- end of data --- 2340 # The following Perl script takes an NAA as argument and outputs 2341 # the NMAs in this file listed under any matching NAA. 2342 # 2343 # my $naa = shift; 2344 # while (<>) { 2345 # next if (! /^$naa:/); 2346 # while (<>) { 2347 # last if (! /^[#\s]./); 2348 # print "$1\n" if (/^\s+(\S+)/); 2349 # } 2350 # } 2351 # 2352 # Create a g/t/nroff-safe version of this table with the UNIX command, 2353 # 2354 # expand natab | sed 's/\\/\\\e/g' > natab.roff 2355 # 2356 # end of file 2358 14. Copyright Notice 2360 Copyright (C) The IETF Trust (2007). This document is subject to the 2361 rights, licenses and restrictions contained in BCP 78, and except as 2362 set forth therein, the authors retain all their rights. 2364 This document and the information contained herein are provided on an 2365 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2366 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 2367 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 2368 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 2369 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2370 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 2372 Expires 24 January 2008 2373 Table of Contents 2375 Status of this Document . . . . . . . . . . . . . . . . . . . . . . 1 2376 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2377 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2378 1.1. Reasons to Use ARKs . . . . . . . . . . . . . . . . . . . . . 4 2379 1.2. Three Requirements of ARKs . . . . . . . . . . . . . . . . . . 4 2380 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff . . . 5 2381 1.4. Definition of Identifier . . . . . . . . . . . . . . . . . . . 7 2382 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2383 2.1. The Name Mapping Authority Hostport (NMAH) . . . . . . . . . . 8 2384 2.2. The ARK Label Part - ark: . . . . . . . . . . . . . . . . . . 9 2385 2.3. The Name Assigning Authority Number (NAAN) . . . . . . . . . . 10 2386 2.4. The Name Part . . . . . . . . . . . . . . . . . . . . . . . . 10 2387 2.5. The Qualifier Part . . . . . . . . . . . . . . . . . . . . . . 11 2388 2.5.1. ARKs that Reveal Object Hierarchy . . . . . . . . . . . . . 12 2389 2.5.2. ARKs that Reveal Object Variants . . . . . . . . . . . . . . 13 2390 2.6. Character Repertoires . . . . . . . . . . . . . . . . . . . . 14 2391 2.7. Normalization and Lexical Equivalence . . . . . . . . . . . . 15 2392 3. Naming Considerations . . . . . . . . . . . . . . . . . . . . . 16 2393 3.1. ARKS Embedded in Language . . . . . . . . . . . . . . . . . . 16 2394 3.2. Objects Should Wear Their Identifiers . . . . . . . . . . . . 17 2395 3.3. Names are Political, not Technological . . . . . . . . . . . . 17 2396 3.4. Choosing a Hostname or NMA . . . . . . . . . . . . . . . . . . 17 2397 3.5. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . . . 19 2398 3.6. NAAN Namespace Management . . . . . . . . . . . . . . . . . . 20 2399 3.7. Sub-Object Naming . . . . . . . . . . . . . . . . . . . . . . 21 2400 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . . . 21 2401 4.1. Looking Up NMAHs in a Globally Accessible File . . . . . . . . 22 2402 4.2. Looking up NMAHs Distributed via DNS . . . . . . . . . . . . . 23 2403 5. Generic ARK Service Definition . . . . . . . . . . . . . . . . . 26 2404 5.1. Generic ARK Access Service (access, location) . . . . . . . . 26 2405 5.2. Generic Policy Service (permanence, naming, etc.) . . . . . . 26 2406 5.3. Generic Description Service . . . . . . . . . . . . . . . . . 28 2407 6. Overview of The HTTP URL Mapping Protocol (THUMP) . . . . . . . 28 2408 7. Overview of Electronic Resource Citations (ERCs) . . . . . . . . 31 2409 7.1. ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2410 7.2. ERC Stories . . . . . . . . . . . . . . . . . . . . . . . . . 34 2411 7.3. The ERC Anchoring Story . . . . . . . . . . . . . . . . . . . 35 2412 7.4. ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . . 36 2413 7.5. ERC Element Values . . . . . . . . . . . . . . . . . . . . . . 38 2414 7.6. ERC Element Encoding and Dates . . . . . . . . . . . . . . . . 40 2415 7.7. ERC Stub Records and Internal Support . . . . . . . . . . . . 41 2416 8. Advice to Web Clients . . . . . . . . . . . . . . . . . . . . . 42 2417 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 43 2418 10. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 43 2419 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2420 12. Appendix: ARK Implementations . . . . . . . . . . . . . . . . 45 2421 13. Appendix: Current ARK Name Authority Table . . . . . . . . . . 46 2422 14. Copyright Notice . . . . . . . . . . . . . . . . . . . . . . . 49