idnits 2.17.1 draft-kunze-ark-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 16. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1805. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1816. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1823. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1829. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([Qualifier]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 2 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 1684 has weird spacing: '... regexp repla...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (May 22, 2008) is 5818 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Qualifier' is mentioned on line 483, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'ANVL' -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK' -- Possible downref: Non-RFC (?) normative reference: ref. 'DCKernel' -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI' -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC' -- Possible downref: Non-RFC (?) normative reference: ref. 'Handle' -- Possible downref: Non-RFC (?) normative reference: ref. 'Kernel' -- Possible downref: Non-RFC (?) normative reference: ref. 'N2T' -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm' -- Possible downref: Non-RFC (?) normative reference: ref. 'NOID' -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL' ** Obsolete normative reference: RFC 2141 (Obsoleted by RFC 8141) ** Downref: Normative reference to an Informational RFC: RFC 2288 ** Obsolete normative reference: RFC 2611 (Obsoleted by RFC 3406) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Obsolete normative reference: RFC 2822 (Obsoleted by RFC 5322) ** Obsolete normative reference: RFC 2915 (Obsoleted by RFC 3401, RFC 3402, RFC 3403, RFC 3404) ** Downref: Normative reference to an Informational RFC: RFC 5013 -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP' Summary: 11 errors (**), 0 flaws (~~), 5 warnings (==), 20 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Kunze 3 Internet-Draft California Digital Library 4 Expires: November 23, 2008 R. Rodgers 5 US National Library of Medicine 6 May 22, 2008 8 The ARK Identifier Scheme 9 http://www.ietf.org/internet-drafts/draft-kunze-ark-15.txt 11 Status of this Memo 13 By submitting this Internet-Draft, each author represents that any 14 applicable patent or other IPR claims of which he or she is aware 15 have been or will be disclosed, and any of which he or she becomes 16 aware will be disclosed, in accordance with Section 6 of BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt. 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 This Internet-Draft will expire on November 23, 2008. 36 Copyright Notice 38 Copyright (C) The IETF Trust (2008). 40 Abstract 42 The ARK (Archival Resource Key) naming scheme is designed to 43 facilitate the high-quality and persistent identification of 44 information objects. A founding principle of the ARK is that 45 persistence is purely a matter of service and is neither inherent in 46 an object nor conferred on it by a particular naming syntax. The 47 best that an identifier can do is to lead users to the services that 48 support robust reference. The term ARK itself refers both to the 49 scheme and to any single identifier that conforms to it. An ARK has 50 five components: 52 [http://NMAH/]ark:/NAAN/Name[Qualifier] 54 an optional and mutable Name Mapping Authority Hostport (usually a 55 hostname), the "ark:" label, the Name Assigning Authority Number 56 (NAAN), the assigned Name, and an optional and possibly mutable 57 Qualifier supported by the NMA. The NAAN and Name together form the 58 immutable persistent identifier for the object independent of the URL 59 hostname. An ARK is a special kind of URL that connects users to 60 three things: the named object, its metadata, and the provider's 61 promise about its persistence. When entered into the location field 62 of a Web browser, the ARK leads the user to the named object. That 63 same ARK, inflected by appending a single question mark (`?'), 64 returns a brief metadata record that is both human- and machine- 65 readable. When the ARK is inflected by appending dual question marks 66 (`??'), the returned metadata contains a commitment statement from 67 the current provider. Tools exist for minting, binding, and 68 resolving ARKs. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 73 1.1. Reasons to Use ARKs . . . . . . . . . . . . . . . . . . . 5 74 1.2. Three Requirements of ARKs . . . . . . . . . . . . . . . . 6 75 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff . 7 76 1.4. Definition of Identifier . . . . . . . . . . . . . . . . . 8 77 2. ARK Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . 10 78 2.1. The Name Mapping Authority Hostport (NMAH) . . . . . . . . 10 79 2.2. The ARK Label Part (ark:/) . . . . . . . . . . . . . . . . 11 80 2.3. The Name Assigning Authority Number (NAAN) . . . . . . . . 12 81 2.4. The Name Part . . . . . . . . . . . . . . . . . . . . . . 13 82 2.5. The Qualifier Part . . . . . . . . . . . . . . . . . . . . 13 83 2.5.1. ARKs that Reveal Object Hierarchy . . . . . . . . . . 15 84 2.5.2. ARKs that Reveal Object Variants . . . . . . . . . . . 16 85 2.6. Character Repertoires . . . . . . . . . . . . . . . . . . 17 86 2.7. Normalization and Lexical Equivalence . . . . . . . . . . 18 87 3. Naming Considerations . . . . . . . . . . . . . . . . . . . . 20 88 3.1. ARKS Embedded in Language . . . . . . . . . . . . . . . . 20 89 3.2. Objects Should Wear Their Identifiers . . . . . . . . . . 20 90 3.3. Names are Political, not Technological . . . . . . . . . . 21 91 3.4. Choosing a Hostname or NMA . . . . . . . . . . . . . . . . 21 92 3.5. Assigners of ARKs . . . . . . . . . . . . . . . . . . . . 23 93 3.6. NAAN Namespace Management . . . . . . . . . . . . . . . . 23 94 3.7. Sub-Object Naming . . . . . . . . . . . . . . . . . . . . 25 95 4. Finding a Name Mapping Authority . . . . . . . . . . . . . . . 26 96 4.1. Looking Up NMAHs in a Globally Accessible File . . . . . . 27 97 5. Generic ARK Service Definition . . . . . . . . . . . . . . . . 29 98 5.1. Generic ARK Access Service (access, location) . . . . . . 29 99 5.1.1. Generic Policy Service (permanence, naming, etc.) . . 29 100 5.1.2. Generic Description Service . . . . . . . . . . . . . 31 101 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) . . . . 31 102 5.3. The Electronic Resource Citation (ERC) . . . . . . . . . . 34 103 5.4. Advice to Web Clients . . . . . . . . . . . . . . . . . . 36 104 5.5. Security Considerations . . . . . . . . . . . . . . . . . 37 105 6. References . . . . . . . . . . . . . . . . . . . . . . . . . . 38 106 Appendix A. ARK Maintenance Agency . . . . . . . . . . . . . . . 40 107 Appendix B. Looking up NMAHs Distributed via DNS . . . . . . . . 41 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 44 109 Intellectual Property and Copyright Statements . . . . . . . . . . 45 111 1. Introduction 113 This document describes a scheme for the high-quality naming of 114 information resources. The scheme, called the Archival Resource Key 115 (ARK), is well suited to long-term access and identification of any 116 information resources that accommodate reasonably regular electronic 117 description. This includes digital documents, databases, software, 118 and websites, as well as physical objects (books, bones, statues, 119 etc.) and intangible objects (chemicals, diseases, vocabulary terms, 120 performances). Hereafter the term "object" refers to an information 121 resource. The term ARK itself refers both to the scheme and to any 122 single identifier that conforms to it. A reasonably concise and 123 accessible overview and rationale for the scheme is available at 124 [ARK]. 126 Schemes for persistent identification of network-accessible objects 127 are not new. In the early 1990's, the design of the Uniform Resource 128 Name [RFC2141] responded to the observed failure rate of URLs by 129 articulating an indirect, non-hostname-based naming scheme and the 130 need for responsible name management. Meanwhile, promoters of the 131 Digital Object Identifier [DOI] succeeded in building a community of 132 providers around a mature software system [Handle] that supports name 133 management. The Persistent Uniform Resource Locator [PURL] was 134 another scheme that had the advantage of working with unmodified web 135 browsers. ARKs represent an approach that attempts to build on the 136 strengths and to avoid the weaknesses of these schemes. 138 A founding principle of the ARK is that persistence is purely a 139 matter of service. Persistence is neither inherent in an object nor 140 conferred on it by a particular naming syntax. Nor is the technique 141 of name indirection -- upon which URNs, Handles, DOIs, and PURLs are 142 founded -- of central importance. Name indirection is an ancient and 143 well-understood practice; new mechanisms for it keep appearing and 144 distracting practitioner attention, with the Domain Name System (DNS) 145 [RFC1034] being a particularly dazzling and elegant example. What is 146 often forgotten is that maintenance of an indirection table is an 147 unavoidable cost to the organization providing persistence, and that 148 cost is equivalent across naming schemes. That indirection has 149 always been a native part of the web while being so lightly utilized 150 for the persistence of web-based objects indicates how unsuited most 151 organizations will probably be to the task of table maintenance and 152 to the much more fundamental challenge of keeping the objects 153 themselves viable. 155 Persistence is achieved through a provider's successful stewardship 156 of objects and their identifiers. The highest level of persistence 157 will be reinforced by a provider's robust contingency, redundancy, 158 and succession strategies. It is further safeguarded to the extent 159 that a provider's mission is shielded from funding and political 160 instabilities. These are by far the major challenges confronting 161 persistence providers, and no identifier scheme has any direct impact 162 on them. In fact, some schemes may actually be liabilities for 163 persistence because they create short- and long-term dependencies for 164 every object access on complex, special-purpose infrastructures, 165 parts of which are proprietary and all of which increase the carry- 166 forward burden for the preservation community. It is for this reason 167 that the ARK scheme relies only on educated name assignment and light 168 use of general-purpose infrastructures that are maintained mostly by 169 the internet community at large (the DNS, web servers, and web 170 browsers). 172 1.1. Reasons to Use ARKs 174 If no persistent identifier scheme contributes directly to 175 persistence, why not just use URLs? A particular URL may be as 176 durable an identifier as it is possible to have, but nothing 177 distinguishes it from an ordinary URL to the recipient who is 178 wondering if it is suitable for long-term reference. An ARK embedded 179 in a URL provides some of the necessary conditions for credible 180 persistence, inviting access to not one, but to three things: to the 181 object, to its metadata, and to a nuanced statement of commitment 182 from the provider in question (the NMA, described below) regarding 183 the object. Existence of the two extra services can be probed 184 automatically by appending `?' and `??' to the ARK. 186 The form of the ARK also supports the natural separation of naming 187 authorities into the original name assigning authority and the 188 diverse multiple name mapping (or servicing) authorities that in 189 succession and in parallel will take over custodial responsibilities 190 from the original assigner (assuming the assigner ever held that 191 responsibility) for the large majority of a long-term object's 192 archival lifetime. The name mapping authority, indicated by the 193 hostname part of the URL that contains the ARK, serves to launch the 194 ARK into cyberspace. Should it ever fail (and there is no reason why 195 a well-chosen hostname for a 100-year-old cultural memory institution 196 shouldn't last as long as the DNS), that host name is considered 197 disposeable and replaceable. Again, the form of the ARK helps 198 because it defines exactly how to recover the core immutable object 199 identity, and simple algorithms (one based on the URN model) or even 200 by-hand internet query can be used for for locating another mapping 201 authority. 203 There are tools to assist in generating ARKs and other identifiers, 204 such as [NOID] and "uuidgen", both of which rely for uniqueness on 205 human-maintained registries. This document also contains some 206 guidelines and considerations for managing namespaces and choosing 207 hostnames with persistence in mind. 209 1.2. Three Requirements of ARKs 211 The first requirement of an ARK is to give users a link from an 212 object to a promise of stewardship for it. That promise is a multi- 213 faceted covenant that binds the word of an identified service 214 provider to a specific set of responsibilities. It is critical for 215 the promise to come from a current provider and almost irrelevant, 216 over a long period of time, what the original assigner's intentions 217 were. No one can tell if successful stewardship will take place 218 because no one can predict the future. Reasonable conjecture, 219 however, may be based on past performance. There must be a way to 220 tie a promise of persistence to a provider's demonstrated or 221 perceived ability -- its reputation -- in that arena. Provider 222 reputations would then rise and fall as promises are observed 223 variously to be kept and broken. This is perhaps the best way we 224 have for gauging the strength of any persistence promise. 226 The second requirement of an ARK is to give users a link from an 227 object to a description of it. The problem with a naked identifier 228 is that without a description real identification is incomplete. 229 Identifiers common today are relatively opaque, though some contain 230 ad hoc clues reflecting assertions that were briefly true, such as 231 where in a filesystem hierarchy an object lived during a short stay. 232 Possession of both an identifier and an object is some improvement, 233 but positive identification may still be uncertain since the object 234 itself might not include a matching identifier or might not carry 235 evidence obvious enough to reveal its identity without significant 236 research. In either case, what is called for is a record bearing 237 witness to the identifier's association with the object, as supported 238 by a recorded set of object characteristics. This descriptive record 239 is partly an identification "receipt" with which users and archivists 240 can verify an object's identity after brief inspection and a 241 plausible match with recorded characteristics such as title and size. 243 The final requirement of an ARK is to give users a link to the object 244 itself (or to a copy) if at all possible. Persistent access is the 245 central duty of an ARK. Persistent identification plays a vital 246 supporting role but, strictly speaking, it can be construed as no 247 more than a record attesting to the original assignment of a never- 248 reassigned identifier. Object access may not be feasible for various 249 reasons, such as a transient service outage, a catastrophic loss, a 250 licensing agreement that keeps an archive "dark" for a period of 251 years, or when an object's own lack of tangible existence confuses 252 normal concepts of access (e.g., a vocabulary term might be 253 "accessed" through its definition). In such cases the ARK's 254 identification role assumes a much higher profile. But attempts to 255 simplify the persistence problem by decoupling access from 256 identification and concentrating exclusively on the latter are of 257 questionable utility. A perfect system for assigning forever unique 258 identifiers might be created, but if it did so without reducing 259 access failure rates, no one would be interested. The central issue 260 -- which may be summed up as the "HTTP 404 Not Found" problem -- 261 would not have been addressed. 263 1.3. Organizing Support for ARKs: Our Stuff vs. Their Stuff 265 An organization and the user community it serves can often be seen to 266 struggle with two different areas of persistent identification: the 267 Our Stuff problem and the Their Stuff problem. In the Our Stuff 268 problem, we in the organization want our own objects to acquire 269 persistent names. Since we possess or control these objects, our 270 organization tackles the Our Stuff problem directly. Whether or not 271 the objects are named by ARKs, our organization is the responsible 272 party, so it can plan for, maintain, and make commitments about the 273 objects. 275 In the Their Stuff problem, we in the organization want others' 276 objects to acquire persistent names. These are objects that we do 277 not own or control, but some of which are critically important to us. 278 But because they are beyond our influence as far as support is 279 concerned, creating and maintaining persistent identifiers for Their 280 Stuff is not especially purposeful or feasible for us to engage in. 281 There is little that we can do about someone else's stuff except 282 encourage their uptake or adoption of persistence services. 284 Co-location of persistent access and identification services is 285 natural. Any organization that undertakes ongoing support of true 286 persistent identification (which includes description) is well-served 287 if it controls, owns, or otherwise has clear internal access to the 288 identified objects, and this gives it an advantage if it wishes also 289 to support persistent access to outsiders. Conversely, persistent 290 access to outsiders requires orderly internal collection management 291 procedures that include monitoring, acquisition, verification, and 292 change control over objects, which in turn requires object 293 identifiers persistent enough to support auditable record keeping 294 practices. 296 Although, organizing ARK services under one roof thus tends to make 297 sense, object hosting can successfully be separated from name 298 mapping. An example is when a name mapping authority centrally 299 provides uniform resolution services via a protocol gateway on behalf 300 of organizations that host objects behind a variety of access 301 protocols. It is also reasonable to build value-added description 302 services that rely on the underlying services of a set of mapping 303 authorities. 305 Supporting ARKs is not for every organization. By requiring 306 specific, revealed commitments to preservation, to object access, and 307 to description, the bar for providing ARK services is higher than for 308 some other identifier schemes. On the other hand, it would be hard 309 to grant credence to a persistence promise from an organization that 310 could not muster the minimum ARK services. Not that there isn't a 311 business model for an ARK-like, description-only service built on top 312 of another organization's full complement of ARK services. For 313 example, there might be competition at the description level for 314 abstracting and indexing a body of scientific literature archived in 315 a combination of open and fee-based repositories. The description- 316 only service would have no direct commitment to the objects, but 317 would act as an intermediary, forwarding commitment statements from 318 object hosting services to requestors. 320 1.4. Definition of Identifier 322 An identifier is not a string of character data -- an identifier is 323 an association between a string of data and an object. This 324 abstraction is necessary because without it a string is just data. 325 It's nonsense to talk about a string's breaking, or about its being 326 strong, maintained, and authentic. But as a representative of an 327 association, a string can do, metaphorically, the things that we 328 expect of it. 330 Without regard to whether an object is physical, digital, or 331 conceptual, to identify it is to claim an association between it and 332 a representative string, such as "Jane" or "ISBN 0596000278". What 333 gives a claim credibility is a set of verifiable assertions, or 334 metadata, about the object, such as age, height, title, or number of 335 pages. In other words, the association is made manifest by a record 336 (e.g., a cataloging or other metadata record) that vouches for it. 338 In the complete absence of any testimony (metadata) regarding an 339 association, a would-be identifier string is a meaningless sequence 340 of characters. To keep an externally visible but otherwise internal 341 string from being perceived as an identifier by outsiders, for 342 example, it suffices for an organization not to disclose the nature 343 of its association. For our immediate purpose, actual existence of 344 an association record is more important than its authenticity or 345 verifiability, which are outside the scope of this specification. 347 It is a gift to the identification process if an object carries its 348 own name as an inseparable part of itself, such as an identifier 349 imprinted on the first page of a document or embedded in a data 350 structure element of a digital document header. In cases where the 351 object is large, unwieldy, or unavailable (such as when licensing 352 restrictions are in effect), a metadata record that includes the 353 identifier string will usually suffice. That record becomes a 354 conveniently manipulable object surrogate, acting as both an 355 association "receipt" and "declaration". 357 Note that our definition of identifier extends the one in use for 358 Uniform Resource Identifiers [RFC3986]. The present document still 359 sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for 360 the string part of an identifier, but the context should make the 361 meaning clear. 363 2. ARK Anatomy 365 An ARK is represented by a sequence of characters (a string) that 366 contains the label, "ark:", optionally preceded by the beginning part 367 of a URL. Here is a diagrammed example. 369 http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff 370 \________________/ \__/ \___/ \______/ \____________/ 371 (replaceable) | | | Qualifier 372 | ARK Label | | (NMA-supported) 373 | | | 374 Name Mapping Authority | Name (NAA-assigned) 375 Hostport (NMAH) | 376 Name Assigning Authority Number (NAAN) 378 The ARK syntax can be summarized, 380 [http://NMAH/]ark:/NAAN/Name[Qualifier] 382 where the NMAH and Qualifier parts are in brackets to indicate that 383 they are optional. 385 2.1. The Name Mapping Authority Hostport (NMAH) 387 Before the "ark:" label may appear an optional Name Mapping Authority 388 Hostport (NMAH) that is a temporary address where ARK service 389 requests may be sent. It consists of "http://" (or any service 390 specification valid for a URL) followed by an Internet hostname or 391 hostport combination having the same format and semantics as the 392 hostport part of a URL. The most important thing about the NMAH is 393 that it is "identity inert" from the point of view of object 394 identification. In other words, ARKs that differ only in the 395 optional NMAH part identify the same object. Thus, for example, the 396 following three ARKs are synonyms for just one information object: 398 http://loc.gov/ark:/12025/654xz321 399 http://rutgers.edu/ark:/12025/654xz321 400 ark:/12025/654xz321 402 Strictly speaking, in the realm of digital objects, these ARKs may 403 lead over time to somewhat different or diverging instances of the 404 originally named object. In an ideal world, divergence of persistent 405 objects is not desirable, but it is widely believed that digital 406 preservation efforts will inevitably lead to alterations in some 407 original objects (e.g, a format migration in order to preserve the 408 ability to display a document). If any of those objects are held 409 redundantly in more than one organization (a common preservation 410 strategy), chances are small that all holding organizations will 411 perform the same precise transformations and all maintain the same 412 object metadata. More significant divergence would be expected when 413 the holding organizations serve different audiences or compete with 414 each other. 416 The NMAH part makes an ARK into an actionable URL. As with many 417 internet parameters, it is helpful to approach the NMAH being liberal 418 in what you accept and conservative in what you propose. From the 419 recipient's point of view, the NMAH part should be treated as 420 temporary, disposable, and replaceable. From the NMA's point of 421 view, it should be chosen with the greatest concern for longevity. A 422 carefully chosen NMAH should be at least as permanent as the 423 providing organization's own hostname. In the case of a national or 424 university library, for example, there is no reason why the NMAH 425 should not be considerably more permanent than soft-funded proxy 426 hostnames such as hdl.handle.net, dx.doi.org, and purl.org. In 427 general and over time, however, it is not unexpected for an NMAH 428 eventually to stop working and require replacement with the NMAH of a 429 currently active service provider. 431 This replacement relies on a mapping authority "resolver" discovery 432 process, of which two alternate methods are outlined in a later 433 section. The ARK, URN, Handle, and DOI schemes all use a resolver 434 discovery model that sooner or later requires matching the original 435 assigning authority with a current provider servicing that 436 authority's named objects; once found, the resolver at that provider 437 performs what amounts to a redirect to a place where the object is 438 currently held. All the schemes rely on the ongoing functionality of 439 currently mainstream technologies such as the Domain Name System 440 [RFC1034] and web browsers. The Handle and DOI schemes in addition 441 require that the Handle protocol layer and global server grid be 442 available at all times. 444 The practice of prepending "http://" and an NMAH to an ARK is a way 445 of creating an actionable identifier by a method that is itself 446 temporary. Assuming that infrastructure supporting [RFC2616] 447 information retrieval will no longer be available one day, ARKs will 448 then have to be converted into new kinds of actionable identifiers. 449 By that time, if ARKs see widespread use, web browsers would 450 presumably evolve to perform this (currently simple) transformation 451 automatically. 453 2.2. The ARK Label Part (ark:/) 455 The label part distinguishes an ARK from an ordinary identifier. In 456 a URL found in the wild, the string, "ark:/", indicates that the URL 457 stands a reasonable chance of being an ARK. If the context warrants, 458 verification that it actually is an ARK can be done by testing it for 459 existence of the three ARK services. 461 Since nothing about an identifier syntax directly affects 462 persistence, the "ark:" label (like "urn:", "doi:", and "hdl:") 463 cannot tell you whether the identifier is persistent or whether the 464 object is available. It does tell you that the original Name 465 Assigning Authority (NAA) had some sort of hopes for it, but it 466 doesn't tell you whether that NAA is still in existence, or whether a 467 decade ago it ceased to have any responsibility for providing 468 persistence, or whether it ever had any responsibility beyond naming. 470 Only a current provider can say for certain what sort of commitment 471 it intends, and the ARK label suggests that you can query the NMAH 472 directly to find out exactly what kind of persistence is promised. 473 Even if what is promised is impersistence (i.e., a short-term 474 identifier), saying so is valuable information to the recipient. 475 Thus an ARK is a high-functioning identifier in the sense that it 476 provides access to the object, the metadata, and a commitment 477 statement, even if the commitment is explicitly very weak. 479 2.3. The Name Assigning Authority Number (NAAN) 481 Recalling that the general form of the ARK is, 483 [http://NMAH/]ark:/NAAN/Name[Qualifier] 485 the part of the ARK directly following the "ark:" is the Name 486 Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. 487 This part is always required, as it identifies the organization that 488 originally assigned the Name of the object. It is used to discover a 489 currently valid NMAH and to provide top-level partitioning of the 490 space of all ARKs. NAANs are registered in a manner similar to URN 491 Namespaces, but they are pure numbers consisting of 5 digits or 9 492 digits. Thus, the first 100,000 registered NAAs fit compactly into 493 the 5 digits, and if growth warrants, the next billion fit into the 9 494 digit form. In either case the fixed odd numbers of digits helps 495 reduce the chances of finding a NAAN out of context and confusing it 496 with nearby quantities such as 4-digit dates. 498 The NAAN designates a top-level ARK namespace. Once registered for a 499 namespace, a NAAN is never re-registered. It is possible, however, 500 for there to be a succession of organizations that manage of an ARK 501 namespace. 503 2.4. The Name Part 505 The part of the ARK just after the NAAN is the Name assigned by the 506 NAA, and it is also required. Semantic opaqueness in the Name part 507 is strongly encouraged in order to reduce an ARK's vulnerability to 508 era- and language-specific change. Identifier strings containing 509 linguistic fragments can create support difficulties down the road. 510 No matter how appropriate or even meaningless they are today, such 511 fragments may one day create confusion, give offense, or infringe on 512 a trademark as the semantic environment around us and our communities 513 evolves. 515 Names that look more or less like numbers avoid common problems that 516 defeat persistence and international acceptance. The use of digits 517 is highly recommended. Mixing in non-vowel alphabetic characters a 518 couple at a time is a relatively safe and easy way to achieve a 519 denser namespace (more possible names for a given length of the name 520 string). Such names have a chance of aging and traveling well. 521 Tools exists that mint, bind, and resolve opaque identifiers, with or 522 without check characters [NOID]. More on naming considerations is 523 given in a subsequent section. 525 2.5. The Qualifier Part 527 The part of the ARK following the NAA-assigned Name is an optional 528 Qualifier. It is a string that extends the base ARK in order to 529 create a kind of service entry point into the object named by the 530 NAA. At the discretion of the providing NMA, such a service entry 531 point permits an ARK to support access to individual hierarchical 532 components and subcomponents of an object, and to variants (versions, 533 languages, formats) of components. A Qualifier may be invented by 534 the NAA or by any NMA servicing the object. 536 In form, the Qualifier is a ComponentPath, or a VariantPath, or a 537 ComponentPath followed by a VariantPath. A VariantPath is introduced 538 and subdivided by the reserved character `.', and a ComponentPath is 539 introduced and subdivided by the reserved character `/'. In this 540 example, 542 http://example.org/ark:/12025/654xz321/s3/f8.05v.tiff 544 the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is 545 a VariantPath. The ARK Qualifier is a formalization of some 546 currently mainstream URL syntax conventions. This formalization 547 specifically reserves meanings that permit recipients to make strong 548 inferences about logical sub-object containment and equivalence based 549 only on the form of the received identifiers; there is great 550 efficiency in not having to inspect metadata records to discover such 551 relationships. NMAs are free not to disclose any of these 552 relationships merely by avoiding the reserved characters above. 553 Hierarchical components and variants are discussed further in the 554 next two sections. 556 The Qualifier, if present, differs from the Name in several important 557 respects. First, a Qualifier may have been assigned either by the 558 NAA or later by the NMA. The assignment of a Qualifier by an NMA 559 effectively amounts to an act of publishing a service entry point 560 within the conceptual object originally named by the NAA. For our 561 purposes, an ARK extended with a Qualifier assigned by an NMA will be 562 called an NMA-qualified ARK. 564 Second, a Qualifier assignment on the part of an NMA is made in 565 fulfillment of its service obligations and may reflect changing 566 service expectations and technology requirements. NMA-qualified ARKs 567 could therefore be transient, even if the base, unqualified ARK is 568 persistent. For example, it would be reasonable for an NMA to 569 support access to an image object through an actionable ARK that is 570 considered persistent even if the experience of that access changes 571 as linking, labeling, and presentation conventions evolve and as 572 format and security standards are updated. For an image "thumbnail", 573 that NMA could also support an NMA-qualified ARK that is considered 574 impersistent because the thumbnail will be replaced with higher 575 resolution images as network bandwidth and CPU speeds increase. At 576 the same time, for an originally scanned, high-resolution master, the 577 NMA could publish an NMA-qualfied ARK that is itself considered 578 persistent. Of course, the NMA must be able to return its separate 579 commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs, 580 and to any NAA-qualified ARKs that it supports. 582 A third difference between a Qualifier and a Name concerns the 583 semantic opaqueness constraint. When an NMA-qualified ARK is to be 584 used as a transient service entry point into a persistent object, the 585 priority given to semantic opaqueness observed by the NAA in the Name 586 part may be relaxed by the NMA in the Qualifier part. If service 587 priorities in the Qualifier take precedence over persistence, short- 588 term usability considerations may recommend somewhat semantically 589 laden Qualifier strings. 591 Finally, not only is the set of Qualifiers supported by an NMA 592 mutable, but different NMAs may support different Qualifier sets for 593 the same NAA-identified object. In this regard the NMAs act 594 independently of each other and of the NAA. 596 The next two sections describe how ARK syntax may be used to declare, 597 or to avoid declaring, certain kinds of relatedness among qualified 598 ARKs. 600 2.5.1. ARKs that Reveal Object Hierarchy 602 An NAA or NMA may choose to reveal the presence of a hierarchical 603 relationship between objects using the `/' (slash) character after 604 the Name part of an ARK. Some authorities will choose not to 605 disclose this information, while others will go ahead and disclose so 606 that manipulators of large sets of ARKs can infer object 607 relationships by simple identifier inspection; for example, this 608 makes it possible for a system to present a collapsed view of a large 609 search result set. 611 If the ARK contains an internal slash after the NAAN, the piece to 612 its left indicates a containing object. For example, publishing an 613 ARK of the form, 615 ark:/12025/654/xz/321 617 is equivalent to publishing three ARKs, 619 ark:/12025/654/xz/321 620 ark:/12025/654/xz 621 ark:/12025/654 623 together with a declaration that the first object is contained in the 624 second object, and that the second object is contained in the third. 626 Revealing the presence of hierarchy is completely up to the assigner 627 (NMA or NAA). It is hard enough to commit to one object's name, let 628 alone to three objects' names and to a specific, ongoing relatedness 629 among them. Thus, regardless of whether hierarchy was present 630 initially, the assigner, by not using slashes, reveals no shared 631 inferences about hierarchical or other inter-relatedness in the 632 following ARKs: 634 ark:/12025/654_xz_321 635 ark:/12025/654_xz 636 ark:/12025/654xz321 637 ark:/12025/654xz 638 ark:/12025/654 640 Note that slashes around the ARK's NAAN (/12025/ in these examples) 641 are not part of the ARK's Name and therefore do not indicate the 642 existence of some sort of NAAN super object containing all objects in 643 its namespace. A slash must have at least one non-structural 644 character (one that is neither a slash nor a period) on both sides in 645 order for it to separate recognizable structural components. So 646 initial or final slashes may be removed, and double slashes may be 647 converted into single slashes. 649 2.5.2. ARKs that Reveal Object Variants 651 An NAA or NMA may choose to reveal the possible presence of variant 652 objects or object components using the `.' (period) character after 653 the Name part of an ARK. Some authorities will choose not to 654 disclose this information, while others will go ahead and disclose so 655 that manipulators of large sets of ARKs can infer object 656 relationships by simple identifier inspection; for example, this 657 makes it possible for a system to present a collapsed view of a large 658 search result set. 660 If the ARK contains an internal period after Name, the piece to its 661 left is a base name and the piece to its right, and up to the end of 662 the ARK or to the next period is a suffix. A Name may have more than 663 one suffix, for example, 665 ark:/12025/654.24 666 ark:/12025/xz4/654.24 667 ark:/12025/654.20v.78g.f55 669 ark:/12025/654.24 670 ark:/12025/xz4/654.24 671 ark:/12025/654.20v.78g.f55 673 There are two main rules. First, if two ARKs share the same base 674 name but have different suffixes, the corresponding objects were 675 considered variants of each other (different formats, languages, 676 versions, etc.) by the assigner (NMA or NAA). Thus, the following 677 ARKs are variants of each other: 679 ark:/12025/654.20v.78g.f55 680 ark:/12025/654.321xz 681 ark:/12025/654.44 683 Second, publishing an ARK with a suffix implies the existence of at 684 least one variant identified by the ARK without its suffix. The ARK 685 otherwise permits no further assumptions about what variants might 686 exist. So publishing the ARK, 688 ark:/12025/654.20v.78g.f55 690 is equivalent to publishing the four ARKs, 692 ark:/12025/654.20v.78g.f55 693 ark:/12025/654.20v.78g 694 ark:/12025/654.20v 695 ark:/12025/654 697 Revealing the possibility of variants is completely up to the 698 assigner. It is hard enough to commit to one object's name, let 699 alone to multiple variants' names and to a specific, ongoing 700 relatedness among them. The assigner is the sole arbiter of what 701 constitutes a variant within its namespace, and whether to reveal 702 that kind of relatedness by using periods within its names. 704 A period must have at least one non-structural character (one that is 705 neither a slash nor a period) on both sides in order for it to 706 separate recognizable structural components. So initial or final 707 periods may be removed, and adjacent periods may be converted into a 708 single period. Multiple suffixes should be arranged in sorted order 709 (pure ASCII collating sequence) at the end of an ARK. 711 2.6. Character Repertoires 713 The Name and Qualifier parts are strings of visible ASCII characters 714 and should be less than 128 bytes in length. The length restriction 715 keeps the ARK short enough to append ordinary ARK request strings 716 without running into transport restrictions (e.g., within HTTP GET 717 requests). Characters may be letters, digits, or any of these six 718 characters: 720 = # * + @ _ $ 722 The following characters may also be used, but their meanings are 723 reserved: 725 % - . / 727 The characters `/' and `.' are ignored if either appears as the last 728 character of an ARK. If used internally, they allow a name assigner 729 to reveal object hierarchy and object variants as previously 730 described. 732 Hyphens are considered to be insignificant and are always ignored in 733 ARKs. A `-' (hyphen) may appear in an ARK for readability, or it may 734 have crept in during the formatting and wrapping of text, but it must 735 be ignored in lexical comparisons. As in a telephone number, hyphens 736 have no meaning in an ARK. It is always safe for an NMA that 737 receives an ARK to remove any hyphens found in it. As a result, like 738 the NMAH, hyphens are "identity inert" in comparing ARKs for 739 equivalence. For example, the following ARKs are equivalent for 740 purposes of comparison and ARK service access: 742 ark:/12025/65-4-xz-321 743 http://sneezy.dopey.com/ark:/12025/654--xz32-1 744 ark:/12025/654xz321 746 The `%' character is reserved for %-encoding all other octets that 747 would appear in the ARK string, in the same manner as for URIs 748 [RFC3986]. A %-encoded octet consists of a `%' followed by two hex 749 digits; for example, "%7d" stands in for `}'. Lower case hex digits 750 are preferred to reduce the chances of false acronym recognition; 751 thus it is better to use "%acT" instead of "%ACT". The character `%' 752 itself must be represented using "%25". As with URNs, %-encoding 753 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 754 that have less restricted character repertoires [RFC2288]. 756 2.7. Normalization and Lexical Equivalence 758 To determine if two or more ARKs identify the same object, the ARKs 759 are compared for lexical equivalence after first being normalized. 760 Since ARK strings may appear in various forms (e.g., having different 761 NMAHs), normalizing them minimizes the chances that comparing two ARK 762 strings for equality will fail unless they actually identify 763 different objects. In a specified-host ARK (one having an NMAH), the 764 NMAH never participates in such comparisons. 766 Normalization of an ARK for the purpose of octet-by-octet equality 767 comparison with another ARK consists of four steps. First, any upper 768 case letters in the "ark:" label and the two characters following a 769 `%' are converted to lower case. The case of all other letters in 770 the ARK string must be preserved. Second, any NMAH part is removed 771 (everything from an initial "http://" up to the next slash) and all 772 hyphens are removed. 774 Third, structural characters (slash and period) are normalized. 775 Initial and final occurrences are removed, and two structural 776 characters in a row (e.g., // or ./) are replaced by the first 777 character, iterating until each occurrence has at least one non- 778 structural character on either side. Finally, if there are any 779 components with a period on the left and a slash on the right, either 780 the component and the preceding period must be moved to the end of 781 the Name part or the ARK must be thrown out as malformed. 783 The fourth and final step is to arrange the suffixes in ASCII 784 collating sequence (that is, to sort them) and to remove duplicate 785 suffixes, if any. It is also permissible to throw out ARKs for which 786 the suffixes are not sorted. 788 The resulting ARK string is now normalized. Comparisons between 789 normalized ARKs are case-sensitive, meaning that upper case letters 790 are considered different from their lower case counterparts. 792 To keep ARK string variation to a minimum, no reserved ARK characters 793 should be %-encoded unless it is deliberately to conceal their 794 reserved meanings. No non-reserved ARK characters should ever be 795 %-encoded. Finally, no %-encoded character should ever appear in an 796 ARK in its decoded form. 798 3. Naming Considerations 800 The most important threats faced by persistence providers include 801 such things as funding loss, natural disaster, political and social 802 upheaval, processing faults, and errors in human oversight. There is 803 nothing that an identifer scheme can do about such things. Still, a 804 few observed identifier failures and inconveniences can be traced 805 back to naming practices that we now know to be less than optimal for 806 persistence. 808 3.1. ARKS Embedded in Language 810 The ARK has different goals from the URI, so it has different 811 character set requirements. Because linguistic constructs imperil 812 persistence, for ARKs non-ASCII character support is unimportant. 813 ARKs and URIs share goals of transcribability and transportability 814 within web documents, so characters are required to be visible, non- 815 conflicting with HTML/XML syntax, and not subject to tampering during 816 transmission across common transport gateways. Add the goal of 817 making an undelimited ARK recognizable in running prose, as in ark:/ 818 12025/=@_22*$, and certain punctuation characters (e.g., comma, 819 period) end up being excluded from the ARK lest the end of a phrase 820 or sentence be mistaken for part of the ARK. 822 This consideration has more direct effect on ARK usability in a 823 natural language context than it has on ARK persistence. The same is 824 true of the rule preventing hyphens from having lexical significance. 825 It is fine to publish ARKs with hyphens in them (e.g., such as the 826 output of UUID/GUID generators), but the uniform treatment of hyphens 827 as insignificant reduces the possibility of users transcribing 828 identifiers that will have been broken through unpredictable 829 hyphenation by word processors. Any measure that reduces user 830 irritation with an identifier will increase its chances of survival. 832 3.2. Objects Should Wear Their Identifiers 834 A valuable technique for provision of persistent objects is to try to 835 arrange for the complete identifier to appear on, with, or near its 836 retrieved object. An object encountered at a moment in time when its 837 discovery context has long since disappeared could then easily be 838 traced back to its metadata, to alternate versions, to updates, etc. 839 This has seen reasonable success, for example, in book publishing and 840 software distribution. An identifier string only has meaning when 841 its association is known, and this a very sure, simple, and low-tech 842 method of reminding everyone exactly what that association is. 844 3.3. Names are Political, not Technological 846 If persistence is the goal, a deliberate local strategy for 847 systematic name assignment is crucial. Names must be chosen with 848 great care. Poorly chosen and managed names will devastate any 849 persistence strategy, and they do not discriminate by identifier 850 scheme. Whether a mistakenly re-assigned name is a URN, DOI, PURL, 851 URL, or ARK, the damage -- failed access and confusion -- is not 852 mitigated more in one scheme than in another. Conversely, in-house 853 efforts to manage names responsibly will go much further towards 854 safeguarding persistence than any choice of naming scheme or name 855 resolution technology. 857 Branding (e.g., at the corporate or departmental level) is important 858 for funding and visibility, but substrings representing brands and 859 organizational names should be given a wide berth except when 860 absolutely necessary in the hostname (the identity-inert) part of the 861 ARK. These substrings are not only unstable because organizations 862 change frequently, but they are also dangerous because successor 863 organizations often have political or legal reasons to actively 864 suppress predecessor names and brands. Any measure that reduces the 865 chances of future political or legal pressure on an identifier will 866 decrease the chances that our descendants will be obliged to 867 deliberately break it. 869 3.4. Choosing a Hostname or NMA 871 Hostnames appearing in any identifier meant to be persistent must be 872 chosen with extra care. The tendency in hostname selection has 873 traditionally been to choose a token with recognizable attributes, 874 such as a corporate brand, but that tendency wreaks havoc with 875 persistence that is supposed to outlive brands, corporations, subject 876 classifications, and natural language semantics (e.g., what did the 877 three letters "gay" mean in 1958, 1978, and 1998?). Today's 878 recognized and correct attributes are tomorrow's stale or incorrect 879 attributes. In making hostnames (any names, actually) long-term 880 persistent, it helps to eliminate recognizable attributes to the 881 extent possible. This affects selection of any name based on URLs, 882 including PURLs and the explicitly disposable NMAHs. 884 There is no excuse for a provider that manages its internal names 885 impeccably not to exercise the same care in choosing what could be an 886 exceptionally durable hostname, especially if it would form the 887 prefix for all the provider's URL-based external names. Registering 888 an opaque hostname in the ".org" or ".net" domain would not be a bad 889 start. Another way is to publish your ARKs with an organizational 890 domain name that will be mapped by DNS to an appropriate NMA host. 891 This makes for shorter names with less branding vulnerability. 893 It is a mistake to think that hostnames are inherently unstable. If 894 you require brand visibility, that may be a fact of life. But things 895 are easier if yours is the brand of long-lived cultural memory 896 institution such as a national or university library or archive. 897 Well-chosen hostnames from organizations that are sheltered from the 898 direct effects of a volatile marketplace can easily provide longer- 899 lived global resolvers than the domain names explicitly or implicitly 900 used as starting points for global resolution by indirection-based 901 persistent identifier schemes. For example, it is hard to imagine 902 circumstances under which the Library of Congress' domain name would 903 disappear sooner than, say, "handle.net". 905 For smaller libraries, archives, and preservation organizations, 906 there is a natural concern about whether they will be able to keep 907 their web servers and domain names in the face of uncertain funding. 908 One option is to form or join a consortium [N2T] of like-minded 909 organizations with the purpose of providing mutual preservation 910 support. The first goal of such a consortium would be to perpetually 911 rent a hostname on which to establish a web server that simply 912 redirects incoming member organization requests to the appropriate 913 member server; using ARKs, for example, a 150-member consortium could 914 run a very small server (24x7) that contained nothing more than 150 915 rewrite rules in its configuration file. Even more helpful would be 916 additional consortial support for a member organization that was 917 unable to continue providing services and needed to find a successor 918 archival organization. This would be a low-cost, low-tech way to 919 publish ARKs (or URLs) under highly persistent hostnames. 921 There are no obvious reasons why the organizations registering DNS 922 names, URN Namespaces, and DOI publisher IDs should have among them 923 one that is intrinsically more fallible than the next. Moreover, it 924 is a misconception that the demise of DNS and of HTTP need adversely 925 affect the persistence of URLs. At such a time, certainly URLs from 926 the present day might not then be actionable by our present-day 927 mechanisms, but resolution systems for future non-actionable URLs are 928 no harder to imagine than resolution systems for present-day non- 929 actionable URNs and DOIs. There is no more stable a namespace than 930 one that is dead and frozen, and that would then characterize the 931 space of names bearing the "http://" prefix. It is useful to 932 remember that just because hostnames have been carelessly chosen in 933 their brief history does not mean that they are unsuitable in NMAHs 934 (and URLs) intended for use in situations demanding the highest level 935 of persistence available in the Internet environment. A well-planned 936 name assignment strategy is everything. 938 3.5. Assigners of ARKs 940 A Name Assigning Authority (NAA) is an organization that creates (or 941 delegates creation of) long-term associations between identifiers and 942 information objects. Examples of NAAs include national libraries, 943 national archives, and publishers. An NAA may arrange with an 944 external organization for identifier assignment. The US Library of 945 Congress, for example, allows OCLC (the Online Computer Library 946 Center, a major world cataloger of books) to create associations 947 between Library of Congress call numbers (LCCNs) and the books that 948 OCLC processes. A cataloging record is generated that testifies to 949 each association, and the identifier is included by the publisher, 950 for example, in the front matter of a book. 952 An NAA does not so much create an identifier as create an 953 association. The NAA first draws an unused identifier string from 954 its namespace, which is the set of all identifiers under its control. 955 It then records the assignment of the identifier to an information 956 object having sundry witnessed characteristics, such as a particular 957 author and modification date. A namespace is usually reserved for an 958 NAA by agreement with recognized community organizations (such as 959 IANA and ISO) that all names containing a particular string be under 960 its control. In the ARK an NAA is represented by the Name Assigning 961 Authority Number (NAAN). 963 The ARK namespace reserved for an NAA is the set of names bearing its 964 particular NAAN. For example, all strings beginning with "ark:/ 965 12025/" are under control of the NAA registered under 12025, which 966 might be the National Library of Finland. Because each NAA has a 967 different NAAN, names from one namespace cannot conflict with those 968 from another. Each NAA is free to assign names from its namespace 969 (or delegate assignment) according to its own policies. These 970 policies must be documented in a manner similar to the declarations 971 required for URN Namespace registration [RFC2611]. 973 To register for a NAAN, please read about the mapping authority 974 discovery file in the next section and send email to ark@cdlib.org. 976 3.6. NAAN Namespace Management 978 Every NAA must have a namespace management strategy. A time-honored 979 technique is to hierarchically partition a namespace into 980 subnamespaces using prefixes that guarantee non-collision of names in 981 different partition. This practice is strongly encouraged for all 982 NAAs, especially when subnamespace management will be delegated to 983 other departments, units, or projects within an organization. For 984 example, with a NAAN that is assigned to a university and managed by 985 its main library, care should be taken to reserve semantically opaque 986 prefixes that will set aside large parts of the unused namespace for 987 future assignments. Prefix-based partition management is an 988 important responsibility of the NAA. 990 This sort of delegation by prefix is well-used in the formation of 991 DNS names and ISBN identifiers. An important difference is that in 992 the former, the hierarchy is deliberately exposed and in the latter 993 it is hidden. Rather than using lexical boundary markers such as the 994 period (`.') found in domain names, the ISBN uses a publisher prefix 995 but doesn't disclose where the prefix ends and the publisher's 996 assigned name begins. This practice of non-disclosure, borrowed from 997 the ISBN and ISSN schemes, is encouraged in assigning ARKs, because 998 it reduces the visibility of an assertion that is probably not 999 important now and may become a vulnerability later. 1001 Reasonable prefixes for assigned names usually consist of consonants 1002 and digits and are 1-5 characters in length. For example, the 1003 constant prefix "x9t" might be delegated to a book digitization 1004 project that creates identifiers such as 1006 http://444.berkeley.edu/ark:/28722/x9t38rk45c 1008 If longevity is the goal, it is important to keep the prefixes free 1009 of recognizable semantics; for example, using an acronym representing 1010 a project or a department is discouraged. At the same time, you may 1011 wish to set aside a subnamespace for testing purposes under a prefix 1012 such as "fk..." that can serve as a visual clue and reminder to 1013 maintenance staff that this "fake" identifier was never published. 1015 There are other measures one can take to avoid user confusion, 1016 transcription errors, and the appearance of accidental semantics when 1017 creating identifiers. If you are generating identifiers 1018 automatically, pure numeric identifiers are likeley to be 1019 semantically opaque enough, but it's probably useful to avoid leading 1020 zeroes because some users mistakenly treat them as optional, thinking 1021 (arithmetically) that they don't contribute to the "value" of the 1022 identifier. 1024 If you need lots of identifiers and you don't want them to get too 1025 long, you can mix digits with consonants (but avoid vowels since they 1026 might accidentally spell words) to get more identifiers without 1027 increasing the string length. In this case you may not want more 1028 than a two letters in a row because it reduces the chance of 1029 generating acronyms. Generator tools such as [NOID] provide support 1030 for these sorts of identifiers, and can also add a computed check 1031 character as a guarantee against the most common transcription 1032 errors. 1034 3.7. Sub-Object Naming 1036 As mentioned previously, semantically opaque identifiers are very 1037 useful for long-term naming of abstract objects, however, it may be 1038 appropriate to extend these names with less opaque extensions that 1039 reference contemporary service entry points (sub-objects) in support 1040 of the object. Sub-object extensions beginning with a digit or 1041 underscore (`_') are reserved for the possibilty of developing a 1042 future registry of canonical service points (e.g., numeric references 1043 to versions, formats, languages, etc). 1045 4. Finding a Name Mapping Authority 1047 In order to derive an actionable identifier (these days, a URL) from 1048 an ARK, a hostport (hostname or hostname plus port combination) for a 1049 working Name Mapping Authority (NMA) must be found. An NMA is a 1050 service that is able to respond to the three basic ARK service 1051 requests. Relying on registration and client-side discovery, NMAs 1052 make known which NAAs' identifiers they are willing to service. 1054 Upon encountering an ARK, a user (or client software) looks inside it 1055 for the optional NMAH part (the hostport of the NMA's ARK service). 1056 If it contains an NMAH that is working, this NMAH discovery step may 1057 be skipped; the NMAH effectively uses the beginning of an ARK to 1058 cache the results of a prior mapping authority discovery process. If 1059 a new NMAH needs to found, the client looks inside the ARK again for 1060 the NAAN (Name Assigning Authority Number). Querying a global 1061 database, it then uses the NAAN to look up all current NMAHs that 1062 service ARKs issued by the identified NAA. 1064 The global database is key, and ideally the lookup would be automatic 1065 and transparent to the user. For this, the most promising method is 1066 probably the Name-to-Thing (N2T) Resolver [N2T] at n2t.info. It is a 1067 proposed low-cost, highly reliable, consortially maintained NMAH that 1068 simply exists to support actionable HTTP-based URLs for as long as 1069 HTTP is used. One of its big advantages over the other two methods 1070 and the URN, Handle, DOI, and PURL methods, is that N2T addresses the 1071 namespace splitting problem. When objects maintained by one NMA are 1072 inherited by more than one successor NMA, until now one of those 1073 successors would be required to maintain forwarding tables on behalf 1074 of the other successors. 1076 There are two other ways to discover an NMAH, one of them described 1077 in a subsection below. Another way, described in an appendix, is 1078 based on a simplification of the URN resolver discovery method, 1079 itself very similar in principle to the resolver discovery method 1080 used by Handles and DOIs. None of these methods does more than what 1081 can be done with a very small, consortially maintained web server 1082 such as [N2T]. 1084 In the interests of long-term persistence, however, ARK mechanisms 1085 are first defined in high-level, protocol-independent terms so that 1086 mechanisms may evolve and be replaced over time without compromising 1087 fundamental service objectives. Either or both specific methods 1088 given here may eventually be supplanted by better methods since, by 1089 design, the ARK scheme does not depend on a particular method, but 1090 only on having some method to locate an active NMAH. 1092 At the time of issuance, at least one NMAH for an ARK should be 1093 prepared to service it. That NMA may or may not be administered by 1094 the Name Assigning Authority (NAA) that created it. Consider the 1095 following hypothetical example of providing long-term access to a 1096 cancer research journal. The publisher wishes to turn a profit and 1097 the National Library of Medicine wishes to preserve the scholarly 1098 record. An agreement might be struck whereby the publisher would act 1099 as the NAA and the national library would archive the journal issue 1100 when it appears, but without providing direct access for the first 1101 six months. During the first six months of peak commercial 1102 viability, the publisher would retain exclusive delivery rights and 1103 would charge access fees. Again, by agreement, both the library and 1104 the publisher would act as NMAs, but during that initial period the 1105 library would redirect requests for issues less than six months old 1106 to the publisher. At the end of the waiting period, the library 1107 would then begin servicing requests for issues older than six months 1108 by tapping directly into its own archives. Meanwhile, the publisher 1109 might routinely redirect incoming requests for older issues to the 1110 library. Long-term access is thereby preserved, and so is the 1111 commercial incentive to publish content. 1113 Although it will be common for an NAA also to run an NMA service, it 1114 is never a requirement. Over time NAAs and NMAs will come and go. 1115 One NMA will succeed another, and there might be many NMAs serving 1116 the same ARKs simultaneously (e.g., as mirrors or as competitors). 1117 There might also be asymmetric but coordinated NMAs as in the 1118 library-publisher example above. 1120 4.1. Looking Up NMAHs in a Globally Accessible File 1122 This subsection describes a way to look up NMAHs using a simple name 1123 authority table represented as a plain text file. For efficient 1124 access the file may be stored in a local filesystem, but it needs to 1125 be reloaded periodically to incorporate updates. It is not expected 1126 that the size of the file or frequency of update should impose an 1127 undue maintenance or searching burden any time soon, for even 1128 primitive linear search of a file with ten-thousand NAAs is a 1129 subsecond operation on modern server machines. The proposed file 1130 strategy is similar to the /etc/hosts file strategy that supported 1131 Internet host address lookup for a period of years before the advent 1132 of DNS. 1134 The name authority table file is updated on an ongoing basis and is 1135 available for copying over the internet from the California Digital 1136 Library at http://www.cdlib.org/inside/diglib/ark/natab and from a 1137 number of mirror sites. The file contains comment lines (lines that 1138 begin with `#') explaining the format and giving the file's 1139 modification time, reloading address, and NAA registration 1140 instructions. There is even a Perl script that processes the file 1141 embedded in the file's comments. The currently registered Name 1142 Assigning Authorities are: 1144 12025 National Library of Medicine 1145 12026 Library of Congress 1146 12027 National Agriculture Library 1147 13030 California Digital Library 1148 13038 World Intellectual Property Organization 1149 20775 University of California San Diego 1150 29114 University of California San Francisco 1151 28722 University of California Berkeley 1152 21198 University of California Los Angeles 1153 15230 Rutgers University 1154 13960 Internet Archive 1155 64269 Digital Curation Centre 1156 62624 New York University 1157 67531 University of North Texas 1158 27927 Ithaka Electronic-Archiving Initiative 1159 12148 Bibliotheque nationale de France 1160 / National Library of France 1161 78319 Google 1162 88435 Princeton University 1163 78428 University of Washington 1164 89901 Archives of the Region of Vaestra Goetaland 1165 and City of Gothenburg, Sweden 1166 80444 Northwest Digital Archives 1167 25593 Emory University 1168 25031 University of Kansas 1169 17101 Centre for Ecology & Hydrology, UK 1170 65323 University of Calgary 1171 61001 University of Chicago 1172 52327 Bibliotheque et Archives Nationales du Quebec 1173 / National Libary and Archives of Quebec 1174 39331 National Szechenyi Library / National Library of Hungary 1175 26677 Library and Archives Canada / Bibliotheque et Archives Canada 1177 5. Generic ARK Service Definition 1179 An ARK request's output is delivered information; examples include 1180 the object itself, a policy declaration (e.g., a promise of support), 1181 a descriptive metadata record, or an error message. The experience 1182 of object delivery is expected to be an evolving mix of information 1183 that reflects changing service expectations and technology 1184 requirements; contemporary examples include such things as an object 1185 summary and component links formatted for human consumption. ARK 1186 services must be couched in high-level, protocol-independent terms if 1187 persistence is to outlive today's networking infrastructural 1188 assumptions. The high-level ARK service definitions listed below are 1189 followed in the next section by a concrete method (one of many 1190 possible methods) for delivering these services with today's 1191 technology. 1193 5.1. Generic ARK Access Service (access, location) 1195 Returns (a copy of) the object or a redirect to the same, although a 1196 sensible object proxy may be substituted. Examples of sensible 1197 substitutes include, 1199 o a table of contents instead of a large complex document, 1201 o a home page instead of an entire web site hierarchy, 1203 o a rights clearance challenge before accessing protected data, 1205 o directions for access to an offline object (e.g., a book), 1207 o a description of an intangible object (a disease, an event), or 1209 o an applet acting as "player" for a large multimedia object. 1211 May also return a discriminated list of alternate object locators. 1212 If access is denied, returns an explanation of the object's current 1213 (perhaps permanent) inaccessibility. 1215 5.1.1. Generic Policy Service (permanence, naming, etc.) 1217 Returns declarations of policy and support commitments for given 1218 ARKs. Declarations are returned in either a structured metadata 1219 format or a human readable text format; sometimes one format may 1220 serve both purposes. Policy subareas may be addressed in separate 1221 requests, but the following areas should should be covered: object 1222 permanence, object naming, object fragment addressing, and 1223 operational service support. 1225 The permanence declaration for an object is a rating defined with 1226 respect to an identified permanence provider (guarantor), which will 1227 be the NMA. It may include the following aspects. 1229 (a) "object availability" -- whether and how access to the object 1230 is supported (e.g., online 24x7, or offline only), 1232 (b) "identifier validity" -- under what conditions the identifier 1233 will be or has been re-assigned, 1235 (c) "content invariance" -- under what conditions the content of 1236 the object is subject to change, and 1238 (d) "change history" -- access to corrections, migrations, and 1239 revisions, whether through links to the changed objects themselves 1240 or through a document summarizing the change history 1242 One approach to a permanence rating framework, conceived 1243 independently from ARKs, is given in [NLMPerm]. Under ongoing 1244 development and limited deployment at the US National Library of 1245 Medicine, it identifies the following "permanence levels": 1247 Not Guaranteed: No commitment has been made to retain this 1248 resource. It could become unavailable at any time. Its 1249 identifier could be changed. 1251 Permanent: Dynamic Content: A commitment has been made to keep 1252 this resource permanently available. Its identifier will always 1253 provide access to the resource. Its content could be revised or 1254 replaced. 1256 Permanent: Stable Content: A commitment has been made to keep this 1257 resource permanently available. Its identifier will always 1258 provide access to the resource. Its content is subject only to 1259 minor corrections or additions. 1261 Permanent: Unchanging Content: A commitment has been made to keep 1262 this resource permanently available. Its identifier will always 1263 provide access to the resource. Its content will not change. 1265 Naming policy for an object includes an historical description of the 1266 NAA's (and its successor NAA's) policies regarding differentiation of 1267 objects. Since it the NMA who responds to requests for policy 1268 statements, it is useful for the NMA to be able to produce or 1269 summarize these historical NAA documents. Naming policy may include 1270 the following aspects. 1272 (i) "similarity" -- (or "unity") the limit, defined by the NAA, to 1273 the level of dissimilarity beyond which two similar objects 1274 warrant separate identifiers but before which they share one 1275 single identifier, and 1277 (ii) "granularity" -- the limit, defined by the NAA, to the level 1278 of object subdivision beyond which sub-objects do not warrant 1279 separately assigned identifiers but before which sub-objects are 1280 assigned separate identifiers. 1282 Subnaming policy for an object describes the qualifiers that the NMA, 1283 in fulfilling its ongoing and evolving service obligations, allows as 1284 extensions to an NAA-assigned ARK. To the conceptual object that the 1285 NAA named with an ARK, the NMA may add component access points and 1286 derivatives (e.g., format migrations in aid of preservation) in order 1287 to provide both basic and value-added services. 1289 Addressing policy for an object includes a description of how, during 1290 access, object components (e.g., paragraphs, sections) or views 1291 (e.g., image conversions) may or may not be "addressed", in other 1292 words, how the NMA permits arguments or parameters to modify the 1293 object delivered as the result of an ARK request. If supported, 1294 these sorts of operations would provide things like byte-ranged 1295 fragment delivery and open-ended format conversions, or any set of 1296 possible transformations that would be too numerous to list or to 1297 identify with separately assigned ARKs. 1299 Operational service support policy includes a description of general 1300 operational aspects of the NMA service, such as after-hours staffing 1301 and trouble reporting procedures. 1303 5.1.2. Generic Description Service 1305 Returns a description of the object. Descriptions are returned in 1306 either a structured metadata format or a human readable text format; 1307 sometimes one format may serve both purposes. A description must at 1308 a minimum answer the who, what, when, and where questions concerning 1309 an expression of the object. Standalone descriptions should be 1310 accompanied by the modification date and source of the description 1311 itself. May also return discriminated lists of ARKs that are related 1312 to the given ARK. 1314 5.2. Overview of The HTTP URL Mapping Protocol (THUMP) 1316 The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (any 1317 identifier) and asking such questions as, what information does this 1318 identify and how permanent is it? [THUMP] is in fact one specific 1319 method under development for delivering ARK services. The protocol 1320 runs over HTTP to exploit the web browser's current pre-eminence as 1321 user interface to the Internet. THUMP is designed so that a person 1322 can enter ARK requests directly into the location field of current 1323 browser interfaces. Because it runs over HTTP, THUMP can be 1324 simulated and tested via keyboard-based interactions [RFC0854]. 1326 The asker (a person or client program) starts with an identifier, 1327 such as an ARK or a URL. The identifier reveals to the asker (or 1328 allows the asker to infer) the Internet host name and port number of 1329 a server system that responds to questions. Here, this is just the 1330 NMAH that is obtained by inspection and possibly lookup based on the 1331 ARK's NAAN. The asker then sets up an HTTP session with the server 1332 system, sends a question via a THUMP request (contained within an 1333 HTTP request), receives an answer via a THUMP response (contained 1334 within an HTTP response), and closes the session. That concludes the 1335 connected portion of the protocol. 1337 A THUMP request is a string of characters beginning with a `?' 1338 (question mark) that is appended to the identifier string. The 1339 resulting string is sent as an argument to HTTP's GET command. 1340 Request strings too long for GET may be sent using HTTP's POST 1341 command. The three most common requests correspond to three 1342 degenerate special cases that keep the user's learning and typing 1343 burden low. First, a simple key with no request at all is the same 1344 as an ordinary access request. Thus a plain ARK entered into a 1345 browser's location field behaves much like a plain URL, and returns 1346 access to the primary identified object, for instance, an HTML 1347 document. 1349 The second special case is a minimal ARK description request string 1350 consisting of just "?". For example, entering the string, 1352 ark.nlm.nih.gov/12025/psbbantu? 1354 into the browser's location field directly precipitates a request for 1355 a metadata record describing the object identified by ark:/12025/ 1356 psbbantu. The browser, unaware of THUMP, prepares and sends an HTTP 1357 GET request in the same manner as for a URL. THUMP is designed so 1358 that the response (indicated by the returned HTTP content type) is 1359 normally displayed, whether the output is structured for machine 1360 processing (text/plain) or formatted for human consumption (text/ 1361 html). 1363 In the following example THUMP session, each line has been annotated 1364 to include a line number and whether it was the client or server that 1365 sent it. Without going into much depth, the session has four pieces 1366 separated from each other by blank lines: the client's piece (lines 1367 1-3), the server's HTTP/THUMP response headers (4-7), and the body of 1368 the server's response (8-13). The first and last lines (1 and 13) 1369 correspond to the client's steps to start the TCP session and the 1370 server's steps to end it, respectively. 1372 1 C: [opens session] 1373 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 1374 C: 1375 S: HTTP/1.1 200 OK 1376 5 S: Content-Type: text/plain 1377 S: THUMP-Status: 0.6 200 OK 1378 S: 1379 S: erc: 1380 S: who: Lederberg, Joshua 1381 10 S: what: Studies of Human Families for Genetic Linkage 1382 S: when: 1974 1383 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1384 S: [closes session] 1386 The first two server response lines (4-5) above are typical of HTTP. 1387 The next line (6) is peculiar to THUMP, and indicates the THUMP 1388 version and a normal return status. 1390 The balance of the response consists of a single metadata record 1391 (8-12) that comprises the ARK description service response. The 1392 returned record is in the format of an Electronic Resource Citation 1393 [ERC], which is discussed in overview in the next section. For now, 1394 note that it contains four elements that answer the top priority 1395 questions regarding an expression of the object: who played a major 1396 role in expressing it, what the expression was called, when is was 1397 created, and where the expression may be found. This quartet of 1398 elements comes up again and again in ERCs. 1400 The third degenerate special case of an ARK request (and no other 1401 cases will be described in this document) is the string "??", 1402 corresponding to a minimal permanence policy request. It can be seen 1403 in use appended to an ARK (on line 2) in the example session that 1404 follows. 1406 1 C: [opens session] 1407 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1 1408 C: 1409 S: HTTP/1.1 200 OK 1410 5 S: Content-Type: text/plain 1411 S: THUMP-Status: 0.6 200 OK 1412 S: 1413 S: erc: 1414 S: who: Lederberg, Joshua 1415 10 S: what: Studies of Human Families for Genetic Linkage 1416 S: when: 1974 1417 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1418 S: erc-support: 1419 S: who: USNLM 1420 15 S: what: Permanent, Unchanging Content 1421 S: when: 20010421 1422 S: where: http://ark.nlm.nih.gov/yy22948 1423 S: [closes session] 1425 Each segment in an ERC tells a different story relating to the 1426 object, so although the same four questions (elements) appear in 1427 each, the answers depend on the segment's story type. While the 1428 first segment tells the story of an expression of the object, the 1429 second segment tells the story of the support commitment made to it: 1430 who made the commitment, what the nature of the commitment was, when 1431 it was made, and where a fuller explanation of the commitment may be 1432 found. 1434 5.3. The Electronic Resource Citation (ERC) 1436 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1437 kind of object description that uses Dublin Core Kernel metadata 1438 elements [DCKernel]. The ERC with Kernel elements provides a simple, 1439 compact, and printable record for holding data associated with an 1440 information resource. As originally designed [Kernel], Kernel 1441 metadata balances the needs for expressive power, very simple machine 1442 processing, and direct human manipulation. 1444 The previous section shows two limited examples of what is fully 1445 described elsewhere [ERC]. The rest of this short section provides 1446 some of the background and rationale for this record format. 1448 A founding principle of Kernel metadata is that direct human contact 1449 with metadata will be a necessary and sufficient condition for the 1450 near term rapid development of metadata standards, systems, and 1451 services. Thus the machine-processable Kernel elements must only 1452 minimally strain people's ability to read, understand, change, and 1453 transmit ERCs without their relying on intermediation with 1454 specialized software tools. The basic ERC needs to be succinct, 1455 transparent, and trivially parseable by software. 1457 In the current Internet, it is natural seriously to consider using 1458 XML as an exchange format because of predictions that it will obviate 1459 many ad hoc formats and programs, and unify much of the world's 1460 information under one reliable data structuring discipline that is 1461 easy to generate, verify, parse, and render. It appears, however, 1462 that XML is still only catching on after years of standards work and 1463 implementation experience. The reasons for it are unclear, but for 1464 now very simple XML interpretation is still out of reach. Another 1465 important caution is that XML structures are hard on the eyeballs, 1466 taking up an amount of display (and page) space that significantly 1467 exceeds that of traditional formats. Until these conflicts with ERC 1468 principle are resolved, XML is not a first choice for representing 1469 ERCs. Borrowing instead from the data structuring format that 1470 underlies the successful spread of email and web services, the first 1471 ERC format uses [ANVL], which is based on email and HTTP headers 1472 [RFC2822]. There is a naturalness to ANVL's label-colon-value format 1473 (seen in the previous section) that barely needs explanation to a 1474 person beginning to enter ERC metadata. 1476 Besides simplicity of ERC system implementation and data entry 1477 mechanics, ERC semantics (what the record and its constituent parts 1478 mean) must also be easy to explain. ERC semantics are based on a 1479 reformulation and extension of the Dublin Core [RFC5013] hypothesis, 1480 which suggests that the fifteen Dublin Core metadata elements have a 1481 key role to play in cross-domain resource description. The ERC 1482 design recognizes that the Dublin Core's primary contribution is the 1483 international, interdisciplinary consensus that identified fifteen 1484 semantic buckets (element categories), regardless of how they are 1485 labeled. The ERC then adds a definition for a record and some 1486 minimal compliance rules. In pursuing the limits of simplicity, the 1487 ERC design combines and relabels some Dublin Core buckets to isolate 1488 a tiny kernel (subset) of four elements for basic cross-domain 1489 resource description. 1491 For the cross-domain kernel, the ERC uses the four basic elements -- 1492 who, what, when, and where -- to pretend that every object in the 1493 universe can have a uniform minimal description. Each has a name or 1494 other identifier, a location, some responsible person or party, and a 1495 date. It doesn't matter what type of object it is, or whether one 1496 plans to read it, interact with it, smoke it, wear it, or navigate 1497 it. Of course, this approach is flawed because uniformity of 1498 description for some object types requires more semantic contortion 1499 and sacrifice than for others. That is why at the beginning of this 1500 document, the ARK was said to be suited to objects that accommodate 1501 reasonably regular electronic description. 1503 While insisting on uniformity at the most basic level provides 1504 powerful cross-domain leverage, the semantic sacrifice is great for 1505 many applications. So the ERC also permits a semantically rich and 1506 nuanced description to co-exist in a record along with a basic 1507 description. In that way both sophisticated and naive recipients of 1508 the record can extract the level of meaning from it that best suits 1509 their needs and abilities. Key to unlocking the richer description 1510 is a controlled vocabulary of ERC record types (not explained in this 1511 document) that permit knowledgeable recipients to apply defined sets 1512 of additional assumptions to the record. 1514 5.4. Advice to Web Clients 1516 ARKs are envisaged to appear wherever durable object references are 1517 planned. Library cataloging records, literature citations, and 1518 bibliographies are important examples. In many of these places URLs 1519 (Uniform Resource Locators) are currently used, and inside some of 1520 those URLs are embedded URNs, Handles, and DOIs. Unfortunately, 1521 there's no suggestion of a way to probe for extra services that would 1522 build confidence in those identifiers; in other words, there's no way 1523 to tell whether any of those identifiers is any better managed than 1524 the average URL. 1526 ARKs are also envisaged to appear in hypertext links (where they are 1527 not normally shown to users) and in rendered text (displayed or 1528 printed). A normal HTML link for which the URL is not displayed 1529 looks like this. 1531 Click Here 1533 A URL with an embedded ARK invites access (via `?' and `??') to extra 1534 services: 1536 Click Here 1538 Using the [N2T] resolver to provide identifier-scheme-agnostic 1539 protection against hostname instability, this ARK could be published 1540 as: 1542 Click Here 1544 An NAA will typically make known the associations it creates by 1545 publishing them in catalogs, actively advertizing them, or simply 1546 leaving them on web sites for visitors (e.g., users, indexing 1547 spiders) to stumble across in browsing. 1549 5.5. Security Considerations 1551 The ARK naming scheme poses no direct risk to computers and networks. 1552 Implementors of ARK services need to be aware of security issues when 1553 querying networks and filesystems for Name Mapping Authority 1554 services, and the concomitant risks from spoofing and obtaining 1555 incorrect information. These risks are no greater for ARK mapping 1556 authority discovery than for other kinds of service discovery. For 1557 example, recipients of ARKs with a specified hostport (NMAH) should 1558 treat it like a URL and be aware that the identified ARK service may 1559 no longer be operational. 1561 Apart from mapping authority discovery, ARK clients and servers 1562 subject themselves to all the risks that accompany normal operation 1563 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1564 As specializations of such protocols, an ARK service may limit 1565 exposure to the usual risks. Indeed, ARK services may enhance a kind 1566 of security by helping users identify long-term reliable references 1567 to information objects. 1569 6. References 1571 [ANVL] Kunze, J. and B. Kahle, "A Name-Value Language", 2008, 1572 . 1574 [ARK] Kunze, J., "Towards Electronic Persistence Using ARK 1575 Identifiers", IWAW/ECDL Annual Workshop Proceedings 3rd, 1576 August 2003, 1577 . 1579 [DCKernel] 1580 DCMI, "Kernel Metadata Working Group", 2001-2008, 1581 . 1583 [DOI] IDF, "The Digital Object Identifier (DOI) System", 1584 February 2001, . 1586 [ERC] Kunze, J. and A. Turner, "Kernel Metadata and Electronic 1587 Resource Citations", October 2007, 1588 . 1590 [Handle] Lannom, L., "Handle System Overview", ICSTI Forum No. 30, 1591 April 1999, . 1593 [Kernel] Kunze, J., "A Metadata Kernel for Electronic Permanence", 1594 Journal of Digital Information Vol 2, Issue 2, ISSN 1368- 1595 7506, January 2002, 1596 . 1598 [N2T] CDL, "Name-to-Thing Resolver", August 2006, 1599 . 1601 [NLMPerm] Byrnes, M., "Defining NLM's Commitment to the Permanence 1602 of Electronic Information", ARL 212:8-9, October 2000, 1603 . 1605 [NOID] Kunze, J., "Nice Opaque Identifiers", February 2005, 1606 . 1608 [PURL] Shafer, K., "Introduction to Persistent Uniform Resource 1609 Locators", 1996, . 1611 [RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol 1612 Specification", STD 8, RFC 854, May 1983. 1614 [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", 1615 STD 13, RFC 1034, November 1987. 1617 [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. 1619 [RFC2288] Lynch, C., Preston, C., and R. Jr, "Using Existing 1620 Bibliographic Identifiers as Uniform Resource Names", 1621 RFC 2288, February 1998. 1623 [RFC2611] Daigle, L., van Gulik, D., Iannella, R., and P. Faltstrom, 1624 "URN Namespace Definition Mechanisms", BCP 33, RFC 2611, 1625 June 1999. 1627 [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., 1628 Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext 1629 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. 1631 [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, 1632 April 2001. 1634 [RFC2915] Mealling, M. and R. Daniel, "The Naming Authority Pointer 1635 (NAPTR) DNS Resource Record", RFC 2915, September 2000. 1637 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 1638 Resource Identifier (URI): Generic Syntax", STD 66, 1639 RFC 3986, January 2005. 1641 [RFC5013] Kunze, J. and T. Baker, "The Dublin Core Metadata Element 1642 Set", RFC 5013, August 2007. 1644 [THUMP] Gamiel, K. and J. Kunze, "The HTTP URL Mapping Protocol", 1645 August 2007, 1646 . 1648 Appendix A. ARK Maintenance Agency 1650 Production settings in which ARKs are used include the University of 1651 California, the National Library of France, the Internet Archive, and 1652 Portico, with maintenance based at the California Digital Library 1653 (CDL), housed at the University of California Office of the 1654 President. 1656 http://ark.cdlib.org/ 1658 Appendix B. Looking up NMAHs Distributed via DNS 1660 This subsection introduces an older method for looking up NMAHs that 1661 is based on the method for discovering URN resolvers described in 1662 [RFC2915]. It relies on querying the DNS system already installed in 1663 the background infrastructure of most networked computers. A query 1664 is submitted to DNS asking for a list of resolvers that match a given 1665 NAAN. DNS distributes the query to the particular DNS servers that 1666 can best provide the answer, unless the answer can be found more 1667 quickly in a local DNS cache as a side-effect of a recent query. 1668 Responses come back inside Name Authority Pointer (NAPTR) records. 1669 The normal result is one or more candidate NMAHs. 1671 In its full generality the [RFC2915] algorithm ambitiously 1672 accommodates a complex set of preferences, orderings, protocols, 1673 mapping services, regular expression rewriting rules, and DNS record 1674 types. This subsection proposes a drastic simplification of it for 1675 the special case of ARK mapping authority discovery. The simplified 1676 algorithm is called Maptr. It uses only one DNS record type (NAPTR) 1677 and restricts most of its field values to constants. The following 1678 hypothetical excerpt from a DNS data file for the NAAN known as 12026 1679 shows three example NAPTR records ready to use with the Maptr 1680 algorithm. 1682 12026.ark.arpa. 1683 ;; US Library of Congress 1684 ;; order pref flags service regexp replacement 1685 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 1686 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 1687 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 1689 All the fields are held constant for Maptr except for the "flags", 1690 "regexp", and "replacement" fields. The "service" field contains the 1691 constant value "ark" so that NAPTR records participating in the Maptr 1692 algorithm will not be confused with other NAPTR records. The "order" 1693 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 1694 the algorithm may evolve to use these fields for ranking decisions 1695 when usage patterns and local administrative needs are better 1696 understood. 1698 When a Maptr query returns a record with a flags field of "h" (for 1699 hostport, a Maptr extension to the NAPTR flags), the replacement 1700 field contains the NMAH (hostport) of an ARK service provider. When 1701 a query returns a record with a flags field of "" (the empty string), 1702 the client needs to submit a new query containing the domain name 1703 found in the replacement field. This second sort of record exploits 1704 the distributed nature of DNS by redirecting the query to another 1705 domain name. It looks like this. 1707 12345.ark.arpa. 1708 ;; Digital Library Consortium 1709 ;; order pref flags service regexp replacement 1710 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 1712 Here is the Maptr algorithm for ARK mapping authority discovery. In 1713 it replace with the NAAN from the ARK for which an NMAH is 1714 sought. 1716 1. Initialize the DNS query: type=NAPTR, query=.ark.arpa. 1718 2. Submit the query to DNS and retrieve (NAPTR) records, discarding 1719 any record that does not have "ark" for the service field. 1721 3. All remaining records with a flags fields of "h" contain 1722 candidate NMAHs in their replacement fields. Set them aside, if 1723 any. 1725 4. Any record with an empty flags field ("") has a replacement field 1726 containing a new domain name to which a subsequent query should 1727 be redirected. For each such record, set query= 1728 then go to step (2). When all such records have been recursively 1729 exhausted, go to step (5). 1731 5. All redirected queries have been resolved and a set of candidate 1732 NMAHs has been accumulated from steps (3). If there are zero 1733 NMAHs, exit -- no mapping authority was found. If there is one 1734 or more NMAH, choose one using any criteria you wish, then exit. 1736 A Perl script that implements this algorithm is included here. 1738 #!/depot/bin/perl 1740 use Net::DNS; # include simple DNS package 1741 my $qtype = "NAPTR"; # initialize query type 1742 my $naa = shift; # get NAAN script argument 1743 my $mad = new Net::DNS::Resolver; # mapping authority discovery 1745 &maptr("$naa.ark.arpa"); # call maptr - that's it 1747 sub maptr { # recursive maptr algorithm 1748 my $dname = shift; # domain name as argument 1749 my ($rr, $order, $pref, $flags, $service, $regexp, 1750 $replacement); 1751 my $query = $mad->query($dname, $qtype); 1752 return # non-productive query 1753 if (! $query || ! $query->answer); 1754 foreach $rr ($query->answer) { 1755 next # skip records of wrong type 1756 if ($rr->type ne $qtype); 1757 ($order, $pref, $flags, $service, $regexp, 1758 $replacement) = split(/\s/, $rr->rdatastr); 1759 if ($flags eq "") { 1760 &maptr($replacement); # recurse 1761 } elsif ($flags eq "h") { 1762 print "$replacement\n"; # candidate NMAH 1763 } 1764 } 1765 } 1767 The global database thus distributed via DNS and the Maptr algorithm 1768 can easily be seen to mirror the contents of the Name Authority Table 1769 file described in the previous section. 1771 Authors' Addresses 1773 John A. Kunze 1774 California Digital Library 1775 415 20th St, 4th Floor 1776 Oakland, CA 94612 1777 US 1779 Fax: +1 510-893-5212 1780 Email: jak@ucop.edu 1782 R. P. C. Rodgers 1783 US National Library of Medicine 1784 8600 Rockville Pike, Bldg. 38A 1785 Bethesda, MD 20894 1786 USA 1788 Fax: +1 301-496-0673 1789 Email: rodgers@nlm.nih.gov 1791 Full Copyright Statement 1793 Copyright (C) The IETF Trust (2008). 1795 This document is subject to the rights, licenses and restrictions 1796 contained in BCP 78, and except as set forth therein, the authors 1797 retain all their rights. 1799 This document and the information contained herein are provided on an 1800 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1801 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1802 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1803 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1804 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1805 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1807 Intellectual Property 1809 The IETF takes no position regarding the validity or scope of any 1810 Intellectual Property Rights or other rights that might be claimed to 1811 pertain to the implementation or use of the technology described in 1812 this document or the extent to which any license under such rights 1813 might or might not be available; nor does it represent that it has 1814 made any independent effort to identify any such rights. Information 1815 on the procedures with respect to rights in RFC documents can be 1816 found in BCP 78 and BCP 79. 1818 Copies of IPR disclosures made to the IETF Secretariat and any 1819 assurances of licenses to be made available, or the result of an 1820 attempt made to obtain a general license or permission for the use of 1821 such proprietary rights by implementers or users of this 1822 specification can be obtained from the IETF on-line IPR repository at 1823 http://www.ietf.org/ipr. 1825 The IETF invites any interested party to bring to its attention any 1826 copyrights, patents or patent applications, or other proprietary 1827 rights that may cover technology that may be required to implement 1828 this standard. Please address the information to the IETF at 1829 ietf-ipr@ietf.org. 1831 Acknowledgment 1833 Funding for the RFC Editor function is provided by the IETF 1834 Administrative Support Activity (IASA).