idnits 2.17.1 draft-kunze-ark-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Found some kind of copyright notice around line 1783 but it does not match any copyright boilerplate known by this tool. Expected boilerplate is as follows today (2024-04-18) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 39 longer pages, the longest (page 2) being 63 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 40 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 5 characters in excess of 72. ** There are 1098 instances of lines with control characters in the document. == There are 13 instances of lines with non-RFC2606-compliant FQDNs in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 557 has weird spacing: '...eful to remem...' == Line 749 has weird spacing: '... regexp repla...' == Line 1793 has weird spacing: '...for the purpo...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (20 February 2002) is 8093 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'MD5' is defined on line 1683, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE' -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI' ** Obsolete normative reference: RFC 822 (ref. 'EMHDRS') (Obsoleted by RFC 2822) -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC' -- Possible downref: Non-RFC (?) normative reference: ref. 'HKMP' ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref. 'MD5') -- Possible downref: Non-RFC (?) normative reference: ref. 'NAPTR' -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm' -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL' -- Possible downref: Non-RFC (?) normative reference: ref. 'REG' ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC 3986) ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref. 'URNBIB') ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC 8141) ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC 3406) Summary: 16 errors (**), 0 flaws (~~), 9 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet-Draft: draft-kunze-ark-03.txt J. Kunze 3 ARK Identifier Scheme University of California (UCSF) 4 Expires 20 August 2002 R. P. C. Rodgers 5 US National Library of Medicine 6 20 February 2002 8 The ARK Persistent Identifier Scheme 10 (http://www.ietf.org/internet-drafts/draft-kunze-ark-03.txt) 12 Status of this Document 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of RFC2026. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as ``work in progress.'' 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 Distribution of this document is unlimited. Please send comments to 34 jak@ckm.ucsf.edu. 36 Copyright (C) The Internet Society (2002). All Rights Reserved. 38 Abstract 40 The ARK (Archival Resource Key) is a scheme intended to facilitate 41 the persistent naming and retrieval of information objects. It 42 comprises an identifier syntax and three services. An ARK has four 43 components: 45 [http://NMAH/]ark:/NAAN/Name 47 an optional and mutable Name Mapping Authority Hostport part (NMAH, 48 where "hostport" is a hostname followed optionally by a colon and 49 port number), the "ark:" label, the Name Assigning Authority Number 50 (NAAN), and the assigned Name. The NAAN and Name together form the 51 immutable persistent identifier for the object. 53 An ARK request is an ARK with a service request and a question mark 54 appended to it. Use of an ARK request proceeds in two steps. First, 55 the NMAH, if not specified, is discovered based on the NAAN. Two 56 discovery methods are proposed: one is file based, the other based 57 on the DNS NAPTR record. Second, the ARK request is submitted to the 58 NMAH. Three ARK services are defined, gaining access to: (1) the 59 object (or a sensible substitute), (2) a description of the object 60 (metadata), and (3) a description of the commitment made by the NMA 61 regarding the persistence of the object (policy). These services are 62 defined initially to use the HTTP protocol. When the NMAH is 63 specified, the ARK is a valid URL that can gain access to ARK 64 services using an unmodified Web client. 66 1. Introduction 68 This document describes a scheme for the high-quality naming of 69 information resources. The scheme, called the Archival Resource Key 70 (ARK), is well suited to long-term access and identification for any 71 information resources that accommodate reasonably regular electronic 72 description. This includes digital documents, databases, software, 73 and websites, as well as physical objects (such as books, bones, and 74 statues) and intangible objects (chemicals, diseases, vocabulary 75 terms, performances). Hereafter the term "object" refers to an 76 information resource. The term ARK itself refers both to the scheme 77 and to any single identifier that conforms to it. 79 Schemes for persistent identification of network-accessible objects 80 are not new. In the early 1990's, the design of the Uniform Resource 81 Name [URNSYN] responded to the observed failure rate of URLs by 82 articulating an indirect, non-hostname-based naming scheme and the 83 need for responsible name management. Meanwhile, promoters of the 84 Digital Object Identifier [DOI] succeeded in building a community of 85 providers around a mature software system that supports name 86 management. The Persistent Uniform Resource Locator [PURL] was a 87 third scheme that has the unique advantage of working with unmodified 88 web browsers. The ARK scheme is a new approach. 90 A founding principle of the ARK is that persistence is purely a 91 matter of service. Persistence is neither inherent in an object nor 92 conferred on it by a particular naming syntax. Rather, persistence 93 is achieved through a provider's successful stewardship of objects 94 and their identifiers. The highest level of persistence will be 95 reinforced by a provider's robust contingency, redundancy, and 96 succession strategies. It is further safeguarded to the extent that 97 a provider's mission is shielded from marketplace and political 98 instabilities. 100 1.1. Three Reasons to Use ARKs 102 The first requirement of an ARK is to give users a link from an 103 object to a promise of stewardship for it. That promise is a multi- 104 faceted covenant that binds the word of an identified service 105 provider to a specific set of responsibilities. No one can tell if 106 successful stewardship will take place because no one can predict the 107 future. Reasonable conjecture, however, may be based on past 108 performance. There must be a way to tie a promise of persistence to 109 a provider's demonstrated or perceived ability -- its reputation -- 110 in that arena. Provider reputations would then rise and fall as 111 promises are observed variously to be kept and broken. This is 112 perhaps the best way we have for gauging the strength of any 113 persistence promise. 115 The second requirement of an ARK is to give users a link from an 116 object to a description of it. The problem with a naked identifier 117 is that without a description real identification is incomplete. 118 Identifiers common today are relatively opaque, though some contain 119 ad hoc clues that reflect fleeting life cycle events such as the 120 address of a short stay in a filesystem hierarchy. Possession of 121 both an identifier and an object is some improvement, but positive 122 identification may still be elusive since the object itself need not 123 include a matching identifier or be transparent enough to reveal its 124 identity without significant research. In either case, what is 125 called for is a record bearing witness to the identifier's 126 association with the object, as supported by a recorded set of object 127 characteristics. This descriptive record is partly an identification 128 "receipt" with which users and archivists can verify an object's 129 identity after brief inspection and a plausible match with recorded 130 characteristics such as title and size. 132 The final requirement of an ARK is to give users a link to the object 133 itself (or to a copy) if at all possible. Persistent access is the 134 central duty of an ARK, with persistent identification playing a 135 vital but supporting role. Object access may not be feasible for 136 various reasons, such as catastrophic loss of the object, a licensing 137 agreement that keeps an archive "dark" for a period of years, or when 138 an object's own lack of tangible existence precludes normal concepts 139 of access (e.g., a vocabulary term might be accessed through its 140 definition). In such cases the ARK's identification role assumes a 141 much higher profile. But attempts to simplify the persistence 142 problem by decoupling access from identification and concentrating 143 exclusively on the latter are of questionable utility. A perfect 144 system for assigning forever unique identifiers might be created, but 145 if it did so without reducing access failure rates, no one would be 146 interested. The central issue -- which may be summed up as the "HTTP 147 404 Not Found" problem -- would not have been addressed. 149 1.2. Organizing Support for ARKs 151 Co-location of persistent access and identification services is 152 natural. Any organization that undertakes ongoing support of true 153 persistent identification (which includes description) is well-served 154 if it controls, owns, or otherwise has clear internal access to the 155 identified objects, and this gives it an advantage if it wishes also 156 to support persistent external access. Conversely, the latter 157 implies a commitment to collection management activities such as 158 monitoring, acquisition, verification, and change control over 159 objects that are persistently identified at least for the sake of 160 internal record keeping and accountability; this covers the major 161 prerequisite for external support of persistent identification. 162 Organizing ARK services under one roof thus tends to make sense. 164 ARK support is not for everybody. By requiring specific, revealed 165 commitments to preservation, object access, and description, the bar 166 for providing ARK services is high. On the other hand, it would be 167 hard to grant credence to a persistence promise from an organization 168 that could not muster the minimum ARK services. Not that there isn't 169 a business model for an ARK-like, description-only service built on 170 top of another organization's full complement of ARK services. For 171 example, there might be competition at the description level for 172 abstracting and indexing a body of scientific literature archived in 173 a combination of open and fee-based repositories. Such a business 174 would benefit more from persistence than it would directly support 175 it. 177 1.3. A Definition of Identifier 179 Heretofore, persistence discussion has been hampered by a borrowed 180 meaning for "identifier" that emerged as a side effect of defining 181 the Uniform Resource Identifier in [URI]: 183 (formerly) An identifier is a sequence of characters with a 184 restricted syntax ... that can act as a reference to something 185 that has identity. 187 The term works in context, but falters when employed for persistence. 188 Troubling phrases arise, such as, 190 "The goal is to create an identifier that does not break." 192 As defined this kind of identifier "breaks" when it sustains damage 193 to its character sequence, but really what breaks has to do with the 194 identifier's reference role. The following definition is proposed. 196 (new definition) An identifier is an association between a 197 string (a sequence of characters) and an information resource. 198 That association is made manifest by a record (e.g., a 199 cataloging or other metadata record) that binds the identifier 200 string to a set of identifying resource characteristics. 202 The identifier (the association) must be vouched for by some sort of 203 record. In the complete absence of any testimony (e.g., metadata) 204 regarding an association, a would-be identifier string is a 205 meaningless sequence of characters. To keep an externally visible 206 but otherwise internal identifier string opaque to outsiders, for 207 example, it suffices for an organization not to disclose the nature 208 of its association. For our immediate purpose, actual existence of 209 an association record is more important than its authenticity. If 210 one is lucky an object carries its own identifier as part of itself 211 (e.g., imprinted on the first page), but in processes such as 212 resource discovery and retrieval the typical object is often unwieldy 213 or unavailable (such as when licensing restrictions are in effect). 214 A metadata record that includes the identifier string is the next 215 best thing -- a conveniently manipulable surrogate that can act as 216 both an association "receipt" and "declaration". 218 It now makes sense to speak of preventing an identifier, as an 219 association, from breaking. Having said that, this document still 220 (ab)uses the terms "ARK" and "identifier" as shorthands to refer to 221 identifier strings, in other words, to sequences of characters. Thus 222 a discussion of ARK syntax refers to a string format, not an 223 association format. The context should make the meaning clear. 225 2. ARK Anatomy 227 An ARK is represented by a sequence of characters (a string) that 228 contains the label, "ark:", optionally preceded by the beginning part 229 of a URL. Here is a diagrammed example. 231 http://foobar.zaf.org/ark:/12025/654xz321 232 \___________________/ \__/ \___/ \______/ 233 (optional) | | | 234 | ARK Label | Name (assigned by the NAA) 235 | | 236 Name Mapping Authority Name Assigning Authority 237 Hostport (NMAH) Number (NAAN) 239 The ARK syntax can be summarized, 241 [http://NMAH/]ark:/NAAN/Name 243 where the NMAH part is in brackets to indicate that it is optional. 245 2.1. The Name Mapping Authority Hostport (NMAH) 247 Before the "ark:" label may appear an optional Name Mapping Authority 248 Hostport (NMAH) that is a temporary address where ARK service 249 requests may be sent. It consists of "http://" (or any service 250 specification valid for a URL) followed by an Internet hostname or 251 hostport combination having the same format and semantics as the 252 hostport part of a URL. The most important thing about the NMAH is 253 that it is "identity inert" from the point of view of object 254 identification. In other words, ARKs that differ only in the 255 optional NMAH part identify the same object. Thus, for example, the 256 following three ARKs are synonyms for but one information resource: 258 http://foobar.zaf.org/ark:/12025/654xz321 259 http://sneezy.dopey.com/ark:/12025/654xz321 260 ark:/12025/654xz321 262 The NMAH part makes an ARK into an actionable URL. Conversely, any 263 URL whose path component begins with "ark:/" stands a reasonable 264 chance of being an ARK (only because such URLs are not common), but 265 further verification is still required (such as probing the URL for 266 the three ARK services). 268 The NMAH part is temporary, disposable, and replaceable. Over time 269 the NMAH will likely stop working and have to be replaced with a 270 currently active service provider. This relies on a mapping 271 authority discovery process, of which two alternate methods are 272 outlined in a later section. Meanwhile, a carefully chosen NMAH can 273 be as durable as any Internet domain name, and so may last for a 274 decade or longer. Users should be prepared, however, to refresh the 275 NMAH because the one found in the URL form of the ARK may have 276 stopped working. 278 The above method for creating an actionable identifier from a basic 279 ARK (prepending "http://" and an NMAH) is itself temporary. Assuming 280 that the reign of [HTTP] in information retrieval will end one day, 281 ARKs will have to be converted into new kinds of actionable 282 identifiers. In any event, if ARKs see widespread use, web browsers 283 would presumably evolve to perform this (currently simple) 284 transformation automatically. 286 2.2. The Name Assigning Authority Number (NAAN) 288 The part of the ARK directly following the "ark:" is the Name 289 Assigning Authority Number (NAAN) enclosed in `/' (slash) characters. 290 This part is always required, as it identifies the organization that 291 originally assigned the Name of the object. It is used to discover a 292 currently valid NMAH and to provide top-level partitioning of the 293 space of all ARKs. NAANs are registered in a manner similar to URN 294 Namespaces, but they are pure numbers consisting of 5 digits or 9 295 digits. Thus, the first 100,000 registered NAAs fit compactly into 296 the 5 digits, and if growth warrants, the next billion fit into the 9 297 digit form. In either case the fixed odd number of digits helps 298 reduce the chances of finding a NAAN out of context and confusing it 299 with nearby quantities such as 4-digit dates. 301 2.3. The Name Part 303 The final part of the ARK is the Name assigned by the NAA, and it is 304 also required. The Name is a string of visible ASCII characters and 305 should be less than 128 bytes in length. The length restriction 306 keeps the ARK short enough to append ordinary ARK request strings 307 without running into transport restrictions within HTTP GET requests. 308 Characters may be letters, digits, or any of these six characters: 310 = @ $ _ * + # 312 The following characters may also be used, but in limited ways: 314 / . - % 316 The characters `/' and `.' are ignored if either appears as the last 317 character of an ARK. If used internally, they allow a name assigning 318 authority to reveal object hierarchy and object variants as described 319 in the next two sections. 321 A `-' (hyphen) may appear in an ARK, but must be ignored in lexical 322 comparisons. The `%' character is reserved for %-encoding all other 323 octets that would appear in the ARK string, in the same manner as for 324 URIs [URI]. A %-encoded octet consists of a `%' followed by two hex 325 digits; for example, "%7d" stands in for `}'. Lower case hex digits 326 are preferred to reduce the chances of false acronym recognition; 327 thus it is better to use "%acT" instead of "%ACT". The character `%' 328 itself must be represented using "%25". As with URNs, %-encoding 329 permits ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) 330 that have less restricted character repertoires [URNBIB]. 332 The creation of names that include linguistically based constructs 333 (having recognizable meaning from natural language) is strongly 334 discouraged if long-term persistence is a naming priority. Such 335 names do not age or travel well. Names that look more or less like 336 numbers avoid common problems that defeat persistence and 337 international acceptance. The use of digits is highly recommended. 338 Mixing in non-vowel alphabetic characters is a relatively safe and 339 easy way to achieve more compact names, although any character 340 repertoire can work if potentially troublesome names will be 341 discarded during a screening process. More on naming considerations 342 is given in a later section. 344 2.3.1. Names that Reveal Object Hierarchy 346 A name assigning authority may choose to reveal the presence of a 347 hierarchical relationship between objects using the `/' (slash) 348 character in the Name part of an ARK. If the Name contains an 349 internal slash, the piece to its left indicates a containing object. 350 For example, publishing an ARK of the form, 352 ark:/12025/654/xz/321 354 is equivalent to publishing three ARKs, 356 ark:/12025/654/xz/321 357 ark:/12025/654/xz 358 ark:/12025/654 360 together with a declaration that the first object is contained in the 361 second object, and that the second object is contained in the third. 363 Revealing the presence of hierarchy is completely up to the assigning 364 authority. It is hard enough to commit to one object's name, let 365 alone to three objects' names and to a specific, ongoing relatedness 366 among them. Thus, regardless of whether hierarchy was present 367 initially, the assigning authority, by not using slashes, reveals no 368 shared inferences about hierarchical or other inter-relatedness in 369 the following ARKs: 371 ark:/12025/654_xz_321 372 ark:/12025/654_xz 373 ark:/12025/654xz321 374 ark:/12025/654xz 375 ark:/12025/654 377 Note that slashes around the ARK's NAAN (/12025/ in these examples) 378 are not part of the ARK's Name and therefore do not indicate the 379 existence of some sort of NAAN super object containing all objects in 380 its namespace. A slash must have at least one non-structural 381 character (one that is neither a slash nor a period) on both sides in 382 order for it to separate recognizable structural components. So 383 initial or final slashes may be removed, and double slashes may be 384 converted into single slashes. 386 2.3.2. Names that Reveal Object Variants 388 A name assigning authority may choose to reveal the possible presence 389 of variant objects using the `.' (period) character in the Name part 390 of an ARK. If the Name contains an internal period, the piece to its 391 left is a base name and the piece to its right up to the end of the 392 ARK or to the next period is a suffix. A Name may have more than one 393 suffix, for example, 395 ark:/12025/654.24 396 ark:/12025/xz4/654.24 397 ark:/12025/654.f55.g78.v20 399 There are two main rules. First, if two ARKs share the same base 400 name but have different suffixes, the corresponding objects were 401 considered variants of each other (different formats, languages, 402 versions, etc.) by the assigning authority. Thus, the following ARKs 403 are variants of each other: 405 ark:/12025/654.f55.g78.v20 406 ark:/12025/654.321xz 407 ark:/12025/654.44 409 Second, publishing an ARK with a suffix implies the existence of at 410 least one variant identified by the ARK without its suffix. The ARK 411 otherwise permits no further assumptions about what variants might 412 exist. So publishing the ARK, 414 ark:/12025/654.f55.g78.v20 416 is equivalent to publishing the four ARKs, 418 ark:/12025/654.f55.g78.v20 419 ark:/12025/654.f55.g78 420 ark:/12025/654.f55 421 ark:/12025/654 423 Revealing the possibility of variants is completely up to the 424 assigning authority. It is hard enough to commit to one object's 425 name, let alone to multiple variants' names and to a specific, 426 ongoing relatedness among them. The assigning authority is the sole 427 arbiter of what constitutes a variant within its namespace, and 428 whether to reveal that kind of relatedness by using periods within 429 its names. 431 A period must have at least one non-structural character (one that is 432 neither a slash nor a period) on both sides in order for it to 433 separate recognizable structural components. So initial or final 434 periods may be removed, and double periods may be converted into 435 single periods. Multiple suffixes should be arranged in sorted order 436 (pure ASCII collating sequence) at the end of an ARK. 438 2.3.3. Hyphens are Ignored 440 Hyphens are always ignored in ARKs. Hyphens may be added to an ARK's 441 Name part for readability, or during the formatting and wrapping of 442 text lines, but (as in phone numbers) they are treated as if they 443 were not present. Thus, like the NMAH, hyphens are "identity inert" 444 in comparing ARKs for equivalence. For example, the following ARKs 445 are equivalent for purposes of comparison and ARK service access: 447 ark:/12025/65-4-xz-321 448 ark:sneezy.dopey.com/12025/654--xz32-1 449 ark:/12025/654xz321 451 2.4. Normalization and Lexical Equivalence 453 To determine if two or more ARKs identify the same object, the ARKs 454 are compared for lexical equivalence after first being normalized. 455 Since ARK strings may appear in various forms (e.g., having different 456 NMAHs), normalizing them minimizes the chances that comparing two ARK 457 strings for equality will fail unless they actually identify 458 different objects. In a specified-host ARK (one having an NMAH), the 459 NMAH never participates in such comparisons. 461 Normalization of an ARK for the purpose of octet-by-octet equality 462 comparison with another ARK consists of four steps. First, any upper 463 case letters in the "ark:" label and the two characters following a 464 `%' are converted to lower case. The case of all other letters in 465 the ARK string must be preserved. Second, any NMAH part is removed 466 (everything from an initial "http://" up to the next slash) and all 467 hyphens are removed. 469 Third, structural characters (slash and period) are normalized. 470 Initial and final occurrences are removed, and two structural 471 characters in a row (e.g., // or ./) are replaced by the first 472 character, iterating until each occurrence has at least one non- 473 structural character on either side. Finally, if there are any 474 components with a period on the left and a slash on the right, either 475 the component and the preceding period must be moved to the end of 476 the Name part or the ARK must be thrown out as malformed. 478 The fourth and final step is to arrange the suffixes in ASCII 479 collating sequence (that is, to sort them) and to remove duplicate 480 suffixes, if any. It is also permissible to throw out ARKs for which 481 the suffixes are not sorted. 483 The resulting ARK string is now normalized. Comparisons between 484 normalized ARKs are case-sensitive, meaning that upper case letters 485 are considered different from their lower case counterparts. 487 To keep ARK string variation to a minimum, no reserved ARK characters 488 should be %-encoded unless it is deliberately to conceal their 489 reserved meanings. No non-reserved ARK characters should ever be %- 490 encoded. Finally, no %-encoded character should ever appear in an 491 ARK in its decoded form. 493 2.5. Naming Considerations 495 The ARK has different goals from the URI, so it has different 496 character set requirements. Because linguistic constructs imperil 497 persistence, for ARKs non-ASCII character support is unimportant. 498 ARKs and URIs share goals of transcribability and transportability 499 within web documents, so characters are required to be visible, non- 500 conflicting with HTML/XML syntax, and not subject to tampering during 501 transmission across common transport gateways. Add the goal of 502 making an undelimited ARK recognizable in running prose, as in 503 ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma, 504 period) end up being excluded from the ARK lest the end of a phrase 505 or sentence be mistaken for part of the ARK. 507 A valuable technique for provision of persistent objects is to try to 508 arrange for the complete identifier to appear on, with, or near its 509 retrieved object. An object encountered at a moment in time when its 510 discovery context has long since disappeared could then easily be 511 traced back to its metadata, to alternate versions, to updates, etc. 512 This has seen reasonable success, for example, in book publishing and 513 software distribution. 515 If persistence is the goal, a deliberate local strategy for 516 systematic name assignment is crucial. Names must be chosen with 517 great care. Poorly chosen and managed names will devastate any 518 persistence strategy, and they do not discriminate based on naming 519 scheme. Whether a mistakenly re-assigned identifier is a URN, DOI, 520 PURL, URL, or ARK, the damage -- failed access and confusion -- is 521 not mitigated more in one scheme than in another. Conversely, in- 522 house efforts to manage names responsibly will go much further 523 towards safeguarding persistence than any choice of naming scheme or 524 name resolution technology. 526 Hostnames appearing in any identifier meant to be persistent must be 527 chosen with extra care. The tendency in hostname selection has 528 traditionally been to choose a token with recognizable attributes, 529 such as a corporate brand, but that tendency wreaks havoc with 530 persistence that is to outlive brands, corporations, subject 531 classifications, and natural language semantics (e.g., what did the 532 three letters "gay" mean 1958, 1978, and 1998?). Today's recognized 533 and correct attributes are tomorrow's stale or incorrect attributes. 534 In making hostnames (any names, actually) long-term persistent, it 535 helps to eliminate recognizable attributes to the extent possible. 536 This affects selection of any name based on URLs, including PURLs and 537 the explicitly disposable NMAHs. There is no excuse for a provider 538 that manages its internal names impeccably not to exercise the same 539 care in choosing what could be an exceptionally durable hostname, 540 especially if it would form the prefix for all the provider's URL- 541 based external names. Registering an opaque hostname in the ".org" 542 or ".net" domain would not be a bad start. 544 Dubious persistence speculation does not make selecting naming 545 strategies any easier. For example, despite rumors to the contrary, 546 there are really no obvious reasons why the organizations registering 547 DNS names, URN Namespaces, and DOI publisher IDs should have among 548 them one that is intrinsically more fallible than the next. 549 Moreover, it is a misconception that the demise of DNS and of HTTP 550 need adversely affect the persistence of URLs. At such a time, 551 certainly URLs from the present day might not then be actionable by 552 our present-day mechanisms, but resolution systems for future non- 553 actionable URLs are no harder to imagine than resolution systems for 554 present-day non-actionable URNs and DOIs. There is no more stable a 555 namespace than one that is dead and frozen, and that would then 556 characterize the space of names bearing the "http://" prefix. It is 557 useful to remember that just because hostnames have been carelessly 558 chosen in their brief history does not mean that they are unsuitable 559 in NMAHs (and URLs) intended for use in situations demanding the 560 highest level of persistence available in the Internet environment. 561 A well-planned name assignment strategy is everything. 563 3. Assigners of ARKs 565 A Name Assigning Authority (NAA) is an organization that creates (or 566 delegates creation of) long-term associations between identifiers and 567 information objects. Examples of NAAs include national libraries, 568 national archives, and publishers. An NAA may arrange with an 569 external organization for identifier assignment. The US Library of 570 Congress, for example, allows OCLC (the Online Computer Library 571 Center, a major world cataloger of books) to create associations 572 between Library of Congress call numbers (LCCNs) and the books that 573 OCLC processes. A cataloging record is generated that testifies to 574 each association, and the identifier is included by the publisher, 575 for example, in the front matter of a book. 577 An NAA does not so much create an identifier as create an 578 association. The NAA first draws an unused identifier string from 579 its namespace, which is the set of all identifiers under its control. 580 It then records the assignment of the identifier to an information 581 object having sundry witnessed characteristics, such as a particular 582 author and modification date. A namespace is usually reserved for an 583 NAA by agreement with recognized community organizations (such as 584 IANA and ISO) that all names containing a particular string be under 585 its control. In the ARK an NAA is represented by the Name Assigning 586 Authority Number (NAAN). 588 The ARK namespace reserved for an NAA is the set of names bearing its 589 particular NAAN. For example, all strings beginning with 590 "ark:/12025/" are under control of the NAA registered under 12025, 591 which might be the National Library of Finland. Because each NAA has 592 a different NAAN, names from one namespace cannot conflict with those 593 from another. Each NAA is free to assign names from its namespace 594 (or delegate assignment) according to its own policies. These 595 policies must be documented in a manner similar to the declarations 596 required for URN Namespace registration [URNNID]. 598 For now, registration of ARK NAAs is in a bootstrapping phase. To 599 register, please read about the mapping authority discovery file in 600 the next section and send email to jak@ckm.ucsf.edu. 602 4. Finding a Name Mapping Authority 604 In order to derive an actionable identifier (these days, a URL) from 605 an ARK, a hostport (hostname or hostname plus port combination) for a 606 working Name Mapping Authority (NMA) must be found. An NMA is a 607 service that is able to respond to the three basic ARK service 608 requests. Relying on registration and client-side discovery, NMAs 609 make known which NAAs' identifiers they are willing to service. 611 Upon encountering an ARK, a user (or client software) looks inside it 612 for the optional NMAH part (the hostport of the NMA's ARK service). 613 If it contains an NMAH that is working, this NMAH discovery step may 614 be skipped; the NMAH effectively uses the beginning of an ARK to 615 cache the results of a prior mapping authority discovery process. If 616 a new NMAH needs to found, the client looks inside the ARK again for 617 the NAAN (Name Assigning Authority Number). Querying a global 618 database, it then uses the NAAN to look up all current NMAHs that 619 service ARKs issued by the identified NAA. The global database is 620 key, and two specific methods for querying it are given in this 621 section. 623 In the interests of long-term persistence, however, ARK mechanisms 624 are first defined in high-level, protocol-independent terms so that 625 mechanisms may evolve and be replaced over time without compromising 626 fundamental service objectives. Either or both specific methods 627 given here may eventually be supplanted by better methods since, by 628 design, the ARK scheme does not depend on a particular method, but 629 only on having some method to locate an active NMAH. 631 At the time of issuance, at least one NMAH for an ARK should be 632 prepared to service it. That NMA may or may not be administered by 633 the Name Assigning Authority (NAA) that created it. Consider the 634 following hypothetical example of providing long-term access to a 635 cancer research journal. The publisher wishes to turn a profit and 636 the National Library of Medicine wishes to preserve the scholarly 637 record. An agreement might be struck whereby the publisher would act 638 as the NAA and the national library would archive the journal issue 639 when it appears, but without providing direct access for the first 640 six months. During the first six months of peak commercial 641 viability, the publisher would retain exclusive delivery rights and 642 would charge access fees. Again, by agreement, both the library and 643 the publisher would act as NMAs, but during that initial period the 644 library would redirect requests for issues less than six months old 645 to the publisher. At the end of the waiting period, the library 646 would then begin servicing requests for issues older than six months 647 by tapping directly into its own archives. Meanwhile, the publisher 648 might routinely redirect incoming requests for older issues to the 649 library. Long-term access is thereby preserved, and so is the 650 commercial incentive to publish content. 652 There is never a requirement that an NAA also run an NMA service, 653 although it seems not an unlikely scenario. Over time NAAs and NMAs 654 would come and go. One NMA would succeed another, and there might be 655 many NMAs serving the same ARKs simultaneously (e.g., as mirrors or 656 as competitors). There might also be asymmetric but coordinated NMAs 657 as in the library-publisher example above. 659 4.1. Looking Up NMAHs in a Globally Accessible File 661 This subsection describes a way to look up NMAHs using a simple text 662 file. For efficient access the file may be stored in a local 663 filesystem, but it needs to be reloaded periodically to incorporate 664 updates. It is not expected that the size of the file or frequency 665 of update should impose an undue maintenance or searching burden any 666 time soon, for even primitive linear search of a file with ten- 667 thousand NAAs is a subsecond operation on modern server machines. 668 The proposed file strategy is similar to the /etc/hosts file strategy 669 that supported Internet host address lookup for a period of years 670 before the advent of the Domain Name System [DNS]. 672 A copy of the current file (at the time of writing) appears in an 673 appendix and is available on the web. A minimal version of the file 674 appears below. Comment lines (lines that begin with `#') explain the 675 format and give the file's modification time, reloading address, and 676 NAA registration instructions. There is even a Perl script that 677 processes the file embedded in the file's comments. Because this is 678 still a proposed file, none of the values in it are real. 680 # 681 # Name Assigning Authority / Name Mapping Authority Lookup Table 682 # Last change: 22 February 2001 683 # Reload from: http://ark.nlm.nih.gov/etc/natab 684 # Mirrored at: http://www.ckm.ucsf.edu/people/jak/home/etc/natab 685 # http://....../etc/natab 686 # To register: mailto:jak@ckm.ucsf.edu?Subject=naareg 687 # Process with: Perl script at end of this file (optional) 688 # 689 # Each NAA appears at the beginning of a line with the NAA Number 690 # first, a colon, and an ARK or URL to a statement of naming policy 691 # (see http://ark.nlm.nih.gov/naapolicyeg.html for an example). 692 # All the NMA hostports that service an NAA are listed, one per 693 # line, indented, after the corresponding NAA line. 694 # 695 # US National Library of Medicine 696 12025: http://www.nlm.nih.gov/xxx/naapolicy.html 697 lhc.nlm.nih.gov:8080 USNLM 698 foobar.zaf.org UCSF 699 sneezy.dopey.com BIREME 700 # 701 # US Library of Congress 702 12026: http://www.loc.gov/xxx/naapolicy.html 703 foobar.zaf.org USLC 704 sneezy.dopey.com USLC 705 # 706 # US National Agriculture Library 707 12027: http://www.nal.gov/xxx/naapolicy.html 708 foobar.zaf.gov:80 USNAL 709 # 710 #--- end of data --- 711 # The enclosed Perl script takes an NAA as argument and outputs 712 # the NMAs in this file listed under any matching NAA. 713 # 714 # my $naa = shift; 715 # while (<>) { 716 # next if (! /^$naa:/); 717 # while (<>) { 718 # last if (! /^[#\s]./); 719 # print "$1\n" if (/^\s+(\S+)/); 720 # } 721 # } 722 # end of file 724 4.2. Looking up NMAHs Distributed via DNS 726 This subsection introduces a method for looking up NMAHs that is 727 based on the method for discovering URN resolvers described in 728 [NAPTR]. It relies on querying the DNS system already installed in 729 the background infrastructure of most networked computers. A query 730 is submitted to DNS asking for a list of resolvers that match a given 731 NAAN. DNS distributes the query to the particular DNS servers that 732 can best provide the answer, unless the answer can be found more 733 quickly in a local DNS cache as a side-effect of a recent query. 734 Responses come back inside Name Authority Pointer (NAPTR) records. 735 The normal result is one or more candidate NMAHs. 737 In its full generality the [NAPTR] algorithm ambitiously accommodates 738 a complex set of preferences, orderings, protocols, mapping services, 739 regular expression rewriting rules, and DNS record types. This 740 subsection proposes a drastic simplification of it for the special 741 case of ARK mapping authority discovery. The simplified algorithm is 742 called Maptr. It uses only one DNS record type (NAPTR) and restricts 743 most of its field values to constants. The following hypothetical 744 excerpt from a DNS data file for the NAAN known as 12026 shows three 745 example NAPTR records ready to use with the Maptr algorithm. 747 12026.ark.arpa. 748 ;; US Library of Congress 749 ;; order pref flags service regexp replacement 750 IN NAPTR 0 0 "h" "ark" "USLC" lhc.nlm.nih.gov:8080 751 IN NAPTR 0 0 "h" "ark" "USLC" foobar.zaf.org 752 IN NAPTR 0 0 "h" "ark" "USLC" sneezy.dopey.com 754 All the fields are held constant for Maptr except for the "flags", 755 "regexp", and "replacement" fields. The "service" field contains the 756 constant value "ark" so that NAPTR records participating in the Maptr 757 algorithm will not be confused with other NAPTR records. The "order" 758 and "pref" fields are held to 0 (zero) and otherwise ignored for now; 759 the algorithm may evolve to use these fields for ranking decisions 760 when usage patterns and local administrative needs are better 761 understood. 763 When a Maptr query returns a record with a flags field of "h" (for 764 hostport, a Maptr extension to the NAPTR flags), the replacement 765 field contains the NMAH (hostport) of an ARK service provider. When 766 a query returns a record with a flags field of "" (the empty string), 767 the client needs to submit a new query containing the domain name 768 found in the replacement field. This second sort of record exploits 769 the distributed nature of DNS by redirecting the query to another 770 domain name. It looks like this. 772 12345.ark.arpa. 773 ;; Digital Library Consortium 774 ;; order pref flags service regexp replacement 775 IN NAPTR 0 0 "" "ark" "" dlc.spct.org. 777 Here is the Maptr algorithm for ARK mapping authority discovery. In 778 it replace with the NAAN from the ARK for which an NMAH is 779 sought. 781 (1) Initialize the DNS query: type=NAPTR, 782 query=.ark.arpa. 784 (2) Submit the query to DNS and retrieve (NAPTR) records, 785 discarding any record that does not have "ark" for the service 786 field. 788 (3) All remaining records with a flags fields of "h" contain 789 candidate NMAHs in their replacement fields. Set them aside, if 790 any. 792 (4) Any record with an empty flags field ("") has a replacement 793 field containing a new domain name to which a subsequent query 794 should be redirected. For each such record, set 795 query= then go to step (2). When all such records 796 have been recursively exhausted, go to step (5). 798 (5) All redirected queries have been resolved and a set of 799 candidate NMAHs has been accumulated from steps (3). If there 800 are zero NMAHs, exit -- no mapping authority was found. If 801 there is one or more NMAH, choose one using any criteria you 802 wish, then exit. 804 A Perl script that implements this algorithm is included here. 806 #!/depot/bin/perl 808 use Net::DNS; # include simple DNS package 809 my $qtype = "NAPTR"; # initialize query type 810 my $naa = shift; # get NAAN script argument 811 my $mad = new Net::DNS::Resolver; # mapping authority discovery 813 &maptr("$naa.ark.arpa"); # call maptr - that's it 815 sub maptr { # recursive maptr algorithm 816 my $dname = shift; # domain name as argument 817 my ($rr, $order, $pref, $flags, $service, $regexp, 818 $replacement); 819 my $query = $mad->query($dname, $qtype); 820 return # non-productive query 821 if (! $query || ! $query->answer); 822 foreach $rr ($query->answer) { 823 next # skip records of wrong type 824 if ($rr->type ne $qtype); 825 ($order, $pref, $flags, $service, $regexp, 826 $replacement) = split(/\s/, $rr->rdatastr); 827 if ($flags eq "") { 828 &maptr($replacement); # recurse 829 } elsif ($flags eq "h") { 830 print "$replacement\n"; # candidate NMAH 831 } 832 } 833 } 835 The global database thus distributed via DNS and the Maptr algorithm 836 can easily be seen to mirror the contents of the Name Authority Table 837 file described in the previous section. 839 5. Generic ARK Service Definition 841 An ARK request's output is delivered information; examples include 842 the object itself, a policy declaration (e.g., a promise of support), 843 a descriptive metadata record, or an error message. ARK services 844 must be couched in high-level, protocol-independent terms if 845 persistence is to outlive today's networking infrastructural 846 assumptions. The high-level ARK service definitions listed below are 847 followed in the next section by a concrete method (one of many 848 possible methods) for delivering these services with today's 849 technology. 851 5.1. Generic ARK Access Service (access, location) 853 Returns (a copy of) the object or a redirect to the same, although a 854 sensible object proxy may be substituted. Examples of sensible 855 substitutes include, 857 - a table of contents instead of a large complex document, 858 - a home page instead of an entire web site hierarchy, 859 - a rights clearance challenge before accessing protected data, 860 - directions for access to an offline object (e.g., a book), 861 - a description of an intangible object (a disease, an event), or 862 - an applet acting as "player" for a large multimedia object. 864 May also return a discriminated list of alternate object locators. 865 If access is denied, returns an explanation of the object's current 866 (perhaps permanent) inaccessibility. 868 5.2. Generic Policy Service (permanence, naming, etc.) 870 Returns declarations of policy and support commitments for given 871 ARKs. Declarations are returned in either a structured metadata 872 format or a human readable text format; sometimes one format may 873 serve both purposes. Policy subareas may be addressed in separate 874 requests, but the following areas should should be covered: object 875 permanence, object naming, object fragment addressing, and 876 operational service support. 878 The permanence declaration for an object is a rating defined with 879 respect to an identified permanence provider (guarantor), and may 880 include the following aspects. One permanence rating framework is 881 given in [NLMPerm]. 883 (a) "object availability" -- whether and how access to the 884 object is supported (e.g., online 24x7, or offline only), 886 (b) "identifier validity" -- under what conditions the 887 identifier will be or has been re-assigned, 889 (c) "content invariance" -- under what conditions the content of 890 the object is subject to change, and 892 (d) "change history" -- documentation, whether abbreviated or 893 detailed, of any or all corrections, migrations, revisions, etc. 895 Naming policy for an object includes an historical description of the 896 NAA's (and its successor NAA's) policies regarding differentiation of 897 objects. It may include the following aspects. 899 (e) "similarity" -- (or "unity") the limit, defined by the NAA, 900 to the level of dissimilarity beyond which two similar objects 901 warrant separate identifiers but before which they share one 902 single identifier, and 904 (f) "granularity" -- the limit, defined by the NAA, to the level 905 of object subdivision beyond which sub-objects do not warrant 906 separately assigned identifiers but before which sub-objects are 907 assigned separate identifiers. 909 Addressing policy for an object includes a description of how, during 910 access, object components (e.g., paragraphs, sections) or views 911 (e.g., image conversions) may or may not be "addressed", in other 912 words, how the NMA permits arguments or parameters to modify the 913 object delivered as the result of an ARK request. If supported, 914 these sorts of operations would provide things like byte-ranged 915 fragment delivery and open-ended format conversions, or any set of 916 possible transformations that would be too numerous to list or to 917 identify with separately assigned ARKs. 919 Operational service support policy includes a description of general 920 operational aspects of the NMA service, such as after-hours staffing 921 and trouble reporting procedures. 923 5.3. Generic Description Service 925 Returns a description of the object. Descriptions are returned in 926 either a structured metadata format or a human readable text format; 927 sometimes one format may serve both purposes. A description must at 928 a minimum answer the who, what, when, and where questions concerning 929 an expression of the object. Standalone descriptions should be 930 accompanied by the modification date and source of the description 931 itself. May also return discriminated lists of ARKs that are related 932 to the given ARK. 934 6. Overview of the HTTP Key Mapping Protocol (HKMP) 936 The HTTP Key Mapping Protocol (HKMP) is a way of taking a key (a kind 937 of identifier) and asking such questions as, what information does 938 this identify and how permanent is it? [HKMP] is in fact one 939 specific method under development for delivering ARK services. The 940 protocol runs over HTTP to exploit the web browser's current pre- 941 eminence as user interface to the Internet. HKMP is designed so that 942 a person can enter ARK requests directly into the location field of 943 current browser interfaces. Because it runs over HTTP, HKMP can be 944 simulated and tested within keyboard-based [TELNET] sessions. 946 The asker (a person or client program) starts with an identifier, 947 such as an ARK or a URL. The identifier reveals to the asker (or 948 allows the asker to infer) the Internet host name and port number of 949 a server system that responds to questions. Here, this is just the 950 NMAH that is obtained by inspection and possibly lookup based on the 951 ARK's NAAN. The asker then sets up an HTTP session with the server 952 system, sends a question via an HKMP request (contained within an 953 HTTP request), receives an answer via an HKMP response (contained 954 within an HTTP response), and closes the session. That concludes the 955 connected portion of the protocol. 957 An HKMP request is a string of characters beginning with a `?' 958 (question mark) that is appended to the identifier string. The 959 resulting string is sent as an argument to HTTP's GET command. 960 Request strings too long for GET may be sent using HTTP's POST 961 command. The three most common requests correspond to three 962 degenerate special cases that keep the user's learning and typing 963 burden low. First, a simple key with no request at all is the same 964 as an ordinary access request. Thus a plain ARK entered into a 965 browser's location field behaves much like a plain URL, and returns 966 access to the primary identified object, for instance, an HTML 967 document. 969 The second special case is a minimal ARK description request string 970 consisting of just "?". For example, entering the string, 972 ark.nlm.nih.gov/12025/psbbantu? 974 into the browser's location field directly precipitates a request for 975 a metadata record describing the object identified by 976 ark:/12025/psbbantu. The browser, unaware of HKMP, prepares and 977 sends an HTTP GET request in the same manner as for a URL. HKMP is 978 designed so that the response (indicated by the returned HTTP content 979 type) is normally displayed, whether the output is structured for 980 machine processing (text/plain) or formatted for human consumption 981 (text/html). 983 In the following example HKMP session, each line has been annotated 984 to include a line number and whether it was the client or server that 985 sent it. Without going into much depth, the session has three pieces 986 separated from each other by blank lines: the client's piece (lines 987 1-3), the server's HTTP/HKMP response headers (4-7), and the body of 988 the server's response (8-13). The first and last lines (1 and 13) 989 correspond to the client's steps to start the TCP session and the 990 server's steps to end it, respectively. 992 1 C: [opens session] 993 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1 994 C: 995 S: HTTP/1.1 200 OK 996 5 S: Content-Type: text/plain 997 S: HKMP-Status: 0.1 200 OK 998 S: 999 S: erc: 1000 S: who: Lederberg, Joshua 1001 10 S: what: Studies of Human Families for Genetic Linkage 1002 S: when: 1974 1003 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1004 S: [closes session] 1006 The first two server response lines (4-5) above are typical of HTTP. 1007 The next line (6) is peculiar to HKMP, and indicates the HKMP version 1008 and a normal return status. The balance of the response (8-11) is 1009 the single metadata record that comprises the ARK description service 1010 response. The record is in the format of an Electronic Resource 1011 Citation [ERC], which is discussed in more detail in the next 1012 section. For now, note that it contains four elements that answer 1013 the top priority questions regarding an expression of the object: 1014 who played a major role in expressing it, what the expression was 1015 called, when is was created, and where the expression may be found. 1016 This quartet of elements comes up again and again in ERCs. 1018 The third degenerate special case of an ARK request (and no other 1019 cases will be described in this document) is the string "??", 1020 corresponding to a minimal permanence policy request. It can be seen 1021 in use appended to an ARK (on line 2) in the example session that 1022 follows. 1024 1 C: [opens session] 1025 C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1 1026 C: 1027 S: HTTP/1.1 200 OK 1028 5 S: Content-Type: text/plain 1029 S: HKMP-Status: 0.1 200 OK 1030 S: 1031 S: erc: 1032 S: who: Lederberg, Joshua 1033 10 S: what: Studies of Human Families for Genetic Linkage 1034 S: when: 1974 1035 S: where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1036 S: erc-support: 1037 S: who: USNLM 1038 15 S: what: Permanent, Unchanging Content 1039 S: when: 2001 04 21 1040 S: where: http://ark.nlm.nih.gov/yy22948 1041 S: [closes session] 1043 Again, a single metadata record (lines 8-17) is returned, but it 1044 consists of two segments. The first segment (8-12) gives the same 1045 basic citation information as in the previous example. It is 1046 returned in order to establish context for the persistence 1047 declaration in the second segment (13-17). 1049 Each segment in an ERC tells a different story relating to the 1050 object, so although the same four questions (elements) appear in 1051 each, the answers depend on the segment's story type. While the 1052 first segment tells the story of an expression of the object, the 1053 second segment tells the story of the support commitment made to it: 1054 who made the commitment, what the nature of the commitment was, when 1055 it was made, and where a fuller explanation of the commitment may be 1056 found. 1058 7. Overview of Electronic Resource Citations (ERCs) 1060 An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a 1061 simple, compact, and printable record designed to hold data 1062 associated with an information resource. By design, the ERC is a 1063 metadata format that balances the needs for expressive power, very 1064 simple machine processing, and direct human manipulation. 1066 A founding principle of the ERC is that direct human contact with 1067 metadata will be a necessary and sufficient condition for the near 1068 term rapid development of metadata standards, systems, and services. 1069 Thus the machine-processable ERC format must only minimally strain 1070 people's ability to read, understand, change, and transmit ERCs 1071 without their relying on intermediation with specialized software 1072 tools. The basic ERC needs to be succinct, transparent, and 1073 trivially parseable by software. 1075 In the current Internet, it is natural seriously to consider using 1076 XML as an exchange format because of predictions that it will obviate 1077 many ad hoc formats and programs, and unify much of the world's 1078 information under one reliable data structuring discipline that is 1079 easy to generate, verify, parse, and render. It appears, however, 1080 that XML is still only catching on after years of standards work and 1081 implementation experience. The reasons for it are unclear, but for 1082 now very simple XML interpretation is still out of reach. Another 1083 important caution is that XML structures are hard on the eyeballs, 1084 taking up an amount of display (and page) space that significantly 1085 exceeds that of traditional formats. Until these conflicts with ERC 1086 principle are resolved, XML is not a first choice for representing 1087 ERCs. Borrowing instead from the data structuring format that 1088 underlies the successful spread of email and web services, the first 1089 ERC format is based on email and HTTP headers (RFC822) [EMHDRS]. 1090 There is a naturalness to its label-colon-value format (seen in the 1091 previous section) that barely needs explanation to a person beginning 1092 to enter ERC metadata. 1094 Besides simplicity of ERC system implementation and data entry 1095 mechanics, ERC semantics (what the record and its constituent parts 1096 mean) must also be easy to explain. ERC semantics are based on a 1097 reformulation and extension of the Dublin Core [DCORE] hypothesis, 1098 which suggests that the fifteen Dublin Core metadata elements have a 1099 key role to play in cross-domain resource description. The ERC 1100 design recognizes that the Dublin Core's primary contribution is the 1101 international, interdisciplinary consensus that identified fifteen 1102 semantic buckets (element categories), regardless of how they are 1103 labeled. The ERC then adds a definition for a record and some 1104 minimal compliance rules. In pursuing the limits of simplicity, the 1105 ERC design combines and relabels some Dublin Core buckets to isolate 1106 a tiny kernel (subset) of four elements for basic cross-domain 1107 resource description. 1109 For the cross-domain kernel, the ERC uses the four basic elements -- 1110 who, what, when, and where -- to pretend that every object in the 1111 universe can have a uniform minimal description. Each has a name or 1112 other identifier, a location, some responsible person or party, and a 1113 date. It doesn't matter what type of object it is, or whether one 1114 plans to read it, interact with it, smoke it, wear it, or navigate 1115 it. Of course, this approach is flawed because uniformity of 1116 description for some object types requires more semantic contortion 1117 and sacrifice than for others. That is why at the beginning of this 1118 document, the ARK was said to be suited to objects that accommodate 1119 reasonably regular electronic description. 1121 While insisting on uniformity at the most basic level provides 1122 powerful cross-domain leverage, the semantic sacrifice is great for 1123 many applications. So the ERC also permits a semantically rich and 1124 nuanced description to co-exist in a record along with a basic 1125 description. In that way both sophisticated and naive recipients of 1126 the record can extract the level of meaning from it that best suits 1127 their needs and abilities. Key to unlocking the richer description 1128 is a controlled vocabulary of ERC record types (not explained in this 1129 document) that permit knowledgeable recipients to apply defined sets 1130 of additional assumptions to the record. 1132 7.1. ERC Syntax 1134 An ERC record is a sequence of metadata elements ending in a blank 1135 line. An element consists of a label, a colon, and an optional 1136 value. Here is an example of a record with five elements. 1138 erc: 1139 who: Gibbon, Edward 1140 what: The Decline and Fall of the Roman Empire 1141 when: 1781 1142 where: http://www.ccel.org/g/gibbon/decline/ 1144 A long value may be folded (continued) onto the next line by 1145 inserting a newline and indenting the next line. A value can be thus 1146 folded across multiple lines. Here are two example elements, each 1147 folded across four lines. 1149 who/created: University of California, San Francisco, AIDS 1150 Program at San Francisco General Hospital | University 1151 of California, San Francisco, Center for AIDS Prevention 1152 Studies 1153 what/Topic: 1154 Heart Attack | Heart Failure 1155 | Heart 1156 Diseases 1158 An element value folded across several lines is treated as if the 1159 lines were joined together on one long line. For example, the second 1160 element from the previous example is considered equivalent to 1162 what/Topic: Heart Attack | Heart Failure | Heart Diseases 1164 An element value may contain multiple values, each one separated from 1165 the next by a `|' (pipe) character. The element from the previous 1166 example contains three values. 1168 For annotation purposes, any line beginning with a `#' (hash) 1169 character is treated as if it were not present; this is a "comment" 1170 line (a feature not available in email or HTTP headers). For 1171 example, the following element is spread across four lines and 1172 contains two values: 1174 what/Topic: 1175 Heart Attack 1176 # | Heart Failure -- hold off until next review cycle 1177 | Heart Diseases 1179 7.2. ERC Stories 1181 An ERC record is organized into one or more distinct segments, where 1182 where each segment tells a story about a different aspect of the 1183 information resource. A segment boundary occurs whenever a segment 1184 label (an element beginning with "erc") is encountered. The basic 1185 label "erc:" introduces the story of an object's expression (e.g., 1186 its publication, installation, or performance). The label "erc- 1187 about:" introduces the story of an object's content (what it is 1188 about) and "erc-support:" introduces the story of a support 1189 commitment made to it. A story segment that concerns the ERC itself 1190 is introduced by the label "erc-from:". It is an important segment 1191 that tells the story of the ERC's provenance. Elements beginning 1192 with "erc" are reserved for segment labels and their associated story 1193 types. From an earlier example, here is an ERC with two segments. 1195 erc: 1196 who: Lederberg, Joshua 1197 what: Studies of Human Families for Genetic Linkage 1198 when: 1974 1199 where: http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf 1200 erc-support: 1201 who: NIH/NLM/LHNCBC 1202 what: Permanent, Unchanging Content 1203 # Note to ops staff: date needs verification. 1204 when: 2001 04 21 1205 where: http://ark.nlm.nih.gov/yy22948 1207 Segment stories are told according to journalistic tradition. While 1208 any number of pertinent elements may appear in a segment, priority is 1209 placed on answering the questions who, what, when, and where at the 1210 beginning of each segment so that readers can make the most important 1211 selection or rejection decisions as soon as possible. To make things 1212 simple, the listed ordering of the questions is maintained in each 1213 segment (as it happens most people who have been exposed to this 1214 story telling technique are already familiar with the above 1215 ordering). 1217 The four questions are answered by using corresponding element 1218 labels. The four element labels can be re-used in each story 1219 segment, but their meaning changes depending on the segment (the 1220 story type) in which they appear. In the example above, "who" is 1221 first used to name a document's author and subsequently used to name 1222 the permanence guarantor (provider). Similarly, "when" first lists 1223 the date of object creation and in the next segment lists the date of 1224 a commitment decision. Four labels appearing across three segments 1225 effectively map to twelve semantically distinct elements. Distinct 1226 element meanings are mapped to Dublin Core elements in a later 1227 section. 1229 7.3. The ERC Anchoring Story 1231 Each ERC contains an anchoring story. It is usually the first 1232 segment labeled "erc:" and it concerns an "anchoring" expression of 1233 the object. An "anchoring" expression is the one that a provider 1234 deemed the most suitable basic referent given the audience and 1235 application for which it produced the ERC. If it sounds like the 1236 provider has great latitude in choosing its anchoring expression, it 1237 is because it does. A typical anchoring story in an ERC for a born- 1238 digital document would be the story of the document's release on a 1239 web site; such a document would then be the anchoring expression. 1241 An anchoring story need not be the central descriptive goal of an ERC 1242 record. For example, a museum provider may create an ERC for a 1243 digitized photograph of a painting but choose to anchor it in the 1244 story of the original painting instead of the story of the electronic 1245 likeness; although the ERC may through other segments prove to be 1246 centrally concerned with describing the electronic likeness, the 1247 provider may have chosen this particular anchoring story in order to 1248 make the ERC visible in a way that is most natural to patrons (who 1249 would find the Mona Lisa under da Vinci sooner than they would find 1250 it under the name of the person who snapped the photograph or scanned 1251 the image). In another example, a provider that creates an ERC for a 1252 dramatic play as an abstract work has the task of describing a piece 1253 of intangible intellectual property. To anchor this abstract object 1254 in the concrete world, if only through a derivative expression, it 1255 makes sense for the provider to choose a suitable printed edition of 1256 the play as the anchoring object expression (to describe in the 1257 anchoring story) of the ERC. 1259 The anchoring story has special rules designed to keep ERC processing 1260 simple and predictable. Each of the four basic elements (who, what, 1261 when, and where) must be present, unless a best effort to supply it 1262 fails. In the event of failure, the element still appears but a 1263 special value (described later) is used to explain the missing value. 1264 While the requirement that each of the four elements be present only 1265 applies to the anchoring story segment, as usual these elements 1266 appear at the beginning of the segment and may only be used in the 1267 prescribed order. A minimal ERC would normally consist of just an 1268 anchoring story and the element quartet, as illustrated in the next 1269 example. 1271 erc: 1272 who: National Research Council 1273 what: The Digital Dilemma 1274 when: 2000 1275 where: http://books.nap.edu/html/digital%5Fdilemma 1277 A minimal ERC can be abbreviated so that it resembles a traditional 1278 compact bibliographic citation that is nonetheless completely machine 1279 processable. The required elements and ordering makes it possible to 1280 eliminate the element labels, as shown here. 1282 erc: National Research Council | The Digital Dilemma | 2000 1283 | http://books.nap.edu/html/digital%5Fdilemma 1285 7.4. ERC Elements 1287 As mentioned, the four basic ERC elements (who, what, when, and 1288 where) take on different specific meanings depending on the story 1289 segment in which they are used. By appearing in each segment, albeit 1290 in different guises, the four elements serve as a valuable mnemonic 1291 device -- a kind of checklist -- for constructing minimal story 1292 segments from scratch. Again, it is only in the anchoring segment 1293 that all four elements are mandatory. 1295 Here are some mappings between ERC elements and Dublin Core [DCORE] 1296 elements. 1298 Segment ERC Element Equivalent Dublin Core Element 1299 --------- ----------- ------------------------------ 1300 erc who Creator/Contributor/Publisher 1301 erc what Title 1302 erc when Date 1303 erc where Identifier 1304 erc-about who 1305 erc-about what Subject 1306 erc-about when Coverage (temporal) 1307 erc-about where Coverage (spatial) 1309 The basic element labels may also be qualified to add nuances to the 1310 semantic categories that they identify. Elements are qualified by 1311 appending a `/' (slash) and a qualifier term. Often qualifier terms 1312 appear as the past tense form of a verb because it makes re-using 1313 qualifiers among elements easier. 1315 who/published: ... 1316 when/published: ... 1317 where/published: ... 1319 Using past tense verbs for qualifiers also reminds providers and 1320 recipients that element values contain transient assertions that may 1321 have been true once, but that tend to become less true over time. 1322 Recipients that don't understand the meaning of a qualifier can fall 1323 back onto the semantic category (bucket) designated by the 1324 unqualified element label. Inevitably recipients (people and 1325 software) will have diverse abilities in understanding elements and 1326 qualifiers. 1328 Any number of other elements and qualifiers may be used in 1329 conjunction with the quartet of basic segment questions. The only 1330 semantic requirement is that they pertain to the segment's story. 1331 Also, it is only the four basic elements that change meaning 1332 depending on their segment context. All other elements have meaning 1333 independent of the segment in which they appear. If an element label 1334 stripped of its qualifier is still not recognized by the recipient, a 1335 second fall back position is to ignore it and rely on the four basic 1336 elements. 1338 Elements may be either Canonical, Provisional, or Local. Canonical 1339 elements are officially recognized via a registry as part of the 1340 metadata vernacular. All elements, qualifiers, and segment labels 1341 used in this document up until now belong to that vernacular. 1342 Provisional elements are also officially recognized via the registry, 1343 but have only been proposed for inclusion in the vernacular. To be 1344 promoted to the vernacular, a provisional element passes through a 1345 vetting process during which its documentation must be in order and 1346 its community acceptance demonstrated. Local elements are any 1347 elements not officially recognized in the registry. The registry 1348 [REG] is a work in progress. 1350 Local elements can be immediately distinguishable from Canonical or 1351 Provisional elements because all terms that begin with an upper case 1352 letter are reserved for spontaneous local use. No term beginning 1353 with an upper case letter will ever be assigned Canonical or 1354 Provisional status, so it should be safe to use such terms for local 1355 purposes. Any recipient of external ERCs containing such terms will 1356 understand them to be part of the originating provider's local 1357 metadata dialect. Here's an example ERC with three segments, one 1358 local element, and two local qualifiers. The segment boundaries have 1359 been emphasized by comment lines (which, as before, are ignored by 1360 processors). 1362 erc: 1363 who: Bullock, TH | Achimowicz, JZ | Duckrow, RB 1364 | Spencer, SS | Iragui-Madoz, VJ 1365 what: Bicoherence of intracranial EEG in sleep, 1366 wakefulness and seizures 1367 when: 1997 12 00 1368 where: http://cogprints.soton.ac.uk/%{ 1369 documents/disk0/00/00/01/22/index.html %} 1370 in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678 1371 IDcode: cog00000122 1372 # ---- new segment ---- 1373 erc-about: 1374 what/Subcategory: Bispectrum | Nonlinearity | Epilepsy 1375 | Cooperativity | Subdural | Hippocampus | Higher moment 1376 # ---- new segment ---- 1377 erc-from: 1378 who: NIH/NLM/NCBI 1379 what: pm9546494 1380 when/Reviewed: 1998 04 18 021600 1381 where: http://ark.nlm.nih.gov/12025/pm9546494? 1383 The local element "IDcode" immediately precedes the "erc-about" 1384 segment, which itself contains an element with the local qualifier 1385 "Subcategory". The second to last element also carries the local 1386 qualifier "Reviewed". Finally, what might be a provisional element 1387 "in" appears near the end of the first segment. It might have been 1388 proposed as a way to complete a citation for an object originally 1389 appearing inside another object (such as an article appearing in a 1390 journal or an encyclopedia). 1392 7.5. ERC Element Values 1394 ERC element values tend to be straightforward strings. If the 1395 provider intends something special for an element, it will so 1396 indicate with markers at the beginning of its value string. The 1397 markers are designed to be uncommon enough that they would not likely 1398 occur in normal data except by deliberate intent. Markers can only 1399 occur near the beginning of a string, and once any octet of non- 1400 marker data has been encountered, no further marker processing is 1401 done for the element value. In the absence of markers the string is 1402 considered pure data; this has been the case with all the examples 1403 seen thus far. The fullest form of an element value with all three 1404 optional markers in place looks like this. 1406 VALUE = [markup_flags] (:ccode) , DATA 1408 In processing, the first non-whitespace character of an ERC element 1409 value is examined. An initial `[' is reserved to introduce a 1410 bracketed set of markup flags (not described in this document) that 1411 ends with `]'. If ERC data is machine-generated, each value string 1412 may be preceded by "[]" to prevent any of its data from being 1413 mistaken for markup flags. Once past the optional markup, the 1414 remaining value may optionally begin with a controlled code. A 1415 controlled code always has the form "(:ccode)", for example, 1417 who: (:unkn) Anonymous 1418 what: (:791) Bee Stings 1420 Any string after such a code is taken to be an uncontrolled (e.g., 1421 natural language) equivalent. The code "unkn" indicates a 1422 conventional explanation for a missing value (stating that the value 1423 is unknown). The remainder of the string makes an equivalent 1424 statement in a form that the provider deemed most suitable to its 1425 (probably human) audience. The code "791" could be a fixed numeric 1426 topic identifier within an unspecified topic vocabulary. Any code 1427 may be ignored by those that do not understand it. 1429 There are several codes to explain different ways in which a required 1430 element's value may go missing. 1432 (:unkn) unknown (e.g., Anonymous, Inconnue) 1433 (:unav) value unavailable indefinitely 1434 (:unac) temporarily inaccessible 1435 (:unap) not applicable, makes no sense 1436 (:unas) value unassigned (e.g., Untitled) 1437 (:none) never had a value, never will 1438 (:null) explicitly empty 1439 (:unal) unallowed, suppressed intentionally 1441 Once past an optional controlled code, the remaining string value is 1442 subjected to one final test. If the first next non-whitespace 1443 character is a `,' (comma), it indicates that the string value is 1444 "sort-friendly". This means that the value is (a) laid out with an 1445 inverted word order useful for sorting items having comparably laid 1446 out element values (items might be the containing ERC records) and 1447 (b) that the value may contain other commas that indicate inversion 1448 points should it become necessary to recover the value in natural 1449 word order. Typically, this feature is used to express Western-style 1450 personal names in family-name-given-name order. It can also be used 1451 wherever natural word order might make sorting tricky, such as when 1452 data contains titles or corporate names. Here are some example 1453 elements. 1455 who: , van Gogh, Vincent 1456 who:,Howell, III, PhD, 1922-1987, Thurston 1457 who:, Acme Rocket Factory, Inc., The 1458 who:, Mao Tse Tung 1459 who:, McCartney, Paul, Sir, 1460 what:, Health and Human Services, United States Government 1461 Department of, The, 1462 There are rules to use in recovering a copy of the value in natural 1463 word order, if desired. The above example strings have the following 1464 natural word order values, respectively. 1466 Vincent van Gogh 1467 Thurston Howell, III, PhD, 1922-1987 1468 The Acme Rocket Factory, Inc. 1469 Mao Tse Tung 1470 Sir Paul McCartney 1471 The United States Government Department of Health and Human Services 1473 7.6. ERC Element Encoding and Dates 1475 Some characters that need to appear in ERC element values might 1476 conflict with special characters used for structuring ERCs, so there 1477 needs to be a way to include them as literal characters that are 1478 protected from special interpretation. This is accomplished through 1479 an encoding mechanism that resembles the %-encoding familiar to [URI] 1480 handlers. 1482 The ERC encoding mechanism also uses `%', but instead of taking two 1483 following hexadecimal digits, it takes one non-alphanumeric character 1484 or two alphabetic characters that cannot be mistaken for hex digits. 1485 It is designed not to be confused with normal web-style %-encoding. 1486 In particular it can be decoded without risking unintended decoding 1487 of normal %-encoded data (which would introduce errors). Here are 1488 the one-character (non-alphanumeric) ERC encoding extensions. 1490 ERC Purpose 1491 --- ------------------------------------------------ 1492 %! decodes to the element separator `|' 1493 %% decodes to a percent sign `%' 1494 %. decodes to a comma `,' 1495 %_ a non-character used as syntax shim 1496 %{ a non-character that begins an expansion block 1497 %} a non-character that ends an expansion block 1499 One particularly useful construct in ERC element values is the pair 1500 of special encoding markers ("%{" and "%}") that indicates a 1501 "expansion" block. Whatever string of characters they enclose will 1502 be treated as if none of the contained whitespace (SPACEs, TABs, 1503 Newlines) were present. This comes in handy for writing long, 1504 multi-part URLs in a readable way. For example, the value in 1505 where: http://foo.bar.org/node%{ 1506 ? db = foo 1507 & start = 1 1508 & end = 5 1509 & buf = 2 1510 & query = foo + bar + zaf 1511 %} 1513 is decoded into an equivalent element, but with a correct and intact 1514 URL: 1516 where: 1517 http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf 1519 In a parting word about ERC element values, a commonly recurring 1520 value type is a date, possibly followed by a time. ERC dates take on 1521 one of the following forms: 1523 1999 (four digit year) 1524 2000 12 29 (year, month, day) 1525 2000 12 29 235955 (year, month, day, hour, minute, second) 1527 21 Spring 31 1st quarter 25 Spring (so. hemisphere) 22 1528 Summer 32 2nd quarter 26 Summer (so. hemisphere) 23 1529 Fall 33 3rd quarter 27 Fall (so. hemisphere) 24 1530 Winter 34 4th quarter 28 Winter (so. hemisphere) In 1531 dates, all internal whitespace is squeezed out to achieve a 1532 normalized form suitable for lexical comparison and sorting. This 1533 means that the following dates 1535 2000 12 29 235955 (recommended for readability) 1536 2000 12 29 23 59 55 1537 20001229 23 59 55 1538 20001229235955 (normalized date and time) 1540 are all equivalent. The first form is recommended for readability. 1541 The last form (shortest and easiest to compute with) is the 1542 normalized form. Hyphens and commas are reserved to create date 1543 ranges and lists, for example, 1545 1996-2000 (a range of four years) 1546 1952, 1957, 1969 (a list of three years) 1547 1952, 1958-1967, 1985 (a mixed list of dates and ranges) 1548 20001229-20001231 (a range of three days) 1550 7.7. ERC Stub Records and Internal Support 1552 The ERC design introduces the concept of a "stub" record, which is an 1553 incomplete ERC record intended to be supplemented with additional 1554 elements before being released as a standalone ERC record. A stub 1555 ERC record has no minimum required elements. It is just a group of 1556 elements that does not begin with "erc:" but otherwise conforms to 1557 the ERC record syntax. 1559 ERC stubs may be useful in supporting internal procedures using the 1560 ERC syntax. Often they rely on the convenience and accuracy of 1561 automatically supplied elements, even the basic ones. To be ready 1562 for external use, however, an ERC stub must be transformed into a 1563 complete ERC record having the usual required elements. An ERC stub 1564 record can be convenient for metadata embedded in a document, where 1565 elements such as location, modification date, and size -- which one 1566 would not omit from an externalized record -- are omitted simply 1567 because they are much better supplied by a computation. A separate 1568 local administrative procedure, not defined for ERC's in general, 1569 would effect the promotion of stubs into complete records. 1571 While the ERC is a general-purpose container for exchange of resource 1572 descriptions, it does not dictate how records must be internally 1573 stored, laid out, or assembled by data providers or recipients. 1574 Arbitrary internal descriptive frameworks can support ERCs simply by 1575 mapping (e.g., on demand) local records to the ERC container format 1576 and making them available for export. Therefore, to support ERCs 1577 there is no need for a data provider to convert internal data to be 1578 stored in an ERC format. On the other hand, any provider (such as 1579 one just getting started in the business of resource description) may 1580 choose to store and manipulate local data natively in the ERC format. 1582 8. Advice to Web Clients 1584 This section offers some advice to web client software developers. 1585 It is hard to write about because it tries to anticipate a series of 1586 events that might lead to native web browser support for ARKs. 1588 ARKs are envisaged to appear wherever durable object references are 1589 planned. Library cataloging records, literature citations, and 1590 bibliographies are important examples. In many of these places URLs 1591 (Uniform Resource Locators) currently stand in, and URNs, DOIs, and 1592 PURLs have been proposed as alternatives. 1594 The strings representing ARKs are also envisaged to appear in some of 1595 the places where URLs currently appear: in hypertext links (where 1596 they are not normally shown to users) and in rendered text (displayed 1597 or printed). Internet search engines, for example, tend to include 1598 both actionable and manifest links when listing each item found. A 1599 normal HTML link for which the URL is not displayed looks like this. 1601 Click Here 1603 The same link with an ARK instead of a URL: 1605 Click Here 1607 Web browsers would in general require a small modification to 1608 recognize and convert this ARK, via mapping authority discovery, to 1609 the URL form. 1611 Click Here 1613 A browser that knows how to make that conversion could also 1614 automatically detect and replace a non-working NMAH. 1616 An NAA will typically make known the associations it creates by 1617 publishing them in catalogs, actively advertizing them, or simply 1618 leaving them on web sites for visitors (e.g., users, indexing 1619 spiders) to stumble across in browsing. 1621 9. Security Considerations 1623 The ARK naming scheme poses no direct risk to computers and networks. 1624 Implementors of ARK services need to be aware of security issues when 1625 querying networks and filesystems for Name Mapping Authority 1626 services, and the concomitant risks from spoofing and obtaining 1627 incorrect information. These risks are no greater for ARK mapping 1628 authority discovery than for other kinds of service discovery. For 1629 example, recipients of ARKs with a specified hostport (NMAH) should 1630 treat it like a URL and be aware that the identified ARK service may 1631 no longer be operational. 1633 Apart from mapping authority discovery, ARK clients and servers 1634 subject themselves to all the risks that accompany normal operation 1635 of the protocols underlying mapping services (e.g., HTTP, Z39.50). 1636 As specializations of such protocols, an ARK service may limit 1637 exposure to the usual risks. Indeed, ARK services may enhance a kind 1638 of security by helping users identify long-term reliable references 1639 to information objects. 1641 10. Authors' Addresses 1643 John A. Kunze 1644 Center for Knowledge Management 1645 University of California, San Francisco 1646 530 Parnassus Ave, Box 0840 1647 San Francisco, CA 94143-0840, USA 1649 Fax: +1 415-476-4653 1650 EMail: jak@ckm.ucsf.edu 1651 R. P. C. Rodgers 1652 US National Library of Medicine 1653 8600 Rockville Pike, Bldg. 38A 1654 Bethesda, MD 20894 1656 Fax: +1 301-496-0673 1657 EMail: rodgers@nlm.nih.gov 1659 11. References 1661 [DCORE] Dublin Core Metadata Initiative, "Dublin Core Metadata 1662 Element Set, Version 1.1: Reference Description", July 1663 1999, http://dublincore.org/documents/dces/. 1665 [DNS] P.V. Mockapetris, "Domain Names - Concepts and 1666 Facilities", RFC 1034, November 1987. 1668 [DOI] International DOI Foundation, "The Digital Object 1669 Identifier (DOI) System", February 2001, 1670 http://dx.doi.org/10.1000/203. 1672 [EMHDRS] D. Crocker, "Standard for the format of ARPA Internet text 1673 messages", RFC 822, August 1982. 1675 [ERC] J. Kunze, "Electronic Resource Citations", work in 1676 progress. 1678 [HKMP] J. Kunze, "HTTP Key Mapping Protocol", work in progress. 1680 [HTTP] R. Fielding, et al, "Hypertext Transfer Protocol -- 1681 HTTP/1.1", RFC 2616, June 1999. 1683 [MD5] R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321, 1684 April 1992. 1686 [NAPTR] M. Mealling, Daniel, R., "The Naming Authority Pointer 1687 (NAPTR) DNS Resource Record", RFC 2915, September 2000. 1689 [NLMPerm] M. Byrnes, "Defining NLM's Commitment to the Permanence of 1690 Electronic Information", ARL 212:8-9, October 2000, 1691 http://www.arl.org/newsltr/212/nlm.html 1693 [PURL] K. Shafer, et al, "Introduction to Persistent Uniform 1694 Resource Locators", 1996, 1695 http://purl.oclc.org/OCLC/PURL/INET96 1697 [REG] J. Kunze, "Resource Metadata Vocabulary", work in 1698 progress. 1700 [URI] T. Berners-Lee, et al, "Uniform Resource Identifiers 1701 (URI): Generic Syntax", RFC 2396, August 1998. 1703 [URNBIB] C. Lynch, et al, "Using Existing Bibliographic Identifiers 1704 as Uniform Resource Names", RFC 2288, February 1998. 1706 [URNSYN] R. Moats, "URN Syntax", RFC 2141, May 1997. 1708 [URNNID] L. Daigle, et al, "URN Namespace Definition Mechanisms", 1709 RFC 2611, June 1999. 1711 [TELNET] J. Postel, J.K. Reynolds, "Telnet Protocol Specification", 1712 RFC 854, May 1983. 1714 12. Appendix: An NLM Prototype ARK Service 1716 The US National Library of Medicine (NLM) has an experimental, 1717 prototype ARK service under development. It is being made available 1718 for purposes of demonstrating various aspects of the ARK system, but 1719 is subject to temporary or permanent withdrawal (without notice) 1720 depending upon the circumstances of the small research group 1721 responsible for making it available. It is described at: 1723 http://ark.nlm.nih.gov/ 1725 Comments and feedback may be addressed to rodgers@nlm.nih.gov. 1727 13. Appendix: Current ARK Name Authority Table 1729 This appendix contains a copy of the Name Authority Table (a file) at 1730 the time of writing. It may be loaded into a local filesystem (e.g., 1731 /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to 1732 NMAHs (Name Mapping Authority Hostports). It contains Perl code that 1733 can be copied into a standalone script that processes the table (as a 1734 file). Because this is still a proposed file, none of the values in 1735 it are real. 1737 # 1738 # Name Assigning Authority / Name Mapping Authority Lookup Table 1739 # Last change: 22 February 2001 1740 # Reload from: http://ark.nlm.nih.gov/etc/natab 1741 # Mirrored at: http://www.ckm.ucsf.edu/people/jak/home/etc/natab 1742 # http://....../etc/natab 1743 # To register: mailto:jak@ckm.ucsf.edu?Subject=naareg 1744 # Process with: Perl script at end of this file (optional) 1745 # 1746 # Each NAA appears at the beginning of a line with the NAA Number 1747 # first, a colon, and an ARK or URL to a statement of naming policy 1748 # (see http://ark.nlm.nih.gov/naapolicyeg.html for an example). 1749 # All the NMA hostports that service an NAA are listed, one per 1750 # line, indented, after the corresponding NAA line. 1751 # 1752 # US Library of Congress 1753 12025: http://www.loc.gov/xxx/naapolicy.html 1754 foobar.zaf.org 1755 sneezy.dopey.com 1756 # 1757 # US National Library of Medicine 1758 12026: http://www.nlm.nih.gov/xxx/naapolicy.html 1759 lhc.nlm.nih.gov:8080 1760 foobar.zaf.org 1761 sneezy.dopey.com 1762 # 1763 # US National Agriculture Library 1764 12027: http://www.nal.gov/xxx/naapolicy.html 1765 foobar.zaf.gov:80 1766 # 1767 #--- end of data --- 1768 # The enclosed Perl script takes an NAA as argument and outputs 1769 # the NMAs in this file listed under any matching NAA. 1770 # 1771 # my $naa = shift; 1772 # while (<>) { 1773 # next if (! /^$naa:/); 1774 # while (<>) { 1775 # last if (! /^[#\s]./); 1776 # print "$1\n" if (/^\s+(\S+)/); 1777 # } 1778 # } 1779 # end of file 1781 14. Copyright Notice 1783 Copyright (C) The Internet Society (2002). All Rights Reserved. 1785 This document and translations of it may be copied and furnished to 1786 others, and derivative works that comment on or otherwise explain it 1787 or assist in its implementation may be prepared, copied, published 1788 and distributed, in whole or in part, without restriction of any 1789 kind, provided that the above copyright notice and this paragraph are 1790 included on all such copies and derivative works. However, this 1791 document itself may not be modified in any way, such as by removing 1792 the copyright notice or references to the Internet Society or other 1793 Internet organizations, except as needed for the purpose of 1794 developing Internet standards in which case the procedures for 1795 copyrights defined in the Internet Standards process must be 1796 followed, or as required to translate it into languages other than 1797 English. 1799 The limited permissions granted above are perpetual and will not be 1800 revoked by the Internet Society or its successors or assigns. 1802 This document and the information contained herein is provided on an 1803 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1804 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1805 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1806 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1807 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1809 The IETF invites any interested party to bring to its attention any 1810 copyrights, patents or patent applications, or other proprietary 1811 rights which may cover technology that may be required to practice 1812 this standard. Please address the information to the IETF Executive 1813 Director. 1815 Expires 20 August 2002 1816 Table of Contents 1818 Status of this Document ........................................... 1 1819 Abstract .......................................................... 1 1820 1. Introduction .................................................. 3 1821 1.1. Three Reasons to Use ARKs ................................... 3 1822 1.2. Organizing Support for ARKs ................................. 4 1823 1.3. A Definition of Identifier .................................. 5 1824 2. ARK Anatomy ................................................... 6 1825 2.1. The Name Mapping Authority Hostport (NMAH) .................. 6 1826 2.2. The Name Assigning Authority Number (NAAN) .................. 7 1827 2.3. The Name Part ............................................... 7 1828 2.3.1. Names that Reveal Object Hierarchy ........................ 8 1829 2.3.2. Names that Reveal Object Variants ......................... 9 1830 2.3.3. Hyphens are Ignored ....................................... 10 1831 2.4. Normalization and Lexical Equivalence ....................... 10 1832 2.5. Naming Considerations ....................................... 11 1833 3. Assigners of ARKs ............................................. 12 1834 4. Finding a Name Mapping Authority .............................. 13 1835 4.1. Looking Up NMAHs in a Globally Accessible File .............. 14 1836 4.2. Looking up NMAHs Distributed via DNS ........................ 16 1837 5. Generic ARK Service Definition ................................ 19 1838 5.1. Generic ARK Access Service (access, location) ............... 19 1839 5.2. Generic Policy Service (permanence, naming, etc.) .......... 20 1840 5.3. Generic Description Service ................................. 21 1841 6. Overview of the HTTP Key Mapping Protocol (HKMP) .............. 21 1842 7. Overview of Electronic Resource Citations (ERCs) .............. 24 1843 7.1. ERC Syntax .................................................. 25 1844 7.2. ERC Stories ................................................. 26 1845 7.3. The ERC Anchoring Story ..................................... 27 1846 7.4. ERC Elements ................................................ 28 1847 7.5. ERC Element Values .......................................... 30 1848 7.6. ERC Element Encoding and Dates .............................. 32 1849 7.7. ERC Stub Records and Internal Support ....................... 34 1850 8. Advice to Web Clients ......................................... 34 1851 9. Security Considerations ....................................... 35 1852 10. Authors' Addresses ........................................... 35 1853 11. References ................................................... 36 1854 12. Appendix: An NLM Prototype ARK Service ...................... 37 1855 13. Appendix: Current ARK Name Authority Table .................. 37 1856 14. Copyright Notice ............................................. 38