idnits 2.17.1 

draft-kunze-ark-13.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 16.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 2363.

  ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure
     Invitation. 


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 50
     longer pages, the longest (page 2) being 63 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 51 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 8 instances of too long lines in the document, the longest one
     being 21 characters in excess of 72.

  ** The abstract seems to contain references ([Qualifier]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.

  == There are 8 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  == Line 1136 has weird spacing: '... regexp  repla...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (23 February 2007) is 6272 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Qualifier' is mentioned on line 436, but not defined

  == Unused Reference: 'MD5' is defined on line 2140, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ANVL'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Handle'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref.
     'MD5')

  -- Possible downref: Non-RFC (?) normative reference: ref. 'N2T'

  ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC
     3401, RFC 3402, RFC 3403, RFC 3404)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NOID'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL'

  ** Obsolete normative reference: RFC  822 (Obsoleted by RFC 2822)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'TEMPER'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP'

  ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC
     3986)

  ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref.
     'URNBIB')

  ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC
     8141)

  ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC
     3406)


     Summary: 18 errors (**), 0 flaws (~~), 8 warnings (==), 18 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	Internet-Draft: draft-kunze-ark-13.txt                          J. Kunze
2	ARK Identifier Scheme                    University of California (UCOP)
3	Expires 23 August 2007                                  R. P. C. Rodgers
4	                                         US National Library of Medicine
5	                                                        23 February 2007

7	                  The ARK Persistent Identifier Scheme

9	      (http://www.ietf.org/internet-drafts/draft-kunze-ark-13.txt)

11	Status of this Document

13	   By submitting this Internet-Draft, each author represents that any
14	   applicable patent or other IPR claims of which he or she is aware
15	   have been or will be disclosed, and any of which he or she becomes
16	   aware will be disclosed, in accordance with Section 6 of BCP 79.

18	   Internet-Drafts are working documents of the Internet Engineering
19	   Task Force (IETF), its areas, and its working groups.  Note that
20	   other groups may also distribute working documents as Internet-
21	   Drafts.

23	   Internet-Drafts are draft documents valid for a maximum of six months
24	   and may be updated, replaced, or obsoleted by other documents at any
25	   time.  It is inappropriate to use Internet-Drafts as reference
26	   material or to cite them other than as ``work in progress.''

28	   The list of current Internet-Drafts can be accessed at
29	   http://www.ietf.org/ietf/1id-abstracts.txt

31	   The list of Internet-Draft Shadow Directories can be accessed at
32	   http://www.ietf.org/shadow.html.

34	   Distribution of this document is unlimited.  Please send comments to
35	   jak@ucop.edu.

37	   Copyright (C) The IETF Trust (2007).  All Rights Reserved.

39	Abstract

41	   The ARK (Archival Resource Key) naming scheme is designed to
42	   facilitate the high-quality and persistent identification of
43	   information objects. A founding principle of the ARK is that
44	   persistence is purely a matter of service and is neither inherent in
45	   an object nor conferred on it by a particular naming syntax. The best
46	   that an identifier can do is to lead users to the services that
47	   support persistence. The term ARK itself refers both to the scheme
48	   and to any single identifier that conforms to it.  An ARK has five
49	   components:

51	              [http://NMAH/]ark:/NAAN/Name[Qualifier]

53	   an optional and mutable Name Mapping Authority Hostport, the "ark:"
54	   label, the Name Assigning Authority Number (NAAN), the assigned Name,
55	   and an optional and possibly mutable Qualifier supported by the NMA.
56	   The NAAN and Name together form the immutable persistent identifier
57	   for the object.  An ARK is a special kind of URL that connects users
58	   to three things: the named object, its metadata, and the provider's
59	   promise about its persistence. When entered into the location field
60	   of a Web browser, the ARK leads the user to the named object. That
61	   same ARK, followed by a single question mark ('?'), returns a brief
62	   metadata record that is both human- and machine-readable. When the
63	   ARK is followed by dual question marks ('??'), the returned metadata
64	   contains a commitment statement from the current provider.  Tools
65	   exist for minting, binding, and resolving ARKs.

67	1.  Introduction

69	   This document describes a scheme for the high-quality naming of
70	   information resources.  The scheme, called the Archival Resource Key
71	   (ARK), is well suited to long-term access and identification of any
72	   information resources that accommodate reasonably regular electronic
73	   description.  This includes digital documents, databases, software,
74	   and websites, as well as physical objects (books, bones, statues,
75	   etc.) and intangible objects (chemicals, diseases, vocabulary terms,
76	   performances).  Hereafter the term "object" refers to an information
77	   resource.  The term ARK itself refers both to the scheme and to any
78	   single identifier that conforms to it.  A reasonably concise and
79	   accessible overview and rationale for the scheme is available at
80	   [ARK].

82	   Schemes for persistent identification of network-accessible objects
83	   are not new.  In the early 1990's, the design of the Uniform Resource
84	   Name [URNSYN] responded to the observed failure rate of URLs by
85	   articulating an indirect, non-hostname-based naming scheme and the
86	   need for responsible name management.  Meanwhile, promoters of the
87	   Digital Object Identifier [DOI] succeeded in building a community of
88	   providers around a mature software system [Handle] that supports name
89	   management.  The Persistent Uniform Resource Locator [PURL] was
90	   another scheme that has the unique advantage of working with
91	   unmodified web browsers.  ARKs represent an approach that attempts to
92	   build on the strengths and to avoid the weaknesses of the other
93	   schemes.

95	   A founding principle of the ARK is that persistence is purely a
96	   matter of service.  Persistence is neither inherent in an object nor
97	   conferred on it by a particular naming syntax.  Nor is the technique
98	   of name indirection - upon which URNs, Handles, DOIs, and PURLs are
99	   founded - of central importance.  Name indirection is an ancient and
100	   well-understood practice; new mechanisms for it keep appearing and
101	   distracting practitioner attention, with the Domain Name System [DNS]
102	   being a particularly dazzling and elegant example.  What is often
103	   forgotten is that maintenance of an indirection table is the
104	   overwhelming and unavoidable cost to the organization providing
105	   persistence, and the cost is equivalent across naming schemes.  That
106	   indirection has always been a native part of the web while being so
107	   lightly utilized for the persistence of web-based objects is an
108	   indication of how unsuited most organizations are to the task of
109	   table maintenance and to the overall challenge of digital permanence.

111	   Persistence is achieved through a provider's successful stewardship
112	   of objects and their identifiers.  The highest level of persistence
113	   will be reinforced by a provider's robust contingency, redundancy,
114	   and succession strategies.  It is further safeguarded to the extent
115	   that a provider's mission is shielded from marketplace and political
116	   instabilities.  These are by far the major challenges confronting
117	   persistence providers, and no identifier scheme has any direct impact
118	   on them.  In fact, some schemes may be actual liabilities for
119	   persistence because they create short- and long-term dependencies for
120	   every object access on complex, special-purpose local and global
121	   infrastructures, parts of which are proprietary and all of which
122	   increase the carry-forward burden for the preservation community.  It
123	   is for this reason that the ARK scheme relies only on educated name
124	   assignment and light use of general-purpose infrastructures that the
125	   entire internet community needs (the DNS, web servers, and web
126	   browsers) and that one can reasonably expect many others to help
127	   carry forward into the technologically evolving future.

129	1.1.  Reasons to Use ARKs

131	   If no persistent identifier scheme contributes directly to
132	   persistence, why not just use URLs?  A particular URL may be as
133	   durable an identifier as it is possible to have, but nothing
134	   distinguishes it from an ordinary URL to the recipient who is
135	   wondering if it is suitable for long-term reference.  An ARK is just
136	   a URL, distinguished by its form, that provides some of the necessary
137	   conditions for credible persistence.  An ARK invites access to not
138	   one, but to three things:  to the object, to its metadata, and to a
139	   nuanced statement of commitment from the provider regarding the
140	   object.  Existence of the two extra services can be probed
141	   automatically by appending either `?' or `??' to the ARK.

143	   The form of the ARK also supports the natural separation of naming
144	   authorities into the original name assigning authority and the
145	   diverse multiple name mapping (or servicing) authorities that in
146	   succession and in parallel will take over custodial responsibilities
147	   from the original assigner for the large majority of a long-term
148	   object's archival lifetime.  The mapping authority, indicated by the
149	   hostname part of the URL that contains the ARK, serves to launch the
150	   ARK into cyberspace.  Should it ever fail (and there is no reason why
151	   a well-chosen hostname of a 100-year-old cultural memory institution
152	   shouldn't last as long as the DNS), that host name is considered
153	   disposeable and replaceable.  Again, the form of the ARK helps
154	   because it defines exactly how to recover the core immutable object
155	   identity, and several simple algorithms (based on the URN model) are
156	   defined for locating another mapping authority.

158	   There are tools to assist in generating ARKs and other identifiers,
159	   such as [NOID] and "uuidgen", both of which rely for uniqueness on
160	   human-maintained registries.  This document also contains some
161	   guidelines and considerations for managing namespaces and choosing
162	   hostnames wisely.

164	1.2.  Three Requirements of ARKs

166	   The first requirement of an ARK is to give users a link from an
167	   object to a promise of stewardship for it.  That promise is a multi-
168	   faceted covenant that binds the word of an identified service
169	   provider to a specific set of responsibilities.  No one can tell if
170	   successful stewardship will take place because no one can predict the
171	   future.  Reasonable conjecture, however, may be based on past
172	   performance.  There must be a way to tie a promise of persistence to
173	   a provider's demonstrated or perceived ability - its reputation - in
174	   that arena.  Provider reputations would then rise and fall as
175	   promises are observed variously to be kept and broken.  This is
176	   perhaps the best way we have for gauging the strength of any
177	   persistence promise.  Note that over time, current providers have
178	   nothing to do with the intentions of the original assigners of names.

180	   The second requirement of an ARK is to give users a link from an
181	   object to a description of it.  The problem with a naked identifier
182	   is that without a description real identification is incomplete.
183	   Identifiers common today are relatively opaque, though some contain
184	   ad hoc clues that reflect brief life cycle periods such as the
185	   address of a short stay in a filesystem hierarchy.  Possession of
186	   both an identifier and an object is some improvement, but positive
187	   identification may still be uncertain since the object itself might
188	   not include a matching identifier or might not carry evidence obvious
189	   enough to reveal its identity without significant research.  In
190	   either case, what is called for is a record bearing witness to the
191	   identifier's association with the object, as supported by a recorded
192	   set of object characteristics.  This descriptive record is partly an
193	   identification "receipt" with which users and archivists can verify
194	   an object's identity after brief inspection and a plausible match
195	   with recorded characteristics such as title and size.

197	   The final requirement of an ARK is to give users a link to the object
198	   itself (or to a copy) if at all possible.  Persistent access is the
199	   central duty of an ARK.  Persistent identification plays a vital
200	   supporting role but, strictly speaking, it can be construed as no
201	   more than a record attesting to the original assignment of a never-
202	   reassigned identifier.  Object access may not be feasible for various
203	   reasons, such as catastrophic loss of the object, a licensing
204	   agreement that keeps an archive "dark" for a period of years, or when
205	   an object's own lack of tangible existence confuses normal concepts
206	   of access (e.g., a vocabulary term might be accessed through its
207	   definition).  In such cases the ARK's identification role assumes a
208	   much higher profile.  But attempts to simplify the persistence
209	   problem by decoupling access from identification and concentrating
210	   exclusively on the latter are of questionable utility.  A perfect
211	   system for assigning forever unique identifiers might be created, but
212	   if it did so without reducing access failure rates, no one would be
213	   interested.  The central issue - which may be summed up as the "HTTP
214	   404 Not Found" problem - would not have been addressed.

216	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff

218	   An organization and the user community it serves can often be seen to
219	   struggle with two different areas of persistent identification: the
220	   Our Stuff problem and the Their Stuff problem.  In the Our Stuff
221	   problem, we in the organization want our own objects to acquire
222	   persistent names.  Since we possess or control these objects, our
223	   organization tackles the Our Stuff problem directly.  Whether or not
224	   the objects are named by ARKs, our organization is the responsible
225	   party, so it can plan for, maintain, and make commitments about the
226	   objects.

228	   In the Their Stuff problem, we in the organization want others'
229	   objects to acquire persistent names.  These are objects that we do
230	   not own or control, but some of which are critically important to us.
231	   But because they are beyond our influence as far as support is
232	   concerned, creating and maintaining persistent identifiers for Their
233	   Stuff is not especially purposeful or feasible for us to do.  There
234	   is little that we can do about someone else's stuff except encourage
235	   them to find or become providers of persistence services.

237	   Co-location of persistent access and identification services is
238	   natural.  Any organization that undertakes ongoing support of true
239	   persistent identification (which includes description) is well-served
240	   if it controls, owns, or otherwise has clear internal access to the
241	   identified objects, and this gives it an advantage if it wishes also
242	   to support persistent access to outsiders.  Conversely, persistent
243	   access to outsiders requires orderly internal collection management
244	   procedures that include monitoring, acquisition, verification, and
245	   change control over objects, which in turn requires object
246	   identifiers persistent enough to support auditable record keeping
247	   practices.

249	   Although, organizing ARK services under one roof thus tends to make
250	   sense, object hosting can successfully be separated from name
251	   mapping.  An example is when a name mapping authority centrally
252	   provides uniform resolution services via a protocol gateway on behalf
253	   of organizations that host objects behind a variety of access
254	   protocols.  It is also reasonable to build value-added description
255	   services that rely on the underlying services of a set of mapping
256	   authorities.

258	   Supporting ARKs is not for every organization.  By requiring
259	   specific, revealed commitments to preservation, to object access, and
260	   to description, the bar for providing ARK services is higher than for
261	   some other identifier schemes.  On the other hand, it would be hard
262	   to grant credence to a persistence promise from an organization that
263	   could not muster the minimum ARK services.  Not that there isn't a
264	   business model for an ARK-like, description-only service built on top
265	   of another organization's full complement of ARK services.  For
266	   example, there might be competition at the description level for
267	   abstracting and indexing a body of scientific literature archived in
268	   a combination of open and fee-based repositories.  The description-
269	   only service would have no direct commitment to the objects, but
270	   would act as an intermediary, forwarding commitment statements from
271	   object hosting services to requestors.

273	1.4.  Definition of Identifier

275	   An identifier is not a string of character data - an identifier is an
276	   association between a string of data and an object.  This abstraction
277	   is necessary because without it a string is just data.  It's nonsense
278	   to talk about a string's breaking, or about its being strong,
279	   maintained, and authentic.  But as a representative of an
280	   association, a string can do, metaphorically, the things that we
281	   expect of it.

283	   Without regard to whether an object is physical, digital, or
284	   conceptual, to identify it is to claim an association between it and
285	   a representative string, such as "Jane" or "ISBN 0596000278".  What
286	   gives a claim credibility is a set of verifiable assertions, or
287	   metadata, about the object, such as age, height, title, or number of
288	   pages.  In other words, the association is made manifest by a record
289	   (e.g., a cataloging or other metadata record) that vouches for it.

291	   In the complete absence of any testimony (metadata) regarding an
292	   association, a would-be identifier string is a meaningless sequence
293	   of characters.  To keep an externally visible but otherwise internal
294	   string from being perceived as an identifier by outsiders, for
295	   example, it suffices for an organization not to disclose the nature
296	   of its association.  For our immediate purpose, actual existence of
297	   an association record is more important than its authenticity or
298	   verifiability, which are outside the scope of this specification.

300	   It is a gift to the identification process if an object carries its
301	   own name as an inseparable part of itself, such as an identifier
302	   imprinted on the first page of a document or embedded in a data
303	   structure element of a digital document header.  In cases where the
304	   object is large, unwieldy, or unavailable (such as when licensing
305	   restrictions are in effect), a metadata record that includes the
306	   identifier string will usually suffice.  That record becomes a
307	   conveniently manipulable object surrogate, acting as both an
308	   association "receipt" and "declaration".

310	   Note that our definition of identifier extends the one in use for
311	   Uniform Resource Identifiers [URI].  The present document still
312	   sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for
313	   the string part of an identifier, but the context should make the
314	   meaning clear.

316	2.  ARK Anatomy

318	   An ARK is represented by a sequence of characters (a string) that
319	   contains the label, "ark:", optionally preceded by the beginning part
320	   of a URL.  Here is a diagrammed example.

322	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff
323	         \___________________/ \__/ \___/ \______/ \____________/
324	           (replaceable)        |     |      |       Qualifier
325	                |         ARK Label   |      |    (NMA-supported)
326	                |                     |      |
327	      Name Mapping Authority          |    Name (NAA-assigned)
328	         Hostport (NMAH)              |
329	                           Name Assigning Authority Number (NAAN)

331	   The ARK syntax can be summarized,

333	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

335	   where the NMAH and Qualifier parts are in brackets to indicate that
336	   they are optional.

338	2.1.  The Name Mapping Authority Hostport (NMAH)

340	   Before the "ark:" label may appear an optional Name Mapping Authority
341	   Hostport (NMAH) that is a temporary address where ARK service
342	   requests may be sent.  It consists of "http://" (or any service
343	   specification valid for a URL) followed by an Internet hostname or
344	   hostport combination having the same format and semantics as the
345	   hostport part of a URL.  The most important thing about the NMAH is
346	   that it is "identity inert" from the point of view of object
347	   identification.  In other words, ARKs that differ only in the
348	   optional NMAH part identify the same object.  Thus, for example, the
349	   following three ARKs are synonyms for just one information object:

351	                      http://loc.gov/ark:/12025/654xz321
352	                  http://rutgers.edu/ark:/12025/654xz321
353	                                     ark:/12025/654xz321

355	   Strictly speaking, in the realm of digital objects, these ARKs may
356	   lead over time to somewhat different or diverging instances of the
357	   originally named object.  In an ideal world, divergence of persistent
358	   objects is not desirable, but it is widely believed that digital
359	   preservation efforts will inevitably lead to alterations in some
360	   original objects (e.g, a format migration in order to preserve the
361	   ability to display a document).  If any of those objects are held
362	   redundantly in more than one organization (a common preservation
363	   strategy), chances are small that all holding organizations will
364	   perform the same precise transformations and all maintain the same
365	   object metadata.  More significant divergence would be expected when
366	   the holding organizations serve different audiences or compete with
367	   each other.

369	   The NMAH part makes an ARK into an actionable URL.  As with many
370	   internet parameters, it is helpful to approach the NMAH being liberal
371	   in what you accept and conservative in what you propose.  From the
372	   recipient's point of view, the NMAH part should be treated as
373	   temporary, disposable, and replaceable.  From the NMA's point of
374	   view, it should be chosen with the greatest concern for longevity.  A
375	   carefully chosen NMAH should be at least as permanent as the
376	   providing organization's own hostname.  In the case of a national or
377	   university library, for example, there is no reason why the NMAH
378	   should not be considerably more permanent than soft-funded proxy
379	   hostnames such as hdl.handle.net, dx.doi.org, and purl.org.  In
380	   general and over time, however, it is not unexpected for an NMAH
381	   eventually to stop working and require replacement with the NMAH of a
382	   currently active service provider.

384	   This replacement relies on a mapping authority "resolver" discovery
385	   process, of which two alternate methods are outlined in a later
386	   section.  The ARK, URN, Handle, and DOI schemes all use a resolver
387	   discovery model that sooner or later requires matching the original
388	   assigning authority with a current provider servicing that
389	   authority's named objects; once found, the resolver at that provider
390	   performs what amounts to a redirect to a place where the object is
391	   currently held.  All the schemes rely on the ongoing functionality of
392	   currently mainstream technologies such as the Domain Name System
393	   [DNS] and web browsers.  The Handle and DOI schemes in addition
394	   require that the Handle protocol layer and global server grid be
395	   available at all times.

397	   The practice of prepending "http://" and an NMAH to an ARK is a way
398	   of creating an actionable identifier by a method that is itself
399	   temporary.  Assuming that infrastructure supporting [HTTP]
400	   information retrieval will no longer be available one day, ARKs will
401	   then have to be converted into new kinds of actionable identifiers.
402	   By that time, if ARKs see widespread use, web browsers would
403	   presumably evolve to perform this (currently simple) transformation
404	   automatically.

406	2.2.  The ARK Label Part - ark:

408	   The label part distinguishes an ARK from an ordinary identifier.  In
409	   a URL found in the wild, the string, "ark:/", indicates that the URL
410	   stands a reasonable chance of being an ARK.  If the context warrants,
411	   verification that it actually is an ARK can be done by testing it for
412	   existence of the three ARK services.

414	   Since nothing about an identifier syntax directly affects
415	   persistence, the "ark:" label (like "urn:", "doi:", and "hdl:")
416	   cannot tell you whether the identifier is persistent or whether the
417	   object is available.  It does tell you that the original Name
418	   Assigning Authority (NAA) had some sort of hopes for it, but it
419	   doesn't tell you whether that NAA is still in existence, or whether a
420	   decade ago it ceased to have any responsibility for providing
421	   persistence, or whether it ever had any responsibility beyond naming.

423	   Only a current provider can say for certain what sort of commitment
424	   it intends, and the ARK label suggests that you can query the NMAH
425	   directly to find out exactly what kind of persistence is promised.
426	   Even if what is promised is impersistence (i.e., a short-term
427	   identifier), saying so is valuable information to the recipient.
428	   Thus an ARK is a high-functioning identifier in the sense that it
429	   provides access to the object, the metadata, and a commitment
430	   statement, even if the commitment is explicitly very weak.

432	2.3.  The Name Assigning Authority Number (NAAN)

434	   Recalling that the general form of the ARK is,

436	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

438	   the part of the ARK directly following the "ark:" is the Name
439	   Assigning Authority Number (NAAN) enclosed in `/' (slash) characters.
440	   This part is always required, as it identifies the organization that
441	   originally assigned the Name of the object.  It is used to discover a
442	   currently valid NMAH and to provide top-level partitioning of the
443	   space of all ARKs.  NAANs are registered in a manner similar to URN
444	   Namespaces, but they are pure numbers consisting of 5 digits or 9
445	   digits.  Thus, the first 100,000 registered NAAs fit compactly into
446	   the 5 digits, and if growth warrants, the next billion fit into the 9
447	   digit form.  In either case the fixed odd numbers of digits helps
448	   reduce the chances of finding a NAAN out of context and confusing it
449	   with nearby quantities such as 4-digit dates.

451	   The NAAN designates a top-level ARK namespace.  Once registered for a
452	   namespace, a NAAN is never re-registered.  It is possible, however,
453	   for there to be a succession of organizations that manage of an ARK
454	   namespace.

456	2.4.  The Name Part

458	   The part of the ARK just after the NAAN is the Name assigned by the
459	   NAA, and it is also required.  Semantic opaqueness in the Name part
460	   is strongly encouraged in order to reduce an ARK's vulnerability to
461	   era- and language-specific change.  Identifier strings containing
462	   linguistic fragments can create support difficulties down the road.
463	   No matter how appropriate or even meaningless they are today, such
464	   fragments may one day create confusion, give offense, or infringe on
465	   a trademark as the semantic environment around us and our communities
466	   evolves.

468	   Names that look more or less like numbers avoid common problems that
469	   defeat persistence and international acceptance.  The use of digits
470	   is highly recommended.  Mixing in non-vowel alphabetic characters a
471	   couple at a time is a relatively safe and easy way to achieve a
472	   denser namespace (more possible names for a given length of the name
473	   string).  Such names have a chance of aging and traveling well.
474	   Tools exists that mint, bind, and resolve opaque identifiers, with or
475	   without check characters [NOID].  More on naming considerations is
476	   given in a subsequent section.

478	2.5.  The Qualifier Part

480	   The part of the ARK following the NAA-assigned Name is an optional
481	   Qualifier.  It is a string that extends the base ARK in order to
482	   create a kind of service entry point into the object named by the
483	   NAA.  At the discretion of the providing NMA, such a service entry
484	   point permits an ARK to support access to individual hierarchical
485	   components and subcomponents of an object, and to variants (versions,
486	   languages, formats) of components.  A Qualifier may be invented by
487	   the NAA or by any NMA servicing the object.

489	   In form, the Qualifier is a ComponentPath, or a VariantPath, or a
490	   ComponentPath followed by a VariantPath.  A VariantPath is introduced
491	   and subdivided by the reserved character `.', and a ComponentPath is
492	   introduced and subdivided by the reserved character `/'.  In this
493	   example,

495	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff

497	   the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is
498	   a VariantPath.  The ARK Qualifier is a formalization of some
499	   currently mainstream URL syntax conventions.  This formalization
500	   specifically reserves meanings that permit recipients to make strong
501	   inferences about logical sub-object containment and equivalence based
502	   only on the form of the received identifiers; there is great
503	   efficiency in not having to inspect metadata records to discover such
504	   relationships.  NMAs are free not to disclose any of these
505	   relationships merely by avoiding the reserved characters above.
506	   Hierarchical components and variants are discussed further in the
507	   next two sections.

509	   The Qualifier, if present, differs from the Name in several important
510	   respects.  First, a Qualifier may have been assigned either by the
511	   NAA or later by the NMA.  The assignment of a Qualifier by an NMA
512	   effectively amounts to an act of publishing a service entry point
513	   within the conceptual object originally named by the NAA.  For our
514	   purposes, an ARK extended with a Qualifier assigned by an NMA will be
515	   called an NMA-qualified ARK.

517	   Second, a Qualifier assignment on the part of an NMA is made in
518	   fulfillment of its service obligations and may reflect changing
519	   service expectations and technology requirements.  NMA-qualified ARKs
520	   could therefore be transient, even if the base, unqualified ARK is
521	   persistent.  For example, it would be reasonable for an NMA to
522	   support access to an image object through an actionable ARK that is
523	   considered persistent even if the experience of that access changes
524	   as linking, labeling, and presentation conventions evolve and as
525	   format and security standards are updated.  For an image "thumbnail",
526	   that NMA could also support an NMA-qualified ARK that is considered
527	   impersistent because the thumbnail will be replaced with higher
528	   resolution images as network bandwidth and CPU speeds increase.  At
529	   the same time, for an originally scanned, high-resolution master, the
530	   NMA could publish an NMA-qualfied ARK that is itself considered
531	   persistent.  Of course, the NMA must be able to return its separate
532	   commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs,
533	   and to any NAA-qualified ARKs that it supports.

535	   A third difference between a Qualifier and a Name concerns the
536	   semantic opaqueness constraint.  When an NMA-qualified ARK is to be
537	   used as a transient service entry point into a persistent object, the
538	   priority given to semantic opaqueness observed by the NAA in the Name
539	   part may be relaxed by the NMA in the Qualifier part.  If service
540	   priorities in the Qualifier take precedence over persistence, short-
541	   term usability considerations may recommend somewhat semantically
542	   laden Qualifier strings.

544	   Finally, not only is the set of Qualifiers supported by an NMA
545	   mutable, but different NMAs may support different Qualifier sets for
546	   the same NAA-identified object.  In this regard the NMAs act
547	   independently of each other and of the NAA.

549	   The next two sections describe how ARK syntax may be used to declare,
550	   or to avoid declaring, certain kinds of relatedness among qualified
551	   ARKs.

553	2.5.1.  ARKs that Reveal Object Hierarchy

555	   An NAA or NMA may choose to reveal the presence of a hierarchical
556	   relationship between objects using the `/' (slash) character after
557	   the Name part of an ARK.  Some authorities will choose not to
558	   disclose this information, while others will go ahead and disclose so
559	   that manipulators of large sets of ARKs can infer object
560	   relationships by simple identifier inspection; for example, this
561	   makes it possible for a system to present a collapsed view of a large
562	   search result set.

564	   If the ARK contains an internal slash after the NAAN, the piece to
565	   its left indicates a containing object.  For example, publishing an
566	   ARK of the form,

568	                         ark:/12025/654/xz/321

570	   is equivalent to publishing three ARKs,

572	                         ark:/12025/654/xz/321
573	                         ark:/12025/654/xz
574	                         ark:/12025/654

576	   together with a declaration that the first object is contained in the
577	   second object, and that the second object is contained in the third.

579	   Revealing the presence of hierarchy is completely up to the assigner
580	   (NMA or NAA).  It is hard enough to commit to one object's name, let
581	   alone to three objects' names and to a specific, ongoing relatedness
582	   among them.  Thus, regardless of whether hierarchy was present
583	   initially, the assigner, by not using slashes, reveals no shared
584	   inferences about hierarchical or other inter-relatedness in the
585	   following ARKs:

587	                         ark:/12025/654_xz_321
588	                         ark:/12025/654_xz
589	                         ark:/12025/654xz321
590	                         ark:/12025/654xz
591	                         ark:/12025/654

593	   Note that slashes around the ARK's NAAN (/12025/ in these examples)
594	   are not part of the ARK's Name and therefore do not indicate the
595	   existence of some sort of NAAN super object containing all objects in
596	   its namespace.  A slash must have at least one non-structural
597	   character (one that is neither a slash nor a period) on both sides in
598	   order for it to separate recognizable structural components.  So
599	   initial or final slashes may be removed, and double slashes may be
600	   converted into single slashes.

602	2.5.2.  ARKs that Reveal Object Variants

604	   An NAA or NMA may choose to reveal the possible presence of variant
605	   objects or object components using the `.' (period) character after
606	   the Name part of an ARK.  Some authorities will choose not to
607	   disclose this information, while others will go ahead and disclose so
608	   that manipulators of large sets of ARKs can infer object
609	   relationships by simple identifier inspection; for example, this
610	   makes it possible for a system to present a collapsed view of a large
611	   search result set.

613	   If the ARK contains an internal period after Name, the piece to its
614	   left is a base name and the piece to its right, and up to the end of
615	   the ARK or to the next period is a suffix.  A Name may have more than
616	   one suffix, for example,
617	                         ark:/12025/654.24
618	                         ark:/12025/xz4/654.24
619	                         ark:/12025/654.20v.78g.f55

621	   There are two main rules.  First, if two ARKs share the same base
622	   name but have different suffixes, the corresponding objects were
623	   considered variants of each other (different formats, languages,
624	   versions, etc.) by the assigner (NMA or NAA).  Thus, the following
625	   ARKs are variants of each other:

627	                         ark:/12025/654.20v.78g.f55
628	                         ark:/12025/654.321xz
629	                         ark:/12025/654.44

631	   Second, publishing an ARK with a suffix implies the existence of at
632	   least one variant identified by the ARK without its suffix.  The ARK
633	   otherwise permits no further assumptions about what variants might
634	   exist.  So publishing the ARK,

636	                         ark:/12025/654.20v.78g.f55

638	   is equivalent to publishing the four ARKs,

640	                         ark:/12025/654.20v.78g.f55
641	                         ark:/12025/654.20v.78g
642	                         ark:/12025/654.20v
643	                         ark:/12025/654

645	   Revealing the possibility of variants is completely up to the
646	   assigner.  It is hard enough to commit to one object's name, let
647	   alone to multiple variants' names and to a specific, ongoing
648	   relatedness among them.  The assigner is the sole arbiter of what
649	   constitutes a variant within its namespace, and whether to reveal
650	   that kind of relatedness by using periods within its names.

652	   A period must have at least one non-structural character (one that is
653	   neither a slash nor a period) on both sides in order for it to
654	   separate recognizable structural components.  So initial or final
655	   periods may be removed, and adjacent periods may be converted into a
656	   single period.  Multiple suffixes should be arranged in sorted order
657	   (pure ASCII collating sequence) at the end of an ARK.

659	2.6.  Character Repertoires

661	   The Name and Qualifier parts are strings of visible ASCII characters
662	   and should be less than 128 bytes in length.  The length restriction
663	   keeps the ARK short enough to append ordinary ARK request strings
664	   without running into transport restrictions (e.g., within HTTP GET
665	   requests).  Characters may be letters, digits, or any of these six
666	   characters:

668	         =   #   *   +   @   _   $

670	   The following characters may also be used, but their meanings are
671	   reserved:

673	         %   -   .   /

675	   The characters `/' and `.' are ignored if either appears as the last
676	   character of an ARK.  If used internally, they allow a name assigner
677	   to reveal object hierarchy and object variants as previously
678	   described.

680	   Hyphens are considered to be insignificant and are always ignored in
681	   ARKs.  A `-' (hyphen) may appear in an ARK for readability, or it may
682	   have crept in during the formatting and wrapping of text, but it must
683	   be ignored in lexical comparisons.  As in a telephone number, hyphens
684	   have no meaning in an ARK.  It is always safe for an NMA that
685	   receives an ARK to remove any hyphens found in it.  As a result, like
686	   the NMAH, hyphens are "identity inert" in comparing ARKs for
687	   equivalence.  For example, the following ARKs are equivalent for
688	   purposes of comparison and ARK service access:

690	                                 ark:/12025/65-4-xz-321
691	         http://sneezy.dopey.com/ark:/12025/654--xz32-1
692	                                 ark:/12025/654xz321

694	   The `%' character is reserved for %-encoding all other octets that
695	   would appear in the ARK string, in the same manner as for URIs [URI].
696	   A %-encoded octet consists of a `%' followed by two hex digits; for
697	   example, "%7d" stands in for `}'.  Lower case hex digits are
698	   preferred to reduce the chances of false acronym recognition; thus it
699	   is better to use "%acT" instead of "%ACT".  The character `%' itself
700	   must be represented using "%25".  As with URNs, %-encoding permits
701	   ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) that have
702	   less restricted character repertoires [URNBIB].

704	2.7.  Normalization and Lexical Equivalence

706	   To determine if two or more ARKs identify the same object, the ARKs
707	   are compared for lexical equivalence after first being normalized.
708	   Since ARK strings may appear in various forms (e.g., having different
709	   NMAHs), normalizing them minimizes the chances that comparing two ARK
710	   strings for equality will fail unless they actually identify
711	   different objects.  In a specified-host ARK (one having an NMAH), the
712	   NMAH never participates in such comparisons.

714	   Normalization of an ARK for the purpose of octet-by-octet equality
715	   comparison with another ARK consists of four steps.  First, any upper
716	   case letters in the "ark:" label and the two characters following a
717	   `%' are converted to lower case.  The case of all other letters in
718	   the ARK string must be preserved.  Second, any NMAH part is removed
719	   (everything from an initial "http://" up to the next slash) and all
720	   hyphens are removed.

722	   Third, structural characters (slash and period) are normalized.
723	   Initial and final occurrences are removed, and two structural
724	   characters in a row (e.g., // or ./) are replaced by the first
725	   character, iterating until each occurrence has at least one non-
726	   structural character on either side.  Finally, if there are any
727	   components with a period on the left and a slash on the right, either
728	   the component and the preceding period must be moved to the end of
729	   the Name part or the ARK must be thrown out as malformed.

731	   The fourth and final step is to arrange the suffixes in ASCII
732	   collating sequence (that is, to sort them) and to remove duplicate
733	   suffixes, if any.  It is also permissible to throw out ARKs for which
734	   the suffixes are not sorted.

736	   The resulting ARK string is now normalized.  Comparisons between
737	   normalized ARKs are case-sensitive, meaning that upper case letters
738	   are considered different from their lower case counterparts.

740	   To keep ARK string variation to a minimum, no reserved ARK characters
741	   should be %-encoded unless it is deliberately to conceal their
742	   reserved meanings.  No non-reserved ARK characters should ever be
743	   %-encoded.  Finally, no %-encoded character should ever appear in an
744	   ARK in its decoded form.

746	3.  Naming Considerations

748	   The most important threats faced by persistence providers include
749	   such things as funding loss, natural disaster, political and social
750	   upheaval, processing faults, and errors in human oversight.  There is
751	   nothing that an identifer scheme can do about such things.  Still, a
752	   few observed identifier failures and inconveniences can be traced
753	   back to naming practices that we now know to be less than optimal for
754	   persistence.

756	3.1.  ARKS Embedded in Language

758	   The ARK has different goals from the URI, so it has different
759	   character set requirements.  Because linguistic constructs imperil
760	   persistence, for ARKs non-ASCII character support is unimportant.
761	   ARKs and URIs share goals of transcribability and transportability
762	   within web documents, so characters are required to be visible, non-
763	   conflicting with HTML/XML syntax, and not subject to tampering during
764	   transmission across common transport gateways.  Add the goal of
765	   making an undelimited ARK recognizable in running prose, as in
766	   ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma,
767	   period) end up being excluded from the ARK lest the end of a phrase
768	   or sentence be mistaken for part of the ARK.

770	   This consideration has more direct effect on ARK usability in a
771	   natural language context than it has on ARK persistence.  The same is
772	   true of the rule preventing hyphens from having lexical significance.
773	   It is fine to publish ARKs with hyphens in them (e.g., such as the
774	   output of UUID/GUID generators), but the uniform treatment of hyphens
775	   as insignificant reduces the possibility of users transcribing
776	   identifiers that will have been broken through unpredictable
777	   hyphenation by word processors.  Any measure that reduces user
778	   irritation with an identifier will increase its chances of survival.

780	3.2.  Objects Should Wear Their Identifiers

782	   A valuable technique for provision of persistent objects is to try to
783	   arrange for the complete identifier to appear on, with, or near its
784	   retrieved object.  An object encountered at a moment in time when its
785	   discovery context has long since disappeared could then easily be
786	   traced back to its metadata, to alternate versions, to updates, etc.
787	   This has seen reasonable success, for example, in book publishing and
788	   software distribution.  An identifier string only has meaning when
789	   its association is known, and this a very sure, simple, and low-tech
790	   method of reminding everyone exactly what that association is.

792	3.3.  Names are Political, not Technological

794	   If persistence is the goal, a deliberate local strategy for
795	   systematic name assignment is crucial.  Names must be chosen with
796	   great care.  Poorly chosen and managed names will devastate any
797	   persistence strategy, and they do not discriminate by identifier
798	   scheme.  Whether a mistakenly re-assigned name is a URN, DOI, PURL,
799	   URL, or ARK, the damage - failed access and confusion - is not
800	   mitigated more in one scheme than in another.  Conversely, in-house
801	   efforts to manage names responsibly will go much further towards
802	   safeguarding persistence than any choice of naming scheme or name
803	   resolution technology.

805	   Branding (e.g., at the corporate or departmental level) is important
806	   for funding and visibility, but substrings representing brands and
807	   organizational names should be given a wide berth except when
808	   absolutely necessary in the hostname (the identity-inert) part of the
809	   ARK.  These substrings are not only unstable because organizations
810	   change frequently, but they are also dangerous because successor
811	   organizations often have political or legal reasons to actively
812	   suppress predecessor names and brands.  Any measure that reduces the
813	   chances of future political or legal pressure on an identifier will
814	   decrease the chances that our descendants will be obliged to
815	   deliberately break it.

817	3.4.  Choosing a Hostname or NMA

819	   Hostnames appearing in any identifier meant to be persistent must be
820	   chosen with extra care.  The tendency in hostname selection has
821	   traditionally been to choose a token with recognizable attributes,
822	   such as a corporate brand, but that tendency wreaks havoc with
823	   persistence that is supposed to outlive brands, corporations, subject
824	   classifications, and natural language semantics (e.g., what did the
825	   three letters "gay" mean in 1958, 1978, and 1998?).  Today's
826	   recognized and correct attributes are tomorrow's stale or incorrect
827	   attributes.  In making hostnames (any names, actually) long-term
828	   persistent, it helps to eliminate recognizable attributes to the
829	   extent possible.  This affects selection of any name based on URLs,
830	   including PURLs and the explicitly disposable NMAHs.

832	   There is no excuse for a provider that manages its internal names
833	   impeccably not to exercise the same care in choosing what could be an
834	   exceptionally durable hostname, especially if it would form the
835	   prefix for all the provider's URL-based external names.  Registering
836	   an opaque hostname in the ".org" or ".net" domain would not be a bad
837	   start.  Another way is to publish your ARKs with an organizational
838	   domain name that will be mapped by DNS to an appropriate NMA host.
839	   This makes for shorter names with less branding vulnerability.

841	   It is a mistake to think that hostnames are inherently unstable.  If
842	   you require brand visibility, that may be a fact of life.  But things
843	   are easier if yours is the brand of long-lived cultural memory
844	   institution such as a national or university library or archive.
845	   Well-chosen hostnames from organizations that are sheltered from the
846	   direct effects of a volatile marketplace can easily provide longer-
847	   lived global resolvers than the domain names explicitly or implicitly
848	   used as starting points for global resolution by indirection-based
849	   persistent identifier schemes.  For example, it is hard to imagine
850	   circumstances under which the Library of Congress' domain name would
851	   disappear sooner than, say, "handle.net".

853	   For smaller libraries, archives, and preservation organizations,
854	   there is a natural concern about whether they will be able to keep
855	   their web servers and domain names in the face of uncertain funding.
856	   One option is to form or join a consortium [N2T] of like-minded
857	   organizations with the purpose of providing mutual preservation
858	   support.  The first goal of such a consortium would be to perpetually
859	   rent a hostname on which to establish a web server that simply
860	   redirects incoming member organization requests to the appropriate
861	   member server; using ARKs, for example, a 150-member consortium could
862	   run a very small server (24x7) that contained nothing more than 150
863	   rewrite rules in its configuration file.  Even more helpful would be
864	   additional consortial support for a member organization that was
865	   unable to continue providing services and needed to find a successor
866	   archival organization.  This would be a low-cost, low-tech way to
867	   publish ARKs (or URLs) under highly persistent hostnames.

869	   There are no obvious reasons why the organizations registering DNS
870	   names, URN Namespaces, and DOI publisher IDs should have among them
871	   one that is intrinsically more fallible than the next.  Moreover, it
872	   is a misconception that the demise of DNS and of HTTP need adversely
873	   affect the persistence of URLs.  At such a time, certainly URLs from
874	   the present day might not then be actionable by our present-day
875	   mechanisms, but resolution systems for future non-actionable URLs are
876	   no harder to imagine than resolution systems for present-day non-
877	   actionable URNs and DOIs.  There is no more stable a namespace than
878	   one that is dead and frozen, and that would then characterize the
879	   space of names bearing the "http://" prefix.  It is useful to
880	   remember that just because hostnames have been carelessly chosen in
881	   their brief history does not mean that they are unsuitable in NMAHs
882	   (and URLs) intended for use in situations demanding the highest level
883	   of persistence available in the Internet environment.  A well-planned
884	   name assignment strategy is everything.

886	3.5.  Assigners of ARKs

888	   A Name Assigning Authority (NAA) is an organization that creates (or
889	   delegates creation of) long-term associations between identifiers and
890	   information objects.  Examples of NAAs include national libraries,
891	   national archives, and publishers.  An NAA may arrange with an
892	   external organization for identifier assignment.  The US Library of
893	   Congress, for example, allows OCLC (the Online Computer Library
894	   Center, a major world cataloger of books) to create associations
895	   between Library of Congress call numbers (LCCNs) and the books that
896	   OCLC processes.  A cataloging record is generated that testifies to
897	   each association, and the identifier is included by the publisher,
898	   for example, in the front matter of a book.

900	   An NAA does not so much create an identifier as create an
901	   association.  The NAA first draws an unused identifier string from
902	   its namespace, which is the set of all identifiers under its control.
903	   It then records the assignment of the identifier to an information
904	   object having sundry witnessed characteristics, such as a particular
905	   author and modification date.  A namespace is usually reserved for an
906	   NAA by agreement with recognized community organizations (such as
907	   IANA and ISO) that all names containing a particular string be under
908	   its control.  In the ARK an NAA is represented by the Name Assigning
909	   Authority Number (NAAN).

911	   The ARK namespace reserved for an NAA is the set of names bearing its
912	   particular NAAN.  For example, all strings beginning with
913	   "ark:/12025/" are under control of the NAA registered under 12025,
914	   which might be the National Library of Finland.  Because each NAA has
915	   a different NAAN, names from one namespace cannot conflict with those
916	   from another.  Each NAA is free to assign names from its namespace
917	   (or delegate assignment) according to its own policies.  These
918	   policies must be documented in a manner similar to the declarations
919	   required for URN Namespace registration [URNNID].

921	   To register for a NAAN, please read about the mapping authority
922	   discovery file in the next section and send email to ark@cdlib.org.

924	3.6.  NAAN Namespace Management

926	   Every NAA must have a namespace management strategy.  A time-honored
927	   technique is to hierarchically partition a namespace into
928	   subnamespaces using prefixes that guarantee non-collision of names in
929	   different partition.  This practice is strongly encouraged for all
930	   NAAs, especially when subnamespace management will be delegated to
931	   other departments, units, or projects within an organization.  For
932	   example, with a NAAN that is assigned to a university and managed by
933	   its main library, care should be taken to reserve semantically opaque
934	   prefixes that will set aside large parts of the unused namespace for
935	   future assignments.  Prefix-based partition management is an
936	   important responsibility of the NAA.

938	   This sort of delegation by prefix is well-used in the formation of
939	   DNS names and ISBN identifiers.  An important difference is that in
940	   the former, the hierarchy is deliberately exposed and in the latter
941	   it is hidden.  Rather than using lexical boundary markers such as the
942	   period (`.') found in domain names, the ISBN uses a publisher prefix
943	   but doesn't disclose where the prefix ends and the publisher's
944	   assigned name begins.  This practice of non-disclosure, borrowed from
945	   the ISBN and ISSN schemes, is encouraged in assigning ARKs, because
946	   it reduces the visibility of an assertion that is probably not
947	   important now and may become a vulnerability later.

949	   Reasonable prefixes for assigned names usually consist of consonants
950	   and digits and are 1-5 characters in length.  For example, the
951	   constant prefix "x9t" might be delegated to a book digitization
952	   project that creates identifiers such as

954	             http://444.berkeley.edu/ark:/28722/x9t38rk45c

956	   If longevity is the goal, it is important to keep the prefixes free
957	   of recognizable semantics; for example, using an acronym representing
958	   a project or a department is discouraged.  At the same time, you may
959	   wish to set aside a subnamespace for testing purposes under a prefix
960	   such as "fk..." that can serve as a visual clue and reminder to
961	   maintenance staff that this "fake" identifier was never published.

963	   There are other measures one can take to avoid user confusion,
964	   transcription errors, and the appearance of accidental semantics when
965	   creating identifiers.  If you are generating identifiers
966	   automatically, pure numeric identifiers are likeley to be
967	   semantically opaque enough, but it's probably useful to avoid leading
968	   zeroes because some users mistakenly treat them as optional, thinking
969	   (arithmetically) that they don't contribute to the "value" of the
970	   identifier.

972	   If you need lots of identifiers and you don't want them to get too
973	   long, you can mix digits with consonants (but avoid vowels since they
974	   might accidentally spell words) to get more identifiers without
975	   increasing the string length.  In this case you may not want more
976	   than a two letters in a row because it reduces the chance of
977	   generating acronyms.  Generator tools such as [NOID] provide support
978	   for these sorts of identifiers, and can also add a computed check
979	   character as a guarantee against the most common transcription
980	   errors.

982	3.7.  Sub-Object Naming

984	   As mentioned previously, semantically opaque identifiers are very
985	   useful for long-term naming of abstract objects, however, it may be
986	   appropriate to extend these names with less opaque extensions that
987	   reference contemporary service entry points (sub-objects) in support
988	   of the object.  Sub-object extensions beginning with a digit or
989	   underscore (`_') are reserved for the possibilty of developing a
990	   future registry of canonical service points (e.g., numeric references
991	   to versions, formats, languages, etc).

993	4.  Finding a Name Mapping Authority

995	   In order to derive an actionable identifier (these days, a URL) from
996	   an ARK, a hostport (hostname or hostname plus port combination) for a
997	   working Name Mapping Authority (NMA) must be found.  An NMA is a
998	   service that is able to respond to the three basic ARK service
999	   requests.  Relying on registration and client-side discovery, NMAs
1000	   make known which NAAs' identifiers they are willing to service.

1002	   Upon encountering an ARK, a user (or client software) looks inside it
1003	   for the optional NMAH part (the hostport of the NMA's ARK service).
1004	   If it contains an NMAH that is working, this NMAH discovery step may
1005	   be skipped; the NMAH effectively uses the beginning of an ARK to
1006	   cache the results of a prior mapping authority discovery process.  If
1007	   a new NMAH needs to found, the client looks inside the ARK again for
1008	   the NAAN (Name Assigning Authority Number).  Querying a global
1009	   database, it then uses the NAAN to look up all current NMAHs that
1010	   service ARKs issued by the identified NAA.  The global database is
1011	   key, and two specific methods for querying it are given in this
1012	   section.

1014	   A third very promising method, called the Name-to-Thing [N2T]
1015	   Resolver, is being explored.  It is a low-cost, highly stable,
1016	   consortially maintained NMAH that simply exists to support actionable
1017	   HTTP-based URLs for as long as HTTP is used.  One of its big
1018	   advantages over the other two methods and the URN, Handle, DOI, and
1019	   PURL methods, is that N2T addresses the namespace splitting problem.
1020	   When objects maintained by one NMA are inherited by more than one
1021	   successor NMA, until now one of those successors would be required to
1022	   maintain forwarding tables on behalf of the other successors.

1024	   In the interests of long-term persistence, however, ARK mechanisms
1025	   are first defined in high-level, protocol-independent terms so that
1026	   mechanisms may evolve and be replaced over time without compromising
1027	   fundamental service objectives.  Either or both specific methods
1028	   given here may eventually be supplanted by better methods since, by
1029	   design, the ARK scheme does not depend on a particular method, but
1030	   only on having some method to locate an active NMAH.

1032	   At the time of issuance, at least one NMAH for an ARK should be
1033	   prepared to service it.  That NMA may or may not be administered by
1034	   the Name Assigning Authority (NAA) that created it.  Consider the
1035	   following hypothetical example of providing long-term access to a
1036	   cancer research journal.  The publisher wishes to turn a profit and
1037	   the National Library of Medicine wishes to preserve the scholarly
1038	   record.  An agreement might be struck whereby the publisher would act
1039	   as the NAA and the national library would archive the journal issue
1040	   when it appears, but without providing direct access for the first
1041	   six months.  During the first six months of peak commercial
1042	   viability, the publisher would retain exclusive delivery rights and
1043	   would charge access fees.  Again, by agreement, both the library and
1044	   the publisher would act as NMAs, but during that initial period the
1045	   library would redirect requests for issues less than six months old
1046	   to the publisher.  At the end of the waiting period, the library
1047	   would then begin servicing requests for issues older than six months
1048	   by tapping directly into its own archives.  Meanwhile, the publisher
1049	   might routinely redirect incoming requests for older issues to the
1050	   library.  Long-term access is thereby preserved, and so is the
1051	   commercial incentive to publish content.

1053	   Although it will be common for an NAA also to run an NMA service, it
1054	   is never a requirement.  Over time NAAs and NMAs will come and go.
1055	   One NMA will succeed another, and there might be many NMAs serving
1056	   the same ARKs simultaneously (e.g., as mirrors or as competitors).
1057	   There might also be asymmetric but coordinated NMAs as in the
1058	   library-publisher example above.

1060	4.1.  Looking Up NMAHs in a Globally Accessible File

1062	   This subsection describes a way to look up NMAHs using a simple name
1063	   authority table represented as a plain text file.  For efficient
1064	   access the file may be stored in a local filesystem, but it needs to
1065	   be reloaded periodically to incorporate updates.  It is not expected
1066	   that the size of the file or frequency of update should impose an
1067	   undue maintenance or searching burden any time soon, for even
1068	   primitive linear search of a file with ten-thousand NAAs is a
1069	   subsecond operation on modern server machines.  The proposed file
1070	   strategy is similar to the /etc/hosts file strategy that supported
1071	   Internet host address lookup for a period of years before the advent
1072	   of DNS.

1074	   The name authority table file is updated on an ongoing basis and is
1075	   available for copying over the internet from the California Digital
1076	   Library at http://www.cdlib.org/inside/diglib/ark/natab and from a
1077	   number of mirror sites.  The file contains comment lines (lines that
1078	   begin with `#') explaining the format and giving the file's
1079	   modification time, reloading address, and NAA registration
1080	   instructions.  There is even a Perl script that processes the file
1081	   embedded in the file's comments.  As of February 2006, currently
1082	   registered Name Assigning Authorities are:

1084	        12025            National Library of Medicine
1085	        12026            Library of Congress
1086	        12027            National Agriculture Library
1087	        13030            California Digital Library
1088	        13038            World Intellectual Property Organization
1089	        20775            University of California San Diego
1090	        29114            University of California San Francisco
1091	        28722            University of California Berkeley
1092	        21198            University of California Los Angeles
1093	        15230            Rutgers University
1094	        13960            Internet Archive
1095	        64269            Digital Curation Centre
1096	        62624            New York University
1097	        67531            University of North Texas
1098	        27927            Ithaka Electronic-Archiving Initiative
1099	        12148            Bibliotheque nationale de France / National Library of France
1100	        78319            Google
1101	        88435            Princeton University
1102	        78428            University of Washington
1103	        89901            Archives of Region of Vastra Gotaland and City of Gothenburg, Sweden
1104	        80444            Northwest Digital Archives
1105	        25593            Emory University
1106	        25031            University of Kansas
1107	        17101            Centre for Ecology & Hydrology, UK

1109	   A snapshot of the name authority table file appears in an appendix.

1111	4.2.  Looking up NMAHs Distributed via DNS

1113	   This subsection introduces a method for looking up NMAHs that is
1114	   based on the method for discovering URN resolvers described in
1115	   [NAPTR].  It relies on querying the DNS system already installed in
1116	   the background infrastructure of most networked computers.  A query
1117	   is submitted to DNS asking for a list of resolvers that match a given
1118	   NAAN.  DNS distributes the query to the particular DNS servers that
1119	   can best provide the answer, unless the answer can be found more
1120	   quickly in a local DNS cache as a side-effect of a recent query.
1121	   Responses come back inside Name Authority Pointer (NAPTR) records.
1122	   The normal result is one or more candidate NMAHs.

1124	   In its full generality the [NAPTR] algorithm ambitiously accommodates
1125	   a complex set of preferences, orderings, protocols, mapping services,
1126	   regular expression rewriting rules, and DNS record types.  This
1127	   subsection proposes a drastic simplification of it for the special
1128	   case of ARK mapping authority discovery.  The simplified algorithm is
1129	   called Maptr.  It uses only one DNS record type (NAPTR) and restricts
1130	   most of its field values to constants.  The following hypothetical
1131	   excerpt from a DNS data file for the NAAN known as 12026 shows three
1132	   example NAPTR records ready to use with the Maptr algorithm.

1134	       12026.ark.arpa.
1135	       ;; US Library of Congress
1136	       ;;       order pref flags service regexp  replacement
1137	        IN NAPTR  0     0   "h"  "ark"   "USLC"  lhc.nlm.nih.gov:8080
1138	        IN NAPTR  0     0   "h"  "ark"   "USLC"  foobar.zaf.org
1139	        IN NAPTR  0     0   "h"  "ark"   "USLC"  sneezy.dopey.com

1141	   All the fields are held constant for Maptr except for the "flags",
1142	   "regexp", and "replacement" fields.  The "service" field contains the
1143	   constant value "ark" so that NAPTR records participating in the Maptr
1144	   algorithm will not be confused with other NAPTR records.  The "order"
1145	   and "pref" fields are held to 0 (zero) and otherwise ignored for now;
1146	   the algorithm may evolve to use these fields for ranking decisions
1147	   when usage patterns and local administrative needs are better
1148	   understood.

1150	   When a Maptr query returns a record with a flags field of "h" (for
1151	   hostport, a Maptr extension to the NAPTR flags), the replacement
1152	   field contains the NMAH (hostport) of an ARK service provider.  When
1153	   a query returns a record with a flags field of "" (the empty string),
1154	   the client needs to submit a new query containing the domain name
1155	   found in the replacement field.  This second sort of record exploits
1156	   the distributed nature of DNS by redirecting the query to another
1157	   domain name.  It looks like this.

1159	       12345.ark.arpa.
1160	       ;; Digital Library Consortium
1161	       ;;       order pref flags service regexp replacement
1162	        IN NAPTR  0     0    ""  "ark"     ""   dlc.spct.org.

1164	   Here is the Maptr algorithm for ARK mapping authority discovery.  In
1165	   it replace <NAAN> with the NAAN from the ARK for which an NMAH is
1166	   sought.

1168	        (1) Initialize the DNS query:  type=NAPTR,
1169	        query=<NAAN>.ark.arpa.

1171	        (2) Submit the query to DNS and retrieve (NAPTR) records,
1172	        discarding any record that does not have "ark" for the service
1173	        field.

1175	        (3) All remaining records with a flags fields of "h" contain
1176	        candidate NMAHs in their replacement fields.  Set them aside, if
1177	        any.

1179	        (4) Any record with an empty flags field ("") has a replacement
1180	        field containing a new domain name to which a subsequent query
1181	        should be redirected.  For each such record, set
1182	        query=<replacement> then go to step (2).  When all such records
1183	        have been recursively exhausted, go to step (5).

1185	        (5) All redirected queries have been resolved and a set of
1186	        candidate NMAHs has been accumulated from steps (3).  If there
1187	        are zero NMAHs, exit - no mapping authority was found.  If there
1188	        is one or more NMAH, choose one using any criteria you wish,
1189	        then exit.

1191	   A Perl script that implements this algorithm is included here.

1193	     #!/depot/bin/perl

1195	     use Net::DNS;                 # include simple DNS package
1196	     my $qtype = "NAPTR";               # initialize query type
1197	     my $naa = shift;              # get NAAN script argument
1198	     my $mad = new Net::DNS::Resolver;  # mapping authority discovery

1200	     &maptr("$naa.ark.arpa");      # call maptr - that's it

1202	     sub maptr {                   # recursive maptr algorithm
1203	          my $dname = shift;       # domain name as argument
1204	          my ($rr, $order, $pref, $flags, $service, $regexp,
1205	               $replacement);
1206	          my $query = $mad->query($dname, $qtype);
1207	          return                   # non-productive query
1208	               if (! $query || ! $query->answer);
1209	          foreach $rr ($query->answer) {
1210	               next           # skip records of wrong type
1211	                    if ($rr->type ne $qtype);
1212	               ($order, $pref, $flags, $service, $regexp,
1213	                    $replacement) = split(/\s/, $rr->rdatastr);
1214	               if ($flags eq "") {
1215	                    &maptr($replacement);    # recurse
1216	               } elsif ($flags eq "h") {
1217	                    print "$replacement\n";  # candidate NMAH
1218	               }
1219	          }
1220	     }

1222	   The global database thus distributed via DNS and the Maptr algorithm
1223	   can easily be seen to mirror the contents of the Name Authority Table
1224	   file described in the previous section.

1226	5.  Generic ARK Service Definition

1228	   An ARK request's output is delivered information; examples include
1229	   the object itself, a policy declaration (e.g., a promise of support),
1230	   a descriptive metadata record, or an error message.  The experience
1231	   of object delivery is expected to be an evolving mix of information
1232	   that reflects changing service expectations and technology
1233	   requirements; contemporary examples include such things as an object
1234	   summary and component links formatted for human consumption.  ARK
1235	   services must be couched in high-level, protocol-independent terms if
1236	   persistence is to outlive today's networking infrastructural
1237	   assumptions.  The high-level ARK service definitions listed below are
1238	   followed in the next section by a concrete method (one of many
1239	   possible methods) for delivering these services with today's
1240	   technology.

1242	5.1.  Generic ARK Access Service (access, location)

1244	   Returns (a copy of) the object or a redirect to the same, although a
1245	   sensible object proxy may be substituted.  Examples of sensible
1246	   substitutes include,

1248	     - a table of contents instead of a large complex document,
1249	     - a home page instead of an entire web site hierarchy,
1250	     - a rights clearance challenge before accessing protected data,
1251	     - directions for access to an offline object (e.g., a book),
1252	     - a description of an intangible object (a disease, an event), or
1253	     - an applet acting as "player" for a large multimedia object.

1255	   May also return a discriminated list of alternate object locators.
1256	   If access is denied, returns an explanation of the object's current
1257	   (perhaps permanent) inaccessibility.

1259	5.2.  Generic Policy Service (permanence, naming, etc.)

1261	   Returns declarations of policy and support commitments for given
1262	   ARKs.  Declarations are returned in either a structured metadata
1263	   format or a human readable text format; sometimes one format may
1264	   serve both purposes.  Policy subareas may be addressed in separate
1265	   requests, but the following areas should should be covered:  object
1266	   permanence, object naming, object fragment addressing, and
1267	   operational service support.

1269	   The permanence declaration for an object is a rating defined with
1270	   respect to an identified permanence provider (guarantor), which will
1271	   be the NMA.  It may include the following aspects.

1273	        (a) "object availability" - whether and how access to the object
1274	        is supported (e.g., online 24x7, or offline only),

1276	        (b) "identifier validity" - under what conditions the identifier
1277	        will be or has been re-assigned,

1279	        (c) "content invariance" - under what conditions the content of
1280	        the object is subject to change, and

1282	        (d) "change history" - access to corrections, migrations, and
1283	        revisions, whether through links to the changed objects
1284	        themselves or through a document summarizing the change history

1286	   One approach to a permanence rating framework, conceived
1287	   independently from ARKs, is given in [NLMPerm].  Under ongoing
1288	   development and limited deployment at the US National Library of
1289	   Medicine, it identifies the following "permanence levels":

1291	        Not Guaranteed: No commitment has been made to retain this
1292	        resource.  It could become unavailable at any time.  Its
1293	        identifier could be changed.

1295	        Permanent: Dynamic Content: A commitment has been made to keep
1296	        this resource permanently available.  Its identifier will always
1297	        provide access to the resource.  Its content could be revised or
1298	        replaced.

1300	        Permanent: Stable Content: A commitment has been made to keep
1301	        this resource permanently available.  Its identifier will always
1302	        provide access to the resource.  Its content is subject only to
1303	        minor corrections or additions.

1305	        Permanent: Unchanging Content: A commitment has been made to
1306	        keep this resource permanently available.  Its identifier will
1307	        always provide access to the resource.  Its content will not
1308	        change.

1310	   Naming policy for an object includes an historical description of the
1311	   NAA's (and its successor NAA's) policies regarding differentiation of
1312	   objects.  Since it the NMA who responds to requests for policy
1313	   statements, it is useful for the NMA to be able to produce or
1314	   summarize these historical NAA documents.  Naming policy may include
1315	   the following aspects.

1317	        (i) "similarity" - (or "unity") the limit, defined by the NAA,
1318	        to the level of dissimilarity beyond which two similar objects
1319	        warrant separate identifiers but before which they share one
1320	        single identifier, and

1322	        (ii) "granularity" - the limit, defined by the NAA, to the level
1323	        of object subdivision beyond which sub-objects do not warrant
1324	        separately assigned identifiers but before which sub-objects are
1325	        assigned separate identifiers.

1327	   Subnaming policy for an object describes the qualifiers that the NMA,
1328	   in fulfilling its ongoing and evolving service obligations, allows as
1329	   extensions to an NAA-assigned ARK.  To the conceptual object that the
1330	   NAA named with an ARK, the NMA may add component access points and
1331	   derivatives (e.g., format migrations in aid of preservation) in order
1332	   to provide both basic and value-added services.

1334	   Addressing policy for an object includes a description of how, during
1335	   access, object components (e.g., paragraphs, sections) or views
1336	   (e.g., image conversions) may or may not be "addressed", in other
1337	   words, how the NMA permits arguments or parameters to modify the
1338	   object delivered as the result of an ARK request.  If supported,
1339	   these sorts of operations would provide things like byte-ranged
1340	   fragment delivery and open-ended format conversions, or any set of
1341	   possible transformations that would be too numerous to list or to
1342	   identify with separately assigned ARKs.

1344	   Operational service support policy includes a description of general
1345	   operational aspects of the NMA service, such as after-hours staffing
1346	   and trouble reporting procedures.

1348	5.3.  Generic Description Service

1350	   Returns a description of the object.  Descriptions are returned in
1351	   either a structured metadata format or a human readable text format;
1352	   sometimes one format may serve both purposes.  A description must at
1353	   a minimum answer the who, what, when, and where questions concerning
1354	   an expression of the object.  Standalone descriptions should be
1355	   accompanied by the modification date and source of the description
1356	   itself.  May also return discriminated lists of ARKs that are related
1357	   to the given ARK.

1359	6.  Overview of The HTTP URL Mapping Protocol (THUMP)

1361	   The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (a
1362	   kind of identifier) and asking such questions as, what information
1363	   does this identify and how permanent is it?  [THUMP] is in fact one
1364	   specific method under development for delivering ARK services.  The
1365	   protocol runs over HTTP to exploit the web browser's current pre-
1366	   eminence as user interface to the Internet.  THUMP is designed so
1367	   that a person can enter ARK requests directly into the location field
1368	   of current browser interfaces.  Because it runs over HTTP, THUMP can
1369	   be simulated and tested within keyboard-based [TELNET] sessions.

1371	   The asker (a person or client program) starts with an identifier,
1372	   such as an ARK or a URL.  The identifier reveals to the asker (or
1373	   allows the asker to infer) the Internet host name and port number of
1374	   a server system that responds to questions.  Here, this is just the
1375	   NMAH that is obtained by inspection and possibly lookup based on the
1376	   ARK's NAAN.  The asker then sets up an HTTP session with the server
1377	   system, sends a question via a THUMP request (contained within an
1378	   HTTP request), receives an answer via a THUMP response (contained
1379	   within an HTTP response), and closes the session.  That concludes the
1380	   connected portion of the protocol.

1382	   A THUMP request is a string of characters beginning with a `?'
1383	   (question mark) that is appended to the identifier string.  The
1384	   resulting string is sent as an argument to HTTP's GET command.
1385	   Request strings too long for GET may be sent using HTTP's POST
1386	   command.  The three most common requests correspond to three
1387	   degenerate special cases that keep the user's learning and typing
1388	   burden low.  First, a simple key with no request at all is the same
1389	   as an ordinary access request.  Thus a plain ARK entered into a
1390	   browser's location field behaves much like a plain URL, and returns
1391	   access to the primary identified object, for instance, an HTML
1392	   document.

1394	   The second special case is a minimal ARK description request string
1395	   consisting of just "?".  For example, entering the string,

1397	             ark.nlm.nih.gov/12025/psbbantu?

1399	   into the browser's location field directly precipitates a request for
1400	   a metadata record describing the object identified by
1401	   ark:/12025/psbbantu.  The browser, unaware of THUMP, prepares and
1402	   sends an HTTP GET request in the same manner as for a URL.  THUMP is
1403	   designed so that the response (indicated by the returned HTTP content
1404	   type) is normally displayed, whether the output is structured for
1405	   machine processing (text/plain) or formatted for human consumption
1406	   (text/html).

1408	   In the following example THUMP session, each line has been annotated
1409	   to include a line number and whether it was the client or server that
1410	   sent it.  Without going into much depth, the session has four pieces
1411	   separated from each other by blank lines:  the client's piece (lines
1412	   1-3), the server's HTTP/THUMP response headers (4-7), and the body of
1413	   the server's response (8-17).  The first and last lines (1 and 17)
1414	   correspond to the client's steps to start the TCP session and the
1415	   server's steps to end it, respectively.

1417	      1  C: [opens session]
1418	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1
1419	         C:
1420	         S: HTTP/1.1 200 OK
1421	      5  S: Content-Type: text/plain
1422	         S: THUMP-Status: 0.1 200 OK
1423	         S:
1424	         S: |set: NLM | 12025/psbbantu? | 20030731
1425	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu?
1426	     10  S: here: 1 | 1 | 1
1427	         S:
1428	         S: erc:
1429	         S: who:    Lederberg, Joshua
1430	         S: what:   Studies of Human Families for Genetic Linkage
1431	     15  S: when:   1974
1432	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1433	         S: [closes session]

1435	   The first two server response lines (4-5) above are typical of HTTP.
1436	   The next line (6) is peculiar to THUMP, and indicates the THUMP
1437	   version and a normal return status.  The balance of the response
1438	   consists of a record set header (lines 8-10) and a single metadata
1439	   record (12-16) that comprises the ARK description service response.
1440	   The record set header identifies (8-9) who created the set, what its
1441	   title is, when it was created, and where an automated process can
1442	   access the set; it ends in a line (10) whose respective sub-elements
1443	   indicate that here in this communication the recipient can expect to
1444	   find 1 record, starting at the record numbered 1, from a set
1445	   consisting of a total of 1 record (i.e., here is the entire set,
1446	   consisting of exactly one record).

1448	   The returned record (12-16) is in the format of an Electronic
1449	   Resource Citation [ERC], which is discussed in more detail in the
1450	   next section.  For now, note that it contains four elements that
1451	   answer the top priority questions regarding an expression of the
1452	   object:  who played a major role in expressing it, what the
1453	   expression was called, when is was created, and where the expression
1454	   may be found.  This quartet of elements comes up again and again in
1455	   ERCs.

1457	   The third degenerate special case of an ARK request (and no other
1458	   cases will be described in this document) is the string "??",
1459	   corresponding to a minimal permanence policy request.  It can be seen
1460	   in use appended to an ARK (on line 2) in the example session that
1461	   follows.

1463	      1  C: [opens session]
1464	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1
1465	         C:
1466	         S: HTTP/1.1 200 OK
1467	      5  S: Content-Type: text/plain
1468	         S: THUMP-Status: 0.1 200 OK
1469	         S:
1470	         S: |set: NLM | 12025/psbbantu?? | 20030731
1471	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu??
1472	     10  S: here: 1 | 1 | 1
1473	         S:
1474	         S: erc:
1475	         S: who:    Lederberg, Joshua
1476	         S: what:   Studies of Human Families for Genetic Linkage
1477	     15  S: when:   1974
1478	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1479	         S: erc-support:
1480	         S: who:    USNLM
1481	         S: what:   Permanent, Unchanging Content
1482	     20  S: when:   20010421
1483	         S: where:  http://ark.nlm.nih.gov/yy22948
1484	         S: [closes session]

1486	   Again, a single metadata record (lines 12-21) is returned, but it
1487	   consists of two segments.  The first segment (12-16) gives the same
1488	   basic citation information as in the previous example.  It is
1489	   returned in order to establish context for the persistence
1490	   declaration in the second segment (17-21).

1492	   Each segment in an ERC tells a different story relating to the
1493	   object, so although the same four questions (elements) appear in
1494	   each, the answers depend on the segment's story type.  While the
1495	   first segment tells the story of an expression of the object, the
1496	   second segment tells the story of the support commitment made to it:
1497	   who made the commitment, what the nature of the commitment was, when
1498	   it was made, and where a fuller explanation of the commitment may be
1499	   found.

1501	7.  Overview of Electronic Resource Citations (ERCs)

1503	   An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a
1504	   simple, compact, and printable record designed to hold data
1505	   associated with an information resource.  By design, the ERC is a
1506	   metadata format that balances the needs for expressive power, very
1507	   simple machine processing, and direct human manipulation.

1509	   A founding principle of the ERC is that direct human contact with
1510	   metadata will be a necessary and sufficient condition for the near
1511	   term rapid development of metadata standards, systems, and services.
1512	   Thus the machine-processable ERC format must only minimally strain
1513	   people's ability to read, understand, change, and transmit ERCs
1514	   without their relying on intermediation with specialized software
1515	   tools.  The basic ERC needs to be succinct, transparent, and
1516	   trivially parseable by software.

1518	   In the current Internet, it is natural seriously to consider using
1519	   XML as an exchange format because of predictions that it will obviate
1520	   many ad hoc formats and programs, and unify much of the world's
1521	   information under one reliable data structuring discipline that is
1522	   easy to generate, verify, parse, and render.  It appears, however,
1523	   that XML is still only catching on after years of standards work and
1524	   implementation experience.  The reasons for it are unclear, but for
1525	   now very simple XML interpretation is still out of reach.  Another
1526	   important caution is that XML structures are hard on the eyeballs,
1527	   taking up an amount of display (and page) space that significantly
1528	   exceeds that of traditional formats.  Until these conflicts with ERC
1529	   principle are resolved, XML is not a first choice for representing
1530	   ERCs.  Borrowing instead from the data structuring format that
1531	   underlies the successful spread of email and web services, the first
1532	   ERC format uses [ANVL], which is based on email and HTTP headers
1533	   [RFC822].  There is a naturalness to ANVL's label-colon-value format
1534	   (seen in the previous section) that barely needs explanation to a
1535	   person beginning to enter ERC metadata.

1537	   Besides simplicity of ERC system implementation and data entry
1538	   mechanics, ERC semantics (what the record and its constituent parts
1539	   mean) must also be easy to explain.  ERC semantics are based on a
1540	   reformulation and extension of the Dublin Core [DCORE] hypothesis,
1541	   which suggests that the fifteen Dublin Core metadata elements have a
1542	   key role to play in cross-domain resource description.  The ERC
1543	   design recognizes that the Dublin Core's primary contribution is the
1544	   international, interdisciplinary consensus that identified fifteen
1545	   semantic buckets (element categories), regardless of how they are
1546	   labeled.  The ERC then adds a definition for a record and some
1547	   minimal compliance rules.  In pursuing the limits of simplicity, the
1548	   ERC design combines and relabels some Dublin Core buckets to isolate
1549	   a tiny kernel (subset) of four elements for basic cross-domain
1550	   resource description.

1552	   For the cross-domain kernel, the ERC uses the four basic elements -
1553	   who, what, when, and where - to pretend that every object in the
1554	   universe can have a uniform minimal description.  Each has a name or
1555	   other identifier, a location, some responsible person or party, and a
1556	   date.  It doesn't matter what type of object it is, or whether one
1557	   plans to read it, interact with it, smoke it, wear it, or navigate
1558	   it.  Of course, this approach is flawed because uniformity of
1559	   description for some object types requires more semantic contortion
1560	   and sacrifice than for others.  That is why at the beginning of this
1561	   document, the ARK was said to be suited to objects that accommodate
1562	   reasonably regular electronic description.

1564	   While insisting on uniformity at the most basic level provides
1565	   powerful cross-domain leverage, the semantic sacrifice is great for
1566	   many applications.  So the ERC also permits a semantically rich and
1567	   nuanced description to co-exist in a record along with a basic
1568	   description.  In that way both sophisticated and naive recipients of
1569	   the record can extract the level of meaning from it that best suits
1570	   their needs and abilities.  Key to unlocking the richer description
1571	   is a controlled vocabulary of ERC record types (not explained in this
1572	   document) that permit knowledgeable recipients to apply defined sets
1573	   of additional assumptions to the record.

1575	7.1.  ERC Syntax

1577	   An ERC record is a sequence of metadata elements ending in a blank
1578	   line.  An element consists of a label, a colon, and an optional
1579	   value.  Here is an example of a record with five elements.

1581	          erc:
1582	          who: Gibbon, Edward
1583	          what: The Decline and Fall of the Roman Empire
1584	          when: 1781
1585	          where: http://www.ccel.org/g/gibbon/decline/

1587	   A long value may be folded (continued) onto the next line by
1588	   inserting a newline and indenting the next line.  A value can be thus
1589	   folded across multiple lines.  Here are two example elements, each
1590	   folded across four lines.

1592	          who/created: University of California, San Francisco, AIDS
1593	               Program at San Francisco General Hospital | University
1594	               of California, San Francisco, Center for AIDS Prevention
1595	               Studies
1596	          what/Topic:
1597	                Heart Attack | Heart Failure
1598	               | Heart
1599	                                Diseases

1601	   An element value folded across several lines is treated as if the
1602	   lines were joined together on one long line.  For example, the second
1603	   element from the previous example is considered equivalent to

1605	          what/Topic: Heart Attack | Heart Failure | Heart Diseases

1607	   An element value may contain multiple values, each one separated from
1608	   the next by a `|' (pipe) character.  The element from the previous
1609	   example contains three values.

1611	   For annotation purposes, any line beginning with a `#' (hash)
1612	   character is treated as if it were not present; this is a "comment"
1613	   line (a feature not available in email or HTTP headers).  For
1614	   example, the following element is spread across four lines and
1615	   contains two values:

1617	          what/Topic:
1618	               Heart Attack
1619	          #    | Heart Failure  -- hold off until next review cycle
1620	               | Heart Diseases

1622	7.2.  ERC Stories

1624	   An ERC record is organized into one or more distinct segments, where
1625	   where each segment tells a story about a different aspect of the
1626	   information resource.  A segment boundary occurs whenever a segment
1627	   label (an element beginning with "erc") is encountered.  The basic
1628	   label "erc:" introduces the story of an object's expression (e.g.,
1629	   its publication, installation, or performance).  The label "erc-
1630	   about:" introduces the story of an object's content (what it is
1631	   about) and "erc-support:" introduces the story of a support
1632	   commitment made to it.  A story segment that concerns the ERC itself
1633	   is introduced by the label "erc-from:".  It is an important segment
1634	   that tells the story of the ERC's provenance.  Elements beginning
1635	   with "erc" are reserved for segment labels and their associated story
1636	   types.  From an earlier example, here is an ERC with two segments.

1638	         erc:
1639	         who:    Lederberg, Joshua
1640	         what:   Studies of Human Families for Genetic Linkage
1641	         when:   1974
1642	         where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1643	         erc-support:
1644	         who:    NIH/NLM/LHNCBC
1645	         what:   Permanent, Unchanging Content
1646	         # Note to ops staff:  date needs verification.
1647	         when:   2001 04 21
1648	         where:  http://ark.nlm.nih.gov/yy22948

1650	   Segment stories are told according to journalistic tradition.  While
1651	   any number of pertinent elements may appear in a segment, priority is
1652	   placed on answering the questions who, what, when, and where at the
1653	   beginning of each segment so that readers can make the most important
1654	   selection or rejection decisions as soon as possible.  To make things
1655	   simple, the listed ordering of the questions is maintained in each
1656	   segment (as it happens most people who have been exposed to this
1657	   story telling technique are already familiar with the above
1658	   ordering).

1660	   The four questions are answered by using corresponding element
1661	   labels.  The four element labels can be re-used in each story
1662	   segment, but their meaning changes depending on the segment (the
1663	   story type) in which they appear.  In the example above, "who" is
1664	   first used to name a document's author and subsequently used to name
1665	   the permanence guarantor (provider).  Similarly, "when" first lists
1666	   the date of object creation and in the next segment lists the date of
1667	   a commitment decision.  Four labels appearing across three segments
1668	   effectively map to twelve semantically distinct elements.  Distinct
1669	   element meanings are mapped to Dublin Core elements in a later
1670	   section.

1672	7.3.  The ERC Anchoring Story

1674	   Each ERC contains an anchoring story.  It is usually the first
1675	   segment labeled "erc:" and it concerns an "anchoring" expression of
1676	   the object.  An "anchoring" expression is the one that a provider
1677	   deemed the most suitable basic referent given the audience and
1678	   application for which it produced the ERC.  If it sounds like the
1679	   provider has great latitude in choosing its anchoring expression, it
1680	   is because it does.  A typical anchoring story in an ERC for a born-
1681	   digital document would be the story of the document's release on a
1682	   web site; such a document would then be the anchoring expression.

1684	   An anchoring story need not be the central descriptive goal of an ERC
1685	   record.  For example, a museum provider may create an ERC for a
1686	   digitized photograph of a painting but choose to anchor it in the
1687	   story of the original painting instead of the story of the electronic
1688	   likeness; although the ERC may through other segments prove to be
1689	   centrally concerned with describing the electronic likeness, the
1690	   provider may have chosen this particular anchoring story in order to
1691	   make the ERC visible in a way that is most natural to patrons (who
1692	   would find the Mona Lisa under da Vinci sooner than they would find
1693	   it under the name of the person who snapped the photograph or scanned
1694	   the image).  In another example, a provider that creates an ERC for a
1695	   dramatic play as an abstract work has the task of describing a piece
1696	   of intangible intellectual property.  To anchor this abstract object
1697	   in the concrete world, if only through a derivative expression, it
1698	   makes sense for the provider to choose a suitable printed edition of
1699	   the play as the anchoring object expression (to describe in the
1700	   anchoring story) of the ERC.

1702	   The anchoring story has special rules designed to keep ERC processing
1703	   simple and predictable.  Each of the four basic elements (who, what,
1704	   when, and where) must be present, unless a best effort to supply it
1705	   fails.  In the event of failure, the element still appears but a
1706	   special value (described later) is used to explain the missing value.
1707	   While the requirement that each of the four elements be present only
1708	   applies to the anchoring story segment, as usual these elements
1709	   appear at the beginning of the segment and may only be used in the
1710	   prescribed order.  A minimal ERC would normally consist of just an
1711	   anchoring story and the element quartet, as illustrated in the next
1712	   example.

1714	         erc:
1715	         who:   National Research Council
1716	         what:  The Digital Dilemma
1717	         when:  2000
1718	         where: http://books.nap.edu/html/digital%5Fdilemma

1720	   A minimal ERC can be abbreviated so that it resembles a traditional
1721	   compact bibliographic citation that is nonetheless completely machine
1722	   processable.  The required elements and ordering makes it possible to
1723	   eliminate the element labels, as shown here.

1725	         erc: National Research Council | The Digital Dilemma | 2000
1726	                | http://books.nap.edu/html/digital%5Fdilemma

1728	7.4.  ERC Elements

1730	   As mentioned, the four basic ERC elements (who, what, when, and
1731	   where) take on different specific meanings depending on the story
1732	   segment in which they are used.  By appearing in each segment, albeit
1733	   in different guises, the four elements serve as a valuable mnemonic
1734	   device - a kind of checklist - for constructing minimal story
1735	   segments from scratch.  Again, it is only in the anchoring segment
1736	   that all four elements are mandatory.

1738	   Here are some mappings between ERC elements and Dublin Core [DCORE]
1739	   elements.

1741	          Segment     ERC Element     Equivalent Dublin Core Element
1742	         ---------    -----------     ------------------------------
1743	            erc          who          Creator/Contributor/Publisher
1744	            erc          what                Title
1745	            erc          when                Date
1746	            erc          where               Identifier
1747	         erc-about       who                  <none>
1748	         erc-about       what                Subject
1749	         erc-about       when                Coverage (temporal)
1750	         erc-about       where               Coverage (spatial)

1752	   The basic element labels may also be qualified to add nuances to the
1753	   semantic categories that they identify.  Elements are qualified by
1754	   appending a `/' (slash) and a qualifier term.  Often qualifier terms
1755	   appear as the past tense form of a verb because it makes re-using
1756	   qualifiers among elements easier.

1758	         who/published:  ...
1759	         when/published: ...
1760	         where/published: ...

1762	   Using past tense verbs for qualifiers also reminds providers and
1763	   recipients that element values contain transient assertions that may
1764	   have been true once, but that tend to become less true over time.
1765	   Recipients that don't understand the meaning of a qualifier can fall
1766	   back onto the semantic category (bucket) designated by the
1767	   unqualified element label.  Inevitably recipients (people and
1768	   software) will have diverse abilities in understanding elements and
1769	   qualifiers.

1771	   Any number of other elements and qualifiers may be used in
1772	   conjunction with the quartet of basic segment questions.  The only
1773	   semantic requirement is that they pertain to the segment's story.
1774	   Also, it is only the four basic elements that change meaning
1775	   depending on their segment context.  All other elements have meaning
1776	   independent of the segment in which they appear.  If an element label
1777	   stripped of its qualifier is still not recognized by the recipient, a
1778	   second fall back position is to ignore it and rely on the four basic
1779	   elements.

1781	   Elements may be either Canonical, Provisional, or Local.  Canonical
1782	   elements are officially recognized via a registry as part of the
1783	   metadata vernacular.  All elements, qualifiers, and segment labels
1784	   used in this document up until now belong to that vernacular.
1785	   Provisional elements are also officially recognized via the registry,
1786	   but have only been proposed for inclusion in the vernacular.  To be
1787	   promoted to the vernacular, a provisional element passes through a
1788	   vetting process during which its documentation must be in order and
1789	   its community acceptance demonstrated.  Local elements are any
1790	   elements not officially recognized in the registry.  The registry
1791	   [DERC] is a work in progress.

1793	   Local elements can be immediately distinguishable from Canonical or
1794	   Provisional elements because all terms that begin with an upper case
1795	   letter are reserved for spontaneous local use.  No term beginning
1796	   with an upper case letter will ever be assigned Canonical or
1797	   Provisional status, so it should be safe to use such terms for local
1798	   purposes.  Any recipient of external ERCs containing such terms will
1799	   understand them to be part of the originating provider's local
1800	   metadata dialect.  Here's an example ERC with three segments, one
1801	   local element, and two local qualifiers.  The segment boundaries have
1802	   been emphasized by comment lines (which, as before, are ignored by
1803	   processors).

1805	         erc:
1806	         who: Bullock, TH | Achimowicz, JZ | Duckrow, RB
1807	                 | Spencer, SS | Iragui-Madoz, VJ
1808	         what: Bicoherence of intracranial EEG in sleep,
1809	                 wakefulness and seizures
1810	         when: 1997 12 00
1811	         where: http://cogprints.soton.ac.uk/%{
1812	                 documents/disk0/00/00/01/22/index.html %}
1813	         in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678
1814	         IDcode: cog00000122
1815	         # ---- new segment ----
1816	         erc-about:
1817	         what/Subcategory: Bispectrum | Nonlinearity | Epilepsy
1818	                 | Cooperativity | Subdural | Hippocampus | Higher moment
1819	         # ---- new segment ----
1820	         erc-from:
1821	         who: NIH/NLM/NCBI
1822	         what: pm9546494
1823	         when/Reviewed: 1998 04 18 021600
1824	         where: http://ark.nlm.nih.gov/12025/pm9546494?

1826	   The local element "IDcode" immediately precedes the "erc-about"
1827	   segment, which itself contains an element with the local qualifier
1828	   "Subcategory".  The second to last element also carries the local
1829	   qualifier "Reviewed".  Finally, what might be a provisional element
1830	   "in" appears near the end of the first segment.  It might have been
1831	   proposed as a way to complete a citation for an object originally
1832	   appearing inside another object (such as an article appearing in a
1833	   journal or an encyclopedia).

1835	7.5.  ERC Element Values

1837	   ERC element values tend to be straightforward strings.  If the
1838	   provider intends something special for an element, it will so
1839	   indicate with markers at the beginning of its value string.  The
1840	   markers are designed to be uncommon enough that they would not likely
1841	   occur in normal data except by deliberate intent.  Markers can only
1842	   occur near the beginning of a string, and once any octet of non-
1843	   marker data has been encountered, no further marker processing is
1844	   done for the element value.  In the absence of markers the string is
1845	   considered pure data; this has been the case with all the examples
1846	   seen thus far.  The fullest form of an element value with all three
1847	   optional markers in place looks like this.

1849	         VALUE =    [markup_flags]    (:ccode)    ,    DATA

1851	   In processing, the first non-whitespace character of an ERC element
1852	   value is examined.  An initial `[' is reserved to introduce a
1853	   bracketed set of markup flags (not described in this document) that
1854	   ends with `]'.  If ERC data is machine-generated, each value string
1855	   may be preceded by "[]" to prevent any of its data from being
1856	   mistaken for markup flags.  Once past the optional markup, the
1857	   remaining value may optionally begin with a controlled code.  A
1858	   controlled code always has the form "(:ccode)", for example,

1860	         who: (:unkn) Anonymous
1861	         what: (:791) Bee Stings

1863	   Any string after such a code is taken to be an uncontrolled (e.g.,
1864	   natural language) equivalent.  The code "unkn" indicates a
1865	   conventional explanation for a missing value (stating that the value
1866	   is unknown).  The remainder of the string makes an equivalent
1867	   statement in a form that the provider deemed most suitable to its
1868	   (probably human) audience.  The code "791" could be a fixed numeric
1869	   topic identifier within an unspecified topic vocabulary.  Any code
1870	   may be ignored by those that do not understand it.

1872	   There are several codes to explain different ways in which a required
1873	   element's value may go missing.

1875	         (:unac)   temporarily inaccessible
1876	         (:unal)   unallowed, suppressed intentionally
1877	         (:unap)   not applicable, makes no sense
1878	         (:unas)   value unassigned (e.g., Untitled)
1879	         (:unav)   value unavailable indefinitely
1880	         (:unkn)   unknown (e.g., Anonymous, Inconnue)
1881	         (:etal)   too numerous to list (I<et alia>).
1882	         (:none)   never had a value, never will
1883	         (:null)   explicitly empty
1884	         (:tba)    to be assigned or announced later

1886	   Once past an optional controlled code, the remaining string value is
1887	   subjected to one final test.  If the first next non-whitespace
1888	   character is a `,' (comma), it indicates that the string value is
1889	   "sort-friendly".  This means that the value is (a) laid out with an
1890	   inverted word order useful for sorting items having comparably laid
1891	   out element values (items might be the containing ERC records) and
1892	   (b) that the value may contain other commas that indicate inversion
1893	   points should it become necessary to recover the value in natural
1894	   word order.  Typically, this feature is used to express Western-style
1895	   personal names in family-name-given-name order.  It can also be used
1896	   wherever natural word order might make sorting tricky, such as when
1897	   data contains titles or corporate names.  Here are some example
1898	   elements.

1900	         who:   ,  van Gogh, Vincent
1901	         who:,Howell, III, PhD, 1922-1987, Thurston
1902	         who:, Acme Rocket Factory, Inc., The
1903	         who:, Mao Tse Tung
1904	         who:, McCartney, Paul, Sir,
1905	         what:, Health and Human Services, United States Government
1906	                 Department of, The,

1908	   There are rules to use in recovering a copy of the value in natural
1909	   word order, if desired.  The above example strings have the following
1910	   natural word order values, respectively.

1912	         Vincent van Gogh
1913	         Thurston Howell, III, PhD, 1922-1987
1914	         The Acme Rocket Factory, Inc.
1915	         Mao Tse Tung
1916	         Sir Paul McCartney
1917	         The United States Government Department of Health and Human Services

1919	7.6.  ERC Element Encoding and Dates

1921	   Some characters that need to appear in ERC element values might
1922	   conflict with special characters used for structuring ERCs, so there
1923	   needs to be a way to include them as literal characters that are
1924	   protected from special interpretation.  This is accomplished through
1925	   an encoding mechanism that resembles the %-encoding familiar to [URI]
1926	   handlers.

1928	   The ERC encoding mechanism also uses `%', but instead of taking two
1929	   following hexadecimal digits, it takes one non-alphanumeric character
1930	   or two alphabetic characters that cannot be mistaken for hex digits.
1931	   It is designed not to be confused with normal web-style %-encoding.
1932	   In particular it can be decoded without risking unintended decoding
1933	   of normal %-encoded data (which would introduce errors).  Here are
1934	   the one-character (non-alphanumeric) ERC encoding extensions.

1936	         ERC       Purpose
1937	         ---     ------------------------------------------------
1938	         %!      decodes to the element separator `|'
1939	         %%      decodes to a percent sign `%'
1940	         %.      decodes to a comma `,'
1941	         %_      a non-character used as syntax shim
1942	         %{      a non-character that begins an expansion block
1943	         %}      a non-character that ends an expansion block

1945	   One particularly useful construct in ERC element values is the pair
1946	   of special encoding markers ("%{" and "%}") that indicates a
1947	   "expansion" block.  Whatever string of characters they enclose will
1948	   be treated as if none of the contained whitespace (SPACEs, TABs,
1949	   Newlines) were present.  This comes in handy for writing long, multi-
1950	   part URLs in a readable way.  For example, the value in

1952	         where: http://foo.bar.org/node%{
1953	                    ? db = foo
1954	                    & start = 1
1955	                    & end = 5
1956	                    & buf = 2
1957	                    & query = foo + bar + zaf
1958	                %}

1960	   is decoded into an equivalent element, but with a correct and intact
1961	   URL:

1963	     where:
1964	      http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

1966	   In a parting word about ERC element values, a commonly recurring
1967	   value type is a date, possibly followed by a time.  ERC dates use the
1968	   [TEMPER] format, taking on one of the following forms:

1970	         1999                (four digit year)
1971	         2000 12 29          (year, month, day)
1972	         2000 12 29 235955   (year, month, day, hour, minute, second)

1974	   In dates, all internal whitespace is squeezed out to achieve a
1975	   normalized form suitable for lexical comparison and sorting.  This
1976	   means that the following dates

1978	         2000 12 29 235955           (recommended for readability)
1979	         2000 12 29 23 59 55
1980	         20001229 23 59 55
1981	         20001229235955              (normalized date and time)

1983	   are all equivalent.  The first form is recommended for readability.
1984	   The last form (shortest and easiest to compute with) is the
1985	   normalized form.  Hyphens and commas are reserved to create date
1986	   ranges and lists, for example,

1988	         1996-2000                   (a range of four years)
1989	         1952, 1957, 1969            (a list of three years)
1990	         1952, 1958-1967, 1985       (a mixed list of dates and ranges)
1991	         20001229-20001231           (a range of three days)

1993	7.7.  ERC Stub Records and Internal Support

1995	   The ERC design introduces the concept of a "stub" record, which is an
1996	   incomplete ERC record intended to be supplemented with additional
1997	   elements before being released as a standalone ERC record.  A stub
1998	   ERC record has no minimum required elements.  It is just a group of
1999	   elements that does not begin with "erc:" but otherwise conforms to
2000	   the ERC record syntax.

2002	   ERC stubs may be useful in supporting internal procedures using the
2003	   ERC syntax.  Often they rely on the convenience and accuracy of
2004	   automatically supplied elements, even the basic ones.  To be ready
2005	   for external use, however, an ERC stub must be transformed into a
2006	   complete ERC record having the usual required elements.  An ERC stub
2007	   record can be convenient for metadata embedded in a document, where
2008	   elements such as location, modification date, and size - which one
2009	   would not omit from an externalized record - are omitted simply
2010	   because they are much better supplied by a computation.  A separate
2011	   local administrative procedure, not defined for ERC's in general,
2012	   would effect the promotion of stubs into complete records.

2014	   While the ERC is a general-purpose container for exchange of resource
2015	   descriptions, it does not dictate how records must be internally
2016	   stored, laid out, or assembled by data providers or recipients.
2017	   Arbitrary internal descriptive frameworks can support ERCs simply by
2018	   mapping (e.g., on demand) local records to the ERC container format
2019	   and making them available for export.  Therefore, to support ERCs
2020	   there is no need for a data provider to convert internal data to be
2021	   stored in an ERC format.  On the other hand, any provider (such as
2022	   one just getting started in the business of resource description) may
2023	   choose to store and manipulate local data natively in the ERC format.

2025	8.  Advice to Web Clients

2027	   This section offers some advice to web client software developers.
2028	   It is hard to write about because it tries to anticipate a series of
2029	   events that might lead to native web browser support for ARKs.

2031	   ARKs are envisaged to appear wherever durable object references are
2032	   planned.  Library cataloging records, literature citations, and
2033	   bibliographies are important examples.  In many of these places URLs
2034	   (Uniform Resource Locators) currently stand in, and URNs, DOIs, and
2035	   PURLs have been proposed as alternatives.

2037	   The strings representing ARKs are also envisaged to appear in some of
2038	   the places where URLs currently appear:  in hypertext links (where
2039	   they are not normally shown to users) and in rendered text (displayed
2040	   or printed).  Internet search engines, for example, tend to include
2041	   both actionable and manifest links when listing each item found.  A
2042	   normal HTML link for which the URL is not displayed looks like this.

2044	          <a href = "http://foo.bar.org/index.htm"> Click Here <a>

2046	   The same link with an ARK instead of a URL:

2048	          <a href = "ark:/14697/b12345x"> Click Here <a>

2050	   Web browsers would in general require a small modification to
2051	   recognize and convert this ARK, via mapping authority discovery, to
2052	   the URL form.

2054	          <a href = "http://a.b.org/ark:/14697/b12345x"> Click Here <a>

2056	   A browser that knows how to make that conversion could also
2057	   automatically detect and replace a non-working NMAH.

2059	   An NAA will typically make known the associations it creates by
2060	   publishing them in catalogs, actively advertizing them, or simply
2061	   leaving them on web sites for visitors (e.g., users, indexing
2062	   spiders) to stumble across in browsing.

2064	9.  Security Considerations

2066	   The ARK naming scheme poses no direct risk to computers and networks.
2067	   Implementors of ARK services need to be aware of security issues when
2068	   querying networks and filesystems for Name Mapping Authority
2069	   services, and the concomitant risks from spoofing and obtaining
2070	   incorrect information.  These risks are no greater for ARK mapping
2071	   authority discovery than for other kinds of service discovery.  For
2072	   example, recipients of ARKs with a specified hostport (NMAH) should
2073	   treat it like a URL and be aware that the identified ARK service may
2074	   no longer be operational.

2076	   Apart from mapping authority discovery, ARK clients and servers
2077	   subject themselves to all the risks that accompany normal operation
2078	   of the protocols underlying mapping services (e.g., HTTP, Z39.50).
2079	   As specializations of such protocols, an ARK service may limit
2080	   exposure to the usual risks.  Indeed, ARK services may enhance a kind
2081	   of security by helping users identify long-term reliable references
2082	   to information objects.

2084	10.  Authors' Addresses

2086	   John A. Kunze
2087	   California Digital Library
2088	   University of California, Office of the President
2089	   415 20th St, 4th Floor
2090	   Oakland, CA  94612-3550, USA

2092	   Fax:   +1 510-893-5212
2093	   EMail: jak@ucop.edu

2095	   R. P. C. Rodgers
2096	   US National Library of Medicine
2097	   8600 Rockville Pike, Bldg. 38A
2098	   Bethesda, MD  20894, USA

2100	   Fax:   +1 301-496-0673
2101	   EMail: rodgers@nlm.nih.gov

2103	11.  References

2105	   [ANVL]     J. Kunze, B. Kahle, et al, "A Name-Value Language", work
2106	              in progress,
2107	              http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf

2109	   [ARK]      J. Kunze, "Towards Electronic Persistence Using ARK
2110	              Identifiers", Proceedings of the 3rd ECDL Workshop on Web
2111	              Archives, August 2003, (PDF)
2112	              http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze

2114	   [DCORE]    Dublin Core Metadata Initiative, "Dublin Core Metadata
2115	              Element Set, Version 1.1:  Reference Description", July
2116	              1999, http://dublincore.org/documents/dces/.

2118	   [DERC]     J. Kunze, "Dictionary of the ERC", work in progress within
2119	              the Dublin Core Metadata Initiative's Kernel Working
2120	              Group, http://dublincore.org/groups/kernel/

2122	   [DNS]      P.V. Mockapetris, "Domain Names - Concepts and
2123	              Facilities", RFC 1034, November 1987.

2125	   [DOI]      International DOI Foundation, "The Digital Object
2126	              Identifier (DOI) System", February 2001,
2127	              http://dx.doi.org/10.1000/203.

2129	   [ERC]      J. Kunze, "A Metadata Kernel for Electronic Permanence",
2130	              Journal of Digital Information, Vol 2, Issue 2, January
2131	              2002, ISSN 1368-7506, (PDF)
2132	              http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

2134	   [Handle]   L. Lannom, "Handle System Overview", ICSTI Forum, No. 30,
2135	              April 1999, http://www.icsti.org/forum/30/#lannom

2137	   [HTTP]     R. Fielding, et al, "Hypertext Transfer Protocol --
2138	              HTTP/1.1", RFC 2616, June 1999.

2140	   [MD5]      R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321,
2141	              April 1992.

2143	   [N2T]      CDL, "Name-to-Thing Resolover", work in progress, August
2144	              2006, http://n2t.info

2146	   [NAPTR]    M. Mealling, Daniel, R., "The Naming Authority Pointer
2147	              (NAPTR) DNS Resource Record", RFC 2915, September 2000.

2149	   [NLMPerm]  M. Byrnes, "Defining NLM's Commitment to the Permanence of
2150	              Electronic Information", ARL 212:8-9, October 2000,
2151	              http://www.arl.org/newsltr/212/nlm.html

2153	   [NOID]     J. Kunze, "Nice Opaque Identifiers", February 2005,
2154	              http://www.cdlib.org/inside/diglib/ark/noid.pdf

2156	   [PURL]     K. Shafer, et al, "Introduction to Persistent Uniform
2157	              Resource Locators", 1996,
2158	              http://purl.oclc.org/OCLC/PURL/INET96

2160	   [RFC822]   D. Crocker, "Standard for the format of ARPA Internet text
2161	              messages", RFC 822, August 1982.

2163	   [TELNET]   J. Postel, J.K. Reynolds, "Telnet Protocol Specification",
2164	              RFC 854, May 1983.

2166	   [TEMPER]   J. Kunze, "Temporal Enumerated Ranges", work in progress,
2167	              http://www.cdlib.org/inside/diglib/ark/temperspec.pdf

2169	   [THUMP]    K. Gamiel, J. Kunze, "The HTTP URL Mapping Protocol", work
2170	              in progress, http://www.ietf.org/internet-drafts/draft-
2171	              kunze-thump-00.txt

2173	   [URI]      T. Berners-Lee, et al, "Uniform Resource Identifiers
2174	              (URI): Generic Syntax", RFC 2396, August 1998.

2176	   [URNBIB]   C. Lynch, et al, "Using Existing Bibliographic Identifiers
2177	              as Uniform Resource Names", RFC 2288, February 1998.

2179	   [URNSYN]   R. Moats, "URN Syntax", RFC 2141, May 1997.

2181	   [URNNID]   L. Daigle, et al, "URN Namespace Definition Mechanisms",
2182	              RFC 2611, June 1999.

2184	12.  Appendix:  ARK Implementations

2186	   Currently, the primary implementation activity is at the California
2187	   Digital Library (CDL),

2189	         http://ark.cdlib.org/

2191	   housed at the University of California Office of the President, where
2192	   over 200,000 ARKs have been assigned to objects that the CDL owns or
2193	   controls.  Some experimentation in ARKs is taking place at JSTOR, the
2194	   Digital Curation Centre, WIPO and at the University of California's
2195	   San Diego, San Francisco, and Berkeley campuses.

2197	   The US National Library of Medicine (NLM) also has an experimental,
2198	   prototype ARK service under development.  It is being made available
2199	   for purposes of demonstrating various aspects of the ARK system, but
2200	   is subject to temporary or permanent withdrawal (without notice)
2201	   depending upon the circumstances of the small research group
2202	   responsible for making it available.  It is described at:

2204	         http://ark.nlm.nih.gov/

2206	   Comments and feedback may be addressed to rodgers@nlm.nih.gov.

2208	13.  Appendix:  Current ARK Name Authority Table

2210	   This appendix contains a copy of the Name Authority Table (a file) at
2211	   the time of writing.  It may be loaded into a local filesystem (e.g.,
2212	   /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to
2213	   NMAHs (Name Mapping Authority Hostports).  It contains Perl code that
2214	   can be copied into a standalone script that processes the table (as a
2215	   file).  Because this is still a proposed file, none of the values in
2216	   it are real.

2218	     #
2219	     # Name Assigning Authority / Name Mapping Authority Lookup Table
2220	     #       Last change:   2006.08.22
2221	     #       Reload from:   http://ark.nlm.nih.gov/etc/natab
2222	     #       Mirrored at:   http://www.cdlib.org/inside/diglib/ark/natab
2223	     #       To register:   mailto:ark@cdlib.org?Subject=naareg
2224	     #       Process with:  Perl script at end of this file (optional)
2225	     #
2226	     # Each NAA appears at the beginning of a line with the NAA Number
2227	     # first, a colon, and an ARK or URL to a statement of naming policy
2228	     # (see http://ark.cdlib.org for an example).
2229	     # All the NMA hostports that service an NAA are listed, one per
2230	     # line, indented, after the corresponding NAA line.
2231	     #
2232	     #       National Library of Medicine
2233	     12025:  http://www.nlm.nih.gov/xxx/naapolicy.html
2234	             ark.nlm.nih.gov USNLM
2235	             foobar.zaf.org UCSF
2236	     #
2237	     #       Library of Congress
2238	     12026:  http://www.loc.gov/xxx/naapolicy.html
2239	             foobar.zaf.org USLC
2240	     #
2241	     #       National Agriculture Library
2242	     12027:  http://www.nal.gov/xxx/naapolicy.html
2243	             foobar.zaf.gov:80 USNAL
2244	     #
2245	     #       California Digital Library
2246	     13030:  http://www.cdlib.org/inside/diglib/ark/
2247	             ark.cdlib.org CDL
2248	     #
2249	     #       World Intellectual Property Organization
2250	     13038:  http://www.wipo.int/xxx/naapolicy.html
2251	             www.wipo.int WIPO
2252	     #
2253	     #       University of California San Diego
2254	     20775:  http://library.ucsd.edu/xxx/naapolicy.html
2255	             ucsd.edu UCSD
2256	     #
2257	     #       University of California San Francisco
2258	     29114:  http://library.ucsf.edu/xxx/naapolicy.html
2259	             ucsf.edu UCSF
2260	     #
2261	     #       University of California Berkeley
2262	     28722:  http://library.berkeley.edu/xxx/naapolicy.html
2263	             berkeley.edu UCB
2264	     #
2265	     #       University of California Los Angeles
2266	     21198:  http://library.ucla.edu/xxx/naapolicy.html
2267	             ucla.edu UCLA
2268	     #
2269	     #       Rutgers University
2270	     15230:  http://rci.rutgers.edu/xxx/naapolicy.html
2271	             rutgers.edu RU
2272	     #
2273	     #       Internet Archive
2274	     13960:  http://www.archive.org/xxx/naapolicy.html
2275	             archive.org IA
2276	     #
2277	     #       Digital Curation Centre
2278	     64269:  http://www.dcc.ac.uk/xxx/naapolicy.html
2279	             dcc.ac.uk DCC
2280	     #
2281	     #       New York University
2282	     62624:  http://library.nyu.edu/xxx/naapolicy.html
2283	             nyu.edu NYU
2284	     #
2285	     #       University of North Texas
2286	     67531:  http://www.library.unt.edu/xxx/naapolicy.html
2287	             unt.edu UNT
2288	     #
2289	     #       Ithaka Electronic-Archiving Initiative
2290	     27927:  http://www.ithaka.org/xxx/naapolicy.html
2291	             ithaka.org ITHAKA
2292	     #
2293	     #       Bibliothque nationale de France / National Library of France
2294	     12148:  http://www.bnf.fr/xxx/naapolicy.html
2295	             bnf.fr BNF
2296	     #
2297	     #       Princeton University
2298	     88435:  http://diglib.princeton.edu/xxx/naapolicy.html
2299	             princeton.edu PU
2300	     #
2301	     #       University of Washington
2302	     78428:  http://u.washington.edu/xxx/naapolicy.html
2303	             u.washington.edu UW
2304	     #
2305	     #       Archives of Region of Vstra Gtaland and City of Gothenburg, Sweden
2306	     89901:  http://www.arkivnamnden.org/xxx/naapolicy.html
2307	             arkivnamnden.org AVGG
2308	     #
2309	     #       Northwest Digital Archives
2310	     80444:  http://nwda.wsulibs.wsu.edu/xxx/naapolicy.html
2311	             nwda.wsulibs.wsu.edu NWDA
2312	     #
2313	     #       Emory University
2314	     25593:  http://id.library.emory.edu/xxx/naapolicy.html
2315	             id.library.emory.edu EMORY
2316	     #
2317	     #       University of Kansas
2318	     25031:  http://www.lib.ku.edu/xxx/naapolicy.html
2319	             www.lib.ku.edu UKANSAS

2321	     #
2322	     #       Google
2323	     78319:  http://www.google.com/xxx/naapolicy.html
2324	             www.google.com GOOGLE
2325	     #
2326	     #       UK Centre for Ecology and Hydrology
2327	     17101:  http://www.ceh.ac.uk/xxx/naapolicy.html
2328	             www.ceh.ac.uk CEH
2329	     #
2330	     #12345: reserved for examples
2331	     #
2332	     #--- end of data ---
2333	     # The following Perl script takes an NAA as argument and outputs
2334	     # the NMAs in this file listed under any matching NAA.
2335	     #
2336	     # my $naa = shift;
2337	     # while (<>) {
2338	     #       next if (! /^$naa:/);
2339	     #       while (<>) {
2340	     #               last if (! /^[#\s]./);
2341	     #               print "$1\n" if (/^\s+(\S+)/);
2342	     #       }
2343	     # }
2344	     #
2345	     # Create a g/t/nroff-safe version of this table with the UNIX command,
2346	     #
2347	     #       expand natab | sed 's/\\/\\\e/g' > natab.roff
2348	     #
2349	     # end of file

2351	14.  Copyright Notice

2353	   Copyright (C) The IETF Trust (2007).  This document is subject to the
2354	   rights, licenses and restrictions contained in BCP 78, and except as
2355	   set forth therein, the authors retain all their rights.

2357	   This document and the information contained herein are provided on an
2358	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2359	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2360	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2361	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2362	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2363	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2365	Expires 23 August 2007
2366	                           Table of Contents

2368	Status of this Document  . . . . . . . . . . . . . . . . . . . . . .   1
2369	Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1
2370	1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .   3
2371	1.1.  Reasons to Use ARKs  . . . . . . . . . . . . . . . . . . . . .   4
2372	1.2.  Three Requirements of ARKs . . . . . . . . . . . . . . . . . .   4
2373	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff  . . .   5
2374	1.4.  Definition of Identifier . . . . . . . . . . . . . . . . . . .   7
2375	2.  ARK Anatomy  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
2376	2.1.  The Name Mapping Authority Hostport (NMAH) . . . . . . . . . .   8
2377	2.2.  The ARK Label Part - ark:  . . . . . . . . . . . . . . . . . .   9
2378	2.3.  The Name Assigning Authority Number (NAAN) . . . . . . . . . .  10
2379	2.4.  The Name Part  . . . . . . . . . . . . . . . . . . . . . . . .  10
2380	2.5.  The Qualifier Part . . . . . . . . . . . . . . . . . . . . . .  11
2381	2.5.1.  ARKs that Reveal Object Hierarchy  . . . . . . . . . . . . .  12
2382	2.5.2.  ARKs that Reveal Object Variants . . . . . . . . . . . . . .  13
2383	2.6.  Character Repertoires  . . . . . . . . . . . . . . . . . . . .  14
2384	2.7.  Normalization and Lexical Equivalence  . . . . . . . . . . . .  15
2385	3.  Naming Considerations  . . . . . . . . . . . . . . . . . . . . .  16
2386	3.1.  ARKS Embedded in Language  . . . . . . . . . . . . . . . . . .  16
2387	3.2.  Objects Should Wear Their Identifiers  . . . . . . . . . . . .  17
2388	3.3.  Names are Political, not Technological . . . . . . . . . . . .  17
2389	3.4.  Choosing a Hostname or NMA . . . . . . . . . . . . . . . . . .  17
2390	3.5.  Assigners of ARKs  . . . . . . . . . . . . . . . . . . . . . .  19
2391	3.6.  NAAN Namespace Management  . . . . . . . . . . . . . . . . . .  20
2392	3.7.  Sub-Object Naming  . . . . . . . . . . . . . . . . . . . . . .  21
2393	4.  Finding a Name Mapping Authority . . . . . . . . . . . . . . . .  21
2394	4.1.  Looking Up NMAHs in a Globally Accessible File . . . . . . . .  22
2395	4.2.  Looking up NMAHs Distributed via DNS . . . . . . . . . . . . .  23
2396	5.  Generic ARK Service Definition . . . . . . . . . . . . . . . . .  26
2397	5.1.  Generic ARK Access Service (access, location)  . . . . . . . .  26
2398	5.2.  Generic Policy Service (permanence, naming, etc.)  . . . . . .  26
2399	5.3.  Generic Description Service  . . . . . . . . . . . . . . . . .  28
2400	6.  Overview of The HTTP URL Mapping Protocol (THUMP)  . . . . . . .  28
2401	7.  Overview of Electronic Resource Citations (ERCs) . . . . . . . .  31
2402	7.1.  ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  33
2403	7.2.  ERC Stories  . . . . . . . . . . . . . . . . . . . . . . . . .  34
2404	7.3.  The ERC Anchoring Story  . . . . . . . . . . . . . . . . . . .  35
2405	7.4.  ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . .  36
2406	7.5.  ERC Element Values . . . . . . . . . . . . . . . . . . . . . .  38
2407	7.6.  ERC Element Encoding and Dates . . . . . . . . . . . . . . . .  40
2408	7.7.  ERC Stub Records and Internal Support  . . . . . . . . . . . .  41
2409	8.  Advice to Web Clients  . . . . . . . . . . . . . . . . . . . . .  42
2410	9.  Security Considerations  . . . . . . . . . . . . . . . . . . . .  43
2411	10.  Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  43
2412	11.  References  . . . . . . . . . . . . . . . . . . . . . . . . . .  44
2413	12.  Appendix:  ARK Implementations  . . . . . . . . . . . . . . . .  45
2414	13.  Appendix:  Current ARK Name Authority Table . . . . . . . . . .  46
2415	14.  Copyright Notice  . . . . . . . . . . . . . . . . . . . . . . .  49