idnits 2.17.1 

draft-kunze-ark-11.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 2334.

  ** Found boilerplate matching RFC 3978, Section 5.4, paragraph 1 (on line
     2326), which is fine, but *also* found old RFC 2026, Section 10.4C,
     paragraph 1 text on line 38.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.

  ** The document seems to lack an RFC 3979 Section 5, para. 1 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 2 IPR Disclosure
     Acknowledgement. 

  ** The document seems to lack an RFC 3979 Section 5, para. 3 IPR Disclosure
     Invitation. 


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document seems to lack a 1id_guidelines paragraph about 6 months
     document validity -- however, there's a paragraph with a matching
     beginning. Boilerplate error?

  ** The document is more than 15 pages and seems to lack a Table of Contents.

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard

  == The page length should not exceed 58 lines per page, but there was 49
     longer pages, the longest (page 2) being 63 lines

  == It seems as if not all pages are separated by form feeds - found 0 form
     feeds but 50 pages


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.

  ** There are 8 instances of too long lines in the document, the longest one
     being 19 characters in excess of 72.

  ** The abstract seems to contain references ([Qualifier]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.

  == There are 8 instances of lines with non-RFC2606-compliant FQDNs in the
     document.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == Line 1124 has weird spacing: '... regexp  repla...'

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (23 February 2006) is 6637 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Qualifier' is mentioned on line 437, but not defined

  == Unused Reference: 'MD5' is defined on line 2129, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ANVL'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ARK'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DCORE'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'DOI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ERC'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Handle'

  ** Obsolete normative reference: RFC 2616 (ref. 'HTTP') (Obsoleted by RFC
     7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235)

  ** Downref: Normative reference to an Informational RFC: RFC 1321 (ref.
     'MD5')

  ** Obsolete normative reference: RFC 2915 (ref. 'NAPTR') (Obsoleted by RFC
     3401, RFC 3402, RFC 3403, RFC 3404)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NLMPerm'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'NOID'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'PURL'

  ** Obsolete normative reference: RFC  822 (Obsoleted by RFC 2822)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'TEMPER'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'THUMP'

  ** Obsolete normative reference: RFC 2396 (ref. 'URI') (Obsoleted by RFC
     3986)

  ** Downref: Normative reference to an Informational RFC: RFC 2288 (ref.
     'URNBIB')

  ** Obsolete normative reference: RFC 2141 (ref. 'URNSYN') (Obsoleted by RFC
     8141)

  ** Obsolete normative reference: RFC 2611 (ref. 'URNNID') (Obsoleted by RFC
     3406)


     Summary: 21 errors (**), 0 flaws (~~), 8 warnings (==), 17 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Internet-Draft: draft-kunze-ark-11.txt                          J. Kunze
3	ARK Identifier Scheme                    University of California (UCOP)
4	Expires 23 August 2006                                  R. P. C. Rodgers
5	                                         US National Library of Medicine
6	                                                        23 February 2006

8	                  The ARK Persistent Identifier Scheme

10	      (http://www.ietf.org/internet-drafts/draft-kunze-ark-11.txt)

12	Status of this Document

14	   By submitting this Internet-Draft, each author represents that any
15	   applicable patent or other IPR claims of which he or she is aware
16	   have been or will be disclosed, and any of which he or she becomes
17	   aware will be disclosed, in accordance with Section 6 of BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups.  Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six months
25	   and may be updated, replaced, or obsoleted by other documents at any
26	   time.  It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as ``work in progress.''

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html.

35	   Distribution of this document is unlimited.  Please send comments to
36	   jak@ucop.edu.

38	   Copyright (C) The Internet Society (2006).  All Rights Reserved.

40	Abstract

42	   The ARK (Archival Resource Key) naming scheme is designed to
43	   facilitate the high-quality and persistent identification of
44	   information objects. A founding principle of the ARK is that
45	   persistence is purely a matter of service and is neither inherent in
46	   an object nor conferred on it by a particular naming syntax. The best
47	   that an identifier can do is to lead users to the services that
48	   support persistence. The term ARK itself refers both to the scheme
49	   and to any single identifier that conforms to it.  An ARK has five
50	   components:

52	              [http://NMAH/]ark:/NAAN/Name[Qualifier]

54	   an optional and mutable Name Mapping Authority Hostport, the "ark:"
55	   label, the Name Assigning Authority Number (NAAN), the assigned Name,
56	   and an optional and possibly mutable Qualifier supported by the NMA.
57	   The NAAN and Name together form the immutable persistent identifier
58	   for the object.  An ARK is a special kind of URL that connects users
59	   to three things: the named object, its metadata, and the provider's
60	   promise about its persistence. When entered into the location field
61	   of a Web browser, the ARK leads the user to the named object. That
62	   same ARK, followed by a single question mark ('?'), returns a brief
63	   metadata record that is both human- and machine-readable. When the
64	   ARK is followed by dual question marks ('??'), the returned metadata
65	   contains a commitment statement from the current provider.  Tools
66	   exist for minting, binding, and resolving ARKs.

68	1.  Introduction

70	   This document describes a scheme for the high-quality naming of
71	   information resources.  The scheme, called the Archival Resource Key
72	   (ARK), is well suited to long-term access and identification of any
73	   information resources that accommodate reasonably regular electronic
74	   description.  This includes digital documents, databases, software,
75	   and websites, as well as physical objects (books, bones, statues,
76	   etc.) and intangible objects (chemicals, diseases, vocabulary terms,
77	   performances).  Hereafter the term "object" refers to an information
78	   resource.  The term ARK itself refers both to the scheme and to any
79	   single identifier that conforms to it.  A reasonably concise and
80	   accessible overview and rationale for the scheme is available at
81	   [ARK].

83	   Schemes for persistent identification of network-accessible objects
84	   are not new.  In the early 1990's, the design of the Uniform Resource
85	   Name [URNSYN] responded to the observed failure rate of URLs by
86	   articulating an indirect, non-hostname-based naming scheme and the
87	   need for responsible name management.  Meanwhile, promoters of the
88	   Digital Object Identifier [DOI] succeeded in building a community of
89	   providers around a mature software system [Handle] that supports name
90	   management.  The Persistent Uniform Resource Locator [PURL] was
91	   another scheme that has the unique advantage of working with
92	   unmodified web browsers.  ARKs represent an approach that attempts to
93	   build on the strengths and to avoid the weaknesses of the other
94	   schemes.

96	   A founding principle of the ARK is that persistence is purely a
97	   matter of service.  Persistence is neither inherent in an object nor
98	   conferred on it by a particular naming syntax.  Nor is the technique
99	   of name indirection - upon which URNs, Handles, DOIs, and PURLs are
100	   founded - of central importance.  Name indirection is an ancient and
101	   well-understood practice; new mechanisms for it keep appearing and
102	   distracting practitioner attention, with the Domain Name System [DNS]
103	   being a particularly dazzling and elegant example.  What is often
104	   forgotten is that maintenance of an indirection table is the
105	   overwhelming and unavoidable cost to the organization providing
106	   persistence, and the cost is equivalent across naming schemes.  That
107	   indirection has always been a native part the web while being so
108	   lightly utilized for the persistence of web-based objects is an
109	   indication of how unsuited most organizations are to the task of
110	   table maintenance and to the overall challenge of digital permanence.

112	   Persistence is achieved through a provider's successful stewardship
113	   of objects and their identifiers.  The highest level of persistence
114	   will be reinforced by a provider's robust contingency, redundancy,
115	   and succession strategies.  It is further safeguarded to the extent
116	   that a provider's mission is shielded from marketplace and political
117	   instabilities.  These are by far the major challenges confronting
118	   persistence providers, and no identifier scheme has any direct impact
119	   on them.  In fact, some schemes appear to be actual liabilities for
120	   persistence because they create short- and long-term dependencies for
121	   every object access on complex, special-purpose local and global
122	   infrastructures, parts of which are proprietary and all of which
123	   increase the carry-forward burden for the preservation community.  It
124	   is for this reason that the ARK scheme relies only on educated name
125	   assignment and light use of general-purpose infrastructures that the
126	   entire internet community needs (the DNS, web servers, and web
127	   browsers) and that one can reasonably expect many others to help
128	   carry forward into the technologically evolving future.

130	1.1.  Reasons to Use ARKs

132	   If no persistent identifier scheme contributes directly to
133	   persistence, why not just use URLs?  A particular URL may be as
134	   durable an identifier as it is possible to have, but nothing
135	   distinguishes it from an ordinary URL to the recipient who is
136	   wondering if it is suitable for long-term reference.  An ARK is just
137	   a URL, distinguished by its form, that provides some of the necessary
138	   conditions for credible persistence.  An ARK invites access to not
139	   one, but to three things:  to the object, to its metadata, and to a
140	   nuanced statement of commitment from the provider regarding the
141	   object.  Existence of the two extra services can be probed
142	   automatically by appending either `?' or `??' to the ARK.

144	   The form of the ARK also supports the natural separation of naming
145	   authorities into the original name assigning authority and the
146	   diverse multiple name mapping (or servicing) authorities that in
147	   succession and in parallel will take over custodial responsibilities
148	   from the original assigner for the large majority of a long-term
149	   object's archival lifetime.  The mapping authority, indicated by the
150	   hostname part of the URL that contains the ARK, serves to launch the
151	   ARK into cyberspace.  Should it ever fail (and there is no reason why
152	   a well-chosen hostname of a 100-year-old cultural memory institution
153	   shouldn't last as long as the DNS), that host name is considered
154	   disposeable and replaceable.  Again, the form of the ARK helps
155	   because it defines exactly how to recover the core immutable object
156	   identity, and several simple algorithms (based on the URN model) are
157	   defined for locating another mapping authority.

159	   There are tools to assist in generating ARKs and other identifiers,
160	   such as [NOID] and "uuidgen", both of which rely for uniqueness on
161	   human-maintained registries.  This document also contains some
162	   guidelines and considerations for managing namespaces and choosing
163	   hostnames wisely.

165	1.2.  Three Requirements of ARKs

167	   The first requirement of an ARK is to give users a link from an
168	   object to a promise of stewardship for it.  That promise is a multi-
169	   faceted covenant that binds the word of an identified service
170	   provider to a specific set of responsibilities.  No one can tell if
171	   successful stewardship will take place because no one can predict the
172	   future.  Reasonable conjecture, however, may be based on past
173	   performance.  There must be a way to tie a promise of persistence to
174	   a provider's demonstrated or perceived ability - its reputation - in
175	   that arena.  Provider reputations would then rise and fall as
176	   promises are observed variously to be kept and broken.  This is
177	   perhaps the best way we have for gauging the strength of any
178	   persistence promise.  Note that over time, current providers have
179	   nothing to do with the intentions of the original assigners of names.

181	   The second requirement of an ARK is to give users a link from an
182	   object to a description of it.  The problem with a naked identifier
183	   is that without a description real identification is incomplete.
184	   Identifiers common today are relatively opaque, though some contain
185	   ad hoc clues that reflect brief life cycle periods such as the
186	   address of a short stay in a filesystem hierarchy.  Possession of
187	   both an identifier and an object is some improvement, but positive
188	   identification may still be uncertain since the object itself might
189	   not include a matching identifier or might not carry evidence obvious
190	   enough to reveal its identity without significant research.  In
191	   either case, what is called for is a record bearing witness to the
192	   identifier's association with the object, as supported by a recorded
193	   set of object characteristics.  This descriptive record is partly an
194	   identification "receipt" with which users and archivists can verify
195	   an object's identity after brief inspection and a plausible match
196	   with recorded characteristics such as title and size.

198	   The final requirement of an ARK is to give users a link to the object
199	   itself (or to a copy) if at all possible.  Persistent access is the
200	   central duty of an ARK.  Persistent identification plays a vital
201	   supporting role but, strictly speaking, it can be construed as no
202	   more than a record attesting to the original assignment of a never-
203	   reassigned identifier.  Object access may not be feasible for various
204	   reasons, such as catastrophic loss of the object, a licensing
205	   agreement that keeps an archive "dark" for a period of years, or when
206	   an object's own lack of tangible existence confuses normal concepts
207	   of access (e.g., a vocabulary term might be accessed through its
208	   definition).  In such cases the ARK's identification role assumes a
209	   much higher profile.  But attempts to simplify the persistence
210	   problem by decoupling access from identification and concentrating
211	   exclusively on the latter are of questionable utility.  A perfect
212	   system for assigning forever unique identifiers might be created, but
213	   if it did so without reducing access failure rates, no one would be
214	   interested.  The central issue - which may be summed up as the "HTTP
215	   404 Not Found" problem - would not have been addressed.

217	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff

219	   An organization and the user community it serves can often be seen to
220	   struggle with two different areas of persistent identification: the
221	   Our Stuff problem and the Their Stuff problem.  In the Our Stuff
222	   problem, we in the organization want our own objects to acquire
223	   persistent names.  Since we possess or control these objects, our
224	   organization tackles the Our Stuff problem directly.  Whether or not
225	   the objects are named by ARKs, our organization is the responsible
226	   party, so it can plan for, maintain, and make commitments about the
227	   objects.

229	   In the Their Stuff problem, we in the organization want others'
230	   objects to acquire persistent names.  These are objects that we do
231	   not own or control, but some of which are critically important to us.
232	   But because they are beyond our influence as far as support is
233	   concerned, creating and maintaining persistent identifiers for Their
234	   Stuff is not especially purposeful or feasible for us to do.  There
235	   is little that we can do about someone else's stuff except encourage
236	   them to find or become providers of persistence services.

238	   Co-location of persistent access and identification services is
239	   natural.  Any organization that undertakes ongoing support of true
240	   persistent identification (which includes description) is well-served
241	   if it controls, owns, or otherwise has clear internal access to the
242	   identified objects, and this gives it an advantage if it wishes also
243	   to support persistent access to outsiders.  Conversely, persistent
244	   access to outsiders requires orderly internal collection management
245	   procedures that include monitoring, acquisition, verification, and
246	   change control over objects, which in turn requires object
247	   identifiers persistent enough to support auditable record keeping
248	   practices.

250	   Although, organizing ARK services under one roof thus tends to make
251	   sense, object hosting can successfully be separated from name
252	   mapping.  An example is when a name mapping authority centrally
253	   provides uniform resolution services via a protocol gateway on behalf
254	   of organizations that host objects behind a variety of access
255	   protocols.  It is also reasonable to build value-added description
256	   services that rely on the underlying services of a set of mapping
257	   authorities.

259	   Supporting ARKs is not for every organization.  By requiring
260	   specific, revealed commitments to preservation, to object access, and
261	   to description, the bar for providing ARK services is higher than for
262	   some other identifier schemes.  On the other hand, it would be hard
263	   to grant credence to a persistence promise from an organization that
264	   could not muster the minimum ARK services.  Not that there isn't a
265	   business model for an ARK-like, description-only service built on top
266	   of another organization's full complement of ARK services.  For
267	   example, there might be competition at the description level for
268	   abstracting and indexing a body of scientific literature archived in
269	   a combination of open and fee-based repositories.  The description-
270	   only service would have no direct commitment to the objects, but
271	   would act as an intermediary, forwarding commitment statements from
272	   object hosting services to requestors.

274	1.4.  Definition of Identifier

276	   An identifier is not a string of character data - an identifier is an
277	   association between a string of data and an object.  This abstraction
278	   is necessary because without it a string is just data.  It's nonsense
279	   to talk about a string's breaking, or about its being strong,
280	   maintained, and authentic.  But as a representative of an
281	   association, a string can do, metaphorically, the things that we
282	   expect of it.

284	   Without regard to whether an object is physical, digital, or
285	   conceptual, to identify it is to claim an association between it and
286	   a representative string, such as "Jane" or "ISBN 0596000278".  What
287	   gives a claim credibility is a set of verifiable assertions, or
288	   metadata, about the object, such as age, height, title, or number of
289	   pages.  In other words, the association is made manifest by a record
290	   (e.g., a cataloging or other metadata record) that vouches for it.

292	   In the complete absence of any testimony (metadata) regarding an
293	   association, a would-be identifier string is a meaningless sequence
294	   of characters.  To keep an externally visible but otherwise internal
295	   string from being perceived as an identifier by outsiders, for
296	   example, it suffices for an organization not to disclose the nature
297	   of its association.  For our immediate purpose, actual existence of
298	   an association record is more important than its authenticity or
299	   verifiability, which are outside the scope of this specification.

301	   It is a gift to the identification process if an object carries its
302	   own name as an inseparable part of itself, such as an identifier
303	   imprinted on the first page of a document or embedded in a data
304	   structure element of a digital document header.  In cases where the
305	   object is large, unwieldy, or unavailable (such as when licensing
306	   restrictions are in effect), a metadata record that includes the
307	   identifier string will usually suffice.  That record becomes a
308	   conveniently manipulable object surrogate, acting as both an
309	   association "receipt" and "declaration".

311	   Note that our definition of identifier extends the one in use for
312	   Uniform Resource Identifiers [URI].  The present document still
313	   sometimes (ab)uses the terms "ARK" and "identifier" as shorthand for
314	   the string part of an identifier, but the context should make the
315	   meaning clear.

317	2.  ARK Anatomy

319	   An ARK is represented by a sequence of characters (a string) that
320	   contains the label, "ark:", optionally preceded by the beginning part
321	   of a URL.  Here is a diagrammed example.

323	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff
324	         \___________________/ \__/ \___/ \______/ \____________/
325	           (replaceable)        |     |      |       Qualifier
326	                |         ARK Label   |      |    (NMA-supported)
327	                |                     |      |
328	      Name Mapping Authority          |    Name (NAA-assigned)
329	         Hostport (NMAH)              |
330	                           Name Assigning Authority Number (NAAN)

332	   The ARK syntax can be summarized,

334	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

336	   where the NMAH and Qualifier parts are in brackets to indicate that
337	   they are optional.

339	2.1.  The Name Mapping Authority Hostport (NMAH)

341	   Before the "ark:" label may appear an optional Name Mapping Authority
342	   Hostport (NMAH) that is a temporary address where ARK service
343	   requests may be sent.  It consists of "http://" (or any service
344	   specification valid for a URL) followed by an Internet hostname or
345	   hostport combination having the same format and semantics as the
346	   hostport part of a URL.  The most important thing about the NMAH is
347	   that it is "identity inert" from the point of view of object
348	   identification.  In other words, ARKs that differ only in the
349	   optional NMAH part identify the same object.  Thus, for example, the
350	   following three ARKs are synonyms for just one information object:

352	                      http://loc.gov/ark:/12025/654xz321
353	                  http://rutgers.edu/ark:/12025/654xz321
354	                                     ark:/12025/654xz321

356	   Strictly speaking, in the realm of digital objects, these ARKs may
357	   lead over time to somewhat different or diverging instances of the
358	   originally named object.  In an ideal world, divergence of persistent
359	   objects is not desirable, but it is widely believed that digital
360	   preservation efforts will inevitably lead to alterations in some
361	   original objects (e.g, a format migration in order to preserve the
362	   ability to display a document).  If any of those objects are held
363	   redundantly in more than one organization (a common preservation
364	   strategy), chances are small that all holding organizations will
365	   perform the same precise transformations and all maintain the same
366	   object metadata.  More significant divergence would be expected when
367	   the holding organizations serve different audiences or compete with
368	   each other.

370	   The NMAH part makes an ARK into an actionable URL.  As with many
371	   internet parameters, it is helpful to approach the NMAH being liberal
372	   in what you accept and conservative in what you propose.  From the
373	   recipient's point of view, the NMAH part should be treated as
374	   temporary, disposable, and replaceable.  From the NMA's point of
375	   view, it should be chosen with the greatest concern for longevity.  A
376	   carefully chosen NMAH should be at least as permanent as the
377	   providing organization's own hostname.  In the case of a national or
378	   university library, for example, there is no reason why the NMAH
379	   should not be considerably more permanent than soft-funded proxy
380	   hostnames such as hdl.handle.net, dx.doi.org, and purl.org.  In
381	   general and over time, however, it is not unexpected for an NMAH
382	   eventually to stop working and require replacement with the NMAH of a
383	   currently active service provider.

385	   This replacement relies on a mapping authority "resolver" discovery
386	   process, of which two alternate methods are outlined in a later
387	   section.  The ARK, URN, Handle, and DOI schemes all use a resolver
388	   discovery model that sooner or later requires matching the original
389	   assigning authority with a current provider servicing that
390	   authority's named objects; once found, the resolver at that provider
391	   performs what amounts to a redirect to a place where the object is
392	   currently held.  All the schemes rely on the ongoing functionality of
393	   currently mainstream technologies such as the Domain Name System
394	   [DNS] and web browsers.  The Handle and DOI schemes in addition
395	   require that the Handle protocol layer and global server grid be
396	   available at all times.

398	   The practice of prepending "http://" and an NMAH to an ARK is a way
399	   of creating an actionable identifier by a method that is itself
400	   temporary.  Assuming that infrastructure supporting [HTTP]
401	   information retrieval will no longer be available one day, ARKs will
402	   then have to be converted into new kinds of actionable identifiers.
403	   By that time, if ARKs see widespread use, web browsers would
404	   presumably evolve to perform this (currently simple) transformation
405	   automatically.

407	2.2.  The ARK Label Part - ark:

409	   The label part distinguishes an ARK from an ordinary identifier.  In
410	   a URL found in the wild, the string, "ark:/", indicates that the URL
411	   stands a reasonable chance of being an ARK.  If the context warrants,
412	   verification that it actually is an ARK can be done by testing it for
413	   existence of the three ARK services.

415	   Since nothing about an identifier syntax directly affects
416	   persistence, the "ark:" label (like "urn:", "doi:", and "hdl:")
417	   cannot tell you whether the identifier is persistent or whether the
418	   object is available.  It does tell you that the original Name
419	   Assigning Authority (NAA) had some sort of hopes for it, but it
420	   doesn't tell you whether that NAA is still in existence, or whether a
421	   decade ago it ceased to have any responsibility for providing
422	   persistence, or whether it ever had any responsibility beyond naming.

424	   Only a current provider can say for certain what sort of commitment
425	   it intends, and the ARK label suggests that you can query the NMAH
426	   directly to find out exactly what kind of persistence is promised.
427	   Even if what is promised is impersistence (i.e., a short-term
428	   identifier), saying so is valuable information to the recipient.
429	   Thus an ARK is a high-functioning identifier in the sense that it
430	   provides access to the object, the metadata, and a commitment
431	   statement, even if the commitment is explicitly very weak.

433	2.3.  The Name Assigning Authority Number (NAAN)

435	   Recalling that the general form of the ARK is,

437	                    [http://NMAH/]ark:/NAAN/Name[Qualifier]

439	   the part of the ARK directly following the "ark:" is the Name
440	   Assigning Authority Number (NAAN) enclosed in `/' (slash) characters.
441	   This part is always required, as it identifies the organization that
442	   originally assigned the Name of the object.  It is used to discover a
443	   currently valid NMAH and to provide top-level partitioning of the
444	   space of all ARKs.  NAANs are registered in a manner similar to URN
445	   Namespaces, but they are pure numbers consisting of 5 digits or 9
446	   digits.  Thus, the first 100,000 registered NAAs fit compactly into
447	   the 5 digits, and if growth warrants, the next billion fit into the 9
448	   digit form.  In either case the fixed odd numbers of digits helps
449	   reduce the chances of finding a NAAN out of context and confusing it
450	   with nearby quantities such as 4-digit dates.

452	   The NAAN designates a top-level ARK namespace.  Once registered for a
453	   namespace, a NAAN is never re-registered.  It is possible, however,
454	   for there to be a succession of organizations that manage of an ARK
455	   namespace one organization to succeed another

457	2.4.  The Name Part

459	   The part of the ARK just after the NAAN is the Name assigned by the
460	   NAA, and it is also required.  Semantic opaqueness in the Name part
461	   is strongly encouraged in order to reduce an ARK's vulnerability to
462	   era- and language-specific change.  Identifier strings containing
463	   linguistic fragments can create support difficulties down the road.
464	   No matter how appropriate or even meaningless they are today, such
465	   fragments may one day create confusion, give offense, or infringe on
466	   a trademark as the semantic environment around us and our communities
467	   evolves.

469	   Names that look more or less like numbers avoid common problems that
470	   defeat persistence and international acceptance.  The use of digits
471	   is highly recommended.  Mixing in non-vowel alphabetic characters a
472	   couple at a time is a relatively safe and easy way to achieve a
473	   denser namespace (more possible names for a given length of the name
474	   string).  Such names have a chance of aging and traveling well.
475	   Tools exists that mint, bind, and resolve opaque identifiers, with or
476	   without check characters [NOID].  More on naming considerations is
477	   given in a subsequent section.

479	2.5.  The Qualifier Part

481	   The part of the ARK following the NAA-assigned Name is an optional
482	   Qualifier.  It is a string that extends the base ARK in order to
483	   create a kind of service entry point into the object named by the
484	   NAA.  At the discretion of the providing NMA, such a service entry
485	   point permits an ARK to support access to individual hierarchical
486	   components and subcomponents of an object, and to variants (versions,
487	   languages, formats) of components.  A Qualifier may be invented by
488	   the NAA or by any NMA servicing the object.

490	   In form, the Qualifier is a ComponentPath, or a VariantPath, or a
491	   ComponentPath followed by a VariantPath.  A VariantPath is introduced
492	   and subdivided by the reserved character `.', and a ComponentPath is
493	   introduced and subdivided by the reserved character `/'.  In this
494	   example,

496	         http://foobar.zaf.org/ark:/12025/654xz321/s3/f8.05v.tiff

498	   the string "/s3/f8" is a ComponentPath and the string ".05v.tiff" is
499	   a VariantPath.  The ARK Qualifier is a formalization of some
500	   currently mainstream URL syntax conventions.  This formalization
501	   specifically reserves meanings that permit recipients to make strong
502	   inferences about logical sub-object containment and equivalence based
503	   only on the form of the received identifiers; there is great
504	   efficiency in not having to inspect metadata records to discover such
505	   relationships.  NMAs are free not to disclose any of these
506	   relationships merely by avoiding the reserved characters above.
507	   Hierarchical components and variants are discussed further in the
508	   next two sections.

510	   The Qualifier, if present, differs from the Name in several important
511	   respects.  First, a Qualifier may have been assigned either by the
512	   NAA or later by the NMA.  The assignment of a Qualifier by an NMA
513	   effectively amounts to an act of publishing a service entry point
514	   within the conceptual object originally named by the NAA.  For our
515	   purposes, an ARK extended with a Qualifier assigned by an NMA will be
516	   called an NMA-qualified ARK.

518	   Second, a Qualifier assignment on the part of an NMA is made in
519	   fulfillment of its service obligations and may reflect changing
520	   service expectations and technology requirements.  NMA-qualified ARKs
521	   could therefore be transient, even if the base, unqualified ARK is
522	   persistent.  For example, it would be reasonable for an NMA to
523	   support access to an image object through an actionable ARK that is
524	   considered persistent even if the experience of that access changes
525	   as linking, labeling, and presentation conventions evolve and as
526	   format and security standards are updated.  For an image "thumbnail",
527	   that NMA could also support an NMA-qualified ARK that is considered
528	   impersistent because the thumbnail will be replaced with higher
529	   resolution images as network bandwidth and CPU speeds increase.  At
530	   the same time, for an originally scanned, high-resolution master, the
531	   NMA could publish an NMA-qualfied ARK that is itself considered
532	   persistent.  Of course, the NMA must be able to return its separate
533	   commitments to unqualified, NAA-assigned ARKs, to NMA-qualified ARKs,
534	   and to any NAA-qualified ARKs that it supports.

536	   A third difference between a Qualifier and a Name concerns the
537	   semantic opaqueness constraint.  When an NMA-qualified ARK is to be
538	   used as a transient service entry point into a persistent object, the
539	   priority given to semantic opaqueness observed by the NAA in the Name
540	   part may be relaxed by the NMA in the Qualifier part.  If service
541	   priorities in the Qualifier take precedence over persistence, short-
542	   term usability considerations may recommend somewhat semantically
543	   laden Qualifier strings.

545	   Finally, not only is the set of Qualifiers supported by an NMA
546	   mutable, but different NMAs may support different Qualifier sets for
547	   the same NAA-identified object.  In this regard the NMAs act
548	   independently of each other and of the NAA.

550	   The next two sections describe how ARK syntax may be used to declare,
551	   or to avoid declaring, certain kinds of relatedness among qualified
552	   ARKs.

554	2.5.1.  ARKs that Reveal Object Hierarchy

556	   An NAA or NMA may choose to reveal the presence of a hierarchical
557	   relationship between objects using the `/' (slash) character after
558	   the Name part of an ARK.  Some authorities will choose not to
559	   disclose this information, while others will go ahead and disclose so
560	   that manipulators of large sets of ARKs can infer object
561	   relationships by simple identifier inspection; for example, this
562	   makes it possible for a system to present a collapsed view of a large
563	   search result set.

565	   If the ARK contains an internal slash after the NAAN, the piece to
566	   its left indicates a containing object.  For example, publishing an
567	   ARK of the form,

569	                         ark:/12025/654/xz/321

571	   is equivalent to publishing three ARKs,

573	                         ark:/12025/654/xz/321
574	                         ark:/12025/654/xz
575	                         ark:/12025/654

577	   together with a declaration that the first object is contained in the
578	   second object, and that the second object is contained in the third.

580	   Revealing the presence of hierarchy is completely up to the assigner
581	   (NMA or NAA).  It is hard enough to commit to one object's name, let
582	   alone to three objects' names and to a specific, ongoing relatedness
583	   among them.  Thus, regardless of whether hierarchy was present
584	   initially, the assigner, by not using slashes, reveals no shared
585	   inferences about hierarchical or other inter-relatedness in the
586	   following ARKs:

588	                         ark:/12025/654_xz_321
589	                         ark:/12025/654_xz
590	                         ark:/12025/654xz321
591	                         ark:/12025/654xz
592	                         ark:/12025/654

594	   Note that slashes around the ARK's NAAN (/12025/ in these examples)
595	   are not part of the ARK's Name and therefore do not indicate the
596	   existence of some sort of NAAN super object containing all objects in
597	   its namespace.  A slash must have at least one non-structural
598	   character (one that is neither a slash nor a period) on both sides in
599	   order for it to separate recognizable structural components.  So
600	   initial or final slashes may be removed, and double slashes may be
601	   converted into single slashes.

603	2.5.2.  ARKs that Reveal Object Variants

605	   An NAA or NMA may choose to reveal the possible presence of variant
606	   objects or object components using the `.' (period) character after
607	   the Name part of an ARK.  Some authorities will choose not to
608	   disclose this information, while others will go ahead and disclose so
609	   that manipulators of large sets of ARKs can infer object
610	   relationships by simple identifier inspection; for example, this
611	   makes it possible for a system to present a collapsed view of a large
612	   search result set.

614	   If the ARK contains an internal period after Name, the piece to its
615	   left is a base name and the piece to its right, and up to the end of
616	   the ARK or to the next period is a suffix.  A Name may have more than
617	   one suffix, for example,
618	                         ark:/12025/654.24
619	                         ark:/12025/xz4/654.24
620	                         ark:/12025/654.20v.78g.f55

622	   There are two main rules.  First, if two ARKs share the same base
623	   name but have different suffixes, the corresponding objects were
624	   considered variants of each other (different formats, languages,
625	   versions, etc.) by the assigner (NMA or NAA).  Thus, the following
626	   ARKs are variants of each other:

628	                         ark:/12025/654.20v.78g.f55
629	                         ark:/12025/654.321xz
630	                         ark:/12025/654.44

632	   Second, publishing an ARK with a suffix implies the existence of at
633	   least one variant identified by the ARK without its suffix.  The ARK
634	   otherwise permits no further assumptions about what variants might
635	   exist.  So publishing the ARK,

637	                         ark:/12025/654.20v.78g.f55

639	   is equivalent to publishing the four ARKs,

641	                         ark:/12025/654.20v.78g.f55
642	                         ark:/12025/654.20v.78g
643	                         ark:/12025/654.20v
644	                         ark:/12025/654

646	   Revealing the possibility of variants is completely up to the
647	   assigner.  It is hard enough to commit to one object's name, let
648	   alone to multiple variants' names and to a specific, ongoing
649	   relatedness among them.  The assigner is the sole arbiter of what
650	   constitutes a variant within its namespace, and whether to reveal
651	   that kind of relatedness by using periods within its names.

653	   A period must have at least one non-structural character (one that is
654	   neither a slash nor a period) on both sides in order for it to
655	   separate recognizable structural components.  So initial or final
656	   periods may be removed, and double periods may be converted into
657	   single periods.  Multiple suffixes should be arranged in sorted order
658	   (pure ASCII collating sequence) at the end of an ARK.

660	2.6.  Character Repertoires

662	   The Name and Qualifier parts are strings of visible ASCII characters
663	   and should be less than 128 bytes in length.  The length restriction
664	   keeps the ARK short enough to append ordinary ARK request strings
665	   without running into transport restrictions (e.g., within HTTP GET
666	   requests).  Characters may be letters, digits, or any of these six
667	   characters:

669	         =   #   *   +   @   _   $

671	   The following characters may also be used, but their meanings are
672	   reserved:

674	         %   -   .   /

676	   The characters `/' and `.' are ignored if either appears as the last
677	   character of an ARK.  If used internally, they allow a name assigner
678	   to reveal object hierarchy and object variants as previously
679	   described.

681	   Hyphens are considered to be insignificant and are always ignored in
682	   ARKs.  A `-' (hyphen) may appear in an ARK for readability, or it may
683	   have crept in during the formatting and wrapping of text, but it must
684	   be ignored in lexical comparisons.  As in a telephone number, hyphens
685	   have no meaning in an ARK.  It is always safe for an NMA that
686	   receives an ARK to remove any hyphens found in it.  As a result, like
687	   the NMAH, hyphens are "identity inert" in comparing ARKs for
688	   equivalence.  For example, the following ARKs are equivalent for
689	   purposes of comparison and ARK service access:

691	                                 ark:/12025/65-4-xz-321
692	         http://sneezy.dopey.com/ark:/12025/654--xz32-1
693	                                 ark:/12025/654xz321

695	   The `%' character is reserved for %-encoding all other octets that
696	   would appear in the ARK string, in the same manner as for URIs [URI].
697	   A %-encoded octet consists of a `%' followed by two hex digits; for
698	   example, "%7d" stands in for `}'.  Lower case hex digits are
699	   preferred to reduce the chances of false acronym recognition; thus it
700	   is better to use "%acT" instead of "%ACT".  The character `%' itself
701	   must be represented using "%25".  As with URNs, %-encoding permits
702	   ARKs to support legacy namespaces (e.g., ISBN, ISSN, SICI) that have
703	   less restricted character repertoires [URNBIB].

705	2.7.  Normalization and Lexical Equivalence

707	   To determine if two or more ARKs identify the same object, the ARKs
708	   are compared for lexical equivalence after first being normalized.
709	   Since ARK strings may appear in various forms (e.g., having different
710	   NMAHs), normalizing them minimizes the chances that comparing two ARK
711	   strings for equality will fail unless they actually identify
712	   different objects.  In a specified-host ARK (one having an NMAH), the
713	   NMAH never participates in such comparisons.

715	   Normalization of an ARK for the purpose of octet-by-octet equality
716	   comparison with another ARK consists of four steps.  First, any upper
717	   case letters in the "ark:" label and the two characters following a
718	   `%' are converted to lower case.  The case of all other letters in
719	   the ARK string must be preserved.  Second, any NMAH part is removed
720	   (everything from an initial "http://" up to the next slash) and all
721	   hyphens are removed.

723	   Third, structural characters (slash and period) are normalized.
724	   Initial and final occurrences are removed, and two structural
725	   characters in a row (e.g., // or ./) are replaced by the first
726	   character, iterating until each occurrence has at least one non-
727	   structural character on either side.  Finally, if there are any
728	   components with a period on the left and a slash on the right, either
729	   the component and the preceding period must be moved to the end of
730	   the Name part or the ARK must be thrown out as malformed.

732	   The fourth and final step is to arrange the suffixes in ASCII
733	   collating sequence (that is, to sort them) and to remove duplicate
734	   suffixes, if any.  It is also permissible to throw out ARKs for which
735	   the suffixes are not sorted.

737	   The resulting ARK string is now normalized.  Comparisons between
738	   normalized ARKs are case-sensitive, meaning that upper case letters
739	   are considered different from their lower case counterparts.

741	   To keep ARK string variation to a minimum, no reserved ARK characters
742	   should be %-encoded unless it is deliberately to conceal their
743	   reserved meanings.  No non-reserved ARK characters should ever be
744	   %-encoded.  Finally, no %-encoded character should ever appear in an
745	   ARK in its decoded form.

747	3.  Naming Considerations

749	   The most important threats faced by persistence providers include
750	   such things as funding loss, natural disaster, political and social
751	   upheaval, processing faults, and errors in human oversight.  There is
752	   nothing that an identifer scheme can do about such things.  Still, a
753	   few observed identifier failures and inconveniences can be traced
754	   back to naming practices that we now know to be less than optimal for
755	   persistence.

757	3.1.  ARKS Embedded in Language

759	   The ARK has different goals from the URI, so it has different
760	   character set requirements.  Because linguistic constructs imperil
761	   persistence, for ARKs non-ASCII character support is unimportant.
762	   ARKs and URIs share goals of transcribability and transportability
763	   within web documents, so characters are required to be visible, non-
764	   conflicting with HTML/XML syntax, and not subject to tampering during
765	   transmission across common transport gateways.  Add the goal of
766	   making an undelimited ARK recognizable in running prose, as in
767	   ark:/12025/=@_22*$, and certain punctuation characters (e.g., comma,
768	   period) end up being excluded from the ARK lest the end of a phrase
769	   or sentence be mistaken for part of the ARK.

771	   This consideration has more direct effect on ARK usability in a
772	   natural language context than it has on ARK persistence.  The same is
773	   true of the rule preventing hyphens from having lexical significance.
774	   It is fine to publish ARKs with hyphens in them (e.g., such as the
775	   output of UUID/GUID generators), but the uniform treatment of hyphens
776	   as insignificant reduces the possibility of users transcribing
777	   identifiers that will have been broken through unpredictable
778	   hyphenation by word processors.  Any measure that reduces user
779	   irritation with an identifier will increase its chances of survival.

781	3.2.  Objects Should Wear Their Identifiers

783	   A valuable technique for provision of persistent objects is to try to
784	   arrange for the complete identifier to appear on, with, or near its
785	   retrieved object.  An object encountered at a moment in time when its
786	   discovery context has long since disappeared could then easily be
787	   traced back to its metadata, to alternate versions, to updates, etc.
788	   This has seen reasonable success, for example, in book publishing and
789	   software distribution.  An identifier string only has meaning when
790	   its association is known, and this a very sure, simple, and low-tech
791	   method of reminding everyone exactly what that association is.

793	3.3.  Names are Political, not Technological

795	   If persistence is the goal, a deliberate local strategy for
796	   systematic name assignment is crucial.  Names must be chosen with
797	   great care.  Poorly chosen and managed names will devastate any
798	   persistence strategy, and they do not discriminate by identifier
799	   scheme.  Whether a mistakenly re-assigned name is a URN, DOI, PURL,
800	   URL, or ARK, the damage - failed access and confusion - is not
801	   mitigated more in one scheme than in another.  Conversely, in-house
802	   efforts to manage names responsibly will go much further towards
803	   safeguarding persistence than any choice of naming scheme or name
804	   resolution technology.

806	   Branding (e.g., at the corporate or departmental level) is important
807	   for funding and visibility, but substrings representing brands and
808	   organizational names should be given a wide berth except when
809	   absolutely necessary in the hostname (the identity-inert) part of the
810	   ARK.  These substrings are not only unstable because organizations
811	   change frequently, but they are also dangerous because successor
812	   organizations often have political or legal reasons to actively
813	   suppress predecessor names and brands.  Any measure that reduces the
814	   chances of future political or legal pressure on an identifier will
815	   decrease the chances that our descendants will be obliged to
816	   deliberately break it.

818	3.4.  Choosing a Hostname or NMA

820	   Hostnames appearing in any identifier meant to be persistent must be
821	   chosen with extra care.  The tendency in hostname selection has
822	   traditionally been to choose a token with recognizable attributes,
823	   such as a corporate brand, but that tendency wreaks havoc with
824	   persistence that is supposed to outlive brands, corporations, subject
825	   classifications, and natural language semantics (e.g., what did the
826	   three letters "gay" mean in 1958, 1978, and 1998?).  Today's
827	   recognized and correct attributes are tomorrow's stale or incorrect
828	   attributes.  In making hostnames (any names, actually) long-term
829	   persistent, it helps to eliminate recognizable attributes to the
830	   extent possible.  This affects selection of any name based on URLs,
831	   including PURLs and the explicitly disposable NMAHs.

833	   There is no excuse for a provider that manages its internal names
834	   impeccably not to exercise the same care in choosing what could be an
835	   exceptionally durable hostname, especially if it would form the
836	   prefix for all the provider's URL-based external names.  Registering
837	   an opaque hostname in the ".org" or ".net" domain would not be a bad
838	   start.  Another way is to publish your ARKs with an organizational
839	   domain name that will be mapped by DNS to an appropriate NMA host.
840	   This makes for shorter names with less branding vulnerability.

842	   It is a mistake to think that hostnames are inherently unstable.  If
843	   you require brand visibility, that may be a fact of life.  But things
844	   are easier if yours is the brand of long-lived cultural memory
845	   institution such as a national or university library or archive.
846	   Well-chosen hostnames from organizations that are sheltered from the
847	   direct effects of a volatile marketplace can easily provide longer-
848	   lived global resolvers than the domain names explicitly or implicitly
849	   used as starting points for global resolution by indirection-based
850	   persistent identifier schemes.  For example, it is hard to imagine
851	   circumstances under which the Library of Congress' domain name would
852	   disappear sooner than, say, "handle.net".

854	   For smaller libraries, archives, and preservation organizations,
855	   there is a natural concern about whether they will be able to keep
856	   their web servers and domain names in the face of uncertain funding.
857	   One option is to form or join a consortium of like-minded
858	   organizations with the purpose of providing mutual preservation
859	   support.  The first goal of such a consortium would be to perpetually
860	   rent a hostname on which to establish a web server that simply
861	   redirects incoming member organization requests to the appropriate
862	   member server; using ARKs, for example, a 150-member consortium could
863	   run a very small server (24x7) that contained nothing more than 150
864	   rewrite rules in its configuration file.  Even more helpful would be
865	   additional consortial support for a member organization that was
866	   unable to continue providing services and needed to find a successor
867	   archival organization.  This would be a low-cost, low-tech way to
868	   publish ARKs (or URLs) under highly persistent hostnames.

870	   There are no obvious reasons why the organizations registering DNS
871	   names, URN Namespaces, and DOI publisher IDs should have among them
872	   one that is intrinsically more fallible than the next.  Moreover, it
873	   is a misconception that the demise of DNS and of HTTP need adversely
874	   affect the persistence of URLs.  At such a time, certainly URLs from
875	   the present day might not then be actionable by our present-day
876	   mechanisms, but resolution systems for future non-actionable URLs are
877	   no harder to imagine than resolution systems for present-day non-
878	   actionable URNs and DOIs.  There is no more stable a namespace than
879	   one that is dead and frozen, and that would then characterize the
880	   space of names bearing the "http://" prefix.  It is useful to
881	   remember that just because hostnames have been carelessly chosen in
882	   their brief history does not mean that they are unsuitable in NMAHs
883	   (and URLs) intended for use in situations demanding the highest level
884	   of persistence available in the Internet environment.  A well-planned
885	   name assignment strategy is everything.

887	3.5.  Assigners of ARKs

889	   A Name Assigning Authority (NAA) is an organization that creates (or
890	   delegates creation of) long-term associations between identifiers and
891	   information objects.  Examples of NAAs include national libraries,
892	   national archives, and publishers.  An NAA may arrange with an
893	   external organization for identifier assignment.  The US Library of
894	   Congress, for example, allows OCLC (the Online Computer Library
895	   Center, a major world cataloger of books) to create associations
896	   between Library of Congress call numbers (LCCNs) and the books that
897	   OCLC processes.  A cataloging record is generated that testifies to
898	   each association, and the identifier is included by the publisher,
899	   for example, in the front matter of a book.

901	   An NAA does not so much create an identifier as create an
902	   association.  The NAA first draws an unused identifier string from
903	   its namespace, which is the set of all identifiers under its control.
904	   It then records the assignment of the identifier to an information
905	   object having sundry witnessed characteristics, such as a particular
906	   author and modification date.  A namespace is usually reserved for an
907	   NAA by agreement with recognized community organizations (such as
908	   IANA and ISO) that all names containing a particular string be under
909	   its control.  In the ARK an NAA is represented by the Name Assigning
910	   Authority Number (NAAN).

912	   The ARK namespace reserved for an NAA is the set of names bearing its
913	   particular NAAN.  For example, all strings beginning with
914	   "ark:/12025/" are under control of the NAA registered under 12025,
915	   which might be the National Library of Finland.  Because each NAA has
916	   a different NAAN, names from one namespace cannot conflict with those
917	   from another.  Each NAA is free to assign names from its namespace
918	   (or delegate assignment) according to its own policies.  These
919	   policies must be documented in a manner similar to the declarations
920	   required for URN Namespace registration [URNNID].

922	   To register for a NAAN, please read about the mapping authority
923	   discovery file in the next section and send email to ark@cdlib.org.

925	3.6.  NAAN Namespace Management

927	   Every NAA must have a namespace management strategy.  A time-honored
928	   technique is to hierarchically partition a namespace into
929	   subnamespaces using prefixes that guarantee non-collision of names in
930	   different partition.  This practice is strongly encouraged for all
931	   NAAs, especially when subnamespace management will be delegated to
932	   other departments, units, or projects within an organization.  For
933	   example, with a NAAN that is assigned to a university and managed by
934	   its main library, care should be taken to reserve semantically opaque
935	   prefixes that will set aside large parts of the unused namespace for
936	   future assignments.  Prefix-based partition management is an
937	   important responsibility of the NAA.

939	   This sort of delegation by prefix is well-used in the formation of
940	   DNS names and ISBN identifiers.  An important difference is that in
941	   the former, the hierarchy is deliberately exposed and in the latter
942	   it is hidden.  Rather than using lexical boundary markers such as the
943	   period (`.') found in domain names, the ISBN uses a publisher prefix
944	   but doesn't disclose where the prefix ends and the publisher's
945	   assigned name begins.  This practice of non-disclosure, borrowed from
946	   the ISBN and ISSN schemes, is encouraged in assigning ARKs, because
947	   it reduces the visibility of an assertion that is probably not
948	   important now and may become a vulnerability later.

950	   Reasonable prefixes for assigned names usually consist of consonants
951	   and digits and are 1-5 characters in length.  For example, the
952	   constant prefix "x9t" might be delegated to a book digitization
953	   project that creates identifiers such as

955	             http://444.berkeley.edu/ark:/28722/x9t38rk45c

957	   If longevity is the goal, it is important to keep the prefixes free
958	   of recognizable semantics; for example, using an acronym representing
959	   a project or a department is discouraged.  At the same time, you may
960	   wish to set aside a subnamespace for testing purposes under a prefix
961	   such as "fk..." that can serve as a visual clue and reminder to
962	   maintenance staff that this "fake" identifier was never published.

964	   There are other measures one can take to avoid user confusion,
965	   transcription errors, and the appearance of accidental semantics when
966	   creating identifiers.  If you are generating identifiers
967	   automatically, pure numeric identifiers are likeley to be
968	   semantically opaque enough, but it's probably useful to avoid leading
969	   zeroes because some users mistakenly treat them as optional, thinking
970	   (arithmetically) that they don't contribute to the "value" of the
971	   identifier.

973	   If you need lots of identifiers and you don't want them to get too
974	   long, you can mix digits with consonants (but avoid vowels since they
975	   might accidentally spell words) to get more identifiers without
976	   increasing the string length.  In this case you may not want more
977	   than a two letters in a row because it reduces the chance of
978	   generating acronyms.  Generator tools such as [NOID] provide support
979	   for these sorts of identifiers, and can also add a computed check
980	   character as a guarantee against the most common transcription
981	   errors.

983	3.7.  Sub-Object Naming

985	   As mentioned previously, semantically opaque identifiers are very
986	   useful for long-term naming of abstract objects, however, it may be
987	   appropriate to extend these names with less opaque extensions that
988	   reference contemporary service entry points (sub-objects) in support
989	   of the object.  Sub-object extensions beginning with a digit or
990	   underscore (`_') are reserved for the possibilty of developing a
991	   future registry of canonical service points (e.g., numeric references
992	   to versions, formats, languages, etc).

994	4.  Finding a Name Mapping Authority

996	   In order to derive an actionable identifier (these days, a URL) from
997	   an ARK, a hostport (hostname or hostname plus port combination) for a
998	   working Name Mapping Authority (NMA) must be found.  An NMA is a
999	   service that is able to respond to the three basic ARK service
1000	   requests.  Relying on registration and client-side discovery, NMAs
1001	   make known which NAAs' identifiers they are willing to service.

1003	   Upon encountering an ARK, a user (or client software) looks inside it
1004	   for the optional NMAH part (the hostport of the NMA's ARK service).
1005	   If it contains an NMAH that is working, this NMAH discovery step may
1006	   be skipped; the NMAH effectively uses the beginning of an ARK to
1007	   cache the results of a prior mapping authority discovery process.  If
1008	   a new NMAH needs to found, the client looks inside the ARK again for
1009	   the NAAN (Name Assigning Authority Number).  Querying a global
1010	   database, it then uses the NAAN to look up all current NMAHs that
1011	   service ARKs issued by the identified NAA.  The global database is
1012	   key, and two specific methods for querying it are given in this
1013	   section.

1015	   In the interests of long-term persistence, however, ARK mechanisms
1016	   are first defined in high-level, protocol-independent terms so that
1017	   mechanisms may evolve and be replaced over time without compromising
1018	   fundamental service objectives.  Either or both specific methods
1019	   given here may eventually be supplanted by better methods since, by
1020	   design, the ARK scheme does not depend on a particular method, but
1021	   only on having some method to locate an active NMAH.

1023	   At the time of issuance, at least one NMAH for an ARK should be
1024	   prepared to service it.  That NMA may or may not be administered by
1025	   the Name Assigning Authority (NAA) that created it.  Consider the
1026	   following hypothetical example of providing long-term access to a
1027	   cancer research journal.  The publisher wishes to turn a profit and
1028	   the National Library of Medicine wishes to preserve the scholarly
1029	   record.  An agreement might be struck whereby the publisher would act
1030	   as the NAA and the national library would archive the journal issue
1031	   when it appears, but without providing direct access for the first
1032	   six months.  During the first six months of peak commercial
1033	   viability, the publisher would retain exclusive delivery rights and
1034	   would charge access fees.  Again, by agreement, both the library and
1035	   the publisher would act as NMAs, but during that initial period the
1036	   library would redirect requests for issues less than six months old
1037	   to the publisher.  At the end of the waiting period, the library
1038	   would then begin servicing requests for issues older than six months
1039	   by tapping directly into its own archives.  Meanwhile, the publisher
1040	   might routinely redirect incoming requests for older issues to the
1041	   library.  Long-term access is thereby preserved, and so is the
1042	   commercial incentive to publish content.

1044	   Although it will be common for an NAA also to run an NMA service, it
1045	   is never a requirement.  Over time NAAs and NMAs will come and go.
1046	   One NMA will succeed another, and there might be many NMAs serving
1047	   the same ARKs simultaneously (e.g., as mirrors or as competitors).
1048	   There might also be asymmetric but coordinated NMAs as in the
1049	   library-publisher example above.

1051	4.1.  Looking Up NMAHs in a Globally Accessible File

1053	   This subsection describes a way to look up NMAHs using a simple name
1054	   authority table represented as a plain text file.  For efficient
1055	   access the file may be stored in a local filesystem, but it needs to
1056	   be reloaded periodically to incorporate updates.  It is not expected
1057	   that the size of the file or frequency of update should impose an
1058	   undue maintenance or searching burden any time soon, for even
1059	   primitive linear search of a file with ten-thousand NAAs is a
1060	   subsecond operation on modern server machines.  The proposed file
1061	   strategy is similar to the /etc/hosts file strategy that supported
1062	   Internet host address lookup for a period of years before the advent
1063	   of DNS.

1065	   The name authority table file is updated on an ongoing basis and is
1066	   available for copying over the internet from the California Digital
1067	   Library at http://www.cdlib.org/inside/diglib/ark/natab and from a
1068	   number of mirror sites.  The file contains comment lines (lines that
1069	   begin with `#') explaining the format and giving the file's
1070	   modification time, reloading address, and NAA registration
1071	   instructions.  There is even a Perl script that processes the file
1072	   embedded in the file's comments.  As of February 2006, currently
1073	   registered Name Assigning Authorities are:

1075	        12025            National Library of Medicine
1076	        12026            Library of Congress
1077	        12027            National Agriculture Library
1078	        13030            California Digital Library
1079	        13038            World Intellectual Property Organization
1080	        20775            University of California San Diego
1081	        29114            University of California San Francisco
1082	        28722            University of California Berkeley
1083	        21198            University of California Los Angeles
1084	        15230            Rutgers University
1085	        13960            Internet Archive
1086	        64269            Digital Curation Centre
1087	        62624            New York University
1088	        67531            University of North Texas
1089	        27927            Ithaka Electronic-Archiving Initiative
1090	        12148            Bibliothque nationale de France / National Library of France
1091	        88435            Princeton University
1092	        78428            University of Washington
1093	        89901            Archives of Region of Vstra Gtaland and City of Gothenburg, Sweden
1094	        80444            Northwest Digital Archives
1095	        25593            Emory University

1097	   A snapshot of the name authority table file appears in an appendix.

1099	4.2.  Looking up NMAHs Distributed via DNS

1101	   This subsection introduces a method for looking up NMAHs that is
1102	   based on the method for discovering URN resolvers described in
1103	   [NAPTR].  It relies on querying the DNS system already installed in
1104	   the background infrastructure of most networked computers.  A query
1105	   is submitted to DNS asking for a list of resolvers that match a given
1106	   NAAN.  DNS distributes the query to the particular DNS servers that
1107	   can best provide the answer, unless the answer can be found more
1108	   quickly in a local DNS cache as a side-effect of a recent query.
1109	   Responses come back inside Name Authority Pointer (NAPTR) records.
1110	   The normal result is one or more candidate NMAHs.

1112	   In its full generality the [NAPTR] algorithm ambitiously accommodates
1113	   a complex set of preferences, orderings, protocols, mapping services,
1114	   regular expression rewriting rules, and DNS record types.  This
1115	   subsection proposes a drastic simplification of it for the special
1116	   case of ARK mapping authority discovery.  The simplified algorithm is
1117	   called Maptr.  It uses only one DNS record type (NAPTR) and restricts
1118	   most of its field values to constants.  The following hypothetical
1119	   excerpt from a DNS data file for the NAAN known as 12026 shows three
1120	   example NAPTR records ready to use with the Maptr algorithm.

1122	       12026.ark.arpa.
1123	       ;; US Library of Congress
1124	       ;;       order pref flags service regexp  replacement
1125	        IN NAPTR  0     0   "h"  "ark"   "USLC"  lhc.nlm.nih.gov:8080
1126	        IN NAPTR  0     0   "h"  "ark"   "USLC"  foobar.zaf.org
1127	        IN NAPTR  0     0   "h"  "ark"   "USLC"  sneezy.dopey.com

1129	   All the fields are held constant for Maptr except for the "flags",
1130	   "regexp", and "replacement" fields.  The "service" field contains the
1131	   constant value "ark" so that NAPTR records participating in the Maptr
1132	   algorithm will not be confused with other NAPTR records.  The "order"
1133	   and "pref" fields are held to 0 (zero) and otherwise ignored for now;
1134	   the algorithm may evolve to use these fields for ranking decisions
1135	   when usage patterns and local administrative needs are better
1136	   understood.

1138	   When a Maptr query returns a record with a flags field of "h" (for
1139	   hostport, a Maptr extension to the NAPTR flags), the replacement
1140	   field contains the NMAH (hostport) of an ARK service provider.  When
1141	   a query returns a record with a flags field of "" (the empty string),
1142	   the client needs to submit a new query containing the domain name
1143	   found in the replacement field.  This second sort of record exploits
1144	   the distributed nature of DNS by redirecting the query to another
1145	   domain name.  It looks like this.

1147	       12345.ark.arpa.
1148	       ;; Digital Library Consortium
1149	       ;;       order pref flags service regexp replacement
1150	        IN NAPTR  0     0    ""  "ark"     ""   dlc.spct.org.

1152	   Here is the Maptr algorithm for ARK mapping authority discovery.  In
1153	   it replace <NAAN> with the NAAN from the ARK for which an NMAH is
1154	   sought.

1156	        (1) Initialize the DNS query:  type=NAPTR,
1157	        query=<NAAN>.ark.arpa.

1159	        (2) Submit the query to DNS and retrieve (NAPTR) records,
1160	        discarding any record that does not have "ark" for the service
1161	        field.

1163	        (3) All remaining records with a flags fields of "h" contain
1164	        candidate NMAHs in their replacement fields.  Set them aside, if
1165	        any.

1167	        (4) Any record with an empty flags field ("") has a replacement
1168	        field containing a new domain name to which a subsequent query
1169	        should be redirected.  For each such record, set
1170	        query=<replacement> then go to step (2).  When all such records
1171	        have been recursively exhausted, go to step (5).

1173	        (5) All redirected queries have been resolved and a set of
1174	        candidate NMAHs has been accumulated from steps (3).  If there
1175	        are zero NMAHs, exit - no mapping authority was found.  If there
1176	        is one or more NMAH, choose one using any criteria you wish,
1177	        then exit.

1179	   A Perl script that implements this algorithm is included here.

1181	     #!/depot/bin/perl

1183	     use Net::DNS;                 # include simple DNS package
1184	     my $qtype = "NAPTR";               # initialize query type
1185	     my $naa = shift;              # get NAAN script argument
1186	     my $mad = new Net::DNS::Resolver;  # mapping authority discovery

1188	     &maptr("$naa.ark.arpa");      # call maptr - that's it

1190	     sub maptr {                   # recursive maptr algorithm
1191	          my $dname = shift;       # domain name as argument
1192	          my ($rr, $order, $pref, $flags, $service, $regexp,
1193	               $replacement);
1194	          my $query = $mad->query($dname, $qtype);
1195	          return                   # non-productive query
1196	               if (! $query || ! $query->answer);
1197	          foreach $rr ($query->answer) {
1198	               next           # skip records of wrong type
1199	                    if ($rr->type ne $qtype);
1200	               ($order, $pref, $flags, $service, $regexp,
1201	                    $replacement) = split(/\s/, $rr->rdatastr);
1202	               if ($flags eq "") {
1203	                    &maptr($replacement);    # recurse
1204	               } elsif ($flags eq "h") {
1205	                    print "$replacement\n";  # candidate NMAH
1206	               }
1207	          }
1208	     }

1210	   The global database thus distributed via DNS and the Maptr algorithm
1211	   can easily be seen to mirror the contents of the Name Authority Table
1212	   file described in the previous section.

1214	5.  Generic ARK Service Definition

1216	   An ARK request's output is delivered information; examples include
1217	   the object itself, a policy declaration (e.g., a promise of support),
1218	   a descriptive metadata record, or an error message.  The experience
1219	   of object delivery is expected to be an evolving mix of information
1220	   that reflects changing service expectations and technology
1221	   requirements; contemporary examples include such things as an object
1222	   summary and component links formatted for human consumption.  ARK
1223	   services must be couched in high-level, protocol-independent terms if
1224	   persistence is to outlive today's networking infrastructural
1225	   assumptions.  The high-level ARK service definitions listed below are
1226	   followed in the next section by a concrete method (one of many
1227	   possible methods) for delivering these services with today's
1228	   technology.

1230	5.1.  Generic ARK Access Service (access, location)

1232	   Returns (a copy of) the object or a redirect to the same, although a
1233	   sensible object proxy may be substituted.  Examples of sensible
1234	   substitutes include,

1236	     - a table of contents instead of a large complex document,
1237	     - a home page instead of an entire web site hierarchy,
1238	     - a rights clearance challenge before accessing protected data,
1239	     - directions for access to an offline object (e.g., a book),
1240	     - a description of an intangible object (a disease, an event), or
1241	     - an applet acting as "player" for a large multimedia object.

1243	   May also return a discriminated list of alternate object locators.
1244	   If access is denied, returns an explanation of the object's current
1245	   (perhaps permanent) inaccessibility.

1247	5.2.  Generic Policy Service (permanence, naming, etc.)

1249	   Returns declarations of policy and support commitments for given
1250	   ARKs.  Declarations are returned in either a structured metadata
1251	   format or a human readable text format; sometimes one format may
1252	   serve both purposes.  Policy subareas may be addressed in separate
1253	   requests, but the following areas should should be covered:  object
1254	   permanence, object naming, object fragment addressing, and
1255	   operational service support.

1257	   The permanence declaration for an object is a rating defined with
1258	   respect to an identified permanence provider (guarantor), which will
1259	   be the NMA.  It may include the following aspects.

1261	        (a) "object availability" - whether and how access to the object
1262	        is supported (e.g., online 24x7, or offline only),

1264	        (b) "identifier validity" - under what conditions the identifier
1265	        will be or has been re-assigned,

1267	        (c) "content invariance" - under what conditions the content of
1268	        the object is subject to change, and

1270	        (d) "change history" - access to corrections, migrations, and
1271	        revisions, whether through links to the changed objects
1272	        themselves or through a document summarizing the change history

1274	   One approach to a permanence rating framework, conceived
1275	   independently from ARKs, is given in [NLMPerm].  Under ongoing
1276	   development and limited deployment at the US National Library of
1277	   Medicine, it identifies the following "permanence levels":

1279	        Not Guaranteed: No commitment has been made to retain this
1280	        resource.  It could become unavailable at any time.  Its
1281	        identifier could be changed.

1283	        Permanent: Dynamic Content: A commitment has been made to keep
1284	        this resource permanently available.  Its identifier will always
1285	        provide access to the resource.  Its content could be revised or
1286	        replaced.

1288	        Permanent: Stable Content: A commitment has been made to keep
1289	        this resource permanently available.  Its identifier will always
1290	        provide access to the resource.  Its content is subject only to
1291	        minor corrections or additions.

1293	        Permanent: Unchanging Content: A commitment has been made to
1294	        keep this resource permanently available.  Its identifier will
1295	        always provide access to the resource.  Its content will not
1296	        change.

1298	   Naming policy for an object includes an historical description of the
1299	   NAA's (and its successor NAA's) policies regarding differentiation of
1300	   objects.  Since it the NMA who responds to requests for policy
1301	   statements, it is useful for the NMA to be able to produce or
1302	   summarize these historical NAA documents.  Naming policy may include
1303	   the following aspects.

1305	        (i) "similarity" - (or "unity") the limit, defined by the NAA,
1306	        to the level of dissimilarity beyond which two similar objects
1307	        warrant separate identifiers but before which they share one
1308	        single identifier, and

1310	        (ii) "granularity" - the limit, defined by the NAA, to the level
1311	        of object subdivision beyond which sub-objects do not warrant
1312	        separately assigned identifiers but before which sub-objects are
1313	        assigned separate identifiers.

1315	   Subnaming policy for an object describes the qualifiers that the NMA,
1316	   in fulfilling its ongoing and evolving service obligations, allows as
1317	   extensions to an NAA-assigned ARK.  To the conceptual object that the
1318	   NAA named with an ARK, the NMA may add component access points and
1319	   derivatives (e.g., format migrations in aid of preservation) in order
1320	   to provide both basic and value-added services.

1322	   Addressing policy for an object includes a description of how, during
1323	   access, object components (e.g., paragraphs, sections) or views
1324	   (e.g., image conversions) may or may not be "addressed", in other
1325	   words, how the NMA permits arguments or parameters to modify the
1326	   object delivered as the result of an ARK request.  If supported,
1327	   these sorts of operations would provide things like byte-ranged
1328	   fragment delivery and open-ended format conversions, or any set of
1329	   possible transformations that would be too numerous to list or to
1330	   identify with separately assigned ARKs.

1332	   Operational service support policy includes a description of general
1333	   operational aspects of the NMA service, such as after-hours staffing
1334	   and trouble reporting procedures.

1336	5.3.  Generic Description Service

1338	   Returns a description of the object.  Descriptions are returned in
1339	   either a structured metadata format or a human readable text format;
1340	   sometimes one format may serve both purposes.  A description must at
1341	   a minimum answer the who, what, when, and where questions concerning
1342	   an expression of the object.  Standalone descriptions should be
1343	   accompanied by the modification date and source of the description
1344	   itself.  May also return discriminated lists of ARKs that are related
1345	   to the given ARK.

1347	6.  Overview of The HTTP URL Mapping Protocol (THUMP)

1349	   The HTTP URL Mapping Protocol (THUMP) is a way of taking a key (a
1350	   kind of identifier) and asking such questions as, what information
1351	   does this identify and how permanent is it?  [THUMP] is in fact one
1352	   specific method under development for delivering ARK services.  The
1353	   protocol runs over HTTP to exploit the web browser's current pre-
1354	   eminence as user interface to the Internet.  THUMP is designed so
1355	   that a person can enter ARK requests directly into the location field
1356	   of current browser interfaces.  Because it runs over HTTP, THUMP can
1357	   be simulated and tested within keyboard-based [TELNET] sessions.

1359	   The asker (a person or client program) starts with an identifier,
1360	   such as an ARK or a URL.  The identifier reveals to the asker (or
1361	   allows the asker to infer) the Internet host name and port number of
1362	   a server system that responds to questions.  Here, this is just the
1363	   NMAH that is obtained by inspection and possibly lookup based on the
1364	   ARK's NAAN.  The asker then sets up an HTTP session with the server
1365	   system, sends a question via a THUMP request (contained within an
1366	   HTTP request), receives an answer via a THUMP response (contained
1367	   within an HTTP response), and closes the session.  That concludes the
1368	   connected portion of the protocol.

1370	   A THUMP request is a string of characters beginning with a `?'
1371	   (question mark) that is appended to the identifier string.  The
1372	   resulting string is sent as an argument to HTTP's GET command.
1373	   Request strings too long for GET may be sent using HTTP's POST
1374	   command.  The three most common requests correspond to three
1375	   degenerate special cases that keep the user's learning and typing
1376	   burden low.  First, a simple key with no request at all is the same
1377	   as an ordinary access request.  Thus a plain ARK entered into a
1378	   browser's location field behaves much like a plain URL, and returns
1379	   access to the primary identified object, for instance, an HTML
1380	   document.

1382	   The second special case is a minimal ARK description request string
1383	   consisting of just "?".  For example, entering the string,

1385	             ark.nlm.nih.gov/12025/psbbantu?

1387	   into the browser's location field directly precipitates a request for
1388	   a metadata record describing the object identified by
1389	   ark:/12025/psbbantu.  The browser, unaware of THUMP, prepares and
1390	   sends an HTTP GET request in the same manner as for a URL.  THUMP is
1391	   designed so that the response (indicated by the returned HTTP content
1392	   type) is normally displayed, whether the output is structured for
1393	   machine processing (text/plain) or formatted for human consumption
1394	   (text/html).

1396	   In the following example THUMP session, each line has been annotated
1397	   to include a line number and whether it was the client or server that
1398	   sent it.  Without going into much depth, the session has four pieces
1399	   separated from each other by blank lines:  the client's piece (lines
1400	   1-3), the server's HTTP/THUMP response headers (4-7), and the body of
1401	   the server's response (8-17).  The first and last lines (1 and 17)
1402	   correspond to the client's steps to start the TCP session and the
1403	   server's steps to end it, respectively.

1405	      1  C: [opens session]
1406	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu? HTTP/1.1
1407	         C:
1408	         S: HTTP/1.1 200 OK
1409	      5  S: Content-Type: text/plain
1410	         S: THUMP-Status: 0.1 200 OK
1411	         S:
1412	         S: |set: NLM | 12025/psbbantu? | 20030731
1413	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu?
1414	     10  S: here: 1 | 1 | 1
1415	         S:
1416	         S: erc:
1417	         S: who:    Lederberg, Joshua
1418	         S: what:   Studies of Human Families for Genetic Linkage
1419	     15  S: when:   1974
1420	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1421	         S: [closes session]

1423	   The first two server response lines (4-5) above are typical of HTTP.
1424	   The next line (6) is peculiar to THUMP, and indicates the THUMP
1425	   version and a normal return status.  The balance of the response
1426	   consists of a record set header (lines 8-10) and a single metadata
1427	   record (12-16) that comprises the ARK description service response.
1428	   The record set header identifies (8-9) who created the set, what its
1429	   title is, when it was created, and where an automated process can
1430	   access the set; it ends in a line (10) whose respective sub-elements
1431	   indicate that here in this communication the recipient can expect to
1432	   find 1 record, starting at the record numbered 1, from a set
1433	   consisting of a total of 1 record (i.e., here is the entire set,
1434	   consisting of exactly one record).

1436	   The returned record (12-16) is in the format of an Electronic
1437	   Resource Citation [ERC], which is discussed in more detail in the
1438	   next section.  For now, note that it contains four elements that
1439	   answer the top priority questions regarding an expression of the
1440	   object:  who played a major role in expressing it, what the
1441	   expression was called, when is was created, and where the expression
1442	   may be found.  This quartet of elements comes up again and again in
1443	   ERCs.

1445	   The third degenerate special case of an ARK request (and no other
1446	   cases will be described in this document) is the string "??",
1447	   corresponding to a minimal permanence policy request.  It can be seen
1448	   in use appended to an ARK (on line 2) in the example session that
1449	   follows.

1451	      1  C: [opens session]
1452	         C: GET http://ark.nlm.nih.gov/ark:/12025/psbbantu?? HTTP/1.1
1453	         C:
1454	         S: HTTP/1.1 200 OK
1455	      5  S: Content-Type: text/plain
1456	         S: THUMP-Status: 0.1 200 OK
1457	         S:
1458	         S: |set: NLM | 12025/psbbantu?? | 20030731
1459	         S:         | http://ark.nlm.nih.gov/ark:/12025/psbbantu??
1460	     10  S: here: 1 | 1 | 1
1461	         S:
1462	         S: erc:
1463	         S: who:    Lederberg, Joshua
1464	         S: what:   Studies of Human Families for Genetic Linkage
1465	     15  S: when:   1974
1466	         S: where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1467	         S: erc-support:
1468	         S: who:    USNLM
1469	         S: what:   Permanent, Unchanging Content
1470	     20  S: when:   20010421
1471	         S: where:  http://ark.nlm.nih.gov/yy22948
1472	         S: [closes session]

1474	   Again, a single metadata record (lines 12-21) is returned, but it
1475	   consists of two segments.  The first segment (12-16) gives the same
1476	   basic citation information as in the previous example.  It is
1477	   returned in order to establish context for the persistence
1478	   declaration in the second segment (17-21).

1480	   Each segment in an ERC tells a different story relating to the
1481	   object, so although the same four questions (elements) appear in
1482	   each, the answers depend on the segment's story type.  While the
1483	   first segment tells the story of an expression of the object, the
1484	   second segment tells the story of the support commitment made to it:

1486	   who made the commitment, what the nature of the commitment was, when
1487	   it was made, and where a fuller explanation of the commitment may be
1488	   found.

1490	7.  Overview of Electronic Resource Citations (ERCs)

1492	   An Electronic Resource Citation (or ERC, pronounced e-r-c) [ERC] is a
1493	   simple, compact, and printable record designed to hold data
1494	   associated with an information resource.  By design, the ERC is a
1495	   metadata format that balances the needs for expressive power, very
1496	   simple machine processing, and direct human manipulation.

1498	   A founding principle of the ERC is that direct human contact with
1499	   metadata will be a necessary and sufficient condition for the near
1500	   term rapid development of metadata standards, systems, and services.
1501	   Thus the machine-processable ERC format must only minimally strain
1502	   people's ability to read, understand, change, and transmit ERCs
1503	   without their relying on intermediation with specialized software
1504	   tools.  The basic ERC needs to be succinct, transparent, and
1505	   trivially parseable by software.

1507	   In the current Internet, it is natural seriously to consider using
1508	   XML as an exchange format because of predictions that it will obviate
1509	   many ad hoc formats and programs, and unify much of the world's
1510	   information under one reliable data structuring discipline that is
1511	   easy to generate, verify, parse, and render.  It appears, however,
1512	   that XML is still only catching on after years of standards work and
1513	   implementation experience.  The reasons for it are unclear, but for
1514	   now very simple XML interpretation is still out of reach.  Another
1515	   important caution is that XML structures are hard on the eyeballs,
1516	   taking up an amount of display (and page) space that significantly
1517	   exceeds that of traditional formats.  Until these conflicts with ERC
1518	   principle are resolved, XML is not a first choice for representing
1519	   ERCs.  Borrowing instead from the data structuring format that
1520	   underlies the successful spread of email and web services, the first
1521	   ERC format uses [ANVL], which is based on email and HTTP headers
1522	   [RFC822].  There is a naturalness to ANVL's label-colon-value format
1523	   (seen in the previous section) that barely needs explanation to a
1524	   person beginning to enter ERC metadata.

1526	   Besides simplicity of ERC system implementation and data entry
1527	   mechanics, ERC semantics (what the record and its constituent parts
1528	   mean) must also be easy to explain.  ERC semantics are based on a
1529	   reformulation and extension of the Dublin Core [DCORE] hypothesis,
1530	   which suggests that the fifteen Dublin Core metadata elements have a
1531	   key role to play in cross-domain resource description.  The ERC
1532	   design recognizes that the Dublin Core's primary contribution is the
1533	   international, interdisciplinary consensus that identified fifteen
1534	   semantic buckets (element categories), regardless of how they are
1535	   labeled.  The ERC then adds a definition for a record and some
1536	   minimal compliance rules.  In pursuing the limits of simplicity, the
1537	   ERC design combines and relabels some Dublin Core buckets to isolate
1538	   a tiny kernel (subset) of four elements for basic cross-domain
1539	   resource description.

1541	   For the cross-domain kernel, the ERC uses the four basic elements -
1542	   who, what, when, and where - to pretend that every object in the
1543	   universe can have a uniform minimal description.  Each has a name or
1544	   other identifier, a location, some responsible person or party, and a
1545	   date.  It doesn't matter what type of object it is, or whether one
1546	   plans to read it, interact with it, smoke it, wear it, or navigate
1547	   it.  Of course, this approach is flawed because uniformity of
1548	   description for some object types requires more semantic contortion
1549	   and sacrifice than for others.  That is why at the beginning of this
1550	   document, the ARK was said to be suited to objects that accommodate
1551	   reasonably regular electronic description.

1553	   While insisting on uniformity at the most basic level provides
1554	   powerful cross-domain leverage, the semantic sacrifice is great for
1555	   many applications.  So the ERC also permits a semantically rich and
1556	   nuanced description to co-exist in a record along with a basic
1557	   description.  In that way both sophisticated and naive recipients of
1558	   the record can extract the level of meaning from it that best suits
1559	   their needs and abilities.  Key to unlocking the richer description
1560	   is a controlled vocabulary of ERC record types (not explained in this
1561	   document) that permit knowledgeable recipients to apply defined sets
1562	   of additional assumptions to the record.

1564	7.1.  ERC Syntax

1566	   An ERC record is a sequence of metadata elements ending in a blank
1567	   line.  An element consists of a label, a colon, and an optional
1568	   value.  Here is an example of a record with five elements.

1570	          erc:
1571	          who: Gibbon, Edward
1572	          what: The Decline and Fall of the Roman Empire
1573	          when: 1781
1574	          where: http://www.ccel.org/g/gibbon/decline/

1576	   A long value may be folded (continued) onto the next line by
1577	   inserting a newline and indenting the next line.  A value can be thus
1578	   folded across multiple lines.  Here are two example elements, each
1579	   folded across four lines.

1581	          who/created: University of California, San Francisco, AIDS
1582	               Program at San Francisco General Hospital | University
1583	               of California, San Francisco, Center for AIDS Prevention
1584	               Studies
1585	          what/Topic:
1586	                Heart Attack | Heart Failure
1587	               | Heart
1588	                                Diseases

1590	   An element value folded across several lines is treated as if the
1591	   lines were joined together on one long line.  For example, the second
1592	   element from the previous example is considered equivalent to

1594	          what/Topic: Heart Attack | Heart Failure | Heart Diseases

1596	   An element value may contain multiple values, each one separated from
1597	   the next by a `|' (pipe) character.  The element from the previous
1598	   example contains three values.

1600	   For annotation purposes, any line beginning with a `#' (hash)
1601	   character is treated as if it were not present; this is a "comment"
1602	   line (a feature not available in email or HTTP headers).  For
1603	   example, the following element is spread across four lines and
1604	   contains two values:

1606	          what/Topic:
1607	               Heart Attack
1608	          #    | Heart Failure  -- hold off until next review cycle
1609	               | Heart Diseases

1611	7.2.  ERC Stories

1613	   An ERC record is organized into one or more distinct segments, where
1614	   where each segment tells a story about a different aspect of the
1615	   information resource.  A segment boundary occurs whenever a segment
1616	   label (an element beginning with "erc") is encountered.  The basic
1617	   label "erc:" introduces the story of an object's expression (e.g.,
1618	   its publication, installation, or performance).  The label "erc-
1619	   about:" introduces the story of an object's content (what it is
1620	   about) and "erc-support:" introduces the story of a support
1621	   commitment made to it.  A story segment that concerns the ERC itself
1622	   is introduced by the label "erc-from:".  It is an important segment
1623	   that tells the story of the ERC's provenance.  Elements beginning
1624	   with "erc" are reserved for segment labels and their associated story
1625	   types.  From an earlier example, here is an ERC with two segments.

1627	         erc:
1628	         who:    Lederberg, Joshua
1629	         what:   Studies of Human Families for Genetic Linkage
1630	         when:   1974
1631	         where:  http://profiles.nlm.nih.gov/BB/A/N/T/U/_/bbantu.pdf
1632	         erc-support:
1633	         who:    NIH/NLM/LHNCBC
1634	         what:   Permanent, Unchanging Content
1635	         # Note to ops staff:  date needs verification.
1636	         when:   2001 04 21
1637	         where:  http://ark.nlm.nih.gov/yy22948

1639	   Segment stories are told according to journalistic tradition.  While
1640	   any number of pertinent elements may appear in a segment, priority is
1641	   placed on answering the questions who, what, when, and where at the
1642	   beginning of each segment so that readers can make the most important
1643	   selection or rejection decisions as soon as possible.  To make things
1644	   simple, the listed ordering of the questions is maintained in each
1645	   segment (as it happens most people who have been exposed to this
1646	   story telling technique are already familiar with the above
1647	   ordering).

1649	   The four questions are answered by using corresponding element
1650	   labels.  The four element labels can be re-used in each story
1651	   segment, but their meaning changes depending on the segment (the
1652	   story type) in which they appear.  In the example above, "who" is
1653	   first used to name a document's author and subsequently used to name
1654	   the permanence guarantor (provider).  Similarly, "when" first lists
1655	   the date of object creation and in the next segment lists the date of
1656	   a commitment decision.  Four labels appearing across three segments
1657	   effectively map to twelve semantically distinct elements.  Distinct
1658	   element meanings are mapped to Dublin Core elements in a later
1659	   section.

1661	7.3.  The ERC Anchoring Story

1663	   Each ERC contains an anchoring story.  It is usually the first
1664	   segment labeled "erc:" and it concerns an "anchoring" expression of
1665	   the object.  An "anchoring" expression is the one that a provider
1666	   deemed the most suitable basic referent given the audience and
1667	   application for which it produced the ERC.  If it sounds like the
1668	   provider has great latitude in choosing its anchoring expression, it
1669	   is because it does.  A typical anchoring story in an ERC for a born-
1670	   digital document would be the story of the document's release on a
1671	   web site; such a document would then be the anchoring expression.

1673	   An anchoring story need not be the central descriptive goal of an ERC
1674	   record.  For example, a museum provider may create an ERC for a
1675	   digitized photograph of a painting but choose to anchor it in the
1676	   story of the original painting instead of the story of the electronic
1677	   likeness; although the ERC may through other segments prove to be
1678	   centrally concerned with describing the electronic likeness, the
1679	   provider may have chosen this particular anchoring story in order to
1680	   make the ERC visible in a way that is most natural to patrons (who
1681	   would find the Mona Lisa under da Vinci sooner than they would find
1682	   it under the name of the person who snapped the photograph or scanned
1683	   the image).  In another example, a provider that creates an ERC for a
1684	   dramatic play as an abstract work has the task of describing a piece
1685	   of intangible intellectual property.  To anchor this abstract object
1686	   in the concrete world, if only through a derivative expression, it
1687	   makes sense for the provider to choose a suitable printed edition of
1688	   the play as the anchoring object expression (to describe in the
1689	   anchoring story) of the ERC.

1691	   The anchoring story has special rules designed to keep ERC processing
1692	   simple and predictable.  Each of the four basic elements (who, what,
1693	   when, and where) must be present, unless a best effort to supply it
1694	   fails.  In the event of failure, the element still appears but a
1695	   special value (described later) is used to explain the missing value.
1696	   While the requirement that each of the four elements be present only
1697	   applies to the anchoring story segment, as usual these elements
1698	   appear at the beginning of the segment and may only be used in the
1699	   prescribed order.  A minimal ERC would normally consist of just an
1700	   anchoring story and the element quartet, as illustrated in the next
1701	   example.

1703	         erc:
1704	         who:   National Research Council
1705	         what:  The Digital Dilemma
1706	         when:  2000
1707	         where: http://books.nap.edu/html/digital%5Fdilemma

1709	   A minimal ERC can be abbreviated so that it resembles a traditional
1710	   compact bibliographic citation that is nonetheless completely machine
1711	   processable.  The required elements and ordering makes it possible to
1712	   eliminate the element labels, as shown here.

1714	         erc: National Research Council | The Digital Dilemma | 2000
1715	                | http://books.nap.edu/html/digital%5Fdilemma

1717	7.4.  ERC Elements

1719	   As mentioned, the four basic ERC elements (who, what, when, and
1720	   where) take on different specific meanings depending on the story
1721	   segment in which they are used.  By appearing in each segment, albeit
1722	   in different guises, the four elements serve as a valuable mnemonic
1723	   device - a kind of checklist - for constructing minimal story
1724	   segments from scratch.  Again, it is only in the anchoring segment
1725	   that all four elements are mandatory.

1727	   Here are some mappings between ERC elements and Dublin Core [DCORE]
1728	   elements.

1730	          Segment     ERC Element     Equivalent Dublin Core Element
1731	         ---------    -----------     ------------------------------
1732	            erc          who          Creator/Contributor/Publisher
1733	            erc          what                Title
1734	            erc          when                Date
1735	            erc          where               Identifier
1736	         erc-about       who                  <none>
1737	         erc-about       what                Subject
1738	         erc-about       when                Coverage (temporal)
1739	         erc-about       where               Coverage (spatial)

1741	   The basic element labels may also be qualified to add nuances to the
1742	   semantic categories that they identify.  Elements are qualified by
1743	   appending a `/' (slash) and a qualifier term.  Often qualifier terms
1744	   appear as the past tense form of a verb because it makes re-using
1745	   qualifiers among elements easier.

1747	         who/published:  ...
1748	         when/published: ...
1749	         where/published: ...

1751	   Using past tense verbs for qualifiers also reminds providers and
1752	   recipients that element values contain transient assertions that may
1753	   have been true once, but that tend to become less true over time.
1754	   Recipients that don't understand the meaning of a qualifier can fall
1755	   back onto the semantic category (bucket) designated by the
1756	   unqualified element label.  Inevitably recipients (people and
1757	   software) will have diverse abilities in understanding elements and
1758	   qualifiers.

1760	   Any number of other elements and qualifiers may be used in
1761	   conjunction with the quartet of basic segment questions.  The only
1762	   semantic requirement is that they pertain to the segment's story.
1763	   Also, it is only the four basic elements that change meaning
1764	   depending on their segment context.  All other elements have meaning
1765	   independent of the segment in which they appear.  If an element label
1766	   stripped of its qualifier is still not recognized by the recipient, a
1767	   second fall back position is to ignore it and rely on the four basic
1768	   elements.

1770	   Elements may be either Canonical, Provisional, or Local.  Canonical
1771	   elements are officially recognized via a registry as part of the
1772	   metadata vernacular.  All elements, qualifiers, and segment labels
1773	   used in this document up until now belong to that vernacular.
1774	   Provisional elements are also officially recognized via the registry,
1775	   but have only been proposed for inclusion in the vernacular.  To be
1776	   promoted to the vernacular, a provisional element passes through a
1777	   vetting process during which its documentation must be in order and
1778	   its community acceptance demonstrated.  Local elements are any
1779	   elements not officially recognized in the registry.  The registry
1780	   [DERC] is a work in progress.

1782	   Local elements can be immediately distinguishable from Canonical or
1783	   Provisional elements because all terms that begin with an upper case
1784	   letter are reserved for spontaneous local use.  No term beginning
1785	   with an upper case letter will ever be assigned Canonical or
1786	   Provisional status, so it should be safe to use such terms for local
1787	   purposes.  Any recipient of external ERCs containing such terms will
1788	   understand them to be part of the originating provider's local
1789	   metadata dialect.  Here's an example ERC with three segments, one
1790	   local element, and two local qualifiers.  The segment boundaries have
1791	   been emphasized by comment lines (which, as before, are ignored by
1792	   processors).

1794	         erc:
1795	         who: Bullock, TH | Achimowicz, JZ | Duckrow, RB
1796	                 | Spencer, SS | Iragui-Madoz, VJ
1797	         what: Bicoherence of intracranial EEG in sleep,
1798	                 wakefulness and seizures
1799	         when: 1997 12 00
1800	         where: http://cogprints.soton.ac.uk/%{
1801	                 documents/disk0/00/00/01/22/index.html %}
1802	         in: EEG Clin Neurophysiol | 1997 12 00 | v103, i6, p661-678
1803	         IDcode: cog00000122
1804	         # ---- new segment ----
1805	         erc-about:
1806	         what/Subcategory: Bispectrum | Nonlinearity | Epilepsy
1807	                 | Cooperativity | Subdural | Hippocampus | Higher moment
1808	         # ---- new segment ----
1809	         erc-from:
1810	         who: NIH/NLM/NCBI
1811	         what: pm9546494
1812	         when/Reviewed: 1998 04 18 021600
1813	         where: http://ark.nlm.nih.gov/12025/pm9546494?

1815	   The local element "IDcode" immediately precedes the "erc-about"
1816	   segment, which itself contains an element with the local qualifier
1817	   "Subcategory".  The second to last element also carries the local
1818	   qualifier "Reviewed".  Finally, what might be a provisional element
1819	   "in" appears near the end of the first segment.  It might have been
1820	   proposed as a way to complete a citation for an object originally
1821	   appearing inside another object (such as an article appearing in a
1822	   journal or an encyclopedia).

1824	7.5.  ERC Element Values

1826	   ERC element values tend to be straightforward strings.  If the
1827	   provider intends something special for an element, it will so
1828	   indicate with markers at the beginning of its value string.  The
1829	   markers are designed to be uncommon enough that they would not likely
1830	   occur in normal data except by deliberate intent.  Markers can only
1831	   occur near the beginning of a string, and once any octet of non-
1832	   marker data has been encountered, no further marker processing is
1833	   done for the element value.  In the absence of markers the string is
1834	   considered pure data; this has been the case with all the examples
1835	   seen thus far.  The fullest form of an element value with all three
1836	   optional markers in place looks like this.

1838	         VALUE =    [markup_flags]    (:ccode)    ,    DATA

1840	   In processing, the first non-whitespace character of an ERC element
1841	   value is examined.  An initial `[' is reserved to introduce a
1842	   bracketed set of markup flags (not described in this document) that
1843	   ends with `]'.  If ERC data is machine-generated, each value string
1844	   may be preceded by "[]" to prevent any of its data from being
1845	   mistaken for markup flags.  Once past the optional markup, the
1846	   remaining value may optionally begin with a controlled code.  A
1847	   controlled code always has the form "(:ccode)", for example,

1849	         who: (:unkn) Anonymous
1850	         what: (:791) Bee Stings

1852	   Any string after such a code is taken to be an uncontrolled (e.g.,
1853	   natural language) equivalent.  The code "unkn" indicates a
1854	   conventional explanation for a missing value (stating that the value
1855	   is unknown).  The remainder of the string makes an equivalent
1856	   statement in a form that the provider deemed most suitable to its
1857	   (probably human) audience.  The code "791" could be a fixed numeric
1858	   topic identifier within an unspecified topic vocabulary.  Any code
1859	   may be ignored by those that do not understand it.

1861	   There are several codes to explain different ways in which a required
1862	   element's value may go missing.

1864	         (:unac)   temporarily inaccessible
1865	         (:unal)   unallowed, suppressed intentionally
1866	         (:unap)   not applicable, makes no sense
1867	         (:unas)   value unassigned (e.g., Untitled)
1868	         (:unav)   value unavailable indefinitely
1869	         (:unkn)   unknown (e.g., Anonymous, Inconnue)
1870	         (:etal)   too numerous to list (I<et alia>).
1871	         (:none)   never had a value, never will
1872	         (:null)   explicitly empty
1873	         (:tba)    to be assigned or announced later

1875	   Once past an optional controlled code, the remaining string value is
1876	   subjected to one final test.  If the first next non-whitespace
1877	   character is a `,' (comma), it indicates that the string value is
1878	   "sort-friendly".  This means that the value is (a) laid out with an
1879	   inverted word order useful for sorting items having comparably laid
1880	   out element values (items might be the containing ERC records) and
1881	   (b) that the value may contain other commas that indicate inversion
1882	   points should it become necessary to recover the value in natural
1883	   word order.  Typically, this feature is used to express Western-style
1884	   personal names in family-name-given-name order.  It can also be used
1885	   wherever natural word order might make sorting tricky, such as when
1886	   data contains titles or corporate names.  Here are some example
1887	   elements.

1889	         who:   ,  van Gogh, Vincent
1890	         who:,Howell, III, PhD, 1922-1987, Thurston
1891	         who:, Acme Rocket Factory, Inc., The
1892	         who:, Mao Tse Tung
1893	         who:, McCartney, Paul, Sir,
1894	         what:, Health and Human Services, United States Government
1895	                 Department of, The,

1897	   There are rules to use in recovering a copy of the value in natural
1898	   word order, if desired.  The above example strings have the following
1899	   natural word order values, respectively.

1901	         Vincent van Gogh
1902	         Thurston Howell, III, PhD, 1922-1987
1903	         The Acme Rocket Factory, Inc.
1904	         Mao Tse Tung
1905	         Sir Paul McCartney
1906	         The United States Government Department of Health and Human Services

1908	7.6.  ERC Element Encoding and Dates

1910	   Some characters that need to appear in ERC element values might
1911	   conflict with special characters used for structuring ERCs, so there
1912	   needs to be a way to include them as literal characters that are
1913	   protected from special interpretation.  This is accomplished through
1914	   an encoding mechanism that resembles the %-encoding familiar to [URI]
1915	   handlers.

1917	   The ERC encoding mechanism also uses `%', but instead of taking two
1918	   following hexadecimal digits, it takes one non-alphanumeric character
1919	   or two alphabetic characters that cannot be mistaken for hex digits.
1920	   It is designed not to be confused with normal web-style %-encoding.
1921	   In particular it can be decoded without risking unintended decoding
1922	   of normal %-encoded data (which would introduce errors).  Here are
1923	   the one-character (non-alphanumeric) ERC encoding extensions.

1925	         ERC       Purpose
1926	         ---     ------------------------------------------------
1927	         %!      decodes to the element separator `|'
1928	         %%      decodes to a percent sign `%'
1929	         %.      decodes to a comma `,'
1930	         %_      a non-character used as syntax shim
1931	         %{      a non-character that begins an expansion block
1932	         %}      a non-character that ends an expansion block

1934	   One particularly useful construct in ERC element values is the pair
1935	   of special encoding markers ("%{" and "%}") that indicates a
1936	   "expansion" block.  Whatever string of characters they enclose will
1937	   be treated as if none of the contained whitespace (SPACEs, TABs,
1938	   Newlines) were present.  This comes in handy for writing long, multi-
1939	   part URLs in a readable way.  For example, the value in

1941	         where: http://foo.bar.org/node%{
1942	                    ? db = foo
1943	                    & start = 1
1944	                    & end = 5
1945	                    & buf = 2
1946	                    & query = foo + bar + zaf
1947	                %}

1949	   is decoded into an equivalent element, but with a correct and intact
1950	   URL:

1952	     where:
1953	      http://foo.bar.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf

1955	   In a parting word about ERC element values, a commonly recurring
1956	   value type is a date, possibly followed by a time.  ERC dates use the
1957	   [TEMPER] format, taking on one of the following forms:

1959	         1999                (four digit year)
1960	         2000 12 29          (year, month, day)
1961	         2000 12 29 235955   (year, month, day, hour, minute, second)

1963	   In dates, all internal whitespace is squeezed out to achieve a
1964	   normalized form suitable for lexical comparison and sorting.  This
1965	   means that the following dates

1967	         2000 12 29 235955           (recommended for readability)
1968	         2000 12 29 23 59 55
1969	         20001229 23 59 55
1970	         20001229235955              (normalized date and time)

1972	   are all equivalent.  The first form is recommended for readability.
1973	   The last form (shortest and easiest to compute with) is the
1974	   normalized form.  Hyphens and commas are reserved to create date
1975	   ranges and lists, for example,

1977	         1996-2000                   (a range of four years)
1978	         1952, 1957, 1969            (a list of three years)
1979	         1952, 1958-1967, 1985       (a mixed list of dates and ranges)
1980	         20001229-20001231           (a range of three days)

1982	7.7.  ERC Stub Records and Internal Support

1984	   The ERC design introduces the concept of a "stub" record, which is an
1985	   incomplete ERC record intended to be supplemented with additional
1986	   elements before being released as a standalone ERC record.  A stub
1987	   ERC record has no minimum required elements.  It is just a group of
1988	   elements that does not begin with "erc:" but otherwise conforms to
1989	   the ERC record syntax.

1991	   ERC stubs may be useful in supporting internal procedures using the
1992	   ERC syntax.  Often they rely on the convenience and accuracy of
1993	   automatically supplied elements, even the basic ones.  To be ready
1994	   for external use, however, an ERC stub must be transformed into a
1995	   complete ERC record having the usual required elements.  An ERC stub
1996	   record can be convenient for metadata embedded in a document, where
1997	   elements such as location, modification date, and size - which one
1998	   would not omit from an externalized record - are omitted simply
1999	   because they are much better supplied by a computation.  A separate
2000	   local administrative procedure, not defined for ERC's in general,
2001	   would effect the promotion of stubs into complete records.

2003	   While the ERC is a general-purpose container for exchange of resource
2004	   descriptions, it does not dictate how records must be internally
2005	   stored, laid out, or assembled by data providers or recipients.
2006	   Arbitrary internal descriptive frameworks can support ERCs simply by
2007	   mapping (e.g., on demand) local records to the ERC container format
2008	   and making them available for export.  Therefore, to support ERCs
2009	   there is no need for a data provider to convert internal data to be
2010	   stored in an ERC format.  On the other hand, any provider (such as
2011	   one just getting started in the business of resource description) may
2012	   choose to store and manipulate local data natively in the ERC format.

2014	8.  Advice to Web Clients

2016	   This section offers some advice to web client software developers.
2017	   It is hard to write about because it tries to anticipate a series of
2018	   events that might lead to native web browser support for ARKs.

2020	   ARKs are envisaged to appear wherever durable object references are
2021	   planned.  Library cataloging records, literature citations, and
2022	   bibliographies are important examples.  In many of these places URLs
2023	   (Uniform Resource Locators) currently stand in, and URNs, DOIs, and
2024	   PURLs have been proposed as alternatives.

2026	   The strings representing ARKs are also envisaged to appear in some of
2027	   the places where URLs currently appear:  in hypertext links (where
2028	   they are not normally shown to users) and in rendered text (displayed
2029	   or printed).  Internet search engines, for example, tend to include
2030	   both actionable and manifest links when listing each item found.  A
2031	   normal HTML link for which the URL is not displayed looks like this.

2033	          <a href = "http://foo.bar.org/index.htm"> Click Here <a>

2035	   The same link with an ARK instead of a URL:

2037	          <a href = "ark:/14697/b12345x"> Click Here <a>

2039	   Web browsers would in general require a small modification to
2040	   recognize and convert this ARK, via mapping authority discovery, to
2041	   the URL form.

2043	          <a href = "http://a.b.org/ark:/14697/b12345x"> Click Here <a>

2045	   A browser that knows how to make that conversion could also
2046	   automatically detect and replace a non-working NMAH.

2048	   An NAA will typically make known the associations it creates by
2049	   publishing them in catalogs, actively advertizing them, or simply
2050	   leaving them on web sites for visitors (e.g., users, indexing
2051	   spiders) to stumble across in browsing.

2053	9.  Security Considerations

2055	   The ARK naming scheme poses no direct risk to computers and networks.
2056	   Implementors of ARK services need to be aware of security issues when
2057	   querying networks and filesystems for Name Mapping Authority
2058	   services, and the concomitant risks from spoofing and obtaining
2059	   incorrect information.  These risks are no greater for ARK mapping
2060	   authority discovery than for other kinds of service discovery.  For
2061	   example, recipients of ARKs with a specified hostport (NMAH) should
2062	   treat it like a URL and be aware that the identified ARK service may
2063	   no longer be operational.

2065	   Apart from mapping authority discovery, ARK clients and servers
2066	   subject themselves to all the risks that accompany normal operation
2067	   of the protocols underlying mapping services (e.g., HTTP, Z39.50).
2068	   As specializations of such protocols, an ARK service may limit
2069	   exposure to the usual risks.  Indeed, ARK services may enhance a kind
2070	   of security by helping users identify long-term reliable references
2071	   to information objects.

2073	10.  Authors' Addresses

2075	   John A. Kunze
2076	   California Digital Library
2077	   University of California, Office of the President
2078	   415 20th St, 4th Floor
2079	   Oakland, CA  94612-3550, USA

2081	   Fax:   +1 510-893-5212
2082	   EMail: jak@ucop.edu

2084	   R. P. C. Rodgers
2085	   US National Library of Medicine
2086	   8600 Rockville Pike, Bldg. 38A
2087	   Bethesda, MD  20894, USA

2089	   Fax:   +1 301-496-0673
2090	   EMail: rodgers@nlm.nih.gov

2092	11.  References

2094	   [ANVL]     J. Kunze, B. Kahle, et al, "A Name-Value Language", work
2095	              in progress,
2096	              http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf

2098	   [ARK]      J. Kunze, "Towards Electronic Persistence Using ARK
2099	              Identifiers", Proceedings of the 3rd ECDL Workshop on Web
2100	              Archives, August 2003, (PDF)
2101	              http://bibnum.bnf.fr/ecdl/2003/proceedings.php?f=kunze

2103	   [DCORE]    Dublin Core Metadata Initiative, "Dublin Core Metadata
2104	              Element Set, Version 1.1:  Reference Description", July
2105	              1999, http://dublincore.org/documents/dces/.

2107	   [DERC]     J. Kunze, "Dictionary of the ERC", work in progress within
2108	              the Dublin Core Metadata Initiative's Kernel Working
2109	              Group, http://dublincore.org/groups/kernel/

2111	   [DNS]      P.V. Mockapetris, "Domain Names - Concepts and
2112	              Facilities", RFC 1034, November 1987.

2114	   [DOI]      International DOI Foundation, "The Digital Object
2115	              Identifier (DOI) System", February 2001,
2116	              http://dx.doi.org/10.1000/203.

2118	   [ERC]      J. Kunze, "A Metadata Kernel for Electronic Permanence",
2119	              Journal of Digital Information, Vol 2, Issue 2, January
2120	              2002, ISSN 1368-7506, (PDF)
2121	              http://jodi.ecs.soton.ac.uk/Articles/v02/i02/Kunze/

2123	   [Handle]   L. Lannom, "Handle System Overview", ICSTI Forum, No. 30,
2124	              April 1999, http://www.icsti.org/forum/30/#lannom

2126	   [HTTP]     R. Fielding, et al, "Hypertext Transfer Protocol --
2127	              HTTP/1.1", RFC 2616, June 1999.

2129	   [MD5]      R. Rivest, "The MD5 Message-Digest Algorithm", RFC 1321,
2130	              April 1992.

2132	   [NAPTR]    M. Mealling, Daniel, R., "The Naming Authority Pointer
2133	              (NAPTR) DNS Resource Record", RFC 2915, September 2000.

2135	   [NLMPerm]  M. Byrnes, "Defining NLM's Commitment to the Permanence of
2136	              Electronic Information", ARL 212:8-9, October 2000,
2137	              http://www.arl.org/newsltr/212/nlm.html

2139	   [NOID]     J. Kunze, "Nice Opaque Identifiers", February 2005,
2140	              http://www.cdlib.org/inside/diglib/ark/noid.pdf

2142	   [PURL]     K. Shafer, et al, "Introduction to Persistent Uniform
2143	              Resource Locators", 1996,
2144	              http://purl.oclc.org/OCLC/PURL/INET96

2146	   [RFC822]   D. Crocker, "Standard for the format of ARPA Internet text
2147	              messages", RFC 822, August 1982.

2149	   [TELNET]   J. Postel, J.K. Reynolds, "Telnet Protocol Specification",
2150	              RFC 854, May 1983.

2152	   [TEMPER]   J. Kunze, "Temporal Enumerated Ranges", work in progress,
2153	              http://www.cdlib.org/inside/diglib/ark/temperspec.pdf

2155	   [THUMP]    K. Gamiel, J. Kunze, N. Nassar, "The HTTP URL Mapping
2156	              Protocol", work in progress.

2158	   [URI]      T. Berners-Lee, et al, "Uniform Resource Identifiers
2159	              (URI): Generic Syntax", RFC 2396, August 1998.

2161	   [URNBIB]   C. Lynch, et al, "Using Existing Bibliographic Identifiers
2162	              as Uniform Resource Names", RFC 2288, February 1998.

2164	   [URNSYN]   R. Moats, "URN Syntax", RFC 2141, May 1997.

2166	   [URNNID]   L. Daigle, et al, "URN Namespace Definition Mechanisms",
2167	              RFC 2611, June 1999.

2169	12.  Appendix:  ARK Implementations

2171	   Currently, the primary implementation activity is at the California
2172	   Digital Library (CDL),

2174	         http://ark.cdlib.org/

2176	   housed at the University of California Office of the President, where
2177	   over 200,000 ARKs have been assigned to objects that the CDL owns or
2178	   controls.  Some experimentation in ARKs is taking place at JSTOR, the
2179	   Digital Curation Centre, WIPO and at the University of California's
2180	   San Diego, San Francisco, and Berkeley campuses.

2182	   The US National Library of Medicine (NLM) also has an experimental,
2183	   prototype ARK service under development.  It is being made available
2184	   for purposes of demonstrating various aspects of the ARK system, but
2185	   is subject to temporary or permanent withdrawal (without notice)
2186	   depending upon the circumstances of the small research group
2187	   responsible for making it available.  It is described at:

2189	         http://ark.nlm.nih.gov/

2191	   Comments and feedback may be addressed to rodgers@nlm.nih.gov.

2193	13.  Appendix:  Current ARK Name Authority Table

2195	   This appendix contains a copy of the Name Authority Table (a file) at
2196	   the time of writing.  It may be loaded into a local filesystem (e.g.,
2197	   /etc/natab) for use in mapping NAAs (Name Assigning Authorities) to
2198	   NMAHs (Name Mapping Authority Hostports).  It contains Perl code that
2199	   can be copied into a standalone script that processes the table (as a
2200	   file).  Because this is still a proposed file, none of the values in
2201	   it are real.

2203	     #
2204	     # Name Assigning Authority / Name Mapping Authority Lookup Table
2205	     #       Last change:   2006.01.12
2206	     #       Reload from:   http://ark.nlm.nih.gov/etc/natab
2207	     #       Mirrored at:   http://www.cdlib.org/inside/diglib/ark/natab
2208	     #       To register:   mailto:ark@cdlib.org?Subject=naareg
2209	     #       Process with:  Perl script at end of this file (optional)
2210	     #
2211	     # Each NAA appears at the beginning of a line with the NAA Number
2212	     # first, a colon, and an ARK or URL to a statement of naming policy
2213	     # (see http://ark.cdlib.org for an example).
2214	     # All the NMA hostports that service an NAA are listed, one per
2215	     # line, indented, after the corresponding NAA line.
2216	     #
2217	     #       National Library of Medicine
2218	     12025:  http://www.nlm.nih.gov/xxx/naapolicy.html
2219	             ark.nlm.nih.gov USNLM
2220	             foobar.zaf.org UCSF
2221	     #
2222	     #       Library of Congress
2223	     12026:  http://www.loc.gov/xxx/naapolicy.html
2224	             foobar.zaf.org USLC
2225	     #
2226	     #       National Agriculture Library
2227	     12027:  http://www.nal.gov/xxx/naapolicy.html
2228	             foobar.zaf.gov:80 USNAL
2229	     #
2230	     #       California Digital Library
2231	     13030:  http://www.cdlib.org/inside/diglib/ark/
2232	             ark.cdlib.org CDL
2233	     #
2234	     #       World Intellectual Property Organization
2235	     13038:  http://www.wipo.int/xxx/naapolicy.html
2236	             www.wipo.int WIPO
2237	     #
2238	     #       University of California San Diego
2239	     20775:  http://library.ucsd.edu/xxx/naapolicy.html
2240	             ucsd.edu UCSD
2241	     #
2242	     #       University of California San Francisco
2243	     29114:  http://library.ucsf.edu/xxx/naapolicy.html
2244	             ucsf.edu UCSF
2245	     #
2246	     #       University of California Berkeley
2247	     28722:  http://library.berkeley.edu/xxx/naapolicy.html
2248	             berkeley.edu UCB
2249	     #
2250	     #       University of California Los Angeles
2251	     21198:  http://library.ucla.edu/xxx/naapolicy.html
2252	             ucla.edu UCLA
2253	     #
2254	     #       Rutgers University
2255	     15230:  http://rci.rutgers.edu/xxx/naapolicy.html
2256	             rutgers.edu RU
2257	     #
2258	     #       Internet Archive
2259	     13960:  http://www.archive.org/xxx/naapolicy.html
2260	             archive.org IA
2261	     #
2262	     #       Digital Curation Centre
2263	     64269:  http://www.dcc.ac.uk/xxx/naapolicy.html
2264	             dcc.ac.uk DCC
2265	     #
2266	     #       New York University
2267	     62624:  http://library.nyu.edu/xxx/naapolicy.html
2268	             nyu.edu NYU
2269	     #
2270	     #       University of North Texas
2271	     67531:  http://www.library.unt.edu/xxx/naapolicy.html
2272	             unt.edu UNT
2273	     #
2274	     #       Ithaka Electronic-Archiving Initiative
2275	     27927:  http://www.ithaka.org/xxx/naapolicy.html
2276	             ithaka.org ITHAKA
2277	     #
2278	     #       Bibliothque nationale de France / National Library of France
2279	     12148:  http://www.bnf.fr/xxx/naapolicy.html
2280	             bnf.fr BNF
2281	     #
2282	     #       Princeton University
2283	     88435:  http://diglib.princeton.edu/xxx/naapolicy.html
2284	             princeton.edu PU
2285	     #
2286	     #       University of Washington
2287	     78428:  http://u.washington.edu/xxx/naapolicy.html
2288	             u.washington.edu UW
2289	     #
2290	     #       Archives of Region of Vstra Gtaland and City of Gothenburg, Sweden
2291	     89901:  http://www.arkivnamnden.org/xxx/naapolicy.html
2292	             arkivnamnden.org AVGG
2293	     #
2294	     #       Northwest Digital Archives
2295	     80444:  http://nwda.wsulibs.wsu.edu/xxx/naapolicy.html
2296	             nwda.wsulibs.wsu.edu NWDA
2297	     #
2298	     #       Emory University
2299	     25593:  http://id.library.emory.edu/xxx/naapolicy.html
2300	             id.library.emory.edu EMORY
2301	     #
2302	     #--- end of data ---
2303	     # The following Perl script takes an NAA as argument and outputs
2304	     # the NMAs in this file listed under any matching NAA.

2306	     #
2307	     # my $naa = shift;
2308	     # while (<>) {
2309	     #       next if (! /^$naa:/);
2310	     #       while (<>) {
2311	     #               last if (! /^[#\s]./);
2312	     #               print "$1\n" if (/^\s+(\S+)/);
2313	     #       }
2314	     # }
2315	     #
2316	     # Create a g/t/nroff-safe version of this table with the UNIX command,
2317	     #
2318	     #       expand natab | sed 's/\\/\\\e/g' > natab.roff
2319	     #
2320	     # end of file

2322	14.  Copyright Notice

2324	   Copyright (C) The Internet Society (2006).  This document is subject
2325	   to the rights, licenses and restrictions contained in BCP 78, and
2326	   except as set forth therein, the authors retain all their rights.

2328	   This document and the information contained herein are provided on an
2329	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2330	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
2331	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
2332	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
2333	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2334	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2336	Expires 23 August 2006
2337	                           Table of Contents

2339	Status of this Document  . . . . . . . . . . . . . . . . . . . . . .   1
2340	Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1
2341	1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .   3
2342	1.1.  Reasons to Use ARKs  . . . . . . . . . . . . . . . . . . . . .   4
2343	1.2.  Three Requirements of ARKs . . . . . . . . . . . . . . . . . .   4
2344	1.3.  Organizing Support for ARKs:  Our Stuff vs. Their Stuff  . . .   5
2345	1.4.  Definition of Identifier . . . . . . . . . . . . . . . . . . .   7
2346	2.  ARK Anatomy  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
2347	2.1.  The Name Mapping Authority Hostport (NMAH) . . . . . . . . . .   8
2348	2.2.  The ARK Label Part - ark:  . . . . . . . . . . . . . . . . . .   9
2349	2.3.  The Name Assigning Authority Number (NAAN) . . . . . . . . . .  10
2350	2.4.  The Name Part  . . . . . . . . . . . . . . . . . . . . . . . .  10
2351	2.5.  The Qualifier Part . . . . . . . . . . . . . . . . . . . . . .  11
2352	2.5.1.  ARKs that Reveal Object Hierarchy  . . . . . . . . . . . . .  12
2353	2.5.2.  ARKs that Reveal Object Variants . . . . . . . . . . . . . .  13
2354	2.6.  Character Repertoires  . . . . . . . . . . . . . . . . . . . .  14
2355	2.7.  Normalization and Lexical Equivalence  . . . . . . . . . . . .  15
2356	3.  Naming Considerations  . . . . . . . . . . . . . . . . . . . . .  16
2357	3.1.  ARKS Embedded in Language  . . . . . . . . . . . . . . . . . .  16
2358	3.2.  Objects Should Wear Their Identifiers  . . . . . . . . . . . .  17
2359	3.3.  Names are Political, not Technological . . . . . . . . . . . .  17
2360	3.4.  Choosing a Hostname or NMA . . . . . . . . . . . . . . . . . .  17
2361	3.5.  Assigners of ARKs  . . . . . . . . . . . . . . . . . . . . . .  19
2362	3.6.  NAAN Namespace Management  . . . . . . . . . . . . . . . . . .  20
2363	3.7.  Sub-Object Naming  . . . . . . . . . . . . . . . . . . . . . .  21
2364	4.  Finding a Name Mapping Authority . . . . . . . . . . . . . . . .  21
2365	4.1.  Looking Up NMAHs in a Globally Accessible File . . . . . . . .  22
2366	4.2.  Looking up NMAHs Distributed via DNS . . . . . . . . . . . . .  23
2367	5.  Generic ARK Service Definition . . . . . . . . . . . . . . . . .  25
2368	5.1.  Generic ARK Access Service (access, location)  . . . . . . . .  26
2369	5.2.  Generic Policy Service (permanence, naming, etc.)  . . . . . .  26
2370	5.3.  Generic Description Service  . . . . . . . . . . . . . . . . .  28
2371	6.  Overview of The HTTP URL Mapping Protocol (THUMP)  . . . . . . .  28
2372	7.  Overview of Electronic Resource Citations (ERCs) . . . . . . . .  31
2373	7.1.  ERC Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  32
2374	7.2.  ERC Stories  . . . . . . . . . . . . . . . . . . . . . . . . .  33
2375	7.3.  The ERC Anchoring Story  . . . . . . . . . . . . . . . . . . .  34
2376	7.4.  ERC Elements . . . . . . . . . . . . . . . . . . . . . . . . .  35
2377	7.5.  ERC Element Values . . . . . . . . . . . . . . . . . . . . . .  37
2378	7.6.  ERC Element Encoding and Dates . . . . . . . . . . . . . . . .  39
2379	7.7.  ERC Stub Records and Internal Support  . . . . . . . . . . . .  41
2380	8.  Advice to Web Clients  . . . . . . . . . . . . . . . . . . . . .  41
2381	9.  Security Considerations  . . . . . . . . . . . . . . . . . . . .  42
2382	10.  Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . .  42
2383	11.  References  . . . . . . . . . . . . . . . . . . . . . . . . . .  43
2384	12.  Appendix:  ARK Implementations  . . . . . . . . . . . . . . . .  44
2385	13.  Appendix:  Current ARK Name Authority Table . . . . . . . . . .  45
2386	14.  Copyright Notice  . . . . . . . . . . . . . . . . . . . . . . .  48