DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT DRAFT
This is a draft document in progress.
It is subject to change without notice
Last change date: April 26, 1995
Comments are most welcome.
Contact: Donald Waters
Address: donald_waters@yale.edu
Last modified: 4/26/95
The Social Organization of Archiving Digital Information
- I. Overview of Major Issues
- A. Statement of the problem
- 1. The Task Force on Archiving of Digital Information is concerned
primarily with ensuring that information in digital form endures for
future generations.
- 2. Preserving the media on which information is electronically
recorded is now well understood to be a relatively short-term and partial
solution to the general problem of preserving digital information.
- 3. Even if the media could be physically well-preserved, rapid changes
in the means of recording, in the formats for storage, and in the software
for use threaten to render the life of information in the digital age as,
to borrow a phrase from another arena of discourse on civil society,
"nasty, brutish and short."
- a) Given the threat of technological obsolescence, Michael Lesk has
argued that the preservation of electronic information into the
indefinite future "means copying, not physical preservation." In this
sense, preservation means "refreshing" information from old to new
technologies.
- (1) Devices, processes and software used to record, store, and
retrieve digital information are being replaced with new products and
methods on a 2- to 5- year cycle.
- (2) Backward compatibility between versions of software and
generations of hardware are not assured.
- (3) Interoperability among competing hardware and software product
lines is not assured.
- (4) Therefore, "copying" for preservation purposes is more complex
than replicating digital information on new storage media.
- b) Jeff Rothenberg has recently suggested another possible solution:
Create and archive emulators of software and operating systems which
allow the content of digital information objects to be carried forward
and used in its original format.
- 4. Given various technical options, however, preserving electronic
information is not only, or even primarily, a technical matter, as anyone
knows who has participated in the copying, say, of a bibliographic or
corporate financial database from one on-line system to another.
- B. Some issues are important but fall outside the purview of the Task
Force.
- 1. Questions of intellectual judgment -- what information to eschew
and what to carry forward in what structure and format -- are always among
the more difficult issues in creating and maintaining an archive.
- a) Some principles that would influence the selection of materials
in electronic form are treated below in this report.
- b) But selection is an issue common to all archiving functions.
- c) Moreover, selection criteria cannot be generalized because they
are dependent on the goals and policies of each archive and so are
highly archive-specific.
- d) In general, the Task Force views the criteria for selecting
digital materials for archiving as beyond the scope of work.
- 2. The Task Force also regards as out of scope issues pertaining to
the preservation of materials in analog form.
- a) We recognize, however, that materials in analog form may be
converted and need thereafter to be preserved in digital form.
- b) The principles and practices established here will no doubt
inform responsible parties who are converting information objects from
analog to digital form.
- c) Moreover, interoperability between analog and digital
preservation systems is an important goal
- C. Mobilizing the will and resources of a variety of agents to a common
good.
- 1. How can we build a meaningful, dynamic organizational framework for
these agents to work together productively?
- 2. What are the principles, economic incentives and contractual
relationships that might serve to create an environment most conducive to
the preservation of digital information into the indefinite
future?
- D. Among the many factors conditioning intellectual judgment and the
application of technological wizardry in the archiving of digital
information is, of course, the social organization of information, its
generation and use.
- 1. We intend to observe the 80/20 rule
- a) Our objective is not to resolve all issues associated with the
preservation of all kinds of digital information.
- b) We want to address the most vexing issues that currently stand in
the way of an organized approach to digital archiving.
- c) We hope to articulate enough basic principles to stimulate
progress, but not so many as to stifle it
- 2. We advance a four-part approach to the social organization of
archiving digital information.
- a) How is the emerging digital information environment changing the
life cycle of information and the identity and role of the various
stakeholders in that information?
- b) What features of electronic information objects most affect our
ability to archive them?
- c) What kinds of roles, functions and other organizational
structures are needed to preserve those objects?
- d) What infrastructures are needed to enable digital
preservation?
- II. The Digital Environment: Stakeholders in the Information Life
cycle
- A. Critical factors in the emerging digital environment
- 1. Digital networks are central to information access and distribution
- a) Assume the existence of reliable, secure, high-bandwidth networks
- b) Specific policy decisions regarding pricing, security and
extension of networks, however, will greatly affect the viability of
efforts to preserve digital information residing on the
networks.
- 2. Digital technologies facilitate the reclamation and reuse of
information objects.
- a) Ease of reuse may increase value of archive to all stakeholders
- b) The creator/publisher may have sufficient incentive to maintain
its own archive
- c) Need to plan for cases where there is little incentive for
preserving digital information objects
- d) Easily reused digital objects may complicate selection decisions,
as well as indexing and cataloging
- 3. Digital technologies also provide greater flexibility in the
distribution of information
- a) Emergence of consortial arrangements to explore distribution
mechanisms among various partners in the information cycle
- (1) Role of RLG/DPC
- (2) Rise of regional consortia
- b) Distribution of responsibilities for collecting information as a
basis for archiving function
- 4. Digital technologies increasingly are serving to integrate the
delivery of information in various media
- a) Digital storage and transmission of voice, text, images and video
introduces a common layer of technology.
- b) Integration affects user behavior.
- c) Integration also gives rise to new (e.g. multimedia) information
objects that need to be preserved.
- d) These objects in many cases will exist only in digital
form
- B. Stakeholders in the information life cycle
- 1. Stakeholders include the following
- a) Author/creator, Publisher, Distributor, Library/Archive,
Reader/Consumer
- b) In the digital environment, the traditional relationships among
these identities are shifting and new stakeholders will emerge.
- 2. There are various models to represent these shifting relationships
and to suggest where archiving might fit in the emerging digital
environment. These models cannot well be represented linearly.
- a) The traditional model of the information life cycle emphasizes
the tension between copyright and fair use.
- (1) Provider --> publisher --> distributor
- (2) From the distributor the information may flow directly to the
reader or to a library and then to the reader.
- (3) One function of the library is to assure fair use
- b) Fair use analysis depends on a static information object to apply
fair use tests; however, the model doesn't apply in the digital
environment.
- c) Alternative models posited for the digital environment generally
are silent on fair use but assume copyright or other rights for some
objects.
- d) These alternative models emphasize other values in the
information life cycle.
- (1) Direct to reader: information costs are so low that
information does not need to pass through central repositories.
- (2) Author as publisher: barriers to publication are so low that
authors can publish directly (e.g. as pre-prints)
- (3) Library as publisher/distributor: libraries are seen as
archives for publishing on demand.
- III. Information objects
- A. Digital information objects have attributes that are structured in
multiple dimensions and which influence the technical, economic and
operational characteristics of information use.
- B. Among the attributes that are critical to digital preservation are
the following:
- 1. Source and mode of distribution
- a) Individually-owned materials (such as mail, notes, manuscripts,
preprints, databases, etc.); Corporately-owned materials (employment and
financial records, planning documents, reports, etc.); Publishers
(books, serials, films, recordings, etc.); and libraries, museums and
other educational institutions.
- b) Personal and organizational archives require an information
management infrastructure for ongoing preservation
- c) Publishers maintain an infrastructure that may or may not meet
preservation requirements
- 2. Context
- a) Meta-information (bibliographic catalogs, indices, data
dictionaries, directory systems, etc.); as opposed to
- b) The documents and other objects to which they refer, such as
monographic and serial texts, graphic and photographic images, sound
recordings, data collections, software-dependent data objects (GIS and
CAD)
- 3. Encoding of structure, format and content may vary and affect both
use and ability to archive
- a) All features of information objects may be encoded in proprietary
software that runs on specific operating systems
- b) They may be encoded in standard formats: ASCII, TIFF, etc.
- c) They may have self-defining qualities: SGML with DTD; data and
codebooks
- d) Multimedia objects may incorporate all of these qualities
- 4. The attributes may be dynamic in various ways:
- a) Information objects may be revised and updated so that there are
instances, versions or editions of the object
- b) Information objects may change cumulatively or interactively, as
in contributions to a listserv
- c) Information objects may be dynamic in the various views one has
or takes of the information
- d) Information objects may change in the linkages made among them
- (1) For objects that are dynamic in this way, like WWW pages,
there appears to be no good archiving solution other than to take
periodic snapshots or to archive everything
- (2) Solution only for component pieces
- 5. Perhaps the most important attribute is use
- a) What observers seem most to fear is that digital information
objects that are no longer used will simply be deleted without
consideration for future use and without being made available for
archiving by some other custodian.
- b) Put a different way: the best insurance that information will
endure is use, but use is not a sufficient criterion for continuous
archiving
- c) Put another way: a digital archive serves as a safety net for
user demand.
- IV. Roles, Functions and Organization for Archiving
- A. Hypotheses about effective organization for archiving digital
information follow from the preceding analysis.
- 1. Effective management of relationships among stakeholders is the key
to successful archiving
- 2. The distributed network of information suggests the need for
distributed responsibility for archiving the information
- a) Stakeholders will invoke a variety of consortial models as the
emerging digital environment gives rise to new ways of interacting and
dividing labor and responsibility
- b) Collaborative models will likely include partnerships,
federations, contractor/subcontractor relationships, etc.
- c) Organizations will form around intellectual discipline, types of
material, functional role such as storage or cataloging, and across to
regional, national or international boundaries
- d) As the digital preservation environment takes shape the most
effective organizational structures will likely be those that are agile
and bear the least overhead
- 3. Given that digital technologies facilitate the reuse of information
objects and given that the best insurance that information will endure is
use, it seems plausible to suggest the following tiered structure of
responsibility:
- a) Information creators/providers/owners have initial responsibility
to provide for the archiving of digital information.
- b) The creator/provider/owner may engage libraries and archives to
take over some or all of the archival responsibility
- (1) Libraries and archives may also interact with
creators/providers as subcontractors for maintaining an archive even
during the active life of information objects
- (2) Libraries and archives may exercise an aggressive rescue
function to preserve information objects that become endangered
because the creator/provider/owner no longer takes responsibility for
the archiving function and does not take steps formally to hand it
over
- c) Libraries and archives would assume responsibility for selecting
and archiving material for which there is no natural institutional
home
- 4. The organization for archiving ought to be designed to accommodate
information objects that are self-describing; that is, packaged with
information about what it is, what is needed to effectively use it and how
to use it
- B. A commitment to enduring access is a defining feature of a digital
archive and is fulfilled in practice by the exercise of these critical
functions:
- 1. Managing the operating environment, which consists of the following
areas of responsibility
- a) Storage of the copy of record
- (1) Storage may be on-line, near-line, or off-line but must be
accessible when needed
- (2) Storage practice may support just-in-case as well as
just-in-time distribution strategies
- b) Access policies
- (1) Level of access
- (2) Nature of access
- c) Connectivity
- d) Description
- (1) In order to decide what to preserve, you've got to decide what
what is
- (2) Presume that provider/publisher will provide a basic set of
metadata
- (3) The archiving system must support generating and managing
common metadata from multiple objects
- e) Retrieval
- (1) You do not have what you cannot retrieve
- (2) Develop and maintain mechanisms for searching metadata and
information objects
- f) Assure authority and provenance via cryptographic techniques
- g) Capacity
- (1) Are there sufficient access points?
- (2) Does the archive meet current computational standards,
including user display capabilities
- 2. Managing the migration of the archive as the operating environment
changes.
- a) Working Definition: Migration is the periodic transfer of digital
information from one hardware/software configuration to another or from
one generation of technology to a subsequent generation in order to
retain the ability to access, display, retrieve, manipulate, and use the
information.
- b) Migration is different from copying. Copying is transferring the
same bit stream from one medium or storage device to another.
- c) In some migrations, it may not be possible to migrate an exact
"replica" or "copy" of the original object and still retain software
compatibility.
- d) Changing hardware and software will drive the need for
migration.
- 3. Managing the costs of the operating environment of the archive and
of periodic migrations.
- a) The ability to estimate/predict the costs of operation and of
migration will be an important factor in planning and resource
allocation (and possibly selection).
- b) Costs of the operating environment will likely vary over time:
- (1) The principle cost factors are those associated with storage,
use, property rights transactions, and the systems engineering needed
to maintain the distributed infrastructure.
- (2) Storage costs need to be managed as an amortized capital cost
and will likely continue to decline both absolutely and relative to
the other cost factors.
- (3) The costs of access and of property rights transactions are
relatively high because the supporting systems are highly immature (or
non-existent); these systems are developing very rapidly and their
relative costs will fall.
- (4) In the long-run, the primary cost factor in the management of
digital archives will likely be the costs of systems engineering to
support the highly distributed network-based functions needed to
operate and digital archive effectively.
- c) Costs of migration will vary depending on:
- (1) complexity of original data structures
- (2) frequency of migration (e.g. the life cycle of software and
how vendors are positioning themselves in different application
environments).
- (3) the extent to which the functionality for computation,
display, indexing, linkage, etc. must be migrated in addition to
content.
- (4) the need to compensate for acquisition or intellectual
property rights.
- d) We have little reliable data on these costs and little experience
in managing them.
- e) There are unresolved issues regarding the distribution of costs
(e.g. whether you charge to recover migration costs; and who you charge,
etc.).
- C. Organizational mechanisms to facilitate distributed responsibilities
for archiving
- 1. Is there a need for a central repository, like the Iron Mountain
facility for master microfilm copies?
- 2. Migration Strategies
- a) There are a variety of migration strategies, none of which is
entirely satisfactory nor universally applicable.
- (1) Migration strategies may vary in different application
environments, for different types of material and depending on the
need to preserve various levels of functionality (computational,
display, indexing, etc.).
- (2) Our community is only beginning to address migration issues
and our experience is limited in terms of technical feasibility,
costs, benchmark, etc.
- (3) Migration should become more effective as the community
matures, gains experience, and learns how to select appropriate
migration strategies.
- (4) We still need to work on some useful ways to break digital
preservation into several specific types of material and then refine
ideas for migration strategies in these scenarios.
- b) Strategy 1): Migrate digital materials from less stable to more
stable media and/or from formats that are highly software dependent to
formats that are less software-intensive.
- (1) This strategy is most commonly implemented by printing to
paper or microfilm.
- (2) This strategy is also used for some digital materials (keeping
ASCII text or delimited ASCII data files) when retaining content is
paramount or when display, computational, indexing and other
functionality is not critical.
- (3) This is a feasible and cost-effective strategy for a certain
slice of digital materials because it eliminates the need for future
migrations or it reduces migration to simple copying.
- (4) As long as we lack skills, standards, and more robust
strategies to avoid this hybrid solution, printing to paper or film
will remain a migration strategy for certain types of materials in
many institutions.
- (5) Many types of digital materials are not amenable to this
strategy (i.e. how do you microfilm a database? or print out a full
motion video)
- (6) Migration strategies developed for digital preservation may be
applicable to business environments where there is a desire to reduce
or eliminate paper documents.
- c) Strategy 2): Migrate digital materials from the multiplicity of
formats present at any time to smaller number of common formats.
- (1) Subsequent migrations will involve economies of scale and
fewer customized transformations.
- (2) Development of standard interchange formats that all documents
can head toward may be a more cost-effective approach (such as Opendoc
or SQL/CCL for databases)
- (3) Wide scale adoption of standards may be difficult to achieve
because vendors will determine whether open systems are desirable.
- (4) The need within communities for interchange or sharing of
documents will drive interchange formats and standards, not the need
for preservation. (e.g. current trends within the GIS community or
within the business community around EDI. We won't solve the
compatibility problem, but we should take advantage of it.)
- (5) Make sure that institutional hardware/software platforms
comply with standards or common configurations (e.g. Don't install
Apple in the library or archives when the rest of the organization is
using Windows; also try to keep hardware/software platforms in the
library or archives on the same generation of technology as the rest
of the organization.)
- d) Strategy 3: Develop/impose standards
- (1) Look to common usage rules; Adopt de facto standards or
commonly used packages as the only acceptable formats for
preservation.
- (2) This strategy is not likely to succeed in many environments
where imposition of standards is viewed as a limitation on freedom of
choice.
- e) Strategy 4: Work with industry to develop backward compatibility
paths as standard feature in all software
- (1) Where migration paths are not commonly included in software
packages (such as between software product lines), raise user
awareness of the need for a migration path for: new versions,
different vendors, vendors that go out of business
- f) Strategy 5: Develop "processing centers" that can handle
migration and reformatting of materials in obsolete formats
- (1) Even if we succeed with standards and migration paths, there
is a large body of materials in non-standard formats (and this is
likely to continue to be produced).
- (2) Processing centers should be established that specialize in
conversions of materials in one or a few obsolete formats (e.g. text,
certain types of databases, (GIS, CAD, multi-media)
- (3) Technical strategies might involve maintaining obsolete
hardware/software to provide the look and feel of the original
material
- (4) Processing centers might develop software emulators or
retrospective migration programs
- (5) This approach would take advantage of economies of scale and
maximize use of expertise
- (6) Possible models are consortia of institutions, regional
centers, commercial firms (similar to services that convert old movies
to current video formats); national labs (e.g. establish a national
hardware and software laboratory like the national media lab).
- (7) We have very limited experience with the technical feasibility
of this approach
- 3. Registries for distribution of functions
- a) Registry of archives
- (1) Archives must be self-describing, so that users can
interrogate them on-line and understand how they are organized
- (2) Rather than a standard set of elements or a standard
organization, we need instead a standard method for declaring the
existence of an archive and describing what it contains and what
services it provides.
- (3) Perhaps a tool like the finding aid might satisfy this
requirement
- b) Registries of locations
- (1) Distinction between Universal Resource Name and Universal
Resource Location
- (2) Need for indirection between the name and location so that
there can be multiple instances of the object and the underlying
location can change without affecting pointers to the name.
- c) Registry of ownership
- (1) Means for transacting rights clearance
- (2) Principle: clarity in legal rights will drive decisions to
preserve
- d) Registry of bootstrapping tools
- (1) DTDs, codebooks
- (2) Licensed software and operating
systems
- V. Enabling Infrastructure: Factors affecting the efficacy and
durability of the organizational model
- A. Scale and timing
- 1. We are in the early stages of this business
- 2. Current migration activities are sub-critical in terms of scale;
larger scale activities that better apply or distribute expertise may be
more economically feasible
- 3. The ability to spread migration costs across multiple users may
also make migration more cost-effective.
- 4. We need benchmark to measure what migration costs, when it is
worthwhile, where we are improving
- 5. How does emulation compare with copying or refreshing digital
information? Are there other technical solutions that we can imagine or
propose?
- B. What market, contractual and other kinds of support are necessary to
facilitate the transfer of archival responsibility?
- 1. Legal protections, similar to the preservation functions under
present copyright law, are needed to enable libraries/archives to assume
responsibility for preserving digital information objects in the event
that the creator/provider/owner no longer takes responsibility for the
archiving function and does not take steps formally to hand it over
- 2. To the extent that present motivation to maintain archives arises
from the principle of fair use, how is the principle maintained in the
model?
- C. What financial incentives might motivate the archival enterprise?
- 1. Financial support through life cycle budgeting, access charges etc.
- 2. Digital technologies must have capital budgeting techniques
- 3. Tax implications for providers/publishers building digital stock
rather than paper stock
- D. Feasibility of registries
- 1. Transactions for rights permissions and payments
- 2. Bootstrapping techniques, particularly with obsolete software and
operating systems
- E. Search engines and metadata
- F. Links to other subsystems inside and outside the infrastructure (e.g.
commercial repositories, GILS, analog archives, etc.)
- VI. Recommendations
- A. Identify best practices in these environments and learn from them.
- 1. There are many situations where digital preservation will be
handled/solved by others to meet their own business needs (e.g. in
government, industry, medicine, etc.).
- 2. We should focus efforts on areas where solutions in industry,
government, etc. will not meet the needs of our users for continuing
access to digital information; e.g. identify where migration will not
occur if we do not intervene.
- B. Encourage the creators of digital materials to keep digital
information in a live, native software environment as long as possible.
- 1. We need to concentrate on migration strategies for distributed
archives if and when the original custodian goes out of business or has no
continuing interest in preserving digital materials.
- C. Develop strategies to encourage/support preservation of software as a
significant intellectual and cultural product in its own right.
- 1. Build alliances with software engineers, computing enterprises,
etc. as sources of both financial support and expertise for software
preservation.
- 2. Create computer hardware/software repositories that are both
museums documenting the significance of computing and working laboratories
for research and migration.
- 3. Integrate software preservation strategies with support for
migration