The Social Organization of Archiving Digital Information
DRAFT     DRAFT     DRAFT     DRAFT     DRAFT     DRAFT     DRAFT     DRAFT



This is a draft document in progress.

It is subject to change without notice

Last change date:  April 26, 1995



Comments are most welcome.

Contact:  Donald Waters

Address:  donald_waters@yale.edu



Last modified: 4/26/95



The Social Organization of Archiving Digital Information


I. Overview of Major Issues
A. Statement of the problem
1. The Task Force on Archiving of Digital Information is concerned primarily with ensuring that information in digital form endures for future generations.
2. Preserving the media on which information is electronically recorded is now well understood to be a relatively short-term and partial solution to the general problem of preserving digital information.
3. Even if the media could be physically well-preserved, rapid changes in the means of recording, in the formats for storage, and in the software for use threaten to render the life of information in the digital age as, to borrow a phrase from another arena of discourse on civil society, "nasty, brutish and short."
a) Given the threat of technological obsolescence, Michael Lesk has argued that the preservation of electronic information into the indefinite future "means copying, not physical preservation." In this sense, preservation means "refreshing" information from old to new technologies.
(1) Devices, processes and software used to record, store, and retrieve digital information are being replaced with new products and methods on a 2- to 5- year cycle.
(2) Backward compatibility between versions of software and generations of hardware are not assured.
(3) Interoperability among competing hardware and software product lines is not assured.
(4) Therefore, "copying" for preservation purposes is more complex than replicating digital information on new storage media.
b) Jeff Rothenberg has recently suggested another possible solution: Create and archive emulators of software and operating systems which allow the content of digital information objects to be carried forward and used in its original format.
4. Given various technical options, however, preserving electronic information is not only, or even primarily, a technical matter, as anyone knows who has participated in the copying, say, of a bibliographic or corporate financial database from one on-line system to another.
B. Some issues are important but fall outside the purview of the Task Force.
1. Questions of intellectual judgment -- what information to eschew and what to carry forward in what structure and format -- are always among the more difficult issues in creating and maintaining an archive.
a) Some principles that would influence the selection of materials in electronic form are treated below in this report.
b) But selection is an issue common to all archiving functions.
c) Moreover, selection criteria cannot be generalized because they are dependent on the goals and policies of each archive and so are highly archive-specific.
d) In general, the Task Force views the criteria for selecting digital materials for archiving as beyond the scope of work.
2. The Task Force also regards as out of scope issues pertaining to the preservation of materials in analog form.
a) We recognize, however, that materials in analog form may be converted and need thereafter to be preserved in digital form.
b) The principles and practices established here will no doubt inform responsible parties who are converting information objects from analog to digital form.
c) Moreover, interoperability between analog and digital preservation systems is an important goal
C. Mobilizing the will and resources of a variety of agents to a common good.
1. How can we build a meaningful, dynamic organizational framework for these agents to work together productively?
2. What are the principles, economic incentives and contractual relationships that might serve to create an environment most conducive to the preservation of digital information into the indefinite future?
D. Among the many factors conditioning intellectual judgment and the application of technological wizardry in the archiving of digital information is, of course, the social organization of information, its generation and use.
1. We intend to observe the 80/20 rule
a) Our objective is not to resolve all issues associated with the preservation of all kinds of digital information.
b) We want to address the most vexing issues that currently stand in the way of an organized approach to digital archiving.
c) We hope to articulate enough basic principles to stimulate progress, but not so many as to stifle it
2. We advance a four-part approach to the social organization of archiving digital information.
a) How is the emerging digital information environment changing the life cycle of information and the identity and role of the various stakeholders in that information?
b) What features of electronic information objects most affect our ability to archive them?
c) What kinds of roles, functions and other organizational structures are needed to preserve those objects?
d) What infrastructures are needed to enable digital preservation?
II. The Digital Environment: Stakeholders in the Information Life cycle
A. Critical factors in the emerging digital environment
1. Digital networks are central to information access and distribution
a) Assume the existence of reliable, secure, high-bandwidth networks
b) Specific policy decisions regarding pricing, security and extension of networks, however, will greatly affect the viability of efforts to preserve digital information residing on the networks.
2. Digital technologies facilitate the reclamation and reuse of information objects.
a) Ease of reuse may increase value of archive to all stakeholders
b) The creator/publisher may have sufficient incentive to maintain its own archive
c) Need to plan for cases where there is little incentive for preserving digital information objects
d) Easily reused digital objects may complicate selection decisions, as well as indexing and cataloging
3. Digital technologies also provide greater flexibility in the distribution of information
a) Emergence of consortial arrangements to explore distribution mechanisms among various partners in the information cycle
(1) Role of RLG/DPC
(2) Rise of regional consortia
b) Distribution of responsibilities for collecting information as a basis for archiving function
4. Digital technologies increasingly are serving to integrate the delivery of information in various media
a) Digital storage and transmission of voice, text, images and video introduces a common layer of technology.
b) Integration affects user behavior.
c) Integration also gives rise to new (e.g. multimedia) information objects that need to be preserved.
d) These objects in many cases will exist only in digital form
B. Stakeholders in the information life cycle
1. Stakeholders include the following
a) Author/creator, Publisher, Distributor, Library/Archive, Reader/Consumer
b) In the digital environment, the traditional relationships among these identities are shifting and new stakeholders will emerge.
2. There are various models to represent these shifting relationships and to suggest where archiving might fit in the emerging digital environment. These models cannot well be represented linearly.
a) The traditional model of the information life cycle emphasizes the tension between copyright and fair use.
(1) Provider --> publisher --> distributor
(2) From the distributor the information may flow directly to the reader or to a library and then to the reader.
(3) One function of the library is to assure fair use
b) Fair use analysis depends on a static information object to apply fair use tests; however, the model doesn't apply in the digital environment.
c) Alternative models posited for the digital environment generally are silent on fair use but assume copyright or other rights for some objects.
d) These alternative models emphasize other values in the information life cycle.
(1) Direct to reader: information costs are so low that information does not need to pass through central repositories.
(2) Author as publisher: barriers to publication are so low that authors can publish directly (e.g. as pre-prints)
(3) Library as publisher/distributor: libraries are seen as archives for publishing on demand.
III. Information objects
A. Digital information objects have attributes that are structured in multiple dimensions and which influence the technical, economic and operational characteristics of information use.
B. Among the attributes that are critical to digital preservation are the following:
1. Source and mode of distribution
a) Individually-owned materials (such as mail, notes, manuscripts, preprints, databases, etc.); Corporately-owned materials (employment and financial records, planning documents, reports, etc.); Publishers (books, serials, films, recordings, etc.); and libraries, museums and other educational institutions.
b) Personal and organizational archives require an information management infrastructure for ongoing preservation
c) Publishers maintain an infrastructure that may or may not meet preservation requirements
2. Context
a) Meta-information (bibliographic catalogs, indices, data dictionaries, directory systems, etc.); as opposed to
b) The documents and other objects to which they refer, such as monographic and serial texts, graphic and photographic images, sound recordings, data collections, software-dependent data objects (GIS and CAD)
3. Encoding of structure, format and content may vary and affect both use and ability to archive
a) All features of information objects may be encoded in proprietary software that runs on specific operating systems
b) They may be encoded in standard formats: ASCII, TIFF, etc.
c) They may have self-defining qualities: SGML with DTD; data and codebooks
d) Multimedia objects may incorporate all of these qualities
4. The attributes may be dynamic in various ways:
a) Information objects may be revised and updated so that there are instances, versions or editions of the object
b) Information objects may change cumulatively or interactively, as in contributions to a listserv
c) Information objects may be dynamic in the various views one has or takes of the information
d) Information objects may change in the linkages made among them
(1) For objects that are dynamic in this way, like WWW pages, there appears to be no good archiving solution other than to take periodic snapshots or to archive everything
(2) Solution only for component pieces
5. Perhaps the most important attribute is use
a) What observers seem most to fear is that digital information objects that are no longer used will simply be deleted without consideration for future use and without being made available for archiving by some other custodian.
b) Put a different way: the best insurance that information will endure is use, but use is not a sufficient criterion for continuous archiving
c) Put another way: a digital archive serves as a safety net for user demand.
IV. Roles, Functions and Organization for Archiving
A. Hypotheses about effective organization for archiving digital information follow from the preceding analysis.
1. Effective management of relationships among stakeholders is the key to successful archiving
2. The distributed network of information suggests the need for distributed responsibility for archiving the information
a) Stakeholders will invoke a variety of consortial models as the emerging digital environment gives rise to new ways of interacting and dividing labor and responsibility
b) Collaborative models will likely include partnerships, federations, contractor/subcontractor relationships, etc.
c) Organizations will form around intellectual discipline, types of material, functional role such as storage or cataloging, and across to regional, national or international boundaries
d) As the digital preservation environment takes shape the most effective organizational structures will likely be those that are agile and bear the least overhead
3. Given that digital technologies facilitate the reuse of information objects and given that the best insurance that information will endure is use, it seems plausible to suggest the following tiered structure of responsibility:
a) Information creators/providers/owners have initial responsibility to provide for the archiving of digital information.
b) The creator/provider/owner may engage libraries and archives to take over some or all of the archival responsibility
(1) Libraries and archives may also interact with creators/providers as subcontractors for maintaining an archive even during the active life of information objects
(2) Libraries and archives may exercise an aggressive rescue function to preserve information objects that become endangered because the creator/provider/owner no longer takes responsibility for the archiving function and does not take steps formally to hand it over
c) Libraries and archives would assume responsibility for selecting and archiving material for which there is no natural institutional home
4. The organization for archiving ought to be designed to accommodate information objects that are self-describing; that is, packaged with information about what it is, what is needed to effectively use it and how to use it
B. A commitment to enduring access is a defining feature of a digital archive and is fulfilled in practice by the exercise of these critical functions:
1. Managing the operating environment, which consists of the following areas of responsibility
a) Storage of the copy of record
(1) Storage may be on-line, near-line, or off-line but must be accessible when needed
(2) Storage practice may support just-in-case as well as just-in-time distribution strategies
b) Access policies
(1) Level of access
(2) Nature of access
c) Connectivity
d) Description
(1) In order to decide what to preserve, you've got to decide what what is
(2) Presume that provider/publisher will provide a basic set of metadata
(3) The archiving system must support generating and managing common metadata from multiple objects
e) Retrieval
(1) You do not have what you cannot retrieve
(2) Develop and maintain mechanisms for searching metadata and information objects
f) Assure authority and provenance via cryptographic techniques
g) Capacity
(1) Are there sufficient access points?
(2) Does the archive meet current computational standards, including user display capabilities
2. Managing the migration of the archive as the operating environment changes.
a) Working Definition: Migration is the periodic transfer of digital information from one hardware/software configuration to another or from one generation of technology to a subsequent generation in order to retain the ability to access, display, retrieve, manipulate, and use the information.
b) Migration is different from copying. Copying is transferring the same bit stream from one medium or storage device to another.
c) In some migrations, it may not be possible to migrate an exact "replica" or "copy" of the original object and still retain software compatibility.
d) Changing hardware and software will drive the need for migration.
3. Managing the costs of the operating environment of the archive and of periodic migrations.
a) The ability to estimate/predict the costs of operation and of migration will be an important factor in planning and resource allocation (and possibly selection).
b) Costs of the operating environment will likely vary over time:
(1) The principle cost factors are those associated with storage, use, property rights transactions, and the systems engineering needed to maintain the distributed infrastructure.
(2) Storage costs need to be managed as an amortized capital cost and will likely continue to decline both absolutely and relative to the other cost factors.
(3) The costs of access and of property rights transactions are relatively high because the supporting systems are highly immature (or non-existent); these systems are developing very rapidly and their relative costs will fall.
(4) In the long-run, the primary cost factor in the management of digital archives will likely be the costs of systems engineering to support the highly distributed network-based functions needed to operate and digital archive effectively.
c) Costs of migration will vary depending on:
(1) complexity of original data structures
(2) frequency of migration (e.g. the life cycle of software and how vendors are positioning themselves in different application environments).
(3) the extent to which the functionality for computation, display, indexing, linkage, etc. must be migrated in addition to content.
(4) the need to compensate for acquisition or intellectual property rights.
d) We have little reliable data on these costs and little experience in managing them.
e) There are unresolved issues regarding the distribution of costs (e.g. whether you charge to recover migration costs; and who you charge, etc.).
C. Organizational mechanisms to facilitate distributed responsibilities for archiving
1. Is there a need for a central repository, like the Iron Mountain facility for master microfilm copies?
2. Migration Strategies
a) There are a variety of migration strategies, none of which is entirely satisfactory nor universally applicable.
(1) Migration strategies may vary in different application environments, for different types of material and depending on the need to preserve various levels of functionality (computational, display, indexing, etc.).
(2) Our community is only beginning to address migration issues and our experience is limited in terms of technical feasibility, costs, benchmark, etc.
(3) Migration should become more effective as the community matures, gains experience, and learns how to select appropriate migration strategies.
(4) We still need to work on some useful ways to break digital preservation into several specific types of material and then refine ideas for migration strategies in these scenarios.
b) Strategy 1): Migrate digital materials from less stable to more stable media and/or from formats that are highly software dependent to formats that are less software-intensive.
(1) This strategy is most commonly implemented by printing to paper or microfilm.
(2) This strategy is also used for some digital materials (keeping ASCII text or delimited ASCII data files) when retaining content is paramount or when display, computational, indexing and other functionality is not critical.
(3) This is a feasible and cost-effective strategy for a certain slice of digital materials because it eliminates the need for future migrations or it reduces migration to simple copying.
(4) As long as we lack skills, standards, and more robust strategies to avoid this hybrid solution, printing to paper or film will remain a migration strategy for certain types of materials in many institutions.
(5) Many types of digital materials are not amenable to this strategy (i.e. how do you microfilm a database? or print out a full motion video)
(6) Migration strategies developed for digital preservation may be applicable to business environments where there is a desire to reduce or eliminate paper documents.
c) Strategy 2): Migrate digital materials from the multiplicity of formats present at any time to smaller number of common formats.
(1) Subsequent migrations will involve economies of scale and fewer customized transformations.
(2) Development of standard interchange formats that all documents can head toward may be a more cost-effective approach (such as Opendoc or SQL/CCL for databases)
(3) Wide scale adoption of standards may be difficult to achieve because vendors will determine whether open systems are desirable.
(4) The need within communities for interchange or sharing of documents will drive interchange formats and standards, not the need for preservation. (e.g. current trends within the GIS community or within the business community around EDI. We won't solve the compatibility problem, but we should take advantage of it.)
(5) Make sure that institutional hardware/software platforms comply with standards or common configurations (e.g. Don't install Apple in the library or archives when the rest of the organization is using Windows; also try to keep hardware/software platforms in the library or archives on the same generation of technology as the rest of the organization.)
d) Strategy 3: Develop/impose standards
(1) Look to common usage rules; Adopt de facto standards or commonly used packages as the only acceptable formats for preservation.
(2) This strategy is not likely to succeed in many environments where imposition of standards is viewed as a limitation on freedom of choice.
e) Strategy 4: Work with industry to develop backward compatibility paths as standard feature in all software
(1) Where migration paths are not commonly included in software packages (such as between software product lines), raise user awareness of the need for a migration path for: new versions, different vendors, vendors that go out of business
f) Strategy 5: Develop "processing centers" that can handle migration and reformatting of materials in obsolete formats
(1) Even if we succeed with standards and migration paths, there is a large body of materials in non-standard formats (and this is likely to continue to be produced).
(2) Processing centers should be established that specialize in conversions of materials in one or a few obsolete formats (e.g. text, certain types of databases, (GIS, CAD, multi-media)
(3) Technical strategies might involve maintaining obsolete hardware/software to provide the look and feel of the original material
(4) Processing centers might develop software emulators or retrospective migration programs
(5) This approach would take advantage of economies of scale and maximize use of expertise
(6) Possible models are consortia of institutions, regional centers, commercial firms (similar to services that convert old movies to current video formats); national labs (e.g. establish a national hardware and software laboratory like the national media lab).
(7) We have very limited experience with the technical feasibility of this approach
3. Registries for distribution of functions
a) Registry of archives
(1) Archives must be self-describing, so that users can interrogate them on-line and understand how they are organized
(2) Rather than a standard set of elements or a standard organization, we need instead a standard method for declaring the existence of an archive and describing what it contains and what services it provides.
(3) Perhaps a tool like the finding aid might satisfy this requirement
b) Registries of locations
(1) Distinction between Universal Resource Name and Universal Resource Location
(2) Need for indirection between the name and location so that there can be multiple instances of the object and the underlying location can change without affecting pointers to the name.
c) Registry of ownership
(1) Means for transacting rights clearance
(2) Principle: clarity in legal rights will drive decisions to preserve
d) Registry of bootstrapping tools
(1) DTDs, codebooks
(2) Licensed software and operating systems
V. Enabling Infrastructure: Factors affecting the efficacy and durability of the organizational model
A. Scale and timing
1. We are in the early stages of this business
2. Current migration activities are sub-critical in terms of scale; larger scale activities that better apply or distribute expertise may be more economically feasible
3. The ability to spread migration costs across multiple users may also make migration more cost-effective.
4. We need benchmark to measure what migration costs, when it is worthwhile, where we are improving
5. How does emulation compare with copying or refreshing digital information? Are there other technical solutions that we can imagine or propose?
B. What market, contractual and other kinds of support are necessary to facilitate the transfer of archival responsibility?
1. Legal protections, similar to the preservation functions under present copyright law, are needed to enable libraries/archives to assume responsibility for preserving digital information objects in the event that the creator/provider/owner no longer takes responsibility for the archiving function and does not take steps formally to hand it over
2. To the extent that present motivation to maintain archives arises from the principle of fair use, how is the principle maintained in the model?
C. What financial incentives might motivate the archival enterprise?
1. Financial support through life cycle budgeting, access charges etc.
2. Digital technologies must have capital budgeting techniques
3. Tax implications for providers/publishers building digital stock rather than paper stock
D. Feasibility of registries
1. Transactions for rights permissions and payments
2. Bootstrapping techniques, particularly with obsolete software and operating systems
E. Search engines and metadata
F. Links to other subsystems inside and outside the infrastructure (e.g. commercial repositories, GILS, analog archives, etc.)
VI. Recommendations
A. Identify best practices in these environments and learn from them.
1. There are many situations where digital preservation will be handled/solved by others to meet their own business needs (e.g. in government, industry, medicine, etc.).
2. We should focus efforts on areas where solutions in industry, government, etc. will not meet the needs of our users for continuing access to digital information; e.g. identify where migration will not occur if we do not intervene.
B. Encourage the creators of digital materials to keep digital information in a live, native software environment as long as possible.
1. We need to concentrate on migration strategies for distributed archives if and when the original custodian goes out of business or has no continuing interest in preserving digital materials.
C. Develop strategies to encourage/support preservation of software as a significant intellectual and cultural product in its own right.
1. Build alliances with software engineers, computing enterprises, etc. as sources of both financial support and expertise for software preservation.
2. Create computer hardware/software repositories that are both museums documenting the significance of computing and working laboratories for research and migration.
3. Integrate software preservation strategies with support for migration