Position Paper on Archiving of Digital Information: Don Waters, Yale University Library

Some Considerations on the Archiving of Digital Information

Don Waters

Associate University Librarian

Yale University Library

January, 1995

Individuals and organizations today increasingly create, publish and otherwise disseminate information in electronic formats. In addition, much existing information is being converted to electronic formats for a variety of reasons including to improve access. The vigorous flow of information in digital form, which will be essential to a democratic citizenry in the future, depends at least on the following conditions:

  • Authors and publishers must be able to register publicly the existence and location of their intellectual property;
  • Parties to the exchange of information must have confidence that their transactions are secure and confidential;
  • Readers must have the ability to verify that the attribution of authorship in a document is true and that the copy at which he or she is looking has the same content as the version that the author originally created; and
  • Authors and readers must have access to an accumulated store of knowledge that is preserved from the past and will be preserved into the future.

The Task Force on Archiving of Digital Information is concerned primarily with one of these essential conditions, namely ensuring that information in digital form endures for future generations. The question of preserving or archiving digital information is not a new one and has been explored at a variety of levels over the last two decades. Archivists have perhaps been most acutely aware of the difficulties as they have observed the rapid and widespread shift from the use of typewriters and other analog media to word processors, speadsheets and other digital means of recording individual and institutional decisions and actions.

Preserving the media on which information is electronically recorded is now well understood to be a relatively short-term and partial solution to the general problem of preserving digital information. Even if the media could be physically well-preserved, rapid changes in the means of recording, in the formats for storage, and in the software for use threaten to render the life of information in the digital age as, to borrow a phrase from another arena of discourse on civil society, "nasty, brutish and short."

Given the threat of technological obsolescence, Michael Lesk has argued that the preservation of electronic information into the indefinite future "means copying, not physical preservation." In this sense, preservation means "refreshing" information from old to new technologies. Or does it? Jeff Rothenberg has recently suggested another possible solution: Create and archive emulators of software and operating systems which allow the content of digital information objects to be carried forward and used in its original format. How does emulation compare with the notion of copying or refreshing digital information? Are there other technical solutions that we can imagine or propose?

Whatever options are available, preserving electronic information is not only, or even primarily, a technical matter, as anyone knows who has participated in the copying, say, of a bibliographic or corporate financial database from one on-line system to another. Questions of intellectual judgment -- what information to eschew and what to carry forward in what structure and format -- are always among the more difficult issues in creating and maintaining an archive. Can we articulate how the issues associated with selecting digital materials for archiving are similar to and different from those associated with archiving materials in analog form? Are the issues of identifying and saving, say, the e-mail messages being generated out there on the Net today by future Nobel prize winners any different from the issues encountered in identifying and saving the early written correspondence of past prize winners?

Among the many factors conditioning intellectual judgment and the application of technological wizardary in the archiving of digital information is, of course, the social organization of information, its generation and use. What kinds of electronic information objects are being produced and need to be archived for future generations? What are the individual and organizational interests in those objects?

In an analysis of types of information objects, one might usefully begin by distinguishing meta-information (bibliographic catalogs, indices, data dictionaries, directory systems, etc.) from the documents and other objects to which they refer, such as monographic and serial texts, graphic and photographic images, sound recordings, data collections, software-dependent data objects (GIS and CAD), and hyper-media or compound documents, which combine some or all of the other types. Each type of object has distinctive features of content, structure, format and context (i.e., relationship to other objects) that are relevant to long-term preservation. Is such a typology useful for our purposes? Do many or most of the objects we expect to see generated in electronic form cross the boundaries of this typology and so render it useless? Are other factors, such as the changing nature of the object (e.g. a dynamic WWW page), more relevant to consider than format?

A typology of individual and organizational interests in these various information objects, however we best characterize them, might usefully distinguish the following types of archives: those of individually-owned materials (such as mail, notes, manuscripts, preprints, databases, etc.); those of corporately-owned materials (employment and financial records, planning documents, reports, etc.); those of publishers (books, serials, films, recordings, etc.); and those of libraries, museums and other educational institutions. Are these distinctions useful? If so, how can we usefully enhance them? And where among these disparate, sometimes competing, interests can the commitment be found or created to ensure that the information needed by future generations is preserved for their use?

At its most general level, then, the real issue before the Task Force is a question of identifying the ways and means of mobilizing the will and resources of a variety of agents to a common good. How can we build a meaningful, dynamic organizational framework for these agents to work together productively? What are the principles, economic incentives and contractual relationships that might serve to create an environment most conducive to the preservation of digital information into the indefinite future?