ICADL 2007 - LNCS 4822

Archival Tools to Match the Web: Open, International, Comprehensive

Gordon Mohr

Internet Archive, 4 Funston Ave, San Francisco, CA, 94129, USA

Abstract. Together with a number of national libraries, the Internet Archive committed itself in 2003 to international collaboration to create open source tools and standardized formats for web archiving. This project was motivated by our experience as home to over 100 billion archived web resources dating back to 1996, and as a partner to memory institutions building thematic web archives. Resulting tools include the Heritrix archival web crawler/harvester, the Wayback archive browsing service, and the NutchWAX archive full-text index and query utilities. A standard ingest/archival format for web resources called WARC has also been developed. Software with full source code is free to download and reuse, and organizations worldwide have adopted and contributed to these tools. Working with large collections remains a challenge, and the web itself is constantly growing and changing, so we continue to seek international cooperation to expand and improve this web archive tool set.

Keywords: World wide web, internet, harvesting, crawling, archives, indexing, search, HTTP, open source, collaboration

LNCS 4822, p. 7 f.

Full article in PDF | BibTeX

© Springer-Verlag Berlin Heidelberg 2007