Digital Library of Historical Newspapers

by Martin Doerr, Georgios Markakis, Maria Theodoridou


A management system for historical newspapers that supports both digital library functionality and archival management of original newspaper articles is being developed for the needs of the Vikelea Municipal Library of Heraklion. It includes OCR-based page analysis and article clipping, article-level metadata generation, semantic indexing and multifaceted classification of articles using a built-in thesaurus. We aim to improve the classification, completeness and precision of retrieved information - supporting both metadata and full-text searching - and to provide user-friendly Web access.

An important part of the study of historical newspapers consists of classifying the material and annotating it such that its future retrieval is made easier. The system has a variety of goals, including supporting the preservation, documentation and study of historical newspapers. It also aims to protect people from exposure to potential health hazards and to assist in the production and dissemination of electronic versions of publications, thereby promoting cultural education.

The structural particularities of digitized newspaper documents pose a significant challenge in creating an efficient digital library system interface. A newspaper page consists of articles (text blocks), pictures and advertisements that refer to a variety of real-world events, activities, actors and/or objects. Consequently, the page itself is not the basic conceptual unit of information and is therefore not suitable for a thorough metadata-based description of the material. Instead we focused on the notion of the segment as a basic conceptual unit. A segment may consist of one or more parts of the newspaper document that are conceptually relevant (ie an article, a group of articles or advertisements etc).

The historical newspaper management system implements a 'hybrid' form of classification and searching based on the following elements:

The large volume of the material that needs to be digitized and classified poses another important challenge. The system will be used to digitize approximately 100.000 pages. Given the fact that each page generally contains between five and twenty articles, we need to create an efficient and flexible interface as well as a mass import/OCR mechanism in order to reduce the time and cost of the digitization process.

The historical newspaper management system consists of the following subsystems:
The Digital Library deals with the management of the archival catalogue and information on the contents of the newspapers. It therefore supports thematic indexing and classification based on concepts retrieved from appropriate thesauri.

At the core of the Historical Newspaper Digital Library is the Fedora open-source digital repository system, which is a flexible content repository system that provides organizations with flexible tools for managing and delivering their digital content. Fedora is jointly developed by Cornell University and the University of Virginia Library.

Historical newspaper management system architecture.
Historical newspaper management system architecture.

The functionality of the digital repository is enhanced by the use of SIS Thesaurus Management System, which is a semantic network used to store, develop and access multiple thesauri and their interrelations under one database schema. The semantic interoperability of the digital repository with the thesaurus management system aids users in classifying and retrieving newspaper articles.

The Documentation Tool provides an efficient Web-based user interface for the insertion, filing, documentation and classification of material, and follows international standards for information modelling and interoperability.

We have created a flexible, easily deployable and user-friendly Web interface for this system to enable the researcher to isolate a specific conceptual entity within the document and perform an on-the-fly creation, description and storage of the produced metadata.

In addition to the creation of the segment, the system performs an extraction of the text included in the annotated segment of the document and stores it for full-text search purposes.

Graphical terminology visualization techniques enable the user to annotate the document according to appropriately developed thesauri. The combination of thesauri visual graphs and auto-complete algorithms significantly reduces the time needed for the creation of metadata and supports the efficient sharing of knowledge among the members of a community of annotators.

The Administrator Tool allows the mass storage of digitized material (JPEG images) into the digital repository, and the transformation of this material into a format that can be annotated and indexed by the experts via the documentation tool.

The historical newspaper management system is currently being used in the Vikelea Municipal Library of Heraklion to upload a significant part of the historical archive of newspapers and magazines regarding the history of Crete.

Link:
http://www.ics.forth.gr/isl/cci.html

Please contact:
Maria Theodoridou, ICS-FORTH, Greece
Tel: +30 2810 391 731
E-mail: maria@ics.forth.gr