ERCIM News No.26 - July 1996 - SGFI

Language Processing in Document Engineering

by Afzal Ballim, Christine Vanoirbeek and Giovanni Coray

Natural language processing (NLP) techniques are of growing importance to the field of document engineering (DE). Trends in the use of structured multimedia documents favour the application of such techniques in document creation, synthesis, and dissemination, as well as in indexing and retrieval. Research in document engineering at the Laboratoire d'Informatique Théorique (LITH) in the Swiss Federal Institute of Technology (EPFL) makes use of state-of-the-art NLP, and entails direct research in language processing.

The field of document engineering is concerned with the creation and use of documents and document collections. The emergence of large scale networks has affected many aspects of the document. Collaborative document creation is much enhanced by the ability of authors to interact over networks. Disseminating a document over networks gives the authors feedback from their audience which may be used to update the document. The availability of large document collections over networks such as the Internet makes it attractive to consider the synthesis of documents in reaction to user requests, as opposed simply to retrieving a textual database. Of course, the indexing of documents for retrieval has itself been brought to the fore by the escalation of interest in the World Wide Web (WWW), where the volume of available information makes effective indexing a necessity.

All of these aspects of document engineering can benefit from techniques in NLP, and one goal of our work at LITH is to use these techniques to their fullest potential, and, hopefully, contribute to the advancement of NLP itself. This is confirmed by a number of our research projects, such as those described below.

Projects

TALC (Text ALignment and Consistency): The production of multilingual documents is of particular concern in the European setting. It often involves a number of translators (even for translation to a single destination language) and time constraints may impose tight restrictions on the checking of translation consistency. Advanced techniques from natural language processing (NLP) can be applied to construct new tools that address vital issues of quality control of translated documents. What is needed is essentially two kinds of things:

ways to recognize corresponding parts of texts and translations
ways to evaluate translation quality in corresponding parts.

Sentence boundary marking and the automatic linking of sentences to their translations (alignment) can already be achieved by computers with a high degree of reliability. Linguistic techniques permit the recognition of words and phrases (eg, technical terms) within the sentences and thus offer one measure of translation quality.

In collaboration with the Swiss Federal Chancellery and ISSCO, the goal of this project is to extend known techniques to allow for more sophisticated alignment mechanisms which can take into account a complex view of a document as a richly structured object. In addition, with the elaboration of a richer set of linguistic objects and their translation properties, we can establish a range of translation quality control criteria.

Robust Text Analysis: The ability to deal with large amounts of possibly ill-formed or unforeseen text is one of the principal objectives of current research in NLP, an ability which is particularly necessary for advanced information extraction and retrieval from large textual corpora. This project investigates methods of epitomising the compilation of Definite Clause Grammars (DCGs) in such a way that the grammars can be used for robust parsing of noisy input, or input for which the grammar only gives a partial description. It builds on previous work by us in this domain, and we believe it will have a number of important benefits:

existing grammars written using DCGs can instantly be converted for use in notoriously difficult domains - for example, in the indexing of large text collections to permit intelligent Information Retrieval
grammar writers can develop their grammars in an incremental fashion, with instantaneous feedback on the coverage that their grammars provide of the information being analysed
grammars can be developed to analyse only those portions of the input that are deemed to be important, allowing graceful failure on unimportant portions.

Document Synthesis: One long-term research goal at LITH is the creation of a system for the synthesis of a virtual document from a collection of existing documents in reaction to a complex specification from a user. This specification would largely be made through the use of a dialogue between the user and the system, whereby a complex model of the user would be generated together with an analysis of the expectations of the user. The synthesized document would be highly structured, providing various types of hyperlinks to allow for a thematic navigation of the document by the user.

Our interest in NLP is perhaps particularly evident in this project because of the use of semantic and pragmatic models of the document: we study the intentions behind the document, as well as the applications that could be associated with it. The creation of such models for existing documents requires mechanisms for document understanding - a process which in the long-term entails research in almost all aspects of NLP.

We anticipate that the importance of NLP in document engineering will continue to grow as the documents become more dynamic and reactive to the needs of the user. We are convinced, therefore, that advancements in DE will rely on those of NLP, and at the same time believe that the requirements of DE will help drive developments in NLP. To this end, we are actively pursuing natural language work within the framework of document engineering in our laboratory.

Please contact:
Afzal Ballim - LITH-EPFL
Tel: +41 21 693 52 34
E-mail: ballim@di.epfl.ch

return to the contents page