ERCIM News No.26 - July 1996 - SGFI
Language Processing in Document Engineering
by Afzal Ballim, Christine Vanoirbeek and Giovanni Coray
Natural language processing (NLP) techniques are of growing importance
to the field of document engineering (DE). Trends in the use of structured
multimedia documents favour the application of such techniques in document
creation, synthesis, and dissemination, as well as in indexing and retrieval.
Research in document engineering at the Laboratoire d'Informatique Théorique
(LITH) in the Swiss Federal Institute of Technology (EPFL) makes use of
state-of-the-art NLP, and entails direct research in language processing.
The field of document engineering is concerned with the creation and use
of documents and document collections. The emergence of large scale networks
has affected many aspects of the document. Collaborative document creation
is much enhanced by the ability of authors to interact over networks. Disseminating
a document over networks gives the authors feedback from their audience
which may be used to update the document. The availability of large document
collections over networks such as the Internet makes it attractive to consider
the synthesis of documents in reaction to user requests, as opposed simply
to retrieving a textual database. Of course, the indexing of documents for
retrieval has itself been brought to the fore by the escalation of interest
in the World Wide Web (WWW), where the volume of available information makes
effective indexing a necessity.
All of these aspects of document engineering can benefit from techniques
in NLP, and one goal of our work at LITH is to use these techniques to their
fullest potential, and, hopefully, contribute to the advancement of NLP
itself. This is confirmed by a number of our research projects, such as
those described below.
Projects
TALC (Text ALignment and Consistency): The production of multilingual documents
is of particular concern in the European setting. It often involves a number
of translators (even for translation to a single destination language) and
time constraints may impose tight restrictions on the checking of translation
consistency. Advanced techniques from natural language processing (NLP)
can be applied to construct new tools that address vital issues of quality
control of translated documents. What is needed is essentially two kinds
of things:
- ways to recognize corresponding parts of texts and translations
- ways to evaluate translation quality in corresponding parts.
Sentence boundary marking and the automatic linking of sentences to their
translations (alignment) can already be achieved by computers with a high
degree of reliability. Linguistic techniques permit the recognition of words
and phrases (eg, technical terms) within the sentences and thus offer one
measure of translation quality.
In collaboration with the Swiss Federal Chancellery and ISSCO, the goal
of this project is to extend known techniques to allow for more sophisticated
alignment mechanisms which can take into account a complex view of a document
as a richly structured object. In addition, with the elaboration of a richer
set of linguistic objects and their translation properties, we can establish
a range of translation quality control criteria.
Robust Text Analysis: The ability to deal with large amounts of possibly
ill-formed or unforeseen text is one of the principal objectives of current
research in NLP, an ability which is particularly necessary for advanced
information extraction and retrieval from large textual corpora. This project
investigates methods of epitomising the compilation of Definite Clause Grammars
(DCGs) in such a way that the grammars can be used for robust parsing of
noisy input, or input for which the grammar only gives a partial description.
It builds on previous work by us in this domain, and we believe it will
have a number of important benefits:
- existing grammars written using DCGs can instantly be converted for
use in notoriously difficult domains - for example, in the indexing of large
text collections to permit intelligent Information Retrieval
- grammar writers can develop their grammars in an incremental fashion,
with instantaneous feedback on the coverage that their grammars provide
of the information being analysed
- grammars can be developed to analyse only those portions of the input
that are deemed to be important, allowing graceful failure on unimportant
portions.
Document Synthesis: One long-term research goal at LITH is the creation
of a system for the synthesis of a virtual document from a collection of
existing documents in reaction to a complex specification from a user. This
specification would largely be made through the use of a dialogue between
the user and the system, whereby a complex model of the user would be generated
together with an analysis of the expectations of the user. The synthesized
document would be highly structured, providing various types of hyperlinks
to allow for a thematic navigation of the document by the user.
Our interest in NLP is perhaps particularly evident in this project because
of the use of semantic and pragmatic models of the document: we study the
intentions behind the document, as well as the applications that could be
associated with it. The creation of such models for existing documents requires
mechanisms for document understanding - a process which in the long-term
entails research in almost all aspects of NLP.
We anticipate that the importance of NLP in document engineering will continue
to grow as the documents become more dynamic and reactive to the needs of
the user. We are convinced, therefore, that advancements in DE will rely
on those of NLP, and at the same time believe that the requirements of DE
will help drive developments in NLP. To this end, we are actively pursuing
natural language work within the framework of document engineering in our
laboratory.
Please contact:
Afzal Ballim - LITH-EPFL
Tel: +41 21 693 52 34
E-mail: ballim@di.epfl.ch
return to the contents page