ERCIM News No.27 - October 1996
A System for Cross-Language Information Retrieval
by Carol Peters and Eugenio Picchi
We describe a system to query comparable corpora, ie collections
of texts in more than one language from a common domain. Given a particular
term or set of terms in the texts in one language, contexts which contain
lexically equivalent or related expressions can be retrieved from the texts
in the other(s). The system has been developed to process Italian/ English
texts but could be extended to include other languages. The initial implementation
was made with the requirements of contrastive linguistics in mind; however,
the system could have applications in the fields of bilingual/ multilingual
document retrieval.
With the recent rapid diffusion over Internet of world wide distributed
document bases, the question of multilingual information retrieval is becoming
increasingly relevant as the disadvantages of allowing English-only systems
and document collections to dominate the global scene unchallenged are gradually
being recognised. Natural language processing techniques and tools have
already been incorporated into IR processes with varying degrees of success.
We feel that such methodologies have an important role to play in the development
of multilingual document systems in which users can formulate queries in
their preferred languages and retrieve all relevant documents in whatever
language they are stored. Here below, we describe a strategy being studied
for comparable corpus querying and explain why we feel this approach can
be extended to cross-language information retrieval applications.
Comparable corpora are sets of texts in pairs (or multiples) of languages
on the same topic or domain. Given a particular term or set of terms in
a domain-specific corpus in one language, the aim is to identify contexts
which contain equivalent or related expressions in a comparable corpus in
another language. We do this by extracting lexical and linguistic knowledge
from the first corpus, and projecting it onto the second. Our starting point
is a basic tenet of corpus linguistics: a word acquires sense from its context.
We thus attempt to isolate the vocabulary related to the terms in the corpus
in one language (L1) ­p; hypothesising that lexically equivalent terms
will be associated with a similar vocabulary in the comparable corpus for
the other (L2).
Thus for any term, T, and using a well-known statistical procedure (Church
and Hanks' Mutual Information Index), we calculate its set of significant
collocates in our L1 corpus; the set of lemmas derived makes up the vocabulary,
V1, that characterizes T in this particular subdomain corpus. Next, using
our lexical tools (eg English/Italian morpho-logical procedures, a bilingual
lexical database), we construct an equivalent L2 vocabulary of translation
equivalents (V2). Words or expressions that can be considered as lexically
equivalent to our selected term in the L1 texts are then searched in the
L2 corpus, ie we identify those contexts in L2 in which there is a significant
presence of the L2 vocabulary for T. The significance is determined on the
basis of a statistical procedure that assesses the probability for different
sets of L2 co-occurrences to represent lexically equivalent contexts for
T. The L2 contexts retrieved are written in a file and listed in descending
order of relevance to our L1 term.
Although these procedures are still in an experimental phase, the first
results are encouraging, ie we can retrieve contexts which refer to a particular
concept represented in L1 by a given expression (term or set of terms),
without the necessity for a known translation equi-valent for that expression
being present.
When we began this work our main interest was linguistic, however, we now
intend to extend the procedures to run in a multilingual document query
system. Most current IR systems which include a multilingual component use
a thesaurus in order to search keywords over languages. However, a multilingual
thesaurus that makes any attempt towards exhaustiveness is difficult and
expensive to construct and maintain. Technical vocabulary is in continual
development as new ideas mature, new processes are introduced. Any thesaurus
must be frequently updated if it is to be useful for query and retrieval
purposes. Even if a thesaurus is well constructed and includes pointers
to semantically (eg synonyms, hyponyms, meronyms) and lexically (eg close
collocates) related items, the users are still obliged to base their query
on a keyword list rather than formulating a fully natural language query.
We thus intend to test two applications of our system: as a method that
can be used when no multilingual thesaurus is available; as a method for
constructing and/or enriching a multilingual termbank. Our hypothesis is
that a document base consisting of texts on the same topic in more than
one language in itself constitutes a set of comparable corpora. It should
thus be possible to apply procedures such as those outlined here to retrieve
all documents in a second language which contain lexical equivalences to
a term or set of terms searched in a first language even when no multilingual
thesaurus is available. We also intend to test the system as a tool for
the semi-automatic construction of a thesaurus in a second language on the
basis of an existing thesaurus in L1. In this case, the system would be
run for each term in the L1 thesaurus in order to retrieve corresponding
L2 equivalent contexts. The terminologist could then select the relevant
set of L2 (multiword) terms for each L1 item searched. At the same time,
both the L1 and L2 thesauri could be enriched by automatically associating
with each node of each side of the multilingual thesaurus all the significant
collocates characterising that particular term. In this way, we can create
a multilingual search tool which combines the features of a keyword-based
tool with that of our comparable procedure, and thus searches for both pre-identified
multilingual thesaurus terms and also cross-language lexical equivalences.
There should be no problems in extending the system to cover additional
languages, providing the necessary lexical and morphological resources are
available. Any language can be adopted as the starting point, much as is
currently done in the construction of many multilingual thesauri where one
language (usually English) acts as the base. The vocabulary associated with
any term in the corpus for this language (V1) will then be translated into
all the other languages (constructing Vn vocabu-laries). Each comparable
corpus (or set of documents), will then be searched for contexts with a
significant cooccurrence of lexical items from the relative target language
vocabulary for T. This will be the subject of a future study.
Please contact:
Carol Peters - IEI-CNR
Tel: +39 50 593 429
E-mail: carol@iei.pi.cnr.it
or Eugenio Picchi - ILC-CNR
Tel: +39 50 560481
E-mail: picchi@ilc.pi.cnr.it
return to the contents page