ERCIM News No.26 - July 1996 - INRIA
Analysing Information from Large Documentary Bases - The ILC Project
by Yannick Toussaint and Jean Royaute
The ILC Project (Infométrie, Langage et Connaissance)
is a collaboration between the DIALOGUE Team of the INRIA-Lorraine &
CRIN-CNRS Laboratory in Nancy and the Infometry Research Program of the
INIST-CNRS Laboratory. It aims at partly modelling and structuring the knowledge
written in large documentary bases. This modelling will facilitate information
analysis. The project is part of the ILIAD project, supported by the French
National Cognitive Science Program (GIS 'Science de la Cognition').
The tools and methods currently being developed in the ILC project should
enable a human operator to collect the information content of a text without
reading it sequentially. The information
analysis is the step following the information retrieval process and is
based on methods particular to informetrics, using statistical techniques
of data analysis. They are combined with approaches used in large corpora
linguistics for identifying term structures and locating them and the relations
between them in the texts.. Techniques from artificial intelligence are
called upon in order to collect and organise the knowledge that emerges
from these linguistic and statistic processes.
We assume that the major part of the information in technical texts is located
in noun phrases. Therefore, our strategy for analysing information relies
upon performing robust and partial linguistic processing based on term and
noun phrase identification. Combining statistic and linguistic methods,
we search the texts for the conceptual links that exist between terms in
the domain. We pay special attention to the identification of a set of linguistic
connectors and to certain domain-specific predicative structures.
We divided the project into two phases. The first, which is now near completion,
consists in building an automatic process for the recognition of terms in
texts from a thesaurus, and of the classification of these texts following
criteria of term co-occurrence.
The second phase is aimed at identifying structures in the texts, predicatives
or not, which could reveal a conceptual link between two terms. This should
lead to the construction of a knowledge base with the terms and their conceptual
relations whose main structure is the initial thesaurus.
Searching terms in corpora
and classifying them
The first phase of the project combines three different stages :
- a probabilistic approach, which relies on morphological and weighted
contextual rules, in order to tag terms from a thesaurus. The resource we
use is the AGROVOC thesaurus, a trilingual thesaurus referring to the agricultural
domain. We re-accentuated the French entries using a semi-automatic procedure.
The Brill tagger was then trained on this corpus
- a computational linguistics approach focusing on the partial treatment
of noun phrases. It relies on the identification of terms and their variation
in corpora. For example, storage of medical data is a variant form of medical
data storage. This process uses the FASTR analyser (developed by C. Jacquemin,
IRIN-Nantes, France) working with unification grammars written in the PATR-II
formalism
- a statistical approach using the NDOC tool (developed at INIST-CNRS,
Nancy), based on term cooccurrence in corpora. A statistical index, called
Equivalence, gives the degree of association of two terms. A hierarchical
classification algorithm then allows the definition of clusters of close
terms.
Conclusion and future work
In order to integrate these three stages, we had to develop robust linguistic
tools such as a lemmatiser for French. Identical tools were developed for
English and the results of the experimentations on the same domain are very
close to those for French.
The second phase of the project is being started next month and we will
then focus on three points :
- the specification of how predicate structures could be used to make
explicit relations inside a cluster or between two clusters
- the representation model that could be used to structure the information
- the structuring of the data following the model to obtain a valid
hierarchical classification.
See also:
http://www.loria.fr/exterieur/equipe/dialogue
Please contact:
Yannick Toussaint - INRIA Lorraine
& CRIN-CNRS
Tel: +33 83 59 20 91
E-mail: Yannick.Toussaint@inria.fr
return to the contents page