Joint ERCIM Actions
ERCIM News No.33 - April 1998

Contrastive Indexing of Full Text Documents

by Laurent Romary and Patrice Bonhomme


In the context of the general Aquarelle scenario, the creation of folders allows a user to put together pieces of information which he considers useful for his own purpose. In particular, he may include textual fields which in turn have to be made accessible for further retrieval. To this end, we designed a full text indexing method which, rather than providing an absolute set of indexes for each textual field, aims at contrasting each of them to the other fields the user might point to either in the same folder or within other folders he has created or extracted from an Aquarelle server.

The basic idea behind the contrastive indexing method is to consider a given document or rather the set of tokens it contains as a sample taken from the set of all tokens belonging to the reference corpus of documents it belongs to. The frequency of the token within the document can then be compared to the expected distribution computed from the reference corpus, in order to evaluate whether it is inkeeping with it, or on the contrary too far from it not to be interpreted as indicating a particular relevance for the document. For each document, we thus compute a set of so-called contrasting tokens which is a good indication of its informational content relatively to the contents of the documents it is compared to. As a consequence, this method has different interesting properties which both from a linguistic and information retrieval point of view makes it a good option for an optimal full text indexing mechanism:

The full text indexing module has been considered as a semi-automatic process provided to the user during the folder editing stage. As a matter of fact the user always has the possibility to edit and validate the set of candidate terms before these are actually inserted within the folder itself.

Given the robustness of the method as we have observed it in our first trials within the Aquarelle project, we have thought of extending it towards a general mechanism of content identification within a set of more less homogeneous documents. Indeed, what results from the contrastive indexing process is a kind of thematic description of the document in comparison with a given reference which acts as a background, hence the possibility to iteratively group together documents with similar descriptions and further to build up a thematic map of the whole reference database. This concept has been recently applied within a project funded by the DGLF (Délégation Générale à la Langue Française) aiming at automatically producing thematic descriptions of a given web site. The contrastive indexing method, combined with a hierarchical clustering algorithm has allowed us to produce topic maps of a given web site independently of its actual language or content domain.

Please contact:

Laurent Romary and Patrice Bonhomme - LORIA
Tel: +33 3 8359 2037
E-mail: {romary,bonhomme}@loria.fr


return to the contents page