ERCIM News No.27 - October 1996
Bipartite Graphs and Automatic Generation of Thesauri
by Michiel Hazewinkel
The key to information finding (information retrieval) in large and
very large collections is METADATA, that is extra information added to the
main document (article, or chapter, or... ) designed to make it findable
(retrievability) - particularly, metadata in the form of subject-indicators
(key words and key phrases) and classifi-cation information.
A thesaurus and classification scheme can be of enormous value to make a
given (large) corpus of material accessible. This is 'proved' for example
by the very considerable effort that Elsevier Science Publishers invests
in maintaining the thesaurus of 30,000 concepts underlying EMBASE (Excepta
Medica, four full time employees). Of course there are many other aspects
to the general idea of metadata. For instance, the matter of suitable containers
for Metadata has attracted considerable interest in recent years.
Building a sufficiently large thesaurus by hand is an enormously expensive
and laborious task; certainly if we are thinking of one of the size really
needed to deal with a field like mathematics or computer science (let alone
a field like physics). For mathematics I estimate that a thesaurus of key
phrases (not words ­p; these are not information rich enough) should
comprise about 120,000 terms. Thus it is instinctive to begin to think about
the automatic generation of thesauri. By now there are (first generation)
commercial and academic programs available for extracting index and thesaurus
terms (noun phrases, pre-positional noun phrases) from a corpus of electronic
texts. Other linguistic software can be used to 'standardize' the raw list
of key phrases thus obtained. The available data currently consists of a
bipartite graph between terms and documents. At this stage, a number of
mathe-matical and computer science problems also arise and I should like
to mention some of them.
Thesaurus problems
A thesaurus according to ISO standard 2788 and a number of other national
and international standards, is much more than an alphabetical list of standardized
key phrases. Fortunately, using the Hamming distance, a metric is defined
on both the set of terms and the set of documents. This gives a notion of
distance between terms (and between documents) thus giving a quantitative
version of the thesaurus concept 'related term'. This also gives promising
possibilities for such things as neighbourhood search. A much harder problem
mathematically is how to capture (quantitatively) the thesaurus concepts
of 'broader terms' and 'narrower terms'.
Bottom up classification
The notion of distance on the set of terms makes it possible to apply clustering
techniques and thus to define a hierarchy. The problem is to relate this
bottom up hierarchy with the top-down hierarchy represented by a classification
scheme for a given field.
Missing terms problem
The set of terms generated as above will have a tendency to be incomplete,
particularly as regards the more general, broader terms (which can also
be thought of as missing centres of clusters). The problem is how to recover
these missing centres.
Matching problem
Generating a thesaurus in one go for a large area is not feasible (Mathematics
is as large a field as I care to contem-plate in this respect; life sciences
or physics are far larger fields). Thus the problem arises of constructing
several thesauri (to be thought of as charts forming part of an atlas) and
to match them, ie to describe the overlap between them.
These and various other related problems (such as the multilingual version
of the Matching problem) form the subject matter of a number of research
projects being carried out at CWI in the context of Digital Libraries. Related
efforts involve interactive books and multimedia.
Please contact:
Michiel Hazewinkel - CWI
Tel: +31 20 5924204
E-mail: mich@cwi.nl
return to the contents page