ERCIM News No.25 - April 1996 - INRIA
Authors may index their own Web Documents
by Jacques André and Hélène Richy
The main access to information from the World Wide Web is navigational.
Many projects or commercial crawlers have been designed for that purpose.
In particular the concept of cartography is one of the most successfull.
On the other hand, many studies are concerned with automatic indexation:
tools are written for extracting from (full) texts the pertinent information
the reader is looking for.
Between these two approaches, structural and statistical, we propose another
one, based on the traditional technique: authors have the best knowledge
about the contents of their documents. They are able to give key-words summarizing
their thought. However, many problems are still yet unsolved.
A first approach, using the structured document editor Grif, allowed us
to produce large index tables for traditionnal paper-form books, such as
Cartulaire de Saint Laurent, the first Cartular written in French during
the XIV century.
Extending such tools for the Web requires a lot of improvements at various
levels. Note that index is here a concept that is extended to other concepts
such as bibliography, references, table of contents, etc.
From the authoring system point of view, a set of three tasks is usefull:
- a preliminary task is to decide which entities are to be indexed and
how these entities will be indexed: a marking tool should enable the creation
of such descriptions
- a second task consists in specifying how index tables will be constructed,
an index selector should propose a list of index tables to be constructed
- Finally, an index builder should produce structured and formatted
index tables after collecting and sorting information.
When considering large documents, from the Web (ie from the reader) point
of view, such an index is not a static document, but rather an active one
that has to be updated. Many occasions require to update index documents,
such as:
- the content of some previously indexed documents has changed * new
pages have to be indexed
- new index table is required (with new options).
Various updating strategies may be proposed:
- immediate updating, which is more or less unrealistic
- updating when index are accessed. This supposes that before displaying
index tables, all links are checked
- on the user's demand.
At INRIA-Rennes, we are working, in the context of the Thot system (Opera
project/Inria), on such index manipulation. Work is in progress to implement
such a system based on the second strategy (updating when index is accessed)
in the frame of the Tamaya environment.
More info in the Web: http://www-bi.imag.fr/OPERA/BibOpera.html.
Please contact:
Jacques André or Hélène Richy - INRIA
Tel: +33 99 84 71 00
E-mail:jacques.andre@irisa.fr
or helene.richy@irisa.fr
return to the contents page