COMPUTATIONAL LINGUISTICS
ERCIM News No.26 - July 1996 - SICS

Non-Topic Information Retrieval using Computational Stylistics


by Jussi Karlgren

Research in information retrieval and document analysis has traditionally concentrated on building general, task-independent representations about the content of documents based on word, phrase, or term occurrence statistics of different kinds: performing a sort of shallow semantic analysis. SICS is currently investigating text variation on dimensions other than text topic.

While topic or content arguably is the most important characteristic of a text, texts vary in other ways as well. Indeed, stylistic variation between texts of the same topic is often at least as noticeable as the variation between texts of different topic but same genre or variety. For instance, a text about a certain topic can be of several different genres ­p; journalistic, scientific, legal, fictional, poetic ­p; all reflected in the style of writing. Secondarily style is an indication of quality: texts about the same topic in the same genre can be of very varying quality and usefulness for a certain information need or retrieval task.

Users of document bases typically are quite aware of which genres they search for: popular science, overviews, technical descriptions, program manuals, long texts, short texts.

The recent extension of information technology to the general public places an increased burden of information sifting on the consumer. With paper-based technology, the consumer finds clues to discriminate between different types of publication in extra-textual factors such as print and paper quality, method of delivery and so forth - these are all negated or weakened by low cost information production tools such as desk top publishing and distribution mechanisms such as the World Wide Web. They lower the threshold of publication and increase the diversity of data and sources available to the information consumer. The increase in supply does not make it easier to choose sources and assess the quality and veracity of information.

So, in short: texts differ, users know it, and consumers need more information categorization tools.

Now, stylistic variation is easy to detect ­p; down to the level of individual variation between authors within the same genre ­p; using computationally non-complex methods. We study variation on several different levels of analysis: These factors are recombined to model various types of style and register variation.

The results of the first experiments are positive. The hypothesis has been that relevant texts in an information retrieval scenario will show systematic differences in stylistic factors from texts which are not relevant. If texts are categorized for genre, this has indeed turned out to be true. It is not difficult to discriminate between genres, and once that is done, the information can be presented to the user, to allow the user to select genre retrieval, or it can be used to rank the output from a topic-based search to ensure higher precision in the retrieval results.


Please contact:
Jussi Karlgren - SICS
Tel: +1 212 998 3496
E-mail: karlgren@cs.nyu.edu
or Jussi.Karlgren@sics.se

http://www.sics.se/~jussi/

return to the contents page