ERCIM News No.26 - July 1996 - SICS
Non-Topic Information Retrieval using Computational Stylistics
by Jussi Karlgren
Research in information retrieval and document analysis has traditionally
concentrated on building general, task-independent representations about
the content of documents based on word, phrase, or term occurrence statistics
of different kinds: performing a sort of shallow semantic analysis. SICS
is currently investigating text variation on dimensions other than text
topic.
While topic or content arguably is the most important characteristic of
a text, texts vary in other ways as well. Indeed, stylistic variation between
texts of the same topic is often at least as noticeable as the variation
between texts of different topic but same genre or variety. For instance,
a text about a certain topic can be of several different genres ­p; journalistic,
scientific, legal, fictional, poetic ­p; all reflected in the style of
writing. Secondarily style is an indication of quality: texts about the
same topic in the same genre can be of very varying quality and usefulness
for a certain information need or retrieval task.
Users of document bases typically are quite aware of which genres they search
for: popular science, overviews, technical descriptions, program manuals,
long texts, short texts.
The recent extension of information technology to the general public places
an increased burden of information sifting on the consumer. With paper-based
technology, the consumer finds clues to discriminate between different types
of publication in extra-textual factors such as print and paper quality,
method of delivery and so forth - these are all negated or weakened by low
cost information production tools such as desk top publishing and distribution
mechanisms such as the World Wide Web. They lower the threshold of publication
and increase the diversity of data and sources available to the information
consumer. The increase in supply does not make it easier to choose sources
and assess the quality and veracity of information.
So, in short: texts differ, users know it, and consumers need more information
categorization tools.
Now, stylistic variation is easy to detect ­p; down to the level of individual
variation between authors within the same genre ­p; using computationally
non-complex methods. We study variation on several different levels of analysis:
- Lexical variation: type/token ratios, long word content, pronoun content,
adverbial content, difficult word percentage, average word length
- Syntactic variation: average sentence length, parse tree measures,
parser performance data
- Textual variation: average paragraph and text length; subtopic measures.
These factors are recombined to model various types of style and register
variation.
The results of the first experiments are positive. The hypothesis has been
that relevant texts in an information retrieval scenario will show systematic
differences in stylistic factors from texts which are not relevant. If texts
are categorized for genre, this has indeed turned out to be true. It is
not difficult to discriminate between genres, and once that is done, the
information can be presented to the user, to allow the user to select genre
retrieval, or it can be used to rank the output from a topic-based search
to ensure higher precision in the retrieval results.
Please contact:
Jussi Karlgren - SICS
Tel: +1 212 998 3496
E-mail: karlgren@cs.nyu.edu
or Jussi.Karlgren@sics.se
http://www.sics.se/~jussi/
return to the contents page