Text Document Classification

by Jana Novovicová

During the last twenty years the number of text documents in digital form has grown enormously in size. As a consequence, it is of great practical importance to be able to automatically organize and classify documents. Research into text classification aims to partition unstructured sets of documents into groups that describe the contents of the documents. There are two main variants of text classification: text clustering and text categorization. The former is concerned with finding a latent group structure in the set of documents, while the latter (also known as text classification) can be seen as the task of structuring the repository of documents according to a group structure that is known in advance.

Document classification appears in many applications, including e-mail filtering, mail routing, spam filtering, news monitoring, selective dissemination of information to information consumers, automated indexing of scientific articles, automated population of hierarchical catalogues of Web resources, identification of document genre, authorship attribution, survey coding and so on. Automated text categorization is attractive because manually organizing text document bases can be too expensive, or unfeasible given the time constraints of the application or the number of documents involved.
The task of text categorization (TC) is the focus of a research project at the Institute of Information Theory and Automation, at the Academy of Sciences of the Czech Republic (UTIA). The dominant approach to TC is based on machine learning techniques. We can roughly distinguish three different phases in the design of TC systems: document representation, classifier construction and classifier evaluation. Document representation denotes the mapping of a document into a compact form of its content. A text document is typically represented as a vector of term weights (word features) from a set of terms (dictionary), where each term occurs at least once in a certain minimum number (k) of documents. A major characteristic of the TC problem is the extremely high dimensionality of text data. The number of potential features often exceeds the number of training documents. Dimensionality reduction (DR) is a very important step in TC, because irrelevant and redundant features often degrade the performance of classification algorithms both in speed and classification accuracy.

DR in TC often takes the form of feature selection. Methods for feature subset selection for TC tasks use some evaluation function that is applied to a single feature. The best individual features (BIF) method evaluates all words individually according to a given criterion, sorts them and selects the best subset of words. Since the vocabulary usually contains several thousand or tens of thousands of words, BIF methods are popular in TC. However, such methods evaluate each word separately, and completely ignore the existence of other words and the manner in which the words work together.

UTIA proposed the use of the sequential forward selection (SFS) method based on novel improved mutual information measures as criteria for reducing the dimensionality of text data. These criteria take into consideration how features work together. The performance of the proposed criteria using SFS compared to mutual information, information gain, chi-square statistics and odds ratio using the BIF method has been investigated. Experiments using a naive Bayes classifier based on multinomial model, linear support vector machine (SVM) and k-nearest neighbour classifiers on the Reuters data sets were performed and the results were analysed from various perspectives, including precision, recall and F1-measure. Preliminary experimental results on the Reuters data indicate that SFS methods significantly outperform BIF based on the above-mentioned evaluation functions. Furthermore, SVM on average outperforms both Naive Bayes and k-nearest neighbour classifiers on the test data.

Currently, text classification research at UTIA is heading in two directions. First, investigation of sequential floating search methods and oscillating algorithms (developed in UTIA) for reducing dimensionality of text data; and second, design of a new probabilistic model for document modelling based on mixtures for simultaneously solving the problems of feature selection and classification. These phases of the project rely on the involvement of PhD students from the Faculty of Mathematics and Physics at Charles University in Prague.

This research is partially supported by GAAV No A2075302, the EC MUSCLE project FP6-507752 and the project 1M6798555601 DAR.

Please contact:
Jana Novovicová, CRCIM (UTIA), Czech Republic
Tel: +420 2 6605 2224
E-mail: novovicutia.cas.cz