ERCIM News No.26 - July 1996

Optimizing Analysis and Generation in Natural Language Processing

by Christer Samuelsson

This article describes the main work underlying the 1995 ERCIM Cor Baayen Fellowship Award. It describes two methods for speeding up natural language analysis and generation, respectively, by making the processing more deterministic. This is done by optimizing the analysis and generation machinery through the use of previously processed training examples.

The work for which the author received the first ERCIM Cor Baayen Fellowship Award in 1995 has largely been concerned with speeding up syntactic analysis and generation in natural-language processing (NLP) systems.

Parsers (and generators) that analyse (or generate) sentences in natural language w.r.t. a formal grammar of the language in question are normally realized as abstract machines that transit between internal states while reading (or producing) a sentence. Examples of these are finite-state automata, LR parsers, chart parsers, etc. (Some of these do not rely only on the internal states, but can store intermediate results in various ways, eg, in pushdown stacks, charts, etc.) The basic idea behind the techniques developed is to reduce the number of possibilities when transiting from one state to another. This makes the processing more deterministic, and thus faster. As a side effect, the total number of states visited for any particular sentence is also reduced. The cost is a larger number of internal states in total; the methods thus trade memory requirements for processing speed.

The author's doctoral dissertation focused on speeding up syntactic analysis. Here, the existing formal grammar of some NLP system and a training corpus, ie, a set of pre-analysed sentences, are used to create a new grammar that allows much faster analysis. The new grammar is then compiled into a set of LR-parsing tables, resulting in a speedup of about a factor 60.

In more detail, the various rules of the original grammar are chunked together, based on how they were used to analyse the sentences in the training corpus, to form composite grammar rules. Thus the rules of the new, learned grammar consist of combinations of the original grammar rules. This means that the coverage of the new grammar is strictly less than that of the old one, due to rule combinations that do not occur in the training data. This only occasionally prevents the system from appropriately processing sentences that are actually later submitted to the system. It does however often rule out readings of input sentences that are theoretically possible, but not the desired ones. Furthermore, it rules out a lot of blind alleys in the set of potential paths through the set of internal states of the LR parser. These two effects drastically cut down the amount of work the parser has to perform. With a 60-fold speedup, the system retained 90 % of the original coverage.

The work on speeding up natural language generation is a bit more technical in nature. A new algorithm was devised based on the existing semantic-head-driven-generation (SHDG) algorithm by compiling the generation grammar into generation tables using LR-compilation techniques. This means that much of the work that the SHDG algorithm does at runtime is instead done already at compile time. Intuitively, the new algorithm performs functor merging, analogous to the prefix merging done by an LR parser, thereby at runtime simultaneously processing several alternatives in parallel, rather than constructing and trying each one separately in turn.

Nevertheless, the resulting tran-sitions between the various states of the generator are not necessarily deterministic. Also here, a training corpus is used to construct generation tables that are minimally nondeterministic on the training data, by expanding out the number of internal states. However, we cannot guarantee that the amount of nondeterminism encountered for new input is minimal, though it will in general be (almost) minimal. Contrary to the method for speeding up syntactic analysis, this does not affect the coverage of the grammar; we can still generate exactly the same sentences for any input semantic structure that we could generate before. Thus, the two methods differ in the respect that grammatical com-pleteness is preserved for the latter one.

Please contact:
Christer Samuelsson - Universität des Saarlandes
Tel: +49 681 302 4502
E-mail: christer@coli.uni-sb.de

return to the contents page