ERCIM News No.26 - July 1996
Optimizing Analysis and Generation in Natural Language Processing
by Christer Samuelsson
This article describes the main work underlying the 1995 ERCIM
Cor Baayen Fellowship Award. It describes two methods for speeding up natural
language analysis and generation, respectively, by making the processing
more deterministic. This is done by optimizing the analysis and generation
machinery through the use of previously processed training examples.
The work for which the author received the first ERCIM Cor
Baayen Fellowship Award in 1995 has largely been concerned with speeding
up syntactic analysis and generation in natural-language processing (NLP)
systems.
Parsers (and generators) that analyse (or generate) sentences in natural
language w.r.t. a formal grammar of the language in question are normally
realized as abstract machines that transit between internal states while
reading (or producing) a sentence. Examples of these are finite-state automata,
LR parsers, chart parsers, etc. (Some of these do not rely only on the internal
states, but can store intermediate results in various ways, eg, in pushdown
stacks, charts, etc.) The basic idea behind the techniques developed is
to reduce the number of possibilities when transiting from one state to
another. This makes the processing more deterministic, and thus faster.
As a side effect, the total number of states visited for any particular
sentence is also reduced. The cost is a larger number of internal states
in total; the methods thus trade memory requirements for processing speed.
The author's doctoral dissertation focused on speeding up syntactic analysis.
Here, the existing formal grammar of some NLP system and a training corpus,
ie, a set of pre-analysed sentences, are used to create a new grammar that
allows much faster analysis. The new grammar is then compiled into a set
of LR-parsing tables, resulting in a speedup of about a factor 60.
In more detail, the various rules of the original grammar are chunked together,
based on how they were used to analyse the sentences in the training corpus,
to form composite grammar rules. Thus the rules of the new, learned grammar
consist of combinations of the original grammar rules. This means that the
coverage of the new grammar is strictly less than that of the old one, due
to rule combinations that do not occur in the training data. This only occasionally
prevents the system from appropriately processing sentences that are actually
later submitted to the system. It does however often rule out readings of
input sentences that are theoretically possible, but not the desired ones.
Furthermore, it rules out a lot of blind alleys in the set of potential
paths through the set of internal states of the LR parser. These two effects
drastically cut down the amount of work the parser has to perform. With
a 60-fold speedup, the system retained 90 % of the original coverage.
The work on speeding up natural language generation is a bit more technical
in nature. A new algorithm was devised based on the existing semantic-head-driven-generation
(SHDG) algorithm by compiling the generation grammar into generation tables
using LR-compilation techniques. This means that much of the work that the
SHDG algorithm does at runtime is instead done already at compile time.
Intuitively, the new algorithm performs functor merging, analogous to the
prefix merging done by an LR parser, thereby at runtime simultaneously processing
several alternatives in parallel, rather than constructing and trying each
one separately in turn.
Nevertheless, the resulting tran-sitions between the various states of the
generator are not necessarily deterministic. Also here, a training corpus
is used to construct generation tables that are minimally nondeterministic
on the training data, by expanding out the number of internal states. However,
we cannot guarantee that the amount of nondeterminism encountered for new
input is minimal, though it will in general be (almost) minimal. Contrary
to the method for speeding up syntactic analysis, this does not affect the
coverage of the grammar; we can still generate exactly the same sentences
for any input semantic structure that we could generate before. Thus, the
two methods differ in the respect that grammatical com-pleteness is preserved
for the latter one.
Please contact:
Christer Samuelsson - Universität des Saarlandes
Tel: +49 681 302 4502
E-mail: christer@coli.uni-sb.de
return to the contents page