ERCIM News No.26 - July 1996
Data driven Linguistics
by Marc Moens
When it was established in 1989, the Human Communication Research
Centre in Edinburgh and Glasgow committed itself to expanding the scope
of formal investigations of language to encompass as wide a range as possible
of real language structure and real language use. Since then, HCRC has helped
create, collect and disseminate corpus resources, which are available for
study and use by researchers the all over the world.
With the exception of sociolinguistics, traditional linguistics ­p; including
computational linguistics ­p; has tended to concentrate on short, carefully
constructed sentences. Not surprisingly, this has had consequences both
for the types of theories developed, and for their applicability to real
world problems. But recent years have seen a sea of change in attitudes
among researchers addressing human linguistic communication. Particularly
in computationally-oriented research and development, people have turned
away from abstract, theoretical work, towards concrete data-driven activities.
This shift has been made possible because substantial bodies of text and
speech have become available in electronic form. In turn, the shift in attitude
has increased demand for real data, and as a result, there has been a dramatic
growth in the number of new text collections or corpora.
These corpora tend to be large - in the order of hundreds of millions of
words of text. By way of comparison, a page of printed text usually contains
around 600 words, so a 100 million word corpus occupies more than 150,000
printed pages.
Our initial foray in the field of linguistic resources, the HCRC Map Task
Corpus, was built up from 128 dialogues between people carrying out a simple
cooperative task. Each of the two participants has a schematic map which
the other cannot see, but both collaborate to reproduce on one of the maps
a route already printed on the other. The dialogues were annotated at several
levels of detail, and these annotations, together with the maps and the
digitally-recorded speech itself, are included on an eight disc CD-ROM set.
Other corpus collection work was concerned with textual, rather than spoken,
material, such as the European Corpus Initiative, carried out by HCRC under
the aegis of the European Union and the Association for Computational Linguistics.
Until our production and distribution of the ECI disc (100 million words
in 22 languages), researchers in languages other than English had essentially
no access to large amounts of real text in their language in electronic
form. A new corpus collection project covers financial journalism between
1989 and 1991 across 6 European languages. This balanced collection makes
it particularly suitable for comparative study, and as the basis for the
development of multilingual systems.
As well as helping provide linguistic resources for worldwide use, HCRC
obviously carries out various research projects, using these and other corpus
resources. In the course of this work, a number of tools have been developed
which help researchers find their way through these large corpora, derive
significant generalisations, etc. These tools are distributed to other R&D
groups via HCRC's Language Technology Group. The Web pages at http://www.ltg.ed.ac.uk
give full details on how the tools can be downloaded.
Please contact:
Marc Moens - HCRC
University of Edinburgh
Tel: +44 131 650 4427
E-mail: M.Moens@ed.ac.uk
return to the contents page