ERCIM News No.26 - July 1996 - CNR
Introduction
by Antonio Zampolli
The discipline of Computational Linguistics (CL) as it is known today
originated from the Machine Translation Research of the '50s and '60s. In
1966, a DARPA report attributed the failure of the machine translation project
to achieve concrete results to the concentration of the work on 'small scale
examples', and on 'miniature models of language'. However, despite the report's
recommendations to work with 'real language problems, above a certain scale
of grammar size, dictionary size and available corpora', in the following
two decades, CL focused mainly on the computational implementation of linguistic
models to deal with short, carefully selected sentences. This work produced
excellent theoretical insights, but proved to be insufficient when the major
funding agencies turned their attention to the potential of CL in answering
the growing needs of managing and accessing the wealth of information transmitted
by natural languages in the fast developing information society.
The demand for language industry products, to assist the traditional linguistic
professions (translation, language teaching, etc.) and to develop new language
processing applications (natural language interfaces, speech input and output,
document retrieval and indexing, etc.) has lead to the emergence of the
language engineering paradigm, which requires development of robust language
processing components, capable of dealing with real texts in concrete information
and communication systems. This, in turn, requires the availability of reusable
language resources, (typically large) sets of language data and descriptions,
for building, training, evaluating written and spoken language processing
systems: spoken and written corpora, lexica, grammars and terminologies.
In this way, the major national and international funding agencies and organisations
have assumed and continue to have a key role in shaping our field. They
are currently sponsoring a large part of the on-going research, through
programmes which, determining the objectives of the largest projects, in
practice define the main trends and strategies. For this reason, we felt
it appropriate to invite leaders of the main North American (NSF) and European
(EU) programmes to describe the general framework and the overall objectives
of the sponsored activities (see articles by Ballim et al., etc.).
The global information society has clear multilingual implications. Recently
authoritative sources have warned that languages, for which no adequate
computer processing is being developed, risk gradual loss of their place
in the global information society, with serious implications for the culture
of which they are the vehicle, to the detriment of one of the greatest humanity
values: cultural diversity. Bernard Quemada discusses the relationship between
language technology and multilingualism, and presents a set of recommended
actions.
International collaboration is particularly important for the progress of
our field and the success of its applications, especially those aiming at
producing multilingual information and communi-cation services. Multilingual
systems production requires close coordination between the partners of the
different languages, to ensure the integrity of the components, and in particular
the interoperability of the embedded language resources. Two major infrastructural
European initiatives, EAGLES and ELRA, are described by Calzolari et al.
and Choukri, respectively.
The paradigm shift is reflected in the topics of the articles presented
by the ERCIM associated authors, which describe either individual projects
or the general action lines of their Institutes. The current Zeitgeist is
witnessed from the fact that several articles refer to the construction
of (large) language resources, and/or are focusing on practical appli-cations
of real language use. The mandate and the programmes of some Institutes
explicitly include the creation of multifunctional language resources, ie
of resources intended for reuse by R&D community (eg see the articles
by Moens, Calzolari, Hajicová and Wittmann). Other language resources
are created for direct use in the author specific systems.
Innovative methods are researched and tools constructed for extracting knowledge
from language resources, e.g. identifying stylistic variations (Karlgren),
term extraction from corpora and use of corpora for training language processing
components (Samuelsson, Calzolari) and for structuring and organising the
knowledge acquired (Calzolri and André et al.). In parallel models,
methods and tools are actively explored for annotating corpora and lexica
with increasingly deep levels of linguistic descriptions (see contributions
from ILC-CNR and CWI). In this way, synergies and convergences are reinforced
between abstract theoretical work and concrete data-driven activities.
Multilingual resources are developed for preparing multilingual applications:
e.g. translation (CRCIM) and aids for disabled (FORTH). Tools are developed
for localising software (see the contribution from VTT). Robust linguistic
processing tools are developed to annotate large, real language corpora:
properly adapted, they can be incorporated as morphological, syntactical,
semantic components in applicative systems (see contributions from Ballim
et al., Prószéky and Calzolari).
The process of producing and using documents has received great attention
in the language engineering framework. Documents transmit knowledge and
present information organised for human understanding and work. The use
of language technology achieves significant enhancements in all the document
processing phases and in the work productivity in general. Topics discussed
in this issue's articles include document preparation and production, multilingual
document generation, content representation and synthesis, document navigation
and retrieval, extension of multimedia capabilities of information systems
(Alexa et al., André et al., Toussaint et al., Ballim et al., Pierrel
et al.).
In the past, the activities in the field of speech and of (written natural)
language processing have been developed separately for various reasons,
including the different scientific and technical knowledge and disciplinary
backgrounds required. Recently, the need for integration has become increasingly
apparent.
The last decade has witnessed a dramatic improvement in speech recognition.
The transition from laboratory demonstrations to commercial deployment has
already begun, providing services like voice dialling, call routing, simple
data entry. The next challenge is a fascinating one: to build spoken language
interfaces, in which both the user and the computer play active roles in
conversation, in the user's own language. Speech interfaces are the most
efficient, flexible, natural for humans, and will open access to the wealth
of information and services in the information networks, to a larger part
of society. The realisation of these interfaces requires that language processing
components work in synergy with speech recognition and generation components,
to produce meaning representation and natural speech output. Activities
in this direction are reported in the articles by Moens, Calzolari, Alexa
et al., Stephanidis and Antona, Trancoso, Gyimóthy, and Pierrel et
al. Trancoso refers not only to the activities of her Institute, but also
to ELSNET and in particular, to actions for joint student formation.
Please contact:
Antonio Zampolli - ILC-CNR
Tel: +39 50 560481
E-mail: hilary@ilc.pi.cnr.it
return to the contents page