ERCIM News No.26 - July 1996 - CNR
Activities at the Institute for Computational Linguistics of CNR
by Nicoletta Calzolari
The Institute for Computational Linguistics of the Italian National
Research Council (ILC-CNR) of Pisa, known, together with the Department
of Linguistics of Pisa University and the Consorzio Pisa Ricerche, as the
'Pisa Group' has been active in the field of Computational Linguistics (CL)
since 1967.
The Pisa Group is now involved in a large number of national and international
projects, ranging from Text Processing (concordances, indices, lemmatisation,
statistical analyses, etc.), to building a Reference Corpus of the Italian
Language in co-ordination with parallel initiatives on other languages,
and including the development of large Textual Databases, use and analysis
of Machine-Readable Dictionaries, development of large Lexical Knowledge
Bases (monolingual and bilingual), study of parallel/contrastive multilingual
corpora, implementation of morphology for several languages, design of computational
grammars and development of parsers (in different frameworks), implementation
of Knowledge Representation languages and systems, study of dialogue and
natural language interfaces, Machine Translation, digital image processing,
acquisition of (lexical) information from large text corpora, application
of natural language processing techniques in Information Retrieval applications,
in the field of digital libraries, etc.
As it would be impossible to provide a detailed description of all the ongoing
activities here, we simply outline the main sectors of research and development,
highlighting the fact that they cover central and mainstream themes in the
state-of-the-art CL, and are articulated within an overall design which
encompasses both CL proper and so-called Literary and Linguistic Computing.
This important convergence allows us to integrate the best of the two areas
of interest.
We can mention six main sectors of activity:
Very Large Reusable Linguistic Resources
In recent years the Pisa Group has strongly promoted the concept of reusability
of linguistic resources at the international level. In particular, it has
been very active in promoting awareness of the need for adequate linguistic
resources for all languages, and for actions directed at fostering development
in this sector. It has thus co-ordinated, and now co-ordinates, a number
of projects and activities of the EC: for the definition of common specifications
(LRE-EAGLES); for the definition of a European infrastructure for the creation,
management and distribution of resources; for exploration of possible cooperation
with the United States (NSF/ESPRIT, and EAGLES Inter-national Cooperation);
for experimen-tation of methods for the re-use of existing resources (ESPRIT
ACQUILEX); for a harmonised develop-ment of large generic Corpora and Lexicons1
for European languages based on common specifications (LE-PAROLE); for the
(semi-)automatic acquisition of lexical information from large corpora (LE-SPARKLE);
for the collection and distribution of linguistic resources (the director
of ILC is president of ELRA; see article by K. Choukri in this issue).
Methods and Tools for Text Processing
Conceived for both literary and linguistic work and paying particular attention
to lexicographic needs, the Pisa text processing system has now been developed
as a highly complex structure composed of independent modules with the DBT,
a textual database system, as the core system, and including components
for mophological analysis and generation, POS tagging and lemmatisation,
and lexical database management. The DBT is software for mono- and bilingual
full text access and analyses. Recently, a set of procedures have been added
to the DBT system so that it can be used on the INTERNET circuit.
Image Processing and Computational Philology
The combination of text and image processing techniques seems to offer interesting
possibilities to various designers dealing with large collections of texts.
A particular task is the computer-assisted presentation and translation
of ancient manuscripts and old printed documents. In this field, the ILC
has also developed methods and tools in the framework of the European project
This line of activity provides a system, particularly appropriate for classical
scholars, to facilitate look-up of an image archive with digital representation
of the sources, transcribe the text contained in the images, and match electronically
each word of the transcription with the portions of image in which the word
is inscribed.
Data, Methods and Tools for Analysis and Generation
of Linguistic Structures
Data, methods and tools are designed and developed to deal with linguistic
structures at different levels of description. We give just a few examples.
At the phonological level, the Italian lexicon (both lemmas and inflected
word-forms) is provided with the phonological transcription, and a very
large inventory of proper names accompanied by the phonological transcription
has been built within LRE ONOMASTICA.
A Reference Corpus of Contemporary Italian and an Italian Lexicon are being
built within the framework of national and EC projects (among others LRE
DELIS, LRE MULTEXT, ET-10-Cobuild, in addition to those mentioned above).
A corpus of child language is being built within a national project to further
the study of language acquisition. A semantic lexicon is being built in
the framework of the LE EuroWordNet project, modelled on the Princeton WordNet,
and linked to other European WordNets.
At the syntactic level two main lines are worth mentioning: i) the creation
of grammars for Italian (eg in ATN and CGU (Complex Grammar Unit)), and
ii) the implementation of development tools. In the EC projects COLSIT and
LS-GRAM the aim is to import the EUROTRA grammars on the ALEP platform.
Within ESPRIT IDEAL and EUREKA PROMETHEUS, the ILC has contributed to formulate
a theoretical model for communication and dialogue applied to a number of
man-machine interactions.
Knowledge Representation and Cognitive Research
The theoretical study of knowledge structures, analysed in their logical
and cognitive components, allowed the development of a knowledge representation
language in the form of a semantic network. This type of research has been
developed in a number of national projects and in LRE CRISTAL, with the
design of modules for conceptual modelling and for developing domain-specific
Language Engineering Applications
The various data, methods, techniques and tools listed above are used also
in a number of application projects, as components of systems in the broad
area of information technology. We list a few of the relevant application
areas here.
In the Information Retrieval field the ILC has developed linguistic components
in LRE RENOS for the extraction of legal terms from a corpus, while in LRE
CRISTAL has developed a multilingual environment for a natural language
interface in the retrieval of information from financial texts. In LE TAMIC-P,
a technical dictionary will be linked to a Knowledge Base in the pension
In the digital libraries area, we can cite the MLAP MEMORIA project aimed
at designing an intelligent reading environment to access large electronic
libraries, exploiting both natural language and image processing techniques.
In the multimedia sector, some projects use NLP methods and tools in an
environment which helps in learning, teaching and supporting the disabled,
eg by inproving vocal man-machine interaction.
In the didactic area, the ADDIZIONARIO project aims at creating a hypermedia
linguistic laboratory to help children in the process of first language
learning, in particular through the use of a multimedia dictionary.
Much fuller information on our activities can be found at our Web site:
Please contact:
Nicoletta Calzolari
Tel: +39-50-560481
E-mail: glottolo@ilc.pi.cnr.it
return to the contents page