ERCIM News No.26 - July 1996 - INESC
Natural Language Processing at INESC
by Luzia Wittmann
The Natural Language Group of INESC has developed a broad coverage
system named Palavroso, for automatic morphological processing of European
and Brazilian Portuguese. It is intended to be the first block of a more
complex system, a base for the development of commercial products and to
be useful for scientific research on the Portuguese Language.
The core of Palavroso is a rule based morphological analyzer, to which lexicons1
of a variable dimension can be linked. The actual European Portuguese (EP)
lexicon contains about 60,000 root words accepting up to 1,300,000 forms.
The Brazilian Portuguese lexicon is now in the concluding phase of its constitution.
Palavroso encompasses all inflectional morphology of Portuguese, and handles
correctly enclitics (enclise and mesoclise), compounds, superlatives, augmentatives
and diminutives.
The EP lexicon is compatible with the EAGLES recommendations and will be
reused for the Portuguese lexicon of the LE-PAROLE project, in which INESC
is participating as associated partner of the Centre of Linguistics of the
University of Lisbon - CLUL. The Group is sharing with CLUL the construction
of a lexicon with 20,000 entries, with morphosyntactic and syntactic infor-mation.
The lexical entries will be selected with the help of the corpora tools
based on Palavroso. In the same project, Palavroso will be used for tagging
the Portuguese corpus (20 million running words).
Palavroso has been designed to run, and is successfully installed on two
different computer platforms: UNIX and Windows, and is easily adaptable
to any other computer system. Several applications have already been developed
using Palavroso as the core and underlying base. The most important are
a set of corpora tools, and a spelling checker, named Correcto.
Correcto also runs on the UNIX and Windows platforms, and ­p; as it is
intended to be commercialised ­p; has been compared with the existing
commercial spelling aids for European Portuguese. The results show that
Correcto has a very good performance in all of the aspects measured. It
is definitely better in pro-viding less and more precise suggestions. In
addition, its coverage of specific morphological phenomena (such as, for
instance, compounds and verbs with clitics), is far superior, due to the
under-lying system. The measures and the method adopted are published and
available. The adaptation of Correcto to Brazilian Portuguese is under way.
Contrastive studies between European and Brazilian variants of Portuguese
are one of our research lines since 1994, bearing in mind that a common
effort from the several variants for NLP can be advantageous for the Portuguese
language as a whole. The Natural Language Group developed a first survey
of qualitative and quantitative differences between the two variants in
a joint project with Logos Inc. (USA) and is now continuing work in this
domain, expecting to have official funds to start a larger project in collaboration
with CLUL and UNESP (University of the State of Sao Paulo - Brazil).
Created in 1987 as a joint centre with IBM (IBM-INESC Scientific Group),
the Natural Language Group at INESC was reorganized in 1990 as a regular
R&D group of INESC, loosing contact with IBM and diversifying cooperation
at national and international level. Since then the Group has been acquiring
experience in several domains of NLP for Portuguese. At present, besides
the activities mentioned above, our main fields of interest in the Group
are grammar checking, machine translation (including MT between closely
related languages), Portuguese teaching computational aids, and intelligent
text retrieval.
Please contact:
Luzia Wittmann - INESC
Tel: +351 1 3100303
E-mail: luzia.wittmann@inesc.pt
return to the contents page