ERCIM News No.26 - July 1996 - SZTAKI
MorphoLogic - A Language Engineering Company from Hungary
by Gábor Prószéky
MorphoLogic, a Hungarian enterprise, was established in 1991 by a
group of NL researchers from the Hungarian Academy of Sciences and universities
in Budapest. MorphoLogic is the only organization in Hungary that is doing
R&D solely in the field of natural language processing. MorphoLogic
has noticed that one way of bridging the gap between basic research and
profit-oriented development is to do basic research within the company.
The close ties that MorphoLogic maintains with academic labs in Budapest
have resulted in a number of very profitable language products, and the
sale of these products not only provide funding for the profit-making activities
of the company, but for its non-profit activities, as well. MorphoLogic
is involved in four EC-sponsored (Copernicus) projects: GLOSSER, GRAMLEX,
MULTEXT-EAST, ELSnet goes East, with academic and industrial partners from
more than ten European countries.
The name of MorphoLogic refers to the company's focus on R&D work in
morphology and syntax. R&D efforts over the past few years have focused
on the following main related areas:
- development of a string-based unification morpho-syntactic formalism
- development of a family of proof-reading tools (spelling and grammar
checkers, hyphenators, thesauri) for Hungarian and other agglutinative and
highly inflectional languages (Polish, Romanian, Bulgarian, etc.)
- development of tools supporting intelligent text analysis, free text
search and database indexing
- development of intelligent bilingual morphological dictionaries
- development of a set of programs supporting (character, hand-writing
and speech) recognition tools.
Each of these areas has one or more specific projects and partners associated
with it. The Research Institute for Linguistics (RIL) of the Hungarian Academy
of Sciences has been an important partner in the development of the string-based
morpho-syntactic formalism, since the first users of commercial morpho-syntactic
systems in Hungary were the lexicographers at RIL who were writing a corpus-based
Historical Dictionary of Hungarian.
MorphoLogic's basic system consists of both a morphological analyser and
a generator, and it can handle derivational and inflectional affixes and
compounding. Both the linguistic description language and the internal database
format with its search routines are in-house developments of MorphoLogic.
The linguistic databases of these models cover various natural languages,
from Hungarian through Eastern-European Slavic languages to German or English.
All the kernel linguistic software has been written in standard C, hence
the MorphoLogic program modules are totally portable.
Spell-checking for highly inflectional, agglutinative languages, such as
Hungarian, requires a thorough morphological analysis, very different from
spell-checking for morphologically simple languages, like English, that
involves the trivial task of looking up the word in a word list. The morphology-based
speller, called Helyes-e? consists of lexicons1 and algorithms that enable
the software to handle billions of possible words and to propose intelligent
corrections for the misspelled words. Helyes-e? can be customized by the
user, and thus it is easily adapted to OCR, handwriting and speech recognition
systems where error-types are different from typical typing errors. The
hyphenator, Helyesel, hyphenates any word-form, again using a morphological
segmentation algorithm. This model is useful for languages in which morpheme
boundaries override the usual hyphenation points. List-based hyphenation
does not work in such languages. What's more, Helyesel also allows hyphenation
with optional letter-insertion or letter-change. Helyette, the so-called
inflectional thesaurus, is a combination of a morphological analyser, a
synonym dictionary and a morphological generator. It works by finding the
lexical base of an input word and storing the inflectional information.
It then offers the synonyms of the stem, and finally, it generates the morpho-phonologically
correct combination of the chosen synonym and the stored inflectional information.
Helyette is meant to be language-independent. Its first implementation with
the complex suffix system of Hungarian has been successful and MorphoLogic
is now looking to test the system on other languages.
The project concentrating on intelligent dictionaries is called MoBiDic
(MorphoLogic Bilingual Dictionaries). The word or expression to be translated
goes through a morphological segmentation and its stem(s) are the real query
that has to be found either among the headwords or in the full entry. This
latter option makes the lexical search similar to free text search with
linguistic filters. MoBiDic is, therefore, able to treat dictionaries and
corpora with the help of the same set of linguistic functions. Furthermore,
the number of dictionaries is not limited: MoBiDic looks up a word in open
dictionaries that you either buy or build yourself. The most recent features
are the possibility of using any sort of multimedia and the well-defined
API to MoBiDic which is open to researchers.
Two projects have been started recently: for the linguistic support of recognition
tools, and, for the development of a parser that relies on the morphological
engine, Humor, the High-speed Unification Morphology, Enhanced with Syntactic
Knowledge, ie, HumorESK.
Please contact:
Gábor Prószéky - MorphoLogic
Tel: +36 1 2018355
E-mail: h6109pro@ella.hu
return to the contents page