Guaranteeing Multilinguality in the Information Society
Carol Peters
Istituto di Elaborazione della Informazione
Consiglio Nazionale delle Ricerche
With the recent rapid diffusion over the international computer networks
of world wide distributed document bases, the question of multilingual access
and multilingual information retrieval is becoming increasingly relevant.
So far research and development activities have been concentrated on monolingual
environments and, in the large majority of cases, the default language has
been English. Although English admittedly tends to play a predominant role
in international communications, there are many risks inherent in allowing
this predominance to remain unchallenged. The diversity of the world's languages
and cultures gives rise to an enormous wealth of knowledge and ideas. It
is thus essential to study and develop computational methodologies and tools
that allow us to preserve and exploit this heritage rather than helping
to destroy it. Ideally, it should be possible for users throughout the world
- independently of their native tongue - to have access to the massive volumes
of information of all types - scientific, economic, literary, news, etc.
- now available over the networks, and in particular through Internet and
the World Wide Web. It must also be possible for information providers throughout
the world to make their work and ideas available to anyone, anywhere, in
whatever language. At the same time, attention must be paid to assisting
non-expert users. They must be provided with easy-to-use, flexible tools
that help and guide them in the search for knowledge. This is especially
important for developing countries where one of the main keys to progress
is education - and the main path to education is access to knowledge. And
by this we intend access to a wide diversity of unrestricted information
sources that give the user the opportunity to select and discard, retrieving
exactly those texts that are of interest.
To make all this possible, we must begin to take measures to guarantee multilinguality
in the global information society.
This implies users being able to access Internet available document bases
in different languages, specifying their information needs in their native
language, yet achieving a high level of search and retrieval precision.
Ideally, they should be able to retrieve documents matching their query
in whatever language the document is stored. Of course, this also means
that information providers can make their material available on Web in their
preferred language, confident that this does not in itself preclude or limit
access.
We will not discuss the question of machine translation here. This must
be considered as secondary to multilingual access. Access is an essential
first stage; if necessary - once the right information source has been identified
and retrieved - you can proceed to its translation. If you are unable to
access the document then you may never know of its existence, however useful
it might have been
However, the question of multilingual access is an extremely complex one.
Two basic issues are involved:
(i) Multiple language representation, manipulation, and display
(ii) Multilingual search and retrieval.
The first item involves the question of multilingual character set encoding.
An application that claims to be multilingual must obviously support the
character sets and encodings used to represent the information it is processing.
However, at the present, the vast majority of WWW browsers have no support
for multilingual data representation and recognition. The second issue
is that of multilingual querying, i.e. the study and development of tools
that make it possible for users to interact with document bases in different
languages, formulating queries in their preferred language.
In the following, we focus on these two issues, mentioning some of the current
research efforts to resolve them, with particular reference to the approach
that will be adopted in an ERCIM-sponsored project - the SAMOS project.
The SAMOS project aims at the development of a networked computer science
technical report library. A digital library architecture will be developed
which provides Internet access to a distributed, decentralised multi-format
collection of documents and includes a multilingual interface.
SAMOS is running in collaboration with the US-based Networked Computer Science
Technical Report Library - NCSTRL, which includes some of the major US universities.
The issue of multilinguality has not been addressed to date in the NCSTRL
consortium since it has been operating in an exclusively English-speaking
environment. However, it becomes immediately of great significance once
we start working in Europe - and if, as we hope, the collaboration can be
extended to cover other areas of the world.
In the part of the project that will be responsible for providing studying
and implementing multilingual access and query functionalities, we have
identified four areas in which substantial progress must be made:
- the provision of an interface to the SAMOS system in each European
language
- the compilation of large-scale multilingual language reference resources
in the domain of Computer Science,
- the design of an enriched multilingual classification thesaurus for
computer science
- the investigation of approaches to multilingual information retrieval.
The first stage is to provide the Multilingual Interface to the system:
an application that claims to be multilingual must be able to present data
in multiple languages meaningfully.
The ERCIM consortium represents a multilingual community of more than one
dozen languages, including English, German, the Romance languages (French,
Spanish, Portuguese, Italian), most of the Scandinavian languages, Czech,
Hungarian and Greek. An important implication of this diversity is that
the combined character set of these languages is much greater than that
of the ASCII character set used to encode the English documents of the NCSTRL
consortium to date. SAMOS must therefore identify and adopt a suitable character
encoding standard to cover all of the languages it represents.
We have identified two different approaches that may be taken to encode
such a large combined character set. The first approach is to use several
of the 8-bit ISO standard Latin character sets and to include in the document
metadata the character code used in that document (for example a document
written in German requires a different 8-bit code than a document written
in Greek). This is perhaps feasible as long as we limit our coverage to
the most common European languages. The problem becomes much more complex,
however, if we want to start moving between, for example, French and Arabic,
English and Japanese. If we start to use a large number of character sets
and encodings, and if the browser is to handle translation from one set
to another, the system response times will clearly be heavily effected.
An alternative is to use a single 16-bit character encoding, like the Unicode
standard, to represent all the languages. The Unicode Character Standard
encodes scripts (collections of symbols) rather than languages. It currently
contains 34,168 coded characters covering principal written languages of
the Americas, Europe, Middle East, Africa, India, Asia. Unicode characters
are language neutral; if necessary, a higher level protocol must be used
to specify the language.
We will investigate both of these alternatives (and possibly other solutions)
and decide on what is most suitable for our digital library application,
taking into consideration both the immediate future and long-term possible
developments.
In particular, although a 16-bit code ensures that a document can be displayed
without relying on its metadata, it places higher demands on storage, and
could significantly affect long distance transmittal times. We will also
have to investigate thoroughly the implications of introducing an extended
character encoding into the wider NCSTRL consortium, particularly from the
point of view of compatibility with existing document collections, so that
full compatibility may be maintained. At the same time, we intend to maintain
close contact with the World Wide Web Consortium through ERCIM. The Web
Consortium has recently set up a Working group to study this question, and
the standard we decide to adopt must be compatible with any decision they
may make on a character encoding standard for the World Wide Web.
In the first stage of SAMOS, we will implement and test the encoding of
our extended character set by providing user interfaces to the system in
English, French, German, Greek and Italian. It will be possible for the
user to independently select the language of the interface and the language
in which the query will be formulated and submitted. Once the character
encoding has been implemented and tested on these initial five languages,
we intend to provide additional interfaces in the rest of the project languages.
Of course the question of multilingual data representation is just the first
step. Within the project we intend to study the development of multilingual
search tools. We have two types of tools in mind. In a first stage, our
intention is to implement a multilingual key-word based search tool to enable
fairly rudimentary searching over documents in several languages. In a second
stage, we will be investigating the design of more sophisticated search
mechanisms.
A particularly interesting feature of the SAMOS multilingual effort is that
it strongly reflects a recent trend: the convergence between NLP and IR.
SAMOS will provide a platform for the integration of methodologies and tools
developed for Natural Language Processing (NLP), and in particular in the
field of computational lexicography, with techniques and results coming
from the information retrieval (IR) field.
Over the last decade or so, in the field of computational linguistics, there
has been intensive work of the development of a series of important lexicon
and text management and analysis tools, needed for all types of Natural
Language Processing applications. In particular, we can mention mono- and
bilingual electronic dictionaries and lexical databases, lexical knowledge
bases, morphological analysers and generators, procedures to generate taxonomies
from dictionary data, monolingual text analysis systems, bilingual corpus
systems, part-of-speech taggers, syntactic parsers, sense disambiguators,
etc.
Such tools are now being applied to typical Information Retrieval Tasks
and a number will be employed in the implementation of the SAMOS multilingual
access functionalities.
The basis for our search tools will be provided by a multilingual classification
thesaurus. Recent studies in the field of corpus linguistics show the importance
of real language data in order to acquire reliable statistical data on term
usage and frequency; this can be supplied by language reference corpora
for the domain of interest. We thus intend to construct reference corpora
for Computer Science: the main corpus will be an English sub-language representative
corpus for computer science. Important resources in the creation and evaluation
of multilingual search tools will be a set of comparable corpora in the
main project languages. Comparable corpora are sets of texts from pairs
or multiples of languages that have the same communicative function and
can be contrasted and compared because of their common features. These corpora
should be initially representative of a sufficiently wide theoretical computer
science sub-domain or set of sub-domains.
The development of tools for the multilingual querying is heavily dependent
on the availability of adequate thesauri in each language with translation
links. We will thus construct an enriched multilingual thesaurus for computer
science, on the basis of both existing classification thesauri and of the
corpus data. To serve as a basis for multi-lingual querying, the thesaurus
to be created must go into much finer detail than do existing thesauri such
as the 'classical' INSPEC thesaurus for physics, electrotechnology, computers
and control. A core thesaurus will be constructed for English and then be
translated into German, French, and Italian, using electronic dictionaries
where possible. The translated terminology will be mapped to the base core
thesaurus. It is not to be expected that the result will perfectly reflect
usage and structure of terminology in these target languages. The results
will be evaluated and correlated with list of key terms directly generated
from corpora in the target languages (French, German, Italian).
Of immediate relevance to our work on multilingual thesaurus building and
the development of key-word search based tools are two current projects
in the Libraries programme: TRANSLIB and CANAL/LS 3. Both projects aim at
supporting multilingual access (Translib: Greek, Spanish, English; CANAL/LS:
English, German, French, Spanish) to library on-line public access catalogues
(OPACs) and both are building multilingual thesauri as tools for this purpose.
SAMOS aims at extending the multilingual access to search not only catalogue
but also full text document data in a specific domain. The multilingual
thesaurus to be designed and built in SAMOS will thus differ in that it
will refer to a selected sublanguage (computer science) and should be more
exhaustive: it will be supplemented using reference corpus data and expanded
to include a network of semantically and syntactically related data.
In this respect, we hope to establish links between the multilingual work
in SAMOS and that of EuroWordNet LE 4003, which aims at building a multilingual
wordnet database with semantic relations between words for English, Dutch,
Spanish and Italian. The wordnets will be stored in a central lexical database
system and word meanings linked to meaning in the Princeton WordNet 1.5.
Major concepts and words in the individual wordnets will be merged to form
language-independent ontology (set of semantic relations between concepts).
The aim is to create a flexible general-language multilingual search tool
(not domain-specific terminology).
We will use the multilingual thesaurus in the study and development of multilingual
search tools; we will also study the design of more broad-coverage tools
to complement the thesaurus-based tools. The aim is to cover the different
requirements of users of our system: librarians using rich thesauri to specify
precise queries where exact results are required, and researchers specifying
more vague queries where a high level of recall is desired.
The methodologies studied should be extendible to other languages. There
are three proposed pilot programs within SAMOS:
I. Corpus-based enrichment of multilingual thesauri and lexicon-based query
procedures
II. Multilingual querying based on a graphical thesaurus browser
III. Automatic query expansion for multilingual information retrieval.
By the end of the first stage of the SAMOS project, we should have defined
a proposal for multiple language representation and implemented a first
prototype of a multilingual search tool.
This paper is also available in rtf format