Guaranteeing Multilinguality in the Information Society

Carol Peters
Istituto di Elaborazione della Informazione
Consiglio Nazionale delle Ricerche

With the recent rapid diffusion over the international computer networks of world wide distributed document bases, the question of multilingual access and multilingual information retrieval is becoming increasingly relevant.

So far research and development activities have been concentrated on monolingual environments and, in the large majority of cases, the default language has been English. Although English admittedly tends to play a predominant role in international communications, there are many risks inherent in allowing this predominance to remain unchallenged. The diversity of the world's languages and cultures gives rise to an enormous wealth of knowledge and ideas. It is thus essential to study and develop computational methodologies and tools that allow us to preserve and exploit this heritage rather than helping to destroy it. Ideally, it should be possible for users throughout the world - independently of their native tongue - to have access to the massive volumes of information of all types - scientific, economic, literary, news, etc. - now available over the networks, and in particular through Internet and the World Wide Web. It must also be possible for information providers throughout the world to make their work and ideas available to anyone, anywhere, in whatever language. At the same time, attention must be paid to assisting non-expert users. They must be provided with easy-to-use, flexible tools that help and guide them in the search for knowledge. This is especially important for developing countries where one of the main keys to progress is education - and the main path to education is access to knowledge. And by this we intend access to a wide diversity of unrestricted information sources that give the user the opportunity to select and discard, retrieving exactly those texts that are of interest.

To make all this possible, we must begin to take measures to guarantee multilinguality in the global information society.

This implies users being able to access Internet available document bases in different languages, specifying their information needs in their native language, yet achieving a high level of search and retrieval precision. Ideally, they should be able to retrieve documents matching their query in whatever language the document is stored. Of course, this also means that information providers can make their material available on Web in their preferred language, confident that this does not in itself preclude or limit access.

We will not discuss the question of machine translation here. This must be considered as secondary to multilingual access. Access is an essential first stage; if necessary - once the right information source has been identified and retrieved - you can proceed to its translation. If you are unable to access the document then you may never know of its existence, however useful it might have been

However, the question of multilingual access is an extremely complex one. Two basic issues are involved:
(i) Multiple language representation, manipulation, and display
(ii) Multilingual search and retrieval.

The first item involves the question of multilingual character set encoding. An application that claims to be multilingual must obviously support the character sets and encodings used to represent the information it is processing. However, at the present, the vast majority of WWW browsers have no support for multilingual data representation and recognition. The second issue is that of multilingual querying, i.e. the study and development of tools that make it possible for users to interact with document bases in different languages, formulating queries in their preferred language.

In the following, we focus on these two issues, mentioning some of the current research efforts to resolve them, with particular reference to the approach that will be adopted in an ERCIM-sponsored project - the SAMOS project.

The SAMOS project aims at the development of a networked computer science technical report library. A digital library architecture will be developed which provides Internet access to a distributed, decentralised multi-format collection of documents and includes a multilingual interface.

SAMOS is running in collaboration with the US-based Networked Computer Science Technical Report Library - NCSTRL, which includes some of the major US universities. The issue of multilinguality has not been addressed to date in the NCSTRL consortium since it has been operating in an exclusively English-speaking environment. However, it becomes immediately of great significance once we start working in Europe - and if, as we hope, the collaboration can be extended to cover other areas of the world.

In the part of the project that will be responsible for providing studying and implementing multilingual access and query functionalities, we have identified four areas in which substantial progress must be made:

the provision of an interface to the SAMOS system in each European language
the compilation of large-scale multilingual language reference resources in the domain of Computer Science,
the design of an enriched multilingual classification thesaurus for computer science
the investigation of approaches to multilingual information retrieval.

The first stage is to provide the Multilingual Interface to the system: an application that claims to be multilingual must be able to present data in multiple languages meaningfully.

The ERCIM consortium represents a multilingual community of more than one dozen languages, including English, German, the Romance languages (French, Spanish, Portuguese, Italian), most of the Scandinavian languages, Czech, Hungarian and Greek. An important implication of this diversity is that the combined character set of these languages is much greater than that of the ASCII character set used to encode the English documents of the NCSTRL consortium to date. SAMOS must therefore identify and adopt a suitable character encoding standard to cover all of the languages it represents.

We have identified two different approaches that may be taken to encode such a large combined character set. The first approach is to use several of the 8-bit ISO standard Latin character sets and to include in the document metadata the character code used in that document (for example a document written in German requires a different 8-bit code than a document written in Greek). This is perhaps feasible as long as we limit our coverage to the most common European languages. The problem becomes much more complex, however, if we want to start moving between, for example, French and Arabic, English and Japanese. If we start to use a large number of character sets and encodings, and if the browser is to handle translation from one set to another, the system response times will clearly be heavily effected.

An alternative is to use a single 16-bit character encoding, like the Unicode standard, to represent all the languages. The Unicode Character Standard encodes scripts (collections of symbols) rather than languages. It currently contains 34,168 coded characters covering principal written languages of the Americas, Europe, Middle East, Africa, India, Asia. Unicode characters are language neutral; if necessary, a higher level protocol must be used to specify the language.

We will investigate both of these alternatives (and possibly other solutions) and decide on what is most suitable for our digital library application, taking into consideration both the immediate future and long-term possible developments.

In particular, although a 16-bit code ensures that a document can be displayed without relying on its metadata, it places higher demands on storage, and could significantly affect long distance transmittal times. We will also have to investigate thoroughly the implications of introducing an extended character encoding into the wider NCSTRL consortium, particularly from the point of view of compatibility with existing document collections, so that full compatibility may be maintained. At the same time, we intend to maintain close contact with the World Wide Web Consortium through ERCIM. The Web Consortium has recently set up a Working group to study this question, and the standard we decide to adopt must be compatible with any decision they may make on a character encoding standard for the World Wide Web.

In the first stage of SAMOS, we will implement and test the encoding of our extended character set by providing user interfaces to the system in English, French, German, Greek and Italian. It will be possible for the user to independently select the language of the interface and the language in which the query will be formulated and submitted. Once the character encoding has been implemented and tested on these initial five languages, we intend to provide additional interfaces in the rest of the project languages.

Of course the question of multilingual data representation is just the first step. Within the project we intend to study the development of multilingual search tools. We have two types of tools in mind. In a first stage, our intention is to implement a multilingual key-word based search tool to enable fairly rudimentary searching over documents in several languages. In a second stage, we will be investigating the design of more sophisticated search mechanisms.

A particularly interesting feature of the SAMOS multilingual effort is that it strongly reflects a recent trend: the convergence between NLP and IR.
SAMOS will provide a platform for the integration of methodologies and tools developed for Natural Language Processing (NLP), and in particular in the field of computational lexicography, with techniques and results coming from the information retrieval (IR) field.

Over the last decade or so, in the field of computational linguistics, there has been intensive work of the development of a series of important lexicon and text management and analysis tools, needed for all types of Natural Language Processing applications. In particular, we can mention mono- and bilingual electronic dictionaries and lexical databases, lexical knowledge bases, morphological analysers and generators, procedures to generate taxonomies from dictionary data, monolingual text analysis systems, bilingual corpus systems, part-of-speech taggers, syntactic parsers, sense disambiguators, etc.
Such tools are now being applied to typical Information Retrieval Tasks and a number will be employed in the implementation of the SAMOS multilingual access functionalities.

The basis for our search tools will be provided by a multilingual classification thesaurus. Recent studies in the field of corpus linguistics show the importance of real language data in order to acquire reliable statistical data on term usage and frequency; this can be supplied by language reference corpora for the domain of interest. We thus intend to construct reference corpora for Computer Science: the main corpus will be an English sub-language representative corpus for computer science. Important resources in the creation and evaluation of multilingual search tools will be a set of comparable corpora in the main project languages. Comparable corpora are sets of texts from pairs or multiples of languages that have the same communicative function and can be contrasted and compared because of their common features. These corpora should be initially representative of a sufficiently wide theoretical computer science sub-domain or set of sub-domains.

The development of tools for the multilingual querying is heavily dependent on the availability of adequate thesauri in each language with translation links. We will thus construct an enriched multilingual thesaurus for computer science, on the basis of both existing classification thesauri and of the corpus data. To serve as a basis for multi-lingual querying, the thesaurus to be created must go into much finer detail than do existing thesauri such as the 'classical' INSPEC thesaurus for physics, electrotechnology, computers and control. A core thesaurus will be constructed for English and then be translated into German, French, and Italian, using electronic dictionaries where possible. The translated terminology will be mapped to the base core thesaurus. It is not to be expected that the result will perfectly reflect usage and structure of terminology in these target languages. The results will be evaluated and correlated with list of key terms directly generated from corpora in the target languages (French, German, Italian).

Of immediate relevance to our work on multilingual thesaurus building and the development of key-word search based tools are two current projects in the Libraries programme: TRANSLIB and CANAL/LS 3. Both projects aim at supporting multilingual access (Translib: Greek, Spanish, English; CANAL/LS: English, German, French, Spanish) to library on-line public access catalogues (OPACs) and both are building multilingual thesauri as tools for this purpose.

SAMOS aims at extending the multilingual access to search not only catalogue but also full text document data in a specific domain. The multilingual thesaurus to be designed and built in SAMOS will thus differ in that it will refer to a selected sublanguage (computer science) and should be more exhaustive: it will be supplemented using reference corpus data and expanded to include a network of semantically and syntactically related data.

In this respect, we hope to establish links between the multilingual work in SAMOS and that of EuroWordNet LE 4003, which aims at building a multilingual wordnet database with semantic relations between words for English, Dutch, Spanish and Italian. The wordnets will be stored in a central lexical database system and word meanings linked to meaning in the Princeton WordNet 1.5. Major concepts and words in the individual wordnets will be merged to form language-independent ontology (set of semantic relations between concepts). The aim is to create a flexible general-language multilingual search tool (not domain-specific terminology).

We will use the multilingual thesaurus in the study and development of multilingual search tools; we will also study the design of more broad-coverage tools to complement the thesaurus-based tools. The aim is to cover the different requirements of users of our system: librarians using rich thesauri to specify precise queries where exact results are required, and researchers specifying more vague queries where a high level of recall is desired.

The methodologies studied should be extendible to other languages. There are three proposed pilot programs within SAMOS:
I. Corpus-based enrichment of multilingual thesauri and lexicon-based query procedures
II. Multilingual querying based on a graphical thesaurus browser
III. Automatic query expansion for multilingual information retrieval.

By the end of the first stage of the SAMOS project, we should have defined a proposal for multiple language representation and implemented a first prototype of a multilingual search tool.

This paper is also available in rtf format