Dutch Language still Difficult for Automatic Translation
by Henk Nieland
English is a relatively easy language to automate. However, several other languages, including Dutch and Latin, contain features which make this far more difficult. At CWI a way has been found how to deal with some of these difficulties by applying techniques used in mathematically defined languages to the description of natural languages. The research is part of a project in which the cross-fertilization of linguistics and computer science is exploited.
"Vot flites aagh zair fghom Boston to Lozz Endzjelease tomogho eefning?" CWI researcher Annius Groenink asked this question to a large SUN workstation while visiting a colleague in Cambridge some time ago. On the workstation ran a program that could translate spoken sentences related to flight reservations into French. Ten seconds later it sounded in good French: "Quels vols y a't-il de Boston a Los Angeles demain soir?" The same sentence pronounced with a lot of ... euh ... 's in it and with an even stronger Dutch-French accent did not lead the computer astray. The colleague explained that only a real Frenchman can speak with such an accent that the computer will go mad ... .
The example indicates how well translation jobs like this are handled nowadays. For specific programmes with a restricted vocabulary, like flight reservations, this is far better than is usually thought. However, automatic translation still shows some substantial gaps that have not always sufficiently drawn the technicians' attention. One of the reasons is their concentration on English, which in a mysterious way is more related to computer languages than Dutch. The difficulties in Dutch (for example the word order) present themselves even stronger in Latin: "For the computer Latin is an extreme form of Dutch", says Groenink.
Groenink first studied the necessary changes in existing description techniques for natural languages in order to deal with more "complex" languages like Dutch. To fit a language for automation its grammar should be described with such precision that on this basis a computer can automatically parse sentences. Such descriptions are being made nowadays with some regularity for English and French, with satisfactory results.
However, existing methods turn out to have a rather poor performance when applied to Dutch. One cause lies in the difficulties with the construction of an underlying "tree structure" in Dutch. Contrary to English and computer languages, many Dutch sentences do not have such a tree structure (see Figure) - a well-known and old problem in linguistics.
The solution proposed by Groenink was known already for some time among formal language theory researchers (strangely enough outside The Netherlands), but was never applied in this context. A part of a sentence, which corresponds to a branching point in the tree, is viewed not as just one string of words within the sentence, but as two or more separate parts. Such extensions of the existing methods provide in principle a description of Dutch which can serve as a basis for automation. The proposed method turns out to work also for a language like Latin, in which - according to common opinion - the word order is completely free. Groenink formally showed that this order is not at all as free as was assumed, thus giving latinists who doubted this already, an argument to reinforce their ideas about the role of word order in various kinds of Latin texts.
Word order also afflicts German, as can be seen in a translation program of dictionary producer Langenscheidt (on the Web: http://www.gmsmuc.de/ trans.html). Simple German sentences are translated fairly well into English or Spanish, but already a slight increase in complexity causes the program to produce a totally wrong word order, and a sentence like "Sah ich dich dem Mann gestern schwimmen helfen?" is not understood at all.
To come up with a theoretical solution is one thing, to show that it also works in practice is quite another story. Groenink paid particular attention to the links between these two aspects of his research and proved that his proposed extensions with regard to parsing do not cause an exponential growth in computing time. In addition, Groenink could make plausible that more realistic implementations of his software, where for example checks on cases and singular/plural are included, also remain efficient. For more information, see Annius Groenink's home page at http://www.cwi.nl/~avg/
Please contact:
Annius Groenink - CWI
Tel: +31 20 592 4113
E-mail: Annius.Groenink@cwi.nl