Google Teaches Computers the Meaning of Words
by Rudi Cilibrasi and Paul Vitányi
Computers can learn the meaning of words with the help of the Google search engine. CWI researchers Rudi Cilibrasi and Paul Vitányi found a way to use the World Wide Web as a massive database from which to extract similarity of meaning between words. The approach is novel in its unrestricted problem domain, simplicity of implementation, and manifestly ontological underpinnings.
To make computers more intelligent one would like to represent the meaning of words and phrases in computer-digestable form. Long-term and labor-intensive efforts like the Cyc project (Cyc Corporation) and the WordNet project (Princeton University) try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence and knowledge about the real world spontaneously emerges. This comes at the great cost of designing structures capable of manipulating knowledge, and entering high quality contents in these structures by knowledgeable human experts. While the efforts are long-running and large scale, the overall information entered is minute compared to what is available on the World Wide Web.
The rise of the World Wide Web has enticed millions of users to type in trillions of characters to create billions of web pages of on average low quality contents. The sheer mass of the information available about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense. We devise a general method to tap the amorphous low-grade knowledge available for free on the World Wide Web, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using search engines like Google.
We developed a method that uses only the name of an object and obtains knowledge about the semantic (meaning) similarity of objects by tapping and distilling the great mass of available information on the web. Intuitively, the approach is as follows. The meaning of a word can often be derived from words in the context in which it occurs. Two related words will be likely to give more hits - web pages where they both occurred - than two unrelated words. For instance, the combined terms 'head&' and 'hat' will give more hits in a Google search than 'head' and 'banana'.
The Google search engine indexes around ten billion pages on the web today. Each such page can be viewed as a set of index terms. A search for a particular index term, say 'horse', returns a certain number of hits, say 46,700,000. The number of hits for the search term 'rider' is, say, 12,200,000. It is also possible to search for the pages where both 'horse' and 'rider' occur. This gives, say, 2,630,000 hits. Dividing by the total number of pages indexed by Google yields the 'Google probability'of the search terms. Using standard information theory, we take the negative logarithm of these probabilities to obtain code word lengths for the search terms - this 'Google code' is optimal in expected code word length, and hence we can view Google as a compressor.
We then plug in these code word lengths in our formula derived in a decade of theoretical and experimental work by others and us on compression-based similarity distances, to obtain the normalized Google distance or NGD, which represents the relative distance in meaning between the words The lower this so-called normalized Google distance, the more closely words are related. With the Google hit numbers above, we computed NGD (horse; rider): 0.443. The same calculation when Google indexed only one-half of the current number of pages gave NGD (horse; rider): 0.460. This is in line with our contention that the relative frequencies of web pages containing search terms give objective absolute information about the true semantic distances between the search terms: that is, when the number of pages indexed by Google goes to infinity, the NGD's must stabilize and converge to definite limit values.
By calculating the NGD's, networks of semantic distances between words can be generated, with which a computer can internalize the meaning of these words by their semantic interrelations. In several tests, we demonstrated that the method can distinguish between colors and numbers, between prime numbers and composite numbers, and can distinguish between 17th century Dutch painters. It also can understand the distinction between electrical terms and nonelectrical terms, and perform similar feats for religious terms, and emergency incidents terms. Furthermore, we conducted a massive experiment in understanding 100 randomly selected WordNet categories. There, we obtained an 87.5 percent mean agreement in accuracy of classifying unknown terms with the PhD-expert-entered semantic knowledge in the WordNet database. The method also exhibited the ability to do a simple automatic English-Spanish translation.
This research has been supported partly by the Netherlands ICES-KIS-3 program BRICKS, by NWO, the EU and ESF.
Links:
http://www.arxiv.org/abs/cs.CL/0412098
http://homepages.cwi.nl/~paulv/
http://www.newscientist.com/channel/info-tech/mg18524846.100
Please contact:
Paul Vitányi, CWI and University of Amsterdam, The Netherlands
Tel: +31 20 592 4124
E-mail: Paul.Vitanyicwi.nl
Rudi Cilibrasi, CWI, The Netherlands
Tel: +31 20 5924232
E-mail: Rudi.Cilibrasicwi.nl