The Google Similarity Distance

Paul M.B. Vitanyi (acm) (google citations)

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO 3, MARCH 2007, 370–383

If you’re going to remember one(?) thing:

Semantic cognition using algorithms appears to be possible.

Running code is available for download at http://www.complearn.org

Overview:

A way of using search engine results to compute a semantic relationship between any two (n?) items. It basically uses Information Distance / Kolmogorov Complexity to determine similarity. From the paper:

While the theory we propose is rather intricate, the resulting method is simple enough. We give an example: At the time of doing the experiment, a Google search for “horse”, returned 46,700,000 hits. The number of hits for the search term “rider” was 12,200,000. Searching for the pages where both “horse” and “rider” occur gave 2,630,000 hits, and Google indexed 8,058,044,651 web pages. Using these numbers in the main formula (III.3) we derive below, with N = 8, 058, 044, 651, this yields a Normalized Google Distance between the terms “horse” and “rider” as follows:

NGD(horse, rider) ≈ 0.443.

In the sequel of the paper we argue that the NGD is a normed semantic distance between the terms in question, usually (but not always, see below) in between 0 (identical) and 1 (unrelated), in the cognitive space invoked by the usage of the terms on the world-wide-web as filtered by Google.

This really sounds like a usable model of cognition. For example:

For us, the Google semantics of a word or phrase consists of the set of web pages returned by the query concerned. Note that this can mean that terms with different meaning have the same semantics, and that opposites like ”true” and ”false” often have a similar semantics. Thus, we just discover associations between terms, suggesting a likely relationship.

Phlog

nearly decomposable

The Google Similarity Distance

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply