Semantic relatedness

From Wikipedia, the free encyclopedia

Main article: Semantic similarity

Computational Measures of Semantic Relatedness include:

  • Latent semantic analysis (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times
  • Pointwise Mutual Information (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents
  • GLSA (+) vector-based, adds vectors to measure multi-word terms; (-) non-incremental vocabulary, long pre-processing times
  • ICAN (+) incremental, network-based measure, good for spreading activation, accounts for second-order relatedness; (-) cannot measure relatedness between multi-word terms, long pre-processing times
  • NGD (+) large vocab, because it uses any search engine (like Google); (-) cannot measure relatedness between whole sentences or documents
  • WordNet: (+) humanly constructed; (-) humanly constructed (not automatically learned), cannot measure relatedness between multi-word term, non-incremental vocabulary

Contents

Google distance is a measure of semantic interrelatedness derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.

Specifically, the normalized Google distance between two search terms x and y is

\operatorname{NGD}(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)} {\log M - \min\{\log f(x), \log f(y)\}}

where M is the total number of web pages searched by Google; f(x) and f(y) are the number of hits for search terms x and y, respectively; and f(xy) is the number of web pages on which both x and y occur.

If the two search terms x and y never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero.

  • Cilibrasi, R. & Vitanyi, P.M.B. (2006). Similarity of objects and the meaning of words. Proc. 3rd Conf. Theory and Applications of Models of Computation (TAMC), J.-Y. Cai, S. B. Cooper, and A. Li (Eds.), Lecture Notes in Computer Science, Vol. 3959, Springer-Verlag, Berlin.
  • Dumais, S. (2003). Data-driven approaches to information access. Cognitive Science, 27(3), 491-524.
  • Juvina, I., van Oostendorp, H., Karbor, P., & Pauw, B. (2005). Towards modeling contextual information in web navigation. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1078-1083). Austin, Tx: The Cognitive Science Society, Inc.
  • Kaur, I. & Hornof, A.J. (2005). A Comparison of LSA, WordNet and PMI for Predicting User Click Behavior. Proceedings of the Conference on Human Factors in Computing, CHI 2005 (pp. 51-60).
  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240.
  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.
  • Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. In B. G. Bara & L. Barsalou & M. Bucciarelli (Eds.), 27th Annual Meeting of the Cognitive Science Society, CogSci2005 (pp. 1254-1259). Austin, Tx: The Cognitive Science Society, Inc.
  • Lemaire, B., & Denhiére, G. (2004). Incremental construction of an associative network from a corpus. In K. D. Forbus & D. Gentner & T. Regier (Eds.), 26th Annual Meeting of the Cognitive Science Society, CogSci2004. Hillsdale, NJ: Lawrence Erlbaum Publisher.
  • Pirolli, P. (2005). Rational analyses of information foraging on the Web. Cognitive Science, 29(3), 343-373.
  • Pirolli, P., & Fu, W.-T. (2003). SNIF-ACT: A model of information foraging on the World Wide Web. Lecture Notes in Computer Science, 2702, 45-54.
  • Turney, P. (2001). Mining the Web for Synonyms: PMI versus LSA on TOEFL. In L. De Raedt & P. Flach (Eds.), Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001) (pp. 491-502). Freiburg, Germany.
  • Veksler, V.D. & Gray, W.D. (2006). Test Case Selection for Evaluating Measures of Semantic Distance. Proceedings of the 28th Annual Meeting of the Cognitive Science Society, CogSci2006.

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.