Lindsey, R., Veksler, V. D., Grintsvayg, A. & Gray, W. D. (2007)

Lindsey, R., Veksler, V. D., Grintsvayg, A., & Gray, W. D. (2007). Be wary of what your computer reads: The effects of corpus selection on measuring semantic relatedness. Proceedings of the 8th International Conference on Cognitive Modeling. Ann Arbor, MI.

Be wary of what your computer reads: The effects of corpus selection on measuring semantic relatedness

Measures of Semantic Relatedness (MSRs) provide models of human semantic associations and, as such, have been applied to predict human text comprehension (Lemaire, Denhiere, Bellissens, & Jhean-Iarose, 2006). In addition, MSRs form key components in more integrated cognitive modeling such as models that perform information search on the World Wide Web (WWW) (Pirolli, 2005). However, the effectiveness of an MSR depends on the algorithm it uses as well as the text corpus on which it is trained. In this paper, we examine the impact of corpus selection on the performance of two popular MSRs, Pointwise Mutual Information and Normalised Google Distance. We tested these measures with corpora derived from the WWW, books, news articles, emails, web-forums, and encyclopedia. Results indicate that for the tested MSRs, the traditionally employed books and WWW-based corpora are less than optimal, and that using a corpus based on the New York Times news articles best predicts human behavior.

Download Paper

Download Endnote Citation

Please note that the copyright of this article is owned by Elsevier.

The EndNotes citation for this paper may be imported into EndNotes™ using the File:Import:EndNotes Generated XML command

Back to Home << Publications

Visitors since 2004.12.08: