SEMANTIC SIMILARITY AND RELATEDNESS

PROJECTS

SEMANTIC SIMILARITY AND RELATEDNESS

About
  • Semantic similarity and relatedness measures quantify the degree to which two concepts are similar (e.g. liver-organ) or related (e.g. headache-aspirin). The automated discovery of groups of semantically similar or related concepts and terms is critical to improving the retrieval and clustering of biomedical and clinical documents, and the development of biomedical terminologies and ontologies.
  • Relatedness measures quantify the degree to which two words are associated with each other (scissors-paper). Similarity is a subset of relatedness and quantifies how alike two concepts are based on their location within an is-a hierarchy (car-vehicle). The score assigned to a term pair indicates the degree to which the terms are connected together through is-a relations. For example, "Lung Cancer" is-a type of "Disease" and therefore would receive a high similarity score, but "Lung Cancer" and "Coughing" would not receive a high similarity score, although the two are clearly related.
  • In this work, we are exploring taxonomy based metrics, corpus based metrics and hybrids.

Software
  • UMLS-Similarity -- a suite of Perl modules that implement a number of semantic similarity measures. The measures use the UMLS-Interface module to access the UMLS to generate similarity scores between concepts. Currently, this package includes programs that implement the similarity measures described by Leacock & Chodorow (1998), Wu & Palmer (1994), Nguyen & Al-Mubaid (2006), Rada, et. al. (1989), Jiang & Conrath (1997), Resnik (1995) and Lin (1998), and the relatedness measures proposed by Banerjee & Pedersen (2002) and Patwardhan (2003).


Publications