This page contains the various data sets that are/were used in our research. The data is distributed without any warranty. For any questions regarding the content of this page, please contact Bridget McInnes. [see also the research page for related information].

Abbrev Dataset
  • The Abbrev dataset is made available by Stevenson, et al. (2009). It consists of the acronyms and long-forms from Medline abstracts that were intially prsented by Liu, et al. (2001). The dataset is automatically re-created by identifying the acronyms long froms in Medline and replacing it with it's acronym. The dataset consists of three subsets containing 100, 200 and 300 instances respectively [download].
    • M. Stevenson, Y. Guo, A. Al Amri, and R. Gaizauskas. 2009. Disambiguation of biomedical abbreviations.In Proceedings of the ACL BioNLP Workshop, pages 71–79.

Conflate Dataset

MayoSRS Reference Standard
  • MayoSRS, developed by Pakhomov, et al., consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic. The relatedness of each term pair was assessed based on a four point scale: (4.0) practically synonymous, (3.0) related, (2.0) marginally related and (1.0) unrelated.
    • T. Pedersen, S. Pakhomov, S. Patwardhan, C. Chute, Measures of semantic similarity and relatedness in the biomedical domain, Journal of Biomedical Informatics 40 (3) (2007) 288–299.

MiniMayoSRS Semantic Similarity Reference Standard
  • MiniMayoSRS is a subset of the MayoSRS and consists of 30 term pairs on which a higher inter-annotator agreement was achieved. The average correlation between physicians is 0.68. The average correlation between medical coders is 0.78.

MSH-WSD Dataset
  • The data set contains 203 ambiguous terms and acronyms from the 2010 Medline baseline. Each instance of a term was automatically assigned a CUI from the 2009AB version of the UMLS by exploiting the fact that each instance in Medline is manually indexed with Medical Subject Headings in which each heading has an associated CUI. Each target word contains approximately 187 instances, has 2.08 possible senses and has a 54.5% majority sense. Out of 203 target words, 106 are terms, 88 are acronyms, and 9 have possible senses that are both acronyms and terms [download].
    • Exploiting MeSH Indexing in MEDLINE to Generate a Data set For Word Sense Disambiguation. Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011 Jun 2;12(1):223.

NLM-WSD Dataset
  • The National Library of Medicine’s Word Sense Disambiguation (NLM-WSD) dataset contains 100 randomly selected instances of 50 frequent and highly ambiguous words from 1998 MEDLINE abstracts. Each instance of a target word was manually disambiguated by 11 human evaluators who assigned the word a CUI or “None” if none of the CUIs described the concept [download].
    • M. Weeber, J. Mork, A. Aronson. Developing a test collection for biomedical word sense disambiguation. In Proceedings of the American Medical Informatics Association (AMIA) Symposium. November 2001; p. 746-50.

Review Dataset

UMNSRS Reference Standard
  • UMNSRS, developed by Pakhomov, et al., consists of 725 clinical term pairs whose semantic similarity and relatedness. The similarity and relatedness of each term pair was annotated based on a continuous scale by having the resident touch a bar on a touch sensitive computer screen to indicate the degree of similarity or relatedness. The Intraclass Correlation Coefficient (ICC) for the reference standard tagged for similarity was 0.47, and 0.50 for relatedness. Therefore, as suggested by Pakhomov and colleagues, the subset below consists of 401 pairs for the similarity set and 430 pairs for the relatedness set which each have an ICC equal to 0.73.