VCU NLP Lab: Data

CONTENT

ABBREV DATASET
CONFLATE DATASET
MAYOSRS
MINIMAYOSRS
MSH-WSD DATASET
NLM-WSD DATASET
REVIEW DATASET
UMNSRS

DATA

This page contains the various data sets that are/were used in our research. The data is distributed without any warranty. For any questions regarding the content of this page, please contact Bridget McInnes. [see also the research page for related information].

Abbrev Dataset

The Abbrev dataset is made available by Stevenson, et al. (2009). It consists of the acronyms and long-forms from Medline abstracts that were intially prsented by Liu, et al. (2001). The dataset is automatically re-created by identifying the acronyms long froms in Medline and replacing it with it's acronym. The dataset consists of three subsets containing 100, 200 and 300 instances respectively [download].

M. Stevenson, Y. Guo, A. Al Amri, and R. Gaizauskas. 2009. Disambiguation of biomedical abbreviations.In Proceedings of the ACL BioNLP Workshop, pages 71–79.

Conflate Dataset

These are the nine WSD conflate datasets.

Bridget T. McInnes. An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline. Bridget T. McInnes. In Proceedings of the Assocation for Computational Linguistics Student Research Workshop (ACL-SRW) 2008.

MayoSRS Reference Standard

MayoSRS, developed by Pakhomov, et al., consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic. The relatedness of each term pair was assessed based on a four point scale: (4.0) practically synonymous, (3.0) related, (2.0) marginally related and (1.0) unrelated.

T. Pedersen, S. Pakhomov, S. Patwardhan, C. Chute, Measures of semantic similarity and relatedness in the biomedical domain, Journal of Biomedical Informatics 40 (3) (2007) 288–299.

Term Pairs
SNOMED CT mappings:

MiniMayoSRS Semantic Similarity Reference Standard

MiniMayoSRS is a subset of the MayoSRS and consists of 30 term pairs on which a higher inter-annotator agreement was achieved. The average correlation between physicians is 0.68. The average correlation between medical coders is 0.78.

T. Pedersen, S. Pakhomov, S. Patwardhan, C. Chute, Measures of semantic similarity and relatedness in the biomedical domain, Journal of Biomedical Informatics 40 (3) (2007) 288–299.

Term Pairs
SNOMED CT mappings:

MSH mappings:

MSH-WSD Dataset

The data set contains 203 ambiguous terms and acronyms from the 2010 Medline baseline. Each instance of a term was automatically assigned a CUI from the 2009AB version of the UMLS by exploiting the fact that each instance in Medline is manually indexed with Medical Subject Headings in which each heading has an associated CUI. Each target word contains approximately 187 instances, has 2.08 possible senses and has a 54.5% majority sense. Out of 203 target words, 106 are terms, 88 are acronyms, and 9 have possible senses that are both acronyms and terms [download].

Exploiting MeSH Indexing in MEDLINE to Generate a Data set For Word Sense Disambiguation. Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011 Jun 2;12(1):223.

NLM-WSD Dataset

The National Library of Medicine’s Word Sense Disambiguation (NLM-WSD) dataset contains 100 randomly selected instances of 50 frequent and highly ambiguous words from 1998 MEDLINE abstracts. Each instance of a target word was manually disambiguated by 11 human evaluators who assigned the word a CUI or “None” if none of the CUIs described the concept [download].

M. Weeber, J. Mork, A. Aronson. Developing a test collection for biomedical word sense disambiguation. In Proceedings of the American Medical Informatics Association (AMIA) Symposium. November 2001; p. 746-50.

Review Dataset

These are four dataset containing product reviews:

UMNSRS Reference Standard

UMNSRS, developed by Pakhomov, et al., consists of 725 clinical term pairs whose semantic similarity and relatedness. The similarity and relatedness of each term pair was annotated based on a continuous scale by having the resident touch a bar on a touch sensitive computer screen to indicate the degree of similarity or relatedness. The Intraclass Correlation Coefficient (ICC) for the reference standard tagged for similarity was 0.47, and 0.50 for relatedness. Therefore, as suggested by Pakhomov and colleagues, the subset below consists of 401 pairs for the similarity set and 430 pairs for the relatedness set which each have an ICC equal to 0.73.

S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, G. Melton, Semantic similarity and relatedness between clinical terms: An experimental study, in: Proceedings of the American Medical Informatics Association (AMIA) Symposium, Washington, DC, 2010, pp. 572–576.

Subset tagged for similarity:

NATURAL LANGUAGE PROCESSING LAB

DATA

CONTENT

DATA

Abbrev Dataset

Conflate Dataset

MayoSRS Reference Standard

MiniMayoSRS Semantic Similarity Reference Standard

MSH-WSD Dataset

NLM-WSD Dataset

Review Dataset

UMNSRS Reference Standard