Word Sense Disambiguation
About
-
Word Sense Disambiguation (WSD) is the task of automatically identifying the intended sense (or concept) of an ambiguous word based on the context in which the word is used. In our work, the set of possible meanings for a word are defined by Concept Unique Identifiers (CUIs) associated with a particular term in the Unified Medical Language System (UMLS). Thus, when performing WSD of biomedical terms, our more specific goal is to assign a term one of its possible CUIs based on its surrounding context. For example, the term cold could refer to the temperature (C0009264) or the common cold (C0009443), depending on the context in which it occurs.
Automatically identifying the intended concept of ambiguous words improves the performance of clinical and biomedical applications such as medical coding and indexing for quality assessment, cohort discovery and other secondary uses of data. These capabilities are becoming essential tasks due to the growing amount of information available to researchers, the transition of health care documentation towards electronic health records, and the push for quality and efficiency in health care.
In this work, we are exploring three types types of methods: supervised, unsupervised and knowledge-based. Supervised methods use machine learning algorithms (e.g. SVMs, Naive Bayes) to learn from manually tagged training data; unsupervised methods rely on the distributional characteristics of the terms in large unannotated corpora; and lastly, knowledge-based methods use information from an external knowledge source.
Software
- CuiTools -- A freely availble suite of Perl programs for supervised, unsupervised and knowledge-based word sense disambiguation.
- UMLS-SenseRelate -- A freely availble suite of Perl programs for exploring the use of semantic similarity and reltaedness between UMLS concepts to disambiguate terms in biomedical text.
Publications
- Local Ensemble Learning from Imbalanced and Noisy Data for Word Sense Disambiguation. Bartosz Krawczyk and Bridget T. McInnes. Pattern Recognition, 2018.
- Evaluating Feature Extraction Methods for Knowledge-based Biomedical Word Sense Disambiguation.
Sam Henry, Clint Cuffy and Bridget McInnes.
In Proceedings of the 16th Workshop on Biomedical Natural Language Processing (BioNLP) at the Association of Computational Linguistics, 2017.
- Challenges and Practical Approaches with Word Sense Disambiguation of Acronyms and
Abbreviations in the Clinical Domain. Sungrim Moon, Bridget T. McInnes, and Genevieve B Melton. Healthcare informatics
research, 2015, 21 (1), 35-42.
- Determining the Difficulty of Word Sense Disambiguation. Bridget T. McInnes and Mark Stevenson.
Journal of Biomedical Informatics. 2014 Feb; 47:83-90. (data: MSH-WSD NLM-WSD Abbrev; code: UMLS-Similarity)
- Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity.
Bridget T. McInnes, Ted Pedersen, Ying Liu, Serguei Pakhomov, and Genevieve B. Melton. Appears in the Proceedings of the
Annual Symposium of the American Medical Informatics Association (AMIA). Oct. 2011, Washington DC. (data: MSHWSD; code:
UMLS-SenseRelate; presentation: pptx, pdf)
- Exploiting MeSH Indexing in MEDLINE to Generate a Data Set for Word Sense
Disambiguation. Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011 Jun 2;12(1):223. (data: MSHWSD)
- Using Second-order Vectors in a Knowledge-based Method for Acronym Disambiguation.
Bridget T. McInnes, Ted Pedersen, Ying Liu, Serguei Pakhomov, and Genevieve B. Melton. Appears in the Proceedings of the
Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011), June 23-24, 2011, pp. 145 - 153, Portland,
Oregon. (data: Abbrev; code: CuiTools)
- Collocation Analysis for UMLS Knowledge-based Word Sense Disambiguation.
Antonio Jimen-Yepes, Bridget T. McInnes and Alan R. Aronson. BMC Bioinformatics. 2011, 12(Suppl 3):S4.
- Automated Identification of Synonyms in Biomedical Acronym Sense Inventories.
Genevieve B. Melton, SungRim Moon, Bridget T. McInnes, and Serguei Pakhomov. Appears in the Proceedings of the Louhi Workshop at
the North American Association of Computational Linguistics (NAACL). June 1- 16, 2010, Los Angeles, CA.
- Supervised and Knowledge-based Methods for Disambiguating Terms in Biomedical Text using the UMLS and MetaMap.
Bridget T. McInnes, Doctor of Philosophy Dissertation, Department of Computer Science, University of Minnesota, Twin Cities, September, 2009. (code: CuiTools; data: NLM-WSD)
- Using CuiTools to Identify Obesity and its Co-morbidities in Discharge Summaries.
Bridget T. McInnes. Appears in the Proceedings of the Second i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, Nov 7-8,
2008, Washington, DC. (code: CuiTools)
- An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline.
Bridget T. McInnes. Appears in the Proceedings of the Association for Computational Linguistics Student Research Workshop (ACL-SRW) 2008.
(code: CuiTools; data: Conflate Data; poster: ppt)
- Using Domain Specific Information for Word Sense Disambiguation. Bridget T. McInnes, Ted Pedersen and John Carlis. Grace Hopper Conference for
Women in Computing, October 2007, Orlando, Florida. (code: CuiTools; poster: pdf)
- Using UMLS Concept Unique Identifiers (CUIs) for Word Sense Disambiguation in the Biomedical
Domain. Bridget T. McInnes, Ted Pedersen, and John Carlis. Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association
(AMIA), pages 533-37, Nov. 2007, Chicago, IL. (code: CuiTools; data: NLM-WSD)