Word Sense Disambiguation

PROJECTS

Word Sense Disambiguation

About
  • Word Sense Disambiguation (WSD) is the task of automatically identifying the intended sense (or concept) of an ambiguous word based on the context in which the word is used. In our work, the set of possible meanings for a word are defined by Concept Unique Identifiers (CUIs) associated with a particular term in the Unified Medical Language System (UMLS). Thus, when performing WSD of biomedical terms, our more specific goal is to assign a term one of its possible CUIs based on its surrounding context. For example, the term cold could refer to the temperature (C0009264) or the common cold (C0009443), depending on the context in which it occurs.

    Automatically identifying the intended concept of ambiguous words improves the performance of clinical and biomedical applications such as medical coding and indexing for quality assessment, cohort discovery and other secondary uses of data. These capabilities are becoming essential tasks due to the growing amount of information available to researchers, the transition of health care documentation towards electronic health records, and the push for quality and efficiency in health care.

    In this work, we are exploring three types types of methods: supervised, unsupervised and knowledge-based. Supervised methods use machine learning algorithms (e.g. SVMs, Naive Bayes) to learn from manually tagged training data; unsupervised methods rely on the distributional characteristics of the terms in large unannotated corpora; and lastly, knowledge-based methods use information from an external knowledge source.

Software
  • CuiTools -- A freely availble suite of Perl programs for supervised, unsupervised and knowledge-based word sense disambiguation.
  • UMLS-SenseRelate -- A freely availble suite of Perl programs for exploring the use of semantic similarity and reltaedness between UMLS concepts to disambiguate terms in biomedical text.

Publications

Presentations