help button home button JAMIA Hate scrolling?
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published April 2, 2004 as JAMIA PrePrint; doi:10.1197/jamia.M1533
Journal of the American Medical Informatics Association 2004;11(4):320-331
© 2004 American Medical Informatics Association


A more recent version of this article appeared on July 1, 2004
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M1533v1
11/4/320    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Friedman, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Friedman, C.

Submitted on January 9, 2004
Accepted on March 16, 2004

A Multi-Aspect Comparison Study of Supervised Word Sense Disambiguation

Hongfang Liu PhD1*, Virginia Teller PhD2, and Carol Friedman PhD3

Affiliation of the authors: 1 Department of Information Systems, University of Maryland at Baltimore County, Baltimore, MD; 2 Department of Computer Science, Hunter College, City University of New York, New York, NY; 3 Department of Biomedical Informatics, Columbia University, New York, NY

* To whom correspondence should be addressed.

Objective To investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain.

Methods The study involves three datasets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). We implemented four machine learning algorithms including i) Naive Bayes (NBL) and Decision Lists (TDLL), ii) our adaptation of Decision Lists (ODLL), and iii) our Mixed Supervised Learning (MSL). There were six feature representations (i.e., various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2,4,6,8,10).

Results Supervised WSD is suitable only when we have enough sense-tagged instances with at least a few dozens of instances for each sense. The combination of collocations and neighboring words are appropriate selections for the context. For terms with biomedical unrelated senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 to 10 should be used. The performance of our implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, our mixed supervised learning was stable and generally better than others for all sets.

Conclusion From this study, we found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.




This article has been cited by other articles:


Home page
J. Am. Med. Inform. Assoc.Home page
H. Liu, Z.-Z. Hu, M. Torii, C. Wu, and C. Friedman
Quantitative Assessment of Dictionary-based Protein Named Entity Tagging
J. Am. Med. Inform. Assoc., September 1, 2006; 13(5): 497 - 507.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.