help button home button JAMIA Bigger figures
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published June 23, 2006 as JAMIA PrePrint; doi:10.1197/jamia.M2085
Journal of the American Medical Informatics Association 2006;13(5):497-507
© 2006 American Medical Informatics Association


A more recent version of this article appeared on September 1, 2006
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M2085v1
13/5/497    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Friedman, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Friedman, C.

Submitted on February 16, 2006
Accepted on June 2, 2006

Quantitative Assessment of Dictionary-based Protein Named Entity Tagging

Hongfang Liu PhD1*, Zhang-Zhi Hu MD2, Manabu Torii PhD1, Cathy Wu PhD2, and Carol Friedman PhD3

Affiliation of the authors: 1 Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Georgetown University, Washington, DC ; 2 Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Georgetown University, Washington, DC ; 3 Department of Biomedical Informatics, Columbia University, New York, NY

* To whom correspondence should be addressed.

Objective Natural language processing (NLP) approaches have been explored to manage and mine information recorded in biological literature. A critical step for biological literature mining is biological named entity tagging (BNET) that identifies names mentioned in text and normalizes them with entities in biological databases. The aim of this study was to provide quantitative assessment of the complexity of BNET on protein entities through BioThesaurus, a thesaurus of gene and protein names for UniProt entries that was acquired using online resources.

Methods We evaluated the complexity through several perspectives: ambiguity (i.e., the number of genes or proteins represented by one name), synonymy (i.e., the number of names associated with one gene or protein), and coverage (i.e., the percentage of gene and protein names in text included in the thesaurus). We also normalized names in BioThesaurus and measures were obtained twice, one before normalization and one after.

Results The current version of BioThesaurus has over 2.6 million names or 2.1 million normalized names covering more than 1.8 million UniProt Knowledgebase entries. The average synonymy is 3.53 (2.86 after normalization), ambiguity is 2.31 before normalization and 2.32 after, while the coverage is 94.0% based on the BioCreAtive data set comprising MEDLINE abstracts containing gene or protein named entities.

Conclusion The study indicated that names for genes or proteins are highly ambiguous and there are usually multiple names for the same gene or protein. It also demonstrated that most gene or protein names appearing in BioCreAtive text can be found in BioThesaurus, which was acquired using annotation fields from online resources.




This article has been cited by other articles:


Home page
Adv. Physiol. Educ.Home page
L. Nordquist
Physiology education and the linguistic jungle of science
Advan Physiol Educ, September 1, 2008; 32(3): 173 - 174.
[Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.