help button home button JAMIA Bigger figures
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published February 5, 2004 as JAMIA PrePrint; doi:10.1197/jamia.M1453
Journal of the American Medical Informatics Association 2004;11(3):174-178
© 2004 American Medical Informatics Association


A more recent version of this article appeared on May 1, 2004
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M1453v1
11/3/174    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Egorov, S.
Right arrow Articles by Daraselia, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Egorov, S.
Right arrow Articles by Daraselia, N.

Submitted on September 8, 2003
Accepted on January 11, 2004

A Simple and Practical Dictionary-Based Approach for Identification of Proteins in MEDLINE Abstracts

Sergei Egorov PhD1, Anton Yuryev PhD1, and Nikolai Daraselia PhD1*

Affiliation of the authors: 1 Ariadne Genomics, Inc., Rockville, MD

* To whom correspondence should be addressed.

Objective Develop the practical and efficient protein identification system for biomedical corpora.

Design The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of MEDLINE Name-of-Substance(NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.

Measurements The recall and precision of the system has been determined using 1,000 randomly selected and hand-tagged MEDLINE abstracts.

Results The developed system is capable of identifying protein occurrences in MEDLINE abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84% respectively.

Conclusion The developed system appears to be well-suited for protein-based MEDLINE indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement are also discussed. The authors claim to have a financial interest in the ProtScan system (see Acknowledgements).




This article has been cited by other articles:


Home page
BioinformaticsHome page
W. Zhou, V. I. Torvik, and N. R. Smalheiser
ADAM: another database of abbreviations in MEDLINE
Bioinformatics, November 15, 2006; 22(22): 2813 - 2818.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.