help button home button JAMIA Hate scrolling?
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

First published February 5, 2004 as JAMIA PrePrint; doi:10.1197/jamia.M1453
This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M1453v1
11/3/174    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Egorov, S.
Right arrow Articles by Daraselia, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Egorov, S.
Right arrow Articles by Daraselia, N.
J Am Med Inform Assoc. 2004;11:174-178. DOI 10.1197/jamia.M1453.
© 2004 American Medical Informatics Association


Research Paper

A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts

Sergei Egorov, PhD, Anton Yuryev, PhD and Nikolai Daraselia, PhD

Affiliation of the authors: Ariadne Genomics, Inc., Rockville, MD.

Correspondence and reprints: Nikolai Daraselia, PhD, Ariadne Genomics, Inc., 9700 Great Seneca Highway, Rockville, MD 20850; e-mail: <nikolai{at}ariadnegenomics.com>.

Received for publication: 09/08/03; accepted for publication: 01/11/04.

Objective: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.

Design: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.

Measurements: The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts.

Results: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.

Conclusion: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.




This article has been cited by other articles:


Home page
BioinformaticsHome page
W. Zhou, V. I. Torvik, and N. R. Smalheiser
ADAM: another database of abbreviations in MEDLINE
Bioinformatics, November 15, 2006; 22(22): 2813 - 2818.
[Abstract] [Full Text] [PDF]




HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2004 by the American Medical Informatics Association.