| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Submitted on September 8, 2003
Accepted on January 11, 2004
Affiliation of the authors: 1 Ariadne Genomics, Inc., Rockville, MD
* To whom correspondence should be addressed.
Objective Develop the practical and efficient protein identification system for biomedical corpora.
Design The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of MEDLINE Name-of-Substance(NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.
Measurements The recall and precision of the system has been determined using 1,000 randomly selected and hand-tagged MEDLINE abstracts.
Results The developed system is capable of identifying protein occurrences in MEDLINE abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84% respectively.
Conclusion The developed system appears to be well-suited for protein-based MEDLINE indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement are also discussed. The authors claim to have a financial interest in the ProtScan system (see Acknowledgements).
This article has been cited by other articles:
![]() |
W. Zhou, V. I. Torvik, and N. R. Smalheiser ADAM: another database of abbreviations in MEDLINE Bioinformatics, November 15, 2006; 22(22): 2813 - 2818. [Abstract] [Full Text] [PDF] |
||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |