help button home button JAMIA Hate scrolling?
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH

First published June 23, 2006 as JAMIA PrePrint; doi:10.1197/jamia.M2051
Journal of the American Medical Informatics Association 2006;13(5):526-535
© 2006 American Medical Informatics Association


A more recent version of this article appeared on September 1, 2006
This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M2051v1
13/5/526    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lu, X.
Right arrow Articles by Zhai, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lu, X.
Right arrow Articles by Zhai, C.

Submitted on January 9, 2006
Accepted on June 6, 2006

Enhancing text categorization with semantic-enriched representation and training data augmentation

Xinghua Lu MD, PhD1*, Bin Zheng MD, PhD1, Atulya Velivelli2, and ChengXiang Zhai PhD3

Affiliation of the authors: 1 Dept. Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC ; 2 Dept. Electrical and Computer Engineering, University of Illinois at Urbana- Champaign, Urbana, IL ; 3 Dept. Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL

* To whom correspondence should be addressed.

Objective Acquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of the facts that relevant documents constitute only a small fraction of the literature database and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are in need.

Design We studied two approaches that enhance the text categorization performance on sparse data tasks: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a reduced semantic topic space. A graph-based semi-supervised learning algorithm was applied to identify potential positive training cases, which were further used to augment training data. The effects of transformed and augmented training data on the text categorization tasks by support vector machine (SVM) were evaluated.

Results and Conclusion Semantic-enriched transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH
Copyright © 1994 by the American Medical Informatics Association.