help button home button JAMIA Bigger figures
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

First published June 23, 2006 as JAMIA PrePrint; doi:10.1197/jamia.M2051
This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M2051v1
13/5/526    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lu, X.
Right arrow Articles by Zhai, C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lu, X.
Right arrow Articles by Zhai, C.
J Am Med Inform Assoc. 2006;13:526-535. DOI 10.1197/jamia.M2051.
© 2006 American Medical Informatics Association


Research Paper

Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation

Xinghua Lu, MD, PhDa,*, Bin Zheng, MD, PhDa, Atulya Velivellib and ChengXiang Zhai, PhDc

a Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC
b Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL
c Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL

* Correspondence and reprints: Xinghua Lu, MD, PhD, Department of Biostatistics, Bioinformatics and Epidemiology, 135 Cannon St, Suite 303, Charleston, SC 29425. (Email: lux{at}musc.edu).

Received for publication: 01/09/06; accepted for publication: 06/06/06.

Objective: Acquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of high data dimensionality and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are needed.

Design: We studied two approaches that enhance the text categorization performance on sparse and high data dimensionality: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a semantic topic space with reduced dimensionality. A semi-supervised learning algorithm based on graph theory was applied to identify potential positive training cases, which were further used to augment training data. The effects of data transformation and augmentation on text categorization by support vector machine (SVM) were evaluated.

Results and Conclusion: Semantic-enriched data transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2006 by the American Medical Informatics Association.