| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Submitted on January 9, 2006
Accepted on June 6, 2006
Affiliation of the authors: 1 Dept. Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC ; 2 Dept. Electrical and Computer Engineering, University of Illinois at Urbana- Champaign, Urbana, IL ; 3 Dept. Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL
* To whom correspondence should be addressed.
Objective Acquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of the facts that relevant documents constitute only a small fraction of the literature database and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are in need.
Design We studied two approaches that enhance the text categorization performance on sparse data tasks: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a reduced semantic topic space. A graph-based semi-supervised learning algorithm was applied to identify potential positive training cases, which were further used to augment training data. The effects of transformed and augmented training data on the text categorization tasks by support vector machine (SVM) were evaluated.
Results and Conclusion Semantic-enriched transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |