| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Paper |
a Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, MD
b Graduate School of Medicine, University of Tennessee, Knoxville, TN.
* Correspondence: Charles Sneiderman, MD, PhD, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 (Email: charlie{at}nlm.nih.gov).
Received for publication: 02/21/07; accepted for publication: 07/26/07.
| Abstract |
|---|
|
|
|---|
Design: Three experimental systems were evaluated for their ability to find MEDLINE citations providing answers to clinical questions of different complexity. The systems (SemRep, Essie, and CQA-1.0), which rely on domain knowledge and semantic processing to varying extents, were evaluated separately and in combination. Fifteen therapy and prevention questions in three categories (general, intermediate, and specific questions) were searched. The first 10 citations retrieved by each system were randomized, anonymized, and evaluated on a three-point scale. The reasons for ratings were documented.
Measurements: Metrics evaluating the overall performance of a system (mean average precision, binary preference) and metrics evaluating the number of relevant documents in the first several presented to a physician were used.
Results: Scores (mean average precision = 0.57, binary preference = 0.71) for fusion of the retrieval results of the three systems are significantly (p < 0.01) better than those for any individual system. All three systems present three to four relevant citations in the first five for any question type.
Conclusion: The improvements in finding relevant MEDLINE citations due to knowledge-based processing show promise in assisting physicians to answer questions in clinical practice.
| Introduction |
|---|
|
|
|---|
Several possible approaches to addressing these problems have been investigated. Studies of interfaces for structured query formulation report improvements in precision without negative impact on recall, but not necessarily higher user satisfaction or acceptance of the interface.3 Moreover, training in the searching and appraisal of medical literature are essential for finding satisfactory answers to clinical questions.1,4 Merely using electronic information resources of choice, physicians were not always successful in answering clinical questions.5 The paradigm of evidence-based medicine (EBM)6 is an important resource in devising solutions to these problems.
In an environment where time and effort are at a premium, the value of MEDLINE for assisting in therapeutic decision making depends not only on well-formed questions but also on algorithms that can improve precision and recall in finding clinically relevant information.7 Research indicates that given enough time and skill, clinicians can find answers to their questions in MEDLINE.8 A recommended strategy is to reduce search results by focusing the question, often by adding more terms to a query. This requires a clinician to invest time in analyzing the information needed to identify search terms describing a clinical situation. An alternative approach is often observed in practice9: A clinician underspecifies a search by submitting two or three terms and then selects relevant documents while browsing the results, often using clinical practice guidelines for evaluation. Some of these strategies could be implemented automatically by incorporating domain knowledge in the search and then postprocessing the search results in order to rerank results for relevance to the query.
The research presented here explores the effectiveness of such automatic methods. We evaluate three knowledge-based automatic methods being developed at the National Library of Medicine to assist physicians find clinically relevant information in MEDLINE. The three systems use medical domain knowledge encoded in the Unified Medical Language System (UMLS)10 (alone or in combination with corpus-based methods) to find information for clinical queries of varying complexity.
The first system, SemRep Summarization,11 uses natural language processing and automatic summarization of MEDLINE citations to find the most relevant information about a clinical query within PubMed retrieval results. The second system, CQA-1.0,12 uses EBM recommendations for finding best answers for questions about treatment and prevention.13 Both SemRep and CQA-1.0 rerank a set of MEDLINE citations retrieved using PubMed and clinical query filters.14 The third system, Essie,15 is a probabilistic search engine that uses fine-grained tokenization, concept searching utilizing UMLS-derived synonymy, and phrase searching based on the users query to find the best MEDLINE citations for answering a clinical question. All three systems use structured domain knowledge from the UMLS to a varying extent and rely directly or indirectly on the medical subject headings (MeSH) controlled vocabulary used to manually index MEDLINE citations.
After providing an overview of the three systems, we concentrate on evaluating them with respect to finding information about treatment and prevention of 15 disorders. A test collection was constructed using the Text Retrieval Conference (TREC) pooling strategy.16 A rating scale developed to evaluate the utility of MEDLINE citations in clinical decision making17 was used to compare the performance of the three methods in answering clinical queries. We provide results in several evaluation metrics as a way of predicting the effectiveness of the systems under consideration in different clinical situations. The goal of this work is to explore approaches to increasing the utility of the primary literature with respect to answering clinical questions of varying complexity. Specifically, we investigate whether automatic understanding of MEDLINE citations based on medical domain knowledge can provide practical support for clinicians in therapeutic decision making.
| Background |
|---|
|
|
|---|
In addition to using domain knowledge to assist clinicians as they actively seek information, promising results have been obtained through passive query generation using patient records for query formulation.27 The effect of unobtrusively providing context-specific links between clinical data and information resources in the form of "infobuttons"28 has been examined in several recent studies. Rosenbloom et al.29 found a significant increase in the use of educational materials in the Care Provider Order Entry system when the materials could be accessed through visible hyperlinks (as opposed to menus). Cimino et al.30 observed positive results and increased use of infobuttons over 5.7 years. The success of context-specific access to knowledge access in this study varies with context and user type. Similar to the study by Cimino et al., Del Fiol et al.31 observed increased infobutton use with preference for secondary sources (such as Micromedex and UpToDate) that summarize the results of clinical studies and underutilization of resources providing access to the primary literature (such as MDConsult and PubMed). Although these resources provide valuable general information to the clinician, there remains a need for methods that help find answers to particular questions.
PubMed
PubMed automatically recognizes controlled vocabulary terms matching a users query with the entries in several translation tables. If a match is found in the MeSH translation table, the term is searched as MeSH (including the MeSH term and any specific terms indented under that term in the MeSH hierarchy) and as a text word. One of the advanced PubMed search options, clinical queries, is a set of filters designed to find clinically relevant and scientifically sound studies.32 These filters automatically expand queries using predefined sets of terms designed to limit search results to articles addressing one of the four major clinical tasks (etiology, diagnosis, therapy, and prognosis). For each task, clinical queries provide two search choices: specific (narrow) or sensitive (broad). For example, a "narrow therapy" clinical query augments a users query with the following search terms: randomized controlled trial [Publication Type] OR (randomized [Title/Abstract] AND controlled [Title/Abstract] AND trial [Title/Abstract]).
SemRep Summarization
The first reranking system considered in this study is based on SemRep Summarization,11,33 which depends on the semantic natural language processing system SemRep.34,35 SemRep identifies semantic predications (relationships) in biomedical text using underspecified syntactic analysis and structured domain knowledge from the UMLS. SemRep predications consist of UMLS Metathesaurus concepts as arguments and UMLS Semantic Network relations as predicates (relations between the concepts). Analysis begins with an underspecified syntactic parse that relies on the SPECIALIST lexicon36 and a part-of-speech tagger.37 MetaMap38 then matches noun phrases to concepts in the UMLS Metathesaurus and determines the semantic type for each concept. Concepts are identified as arguments in a predication using syntactic constraints based on dependency grammar rules and semantic constraints imposed by the Semantic Network. Predications representing core aspects of the clinical scenario were central to this study. These predications have predicates such as TREATS, CO-OCCURS_WITH, and OCCURS_IN and arguments belonging to the UMLS semantic groups39 Chemicals and Drugs, Disorders, and Population Groups.
SemRep Summarization is an automatic summarization system in the semantic abstraction paradigm.40 The system takes as input a list of predications extracted by SemRep from biomedical text (MEDLINE citations to be reranked in this study). Output is a condensed set of predications that serves as a summary of salient information on a specified topic in the citations processed. The core of the system is a transformation stage that identifies the most important information with respect to the specified topic. The transformation stage relies on four principles: (1) relevance, which keeps predications on the topic of the summary; (2) connectivity, which keeps related predications that share an argument with the summary topic; (3) novelty, which eliminates uninformative predications; and (4) saliency, which keeps high frequency predications.41 Predications in the summary are linked to the citations from which they were extracted and play an important role in exploiting SemRep Summarization for reranking retrieved citations in this study.
CQA-1.0
Another reranking method is implemented in the prototype clinical question-answering system CQA-1.0. In this system, questions and MEDLINE citations are represented using frames that capture the fundamental elements of EBM: (1) clinical scenario, (2) clinical task, and (3) strength of evidence. A question frame submitted to the system is used to generate a query and search MEDLINE using PubMed. Retrieved citations are processed with several knowledge extractors and classifiers that rely on a combination of UMLS concept recognition using MetaMap,38 manually derived patterns and rules, and supervised machine learning techniques12 to identify the fundamental EBM components listed. The PICO framework (Problem/Patient, Intervention, Comparison, and Outcome) designed to help clinicians formulate clinical questions42 is used to capture the first fundamental component (clinical scenario) in a MEDLINE citation. The elements of a clinical scenario are identified and extracted by four knowledge extractors. The problem extractor identifies a UMLS concept in the semantic group39 Disorders, which is the focus of a given study. The population extractor identifies phrases containing numerical expressions and concepts with the semantic type Group and its subcategories. The intervention and comparison extractor is based on finding concepts with nine semantic types (for example, Therapeutic or Preventive Procedure and Diagnostic Procedure). Identification of the second fundamental component (clinical task) is based on rules derived from: (1) search strategies encoded in PubMed clinical queries, (2) the JAMA EBM tutorial series on critical appraisal of medical literature,43 and (3) MeSH scope notes. The third fundamental component (strength of evidence) is based on the type of clinical study presented in the publication, authority of the journal that published it, and date of publication. Citation scoring and reranking with respect to a question are based on: (1) matching the question and citation with PICO frames, (2) matching the clinical task that generated the question with the task identified in the clinical study (treatment and prevention, for this study), and (3) the strength of the evidence presented in the study.
Essie
A different approach to finding citations answering clinical questions is implemented in Essie, a probabilistic search engine developed at the National Library of Medicine for the ClinicalTrials.gov database. Essie incorporates a number of strategies aimed at alleviating the need for sophisticated user queries.15 These strategies include a fine-grained tokenization algorithm that preserves punctuation information, concept searching utilizing UMLS-derived synonymy, and phrase searching based on the users query. Citations containing phrases identified in a users query are ranked higher than citations containing individual words comprising the phrase. Position of a matching phrase or term in a citation also influences the rank of a citation with respect to a query. For example, if a phrase is found in the title, the citation is ranked higher than one that contains this phrase in the abstract. Essie provides several possibilities for query expansion: exact match, SPECIALIST lexicon-based36 morphological expansion of terms, and UMLS-based expansion of concepts. Essie was the best-performing search engine in the 2003 TREC Genomics track44 and one of the best-performing systems in the 2006 TREC Genomics track.45
Evaluation Strategy
Our evaluation is based on techniques developed over the past 15 years in the framework of TREC—a yearly large-scale evaluation of information retrieval and question answering systems.16 Traditionally, systems are evaluated using test collections consisting of: (1) a corpus of documents (for example, MEDLINE citations), (2) a set of queries or questions (called topics in TREC), and (3) relevance judgments—human assessment of the relevance of each document in the collection to a given topic. Ideally, each document in the corpus would be judged with respect to each topic. Due to the size of modern document collections, such evaluation is not feasible even in the framework of TREC, which leads to an alternative strategy of first selecting a subset of documents to be judged, then assessing the relevance of these documents to the topics, and finally using these relevance judgments to assess the relative performance of the systems. A practical solution to the question of selection of an appropriate small subset of documents is the TREC pooling strategy. Documents to be judged for each topic are contributed to the pool by each information retrieval system participating in the evaluation. In TREC, the top 75 to 100 documents returned by each system are combined into a set given to the judge. The judged documents are subsequently used to evaluate the relative performance of the contributing systems.
| Methods |
|---|
|
|
|---|
Creating a Test Collection
To evaluate the performance of the three systems under scrutiny, we constructed a test collection consisting of 15 clinical questions along with relevant MEDLINE citations and judgments of their relevance to the questions. The top 10 documents returned by each system were added to the pool of documents evaluated by the first author, who did not participate in the development of any of the experimental systems.
Question Selection
For the questions, the first author, a practicing family physician, selected 15 queries (Table 1) from the Family Practice Information Network (FPIN) clinical queries collection, which is published monthly in the Journal of Family Practice and American Family Physician and contains queries typically generated in the daily practice of general medicine.46 Even if the query did not adhere to the syntactic form of a question (for example, specific queries 4 and 5), the original queries were not modified. The queries selected pertain to therapeutic or preventive interventions for clinical problems and can be regarded as instances of generic clinical questions.47 We identified two types of clinicians information needs: general (an overview of a topic) and specific (an exact answer to a focused question). When inspecting the FPIN clinical queries collection, we determined that some questions are intermediate; they do not call for an overview but are not focused enough for an exact answer. The nature of the questions in the FPIN collection warrants exploration of all three question types. Five queries were selected as general in that the only element of a clinical scenario in the question was the problem. Five were intermediate, with clinical scenario elements of population group, intervention, or outcome included with the request for therapy or prevention of a problem. Finally, five were specific or complex, including at least two elements of a clinical scenario selected from population group, intervention, or outcome (in addition to the problem). Our focus on therapy and prevention questions and the intent to evaluate the systems performance for all levels of difficulty precluded random selection of the questions. Instead, the first author selected five questions of interest to his practice from each level.
|
Retrieving and Reranking MEDLINE Citations
Each FPIN question was used to search MEDLINE with PubMed and Essie, limited to no later than the date of the FPIN answers for each question. Essie returns relevance-ranked output directly. The chronologically ordered citations from PubMed were subsequently reranked for each query using SemRep Summarization and the CQA-1.0 system.
For the strategies based on SemRep Summarization and CQA-1.0, an initial PubMed search strategy was to use the narrow therapy clinical queries filter and the clinical terms identified in a given question. For example, the clinical term in the FPIN question, "What is the best approach to treatment of osteoporosis?" is osteoporosis. The addition of the PubMed clinical queries filter to this term yields the following query: (osteoporosis[MeSH Terms] OR osteoporosis[Text Word]) AND (randomized controlled trial[Publication Type] OR (randomized[Title/Abstract] AND controlled[Title/Abstract] AND trial[Title/Abstract])). If the initial search yielded no results, the search was repeated with the clinical queries filter replaced with the following limits: citations with abstracts, restricted to human studies written in English. Two of the intermediate questions and all specific questions required this substitution. A total of 1,305 documents for the first set, 925 for the second set, and 959 for the third set were retrieved from MEDLINE using PubMed. Unranked PubMed results were used as a baseline against which experimental results were compared.
In exploiting SemRep Summarization for reranking retrieved citations, predications were extracted from the MEDLINE citations retrieved for each FPIN query. After summarizing the predications, the citations from which the predications were extracted were promoted as being more highly relevant to the query based on how closely and how frequently arguments in those predications matched Metathesaurus concepts extracted from the query.
The CQA-1.0 reranking algorithm promotes citations in which the automatically identified problems and interventions match those in the question; patient-oriented outcomes are identified with strong confidence, the task matches that of the questions, the study population is large, and the strength of evidence is high.50
In searching with Essie, a strategy similar to PubMed clinical queries, using EBM-related and therapy-related terms (such as therapeutic use, clinical trial, etc.) was applied. Unlike the clinical queries filters, this strategy promotes EBM-oriented citations without reducing the number of retrieved citations. Essie core document ranking promotes citations that contain query phrases in the fields observed to be most informative, for example in the title.51 To take advantage of UMLS synonymy, UMLS-based expansion of concepts was used in the search. Essie returned 2,500 citations in the first set, 896 in the second, and 673 in the third.
Fusion of Results
In addition to evaluation of individual systems, the ranked results generated by each system were merged using fusion. Fusion was based on the rank order assigned to a document by each system, rather than on scores. This is because the systems either do not score documents or generate scores for ranking purposes only (that is, scores represent neither the similarity of a citation and the query nor the systems confidence in the relevance of a citation to the query). This approach relies on document overlap, which for SemRep, CQA-1.0, and the baseline PubMed retrieval constitutes the whole result set. The results were merged using the fusion approach proposed by Fox et al.52 The contribution of each system to the final ranking was weighted equally.
Evaluation
Five sets of output were evaluated as part of this study: the ranked output from each of the systems under consideration, the fused output from all three, and unprocessed PubMed output (baseline). The trec_eval-8.0 package53 was used to evaluate the results. The systems were evaluated under two conditions: strict, considering only citations graded A in the three-point scale evaluation to be relevant, and soft, considering both three-point scale A-grade and B-grade citations relevant to the question. Because the relative ranking of the systems with respect to the baseline is identical under both conditions, we present and discuss the results of the soft evaluation. The differences in retrieval results between systems were compared using a Wilcoxon signed ranks test for all metrics. p values <0.05 were considered significant. The Wilcoxon signed ranks test is used when the values in the two results being compared are naturally paired (for example, the same set of documents is ranked by two systems) and the relative magnitude as well as the direction of the differences is considered.54
Two classes of evaluation metrics were used to account for two different information needs experienced by clinicians, one general and the other focused. The first type of information need is reflected in our general questions and corresponds to a situation in which a clinician might need an overview of a topic. In this scenario, a clinician would be interested in both precision (the percentage of the retrieved citations that are relevant) and recall (the percentage of the relevant documents that are retrieved). Evaluation metrics that reflect this need are:
The second type of information need experienced by clinicians corresponds to a situation in which an exact answer to a well-focused question is required (reflected in our specific questions). Because clinicians are willing to spend no more than 4 to 5 minutes evaluating search results,56 it is important that the answer to the question be found in the first few citations retrieved. Metrics that evaluate how soon a user will see the answer and how many relevant citations are at the top of the retrieval results list are:
| Results |
|---|
|
|
|---|
|
In Tables 3 through 5,
results are presented categorized by the complexity of the question and from the point of view of how well evaluated systems perform in response to general versus focused information needs. For general questions (Table 3), there is no single trend discernible. As noted, MAP, Bpref, and R-prec are likely to be most valuable for evaluating general questions as expressing a general information need. Essie and CQA-1.0 significantly outperformed PubMed according to MAP, but not Bpref. Fusion does well for Bpref.
|
|
|
The baseline is higher for specific questions; however, the experimental approaches apparently benefited from additional details provided in the complex questions (Table 5). The CQA-1.0 system, specifically designed to handle questions in the EBM-recommended form, benefited most among individual systems, scoring particularly well on MRR, P@5, and P@10. Fusion also does well on these measures in response to a focused information need. CQA-1.0 also did well according to MAP for the complex questions, 0.6286. However, the difference between CQA-1.0 and Essie is not statistically significant. Fusion of the results for the three systems (MAP = 0.7839) is particularly successful for this class of questions.
In terms of finding answers to specific questions, all experimental methods were successful in promoting relevant documents to the higher ranks, achieving MRR from 0.86 to 0.96 (Figure 1), 79% to 85% precision at five retrieved documents, and 69% to 87% precision at 10 documents, meaning that three to four of the first five documents retrieved by the evaluated systems (and six to eight of the first 10) provide information that potentially or definitely leads to answers to a clinical question.
|
| Discussion |
|---|
|
|
|---|
External Knowledge
The three systems evaluated rely on UMLS domain knowledge to manipulate semantic content in MEDLINE citations. Such content includes: (1) the number of subjects, (2) comparison of multiple therapies, (3) placebo control, and (4) comparative cost of interventions. Previous research49 has identified nonsemantic characteristics of articles as being important in identifying key articles. These include methodological rigor, authors and their institutional affiliations, document types, and population studied. Our research suggests that such cues, which are used in Essie and CQA-1.0 but not in SemRep, contribute to performance. Judging by the reciprocal rank of the top retrieved document, and precision at five and 10 documents, semantic reranking is necessary when a clinician is interested in (or has time for) only the first few citations. However, using Essie might preclude the need for reranking for general and intermediate questions.
Yet another type of key element identified in this study requires external knowledge in addition to semantic processing. These characteristics include: (1) availability of a therapy for the local practitioner community (e.g., approval by the U.S. Food and Drug Administration or availability in a community environment) and (2) applicability of the study results more generally, for example, extending the results of a clinical trial conducted in a subpopulation to the population of interest.
Notes taken during evaluation identified additional nonsemantic criteria used to assess usefulness of the citation to a clinician. The rater (who considers himself a typical primary care physician1) evaluated the utility of a citation and used nontopical cues present in the citation, as well as "world knowledge." For example, for the query, "what is the most effective treatment for ADHD in children?" a citation entitled "Attention-deficit hyperactivity disorder in children and youth: a quantitative systematic review of the efficacy of different management strategies" was judged as A grade (leads to an answer; definitely useful in clinical decision making for the question) with the assumption that a systematic review was exhaustive of the published literature for efficacy. In contrast, for the query, "What is the best antiviral agent for influenza infection?" a citation "Efficacy and safety of oseltamivir in treatment of acute influenza: a randomized control trial" was judged as A grade even though the comparisons were to placebo only. The rater believes that a citation with comparisons to various treatment methods is unlikely to appear in the primary literature.
Citations that were judged to be B grade (not sufficient to answer the query but helpful in medical decision making) also need to be qualified by an understanding that the rater assumed the knowledge level of the typical primary care physician. Thus for the query, "What is the best treatment for gastroesophageal reflux and vomiting in infants?" a citation not related to therapy entitled "The infant with chronic vomiting: the value of the upper GI series" was retrieved by the probabilistic search method. It was rated B because the rater thought that most primary care physicians might not know that "in a study of 344 otherwise healthy infants referred to pediatric gastroenterologists for chronic vomiting findings other than gastroesophageal reflux were seen in only 2 patients ... (0.6%)" and that knowledge might influence a best-therapy decision.
Citations that were judged to be C grade (not helpful in answering the clinical query) also involved some assumptions regarding utility to the decision maker. For the query, "In children with acute vomiting and diarrhea (gastroenteritis), does treatment with intravenous fluids improve recovery compared with oral rehydration therapy (ORT)?" a citation entitled "Ondansetron decreases vomiting associated with acute gastroenteritis: a randomized control trial" was rated C because the study population reported included only children who had been assigned to intravenous fluid therapy. The information may have been new and helpful in treatment of the disorder, but was not helpful in the decision called for in the query.
| Limitations |
|---|
|
|
|---|
Implications and Future Work
This study presents some evidence showing that the burden of overcoming several of the major obstacles1 in practicing evidence-based medicine could be alleviated by integrating into information retrieval systems the domain knowledge in the UMLS and the EBM principles. Unless connected to an electronic patient record, automatic methods cannot be used for the initial step of formulating an information need nor (under any circumstances) for the final steps of appraising the evidence and making a clinical decision. However, automatic methods could address the challenging task of determining an optimal search strategy. A system might first provide the clinician with a pick list for selecting question type, for example, an overview of best available treatments for a given condition. The system could then use a predetermined optimal search strategy for the question type chosen.
Our study suggests several areas for further exploration. We are currently developing question templates for submitting therapy questions to our systems. We plan to expand these to accommodate other types of clinical questions, including those involving diagnosis, prognosis, and cost effectiveness. Uncertainty about finding all relevant evidence could be mitigated by using optimal recall-oriented strategies. Subsequently, the difficulty of synthesizing and appraising all evidence found could be addressed by presenting aggregated search results to the clinician (using SemRep summaries and patient-oriented outcomes extracted by CQA-1.0, for example).
| Conclusion |
|---|
|
|
|---|
| Footnotes |
|---|
1 See American Academy of Family Physicians Policy and Advocacy58 for one definition of a typical family doctor. ![]()
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |