| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Viewpoint Paper |
Stanford Center for Biomedical Informatics, Department of Medicine and Department of Pediatrics, Stanford University School of Medicine, Stanford, CA; Lucile Packard Children's Hospital, Palo Alto, CA
* Correspondence: Atul Butte, MD, PhD, Stanford Center for Biomedical Informatics, 251 Campus Drive, Room X-215 MS-5479, Stanford, CA 94305-5479 (Email: abutte{at}stanford.edu).
Received for publication: 04/10/08; accepted for publication: 08/15/08.
| Abstract |
|---|
| Introduction |
|---|
Achieving the impact of translational medicine requires expanding the role and scope of bioinformatics just as much as those for clinical informatics. In 1999, the Advisory Committee to the Director, National Institutes of Health (NIH) Working Group on Biomedical Computing, co-chaired by David Botstein and Larry Smarr, released the Biomedical Information Science and Technology Initiative (BISTI) report, which recommended that NIH should be responsive to the growth in biological data and should apply funding resources to accelerate the development and application of computational tools to science. While the BISTI report certainly led to increased funding for bioinformatics research, in retrospect, the subsequent initiatives often led to the development of novel tools, perhaps at the expense of identifying novel questions. Perhaps there was no way for the BISTI authors to predict that a generation of scientists, asking medical questions at a molecular level solely using computational resources, could appear so quickly.
The circumstances are now such that it is time to recognize this new area of inquiry called Translational Bioinformatics. The American Medical Informatics Association (AMIA) recently added translational bioinformatics as one of its three major domains of informatics. The AMIA has defined translational bioinformatics as:
"... the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data into proactive, predictive, preventative, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders, including biomedical scientists, clinicians, and patients."3
Translational Bioinformatics involves the development and use of computational methods that can reason over the enormous amounts of life science data being collected and stored for the purpose of creating new tools for medicine. While bioinformatics methodologies have been used to enable biological discoveries for decades, here the end product has to be translational, or applying to human health and disease.
Why should investigators in computer science, biomedical informatics, and biomedical research in general be interested in Translational Bioinformatics today? I will list eight reasons why now is an excellent time to be studying Translational Bioinformatics. Five of these reasons are intrinsic to this scientific discipline, while three are extrinsic, regarding the practice of this discipline in today's scientific, funding, and political context. I will end with the significant challenge of building a community of future investigators in Translational Bioinformatics.
| Availability of Molecular Tools |
|---|
Beyond being large, these technologies are nearly comprehensive. It's one thing having a technology to measure ten genes, or 10,000 genes, but once you get close to 40,000 genes, there aren't many more genes left to measure. This may be only an illusion of stability, however, as demand increases for levels of resolution to improve. For instance, newer gene expression microarrays have evolved to measure exons, the individual components of RNA molecules, instead of entire transcripts.8 Future technologies may enable faster measurements to be made, with less bias towards the known catalog, or with less measurement noise.
Another important point is the low cost of these modalities. Gene expression microarrays were a cost-prohibitive technology when they were developed 11 years ago, but now they are essentially commodity items. Microarrays that measure activity of every gene in the genome now cost only about $300 per sample (plus labor and supplies) for academics.
Other research modalities have also become inexpensive. Between any two individuals, there are an estimated 10 million differences, or single nucleotide polymorphisms (SNPs), in DNA.9 The measurement of 1.8 million of these differences also costs about $300 per sample, in academia. Last year's analytic model had half a million SNPs for the same price, and the model from the previous year had about 10,000 SNPs for the same price, so there has been a geometrically progressive price reduction.
| Public Availability of Molecular Measurement Data |
|---|
The premier example of an internationally available data resource is GenBank, initially created in the early 1980s by Walter Goad.10 Because so many investigators at the dawn of the sequencing era were generating DNA sequences, there was a need for a repository to centrally manage and use these sequences. Funding from the NIH for GenBank started in 1982, and in the subsequent quarter century, GenBank has grown to include 82 billion nucleotides in 78 million sequences.11 At the time of this writing, hundreds of organisms have been completely sequenced including, of course, man and mouse. But a total of 270,000 species have had some sequence measured.12 In this way, GenBank has both breadth and depth.
The equivalent of GenBank for gene expression microarrays is known as the Gene Expression Omnibus (GEO).13 The GEO is also maintained by the National Center for Biotechnology Information at the National Library of Medicine. At the time of this writing, GEO has over 183,000 samples from over 7,200 experiments, an impressive growth in seven years. The number of samples either doubles or triples each year.
This availability of massive data sets is not just an American initiative. The European Bioinformatics Institute (EBI) has a similar web-based database called ArrayExpress14 with over a hundred-thousand samples from over 3,000 experiments. All together, translational bioinformaticians can likely get their hands on more than a quarter million microarray samples today. This is more data than can be generated by any one biologist, and the results from analyzing these larger collections of samples are potentially enormous in impact. As of 2007, diseases contributing to nearly a third of human disease-related mortality in the United States have been studied by microarrays.15
This availability is not limited to gene expression data. The EBI also has a web-based database called PRIDE, which holds proteomics data.16 The PRIDE database holds 3,200 independent samples with 2.6 million mass spectra freely available for download. Data from genome-wide association studies have their own repository, the NCBI Database of Genotype and Phenotype (dbGaP).17 As of this writing, fourteen genetic studies are available for downloading in this one year old database, with over 40,000 human samples.
| Culture of Sharing Molecular Data and Tools |
|---|
Funding agencies also increasingly require the public availability of scientific data, such as the Wellcome Trust and NIH.18 Grant proposals to NIH asking for over $500,000 per year need to have text describing how the data will be shared.19 Though this requirement is new, policies in sharing data from major projects, such as the Human Genome Project, go back more than a decade.20
Beyond these requirements, however, there is a culture of open sharing in molecular biology and bioinformatics that continues to grow: sharing of tools, data, findings, and publications. Important tools for bioinformatics, such as Significance Analysis of Microarrays (SAM),21 TM4 Multiple Expression Viewer,22 GenePattern,23 GenMAPP,24 and R and Bioconductor,25,26 are downloadable for free, and in many cases, have available source code.
In many cases, biomedical research communities have come together and have realized that sharing takes more than just uploading files to a common website. Through contention and agreement, these communities are starting to standardize terminology, phenotypes, and gene names.27 Challenges still remain in cataloging, calibrating, and normalizing data across experimenters, across measurement modalities, and across biological models; improper attention to these could lead to false positives and negatives. These biomedical research communities could benefit from learning how some of these challenges were addressed by the clinical informatics community.
Where those standards have not yet been reached, there is at least the understanding in the appropriate communities that standards must be reached. Increasingly, there is inter-community sharing, where one community will learn from the standardization efforts of another community. Examples of inter-community sharing include the design of the Minimum Information About a Proteomics Experiment (MIAPE)28 using the Minimum Information About a Microarray Experiment (MIAME),29 and partnerships between the FlyBase and ZFIN with the National Center for Biomedical Ontology to standardize phenotype descriptors.30
Curiously, this culture of sharing has not extended well to clinical research or clinical informatics. Clinical informatics tools, including vocabularies and text-parsing tools, are not always shared, or require signed licensing agreements. Clinical data, even de-identified subsets, are not as available on the Internet as molecular measurements. This could be due to fears of release of personal medical information, disclosure of evidence of culpability, or worries that one might miss a discovery in one's own patient cohort.31
| Clinicians are Expected to Interpret Bioinformatics Methodologies |
|---|
| Question Asking in Translational Bioinformatics |
|---|
In addition, bioinformatics must play a key role in the storage and retrieval of high-throughput data. A bioinformatician could work with a biologist to set up a web site and a standardized database for experimental measurements, facilitate the sharing of the measurements, and relate them to clinical outcomes.
Because of the public availability of raw high-throughput molecular data, roles for translational bioinformaticians can now change to beyond just providing a service. Translational bioinformaticians, given the data resources outlined above, have essentially more samples available regarding a given disease, e.g., breast cancer, than any individual biologist studying breast cancer might alone create. A translational bioinformatician can go to the NCBI GEO and download over 9,300 microarray studies on breast cancer (over 1,800 of them entered in 2007).
The availability of substantial public data enables bioinformaticians' roles to change. Instead of just facilitating the questions of biologists, the bioinformatician, adequately prepared in both clinical science and bioinformatics, can ask new and interesting questions that could never have been asked before. For example, Mootha et al. integrated four publicly available expression data sets with genetic linkage data and proteins identified from mitochondria to find the gene mutation associated with Leigh syndrome, French-Canadian type.33 English collected 49 publicly available high-throughput experiments of multiple types, such as genetic scans, gene expression microarrays, proteomics, and RNA interference, all related to the study of obesity. She found that an integrative model across 49 experiments could statistically significantly outperform each of the independent experiments in rediscovering known obesity-associated genes and predicting novel ones.34
These examples demonstrate an approach to integrating public and private data sets to address an important question in medicine. There is a role for the translational bioinformatician as question-asker, not just as infrastructure-builder or assistant to a biologist.
| Calls for Translational Medicine |
|---|
| Increasing Research Funding for Translational Bioinformatics |
|---|
Most importantly, Dr. Zerhouni wrote in The New England Journal of Medicine in 2005:
It is the responsibility of those of us involved in today's biomedical research enterprise to translate the remarkable scientific innovations we are witnessing into health gains for the nation ... At no other time has the need for a robust, bidirectional information flow between basic and translational scientists been so necessary. 41
There are impressive informatics-related terms in that quote for a Director of NIH, such as "robust, bidirectional information flow." Coincident with this quote and publication, the push to reinvent clinical research reached a new peak with the release of the Request for Applications (RFA) for the NIH Roadmap Institutional Clinical and Translational Science Awards (CTSA).43 These awards required that medical schools, research hospitals, and related institutions commit to reinventing how they perform and teach clinical and translational research. To enable this transformation, NIH planned to fund approximately 60 institutions at about $30 million each. As might be expected, the RFA for a $30 million grant spans over 50 printed pages. Unexpectedly, however, the word "informatics" in this RFA appeared 38 times.43 An institution cannot apply for a CTSA grant without organizing a clear plan for informatics, and must include tools and infrastructure to enable Translational Medicine. Each institution is required to determine a local Biomedical Informatics Director, and each of these participates on a national committee to set standards for clinical and translational research. This was clear recognition by NIH that the problems of Translational Medicine will not be solved without the help of informatics, and substantial money backed up this statement. Beyond the CTSA, NIH has continued to support Translational Bioinformatics through its funding of other large programs, including seven National Centers for Biomedical Computing (NCBC) and the Cancer Bioinformatics Grid (caBIG).
The CTSA effort provides an example of the depth of funding available when NIH focuses on specific major problems. There is also breadth. Figure 1 shows the yearly count of how often the word "informatics" appears in Request for Applications (RFAs) and Program Announcements (PAs) issued in the NIH Guide, the weekly publication issued by NIH on new funding mechanisms and program announcements.
|
While the 2007 count of 136 was lower than 2006, we have now reached new watershed in that a quarter of the RFAs and PAs mentioned the term "informatics." This crude method of counting admittedly does not distinguish clinical informatics from bioinformatics, does not consider the dollars available for each RFA, ignores the duration, availability, and expiration of RFAs, and may even falsely count RFAs in which informatics is explicitly excluded. Yet this still remains a simple example of how broadly informatics is now considered across many funding mechanisms involving all institutes of NIH.
| Few Investigators in Translational Bioinformatics |
|---|
The future development of practitioners of Translational Bioinformatics will require that individuals enter this discipline from even more diverse backgrounds. For instance, it is still rare for a clinician-scientist, who has completed training in medicine, pediatrics, or surgery, to undergo joint training in a sub-specialty as well as bioinformatics. A quantitatively-thinking cardiologist-scientist in-training could be trained in both human physiological measurements as well as methods for multi-scale modeling of the heart. A quantitatively-thinking oncology research-nurse-in-training could be trained in both making molecular measurements and methods in machine learning to find genes that predict outcome. Success in these joint training programs will require vision, as well as bioinformatics training program directors that reach out to and work with traditional subspecialty fellowship directors.
| Conclusion |
|---|
Computer scientists, even at the undergraduate level, should be educated that the algorithms and methods they develop in machine learning, visualization, network modeling, and knowledge representation will find a receptive audience in biomedical research. Quantitative-thinking undergraduate and graduate students in biology and chemistry should be exposed to, and excited by, increasing digital sources of data. There is no single educational solution that spans these constituencies, but the pieces have to include web-based instruction, traditional lecture-based courses, graduate degree programs, research fellowships, and continuing medical education courses. Some of these educational opportunities might be most efficiently delivered when centralized within departmental structures, but clearly Translational Bioinformatics will be practiced outside existing department walls.
Despite these challenges in developing of a committed set of investigators in Translational Bioinformatics, this is clearly a unique and exciting time to be part of the growth phase of this new scientific discipline.
| Acknowledgments |
|---|
| References |
|---|
This article has been cited by other articles:
![]() |
F. Azuaje, Y. Devaux, and D. Wagner Computational biology for cardiovascular biomarker discovery Brief Bioinform, July 1, 2009; 10(4): 367 - 377. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Embi and P. R.O. Payne Clinical Research Informatics: Challenges, Opportunities and Definition for an Emerging Domain J. Am. Med. Inform. Assoc., May 1, 2009; 16(3): 316 - 327. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |