The development of research infrastructure is the foundation of the Biomedical Informatics program at Georgetown and underliesthe mission of integrating and making sense of enormous volumes of data being generated in both the lab and clinic. ICBI scientists and software engineers are working together to develop technologies that enable the integration of biomedical data with state-of-the-art tools through a multi-disciplinary and collaborative approach that drives translational research and clinical care.
Our primary areas of technology development involve:
Practically every major sector of the economy and scientific enterprise is kindling (or rekindling) the idea of Big Data as key to solving important problems in and across disciplines. The Wall Street Journal has highlighted Big Data as one of three game changer or ‘black swan’ technologies that will transform the future. It is fair to say that there is a big data rush underway; nowhere is the promise and potential more real than in the rapid rise of inexpensive whole-genome sequencing technologies using next-generation sequencing (NGS) instruments. The world’s current sequencing capacity is estimated to be 13 quadrillion DNA bases a year. The cost to produce an accurate human whole-genome sequence is dropping rapidly and is expected to cost under $100 per genome in the next decade; the capacity to sequence the genomes of a billion people will have been realized in the next twenty years. These will be truly Big Data, requiring 3 or more exabytes of storage. A number of public and private projects are already contributing to this biological data deluge. The NIH-funded 1000 Genomes Project deposited 200 terabytes of raw sequencing data into the GenBank archive during the project’s first 6 months of operation, twice as much as had been deposited into all of GenBank for the entire 30 years preceding. Our team is working on a new generation of data management and mining systems to support diverse needs from personalized clinical medicine, translational and population research involving molecular and genetic outcomes. This project spans the disciplines of computer systems, analytics, and genomics. Our systems research, in collaboration with the departments of computer science at Virginia Tech and Georgetown University will help harness petabytes of genomic information in a manner cognizant of the multimodal, multilevel nature of the datasets, and our project will extend the query capabilities of existing NoSQL models. Our analytics research extends the MapReduce workflow paradigm to support more complex workflows on genomic and other biological data while being attentive to the need to save some intermediary results and discard others. More broadly, this project is aimed to be the first implementation of petabyte-scale compositional data mining to extract actionable, ranked knowledge from large-scale genome studies.
Clinical Omics Data Integration
The development of research infrastructure is the foundation of our program and underlies the mission of integrating and making sense of enormous volumes of data. Our long-term goal is to develop methodologies to help provide clinical decision support, through the integration of available “omics” and patient data. Toward this goal we have developed G-CODE – the Georgetown Clinical & Omics Development Engine – to help empower the next generation of translational research. The power of the G-CODE concept lies in the integration of multi-omics data with clinical outcome data and supported within a powerful, but easy to use environment accessible to clinicians trying to decide the best treatment options, as well as to researchers looking for trends among large datasets. We are not only interested in the development of research platforms, but also in the analysis of large datasets for novel information. We use a variety of open source tools and infrastructure in addition to our own tools and algorithms. Our research team is asking key biological and medical research questions that can be addressed through data mining, analysis, and the integration of a wide array of disparate datasets primarily obtained through public studies, although we collaborate on private studies as well. The Georgetown Clinical and Omics Development Engine (G-CODE) G-CODE has been developed to help empower the next generation of translational research for a wide array of disease areas by making powerful bioinformatics tools and integrated experimental and clinical data easily accessible by both physician-scientists and laboratory researchers within a unified and quickly-deployable environment. This tool is freely available for use with public or private studies and can be tailored to specific use cases. To discuss how G-CODE can enable and accelerate your translational, basic, or clinical research, please contact us at firstname.lastname@example.org.
Clinical Research Management Systems
ICBI provides support for clinical research at the Georgetown University Medical Center through the Clinical Research Management Office, which supports clinical trials within the Lombardi Comprehensive Cancer Center.
Whole genome sequencing has brought about a whole new set of research challenges in the biomedical field. The vast amounts of data produced from sequencing the human genome necessitates a new computational strategy as well. No longer is it feasible for researchers to build out datacenters equipped with adequate computational power and storage capacity for whole genome sequencing. Cloud computing lowers the barrier to entry for many institutions and allows for access to required resources at a fraction of the cost that it would take to build out a datacenter with the required capacity. Currently, the leading player in the cloud computing arena is Amazon Web Services. They offer a multitude of resources that help to store and analyze the data produced by whole genome sequencing. Amazon S3 is used to store the data in a secure, encrypted, redundant environment. EC2 provides a computational environment that is flexible, scalable and stable. Users are able to create virtual machines of various sizes, with up to 60 GB of RAM and 88 cores, and are also able to spin up multiple instances so that workflows can be parallelized. Elastic Map Reduce provides a framework for parallelizing jobs, so that tasks that may have taken days before can now be performed in a matter of hours. All of these services combine to provide research institutions with the necessary capacity to store and analyze the onslaught of next generation sequencing data. ICBI utilizes Amazon Web Services to store and analyze hundreds of whole genome sequences in a secure and scalable environment. Data analysis pipelines leverage the elastic nature of the cloud and allow the center to scale to thousands of whole genome sequences.
iCancerLab is a research platform we are developing for cancer systems immunology. The platform will include tools capable of integrating large sets of heterogeneous and context-specific information to describe the time-dependent relationships of cancer, immunity, and immunotherapies. The suite of tools is being built using evidence in public databases, almost in real-time, and will include five specific tools:
- miRNAs (iMIRLab) to explore immune gene targets of microRNAs
- SNP2Structure to identify the impact of mutations on immunoprotein structures
- Significant Intratumor Genomic Heterogenity (SIGH) to look at tumor heterogeneity in the context of immune response
- Differential Dependence Network (DDN) to look at differential biological networks in inflamed versus non-inflamed tumors
- Compositional Data Mining (CDM) to create stories by connecting a wide variety of data and text to generate patterns.
These tools will enhance immunotherapy development by identifying optimal target antigens and improving immunotherapy trials by determining biomarker signatures to predict immune responsiveness, ultimately making immunotherapy discovery and development faster, cheaper and more effective.
Our data portals and tools can be found here: https://apps.icbi.georgetown.edu/