Curated and Standardized Research Datasets

ICBI faculty conduct research using public and proprietary datasets to advance Precision Medicine. They are involved in developing and applying Bioinformatics and Medical informatics methods to derive actionable knowledge from genomics, electronic health records, registries, patient-reported, public health and other datasets.

Below is a list of specialized datasets that were co-developed by ICBI investigators and collaborators. These datasets are currently managed by ICBI and available to GUMC investigators for research use. Please contact the ICBI investigators listed under each dataset to initiate a collaboration.

1. REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa) Dataset

ICBI Investigators – Yuriy Gusev, Krithika Bhuvaneshwar, Camelia Bencheqroun, Subha Madhavan

Description: The REMBRANDT dataset was originally created at the National Cancer Institute and funded by Glioma Molecular Diagnostic Initiative. The data was collected from 2004-2006. In 2015, the NCI transferred this dataset to Georgetown, and it is now physically located on the Georgetown Database of Cancer (G-DOC), a cancer data integration and sharing platform for hosting alongside other cancer studies. REMBRANDT includes genomic data from 261 samples of glioblastoma, 170 of astrocytoma, 86 tissues of oligodendroglioma, and a number that are mixed or of an unknown subclass. Outcomes data include more than 13,000 data points.

The dataset is accessible for conducting clinical translational research using the open access Georgetown Database of Cancer (G-DOC) platform. In addition, the raw and processed genomics and transcriptomics data have also been made available via the public NCBI GEO repository as a super series GSE108476. Such combined datasets would provide researchers with a unique opportunity to conduct integrative analysis of gene expression and copy number changes in patients alongside clinical outcomes (overall survival) using this large brain cancer study

Raw data: GSE108476
MRI images: The Cancer Imaging Archive (TCIA)

Citations:

2. Georgetown Pediatric Cancer Outcomes Database

ICBI Investigators: Shruti Rao, Subha Madhavan

Collaborators: Aziza Shad, Kenneth Tercyak

Description: The Georgetown Pediatric Cancer Outcomes Database was co-developed by investigators at the Georgetown University Medical Center (GUMC) Lombardi Comprehensive Cancer Center (LCCC), the Pediatric Hematology Oncology and Bone Marrow Transplantation Program and ICBI. The 620 pediatric cancer patients (as of May 2015) in this database were diagnosed with various cancers at Lombardi Cancer Center’s Pediatric Oncology Program and were enrolled or treated as per Children’s Oncology Group (COG) protocols between 1990 and 2014. GUMC’s EMR systems – ARIA, AMALGA along with the paper charts were curated to obtain complete clinical information for these patients. Several retrospective research projects have been conducted on this pediatric cancer outcomes database such as – evaluation of secondary cancers in this patient cohort after therapy; assessment of risk factors associated with childhood cancer therapy and extracting predictor variables for late effects of childhood cancer treatments from clinical notes.

Data Formats: Excel/CSV, SAS, SPSS, R, Stata

3. Georgetown Immuno Oncology Registry

ICBI Investigators: Anas Belouali, Subha Madhavan

Collaborators: Neil Shah, Michael Atkins

Description: Our multidisciplinary team of clinicians and informaticians built a centralized research data warehouse for ImmunoOncology that is enabling novel hypothesis generation and retrospective outcomes research at the 10 DC-Baltimore based MedStar Health network hospitals. Data sources include hospital medical records, Labs, pathology, radiology, and Cancer Registry data. Data was extracted from different hospital systems and integrated into REDCap to allow clinicians to capture additional data from patient charts. Our integrative approach helps assess outcomes in patients with different comorbidities across all hospitals, effects of treatment with steroids on immune toxicities and their outcomes, use of various drugs before and during ImmunoOncology treatment and their impact on outcomes and toxicities. Toxicities and adverse events data in EHRs is not comprehensive. Hence, in addition to data extracted using ICD codes from structured fields/tables, we are using our in house developed NLP methodologies to automatically identify toxicities in patient charts. The Immuno Oncology registry has 758 patients as of Jan 2020.

Data Formats: Excel/CSV, SAS, SPSS, R, Stata

Citations:

Shah NJ, Al-Shbool G, Blackburn M, Cook M, Belouali A, Liu SV, Madhavan S, He AR, Atkins MB, Gibney G, Kim C.’Safety and efficacy of immune checkpoint inhibitors (ICIs) in cancer patients with HIV, hepatitis B, or hepatitis C viral infection.’ J Immunother Cancer. 2019 Dec 17;7(1):353. doi: 10.1186/s40425-019-0771-1. PMID: 31847881

4. CARIS Molecular Diagnostic Dataset

ICBI Investigators: Kanchi Krishnamurthy

Collaborators: John Marshall

Description: Molecular profiling of any patient’s tumor identifies their disease biomarker pattern that then allows that patient’s medical team to select personalized treatment options that they may not have previously considered. ICBI investigators collaborated with the Ruesch center on a project that involves the molecular profiling of cancer patients seen at Georgetown. We obtained and analyzed test results of around 3900 patients (as of Jul 2019) from Caris Molecular Intelligence™ service with the ultimate goal of better informing treatment decisions. We are currently pursuing several projects centered around these datasets, including correlating survival with choice of standard vs personalized treatment, correlation of mutational burden and microsatellite instability in colorectal cancer, development of open-source visualization and analysis tools for molecular profiling, and comprehensive comparisons with public resources such as TCGA and AACR Genie.

Data Formats: Excel/CSV

5. Stage II Colorectal Cancer – Multi-omics Molecular Profiling Dataset

ICBI Investigators: Yuiry Gusev, Krithika Bhuvaneshwar
Collaborators: Lou Weiner

Description: Colorectal cancer (CRC) patient biospecimens with extensive clinical and follow-up data were selected from the Indivumed GmbH biobank for 40 patients (20 relapse and 20 no-relapse). The patients consisted of 12 with late stage I, and 28 with stage II. Four patients (out of 12) with late stage I had experienced relapse (~33%), and it is important to note that 12 patients (out of 28) with stage II were relapse-free (~43%). Therefore, the relapse-free group of samples, and the group with relapse are both represented by a mixture of late stage I and stage II patients. Only nine stage II patients (out of 28) had rectal cancer; of these 6 had relapsed within 5 years. Of more than 180 clinical attributes, 64 were shortlisted based on relevance to clinical outcome and biomarker analysis. The molecular data included gene expression, DNA copy number, microRNA, serum and urine metabolites.

Data Formats: Tab delimited text files

Citations:

Subha Madhavan,,* Yuriy Gusev, Thanemozhi G. Natarajan, Lei Song, Krithika Bhuvaneshwar, Robinder Gauba, Abhishek Pandey, Bassem R. Haddad, David Goerlitz, Amrita K. Cheema, Hartmut Juhl, Bhaskar Kallakury, John L. Marshall, Stephen W. Byers, and Louis M. Weiner. ‘Genome-wide multi-omics profiling of colorectal cancer identifies immune determinants strongly associated with relapse’. Frontiers in Genetics, Nov 2013, 4: 236

6. ICBI Molecular Simulation Datasets

ICBI Investigators: Matthew McCoy

Description: Protein structure simulation results generated from the SNP2SIM workflow, and used to develop protein specific models of variant function. Contains metadata on simulation configuration/parameterization, and output from Nanoscale Molecular Dynamics (NAMD) and Autodock Vina.

Data Formats: NAMD output binaries (MD simulation results) or text files stored in the SNP2SIM defined file structure.

Citations:

7. OMOP-mapped Cancer Dataset (under-development)

ICBI Investigators: Shuo Wang, Kanchi Krishnamurthy, Adil Alaoui

Description: Our team is developing short and long-term strategies to map oncology-specific electronic health record data (EHR) hosted and managed in ARIA into the Observational Medical Outcomes Partnership common data model (OMOP CDM), which is one of the leading standard data models used in nationwide research and information sharing initiatives. OMOP was developed to be a shared analytics model and it has been adopted by the Observational Health Data Sciences and Informatics (OHDSI) Consortium. Our approach was to transform oncology EHR data and related databases into a common format as well as a common representation (terminologies, vocabularies, coding schemes), and then perform systematic analyses using a library of standard analytic routines that were developed and shared by the community based on the common data format.

8. Curated dataset for CDGnet

ICBI Investigators: Simina Boca

Collaborators: Hector Corrada Bravo

Description: CDGnet is an informatics tool for recommending targeted therapies to individuals with cancer using biological networks. The CDGnet tool is currently hosted at http://epiviz.cbcb.umd.edu/shiny/CDGnet/, with the Github repository – which includes the curated dataset at https://github.com/SiminaB/CDGnet. This includes curation from the DailyMed labels for targeted cancer therapies, along with drug-gene connections from DrugBank.

Data Formats: R objects, Excel file available at https://www.biorxiv.org/content/10.1101/605261v1.supplementary-material

Citations:

Kancherla J*, Rao S*, Bhuvaneshwar K, Riggins RB, Beckman RA, Madhavan S, Corrada Bravo H, Boca SM. “An evidence-based network approach to recommending targeted cancer therapies.” JCO Clinical Cancer Informatics, 2020, 4:71-88. * Joint first authorship

9. TCGA Data

ICBI Investigators: Yuriy Gusev, Krithika Bhuvaneshwar

Collaborators: Ruth He

Description: ICBI scientists have extensive experience with access, and analysis of multi-omics data from the TCGA and TARGET data collections. We are also experienced with online public TCGA resources such as the Genomic Data Commons (GDC), Cbio portal, UCSC Xena Browser, and Firebrowse/Firehose.

Data Formats: TCGA data formats

Citations:

Bhuvaneshwar et al. Genome sequencing analysis of blood cells identifies germline haplotypes strongly associated with drug resistance in osteosarcoma patients. BMC Cancer. 2019 Dec (analyzed TARGET osteosarcoma data)
Hau et al. Dynamic Regulation of Caveolin-1 Phosphorylation and Caveolae Formation by Mammalian Target of Rapamycin Complex 2 in Bladder Cancer Cells. The American Journal of Pathology. Volume 189, Issue 9, September 2019. (Explored TCGA Bladder Cancer data)
Gusev et al. Exploration of the immune cell landscape in brain cancer utilizing gene expression and copy number data. bioRxiv. Dec 2018 (Explored TCGA Brain cancer data)
Bhuvaneshwar et al. viGEN: An open source pipeline for the detection and quantification of viral RNA in human tumors. Frontiers in Microbiology. Sep 2018 (TCGA cervical cancer and TCGA liver cancer data)
Bhuvaneshwar et al. RNAseq analysis of infiltrating immune cells in liver cancer. Cancer Research. 2017/7 (Poster, TCGA Liver cancer data)
Bhuvaneshwar et al. Variant analysis of LY6 genes in TCGA ovarian cancer 2017/7 (Poster, TCGA Liver ovarian data)
Luo et al. Distinct lymphocyte antigens 6 (Ly6) family members Ly6D, Ly6E, Ly6K and Ly6H drive tumorigenesis and clinical outcome. Oncotarget. 2016 Mar 8; 7(10): 11165–11193. (Multiple TCGA datasets)

10. Bladder Cancer Dataset for Immuno Oncology analysis

ICBI Investigators: Yuriy Gusev, Krithika Bhuvaneshwar

Collaborators: Geoffrey Gibney

Description: This is a public gene expression dataset containing primary bladder cancer samples. It includes 165 primary bladder cancer samples, 23 recurrent non-muscle invasive tumor tissues, 58 normal looking bladder mucosae surrounding cancer and 10 normal bladder mucosae for microarray analysis. Available in NCBI GEO at Series GSE13507

Data Formats: Same as in NCBI GEO

Please complete our research support and consultation request form to request access to the above mentioned datasets.

Policies for Access and Use of Specialized Datasets curated and managed by ICBI for Research Purposes

Please note the following policy for data access, use and collaboration with the ICBI. This policy applies to individuals associated with Georgetown University, MedStar Health and/or the Georgetown-Howard Universities Center for Clinical and Translational Science (GHUCCTS) only.

De-identified summary level data and data access through G-DOC will be provided for research use at no cost
LDS or PHI data will not be provided to investigators at record level. Investigators can collaborate with ICBI faculty and staff to conduct research on these datasets
- For research and analysis support, we will first schedule a meeting to discuss details of your project and estimate the level of effort to complete project milestones.
- The first two hours of preliminary data analysis and assessment will be provided for free.
- Cost of IT infrastructure for data hosting and management will be assessed and an estimate will be provided to the co-investigators
- Additional data analysis, integration and support will be provided for a flat rate of $120/hr.
- Please also invite ICBI faculty and/or staff involved in planning, designing, analysis and/or writing for your project, to be a co-author on your conference abstract and/or publication.