Data Challenges

The team at Innovation Center for Biomedical Informatics Georgetown University (Georgetown-ICBI) has participated and/or organized multiple international data challenges which are described below.

Georgetown University was one of the co-sponsors and co-organizers of the COVID-19 Data Visualization Challenge.  The challenge brought together data scientists, economists, global health experts, and others to crowdsource data analysis and create visualizations or analysis tools that clearly communicate findings. The Data Challenge crowdsourced academic and professional talent in data science, public health policy, economics, and related fields. Read more about the challenge here: https://cgdv.github.io/challenges/COVID-19/

A COVID-19 resource was also made available to researchers, international public health specialists, public policy analysts, and experts worldwide, made available here: https://cgdv.github.io/challenges/COVID-19/datasource/

Our Georgetown-ICBI Team were involved in co-organizing the challenge and engaging the judges; and were part of the judging team. Georgetown-ICBI’s very own Tableau Developer/Clinical Database Analyst Shuo Wang participated in this challenge and was one of the honorable mentions in the challenge.

Clinical investigators at Georgetown University are seeking to advance precision medicine techniques for the prognosis and treatment of brain tumors through the identification of novel multi-omics biomarkers. In support of this goal, precisionFDA, The Innovation Center for Biomedical Informatics and the Lombardi Comprehensive Cancer Center at Georgetown University launched and executed the Brain Cancer Predictive Modeling and Biomarker Discovery Challenge.  We leveraged the Food and Drug Administration (FDA)’s crowdsourcing for regulatory science advancement platform – the precisionFDA platform

The challenge ran from November 2019 to February 2020, and asked participating teams to develop unsupervised machine learning and/or artificial intelligence models to identify biomarkers and predict patient outcomes using gene expression, DNA copy number, and clinical data from the Rembrandt data collection. Read all the details about the challenge from the precision FDA website here : https://precision.fda.gov/challenges/8/view

The challenge highlights not only the value and the continued need for public and open data, but also the rich history of innovative crowdsourcing competitions that allows for new discoveries through team science. Detailed challenge results are here: https://precision.fda.gov/challenges/8/view/results. The top three performing teams were awarded a podium presentation and a poster at the 9th Annual Health Informatics and Data Science Symposium at Georgetown University on October 2020 https://icbi.georgetown.edu/symposium/. The challenge team is working on a manuscript, along with the top performers of the challenge. It would provide an overview of the challenge data and design, and a summary of the submissions from various participating teams.

The Rembrandt data collection: The REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa) dataset was originally created at the National Cancer Institute and funded by Glioma Molecular Diagnostic Initiative. The data was collected from 2004-2006. In 2015, the NCI transferred this dataset to Georgetown. The dataset is accessible for conducting clinical translational research using the open access Georgetown Database of Cancer (G-DOC) platform. In addition, the raw and processed genomics and transcriptomics data have also been made available via the public NCBI GEO repository as a super series GSE108476. Such combined datasets would provide researchers with a unique opportunity to conduct integrative analysis of gene expression and copy number changes in patients alongside clinical outcomes (overall survival) using this large brain cancer study

Publications

  • Gusev Y, Bhuvaneshwar K, Song L, Zenklusen JC, Fine H, Madhavan S. The REMBRANDT study, a large collection of genomic data from brain cancer patients. Nature Scientific Data, Aug 2018. PMID: 30106394
  • Madhavan S, Zenklusen JC, Kotliarov Y, Sahni H, Fine HA, Buetow. Rembrandt: helping personalized medicine become a reality through integrative translational research. Molecular Cancer Research. Feb 2009. PMID19208739
  • Madhavan S, Gusev Y, Harris M, Tanenbaum DM, Gauba R, Bhuvaneshwar K, Shinohara A, Rosso K, Carabet L, Song L, Riggins RB, Dakshanamurthy S, Wang Y, Byers SW, Clarke R, Weiner LM. G-DOC®: A Systems Medicine Platform for Personalized Oncology. Neoplasia 13:9. Sep 2011. PMID: 21969811
  • Bhuvaneshwar K, Belouali A, Singh V, Johnson RM, Song L, Alaoui A, Harris MA, Clarke R, Weiner LM, Gusev Y, Madhavan S. G-DOC Plus – an integrative bioinformatics platform for precision medicine. BMC Bioinformatics April 2016. PMID: 27130330

In this challenge, EHR data from the University of Washington that spanned 10 years of clinical records (2009-2019) from 1.2 million patients was made available to all participants. The records included medications prescribed, conditions of patients, observations such as blood pressure and heart rate, demographic information, procedures, and laboratory measurements. The task was to predict the mortality status of patients within 180 days of their last visit using EHR data in OMOP common data model format. A full summary of the challenge can be found here: https://www.synapse.org/#!Synapse:syn18405991/wiki/589657

The Georgetown-ICBI team which included our collaborators from ESAC Inc, used engineered features such as various risk scores, life-threatening diseases, Charlson comorbidity index along with demographics to train machine learning (ML) models such as logistic regression, SVM, XGBoost, etc for mortality prediction. A full report about our methology can be found here

The challenge consisted of three rounds. Rounds 1and 2  lasted one month and allowed each participating teams to make up to 3 submissions. The final round was Round 3 which lasted 6 weeks and allowed each participating teams to make up to 5 submissions. At the end of all the rounds, a final leaderboard was released, where the Georgetown-ICBI team ranked 9th out of a total of 23 participating teams. As a result, our team was invited to be part of a collaborative publication which is in pre-print stage as of Jan 2021.

Publication: Bergquist, T et al, Evaluation of crowdsourced mortality prediction models as a framework for assessing AI in medicine, medRxiv, 2021. doi: https://doi.org/10.1101/2021.01.18.21250072

The purpose of the Bringing Predictive Analytics to Healthcare Challenge was to explore how predictive analytics and related methods may be applied and contribute to understanding healthcare issues. The challenge involved developing predictive analytics methods to estimate hospital inpatient utilization for selected counties in the US. In addition, participating reams were asked to predict the total number of hospital inpatient discharges and the mean length of stay for selected counties in the U.S. for the year 2017 based on data for the years 2011 to 2016.  The challenge ran from March 27 to June 28, 2019. More details about the challenge can be found here: https://www.ahrq.gov/predictive-analytics-challenge/index.html

Details about our challenge methodology can be found here . AHRQ scored each challenge entry and selected five winners. The ranking for the rest of the participating teams was not released.

The Data Science Bowl is a worldwide competition, that brings together data scientists, technologists, and domain experts across industries to take on the world’s challenges with data and technology. In the 2018 Data Science Bowl challenge, the aim was to identify the nuclei in divergent microscopy images, regardless of the experimental setup, over a period of 90 days.

Tissue samples are taken from patients, and are viewed under the microscope by a pathologist, to help understand the nature of the disease. This is a manual and slow process, and eventually led to the rise of digital pathology and automated analysis of microscopic images. If the nuclei in these microscopy images could be correctly identified, researchers would be able to study how these cells responded to various drugs and decipher the biological processes at work. The automation of nuclei detection would enable more efficient drug testing, with the ultimate goal of transforming human lives through faster cures. Although there are existing software that can automatically detect nuclei from a specific type of image, the aim of this 2018 Data Science Bowl was to have one algorithm that could detect nuclei from a diverse collection of microscopic images and across varied conditions. The ultimate goal was to use one algorithm to automate nuclei detection, to enable faster cure of diseases.

Our Georgetown-ICBI team approached the challenge in three different ways. In all three methods, we used open source tools in an effort to make the workflow as reproducible, and comply with Findable, Accessible, Interoperable, Reusable (FAIR) standards. The same pre- processing methods were used across all three methods . At the end of the competition, our team ranked in the top 12%, out of more than 68,000 algorithms that were submitted from around the world.  Our work provides a clear reference point for further development of machine learning methods for image analysis in the future. Read more about the challenge here: https://www.kaggle.com/c/data-science-bowl-2018

Publication : Sharma V, Boca S, Bender J, McCoy M, Gusev Y, Bhuvaneshwar K, Harris B, and  Madhavan S, Deep Learning Approach to automated detection of nuclei in microscopy images, AMIA 2019 Informatics Summit, San Francisco. Link to proceedings here.

The Georgetown-ICBI team collaborated with the University of Delaware to participate in the Text Retrieval Conference (TREC) Precision Medicine and Clinical Decision support track co-sponsored by NIST and DOD. The aim of the competition was to encourage data-driven approaches to identify the best treatment for a patient, by finding the clinical trials that best matched the patient condition, as well as finding evidence-based literature that suggested effective treatment. More details about the challenge can be found here: http://www.trec-cds.org/2017.html

In this challenge, our team employed a two-part system to generate the ranked list of clinical trials and scientific abstracts. The first part pertained to query expansion and document retrieval from document index. The second part pertained to generating the final ranked list by implementing a heuristic scoring method.

There were a total of 32 participants in the NIST TREC Precision Medicine competition. The scoring for clinical trials involved grouping trials based on different trial fields and extraction of features based on occurrences of gene/disease and other terms in the trial. Our system ranked first in this criteria on all three measures – P@5, p@10 and p@15 for grouping and ranking ClinicalTrials.gov data, and 1st, 4th and 5th place in three measures of ranking abstracts. This work was the first version of Georgetown-ICBI’s MACE2K project’s Natural Language Processing (NLP) module eGARD. Read about the eGARD tool here: https://pubmed.ncbi.nlm.nih.gov/29261751/ , and about MACE2K here: https://www.biorxiv.org/content/10.1101/2020.12.03.409094v1

Publications

  • Mahmood, A.S., Li, G., Rao, S., McGarvey, P.B., Wu, C.H., Madhavan, S., & Vijay-Shanker, K. (2017). UD_GU_BioTM at TREC 2017: Precision Medicine Track. TREC. Link: https://trec.nist.gov/pubs/trec26/papers/UD_GU_BioTM-PM.pdf
  • Roberts K et al. Overview of the TREC 2017 Precision Medicine Track. TREC Text Retr Conf. 2017 Nov; 26. PMID: 32776021

The DREAM Challenges (http://dreamchallenges.org/) are a very competitive open science effort. The Georgetown-ICBI team of 7 bioinformatics scientists participated in the DREAM 7 Drug Sensitivity Prediction Challenge in 2012. The team worked to integrate multiple –omics measurements and predict drug sensitivity in breast cancer cell lines, and ranked #7 out of 44 groups around the world who participated. As a result, the team was invited to be part of a collaborative publication Costello et al in 2014, published high impact peer reviewed scientific journal Nature Biotechnology. Read more about the challenge here: http://dreamchallenges.org/project-list/dream7-2012/ .

Publication : Costello, J., Heiser, L., Georgii, E. et al. A community effort to assess and improve drug sensitivity prediction algorithmsNat Biotechnol 32, 1202–1212 (2014). PMID: 24880487. https://doi.org/10.1038/nbt.2877