ICBI Director's Blog, Spring 2013

2013-04-19

It is an exciting time to be a data scientist! From large-scale clinical genomic studies to drug discovery and development, now more than ever there is a critical need for computational analysis and interpretation. Commercial, academic, and government sectors alike are developing systems biology and computational approaches to mine BIG DATA for identifying biomarkers, drug targets and predicting outcomes for complex diseases. But the reality is that it is not about BIG DATA anymore; we already know how to store, organize, and access these data. The challenge that still remains is extracting small, actionable bites to inform biomedical research and care.

I wanted to share three recent experiences that underscore the need to recalibrate our thinking in times of diminishing resources and how best to apply data science to solve real-world challenges in biomedicine effectively. At the BioIT world conference in Boston, approximately 2,500 life sciences, pharmaceutical, clinical, healthcare, and IT professionals from 30+ countries gathered to discuss best practices and informatics/IT technologies in genomics, cloud computing, BIG DATA in disease research, and big pharma data management. About midway through the conference, I realized all 12 tracks appeared to have converged on one theme – we are grappling with how to move forward with $1000 genomes requiring $1,000,000 analyses!

IT professionals are working to break this cost barrier. Cycle Computing orchestrated 50,000-core supercomputers on the Amazon cloud for Schrodinger to accelerate the screening of potential new cancer drugs. The experiment was completed in three hours, compared to an estimated nine months required to evaluate, design, and build a 50,000-core environment and to make it fully operational. The cost of the entire project including compute-time was less than $5000 per week at its peak.

We will see many more such efficiencies gained as HTP technologies and methods evolve over the next few years with data scientists leading the way. We must put the advances in technology in the context of policy. That brings me to the second experience that I want to highlight. I had the wonderful opportunity to take part in a think tank organized by the NIH recently to discuss the identifiability of genomic data. The think tank brought together 46 leaders from several fields, including cancer genomics, bioinformatics, human subject protection, patient advocacy, and commercial genetics to discern the preferences and concerns of research participants about data sharing and individual identifiability. Some investigators suggest that human beings can be uniquely identified from just 30 to 80 statistically independent single-nucleotide polymorphisms. What does this mean for cloud service providers who currently host several petabytes of genomic data for academic medical centers, hospitals, and Pharma? We are already experiencing the need to reexamine HIPAA through the lens of genomic medicine. While the policy will eventually catch up with technology, data scientists who manage and analyze human genome data must exercise extra caution and pay close attention to concerns and policies to protect participant privacy. The need for well-trained and skilled data scientists is greater than ever to address these challenges. McKinsey predicts that by 2018, the United States will have a shortage of 150,000 to 180,000 people who have deep data analytical skills.

Lastly, I attended an event hosted by Georgetown’s McDonough School of Business on “Big Data: Educating the Next Generation,” which emphasized among other things the need for a data literacy course for every college junior. I would argue that we must start earlier than that – why not in elementary school? Last week, as one of the science fair judges at my son’s elementary school I witnessed children aged 6 through 11 as they presented extensive data tables and charts to explain the outcomes from their physics, chemistry, and biology experiments. I wish I had a penny every time I heard the word “pairwise comparison” at that science fair!

Let’s continue the conversation – find me on e-mail at sm696@georgetown.edu or on twitter at @subhamadhavan.