Interview with Simina

What are your current research interests?

My research mostly focuses on questions regarding the use of “omics” data, including genomics and metabolomics in order to further personalize medical screening and treatment. For example, I’m interested in understanding an individual’s risk of developing different types of cancer given their genetic profile or by measuring blood plasma levels of certain metabolites [small molecules], and to determine which drug or treatment regimen is most likely to work for them based on the genetic profile of their tumor. The “classical” epidemiological variables – including demographic information, environmental exposures, and health behaviors – and clinical variables, including patient history and disease stage, will still play an important role but can now be augmented by other sources of data. Integrating information from different types of data is critical to developing risk models of disease. It is becoming increasingly common to have, for example, genomic, metabolomic, and epigenomic data from the same individuals, which may result in better prediction models than if only one type of data was used. Electronic health records can add an additional layer of information. In order to use these various data types, it is important to think about the best ways to structure, store, and visualize them so that they are usable for various bioinformatic analyses. In terms of specific biostatistical methodologies I use, I generally attempt to tailor them to the scientific problem of interest, with some common themes in my work thus far including set-level inference, multiple testing adjustments, mediation analysis, and meta-analysis. It is also important to me to make new methods easily accessible and usable via the creation of new computational tools.

Biostatisticians are now in high demand yet there are much fewer graduates in this field as compared to other bioscience disciplines. Why did you choose to specialize in biostatistics?

Yes, there does seem to be a shortage of biostatisticians. I think this partly comes from a visibility gap among undergraduates– this specialization is not well known. The biostatisticians that I know come mostly from a mathematical or basic science background. Some of my peers in graduate school had obtained undergraduate degrees in other fields, including engineering, computer science, and even English and history. I found the field while looking for a way to combine my undergraduate math major with my interest in biology—particularly genetics. After doing an internship in the Bioinformatics Group at Argonne National Laboratory in undergrad, I was convinced that a good path for me would be at the interface of biostatistics and bioinformatics. This led me to attend graduate school at Johns Hopkins where I obtained a Ph.D. in Biostatistics and an M.H.S. (Master’s of Health Science) with a focus in Bioinformatics.

As a biostatistician what do you see as the major challenges for big data analysis in bioinformatics and biomedical and healthcare research?

I think one of the major challenges is creating functional interdisciplinary teams where all the individuals contribute their expertise while at the same time being mutually respectful. It is very easy to criticize other people’s work and feel like everyone has it easier than you. For instance, I regularly remind myself that my attempts at performing biology experiments in college were quite unsuccessful, so I would be unable to carry out that part of the project. Similarly, biologists, epidemiologists, and clinicians on an interdisciplinary team need to remember that a statistical approach cannot be developed in a half-hour meeting and that a bioinformatics database cannot be created overnight.

You’ve recently shared some hilarious statistical analyses done by a Harvard law student reminding us that correlation does not equal causation. Do you notice this as a big issue in your collaborations with molecular biologists/ bioscience researchers not formally trained in statistics?

Tyler Vigen combined statistics from the US Census Bureau and the CDC to show how spurious correlations can arise between things like per capita consumption of cheese and the number of people who died by becoming tangled in their sheets (http://www.tylervigen.com). Of course this is an issue in science, which is why randomized trials are generally considered to provide more trustworthy results than observational studies. Since it is often impractical and/or unethical to perform randomized trials, it is especially important to keep in mind that correlation is not causation (there is also a whole field of causal inference!). I think biomedical scientists are aware of this, but part of their job is to come up with possible plausible models for the data. The human desire to generate narrative to explain results makes this even harder. At the same time, statistical scientists (including myself) are trained to be very skeptical. Both types of researchers are essential towards the goal of finding scientific truth.