During the summer of 2014 I worked in the lab of Dr. Chirag Patel, located on 10 Shattuck Street at the Countway Library. My research consisted of using Big Data to study the correlation between physical activity and inherited breast cancer risk. My research was conducted within a dry lab, therefore most of my data was gathered using computer science. This project was great for me because I have a particular interest in the area of cancer, I enjoy the research aspect behind it but also the medical facet encompassing it. In the summer of 2015, I worked at the Division of Infectious Diseases at Brigham and Women's Hospital in the lab of Dr. Lee under the mentorship of Katherine Prosen. My projected consisted of analyzing a protein known as Clumping Factor A located on the surface of the bacterial pathogen Staphylococcus aureus and measuring the effectiveness of anti-ClfA antibodies against different strains of S. aureus. This was my first experience within a microbiology lab, thus I learned a vast amount of lab skills that will be of great use in college and in the professional realm.



1. Gather gene expression data of physical activity from 30 GEO data-sets and then analyze this data using GEO2R.
2. Find the gene expression signature across the 30 data-sets.

Hypothesis and aims

My hypothesis is that there is a correlation between the genes expressed in people who do not perform exercise and genes that are expressed in breast cancer. I believe that the genes expressed in people who do not engage in physical activity correlate to genes expressed in breast cancer, therefore these individuals are at a higher risk of acquiring cancer than those who perform physical activity.

To test my hypothesis I have assumed that the null hypothesis will be rejected after looking at P values of the overall data I will gather from my experiment. My overall experiment is actually subdivided into two, first I have to find the genes expressed in physical activity/ exercise and then compare this data to the genes expressed by breast cancer in order to find the existing correlation I have presumed lies within the two.

1. I will use Gene expression data from GEO to disprove that null hypothesis does not apply between physical activity and breast cancer risks. I will use data from 30 data-sets in GPL570 to find which genes are turned on and off (expressed) during physical activity/ exercise.
2. I will then compare the physical activity gene data from GEO2R to the breast cancer risk genes from TCGA (The Cancer Genome Atlas).
3. I will then find the common genes from the two experiments and draw conclusions based on P values.
4. Results (I will find out whether my hypothesis was correct or incorrect.)

Student reading assignment

The first reading I was assigned is about Translational Bioinformatics, which is a term for biomedical research data that has been collected for over fifty years. This data is being used in various ways to transform and rather improve how patients are treated through the use and combination of bioinformatics and clinical informatics. Translational Bioinformatics has become one of the major domains of bioinformatics because of many reasons but one in particular, is that it allows for bioinformaticians to ask new questions that have never been asked. Public and private data can be integrated to answer some of medicine's deepest mysteries. The overall gist of this medical innovation is using computer based aid that can sort through all the data that has been gathered and be able to generate reasonable tools that can be useful in medicine.

I learned from my third reading that environmental and behavioral factors are subject to change in contrast to socioeconomic factors and demographic factors, and therefore these are just as important because they contribute to mortality rates. Environmental and behavioral factors have the possibility of increasing or decreasing death risks.