bd2k

Dr. Sean Young Report from NIH BD2K All Hands Grantee Event

1. What were your biggest take-aways from the BD2K All Hands Grantee event at the NIH?

The meeting focused a lot on data science approaches like creating new machine learning models. One researcher (Dr. Jiawei Han) who leads an expert group out of UI Urbana-Champaign had a poster showing some impressive new methods for data analysis methods. People were definitely interested in our approaches for social data also as they see the importance of data from new media being used to predict events and be used to solve real-world problems. I think the biggest take-away is that the "big data" area isn't going away anytime soon. The government and companies are putting a lot of resources behind studying this area and see huge potential in how it can change our life and work. It's always exciting being a part of an early movement where there is excitement and a lot of promise. Now that researchers know we have support, it's up to us to deliver on that promise.

2. Have you had specific feedback from the NIH on treating social media in a "serious," epidemiological research area? Did you find others at the BD2K event who are open to your ideas?

People are very open to the idea. Timing is great. I've been studying this area for over 10 years and it's actually the first time where almost everyone understands my research. That might sound crazy, but it's actually pretty common for researchers to be working on things that no one else understands, especially if it's related to technology. But people who used to question whether social media and technologies were a fad now so the tremendous amount of data from these technologies. They understand the area we're studying at a high level and when we show them specific examples of the things people say on twitter, or how people use wearable devices, they really get it. They understand our research, the potential of what we're building and studying, and how it can impact society. It's exciting to be able to share this with people.

3. Are there new or upcoming types of data that you would like to include in your research, that only the NIH can give you access to?

I have a call this morning with the Centers for Disease Control and Prevention (CDC). They're really interested in having us modeling ways to monitor and predict disease. They'll be supplying datasets of disease across the country. We're also looking into game forum datasets from people who play and are interested in video games. We have a lot of data stored and ready to go for analysis.

4. If you could explain the value of BD2K grants to a layman, how would you put it?  What kind of return on investment has there been?

Science is based on math and statistics, but statistics are dependent on data. If enough data aren't available, then the statistics won't mean anything. I was walking my dog the other day and she decided to do one of her infamous "i'm done walking" tricks where she drops to the ground in the middle of the walk and won't move. She's scared of the sound of trashtrucks, and whenever a trashtruck comes by she drops and tries to take cover. A woman saw me, crossed the street, and told me the fact that my dog was doing that means she has bad joints and I need to get her to the vet immediately. When I asked her why she said that, she explained to me that her 10 year old dog does this and has bad joints. She surmised that my dog must have bad joints too. She didn't seem willing to listen to the old correlation is not causation argument.

The point is, people often come to incorrect conclusions because they don't have enough data. A vet would be less likely to have made the conclusion the woman did about my dog, not because vets are smarter or even because they have studied this, but because they see many more of these cases and therefore have a lot more data points to know when dogs drop to the ground because they're scared and when they do it because they're injured. The area of "big data" promises to give us a lot more data in order to analyze trends and outcomes and have more accuracy in our conclusions. There's a huge opportunity for a return on investment in this area. It not only allows us to be more accurate, but as in our work, it provides us with the ability to predict events we couldn't have predicted before. That means the ability for huge social returns like preventing disease and reducing poverty, and financial returns like predicting the stock market and finding the right audience of customers who want to buy products.

5. During the event I noticed you live Tweeting.  Did the use of social media change the way that you and your fellow researchers interact at an NIH event?

Most NIH researchers, or scientists in general, aren't big on tweeting. Most researchers are interested in doing their work and leave it up to others who may want to get their work out to the public. I find it tough to tweet and learn and that same time but I try because I think it's important to let the world know about what is happening in the science, tech, and public health community and I enjoy interacting with them about it.

6. A lot of the researchers at the BD2K event were focused on genomics and phenotype data collection.  Do we need to import terminology like genotype and phenotype into the study of social media to gain more understanding from the research community?  Are those terms already being used?

Genomics is a big area of study among big data researchers for a few reasons, but the most important reason is that we have a LOT of genome data. In order to do big data research, we need a lot of data, so researchers interested in this area often gravitate toward genomics. A lot of the advanced learning models are built on genomics data. When we work with a researcher like our own Professor Wei Wang, an expert in data mining, she has expertise in genomics data. She brings that language with her to our work. I therefore think it's unavoidable when working with experts in big data to not use language often used in genomics research. That's a good thing because it's gives a common language that people can use, but social data are different than genomics data, so we'll need to develop our own variation of the language over time.

7. What kind of improvements or additions would you like to see added to next years BD2K All Hands Grantee event?

The point of the meeting was to encourage cross-collaboration and talking between different groups and researchers. Doing multi-disciplinary work is something that universities and government always talk about and encourage, but they don't usually provide incentives for doing it. For example, researchers are supposed to publish their research, but most of the top journals are focused on one area, for example, cardiology or social psychology, and the researchers reviewing the science for those journals don't usually have interest or experience in other areas. That means that researchers doing interdisciplinary work have a tougher time getting their work respected and known. The big data area is designed to be interdisciplinary. Next year's meeting could really move forward by creating incentives for researchers to publish interdisciplinary work, like dedicated top journals and funding for projects that bring together experts from different fields to solve important problems.