Video: The importance of publishing datasets

7 May, 2013

(This is part 3 of a series of posts featuring speakers from “Challenging the Science Publishing Status Quo”, an evening of talks about peer review, data sharing, and open access. Previously: Lawrence Kane on rapid publication, Keith Flaherty on publishing negative results.)

Steven Hyman is Distinguished Service Professor of Stem Cell and Regenerative Biology and Scholar in Residence at the Broad Institute, Harvard University and MIT, as well as past-Provost of Harvard University. He started his talk by emphasizing that he was not just going to speak about the undeniable importance of sharing data, but also about some of the obstacles.

Hyman’s slides are available via F1000Posters. All video segments from his talk are linked in the text below:

Hyman points out that researchers often worked very hard to get their data sets, and they want it to be seen and used by others, but there often is no incentive for them to share the data directly, and not everyone who helped collect the data currently receives the credit they deserve.

The genomics community got into the habit of putting up all raw data during the Human Genome Project, but Hyman explains that there is a difference between linear sequencing data and other, more complex, types of data.

“I remember when I was NIMH director and was trying to get people in the autism genetics community [to share their data], they were all arguing that they couldn’t share data and do a meta analysis, because they had asked their phenotyping questions slightly differently. “

Another obstacle to sharing data is that small groups are fearful of sharing their data. But often, small labs benefit from making their data sets public, so they can make a valuable contribution. Hyman’s example is from the genomic analysis of complex disorders (such as schizophrenia or autism), where a sample set of 7000 subjects was not enough to identify genome-specific regions. Only when groups started sharing their data sets, and reached a larger sample set of 30,000 patients and 30,000 controls, 72 regions of genome-wide significance became visible. The number of identified regions has now increased to 90, because when these data were presented at meetings, smaller groups who had been holding on to their data realized that the analysis would work better if they all shared their data.

“Often people will say that Big Science, Big Data projects eat up federal funds. (…) But very often, at least when data sets are generated, or tools are generated, these empower small labs. “

Hyman then addresses issues of replication and reproducability. For example, researchers are not necessarily trained in concepts and standards for blinding or statistic analysis, and cell lines change as they’re split. This all makes it difficult for groups to reproduce each others’ work. But part of the issue of reproducibility is that data and reagents are not being shared. Minimizing issues with reproducibility, through sharing data, is important to ensure continued support from funders.

In closing, Hyman urges researchers to change their habits when assessing individuals based on their publication history.

“We’ve met the enemy and it’s us.”

In his experience as an institute director he has seen the way study sections and tenure committees look at impact factors of journals instead of reading the work and looking at the content. Making data available, and giving people proper credit for sharing their data, provides evaluators with a new measure to asses researchers’ productivity, but they do need to get used to using that information and not just go by impact factors.

topics: Open data, Open research

blog