Data: why openness and sharing are important

14 March, 2013

Researchers are coming under increasing pressure to share the detailed results of their research, namely the datasets themselves, with other researchers. The pressure may come in the form of data management plans now being requested by many funders, or from requests for the data from an institutional data repository, or from journals that are increasingly encouraging data articles or even requesting that the data behind more traditional articles be made available on request.

At F1000Research, we think it is essential that the data supporting the results and, ultimately, the article conclusions be made publicly available with the article, thus enabling reproducibility and even reuse in some cases.

Our mandatory policy on data submission (obviously not including those datasets where data protection could be an issue) has been tested by a couple of recent submissions where the authors had understandable initial concerns about such open sharing of their detailed results. One such author, Vathsala Mohan from AgResearch Ltd in New Zealand, initially wrote to us saying that: “providing the raw data is a little difficult as those data are very important and valuable and will form a basis for other papers from my research”.

I am sure to many researchers, this is a familiar scenario and concern. It initially made us wonder whether our policy was a viable approach. We explored other options, such as a limited-time embargo before publishing data. We also discussed the options with many on our Advisory Panel who, to our surprise, unanimously told us that we should be bold and stick to our original plans; and of course they are right. It really makes no sense that a reader has to take the authors’ word for it that they really did generate the data behind their graphs, or that they analysed the data correctly and without (deliberate or unintentional) bias.

One of the strongest arguments for publishing your data as early as possible is to establish priority. This means you can truly show, with a formal data citation, that you did the work before anyone else. Such an approach could certainly have prevented many a Nobel Laureate dispute!

Publishing was, in fact, what Mohan and colleagues ultimately chose to do. Given the volume of data behind their original research article, we decided that the work would be best represented as 2 articles: one focussing purely on the data and protocol information, and the other focussing on the analysis and conclusions. The articles are independently citable while still being tightly linked. One of the advantages of separate publication is that it gives authors the opportunity to provide proper credit on the data paper to those who generated the data, who may not always be the same individuals that conducted the analyses and wrote up the conclusions.

Other important data to share

Most researchers have plenty of good data they have generated that are not going to be taken further – maybe the student who conducted the work has left, the grant ran out, the PI has left, or it may just be that the data are interesting but there aren’t significant resources in the lab to take the work further. Right now, these data are often lost in the bottom of a filing cabinet somewhere, but it is obviously much better for science if the information in these datasets is shared with others through its publication – and publication can bring extra citations for the authors of the data too. All that is required to publish these data articles with F1000Research is the dataset(s) and enough protocol information for someone else to be able to replicate the experiment e.g. this data article by Don Cooper.

Some of our data authors have seen publishing their data with F1000Research as a way to search for potential collaborators. Lawrence Kane (University of Pittsburgh, PA, USA), author of this paper, commented that: “The results have exceeded my expectations. We have had several scientists in academia and biotech contact us regarding reagents or possible collaborations”.

Attribution

This leads us to an important issue that publishers and others need to work on, which is how to build a better system of recognition for the authors of original datasets when someone else reuses them or builds on top of them. Formal citation is an important way to do this. The current license recommended for all data is the CC0 (no rights reserved) license to avoid the problems of attribution stacking, which happens when studies combine datasets from hundreds of other studies, some of which may have themselves been created from hundreds of other datasets. Despite this, the standard cultural norms of citing the original source of the data should still stand in the vast majority of cases.

However, some additional form of recognition for having created the original dataset would seem appropriate. One solution could be some kind of co-data authorship on any subsequent papers. Obviously the original data authors may wish to have the option to decline such co-authorship, for example in cases where they may not agree with the conclusions of the subsequent paper.

If anyone has a suggestion as to how to enable better recognition of data creators’ work, please let us know! We want to do everything we can to encourage authors to share their data.

And of course if you have useful data quietly declining in the bottom of a drawer somewhere, I urge you to do the right thing and send it in for publication – who knows what interesting discoveries you might find yourself sharing credit for!

topics: Open data, Open research

blog