What is open data?
7 October, 2014 | Eva Amsen |
|
|
To continue our series of “What is…” posts, we’re focusing on open data. Previous posts in this series covered open access, open peer review and post-publication peer review.
Open data in science
“Open data” is a broad concept that doesn’t just apply to research data, but also to, for example, the opening up of government data. Many of the underlying ideas are similar: the goal of open data, whether it involves research data or census data, is to make data available to anyone and reusable by anyone for further analysis.
In the sciences, data have not always been easy to come by. Before the internet, journal articles could not feasibly include all the relevant data. If you wanted to use another group’s data, you had to ask them for it.
One of the first, and one of the best-known, data sharing projects in biology is the human genome project. The sequencing of the human genome was a massive undertaking, by many researchers across the world. The results of their efforts have greatly advanced many areas of research and healthcare over the past decade and a half, but none of that would have been possible if the genomic sequences had not been widely available. Imagine if every time you wanted to align a DNA sequence or generate PCR primers you had to ask for permission, or worse, pay for use of the information.
Instead, anyone can freely download human genomic data, use it without asking for explicit permission, re-analyse and interpret it, and use it for anything from art projects to teaching to data mining to including versions of it in their own work. That is what open data is.
The open data movement proposes that not just big publicly funded projects like the human genome project share their data in such a way, but wants to apply these same principles to any and all kinds of data.
Why use open data in science?
As illustrated by the human genome example above, opening up research data makes it much easier for other scientists to build upon that work and advance the field. Another advantage of open data is that availability of the underlying data used to generate the figures in a paper makes it easier for others to reproduce the work. This complete transparency of data also discourages researchers from falsifying figures in their publication: Too often people get away with photoshopped images or duplicating images from different studies, and that is much easier to catch if the underlying data is available. Another important advantage of open data is that it allows datasets to be easily aggregated for meta-studies.
Regulations and principles for data sharing in biomedical research
There are a number of organisations that recommend, regulate, or advise the use of data sharing in research. A few of them are listed here, and each of their websites includes much more information:
- NIH data sharing policies
- Biosharing – a resource of various policies, standards and databases for the sharing of research data
- Wellcome Trust Guidance for researchers: Developing a data management and sharing plan
- Panton Principles for open data in science
Incentives for data sharing
Guidelines are a good first step, but there also needs to be an incentive for researchers to comply with the guidelines. Funders may ask you to share your data, but often lack the resources to ensure that you really do. To overcome a similar lack of (mandated) open access publication, NIH no longer renews grants if the grantholder did not make their work available by open access standards. A similar enforcement for open data is not (yet) in place.
At the moment, if you want to publish work based on certain formats of data, such as microarray screens or protein structures, journal editors will ask you to deposit your data in a suitable database within a certain period of publishing your article, but they often aren’t able to follow up and make sure that an author has really deposited their data within the required period after publication.
To encourage data sharing of all types of data, F1000Research and (since early 2014) PLOS require their authors to make all data underlying their articles openly available from the moment of publication of the article.
Credit for data publication
Another incentive for data sharing is to provide credit for data. Researchers now generally get professional credit only for published articles. A few journals now allow researchers to publish data sets in the form of a journal article, such as F1000Research (data notes) , GigaScience, Scientific Data and Data. The requirements for such articles (often called “data notes” or “data descriptors”) are that they include only a brief introduction, methods, and results – but no interpretation. F1000Research has had confirmation from several publishers that this sort of publication will still allow researchers to later use these same data sets in another, more in-depth, publication.
Over time, a better way to receive credit for data would be for funders and institutes to formally recognize data deposition and open data sharing as a valuable contribution to research, but until that happens this is one way to formally turn unanalysed data into a tangible credit.
References and links
General:
Image by John Goode, via Flickr, under CC-BY 2.0 license.
All about the human genome project
Wikipedia page on open science data
Open knowledge foundation and its Open definition
Panton Principles for open data in science
Funder guidelines about data sharing:
Biosharing – a resource of various policies, standards and databases for the sharing of research data
NIH data sharing policies
Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020(pdf) includes open data pilot information.
Journal data sharing policies:
F1000Research data sharing information
FAQs about PLOS data sharing policy
Publishing data as articles:
Information about F1000Research Data Notes
Journals that do not consider published data sets as “previous publication”.
Format of Scientific Data data descriptors.
Format of GigaScience Data Notes
|