The importance of providing data, and not just images of data

20 February, 2014

Ross Mounce is a researcher at the University of Bath and a F1000 Specialist. In this guest post he addresses the problem of publishing results merely as images, rather than as reusable data, and he shares his solution to unlocking data from images of phylogenetic trees.

RossMounce During the course of my PhD thesis research on the importance of fossils in phylogeny, I was particularly struck by three things about theresearch literature: (1) it’s vast – with millions of papers published each year at an ever growing rate, (2) few publications come with openly available data – which makes it harder to build upon progress, (3) many reported results aren’t reproducible(!)

The difficulties of just getting access to papers & their supporting data fascinated me – in the 21^st century, with all the benefits of the internet and modern web technology it was difficult to understand why things were so hard. I talked with my peers about these problems at the Young Systematists’ Forum and found I was far from alone – it is routine to re-use data from the literature, and many who try to do this often encounter rather unnecessary and time-consuming problems.

So, back in 2011 I wrote an open letter with colleagues calling for better data archiving in palaeontology with suggestions as to how and why it should be done. A lot has changed since then, and for the better – many journals such as F1000Research now ensure that 100% of their research articles are published with their full datasets openly-available. By making data available to all – we enable and empower attempts to reproduce and extend-upon published research, with benefits to authors as well e.g. the open data citation advantage.

But what of the older papers, published before authors routinely published their full supporting data? How do we find and extract data from these? My particular interest is in phylogenetic data – both phylogenetic trees and character matrices. Tens of thousands of phylogenetic analyses are published every year, scattered across thousands of different journals – yet few publicly archive data for future re-use. In 2012, I was part of international collaboration which found that sadly, less than 4% of published phylogenetic studies in 2010 had publicly available data. Asking the authors for that data doesn’t work well either. A more recent paper ‘Lost Branches of the Tree of Life‘ shows that just 16% of authors actually supply the desired data upon request. What to do?

Phylotree

In my first research contract post-PhD, I’ve got funding from the BBSRC Tools and Resources Development Fund to work with Peter Murray-Rust & Matthew Wills to develop ‘Phylogenetic Literature Unlocking Tools’ (PLUTo) using content mining techniques to find and harvest relevant data directly from the literature. Central to this approach is the idea of converting scientific figures (images; often the only vestige of data that remains in many papers), back into re-usable, re-purposable data. We’re calling it ‘content mining’ because it’s more than just strictly text-mining – it involves the analysis of image data, as well as text. Many traditional publishers use copyright law to make it legally difficult for us to apply these techniques to their published content, but all papers published under the Creative Commons Attribution Licence (CC BY) as used by F1000Research, PLOS, PeerJ and many others automatically allow us to do this – a good thing for science.

Later this year our efforts in this area will be boosted by legislative change in the UK. Following the publication of a report, called ‘Digital Opportunity: A review of Intellectual Property and Growth‘ by Professor Ian Hargreaves – it is expected that on the advice of the report, UK copyright law will be modernised to allow a copyright exception for content mining for non-commercial research purposes. After this change occurs, any researcher in the UK with legitimate access to electronic content can mine it for non-commercial research, without infringing copyright. Many traditional journal subscription agreements specifically prevent content mining, or require lengthy negotiation before it can occur, so this will be a big boon for our research as well as others.

If you’re interested in content mining, or getting re-usable data back from figures, follow myself @rmounce or @petermurrayrust on Twitter – we’re actively looking for new applications and users of the technology we’ll be further developing this year 🙂

topics: Open data

blog