Reanalyse(a)s: making reproducibility easier with Code Ocean widgets

20 April, 2017

Several articles in the Preclinical Reproducibility and Robustness channel now include interactive reanalysis interfaces that enable others to reproduce analyses within the article itself, Thomas Ingraham outlines how these work.

Much has been written about the importance of reproducibility and ways to improve it, from pre-registering study protocols to performing full-scale exact replications of entire studies. Another is to ensure all relevant data and code behind the stats described in the article are available, so others can re-run the analyses to see if they can reproduce the results.

This approach, taken by several publishers including F1000, ensures guaranteed, immediate access to these resources without having to depend on the variable availability and cooperation of the original authors. However, reproducing an analysis can still be seen as a chore; scripts and data need to be downloaded, documentation checked and the many circles of ‘dependency hell’ navigated. It is to be expected that the harder it is to perform a reanalysis, the less likely a reanalysis will be done at all.

Plain sailing with Code Ocean

Cornell University based start-up Code Ocean built a platform that makes re-running analyses much (much) easier. Authors simply upload their code and data, publish and, once live, others need only hit the ‘Run’ button to re-run the analysis*. More importantly, you can edit the code to see how the results differ by changing the parameters, as well as run the analyses on your own uploaded data.

Code Ocean can do all this as Docker containers, pre-bundled packages of all the libraries and settings the original creator used to make the code work, form a core part of their infrastructure. This bypasses the need for downloading the code and data, setting up the environment on your own computer and installing the myriad dependencies often required to make it all work. Importantly, both code creators and users do not need any knowledge of Docker to make it all work.

Embedding reproducibility

This interface is also embeddable, and this is where we come into the equation: we are encouraging authors to embed Code Ocean widgets in their articles where appropriate, to make it easier for others to reproduce their analyses. This way, researchers don’t even need to leave the article, where all the key information is, to do the reanalysis. Plus, anything that raises the visibility and prominence of statistical reproducibility in research reporting is to be embraced.

Considering we already have strong data and software availability policies in place, bringing everything together in one is the next natural step. As a proof of concept (and usefulness) five articles in the Preclinical Reproducibility & Robustness channel have added Code Ocean widgets in their methods or results sections, so users can reanalyse their reanalyses within the articles themselves. These include:

Toker, Feng & Pavlidis’ disquieting finding that perhaps as many as 1 in 3 human transcriptomics studies contain samples labelled with the wrong sex, a variable that has strong influences gene expression profile. They uncovered these mismatches by comparing sex-specific gene expression with the sex stated on the label (example widget below).

A follow up analysis by Tarabichi & Detours of the widely publicized, and contested, finding that 2/3 of cancer mutations are down to bad luck (or more precisely, to the number of lifetime stem cell divisions; LSCDs), rather than environmental factors and so hard to prevent. Their reanalysis upheld the strong correlation between LSCD and cancer, but undercut their claim that prevention methods are unlikely to be effective against many cases of cancer. This is particularly topical as the original authors Tomasetti & Vogelstein have just published a follow up to their 2015 paper.

Gilad & Mizrahi-Man’s reanalysis of comparative gene expression data generated by the mouse ENCODE Consortium, which unexpectedly found gene expression to be influenced more by species than by tissue type. The reanalysis suggested this discovery was actually an artefact caused by batch effects unaccounted for in the original study design. When batch was factored in, gene expression patterns clustered by tissue type rather than species. This highly accessed paper generated a lot of insightful discussion from multiple labs, which can be read in the articles comment section.

A reanalysis by Safikahni et al. of drug sensitivity data collected by the Genomics of Drug Sensitivity in Cancer (GDSC) and the Cancer Cell Line Encyclopedia (CCLE) projects. There has been a long running debate between the two groups as to whether this data is consistent between the GDSC and CCLE; Safikhani et al. claim not, and show this has negative implications for the development of the molecular predictors of a patient’s response to specific drugs.

Do, Mobley & Singhal’s reanalysis of data suggesting the presence of specific patterns of gene expression dysregulation in Down’s syndrome. Their reanalysis did not find evidence for the existence of such domains. The authors have gone further in preparing an interactive ‘click-and-select’ interface (pictured below) for those less comfortable working scripts; users can choose which of the pre-loaded RNASeq expression datasets to compare, or to upload and compare their own. This interface is not yet available in the widget but can be accessed by going to the version on Code Ocean (screenshot below).

How to use Code Ocean widgets

Click the blue ‘Run’ button in the top-right corner (you will need to register, see below).
Wait for script to be executed
Once complete, results (typically in the form of figures) will appear in the ‘Results’ pane. Clicking these will bring up the images.
If you want to edit the code, click the ‘Code’ tab (far left) and click the file you want to edit, make your changes, and then run again. You can also upload your own code and data files in this pane.

*An FYI – you will need to register a Code Ocean account before you can run or edit the algorithms (the registration process is super quick). The default account provides an hour’s worth of run time per month for free, which should be plenty for occasional re-running of analyses (typical run times are between 20 seconds to a couple of minutes). If you find you need a greater amount of run time, Code Ocean have paid plans.

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog