Going with the workflow: an interview with Bioconductor

14 July, 2016

Since the Bioconductor project began 15 years ago, it has grown into vast source of indispensable analytical tools for researchers working with high-throughput genomic data. Their open source, open development philosophy naturally aligns with our open science publishing model, and so we were thrilled to collaborate with Bioconductor to launch its own channel a year ago this month. Eighteen articles have been published since then, covering individual Bioconductor packages together with workflows involving multiple packages that solve current important problems in genomics. Nine of these were published last month as part of an article series in the run up to the BioC2016 conference, held this year at Stanford University.

To give some more insight into the Bioconductor project and the associated channel, we spoke with Bioconductor Channel Advisors Wolfgang Huber (EMBL), Kasper D Hansen (John Hopkins), Sean Davis (National Cancer Institute), Vincent Carey (Harvard) and Martin Morgan (Roswell Park Cancer Institute).

Can you provide a brief overview of the historical development of the Bioconductor project, its main activities and its goals?

The aim of the project is to produce and disseminate tools for analysing modern biological data. The tools are interoperable modules written by an open, collaborative network of scientists (Huber et al. 2015, Gentleman et al. 2004). It leverages the open-source statistical language R, but it also extends R with data structures and algorithms adapted to the size and complexity of biotechnology data. It started in 2001 with gene expression microarrays and has since then constantly reinvented itself to support and sometimes drive many other high-throughput technologies in biology.

Were there any particular features of the F1000Research platform that influenced your decision to set up the Bioconductor Channel?

Bioconductor workflows are academic publications: they benefit from peer review, and authors and readers wish to be able to cite them just as they would cite other papers. What convinced us of the F1000Research platform was the post-publication peer review model, with its rapid initial publication and transparent peer review, and how it overcomes the problems of the ‘gatekeeper’ role of traditional peer-review.

You mentioned workflows, which are a sequence of steps involving multiple Bioconductor packages. Can you elaborate on what these workflows can do, and why it is important to publish them?

Martin Morgan (Roswell Park Cancer Institute) — Martin Morgan

The end-to-end workflows close an important gap. There are papers and documentation of individual packages, but these are often driven by the wish to explain “what can you do with this tool”, whereas many users are often more interested in the question “how can I solve this problem?”. The channel’s workflows address this question, as such they can be considered computational protocol papers. They are also more lightweight and more rapid to produce and update than, say, textbooks. For workflows, we are particularly interested in executable documents where readers know that the code in the document is directly responsible for the output in the document.

That’s good to hear. We understand you are implementing a pipeline that enables authors to automate the submission of manuscripts written in Rmarkdown, the preferred format of the Bioconductor community, to the Bioconductor Channel. Could you outline this process?

Rmarkdown is simply the natural format for writing R-based workflows; it is used widely in the community. Rmarkdown integrates nicely with popular development environments for R (RStudio), and with the web. It is more fun, and more efficient, to use than conventional word processors, yet much easier than LaTeX. An Rmarkdown document contains natural language text interleaved with programming code. It is processed by R to evaluate the code to produce figures, tables, and other output embedded in the document. The document is then rendered into HTML or PDF format for reading. We are developing tools that help automate the conversion from Rmarkdown to the LaTeX files required for submission to F1000Research. The already published workflows have helped us to understand the wide range of strategies authors have adopted to meet the dual requirements of executable workflows and the journal-specific publication format. Authors interested in writing their own workflows and submitting them to F1000Research are encouraged to get in touch with us to ensure they are using a painless pipeline.

You recently organized the BioC2016 conference. What were the standout highlights of the meeting?

The amount and the pace of the science being done were outstanding. The community is vibrant, there are many new researchers joining, both as users and developers. We’re excited to see how the tools that we produce enable such good science, both in understanding individual new data types (say, single-cell ‘omics), and in integrating heterogeneous data.

The conference offers a unique perspective on modern statistical genomics. Morning talks cover a very broad range of statistical, biological, and computational issues. Afternoon workshops translate the ideas of the morning into practical application. A great example of this interplay between ideas and application was the morning talk by Dr. Sandrine Dudoit addressing statistical approaches to identification of novel cell types using single-cell transcriptome sequencing, coupled with an afternoon workshop presented by Davide Risso, Kelly Street and Michael Cole describing Bioconductor packages for analysis of single-cell RNA-seq data.

What’s next for the Bioconductor project?

Perhaps it would be best to look briefly at three concepts for Bioconductor’s future: strategy, environment, and people.

Strategy: Inspired not least by F1000Research, we are moving to a new package submission and review process that is based on transparent, open post-publication peer review. We’re also responding to other trends in statistical computing, including more flexible access to source code repositories (git), use of markdown-based vignettes and workbooks, and increasingly facile approaches to presenting rigorous and reproducible analytical results as rich and interactive documents and applications. We are striving to make it natural for critical metadata like experimental batch conditions and reference genome build identifiers to be propagated along workflows from lab to report to published dataset. Recognizing that today’s experiment is tomorrow’s reference annotation helps build respect for preservation of data and annotation provenance. We are also closely watching developments in the space of genomic data standards arising through the Global Alliance for Genome Health (GA4GH) and the NIH Genomic Data Commons.

Environment: The complexity and heterogeneity of biological data keep increasing, and we need to keep innovating our infrastructure to support this. This includes distributed computing, efficient storage, but also well-engineered data types that are easy to use, yet “safe” for complex operations typical of integrative analysis at genome scale. Particularly as cloud-based environments become more common for genome investigators, we need to ensure that the engineering commitments underlying Bioconductor’s success continue to be respected. This includes separation of release and development streams for infrastructure and analytic software, synchronization of code release processes with updates of key infrastructure components like the R language, and full continuous integration of both release and development streams to ensure uninterrupted interoperability.

People: It is very satisfying to see the growth of the developer/contributor community and the increased interest in the annual Bioconductor conference. Our foundation, Bioconductor Foundation of N.A., Inc., uses revenues from monograph royalties and training activities to fund scholarships for conference attendees and to defray costs of special developer meetings in Europe and Asia. We are also glad to see that Bioconductor methods are central in modules for several online courses offered through the EdX and Coursera platforms. All of these conditions bode well for continued strength of the developer and user communities.

Great to hear that the open post publication peer review approach is being applied to package evaluation too! Speaking of which, how can one contribute packages or cross-package workflows and join the Bioconductor community?

Kasper D Hansen (John Hopkins) — Kasper D Hansen

The four steps to participation are outlined on the project home page, at https://bioconductor.org: install R and Bioconductor; use the extensive training material, package vignettes, and now F1000Research work flows to learn how to effectively work with statistical and biological data and Bioconductor packages; use the software for your own work, perhaps exploring new approaches and tackling new data types; and share your results and knowledge with your lab group, colleagues, and broader community by developing your own package, helping others on our support site (https://support.bioconductor.org), or contributing a workflow of your own. There are impressive examples of individuals who started out as a casual user and successively moved into development roles as they got more experience and started solving their own research problems in innovative ways.

References

Huber et al., 2015 Nature Methods 12:115-121. doi:10.1038/nmeth.3252

Gentleman et al., 2004 Genome Biology 5:R80. 10.1186/gb-2004-5-10-r80

blog