Cluster Flow - an easy to use bioinformatics tool

Cluster Flow – an easy to use bioinformatics tool

17 March, 2017

Cluster Flow is a pipeline tool developed by the SciLifeLab Swedish National Genomics Facility and the Babraham Bioinformatics Group in the UK. It has been described in a Software Tool Article on F1000Research. In this guest blog, one of the article’s authors, Phil Ewels, explains what Cluster Flow is and how it will be of use to the bioinformatics community.

Pipeline tools reduce workloads and aid reproducibility

In any given project, a bioinformatician will typically run many different software packages to process and analyse data. This is especially true for studies with next-generation sequencing, where there can be tens of steps and hundreds (if not thousands) of samples. Running these tasks manually is time consuming and error prone. Pipeline tools offer a way to automate this process, allowing samples to be processed in the same way by the same tools every time. This is great for reproducibility between experiments and reduces the manual work required. As the scale of bioinformatics projects increase, such automation is increasingly necessary.

What makes Cluster Flow different from other pipeline tools

We designed Cluster Flow from the outset to be simple and easy to use by those running the analysis. We found that other pipeline tools required a substantial investment of time, both for installation and then to write the required pipeline scripts. Cluster Flow supports over 40 common bioinformatics tools assembled into analysis pipelines, ready to run out of the box.

This ease of use has had the unexpected consequence of lab scientists in our institute running their own bioinformatics analyses, confident that the tools are being run in an appropriate manner. This is a great side effect for us – some of the routine workload is removed from the core bioinformatics group, and researchers don’t have to wait in a queue for standard analysis runs.

Cluster Flow as a resource for the bioinformatics community

Cluster Flow is perfect for small to medium sized research groups who may just be starting to run a significant number of next-generation sequencing samples. It comes with analysis pipelines for a number of common data types (RNA-seq, ChIP-seq, Bisulfite-Seq and others) so you can be up and running straight away. Cluster Flow itself can then be easily extended and customised as required.

The Cluster Flow modules that launch software tools are written in the programming language Perl and should be easy to understand for those familiar with bioinformatics. We’ve written quite a bit of documentation about how to create new modules, so we hope that groups will extend Cluster Flow to work with their favourite tools.

Open source code means it can be adapted for your own needs

I think that open source code is critical for scientific research. For results to be trusted, they have to be understood and it’s almost impossible to know how a program works without looking into its code.

Open source code also helps tools to develop: Cluster Flow is maintained in the open on GitHub, allowing anyone to make their own copy of the program and make their own changes. These can then be contributed back to the main program. Several people have already extended Cluster Flow in this way, helping its development by crowd-sourcing the needs and experience of many bioinformaticians.

topics: Biology, Open data, Open research, Reproducibility

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog