Streamlining bioinformatics data managementF1000 Blogs

New open source platform helps streamline bioinformatics data management

24 February, 2023

Data collection underpins biological research more than ever before. Yet, data management can be as time-consuming as the analysis itself. Computational biologists often spend days preparing data before answering any biological questions.

Here, Balazs Bohar digs deeper into his Software Tool Article, which examines how Sherlock, an open source, big data platform helps provide a solution. By streamlining bioinformatics data management, Sherlock helps computational biologists spend more time analyzing data and less time collating it.

What inspired you to develop Sherlock?

The main motivation behind developing Sherlock was the incredible growth of data in the last one to two decades. Due to the increasing quantity of data, computational biologists need new, fast, and efficient bioinformatics tools to be able to process, analyze, and store these large data sets. It was for this reason, we set out to design Sherlock and make it openly available on GitHub. We wanted to help streamline bioinformatics data management.

How do computational biologists use data in their role?

In general, computational biologists use data in four different ways, with each having different requirements. These include:

#1 Data collection

Computational biologists must be proficient in the types and structures of data required for their tasks to be able to find and collect all of the possible data in the online environment.

#2 Data storage

What’s more they need a secure, efficient data storage device or tool. Currently, cloud providers offer data storage solutions. However, computational biologists need to be familiar with the structure and operation of these storage systems.

#3 Processing data

Researchers in the field need to be familiar with a range of widely used online tools for different data processing methods. In some circumstances, they’ll need to write their own scripts and programs.

#4 Data analysis

The knowledge required for data analysis or for data processing largely depends on the format and structure of the data required to perform and handle the different tasks.

“In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself.”
– Balazs Bohar, Research Assistant, Korcsmaros Group at Imperial College London

What challenges do computational biologists face when trying to download and analyze data?

In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself.

Most bioinformatics projects start with gathering a lot of data. This involves downloading multiple public databases and spending days preparing the data before answering any biological questions.

Computational biologists will spend time working on our own data, but most of our time is spent gathering and processing external reference data.

This is problematic because all this data can be extremely large and complex, which makes it difficult to download, store, and analyze efficiently and quickly. In fact, the vast scale of the datasets is precisely why computers and databases are required. We need them to correlate and process all the datasets – we cannot do it manually.

Additionally, working with data from different places in different formats is also very difficult. Firstly, you need to transform all the data into a standard format before you start working on it. Secondly, you must write your own scripts to do that, because it’s very rare to find software that will do it for you. Lastly, you store the downloaded data as well, which can be also very challenging.

How does Sherlock provide a solution?

Sherlock helps computational biologists working with large datasets in four ways. These are:

#1 A single standard Data Lake

Sherlock helps in the very first steps of a bioinformatics project (collect, store, process and analyze the data). We have combined many different databases into a standard Data Lake, so it becomes relatively easy to clean, integrate and filter the data you need.

#2 Simple interface

Furthermore, the platform also offers a simple interface to leverage big data technologies, such as Docker and PrestoDB and it is designed to enable users to analyze, process, query and extract information from extremely complex and large data.

#3 Optimized Row Columnar (ORC)

Moreover, Sherlock can handle different structured data from several sources and convert them to a common optimized storage format, the ORC.

#4 Storage

Finally, Sherlock provides a storage solution for big data computational biology projects.

Can you share some examples of how researchers have been using Sherlock so far?

One of the most repetitive and crucial tasks for those working in bioinformatics is the identifier (ID) mapping. Working with many different datasets from multiple sources, all carrying diverse identifiers, is challenging. The principal idea behind these ID mapping steps is to have a separate table or tables, called mapping tables, which contain the different identifiers. The main limitation of this is that when a team must work with so many different identifiers at once, this mapping table can be large, which increases the computational time to extract the data from the mapping table. Sherlock’s speed and efficiency enables researchers to avoid these kinds of problems and significantly shorten the time needed for ID mapping.

What are the next steps for Sherlock and the team?

Our main goal is to disseminate the Sherlock platform widely and enable computational and systems biologists to manage their large-scale datasets more quickly and effectively.

As far as next steps, we would like to continuously develop and keep the operation of the platform at the same level as the evolving IT and bioinformatics world. Additionally, we plan on regularly updating Sherlock’s loader scripts to ensure compatibility with commonly used biological databases in the future.

Furthermore, we plan to improve our source code to make more detailed documentation. Right now, updating the included databases in the Data Lake is a manual process, and we’d like to move towards greater automation. We would also like to include more common and general computational biology examples in the repository. To aid this, we are developing tutorials and extending the use cases of how Sherlock can be deployed and utilized for different research projects.

Read the full Software Tool Article today on F1000Research or explore our Bioinformatics Gateway to discover a plethora of cutting-edge open research.

1 thoughts on “New open source platform helps streamline bioinformatics data management”

Negocio Esperto says:

16 March, 2023 at 5:43 pm

Really exciting. I can’t wait to test this with my own hands!

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog

New open source platform helps streamline bioinformatics data management

What inspired you to develop Sherlock?

How do computational biologists use data in their role?

#1 Data collection

#2 Data storage

#3 Processing data

#4 Data analysis

What challenges do computational biologists face when trying to download and analyze data?

How does Sherlock provide a solution?

#1 A single standard Data Lake

#2 Simple interface

#3 Optimized Row Columnar (ORC)

#4 Storage

Can you share some examples of how researchers have been using Sherlock so far?

What are the next steps for Sherlock and the team?

Sharing sensitive data: key considerations and approaches for safer sharing

New Android app, PhenoApp, helps plant researchers acquire and manage data more efficiently

1 thoughts on “New open source platform helps streamline bioinformatics data management”

Leave a Reply Cancel reply

Balázs Bohár

Follow F1000

Topics

Popular posts

Our Blogs

blog

New open source platform helps streamline bioinformatics data management

What inspired you to develop Sherlock?

How do computational biologists use data in their role?

#1 Data collection

#2 Data storage

#3 Processing data

#4 Data analysis

What challenges do computational biologists face when trying to download and analyze data?

How does Sherlock provide a solution?

#1 A single standard Data Lake

#2 Simple interface

#3 Optimized Row Columnar (ORC)

#4 Storage

Can you share some examples of how researchers have been using Sherlock so far?

What are the next steps for Sherlock and the team?

Share this post

Sharing sensitive data: key considerations and approaches for safer sharing

New Android app, PhenoApp, helps plant researchers acquire and manage data more efficiently

1 thoughts on “New open source platform helps streamline bioinformatics data management”

Leave a Reply Cancel reply

Balázs Bohár

Follow F1000

Topics

Popular posts

Our Blogs