New open source platform helps streamline bioinformatics data management
24 February, 2023 | Balázs Bohár |
|
|

Data collection underpins biological research more than ever before. Yet, data management can be as time-consuming as the analysis itself. Computational biologists often spend days preparing data before answering any biological questions.
Here, Balazs Bohar digs deeper into his Software Tool Article, which examines how Sherlock, an open source, big data platform helps provide a solution. By streamlining bioinformatics data management, Sherlock helps computational biologists spend more time analyzing data and less time collating it.
What inspired you to develop Sherlock?
The main motivation behind developing Sherlock was the incredible growth of data in the last one to two decades. Due to the increasing quantity of data, computational biologists need new, fast, and efficient bioinformatics tools to be able to process, analyze, and store these large data sets. It was for this reason, we set out to design Sherlock and make it openly available on GitHub. We wanted to help streamline bioinformatics data management.
How do computational biologists use data in their role?
In general, computational biologists use data in four different ways, with each having different requirements. These include:
#1 Data collection
Computational biologists must be proficient in the types and structures of data required for their tasks to be able to find and collect all of the possible data in the online environment.
#2 Data storage
What’s more they need a secure, efficient data storage device or tool. Currently, cloud providers offer data storage solutions. However, computational biologists need to be familiar with the structure and operation of these storage systems.
#3 Processing data
Researchers in the field need to be familiar with a range of widely used online tools for different data processing methods. In some circumstances, they’ll need to write their own scripts and programs.
#4 Data analysis
The knowledge required for data analysis or for data processing largely depends on the format and structure of the data required to perform and handle the different tasks.
“In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself.”
– Balazs Bohar, Research Assistant, Korcsmaros Group at Imperial College London
What challenges do computational biologists face when trying to download and analyze data?
In the era of Big Data, data collection underpins biological research more than ever before. In many cases, this can be as time-consuming as the analysis itself.
Most bioinformatics projects start with gathering a lot of data. This involves downloading multiple public databases and spending days preparing the data before answering any biological questions.
Computational biologists will spend time working on our own data, but most of our time is spent gathering and processing external reference data.
This is problematic because all this data can be extremely large and complex, which makes it difficult to download, store, and analyze efficiently and quickly. In fact, the vast scale of the datasets is precisely why computers and databases are required. We need them to correlate and process all the datasets – we cannot do it manually.
Additionally, working with data from different places in different formats is also very difficult. Firstly, you need to transform all the data into a standard format before you start working on it. Secondly, you must write your own scripts to do that, because it’s very rare to find software that will do it for you. Lastly, you store the downloaded data as well, which can be also very challenging.
How does Sherlock provide a solution?
Sherlock helps computational biologists working with large datasets in four ways. These are:
#1 A single standard Data Lake
Sherlock helps in the very first steps of a bioinformatics project (collect, store, process and analyze the data). We have combined many different databases into a standard Data Lake, so it becomes relatively easy to clean, integrate and filter the data you need.
#2 Simple interface
Furthermore, the platform also offers a simple interface to leverage big data technologies, such as Docker and PrestoDB and it is designed to enable users to analyze, process, query and extract information from extremely complex and large data.
#3 Optimized Row Columnar (ORC)
Moreover, Sherlock can handle different structured data from several sources and convert them to a common optimized storage format, the ORC.
#4 Storage
Finally, Sherlock provides a storage solution for big data computational biology projects.
Can you share some examples of how researchers have been using Sherlock so far?
One of the most repetitive and crucial tasks for those working in bioinformatics is the identifier (ID) mapping. Working with many different datasets from multiple sources, all carrying diverse identifiers, is challenging. The principal idea behind these ID mapping steps is to have a separate table or tables, called mapping tables, which contain the different identifiers. The main limitation of this is that when a team must work with so many different identifiers at once, this mapping table can be large, which increases the computational time to extract the data from the mapping table. Sherlock’s speed and efficiency enables researchers to avoid these kinds of problems and significantly shorten the time needed for ID mapping.
What are the next steps for Sherlock and the team?
Our main goal is to disseminate the Sherlock platform widely and enable computational and systems biologists to manage their large-scale datasets more quickly and effectively.
As far as next steps, we would like to continuously develop and keep the operation of the platform at the same level as the evolving IT and bioinformatics world. Additionally, we plan on regularly updating Sherlock’s loader scripts to ensure compatibility with commonly used biological databases in the future.
Furthermore, we plan to improve our source code to make more detailed documentation. Right now, updating the included databases in the Data Lake is a manual process, and we’d like to move towards greater automation. We would also like to include more common and general computational biology examples in the repository. To aid this, we are developing tutorials and extending the use cases of how Sherlock can be deployed and utilized for different research projects.
Read the full Software Tool Article today on F1000Research or explore our Bioinformatics Gateway to discover a plethora of cutting-edge open research.
|
Really exciting. I can’t wait to test this with my own hands!