Reusing human translational research data

19 September, 2017

Exploring the steps needed to make clinical data reusable

Image credit: the cell portion of this graphic is based on image by Database Center for Life Science, CC BY 3.0. The other portions of this graphic are from free to use sources.

Chao Zhang is an author on a recent F1000Research that article looked into the options available to researchers for reusing data gathered from patients during clinical research. He outlines the several stages that are needed for these data to be accessed and reused.

Cancer treatment has come a long way since Sidney Farber fathered modern chemotherapy back in 1950s. Today, big data can be generated from patients’ tumour samples, from which researchers can pinpoint mutations unique to each patient, and clinicians can then select the appropriate medicines required to treat each tumour.

Data is playing an important role in clinical research, and personalized medicine has only been made possible because of the advancement in data analysis methods and high-throughput experimental techniques.

Data spawn

A huge amount of data has been spawned through patient involvement with research and dispersed into different in-house data repositories due to the impressive number of patients involved. Unfortunately, most of the data are underused, either because they are not shared; shared but cannot be easily found; or because it is hard to reuse due to the deficiency of the crucial metadata that describe the experiments and data processing of the time. The choice of analytical tools can also limit the analysis and prevent interpretation of all the available data.

Instead of simply relying on newly generated data, reusing data has its own advantages. Economically, reusing data will help us pare down on huge costs, since it is costly to generate these data.

Methodologically, reusing the existing data will help us rapidly develop and verify new methods, which can be used to re-analyse the existing data and reveal information that previously went undetected.

We must also consider data privacy, and ensure that all data is secure and reused responsibly. Based on this prerequisite, FAIR data principles at least provides a good guideline to make data ‘R’eusable by making them ‘F’indable, ‘A’ccessible and ‘I’nteroperable.

Data exploration

Thanks to the ongoing efforts worldwide, FAIR principles are materializing in the medical big data management. European Genome-phenome Archive (EGA) has become the de facto data repository for human big data, making them accessible in a secure way.

Data exploration typically begins with tranSMART, a platform extensively used to integrate clinical data and extract information from big data, such as mutations. Galaxy is a bioinformatics workflow management system that enables the easy use of complex computational pipelines.

Let’s make the data reusable

I joined this collaborative project as part of my internship when I was still a master student. My supervisor, Sanne Abeln, enthused me and helped me to grasp the mind-blowing concepts of the research and get on with the work. The project lasted until the beginning of my second PhD year and was completed with the help of all the collaborators.

In this implementation study, we aimed to establish a data reuse scenario: firstly, users explore the integrated information in tranSMART and then trace them back to the big data in EGA. Afterwards, users can do the big data reanalysis in Galaxy. We achieved this by making the data flow from tranSMART to EGA and from EGA to Galaxy.

To do this, we first made available our test data in EGA and tranSMART with proper metadata; and then developed a Galaxy tool that enables importing EGA data into Galaxy effortlessly. Finally, by aligning the data models of tranSMART and EGA, we demonstrated that tranSMART and EGA are interoperable in a demo we especially developed.

Based on these efforts, we proposed a few essential well-defined metadata attributes for capturing human big data. Furthermore, we suggest these metadata attributes be uniquely named by long-lasting digital IDs to achieve the cross linking between different data resources like EGA and tranSMART.

Further down the road: data reproducibility

Our project focused on data reuse, but we are also interested in data management and reproducibility. We still need to know how the information was extracted from human big data, so that we can reproduce this process and retrieve the information faithfully. So far, this has not been possible due to the diverse applications of bioinformatics workflows and the lack of standard frameworks, but is a challenge we must eventually overcome.

Both data reuse and reproducibility still rely on interoperability, which is not easy to achieve and will require more effort from all of us.

topics: Medicine, Open data, Open research, Reproducibility

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog