FAIRtracks: a new solution for the FAIRification of genomic tracksF1000 Blogs

FAIRtracks: a new solution for the FAIRification of genomic tracks

7 December, 2021

Sveinung Gundersen and the FAIRtracks team

In this blog post, Sveinung Gundersen and colleagues introduce FAIRtracks — a new standard aimed at making genomic track metadata more Findable, Accessible, Reusable, and Interoperable (FAIR). Keep reading to discover the potential of FAIRtracks and what’s needed to transform this novel solution into standard community practice.

What are genomic tracks?

Genomic tracks refer to data files that annotate DNA reference sequence positions, and can be visualized in genome browser software such as the UCSC Genome Browser and Ensembl. Track files represent summaries of the raw data according to specific criteria and granularity. For example: “hot spot” regions (with a high number of reads), values deviating from expectations, or cross-genomic links representing closeness in 3D. In essence, the condensed data in track files relate to the raw data much like an abstract describes a scientific publication. This data reduction allows researchers to scan large amounts of data to define a hypothesis before carrying out more accurate analyses.

Genomic tracks example from USCS Genome Browser — Some different types of genomic tracks visualized in the UCSC Genome Browser

Major obstacles to the reuse of data

Much of the data from major consortia are well annotated with metadata and are available through dedicated data portals. However, the metadata are provided according to distinct models and APIs, with varying levels of breadth and granularity.

Also, many tracks generated in smaller research projects are available in track hubs, listed in the UCSC Public Track Hubs page and the Track Hub Registry. However, these are mostly indexed at the track collection level only. The interesting metadata for research tends to reside on individual experiments, samples, or track files. In practice, it is often painstakingly difficult for researchers to locate relevant track files in their specific analytical contexts.

We believe the lack of interoperable metadata adhering to the FAIR data principles is a significant obstacle to the reuse of track data at all stages of the scientific process.

Through ELIXIR funds, we assembled a small group of people with broad experience from track data production, interoperability and analysis (1-7) to come up with a possible solution to these issues.

A new solution to FAIRify genomic track metadata

We have proposed a minimal draft standard named FAIRtracks based on JSON Schema with core fields that we have found helpful for data analysis. As a single exception to the FAIR principles, FAIRtracks in this way somewhat contradicts principle F2, which encourages extensive and generous metadata. However, the more generous the metadata can be, the more difficult it is to enforce stringency. Hence, principle F2 seems to be formulated mainly with free-text search or similar functionality in mind.

In contrast, FAIRtracks is more of a “metadata exchange standard” designed around a set of main object types: track collections, studies, experiments, samples, and track files. All of these can refer directly to records in other repositories containing richer metadata. With this solution, we can enforce the strictness in the core metadata fields required to provide accurate categorical search functionality to end-users.

As a result, FAIRtracks can bridge specialized data portals and analysis tools. As a proof of concept, we have implemented a set of services that comprise the FAIRtracks ecosystem, including metadata validation and search capabilities through the TrackFind service. FAIRtracks makes heavy use of ontologies, while the identifiers are actionable through the services Identifiers.org and N2T.

Implications for open science

We invented FAIRtracks in the spirit of open science. We would very much like to bring a community together on these ideals. In our dream scenario, FAIRtracks will be adopted and further developed by the community as a standard. We hope it will connect data producers, tool developers, the FAIR community, and data analysts/researchers. As a result, FAIRtracks will make it possible to mobilize the wealth of existing and newly generated track files.

Adoption of FAIRtracks could allow for creative and novel discoveries based on open data, further inspiring focused research projects with breakthrough, real-world applications. We believe there is great scientific potential in leveraging public data sources this way. This should be of interest to many, especially to research communities with little funding available. It could even inspire other scientific fields.

Challenges ahead

Designing a standard without direct consortium backing is fighting an uphill battle. The leadership of large consortia has set the most successful standards in bioinformatics. Consortia datasets are often so valuable that users are willing to adjust their working habits and adopt new standards to access them.

With the current focus from funding agencies on FAIR and open science, smaller projects are also now searching for solutions to FAIRify their research output. As such, this might present an opportunity for bottom-up approaches like ours to take hold.

We need initial adopters and contributors, but most potential users have plenty on their plates already. However, we are fortunate to have been selected as an ELIXIR Recommended Interoperability Platform. This gives us some organizational backing and has increased our visibility.

Evolving FAIRtracks into a community standard

We are interested in bringing together several types of communities: data producers, biocurators, tool developers (including developers of genome browsers), domain experts (to help expand and fine-tune the standard), the FAIR community, and the researchers/data analysts that are the target end-users.

A particular challenge is moving existing track metadata into the FAIRtracks ecosystem. Initially, we performed this operation in a more “ad hoc” manner for metadata available through BLUEPRINT. We are currently setting up a more reusable infrastructure to help move larger amounts of existing and novel metadata into the ecosystem. Also important is to integrate with other tools and frameworks of relevance, as exemplified with ongoing work to integrate with the Galaxy framework. Any community contributions in these regards are very appreciated.

Have your say

For FAIRtracks to evolve into a community standard, interested parties must share their impressions of FAIRtracks in one form or another, whether small or large. To that end, we have created a survey that we now open for the second round of community feedback. We plan to host an online workshop in collaboration with the ELIXIR interoperability platform soon. We are very interested in getting in touch with potential participants to receive ideas on its contents.

Read the full Opinion Article today on F1000Research or visit the FAIRtracks website to learn more about FAIRtracks. You can stay up to date with the latest on FAIRtracks by following on Twitter or subscribing to the FAIRtracks mailing list.

Do you have feedback to share? Fill out the FAIRtracks survey now and have your say.

References:

1. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2021 Nov 17;gkab1049.

2. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, et al. A promoter-level mammalian expression atlas. Nature. 2014 Mar 27;507(7493):462–70.

3. Fernández JM, de la Torre V, Richardson D, Royo R, Puiggròs M, Moncunill V, et al. The BLUEPRINT Data Analysis Portal. Cell Systems. 2016 Nov 23;3(5):491-495.e5.

4. Harrison PW, Fan J, Richardson D, Clarke L, Zerbino D, Cochrane G, et al. FAANG, establishing metadata standards, validation and best practices for the farmed and companion animal community. Anim Genet. 2018 Dec;49(6):520–6.

5. Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, et al. The Genomic HyperBrowser: inferential genomics at the sequence level. Genome Biol. 2010;11(12):R121.

6. Simovski B, Vodák D, Gundersen S, Domanska D, Azab A, Holden L, et al. GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome. Gigascience. 2017 Jul 1;6(7):1–12.

7. Stunnenberg HG, International Human Epigenome Consortium, Hirst M. The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery. Cell. 2016 Dec 15;167(7):1897.

The rest of the FAIRtracks team, by group affiliation. (Top) From the Track Hub Registry group at EMBL-EBI, Hinxton, UK: Sanjay Boddu, Peter Harrison, Kieron Taylor*, and Daniel Zerbino*. (Middle) From ELIXIR Norway at the Centre for Bioinformatics, University of Oslo (UiO): Dmytro Titov*, Radmila Kompova*, Ahmed Ghanem, Nazeefa Fatima, Federico Bianchini, and Eivind Hovig. (Bottom, first two) From ELIXIR Spain at the Life Sciences Department from the Barcelona Supercomputer Centre (BSC): José María Fernández and Salvador Capella-Gutierrez. (Bottom, third) From ELIXIR Norway at the Computational Biology Unit, University of Bergen (UiB): Matúš Kalaš. (Bottom, fourth) From ELIXIR Norway at the Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU): Finn Drabløs. *Has changed affiliation since contributing to FAIRtracks.”

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog