It is as easy as 1, 2, 3 – just ten steps to genome assembly
5 June, 2018 | Henrik Lantz |
|
|
Henrik Lantz, shares his advice on genome assembly and genome annotation, helping researchers to avoid delays and to improve the quality of their sequence

In our work on the ELIXIR-EXCELERATE task ‘Capacity Building in Genome Assembly and Annotation’, we noticed a need for a document to help researchers get started with genome assembly and genome annotation.
It is difficult to compare subjects in bioinformatics, but at least in my mind, genome assembly and genome annotation are among the most explorative and resource demanding subjects of all. In particular, projects involving eukaryotes can often take months or even years to get to an annotated genome of reasonable quality. Our ten steps will hopefully lower the barrier of entry for these complicated, but necessary, steps of a genome project.
How it will help authors
Do not fall in the trap of thinking you must have a perfect genome, because this can easily lead to a project that never ends.
We want to make researchers think before starting. We have seen many projects fail in the planning stage by not using the right type of data or not making sure enough computing resources are available.
By giving advice all the way from DNA extraction to submission of results to a suitable repository, it is possible to avoid unnecessary delays and should improve the quality of the final results.
Advice for first timers
When embarking on a genome assembly project for the first time, it is important to think about the organism you are working on. What kind of specific challenges does the genome present to you? Does it have a high repeat-content or do you expect to see high levels of heterozygosity?
If it does, you need to order the right type of sequence data to help you deal with this. Also, try to keep focused on what you actually need the assembled genome for. Perhaps a fragmented genome will suffice, as long as the parts you are interested in have been assembled. Do not fall in the trap of thinking you must have a perfect genome, because this can easily lead to a project that never ends.
The key aspects in genome assembly projects
DNA extraction is probably the single most important step of a genome assembly project.
DNA extraction is probably the single most important step of a genome assembly project. I cannot think of any step that influences the end results to a higher degree. Extraction of DNA for use in genome assembly is a very different thing compared to extraction for a Polymerase Chain Reaction. Make sure to spend time here because it will save you all sorts of hassle later.
Beyond this, careful quality control of the sequence data is also vitally important. Just looking at FASTQ – a text-based format for storing biological data – quality values is of limited value. If you can, investigate k-mer – all the possible subsequences obtained through DNA sequencing – composition and look for contamination before going into the actual assembly phase.
Sometimes this will even tell you that the data is too bad to be used, for example if you have a severe case of contamination. Then you can go back and order new data rather than spend time analysing data that will not help you answer the questions at hand.
How can users improve quality, reusability and sustainability of their results?
This helps to avoid unnecessary delays and should improve the quality of the end results.
The two strongest practical recommendations to improve quality, reusability and sustainability of results, is firstly to use containers to ensure analyses are repeatable, and stress the importance that data, tools, and results are submitted to suitable repositories.
Secondly, on a more general scale, I would like to see that researchers have the FAIR (Findable, Accessible, Interoperable and Reusable) principles in mind from the beginning of a project. Think about what you can do to make sure your results are FAIR. It can simply involve documenting metadata of samples or keeping a log of tools and settings used to analyse the data, but this is easy to forget.
|
User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.
Click here to post comment and indicate that you accept the Commenting Terms and Conditions.