More than a spoonful of sugar - the complex sugarcane genome

More than a spoonful of sugar – the complex sugarcane genome

9 August, 2017

F1000Research authors discuss sugarcane and the challenge of sequencing its genome

Image credit: Thamizhpparithi Maari , Wikimedia Commons, CC BY-SA 3.0

Two different research groups, the J. Craig Venter Institute (JCVI) and the Brazilian Center for Research in Energy and Materials, published two separate data articles on the sugarcane genome in the GODAN Gateway to provide a resource to help people find organisms living on or in their sugarcane crops. In this guest blog, the two lead authors from each article explain how they learnt that the genome was much larger than they anticipated.

Sugarcane is important economically. As a crop, it is used to make sugar, ethanol and electricity. As a biofuel feedstock, sugarcane is used to make first-generation bioethanol (from the edible sugars) as well as second-generation bioethanol (from the inedible biomass). It is already cultivated in over 9 million hectares in Brazil, and we believe production can be increased through better management practices, traditional breeding programs, or genetic modification.

From the genomics point of view, sugarcane is very complex. Let’s compare the sugarcane genome to the human genome. The human genome is usually described as having 3 billion bases (3 Gbp). That figure actually describes the “haploid” genome size, or the size of one copy of each chromosome. Actual human somatic cells are “diploid”, they have two copies of each chromosome, so the human genome sequence is really 6 Gbp. Since there are very few differences between copies in human, it was safe to assume the 3Gbp size during the Human Genome Project, and indeed the HGP generated one 3Gbp sequence.

We are convinced that genomic data is more useful to science when it is published sooner rather than later.

Sugarcane is “polyploid” i.e., having many copies of each chromosome. And it gets worse because the copies are not the same! The sugarcane cultivars in use today are hybrids generated by crossing different species of plants. The goal was to combine the best traits of several species. In most cases, the parent species were already polyploid and the hybrids had higher chromosome counts than either parent.

Constructing the bigger picture from an even bigger genome

Today’s cultivars mainly originate from crosses between the already polyploid parental species, Saccharum officinarum, which brought high sugar producing traits, and S. spontaneum, which brought abiotic and biotic resistance traits. The exact ploidy of the hybrids is uncertain but it could vary between 8 and 12 copies per cell, and not all chromosomes appear with the same number of copies, a phenomenon called aneuploidy. So, sugarcane genomes are going to be much more complicated compared to reconstructing the human genome.

Our groups sequenced different cultivars. The JCVI group sequenced one called CP 96-1252. This hybrid, developed by the USDA, is the top commercial sugarcane cultivar planted in the state of Florida. Its genome is derived from four species and the genetic contribution per species is probably unequal. The Brazilian group sequenced cultivar SP80-3280, which was commercially released in Brazil in 1997. It is still among the twenty most planted in the country. It has been extensively used as a workhorse in genomics studies in Brazil. For example, it was the cultivar that most contributed to the EST sequencing project at the beginning of the 2000’s, and several research groups have generated BAC sequences, shotgun genome and transcriptome sequences and surveyed its microbiota.

A large complex genome that can be explored with long sequencing reads

Our groups applied different DNA sequencing methods. The JCVI group used the Illumina NextSeq sequencing machine which generates lots of short reads quickly. In fact, we generated over 200 billion bases in over 1.3 billion reads. At that point, we stopped to ask how much we had achieved. We analysed the genome to see if we had reached 10X coverage, meaning 10 reads from each unique sequence in the genome, which would have been a good start. We found we had not even reached 1X so there was no point in trying to assemble these reads. In other words, we learned that the genome was larger than we had thought.

We could have waited until we generated a better assembly but, perfect genome assemblies do not exist.

The Brazilian group applied a different sequencing method to a different hybrid. The TruSeq Synthetic Long Reads method combined almost 2 billion short reads (close to 400 billion bases) into almost 1.4 million long reads. So, we showed how using long sequencing reads, TruSeq Synthetic Long Reads in our case, is useful for exploring the genome sequence of a complex polyploid. Previous attempts only used short sequencing reads, which produced fragmented assemblies that had reduced capacity to recover complete gene models.

After we assembled the long reads, we were surprised by the large number of genes that we were able to predict: over 150 thousand. Close to 40% of the predicted protein-coding genes appear exclusive to sugarcane when compared to other plant species.

Genome assemblies are hypotheses that should be shared

We are convinced that genomic data is more useful to science when it is published sooner rather than later. We could have waited until we generated a better assembly but, perfect genome assemblies do not exist. Assemblies are hypotheses that can be and are updated and revised, even more for crops with complex genomes such as sugarcane.

In the same spirit, we released and published the unassembled short and long reads from both cultivars. The released unassembled reads would prove helpful if other groups incorporate these data into their sequencing projects. For cultivar CP 96-1252, we demonstrated that our starting material was sterile sugarcane, free of pathogens. We can imagine that another group working with this cultivar could test for pathogens by subjecting their plants to light sequencing and comparing their reads to ours.

A valuable resource for the community

The data can be used to identify promoters of genes of interest associated with specific traits, or to exploit population genomics data.

We expect that our genome assembly and our DNA sequencing reads will be useful for biotechnology groups working on sugarcane. Our data could help them advance in their sugarcane research projects. For instance, the data can be used to identify promoters of genes of interest associated with specific traits, or to exploit population genomics data. Our draft genome sequence could be used as a reference in genotype by sequencing approaches. Until now, scientists have had to use the published genome sequence of Sorghum bicolor, which is in the same family as sugarcane, but clearly has a different genome.

The genes that we predicted from our assembly will almost certainly get used by other scientists. Scientists can select better targets for genetic modification in plants if they understand the evolutionary process that led to the genes that already exist. With many metabolic pathways of interest, for instance in the synthesis of lignin components , it is crucial to understand the roles and interactions of families of closely related genes. Our predicted gene set is rich in gene families due, in part, to the polyploid nature of the hybrid genome.

topics: Biology, Open research

User comments must be in English, comprehensible and relevant to the post under discussion. We reserve the right to remove any comments that we consider to be inappropriate, offensive or otherwise in breach of the User Comment Terms and Conditions. Commenters must not use a comment for personal attacks.

Click here to post comment and indicate that you accept the Commenting Terms and Conditions.

blog