Genome Assembly and Analysis Methods
Recovery of high-quality and complete genomes from metagenomes
Genomes, recovered from metagenomes, are increasingly the main source of information that we can ascertain about unculturable microbes and viruses. Currently, sequencing technology produces reads that are too short to represent an entire genome and must be assembled into longer contiguous sequences. However, this complex process is prone to various types of errors, including misassemblies, gaps, and chimeras. These errors can misrepresent the biological attributes of an organism, leading to false or misleading evolutionary and ecological interpretations. As such, genome curation – the process of refining, correcting, and improving genome assemblies – is crucial. Genome curation has also allowed us to complete many genomes directly from environmental samples - providing model representations of uncultivated microbes (Chen et al., 2020, Genome Research). One of the pitfalls of genome curation is that it is largely a manual, time intensive process that does not scale to the thousands of metagenome assembled genomes (MAGs) that can be generated in contemporary metagenomic studies. We've recognized this challenge and are actively developing novel tools to automate genome curation. Our focus is on creating algorithms that can identify and rectify assembly errors at scale, aiming to increase the accuracy and utility of the assembled genomes, thereby ensuring more reliable insights into microbial ecology, evolution, and metabolism.
​
Addressing Binning Issues in Metagenomic Studies
Binning, the process of grouping assembled sequences into 'bins' that represent individual organisms, viruses, or mobile genetic elements, is a critical step in genome-resolved metagenomic studies. However, it's often plagued with issues such as contamination by fragments from other genomes and misbinning, where sequences are incorrectly assigned to bins. These issues can significantly compromise the quality of MAGs, hindering our understanding of the microbial world. We’re developing methods that aim to automate the high-quality recovery of these bins. Our tools are designed to improve genome bins, facilitating the construction of high-quality MAGs. Our ongoing research in automating genome curation for both assemblies and bins underscores that our exploration of microbial life is based on the most accurate genomic data possible.
Identification and fixation of assembly errors in metagenomics
Assembly errors occur frequently in metagenomic assembly due to various reasons; thus, it is important to identify and fix them if possible. The assembly errors are usually due to (1) the inherent mechanisms of the used assembler in determining the paths if based on a de-bruijn graph; (2) the occurrence of multiple genomic fragments in the community (same genome or different genomes), for example, transposons; (3) insufficient reads at some regions of the genomes; (4) sequencing errors, which lead to the generation of fake reads; and others. We have been developing pipelines and tools for the identification of different cases of assembly errors and appropriate methods to fix the errors when applied.
Usually, the above mentioned two parts will be considered and performed at the same time when trying to curate a genome to a better status as much as possible.
Extension of metagenomic sequences for high quality and complete genomes
​
In metagenomic assembly, the generated contigs or scaffolds usually represent a fraction of the corresponding genomes, only in rare cases when you will obtain a sequence representing the complete genome, primarily for viruses or plamids. Once a long sequence is generated from a metagenomic assembly, one may want to get a complete genome from it, if it has a high estimated completeness. One solution is to find the remaining piece(s) of the genome, and try to design primer set, followed by amplification of the region between the regions and sequencing, to bridge the gaps, like what is usually done to obtain complete genomes for isolates. If the metagenomic sample was sequenced by paired-end short reads, it is possible to bridge the gaps between the pieces of a genome using unplaced paired reads. For the two reads of a pair, if one read was mapped to a contig, while the other not, then the unmapped one is called the unplaced paried read, which could be used to extended the end of a contig when (1) there are multiple unplaced paired reads are available in the same region lead to a consensus sequence for extension of the conitg, and (2) the other paired reads are mapped nearby (usually within the region of the inserted length during sequencing library preparation). This approach has been detailed in our previous review regarding how to obtain accurate and complete genomes from metagenomes (Chen et al., 2020, Genome Research), and utilized to generate the complete genomes of both prokaryote, viral and plasmid genomes. Given that this process is labour intensive, we have been attempting to develop an automatic pipeline, to make it a routine step in metagenomic analyses.