Improving eukaryotic genome assembly through application of single molecule real-time sequencing data genome: coffee leaf rust fungus, H. vastatrix

Date
2014
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Coffee production is globally threatened by Coffee Leaf Rust disease. The fungal pathogen, Hemileia vastatrix, has been estimated to have the largest fungal genome known. With the absence of an available draft genome, genome sequencing and assembly is a fundamental step in understanding the infectious mechanism of the disease. Next Generation Sequencing technologies (NGS) have been successfully applied for the whole genome sequencing and assembly of many genomes. Second-generation sequencing technologies, such as Illumina, are known for their high throughput but limited by short read lengths and systematic biases. The application of such technologies on large and more complex genomes result in numerous inaccuracies due to the inability to handle repeat regions and sequencing errors. Longer sequence data produced by third generation sequencing technologies, notably PacBio RS-II (Pacific Biosciences Inc.), show promise for overcoming such issues, demonstrated through accurate bacterial-scale genome assemblies and improvements to existing eukaryotic genomes by filling gaps and sequencing through repetitive sequence regions, but are limited by a high error rate and lower throughput. In this study, we developed a three-stage pipeline to assess the performance of various de novo assembly algorithms, SOAPdenovo2, CLC Genomics Workbench (CLC), and Velvet; error correction tools, LSC and PacBioToCA; and the whole-genome shotgun assembler, Celera, for the whole genome assembly of large eukaryotic genomes using synthetic PacBio RS II CLR (Continuous Long Reads) and Illumina paired-end reads created from the Arabidopsis thaliana genome as a proxy for H. vastatrix. At each stage, performance was assessed by reference genome mapping using BLASR and BWA-MEM, and was visualized using SeqMonk and CLC. The results showed the ability of the pipeline to produce long scaffolds with low nucleotide mapping error; the best performance overall was seen with the whole-genome shotgun assembly of SOAPdenovo2 scaffolds and PacBioToCA contigs, producing long genome scaffolds (>1.8Mb) with high N50, no captured gaps and spanning 93% of the reference genome with 1% nucleotide mapping error. These findings demonstrate that creating long genomic scaffolds for complex eukaryotic genomes such as H. vastatrix by NGS can be achieved with implementation of appropriate de novo assembly algorithms.
Description
Keywords
Citation