Pre-and post-processing tools for next-generation sequencing de novo assemblies

Khaleel, Sari
Journal Title
Journal ISSN
Volume Title
University of Delaware
High-throughput Next-Generation Sequencing (NGS) technologies have revolutionized and accelerated genomic analyses. However, their shorter read length and higher error rates in comparison to classical Sanger sequencing have hindered downstream analyses such as de novo genome assembly. Therefore, there is an urgent need to develop tools for quality control and pre-processing of NGS short-read data and perform systematical assessments of their impact on downstream analyses Many de novo assembly projects that use NGS include a pre-processing step where low quality reads and sequence artifacts are cleaned from NGS reads using unpublished in-house scripts. Although some open-source and commercial trimming scripts or pre-processing tools are available, a simple and comprehensive open-source toolkit with major trimming algorithms is currently lacking. Furthermore, most of the aforementioned assembly projects assess assembly by contiguity alone without evaluating the correctness of the assembled contigs. The problem of misassembly is complicated further in metagenomic assemblies by contig chimerism, which is caused by the co-assembly of reads from two or more genomes at regions of sequence similarity. Recently developed metagenomic assemblers attempts to solve this problem during assembly, but they are trained on short, high-coverage Illumina reads and not the long, low-coverage reads of 454, the preferred platform for metagenome sequencing. Contig chimerism affects downstream metagenome analyses, such as binning and gene prediction. Presented are methods and tools for pre- and post-assembly processing of NGS data. ngsShoRT (next-generation-sequencing Short Reads Trimmer), a tool written in Perl to implement many of the commonly used algorithms in trimming literature as well as methods developed by our group, was developed for pre-processing of NGS reads. ngsShORT was tested on Illumina paired-end (PE) reads of the Caenorhabditis elegans genome, and its trimming methods were compared by the improvement in assembly contiguity as well as accuracy with BLAST. A particular problem in trimming NGS reads is the identification of adaptor sequences in NGS reads that may not be detected by regular text-search algorithms. This problem can be managed by the identification and removal ofhigh frequency K-mers in reads using kmerFreq, our adaptor sequence detection and trimming tool. kmerFreq was tested on 454 reads of a viral metagenome specimen and the resulting improvement in assembly contiguity and assembler performance was evaluated. Finally, an analysis pipeline forc ontig chimerism in metagenomic assemblies of simulated 454 reads is presented and tested on a simulated bacterial metagenome. Our hypothesis is that chimeric contigs can be identified in a metagenomic assembly by the presence of unique coverage and polymorphism attributes that distinguish them from non-chimeric contigs. To train this post-assembly approach, a chimeric contig simulation and analysis pipeline was developed to study contig chimerism in assemblies of simulated metagenomes. The pipeline was used to simulate a bacterial metagenome and analyze and compare chimeric and non-chimeric contigs in its assembly. The results of this analysis may provide insight into coverage and polymorphism patterns in chimeric contigs, and may be useful for the detection of chimeric contigs in a real metagenome assembly.