Investigating the nature of metagenomic ORFans: unknown proteins or analytical artifacts?

Date
2014
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Study of viruses is limited by inability to culture and lack of universal genetic markers (e.g. 16S rRNA). Shotgun metagenomics, simultaneous sequencing of all viral DNA from a sample, has emerged as an approach to overcome many of these limitations. However, analysis poses unique challenges due to fragmented sequences, gene structure, and viral underrepresentation in sequence databases. VIROME is a bioinformatics platform that simplifies viral metagenomic analysis and exploration. A key step is prediction of open reading frames (ORFs) from metagenomes. Despite comparison of these ORFs against several reference databases, a substantial number show no homology to previously observed proteins, thus classified as ORFans. This study characterized ORFans to determine if they represent unknown proteins, or may be artifactual. A BLAST was carried out comparing predicted ORFs from metagenomic samples on VIROME, against UniRef100 and MgOl environmental database releases since 2005. An increasing number of hits and decrease in ORFans was observed over the timecourse due to the growing number of proteins accounted for in the databases, indicating that some ORFans were real proteins. However, a significant number remain classified as ORFans. We studied these ORFans to find if any characteristics, distinguish them from non-ORFans. ORFans in general were observed to have lower ORF caller score and shorter read lengths than non-ORFans. The ORFan fraction was more likely to have over-representation of several kmers. Homopolymeric kmers were particularly overrepresented in 454 pyrosequencing ORFans, potentially indicative of sequencing platform artifacts. We next assessed various ORF callers to determine if ORFs are being wrongly predicted. Three - MetaGeneAnnotator, MetaGeneMark and Orphelia - were applied to eleven viruses, both whole genome and shredded to simulate metagenomes. MGA had the best overall performance: precision (0.82), sensitivity (0.74). Precision results indicate a significant number of false-positives would be expected, and likely contribute to ORFans. Varying cutoffs filters for ORF length and ORF score was assessed and indicate increasing cutoffs does increase precision, but lowers sensitivity. The findings indicate that a significant fraction of ORFans are likely artifacts of sequencing platform and ORF caller. These false-positives can be managed by applying cutoffs, but lowered sensitivity must be balanced.
Description
Keywords
Citation