Unraveling viral gene associations through integrative computational approaches

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

University of Delaware

Abstract

Viruses play a central role in shaping microbial ecology and evolution, influencing processes ranging from nutrient cycling to host population dynamics. Understanding viral diversity and function is essential to fully grasping microbial ecosystems. However, the interpretation of viral genomes remains challenging due to large portions of genetic material with unknown or poorly characterized functions. This dissertation addresses two key aspects of viral genomic analysis: the interpretation of viral dark matter, genomic regions lacking informative functional annotation, and the characterization of replication-associated proteins. A combination of computational frameworks is introduced to enhance functional predictions, drawing on protein clustering, gene neighborhood analysis, and context-aware protein embeddings. To investigate viral dark matter, co-occurrence analysis of both well-characterized and ambiguous protein clusters reveals that these unannotated regions are not randomly distributed but are structurally organized and functionally linked to critical viral processes such as replication, assembly, and genome packaging. For this purpose, the ORf Interaction Ontology Network (ORION) was developed to organize open reading frames into syntenic cluster blocks, reducing sequence complexity and exposing conserved gene patterns. Parallel to this, the study delves into the sequence and structural characteristics of replication-associated proteins, including DNA polymerase A, ribonucleotide reductase, and DNA helicase, to better understand their role in gene neighborhood organization and viral replication strategies. This analysis is supported by PHIDRA (Protein Homology Identification via Domain-Related Architecture), which provides domain-level context to improve traditional functional annotations and support genofeature assignments at the biochemical and family level. Additionally, an embedding-based pipeline fine-tuned on viral sequences captures both primary and secondary sequence context-aware features, allowing high-resolution clustering beyond the limits of traditional sequence identity thresholds. Together, these integrative approaches uncover biologically relevant patterns within viral genomes, reveal ecological and evolutionary trends, and significantly improve the functional interpretation of both well-annotated and previously uninformative proteins. This work contributes a scalable and adaptable suite of computational tools that advances the study of viral genomics and metagenomics.

Description

Citation

Endorsement

Review

Supplemented By

Referenced By