Improvements in viral gene annotation using large language models and soft alignments

Author(s)Harrigan, William L.
Author(s)Ferrell, Barbra D.
Author(s)Wommack, K. Eric
Author(s)Polson, Shawn W.
Author(s)Schreiber, Zachary D.
Author(s)Belcaid, Mahdi
Date Accessioned2024-05-09T17:48:24Z
Date Available2024-05-09T17:48:24Z
Publication Date2024-04-25
DescriptionThis article was originally published in BMC Bioinformatics. The version of record is available at: https://doi.org/10.1186/s12859-024-05779-6. © The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdo-main/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
AbstractBackground The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. Results Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. Conclusion The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
SponsorK.E.W. and S.W.P.’s work was funded by the NSF Grant OIA1736030. MB’s work was funded through a supplement to the OIA1736030 and OIA2149133 awards. W.H. received support through the Hawaii EPSCoR fellowship program (OIA-2149133).
CitationHarrigan, W.L., Ferrell, B.D., Wommack, K.E. et al. Improvements in viral gene annotation using large language models and soft alignments. BMC Bioinformatics 25, 165 (2024). https://doi.org/10.1186/s12859-024-05779-6
ISSN1471-2105
URLhttps://udspace.udel.edu/handle/19716/34348
Languageen_US
PublisherBMC Bioinformatics
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
Keywordslarge language models
Keywordsprotein homology
Keywordsviruses
Keywordsalignments
TitleImprovements in viral gene annotation using large language models and soft alignments
TypeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Improvements in viral gene annotation using large language models and soft alignments.pdf
Size:
2.21 MB
Format:
Adobe Portable Document Format
Description:
Main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: