Dual-Modality Transformer-Based Approach for Viral Protein Classification Integrating Protein Language Models and 3Di FASTA Representations

Viruses significantly impact ecosystems by influencing microbial diversity and facilitating genetic exchange, but their genomes remain poorly annotated. Accurate viral genome annotation is challenging due to limited viral protein representation in databases and rapid sequence divergence. We present...

Full description

Autores:
Puentes Mozo, Juanita
Tipo de recurso:
Trabajo de grado de pregrado
Fecha de publicación:
2024
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/74920
Acceso en línea:
https://hdl.handle.net/1992/74920
Palabra clave:
Phage protein classification
Multi-modality approach
Viral proteins
Transformer models
Deep learning
Artificial intelligence
PHROGs database
PHROG-function prediction
Transfer learning in virology
Microbiología
Rights
openAccess
License
Attribution-NonCommercial-NoDerivatives 4.0 International
Description
Summary:Viruses significantly impact ecosystems by influencing microbial diversity and facilitating genetic exchange, but their genomes remain poorly annotated. Accurate viral genome annotation is challenging due to limited viral protein representation in databases and rapid sequence divergence. We present a novel approach for viral protein classification by integrating text embeddings from protein language models (pLMs) and visual features from 3Di FASTA representations using transformer models. Leveraging pre-trained models such as ProteinBERT, ProteinBFD, and ESM, we performed a series of viral protein classification experiments at two levels: category level (9 classes) and PHROGs family level (1159 classes). Our model achieved superior results with PHROGs labels, attaining precision, recall, and F-score values of 0.784, 0.789, and 0.786, respectively, at the category level. The integration of 3Di image features with FASTA sequences further improved classification accuracy, enhancing true positive rates across most classes. These findings highlight the importance of accurate functional annotations and demonstrate the potential of transformer-based models in viral protein classification. The results also suggest that homology-based labels, such as those used by Pharokka, may introduce inconsistencies, warranting further investigation. Our dual-modality approach provides a robust framework for future research, promoting more precise and comprehensive protein classification methodologies.