Dual-Modality Transformer-Based Approach for Viral Protein Classification Integrating Protein Language Models and 3Di FASTA Representations
Viruses significantly impact ecosystems by influencing microbial diversity and facilitating genetic exchange, but their genomes remain poorly annotated. Accurate viral genome annotation is challenging due to limited viral protein representation in databases and rapid sequence divergence. We present...
- Autores:
-
Puentes Mozo, Juanita
- Tipo de recurso:
- Trabajo de grado de pregrado
- Fecha de publicación:
- 2024
- Institución:
- Universidad de los Andes
- Repositorio:
- Séneca: repositorio Uniandes
- Idioma:
- eng
- OAI Identifier:
- oai:repositorio.uniandes.edu.co:1992/74920
- Acceso en línea:
- https://hdl.handle.net/1992/74920
- Palabra clave:
- Phage protein classification
Multi-modality approach
Viral proteins
Transformer models
Deep learning
Artificial intelligence
PHROGs database
PHROG-function prediction
Transfer learning in virology
Microbiología
- Rights
- openAccess
- License
- Attribution-NonCommercial-NoDerivatives 4.0 International
Summary: | Viruses significantly impact ecosystems by influencing microbial diversity and facilitating genetic exchange, but their genomes remain poorly annotated. Accurate viral genome annotation is challenging due to limited viral protein representation in databases and rapid sequence divergence. We present a novel approach for viral protein classification by integrating text embeddings from protein language models (pLMs) and visual features from 3Di FASTA representations using transformer models. Leveraging pre-trained models such as ProteinBERT, ProteinBFD, and ESM, we performed a series of viral protein classification experiments at two levels: category level (9 classes) and PHROGs family level (1159 classes). Our model achieved superior results with PHROGs labels, attaining precision, recall, and F-score values of 0.784, 0.789, and 0.786, respectively, at the category level. The integration of 3Di image features with FASTA sequences further improved classification accuracy, enhancing true positive rates across most classes. These findings highlight the importance of accurate functional annotations and demonstrate the potential of transformer-based models in viral protein classification. The results also suggest that homology-based labels, such as those used by Pharokka, may introduce inconsistencies, warranting further investigation. Our dual-modality approach provides a robust framework for future research, promoting more precise and comprehensive protein classification methodologies. |
---|