Development of a variant interpretation framework for the SIGEN genomic diagnostic service
"Diagnostic of genetics diseases from high throughput DNA sequencing data is becoming a common practice. The SIGEN diagnostic service aims to offer high quality genetic diagnosis service in Colombia. However, an important concern among practitioners interpreting genetic diagnostic reports is th...
- Autores:
-
Mahecha López, Daniel Hernán
- Tipo de recurso:
- Fecha de publicación:
- 2020
- Institución:
- Universidad de los Andes
- Repositorio:
- Séneca: repositorio Uniandes
- Idioma:
- eng
- OAI Identifier:
- oai:repositorio.uniandes.edu.co:1992/48416
- Acceso en línea:
- http://hdl.handle.net/1992/48416
- Palabra clave:
- Alineamiento de secuencias (Bioinformática)
Enfermedades genéticas
Aprendizaje automático (Inteligencia artificial)
Teoría bayesiana de decisiones estadísticas
Bioinformática
Biología
- Rights
- openAccess
- License
- http://creativecommons.org/licenses/by-nc-sa/4.0/
Summary: | "Diagnostic of genetics diseases from high throughput DNA sequencing data is becoming a common practice. The SIGEN diagnostic service aims to offer high quality genetic diagnosis service in Colombia. However, an important concern among practitioners interpreting genetic diagnostic reports is the significant number of disease-related variants classified as Variants of Uncertain Significance (VUS). An additional barrier is the high cost of software and databases required in the the interpretation process. Here, we present a framework for variant interpretation using only open access software tools and databases, tested with real data from patients with suspected genetic disease. To help prioritize VUS with higher probabilities of being pathogenic, we developed different machine-learning methods. We trained and compared a Naive Bayes model, a Random Forest (RF), a Support Vector Machine, and a Five-Layer Perceptron (MLP) using variants from ClinVar classified as pathogenic, likely pathogenic, likely benign and benign on october 2019. A set of conservation scores and 1,000 human genomes global allele frequencies were used as features for model training. The RF and the MLP models showed the highest accuracy, above commonly used tools for variant deleteriousness prediction. Additionally, we developed a database of the variants found in our patient population and a web interface to make it more accessible."--Tomado del Formato de Documento de Grado |
---|