Interpretable Deep Embeddings for Single-cell RNA-seq Clustering Analysis via Gene Attention

This work presents a ZINB model-based denoising autoencoder that offers interpretable deep embeddings through a gene attention mechanism for single-cell RNA-seq clustering. Our method performs a dimensionality reduction into a latent space that embeds semantic information from gene expression inputs...

Full description

Autores:
Forigua Díaz, Cristhian David
Tipo de recurso:
Trabajo de grado de pregrado
Fecha de publicación:
2022
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/63770
Acceso en línea:
http://hdl.handle.net/1992/63770
Palabra clave:
scRNA-seq
Autoencoder
Gene attention
Interpretability
Clustering
Biología
Ingeniería
Rights
openAccess
License
Atribución-CompartirIgual 4.0 Internacional
Description
Summary:This work presents a ZINB model-based denoising autoencoder that offers interpretable deep embeddings through a gene attention mechanism for single-cell RNA-seq clustering. Our method performs a dimensionality reduction into a latent space that embeds semantic information from gene expression inputs and uses the latent representations for further clustering into cell groups. Our gene attention mechanism offers a sense of interpretability to how the autoencoder is embedding the gene expression data and offers the possibility to perform gene analysis for clustering. We perform extensive ablation experiments on the configuration of the autoencoder configuration and the attention mechanism. We test our method on six scRNA-seq datasets with different cell types. The results indicate that our method is competitive compared to previous approaches. In particular, it outperforms previous methods on the 10XPBMC and Worm Neuron Cells datasets. Functional enrichment analysis of genes highlighted by attention vectors offers interpretability on how the network processes the gene expression data. The gene analysis shows a correspondence between what the network learns and the cell types in the datasets.