Graphing genomes in 2D, applications of multivariate statistics on the genomic composition

Next Generation Sequencing has moved the Big Data phenomenon into the Biological Sciences, making the understanding of biological data a computational challenge. In consequence, it is important to create tools that exploit human visual skills in the interpretation of this ever-increasing information...

Full description

Autores:
Martínez Villa, María Camila
Tipo de recurso:
Fecha de publicación:
2017
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/34158
Acceso en línea:
http://hdl.handle.net/1992/34158
Palabra clave:
Secuencia de nucleótidos - Investigaciones
Bioinformática - Investigaciones
Genómica - Investigaciones
Big Data - Investigaciones
Biología
Rights
openAccess
License
http://creativecommons.org/licenses/by-nc-sa/4.0/
Description
Summary:Next Generation Sequencing has moved the Big Data phenomenon into the Biological Sciences, making the understanding of biological data a computational challenge. In consequence, it is important to create tools that exploit human visual skills in the interpretation of this ever-increasing information. However, transforming genomic data into an image with biological meaning is particularly difficult because the information is not comprised in a single variable but a set of them. The distribution of genomic composition embedded in k-mer frequencies (frequencies of all possible substrings of size k) is a suitable approach, since it will allow us to obtain a specific signature of different organisms in order to classify and visualize them. The main goal of this study was to develop an R function to transform a genomic sequence into a specific 2D image based on k-mer frequencies and to proof that this visualization would keep biological relationships of organisms. The function was developed such that it fragments a genome, reduce the dimensionality of genomic composition measurements and assign a specific color (RGB) to each fragment, transforming it into an image pixel. This function was applied to 52 Bacterial genomes observing that related organisms presented similar color pattern across family, class and phylum. Also, a Mantel and Chi-squared tests were performed over two distinct distance matrices, one from pixel features and another from a traditional 16S-based phylogenetic tree, in order to assess statistical similarity of the obtained 2D images to classical phylogeny. In conclusion, image-based tools can help improve genomic comparisons, exploiting human visual capabilities.