Graphing genomes in 2D, applications of multivariate statistics on the genomic composition
Next Generation Sequencing has moved the Big Data phenomenon into the Biological Sciences, making the understanding of biological data a computational challenge. In consequence, it is important to create tools that exploit human visual skills in the interpretation of this ever-increasing information...
- Autores:
-
Martínez Villa, María Camila
- Tipo de recurso:
- Fecha de publicación:
- 2017
- Institución:
- Universidad de los Andes
- Repositorio:
- Séneca: repositorio Uniandes
- Idioma:
- eng
- OAI Identifier:
- oai:repositorio.uniandes.edu.co:1992/34158
- Acceso en línea:
- http://hdl.handle.net/1992/34158
- Palabra clave:
- Secuencia de nucleótidos - Investigaciones
Bioinformática - Investigaciones
Genómica - Investigaciones
Big Data - Investigaciones
Biología
- Rights
- openAccess
- License
- http://creativecommons.org/licenses/by-nc-sa/4.0/
id |
UNIANDES2_584bbbadcb6caaa9fb76de5ce0c0d1d8 |
---|---|
oai_identifier_str |
oai:repositorio.uniandes.edu.co:1992/34158 |
network_acronym_str |
UNIANDES2 |
network_name_str |
Séneca: repositorio Uniandes |
repository_id_str |
|
dc.title.es_CO.fl_str_mv |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
title |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
spellingShingle |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition Secuencia de nucleótidos - Investigaciones Bioinformática - Investigaciones Genómica - Investigaciones Big Data - Investigaciones Biología |
title_short |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
title_full |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
title_fullStr |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
title_full_unstemmed |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
title_sort |
Graphing genomes in 2D, applications of multivariate statistics on the genomic composition |
dc.creator.fl_str_mv |
Martínez Villa, María Camila |
dc.contributor.advisor.none.fl_str_mv |
López Kleine, Liliana Reyes Muñoz, Alejandro |
dc.contributor.author.none.fl_str_mv |
Martínez Villa, María Camila |
dc.contributor.jury.none.fl_str_mv |
Niño, Luis Fernando |
dc.subject.keyword.es_CO.fl_str_mv |
Secuencia de nucleótidos - Investigaciones Bioinformática - Investigaciones Genómica - Investigaciones Big Data - Investigaciones |
topic |
Secuencia de nucleótidos - Investigaciones Bioinformática - Investigaciones Genómica - Investigaciones Big Data - Investigaciones Biología |
dc.subject.themes.none.fl_str_mv |
Biología |
description |
Next Generation Sequencing has moved the Big Data phenomenon into the Biological Sciences, making the understanding of biological data a computational challenge. In consequence, it is important to create tools that exploit human visual skills in the interpretation of this ever-increasing information. However, transforming genomic data into an image with biological meaning is particularly difficult because the information is not comprised in a single variable but a set of them. The distribution of genomic composition embedded in k-mer frequencies (frequencies of all possible substrings of size k) is a suitable approach, since it will allow us to obtain a specific signature of different organisms in order to classify and visualize them. The main goal of this study was to develop an R function to transform a genomic sequence into a specific 2D image based on k-mer frequencies and to proof that this visualization would keep biological relationships of organisms. The function was developed such that it fragments a genome, reduce the dimensionality of genomic composition measurements and assign a specific color (RGB) to each fragment, transforming it into an image pixel. This function was applied to 52 Bacterial genomes observing that related organisms presented similar color pattern across family, class and phylum. Also, a Mantel and Chi-squared tests were performed over two distinct distance matrices, one from pixel features and another from a traditional 16S-based phylogenetic tree, in order to assess statistical similarity of the obtained 2D images to classical phylogeny. In conclusion, image-based tools can help improve genomic comparisons, exploiting human visual capabilities. |
publishDate |
2017 |
dc.date.issued.none.fl_str_mv |
2017 |
dc.date.accessioned.none.fl_str_mv |
2020-06-10T08:58:50Z |
dc.date.available.none.fl_str_mv |
2020-06-10T08:58:50Z |
dc.type.spa.fl_str_mv |
Trabajo de grado - Maestría |
dc.type.coarversion.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
dc.type.driver.spa.fl_str_mv |
info:eu-repo/semantics/masterThesis |
dc.type.content.spa.fl_str_mv |
Text |
dc.type.redcol.spa.fl_str_mv |
http://purl.org/redcol/resource_type/TM |
dc.identifier.uri.none.fl_str_mv |
http://hdl.handle.net/1992/34158 |
dc.identifier.pdf.none.fl_str_mv |
u806887.pdf |
dc.identifier.instname.spa.fl_str_mv |
instname:Universidad de los Andes |
dc.identifier.reponame.spa.fl_str_mv |
reponame:Repositorio Institucional Séneca |
dc.identifier.repourl.spa.fl_str_mv |
repourl:https://repositorio.uniandes.edu.co/ |
url |
http://hdl.handle.net/1992/34158 |
identifier_str_mv |
u806887.pdf instname:Universidad de los Andes reponame:Repositorio Institucional Séneca repourl:https://repositorio.uniandes.edu.co/ |
dc.language.iso.es_CO.fl_str_mv |
eng |
language |
eng |
dc.rights.uri.*.fl_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.accessrights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.coar.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ http://purl.org/coar/access_right/c_abf2 |
eu_rights_str_mv |
openAccess |
dc.format.extent.es_CO.fl_str_mv |
36 hojas |
dc.format.mimetype.es_CO.fl_str_mv |
application/pdf |
dc.publisher.es_CO.fl_str_mv |
Uniandes |
dc.publisher.program.es_CO.fl_str_mv |
Maestría en Biología Computacional |
dc.publisher.faculty.es_CO.fl_str_mv |
Facultad de Ciencias |
dc.publisher.department.es_CO.fl_str_mv |
Departamento de Biología |
dc.source.es_CO.fl_str_mv |
instname:Universidad de los Andes reponame:Repositorio Institucional Séneca |
instname_str |
Universidad de los Andes |
institution |
Universidad de los Andes |
reponame_str |
Repositorio Institucional Séneca |
collection |
Repositorio Institucional Séneca |
bitstream.url.fl_str_mv |
https://repositorio.uniandes.edu.co/bitstreams/cb31a1af-6d3f-4a78-b41b-7d8c0c2162d8/download https://repositorio.uniandes.edu.co/bitstreams/b238406a-93d3-4651-b141-7bfaed4d4efd/download https://repositorio.uniandes.edu.co/bitstreams/2c601da5-de2b-40ad-9656-118366b550ca/download |
bitstream.checksum.fl_str_mv |
255265ba908c7e1d2271c63894647e90 d8286aa5d1b2147a1fb4487953cdc64f 57eca7677ccafecf97b69efdd093e4a0 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio institucional Séneca |
repository.mail.fl_str_mv |
adminrepositorio@uniandes.edu.co |
_version_ |
1812134056894136320 |
spelling |
Al consultar y hacer uso de este recurso, está aceptando las condiciones de uso establecidas por los autores.http://creativecommons.org/licenses/by-nc-sa/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2López Kleine, Liliana3b050caa-7faf-48a1-8f2d-c45e46193ee9500Reyes Muñoz, Alejandrovirtual::16184-1Martínez Villa, María Camila07693608-d132-4681-91df-9355072301b7500Niño, Luis Fernando2020-06-10T08:58:50Z2020-06-10T08:58:50Z2017http://hdl.handle.net/1992/34158u806887.pdfinstname:Universidad de los Andesreponame:Repositorio Institucional Sénecarepourl:https://repositorio.uniandes.edu.co/Next Generation Sequencing has moved the Big Data phenomenon into the Biological Sciences, making the understanding of biological data a computational challenge. In consequence, it is important to create tools that exploit human visual skills in the interpretation of this ever-increasing information. However, transforming genomic data into an image with biological meaning is particularly difficult because the information is not comprised in a single variable but a set of them. The distribution of genomic composition embedded in k-mer frequencies (frequencies of all possible substrings of size k) is a suitable approach, since it will allow us to obtain a specific signature of different organisms in order to classify and visualize them. The main goal of this study was to develop an R function to transform a genomic sequence into a specific 2D image based on k-mer frequencies and to proof that this visualization would keep biological relationships of organisms. The function was developed such that it fragments a genome, reduce the dimensionality of genomic composition measurements and assign a specific color (RGB) to each fragment, transforming it into an image pixel. This function was applied to 52 Bacterial genomes observing that related organisms presented similar color pattern across family, class and phylum. Also, a Mantel and Chi-squared tests were performed over two distinct distance matrices, one from pixel features and another from a traditional 16S-based phylogenetic tree, in order to assess statistical similarity of the obtained 2D images to classical phylogeny. In conclusion, image-based tools can help improve genomic comparisons, exploiting human visual capabilities."La secuenciación de NGS ha trasladado el fenómeno Big Data a las Ciencias Biológicas, haciendo que la comprensión de los datos biológicos sea computacional un reto. Por lo cual, es importante crear herramientas que exploten la habilidades visuales humanas en la interpretación de información genómica que se encuentra en constante aumento. Sin embargo, el hecho de transformar datos genómicos en una imagen que tenga significado biológico es particularmente difícil porque la información no está comprendida en una sola variable sino en un conjunto de ellas. Por esto, las distribuciones de la composición genómica integrada en las frecuencias de k-mer (frecuencias de todas las subcadenas posibles de tamaño k) es una opcion adecuada, ya que nos permitirá obtener una firma específica de diferentes organismos para clasificarlos y visualizarlos. Asi, el objetivo principal de este estudio fue desarrollar una función R para transformar una genómica secuencia en una imagen 2D específica basada en las frecuencias k-mer y para probar que esto la visualización mantendría las relaciones biológicas de los organismos. La función fue desarrollada tal que fragmenta un genoma, reduce la dimensionalidad de las métricas de composición genómica y asigna un color específico (RGB) a cada fragmento, transformándolo en un píxel de imagen. Esta función se aplicó a 52 genomas bacterianos observando que los organismos más relacionados entre si, presentaron un patrón de color similar a través de familia, clase y phylum. Además, se realizaron pruebas de Mantel y Chi-cuadrado sobre dos matrices de distancia distintas, una de las características de píxeles y otra de una tradicional Árbol filogenético basado en 16S, con el fin de evaluar la similitud estadística de las imágenes en 2D y una filogenia clásica. En conclusión, las herramientas basadas en imágenes pueden ayudar mejorar las comparaciones genómicas, explotando las capacidades visuales humanas."--Tomado del Formato de Documento de Grado.Magíster en Biología ComputacionalMaestría36 hojasapplication/pdfengUniandesMaestría en Biología ComputacionalFacultad de CienciasDepartamento de Biologíainstname:Universidad de los Andesreponame:Repositorio Institucional SénecaGraphing genomes in 2D, applications of multivariate statistics on the genomic compositionTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesishttp://purl.org/coar/version/c_970fb48d4fbd8a85Texthttp://purl.org/redcol/resource_type/TMSecuencia de nucleótidos - InvestigacionesBioinformática - InvestigacionesGenómica - InvestigacionesBig Data - InvestigacionesBiologíaPublicationhttps://scholar.google.es/citations?user=hbXF8UEAAAAJvirtual::16184-10000-0003-2907-3265virtual::16184-1https://scienti.minciencias.gov.co/cvlac/visualizador/generarCurriculoCv.do?cod_rh=0000395927virtual::16184-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::16184-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::16184-1ORIGINALu806887.pdfapplication/pdf10660527https://repositorio.uniandes.edu.co/bitstreams/cb31a1af-6d3f-4a78-b41b-7d8c0c2162d8/download255265ba908c7e1d2271c63894647e90MD51TEXTu806887.pdf.txtu806887.pdf.txtExtracted texttext/plain48617https://repositorio.uniandes.edu.co/bitstreams/b238406a-93d3-4651-b141-7bfaed4d4efd/downloadd8286aa5d1b2147a1fb4487953cdc64fMD54THUMBNAILu806887.pdf.jpgu806887.pdf.jpgIM Thumbnailimage/jpeg16922https://repositorio.uniandes.edu.co/bitstreams/2c601da5-de2b-40ad-9656-118366b550ca/download57eca7677ccafecf97b69efdd093e4a0MD551992/34158oai:repositorio.uniandes.edu.co:1992/341582024-03-13 15:39:34.524http://creativecommons.org/licenses/by-nc-sa/4.0/open.accesshttps://repositorio.uniandes.edu.coRepositorio institucional Sénecaadminrepositorio@uniandes.edu.co |