Taxonomic assignment of 16S rRNA sequences based on Fourier analysis

We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it...

Full description

Autores:
Luque y Guzmán Sáenz, Guillermo Gustavo
Tipo de recurso:
Fecha de publicación:
2016
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/13692
Acceso en línea:
http://hdl.handle.net/1992/13692
Palabra clave:
Metagenómica
ADN ribosómico
Biología computacional
Biología
Rights
openAccess
License
http://creativecommons.org/licenses/by-nc-sa/4.0/
id UNIANDES2_91ed190fd6bfa9d72684202053de0071
oai_identifier_str oai:repositorio.uniandes.edu.co:1992/13692
network_acronym_str UNIANDES2
network_name_str Séneca: repositorio Uniandes
repository_id_str
spelling Al consultar y hacer uso de este recurso, está aceptando las condiciones de uso establecidas por los autores.http://creativecommons.org/licenses/by-nc-sa/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Reyes Muñoz, Alejandrovirtual::1363-1Luque y Guzmán Sáenz, Guillermo Gustavo4376a737-a0c9-4d71-aaa5-3e87411b886e5002018-09-28T10:49:39Z2018-09-28T10:49:39Z2016http://hdl.handle.net/1992/13692u728994.pdfinstname:Universidad de los Andesreponame:Repositorio Institucional Sénecarepourl:https://repositorio.uniandes.edu.co/We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it is faster than several of the de facto tools with the same purpose in microbial ecology. In order to manage the DNA sequences, at first they are numerically represented as projections into a 3D space defined by the vertex of a tetrahedron. Afterwards, Discrete Fourier Transform allows to get their Power Spectra and use them as input both to train the classifier and to predict their taxonomy. Parseval's identity theorem ensures that similarity between the numerical representation of two DNA sequences can be gotten from their power spectra. This aspect is tested by comparing a dendrogram showing the results of a hierarchical clustering using the pair-wise distance between the spectra of DNA sequences, with another one that has been built using the distance matrix obtained after a multiple sequence alignment (MSA). Performance and assertiveness of TAXOFOR against UCLUST, RDP and MOTHUR was assessed while assigning taxonomy to the same set of 16S rRNA sequences. The initial results are promising and give us enough room to implement improvements in terms of parallel processing and memory handlingMagíster en Biología ComputacionalMaestría44 hojasapplication/pdfengUniversidad de los AndesMaestría en Biología ComputacionalFacultad de CienciasDepartamento de Biologíainstname:Universidad de los Andesreponame:Repositorio Institucional SénecaTaxonomic assignment of 16S rRNA sequences based on Fourier analysisTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesishttp://purl.org/coar/version/c_970fb48d4fbd8a85Texthttp://purl.org/redcol/resource_type/TMMetagenómicaADN ribosómicoBiología computacionalBiologíaPublicationhttps://scholar.google.es/citations?user=hbXF8UEAAAAJvirtual::1363-10000-0003-2907-3265virtual::1363-1https://scienti.minciencias.gov.co/cvlac/visualizador/generarCurriculoCv.do?cod_rh=0000395927virtual::1363-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::1363-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::1363-1ORIGINALu728994.pdfapplication/pdf922245https://repositorio.uniandes.edu.co/bitstreams/37272a97-8760-49cb-8731-1421698d6221/downloada37515478c2bded3b2438bdf312c5157MD51TEXTu728994.pdf.txtu728994.pdf.txtExtracted texttext/plain54602https://repositorio.uniandes.edu.co/bitstreams/f24d1dc8-c5d6-4ea7-a491-a3f059ac4f3a/download5c666851f12aeb2550cdb9d10500b30bMD54THUMBNAILu728994.pdf.jpgu728994.pdf.jpgIM Thumbnailimage/jpeg15789https://repositorio.uniandes.edu.co/bitstreams/7e194934-1f74-42df-8a0c-28c567ecba3b/download085178f6a2a2ef909486b7ca9ec190a9MD551992/13692oai:repositorio.uniandes.edu.co:1992/136922024-03-13 11:56:44.463http://creativecommons.org/licenses/by-nc-sa/4.0/open.accesshttps://repositorio.uniandes.edu.coRepositorio institucional Sénecaadminrepositorio@uniandes.edu.co
dc.title.es_CO.fl_str_mv Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
title Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
spellingShingle Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
Metagenómica
ADN ribosómico
Biología computacional
Biología
title_short Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
title_full Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
title_fullStr Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
title_full_unstemmed Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
title_sort Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
dc.creator.fl_str_mv Luque y Guzmán Sáenz, Guillermo Gustavo
dc.contributor.advisor.none.fl_str_mv Reyes Muñoz, Alejandro
dc.contributor.author.none.fl_str_mv Luque y Guzmán Sáenz, Guillermo Gustavo
dc.subject.keyword.es_CO.fl_str_mv Metagenómica
ADN ribosómico
Biología computacional
topic Metagenómica
ADN ribosómico
Biología computacional
Biología
dc.subject.themes.none.fl_str_mv Biología
description We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it is faster than several of the de facto tools with the same purpose in microbial ecology. In order to manage the DNA sequences, at first they are numerically represented as projections into a 3D space defined by the vertex of a tetrahedron. Afterwards, Discrete Fourier Transform allows to get their Power Spectra and use them as input both to train the classifier and to predict their taxonomy. Parseval's identity theorem ensures that similarity between the numerical representation of two DNA sequences can be gotten from their power spectra. This aspect is tested by comparing a dendrogram showing the results of a hierarchical clustering using the pair-wise distance between the spectra of DNA sequences, with another one that has been built using the distance matrix obtained after a multiple sequence alignment (MSA). Performance and assertiveness of TAXOFOR against UCLUST, RDP and MOTHUR was assessed while assigning taxonomy to the same set of 16S rRNA sequences. The initial results are promising and give us enough room to implement improvements in terms of parallel processing and memory handling
publishDate 2016
dc.date.issued.none.fl_str_mv 2016
dc.date.accessioned.none.fl_str_mv 2018-09-28T10:49:39Z
dc.date.available.none.fl_str_mv 2018-09-28T10:49:39Z
dc.type.spa.fl_str_mv Trabajo de grado - Maestría
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/masterThesis
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TM
dc.identifier.uri.none.fl_str_mv http://hdl.handle.net/1992/13692
dc.identifier.pdf.none.fl_str_mv u728994.pdf
dc.identifier.instname.spa.fl_str_mv instname:Universidad de los Andes
dc.identifier.reponame.spa.fl_str_mv reponame:Repositorio Institucional Séneca
dc.identifier.repourl.spa.fl_str_mv repourl:https://repositorio.uniandes.edu.co/
url http://hdl.handle.net/1992/13692
identifier_str_mv u728994.pdf
instname:Universidad de los Andes
reponame:Repositorio Institucional Séneca
repourl:https://repositorio.uniandes.edu.co/
dc.language.iso.es_CO.fl_str_mv eng
language eng
dc.rights.uri.*.fl_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
dc.rights.coar.spa.fl_str_mv http://purl.org/coar/access_right/c_abf2
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.extent.es_CO.fl_str_mv 44 hojas
dc.format.mimetype.es_CO.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidad de los Andes
dc.publisher.program.es_CO.fl_str_mv Maestría en Biología Computacional
dc.publisher.faculty.es_CO.fl_str_mv Facultad de Ciencias
dc.publisher.department.es_CO.fl_str_mv Departamento de Biología
publisher.none.fl_str_mv Universidad de los Andes
dc.source.es_CO.fl_str_mv instname:Universidad de los Andes
reponame:Repositorio Institucional Séneca
instname_str Universidad de los Andes
institution Universidad de los Andes
reponame_str Repositorio Institucional Séneca
collection Repositorio Institucional Séneca
bitstream.url.fl_str_mv https://repositorio.uniandes.edu.co/bitstreams/37272a97-8760-49cb-8731-1421698d6221/download
https://repositorio.uniandes.edu.co/bitstreams/f24d1dc8-c5d6-4ea7-a491-a3f059ac4f3a/download
https://repositorio.uniandes.edu.co/bitstreams/7e194934-1f74-42df-8a0c-28c567ecba3b/download
bitstream.checksum.fl_str_mv a37515478c2bded3b2438bdf312c5157
5c666851f12aeb2550cdb9d10500b30b
085178f6a2a2ef909486b7ca9ec190a9
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositorio institucional Séneca
repository.mail.fl_str_mv adminrepositorio@uniandes.edu.co
_version_ 1812133813816393728