Taxonomic assignment of 16S rRNA sequences based on Fourier analysis
We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it...
- Autores:
-
Luque y Guzmán Sáenz, Guillermo Gustavo
- Tipo de recurso:
- Fecha de publicación:
- 2016
- Institución:
- Universidad de los Andes
- Repositorio:
- Séneca: repositorio Uniandes
- Idioma:
- eng
- OAI Identifier:
- oai:repositorio.uniandes.edu.co:1992/13692
- Acceso en línea:
- http://hdl.handle.net/1992/13692
- Palabra clave:
- Metagenómica
ADN ribosómico
Biología computacional
Biología
- Rights
- openAccess
- License
- http://creativecommons.org/licenses/by-nc-sa/4.0/
id |
UNIANDES2_91ed190fd6bfa9d72684202053de0071 |
---|---|
oai_identifier_str |
oai:repositorio.uniandes.edu.co:1992/13692 |
network_acronym_str |
UNIANDES2 |
network_name_str |
Séneca: repositorio Uniandes |
repository_id_str |
|
spelling |
Al consultar y hacer uso de este recurso, está aceptando las condiciones de uso establecidas por los autores.http://creativecommons.org/licenses/by-nc-sa/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Reyes Muñoz, Alejandrovirtual::1363-1Luque y Guzmán Sáenz, Guillermo Gustavo4376a737-a0c9-4d71-aaa5-3e87411b886e5002018-09-28T10:49:39Z2018-09-28T10:49:39Z2016http://hdl.handle.net/1992/13692u728994.pdfinstname:Universidad de los Andesreponame:Repositorio Institucional Sénecarepourl:https://repositorio.uniandes.edu.co/We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it is faster than several of the de facto tools with the same purpose in microbial ecology. In order to manage the DNA sequences, at first they are numerically represented as projections into a 3D space defined by the vertex of a tetrahedron. Afterwards, Discrete Fourier Transform allows to get their Power Spectra and use them as input both to train the classifier and to predict their taxonomy. Parseval's identity theorem ensures that similarity between the numerical representation of two DNA sequences can be gotten from their power spectra. This aspect is tested by comparing a dendrogram showing the results of a hierarchical clustering using the pair-wise distance between the spectra of DNA sequences, with another one that has been built using the distance matrix obtained after a multiple sequence alignment (MSA). Performance and assertiveness of TAXOFOR against UCLUST, RDP and MOTHUR was assessed while assigning taxonomy to the same set of 16S rRNA sequences. The initial results are promising and give us enough room to implement improvements in terms of parallel processing and memory handlingMagíster en Biología ComputacionalMaestría44 hojasapplication/pdfengUniversidad de los AndesMaestría en Biología ComputacionalFacultad de CienciasDepartamento de Biologíainstname:Universidad de los Andesreponame:Repositorio Institucional SénecaTaxonomic assignment of 16S rRNA sequences based on Fourier analysisTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesishttp://purl.org/coar/version/c_970fb48d4fbd8a85Texthttp://purl.org/redcol/resource_type/TMMetagenómicaADN ribosómicoBiología computacionalBiologíaPublicationhttps://scholar.google.es/citations?user=hbXF8UEAAAAJvirtual::1363-10000-0003-2907-3265virtual::1363-1https://scienti.minciencias.gov.co/cvlac/visualizador/generarCurriculoCv.do?cod_rh=0000395927virtual::1363-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::1363-1f71489e5-69f6-4e6b-90a6-c6b1d3fecec7virtual::1363-1ORIGINALu728994.pdfapplication/pdf922245https://repositorio.uniandes.edu.co/bitstreams/37272a97-8760-49cb-8731-1421698d6221/downloada37515478c2bded3b2438bdf312c5157MD51TEXTu728994.pdf.txtu728994.pdf.txtExtracted texttext/plain54602https://repositorio.uniandes.edu.co/bitstreams/f24d1dc8-c5d6-4ea7-a491-a3f059ac4f3a/download5c666851f12aeb2550cdb9d10500b30bMD54THUMBNAILu728994.pdf.jpgu728994.pdf.jpgIM Thumbnailimage/jpeg15789https://repositorio.uniandes.edu.co/bitstreams/7e194934-1f74-42df-8a0c-28c567ecba3b/download085178f6a2a2ef909486b7ca9ec190a9MD551992/13692oai:repositorio.uniandes.edu.co:1992/136922024-03-13 11:56:44.463http://creativecommons.org/licenses/by-nc-sa/4.0/open.accesshttps://repositorio.uniandes.edu.coRepositorio institucional Sénecaadminrepositorio@uniandes.edu.co |
dc.title.es_CO.fl_str_mv |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
title |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
spellingShingle |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis Metagenómica ADN ribosómico Biología computacional Biología |
title_short |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
title_full |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
title_fullStr |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
title_full_unstemmed |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
title_sort |
Taxonomic assignment of 16S rRNA sequences based on Fourier analysis |
dc.creator.fl_str_mv |
Luque y Guzmán Sáenz, Guillermo Gustavo |
dc.contributor.advisor.none.fl_str_mv |
Reyes Muñoz, Alejandro |
dc.contributor.author.none.fl_str_mv |
Luque y Guzmán Sáenz, Guillermo Gustavo |
dc.subject.keyword.es_CO.fl_str_mv |
Metagenómica ADN ribosómico Biología computacional |
topic |
Metagenómica ADN ribosómico Biología computacional Biología |
dc.subject.themes.none.fl_str_mv |
Biología |
description |
We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it is faster than several of the de facto tools with the same purpose in microbial ecology. In order to manage the DNA sequences, at first they are numerically represented as projections into a 3D space defined by the vertex of a tetrahedron. Afterwards, Discrete Fourier Transform allows to get their Power Spectra and use them as input both to train the classifier and to predict their taxonomy. Parseval's identity theorem ensures that similarity between the numerical representation of two DNA sequences can be gotten from their power spectra. This aspect is tested by comparing a dendrogram showing the results of a hierarchical clustering using the pair-wise distance between the spectra of DNA sequences, with another one that has been built using the distance matrix obtained after a multiple sequence alignment (MSA). Performance and assertiveness of TAXOFOR against UCLUST, RDP and MOTHUR was assessed while assigning taxonomy to the same set of 16S rRNA sequences. The initial results are promising and give us enough room to implement improvements in terms of parallel processing and memory handling |
publishDate |
2016 |
dc.date.issued.none.fl_str_mv |
2016 |
dc.date.accessioned.none.fl_str_mv |
2018-09-28T10:49:39Z |
dc.date.available.none.fl_str_mv |
2018-09-28T10:49:39Z |
dc.type.spa.fl_str_mv |
Trabajo de grado - Maestría |
dc.type.coarversion.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
dc.type.driver.spa.fl_str_mv |
info:eu-repo/semantics/masterThesis |
dc.type.content.spa.fl_str_mv |
Text |
dc.type.redcol.spa.fl_str_mv |
http://purl.org/redcol/resource_type/TM |
dc.identifier.uri.none.fl_str_mv |
http://hdl.handle.net/1992/13692 |
dc.identifier.pdf.none.fl_str_mv |
u728994.pdf |
dc.identifier.instname.spa.fl_str_mv |
instname:Universidad de los Andes |
dc.identifier.reponame.spa.fl_str_mv |
reponame:Repositorio Institucional Séneca |
dc.identifier.repourl.spa.fl_str_mv |
repourl:https://repositorio.uniandes.edu.co/ |
url |
http://hdl.handle.net/1992/13692 |
identifier_str_mv |
u728994.pdf instname:Universidad de los Andes reponame:Repositorio Institucional Séneca repourl:https://repositorio.uniandes.edu.co/ |
dc.language.iso.es_CO.fl_str_mv |
eng |
language |
eng |
dc.rights.uri.*.fl_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.rights.accessrights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.coar.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ http://purl.org/coar/access_right/c_abf2 |
eu_rights_str_mv |
openAccess |
dc.format.extent.es_CO.fl_str_mv |
44 hojas |
dc.format.mimetype.es_CO.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidad de los Andes |
dc.publisher.program.es_CO.fl_str_mv |
Maestría en Biología Computacional |
dc.publisher.faculty.es_CO.fl_str_mv |
Facultad de Ciencias |
dc.publisher.department.es_CO.fl_str_mv |
Departamento de Biología |
publisher.none.fl_str_mv |
Universidad de los Andes |
dc.source.es_CO.fl_str_mv |
instname:Universidad de los Andes reponame:Repositorio Institucional Séneca |
instname_str |
Universidad de los Andes |
institution |
Universidad de los Andes |
reponame_str |
Repositorio Institucional Séneca |
collection |
Repositorio Institucional Séneca |
bitstream.url.fl_str_mv |
https://repositorio.uniandes.edu.co/bitstreams/37272a97-8760-49cb-8731-1421698d6221/download https://repositorio.uniandes.edu.co/bitstreams/f24d1dc8-c5d6-4ea7-a491-a3f059ac4f3a/download https://repositorio.uniandes.edu.co/bitstreams/7e194934-1f74-42df-8a0c-28c567ecba3b/download |
bitstream.checksum.fl_str_mv |
a37515478c2bded3b2438bdf312c5157 5c666851f12aeb2550cdb9d10500b30b 085178f6a2a2ef909486b7ca9ec190a9 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio institucional Séneca |
repository.mail.fl_str_mv |
adminrepositorio@uniandes.edu.co |
_version_ |
1812133813816393728 |