Taxonomic assignment of 16S rRNA sequences based on Fourier analysis

We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it...

Full description

Autores:
Luque y Guzmán Sáenz, Guillermo Gustavo
Tipo de recurso:
Fecha de publicación:
2016
Institución:
Universidad de los Andes
Repositorio:
Séneca: repositorio Uniandes
Idioma:
eng
OAI Identifier:
oai:repositorio.uniandes.edu.co:1992/13692
Acceso en línea:
http://hdl.handle.net/1992/13692
Palabra clave:
Metagenómica
ADN ribosómico
Biología computacional
Biología
Rights
openAccess
License
http://creativecommons.org/licenses/by-nc-sa/4.0/
Description
Summary:We introduce TAXOFOR a novel machine learning classifier using Random Forests to assign taxonomy to paired-end sequencing amplicons up to genus level, trained with annotated sequences from the Green-Genes database. It performs this task with a confidence close to 98% in terms of its accuracy, and it is faster than several of the de facto tools with the same purpose in microbial ecology. In order to manage the DNA sequences, at first they are numerically represented as projections into a 3D space defined by the vertex of a tetrahedron. Afterwards, Discrete Fourier Transform allows to get their Power Spectra and use them as input both to train the classifier and to predict their taxonomy. Parseval's identity theorem ensures that similarity between the numerical representation of two DNA sequences can be gotten from their power spectra. This aspect is tested by comparing a dendrogram showing the results of a hierarchical clustering using the pair-wise distance between the spectra of DNA sequences, with another one that has been built using the distance matrix obtained after a multiple sequence alignment (MSA). Performance and assertiveness of TAXOFOR against UCLUST, RDP and MOTHUR was assessed while assigning taxonomy to the same set of 16S rRNA sequences. The initial results are promising and give us enough room to implement improvements in terms of parallel processing and memory handling