A Recurrent Neural Network approach for whole genome bacteria classification

The classification of bacteria plays an essential role in multiple areas of research. Those areas include experimental biology, food and water industries, pathology, microbiology, and evolutionary studies. Although there exist methodologies for classification - such as mass spectrometry, single-nucl...

Full description

Autores:
Lugo Martínez, Luis Eduardo
Tipo de recurso:
Fecha de publicación:
2018
Institución:
Universidad Nacional de Colombia
Repositorio:
Universidad Nacional de Colombia
Idioma:
spa
OAI Identifier:
oai:repositorio.unal.edu.co:unal/68663
Acceso en línea:
https://repositorio.unal.edu.co/handle/unal/68663
http://bdigital.unal.edu.co/69758/
Palabra clave:
0 Generalidades / Computer science, information and general works
5 Ciencias naturales y matemáticas / Science
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Recurrent neural network
Bacteria identification
Whole genome sequence
Rights
openAccess
License
Atribución-NoComercial 4.0 Internacional
id UNACIONAL2_68df3ee10bcacf064826e77403a261d5
oai_identifier_str oai:repositorio.unal.edu.co:unal/68663
network_acronym_str UNACIONAL2
network_name_str Universidad Nacional de Colombia
repository_id_str
spelling Atribución-NoComercial 4.0 InternacionalDerechos reservados - Universidad Nacional de Colombiahttp://creativecommons.org/licenses/by-nc/4.0/info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2Barreto, EmilianoLugo Martínez, Luis Eduardo3f9b7f0c-d5ae-4cb1-b251-0167407de6833002019-07-03T07:26:06Z2019-07-03T07:26:06Z2018-09https://repositorio.unal.edu.co/handle/unal/68663http://bdigital.unal.edu.co/69758/The classification of bacteria plays an essential role in multiple areas of research. Those areas include experimental biology, food and water industries, pathology, microbiology, and evolutionary studies. Although there exist methodologies for classification - such as mass spectrometry, single-nucleotide polymorphisms, microscopic morphology, and neural network approaches - a transition to a whole genome sequence based taxonomy is already undergoing. Next Generation Sequencing helps the transition by producing DNA sequence data efficiently. However, the rate of DNA sequence data generation and the high dimensionality of such data need faster computer methodologies. Machine learning, an area of artificial intelligence, has the ability to analyze high dimensional data in a systematic, fast, and efficient way. Therefore, we propose a sequential deep learning model for bacteria classification. The proposed neural network exploits the vast amounts of information generated by Next Generation Sequencing, in order to extract a classification model for whole genome bacteria sequences. A distributed representation based on k-mers of k={3,4,5} provided an efficient encoding for the bacterial sequences. The classification model relies on a bidirectional recurrent neural network architecture. It generates an accuracy of 0.99455 +/- 0.00281 for 14 species, 0.95031 +/- 0.00469 for 48 species, and 0.89107 +/- 0.00392 for 111 species. After validating the classification model, the bidirectional recurrent neural network outperformed other classification approaches, such as Naive Bayes and Feedforward neural network. The proposed model provides an automated identification method. It infers species for bacterial whole genome sequences and it does not require any manual feature extraction.Maestríaapplication/pdfspaUniversidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e IndustrialDepartamento de Ingeniería de Sistemas e IndustrialLugo Martínez, Luis Eduardo (2018) A Recurrent Neural Network approach for whole genome bacteria classification. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.0 Generalidades / Computer science, information and general works5 Ciencias naturales y matemáticas / Science6 Tecnología (ciencias aplicadas) / Technology62 Ingeniería y operaciones afines / EngineeringRecurrent neural networkBacteria identificationWhole genome sequenceA Recurrent Neural Network approach for whole genome bacteria classificationTrabajo de grado - Maestríainfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/acceptedVersionTexthttp://purl.org/redcol/resource_type/TMORIGINALMastersFinalProject_LuisLugo.pdfapplication/pdf1009226https://repositorio.unal.edu.co/bitstream/unal/68663/1/MastersFinalProject_LuisLugo.pdf0d135c899f47562df21636d52558e4abMD51THUMBNAILMastersFinalProject_LuisLugo.pdf.jpgMastersFinalProject_LuisLugo.pdf.jpgGenerated Thumbnailimage/jpeg3581https://repositorio.unal.edu.co/bitstream/unal/68663/2/MastersFinalProject_LuisLugo.pdf.jpg25beac6984c4b72ce486a3f6e5d0e4ecMD52unal/68663oai:repositorio.unal.edu.co:unal/686632024-05-28 23:08:52.015Repositorio Institucional Universidad Nacional de Colombiarepositorio_nal@unal.edu.co
dc.title.spa.fl_str_mv A Recurrent Neural Network approach for whole genome bacteria classification
title A Recurrent Neural Network approach for whole genome bacteria classification
spellingShingle A Recurrent Neural Network approach for whole genome bacteria classification
0 Generalidades / Computer science, information and general works
5 Ciencias naturales y matemáticas / Science
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Recurrent neural network
Bacteria identification
Whole genome sequence
title_short A Recurrent Neural Network approach for whole genome bacteria classification
title_full A Recurrent Neural Network approach for whole genome bacteria classification
title_fullStr A Recurrent Neural Network approach for whole genome bacteria classification
title_full_unstemmed A Recurrent Neural Network approach for whole genome bacteria classification
title_sort A Recurrent Neural Network approach for whole genome bacteria classification
dc.creator.fl_str_mv Lugo Martínez, Luis Eduardo
dc.contributor.author.spa.fl_str_mv Lugo Martínez, Luis Eduardo
dc.contributor.spa.fl_str_mv Barreto, Emiliano
dc.subject.ddc.spa.fl_str_mv 0 Generalidades / Computer science, information and general works
5 Ciencias naturales y matemáticas / Science
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
topic 0 Generalidades / Computer science, information and general works
5 Ciencias naturales y matemáticas / Science
6 Tecnología (ciencias aplicadas) / Technology
62 Ingeniería y operaciones afines / Engineering
Recurrent neural network
Bacteria identification
Whole genome sequence
dc.subject.proposal.spa.fl_str_mv Recurrent neural network
Bacteria identification
Whole genome sequence
description The classification of bacteria plays an essential role in multiple areas of research. Those areas include experimental biology, food and water industries, pathology, microbiology, and evolutionary studies. Although there exist methodologies for classification - such as mass spectrometry, single-nucleotide polymorphisms, microscopic morphology, and neural network approaches - a transition to a whole genome sequence based taxonomy is already undergoing. Next Generation Sequencing helps the transition by producing DNA sequence data efficiently. However, the rate of DNA sequence data generation and the high dimensionality of such data need faster computer methodologies. Machine learning, an area of artificial intelligence, has the ability to analyze high dimensional data in a systematic, fast, and efficient way. Therefore, we propose a sequential deep learning model for bacteria classification. The proposed neural network exploits the vast amounts of information generated by Next Generation Sequencing, in order to extract a classification model for whole genome bacteria sequences. A distributed representation based on k-mers of k={3,4,5} provided an efficient encoding for the bacterial sequences. The classification model relies on a bidirectional recurrent neural network architecture. It generates an accuracy of 0.99455 +/- 0.00281 for 14 species, 0.95031 +/- 0.00469 for 48 species, and 0.89107 +/- 0.00392 for 111 species. After validating the classification model, the bidirectional recurrent neural network outperformed other classification approaches, such as Naive Bayes and Feedforward neural network. The proposed model provides an automated identification method. It infers species for bacterial whole genome sequences and it does not require any manual feature extraction.
publishDate 2018
dc.date.issued.spa.fl_str_mv 2018-09
dc.date.accessioned.spa.fl_str_mv 2019-07-03T07:26:06Z
dc.date.available.spa.fl_str_mv 2019-07-03T07:26:06Z
dc.type.spa.fl_str_mv Trabajo de grado - Maestría
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/masterThesis
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/acceptedVersion
dc.type.content.spa.fl_str_mv Text
dc.type.redcol.spa.fl_str_mv http://purl.org/redcol/resource_type/TM
status_str acceptedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.unal.edu.co/handle/unal/68663
dc.identifier.eprints.spa.fl_str_mv http://bdigital.unal.edu.co/69758/
url https://repositorio.unal.edu.co/handle/unal/68663
http://bdigital.unal.edu.co/69758/
dc.language.iso.spa.fl_str_mv spa
language spa
dc.relation.ispartof.spa.fl_str_mv Universidad Nacional de Colombia Sede Bogotá Facultad de Ingeniería Departamento de Ingeniería de Sistemas e Industrial
Departamento de Ingeniería de Sistemas e Industrial
dc.relation.references.spa.fl_str_mv Lugo Martínez, Luis Eduardo (2018) A Recurrent Neural Network approach for whole genome bacteria classification. Maestría thesis, Universidad Nacional de Colombia - Sede Bogotá.
dc.rights.spa.fl_str_mv Derechos reservados - Universidad Nacional de Colombia
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.license.spa.fl_str_mv Atribución-NoComercial 4.0 Internacional
dc.rights.uri.spa.fl_str_mv http://creativecommons.org/licenses/by-nc/4.0/
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
rights_invalid_str_mv Atribución-NoComercial 4.0 Internacional
Derechos reservados - Universidad Nacional de Colombia
http://creativecommons.org/licenses/by-nc/4.0/
http://purl.org/coar/access_right/c_abf2
eu_rights_str_mv openAccess
dc.format.mimetype.spa.fl_str_mv application/pdf
institution Universidad Nacional de Colombia
bitstream.url.fl_str_mv https://repositorio.unal.edu.co/bitstream/unal/68663/1/MastersFinalProject_LuisLugo.pdf
https://repositorio.unal.edu.co/bitstream/unal/68663/2/MastersFinalProject_LuisLugo.pdf.jpg
bitstream.checksum.fl_str_mv 0d135c899f47562df21636d52558e4ab
25beac6984c4b72ce486a3f6e5d0e4ec
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad Nacional de Colombia
repository.mail.fl_str_mv repositorio_nal@unal.edu.co
_version_ 1806886505451356160