DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework

ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of D...

Full description

Autores:
Benavides Arévalo, Bernardo Andrés
Tipo de recurso:
Doctoral thesis
Fecha de publicación:
2019
Institución:
Universidad de Antioquia
Repositorio:
Repositorio UdeA
Idioma:
spa
OAI Identifier:
oai:bibliotecadigital.udea.edu.co:10495/14471
Acceso en línea:
http://hdl.handle.net/10495/14471
Palabra clave:
Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
Rights
openAccess
License
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
id UDEA2_968dd429dca33817eb0a8b6a37f00d3d
oai_identifier_str oai:bibliotecadigital.udea.edu.co:10495/14471
network_acronym_str UDEA2
network_name_str Repositorio UdeA
repository_id_str
dc.title.spa.fl_str_mv DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
title DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
spellingShingle DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
title_short DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
title_full DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
title_fullStr DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
title_full_unstemmed DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
title_sort DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
dc.creator.fl_str_mv Benavides Arévalo, Bernardo Andrés
dc.contributor.advisor.none.fl_str_mv Cabarcas Jaramillo, Felipe
dc.contributor.author.none.fl_str_mv Benavides Arévalo, Bernardo Andrés
dc.subject.unesco.none.fl_str_mv Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
topic Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
dc.subject.unescouri.none.fl_str_mv http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
description ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of DNA data that can be used by researchers to study environmental samples using culture-independent shotgun metagenomic experiments. Metagenomics has made it possible to explore the large variety of microorganisms present in many complex ecosystems, like soils, oceans, biosolids, hot springs, among others. Moreover, it has allowed the identification of novel bacterial and archaeal species, generating complete or near-complete genomes. It has helped filling blind spots into underrepresented or missed taxonomical clades. One of the main challenges in the metagenomic analysis is the assembly process. Microbial communities are complex, bacteria have different genome size and abundances, some regions of their genome are very similar, and metagenomic sequencing results in a mixture of reads from the several microorganisms present in the community. Despite the development of dozens of implementations for de novo assembly for metagenomics, they have not eliminated the high risk of assembling reads from different organisms as a single chromosome, which creates chimeric molecules. One alternative to address this is to separate reads in groups (binning) before the assembly process. Given that most assemblers consider that the reads belong to a single species, by grouping highly similar reads in bins, the assembly complexity and the probability of creating chimeric contigs are significantly reduced. In this dissertation, we introduce a binning strategy to group reads from the same molecule into the single bin. We named our method CLAME.We showed that CLAME decreases the complexity of metagenome, and allows recovering almost complete bacterial genomes. We also introduce DATMA, an integration of CLAME into a distributed workflow for metagenomics analysis. DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16SrRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. We show CLAME and DATMA functionality analyzing complex metagenomes and recovered from them most of its species and, more important DATMA automatically extracted an almost complete genome from the predominant species.
publishDate 2019
dc.date.issued.none.fl_str_mv 2019
dc.date.accessioned.none.fl_str_mv 2020-05-19T21:42:51Z
dc.date.available.none.fl_str_mv 2020-05-19T21:42:51Z
dc.type.spa.fl_str_mv info:eu-repo/semantics/doctoralThesis
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_b1a7d7d4d402bcce
dc.type.hasversion.spa.fl_str_mv info:eu-repo/semantics/draft
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_db06
dc.type.redcol.spa.fl_str_mv https://purl.org/redcol/resource_type/TD
dc.type.local.spa.fl_str_mv Tesis/Trabajo de grado - Monografía - Doctorado
format http://purl.org/coar/resource_type/c_db06
status_str draft
dc.identifier.uri.none.fl_str_mv http://hdl.handle.net/10495/14471
url http://hdl.handle.net/10495/14471
dc.language.iso.spa.fl_str_mv spa
language spa
dc.rights.*.fl_str_mv Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
dc.rights.spa.fl_str_mv info:eu-repo/semantics/openAccess
dc.rights.uri.*.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/2.5/co/
dc.rights.accessrights.spa.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.creativecommons.spa.fl_str_mv https://creativecommons.org/licenses/by-nc-nd/4.0/
rights_invalid_str_mv Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
http://creativecommons.org/licenses/by-nc-nd/2.5/co/
http://purl.org/coar/access_right/c_abf2
https://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.extent.spa.fl_str_mv 112
dc.format.mimetype.spa.fl_str_mv application/pdf
dc.publisher.place.spa.fl_str_mv Medellín, Colombia
institution Universidad de Antioquia
bitstream.url.fl_str_mv http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/2/license_rdf
http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/3/license.txt
http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/1/BenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdf
bitstream.checksum.fl_str_mv b88b088d9957e670ce3b3fbe2eedbc13
8a4605be74aa9ea9d79846c1fba20a33
85d1e696e10b04fb296741ff3f6d9a27
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad de Antioquia
repository.mail.fl_str_mv andres.perez@udea.edu.co
_version_ 1812173181583097856
spelling Cabarcas Jaramillo, FelipeBenavides Arévalo, Bernardo Andrés2020-05-19T21:42:51Z2020-05-19T21:42:51Z2019http://hdl.handle.net/10495/14471ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of DNA data that can be used by researchers to study environmental samples using culture-independent shotgun metagenomic experiments. Metagenomics has made it possible to explore the large variety of microorganisms present in many complex ecosystems, like soils, oceans, biosolids, hot springs, among others. Moreover, it has allowed the identification of novel bacterial and archaeal species, generating complete or near-complete genomes. It has helped filling blind spots into underrepresented or missed taxonomical clades. One of the main challenges in the metagenomic analysis is the assembly process. Microbial communities are complex, bacteria have different genome size and abundances, some regions of their genome are very similar, and metagenomic sequencing results in a mixture of reads from the several microorganisms present in the community. Despite the development of dozens of implementations for de novo assembly for metagenomics, they have not eliminated the high risk of assembling reads from different organisms as a single chromosome, which creates chimeric molecules. One alternative to address this is to separate reads in groups (binning) before the assembly process. Given that most assemblers consider that the reads belong to a single species, by grouping highly similar reads in bins, the assembly complexity and the probability of creating chimeric contigs are significantly reduced. In this dissertation, we introduce a binning strategy to group reads from the same molecule into the single bin. We named our method CLAME.We showed that CLAME decreases the complexity of metagenome, and allows recovering almost complete bacterial genomes. We also introduce DATMA, an integration of CLAME into a distributed workflow for metagenomics analysis. DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16SrRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. We show CLAME and DATMA functionality analyzing complex metagenomes and recovered from them most of its species and, more important DATMA automatically extracted an almost complete genome from the predominant species.112application/pdfspainfo:eu-repo/semantics/draftinfo:eu-repo/semantics/doctoralThesishttp://purl.org/coar/resource_type/c_db06https://purl.org/redcol/resource_type/TDTesis/Trabajo de grado - Monografía - Doctoradohttp://purl.org/coar/version/c_b1a7d7d4d402bcceAtribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-nd/2.5/co/http://purl.org/coar/access_right/c_abf2https://creativecommons.org/licenses/by-nc-nd/4.0/DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation frameworkMedellín, ColombiaMicroorganismsMicroorganismoGenesQuality controlControl de calidadEcosystemsEcosistemaGenhttp://vocabularies.unesco.org/thesaurus/concept3512http://vocabularies.unesco.org/thesaurus/concept222http://vocabularies.unesco.org/thesaurus/concept6517http://vocabularies.unesco.org/thesaurus/concept211Doctor en Ingeniería ElectrónicaDoctoradoFacultad de Ingeniería. Doctorado en Ingeniería ElectrónicaUniversidad de AntioquiaCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8823http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/2/license_rdfb88b088d9957e670ce3b3fbe2eedbc13MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/3/license.txt8a4605be74aa9ea9d79846c1fba20a33MD53ORIGINALBenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdfBenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdfTesis doctoralapplication/pdf4551146http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/1/BenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdf85d1e696e10b04fb296741ff3f6d9a27MD5110495/14471oai:bibliotecadigital.udea.edu.co:10495/144712021-05-21 11:44:13.498Repositorio Institucional Universidad de Antioquiaandres.perez@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=