DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework
ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of D...
- Autores:
-
Benavides Arévalo, Bernardo Andrés
- Tipo de recurso:
- Doctoral thesis
- Fecha de publicación:
- 2019
- Institución:
- Universidad de Antioquia
- Repositorio:
- Repositorio UdeA
- Idioma:
- spa
- OAI Identifier:
- oai:bibliotecadigital.udea.edu.co:10495/14471
- Acceso en línea:
- http://hdl.handle.net/10495/14471
- Palabra clave:
- Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
- Rights
- openAccess
- License
- Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
id |
UDEA2_968dd429dca33817eb0a8b6a37f00d3d |
---|---|
oai_identifier_str |
oai:bibliotecadigital.udea.edu.co:10495/14471 |
network_acronym_str |
UDEA2 |
network_name_str |
Repositorio UdeA |
repository_id_str |
|
dc.title.spa.fl_str_mv |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
title |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
spellingShingle |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework Microorganisms Microorganismo Genes Quality control Control de calidad Ecosystems Ecosistema Gen http://vocabularies.unesco.org/thesaurus/concept3512 http://vocabularies.unesco.org/thesaurus/concept222 http://vocabularies.unesco.org/thesaurus/concept6517 http://vocabularies.unesco.org/thesaurus/concept211 |
title_short |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
title_full |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
title_fullStr |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
title_full_unstemmed |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
title_sort |
DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework |
dc.creator.fl_str_mv |
Benavides Arévalo, Bernardo Andrés |
dc.contributor.advisor.none.fl_str_mv |
Cabarcas Jaramillo, Felipe |
dc.contributor.author.none.fl_str_mv |
Benavides Arévalo, Bernardo Andrés |
dc.subject.unesco.none.fl_str_mv |
Microorganisms Microorganismo Genes Quality control Control de calidad Ecosystems Ecosistema Gen |
topic |
Microorganisms Microorganismo Genes Quality control Control de calidad Ecosystems Ecosistema Gen http://vocabularies.unesco.org/thesaurus/concept3512 http://vocabularies.unesco.org/thesaurus/concept222 http://vocabularies.unesco.org/thesaurus/concept6517 http://vocabularies.unesco.org/thesaurus/concept211 |
dc.subject.unescouri.none.fl_str_mv |
http://vocabularies.unesco.org/thesaurus/concept3512 http://vocabularies.unesco.org/thesaurus/concept222 http://vocabularies.unesco.org/thesaurus/concept6517 http://vocabularies.unesco.org/thesaurus/concept211 |
description |
ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of DNA data that can be used by researchers to study environmental samples using culture-independent shotgun metagenomic experiments. Metagenomics has made it possible to explore the large variety of microorganisms present in many complex ecosystems, like soils, oceans, biosolids, hot springs, among others. Moreover, it has allowed the identification of novel bacterial and archaeal species, generating complete or near-complete genomes. It has helped filling blind spots into underrepresented or missed taxonomical clades. One of the main challenges in the metagenomic analysis is the assembly process. Microbial communities are complex, bacteria have different genome size and abundances, some regions of their genome are very similar, and metagenomic sequencing results in a mixture of reads from the several microorganisms present in the community. Despite the development of dozens of implementations for de novo assembly for metagenomics, they have not eliminated the high risk of assembling reads from different organisms as a single chromosome, which creates chimeric molecules. One alternative to address this is to separate reads in groups (binning) before the assembly process. Given that most assemblers consider that the reads belong to a single species, by grouping highly similar reads in bins, the assembly complexity and the probability of creating chimeric contigs are significantly reduced. In this dissertation, we introduce a binning strategy to group reads from the same molecule into the single bin. We named our method CLAME.We showed that CLAME decreases the complexity of metagenome, and allows recovering almost complete bacterial genomes. We also introduce DATMA, an integration of CLAME into a distributed workflow for metagenomics analysis. DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16SrRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. We show CLAME and DATMA functionality analyzing complex metagenomes and recovered from them most of its species and, more important DATMA automatically extracted an almost complete genome from the predominant species. |
publishDate |
2019 |
dc.date.issued.none.fl_str_mv |
2019 |
dc.date.accessioned.none.fl_str_mv |
2020-05-19T21:42:51Z |
dc.date.available.none.fl_str_mv |
2020-05-19T21:42:51Z |
dc.type.spa.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
dc.type.coarversion.fl_str_mv |
http://purl.org/coar/version/c_b1a7d7d4d402bcce |
dc.type.hasversion.spa.fl_str_mv |
info:eu-repo/semantics/draft |
dc.type.coar.spa.fl_str_mv |
http://purl.org/coar/resource_type/c_db06 |
dc.type.redcol.spa.fl_str_mv |
https://purl.org/redcol/resource_type/TD |
dc.type.local.spa.fl_str_mv |
Tesis/Trabajo de grado - Monografía - Doctorado |
format |
http://purl.org/coar/resource_type/c_db06 |
status_str |
draft |
dc.identifier.uri.none.fl_str_mv |
http://hdl.handle.net/10495/14471 |
url |
http://hdl.handle.net/10495/14471 |
dc.language.iso.spa.fl_str_mv |
spa |
language |
spa |
dc.rights.*.fl_str_mv |
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO) |
dc.rights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.uri.*.fl_str_mv |
http://creativecommons.org/licenses/by-nc-nd/2.5/co/ |
dc.rights.accessrights.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
dc.rights.creativecommons.spa.fl_str_mv |
https://creativecommons.org/licenses/by-nc-nd/4.0/ |
rights_invalid_str_mv |
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO) http://creativecommons.org/licenses/by-nc-nd/2.5/co/ http://purl.org/coar/access_right/c_abf2 https://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.format.extent.spa.fl_str_mv |
112 |
dc.format.mimetype.spa.fl_str_mv |
application/pdf |
dc.publisher.place.spa.fl_str_mv |
Medellín, Colombia |
institution |
Universidad de Antioquia |
bitstream.url.fl_str_mv |
http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/2/license_rdf http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/3/license.txt http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/1/BenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdf |
bitstream.checksum.fl_str_mv |
b88b088d9957e670ce3b3fbe2eedbc13 8a4605be74aa9ea9d79846c1fba20a33 85d1e696e10b04fb296741ff3f6d9a27 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio Institucional Universidad de Antioquia |
repository.mail.fl_str_mv |
andres.perez@udea.edu.co |
_version_ |
1812173181583097856 |
spelling |
Cabarcas Jaramillo, FelipeBenavides Arévalo, Bernardo Andrés2020-05-19T21:42:51Z2020-05-19T21:42:51Z2019http://hdl.handle.net/10495/14471ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of DNA data that can be used by researchers to study environmental samples using culture-independent shotgun metagenomic experiments. Metagenomics has made it possible to explore the large variety of microorganisms present in many complex ecosystems, like soils, oceans, biosolids, hot springs, among others. Moreover, it has allowed the identification of novel bacterial and archaeal species, generating complete or near-complete genomes. It has helped filling blind spots into underrepresented or missed taxonomical clades. One of the main challenges in the metagenomic analysis is the assembly process. Microbial communities are complex, bacteria have different genome size and abundances, some regions of their genome are very similar, and metagenomic sequencing results in a mixture of reads from the several microorganisms present in the community. Despite the development of dozens of implementations for de novo assembly for metagenomics, they have not eliminated the high risk of assembling reads from different organisms as a single chromosome, which creates chimeric molecules. One alternative to address this is to separate reads in groups (binning) before the assembly process. Given that most assemblers consider that the reads belong to a single species, by grouping highly similar reads in bins, the assembly complexity and the probability of creating chimeric contigs are significantly reduced. In this dissertation, we introduce a binning strategy to group reads from the same molecule into the single bin. We named our method CLAME.We showed that CLAME decreases the complexity of metagenome, and allows recovering almost complete bacterial genomes. We also introduce DATMA, an integration of CLAME into a distributed workflow for metagenomics analysis. DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16SrRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. We show CLAME and DATMA functionality analyzing complex metagenomes and recovered from them most of its species and, more important DATMA automatically extracted an almost complete genome from the predominant species.112application/pdfspainfo:eu-repo/semantics/draftinfo:eu-repo/semantics/doctoralThesishttp://purl.org/coar/resource_type/c_db06https://purl.org/redcol/resource_type/TDTesis/Trabajo de grado - Monografía - Doctoradohttp://purl.org/coar/version/c_b1a7d7d4d402bcceAtribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-nd/2.5/co/http://purl.org/coar/access_right/c_abf2https://creativecommons.org/licenses/by-nc-nd/4.0/DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation frameworkMedellín, ColombiaMicroorganismsMicroorganismoGenesQuality controlControl de calidadEcosystemsEcosistemaGenhttp://vocabularies.unesco.org/thesaurus/concept3512http://vocabularies.unesco.org/thesaurus/concept222http://vocabularies.unesco.org/thesaurus/concept6517http://vocabularies.unesco.org/thesaurus/concept211Doctor en Ingeniería ElectrónicaDoctoradoFacultad de Ingeniería. Doctorado en Ingeniería ElectrónicaUniversidad de AntioquiaCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8823http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/2/license_rdfb88b088d9957e670ce3b3fbe2eedbc13MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/3/license.txt8a4605be74aa9ea9d79846c1fba20a33MD53ORIGINALBenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdfBenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdfTesis doctoralapplication/pdf4551146http://bibliotecadigital.udea.edu.co/bitstream/10495/14471/1/BenavidesBernardo_2019_DistributedAutomaticMetagenomic.pdf85d1e696e10b04fb296741ff3f6d9a27MD5110495/14471oai:bibliotecadigital.udea.edu.co:10495/144712021-05-21 11:44:13.498Repositorio Institucional Universidad de Antioquiaandres.perez@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |