DATMA: Distributed AuTomatic Metagenomic Assembly and Annotation framework

ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of D...

Full description

Autores:
Benavides Arévalo, Bernardo Andrés
Tipo de recurso:
Doctoral thesis
Fecha de publicación:
2019
Institución:
Universidad de Antioquia
Repositorio:
Repositorio UdeA
Idioma:
spa
OAI Identifier:
oai:bibliotecadigital.udea.edu.co:10495/14471
Acceso en línea:
http://hdl.handle.net/10495/14471
Palabra clave:
Microorganisms
Microorganismo
Genes
Quality control
Control de calidad
Ecosystems
Ecosistema
Gen
http://vocabularies.unesco.org/thesaurus/concept3512
http://vocabularies.unesco.org/thesaurus/concept222
http://vocabularies.unesco.org/thesaurus/concept6517
http://vocabularies.unesco.org/thesaurus/concept211
Rights
openAccess
License
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
Description
Summary:ABSTRACT: Bacterial populations have colonized almost every possible niche on Earth, including those considered harsh for most organisms. These extreme physical conditions make it hard to get genetic information from the organism community. Next-generation sequencing has provided a large amount of DNA data that can be used by researchers to study environmental samples using culture-independent shotgun metagenomic experiments. Metagenomics has made it possible to explore the large variety of microorganisms present in many complex ecosystems, like soils, oceans, biosolids, hot springs, among others. Moreover, it has allowed the identification of novel bacterial and archaeal species, generating complete or near-complete genomes. It has helped filling blind spots into underrepresented or missed taxonomical clades. One of the main challenges in the metagenomic analysis is the assembly process. Microbial communities are complex, bacteria have different genome size and abundances, some regions of their genome are very similar, and metagenomic sequencing results in a mixture of reads from the several microorganisms present in the community. Despite the development of dozens of implementations for de novo assembly for metagenomics, they have not eliminated the high risk of assembling reads from different organisms as a single chromosome, which creates chimeric molecules. One alternative to address this is to separate reads in groups (binning) before the assembly process. Given that most assemblers consider that the reads belong to a single species, by grouping highly similar reads in bins, the assembly complexity and the probability of creating chimeric contigs are significantly reduced. In this dissertation, we introduce a binning strategy to group reads from the same molecule into the single bin. We named our method CLAME.We showed that CLAME decreases the complexity of metagenome, and allows recovering almost complete bacterial genomes. We also introduce DATMA, an integration of CLAME into a distributed workflow for metagenomics analysis. DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16SrRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. We show CLAME and DATMA functionality analyzing complex metagenomes and recovered from them most of its species and, more important DATMA automatically extracted an almost complete genome from the predominant species.