The complex task of choosing a de novo assembly: Lessons from fungal genomes

Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation...

Full description

Autores:

Tipo de recurso:

Fecha de publicación:: 2014

Institución:: Universidad del Rosario

Repositorio:: Repositorio EdocUR - U. Rosario

Idioma:: eng

id	EDOCUR2_2aca8c616771aa67814d7f918c6c8514
oai_identifier_str	oai:repository.urosario.edu.co:10336/23761
network_acronym_str	EDOCUR2
network_name_str	Repositorio EdocUR - U. Rosario
repository_id_str
spelling	8d2fbe92-da8c-46a5-ab8d-8b5ea2cd1d5f-17dcaba27-e80f-4f46-a0c0-98010bce926a-1f2a7edfb-a2ef-461e-8062-812cd0c2123b-182ae7fd9-7890-4c2d-8135-9ad6c4018b45-13758956002020-05-26T00:05:10Z2020-05-26T00:05:10Z2014Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes. © 2014 Elsevier Ltd. All rights reserved.application/pdfhttps://doi.org/10.1016/j.compbiolchem.2014.08.01414769271https://repository.urosario.edu.co/handle/10336/23761engElsevier Ltd107No. PA97Computational Biology and ChemistryVol. 53Computational Biology and Chemistry, ISSN:14769271, Vol.53, No.PA (2014); pp. 97-107https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908554464&doi=10.1016%2fj.compbiolchem.2014.08.014&partnerID=40&md5=66fd3c29a8b9f784aa0c6941b74970e4Abierto (Texto Completo)http://purl.org/coar/access_right/c_abf2instname:Universidad del Rosarioreponame:Repositorio Institucional EdocURComplex taskDe novo assembliesGenome assemblyNext-generation sequencingSpacer DNAAlgorithmContig mappingDNA sequenceFungal genomeGeneticsHigh throughput sequencingNucleotide repeatOpen reading frameParacoccidioidesQuality controlStatistics and numerical dataAlgorithmsBenchmarkingContig MappingHigh-Throughput Nucleotide SequencingOpen Reading FramesParacoccidioidesGenome assembly methodsNext-generation sequencingRepetitive DNADNAFungalIntergenicNucleic AcidDNAGenomeRepetitive SequencesSequence AnalysisThe complex task of choosing a de novo assembly: Lessons from fungal genomesarticleArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_6501Gallo, Juan EstebanMuñoz, José FernandoMisas, ElizabethMcEwen, Juan GuillermoClay, Oliver Keatinge10336/23761oai:repository.urosario.edu.co:10336/237612022-05-02 07:37:21.211576https://repository.urosario.edu.coRepositorio institucional EdocURedocur@urosario.edu.co
dc.title.spa.fl_str_mv	The complex task of choosing a de novo assembly: Lessons from fungal genomes
title	The complex task of choosing a de novo assembly: Lessons from fungal genomes
spellingShingle	The complex task of choosing a de novo assembly: Lessons from fungal genomes Complex task De novo assemblies Genome assembly Next-generation sequencing Spacer DNA Algorithm Contig mapping DNA sequence Fungal genome Genetics High throughput sequencing Nucleotide repeat Open reading frame Paracoccidioides Quality control Statistics and numerical data Algorithms Benchmarking Contig Mapping High-Throughput Nucleotide Sequencing Open Reading Frames Paracoccidioides Genome assembly methods Next-generation sequencing Repetitive DNA DNA Fungal Intergenic Nucleic Acid DNA Genome Repetitive Sequences Sequence Analysis
title_short	The complex task of choosing a de novo assembly: Lessons from fungal genomes
title_full	The complex task of choosing a de novo assembly: Lessons from fungal genomes
title_fullStr	The complex task of choosing a de novo assembly: Lessons from fungal genomes
title_full_unstemmed	The complex task of choosing a de novo assembly: Lessons from fungal genomes
title_sort	The complex task of choosing a de novo assembly: Lessons from fungal genomes
dc.subject.keyword.spa.fl_str_mv	Complex task De novo assemblies Genome assembly Next-generation sequencing Spacer DNA Algorithm Contig mapping DNA sequence Fungal genome Genetics High throughput sequencing Nucleotide repeat Open reading frame Paracoccidioides Quality control Statistics and numerical data Algorithms Benchmarking Contig Mapping High-Throughput Nucleotide Sequencing Open Reading Frames Paracoccidioides Genome assembly methods Next-generation sequencing Repetitive DNA
topic	Complex task De novo assemblies Genome assembly Next-generation sequencing Spacer DNA Algorithm Contig mapping DNA sequence Fungal genome Genetics High throughput sequencing Nucleotide repeat Open reading frame Paracoccidioides Quality control Statistics and numerical data Algorithms Benchmarking Contig Mapping High-Throughput Nucleotide Sequencing Open Reading Frames Paracoccidioides Genome assembly methods Next-generation sequencing Repetitive DNA DNA Fungal Intergenic Nucleic Acid DNA Genome Repetitive Sequences Sequence Analysis
dc.subject.keyword.eng.fl_str_mv	DNA Fungal Intergenic Nucleic Acid DNA Genome Repetitive Sequences Sequence Analysis
description	Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes. © 2014 Elsevier Ltd. All rights reserved.
publishDate	2014
dc.date.created.spa.fl_str_mv	2014
dc.date.accessioned.none.fl_str_mv	2020-05-26T00:05:10Z
dc.date.available.none.fl_str_mv	2020-05-26T00:05:10Z
dc.type.eng.fl_str_mv	article
dc.type.coarversion.fl_str_mv	http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv	http://purl.org/coar/resource_type/c_6501
dc.type.spa.spa.fl_str_mv	Artículo
dc.identifier.doi.none.fl_str_mv	https://doi.org/10.1016/j.compbiolchem.2014.08.014
dc.identifier.issn.none.fl_str_mv	14769271
dc.identifier.uri.none.fl_str_mv	https://repository.urosario.edu.co/handle/10336/23761
url	https://doi.org/10.1016/j.compbiolchem.2014.08.014 https://repository.urosario.edu.co/handle/10336/23761
identifier_str_mv	14769271
dc.language.iso.spa.fl_str_mv	eng
language	eng
dc.relation.citationEndPage.none.fl_str_mv	107
dc.relation.citationIssue.none.fl_str_mv	No. PA
dc.relation.citationStartPage.none.fl_str_mv	97
dc.relation.citationTitle.none.fl_str_mv	Computational Biology and Chemistry
dc.relation.citationVolume.none.fl_str_mv	Vol. 53
dc.relation.ispartof.spa.fl_str_mv	Computational Biology and Chemistry, ISSN:14769271, Vol.53, No.PA (2014); pp. 97-107
dc.relation.uri.spa.fl_str_mv	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908554464&doi=10.1016%2fj.compbiolchem.2014.08.014&partnerID=40&md5=66fd3c29a8b9f784aa0c6941b74970e4
dc.rights.coar.fl_str_mv	http://purl.org/coar/access_right/c_abf2
dc.rights.acceso.spa.fl_str_mv	Abierto (Texto Completo)
rights_invalid_str_mv	Abierto (Texto Completo) http://purl.org/coar/access_right/c_abf2
dc.format.mimetype.none.fl_str_mv	application/pdf
dc.publisher.spa.fl_str_mv	Elsevier Ltd
institution	Universidad del Rosario
dc.source.instname.spa.fl_str_mv	instname:Universidad del Rosario
dc.source.reponame.spa.fl_str_mv	reponame:Repositorio Institucional EdocUR
repository.name.fl_str_mv	Repositorio institucional EdocUR
repository.mail.fl_str_mv	edocur@urosario.edu.co
_version_	1837007785834840064

The complex task of choosing a de novo assembly: Lessons from fungal genomes

Publicaciones similares