Efficient Storage of Genomic Sequences in High Performance Computing Systems

ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for r...

Full description

Autores:
Guerra Soler, Aníbal José
Tipo de recurso:
Doctoral thesis
Fecha de publicación:
2019
Institución:
Universidad de Antioquia
Repositorio:
Repositorio UdeA
Idioma:
spa
OAI Identifier:
oai:bibliotecadigital.udea.edu.co:10495/12525
Acceso en línea:
http://hdl.handle.net/10495/12525
Palabra clave:
Performance - evaluation
Genomic sequences
Parallel computing
Reads alignment
Reads compression
Referential compression
SIMD programming
http://id.loc.gov/authorities/subjects/sh2010105499
Rights
openAccess
License
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
id UDEA2_d638be8d4c5a66072d5709664faf22e8
oai_identifier_str oai:bibliotecadigital.udea.edu.co:10495/12525
network_acronym_str UDEA2
network_name_str Repositorio UdeA
repository_id_str
dc.title.spa.fl_str_mv Efficient Storage of Genomic Sequences in High Performance Computing Systems
title Efficient Storage of Genomic Sequences in High Performance Computing Systems
spellingShingle Efficient Storage of Genomic Sequences in High Performance Computing Systems
Performance - evaluation
Genomic sequences
Parallel computing
Reads alignment
Reads compression
Referential compression
SIMD programming
http://id.loc.gov/authorities/subjects/sh2010105499
title_short Efficient Storage of Genomic Sequences in High Performance Computing Systems
title_full Efficient Storage of Genomic Sequences in High Performance Computing Systems
title_fullStr Efficient Storage of Genomic Sequences in High Performance Computing Systems
title_full_unstemmed Efficient Storage of Genomic Sequences in High Performance Computing Systems
title_sort Efficient Storage of Genomic Sequences in High Performance Computing Systems
dc.creator.fl_str_mv Guerra Soler, Aníbal José
dc.contributor.advisor.none.fl_str_mv Isaza Ramírez, Sebastián
Aedo Cobo, José Edinson
dc.contributor.author.none.fl_str_mv Guerra Soler, Aníbal José
dc.subject.lcsh.none.fl_str_mv Performance - evaluation
topic Performance - evaluation
Genomic sequences
Parallel computing
Reads alignment
Reads compression
Referential compression
SIMD programming
http://id.loc.gov/authorities/subjects/sh2010105499
dc.subject.proposal.spa.fl_str_mv Genomic sequences
Parallel computing
Reads alignment
Reads compression
Referential compression
SIMD programming
dc.subject.lcshuri.none.fl_str_mv http://id.loc.gov/authorities/subjects/sh2010105499
description ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction.
publishDate 2019
dc.date.accessioned.none.fl_str_mv 2019-11-29T16:49:54Z
dc.date.available.none.fl_str_mv 2019-11-29T16:49:54Z
dc.date.issued.none.fl_str_mv 2019
dc.type.spa.fl_str_mv info:eu-repo/semantics/doctoralThesis
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_b1a7d7d4d402bcce
dc.type.hasversion.spa.fl_str_mv info:eu-repo/semantics/draft
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_db06
dc.type.redcol.spa.fl_str_mv https://purl.org/redcol/resource_type/TD
dc.type.local.spa.fl_str_mv Tesis/Trabajo de grado - Monografía - Doctorado
format http://purl.org/coar/resource_type/c_db06
status_str draft
dc.identifier.citation.spa.fl_str_mv Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.
dc.identifier.uri.none.fl_str_mv http://hdl.handle.net/10495/12525
identifier_str_mv Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.
url http://hdl.handle.net/10495/12525
dc.language.iso.spa.fl_str_mv spa
language spa
dc.rights.*.fl_str_mv Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
dc.rights.spa.fl_str_mv info:eu-repo/semantics/openAccess
dc.rights.uri.*.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/2.5/co/
dc.rights.accessrights.spa.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.creativecommons.spa.fl_str_mv https://creativecommons.org/licenses/by-nc-nd/4.0/
rights_invalid_str_mv Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
http://creativecommons.org/licenses/by-nc-nd/2.5/co/
http://purl.org/coar/access_right/c_abf2
https://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.extent.spa.fl_str_mv 130
dc.format.mimetype.spa.fl_str_mv application/pdf
dc.publisher.group.spa.fl_str_mv Sistemas Embebidos e Inteligencia Computacional (SISTEMIC)
dc.publisher.place.spa.fl_str_mv Medellín, Colombia
institution Universidad de Antioquia
bitstream.url.fl_str_mv http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/2/license_url
http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/3/license_text
http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/4/license_rdf
http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/5/license.txt
http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/1/GuerraSolerAnibal_2019_EfficientStorageGenomic.pdf
bitstream.checksum.fl_str_mv 4afdbb8c545fd630ea7db775da747b2f
d41d8cd98f00b204e9800998ecf8427e
d41d8cd98f00b204e9800998ecf8427e
8a4605be74aa9ea9d79846c1fba20a33
7cc9422902dfe3e2a7483d46fcde2ddd
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Institucional Universidad de Antioquia
repository.mail.fl_str_mv andres.perez@udea.edu.co
_version_ 1812173204742995968
spelling Isaza Ramírez, SebastiánAedo Cobo, José EdinsonGuerra Soler, Aníbal José2019-11-29T16:49:54Z2019-11-29T16:49:54Z2019Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.http://hdl.handle.net/10495/12525ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction.130application/pdfspainfo:eu-repo/semantics/draftinfo:eu-repo/semantics/doctoralThesishttp://purl.org/coar/resource_type/c_db06https://purl.org/redcol/resource_type/TDTesis/Trabajo de grado - Monografía - Doctoradohttp://purl.org/coar/version/c_b1a7d7d4d402bcceAtribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-nd/2.5/co/http://purl.org/coar/access_right/c_abf2https://creativecommons.org/licenses/by-nc-nd/4.0/Performance - evaluationGenomic sequencesParallel computingReads alignmentReads compressionReferential compressionSIMD programminghttp://id.loc.gov/authorities/subjects/sh2010105499Efficient Storage of Genomic Sequences in High Performance Computing SystemsSistemas Embebidos e Inteligencia Computacional (SISTEMIC)Medellín, ColombiaDoctor en Ingeniería ElectrónicaDoctoradoFacultad de Ingeniería. Doctorado en Ingeniería electrónicaUniversidad de AntioquiaCC-LICENSElicense_urllicense_urltext/plain; charset=utf-849http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/2/license_url4afdbb8c545fd630ea7db775da747b2fMD52license_textlicense_texttext/html; charset=utf-80http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/3/license_textd41d8cd98f00b204e9800998ecf8427eMD53license_rdflicense_rdfapplication/rdf+xml; charset=utf-80http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/4/license_rdfd41d8cd98f00b204e9800998ecf8427eMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/5/license.txt8a4605be74aa9ea9d79846c1fba20a33MD55ORIGINALGuerraSolerAnibal_2019_EfficientStorageGenomic.pdfGuerraSolerAnibal_2019_EfficientStorageGenomic.pdfTesis doctoralapplication/pdf7048897http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/1/GuerraSolerAnibal_2019_EfficientStorageGenomic.pdf7cc9422902dfe3e2a7483d46fcde2dddMD5110495/12525oai:bibliotecadigital.udea.edu.co:10495/125252021-05-21 11:45:22.558Repositorio Institucional Universidad de Antioquiaandres.perez@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=