Efficient Storage of Genomic Sequences in High Performance Computing Systems
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for r...
- Autores:
-
Guerra Soler, Aníbal José
- Tipo de recurso:
- Doctoral thesis
- Fecha de publicación:
- 2019
- Institución:
- Universidad de Antioquia
- Repositorio:
- Repositorio UdeA
- Idioma:
- spa
- OAI Identifier:
- oai:bibliotecadigital.udea.edu.co:10495/12525
- Acceso en línea:
- http://hdl.handle.net/10495/12525
- Palabra clave:
- Performance - evaluation
Genomic sequences
Parallel computing
Reads alignment
Reads compression
Referential compression
SIMD programming
http://id.loc.gov/authorities/subjects/sh2010105499
- Rights
- openAccess
- License
- Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)
id |
UDEA2_d638be8d4c5a66072d5709664faf22e8 |
---|---|
oai_identifier_str |
oai:bibliotecadigital.udea.edu.co:10495/12525 |
network_acronym_str |
UDEA2 |
network_name_str |
Repositorio UdeA |
repository_id_str |
|
dc.title.spa.fl_str_mv |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
title |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
spellingShingle |
Efficient Storage of Genomic Sequences in High Performance Computing Systems Performance - evaluation Genomic sequences Parallel computing Reads alignment Reads compression Referential compression SIMD programming http://id.loc.gov/authorities/subjects/sh2010105499 |
title_short |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
title_full |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
title_fullStr |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
title_full_unstemmed |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
title_sort |
Efficient Storage of Genomic Sequences in High Performance Computing Systems |
dc.creator.fl_str_mv |
Guerra Soler, Aníbal José |
dc.contributor.advisor.none.fl_str_mv |
Isaza Ramírez, Sebastián Aedo Cobo, José Edinson |
dc.contributor.author.none.fl_str_mv |
Guerra Soler, Aníbal José |
dc.subject.lcsh.none.fl_str_mv |
Performance - evaluation |
topic |
Performance - evaluation Genomic sequences Parallel computing Reads alignment Reads compression Referential compression SIMD programming http://id.loc.gov/authorities/subjects/sh2010105499 |
dc.subject.proposal.spa.fl_str_mv |
Genomic sequences Parallel computing Reads alignment Reads compression Referential compression SIMD programming |
dc.subject.lcshuri.none.fl_str_mv |
http://id.loc.gov/authorities/subjects/sh2010105499 |
description |
ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction. |
publishDate |
2019 |
dc.date.accessioned.none.fl_str_mv |
2019-11-29T16:49:54Z |
dc.date.available.none.fl_str_mv |
2019-11-29T16:49:54Z |
dc.date.issued.none.fl_str_mv |
2019 |
dc.type.spa.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
dc.type.coarversion.fl_str_mv |
http://purl.org/coar/version/c_b1a7d7d4d402bcce |
dc.type.hasversion.spa.fl_str_mv |
info:eu-repo/semantics/draft |
dc.type.coar.spa.fl_str_mv |
http://purl.org/coar/resource_type/c_db06 |
dc.type.redcol.spa.fl_str_mv |
https://purl.org/redcol/resource_type/TD |
dc.type.local.spa.fl_str_mv |
Tesis/Trabajo de grado - Monografía - Doctorado |
format |
http://purl.org/coar/resource_type/c_db06 |
status_str |
draft |
dc.identifier.citation.spa.fl_str_mv |
Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia. |
dc.identifier.uri.none.fl_str_mv |
http://hdl.handle.net/10495/12525 |
identifier_str_mv |
Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia. |
url |
http://hdl.handle.net/10495/12525 |
dc.language.iso.spa.fl_str_mv |
spa |
language |
spa |
dc.rights.*.fl_str_mv |
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO) |
dc.rights.spa.fl_str_mv |
info:eu-repo/semantics/openAccess |
dc.rights.uri.*.fl_str_mv |
http://creativecommons.org/licenses/by-nc-nd/2.5/co/ |
dc.rights.accessrights.spa.fl_str_mv |
http://purl.org/coar/access_right/c_abf2 |
dc.rights.creativecommons.spa.fl_str_mv |
https://creativecommons.org/licenses/by-nc-nd/4.0/ |
rights_invalid_str_mv |
Atribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO) http://creativecommons.org/licenses/by-nc-nd/2.5/co/ http://purl.org/coar/access_right/c_abf2 https://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.format.extent.spa.fl_str_mv |
130 |
dc.format.mimetype.spa.fl_str_mv |
application/pdf |
dc.publisher.group.spa.fl_str_mv |
Sistemas Embebidos e Inteligencia Computacional (SISTEMIC) |
dc.publisher.place.spa.fl_str_mv |
Medellín, Colombia |
institution |
Universidad de Antioquia |
bitstream.url.fl_str_mv |
http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/2/license_url http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/3/license_text http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/4/license_rdf http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/5/license.txt http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/1/GuerraSolerAnibal_2019_EfficientStorageGenomic.pdf |
bitstream.checksum.fl_str_mv |
4afdbb8c545fd630ea7db775da747b2f d41d8cd98f00b204e9800998ecf8427e d41d8cd98f00b204e9800998ecf8427e 8a4605be74aa9ea9d79846c1fba20a33 7cc9422902dfe3e2a7483d46fcde2ddd |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositorio Institucional Universidad de Antioquia |
repository.mail.fl_str_mv |
andres.perez@udea.edu.co |
_version_ |
1812173204742995968 |
spelling |
Isaza Ramírez, SebastiánAedo Cobo, José EdinsonGuerra Soler, Aníbal José2019-11-29T16:49:54Z2019-11-29T16:49:54Z2019Guerra-Soler, A.,J. (2019). Efficient Storage of Genomic Sequences in High Performance Computing Systems. (Tesis doctoral). Universidad de Antioquia. Medellín, Colombia.http://hdl.handle.net/10495/12525ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction.130application/pdfspainfo:eu-repo/semantics/draftinfo:eu-repo/semantics/doctoralThesishttp://purl.org/coar/resource_type/c_db06https://purl.org/redcol/resource_type/TDTesis/Trabajo de grado - Monografía - Doctoradohttp://purl.org/coar/version/c_b1a7d7d4d402bcceAtribución-NoComercial-SinDerivadas 2.5 Colombia (CC BY-NC-ND 2.5 CO)info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-nd/2.5/co/http://purl.org/coar/access_right/c_abf2https://creativecommons.org/licenses/by-nc-nd/4.0/Performance - evaluationGenomic sequencesParallel computingReads alignmentReads compressionReferential compressionSIMD programminghttp://id.loc.gov/authorities/subjects/sh2010105499Efficient Storage of Genomic Sequences in High Performance Computing SystemsSistemas Embebidos e Inteligencia Computacional (SISTEMIC)Medellín, ColombiaDoctor en Ingeniería ElectrónicaDoctoradoFacultad de Ingeniería. Doctorado en Ingeniería electrónicaUniversidad de AntioquiaCC-LICENSElicense_urllicense_urltext/plain; charset=utf-849http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/2/license_url4afdbb8c545fd630ea7db775da747b2fMD52license_textlicense_texttext/html; charset=utf-80http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/3/license_textd41d8cd98f00b204e9800998ecf8427eMD53license_rdflicense_rdfapplication/rdf+xml; charset=utf-80http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/4/license_rdfd41d8cd98f00b204e9800998ecf8427eMD54LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/5/license.txt8a4605be74aa9ea9d79846c1fba20a33MD55ORIGINALGuerraSolerAnibal_2019_EfficientStorageGenomic.pdfGuerraSolerAnibal_2019_EfficientStorageGenomic.pdfTesis doctoralapplication/pdf7048897http://bibliotecadigital.udea.edu.co/bitstream/10495/12525/1/GuerraSolerAnibal_2019_EfficientStorageGenomic.pdf7cc9422902dfe3e2a7483d46fcde2dddMD5110495/12525oai:bibliotecadigital.udea.edu.co:10495/125252021-05-21 11:45:22.558Repositorio Institucional Universidad de Antioquiaandres.perez@udea.edu.coTk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo= |