A taxonomy of tools and approaches for distributed genomic analyses

The amount of biomedical data collected and stored has grown significantly. Analyzing these extensive amounts of data cannot be done by individuals or single organizations anymore. Thus, the scientific community is creating global collaborative efforts to analyze these data. However, biomedical data...

Full description

Autores:
Garzón, Wilmer
Benavides, Luis Alberto
Gignard, Alban
Südholt, Mario
Tipo de recurso:
Article of journal
Fecha de publicación:
2022
Institución:
Escuela Colombiana de Ingeniería Julio Garavito
Repositorio:
Repositorio Institucional ECI
Idioma:
eng
OAI Identifier:
oai:repositorio.escuelaing.edu.co:001/3156
Acceso en línea:
https://repositorio.escuelaing.edu.co/handle/001/3156
https://repositorio.escuelaing.edu.co/
Palabra clave:
Biometría
Biometry
Análisis de la información
Information analysis
Investigación biomédica
Biomedical research
Tecnología médica
Medical technology
Distributed biomedical analyses
Fully distributed collaborations
Reproducibility
Scalability Multi-site analyses
Distributed workflow analyses
Análisis biomédicos distribuidos
Colaboraciones totalmente distribuidas
Reproducibilidad
Análisis de escalabilidad multisitio
Análisis de flujo de trabajo distribuido
Rights
openAccess
License
http://purl.org/coar/access_right/c_abf2
id ESCUELAIG2_5d5f54a6d7c352f15aa23341e2ebbb11
oai_identifier_str oai:repositorio.escuelaing.edu.co:001/3156
network_acronym_str ESCUELAIG2
network_name_str Repositorio Institucional ECI
repository_id_str
dc.title.eng.fl_str_mv A taxonomy of tools and approaches for distributed genomic analyses
title A taxonomy of tools and approaches for distributed genomic analyses
spellingShingle A taxonomy of tools and approaches for distributed genomic analyses
Biometría
Biometry
Análisis de la información
Information analysis
Investigación biomédica
Biomedical research
Tecnología médica
Medical technology
Distributed biomedical analyses
Fully distributed collaborations
Reproducibility
Scalability Multi-site analyses
Distributed workflow analyses
Análisis biomédicos distribuidos
Colaboraciones totalmente distribuidas
Reproducibilidad
Análisis de escalabilidad multisitio
Análisis de flujo de trabajo distribuido
title_short A taxonomy of tools and approaches for distributed genomic analyses
title_full A taxonomy of tools and approaches for distributed genomic analyses
title_fullStr A taxonomy of tools and approaches for distributed genomic analyses
title_full_unstemmed A taxonomy of tools and approaches for distributed genomic analyses
title_sort A taxonomy of tools and approaches for distributed genomic analyses
dc.creator.fl_str_mv Garzón, Wilmer
Benavides, Luis Alberto
Gignard, Alban
Südholt, Mario
dc.contributor.author.none.fl_str_mv Garzón, Wilmer
Benavides, Luis Alberto
Gignard, Alban
Südholt, Mario
dc.contributor.researchgroup.spa.fl_str_mv CTG - Informática
dc.subject.armarc.none.fl_str_mv Biometría
Biometry
Análisis de la información
Information analysis
Investigación biomédica
Biomedical research
Tecnología médica
Medical technology
topic Biometría
Biometry
Análisis de la información
Information analysis
Investigación biomédica
Biomedical research
Tecnología médica
Medical technology
Distributed biomedical analyses
Fully distributed collaborations
Reproducibility
Scalability Multi-site analyses
Distributed workflow analyses
Análisis biomédicos distribuidos
Colaboraciones totalmente distribuidas
Reproducibilidad
Análisis de escalabilidad multisitio
Análisis de flujo de trabajo distribuido
dc.subject.proposal.eng.fl_str_mv Distributed biomedical analyses
Fully distributed collaborations
Reproducibility
Scalability Multi-site analyses
Distributed workflow analyses
dc.subject.proposal.spa.fl_str_mv Análisis biomédicos distribuidos
Colaboraciones totalmente distribuidas
Reproducibilidad
Análisis de escalabilidad multisitio
Análisis de flujo de trabajo distribuido
description The amount of biomedical data collected and stored has grown significantly. Analyzing these extensive amounts of data cannot be done by individuals or single organizations anymore. Thus, the scientific community is creating global collaborative efforts to analyze these data. However, biomedical data is subject to several legal and socio- economic restrictions hindering the possibilities for research collaboration. In this paper, we argue that researchers require new tools and techniques to address the restrictions and needs of global scientific collaborations over geo-distributed biomedical data. These tools and techniques must support what we call Fully Distributed Collaborations (FDC), which are research endeavors that harness means to exploit and analyze massive biomedical information collaboratively while respecting legal and socio-economical restrictions. This paper first motivates and discusses the requirements of FDCs in the context of a research collaboration on the development of diagnostic and predictive tools for the risk of intracranial aneurysm formation and rupture (the ICAN project). The paper then presents a taxonomy classifying the current tools and techniques for biomedical analysis with respect to the proposed requirements. The taxonomy considers three key architectural features to support FDC scenarios: data and computation placement, Privacy and Security, and Performance and Scalability. The review reveals new research opportunities to design tools and techniques for multi-site analyses encouraging scientific collaborations while mitigating technical and legal constraints.
publishDate 2022
dc.date.issued.none.fl_str_mv 2022
dc.date.accessioned.none.fl_str_mv 2024-07-11T16:51:03Z
dc.date.available.none.fl_str_mv 2024-07-11T16:51:03Z
dc.type.spa.fl_str_mv Artículo de revista
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_2df8fbb1
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.version.spa.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.coar.spa.fl_str_mv http://purl.org/coar/resource_type/c_6501
dc.type.content.spa.fl_str_mv Text
dc.type.driver.spa.fl_str_mv info:eu-repo/semantics/article
format http://purl.org/coar/resource_type/c_6501
status_str publishedVersion
dc.identifier.uri.none.fl_str_mv https://repositorio.escuelaing.edu.co/handle/001/3156
dc.identifier.eissn.spa.fl_str_mv 2352-9148
dc.identifier.instname.spa.fl_str_mv Universidad Escuela Colombiana de Ingeniería Julio Garavito
dc.identifier.reponame.spa.fl_str_mv Repositorio Digital
dc.identifier.repourl.spa.fl_str_mv https://repositorio.escuelaing.edu.co/
url https://repositorio.escuelaing.edu.co/handle/001/3156
https://repositorio.escuelaing.edu.co/
identifier_str_mv 2352-9148
Universidad Escuela Colombiana de Ingeniería Julio Garavito
Repositorio Digital
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.citationedition.spa.fl_str_mv Vol. 32 año 2022
dc.relation.citationendpage.spa.fl_str_mv 17
dc.relation.citationstartpage.spa.fl_str_mv 1
dc.relation.citationvolume.spa.fl_str_mv 32
dc.relation.ispartofjournal.eng.fl_str_mv Informatics in Medicine Unlocked
dc.relation.references.spa.fl_str_mv Abouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating taverna and galaxy workflows with cloud computing support. BMC Bioinfo 2012;13:77. https://doi. org/10.1186/1471-2105-13-77
Abu-Doleh A, Catalyurek UV. Spaler: spark and GraphX based de novo genome assembler. In: 2015 IEEE international conference on big data (big data). IEEE; 2015. https://doi.org/10.1109/bigdata.2015.7363853
Abuín JM, Pichel JC, Pena TF, Amigo J. SparkBWA: speeding up the alignment of high-throughput DNA sequencing data. PLOS ONE 2016;11:e0155461. https:// doi.org/10.1371/journal.pone.0155461
Al-Zoubi K, Wainer G. Modelling fog amp; cloud collaboration methods on large scale. In: 2020 winter simulation conference. WSC); 2020. p. 2161–72. https:// doi.org/10.1109/WSC48552.2020.9384058
Almeida JS, Grüneberg A, Maass W, Vinga S. Fractal MapReduce decomposition of sequence alignment. Algorithm Mol Biol 2012;7. https://doi.org/10.1186/ 1748-7188-7-12
ANR. IntraCranial ANeurysms: from familial forms to pathophysiological mechanisms – I-CAN. 2019. http://www.agence-nationale-recherche.fr/Project- ANR-15-CE17-0008. [Accessed 10 October 2019]
Atkinson M, Gesing S, Montagnat J, Taylor I. Scientific workflows: past, present and future. 2017. https://doi.org/10.1016/j.future.2017.05.041
Barillot C, Bannier E, Commowick O, Corouge I, Baire A, Fakhfakh I, Guillaumont J, Yao Y, Kain M. Shanoir: applying the software as a service distribution model to manage brain imaging research repositories. Front ICT 2016;3:25. URL: https://www.frontiersin.org/article/10.3389/fict.2016.00025
Barseghian D, Altintas I, et al. Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inf 2010;5:42–50. https://doi.org/10.1016/j.ecoinf.2009.08.008
Bez M, Fornari G, Vardanega T. The scalability challenge of ethereum: an initial quantitative analysis. In: 2019 IEEE international conference on service-oriented system engineering (SOSE). IEEE; 2019. https://doi.org/10.1109/ sose.2019.00031
Bondiombouy C, Valduriez P. Query processing in multistore systems: an overview. Int J Cloud Comput 2016;5:309–46
zahra Boujdad F, Sudholt M. Constructive privacy for shared genetic data. In: Proceedings of the 8th international conference on cloud computing and services science. SCITEPRESS - Science and Technology Publications; 2018. https://doi. org/10.5220/0006765804890496
Boujdad FZ, Gaignard A, et al. On distributed collaboration for biomedical analyses. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2019. https://doi.org/10.1109/ ccgrid.2019.00079
Boujdad FZ, Niyitegeka D, Bellafqira R, Gouenou C, Emmanuelle G, Südholt M. A hybrid cloud deployment architecture for privacy-preserving collaborative genome-wide association studies. In: ICDF2C 2021 - 12th EAI international conference on digital forensics & cyber crime; 2021
Bourcier R, Chatel S, et al. Understanding the pathophysiology of intracranial aneurysm: the ICAN project. Neurosurgery 2017;80:621–6. https://doi.org/ 10.1093/neuros/nyw135
Bux M, Brandt J, Witt C, Dowling J, Leser U. Hi-way: execution of scientific workflows on hadoop yarn. In: 20th international conference on extending database technology, EDBT 2017, 21 march 2017 through 24 march 2017, Open Proceedings. Org; 2017. p. 668–79. https://doi.org/10.5441/002/edbt.2017.87
Bux M, Leser U. Parallelization in scientific workflow management systems. 2013. arXiv preprint arXiv:1303.7195
Canali C, Lancellotti R, Mione S. Collaboration strategies for fog computing under heterogeneous network-bound scenarios. In: 2020 IEEE 19th international symposium on network computing and applications. NCA); 2020. p. 1–8. https:// doi.org/10.1109/NCA51143.2020.9306730
Cano I, Weimer M, Mahajan D, Curino C, Fumarola GM. Towards geo-distributed machine learning. 2016. arXiv preprint arXiv:1603.09035
de Castro MR, dos Santos Tostes C, et al. SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinf 2017;18. https://doi.org/10.1186/ s12859-017-1723-8
Cattaneo G, Giancarlo R, et al. MapReduce in computational biology - a synopsis. 10.1007%2F978-3-319-57711-1_5. In: Advances in artificial life, evolutionary computation, and systems chemistry. Springer International Publishing; 2017. p. 53–64. URL
Cattaneo G, Petrillo UF, Giancarlo R, Roscigno G. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with hadoop. J Supercomput 2016;73:1467–83. https://doi.org/10.1007/s11227-016- 1835-3
Chang YJ, Chen CC, Chen CL, Ho JM. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. In: BMC genomics, BioMed central; 2012. S28. https://doi.org/ 10.1186/1471-2164-13-S7-S28
Chen Z, Hu J, Min G, Chen X. Effective data placement for scientific workflows in mobile edge computing using genetic particle swarm optimization. Concurrency Comput: Pract Ex 2019;e5413doi. https://doi.org/10.1002/cpe.5413
Chervenak A, Deelman E, Foster I, Guy L, Hoschek W, Iamnitchi A, Kesselman C, Kunszt P, Ripeanu M, Schwartzkopf B, Stockinger H, Stockinger K, Tierney B. Giggle: a framework for constructing scalable replica location services. In: ACM/ IEEE SC 2002 conference (SC’02), IEEE; 2002. https://doi.org/10.1109/ sc.2002.10024
Claerhout B, DeMoor G. Privacy protection for clinical and genomic data: the use of privacy-enhancing techniques in medicine. Int J Med Inf 2005;74:257–65.
Cohen-Boulakia S, Belhajjame K, et al. Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Generat Comput Syst 2017;75:284–98. https://doi.org/10.1016/j. future.2017.01.012
Colosimo ME, Peterson MW, Mardis S, Hirschman L. Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med 2011;6. https://doi.org/10.1186/1751-0473-6-13
Commission, E., Council. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. http://data.europa.eu/eli/reg/2016/679/2016-05-04; 2016
Congress of Colombia. Colombian data protection law. URL: https://www.fun cionpublica.gov.co/eva/gestornormativo/norma.php?i=49981. [Accessed 16 September 2021]
Consortium DS, Consortium DM, Mahajan A, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 2014;46:234. https://doi.org/10.1038/ng.2897
Cook CE, Lopez R, et al. The european bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 2018;47:D15–22. https://doi.org/ 10.1093/nar/gky1124
Cope JM, Trebon N, Tufo HM, Beckman P. Robust data placement in urgent computing environments. In: 2009 IEEE international symposium on parallel & distributed processing. IEEE; 2009. p. 1–13. https://doi.org/10.1109/ IPDPS.2009.5160914
Corpas M, Kovalevskaya NV, McMurray A, Nielsen FG. A fair guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol 2018; 14:e1005873. https://doi.org/10.1371/journal.pcbi.1005873
De Moor G, Claerhout B, De Meyer F. Privacy enhancing techniques. Method Inf Med 2003;42:148–53
De Roure D, Belhajjam K, Missier P, G´ omez-P´ erez JM, Palma R, Ruiz JE, Hettne K, Roos M, Klyne G, Goble C. Towards the preservation of scientific workflows. In: iPRES 2011-8th international conference on preservation of digital objects. National Library Board Singapore and Nanyang Technology University; 2011. p. 228–31
De Wit P, Pespeni MH, et al. The simple fool’s guide to population genomics via rna-seq: an introduction to high-throughput sequencing data analysis. Mol Eco Res 2012;12:1058–67. https://doi.org/10.1111/1755-0998.12003
Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade: scalable sequence analysis with MapReduce. Bioinformatics 2015;31:2482–8. https://doi.org/ 10.1093/bioinformatics/btv179
Deelman E, Gannon D, et al. Workflows and e-science: an overview of workflow system features and capabilities. Future Generat Comput Syst 2009;25:528–40. https://doi.org/10.1016/j.future.2008.06.012
Deelman E, Vahi K, et al. Pegasus, a workflow management system for science automation. Future Generat Comput Syst 2015;46:17–35. https://doi.org/ 10.1016/j.future.2014.10.008
Dolev S, Florissi P, et al. A survey on geographically distributed big-data processing using MapReduce. IEEE Transact Big Data 2019;5:60–80. https://doi. org/10.1109/tbdata.2017.2723473
Dong G, Fu X, Li H, Pan X. An accurate sequence assembly algorithm for livestock, plants and microorganism based on spark. Int J Pattern Recognit Artif Intell 2017; 31:1750024. https://doi.org/10.1142/s0218001417500240
Ebrahimi M, Mohan A, Kashlev A, Lu S. Bdap: a big data placement strategy for cloud-based scientific workflows. In: 2015 IEEE first international conference on big data computing service and applications. IEEE; 2015. p. 105–14. https://doi. org/10.1109/BigDataService.2015.70
Elmroth E, Hern´andez F, Tordsson J. Three fundamental dimensions of scientific workflow interoperability: model of computation, language, and execution environment. Future Generat Comput Syst 2010;26:245–56
Fakas GJ, Karakostas B. A peer to peer (P2P) architecture for dynamic workflow management. Inf Software Technol 2004;46:423–31
Fan J, Han F, Liu H. Challenges of big data analysis. Nat Sci Rev 2014;1:293–314. https://doi.org/10.1093/nsr/nwt032
Federer LM, Lu YL, et al. Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLOS ONE 2015;10:e0129506. https://doi.org/10.1371/journal.pone.0129506
Freire J, Bonnet P, Shasha D. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data; 2012. p. 593–6
Frye SV, Arkin MR, et al. Tackling reproducibility in academic preclinical drug discovery. Nat Rev Drug Discovery 2015;14:733–4. https://doi.org/10.1038/ nrd4737
Gil Y, Ratnakar V, et al. Wings: intelligent workflow-based design of computational experiments. IEEE Intell Syst 2011;26:62–72. https://doi.org/ 10.1109/mis.2010.9
Gilbert S, Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002;33:51–9. https://doi.org/ 10.1145/564585.564601
Goecks J, Nekrutenko A, Taylor J, Team TG. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. https://doi.org/10.1186/gb-2010- 11-8-r86.
Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility mean? Sci Translat Med 2016;8. https://doi.org/10.1126/scitranslmed.aaf5027. 341ps12–341ps12
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next- generation sequencing technologies. Nature Rev Genet 2016;17:333
Guo R, Zhao Y, Zou Q, et al. Bioinformatics applications on Apache spark. GigaScience 2018. https://doi.org/10.1093/gigascience/giy098
of Health NI, et al. Guidance: rigor and reproducibility in grant applications. 2017
Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scalable genome-wide association studies using hadoop clusters. Bioinformatics 2012;29:135–6. https:// doi.org/10.1093/bioinformatics/bts647
Huang L, Krüger J, Sczyrba A. Analyzing large scale genomic data on the cloud with sparkhit. Bioinformatics 2017;34:1457–65. https://doi.org/10.1093/ bioinformatics/btx808
Huang Y, Gottardo R. Comparability and reproducibility of biomedical data. Briefings Bioinfo 2012;14:391–401. https://doi.org/10.1093/bib/bbs078
Hung CL, Lin YL, Hua GJ, Hu YC. CloudTSS: a TagSNP selection approach on cloud computing. In: Communications in computer and information science. Springer Berlin Heidelberg; 2011. p. 525–34. https://doi.org/10.1007/978-3- 642-27180-9_64
Hutson S. Data handling errors spur debate over clinical trial. 618–618 Nature Med 2010;16. https://doi.org/10.1038/nm0610-618a
Karim MR, Michel A, et al. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Briefings Bioinfo 2017;19: 1035–50. https://doi.org/10.1093/bib/bbx039
Khan A, Kim T, Byun H, Kim Y. Scispace: a scientific collaboration workspace for geo-distributed hpc data centers. Future Generat Comput Syst 2019;101:398–409.
Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: a review of best practices and their practical application in cwlprov. GigaScience 2019;8:giz095
Kim D, Vouk MA. Assessing run-time overhead of securing kepler. Procedia Comput Sci 2016;80:2281–6. https://doi.org/10.1016/j.procs.2016.05.412
Kim JH. Genome data analysis. Springer Singapore; 2019. URL: https://www.sp ringer.com/gp/book/9789811319419
Koster J, Rahmann S. Snakemake–a scalable bioinformatics workflow engine. Bioinfo 2012;28:2520–2. https://doi.org/10.1093/bioinformatics/bts480
Kuhn K, et al. The cancer biomedical informatics grid (cabig): infrastructure and applications for a worldwide research community. Medinfo 2007;1:330
Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with myrna. Genome Biol 2010;11:R83. https://doi.org/ 10.1186/gb-2010-11-8-r83
Langmead B, Schatz MC, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10:R134. https://doi.org/10.1186/gb-2009-10-11-r134
Legislature CS. The California consumer privacy act of. 2018. https://leginfo.legi slature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180SB1121
Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: 2009 international conference on parallel processing workshops. IEEE; 2009. https:// doi.org/10.1109/icppw.2009.37
Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009;19:1124–32. https://doi.org/10.1101/gr.088013.108
Li X, Zhang L, et al. A novel workflow-level data placement strategy for data- sharing scientific cloud workflows. IEEE Transact Serv Comput 2016. https://doi. org/10.1109/TSC.2016.2625247
Liu J, Pacitti E, Valduriez P, Mattoso M. Parallelization of scientific workflows in the cloud. 2014
Liu J, Pacitti E, Valduriez P, Mattoso M. A survey of data-intensive scientific workflow management. J Grid Comput 2015;13:457–93. https://doi.org/ 10.1007/s10723-015-9329-8
Liu J, Pacitti E, Valduriez P, Mattoso M. Scientific workflow scheduling with provenance data in a multisite cloud. In: Transactions on large-scale data-and knowledge-centered systems XXXIII. Springer; 2017. p. 80–112
Liu J, Pineda L, Pacitti E, Costan A, Valduriez P, Antoniu G, Mattoso M. Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Transact Knowl Data Eng 2019;31:1940–53. https://doi.org/10.1109/ tkde.2018.2867857
Liu X, Datta A. Towards intelligent data placement for scientific workflows in collaborative cloud environment. In: 2011 IEEE international symposium on parallel and distributed processing workshops and phd forum. IEEE; 2011. p. 1052–61. https://doi.org/10.1109/IPDPS.2011.259
Liu Y, Zhang L, Ge N, Li G. A systematic literature review on federated learning: from a model quality perspective. 2020. arXiv preprint arXiv:2012.01973
Lu S, Zhang J. Collaborative scientific workflows supporting collaborative science. Int J Bus Process Integrat Manag 2011;5:185. https://doi.org/10.1504/ ijbpim.2011.040209
Lu YY, Tang K, et al. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic acids research 2017;45:W554–9. https://doi.org/10.1093/nar/gkx351
Malin BA, Emam KE, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. 2013
McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303. https://doi.org/10.1101/gr.107524.110
McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication- efficient learning of deep networks from decentralized data. In: Singh A, Zhu J, editors. Proceedings of the 20th international conference on artificial intelligence and statistics. Fort Lauderdale, FL, USA: PMLR; 2017. p. 1273–82. URL: http://pr oceedings.mlr.press/v54/mcmahan17a.html
Moreau L, Missier P, Cheney J, Soiland-Reyes S. Prov-n: the provenance notation. 2013
Nagappan M, Vouk MA. A model for sharing of confidential provenance information in a query based system. In: International provenance and annotation workshop. Springer; 2008. p. 62–9. https://doi.org/10.1007/978-3-540-89965-5_ 8
Nguyen T, Shi W, Ruden D. CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011;4. https://doi.org/ 10.1186/1756-0500-4-171
NHGRI-EBI. GWAS catalog. 2019. https://www.ebi.ac.uk/gwas/. accessed 20- Sept-2019
NIH-BMIC. NIH data sharing repositories. 2019. https://www.nlm.nih.gov/NIH bmic/nih_data_sharing_repositories.html. accessed 20-Sept-2019
Nordberg H, Bhatia K, Wang K, Wang Z. BioPig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 2013;29:3014–9. https://doi.org/ 10.1093/bioinformatics/btt528
NSF, 2019. Chapter XI - Other Post Award Requirements and Consideration. https://www.nsf.gov/pubs/policydocs/pappg19_1/pappg_11.jsp\#XID4. [Online; accessed 20-June-2019]
O’Brien AR, Saunders NFW, et al. VariantSpark: population scale clustering of genotype information. BMC Genom 2015;16. https://doi.org/10.1186/s12864- 015-2269-7
Pandey RV, Schl¨otterer C. DistMap: a toolkit for distributed short read mapping on a hadoop cluster. PLoS ONE 2013;8:e72614. https://doi.org/10.1371/journal. pone.0072614
Papageorgiou L, Eleni P, et al. Genomic big data hitting the storage bottleneck. EMBnetjournal 2018;24:e910. https://doi.org/10.14806/ej.24.0.910
Parks R, Chu CH, Xu H. Healthcare information privacy research: iusses, gaps and what next? AMCIS; 2011
Peteiro-Barral D, Guijarro-Berdi˜ nas B. A survey of methods for distributed machine learning. Prog Artif Intell 2013;2:1–11
Pineda-Morales L, Costan A, Antoniu G. Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE international conference on cluster computing. IEEE; 2015. p. 294–303. https:// doi.org/10.1109/cluster.2015.49
Pineda-Morales L, Liu J, Costan A, Pacitti E, Antoniu G, Valduriez P, Mattoso M. Managing hot metadata for scientific workflows on multisite clouds. In: 2016 IEEE international conference on big data (big data). IEEE; 2016. p. 390–7
Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011;27:2159–60. https://doi.org/10.1093/ bioinformatics/btr325
Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100
Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100
Rodriguez MA, Buyya R. Scientific workflow management system for clouds. In: Software architecture for big data and the cloud. Elsevier; 2017. p. 367–87. https://doi.org/10.1016/b978-0-12-805467-3.00018-1
Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430
Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430
Salloum S, Dautov R, et al. Big data analytics on Apache spark. Int J Data Sci Anal 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9
Santana-Perez I, P´ erez-Hern´ andez MS. Towards reproducibility in scientific workflows: an infrastructure-based approach. Scientific Program 2015:1–11. https://doi.org/10.1155/2015/243180
Schadt EE, Linderman MD, et al. Computational solutions to large-scale data management and analysis. Nature Rev Genet 2010;11:647–57. https://doi.org/ 10.1038/nrg2857
Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdf
Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdf
Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010
Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010
Senturk IF, Balakrishnan P, et al. A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Generat Comput Syst 2018;78:379–91. https://doi.org/10.1016/j.future.2016.06.008
Sharov AA, Schlessinger D, Ko MSH. ExAtlas: an interactive online tool for meta- analysis of gene expression data. J Bioinfo Comput Biol 2015;13:1550019. https://doi.org/10.1142/s0219720015500195
Soiland-Reyes S, Alper P, Goble C. Tracking workflow execution with tavernaprov. In: PROV: three tears later: Provenance Week 2016; 2016
Stephens ZD, Lee SY, et al. Big data: astronomical or genomical? PLOS Biology 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195
Tannenbaum T, Wright D, Miller K, Livny M. Condor: a distributed job scheduler. In: Beowulf cluster computing with windows; 2001. p. 307–50
Taylor I, Shields M, Wang I, Harrison A. The triana workflow environment: architecture and applications. In: Workflows for e-Science. Springer; 2007. p. 320–39. https://doi.org/10.1007/978-1-84628-757-2_20
Taylor IJ, Deelman E, et al. Workflows for e-Science: scientific workflows for grids, ume 1. Springer; 2007. https://doi.org/10.1007/978-1-84628-757-2
Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurr Comput: Pract Exp 2005;17:323–56. https://doi.org/ 10.1002/cpe.938
Tommaso PD, Chatzou M, et al. Nextflow enables reproducible computational workflows. Nature Biotechnol 2017;35:316–9. https://doi.org/10.1038/ nbt.3820
Turakhia MP, Desai M, Hedlin H, Rajmane A, Talati N, Ferris T, Desai S, Nag D, Patel M, Kowey P, Rumsfeld JS, Russo AM, Hills MT, Granger CB, Mahaffey KW, Perez MV. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study. Am Heart J 2019; 207:66–75. https://doi.org/10.1016/j.ahj.2018.09.002. https://www.sciencedi rect.com/science/article/pii/S0002870318302710.
Union I. Communication from the commission to the european parliament, the council, the european economic and social committee and the committee of the regions. A new skills agenda for europe. 2014 [Brussels].
Valduriez P, Mattoso M, Akbarinia R, Borges H, Camata J, Coutinho A, Gaspar D, Lemus N, Liu J, Lustosa H, et al. Scientific data analysis using data-intensive scalable computing: the scidisc project. In: LADaS: Latin America data science workshop, CEUR-WS. Org; 2018
Van Hung T, Chuanhe H. An effective data placement strategy in main-memory database cluster. In: 2011 second international conference on networking and distributed computing. IEEE; 2011. p. 93–8. https://doi.org/10.1109/ ICNDC.2011.27.
Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS. A survey on distributed machine learning. ACM Comput Surv (CSUR) 2020;53: 1–33
Wang J, Crawl D, Altintas I. Kepler + hadoop. In: Proceedings of the 4th workshop on workflows in support of large-scale science - WORKS ’09. ACM Press; 2009. https://doi.org/10.1145/1645164.1645176
Wang R, Li M, Peng L, Hu Y, Hassan MM, Alelaiwi A. Cognitive multi-agent empowering mobile edge computing for resource caching and collaboration. Future Generat Comput Syst 2020;102:66–74. https://doi.org/10.1016/j. future.2019.08.001. URL: https://www.sciencedirect.com/science/article/pii/ S0167739X19318783
Wang Y. Automating experimentation with distributed systems using generative techniques. Ph.D. thesis. University of Colorado at Boulder; 2006
Wang Y, Carzaniga A, Wolf AL. Four enhancements to automated distributed system experimentation methods. In: Proceedings of the 30th international conference on Software engineering; 2008. p. 491–500
Wiewi´ orka MS, Messina A, et al. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2014;30:2652–3. https://doi.org/10.1093/bioinformatics/btu343
Wilde M, Hategan M, et al. Swift: a language for distributed parallel scripting. Parallel Comput 2011;37:633–52. https://doi.org/10.1016/j.parco.2011.05.005.
Wolstencroft K, Haines R, et al. The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 2013;41:W557–61. https://doi.org/10.1093/nar/gkt328
Xiao Y, Zhou AC, Yang X, He B. Privacy-preserving workflow scheduling in geo- distributed data centers. Future Generat Comput Syst 2022;130:46–58
Xie J, Yin S, et al. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and phd forum (IPDPSW). IEEE; 2010. p. 1–9. https://doi.org/10.1109/IPDPSW.2010.547088
Xie T. Sea: a striping-based energy-aware strategy for data placement in raid- structured storage systems. IEEE Transact Comput 2008;57:748–61. https://doi. org/10.1109/TC.2008.27
Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transact Big Data 2015;1:49–67. https://doi.org/10.1109/tbdata.2015.2472014
Xu B, Gao J, Li C. An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012;426:395–8. https://doi.org/ 10.1016/j.bbrc.2012.08.101
Xu B, Li C, Zhuang H, et al. DSA: scalable distributed sequence alignment system using SIMD instructions. In: 2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2017. https://doi.org/ 10.1109/ccgrid.2017.74
Xu B, Li C, Zhuang H, et al. Efficient distributed smith-waterman algorithm based on Apache spark. In: 2017 IEEE 10th international conference on cloud computing (CLOUD). IEEE; 2017. https://doi.org/10.1109/cloud.2017.83
Yu HF, Hsieh CJ, Chang KW, Lin CJ. Large linear classification when data cannot f it in memory. In: ACM Transactions on Knowledge Discovery from Data (TKDD); 2012. p. 1–23. 5
Yu J, Buyya R. A taxonomy of workflow management systems for grid computing. J Grid Comput 2005;3:171–200. https://doi.org/10.1007/s10723-005-9010-8.
Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generat Comput Syst 2010;26:1200–14. https://doi.org/ 10.1016/j.future.2010.02.004
Zhang D, Zhao L, Li B, et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. The American J Human Genet 2017;101:115–22. https://doi.org/10.1016/j. ajhg.2017.05.017
Zhang L, Gu S, Liu Y, Wang B, Azuaje F. Gene set analysis in the cloud. Bioinformatics 2011;28:294–5. https://doi.org/10.1093/bioinformatics/btr630
Zhao G, Ling C, Sun D. SparkSW: scalable distributed computing system for large- scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, IEEE; 2015. https://doi.org/ 10.1109/ccgrid.2015.55
Zhao J, Gomez-Perez JM, Belhajjame K, Klyne G, Garcia-Cuesta E, Garrido A, Hettne K, Roos M, De Roure D, Goble C. Why workflows break—understanding and combating decay in taverna workflows. In: 2012 ieee 8th international conference on e-science. IEEE; 2012. p. 1–9
Zhao Q, Xiong, et al. A new energy-aware task scheduling method for data- intensive applications in the cloud. J Network Comput Appl 2016;59:14–27. https://doi.org/10.1016/j.jnca.2015.05.001
Zhao Y, Li Y, Raicu I, Lu S, Tian W, Liu H. Enabling scalable scientific workflow management in the cloud. Future Generat Comput Syst 2015;46:3–16. https:// doi.org/10.1016/j.future.2014.10.023.
Zhou W, Li R, Yuan S, et al. MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics 2017. https:// doi.org/10.1093/bioinformatics/btw750. btw750
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18. https://doi. org/10.1186/s13059-017-1319-7
Zytnicki M, Quesneville H. S-MART, a software toolbox to aid RNA-seq data analysis. PLoS ONE 2011;6:e25988. https://doi.org/10.1371/journal. pone.0025988.
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_abf2
dc.rights.accessrights.spa.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
rights_invalid_str_mv http://purl.org/coar/access_right/c_abf2
dc.format.extent.spa.fl_str_mv 17 páginas
dc.format.mimetype.spa.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv Elsevier Ltd
dc.publisher.place.spa.fl_str_mv Bogotá (Colombia)
dc.source.spa.fl_str_mv www.elsevier.com/locate/imu
institution Escuela Colombiana de Ingeniería Julio Garavito
bitstream.url.fl_str_mv https://repositorio.escuelaing.edu.co/bitstream/001/3156/4/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf.txt
https://repositorio.escuelaing.edu.co/bitstream/001/3156/3/Portada%20A%20toxonomy%20of%20tools%20and%20aproaches%20for%20distributed%20genomic%20analyses.PNG
https://repositorio.escuelaing.edu.co/bitstream/001/3156/5/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf.jpg
https://repositorio.escuelaing.edu.co/bitstream/001/3156/2/license.txt
https://repositorio.escuelaing.edu.co/bitstream/001/3156/1/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf
bitstream.checksum.fl_str_mv eb26b4b320214d3ec325cd2e042520db
2c10021a4b8daf155722e9f0326b3472
adfb70997a6f9d9860a919c715d14321
5a7ca94c2e5326ee169f979d71d0f06e
352bd361e8f413892b8b09b9617362eb
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositorio Escuela Colombiana de Ingeniería Julio Garavito
repository.mail.fl_str_mv repositorio.eci@escuelaing.edu.co
_version_ 1814355622643105792
spelling Garzón, Wilmer6b04a33a7db33dd5cc3491063fc48a95Benavides, Luis Albertof41ce80dc6191ec07a18f9540f4178a8Gignard, Alban6fbca16bc50d2c0b4dee4332f1d568c6Südholt, Mario0250aeec3f859c0b7205d04882c0260dCTG - Informática2024-07-11T16:51:03Z2024-07-11T16:51:03Z2022https://repositorio.escuelaing.edu.co/handle/001/31562352-9148Universidad Escuela Colombiana de Ingeniería Julio GaravitoRepositorio Digitalhttps://repositorio.escuelaing.edu.co/The amount of biomedical data collected and stored has grown significantly. Analyzing these extensive amounts of data cannot be done by individuals or single organizations anymore. Thus, the scientific community is creating global collaborative efforts to analyze these data. However, biomedical data is subject to several legal and socio- economic restrictions hindering the possibilities for research collaboration. In this paper, we argue that researchers require new tools and techniques to address the restrictions and needs of global scientific collaborations over geo-distributed biomedical data. These tools and techniques must support what we call Fully Distributed Collaborations (FDC), which are research endeavors that harness means to exploit and analyze massive biomedical information collaboratively while respecting legal and socio-economical restrictions. This paper first motivates and discusses the requirements of FDCs in the context of a research collaboration on the development of diagnostic and predictive tools for the risk of intracranial aneurysm formation and rupture (the ICAN project). The paper then presents a taxonomy classifying the current tools and techniques for biomedical analysis with respect to the proposed requirements. The taxonomy considers three key architectural features to support FDC scenarios: data and computation placement, Privacy and Security, and Performance and Scalability. The review reveals new research opportunities to design tools and techniques for multi-site analyses encouraging scientific collaborations while mitigating technical and legal constraints.La cantidad de datos biomédicos recopilados y almacenados ha aumentado significativamente. El análisis de estas grandes cantidades de datos ya no lo pueden realizar individuos ni organizaciones individuales. Así, la comunidad científica está creando esfuerzos colaborativos globales para analizar estos datos. Sin embargo, los datos biomédicos están sujetos a varias restricciones legales y socioeconómicas que obstaculizan las posibilidades de colaboración en investigación. En este artículo, sostenemos que los investigadores necesitan nuevas herramientas y técnicas para abordar las restricciones y necesidades de las colaboraciones científicas globales sobre datos biomédicos geodistribuidos. Estas herramientas y técnicas deben respaldar lo que llamamos Colaboraciones Totalmente Distribuidas (FDC), que son esfuerzos de investigación que aprovechan los medios para explotar y analizar información biomédica masiva de manera colaborativa respetando las restricciones legales y socioeconómicas. En primer lugar, este artículo motiva y analiza los requisitos de los CDF en el contexto de una colaboración de investigación sobre el desarrollo de herramientas de diagnóstico y predicción del riesgo de formación y rotura de aneurismas intracraneales (el proyecto ICAN). Luego, el artículo presenta una taxonomía que clasifica las herramientas y técnicas actuales para el análisis biomédico con respecto a los requisitos propuestos. La taxonomía considera tres características arquitectónicas clave para admitir escenarios FDC: ubicación de datos y cálculos, privacidad y seguridad, y rendimiento y escalabilidad. La revisión revela nuevas oportunidades de investigación para diseñar herramientas y técnicas para análisis multisitio que fomenten colaboraciones científicas y al mismo tiempo mitiguen las limitaciones técnicas y legales.17 páginasapplication/pdfengElsevier LtdBogotá (Colombia)www.elsevier.com/locate/imuA taxonomy of tools and approaches for distributed genomic analysesArtículo de revistainfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501http://purl.org/coar/resource_type/c_2df8fbb1Textinfo:eu-repo/semantics/articlehttp://purl.org/coar/version/c_970fb48d4fbd8a85Vol. 32 año 202217132Informatics in Medicine UnlockedAbouelhoda M, Issa SA, Ghanem M. Tavaxy: integrating taverna and galaxy workflows with cloud computing support. BMC Bioinfo 2012;13:77. https://doi. org/10.1186/1471-2105-13-77Abu-Doleh A, Catalyurek UV. Spaler: spark and GraphX based de novo genome assembler. In: 2015 IEEE international conference on big data (big data). IEEE; 2015. https://doi.org/10.1109/bigdata.2015.7363853Abuín JM, Pichel JC, Pena TF, Amigo J. SparkBWA: speeding up the alignment of high-throughput DNA sequencing data. PLOS ONE 2016;11:e0155461. https:// doi.org/10.1371/journal.pone.0155461Al-Zoubi K, Wainer G. Modelling fog amp; cloud collaboration methods on large scale. In: 2020 winter simulation conference. WSC); 2020. p. 2161–72. https:// doi.org/10.1109/WSC48552.2020.9384058Almeida JS, Grüneberg A, Maass W, Vinga S. Fractal MapReduce decomposition of sequence alignment. Algorithm Mol Biol 2012;7. https://doi.org/10.1186/ 1748-7188-7-12ANR. IntraCranial ANeurysms: from familial forms to pathophysiological mechanisms – I-CAN. 2019. http://www.agence-nationale-recherche.fr/Project- ANR-15-CE17-0008. [Accessed 10 October 2019]Atkinson M, Gesing S, Montagnat J, Taylor I. Scientific workflows: past, present and future. 2017. https://doi.org/10.1016/j.future.2017.05.041Barillot C, Bannier E, Commowick O, Corouge I, Baire A, Fakhfakh I, Guillaumont J, Yao Y, Kain M. Shanoir: applying the software as a service distribution model to manage brain imaging research repositories. Front ICT 2016;3:25. URL: https://www.frontiersin.org/article/10.3389/fict.2016.00025Barseghian D, Altintas I, et al. Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inf 2010;5:42–50. https://doi.org/10.1016/j.ecoinf.2009.08.008Bez M, Fornari G, Vardanega T. The scalability challenge of ethereum: an initial quantitative analysis. In: 2019 IEEE international conference on service-oriented system engineering (SOSE). IEEE; 2019. https://doi.org/10.1109/ sose.2019.00031Bondiombouy C, Valduriez P. Query processing in multistore systems: an overview. Int J Cloud Comput 2016;5:309–46zahra Boujdad F, Sudholt M. Constructive privacy for shared genetic data. In: Proceedings of the 8th international conference on cloud computing and services science. SCITEPRESS - Science and Technology Publications; 2018. https://doi. org/10.5220/0006765804890496Boujdad FZ, Gaignard A, et al. On distributed collaboration for biomedical analyses. In: 2019 19th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2019. https://doi.org/10.1109/ ccgrid.2019.00079Boujdad FZ, Niyitegeka D, Bellafqira R, Gouenou C, Emmanuelle G, Südholt M. A hybrid cloud deployment architecture for privacy-preserving collaborative genome-wide association studies. In: ICDF2C 2021 - 12th EAI international conference on digital forensics & cyber crime; 2021Bourcier R, Chatel S, et al. Understanding the pathophysiology of intracranial aneurysm: the ICAN project. Neurosurgery 2017;80:621–6. https://doi.org/ 10.1093/neuros/nyw135Bux M, Brandt J, Witt C, Dowling J, Leser U. Hi-way: execution of scientific workflows on hadoop yarn. In: 20th international conference on extending database technology, EDBT 2017, 21 march 2017 through 24 march 2017, Open Proceedings. Org; 2017. p. 668–79. https://doi.org/10.5441/002/edbt.2017.87Bux M, Leser U. Parallelization in scientific workflow management systems. 2013. arXiv preprint arXiv:1303.7195Canali C, Lancellotti R, Mione S. Collaboration strategies for fog computing under heterogeneous network-bound scenarios. In: 2020 IEEE 19th international symposium on network computing and applications. NCA); 2020. p. 1–8. https:// doi.org/10.1109/NCA51143.2020.9306730Cano I, Weimer M, Mahajan D, Curino C, Fumarola GM. Towards geo-distributed machine learning. 2016. arXiv preprint arXiv:1603.09035de Castro MR, dos Santos Tostes C, et al. SparkBLAST: scalable BLAST processing using in-memory operations. BMC Bioinf 2017;18. https://doi.org/10.1186/ s12859-017-1723-8Cattaneo G, Giancarlo R, et al. MapReduce in computational biology - a synopsis. 10.1007%2F978-3-319-57711-1_5. In: Advances in artificial life, evolutionary computation, and systems chemistry. Springer International Publishing; 2017. p. 53–64. URLCattaneo G, Petrillo UF, Giancarlo R, Roscigno G. An effective extension of the applicability of alignment-free biological sequence comparison algorithms with hadoop. J Supercomput 2016;73:1467–83. https://doi.org/10.1007/s11227-016- 1835-3Chang YJ, Chen CC, Chen CL, Ho JM. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework. In: BMC genomics, BioMed central; 2012. S28. https://doi.org/ 10.1186/1471-2164-13-S7-S28Chen Z, Hu J, Min G, Chen X. Effective data placement for scientific workflows in mobile edge computing using genetic particle swarm optimization. Concurrency Comput: Pract Ex 2019;e5413doi. https://doi.org/10.1002/cpe.5413Chervenak A, Deelman E, Foster I, Guy L, Hoschek W, Iamnitchi A, Kesselman C, Kunszt P, Ripeanu M, Schwartzkopf B, Stockinger H, Stockinger K, Tierney B. Giggle: a framework for constructing scalable replica location services. In: ACM/ IEEE SC 2002 conference (SC’02), IEEE; 2002. https://doi.org/10.1109/ sc.2002.10024Claerhout B, DeMoor G. Privacy protection for clinical and genomic data: the use of privacy-enhancing techniques in medicine. Int J Med Inf 2005;74:257–65.Cohen-Boulakia S, Belhajjame K, et al. Scientific workflows for computational reproducibility in the life sciences: status, challenges and opportunities. Future Generat Comput Syst 2017;75:284–98. https://doi.org/10.1016/j. future.2017.01.012Colosimo ME, Peterson MW, Mardis S, Hirschman L. Nephele: genotyping via complete composition vectors and MapReduce. Source Code Biol Med 2011;6. https://doi.org/10.1186/1751-0473-6-13Commission, E., Council. Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. http://data.europa.eu/eli/reg/2016/679/2016-05-04; 2016Congress of Colombia. Colombian data protection law. URL: https://www.fun cionpublica.gov.co/eva/gestornormativo/norma.php?i=49981. [Accessed 16 September 2021]Consortium DS, Consortium DM, Mahajan A, et al. Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 2014;46:234. https://doi.org/10.1038/ng.2897Cook CE, Lopez R, et al. The european bioinformatics institute in 2018: tools, infrastructure and training. Nucleic Acids Res 2018;47:D15–22. https://doi.org/ 10.1093/nar/gky1124Cope JM, Trebon N, Tufo HM, Beckman P. Robust data placement in urgent computing environments. In: 2009 IEEE international symposium on parallel & distributed processing. IEEE; 2009. p. 1–13. https://doi.org/10.1109/ IPDPS.2009.5160914Corpas M, Kovalevskaya NV, McMurray A, Nielsen FG. A fair guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol 2018; 14:e1005873. https://doi.org/10.1371/journal.pcbi.1005873De Moor G, Claerhout B, De Meyer F. Privacy enhancing techniques. Method Inf Med 2003;42:148–53De Roure D, Belhajjam K, Missier P, G´ omez-P´ erez JM, Palma R, Ruiz JE, Hettne K, Roos M, Klyne G, Goble C. Towards the preservation of scientific workflows. In: iPRES 2011-8th international conference on preservation of digital objects. National Library Board Singapore and Nanyang Technology University; 2011. p. 228–31De Wit P, Pespeni MH, et al. The simple fool’s guide to population genomics via rna-seq: an introduction to high-throughput sequencing data analysis. Mol Eco Res 2012;12:1058–67. https://doi.org/10.1111/1755-0998.12003Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade: scalable sequence analysis with MapReduce. Bioinformatics 2015;31:2482–8. https://doi.org/ 10.1093/bioinformatics/btv179Deelman E, Gannon D, et al. Workflows and e-science: an overview of workflow system features and capabilities. Future Generat Comput Syst 2009;25:528–40. https://doi.org/10.1016/j.future.2008.06.012Deelman E, Vahi K, et al. Pegasus, a workflow management system for science automation. Future Generat Comput Syst 2015;46:17–35. https://doi.org/ 10.1016/j.future.2014.10.008Dolev S, Florissi P, et al. A survey on geographically distributed big-data processing using MapReduce. IEEE Transact Big Data 2019;5:60–80. https://doi. org/10.1109/tbdata.2017.2723473Dong G, Fu X, Li H, Pan X. An accurate sequence assembly algorithm for livestock, plants and microorganism based on spark. Int J Pattern Recognit Artif Intell 2017; 31:1750024. https://doi.org/10.1142/s0218001417500240Ebrahimi M, Mohan A, Kashlev A, Lu S. Bdap: a big data placement strategy for cloud-based scientific workflows. In: 2015 IEEE first international conference on big data computing service and applications. IEEE; 2015. p. 105–14. https://doi. org/10.1109/BigDataService.2015.70Elmroth E, Hern´andez F, Tordsson J. Three fundamental dimensions of scientific workflow interoperability: model of computation, language, and execution environment. Future Generat Comput Syst 2010;26:245–56Fakas GJ, Karakostas B. A peer to peer (P2P) architecture for dynamic workflow management. Inf Software Technol 2004;46:423–31Fan J, Han F, Liu H. Challenges of big data analysis. Nat Sci Rev 2014;1:293–314. https://doi.org/10.1093/nsr/nwt032Federer LM, Lu YL, et al. Biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff. PLOS ONE 2015;10:e0129506. https://doi.org/10.1371/journal.pone.0129506Freire J, Bonnet P, Shasha D. Computational reproducibility: state-of-the-art, challenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data; 2012. p. 593–6Frye SV, Arkin MR, et al. Tackling reproducibility in academic preclinical drug discovery. Nat Rev Drug Discovery 2015;14:733–4. https://doi.org/10.1038/ nrd4737Gil Y, Ratnakar V, et al. Wings: intelligent workflow-based design of computational experiments. IEEE Intell Syst 2011;26:62–72. https://doi.org/ 10.1109/mis.2010.9Gilbert S, Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 2002;33:51–9. https://doi.org/ 10.1145/564585.564601Goecks J, Nekrutenko A, Taylor J, Team TG. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. https://doi.org/10.1186/gb-2010- 11-8-r86.Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility mean? Sci Translat Med 2016;8. https://doi.org/10.1126/scitranslmed.aaf5027. 341ps12–341ps12Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next- generation sequencing technologies. Nature Rev Genet 2016;17:333Guo R, Zhao Y, Zou Q, et al. Bioinformatics applications on Apache spark. GigaScience 2018. https://doi.org/10.1093/gigascience/giy098of Health NI, et al. Guidance: rigor and reproducibility in grant applications. 2017Huang H, Tata S, Prill RJ. BlueSNP: R package for highly scalable genome-wide association studies using hadoop clusters. Bioinformatics 2012;29:135–6. https:// doi.org/10.1093/bioinformatics/bts647Huang L, Krüger J, Sczyrba A. Analyzing large scale genomic data on the cloud with sparkhit. Bioinformatics 2017;34:1457–65. https://doi.org/10.1093/ bioinformatics/btx808Huang Y, Gottardo R. Comparability and reproducibility of biomedical data. Briefings Bioinfo 2012;14:391–401. https://doi.org/10.1093/bib/bbs078Hung CL, Lin YL, Hua GJ, Hu YC. CloudTSS: a TagSNP selection approach on cloud computing. In: Communications in computer and information science. Springer Berlin Heidelberg; 2011. p. 525–34. https://doi.org/10.1007/978-3- 642-27180-9_64Hutson S. Data handling errors spur debate over clinical trial. 618–618 Nature Med 2010;16. https://doi.org/10.1038/nm0610-618aKarim MR, Michel A, et al. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Briefings Bioinfo 2017;19: 1035–50. https://doi.org/10.1093/bib/bbx039Khan A, Kim T, Byun H, Kim Y. Scispace: a scientific collaboration workspace for geo-distributed hpc data centers. Future Generat Comput Syst 2019;101:398–409.Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: a review of best practices and their practical application in cwlprov. GigaScience 2019;8:giz095Kim D, Vouk MA. Assessing run-time overhead of securing kepler. Procedia Comput Sci 2016;80:2281–6. https://doi.org/10.1016/j.procs.2016.05.412Kim JH. Genome data analysis. Springer Singapore; 2019. URL: https://www.sp ringer.com/gp/book/9789811319419Koster J, Rahmann S. Snakemake–a scalable bioinformatics workflow engine. Bioinfo 2012;28:2520–2. https://doi.org/10.1093/bioinformatics/bts480Kuhn K, et al. The cancer biomedical informatics grid (cabig): infrastructure and applications for a worldwide research community. Medinfo 2007;1:330Langmead B, Hansen KD, Leek JT. Cloud-scale RNA-sequencing differential expression analysis with myrna. Genome Biol 2010;11:R83. https://doi.org/ 10.1186/gb-2010-11-8-r83Langmead B, Schatz MC, et al. Searching for SNPs with cloud computing. Genome Biol 2009;10:R134. https://doi.org/10.1186/gb-2009-10-11-r134Legislature CS. The California consumer privacy act of. 2018. https://leginfo.legi slature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180SB1121Leo S, Santoni F, Zanetti G. Biodoop: bioinformatics on hadoop. In: 2009 international conference on parallel processing workshops. IEEE; 2009. https:// doi.org/10.1109/icppw.2009.37Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009;19:1124–32. https://doi.org/10.1101/gr.088013.108Li X, Zhang L, et al. A novel workflow-level data placement strategy for data- sharing scientific cloud workflows. IEEE Transact Serv Comput 2016. https://doi. org/10.1109/TSC.2016.2625247Liu J, Pacitti E, Valduriez P, Mattoso M. Parallelization of scientific workflows in the cloud. 2014Liu J, Pacitti E, Valduriez P, Mattoso M. A survey of data-intensive scientific workflow management. J Grid Comput 2015;13:457–93. https://doi.org/ 10.1007/s10723-015-9329-8Liu J, Pacitti E, Valduriez P, Mattoso M. Scientific workflow scheduling with provenance data in a multisite cloud. In: Transactions on large-scale data-and knowledge-centered systems XXXIII. Springer; 2017. p. 80–112Liu J, Pineda L, Pacitti E, Costan A, Valduriez P, Antoniu G, Mattoso M. Efficient scheduling of scientific workflows using hot metadata in a multisite cloud. IEEE Transact Knowl Data Eng 2019;31:1940–53. https://doi.org/10.1109/ tkde.2018.2867857Liu X, Datta A. Towards intelligent data placement for scientific workflows in collaborative cloud environment. In: 2011 IEEE international symposium on parallel and distributed processing workshops and phd forum. IEEE; 2011. p. 1052–61. https://doi.org/10.1109/IPDPS.2011.259Liu Y, Zhang L, Ge N, Li G. A systematic literature review on federated learning: from a model quality perspective. 2020. arXiv preprint arXiv:2012.01973Lu S, Zhang J. Collaborative scientific workflows supporting collaborative science. Int J Bus Process Integrat Manag 2011;5:185. https://doi.org/10.1504/ ijbpim.2011.040209Lu YY, Tang K, et al. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic acids research 2017;45:W554–9. https://doi.org/10.1093/nar/gkx351Malin BA, Emam KE, O’Keefe CM. Biomedical data privacy: problems, perspectives, and recent advances. 2013McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297–303. https://doi.org/10.1101/gr.107524.110McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication- efficient learning of deep networks from decentralized data. In: Singh A, Zhu J, editors. Proceedings of the 20th international conference on artificial intelligence and statistics. Fort Lauderdale, FL, USA: PMLR; 2017. p. 1273–82. URL: http://pr oceedings.mlr.press/v54/mcmahan17a.htmlMoreau L, Missier P, Cheney J, Soiland-Reyes S. Prov-n: the provenance notation. 2013Nagappan M, Vouk MA. A model for sharing of confidential provenance information in a query based system. In: International provenance and annotation workshop. Springer; 2008. p. 62–9. https://doi.org/10.1007/978-3-540-89965-5_ 8Nguyen T, Shi W, Ruden D. CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011;4. https://doi.org/ 10.1186/1756-0500-4-171NHGRI-EBI. GWAS catalog. 2019. https://www.ebi.ac.uk/gwas/. accessed 20- Sept-2019NIH-BMIC. NIH data sharing repositories. 2019. https://www.nlm.nih.gov/NIH bmic/nih_data_sharing_repositories.html. accessed 20-Sept-2019Nordberg H, Bhatia K, Wang K, Wang Z. BioPig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 2013;29:3014–9. https://doi.org/ 10.1093/bioinformatics/btt528NSF, 2019. Chapter XI - Other Post Award Requirements and Consideration. https://www.nsf.gov/pubs/policydocs/pappg19_1/pappg_11.jsp\#XID4. [Online; accessed 20-June-2019]O’Brien AR, Saunders NFW, et al. VariantSpark: population scale clustering of genotype information. BMC Genom 2015;16. https://doi.org/10.1186/s12864- 015-2269-7Pandey RV, Schl¨otterer C. DistMap: a toolkit for distributed short read mapping on a hadoop cluster. PLoS ONE 2013;8:e72614. https://doi.org/10.1371/journal. pone.0072614Papageorgiou L, Eleni P, et al. Genomic big data hitting the storage bottleneck. EMBnetjournal 2018;24:e910. https://doi.org/10.14806/ej.24.0.910Parks R, Chu CH, Xu H. Healthcare information privacy research: iusses, gaps and what next? AMCIS; 2011Peteiro-Barral D, Guijarro-Berdi˜ nas B. A survey of methods for distributed machine learning. Prog Artif Intell 2013;2:1–11Pineda-Morales L, Costan A, Antoniu G. Towards multi-site metadata management for geographically distributed cloud workflows. In: 2015 IEEE international conference on cluster computing. IEEE; 2015. p. 294–303. https:// doi.org/10.1109/cluster.2015.49Pineda-Morales L, Liu J, Costan A, Pacitti E, Antoniu G, Valduriez P, Mattoso M. Managing hot metadata for scientific workflows on multisite clouds. In: 2016 IEEE international conference on big data (big data). IEEE; 2016. p. 390–7Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011;27:2159–60. https://doi.org/10.1093/ bioinformatics/btr325Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100Rasheed Z, Rangwala H. A map-reduce framework for clustering metagenomes. In: 2013 IEEE international symposium on parallel & distributed processing, workshops and phd forum. IEEE; 2013. https://doi.org/10.1109/ ipdpsw.2013.100Rodriguez MA, Buyya R. Scientific workflow management system for clouds. In: Software architecture for big data and the cloud. Elsevier; 2017. p. 367–87. https://doi.org/10.1016/b978-0-12-805467-3.00018-1Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430Ross RB, Thakur R, et al. Pvfs: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference; 2000. p. 391–430Salloum S, Dautov R, et al. Big data analytics on Apache spark. Int J Data Sci Anal 2016;1:145–64. https://doi.org/10.1007/s41060-016-0027-9Santana-Perez I, P´ erez-Hern´ andez MS. Towards reproducibility in scientific workflows: an infrastructure-based approach. Scientific Program 2015:1–11. https://doi.org/10.1155/2015/243180Schadt EE, Linderman MD, et al. Computational solutions to large-scale data management and analysis. Nature Rev Genet 2010;11:647–57. https://doi.org/ 10.1038/nrg2857Schatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdfSchatz MC. BlastReduce: high performance short read mapping with MapReduce. University of Maryland; 2008. http://cgis.cs.umd.edu/Grad/scholarlypapers/pa pers/MichaelSchatz.pdfSchatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010Schatz MC, Sommer D, Kelley D, Pop M. De novo assembly of large genomes using cloud computing. In: Proceedings of the cold spring harbor biology of genomes conference; 2010Senturk IF, Balakrishnan P, et al. A resource provisioning framework for bioinformatics applications in multi-cloud environments. Future Generat Comput Syst 2018;78:379–91. https://doi.org/10.1016/j.future.2016.06.008Sharov AA, Schlessinger D, Ko MSH. ExAtlas: an interactive online tool for meta- analysis of gene expression data. J Bioinfo Comput Biol 2015;13:1550019. https://doi.org/10.1142/s0219720015500195Soiland-Reyes S, Alper P, Goble C. Tracking workflow execution with tavernaprov. In: PROV: three tears later: Provenance Week 2016; 2016Stephens ZD, Lee SY, et al. Big data: astronomical or genomical? PLOS Biology 2015;13:e1002195. https://doi.org/10.1371/journal.pbio.1002195Tannenbaum T, Wright D, Miller K, Livny M. Condor: a distributed job scheduler. In: Beowulf cluster computing with windows; 2001. p. 307–50Taylor I, Shields M, Wang I, Harrison A. The triana workflow environment: architecture and applications. In: Workflows for e-Science. Springer; 2007. p. 320–39. https://doi.org/10.1007/978-1-84628-757-2_20Taylor IJ, Deelman E, et al. Workflows for e-Science: scientific workflows for grids, ume 1. Springer; 2007. https://doi.org/10.1007/978-1-84628-757-2Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurr Comput: Pract Exp 2005;17:323–56. https://doi.org/ 10.1002/cpe.938Tommaso PD, Chatzou M, et al. Nextflow enables reproducible computational workflows. Nature Biotechnol 2017;35:316–9. https://doi.org/10.1038/ nbt.3820Turakhia MP, Desai M, Hedlin H, Rajmane A, Talati N, Ferris T, Desai S, Nag D, Patel M, Kowey P, Rumsfeld JS, Russo AM, Hills MT, Granger CB, Mahaffey KW, Perez MV. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study. Am Heart J 2019; 207:66–75. https://doi.org/10.1016/j.ahj.2018.09.002. https://www.sciencedi rect.com/science/article/pii/S0002870318302710.Union I. Communication from the commission to the european parliament, the council, the european economic and social committee and the committee of the regions. A new skills agenda for europe. 2014 [Brussels].Valduriez P, Mattoso M, Akbarinia R, Borges H, Camata J, Coutinho A, Gaspar D, Lemus N, Liu J, Lustosa H, et al. Scientific data analysis using data-intensive scalable computing: the scidisc project. In: LADaS: Latin America data science workshop, CEUR-WS. Org; 2018Van Hung T, Chuanhe H. An effective data placement strategy in main-memory database cluster. In: 2011 second international conference on networking and distributed computing. IEEE; 2011. p. 93–8. https://doi.org/10.1109/ ICNDC.2011.27.Verbraeken J, Wolting M, Katzy J, Kloppenburg J, Verbelen T, Rellermeyer JS. A survey on distributed machine learning. ACM Comput Surv (CSUR) 2020;53: 1–33Wang J, Crawl D, Altintas I. Kepler + hadoop. In: Proceedings of the 4th workshop on workflows in support of large-scale science - WORKS ’09. ACM Press; 2009. https://doi.org/10.1145/1645164.1645176Wang R, Li M, Peng L, Hu Y, Hassan MM, Alelaiwi A. Cognitive multi-agent empowering mobile edge computing for resource caching and collaboration. Future Generat Comput Syst 2020;102:66–74. https://doi.org/10.1016/j. future.2019.08.001. URL: https://www.sciencedirect.com/science/article/pii/ S0167739X19318783Wang Y. Automating experimentation with distributed systems using generative techniques. Ph.D. thesis. University of Colorado at Boulder; 2006Wang Y, Carzaniga A, Wolf AL. Four enhancements to automated distributed system experimentation methods. In: Proceedings of the 30th international conference on Software engineering; 2008. p. 491–500Wiewi´ orka MS, Messina A, et al. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2014;30:2652–3. https://doi.org/10.1093/bioinformatics/btu343Wilde M, Hategan M, et al. Swift: a language for distributed parallel scripting. Parallel Comput 2011;37:633–52. https://doi.org/10.1016/j.parco.2011.05.005.Wolstencroft K, Haines R, et al. The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 2013;41:W557–61. https://doi.org/10.1093/nar/gkt328Xiao Y, Zhou AC, Yang X, He B. Privacy-preserving workflow scheduling in geo- distributed data centers. Future Generat Comput Syst 2022;130:46–58Xie J, Yin S, et al. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and phd forum (IPDPSW). IEEE; 2010. p. 1–9. https://doi.org/10.1109/IPDPSW.2010.547088Xie T. Sea: a striping-based energy-aware strategy for data placement in raid- structured storage systems. IEEE Transact Comput 2008;57:748–61. https://doi. org/10.1109/TC.2008.27Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, Yu Y. Petuum: a new platform for distributed machine learning on big data. IEEE Transact Big Data 2015;1:49–67. https://doi.org/10.1109/tbdata.2015.2472014Xu B, Gao J, Li C. An efficient algorithm for DNA fragment assembly in MapReduce. Biochem Biophys Res Commun 2012;426:395–8. https://doi.org/ 10.1016/j.bbrc.2012.08.101Xu B, Li C, Zhuang H, et al. DSA: scalable distributed sequence alignment system using SIMD instructions. In: 2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID), IEEE; 2017. https://doi.org/ 10.1109/ccgrid.2017.74Xu B, Li C, Zhuang H, et al. Efficient distributed smith-waterman algorithm based on Apache spark. In: 2017 IEEE 10th international conference on cloud computing (CLOUD). IEEE; 2017. https://doi.org/10.1109/cloud.2017.83Yu HF, Hsieh CJ, Chang KW, Lin CJ. Large linear classification when data cannot f it in memory. In: ACM Transactions on Knowledge Discovery from Data (TKDD); 2012. p. 1–23. 5Yu J, Buyya R. A taxonomy of workflow management systems for grid computing. J Grid Comput 2005;3:171–200. https://doi.org/10.1007/s10723-005-9010-8.Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generat Comput Syst 2010;26:1200–14. https://doi.org/ 10.1016/j.future.2010.02.004Zhang D, Zhao L, Li B, et al. SEQSpark: a complete analysis tool for large-scale rare variant association studies using whole-genome and exome sequence data. The American J Human Genet 2017;101:115–22. https://doi.org/10.1016/j. ajhg.2017.05.017Zhang L, Gu S, Liu Y, Wang B, Azuaje F. Gene set analysis in the cloud. Bioinformatics 2011;28:294–5. https://doi.org/10.1093/bioinformatics/btr630Zhao G, Ling C, Sun D. SparkSW: scalable distributed computing system for large- scale biological sequence alignment. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, IEEE; 2015. https://doi.org/ 10.1109/ccgrid.2015.55Zhao J, Gomez-Perez JM, Belhajjame K, Klyne G, Garcia-Cuesta E, Garrido A, Hettne K, Roos M, De Roure D, Goble C. Why workflows break—understanding and combating decay in taverna workflows. In: 2012 ieee 8th international conference on e-science. IEEE; 2012. p. 1–9Zhao Q, Xiong, et al. A new energy-aware task scheduling method for data- intensive applications in the cloud. J Network Comput Appl 2016;59:14–27. https://doi.org/10.1016/j.jnca.2015.05.001Zhao Y, Li Y, Raicu I, Lu S, Tian W, Liu H. Enabling scalable scientific workflow management in the cloud. Future Generat Comput Syst 2015;46:3–16. https:// doi.org/10.1016/j.future.2014.10.023.Zhou W, Li R, Yuan S, et al. MetaSpark: a spark-based distributed processing tool to recruit metagenomic reads to reference genomes. Bioinformatics 2017. https:// doi.org/10.1093/bioinformatics/btw750. btw750Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18. https://doi. org/10.1186/s13059-017-1319-7Zytnicki M, Quesneville H. S-MART, a software toolbox to aid RNA-seq data analysis. PLoS ONE 2011;6:e25988. https://doi.org/10.1371/journal. pone.0025988.info:eu-repo/semantics/openAccesshttp://purl.org/coar/access_right/c_abf2BiometríaBiometryAnálisis de la informaciónInformation analysisInvestigación biomédicaBiomedical researchTecnología médicaMedical technologyDistributed biomedical analysesFully distributed collaborationsReproducibilityScalability Multi-site analysesDistributed workflow analysesAnálisis biomédicos distribuidosColaboraciones totalmente distribuidasReproducibilidadAnálisis de escalabilidad multisitioAnálisis de flujo de trabajo distribuidoTEXTA taxonomy of tools and approaches for distributed genomic analyses.pdf.txtA taxonomy of tools and approaches for distributed genomic analyses.pdf.txtExtracted texttext/plain131146https://repositorio.escuelaing.edu.co/bitstream/001/3156/4/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf.txteb26b4b320214d3ec325cd2e042520dbMD54open accessTHUMBNAILPortada A toxonomy of tools and aproaches for distributed genomic analyses.PNGPortada A toxonomy of tools and aproaches for distributed genomic analyses.PNGimage/png236507https://repositorio.escuelaing.edu.co/bitstream/001/3156/3/Portada%20A%20toxonomy%20of%20tools%20and%20aproaches%20for%20distributed%20genomic%20analyses.PNG2c10021a4b8daf155722e9f0326b3472MD53open accessA taxonomy of tools and approaches for distributed genomic analyses.pdf.jpgA taxonomy of tools and approaches for distributed genomic analyses.pdf.jpgGenerated Thumbnailimage/jpeg15578https://repositorio.escuelaing.edu.co/bitstream/001/3156/5/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf.jpgadfb70997a6f9d9860a919c715d14321MD55open accessLICENSElicense.txtlicense.txttext/plain; charset=utf-81881https://repositorio.escuelaing.edu.co/bitstream/001/3156/2/license.txt5a7ca94c2e5326ee169f979d71d0f06eMD52open accessORIGINALA taxonomy of tools and approaches for distributed genomic analyses.pdfA taxonomy of tools and approaches for distributed genomic analyses.pdfapplication/pdf1871605https://repositorio.escuelaing.edu.co/bitstream/001/3156/1/A%20taxonomy%20of%20tools%20and%20approaches%20for%20distributed%20genomic%20analyses.pdf352bd361e8f413892b8b09b9617362ebMD51metadata only access001/3156oai:repositorio.escuelaing.edu.co:001/31562024-08-06 16:11:23.159metadata only accessRepositorio Escuela Colombiana de Ingeniería Julio Garavitorepositorio.eci@escuelaing.edu.coU0kgVVNURUQgSEFDRSBQQVJURSBERUwgR1JVUE8gREUgUEFSRVMgRVZBTFVBRE9SRVMgREUgTEEgQ09MRUNDScOTTiAiUEVFUiBSRVZJRVciLCBPTUlUQSBFU1RBIExJQ0VOQ0lBLgoKQXV0b3Jpem8gYSBsYSBFc2N1ZWxhIENvbG9tYmlhbmEgZGUgSW5nZW5pZXLDrWEgSnVsaW8gR2FyYXZpdG8gcGFyYSBwdWJsaWNhciBlbCB0cmFiYWpvIGRlIGdyYWRvLCBhcnTDrWN1bG8sIHZpZGVvLCAKY29uZmVyZW5jaWEsIGxpYnJvLCBpbWFnZW4sIGZvdG9ncmFmw61hLCBhdWRpbywgcHJlc2VudGFjacOzbiB1IG90cm8gKGVuICAgIGFkZWxhbnRlIGRvY3VtZW50bykgcXVlIGVuIGxhIGZlY2hhIAplbnRyZWdvIGVuIGZvcm1hdG8gZGlnaXRhbCwgeSBsZSBwZXJtaXRvIGRlIGZvcm1hIGluZGVmaW5pZGEgcXVlIGxvIHB1YmxpcXVlIGVuIGVsIHJlcG9zaXRvcmlvIGluc3RpdHVjaW9uYWwsIAplbiBsb3MgdMOpcm1pbm9zIGVzdGFibGVjaWRvcyBlbiBsYSBMZXkgMjMgZGUgMTk4MiwgbGEgTGV5IDQ0IGRlIDE5OTMsIHkgZGVtw6FzIGxleWVzIHkganVyaXNwcnVkZW5jaWEgdmlnZW50ZQphbCByZXNwZWN0bywgcGFyYSBmaW5lcyBlZHVjYXRpdm9zIHkgbm8gbHVjcmF0aXZvcy4gRXN0YSBhdXRvcml6YWNpw7NuIGVzIHbDoWxpZGEgcGFyYSBsYXMgZmFjdWx0YWRlcyB5IGRlcmVjaG9zIGRlIAp1c28gc29icmUgbGEgb2JyYSBlbiBmb3JtYXRvIGRpZ2l0YWwsIGVsZWN0csOzbmljbywgdmlydHVhbDsgeSBwYXJhIHVzb3MgZW4gcmVkZXMsIGludGVybmV0LCBleHRyYW5ldCwgeSBjdWFscXVpZXIgCmZvcm1hdG8gbyBtZWRpbyBjb25vY2lkbyBvIHBvciBjb25vY2VyLgpFbiBtaSBjYWxpZGFkIGRlIGF1dG9yLCBleHByZXNvIHF1ZSBlbCBkb2N1bWVudG8gb2JqZXRvIGRlIGxhIHByZXNlbnRlIGF1dG9yaXphY2nDs24gZXMgb3JpZ2luYWwgeSBsbyBlbGFib3LDqSBzaW4gCnF1ZWJyYW50YXIgbmkgc3VwbGFudGFyIGxvcyBkZXJlY2hvcyBkZSBhdXRvciBkZSB0ZXJjZXJvcy4gUG9yIGxvIHRhbnRvLCBlcyBkZSBtaSBleGNsdXNpdmEgYXV0b3LDrWEgeSwgZW4gY29uc2VjdWVuY2lhLCAKdGVuZ28gbGEgdGl0dWxhcmlkYWQgc29icmUgw6lsLiBFbiBjYXNvIGRlIHF1ZWphIG8gYWNjacOzbiBwb3IgcGFydGUgZGUgdW4gdGVyY2VybyByZWZlcmVudGUgYSBsb3MgZGVyZWNob3MgZGUgYXV0b3Igc29icmUgCmVsIGRvY3VtZW50byBlbiBjdWVzdGnDs24sIGFzdW1pcsOpIGxhIHJlc3BvbnNhYmlsaWRhZCB0b3RhbCB5IHNhbGRyw6kgZW4gZGVmZW5zYSBkZSBsb3MgZGVyZWNob3MgYXF1w60gYXV0b3JpemFkb3MuIEVzdG8gCnNpZ25pZmljYSBxdWUsIHBhcmEgdG9kb3MgbG9zIGVmZWN0b3MsIGxhIEVzY3VlbGEgYWN0w7phIGNvbW8gdW4gdGVyY2VybyBkZSBidWVuYSBmZS4KVG9kYSBwZXJzb25hIHF1ZSBjb25zdWx0ZSBlbCBSZXBvc2l0b3JpbyBJbnN0aXR1Y2lvbmFsIGRlIGxhIEVzY3VlbGEsIGVsIENhdMOhbG9nbyBlbiBsw61uZWEgdSBvdHJvIG1lZGlvIGVsZWN0csOzbmljbywgCnBvZHLDoSBjb3BpYXIgYXBhcnRlcyBkZWwgdGV4dG8sIGNvbiBlbCBjb21wcm9taXNvIGRlIGNpdGFyIHNpZW1wcmUgbGEgZnVlbnRlLCBsYSBjdWFsIGluY2x1eWUgZWwgdMOtdHVsbyBkZWwgdHJhYmFqbyB5IGVsIAphdXRvci5Fc3RhIGF1dG9yaXphY2nDs24gbm8gaW1wbGljYSByZW51bmNpYSBhIGxhIGZhY3VsdGFkIHF1ZSB0ZW5nbyBkZSBwdWJsaWNhciB0b3RhbCBvIHBhcmNpYWxtZW50ZSBsYSBvYnJhIGVuIG90cm9zIAptZWRpb3MuRXN0YSBhdXRvcml6YWNpw7NuIGVzdMOhIHJlc3BhbGRhZGEgcG9yIGxhcyBmaXJtYXMgZGVsIChsb3MpIGF1dG9yKGVzKSBkZWwgZG9jdW1lbnRvLiAKU8OtIGF1dG9yaXpvIChhbWJvcykK