Evaluating replication for parallel jobs: an efficient approach

Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, req...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad del Rosario
Repositorio:
Repositorio EdocUR - U. Rosario
Idioma:
eng
OAI Identifier:
oai:repository.urosario.edu.co:10336/27694
Acceso en línea:
https://doi.org/10.1109/TPDS.2015.2496593
https://repository.urosario.edu.co/handle/10336/27694
Palabra clave:
Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
Rights
License
Restringido (Acceso a grupos específicos)
id EDOCUR2_b2cc5ab0cf086a954358c8ac1d046ae6
oai_identifier_str oai:repository.urosario.edu.co:10336/27694
network_acronym_str EDOCUR2
network_name_str Repositorio EdocUR - U. Rosario
repository_id_str
spelling e1d48e1f-f195-4e4b-9c1b-f62c730d04e5800352026002020-08-19T14:43:23Z2020-08-19T14:43:23Z2015-10-30Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.application/pdfhttps://doi.org/10.1109/TPDS.2015.2496593ISSN: 1045-9219EISSN: 1558-2183https://repository.urosario.edu.co/handle/10336/27694engIEEE2302No. 82288IEEE Transactions on Parallel and Distributed SystemsVol. 27IEEE Transactions on Parallel and Distributed Systems, ISSN: 1045-9219;EISSN: 1558-2183, Vol.27, No.8 (1 Aug 2016); pp. 2288-2302https://ieeexplore.ieee.org/document/7313012Restringido (Acceso a grupos específicos)http://purl.org/coar/access_right/c_16ecIEEE Transactions on Parallel and Distributed Systemsinstname:Universidad del Rosarioreponame:Repositorio Institucional EdocURTime factorsReliabilityCorrelationProgram processorsComputational modelingAbsorptionServersEvaluating replication for parallel jobs: an efficient approachEvaluación de la replicación para trabajos paralelos: un enfoque eficientearticleArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_6501Qiu, ZhanPérez, Juan F.10336/27694oai:repository.urosario.edu.co:10336/276942021-09-23 12:38:12.95https://repository.urosario.edu.coRepositorio institucional EdocURedocur@urosario.edu.co
dc.title.spa.fl_str_mv Evaluating replication for parallel jobs: an efficient approach
dc.title.TranslatedTitle.spa.fl_str_mv Evaluación de la replicación para trabajos paralelos: un enfoque eficiente
title Evaluating replication for parallel jobs: an efficient approach
spellingShingle Evaluating replication for parallel jobs: an efficient approach
Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
title_short Evaluating replication for parallel jobs: an efficient approach
title_full Evaluating replication for parallel jobs: an efficient approach
title_fullStr Evaluating replication for parallel jobs: an efficient approach
title_full_unstemmed Evaluating replication for parallel jobs: an efficient approach
title_sort Evaluating replication for parallel jobs: an efficient approach
dc.subject.keyword.spa.fl_str_mv Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
topic Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
description Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.
publishDate 2015
dc.date.created.spa.fl_str_mv 2015-10-30
dc.date.accessioned.none.fl_str_mv 2020-08-19T14:43:23Z
dc.date.available.none.fl_str_mv 2020-08-19T14:43:23Z
dc.type.eng.fl_str_mv article
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_6501
dc.type.spa.spa.fl_str_mv Artículo
dc.identifier.doi.none.fl_str_mv https://doi.org/10.1109/TPDS.2015.2496593
dc.identifier.issn.none.fl_str_mv ISSN: 1045-9219
EISSN: 1558-2183
dc.identifier.uri.none.fl_str_mv https://repository.urosario.edu.co/handle/10336/27694
url https://doi.org/10.1109/TPDS.2015.2496593
https://repository.urosario.edu.co/handle/10336/27694
identifier_str_mv ISSN: 1045-9219
EISSN: 1558-2183
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.citationEndPage.none.fl_str_mv 2302
dc.relation.citationIssue.none.fl_str_mv No. 8
dc.relation.citationStartPage.none.fl_str_mv 2288
dc.relation.citationTitle.none.fl_str_mv IEEE Transactions on Parallel and Distributed Systems
dc.relation.citationVolume.none.fl_str_mv Vol. 27
dc.relation.ispartof.spa.fl_str_mv IEEE Transactions on Parallel and Distributed Systems, ISSN: 1045-9219;EISSN: 1558-2183, Vol.27, No.8 (1 Aug 2016); pp. 2288-2302
dc.relation.uri.spa.fl_str_mv https://ieeexplore.ieee.org/document/7313012
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_16ec
dc.rights.acceso.spa.fl_str_mv Restringido (Acceso a grupos específicos)
rights_invalid_str_mv Restringido (Acceso a grupos específicos)
http://purl.org/coar/access_right/c_16ec
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv IEEE
dc.source.spa.fl_str_mv IEEE Transactions on Parallel and Distributed Systems
institution Universidad del Rosario
dc.source.instname.none.fl_str_mv instname:Universidad del Rosario
dc.source.reponame.none.fl_str_mv reponame:Repositorio Institucional EdocUR
repository.name.fl_str_mv Repositorio institucional EdocUR
repository.mail.fl_str_mv edocur@urosario.edu.co
_version_ 1808390580159381504