Evaluating replication for parallel jobs: an efficient approach
Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, req...
- Autores:
- Tipo de recurso:
- Fecha de publicación:
- 2015
- Institución:
- Universidad del Rosario
- Repositorio:
- Repositorio EdocUR - U. Rosario
- Idioma:
- eng
- OAI Identifier:
- oai:repository.urosario.edu.co:10336/27694
- Acceso en línea:
- https://doi.org/10.1109/TPDS.2015.2496593
https://repository.urosario.edu.co/handle/10336/27694
- Palabra clave:
- Time factors
Reliability
Correlation
Program processors
Computational modeling
Absorption
Servers
- Rights
- License
- Restringido (Acceso a grupos específicos)
id |
EDOCUR2_b2cc5ab0cf086a954358c8ac1d046ae6 |
---|---|
oai_identifier_str |
oai:repository.urosario.edu.co:10336/27694 |
network_acronym_str |
EDOCUR2 |
network_name_str |
Repositorio EdocUR - U. Rosario |
repository_id_str |
|
spelling |
e1d48e1f-f195-4e4b-9c1b-f62c730d04e5800352026002020-08-19T14:43:23Z2020-08-19T14:43:23Z2015-10-30Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.application/pdfhttps://doi.org/10.1109/TPDS.2015.2496593ISSN: 1045-9219EISSN: 1558-2183https://repository.urosario.edu.co/handle/10336/27694engIEEE2302No. 82288IEEE Transactions on Parallel and Distributed SystemsVol. 27IEEE Transactions on Parallel and Distributed Systems, ISSN: 1045-9219;EISSN: 1558-2183, Vol.27, No.8 (1 Aug 2016); pp. 2288-2302https://ieeexplore.ieee.org/document/7313012Restringido (Acceso a grupos específicos)http://purl.org/coar/access_right/c_16ecIEEE Transactions on Parallel and Distributed Systemsinstname:Universidad del Rosarioreponame:Repositorio Institucional EdocURTime factorsReliabilityCorrelationProgram processorsComputational modelingAbsorptionServersEvaluating replication for parallel jobs: an efficient approachEvaluación de la replicación para trabajos paralelos: un enfoque eficientearticleArtículohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_6501Qiu, ZhanPérez, Juan F.10336/27694oai:repository.urosario.edu.co:10336/276942021-09-23 12:38:12.95https://repository.urosario.edu.coRepositorio institucional EdocURedocur@urosario.edu.co |
dc.title.spa.fl_str_mv |
Evaluating replication for parallel jobs: an efficient approach |
dc.title.TranslatedTitle.spa.fl_str_mv |
Evaluación de la replicación para trabajos paralelos: un enfoque eficiente |
title |
Evaluating replication for parallel jobs: an efficient approach |
spellingShingle |
Evaluating replication for parallel jobs: an efficient approach Time factors Reliability Correlation Program processors Computational modeling Absorption Servers |
title_short |
Evaluating replication for parallel jobs: an efficient approach |
title_full |
Evaluating replication for parallel jobs: an efficient approach |
title_fullStr |
Evaluating replication for parallel jobs: an efficient approach |
title_full_unstemmed |
Evaluating replication for parallel jobs: an efficient approach |
title_sort |
Evaluating replication for parallel jobs: an efficient approach |
dc.subject.keyword.spa.fl_str_mv |
Time factors Reliability Correlation Program processors Computational modeling Absorption Servers |
topic |
Time factors Reliability Correlation Program processors Computational modeling Absorption Servers |
description |
Many modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs. |
publishDate |
2015 |
dc.date.created.spa.fl_str_mv |
2015-10-30 |
dc.date.accessioned.none.fl_str_mv |
2020-08-19T14:43:23Z |
dc.date.available.none.fl_str_mv |
2020-08-19T14:43:23Z |
dc.type.eng.fl_str_mv |
article |
dc.type.coarversion.fl_str_mv |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
dc.type.coar.fl_str_mv |
http://purl.org/coar/resource_type/c_6501 |
dc.type.spa.spa.fl_str_mv |
Artículo |
dc.identifier.doi.none.fl_str_mv |
https://doi.org/10.1109/TPDS.2015.2496593 |
dc.identifier.issn.none.fl_str_mv |
ISSN: 1045-9219 EISSN: 1558-2183 |
dc.identifier.uri.none.fl_str_mv |
https://repository.urosario.edu.co/handle/10336/27694 |
url |
https://doi.org/10.1109/TPDS.2015.2496593 https://repository.urosario.edu.co/handle/10336/27694 |
identifier_str_mv |
ISSN: 1045-9219 EISSN: 1558-2183 |
dc.language.iso.spa.fl_str_mv |
eng |
language |
eng |
dc.relation.citationEndPage.none.fl_str_mv |
2302 |
dc.relation.citationIssue.none.fl_str_mv |
No. 8 |
dc.relation.citationStartPage.none.fl_str_mv |
2288 |
dc.relation.citationTitle.none.fl_str_mv |
IEEE Transactions on Parallel and Distributed Systems |
dc.relation.citationVolume.none.fl_str_mv |
Vol. 27 |
dc.relation.ispartof.spa.fl_str_mv |
IEEE Transactions on Parallel and Distributed Systems, ISSN: 1045-9219;EISSN: 1558-2183, Vol.27, No.8 (1 Aug 2016); pp. 2288-2302 |
dc.relation.uri.spa.fl_str_mv |
https://ieeexplore.ieee.org/document/7313012 |
dc.rights.coar.fl_str_mv |
http://purl.org/coar/access_right/c_16ec |
dc.rights.acceso.spa.fl_str_mv |
Restringido (Acceso a grupos específicos) |
rights_invalid_str_mv |
Restringido (Acceso a grupos específicos) http://purl.org/coar/access_right/c_16ec |
dc.format.mimetype.none.fl_str_mv |
application/pdf |
dc.publisher.spa.fl_str_mv |
IEEE |
dc.source.spa.fl_str_mv |
IEEE Transactions on Parallel and Distributed Systems |
institution |
Universidad del Rosario |
dc.source.instname.none.fl_str_mv |
instname:Universidad del Rosario |
dc.source.reponame.none.fl_str_mv |
reponame:Repositorio Institucional EdocUR |
repository.name.fl_str_mv |
Repositorio institucional EdocUR |
repository.mail.fl_str_mv |
edocur@urosario.edu.co |
_version_ |
1818106425221578752 |