Enhancing reliability and response times via replication in computing clusters

Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of addi...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2015
Institución:
Universidad del Rosario
Repositorio:
Repositorio EdocUR - U. Rosario
Idioma:
eng
OAI Identifier:
oai:repository.urosario.edu.co:10336/28504
Acceso en línea:
https://doi.org/10.1109/INFOCOM.2015.7218512
https://repository.urosario.edu.co/handle/10336/28504
Palabra clave:
Servers
Time factors
Reliability
Computational modeling
Conferences
Computers
Switches
Rights
License
Restringido (Acceso a grupos específicos)
id EDOCUR2_38c54bd73fb21e00a6e741d17625b34e
oai_identifier_str oai:repository.urosario.edu.co:10336/28504
network_acronym_str EDOCUR2
network_name_str Repositorio EdocUR - U. Rosario
repository_id_str
spelling e1d48e1f-f195-4e4b-9c1b-f62c730d04e5800352026002020-08-28T15:49:14Z2020-08-28T15:49:14Z2015-08-24Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.application/pdfhttps://doi.org/10.1109/INFOCOM.2015.7218512EISBN: 978-1-4799-8381-0https://repository.urosario.edu.co/handle/10336/28504engIEEE136313552015 IEEE Conference on Computer Communications (INFOCOM)IEEE Conference on Computer Communications (INFOCOM), EISBN: 978-1-4799-8381-0 (2015); pp. 1355-1363https://ieeexplore.ieee.org/abstract/document/7218512Restringido (Acceso a grupos específicos)http://purl.org/coar/access_right/c_16ec2015 IEEE Conference on Computer Communications (INFOCOM)instname:Universidad del Rosarioreponame:Repositorio Institucional EdocURServersTime factorsReliabilityComputational modelingConferencesComputersSwitchesEnhancing reliability and response times via replication in computing clustersMejora de la confiabilidad y los tiempos de respuesta mediante la replicación en clústeres informáticosbookPartParte de librohttp://purl.org/coar/version/c_970fb48d4fbd8a85http://purl.org/coar/resource_type/c_3248Qiu, Zhan.Pérez, Juan F.10336/28504oai:repository.urosario.edu.co:10336/285042021-09-23 00:51:24.057https://repository.urosario.edu.coRepositorio institucional EdocURedocur@urosario.edu.co
dc.title.spa.fl_str_mv Enhancing reliability and response times via replication in computing clusters
dc.title.TranslatedTitle.spa.fl_str_mv Mejora de la confiabilidad y los tiempos de respuesta mediante la replicación en clústeres informáticos
title Enhancing reliability and response times via replication in computing clusters
spellingShingle Enhancing reliability and response times via replication in computing clusters
Servers
Time factors
Reliability
Computational modeling
Conferences
Computers
Switches
title_short Enhancing reliability and response times via replication in computing clusters
title_full Enhancing reliability and response times via replication in computing clusters
title_fullStr Enhancing reliability and response times via replication in computing clusters
title_full_unstemmed Enhancing reliability and response times via replication in computing clusters
title_sort Enhancing reliability and response times via replication in computing clusters
dc.subject.keyword.spa.fl_str_mv Servers
Time factors
Reliability
Computational modeling
Conferences
Computers
Switches
topic Servers
Time factors
Reliability
Computational modeling
Conferences
Computers
Switches
description Computing clusters have been widely deployed for scientific and engineering applications to support intensive computation and massive data operations. As applications and resources in a cluster are subject to failures, fault-tolerance strategies are commonly adopted, sometimes at the expense of additional delays in job response times, or unnecessarily increasing resource usage. In this paper, we explore concurrent replication with canceling, a fault-tolerance approach where jobs and their replicas are processed concurrently, and the successful completion of either triggers the removals of its replica. We propose a stochastic model to study how this approach affects the cluster service level objectives (SLOs), particularly the offered response time percentiles. In addition to the expected gains in reliability, the proposed model allows us to determine the regions of the utilization where introducing replication with canceling effectively reduces the response times. Moreover, we show how this model can support resource provisioning decisions with reliability and response time guarantees.
publishDate 2015
dc.date.created.spa.fl_str_mv 2015-08-24
dc.date.accessioned.none.fl_str_mv 2020-08-28T15:49:14Z
dc.date.available.none.fl_str_mv 2020-08-28T15:49:14Z
dc.type.eng.fl_str_mv bookPart
dc.type.coarversion.fl_str_mv http://purl.org/coar/version/c_970fb48d4fbd8a85
dc.type.coar.fl_str_mv http://purl.org/coar/resource_type/c_3248
dc.type.spa.spa.fl_str_mv Parte de libro
dc.identifier.doi.none.fl_str_mv https://doi.org/10.1109/INFOCOM.2015.7218512
dc.identifier.issn.none.fl_str_mv EISBN: 978-1-4799-8381-0
dc.identifier.uri.none.fl_str_mv https://repository.urosario.edu.co/handle/10336/28504
url https://doi.org/10.1109/INFOCOM.2015.7218512
https://repository.urosario.edu.co/handle/10336/28504
identifier_str_mv EISBN: 978-1-4799-8381-0
dc.language.iso.spa.fl_str_mv eng
language eng
dc.relation.citationEndPage.none.fl_str_mv 1363
dc.relation.citationStartPage.none.fl_str_mv 1355
dc.relation.citationTitle.none.fl_str_mv 2015 IEEE Conference on Computer Communications (INFOCOM)
dc.relation.ispartof.spa.fl_str_mv IEEE Conference on Computer Communications (INFOCOM), EISBN: 978-1-4799-8381-0 (2015); pp. 1355-1363
dc.relation.uri.spa.fl_str_mv https://ieeexplore.ieee.org/abstract/document/7218512
dc.rights.coar.fl_str_mv http://purl.org/coar/access_right/c_16ec
dc.rights.acceso.spa.fl_str_mv Restringido (Acceso a grupos específicos)
rights_invalid_str_mv Restringido (Acceso a grupos específicos)
http://purl.org/coar/access_right/c_16ec
dc.format.mimetype.none.fl_str_mv application/pdf
dc.publisher.spa.fl_str_mv IEEE
dc.source.spa.fl_str_mv 2015 IEEE Conference on Computer Communications (INFOCOM)
institution Universidad del Rosario
dc.source.instname.none.fl_str_mv instname:Universidad del Rosario
dc.source.reponame.none.fl_str_mv reponame:Repositorio Institucional EdocUR
repository.name.fl_str_mv Repositorio institucional EdocUR
repository.mail.fl_str_mv edocur@urosario.edu.co
_version_ 1808390655002542080