Managing Response Time Tails by Sharding

Matrix analytic methods are developed to compute the probability distribution of response times (i.e., data access times) in distributed storage systems protected by erasure coding, which is implemented by sharding a data object into N fragments, only K less than N of which are required to reconstru...

Full description

Autores:
Tipo de recurso:
Fecha de publicación:
2019
Institución:
Universidad del Rosario
Repositorio:
Repositorio EdocUR - U. Rosario
Idioma:
eng
OAI Identifier:
oai:repository.urosario.edu.co:10336/23233
Acceso en línea:
https://doi.org/10.1145/3300143
https://repository.urosario.edu.co/handle/10336/23233
Palabra clave:
Multiprocessing systems
Probability distributions
Quality of service
Response time (computer systems)
Distributed storage system
Matrix analytic methods
Mean response time
Numerical results
Parallel task
Performance
Sharding
Workload intensities
Digital storage
Parallel task processing
Performance
Quality of service
Response time
Sharding
Rights
License
Abierto (Texto Completo)
Description
Summary:Matrix analytic methods are developed to compute the probability distribution of response times (i.e., data access times) in distributed storage systems protected by erasure coding, which is implemented by sharding a data object into N fragments, only K less than N of which are required to reconstruct the object. This leads to a partial-fork-join model with a choice of canceling policies for the redundant N - K tasks. The accuracy of the analytical model is supported by tests against simulation in a broad range of setups. At increasing workload intensities, numerical results show the extent to which increasing the redundancy level reduces the mean response time of storage reads and significantly flattens the tail of their distribution; this is demonstrated at medium-high quantiles, up to the 99th. The quantitative reduction in response time achieved by two policies for canceling redundant tasks is also shown: For cancel-at-finish and cancel-at-start, which limits the additional load introduced whilst losing the benefit of selectivity amongst fragment service times. © 2019 Copyright held by the owner/author(s).