Self-Healing Distributed Scheduling Platform

Distributed systems require effective mechanisms to manage the reliable provisioning of computational resources from different and distributed providers. Moreover, the dynamic environment that affects the behaviour of such systems and the complexity of these dynamics demand autonomous capabilities t...

Full description

Autores:
Frincu, Marc E.
Rouvoy, Romain
Müller, Hausi A.
Petcu, Dana
Villegas Machado, Norha Milena
Tipo de recurso:
Part of book
Fecha de publicación:
2011
Institución:
Universidad ICESI
Repositorio:
Repositorio ICESI
Idioma:
eng
OAI Identifier:
oai:repository.icesi.edu.co:10906/79548
Acceso en línea:
http://dx.doi.org/10.1109/CCGrid.2011.23
http://ieeexplore.ieee.org/articleDetails.jsp?arnumber=5948613
http://hdl.handle.net/10906/79548
Palabra clave:
Programación de computadores
Ingeniería de sistemas y comunicaciones
Plataforma tecnológica
Systems engineering
Rights
openAccess
License
https://creativecommons.org/licenses/by-nc-nd/4.0/
Description
Summary:Distributed systems require effective mechanisms to manage the reliable provisioning of computational resources from different and distributed providers. Moreover, the dynamic environment that affects the behaviour of such systems and the complexity of these dynamics demand autonomous capabilities to ensure the behaviour of distributed scheduling platforms and to achieve business and user objectives. In this paper we propose a self-adaptive distributed scheduling platform composed of multiple agents implemented as intelligent feedback control loops to support policy-based scheduling and expose self-healing capabilities. Our platform leverages distributed scheduling processes by (i) allowing each provider to maintain its own internal scheduling process, and (ii) implementing self-healing capabilities based on agent module recovery. Simulated tests are performed to determine the optimal number of agents to be used in the negotiation phase without affecting the scheduling cost function. Test results on a real-life platform are presented to evaluate recovery times and optimize platform parameters.