Modeling and Optimization of Performance and Reliability of Distributed Autonomic Systems



Journal Title

Journal ISSN

Volume Title



Distributed systems have been a very important topic in the field of computer science due to the great benefits they bring, such as scalability, resource sharing, and ease of collaboration. Distributed systems will continue to be very important because they are the foundation of many advanced technologies such as grid computing, web-services, and cloud computing. Despite their benefits, distributed systems do not necessarily guarantee reliability, fast response time, or security. On the contrary, modern distributed systems involve heterogeneous resources and dynamic architectures, which can introduce failures due to dependencies between parts of the system or due to network disruptions. These fragilities spurred research on mechanisms to overcome them. Moreover, the ever-growing complexity of modern systems required the evolution of self-managing systems (aka autonomic systems) that are able to determine the optimal or near-optimal configuration of a dynamically changing system in the face of varying environmental conditions. This dissertation proposes analytic models for well-known reliability and fault-tolerant techniques, such as checkpointing and job replication, to improve performance and reliability in distributed systems. The models developed here can be used by autonomic managers in self-healing and self-optimizing systems as presented here.