Dissertation Defense - University of Houston
Skip to main content

Dissertation Defense

In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Olga Datskova

will defend her dissertation

Detection and Analysis of Operational Instability within Distributed Computing Environments


Abstract

Distributed computing environments are increasingly deployed over geographically spanning data centers using heterogeneous hardware systems. Failures within such environments incur considerable physical as well as computing time losses that are unacceptable for large scale scientific processing tasks. At present, resource management systems are limited in detecting and analyzing such occurrences beyond the level of alarms and notifications. The nature of such instabilities is mainly unknown relying on subsystem expert knowledge and reactivity when they do occur. This work examines performance fluctuations associated with failures within a large scientific distributed production environment. We first present an approach to distinguish between expected operational behavior and service instability occurring within a data center in the context of network quality, production job efficiency and job error state deviation. This method identifies failure domains, allowing for online detection of service state fluctuations. We then propose a data center stability measure along with an event selection approach, used in analyzing past unstable behavior. We determine that for a number of detected events precursors to an instability exist. For selected events we discovered a reliability model fit suggesting potential use in predictive analytics. Developed methods are able to detect a prehyp{}failure period identifying service failure domains affected by the instability. This allows users as well as central and data center experts to take action in advance of service failure effects with the view on how this failure will be expected to develop. This work represents an instrumental step towards automated proactive management of large distributed computing environments.


Date: Wednesday, August 9, 2017
Time: 9:00 AM 
Place: PGH 501
Advisor: Dr. Weidong Shi

Faculty, students, and the general public are invited.