Skip navigation.

Distributed-System Failures: Observations and Implications for Testing

Publication Type:



Department of Computer Science, University of Colorado, Boulder, CO (2005)

Other Number:



Distributed software systems are notoriously difficult to test. As in all software testing, the space
of potential test cases for distributed systems is intractably large and so the efforts of testers must
be directed to the scenarios that are most important. Unfortunately, there does not currently
exist a general-purpose, disciplined, and effective testing method for distributed systems. In this
paper we present an empirical study of failures experienced by the users of seven open-source
distributed systems. The goal of the study is to understand the extent to which there are patterns,
commonalities, and correlations in the failure scenarios, such that one could define an improved
testing method for distributed systems. Our results indicate that: a new generation of testadequacy
criteria are needed to address the failures that are due to distribution; the configurations
that cause user-reported failures are reasonably straightforward to identify; and generic failure
observations (i.e., those for which reusable techniques can be developed) are strongly correlated to
the distributed nature of system failures. The second two results in particular imply that there is
a reasonable bound on the effort required to organize the testing activity. Overall, the study gives
us some early confidence that it is feasible to consider a testing method that targets distributed
systems. The results of this study are offered as a basis for future work on the definition of such
a method.

CU-CS-994-05.pdf190.96 KB