Distance-Based Sampling of Software Configuration Spaces
Configurable software systems often provide a multitude of configuration options to adjust and optimize their functional and non-functional properties. For instance, to find the fastest configuration for a given setting, a brute-force approach is to measure the performance of all configurations, which is typically intractable. Addressing this challenge, state-of-the-art approaches rely on machine learning, analyzing a few configurations (i.e., a sample set) to predict the performance of other configurations. However, to obtain accurate performance predictions, a representative sample set of configurations is desirable. Addressing this task, different sampling strategies have been proposed, which come with different advantages (e.g., covering the configuration space systematically) and disadvantages (e.g., the need to enumerate all configurations). In our experiments, we found that most sampling strategies not achieve a good coverage of the configuration space with respect to covering relevant performance values. That is, they miss important configurations with distinct performance behavior. Based on this observation, we devise a new sampling strategy, called distance-based sampling, that is based on a distance metric and a probability distribution to spread the configurations of the resulting sample set according the probability distribution across the configuration space. This way, we cover different kinds of interactions among configuration options in the sample set. To demonstrate the benefit of distance-based sampling, we compare it to state-of-the-art sampling strategies, such as t-wise sampling, on 10 real-world software systems. Our results show that distance-based sampling leads to higher accuracy of performance models of configurable software systems for medium to large sample-set sizes.