{"title":"Dividable Configuration Performance Learning","authors":"Jingzhi Gong;Tao Chen;Rami Bahsoon","doi":"10.1109/TSE.2024.3491945","DOIUrl":null,"url":null,"abstract":"Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed \n<monospace>DaL</monospace>\n, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further, \n<monospace>DaL</monospace>\n adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches, \n<monospace>DaL</monospace>\n performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to \n<inline-formula><tex-math>$1.61\\times$</tex-math></inline-formula>\n improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter \n<inline-formula><tex-math>$d$</tex-math></inline-formula>\n can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically, \n<monospace>DaL</monospace>\n considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository: \n<uri>https://github.com/ideas-labo/DaL-ext</uri>\n.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 1","pages":"106-134"},"PeriodicalIF":6.5000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10744216","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10744216/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Machine/deep learning models have been widely adopted to predict the configuration performance of software systems. However, a crucial yet unaddressed challenge is how to cater for the sparsity inherited from the configuration landscape: the influence of configuration options (features) and the distribution of data samples are highly sparse. In this paper, we propose a model-agnostic and sparsity-robust framework for predicting configuration performance, dubbed
DaL
, based on the new paradigm of dividable learning that builds a model via “divide-and-learn”. To handle sample sparsity, the samples from the configuration landscape are divided into distant divisions, for each of which we build a sparse local model, e.g., regularized Hierarchical Interaction Neural Network, to deal with the feature sparsity. A newly given configuration would then be assigned to the right model of division for the final prediction. Further,
DaL
adaptively determines the optimal number of divisions required for a system and sample size without any extra training or profiling. Experiment results from 12 real-world systems and five sets of training data reveal that, compared with the state-of-the-art approaches,
DaL
performs no worse than the best counterpart on 44 out of 60 cases (within which 31 cases are significantly better) with up to
$1.61\times$
improvement on accuracy; requires fewer samples to reach the same/better accuracy; and producing acceptable training overhead. In particular, the mechanism that adapted the parameter
$d$
can reach the optimal value for 76.43% of the individual runs. The result also confirms that the paradigm of dividable learning is more suitable than other similar paradigms such as ensemble learning for predicting configuration performance. Practically,
DaL
considerably improves different global models when using them as the underlying local models, which further strengthens its flexibility. To promote open science, all the data, code, and supplementary materials of this work can be accessed at our repository:
https://github.com/ideas-labo/DaL-ext
.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.