Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples

IF 4.3 3区材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC ACS Applied Electronic Materials Pub Date : 2022-11-28 DOI:10.1016/j.bdr.2022.100358

Zhiqiang Liu, Xuanhua Shi, Hai Jin

{"title":"Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples","authors":"Zhiqiang Liu, Xuanhua Shi, Hai Jin","doi":"10.1016/j.bdr.2022.100358","DOIUrl":null,"url":null,"abstract":"<div><p><span>To support the various analysis application of big data<span>, big data processing<span> frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the </span></span></span>configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.</p></div>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579622000521","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

To support the various analysis application of big data, big data processing frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于减少训练样本间信息重叠的可配置大数据框架数据高效性能建模

为支持大数据的各种分析应用，大数据处理框架具有高度可配置性。然而，对于普通用户来说，为每个应用程序定制可配置框架以实现最佳性能是很困难的。最近，人们提出了许多自动调优方法来配置这些框架。具体来说，这些方法首先通过随机抽样配置并测量相应的性能来构建性能预测模型。然后，基于性能预测模型在配置空间中进行启发式搜索。对于大多数框架，构建性能模型的成本太高，因为它需要度量大量配置的性能，这会导致数据收集的开销过大。在本文中，我们提出了一种新的数据高效的方法来建立对预测精度影响很小的性能模型。与传统方法相比，该方法可以用更少的训练样本来训练性能模型，从而减少了数据收集的开销。具体而言，该方法可以在迭代模型更新过程中，根据性能模型的动态要求，主动对重要样例进行采样。因此，它可以充分利用收集到的信息数据，用更少的训练样例训练性能模型。为了对重要的训练样本进行采样，我们使用了几个虚拟性能模型来有效地估计所有候选配置的重要性。实验结果表明，该方法所需的训练样本比传统方法少，对预测精度影响较小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊