Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Big Data Research Pub Date : 2022-11-28 DOI:10.1016/j.bdr.2022.100358
Zhiqiang Liu, Xuanhua Shi, Hai Jin
{"title":"Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples","authors":"Zhiqiang Liu,&nbsp;Xuanhua Shi,&nbsp;Hai Jin","doi":"10.1016/j.bdr.2022.100358","DOIUrl":null,"url":null,"abstract":"<div><p><span>To support the various analysis application of big data<span>, big data processing<span> frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the </span></span></span>configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"30 ","pages":"Article 100358"},"PeriodicalIF":4.2000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579622000521","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

To support the various analysis application of big data, big data processing frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于减少训练样本间信息重叠的可配置大数据框架数据高效性能建模
为支持大数据的各种分析应用,大数据处理框架具有高度可配置性。然而,对于普通用户来说,为每个应用程序定制可配置框架以实现最佳性能是很困难的。最近,人们提出了许多自动调优方法来配置这些框架。具体来说,这些方法首先通过随机抽样配置并测量相应的性能来构建性能预测模型。然后,基于性能预测模型在配置空间中进行启发式搜索。对于大多数框架,构建性能模型的成本太高,因为它需要度量大量配置的性能,这会导致数据收集的开销过大。在本文中,我们提出了一种新的数据高效的方法来建立对预测精度影响很小的性能模型。与传统方法相比,该方法可以用更少的训练样本来训练性能模型,从而减少了数据收集的开销。具体而言,该方法可以在迭代模型更新过程中,根据性能模型的动态要求,主动对重要样例进行采样。因此,它可以充分利用收集到的信息数据,用更少的训练样例训练性能模型。为了对重要的训练样本进行采样,我们使用了几个虚拟性能模型来有效地估计所有候选配置的重要性。实验结果表明,该方法所需的训练样本比传统方法少,对预测精度影响较小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Big Data Research
Big Data Research Computer Science-Computer Science Applications
CiteScore
8.40
自引率
3.00%
发文量
0
期刊介绍: The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.
期刊最新文献
VertexLocater: PIM-enabled dynamic offloading for graph computing Efficient time series forecasting with gated attention and patched data: A transformer-based approach Asymmetric deviation entropy regularization for semi-supervised fuzzy C-means clustering and its fast Algorithm Techniques for interactive visual examination of vessel performance Editorial Board
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1