{"title":"基于减少训练样本间信息重叠的可配置大数据框架数据高效性能建模","authors":"Zhiqiang Liu, Xuanhua Shi, Hai Jin","doi":"10.1016/j.bdr.2022.100358","DOIUrl":null,"url":null,"abstract":"<div><p><span>To support the various analysis application of big data<span>, big data processing<span> frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the </span></span></span>configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.</p></div>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2022-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples\",\"authors\":\"Zhiqiang Liu, Xuanhua Shi, Hai Jin\",\"doi\":\"10.1016/j.bdr.2022.100358\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><span>To support the various analysis application of big data<span>, big data processing<span> frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the </span></span></span>configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.</p></div>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2022-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214579622000521\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579622000521","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Data-Efficient Performance Modeling for Configurable Big Data Frameworks by Reducing Information Overlap Between Training Examples
To support the various analysis application of big data, big data processing frameworks are designed to be highly configurable. However, for common users, it is difficult to tailor the configurable frameworks to achieve optimal performance for every application. Recently, many automatic tuning methods are proposed to configure these frameworks. In detail, these methods firstly build a performance prediction model through sampling configurations randomly and measuring the corresponding performance. Then, they conduct heuristic search in the configuration space based on the performance prediction model. For most frameworks, it is too expensive to build the performance model since it needs to measure the performance of large amounts of configurations, which cause too much overhead on data collection. In this paper, we propose a novel data-efficient method to build the performance model with little impact on prediction accuracy. Compared to the traditional methods, the proposed method can reduce the overhead of data collection because it can train the performance model with much less training examples. Specifically, the proposed method can actively sample the important examples according to the dynamic requirement of the performance model during the iterative model updating. Hence, it can make full use of the collected informative data and train the performance model with much less training examples. To sample the important training examples, we employ several virtual performance model to estimate the importance of all candidate configurations efficiently. Experimental results show that our method needs less training examples than traditional methods with little impact on prediction accuracy.