Shixin Huang, Chao Chen, Gangya Zhu, Jinhan Xin, Z. Wang, Kai Hwang, Zhibin Yu
{"title":"基于贝叶斯优化的流数据处理系统资源配置调优","authors":"Shixin Huang, Chao Chen, Gangya Zhu, Jinhan Xin, Z. Wang, Kai Hwang, Zhibin Yu","doi":"10.34133/2022/9820424","DOIUrl":null,"url":null,"abstract":"Stream data processing systems are becoming increasingly popular in the big data era. Systems such as Apache Flink typically provide a number (e.g., 30) of configuration parameters to flexibly specify the amount of resources (e.g., CPU cores and memory) allocated for tasks. These parameters significantly affect task performance. However, it is hard to manually tune them for optimal performance for an unknown program running on a given cluster. An automatic as well as fast resource configuration tuning approach is therefore desired. To this end, we propose to leverage Bayesian optimization to automatically tune the resource configurations for stream data processing systems. We first select a machine learning model—Random Forest—to construct accurate performance models for a stream data processing program. We subsequently take the Bayesian optimization (BO) algorithm, along with the performance models, to iteratively search the optimal configurations for a stream data processing program. Experimental results show that our approach improves the 99th-percentile tail latency by a factor of 2.62× on average and up to 5.26× overall. Furthermore, our approach improves throughput by a factor of 1.05× on average and up to 1.21× overall.","PeriodicalId":45291,"journal":{"name":"International Journal of Intelligent Computing and Cybernetics","volume":"19 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Resource Configuration Tuning for Stream Data Processing Systems via Bayesian Optimization\",\"authors\":\"Shixin Huang, Chao Chen, Gangya Zhu, Jinhan Xin, Z. Wang, Kai Hwang, Zhibin Yu\",\"doi\":\"10.34133/2022/9820424\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stream data processing systems are becoming increasingly popular in the big data era. Systems such as Apache Flink typically provide a number (e.g., 30) of configuration parameters to flexibly specify the amount of resources (e.g., CPU cores and memory) allocated for tasks. These parameters significantly affect task performance. However, it is hard to manually tune them for optimal performance for an unknown program running on a given cluster. An automatic as well as fast resource configuration tuning approach is therefore desired. To this end, we propose to leverage Bayesian optimization to automatically tune the resource configurations for stream data processing systems. We first select a machine learning model—Random Forest—to construct accurate performance models for a stream data processing program. We subsequently take the Bayesian optimization (BO) algorithm, along with the performance models, to iteratively search the optimal configurations for a stream data processing program. Experimental results show that our approach improves the 99th-percentile tail latency by a factor of 2.62× on average and up to 5.26× overall. Furthermore, our approach improves throughput by a factor of 1.05× on average and up to 1.21× overall.\",\"PeriodicalId\":45291,\"journal\":{\"name\":\"International Journal of Intelligent Computing and Cybernetics\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2022-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Intelligent Computing and Cybernetics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34133/2022/9820424\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Intelligent Computing and Cybernetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34133/2022/9820424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
Resource Configuration Tuning for Stream Data Processing Systems via Bayesian Optimization
Stream data processing systems are becoming increasingly popular in the big data era. Systems such as Apache Flink typically provide a number (e.g., 30) of configuration parameters to flexibly specify the amount of resources (e.g., CPU cores and memory) allocated for tasks. These parameters significantly affect task performance. However, it is hard to manually tune them for optimal performance for an unknown program running on a given cluster. An automatic as well as fast resource configuration tuning approach is therefore desired. To this end, we propose to leverage Bayesian optimization to automatically tune the resource configurations for stream data processing systems. We first select a machine learning model—Random Forest—to construct accurate performance models for a stream data processing program. We subsequently take the Bayesian optimization (BO) algorithm, along with the performance models, to iteratively search the optimal configurations for a stream data processing program. Experimental results show that our approach improves the 99th-percentile tail latency by a factor of 2.62× on average and up to 5.26× overall. Furthermore, our approach improves throughput by a factor of 1.05× on average and up to 1.21× overall.