基于Apache Spark的机器学习模型性能评估:实证研究

Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua
{"title":"基于Apache Spark的机器学习模型性能评估:实证研究","authors":"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua","doi":"10.1109/CICN56167.2022.10008282","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study\",\"authors\":\"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua\",\"doi\":\"10.1109/CICN56167.2022.10008282\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.\",\"PeriodicalId\":287589,\"journal\":{\"name\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CICN56167.2022.10008282\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008282","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

人工智能(AI)和机器学习显著改善了许多领域,如教育、医疗保健和工业。机器学习技术主要依赖于训练数据的数量和多样性。随着我们所处的数字化转型,我们可以从不同的来源收集到大量的数据。然而,需要解决的问题是如何处理这些数据以及将其存储在何处。云服务和分布式文件系统(dfs)有助于解决这个问题。许多dfs(如Hadoop、Quantcast和Apache Spark)在许多方面存在差异,包括调度算法、数据管理协议、吞吐量和运行时。一些dfs可能比其他dfs更适合处理特定的应用程序。Apache Spark能够处理像机器学习操作这样的迭代操作,并且它提供了一个名为MLlib的不同机器学习算法的集成库。在本文中,我们使用两种机器学习算法,即逻辑回归(LR)和随机森林(RF)来评估Spark的使用。我们研究了不同内存分配配置和GPU使用的影响。我们得出的结论是,使用Spark极大地改善了运行时和内存消耗。然而,由于影响机器学习模型准确性的不同因素,它的使用必须是合理的,并且需要用于数据的大小。内存分配应该保持在所需的最小值,并且GPU应该只在使用的机器学习算法支持并行化时使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study
Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Prediction of Downhole Pressure while Tripping A Parallelized Genetic Algorithms approach to Community Energy Systems Planning Application of Artificial Neural Network to Estimate Students Performance in Scholastic Assessment Test A New Intelligent System for Evaluating and Assisting Students in Laboratory Learning Management System Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1