Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs

Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis
{"title":"Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs","authors":"Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis","doi":"10.1109/ICTAI.2018.00157","DOIUrl":null,"url":null,"abstract":"In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于ssd的MapReduce集群机器学习算法的效率研究
在大数据时代,高效处理海量数据已经成为组织和企业的标准需求。由于单个工作站无法承受如此巨大的工作负载,因此引入MapReduce的目的是为在大型集群上执行应用程序提供一个健壮、简单和容错的并行化框架。这种应用最具代表性的例子之一是机器学习算法,它主导了数据挖掘的广泛研究领域。同时,最近硬件技术的进步导致了二级存储的高性能替代设备的引入,即固态驱动器(ssd)。在本文中,我们研究了几种并行数据挖掘算法在配备这种现代硬件的MapReduce集群上的性能。更具体地说,我们研究了标准的数据集预处理方法,包括向量化和降维,以及两种监督分类器,朴素贝叶斯和线性回归。我们通过使用两种不同的数据集和应用几种不同的集群配置,比较了这些算法在配备标准磁盘和ssd的实验集群上的执行时间。我们的实验表明,ssd的使用可以在一定程度上加速机器学习方法的执行,这取决于集群设置和应用算法的性质。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
[Title page i] Enhanced Unsatisfiable Cores for QBF: Weakening Universal to Existential Quantifiers Effective Ant Colony Optimization Solution for the Brazilian Family Health Team Scheduling Problem Exploiting Global Semantic Similarity Biterms for Short-Text Topic Discovery Assigning and Scheduling Service Visits in a Mixed Urban/Rural Setting
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1