Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis
{"title":"基于ssd的MapReduce集群机器学习算法的效率研究","authors":"Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis","doi":"10.1109/ICTAI.2018.00157","DOIUrl":null,"url":null,"abstract":"In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs\",\"authors\":\"Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis\",\"doi\":\"10.1109/ICTAI.2018.00157\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.\",\"PeriodicalId\":254686,\"journal\":{\"name\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"116 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI.2018.00157\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs
In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.