面向大数据挖掘处理的动态分布式并行机器学习算法

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Technologies and Applications Pub Date : 2021-12-21 DOI:10.1108/dta-06-2021-0153

Laouni Djafri

{"title":"面向大数据挖掘处理的动态分布式并行机器学习算法","authors":"Laouni Djafri","doi":"10.1108/dta-06-2021-0153","DOIUrl":null,"url":null,"abstract":"PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing\",\"authors\":\"Laouni Djafri\",\"doi\":\"10.1108/dta-06-2021-0153\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.\",\"PeriodicalId\":56156,\"journal\":{\"name\":\"Data Technologies and Applications\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2021-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Technologies and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1108/dta-06-2021-0153\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-06-2021-0153","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 1

摘要

本工作可以用作其他设置中的构建块，例如GPU, Map-Reduce, Spark或任何其他设置。此外，DDPML还可以部署在其他分布式系统上，如P2P网络、集群、云计算或其他技术。设计/方法/方法在大数据时代，所有公司都希望从大量数据中受益。这些数据可以帮助他们了解他们的内部和外部环境，并预测相关的现象，因为这些数据转化为知识，可以用于以后的预测。因此，这些知识成为公司手中的巨大资产。这正是数据挖掘的目的。但随着大量数据和知识以更快的速度产生，作者现在谈论的是大数据挖掘。因此，作者提出的工作主要针对使用分布式和并行处理技术对大数据进行分类时的数量、准确性、有效性和速度问题。因此，作者在这项工作中提出的问题是，作者如何使机器学习算法同时以分布式和并行的方式工作，而不会失去分类结果的准确性。为了解决这个问题，作者提出了一种称为动态分布式并行机器学习(DDPML)算法的系统。为了构建它，作者将他们的工作分为两部分。首先，作者提出了一种由Map-Reduce算法控制的分布式架构，而Map-Reduce算法又依赖于随机抽样技术。因此，作者设计的分布式架构专门用于处理大数据处理，该处理与本工作中提出的采样策略以一致和有效的方式运行。该体系结构还有助于作者实际验证使用代表性学习库(RLB)获得的分类结果。第二部分，采用分层随机抽样的方法，在两个层次上抽样，提取了具有代表性的学习库。该采样方法还应用于提取第一层的共享学习基(SLB)和部分学习基(PLBL1)以及第二层的部分学习基(PLBL2)。实验结果表明，在分类结果没有明显损失的情况下，本文提出的方法是有效的。因此，在实践中，系统DDPML通常专门用于大数据挖掘处理，并且在具有简单结构的分布式系统(例如客户机-服务器网络)中有效地工作。结果得到了满意的分类结果。Originality/valueDDPML系统是专门为顺利处理大数据挖掘分类而设计的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data Technologies and Applications Social Sciences-Library and Information Sciences

CiteScore

3.80

自引率

6.20%

发文量

期刊介绍： Previously published as: Program Online from: 2018 Subject Area: Information & Knowledge Management, Library Studies