{"title":"Node and relevant data selection in distributed predictive analytics: A query-centric approach","authors":"Tahani Aladwani , Christos Anagnostopoulos , Kostas Kolomvatsos","doi":"10.1016/j.jnca.2024.104029","DOIUrl":null,"url":null,"abstract":"<div><div>Distributed Predictive Analytics (DPA) refers to constructing predictive models based on data distributed across nodes. DPA reduces the need for data centralization, thus, alleviating concerns about data privacy, decreasing the load on central servers, and minimizing communication overhead. However, data collected by nodes are inherently different; each node can have different distributions, volumes, access patterns, and features space. This heterogeneity hinders the development of accurate models in a distributed fashion. Many state-of-the-art methods adopt random node selection as a straightforward approach. Such method is particularly ineffective when dealing with data and access pattern heterogeneity, as it increases the likelihood of selecting nodes with low-quality or irrelevant data for DPA. Consequently, it is only after training models over randomly selected nodes that the most suitable ones can be identified based on the predictive performance. This results in more time and resource consumption, and increased network load. In this work, holistic knowledge of nodes’ data characteristics and access patterns is crucial. Such knowledge enables the successful selection of a subset of suitable nodes for each DPA task (query) before model training. Our method engages the most suitable nodes by predicting their relevant distributed data and learning predictive models <em>per</em> query. We introduce a novel DPA query-centric mechanism for node and relevant data selection. We contribute with (i) predictive selection mechanisms based on the availability and relevance of data per DPA query and (ii) various distributed machine learning mechanisms that engage the most suitable nodes for model training. We evaluate the efficiency of our mechanism and provide a comparative assessment with other methods found in the literature. Our experiments showcase that our mechanism significantly outperforms other approaches being applicable in DPA.</div></div>","PeriodicalId":54784,"journal":{"name":"Journal of Network and Computer Applications","volume":"232 ","pages":"Article 104029"},"PeriodicalIF":7.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Network and Computer Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1084804524002066","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed Predictive Analytics (DPA) refers to constructing predictive models based on data distributed across nodes. DPA reduces the need for data centralization, thus, alleviating concerns about data privacy, decreasing the load on central servers, and minimizing communication overhead. However, data collected by nodes are inherently different; each node can have different distributions, volumes, access patterns, and features space. This heterogeneity hinders the development of accurate models in a distributed fashion. Many state-of-the-art methods adopt random node selection as a straightforward approach. Such method is particularly ineffective when dealing with data and access pattern heterogeneity, as it increases the likelihood of selecting nodes with low-quality or irrelevant data for DPA. Consequently, it is only after training models over randomly selected nodes that the most suitable ones can be identified based on the predictive performance. This results in more time and resource consumption, and increased network load. In this work, holistic knowledge of nodes’ data characteristics and access patterns is crucial. Such knowledge enables the successful selection of a subset of suitable nodes for each DPA task (query) before model training. Our method engages the most suitable nodes by predicting their relevant distributed data and learning predictive models per query. We introduce a novel DPA query-centric mechanism for node and relevant data selection. We contribute with (i) predictive selection mechanisms based on the availability and relevance of data per DPA query and (ii) various distributed machine learning mechanisms that engage the most suitable nodes for model training. We evaluate the efficiency of our mechanism and provide a comparative assessment with other methods found in the literature. Our experiments showcase that our mechanism significantly outperforms other approaches being applicable in DPA.
期刊介绍:
The Journal of Network and Computer Applications welcomes research contributions, surveys, and notes in all areas relating to computer networks and applications thereof. Sample topics include new design techniques, interesting or novel applications, components or standards; computer networks with tools such as WWW; emerging standards for internet protocols; Wireless networks; Mobile Computing; emerging computing models such as cloud computing, grid computing; applications of networked systems for remote collaboration and telemedicine, etc. The journal is abstracted and indexed in Scopus, Engineering Index, Web of Science, Science Citation Index Expanded and INSPEC.