首页 > 最新文献

Big Data Mining and Analytics最新文献

英文 中文
Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data 利用Hadoop和MapReduce实现基于复制的大数据资源分配查询管理
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020026
Ankit Kumar;Neeraj Varshney;Surbhi Bhatiya;Kamred Udham Singh
We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices' data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.
我们生活在一个我们周围的一切都在被创造的时代。数据生成率如此之高,给实施成本高昂且简单的数据存储和恢复过程带来了压力。MapReduce模型功能用于创建集群并行、分布式算法和大型数据集。Hadoop的MapReduce策略有助于开发一个非商业用途的社区,以提供一种新的算法来解决商业应用程序中的此类问题,正如该工作算法所预期的那样,由于Hadoop集群结果不相称或歧视性,它具有洞察力。在这份工作下进行的工作和考试取得了预期成绩;它们中的许多都被安排来设置时间表、匹配矩阵的数据位置、在确定点击之前进行聚类、准确的映射和内部可靠性,以避免运行和执行时间。已经实现了映射器输出和支持者,并使用映射来减少功能。执行输入键值对和输出键值对已经设置。本文的重点是评估这种技术对大量数据的有效检索。该技术允许向大型数据库提供信息,从存储和索引技术到异构环境中的查询分布、可扩展性和性能。结果表明,所提出的工作将数据处理时间减少了30%。
{"title":"Replication-Based Query Management for Resource Allocation Using Hadoop and MapReduce over Big Data","authors":"Ankit Kumar;Neeraj Varshney;Surbhi Bhatiya;Kamred Udham Singh","doi":"10.26599/BDMA.2022.9020026","DOIUrl":"10.26599/BDMA.2022.9020026","url":null,"abstract":"We live in an age where everything around us is being created. Data generation rates are so scary, creating pressure to implement costly and straightforward data storage and recovery processes. MapReduce model functionality is used for creating a cluster parallel, distributed algorithm, and large datasets. The MapReduce strategy from Hadoop helps develop a community of non-commercial use to offer a new algorithm for resolving such problems for commercial applications as expected from this working algorithm with insights as a result of disproportionate or discriminatory Hadoop cluster results. Expected results are obtained in the work and the exam conducted under this job; many of them are scheduled to set schedules, match matrices' data positions, clustering before determining to click, and accurate mapping and internal reliability to be closed together to avoid running and execution times. Mapper output and proponents have been implemented, and the map has been used to reduce the function. The execution input key/value pair and output key/value pair have been set. This paper focuses on evaluating this technique for the efficient retrieval of large volumes of data. The technique allows for capabilities to inform a massive database of information, from storage and indexing techniques to the distribution of queries, scalability, and performance in heterogeneous environments. The results show that the proposed work reduces the data processing time by 30%.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"465-477"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233249.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49356278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Clinical Data Analysis Based Diagnostic Systems for Heart Disease Prediction Using Ensemble Method 基于临床数据分析的心脏病集成预测诊断系统
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020052
Ankit Kumar;Kamred Udham Singh;Manish Kumar
The correct diagnosis of heart disease can save lives, while the incorrect diagnosis can be lethal. The UCI machine learning heart disease dataset compares the results and analyses of various machine learning approaches, including deep learning. We used a dataset with 13 primary characteristics to carry out the research. Support vector machine and logistic regression algorithms are used to process the datasets, and the latter displays the highest accuracy in predicting coronary disease. Python programming is used to process the datasets. Multiple research initiatives have used machine learning to speed up the healthcare sector. We also used conventional machine learning approaches in our investigation to uncover the links between the numerous features available in the dataset and then used them effectively in anticipation of heart infection risks. Using the accuracy and confusion matrix has resulted in some favorable outcomes. To get the best results, the dataset contains certain unnecessary features that are dealt with using isolation logistic regression and Support Vector Machine (SVM) classification.
心脏病的正确诊断可以挽救生命,而不正确的诊断可能是致命的。UCI机器学习心脏病数据集比较了包括深度学习在内的各种机器学习方法的结果和分析。我们使用了一个具有13个主要特征的数据集来进行研究。支持向量机和逻辑回归算法用于处理数据集,后者在预测冠心病方面显示出最高的准确性。Python编程用于处理数据集。多项研究计划都使用机器学习来加快医疗保健领域的发展。我们在调查中还使用了传统的机器学习方法来揭示数据集中可用的众多特征之间的联系,然后有效地使用它们来预测心脏感染风险。使用准确度和混淆矩阵已经产生了一些有利的结果。为了获得最佳结果,数据集包含某些不必要的特征,这些特征使用隔离逻辑回归和支持向量机(SVM)分类进行处理。
{"title":"A Clinical Data Analysis Based Diagnostic Systems for Heart Disease Prediction Using Ensemble Method","authors":"Ankit Kumar;Kamred Udham Singh;Manish Kumar","doi":"10.26599/BDMA.2022.9020052","DOIUrl":"10.26599/BDMA.2022.9020052","url":null,"abstract":"The correct diagnosis of heart disease can save lives, while the incorrect diagnosis can be lethal. The UCI machine learning heart disease dataset compares the results and analyses of various machine learning approaches, including deep learning. We used a dataset with 13 primary characteristics to carry out the research. Support vector machine and logistic regression algorithms are used to process the datasets, and the latter displays the highest accuracy in predicting coronary disease. Python programming is used to process the datasets. Multiple research initiatives have used machine learning to speed up the healthcare sector. We also used conventional machine learning approaches in our investigation to uncover the links between the numerous features available in the dataset and then used them effectively in anticipation of heart infection risks. Using the accuracy and confusion matrix has resulted in some favorable outcomes. To get the best results, the dataset contains certain unnecessary features that are dealt with using isolation logistic regression and Support Vector Machine (SVM) classification.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"513-525"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233243.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42487577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Call for Papers: Special Issue on Edge AI Empowered Giant Model Training 论文征集:Edge AI赋能巨型模型训练特刊
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29
{"title":"Call for Papers: Special Issue on Edge AI Empowered Giant Model Training","authors":"","doi":"","DOIUrl":"https://doi.org/","url":null,"abstract":"","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"526-526"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233251.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"68007722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VDCM: A Data Collection Mechanism for Crowd Sensing in Vehicular Ad Hoc Networks VDCM:车辆自组织网络中人群感知的数据收集机制
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020041
Juli Yin;Linfeng Wei;Zhiquan Liu;Xi Yang;Hongliang Sun;Yudan Cheng;Jianbin Mai
With the rapid development of mobile devices, aggregation security and efficiency topics are more important than past in crowd sensing. When collecting large-scale vehicle-provided data, the data transmitted via autonomous networks are publicly accessible to all attackers, which increases the risk of vehicle exposure. So we need to ensure data aggregation security. In addition, low aggregation efficiency will lead to insufficient sensing data, making the data unable to provide data mining services. Aiming at the problem of aggregation security and efficiency in large-scale data collection, this article proposes a data collection mechanism (VDCM) for crowd sensing in vehicular ad hoc networks (VANETs). The mechanism includes two mechanism assumptions and selects appropriate methods to reduce consumption. It selects sub mechanism 1 when there exist very few vehicles or the coalition cannot be formed, otherwise selects sub mechanism 2. Single aggregation is used to collect data in sub mechanism 1. In sub mechanism 2, cooperative vehicles are selected by using coalition formation strategy and auction cooperation agreement, and multi aggregation is used to collect data. Two sub mechanisms use Paillier homomorphic encryption technology to ensure the security of data aggregation. In addition, mechanism supplements the data update and scoring steps to increase the amount of available data. The performance analysis shows that the mechanism proposed in this paper can safely aggregate data and reduce consumption. The simulation results indicate that the proposed mechanism reduces time consumption and increases the amount of available data compared with existing mechanisms.
随着移动设备的快速发展,聚集安全和效率问题在人群感知中比以往更加重要。在收集大规模车辆提供的数据时,通过自主网络传输的数据对所有攻击者都是公开的,这增加了车辆暴露的风险。因此,我们需要确保数据聚合的安全性。此外,聚合效率低会导致传感数据不足,使数据无法提供数据挖掘服务。针对大规模数据采集中的聚集安全性和效率问题,本文提出了一种用于车载自组织网络(VANET)人群感知的数据采集机制(VDCM)。该机制包括两个机制假设,并选择适当的方法来减少消耗。当车辆非常少或联盟无法形成时,它选择子机构1,否则选择子机构2。单个聚合用于在子机制1中收集数据。在子机制2中,通过联盟形成策略和拍卖合作协议来选择合作车辆,并使用多聚合来收集数据。两个子机制使用了Paillier同态加密技术来保证数据聚合的安全性。此外,该机制补充了数据更新和评分步骤,以增加可用数据的数量。性能分析表明,本文提出的机制可以安全地聚合数据,降低功耗。仿真结果表明,与现有机制相比,该机制减少了时间消耗,增加了可用数据量。
{"title":"VDCM: A Data Collection Mechanism for Crowd Sensing in Vehicular Ad Hoc Networks","authors":"Juli Yin;Linfeng Wei;Zhiquan Liu;Xi Yang;Hongliang Sun;Yudan Cheng;Jianbin Mai","doi":"10.26599/BDMA.2022.9020041","DOIUrl":"10.26599/BDMA.2022.9020041","url":null,"abstract":"With the rapid development of mobile devices, aggregation security and efficiency topics are more important than past in crowd sensing. When collecting large-scale vehicle-provided data, the data transmitted via autonomous networks are publicly accessible to all attackers, which increases the risk of vehicle exposure. So we need to ensure data aggregation security. In addition, low aggregation efficiency will lead to insufficient sensing data, making the data unable to provide data mining services. Aiming at the problem of aggregation security and efficiency in large-scale data collection, this article proposes a data collection mechanism (VDCM) for crowd sensing in vehicular ad hoc networks (VANETs). The mechanism includes two mechanism assumptions and selects appropriate methods to reduce consumption. It selects sub mechanism 1 when there exist very few vehicles or the coalition cannot be formed, otherwise selects sub mechanism 2. Single aggregation is used to collect data in sub mechanism 1. In sub mechanism 2, cooperative vehicles are selected by using coalition formation strategy and auction cooperation agreement, and multi aggregation is used to collect data. Two sub mechanisms use Paillier homomorphic encryption technology to ensure the security of data aggregation. In addition, mechanism supplements the data update and scoring steps to increase the amount of available data. The performance analysis shows that the mechanism proposed in this paper can safely aggregate data and reduce consumption. The simulation results indicate that the proposed mechanism reduces time consumption and increases the amount of available data compared with existing mechanisms.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"391-403"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233240.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48342786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-Based Hybrid Models for Predicting Loan Risk in the Banking Sector 基于人工智能的银行业贷款风险预测混合模型
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020037
Vikas Kumar;Shaiku Shahida Saheb;Preeti;Atif Ghayas;Sunil Kumari;Jai Kishan Chandel;Saroj Kumar Pandey;Santosh Kumar
Every real-world scenario is now digitally replicated in order to reduce paperwork and human labor costs. Machine Learning (ML) models are also being used to make predictions in these applications. Accurate forecasting requires knowledge of these machine learning models and their distinguishing features. The datasets we use as input for each of these different types of ML models, yielding different results. The choice of an ML model for a dataset is critical. A loan risk model is used to show how ML models for a dataset can be linked together. The purpose of this study is to look into how we could use machine learning to quantify or forecast mortgage credit risk. This phrase refers to the process of evaluating massive amounts of data in order to derive useful information for making decisions in a variety of fields. If credit risk is considered, a method based on an examination of what caused and how mortgage credit risk affected credit defaults during the still-current economic crisis of 2021 will be tried. Various approaches to credit risk calculation will be examined, ranging from the most basic to the most complex. In addition, we will conduct a case study on a sample of mortgage loans and compare the results of three different analytical approaches, logistic regression, decision tree, and gradient boost to see which one produced the most commercially useful insights.
现在,为了减少文书工作和人力成本,每个真实世界的场景都被数字化复制。机器学习(ML)模型也被用于在这些应用中进行预测。准确的预测需要了解这些机器学习模型及其显著特征。我们使用的数据集作为这些不同类型的ML模型的输入,产生不同的结果。为数据集选择ML模型至关重要。贷款风险模型用于显示如何将数据集的ML模型链接在一起。本研究的目的是探讨我们如何使用机器学习来量化或预测抵押贷款信贷风险。这个短语指的是评估大量数据的过程,以便获得在各个领域做出决策的有用信息。如果考虑到信贷风险,将尝试一种基于对2021年当前经济危机期间抵押贷款信贷风险造成的原因以及如何影响信贷违约的研究的方法。将研究各种信用风险计算方法,从最基本的到最复杂的。此外,我们将对抵押贷款样本进行案例研究,并比较逻辑回归、决策树和梯度提升三种不同分析方法的结果,看看哪种方法产生了最具商业价值的见解。
{"title":"AI-Based Hybrid Models for Predicting Loan Risk in the Banking Sector","authors":"Vikas Kumar;Shaiku Shahida Saheb;Preeti;Atif Ghayas;Sunil Kumari;Jai Kishan Chandel;Saroj Kumar Pandey;Santosh Kumar","doi":"10.26599/BDMA.2022.9020037","DOIUrl":"10.26599/BDMA.2022.9020037","url":null,"abstract":"Every real-world scenario is now digitally replicated in order to reduce paperwork and human labor costs. Machine Learning (ML) models are also being used to make predictions in these applications. Accurate forecasting requires knowledge of these machine learning models and their distinguishing features. The datasets we use as input for each of these different types of ML models, yielding different results. The choice of an ML model for a dataset is critical. A loan risk model is used to show how ML models for a dataset can be linked together. The purpose of this study is to look into how we could use machine learning to quantify or forecast mortgage credit risk. This phrase refers to the process of evaluating massive amounts of data in order to derive useful information for making decisions in a variety of fields. If credit risk is considered, a method based on an examination of what caused and how mortgage credit risk affected credit defaults during the still-current economic crisis of 2021 will be tried. Various approaches to credit risk calculation will be examined, ranging from the most basic to the most complex. In addition, we will conduct a case study on a sample of mortgage loans and compare the results of three different analytical approaches, logistic regression, decision tree, and gradient boost to see which one produced the most commercially useful insights.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"478-490"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233246.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Personalized Federated Learning for Heterogeneous Residential Load Forecasting 异构住宅负荷预测的个性化联合学习
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020043
Xiaodong Qu;Chengcheng Guan;Gang Xie;Zhiyi Tian;Keshav Sood;Chaoli Sun;Lei Cui
Accurate load forecasting is critical for electricity production, transmission, and maintenance. Deep learning (DL) model has replaced other classical models as the most popular prediction models. However, the deep prediction model requires users to provide a large amount of private electricity consumption data, which has potential privacy risks. Edge nodes can federally train a global model through aggregation using federated learning (FL). As a novel distributed machine learning (ML) technique, it only exchanges model parameters without sharing raw data. However, existing forecasting methods based on FL still face challenges from data heterogeneity and privacy disclosure. Accordingly, we propose a user-level load forecasting system based on personalized federated learning (PFL) to address these issues. The obtained personalized model outperforms the global model on local data. Further, we introduce a novel differential privacy (DP) algorithm in the proposed system to provide an additional privacy guarantee. Based on the principle of generative adversarial network (GAN), the algorithm achieves the balance between privacy and prediction accuracy throughout the game. We perform simulation experiments on the real-world dataset and the experimental results show that the proposed system can comply with the requirement for accuracy and privacy in real load forecasting scenarios.
准确的负荷预测对于电力生产、输电和维护至关重要。深度学习(DL)模型已经取代其他经典模型成为最流行的预测模型。然而,深度预测模型需要用户提供大量私人用电数据,这存在潜在的隐私风险。边缘节点可以使用联合学习(FL)通过聚合来联合训练全局模型。作为一种新型的分布式机器学习技术,它只交换模型参数,不共享原始数据。然而,现有的基于FL的预测方法仍然面临着数据异质性和隐私披露的挑战。因此,我们提出了一个基于个性化联合学习(PFL)的用户级负荷预测系统来解决这些问题。所获得的个性化模型在局部数据上优于全局模型。此外,我们在所提出的系统中引入了一种新的差分隐私(DP)算法,以提供额外的隐私保证。基于生成对抗性网络(GAN)的原理,该算法在整个游戏中实现了隐私和预测准确性之间的平衡。我们在真实世界的数据集上进行了仿真实验,实验结果表明,所提出的系统能够满足真实负荷预测场景中对准确性和隐私性的要求。
{"title":"Personalized Federated Learning for Heterogeneous Residential Load Forecasting","authors":"Xiaodong Qu;Chengcheng Guan;Gang Xie;Zhiyi Tian;Keshav Sood;Chaoli Sun;Lei Cui","doi":"10.26599/BDMA.2022.9020043","DOIUrl":"10.26599/BDMA.2022.9020043","url":null,"abstract":"Accurate load forecasting is critical for electricity production, transmission, and maintenance. Deep learning (DL) model has replaced other classical models as the most popular prediction models. However, the deep prediction model requires users to provide a large amount of private electricity consumption data, which has potential privacy risks. Edge nodes can federally train a global model through aggregation using federated learning (FL). As a novel distributed machine learning (ML) technique, it only exchanges model parameters without sharing raw data. However, existing forecasting methods based on FL still face challenges from data heterogeneity and privacy disclosure. Accordingly, we propose a user-level load forecasting system based on personalized federated learning (PFL) to address these issues. The obtained personalized model outperforms the global model on local data. Further, we introduce a novel differential privacy (DP) algorithm in the proposed system to provide an additional privacy guarantee. Based on the principle of generative adversarial network (GAN), the algorithm achieves the balance between privacy and prediction accuracy throughout the game. We perform simulation experiments on the real-world dataset and the experimental results show that the proposed system can comply with the requirement for accuracy and privacy in real load forecasting scenarios.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"421-432"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233242.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
K-Means Clustering with Local Distance Privacy 具有局部距离隐私的K-Means聚类
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020050
Mengmeng Yang;Longxia Huang;Chenghua Tang
With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. K-means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users' data, which disclose users' privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing $K$-means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users' records in-distinguish from each other in high-dimensional space, we map the user's record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local dE-privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.
随着信息技术的发展,每天都会产生大量的数据。收集和分析这些数据有助于服务提供商改善服务,并在激烈的市场竞争中获得优势。K-means聚类在实际生活中被广泛应用于聚类分析。然而,这些分析是基于用户的数据,这些数据披露了用户的隐私。局部差分隐私由于其强大的隐私保障,近年来引起了人们的广泛关注,并被应用于聚类分析。然而,现有的具有局部差分隐私保护的$K$-均值聚类方法由于在整个数据集中引入了大量噪声以确保隐私保证,因此无法获得理想的聚类结果。为了解决这个问题,我们提出了一种新的方法,为参与聚类分析的用户提供本地距离隐私。我们没有在高维空间中使用户的记录相互区分,而是将用户的记录映射到一维距离空间中,并使这种距离空间中的记录不相互区分。具体来说,我们首先生成一个有噪声的距离,然后合成高维数据记录。我们提出了一种有界拉普拉斯方法(BLM)和一种聚类不可分辨方法(CIM)来对这种噪声距离进行采样,分别满足局部差分隐私保证和局部dE隐私保证。此外,我们还介绍了一种在高维空间中生成合成数据记录的方法。我们的实验评估结果表明,我们的方法显著优于传统方法。
{"title":"K-Means Clustering with Local Distance Privacy","authors":"Mengmeng Yang;Longxia Huang;Chenghua Tang","doi":"10.26599/BDMA.2022.9020050","DOIUrl":"10.26599/BDMA.2022.9020050","url":null,"abstract":"With the development of information technology, a mass of data are generated every day. Collecting and analysing these data help service providers improve their services and gain an advantage in the fierce market competition. K-means clustering has been widely used for cluster analysis in real life. However, these analyses are based on users' data, which disclose users' privacy. Local differential privacy has attracted lots of attention recently due to its strong privacy guarantee and has been applied for clustering analysis. However, existing \u0000<tex>$K$</tex>\u0000-means clustering methods with local differential privacy protection cannot get an ideal clustering result due to the large amount of noise introduced to the whole dataset to ensure the privacy guarantee. To solve this problem, we propose a novel method that provides local distance privacy for users who participate in the clustering analysis. Instead of making the users' records in-distinguish from each other in high-dimensional space, we map the user's record into a one-dimensional distance space and make the records in such a distance space not be distinguished from each other. To be specific, we generate a noisy distance first and then synthesize the high-dimensional data record. We propose a Bounded Laplace Method (BLM) and a Cluster Indistinguishable Method (CIM) to sample such a noisy distance, which satisfies the local differential privacy guarantee and local d\u0000<inf>E</inf>\u0000-privacy guarantee, respectively. Furthermore, we introduce a way to generate synthetic data records in high-dimensional space. Our experimental evaluation results show that our methods outperform the traditional methods significantly.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"433-442"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233248.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46837075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Elastic Optimization for Stragglers in Edge Federated Learning 边缘联邦学习中掉队者的弹性优化
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020046
Khadija Sultana;Khandakar Ahmed;Bruce Gu;Hua Wang
To fully exploit enormous data generated by intelligent devices in edge computing, edge federated learning (EFL) is envisioned as a promising solution. The distributed collaborative training in EFL deals with delay and privacy issues compared to traditional centralized model training. However, the existence of straggling devices, responding slow to servers, degrades model performance. We consider data heterogeneity from two aspects: high dimensional data generated at edge devices where the number of features is greater than that of observations and the heterogeneity caused by partial device participation. With large number of features, computation overhead on the devices increases, causing edge devices to become stragglers. And incorporation of partial training results causes gradients to be diverged which further exaggerates when more training is performed to reach local optima. In this paper, we introduce elastic optimization methods for stragglers due to data heterogeneity in edge federated learning. Specifically, we define the problem of stragglers in EFL. Then, we formulate an optimization problem to be solved at edge devices. We customize a benchmark algorithm, FedAvg, to obtain a new elastic optimization algorithm (FedEN) which is applied in local training of edge devices. FedEN mitigates stragglers by having a balance between lasso and ridge penalization thereby generating sparse model updates and enforcing parameters as close as to local optima. We have evaluated the proposed model on MNIST and CIFAR-10 datasets. Simulated experiments demonstrate that our approach improves run time training performance by achieving average accuracy with less communication rounds. The results confirm the improved performance of our approach over benchmark algorithms.
为了在边缘计算中充分利用智能设备生成的大量数据,边缘联合学习(EFL)被认为是一种很有前途的解决方案。与传统的集中式模式训练相比,EFL中的分布式协作训练处理了延迟和隐私问题。然而,零散设备的存在,对服务器的响应缓慢,降低了模型的性能。我们从两个方面考虑数据异质性:在特征数量大于观测数量的边缘设备上生成的高维数据,以及部分设备参与引起的异质性。随着大量特征的出现,设备上的计算开销增加,导致边缘设备变得掉队。部分训练结果的结合会导致梯度发散,当进行更多训练以达到局部最优时,这会进一步夸大。在本文中,我们介绍了边缘联合学习中由于数据异构而导致掉队者的弹性优化方法。具体来说,我们定义了英语中的掉队者问题。然后,我们提出了一个要在边缘设备上解决的优化问题。我们定制了一个基准算法FedAvg,以获得一种新的弹性优化算法(FedEN),该算法应用于边缘设备的局部训练。FedEN通过在套索和山脊惩罚之间保持平衡来缓解掉队者,从而生成稀疏模型更新并强制执行接近局部最优的参数。我们已经在MNIST和CIFAR-10数据集上评估了所提出的模型。模拟实验表明,我们的方法通过减少通信轮次来实现平均精度,从而提高了运行时训练性能。结果证实了我们的方法相对于基准算法的改进性能。
{"title":"Elastic Optimization for Stragglers in Edge Federated Learning","authors":"Khadija Sultana;Khandakar Ahmed;Bruce Gu;Hua Wang","doi":"10.26599/BDMA.2022.9020046","DOIUrl":"10.26599/BDMA.2022.9020046","url":null,"abstract":"To fully exploit enormous data generated by intelligent devices in edge computing, edge federated learning (EFL) is envisioned as a promising solution. The distributed collaborative training in EFL deals with delay and privacy issues compared to traditional centralized model training. However, the existence of straggling devices, responding slow to servers, degrades model performance. We consider data heterogeneity from two aspects: high dimensional data generated at edge devices where the number of features is greater than that of observations and the heterogeneity caused by partial device participation. With large number of features, computation overhead on the devices increases, causing edge devices to become stragglers. And incorporation of partial training results causes gradients to be diverged which further exaggerates when more training is performed to reach local optima. In this paper, we introduce elastic optimization methods for stragglers due to data heterogeneity in edge federated learning. Specifically, we define the problem of stragglers in EFL. Then, we formulate an optimization problem to be solved at edge devices. We customize a benchmark algorithm, FedAvg, to obtain a new elastic optimization algorithm (FedEN) which is applied in local training of edge devices. FedEN mitigates stragglers by having a balance between lasso and ridge penalization thereby generating sparse model updates and enforcing parameters as close as to local optima. We have evaluated the proposed model on MNIST and CIFAR-10 datasets. Simulated experiments demonstrate that our approach improves run time training performance by achieving average accuracy with less communication rounds. The results confirm the improved performance of our approach over benchmark algorithms.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"404-420"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233241.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47729450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A PLS-SEM Based Approach: Analyzing Generation Z Purchase Intention Through Facebook's Big Data 基于PLS-SEM的方法:通过Facebook的大数据分析Z世代的购买意愿
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2022.9020033
Vikas Kumar;Preeti;Shaiku Shahida Saheb;Sunil Kumari;Kanishka Pathak;Jai Kishan Chandel;Neeraj Varshney;Ankit Kumar
The objective of this paper is to provide a better rendition of Generation Z purchase intentions of retail products through Facebook. The study gyrated around the favorable attitude formation of Generation Z translating into intentions to purchase retail products through Facebook. The role of antecedents of attitude, namely enjoyment, credibility, and peer communication was also explored. The main purpose was to analyze the F-commerce pervasiveness (retail purchases through Facebook) among Generation Z in India and how could it be materialized effectively. A conceptual façade was proposed after trotting out germane and urbane literature. The study focused exclusively on Generation Z population. The data were statistically analyzed using partial least squares structural equation modelling. The study found the proposed conceptual model had a high prediction power of Generation Z intentions to purchase retail products through Facebook verifying the materialization of F-commerce. Enjoyment, credibility, and peer communication were proved to be good predictors of attitude (R2=0.589) and furthermore attitude was found to be a stellar antecedent to purchase intentions (R2=0.540).
本文的目的是通过Facebook更好地再现Z世代对零售产品的购买意图。这项研究围绕着Z世代形成的有利态度转变为通过Facebook购买零售产品的意图展开。还探讨了态度的前因,即享受、可信度和同伴交流的作用。主要目的是分析F商务在印度Z世代中的普遍性(通过Facebook进行零售购买),以及如何有效地实现它。一个概念性的外观是在抛出德国和城市文学之后提出的。这项研究只关注Z世代人群。使用偏最小二乘结构方程模型对数据进行统计分析。研究发现,所提出的概念模型对Z世代通过Facebook购买零售产品的意图具有很高的预测力,验证了F-commerce的物化。乐趣、可信度和同伴交流被证明是态度的良好预测因素(R2=0.589),此外,态度被发现是购买意愿的主要前提(R2=0.540)。
{"title":"A PLS-SEM Based Approach: Analyzing Generation Z Purchase Intention Through Facebook's Big Data","authors":"Vikas Kumar;Preeti;Shaiku Shahida Saheb;Sunil Kumari;Kanishka Pathak;Jai Kishan Chandel;Neeraj Varshney;Ankit Kumar","doi":"10.26599/BDMA.2022.9020033","DOIUrl":"10.26599/BDMA.2022.9020033","url":null,"abstract":"The objective of this paper is to provide a better rendition of Generation Z purchase intentions of retail products through Facebook. The study gyrated around the favorable attitude formation of Generation Z translating into intentions to purchase retail products through Facebook. The role of antecedents of attitude, namely enjoyment, credibility, and peer communication was also explored. The main purpose was to analyze the F-commerce pervasiveness (retail purchases through Facebook) among Generation Z in India and how could it be materialized effectively. A conceptual façade was proposed after trotting out germane and urbane literature. The study focused exclusively on Generation Z population. The data were statistically analyzed using partial least squares structural equation modelling. The study found the proposed conceptual model had a high prediction power of Generation Z intentions to purchase retail products through Facebook verifying the materialization of F-commerce. Enjoyment, credibility, and peer communication were proved to be good predictors of attitude (R\u0000<sup>2</sup>\u0000=0.589) and furthermore attitude was found to be a stellar antecedent to purchase intentions (R\u0000<sup>2</sup>\u0000=0.540).","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"491-503"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233245.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46940167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Privacy-Aware and Trustworthy Data Sharing Using Blockchain for Edge Intelligence 利用区块链实现边缘智能的隐私感知和可信数据共享
IF 13.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.26599/BDMA.2023.9020012
Youyang Qu;Lichuan Ma;Wenjie Ye;Xuemeng Zhai;Shui Yu;Yunfeng Li;David Smith
The popularization of intelligent healthcare devices and big data analytics significantly boosts the development of Smart Healthcare Networks (SHNs). To enhance the precision of diagnosis, different participants in SHNs share health data that contain sensitive information. Therefore, the data exchange process raises privacy concerns, especially when the integration of health data from multiple sources (linkage attack) results in further leakage. Linkage attack is a type of dominant attack in the privacy domain, which can leverage various data sources for private data mining. Furthermore, adversaries launch poisoning attacks to falsify the health data, which leads to misdiagnosing or even physical damage. To protect private health data, we propose a personalized differential privacy model based on the trust levels among users. The trust is evaluated by a defined community density, while the corresponding privacy protection level is mapped to controllable randomized noise constrained by differential privacy. To avoid linkage attacks in personalized differential privacy, we design a noise correlation decoupling mechanism using a Markov stochastic process. In addition, we build the community model on a blockchain, which can mitigate the risk of poisoning attacks during differentially private data transmission over SHNs. Extensive experiments and analysis on real-world datasets have testified the proposed model, and achieved better performance compared with existing research from perspectives of privacy protection and effectiveness.
智能医疗设备和大数据分析的普及大大推动了智能医疗网络(SHN)的发展。为了提高诊断的准确性,SHN的不同参与者共享包含敏感信息的健康数据。因此,数据交换过程引发了隐私问题,尤其是当来自多个来源的健康数据集成(链接攻击)导致进一步泄露时。链接攻击是隐私领域的一种主要攻击,它可以利用各种数据源进行私人数据挖掘。此外,对手发动中毒攻击以伪造健康数据,从而导致误诊甚至身体损伤。为了保护私人健康数据,我们提出了一个基于用户之间信任水平的个性化差异隐私模型。信任通过定义的社区密度来评估,而相应的隐私保护级别被映射到受差分隐私约束的可控随机噪声。为了避免个性化差分隐私中的链接攻击,我们使用马尔可夫随机过程设计了一种噪声相关解耦机制。此外,我们在区块链上建立了社区模型,可以降低SHN上差异私有数据传输过程中中毒攻击的风险。在真实世界数据集上进行的大量实验和分析验证了所提出的模型,并从隐私保护和有效性的角度与现有研究相比取得了更好的性能。
{"title":"Towards Privacy-Aware and Trustworthy Data Sharing Using Blockchain for Edge Intelligence","authors":"Youyang Qu;Lichuan Ma;Wenjie Ye;Xuemeng Zhai;Shui Yu;Yunfeng Li;David Smith","doi":"10.26599/BDMA.2023.9020012","DOIUrl":"10.26599/BDMA.2023.9020012","url":null,"abstract":"The popularization of intelligent healthcare devices and big data analytics significantly boosts the development of Smart Healthcare Networks (SHNs). To enhance the precision of diagnosis, different participants in SHNs share health data that contain sensitive information. Therefore, the data exchange process raises privacy concerns, especially when the integration of health data from multiple sources (linkage attack) results in further leakage. Linkage attack is a type of dominant attack in the privacy domain, which can leverage various data sources for private data mining. Furthermore, adversaries launch poisoning attacks to falsify the health data, which leads to misdiagnosing or even physical damage. To protect private health data, we propose a personalized differential privacy model based on the trust levels among users. The trust is evaluated by a defined community density, while the corresponding privacy protection level is mapped to controllable randomized noise constrained by differential privacy. To avoid linkage attacks in personalized differential privacy, we design a noise correlation decoupling mechanism using a Markov stochastic process. In addition, we build the community model on a blockchain, which can mitigate the risk of poisoning attacks during differentially private data transmission over SHNs. Extensive experiments and analysis on real-world datasets have testified the proposed model, and achieved better performance compared with existing research from perspectives of privacy protection and effectiveness.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 4","pages":"443-464"},"PeriodicalIF":13.6,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10233239/10233247.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48363167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Big Data Mining and Analytics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1