Pub Date : 2024-03-20DOI: 10.1109/TBDATA.2024.3403375
Jeongsu Park;Dong Hoon Lee
As a Big Data analysis technique, hierarchical clustering is helpful in summarizing data since it returns the clusters of the data and their clustering history. Cloud computing is the most suitable option to efficiently perform hierarchical clustering over numerous data. However, since compromised cloud service providers can cause serious privacy problems by revealing data, it is necessary to solve the problems prior to using the external cloud computing service. Privacy-preserving hierarchical clustering protocol in an outsourced computing environment has never been proposed in existing works. Existing protocols have several problems that limit the number of participating data owners or disclose the information of data. In this article, we propose a parallelly running and privacy-preserving agglomerative hierarchical clustering (ppAHC) over the union of datasets of multiple data owners in an outsourced computing environment, which is the first protocol to the best of our knowledge. The proposed ppAHC does not disclose any information about input and output, including the data access patterns. The proposed ppAHC is highly efficient and suitable for Big Data analysis to handle numerous data since its cost for one round is independent of the amount of data. It allows data owners without sufficient computing capability to participate in a collaborative hierarchical clustering.
{"title":"Parallelly Running and Privacy-Preserving Agglomerative Hierarchical Clustering in Outsourced Cloud Computing Environments","authors":"Jeongsu Park;Dong Hoon Lee","doi":"10.1109/TBDATA.2024.3403375","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403375","url":null,"abstract":"As a Big Data analysis technique, hierarchical clustering is helpful in summarizing data since it returns the clusters of the data and their clustering history. Cloud computing is the most suitable option to efficiently perform hierarchical clustering over numerous data. However, since compromised cloud service providers can cause serious privacy problems by revealing data, it is necessary to solve the problems prior to using the external cloud computing service. Privacy-preserving hierarchical clustering protocol in an outsourced computing environment has never been proposed in existing works. Existing protocols have several problems that limit the number of participating data owners or disclose the information of data. In this article, we propose a parallelly running and privacy-preserving agglomerative hierarchical clustering (ppAHC) over the union of datasets of multiple data owners in an outsourced computing environment, which is the first protocol to the best of our knowledge. The proposed ppAHC does not disclose any information about input and output, including the data access patterns. The proposed ppAHC is highly efficient and suitable for Big Data analysis to handle numerous data since its cost for one round is independent of the amount of data. It allows data owners without sufficient computing capability to participate in a collaborative hierarchical clustering.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"174-189"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Networks (GNNs) are powerful tools for graph representation learning, but they face challenges when applied to large-scale graphs due to substantial computational costs and memory requirements. To address scalability limitations, various methods have been proposed, including sampling-based and decoupling-based methods. However, these methods have their limitations: sampling-based methods inevitably discard some link information during the sampling process, while decoupling-based methods require alterations to the model's structure, reducing their adaptability to various GNNs. This paper proposes a novel graph pooling method, Graph Partial Pooling (GPPool), for scaling GNNs to large-scale graphs. GPPool is a versatile and straightforward technique that enhances training efficiency while simultaneously reducing memory requirements. GPPool constructs small-scale pooled graphs by pooling partial nodes into supernodes. Each pooled graph consists of supernodes and unpooled nodes, preserving valuable local and global information. Training GNNs on these graphs reduces memory demands and enhances their performance. Additionally, this paper provides a theoretical analysis of training GNNs using GPPool-constructed graphs from a graph diffusion perspective. It shows that a GNN can be transformed from a large-scale graph into pooled graphs with minimal approximation error. A series of experiments on datasets of varying scales demonstrates the effectiveness of GPPool.
{"title":"Training Large-Scale Graph Neural Networks via Graph Partial Pooling","authors":"Qi Zhang;Yanfeng Sun;Shaofan Wang;Junbin Gao;Yongli Hu;Baocai Yin","doi":"10.1109/TBDATA.2024.3403380","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3403380","url":null,"abstract":"Graph Neural Networks (GNNs) are powerful tools for graph representation learning, but they face challenges when applied to large-scale graphs due to substantial computational costs and memory requirements. To address scalability limitations, various methods have been proposed, including sampling-based and decoupling-based methods. However, these methods have their limitations: sampling-based methods inevitably discard some link information during the sampling process, while decoupling-based methods require alterations to the model's structure, reducing their adaptability to various GNNs. This paper proposes a novel graph pooling method, Graph Partial Pooling (GPPool), for scaling GNNs to large-scale graphs. GPPool is a versatile and straightforward technique that enhances training efficiency while simultaneously reducing memory requirements. GPPool constructs small-scale pooled graphs by pooling partial nodes into supernodes. Each pooled graph consists of supernodes and unpooled nodes, preserving valuable local and global information. Training GNNs on these graphs reduces memory demands and enhances their performance. Additionally, this paper provides a theoretical analysis of training GNNs using GPPool-constructed graphs from a graph diffusion perspective. It shows that a GNN can be transformed from a large-scale graph into pooled graphs with minimal approximation error. A series of experiments on datasets of varying scales demonstrates the effectiveness of GPPool.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"221-233"},"PeriodicalIF":7.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1109/TBDATA.2024.3378090
Tao Li;Yuhua Qian;Feijiang Li;Xinyan Liang;Zhi-Hui Zhan
It is a challenging task to select the informative features that can maintain the manifold structure in the original feature space. Many unsupervised feature selection methods still suffer the poor cluster performance in the selected feature subset. To tackle this problem, a feature subspace learning-based binary differential evolution algorithm is proposed for unsupervised feature selection. First, a new unsupervised feature selection framework based on evolutionary computation is designed, in which the feature subspace learning and the population search mechanism are combined into a unified unsupervised feature selection. Second, a local manifold structure learning strategy and a sample pseudo-label learning strategy are presented to calculate the importance of the selected feature subspace. Third, the binary differential evolution algorithm is developed to optimize the selected feature subspace, in which the binary information migration mutation operator and the adaptive crossover operator are designed to promote the searching for the global optimal feature subspace. Experimental results on various types of real-world datasets demonstrate that the proposed algorithm can obtain more informative feature subset and competitive cluster performance compared with eight state-of-the-art unsupervised feature selection methods.
{"title":"Feature Subspace Learning-Based Binary Differential Evolution Algorithm for Unsupervised Feature Selection","authors":"Tao Li;Yuhua Qian;Feijiang Li;Xinyan Liang;Zhi-Hui Zhan","doi":"10.1109/TBDATA.2024.3378090","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3378090","url":null,"abstract":"It is a challenging task to select the informative features that can maintain the manifold structure in the original feature space. Many unsupervised feature selection methods still suffer the poor cluster performance in the selected feature subset. To tackle this problem, a feature subspace learning-based binary differential evolution algorithm is proposed for unsupervised feature selection. First, a new unsupervised feature selection framework based on evolutionary computation is designed, in which the feature subspace learning and the population search mechanism are combined into a unified unsupervised feature selection. Second, a local manifold structure learning strategy and a sample pseudo-label learning strategy are presented to calculate the importance of the selected feature subspace. Third, the binary differential evolution algorithm is developed to optimize the selected feature subspace, in which the binary information migration mutation operator and the adaptive crossover operator are designed to promote the searching for the global optimal feature subspace. Experimental results on various types of real-world datasets demonstrate that the proposed algorithm can obtain more informative feature subset and competitive cluster performance compared with eight state-of-the-art unsupervised feature selection methods.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"99-114"},"PeriodicalIF":7.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1109/TBDATA.2024.3378100
Jing Zhang;Ming Wu;Zeyi Sun;Cangqi Zhou
Crowdsourcing has been playing an essential role in machine learning since it can obtain a large number of labels in an economical and fast manner for training increasingly complex learning models. However, the application of crowdsourcing learning still faces several challenges such as the low quality of crowd labels and the urgent requirement for learning models adapting to the label noises. There have been many studies focusing on truth inference algorithms to improve the quality of labels obtained by crowdsourcing. Comparably, end-to-end predictive model learning in crowdsourcing scenarios, especially using cutting-edge deep learning techniques, is still in its infant stage. In this paper, we propose a novel graph convolutional network-based framework, namely CGNNAT, which models the correlation of instances by combining the GCN model with an attention mechanism to learn more representative node embeddings for a better understanding of the bias tendency of crowd workers. Furthermore, a specific projection processing layer is employed in CGNNAT to model the reliability of each crowd worker, which makes the model an end-to-end neural network directly trained by noisy crowd labels. Experimental results on several real-world and synthetic datasets show that the proposed CGNNAT outperforms state-of-the-art and classical methods in terms of label prediction.
{"title":"Learning From Crowds Using Graph Neural Networks With Attention Mechanism","authors":"Jing Zhang;Ming Wu;Zeyi Sun;Cangqi Zhou","doi":"10.1109/TBDATA.2024.3378100","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3378100","url":null,"abstract":"Crowdsourcing has been playing an essential role in machine learning since it can obtain a large number of labels in an economical and fast manner for training increasingly complex learning models. However, the application of crowdsourcing learning still faces several challenges such as the low quality of crowd labels and the urgent requirement for learning models adapting to the label noises. There have been many studies focusing on truth inference algorithms to improve the quality of labels obtained by crowdsourcing. Comparably, end-to-end predictive model learning in crowdsourcing scenarios, especially using cutting-edge deep learning techniques, is still in its infant stage. In this paper, we propose a novel graph convolutional network-based framework, namely CGNNAT, which models the correlation of instances by combining the GCN model with an attention mechanism to learn more representative node embeddings for a better understanding of the bias tendency of crowd workers. Furthermore, a specific projection processing layer is employed in CGNNAT to model the reliability of each crowd worker, which makes the model an end-to-end neural network directly trained by noisy crowd labels. Experimental results on several real-world and synthetic datasets show that the proposed CGNNAT outperforms state-of-the-art and classical methods in terms of label prediction.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"86-98"},"PeriodicalIF":7.5,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1109/TBDATA.2024.3375150
Yunpeng Xiao;Xufeng Li;Tun Li;Rong Wang;Yucai Pang;Guoyin Wang
Vertical federated learning can aggregate participant data features. To address the issue of insufficient overlapping data in vertical federated learning, this study presents a generative adversarial network model that allows distributed data augmentation. First, this study proposes a distributed generative adversarial network FeCGAN for multiple participants with insufficient overlapping data, considering the fact that the generative adversarial network can generate simulation samples. This network is suitable for multiple data sources and can augment participants’ local data. Second, to address the problem of learning divergence caused by different local distributions of multiple data sources, this study proposes the aggregation algorithm FedKL. It aggregates the feedback of the local discriminator to interact with the generator and learns the local data distribution more accurately. Finally, given the problem of data waste caused by the unavailability of nonoverlapping data, this study proposes a data augmentation method called VFeDA. It uses FeCGAN to generate pseudo features and expands more overlapping data, thereby improving the data use. Experiments showed that the proposed model is suitable for multiple data sources and can generate high-quality data.
{"title":"A Distributed Generative Adversarial Network for Data Augmentation Under Vertical Federated Learning","authors":"Yunpeng Xiao;Xufeng Li;Tun Li;Rong Wang;Yucai Pang;Guoyin Wang","doi":"10.1109/TBDATA.2024.3375150","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3375150","url":null,"abstract":"Vertical federated learning can aggregate participant data features. To address the issue of insufficient overlapping data in vertical federated learning, this study presents a generative adversarial network model that allows distributed data augmentation. First, this study proposes a distributed generative adversarial network FeCGAN for multiple participants with insufficient overlapping data, considering the fact that the generative adversarial network can generate simulation samples. This network is suitable for multiple data sources and can augment participants’ local data. Second, to address the problem of learning divergence caused by different local distributions of multiple data sources, this study proposes the aggregation algorithm FedKL. It aggregates the feedback of the local discriminator to interact with the generator and learns the local data distribution more accurately. Finally, given the problem of data waste caused by the unavailability of nonoverlapping data, this study proposes a data augmentation method called VFeDA. It uses FeCGAN to generate pseudo features and expands more overlapping data, thereby improving the data use. Experiments showed that the proposed model is suitable for multiple data sources and can generate high-quality data.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 1","pages":"74-85"},"PeriodicalIF":7.5,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1109/TBDATA.2024.3375152
Yeyu Yan;Zhongying Zhao;Zhan Yang;Yanwei Yu;Chao Li
Due to the widespread applications of heterogeneous graphs in the real world, heterogeneous graph neural networks (HGNNs) have developed rapidly and made a great success in recent years. To effectively capture the complex interactions in heterogeneous graphs, various attention mechanisms are widely used in designing HGNNs. However, the employment of these attention mechanisms brings two key problems: high computational complexity and poor robustness. To address these problems, we propose a Fast