IEEE Transactions on Big Data最新文献_第5页

Fine-Tuned Personality Federated Learning for Graph Data 针对图形数据的微调个性联合学习

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-19 DOI: 10.1109/TBDATA.2024.3356388

Meiting Xue;Zian Zhou;Pengfei Jiao;Huijun Tang

Federated Learning (FL) empowers multiple clients to collaboratively learn a global generalization model without the need to share their local data, thus reducing privacy risks and expanding the scope of AI applications. However, current works focus less on data in a highly nonidentically distributed manner such as graph data which are common in reality, and ignore the problem of model personalization between clients for graph data training in federated learning. In this paper, we propose a novel personality graph federated learning framework based on variational graph autoencoders that incorporates model contrastive learning and local fine-tuning to achieve personalized federated training on graph data for each client, which is called FedVGAE. Then we introduce an encoder-sharing strategy to the proposed framework that shares the parameters of the encoder layer to further improve personality performance. The node classification and link prediction experiments demonstrate that our method achieves better performance than other federated learning methods on most graph datasets in the non-iid setting. Finally, we conduct ablation experiments, the result demonstrates the effectiveness of our proposed method.

联盟学习（Federated Learning，FL）使多个客户端能够协作学习一个全局泛化模型，而无需共享各自的本地数据，从而降低了隐私风险，扩大了人工智能的应用范围。然而，目前的研究较少关注高度非同分布式的数据，如现实中常见的图数据，而忽略了联合学习中图数据训练的客户端之间的模型个性化问题。在本文中，我们提出了一种基于变异图自编码器的新型个性图联合学习框架，该框架结合了模型对比学习和局部微调，以实现每个客户端对图数据的个性化联合训练，我们称之为 FedVGAE。然后，我们在拟议框架中引入了编码器共享策略，共享编码器层的参数，以进一步提高个性性能。节点分类和链接预测实验表明，我们的方法在非 iid 环境下的大多数图数据集上都取得了比其他联合学习方法更好的性能。最后，我们进行了消融实验，结果证明了我们提出的方法的有效性。

{"title":"Fine-Tuned Personality Federated Learning for Graph Data","authors":"Meiting Xue;Zian Zhou;Pengfei Jiao;Huijun Tang","doi":"10.1109/TBDATA.2024.3356388","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356388","url":null,"abstract":"Federated Learning (FL) empowers multiple clients to collaboratively learn a global generalization model without the need to share their local data, thus reducing privacy risks and expanding the scope of AI applications. However, current works focus less on data in a highly nonidentically distributed manner such as graph data which are common in reality, and ignore the problem of model personalization between clients for graph data training in federated learning. In this paper, we propose a novel personality graph federated learning framework based on variational graph autoencoders that incorporates model contrastive learning and local fine-tuning to achieve personalized federated training on graph data for each client, which is called FedVGAE. Then we introduce an encoder-sharing strategy to the proposed framework that shares the parameters of the encoder layer to further improve personality performance. The node classification and link prediction experiments demonstrate that our method achieves better performance than other federated learning methods on most graph datasets in the non-iid setting. Finally, we conduct ablation experiments, the result demonstrates the effectiveness of our proposed method.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"313-319"},"PeriodicalIF":7.2,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LGRL: Local-Global Representation Learning for On-the-Fly FG-SBIR LGRL：用于即时 FG-SBIR 的局部-全局表征学习

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-19 DOI: 10.1109/TBDATA.2024.3356393

Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang

On-the-fly Fine-grained sketch-based image retrieval (On-the-fly FG-SBIR) framework aim to break the barriers that sketch drawing requires excellent skills and is time-consuming. Considering such problems, a partial sketch with fewer strokes contains only the little local information, and the drawing process may show great difference among users, resulting in poor performance at the early retrieval. In this study, we developed a local-global representation learning (LGRL) method, in which we learn the representations for both the local and global regions of the partial sketch and its target photos. Specifically, we first designed a triplet network to learn the joint embedding space shared between the local and global regions of the entire sketch and its corresponding region of the photo. Then, we divided each partial sketch in the sketch-drawing episode into several local regions; Another learnable module following the triplet network was designed to learn the representations for the local regions of the partial sketch. Finally, by combining both the local and global regions of the sketches and photos, the final distance was determined. In the experiments, our method outperformed state-of-the-art baseline methods in terms of early retrieval efficiency on two publicly sketch-retrieval datasets and the practice test.

基于即时细粒度草图的图像检索（On-the-fly Fine-grained Sketch-based Image Retrieval，简称On-the-fly FG-SBIR）框架旨在打破草图绘制需要高超技巧和耗费时间的障碍。考虑到这些问题，笔画较少的局部草图仅包含很少的局部信息，而且用户之间的绘制过程可能存在很大差异，导致早期检索性能不佳。在本研究中，我们开发了一种局部-全局表示学习（LGRL）方法，即学习局部草图及其目标照片的局部区域和全局区域的表示。具体来说，我们首先设计了一个三元组网络来学习整个草图的局部和全局区域与照片的相应区域之间共享的联合嵌入空间。然后，我们将草图绘制过程中的每个局部草图划分为若干局部区域；继三重网络之后，我们又设计了另一个可学习模块，用于学习局部草图局部区域的表征。最后，结合草图和照片的局部区域和全局区域，确定最终距离。在实验中，在两个公开的草图检索数据集和实践测试中，我们的方法在早期检索效率方面优于最先进的基线方法。

{"title":"LGRL: Local-Global Representation Learning for On-the-Fly FG-SBIR","authors":"Dawei Dai;Yingge Liu;Yutang Li;Shiyu Fu;Shuyin Xia;Guoyin Wang","doi":"10.1109/TBDATA.2024.3356393","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3356393","url":null,"abstract":"On-the-fly Fine-grained sketch-based image retrieval (On-the-fly FG-SBIR) framework aim to break the barriers that sketch drawing requires excellent skills and is time-consuming. Considering such problems, a partial sketch with fewer strokes contains only the little local information, and the drawing process may show great difference among users, resulting in poor performance at the early retrieval. In this study, we developed a local-global representation learning (LGRL) method, in which we learn the representations for both the local and global regions of the partial sketch and its target photos. Specifically, we first designed a triplet network to learn the joint embedding space shared between the local and global regions of the entire sketch and its corresponding region of the photo. Then, we divided each partial sketch in the sketch-drawing episode into several local regions; Another learnable module following the triplet network was designed to learn the representations for the local regions of the partial sketch. Finally, by combining both the local and global regions of the sketches and photos, the final distance was determined. In the experiments, our method outperformed state-of-the-art baseline methods in terms of early retrieval efficiency on two publicly sketch-retrieval datasets and the practice test.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"543-555"},"PeriodicalIF":7.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection GAT-COBO：用于电信欺诈检测的成本敏感图神经网络

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-11 DOI: 10.1109/TBDATA.2024.3352978

Xinxin Hu;Haotian Chen;Junjie Zhang;Hongchang Chen;Shuxin Liu;Xing Li;Yahui Wang;Xiangyang Xue

Along with the rapid evolution of mobile communication technologies, such as 5G, there has been a significant increase in telecom fraud, which severely dissipates individual fortune and social wealth. In recent years, graph mining techniques are gradually becoming a mainstream solution for detecting telecom fraud. However, the graph imbalance problem, caused by the Pareto principle, brings severe challenges to graph data mining. This emerging and complex issue has received limited attention in prior research. In this paper, we propose a Graph ATtention network with COst-sensitive BOosting (GAT-COBO) for the graph imbalance problem. First, we design a GAT-based base classifier to learn the embeddings of all nodes in the graph. Then, we feed the embeddings into a well-designed cost-sensitive learner for imbalanced learning. Next, we update the weights according to the misclassification cost to make the model focus more on the minority class. Finally, we sum the node embeddings obtained by multiple cost-sensitive learners to obtain a comprehensive node representation, which is used for the downstream anomaly detection task. Extensive experiments on two real-world telecom fraud detection datasets demonstrate that our proposed method is effective for the graph imbalance problem, outperforming the state-of-the-art GNNs and GNN-based fraud detectors. In addition, our model is also helpful for solving the widespread over-smoothing problem in GNNs.

伴随着 5G 等移动通信技术的快速发展，电信诈骗案件大幅增加，严重侵蚀了个人财富和社会财富。近年来，图挖掘技术逐渐成为检测电信诈骗的主流解决方案。然而，帕累托原理导致的图不平衡问题给图数据挖掘带来了严峻的挑战。这一新兴而复杂的问题在之前的研究中受到的关注有限。在本文中，我们针对图不平衡问题提出了一种具有 COst-sensitive BOosting（GAT-COBO）功能的图 ATtention 网络。首先，我们设计了一个基于 GAT 的基础分类器来学习图中所有节点的嵌入。然后，我们将嵌入信息输入一个精心设计的成本敏感学习器，以进行不平衡学习。接下来，我们根据误分类成本更新权重，使模型更加关注少数类别。最后，我们将多个成本敏感学习器获得的节点嵌入相加，得到一个综合节点表示，用于下游的异常检测任务。在两个真实世界的电信欺诈检测数据集上进行的大量实验表明，我们提出的方法对图不平衡问题非常有效，性能优于最先进的 GNN 和基于 GNN 的欺诈检测器。此外，我们的模型还有助于解决 GNN 中普遍存在的过度平滑问题。

{"title":"GAT-COBO: Cost-Sensitive Graph Neural Network for Telecom Fraud Detection","authors":"Xinxin Hu;Haotian Chen;Junjie Zhang;Hongchang Chen;Shuxin Liu;Xing Li;Yahui Wang;Xiangyang Xue","doi":"10.1109/TBDATA.2024.3352978","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352978","url":null,"abstract":"Along with the rapid evolution of mobile communication technologies, such as 5G, there has been a significant increase in telecom fraud, which severely dissipates individual fortune and social wealth. In recent years, graph mining techniques are gradually becoming a mainstream solution for detecting telecom fraud. However, the graph imbalance problem, caused by the Pareto principle, brings severe challenges to graph data mining. This emerging and complex issue has received limited attention in prior research. In this paper, we propose a \u0000<underline>G</u>\u0000raph \u0000<underline>AT</u>\u0000tention network with \u0000<underline>CO</u>\u0000st-sensitive \u0000<underline>BO</u>\u0000osting (GAT-COBO) for the graph imbalance problem. First, we design a GAT-based base classifier to learn the embeddings of all nodes in the graph. Then, we feed the embeddings into a well-designed cost-sensitive learner for imbalanced learning. Next, we update the weights according to the misclassification cost to make the model focus more on the minority class. Finally, we sum the node embeddings obtained by multiple cost-sensitive learners to obtain a comprehensive node representation, which is used for the downstream anomaly detection task. Extensive experiments on two real-world telecom fraud detection datasets demonstrate that our proposed method is effective for the graph imbalance problem, outperforming the state-of-the-art GNNs and GNN-based fraud detectors. In addition, our model is also helpful for solving the widespread over-smoothing problem in GNNs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"528-542"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Core Maintenance on Dynamic Graphs: A Distributed Approach Built on H-Index 动态图上的核心维护：基于 H-Index 的分布式方法

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-11 DOI: 10.1109/TBDATA.2024.3352973

Qiang-Sheng Hua;Hongen Wang;Hai Jin;Xuanhua Shi

Core number is an essential tool for analyzing graph structure. Graphs in the real world are typically large and dynamic, requiring the development of distributed algorithms to refrain from expensive I/O operations and the maintenance algorithms to address dynamism. Core maintenance updates the core number of each vertex upon the insertion/deletion of vertices/edges. Although the state-of-the-art distributed maintenance algorithm (Weng et al.~2022) can handle multiple edge insertions/deletions simultaneously, it still has two aspects to improve. (I) Parallel processing is not allowed when inserting/removing edges with the same core number, reducing the degree of parallelism and raising the number of rounds. (II) During the implementation phase, only one thread is assigned to the vertices with the same core number, leading to the inability to fully utilize the distributed computing power. Furthermore, the h-index (Lü, et al. 2016) based distributed core decomposition algorithm (Montresor et al. 2013) can fully utilize the distributed computing power where all vertices can be processed in parallel. However, it requires all vertices to recompute their core numbers upon graph changes. In this article, we propose a distributed core maintenance algorithm based on h-index, which circumvents the issues of algorithm (Weng et al.~2022). In addition, our algorithm avoids core numbers recalculation where the numbers do not change. In comparison to the state-of-the-art distributed maintenance algorithm (Weng et al.~2022), the time speedup ratio is at least 100 in the scenarios of both insertion and deletion. Compared to the distributed core decomposition algorithm (Montresor et al. 2013), the average time speedup ratios are 2 and 8 for the cases of insertion and deletion, respectively.

核心数是分析图结构的重要工具。现实世界中的图形通常是庞大而动态的，这就要求开发分布式算法以避免昂贵的 I/O 操作，并开发维护算法以解决动态性问题。核心维护在插入/删除顶点/边时更新每个顶点的核心数。虽然最先进的分布式维护算法（Weng 等~2022）可以同时处理多条边的插入/删除，但仍有两方面需要改进。(I) 在插入/删除具有相同核心数的边时，不允许并行处理，从而降低了并行程度，增加了回合数。(二）在执行阶段，只为具有相同核心数的顶点分配一个线程，导致无法充分利用分布式计算能力。此外，基于 h 指数（Lü 等人，2016 年）的分布式核心分解算法（Montresor 等人，2013 年）可以充分利用分布式计算能力，并行处理所有顶点。然而，它要求所有顶点在图发生变化时重新计算其核心数。在本文中，我们提出了一种基于 h-index 的分布式核心维护算法，它规避了算法（Weng 等~2022）的问题。此外，我们的算法还避免了在核心数不变的情况下重新计算核心数。与最先进的分布式维护算法（Weng 等 ~2022）相比，在插入和删除两种情况下，时间加速比至少为 100。与分布式内核分解算法（Montresor 等人，2013 年）相比，插入和删除情况下的平均时间加速比分别为 2 和 8。

{"title":"Core Maintenance on Dynamic Graphs: A Distributed Approach Built on H-Index","authors":"Qiang-Sheng Hua;Hongen Wang;Hai Jin;Xuanhua Shi","doi":"10.1109/TBDATA.2024.3352973","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3352973","url":null,"abstract":"Core number is an essential tool for analyzing graph structure. Graphs in the real world are typically large and dynamic, requiring the development of distributed algorithms to refrain from expensive I/O operations and the maintenance algorithms to address dynamism. Core maintenance updates the core number of each vertex upon the insertion/deletion of vertices/edges. Although the state-of-the-art distributed maintenance algorithm (Weng et al.~2022) can handle multiple edge insertions/deletions simultaneously, it still has two aspects to improve. (I) Parallel processing is not allowed when inserting/removing edges with the same core number, reducing the degree of parallelism and raising the number of rounds. (II) During the implementation phase, only one thread is assigned to the vertices with the same core number, leading to the inability to fully utilize the distributed computing power. Furthermore, the h-index (Lü, et al. 2016) based distributed core decomposition algorithm (Montresor et al. 2013) can fully utilize the distributed computing power where all vertices can be processed in parallel. However, it requires all vertices to recompute their core numbers upon graph changes. In this article, we propose a distributed core maintenance algorithm based on h-index, which circumvents the issues of algorithm (Weng et al.~2022). In addition, our algorithm avoids core numbers recalculation where the numbers do not change. In comparison to the state-of-the-art distributed maintenance algorithm (Weng et al.~2022), the time speedup ratio is at least 100 in the scenarios of both insertion and deletion. Compared to the distributed core decomposition algorithm (Montresor et al. 2013), the average time speedup ratios are 2 and 8 for the cases of insertion and deletion, respectively.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 5","pages":"595-608"},"PeriodicalIF":7.5,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10388383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142130280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Heterogeneous Streaming Feature Selection Without Feature Type Information 无特征类型信息的在线异构流特征选择

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-08 DOI: 10.1109/TBDATA.2024.3350630

Peng Zhou;Yunyun Zhang;Zhaolong Ling;Yuanting Yan;Shu Zhao;Xindong Wu

Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of Big Data. However, features may be generated dynamically and arrive individually over time in practice, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature in advance, but this is unreasonable and unrealistic. Therefore, this paper first studies a practical issue of Online Heterogeneous Streaming Feature Selection without the feature type information before learning, named OHSFS. Specifically, we first model the streaming feature selection issue as a minimax problem. Then, in terms of MIC (Maximal Information Coefficient), we derive a new metric

$MIC_{Gain}$

to determine whether a new streaming feature should be selected. To speed up the efficiency of OHSFS, we present the metric

$MIC_{Cor}$

that can directly discard low correlation features. Finally, extensive experimental results indicate the effectiveness of OHSFS. Moreover, OHSFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.

特征选择的目的是从原始数据集中选择一个最佳的最小特征子集，它已成为数据挖掘和机器学习前不可或缺的预处理组件，尤其是在大数据时代。然而，在实践中，特征可能是动态生成的，并随着时间的推移逐个到达，我们称之为流特征。现有的流特征选择方法大多假设所有动态生成的特征都是同一类型，或者假设我们可以提前知道每个新到达特征的特征类型，但这是不合理的，也是不现实的。因此，本文首先研究了在学习前不知道特征类型信息的在线异构流特征选择的实际问题，并将其命名为 OHSFS。具体来说，我们首先将流式特征选择问题建模为最小问题。然后，根据最大信息系数（MIC），我们推导出一个新指标 $MIC_{Gain}$，用于判断是否应该选择新的流特征。为了加快 OHSFS 的效率，我们提出了可以直接舍弃低相关性特征的指标 $MIC_{Cor}$。最后，大量实验结果表明了 OHSFS 的有效性。此外，OHSFS 是非参数的，在学习之前不需要知道特征类型，这符合实际应用的需要。

{"title":"Online Heterogeneous Streaming Feature Selection Without Feature Type Information","authors":"Peng Zhou;Yunyun Zhang;Zhaolong Ling;Yuanting Yan;Shu Zhao;Xindong Wu","doi":"10.1109/TBDATA.2024.3350630","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350630","url":null,"abstract":"Feature selection aims to select an optimal minimal feature subset from the original datasets and has become an indispensable preprocessing component before data mining and machine learning, especially in the era of Big Data. However, features may be generated dynamically and arrive individually over time in practice, which we call streaming features. Most existing streaming feature selection methods assume that all dynamically generated features are the same type or assume we can know the feature type for each new arriving feature in advance, but this is unreasonable and unrealistic. Therefore, this paper first studies a practical issue of Online Heterogeneous Streaming Feature Selection without the feature type information before learning, named OHSFS. Specifically, we first model the streaming feature selection issue as a minimax problem. Then, in terms of MIC (Maximal Information Coefficient), we derive a new metric \u0000<inline-formula><tex-math>$MIC_{Gain}$</tex-math></inline-formula>\u0000 to determine whether a new streaming feature should be selected. To speed up the efficiency of OHSFS, we present the metric \u0000<inline-formula><tex-math>$MIC_{Cor}$</tex-math></inline-formula>\u0000 that can directly discard low correlation features. Finally, extensive experimental results indicate the effectiveness of OHSFS. Moreover, OHSFS is nonparametric and does not need to know the feature type before learning, which aligns with practical application needs.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"470-485"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Multi-Scale Features Mutual Mapping Fusion Based on Reverse Knowledge Distillation for Industrial Anomaly Detection and Localization 基于反向知识提炼的增强型多尺度特征相互映射融合技术，用于工业异常检测和定位

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-08 DOI: 10.1109/TBDATA.2024.3350539

Guoxiang Tong;Quanquan Li;Yan Song

Unsupervised anomaly detection methods based on knowledge distillation have exhibited promising results. However, there is still room for improvement in the differential characterization of anomalous samples. In this article, a novel anomaly detection and localization model based on reverse knowledge distillation is proposed, where an enhanced multi-scale feature mutual mapping feature fusion module is proposed to greatly extract discrepant features at different scales. This module helps enhance the difference in anomaly region representation in the teacher-student structure by inhomogeneously fusing features at different levels. Then, the coordinate attention mechanism is introduced in the reverse distillation structure to pay special attention to dominant issues, facilitating nice direction guidance and position encoding. Furthermore, an innovative single-category embedding memory bank, inspired by human memory mechanisms, is developed to normalize single-category embedding to encourage high-quality model reconstruction. Finally, in several categories of the well-known MVTec dataset, our model achieves better results than state-of-the-art models in terms of AUROC and PRO, with an overall average of 98.1%, 98.3%, and 95.0% for detection AUROC scores, localization AUROC scores, and localization PRO scores, respectively, across 15 categories. Extensive experiments are conducted on the ablation study to validate the contribution of each component of the model.

基于知识提炼的无监督异常检测方法取得了可喜的成果。然而，在对异常样本进行差异化特征描述方面仍有改进的空间。本文提出了一种基于反向知识提炼的新型异常检测和定位模型，其中提出了一个增强型多尺度特征相互映射特征融合模块，以极大地提取不同尺度上的差异特征。该模块通过非均质地融合不同层次的特征，有助于增强师生结构中异常区域表征的差异性。然后，在反向蒸馏结构中引入坐标关注机制，特别关注主导问题，促进良好的方向引导和位置编码。此外，还受人类记忆机制的启发，开发了创新的单类嵌入记忆库，将单类嵌入归一化，以促进高质量的模型重构。最后，在著名的 MVTec 数据集的几个类别中，我们的模型在 AUROC 和 PRO 方面取得了比最先进模型更好的结果，在 15 个类别中，检测 AUROC 分数、定位 AUROC 分数和定位 PRO 分数的总体平均值分别为 98.1%、98.3% 和 95.0%。在消融研究中进行了广泛的实验，以验证模型各组成部分的贡献。

{"title":"Enhanced Multi-Scale Features Mutual Mapping Fusion Based on Reverse Knowledge Distillation for Industrial Anomaly Detection and Localization","authors":"Guoxiang Tong;Quanquan Li;Yan Song","doi":"10.1109/TBDATA.2024.3350539","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350539","url":null,"abstract":"Unsupervised anomaly detection methods based on knowledge distillation have exhibited promising results. However, there is still room for improvement in the differential characterization of anomalous samples. In this article, a novel anomaly detection and localization model based on reverse knowledge distillation is proposed, where an enhanced multi-scale feature mutual mapping feature fusion module is proposed to greatly extract discrepant features at different scales. This module helps enhance the difference in anomaly region representation in the teacher-student structure by inhomogeneously fusing features at different levels. Then, the coordinate attention mechanism is introduced in the reverse distillation structure to pay special attention to dominant issues, facilitating nice direction guidance and position encoding. Furthermore, an innovative single-category embedding memory bank, inspired by human memory mechanisms, is developed to normalize single-category embedding to encourage high-quality model reconstruction. Finally, in several categories of the well-known MVTec dataset, our model achieves better results than state-of-the-art models in terms of AUROC and PRO, with an overall average of 98.1%, 98.3%, and 95.0% for detection AUROC scores, localization AUROC scores, and localization PRO scores, respectively, across 15 categories. Extensive experiments are conducted on the ablation study to validate the contribution of each component of the model.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"498-513"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable Unsupervised Hashing via Exploiting Robust Cross-Modal Consistency 通过利用稳健的跨模态一致性实现可扩展的无监督哈希算法

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-08 DOI: 10.1109/TBDATA.2024.3350541

Xingbo Liu;Jiamin Li;Xiushan Nie;Xuening Zhang;Shaohua Wang;Yilong Yin

Unsupervised cross-modal hashing has received increasing attention because of its efficiency and scalability for large-scale data retrieval and analysis. However, existing unsupervised cross-modal hashing methods primarily focus on learning shared feature embedding, ignoring robustness and consistency across different modalities. To this end, this study proposes a novel method called scalable unsupervised hashing (SUH) for large-scale cross-modal retrieval. In the proposed method, latent semantic information and common semantic embedding within heterogeneous data are simultaneously exploited using multimodal clustering and collective matrix factorization, respectively. Furthermore, the robust norm is seamlessly integrated into the two processes, making SUH insensitive to outliers. Based on the robust consistency exploited from the latent semantic information and feature embedding, hash codes can be learned discretely to avoid cumulative quantitation loss. The experimental results on five benchmark datasets demonstrate the effectiveness of the proposed method under various scenarios.

无监督跨模态哈希因其在大规模数据检索和分析中的高效性和可扩展性而受到越来越多的关注。然而，现有的无监督跨模态散列方法主要侧重于学习共享特征嵌入，而忽略了不同模态之间的鲁棒性和一致性。为此，本研究提出了一种用于大规模跨模态检索的新型方法，即可扩展的无监督散列（SUH）。在所提出的方法中，利用多模态聚类和集合矩阵因式分解，可同时利用异构数据中的潜在语义信息和共同语义嵌入。此外，鲁棒规范被无缝集成到这两个过程中，使得 SUH 对异常值不敏感。基于从潜在语义信息和特征嵌入中利用的鲁棒一致性，哈希代码可以离散学习，以避免累积量化损失。在五个基准数据集上的实验结果证明了所提方法在不同场景下的有效性。

引用次数: 0

Few-Shot Learning With Multi-Granularity Knowledge Fusion and Decision-Making 利用多粒度知识融合和决策的 "少量学习"（Few-Shot Learning with Multi-Granularity Knowledge Fusion and Decision-Making

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2024-01-08 DOI: 10.1109/TBDATA.2024.3350542

Yuling Su;Hong Zhao;Yifeng Zheng;Yu Wang

Few-shot learning (FSL) is a challenging task in classifying new classes from few labelled examples. Many existing models embed class structural knowledge as prior knowledge to enhance FSL against data scarcity. However, they fall short of connecting the class structural knowledge with the limited visual information which plays a decisive role in FSL model performance. In this paper, we propose a unified FSL framework with multi-granularity knowledge fusion and decision-making (MGKFD) to overcome the limitation. We aim to simultaneously explore the visual information and structural knowledge, working in a mutual way to enhance FSL. On the one hand, we strongly connect global and local visual information with multi-granularity class knowledge to explore intra-image and inter-class relationships, generating specific multi-granularity class representations with limited images. On the other hand, a weight fusion strategy is introduced to integrate multi-granularity knowledge and visual information to make the classification decision of FSL. It enables models to learn more effectively from limited labelled examples and allows generalization to new classes. Moreover, considering varying erroneous predictions, a hierarchical loss is established by structural knowledge to minimize the classification loss, where greater degree of misclassification is penalized more. Experimental results on three benchmark datasets show the advantages of MGKFD over several advanced models.

少量学习（FSL）是一项具有挑战性的任务，即从少量标记的示例中对新类别进行分类。许多现有模型都将类结构知识作为先验知识嵌入其中，以增强 FSL 的能力，应对数据匮乏问题。然而，这些模型没有将类别结构知识与有限的视觉信息联系起来，而视觉信息对 FSL 模型的性能起着决定性作用。在本文中，我们提出了一个统一的 FSL 框架，该框架具有多粒度知识融合和决策（MGKFD）功能，以克服上述局限性。我们的目标是同时探索视觉信息和结构知识，以相互促进的方式增强 FSL。一方面，我们将全局和局部视觉信息与多粒度类别知识紧密联系起来，探索图像内和类别间的关系，从而利用有限的图像生成特定的多粒度类别表征。另一方面，我们引入了权重融合策略，以整合多粒度知识和视觉信息，从而做出 FSL 的分类决策。这使模型能更有效地从有限的标注示例中学习，并能泛化到新的类别。此外，考虑到不同的错误预测，通过结构知识建立了分层损失，以最小化分类损失，其中错误分类程度越高，受到的惩罚越大。在三个基准数据集上的实验结果表明，MGKFD 比几种高级模型更具优势。

{"title":"Few-Shot Learning With Multi-Granularity Knowledge Fusion and Decision-Making","authors":"Yuling Su;Hong Zhao;Yifeng Zheng;Yu Wang","doi":"10.1109/TBDATA.2024.3350542","DOIUrl":"https://doi.org/10.1109/TBDATA.2024.3350542","url":null,"abstract":"Few-shot learning (FSL) is a challenging task in classifying new classes from few labelled examples. Many existing models embed class structural knowledge as prior knowledge to enhance FSL against data scarcity. However, they fall short of connecting the class structural knowledge with the limited visual information which plays a decisive role in FSL model performance. In this paper, we propose a unified FSL framework with multi-granularity knowledge fusion and decision-making (MGKFD) to overcome the limitation. We aim to simultaneously explore the visual information and structural knowledge, working in a mutual way to enhance FSL. On the one hand, we strongly connect global and local visual information with multi-granularity class knowledge to explore intra-image and inter-class relationships, generating specific multi-granularity class representations with limited images. On the other hand, a weight fusion strategy is introduced to integrate multi-granularity knowledge and visual information to make the classification decision of FSL. It enables models to learn more effectively from limited labelled examples and allows generalization to new classes. Moreover, considering varying erroneous predictions, a hierarchical loss is established by structural knowledge to minimize the classification loss, where greater degree of misclassification is penalized more. Experimental results on three benchmark datasets show the advantages of MGKFD over several advanced models.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"486-497"},"PeriodicalIF":7.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SCOREH+: A High-Order Node Proximity Spectral Clustering on Ratios-of-Eigenvectors Algorithm for Community Detection SCOREH+：用于群落检测的基于特征向量比的高阶节点邻近度谱聚类算法

IF 7.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-12-25 DOI: 10.1109/TBDATA.2023.3346715

Yanhui Zhu;Fang Hu;Lei Hsin Kuo;Jia Liu

The research on complex networks has achieved significant progress in revealing the mesoscopic features of networks. Community detection is an important aspect of understanding real-world complex systems. We present in this paper a High-order node proximity Spectral Clustering on Ratios-of-Eigenvectors (SCOREH+) algorithm for locating communities in complex networks. The algorithm improves SCORE and SCORE+ and preserves high-order transitivity information of the network affinity matrix. We optimize the high-order proximity matrix from the initial affinity matrix using the Radial Basis Functions (RBFs) and Katz index. In addition to the optimization of the Laplacian matrix, we implement a procedure that joins an additional eigenvector (the

$(k+1){rm th}$

leading eigenvector) to the spectrum domain for clustering if the network is considered to be a “weak signal” graph. The algorithm has been successfully applied to both real-world and synthetic data sets. The proposed algorithm is compared with state-of-art algorithms, such as ASE, Louvain, Fast-Greedy, Spectral Clustering (SC), SCORE, and SCORE+. To demonstrate the high efficacy of the proposed method, we conducted comparison experiments on eleven real-world networks and a number of synthetic networks with noise. The experimental results in most of these networks demonstrate that SCOREH+ outperforms the baseline methods. Moreover, by tuning the RBFs and their shaping parameters, we may generate state-of-the-art community structures on all real-world networks and even on noisy synthetic networks.

复杂网络研究在揭示网络的中观特征方面取得了重大进展。群落检测是理解现实世界复杂系统的一个重要方面。本文提出了一种在复杂网络中定位群落的高阶节点邻近特征向量比谱聚类算法（SCOREH+）。该算法改进了 SCORE 和 SCORE+，并保留了网络亲缘矩阵的高阶传递信息。我们利用径向基函数（RBF）和卡茨指数从初始亲和矩阵中优化高阶亲和矩阵。除了优化拉普拉斯矩阵外，如果网络被认为是一个 "弱信号 "图，我们还实施了一个程序，将一个额外的特征向量（$(k+1){rm th}$ 领先特征向量）加入频谱域进行聚类。该算法已成功应用于现实世界和合成数据集。该算法与 ASE、Louvain、Fast-Greedy、Spectral Clustering (SC)、SCORE 和 SCORE+ 等先进算法进行了比较。为了证明所提方法的高效性，我们在 11 个真实世界网络和一些带噪声的合成网络上进行了对比实验。其中大部分网络的实验结果表明，SCOREH+ 的性能优于基线方法。此外，通过调整 RBF 及其整形参数，我们可以在所有真实世界网络甚至是有噪声的合成网络上生成最先进的社区结构。

{"title":"SCOREH+: A High-Order Node Proximity Spectral Clustering on Ratios-of-Eigenvectors Algorithm for Community Detection","authors":"Yanhui Zhu;Fang Hu;Lei Hsin Kuo;Jia Liu","doi":"10.1109/TBDATA.2023.3346715","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3346715","url":null,"abstract":"The research on complex networks has achieved significant progress in revealing the mesoscopic features of networks. Community detection is an important aspect of understanding real-world complex systems. We present in this paper a High-order node proximity Spectral Clustering on Ratios-of-Eigenvectors (SCOREH+) algorithm for locating communities in complex networks. The algorithm improves SCORE and SCORE+ and preserves high-order transitivity information of the network affinity matrix. We optimize the high-order proximity matrix from the initial affinity matrix using the Radial Basis Functions (RBFs) and Katz index. In addition to the optimization of the Laplacian matrix, we implement a procedure that joins an additional eigenvector (the \u0000<inline-formula><tex-math>$(k+1){rm th}$</tex-math></inline-formula>\u0000 leading eigenvector) to the spectrum domain for clustering if the network is considered to be a “weak signal” graph. The algorithm has been successfully applied to both real-world and synthetic data sets. The proposed algorithm is compared with state-of-art algorithms, such as ASE, Louvain, Fast-Greedy, Spectral Clustering (SC), SCORE, and SCORE+. To demonstrate the high efficacy of the proposed method, we conducted comparison experiments on eleven real-world networks and a number of synthetic networks with noise. The experimental results in most of these networks demonstrate that SCOREH+ outperforms the baseline methods. Moreover, by tuning the RBFs and their shaping parameters, we may generate state-of-the-art community structures on all real-world networks and even on noisy synthetic networks.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 3","pages":"301-312"},"PeriodicalIF":7.2,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140924735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Causal Chain Graph Structure via Alternate Learning and Double Pruning 通过交替学习和双修剪学习因果链图结构

IF 7.5 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data

Pub Date : 2023-12-25 DOI: 10.1109/TBDATA.2023.3346712

Shujing Yang;Fuyuan Cao;Kui Yu;Jiye Liang

Causal chain graphs model the dependency structure between individuals when the assumption of individual independence in causal inference is violated. However, causal chain graphs are often unknown in practice and require learning from data. Existing learning algorithms have certain limitations. Specifically, learning local information requires multiple subset searches, building the skeleton requires additional conditional independence testing, and directing the edges requires obtaining local information from the skeleton again. To remedy these problems, we propose a novel algorithm for learning causal chain graph structure. The algorithm alternately learns the adjacencies and spouses of each variable as local information and doubly prunes them to obtain more accurate local information, which reduces subset searches, improves its accuracy, and facilitates subsequent learning. It then directly constructs the chain graphs skeleton using the learned adjacencies without conditional independence testing. Finally, it directs the edges of complexes using the learned adjacencies and spouses to learn chain graphs without reacquiring local information, further improving its efficiency. We conduct theoretical analysis to prove the correctness of our algorithm and compare it with the state-of-the-art algorithms on synthetic and real-world datasets. The experimental results demonstrate our algorithm is more reliable than its rivals.

当因果推理中的个体独立性假设被违反时，因果链图可以模拟个体之间的依赖结构。然而，因果链图在实践中往往是未知的，需要从数据中学习。现有的学习算法有一定的局限性。具体来说，学习局部信息需要多次子集搜索，构建骨架需要额外的条件独立性测试，而引导边缘则需要再次从骨架中获取局部信息。为了解决这些问题，我们提出了一种学习因果链图结构的新算法。该算法将每个变量的邻接关系和配偶关系作为局部信息交替学习，并对其进行双重修剪，以获得更准确的局部信息，从而减少子集搜索，提高准确性，方便后续学习。然后，它利用学习到的邻接关系直接构建链图骨架，而无需进行条件独立性测试。最后，它利用学习到的邻接和配偶来引导复合物的边，从而在不重新获取局部信息的情况下学习链图，进一步提高了效率。我们进行了理论分析以证明我们算法的正确性，并在合成数据集和实际数据集上与最先进的算法进行了比较。实验结果表明，我们的算法比竞争对手更可靠。

{"title":"Learning Causal Chain Graph Structure via Alternate Learning and Double Pruning","authors":"Shujing Yang;Fuyuan Cao;Kui Yu;Jiye Liang","doi":"10.1109/TBDATA.2023.3346712","DOIUrl":"https://doi.org/10.1109/TBDATA.2023.3346712","url":null,"abstract":"Causal chain graphs model the dependency structure between individuals when the assumption of individual independence in causal inference is violated. However, causal chain graphs are often unknown in practice and require learning from data. Existing learning algorithms have certain limitations. Specifically, learning local information requires multiple subset searches, building the skeleton requires additional conditional independence testing, and directing the edges requires obtaining local information from the skeleton again. To remedy these problems, we propose a novel algorithm for learning causal chain graph structure. The algorithm alternately learns the adjacencies and spouses of each variable as local information and doubly prunes them to obtain more accurate local information, which reduces subset searches, improves its accuracy, and facilitates subsequent learning. It then directly constructs the chain graphs skeleton using the learned adjacencies without conditional independence testing. Finally, it directs the edges of complexes using the learned adjacencies and spouses to learn chain graphs without reacquiring local information, further improving its efficiency. We conduct theoretical analysis to prove the correctness of our algorithm and compare it with the state-of-the-art algorithms on synthetic and real-world datasets. The experimental results demonstrate our algorithm is more reliable than its rivals.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"10 4","pages":"442-456"},"PeriodicalIF":7.5,"publicationDate":"2023-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0